中等规模 - 第 4 页

63

GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

The lack of publicly accessible text corpora is a major obstacle for progress in natural language processing. For me ...

Khan

2021-08-24

NLP

45

Tunisian Arabish Corpus (TArC)

Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in cat ...

Khan

2021-08-24

TalkDown: A Corpus for Condescension Detection in Context

NLP

101

TalkDown: A Corpus for Condescension Detection in Context

Condescending language use is caustic; it can bring dialogues to an end and bifurcate communities. Thus, systems for ...

Khan

2021-08-24

BillSum: A Corpus for Automatic Summarization of US Legislation

NLP

44

BillSum: A Corpus for Automatic Summarization of US Legislation

Automatic summarization methods have been studied on a variety of domains, including news and scientific articles. Y ...

Khan

2021-08-24

QASC: A Dataset for Question Answering via Sentence CompositionComposing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition(QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

NLP

214

QASC: A Dataset for Question Answering via Sentence CompositionComposing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition(QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a m ...

Khan

2021-08-24

A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

NLP

33

A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

Automated fact-checking based on machine learning is a promising approach to identify false information distributed ...

Khan

2021-08-24

NLP

41

古滕贝格图书公司

数据集包含 60，000 本电子书，朗：多语言，迭代： 60 0， file_type：文本，任务：文本Corpora

Khan

2021-08-24

NLP

74

阿拉伯语语音语料库

数据集使用专业工作室以南莱万丁阿拉伯语（达马西亚口音）录制。合成语音作为使用此语料库的输出产生了高质量 --, file_type ...

Khan

2021-08-24

NLP

144

NUBES：西班牙临床文本中否定和不确定性的语料库

本文介绍了 NUBes语料库的第一个版本（西班牙语生物医学文本中的否定和不确定性注释）。语料库是持续研究的一部分，目前包括 ...

Khan

2021-08-24

NLP

39

波兰议会党团

数据集收集了波兰议会、议会和参议院会议记录中的语言分析文件。它基于波兰 Sejm Corpus.，朗：波兰语，迭代： 3，000+， ...

Khan

2021-08-24