Khan - 第 7 页 - 数据集市

NLP

0 0

NCBI Disease Corpus

Dataset contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a Me ...

Khan

2021-08-24

NLP

0 0

Tunisian Arabish Corpus (TArC)

Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in cat ...

Khan

2021-08-24

CaSiNo: A Corpus of Campsite Negotiation Dialogues for Automatic Negotiation Systems

NLP

0 0

CaSiNo: A Corpus of Campsite Negotiation Dialogues for Automatic Negotiation Systems

Automated systems that negotiate with humans have broad applications in pedagogy and conversational AI. To advance t ...

Khan

2021-08-24

TalkDown: A Corpus for Condescension Detection in Context

NLP

0 0

TalkDown: A Corpus for Condescension Detection in Context

Condescending language use is caustic; it can bring dialogues to an end and bifurcate communities. Thus, systems for ...

Khan

2021-08-24

BillSum: A Corpus for Automatic Summarization of US Legislation

NLP

0 0

BillSum: A Corpus for Automatic Summarization of US Legislation

Automatic summarization methods have been studied on a variety of domains, including news and scientific articles. Y ...

Khan

2021-08-24

QASC: A Dataset for Question Answering via Sentence CompositionComposing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition(QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

NLP

0 0

QASC: A Dataset for Question Answering via Sentence CompositionComposing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition(QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a m ...

Khan

2021-08-24

A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

NLP

0 0

A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

Automated fact-checking based on machine learning is a promising approach to identify false information distributed ...

Khan

2021-08-24

A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

NLP

0 0

A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

The lack of large-scale datasets has been a major hindrance to the development of NLP tasks such as spelling correct ...

Khan

2021-08-24

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis

NLP

0 0

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis

This paper introduces RyanSpeech, a new speech corpus for research on automated text-to-speech (TTS) systems. Public ...

Khan

2021-08-24

NLP

0 0

古滕贝格图书公司

数据集包含 60，000 本电子书，朗：多语言，迭代： 60 0， file_type：文本，任务：文本Corpora

Khan

2021-08-24

NLP

0 0

开放式研究科珀斯

Dataset包含超过 3900 万篇在计算机科学、神经科学和生物医学领域发表的研究论文 file_type。

Khan

2021-08-24

NLP

0 0

WAC：维基百科对话在线滥用检测的科珀斯

随着在线社交网络的普及，监控所有用户生成的内容变得越来越困难。因此，自动化互联网上不当交换内容的适度过程已成为一项优 ...

Khan

2021-08-24

NLP

0 0

阿拉伯语语音语料库

数据集使用专业工作室以南莱万丁阿拉伯语（达马西亚口音）录制。合成语音作为使用此语料库的输出产生了高质量 --, file_type ...

Khan

2021-08-24

NLP

0 0

NUBES：西班牙临床文本中否定和不确定性的语料库

本文介绍了 NUBes语料库的第一个版本（西班牙语生物医学文本中的否定和不确定性注释）。语料库是持续研究的一部分，目前包括 ...

Khan

2021-08-24

NLP

0 0

波兰议会党团

数据集收集了波兰议会、议会和参议院会议记录中的语言分析文件。它基于波兰 Sejm Corpus.，朗：波兰语，迭代： 3，000+， ...

Khan

2021-08-24

NLP

0 0

连续和部分可观察的上下文下共同接地的自然语言语料库

共同点是创造、修复和更新相互理解的过程，这是复杂的人类交流的一个重要方面。然而，传统的对话制度建立共同点的能力有限， ...

Khan

2021-08-24

NLP

0 0

兴奋数据集s

Dataset包含来自客户的负面反馈，其中他们陈述了对给定公司不满意的原因。数据集有英文和意大利文版本，朗：意大利语、英语 ...

Khan

2021-08-24

NLP

0 0

科学专家发现文献网络嵌入方法的新数据集和基准

科学文献的发展速度比以往任何时候都要快。由于出版物数量不断增加，以及专业领域日益多样化，在特定科学领域寻找专家从未像 ...

Khan

2021-08-24

NLP

0 0

生成带有预培训语言模型的数据集

要从预先训练的语言模型中获取高质量的句子嵌入，它们必须增加额外的预培训目标，或对大量标记文本对进行精细调整。虽然后一 ...

Khan

2021-08-24

NLP

0 0

指导基于基于子料库的集扩展，通过辅助集生成和共同扩展

鉴于一小套种子实体（例如，"美国"，"俄罗斯"），基于语料库的集扩展是诱导一组广泛的实体，这些实体共享相同的语义类（本例 ...

Khan

2021-08-24