生成带有预培训语言模型的数据集

资源介绍

要从预先训练的语言模型中获取高质量的句子嵌入，它们必须增加额外的预培训目标，或对大量标记文本对进行精细调整。虽然后一种方法通常优于前者，但它需要人类付出巨大努力才能生成足够大小的合适数据集。在本文中，我们展示了如何利用大型预培训语言模型来获得高质量的嵌入，而无需任何标记数据、微调或修改其培训前目标：我们利用其生成能力从零开始生成标记文本对的整个数据集，然后可用于定期对小得多的模型进行微调。我们完全不受监督的方法在几个英语语义文本相似性数据集上优于强基线。

END

QASC: A Dataset for Question Answering via Sentence CompositionComposing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition(QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

2021-08-24 213

生成带有预培训语言模型的数据集免费

资源介绍

发表评论取消回复

最新文章

热门文章

MIMIC-III（"重症监护医疗信息市场"）

IAM 50个最常见的作家手写数据集

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

Chinese handwritten digits MNIST dataset

标签云

猜你喜欢

生成带有预培训语言模型的数据集免费

资源介绍

发表评论 取消回复

最新文章

热门文章

MIMIC-III（"重症监护医疗信息市场"）

IAM 50个最常见的作家手写数据集

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

Chinese handwritten digits MNIST dataset

标签云

猜你喜欢

MIMIC-III（"重症监护医疗信息市场"）

IAM 50个最常见的作家手写数据集

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

Chinese handwritten digits MNIST dataset

NUBES：西班牙临床文本中否定和不确定性的语料库

Yahoo! N-Grams 2.0

句子/概念对的真实含义

Reddit评论

通过安装残余物来在自然语言推理中未学习数据集偏差

发表评论取消回复