德国指定实体识别法律文件数据集

资源介绍

我们描述了德国联邦法院判决中为指定实体识别而开发的数据集。它包括大约67，000句子和超过200万个代币。该资源包含 54，000 个手动注释实体，映射到 19 个细粒度语义类：人员、法官、律师、国家、城市、街道、景观、组织、公司、机构、法院、品牌、法律、法令、欧洲法律规范、法规、合同、法院裁决和法律文献。此外，法律文件还自动注释了超过 35，000 个基于 TimeML 的时间表达式。该数据集以CONNL-2002格式的CC-BY 4.0许可证提供，用于在欧盟项目Lynx中为德国法律文件培训NER服务。

END

QASC: A Dataset for Question Answering via Sentence CompositionComposing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition(QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

2021-08-24 249

德国指定实体识别法律文件数据集免费

资源介绍

发表评论取消回复

最新文章

热门文章

MIMIC-III（"重症监护医疗信息市场"）

NIH NCBI PMC 文章数据集

IAM 50个最常见的作家手写数据集

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

标签云

猜你喜欢

德国指定实体识别法律文件数据集免费

资源介绍

发表评论 取消回复

最新文章

热门文章

MIMIC-III（"重症监护医疗信息市场"）

NIH NCBI PMC 文章数据集

IAM 50个最常见的作家手写数据集

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

标签云

猜你喜欢

MIMIC-III（"重症监护医疗信息市场"）

NIH NCBI PMC 文章数据集

IAM 50个最常见的作家手写数据集

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

Chinese handwritten digits MNIST dataset

路透社语料库

Yahoo! N-Grams 2.0

Reddit评论

NUBES：西班牙临床文本中否定和不确定性的语料库

发表评论取消回复