荷兰语,该语料库包含两种类型的学生文本:作文和评论。涉及作者(性别、年龄、性取向、来源地区、性格概况)和文档(时间、 ...
ZEST is a benchmark for zero-shot generalization to unseen NLP tasks, with 25K labeled instances across 1,251 differ ...
ParaCrawl 是一套大型平行公司,通过广泛的网络爬行工作,为所有欧盟官方语言提供往返英语的辅助。从识别带有翻译文本的网站 ...
Some of the most important datasets for NLP, with a focus on classification, including IMDb, AG-News, Amazon Reviews ...
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The ...
294,000 science-relevant tuples
PMC 开放访问 (OA) 子集,其中包含 PMC 中包含具有机器可读知识
日本令牌词典,用于与MeCab。
日语词典和文字嵌入用于自然语言处理。苏达奇迪克是日本令牌(形态分析仪)苏达奇的词典。chiVe是日本预训单词嵌入(单词载 ...
A corpus of web crawl data composed of over 50 billion web pages.