数据集(汇总合集)

数据集 sophie ⋅ 于 8个月前 ⋅ 3346 阅读

本帖将收录汇总目前极市内外所有各行业数据集(含下载地址),并不断更新~也欢迎大家推荐未收录的资源,分享给全国的CV开发者学习(提供资源请或遇下载问题在本帖内留言)

(一)

500 万面孔 | 15 个免费人脸识别数据集
https://bbs.cvmart.net/topics/457

世界最大人脸对齐数据集
https://www.adrianbulat.com/face-alignment/

超全的人脸识别数据集汇总,附打包下载
https://bbs.cvmart.net/topics/1582

Google 发布人脸取证基准数据集,用于监测和对抗 Deepfake 深度换脸
https://bbs.cvmart.net/topics/1055


(二)交通/汽车类相关数据集

交通标志检测识别 数据集
https://cg.cs.tsinghua.edu.cn/traffic-sign/

奥迪推出大型自动驾驶数据集 A2D2

https://www.a2d2.audi/a2d2/en.html

无人驾驶数据集 A2D2:2D 语义分割、3D 点云分割、3D 边框、车辆数据
https://www.audi-electronics-venture.de/aev/web/en/driving-dataset.html

大规模城市交通监控车辆重识别图像数据集
github地址:https://github.com/VehicleReId/VeRidataset
项目主页:https://vehiclereid.github.io/VeRi/

Download:邮件发送姓名和所属机构至xinchenliu@bupt.edu.cn申请。

伯克利发布史上最大规模自动驾驶视频数据集 BDD100K
Reference
1.https://bdd-data.berkeley.edu/
2.https://arxiv.org/abs/1805.04687
3.https://bdd-data.berkeley.edu/wad-2018.html
4.https://www.getnexar.com/
5.https://deepdrive.berkeley.edu/
6.https://arxiv.org/abs/1612.01079
7.https://bdd-data.berkeley.edu/wad-2018.html
8.https://bdd-data.berkeley.edu/login.html
9.https://github.com/ucbdrive/bdd-data

原文链接:
https://bair.berkeley.edu/blog/2018/05/30/bdd/

(三)

(五)CIFAR 系列数据集

CIFAR -10:
https://hyper.ai/datasets/4926

CIFAR-100:
https://hyper.ai/datasets/4929

CIFAR-10 数据集共包含 60000 张大小为 32x32 的彩色图像,分为 10 个类别,每个类别 6000 个图像。


机器学习

亚马逊网络服务数据](http://aws.amazon.com/datasets )

机器学习的数据集存储库](http://mldata.org/)

机器学习样本数据库](http://kdd.ics.uci.edu/)

Welcome to the UC Irvine Machine Learning Repository!

UCI 机器学习库](http://archive.ics.uci.edu/ml/index.php)[

七个机器学习时序数据集

七个机器学习时序数据集](https://machinelearningmastery.com/time-series-datasets-for-machine-learning/)[

世界上最大的象棋比赛数据集

Million Base](http://www.top-5000.nl/pgn.htm)[MNIST](http://yann.lecun.com/exdb/mnist/)[扑克牌数据集](http://archive.ics.uci.edu/ml/datasets/Poker+Hand)[Arcade Universe](https://github.com/caglar/Arcade-Universe)[

区分 3 种简单形状。

Baby AI Shapes Dataset](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BabyAIShapesDatasets)[

一个问题-图像-答案数据集。

Baby AI Image And Question Dataset](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BabyAIImageAndQuestionDatasets)[Deep Vs Shallow Comparison ICML2007](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007)[MnistVariations](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations)[

区分宽矩形和垂直矩形。

RectanglesData](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/RectanglesData)[

区分凸形和非凸形状。

ConvexNonConvex](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/ConvexNonConvex)[


嘈杂 MNIST 背景下相关度的控制

BackgroundCorrelation](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BackgroundCorrelation)[Google Natural Questions](https://ai.google.com/research/NaturalQuestions)[

面部多样性(dif)是一个庞大而多样化的数据集,旨在推动面部识别技术的公平性和准确性研究。

IBM Diversity in Faces Dataset](https://www.research.ibm.com/artificial-intelligence/trusted-ai/diversity-in-faces/)[

该数据集包含22M各种日常图像的问题。每个图像都与图像对象、属性和关系的场景图相关联,这是基于Visual Genome的新版本。

GQA](https://cs.stanford.edu/people/dorarad/gqa/)[

Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision.

Facebook BISON](http://hexianghu.com/bison/)[

Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding.

视觉常识推理(VCR)](https://visualcommonsense.com/)[

Youtube-8M 2018](https://research.google.com/youtube8m/index.html)[Chinese Text in the Wild](https://ctwdataset.github.io/)[

The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located. You can browse the recognized drawings on quickdraw.withgoogle.com/data.

The Quick, Draw! Dataset](https://github.com/googlecreativelab/quickdraw-dataset)[


这是一个匿名的豆瓣数据集,包含129,490个独特用户和58,541个独特的电影项目。

Douban](http://socialcomputing.asu.edu/datasets/Douban)[

Epinions is a website where people can review products. Users can register for free and start writing subjective reviews about many different types of items (software, music, television show, hardware, office appliances, ...). A peculiar characteristics of Epinions is that users are paid according to how much a review is found useful (Income Share program).

Epinions](http://www.trustlet.org/epinions.html)[

Flixster是一个社交电影网站,允许用户分享电影评级,发现新电影和遇到具有类似电影品味的其他人。

Flixster](http://socialcomputing.asu.edu/datasets/Flixster)[

MyPersonality是一款流行的Facebook应用程序,允许用户进行真正的心理测试,并允许我们记录(同意!)他们的心理和Facebook个人资料。目前,我们的数据库包含超过6,000,000个测试结果,以及超过4,000,000个个人Facebook个人资料。

MyPersonality](http://mypersonality.org/wiki/doku.php)[

Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.

Movielens](https://grouplens.org/datasets/movielens/)[

Anonymous Ratings from the Jester Online Joke Recommender System.

Jester](http://eigentaste.berkeley.edu/dataset/)[

Book-Crossing Dataset.

BookCrossing](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)[LastFM](https://grouplens.org/datasets/hetrec-2011/)[

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries.

Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia)[

The files found here are complete copies of the OpenStreetMap.org database, including editing history. These are published under an Open Data Commons Open Database License 1.0 licensed. For more information.

OpenStreetMap](https://planet.openstreetmap.org/planet/full-history/)[

Hermes is Lab41's foray into recommender systems. It explores how to choose a recommender system for a new application by analyzing the performance of multiple recommender system algorithms on a variety of datasets.

PythonGitCode](https://github.com/lab41/hermes)[

Recommendation and Ratings Public Data Sets For Machine Learning.

Gist](https://gist.github.com/entaroadun/1653794)[

The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes. Available in both JSON and SQL files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps

Yelp](https://www.yelp.com/dataset)[

The CiteULike database is potentially useful for researchers in various fields. Physicists and computer scientists have expressed an interest in trying to analyse the structure of the data, and frequently ask for datasets to be made available. Previously this was done on an ad-hoc basis, and it relied on us remembering to update the data file. Now, there is an automatic process which runs every night producing a snapshot summary of what articles have been posted with which tags.

CiteULike](http://www.citeulike.org/faq/data.adp)[

The data set contains anonymized users' shopping logs in the past 6 months before and on the "Double 11" day,and the label information indicating whether they are repeated buyers. Due to privacy issue, data is sampled in a biased way, so the statistical result on this data set would deviate from the actual of Tmall.com. wait to update

TaoBao](https://tianchi.aliyun.com/dataset/)

公共政府数据集

The US National Center for Education StatisticsThe UK Data CentreData USAData.govChronic disease data[

美国学校系统财务状况调查。

School system finances](https://catalog.data.gov/dataset/annual-survey-of-school-system-finances)[

包含有关本地食物选择如何影响美国饮食习惯的数据。

Food Environment Atlas](https://catalog.data.gov/dataset/food-environment-atlas-f4a22)[

The Harvard Library provides open access to our metadata through bibliographic datasets and APIs.

Harvard Library APIs & Datasets](https://library.harvard.edu/services-tools/harvard-library-apis-datasets#Harvard-Library-Bibliographic-Dataset)[欧盟性别统计数据库](http://eige.europa.eu/gender-statistics)[荷兰国家地质研究数据](http://www.nationaalgeoregister.nl/geonetwork/srv/dut/search#fast=index&from=1&to=50&any_OR_geokeyword_OR_title_OR_keyword=landinrichting*&relation=within)[

就是没有介绍。

联合国开发计划署项目](http://open.undp.org/#2016)[Collates data from the body which oversees the network of EV charge points across the Republic of Ireland and Northern Ireland.

Irish Electric Vehicle Charge Point Status](http://www.mlopt.com/?p=6598) HighD - 高速公路无人机数据集MURA


金融

国际货币基金组织公布有关国际金融、债务利率、外汇储备、商品价格和投资的数据。

IMF Data](https://www.imf.org/en/Data)[

很好的财经数据来源

Quandl](https://www.quandl.com/)[

涵盖人口统计和世界各地大量经济和发展指标的数据集。

World Bank Open Data](https://data.worldbank.org/)[

世界金融市场的最新信息,包括股票价格指数、商品和外汇。

Financial Times Market Data](https://markets.ft.com/data/)[

观察和分析有关互联网搜索活动和世界各地新闻故事趋势的数据。

Google Trends](http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0)[

寻找美国宏观经济数据的来源。

AmericanEconomic Association (AEA)](https://www.aeaweb.org/resources/data/us-macro-regional)


图像处理

MNIST 是一个手写数字数据库,它有 60000 个训练样本集和 10000 个测试样本集,每个样本图像的宽高为28*28,是机器学习领域的入门数据集。

MINIST](http://yann.lecun.com/exdb/mnist/)[

ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库。超过1400万的图像URL被ImageNet手动注释,以指示图片中的对象;在至少一百万个图像中,还提供了边界框。

ImageNet](http://www.image-net.org/)[

ImageNet 之外另一个常用的图像数据集,包含通用图像理解和注释。

MS COCO](http://cocodataset.org/)[

100 个不同的物体在 360°旋转中以每个角度成像。

COIL100](http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php)[

非常详细的视觉知识库,配有约 100K 个图像的注释。

Visual Genome](http://visualgenome.org/)[

13000 张贴有标签的人脸图像,用于开发涉及人脸识别的应用。

Labelled Faces in the Wild](http://vis-www.cs.umass.edu/lfw/)[

包含 20580 个图像和 120 个不同品种的狗类别。

Stanford Dogs Dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/)[

非常具体的数据集,适用于大多数场景识别模型,因为后者在「外部」表现更好。包含 67 个室内类别,总共 15620 个图像。

Indoor Scene Recognition](http://web.mit.edu/torralba/www/indoor.html)[

我们在SUN数据库上测量人类场景分类性能,并将其与计算方法进行比较。

SUN dataset](https://vision.princeton.edu/projects/2010/SUN/)[

This data consists of 640 black and white face images of people taken with varying pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes (wearing sunglasses or not), and size.

CMU Face Images](http://kdd.ics.uci.edu/databases/faces/faces.html)[

Logo Synthesis and Manipulation with Clustered Generative Adverserial Network

50万个LOGO标志数据集](https://data.vision.ee.ethz.ch/sagea/lld/)[

Mapillary Vistas, the world’s largest and most diverse publicly available, pixel-accurately and instance-specifically annotated street-level imagery dataset, will empower autonomous mobility and transport at the global scale.

大规模街道级图片(分割)数据集](https://blog.mapillary.com/product/2017/05/03/mapillary-vistas-dataset.html)[The Cityscapes Dataset](https://github.com/mcordts/cityscapesScripts)[

常用于作为模型的验证基线。 25x25的黑白手写数字图像数据集。 这个数据集很简单,因此你的模型在MNIST上跑的通,并不意味着它就是有效的。

MNIST](https://pjreddie.com/projects/mnist-in-csv/)[

32x32彩色图像。 现在用的不多了,不过也可以用来作为你模型验证的基准数据。

CIFAR 10和CIFAR 100](https://www.cs.toronto.edu/~kriz/cifar.html)[

新一代算法事实上的标准图像数据集。 许多提供图像API服务的公司,通过其REST接口所提供的标注信息,与WordNet的1000种分类非常类似。

ImageNet](http://image-net.org/)[

常用于场景理解及相关的辅助任务,如房间布局估计,重要性预测等。

LSUN](http://lsun.cs.princeton.edu/2016/)[

常用于通用的图像分割/分类任务 - 对构建真实世界的图像注释并不十分有用,但作为基线也不错。

PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/)[

来自Google街景的房屋号码数据集。 可以将其视为自然环境下的递归版MNIST。

SVHN](http://ufldl.stanford.edu/housenumbers/)[

非常详细的视觉知识库,包含大约100K个经过深度标注的图像。

Visual Genome](http://visualgenome.org/)[Labeled Faces in the Wild](http://vis-www.cs.umass.edu/lfw/)[Mut1ny](http://www.mut1ny.com/face-headsegmentation-dataset)[

用于开发无监督特征学习、深度学习、自学习算法的图像识别数据集。像修改过的 CIFAR-10。

STL-10](http://cs.stanford.edu/~acoates/stl10/)[

玩具摆件在各种照明和姿势下的双目图像。

NORB](http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/)[

通用图像分割/分类——对于构建真实世界图像注释不是非常有用,但对基线很有用。

Pascal VOC](http://pascallin.ecs.soton.ac.uk/challenges/VOC/)[

Dataset hosted at archive.org covering music released around the world, for use in image processing research

One Million Audio Cover Images](https://archive.org/details/audio-covers)[

用于连续目标识别的新数据集和基准。

Core50](https://vlomonaco.github.io/core50/)[

Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN)

NVIDIA Flickr-Faces-HQ](https://github.com/NVlabs/ffhq-dataset)[Danbooru2018](https://www.gwern.net/Danbooru2018)[

15,851,536 boxes on 600 categories 2,785,498 instance segmentations on 350 categories 36,464,560 image-level labels on 19,959 categories 391,073 relationship annotations of 329 relationships Extension - 478,000 crowdsourced images with 6,000+ categories

谷歌开放图像V4](https://storage.googleapis.com/openimages/web/index.html)[

Image annotation and data management at scale Keep all your data privately in one place in a single format. Annotate with powerful tools: polygons, rectangles, tags, 3d cuboids. Utilize latest deep learning to label data up to 10x faster.

Supervisely Person](https://supervise.ly/)[

VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. 265,016 images (COCO and abstract scenes) At least 3 questions (5.4 questions on average) per image 10 ground truth answers per question 3 plausible (but likely incorrect) answers per question Automatic evaluation metric

VQA Visual Question Answering](https://visualqa.org/)[

REALISM FROM SUNLIGHT TO SENSOR Synscapes is created with an end-to-end approach to realism, accurately capturing the effects of everything from illumination by sun and sky, to the scene's geometric and material composition, to the optics, sensor and processing of the camera system. 25,000 PROCEDURAL & UNIQUE IMAGES The images in the dataset do not follow a driven path through a single virtual world. Instead, an entirely unique scene was procedurally generated for each of the twenty-five thousand images. As a result, the dataset contains a wide range of variations and unique combinations of features.

Synscapes](https://7dlabs.com/synscapes-overview)[

This repository introduces the open-source project dubbed Tencent ML-Images, which publishes: ML-Images: the largest open-source multi-label image database, including 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories Resnet-101 model: it is pre-trained on ML-Images, and achieves the top-1 accuracy 80.73% on ImageNet via transfer learning

Tencent ML — Images](https://github.com/Tencent/tencent-ml-images)[

fastMRI is collaborative research project from Facebook AI Research (FAIR) and NYU Langone Health to investigate the use of AI to make MRI scans up to 10 times faster.

fastMRI Dataset](http://fastmri.org/)[Mapillary Vistas](https://www.mapillary.com/dataset/vistas)[

The Places dataset is designed following principles of human visual cognition. Our goal is to build a core of visual knowledge that can be used to train artificial systems for high-level visual understanding tasks, such as scene context, object recognition, action and event prediction, and theory-of-mind inference. The semantic categories of Places are defined by their function: the labels represent the entry-level of an environment. To illustrate, the dataset has different categories of bedrooms, or streets, etc, as one does not act the same way, and does not make the same predictions of what can happen next, in a home bedroom, an hotel bedroom or a nursery.

Places2](http://places2.csail.mit.edu/)


情感分析

以亚马逊的产品评论为特色。

Multidomain Sentiment analysis dataset](http://www.cs.jhu.edu/~mdredze/datasets/sentiment/)[

用于二进制情感分类的较旧的、相对较小的数据集,具有 25000 个电影评论。

IMDB reviews](http://ai.stanford.edu/~amaas/data/sentiment/)[Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/code.html)[Sentiment140](http://help.sentiment140.com/for-students/)[

2015 年 2 月以来美国航空公司的推特数据,分为正面、负面和中性。

Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)[

借助这些数据,我们可以进行临刑者的情感分析。

1984年以来死刑犯的最后一句话](http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html)[

Reddit(新闻网站)用户都关心什么?

Reddit上250万个帖子](https://github.com/umbrae/reddit-top-2.5-million)[美国人是如何遇到他们另一半的?](https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/30103?q=&paging.rows=25&sortBy=10)[现在工作与过去工作相比,哪个更糟糕?](https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/7610?q=&paging.rows=25&sortBy=9)[

我常常想会不会有人做在线人格测试发现比大多数人更神经质?从很多在线性格测试项目中可以得到 大量可用的数据,将性格测试的答案与大众的答案进行比较,就可以找出那些比较神经质的人。

在线人格测试](https://openpsychometrics.org/_rawdata/)[

较旧的学术数据集。

多领域情绪分析](http://www.cs.jhu.edu/~mdredze/datasets/sentiment/)[

标准情感数据集,在每个句子解析树的每个节点都有细粒度的情感注释。

Stanford Sentiment Treebank](http://nlp.stanford.edu/sentiment/code.html)[

我们提出了一个用于分析人类情感状态的多模态数据集。记录了32个参与者的脑电图(EEG)和外围生理信号,每个观看40个一分钟长的音乐视频片段。参与者根据唤醒,效价,喜欢/不喜欢,支配和熟悉程度对每个视频进行评分。对于32名参与者中的22名,还记录了正面视频。使用一种新的刺激选择方法,利用来自last.fm网站的情感标签检索,视频高亮检测和在线评估工具。

DEAPdataset](http://www.eecs.qmul.ac.uk/mmv/datasets/deap/index.html)[

带注释的情感数据集

Dai-labor](http://www.dai-labor.de/en/competence_centers/irml/datasets/)


自然语言处理

包含来自亚马逊长达 18 年的约 3500 万条评论。数据包括产品和用户信息、评级和明文审查。

Amazon Reviews](https://snap.stanford.edu/data/web-Amazon.html)[

Google 书籍中的词汇集合。

Google Books Ngrams](https://aws.amazon.com/cn/datasets/google-books-ngrams/)[

从 blogger . com 收集的 681288 篇博客文章。每个博客至少包含 200 个常用英语单词。

Blogger Corpus](http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm)[Wikipedia Links data](https://code.google.com/p/wiki-links/downloads/list)[

加拿大第 36 届国会记录 130 万对文本。

Hansards text chunks of Canadian Parliament](https://www.isi.edu/natural-language/download/hansard/)[

由 5574 条英文短信垃圾邮件组成的数据集

SMS Spam Collection in English](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)[

Yelp 发布的开放数据集包含 500 多万条评论。

Yelp Reviews](https://www.yelp.com/dataset)[

大型垃圾邮件数据集,可用于垃圾邮件过滤。

UCI's Spambase](https://archive.ics.uci.edu/ml/datasets/Spamb (https://archive.ics.uci.edu/ml/datasets/Spambase))[

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

自然语言处理最新进展/相关资源跟踪大列表](https://github.com/sebastianruder/NLP-progress)[

今日头条中文新闻(文本)分类数据集

今日头条中文新闻(文本)分类数据集](https://github.com/fateleak/toutiao-text-classfication-dataset)[

比利时佛兰德区数以千计截然不同的超过 10000 个的交通标志标注。

KUL Belgium Traffic Sign Dataset](http://www.vision.ee.ethz.ch/~timofter/traffic_signs/)[

RACE Reading Comprehension Task

2.8万文章/10万问题大规模(英语考试)阅读理解数据集](https://github.com/qizhex/RACE_AR_baselines)[

Natural Language for Visual Reasoning

NLVR:自然语言基础数据集(对象分组、数量、比较及空间关系推理)](http://lic.nlp.cornell.edu/nlvr/)[

斯坦福大学的问答数据集 - 涉及广泛主题的问答和阅读理解数据集,每个问题的答案都是单独一段文字。

SQUAD](https://rajpurkar.github.io/SQuAD-explorer/)[

用于文本分类的八个数据集。 常用来作为文本分类算法的基线。 样本量从120K到3.6M不等,分类数量从2到14不等。 其中包括来自DBPedia,Amazon,Yelp,Yahoo!,搜狗和AG的数据集。

Text Classification](https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M)[

Quora发布的第一个包含重复/语义相似标注的数据集。

Question Pairs](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs)[

根据维基百科文章的难度分级,手工生成的真实问/答对。

CMU Q/A](http://www.cs.cmu.edu/~ark/QA-data/)[

用于有状态自然语言理解研究的复杂的人工数据集。

Maluuba](https://datasets.maluuba.com/)[

大型通用语言建模数据集。 通常用于训练word2vec或GloVe等分布式单词表示。

Billion Words](http://www.statmt.org/lm-benchmark/)[

全网PB级抓取 - 最常用于学习单词嵌入。

Common Crawl](https://commoncrawl.org/the-data/)[

来自FAIR(facebookd ai research)的阅读理解和问答数据集。

bAbi](https://research.fb.com/downloads/babi/)[从古腾堡计划提供的儿童图书中提取的(问题+上下文,答案)方面的基准数据集。 对问答、阅读理解、真实性查询这方面的任务很有用

The Children’s Book Test](https://research.fb.com/downloads/babi/) Stanford Sentiment Treebank[

用于文本分类的经典数据集之一,通常用作单纯分类算法的基准或者作为信息抽取或索引算法的验证。

20 Newsgroups](http://qwone.com/~jason/20Newsgroups/)[

比较早期,相对较小的二元情感分类数据集。

IMDB](http://ai.stanford.edu/~amaas/data/sentiment/)[

来自著名的UCI机器学习库的一个早期的、经典的垃圾邮件数据集。 由于其中包括数据集构建的细节,这可能是学习个性化垃圾邮件过滤的一个有趣的基准

UCI’s Spambase](https://archive.ics.uci.edu/ml/datasets/Spambase)[

各种长度的电影评论数据 - 通常作为协同过滤算法的基线。

MovieLens](https://grouplens.org/datasets/movielens/)[

Project Gutenberg 提供的儿童图书中提取的(问题+背景、答案)对的基线。用于问答(阅读理解)和仿真查找。

The Children’s Book Test](http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz)[

来自 Book-Crossing 社区。包含 278,858 位用户提供的约 271,379 本书的 1,149,780 个评分。

Book-Crossing 数据集](http://www.informatik.uni-freiburg.de/~cziegler/BX/)[Maluuba News QA](https://datasets.maluuba.com/NewsQA)[

Yelp 数据集是用于 NLP 的 Yelp 业务、评论和用户数据的子集。

Yelp Open Dataset](https://www.yelp.com/dataset)[

Over one billion public comments posted to Reddit between 2007 and 2015, for training language algorithms

Complete Public Reddit Comments Corpus](https://archive.org/details/2015_reddit_comments_corpus)[

谷歌重磅发布自然问题数据集(Natural Questions),包含30万个自然发生的问题和人工注释的答案,16000个示例,并发起基于此数据集的问答系统挑战赛。有望成为自然语言理解领域的SQuAD!

Natural Questions](https://ai.google.com/research/NaturalQuestions)[

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

斯坦福问题解答数据集(SQuAD)2.0](https://rajpurkar.github.io/SQuAD-explorer/)[

MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word’s translation into English (and corresponding images.)

大规模多语言图像数据集(MMID)](http://multilingual-images.org/)[

Given a partial description like "she opened the hood of the car," humans can reason about the situation and anticipate what might come next ("then, she examined the engine"). SWAG (Situations With Adversarial Generations) is a large-scale dataset for this task of grounded commonsense inference, unifying natural language inference and physically grounded reasoning.

SWAG](https://rowanzellers.com/swag/)[

多类型自然语言推理(MultiNLI)语料库是一个由源文本集合组成的433k个句子对,用文本蕴涵信息注释。语料库以SNLI语料库为模型,但不同之处在于涵盖了一系列口语和书面文本,并支持独特的跨类型泛化评估。该语料库是哥本哈根EMNLP举办的RepEval 2017研讨会共同任务的基础。

MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)[

CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. CoQA is pronounced as coca .

CoQA](https://stanfordnlp.github.io/coqa/)[

Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas.

Spider 1.0](https://yale-lily.github.io/spider)[

HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

HotpotQA](https://hotpotqa.github.io/)[

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

AmazonReviews](http://jmcauley.ucsd.edu/data/amazon/)


自动驾驶

目前最大的自动驾驶人工智能数据集。包含 100000 多段视频,内容涉及一天中不同时间和天气条件下 1100 多小时的驾驶体验。注释图像来自纽约和旧金山地区。

Berkeley DeepDrive BDD100k](https://bdd-data.berkeley.edu/)[

百度 Apollo 计划开放的大规模自动驾驶数据集。它定义了 26 个不同语义项目,如汽车、自行车、行人、建筑物、路灯等。

Baidu Apolloscapes](http://apolloscape.auto/)[

7 小时以上的公路行驶体验。详细信息包括车速、加速度、转向角和 GPS 坐标。

Comma.ai](https://archive.org/details/comma-dataset)[

一年内在英国牛津同一条路线重复 100 多次的行驶。数据集捕捉天气、交通和行人的不同组合,以及建筑和道路工程等长期变化。

Oxford's Robotic Car](http://robotcar-dataset.robots.ox.ac.uk/)[

记录 50 个不同城市街道场景的大型数据集。

Cityscape Dataset](https://www.cityscapes-dataset.com/)[

该数据集可用于自主车辆的感知和导航。数据集在发达国家的道路上出现严重偏差。

CSSAD Dataset](http://aplicaciones.cimat.mx/Personal/jbhayet/ccsad-dataset)[

在 AgeLab 收集的 1000 多个小时的多传感器驱动数据集样本。

MIT AGE Lab](http://lexfridman.com/carsync/)[

此数据集包括交通标志、车辆检测、交通灯和轨迹模式。

LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets](http://cvrr.ucsd.edu/LISA/datasets.html)[

以比利时佛兰德斯地区数千个不同的交通标志为基础的 10,000 多条交通标志注释。

KUL Belgium Traffic Sign Dataset](http://www.vision.ee.ethz.ch/~timofter/traffic_signs/)[

首次展示 2 百万公里的出行数据。

Uber 2B trip data](https://movement.uber.com/cities?lang=pt-BR)[

纽约市出租车行车位置记录(2013年),纽约市出租车的详细行车位置数据,包括 等级(medallion), 执照(hack license), 供应商编号(vendor id), 比率代码(rate code), (store and forward flag), 上车时间(pickup datetime), 下车时间(dropoff datetime), 乘客数量(passenger count), 行车时长(trip time in seconds), 行车距离(trip distance), 上车经纬度坐标(latitude and longitude coordinates for the pickup location) , 下车经纬度坐标( latitude and longitude coordinates for the dropoff location) 等属性信息。 该数据为个人根据 FOIL 法案从 NYC’s Taxi and Limousine Commission 处申请获得的数据。

纽约市出租车行车位置记录(2013年)](http://chriswhong.com/open-data/foil_nyc_taxi/)[

芝加哥市2013年至今的出租车行驶记录数据,包括,出租车ID、行程开始时间、行程结束时间、行程里程数、上下车乘客数、上下车社区区域。

芝加哥市2013年至今的出租车行驶记录数据](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew)[

Udacity 学城开放的自动驾驶课程中的自动驾驶汽车数据集,旨在打造一个开源的自动驾驶项目。

Udacity 自动驾驶](https://www.udacity.com/self-driving-car)[Uber 纽约市乘车数据](https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city)[

2005-2015

英国车祸数据(2005-2015)](https://www.kaggle.com/silicon99/dft-accident-data)[

超速有危险

芝加哥汽车超速数据](https://www.kaggle.com/chicagopolice/speed-violations)[

KITTI是一系列以自动驾驶为目标的机器视觉任务数据,包括:空间建模、视觉流、视觉测距、3D物体检测、3D物体追踪等。数据来自一辆搭载2台彩色和黑白摄像机、360度激光雷达和GPS定位系统的小汽车,在中等规模城市Karlsruhe行驶并记录数据。

KITTI 自动驾驶数据](http://www.cvlibs.net/datasets/kitti/)[

German Traffic Sign Recognition Benchmark (GTSRB) 是一个德国交通标志检测数据,通过模式识别技术辅助驾驶员进行交通标识识别。

德国交通标志识别数据](http://benchmark.ini.rub.de)[

Traffic Lights Recognition (TLR) 是一个交通信号灯识别的视频数据,在真实的道路上采集的交通信号灯视频,分辨率为 640x480,由法国一所大学提供。

交通信号识别视频数据](http://www.lara.prd.fr/benchmarks/trafficlightsrecognition)[

Explore 100,000 HD video sequences of over 1,100-hour driving experience across many different times in the day, weather conditions, and driving scenarios. Our video sequences also include GPS locations, IMU data, and timestamps.

Berkeley Deep Drive (BDD100K)](https://bdd-data.berkeley.edu/)[

comma.ai presents comma2k19, a dataset of over 33 hours of commute in California's 280 highway. This means 2019 segments, 1 minute long each, on a 20km section of highway driving between California's San Jose and San Francisco. comma2k19 is a fully reproducible and scalable dataset. The data was collected using comma EONs that has sensors similar to those of any modern smartphone including a road-facing camera, phone GPS, thermometers and 9-axis IMU. Additionally, the EON captures raw GNSS measurements and all CAN data sent by the car with a comma grey panda.

Comma 2k19](https://github.com/commaai/comma2k19)[

欢迎使用HD1K Benchmark Suite,这是一种自主驾驶数据集和光流基准。该数据集由海德堡图像处理合作实验室与Robert Bosch GmbH密切合作创建。

HD1K Benchmark Suite](http://hci-benchmark.org/)[

ApolloScape是Apollo自动驾驶项目的一部分,是一个以研究为导向的项目,旨在促进自动驾驶各个方面的创新,从感知,导航到控制。它承载了对语义注释(像素级)街景视图图像的开放访问以及支持用户定义策略的模拟工具

ApolloScape](http://apolloscape.auto/)[

nuScenes数据集是一个大规模的自动驾驶数据集。它的特点是: 全传感器套件(1x LIDAR,5x RADAR,6x摄像头,IMU,GPS) 1000个场景,每个20秒 1,400,000张相机图像 390,000个激光雷达扫描 两个不同的城市:波士顿和新加坡 左手与右手交通 详细的地图信息 23个对象类的手动注释 1.4M 3D边界框,注释频率为2Hz 可见性,活动和姿势等属性

nuScenes](https://www.nuscenes.org/)

科研数据

[

好奇号火星车在火星上收集的大约32,000张彩色图像,显示了火星的各种地理和地质特征,如山脉和山谷,陨石坑,沙丘和岩石地形

火星数据集](https://dominikschmidt.xyz/mars32k/)[

Volcanoes on Venus - JARtool experiments

金星上的火山](http://kdd.ics.uci.edu/databases/volcanoes/volcanoes.html)[

The National UFO Reporting Center Online Database

UFO 报告](http://www.nuforc.org/webreports.html)[

911

维基解密 911 寻呼机截取](http://911.wikileaks.org/files/index.html)[

HadCRUT4 is a global temperature dataset, providing gridded temperature anomalies across the world as well as averages for the hemispheres and the globe as a whole. CRUTEM4 and HadSST3 are the land and ocean components of this overall dataset, respectively.

Temperature](https://crudata.uea.ac.uk/cru/data/temperature/#datter)[

NSSDCA supports the space science research community, the education enterprise, and the general public. Further information on NSSDCA's designated community is available.

NASA](https://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html)[

Data giving characteristics of each ORF (potential gene) in the E. coli genome. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided.

E. coli genes](http://kdd.ics.uci.edu/databases/ecoli/ecoli.html)[

Data giving characteristics of each ORF (potential gene) in the M. tuberculosis bacterium. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided.

M. tuberculosis genes](http://kdd.ics.uci.edu/databases/tb/tb.html)[

The data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. The data is expected to aid in the understanding and prediction of El Nino/Southern Oscillation (ENSO) cycles.

El Nino Data](http://kdd.ics.uci.edu/databases/el_nino/el_nino.html)[

This data arises from a large study to examine EEG correlates of genetic predisposition to alcoholism. It contains measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second.

EEG Database](http://kdd.ics.uci.edu/databases/eeg/eeg.html)[

A SERVICE OF NASA EXOPLANET SCIENCE INSTITUTE

NASA EXOPLANET ARCHIVE](https://exoplanetarchive.ipac.caltech.edu/index.html)[

GET-Evidence已经提供了可下载的 公共基因组。我怀疑Steven Pinker的个人数据也在其中,或许你也可以克隆一个自己。

想克隆一个自己吗?](http://evidence.pgp-hms.org/download)[

你想过对食物进行具体细分吗?(感谢加拿大)

Listing foods](http://foodb.ca/foods?utf8=%E2%9C%93&q[name_cont]=&q[name_scientific_cont]=&q[food_group_cont]=&q[food_subgroup_cont]=&button=)[

预防人类灭绝是头等大事!再小的灭绝危险都是需要警惕的。

公元前2500年到2012年流星撞击地球记录](https://www.analyticbridge.datasciencecentral.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized)[

性别和精神疾病对犯罪影响有多大?](http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/27521?q=&paging.rows=25&sortBy=10)[

其中还包括了大量关系数据和生物指标数据。

青少年的健康数据](https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/21600?q=&paging.rows=25&sortBy=9)[

将这些数据“喂”给一个神经网络,也许能看到一些关于地震的预测。

1000年到1903年之间的所有地震数据](https://www.globalquakemodel.org/single-post/2017/05/17/Global-Historical-Earthquake-Archive-and-Catalogue-1000-1903)[

使用免费许可协议的全球矢量数据。 它包括 (一个早期版本的)美国人口普查局的TIGER数据。

OpenStreetMap](https://wiki.openstreetmap.org/wiki/Planet.osm)[

整个地球表面的卫星照片,每隔几周更新一次。

Landsat8](https://landsat.usgs.gov/landsat-8)[

美国境内使用多普勒雷达扫描的大气状况数据集。

NEXRAD](https://governmentshutdown.noaa.gov/)[

The Surveillance Atlas of Infectious Diseases is a tool that interacts with the latest available data about a number of infectious diseases. The interface allows users to interact and manipulate the data to produce a variety of tables and maps.​​ The information contained in the dataset provided through ATLAS is made available by ECDC collating data from the Member States collected through The European Surveillance System (TESSy).



欧盟传染病监测图集](http://ecdc.europa.eu/en/data-tools/atlas/Pages/atlas.aspx)[

The Training and Test Sets each consist of 15 biological activity data sets in comma separated value (CSV) format. Each row of data corresponds to a chemical structure represented by molecular descriptors.

默克分子活动挑战](http://www.kaggle.com/c/MerckActivity/data)[

Musk dataset 描述了以不同构造出现的分子。每个分子都是 musk 或 non-musk,且其中一个构造决定了这一特性。

Musk dataset](https://archive.ics.uci.edu/ml/datasets/Musk+(Version+2))[

A communal biometrics framework supporting the development of open algorithms and reproducible evaluations.

开源生物识别数据](http://openbiometrics.org/)[

A communal biometrics framework supporting the development of open algorithms and reproducible evaluations.

Open Source Biometric Recognition](http://openbiometrics.org/)[

MIMIC-CXR是一个大型的公开数据库,包括2011年至2016年间的医疗中心的患者胸部X光片,该数据包含371920例胸部X光片,与227943例成像研究相关。

MIMIC-CXR](https://www.physionet.org/physiobank/database/mimiccxr/)[

航天器姿态估算(SPEED)](https://kelvins.esa.int/satellite-pose-estimation-challenge/home/)


大学公开数据集

69G大规模无人机(校园)图像数据集【Stanford】[

CUHK Face Sketch Database (CUFS)

人脸素描数据集【CUHK】](http://mmlab.ie.cuhk.edu.hk/archive/facesketch.html)[

The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.

自然语言推理(文本蕴含标记)数据集【NYU】](https://www.nyu.edu/projects/bowman/multinli/)[Berkeley图像分割数据集BSDS500【Berkeley】](https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html)[

The Oxford-IIIT Pet Dataset

宠物图片(分割)数据集【Oxford】](http://www.robots.ox.ac.uk/~vgg/data/pets/)[

MIT

发布ADE20K场景感知/解析/分割/多目标识别数据集【MIT】](https://groups.csail.mit.edu/vision/datasets/ADE20K/)[

We introduce the Multimodal Dyadic Behavior (MMDB) dataset, a unique collection of multimodal (video, audio, and physiological) recordings of the social and communicative behavior of toddlers.

多模态二元行为数据集【GaTech】](http://www.cbi.gatech.edu/mmdb/)[

美国密歇根大学政治与社会研究校际联合数据库(ICPSR)

ICPSR](https://www.icpsr.umich.edu/icpsrweb/ICPSR/)[

Maryland大学有一个“全球恐怖主义数据库”,这是一组由 11.3万个恐怖事件组成的数据集。你可以在填完表后下载它。

全球恐怖主义数据库](https://www.start.umd.edu/gtd/contact/)


语音

2000 HUB5 English[

包含文本和语音的有声读物数据集。 近500小时多个说话人朗读的清晰语音,按书的章节进行组织。

LibriSpeech](http://www.openslr.org/12/)[

清晰的英语方言语音数据集,如果你需要解决对不同口音或语调的鲁棒性问题,那么它非常有用。

VoxForge](http://www.voxforge.org/)[

只包含英文的语音识别数据集。

TIMIT](https://catalog.ldc.upenn.edu/LDC93S1)[

噪音语音识别挑战数据集。 数据集包含真实、模拟和干净的录音。 在4个嘈杂的地点,4个说话人的近9000个真实录音。

CHIME](http://spandh.dcs.shef.ac.uk/chime_challenge/data.html)[

音乐推荐数据集,包含底层的社交网络数据,以及其它一些对混合式推荐系统有用的元数据。

Last.fm](https://grouplens.org/datasets/hetrec-2011/)[

TED 演讲的音频转录。1495 个 TED 演讲录音以及这些录音的文字转录。

TED-LIUM](http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus)[

古典钢琴曲

Piano-midi.de](http://www.piano-midi.de/)[

超过 1000 首民谣

Nottingham](http://abc.sourceforge.net/NMD/)[

古典音乐评分的电子图书馆

MuseData](http://musedata.stanford.edu/)[

四部协奏曲

JSB Chorales](http://www.jsbchorales.net/index.shtml)[

扩展了 632 个音频分类样本,并从 YouTube 视频中提取了 2,084,320 个人类标记的 10 秒声音片段。

Google Audioset](https://research.google.com/audioset/)[

这个仓库收集了黄石公园的公开声音库。这些声音包含了自然环境、动物产生的声音。 您可以使用它们作为您工作 学习或冥想的背景音乐。

黄石公园声音库](https://github.com/rosuH/YSL)[

语音是自然的、有人性的。这也是为什么我们希望为机器建立可用的语音技术。但要创造一个语音系统,开发者需要大量的语音数据。 大部分由大公司持有的数据,并未开放给公众使用。我们认为这会扼杀创新,因而推出了 Common Voice 项目,让语音识别技术的大门对每个人开放而无障碍。

Mozilla Common Voice](https://voice.mozilla.org/zh-CN)[

The Voices Obscured in Complex Environmental Settings (VOiCES) corpus is a creative commons speech dataset targeting acoustically challenging and reverberant environments with robust labels and truth data for transcription, denoising, and speaker identification.

VOiCES](https://voices18.github.io/)

更新中...

微信公众号: 极市平台(ID: extrememart )
每天推送最新CV干货

回复数量: 0
暂无回复~
您需要登陆以后才能留下评论!