|
| 1 | +# 数据集网址集合 |
| 2 | + |
| 3 | +--- |
| 4 | + |
| 5 | +http://archive.ics.uci.edu/ml/index.php |
| 6 | +http://aws.amazon.com/publicdatasets/ |
| 7 | +http://www.kaggle.com/competitions |
| 8 | +http://www.kdnuggets.com/datasets/index.html |
| 9 | +https://mp.weixin.qq.com/s?__biz=MzI4ODU5NjQ3OQ==&mid=2247483972&idx=1&sn=c7f7bbb3312934468912705d74d7c07f&chksm=ec3d4ad4db4ac3c25b6dbf7ce002195d3086e075b4b8087a252735a71c370e78fce074e832ef&mpshare=1&scene=23&srcid=1212moFaKMWXuk3LOpfsACna#rd |
| 10 | +http://mp.weixin.qq.com/s/4eDan-7KNnwgVP0DT96gzQ |
| 11 | +http://mp.weixin.qq.com/s/tKc72xnqu4R4wkrVbK_bXA (偏国内,包含工具) |
| 12 | +https://www.quandl.com/ |
| 13 | + |
| 14 | +http://archive.ics.uci.edu/ml/datasets.html |
| 15 | + |
| 16 | +## 20G的金融行业数据集 |
| 17 | +http://mp.weixin.qq.com/s/_NS0UUDr84yq0rLg7jfr5g |
| 18 | + |
| 19 | +## 图片数据 |
| 20 | +http://labelme.csail.mit.edu/Release3.0/index.php?message=1 |
| 21 | +http://www.image-net.org/index |
| 22 | + |
| 23 | +## 吴恩达医学数据 |
| 24 | +http://mp.weixin.qq.com/s/M3s3z3YnEBvUxpDVGFVKHw |
| 25 | + |
| 26 | +## 影像数据 |
| 27 | +http://www.91weitu.com/ |
| 28 | + |
| 29 | +## 气象 |
| 30 | +http://172.16.14.141:9100/ |
| 31 | + |
| 32 | +## 爬虫工具 |
| 33 | +https://www.oschina.net/p/beanbun |
| 34 | +https://mp.weixin.qq.com/s/5rtoVnhYcVZpuRszr88diQ |
| 35 | +https://gitee.com/xiyouMc/pornhubbot |
| 36 | +https://gitee.com/l-weiwei/spiderman |
| 37 | +https://gitee.com/flashsword20/webmagic |
| 38 | + |
| 39 | +## 古诗 |
| 40 | +https://github.com/chinese-poetry/chinese-poetry |
| 41 | + |
| 42 | +## Datasets |
| 43 | +Neural Networks used for supervised learning are notoriously data hungry. That’s why open datasets are an incredibly important contribution to the research community. The following are a few datasets that stood out this year: |
| 44 | + |
| 45 | +- Youtube Bounding Boxes |
| 46 | +- Google QuickDraw Data |
| 47 | +- DeepMind Open Source Datasets |
| 48 | +- Google Speech Commands Dataset |
| 49 | +- Atomic Visual Actions |
| 50 | +- Several updates to the Open Images data set |
| 51 | +- Nsynth dataset of annotated musical notes |
| 52 | +- Quora Question Pairs |
| 53 | + |
| 54 | + |
| 55 | +## Public Data Sets on Amazon Web Services (AWS) |
| 56 | +http://aws.amazon.com/datasets |
| 57 | +Amazon从2008年开始就为开发者提供几十TB的开发数据。 |
| 58 | + |
| 59 | +## Yahoo! Webscope |
| 60 | +http://webscope.sandbox.yahoo.com/index.php |
| 61 | + |
| 62 | +## Konect is a collection of network datasets |
| 63 | +http://konect.uni-koblenz.de/ |
| 64 | + |
| 65 | +## Stanford Large Network Dataset Collection |
| 66 | +http://snap.stanford.edu/data/index.html |
| 67 | + |
| 68 | +## 安全相关的数据集 |
| 69 | +http://www.secrepo.com/ |
| 70 | + |
| 71 | + |
| 72 | +## 几个跟互联网有关的数据集: |
| 73 | +1、Dataset for "Statistics and Social Network of YouTube Videos" |
| 74 | +http://netsg.cs.sfu.ca/youtubedata/ |
| 75 | + |
| 76 | +2、1998 World Cup Web Site Access Logs |
| 77 | +http://ita.ee.lbl.gov/html/contrib/WorldCup.html |
| 78 | +这个是1998年世界杯期间的数据集。从1998/04/26 到 1998/07/26 的92天中,发生了 1,352,804,107次请求。 |
| 79 | + |
| 80 | +3、Page view statistics for Wikimedia projects |
| 81 | +http://dammit.lt/wikistats/ |
| 82 | + |
| 83 | +4、AOL Search Query Logs - RP |
| 84 | +http://www.researchpipeline.com/mediawiki/index.php?title=AOL_Search_Query_Logs |
| 85 | + |
| 86 | +5、livedoor gourmet |
| 87 | +http://blog.livedoor.jp/techblog/archives/65836960.html |
| 88 | + |
| 89 | +## 海量图像数据集: |
| 90 | +1、ImageNet |
| 91 | +http://www.image-net.org/ |
| 92 | +包含1400万的图像。 |
| 93 | + |
| 94 | +2、Tiny Images Dataset |
| 95 | +http://horatio.cs.nyu.edu/mit/tiny/data/index.html |
| 96 | +包含8000万的32x32图像。 |
| 97 | + |
| 98 | +3、 MirFlickr1M |
| 99 | +http://press.liacs.nl/mirflickr/ |
| 100 | +Flickr中的100万的图像集。 |
| 101 | + |
| 102 | +4、 CoPhIR |
| 103 | +http://cophir.isti.cnr.it/whatis.html |
| 104 | +Flickr中的1亿600万的图像 |
| 105 | + |
| 106 | +5、SBU captioned photo dataset |
| 107 | +http://dsl1.cewit.stonybrook.edu/~vicente/sbucaptions/ |
| 108 | +Flickr中的100万的图像集。 |
| 109 | + |
| 110 | +6、Large-Scale Image Annotation using Visual Synset(ICCV 2011) |
| 111 | +http://cpl.cc.gatech.edu/projects/VisualSynset/ |
| 112 | +包含2亿图像 |
| 113 | + |
| 114 | +7、NUS-WIDE |
| 115 | +http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm |
| 116 | +Flickr中的27万的图像集。 |
| 117 | + |
| 118 | +8、SUN dataset |
| 119 | +http://people.csail.mit.edu/jxiao/SUN/ |
| 120 | +包含13万的图像 |
| 121 | + |
| 122 | +9、MSRA-MM |
| 123 | +http://research.microsoft.com/en-us/projects/msrammdata/ |
| 124 | +包含100万的图像,23000视频 |
| 125 | + |
| 126 | +10、TRECVID |
| 127 | +http://trecvid.nist.gov/ |
| 128 | + |
| 129 | +Stack Overflow Dump Files |
| 130 | +7.3G stackoverflow.com-Posts.7z |
| 131 | +573.1K stackoverflow.com-Tags.7z |
| 132 | +153.0M stackoverflow.com-Users.7z |
| 133 | +2.2G stackoverflow.com-Comments.7z |
| 134 | + |
| 135 | +截止目前好像还没有国内的企业或者组织开放自己的数据集。希望也能有企业开发自己的数据集给研究人员使用,从而推动海量数据处理在国内的发展! |
| 136 | + |
| 137 | +## 2014/07/07 雅虎发布超大Flickr数据集 1亿的图片+视频 |
| 138 | +http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images-for |
| 139 | + |
| 140 | +## 100多个有趣的数据集 |
| 141 | +http://www.csdn.net/article/2014-06-06/2820111-100-Interesting-Data-Sets-for-Statistics |
| 142 | + |
| 143 | + |
| 144 | + |
| 145 | + |
0 commit comments