Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

datasets.txt 9.9 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
  1. laion/OIG
  2. tatsu-lab/alpaca
  3. fka/awesome-chatgpt-prompts
  4. Anthropic/hh-rlhf
  5. sahil2801/CodeAlpaca-20k
  6. JosephusCheung/GuanacoDataset
  7. nomic-ai/gpt4all_prompt_generations
  8. https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
  9. # openai/summarize_from_feedback
  10. # openai/webgpt_comparisons
  11. # Simontwice/premise_selection_in_isabelle
  12. hoskinson-center/proofnet
  13. ehartford/oa_leet10k
  14. ehartford/leet10k-alpaca
  15. QingyiSi/Alpaca-CoT
  16. https://github.com/Nan-Do/LeetCodeContestsDataset
  17. Muennighoff/flan
  18. BAAI/COIG
  19. # JavaFXpert/gpt-math-techniques
  20. gsm8k
  21. MU-NLPC/Calc-aqua_rat
  22. MU-NLPC/Calc-gsm8k
  23. reasoning-machines/gsm-hard
  24. https://github.com/reasoning-machines/pal.git
  25. anon8231489123/ShareGPT_Vicuna_unfiltered
  26. # openai_humaneval
  27. https://raw.githubusercontent.com/openai/human-eval/master/data/HumanEval.jsonl.gz
  28. Muennighoff/flan
  29. HuggingFaceM4/COCO
  30. https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip
  31. http://images.cocodataset.org/zips/train2014.zip
  32. https://github.com/ymcui/Chinese-LLaMA-Alpaca/raw/main/data/alpaca_data_zh_51k.json
  33. akoksal/LongForm
  34. liuhaotian/LLaVA-Instruct-150K
  35. liuhaotian/LLaVA-CC3M-Pretrain-595K
  36. fnlp/moss-002-sft-data
  37. databricks/databricks-dolly-15k
  38. # https://github.com/anthropics/evals
  39. Anthropic/model-written-evals
  40. https://github.com/microsoft/AGIEval
  41. https://github.com/EleutherAI/lm-evaluation-harness
  42. https://github.com/thisserand/alpaca-lora-finetune-language
  43. https://github.com/loubnabnl/bloom-code-evaluation
  44. yahma/alpaca-cleaned
  45. RyokoAI/ShareGPT52K
  46. YeungNLP/firefly-train-1.1M
  47. BelleGroup/train_2M_CN
  48. BelleGroup/multiturn_chat_0.8M
  49. BelleGroup/school_math_0.25M
  50. BelleGroup/generated_chat_0.4M
  51. https://github.com/yizhongw/self-instruct
  52. https://github.com/lamini-ai/lamini
  53. https://github.com/nelson-liu/evaluating-verifiability-in-generative-search-engines
  54. https://github.com/thu-coai/Safety-Prompts
  55. https://github.com/csitfun/LogiQA2.0
  56. https://github.com/nlpdata/c3
  57. https://github.com/nlpdata/dream
  58. https://github.com/terryyz/llm-code-eval
  59. ehartford/WizardLM_alpaca_evol_instruct_70k_unfiltered
  60. nomic-ai/gpt4all-j-prompt-generations
  61. bigcode/ta-prompt
  62. MBZUAI/LaMini-instruction
  63. OpenAssistant/oasst1
  64. junelee/wizard_vicuna_70k
  65. ehartford/wizard_vicuna_70k_unfiltered
  66. hoskinson-center/minif2f-lean4
  67. mosaicml/dolly_hhrlhf
  68. MBZUAI/Bactrian-X
  69. teknium/GPT4-LLM-Cleaned
  70. spdenisov/tatoeba
  71. spdenisov/udt_alpaca
  72. spdenisov/enwiktionary
  73. 0x22almostEvil/multilingual-wikihow-qa-16k
  74. 0x22almostEvil/reasoning-gsm-qna-oa
  75. 0x22almostEvil/reasoning_bg_oa
  76. 0x22almostEvil/tatoeba-mt-qna-oa
  77. clips/20Q
  78. allenai/soda
  79. victor123/evol_instruct_70k
  80. cryscan/multilingual-share
  81. stanfordnlp/SHP
  82. wangrui6/Zhihu-KOL
  83. liyucheng/zhihu_rlhf_3k
  84. liyucheng/zhihu_26k
  85. PKU-Alignment/PKU-SafeRLHF-10K
  86. allenai/prosocial-dialog
  87. roneneldan/TinyStories
  88. sambanovasystems/xOA22
  89. sambanovasystems/x-self-instruct-seed-32
  90. IlyaGusev/gpt_roleplay_realm
  91. iamketan25/roleplay-instructions-dataset
  92. huggingface-tools/default-endpoints
  93. ashiyakatuka11/empathetic_dialogues_context
  94. Abirate/english_quotes
  95. openaccess-ai-collective/oasst1-guanaco-extended
  96. winddude/reddit_finance_43_250k
  97. https://github.com/lupantech/PromptPG
  98. openllmplayground/pandagpt_visual_instruction_dataset
  99. timdettmers/openassistant-guanaco
  100. ehartford/samantha-data
  101. ehartford/based
  102. renumics/cifar100-enriched
  103. kaiokendev/SuperCOT-dataset
  104. tiedong/goat
  105. nomic-ai/gpt4all_prompt_generations_with_p3
  106. P1ayer-1/chatgpt-conversations-chatlogs.net
  107. achang/plot_qa
  108. yankscally/midiset
  109. sileod/mindgames
  110. spdenisov/wsd_semcor
  111. code_x_glue_ct_code_to_text
  112. GaussianMixture/oasst_alpaca_sharegpt_dataset
  113. shibing624/medical
  114. BelleGroup/train_3.5M_CN
  115. shibing624/alpaca-zh
  116. Chinese-Vicuna/guanaco_belle_merge_v1.0
  117. FreedomIntelligence/HuatuoGPT-sft-data-v1
  118. philschmid/sharegpt-raw
  119. Hello-SimpleAI/HC3
  120. teknium/GPTeacher-General-Instruct
  121. metaeval/ScienceQA_text_only
  122. hellaswag
  123. riddle_sense
  124. camel-ai/math
  125. camel-ai/biology
  126. camel-ai/physics
  127. camel-ai/chemistry
  128. winglian/evals
  129. ewof/code-alpaca-instruct-unfiltered
  130. ewof/code-alpaca-instruct-unfiltered
  131. 64bits/lex_fridman_podcast_for_llm_vicuna
  132. https://github.com/GJBroughton/Star_Trek_Scripts
  133. https://github.com/shibing624/MedicalGPT
  134. tasksource/oasst1_pairwise_rlhf_reward
  135. Dahoas/full-hh-rlhf
  136. Dahoas/static-hh
  137. Dahoas/rm-static
  138. liswei/rm-static-zhTW
  139. yitingxie/rlhf-reward-datasets
  140. # flan/v2/*data
  141. https://github.com/google-research/FLAN
  142. conceptofmind/flan2021_submix_original
  143. conceptofmind/t0_submix_original
  144. conceptofmind/niv2_submix_original
  145. conceptofmind/cot_submix_original
  146. conceptofmind/dialog_submix_original
  147. https://github.com/google-research-datasets/Taskmaster
  148. https://github.com/OFA-Sys/ExpertLLaMA
  149. https://github.com/ziliwangnlp/RefGPT
  150. Mutonix/RefGPT-Fact
  151. Mutonix/RefGPT-Code-ds
  152. Mutonix/RefGPT-Code-cr
  153. Mutonix/RefGPT-Code-bg
  154. PocketDoc/Alpaca_Evol_Instruct_Cleaned
  155. GAIR/lima
  156. PocketDoc/DansPileOfSets
  157. WizardLM/WizardLM_evol_instruct_V2_196k
  158. Alignment-Lab-AI/AILabAssistant
  159. tasksource/tasksource-instruct-v0
  160. tasksource/zero-shot-label-nli
  161. tasksource/icl-symbol-tuning-instruct
  162. vietgpt/OIG_mathqa_flanv2_en
  163. neural_code_search
  164. neulab/conala
  165. reshinthadith/synthetic_program_synthesis_python_1M
  166. fiveflow/cot_ranking
  167. squad_adversarial
  168. bigscience-data/roots_zh-cn_wikipedia
  169. knkarthick/dialogsum
  170. jondurbin/rosettacode-raw
  171. winddude/IHOPv01
  172. jondurbin/rosettacode-10
  173. Norquinal/claude_multi_instruct_1k
  174. Open-Orca/OpenOrca
  175. ehartford/dolphin
  176. CheshireAI/guanaco-unchained
  177. Salesforce/dialogstudio
  178. theblackcat102/evol-codealpaca-v1
  179. nickrosh/Evol-Instruct-Code-80k-v1
  180. P1ayer-1/books-3
  181. P1ayer-1/college_textbooks
  182. P1ayer-1/books-3-textbooks
  183. P1ayer-1/crash_course_subs
  184. P1ayer-1/stack-exchange-preferences-code-v2
  185. goendalf666/sql-chat-instructions
  186. wenhu/TheoremQA
  187. lukaemon/bbh
  188. dmayhem93/agieval-lsat-ar
  189. lmsys/chatbot_arena_conversations
  190. shahules786/orca-chat
  191. CarperAI/openai_summarize_comparisons
  192. declare-lab/InstructEvalImpact
  193. nampdn-ai/tiny-codes
  194. declare-lab/flan-mini
  195. allenai/peS2o
  196. LinkSoul/instruction_merge_set
  197. jondurbin/airoboros-gpt4-m2.0
  198. rombodawg/MegaCodeTraining200k
  199. causalnlp/corr2cause
  200. ehartford/WizardLM_evol_instruct_V2_196k_unfiltered_merged_split
  201. ehartford/open-instruct-uncensored
  202. stingning/ultrachat
  203. # 14 GB
  204. ArmelR/stack-exchange-instruction
  205. https://github.com/HKUNLP/DS-1000
  206. https://github.com/project-baize/baize-chatbot
  207. https://github.com/leanprover/lean4-samples
  208. https://github.com/microsoft/promptbench
  209. https://github.com/llm-attacks/llm-attacks
  210. https://github.com/lz1oceani/verify_cot
  211. https://github.com/Lichang-Chen/InstructZero
  212. https://github.com/salesforce/factualNLG
  213. # huge extra downloads, TODO subreddit
  214. https://github.com/CornellNLP/ConvoKit
  215. # huge extra downloads and mostly dup
  216. https://github.com/allenai/open-instruct
  217. # has extra download
  218. https://github.com/wellecks/naturalproofs
  219. https://github.com/brightmart/nlp_chinese_corpus
  220. https://github.com/FranxYao/chain-of-thought-hub
  221. https://github.com/Troyanovsky/Local-LLM-comparison
  222. # https://people.eecs.berkeley.edu/~hendrycks/MATH.tar
  223. # https://drive.google.com/open?id=1hQsua3TkpEmcJD_UWQx8dmNdEZPyxw23&authuser=0
  224. https://github.com/hendrycks/math/
  225. # https://s3.amazonaws.com/datasets.huggingface.co/scientific_papers/1.1.1/arxiv-dataset.zip
  226. scientific_papers
  227. # for now:
  228. # https://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/xml/en.zip
  229. open_subtitles
  230. # also https://cloud.tsinghua.edu.cn/d/a5a16f2381e7439eb475/
  231. https://github.com/thu-coai/LOT-LongLM
  232. https://github.com/StevenGrove/GPT4Tools
  233. https://github.com/HKUNLP/UnifiedSKG
  234. https://github.com/RUCAIBox/StructGPT
  235. # download_raw_data.sh + download_preprocessed_data.sh
  236. https://github.com/michiyasunaga/DrRepair
  237. # https://cloud.google.com/sdk/docs/install-sdk#linux
  238. # gsutil -m cp -r gs://dm-code_contests local_location
  239. https://github.com/deepmind/code_contests
  240. # use gdown to download the dataset (~4G) and notebooks (~8G)
  241. https://github.com/rajasagashe/JuICe
  242. # https://github.com/InternLM/opencompass/releases/download/0.1.1/OpenCompassData.zip
  243. https://github.com/InternLM/opencompass
  244. https://github.com/lean-dojo/ReProver
  245. # Lean 3: https://zenodo.org/record/8016386 by https://github.com/lean-dojo/LeanDojo/blob/main/scripts/generate-benchmark-lean3.ipynb
  246. # Lean 4: https://zenodo.org/record/8040110 by https://github.com/lean-dojo/LeanDojo/blob/main/scripts/generate-benchmark-lean4.ipynb
  247. https://github.com/lean-dojo/LeanDojo
  248. # should have filted
  249. Helsinki-NLP/tatoeba_mt
  250. # partial download
  251. allenai/objaverse
  252. # huge 'data/*'
  253. tiiuae/falcon-refinedweb
  254. # TODO: partial download
  255. NTU-NLP-sg/xCodeEval
  256. # huge download
  257. # '*'
  258. RyokoAI/CNNovel125K
  259. RyokoAI/Fandom23K
  260. RyokoAI/ScribbleHub17K
  261. RyokoAI/Honeyfeed3600
  262. # '*'
  263. roneneldan/TinyStories
  264. # selective download
  265. # 'data/agda/*' 'data/coq/*' 'data/c2hs-haskell/*' 'data/f-sharp/*' 'data/idris/*' 'data/isabelle/*' 'data/julia/*' 'data/kotlin/*' 'data/lean/*' 'data/literate-agda/*' 'data/literate-haskell/*' 'data/markdown/*' 'data/mathematica/*' 'data/prolog/*' 'data/restructuredtext/*' 'data/rust/*' 'data/sage/*' 'data/tex/*'
  266. bigcode/the-stack
  267. # wiki.jsonl book.jsonl filtered_08cdfa755e6d4d89b673d5bd1acee5f6.sampled.jsonl arxiv_*.jsonl
  268. togethercomputer/RedPajama-Data-1T
  269. # 'preprocessed/adult/*' 'preprocessed/chain_of_thought/*' 'preprocessed/conversation/*' 'preprocessed/instruct/*' 'preprocessed/knowledge/*' 'preprocessed/rlhf/*' 'preprocessed/summarisation/*' 'preprocessed/system/*'
  270. m8than/normalised_chatml_rwkvready
  271. #
  272. tiiuae/falcon-refinedweb
  273. # TODO: selective download
  274. # 'en/*' 'zh/*'
  275. bigscience/xP3
  276. # TODO: need extra download
  277. allenai/lila
  278. https://github.com/fighting41love/funNLP
  279. https://github.com/X-PLUG/mPLUG-Owl
  280. # TODO: too big, to foundational, need to filter and dedup
  281. # ~3G
  282. # SirNeural/flan_v2
  283. # ~85G
  284. # conceptofmind/flan_dialog_submix
  285. # 7GB
  286. # ccdv/arxiv-summarization
  287. # 311GB
  288. # bigcode/starcoderdata
  289. # ~90GB
  290. # MMInstruction/M3IT
  291. # 204GB
  292. # MMInstruction/M3IT-80
  293. # https://huggingface.co/datasets/allenai/c4/tree/mC4_3.1.0/multilingual
  294. # allenai/c4
  295. # allenai/nllb
  296. # oscar-corpus/OSCAR-2201
  297. # sil-ai/bloom-lm
  298. # EleutherAI/the_pile
  299. # EleutherAI/the_pile_deduplicated
  300. # bigscience/bloomz
  301. # bigscience/evaluation-results
  302. # c4
  303. # wikipedia
  304. # the_pile_books3
  305. # pg19
  306. # TODO
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...