Data for LLMs

Training an LLM needs a large amount of high qualitity data. Even though many giant teches open up their high performance LLMs (e.g., LLaMA, Mistral), high qualitity data still remain private. Chinese Dataset English Dataset RefinedWeb: 600 B toknes Dolma: open-sourced by Allenai, contains 3T tokens and a toolkit with some key features: high performance, portability, built-in tagger, fast decuplication, extensibility and cloud support. fineweb: 15 trillion tokens of high quality web data....

January 4, 2024 · 5 min · Loong

Continual Pretraining

Large language models (LLMs) have already demonstrated significant achievements, many startups make a plan to train their own LLMs. However, training a LLM from scratch remains a big challenge, both in terms of machine costs and the difficulty of data collection. Under this background, continuous pretraining based on some open source LLMs is a considerable alternative. Determine your purpose of your continuous pretraining LLM. In common, standard LLMs may not excel in specific domains like financial, law, or trade....

December 29, 2023 · 3 min · Loong