Skip to content

Commit 84a3d7e

Browse files
author
wuchengwei
committed
update v3.0.0
1 parent ccac94b commit 84a3d7e

File tree

175 files changed

+10597
-24832
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

175 files changed

+10597
-24832
lines changed

.gitignore

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Ignore the .idea directory
2+
.idea/
3+
4+
# Build and Release Folders
5+
bin-debug/
6+
bin-release/
7+
[Oo]bj/
8+
[Bb]in/
9+
10+
# Other files and folders
11+
.settings/
12+
13+
# Executables
14+
*.swf
15+
*.air
16+
*.ipa
17+
*.apk
18+
19+
# Project files, i.e. `.project`, `.actionScriptProperties` and `.flexProperties`
20+
# should NOT be excluded as they contain compiler settings and other important
21+
# information for Eclipse / Flash Builder.

FlagOpen.png

100644100755
File mode changed.

LICENSE

100644100755
File mode changed.

README.md

100644100755
+78-62
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
![FlagData](flagdata_logo.png)
44
[![Pypi Package](https://img.shields.io/pypi/v/flagdata?label=pypi%20package)](https://pypi.org/project/flagdata/)
5-
[![Python Application](https://github.com/FlagOpen/FlagData/actions/workflows/python-app.yml/badge.svg)](https://github.com/FlagOpen/FlagData/actions/workflows/python-app.yml)
65
[![License](https://img.shields.io/github/license/FlagOpen/FlagData.svg?color=blue)](https://github.com/FlagOpen/FlagData/blob/main/LICENSE)
76
![GitHub release (release name instead of tag name)](https://img.shields.io/github/v/release/FlagOpen/FlagData?include_prereleases&style=social)
87

@@ -30,7 +29,7 @@ The complete pipeline process and features such as
3029
![pipeline](pipeline.png)
3130

3231
## News
33-
32+
- [June 13st, 2024] FlagData v3.0.0 update, supports multiple data types, dozens of operator pools for DIY, and generates high-quality data with one click
3433
- [Dec 31st, 2023] FlagData v2.0.0 has been upgraded
3534
- [Jan 31st, 2023] FlagData v1.0.0 is online!
3635

@@ -49,10 +48,29 @@ The complete pipeline process and features such as
4948
- [Configuration](#Configuration)
5049
- [Data cleaning](#Data-cleaning)
5150
- [Data Quality assessment](#Data-Quality-assessment)
52-
- [Contact us](#Contact-us)
51+
- [Operator Pool](#Operator-Pool)
52+
- [Strong community support](#Strong-community-support)
53+
- [Users](#Users)
5354
- [Reference project](#Reference-project)
5455
- [License](#License)
5556

57+
# V3.0.0 UPDATE
58+
With the feedback from the community, FlagData has been upgraded. This update provides a set of fool-proof language pre-training data construction tools. According to different data types, we provide one-click data quality improvement tasks such as Html, Text, Book, Arxiv, Qa, etc. Both novice users and advanced users can easily generate high-quality data.
59+
- Novice users: Just confirm the data type to generate high-quality data.
60+
- Advanced users: We provide dozens of operator pools for users to DIY their own LLM pre-training data construction process.
61+
62+
**Project Features:**
63+
64+
- Ease of use: Fool-style operation, simple configuration is all that is needed to generate high-quality data.
65+
- Flexibility: Advanced users can customize the data construction process through various operator pools.
66+
- Diversity: Supports multiple data types (HTML, Web, Wiki, Book, Paper, QA, Redpajama, Code)
67+
68+
**Key highlights**
69+
70+
- 🚀 Generate high-quality data with one click
71+
- 🔧 Dozens of operator pools for DIY
72+
- 🌐 Support for multiple data types
73+
5674
## Installation
5775

5876
- Under the requirements.txt file, are all the dependent packages of the FlagData project
@@ -61,29 +79,13 @@ The complete pipeline process and features such as
6179
pip install -r requirements.txt
6280
```
6381

64-
Optionally install the `cleaner` module required in FlagData. You will only install the dependency packages for the
65-
corresponding modules, which is suitable for users who only want to use the `cleaner` module and do not want to install
66-
other module dependency packages.
67-
68-
```bash
69-
pip install flagdata[cleaner]
70-
```
71-
7282
**Install the latest version of the main branch**
7383

7484
The main branch is officially released by FlagData. If you want to install / update to the latest version of the main
7585
branch, use the following command:
7686

7787
```
7888
git clone https://github.com/FlagOpen/FlagData.git
79-
pip install .[all]
80-
```
81-
82-
**Secondary development based on source code**
83-
84-
```bash
85-
git clone https://github.com/FlagOpen/FlagData.git
86-
pip install -r requirements.txt
8789
```
8890

8991
## Quick Start
@@ -102,7 +104,7 @@ different strategies. The strategies include:
102104
answers. In order to increase the diversity of generated samples, it is supported to exclude already generated
103105
samples.
104106

105-
See [ReadMe under data_gen Module](flagdata/data_gen/README.md) for an example.
107+
See [Instructions for using the Data Enhancement Module](flagdata/data_gen/README.md) for an example.
106108

107109
### Data preparation phase
108110

@@ -115,7 +117,7 @@ Title [Chapter Title]", "Address [E-mail]","PageBreak", "Header [Header]", "Foot
115117
UncategorizedText [arxiv vertical number]", "
116118
Image, Formula, etc. Tool scripts provide two forms: keeping full text and saving by category resolution.
117119

118-
See [ReadMe under all2txt Module](flagdata/all2txt/README.md) for an example.
120+
See [Instructions for using all2txt modules](flagdata/all2txt/README.md) for an example.
119121

120122
### Data preprocessing phase
121123

@@ -131,43 +133,33 @@ finally outputs a score of 0: 1.
131133
+ For general cleaning rules, if it is greater than 0.5, it is classified as a specific language, otherwise it indicates
132134
that the page is not sure what language it is and discards the page.
133135

134-
See [ReadMe under language_identification Module](flagdata/language_identification/README.md) for an example.
136+
See [Instructions for using the language identification module](flagdata/language_identification/README.md) for an example.
135137

136138
#### Data cleaning
137139

138-
The cleaner module uses multi-process pool mp.Pool to process data in parallel in a multi-process manner. Use
139-
SharedMemoryManager to create shareable data structures, and multiple processes share data in data processing.
140-
141-
Efficient data cleaning is achieved through multi-processes and shared memory:
140+
We provide one-click data quality improvement tasks such as Html, Text, Book, Arxiv, Qa, etc. For more customized functions, users can refer to the "data_operator" section.
141+
##### TextCleaner
142+
TextCleaner provides a fast and extensible text data cleaning tool. It provides commonly used text cleaning modules.
143+
Users only need to select the text_clean.yaml file in cleaner_builder.py to process text data.
144+
For details, see[Instructions for using TextCleaner](flagdata/cleaner/docs/Text_Cleaner.md)
142145

143-
Currently, the following cleaning rules are included:
146+
##### ArxivCleaner
147+
ArxivCleaner provides a commonly used arxiv text data cleaning tool.
148+
Users only need to select the arxiv_clean.yaml file in cleaner_builder.py to process arxiv data.
144149

145-
+ Emoticons and meaningless characters (regular)
146-
+ Clean and reprint copyright notice information (Zhihu, csdn, brief book, blog park)
147-
+ Remove unreasonable consecutive punctuation marks, and newline characters are unified as\ n
148-
+ Remove personal privacy, URL and extra spaces such as mobile phone number and ID number
149-
+ Remove irrelevant content such as beginning and end, and remove text whose length is less than n (currently nasty 100)
150-
+ Convert simplified Chinese to traditional Chinese (opencc Library)
150+
##### HtmlCleaner
151+
HtmlCleaner provides commonly used Html format text extraction and data cleaning tools.
152+
Users only need to run the main method to process arxiv data.
151153

152-
It takes only two steps to use the data cleaning feature of FlagData:
153-
154-
1. Modify the data path and format in the YAML configuration file. We give detailed comments on each parameter in the
155-
configuration file template to explain its meaning. At the same time, you can refer
156-
to [Configuration](#Configuration) Chapter.
157-
158-
2. Specify the configuration file path in the following code and run it
159-
```python
160-
from flagdata.cleaner.text_cleaner import DataCleaner
161-
if __name__ == "__main__": # Safe import of main module in multi-process
162-
cleaner = DataCleaner("config.yaml")
163-
cleaner.clean()
164-
```
154+
##### QaCleaner
155+
QaCleaner provides commonly used Qa format text extraction and data cleaning tools.
156+
Users only need to run the main method to process Qa data.
157+
For details, see[Instructions for using Qa](flagdata/cleaner/docs/Qa_Cleaner.md)
165158

166-
The cleaned file will be saved in the format `jsonl` to the path corresponding to the `output` parameter specified in
167-
the configuration file.
168-
169-
See [Tutorial 1: Clean the original text obtained from the Internet](/flagdata/cleaner/tutorial_01_cleaner.md) for an
170-
example.
159+
##### BookCleaner
160+
BookCleaner provides a common book format text extraction and data cleaning tool.
161+
Users only need to run the main method to process the book data.
162+
For details, see[Instructions for using Book](flagdata/cleaner/docs/Book_Cleaner.md)
171163

172164
#### Quality assessment
173165

@@ -182,7 +174,7 @@ This paper compares different text classification models, including logical regr
182174
their performance. In the experiment, BERTEval and FastText models perform well in text classification tasks, and
183175
FastText model performs best in terms of accuracy and recall rate. [experimental results are from ChineseWebText]
184176

185-
See [ReadMe under quality_assessment Module](flagdata/quality_assessment/README.md) for an example.
177+
See [Instructions for using the quality assessment module](flagdata/quality_assessment/README.md) for an example.
186178

187179
#### Data deduplication
188180

@@ -196,6 +188,7 @@ to retain only those texts that are very similar, while discard those texts with
196188
default value is 0.87. At the same time, we use the distributed computing power of Spark to deal with large-scale data,
197189
the idea of MapReduce is used to remove duplicates, and tuned by spark to deal with large-scale text data sets
198190
efficiently.
191+
199192
The following is the similar text iterated in the process of data deduplication, which has slight differences in line
200193
wrapping and name editing, but the deduplication algorithm can identify two paragraphs of text that are highly similar.
201194

@@ -253,13 +246,13 @@ The analysis data analysis module provides the following functions:
253246

254247
+ length analysis of the text.
255248

256-
See [ReadMe under analysis Module](flagdata/analysis/README.md) for an example.
249+
See [Instructions for using the analysis module](flagdata/analysis/README.md) for an example.
257250

258251
## Configuration
259252

260253
For the `data cleansing` and `data quality assessment` modules,
261254
We provide a profile
262-
template:[cleaner_config.yaml](https://dorc.baai.ac.cn/resources/projects/FlagData/cleaner_config.yaml)[bert_config.yaml](flagdata/quality_assessment/Bert/bert_config.yaml)
255+
template:[text_clean.yaml、arxiv_clean.yaml](flagData/cleaner/configs)[bert_config.yaml](flagdata/quality_assessment/Bert/bert_config.yaml)
263256
The configuration file is readable [YAML](https://yaml.org) format , provides detailed comments. Please make sure that
264257
the parameters have been modified in the configuration file before using these modules.
265258

@@ -268,10 +261,16 @@ Here are some important parameters you need to pay attention to:
268261
### Data cleaning
269262

270263
```yaml
271-
# Raw data to be cleaned
264+
# 待清洗的原始数据
272265
input: ./demo/demo_input.jsonl
273-
# Save path of data after cleaning
266+
# 清洗后数据的保存路径
274267
output: ./demo/output.jsonl
268+
# 待处理的字段
269+
source_key: text
270+
# key in the output file for saving
271+
result_key: cleanedContent
272+
# 需要选择的Pipline类
273+
cleaner_class: ArxivCleaner
275274
```
276275
277276
### Data Quality assessment
@@ -283,20 +282,37 @@ Here are some important parameters you need to pay attention to:
283282
# The text_key field is the field being evaluated
284283
text_key: "raw_content"
285284
```
285+
## Operator Pool
286+
We provide some basic operators for data cleaning, filtering, format conversion, etc. to help users build their own data construction process.
287+
288+
The operators provided are divided into three types: Formatter, Pruner, and Filter. Formatter is used to process structured data and can be used for mutual conversion of data in different formats; Pruner is used to clean text data; Filter is used for sample filtering.
289+
The figure below shows these operators in different processing locations and a list of some of the operators
286290
287-
## Contact us
291+
<img src="pic/data_operator.png" width="50%" height="auto">
288292
289-
If you have any questions about the use and code of this project, you can submit issue. At the same time, you can
290-
contact us directly through [email protected].
293+
<img src="pic/some_operator.png" width="50%" height="auto">
291294
292-
An active community is inseparable from your contribution, if you have a new idea, welcome to join our community, let us
293-
become a part of open source, together to contribute our own efforts for open source!
295+
For detailed description, see[Instructions for using the data operator](flagdata/data_operator/Operator_ZH.md)
294296
297+
## Strong community support
298+
### Community Support
299+
If you have any questions about the use and code of this project, you can submit an issue. You can also contact us directly via email at [email protected];
300+
301+
An active community cannot be separated from your contribution. If you have a new idea, welcome to join our community, let us become part of open source, and contribute to open source together! ! !
295302
<img src="contact_me.png" width="50%" height="auto">
296303
297-
Or follow Zhiyuan FlagOpen open source system, FlagOpen official website https://flagopen.baai.ac.cn/
304+
Or follow the FlagOpen open source system, FlagOpen official website https://flagopen.baai.ac.cn/
298305
![contact_me](FlagOpen.png)
299306
307+
### Questions and Feedback
308+
- Please report issues and make suggestions through GitHub Issues, and we will respond quickly within 24 hours.
309+
- You are also welcome to discuss actively in GitHub Discussions.
310+
- If it is inconvenient to use GitHub, of course, everyone in the FlagData open source community can also speak freely. For reasonable suggestions, we will iterate in the next version.
311+
We will invite experts in the field to hold online and offline exchanges regularly to share the latest LLM research results.
312+
## Users
313+
314+
<img src="pic/users.png" width="50%" height="auto">
315+
300316
## Reference project
301317
302318
Part of this project is referenced from the following code:

0 commit comments

Comments
 (0)