generated from cloudwego/.github
-
Notifications
You must be signed in to change notification settings - Fork 239
feat(parser): add csv document parser #532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
CXeon
wants to merge
35
commits into
cloudwego:main
Choose a base branch
from
CXeon:feat/document-csv-parser
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 16 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
c30a72a
feat(parser): 新增document的csv文件parser接口实现
CXeon 3326fe1
Merge branch 'main' into feat/document-csv-parser
CXeon 6a52e2b
Merge branch 'main' into feat/document-csv-parser
CXeon d0fb93d
Merge branch 'main' into feat/document-csv-parser
CXeon 4f45815
Merge branch 'main' into feat/document-csv-parser
CXeon b7fd466
Merge branch 'main' into feat/document-csv-parser
hi-pender 504a748
Merge branch 'main' into feat/document-csv-parser
hi-pender 98eafd8
fix(csv_parser):resolved an incorrect package import for context and …
CXeon 2c978a1
Merge branch 'cloudwego:main' into feat/document-csv-parser
CXeon 163e5b5
Merge branch 'feat/document-csv-parser' of github.com:CXeon/eino-ext …
CXeon fb1f058
Merge branch 'main' into feat/document-csv-parser
CXeon ff9fc9d
Merge branch 'main' into feat/document-csv-parser
CXeon ebc9940
Merge branch 'main' into feat/document-csv-parser
CXeon 1f0a7da
Merge branch 'main' into feat/document-csv-parser
CXeon 31f9be3
Merge branch 'main' into feat/document-csv-parser
hi-pender 6b637ff
Merge branch 'main' into feat/document-csv-parser
hi-pender 14584e2
feat(parser): ddd copyright and license information
hi-pender b032f30
feat(parser): add copyright and license to csv_parser_test.go
hi-pender 7600202
Merge branch 'main' of github.com:CXeon/eino-ext into feat/document-c…
CXeon 20757f1
feat(csv_parser): add README.md file and examples directory.
CXeon 5e5da67
Merge branch 'feat/document-csv-parser' of github.com:CXeon/eino-ext …
CXeon eb0f369
style(readme): Removed the description of the certificate.
CXeon 00d0c71
style(readme): Add LICENSE.
CXeon 34d5ae4
style(readme): Add license header in main.
CXeon 13c697d
Merge branch 'main' into feat/document-csv-parser
CXeon fac519c
Merge branch 'main' into feat/document-csv-parser
CXeon ef1824b
Merge branch 'main' into feat/document-csv-parser
CXeon 13f863c
Merge branch 'main' into feat/document-csv-parser
CXeon 31c3008
Merge branch 'main' into feat/document-csv-parser
CXeon bf7253c
Merge branch 'main' into feat/document-csv-parser
CXeon 5bc4b76
Merge branch 'main' into feat/document-csv-parser
CXeon de4cd51
Merge branch 'main' into feat/document-csv-parser
CXeon ffde69f
Merge branch 'main' into feat/document-csv-parser
CXeon 9c285ba
Merge branch 'main' into feat/document-csv-parser
CXeon 4b716c7
Merge branch 'main' into feat/document-csv-parser
CXeon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| package csv | ||
|
|
||
| import ( | ||
| "encoding/csv" | ||
| "fmt" | ||
| "io" | ||
| "strings" | ||
|
|
||
| "context" | ||
|
|
||
| "github.com/cloudwego/eino/components/document/parser" | ||
| "github.com/cloudwego/eino/schema" | ||
| ) | ||
|
|
||
| const ( | ||
| MetaDataRow = "_row" | ||
| MetaDataExt = "_ext" | ||
| ) | ||
|
|
||
| // CSVParser parses CSV content from io.Reader. | ||
| type CSVParser struct { | ||
| Config *Config | ||
| } | ||
|
|
||
| // Config Used to configure CsvParser. | ||
| type Config struct { | ||
| // NoHeader is set to false by default, which means that the first row is used as the table header | ||
| NoHeader bool | ||
| // IDPrefix is set to customize the prefix of document ID, default 1,2,3, ... | ||
| IDPrefix string | ||
| // Comma is set to ',' by default, which means that the comma is used as the field delimiter | ||
| Comma rune | ||
| // Comment is set to '#' by default, which means that the '#' character is used as the comment character | ||
| Comment rune | ||
| } | ||
|
|
||
| // NewCSVParser creates a new CSVParser | ||
| func NewCSVParser(ctx context.Context, config *Config) (cp *CSVParser, err error) { | ||
| if config == nil { | ||
| config = &Config{} | ||
|
|
||
| } | ||
| if config.Comma == 0 { | ||
| config.Comma = rune(',') | ||
| } | ||
| if config.Comment == 0 { | ||
| config.Comment = rune('#') | ||
| } | ||
|
|
||
| cp = &CSVParser{Config: config} | ||
| return cp, nil | ||
| } | ||
|
|
||
| // generateID generates document ID based on configuration | ||
| func (cp *CSVParser) generateID(i int) string { | ||
| if cp.Config.IDPrefix == "" { | ||
| return fmt.Sprintf("%d", i) | ||
| } | ||
| return fmt.Sprintf("%s%d", cp.Config.IDPrefix, i) | ||
| } | ||
|
|
||
| func (cp *CSVParser) buildRowMetaData(row []string, headers []string) map[string]any { | ||
| metaData := make(map[string]any) | ||
| if !cp.Config.NoHeader { | ||
| for j, header := range headers { | ||
| if j < len(row) { | ||
| metaData[header] = row[j] | ||
| } | ||
| } | ||
| } | ||
| return metaData | ||
| } | ||
|
|
||
| func (cp *CSVParser) Parse(ctx context.Context, reader io.Reader, opts ...parser.Option) ([]*schema.Document, error) { | ||
| option := parser.GetCommonOptions(&parser.Options{}, opts...) | ||
|
|
||
| csvFile := csv.NewReader(reader) | ||
|
|
||
| // get all rows | ||
| rows, err := csvFile.ReadAll() | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if len(rows) == 0 { | ||
| return nil, nil | ||
| } | ||
|
|
||
| var ret []*schema.Document | ||
|
|
||
| // Process the header | ||
| startIdx := 0 | ||
| var headers []string | ||
| if !cp.Config.NoHeader && len(rows) > 0 { | ||
| headers = rows[0] | ||
| startIdx = 1 | ||
| } | ||
|
|
||
| // Process rows of data | ||
| for i := startIdx; i < len(rows); i++ { | ||
| row := rows[i] | ||
| if len(row) == 0 { | ||
| continue | ||
| } | ||
| // Convert row data to strings | ||
| contentParts := make([]string, len(row)) | ||
| for j, cell := range row { | ||
| contentParts[j] = strings.TrimSpace(cell) | ||
| } | ||
| content := strings.Join(contentParts, string(cp.Config.Comma)) | ||
|
|
||
| meta := make(map[string]any) | ||
|
|
||
| // Build the row's Meta | ||
| rowMeta := cp.buildRowMetaData(row, headers) | ||
| meta[MetaDataRow] = rowMeta | ||
|
|
||
| // Get the Common ExtraMeta | ||
| if option.ExtraMeta != nil { | ||
| meta[MetaDataExt] = option.ExtraMeta | ||
| } | ||
|
|
||
| // Create New Document | ||
| nDoc := &schema.Document{ | ||
| ID: cp.generateID(i), | ||
| Content: content, | ||
| MetaData: meta, | ||
| } | ||
|
|
||
| ret = append(ret, nDoc) | ||
| } | ||
|
|
||
| return ret, nil | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| package csv | ||
|
|
||
| import ( | ||
| "context" | ||
| "os" | ||
| "testing" | ||
|
|
||
| "github.com/cloudwego/eino/components/document/parser" | ||
| ) | ||
|
|
||
| func TestCSVParser(t *testing.T) { | ||
| f, err := os.Open("./test.csv") | ||
| if err != nil { | ||
| t.Error(err) | ||
| return | ||
| } | ||
| defer f.Close() | ||
|
|
||
| ctx := context.Background() | ||
| cp, err := NewCSVParser(ctx, &Config{}) | ||
| if err != nil { | ||
| t.Error(err) | ||
| return | ||
| } | ||
|
|
||
| docs, err := cp.Parse(ctx, f, parser.WithURI("local"), parser.WithExtraMeta(map[string]any{ | ||
| "_extension": ".csv", | ||
| "_file_name": "test.csv", | ||
| "_source": "local", | ||
| })) | ||
|
|
||
| if err != nil { | ||
| t.Error(err) | ||
| return | ||
| } | ||
| t.Log(docs) | ||
| return | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| module github.com/cloudwego/eino-ext/components/document/parser/csv | ||
|
|
||
| go 1.24.0 | ||
|
|
||
| toolchain go1.24.4 | ||
|
|
||
| require ( | ||
| github.com/cloudwego/eino v0.5.13 | ||
| golang.org/x/net v0.47.0 | ||
| ) | ||
|
|
||
| require ( | ||
| github.com/bahlo/generic-list-go v0.2.0 // indirect | ||
| github.com/buger/jsonparser v1.1.1 // indirect | ||
| github.com/bytedance/gopkg v0.1.3 // indirect | ||
| github.com/bytedance/sonic v1.14.1 // indirect | ||
| github.com/bytedance/sonic/loader v0.3.0 // indirect | ||
| github.com/cloudwego/base64x v0.1.6 // indirect | ||
| github.com/dustin/go-humanize v1.0.1 // indirect | ||
| github.com/eino-contrib/jsonschema v1.0.2 // indirect | ||
| github.com/getkin/kin-openapi v0.118.0 // indirect | ||
| github.com/go-openapi/jsonpointer v0.19.5 // indirect | ||
| github.com/go-openapi/swag v0.19.5 // indirect | ||
| github.com/goph/emperror v0.17.2 // indirect | ||
| github.com/invopop/yaml v0.1.0 // indirect | ||
| github.com/josharian/intern v1.0.0 // indirect | ||
| github.com/json-iterator/go v1.1.12 // indirect | ||
| github.com/klauspost/cpuid/v2 v2.2.9 // indirect | ||
| github.com/mailru/easyjson v0.7.7 // indirect | ||
| github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect | ||
| github.com/modern-go/reflect2 v1.0.2 // indirect | ||
| github.com/mohae/deepcopy v0.0.0-20170929034955-c48cc78d4826 // indirect | ||
| github.com/nikolalohinski/gonja v1.5.3 // indirect | ||
| github.com/pelletier/go-toml/v2 v2.0.9 // indirect | ||
| github.com/perimeterx/marshmallow v1.1.4 // indirect | ||
| github.com/pkg/errors v0.9.1 // indirect | ||
| github.com/sirupsen/logrus v1.9.3 // indirect | ||
| github.com/slongfield/pyfmt v0.0.0-20220222012616-ea85ff4c361f // indirect | ||
| github.com/twitchyliquid64/golang-asm v0.15.1 // indirect | ||
| github.com/wk8/go-ordered-map/v2 v2.1.8 // indirect | ||
| github.com/yargevad/filepathx v1.0.0 // indirect | ||
| golang.org/x/arch v0.11.0 // indirect | ||
| golang.org/x/exp v0.0.0-20230713183714-613f0c0eb8a1 // indirect | ||
| golang.org/x/sys v0.38.0 // indirect | ||
| gopkg.in/yaml.v2 v2.4.0 // indirect | ||
| gopkg.in/yaml.v3 v3.0.1 // indirect | ||
| ) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.