Skip to content

Commit 6d22b60

Browse files
authored
Adding Programmability doc. (#130)
And expose/export fileformat specific readers for non-omniparser usage.
1 parent f95defc commit 6d22b60

29 files changed

+452
-175
lines changed

README.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,7 @@ is used, specially the all mighty `javascript` (and `javascript_with_context`).
2222
- [CSV Schema in Depth](./doc/csv_in_depth.md): everything about schemas for CSV input.
2323
- [Fixed-Length Schema in Depth](./doc/fixedlength_in_depth.md): everything about schemas for fixed-length (e.g. TXT)
2424
input
25-
- [JSON Schema in Depth](./doc/json_in_depth.md): everything about schemas for JSON input.
26-
- [XML Schema in Depth](./doc/xml_in_depth.md): everything about schemas for XML input.
25+
- [JSON/XML Schema in Depth](./doc/json_xml_in_depth.md): everything about schemas for JSON or XML input.
2726
- [EDI Schema in Depth](./doc/edi_in_depth.md): everything about schemas for EDI input.
2827
- [Programmability](./doc/programmability.md): Advanced techniques for using omniparser (or some of its components) in
2928
your code.

doc/gettingstarted.md

+2
Original file line numberDiff line numberDiff line change
@@ -714,6 +714,7 @@ for {
714714
if err == io.EOF {
715715
break
716716
}
717+
if err != nil { ... }
717718
// output contains a []byte of the ingested and transformed record.
718719
}
719720
```
@@ -800,6 +801,7 @@ for {
800801
if err == io.EOF {
801802
break
802803
}
804+
if err != nil { ... }
803805
// output contains a []byte of the ingested and transformed record.
804806
}
805807
```

doc/json_in_depth.md

Whitespace-only changes.

doc/json_xml_in_depth.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# JSON/XML Schema in "Depth" :blush:
2+
3+
Omniparser schemas for JSON and XML inputs contain only two parts, `parser_settings` and
4+
`transform_declarations`, both of which we have covered in depth [here](./gettingstarted.md) and
5+
[here](./transforms.md).

doc/programmability.md

+254
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
* [Programmability of Omniparser](#programmability-of-omniparser)
2+
* [Out\-of\-Box Basic Use Case](#out-of-box-basic-use-case)
3+
* [Add A New custom\_func](#add-a-new-custom_func)
4+
* [Add A New custom\_parse](#add-a-new-custom_parse)
5+
* [Add A New File Format](#add-a-new-file-format)
6+
* [Add A New Schema Handler](#add-a-new-schema-handler)
7+
* [Put All Together](#put-all-together)
8+
* [In Non\-Golang Environment](#in-non-golang-environment)
9+
* [Programmability of Some Components without Omniparser](#programmability-of-some-components-without-omniparser)
10+
* [Functions](#functions)
11+
* [IDR](#idr)
12+
* [CSV Reader](#csv-reader)
13+
* [Fixed\-Length Reader](#fixed-length-reader)
14+
* [EDI Reader](#edi-reader)
15+
* [JSON Reader](#json-reader)
16+
* [XML Reader](#xml-reader)
17+
18+
# Programmability of Omniparser
19+
20+
There are many ways to use omniparser in your code/service/app programmatically.
21+
22+
## Out-of-Box Basic Use Case
23+
24+
This is covered in [Getting Started](./gettingstarted.md#using-omniparser-programmatically), copy it
25+
here for completeness.
26+
```
27+
schema, err := omniparser.NewSchema("your schema name", strings.NewReader("your schema content"))
28+
if err != nil { ... }
29+
transform, err := schema.NewTransform("your input name", strings.NewReader("your input content"), &transformctx.Ctx{})
30+
if err != nil { ... }
31+
for {
32+
output, err := transform.Read()
33+
if err == io.EOF {
34+
break
35+
}
36+
if err != nil { ... }
37+
// output contains a []byte of the ingested and transformed record.
38+
}
39+
```
40+
Note this out-of-box omniparser setup contains only the `omni.2.1` schema handler, meaning only schemas
41+
whose `parser_settings.version` is `omni.2.1` are supported. `omni.2.1.` schema handler's supported file
42+
formats include: delimited (CSV, TSV, etc), EDI, XML, JSON, fixed-length. `omni.2.1.` schema handler's
43+
supported built-in `custom_func`s are listed [here](./customfuncs.md).
44+
45+
## Add A New `custom_func`
46+
47+
If the built-in `custom_func`s are enough, you can add your own custom functions by
48+
[doing this](../extensions/omniv21/samples/customfileformats/jsonlog/sample_test.go) (note the linked
49+
sample does more than just adding a new `custom_func`):
50+
```
51+
schema, err := omniparser.NewSchema(
52+
"your schema name",
53+
strings.NewReader("your schema content"),
54+
omniparser.Extension{
55+
CreateSchemaHandler: omniv21.CreateSchemaHandler,
56+
CustomFuncs: customfuncs.Merge(
57+
customfuncs.CommonCustomFuncs, // global custom_funcs
58+
v21.OmniV21CustomFuncs, // omni.2.1 custom_funcs
59+
customfuncs.CustomFuncs{
60+
"normalize_severity": normalizeSeverity, // <====== your own custom_funcs
61+
})})
62+
if err != nil { ... }
63+
transform, err := schema.NewTransform("your input name", strings.NewReader("your input content"), &transformctx.Ctx{})
64+
if err != nil { ... }
65+
for {
66+
output, err := transform.Read()
67+
if err == io.EOF {
68+
break
69+
}
70+
if err != nil { ... }
71+
// output contains a []byte of the ingested and transformed record.
72+
}
73+
```
74+
75+
Each `custom_func` must be a Golang function with the first param being `*transformctx.Ctx`. The rest
76+
params can be of any type, as long as they will match the types of data that are fed into the function
77+
in `transform_declarations`.
78+
79+
## Add A New `custom_parse`
80+
81+
There are several ways to customize transform logic, one of which is using the all mighty `custom_func`
82+
`javascript` (or its silibing `javascript_with_context`), see details
83+
[here](./use_of_custom_funcs.md#javascript-and-javascript_with_context).
84+
85+
However currently we don't support multi-line javascript (yet), which makes writing complex transform
86+
logic in a single line javascript difficult to read and debug. Also there are situations where schema
87+
writers want the following:
88+
- native Golang code transform logic
89+
- logging/stats
90+
- better/thorough test coverage
91+
- more complexed operations like RPCs calls, encryption, etc, which isn't really suited/possible for
92+
javascript to handle.
93+
94+
`custom_parse` provides an in-code transform plugin mechanism. In addition to a number of built-in
95+
transforms, such as field, `const`, `external`, `object`, `template`, `array`, and `custom_func`,
96+
`custom_parse` allows schema writer to be able to provide a Golang function that takes in the
97+
`*idr.Node` at the current IDR cursor (see more about IDR cursoring
98+
[here](./xpath.md#data-context-and-anchoring)), does whatever processing and transforms as it sees
99+
fit, and returns whatever the desired result to be embedded in place of the `custom_parse`.
100+
101+
[This sample](../extensions/omniv21/samples/customparse/sample_test.go) gives a very detailed demo
102+
of how `custom_parse` works.
103+
104+
## Add A New File Format
105+
106+
While built-in `omni.2.1` schema handler already supports most popular file formats in a typical
107+
ETL pipeline, new file format(s) can be added into the schema handler, so it can ingest new formats
108+
while using the same extensible/capable transform (`transform_declarations`) logic.
109+
110+
On a high level, a [`FileFormat`](../extensions/omniv21/fileformat/fileformat.go) is a component
111+
that knows how to ingest a data record, in streaming fashion, from a certain file format, and
112+
convert it into an `idr.Node` tree, for later processing and transform.
113+
114+
Typically, a new [`FileFormat`](../extensions/omniv21/fileformat/fileformat.go) may require some
115+
additional information in a schema (usually in a `file_declaration` section), thus `omni.2.1` schema
116+
handler will give a new custom [`FileFormat`](../extensions/omniv21/fileformat/fileformat.go) a
117+
chance to validate a schema. Then the schema handler will ask
118+
the new [`FileFormat`](../extensions/omniv21/fileformat/fileformat.go) to create a format specific
119+
reader, whose job is to consume input stream, and convert each record into the IDR format.
120+
121+
See [this example](../extensions/omniv21/samples/customfileformats) for how to add a new
122+
[`FileFormat`](../extensions/omniv21/fileformat/fileformat.go).
123+
124+
## Add A New Schema Handler
125+
126+
To complete omniparser's full extensibility picture, we allow adding complete new schema handlers,
127+
whether they're for major schema version upgrades that break backward-compatibility, or for brand-new
128+
parsing/transform paradigms. In fact, we utilize this customizability capability ourselves for
129+
integrating those legacy omniparser schema supports (schema versions that are older than `omni.2.1`
130+
and are not compatible with `omni.2.1`): take a glimpse at: https://github.com/jf-tech/omniparserlegacy.
131+
132+
## Put All Together
133+
134+
The most canonical use case of omniparser would be a (micro)service that is part of a larger ETL
135+
pipeline that gets different input files/streams from different external integration influx points,
136+
performs schema driven (thus codeless) parsing and transform to process and standardize the inputs
137+
into internal formats for later stage loading (L) part of ETL.
138+
139+
Because omniparser's parsing and transform is schema driven and involves little/no coding, it enables
140+
faster and at-scale ETL integration possibly done by non-coding engineers or support staffs:
141+
142+
![](./resources/typical_omnipasrser_service.png)
143+
144+
First in your service, there needs to be a schema cache component that loads and refreshes all the
145+
schemas from a schema repository (could be a REST API, or a database, or some storage). These schemas
146+
are parsed, validated (by [`omniparser.NewSchema`](../schema.go) calls) and cached.
147+
148+
As different integration partners' input streams are coming in, the service will, based on some
149+
criteria, such as partner IDs, select which schema to use for a particular input. Once schema
150+
selection is completed, the service calls [`schema.NewTransform`](../schema.go) to create an
151+
instance of a transform operation for this particular input, performs the parsing and transform, and
152+
sends the standardized output into a later stage in the ETL pipeline.
153+
154+
## In Non-Golang Environment
155+
156+
Omniparser is currently only implemented in Golang (we do want to port it to other languages, at least
157+
Java, in the near future), the only way to utilize it, if your service or environment is not in Golang,
158+
is to sidecar it, by either making it a standard alone service or shell-exec omniparser, both of which
159+
involves omniparser's CLI.
160+
161+
Recall in [Getting Started](./gettingstarted.md#cli-command-line-interface) we demonstrated omniparser
162+
CLI's `transform` command. You can shell-exec it from your service. Keep in mind the following if you
163+
want to go down this path:
164+
- you will have to pre-compile omniparser CLI binary (which needs to platform/OS specific) and ship with
165+
your service, and
166+
- you will need to copy down the input file locally in your service before invoking the CLI, and then
167+
intercept `stdout`/`stderr` from the CLI and its exit code in order to get the results.
168+
169+
Omniparser CLI has another command `server`, which simply launches the CLI into a http listening service
170+
that exposes a REST API:
171+
- `POST`
172+
- request `Content-Type`: `application/json`
173+
- request JSON:
174+
```
175+
{
176+
"schema": "... the schema content, required ...",
177+
"input": "... the input to be parsed and transformed, required ...",
178+
"properties": { ... JSON string map used for `external` transforms, optional ...}
179+
}
180+
```
181+
Keep in mind the following if you want to go down this path:
182+
- you will need to host this CLI-turned omniparser service somewhere accessible to your service,
183+
- you lose the benefit of omniparser stream processing, which enables parsing infinitely large input,
184+
because now you need to send the input as a single string in the `input` field of the HTTP POST request.
185+
186+
# Programmability of Some Components without Omniparser
187+
188+
There are many components inside omniparser can be useful in your code, even if you don't want to
189+
use omniparser as a whole for parsing and transforming input file/data. Here is a selected list of
190+
these components:
191+
192+
## Functions
193+
194+
- [`DateTimeToRFC3339()`, `DateTimeLayoutToRFC3339()`, `DateTimeToEpoch()`, `EpochToDateTimeRFC3339()`](../customfuncs/datetime.go)
195+
196+
Parsing and formatting date/time stamps isn't trivial at all, especially when time zones are
197+
involved. These functions can be used independent of omniparser and are very useful when your
198+
Golang code deals with date/time a lot.
199+
200+
- [`JavaScript()`](../extensions/omniv21/customfuncs/javascript.go):
201+
202+
Omniparser uses github.com/dop251/goja as the native Golang javascript engine. Yes you can directly
203+
use `goja`, but you'll have to deal with performance related vm caching, and error handling. Instead
204+
you can directly use `JavaScript` function.
205+
206+
## IDR
207+
208+
We have an in-depth [doc](./idr.md) talking about IDR, which proves to be really useful in many document
209+
parsing situations, even outside of omniparser realm. This `idr` package contains the IDR node/tree
210+
definitions, creation, caching, recycling and releasing mechanisms, serialization helpers, XPath
211+
assisted navigation and querying, and two powerful stream readers for JSON and XML inputs.
212+
213+
Particularly, the [JSON](../idr/jsonreader.go)/[XML](../idr/xmlreader.go) readers are two powerful
214+
parsers, capable of ingesting JSON/XML data in streaming fashion assisted by XPath style target
215+
filtering, thus enabling processing arbitrarily large inputs.
216+
217+
## CSV Reader
218+
219+
Use [`NewReader()`](../extensions/omniv21/fileformat/csv/reader.go) to create a CSV reader that does
220+
- header column validation
221+
- header/data row jumping
222+
- XPath based data row filtering
223+
- Mis-escaped quote replacement
224+
- Context-aware error message
225+
226+
For more reader specific settings/configurations, check
227+
[CSV in Depth](./csv_in_depth.md#csv-file_declaration) page.
228+
229+
## Fixed-Length Reader
230+
231+
Use [`NewReader()`](../extensions/omniv21/fileformat/fixedlength/reader.go) to create a fixed-length
232+
reader that does
233+
- row based or header/footer based envelope parsing
234+
- XPath based data row filtering
235+
- Context-aware error message
236+
237+
For more reader specific settings/configurations, check
238+
[Fixed-Length in Depth](./fixedlength_in_depth.md) page.
239+
240+
## EDI Reader
241+
242+
Use [`NewReader()`](../extensions/omniv21/fileformat/edi/reader.go) to create an EDI reader that does
243+
- segment min/max validation
244+
- XPath based data row filtering
245+
- Context-aware error message
246+
247+
Future TO-DO: create a version of non-validating EDI reader for users who are only interested in
248+
getting the raw segment data, without any validation.
249+
250+
## JSON Reader
251+
See [IDR](#idr) notes about the JSON/XML readers above.
252+
253+
## XML Reader
254+
See [IDR](#idr) notes about the JSON/XML readers above.
52.7 KB
Loading

doc/xml_in_depth.md

Whitespace-only changes.

extensions/omniv21/fileformat/csv/decl.go

+6-4
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,24 @@ import (
44
"github.com/jf-tech/go-corelib/strs"
55
)
66

7-
type column struct {
7+
// Column is a CSV column.
8+
type Column struct {
89
Name string `json:"name"`
910
// If the CSV column 'name' contains characters (such as space, or special letters) that are
1011
// not suitable for *idr.Node construction and xpath query, this gives schema writer an
1112
// alternate way to name/label the column. Optional.
1213
Alias *string `json:"alias"`
1314
}
1415

15-
func (c column) name() string {
16+
func (c Column) name() string {
1617
return strs.StrPtrOrElse(c.Alias, c.Name)
1718
}
1819

19-
type fileDecl struct {
20+
// FileDecl describes CSV specific schema settings for omniparser reader.
21+
type FileDecl struct {
2022
Delimiter string `json:"delimiter"`
2123
ReplaceDoubleQuotes bool `json:"replace_double_quotes"`
2224
HeaderRowIndex *int `json:"header_row_index"`
2325
DataRowIndex int `json:"data_row_index"`
24-
Columns []column `json:"columns"`
26+
Columns []Column `json:"columns"`
2527
}

extensions/omniv21/fileformat/csv/decl_test.go

+2-2
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,6 @@ import (
88
)
99

1010
func TestColumnName(t *testing.T) {
11-
assert.Equal(t, "name", column{Name: "name"}.name())
12-
assert.Equal(t, "alias", column{Name: "name", Alias: strs.StrPtr("alias")}.name())
11+
assert.Equal(t, "name", Column{Name: "name"}.name())
12+
assert.Equal(t, "alias", Column{Name: "name", Alias: strs.StrPtr("alias")}.name())
1313
}

extensions/omniv21/fileformat/csv/format.go

+3-3
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ func NewCSVFileFormat(schemaName string) fileformat.FileFormat {
3030
}
3131

3232
type csvFormatRuntime struct {
33-
Decl *fileDecl `json:"file_declaration"`
33+
Decl *FileDecl `json:"file_declaration"`
3434
XPath string
3535
}
3636

@@ -65,7 +65,7 @@ func (f *csvFileFormat) ValidateSchema(
6565
return &runtime, nil
6666
}
6767

68-
func (f *csvFileFormat) validateFileDecl(decl *fileDecl) error {
68+
func (f *csvFileFormat) validateFileDecl(decl *FileDecl) error {
6969
// If header_row_index is specified, then it must be < data_row_index
7070
if decl.HeaderRowIndex != nil && *decl.HeaderRowIndex >= decl.DataRowIndex {
7171
return f.FmtErr(
@@ -78,7 +78,7 @@ func (f *csvFileFormat) validateFileDecl(decl *fileDecl) error {
7878
return nil
7979
}
8080

81-
func (f *csvFileFormat) validateColumns(columns []column) error {
81+
func (f *csvFileFormat) validateColumns(columns []Column) error {
8282
namesSeen := map[string]bool{}
8383
aliasesSeen := map[string]bool{}
8484
for _, column := range columns {

extensions/omniv21/fileformat/csv/format_test.go

+2-2
Original file line numberDiff line numberDiff line change
@@ -164,11 +164,11 @@ func TestCreateFormatReader(t *testing.T) {
164164
lf("x|y")+
165165
lf("4|5|6")),
166166
&csvFormatRuntime{
167-
Decl: &fileDecl{
167+
Decl: &FileDecl{
168168
Delimiter: "|",
169169
HeaderRowIndex: testlib.IntPtr(1),
170170
DataRowIndex: 2,
171-
Columns: []column{{Name: "A"}, {Name: "B"}, {Name: "C"}},
171+
Columns: []Column{{Name: "A"}, {Name: "B"}, {Name: "C"}},
172172
},
173173
XPath: ".[A != 'x']",
174174
})

extensions/omniv21/fileformat/csv/reader.go

+2-2
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ func IsErrInvalidHeader(err error) bool {
3232

3333
type reader struct {
3434
inputName string
35-
decl *fileDecl
35+
decl *FileDecl
3636
xpath *xpath.Expr
3737
r *ios.LineNumReportingCsvReader
3838
headerChecked bool
@@ -144,7 +144,7 @@ func (r *reader) fmtErrStr(format string, args ...interface{}) string {
144144
}
145145

146146
// NewReader creates an FormatReader for CSV file format.
147-
func NewReader(inputName string, r io.Reader, decl *fileDecl, xpathStr string) (*reader, error) {
147+
func NewReader(inputName string, r io.Reader, decl *FileDecl, xpathStr string) (*reader, error) {
148148
var expr *xpath.Expr
149149
var err error
150150
xpathStr = strings.TrimSpace(xpathStr)

0 commit comments

Comments
 (0)