|
1 | 1 | # file-utils-processors-ts
|
2 | 2 |
|
3 |
| -[](https://github.com/rdf-connect/file-utils-processors-ts/actions/workflows/build-test.yml) [](https://npmjs.com/package/@rdfc/file-utils-processors-ts) |
| 3 | +[](https://github.com/rdf-connect/file-utils-processors-ts/actions/workflows/build-test.yml) |
4 | 4 |
|
5 |
| -[RDF-Connect](https://rdf-connect.github.io/rdfc.github.io/) Typescript processors for handling file operations. It currently exposes 6 functions: |
| 5 | +This repository provides a set of processors for reading, transforming, and extracting files in RDF-Connect pipelines. |
| 6 | +It includes utilities for reading files from folders or glob patterns, substituting strings or environment variables, reading files on demand, and handling compressed files (zip/gzip). |
6 | 7 |
|
7 |
| -### [`js:GlobRead`](https://github.com/rdf-connect/file-utils-processors-ts/blob/main/processors.ttl#L10) |
| 8 | +These processors are designed to integrate seamlessly into RDF-Connect pipelines using the [rdfc:NodeRunner](https://github.com/rdf-connect/js-runner). |
8 | 9 |
|
9 |
| -This function relies on the [`glob`](https://www.npmjs.com/package/glob) library to select a set of files according to a shell expression and stream them out in a sequential fashion. A `wait` parameter can be defined to wait x milliseconds between file streaming operations. |
| 10 | +--- |
10 | 11 |
|
11 |
| -### [`js:FolderRead`](https://github.com/rdf-connect/file-utils-processors-ts/blob/main/processors.ttl#L70) |
| 12 | +## Usage |
12 | 13 |
|
13 |
| -This function reads all the files present in a given folder and streams out their content in a sequential fashion. A `maxMemory` parameter can be given (in GB) to defined threshold of maximum used memory by the streaming process. When the threshold is exceeded, the streaming process will pause for as many milliseconds as defined by the `pause` parameter. |
| 14 | +To use these processors, import the package into your RDF-Connect pipeline configuration and reference the required processors. |
14 | 15 |
|
15 |
| -### [`js:Substitute`](https://github.com/rdf-connect/file-utils-processors-ts/blob/main/processors.ttl#L121) |
| 16 | +### Installation |
16 | 17 |
|
17 |
| -This function transform a stream by applying a given string substitution on each of the messages. The matching string can be a regex defined by the `source` property and setting the `regexp` property to `true`. |
| 18 | +```bash |
| 19 | +npm install |
| 20 | +npm run build |
| 21 | +``` |
18 | 22 |
|
19 |
| -### [`js:Envsub`](https://github.com/rdf-connect/file-utils-processors-ts/blob/main/processors.ttl#L185) |
| 23 | +Or install from NPM: |
20 | 24 |
|
21 |
| -This function substitute all the defined environment variables on each of the elements of an input stream that have been labeled with a `${VAR_NAME}` pattern. |
| 25 | +```bash |
| 26 | +npm install @rdfc/file-utils-processors-ts |
| 27 | +``` |
22 | 28 |
|
23 |
| -### [`js:ReadFile`](https://github.com/rdf-connect/file-utils-processors-ts/blob/main/processors.ttl#L220) |
| 29 | +Next, you can add the processors to your pipeline configuration as follows: |
24 | 30 |
|
25 |
| -This function can read on demand and push downstream the contents of a file located in a predefined folder. This processor is used mostly for testing and demonstrating pipeline implementations. |
| 31 | +```turtle |
| 32 | +@prefix rdfc: <https://w3id.org/rdf-connect#>. |
| 33 | +@prefix owl: <http://www.w3.org/2002/07/owl#>. |
26 | 34 |
|
27 |
| -### [`js:UnzipFile`](https://github.com/rdf-connect/file-utils-processors-ts/blob/main/processors.ttl#L265) |
| 35 | +### Import the processor definitions |
| 36 | +<> owl:imports <./node_modules/@rdfc/file-utils-processors-ts/processors.ttl>. |
28 | 37 |
|
29 |
| -This function can receive a zipped file in the form of a Buffer and stream out its decompressed contents. |
| 38 | +### Define the channels your processor needs |
| 39 | +<in> a rdfc:Reader, rdfc:Writer. |
| 40 | +<out> a rdfc:Reader, rdfc:Writer. |
30 | 41 |
|
31 |
| -### [`js:GunzipFile`](https://github.com/rdf-connect/file-utils-processors-ts/blob/main/processors.ttl#L310) |
| 42 | +### Attach the processor to the pipeline under the NodeRunner |
| 43 | +# Add the `rdfc:processor <folderReader>` statement under the `rdfc:consistsOf` statement of the `rdfc:NodeRunner` |
32 | 44 |
|
33 |
| -This function can receive a gzipped file in the form of a Buffer and stream out its decompressed contents. |
| 45 | +### Define and configure the processors |
| 46 | +<folderReader> a rdfc:FolderRead; |
| 47 | + rdfc:folder_location "./data"; |
| 48 | + rdfc:file_stream <out>. |
| 49 | +``` |
| 50 | + |
| 51 | +--- |
| 52 | + |
| 53 | +## Processors and Configuration |
| 54 | + |
| 55 | +### 📂 `rdfc:GlobRead` – Glob-based File Reader |
| 56 | +Reads all files matching a given glob pattern. |
| 57 | + |
| 58 | +**Parameters:** |
| 59 | +- `rdfc:glob` (`string`, required): Glob pattern to select files. |
| 60 | +- `rdfc:output` (`rdfc:Writer`, required): Output channel to stream file contents. |
| 61 | +- `rdfc:wait` (`integer`, optional): Delay (ms) before reading files. |
| 62 | +- `rdfc:closeOnEnd` (`boolean`, optional): Whether to close the stream after finishing. |
| 63 | +- `rdfc:binary` (`boolean`, optional): If true, streams binary data instead of text. |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +### 📁 `rdfc:FolderRead` – Folder File Reader |
| 68 | +Reads all files inside a folder. |
| 69 | + |
| 70 | +**Parameters:** |
| 71 | +- `rdfc:folder_location` (`string`, required): Path to the folder. |
| 72 | +- `rdfc:file_stream` (`rdfc:Writer`, required): Output channel to stream file contents. |
| 73 | +- `rdfc:max_memory` (`double`, optional): Max memory usage allowed (in MB). |
| 74 | +- `rdfc:pause` (`integer`, optional): Pause duration (ms) between file reads. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +### 🔄 `rdfc:Substitute` – String Substitution Processor |
| 79 | +Performs string substitution (supports regex) on messages in the stream. |
| 80 | + |
| 81 | +**Parameters:** |
| 82 | +- `rdfc:input` (`rdfc:Reader`, required): Input channel. |
| 83 | +- `rdfc:output` (`rdfc:Writer`, required): Output channel. |
| 84 | +- `rdfc:source` (`string`, required): Source string or regex to match. |
| 85 | +- `rdfc:replace` (`string`, required): Replacement string. |
| 86 | +- `rdfc:regexp` (`boolean`, optional): If true, treat `source` as a regex. |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +### 🌍 `rdfc:Envsub` – Environment Variable Substitution |
| 91 | +Substitutes environment variables in the stream with their values. |
| 92 | + |
| 93 | +**Parameters:** |
| 94 | +- `rdfc:input` (`rdfc:Reader`, required): Input channel. |
| 95 | +- `rdfc:output` (`rdfc:Writer`, required): Output channel. |
| 96 | + |
| 97 | +--- |
| 98 | + |
| 99 | +### 📄 `rdfc:ReadFile` – On-Demand File Reader |
| 100 | +Reads a requested file from a given folder. |
| 101 | + |
| 102 | +**Parameters:** |
| 103 | +- `rdfc:input` (`rdfc:Reader`, required): Input channel (file requests). |
| 104 | +- `rdfc:folderPath` (`string`, required): Path to the folder containing files. |
| 105 | +- `rdfc:output` (`rdfc:Writer`, required): Output channel for file contents. |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +### 📦 `rdfc:UnzipFile` – Zip File Extractor |
| 110 | +Unzips a compressed file and streams its content. |
| 111 | + |
| 112 | +**Parameters:** |
| 113 | +- `rdfc:input` (`rdfc:Reader`, required): Input channel (zip file). |
| 114 | +- `rdfc:output` (`rdfc:Writer`, required): Output channel (extracted contents). |
| 115 | +- `rdfc:outputAsBuffer` (`boolean`, optional): If true, outputs raw buffers instead of strings. |
| 116 | + |
| 117 | +--- |
| 118 | + |
| 119 | +### 🗜️ `rdfc:GunzipFile` – Gzip File Extractor |
| 120 | +Gunzip a compressed file and stream out its content. |
| 121 | + |
| 122 | +**Parameters:** |
| 123 | +- `rdfc:input` (`rdfc:Reader`, required): Input channel (gzip file). |
| 124 | +- `rdfc:output` (`rdfc:Writer`, required): Output channel (extracted contents). |
| 125 | +- `rdfc:outputAsBuffer` (`boolean`, optional): If true, outputs raw buffers instead of strings. |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +## Example Pipelines |
| 130 | + |
| 131 | +### Example 1: Reading all `.txt` files in a folder and logging them |
| 132 | +```turtle |
| 133 | +<reader> a rdfc:GlobRead; |
| 134 | +rdfc:glob "./data/*.txt"; |
| 135 | +rdfc:output <out>. |
| 136 | +
|
| 137 | +<logger> a rdfc:LogProcessorJs; |
| 138 | + rdfc:reader <out>; |
| 139 | + rdfc:level "info"; |
| 140 | + rdfc:label "glob-reader". |
| 141 | +``` |
| 142 | + |
| 143 | +### Example 2: Substituting strings in a stream |
| 144 | +```turtle |
| 145 | +<substitute> a rdfc:Substitute; |
| 146 | +rdfc:reader <in>; |
| 147 | +rdfc:writer <out>; |
| 148 | +rdfc:source "World"; |
| 149 | +rdfc:replace "RDF-Connect"; |
| 150 | +rdfc:regexp false. |
| 151 | +``` |
| 152 | + |
| 153 | +### Example 3: Reading and unzipping a file |
| 154 | +```turtle |
| 155 | +<unzipper> a rdfc:UnzipFile; |
| 156 | +rdfc:reader <in>; |
| 157 | +rdfc:writer <out>; |
| 158 | +rdfc:outputAsBuffer true. |
| 159 | +``` |
0 commit comments