Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions chat_content.html

Large diffs are not rendered by default.

Binary file added chat_head.txt
Binary file not shown.
130 changes: 130 additions & 0 deletions docs/filestore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Filestore configuration

Greenmask can optionally dump and restore a filestore alongside database data. Instead of maintaining a separate script to archive binary files, you can reuse the same database connection and storage configuration (for example, S3 settings) to back up and restore binaries together with the database, without extra scripts or duplicate credential management. You can also limit the filestore to an explicit file list to apply "anonymization by reduction": after restore, a post-restore script can replace selected binary references with a placeholder.

## `dump.filestore` section

In the `dump` section, `filestore` controls how a filesystem directory is packaged and uploaded to storage.

### Parameters

* `enabled` — enables or disables filestore dumping.
* `root_path` — **required**. Root directory of the filestore on the source filesystem.
* `include_list_file` — path to a file containing relative paths to include. Each line is a path relative to `root_path`.
* `include_list_query` — SQL query (inline) that returns a list of relative paths to include.
* `include_list_query_file` — path to a file with the SQL query to execute.
* `subdir` — storage subdirectory for filestore artifacts. Default: `filestore`.
* `archive_name` — name of the tar archive produced. Default: `filestore.tar.gz`.
* `metadata_name` — name of the metadata JSON file. Default: `filestore.json`.
* `use_pgzip` — optionally overrides the default compression behavior. Inherits dump `--pgzip`, which is `false` by default.
* `fail_on_missing` — if true, missing files cause the dump to fail. Default: `false`.
* `split.max_size_bytes` — enables archive splitting by maximum size. Default: `0` (disabled).
* `split.max_files` — enables archive splitting by maximum number of files. Default: `0` (disabled).

!!! note
Filestore archives and metadata are uploaded into the configured storage under the `subdir` path.

!!! warning
Only one include-list source can be configured at a time:
`include_list_file` **or** `include_list_query` **or** `include_list_query_file`.
If no include list is configured, all files under `root_path` are included recursively.

### Example

```yaml title="filestore dump config example"
dump:
filestore:
enabled: true
root_path: "/var/lib/odoo/filestore"
subdir: "filestore"
archive_name: "filestore.tar.gz"
metadata_name: "filestore.json"

# choose exactly one source of paths:
include_list_file: "/etc/greenmask/filestore-files.txt"
# include_list_query: "SELECT DISTINCT store_fname FROM ir_attachment WHERE mimetype != 'application/pdf'"
# include_list_query_file: "/etc/greenmask/filestore_query.sql"

fail_on_missing: true
use_pgzip: true

split:
max_size_bytes: 1073741824 # 1 GiB
max_files: 100000
```

## `restore.filestore` section

In the `restore` section, `filestore` controls how stored filestore archives are fetched and unpacked.

### Parameters

* `enabled` — enables or disables filestore restoration.
* `target_path` — **required**. Destination directory on the target filesystem.
* `subdir` — storage subdirectory where filestore artifacts are stored. Default: `filestore`.
* `metadata_name` — metadata file name. Default: `filestore.json`.
* `use_pgzip` — optionally overrides compression behavior from metadata.
* `clean_target` — if true, removes the target directory before extraction. Default: `false`.
* `skip_existing` — if true, existing files are left untouched. Default: `false`.

!!! note
If `use_pgzip` is set, it overrides the `use_pgzip` value stored in the filestore metadata.

!!! warning
If `clean_target` is enabled, the entire `target_path` directory will be removed before restore.

### Example

```yaml title="filestore restore config example"
restore:
filestore:
enabled: true
target_path: "/var/lib/odoo/filestore"
subdir: "filestore"
metadata_name: "filestore.json"
clean_target: false
skip_existing: true
use_pgzip: true
```

## Include list sources

When dumping a filestore, you can limit which files are packed using one of the include-list mechanisms:

* **File list** (`include_list_file`) — a text file with one relative path per line.
* **SQL query** (`include_list_query` / `include_list_query_file`) — a query that returns relative paths.

All paths are resolved relative to `root_path`.

!!! tip
Use an SQL query when paths are stored in the database and you want the filestore selection to follow the dataset.

### Why include lists are useful

Restricting the filestore to an explicit list lets you implement "anonymization by reduction". Instead of copying
all binaries, you can keep only the necessary files and then use a post-restore script to replace references to
missing binaries with a placeholder or a generic asset (for example, `invoice_placeholder.pdf`). This approach
reduces storage, shortens transfer time, and keeps access credentials and storage handling centralized in Greenmask.

### Odoo example query

The following Odoo query excludes all PDF and ZIP attachments from the filestore selection:

```sql
SELECT DISTINCT store_fname
FROM ir_attachment
WHERE store_fname IS NOT NULL
AND (NOT ((COALESCE(mimetype, '') = 'application/pdf') OR (COALESCE(mimetype, '') = 'application/zip')))
ORDER BY store_fname
```

## Archive splitting

If `split.max_size_bytes` or `split.max_files` is set, the filestore is split into multiple archives. Each archive is
stored separately, and metadata contains the archive list and statistics.

Splitting is useful when:
* individual archives must stay below storage limits,
* large filestores should be processed in smaller parts.

Splitting is not tied to `jobs`: filestore dump/restore does not use multi-threaded workers and processes archives sequentially.
53 changes: 53 additions & 0 deletions internal/db/postgres/cmd/dump.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ import (
"github.com/greenmaskio/greenmask/internal/db/postgres/transformers/custom"
"github.com/greenmaskio/greenmask/internal/db/postgres/transformers/utils"
"github.com/greenmaskio/greenmask/internal/domains"
"github.com/greenmaskio/greenmask/internal/filestore"
"github.com/greenmaskio/greenmask/internal/storages"
"github.com/greenmaskio/greenmask/pkg/toolkit"
)
Expand Down Expand Up @@ -543,9 +544,61 @@ func (d *Dump) Run(ctx context.Context) (err error) {
return fmt.Errorf("writeMetaData stage dumping error: %w", err)
}

var includeListExecutor filestore.IncludeListQueryExecutor
if d.config.Dump.Filestore != nil {
includeListExecutor = &filestoreQueryExecutor{dump: d}
}
if err := filestore.Dump(ctx, d.config.Dump.Filestore, d.st, d.pgDumpOptions.Pgzip, includeListExecutor); err != nil {
return fmt.Errorf("filestore dumping error: %w", err)
}

return nil
}

type filestoreQueryExecutor struct {
dump *Dump
}

func (e *filestoreQueryExecutor) RunIncludeListQuery(ctx context.Context, query string) ([]string, error) {
conn, tx, err := e.dump.getWorkerTransaction(ctx)
if err != nil {
return nil, err
}
defer func() {
if err := conn.Close(ctx); err != nil {
log.Debug().Err(err).Msg("error closing include list query connection")
}
}()
defer func() {
if err := tx.Rollback(ctx); err != nil {
log.Debug().Err(err).Msg("unable to rollback include list query transaction")
}
}()

rows, err := tx.Query(ctx, query)
if err != nil {
return nil, fmt.Errorf("run include_list_query: %w", err)
}
defer rows.Close()

if len(rows.FieldDescriptions()) != 1 {
return nil, fmt.Errorf("include_list_query must return exactly one column, got %d", len(rows.FieldDescriptions()))
}

var values []string
for rows.Next() {
var rel string
if scanErr := rows.Scan(&rel); scanErr != nil {
return nil, fmt.Errorf("scan include_list_query result: %w", scanErr)
}
values = append(values, rel)
}
if err := rows.Err(); err != nil {
return nil, fmt.Errorf("iterate include_list_query result: %w", err)
}
return values, nil
}

func (d *Dump) MergeTocEntries(schemaEntries []*toc.Entry, dataEntries []*toc.Entry) (
[]*toc.Entry, error,
) {
Expand Down
5 changes: 5 additions & 0 deletions internal/db/postgres/cmd/restore.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ import (
"github.com/greenmaskio/greenmask/internal/db/postgres/toc"
"github.com/greenmaskio/greenmask/internal/db/postgres/utils"
"github.com/greenmaskio/greenmask/internal/domains"
"github.com/greenmaskio/greenmask/internal/filestore"
"github.com/greenmaskio/greenmask/internal/storages"
"github.com/greenmaskio/greenmask/pkg/toolkit"
)
Expand Down Expand Up @@ -142,6 +143,10 @@ func (r *Restore) Run(ctx context.Context) error {
return fmt.Errorf("post-data stage restoration error: %w", err)
}

if err := filestore.Restore(ctx, r.cfg.Filestore, r.st); err != nil {
return fmt.Errorf("filestore restoration error: %w", err)
}

return nil
}

Expand Down
32 changes: 32 additions & 0 deletions internal/domains/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -97,12 +97,14 @@ type Dump struct {
PgDumpOptions pgdump.Options `mapstructure:"pg_dump_options" yaml:"pg_dump_options" json:"pg_dump_options"`
Transformation []*Table `mapstructure:"transformation" yaml:"transformation" json:"transformation,omitempty"`
VirtualReferences []*VirtualReference `mapstructure:"virtual_references" yaml:"virtual_references" json:"virtual_references,omitempty"`
Filestore *FilestoreDump `mapstructure:"filestore" yaml:"filestore" json:"filestore,omitempty"`
}

type Restore struct {
PgRestoreOptions pgrestore.Options `mapstructure:"pg_restore_options" yaml:"pg_restore_options" json:"pg_restore_options"`
Scripts map[string][]pgrestore.Script `mapstructure:"scripts" yaml:"scripts" json:"scripts,omitempty"`
ErrorExclusions *DataRestorationErrorExclusions `mapstructure:"insert_error_exclusions" yaml:"insert_error_exclusions" json:"insert_error_exclusions,omitempty"`
Filestore *FilestoreRestore `mapstructure:"filestore" yaml:"filestore" json:"filestore,omitempty"`
}

type TablesDataRestorationErrorExclusions struct {
Expand All @@ -122,6 +124,36 @@ type DataRestorationErrorExclusions struct {
Global *GlobalDataRestorationErrorExclusions `mapstructure:"global" yaml:"global" json:"global,omitempty"`
}

type FilestoreDump struct {
Enabled bool `mapstructure:"enabled" yaml:"enabled" json:"enabled,omitempty"`
RootPath string `mapstructure:"root_path" yaml:"root_path" json:"root_path,omitempty"`
FileList string `mapstructure:"file_list" yaml:"file_list" json:"file_list,omitempty"`
IncludeListFile string `mapstructure:"include_list_file" yaml:"include_list_file" json:"include_list_file,omitempty"`
IncludeListQuery string `mapstructure:"include_list_query" yaml:"include_list_query" json:"include_list_query,omitempty"`
IncludeListQueryFile string `mapstructure:"include_list_query_file" yaml:"include_list_query_file" json:"include_list_query_file,omitempty"`
Subdir string `mapstructure:"subdir" yaml:"subdir" json:"subdir,omitempty"`
ArchiveName string `mapstructure:"archive_name" yaml:"archive_name" json:"archive_name,omitempty"`
MetadataName string `mapstructure:"metadata_name" yaml:"metadata_name" json:"metadata_name,omitempty"`
UsePgzip *bool `mapstructure:"use_pgzip" yaml:"use_pgzip" json:"use_pgzip,omitempty"`
FailOnMissing bool `mapstructure:"fail_on_missing" yaml:"fail_on_missing" json:"fail_on_missing,omitempty"`
Split FilestoreDumpSplit `mapstructure:"split" yaml:"split" json:"split,omitempty"`
}

type FilestoreDumpSplit struct {
MaxSizeBytes int64 `mapstructure:"max_size_bytes" yaml:"max_size_bytes" json:"max_size_bytes,omitempty"`
MaxFiles int `mapstructure:"max_files" yaml:"max_files" json:"max_files,omitempty"`
}

type FilestoreRestore struct {
Enabled bool `mapstructure:"enabled" yaml:"enabled" json:"enabled,omitempty"`
TargetPath string `mapstructure:"target_path" yaml:"target_path" json:"target_path,omitempty"`
Subdir string `mapstructure:"subdir" yaml:"subdir" json:"subdir,omitempty"`
MetadataName string `mapstructure:"metadata_name" yaml:"metadata_name" json:"metadata_name,omitempty"`
UsePgzip *bool `mapstructure:"use_pgzip" yaml:"use_pgzip" json:"use_pgzip,omitempty"`
CleanTarget bool `mapstructure:"clean_target" yaml:"clean_target" json:"clean_target,omitempty"`
SkipExisting bool `mapstructure:"skip_existing" yaml:"skip_existing" json:"skip_existing,omitempty"`
}

type TransformerConfig struct {
Name string `mapstructure:"name" yaml:"name" json:"name,omitempty"`
ApplyForReferences bool `mapstructure:"apply_for_references" yaml:"apply_for_references" json:"apply_for_references,omitempty"`
Expand Down
Loading