Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 24 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/65a67b0ee65b8ffd3b91572f/nMqy0cpsj15EIliGzfOo_.png" alt="LeMaterial" width="700"/>
</p>


# LeMaterial-Fetcher

`lematerial-fetcher` is designed to fetch data from a specified OPTIMADE's compatible JSON-API, process it, and store it in a PostgreSQL database. It is highly concurrent, to handle data fetching and processing efficiently.
LeMaterial-Fetcher is designed to fetch data any external source, process it, and store it in a PostgreSQL database in a pre-defined format with structure-level validation and database-level validators. It is highly concurrent, to handle data fetching and processing efficiently.

The objective is to retrieve information from various OPTIMADE sources and establish a local database. This database will enable us to process and utilize the data according to our specific requirements, which can then be uploaded to an online and easily accessible place like Hugging Face.
The objective is to retrieve information from various sources and establish a local database that can be unified. This database will enable us to process and utilize the data according to our specific requirements, which can then be uploaded to an online and easily accessible place like Hugging Face.

**Explore the datasets built with this tool on [Hugging Face](https://huggingface.co/LeMaterial)** 🤗:

Expand All @@ -29,18 +34,21 @@ We gratefully acknowledge these projects and their dedication to open materials
## Installation

1. Clone the repository:

```bash
git clone git@github.com:LeMaterial/lematerial-fetcher.git
cd lematerial-fetcher
```

2. Set up your environment variables. Copy the provided template and customize it:

```bash
cp .env.example .env
vim .env
```

3. Install the package:

```bash
# Using uv (recommended)
uv add git+https://github.com/LeMaterial/lematerial-fetcher.git
Expand Down Expand Up @@ -91,6 +99,7 @@ lematerial-fetcher [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]
### Available Commands

1. **Materials Project (MP)**

```bash
# Fetch structures
lematerial-fetcher mp fetch --table-name mp_structures --num-workers 4
Expand All @@ -103,6 +112,7 @@ lematerial-fetcher [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]
```

2. **Alexandria**

```bash
# Fetch structures
lematerial-fetcher alexandria fetch --table-name alex_structures --functional pbe
Expand All @@ -115,6 +125,7 @@ lematerial-fetcher [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]
```

3. **OQMD**

```bash
# Fetch data
lematerial-fetcher oqmd fetch --table-name oqmd_structures
Expand All @@ -124,6 +135,7 @@ lematerial-fetcher [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]
```

4. **Push to Hugging Face**

```bash
lematerial-fetcher push --table-name my_table --hf-repo-id my-repo
```
Expand All @@ -133,31 +145,36 @@ lematerial-fetcher [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]
These options are available across most commands:

#### Database Options

- `--db-conn-str STR`: Complete database connection string
- `--db-user USER`: Database username
- `--db-host HOST`: Database host (default: localhost)
- `--db-name NAME`: Database name (default: lematerial)

#### Processing Options

- `--num-workers N`: Number of parallel workers
- `--log-dir DIR`: Directory for logs (default: ./logs)
- `--max-retries N`: Maximum retry attempts (default: 3)
- `--retry-delay N`: Delay between retries in seconds (default: 2)
- `--log-every N`: Log frequency (default: 1000)

#### Fetch Options

- `--offset N`: Starting offset (default: 0)
- `--table-name NAME`: Target table name
- `--limit N`: Items per API request (default: 500)

#### Transformer Options

- `--batch-size N`: Batch processing size (default: 500)
- `--dest-table-name NAME`: Destination table name
- `--traj`: Transform trajectory data

### Examples

1. **Fetch from Materials Project with custom configuration**:

```bash
lematerial-fetcher mp fetch \
--table-name mp_structures \
Expand All @@ -168,6 +185,7 @@ These options are available across most commands:
```

2. **Transform Alexandria data with source and destination databases**:

```bash
lematerial-fetcher alexandria transform \
--table-name source_table \
Expand All @@ -178,6 +196,7 @@ These options are available across most commands:
```

3. **Push to Hugging Face with custom chunk size**:

```bash
lematerial-fetcher push \
--table-name my_table \
Expand All @@ -193,6 +212,7 @@ These options are available across most commands:
You can configure the database connection in two ways:

1. **Using individual parameters**:

```bash
# Set password in environment
export LEMATERIALFETCHER_DB_PASSWORD=your_password
Expand All @@ -202,13 +222,15 @@ You can configure the database connection in two ways:
```

2. **Using a connection string**:

```bash
lematerial-fetcher mp fetch --db-conn-str="host=localhost user=username password=password dbname=database_name sslmode=disable"
```

### MySQL Configuration (for OQMD)

MySQL-specific options:

- `--mysql-host HOST`: MySQL host (default: localhost)
- `--mysql-user USER`: MySQL username
- `--mysql-database NAME`: MySQL database name (default: lematerial)
Expand Down
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ dependencies = [
"beautifulsoup4>=4.13.3",
"datasets>=3.4.1",
"ijson>=3.3.0",
"moyopy>=0.4.2",
"ase>=3.24.0",
"material-hasher",
]

[project.scripts]
Expand Down Expand Up @@ -51,6 +54,9 @@ dev-dependencies = [
"botocore>=1.36.20",
]

[tool.uv.sources]
material-hasher = { git = "https://github.com/LeMaterial/lematerial-hasher.git" }


[tool.ruff.lint]
extend-select = ["I"]
30 changes: 25 additions & 5 deletions src/lematerial_fetcher/database/postgres.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,13 +483,17 @@ def columns(cls) -> dict[str, str]:
"last_modified": "TIMESTAMP",
"stress_tensor": "FLOAT[][]",
"energy": "FLOAT",
"energy_corrected": "FLOAT",
"magnetic_moments": "FLOAT[]",
"forces": "FLOAT[][]",
"total_magnetization": "FLOAT",
"dos_ef": "FLOAT",
"charges": "FLOAT[]",
"band_gap_indirect": "FLOAT",
"functional": "TEXT",
"space_group_it_number": "INTEGER",
"cross_compatibility": "BOOLEAN",
"entalpic_fingerprint": "FLOAT[]",
"bawl_fingerprint": "TEXT",
}

def _prepare_species_data(self, species: list[dict[str, Any]]) -> list[Json]:
Expand Down Expand Up @@ -557,13 +561,17 @@ def insert_data(self, structure: OptimadeStructure) -> None:
structure.last_modified,
structure.stress_tensor,
structure.energy,
structure.energy_corrected,
structure.magnetic_moments,
structure.forces,
structure.total_magnetization,
structure.dos_ef,
structure.charges,
structure.band_gap_indirect,
structure.functional,
structure.space_group_it_number,
structure.cross_compatibility,
structure.entalpic_fingerprint,
structure.bawl_fingerprint,
)
cur.execute(query, input_data)
self.conn.commit()
Expand Down Expand Up @@ -620,13 +628,17 @@ def batch_insert_data(
structure.last_modified,
structure.stress_tensor,
structure.energy,
structure.energy_corrected,
structure.magnetic_moments,
structure.forces,
structure.total_magnetization,
structure.dos_ef,
structure.charges,
structure.band_gap_indirect,
structure.functional,
structure.space_group_it_number,
structure.cross_compatibility,
structure.entalpic_fingerprint,
structure.bawl_fingerprint,
)
)

Expand Down Expand Up @@ -717,13 +729,17 @@ def insert_data(self, structure: Trajectory) -> None:
structure.last_modified,
structure.stress_tensor,
structure.energy,
structure.energy_corrected,
structure.magnetic_moments,
structure.forces,
structure.total_magnetization,
structure.dos_ef,
structure.charges,
structure.band_gap_indirect,
structure.functional,
structure.space_group_it_number,
structure.cross_compatibility,
structure.entalpic_fingerprint,
structure.bawl_fingerprint,
# trajectory-specific fields
structure.relaxation_step,
structure.relaxation_number,
Expand Down Expand Up @@ -783,13 +799,17 @@ def batch_insert_data(
structure.last_modified,
structure.stress_tensor,
structure.energy,
structure.energy_corrected,
structure.magnetic_moments,
structure.forces,
structure.total_magnetization,
structure.dos_ef,
structure.charges,
structure.band_gap_indirect,
structure.functional,
structure.space_group_it_number,
structure.cross_compatibility,
structure.entalpic_fingerprint,
structure.bawl_fingerprint,
# trajectory-specific fields
structure.relaxation_step,
structure.relaxation_number,
Expand Down
Loading