-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit e5c7a3b
Showing
18 changed files
with
4,196 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
cff-version: 0.1.0 | ||
message: "If you use this software, please cite it as below." | ||
authors: | ||
- family-names: "Lui" | ||
given-names: "Lok Hei" | ||
orcid: "https://orcid.org/0000-0001-5077-1530" | ||
title: "Dataverse metadata Crawler" | ||
version: 0.1.0 | ||
date-released: 2025-01-16 | ||
url: "https://github.com/kenlhlui/dataverse-metadata-crawler-p" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2025 Lok Hei Lui | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
[](https://www.repostatus.org/#active) | ||
[](https://opensource.org/license/mit) | ||
[](https://dataverse.org/) | ||
[](https://github.com/psf/black) | ||
|
||
# Dataverse Metadata Crawler | ||
 | ||
|
||
## 📜Description | ||
A Python CLI tool for extracting and exporting metadata from [Dataverse](https://dataverse.org/) repositories. It supports bulk extraction of dataverses, datasets, and data file metadata from any chosen level of dataverse collection (whole Dataverse repository/sub-Dataverse), with flexible export options to JSON and CSV formats. | ||
|
||
## ✨Features | ||
1. Bulk metadata extraction from Dataverse repositories from any chosen level of collection (top level or selected collection) | ||
2. JSON & CSV file export options | ||
|
||
## 📦Prerequisites | ||
1. Git | ||
2. Python 3.10+ | ||
|
||
## ⚙️Installation | ||
|
||
1. Clone the repository | ||
```sh | ||
git clone https://github.com/scholarsportal/dataverse-metadata-crawler.git | ||
``` | ||
|
||
2. Change to the project directory | ||
```sh | ||
cd ~/dataverse-metadata-export-p | ||
``` | ||
|
||
3. Create an environment file (.env) | ||
```sh | ||
touch .env # For Unix/MacOS | ||
nano .env # or vim .env, or your preferred editor | ||
# OR | ||
New-Item .env -Type File # For Windows (Powershell) | ||
notepad .env | ||
``` | ||
|
||
4. Configure environment file using your text editor at your choice | ||
```sh | ||
# .env file | ||
BASE_URL = "TARGET_REPO_URL" # e.g., "https://demo.borealisdata.ca/" | ||
API_KEY = "YOUR_API_KEY" # Find in your Dataverse account settings. You may also specify it in the CLI interface (with -a flag) | ||
``` | ||
|
||
5. Set up virtual environment (recommended) | ||
```sh | ||
python3 -m venv .venv | ||
source .venv/bin/activate # For Unix/MacOS | ||
# OR | ||
.venv\Scripts\activate # For Windows | ||
``` | ||
|
||
6. Install dependencies | ||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## 🛠️Usage | ||
|
||
### Basic Command | ||
```sh | ||
python3 dvmeta/main.py [-a AUTH] [-l] [-d] [-p] [-f] [-e] [-s] -c COLLECTION_ALIAS -v VERSION | ||
``` | ||
**Required arguments:** | ||
| **Option** | **Short** | **Type** | **Description** | **Default** | | ||
|--------------------|-----------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------| | ||
| --collection_alias | -c | TEXT | Name of the collection to crawl. <br/> **[required]** | None | | ||
| --version | -v | TEXT | The Dataset version to crawl. Options include: <br/> • `draft` - The draft version, if any <br/> • `latest` - Either a draft (if exists) or the latest published version <br/> • `latest-published` - The latest published version <br/> • `x.y` - A specific version <br/> **[required]** | None (required) | | ||
|
||
|
||
**Optional arguments:** | ||
| **Option** | **Short** | **Type** | **Description** | **Default** | | ||
|----------------------|-----------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------| | ||
| --auth | -a | TEXT | Authentication token to access the Dataverse repository. <br/> If | None | | ||
| --log <br/> --no-log | -l | | Output a log file. <br/> Use `--no-log` to disable logging. | `log` (unless `--no-log`) | | ||
| --dvdfds_metadata | -d | | Output a JSON file containing metadata of Dataverses, Datasets, and Data Files. | | | ||
| --permission | -p | | Output a JSON file that stores permission metadata for all Datasets in the repository. | | | ||
| --emptydv | -e | | Output a JSON file that stores all Dataverses which do **not** contain Datasets (though they might have child Dataverses which have Datasets). | | | ||
| --failed | -f | | Output a JSON file of Dataverses/Datasets that failed to be crawled. | | | ||
| --spreadsheet | -s | | Output a CSV file of the metadata of Datasets. | | | ||
| --help | | | Show the help message. | | | ||
|
||
### Examples | ||
```sh | ||
# Export the metadata of latest version of datasets under collection 'demo' to JSON | ||
python3 dvmeta/main.py -c demo -v latest -d | ||
|
||
# Export the metadata of version 1.0 of all datasets under collection 'demo' to JSON and CSV | ||
python3 dvmeta/main.py -c demo -v 1.0 -d -s | ||
|
||
# Export the metadata and permission metadata of version 1.0 of all datasets under collection 'demo' to JSON and CSV, with the API token specified in the CLI interface | ||
python3 dvmeta/main.py -c demo -v 1.0 -d -s -p -a xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx | ||
``` | ||
|
||
## 📂Output Structure | ||
| File | Description | | ||
|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------| | ||
| ds_metadata_yyyymmdd-HHMMSS.json | Datasets' their data files' metadata in JSON format. | | ||
| empty_dv_yyyymmdd-HHMMSS.json | The id of empty dataverse(s) in list format. | | ||
| failed_metadata_uris_yyyymmdd-HHMMSS.json | The URIs (URL) of datasets failed to be downloaded. | | ||
| permission_dict_yyyymmdd-HHMMSS.json | The perission metadata of datasets with their dataset id. | | ||
| pid_dict_yyyymmdd-HHMMSS.json | Datasets' basic info with hierarchical information dictionary.Only exported if -p (permission) flag is used without -d (metadata) flag. | | ||
| pid_dict_dd_yyyymmdd-HHMMSS.json | The Hierarchical information of deaccessioned/draft datasets. | | ||
| ds_metadata_yyyymmdd-HHMMSS.csv | Datasets' their data files' metadata in CSV format. | | ||
| log_yyyymmdd-HHMMSS.txt | Summary of the crawling work. | | ||
|
||
```sh | ||
exported_files/ | ||
├── json_files/ | ||
│ └── ds_metadata_yyyymmdd-HHMMSS.json # With -d flag enabled | ||
│ └── empty_dv_yyyymmdd-HHMMSS.json # With -e flag enabled | ||
│ └── failed_metadata_uris_yyyymmdd-HHMMSS.json | ||
│ └── permission_dict_yyyymmdd-HHMMSS.json # With -p flag enabled | ||
│ └── pid_dict_yyyymmdd-HHMMSS.json # Only exported if -p flag is used without -d flag | ||
│ └── pid_dict_dd_yyyymmdd-HHMMSS.json # Hierarchical information of deaccessioned/draft datasets | ||
├── csv_files/ | ||
│ └── ds_metadata_yyyymmdd-HHMMSS.csv # with -s flag enabled | ||
└── logs_files/ | ||
└── log_yyyymmdd-HHMMSS.txt # Exported by default, without specifying --no-log | ||
``` | ||
|
||
## ✅Tests | ||
No tests have been written yet. Contributions welcome! | ||
|
||
## 💻Development | ||
1. Dependencies managment: [poetry](https://python-poetry.org/) - Update the pyproject.toml dependencies changes | ||
2. Linter: [ruff](https://docs.astral.sh/ruff/) - Linting rules are outlined in the pyproject.toml | ||
|
||
## 🙌Contributing | ||
1. Fork the repository | ||
2. Create a feature branch | ||
3. Submit a pull request | ||
|
||
## 📄License | ||
[MIT](https://choosealicense.com/licenses/mit/) | ||
|
||
## 🆘Support | ||
- Create an issue in the GitHub repository | ||
|
||
## 📚Citation | ||
If you use this software in your work, please cite it using the following metadata. | ||
|
||
APA: | ||
``` | ||
Lui, L. H. (2025). Dataverse metadata Crawler (Version 0.1.0) [Computer software]. https://github.com/kenlhlui/dataverse-metadata-crawler-p | ||
``` | ||
|
||
BibTeX: | ||
``` | ||
@software{Lui_Dataverse_metadata_Crawler_2025, | ||
author = {Lui, Lok Hei}, | ||
month = jan, | ||
title = {{Dataverse metadata Crawler}}, | ||
url = {https://github.com/kenlhlui/dataverse-metadata-crawler-p}, | ||
version = {0.1.0}, | ||
year = {2025} | ||
} | ||
``` | ||
|
||
## ✍️Authors | ||
Ken Lui - Data Curation Specialist, Map and Data Library, University of Toronto - [email protected] | ||
|
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
"""Module to manage the directories for exported files.""" | ||
|
||
from pathlib import Path | ||
|
||
|
||
class DirManager: | ||
"""Class to manage directories and files in the data vault.""" | ||
|
||
def __init__(self) -> None: | ||
"""Initialize the class with the base directory for exported files.""" | ||
self.export_base_dir = r'./exported_files' | ||
self.res_dir = r'./res' | ||
|
||
@staticmethod | ||
def _create_dir(path: Path) -> Path: | ||
"""Helper method to create a directory if it doesn't exist. | ||
Args: | ||
path (Path): The path to the directory. | ||
Returns: | ||
str: The path to the directory. | ||
""" | ||
if not Path.exists(path): | ||
Path(path).mkdir(parents=True, exist_ok=True) | ||
return path | ||
|
||
def json_files_dir(self) -> Path: | ||
"""Create a new directory to store json files. | ||
Returns: | ||
Path: The path to the new directory. | ||
""" | ||
return self._create_dir(Path(self.export_base_dir) / 'json_files') | ||
|
||
def log_files_dir(self) -> Path: | ||
"""Create a new directory to store log files. | ||
Returns: | ||
Path: The path to the new directory. | ||
""" | ||
return self._create_dir(Path(self.export_base_dir) / 'log_files') | ||
|
||
def csv_files_dir(self) -> Path: | ||
"""Create a new directory to store csv files. | ||
Returns: | ||
Path: The path to the new directory. | ||
""" | ||
return self._create_dir(Path(self.export_base_dir) / 'csv_files') |
Oops, something went wrong.