Releases: scholarsportal/dataverse-metadata-crawler
Releases · scholarsportal/dataverse-metadata-crawler
v0.1.4
1. Feature updates
- Added counting deaccession/draft datasets being crawled into the log.
- Added end of crawling message (✅ Crawling process completed successfully.)
2. Bug fixes
- Removed deaccession/draft datasets metadata from
failed_metadata_uris_yyyymmdd-HHMMSS.json
. These metdata record will now only showed inpid_dict_dd_yyyymmdd-HHMMSS.json
. - Removed non-created JSON file output listed in the log.
Full Changelog: v0.1.3...v0.1.4
v0.1.3
1. Feature updates
- Change
example.ipynb
tocolud_cli.ipynb
to better represent the use of the notebook. - Updated
colud_cli.ipynb
to support interactiveBASE_URL
andAPI_KEY
input, for creating the.env
file
2. Others
- Updated the
poetry-export_dependencies.yml
(GitHub workflow file) to update therequirements.txt
andpoetry.lock
files in a CI/CD manner.
Full Changelog: v0.1.2...v0.1.3
v0.1.2
1. Feature updates
- Added
example.ipynb
for launching the tool in- no Git or Python install required.
- Updated handling of checking connection. If the
API_KEY
input by the user is invalid, the tool will now fall back to using unauthenticated connection for crawling.
2. Others
- Changed defining headers for making GET requests to
MetaDataCrawler
.
Full Changelog: v0.1.1...v0.1.2
v0.1.1
1. Schema changes
- The key for
ds_metadata
in the dataset will now use dataset IDs (unique identifiers for each dataset version in the Dataverse system). Example:
# Old version
"doi:10.5072/FK2/DUGFC4": { # datasetPersistentId
"status": "OK",
"data": {
"id": 850,
"datasetId": 2663,
"datasetPersistentId": "doi:10.5072/FK2/DUGFC4",
...
# New version
{
"2663": { # datasetId
"status": "OK",
"data": {
"id": 850,
"datasetId": 2663,
"datasetPersistentId": "doi:10.5072/FK2/DUGFC4",
...
ds_metadata_yyyymmdd-HHMMSS.json
now containsdata
,path_info
andpermission_info
at the second-level.
{
...
"status": "OK",
"data": {
...
},
"path_info": {
...
},
"permission_info": {
...
},
- Changes to the following fields in
path_info
for consistency with the new schema:
collection_alias -> CollectionAlias
collection_id -> CollectionID
pid -> datasetPersistentId
ds_id -> datasetId
path_ids -> path_ids
# Old version
...
"path_info": {
"collection_alias": "toronto",
"collection_id": 22,
"pid": "doi:10.5072/FK2/DUGFC4",
"ds_id": 2663,
"path": "/Nick Field Dataverse",
"path_ids": [
2641
]
}
# New version
...
"path_info": {
"CollectionAlias": "toronto",
"CollectionID": 22,
"datasetPersistentId": "doi:10.5072/FK2/DUGFC4",
"datasetId": 2663,
"path": "/Nick Field Dataverse",
"pathIds": [
2641
]
}
2. Feature updates
- Comibed the representation (
-d
) and permission (-p
) metadata intods_metadata_yyyymmdd-HHMMSS.json
as a single JSON file. - Added the following permission roles count of dataset (
DS_Collab
,DS_Admin
,DS_Contrib
,DS_ContribPlus
,DS_Curator
,DS_FileDown
,DS_Member
) for spreadsheet output - Only available if-p
is enabled
3. Bug Fixes
- Corrected spelling mistakes in the README file.
- Restored missing fields for representation metadata in the spreadsheet:
TermsOfUse
CM_AuthorAff
CM_TimeEnd
CM_CollectionStart
CM_CollectionEnd
- Fixed handling
-f
responses withNone
objects.
v0.1.0
- Inital release
Full Changelog: https://github.com/scholarsportal/dataverse-metadata-crawler/commits/v0.1.0