Skip to content

Releases: scholarsportal/dataverse-metadata-crawler

v0.1.4

25 Feb 21:53
fe51828
Compare
Choose a tag to compare

1. Feature updates

  1. Added counting deaccession/draft datasets being crawled into the log.
  2. Added end of crawling message (✅ Crawling process completed successfully.)

2. Bug fixes

  1. Removed deaccession/draft datasets metadata from failed_metadata_uris_yyyymmdd-HHMMSS.json. These metdata record will now only showed in pid_dict_dd_yyyymmdd-HHMMSS.json.
  2. Removed non-created JSON file output listed in the log.

Full Changelog: v0.1.3...v0.1.4

v0.1.3

04 Feb 22:10
Compare
Choose a tag to compare

1. Feature updates

  1. Change example.ipynb to colud_cli.ipynb to better represent the use of the notebook.
  2. Updated colud_cli.ipynb to support interactive BASE_URL and API_KEY input, for creating the .env file

2. Others

  1. Updated the poetry-export_dependencies.yml (GitHub workflow file) to update the requirements.txt and poetry.lock files in a CI/CD manner.

Full Changelog: v0.1.2...v0.1.3

v0.1.2

03 Feb 16:34
Compare
Choose a tag to compare

1. Feature updates

  1. Added example.ipynb for launching the tool in Binder- no Git or Python install required.
  2. Updated handling of checking connection. If the API_KEY input by the user is invalid, the tool will now fall back to using unauthenticated connection for crawling.

2. Others

  1. Changed defining headers for making GET requests to MetaDataCrawler.

Full Changelog: v0.1.1...v0.1.2

v0.1.1

28 Jan 22:17
7c60b04
Compare
Choose a tag to compare

1. Schema changes

  1. The key for ds_metadata in the dataset will now use dataset IDs (unique identifiers for each dataset version in the Dataverse system). Example:
# Old version
  "doi:10.5072/FK2/DUGFC4": {  # datasetPersistentId
    "status": "OK",
    "data": {
      "id": 850,
      "datasetId": 2663,
      "datasetPersistentId": "doi:10.5072/FK2/DUGFC4",
...

# New version
{
  "2663": {  # datasetId
    "status": "OK",
    "data": {
      "id": 850,
      "datasetId": 2663,
      "datasetPersistentId": "doi:10.5072/FK2/DUGFC4",
...
  1. ds_metadata_yyyymmdd-HHMMSS.json now contains data, path_info and permission_info at the second-level.
{
  ...
    "status": "OK",
    "data": {
    ...
    },
    "path_info": {
    ...
    },
   "permission_info": {
   ...
 },

  1. Changes to the following fields in path_info for consistency with the new schema:
collection_alias -> CollectionAlias
collection_id -> CollectionID
pid -> datasetPersistentId
ds_id -> datasetId
path_ids -> path_ids

# Old version
...
    "path_info": {
      "collection_alias": "toronto",
      "collection_id": 22,
      "pid": "doi:10.5072/FK2/DUGFC4",
      "ds_id": 2663,
      "path": "/Nick Field Dataverse",
      "path_ids": [
        2641
      ]
    }

# New  version
...
    "path_info": {
      "CollectionAlias": "toronto",
      "CollectionID": 22,
      "datasetPersistentId": "doi:10.5072/FK2/DUGFC4",
      "datasetId": 2663,
      "path": "/Nick Field Dataverse",
      "pathIds": [
        2641
      ]
    }

2. Feature updates

  1. Comibed the representation (-d) and permission (-p) metadata into ds_metadata_yyyymmdd-HHMMSS.json as a single JSON file.
  2. Added the following permission roles count of dataset (DS_Collab, DS_Admin, DS_Contrib, DS_ContribPlus, DS_Curator, DS_FileDown, DS_Member) for spreadsheet output - Only available if -p is enabled

3. Bug Fixes

  1. Corrected spelling mistakes in the README file.
  2. Restored missing fields for representation metadata in the spreadsheet:
  • TermsOfUse
  • CM_AuthorAff
  • CM_TimeEnd
  • CM_CollectionStart
  • CM_CollectionEnd
  1. Fixed handling -f responses with None objects.

v0.1.0

28 Jan 21:52
Compare
Choose a tag to compare