-
Notifications
You must be signed in to change notification settings - Fork 10
Description
The extracted address seems truncated when it contains special characters like accents (é, ü, etc.). The accents themselves appear correctly in html though, the problem is really the rest of the address is completely missing. I wonder if this is due to the actual PubMed data (returning it as is) or if it comes from easyPubMed. Reprex:
library(easyPubMed)
dami_query <- "23978301 [pmid] OR 27337513 [pmid] OR 29697998 [pmid] OR 32833485 [pmid]"
dami_on_pubmed <- get_pubmed_ids(dami_query)
dami_abstracts_xml <- fetch_pubmed_data(dami_on_pubmed, encoding = "ASCII")
xx <- table_articles_byAuth(pubmed_data = dami_abstracts_xml,
included_authors = "first",
max_chars = 100,
autofill = TRUE)
#> Processing PubMed data .... done!
xx$address
#> [1] "Center for Wireless &"
#> [2] "Berlin School of Mind and Brain, Humboldt- Universitä"
#> [3] "Centre de Recherches sur la Cognition et l'Apprentissage, Department of Psychology, Université"
#> [4] "Center for Educational Science and Psychology, University of Tü"Created on 2023-07-17 with reprex v2.0.2
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.0 (2023-04-21 ucrt)
#> os Windows 10 x64 (build 19045)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Canada.utf8
#> ctype English_Canada.utf8
#> tz America/Toronto
#> date 2023-07-17
#> pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
#> easyPubMed * 2.13 2019-03-29 [1] CRAN (R 4.3.0)
#> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
#> fs 1.6.2 2023-04-25 [1] CRAN (R 4.3.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
#> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
#> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.1)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
#> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
#> rmarkdown 2.22 2023-06-01 [1] CRAN (R 4.3.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> styler 1.9.1 2023-03-04 [1] CRAN (R 4.3.0)
#> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
#> xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
#>
#> [1] C:/Users/there/AppData/Local/R/win-library/4.3
#> [2] C:/Program Files/R/R-4.3.0/library
#>
#> ──────────────────────────────────────────────────────────────────────────────Correct universities should have been (as displayed on the PubMed website):
- Center for Wireless & Population Health Systems, Qualcomm Institute
- Universität zu Berlin
- Université de Poitiers
- University of Tübingen
Seems related to #7. However, just to be clear, the problem is not the conversion from & to &, it is that the rest of the address is missing after any special character. This creates a bias toward non-English speaking affiliations where for example the correct university and country cannot be correctly identified.