Extracted address is truncated when it contains special characters like accents

The extracted address seems truncated when it contains special characters like accents (é, ü, etc.). The accents themselves appear correctly in html though, the problem is really the rest of the address is completely missing. I wonder if this is due to the actual PubMed data (returning it as is) or if it comes from `easyPubMed`. Reprex:

``` r
library(easyPubMed)

dami_query <- "23978301 [pmid] OR 27337513 [pmid] OR 29697998 [pmid] OR 32833485 [pmid]"
dami_on_pubmed <- get_pubmed_ids(dami_query)
dami_abstracts_xml <- fetch_pubmed_data(dami_on_pubmed, encoding = "ASCII")
xx <- table_articles_byAuth(pubmed_data = dami_abstracts_xml, 
                            included_authors = "first", 
                            max_chars = 100, 
                            autofill = TRUE)
#> Processing PubMed data .... done!

xx$address
#> [1] "Center for Wireless &amp"                                                                          
#> [2] "Berlin School of Mind and Brain, Humboldt- Universit&#xe4"                                         
#> [3] "Centre de Recherches sur la Cognition et l'Apprentissage, Department of Psychology, Universit&#xe9"
#> [4] "Center for Educational Science and Psychology, University of T&#xfc"
```

<sup>Created on 2023-07-17 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

<details style="margin-bottom:10px;">
<summary>
Session info
</summary>

``` r
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.0 (2023-04-21 ucrt)
#>  os       Windows 10 x64 (build 19045)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_Canada.utf8
#>  ctype    English_Canada.utf8
#>  tz       America/Toronto
#>  date     2023-07-17
#>  pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>  digest        0.6.31  2022-12-11 [1] CRAN (R 4.3.0)
#>  easyPubMed  * 2.13    2019-03-29 [1] CRAN (R 4.3.0)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  fs            1.6.2   2023-04-25 [1] CRAN (R 4.3.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
#>  knitr         1.43    2023-05-25 [1] CRAN (R 4.3.1)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.3.0)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
#>  rmarkdown     2.22    2023-06-01 [1] CRAN (R 4.3.0)
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.3.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  styler        1.9.1   2023-03-04 [1] CRAN (R 4.3.0)
#>  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
#>  xfun          0.39    2023-04-20 [1] CRAN (R 4.3.0)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/Users/there/AppData/Local/R/win-library/4.3
#>  [2] C:/Program Files/R/R-4.3.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
```

</details>

Correct universities should have been (as displayed on the PubMed website):
1. Center for Wireless & Population Health Systems, Qualcomm Institute
2. Universität zu Berlin
3. Université de Poitiers
4. University of Tübingen

---
Seems related to #7. However, just to be clear, the problem is not the conversion from & to &amp, it is that the rest of the address is missing after any special character. This creates a bias toward non-English speaking affiliations where for example the correct university and country cannot be correctly identified.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extracted address is truncated when it contains special characters like accents #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extracted address is truncated when it contains special characters like accents #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions