Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# version 1.0.1

+ Fixed CRAN errors
+ [Awesome Official Statistics](http://www.awesomeofficialstatistics.org) badge added
+ [Awesome Official Statistics](https://github.com/SNStatComp/awesome-official-statistics-software) badge added
+ Removed unnecessary dependency on the `RcppAlgos` package

# version 1.0.0
Expand Down
41 changes: 22 additions & 19 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ output: github_document
[![CRAN status](https://www.r-pkg.org/badges/version/blocking)](https://CRAN.R-project.org/package=blocking)
[![CRAN downloads](https://cranlogs.r-pkg.org/badges/grand-total/blocking)](https://cran.r-project.org/package=blocking)
[![CRAN downloads](https://cranlogs.r-pkg.org/badges/blocking)](https://cran.r-project.org/package=blocking)
[![Mentioned in Awesome Official Statistics](https://awesome.re/mentioned-badge.svg)](http://www.awesomeofficialstatistics.org)
[![Mentioned in Awesome Official Statistics](https://awesome.re/mentioned-badge.svg)](https://github.com/SNStatComp/awesome-official-statistics-software)


<!-- badges: end -->
Expand All @@ -26,7 +26,7 @@ knitr::opts_chunk$set(

## Description

This R package is designed to block records for data deduplication and record linkage (also known as entity resolution) using [approximate nearest neighbours algorithms (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) and graphs (via the `igraph` package).
This R package is designed to block records for data deduplication and record linkage (also known as entity resolution) using [approximate nearest neighbor algorithms (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) and graphs (via the `igraph` package).

It supports the following R packages that bind to specific ANN algorithms:

Expand All @@ -39,23 +39,29 @@ The package can be used with the [reclin2](https://cran.r-project.org/package=re

## Installation

Install the GitHub blocking package with:
Install the stable version from CRAN:

```{r, eval=FALSE}
# install.packages("remotes") # uncomment if needed
remotes::install_github("ncn-foreigners/blocking")
install.packages("blocking")
```

You can also install the development version from GitHub:

```{r, eval=FALSE}
# install.packages("pak") # uncomment if needed
pak::pkg_install("ncn-foreigners/blocking")
```

## Basic usage

Load packages for the examples
Load packages for the examples:

```{r}
library(blocking)
library(reclin2)
```

Generate simple data with three groups (`df_example`) and reference data (`df_base`).
Generate simple data with three groups (`df_example`) and reference data (`df_base`):

```{r}
df_example <- data.frame(txt = c(
Expand All @@ -69,18 +75,16 @@ df_example <- data.frame(txt = c(
"monty"
))
df_base <- data.frame(txt = c("montypython", "kowalskijan", "other"))

df_example

df_base
```

Deduplication using the `blocking` function. Output contains information:

+ the method used (where `nnd` which refers to the NN descent algorithm),
+ number of blocks created (here 2 blocks),
+ number of columns used for blocking, i.e. how many shingles were created by `text2vec` package (here 28),
+ reduction ratio, i.e. how large is the reduction of comparison pairs (here 0.5714 which means blocking reduces comparison by over 57%).
+ the method used (`nnd` refers to the NN descent algorithm),
+ number of blocks created (here 2 blocks),
+ number of columns used for blocking, i.e., how many shingles were created by the `text2vec` package (here 28),
+ reduction ratio, i.e., how large the reduction of comparison pairs is (here 0.5714, which means blocking reduces comparisons by over 57%).

```{r}
blocking_result <- blocking(x = df_example$txt)
Expand All @@ -97,7 +101,7 @@ Table with blocking results contains:
blocking_result$result
```

Deduplication using the `pair_ann` function for integration with the `reclin2` package. Use the pipeline with the `reclin2` package.
Deduplication using the `pair_ann` function for integration with the `reclin2` package. Use the pipeline with the `reclin2` package:

```{r}
pair_ann(x = df_example, on = "txt") |>
Expand All @@ -107,7 +111,7 @@ pair_ann(x = df_example, on = "txt") |>
link(selection = "threshold")
```

Linking records using the same function where `df_base` is the "register" and `df_example` is the reference (data).
Linking records using the same function where `df_base` is the "register" and `df_example` is the reference data:

```{r}
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
Expand All @@ -124,15 +128,14 @@ See section `Data Integration (Statistical Matching and Record Linkage)` in [the
Packages that allow blocking:

+ [klsh](https://CRAN.R-project.org/package=klsh) -- k-means locality sensitive hashing,
+ [reclin2](https://CRAN.R-project.org/package=reclin2) -- `pair_blocking`, `pari_minsim` functions,
+ [reclin2](https://CRAN.R-project.org/package=reclin2) -- `pair_blocking`, `pair_minsim` functions,
+ [fastLink](https://CRAN.R-project.org/package=fastLink) -- `blockData` function.

Other:

+ [clevr](https://CRAN.R-project.org/package=clevr) -- evaluation of clustering, helper functions.
+ [exchanger](https://github.com/cleanzr/exchanger) -- bayesian Entity Resolution with Exchangeable Random Partition Priors
+ [clevr](https://CRAN.R-project.org/package=clevr) -- evaluation of clustering, helper functions,
+ [exchanger](https://github.com/cleanzr/exchanger) -- Bayesian Entity Resolution with Exchangeable Random Partition Priors.

## Funding

Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.

45 changes: 24 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ downloads](https://cranlogs.r-pkg.org/badges/grand-total/blocking)](https://cran
[![CRAN
downloads](https://cranlogs.r-pkg.org/badges/blocking)](https://cran.r-project.org/package=blocking)
[![Mentioned in Awesome Official
Statistics](https://awesome.re/mentioned-badge.svg)](http://www.awesomeofficialstatistics.org)
Statistics](https://awesome.re/mentioned-badge.svg)](https://github.com/SNStatComp/awesome-official-statistics-software)

<!-- badges: end -->

Expand All @@ -22,7 +22,7 @@ Statistics](https://awesome.re/mentioned-badge.svg)](http://www.awesomeofficials

This R package is designed to block records for data deduplication and
record linkage (also known as entity resolution) using [approximate
nearest neighbours algorithms
nearest neighbor algorithms
(ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) and graphs
(via the `igraph` package).

Expand All @@ -43,16 +43,22 @@ The package can be used with the

## Installation

Install the GitHub blocking package with:
Install the stable version from CRAN:

``` r
# install.packages("remotes") # uncomment if needed
remotes::install_github("ncn-foreigners/blocking")
install.packages("blocking")
```

You can also install the development version from GitHub:

``` r
# install.packages("pak") # uncomment if needed
pak::pkg_install("ncn-foreigners/blocking")
```

## Basic usage

Load packages for the examples
Load packages for the examples:

``` r
library(blocking)
Expand All @@ -61,7 +67,7 @@ library(reclin2)
```

Generate simple data with three groups (`df_example`) and reference data
(`df_base`).
(`df_base`):

``` r
df_example <- data.frame(txt = c(
Expand All @@ -75,7 +81,6 @@ df_example <- data.frame(txt = c(
"monty"
))
df_base <- data.frame(txt = c("montypython", "kowalskijan", "other"))

df_example
#> txt
#> 1 jankowalski
Expand All @@ -86,7 +91,6 @@ df_example
#> 6 pythonmonty
#> 7 cyrkmontypython
#> 8 monty

df_base
#> txt
#> 1 montypython
Expand All @@ -97,13 +101,12 @@ df_base
Deduplication using the `blocking` function. Output contains
information:

- the method used (where `nnd` which refers to the NN descent
algorithm),
- the method used (`nnd` refers to the NN descent algorithm),
- number of blocks created (here 2 blocks),
- number of columns used for blocking, i.e. how many shingles were
created by `text2vec` package (here 28),
- reduction ratio, i.e. how large is the reduction of comparison pairs
(here 0.5714 which means blocking reduces comparison by over 57%).
- number of columns used for blocking, i.e., how many shingles were
created by the `text2vec` package (here 28),
- reduction ratio, i.e., how large the reduction of comparison pairs is
(here 0.5714, which means blocking reduces comparisons by over 57%).

``` r
blocking_result <- blocking(x = df_example$txt)
Expand Down Expand Up @@ -138,7 +141,7 @@ blocking_result$result
```

Deduplication using the `pair_ann` function for integration with the
`reclin2` package. Use the pipeline with the `reclin2` package.
`reclin2` package. Use the pipeline with the `reclin2` package:

``` r
pair_ann(x = df_example, on = "txt") |>
Expand All @@ -162,7 +165,7 @@ pair_ann(x = df_example, on = "txt") |>
```

Linking records using the same function where `df_base` is the
“register” and `df_example` is the reference (data).
“register” and `df_example` is the reference data:

``` r
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
Expand Down Expand Up @@ -196,16 +199,16 @@ Packages that allow blocking:
- [klsh](https://CRAN.R-project.org/package=klsh) – k-means locality
sensitive hashing,
- [reclin2](https://CRAN.R-project.org/package=reclin2) –
`pair_blocking`, `pari_minsim` functions,
`pair_blocking`, `pair_minsim` functions,
- [fastLink](https://CRAN.R-project.org/package=fastLink) – `blockData`
function.

Other:

- [clevr](https://CRAN.R-project.org/package=clevr) – evaluation of
clustering, helper functions.
- [exchanger](https://github.com/cleanzr/exchanger) – bayesian Entity
Resolution with Exchangeable Random Partition Priors
clustering, helper functions,
- [exchanger](https://github.com/cleanzr/exchanger) – Bayesian Entity
Resolution with Exchangeable Random Partition Priors.

## Funding

Expand Down