Skip to content

Commit 7a53622

Browse files
committed
including method = :hash in joins
1 parent 0db1c1c commit 7a53622

File tree

1 file changed

+14
-1
lines changed

1 file changed

+14
-1
lines changed

docs/src/man/joins.md

+14-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The main functions for combining two data sets are `leftjoin`, `innerjoin`, `out
1818

1919
See [the Wikipedia page on SQL joins](https://en.wikipedia.org/wiki/Join_(SQL)) for more information.
2020

21-
In general (for some special cases InMemoryDatasets may use "hash-join" techniques), to match observations, InMemoryDatasets sorts the right data set and uses a binary search algorithm for finding the matches of each observation in the left data set in the right data set based on the passed key column(s), thus, it has better performance when the left data set is larger than the right data set. The matching is done based on the formatted values of the key column(s), however, using the `mapformats` keyword argument, one may set it to `false` for one or both data sets.
21+
By default, to match observations, InMemoryDatasets sorts the right data set and uses a binary search algorithm for finding the matches of each observation in the left data set in the right data set based on the passed key column(s), thus, it has better performance when the left data set is larger than the right data set. However, passing `method = :hash` changes the default. The matching is done based on the formatted values of the key column(s), however, using the `mapformats` keyword argument, one may set it to `false` for one or both data sets.
2222

2323
For `leftjoin` and `innerjoin` the order of observations of the output data set is the same as their order in the left data set. However, the order of observations from the right table depends on the stability of the sort algorithm. User can set the `stable` keyword argument to `true` to guarantee a stable sort. For `outerjoin` the order of observations from the left data set in the output data set is also the same as their order in the original data set, however, for those observations which are from the right table, there is no specific order.
2424

@@ -141,6 +141,13 @@ julia> @btime innerjoin(dsl, dsr, on = [:x1=>:y1, :x2=>:y2], accelerate = true);
141141
155.306 ms (2160 allocations: 45.92 MiB)
142142
```
143143

144+
And of course for this example we can simply use the hash techniques for matching observations:
145+
146+
```jldoctest
147+
julia> @btime innerjoin(dsl, dsr, on = [:x1=>:y1, :x2=>:y2], method = :hash);
148+
86.323 ms (1095 allocations: 96.95 MiB)
149+
```
150+
144151
As it can be observed, using `accelerate = true` significantly reduces the joining time. The reason for this reduction is because currently sorting `String` type columns in InMemoryDatasets is relatively expensive, and using `accelerate = true` helps to reduce this by splitting the observations into multiple parts.
145152

146153
## `contains`
@@ -157,6 +164,8 @@ The `closejoin!` function does a close join in-place.
157164

158165
A tolerance for finding close matches can be passed via the `tol` keyword argument, and for the situations where the exact match is not allowed, user can pass `allow_exact_match = false`.
159166

167+
`closejoin/!` support `method = :hash` however, for the last key column it uses the sorting method to find the closest match.
168+
160169
### Examples
161170

162171
```jldoctest
@@ -320,6 +329,8 @@ For this kind of inner join, the key columns for both data sets which are define
320329

321330
To change inequalities to strict inequality the `strict_inequality` keyword argument must be set to `true` for one or both sides, e.g. `strict_inequality = true`(both side), `strict_inequality = [false, true]`(only one side).
322331

332+
`innerjoin` supports `method = :hash` for all key columns which are not used for inequality like join.
333+
323334
### Examples
324335

325336
```jldoctest
@@ -412,6 +423,8 @@ julia> innerjoin(store, roster, on = [:store => :store, :date => (:start_date, :
412423

413424
The `update!` functions replace the main data set with the updated version, however, if a copy of the updated data set is required, the `update` function can be used instead.
414425

426+
Like other join functions, one may pass `method = :hash` for using hash techniques to match observations.
427+
415428
### Examples
416429

417430
```jldoctest

0 commit comments

Comments
 (0)