Overloads the dplyr *_join
to accept an na_by
argument.
By default, joining using SQL does not match on NA
/ NULL
.
dbplyr *_join
s has the option "na_matches = na" to match on NA
/ NULL
but this is very inefficient in some
cases.
This function does the matching more efficiently:
If a column contains NA
/ NULL
, the names of these columns can be passed via the na_by
argument and
efficiently match as if "na_matches = na".
If no na_by
argument is given is given, the function defaults to using dplyr::*_join
.
Usage
# S3 method for class 'tbl_sql'
inner_join(x, y, by = NULL, ...)
# S3 method for class 'tbl_sql'
left_join(x, y, by = NULL, ...)
# S3 method for class 'tbl_sql'
right_join(x, y, by = NULL, ...)
# S3 method for class 'tbl_sql'
full_join(x, y, by = NULL, ...)
# S3 method for class 'tbl_sql'
semi_join(x, y, by = NULL, ...)
# S3 method for class 'tbl_sql'
anti_join(x, y, by = NULL, ...)
Arguments
- x, y
A pair of lazy data frames backed by database queries.
- by
A join specification created with
join_by()
, or a character vector of variables to join by.If
NULL
, the default,*_join()
will perform a natural join, using all variables in common acrossx
andy
. A message lists the variables so that you can check they're correct; suppress the message by supplyingby
explicitly.To join on different variables between
x
andy
, use ajoin_by()
specification. For example,join_by(a == b)
will matchx$a
toy$b
.To join by multiple variables, use a
join_by()
specification with multiple expressions. For example,join_by(a == b, c == d)
will matchx$a
toy$b
andx$c
toy$d
. If the column names are the same betweenx
andy
, you can shorten this by listing only the variable names, likejoin_by(a, c)
.join_by()
can also be used to perform inequality, rolling, and overlap joins. See the documentation at ?join_by for details on these types of joins.For simple equality joins, you can alternatively specify a character vector of variable names to join by. For example,
by = c("a", "b")
joinsx$a
toy$a
andx$b
toy$b
. If variable names differ betweenx
andy
, use a named character vector likeby = c("x_a" = "y_a", "x_b" = "y_b")
.To perform a cross-join, generating all combinations of
x
andy
, seecross_join()
.- ...
Other parameters passed onto methods.
Value
Another tbl_lazy
. Use show_query()
to see the generated
query, and use collect()
to execute the query
and return data to R.
See also
dplyr::mutate-joins which this function wraps.
dbplyr::join.tbl_sql which this function wraps.
Examples
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
band_db <- tbl_memdb(dplyr::band_members)
instrument_db <- tbl_memdb(dplyr::band_instruments)
left_join(band_db, instrument_db) |>
show_query()
#> Joining with `by = join_by(name)`
#> <SQL>
#> SELECT `dplyr::band_members`.*, `plays`
#> FROM `dplyr::band_members`
#> LEFT JOIN `dplyr::band_instruments`
#> ON (`dplyr::band_members`.`name` = `dplyr::band_instruments`.`name`)
# Can join with local data frames by setting copy = TRUE
left_join(band_db, dplyr::band_instruments, copy = TRUE)
#> Joining with `by = join_by(name)`
#> # Source: SQL [3 x 3]
#> # Database: sqlite 3.46.0 [:memory:]
#> name band plays
#> <chr> <chr> <chr>
#> 1 Mick Stones NA
#> 2 John Beatles guitar
#> 3 Paul Beatles bass
# Unlike R, joins in SQL don't usually match NAs (NULLs)
db <- memdb_frame(x = c(1, 2, NA))
label <- memdb_frame(x = c(1, NA), label = c("one", "missing"))
left_join(db, label, by = "x")
#> # Source: SQL [3 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> x label
#> <dbl> <chr>
#> 1 1 one
#> 2 2 NA
#> 3 NA NA
# But you can activate R's usual behaviour with the na_matches argument
left_join(db, label, by = "x", na_matches = "na")
#> # Source: SQL [3 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> x label
#> <dbl> <chr>
#> 1 1 one
#> 2 2 NA
#> 3 NA missing
# By default, joins are equijoins, but you can use `sql_on` to
# express richer relationships
db1 <- memdb_frame(x = 1:5)
db2 <- memdb_frame(x = 1:3, y = letters[1:3])
left_join(db1, db2) |> show_query()
#> Joining with `by = join_by(x)`
#> <SQL>
#> SELECT `dbplyr_1qeoeBlLyC`.`x` AS `x`, `y`
#> FROM `dbplyr_1qeoeBlLyC`
#> LEFT JOIN `dbplyr_9Iljj7AHPg`
#> ON (`dbplyr_1qeoeBlLyC`.`x` = `dbplyr_9Iljj7AHPg`.`x`)
left_join(db1, db2, sql_on = "LHS.x < RHS.x") |> show_query()
#> <SQL>
#> SELECT `LHS`.`x` AS `x.x`, `RHS`.`x` AS `x.y`, `y`
#> FROM `dbplyr_1qeoeBlLyC` AS `LHS`
#> LEFT JOIN `dbplyr_9Iljj7AHPg` AS `RHS`
#> ON (LHS.x < RHS.x)