The mission of hablar
is for you to get non-astonishing
results! That means that functions return what you expected. R has some
intuitive quirks that beginners and experienced programmers fail to
identify. Some of the first weird features of R that hablar
solves:
Missing values NA
and irrational values
Inf
, NaN
is dominant. For example, in R
sum(c(1, 2, NA))
is NA
and not 3. In
hablar
the addition of an underscore
sum_(c(1, 2, NA))
returns 3, as is often expected.
Factors (categorical variables) that are converted to numeric
returns the number of the category rather than the value. In
hablar
the convert()
function always changes
the type of the values.
Finding duplicates, and rows with NA
can be
cumbersome. The functions find_duplicates()
and
find_na()
make it easy to find where the data frame needs
to be fixed. When the issues are found the utility replacement
functions, e.g. if_else_()
, if_na()
,
zero_if()
easily fixes many of the most common problems you
face.
hablar
follows the syntax API of tidyverse
and works seamlessly with dplyr
and
tidyselect
.
A common issue in R is how R treats missing values
(i.e. NA
). Sometimes NA
in your data frame
means that there is missing values in the sense that you need to
estimate or replace them with values. But often it is not a problem!
Often NA
means that there is no value, and should
not be. hablar
provide useful functions that handle
NA
intuitively. Let’s take a simple example:
#> # A tibble: 3 × 3
#> name graduation_date age
#> <chr> <date> <int>
#> 1 Fredrik 2016-06-15 21
#> 2 Maria NA 16
#> 3 Astrid 2014-06-15 23
min()
to
min_()
The graduation_date
is missing for Maria. In this case
it is not because we do not know. It is because she has not graduated
yet, she is younger than Fredrik and Astrid. If we would like to know
the first graduation date of the three observation in R with a naive
min()
we get NA
. But with min_()
from hablar
we get the minimum value that is not missing.
See:
%>%
df mutate(min_baseR = min(graduation_date),
min_hablar = min_(graduation_date))
#> # A tibble: 3 × 5
#> name graduation_date age min_baseR min_hablar
#> <chr> <date> <int> <date> <date>
#> 1 Fredrik 2016-06-15 21 NA 2014-06-15
#> 2 Maria NA 16 NA 2014-06-15
#> 3 Astrid 2014-06-15 23 NA 2014-06-15
The hablar
package provides the same functionality
for
max_()
mean_()
median_()
sd_()
first_()
… and more. For more documentation type help(min_())
or
vignette("s")
for an in-depth description.
In hablar
the function convert
provides a
robust, readable and dynamic way to change type of a column.
%>%
mtcars convert(int(cyl, am),
num(disp:drat))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The above chunk converts the columns cyl
and
am
to integers, and the columns disp
through
drat
to numeric. If a column is of type factor
it always converts it to character before further conversion.
With convert
and tidyselect
you can easily
change type of a wide range of columns.
%>%
mtcars convert(
chr(last_col()), # Last colum to character
int(1:2), # First two columns to integer
fct(hp, wt), # hp and wt to factors
dte(vs), # vs to date (if you really want)
num(contains("car")) # car as in carb to numeric
) #> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21 6 160.0 110 3.90 2.62 16.46 1970-01-01 1 4 4
#> Mazda RX4 Wag 21 6 160.0 110 3.90 2.875 17.02 1970-01-01 1 4 4
#> Datsun 710 22 4 108.0 93 3.85 2.32 18.61 1970-01-02 1 4 1
#> Hornet 4 Drive 21 6 258.0 110 3.08 3.215 19.44 1970-01-02 0 3 1
#> Hornet Sportabout 18 8 360.0 175 3.15 3.44 17.02 1970-01-01 0 3 2
#> Valiant 18 6 225.0 105 2.76 3.46 20.22 1970-01-02 0 3 1
#> Duster 360 14 8 360.0 245 3.21 3.57 15.84 1970-01-01 0 3 4
#> Merc 240D 24 4 146.7 62 3.69 3.19 20.00 1970-01-02 0 4 2
#> Merc 230 22 4 140.8 95 3.92 3.15 22.90 1970-01-02 0 4 2
#> Merc 280 19 6 167.6 123 3.92 3.44 18.30 1970-01-02 0 4 4
#> Merc 280C 17 6 167.6 123 3.92 3.44 18.90 1970-01-02 0 4 4
#> Merc 450SE 16 8 275.8 180 3.07 4.07 17.40 1970-01-01 0 3 3
#> Merc 450SL 17 8 275.8 180 3.07 3.73 17.60 1970-01-01 0 3 3
#> Merc 450SLC 15 8 275.8 180 3.07 3.78 18.00 1970-01-01 0 3 3
#> Cadillac Fleetwood 10 8 472.0 205 2.93 5.25 17.98 1970-01-01 0 3 4
#> Lincoln Continental 10 8 460.0 215 3.00 5.424 17.82 1970-01-01 0 3 4
#> Chrysler Imperial 14 8 440.0 230 3.23 5.345 17.42 1970-01-01 0 3 4
#> Fiat 128 32 4 78.7 66 4.08 2.2 19.47 1970-01-02 1 4 1
#> Honda Civic 30 4 75.7 52 4.93 1.615 18.52 1970-01-02 1 4 2
#> Toyota Corolla 33 4 71.1 65 4.22 1.835 19.90 1970-01-02 1 4 1
#> Toyota Corona 21 4 120.1 97 3.70 2.465 20.01 1970-01-02 0 3 1
#> Dodge Challenger 15 8 318.0 150 2.76 3.52 16.87 1970-01-01 0 3 2
#> AMC Javelin 15 8 304.0 150 3.15 3.435 17.30 1970-01-01 0 3 2
#> Camaro Z28 13 8 350.0 245 3.73 3.84 15.41 1970-01-01 0 3 4
#> Pontiac Firebird 19 8 400.0 175 3.08 3.845 17.05 1970-01-01 0 3 2
#> Fiat X1-9 27 4 79.0 66 4.08 1.935 18.90 1970-01-02 1 4 1
#> Porsche 914-2 26 4 120.3 91 4.43 2.14 16.70 1970-01-01 1 5 2
#> Lotus Europa 30 4 95.1 113 3.77 1.513 16.90 1970-01-02 1 5 2
#> Ford Pantera L 15 8 351.0 264 4.22 3.17 14.50 1970-01-01 1 5 4
#> Ferrari Dino 19 6 145.0 175 3.62 2.77 15.50 1970-01-01 1 5 6
#> Maserati Bora 15 8 301.0 335 3.54 3.57 14.60 1970-01-01 1 5 8
#> Volvo 142E 21 4 121.0 109 4.11 2.78 18.60 1970-01-02 1 4 2
For more information, see help(hablar)
or
vignette("convert")
.
When cleaning data you spend a lot of time understanding your data.
Sometimes you get more row than you expected when doing a
left_join()
. Or you did not know that certain column
contained missing values NA
or irrational values like
Inf
or NaN
.
In hablar
the find_*
functions speeds up
your search for the problem. To find duplicated rows you simply
df %>% find_duplicates()
. You can also find duplicates
in in specific columns, which can be useful before joins.
# Create df with duplicates
<- mtcars %>%
df bind_rows(mtcars %>% slice(1, 5, 9))
# Return rows with duplicates in cyl and am
%>%
df find_duplicates(cyl, am)
#> # A tibble: 35 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> # … with 31 more rows
#> # ℹ Use `print(n = ...)` to see more rows
There are also find functions for other cases. For example
find_na()
returns rows with missing values.
%>%
starwars find_na(height)
#> # A tibble: 6 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Arvel Crynyd NA NA brown fair brown NA male mascu… <NA>
#> 2 Finn NA NA black dark dark NA male mascu… <NA>
#> 3 Rey NA NA brown light hazel NA fema… femin… <NA>
#> 4 Poe Dameron NA NA brown light brown NA male mascu… <NA>
#> # … with 2 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
If you rather want a Boolean value instead then
e.g. check_duplicates()
returns TRUE
if the
data frame contains duplicates, otherwise it returns
FALSE
.
Let’s say that we have found a problem is caused by missing values in
the column height
and you want to replace all missing
values with the integer 100. hablar
comes with an
additional ways of doing if-or-else.
%>%
starwars find_na(height) %>%
mutate(height = if_na(height, 100L))
#> # A tibble: 6 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Arvel Crynyd 100 NA brown fair brown NA male mascu… <NA>
#> 2 Finn 100 NA black dark dark NA male mascu… <NA>
#> 3 Rey 100 NA brown light hazel NA fema… femin… <NA>
#> 4 Poe Dameron 100 NA brown light brown NA male mascu… <NA>
#> # … with 2 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
In the chunk above we successfully replaced all missing heights with
the integer 100. hablar
also contain the self
explained:
if_zero()
and zero_if()
if_inf()
and inf_if()
if_nan()
and nan_if()
which works in the same way as the examples above.
The generic function if_else_()
provides the same
rigidity as if_else()
in dplyr
but ads some
flexibility. In dplyr
you need to specify which type
NA
should have. In if_else_()
you can
write:
%>%
starwars mutate(skin_color = if_else_(hair_color == "brown", NA, hair_color))
#> # A tibble: 87 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywal… 172 77 blond blond blue 19 male mascu… Tatooi…
#> 2 C-3PO 167 75 <NA> <NA> yellow 112 none mascu… Tatooi…
#> 3 R2-D2 96 32 <NA> <NA> red 33 none mascu… Naboo
#> 4 Darth Vader 202 136 none none yellow 41.9 male mascu… Tatooi…
#> # … with 83 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
In if_else()
from dplyr
you would have had
to specified NA_character_
.