dbExistsTable
to catch update AWS error
message.dbplyr 2.3.3.9000+
AWS_ROLE_ARN
. This caused
confusing when connecting through web identity (RAthena #
177)dbplyr::in_catalog
when working with
dplyr::tbl
(RAthena #
178)dbClearResult
(RAthena #
168). Thanks to @juhoautio for the request.paws
parameters (RAthena #
169)endpoint_override
parameter allow default
endpoints for each service to be overridden accordingly (RAthena #
169). Thanks to @aoyh for the request and checking the package
in development.test_data
to use
size
parameter explicitly.noctua_options
to change 1 parameter at a time
without affecting other pre-configured settingsretry_quiet
parameter in noctua_options
function.dbplyr
2.0.0 backend API.dplyr
to benefit from AWS Athena unload
methods (#174).dbGetQuery
, dbExecute
,
dbSendQuery
, dbSendStatement
work on older
versions of R
(#170). Thanks to @tyner for identifying issue.AWS Athena UNLOAD
(#160). This is to take advantage of read/write speed
parquet
has to offer.import awswrangler as wr
import getpass
= getpass.getpass()
bucket = f"s3://{bucket}/data/"
path
if "awswrangler_test" not in wr.catalog.databases().values:
"awswrangler_test")
wr.catalog.create_database(
= ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
cols
= wr.s3.read_csv(
df ="s3://noaa-ghcn-pds/csv/189",
path=cols,
names=["dt", "obs_time"]) # Read 10 files from the 1890 decade (~1GB)
parse_dates
wr.s3.to_parquet(=df,
df=path,
path=True,
dataset="overwrite",
mode="awswrangler_test",
database="noaa"
table;
)
="awswrangler_test", table="noaa") wr.catalog.table(database
library(DBI)
<- dbConnect(noctua::athena())
con
# Query ran using CSV output
system.time({
= dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa")
df
})# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 57.004 8.430 160.567
::noctua_options(cache_size = 1)
noctua
# Query ran using UNLOAD Parquet output
system.time({
= dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
df
})# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 21.622 2.350 39.232
# Query ran using cache
system.time({
= dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
df
})# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 13.738 1.886 11.029
sql_translate_env
correctly translates R functions
quantile
and median
to AWS Athena
equivalents (#153). Thanks to @ellmanj for spotting issue.AWS Athena
timestamp with time zone
data type.list
when converting data to
AWS Athena
SQL
format.library(data.table)
library(DBI)
= 5
x
= data.table(
dt var1 = sample(LETTERS, size = x, T),
var2 = rep(list(list("var3"= 1:3, "var4" = list("var5"= letters[1:5]))), x)
)
<- dbConnect(noctua::athena())
con
#> Version: 2.2.0
sqlData(con, dt)
# Registered S3 method overwritten by 'jsonify':
# method from
# print.json jsonlite
# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv
# var1 var2
# 1: 1 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 2: 2 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 3: 3 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 4: 4 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 5: 5 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
#> Version: 2.1.0
sqlData(con, dt)
# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv
# var1 var2
# 1: 1 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 2: 2 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 3: 3 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 4: 4 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 5: 5 1:3|list(var5 = c("a", "b", "c", "d", "e"))
v-2.2.0 now converts lists into json lines format so that AWS Athena
can parse with sql
array
/mapping
/json
functions.
Small down side a s3 method conflict occurs when jsonify is called to
convert lists into json lines. jsonify
was choose in favor
to jsonlite
due to the performance improvements (#156).
dbIsValid
wrongly stated connection is valid for result
class when connection class was disconnected.sql_translate_env.paste
broke with latest version of
dbplyr
. New method is compatible with
dbplyr>=1.4.3
(#149).sql_translate_env
: add support for
stringr
/lubridate
style functions, similar to
Postgres
backend.write_bin
now doesn’t chunk writeBin if R version is
greater than 4.0.0
https://github.com/HenrikBengtsson/Wishlist-for-R/issues/97 (#149)dbConnect
add timezone
parameter so that
time zone between R
and AWS Athena
is
consistent.AthenaConnection
class: ptr
and
info
slots changed from list
to
environment
with in AthenaConnect
class.
Allows class to be updated by reference. Simplifies notation when
viewing class from RStudio environment tab.AthenaResult
class: info
slot changed from
list
to environment
. Allows class to be
updated by reference.By utilising environments for AthenaConnection
and
AthenaResult
, all AthenaResult
classes created
from AthenaConnection
will point to the same
ptr
and info
environments for it’s connection.
Previously ptr
and info
would make a copy.
This means if it was modified it would not affect the child or parent
class for example:
# Old Method
library(DBI)
<- dbConnect(noctua::athena(),
con rstudio_conn_tab = F)
<- dbExecute(con, "select 'helloworld'")
res
# modifying parent class to influence child
@info$made_up <- "helloworld"
con
# nothing happened
@connection@info$made_up
res# > NULL
# modifying child class to influence parent
@connection@info$made_up <- "oh no!"
res
# nothing happened
@info$made_up
con# > "helloworld"
# New Method
library(DBI)
<- dbConnect(noctua::athena(),
con rstudio_conn_tab = F)
<- dbExecute(con, "select 'helloworld'")
res
# modifying parent class to influence child
@info$made_up <- "helloworld"
con
# picked up change
@connection@info$made_up
res# > "helloworld"
# modifying child class to influence parent
@connection@info$made_up <- "oh no!"
res
# picked up change
@info$made_up
con# > "oh no!"
AWS Athena
data types
[array, row, map, json, binary, ipaddress]
(#135).
Conversion types can be changed through dbConnect
and
noctua_options
.library(DBI)
library(noctua)
# default conversion methods
<- dbConnect(noctua::athena())
con
# change json conversion method
noctua_options(json = "character")
:::athena_option_env$json
noctua# [1] "character"
# change json conversion to custom method
noctua_options(json = jsonify::from_json)
:::athena_option_env$json
noctua# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
# change bigint conversion without affecting custom json conversion methods
noctua_options(bigint = "numeric")
:::athena_option_env$json
noctua# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
:::athena_option_env$bigint
noctua# [1] "numeric"
# change binary conversion without affect, bigint or json methods
noctua_options(binary = "character")
:::athena_option_env$json
noctua# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
:::athena_option_env$bigint
noctua# [1] "numeric"
:::athena_option_env$binary
noctua# [1] "character"
# no conversion for json objects
<- dbConnect(noctua::athena(), json = "character")
con2
# use custom json parser
<- dbConnect(noctua::athena(), json = jsonify::from_json) con
rstudio_conn_tab
within dbConnect
.AWS Athena
uses float
data type for the
DDL only, noctua
was wrongly parsing float
data type back to R. Instead AWS Athena
uses data type
real
in SQL functions like select cast
https://docs.aws.amazon.com/athena/latest/ug/data-types.html.
noctua
now correctly parses real
to R’s data
type double
(#133)AWS
returns to get all
results from AWS Glue
catalogue (#137)dbGetPartition
. This
simply tidies up the default AWS Athena partition format.library(DBI)
library(noctua)
<- dbConnect(athena())
con dbGetPartition(con, "test_df2", .format = T)
# Info: (Data scanned: 0 Bytes)
# year month day
# 1: 2020 11 17
dbGetPartition(con, "test_df2")
# Info: (Data scanned: 0 Bytes)
# partition
# 1: year=2020/month=11/day=17
bigint
, this is
to align with other DBI interfaces i.e. RPostgres
. Now
bigint
can be return in the possible formats: [“integer64”,
“integer”, “numeric”, “character”]library(DBI)
con <- dbConnect(noctua::athena(), bigint = "numeric")
When switching between the different file parsers the
bigint
to be represented according to the file parser
i.e. data.table
: “integer64” -> vroom
:
“I”.
dbRemoveTable
: Check if key has “.” or ends with “/”
before adding “/” to the end (#125)uuid
minimum version to fix issue (#128)Error: write_parquet requires the arrow package, please install it first and try again
dbRemoveTable
would error if AWS S3 files for Athena
table have been removed:Error in seq.default(1, length(l), 1000) : wrong sign in 'by' argument
Now a warning message will be returned:
Warning message:
Failed to remove AWS S3 files from: "s3://{bucket}/{prefix}/". Please check if AWS S3 files exist.
dbRemoveTable
now removes AWS S3 objects using
delete_objects
instead of delete_object
. This
allows noctua
to delete AWS S3 files in batches. This will
reduce the number of api calls to AWS and comes with a performance
improvement.library(DBI)
library(data.table)
<- 1010
X <- data.table(x = 1:X,
value y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
<- dbConnect(noctua::athena())
con
# create a removable table with 1010 parquet files in AWS S3.
dbWriteTable(con, "rm_tbl", value, file.type = "parquet", overwrite = T, max.batch = 1)
# old method: delete_object
system.time({dbRemoveTable(con, "rm_tbl", confirm = T)})
# user system elapsed
# 31.004 8.152 115.906
# new method: delete_objects
system.time({dbRemoveTable(con, "rm_tbl", confirm = T)})
# user system elapsed
# 17.319 0.370 22.709
sql_escape_date
into
dplyr_integration.R
backend (RAthena: #
121).use_deprecated_int96_timestamps
set to TRUE
.
This puts POSIXct data type in to java.sql.Timestamp
compatible format, such as yyyy-MM-dd HH:mm:ss[.f...]
.
Thanks to Christian N Wolz for highlight this issue.dbRemoveTable
will ask the user twice to confirm if they
wish to remove the backend files:Info: The S3 objects in prefix will be deleted:
s3://bucket/path/schema/table
Info: The S3 objects in prefix will be deleted:
s3://bucket/path/schema/table
To overcome this dbRemoveTable
will opt for
paws::s3()$list_objects_v2
instead of
paws::s3()$list_objects
when listing s3 objects to be
deleted. This allows noctua
to iterate over AWS s3 prefix
using tokens, instead of deleting objects in chunks. *
s3_upload_location
simplified how s3 location is built. Now
s3.location parameter isn’t affected and instead only additional
components e.g. name, schema and partition. *
dbplyr v-2.0.0
function in_schema
now wraps
strings in quotes, this breaks
db_query_fields.AthenaConnection
. Now
db_query_fields.AthenaConnection
removes any quotation from
the string so that it can search AWS GLUE
for table
metadata. (#117)
start_query_execution
parameter
ClientRequestToken
. This so that the
ClientRequestToken
is “A unique case-sensitive string used
to ensure the request to create the query is idempotent (executes only
once).” (#104)R
has been
interrupt a new parameter has been added to dbConnect
,
keyboard_interrupt
. Example:# Stop AWS Athena when R has been interrupted:
<- dbConnect(noctua::athena())
con
# Let AWS Athena keep running when R has been interrupted:
<- dbConnect(noctua::athena(),
con keyboard_interrupt = F)
noctua
would return a
data.frame
for utility SQL
queries regardless
of backend file parser. This is due to AWS Athena
outputting SQL UTILITY
queries as a text file that required
to be read in line by line. Now noctua
will return the
correct data format based on file parser set in
noctua_options
for example:
noctua_options("vroom")
will return
tibbles
.dbClearResult
when user doesn’t have permission to delete
AWS S3 objects (#96)noctua_options
contains 2 new parameters to control how
noctua
handles retries.dbFetch
is able to return data from AWS Athena in
chunk. This has been achieved by passing NextToken
to
AthenaResult
s4 class. This method won’t be as fast
n = -1
as each chunk will have to be process into data
frame format.library(DBI)
<- dbConnect(noctua::athena())
con <- dbExecute(con, "select * from some_big_table limit 10000")
res dbFetch(res, 5000)
dbWriteTable
opts to use alter table
instead
of standard msck repair table
. This is to improve
performance when appending to tables with high number of existing
partitions.dbWriteTable
now allows json to be appended to json
ddls created with the Openx-JsonSerDe library.dbConvertTable
brings dplyr::compute
functionality to base package, allowing noctua
to use the
power of AWS Athena to convert tables and queries to more efficient file
formats in AWS S3 (RAthena: #
37).dplyr::compute
to give same functionality of
dbConvertTable
region_name
check before making a connection to
AWS Athena (RAthena: #
110)dbWriteTable
would throw throttling error
every now and again, retry_api_call
as been built to handle
the parsing of data between R and AWS S3.dbWriteTable
did not clear down all metadata when
uploading to AWS Athena
dbWriteTable
added support ddl structures for user who
have created ddl’s outside of noctua
noctua
retry
functionality\dontrun
(#91)pyathena
, noctua_options
now
has a new parameter cache_size
. This implements local
caching in R environments instead of using AWS
list_query_executions
. This is down to
dbClearResult
clearing S3’s Athena output when caching
isn’t disablednoctua_options
now has clear_cache
parameter to clear down all cached data.dbRemoveTable
now utilise AWS Glue
to
remove tables from AWS Glue
catalog. This has a performance
enhancement:library(DBI)
= dbConnect(noctua::athena())
con
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Athena method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.247 0.091 2.243
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Glue method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.110 0.045 1.094
dbWriteTable
now supports uploading json lines
(http://jsonlines.org/) format up to AWS Athena
(#88).library(DBI)
= dbConnect(noctua::athena())
con dbWriteTable(con, "iris2", iris, file.type = "json")
dbGetQuery(con, "select * from iris2")
dbConnect
didn’t correct pass .internal
metadata for paws objects.computeHostName
&
computeDisplayName
now get region name from
info
object from dbConnect
S4 class.dbWriteTable
appending to existing table compress file
type was incorrectly return.Rstudio connection tab
comes into an issue when Glue
Table isn’t stored correctly (RAthena: #
92)AWS_REGION
into
dbConnect
writeBin
: Only 2^31 - 1 bytes can be written in a
single call (and that is the maximum capacity of a raw vector on 32-bit
platforms). This means that it will error out with large raw
connections. To over come this writeBin
can be called in
chunks. If readr
is available on system then
readr::write_file
is used for extra speed.library(readr)
library(microbenchmark)
# creating some dummy data for testing
<- 1e8
X <-
df data.frame(
w = runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
write_csv(df, "test.csv")
# read in text file into raw format
<- readBin("test.csv", what = "raw", n = file.size("test.csv"))
obj
format(object.size(obj), units = "auto")
# 3.3 Gb
# writeBin in a loop
<- function(
write_bin
value,
filename,chunk_size = 2L ^ 20L) {
<- length(value)
total_size <- seq(1, total_size, chunk_size)
split_vec
<- file(filename, "a+b")
con on.exit(close(con))
sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
invisible(TRUE)
}
microbenchmark(writeBin_loop = write_bin(obj, tempfile()),
readr = write_file(obj, tempfile()),
times = 5)
# Unit: seconds
# expr min lq mean median uq max neval
# R_loop 41.463273 41.62077 42.265778 41.908908 42.022042 44.313893 5
# readr 2.291571 2.40495 2.496871 2.542544 2.558367 2.686921 5
sql_translate_env
(RAthena: #
44)# Before
translate_sql("2019-01-01", con = con)
# '2019-01-01'
# Now
translate_sql("2019-01-01", con = con)
# DATE '2019-01-01'
fwrite
(>=1.12.4)
https://github.com/Rdatatable/data.table/blob/master/NEWS.mdpaste
/paste0
would use default
dplyr:sql-translate-env
(concat_ws
).
paste0
now uses Presto’s concat
function and
paste
now uses pipes to get extra flexibility for custom
separating values.# R code:
paste("hi", "bye", sep = "-")
# SQL translation:
'hi'||'-'||'bye') (
append
set to
TRUE
then existing s3.location will be utilised (RAthena: #
73)db_compute
returned table name, however when a user
wished to write table to another location (RAthena: #
74). An error would be raised:
Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.temp.iris does not exist
This has now been fixed with db_compute returning
dbplyr::in_schema
.library(DBI)
library(dplyr)
<- dbConnect(noctua::athena())
con
tbl(con, "iris") %>%
compute(name = "temp.iris")
dbListFields
didn’t display partitioned columns. This
has now been fixed with the call to AWS Glue being altered to include
more metadata allowing for column names and partitions to be
returned.dbListFields
dbStatistics
is a wrapper around paws
get_query_execution
to return statistics for
noctua::dbSendQuery
resultsdbGetQuery
has new parameter statistics
to
print out dbStatistics
before returning Athena
results.noctua_options
vroom
has been restricted to >= 1.2.0
due to integer64 support and changes to vroom
apidplyr::tbl
when calling Athena when using the ident
method (#64):library(DBI)
library(dplyr)
<- dbConnect(noctua::athena())
con
# ident method:
<- system.time(tbl(con, "iris"))
t1
# sub query method:
<- system.time(tbl(con, sql("select * from iris")))
t2
# ident method
# user system elapsed
# 0.082 0.012 0.288
# sub query method
# user system elapsed
# 0.993 0.138 3.660
dplyr
sql_translate_env: expected results have now been
updated to take into account bug fix with date fieldsdata.table
to vroom
. From now on
it is possible to change file parser using noctua_options
for example:library(noctua)
noctua_options("vroom")
dbGetTables
that returns Athena hierarchy
as a data.framevroom
roxygen2
7.0.2dbWriteTable
append
parameter checks and uses
existing AWS Athena DDL file type. If file.type
doesn’t
match Athena DDL file type then user will receive a warning
message:warning('Appended `file.type` is not compatible with the existing Athena DDL file type and has been converted to "', File.Type,'".', call. = FALSE)
INTEGER
being incorrectly translated
in sql_translate_env.R
as.character
was getting wrongly
translated (RAthena: #
45)data-transfer
dbRemoveTable
new parameters are added in unit
testsql_translate_env
until test to cater bug
fixdbRemoveTable
can now remove S3 files for AWS Athena table
being removed.tolower
conversion due to request (RAthena: #
41)dbWriteTable
now will split gzip
compressed files to improve AWS Athena performance. By default
gzip
compressed files will be split into 20.Performance results
library(DBI)
<- 1e8
X <- data.frame(w =runif(X),
df x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
<- dbConnect(noctua::athena())
con # upload dataframe with different splits
dbWriteTable(con, "test_split1", df, compress = T, max.batch = nrow(df), overwrite = T) # no splits
dbWriteTable(con, "test_split2", df, compress = T, max.batch = 0.05 * nrow(df), overwrite = T) # 20 splits
dbWriteTable(con, "test_split3", df, compress = T, max.batch = 0.1 * nrow(df), overwrite = T) # 10 splits
AWS Athena performance results from AWS console (query executed:
select count(*) from ....
):
library(DBI)
<- 1e8
X <- data.frame(w =runif(X),
df x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
<- dbConnect(noctua::athena())
con dbWriteTable(con, "test_split1", df, compress = T, overwrite = T) # default will now split compressed file into 20 equal size files.
Added information message to inform user about what files have been added to S3 location if user is overwriting an Athena table.
copy_to
method now supports compress and max_batch, to
align with dbWriteTable
dbWriteTable
POSIXct
to Athena. This class was
convert incorrectly and AWS Athena would return NA instead.
noctua
will now correctly convert POSIXct
to
timestamp but will also correct read in timestamp into
POSIXct
NA
in string format. Before noctua
would
return NA
in string class as ""
this has now
been fixed.noctua
would translate output into a vector with current
the method dbFetch
n = 0.sql_translate_env
. Previously noctua
would take the default dplyr::sql_translate_env
, now
noctua
has a custom method that uses Data types from:
https://docs.aws.amazon.com/athena/latest/ug/data-types.html and window
functions from:
https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.htmlPOSIXct
class has now been added to data transfer unit
testdplyr sql_translate_env
tests if R functions are
correct translated in to Athena sql
syntax.dbWriteTable
is called. The bug is due to function
sqlCreateTable
which dbWriteTable
calls.
Parameters table
and fields
were set to
NULL
. This has now been fixed.s3.location
parameter is dbWriteTable
can
now be made nullableupload_data
has been rebuilt and
removed the old “horrible” if statement with paste
now the
function relies on sprintf
to construct the s3 location
path. This method now is a lot clearer in how the s3 location is created
plus it enables a dbWriteTable
to be simplified.
dbWriteTable
can now upload data to the default s3_staging
directory created in dbConnect
this simplifies
dbWriteTable
to :library(DBI)
<- dbConnect(noctua::athena())
con
dbWriteTable(con, "iris", iris)
data transfer
test now tests compress, and default
s3.location when transferring datadbWriteTable
sqlCreateTable
info message will now only inform user
if colnames have changed and display the column name that have
changedconfig = list()
parameter is paws
objectsBigInt
are now passed correctly into
integer64
AthenaResult
returned:
Error in call[[2]] : object of type 'closure' is not subsettable
.
The function do.call
was causing the issue, to address this
do.call
has been removed and the helper function
request
has been broken down into
ResultConfiguration
to return a single component of
start_query_execution
do.call
have been broken
down due to error:
Error in call[[2]] : object of type 'closure' is not subsettable
bigint
to integer64
in data.transfer
unit testpaws
version has been set to a minimum of
0.1.5
due to latest change.data.table
is now used as the default file parser
data.table::fread
/ data.table::fwrite
. This
isn’t a breaking change as data.table
was used before
however this change makes data.table
to default file
parser.dbConnect
method can use the following methods:
arrow
packageassume_role
developed method for user to assume role
when connecting to AWS Athena