Package starschemar
offers operations to transform flat
tables into star schemas and also to export or exploit them through
queries.
From a flat table, we define a dimensional model classifying its attributes as facts or dimensions (dimensional modelling).
From a flat table and a dimensional model we obtain a star schema that we can transform; from various star schemas we can define a constellation (star schema and constellation definition).
Dimensions contain rows without duplicates, we can apply operations to perform data cleaning and to conform them (cleaning and conforming data).
When new data is obtained, it is necessary to refresh the existing data with them by means of incremental refresh operations (incremental refresh).
The results obtained can be exported to be consulted with other tools (exporting results).
Finally, basic queries can be performed from R (query functions), especially to select the data to export.
In this document, these operations available in the package are briefly presented.
Starting from a flat table, a dimensional model is defined specifying
the attributes that make up each of the dimensions and the measurements
in the facts. The result is a dimensional_model
object. It
is carried out through the following functions:
dimensional_model()
: An empty
dimensional_model
object is created in which definition of
facts and dimensions can be added. Example:define_dimension()
: To define a dimension in a
dimensional_model
object, we have to define its name and
the set of attributes that make it up. Example:dm <- dimensional_model() |>
define_dimension(name = "When",
attributes = c("Week Ending Date",
"WEEK",
"Year"))
define_fact()
: To define facts in a
dimensional_model
object, the essential data is a name and
a set of measurements that can be empty (does not have explicit
measurements). Associated with each measurement, an aggregation function
is required, which by default is SUM. Examples:A dimensional model is implemented using a star schema. We can have several related star schemas through common dimensions that together form a fact constellation.
A star schema is defined from a flat table and a dimensional model definition. Once defined, a star schema can be transformed by defining role playing dimensions, changing the writing style of element names or the type of dimension attributes. These operations are carried out through the following functions:
star_schema()
: Creates a star_schema
object from a flat table (implemented by a tibble
) and a
dimensional_model
object. Example:role_playing_dimension()
: Given a list of
star_schema
dimension names, all with the same structure, a
role playing dimension with the indicated name and attributes is
generated. The original dimensions become role dimensions defined from
the new role playing dimension. Example:st <- star_schema(mrs_age, dm_mrs_age) |>
role_playing_dimension(
dim_names = c("when", "when_available"),
name = "When Common",
attributes = c("Date", "Week", "Year")
)
snake_case()
: Transform fact, dimension, measurement,
and attribute names according to the snake case style. Example:character_dimensions()
: Transforms numeric type
attributes of dimensions into character type. In a
star_schema
numerical data are measurements that are
situated in the facts. Numerical data in dimensions are usually codes,
day, week, month or year numbers. There are tools that consider any
numerical data to be a measurement, for this reason it is appropriate to
transform the numerical data of dimensions into character data. It also
allows indicating the literal to be used in case the numerical value is
not defined. Example:Once a star schema is defined, we can rename its elements. It is necessary to be able to rename attributes of dimensions and measures of facts because the definition operations only allowed us to select columns of a flat table. For completeness also dimensions and facts can be renamed. To carry out these operations, the following functions are available:
rename_dimension()
: Set new name for a dimension.
Example:get_dimension_attribute_names()
: Get the name of
attributes in a dimension, so that it is easier to modify them if
necessary. Example:rename_dimension_attributes()
: Set new names of some
attributes in a dimension. Example:st <-
st_mrs_age |> rename_dimension_attributes(
name = "when",
attributes = c("when_happened_week", "when_happened_year"),
new_names = c("week", "year")
)
rename_fact()
: Set new name for facts. Example:get_measure_names()
: Get the name of the measures in
fact, so that it is easier to modify them if necessary. Example:rename_measures()
: Set new names of some measures in
facts. Example:Based on various star schemas, a constellation can be defined in which star schemas share common dimensions. Dimensions with the same name must be shared. It is defined by the following function:
constellation()
: Creates a constellation
object from a list of star_schema
objects. All dimensions
with the same name in the star schemas have to be conformable.
Example:Once star schemas and fact constellations are defined, data cleaning operations can be carried out on dimensions. There are three groups of functions:
One to obtain components of star schemas and constellations.
Another to define data cleaning operations over dimensions.
One more to apply operations to star schemas or constellations.
We can obtain dimensions from a star schema or conformed dimensions from a fact constellation. Available functions in both cases are similar.
get_dimension_names()
: Get the names of the dimensions
of a star schema. Role playing dimensions are not considered.
Example:get_dimension()
: Get a dimension of a star schema given
its name. Role dimensions can be obtained but not role playing
dimensions. Example:get_conformed_dimension_names()
: Get the names of the
conformed dimensions of a constellation. Example:get_conformed_dimension()
: Get a conformed dimension of
a constellation given its name. Example:get_star_schema_names()
: Get the names of the star
schemas of a constellation. Example:get_star_schema()
: Get a star schema of a constellation
given its name. Example:Modifications are defined on dimension rows in various ways based exclusively on the values of the dimension fields. Although the surrogate key intervenes in the definition, the result, internally, does not depend on it so that it can be applied more generally in other star schemas.
record_update_set()
: A record_update_set
object is created. Stores updates on dimension records. Each update is
made up of a dimension name, an old value set, and a new value set.
Example:match_records()
: For a dimension, given the primary key
of two records, it adds an update to the set of updates that modifies
the combination of values of the rest of attributes of the first record
so that they become the same as those of the second. Example:update_record()
: For a dimension, given the primary key
of one record, it adds an update to the set of updates that modifies the
combination of values of the rest of attributes of the selected record
so that they become those given. Example:updates <- record_update_set() |>
update_record(
dimension = who,
old = 1,
values = c("1: <1 year")
)
update_selection()
: For a dimension, given a vector of
column names, a vector of old values and a vector of new values, it adds
an update to the set of updates that modifies all the records that have
the combination of old values in the columns with the new values in
those same columns. Example:updates <- record_update_set() |>
update_selection(
dimension = where,
columns = c("city"),
old_values = c("Bridgepor"),
new_values = c("Bridgeport")
)
update_selection_general()
: For a dimension, given a
vector of column names, a vector of old values for those columns,
another vector column names, and a vector of new values for those
columns, it adds an update to the set of updates that modifies all the
records that have the combination of old values in the first column
vector with the new values in the second column vector. Example:Defined updates can be applied on a star schema or on the conformed dimension of a fact constellation.
modify_dimension_records()
: Given a list of dimension
record update operations, they are applied on the dimensions of the
star_schema
object. Update operations must be defined with
the set of functions available for that purpose. Example:modify_conformed_dimension_records()
: Given a list of
dimension record update operations, they are applied on the conformed
dimensions of the constellation
object. Update operations
must be defined with the set of functions available for that purpose.
Example:To enrich a dimension with new attributes related to others already included in it, first, we export the attributes on which the new ones depend, then we define the new attributes, and import the table with all the attributes to be added to the dimension.
enrich_dimension_export()
: Export the selected
attributes of a dimension, without repeated combinations, to enrich the
dimension. If it is a role dimension they cannot be exported, you have
to work with the associated role playing dimension. Example:enrich_dimension_import()
: For a dimension of a star
schema a tibble
is attached. This contains dimension
attributes and new attributes. If values associated with all rows in the
dimension are included in the tibble
, the dimension is
enriched with the new attributes. Role dimensions cannot be directly
enriched. If a role playing dimension is enriched, the new attributes
are also added to the associated role dimensions. Example:tb <-
enrich_dimension_export(st_mrs_age,
name = "when_common",
attributes = c("week", "year"))
# Add new columns with meaningful data (these are not), possibly exporting
# data to a file, populating it and importing it.
tb <- tibble::add_column(tb, x = "x", y = "y", z = "z")
st <- enrich_dimension_import(st_mrs_age, name = "when_common", tb)
enrich_dimension_import_test()
: Previous to
enrich_dimension_import
, it checks if the
tibble
has values for all dimension instances. Returns the
dimension instances that do not match the imported data. Example:tb <-
enrich_dimension_export(st_mrs_age,
name = "when_common",
attributes = c("week", "year"))
# Add new columns with meaningful data (these are not), possibly exporting
# data to a file, populating it and importing it.
tb <- tibble::add_column(tb, x = "x", y = "y", z = "z")[-1, ]
tb2 <- enrich_dimension_import_test(st_mrs_age, name = "when_common", tb)
When new data is obtained, an incremental refresh of the data can be carried out, both of the dimensions and of the facts. Incremental refresh can be applied to both star schema and fact constellation, using the following functions.
incremental_refresh_star_schema()
: Incrementally
refresh a star schema with the content of a new one that is integrated
into the first. Once the dimensions are integrated, if there are records
in the fact table whose keys match the new ones, new ones can be
ignored, they can be replaced by new ones, all of them can be grouped
using the aggregation functions, or they can be deleted. Therefore, the
possible values of the existing
parameter are: “ignore”,
“replace”, “group” or “delete”. Example:Sometimes the data refresh consists of eliminating data that is no longer necessary, generally because it corresponds to a period that has stopped being analysed but it can also be for other reasons. This data can be selected using the following function:
filter_fact_rows()
: Filter fact rows based on dimension
conditions in a star schema. Dimensions remain unchanged. Filtered rows
can be deleted using the incremental_refresh_star_schema
function. Example:st <- st_mrs_age |>
filter_fact_rows(name = "when", when_happened_week <= "03") |>
filter_fact_rows(name = "where", city == "Bridgeport")
st2 <- st_mrs_age |>
incremental_refresh_star_schema(st, existing = "delete")
Once the fact data is removed (using the other incremental refresh functions), we can remove the data for the dimensions that are no longer needed using the following function:
purge_dimensions_star_schema()
: Delete instances of
dimensions not related to facts in a star schema. Example:incremental_refresh_constellation()
: Incrementally
refresh a star schema in a constellation with the content of a new star
schema that is integrated into the first. Example:purge_dimensions_constellation()
: Delete instances of
dimensions not related to facts in a star schema. It performs the
operation for each of the component star_schemas
and also
for the conformed dimensions. Example:Once the data has been properly structured and transformed, it can be exported to be consulted with other tools or with R. Various export formats have been defined, both for star schemas and for constellations, using the following functions.
star_schema_as_flat_table()
: We can again obtain a flat
table, implemented using a tibble
, from a star schema.
Example:star_schema_as_multistar()
: We can obtain a
multistar
. A multistar
only distinguishes
between general and conformed dimensions, each dimension has its own
data. It can contain multiple fact tables. Example:star_schema_as_tibble_list()
: We can obtain a
tibble
list with them. Role playing dimensions can be
optionally included. Example:constellation_as_multistar()
: We can obtain a
multistar
. A multistar
only distinguishes
between general and conformed dimensions, each dimension has its own
data. It can contain multiple fact tables. Example:constellation_as_tibble_list()
: We can obtain a
tibble
list with them. Role playing dimensions can be
optionally included. Example:There are many multidimensional query tools available. The exported
data, once stored in files, can be used directly from them. You can also
perform basic queries from R on data in the multistar
format, mainly for selecting the data to export, using the following
functions:
dimensional_query()
: An empty
dimensional_query
object is created where you can select
fact measures, dimension attributes and filter dimension rows.
Example:select_fact()
: To define the fact to be consulted, its
name is indicated, optionally, a vector of names of selected measures
and another of aggregation functions are also indicated. If the name of
any measure is not indicated, only the one corresponding to the number
of aggregated rows is included, which is always included. If no
aggregation function is included, those defined for the measures are
considered. Examples:dq <- dimensional_query(ms_mrs) |>
select_fact(
name = "mrs_age",
measures = c("n_deaths"),
agg_functions = c("MAX")
)
dq <- dimensional_query(ms_mrs) |>
select_fact(name = "mrs_age",
measures = c("n_deaths"))
dq <- dimensional_query(ms_mrs) |>
select_fact(name = "mrs_age")
select_dimension()
: To add a dimension in a
dimensional_query
object, we have to define its name and a
subset of the dimension attributes. If only the name of the dimension is
indicated, it is considered that all its attributes should be added.
Example:dq <- dimensional_query(ms_mrs) |>
select_dimension(name = "where",
attributes = c("city", "state")) |>
select_dimension(name = "when")
filter_dimension()
: Allows you to define selection
conditions for dimension rows. Conditions can be defined on any
attribute of the dimension (not only on attributes selected in the query
for the dimension). The selection is made based on the function
dplyr::filter
. Conditions are defined in exactly the same
way as in that function. Example:dq <- dimensional_query(ms_mrs) |>
filter_dimension(name = "when", when_happened_week <= "03") |>
filter_dimension(name = "where", city == "Boston")
run_query()
: Once we have selected the facts,
dimensions and defined the conditions on the instances, we can execute
the query to obtain the result. As an option, we can indicate if we do
not want to unify the facts in the case of having the same grain.
Example:ms <- dimensional_query(ms_mrs) |>
select_dimension(name = "where",
attributes = c("city", "state")) |>
select_dimension(name = "when",
attributes = c("when_happened_year")) |>
select_fact(
name = "mrs_age",
measures = c("n_deaths")
) |>
select_fact(
name = "mrs_cause",
measures = c("pneumonia_and_influenza_deaths", "other_deaths")
) |>
filter_dimension(name = "when", when_happened_week <= "03") |>
filter_dimension(name = "where", city == "Boston") |>
run_query()
starschemar
package offers a set of operations that
allow us to transform flat tables into star schemas. Star schemas
support the definition of role playing and role dimensions. Additional
transformation operations can be applied to each star schema to adapt
the format of the data. From several star schemas you can define fact
constellation with conformed dimensions.
Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation. In this document the available operations have been briefly presented.