Getting started with DescriptiveStats.OBeu

Aikaterini Chatzopoulou, Kleanthis Koupidis, Charalampos Bratsas

DescriptiveStats.OBeu estimates the descriptive statistical measures, needed at OpenBudgets.eu. You can measure central tendency and dispersion of numeric variables along with their distributions and correlations and the frequencies of categorical variables for a given dataset on OpenBudgets.eu data mining tool platform.

The vignette provides an effective way to use functions of DescriptiveStats.OBeu with datasets including datasets of OpenBudgets.eu.

tojson parameter is used in ds.analysis, ds.statistics, ds.hist, ds.boxplot, ds.correlation, ds.frequency, ds.kurtosis, ds.skewness functions in order to specify if the resulted object should be in json format.

First you have to load the library

# load DescriptiveStats.OBeu
library(DescriptiveStats.OBeu)

Data in the package

The data in the package include the budget of Wuppertal for 2009 to 2020, as a data frame Wuppertal_df and as a json link Wuppertal_openspending as well as a sample json link sample_json_link_openspending, which you can access them using fromJSON of jsonlite package or copy paste the link to a browser.

Wuppertal internal structure

## 'data.frame':    6225 obs. of  10 variables:
##  $ ProduktNR       : chr  "1109020" "1109020" "3103040" "3103040" ...
##  $ Kontotyp        : Factor w/ 2 levels "Aufwendung","Ertrag": 2 1 2 1 2 1 2 1 2 1 ...
##  $ Art             : Factor w/ 2 levels "Ergebnis","Plan": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year            : Factor w/ 12 levels "2009","2010",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount          : num  203228 219134 1926839 11433219 18658 ...
##  $ ProduktbereichNR: chr  "11" "11" "31" "31" ...
##  $ ProduktgruppeNR : chr  "1109" "1109" "3103" "3103" ...
##  $ Produkt         : chr  "(entfallen in 2012) E-Government / Internet" "(entfallen in 2012) E-Government / Internet" "Nicht definiert" "Nicht definiert" ...
##  $ Produktbereich  : chr  "Innere Verwaltung" "Innere Verwaltung" "Soziale Leistungen" "Soziale Leistungen" ...
##  $ Produktgruppe   : chr  "Geschäftsbereichsleitung GB 4" "Geschäftsbereichsleitung GB 4" "Grundsicherung SGB II" "Grundsicherung SGB II" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 10
##   .. ..$ ProduktNR       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Kontotyp        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Art             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Year            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Amount          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ ProduktbereichNR: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ ProduktgruppeNR : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Produkt         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Produktbereich  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Produktgruppe   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Descriptive Statistics in a call

ds.analysis is used to estimate minimum, maximum, range, mean, median, first and third quantiles, variance, standart deviation, skewness and kurtosis, boxplot, histogram parameters needed for visualization of numeric variables and frequencies of factor variables of a given vector, matrix or data frame of data.

ds.analysis components
Component Output Description
statistics
  • Min
  • Max
  • Range
  • Mean
  • Median
  • Quantiles
  • Variance
  • StandardDeviation
  • Skewness
  • Kurtosis
  • The minimum observed value of the input data
  • The maximum observed value of the input data
  • The difference between maximum and minimum
  • The average value of the input data
  • The median value of the input data
  • The 25%, 75% percentiles
  • The variance of the input data
  • The standard deviation of the input data
  • The Skewness of the input data
  • The Kurtosis of the input data
boxplot
  • lo.whisker
  • lo.hinge
  • median
  • up.hinge
  • up.whisker
  • box.width
  • lo.out
  • up.out
  • n
  • Lower horizontal line out of the box
  • Lower horizontal line of the box
  • Horizontal line in the box
  • Upper horizontal line of the box
  • Upper horizontal line out of the box
  • The box width of each variable
  • Lower outliers
  • Upper outliers
  • The number of non-NA observations
histogram
  • cuts
  • counts
  • mean
  • median
  • The boundaries of the histogram classes
  • The frequency of each histogram class
  • The average value of the input vector
  • The median value of the input data
frequencies
  • Variable name
  • frequencies
  • "_row"
  • relative.frequencies
  • The name of the calculated variable
  • The frequency value
  • Name of the categories of the variable
  • Relative frequency values
correlation
  • Variable name
  • Correlation value
  • "_row"
  • The name of the calculated variable
  • The correlation value
  • The corresponding correlation variable

ds.analysis returns by default a list object, we set tojson parameter TRUE, outliers parameter FALSE, fr.select = "Produktbereich". Correlation component is empty because there is one numeric variable.

wuppertalanalysis = ds.analysis(Wuppertal_df,outliers=FALSE, fr.select = "Produktbereich", tojson=TRUE) # json string format
jsonlite::prettify(wuppertalanalysis) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
##     "descriptives": {
##         "Min": {
##             "Amount": [
##                 -2040680.54
##             ]
##         },
##         "Max": {
##             "Amount": [
##                 507995000
##             ]
##         },
##         "Range": {
##             "Amount": [
##                 510035680.54
##             ]
##         },
##         "Mean": {
##             "Amount": [
##                 6171229.3658
##             ]
##         },
##         "Median": {
##             "Amount": [
##                 736038.09
##             ]
##         },
##         "Quantiles": {
##             "Amount": [
##                 243696.13,
##                 2653000
##             ]
##         },
##         "Variance": {
##             "Amount": [
##                 777106882358169
##             ]
##         },
##         "StandardDeviation": {
##             "Amount": [
##                 27876636.8552
##             ]
##         },
##         "Kurtosis": [
##             160.1519
##         ],
##         "Skewness": [
##             11.4762
##         ]
##     },
##     "boxplot": {
##         "Amount": {
##             "lo.whisker": [
##                 -2040680.54
##             ],
##             "lo.hinge": [
##                 243696.13
##             ],
##             "median": [
##                 736038.09
##             ],
##             "up.hinge": [
##                 2653000
##             ],
##             "up.whisker": [
##                 6243113.59
##             ],
##             "box.width": [
##                 11.83
##             ],
##             "lo.out": {
## 
##             },
##             "up.out": {
## 
##             },
##             "n": [
##                 6225
##             ]
##         }
##     },
##     "histogram": {
##         "Amount": {
##             "cuts": [
##                 -50000000,
##                 0,
##                 50000000,
##                 100000000,
##                 150000000,
##                 200000000,
##                 250000000,
##                 300000000,
##                 350000000,
##                 400000000,
##                 450000000,
##                 500000000,
##                 550000000
##             ],
##             "counts": [
##                 46,
##                 6032,
##                 83,
##                 30,
##                 10,
##                 0,
##                 1,
##                 11,
##                 2,
##                 4,
##                 4,
##                 2
##             ],
##             "mean": [
##                 6171229.3658
##             ],
##             "median": [
##                 736038.09
##             ]
##         }
##     },
##     "frequencies": {
##         "frequencies": {
##             "Produktbereich": [
##                 {
##                     "Var1": "Allgemeine Finanzwirtschaft",
##                     "Freq": 101
##                 },
##                 {
##                     "Var1": "Bauen und Wohnen",
##                     "Freq": 193
##                 },
##                 {
##                     "Var1": "Gesundheitsdienste",
##                     "Freq": 207
##                 },
##                 {
##                     "Var1": "Innere Verwaltung",
##                     "Freq": 1737
##                 },
##                 {
##                     "Var1": "Kinder-, Jugend- u. Familienhilfe",
##                     "Freq": 373
##                 },
##                 {
##                     "Var1": "Kultur und Wissenschaft",
##                     "Freq": 346
##                 },
##                 {
##                     "Var1": "Natur- und Landschaftspflege",
##                     "Freq": 256
##                 },
##                 {
##                     "Var1": "Räuml.Planung, Entw., Geoinfo.",
##                     "Freq": 463
##                 },
##                 {
##                     "Var1": "Schulträgeraufgaben",
##                     "Freq": 364
##                 },
##                 {
##                     "Var1": "Sicherheit und Ordnung",
##                     "Freq": 591
##                 },
##                 {
##                     "Var1": "Soziale Leistungen",
##                     "Freq": 663
##                 },
##                 {
##                     "Var1": "Sportförderung",
##                     "Freq": 224
##                 },
##                 {
##                     "Var1": "Stiftungen",
##                     "Freq": 31
##                 },
##                 {
##                     "Var1": "Umweltschutz",
##                     "Freq": 128
##                 },
##                 {
##                     "Var1": "Ver- und Entsorgung",
##                     "Freq": 155
##                 },
##                 {
##                     "Var1": "Verkehrsflächen/-anlagen,ÖPNV",
##                     "Freq": 261
##                 },
##                 {
##                     "Var1": "Wirtschaft und Tourismus",
##                     "Freq": 132
##                 }
##             ]
##         },
##         "relative.frequencies": {
##             "Produktbereich": [
##                 {
##                     "Var1": "Allgemeine Finanzwirtschaft",
##                     "Freq": 0.0162
##                 },
##                 {
##                     "Var1": "Bauen und Wohnen",
##                     "Freq": 0.031
##                 },
##                 {
##                     "Var1": "Gesundheitsdienste",
##                     "Freq": 0.0333
##                 },
##                 {
##                     "Var1": "Innere Verwaltung",
##                     "Freq": 0.279
##                 },
##                 {
##                     "Var1": "Kinder-, Jugend- u. Familienhilfe",
##                     "Freq": 0.0599
##                 },
##                 {
##                     "Var1": "Kultur und Wissenschaft",
##                     "Freq": 0.0556
##                 },
##                 {
##                     "Var1": "Natur- und Landschaftspflege",
##                     "Freq": 0.0411
##                 },
##                 {
##                     "Var1": "Räuml.Planung, Entw., Geoinfo.",
##                     "Freq": 0.0744
##                 },
##                 {
##                     "Var1": "Schulträgeraufgaben",
##                     "Freq": 0.0585
##                 },
##                 {
##                     "Var1": "Sicherheit und Ordnung",
##                     "Freq": 0.0949
##                 },
##                 {
##                     "Var1": "Soziale Leistungen",
##                     "Freq": 0.1065
##                 },
##                 {
##                     "Var1": "Sportförderung",
##                     "Freq": 0.036
##                 },
##                 {
##                     "Var1": "Stiftungen",
##                     "Freq": 0.005
##                 },
##                 {
##                     "Var1": "Umweltschutz",
##                     "Freq": 0.0206
##                 },
##                 {
##                     "Var1": "Ver- und Entsorgung",
##                     "Freq": 0.0249
##                 },
##                 {
##                     "Var1": "Verkehrsflächen/-anlagen,ÖPNV",
##                     "Freq": 0.0419
##                 },
##                 {
##                     "Var1": "Wirtschaft und Tourismus",
##                     "Freq": 0.0212
##                 }
##             ]
##         }
##     },
##     "correlation": {
## 
##     }
## }
## 

ds.analysis uses internally the functions ds.statistics,ds.hist,ds.boxplot,ds.correlation and ds.frequency. However, these functions can be used independently and depends on the user requirements.

Statistical measures

ds.statistics is used to estimate minimum, maximum, range, mean, median, first and third quantiles, variance, standart deviation, skewness and kurtosis values of a given vector, matrix or data frame of data.

ds.statistics returns by default a list object:

ds.statistics(Wuppertal_df) # list format
## $Min
## $Min$Amount
## [1] -2040681
## 
## 
## $Max
## $Max$Amount
## [1] 507995000
## 
## 
## $Range
## $Range$Amount
## [1] 510035681
## 
## 
## $Mean
## $Mean$Amount
## [1] 6171229
## 
## 
## $Median
## $Median$Amount
## [1] 736038.1
## 
## 
## $Quantiles
## $Quantiles$Amount
##       25%       75% 
##  243696.1 2653000.0 
## 
## 
## $Variance
## $Variance$Amount
## [1] 7.771069e+14
## 
## 
## $StandardDeviation
## $StandardDeviation$Amount
## [1] 27876637
## 
## 
## $Kurtosis
##   Amount 
## 160.1519 
## 
## $Skewness
##   Amount 
## 11.47621

The results can be extracted in json format for further use you should set the parameter tojson to TRUE:

wuppertalstats = ds.statistics(Wuppertal_df, tojson = TRUE) # json  format
jsonlite::prettify(wuppertalstats) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
##     "Min": {
##         "Amount": [
##             -2040680.54
##         ]
##     },
##     "Max": {
##         "Amount": [
##             507995000
##         ]
##     },
##     "Range": {
##         "Amount": [
##             510035680.54
##         ]
##     },
##     "Mean": {
##         "Amount": [
##             6171229.3658
##         ]
##     },
##     "Median": {
##         "Amount": [
##             736038.09
##         ]
##     },
##     "Quantiles": {
##         "Amount": [
##             243696.13,
##             2653000
##         ]
##     },
##     "Variance": {
##         "Amount": [
##             777106882358169
##         ]
##     },
##     "StandardDeviation": {
##         "Amount": [
##             27876636.8552
##         ]
##     },
##     "Kurtosis": [
##         160.1519
##     ],
##     "Skewness": [
##         11.4762
##     ]
## }
## 

Histogram

ds.hist computes the parameters needed to visualize a histogram of a numeric input vector, specifying the breaks as in base hist function.

ds.hist(Wuppertal_df$Amount, breaks= "Sturges") # list format
## $cuts
##  [1] -5.0e+07  0.0e+00  5.0e+07  1.0e+08  1.5e+08  2.0e+08  2.5e+08  3.0e+08
##  [9]  3.5e+08  4.0e+08  4.5e+08  5.0e+08  5.5e+08
## 
## $counts
##  [1]   46 6032   83   30   10    0    1   11    2    4    4    2
## 
## $mean
## [1] 6171229
## 
## $median
## [1] 736038.1

Return the results as json string:

wuppertalhist = ds.hist(Wuppertal_df$Amount, breaks= "Sturges", tojson=TRUE) # json  format
jsonlite::prettify(wuppertalhist) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
##     "cuts": [
##         -50000000,
##         0,
##         50000000,
##         100000000,
##         150000000,
##         200000000,
##         250000000,
##         300000000,
##         350000000,
##         400000000,
##         450000000,
##         500000000,
##         550000000
##     ],
##     "counts": [
##         46,
##         6032,
##         83,
##         30,
##         10,
##         0,
##         1,
##         11,
##         2,
##         4,
##         4,
##         2
##     ],
##     "mean": [
##         6171229.3658
##     ],
##     "median": [
##         736038.09
##     ]
## }
## 

Boxplot

The ds.boxplot returns the parameters needed for a boxplot visualization of an input vector, matrix or data frame.

If outl is TRUE the outliers will be computed at the selected out.level level (default is 1.5 times the Interquartile Range) and the width level is determined 0.15 times the square root of the size of the input data. ds.boxplot uses the numeric variables of the input data, you do not have to exclude factor or character variables.

wuppertalbox = ds.boxplot(Wuppertal_df, width = 0.15 , outl = FALSE, tojson=TRUE) # json  format
jsonlite::prettify(wuppertalbox) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
##     "Amount": {
##         "lo.whisker": [
##             -2040680.54
##         ],
##         "lo.hinge": [
##             243696.13
##         ],
##         "median": [
##             736038.09
##         ],
##         "up.hinge": [
##             2653000
##         ],
##         "up.whisker": [
##             6243113.59
##         ],
##         "box.width": [
##             11.83
##         ],
##         "lo.out": {
## 
##         },
##         "up.out": {
## 
##         },
##         "n": [
##             6225
##         ]
##     }
## }
## 

Correlation

ds.correlation estimate the correlation coefficient (default is "pearson") of the input vectors, matrix or data frame. In this example iris dataset is used. Factor or character variables in the input matrix or data frame will be filtered out by default.

iriscorr = ds.correlation(iris, cor.method="pearson", tojson=TRUE) # json format
jsonlite::prettify(iriscorr) # use prettify of jsonlite library to add indentation to the returned JSON string
## [
##     {
##         "Sepal.Length": 1,
##         "Sepal.Width": -0.12,
##         "Petal.Length": 0.87,
##         "Petal.Width": 0.82,
##         "_row": "Sepal.Length"
##     },
##     {
##         "Sepal.Length": 0,
##         "Sepal.Width": 1,
##         "Petal.Length": -0.43,
##         "Petal.Width": -0.37,
##         "_row": "Sepal.Width"
##     },
##     {
##         "Sepal.Length": 0,
##         "Sepal.Width": 0,
##         "Petal.Length": 1,
##         "Petal.Width": 0.96,
##         "_row": "Petal.Length"
##     },
##     {
##         "Sepal.Length": 0,
##         "Sepal.Width": 0,
##         "Petal.Length": 0,
##         "Petal.Width": 1,
##         "_row": "Petal.Width"
##     }
## ]
## 

Frequency

Frequencies and relative frequencies of factors/characters of the input dataset using ds.frequency for Produktbereich from Wuppertal_df dataset and return as json string.

wuppertalfreq = ds.frequency(Wuppertal_df$Produktbereich, tojson = TRUE)
jsonlite::prettify(wuppertalfreq) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
##     "frequencies": {
##         "data": [
##             {
##                 "Var1": "Allgemeine Finanzwirtschaft",
##                 "Freq": 101
##             },
##             {
##                 "Var1": "Bauen und Wohnen",
##                 "Freq": 193
##             },
##             {
##                 "Var1": "Gesundheitsdienste",
##                 "Freq": 207
##             },
##             {
##                 "Var1": "Innere Verwaltung",
##                 "Freq": 1737
##             },
##             {
##                 "Var1": "Kinder-, Jugend- u. Familienhilfe",
##                 "Freq": 373
##             },
##             {
##                 "Var1": "Kultur und Wissenschaft",
##                 "Freq": 346
##             },
##             {
##                 "Var1": "Natur- und Landschaftspflege",
##                 "Freq": 256
##             },
##             {
##                 "Var1": "Räuml.Planung, Entw., Geoinfo.",
##                 "Freq": 463
##             },
##             {
##                 "Var1": "Schulträgeraufgaben",
##                 "Freq": 364
##             },
##             {
##                 "Var1": "Sicherheit und Ordnung",
##                 "Freq": 591
##             },
##             {
##                 "Var1": "Soziale Leistungen",
##                 "Freq": 663
##             },
##             {
##                 "Var1": "Sportförderung",
##                 "Freq": 224
##             },
##             {
##                 "Var1": "Stiftungen",
##                 "Freq": 31
##             },
##             {
##                 "Var1": "Umweltschutz",
##                 "Freq": 128
##             },
##             {
##                 "Var1": "Ver- und Entsorgung",
##                 "Freq": 155
##             },
##             {
##                 "Var1": "Verkehrsflächen/-anlagen,ÖPNV",
##                 "Freq": 261
##             },
##             {
##                 "Var1": "Wirtschaft und Tourismus",
##                 "Freq": 132
##             }
##         ]
##     },
##     "relative.frequencies": {
##         "data": [
##             {
##                 "Var1": "Allgemeine Finanzwirtschaft",
##                 "Freq": 0.0162
##             },
##             {
##                 "Var1": "Bauen und Wohnen",
##                 "Freq": 0.031
##             },
##             {
##                 "Var1": "Gesundheitsdienste",
##                 "Freq": 0.0333
##             },
##             {
##                 "Var1": "Innere Verwaltung",
##                 "Freq": 0.279
##             },
##             {
##                 "Var1": "Kinder-, Jugend- u. Familienhilfe",
##                 "Freq": 0.0599
##             },
##             {
##                 "Var1": "Kultur und Wissenschaft",
##                 "Freq": 0.0556
##             },
##             {
##                 "Var1": "Natur- und Landschaftspflege",
##                 "Freq": 0.0411
##             },
##             {
##                 "Var1": "Räuml.Planung, Entw., Geoinfo.",
##                 "Freq": 0.0744
##             },
##             {
##                 "Var1": "Schulträgeraufgaben",
##                 "Freq": 0.0585
##             },
##             {
##                 "Var1": "Sicherheit und Ordnung",
##                 "Freq": 0.0949
##             },
##             {
##                 "Var1": "Soziale Leistungen",
##                 "Freq": 0.1065
##             },
##             {
##                 "Var1": "Sportförderung",
##                 "Freq": 0.036
##             },
##             {
##                 "Var1": "Stiftungen",
##                 "Freq": 0.005
##             },
##             {
##                 "Var1": "Umweltschutz",
##                 "Freq": 0.0206
##             },
##             {
##                 "Var1": "Ver- und Entsorgung",
##                 "Freq": 0.0249
##             },
##             {
##                 "Var1": "Verkehrsflächen/-anlagen,ÖPNV",
##                 "Freq": 0.0419
##             },
##             {
##                 "Var1": "Wirtschaft und Tourismus",
##                 "Freq": 0.0212
##             }
##         ]
##     }
## }
## 

If the input is a dataframe and the select parameter is not specified, all the factor variables will be returned.

All the numeric variables of the input data are filtered out of the estimations internally.

Kurtosis

This function calculates kurtosis of the input vector, matrix or data frame. Factor or character variables that may be included in the input matrix or data frame, will be omitted in the estimations.

ds.kurtosis(Wuppertal_df$Amount, tojson=TRUE)
## [160.1519]

Skewness

This function calculates skewness of the input vector, matrix or data frame. Factor or character variables that may be included in the input matrix or data frame, will be omitted in the estimations.

ds.skewness(Wuppertal_df$Amount, tojson=TRUE)
## [11.4762]