Integration of Geospatial and Demographic data

As you know, ColOpenData can be used to access both geospatial and demographic data from Colombia, in independent modules. However, we thought it would be helpful to present a module that incorporates a way to merge information between geospatial and demographic data. In this vignette you will learn how to use the function merge_geo_demographic().

library(ColOpenData)
library(dplyr)
library(ggplot2)

Disclaimer: all data is loaded to the environment in the user’s R session, but is not downloaded to user’s computer.

How to merge geospatial and demographic data

Documentation access

Geospatial and demographic data can be merged based on the spatial aggregation level (SAL). While geospatial data can be aggregated down to the block level, demographic data is typically available only at the department and municipality levels. Therefore, these are the only SAL that can be accessed in both types of data for merging.

Now, the merge_geo_demographic() function takes as a parameter the demographic dataset of interest. Therefore, we should first access the demographic documentation to know which dataset we want to work with. Let’s suppose we want to select a dataset at the department level. We can load all demographic available datasets and then filter the level by the desired SAL.

datasets_dem <- list_datasets("demographic", "EN")

department_datasets <- datasets_dem[datasets_dem["level"] == "department", ]

head(department_datasets)

After reviewing the available datasets, we can select the one we wish to work with and take a closer look. For instance, let’s suppose we choose the dataset “DANE_CNPVPD_2018_14BPD”.

chosen_dataset <- download_demographic("DANE_CNPVPD_2018_14BPD")

head(chosen_dataset)

chosen_data presents information regarding health service attended by people that in the last thirty days had an illness, accident, dental problem or other health problem. Now, we can use the merge_geo_demographic() function.

The simplified argument downloads a simplified version of the geometries. This is not recommended for very accurate applications, but for a simple plot the approximation is enough. Also, it makes the download process much faster. To override this, you could use simplified = FALSE.

merged_data <- merge_geo_demographic(
  demographic_dataset =
    "DANE_CNPVPD_2018_14BPD"
)

head(merged_data)

merged_data presents geospatial information related to departments, as well as the information related to the health service attended by the population. We can use this dataset to visualize the proportion of people in each department who used home remedies for health issues. To achieve this, we will calculate the proportion by dividing the count of people who reported using home remedies (“uso_remedios_caseros”) by the total count of people who reported experiencing a health problem in each department.

merged_data <- merged_data %>%
  mutate(proportion_home_remedies = uso_remedios_caseros /
    total_personas_que_tuvieron_alguna_enfermedad)

We can now plot the results

ggplot(data = merged_data) +
  geom_sf(mapping = aes(fill = proportion_home_remedies), color = "white") +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "white", colour = "white"),
    panel.background = element_rect(fill = "white", colour = "white"),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_fill_gradient("Count", low = "#10bed2", high = "#deff00") +
  ggtitle(
    label = "Proportion of people who reported using home remedies to treat
    a health problem",
    subtitle = "Colombia"
  )