Joining Weather Data to Event Tables with {weatherjoin}

Overview

The weatherjoin package attaches gridded weather data to event-based datasets in a reliable, efficient, and reproducible way.

Typical use cases include: - adding air temperature or precipitation to experimental observations, - linking weather data to monitoring events, - enriching spatial point data with meteorological context.

The package is designed around four core principles:

  1. Explicit time handling: timestamps are validated and standardized before any data are requested.
  2. Efficient API usage: weather data are requested only when needed, and only for the required spatial and temporal extent.
  3. Local caching: downloaded weather segments are stored locally and reused across sessions.
  4. Safe joining: weather values are joined back to the user’s data using exact or controlled rolling joins.

Currently, weatherjoin supports the NASA POWER data service via the {nasapower} package. This package is not affiliated with or endorsed by NASA.

Basic usage

At minimum, you need:

library(weatherjoin)

out <- join_weather(
  x = events,
  params = c("T2M", "PRECTOTCORR"),
  time = "event_time",
  lat_col = "lat",
  lon_col = "lon"
)

The result is the original table with weather variables appended.

Time handling in detail

weatherjoin always forms requests to NASA POWER using UTC timestamps. Your input time is interpreted using the tz argument, then standardised internally to UTC for planning, caching, and joining.

What tz means

tz is the timezone used to interpret your event time input.

If your event timestamps are recorded in local clock time (for example UK time), set:

join_weather(..., tz = "Europe/London")

weatherjoin will interpret them as Europe/London and convert internally to UTC before matching with POWER data.

Single-column time input

The time argument may refer to a single column containing any of:

Examples:

join_weather(x, params = "T2M", time = "event_time")     # POSIXct or character
join_weather(x, params = "T2M", time = "event_date")    # Date
join_weather(x, params = "T2M", time = "event_yyyymmdd")# numeric YYYYMMDD

If hourly weather is requested, hour-level information must be present: if you request hourly weather but provide only a date (no hour information), weatherjoin will raise an error.

Multi-column time input

You can also provide multiple columns, which weatherjoin will assemble into a timestamp. Supported schemas include:

Example:

join_weather(x, params = "T2M", time = c("YEAR", "MO", "DY", "HR"))

Column roles are inferred from names (e.g. YEAR, MO, DY, HR, DOY) and validated:

Invalid inputs always produce informative errors.

Daily vs hourly data (time_api)

The time_api argument controls whether daily or hourly POWER data are used:

Rules are explicit:

This avoids silent misinterpretation of temporal resolution.

Daily timestamps and the dummy hour

Daily POWER data have no time-of-day. When constructing timestamps for daily data, weatherjoin assigns a configurable “dummy hour” (default: 12:00) to ensure consistent internal handling.

Advanced users can change this via:

options(weatherjoin.dummy_hour = 12)

This does not change the meaning of daily weather values; it only affects the internally constructed timestamp used for planning and joining.

Spatial handling and representative locations

Weather data are provided on a coarse spatial grid. When many nearby points are present, requesting data separately for each location would be pointless and inefficient, given the spatial coarseness of the NASA POWER data.

weatherjoin therefore uses spatial reduction by default before calling the provider. Each group is reduced to a representative location (centroid; can be changed to median via options), and weather data are fetched once per group.

This behaviour is controlled by the spatial_mode argument:

Example using grouping:

join_weather(
  x = events,
  params = "T2M",
  time = "event_time",
  spatial_mode = "by_group",
  group_col = "site_id"
)

Efficient time-range planning (splitting sparse ranges)

Event data can contain large time gaps (e.g. a few observations in 2010 and a few in 2024). Downloading continuous weather data for the entire span would be wasteful.

weatherjoin detects such gaps and splits requests into multiple time windows:

This dramatically reduces:

Advanced users can tune this behaviour via options:

options(weatherjoin.split_penalty_hours = 72)  # larger = fewer, wider calls
options(weatherjoin.pad_hours = 0)             # padding added around each planned window

Local caching

Automatic, transparent caching is done to avoid multiple calls to API. Downloaded data segments are indexed by:

Segments are reused whenever they cover a new request.

Cache locations

Two scopes are supported:

User-level cache (default): persists across projects and sessions.

Project-level cache: stored in a .weatherjoin/ directory inside the project. This is useful for reproducible analyses and shared projects.

You can control this via:

cache_scope = "user"    # default
cache_scope = "project"

or provide an explicit directory via cache_dir.

Cache maintenance

Cache utilities are provided:

wj_cache_list()
wj_cache_clear()

Advanced cache policy

Most users can ignore cache policy settings. For advanced control, weatherjoin reads:

options(weatherjoin.cache_max_age_days = 60)
options(weatherjoin.cache_refresh = "if_missing")   # or "if_stale", "always"
options(weatherjoin.cache_match_mode = "cover")     # or "exact"
options(weatherjoin.cache_param_match = "superset") # or "exact"

Elevation handling (site_elevation)

Elevation is resolved per representative location, not per event row, and becomes part of the cache identity.

Supported modes:

If elev_fun is not supplied, weatherjoin falls back to elev_constant and issues a warning.

Example:

my_elev <- function(lon, lat, ...) rep(120, length(lon))

join_weather(
  x,
  params = "T2M",
  time = "event_time",
  site_elevation = "auto",
  elev_fun = my_elev
)

Joining weather data back to events

Weather values are joined to events using:

Rolling joins are controlled by:

roll = "nearest"       # default
roll_max_hours = 1     # safety limit

This ensures that weather values are not attached from implausibly distant timestamps.

Handling missing inputs

Rows with missing latitude, longitude, or time are retained in the output:

This design avoids accidental row loss and keeps joins explicit.

Summary

weatherjoin aims to make weather data attachment:

Most users need only a single function call, while advanced configuration remains available via options. Use withr::local_options() for temporary changes inside scripts or reports.