Read and row-bind many patterns files — read_many

This accepts a directory. It will use read_patterns to load every csv.gz in that folder, assuming they are all patterns files. It will then row-bind together each of the produced processed files. Finally, if post_by is specified, it will re-perform the aggregation, handy for new-format patterns files that split the same week's data across multiple files.

read_many_patterns(
  dir = ".",
  recursive = TRUE,
  filelist = NULL,
  start_date = NULL,
  post_by = !is.null(by),
  by = NULL,
  fun = sum,
  na.rm = TRUE,
  filter = NULL,
  expand_int = NULL,
  expand_cat = NULL,
  expand_name = NULL,
  multi = NULL,
  naics_link = NULL,
  select = NULL,
  gen_fips = TRUE,
  silent = FALSE,
  ...
)

Arguments

dir: Name of the directory the files are in.
recursive: Search in all subdirectories as well, as with the since-June-24-2020 format of the AWS downloads. There is not currently a way to include only a subset of these subdirectory files. Perhaps run list.files(recursive=TRUE) on your own and pass a subset of the results to the filelist option.
filelist: A vector of filenames to read in, OR a named list of options to send to patterns_lookup(). This list must include dates for the dates of data you want, and list_files will be set to TRUE. If you like, add key and secret to this list to also download the files you need.
start_date: A vector of dates giving the first date present in each zip file, to be passed to read_patterns giving the first date present in the file, as a date object. Unlike in read_patterns, this value will be added to the data as a variable called start_date so you can use it in post_by.
post_by: After reading in all the files, re-perform aggregation to this level. Use a character vector of variable names (or a list of vectors if using multi). Or just set to TRUE to have post_by = by plus, if present, expand_name or 'date'. Set to FALSE to skip re-aggregation. Including 'start_date' in both by and post_by is a good idea if you aren't using an approach that creates a date variable. By default this is TRUE unless by = NULL (if by = NULL in a multi option, it will still be TRUE by default for that).
by, fun, na.rm, filter, expand_int, expand_cat, expand_name, multi, naics_link, select, gen_fips, silent, ...: Arguments to be passed to read_patterns, specified as in help(read_patterns).

Details

Note that after reading in data, if gen_fips = TRUE, state and county names can be merged in using data(fips_to_names).

Examples

if (FALSE) {
# Our current working directory is full of .csv.gz files!
# Too many... we will probably run out of memory if we try to read them all in at once, so let's chunk it
files <- list.files(pattern = '.gz', recursive = TRUE)
patterns <- read_many_patterns(filelist = files[1:10],
    # We only need these variables (and poi_cbg which is auto-added with gen_fips = TRUE)
    select = c('brands','visits_by_day'),
    # We want two formatted files to come out. The first aggregates to the state-brand-day level, getting visits by day
    multi = list(list(name = 'by_brands', by = c('state_fips','brands'), expand_int = 'visits_by_day'),
    # The second aggregates to the state-county-day level but only for Colorado and COnnecticut (see the filter)
    list(name = 'co_and_ct', by = c('state_fips','county_fips'), filter = 'state_fips %in% 8:9', expand_int = 'visits_by_day')))
patterns_brands <- patterns[[1]]
patterns_co_and_ct <- patterns[[2]]

# Alternately, find the files we need for the seven days starting December 7, 2020,
# read them all in (and if we'd given key and secret too, download them first),
# and then aggregate to the state-date level
dt <- read_many_patterns(filelist = list(dates = lubridate::ymd("2020-12-07") + lubridate::days(0:6)),
                         by = "state_fips", expand_int = 'visits_by_day',
                         select = 'visits_by_day')


# don't forget that if you want weekly data but AREN'T using visits_by_day
# (for example if you're using visitors_home_cbg)
# you want start_date in your by option, as in the second list in multi here
dt <- read_many_patterns(filelist = list(dates = lubridate::ymd("2020-12-07") + lubridate::days(0:6)),
                         select = c('visits_by_day','visitor_home_cbgs'),
                         multi = list(list(name = 'visits',by = 'state_fips',
                         expand_int = 'visits_by_day',filter = 'state_fips == 6'),
                         list(name = 'cbg',by = c('start_date','state_fips'),
                         expand_cat = 'visitor_home_cbgs', filter = 'state_fips == 6')))
}