read_many_patterns.Rd
This accepts a directory. It will use read_patterns
to load every csv.gz
in that folder, assuming they are all patterns files. It will then row-bind together each of the produced processed files. Finally, if post_by
is specified, it will re-perform the aggregation, handy for new-format patterns files that split the same week's data across multiple files.
read_many_patterns(
dir = ".",
recursive = TRUE,
filelist = NULL,
start_date = NULL,
post_by = !is.null(by),
by = NULL,
fun = sum,
na.rm = TRUE,
filter = NULL,
expand_int = NULL,
expand_cat = NULL,
expand_name = NULL,
multi = NULL,
naics_link = NULL,
select = NULL,
gen_fips = TRUE,
silent = FALSE,
...
)
Name of the directory the files are in.
Search in all subdirectories as well, as with the since-June-24-2020 format of the AWS downloads. There is not currently a way to include only a subset of these subdirectory files. Perhaps run list.files(recursive=TRUE)
on your own and pass a subset of the results to the filelist
option.
A vector of filenames to read in, OR a named list of options to send to patterns_lookup()
. This list must include dates
for the dates of data you want, and list_files
will be set to TRUE
. If you like, add key
and secret
to this list to also download the files you need.
A vector of dates giving the first date present in each zip file, to be passed to read_patterns
giving the first date present in the file, as a date object. Unlike in read_patterns
, this value will be added to the data as a variable called start_date
so you can use it in post_by
.
After reading in all the files, re-perform aggregation to this level. Use a character vector of variable names (or a list of vectors if using multi
). Or just set to TRUE
to have post_by = by
plus, if present, expand_name
or 'date'
. Set to FALSE
to skip re-aggregation. Including 'start_date'
in both by
and post_by
is a good idea if you aren't using an approach that creates a date
variable. By default this is TRUE
unless by = NULL
(if by = NULL
in a multi
option, it will still be TRUE
by default for that).
Arguments to be passed to read_patterns
, specified as in help(read_patterns)
.
Note that after reading in data, if gen_fips = TRUE
, state and county names can be merged in using data(fips_to_names)
.
if (FALSE) {
# Our current working directory is full of .csv.gz files!
# Too many... we will probably run out of memory if we try to read them all in at once, so let's chunk it
files <- list.files(pattern = '.gz', recursive = TRUE)
patterns <- read_many_patterns(filelist = files[1:10],
# We only need these variables (and poi_cbg which is auto-added with gen_fips = TRUE)
select = c('brands','visits_by_day'),
# We want two formatted files to come out. The first aggregates to the state-brand-day level, getting visits by day
multi = list(list(name = 'by_brands', by = c('state_fips','brands'), expand_int = 'visits_by_day'),
# The second aggregates to the state-county-day level but only for Colorado and COnnecticut (see the filter)
list(name = 'co_and_ct', by = c('state_fips','county_fips'), filter = 'state_fips %in% 8:9', expand_int = 'visits_by_day')))
patterns_brands <- patterns[[1]]
patterns_co_and_ct <- patterns[[2]]
# Alternately, find the files we need for the seven days starting December 7, 2020,
# read them all in (and if we'd given key and secret too, download them first),
# and then aggregate to the state-date level
dt <- read_many_patterns(filelist = list(dates = lubridate::ymd("2020-12-07") + lubridate::days(0:6)),
by = "state_fips", expand_int = 'visits_by_day',
select = 'visits_by_day')
# don't forget that if you want weekly data but AREN'T using visits_by_day
# (for example if you're using visitors_home_cbg)
# you want start_date in your by option, as in the second list in multi here
dt <- read_many_patterns(filelist = list(dates = lubridate::ymd("2020-12-07") + lubridate::days(0:6)),
select = c('visits_by_day','visitor_home_cbgs'),
multi = list(list(name = 'visits',by = 'state_fips',
expand_int = 'visits_by_day',filter = 'state_fips == 6'),
list(name = 'cbg',by = c('start_date','state_fips'),
expand_cat = 'visitor_home_cbgs', filter = 'state_fips == 6')))
}