read_patterns.Rd
Be aware that the files this is designed to work with are large and this function may take a while to execute. This function takes a single .csv.gz
SafeGraph patterns file and reads it in. The output is a data.table
(or a list of them if multiple are specified) including the file filename
collapsed and expanded in different ways.
read_patterns(
filename,
dir = ".",
by = NULL,
fun = function(x) sum(x, na.rm = TRUE),
na.rm = TRUE,
filter = NULL,
expand_int = NULL,
expand_cat = NULL,
expand_name = NULL,
multi = NULL,
naics_link = NULL,
select = NULL,
gen_fips = TRUE,
start_date = NULL,
silent = FALSE,
...
)
The filename of the .csv.gz
file or the path to the file. Note that if start_date
is not specified, read_patterns
will attempt to get the start date from the first ten characters of the path. In "new format" filepaths ("2020/01/09/core-patterns-part-1.csv.gz"), nine days will be subtracted from the date found.
The directory in which the file sits.
A character vector giving the variable names of the level to be collapsed to using fun
. The resulting data will have X rows per unique combination of by
, where X is 1 if no expand variables are specified, or the length of the expand variable if specified. Set to NULL
to aggregate across all initial rows, or set to FALSE
to not aggregate at all (this will also add an initial_rowno
column showing the original row number). You can also avoid aggregating by doing by = 'placekey'
which might play more nicely with some of the other features..
Function to use to aggregate the expanded variable to the by
level.
Whether to remove any missing values of the expanded variable before aggregating. Does not remove missing values of the by
variables. May not be necessary if fun
handles NA
s on its own.
A character string describing a logical statement for filtering the data, for example filter = 'state_fips == 6'
would give you only data from California. Will be used as an i
argument in a data.table
, see help(data.table)
. Filtering here instead of afterwards can cut down on time and memory demands.
A character variable with the name of The first e JSON variable in integer format ([1,2,3,...]) to be expanded into rows. Cannot be specified along with expand_cat
.
A JSON variable in categorical format (A: 2, B: 3, etc.) to be expanded into rows. Ignored if expand_int
is specified.
The name of the new variable to be created with the category index for the expanded variable.
A list of lists, for the purposes of creating a list of multiple processed files. This will vastly speed up processing over doing each of them one at a time. Each named list has the entry name
as well as any of the options by, fun, filter, expand_int, expand_cat, expand_name
as specified above. If specified, will override other entries of by
, etc..
A data.table
, possibly produced by link_poi_naics
, that links placekey
and naics_code
. This will allow you to include 'naics_code'
in the by
argument. Technically you could have stuff other than naics_code
in here and use that in by
too, I won't stop ya.
Character vector of variables to get from the file. Set to NULL
to get all variables. **Specifying select is very much recommended, and will speed up the function a lot.**
Set to TRUE
to use the poi_cbg
variable to generate state_fips
and county_fips
variables. This will also result in poi_cbg
being converted to character.
The first date in the file, as a date object. If omitted, will assume that the filename begins YYYY-MM-DD.
Set to TRUE to suppress timecode message.
Other arguments to be passed to data.table::fread
when reading in the file. For example, nrows
to only read in a certain number of rows.
Note that after reading in data, if gen_fips = TRUE
, state and county names can be merged in using data(fips_to_names)
.
if (FALSE) {
# 'patterns-part-1.csv.gz' is a weekly patterns file in the main-file folder, which is the working directory
patterns <- read_patterns('patterns-part-1.csv.gz',
# We only need these variables (and poi_cbg which is auto-added with gen_fips = TRUE)
select = c('brands','visits_by_day'),
# We want two formatted files to come out. The first aggregates to the state-brand-day level, getting visits by day
multi = list(list(name = 'by_brands', by = c('state_fips','brands'), expand_int = 'visits_by_day'),
# The second aggregates to the state-county-day level but only for Colorado and COnnecticut (see the filter)
list(name = 'co_and_ct', by = c('state_fips','county_fips'), filter = 'state_fips %in% 8:9', expand_int = 'visits_by_day')))
patterns_brands <- patterns[[1]]
patterns_co_and_ct <- patterns[[2]]
}