Read SafeGraph Patterns — read_patterns • SafeGraphR

Be aware that the files this is designed to work with are large and this function may take a while to execute. This function takes a single .csv.gz SafeGraph patterns file and reads it in. The output is a data.table (or a list of them if multiple are specified) including the file filename collapsed and expanded in different ways.

read_patterns(
  filename,
  dir = ".",
  by = NULL,
  fun = function(x) sum(x, na.rm = TRUE),
  na.rm = TRUE,
  filter = NULL,
  expand_int = NULL,
  expand_cat = NULL,
  expand_name = NULL,
  multi = NULL,
  naics_link = NULL,
  select = NULL,
  gen_fips = TRUE,
  start_date = NULL,
  silent = FALSE,
  ...
)

Arguments

filename: The filename of the .csv.gz file or the path to the file. Note that if start_date is not specified, read_patterns will attempt to get the start date from the first ten characters of the path. In "new format" filepaths ("2020/01/09/core-patterns-part-1.csv.gz"), nine days will be subtracted from the date found.
dir: The directory in which the file sits.
by: A character vector giving the variable names of the level to be collapsed to using fun. The resulting data will have X rows per unique combination of by, where X is 1 if no expand variables are specified, or the length of the expand variable if specified. Set to NULL to aggregate across all initial rows, or set to FALSE to not aggregate at all (this will also add an initial_rowno column showing the original row number). You can also avoid aggregating by doing by = 'placekey' which might play more nicely with some of the other features..
fun: Function to use to aggregate the expanded variable to the by level.
na.rm: Whether to remove any missing values of the expanded variable before aggregating. Does not remove missing values of the by variables. May not be necessary if fun handles NAs on its own.
filter: A character string describing a logical statement for filtering the data, for example filter = 'state_fips == 6' would give you only data from California. Will be used as an i argument in a data.table, see help(data.table). Filtering here instead of afterwards can cut down on time and memory demands.
expand_int: A character variable with the name of The first e JSON variable in integer format ([1,2,3,...]) to be expanded into rows. Cannot be specified along with expand_cat.
expand_cat: A JSON variable in categorical format (A: 2, B: 3, etc.) to be expanded into rows. Ignored if expand_int is specified.
expand_name: The name of the new variable to be created with the category index for the expanded variable.
multi: A list of lists, for the purposes of creating a list of multiple processed files. This will vastly speed up processing over doing each of them one at a time. Each named list has the entry name as well as any of the options by, fun, filter, expand_int, expand_cat, expand_name as specified above. If specified, will override other entries of by, etc..
naics_link: A data.table, possibly produced by link_poi_naics, that links placekey and naics_code. This will allow you to include 'naics_code' in the by argument. Technically you could have stuff other than naics_code in here and use that in by too, I won't stop ya.
select: Character vector of variables to get from the file. Set to NULL to get all variables. **Specifying select is very much recommended, and will speed up the function a lot.**
gen_fips: Set to TRUE to use the poi_cbg variable to generate state_fips and county_fips variables. This will also result in poi_cbg being converted to character.
start_date: The first date in the file, as a date object. If omitted, will assume that the filename begins YYYY-MM-DD.
silent: Set to TRUE to suppress timecode message.
...: Other arguments to be passed to data.table::fread when reading in the file. For example, nrows to only read in a certain number of rows.

Details

Note that after reading in data, if gen_fips = TRUE, state and county names can be merged in using data(fips_to_names).

Examples


if (FALSE) {
# 'patterns-part-1.csv.gz' is a weekly patterns file in the main-file folder, which is the working directory
patterns <- read_patterns('patterns-part-1.csv.gz',
    # We only need these variables (and poi_cbg which is auto-added with gen_fips = TRUE)
    select = c('brands','visits_by_day'),
    # We want two formatted files to come out. The first aggregates to the state-brand-day level, getting visits by day
    multi = list(list(name = 'by_brands', by = c('state_fips','brands'), expand_int = 'visits_by_day'),
    # The second aggregates to the state-county-day level but only for Colorado and COnnecticut (see the filter)
    list(name = 'co_and_ct', by = c('state_fips','county_fips'), filter = 'state_fips %in% 8:9', expand_int = 'visits_by_day')))
patterns_brands <- patterns[[1]]
patterns_co_and_ct <- patterns[[2]]
}