Track foot traffic by group over time — growth_over

A start-to-finish download and analysis! This function, given a range of dates, a subset of data, and a grouping set, will produce an estimate of how foot traffic to those groups has changed over that date range within that subset.

growth_over_time(
  dates,
  by,
  ma = 7,
  dir = ".",
  old_dir = NULL,
  new_dir = NULL,
  filelist = NULL,
  filelist_norm = NULL,
  start_dates = NULL,
  filter = NULL,
  naics_link = NULL,
  origin = 0,
  key = NULL,
  secret = NULL,
  make_graph = FALSE,
  graph_by = NULL,
  line_labels = NULL,
  graph_by_titles = NULL,
  test_run = TRUE,
  read_opts = NULL,
  processing_opts = NULL,
  graph_opts = list(title = data.table::fcase(is.null(graph_by) & is.null(by),
    "SafeGraph Foot Traffic Growth", is.null(graph_by),
    paste("SafeGraph Foot Traffic Growth by", paste(by, collapse = ", ")), min(by %in%
    graph_by) == 1, "SafeGraph Foot Traffic Growth", default =
    paste("SafeGraph Foot Traffic Growth by", paste(by[!(by %in% graph_by)], collapse =
    ", ")))),
  patterns_backfill_date = "2020/12/14/21/",
  norm_backfill_date = "2020/12/14/21/",
  ...
)

Arguments

dates: The range of dates to cover in analysis. Note that (1) analysis will track growth relative to the first date listed here, and (2) if additional, earlier dates are necessary for the ma moving-average, they will be added automatically, don't do it yourself.
by: A character vector of variable names to calculate growth separately by. You will get back a data set with one observation per date in dates per combination of variables in by. Set to NULL to aggregate all traffic by date (within the filter). See the variable names in the [patterns documentation](http://docs.safegraph.com), and in addition you may use state_fips and/or county_fips for state and county FIPS codes.
ma: Number of days over which to take the moving average.
dir: The folder where the patterns_backfill/patterns folders of patterns data, as well as normalization_stats/normalization_stats_backfill are stored. This is also where any files that need to be downloaded from AWS will be stored.
old_dir: Where "old" (pre-December 7, 2020) files go, if not the same as dir. This should be the folder that contains the patterns_backfill and the normalization_stats_backfill folder.
new_dir: Where "new" (post-December 7, 2020) files go, if not the same as dir. This should be the folder that contains the patterns and the normalization_stats folder.
filelist: If your data is not structured as downloaded from AWS, use this option to pass a vector of (full) filenames for patterns CSV.GZ data instead of looking in dir or on AWS. These will not be checked for date ranges until after opening them all, so be extra sure you have everything you need!
filelist_norm: If your data is not structured as downloaded from AWS, use this option to pass a vector of (full) filenames for normalization CSV data instead of looking in dir or on AWS. These will not be checked for date ranges until after opening them all, so be extra sure you have everything you need!
start_dates: If using the filelist argument, provide a vector of the first date present in each file. This should be the same length as filelist.
filter: A character variable describing a subset of the data to include, for example filter = 'state_fips == 6' to only include California, or brands == 'McDonald\'s' to only include McDonald's. See the variable names in the [patterns documentation](http://docs.safegraph.com), and in addition you may use state_fips and/or county_fips for state and county FIPS codes.
naics_link: Necessary only to filter or by on a NAICS code. A data.table, possibly produced by link_poi_naics, that links placekey and naics_code. This will allow you to include 'naics_code' in the by argument. Technically you could have stuff other than naics_code in here and use that in by too, I won't stop ya.
origin: The value indicating no growth/initial value. The first date for each group will have this value. Usually 0 (for "0 percent growth") or 1 ("100 percent of initial value").
key: A character string containing an AWS Access Key ID, necessary if your range of dates extends beyond the files in dir.
secret: A character string containing an AWS Secret Access Key, necessary if your range of dates extends beyond the files in dir.
make_graph: Set to TRUE to produce (and return) a nicely-formatted graph showing growth over time with separate lines for each by group. If by produces more than, roughly, six combinations, then this won't look very good and you should also specify graph_by. Requires that **ggplot2** and **ggrepel** be installed. If this is TRUE, then instead of returning a data.table, will return a list where the first element is the normal data.table, and the second is the ggplot object.
graph_by: A character vector, which must be a subset of by. Will produce a separate graph for each combination of graph_by, graphing separate lines on each for the remaining elements of by that aren't in graph_by. Now, the second element of the returned list will itself be a list containing each of the different graphs as elements, and no graph will be automatically printed. Only relevant if make_graph = TRUE.
line_labels: A data.table (or object like a data.frame that can be coerced to a data.table). Contains columns for all the variables that are in by but not graph_by. Those columns should uniquely identify the rows. Contains exactly one other column, which is the label that will be put on the graph lines. For example, data(naics_codes) would work as an argument if by = naics_code. If line_labels is specified, any combination of by-but-not-graph_by values that is not present in line_labels will be dropped.
graph_by_titles: A data.table (or object like a data.frame that can be coerced to a data.table). Contains columns for all the variables that are in graph_by. Those columns should uniquely identify the rows. Contains exactly one other column, which is the label that will be put in each graph's subtitle. If graph_by_titles is specified, any combination of graph_by values that are not in graph_by_titles will be dropped.
test_run: Runs your analysis for only the first week of data, just to make sure it looks like you want. TRUE by default because this is a slow, data-hungry, and (if you haven't already downloaded the files) bandwidth-hungry command, and you should only run the full thing after being sure it's right!
read_opts: A named list of options to be sent to read_many_patterns. Be careful using as there may be conflicts with options implied by other parameters. Including a select option in here will likely speed up the function considerably.
processing_opts: A named list of options to be sent to processing_template. Be careful using as there may be conflicts with options implied by other parameters.
graph_opts: A named list of options to be sent to graph_template.
patterns_backfill_date: Character variable with the folder structure for the most recent patterns_backfill pull. i.e., the 2018, 2019, and 2020 folders containing backfill data in their subfolders should set in the paste0(old_dir,'/patterns_backfill/',patterns_backfill_date) folder.
norm_backfill_date: A character string containing the series of dates that fills the X in normalization_stats_backfill/X/ and in which the 2018, 2019, and 2020 folders sit.
...: Parameters to be passed on to patterns_lookup() (and, often, from there on to safegraph_aws().)

Details

This goes from start to finish, downloading any necessary files from AWS, reading them in and processing them, normalizing the data by sample size, calculating a moving average, and returning the processed data by group and date. It will even make you a nice graph if you want using graph_template.

Returns a data.table with all the variables in by, the date, the raw visits_by_day, the total_devices_seen normalization variable, the adj_visits variable adjusted for sample size, and growth_visits, which calculates growth from the start of the dates range. If make_graph is TRUE, will instead return a list where the first element is that data.table, and the second is a ggplot graph object.

Be aware:

1. This will only work with the visits_by_day variable. Or at least it's only designed to. Maybe you can get it to work with something else.

2. This uses processing_template, so all the caveats of that function apply here. No attempt will be made to handle outliers, oddities in the data, etc.. You get what you get. If you want anything more complex, you'll have to do it by hand! You might try mining this function's source code (just do foot_traffic_growth in the console) to get started.

3. Each week of included data means a roughly 1GB AWS download unless it's already on your system. Please don't ask for more than you need, and if you have already downloaded the data, please input the directory properly to avoid re-downloading.

4. This requires data to be downloaded from AWS, and will not work on Shop data. See read_many_shop followed by processing_template for that.

5. Be aware that very long time frames, for example crossing multiple years, will always be just a little suspect for this. The sample changed structure considerably from 2019 to 2020. Usually this is handled by normalization by year and then calculation of YOY change on top of that. This function doesn't do that, but you could take its output and do that yourself if you wanted.

TO BE ADDED SOON: Sample size adjustments to equalize sampling rates, and labeling.

Examples


 if (FALSE) {
 data(state_info)

 p <- growth_over_time(lubridate::ymd('2020-12-07') + lubridate::days(0:6),
                       by = c('region', 'brands'),
                       filter = 'brands %in% c("Macy\'s", "Target")',
                       make_graph = TRUE,
                       graph_by = 'region',
                       graph_by_titles = state_info[, .(region, statename)],
                       test_run = FALSE)

 # The growth data overall for the growth of Target and Macy's in this week
 p[[1]]

 # Look at the graph for the growth of Target and Macy's in this week in Colorado
 p[[2]][[6]]
 }