This vignette will walk through the process of using the Traffic-Over-Time functions, introduced in SafeGraphR version 0.3.0, which completely automate the process of downloading, reading, processing, and outputting SafeGraph data in the
visits_by_day column. This process is optimized for use with
visits_by_day, although you might be able to get at least some of it to work for other variables.
The idea is that you can do basically everything you need with a single
growth_over_time() call. This vignette will discuss how to make that call, and also talk about some of the internal functions it uses, so you can know how to use them yourself if you’d like to be a little more hands-on.
Note that in any case where this function downloads data from AWS - wherever you need a key or secret argument - that the original AWS bucket no longer exists; you’ll need to pass along the appropriate
safegraph_aws() arguments as well if you have your own AWS setup.
growth_over_time() function is intended to produce a data set and (optionally) graph showing how a certain subset of traffic data from the weekly patterns files changes over a set of dates, possibly doing different groups separately.
The first thing we need to do is determine the set of dates that we want to track traffic over, for the
dates argument. This is a vector of dates. If you want a range of dates, one good way to do it is
lubridate::ymd('year-month-day') + lubridate::days(0:numberofdaystotrack). The date range you give will be expanded based on
ma to ensure that when you take a moving average with
ma(), your entire date range survives.
This expanded date range will be passed to
patterns_lookup() to determine the set of folders that the data you need are in. If you also give an AWS
secret, it will download all those files for you, along with the appropriate normalization data.
If you want to get more precise, you can instead use the
filelist_norm arguments to specify which files exactly you want being read in for patterns and normalization data. Note that
growth_over_time uses the national-level
total_devices_seen column for normalization.
# Not yet working code gr <- growth_over_time(lubridate::ymd('2020-07-01') + lubridate::days(0:6), key = 'mykey', secret = 'mysecret')
We also must specify a
by argument, which is a character vector giving the variables to do calculations by. Visits data will be aggregated to this level before growth calculations are made. So for example,
by = c('brands','state_fips') would calculate visits in each state for each brand.
Any variable in
by other than the actual variables in the patterns files, or
county fips, must be brought in by merging the patterns files with the
data.table provided in the
naics_link argument. Variables other than
naics_code may be brought in this way and used.
Many of the other options here, like
dir are fairly straightforward and are shared with
read_patterns(). Some others will generally not be touched, for example the
backfill_date arguments, which will only need to be updated if you are using a version of the patterns backfill other than the one available the last time this package was updated. It’s also unlikely that you’ll want to do anything with
read_opts(), which just passes options along to
, which passes options toprocessing_template()`.
One argument to be aware of is
test_run. Set this to
FALSE to actually run your command. Because this function takes so long, and downloads so much data, you want to be real sure it runs right before doing it!
test_run = TRUE, which is the default, will use only one week of data, limiting the impact it has while you check if things work.
Setting your options appropriately will use
patterns_lookup() to figure out which patterns and normalization files to read in,
read_many_csvs() to read them, respectively, and then
processing_template() to prepare them.
processing_template() function will take patterns data fresh off of
read_many_patterns() and normalization data fresh off of
read_many_csvs() and process them appropriately, resulting in a sample-size adjusted visits value and a measurement of growth in traffic since the first day in the sample.
# Example data, imagine these came from read_many_patterns(expand_int = 'visits_by_day') # and read_many_csvs(), respectively data("pat_NY_NJ") data("norm") pat_NY_NJ[, .(visits_by_day = sum(visits_by_day)), by = .(date, state_fips, county_fips)] #> date state_fips county_fips visits_by_day #> 1: 2020-06-22 34 01 23878 #> 2: 2020-06-23 34 01 23898 #> 3: 2020-06-24 34 01 24463 #> 4: 2020-06-25 34 01 24788 #> 5: 2020-06-26 34 01 28190 #> --- #> 577: 2020-06-24 36 123 1001 #> 578: 2020-06-25 36 123 961 #> 579: 2020-06-26 36 123 1212 #> 580: 2020-06-27 36 123 1328 #> 581: 2020-06-28 36 123 823
norm[, .(date, total_devices_seen)] #> date total_devices_seen #> 1: 2020-06-22 17312070 #> 2: 2020-06-23 16870682 #> 3: 2020-06-24 17103600 #> 4: 2020-06-25 17014972 #> 5: 2020-06-26 16964900 #> 6: 2020-06-27 16304498 #> 7: 2020-06-28 16341487
We can pass these two to
processing_templtae() and it will do the sample normalization, the moving average, and the growth-since-first-day. Note that the normalization variable we want should be the last column of
growth_data <- processing_template(pat_NY_NJ[, .(visits_by_day = sum(visits_by_day)), by = .(date, state_fips, county_fips)], norm = norm[, .(date, total_devices_seen)], by = c('state_fips','county_fips'), ma = 2) # since we only have one week of data, use a very short moving average, just to demonstrate growth_data #> date state_fips county_fips visits_by_day total_devices_seen #> 1: 2020-06-23 34 01 23898 16870682 #> 2: 2020-06-24 34 01 24463 17103600 #> 3: 2020-06-25 34 01 24788 17014972 #> 4: 2020-06-26 34 01 28190 16964900 #> 5: 2020-06-27 34 01 28943 16304498 #> --- #> 494: 2020-06-24 36 99 2292 17103600 #> 495: 2020-06-25 36 99 2145 17014972 #> 496: 2020-06-26 36 99 2335 16964900 #> 497: 2020-06-27 36 99 2751 16304498 #> 498: 2020-06-28 36 99 2026 16341487 #> adj_visits growth #> 1: 0.0013979046 0.00000000 #> 2: 0.0014234120 0.01824687 #> 3: 0.0014435591 0.03265929 #> 4: 0.0015592504 0.11541975 #> 5: 0.0017184102 0.22927577 #> --- #> 494: 0.0001255962 0.09600879 #> 495: 0.0001300362 0.13475380 #> 496: 0.0001318513 0.15059337 #> 497: 0.0001531818 0.33673288 #> 498: 0.0001463527 0.27713911
make_graph = TRUE, instead you get a list where the
processing_template() output is the first element of that list! The second would then be a graph or list of graphs. The other arguments in
growth_over_time() have to do with the creation of graphs.
make_graph = TRUE will get you a pretty nice-looking graph without any further work! Probably not completely publication-ready (but it does look nice - you’d have to fiddle with titles, etc., though), but definitely good enough for your own information.
make_graph = TRUE with nothing else will simply take the output discussed in the last section and send it to
graph_template(), which produces a ggplot2 line graph with labels at the end of the lines, and a separate line for each combination of
by. It will also combine the elements of
by into a single variable, if there’s more than one element.
# (this won't work since I'm going straight to graph_template, but I could have # loaded data(fips_to_names) and used that as a line_labels argument in growth_over_time # to label the lines with their county names rather than just the numbers) # This example uses only a few counties graph_template(growth_data[state_fips == 34 & county_fips %in% c(1,3,5)], by = c('county_fips')) + # This is a regular ggplot2 object, so we can further modify it as one ggplot2::theme(text = ggplot2::element_text(family = 'serif')) #> Warning in max(.): no non-missing arguments to max; returning -Inf #> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE): no #> non-missing arguments to max; returning -Inf #> Warning in min.default(structure(numeric(0), class = "Date"), na.rm = FALSE): no #> non-missing arguments to min; returning Inf #> Warning in max.default(structure(numeric(0), class = "Date"), na.rm = FALSE): no #> non-missing arguments to max; returning -Inf #> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = #> "none")` instead.
And that’s the basic idea! You can change the optios of
graph_template() by passing them to the
graph_opts argument of
growth_over_time() as a list, but dang that sounds like a lot of work to me!
growth_over_time() gives you back a list. The first element of that list is the
processing_template() output, and the second is the
One thing that is likely to be common in
growth_over_time() is that there will be a lot of combinations of the
by variables. Too many to track if you put each of their lines on the same graph. So the
graph_by option will let you specify a vector of character variables, which must be a subset of
by. Then, it will produce separate graphs for each combination of
So, want to track two brands separately in each state? Use
filter to pick those two brands, then use
by = c('state_fips','brands'), and then finally
graph_by = 'state_fips'. Now,
growth_over_time() will still return a two-element list, the first of which is the
But the second element will now be its own list, with one element for each combination of
graph_by. Each element of that list will be a separate
graph_template() call, using only data from that combination of
graph_by variables, and graphing lines separately by the remaining non-
graph_by elements of
by. So with
by = c('state_fips','brands') and
graph_by = 'state_fips' we’d get one graph per state, and one line per brand on each of those graphs.