Adjust SafeGraph Data for Sampling Size Differences — sample_size

This function uses 2016 American Community Survey data to adjust SafeGraph counts for the portion of the population that is sampled. This function will return a data.table with columns for a geographic ID and the variable adjust_factor, which you can merge into your data and then multiply whatever count variables you like by adjust_factor to adjust them for sampling differences.

sample_size_adjust(
  data,
  from_id = "census_block_group",
  sample_id = "number_devices_residing",
  from_level = "cbg",
  to_level = "county",
  by = NULL,
  pop_data = NULL
)

Arguments

data: A data.frame (or tibble or data.table) containing (among other things potentially) geographic ID variables and a variable for the number of SafeGraph devices observed in that area. Often this is from a home-panel-summary file.
from_id: A character vector either giving the variable name of the census block group ID, or both the state FIPS and county FIPS variables (which must be numeric, and in state, then county order). Census block group must be specified if from_level='cbg'.
sample_id: A character variable giving the variable name of the variable in data that has the number of SafeGraph observations.
from_level: Either 'cbg' or 'county', indicating the geographic level that is to be adjusted.
to_level: Either 'county' or 'state', indicating the geographic level that the from_level components are to be adjusted to, for example from_level='county' and to_level='state' wouuld give an adjustment factor for each county as though each county in the state was sampled at the same rate.
by: The data returned will be on the from_level level. Specify other vairables here to have it instead be on the from_level-by level, perhaps a timecode. by should not split the from_level counts. If, for example, by is used to split a county in two geographic subcounties, then the population adjustment will not be correct.
pop_data: If a populatinon data file other than data(cbg_pop) or data(county_pop) should be used, enter it here. Should be in the same format, and with the same variable names, as cbg_pop if from_level='cbg', or the same as county_pop if from_level='county'.

Examples

if (FALSE) {
# The current working directory has many home_panel_summary files
# Do some futzing with the census_block_group variable to
# Get it in the same format as how it is in cbg_pop
home_panel <- read_many_csvs(colClasses= c(census_block_group='character'))
home_panel[,census_block_group := as.character(as.numeric(census_block_group))]

# Create the data set with the adjust_factor variable
# This will adjust CBG populations to county ones, by default
adj_factor <- sample_size_adjust(home_panel, by = 'date_range_start')

# Now take some distancing data I have
# (where census_block_group is stored as origin_census_block_group)
data.table::setnames(adj_factor, census_block_group, origin_census_block_group)
# and merge in the adjustment factor
distancing <- merge(distancing, adj_factor, all.x = TRUE, by = 'origin_census_block_group')
# And use that adjustment factor to adjust!
distancing[,adj_device_count := device_count*adj_factor]

}