Objective
The objective of this analysis is to estimate whether damage counts are changing significantly over time in the United States after accounting for several potential driving factors (e.g., economic growth, dig activity, etc.).
Method
A regression analysis was performed to relate damage counts by month and state to a set of explanatory variables including factors related to the economy, demographics, dig activity and others as noted below. The analysis focused on trends over time using a set of year variables to account for changes over the three years included in the analysis (2019 to 2021).
[1]
Given that the damage data are structured as count data, Poisson and negative binomial count models were used for the analysis. Both models are designed to deal with the unique characteristics of count data, but the negative binomial model relaxes a key assumption of the Poisson model (via an overdispersion parameter). Models were run with and without clustering the standard errors on the geographic unit (state) which accounts for the panel nature of the data. The coefficients of interest for the trend analysis are those corresponding to the years variable in Equation 1. By using Equation 1, the year coefficients can be assessed while holding all other measures (i.e. economic and dig activity) equal. Doing so allows us to determine if damages are flat or trending up (or down) for reasons not related to economic or dig activity, if year coefficients are found to be statistically significant.
Before running the regression models, standard data assessments were completed to ensure the regression results are not impacted by known data issues. For instance, multicollinearity between variables was assessed using variance inflation factors (VIF). Since multicollinearity can influence how regression coefficients for certain variables are interpreted, highly collinear variables were iteratively removed if their VIF was above 5. This resulted in the primary models having a reduced set of variables with limited collinearity, and regression analyses were conducted on this reduced set as well as the total set of variables. Variables that were dropped due to multicollinearity were added back into the model one-by-one after the primary models were estimated to assess whether results differed (e.g., if variable ‘A’ was dropped initially then it was added back into the primary model by itself, if variable ‘B’ was dropped initially then it was added back into the primary model by itself, and so on).
Data
A subset of the U.S. damage data was assembled for the 2019 to 2021 period. To help reduce the impact to analysis stemming from variations in company reporting behavior from year to year, rather than actual changes in damages, the damage count dataset was assembled from companies that consistently reported during the past three years. The team also reviewed the makeup of companies to ensure the comparable dataset included representative from facility owner/operators and 811 center stakeholders as well as locator, electric, telecom/CATV, excavator, and water stakeholders.
Damages in the final set of data were distributed across the 50 states and the District of Columbia as well as the 36 months over the 2019 to 2021 period. Damage counts reported for certain states are zero or very low and are thus not well represented in the analysis (i.e., Alaska, Hawaii, Maine, and Vermont). However, other states appear poorly represented relative to past years and these states were flagged in consultation with CGA staff (i.e., Arizona, Connecticut, Idaho, Kansas, Montana, North Dakota, New Hampshire, New Mexico, Nevada, Wisconsin, and West Virginia). Data on other variables, including weather, demographics, economics (e.g., GDP or employment), construction, dig activity (e.g., transmissions), as well as PHMSA data on damages and tickets were also collected (Table B1).
Results and Conclusion
The initial multicollinearity check revealed that many of the variables in Table B1 were highly collinear and these variables were removed from the regression analysis. However, the variables of interest, year_2020 and year_2021 were not substantially correlated with the other explanatory variables in the model.
Though results of the primary Poisson model with the reduced set of variables suggest that damage counts are not changing over time (Table 2), the results of the negative binomial model suggest that damages in the year 2021 differ from those in 2019, although this relationship is weakly significant[1] (10% level of significance).[2] Additional testing suggests that damages in 2021 do not differ significantly from those in 2020. These general findings are the same regardless of model specification for the Poisson model (e.g., primary model with the reduced set of variables, the models adding collinear variables back in on a one-by-one basis, or the models with the full set of variables). However, this is not the case for the negative binomial model as the coefficients on the year variables stemming from the specification with the full set of variables do not differ significantly from zero, although the results of the negative binomial models that added the collinear variables back in on the one-by-one basis largely confirm the results of the primary model. Finally, these general findings are the same for models that cluster the standard errors and those that do not.
Assuming that the assembled data is representative of trends in all sectors and parts of the United States and given the time period considered, the models indicate that damages are remaining level at best, and there is some weak evidence that counts in 2021 differ (upward) significantly from those in 2019 after accounting for key driving factors. Again, “significant” is used in a statistics context, which differs from casual conversation where it may mean large or very important.
Table B1: Variables Initially Used in the Regression Analysis
Variable Name
|
Description
|
Variable is in
Primary Model?
|
Notes
|
year_2020
|
Indicator (dummy) variable accounting for the year 2020
|
Yes
|
Variable of interest. If the variable’s coefficient is significant then counts in 2020 differ significantly from 2019.
|
year_2021
|
Indicator (dummy) variable accounting for the year 2021
|
Yes
|
Variable of interest. If the variable’s coefficient is significant then counts in 2021 differ significantly from 2019.
|
pop
|
Annual estimate of state population
|
No (dropped due to high VIF)
|
|
popchangeP
|
Percent change in state population from previous year
|
Yes
|
|
AreaKm2
|
State area in kilometers squared
|
Yes
|
|
density
|
Population density
|
Yes
|
|
tavg_Value
|
Average monthly temperature in a state in Fahrenheit
|
Yes
|
|
pcp_Value
|
Monthly precipitation in a state in inches
|
Yes
|
|
Real
|
Monthly estimate of real GDP per state (all sectors)
|
Yes
|
|
Construction_Real
|
Monthly estimate of real GDP per state (construction sector only)
|
No (dropped due to high VIF)
|
|
Permits
|
Monthly estimate of building permits issued per state
|
Yes
|
|
emp_remodel_NA
|
Monthly estimate of employment in renovation and remodeling sector at the national level
|
Yes
|
Not seasonally adjusted
|
csU_total
|
Monthly estimate of total construction spending at the national level
|
No (dropped due to high VIF)
|
Not seasonally adjusted
|
TotalStarts_NSA
|
Monthly estimate of total housing starts at the regional level
|
Yes
|
Regions include Northeast, Midwest, South, and West
|
Unemp_NSA
|
Monthly estimate of the unemployment rate per state
|
Yes
|
Not seasonally adjusted
|
ConstGeneral_NSA
|
Monthly employment in the construction sector per state
|
No (dropped due to high VIF)
|
Not seasonally adjusted
|
TotalEmp_NSA
|
Monthly employment for all sectors per state
|
No (dropped due to high VIF)
|
Not seasonally adjusted
|
PHMSA_Damages
|
Annual PHMSA damage counts per state
|
Yes
|
|
PHMSA_Tickets
|
Annual PHMSA ticket counts per state
|
No (dropped due to high VIF)
|
|
OneCall_Trans
|
Annual OneCall center transmissions per state
|
No (dropped due to high VIF)
|
|
month_jfm
|
Indicator (dummy) variable for the months January, February, or March (roughly winter)
|
Yes
|
|
month_amj
|
Indicator (dummy) variable for the months April, May, or June (roughly spring)
|
Yes
|
|
month_jas
|
Indicator (dummy) variable for the months July, August, or September (roughly summer)
|
Yes
|
|
month_ond
|
Indicator (dummy) variable for the months October, November, or December (roughly fall)
|
Yes (but see note)
|
Does not appear in model output as this is the reference season.
|
month_amjjas
|
Indicator (dummy) variable for the months April through September (roughly spring and summer)
|
No (used individual season indicators instead)
|
|
low_counts
|
Indicator (dummy) variable accounting for states that are not well-covered by the data
|
Yes
|
They have lower counts than one might expect
|
Table B2: Primary Count Models Relating Damages to Explanatory Variables with Standard Errors Clustered on Geographya
Variable
|
Poisson Model
|
Negative Binomial
|
year_2020
|
0.0066264
(0.1384777)
|
0.1396216
(0.1333183)
|
year_2021
|
0.016635
(0.2398704)
|
0.3181472*
(0.1841542)
|
AreaKm2
|
-0.00000706
(0.00000191)
|
-0.00000068
(0.00000177)
|
tavg_Value
|
0.0183398**
(0.0085293)
|
0.0278379***
(0.0106641)
|
pcp_Value
|
-0.0265536
(0.0193185)
|
-0.0872928*
(0.046825)
|
popchangeP
|
0.3531989
(0.2483274)
|
0.3966561
(0.2584109)
|
Real
|
0.000000409
(0.000000433)
|
0.0000000189
(0.000000557)
|
Permits
|
0.0000768*
(0.0000466)
|
0.0000893
(0.0000936)
|
emp_remodel_NA
|
-0.0035606
(0.0044479)
|
-0.006691
(0.0066367)
|
TotalStarts_NSA
|
0.0134491
(0.0137669)
|
-0.0063448
(0.0127965)
|
Unemp_NSA
|
0.0046945
(0.032807)
|
-0.0454165
(0.0487075)
|
PHMSA_Damages
|
0.000332***
(0.0001097)
|
0.0004386*
(0.0002264)
|
density
|
-0.0041672**
(0.0019499)
|
-0.0003985***
(0.0001518)
|
month_jfm
|
0.0057922
(0.0606405)
|
-0.1060727
(0.0851081)
|
month_amj
|
-0.0875143
(0.1452539)
|
-0.1250074
(0.2276354)
|
month_jas
|
-0.1719384
(0.1698518)
|
-0.2810573
(0.2819143)
|
low_count
|
-2.590025***
(0.6206466)
|
-2.414872***
(0.5517371)
|
Constant
|
5.96513***
(1.50125)
|
5.948148**
(2.419352)
|
lnalpha (dispersion parameter)
|
N/A
|
0.201455
(0.2182386)
|
alpha
|
N/A
|
1.223181
(0.2669453)
|
Log-likelihood (pseudo)
|
-137,629.62
|
-10,029.549
|
R2 (pseudo)
|
0.6940
|
N/A
|
Observations
|
1836
|
1836
|
a Cells contain model coefficients and associated standard errors in round brackets
***, **, and * indicate that the coefficient is significantly different from zero at the 1%, 5%, and 10% level of significance
[1] In statistics, “significant” means you can feel confident the effect is real rather than random, i.e., that you didn’t just get lucky (or unlucky) in choosing the sample.
[2] The results of the negative binomial model are preferred given that the dispersion parameter (alpha) differs from zero.