Appendix B

Search or Filter

Search DIRT

Appendix B

Green Analytics Trending Regression Analysis

Objective

The objective of this analysis is to estimate whether damage counts are changing significantly over time in the United States after accounting for several potential driving factors (e.g., economic growth, dig activity, etc.).

Method

A regression analysis was performed to relate damage counts by month and state to a set of explanatory variables including factors related to the economy, demographics, dig activity and others as noted below. The analysis focused on trends over time using a set of year variables to account for changes over the three years included in the analysis (2019 to 2021).

[1]

Given that the damage data are structured as count data, Poisson and negative binomial count models were used for the analysis. Both models are designed to deal with the unique characteristics of count data, but the negative binomial model relaxes a key assumption of the Poisson model (via an overdispersion parameter). Models were run with and without clustering the standard errors on the geographic unit (state) which accounts for the panel nature of the data. The coefficients of interest for the trend analysis are those corresponding to the years variable in Equation 1. By using Equation 1, the year coefficients can be assessed while holding all other measures (i.e. economic and dig activity) equal. Doing so allows us to determine if damages are flat or trending up (or down) for reasons not related to economic or dig activity, if year coefficients are found to be statistically significant.

Before running the regression models, standard data assessments were completed to ensure the regression results are not impacted by known data issues. For instance, multicollinearity between variables was assessed using variance inflation factors (VIF). Since multicollinearity can influence how regression coefficients for certain variables are interpreted, highly collinear variables were iteratively removed if their VIF was above 5. This resulted in the primary models having a reduced set of variables with limited collinearity, and regression analyses were conducted on this reduced set as well as the total set of variables. Variables that were dropped due to multicollinearity were added back into the model one-by-one after the primary models were estimated to assess whether results differed (e.g., if variable ‘A’ was dropped initially then it was added back into the primary model by itself, if variable ‘B’ was dropped initially then it was added back into the primary model by itself, and so on).

Data

A subset of the U.S. damage data was assembled for the 2019 to 2021 period. To help reduce the impact to analysis stemming from variations in company reporting behavior from year to year, rather than actual changes in damages, the damage count dataset was assembled from companies that consistently reported during the past three years. The team also reviewed the makeup of companies to ensure the comparable dataset included representative from facility owner/operators and 811 center stakeholders as well as locator, electric, telecom/CATV, excavator, and water stakeholders.

Damages in the final set of data were distributed across the 50 states and the District of Columbia as well as the 36 months over the 2019 to 2021 period. Damage counts reported for certain states are zero or very low and are thus not well represented in the analysis (i.e., Alaska, Hawaii, Maine, and Vermont). However, other states appear poorly represented relative to past years and these states were flagged in consultation with CGA staff (i.e., Arizona, Connecticut, Idaho, Kansas, Montana, North Dakota, New Hampshire, New Mexico, Nevada, Wisconsin, and West Virginia). Data on other variables, including weather, demographics, economics (e.g., GDP or employment), construction, dig activity (e.g., transmissions), as well as PHMSA data on damages and tickets were also collected (Table B1).

Results and Conclusion

The initial multicollinearity check revealed that many of the variables in Table B1 were highly collinear and these variables were removed from the regression analysis. However, the variables of interest, year_2020 and year_2021 were not substantially correlated with the other explanatory variables in the model.

Though results of the primary Poisson model with the reduced set of variables suggest that damage counts are not changing over time (Table 2), the results of the negative binomial model suggest that damages in the year 2021 differ from those in 2019, although this relationship is weakly significant^{^[1]} (10% level of significance).^{^[2]} Additional testing suggests that damages in 2021 do not differ significantly from those in 2020. These general findings are the same regardless of model specification for the Poisson model (e.g., primary model with the reduced set of variables, the models adding collinear variables back in on a one-by-one basis, or the models with the full set of variables). However, this is not the case for the negative binomial model as the coefficients on the year variables stemming from the specification with the full set of variables do not differ significantly from zero, although the results of the negative binomial models that added the collinear variables back in on the one-by-one basis largely confirm the results of the primary model. Finally, these general findings are the same for models that cluster the standard errors and those that do not.

Assuming that the assembled data is representative of trends in all sectors and parts of the United States and given the time period considered, the models indicate that damages are remaining level at best, and there is some weak evidence that counts in 2021 differ (upward) significantly from those in 2019 after accounting for key driving factors. Again, “significant” is used in a statistics context, which differs from casual conversation where it may mean large or very important.

Table B1: Variables Initially Used in the Regression Analysis

Variable Name	Description	Variable is in Primary Model?	Notes
year_2020	Indicator (dummy) variable accounting for the year 2020	Yes	Variable of interest. If the variable’s coefficient is significant then counts in 2020 differ significantly from 2019.
year_2021	Indicator (dummy) variable accounting for the year 2021	Yes	Variable of interest. If the variable’s coefficient is significant then counts in 2021 differ significantly from 2019.
pop	Annual estimate of state population	No (dropped due to high VIF)
popchangeP	Percent change in state population from previous year	Yes
AreaKm2	State area in kilometers squared	Yes
density	Population density	Yes
tavg_Value	Average monthly temperature in a state in Fahrenheit	Yes
pcp_Value	Monthly precipitation in a state in inches	Yes
Real	Monthly estimate of real GDP per state (all sectors)	Yes
Construction_Real	Monthly estimate of real GDP per state (construction sector only)	No (dropped due to high VIF)
Permits	Monthly estimate of building permits issued per state	Yes
emp_remodel_NA	Monthly estimate of employment in renovation and remodeling sector at the national level	Yes	Not seasonally adjusted
csU_total	Monthly estimate of total construction spending at the national level	No (dropped due to high VIF)	Not seasonally adjusted
TotalStarts_NSA	Monthly estimate of total housing starts at the regional level	Yes	Regions include Northeast, Midwest, South, and West
Unemp_NSA	Monthly estimate of the unemployment rate per state	Yes	Not seasonally adjusted
ConstGeneral_NSA	Monthly employment in the construction sector per state	No (dropped due to high VIF)	Not seasonally adjusted
TotalEmp_NSA	Monthly employment for all sectors per state	No (dropped due to high VIF)	Not seasonally adjusted
PHMSA_Damages	Annual PHMSA damage counts per state	Yes
PHMSA_Tickets	Annual PHMSA ticket counts per state	No (dropped due to high VIF)
OneCall_Trans	Annual OneCall center transmissions per state	No (dropped due to high VIF)
month_jfm	Indicator (dummy) variable for the months January, February, or March (roughly winter)	Yes
month_amj	Indicator (dummy) variable for the months April, May, or June (roughly spring)	Yes
month_jas	Indicator (dummy) variable for the months July, August, or September (roughly summer)	Yes
month_ond	Indicator (dummy) variable for the months October, November, or December (roughly fall)	Yes (but see note)	Does not appear in model output as this is the reference season.
month_amjjas	Indicator (dummy) variable for the months April through September (roughly spring and summer)	No (used individual season indicators instead)
low_counts	Indicator (dummy) variable accounting for states that are not well-covered by the data	Yes	They have lower counts than one might expect

Table B2: Primary Count Models Relating Damages to Explanatory Variables with Standard Errors Clustered on Geography^a

Variable	Poisson Model	Negative Binomial
year_2020	0.0066264 (0.1384777)	0.1396216 (0.1333183)
year_2021	0.016635 (0.2398704)	0.3181472* (0.1841542)
AreaKm2	-0.00000706 (0.00000191)	-0.00000068 (0.00000177)
tavg_Value	0.0183398** (0.0085293)	0.0278379*** (0.0106641)
pcp_Value	-0.0265536 (0.0193185)	-0.0872928* (0.046825)
popchangeP	0.3531989 (0.2483274)	0.3966561 (0.2584109)
Real	0.000000409 (0.000000433)	0.0000000189 (0.000000557)
Permits	0.0000768* (0.0000466)	0.0000893 (0.0000936)
emp_remodel_NA	-0.0035606 (0.0044479)	-0.006691 (0.0066367)
TotalStarts_NSA	0.0134491 (0.0137669)	-0.0063448 (0.0127965)
Unemp_NSA	0.0046945 (0.032807)	-0.0454165 (0.0487075)
PHMSA_Damages	0.000332*** (0.0001097)	0.0004386* (0.0002264)
density	-0.0041672** (0.0019499)	-0.0003985*** (0.0001518)
month_jfm	0.0057922 (0.0606405)	-0.1060727 (0.0851081)
month_amj	-0.0875143 (0.1452539)	-0.1250074 (0.2276354)
month_jas	-0.1719384 (0.1698518)	-0.2810573 (0.2819143)
low_count	-2.590025*** (0.6206466)	-2.414872*** (0.5517371)
Constant	5.96513*** (1.50125)	5.948148** (2.419352)
lnalpha (dispersion parameter)	N/A	0.201455 (0.2182386)
alpha	N/A	1.223181 (0.2669453)
Log-likelihood (pseudo)	-137,629.62	-10,029.549
R² (pseudo)	0.6940	N/A
Observations	1836	1836

^a Cells contain model coefficients and associated standard errors in round brackets

***, **, and * indicate that the coefficient is significantly different from zero at the 1%, 5%, and 10% level of significance

[1] In statistics, “significant” means you can feel confident the effect is real rather than random, i.e., that you didn’t just get lucky (or unlucky) in choosing the sample.

[2] The results of the negative binomial model are preferred given that the dispersion parameter (alpha) differs from zero.

Damage Prevention in Your State

Explore damage prevention information, local contacts and rules for safe digging in North America.

Find Your State

CGA Toolkits

CGA has created a suite of toolkits designed to help members generate public awareness about the importance of damage prevention.

Explore Resources

DIRT Report

Search or Filter

Search DIRT

Table of Contents

Appendix B

Green Analytics Trending Regression Analysis

Damage Prevention in Your State

CGA Toolkits

Dedicated to preventing damage to underground utility infrastructure and protecting those who live and work near these important assets.