DIRT Report

Search or Filter

Search DIRT

Table of Contents

Appendix B

Green Analytics Trending Regression Analysis

Objective

The objective of this analysis is to estimate whether damage counts are changing significantly over time in the United States after accounting for several potential driving factors (e.g., economic growth, dig activity, etc.).

Method

A regression analysis was performed to relate damage counts by month and state to a set of explanatory variables including factors related to the economy, demographics, dig activity and others as noted below. The analysis focused on trends over time using a set of year variables to account for changes over the three years included in the analysis (2019 to 2021).

                           [1]

Given that the damage data are structured as count data, Poisson and negative binomial count models were used for the analysis. Both models are designed to deal with the unique characteristics of count data, but the negative binomial model relaxes a key assumption of the Poisson model (via an overdispersion parameter). Models were run with and without clustering the standard errors on the geographic unit (state) which accounts for the panel nature of the data. The coefficients of interest for the trend analysis are those corresponding to the years variable in Equation 1. By using Equation 1, the year coefficients can be assessed while holding all other measures (i.e. economic and dig activity) equal. Doing so allows us to determine if damages are flat or trending up (or down) for reasons not related to economic or dig activity, if year coefficients are found to be statistically significant.

Before running the regression models, standard data assessments were completed to ensure the regression results are not impacted by known data issues. For instance, multicollinearity between variables was assessed using variance inflation factors (VIF). Since multicollinearity can influence how regression coefficients for certain variables are interpreted, highly collinear variables were iteratively removed if their VIF was above 5. This resulted in the primary models having a reduced set of variables with limited collinearity, and regression analyses were conducted on this reduced set as well as the total set of variables. Variables that were dropped due to multicollinearity were added back into the model one-by-one after the primary models were estimated to assess whether results differed (e.g., if variable ‘A’ was dropped initially then it was added back into the primary model by itself, if variable ‘B’ was dropped initially then it was added back into the primary model by itself, and so on).

Data

A subset of the U.S. damage data was assembled for the 2019 to 2021 period. To help reduce the impact to analysis stemming from variations in company reporting behavior from year to year, rather than actual changes in damages, the damage count dataset was assembled from companies that consistently reported during the past three years. The team also reviewed the makeup of companies to ensure the comparable dataset included representative from facility owner/operators and 811 center stakeholders as well as locator, electric, telecom/CATV, excavator, and water stakeholders.

Damages in the final set of data were distributed across the 50 states and the District of Columbia as well as the 36 months over the 2019 to 2021 period. Damage counts reported for certain states are zero or very low and are thus not well represented in the analysis (i.e., Alaska, Hawaii, Maine, and Vermont). However, other states appear poorly represented relative to past years and these states were flagged in consultation with CGA staff (i.e., Arizona, Connecticut, Idaho, Kansas, Montana, North Dakota, New Hampshire, New Mexico, Nevada, Wisconsin, and West Virginia). Data on other variables, including weather, demographics, economics (e.g., GDP or employment), construction, dig activity (e.g., transmissions), as well as PHMSA data on damages and tickets were also collected (Table B1).

Results and Conclusion

The initial multicollinearity check revealed that many of the variables in Table B1 were highly collinear and these variables were removed from the regression analysis. However, the variables of interest, year_2020 and year_2021 were not substantially correlated with the other explanatory variables in the model.

Though results of the primary Poisson model with the reduced set of variables suggest that damage counts are not changing over time (Table 2), the results of the negative binomial model suggest that damages in the year 2021 differ from those in 2019, although this relationship is weakly significant[1] (10% level of significance).[2] Additional testing suggests that damages in 2021 do not differ significantly from those in 2020. These general findings are the same regardless of model specification for the Poisson model (e.g., primary model with the reduced set of variables, the models adding collinear variables back in on a one-by-one basis, or the models with the full set of variables). However, this is not the case for the negative binomial model as the coefficients on the year variables stemming from the specification with the full set of variables do not differ significantly from zero, although the results of the negative binomial models that added the collinear variables back in on the one-by-one basis largely confirm the results of the primary model. Finally, these general findings are the same for models that cluster the standard errors and those that do not.

Assuming that the assembled data is representative of trends in all sectors and parts of the United States and given the time period considered, the models indicate that damages are remaining level at best, and there is some weak evidence that counts in 2021 differ (upward) significantly from those in 2019 after accounting for key driving factors. Again, “significant” is used in a statistics context, which differs from casual conversation where it may mean large or very important.

Table B1: Variables Initially Used in the Regression Analysis

Variable Name

Description

Variable is in

Primary Model?

Notes

year_2020

Indicator (dummy) variable accounting for the year 2020

Yes

Variable of interest. If the variable’s coefficient is significant then counts in 2020 differ significantly from 2019.

year_2021

Indicator (dummy) variable accounting for the year 2021

Yes

Variable of interest. If the variable’s coefficient is significant then counts in 2021 differ significantly from 2019.

pop

Annual estimate of state population

No (dropped due to high VIF)

 

popchangeP

Percent change in state population from previous year

Yes

 

AreaKm2

State area in kilometers squared

Yes

 

density

Population density

Yes

 

tavg_Value

Average monthly temperature in a state in Fahrenheit

Yes

 

pcp_Value

Monthly precipitation in a state in inches

Yes

 

Real

Monthly estimate of real GDP per state (all sectors)

Yes

 

Construction_Real

Monthly estimate of real GDP per state (construction sector only)

No (dropped due to high VIF)

 

Permits

Monthly estimate of building permits issued per state

Yes

 

emp_remodel_NA

Monthly estimate of employment in renovation and remodeling sector at the national level

Yes

Not seasonally adjusted

csU_total

Monthly estimate of total construction spending at the national level

No (dropped due to high VIF)

Not seasonally adjusted

TotalStarts_NSA

Monthly estimate of total housing starts at the regional level

Yes

Regions include Northeast, Midwest, South, and West

Unemp_NSA

Monthly estimate of the unemployment rate per state

Yes

Not seasonally adjusted

ConstGeneral_NSA

Monthly employment in the construction sector per state

No (dropped due to high VIF)

Not seasonally adjusted

TotalEmp_NSA

Monthly employment for all sectors per state

No (dropped due to high VIF)

Not seasonally adjusted

PHMSA_Damages

Annual PHMSA damage counts per state

Yes

 

PHMSA_Tickets

Annual PHMSA ticket counts per state

No (dropped due to high VIF)

 

OneCall_Trans

Annual OneCall center transmissions per state

No (dropped due to high VIF)

 

month_jfm

Indicator (dummy) variable for the months January, February, or March (roughly winter)

Yes

 

month_amj

Indicator (dummy) variable for the months April, May, or June (roughly spring)

Yes

 

month_jas

Indicator (dummy) variable for the months July, August, or September (roughly summer)

Yes

 

month_ond

Indicator (dummy) variable for the months October, November, or December (roughly fall)

Yes (but see note)

Does not appear in model output as this is the reference season.

month_amjjas

Indicator (dummy) variable for the months April through September (roughly spring and summer)

No (used individual season indicators instead)

 

low_counts

Indicator (dummy) variable accounting for states that are not well-covered by the data

Yes

They have lower counts than one might expect

 

 

 

 

Table B2: Primary Count Models Relating Damages to Explanatory Variables with Standard Errors Clustered on Geographya

Variable

Poisson Model

Negative Binomial

year_2020

0.0066264
(0.1384777)

0.1396216
(0.1333183)

year_2021

0.016635
(0.2398704)

0.3181472*
(0.1841542)

AreaKm2

-0.00000706
(0.00000191)

-0.00000068
(0.00000177)

tavg_Value

0.0183398**
(0.0085293)

0.0278379***
(0.0106641)

pcp_Value

-0.0265536
(0.0193185)

-0.0872928*
(0.046825)

popchangeP

0.3531989
(0.2483274)

0.3966561
(0.2584109)

Real

0.000000409
(0.000000433)

0.0000000189
(0.000000557)

Permits

0.0000768*
(0.0000466)

0.0000893
(0.0000936)

emp_remodel_NA

-0.0035606
(0.0044479)

-0.006691
(0.0066367)

TotalStarts_NSA

0.0134491
(0.0137669)

-0.0063448
(0.0127965)

Unemp_NSA

0.0046945
(0.032807)

-0.0454165
(0.0487075)

PHMSA_Damages

0.000332***
(0.0001097)

0.0004386*
(0.0002264)

density

-0.0041672**
(0.0019499)

-0.0003985***
(0.0001518)

month_jfm

0.0057922
(0.0606405)

-0.1060727
(0.0851081)

month_amj

-0.0875143
(0.1452539)

-0.1250074
(0.2276354)

month_jas

-0.1719384
(0.1698518)

-0.2810573
(0.2819143)

low_count

-2.590025***
(0.6206466)

-2.414872***
(0.5517371)

Constant

5.96513***
(1.50125)

5.948148**
(2.419352)

lnalpha (dispersion parameter)

N/A

0.201455
(0.2182386)

alpha

N/A

1.223181
(0.2669453)

Log-likelihood (pseudo)

-137,629.62

-10,029.549

R2 (pseudo)

0.6940

N/A

Observations

1836

1836

 

a Cells contain model coefficients and associated standard errors in round brackets

***, **, and * indicate that the coefficient is significantly different from zero at the 1%, 5%, and 10% level of significance

 

[1] In statistics, “significant”  means you can feel confident the effect is real rather than random, i.e., that you didn’t just get lucky (or unlucky) in choosing the sample.

[2] The results of the negative binomial model are preferred given that the dispersion parameter (alpha) differs from zero.

Damage Prevention in Your State

Explore damage prevention information, local contacts and rules for safe digging in North America.

Find Your State

CGA Toolkits

CGA has created a suite of toolkits designed to help members generate public awareness about the importance of damage prevention.

Explore Resources