Appendix B

Appendix B: Trending Regression Analysis

Objective

The objective of this analysis is to assess whether damage counts are changing significantly[1] over time in the United States after accounting for several potential driving factors (e.g., economic growth).

Method

As in past years, regression analysis was used to relate damage counts by month and state to a set of explanatory variables including factors related to the economy, demographics, dig activity, etc. However, for the current analysis focused on trends over time year variables were added to account for changes over the different years included in the analysis (2020 to 2022).

Given that the damage data are structured as count data, Poisson and negative binomial count models were used for the analysis. Both models are typically used to estimate count data, but the negative binomial model relaxes a key assumption of the Poisson model (via an overdispersion parameter). Non-panel models were run with robust standard errors clustered on the geographic unit (state) which account for the panel nature of the data.

For this year’s analysis, we also took a step back and ran count models at the level of the country (Equation 2). This overcomes certain limitations of models at the state level, such as noise introduced by variation in damage counts resulting from how the data was assembled (e.g., the subset of consistent companies may not properly represent all states leading to variation across states not representative of actual damage activity). However, other limitations present themselves at this scale including that so few observations may increase multicollinearity and results may still be skewed towards states which are better represented in the assembled dataset. Furthermore, data on certain variables is only collected at annual timescales meaning that these variables cannot be included in the regression analysis due to multicollinearity (e.g., One Call transmissions or PHMSA data).

This relationship was modeled following procedures similar to those used for the state level analysis. Poisson and negative binomial count models were used both estimated with robust standard errors.

The coefficients of interest for the trend analyses are those on variables representing the years in Equations 1 or 2. If these coefficients are significantly different from zero after accounting for the other factors in this equation, then damage counts are changing over time for reasons other than the factors incorporated into the model. Before running the regression models, multicollinearity between variables was assessed using variance inflation factors (VIF). Since multicollinearity can influence how regression coefficients for certain variables are interpreted, highly collinear variables were iteratively removed according to their VIF using a VIF threshold of 5. This resulted in the primary models having a reduced set of variables with limited collinearity and regression analyses were conducted on this reduced set as well as the total set of variables.

[1] In statistics, “significant” means you can feel confident the effect is real rather than random, i.e., that you didn’t just get lucky (or unlucky) in choosing the sample.

Data

In consultation with CGA staff a subset of the United States damage data was assembled for the 2020 to 2022 period. To help reduce the noise in the data stemming from variations in company reporting behavior from year to year rather than actual changes in damages, the damage count dataset was initially assembled from companies that reported more than 1000 damages at least once annually from 2020 to 2022. Since this dataset was skewed toward natural gas and One Call center stakeholders it was enhanced with damage counts from the electric, telecom/CATV, excavator, and water stakeholders.

For the state level analyses, damages in the final set of data were distributed across the 50 states and the District of Columbia as well as the 36 months over the 2020 to 2022 period. Damage counts reported for certain states are zero or very low and are thus not well represented in the analysis (i.e., Alaska, Hawaii, Maine, Rhode Island, Vermont, and West Virginia). These states were omitted from the analyses reported below (although the results are very similar when they included). However, reported damage counts in other states may be lower than expected due to limitations in how the dataset was assembled (e.g., a state could have lower counts if there are relatively few consistently reporting companies operating in the state). As such a separate variable, the ratio of total annual damage counts and total annual 811 Center transmissions for a state, was incorporated as an attempt to account for count data being poorly represented for certain states. Data on other variables, including weather, demographics, economics (e.g., GDP or employment), construction, dig activity (e.g., transmissions), as well as PHMSA data on damages and tickets were also collected (Table 1). Similar data were used for the analyses at the national level although at a different scale (e.g., damages were only distributed across the 36-month period).

Results and Conclusion

The initial multicollinearity check revealed that many of the variables in Table 1 were highly collinear, especially for the analysis at the national scale, and these variables were removed from the regression analysis. However, the variables of interest, year _2021 and year_2022 were not substantially correlated with the other explanatory variables in the model.

For the models at the state level, the results are different from last year’s analysis for the period 2019 to 2020. The results of both the Poisson and negative binomial models indicate that damage counts in 2021 and 2022 do not differ significantly from those in 2020 (Table 2). Further testing on the results of the Poisson model results, specifically testing whether the coefficients on the year_2021 and year_2022 variables differ, suggests that damage counts in 2022 also do not differ significantly from those in 2021 (at the 1% level of significance). However, similar testing on the results of the negative binomial model indicate that damage counts in 2022 do differ significantly from counts in 2021 although only at the 10% level of significance. Note that the negative binomial model is preferred given that the dispersion parameter (alpha) is not zero. In addition, follow-up testing of the Poisson model suggests that this model is not appropriate for the data (Deviance and Pearson statistics).

Results were similar at the national scale as none of the coefficients on the year variables were significant (Table 3). Further testing on whether the coefficients on the year_2021 and year_2022 variables differ indicates that they differ significantly at the 5% level of significance. These results hold for both the Poisson and negative binomial models, although the deviance and Pearson statistics suggest that the negative binomial model is preferred.

Assuming that the assembled data is representative of trends in all sectors and parts of the United States the models provide evidence that damage counts in 2021 or 2022 do not differ significantly from those in 2020 after accounting for key driving factors. However, there is some evidence that damage counts in 2022 may differ significantly from those in 2021 (counts in 2022 may be higher). Findings are similar at the national level. Caution is advised when interpreting these conclusions since it is not possible to verify the assumption that the dataset is representative.

Tables

Table 1: Variables Initially Used in the Regression Analysis

Variable Name	Description	Notes
year_2021	Indicator (dummy) variable accounting for the year 2021	Variable of interest. If the variable’s coefficient is significant then counts in 2021 differ significantly from 2020.
year_2022	Indicator (dummy) variable accounting for the year 2022	Variable of interest. If the variable’s coefficient is significant then counts in 2022 differ significantly from 2020.
Pop	Annual estimate of state population
popchangeP	Percent change in state population from previous year
AreaKm2	State area in kilometers squared	Not in national level analysis
Density	Population density
tavg_Value	Average monthly temperature in a state in Fahrenheit
pcp_Value	Monthly precipitation in a state in inches
Real GDP	Monthly estimate of real GDP per state (all sectors)
Construction_Real	Monthly estimate of real GDP per state (construction sector only)
Permits	Monthly estimate of building permits issued per state
emp_remodel_NA	Monthly estimate of employment in renovation and remodeling sector at the national level	Not seasonally adjusted
csU_total	Monthly estimate of total construction spending at the national level	Not seasonally adjusted
TotalStarts_NSA	Monthly estimate of total housing starts at the regional level	Regions include Northeast, Midwest, South, and West
Unemp_NSA	Monthly estimate of the unemployment rate per state	Not seasonally adjusted
ConstGeneral_NSA	Monthly employment in the construction sector per state	Not seasonally adjusted
TotalEmp_NSA	Monthly employment for all sectors per state	Not seasonally adjusted
PHMSA_Damages	Annual PHMSA damage counts per state	Not in national level analysis
PHMSA_Tickets	Annual PHMSA ticket counts per state	Not in national level analysis
OneCall_Trans	Annual OneCall center transmissions per state	Not in national level analysis
month_jfm	Indicator (dummy) variable for the months January, February, or March (roughly winter)
month_amj	Indicator (dummy) variable for the months April, May, or June (roughly spring)
month_jas	Indicator (dummy) variable for the months July, August, or September (roughly summer)
month_ond	Indicator (dummy) variable for the months October, November, or December (roughly fall)
month_amjjas	Indicator (dummy) variable for the months April through September (roughly spring and summer)
damage_onecall_ratio	Ratio of total annual damage counts to total annual OneCall transmissions	Not in national level analysis

Table 2: Count Models Relating Damages to Explanatory Variables with Standard Errors Clustered on Geography as Well as the Fixed Effects Poisson Model^a

Variable	Non-Panel With Clustered SE
Variable	Poisson Model	Negative Binomial
year_2021	-0.1523586 (0.103617)	-0.0807347 (0.0974079)
year_2022	0.1670186 (0.3163175)	0.4578683 (0.3302787)
AreaKm2	-0.00000182** (0.000000741)	-0.00000247** (0.0000011)
tavg_Value	0.0102598* (0.0056541)	0.0154227*** (0.0058048)
pcp_Value	-0.0156518 (0.0140943)	-0.0205085 (0.0201191)
popchangeP	0.0192184 (0.0877684)	-0.1006529 (0.0665743)
Real GDP	0.000000538** (0.000000275)	0.000000177 (0.000000244)
Permits	0.0000307 (0.0000231)	0.0000635* (0.0000351)
emp_remodel_NA	0.0035769 (0.0057629)	-0.0007927 (0.0051689)
TotalStarts_NSA	-0.0170241** (0.0075732)	-0.0006939 (0.0051008)
Unemp_NSA	0.0171364 (0.0442866)	-0.0066856 (0.0355101)
PHMSA_Tickets	0.000000501* (0.000000262)	0.000000865*** (0.000000313)
PHMSA_Damages	0.0000342*** (0.0000132)	0.0001131*** (0.0000391)
Density (population)	-0.0005764*** (0.0001256)	-0.0004551*** (0.0000605)
month_jfm	-0.034512 (0.0845788)	-0.1665809* (0.0860977)
month_amj	0.1708286 (0.1226405)	0.0144001 (0.1415831)
month_jas	0.0797209 (0.1175068)	-0.0302641 (0.1440597)
low_count	1233.152*** (188.2758)	1380.3*** (239.4022)
Constant	2.456758 (2.083993)	3.258361* (1.981628)
lnalpha (dispersion parameter)	N/A	-0.9092629 (0.2028131)
alpha	N/A	0.402821 (0.0816974)
Log-likelihood (pseudo)	-65,603.859	-8,955.8871
R² (pseudo)	0.8198	N/A
Observations	1560	1560

^a Cells contain model coefficients and associated standard errors in round brackets.

***, **, and * indicate that the coefficient is significantly different from zero at the 1%, 5%, and 10% level of significance

Table 3: National Scale Count Models Relating Damages to Explanatory Variables with Robust Standard Errors^a

Variable	Poisson	Negative Binomial
year_2021	-0.0284786 (0.0598719)	-0.0356251 (0.0623831)
year_2022	0.0652038 (0.0671562)	0.0549278 (0.0702493)
pcp_Value	0.0103141 (0.0444743)	0.0141324 (0.047815)
Permits	0.00000363** (0.00000148)	0.00000398*** (0.00000155)
TotalEmp_NSA	-0.000000000506 (0.00000000335)	-0.000000000295 (0.00000000327)
month_jfm	-0.1736543** (0.0716051)	-0.1786293** (0.0707448)
month_amj	0.1732135** (0.0875006)	0.1702865* (0.0888117)
month_jas	0.2387971*** (0.0773031)	0.2325049*** (0.078617)
Constant	8.94596*** (0.4847399)	8.866267*** (0.4714805)
lnalpha (dispersion parameter)	N/A	-4.243615 (0.2616687)
alpha	N/A	0.0143556 (0.0037564)
Log-likelihood (pseudo)	-3203.2467	-313.83887
R² (pseudo)	0.7143	N/A
Observations	36	36

^a Cells contain model coefficients and associated standard errors in round brackets.

***, **, and * indicate that the coefficient is significantly different from zero at the 1%, 5%, and 10% level of significance

DIRT Report

Search or Filter

Search DIRT

Table of Contents

Appendix B: Trending Regression Analysis

Objective

Method

Data

Results and Conclusion

Tables

Damage Prevention in Your State

CGA Toolkits

Dedicated to preventing damage to underground utility infrastructure and protecting those who live and work near these important assets.