DIRT Report

Search or Filter

Search DIRT

Table of Contents

Appendix B: Trending Regression Analysis

Objective

The objective of this analysis is to assess whether damage counts are changing significantly[1] over time in the United States after accounting for several potential driving factors (e.g., economic growth).

Method

As in past years, regression analysis was used to relate damage counts by month and state to a set of explanatory variables including factors related to the economy, demographics, dig activity, etc. However, for the current analysis focused on trends over time year variables were added to account for changes over the different years included in the analysis (2020 to 2022).

                             

Given that the damage data are structured as count data, Poisson and negative binomial count models were used for the analysis. Both models are typically used to estimate count data, but the negative binomial model relaxes a key assumption of the Poisson model (via an overdispersion parameter). Non-panel models were run with robust standard errors clustered on the geographic unit (state) which account for the panel nature of the data.

For this year’s analysis, we also took a step back and ran count models at the level of the country (Equation 2). This overcomes certain limitations of models at the state level, such as noise introduced by variation in damage counts resulting from how the data was assembled (e.g., the subset of consistent companies may not properly represent all states leading to variation across states not representative of actual damage activity). However, other limitations present themselves at this scale including that so few observations may increase multicollinearity and results may still be skewed towards states which are better represented in the assembled dataset. Furthermore, data on certain variables is only collected at annual timescales meaning that these variables cannot be included in the regression analysis due to multicollinearity (e.g., One Call transmissions or PHMSA data).

                               

This relationship was modeled following procedures similar to those used for the state level analysis. Poisson and negative binomial count models were used both estimated with robust standard errors.

The coefficients of interest for the trend analyses are those on variables representing the years in Equations 1 or 2. If these coefficients are significantly different from zero after accounting for the other factors in this equation, then damage counts are changing over time for reasons other than the factors incorporated into the model. Before running the regression models, multicollinearity between variables was assessed using variance inflation factors (VIF). Since multicollinearity can influence how regression coefficients for certain variables are interpreted, highly collinear variables were iteratively removed according to their VIF using a VIF threshold of 5. This resulted in the primary models having a reduced set of variables with limited collinearity and regression analyses were conducted on this reduced set as well as the total set of variables.

 

[1]  In statistics, “significant” means you can feel confident the effect is real rather than random, i.e., that you didn’t just get lucky (or unlucky) in choosing the sample.

Data

In consultation with CGA staff a subset of the United States damage data was assembled for the 2020 to 2022 period. To help reduce the noise in the data stemming from variations in company reporting behavior from year to year rather than actual changes in damages, the damage count dataset was initially assembled from companies that reported more than 1000 damages at least once annually from 2020 to 2022. Since this dataset was skewed toward natural gas and One Call center stakeholders it was enhanced with damage counts from the electric, telecom/CATV, excavator, and water stakeholders.  

For the state level analyses, damages in the final set of data were distributed across the 50 states and the District of Columbia as well as the 36 months over the 2020 to 2022 period. Damage counts reported for certain states are zero or very low and are thus not well represented in the analysis (i.e., Alaska, Hawaii, Maine, Rhode Island, Vermont, and West Virginia). These states were omitted from the analyses reported below (although the results are very similar when they included). However, reported damage counts in other states may be lower than expected due to limitations in how the dataset was assembled (e.g., a state could have lower counts if there are relatively few consistently reporting companies operating in the state). As such a separate variable, the ratio of total annual damage counts and total annual 811 Center transmissions for a state, was incorporated as an attempt to account for count data being poorly represented for certain states. Data on other variables, including weather, demographics, economics (e.g., GDP or employment), construction, dig activity (e.g., transmissions), as well as PHMSA data on damages and tickets were also collected (Table 1). Similar data were used for the analyses at the national level although at a different scale (e.g., damages were only distributed across the 36-month period).

Results and Conclusion

The initial multicollinearity check revealed that many of the variables in Table 1 were highly collinear, especially for the analysis at the national scale, and these variables were removed from the regression analysis. However, the variables of interest, year _2021 and year_2022 were not substantially correlated with the other explanatory variables in the model.

For the models at the state level, the results are different from last year’s analysis for the period 2019 to 2020. The results of both the Poisson and negative binomial models indicate that damage counts in 2021 and 2022 do not differ significantly from those in 2020 (Table 2). Further testing on the results of the Poisson model results, specifically testing whether the coefficients on the year_2021 and year_2022 variables differ, suggests that damage counts in 2022 also do not differ significantly from those in 2021 (at the 1% level of significance). However, similar testing on the results of the negative binomial model indicate that damage counts in 2022 do differ significantly from counts in 2021 although only at the 10% level of significance. Note that the negative binomial model is preferred given that the dispersion parameter (alpha) is not zero. In addition, follow-up testing of the Poisson model suggests that this model is not appropriate for the data (Deviance and Pearson statistics).

Results were similar at the national scale as none of the coefficients on the year variables were significant (Table 3). Further testing on whether the coefficients on the year_2021 and year_2022 variables differ indicates that they differ significantly at the 5% level of significance. These results hold for both the Poisson and negative binomial models, although the deviance and Pearson statistics suggest that the negative binomial model is preferred.

Assuming that the assembled data is representative of trends in all sectors and parts of the United States the models provide evidence that damage counts in 2021 or 2022 do not differ significantly from those in 2020 after accounting for key driving factors. However, there is some evidence that damage counts in 2022 may differ significantly from those in 2021 (counts in 2022 may be higher). Findings are similar at the national level. Caution is advised when interpreting these conclusions since it is not possible to verify the assumption that the dataset is representative.

Tables

Table 1: Variables Initially Used in the Regression Analysis

Variable Name

Description

Notes

year_2021

Indicator (dummy) variable accounting for the year 2021

Variable of interest. If the variable’s coefficient is significant then counts in 2021 differ significantly from 2020.

year_2022

Indicator (dummy) variable accounting for the year 2022

Variable of interest. If the variable’s coefficient is significant then counts in 2022 differ significantly from 2020.

Pop

Annual estimate of state population

 

popchangeP

Percent change in state population from previous year

 

AreaKm2

State area in kilometers squared

Not in national level analysis

Density

Population density

 

tavg_Value

Average monthly temperature in a state in Fahrenheit

 

pcp_Value

Monthly precipitation in a state in inches

 

Real GDP

Monthly estimate of real GDP per state (all sectors)

 

Construction_Real

Monthly estimate of real GDP per state (construction sector only)

 

Permits

Monthly estimate of building permits issued per state

 

emp_remodel_NA

Monthly estimate of employment in renovation and remodeling sector at the national level

Not seasonally adjusted

csU_total

Monthly estimate of total construction spending at the national level

Not seasonally adjusted

TotalStarts_NSA

Monthly estimate of total housing starts at the regional level

Regions include Northeast, Midwest, South, and West

Unemp_NSA

Monthly estimate of the unemployment rate per state

Not seasonally adjusted

ConstGeneral_NSA

Monthly employment in the construction sector per state

Not seasonally adjusted

TotalEmp_NSA

Monthly employment for all sectors per state

Not seasonally adjusted

PHMSA_Damages

Annual PHMSA damage counts per state

Not in national level analysis

PHMSA_Tickets

Annual PHMSA ticket counts per state

Not in national level analysis

OneCall_Trans

Annual OneCall center transmissions per state

Not in national level analysis

month_jfm

Indicator (dummy) variable for the months January, February, or March (roughly winter)

 

month_amj

Indicator (dummy) variable for the months April, May, or June (roughly spring)

 

month_jas

Indicator (dummy) variable for the months July, August, or September (roughly summer)

 

month_ond

Indicator (dummy) variable for the months October, November, or December (roughly fall)

 

month_amjjas

Indicator (dummy) variable for the months April through September (roughly spring and summer)

 

damage_onecall_ratio

Ratio of total annual damage counts to total annual OneCall transmissions

Not in national level analysis

 

 

 

Table 2: Count Models Relating Damages to Explanatory Variables with Standard Errors Clustered on Geography as Well as the Fixed Effects Poisson Modela

Variable

Non-Panel With Clustered SE

Poisson Model

Negative Binomial

year_2021

-0.1523586
(0.103617)

-0.0807347
(0.0974079)

year_2022

0.1670186
(0.3163175)

0.4578683
(0.3302787)

AreaKm2

-0.00000182**
(0.000000741)

-0.00000247**
(0.0000011)

tavg_Value

0.0102598*
(0.0056541)

0.0154227***
(0.0058048)

pcp_Value

-0.0156518
(0.0140943)

-0.0205085
(0.0201191)

popchangeP

0.0192184
(0.0877684)

-0.1006529
(0.0665743)

Real GDP

0.000000538**
(0.000000275)

0.000000177
(0.000000244)

Permits

0.0000307
(0.0000231)

0.0000635*
(0.0000351)

emp_remodel_NA

0.0035769
(0.0057629)

-0.0007927
(0.0051689)

TotalStarts_NSA

-0.0170241**
(0.0075732)

-0.0006939
(0.0051008)

Unemp_NSA

0.0171364
(0.0442866)

-0.0066856
(0.0355101)

PHMSA_Tickets

0.000000501*
(0.000000262)

0.000000865***
(0.000000313)

PHMSA_Damages

0.0000342***
(0.0000132)

0.0001131***
(0.0000391)

Density (population)

-0.0005764***
(0.0001256)

-0.0004551***
(0.0000605)

month_jfm

-0.034512
(0.0845788)

-0.1665809*
(0.0860977)

month_amj

0.1708286
(0.1226405)

0.0144001
(0.1415831)

month_jas

0.0797209
(0.1175068)

-0.0302641
(0.1440597)

low_count

1233.152***
(188.2758)

1380.3***
(239.4022)

Constant

2.456758
(2.083993)

3.258361*
(1.981628)

lnalpha (dispersion parameter)

N/A

-0.9092629
(0.2028131)

alpha

N/A

0.402821
(0.0816974)

Log-likelihood (pseudo)

-65,603.859

-8,955.8871

R2 (pseudo)

0.8198

N/A

Observations

1560

1560


a Cells contain model coefficients and associated standard errors in round brackets.

***, **, and * indicate that the coefficient is significantly different from zero at the 1%, 5%, and 10% level of significance

 

Table 3: National Scale Count Models Relating Damages to Explanatory Variables with Robust Standard Errorsa

Variable

Poisson

Negative Binomial

year_2021

-0.0284786
(0.0598719)

-0.0356251
(0.0623831)

year_2022

0.0652038
(0.0671562)

0.0549278
(0.0702493)

pcp_Value

0.0103141
(0.0444743)

0.0141324
(0.047815)

Permits

0.00000363**
(0.00000148)

0.00000398***
(0.00000155)

TotalEmp_NSA

-0.000000000506
(0.00000000335)

-0.000000000295
(0.00000000327)

month_jfm

-0.1736543**
(0.0716051)

-0.1786293**
(0.0707448)

month_amj

0.1732135**
(0.0875006)

0.1702865*
(0.0888117)

month_jas

0.2387971***
(0.0773031)

0.2325049***
(0.078617)

Constant

8.94596***
(0.4847399)

8.866267***
(0.4714805)

lnalpha (dispersion parameter)

N/A

-4.243615
(0.2616687)

alpha

N/A

0.0143556
(0.0037564)

Log-likelihood (pseudo)

-3203.2467

-313.83887

R2 (pseudo)

0.7143

N/A

Observations

36

36

a Cells contain model coefficients and associated standard errors in round brackets.

***, **, and * indicate that the coefficient is significantly different from zero at the 1%, 5%, and 10% level of significance

Damage Prevention in Your State

Explore damage prevention information, local contacts and rules for safe digging in North America.

Find Your State

CGA Toolkits

CGA has created a suite of toolkits designed to help members generate public awareness about the importance of damage prevention.

Explore Resources