Item Response Theory is Superior to CSA in Crash Correlation

For almost a year, SambaSafety has been sharing its IRT Model with the industry and reporting that we think IRT greatly improves many of the defects of CSA. Now we have completed our analysis on whether it measures up to the FMCSA’s 2012 Effectiveness Test. Details follow, but spoiler alert, IRT is superior to CSA in identifying high crash rate motor carriers. Read on for the details on how SambaSafety reincarnated the 2012 Effectiveness Test to a 2018 Effectiveness Test, that now includes a look at IRT.

In 2004, FMCSA began work on a new data-driven program that they believed would give them a more modern, statistical based program to prioritize motor carriers for interventions and compliance reviews. The Agency, very rightfully, was seeking to enter the Big Data age, and utilize data analytics to prioritize the more than 500,000 motor carriers they were tasked with overseeing toward their stated mission of:


The primary mission of the Federal Motor Carrier Safety Administration (FMCSA) is to reduce crashes, injuries and fatalities involving large trucks and buses.

In December of 2010, the FMCSA released their new safety scoring model, then called CSA 2010, referring to Compliance, Safety, and Accountability and its release date. FMCSA promised that this new data-driven model would predict which motor carriers were more likely to have future crashes and give the Agency the tools they required to prioritize motor carriers for review.

In 2012, an Effectiveness Test was conducted by the Volpe National Transportation Systems Center to determine the strength of CSA’s Correlation to high crash rate motor carriers. This peer-reviewed report was based on data through June 2012, released in January 2014, and found that CSA was an effective tool for identifying carriers with higher-than-the-national-average crash rates under two different models.

The 2012 Effectiveness Test looked back two years at violation data, then forward 18 months at crashes to determine correlation. Then the test produced two different models: 1) The Number of BASICs in Alert Status; and 2) BASIC in Alert Status.


Crash Rates by Number of BASICs in Alert


Crash Rates by What BASIC is in Alert


Each model shows the strength of crash correlation as displayed by the height of the bar, and its comparison to the national average crash rate. The Number of BASICs in Alert Model shows clearly that any number of BASIC Alerts identifies higher-than-average crash rate carriers, and the correlation gets stronger the more Alerts a carrier has. The BASIC in Alert Model shows every BASIC, except Driver Fitness has a correlation that also identifies higher-than-average crash rate motor carriers.

Since 2014, this Effectiveness Study has been an important tool for FMCSA when identifying motor carriers for interventions and compliance reviews and it is still published on the FMCSA’s SMS website today.

The CSA Program, specifically its computational methodology, came under scrutiny and criticism by the Industry for several reasons. CSA Reform began in earnest in December 2015. The reforms mandated by Congress in the FAST Act of 2015 led to a report released in June 2017 by The National Academies of Sciences, Engineering and Medicine (NAS) titled Improving Motor Carrier Safety Measurement.

The NAS report suggested a more robust, scientific statistical model be implemented in CSA to improve its ability to identify risky carriers and avoid crashes. Item Response Theory (IRT) was proposed to offer a better foundation to the CSA Program. In August 2018, FMCSA, in a report to Congress, stated that they intended to move ahead with testing of IRT.

SambaSafety, and Vigillo, which SambaSafety acquired in 2017, have been deeply involved in the reform efforts underway on the CSA Program. SambaSafety released a functioning IRT Model in November 2018 and has been running this methodology on the entire U.S. Trucking Industry since then. We have applied IRT Scores retroactively and have just completed our own Effectiveness Test, updated for 2018 data and applying the IRT Methodology. The fundamental question we posed to ourselves is simply:

Does IRT correlate more strongly to high crash rate motor carriers than legacy CSA?

The answer is, yes, it does, on both models described above.

We applied the same methodology to SMS data but updated the time period to more current activity. It would not be useful to produce current IRT Scores and compare that to the 2012 Effectiveness Test, so we started by bringing the Effectiveness Test current to 2018 data.

The 2018 updated Effectiveness Test results, still looking at existing CSA methodology, present some interesting findings before we look at IRT.

Updated Number of BASICs in Alert Model

A few interesting points in the updated Count of Alerts Test. First, the average national crash rate has increased from 3.43 Crashes per 100 Power Units, to 4.426 Crashes per 100 PU. We know crashes have been increasing – disappointing but true – and this data confirms that rising crash count. Second, Alert Count continues to show a strong correlation. And a final interesting note: Carriers with five BASIC Alerts show lower correlation than those with three to four. This could be read as evidence that the CSA system, with all its flaws, may be working. Very high scrutiny is placed by not only FMCSA, but shippers, brokers and lawyers on 5+ Alert Carriers,even the FMCSA’s “High Risk Motor Carrier” Methodology focuses in this area. Any carrier with five or more Alerts would, out of necessity, be hyper-focused on bringing their CSA Scores down. The result of that focus could well be reflected in dropping crash rates in that category.

Updated BASIC in Alert Model

The second model, BASIC in Alert, also holds some interest. Driver Fitness remains the least useful in identifying crash rates, but all others are stronger than the national average. This view of six year recent data has Controlled Substance and Hazmat swap positions, as do Crash and Unsafe Driving. I don’t read too much into this as they were very close to the same in the original Effectiveness Test, so switching positions is probably not too exciting. Interestingly, we see continuing strong correlations in every BASIC similar to 6 years ago.

With the Test updated to current data, we then applied the same two models to IRT Scores.

It is important to start with a decision we had to make, one that FMCSA has not publicly made known as of this writing: Where will IRT thresholds be set to determine Alert status?

We took a conservative approach (fewer carriers in Alert) and set the Alert Thresholds at an IRT Score of 130. That represents the second standard deviation from the mean and identifies more than 20,000 motor carriers that would be in Alert Status in the BASICs and/or the Safety Culture Score.

IRT Alert Threshold set at 130 (lacking FMCSA guidance)

With our new CSA/IRT Alert Thresholds set, we performed exactly the same Effectiveness Test, this time comparing IRT based scores to the now-updated Effectiveness test for CSA.

Count of BASICs in Alert w/IRT

Once a carrier has more than one BASIC in Alert, IRT is a significantly stronger indicator of risk. It is more correlated to the high crash rate carriers with two Alerts, and significantly stronger for carriers with 3-4 or 5+ Alerts. In most cases, IRT finds carriers with higher-than-national-average crash rates.

BASICs in Alert w/IRT

When we look at Model 2, BASICs in Alert, IRT performs even better. In four of the seven BASICs, IRT has significantly stronger correlation to high crash rate carriers. Driver Fitness, a low performer in every other model, now far surpasses CSA and surpasses the national average. Vehicle Maintenance, HOS, and Unsafe show IRT surging ahead of CSA in its strength of correlation and the much higher crash rates it is finding in each of those four BASICs. This is really good news; these are also the four BASICs that capture the driver behavior and indicators of safety culture that make IRT work.

So why does IRT not appear to perform as well in Crash, Controlled Substance, and Hazmat? Let’s talk about the Crash BASIC. Crashes are the outcome we want to avoid. There are some of us who have never seen the logic of saying crashes predict crashes. Of course they do. CSA only scores you in Crash if you have a crash (actually two). If you tried that in Excel, you’d get a “circular reference” error. CSA is supposed to be looking at behavior that can be remedied to avoid crashes. A crash is not a behavior. So, in all of this, I discount the validity of the Crash BASIC as a Crash Correlation.

The Hazmat BASIC is explained by the FMCSA’s Effectiveness Test as follows:

“The HM Compliance BASIC does not show a strong association with future crash rate; however, it is not intended to identify such an association as the regulations used in this BASIC focus on the reduction of crash severity/consequences, not crash frequency.”

In other words, the HM BASIC is not designed to correlate to Crash Rates, and IRT agrees. It’s a low correlation, because it’s a low correlation.

The Controlled Substance BASIC suffers from a similar defect in terms of Crash Correlation. The same Effectiveness Study states: “The vast majority of Controlled Substances/Alcohol regulations are related to the administration of testing at the carrier level and are not observable or confirmed at the roadside.” The thing that largely puts Carriers in Alert Status in the Controlled Substance BASIC are not observed in roadside inspection data, and are not computed into CSA or IRT Scores. It is unlikely that “testing administration” will ever produce strong correlations to high carrier crash rates.

We are very encouraged by what we’re seeing in this Effectiveness Test on the two BASIC models. But remember, IRT also produces the Safety Culture Score; the proposed new single score that evaluates the safety culture at a motor carrier. Wouldn’t it be cool if that Safety Culture Score showed strong Crash Correlation?

Turns out it does, and it does it quite well.

When we look at the Carriers in Alert Status in the Safety Culture Score under IRT (Score of 130+), we start seeing crash correlation strength that is very strong. The average crash rate of Carriers who are at Alert status in the Safety Culture Score is 10.221 crashes per 100 power units. That is 231% of the industry average, and identifies 3,166 motor carriers who should be receiving the attention of FMCSA. It is also important to understand that many of those carriers were not scored under CSA at all.

Our conclusion is that IRT is definitely a more scientific and accurate tool for identifying high risk carriers when measured by crash rates. We encourage FMCSA to continue their work on IRT in a transparent and collaborative manner so we, and other organizations with interest and expertise in this very complex model, can continue to provide meaningful feedback in our mutual mission to save lives.