Before officially starting the data fusion process, there are seven key items to consider or address:
Some forms of data fusion will be low-cost, while others will be complex and potentially expensive. For some use cases, the results may be well worth the added cost. For other use cases, the incremental improvements (or even risk of no operational improvement) may not be worth the expense. Executives need to keep in mind that the expense of data fusion is not “one-and-done.” There are ongoing costs associated with maintaining the system and/or modifying the system when changes to data or other systems occur.
Agencies are likely using both point-sensor data and probe data for other purposes already. They may be integrated into ATMS or 511 platforms, used to derive travel times for dynamic message sign (DMS) messages, used to detect the back of a queue around a dangerous corner, or even used for real-time enforcement activities. Agencies will want to ensure that the way in which the fusion team interacts with these data is non-damaging to these or other existing applications. For example, an old database server may have been just barely sized for a particular application, and any additional load on that server may introduce latency or even prevent other applications from working at all. Last, it is critical that the original data are preserved and that the fusion process does not accidentally modify or change the source data in their original location. Doing so would impact other systems currently relying on the data as is.
The framework assumes that the entity responsible for implementing the fusion has already procured (or has access to) enough data to make fusion worthwhile. This document does not provide specific guidance on how to procure data. If agencies already own a point sensor or probe-based
dataset, users still need to make sure that the terms and conditions included in third-party private-sector data agreements allow creation of derivative products and/or use said derivative products in the desired ways. Some data providers strictly prohibit certain use cases, for example, speed enforcement. Others may limit the publication of certain derivative products that could compete with their additional markets. Agencies should strive for negotiating open and broad data rights at the onset of procurement. TETC has developed a model data use agreement that can be downloaded and used by other agencies. It can be found here: https://tetcoalition.org/projects/transportation-data-marketplace/.
While the most basic of datasets (speed and volume) have little chance of revealing personally identifiable information (PII), the same is not true for O-D, waypoint, and raw LBS data. These datasets are typically anonymized long before a DOT receives them; however, it is possible that when fusing these data with other datasets that PII could be revealed or even created. DOTs should avoid directly creating derivative products that purposely or accidentally may divulge the movements and names of individuals.
Additionally, if exact starting and ending points of trips are provided (including exact time stamps), then it could be easy to identify individuals, their home and work addresses, or otherwise infringe upon the privacy of individuals and their movements. Ways to obscure these details include the following:
Most of the time, this obfuscation is handled by the data provider before it is given to the agency; however, agencies should be aware of these potential issues and ensure they do not exist before beginning a fusion exercise.
Agencies need to be extremely careful about how they leverage these types of data to ensure that there are not any misuses of data—even if only perceived. Planning and policies need to be put into place long before acquisition of data occurs to protect the agency and the public whom they serve to ensure the data are protected and available for years to come. Agencies need to be cautious, as a single misuse could not only expose user PII but also cause public mistrust of both the technology and the people and organizations that use it.
The transportation data market is highly volatile. New companies are formed frequently. Shiny new datasets disrupt the market, promising to improve capabilities. While progress is exciting, agencies need to perform due diligence on the companies they buy data from or partner with to ensure they will be around for a long time. If a DOT is going to invest hundreds of thousands of dollars each year in data and fusion implementation, the agency needs to protect its investment—making sure the built-out capabilities will be able to provide support for years to come.
Many believe that the inevitable march toward more and more location data may not actually be inevitable. Data may not continue to grow in the predicted trajectory. Data providers disappear frequently, often at the mercy of upstream suppliers and constant changes in relationships. The landscape is constantly shifting. What is reality today may not be reality tomorrow. Thus, validations of certain data provider products (e.g., volumes, O-Ds) that are sensitive to losses or additions to underlying base data become invalid and potentially worthless as soon as the underlying data suppliers change. For this reason, agencies should attempt to discover these changes and ask their suppliers to be open about changes. For example, TETC is trying to tackle some of these validation challenges and is working on regular validations of newer, more sensitive data products. More information on this topic may be found in Section 4.6.
Unfortunately, there have been many disruptions to the marketplace. For example, one company went into bankruptcy and other data providers have dropped data products after brief experimentation phases. Those who had invested heavily in the company that went bankrupt (both agencies and other private-sector companies) were left with products and services that no longer worked and may never work again—at least not as they once did.
Agencies should strive to work with companies that (1) have longevity in their products and their staff—which is a sign of a strong company, (2) that rely on more than one data source so that they can survive minor disruptions, and (3) that are not showing signs of potential financial or other troubles.
The quality and coverage of speed data has been well documented in both urban and rural environments. TETC conducts regular probe speed data quality checks on a wide variety of road classes and environments. Past data validation reports can be found here: https://tetcoalition.org/projects/transportation-data-marketplace/.
TETC is now working with a team of academics at the CATT Laboratory and a panel of state transportation experts to validate other probe-based datasets such as O-Ds, waypoints, volumes, and events. While no reports on validation have been published at the time this report was written, readers should check the TETC website for updates on these efforts.
It is currently assumed that as time progresses, more CVs will be on the road, which will ultimately mean that vendor data penetration rates will also rise. While this is the desired outcome for agencies, the reality is different: data brokers often pay the cost of pulling data directly from vehicles themselves. As more vehicles become available for data access, it is not always in the best interest of the company to download that data—especially if price lists are fixed. There may also be limits that OEMs place on the amount of vehicles available in their pools at any given time. For example, even if there are 10 million CVs on the road, the OEMs or data brokers may only provide access to a rotating pool of those 10 million vehicles so as not to track any one for too long and better ensure privacy protection.
When coverage (or penetration rates) is provided for O-D, waypoint, or event-style probe data, vendors will make claims about their ability to cover the population of a region. A typical claim may state, “our data cover 35 percent of the population” or “our data cover 5 percent of daily trips.” It is important to ask each company to define how they calculate their coverage percentages, and how they define a trip, as those with lower percentages may have more data. Examples of how these numbers can be computed are shown in Table 3.
This is not to say that one method is better (or more accurate) than another. It is mainly important that the vendor be transparent in its claims and that the purchaser be aware of the implications.
Table 3. Two examples of how vendors may classify their coverage.
| Claim | Method of Computing |
|---|---|
| Our data cover 5% of daily trips | If there were an estimated 100,000 daily trips taken by the entire population of a region, then the company believes it is capturing approximately 5,000 trips per day. This company considers a trip an O-D pair where the dwell time is 5 minutes. Thus, multiple “trips” in a single tour are counted as part of this 5,000. |
| Our data cover 35% of the population | If there were 100,000 people living in a region, there is at least one trip record for 35,000 people over the course of a month (or quarter or year). This does not mean that the company is observing 35,000 trips every day. Instead, it means that the company might be observing as few as 1,166 daily trips. |
Some companies are pivoting away from coverage definitions based on percentages and penetration rates for some use cases, as it may distract from the value of available data. For example, for a signalized intersection, it may be less important that a certain percentage of population is represented, but more important that there is a minimum number of samples available to extrapolate necessary metrics. These companies are now trying to state the statistical significance of their coverage rather than exact percentages (which can be difficult to quantify). Whether this clarifies the situation or is another strategy to muddy the waters is yet to be determined.
While agencies are encouraged to share fusion algorithms with one another to avoid duplication of effort and to build off each other’s knowledge, it is important to understand that specific algorithms that take into account the uniqueness of one geography, data availability, etc. may not be directly transferable to another agency with different characteristics.
For example, the National Renewable Energy Laboratory (NREL) presented a study at the 2021 TRB Annual Meeting where they tested the transferability of machine learning algorithms between states (Sakhare, Desai, et al. 2022), to see if the machine learning models created and trained against data from one state would be transferable to the other states and vice versa. The three states were Colorado, North Carolina, and Pennsylvania, and the hope was that one model could essentially be copied and pasted to other states and still expect quality results. The data used in the models included count data from point sensors, probe-based data, and weather data. The research displayed in Table 4 shows that some models worked better than others at overcoming differences between the state results. Also, states with similar geographic and weather conditions were more likely to have better results when trying to use a model developed in another state.
Table 4. Results from spatial transferability exercises conducted by Kasundra, et al. (2021).
| Type of Model | Train → Test | R2 | Mean Absolute Error (MAE) (veh/hr) | Error to Maximum Flow Ratio (EMFR) (%) |
|---|---|---|---|---|
| CO ⇔ NC Spatial Transferability | CO → NC | 0.71 | 577 | 15.6 |
| NC → CO | 0.67 | 704 | 13.6 | |
| NC ⇔ PA Spatial Transferability | NC → PA | 0.89 | 217 | 6.1 |
| PA → NC | 0.79 | 403 | 10.3 | |
| NC + PA Meta Model | NC + PA | 0.91 | 266 | 5.5 |