The Methodological Pitfall of Dataset-Driven Research on Deep Learning in the IoT Space

We highlight a dangerous pitfall in the state-of-the-art evaluation methodology of deep learning algorithms, as applied in several CPS and IoT application spaces, where collecting data from physical experiments is difficult. The article is inspired by the real experiences of the authors. An extended version appears in the IoT-AE Workshop in conjunction with MILCOM 2022 [1].

Few would disagree today that the advent of deep learning has the potential to revolutionize many fields, including embedded computing, if it has not done so in many ways already. Yet, applying deep-learning-based solutions to real-field problems is challenging because it requires significant amounts of training data. Short of having access to the underlying physical system (which may or may not be easy depending on the application), two other go-to techniques of choice in inference algorithm development and testing are simulation and the use of real datasets. Simulation is often criticized for being potentially less realistic, especially when done in conjunction with deep learning – an approach that has the capacity to internalize subtle nuances in the input data to an extent that makes reliance on simulation potentially misleading.

Consequently, applications that do not need to close a feedback loop with the environment (and hence do not need to simulate a response to counter-factual inputs) often rely on previously collected real application traces (or datasets) for evaluation. Such datasets preserve the valuable nuances of the behavior of the underlying real system making a subsequent evaluation, in principle, more trustworthy. This wisdom has been applied extensively to many intelligent sensing, detection, classification, and tracking applications, from recognizing human activities of daily living to identifying the onset of adverse health conditions from sensor data, and from detecting home appliances from their energy consumption to tracking military targets from their non-line-of-sight signatures. Many datasets, collected from physical experiments with the above systems, have become popular in their respective domains as de facto benchmarks that help researchers evaluate their algorithms on real data. It is in this context that the pitfall referred to in this article manifests. This pitfall results in deceptively good evaluation outcomes on test datasets, whereas the underlying algorithms remain prone to catastrophic failure in practice. The problem occurs despite the use of cross-validation that breaks down the data into separate training, validation, and testing sets.

In recent experiments [1], we illustrated this pitfall by designing two target detection and classification algorithms. One was based on a recently proposed neural network architecture for embedded AI [2], and the other was based on a traditional machine learning approach with domain-inspired input feature engineering [3]. The neural network approach outperformed the traditional one on the test dataset. Yet, it failed catastrophically in later deployment. The application was very simple; detect a particular type of vehicle when it passes by (while correctly ignoring other vehicle types) using data from nearby microphones and seismic sensors. This functionality is useful for “intelligent tripwire” scenarios, where the sensors must generate an alert only when a specific type of target is present while ignoring other passing traffic. We could reproducibly show superior neural-network-based target recognition results on the test datasets, greatly outperforming our baseline. The simpler baseline, however, significantly outperformed the neural network in a subsequent deployment. We recall this incident because it illustrates a fundamental trade-off between robustness and performance that remains poorly addressed in today’s dataset-driven evaluation methodology of deep learning algorithms. We posit that the application of this methodology in practical research environments often results in optimistic evaluation results, not representative of the true brittleness of the models developed… but let us tell the story of this pitfall from the beginning.

When training data are scarce, it is well-known that neural networks are prone to overfitting. Overfitting typically occurs when the number of input data samples available for training an estimator is not significantly larger than the number of estimators (e.g., neural network) parameters being trained. As a result, the trained model may simply memorize the individual samples without the ability to generalize well to new ones. To guard against overfitting (and thus show some level of robustness) in neural network testing contexts, the prevailing evaluation methodology calls for a separation between training and testing data; the neural network is trained on one dataset but tested on another. The underlying expectation is that if the network overfits the training data, it will have problems generalizing, and will thus have poor performance on testing data. Moreover, since neural networks have several hyper-parameters that need to be tuned as well, the above basic methodology is usually extended to one featuring a three-way separation, where data are partitioned into a training dataset, a validation dataset, and a testing dataset. The neural network is trained using the training dataset, then evaluated using the validation dataset. Insights from validation are used to tune various hyper-parameters after which the training and validation steps are repeated. Once the designer is satisfied with the neural network performance on the validation dataset, the final stage of the evaluation occurs. In this stage, the network is evaluated on the hitherto withheld testing data. In theory, since the final test uses a previously unseen dataset (during training and validation), the approach should ensure that a model that does well on that training data must have learned to generalize well and is not overfitting (i.e., is not only performant but also sufficiently robust).

The problem with this methodology lies in the way it is implemented in a typical research environment. Specifically, due to difficulties accessing real physical systems (e.g., lack of easy access to representative military vehicles in representative environments in a military target tracking application), researchers often acquire training, validation, and testing data ahead of time. The tested algorithms are then developed iteratively (and evaluated using the acquired data by following the aforementioned three-part methodology in each iteration). The process creates an unintended feedback loop from testing to algorithm development. This loop ultimately results in overfitting the developed algorithm to the testing data.

To explain the aforementioned effect, consider a researcher who gets to the final stage of the evaluation, where they test their trained and tuned neural network using the thus-far withheld testing dataset. Often initial results are not quite satisfactory. The researcher is back to the design table. With insights derived from the failure on the test dataset, they update their algorithm and repeat the three stages of design and evaluation: (i) train the neural network with the training dataset, (ii) tune hyper-parameters with the help of validation data, and finally (iii) test the resulting network using the test data. The above loop is repeated until the researchers are satisfied with the testing results.

The problem with the above practice is that the insights carried in every iteration (when going back from unsatisfactory algorithm testing to algorithm re-design), in essence, cause the new design to overfit the specific test data used. When the datasets are relatively small, this overfitting can be non-trivial, potentially causing catastrophic failures when the trained, tuned, and (ostensibly) independently tested network is eventually used in the field.

Unfortunately, solving the above challenge is non-trivial, and is worthy of the community effort to improve the evaluation methodology and potentially the incentive structure (to better balance performance and robustness concerns). Ideally, to prevent overfitting of algorithms to test data, the final test dataset should only be available to testers, not the developer team. Only binary success/failure feedback could be returned. This is often applied when research is transitioned to a higher level of classification, where it is tested in an environment not known to the original authors. However, such tests often fail without follow-up, and the prospective transition is simply put aside. Alternatively, once a dataset is used to test some iteration of the algorithm under development, it should be viewed as expired and not reused for testing future iterations. This approach emulates the reality of deployment; one cannot rewind time. Any sensor measurement time-series is experienced exactly once in the field and never repeated. Detection lessons learned from it can thus never be applied to the same time-series again (after the algorithm has been updated). Such an approach, however, would require significantly more testing data than what is commonly available in a research environment, away from the field.

To force generalizability, the number of parameters of the learning algorithm should be sufficiently lower than the number of input data samples used for training. These samples must also offer a representative coverage of environmental and physical conditions encountered in practice. Traditional machine learning solutions, such as Random Forest and XGBoost can get away with a reduced number of training parameters because they are purported to ingest input features that have already been carefully designed based on a good understanding of the underlying domain.

Neural networks, in contrast, are meant to ingest raw sensor data and learn appropriate nonlinear features (most useful for the underlying inference task) on their own. Thus, they need more parameters to approximate the right nonlinear functions from layers of approximately piecewise linear kernels. Accordingly, they fundamentally need more training data. A mitigation might be to consider variants of neural networks that utilize nonlinear kernels (e.g., the recently proposed kernels inspired by Taylor series expansion) [4] in order to reduce the number of parameters that need to be learned to approximate common nonlinear functions found in nature and physics. In general, some promise might lie in exploiting recent advances that combine neural networks and symbolic knowledge from the underlying domain. Exploitation of symbolic knowledge may significantly reduce the size of the needed parameter space thus matching more closely the limited availability of training data.

Taking lessons from history, it may be instructive to remember that higher robustness (not higher performance) is why PID controllers have been the staple in the process control industry for many decades, despite their relative lack of sophistication. In practice, the inherent robustness of simple PID control to unexpected conditions in real-life deployment outweighed the performance advantages of other more advanced techniques for several decades. We posit that in the emerging world of neural-network-based intelligent IoT/CPS applications, a similar cautionary tale applies. Yet, the prevalent dataset-based evaluation tilts outcomes towards higher-performance and more brittle models.

To be clear, we do not offer in this article a solution to the challenge we identify. Rather, we invite the research community to rethink the principles of dataset-driven evaluation of modern AI systems in IoT/CPS contexts, while remaining mindful of the realistic feasibility and data access constraints of typical research environments. If successful, the outcome may have a significant impact on the practical viability of much of the data-driven research that exploits AI/ML advances to further the IoT and CPS application space.


The author would like to thank Tianshi Wang, Denizhan Kara, Jinyang Li, and Shengzhong Liu for developing an experimental case study (complete with code and datasets) that illustrates the above pitfall [1]. Great thanks also go to Dr. Brian Jalaian for his insights on techniques for testing AI systems and his suggestions for pitfall mitigation.  The experimentation reported in this article was sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF-17-20196, NSF CNS 20-38817, the IBM-Illinois Discovery Acceleration Institute, and the Boeing Company.


[1] Tianshi Wang, Denizhan Kara, Jinyang Li, Shengzhong Liu, Tarek Abdelzaher, and Brian Jalaian  “The Methodological Pitfall of Dataset-Driven Research on Deep Learning: An IoT Example,” In Proc. 2nd Workshop on the Internet of Things for Adversarial Environments (at MILCOM 2022), Rockville, MD, December 2022

[2] Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. “Deepsense: A unified deep learning framework for time-series mobile sensing data processing.” In Proceedings of the 26th International Conference on World Wide Web, pp. 351-360. 2017.

[3] Tianqi Chen, and Carlos Guestrin. “Xgboost: A scalable tree boosting system.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794. 2016.

[4] Mao, Yanbing, Lui Sha, Huajie Shao, Yuliang Gu, Qixin Wang, and Tarek Abdelzaher. “Phy-Taylor: Physics-Model-Based Deep Neural Networks.” arXiv preprint arXiv:2209.13511 (2022).

AuthorsProf. Tarek Abdelzaher (Ph.D., UMich, 1999) is a Sohaib and Sara Abbasi Professor of CS and Willett Faculty Scholar (UIUC), with over 300 refereed publications in Real-time Computing, Distributed Systems, Sensor Networks, and IoT. He served as Editor-in-Chief of J. Real-Time Systems for 20 years, an AE of IEEE TMC, IEEE TPDS, ACM ToSN, ACM TIoT, and ACM ToIT, among others, and chair of multiple top conferences in his field. Abdelzaher received the IEEE Outstanding Technical Achievement and Leadership Award in Real-time Systems (2012), a Xerox Research Award (2011), and several best paper awards. He is a fellow of IEEE and ACM.

DisclaimerAny views or opinions represented in this blog are personal, belong solely to the blog post authors and do not represent those of ACM SIGBED or its parent organization, ACM.