Professor Uichin Lee’s research team from the School of Computing at KAIST has developed a reproducible machine learning pipeline for mobile affective computing where data from mobile and wearable devices are used for affect prediction, such as self-reported stress levels. This work has been published in the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, one of the most renowned journels in ubiquitous computing. This work has also been invited to present at the ACM Ubicomp Conference 2024 in Melbourne, Australia.
Reproducibility has long been an issue in the field of machine learning. In the field of mobile affective computing, it is especially difficult to replicate the results of existing works due to several challenges. There is a lack of open datasets for in-the-wild mobile affective computing studies. The ground-truth labels are usually collected via the Experience Sampling Method (ESM) approach where a user is asked to self-report their current affective state via a mobile phone. There is uncertainty in the labels. Different studies may use different scales and user survey questions to acquire ground-truth labels, which makes it difficult to replicate the results of other studies.
Overall, there are mainly four types of reproducibility issues in this field. The categorization mainly refers to Albertoni et al.’s work [1]. As shown in Figure 1, the types of reproducibility are categorized based on whether using the same data or the same code/analysis. When the same data, code, and analysis are used, it is referred to as reproducibility. When different data are used with the same code and analysis, this is known as generalizability. If the same data is used but with different code and analysis, it is termed independent reproducibility. Finally, when both different data and different code and analysis are used, it is called replicability.
Previous studies [3, 4, 5] mainly focus on generalizability or replicability issues in this field. However, even on the same dataset, many factors in the data analysis pipeline may influence the reproducibility. Let us take a look at one dimension of label binarization, i.e., how to binarize the user survey labels to create ground-truth labels. Given the Likert scale user survey labels ranging from 1 to 5, the mean value of all labels across all users can be used to create binarized labels. Alternatively, theoretical mid-value 3 in the survey (mid-point or neutral mark), or the mean value from each user to binarize the labels for each user. It is likely that different approaches may affect the performance of a classification model.
As a first attempt to focus on independent reproducibility, our work mainly focuses on stress prediction based on in-the-wild mobile datasets. In order to test the impact of each factor in the pipeline using the same dataset on the model performance (independent reproducibility), it is necessary to establish a common data analysis pipeline. In this work, we synthesized the existing literature to derive the steps and factors of a common pipeline for mobile stress prediction.
As illustrated in Figure 2, the common pipeline comprises eight primary steps: preprocessing, feature extraction, feature preparation, feature selection, data splitting, oversampling/undersampling, model training, and model evaluation. Each step involves various factors and their respective alternatives. For instance, in Step 1, preprocessing, one factor is removing invalid ESM samples. This factor has two alternatives: (a.1) removing expiratory ESM samples and (a.2) removing neutral ESM samples.
Two public datasets [6, 7] were selected to enhance the robustness of our findings. For each dataset, we evaluated the impact of various pipeline factors on model performance. Through these experiments, our goal is to gain insights into the role of each pipeline factor and provide recommendations for their selection. The results can be found in Table 1.
Among the experiment results, one particularly interesting finding concerns the efficacy of labeled data from target users. Our results showed that the partial personalization method did help model performance. Note that partial personalization includes part of the target user’s data in the training set. This may indicate that labeled data from the target user is important for model performance.
Given the fact that the performance of mobile stress prediction in an in-the-wild setting is quite limited, we also tried to push the model performance using combinations of pipeline factors. This proves that the reproducible pipeline can not only help researchers reproduce results of other studies but also help improve the model performance.
Based on the derived reproducible pipeline, we tested possible combinations of different pipeline factors. The best model performance is achieved via combining both mobile sensor data and the last self-reported label value as input features to the model. This is consistent with the result of partial personalization in independent reproducibility experiments. The efficacy of including the last label and partial personalization demonstrate that labeled data from target users are critical for improving model performance.
Another interesting observation is that an overfitting issue exists in generalized mobile stress prediction models, indicating that personal differences could be one of the main reasons for the low model performance.
As shown in the below Figure 3, we analyzed AUC-ROC results for all splits in the Leave-One-Subject-Out (LOSO) approach for both training and test sets. The figure reveals a marked performance discrepancy between training and test datasets, coupled with high variance across folds—classic indicators of overfitting.
As mentioned earlier, both partial personalization (where part of the target user's data is included in the training set) and using the last label as a feature contribute to improved model performance, underscoring the importance of labeled data from target users. Additionally, the overfitting issue arises only in user-independent cross-validation settings (training on user A and testing on user B), suggesting that personal differences are likely the main cause of overfitting. These findings indicate that both the need for labeled data from target users and the overfitting issues are closely tied to personal differences and a lack of model generalizability.
Some researchers have attempted to address the personal difference issue through model personalization. Recent studies have explored approaches such as domain generalization and domain adaptation to enhance model generalizability. Domain generalization focuses on improving model generalizability without using data from target users, while domain adaptation involves tailoring the model using data from target users to enhance its performance. However, there is still a lack of systematic and in-depth analysis of personal differences in this domain. Greater efforts are needed to uncover the underlying causes of personal differences in mobile affective computing and to quantify these differences. Without a thorough understanding of these personal differences, it will be challenging to develop effective methods to address them.
Panyu Zhang is currently a doctoral student at KAIST, working under the mentorship of Prof. Uichin Lee in the Interactive Computing Lab. His research primarily focuses on the reproducibility and generalizability of mobile affective computing.
[1] R. Albertoni, S. Colantonio, P. Skrzypczyński, and J. Stefanowski. 2023. Reproducibility of Machine Learning: Terminology, Recommendations and Open Issues. arXiv preprint arXiv:2302.12691 (2023). https://arxiv.org/abs/2302.12691
[2] Panyu Zhang, Gyuwon Jung, Jumabek Alikhanov, Uzair Ahmed, and Uichin Lee. 2024. A Reproducible Stress Prediction Pipeline with Mobile Sensor Data. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 8, 3, Article 143 (September 2024), 35 pages. https://doi.org/10.1145/3678578
[3] Karim Assi, Lakmal Meegahapola, William Droz, Peter Kun, Amalia De Götzen, Miriam Bidoglia, Sally Stares, George Gaskell, Altangerel Chagnaa, Amarsanaa Ganbold, Tsolmon Zundui, Carlo Caprini, Daniele Miorandi, José Luis Zarza, Alethia Hume, Luca Cernuzzi, Ivano Bison, Marcelo Dario Rodas Britez, Matteo Busso, Ronald Chenu-Abente, Fausto Giunchiglia, and Daniel Gatica-Perez. 2023. Complex Daily Activities, Country-Level Diversity, and Smartphone Sensing: A Study in Denmark, Italy, Mongolia, Paraguay, and UK. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23). Association for Computing Machinery, New York, NY, USA, Article 506, 1–23. https://doi.org/10.1145/3544548.3581190
[4] Mohammed Khwaja, Sumer S. Vaid, Sara Zannone, Gabriella M. Harari, A. Aldo Faisal, and Aleksandar Matic. 2019. Modeling Personality vs. Modeling Personalidad: In-the-wild Mobile Data Analysis in Five Countries Suggests Cultural Impact on Personality Models. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 3, 3, Article 88 (sep 2019), 24 pages. https://doi.org/10.1145/3351246
[5] L. Meegahapola et al. 2023. Generalization and Personalization of Mobile Sensing-Based Mood Inference Models: An Analysis of College
Students in Eight Countries. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT). ACM.
https://dl.acm.org/doi/abs/10.1145/3569483
[6] S. Kang, W. Choi, C. Y. Park, N. Cha, A. Kim, A. H. Khandoker, L. Hadjileontiadis, H. Kim, Y. Jeong, and U. Lee. 2023. K-EmoPhone: A
Mobile and Wearable Dataset with In-Situ Emotion, Stress, and Attention Labels. Scientific Data 10 (2023). https://doi.org/10.1038/s41597-023-01234-6
[7] Gyuwon Jung, Sangjun Park, and Uichin Lee. 2024. DeepStress: Supporting Stressful Context Sensemaking in Personal Informatics Systems Using a Quasi-experimental Approach. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 1000, 1–18. https://doi.org/10.1145/3613904.3642766