Prediction Task: Utilizing CPU and Application Statistics to Predict a User’s Persona
GitHub:
Homepage:
Vince Wong | Jonathan Zhang | Keshan Chen
vlw003@ucsd.edu | jsz002@ucsd.edu | kec180@ucsd.edu
Abstract
- During the first half of this project, we learned about Intel’s telemetry framework. The framework allows remote data collection from devices with Windows operating systems. Two important components of the telemetry framework are the Input Library (IL) and Analyzer Task Library (ATL). The IL exposes metrics from a device and the ATL generates on-device statistics from the data collected by the IL. In the second half of the project, we used pre-collected data provided by Intel that used their telemetry framework to create a classification model.
Project Statement
- Our goal with the model is to predict the “persona” of a user using their computer’s specifications, CPU utilization, CPU temperature, and time spent on certain types of applications.
- User persona’s are provided by Intel which classified if users were casual web users, gamers, communication, etc..
CCS Concepts
- Computing → Machine learning
- Mathematics of Computing → Probability and statistics; Mathematical analysis
Keywords and Packages
- Telemetry, gradient boosting classification, extra trees classification, AdaBoost classification
- Python, Numpy, Pandas, Scikit-Learn
Introduction
- We used four datasets to answer this question which were provided by the Intel Corporation team — hw_metric_histo.csv, system_sysinfo_unique_normalized.csv, frgnd_backgrnd_apps.csv and ucsd_apps_exe_class.csv. All four datasets were pre-collected by Intel using Intel’s System Usage Report (SUR) collector using their telemetry framework. hw_metric_histo contains information about a laptop’s average CPU utilization and temperature.
- System_sysinfo_unique_normalized contains data on a device’s specifications (CPU, GPU, number of cores, etc.) and their predetermined persona provided by Intel (gamer, casual user, office, entertainment, etc.). frgnd_backgrnd_apps.csv provides information on the devices’ time spent on certain applications and ucsd_apps_exe_class.csv contains information on the .exe files’ application type classification.
- To make our predictions, we used multiple scikit-learn classification models. We trained a total of seven different classification models, but ultimately chose to analyze and delve deeper into our radial basis function SVM, AdaBoost, and gradient boosting classification models based on their performance and some interesting shortcomings.
Methodologies — Initial Model Testing
- To begin model selection, we ran a for-loop training and tested multiple scikit-learn classification models: decision trees, extra trees, random forest, AdaBoost, three nearest neighbors, radial basis function SVM, and gradient boosting classifiers. The data was split with 80% of the data as the training set and the other 20% as the test set using scikit-learn’s .train_test_split(). Inside the for loop, the models were trained, tested against the test set, and then scored using scikit-learn’s .score() function.
- The top performing classification models were decision tree, random forest, radial basis function SVM, AdaBoost, and gradient boosting classification with accuracy scores of 67.67%, 66.87%, 66.83%, 65.29% and 64.10%, respectively.
Methodologies — Model Selection & Class Imbalance Mitigation
- To fix class imbalance, we added scikit-learn’s class_weight = ‘balanced’ parameter to the decision tree, extra trees, random forest, and radial basis function SVM. Adding this parameter had large effects on some models. Accounting for class imbalance in the decision tree and random forest model dropped the models’ accuracy from 65% to 40% and 64% to 36%, respectively.
- The accuracy of extra trees and radial basis function SVM did not change much from adding the class_weight = ‘balanced’ parameter. Initially, the accuracy of the extra trees and radial basis function SVM were 65% and 64%, respectively; however, after accounting for class imbalance the accuracies were 67% and 65%, respectively. The updated accuracy scores are shown below:
Results — AdaBoost Classifier
- The AdaBoost classifier received an overall accuracy of 66%. The classifier predicted casual web users, gamers, office and productivity users, and unknowns with 91%, 24%, 0%, and 78% accuracy, respectively.
- The AdaBoost classifier still has a flaw with class imbalance as its stronger performance was based on its strong bias towards classifying most users as a casual web user. The confusion matrix is provided for a more in-depth visualization of its performance:
Results — Radial Basis Function SVM Classifier
- The SVM classifier received an overall accuracy of 64%. The classifier predicted casual web users, gamers, office and productivity users, and unknowns with 99%, 0%, 0%, and 0% accuracy, respectively.
- Even though the class_imbalance parameter was set to ‘balanced’, the model had a very strong bias towards casual web users. The confusion matrix is provided for a more in-depth visualization of its performance:
Results — Gradient Boosting Classifier
- The gradient boosting classifier received an overall accuracy of 63%. The classifier predicted casual web users, gamers, office and productivity users, and unknowns with 87%, 26%, 3%, and 78% accuracy, respectively.
- The gradient boosting classifier did the best overall. The model lost some accuracy with the predicting causal web users, but gained some accuracy compared to the other models for predicting gamers, office and productivity users, and unknowns.
Results —Which features are relevant/important?
- Because our gradient boosting classifier had the best overall performance, we decided to examine the most important features for the model.
- Using scikit-learn’s feature_importances_ function, we looked at the model’s five most important features.
- NVIDIA’s GeForce GTX 1050 graphics cards, average CPU utilization, average CPU temperature, Iris 540 graphics cards, and AMD Radeon R7 450 graphics were our most important features:
Future Discussion and Conclusion
- Our goal was to create a classification model to predict a user’s persona based on their device specifications, CPU utilization, CPU temperature, and their time spent on different types of applications.
- We trained a total of seven models through the scikit-learn package: decision trees, extra trees, random forest, AdaBoost, three nearest neighbors, radial basis function SVM, and gradient boosting classifiers.
- After training and testing all of the models during our initial run, we realized that our gradient boosting classification model performed the best as it was able to predict casual web users and unknown users with 87% and 78% accuracy, respectively.
- It predicted gamers and office and productivity users with 26% and 3% accuracy, respectively. Though the accuracy of these two categories were not particularly the greatest, they did a much better job of predicting these categories in comparison to our AdaBoost and radial basis function SVM models.
- We found that our best model was the gradient boosting classifier which can predict a user’s persona with 64% accuracy.
Acknowledgements
- We would like to thank Professor Aaron Fraenkel, Balaji Shankar Balachandran, and Farouk Mokhtar from the UC San Diego Halicioğlu Data Science Institute for their help throughout the quarter. We would also like to thank our mentors Jamel Tayeb, Sruti Sahani, Chansik Im, Praveen Polasam, and Julien Sebot from the Intel Corporation for their guidance and feedback during our project.