Data
- 172 features
- 84 of Top 100 rated game on Steam
- Two years historical data on SteamDB
- About 40000 rows
- Spilt ratio of training data and testing data is 4 : 1
- We used 20 months data for training and 4 months data to evaluate our predictor.
Some Baselines
To demonstrate our model’s performance, here are some common baselines:
- Random Guess :
- Blind guessing
- Naive method :
- Use the average discount gap of a game as a threshold. If a instance have cross this threshold predict it as “Don’t buy”, otherwise “Buy it”.
Reasonable Metric
Treat “Caution” as “Don’t buy”, which means if a instance’s ground truth is “Don’t buy”, we predict “Caution” is a correct prediction.
Experiment Result
We use two popular classifiers : 1) SVM, 2) Random Forest in our experiment. We can see Random Forest outperformed others and achieved over 80% accuracy and F-score. It’s interesting to point out that one of our baselines “naive method” actually perform surprisedly well and result is just a little behind the SVM. Here, we concluded some comparison between SVM and Random Forest in this Steam dataset.
- Random Forest is much more robust
- Grid Search in Random Forest is much faster : About 20 times
- Random forest interpretation : Give us further information on feature selection