Data

  • 172 features
  • 84 of Top 100 rated game on Steam
  • Two years historical data on SteamDB
    • About 40000 rows
  • Spilt ratio of training data and testing data is 4 : 1
    • We used 20 months data for training and 4 months data to evaluate our predictor.

Some Baselines

To demonstrate our model’s performance, here are some common baselines:

  1. Random Guess :
    • Blind guessing
  2. Naive method :
    • Use the average discount gap of a game as a threshold. If a instance have cross this threshold predict it as “Don’t buy”, otherwise “Buy it”.

Reasonable Metric

Treat “Caution” as “Don’t buy”, which means if a instance’s ground truth is “Don’t buy”, we predict “Caution” is a correct prediction.

Experiment Result

We use two popular classifiers : 1) SVM, 2) Random Forest in our experiment. We can see Random Forest outperformed others and achieved over 80% accuracy and F-score. It’s interesting to point out that one of our baselines “naive method” actually perform surprisedly well and result is just a little behind the SVM. Here, we concluded some comparison between SVM and Random Forest in this Steam dataset.

  • Random Forest is much more robust
  • Grid Search in Random Forest is much faster : About 20 times
  • Random forest interpretation : Give us further information on feature selection