Introduction

Review Bombing is a phenomenon where a large influx of negative reviews are given to a product after negative public relations events. Review bombing occurs most often in the video game industry and are characterized by reviews which do not focus on the game proper, but secondary aspects such as price, DRM (copyright-protection), a developer’s other games, or public relations statements. A demonstrative example being a mass influx of negative reviews to Doom Eternal after the creators of the game updated the game to include new DRM software. Previous case studies of specific review bombing events have shown little impact on sales, however, since many games reply on micro-transactions and in-game stores to generate revenue, it may be more important to ask if review bombing events affect the number of players actively playing a game. Perhaps review bombing is a manner of voicing concern moreso than dissatisfaction. This is what the present study investigates.

Using data from Steam itself and SteamDB, I gather time series data on 200 popular steam games. Using ARIMA models to predict future sales before review bombing events, I investigate if there is evidence of a decrease in average weekly concurrent players after review bombing events in the short, medium, and long term. I find that there is strong evidence of a short-term effect, but weaker evidence for a medium and long term effect.

Sampling Reasoning

The main reason I only consider very popular Steam games is because most Steam games are not continuously played, many fail to maintain a strong concurrent player count three months after launch. Because of this, many games can be said to have a “review bombing event” due to a small influx of negative reviews in a week because there are so few reviews being made. Having a large player-base prevents such things.

Data Collection Process

Reviews were gathered using woctezuma’s steamreviews package to mass download .json files containing review data. Data for concurrent player counts was obtained from SteamDB, for which I had to obtain special permission from the site owner to scrape their website. As such, SteamDB have asked me not to share instructions on how to scrape data from their website. Given this, and the fact that the steamreviews package is rather self-explanatory from it’s documentation, I opt not to include my raw data gathering code. Data collection took some 3 day.

Errors in Data Collection

The steamreviews package doesn’t always work perfectly, it can fail to retrieve an acceptable number of reviews (Usually only some 10% of all reviews). I am not entirely sure how to fix this, but in the end some games had to be discarded due to me facing repeated issues related to this. After filtering games that exhibit this error and games that have not been review bombed, about 70 games remained in my data set.

I created a Game object class to streamline data processing, the code for the class is below:

import datetime as dt
import matplotlib.pyplot as plt
import json
import pandas as pd
import numpy as np
import seaborn as sns
size=(30,10)
sns.set_theme()

# I realized too late that it's Concurrent, not Co-current, oops
def make_steamchart_df(appid):
    # yeah this variable probably shouldn't be called df, but whatever
    df = json.load(open('steamcharts/{0}_players.json'.format(appid),'r'))
    start = dt.datetime.fromtimestamp(df['data']['start'])
    d = {start: int(df['data']['values'][0])}
    step = dt.timedelta(seconds=df['data']['step'])
    for i in range(1,len(df['data']['values'])):
        # go one step
        start += step
        if df['data']['values'][i] is not None:
            d[start] = int(df['data']['values'][i])
        else:
            d[start] = np.nan
    return pd.DataFrame(d,index=['Co-current Players']).T

def make_review_df(appid,use_ijson=False):
    def make_review_dict(appid):
        def timestamp_to_dt(ts):
            return dt.datetime.fromtimestamp(ts)
        j = json.load(open('reviews/review_{0}.json'.format(appid),'r'))
        d = {}
        # timestamp in tuple's first slot, recommened in second
        review_ids = list(j['reviews'].keys())
        review_ids.reverse() # earilest reviews should appear first
        for review_id in review_ids:
            d[review_id] = (timestamp_to_dt(j['reviews'][review_id]['timestamp_created']),
                           int(j['reviews'][review_id]['voted_up']))
        return d
    df = pd.DataFrame(make_review_dict(appid),index=['Review Timestamp','Recommended?']).T
    # booleans will appear as objects, convert to int
    df['Recommended?'] = df['Recommended?'].astype(str).astype(int)
    num_reviewed = []
    for i in range(len(df)):
        num_reviewed.append(1)
    df['# of Reviews'] = num_reviewed
    df = df.set_index('Review Timestamp')
    return df

def make_full_df(appid,resample_format='W-MON'):
    aggfuns = {'Recommended?': np.sum,'# of Reviews': np.sum,
               'Co-current Players': np.mean}
    df = pd.concat([make_review_df(appid),make_steamchart_df(appid)],
                   axis=0,join='outer').resample(resample_format).agg(aggfuns)
    # only get rows after reviews are recorded
    for i in range(len(df)):
        if df.iloc[i,0] > 0:
            first_nz = i
            break
    df = df.iloc[first_nz:,:]
    df['Percent Recommended'] = df['Recommended?']/df['# of Reviews']
    return df
class Game():
    games = []
    def __init__(self, appid):
        self.appid = appid #int
        self.name = id_to_game_dict[appid] #str
        self.data = make_full_df(appid) #DataFrame
        self.__class__.games.append(self)
    def has_review_bomb(self,stds_away=2):
        series = self.data['Percent Recommended']
        threshold = series.mean()-series.std()*stds_away
        for value in series:
            if value < threshold:
                return True
        return False
    def get_review_bomb_index(self,stds_away=2):
        if not self.has_review_bomb(stds_away=stds_away):
            return None
        else:
            series = self.data['Percent Recommended']
            threshold = series.mean()-series.std()*stds_away
            for value, n in zip(series,range(len(series))):
                if value < threshold:
                    yield n
    def plot_autocorr(self,col):
        pd.plotting.autocorrelation_plot(self.data[col])
        plt.show()

I have already created Game objects for the top 200 Steam games, they can be loaded from the obj folder

import pickle
import os

game_objects = []
spellcheck = {'Co-current Players': 'Concurrent Players'}

for appid in os.listdir('obj/'):
    exec('{0} = pickle.load(open(\'obj/{0}\',\'rb\'))'.format(appid))
    exec('{0}.data = {0}.data.rename(mapper=spellcheck,axis=1)'.format(appid))
    exec('game_objects.append({0})'.format(appid)) 

We can see a review bomb event for Team Fortress 2 in mid-2016 in the following chart corresponding to the release of the controversial “Meet Your Match Update,” in which many community-requested features went ignored by the development team.

game440.data.loc[:,['Concurrent Players',
                    'Percent Recommended']].plot(figsize=(30,20),
                                                 subplots=True,
                                                 title='Review and Player Data for'+game440.name)
plt.show()

png

We can also see this in the cumlative precentage of the game

def show_cum_recommended(game,plot=True,retur=True):
    cum_recommended = game.data['Recommended?'].cumsum()/game.data['# of Reviews'].cumsum()
    if plot:
        cum_recommended.plot(title='Overall Percentage Recommended for '+game.name,figsize=size)
        plt.show()
    if retur:
        return cum_recommended

show_cum_recommended(game440,retur=False)

png

Mining Games with Review Bomb Events

In order to filter games that do not have review bomb events, I look for games which have a point in their weekly review score which is 3 standard deviations lower than the mean weekly review score. The following code automatically filters games that don’t have review bomb events.

def last_nan(series,look=4): #finds last time "look"-many nan values appear in a row in a series
    l = list(np.isnan(list(series))) #converts array to booleans showing if value is nan
    l.reverse()
    i = 0
    for n in range(len(l)):
        try:
            if all([l[n+m] for m in range(look)]): # if they are all nan
                i += 1
                return i
            i += 1
        except IndexError:
            i += 1
    return None


def diff_list(l): # differences between list indexes
    if len(l) < 2:
        return None
    diffs = []
    for i in range(len(l)-1):
        diffs.append(l[i+1]-l[i])
    return diffs

def get_review_bomb_index(series,stds_away=3):
    threshold = series.mean()-series.std()*stds_away
    for value, n in zip(series,range(len(series))):
        if value < threshold:
            yield n

def find_review_bomb_start(game,look=4):   
    series = game.data['Concurrent Players']
    valid_passed = last_nan(series,look=4)
    if valid_passed is not None:
        series = series[valid_passed:].fillna(series.mean())
    reviews = game.data['Percent Recommended'][valid_passed:]
    indexes = list(get_review_bomb_index(reviews))
    # ARIMA models require at least 50 observations to be valid
    if indexes is None or len(series) < 50 or len(indexes) == 0:
        return None
    starts = indexes
    for n in range(len(indexes)):
        try:
            # 50 obs
            if indexes[n] < 50:
                starts.remove(indexes[n])
            # is the next entry one more than the current entry?
            if indexes[n] + 1 == indexes[n+1]:
                starts.remove(indexes[n+1])
        except IndexError:
            pass
    # append final numeric index to list to help with spliting
    starts.append(len(game.data)-1)
    if len(starts) > 1:
        diff = diff_list(starts)
        # merge bomb events until there are at least 50 obs between the events
        while any([delta < 50 for delta in diff]):
            for i in range(len(diff)):
                # remove one at a time
                if diff[i] < 50:
                    starts.remove(starts[i+1])
                    break
            # recalculate differences
            diff = diff_list(starts)
            if diff is None:
                return starts
        return starts
    else:
        return None

bombed_games = [game for game in game_objects if find_review_bomb_start(game,look=2) is not None]
len(bombed_games)
81

Statistical Testing

ARIMA Models

Autoregressive Integrated Moving Average (ARIMA) models are a class of statistical models that can be used to forecast time series data. In layman’s terms, ARIMA models state that time-labeled data can be represented by adding a certain percentage of the previous data points and a random value. Specifically, the second observation of time-labeled data can be expressed as the first observation plus some white noise. Every observation after that is then a combination of the last many values and the white noise component of the last many values.

In technical terms ARIMA models model time series data as a sequence of random variables . Let and so on. All ARIMA models can be expressed in the form

formula

where

formula

for some . The “integrated” factor refers to differencing between time steps. This sometimes must be done to maintain model validity, as ARIMA models require a constant (technically, “stationary”) mean. Luckily, such cases are still expressible in this general form, one need only shift by some factor of . is explicitly given as the multiplicity of the unit root of the polynomial .

ARIMA models can be generated via the AutoARIMA function in the pmdarima package which will fit the best-possible ARIMA model to the data based on log-likelihood. The package can then make predictions based on the expected value of the ARIMA model, , trained on data from the pre-review-bomb period to generate a time series estimation of later concurrent player counts where the review bomb did not occur. The mean percentage error between this predicted time series and the true time series will serve as the metric to judge the strength of review bombing events. Mean percentage error is used to account for differences in popularity among the games in the data set.

# declearing methods to generate ARIMA models
from pmdarima.arima import AutoARIMA
from math import sqrt, floor

def split_players_on_bomb(game):
    #split the players time series into series before and after the review bomb effect
    starts = find_review_bomb_start(game)
    ts = game.data['Concurrent Players']
    if len(starts) == 0 or starts is None:
        assert 'Cannot find valid review bomb event for {0} (appid: {1})'.format(game.name,game.appid)
    else:
        if len(starts) == 1:
            s = starts[0]
            yield ts[:s-1]
            yield ts[s+1:]
        else:
            for s in starts:
                i = starts.index(s)
                if i == 0:
                    yield ts[:s-1]
                    yield ts[s+1:starts[i+1]-1]
                elif i != len(starts) - 1:
                    yield ts[s+1:starts[i+1]-1]
                else:
                    if s != len(game.data) - 1:
                        yield ts[s+1:]


def est_bomb_effect(game,term=0):
    # term is the number of weeks the estimator should look ahead with
    l_of_ts = list(split_players_on_bomb(game))
    if len(l_of_ts) == 2:
        before = l_of_ts[0]
        after = l_of_ts[1]
        if term == 0:
            term = len(after)
        model = AutoARIMA()
        try:
            prediction = model.fit_predict(before.fillna(before.mean()),n_periods=term)
            error = (after[:term] - prediction)/after[:term] # signed percentage error
            yield error
        except ValueError: #sometimes returns empty series
            pass
    else:
        mean_errors = []
        for i in range(len(l_of_ts)-1):
            before = l_of_ts[i]
            after = l_of_ts[i+1]
            if term == 0:
                term = len(after)
            try:
                model = AutoARIMA()
                prediction = model.fit_predict(before.fillna(before.mean()),n_periods=term)            
                error = (after[:term] - prediction)/after[:term]
                yield error
            except ValueError:
                pass

def plot_review_bomb_eff(game,term=0):
    title = 'Percentage change in Concurrent players for '
    if term == 0:
        title += game.name
    else:
        title += game.name+' {0} weeks after review bomb event'.format(term)
    mes = list(est_bomb_effect(game,term=term))
    for me in list(mes):
        me.plot(figsize=size,title=title)
    plt.show()
    
def get_review_bomb_eff(game,term=0):
    mean_signed_errors = []
    mes = list(est_bomb_effect(game,term=term))
    for me in list(mes):
        if len(me) > 0: #fail-safe
            mean_signed_errors.append(me.mean())
    return np.mean(mean_signed_errors)

ARIMA estimation is quite computationally expensive, so I opt to use multi-trending

import warnings
import threading
from math import ceil

mean_signed_errors_main = {}

class RB_eff_getter(threading.Thread):
    def __init__(self,threadID,name,begin,stop,ls):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.begin = begin # what index to start at
        self.stop = stop # What index to stop at
        self.ls = ls # iterable to parse
    def run(self):
        for game in self.ls[self.begin:self.stop]:
            global mean_signed_errors_main
            mean_signed_errors_main[game.appid] = (get_review_bomb_eff(game,term=25),get_review_bomb_eff(game,term=15),get_review_bomb_eff(game,term=5))
            print('.',end='')

warnings.filterwarnings("ignore")

threadLock = threading.Lock()
threads = []

print('Getting Review Bomb Effects')
split_by_threads = 10
for i in range(0,ceil(len(bombed_games)/split_by_threads)):
    try:
        exec('REEG{1} = RB_eff_getter({1},\'Thread {1}\',{0}*{2},(({1})*{2})-1,bombed_games)'.format(i,i+1,split_by_threads))
        exec('REEG{0}.start()'.format(i+1))
        exec('threads.append(REEG{0})'.format(i+1))
    except IndexError:
        REEGfinal = RB_eff_getter(i+1,'Thread {0}'.format(i+1),i*split_by_threads,len(bombed_games)-1,bombed_games)
        REEGfinal.start()
        threads.append(REEGfinal)

for t in threads:
    t.join()
print()
print('Done!')
Getting Review Bomb Effects
.........................................................................
Done!
# making dict to translate appids
import csv

id_to_game_dict = {}
games = csv.reader(open('top1000steam.csv','r',encoding='utf-8'))
i = 0
for row in games:
    id_to_game_dict[int(row[2])] = row[0]
# for some reason reading wouldn't work with this game in the file so I added it manually
id_to_game_dict[35700] = 'Trine Enchanted Edition'
#inverse dictionary
game_to_id_dict = {}
for k, v in id_to_game_dict.items():
    game_to_id_dict[v] = k
# making error df
short_term = 'Short-term Review Bomb Effect (5 weeks)'
mid_term = 'Medium-term Review Bomb Effect (15 weeks)'
long_term = 'Long-term Review Bomb Effect (25 weeks)'
review_bomb_eff_df = pd.DataFrame(mean_signed_errors_main,index=[long_term,mid_term,short_term]).T*100
review_bomb_eff_df['Game Name'] = [id_to_game_dict[i] for i in review_bomb_eff_df.index]

We may now generate descriptive statistics for the review bomb effect of each term length

review_bomb_eff_df.dropna().describe()
Long-term Review Bomb Effect (25 weeks) Medium-term Review Bomb Effect (15 weeks) Short-term Review Bomb Effect (5 weeks)
count 72.000000 72.000000 72.000000
mean -83.005382 -69.180567 -66.781793
std 268.913492 257.057211 234.835334
min -1552.735071 -1996.633948 -1797.020743
25% -50.482062 -38.689960 -44.340836
50% -8.992907 -6.763253 -0.515513
75% 16.295495 16.048220 13.507898
max 132.797478 82.478413 47.763262

It seems that there are games with very major losses from review bombing events, I check this via histogram.

review_bomb_eff_df.dropna().plot.hist(figsize=size,bins=20,title='Distribution of Precentage Change in Concurrent Player Count after Review Bombing Event')
plt.show()

png

Given the existence of extreme values from the histogram, I opt to remove outliers from my data set.

from scipy.stats import zscore

def remove_outliers(df,rounds=2):
    i = 0
    clean = df.dropna()
    # clean the data in rounds, outliers may hide the existence of even more outliers
    while i < rounds:
        # calculate Z score for all data points
        Z = np.abs(zscore(clean.loc[:,[long_term,mid_term,short_term]]))
        # get indexes of outliers
        to_remove = np.where(Z > 3)
        to_remove_labels = [clean.index[i] for i in list(set(to_remove[0]))]
        clean = clean.drop(to_remove_labels)
        i += 1
    return clean

review_bomb_eff_df_clean = remove_outliers(review_bomb_eff_df)
review_bomb_eff_df_clean.describe()
Long-term Review Bomb Effect (25 weeks) Medium-term Review Bomb Effect (15 weeks) Short-term Review Bomb Effect (5 weeks)
count 65.000000 65.000000 65.000000
mean -16.899673 -16.096100 -15.423576
std 77.150167 65.616488 51.989141
min -345.827089 -259.424691 -199.869738
25% -34.079372 -24.870464 -25.808589
50% -5.618843 -5.398261 0.949599
75% 16.873663 17.590508 14.506469
max 132.797478 82.478413 47.763262
review_bomb_eff_df_clean.plot.hist(figsize=size,bins=20,title='Distribution of Precentage Change in Concurrent Player Count after Review Bombing Event after Removing Outliers')
plt.show()

png

As we can see, there is strong evidence from the histogram and descriptive statistics that review bombing does have a negative impact on player bases in the short term. The medium and long term effects are a bit more difficult to see, so I opt to use a -test on the mean of each the term periods

from statsmodels.stats.weightstats import ztest
for term in [short_term,mid_term,long_term]:
    resultZ = ztest(review_bomb_eff_df_clean[term].dropna(),
                    alternative='smaller')
    print('Result of Z test against H0: μ ≥ 0 for',term.split()[0]+':')
    print('Z =',round(resultZ[0],3),'(p = {0})'.format(round(resultZ[1],3)))
Result of Z test against H0: μ ≥ 0 for Short-term:
Z = -2.392 (p = 0.008)
Result of Z test against H0: μ ≥ 0 for Medium-term:
Z = -1.978 (p = 0.024)
Result of Z test against H0: μ ≥ 0 for Long-term:
Z = -1.766 (p = 0.039)

As we can see, the -values are significantly negative at a significance level of , in fact, the short-term effect is found to be negative even at a significance level of . This essentially means if we were to assume that review bombing does not have a negative impact on the player base of games, we would have less than a 5% chance to get the data we collected. Thus, this assumption is likely false.

One may question how often player loss from review-bombing events is prolonged in the short term. To this extent, I display the correlation matrix of my data.

review_bomb_eff_df_clean.corr()
Long-term Review Bomb Effect (25 weeks) Medium-term Review Bomb Effect (15 weeks) Short-term Review Bomb Effect (5 weeks)
Long-term Review Bomb Effect (25 weeks) 1.000000 0.898958 0.649668
Medium-term Review Bomb Effect (15 weeks) 0.898958 1.000000 0.779988
Short-term Review Bomb Effect (5 weeks) 0.649668 0.779988 1.000000

As we can see, while short-term loss often translates to medium-term lost and medium-term lost is quite indicative of long-term loss, the relationship between short and long term player loss is less significant, meaning we shouldn’t always except review bombing to cause long term damage to a game’s player base.

Conclusion

Research Suggestions

It may be interesting to see if the same relationship can be seen in sales rather than player count. Sadly, information about sales of Steam games is only kept by internal analytics departments or the payed website SteamSpy. If one were to take on this research, I would recommend trying to isolate the effects of price drops, which would require much more econometric analysis than I have done in this report.