Вы находитесь на странице: 1из 85

1

Player evaluation in Dota 2:


An unusual look at information
asymmetry
Mojmr Krajovi
51122478
JEL D01, D79, D84, Z29

This dissertation is submitted in part requirement for the Degree of M.A. with Honours in Economics at the
University of Aberdeen, Scotland, and is solely the work of the above named candidate
word count: 9,997

Table of Contents
1. Introduction ................................................................................................................................................ 5
2. Overview ..................................................................................................................................................... 6
2.1 Aspirations shift .................................................................................................................................... 6
2.2 The academic field of e-sports ............................................................................................................. 7
2.3 The Rise of the Computer ..................................................................................................................... 7
2.4 The e-sports leader ............................................................................................................................... 9
2.5 Matchmaking potential ...................................................................................................................... 10
3. Current evaluation tools ........................................................................................................................... 11
3.1 The Dota matchmaker ........................................................................................................................ 11
3.1.1 MMR Specifications ......................................................................................................................... 11
3.1.2 Matchmaking inefficiencies ............................................................................................................. 15
3.2 In-game statistics ................................................................................................................................ 17
3.2.1 In game statistics inefficiency .......................................................................................................... 18
3.2.2 In game statistics inefficiencies ....................................................................................................... 19
3.3.1 Complexity ....................................................................................................................................... 20
3.3.2 One game is not enough: the competitive aspect ........................................................................... 21
3.4Postgame statistics .............................................................................................................................. 22
4. The prospect of improvement .................................................................................................................. 23
5. Ambiguous endogenous player skill ......................................................................................................... 24
5.1 Organisation ....................................................................................................................................... 25
5.2 Short-term failure ............................................................................................................................... 27
6. Information asymmetry ............................................................................................................................ 27
6.1 Collective salience .............................................................................................................................. 29
6.2 The assumptions ................................................................................................................................. 30
6.3 The model ........................................................................................................................................... 31
7. The survey setup ....................................................................................................................................... 32
7.1 Findings and interpretation methods ................................................................................................. 35
7.2 Limitations .......................................................................................................................................... 36
7.3 Results ................................................................................................................................................ 39
7.4 Hypothesis testing .............................................................................................................................. 46
8. Conclusion and recommendations ........................................................................................................... 47
9. Appendix A ................................................................................................................................................ 49
10. References .............................................................................................................................................. 79

The dissertation focuses on how the principles of information asymmetry affect perceptions
of competence and performance of every Dota player. Combined with the insufficient
qualitative tools to evaluate one's skill, the players are left to interpret misleading
quantitative statistics. Therefore a new system that takes information inefficiencies into
account has to be created to rapidly increase the community's provision from pursuing a
common passion. The tool creation is not achieved in the scope of the dissertation but a
theoretic model capable of achieving it is codified and valuable data is collected for further
testing.

Acknowledgement
I would like to thank Dr Juergen Bracht for advising and guiding my work, to
Alistair Spragg for having the courage to code the uncodable, to Ing Milan Krajcovic for
immense help with interpreting the data, to Dr Alan Bester for endorsing the research
within the community and to Michal Simko, Michaela Debreceniova, Martin Certicky and
Chris Koch for proofreading every draft and supporting my unorthodox thesis.

1. Introduction
This paper's aim is to introduce the reader to the intricate world of Dota 2 at a
leisurely pace, gradually explaining concepts as they appear. Is it structured in a way to
ease the transition from general concepts and simple truths into more complex Dotaspecific concepts many academics might not know much about.
Section 2 serves as a general overview for the reasoning behind choosing this topic,
particularly the physical vs. digital factors that postulate the idea that e-sports should be
studied as a mixture of sports and game theory to provide academic guidance in fixing
issues experienced by the author personally. Quick overview of the game is given, with
additional information snippets scattered further in the text to prevent overwhelming the
reader with too much game-specific jargon at any time.
In section 3, the existing tools to evaluate performance are explained and critically
presented. This section reasons why these quantitative measures are insufficient in tracking
individual player competence. The section is concluded with the need to devise an
improved metric that can constantly capture player performance and offer qualitative
inference.
Section 4 specifies the potential benefits and the groups affected by such a metric,
while sections 5 and 6 introduce the economic concepts of information asymmetry and the
endogenous competence variable. These are briefly described with respect to player
behaviour within the game environment. General theoretical assumptions are later created
to govern the subsequent model used to conduct a simple short-term experiment that was
supposed to probe the current situation and interpret it. Its setup, limitation and findings can
be found in section 7. The experiment proves to offer some significance but can be heavily
improved. The paper concludes that in order to create a functioning tool, the right research

with the right promotion and the right statistical models need to be in place. The author
expects to continue working on the topic in his own time and hopes to offer the community
a semi-functioning collaborative rating system by The International 5 in August.

2. Overview
2.1 Aspirations shift

In the industrial age, 'physical fitness' became a dominant value in society.


Therefore, most traditional sports evolved to measure such attributes covering what is
known as traditional sport science (Wagner, 2006). While the goals of achieving self-set
goals described under self-determination theory have stayed similar (achieving mastery
drive through competence, relatedness and autonomy) (Edge, 2013), the means of
achieving them have changed with the advancements in technology (Rioult et al., 2014).
Achieving expertise in any activity involves what Ericsson et al (1993) define as
'deliberate practice' - a structured task rehearsal for the sake of improving performance,
contrasted with 'play' which is defined as task immersion for the sole purpose of enjoyment.
Taking part in any sport involves performing it under an agreed set of rules and
specifications. While many people engage in sports, relatively very few deliberately
practice them. Under the intrinsic motivation theory, ambitious individuals prefer
competitive activities. Furthermore, competition-enabled competence valuation can have
favourable effects on an individual's motivation provided properly given feedback
(Wagner, 2006).

7
1

2.2 The academic field of e-sports

This paper's focus makes the discussion whether e-sports has to be defined as a
sport irrelevant for its purpose (Gestalt, 1999). There is an apparent natural connection
between the two. Analogous to traditional sports, the gaming player base follows a
traditional Gaussian distribution2 (Reddit, 2014), with only a small percentage of the
players practicing deliberately (Rioult et al., 2014); each player expresses a self-set level of
intrinsic motivation to reach her goals through the structured game environment with
extremely well-defined and consistent rules identical for every player3, which in turn allow
a player to win a game by finding and executing strategies that outperform the opponent's
strategy4. One significant difference between the two is the form of the activity and the
subsequent evaluation feedback available. While physical sport can be physically measured
and tracked, virtual activity within a digital environment requires the exploration of new
techniques to evaluate performance as is does not involve such easily measurable metrics.
That is also why an online game is often compared to a game of chess and the Elo system
developed for tracking chess players has been implemented to e-sports.
2.3 The Rise of the Computer

E-sports is a contemporary phenomenon, slowly finding its place within the


entertainment industry. LAN5 tournaments sell out spectator venues, and some attract athome audiences bigger than those of top traditional sporting events (Wingfield, 2014).
Globally recognized brands queue up to sponsor the best professional players in the
industry, who themselves earn upwards to seven figure incomes, luring the new digital
generation to become gamers. The US State Department now allows visas for professional
1

Also known as electronic sports or eSports


Illustrated by figure 1 from the Dota 2 2014 rating Survey
3
The most important rules will be introduced throughout the paper in the form of infographics made by
dota2wiki.com
4
Analogous to rugby
5
Abbreviation for 'Local area network', also known as 'offline'
2

gamers on the same ground used for traditional athletes. Amazon bought Twitch, a gaming
streaming website, for $1 billion last year. E-sports viewership has been estimated at over
70 million people in 2013 and is expected to have risen rapidly since (SuperData Research,
2013).
The potential of approaching e-sports through 'sports' science is that it would look
further than considering it a sociocultural phenomenon. E-sports is rooted deeply in the
digital youth culture. Children who are already very competent using the modern
information technology further train their competencies through playing computer games,
rapidly widening the social-technological gap between them and older generations. The
mastery of multimodal communication has become one of the fundamental capabilities to
acquire high status within a group or a society, particularly in youth culture (Wagner,
2006).
Looking at e-sports through game theory would provide deriving modern
approaches and methodologies to actively improve the industry's progress and add elements
it is missing. However, despite massive growth in popularity, (Ibid) academic focus on esports so far has been revolving mainly around its sociological, ethical (Rioult et al., 2014)
and cognitive aspects (Latham, Patston and Tippett, 2013). Little weight has been given to
the lack of tools to measure digital activity and competence. Rioult et al (2014) researched
the possibility of analyzing real time sport based on data mining of an online game, Latham
et al (2013) postulated the importance of distinguishing video-game expertise and
experience and Yang et al (2014) focused on identifying success patterns, but the literature
on evaluating digital performance and competence level has much space to improve.

9
2.4 The e-sports leader

Approaching Dota 2 academically would allow the development of tools and


techniques for improving the analysis of player performance. Tracking progress is part of
any competitive sport culture and human activity (The Economist, 2015), allowing
participants to improve their qualification and enhance their assessment skills (Sadler and
Good, 2006 and Wagner, 2006). Some proxies already exist in Dota6 but they do not offer
any rigid evaluation framework, only the means for users to practice and improve (any
game replay can be downloaded and reviewed and the users can create a practice game to
test any mechanics and interactions governed by the game's rules). So far, the introduction
of objective player evaluation has not been very fruitful.
Dota 2 is a free-to-play MOBA7 computer game. The title has gained over eleven
million players8 since its release in 2013 and has become the most played game 9 on the
Steam platform10. Due to constant updates to game rules balancing, Dota has become one
of the biggest e-sport, fostering an organised network of professional competition (ToftAndersen, 2014). The highest-skill players in the world now make their living as
professionals11, playing in team rosters under sponsor banners and competing against one

'Dota 2' and Dota will be interchangeably used to refer to the same game
Abbreviation for Multiplayer Online Battle Arena
8
It is now installed on 42 million computers worldwide
9
http://steamcharts.com
10
Developed by Valve, Steam provides an internet-based distribution of computer games and additional
community features
11
(e-Sports Earnings, 2015)
7

10

another in dozens of annual online and offline tournaments. The most prominent of them is
The International (TI), famous for its prize pool (Wallace, 2014). Every professional
player's goal is to win TI which grants a team of five players life-time reputation and
prestige. During the rest of the year, fans and players struggle to evaluate who the overall
best players are. Last year's TI4 winners, Team Newbee, dominated during the tournament
and their performance sunk so much afterwards that the Dota community refuses the team
to be automatically entered into the next TI (which has been a tradition so far).
2.5 Matchmaking potential

An interesting concept that differentiates online gaming from traditional sports is


the promotion of a built-in system that aims to maximise each casual player's enjoyment of
the game. This of course does not apply to competitive play, which transcendences such
tendency and involves the most ambitious players trying to be the best. Historically, no
physical sport has had the possibility to create specific matching rules implemented to
directly tackle skill differences of the players. Mutually unknown players either never
played against each other or encountered their skill differences during a match (Lutz,
2013). Due to the technological advancement, online gaming holds the potential to match
mutually unknown players with similar skill to improve their match enjoyment. To be able
to do so requires certain criteria to be met. The deciding factor for maximised game utility
are balanced opponents. Both matched teams' expected probability of winning has to be
50% before the match starts. In theory, this can be achieved if the discrepancy in skill
between the most and the least skilled player in a match is zero. Additional underlying
assumption is the positive relationship between game experience and attained expertise.
Therefore, the matchmaking system also has to try to hold the difference between the least
and the most experienced player at a minimum (Blog.dota2.com, 2015).

11

3. Current evaluation tools


3.1 The Dota matchmaker

Dota matchmaking system tries to judge each player's competence level using a
matchmaking rating (MMR) which is a metric that places each player onto a percentile
point of the total player base Gaussian curve (similar to Elo12 ratings in chess). The exact
formula used to calculate a player's MMR and the underlying matchmaking algorithm has
never been presented by Valve13, therefore the following information is only descriptive.
Every player who has played at least one game is assigned an uncalibrated numerical value
based on specific data-driven rules (What-a-Baller, 2014). For any potential match, similar
MMR players are expected to meet the ideal criteria described in 2.5. The algorithm assigns
a score for each criterion and creates a weighted average. When the generated score
exceeds a set threshold, the match is considered 'good enough' and is formed. To achieve
continuous maximum match quality, the MMR recalibrates after every match.
3.1.1 MMR Specifications

The ideal expected pre-match win rate criterion is the primary aspect of MMR
calibration, although the algorithm does not force particular win rates for players 14. Instead
its built-in Elo-type function places a player on a winning streak into continuously higherlevel games and a player on a losing streak becomes matched with progressively lower
skilled opponents and teammates. Therefore players' win rate stabilising around 50% is an
indirect result of the system. An important characteristic to note is that an Elo system is not
meant to give players a sense of progress. The algorithm is targeted at creating closely
balanced games by copying an assumed normal distribution of competence therefore it is

12

Named after its creator, Arpad Elo


the game's developer
14
Fletcher (2013) categorically refused the allegations that the system looks at players' above par (>50%)
historical win rates and matches them into games they were expected to lose.
13

12

mathematically impossible for all players' MMR to keep rising indefinitely as more

Figure 1: Illustrative distribution of MMR, source: Reddit.com

games are played. A player falsely looking at their MMR as some progress indicator would
be constantly told she makes none (Fletcher, 2013). The MMR works as a medium of
probability distribution of performance in the next match. Uncertainty, serving as the
standard deviation of such distribution, adjusts with respect to the relationship between
actual and predicted match outcomes. The algorithm's repeatedly correct prediction of a
match result reduces player uncertainty, while surprise win/loss tends to increase it
(Blog.dota2.com,2015).
Stemming from Valve's data that similar skill players with dissimilar experience
tend to have different game play expectations (Ibid), the matchmaking system also tries to
match players with similar total matches played. The system measures experience as an

13

approximate logarithmic function of the total games played for a player (figure 2 below).

Figure 2: Approximate logarithmic function of player experience in Dota 2, source: blog.dota2.com

15

If the two players are closer enough in the diagram, they are placed in the same match.
Valve proclaims that they rely on MMR almost exclusively after around 150 Dota matches
(known as the calibration phase), (Dev.dota2.com, 2013).
The global aspect of online gaming has to be taken in consideration, whereas many
players speak different languages. Lack of a common language among teammates' should
be strongly avoided to allow players the
strategic coordination crucial to win a match.
This is an issue Valve has been criticised for
ever since they offered a language preference
matchmaking option. The problem is that the

15

The typical new player (A) progress describes how she gains experience and gradually moves upwards as
her skill increases over time. Player (B) with prior experience in the genre follows a steeper rise and arrives at
similar MMR sooner but becomes matched with similarly skilled players

14

game has no way of proving a player speaks a particular language, therefore anyone can
select any language to reduce their matchmaking queue time (English queue times are
comparatively shorter than any others). Reasoning of a player who deliberately lies about
their language proficiency is that the opportunity cost of not being able to coordinate with
teammates is smaller than waiting to play for a longer period.
Citing Festinger, Chen et al (2010) conjecture that online community members tend
to compare themselves to others if comparative information is present. On top of that,
social comparison theory proposes that people follow social comparisons in situations that
are ambiguous (Ibid). MMR classification serves to increase match enjoyment for the
majority (which functions fairly well16) and while not a direct evaluation of performance17
it is often the only consistent unambiguous player label available (Blog.dotacoach.org,
2015). The community often reduces players to mere numbers, holding little respect to
players with a lower spectrum MMR. What the continuous design of a fluctuating MMR
misses is the ability to recognise exceptional players straight from the start (unlike in
traditional sports) and place them to brackets where they belong. Such players have to
progressively jump towards higher MMR with each easily won game instead.18

16

As the number of concurrent users grows progressively, depicted in figure 3


"Trench is a function of attitude, not ability." (BobRawrley, 2015)
18
as proven by Juice (2013) in his MMR self-administered experiment
17

15

Figure 3: Maximum player online on a log scale, source: Reddit

3.1.2 Matchmaking inefficiencies

Matching inefficiencies exist on both tails of the MMR distribution, due to the
game's learning curve and highly competitive aspect. The relation between MMR and skill
competence is visualised in figure 4 (note that the skill curve is only representative and
actually unknown). Sub 2000 MMR players often refer to their skill bracket as 'Elo Hell' or
'The Trench', claiming that as the major factor of improving their MMR is winning, having
already accomplished a higher skill level will not consequently make them meet higherskilled opponents, as they will get stuck in the undervalued MMR bracket. As Dota is a
team game, it is very rare for an individual to singlehandedly beat the opposing five players
therefore a player who
perceives herself to be
better has to cooperate
with four teammates
who at that skill level
percentile might not
Figure 4: An illustration of the relation between MMR and competence

16

even have a general grasp of all the game mechanics. On the other tail of the spectrum,
above 6000 MMR players suffer from the lack of similarly skilled players to play noncompetitive games with. Players in this bracket often focus on achieving higher order
competitive ambitions (like professional gaming or private leagues19), resulting in highlyskilled players often finding themselves stuck in the matchmaking queue, waiting for
another nine similarly skilled players. As one game of Dota on average requires a 20 to 60
minute commitment, the player-side assumption is that the payoff of waiting in a queue for
a good match reduces the longer they wait. Server-side optimisation follows the assumption
that the expected utility of looking for a game increases with time spent in the queue (as
more
would

tantamount
be

found).

players
This

problem is represented in
figure 5. Long queues for
high-skilled
gradually

players
minimise

their

utility of looking for a game.


On the other hand, the
Figure 5: The matchmaking queue payoff

algorithm oftentimes decides


to overrule long queue times by matching two 6000+ MMR players against each other with
each having four 4000 MMR teammates, resulting in dissatisfaction of too high a skill gap
(similar to the lower distribution tail).
Altogether, MMR lacks any meaning for competitive gameplay as the high
achievement professional players (who are missing in the casual queues) get more
19

pro.faceit.com

17

enjoyment from winning games to secure them a tournament trophy than having a
tantamount opponent in every game. Competitive gameplay is also considered a higherorder bracket transcendent above the MMR ceiling, which nullifies MMR's meaning for
professional gaming. Winning a tournament is a team effort and constitutes additional
deliberate practice to get accustomed to an aggregate team strategy due to an extremely
high level of professional competition. A team with the five highest MMR players with no
prior mutual team coordination experience would have trouble beating a well-coordinated
professional team consisting of players with relatively smaller average MMR but prior
experience in competitive gaming.
3.2 In-game statistics

As the game is numerical by nature,


it provides many statistics that can be
tracked and analysed (Rioult et al., 2014) but
they do not offer an objective evaluation
either. The evaluation problem stems mainly
from the game's victory rule. While scoring
more goals than the opposing team in
football means winning every time, leading
in any metric during a game of Dota does not necessarily mean winning it. The only victory
in Dota is achieved through destroying the enemy's ancient, which ends the match. Such
victory rule is analogous to a victory in chess, whereas a player wins by capturing the
opposing king. All the in-game metrics only provide a holistic report of which team has the
upper hand at that particular moment and how, but none of them are definite ways to win a
match (Herren, 2015).

18
3.2.1 In game statistics inefficiency

Unlike in traditional sports, where a well-functioning system of ratings, both during


a game (e.g. goals scored) and outside of the game (life-time KOs) help the players and
viewers rate and understand how well individuals and teams play, in Dota there is no single
metric that inherently correlates to winning a game. Therefore any current in-game and
post-game quantitative data only describe a match in the provided categories and offer a
cumulative snapshot for individual players or teams. No weight is given to decisions made
within the game; only to the subsequent payoffs; even though such decisions closely
correlate to accumulating an advantage and winning the game. Such statistics are
nevertheless still valuable as they provide a macro overview of performance20, proven by
the recent rise in popularity of professional commentators and analysts (Beyond The
Summit, 2014)21 who describe the game based on the combination of past experience,
expertise and available game statistics (kerblom, 2015)22. Such approach remains the
most used to interpret player decision functions and potential game outcomes (Geere,
2015).

The available individual player statistics are same for every player, with no weight
on any lane position or team role. A professional team
traditionally consists of two support players, who create
early-game opportunities for the team momentum to build up as they gradually offer
comparatively less than the other team roles as the game progresses. Support players often-

20

dotabuff.com, datdota.com, yasp.co, dotagem.com, dotamax.com


https://www.youtube.com/watch?v=K-AFbouKJAQ
22
http://fragbite.se/fragtv/video/2298/synderen-om-kina-vs-varlden-i-need-a-bigger-sample-size
21

19

times end up with negative hero kill/death/assist KDA23 scores, minimum last hits (LH)24
and so on, sacrificing their resources for the collective team strategy goals.
The other three roles, so-called cores, consist of an offlaner role, a mid role and a carry; all
three offer little early on but gradually get stronger and often decide the result of a game.
The offlaner role is expected to suffer early and start slow but develops to provide an utility
function for the team, creating space for the other two cores. The mid role has been
historically a tempo-controlling role, dictating

the

strategy mid-to-late game and only recently it shifted


into a more farm-oriented role type, similar to the carry role - which is a late bloomer, often
required to be babysat by the support duo early, in order to eat up most of the team's gold
net worth as the game progresses. The carry excels later in the game thanks to itemization
and levels gained from having space to farm25.
3.2.2 In game statistics inefficiencies

Just as team metrics, individual player metrics are


insightful but often misleading. For example, the player
with the highest career KDA is Jia Jun Liu, considered
one of world's best carries. Right behind him is Corey
Wright, an exceptionally talented player whose career has
been much less fruitful compared to Liu. While the first
has played in a 'four protect one' strategy line-up where
every other team role sacrificed for him, the other has
spent several years in Korea, where the competition is bland compared to more developed

23

Calculated simply as
Final blow that kills a creep and rewards gold
25
Active gathering of in-game gold and experience
24

20

competitive regions such as China or Europe. Without context, the metric is misleading as
multiple non-game factors affect a player's KDA, such as the team's play style or the
standard of competition. Additionally, when it comes to tracking performance on specific
team roles, simple metrics provide even less useful information. Gold per minute (GPM)26
or experience per minute (XPM)27 only work as comparisons of team cores. The KDA
ratio, hero damage (HD)28 and hero healing (HH)29 only suggest team fighting30 capabilities
while a team does not necessarily have to participate in team fights in order to win the
game known as the split-push strategy31 (Herrer, 2015).
The quantitative scope of the currently tracked data favours roles specifically
focusing on gathering as many resources to win and undervalues a support player's
comparatively important contribution to winning the game. Supports are often referred to as
the unsung heroes of a victory as the game simply has not found a quantitative way to
evaluate non-resource based play so far. The community can still saliently agree on which
support players excel in their role (Medium, 2015).
3.3.1 Complexity

In addition to the victory rule, it is the pure complexity of the game that denies
further evaluation inferences. Just under 1.5 billion games have been played and recorded32
so far and no two games were the same. This is due to the combination of specific game
mechanics, 110 heroes, each with four unique abilities and six item slots for 131 unique ingame items, individual player form, play style, strengths and weaknesses and all the

26

Gold earned average with respect to game duration


Experience attained per one minute of the game
28
Total damage output of a hero afflicted onto opposing hero
29
Total healing done to allied heroes
30
A situation whereas a number of heroes in both teams fight the opponents in the same location
31
"Kills mean nothing, throne means everything" -TobiWan
https://www.youtube.com/watch?v=ezTtb1sY_Hw
32
http://www.dotabuff.com/matches
27

21

potential 10-player position matrices on the game


map. Intuitively, the Nash equilibrium of how to
coordinate to win is not clearly apparent. Players
have to continuously adjust their strategies as
they play and instantly react to the opposite
team's actions. Just the way team members
coordinate their positions early in the game
(Datdota.com, 2015) to maximise their perceived
payoffs with respect to the opposing team's positioning for the first couple of minutes
creates a complex strategic-form game framework. And still, coordinating perfectly in the
early game, a team is far from having guaranteed a victory. Complexity only allows general
qualitative rubrics (such as map awareness or team cohesion) and overall comparative
quantitative data (such as player or team win rate) to describe future outcomes based on
past data. A vast wiki on the game rules (Liquipedia, 2015) and a weekly thread on
Reddit33, filled with questions about peculiarities of the game mechanics, testify Dota's
complexity.
3.3.2 One game is not enough: the competitive aspect

In order to win a match, professional players try to bypass the game's complexity by
a strategy that chooses the set of variables (heroes, items, timeframes for specific actions
and so on) that is the most probable to win a game, with respect to the opposite team's
composition, basing the strategy choice on either statistical of salient principles. Such
behaviour in competitive Dota gives rise to what is called the 'meta-game'. The most

33

a social networking website

22

prominent feature of the 'meta' is the pool of heroes picked or banned34 (joinDOTA.com,
2015). A particular set of hero choices combined with a particular play style and
itemisation, leads to a higher statistical probability of winning than the opponents' hero set,
creating a trend of dominant strategy saliency. Going back to the complexity of the game,
such trends are very unstable, due to the fact that each hero has their own strengths and
weaknesses which can be countered by picking other specific heroes against it, creating a
never-ending cycle of strategies and counterstrategies35 (Kelley, 2015). The 'meta' is closely
watched and discussed as it is a simplified way of understanding the game. Each hero's win
rate in professional games is tracked36 and qualitative inferences are made.
3.4Postgame statistics

Similar approach is used for professional teams, but it has its own limitations.
Steenhuisen (2015) uses the example of two teams: Cloud9 and MVP Phoenix. First has a
217-167 record translating into a 56.51% win rate and the other has a misleadingly better
record of 150-115 (56.60%) An average Cloud9 opponent is much more skilled than that of
MVP Phoenix due to regional differences. To circumvent this, a professional tier Elo-based
ladder ranking is commonly used to adjust the estimates of team performance (Noxville,
2015 and Gosugamers.net, 2015). These are calculated as a combination of team win rate
and subsequent tournament victories. While sufficient to rank teams against each other, the
comparisons of player performances within these matches become much more complicated
(going back to insufficient tracking of individual performance). While a team's win rate is a
relatively better indication of overall long-term performance compared to winning a
particular tournament or in-game statistics, it is only a hint for winning a particular game.

34

at a beginning of the match, both teams take turns in choosing (pick) or removing (ban) 5 heroes from the
pool each
35
http://www.joefkelley.com/dota2chart.html
36
http://www.datdota.com/heroes.php

23

As the most successful professional teams fluctuate under a 70% win rate, the metric does
not take influential variables such as opposite team's standing, current form or importance
of a match into account (Haigh, 1999:261)37. Curiously, Savinoxo administers a betting
model that tries to predict future match outcomes by tracking previous match results and
selecting profitable bets using the Kelly criterion (Smartdotabetting.com, 2015).
A player's performance will naturally fluctuate based on form. The game currently
cannot describe such fluctuations. Winning a tournament should inherently suggest that the
winning team consists of the five best players at the time of the tournament, but with data
being biased towards core team roles there is no way of confirming such a notion. While
the winning team has the highest probability of ranking first in collaborative rubrics (team
execution, cohesion etc.), individual contributions to such estimates cannot be inferred.

4. The prospect of improvement


Individual MMR, winning a tournament nor in-game and post-game statistics offer
a viable way to evaluate all the players fairly. The virtual aspect of Dota hinders our current
ability to measure performance. New ways of player evaluation therefore need to be
applied, building on already existing concepts. Such tools would immensely help increase
three distinct groups' provision from the game.
For the players, both professional and amateur, being able to understand their
performance clearly offers way of correcting and improving on theirs and others' mistakes.
This in turn provides them with the ability to enhance their mastery of the game and
increase their competence autonomously, while being able to relate to their own progress
(Rioult et al., 2014 and Wagner, 2006).

37

Using Morris' relative match importance theory has not been acknowledged in e-sports so far

24

For the viewers of professional matches, viable evaluation of players' performance


enhances the audience's experience, providing an improved spectator experience that
deepens their interest and relatedness to the game as a social medium (Chen et al., 2010).
For Valve, the tools might be necessary to provide sustainable growth of the game
as a business. An online game's monetary success is a direct function of the amount of
players playing it. While still offering ad hoc experience for the users, without objective
tracking of competence progress, the users might become indifferent and stop playing the
game altogether, resulting in potentially negative effects on the company's business
(DiChristopher, 2014).
Dota 2's professional players are still primarily concerned about the primal need to
self-improve and be the best, whereas in professionalised traditional sports such primary
individual motivation to succeed is overshadowed by the secondary business motivation to
generate maximum profits38. The step towards such professionalization effectively requires
the three groups to be fostered for (Rioult et al., 2014).

5. Ambiguous endogenous player skill


The variable that affects winning a match the most is the collective sum of
individual team members' skill, which is endogenously 'chosen' by each player. Each player
decides how big of a skill they want to accomplish, although whether such competence gets
achieved remains unknown to the player. The higher the skill level the exponentially more
costly it becomes to accomplish. Typically, any skill mastery requires a person to obtain
knowledge through practice, which can be difficult to concisely describe (Yang, Harrison
and Roberts, 2014). Dota 2 is an unforgiving game that takes hundreds of play hours to

38

While professional football players still aim to be the best they can, the club pays their
players to win trophies (as they bring in the most prize money) and buys popular players to
increase non-game revenue (such as TV rights, merchandise).

25

grasp (Thurnsten, 2015) and thousands more of directed practice to 'master' (Yang and
Roberts, 2013). Given the endogenous quality of the attribute, the word 'master' should be
taken lightly, as the skill ceiling is an ambiguous concept. Dota 2 is considered one of the
steepest-learning-curve games in the history of gaming (Cocilova, 2015). Notice the 'maybe
20%' indicator on the y-axis in the illustrative figure 6 below.39

Skill

maybe 20%
start here with prior
experience

when people
actually start
considering
themselves pro
learning heroes and
items

actually start practicing the game


learning how MOBA
works

time/task

getting good at it

Figure 6: Illustration of the learning curve, original from: playdota.com

40

5.1 Organisation

While a high skill is much more costly for the player to achieve, it is also much
more valuable to the competitive scene. These costs and benefits raise the issue of whether
or not there is an optimal skill level. Most markets have the property that the buyers are
diverse in their willingness to pay for quality increases, so there will be a variety of
different quality levels being sold at different prices (Holt, 2007). Similar concept can be
drawn for the player market in Dota, whereas teams and players are looking for diverse
competence in players due to their own skill levels being distributed over a spectrum.

39

"In fact, once you have been playing for over a year you begin to hit the glass wall of mediocrity. Where
only those who are talented with either natural skill or hard work can surpass." (Cutsrock, 2014)
40
http://www.playdota.com/forums/attachment.php?attachmentid=37770&d=1299085266

26

Similar to what Rapaport (cited in Roth, Sonmez and Unver, 2003) suggested to minimise
the elimination of immunologically incompatible volunteer kidney donors, the Dota
community independently creates sub-top tournaments that teams with sub-top collective
competence and results participate in against likeminded teams to improve, providing
evidence to the motivations to play discussed in 2.1. While not as popular to spectate, such
organisation promotes the enjoyment of playing the game along with increased personal
competition, which reflects into potential continuation of fresh pool of new talents to the
highest leagues, similar to NHL farm teams (Latham, Patston and Tippett, 2013).
A player's willingness to join a team is intuitively affected by the perceived
difference between her current perceived competence versus the potential team's average
competence (shown in respect to MMR in the illustrative figure 7 below). A 50%
willingness to join illustrates an indifference of preference.

Figure 7: Illustrative willingness to join

27

When offered a spot on a team, players with relatively lower skill face the
temptation to exaggerate their quality, especially if others cannot perceive their endogenous
skill level. Any skill information can only be acquired and maintained through signalling
by previous achievements, community reputation or previous team's opinions of the player.
The competitive organisation of Dota 2 is analogous to an experiment by DeJong, Forsythe
and Lundholm (1985) which showed that even with unknown prices a market will not
collapse thanks to a process of reputation building. As a player claims a certain competence
and others only hold imperfect information about the proclaimed competence, a set of
acquired reputations prevents the scene from collapsing. Instead an ascending vertical skillbased system exists.
5.2 Short-term failure

Nonetheless, some short-term market failures have been historically observed at the
top competence level where the differences in skill levels are already minimised given their
extreme form. These were notably concerning Jacky Mao, Tal Aizik and Johan strm;
three players who were consider top tier at the time they were, for one reason or another,
removed from their teams (No Tidehunter, Team Secret and Cloud9 respectively) and
afterwards struggled to find a high-tier team to play for. Without an objective measure of
their competence, potential teams assumed the players carried some undesirable attributes
and stayed aloof from signing them, analogous to Akerlof's (1970) lemon market
experiment. Mao was even forced to create a new team.

6. Information asymmetry
Even though information asymmetry is traditionally understood in the economic
context of buyers, sellers, price and quality (Holt, 2007) it can be reformulated using simple
proxies to apply to the context of Dota. The ubiquitous analogical assumption of the theory

28

is that a player's skill level can neither be observed by others nor the player herself and its
evaluation cannot be quantified. Skill is described as an

unique but ambiguous

combination of expertise, personality, aptitude and experience; which can be consulted on


three levels:

Player vs. self: No objective metrics of the qualitative individual skill set exists for
the player to rate herself on. Along with the ambiguous competence ceiling, the
only valid way of assessing her own expertise is to compare her own perceptions of
personal competence to what she perceives others' to be. Valve deals with the
problem by introducing the MMR metric providing players with a comparison
option, while professional players take the matter into their own hands by
competing professionally with the ultimate goal of winning The International
(analogous to being the best).

Player vs. other players: Following up on the inability to evaluate own skill, players
can only guess what competence the others have. But, with the addition of a
community, the notion of a player comparing perceived competence of another
player against the collective perception of such competence can exist. If finitely
replicated by the community onto the player herself, some form of collective
saliency takes place. The resulting cross-referenced meta perception becomes the
focal point of her own competence, held by the community. Such reasoning would
explain why only having some limited knowledge of a game is sufficient enough to
judge a player's performance (however poorly). After watching a five minute
section of a football match, any fan familiar with the basic rules can, to some extent,
tell which players are performing better than others. If all general assumptions are

29

held equal, the same should apply for Dota players. The information asymmetry
takes place in the form of players not being able to objectively judge others without
high enough personal competence41

Professionals vs. casuals: The ability to use the community's salient perception to
judge others 'fairly' with respect to the level of one's own (unknown) competence
suggests that the closer a player gets to achieving the notional competence ceiling,
the more capable she becomes at judging other players objectively. An unintended
information asymmetry therefore develops between the player and all other players
who improve at a slower pace. Professional players who signal their competence via
winning competitive matches should therefore be assumed to possess the best
means to evaluate other players' performance.

6.1 Collective salience

Hints to tracking collective saliency and validating collaborative evaluations can be


taken from the research done in aggregating information from noisy human labellers,
mainly in education. The most applicable works include the peer-grading optimisation of
Piech et al (2013)., 'grading without a key' systems by Johnson (1996) and Mislevy et al.
(1999) or Whitehill et al. (2009) with the discrete 'true image labels' model. Schelling
(1960) also showed that individuals can coordinate their behaviour by drawing shared
perceptions although game-theoretic analysis often starts by mathematically describing the
game, excluding any information about how the players describe their strategies themselves
(Sugden, 1995). Projected onto a community of peer-players, each has some common
points of reference due to personal experience of playing the same game. Bacharach's
(1993) theory suggests coordinating at an evaluation label is possible because 'normal'
41

Analogous to the driver ability experiment (McCormick, Walkey and Green, 1986)

30

players use (and can expect other players to use) similar conceptual scheme of what is
meant by performance. Research by Sadler and Good (2006) remains optimistic about the
evaluation validity and positive effects of introducing peer-grading into a community.
Piech et al (2013) conclude that peer assessment offers a promising solution to scale the
grading of complex assignments in massive open online courses.
6.2 The assumptions

If online gamers are assumed to be peers, a subsequent inference suggests that


applying similar scientific tools would reduce information asymmetry of player competence
in Dota. Trying to capture the whole player base would however create many idiosyncrasies
and filtering such noise might not result in valid output. Post-game peer evaluation is
something Valve have started collecting inconspicuously, keeping true to the data-driven
approach. Any changes to matchmaking as a result of such information are yet to be seen.
Based on the three levels of information asymmetry, tracking every player's competence
perception might not provide much value. Players have no incentive to spend their time on
trying to objectively evaluate each other. Instead, if we apply an analogy to the prediction
offered by Chetty et al. (2009)42, players pay closer attention to evaluation when observing
extreme competence (competitive play) than when rating regular peers (Luca and Smith,
2013).
The eventual assumptions represent the notion that players are, to some extent, able
to evaluate others' performance; that a player's evaluation capability is highest when
watching professionals play; and that it is always clouded by the first two levels of
information asymmetry. Only the highest competence players (professionals) are excluded
from the final assumption as they can perceive they stand on the extreme tail of the skill
42

that consumers are more likely to think closely about opaque details when making "large, one time choices"
than when making small repeated purchases

31

curve (figure 4), with all other players distributed to the left. We can therefore suggest
using collective saliency as the mean and professional player evaluation as the standard
deviation of the probability distribution of endogenous player competence, potentially
providing a valid evaluation of performance. Combined with the existing tools to quantify
performance, such approach could yield the desired results described in 4. It would ideally
provide a highly reliable assessment, engage the community and become applicable to a
diverse collection of problem settings. (Piech et al., 2013)
6.3 The model

To root these theoretical assumptions into a model we used an adaptation of the


assumptions for the first order model design (PG1) constructed by Piech et al. (Ibid) from
their work on tuned models of peer-assessment which resulted in increased results
optimisation. Our model analogically assumes existence of the following latent and
observed properties:

True rating: every player's performance p is associated with an unobserved and to


be estimated true underlying rating rp

Rater bias: every user u is associated with a bias, bu R. This variable reflects the
user's tendency to under or over valuate her rating of a player

Rater reliability: u R+ reflects how close on average a user's rating tends to be to


the performance's true rating after having corrected for bias. u can be described as
the inverse variance of normal distribution

User Rating: given by user u to performance p, denoted as zup R. The collection of


all observed ratings is denoted as Z= { zup } . This is the only observed variable.

32

Using Bayesian statistics the model puts prior distributions over the latent variables and
assumes that while an individual user's bias may exist, the average bias of many users is
zero:
u G (0, 0)

for every user r (reliability),

bu N (0, 1/0)

for every user r (bias),

rp N ( 0, 1/ 0)

for every performance p (true rating),

zup N (rp + bu, 1/t u) for every observed rating (user rating).
G refers to a gamma distribution with fixed hyperparameters 0 and 0, while 0 and 0 are
hyperparameters for the priors over biases and true ratings.
Having this model in mind, we have set up a survey which was supposed to outline the
findings and assumptions gathered throughout the paper.

7. The survey setup


The data was gathered on selected matches (highlighted by a red outline in figure 8)
during the Starladder Season 12 LAN finals held in Bucharest from April 24th to April 26th
2015. This tournament was selected as it was the only major LAN tournament held during
the writing of this paper. While it would be beneficial to track all the matches within the
tournament, time constraints led us to focus on only the selected matches. All the chosen
matches were played in a best of three format, meaning a team had to win two games in the
matchup in order to proceed to the winner's bracket, while the loser dropped to the loser
bracket where they would meet other loser bracket teams. The final match was played
between the winner bracket finalist and the winner of the loser bracket in a best of five
format (with the winner bracket team possessing a one game advantage).

33

Figure 8: The Starladder tournament format, source: Dota2 Liquipedia


43

Data gathering was implemented online via the RankDota website created for this
purpose.44 The page parsed selected information on the specified matches from the Dota 2
API (Application Programming Interface) after they finished; namely the time of the match,
the team names, their respective sides within the game, the players' names and the heroes
each assigned to an individual player. The winner of a match was not offered to reduce
experimenter's effect. The site and a plea to participate was advertised for the first two days
of the tournament (24th and 25th of April) through Twitter45 and a Dota-specific subreddit
on Reddit46. Further matches were not promoted. Users were kept anonymous to avoid the
"culture shock" of assuming responsibility for rating (Sadler and Good, 2006).
After accessing the homepage, users were shown a list of finished Starladder games,
43

http://wiki.teamliquid.net/dota2/Star_Ladder_Star_Series/Season_12
http://rankdota.co.uk
45
https://twitter.com/squartefaghoui/status/591599291065638912
46
http://www.reddit.com/r/DotA2/comments/33phzi/launching_rankdota_please_give_it_a_go/ and
http://www.reddit.com/r/DotA2/comments/33t330/rankdota_beta_test_day_2/
44

34

with the ability to rank the players within every game individually (therefore a match
consisting of two games had two separate ranking pages). An important thing to note is that
two different matches were constantly played at the same time, with live coverage through
Twitch47 only offered for the more popular of the two. Users had the option to select their
current

MMR

and

Likert

scale

ranging

from

to

10

was

presented.

User were asked to 'Please rank the performance of the players in this match (10
being the best)'. Performance was presented as an ambiguous concept, without any links to
team roles and no additional description of the scale was given. This let users define
performance based on their own belief of what constitutes performance and to base the
rating labels on their own private descriptions assigned to such belief. Each numerical rank
was a member of a set of finite possible labels for performance. Each user faced 10 choice
problems per player in which there was no incentive to make any particular choice apart
from personal bias (Mehta, Starmer and Sugden, 1994). There was no risk associated with
choosing a rating. Users' goal was not to coordinate on a similar rank - as there is no payoff
presented for doing so. It is to evaluate perceived performance as best as they can. Rating
coordination would be either coincidental or based on a shared conceptual scheme,
eliminating the possibility that the user chooses a rating that is different than what she
believes to be the right label (Sugden, 1995). Apart from performance, no additional rubrics
were codified for the users to rank due to the experimenter's acceptance of own secondary
level information deficiencies (Sadler and Good, 2006). The experiment didn't tell users
which team role players held, only the heroes they played. General game knowledge and
having watched the game are assumed to be sufficient for users to recognize roles. Due to
personal coding limitations, deference of double entries could not be achieved. There is

47

A live streaming website

35

also no way of telling whether users answered truthfully but no explicit nor implicit
incentive to sabotage the project has been discovered. Professional Dota matches are
viewed by players and fans spreading across the whole skill distribution spectrum.
Intuitively, the more competent a player becomes the more interest she shows in spectating
competitive matches. The number of Starladder viewers fluctuated from 20,000 for early
morning matches to over 250,000 for the final match.
Reasoning behind this experiment was to create some inferences about user
behaviour created by outlining the findings and assumptions gathered so far with the
ultimate goal of laying foundations for additional research into finding an optimal
performance evaluation tool. The following hypotheses were defined48:
H1: User's MMR affects the rating she assigns to professional players.
H2: Difference exists between the rating of lower and higher MMR users.
H3: The ratings of winners and losers are different.
H4: True rating (rp), rater bias (bu) and rater reliability (u) can be observed.
7.1 Findings and interpretation methods

Data was gathered for 7 matches involving a total of 15 games, all taking place on
the first two days of the tournament. We managed to collect a total of 7961 responses of
which 6029 were valid (meaning the response had a rank > 0). 1986 responses were
matched with the optional MMR choice, which were used for the inference analysis of the
effect of a respondent's MMR to their ranking.
We had used univariate and multi-way statistics, frequency tables and frequency
distribution descriptive methods to evaluate the data. Non-parametric Wilcoxon test was
48

while not an indicator of competence level, MMR correlates with it enough and is the only useful label
generally recognised

36

used to infer sample differences and an analysis of independence was used to describe the
enumerative data presented in contingency tables, while a graphic method (box plot, bar
graph and contingency graph) helped to better visualise the data. We utilised the Statit
Custom QC statistical software49 and Microsoft Excel to read and process the data.
Important thing to clear reader's potential confusion in the following interpretations is that
the game client does not hold a name for what this paper calls a match but instead names
every game (a subset of a match) with a match ID.
7.2 Limitations

Using a 10 point Likert scale limited our options to interpreter the data. We
therefore did not consider the experiment results as categorically discreet values but as
discreet variables which entitled us to use statistical functions for composite data. We are
aware of the deficiency of processing a discrete scale through continuous statistical
methods but doing so effortlessly simplifies the interpretation while having no negative
implications on the results. We consider the scale to be sufficiently defined in the sense of
ordinality (1 minimum, 10 maximum). Community responses on the survey promotion
page reassured us that the respondents understood the scale, even though they considered it
too soft. Due to ambiguity of the scale different users might have used different weighting
(Sugden, 1995) and some issues of being unable to justify one rank over another have been
reported. We had forced an assumption that player performance remains the same over the
course of the game. This assumption is often incorrect but necessary without access to a
live match's combat log50 (Edge, 2013) which is a limitation in the Dota API.
Measuring performance through a simple 1 to 10 scale also does not allow for more
sophisticated responses. Expanding the scale range to 100 would reduce agreement by

49
50

Evaluation version
A chronological in-game list of every interaction involving the player

37

chance but also dissatisfy the users who would encounter further difficulties with an even
softer scale. If we were able to track specific users, reducing the scale maximum by three
points and implementing a weighted Kappa to account for chance agreement could have
provided sharper results (Abedi, 1996). Offering performance sub-rubrics and allowing
users to input free-form text would have given us further insights into users' ratings (Sadler
and Good, 2006) although many users might not have been prepared to give solid feedback
in English so a language barrier would have to be accounted for. We were also unable to
probe users for post-ranking clarifications ourselves which led some to contact us directly,
relating their concerns (Watters, 2012). Ling et al. (2005) found that individuals contributed
more when given specific challenging goals and were reminded of their uniqueness. We
have replied to every user's comment but due to the code issues, we could not identify
users. We were worried the survey might fail due to under-contribution or nonparticipation
from the community. Success of the experiment lied in a proper execution of the
promotional campaign (Butler cited in Chen et al., 2010). No monetary incentives were
offered to potential participants as the Dota community relies on voluntary contribution of
time and effort rather than monetary encouragement. Such a move could have resulted in
adverse reputation effects and a skewed sample size. Users were instead prompted to rank
more than one match reminding them doing so generates a higher public benefit (Chen et
al., 2010). If tracked, rating would become an impure public good as user could signal their
competence to the community and we could create a database of rating history and possibly
generate

future

match

predictions

for

betting

on

matches

for

the

users.

The more popular matches streamed through Twitch were expected to be rated more
than the matches streamed simultaneously, endangering interpretation accuracy for the
unbroadcasted match (Ludford et al., 2004). A key challenge to prevent this from

38

happening was to motivate the community members who hold a fan allegiance to such
teams. The only measure of success was the sample size for each match, which was also
affected by other factors. Frequent users were more likely to encounter the survey
promotion and participate in studies on Dota due to community affinity, creating a
volunteer effect and changing their ranking behaviour because of this. This was to some
extent controlled by using an MMR proxy (Rosenthal and Rosnow, 2009). There is no way
of knowing if similarly skilled users averaged on a performance rank at most one standard
deviation away from the true rating expected by the information asymmetry assumptions.
Although directly approached, no professional players have provided their peer-rating to
estimate rp. Further optimisations of the model were therefore not possible.
The bias that similar MMR users would not coordinate on a similar rating due to
different fandom preferences would result in a skewed ratings sample for particular fan
favourite players (Sudgen, 1995). No such skewness has been recorded with any player.
Overall, we wanted to follow the best practice set by Piech et al (2013) whereas users'
ranking pattern is pre-calibrated and instead of ranking any number of games, she is
assigned a randomly selected list of players to rank (one of which has been previously
ranked by a professional player). The voluntary spirit of the survey and the professional
players response rate fiasco did not allow us to replicate the concept.

39
7.3 Results

All the processed results can be found in Appendix A. Only a selection of results are
interpreted in the main text. The sample gathered exhibited distribution of ranking scale
frequency shown in figure 9. An interesting fact is the abnormal overusage of the maximum
rank compared to other values, which suggests respondents could identify the

Figure 9

comparatively higher competence level of players or a popularity bias. In contrast, the


usage of ranks 2 to 4 was rare. The distribution suggests respondents were having difficulty
differentiating between similar rankings, selecting the end point ranks the most.
Next, we examined the frequency of rating given based on game result for all
sampled games, shown in figure 10. Analysis of Independence showed that the size of a
player's ranking is directly tied to winning or losing a match, with a significant difference
(P = 0.000) between the rating of winners and losers.

40

Figure 10

The ranking frequency distribution resulted in univariate statistics depicted in figure


11. We consider the data symmetrical due to comparable values of mean and median,
skewness and kurtosis and the Q1 to Q3 range. The values are visualised on a box plot in
figure 12.

Figure 11

41

Figure 12

42

Continuous analysis of central fit, shown in figure 13 holds an overall rating mean
of 6.2 with a standard deviation of 3.0 in 6028 valid cases. The games are listed in
ascending order by time. The spread of responses directly correlates to which games were
broadcasted through Twitch, which in turn correlates with team popularity. Generally, the
top teams have the most fans and the perceived best quality games, therefore the popularity
bias could not be eliminated at this stage. Game 1420622700 had the most responses: A
relatively weaker Team Malaysia51 eliminated the tournament favourite (Team Secret) into
the loser bracket. The game also occurred after the promotional post skyrocketed onto the
front page of the Dota subreddit, confirming the survey participation assumption. The
inverse is apparent from the small amounts of responses to the last four matches recorded
as the survey promotion has not been advertised anymore.

Figure 13

51

Rank 8 on Gosugamers

43

The contingency graph below (figure 14) shows the unexpected game result on
ranking with respect to selected MMR. Team Secret curiously received 232 more ratings
than Team Malaysia. Respondents who selected their current MMR as 6001+ were a lot
more critical to all players, compared to other MMR groups. While the most popular team,
Secret players received low scores from every MMR group.

Figure 14: Rating by MMR and result

44

Additional central fit analysis explored the effect of match result on rankings. We
report a significant statistical difference (P = 0.000) between the two result states, shown in
figures 15 and 16 respectively. This is an unsurprising result, given that winning is the
ultimate goal of any competitive game and succeeding usually means performing better. As
exceptions, some winning teams did not receive similarly high scores than others,
suggesting that if no overly-competent performance can be seen, viewers coordinate on a
lower rank. In contrast, game 1422859350 between Vici Gaming and Cloud9 recorded an
unusual high mean coordination for the losing team. This exception is probably due to a
small sample size. More details on the relations in this section can be found in the
Wilcoxon tests in Appendix A.

Figure 15: Win result

45

Figure 16: Loss result

Figure 17 shows average rating each team received. No emphasis was given to
ranking teams in the survey so the figure only shows collective rankings of players included
in each team. The day after the survey ended, Vici Gaming became the tournament winners.
While their frequency ranking is low, their performance was valued as one of the best
during the first two days of the tournament. Many fans expected Team Secret to be close
contenders but they had been eliminated with a 1-4 record and their collective rating
indicates the frustration of fans.

Figure 17

46

We observed statistically significant differences (P=0.000) in the way highest MMR


group ranked players compared to all the other MMR groups. The finding suggests the third
level of information asymmetry threshold in 4 exists somewhere between 5000 and 6000
MMR whereas highly competent players start to evaluate game performance in
significantly different way. The only other significant difference was noted between the
2001 and 3001 MMR groups which we consider a false positive error. We were unable to
determine its cause but it is a Popperian duty to report it.
7.4 Hypothesis testing

H1: As found through the Wilcoxon tests in Appendix A, we have enough ground
to accept the H1 claim.
H2: Proven by the Wilcoxon test, 6000+ MMR players' ratings differ significantly
from the ones by lower MMR groups, therefore we can accept the H2 claim.
H3: Findings visualised in figure 14, and 2 independence results in Appendix A
propose winners are consistently given higher ratings than losers (P = 0.000) therefore we
can accept H3. The size of the sample is enough to minimize assumed popularity bias (bu)
as theorised 6.3 and 7.2.
H4: Due to the experiment design, configuration of circumstances and the data
types gathered, we are unable to judge this hypothesis with enough depth. The site is
missing a sign-in option for users, therefore we were unable to track additional factors
(apart from MMR and rating given) required to define the features of H4.

47

8. Conclusion and recommendations


We have struggled to collect the right data due to the survey design, specifically
hindered by not prioritising writing the correct code. Offering a user-friendly and functional
webpage was more important over optimising backend features to explore more than two
factors drawn from theory, as no community responses would be devastating.
However, we were pleasantly surprised by the overwhelming response received
from the community, not only eager to participate but help and collaboratively improve the
survey. Judging from the comments and messages received, the author is not the only Dota
player discerned with a lack of evaluation tools. As we have not succeeded in replicating
the simplest model version used by Piech et al. (2013), further pursue is required. On the
non-professional level, Dota lacks evaluation of performance based on something different
than numerical data. Nevertheless, the argument that a team that performs badly but still
manages to win, has still performed better than the loser team because it won, is confirmed
in H3. However the currently tracked statistics do not give any weigh to individual
contribution to a match result. The personality of a player is another factor affecting a
match, as different people have different game approaches. If measurable, improvements,
to prevent play style differences and players wrongly signalling their competence level
would further improve the matchmaking system. Subjective evaluation of skill has been
shown be to affected by factors other than MMR and should be studied. Additional
evaluation options will have to be attempted. The paper indirectly hints to what they might
be, but the range of the study does not allow for further testing. There is definitely room for
improvement.
The professional level suffers from the information inefficiency factors less, as
evolution in competence is shown to correlate with a better informed player (H2).

48

Competitive gameplay would still benefit from applied study of possible measurement
tools, in order to further improve and provide higher provisions for fans, subsequently
minimising the third tier information asymmetry.
Even though the Dota community is anonymous compared to the students involved
in the original PG1 experiments, it seems the Dota community is attracted by the prospect
of a common passion. The results suggest that the community's primal concern that the
survey would only act as a popularity contest were minimised using the relatively small
sample size (as stated in 6.3).
There is a long way to achieve higher provision via peer-evaluation but probing
collective knowledge is a good place to start. We are now developing a second version of
the website to optimise data gathering in a longer timeframe on more tournaments to be
able to stratify the data and capture additional parameters that affect ambiguous
endogenous competence, aiming to achieve PG1 optimisation.

49

9. Appendix A
(due to the margin standards set by the coordination, the graphs section cannot be
perfectly centred in the middle of the page, which we would like to apologise for. It annoys
the author as much as the read)
Statistics for raterRating
Mean
Std Error Mean
Std Deviation
C.O.V.

6.190113
0.038456
2.985728
0.482338

Skewness
Kurtosis
Minimum
Maximum

-0.406538
-1.024626
1
10

Geometric Mean
Std Error G.Mean
Valid cases

5.068968
0.04828
6028

Median
Approx SE of Median
IQR
IQR/Median

7
0.0644
5
0.714286

Q1
Q3
Range
Midrange

4
9
9
5.5

Harmonic Mean
Std Error H.Mean
Missing cases

3.582832
0.048175
0

Rating range distribution


Frequency Distribution
Variable
Absolute Relative
Cum
raterRating
Interval
Freq
Freq(Pct) Freq(Pct)

0 1
780
12.940
12.940
1 2
238
3.948
16.888
2 3
300
4.977
21.865
3 4
379
6.287
28.152
4 5
555
9.207
37.359
5 6
652
10.816
48.175
6 7
736
12.210
60.385
7 8
731
12.127
72.512
8 9
615
10.202
82.714
9 10
1042
17.286
100.000
-------- -------Total
6028
100.000

Valid cases =

6028

Missing cases =

Team rating frequency


Table of TeamID by Result
TeamID
Result
Freq
B
Percent B
Row Pct B
Col Pct Bloss
Bwin
B
99
Alliance B
294 B
40 B
B
4.88 B
0.66 B
B 88.02 B 11.98 B
B
9.26 B
1.40 B
99
C9
B
32 B
749 B
B
0.53 B 12.43 B
B
4.10 B 95.90 B
B
1.01 B 26.26 B
99
IG
B
0 B
280 B
B
0.00 B
4.64 B
B
0.00 B 100.00 B

Total
334
5.54

781
12.96

280
4.64

50
B
0.00 B
9.82 B
99
LC
B
290 B
0 B
290
B
4.81 B
0.00 B
4.81
B 100.00 B
0.00 B
B
9.13 B
0.00 B
99
MY
B
322 B
1186 B
1508
B
5.34 B 19.67 B 25.02
B 21.35 B 78.65 B
B 10.14 B 41.58 B
99
Secret
B
1535 B
276 B
1811
B 25.46 B
4.58 B 30.04
B 84.76 B 15.24 B
B 48.33 B
9.68 B
99
TT
B
703 B
234 B
937
B 11.66 B
3.88 B 15.54
B 75.03 B 24.97 B
B 22.13 B
8.20 B
99
VG
B
0 B
87 B
87
B
0.00 B
1.44 B
1.44
B
0.00 B 100.00 B
B
0.00 B
3.05 B
n
Total
3176
2852
6028
52.69
47.31
100.00

Is there match result independent from team?


(is team performance equal?)
Statistic
DF
Value
Prob

Chi-Square
7 3104.988
0.000
Likelihood Ratio Chi-Square
7 3663.826
0.000
Mantel-Haenszel Chi-Square
1
568.236
0.000
Phi Coefficient
0.718
Contingency Coefficient
0.583
Cramer's V
0.718
Missing cases = 0

Overall average of rating


Summary of Analysis Variables
Std
Valid
Variable
Mean
Deviation cases

raterRating
6.2
3.0
6028

Match average ratings


Analysis variable:

raterRating

Std
Valid
MatchID
Freq
Mean
Deviation cases

1420140358
799
6.8
2.6
799
1420142374
76
5.8
3.5
76
1420243583
36
5.7
4.0
36
1420367758
653
5.4
2.9
653
1420622700
2148
6.2
3.0
2148
1420633610
379
5.6
3.5
379
1420816727
598
6.5
2.7
598
1420832492
89
4.3
3.1
89

51
1421007628
1421280530
1421421139
1422406844
1422527977
1422859350
1423036681

518
420
154
32
63
36
27

6.4
6.2
6.0
5.4
6.3
8.0
7.2

2.6
3.1
3.2
2.9
3.2
2.3
1.5

518
420
154
32
63
36
27

Match rating in relation to the result


Analysis variable:

raterRating

Std
Valid
MatchID
Result Freq
Mean
Deviation cases

1420140358 loss
352
5.9
2.7
352
win
447
7.6
2.3
447
1420142374 loss
win

40
36

4.4
7.4

3.2
3.2

40
36

1420243583 loss
win

16
20

4.2
6.9

3.9
3.8

16
20

1420367758 loss
win

351
302

3.8
7.3

2.1
2.6

351
302

1420622700 loss
win

1190
958

4.9
7.8

2.7
2.6

1190
958

1420633610 loss
win

195
184

5.2
5.9

3.5
3.4

195
184

1420816727 loss
win

322
276

5.9
7.3

2.6
2.6

322
276

1420832492 loss
win

39
50

4.7
4.0

3.1
3.1

39
50

1421007628 loss
win

290
228

5.1
8.1

2.3
2.0

290
228

1421280530 loss
win

230
190

5.4
7.3

3.0
2.8

230
190

1421421139 loss
win

64
90

4.6
7.0

3.0
2.9

64
90

1422406844 loss
win

20
12

5.3
5.5

3.0
2.8

20
12

1422527977 loss
win

35
28

5.6
7.1

3.2
3.2

35
28

1422859350 loss
win

20
16

7.4
8.8

2.8
1.4

20
16

1423036681 loss
12
7.0
1.7
12
win
15
7.4
1.4
15

Is there a difference between win/loss rating?


Rank Sum (Wilcoxon) Test for Variables:
raterRating (loss)
raterRating (win)
raterRating (loss)

10

N = 3176
N = 2852

raterRating (win)

52
******|424
***|194
*****|358
*******|439
*******|461
*****|339
****|270
**|148
********|543

1233|********************
537|********
378|******
213|***
94|*
40|*
30|*
90|*
237|***
0

There is a significant difference between the samples.


If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (win) than for variable
raterRating (loss) would be approximately:
One-sided P-value = 0.0000.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.0000.
Sum of ranks for Variable raterRating (win)
T = 10766816.5
Mean and Standard Deviation under null hypothesis
E =
8597354
S = 66970.0395
Approximate
Standard Error

Sample Medians
raterRating (loss)
raterRating (win)

5
8

Sample Size

0.071
0.0562

3176
2852

Teams in match identification (to ist, o win/loss, ale s


tmami)
Analysis variable:

raterRating

Std
Valid
MatchID
Result TeamID
Freq
Mean
Deviation cases

1420140358 loss
TT
352
5.9
2.7
352
win
1420142374 loss
win
1420243583 loss
win
1420367758 loss
win
1420622700 loss
win
1420633610 loss
win

C9

447

7.6

2.3

447

LC

40

4.4

3.2

40

VG

36

7.4

3.2

36

LC

16

4.2

3.9

16

VG

20

6.9

3.8

20

TT

351

3.8

2.1

351

C9

302

7.3

2.6

302

1190

4.9

2.7

1190

MY

958

7.8

2.6

958

LC

195

5.2

3.5

195

TT

184

5.9

3.4

184

Secret

53
1420816727 loss
win
1420832492 loss
win
1421007628 loss
win
1421280530 loss
win
1421421139 loss
win
1422406844 loss
win
1422527977 loss
win
1422859350 loss
win
1423036681 loss

MY

322

5.9

2.6

322

Secret

276

7.3

2.6

276

LC

39

4.7

3.1

39

TT

50

4.0

3.1

50

Secret

290

5.1

2.3

290

MY

228

8.1

2.0

228

Alliance

230

5.4

3.0

230

IG

190

7.3

2.8

190

Alliance

64

4.6

3.0

64

IG

90

7.0

2.9

90

Secret

20

5.3

3.0

20

Alliance

12

5.5

2.8

12

Secret

35

5.6

3.2

35

Alliance

28

7.1

3.2

28

C9

20

7.4

2.8

20

VG

16

8.8

1.4

16

C9

12

7.0

1.7

12

win
VG
15
7.4
1.4
15

Team average rating


Analysis variable:

raterRating

Std
Valid
TeamID
Freq
Mean
Deviation cases

Alliance
334
5.4
3.1
334
C9
781
7.5
2.4
781
IG
280
7.2
2.8
280
LC
290
5.0
3.4
290
MY
1508
7.5
2.6
1508
Secret
1811
5.3
2.8
1811
TT
937
5.0
2.9
937
VG
87
7.5
2.9
87

54

How raterMMR affects the rating


Analysis variable:

raterRating

Std
Valid
MatchID
Result TeamID raterMMR Freq
Mean
Deviation cases

1420140358 loss
TT
3001
36
7.0
1.8
36
4001
40
5.5
2.7
40
5001
32
6.2
2.4
32
6001
4
7.8
1.3
4
win

1420142374 loss

win

1420367758 loss

win

1420622700 loss

win

1420633610 loss

win

1420816727 loss

C9

3001
4001
5001
6001

45
50
41
5

7.6
8.4
7.3
7.6

1.4
1.4
2.6
1.5

45
50
41
5

LC

3001
4001
5001

5
10
5

3.4
5.7
1.0

2.5
1.9
0.0

5
10
5

VG

3001
4001
5001

4
8
4

6.8
6.1
1.0

1.5
2.9
0.0

4
8
4

TT

1001
2001
3001
4001
5001
6001

15
11
15
40
25
5

5.6
4.9
5.1
4.1
4.7
4.2

1.6
1.8
1.0
2.4
1.8
0.4

15
11
15
40
25
5

C9

1001
2001
3001
4001
5001
6001

12
8
16
31
21
4

7.8
7.5
7.9
7.9
8.1
8.0

1.2
1.4
1.1
2.7
1.0
0.8

12
8
16
31
21
4

Secret

1001
2001
3001
4001
5001
6001

9
25
132
135
67
50

1.4
5.4
5.6
5.1
5.2
3.1

0.5
1.7
2.5
2.7
2.7
3.5

9
25
132
135
67
50

MY

1001
2001
3001
4001
5001
6001

7
20
113
108
49
40

6.6
9.0
8.3
8.3
6.6
5.8

4.3
1.1
2.0
1.8
2.6
4.1

7
20
113
108
49
40

LC

2001
3001
4001
5001
6001

15
18
10
10
20

3.9
4.3
5.5
4.1
3.1

2.6
2.2
4.7
1.5
3.6

15
18
10
10
20

TT

2001
3001
4001
5001
6001

12
16
12
8
16

4.8
7.4
5.2
8.1
1.0

3.0
1.8
3.9
1.5
0.0

12
16
12
8
16

MY

2001
3001

15
15

5.9
6.9

1.6
2.4

15
15

55
Analysis variable:

raterRating

Std
Valid
MatchID
Result TeamID raterMMR Freq
Mean
Deviation cases

1420816727 loss
MY
4001
30
5.7
1.7
30
5001
10
6.7
3.6
10
6001
5
4.4
1.5
5
win

1420832492 loss
win
1421007628 loss

win

1421280530 loss

win

1421421139 loss
win
1422406844 loss
win
1422527977 loss

win

1422859350 loss

Secret

2001
3001
4001
5001
6001

12
12
32
8
4

7.8
8.1
7.9
8.6
8.8

1.3
1.3
1.5
1.8
0.5

12
12
32
8
4

LC

1001
6001

4
4

5.5
2.2

1.3
0.5

4
4

TT

1001
6001

5
5

6.2
1.4

1.3
0.5

5
5

Secret

1001
2001
3001
4001
6001

5
5
35
40
5

9.2
6.2
4.7
5.1
4.0

0.8
2.2
1.8
1.6
2.4

5
5
35
40
5

MY

1001
2001
3001
4001
6001

4
4
28
32
4

10.0
7.5
7.5
7.6
7.2

0.0
2.9
1.6
1.3
2.1

4
4
28
32
4

Alliance

2001
3001
4001
5001
6001

10
15
25
25
5

8.5
3.7
5.0
4.6
1.0

1.5
2.0
1.7
3.3
0.0

10
15
25
25
5

IG

2001
3001
4001
5001
6001

8
12
20
20
4

4.8
6.8
8.1
8.2
5.5

4.1
2.2
1.3
1.9
5.2

8
12
20
20
4

Alliance

3001
5001

8
12

3.7
2.7

2.0
2.5

8
12

IG

3001
5001

10
15

7.5
6.5

1.2
4.1

10
15

Secret

6001

9.0

0.0

Alliance

6001

2.0

0.0

Secret

2001
4001
5001

5
10
15

4.8
2.2
7.5

2.0
1.7
2.7

5
10
15

Alliance

2001
4001
5001

4
8
12

8.2
6.2
6.8

1.3
2.1
4.3

4
8
12

C9

3001
5001

5
5

6.4
10.0

0.5
0.0

5
5

56
Analysis variable:

raterRating

Std
Valid
MatchID
Result TeamID raterMMR Freq
Mean
Deviation cases

1422859350 win
VG
3001
4
7.5
1.0
4
5001
4
10.0
0.0
4
1423036681 loss

C9

3001

7.2

2.2

win
VG
3001
5
6.8
1.3
5

Note:

Local selection: Cases tested = 6028, selected = 1986


Criterion: raterMMR>0
Analysis variable:

raterRating

Std
Valid
raterMMR Freq
Mean
Deviation cases

1001
61
6.1
2.9
61
2001
154
6.3
2.6
154
3001
553
6.6
2.4
553
4001
641
6.4
2.7
641
5001
388
6.2
3.0
388
6001
189
4.2
3.7
189

Rank Sum (Wilcoxon) Test for Variables:


raterRating (1001)
raterRating (2001)

N = 61
N = 154

raterRating (1001)
10
raterRating (2001)
********|15 35|********************
***|6 21|************
*****|10 21|************
****|7 21|************
****|7 25|**************
*|2
6|***
*|2 11|******
****|7
|
**|5 14|********
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (2001) than for variable
raterRating (1001) would be approximately:
One-sided P-value = 0.4512.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.9025.
Sum of ranks for Variable raterRating (1001)
T =

6537.5

Mean and Standard Deviation under null hypothesis


E =
6588
S = 408.116066
Sample Medians

Approximate
Standard Error

Sample Size

57
raterRating (1001)
raterRating (2001)

7
6.5

0.5762
0.2417

61
154

Rank Sum (Wilcoxon) Test for Variables:


raterRating (1001)
raterRating (3001)

N = 61
N = 553

raterRating (1001)
10
raterRating (3001)
**|15 123|********************
*|6
97|***************
*|10
86|*************
*|7
91|**************
*|7
63|**********
*|2
29|****
*|2
26|****
*|7
5|*
*|5
33|*****
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (3001) than for variable
raterRating (1001) would be approximately:
One-sided P-value = 0.1645.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.3289.
Sum of ranks for Variable raterRating (1001)
T =

17485

Mean and Standard Deviation under null hypothesis


E =
18757.5
S = 1302.88428
Sample Medians
raterRating (1001)
raterRating (3001)

7
7

Approximate
Standard Error
0.5762
0.1276

Sample Size
61
553

58

Rank Sum (Wilcoxon) Test for Variables:


raterRating (1001)
raterRating (4001)

N = 61
N = 641

raterRating (1001)
10
raterRating (4001)
*|15 168|********************
*|6
82|*********
*|10
81|*********
*|7
82|*********
*|7
74|********
*|2
56|******
*|2
25|**
*|7
21|**
*|5
52|******
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (4001) than for variable
raterRating (1001) would be approximately:
One-sided P-value = 0.3377.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.6753.
Sum of ranks for Variable raterRating (1001)
T =

20811.5

Mean and Standard Deviation under null hypothesis


E =
21441.5
S = 1502.83611
Sample Medians
raterRating (1001)
raterRating (4001)

7
7

Approximate
Standard Error
0.5762
0.158

Sample Size
61
641

59

Rank Sum (Wilcoxon) Test for Variables:


raterRating (1001)
raterRating (5001)

N = 61
N = 388

raterRating (1001)
10
raterRating (5001)
**|15 108|********************
*|6
59|**********
*|10
37|******
*|7
30|*****
*|7
39|*******
*|2
32|*****
*|2
20|***
*|7
9|*
*|5
54|**********
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (5001) than for variable
raterRating (1001) would be approximately:
One-sided P-value = 0.4323.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.8645.
Sum of ranks for Variable raterRating (1001)
T =

13565

Mean and Standard Deviation under null hypothesis


E =
13725
S = 934.754945
Sample Medians
raterRating (1001)
raterRating (5001)

7
7

Approximate
Standard Error
0.5762
0.2538

Sample Size
61
388

60

Rank Sum (Wilcoxon) Test for Variables:


raterRating (1001)
raterRating (6001)

N = 61
N = 189

raterRating (1001)
10
raterRating (6001)
***|15 47|**********
*|6
9|**
**|10 8|*
*|7
4|*
*|7
5|*
*|2
6|*
*|2
6|*
*|7 16|***
*|5 88|********************
0
There is a significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (1001) than for variable
raterRating (6001) would be approximately:
One-sided P-value = 0.0000.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.0001.
Sum of ranks for Variable raterRating (1001)
T =

9583

Mean and Standard Deviation under null hypothesis


E =
7655.5
S = 476.647544
Sample Medians
raterRating (1001)
raterRating (6001)

7
2

Approximate
Standard Error
0.5762
0.5455

Sample Size
61
189

61

Rank Sum (Wilcoxon) Test for Variables:


raterRating (2001)
raterRating (3001)

N = 154
N = 553

raterRating (2001)
10
raterRating (3001)
*****|35 123|********************
***|21
97|***************
***|21
86|*************
***|21
91|**************
****|25
63|**********
*|6
29|****
*|11
26|****
|
5|*
**|14
33|*****
0
There is a significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (3001) than for variable
raterRating (2001) would be approximately:
One-sided P-value = 0.0951.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.1902.
Sum of ranks for Variable raterRating (2001)
T =

51605.5

Mean and Standard Deviation under null hypothesis


E =
S =

54516
2221.6147
Sample Medians

raterRating (2001)
raterRating (3001)

6.5
7

Approximate
Standard Error
0.2417
0.1276

Sample Size
154
553

62

Rank Sum (Wilcoxon) Test for Variables:


raterRating (2001)
raterRating (4001)

N = 154
N = 641

raterRating (2001)
10
raterRating (4001)
****|35 168|********************
**|21
82|*********
**|21
81|*********
**|21
82|*********
**|25
74|********
*|6
56|******
*|11
25|**
|
21|**
*|14
52|******
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (4001) than for variable
raterRating (2001) would be approximately:
One-sided P-value = 0.3208.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.6417.
Sum of ranks for Variable raterRating (2001)
T =

60109

Mean and Standard Deviation under null hypothesis


E =
61292
S = 2540.88318
Sample Medians
raterRating (2001)
raterRating (4001)

6.5
7

Approximate
Standard Error
0.2417
0.158

Sample Size
154
641

63

Rank Sum (Wilcoxon) Test for Variables:


raterRating (2001)
raterRating (5001)

N = 154
N = 388

raterRating (2001)
10
raterRating (5001)
******|35 108|********************
***|21
59|**********
***|21
37|******
***|21
30|*****
****|25
39|*******
*|6
32|*****
**|11
20|***
|
9|*
**|14
54|**********
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (5001) than for variable
raterRating (2001) would be approximately:
One-sided P-value = 0.4418.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.8836.
Sum of ranks for Variable raterRating (2001)
T =

41571.5

Mean and Standard Deviation under null hypothesis


E =
41811
S = 1632.14214
Sample Medians
raterRating (2001)
raterRating (5001)

6.5
7

Approximate
Standard Error
0.2417
0.2538

Sample Size
154
388

64

Rank Sum (Wilcoxon) Test for Variables:


raterRating (2001)
raterRating (6001)

N = 154
N = 189

raterRating (2001)
10
raterRating (6001)
*******|35 47|**********
****|21 9|**
****|21 8|*
****|21 4|*
*****|25 5|*
*|6
6|*
**|11 6|*
|
16|***
***|14 88|********************
0
There is a significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (2001) than for variable
raterRating (6001) would be approximately:
One-sided P-value = 0.0000.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.0000.
Sum of ranks for Variable raterRating (2001)
T =

31273.5

Mean and Standard Deviation under null hypothesis


E =
26488
S = 898.495208
Sample Medians
raterRating (2001)
raterRating (6001)

6.5
2

Approximate
Standard Error
0.2417
0.5455

Sample Size
154
189

65

Rank Sum (Wilcoxon) Test for Variables:


raterRating (3001)
raterRating (4001)

N = 553
N = 641

raterRating (3001)
10
raterRating (4001)
**************|123 168|********************
***********|97
82|*********
**********|86
81|*********
**********|91
82|*********
*******|63
74|********
***|29
56|******
***|26
25|**
*|5
21|**
***|33
52|******
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (3001) than for variable
raterRating (4001) would be approximately:
One-sided P-value = 0.1110.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.2220.
Sum of ranks for Variable raterRating (3001)
T =

337616.5

Mean and Standard Deviation under null hypothesis


E =
330417.5
S = 5894.83721
Sample Medians
raterRating (3001)
raterRating (4001)

7
7

Approximate
Standard Error
0.1276
0.158

Sample Size
553
641

66

Rank Sum (Wilcoxon) Test for Variables:


raterRating (3001)
raterRating (5001)

N = 553
N = 388

raterRating (3001)
10
raterRating (5001)
********************|123 108|*****************
***************|97
59|*********
*************|86
37|******
**************|91
30|****
**********|63
39|******
****|29
32|*****
****|26
20|***
*|5
9|*
*****|33
54|********
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (3001) than for variable
raterRating (5001) would be approximately:
One-sided P-value = 0.1020.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.2040.
Sum of ranks for Variable raterRating (5001)
T =

177576.5

Mean and Standard Deviation under null hypothesis


E =
182748
S = 4071.30914
Sample Medians
raterRating (3001)
raterRating (5001)

7
7

Approximate
Standard Error
0.1276
0.2538

Sample Size
553
388

67

Rank Sum (Wilcoxon) Test for Variables:


raterRating (3001)
raterRating (6001)

N = 553
N = 189

raterRating (3001)
10
raterRating (6001)
********************|123 47|*******
***************|97
9|*
*************|86
8|*
**************|91
4|*
**********|63
5|*
****|29
6|*
****|26
6|*
*|5
16|**
*****|33
88|**************
0
There is a significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (3001) than for variable
raterRating (6001) would be approximately:
One-sided P-value = 0.0000.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.0000.
Sum of ranks for Variable raterRating (6001)
T =

50947

Mean and Standard Deviation under null hypothesis


E =
70213.5
S = 2523.62289
Sample Medians
raterRating (3001)
raterRating (6001)

7
2

Approximate
Standard Error
0.1276
0.5455

Sample Size
553
189

68

Rank Sum (Wilcoxon) Test for Variables:


raterRating (4001)
raterRating (5001)

N = 641
N = 388

raterRating (4001)
10
raterRating (5001)
********************|168 108|************
*********|82
59|*******
*********|81
37|****
*********|82
30|***
********|74
39|****
******|56
32|***
**|25
20|**
**|21
9|*
******|52
54|******
0
There is no significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (4001) than for variable
raterRating (5001) would be approximately:
One-sided P-value = 0.3288.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.6576.
Sum of ranks for Variable raterRating (5001)
T =

197786

Mean and Standard Deviation under null hypothesis


E =
199820
S = 4587.58613
Sample Medians
raterRating (4001)
raterRating (5001)

7
7

Approximate
Standard Error
0.158
0.2538

Sample Size
641
388

69

Rank Sum (Wilcoxon) Test for Variables:


raterRating (4001)
raterRating (6001)

N = 641
N = 189

raterRating (4001)
10
raterRating (6001)
********************|168 47|*****
*********|82
9|*
*********|81
8|*
*********|82
4|*
********|74
5|*
******|56
6|*
**|25
6|*
**|21
16|*
******|52
88|**********
0
There is a significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (4001) than for variable
raterRating (6001) would be approximately:
One-sided P-value = 0.0000.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.0000.
Sum of ranks for Variable raterRating (6001)
T =

57416

Mean and Standard Deviation under null hypothesis


E =
S =

78529.5
2875.1073
Sample Medians

raterRating (4001)
raterRating (6001)

7
2

Approximate
Standard Error
0.158
0.5455

Sample Size
641
189

70

Rank Sum (Wilcoxon) Test for Variables:


raterRating (5001)
raterRating (6001)

N = 388
N = 189

raterRating (5001)
10
raterRating (6001)
********************|108 47|********
**********|59
9|*
******|37
8|*
*****|30
4|*
*******|39
5|*
*****|32
6|*
***|20
6|*
*|9
16|**
**********|54
88|****************
0
There is a significant difference between the samples.
If in fact the populations were THE SAME, the chance of this much evidence
of greater values for variable raterRating (5001) than for variable
raterRating (6001) would be approximately:
One-sided P-value = 0.0000.
The chance of this much evidence in EITHER direction would be twice that
value, that is:
Two-sided P-value = 0.0000.
Sum of ranks for Variable raterRating (6001)
T =

43087

Mean and Standard Deviation under null hypothesis


E =
54621
S = 1856.72953
Sample Medians
raterRating (5001)
raterRating (6001)

7
2

Approximate
Standard Error
0.2538
0.5455

Sample Size
388
189

71

Description of MMR distribution and effect to rating


Cum
Cum
raterMMR
Freq Percent
Freq Percent

0
4042
67.05
4042
67.05
1001
61
1.01
4103
68.07
2001
154
2.55
4257
70.62
3001
553
9.17
4810
79.79
4001
641
10.63
5451
90.43
5001
388
6.44
5839
96.86
6001
189
3.14
6028
100.00

Missing cases = 0

Score distribution related to MMR and effect to rating

Table of raterRating by raterMMR


raterRating
raterMMR
Freq
B
Percent B
Row Pct B
Col Pct B
1001B
2001B
3001B
4001B
5001B
6001B
999999
1 B
5 B
14 B
33 B
52 B
54 B
88 B
B
0.25 B
0.70 B
1.66 B
2.62 B
2.72 B
4.43 B
B
2.03 B
5.69 B 13.41 B 21.14 B 21.95 B 35.77 B
B
8.20 B
9.09 B
5.97 B
8.11 B 13.92 B 46.56 B
999999
2 B
7 B
0 B
5 B
21 B
9 B
16 B
B
0.35 B
0.00 B
0.25 B
1.06 B
0.45 B
0.81 B
B 12.07 B
0.00 B
8.62 B 36.21 B 15.52 B 27.59 B
B 11.48 B
0.00 B
0.90 B
3.28 B
2.32 B
8.47 B
999999
3 B
2 B
11 B
26 B
25 B
20 B
6 B
B
0.10 B
0.55 B
1.31 B
1.26 B
1.01 B
0.30 B
B
2.22 B 12.22 B 28.89 B 27.78 B 22.22 B
6.67 B
B
3.28 B
7.14 B
4.70 B
3.90 B
5.15 B
3.17 B
999999
4 B
2 B
6 B
29 B
56 B
32 B
6 B
B
0.10 B
0.30 B
1.46 B
2.82 B
1.61 B
0.30 B
B
1.53 B
4.58 B 22.14 B 42.75 B 24.43 B
4.58 B
B
3.28 B
3.90 B
5.24 B
8.74 B
8.25 B
3.17 B
999999
5 B
7 B
25 B
63 B
74 B
39 B
5 B
B
0.35 B
1.26 B
3.17 B
3.73 B
1.96 B
0.25 B
B
3.29 B 11.74 B 29.58 B 34.74 B 18.31 B
2.35 B
B 11.48 B 16.23 B 11.39 B 11.54 B 10.05 B
2.65 B
999999
6 B
7 B
21 B
91 B
82 B
30 B
4 B
B
0.35 B
1.06 B
4.58 B
4.13 B
1.51 B
0.20 B
B
2.98 B
8.94 B 38.72 B 34.89 B 12.77 B
1.70 B
B 11.48 B 13.64 B 16.46 B 12.79 B
7.73 B
2.12 B
999999
7 B
10 B
21 B
86 B
81 B
37 B
8 B
B
0.50 B
1.06 B
4.33 B
4.08 B
1.86 B
0.40 B
B
4.12 B
8.64 B 35.39 B 33.33 B 15.23 B
3.29 B
B 16.39 B 13.64 B 15.55 B 12.64 B
9.54 B
4.23 B
999999
8 B
6 B
21 B
97 B
82 B
59 B
9 B
B
0.30 B
1.06 B
4.88 B
4.13 B
2.97 B
0.45 B
B
2.19 B
7.66 B 35.40 B 29.93 B 21.53 B
3.28 B
B
9.84 B 13.64 B 17.54 B 12.79 B 15.21 B
4.76 B
999999
9 B
3 B
17 B
46 B
68 B
38 B
19 B

Total
246
12.39

58
2.92

90
4.53

131
6.60

213
10.73

235
11.83

243
12.24

274
13.80

191

72
B
0.15 B
0.86 B
2.32 B
3.42 B
1.91 B
0.96 B
9.62
B
1.57 B
8.90 B 24.08 B 35.60 B 19.90 B
9.95 B
B
4.92 B 11.04 B
8.32 B 10.61 B
9.79 B 10.05 B
999999
10 B
12 B
18 B
77 B
100 B
70 B
28 B
305
B
0.60 B
0.91 B
3.88 B
5.04 B
3.52 B
1.41 B 15.36
B
3.93 B
5.90 B 25.25 B 32.79 B 22.95 B
9.18 B
B 19.67 B 11.69 B 13.92 B 15.60 B 18.04 B 14.81 B
n
Total
61
154
553
641
388
189
1986
3.07
7.75
27.84
32.28
19.54
9.52
100.00

Statistic
DF
Value
Prob

Chi-Square
45
368.225
0.000
Likelihood Ratio Chi-Square
45
313.363
0.000
Mantel-Haenszel Chi-Square
1
44.031
0.000
Phi Coefficient
0.431
Contingency Coefficient
0.395
Cramer's V
0.193
Warning:

6% of the cells have expected counts less


than 5. Chi-square tests may not be valid.
Missing cases = 0

Note:

Local selection: Cases tested = 6028, selected = 1986


Criterion: raterMMR>0

Table of raterRating by raterMMR


Controlling for Result
Result = loss
raterRating
raterMMR
Freq
B
Percent B
Row Pct B
Col Pct B
1001B
2001B
3001B
4001B
5001B
6001B
999999
1 B
5 B
6 B
29 B
45 B
35 B
51 B
B
0.47 B
0.57 B
2.75 B
4.26 B
3.31 B
4.83 B
B
2.92 B
3.51 B 16.96 B 26.32 B 20.47 B 29.82 B
B 15.15 B
6.98 B 10.07 B 13.24 B 16.99 B 49.51 B
999999
2 B
4 B
0 B
5 B
19 B
7 B
10 B
B
0.38 B
0.00 B
0.47 B
1.80 B
0.66 B
0.95 B
B
8.89 B
0.00 B 11.11 B 42.22 B 15.56 B 22.22 B
B 12.12 B
0.00 B
1.74 B
5.59 B
3.40 B
9.71 B
999999
3 B
2 B
11 B
26 B
21 B
18 B
6 B
B
0.19 B
1.04 B
2.46 B
1.99 B
1.70 B
0.57 B
B
2.38 B 13.10 B 30.95 B 25.00 B 21.43 B
7.14 B
B
6.06 B 12.79 B
9.03 B
6.18 B
8.74 B
5.83 B
999999
4 B
2 B
6 B
23 B
52 B
25 B
6 B
B
0.19 B
0.57 B
2.18 B
4.92 B
2.37 B
0.57 B
B
1.75 B
5.26 B 20.18 B 45.61 B 21.93 B
5.26 B
B
6.06 B
6.98 B
7.99 B 15.29 B 12.14 B
5.83 B
999999
5 B
5 B
21 B
54 B
61 B
28 B
3 B
B
0.47 B
1.99 B
5.11 B
5.78 B
2.65 B
0.28 B
B
2.91 B 12.21 B 31.40 B 35.47 B 16.28 B
1.74 B
B 15.15 B 24.42 B 18.75 B 17.94 B 13.59 B
2.91 B
999999
6 B
5 B
14 B
58 B
52 B
22 B
3 B
B
0.47 B
1.33 B
5.49 B
4.92 B
2.08 B
0.28 B
B
3.25 B
9.09 B 37.66 B 33.77 B 14.29 B
1.95 B

Total
171
16.19

45
4.26

84
7.95

114
10.80

172
16.29

154
14.58

73
B 15.15 B 16.28 B 20.14 B 15.29 B 10.68 B
2.91 B
999999
7 B
4 B
13 B
42 B
34 B
14 B
1 B
108
B
0.38 B
1.23 B
3.98 B
3.22 B
1.33 B
0.09 B 10.23
B
3.70 B 12.04 B 38.89 B 31.48 B 12.96 B
0.93 B
B 12.12 B 15.12 B 14.58 B 10.00 B
6.80 B
0.97 B
999999
8 B
1 B
8 B
25 B
26 B
21 B
2 B
83
B
0.09 B
0.76 B
2.37 B
2.46 B
1.99 B
0.19 B
7.86
B
1.20 B
9.64 B 30.12 B 31.33 B 25.30 B
2.41 B
B
3.03 B
9.30 B
8.68 B
7.65 B 10.19 B
1.94 B
999999
9 B
3 B
1 B
7 B
9 B
13 B
10 B
43
B
0.28 B
0.09 B
0.66 B
0.85 B
1.23 B
0.95 B
4.07
B
6.98 B
2.33 B 16.28 B 20.93 B 30.23 B 23.26 B
B
9.09 B
1.16 B
2.43 B
2.65 B
6.31 B
9.71 B
999999
10 B
2 B
6 B
19 B
21 B
23 B
11 B
82
B
0.19 B
0.57 B
1.80 B
1.99 B
2.18 B
1.04 B
7.77
B
2.44 B
7.32 B 23.17 B 25.61 B 28.05 B 13.41 B
B
6.06 B
6.98 B
6.60 B
6.18 B 11.17 B 10.68 B
n
Total
33
86
288
340
206
103
1056
3.12
8.14
27.27
32.20
19.51
9.75
100.00

Statistic
DF
Value
Prob

Chi-Square
45
209.189
0.000
Likelihood Ratio Chi-Square
45
202.435
0.000
Mantel-Haenszel Chi-Square
1
17.971
0.000
Phi Coefficient
0.445
Contingency Coefficient
0.407
Cramer's V
0.199
Warning:

20% of the cells have expected counts less


than 5. Chi-square tests may not be valid.

Table of raterRating by raterMMR


Controlling for Result
Result = win
raterRating
raterMMR
Freq
B
Percent B
Row Pct B
Col Pct B
1001B
2001B
3001B
4001B
5001B
6001B
999999
1 B
0 B
8 B
4 B
7 B
19 B
37 B
B
0.00 B
0.86 B
0.43 B
0.75 B
2.04 B
3.98 B
B
0.00 B 10.67 B
5.33 B
9.33 B 25.33 B 49.33 B
B
0.00 B 11.76 B
1.51 B
2.33 B 10.44 B 43.02 B
999999
2 B
3 B
0 B
0 B
2 B
2 B
6 B
B
0.32 B
0.00 B
0.00 B
0.22 B
0.22 B
0.65 B
B 23.08 B
0.00 B
0.00 B 15.38 B 15.38 B 46.15 B
B 10.71 B
0.00 B
0.00 B
0.66 B
1.10 B
6.98 B
999999
3 B
0 B
0 B
0 B
4 B
2 B
0 B
B
0.00 B
0.00 B
0.00 B
0.43 B
0.22 B
0.00 B
B
0.00 B
0.00 B
0.00 B 66.67 B 33.33 B
0.00 B
B
0.00 B
0.00 B
0.00 B
1.33 B
1.10 B
0.00 B
999999
4 B
0 B
0 B
6 B
4 B
7 B
0 B
B
0.00 B
0.00 B
0.65 B
0.43 B
0.75 B
0.00 B
B
0.00 B
0.00 B 35.29 B 23.53 B 41.18 B
0.00 B
B
0.00 B
0.00 B
2.26 B
1.33 B
3.85 B
0.00 B
999999

Total
75
8.06

13
1.40

6
0.65

17
1.83

74
5 B
2 B
4 B
9 B
13 B
11 B
2 B
41
B
0.22 B
0.43 B
0.97 B
1.40 B
1.18 B
0.22 B
4.41
B
4.88 B
9.76 B 21.95 B 31.71 B 26.83 B
4.88 B
B
7.14 B
5.88 B
3.40 B
4.32 B
6.04 B
2.33 B
999999
6 B
2 B
7 B
33 B
30 B
8 B
1 B
81
B
0.22 B
0.75 B
3.55 B
3.23 B
0.86 B
0.11 B
8.71
B
2.47 B
8.64 B 40.74 B 37.04 B
9.88 B
1.23 B
B
7.14 B 10.29 B 12.45 B
9.97 B
4.40 B
1.16 B
999999
7 B
6 B
8 B
44 B
47 B
23 B
7 B
135
B
0.65 B
0.86 B
4.73 B
5.05 B
2.47 B
0.75 B 14.52
B
4.44 B
5.93 B 32.59 B 34.81 B 17.04 B
5.19 B
B 21.43 B 11.76 B 16.60 B 15.61 B 12.64 B
8.14 B
999999
8 B
5 B
13 B
72 B
56 B
38 B
7 B
191
B
0.54 B
1.40 B
7.74 B
6.02 B
4.09 B
0.75 B 20.54
B
2.62 B
6.81 B 37.70 B 29.32 B 19.90 B
3.66 B
B 17.86 B 19.12 B 27.17 B 18.60 B 20.88 B
8.14 B
999999
9 B
0 B
16 B
39 B
59 B
25 B
9 B
148
B
0.00 B
1.72 B
4.19 B
6.34 B
2.69 B
0.97 B 15.91
B
0.00 B 10.81 B 26.35 B 39.86 B 16.89 B
6.08 B
B
0.00 B 23.53 B 14.72 B 19.60 B 13.74 B 10.47 B
999999
10 B
10 B
12 B
58 B
79 B
47 B
17 B
223
B
1.08 B
1.29 B
6.24 B
8.49 B
5.05 B
1.83 B 23.98
B
4.48 B
5.38 B 26.01 B 35.43 B 21.08 B
7.62 B
B 35.71 B 17.65 B 21.89 B 26.25 B 25.82 B 19.77 B
n
Total
28
68
265
301
182
86
930
3.01
7.31
28.49
32.37
19.57
9.25
100.00

Statistic
DF
Value
Prob

Chi-Square
45
270.249
0.000
Likelihood Ratio Chi-Square
45
217.086
0.000
Mantel-Haenszel Chi-Square
1
37.715
0.000
Phi Coefficient
0.539
Contingency Coefficient
0.475
Cramer's V
0.241
Warning:

40% of the cells have expected counts less


than 5. Chi-square tests may not be valid.
Missing cases = 0

Note:

Local selection: Cases tested = 6028, selected = 1986


Criterion: raterMMR>0

75

52

52

An asterix denotes an outliner

76

77

78

79

10. References
Abedi, J. (1996). Interrater/Test Reliability System (ITRS). Multivariate Behavioral
Research, 31(4), pp.409-417.
kerblom, A. (2015). syndereN om Kina vs. Vrlden: "I need a bigger sample size" |
Fragbite.se. [online] Fragbite.se. Available at:
http://fragbite.se/fragtv/video/2298/synderen-om-kina-vs-varlden-i-need-a-biggersample-size [Accessed 29 Apr. 2015].
Akerlof, G. (1970). The Market for "Lemons": Quality Uncertainty and the Market
Mechanism. The Quarterly Journal of Economics, 84(3), p.488.
Bacharach, M. (1993). Variable universe games. [S.l.]: [s.n.].
Beyond The Summit, (2014). TI4 Interview with Nahaz (statsman). [online] YouTube.
Available at: https://www.youtube.com/watch?v=K-AFbouKJAQ [Accessed 29 Apr.
2015].
Blog.dota2.com, (2015). Matchmaking | Dota 2. [online] Available at:
http://blog.dota2.com/2013/12/matchmaking/ [Accessed 29 Apr. 2015].
Blog.dotacoach.org, (2015). DotaCoach Blog: Does coaching work? We spent $1,300 to
find out.. [online] Available at: http://blog.dotacoach.org/2015/04/does-coachingwork-we-spent-1300-to.html [Accessed 29 Apr. 2015].
BobRawrley, (2015). 72.5% of all games are in normal bracket, 15.5% in high and 11.9%
in very high - the initial MMR of 2250 seems to have hardly moved /r/DotA2.
[online] reddit. Available at:
http://www.reddit.com/r/DotA2/comments/2wjo81/725_of_all_games_are_in_normal
_bracket_155_in/corh17g [Accessed 29 Apr. 2015].
Chen, Y., Harper, F., Konstan, J. and Li, S. (2010). Social Comparisons and Contributions
to Online Communities: A Field Experiment on MovieLens. American Economic
Review, 100(4), pp.1358-1398.
Cocilova, A. (2015). 10 great PC games with incredibly steep learning curves. [online]
PCWorld. Available at: http://www.pcworld.com/article/2061971/10-great-pc-games-

80

with-incredibly-steep-learning-curves.html#slide2 [Accessed 29 Apr. 2015].


Cutsrock, D. (2014). Ranked Matchmaking - How it is has changed Dota 2 for me.
/r/DotA2. [online] reddit. Available at:
http://www.reddit.com/r/DotA2/comments/1uqg5s/ranked_matchmaking_how_it_is_h
as_changed_dota_2/ [Accessed 29 Apr. 2015].
Datdota.com, (2015). New Stat Preview: Laning (Part 3 in the ****ing Finally Series) datdota.com. [online] Available at: http://www.datdota.com/blog/?p=616 [Accessed
29 Apr. 2015].
Dejong, D., Forsythe, R. and Lundholm, R. (1985). Ripoffs, Lemons, and Reputation
Formation in Agency Relationships: A Laboratory Market Study. The Journal of
Finance, 40(3), p.809.
Dev.dota2.com, (2013). feedback on matchmaking after the lastest patch - Page 2. [online]
Available at:
http://dev.dota2.com/showthread.php?t=98311&page=2&p=663966#post663966
[Accessed 29 Apr. 2015].
DiChristopher, T. (2014). World of Warcraft faces decline as gamers shift. [online] CNBC.
Available at: http://www.cnbc.com/id/102172664 [Accessed 29 Apr. 2015].
Edge, R. (2013). Predicting Player Churn in Multiplayer Games using Goal-Weighted
Empowerment. 13-024. Minnesota: University of Minnesota.
Ericsson, K., Krampe, R. and Tesch-Rmer, C. (1993). The role of deliberate practice in
the acquisition of expert performance. Psychological Review, 100(3), pp.363-406.
e-Sports Earnings, (2015). Top Players For Dota 2 - Competitive Player Rankings :: eSports Earnings. [online] Available at: http://www.esportsearnings.com/games/231dota-2 [Accessed 29 Apr. 2015].
Fletcher, (2013). feedback on matchmaking after the lastest patch - Page 3. [online]
Dev.dota2.com. Available at:
http://dev.dota2.com/showthread.php?t=98311&page=3&p=664114#post664114
[Accessed 29 Apr. 2015].

81

Fletcher, (2013). New Matchmaking. [online] Dev.dota2.com. Available at:


http://dev.dota2.com/showthread.php?t=98317&p=663920#post663920 [Accessed 29
Apr. 2015].
Geere, D. (2015). Dota 2 data yields ideal team composition. [online] PC Gamer. Available
at: http://www.pcgamer.com/dota-2-data-yields-ideal-team-composition/ [Accessed 29
Apr. 2015].
Gestalt, (1999). The OGA. [online] Eurogamer.net. Available at:
http://www.eurogamer.net/articles/oga [Accessed 29 Apr. 2015].
Gosugamers.net, (2015). DotA 2 Rankings Database | GosuGamers. [online] Available at:
http://www.gosugamers.net/dota2/rankings#team [Accessed 29 Apr. 2015].
Haigh, J. (1999). Taking chances. Oxford: Oxford University Press, p.261.
Herren, B. (2015). theScore eSports. [online] Thescoreesports.com. Available at:
http://www.thescoreesports.com/news/152 [Accessed 29 Apr. 2015].
Holt, C. (2007). Markets, games, & strategic behavior. Boston: Pearson Addison Wesley,
pp.123 - 132.
Johnson, V. (1996). On Bayesian Analysis of Multirater Ordinal Data: An Application to
Automated Essay Grading. Journal of the American Statistical Association, 91(433),
p.42.
joinDOTA.com, (2015). DAC Groupstage - ALL the stats right here! News. [online]
Available at: http://www.joindota.com/en/news/25229-dac-groupstage-all-the-statsright-here [Accessed 29 Apr. 2015].
Kelley, J. (2015). [online] Joefkelley.com. Available at:
http://www.joefkelley.com/dota2chart.html [Accessed 29 Apr. 2015].
Latham, A., Patston, L. and Tippett, L. (2013). Just how expert are expert videogame players? Assessing the experience and expertise of video-game players across
action video-game genres. Frontiers in Psychology, 4.
Ling, K., Beenen, G., Ludford, P., Wang, X., Chang, K., Li, X., Cosley, D., Frankowski,

82

D., Terveen, L., Rashid, A., Resnick, P. and Kraut, R. (2005). Using Social
Psychology to Motivate Contributions to Online Communities. Journal of ComputerMediated Communication, 10(4), pp.00-00.
Liquipedia, (2015). Liquipedia Dota 2 Wiki. [online] Wiki.teamliquid.net. Available at:
http://wiki.teamliquid.net/dota2/Main_Page [Accessed 29 Apr. 2015].
Luca, M. and Smith, J. (2013). Salience is quality disclosure: Evidence from the U.S. News
college ranking. Journal of Economics & Management Strategy, 22(1), pp.58-77.
Ludford, P., Cosley, D., Frankowski, D. and Terveen, L. (2004). Think different.
Proceedings of the 2004 conference on Human factors in computing systems - CHI
'04.
Lutz, T. (2013). Nigerian football team's 79-0 defeat: other infamous losses. [online] the
Guardian. Available at:
http://www.theguardian.com/football/blog/2013/jul/10/nigeria-football-team-79-0heavy-defeats [Accessed 29 Apr. 2015].
McCormick, I., Walkey, F. and Green, D. (1986). Comparative perceptions of driver
ability A confirmation and expansion. Accident Analysis & Prevention, 18(3),
pp.205-208.
Medium, (2015). The Fourth Core: AUI_2000's Enigma. [online] Available at:
https://medium.com/@theodore.yan/the-fourth-core-aui_2000s-enigma-a53c3b0d4b47
[Accessed 29 Apr. 2015].
Mehta, J., Starmer, C. and Sugden, R. (1994). Focal points in pure coordination games: An
experimental investigation. Theory and Decision, 36(2), pp.163-185.
Mislevy, R., Almond, R., Yan, D. and Steinberg, L. (1999). Bayes nets in educational
assessment: Where the numbers come from. In: Proceedings of the fifteenth
conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc,
pp.437-446.
Noxville, (2015). How the Ratings of the DAC Teams have changed (so far) - datdota.com.
[online] Datdota.com. Available at: http://www.datdota.com/blog/?p=1095 [Accessed

83

29 Apr. 2015].
Piech, C., Huang, J., Do, C., Ng, A., Chen, Z. and Koller, D. (2013). Tuned Models of Peer
Assessment in MOOCs. Cornell University Library.
Playdota.com, (2013). I'm gonna need a 3000-3500 account - DotA Forums. [online]
Available at: http://www.playdota.com/forums/showthread.php?t=1398477 [Accessed
29 Apr. 2015].
reddit, (2014). Ranked MMR survey - results update /r/DotA2. [online] Available at:
http://www.reddit.com/r/DotA2/comments/2124az/ranked_mmr_survey_results_updat
e/ [Accessed 29 Apr. 2015].
Rioult, F., Mtivier, J., Helleu, B., Scelles, N. and Durand, C. (2014). SECS 2014.
AASRI Procedia, pp.82-87.
Rosenthal, R. and Rosnow, R. (2009). Artifacts in behavioral research. New York: Oxford
University Press.
Roth, A., Sonmez, T. and Unver, M. (2003). Kidney Exchange.
Sadler, P. and Good, E. (2006). The Impact of Self- and Peer-Grading on Student Learning.
Educ. Assessment, 11(1), pp.1-31.
Schelling, T. (1960). The strategy of conflict. Cambridge: Harvard University Press.
Smartdotabetting.com, (2015). About | SmartDotaBetting. [online] Available at:
http://smartdotabetting.com/about/ [Accessed 29 Apr. 2015].
Steenhuisen, B. (2015). Refining our Notion of Hero Performance - datdota.com. [online]
Datdota.com. Available at: http://www.datdota.com/blog/?p=1110 [Accessed 29 Apr.
2015].
Sugden, R. (1995). A Theory of Focal Points. The Economic Journal, 105(430), p.533.
SuperData Research, (2013). eSports market brief: US accounts for almost half of total
viewership.. [online] Available at: http://www.superdataresearch.com/blog/esportsbrief/ [Accessed 29 Apr. 2015].

84

The Economist, (2015). The once and future king. [online] Available at:
http://www.economist.com/blogs/gametheory/2015/03/statistical-analysis-football
[Accessed 29 Apr. 2015].
Thurnsten, C. (2015). Game is hard: how Dota 2 changed my view of the 'average' gamer.
[online] PC Gamer. Available at: http://www.pcgamer.com/game-is-hard-how-threeyears-of-dota-2-changed-my-view-of-the-average-gamer/ [Accessed 29 Apr. 2015].
Toft-Andersen, J. (2014). Jacob Toft-Andersen on Twitter. [online] Twitter. Available at:
https://twitter.com/themaelk/status/544946864110731264 [Accessed 29 Apr. 2015].
Wagner, M. (2006). On the Scientific Relevance of eSports. In: Proceedings of the 2006
International Conference on Internet Computing and Conference on Computer Game
Development. pp.437-440.
Wallace, A. (2014). [Infographic] Dota 2 International Prize Neck-And-Neck with
'Traditional' Sports Winnings. [online] Gameskinny.com. Available at:
http://www.gameskinny.com/9u6c1/infographic-dota-2-international-prize-neck-andneck-with-traditional-sports-winnings [Accessed 29 Apr. 2015].
Watters, A. (2012). The Problems with Coursera's Peer Assessments. [online]
Hackeducation.com. Available at: http://hackeducation.com/2012/08/27/peerassessment-coursera/ [Accessed 29 Apr. 2015].
What-A-Baller, (2014). Matchmaking and ratings /r/DotA2. [online] reddit. Available
at: http://www.reddit.com/r/DotA2/comments/1y44o4/matchmaking_and_ratings/
[Accessed 29 Apr. 2015].
Whitehill, J., Ruvolo, P., fan Wu, T., Bergsma, J. and Movellan, J. (2009). Whose vote
should count more: Optimal integration of labels from labellers of unknown expertise.
Advances in Neural Information Processing Systems, (22), pp.2035-2043.
Wingfield, N. (2014). In E-Sports, Video Gamers Draw Real Crowds and Big Money.
[online] Nytimes.com. Available at:
http://www.nytimes.com/2014/08/31/technology/esports-explosion-bringsopportunity-riches-for-video-gamers.html [Accessed 29 Apr. 2015].

85

Yang, P. and Roberts, D. (2013). Extracting human-readable knowledge rules in complex


time-evolving environments. In: Proceedings of The 2013 International Conference
on Information and Knowledge Engineering. Las Vegas, Nevada.
Yang, P., Harrison, B. and Roberts, D. (2014). Identifying Patterns in Combat that are
Predictive of Success in MOBA Games. In: Proceedings of the Foundations of Digital
Games 2014 Conference. Raleigh: North Carolina State University.