20 January 2012

Reviews: To score or not to score?

For some time now, I was of the opinion that reviews should not have scores. After reading Alex Kierkegaard's writeup on the topic, I changed my mind.

Background

Alex's post is long. It's well written, and you should read it sometime, but I think I'll still be an enormous hypocrite and provide you with a summary of what he says anyway:
  1. It should be possible to read a review, and then answer the question, "Did this guy like the game? Was it a waste of his time? Does he regret playing it? Would he recommend it to others?" If you cannot answer this, the review is worthless, rambling drivel which literally does not make any sense.
  2. There is no such thing as a score-less review. To demonstrate, take any given group of reviews (ostensibly) without scores. Now label each review as "positive" or "negative". You should be able to do this easily due to 1. When done, go back and replace each "positive" with 1, and each "negative" with 0. Even though the reviewer did not give a score, you have correctly approximated the score that he would have given, with scores on a scale of 0 to 1. Now, repeat this procedure with 5 tags: Strongly positive/negative, mildly positive/negative, and neutral. Replace them with 1-5. You have now approximated a score on a more familiar 1-5 scale.
  3. It follows that a review, any review, even if it claims to not assign a score, must describe a score implicitly. That is, even if there isn't a score, you can read it and say, oh, this looks like a 6/10 (from 2). If you can't say this, then the review is nonsense (from 1).
With these, it is ridiculous to continue refusing to include a score. You are already giving a score by the act of writing a review. If you don't state the score, you are hiding it. Why hide it?

Alex also talks about perfect scores. He's wrong there: If you think 100/100 is a perfect score, I don't see why you can't or won't think 5/5 is a perfect score. It doesn't really matter to me- I don't have a problem with scores being out of 5 and not 100.

Another thing he complains about is close scores like, for example, 76/100 and 77/100. His position boils down to "I cannot imagine myself making very precise judgements about games, therefore it is impossible." It is a laughable position. Alex seems to have a habit of confusing the negligible with the actually non-existent: Just because it's hard to see that 76 to 77 difference, doesn't mean it doesn't exist. I agree that if you are really using a score system with 100 values (or worse, decimals!), you should probably think again. But that doesn't mean it's inherently bad, and it doesn't mean there can't be some guy out there who really can review games so finely that he can discern a 1/100 difference in quality (although, most that score out of 100 probably can't do it). This part is also tangential to my purposes.

Motivation

So what are my purposes, then? Well, as I said, it's clear that I can't just not score my reviews. That would be sticking my head in the sand. However, I have one problem with scores: Suppose you have a scoring system out of 100. You have 4 categories: Graphics, Gameplay, Story, Replay value. Each one gets a score out of 25, then you sum them all for the final score. Reasonable enough, and many mainstream reviewers actually do this. (To make it Alex-friendly, you can make each category 0-1 and then sum them to 0-4.)

Anyhow, the problem: With this scheme, Dwarf Fortress gets 0+25+0+25=50. But Dwarf Fortress isn't a mediocre game! To fix it, you can make it so that gameplay and replay value are out of 45, and others out of 5. Then DF gets 90. Cool, right? Yes, but now Limbo gets, oh, 5+30+5+0=40 if you are really generous. I mean, I didn't think Limbo was perfect1. But I certainly didn't think it was below average crap that deserves a 40/100.

So, for some games, graphics matter and replay value doesn't. For others, the opposite. Rather than come up with a complicated weighting scheme to solve this, I tried to find a lazy shortcut. I think I succeeded.

Solution

If you give a game a 10/10, what does that mean? Essentially, it's the same as saying, "dude, this game is awesome, you'll love it". 0/10 would be saying "piece of shit, don't bother". Reviews are, at their basest, for answering the question, "should I play this game?" Yes, they serve as commentary and can be very valuable in that respect as well, but that question is what gave rise to "reviews" in the first place.

So how would I deal with, say, DF, if I was to give scores? Probably I'd give it a 9/10, and say something to the effect of "if you like roguelikes with ASCII graphics, then it's really a 10/10, and if you really care about the graphics it's 6/10 with tilesets and 3/10 without". Tastes vary. Review audiences are heterogenous2.

However, the review isn't necessarily going to be an absolute endorsement (or disapproval), either. It will probably say, "some such people will like this, some such people will not". Now, if you see a 5/10 game, what if you can't tell whether you're the guy who will like it despite its flaws, or the guy who will definitely hate it?

Sometimes, it's obvious from reading the review. Oftentimes it's not. And in that case, you'll guess. And with a 5/10 score, you will probably guess that you're equally likely to be in either camp... Wait, hold on. Isn't 1/2 the chance of success for an unbiased binary trial? Hmm, what if... What if review scores are probabilities? What if, when I give a game score X out of Y, that means I'm estimating X/Y of my audience will like it, and consequently3, that there's an X/Y probability that you will like it?

Results

Yeah, I'm kinda proud of myself for this. I think it's a great idea - I'm perfectly happy with a score system like this, both as reviewer and review reader4. So how would it look in practice?

Now, I don't want to make 1% resolution estimates, there aren't even 100 people reading my reviews. So I will use this scale:
  • 1: a game only an indy dev could love - 10% chance you'll like it; 10% of my audience will like it.
  • 2: mostly shit, but has noteworthy positive qualities - 30% chance you'll like it; 30% of my audience will like it.
  • 3: absolutely mediocre - 50% chance you'll like it; 50% of my audience will like it.
  • 4: recommended, but not for everyone - 70% chance you'll like it; 70% of my audience will like it.
  • 5: if you don't like this, you don't have a soul - 90% chance you'll like it; 90% of my audience will like it.
I think that looks pretty good5!

In fact, if I happen to decide that "indie game bias" is relevant for a game, I can just bump it up one level. That seems reasonable. If the devs are, say, literally curing cancer and disease, I can totally see bumping a game 2 levels. I like that - I'm okay with foldit being a 3/5 game, and I'm okay with treating it like a 5/5 game because of its mission.

Furthermore, the above may be written in the context of video games, but there's nothing about this system specific to video games. There's no reason not to use it for movies, books, what have you.

Lastly, the nice thing is that, while I've never heard of a reviewer using this system explicitly, all the review scores out there are very compatible with it. Good games are likely to get high scores, and you are likely to enjoy good games. Ergo, high score means more likely to enjoy. You can assume these are just traditional scores, too, if the "math" is confusing, but if basic probability confuses you, what on earth are you doing on my blog?




Footnotes:
1: If you look now, you will see my Limbo review does include a score. That was added after the fact, after this post was written.
2: I don't know if you can even target a homogenous audience of non-trivial size, but I know I wouldn't want to even if I could.
3: It's just basic probability. If a persons in a room like a game, and b persons don't, then when you pick one of them at random, the chance that you get someone who does like it is p=a/(a+b). Since you are only thinking about this because you have no idea which group you belong to, we can assume you are equally like to be any one of those persons. So the chances of you liking the game are also p, which is equal to the fraction of people who like it.
4: It also solves all sorts of problems we weren't even trying to solve: Among other things, it means that even if you buy a 9/10 game and hate it (or buy a 1/10 game and love it), that's fine, because it's a probabilistic prediction, and you are still better off trusting it (assuming the reviewer is trustworthy and reliable).
5: Two things you may notice: First, I'll never have to say you will definitely like a game, or definitely dislike it. Second, no matter how many times I'm wrong, I can always blame it on probability. Man, I'm so clever! Seriously, though: Sorry about this, but them's the breaks. I don't think a system that allows 0% or 100% probabilities would be productive, and I'm not sure if it would be mathematically sensible. Nor do I intend to find out.

3 comments:

  1. This must be one of the stupidest thing I have ever read

    ReplyDelete
  2. Talk about missing the point.

    "Another thing he complains about is close scores like, for example, 76/100 and 77/100. His position boils down to "I cannot imagine myself making very precise judgements about games, therefore it is impossible.""

    Can you care to explain the difference between GameSpot's 8.8 score they gave to Zelda: Twilight Princess, the 8.9 they gave to Smash Bros. Melee, and the 9.0 they gave to The World Ends With You? In other words, can you care to explain why those deserve those rather overly-specific scores?

    ReplyDelete
  3. Hello, culture.vg!

    "care to explain why those deserve those rather overly-specific scores?"

    1. I am not the person who made those reviews. You should ask the reviewer for the explanation, not me. My reviews have a 5-point accuracy, I'll gladly explain exactly what the difference is between my 2 and my 3, as a matter of fact, I have. In the reviews.

    2. Half of the article convinced me that scores make sense (the implicit score argument is brilliant). The other half failed to convince that there is something intrinsically wrong with using a highly granular scale. Honestly, as you can see, I wouldn't really use a precise scale either, much like Alex. So I agree with him there.

    I disagree that no one should ever use a granular scale. Yes, it's probably true that most people who do use one, shouldn't, because they are overestimating their own ability to appraise quality. The disagreement really is that he can't see there being a guy out there, who really can discern the .1 difference between two games; I don't see why not. (Although it would probably take a very exceptional person.) However, I didn't write this post to argue with Alex, I wrote it to explain my scoring system. Hence my lack of explanation regarding the point (honestly it seems like a quite boring thing to argue over).

    Anyway, you asked for an explanation for the scores given to those games, I will not give it. (For one, who gives a fuck what GameSpot says, anyway? Their scores are noise.)

    Instead, I'll give you a contrived example to the contrary: Suppose I rank all the games I play from good to bad. For some strange reason, I always find it easy to say which games are "better" than a game, and which are "worse". I then equally space these ranked games on a number line between 0 and 10. Suppose Smash Bros., which was the 100th and last game I played, was better than Zelda -I thought- but worse than TWEWY. It ends up placed on 8.9.

    Moreover, if you are willing to say that a 0 to 100 scale is too fine, but a 1 to 5 one isn't, you should be able to draw a line. Is out of 6 okay? What about out of 10? And then when you draw that line, you should be able to explain exactly why and how you drew it. I don't think you can, because I think it's impossible.

    ReplyDelete