Sunday, September 5, 2010

Scoring, you doing it wrong



Business Schools all over the world seem to teach scoring methodologies.[i] Douglas W. Hubbard[ii] and L. Anthony Cox[iii] have shown that there are fundamental flaws in their application, thus they are valued by the authors as counterproductive. “All of them, without exception, are borderline or worthless. In practices, they may make many decisions far worse than they would have been using merely unaided judgments.[iv] First I will present the case study of a seminar I attended where their arguments are applied and proven right, second (in a later article) I will present the case of a consumer test, where I propose that the design of scoring led not to “borderline or worthless” results.


[i] This is anecdotal evidence: I encountered the practice frequently when attending seminars, trainings and workshops in the field








The Social Entrepreneurship Seminar Exercise Debacle

Targeting at assessing the adequacy of one's country for Social Entrepreneurship, the following exercise was proposed in this seminar of over 40 participants:





  1. Make groups consisting of people from the same country
  2. There should be a few persons in the group, not to less, not too much (can't remember the exact range, they where about 2 to 5 people)
  3. Write down arguments against adequacy of Social Entrepreneurship on the left
  4. Write down arguments for the adequacy on the right
  5. Give the arguments against a score between 1 to 5
  6. Give the arguments for the adequacy a score between 6 to 10
  7. Sum all individual scores up and divide them by the number of total arguments
  8. Present the arguments the scores and the total score
Cox and Douglas adduce many reasons why such an exercise leads to "borderline or worthless" results:





  1. Scoring attributes a single value to an argument subject to uncertainty. Since we don't know the true value of the argument, labeling with a single value leads to a chance of being wrong close to certainty. The arithmetic average of wrong scores does not eliminate these errors, as one might hope due to the Central Limit Theorem, as we will see in 1.
  2. Scoring pretends a linear utility curve between the arguments, namely that an argument scored '4' is twice as good as an argument scored '2'. The awareness of this arithmetic property was not raised. Given the following biases (see 3. and 4.), we cannot expect comprehensiveness of the arguments nor can we trust the single score (because of 1.). In fact an outlier of value '10' or '0' would change a result significantly due to the averaging in (# refers to the design shown above) #7. Therefore the total score must be seen as arbitrary number without information value.
  3. Range Compression: a discrete scale of 1-5 for inadequacy leads to a lumping together of very dissimilar threats, one attributes a '0' to two arguments even if they are more dissimiliar than the arguments from '4' to '5'. In fact, range compression would only not be an issue if the single arguments are distributed close to 'even'.
  4. Biased Distribution: Hubbard stated that subjective scoring tends to a preference of the score close to 75% of the range. This bias is shown in randomized experiments with arbitrary questions in form of a triangular distribution (a mountain) with a flat left-hand slope. Personally I have anecdotal evidence that confirms this in the consumer test experiment shown later. Presumably in the case of #5 the preference is mirrored to 25% due to the negative argument and 75% for step #6. But with arbitrary questions we must expect a uniform distribution (a flatland). Thus scoring links to a systematic bias unrelated to the nature of the argument.

The Results of the Scoring Debacle

I consider the results of the exercise worthless but harmless in itself, but I see the training in dysfunctional methods as critical.

The result of the exercise was harmless because when the students presented their lists, all results reinforced well known stereotypes of the perception of their countries. Russia is corrupt with weak institutions, Germans have work-ethics and strong institutions, South Korea and Japan are diligent students and have strong institutions. A weak benefit lies in the confirmation by the representatives of the countries of these widespread prejudices – they gain more credibility. Interestingly the result where not focused on social entrepreneurship opportunities but general entrepreneurial opportunities in the countries. I feel: this is missing the point.

The harmfulness of the exercise consists in the adoption of the flawed scoring methodology, repeated and reinforced as central tool in common business practices. Decisions based on the disdained technique are expected to have dramatic consequences.

Is all scoring flawed? I try to deny this in the, not written yet and maybe newer written, second part, with the earlier mentioned consumer test case study.

No comments: