Probability and Decision
Reasoning under conditions of uncertainty is usually reconstructed as probabilistic reasoning; e.g., in determining whether circumstantial evidence determines a defendant's guilt beyond reasonable double, one might reason probabilistically.
Example: The Monty Hall Problem
The following example will make the point that probabilistic reasoning is hard. It explains why there is a need to study it carefully.
Question: "Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No.1, and the host, who knows what's behind the other doors, opens another door, say No. 3, which has a goat. He then says to you, 'Do you want to pick door No.2?' Is it to your advantage to take the switch?"
Answer 1: Robert Sachs, a professor of mathematics at George Mason University in Fairfax, Va., expressed the prevailing view that there was no reason to switch doors. "You blew it!" he wrote. "Let me explain: If one door is shown to be a loser, that information changes the probability of either remaining choice --neither of which has any reason to be more likely -- to 1/2. As a professional mathematician, I'm very concerned with the general public's lack of mathematical skills. Please help by confessing your error and, in the future, being more careful."
Answer 2: (Marilyn vos Savant) "Yes, you should switch. The first door has a 1/3 chance of winning, but the second door has a 2/3 chance. Here's a good way to visualize what happened. Suppose there are a million doors, and you pick door No. 1. Then the host, who knows what's behind the doors and will always avoid the one with the prize, opens them all except door No. 777,777. You'd switch to that door pretty fast, wouldn't you?"
Answer 3: Sorry, Marilyn. There's nothing wrong with your math. As you noted, math answers aren't determined by votes. But TV ratings are! What could possibly have justified your assumption that the game show host offers every contestant the same choice? The initial question described only a single incident.
If I were the game show host, and you were the contestant, I'd offer you the option to switch only if you initially chose the correct door. In this case, the first door has a 100% chance of winning, the second door has a 0% chance, and switching would be a sure loser.
Unless you understand the motives and behavior of the game show host, all the mathematics in the world won't help you answer this question.
Reformulated question: If the host is required to open a door all the time and offer you a switch, then you should take the switch. But if he has the choice whether to allow a switch or not, beware. Caveat emptor. It all depends on his mood. (Monty Hall) "My only advice is, if you can, get me to offer you $5,000 not to open the door, take the money and go home."
Probability
There are many ways to think about probability. Often we face a situation in which there are a number of equally probable outcomes.
Example: I toss a die, which is a six-sided figure (cube). The possible outcomes are that it lands 1, 2, 3, 4, 5, or 6. Assume that each of these is equally probable.
Rule: If the possible outcomes are equally probable, then the probability of H is the number of outcomes that in which H would be true divided by the number of possible outcomes.
Questions:
The Monty Hall Problem Again
Here we are interested in the truth of three hypotheses: Door 1 = The car is behind door 1, Door 2 = The car is behind door 2, Door 3 = The car is behind door 3.
We add a sequence of facts that may bear on the probabilities that we assign to these hypotheses:
The Essential Point
Initially, all hypotheses are equally probable. So, what breaks the tie? It is because Door 3 hypothesis fits with, or explains, the new information better than the Door 1 hypothesis. If we suppose that the Door 3 hypothesis is true, then Monty has to open door 2, so this hypothesis explains the evidence very well. On the other hand, Door 1 only makes the probability of the evidence equal to ½, which is considerably less.
Therefore the likelihood principle will give the correct solution of the Monty Hall problem, although this relies on the fact that the hypotheses start out with the same initial probability. Here is a more complete analysis of the problem.
Bayes Rule for Updating Probabilities
For any hypothesis H, the posterior probability of H is equal to the prior probability of H given the new evidence. In symbols, Pnew(H) = Pold(H/E), where E is the new information obtained between times 1 and 2. Pold(H/E) is the probability of H given E.
Note: If we were to assume that all possible outcomes are equal in the Monty Hall problem after E is known, then we would reach the mistaken conclusion that Pnew(Door 1) = Pnew(Door 3) = ½. However, the correct solution is that Pnew(Door 3) is greater than Pnew(Door 1). Here do we get the correct answer from Bayes Rule?
Definition: The probability of H given E is the probability of H and E in proportion to the probability of E. That is, P(H/E) = P(H and E)/P(E). This definition does not assume that all outcomes are equally probable.
Illustration: If all outcomes are equally probable given E, then P(H/E) is the number of outcomes that make H and E true divided by the number of possible outcomes that make E true. E. g., let H = a die lands a six, and E = the die lands with an even number up. Then P(H/E) = 1/3.
If we do not assume that the outcomes are equally probable, then this formula is not of much practical use. However, it does lead to a more useful formula:
Bayes Theorem
The probability of H given E is the probability of H time the likelihood of H relative to E divided by the probability of E. That is, P(H/E) = P(H) ´ [P(E/H)/P(E)].
Proof: P(H/E) = P(H and E)/P(E). Now, by the same definition, P(E/H) = P(H and E)/P(H). Or, equivalently, P(H and E) = P(H) ´ P(E/H). Bayes theorem follows by substituting this result into the first equation, P(H/E) = P(H and E)/P(E).
Definition: Let us say that H raises our expectation of E if and only if P(E/H) > P(E).
Solution to the Monty Hall Problem
We now use Bayes theorem to calculate Pold(Door 1/E) in the Monty Hall Problem. Pold(Door 1/E) = Pold(Door 1) [Pold(E/Door 1)/Pold(E)]. What is Pold(E)? It is ½ because we could repeat the came many times, and in the situation in which we choose door 1 initially, Monty will open door 2 half of the time. But what is likelihood of Door 1? What is Pold(E/Door 1)? This is also ½ because, given that the car is behind door 1, Monty will open door 2 half of the time. Therefore, the hypothesis that the car is behind door 1 does not raise the expectation of the new information, the probability of the hypothesis is unchanged.
Conclusion: From the fact that the new probabilities must add to 1, we now know that P2(Door 3) = 2/3. (Exercise: Verify this answer directly from Bayes Rule.) Therefore, we can expect a greater payoff by switching doors!
Base-Rate Neglect
The Monty Hall example is an example in which prior probabilities are the same, but the posterior probabilities are not because the likelihoods are different. Bayes theorem correctly states the way in which likelihoods break a symmetry. Base-rate neglect occurs when the likelihoods are the same, but a difference in prior probabilities is not properly taken into account.
Example: There is an accident late at night between a car and a taxi in a part of town where 90% of the taxis are blue and 10% are green. The taxi speeds off without stopping. A witness, Mr. Brown, from out of town, reports that the light was dim, and he could not tell for sure whether the taxi was green or blue. But he would nevertheless guess, if pressed, that it was green. The evidence was important, so the police conducted a test. They showed Mr. Brown 50 green taxis and 50 blue taxis in a random order, all in similar lighting. Mr. Brown correctly guessed the color of 80% of the green taxis, and correctly guessed 80% of the blue taxis as blue. In light of these results, what would you estimate the probability that the taxi involved in the accident was a green taxi?
Answer: Anyone who estimates something close to 80% is wrong. The fact is that the information that 90% of the taxis in that part of town are blue (called the base rate) is a relevant piece of background information, and most people underestimate its relevance (called base-rate neglect). Bayes rule for updating probabilities tells us what the probability should be. Let Green = the taxi involved in the accident was green. We know that Pold(Green) = 90%. But what is Pnew(Green)? Bayes rule for updating probabilities tells us that this is Pold(Green/E), where E = Mr. Brown guessed that the taxi was green. By Bayes theorem:
Pold(Green/E) = Pold(Green) ´ [Pold(E /Green)/ Pold(E)].
We know the prior probability to be 10%, and we estimate the likelihood to be 80%. What is Pold(E)? It turns out we don't need to know this because if we work out the ratio of Pold(Green/E) to Pold(Blue/E), then that term drops out. Then we can work out the probabilities from the fact that they must sum to 1. Here is the calculation: Pold(Green/E)/Pold(Blue/E) = [10% / 90%] ´ [80% / 20%] = 4/9. That is, Pnew(Green) = 31% and Pnew(Blue) = 69%. There is more than 2/3 probable that the taxi was blue despite the fact that Mr. Brown said it was green, and that he is quite a reliable observer!
Remark: Suppose we change the example. Suppose that Mr. Brown said the taxi was blue. Then the probability that taxi really was blue is very close to 1. So, we can make accurate predictions of blue taxis but we cannot make accurate predictions of green taxis. This seems odd, but it is true because without any new information at all the prediction that the taxi was blue will be reasonably accurate in the sense of being correct 90% of the time. If that prediction is confirmed by Mr. Brown's testimony, that probability goes up. If the prediction is not confirmed by Mr. Brown's testimony, then the probability goes down, so that no accurate prediction can be made either way.
Application: This fact has important applications to science. Whenever we are interested in predicting the presence of something rare, we are in the same position as predicting green taxis. It happens whenever the base rate is low. So, for example, if we want to predict the presence of a rare disease, or the presence of a rare particle in nuclear physics, then we face this problem. The test has to extremely good in order to make accurate predictions.
The Psychic Fallacy
Jean Dixon correctly predicted that JFK would be assassinated in 1963. Is she psychic or what? Let Psychic = Jean Dixon is psychic, and let E = Jean Dixon's prediction that JFK was assassinated is true. Then P(E/ Psychic) is close to one. That is, the hypothesis has a high likelihood relative to E. But it has a low posterior probability because the prior probability is very low in light of the fact that she got 98% of the many predictions she published wrong. E does raise the probability of Psychic, but not to a very high value. The prior probability of Psychic is not thought of as a base-rate in this example, but it is playing the same role in Bayes theorem.
Note: When we say that a probability is "prior" we do not mean that it is "a priori" (which means "independent of experience"). The prior probability of the Psychic hypothesis is molded by much experience.
The Gambler's Fallacy
If I toss a coin 6 times and get 5 heads, then what do I predict for the next 6 tosses. If you were gambling then this string of heads may have lost you money. The gambler's fallacy is a name for the idea that over the total of 12 tosses, you can expect some close to a total of 6 heads, so in the next 6 tosses we should expect a less than average number of heads (one or two). This idea is wrong. The coin does not know that it landed an unlucky string of heads on the last 6 tosses, and it would not care even if it did know. So, the best prediction is that there will be 3 heads in the next 6 tosses, making a total of 8 heads out of 12. That is, given what we already know about the first 6 tosses, we should predict that the coin will land heads an above average number of times in the 12 tosses.
Regression towards the Mean
The example used in the gambler's fallacy takes for granted that the coin is fair (i.e., that the probability of landing heads on any toss is ½). But what if the coin was picked from a box that has a variety of bent coins; some coins biased towards landing heads, and some towards landing tails. The amount of bending that produces these differences is not discernible to the naked eye.
Example: Suppose we pick a coin from the box and toss this coin 6 times, and get 5 heads. What do we now predict for the next 60 tosses? The difference in the background information we have about the example makes a difference to the prediction. We have reason to expect more than 30 heads in the next 60 tosses, because the information we have suggests that the coin we chose is biased towards heads to some degree. But should we predict 50 heads in the next 60 tosses? No, because we also expect that such a high frequency of heads was in part due to chance. Our best bet is to suppose that there will be greater than 30 but less than 50 heads in the next 60 tosses.
Definition: This phenomenon is called regression towards the mean.
Applications:
Hypothesis Testing
In a famous paper called the "Law of Small Numbers", Kahneman and Tversky examine the understanding, or lack of understanding, of mathematical psychologists concerning hypothesis testing. In hypothesis testing, one applies something like the likelihood principle. You have two competing hypotheses.
Example: Suppose that you want to know whether a certain way of teaching mathematics is effective. You randomly assort a group of students into two groups, with 15 in each group: The control group is subject to standard teaching methods, while the experimental group is taught using the new methods. Then you test the two groups at the end of the semester, and find that the average test scores for the experimental group were higher. What do you conclude from this?
There are two competing explanations. One, called the null hypothesis, says that new teaching method was no more effective than the old one, and the observed correlation arose by chance. If the null hypothesis is true, then we do not expect the correlation to persist in the future. The second hypothesis is that the effect is 'real' that is, that is likely to persist in the future because the new method is more effective.
Hypothesis testing is similar to the likelihood principle. If the observed difference in the test scores is large enough so the it is sufficiently improbable to observe such a difference if the null hypothesis were true, then we say that the difference is statistically significant. Suppose the observed difference is statistically significant (and so we reject the null hypothesis in favor of the alternative).
Kahneman and Tversky asked psychologists what they expected would happen in this situation if they were to repeat the experiment. They found that psychologists had unrealistically high expectations about getting a statistically significant result the second time. They underestimated the size of the regression-toward-the-mean.
This has serious consequences. If psychologists repeat the experiment a second time and fail to observe a statistically significant difference in scores, their expectations have been dashed, and they have a tendency to reject the notion that the new teaching method is educationally effective, despite that fact that the test score differences in the pooled data is statistically significant!
Bayesian Decision Theory
Choose the action, or policy, that has the highest expected payoff, or expected utility.
Example 1: If there is one action that has the highest probability of obtaining the outcome with the highest payoff, then that is the action recommended by the Bayesian decision rule. This situation applies to the reformulated Monty Hall problem.
Example 2: There are two lotteries each with a million tickets costing $1 each. The first gives out 80,000 prizes of $10 , while the second gives one "lucky" winner a prize of $800,000. It is a well known phenomenon that the second kind of lottery is more popular. Is there a rational basis for this choice according to Bayesian decision theory?
E |
-E |
|
| A | 190 |
-10 |
| -A | 0 |
0 |
Example 3: Assume that you have the choice of implementing an affirmative action policy A, or doing nothing. You judge the probability of the policy having a positive effect E to be a low 10%. You measure the cost of the policy to be 10, but the value of a positive outcome is 200. If you do nothing, then there is no positive effect and no cost. Which choice has the highest expected payoff?
Answer: Expected payoff for -A is 0. Expected payoff for A is .1 ´ 190 - .9 ´ 10 = 19 - 9 = 10. Affirmative action has the greatest expected payoff in this scenario, even though it is believed to have a low probability of success!