Dec. 4, 2012, 7:07 a.m. by tvinar
Topics: Probability
A Shot in the Dark
In “Locating Restriction Sites”, we first familiarized ourselves with restriction enzymes. Recall that these enzymes are used by bacteria to cut through both strands of viral DNA, thus disarming the virus: the viral DNA locations where these cuts are made are known as restriction sites. Recall also that every restriction enzyme is preprogrammed with a reverse palindromic interval of DNA to which it will bind and cut, called a recognition sequence. These even length intervals are usually either 4 or 6 base pairs long, although longer ones do exist; rare-cutter enzymes have recognition sequences of 8 or more base pairs.
In this problem, we will ask a simple question: how does the bacterium "know" that it will probably succeed in finding a restriction site within the virus's DNA? The answer is that the short length of recognition sequences guarantees a large number of matches occurring randomly.
Intuitively, we would expect for a recognition sequence of length 6 to occur on average once every
$4^6 = 4,096$ base pairs. Note that this fact does not imply that the associated restriction enzyme will cut the viral DNA every 4,096 bp; it may find two restriction sites close together, then not find a restriction site for many thousand nucleotides.In this problem, we will generalize the problem of finding an average number of restriction sites to take into account the GC-content of the underlying string being analyzed.
Say that you place a number of bets on your favorite sports teams. If their chances of winning are 0.3, 0.8, and 0.6, then you should expect on average to win 0.3 + 0.8 + 0.6 = 1.7 of your bets (of course, you can never win exactly 1.7!)
More generally, if we have a collection of events
Given: A positive integer
Return: An array
10 AG 0.25 0.5 0.75
0.422 0.563 0.422
The Mathematical Details
In this problem, we are speaking of an expected number of events; how can we tie this into the definition of expected value that we already have from “Calculating Expected Offspring”?
The answer relies on a slick mathematical trick. For any event
$A$ , we can form a random variable for$A$ , called an indicator random variable$I_{A}$ . For an outcome$x$ ,$I_{A}(x) = 1$ when$x$ belongs to$A$ and$I_{A}(x) = 0$ when$x$ belongs to$A^{\textrm{c}}$ .For an indicator random variable
$I_{A}(x) = 1$ , verify that$\mathrm{E}(I_{A}) = \mathrm{Pr}(A)$ .You should also verify from our original formula for expected value that for any two random variables
$X$ and$Y$ ,$\mathrm{E}(X + Y)$ is equal to$\mathrm{E}(X) + \mathrm{E}(Y)$ . As a result, the expected number of events$A_1, A_2, \ldots, A_m$ occurring, or$\mathrm{E}(I_{A_1} + I_{A_2} + \cdots + I_{A_m})$ , reduces to$\mathrm{Pr}(A_1) + \mathrm{Pr}(A_2) + \cdots + \mathrm{Pr}(A_m)$ .