|
1
|
|
|
2
|
- THE VERY BIG PICTURE:
- We would like to know the entire genetic sequence, identifying and
locating every gene coded therein.
|
|
3
|
- There are 3 billion base pairs.
We can only sequence 500 base pairs consecutively before the
sequencing becomes so noisy that it falls apart.
- Even given a faithful sequence, we frequently do not know where genes
start and stop.
|
|
4
|
- Knowing the genes: Apply strategies (usually code recognition) to
identify the start, stop, and middle of genes (GENEFINDING)
|
|
5
|
|
|
6
|
|
|
7
|
- Linkage Maps: Order and relative position only on a macroscopic scale
- Not actual bp distance
- Doesn’t work if genes are too close
- Physical Maps: Locate markers within 104 bp.
|
|
8
|
- Depends on Meiosis
- Distances are really statistics on traits
- Diversity in traits generated by crossing-over
- Distance computed by frequency of expression
- Map is expressed
- By locating the chromosome
- By relative distances between
genes (traits) on the chromosome
|
|
9
|
- To sequence the human genome we need to sequence 3 billion consecutive
base-pairs
- BUT
- Current technology only allows sequencing 500 base pairs
|
|
10
|
- Break the 3 billion base pair strand* into small strands
- Clone the inserts
- Sequence the small fragments
- Reassemble the labeled fragments
|
|
11
|
- Break the genomic DNA into very small fragments (inserts of 500-2000bp)
- Sequence the inserts
- Reassemble the labeled inserts into a labeled genome
|
|
12
|
- Break the genomic DNA into small fragments (30,000 -1M bp) containing
known markers
- Clone the fragments in BACs
- For each BAC clone, break into inserts (500-2000bp)
- Sequence the inserts
- Reassemble the labeled inserts into a labeled genome, using markers as
reference points to known gene loci
|
|
13
|
- We simply need to solve a specific instance the shortest common
superstring problem
|
|
14
|
|
|
15
|
|
|
16
|
- If the problem is solved on a Hamiltonian graph, it is intractable
- If the instance is solved on a Eulerian graph, the problem is tractable
|
|
17
|
- Computational Challenge
- Would need each segment pair to have a unique overlap
|
|
18
|
- Quality issues
- Contamination from viral vector DNA
- Erroneous reads of small segments,
- There are many, many repeats in Eukaryotic DNA
- Conceptual issues
- Reads in two different directions
- Oceans-Islands – a strategy for double tagging to span repeats
- Coverage issues
- Multiple copies to ensure coverage of the genome
- Consensus
|
|
19
|
- RESTRICTION MAPPING
- Data come from RESTRICTION ENZYME analysis
- Cut lengths of the target DNA fragments are MARKERS
- HYBRIDIZATION MAPPING
- Data come from DNA PROBES which bind to target DNA
- Probes have short, but unique, sequences ( MARKERS) and are
identifiable (florescent, commonly)
- Probe finds its exact complement in the target DNA
|
|
20
|
|
|
21
|
- Restriction enzymes cut double stranded DNA at specific spots
characterized by a specific short (~7 BP) sequence
- Usually the sequences on each strand are short reverse repeats
|
|
22
|
- On average, restriction enzymes with
4-base recognition sites will yield pieces 256 bases long
6-base recognition sites will yield pieces 4000 bases long, and
8-base recognition sites will yield pieces 64,000 bases long.
Since hundreds of different restriction enzymes have been
characterized, DNA can be cut into many different small fragments
|
|
23
|
- The cut between the strands may be staggered, leaving ‘sticky’
ends. Restated, there is a
built-in template for annealing with a complementary sequence, possibly
a product of the action of the same restriction enzyme on a different
fragment
|
|
24
|
- Likewise, the action of some restriction enzymes leave fully paired
remnants
|
|
25
|
|
|
26
|
- For restriction site analysis:
- Locate the restriction site on a DNA fragment: the restriction site
itself is thus a marker
- Can thus see if fragments overlap (if not,the fragments are called contigs)
|
|
27
|
- The idea is to use the length of a restriction fragment as part of a
fingerprint
- The ‘list’ of fragment lengths (ordered) is a fingerprint
|
|
28
|
|
|
29
|
- Use BAC clones 120-130k=Kb (usually no chimeras in BACs)
- Use restriction enzyme
- Different sized pieces from different clones
|
|
30
|
|
|
31
|
- We are given lengths of fragments formed from the action of restriction
enzymes on a clone or other large sequence
- We are NOT given the positions of the cleavage sites
- RESTATED: We have unlabeled distance data
- We must reconstruct the positions from the ordering of the fragments
|
|
32
|
|
|
33
|
- We may recover more than one representative of a fragment length without
any other labeling
|
|
34
|
- There are usually multiple ways to reconstruct the sequence from the
fragments
- Below are 3 of the 6 ways for fragment lengths {2,3,5}…
|
|
35
|
- Reconstruction, given only one dimensional distance data -single digest
- Reconstruction, given double digest distance data
- Alignment of clones, given distance data
|
|
36
|
|
|
37
|
|
|
38
|
|
|
39
|
- The one-dimensional restriction enzyme fragment length assembly problem
is a biological instance of the “Turnpike Problem”. Given the distances between exits,
create a map of the turnpike.
- The Turnpike problem is classically solved by a backtrack algorithm.
|
|
40
|
- 3 approaches
- all exponential
- Some only slightly less hideous than others
- A. Polynomial representation of the multisets, with factorization as a
solution
- B. Adjacency Matrix labeling problem
- C. Skiena- Smith -Lemke tree traverse and backtrack algorithm
|
|
41
|
- Fix the endpoints to reflect the largest distance
- For {0,1,3,4,6,9,10} we fix 0 and 10
- Choose the largest distance in the remaining data
- This distance could be measured from either 0 or from the 10, placing it
at location 0+9 or 10-9
- Choose one- if it does not violate the remaining distance data, keep it
- If it does, backtrack and choose the other one
|
|
42
|
- The algorithm is a binary tree with 2n branches
- At each branch, the consistency
with the remaining data must be checked
- Must then look at remaining points O(n log n)
- Worst case –evaluate each branch giving
- O(2nn log n)
|
|
43
|
- Still exponential, but better than the others
- For n=8,
- Adjacency matrix: 3x1029
- S-S-M: 6x103
|
|
44
|
|
|
45
|
|
|
46
|
- The experimental strategy is to perform a double digest
- One (multi) set of fragment lengths from the action of enzyme A
- One set from enzyme B
- One set when both enzymes are applied (AÙB)
|
|
47
|
- A the ordered multiset of lengths of fragments cut by enzyme A ai ;
i=1..n,
- B the ordered multiset of lengths of fragments cut by enzyme B bj ;
j=1…m
- C the set of lengths of fragments cut by both enzymes simultaneously ck; k=1,|AÚB|
- Remember, A, B and C are experimentally determined
|
|
48
|
- Pick an ordering (permutation) of A and an ordering of B
- Determine where the cut-sites of the DNA digested by enzyme A must have
been, given the ordering
- Likewise find the cut-sites from enzyme B digestion, given the ordering
of B
- Create a hypothetical set of cut-sites based on DNA digested by BOTH
enzymes
- Induce an hypothetical ordered multiset of double digest fragment
lengths
- Ascertain whether the experimentally obtained double digest fragment
length C can be arranged to match the hypothetical ordered multiset of
double digest fragment lengths
- If so, one solution (of possibly many) is found. If not, permute A and B and try again
|
|
49
|
- Once we have an ordering of A and B, we can figure out the cut sites S
for that ordering
- Add the lengths ai, and bj for the permutations s
and m, merging the ‘next’ sum in order into S
- s (A)=3,3,4 ai sum
like 3,6,10 into SA
- m (B)=1,2,5,2 bj sum
like 1,3,8,10 into SB
- S= SA Ú SB = 1,3,6,8,10 no repeats ie NOT a multiset-records
only the locations of the cut sites(ordered)
- Construct C(s,m)=sj-sj-1=1,2,3,2,2
- Can C be rearranged to give C(s,m)?
|
|
50
|
- That’s easy– this is directly equivalent to the partition problem
- Given a set of integers, can we partition the set into two subsets s.t.
the sum of integers in each is equal?
- Unfortunately the partition problem is NP complete
- Therefore the DDP problem is NP complete
|
|
51
|
- A={1,3,3,12}
- B={1,2,3,3,4,6}
- C=AÙB={1,1,1,1,2,2,2,3,6}
|
|
52
|
|
|
53
|
- There would be n!m! configurations
- BUT
- There are repeated lengths in the fragments
- So
- There are solutions,
where p
- and r are the number of repeats in A and B
|
|
54
|
- There are lots of solutions
- In fact, a result derived by Waterman from probability theory indicates
the number of solutions increases exponentially as the string length.
|
|
55
|
- Waterman et al have sought to reduce the enormity of the problem:
- Among the multiple solutions are equivalence classes
- Each solution can be mapped into an equivalence class
- The number of equivalence classes is much smaller than the total number
of solutions
|
|
56
|
- If two solutions have the same overlap data, they are overlap equivalent
- For t components from t-1 cuts, the components are permuted t! ways
- But there are 2t reflections
- So there are t!2t overlap equivalent solutions
|
|
57
|
- Solution configurations can also be mapped into equivalence classes by
defining ‘cassettes’ and applying
- Cassette transformations
- Cassette reflections
- The mapping is performed by a sophisticated application of Border Block Graph theory
|
|
58
|
|
|
59
|
|
|
60
|
- For the Waterman example problem
- A={1,3,3,12}
- B={1,2,3,3,4,6}
- C={1,1,1,1,2,2,2,3,6}
- 4320 configurations with 208
distinct solutions
- 26 different overlap equivalence classes
- 25 overlap size equivalence classes
- 18 Cassette equivalence classes
|
|
61
|
- Map the Double Digest Problem
- to the
- Traveling Salesman Problem
|
|
62
|
- Recall the TSP
- Salesman must visit N cities
- What is the shortest itinerary, visiting each city precisely once
(except return to the originating city)
|
|
63
|
- TSP maps to a Hamiltonian graph
- It is NP-Complete
|
|
64
|
- Recall that a useful probability-based approximation algorithm for TSP
is Simulated Annealing
- Concept of SA is to find the lowest energy state (global minimum) by
allowing the solution to ‘jump out’ of local minima
- This is possible because there is a certain probability associated with
each state that it might jump to a higher energy level.
|
|
65
|
- The probability of leaving an energy state is expressed by a
relationship to the current energy state. The probability distribution
is the Boltzmann distribution
- Where E is the energy state, T is the ‘Temperature’ of the system, and k
is Boltzmann’s constant
|
|
66
|
- The physical analogy is in the annealing of metal:
- The strongest metal is purely crystalline (lowest energy state)
- The annealing process is heating to high temperature, then waiting at
decreasing temperatures so that the molecules can rearrange into optimal
energy states
- The time at a temperature, and the number of temperature ‘waits’ is the annealing
schedule
|
|
67
|
- The SA algorithm has three elements
- Definition of a cost function
- Definition of configurations
- Path segment reversals and path segment exchanges in TSP applications
- Shifting gaps in MSA applications
- Guidance from a Stationary Transition Matrix in Double Digest
applications
- Definition of an annealing schedule
|
|
68
|
- The algorithm:
- At each temperature, for sufficient iterations:
- Select a configuration (choose a neighborhood)
- Compute the cost function
- If the cost is lowered, keep the configuration
- If it is higher, keep it only with a certain (Boltzmann) probability
(the Metropolis step)
- Reduce the temperature
|
|
69
|
|
|
70
|
- Always probabilistic
- Waterman suggests uniformly distributed random selection of start and
end points for reversals and transpositions from the transition matrix*
|
|
71
|
|
|
72
|
- Cooling
- Iterations
- Not specified –usually enough so that cost functions aren’t changing
much
|
|
73
|
- For 3x1021 solutions, it found one (of possibly many) in 1635
iterations
|
|
74
|
- Alignment of Overlapping Fragments
|
|
75
|
|
|
76
|
- Recall that a fingerprint is a list of the sizes of the restriction fragments
- Use the Maynard Olson greedy algorithm to reconstruct the order of the
clones. The complete solution is NP-complete.
- 2 clones overlap to give 3 regions:
- unique to left
- common
- unique to right
|
|
77
|
|
|
78
|
|
|
79
|
|
|
80
|
- Move the new one (C2) to the right
- Put matched “loose ends” to the left
- e.g. the 2’s in C1 and C3, and
the 6’s in C1 and C3 are ‘loose ends’ – move them left
|
|
81
|
|
|
82
|
|
|
83
|
|
|
84
|
- STSs are unique substrings ~200bp long
- Pick a 20bp PCR primer on each end
- Amplify only this 200bp STS region within the whole genome by PCR
|
|
85
|
- Hybridization mapping
- Fragments are called clones
- Collection of clones is a library
- Identify the clone by the binding (hybridization) of a small sequence (probe)
- The fingerprint is the set of probes that successfully hybridized
- Probes shared between 2 clones implies overlap of the clones
- Can determine order, but not location, of the probes.
|
|
86
|
|
|
87
|
|
|
88
|
- More than one site may have the same sequence; the probe can bind to
both.
- Solution: generate probes that hybridize
only to unique (Sequence Tagged) sites.
|
|
89
|
- Two pieces of DNA may join and be replicated as if a single clone. This is a chimera.
- In some clone libraries, half the clones are chimeras.
|
|
90
|
- Have at least one marker unique to every clone
- Have at least two markers/clone
|
|
91
|
|
|
92
|
- Assume all clones for all hybrids have been established
- If the clones´hybrids matrix
can be permuted so that each row has consecutive ones, then the matrix
has the C1P.
- If so, the problem is P complex,
not NP.
- Observation errors will vitiate the solution
|
|
93
|
|
|
94
|
- The algorithm to determine if the data have the C1P:
- Separate rows into components
- Permute the components
- Join components back together
- If the data have the C1P, then there is a polynomial algorithm to find
the alignment
- Refer to Setubal/Meidanis 5.3 for the specifics of both algorithms.
- The key point is that the permutations are found in P time.
|
|
95
|
- If Rowsets are distinvt, each goes to a component which can be permuted
separately
- If there is a non-empty intersection, a variant of the Maynard-Olsen
algorithm re-arranges the sub-components
- A very complicated algorithm re-assembles the components using a tree
traversal and backtrack strategy
|
|
96
|
- Chimeric clones impose blocks of 1’s separated by 0’s (gap)
- A false negative will impose a 0 (gap) in an otherwise unbroken run of
1’s.
- A false positive can split a run of 0’s, yielding 2 gaps
|
|
97
|
- Minimize gaps in the permutation!
- What do we have??
- The TSP, an NP-Complete problem!
- The cost function is the Hamming distance
- (the number of rows where 2 corresponding columns differ)
|