Notes
Slide Show
Outline
1
Physical Mapping
2
WHAT ARE WE TRYING TO DO?
  • THE VERY BIG PICTURE:


    • We would like to know the entire genetic sequence, identifying and locating every gene coded therein.


3
WHY NOT??
  • There are 3 billion base pairs.  We can only sequence 500 base pairs consecutively before the sequencing becomes so noisy that it falls apart.
  • Even given a faithful sequence, we frequently do not know where genes start and stop.
4
The Approach

  • Knowing the genes: Apply strategies (usually code recognition) to identify the start, stop, and middle of genes (GENEFINDING)
5
Scales Involved In Mapping
6
 
7
Gene Mapping
  • Linkage Maps: Order and relative position only on a macroscopic scale
    • Not actual bp distance
    • Doesn’t work if genes are too close
  • Physical Maps: Locate markers within 104 bp.


8
Linkage Mapping
  • Depends on Meiosis
    • Distances are really statistics on traits
      • Diversity in traits generated by crossing-over
      • Distance computed by frequency of expression
  • Map is expressed
    • By locating the chromosome
    • By relative  distances between genes (traits) on the chromosome


9
The Overwhelming Problem
  • To sequence the human genome we need to sequence 3 billion consecutive base-pairs


  • BUT


  • Current technology only allows sequencing 500 base pairs
10
Two oversimplified solutions..
  • Break the 3 billion base pair strand* into small strands
  • Clone the inserts
  • Sequence the small fragments
  • Reassemble the labeled fragments
11
Strategy #1
 Whole Genome Shotgun Sequencing
    • Break the genomic DNA into very small fragments (inserts of 500-2000bp)
    • Sequence the inserts
    • Reassemble the labeled inserts into a labeled genome
12
Strategy #2
 Hierarchical Shotgun Sequencing
    • Break the genomic DNA into small fragments (30,000 -1M bp) containing known markers
    • Clone the fragments in BACs
    • For each BAC clone, break into inserts (500-2000bp)
    • Sequence the inserts
    • Reassemble the labeled inserts into a labeled genome, using markers as reference points to known gene loci

13
No worries.., right?
  • We simply need to solve a specific instance the shortest common superstring problem
14
Consider these small inserts
15
 
16
The solution
  • If the problem is solved on a Hamiltonian graph, it is intractable


  • If the instance is solved on a Eulerian graph, the problem is tractable


17
Well, why not just solve the Eulerian problem, then?
  • Computational Challenge
    • Would need each segment pair to have a unique overlap
18
Some practicalities
  • Quality issues
    • Contamination from viral vector DNA
    • Erroneous reads of small segments,
    • There are many, many repeats in Eukaryotic DNA
  • Conceptual issues
    • Reads in two different directions
    • Oceans-Islands – a strategy for double tagging to span repeats
  • Coverage issues
    • Multiple copies to ensure coverage of the genome
    • Consensus
19
APPROACHES TO PHYSICAL MAPPING
  • RESTRICTION MAPPING
    • Data come from RESTRICTION ENZYME analysis
    • Cut lengths of the target DNA fragments are MARKERS



  • HYBRIDIZATION MAPPING
    • Data come from DNA PROBES which bind to target DNA
    • Probes have short, but unique, sequences ( MARKERS) and are identifiable (florescent, commonly)
    • Probe finds its exact complement in the target DNA


20
Restriction Mapping
21
Restriction Enzymes
  • Restriction enzymes cut double stranded DNA at specific spots characterized by a specific short (~7 BP) sequence


  • Usually the sequences on each strand are short reverse repeats



22
Restriction Enzymes
  • On average, restriction enzymes with
     
    4-base recognition sites will yield pieces 256 bases long

    6-base recognition sites will yield pieces 4000 bases long, and

    8-base recognition sites will yield pieces 64,000 bases long.

    Since hundreds of different restriction enzymes have been characterized, DNA can be cut into many different small fragments
23
Restriction Enzymes-Sticky
  • The cut between the strands may be staggered, leaving ‘sticky’ ends.  Restated, there is a built-in template for annealing with a complementary sequence, possibly a product of the action of the same restriction enzyme on a different fragment
24
Restriction Enzymes-Blunt
  • Likewise, the action of some restriction enzymes leave fully paired remnants
25
 
26
Fingerprinting
in Restriction Analysis

  • For restriction site analysis:
    • Locate the restriction site on a DNA fragment: the restriction site itself is thus a marker
    • Can thus see if fragments overlap (if not,the fragments are called contigs)

27
Restriction Mapping
  • The idea is to use the length of a restriction fragment as part of a fingerprint


  • The ‘list’ of fragment lengths (ordered) is a fingerprint
28
Inserting a gene region into a vector for cloning
29
Restriction Mapping
  • Use BAC clones 120-130k=Kb (usually no chimeras in BACs)
  • Use restriction enzyme
  • Different sized pieces from different clones
30
 
31
The General Problem
  • We are given lengths of fragments formed from the action of restriction enzymes on a clone or other large sequence
  • We are NOT given the positions of the cleavage sites
    • RESTATED: We have unlabeled distance data

  • We must reconstruct the positions from the ordering of the fragments
32
 
33
General Difficulties
  • We may recover more than one representative of a fragment length without any other labeling




34
General Difficulties
  • There are usually multiple ways to reconstruct the sequence from the fragments
    • Below are 3 of the 6 ways for fragment lengths {2,3,5}…
35
Types of restriction mapping problems:
  • Reconstruction, given only one dimensional distance data -single digest


  • Reconstruction, given double digest distance data


  • Alignment of clones, given distance data



36
Reconstruction, given only one dimensional distance data (single digest)
37
Reconstruction, given double digest distance data
38
 
39
The Turnpike Problem
  • The one-dimensional restriction enzyme fragment length assembly problem is a biological instance of the “Turnpike Problem”.  Given the distances between exits, create a map of the turnpike.
  • The Turnpike problem is classically solved by a backtrack algorithm.
40
The Single Digest
One dimensional distance data
  • 3 approaches
  • all exponential
  • Some only slightly less hideous than others
    • A. Polynomial representation of the multisets, with factorization as a solution
    • B. Adjacency Matrix labeling problem
    • C. Skiena- Smith -Lemke tree traverse and backtrack algorithm
41
Skiena, Smith, Lemke Algorithm-Backtracking
  • Fix the endpoints to reflect the largest distance
    • For {0,1,3,4,6,9,10} we fix 0 and 10
  • Choose the largest distance in the remaining data
    • For {1,3,4,6,9} it is 9
  • This distance could be measured from either 0 or from the 10, placing it at location 0+9 or 10-9
    • Choose one- if it does not violate the remaining distance data, keep it
    • If it does, backtrack and choose the other one


42
Skiena, Smith, Lemke Algorithm
  • The algorithm is a binary tree with 2n branches
  •  At each branch, the consistency with the remaining data must be checked
    • Must then look at remaining points O(n log n)
  • Worst case –evaluate each branch giving
  •     O(2nn log n)
43
Skiena, Smith, Lemke Algorithm
  • Still exponential, but better than the others
  • For n=8,
  • Adjacency matrix: 3x1029
  • S-S-M: 6x103


44
2. The Double Digest Problem
45
 
46
Double-Digest : Establish 3 Multisets
  • The experimental strategy is to perform a double digest
    • One (multi) set of fragment lengths from the action of enzyme A
    • One set from enzyme B
    • One set when both enzymes are applied (AÙB)
47
Organization of Experimental Data
  • A the ordered multiset of lengths of fragments cut by enzyme A ai ; i=1..n,


  • B the ordered multiset of lengths of fragments cut by enzyme B bj ; j=1…m


  • C the set of lengths of fragments cut by both  enzymes simultaneously  ck; k=1,|AÚB|



    • Remember, A, B and C are experimentally determined

48
The Strategy
  • Pick an ordering (permutation) of A and an ordering of B
  • Determine where the cut-sites of the DNA digested by enzyme A must have been, given the ordering
  • Likewise find the cut-sites from enzyme B digestion, given the ordering of B
  • Create a hypothetical set of cut-sites based on DNA digested by BOTH enzymes
  • Induce an hypothetical ordered multiset of double digest fragment lengths
  • Ascertain whether the experimentally obtained double digest fragment length C can be arranged to match the hypothetical ordered multiset of double digest fragment lengths
  • If so, one solution (of possibly many) is found.  If not, permute A and B and try again
49
Cut-sites
  • Once we have an ordering of A and B, we can figure out the cut sites S for that ordering
  • Add the lengths ai, and bj for the permutations s and m, merging the ‘next’ sum in order into S
    • s (A)=3,3,4     ai sum like 3,6,10 into SA
    • m (B)=1,2,5,2  bj sum like 1,3,8,10 into SB
    • S= SA Ú SB  = 1,3,6,8,10  no repeats ie NOT a multiset-records only the locations of the cut sites(ordered)
  • Construct C(s,m)=sj-sj-1=1,2,3,2,2
  • Can C be rearranged to give C(s,m)?


50
Piece of Cake!!
  • That’s easy– this is directly equivalent to the partition problem
    • Given a set of integers, can we partition the set into two subsets s.t. the sum of integers in each is equal?
  • Unfortunately the partition problem is NP complete
  • Therefore the DDP problem is NP complete
51
An example
(Waterman, 1995)

  • A={1,3,3,12}
  • B={1,2,3,3,4,6}
  • C=AÙB={1,1,1,1,2,2,2,3,6}


52
Two of the many solutions to the example
53
A break…
  • There would be n!m! configurations
  • BUT
  • There are repeated lengths in the fragments
  • So
  • There are           solutions, where p


  • and r are the number of repeats in A and B
54
Still,…
  • There are lots of solutions


  • In fact, a result derived by Waterman from probability theory indicates the number of solutions increases exponentially as the string length.
55
Reducing the Catastrophe
  • Waterman et al have sought to reduce the enormity of the problem:
  • Among the multiple solutions are equivalence classes
  • Each solution can be mapped into an equivalence class
  • The number of equivalence classes is much smaller than the total number of solutions


56
Waterman
Overlap Equivalence Classes
  • If two solutions have the same overlap data, they are overlap equivalent
  • For t components from t-1 cuts, the components are permuted t! ways
  • But there are 2t reflections
  • So there are t!2t overlap equivalent solutions
57
The Schmitt-Waterman Approach
  • Solution configurations can also be mapped into equivalence classes by defining ‘cassettes’ and applying
    • Cassette transformations
    • Cassette reflections
  • The mapping is performed by a sophisticated application of  Border Block Graph theory
58
Cassette Exchange
59
 
60
Summary  Equivalence Classes
  • For the Waterman example problem
  • A={1,3,3,12}
  • B={1,2,3,3,4,6}
  • C={1,1,1,1,2,2,2,3,6}


  • 4320 configurations with  208 distinct solutions


  • 26 different overlap equivalence classes


  • 25 overlap size equivalence classes


  • 18 Cassette equivalence classes


61
Heuristic
  • Map the Double Digest Problem
  •  to the
  • Traveling Salesman Problem
62
TSP
  • Recall the TSP
    • Salesman must visit N cities
    • What is the shortest itinerary, visiting each city precisely once (except return to the originating city)
63
TSP
  • TSP maps to a Hamiltonian graph
  • It is NP-Complete
64
Simulated Annealing
A heuristic for the TSP
  • Recall that a useful probability-based approximation algorithm for TSP is Simulated Annealing
  • Concept of SA is to find the lowest energy state (global minimum) by allowing the solution to ‘jump out’ of local minima
  • This is possible because there is a certain probability associated with each state that it might jump to a higher energy level.
65
Simulated Annealing
  • The probability of leaving an energy state is expressed by a relationship to the current energy state. The probability distribution is the Boltzmann distribution




  • Where E is the energy state, T is the ‘Temperature’ of the system, and k is Boltzmann’s constant


66
Simulated Annealing
  • The physical analogy is in the annealing of metal:
  • The strongest metal is purely crystalline (lowest energy state)
  • The annealing process is heating to high temperature, then waiting at decreasing temperatures so that the molecules can rearrange into optimal energy states
  • The time at a temperature, and the number of temperature ‘waits’ is the annealing schedule
67
Simulated Annealing
  • The SA algorithm has three elements
  • Definition of a cost function
    • distance in TSP
  • Definition of configurations
    • Path segment reversals and path segment exchanges in TSP applications
    • Shifting gaps in MSA applications
    • Guidance from a Stationary Transition Matrix in Double Digest applications
  • Definition of an annealing schedule
68
Implementation: Simulated Annealing
  • The algorithm:
  • At each temperature, for sufficient iterations:
  • Select a configuration (choose a neighborhood)
  • Compute the cost function
    • If the cost is lowered, keep the configuration
    • If it is higher, keep it only with a certain (Boltzmann) probability (the Metropolis step)
  • Reduce the temperature
69
Implementation: Metropolis Step
70
Implementation: Choice of Configurations
  • Always probabilistic
  • Waterman suggests uniformly distributed random selection of start and end points for reversals and transpositions from the transition matrix*
71
SA-Applying to the DDP
  • The cost function:






72
Implementation: Annealing schedule
  • Cooling
    • Rate~1/log(n)
  • Iterations
    • Not specified –usually enough so that cost functions aren’t changing much
73
Efficiency of SA-Waterman
  • For 3x1021 solutions, it found one (of possibly many) in 1635 iterations
74
3. Multiple Sets of Distance Data
  • Alignment of Overlapping Fragments
75
 
76
Maynard-Olson Algorithm
  • Recall that a fingerprint is a list of the  sizes of the restriction fragments
  • Use the Maynard Olson greedy algorithm to reconstruct the order of the clones. The complete solution is NP-complete.
  • 2 clones overlap to give 3 regions:
    • unique to left
    • common
    • unique to right

77
Example: Restriction Fragments Sorted by Length
78
Find Maximum Number of Overlapping Fragments
79
Now Combine with C2
80
Rule of Thumb
  • Move the new one (C2) to the right
  • Put matched “loose ends” to the left


  • e.g.  the 2’s in C1 and C3, and the 6’s in C1 and C3 are ‘loose ends’ – move them left
81
Combining with C2
82
Hybridization Mapping
83
 
84
Hybridization Mapping
  • STSs are unique substrings ~200bp long
  • Pick a 20bp PCR primer on each end
  • Amplify only this 200bp STS region within the whole genome by PCR
85
Fingerprinting in Hybridization
  • Hybridization mapping
    • Fragments are called clones
    • Collection of clones is a library
    • Identify the clone by the binding (hybridization) of a small sequence (probe)
    • The fingerprint is the set of probes that successfully hybridized
    • Probes shared between 2 clones implies overlap of the clones
    • Can determine order, but not location, of the probes.


86
Very Easy
87
Really Difficult
88
Fingerprinting
Repeats
  • More than one site may have the same sequence; the probe can bind to both.
  • Solution: generate probes that hybridize  only to unique (Sequence Tagged) sites.
89
Fingerprinting
Chimeras
  • Two pieces of DNA may join and be replicated as if a single clone.  This is a chimera.
  • In some clone libraries, half the clones are chimeras.
90
Ideal Data
  • Have at least one marker unique to every clone
  • Have at least two markers/clone
91
Hybridize STS Markers to Clones
92
Consecutive 1’s Property (C1P)
  • Assume all clones for all hybrids have been established
  • If the clones´hybrids matrix can be permuted so that each row has consecutive ones, then the matrix has the C1P.
  • If so, the  problem is P complex, not NP.
  • Observation errors will vitiate the solution
93
Permutation of Original Matrix
Rows have Consecutive 1’s
94
C1P Problem
  • The algorithm to determine if the data have the C1P:
    • Separate rows into components
    • Permute the components
    • Join components back together


  • If the data have the C1P, then there is a polynomial algorithm to find the alignment
    • Refer to Setubal/Meidanis 5.3 for the specifics of both algorithms.

  • The key point is that the permutations are found in P time.
95
The algorithm
  • If Rowsets are distinvt, each goes to a component which can be permuted separately
  • If there is a non-empty intersection, a variant of the Maynard-Olsen algorithm re-arranges the sub-components
  • A very complicated algorithm re-assembles the components using a tree traversal and backtrack strategy
96
Mapping With Errors
  • Chimeric clones impose blocks of 1’s separated by 0’s (gap)
  • A false negative will impose a 0 (gap) in an otherwise unbroken run of 1’s.
  • A false positive can split a run of 0’s, yielding 2 gaps
97
The Strategy
  • Minimize gaps in the permutation!


  • What do we have??
    • The TSP, an NP-Complete problem!


    • The cost function is the Hamming distance
    • (the number of rows where 2 corresponding columns differ)