Notes
Slide Show
Outline
1
Gene Finding - Locating CpG islands and GC rich regions
  • Jin Jun
  • 12/09/2004
  • Dept. of CS&E
2
Overview
  • Biological background
  • Previous works and Idea
  • Problem definition
  • Length constrained algorithms
    • GC rich regions
    • CpG islands
  • Conclusions


3
GC rich regions
  • Regions of a DNA sequence that are rich in nucleotides C and G


  • significant in gene recognition and comparative genomics
4
GC rich regions cont.
  • Isochore model
    • The genome of warm-blooded vertebrates is composed of isochores
      • long (>300-kb) DNA regions homogeneous in nucleotide composition.
    • Five distinct isochore classes were described for the human genome
      • GC-poor classes L1 and L2 (~63% of the genome) and increasingly GC-rich classes H1, H2, and H3 (~24%, 7.5%, and 4.7%, respectively).
      • These five classes form the so called typical “genome phenotype”.
5
GC rich regions cont.
  • Birds (chicken) and rodents (mouse and rats) have deviant genome phenotypes
    • The chicken possesses an additional extremely GC-rich isochore class H4, whereas mouse and rat lack class H3. è Comparative genomics


  • Genes are found predominantly in the GC-richest isochore classes H2 and H3, which comprise only 12% of the human genome.
     è Gene recognition
6
CpG Islands
  • Regions of DNA of at least 200bp in length
    • with G+C content above 50%, and
    • a ratio of observed vs. expected CpGs at least 0.6
    • Most of the CpG islands are between 200 and 1400bp with a majority of them being 200–400bp.
  • play an important role in the prediction of promoter.
7
 
8
How to find GC rich regions?
  • A fixed window is moved down the sequence/alignment and calculating statistics of GC richness or compositional heterogeneity at each position that the window is moved to.


  • Since an optimal region could span several windows, the window-based approach might fail in finding the exact locations of some interesting regions.
9
Measure of the GC richness
  • x-pl : where x is the C+G count of the region, p is a positive constant ratio, and l is the length of the region.


  • In other words, each of nucleotides C and G is given a reward of 1-p, and each of nucleotides A and T is penalized by p
10
Length-constrained “heaviest” segments locating algorithms
  • (i) given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum in O(n)-time.


  • (ii) given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average in O(n log L)-time.
11
Maximum Sum Consecutive Subsequence
with Length Constraints
12
Problem Definition

  • Given A = <a1, a2, …, an> and U ≤ n


  • Find a consecutive subsequence of A of length at most U such that the sum of the numbers in the subsequence is maximized.


13
Notations

  • Let A1, A2, …, Ak  be disjoint (consecutive) subsequences of A forming a partition of A, i.e. A = A1A2 … Ak. Ai is called the ith segment of the partition.


  • Denote w(A) as the sum of the sequence
14
Definitions: left-negatives
  • A real sequence A = <a1, a2, …, an> is left-negative iff the sum of each proper prefix <a1, a2, …, an> (w(<a1, a2, …, an>)) is negative or zero for all 1 ≤ i ≤ n-1.
    • <-4, 1, -2, 3> is left-negative, while <5, -3, 4, -1, 2, -6> is not.


  • A partition of the sequence A = A1A2 … Ak is minimal left-negative if each Ai, 1 ≤ i ≤ k, is left-negative, and, for each 1 ≤ i ≤ k-1, the sum of Ai is positive, i.e. w(Ai) > 0.
    • the partition <5> <-3, 4> <-1, 2> <-6> of <5, -3, 4, -1, 2, -6> is minimal left-negative.
15
Algorithm 1
  • Set up the left-negative pointers.
16
Example for Alg. 1
  • The left-negative pointers of a 15-element sequence.
  •         1  2  3  4   5  6  7  8  9 10 11 12 13 14 15


  • ai   9 -3  1  7 -15  2  3 -4  2 -7  6 -2  8  4 -9
  • p[i] 1  4  3  4  15  6  7 13  9 13 11 13 13 14 15
  • w[i] 9  5  1  7 -12  2  3  3  2  5  6  6  8  4 -9




  •    9 -3  1  7 -15  2  3 -4  2 -7  6 -2  8  4 -9
17
Time Complexity of Alg. 1

  • Lemma 2 The algorithm MLN-Point given in Algorithm 1 finds all left-negative pointers for a length n sequence in O(n) time
18
Algorithm 2
  • Finding the maximum sum consecutive subsequence with length constraint.
19
Example for Alg. 2
  • The left-negative pointers of a 15-element sequence with U=5.
  •          1  2  3  4   5  6  7  8  9 10 11 12 13 14 15


  •  ai      9 -3  1  7 -15  2  3 -4  2 -7  6 -2  8  4 -9
  •  p[i]    1  4  3  4  15  6  7 13  9 13 11 13 13 14 15
  •  w[i]    9  5  1  7 -12  2  3  3  2  5  6  6  8  4 -9


  •   i      1  2  3  4   5  6  7  8  9 10 11 12 13 14 15
  •   j      4  4  4  4   7  7  7  9 13 14 14 14 14 14 15
  • S(i,j)  14  5  8  4 -10  5  3 -2  7  9 16 10 12  4 -9
  •   mi     1                             11
  •   mj     4                             14
  •   ms    14                             16
  • result:<6, -2, 8, 4> and its sum = 16.
20
Time Complexity of Alg. 2

  • Theorem 1 Given a length n real sequence, finding the consecutive subsequence of length at most U with the maximum sum can be done in O(n) time.
21
Lower and Upper bounds Alg.

  • Corollary 2 Given a length n real sequence and positive integers L ≤ U, finding the consecutive subsequence of length between L and U with the maximum sum can be done in O(n) time
22
Maximum Average
Consecutive Subsequence with Length Constraints
23
Definitions : right-skew
  • u(A) = w(A)/|A| : the average of A


  • A sequence A = <a1, a2, …, an> is right-skew iff the average of any prefix <a1, a2, …, ai> is always less than or equal to the average of the remaining suffix subsequence <ai+1, ai+2, …, an>.
    • <-4, 1, -2, 3> is right-skew, while <5, -3, 4, -1, 2, -6> is not.


  • A partition A = A1A2 … Ak is decreasingly right-skew if each segment Ai of the partition is right-skew and u(Ai) > u(Aj) for any i < j .
    • the partition <5> <-3, 4> <-1, 2, -6> of <5, -3, 4, -1, 2, -6> is decreasingly right-skew.
24
Example for right-skew alg.
  • The left-negative pointers of a 15-element sequence.
  •         1  2  3  4   5  6  7  8  9 10 11 12 13 14 15


  • ai   9 -3  1  7 -15  2  3 -4  2 -7  6 -2  8  4 -9
  • p[i] 1  4  4  4  14  7  7 14  9 14 11 14 13 14 15
  • w[i] 9  5  8  7  -3  5  3  7  2  9  6 10  8  4 -9
  • d[i] 1  3  2  1  10  2  1  7  1  5  1  3  1  1  1




  •    9 -3  1  7 -15  2  3 -4  2 -7  6 -2  8  4 -9
25
Algorithm 3
  • Finding the maximum average consecutive subsequence with length constraint
26
Time Complexity of Alg. 3

  • Lemma 6 The algorithm can computes all right-skew pointers for a length n sequence in O(n) time.


  • Theorem 3 Given a length n real sequence, finding the consecutive subsequence of length at least L with the maximum average can be done in O(n log L) time.
27
Results
  • Experiments
    • mslc: A proposed algorithm 2 to locates the maximum-sum subsequence of length at most U in O(n)-time.
    • mslc slow: A brute-force O(nU)-time version.
    • mavs: A proposed algorithm 3 to locates the maximum-average subsequence of length at least L in O(n log L).
    • mavs slow: A brute-force O(nL)-time version.
    • mavs linear: this program linearly scan segments for the good partnership. In the worst case, the time complexity is O(nL).
28
homo sapiens 4q sequence contig of size 459kb from position 114130kb to 114589kb
29
rabbit a-like globin gene cluster sequence of 10621bp
30
Conclusion
  • Algorithm for finding a consecutive subsequence of length at most U with the maximum sum in O(n)-time.
    è GC rich regions


  • Algorithm for finding a consecutive subsequence of length at least L with the maximum average in O(n log L)-time.
    è CpG islands
31
References
  • Anton Nekrutenko and Wen-Hsiung Li, “Assessment of compositional heterogeneity within and between eukaryotic genomes”, Genome Research, 10:1986–1995, 2000.


  • Yaw-Ling Lin, Tao Jiang, and Kun-Mao Chao, “Efficient algorithms for locating the length-constrained heaviest segments, with applications to biomolecular sequence analysis”, Journal of Computer and System Science, 65:570-586, 2002.


  • X. Huang, “An algorithm for identifying regions of a DNA sequence that satisfy a content requirement”, CABIOS, 10:219–225, 1994.