Understanding RNA Secondary Structure Through Dynamic Programming
Exploring the concept of RNA secondary structure using dynamic programming, focusing on base pair formations and the rules governing them. The content discusses Watson-Crick pairing, non-crossing constraints, and the importance of maximizing free energy in RNA molecules. It also delves into recursive code and its termination, emphasizing the key questions and inductive reasoning involved in the process.
- RNA Secondary Structure
- Dynamic Programming
- Watson-Crick Pairing
- Free Energy Maximization
- Recursive Code
Uploaded on Dec 09, 2024 | 0 Views
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
CSE 421 Dynamic Programming RNA, Sequence Alignment Yin Tat Lee 1
RNA Secondary Structure RNA: A String B = b1b2 bn over alphabet { A, C, G, U }. Secondary structure. RNA is single-stranded so it tends to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule. C A Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA A A A U G C C G U A A G G U A U U A G A C G C U G C G C G A G C G A U G complementary base pairs: A-U, C-G 3
RNA Secondary Structure (Formal) Secondary structure. A set of pairs ? = { ??,??} that satisfy: [Watson-Crick.] ? is a matching and each pair in ? is a Watson-Crick pair: ? ?, ? ?, ? ?, or ? ?. [No sharp turns.]: The ends of each pair are separated by at least 4 intervening bases. If ??,?? ?, then ? < ? 4. [Non-crossing.] If (??,??) and (??,??) are two pairs in ?, then we cannot have ? < ? < ? < ?. Free energy: Usual hypothesis is that an RNA molecule will maximize total free energy. approximate by number of base pairs Goal: Given an RNA molecule B = b1b2 bn, find a secondary structure S that maximizes the number of base pairs. 4
Secondary Structure (Examples) G G G G G G G C U C U G C G C U C A U A U A G U A U A U A base pair A U G U G G C C A U A U G G G C A U A G U U G G C C A U G 4 ok sharp turn crossing 5
DP: First Attempt First attempt. Let ???(?) = maximum number of base pairs in a secondary structure of the substring b1b2 bn. Suppose ?? is matched with ?? in ??? ? . What IH should we use? match bt and bn t n 1 Difficulty: This naturally reduces to two subproblems Finding secondary structure in ?1, ,?? 1, i.e., OPT(t-1) Finding secondary structure in ??+1, ,?? 1, ??? 6
DP: Second Attempt Definition: ??? ?,? = maximum number of base pairs in a secondary structure of the substring ??,??+1, ,?? The most important part of a correct DP; It fixes IH Case 1: If ? ? 4. ??? ?,? = 0 by no-sharp turns condition. Case 2: Base ?? is not involved in a pair. ??? ?,? = ???(?,? 1) Case 3: Base ?? pairs with ?? for some ? ? < ? 4 non-crossing constraint decouples resulting sub-problems ? ?<? 4{ 1 + ???(?,? 1) + ???(? + 1,? 1) } ??? ?,? = max 7
Recursive Code Let M[i,j]=empty for all i,j. Compute-OPT(i,j){ if (j-i <= 4) return 0; if (M[i,j] is empty) M[i,j]=Compute-OPT(i,j-1) for t=i to j-5 do if (??,?? is in {A-U, U-A, C-G, G-C}) M[i,j]=max(M[i,j], 1+Compute-OPT(i,t-1) + Compute-OPT(t+1,j-1)) return M[j] } Does this code terminate? What are we inducting on? Key question: is there any loop in the recursion? 8
Formal Induction Let ???(?,?) = maximum number of base pairs in a secondary structure of the substring ??,??+1, ,?? Base Case: ???(?,?) = 0 for all ?,? where ? ? 4. IH: For some 4, Suppose we have computed ???(?,?) for all ?,? where ? ? . IS: Goal: We find ???(?,?) for all ?,? where ? ? = + 1. Fix ?,? such that ? ? = + 1. Case 1: Base ?? is not involved in a pair. ??? ?,? = ???(?,? 1) [this we know by IH since ? ? 1 = ] Case 2: Base ?? pairs with ?? for some ? ? < ? 4 ??? ?,? = max ? ?<? 4{ 1 + ???(?,? 1) + ???(? + 1,? 1) } We know by IH since difference 9
Bottom-up DP 4 0 0 0 for = 1, 2, , n-1 for i = 1, 2, , n-1 j = i + if ( <= 4) M[i,j]=0; else M[i,j]=M[i,j-1] for t=i to j-5 do if (??,?? is in {A-U, U-A, C-G, G-C}) M[i,j]=max(M[i,j], 1+ M[i,t-1] + M[t+1,j-1]) 3 0 0 i 2 0 1 6 7 8 9 j return M[1, n] } Running Time: ?(?3) 10
Lesson We may not always induct on ? or ? to get to smaller subproblems. We may have to induct on |? ?| or ? + ? when we are dealing with more complex problems. 11
Sequence Alignment (Edit distance)
Word Alignment How similar are two strings? ocurrance occurrence o c u r r a n c e - o c c u r r e n c e 5 mismatches, 1 gap o c - u r r a n c e o c c u r r e n c e 1 mismatch, 1 gap o c - u r r - a n c e o c c u r r e - n c e 0 mismatches, 3 gaps 13
Edit Distance Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970] Cost = # of gaps + #mismatches. How to formalize the question. Applications. Basis for Unix diff and Word correct in editors. Speech recognition. Computational biology. C T G A C C T A C C T - C T G A C C T A C C T C C T G A C T A C A T C C T G A C - T A C A T Cost: 5 Cost: 3 14
Sequence Alignment Given two strings ?1, ,?? and ?1, ,?? find an alignment with minimum number of mismatch and gaps. An alignment is a set of ordered pairs (??1,??1), ??2,??2, such that ?1< ?2< and ?1< ?2< Example: CTACCG vs. TACATG. Sol: We aligned x2-y1, x3-y2, x4-y3, x5-y4, x6-y6. x1 x2 x3 x4 x5 x6 C T A C C - G - T A C A T G So, the cost is 3. y1 y2 y3 y4 y5 y6 15
DP for Sequence Alignment Let ???(?,?) be min cost of aligning ?1, ,?? and ?1, ,?? Case 1: OPT matches ??,?? Then, pay mis-match cost if ?? ?? + min cost of aligning ?1, ,?? 1 and ?1, ,?? 1 i.e., ???(? 1,? 1) Case 2: OPT leaves ?? unmatched Then, pay gap cost for ?? + ??? ? 1,? Case 3: OPT leaves ?? unmatched Then, pay gap cost for ?? + ???(?,? 1) 16
M[i, j] = min( (xi=yj ? 0:1) + M[i-1, j-1], 1 + M[i-1, j], 1 + M[i, j-1]) Induction What is the order of induction? (i.e. why there is no loop?) We can do induction on ? + ?. 17
Bottom-up DP Sequence-Alignment(m, n, x1x2...xm, y1y2...yn) { for i = 0 to m M[0, i] = i for j = 0 to n M[j, 0] = j for i = 1 to m for j = 1 to n M[i, j] = min( (xi=yj ? 0:1) + M[i-1, j-1], 1 + M[i-1, j], 1 + M[i, j-1]) return M[m, n] } Analysis: (??) time and space. Computational biology: m = n = 100,000. 10 billions ops OK, but 10GB array? 18
Optimizing Memory If we are not using strong induction in the DP, we just need to use the last (row) of computed values. Sequence-Alignment(m, n, x1x2...xm, y1y2...yn) { for i = 0 to m M[0, i] = i for j = 0 to n M[j, 0] = j for i = 1 to m for j = 1 to n M[i, j] = min( (xi=yj ? 0:1) + M[i-1, j-1], 1 + M[i-1, j], 1 + M[i, j-1]) return M[m, n] } Just need ? 1,? rows to compute M[i,j] 19
DP with ?(? + ?) memory Keep an Old array containing values of the last row Fill out the new values in a New array Copy new to old at the end of the loop Sequence-Alignment(m, n, x1x2...xm, y1y2...yn) { for i = 0 to m O[i] = i for i = 1 to m N[0]=i for j = 1 to n N[j] = min( (xi=yj ? 0:1) + O[j-1], 1 + O[j], 1 + N[j-1]) for j = 1 to n O[j]=N[j] return N[n] } M[i-1, j-1] M[i-1, j] M[i, j-1] 20
M[i, j] = min( (xi=yj ? 0:1) + M[i-1, j-1], 1 + M[i-1, j], 1 + M[i, j-1]) Shortest Path 21
How to recover the alignment? Hint: bidirectional search. 22
Lesson Advantage of a bottom-up DP: It is much easier to optimize the space. By the way, edit distance ?2 can be computed in ?( log2?) (1980). can be approximated by log factor in ?(?1+?) (~2010). cannot be solved in ?(?2 ?) exactly (2015). can be approximated by O(1) factor in ?(?1+?) (~2018). 23
Longest Path in a DAG Goal: Given a DAG G, find the longest path. Recall: A directed graph G is a DAG if it has no cycle. This problem is NP-hard for general directed graphs: - It has the Hamiltonian Path as a special case 2 3 6 5 4 7 1 25
DP for Longest Path in a DAG Q: What is the right ordering? Remember, we have to use that G is a DAG, ideally in defining the ordering We saw that every DAG has a topological sorting So, let s use that as an ordering. 2 3 1 2 3 4 5 6 7 6 5 4 7 1 26
DP for Longest Path in a DAG Suppose we have labelled the vertices such that (?,?) is a directed edge only if ? < ?. 1 2 3 4 5 6 7 Let ???(?) = length of the longest path ending at ? Suppose OPT(j) is ?1,?2, ?2,?3, , ?? 1,??, ??,? , then Obs 1: ?1 ?2 ?? ?. Obs 2: ?1,?2, ?2,?3, , ?? 1,?? is the longest path ending at ??. ??? ? = 1 + ??? ??. 27
DP for Longest Path in a DAG Suppose we have labelled the vertices such that (?,?) is a directed edge only if ? < ?. Let ???(?) = length of the longest path ending at ? 0 1 + If ? is a source o.w. ??? ? = ?: ?,? an edge???(?) max 28
Outputting the Longest Path Let G be a DAG given with a topological sorting: For all edges (?,?) we have i<j. Initialize Parent[j]=-1 for all j. Compute-OPT(j){ if (in-degree(j)==0) return 0 if (M[j]==empty) M[j]=0; for all edges (i,j) if (M[j] < 1+Compute-OPT(i)) M[j]=1+Compute-OPT(i) Parent[j]=i return M[j] } Let M[k] be the maximum of M[1], ,M[n] While (Parent[k]!=-1) Print k k=Parent[k] Record the entry that we used to compute OPT(j) 29