Dynamic Programming in Various Algorithms

Quiz

CSE 421

RNA, Sequence Alignment

Yin Tat Lee

Announcement

•

HW5 will be posted tonight. Sorry for the late.

•

Swati OH is moved to Sunday (virtual). See website.

•

Midterm is graded. Check your score on Canvas.

•

Come to any OH for any grading mistakes.

•

Come to my OH to get back the midterm (for in-person midterm)

•

I will post the midterm solution tonight.

Midterm

Here is the statistics

Q4 is the hardest one.

It is modified from some problem in a programming contest.

If you get >= 50/90 in this midterm, you are on track for a 3.4

(depending on your homework)

Midterm

Knapsack Problem

W = 11

Stronger DP (Strengthening Hypothesis)

Take best of the two

RNA Secondary Structure

RNA Secondary Structure

RNA:

 A String B = b



 over alphabet { A, C, G, U }.

Secondary structure

.  RNA is single-stranded so it tends to loop

back and form

base pairs

with itself. This structure is essential

for understanding behavior of molecule.

Ex:

GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA

complementary base pairs:  A-U, C-G

RNA Secondary Structure (Formal)

approximate by number of base pairs

Secondary Structure (Examples)

sharp turn

crossing

ok

4

base pair

DP: First Attempt

DP: Second Attempt

Recursive Code

Does this code terminate?

Formal Induction

Bottom-up DP

Quiz Solution

Sequence Alignment

(Edit distance)

Word Alignment

How similar are two strings?

5 mismatches, 1 gap

1 mismatch, 1 gap

0 mismatches, 3 gaps

Edit Distance

Edit distance

[Levenshtein 1966, Needleman-Wunsch 1970]

   Cost = # of gaps + #mismatches.

Applications.

•

Basis for Unix diff and Word correct in editors.

•

Speech recognition.

•

Computational biology.

Cost: 3

Cost: 5

How to formalize the question.

Sequence Alignment

DP for Sequence Alignment

Bottom-up DP

Sequence-Alignment(

m, n, x

...x

, y

...y

) {

for

 i = 0 to m

      M[0, i] = i

for

 j = 0 to n

      M[j, 0] = j

for

 i = 1 to m

for

 j = 1 to n

         M[i, j] = min(

(x

=y

? 0:1) + M[i-1, j-1],

                       1 + M[i-1, j],

                       1 + M[i, j-1])

   return

 M[m, n]

Induction

 M[i, j] = min(

(x

=y

? 0:1) + M[i-1, j-1],

                       1 + M[i-1, j],

                       1 + M[i, j-1])

Optimizing Memory

We just need to use the last (row) of computed values.

Sequence-Alignment(

m, n, x

...x

, y

...y

) {

for

 i = 0 to m

      M[0, i] = i

for

 j = 0 to n

      M[j, 0] = j

for

 i = 1 to m

for

 j = 1 to n

         M[i, j] = min(

(x

=y

? 0:1) + M[i-1, j-1],

                       1 + M[i-1, j],

                       1 + M[i, j-1])

   return

 M[m, n]

•

Keep an Old array containing values of the last row

•

Fill out the new values in a New array

•

Copy new to old at the end of the loop

Sequence-Alignment(

m, n, x

...x

, y

...y

) {

for

 i = 0 to m

      O[i] = i

for

 i = 1 to m

      N[0]=i

for

 j = 1 to n

         N[j] = min(

(x

=y

? 0:1) + O[j-1],

                       1 + O[j],

                       1 + N[j-1])

      for j = 1 to n

         O[j]=N[j]

   return

 N[n]

M[i-1, j]

M[i, j-1]

M[i-1, j-1]

Shortest Path

 M[i, j] = min(

(x

=y

? 0:1) + M[i-1, j-1],

                       1 + M[i-1, j],

                       1 + M[i, j-1])

How to recover the alignment?

(0,0)

(m,n)

m/2

(m/2,j)

Lesson

Longest Path in a DAG

Longest Path in a DAG

Goal:

 Given a DAG G, find the longest path.

Recall

: A directed graph G is a DAG if it has no cycle.

This problem is NP-hard for general

directed graphs:

It has the Hamiltonian Path as a

     special case

DP for Longest Path in a DAG

: What is the right

ordering

Remember, we have to use that G is a DAG, ideally in

defining the ordering

We saw that every DAG has a

topological sorting

So, let’s use that as an ordering

DP for Longest Path in a DAG

DP for Longest Path in a DAG

Outputting the Longest Path

Record the entry that

we used to compute OPT(j)

Slide Note

Embed Share

Download

Jeremy Lin has invented a time machine that predicts $GME stock prices for upcoming days. The challenge is to determine the best trading strategy given the price predictions and constraints on the number of trades allowed. Alongside, there are announcements related to academic activities like homework posts, midterms grading, and upcoming sessions. The content also covers algorithmic problems like the Knapsack Problem and Stronger Dynamic Programming. It emphasizes efficient coding practices, use of pseudo-code, and specific guidelines for algorithm descriptions. Additionally, it touches upon topics like RNA secondary structure and sequence alignment in dynamic programming.

jahi_162 Follow

Uploaded on Dec 09, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Quiz Jeremy Lin has created a time machine. Now, he knows exactly the price of $GME for the next ? days, which are ?1,?2, ,??. Suppose Jeremy can only trade $GME for at most ? times. So, what is the best trading? Let ??,?be the network at ?-th day using ? trades. ?? ?? ??,?= max ?? 1,?,max ??,? 1 . ?<? This solution technically is wrong. What is the mistake? I omitted initial cases. 1

CSE 421 Dynamic Programming RNA, Sequence Alignment Yin Tat Lee 2

Announcement HW5 will be posted tonight. Sorry for the late. Swati OH is moved to Sunday (virtual). See website. Midterm is graded. Check your score on Canvas. Come to any OH for any grading mistakes. Come to my OH to get back the midterm (for in-person midterm) I will post the midterm solution tonight. 3

Midterm Here is the statistics percentile Q1 Q2 Q3 Q4 Q5 Total 25% 12 (75%) 17.5 (72%) 6.5 (65%) 3 (20%) 18 (72%) 66 (73%) 50% 14 (88%) 21 (87.5%) 9 (90%) 8 (53%) 24 (96%) 72.8 (81%) 75% 16 (100%) 21 (87.5%) 10 (100%) 14.5 (97%) 25 (100%) 80.8 (90%) Q4 is the hardest one. It is modified from some problem in a programming contest. If you get >= 50/90 in this midterm, you are on track for a 3.4 (depending on your homework) 4

Midterm ? 2log2? is not ?(?3). Note that ? 2log2?= ?1+log ? which is not even polynomial. Some write ?log24. Please write ?2 instead. Don t write a large ambiguous paragraph to describe your algo. Use pseudo code instead. We will deduct point in final. MST takes ? ?log? or ?(? + ?log?), but not ?(?log?). Q5, you don t need to do induction. 5

Knapsack Problem Given ?objects and a "knapsack. Item ? weighs ??> 0 kilograms and has value ??> 0. Knapsack has capacity of ? kilograms. Goal: fill knapsack so as to maximize total value. Item 1 2 Value Weight Ex: OPT is { 3, 4 } with value 40. 1 6 1 2 W = 11 3 4 5 18 22 28 5 6 7 Greedy: repeatedly add item with maximum ratio ??/??. Ex: { 5, 2, 1 } achieves only value = 35 greedy not optimal. 6

Stronger DP (Strengthening Hypothesis) Let ???(?,?) = Max value of subsets of items 1, ,? of weight ? Case 1: ???(?,?) selects item ? In this case, ??? ?,? = ??+ ???(? 1,? ??) Take best of the two Case 2: ??? ?,? does not select item ? In this case, ??? ?,? = ???(? 1,?). Therefore, If ? = 0 If ??> ? o.w., 0 ??? ? 1,? max(??? ? 1,? ,??+ ??? ? 1,? ??) ??? ?,? = 7

RNA Secondary Structure

RNA Secondary Structure RNA: A String B = b1b2 bn over alphabet { A, C, G, U }. Secondary structure. RNA is single-stranded so it tends to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule. C A Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA A A A U G C C G U A A G G U A U U A G A C G C U G C G C G A G C G A U G complementary base pairs: A-U, C-G 9

RNA Secondary Structure (Formal) Secondary structure. A set of pairs ? = { ??,??} that satisfy: [Matching]: ? is a matching. [Valid]: each pair in ? is: ? ?, ? ?, ? ?, or ? ?. [No sharp turns]: The ends of each pair are separated by at least 4 intervening bases. If ??,?? ?, then ? < ? 4. [Non-crossing] If (??,??) and (??,??) are two pairs in ?, then we cannot have ? < ? < ? < ?. Free energy: Usual hypothesis is that an RNA molecule will maximize total free energy. approximate by number of base pairs Goal: Given an RNA molecule B = b1b2 bn, find a secondary structure S that maximizes the number of base pairs. 10

Secondary Structure (Examples) G G G G G G G C U C U G C G C U C A U A U A G U A U A U A base pair A U G U G G C C A U A U G G G C A U A G U U G G C C A U G 4 ok sharp turn crossing 11

DP: First Attempt First attempt. Let ???(?) = maximum number of base pairs in a secondary structure of the substring b1b2 bn. Suppose ?? is matched with ?? in ??? ? . What IH should we use? match bt and bn t n 1 Difficulty: This naturally reduces to two subproblems Finding secondary structure in ?1, ,?? 1, i.e., OPT(t-1) Finding secondary structure in ??+1, ,?? 1, ??? 12

DP: Second Attempt Definition: ??? ?,? = maximum number of base pairs in a secondary structure of the substring ??,??+1, ,?? The most important part of a correct DP; It fixes IH Case 1: If ? ? 4. ??? ?,? = 0 by no-sharp turns condition. Case 2: Base ?? is not involved in a pair. ??? ?,? = ???(?,? 1) Case 3: Base ?? pairs with ?? for some ? ? < ? 4 non-crossing constraint decouples resulting sub-problems ? ?<? 4{ 1 + ???(?,? 1) + ???(? + 1,? 1) } ??? ?,? = max 13

Recursive Code Let M[i,j]=empty for all i,j. Compute-OPT(i,j){ if (j-i <= 4) return 0; if (M[i,j] is empty) M[i,j]=Compute-OPT(i,j-1) for t=i to j-5 do if (??,?? is in {A-U, U-A, C-G, G-C}) M[i,j]=max(M[i,j], 1+Compute-OPT(i,t-1) + Compute-OPT(t+1,j-1)) return M[j] } Does this code terminate? 14

Formal Induction Let ???(?,?) = maximum number of base pairs in a secondary structure of the substring ??,??+1, ,?? Base Case: ???(?,?) = 0 for all ?,? where ? ? 4. IH: For some 4, Suppose we have computed ???(?,?) for all ?,? where ? ? . IS: Goal: We find ???(?,?) for all ?,? where ? ? = + 1. Fix ?,? such that ? ? = + 1. Case 1: Base ?? is not involved in a pair. ??? ?,? = ???(?,? 1) [this we know by IH since ? ? 1 = ] Case 2: Base ?? pairs with ?? for some ? ? < ? 4 ??? ?,? = max ? ?<? 4{ 1 + ???(?,? 1) + ???(? + 1,? 1) } We know by IH since difference 15

Bottom-up DP 4 0 0 0 for = 1, 2, , n-1 for i = 1, 2, , n-1 j = i + if ( <= 4) M[i,j]=0; else M[i,j]=M[i,j-1] for t=i to j-5 do if (??,?? is in {A-U, U-A, C-G, G-C}) M[i,j]=max(M[i,j], 1+ M[i,t-1] + M[t+1,j-1]) 3 0 0 i 2 0 1 6 7 8 9 j return M[1, n] } Running Time: ?(?3) (It is also okay to loop over ?,? or ?,?) 16

Quiz Solution Let ?(?,?,?,?) be true if and only if the numbers ?1,?2, ,?? can be partitioned into three sets whose sums are ?,?,?. If ? > 0, we have ? ?,?,?,? = ? ? 1,? ??,?,? ?? ? ? 1,?,? ??,? ?? ? ? 1,?,?,? ?? If ? = 0, we have ? 0,?,?,? = True if ? = ? = ? = 0, False otherwise. 18

Sequence Alignment (Edit distance)

Word Alignment How similar are two strings? ocurrance occurrence o c u r r a n c e - o c c u r r e n c e 5 mismatches, 1 gap o c - u r r a n c e o c c u r r e n c e 1 mismatch, 1 gap o c - u r r - a n c e o c c u r r e - n c e 0 mismatches, 3 gaps 20

Edit Distance Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970] Cost = # of gaps + #mismatches. How to formalize the question. Applications. Basis for Unix diff and Word correct in editors. Speech recognition. Computational biology. C T G A C C T A C C T - C T G A C C T A C C T C C T G A C T A C A T C C T G A C - T A C A T Cost: 5 Cost: 3 21

Sequence Alignment Given two strings ?1, ,?? and ?1, ,?? find an alignment with minimum number of mismatch and gaps. An alignment is a set of ordered pairs (??1,??1), ??2,??2, such that ?1< ?2< and ?1< ?2< Example: CTACCG vs. TACATG. Sol: We aligned x2-y1, x3-y2, x4-y3, x5-y4, x6-y6. x1 x2 x3 x4 x5 x6 C T A C C - G - T A C A T G So, the cost is 3. y1 y2 y3 y4 y5 y6 22

DP for Sequence Alignment Let ???(?,?) be min cost of aligning ?1, ,?? and ?1, ,?? Case 1: OPT matches ??,?? Then, pay mis-match cost if ?? ?? + min cost of aligning ?1, ,?? 1 and ?1, ,?? 1 i.e., ???(? 1,? 1) Case 2: OPT leaves ?? unmatched Then, pay gap cost for ?? + ??? ? 1,? Case 3: OPT leaves ?? unmatched Then, pay gap cost for ?? + ???(?,? 1) 23

Bottom-up DP Sequence-Alignment(m, n, x1x2...xm, y1y2...yn) { for i = 0 to m M[0, i] = i for j = 0 to n M[j, 0] = j for i = 1 to m for j = 1 to n M[i, j] = min( (xi=yj ? 0:1) + M[i-1, j-1], 1 + M[i-1, j], 1 + M[i, j-1]) return M[m, n] } Analysis: (??) time and space. Computational biology: m = n = 1,000,000. 1000 billions ops OK, but 1TB array? 24

M[i, j] = min( (xi=yj ? 0:1) + M[i-1, j-1], 1 + M[i-1, j], 1 + M[i, j-1]) Induction What is the order of induction? (i.e. why there is no loop?) We can do induction on ? + ?. (Alternatively, we can induct on the step of the algorithm) 25

Optimizing Memory We just need to use the last (row) of computed values. Sequence-Alignment(m, n, x1x2...xm, y1y2...yn) { for i = 0 to m M[0, i] = i for j = 0 to n M[j, 0] = j for i = 1 to m for j = 1 to n M[i, j] = min( (xi=yj ? 0:1) + M[i-1, j-1], 1 + M[i-1, j], 1 + M[i, j-1]) return M[m, n] } Just need ? 1,? rows to compute M[i,j] 26

DP with ?(? + ?) memory Keep an Old array containing values of the last row Fill out the new values in a New array Copy new to old at the end of the loop Sequence-Alignment(m, n, x1x2...xm, y1y2...yn) { for i = 0 to m O[i] = i for i = 1 to m N[0]=i for j = 1 to n N[j] = min( (xi=yj ? 0:1) + O[j-1], 1 + O[j], 1 + N[j-1]) for j = 1 to n O[j]=N[j] return N[n] } M[i-1, j-1] M[i-1, j] M[i, j-1] 27

M[i, j] = min( (xi=yj ? 0:1) + M[i-1, j-1], 1 + M[i-1, j], 1 + M[i, j-1]) Shortest Path Edit distance is the distance between (0,0) and (?,?) of the following graph. All horizontal edges has cost 1. All vertical edges has cost 1. The cost of edges from (? 1,? 1) to (?,?) is 1?? ?? The graph is a DAG. Question: How to recover the alignment (or how to find the shortest path) without using ?? space? 28

How to recover the alignment? Idea 1: Suffices to find the point a shortest path pass on the middle row. (m,n) (m/2,j) m/2 Why? Divide and Conquer! (0,0) Idea 2: ?0,0 (?,?)= min ?0,0 (?/2,?)+ ??/2,? (?,?) ? Find(i1,j1,i2,j2) { // Due to spacing, ignored boundary cases Let ? = (??+ ??)/? Compute ???,?? (?,??) for all ? via Sequence-Alignment. Compute ??,? (??,??) for all ? via similar algo run backward. Let ? = argmin????,?? (?,??)+ ??,?? (??,??) ??= Find(i1,j1,k,j) ??= Find(k,j,i2,j2) return ??,?? } 29

Lesson Advantage of a bottom-up DP: It is much easier to optimize the space. By the way, edit distance can be computed in ?(? min ?,? ) if edit distance ? ?2 can be computed in ?( log2?) (1980). can be approximated by log factor in ?(?1+?) (~2010). cannot be solved in ?(?2 ?) exactly (2015). can be approximated by O(1) factor in ?(?2 2/7) (~2018). can be approximated by O(1) factor in ?(?1+?) (~2020). 30

Longest Path in a DAG

Longest Path in a DAG Goal: Given a DAG G, find the longest path. Recall: A directed graph G is a DAG if it has no cycle. This problem is NP-hard for general directed graphs: - It has the Hamiltonian Path as a special case 2 3 6 5 4 7 1 32

DP for Longest Path in a DAG Q: What is the right ordering? Remember, we have to use that G is a DAG, ideally in defining the ordering We saw that every DAG has a topological sorting So, let s use that as an ordering. 2 3 1 2 3 4 5 6 7 6 5 4 7 1 33

DP for Longest Path in a DAG Suppose we have labelled the vertices such that (?,?) is a directed edge only if ? < ?. 1 2 3 4 5 6 7 Let ???(?) = length of the longest path ending at ? Suppose OPT(j) is ?1,?2, ?2,?3, , ?? 1,??, ??,? , then Obs 1: ?1 ?2 ?? ?. Obs 2: ?1,?2, ?2,?3, , ?? 1,?? is the longest path ending at ??. ??? ? = 1 + ??? ??. 34

DP for Longest Path in a DAG Suppose we have labelled the vertices such that (?,?) is a directed edge only if ? < ?. Let ???(?) = length of the longest path ending at ? 0 1 + If ? is a source o.w. ??? ? = ?: ?,? an edge???(?) max 35

Outputting the Longest Path Let G be a DAG given with a topological sorting: For all edges (?,?) we have i<j. Initialize Parent[j]=-1 for all j. Compute-OPT(j){ if (in-degree(j)==0) return 0 if (M[j]==empty) M[j]=0; for all edges (i,j) if (M[j] < 1+Compute-OPT(i)) M[j]=1+Compute-OPT(i) Parent[j]=i return M[j] } Let M[k] be the maximum of M[1], ,M[n] While (Parent[k]!=-1) Print k k=Parent[k] Record the entry that we used to compute OPT(j) 36

Dynamic Programming in Various Algorithms

Download Presentation

Presentation Transcript

Related

More Related Content