
Random Walks and Markov Chains in Data Science Lecture
Explore the concepts of random walks, Markov chains, gradient descent, and stationary distribution in data science. Gain insights into their applications and implications for data analysis and machine learning. Delve into strong connections and terminal components within graphs, understanding the matrix form, and uncovering the significance of stationary distributions. Discover fundamental theorems and practical examples shared by Grigory Yaroslavtsev in this educational lecture.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CSCI B609: Foundations of Data Science Lecture 10/11: Random Walks and Markov Chains + ML Intro Slides at http://grigory.us/data-science-class.html Grigory Yaroslavtsev http://grigory.us
Project Example: Gradient Descent in TensorFlow Gradient Descent (will be covered in class) Adagrad: http://www.magicbroom.info/Papers/DuchiHaSi10.pdf Momentum (stochastic gradient descent + tweaks): http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf Adam (Adaptive + momentum): http://arxiv.org/pdf/1412.6980.pdf FTRL: http://jmlr.org/proceedings/papers/v15/mcmahan11b/mc mahan11b.pdf RMSProp: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_ slides_lec6.pdf
Random Walks and Markov Chains Random walk: Directed graph ? ?,? Starting vertex ?? ? Edge ?,? : probability ??? of transition ? ? ?: ????= 1 ?.? ?.? ?? ?.? ? ?.? ?.? ?.? ?.? ?.? ?.? ?.? ?.? ? ?.? ? ?.?
Strongly Connected Components Def (Strongly Connected Component). ? ? such that ?,? ? there exist paths ? ? and ? ? SCC s form a partition of the vertex set Terminal SCC: no outgoing edges Long enough random walk Terminal SCC ?.? ?.? ?? ?.? ? ?.? ?.? ?.? ?.? ?.? ?.? ?.? ?.? ? ?.? ? ?.?
Matrix Form and Stationary Distribution ??=probability distribution over vertices at time ? ?0= 1,0,0, ,0 ??? = ??+? ? = transition matrix with entries ??? If ? then average of ?? 1 ? ?=0 ? = stationary distribution of? ?is unique and doesn t depend on ?0 if G is strongly connected Note: ?? for ? doesn t always converge! ? converges: ? ??? ?
Stationary Distribution Long-term average: ??=1 ? ??? ? ?=0 Thm. If G is strongly connected then ?? ?: ?? = ? ???= 1 ? ? ?,? = [?,1] We will show that ? ?,? has rank ? there is a unique solution to ? ? ?,? = [?,1]
Stationary Distribution Theorem Thm. ? ? + 1 matrix ? ?,? has rank ? ? = ? ?,? Rank(A) < ? two lin. indep. solutions to Ax=0 ????= 1 ???? 1 = 0 (row sums of ?) ?,0 is a solution to Ax = 0 Assume there is another solution ?,? ?,0 ? ? ? + ?? = ? ?: ?????? ??+ ? = 0 ??= ??????+ ? ?,? ?,0 not all ?? are equal
Stationary Distribution Theorem Cont. ?: ??= ??????+ ? ?,? ?,0 not all ?? are equal ? = ?:??= ????=1 ? is non-empty ? strongly connected ???? ?,? :? ?,? ? ??> ?????? ? > 0 Symmetric argument with ? = ?:??= ????=1 ?? < ??? ??? ? < 0 Contradiction so ?,0 is the unique solution ??? = set of max value coord. ???
Fundamental Theorem of Markov Chains Thm. If ? is transition matrix of a strongly connected Markov Chain and ??=1 There exists a unique ?:?? = ? For any starting distribution: lim ?? is a probability vector After one step: ?? ??? ??? ??=1 1 ? ?=1 ??= ??? ?? satisfies | ??|1 2 ? ???: ? ?=0 ? ??= ? ? 1??? 1 ? 1?? =1 ? 1?? = ? ?=0 ? ?=0 ? ?=0 ?(?? ?0) ? 0 ?? 1 ?
Fundamental Theorem of Markov Chains ? ? + 1 ?????? ? = ? ?,? has rank ? ? ? ?????? ? = last ? columns of ? First ? columns of ? sum to zero ???? ? = ? ??from ??= ??? ?? by dropping first entry ??? = ??,1 ??= ??,1 ? 1 ?? 0 ??,1 ?,1 ?? ?,1 ? 1 Let ?,1 ? 1= ?. Since ?? ?vector? is a probability distribution ??? ? = ??= 0 ? ? ? = 0
Intro to ML Classification problem Instance space ?: 0,1? or ? (feature vectors) Classification: come up with a mapping ? {0,1} Formalization: Assume there is a probability distribution ? over ? ? = target concept (set ? ? of positive instances) Given labeled i.i.d. samples from ? produce ? ? Goal: have ?agree with ? over distribution ? Minimize: ????? = Pr ?[? ? ] ?????= true or generalization error
Intro to ML Training error ? = labeled sampled (pairs ?,? ,? ?,? {0,1}) ? ? ? ? Training error: ????? = Overfitting : low training error, high true error Hypothesis classes: H: collection of subsets of ? called hypotheses If ? = could be all intervals ?,? ,? ? If ? = ? could be linear separators: ? ?? ? ?0|? ?,?0 If ? is large enough (compared to some property of H) then overfitting doesn t occur
Overfitting and Uniform Convergence PAC learning (agnostic): For ?,? > 0 if ? 1/2?2(ln ? + ln2/?) then with probability 1 ?: ? H: ????? ????? ??= r.v. (=1 if ? has error on ?-th sample in ?) ?[??] =????? and ????? = Chernoff bound: Pr ????? ????? Union bound: Pr ? ?: ????? ????? ? 1 ? ?=1 ? ?? > ? 2? 2 ? ?2 > ? 2 ? ? 2 ? ?2 ?
Examples Learning disjunctions ? = 0,1? target concept is OR: ? ??? ? = 2? so ? = 1/2?2(? ln2 + ln2/?) Occam s razor: Target concept can be described by ? bits ? = 2? so ? = 1/2?2(? ln2 + ln2/?) Learning decision trees ? = 0,1? ? = # trees with k nodes Described with ? = ?(?log?) bits
Online Learning + Perceptron Algorithm For ? = 1,2, , Algorithm given ?? ? and asked to predict ?? Algorithm is told ? (??)and charged if ? ?? ?? Linear separator given by ? ? ? ???? 1 = positive examples ? ???? 1 = negative examples ??? / ? ? = distance to hyperplane ??? = 0 ? = 1/ ? ?= margin of the separator
Perceptron Algorithm Set ? = 0 then for ? = 1,2, : Given example ?? predict sgn ?? If mistake was made then update: If ?? was positive: ? ? + ?? If ?? was negative: ? ? ?? Thm. Perceptron makes ?2? ? = max ? Proof: invariants ??? and ? For each mistake ??? ??? + 1 On positive: ? + ?? On negative: ? ?? ?? 2mistakes where 2 ?? . ? ?? ??? + 1 ?? ??? + 1 ?? = ??? + ?? ?? = ??? ??
Perceptron Analysis cont. 2 increase by ?2 On each mistake ? 2 ?? + ?? = 2= 2+ 2?? ?? + On positive: ? + ?? ?? 2 On negative: ? ?? ?? 2 ? mistakes: ??? ?, ? ? 2 2 2+ ?? 2+ ?2 ? ? 2 2 2 2 2?? ?? + ?? ?? = 2= ? 2 2 2+ ?? 2+ ?2 ? ? 2 2 2 2 ??2 or ? 2 ?? 2 Since ??? ? 2 we have: ? 2 ? ? 2 ? ? ? 2 ? ?2? ?? 2 2
Perceptron with noisy data What if there is no perfect separator? Hinge loss of ? : On positive ??: max 0,1 ?? On negative ??: max 0,1 + ?? Sample hinge loss ? ????? ,? = sum of hinge losses over all samples in ? Thm. #mistakes of Perceptron is at most: min? ?2||? |2 ?? ?? 2+ 2? ????? ,?
Proof of noisy perceptron 2 ??2 As before we have ? 2 ?? ?? = ??? + ?? On positive: ? + ?? ??? + 1 ? ????? ,?? On negative: ? + ?? ??? + 1 ? ????? ,?? In the end: ??? ? ? ????? ,? Similar argument as before shows that: ? ?2? ?? ?? = ??? ?? 2+ 2? ????? ,? 2