# String Matching algorithm ![](https://upload-images.jianshu.io/upload_images/7130568-e10dc137e9083a0e.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) ## Rabin-Karp We can view a string of k characters (digits) as a length-k decimal number. E.g., the string “31425” corresponds to the decimal number 31,425. - Given a pattern P [1..m], let p denote the corresponding decimal value. - Given a text T [1..n], let ![](https://latex.codecogs.com/gif.latex?t_s) denote the decimal value of the length-m substring T [(s+1)..(s+m)] for s=0,1,…,(n-m). - let `d` be the radix of num, thus ![](https://latex.codecogs.com/gif.latex?d&space;=&space;len(set(s))) - ![](https://latex.codecogs.com/gif.latex?t_s) = p iff T [(s+1)..(s+m)] = P [1..m]. - p can be computed in O(m) time. p = P[m] + d\*(P[m-1] + d\*(P[m-2]+…)). - t0 can similarly be computed in O(m) time. - Other ![](https://latex.codecogs.com/gif.latex?t_1,\ldots,t_{n-m}) can be computed in O(n-m) time since ![](https://latex.codecogs.com/gif.latex?t_{s+1}&space;can&space;be&space;computed&space;from&space;ts&space;in&space;constant&space;time.&space;Namely,&space;) t_{s+1} = d*(t_s-d^{m-1} * T[s+1])+T[s+m+1] ![](https://latex.codecogs.com/gif.latex?&space;However,&space;it's&space;no&space;need&space;to&space;calculate)t_{s+1}![](https://latex.codecogs.com/gif.latex?directly.&space;We&space;can&space;use&space;modulus&space;operation&space;to&space;reduce&space;the&space;work&space;of&space;caculation.&space;We&space;choose&space;a&space;small&space;prime&space;number.&space;Eg&space;13&space;for&space;radix(&space;noted&space;as&space;d)&space;10.&space;Generally,&space;d\*q&space;should&space;fit&space;within&space;one&space;computer&space;word.&space;We&space;firstly&space;caculate&space;t0&space;mod&space;q.&space;Then,&space;for&space;every)t_i (i>1)![](https://latex.codecogs.com/gif.latex?&space;assume&space;) t_{i-1} = T[i+m-1] + 10*T[i+m-2]+\ldots+10^{m-1}*T[i-1] ![](https://latex.codecogs.com/gif.latex?&space;denote) d' = d^{m-1}\ mod\ q![](https://latex.codecogs.com/gif.latex?&space;thus,&space;) \begin{aligned} t_i &= (t_{i-1} - d^{m-1}*T[i-1]) * d + T[i+m]\\ &\equiv (t_{i-1} - d^{m-1}*T[i-1]) * d + T[i+m] (mod\ q)\\ &\equiv (t_{i-1}- ( d^{m-1} mod \ q) *T[i-1]) * d + T[i+m] (mod\ q)\\ &\equiv (t_{i-1}- d'*T[i-1]) * d + T[i+m] (mod\ q) \end{aligned} ![](https://latex.codecogs.com/gif.latex?&space;So&space;we&space;can&space;compare&space;the&space;modular&space;value&space;of&space;each&space;ti&space;with&space;p's.&space;Only&space;if&space;they&space;are&space;the&space;same,&space;then&space;we&space;compare&space;the&space;origin&space;chracter,&space;namely)T[i],T[i+1],\ldots,T[i+m-1]![](https://latex.codecogs.com/gif.latex?and&space;the&space;pattern.&space;Gernerally,&space;this&space;algorithm's&space;time&space;approximation&space;is&space;O(n+m),&space;and&space;the&space;worst&space;case&space;is&space;O((n-m+1)\*m)&space;**Problem:&space;this&space;is&space;assuming&space;p&space;and&space;ts&space;are&space;small&space;numbers.&space;They&space;may&space;be&space;too&space;large&space;to&space;work&space;with&space;easily.**&space;##&space;FSM&space;A&space;FSM&space;can&space;be&space;represented&space;as&space;(Q,q0,A,S,C),&space;where&space;-&space;Q&space;is&space;the&space;set&space;of&space;all&space;states&space;-&space;q0&space;is&space;the&space;start&space;state&space;-)A\in Q![](https://latex.codecogs.com/gif.latex?is&space;a&space;set&space;of&space;accepting&space;states.&space;-&space;S&space;is&space;a&space;finite&space;input&space;alphabet.&space;-&space;C&space;is&space;the&space;set&space;of&space;transition&space;functions:&space;namely)q_j = c(s,q_i)$. Given a pattern string S, we can build a FSM for string matching. Assume S has m chars, and there should be m+1 states. One is for the begin state, and the others are for matching state of each position of S. Once we have built the FSM, we can run it on any input string. ## KMP >Knuth-Morris-Pratt method The idea is inspired by FSM. We can avoid computing the transition functions. Instead, we compute a prefix functi`Next` on P in O(m) time, and Next has only m entries. > Prefix funtion stores info about how the pattern matches against shifts of itself. - String w is a prefix of string x, if x=wy for some string y - String w is a suffix of string x, if x=yw for some string y - The k-character prefix of the pattern P [1..m] denoted by Pk. - Given that pattern prefix P [1..q] matches text characters T [(s+1)..(s+q)], what is the least shift s'> s such that P [1..k] = T [(s'+1)..(s'+k)] where s'+k=s+q? - At the new shift s', no need to compare the first k characters of P with corresponding characters of T. Method: For prefix pi, find the longest proper prefix of pi that is also a suffix of pi. next[q] = max{k|k\