Proceedings of the Third South American Workshop on String Processing.

Proceedings of the Third South American Workshop on String Processing.
2019/1/11 Lempel-Ziv Parsing and Sublinear-Size Index Structures for String Matching Juha Karkkainen Esko Ukkonen Proceedings of the Third South American Workshop on String Processing. WSP 1996, , 1996 竹田研究室修士課程1年喜田拓也

研究背景 9 文字列照合問題（パターンマッチング） existence problem all-occurrences problem
きだくんのおかあさんがりあかーをひいているさんがりあ

研究背景 KMP O(m+n) m n BM O(m+n) 文字列照合問題（パターンマッチング） existence problem
all-occurrences problem existence problem all-occurrences problem existence problem all-occurrences problem KMP O(m+n) m n BM O(m+n)

研究背景巨大なテキストに対する文字列照合問題 Index を作ることで劇的に速くなる.

研究背景 O（n）現在知られている index 構造 suffix tree suffix array Over 10n bytes
level-compressed trie suffix cactus suffix binary search tree O（n） Over 10n bytes 5n bytes About 11n bytes 9n bytes About 10n bytes

O(n / log n) 研究背景新しい index 構造が必要初の sublinear-size index より小さい
前処理をしない場合よりも検索が高速初の sublinear-size index LZ parsing を用いるこれがミソ O(n / log n)

LZ parsing Lempel-Ziv 圧縮 LZ parsing = LZ76 ? LZ77 （移動辞書式圧縮）
LZSS, LZH,LZR,LZB LZ78 （登録辞書式圧縮） LZW,LZC,LZT,LZFG LZ parsing = LZ76 ?

LZ parsing ... LZ parse Z of string T の定義 Z （P1,L1,C1）（Pi,Li,Ci）
= （P1,L1,C1）（Pi,Li,Ci）（PN,LN,CN） ... Pi, Li 負でない整数 Ci Σ上の記号

Pi < Ui , whenever Li > 0 T［Ui…Ui+Li-1］ = T［Pi…Pi+Li-1］,
LZ parsing LZ parse Z of string T の定義 Z = （P1,L1,C1）（Pi,Li,Ci）（PN,LN,CN） ... = 1, U1 Ui = Ui -1 + Li for i > 1. Pi < Ui , whenever Li > 0 For 1 ≦ i ≦ N T［Ui…Ui+Li-1］ = T［Pi…Pi+Li-1］, T［Ui+Li］ = Ci . T［Pi...］と T［Ui...］の common prefix が最大となる Pi < Ui を選ぶ.

LZ parsing aaaabbaaababaaabb a b a a b $ LZ parse の例 T 注意 Z （0,0,a）
2019/1/11 LZ parsing LZ parse の例 T = aaaabbaaababaaabb a b a a b $ 注意 Z = （0,0,a）（1,3,b）（5,1,a）（3,3,a）（6,5,b）

LZ parsing aaaabbaaababaaabb a b a a b LZ parse の例 block T definition
2019/1/11 LZ parsing LZ parse の例 block T = aaaabbaaababaaabb a b a a b definition phrase boundary symbol Z = （0,0,a）（1,3,b）（5,1,a）（3,3,a）（6,5,b）

LZ parsing aaaabbaaababaaabb a b a a b LZ parse の例 T U1 U5 U4 U3 U2 =
2019/1/11 LZ parsing LZ parse の例 T = aaaabbaaababaaabb a b a a b U1 U5 U4 U3 U2 = 1 2 6 8 12 Z = （0,0,a）（1,3,b）（5,1,a）（3,3,a）（6,5,b）

LZ parsing parse Z の特性どの二つの block も同一ではない

LZ parsing parse Z の特性どの二つの block も同一ではない
パターンP の最初の occurrence は, 必ず Z の boundary symbol を含む

LZ parsing n O（n / log n） parse Z のサイズのオーダー
N を長さ n のテキストの LZ parse における block の個数とする. また c = |Σ| とする. N < n logc n - logc logc n - 1 c-1 O（n / log n）

|Σ| LZ parsing for text length n = 300000. English DNA random 77 4 2 8
64 40964 31633 16734 14755 88748 type N |Σ|

Oh! Let’s go

T における Pの occurrence が boundary symbol を含む.
検索アルゴリズムサーチを2種類に分類する. これがミソ T における Pの occurrence が boundary symbol を含む. Primary occurrence それ以外 Secondary occurrence

検索アルゴリズム Occurrence を2種類に分類する. Primary search Existence problem
これがミソ Primary search Existence problem Primary occurrence Secondary search Secondary occurrence

P の primary occurrence を見つけ出す.
Primary Search 問題の定義 P の primary occurrence を見つけ出す. T ［Ui + Li］ = Ci 1 ･･･ u ･･････ m Reference pair （i , u）

P の primary occurrence を見つけ出す.
Primary Search 問題の定義 P の primary occurrence を見つけ出す. 1 ･･･ u ･････････ u′･･ m Canonical reference pair （i , u）

Primary Search 問題の定義 P の primary occurrence を見つけ出す. O（mN）
（i,u）が primary occurrence の canonical reference pair であるような i を見つけ出す.

Primary Search Primary Search を解くための工夫 1 ･･･ u ･･････ m
T ［Ui + Li - u + 1 … Ui + Li ］ = P［1…u］ T ［Ui + Li … Ui + Li - u + m］ = P［u…m］

Primary Search Primary Search を解くための工夫 1 ･･･ u ･･････ m

i Primary Search Two-Dimensional Prefix Matching(2DPM)
｛（Xi , Yi）｝：文字列の組の集合（X , Y）：質問文字列 i X が Xi の prefix, Y が Yi の prefix. （Xi , Yi）愛?

Xi = T［Ui…Ui+Li］R = T［Ui…Ui+1 -1］R
2019/1/11 Primary Search Two-Dimensional Prefix Matching(2DPM) ｛（Xi , Yi）｝：文字列の組の集合（X , Y）：質問文字列 Xi = T［Ui…Ui+Li］R = T［Ui…Ui+1 -1］R Yi = T［Ui+Li...］ = T［Ui ］ X = P［1…u］R Y = P［u…m］

Primary Search 2DPMの例 1 ･･･ u ･･････ m 1 block

Primary Search 問題の定義 P の primary occurrence を見つけ出す.
2-dimensional prefix matching を解く. 2-dimensional range query を解く.

i Primary Search 質問区間に含まれる点（xi , yi）を見つける. 2-dimensional range query
2019/1/11 Primary Search 2-dimensional range query ｛（xi , yi）｝：点の集合（［xl , xr］ , ［yl , yr］）：質問区間. 質問区間に含まれる点（xi , yi）を見つける. i 愛?

Primary Search 2-dimensional range query ｛（xi , yi）｝：点の集合
（［xl , xr］ , ［yl , yr］）：質問区間. xl xr yl yr

Primary Search 2-dimensional range query ｛（xi , yi）｝：点の集合
（［xl , xr］ , ［yl , yr］）：質問区間. これがミソ辞書式順序を用いる. xl = X xr = Xzzzzz... yl = Y yr = Yzzzzz... xi = Xi , yi = Yi

Primary Search Y 2-dimensional range query X （a,aaaabb...）
aba abazzzz... ab abzzzz... X Y （a,aaaabb...）（baaa,bbaaa...）（ab,aaabab...）（abaa,abaaa...）

N 点の balanced 2-d tree. （Lee, Wong 1977）
Primary Search 2-dimensional range query 2DPM N 点の balanced 2-d tree. （Lee, Wong 1977） O（ N + l） 2-d tree for string. O（m N + l） O（m + N + l）

Primary Search Y 2-d tree の例 X a X 値で判別 b c Y 値で判別 d g e f g e c b f d
(x1 , y1) (x2 , y2) (x3 , y3) (x4 , y4) (x5 , y5) (x7 , y7) (x6 , y6) X 値で判別 Y 値で判別 X Y g e c b f d a

T ［Ui + Li - u + 1 … Ui + Li - u + m］ = P となるような canonical な i を見つける.
Primary Search T ［Ui + Li - u + 1 … Ui + Li - u + m］ = P となるような canonical な i を見つける. 1 ･･･ u ･･････ m O（m + N + l） O（m2 + m N + L）

T ［Ui + Li - u + 1 … Ui + Li - u + m］ = P となるような canonical な i を見つける.
Primary Search T ［Ui + Li - u + 1 … Ui + Li - u + m］ = P となるような canonical な i を見つける. 1 ･･･ u ･･････ m O（m2 + m N + L） O（m2 +m log N + Nm log m + L）

Ok? No!

Secondary Search 問題の定義 Secondary occurrence Known occurrence
これがミソ Secondary occurrence

Secondary Search 問題の定義 Known occurrence Secondary occurrence
（Pi , Li , Ci） Pi Pi+Li-1

Known occurrence O = T［v…v + m - 1］ T［Ui - Pi + v…Ui -Pi + v + m - 1］
Secondary Search 問題の定義 T［v…v + m - 1］ Secondary occurrence Known occurrence O = T［v…v + m - 1］ Pi ≦ v, Pi + Li-1≧ v + m - 1 となるすべての i をみつける. O（N） Secondary occurrence T［Ui - Pi + v…Ui -Pi + v + m - 1］

i Secondary Search xi ≦ x, yi ≧ y となるようなすべての区間［xi , yi］を見つける.
Interval containment problem ｛［xi , yi］｝i=1,…,N : 区間の集合［x , y］ : 質問区間 xi ≦ x, yi ≧ y となるようなすべての区間［xi , yi］を見つける. i 愛?

Secondary Search xi = Pi , yi = Pi + Li - 1 x = v , y = v + m - 1
問題の定義 Secondary occurrence O = T［v…v + m - 1］ T［Pi］ T［Pi+Li-1］ xi = Pi , yi = Pi + Li - 1 x = v , y = v + m - 1 offset oi = Ui - Pi T［ x + oi … y + oi ］

トータルでは, すべての occurrenceに対して interval containment problem が行われるので
Secondary Search Interval containment problem Priority search tree （McCreight 1985） O（log N + l） 2-d heap O（log N + l log N）トータルでは, すべての occurrenceに対して interval containment problem が行われるので O（L log N）

Secondary Search binary tree の構造を持つ. 2-d heap 各ノードは区間のうちの1つを中に持つ.
区間［xi , yi］を持つノードの… 左の subtree のすべてのノードが持つ区間の y の値は yi より小さい. 右の subtree のすべてのノードが持つ区間の x の値は xi より大きい.

Secondary Search 2-d heap の例［2,5］［3,6］［4,5］［5,7］［1,7］［4,8］［6,9］
10 11 2 4 5 6 7 3 ［1,7］［4,8］［6,9］［9,12］［1,9］［7,12］［6,11］ 11 9 7 8 5 6 6 7 9

Secondary Search 2-d heap の例 O（log N + l log N）［2,5］［3,6］［4,5］
1 ［2,5］［3,6］［4,5］［5,7］ 8 9 10 11 2 4 5 6 7 3 ［1,7］［4,8］［6,9］［9,12］［1,9］［7,12］［6,11］［2,3］［2,3］［2,3］［2,3］ O（log N + l log N）［2,3］［2,3］［2,3］ 1 2 3 4 5 6 7 8 9 10 11 12

I saw every thing Mmmmm...

検索アルゴリズムのまとめ parse Z により,サーチを 2 part に分割
primary search -> 2DPM -> 2-dimensional range query 2-d tree for string. O（N）領域 search time O(m2 + m log N + Nm log m ) secondary search > interval containment problem 2-d heap. O（N）領域 search time O(L log N)

さいごに 18 N 新しい Index 構造の構成 bytes LZ parse Z O（N）
本当に小さいのか？ 18 N bytes 1 2 新しい Index 構造の構成 LZ parse Z O（N） 2-d tree for string O（N） 2-d heap と offset O（N） Suffix array = 5n bytes n > 103 小さい

さいごに本当に小さいのか？他の index 構造よりも一般的に小さくなるだろう. 検索時間の practical な評価は？謎

さいごに一番の謎本当に小さいのか？検索時間の practical な評価は？実現可能か？
他の index 構造よりも一般的に小さくなるだろう. 検索時間の practical な評価は？実現可能か？謎一番の謎これがミソ

Proceedings of the Third South American Workshop on String Processing.

Similar presentations

Presentation on theme: "Proceedings of the Third South American Workshop on String Processing."— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

Proceedings of the Third South American Workshop on String Processing.

Similar presentations

Presentation on theme: "Proceedings of the Third South American Workshop on String Processing."— Presentation transcript:

Similar presentations

About project

フィードバック