全文検索のためのデータ構造と構成の効率について

全文検索のためのデータ構造と構成の効率について
定兼邦彦東京大学理学系研究科情報科学専攻

内容全文検索のためのデータ構造の比較検索時間ディスク容量更新時間検索精度

背景電子化された文書の普及大量のテキストから高速に検索したい WWW, メール新聞, 辞書, 書籍ゲノムデータベース
もれがないようにしたい必要なもののみ欲しい

全文検索のアルゴリズム sequential search signature file [Moders 49]
各文書がどのキーワードを含むか inverted file [Bleir 67] 各キーワードがどこにあるか digital tree (trie) 任意のキーワード

Inverted fileのデータ構造キーワードごとに出現文書，位置を記憶 sorted array prefix B-tree trie
キーワードの出現位置のリスト prefix B-tree 更新が簡単 trie prefixをコンパクトに表現

Word indexes vs. Full-text indexes
決まったキーワードのみサイズが小さい構成が早いデータ構造 sorted array prefix B-tree trie 任意のキーワードサイズが大きい構成が遅いデータ構造 suffix array String B-tree suffix tree

Full-text indexのデータ構造
suffix tree [Weiner 73] suffix array [Manber, Myers 93] String B-tree[Ferragina, Grossi 95]

Suffix tree a b $ abab$ 文字列の全てのsuffix(接尾辞)を表すcompacted trie
メモリ上では線形時間で構成可能サイズが大きい unbalanced abab$ a b $

Suffix array 文字列の全てのsuffixのポインタを辞書順にソートした配列省スペース(5N) 更新が遅い abab$ 1
2 ab$ 3 b$ 4

String B-tree abab$ suffixのポインタをB-treeで表したもの
検索時のdiskアクセスが少ない(blind tree) 最悪時の性能が良い挿入、削除が容易サイズ: 13N １から作るのは遅い abab$

I/O complexity 検索のI/O complexity 更新のI/O complexity 構成のI/O complexity

検索のI/O complexity Suffix tree String B-tree Suffix array Nに依存しない
p : キーワード長 occ : 答えの数 N : 文字列長

更新のI/O complexity Suffix tree String B-tree Suffix array Nに依存しない
p : キーワード長 N : 文字列長 B : ディスクページサイズ

構成のI/O complexity suffix tree (optimal) suffix array String B-tree
N : 文字列長 M : メモリサイズ B : ディスクページサイズ

構成アルゴリズム Suffix treeの構成メモリ上ディスク上 Suffix arrayの構成

Suffix treeの構成メモリ上 (線形時間) ディスク上 Weiner 73 McCreight 76 Ukkonen 95
Farach 97 divide and conquer, batch処理ディスク上 Farach, Ferragina, Muthukrishnan 98

Disk上でのsuffix tree構成 I/O アルゴリズムをsortingとscanで表現
数のsortingと同じI/O complexity (optimal) I/O

Sorting I/O complexity
treeのノードのlcaをK個 (K個のrange minima) tree T のEuler Tour ET(T)とノードの深さ文字列中の任意の位置のK文字 treeの各ノードの子孫でmarkされているもの uncompacted trieのmerge suffix treeの全てのsuffix link suffix treeの構成

Block vs. Random I/O 2-way merge M/B-way merge block I/O random I/O
回のsorting 回のI/Oを必要とする。アルゴリズムは

Disk上のsuffix treeの問題点
treeをたどる際にrandom accessが生じる古典的なアルゴリズムは適さない　(divide and conquerを用いる)

Algorithm outline b a b a $ $ Even tree Odd tree $ a b $ Odd treeを作る
mergeする a b $ $

Building the odd tree $ A a b $ abab$ AA$ $
新しい文字列のsuffix treeを再帰的に作る文字を元に戻す $ A a b $ abab$ AA$ $

Building the even tree b a b a abab$ $ $ Even tree
偶数番目のsuffixを辞書順にradix sortする (先頭の文字, 奇数番目のsuffixの辞書順) 隣り合うsuffix間のlcpを求める compacted trieを作る b $ a Even tree a b $ 1 2 3 2: (b,ab$) = (b,2) 4: (b,$) = (b,1) abab$ 2 4

Merging the odd and even trees
anchor pairを見つける side tree pairに分割する pull nodeを見つける merge nodeを見つける TeとToをmergeする

Suffix arrayのメモリ上での構成
quick sort 文字列の比較なので非常に遅い ternary partitioning[Bentley, Sedgewick 97] 無駄な文字列比較が少ない極端に遅くなることがある doubling algorithm Manber, Myers 93 Sadakane, Imai 98 多くの場合最速

Doubling algorithm Karp, Miller, Rosenberg 72
ディスク上の文字列ソート[Arge et al. 97] 長さ 1, 2, 4, … の部分文字列を数値に変換 log n 回の比較で文字列を区別できる

Suffix sorting by doubling (1/5)
グループに番号をつける各グループの中をsuffixの2文字目で分ける番号を更新 (番号が異なる　先頭の2文字が異なる) 各グループの中をsuffixの3,4文字目で分けるグループの番号でソート全てのsuffixの順序が決まるまで繰り返す順序の決まっているグループはskipする

3 3 6 1 10 11 1 6 11 6 V[I[i]+1] V[I[i]] 1 3 5 6 10 11 I[i] 13 2 11 3 12 6 1 4 7 10 5 8 9 ＄ｂｅｏｒｂｅ＄ｅｏｒｎｅ＄ｎｏｔｔｏｂｅｏｏｒｎｏｏｔｔｏｏｂｅ＄ｒｎｏｔｔｏｂｅｔｔｏｂｔｏｂｅｔｏｂｅｏｒｎｏｔｔｏｂｅ$ 先頭の２文字でソート

V[I[i]+2] 8 4 3 1 1 V[I[i]] 1 3 4 6 8 9 11 13 5 10 I[i] 13 2 11 12 3 6 1 10 4 7 5 9 8 ＄ｂｅｏｒｂｅ＄ｅ＄ｅｏｒｎｎｏｔｔｏｂｅｏｏｂｅ＄ｏｒｎｏｏｔｔｏｒｎｏｔｔｏｂｅｔｏｂｅｔｔｏｂｔｏｂｅｏｒｎｏｔｔｏｂｅ$ 先頭の４文字でソート

V[I[i]+4] 8 V[I[i]] 1 2 3 4 6 7 8 9 11 13 5 10 I[i] 13 11 2 12 3 6 10 1 4 7 5 9 8 ＄ｂｅ＄ｂｅｏｒｅ＄ｅｏｒｎｎｏｔｔｏｂｅ＄ｏｂｅｏｏｒｎｏｏｔｔｏｒｎｏｔｔｏｂｅｏｒｔｏｂｅ＄ｔｔｏｂｔｏｂｅｏｒｎｏｔｔｏｂｅ$ 先頭の８文字でソート

V[I[i]] 1 2 3 4 6 7 8 9 11 12 13 5 10 I[i] 13 11 2 12 3 6 10 1 4 7 5 9 8 ＄ｂｅ＄ｂｅｏｒｅ＄ｅｏｒｎｎｏｔｔｏｂｅ＄ｏｂｅｏｏｒｎｏｏｔｔｏｒｎｏｔｔｏｂｅｏｒｔｏｂｅ＄ｔｔｏｂｔｏｂｅｏｒｎｏｔｔｏｂｅ$ ソート終了

Suffix arrayのディスク上での構成
Gonnet, Baeza-Yates, Snider 92 diskはsequential accessのみ Crauser, Ferragina 98 doubling algorithm + discarding I/O I/O

Doubling algorithm + discarding
回の反復 M/B-way マージソートを用いるメモリ内と異なる点すでにソートされている部分はスキップ

Word indexes vs. Full-text indexes

網羅性 word indexes full-text indexes 単語の先頭のみ検索もれの可能性全ての部分文字列
データ量は約1/ (日本語, 英語とも) 検索もれの可能性形態素解析が必要 DNA配列には使えない全ての部分文字列長いものを見つけるのが得意検索結果にごみが入る京都のつもりが東京都ルパンのつもりがダブルパンチはらだのつもりがはらだたしい AND検索で回避?

Full-text indexの利点・欠点
検索結果は文字列へのポインタポインタから文書番号への変換が必要超高速grepとして利用できるサイズが大きい Full-text indexからword indexは構成可能テキストを走査する必要の無いindexに印をつける indexを走査し、印のついているものを削除

課題検索結果のごみをなくすシソーラスの利用構造化された文書からの検索データの収集速度 AND検索? OR検索見出しのみから検索など
元の文書を圧縮して送る word indexだけ送る

圧縮と検索の統合 Block sorting圧縮法[Burrows, Wheeler 94]
suffix arrayに従い文字列を並べ替えてから圧縮伸張時に文字列とsuffix arrayが復元されるテキストを転送する際はBlock sortingで圧縮しておけば良い [Sadakane, Imai 98a]

謝辞貴重なコメントをくださったNTTの原田昌紀氏、中村隆幸氏に感謝いたします。

参考文献(1/3) [1] L.Arge, P.Ferragina, R.Grossi, and J.S. Vitter. On sorting strings in external memory. In ACM Symposium on Theory of Computing, pp , 1997. [2] J.L. Bentley and R.Sedgewick. Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp , 1997. [3] M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994. [4] A.Crauser and P.Ferragina. External memory construction of full-text indexes. In DIMACS Workshop on External Memory Algorithms and/or Visualization, 1998. [5] M.Farach. Optimal Suffix Tree Construction with Large Alphabets. In 38th Symp. on Foundations of Computer Science, pp , 1997. URL URL URL URL URL

参考文献(2/3) [6] P.Ferragina and R.Grossi. The String B-Tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 1998. [7] G.H. Gonnet, R.Baeza-Yates, and T.Snider. New Indices for Text: PAT trees and PAT arrays. In W.Frakes and R.Baeza-Yates, editors, Information Retrieval: Algorithms and Data Structures, chapter5, pp Prentice-Hall, 1992. [8] R.M. Karp, R.E. Miller, and A.L. Rosenberg. Rapid identification of repeated patterns in strings, arrays and trees. In 4th ACM Symposium on Theory of Computing, pp , 1972. [9] U.Manber and G.Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, Vol.22, No.5, pp , October 1993. URL URL

参考文献(3/3) [10] E.M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, Vol.23, No.12, pp , 1976. [11] K.Sadakane and H.Imai. A Cooperative Distributed Text Database Management Method Unifying Search and Compression Based on the Burrows-Wheeler Transformation. In Proceedings of NewDB’98, 1998. [12] K.Sadakane and H.Imai. Constructing Suffix Arrays of Large Texts. In Proceedings of DEWS'98, 1998. [13] E.Ukkonen. On-line construction of suffix trees. Algorithmica, Vol.14, No.3, pp , September 1995. [14] P.Weiner. Linear Pattern Matching Algorihms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pp , 1973. URL URL

全文検索のためのデータ構造と構成の効率について

Similar presentations

Presentation on theme: "全文検索のためのデータ構造と構成の効率について"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

全文検索のためのデータ構造と 構成の効率について

Similar presentations

Presentation on theme: "全文検索のためのデータ構造と 構成の効率について"— Presentation transcript:

Similar presentations

About project

フィードバック

全文検索のためのデータ構造と構成の効率について

Presentation on theme: "全文検索のためのデータ構造と構成の効率について"— Presentation transcript: