IIR輪講復習 #5 Index compression

IIR輪講復習 #5 Index compression

お知らせたつをさんによる補足情報復習資料おきば http://chalow.net/clsearch.cgi?cat=IIR

参考本資料は書籍の輪読会に向けたサマリ資料内で一部上記ドキュメントからの引用あり

5章のテーマインデックスの圧縮辞書 (dictionary) の圧縮 posting files の圧縮

インデックスの圧縮利点留意点キャッシュヒット率上昇ディスクからの転送時間を短縮
あくまで速度向上のため。圧縮率が高くても速度が遅いアルゴリズムは考慮しない。

前処理による圧縮率 "lossy" な圧縮 (5章で扱うのは lossless) 各処理(行)でどこ(列)が小さくなるかに着目
stop words の効果は圧縮と被る言語によっては効果の高い操作が異なる

Heap's Law 単語数 T に対する vocabulary (distinct terms) M の数 ( M = uniq(T) ) の統計

Heap's Law から分かること対数対数で線形辞書 (dictionary) の圧縮は重要
辞書サイズは term が増えると増加し続けるドキュメントが増えると辞書が大きくなる大きな collection にはそれに見合うだけの大きな辞書が絶対に必要辞書 (dictionary) の圧縮は重要

Zipf's law i 番目に大きい要素が全体に占める割合が 1 / i に比例する経験則 (ja.wikipedia.org) 冪乗法則

辞書 (dictionary) の圧縮

dictionary storage を圧縮する
メモリに載せたい postings storage に比べると小さい、載る大規模環境では分散して保持ディスクの seek を減らしたい携帯端末など

代表的な dictionary storage の構造
dictionary, fixed-width storage dictionary-as-a-string storage blocked storage blocked storage with front-coding minimal perfect hash

fixed-width array な辞書 M * (20 + 4 + 4) = 400,000 * 28 = 11.2mb
平均的な英単語 8bytes bytes 無駄 20bytes 以上の単語を保持できない

Dictionary-as-a-string
単語平均 8 bytes 400,000 * 8 = 3.2 * 10^6箇所 = 22bits or 3bytes long. 単語を全部つなげて境界のポインタを保持バイナリサーチ可能単語文字数分のみ = 無駄な 12bytes を削減 400,000 * ( ) = 7.6mb 11.2mb → 7.6mb

Blocked storage ポインタを k 語ごとにして間を省略 string側に文字列の長さを保持
k = 4 の場合 0.5mb の節約ポインタを省略した分で (k - 1) * 3 bytes 節約 k 毎に長さ(int 4bytes)を保存 400,000 * ¼ * 5 = 0.5 mb

ただし k を増やしすぎると... 間隔が長すぎると二分探索に悪影響

front coding "Extreme Compression" 共通した prefix をまとめる更に 1.2mb 節約

dictionary 圧縮の結果

posting files の圧縮

Reuters-RCV 1 800,000 documents 200 tokens / document
6 characters / token 100,000,000 postings

圧縮しない場合 docID に必要なビット数 ... 32bits postings files の合計サイズ
100,000,000 * 32/8 = 400mb

32bits → 20bits docID に必要なビット数 ... 20bits postings files の合計サイズ
ドキュメント総数 800,000 → log2 800,000 postings files の合計サイズ 100,000,000 * 20/8 = 250mb

"gap" で表現して短くする docID は 20 bit、これをもっと小さく実際に必要なのは "gap"
Brutus: 33, 47, 154, 159, gap は 20bit より小さい bitで表現できる 17

Variable byte codes (byteレベル)
768 + 56 5 214577 8bit → continuation bit (1) + payload (7) 数値が可変長なので最小で済む 250 mb → 116mb (50% 減!) VB encoding / decoding の方法 → Figure 5.8

γ codes (bit レベル) gap G を <length, offset> ペアで表現
length は unary codes log2G +1 offset = G - 2log2G in binary.

γ codes の例: 13 00001101 1101 → 1101 (offset length = 3)
unary(3) <1110, 101> デコード 0発見 → unary で 111 = 3bit 読む且つ, 暗黙の4bit目1 を考慮 = 1101 = 13

γ codes の例 13 = <1110, 101> 9 = <1110, 001>
13 = <1110, 101> 9 = <1110, 001> 2 = <10, 1> postings [ 2, 9, 13] →

γ codes が何故良いか zipf's law

postings 圧縮の効果

まとめインデックスの圧縮、と一言に行ってもインデックスの何を圧縮するかで手法が異なる dictionary の圧縮はデータ構造の工夫
postings file の圧縮は数値表現の工夫汎用的な手法から、ドメイン特化したものまで圧縮重要

IIR輪講復習 #5 Index compression

Similar presentations

Presentation on theme: "IIR輪講復習 #5 Index compression"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

IIR輪講復習 #5 Index compression

Similar presentations

Presentation on theme: "IIR輪講復習 #5 Index compression"— Presentation transcript:

Similar presentations

About project

フィードバック