IIR輪講復習 #10 XML retrieval

IIR輪講復習 #10 XML retrieval

お知らせたつをさんによる補足情報復習資料おきば http://chalow.net/clsearch.cgi?cat=IIR

参考本資料は書籍の輪読会に向けたサマリ本資料内で一部上記ドキュメント, スライドからの引用あり

10章のテーマ構造化されたドキュメント(XML) に対する検索構造を意識する場合の問題
Vector space model を XML 検索に拡張 XML 検索の定量的評価方法

structured retrieval XML retrieval = structured retrieval
Parametric and zone search (6.1) はフラット XML retrieval はネストする

unstructured と structured の違い
unstructured retrieval 検索対象の単位がドキュメント structured retrieval 検索対象の単位がドキュメント全体より小さい XML ツリーのサブツリー

DOM The XML document

XPath XML Document を path でアクセス no element! act/scene play//scene
/play/title /play//title /scine/title title#Macbeth

XML context XPath の path = "XML context"

NEXI "Narrowed Extended XPath I"

Challanges in XML retrieval

1. ユーザーが返して欲しい部分木はどれ? "Shakespeare's plays for Macbeth's castle" play?
act? scene? どの木のどのレベルがクエリに対して答えるのに適切か、というのを決めるのは困難

2. ドキュメントのどこをインデックス? 異なる複数のアプローチそれぞれに欠点ありノードを重複しない擬似的なドキュメントへとまとめる
最も大きな要素を一つだけ扱い、部分木を検索結果への後処理で見つける (top down) bottom up 全要素をインデックスそれぞれに欠点あり

3. 語の統計を計算する時の問題 idf 計算 "author/Gates" と "gate の複数系 gates" を併せて df を計算しても意味がない別々に扱うと data sparseness 問題

4. Schema の異種性、多種性 q3 に対して d2, d3 は relevant だがマッチせず

Vector space model for XML retrieval

方針制約に対して完璧にマッチしない要素は低くランクさせる。検索結果から除外はしないインデックス単位は structured term
すべてのクエリを extended query として解釈するベクトル空間モデルに context resemblance 関数 CR を導入する

structured term XML context / term pair ... <c,t>
※ これだけ lexicalized subtree だが structured term ではない

クエリを extended query として解釈

context resemblance function
Cq クエリにおける path Cd ドキュメントにおける path |Cq| クエリ path 上のノード数 |Cd| ドキュメント path 上のノード数 Cq が Cd にマッチするのは別なノードを挿入することで Cq を Cd へと変形することができるとき

CR の計算例 CR(Cq4, Cd2) = 3 / 4 = 0.75 CR(Cq4, Cd3) = 3 / 5 = 0.6

Cosine Similarity に CR を加味
V ... Vocabulary of nonstructural terms B ... set of all XML contexts weight() ... Chapter 6 での idft・wft,d のような関数

SimNoMerge の実際 d9 の類似度 = 1.0 * 0.2 + 0.63 * 0.6 = 0.578
クエリに <c2,t> 以降もあるならそれらを加算 (Σ)

SimNoMerge の実装クエリを structured terms に分割したものそれぞれに対しループ
全XMLコンテキストそれぞれに対しループ CR が 0 より大きいすなわち Cq を Cd へと変換できるとき内積計算 (score[] が accumurator)

SimMerge SimMerge SimNoMerge の改良版 SimNoMerge のマッチ条件を緩くしたもの
詳しくは書籍内の参考文献

Evaluation of XML retrieval

INEX "INitiative for the Evaluation for XML retrieval"
リファレンスコレクション、クエリ集合、関連性についての判定などを作成する協同的な取り組み

XML retrieval の評価 relevant / non-relevant の二値判定では検索結果の構造を評価できない
1. 構造 2. 内容の2軸で評価する CAS (content-and-structure)

component coverage

topical relevance highly relevant (3) fairly relevant (2)
marginally relevant (1) nonrelevant (0)

relevance - coverage combinations

XML retrieval の評価 Q の値を recall, precision, F 値の算出に利用する
relevant - non-relevan で 0 or 1 の binary だった箇所に Q を使う

おおまかな傾向 unstructured query → structured query では recall が下がり precision が上がる

IIR輪講復習 #10 XML retrieval

Similar presentations

Presentation on theme: "IIR輪講復習 #10 XML retrieval"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

IIR輪講復習 #10 XML retrieval

Similar presentations

Presentation on theme: "IIR輪講復習 #10 XML retrieval"— Presentation transcript:

Similar presentations

About project

フィードバック