文字列照合を用いた XMLデータアクセス機構の提案

文字列照合を用いた XMLデータアクセス機構の提案
喜田拓也*　宮本哲†　竹田正幸† *九州大学附属図書館研究開発室 †九州大学システム情報科学府情報理学専攻発表者：　喜田拓也

発表内容研究の目的提案アルゴリズム XPathのサブセットまとめ既存の手法我々のアプローチ誤検出を回避する方法
文字列照合による処理の利点と問題点提案アルゴリズム誤検出を回避する方法パスを考慮した照合処理実験結果 XPathのサブセットまとめ情報処理学会九州支部若手の会セミナー 2002

既存の手法 XML文書メモリＸＭＬパーサープログラム DOM API XML文書 person name first last
Makiko Tanaka person “” person/ name person/ name/first Makiko person/ name/last Tanaka … DOM API プログラム XML文書情報処理学会九州支部若手の会セミナー 2002

我々のアプローチ XML文書メモリ文字列照合アルゴリズムプログラム XML文書 <person> <name>
<first> Makiko </first> <last> Tanaka </last> </name> </person> プログラム XML文書情報処理学会九州支部若手の会セミナー 2002

利点処理が高速少ないメモリで可様々な応用 XML文書木構造巨大なXML文書や大量の文書群を一括に処理複数の質問を同時に処理
情報処理学会九州支部若手の会セミナー 2002

Knuth-Morris-Pratt (1974)
文字列照合問題パタン matching テキスト Pattern matching is one of the most fundamental operations in string processing. Recently, a new trend for accelerating pattern matching has emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time, adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed up the pattern matching since an extra work is needed to keep track of compression mechanism. Knuth-Morris-Pratt (1974) Boyer-Moore (1977) Aho-Corasick (1975) Shift-Or (1992) 情報処理学会九州支部若手の会セミナー 2002

Aho-Corasick(AC)照合機械
パタン集合：={other, <mother>} o t h e r other 1 2 3 4 5 任意の文字 < m o t h e r > 14 6 7 8 9 10 11 12 13 other <mother> goto遷移 failure遷移情報処理学会九州支部若手の会セミナー 2002

問題点タグ名の一部分とマッチする誤った検出 <body> <h1>あのTVCM</h1>
<p> <mother> mother </mother> mを取ったらother、 <other> 他人 </other> です． </p> </body> 誤った検出情報処理学会九州支部若手の会セミナー 2002

解決策 < > r o t h e m < > 13 6 7 8 9 10 11 12 1 2 3 4 5
> 以外の文字 < 以外の文字 r o t h e m < > 13 6 7 8 9 10 11 12 1 2 3 4 5 other <mother> > 以外の文字 < 以外の文字 15 14 情報処理学会九州支部若手の会セミナー 2002

PMM構築方法 < > > < r o t h e m < > 14 15 15 14 13 6 7 8
> 以外の文字 < 以外の文字 14 15 > 以外の文字 > < < 以外の文字 15 14 r o t h e m < > 13 6 7 8 9 10 11 12 1 2 3 4 5 other <mother> 情報処理学会九州支部若手の会セミナー 2002

パスを考慮した照合苗字が「Tanaka」の人を探したい！
<person><name><last>がTanaka （Xpathだと，要素 //person/name/last/ が「Tanaka」）情報処理学会九州支部若手の会セミナー 2002

アイデアスタック (<last>,2) <person> (<name>,1) 1
(<xml>,0) スタック (<last>,2) 1 3 <person> <name> 2 <last> 以外のタグ r o t h e m < > 13 6 7 8 9 10 11 12 1 2 3 4 5 other <mother> > 以外の文字 < 以外の文字 15 14 ={<person>,</person>,<name>,</name>,<last>,</last>,…} ={Tanaka} 情報処理学会九州支部若手の会セミナー 2002

実験結果 Sgrep(J. Jaakkola and P. Kilpeläinen)との比較パタン Sgrep 38.44 37.02
//text/"summers" //test//"summers" /site/regions/africa/item/location/ "United_States" Sgrep 38.44 37.02 51.85 提案アルゴリズム 12.40 12.30 12.23 CPU時間（秒）テキスト：110MB(英文テキスト) CPU : Celelon 366MHz メモリ : 128MB OS : Kondara/MNU Linux 2.1 RC2 情報処理学会九州支部若手の会セミナー 2002

/descendant::cars/child::car/attribute::node()
処理可能なXPathのサブセット文字列照合による手法の限界先行ノードの指定はできない！複雑なフィルタの指定は照合速度を著しく低下させる。 LocationPath ::= '/' RelativeLocationPath RelativeLocationPath ::= Step | RelativeLocationPath '/' Step Step ::= AxisSpecifier NodeTest AxisSpecifier ::= AxisName '::' AxisName ::= 'attribute' | 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'self' | 'namespace' NodeTest ::= QName | NodeType '(' ')' NodeType ::= 'node' | 'text' | 'comment' | 'processing-instruction' /descendant::cars/child::car/attribute::node() 情報処理学会九州支部若手の会セミナー 2002

まとめ XML文書に対する文字列照合処理 Sgrepに比べ３倍以上高速処理可能なXPathのサブセットを定義今後の課題
誤検出しない効率的な照合機械の構築パスを考慮したアルゴリズム Sgrepに比べ３倍以上高速処理可能なXPathのサブセットを定義今後の課題 XPathのサブセットに対する実装 XML文書を圧縮して処理を高速化情報処理学会九州支部若手の会セミナー 2002

文字列照合を用いた XMLデータアクセス機構の提案

Similar presentations

Presentation on theme: "文字列照合を用いた XMLデータアクセス機構の提案"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

文字列照合を用いた XMLデータアクセス機構の提案

Similar presentations

Presentation on theme: "文字列照合を用いた XMLデータアクセス機構の提案"— Presentation transcript:

Similar presentations

About project

フィードバック