知能システム論森下真一講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジン

知能システム論森下真一講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジン
知能システム論　森下真一講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジンクラス分類問題アソシエーションルールクラスタリング

検索エンジンとページランキング Spam: 上位にページランキングされるためのトリック検索頻度の大きい語句の使用
　検索頻度の大きい語句の使用　小さなフォント、文字の無色化、METAタグの悪用　ハイパーリンク情報を利用した検索エンジンの登場

Web Graph WWW全体を web page をノード、hyper-link を有向辺とする有向グラフとみなす各ページの重要度
(importance) を実数で表現重要度順にページランキング初期値を１ Ne MS Am

Page Ranking – google.com
重要度が収束するまで繰返すリンク先に重要度を均等に分配分配された和を重要度として更新 Ne MS Ne MS Am Am Am はその重要度を Ne と MS に均等に分配

Ne MS Am

Ne Dead End MS Am

Spider Trap Ne MS Am

Sprider Trap の回避重要度の一部を税金として徴収 Dead End 効果の回避税金の再配分例

Page Ranking – Hubs & Authorities

１２３４１ 1 3 ２ 2 4 ３４隣接(adjacency)行列

１ 1 3 ２３ 2 4 ４ Hub weight authority weight

1 3 １ 2 4 ２３転置隣接行列１２３４１４２ authority weight Hub weight ３４

1 3 2 4 authority　weight

Page Ranking – Hubs & Authorities Authority Weight の正規化
Authority　Weight が無意味に増大しないように毎回の正規化　（Hub Weight についても同様） r Authority weight a i r Initialize a = ( 1 , K , 1 ) T For i = , 1 , 2 , K r r a = A T A a i + 1 i r r Normalize a s.t. a = 1 i + 1 i + 1

Page Ranking – 参考文献 google.com Sergey Brin and Lawrence Page.
The Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th International World Wide Web Conference. April Hubs and Authorities Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. 　JACM 46(5): (1999) S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Hypersearching the Web. Scientific American, June 1999.

Rank Aggregation データ収集法、ランキング法が異なる検索エンジンのランキング結果をまとめたランキングを生成する技術
　データ収集法、ランキング法が異なる検索エンジンのランキング結果をまとめたランキングを生成する技術　特定の検索エンジンだけで不当なほど高順位を得ている spam の除去に役立つ　検索エンジン間の比較参考文献 Rank aggregation methods for the Web; Cynthia Dwork, Ravi Kumar, Moni Naor and D. Sivakumar; Proceedings of the tenth international World Wide Web conference on World Wide Web, 2001, Pages

Rank Aggregation – ランキング間の距離

Rank Aggregation – ランキング間の距離
σ τ 1 A C 2 C A 3 E B 4 D D 5 B E Spearman footrule = = 6 Kendall tau =|{(A,C), (B,D), (B,E), (D,E)}| = 4

Optimal Rank Aggregation

Optimal Rank Aggregationの計算
重み付き完全二部グラフ 1 2 n i x ランキングσは二部グラフの完全マッチングとみなせる重み最小の完全マッチングσが Optimal rank aggregation O(n2.5)の時間計算量で計算可能

Cyber Communities 共通の内容に関するサイトの集合(cyber community) を発見したい
　ポータルサイトの容易な構築（Yahoo の自動化）参考文献　Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins. 　Trawling the web for emerging cyber-communities. 　The Eighth International World Wide Web Conference. May 　

Cyber Communities Cyber Community の核となる「コア」を探す
　fan と center (Hub と authority)　　からなる二部グラフをコアとして　内部にもっていると「仮定」　fan が i 個、center が j 個以上の　完全二部グラフを探す　(i,j)-コア　center は類似した内容でも互いに　リンクしないページ　(Microsoft と Sun のような関係) 　fan はポータルサイト fan center (i,j)-コア i j

Cyber Communities i j fan center (i,j)-コア
　少なくとも６個以上別サイトにリンクを　張るページを対象（全体の１２％）　replicated page の削除　ある閾値以上リンクされている　ページは削除　（あまりに有名で　小さな community の元と考えにくいので）　明らかに無駄な arc の排除 out-going arc の数が j 個未満の fan を削除 in-coming arc の数が i 個未満の center を削除　主記憶が小さい環境なので二次記憶を使い source node および destination node　でソートする二次記憶型システム．　　 i j (i,j)-コア

Cyber Communities i j fan center (i,j)-コア
　inclusion-exclusion pruning 　 out-going arc の数が j 個の fan X を選択　　(i,j)-コアを構成するか否かチェック　 j 個の　center を指す fan が X 以外に　(i-1) 個以上存在するか調べる　存在しない場合、fan X を削除　以上を繰り返す　(i,j)-コアを出力 i j (i,j)-コア

Number of non-nepotistic cores.
Cyber Communities 実験結果　i = 3～6, j=3 で 13 万 5 千の cyber communities non-nepotistic core fan がすべて異なる web site にある inclusion-exclusion pruning のおかげで core は殆ど disjoint i j Number of cores. Number of non-nepotistic cores. 3 89565 38887 5 70168 30299 7 60614 26800 9 53567 24595 4 29769 11410 21598 12626 17754 10703 15258 9566 11438 7015 8062 4927 6626 4071 5684 3547 6 4854 2757 3196 1795 2549 1425 2141 1206 Copyright

Cyber Communities (i,j)=(3,5) で 400 サンプルの信憑性を調べる 16 (4%) の中身は無関係
　16 (4%) の中身は無関係　122(30%) は現在は消滅　　→　cyber community の一過性を示唆　56% は当時（1年半前）の Yahoo! に載っていない　29% は現在でも載っていない　→　人手で cyber community を見つけることの限界

Web Graph Structure “Bow-tie” Theory
Copyright

Web Graph Structure 　使用データ Alta Vista　 203Mページ, 1466Mリンク, 9.5GB (May 1999) URL 10byte, link 3.4 byte 271Mページ, 2130Mリンク　(October 1999) 　使用計算機　Compaq AlphaServer (465MHz, 12GB MM)

Web Graph Structure Power Law の検証「あるページを指しているリンクの数が
i である確率は 1/xi に比例する」

Web Graph Structure 1 2.1i in-degree ＝ in-coming arc の数
Copyright

Web Graph Structure 1 2.72i out-degree = out-going arc の数
Copyright

Web Graph Structure 参考文献
　Andrei Broder(1), Ravi Kumar(2), Farzin Maghoul(1), Prabhakar Raghavan(2), Sridhar Rajagopalan(2), Raymie Stata(3), Andrew Tomkins(2), Janet Wiener(3): Graph structure in the web. WWW9. May Alta Vista, 2 IBM Almaden, 3 Compaq 　Nature 誌　2000年5月号　Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins (IBM Almaden), and Eli Upfal (Brown). The Web as a Graph. PODS2000, May 2000

Wayback Machine http://web.archive.org/ 過去5年間の100億ページものWebページを

Wayback Machine

Wayback Machine カリフォルニア大学バークリー校のバンクロフト図書館で 2001年10月24日に開かれた式典で披露
過去のWebページのライブラリは100テラバイトにも達するさらに毎月10テラバイト増加保管するデータ容量は米国議会図書館を初めとするあらゆる図書館を上回り、現存する世界最大のデータベース

知能システム論森下真一講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジン

Similar presentations

Presentation on theme: "知能システム論森下真一講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジン"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

知能システム論 森下真一 講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジン

Similar presentations

Presentation on theme: "知能システム論 森下真一 講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジン"— Presentation transcript:

Similar presentations

About project

フィードバック

知能システム論森下真一講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジン

Presentation on theme: "知能システム論森下真一講義日程 10/3, 10/10, 10/17, 10/24, 10/31 内容 Web グラフと検索エンジン"— Presentation transcript: