列挙学校：第４コマパターンマイニングにおける列挙アルゴリズム

列挙学校：第４コマパターンマイニングにおける列挙アルゴリズム
2011/09/30 列挙学校（再放送）有村博紀，北大改訂版ver12: 2011/09/30 列挙学校：第４コマパターンマイニングにおける列挙アルゴリズム有村博紀北海道大学大学院情報科学研究科教材PPT: rekkyo/rekkyo2011.ppt

自己紹介氏名：有村博紀（ありむらひろき）主な研究経歴知識発見九州大学理学部物理学科を卒業(1988年卒)
第１回2008年9月12日 GCOE公開講座氏名：　有村博紀（ありむらひろき）九州大学理学部物理学科を卒業(1988年卒) 北海道大学大学院情報科学研究科 (2004～) 妻あり子供二人ありの46才 2004年から北海道在住主な研究経歴機械学習データマイニングアルゴリズム情報検索アルゴリズム大規模離散構造データ（系列，木，グラフ，．．）を効率良く扱いたい知識発見複雑・小規模単純・大規模

今回の講義：目次４コマ目の目的：データマイニングを題材に列挙はどんなことに使えそうか？
2011/09/30 列挙学校（再放送）有村博紀，北大４コマ目の目的：データマイニングを題材に列挙はどんなことに使えそうか？応用分野では，列挙はどのように使われているか？データマイニング(DM)と機械学習における列挙に関連した話題を提供 DMの歴史も少し紹介

目次パート1: データマイニングと頻出集合発見パート２: 構造データのマイニングデータマイニング頻出集合マイニング
2011/09/30 列挙学校（再放送）有村博紀，北大パート1: データマイニングと頻出集合発見データマイニング頻出集合マイニングパート２: 構造データのマイニング木とグラフのマイニング列挙によるパターンのマイニング関連した話題

列挙学校：第４コマパート１:データマイニングと頻出集合発見
2011/09/30 列挙学校（再放送）有村博紀，北大列挙学校：第４コマパート１:データマイニングと頻出集合発見有村博紀北海道大学大学院情報科学研究科コンピュータサイエンス専攻． 5

半構造データ Semi-structured Data
2011/09/05, 富士通研勉強会，北海道大学, 有村博紀データマイニングとは大量のデータから人間にとって有用なパターンや規則を半自動的にとりだす方法の研究 1990年代半ばから研究が盛んになった Apriori algorithm [Agrawal, Srikant, VLDB1994] 潜在的には古くからある研究の集大成．ただし，大量データに対する効率的計算に重点機械学習・数理統計学・データベース技術の境界分野知識 Knowledge H X C 半構造データ Semi-structured Data 半構造パターン発見 Pattern Discovery 2. 知識をパターンや規則として発見 1. 対象領域の理解・データの前処理 3. 発見した知識の利用

Backgrounds データマイニング大量のデータから人間にとって有用なパターンや規則を半自動的にとりだす方法の研究
2011/09/30 列挙学校（再放送）有村博紀，北大データマイニング大量のデータから人間にとって有用なパターンや規則を半自動的にとりだす方法の研究 1990年代半ばから研究が盛んになった Apriori algorithm [Agrawal, Srikant, VLDB1994] 潜在的には古くからある研究の集大成．ただし，大量データに対する効率的計算に重点機械学習・数理統計学・データベース技術の境界分野

Backgrounds データマイニングのプロセスの全体 1.対象領域の理解 2.データ集合の前処理
2011/09/30 列挙学校（再放送）有村博紀，北大データマイニングのプロセスの全体 1.対象領域の理解 2.データ集合の前処理 3.パターンの発見（狭義のデータマイニング） 4.得られたパターンの解析 5.解析結果の利用

私のデータマイニングのイメージデータマイニング情報検索キーワード検索
2011/09/30 列挙学校（再放送）有村博紀，北大 <ships> <gulf > 情報検索キーワード検索 Finding/Discovering hidden knowledge from massive data 直接目で見る <dallers> <wheat> <shipping > データマイニング <u.s.> <sea men> <strike > <port > <the gulf > <vessels > <iranian > <attack > <silk worm missile> <iran > <strike > 9

データマイニングの研究動向予測学習・自動分類パターン発見構造マイニング新しいタイプのデータマイニング確率モデリング
2011/09/30 列挙学校（再放送）有村博紀，北大トランザクションデータから共通して出現する規則性を発見する頻出パターン発見 [Agrawal et a. ‘94] 最適化マイニング [森下 ’96, ’98, ‘00] パターン発見不完全なデータから，未知の規則を学習する SVM [Vapnik ‘96], Boosting [Shapire & Kearns ‘96] C4.5 [Quinlan ‘96] 予測学習・自動分類構造マイニング非定型構造データから特徴的な部分構造を規則性を発見するグラフマイニング[Washio & Motoda ‘00], [Zaki ’02], [Uno, Asai, Arimura, ’02, ‘03] 新しいタイプのデータマイニングデータを類似したものどうしグルーピングする．大規模・不完全なデータからの高速クラスタリング K-means, CLARANS, DBSCAN クラスタリング確率モデリング有用な規則・パターン・知識 H X C 大規模データデータマイニング高次元大規模データから不確実な現象を予測・モデル化するベイジアンネットワーク [Pearl ’90s] HMM [Asai], MCMC, ベイズ推定・MDL・AIC テキストマイニング自然言語テキスト情報抽出意味マイニングストリームマイニングセンサー監視近似統計処理 10

10 Algorithms in Data Mining
情報知識ネットワーク特論 DM1 In an effort to identify some of the most influential algorithms that have been widely used in the data mining community, the IEEE International Conference on Data Mining (ICDM) identified the top 10 algorithms in data mining for presentation at ICDM '06 in Hong Kong. C 決定木 k-Means - クラスタリング SVM - 計算学習 Apriori - 頻出集合 EM - 統計学習 PageRank - スペクトル AdaBoost - 計算学習 kNN - 統計学習 Naive Bayes - 統計学習 CART - 計算学習 X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand and D. Steinberg: Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37.

「集中」「分散」「ビッグデータ」 (IBMTM :-) これからのデータマイニング「ポストペタスケール時代」人間クラウド
超大規模で不均質かつ不完全な非構造データどうやってあつかうか？「ポストペタスケール時代」人間「分散」さまざまなデバイス多様な人間活動と応用多様で非均一な時空間不完全で複雑なデータと情報「集中」大量のデータ多数のＣＰＵ高速なネットワーク膨大な計算クラウド 2011/09/30 列挙学校（再放送）

情報に関する最近の話題ワトソン君クラウド計算 IBMリサーチ (2011/02/16) クイズ番組で人間に勝利！世界中の情報を計算！
2011/09/30 列挙学校（再放送）有村博紀，北大情報に関する最近の話題ワトソン君 IBMリサーチ (2011/02/16) クイズ番組で人間に勝利！１００万冊の本を読んで回答人工知能と自然言語，アルゴリズム，検索の技術でクラウド計算世界中の情報を計算！米国の人気クイズ番組「ジョパディ！」のワトソン君クイズ王に挑戦中 From geek.com: Google server firm From

データマイニングデータマイニング大量のデータから有用なパターンや規則を半自動的にとりだす方法の研究 1990年代半ばに顕在化．定型データを対象とする非構造／半構造データ多様な構造をもつ膨大なデータ新しいデータマイニング技術が必要! 3. 新たな知識の構築知識 Knowledge H X C 半構造データ Semi-structured Data 半構造パターン発見 Pattern Discovery 2. 知識をパターンや規則として発見知識を有機的につなげる文科省科研費特定「発見科学」(H10-H12), 「情報学」(H13-H17), 基盤研究B(H12-H14, H15-H17)等文科省科研費特別推進研究（平成１７年～１９年）「知識基盤形成のための大規模半構造データからの超高速パターン発見」文科省GCOEプログラム「知の創出のための次世代情報基盤拠点」（H19-H23）

列挙と最適化データマイニングの二つの柱相補的な関係データに含まれるパターンを網羅的に見つける
「列挙と最適化は車の両輪」 by S. Minato 列挙相補的な関係と最適化目的関数を最大化するパターンを求める機械学習手法湊真一先生（北大＆ERATO）データに含まれるパターンを網羅的に見つけるパターン発見手法

パート1の１：頻出集合マイニングの枠組み～ (Agrawal & Srikant,1994)
2011/09/30 列挙学校（再放送）有村博紀，北大パート1の１：頻出集合マイニングの枠組み～ (Agrawal & Srikant,1994)

Backgrounds Frequent Itemset Mining
Finding all "frequent" sets of elements (items) appearing no more than σ times in a given transaction data base. Introduced by Agrawal and Srikant [VLDB'94] One of the most popular data mining problem Basis for more complicated / sophisticated data mining problems

動機：結合ルールマイニング Association Rule Mining [Agrawal 1993/1994]
Finding combination of “items” frequently appearing in a given database トランザクションデータバスケットデータ二値データベースカラム/属性アイテムレコード/タプルバスケットトランザクション／レコードの意味「レコード003の顧客は，ポテトチップとソーセージを一緒に買った」

動機：結合ルールマイニング Frequent Itemset X = { Mustard, Sausage, Beer }
with support/frequency 40% ID Chips Mustard Sausage Softdrink Beer 001 1 1 アイテム集合 X 002 1 1 1 1 1 003 1 1 Xの出現リスト Occ(X) = {002, 005, , 010} 004 1 1 005 1 1 1 1 006 1 1 1 1 007 1 1 1 1 Xの頻度（サポート） freq(X) = |Occ(X)| = 4/10 = 40% 008 1 1 1 009 1 1 010 1 1 1 アイテム集合 X の意味 X = 「マスタードとソーセージ，ビールを一緒に買う人」が全体の40％いた

頻出アイテム集合マイニング Frequent Itemset Mining Problem
Given: A database DB over a set Σ of items, and a number σ≧0 called “minsup” (minimum support) Problem: Find all itemsets X⊆Σ appearing in no less than σ records of DB 20 20

Definitions: Database
2011/09/30 列挙学校（再放送）有村博紀，北大 A set Σ = { 1, ..., n } of items (elements) Transaction database A set T = { t1, ..., tm } of subsets of Σ Each subset t ⊆Σ is called a tuple (record) Transaction database id tuples 1 1, 3 2 2, 4 3 1, 2, 3, 4 4 1, 2, 4 Alphabet of items I = {1,2,3,4} 21 21

Definitions: Itemset lattice
2011/09/30 列挙学校（再放送）有村博紀，北大 Item set: any subset X ⊆ Σ = { 1, ..., n } (Item) set lattice L = (2Σ, ⊆) The power set 2Σ = { X : X ⊆ Σ } The subset relation⊆over 2Σ Example: The set lattice forΣ = {1,2,3,4} 22

Definitions: Frequent sets
2011/09/30 列挙学校（再放送）有村博紀，北大 An itemset X appears in a tuple t: X ⊆ t The occurrence of X in a database T: Occ(X, T) = { t ∈ T : X ⊆ t } The frequency of X: Fr(X, T) = | Occ(X, T) | Minimum support (minsup): an integer 0≦ σ ≦|T| X is σ-frequent (frequent) in T if Fr(X, T) ≧ σ. Alphabet of items I = {A,B,C,D} Transaction database Occurrences and frequences of itemsets id tuples 1 1, 3 2 2, 4 3 1, 2, 3, 4 4 1, 2, 4 Occ(3, T) = {1, 3} Fr(3, T) = 2 Occ(24, T) = {2, 3, 4}, Fr(24, T) = 3 23 23

Definitions: Frequent sets
2011/09/30 列挙学校（再放送）有村博紀，北大 The occurrence of X in a database T: Occ(X, T) = { t ∈ T : X ⊆ t } X is σ-frequent (frequent) in T if Fr(X, T) = | Occ(X, T) | ≧ σ. minsupσ= 2 Frequent sets Frequent sets ∅, 1, 2, 3, 4, 12, 13, 14, 23, 24, 124 The itemset lattice (2Σ, ⊆) 24

Definitions: Problem Frequent Itemset Mining Problem
2011/09/30 列挙学校（再放送）有村博紀，北大 Frequent Itemset Mining Problem Given: A transaction database T and a non-negative integer 0≦ σ ≦|T| Task: Enumerate all "frequent" itemsets X in T that have frequency at least σ (Fr(X)≧σ) F: the class of all σ-frequent itemsets The number |F| of solutions is exponential in the number n of items. a typical enumeration problem. 25

頻出アイテム集合マイニング Frequent Itemset Mining Problem
2011/09/30 列挙学校（再放送）有村博紀，北大 Frequent Itemset Mining Problem Given: A database DB over a set Σ of items, and a number σ≧0 called “minsup” (minimum support) Problem: Find all itemsets X⊆Σ appearing in no more than σ records of DB Applications to more sophisticated data mining problems Assocition rule [Agrawal, Srikant ‘94] {Mustard, Sausage, Beer => PotateChips } with frequency 40% Optimized classification rule [Sese & Morishita PODS’90] If gene0001 & gene0012 then diabetes with classification error 8.5% Itemset boosting/SVM [Saigo, Uno, Tsuda Bioinformatics 23(18) 2007] Learning a linear classifier over itemsets as composite features Weighted substructure mining [Nowozin, Tsuda, Uno, Kudo, Bakir, CVPR’07] Application to image processing 26 26

Computational Complexity of Data Mining Algorithms
How to model efficient data mining algorithms? Light-weight High-throughput

2011/09/30 列挙学校（再放送）有村博紀，北大 Modeling data mining as enumeration Idea: Mesure the computation time per solution Input size M Output-polynomial (OUT-POLY) Total time = poly(N, M) Input polynomial-time enumeration, or amortized polynomial-delay (POLY-ENUM) Amotized delay is poly(Input), or Total time = M·poly(N) Algorithm Output size M Outputs Time Delay D polynomial-delay (POLY-DELAY) Maximum of delay is poly(Input) Total Time T + polynomial-space (POLY-SPACE) 28 28

2011/09/30 列挙学校（再放送）有村博紀，北大 Modeling data mining as enumeration Idea: Mesure the computation time per solution Ultimate Goal: To design polynomial delay and polynomial space algorithm for a given data mining problem Input size M Output-polynomial (OUT-POLY) Total time is poly(Input, Output) Input Algorithm polynomial-time enumeration (POLY-ENUM) Amotized delay is poly(Input), or Total time is Output·poly(Input) Output size M Outputs Time Delay D polynomial-delay (POLY-DELAY) Maximum of delay is poly(Input) Total Time T + polynomial-space (POLY-SPACE) 29 29

2011/09/30 列挙学校（再放送）有村博紀，北大パート1の2：頻出集合マイニングの幅優先(BFS)型アルゴリズム～ Aprioriアルゴリズム (Agrawal & Srikant,1994) ～頻出集合マイニングの最初のアルゴリズムちなみに：引用元 5768, R Schapire - A desicion-theoretic generalization of on-line learning and an application to boosting, Computational learning theory, Springer (AdaBoost) 引用元 : VN Vapnik, The nature of statistical learning theory, Book, 2000 (SVM) Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: ～データマイニングで最も有名な論文の一つ (Google Scholarでの引用数: 11215件, Sep .2011）

パターン列挙頻度計算頻出集合マイニングデータベースの与えられた個数以上の組に含まれる集合（アイテム集合）をすべて見つける問題
2011/09/30 列挙学校（再放送）有村博紀，北大データベースの与えられた個数以上の組に含まれる集合（アイテム集合）をすべて見つける問題二つの部分からなるパターン列挙頻度計算定められたクラスに属するパターンをすべて生成する組み合わせ列挙データベース中での各パターンの出現回数や場所を計算パターンマッチング埋め込み計算

頻出集合マイニングの素朴なアルゴリズム有村博紀，北大 2011/09/30 列挙学校（再放送）

頻出集合マイニング素朴なアルゴリズム (Piatetsky-Shapiro?, KDD'88) 欠点すべての部分集合を生成し，
2011/09/30 列挙学校（再放送）有村博紀，北大素朴なアルゴリズム (Piatetsky-Shapiro?, KDD'88) すべての部分集合を生成し，データベースからその頻度を計算する欠点解となる頻出集合の数に関係なく，指数個の集合を生成してしまう．頻度計算に（すごく）時間がかかる．百数十アイテム数百レコードに対して，数時間～数日を要する！

BFS (Breadth-first search) algorithm
? ? APRIORI Algorithm Most popular, the 1st generation frequent itemset miner [Agrawal & Srikant, VLDB’94; Mannila, Toivonen, Verkamo, KDD’94] Cadidate Generation: BFS (Breadth-first search) Starting from the smallest element, search the itemset lattice from smaller to larger Generate itemsets level by level (level-wise algorithm) Frequency Counting: Horizontal Data Layout Sequentially scanning a large database placed on a hard disk from left to right to compute the frequencies All the candidate itemsets is stored in a dictionary (a hash tree) in main memory. Dr. Agrawal IBM Almaden Lab. Prof. Mannila Helsinki Inst. Tech. Univ. Helsinki 34

Basic idea: “Anti-monotonicity” lemma
Lemma (Agrawal, Srikant): For every minsup σ, if Y is σ-frequent then any subset X of Y is also σ-frequent in DB T. Corollary: Every non-empty σ-frequent set Y is obtained from some σ-frequent set X by adding some new element. The class Fσ of all frequent sets is monotone subclass of (2Σ, ⊆) Anti-monotone property The itemset lattice (2Σ, ⊆) minsupσ= 2 解の境界 parent X Frequent sets ∅, 1, 2, 3, 4, 12, 13, 14, 23, 24, 124 parent X child Y child Y 35

Cadidate Generation by BFS
Apriori algorithm [Agrawal, Srikant, 1994] (Also called Level-wise [Mannila et al., 1994]) Slice the item set-lattice 2Σ into levels L0, L1, …., Ln () Compute Fk = { X∈ Lk : X is a frequent set of size k } for k = 0,1,2, … in a bottom-up manner Generation of the k-th candidate set Ck+1 from Fk: For each a1…ak and b1…bk ∈ Fi with a1=b1…ak-1=bk-1 and ak < bk, add a1…akbk into Ci level 0 Φ 1 2 3 4 5 t1 ○ t2 t3 t4 t5 level 1 1 2 3 4 level 2 12 13 14 23 24 34 level 3 123 124 134 234 level 4 1234 database

Basic Idea: Cadidate Generation by BFS
2011/09/30 列挙学校（再放送）有村博紀，北大 Apriori algorithm [Agrawal, Srikant, 1994] (Also called Level-wise [Mannila et al., 1994]) Slice the item set-lattice 2Σ into levels L0, L1, …., Ln () Compute Fk = { X∈ Lk : X is a frequent set of size k } for k = 0,1,2, … in a bottom-up manner Generation of the k-th candidate set Ck+1 from Fk: For each a1…ak and b1…bk ∈ Fi with a1=b1…ak-1=bk-1 and ak < bk, add a1…akbk into Ci level 0 Φ Frequent sets F 1 2 3 4 5 t1 ○ t2 t3 t4 t5 level 1 1 2 3 4 level 2 12 13 14 23 24 34 level 3 123 124 134 234 level 4 1234 database 37

Basic Idea: Frequency Counting
2011/09/30 列挙学校（再放送）有村博紀，北大 Horizontal layout Scanning the database tuple by tuple For each tuple t, increment the count of all candidate itemset that are subsets of t. 0) Given a candidate set Ci 123 124 134 234 245 1) Scan the database sequentially tuple by tuple items t1 1 3 t2 2 4 t3 5 … t4 t5 tuples x1 x2 x3 x4 x5 t1 ○ t2 t3 t4 t5 38

Basic Idea: Frequency Counting
2011/09/30 列挙学校（再放送）有村博紀，北大 Horizontal layout Compute the frequencies of all candidate sets in Ci by scanning the database tuple by tuple For each tuple t, increment the count of all candidate itemset that are subsets of t. candidate set Ci set freq 123 1 124 2 134 1 234 1 245 2 3) Lookup the subset s in the hash tree and then increment its count 1) Scan the database sequentially tuple by tuple 2) For each tuple t, enumerate all subsets s of t with size k Hash Tree for Ci ≡ Trie for Ci items database D 1 2 t1 1 2 3 4 t2 5 t3 … t4 t5 1 2 4 1 2 5 2 4 5 2 4… 2 5… 2 3 3 4 3 4 4 4 5 tuples 123/1 124/2 134/1 234/1 245/2 39

Breadth-first search algorithm
アルゴリズム Apriori(T:データベース, 0 ≤σ≤|T|: minsup): 出力：全頻出集合の集合F. サイズ1の頻出集合の全体F1を計算する． i = 2. while (Fi-1 /= ∅) do //ステージi: STEP1（候補生成 AprioriGen）: Fi-1のサイズ(i-1)の頻出集合同士を組合わせ，サイズiの候補集合の集まりCiを計算する． STEP2（頻度計算）: データベースを一度だけ走査し，各候補集合X ∈ Ciの頻度Freq(X)を一度に計算する． Fi = ∅. STEP3: Ciから頻度σ以上のすべての候補集合を取り出し， Fi に入れる． STEP4: i = i + 1. return F = F1∪…∪Fi を出力する． 40

Large collection of data
Apriori アルゴリズム [Agrawal 1994, Mannila 1995] The state-of-the-art algorithm for association rules Clever use of the present computer technology A pool of patterns Fast Processor Levelwise search coke & hamburger & chips: 24 d = 3 + d = 2 sausage & beer coke & chips: 48 coke & hamburger: 52 Medium sized Main memory Fast counting by Hash tree + mustard: 24 chips: 124 coke: 254 sausage: 62 beer: 128 d = 1 Large collection of data Huge Hard disk Sequential Disk Scan

議論 Aprioriアルゴリズム(Agrawal '94)の意義
当時（１９９０年代半ば）の計算機の技術トレンドを最大限に活用：　それ以前に比べて，格段に大きくなったメモリ（主記憶）と，安価で高速なＣＰＵ，低速だが極めて大容量のハードディスク Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li: New Algorithms for Fast Discovery of Association Rules. KDD 1997: ; ("prefix equivalence class" method) Shohei Hido, Hiroyuki Kawano: AMIOT: Induced Ordered Tree Mining in Tree-Structured Databases. ICDM 2005: 2011/09/30 列挙学校（再放送）有村博紀，北大

議論 Aprioriアルゴリズム(Agrawal '94)の「候補生成法」（AprioriGen手続き）は，本当に役立っているか？
集合の族の場合，列挙自体はもっと簡単なアルゴリズムでできる(例：家系木法/DFS)．ただし，木やグラフの族の場合，頻度制約を考えると，効率良い場合がある． Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li: New Algorithms for Fast Discovery of Association Rules. KDD 1997: ; ("prefix equivalence class" method) Shohei Hido, Hiroyuki Kawano: AMIOT: Induced Ordered Tree Mining in Tree-Structured Databases. ICDM 2005: 2011/09/30 列挙学校（再放送）有村博紀，北大

2011/09/30 列挙学校（再放送）有村博紀，北大パート1の3：頻出集合マイニングの深さ優先(DFS)型アルゴリズム～ Eclatアルゴリズム (Zaki, SDM 2000) ～ (Sese & Morishita, PODS 2000) ～ (Bayardo SIGMOD’98) その他 Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li: New Algorithms for Fast Discovery of Association Rules. KDD 1997: Shinichi Morishita, Jun Sese: Traversing Itemset Lattice with Statistical Metric Pruning. PODS 2000:

Aprioriアルゴリズム(Agrawal et al., '94)：重複解の回避のため，解をすべて保持する
ＢＦＳ型アルゴリズムの欠点 Aprioriアルゴリズム(Agrawal et al., '94)：　重複解の回避のため，解をすべて保持するメモリを使いすぎる！ 2011/09/30 列挙学校（再放送）有村博紀，北大

深さ優先型(DFS)アルゴリズム「バックトラック」アルゴリズムアルゴリズムのキモ第2世代の頻出集合発見アルゴリズム
2011/09/30 列挙学校（再放送）有村博紀，北大「バックトラック」アルゴリズム第2世代の頻出集合発見アルゴリズム Eclat [Zaki et al., KDD’97] [Morishita DS’98], [Bayardo SIGMOD’98] LCM [Unoら, FIMI'03,'04,DS'04] アルゴリズムのキモ主記憶型の単純な再帰アルゴリズム集合の家系木の上で「深さ優先探索」(DFS)をする．「垂直データレイアウト」を採用．メモリ効率が良くて，実際に高速．いわゆる「パターン成長アプローチ」 "Pattern growth" 46

DFS(Depth-first search) algorithm
2011/09/30 列挙学校（再放送）有村博紀，北大 Backtracking Algorithm The 2nd generation frequent itemset miners Eclat [Zaki et al., KDD’97], [Morishita DS’98], [Bayardo SIGMOD’98] DFS (Depth-first search) The set enumeration tree T = (F, Epa, f) for the class of frequent sets Starting from f, search F from smaller to larger Pattern growth approach: Grow each itemset by attaching a new item, the tail element. Implementation: The search structure is implicitly implemented by a recursive procedure. Vertical Data Layout For each enumerated itemset X, maintain its occurrence list Occ(X) to compute the frequencies. Incremental update of Occ(X) is possible: “Downward closure” 47

Two approaches to Frequent sets mining
2011/09/30 列挙学校（再放送）有村博紀，北大 Apriori algorithm [1994] Breadth-first search (BFS) Horizontal data layout Backtrack algorithm [ ] Depth-first search (DFS) Vertical data layout 1 2 3 4 5 t1 ○ t2 t3 t4 t5 2nd generation In-core algorithm Space efficient 1st generation External memory algorithm database

DFS(Depth-first search) algorithm
DFS algorithm How to generate all subsets X⊆Σin depth-first search? Use enumeration of subsets! (岡本先生の講義の「部分集合列挙」) The set lattice L = (2Σ, ⊆) The set lattice for Σ = {1,2,3,4} σ = 2 F = { ∅, 1, 2, 3, 4, 12, 13, 14, 23, 24, 124 } 49

The set enumeration tree (the family tree)
A spanning tree for all frequent itemsets Assign the unique parent Pa(Y) = Y- { max(Y) } =: X to each non-empty frequent set Y. Given a parent X, we can generate its children Y = X∪{ i } by attaching every item i satisfying i > max(X). The set enumeration tree for F The set lattice for Σ = {1,2,3,4} σ = 2 F = { ∅, 1, 2, 3, 4, 12, 13, 14, 23, 24, 124 } 50

DFS algorithm for Frequent Itemset Mining
2011/09/30 列挙学校（再放送）有村博紀，北大 Σ = {1, ..., n}: アイテムの全体集合 Algorithm Backtrack　//主プログラム BacktrackExpand(∅, Occ(∅), 0, n); procedure BacktrackExpand(X, Occ(X), k, n) Input: X:itemset, Occ(X), k: tail, n: maximum item; if Freq(X) = |Occ(X)| < σ then return; //Backtrack! Print a frequent set X; for i = k+1,...,n do：新しい出現リストOcc(X∪{i})を計算する; BacktrackExpand(X∪{i}, Occ(X∪{i}), i, n) ; 51

効率良い頻度計算 Vertical layout: アイテムを足すごとに各集合の出現リストを更新する
2011/09/30 列挙学校（再放送）有村博紀，北大アイテムを足すごとに各集合の出現リストを更新する Lemma: For any X1, X2, Occ(X ∪ Y) = Occ(X) ∩ Occ(Y). Lemma: For any set X, item a, Occ(X∪{a}) = { t∈Occ(X) : a ∈t } database D Downward closure [Zaki et al ‘97] items t1 1 2 3 4 t2 5 t3 … t4 t5 Occ(1) tuples 1 1 t1 t3 t5 Occ(2 3) f t1 t2 t3 t4 t5 3 2 3 t3 t4 2 t2 t3 t4 t5 2 items 4 Vertical layout: The occurrence list representation 1 2 3 4 5 t1 t2 t4 t3 t5 2 4 t2 t3 t5 UpdateOcc Occ(f) Occ(2) Occ(2 4) 52 52 DFS over the set enumeration tree

Efficient update of Occ(X)
2011/09/30 列挙学校（再放送）有村博紀，北大 Properties: For any X1, X2, Occ(X ∪ Y) = Occ(X) ∩ Occ(Y). (basic) For any set X, item a, Occ(X∪{a}) = { t∈Occ(X) : a ∈t } 手続き UpdateOcc(X, a, O(X)) Input：X⊆Σと, a ∈Σ, O(X). Output: O(X∪{a}). Y = X∪{a}; Occ(Y) = ∅; foreach t ∈O(X) do if a ∈t then Occ(Y) = Occ(Y)∪{t}; return Occ(Y); 古い出現リストから，一個ずつ位置を取り出して，まだ大丈夫か調べる Downward closure [Zaki et al ‘97] 53

DFS: Main result Theorem
The algorithm Backtrack can be implemented to enumerates all σ-frequent itemsets X in a given transaction database T in O(l・|Occ(X)|) time per frequent itemset using O(l・|Occ(X)|) space where l is the maximum length of tuples in T and |Occ(X)| is the number of occurrences of X. Space and time efficient (poly-delay & poly-space) On the other hand, the algorithm Apriori requires O(l・|Occ(X)|) time per frequent itemset and O(maxi ||Fi||) space. 54

Backtrack algorithm [1997-1998]
Summary: BFS vs. DFS 2011/09/30 列挙学校（再放送）有村博紀，北大 Apriori algorithm [1994] Breadth-first search (BFS) Horizontal data layout Backtrack algorithm [ ] Depth-first search (DFS) Vertical data layout x1 x2 x3 x4 x5 t1 ○ t2 t3 t4 t5 2nd generation Space efficient In-core algorithm 1st generation External memory algorithm database

Summary: Horizontal & Vertical Layout
2011/09/30 列挙学校（再放送）有村博紀，北大データベースD 水平/ Horizontal layout (Aprioriアルゴリズム) 垂直 / Vertical layout (Backtrackアルゴリズム) 1 2 3 4 5 t1 ○ t2 t3 t4 t5 Scan t1 1 3 t2 2 4 t3 t4 t5 1 2 3 4 5 t1 t2 t4 t3 t5 tuples Scan 外部記憶向き候補アイテム集合の出現リストを，追加ごとに更新する主記憶向き set freq 123 1 124 2 134 1 234 1 245 2 1 t2 t3 t5 12 t3 t5 124 t3 t5 候補アイテム集合の出現数をまとめて計数する 56

FP-growthアルゴリズム集合27の出現リスト集合 27 頻出パターン木 (FP-Tree) データベース木 * 1 2 5 6 7
2011/09/30 列挙学校（再放送）有村博紀，北大最もよく知られたアルゴリズムの一つ（Hanら，SIGMOD2000）パターンとデータを木構造(FP-Tree ="Frequent Pattern Tree") に格納する．当初は「候補生成なしマイニング」と呼ばれていた．現在では，DFS法の変種と理解（同じパターン列挙と出現計算）実際のデータに対して高速といわれているデータベース A: 1,2,5,6,7,9 B: 2,3,4,5 C: 1,2,7,8,9 D: 1,7,9 E: 2,3,7,9 F: 2,7,9 頻出パターン木 (FP-Tree) データベース木 2 7 集合27の出現リスト 9 集合279の出現リスト 27あ頻出集合 Φ {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {1,7,9} {2,7,9} φ 1 7 9 2 179 19 29 79 27 * 1 2 5 6 7 9 8 3 4 A C B E F D 9 279 集合 27 J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. SIGMOD2000, 1-12 (2000)

ＬＣＭアルゴリズム（Linear-time Closed Itemset Miner）
飽和集合／極大集合マイニングアルゴリズム理論的性能保証：出力線形時間アルゴリズム "The world-fastest closed set mining algorithm" (according to IEEE ICDM "the top-10 data mining algorithms" article) くわしくは宇野毅明の講義で．．． LCM, 宇野先生HP, program codes： FIMI Frequent Itemset Mining Implementations Repository: 58

Summary Frequent Itemset Mining Algorithms Applications
Finding all "frequent" sets of elements appearing in a given transaction data base. Introduced by Agrawal and Srikant [VLDB'94] One of the most basic data mining problem Algorithms BFS algorithm　－　Apriori [Agrawal et al, ‘94] DFS algorithm　－　Backtrack [Zaki et al. ’97; Morishita’98] Applications Feature extraction for SVM & Boosting Closed and Maximal Frequent Itemset Mining Sequences and Graphs

Bibliography パート1：データマイニングと頻出集合発見
2011/09/30 列挙学校（再放送）有村博紀，北大パート1：データマイニングと頻出集合発見 [AIS 93] R. Agrawal, T. Imielinski, A. N. Swami: Mining Association Rules between Sets of Items in Large Databases, Proc. SIGMOD Conference 1993, pp , 1993. [AMST 96] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo: Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining, pp , 1996. [Agrawal & Srikant, VLDB’94] R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules in Large Databases. Proc. VLDB 1994, pp , (Apriori) [Bayardo, SIGMOD’98] R. J. Bayardo Jr.: Efficiently Mining Long Patterns from Databases, Proc. SIGMOD Conference 1998: pp , 1998. [Han et al. SIGMOD’00] J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. SIGMOD Conference 2000, pp. 1-12, (FP-growth) [Morishita & Sese PODS’00] S. Morishita, J. Sese: Traversing Itemset Lattice with Statistical Metric Pruning, Proc. PODS 2000, pp , 2000. [Zaki et al., KDD’97] M. J. Zaki, S. Parthasarathy, M. Ogihara, W. Li, New algorithms for fast discovery of association rules, Proc. KDD 1997, pp , (Eclat) [宇野，有村] 宇野毅明，有村博紀,「データインテンシブコンピューティングその２－頻出アイテム集合発見アルゴリズム－」,人工知能学会誌，レクチャーシリーズ「知能コンピューティングとその周辺」, (編)西田豊明, 第2回, Vol.22, No.3, 2007年5月. (PDF:

列挙学校：第４コマパート２: 系列とグラフのマイニング
2011/09/30 列挙学校（再放送）有村博紀，北大列挙学校：第４コマパート２: 系列とグラフのマイニング有村博紀北海道大学大学院情報科学研究科コンピュータサイエンス専攻． 61

2011/09/30 列挙学校（再放送）有村博紀，北大パート２の１：半構造マイニング～頻出グラフマイニングの定式化（Inokuchi, Washio, Motoda, PKDD2000) その他

Semi-structured data mining
2011/09/30 列挙学校（再放送），有村博紀，北大 Semi-structured data mining 大規模非定形データからのマイニング Minining semi-structured data

2011/09/30 列挙学校（再放送），有村博紀，北大半構造データマイニング半構造データ多様な構造をもつ膨大なデータ従来のデータマイニング手法は，半構造データには直接適用不可能新しい半構造データマイニング技術が必要高速かつ頑健なマイニング技術が鍵 3. 新たな知識の構築知識 Knowledge H X C 半構造データ Semi-structured Data 半構造パターン発見 Pattern Discovery 2. 知識をパターンや規則として発見文科省科研費特別推進研究（平成１７年～１９年）「知識基盤形成のための大規模半構造データからの超高速パターン発見」知識を有機的につなげる

Applications of Semi-structured Data Mining
2011/09/30 列挙学校（再放送），有村博紀，北大 Applications of Semi-structured Data Mining H X C Graph / Network Analysis Text/HTML page classification Information Extraction Natural Language Processing Query Optimization Data Compression Knowledge Disocvery from Complex Data Chemical Compounds/Molecules Network Log Link structures in Web Gene Networks...

FSG [Kuramochi et al. (ICDM’01)
2011/09/30 列挙学校（再放送），有村博紀，北大半構造データマイニングの歴史～1995 1996 1997 1998 1999 2000 2001 2002 2003 Algorithm for finding subgraphs by MDL principle Subdue [Holder et al. (KDD’94)] Finding frequent paths [Wang and Liu (KDD’97)] Finding Semi-structured Schema [Nestrov, Abiteboul et al. (SIGMOD’98)] Finding frequent subgraphs AGM [Inokuchi, Wahio, Motoda (PKDD’00, MLJ. 2003)] FSG [Kuramochi et al. (ICDM’01) Finding frequent / optimal ordered trees FREQT [Ours (SDM’02)]，Treeminer [Zaki (KDD’02)] Finding frequent subgraphs gSpan [Yan and Han (ICDM’02)], VG [Venetik, et al.(ICDM’02)] There are studies on tree mining and graph mining. Finding frequent un-ordered trees UNOT [Ours (SDM’03)]，NK [Nijssen, Kok (MGTS’03)] Frequent Maximal/Closed Trees & Graphs mining [Yan&Han '03; Termier et al.'04] and many algorithms in 2000s

History of Tree & Graph Mining
2011/09/30 列挙学校（再放送）～1995 1996 1997 1998 1999 2000 2001 2002 2003 Algorithm for finding subgraphs by MDL principle Subdue [Holder et al. (KDD’94)] Finding frequent paths [Wang and Liu (KDD’97)] Finding Semi-structured Schema [Nestrov, Abiteboul et al. (SIGMOD’98)] Finding frequent subgraphs AGM [Inokuchi, Wahio, Motoda (PKDD’00, MLJ. 2003)] FSG [Kuramochi et al. (ICDM’01) Finding frequent / optimal ordered trees FREQT [Asai et al. (SDM’02)]，Treeminer [Zaki (KDD’02)] Finding frequent subgraphs gSpan [Yan and Han (ICDM’02)], VG [Venetik, et al.(ICDM’02)] There are studies on tree mining and graph mining. Finding frequent unordered trees UNOT [Asai, Uno, Nakano, Arimura (SDM’03)]，NK [Nijssen, Kok (MGTS’03)] Frequent Maximal/Closed Trees & Graphs mining [Yan&Han '03; Termier et al.'04] and many algorithms in 2000s 有村博紀，北大 67 67

2011/09/30 列挙学校（再放送）パート２の２：頻出順序木のマイニング～ TreeMiner (Zaki, KDD2002) ～ Freqt (Asai et al., SDM2002) ～ Nakano (IPL, 2002) Mohammed Javeed Zaki: Efficiently mining frequent trees in a forest. KDD 2002: Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Arimura, Hiroshi Sakamoto, Setsuo Arikawa: Efficient Substructure Discovery from Large Semi-structured Data, SDM 2002. Shin-Ichi Nakano: Efficient generation of plane trees. Inf. Process. Lett. 84(3): (2002) 有村博紀，北大

Frequent Pattern Mining for Ordered Trees
Minining ordered and unordered trees

Trees Tree Mining Development of Efficient tree mining algorithms
2011/09/30 列挙学校（再放送），有村博紀，北大 Tree Mining Association Rule, Frequent Itemsets Simple Efficient algorithms Expressive patterns & Efficient algorithms Development of Efficient tree mining algorithms Trees Ordered trees. .. General Graphs… Unordered trees are more expressive. But, no efficeint algorithm was known before. Thus, the goal of this study is to develop an efficient tree mining algorithm for this class. General graphs (AGM, FSG, gSpan) Expressive Computationally Hard

研究成果：高速半構造マイニング高速な木パターン発見エンジン半構造データの特徴的部分構造の発見
理論と実装： FREQT, StreamT, Unot SIAM DM'02, PKDD'02, IEEE ICDM'02), DS'03 公開＆応用 FREQT: DEWS'02優秀論文賞（H14年6月） Unot: DEWS’04優秀論文賞（H16年6月） MaxMotif: Awarded DEWS2004（H17.6） StreamT: ’０３AI学会大会優秀賞（H15.6） 71

パターンとデータ：ラベル付き順序木 Labeled Ordered Tree 例根付き (rooted) 順序木(兄弟の順序がある)
2011/09/30 列挙学校（再放送），有村博紀，北大パターンとデータ：ラベル付き順序木 Labeled Ordered Tree 根付き (rooted) 順序木(兄弟の順序がある) ラベル付き(各ノードはラベルをもつ) 例 HTML/XML文書階層レコード自然言語の「かがり受け木」(dependency tree). people @age @id 608 name #text person 609 tel 25 John Mary

順序木のマッチング（パターン照合）パターン木 Tがデータ木 Dにマッチする (出現する)
2011/09/30 列挙学校（再放送），有村博紀，北大順序木のマッチング（パターン照合）パターン木 Tがデータ木 Dにマッチする (出現する) There is a matching function f from T into D A C B matching function f pattern tree T A C B A C B data tree D r 1 A C B C 2 7 f は1-to-1. f は親子関係を保存. f は (間接) 兄弟関係を保存. f はラベルを保存. We give the definition of tree matching. Suppose we are given a data tree D and a pattern tree T; both of them are labeled ordered trees. Then, the pattern tree T matches the data tree D if and only if … A B 3 4 5 8 11 B 6 9 10 73

パターンの頻度 T D P1 P2 A root occurrence of pattern T:
2011/09/30 列挙学校（再放送），有村博紀，北大パターンの頻度 A root occurrence of pattern T: The node to which the root of T maps by a matching function The frequency fr(T) = #root occurrences of T in D T A r D 1 B C A C P1 2 7 The occurrences of a pattern is defined by the root occurrences. A root occurrence of a pattern is the node to which the root of T maps by a matching function. For example, [this node] is a root occurrence of the pattern because the root node maps to [this]. B A C A B 3 4 5 8 11 Root occurrence list OccD(T) = {2, 8} P2 B B C 6 9 10 74

2011/09/30 列挙学校（再放送），有村博紀，北大頻出木パターン発見問題 Given: a colection S of labeled ordered trees and a minimum frequency threshold s Task: Discover all frequent ordered trees in S (with frequency no less than s) without duplicates 頻度しきい値50%での頻出木パターン頻度しきい値 (minimum support) s = 50%

2011/09/30 列挙学校（再放送），有村博紀，北大 Efficent Algorithms Depth-first mining algorithm for frequent ordered trees Freqt [Asai, Arimura et al., SIAM DM'02] TreeMiner [Zaki, KDD'02] Performance guarantee Polynomial time enumeration per pattern Small memory footprint Key technique: Rightmost expansion

半構造パターン発見アルゴリズムの設計のキーポイント
2011/09/30 列挙学校（再放送），有村博紀，北大半構造パターン発見アルゴリズムの設計のキーポイントパターンをどのように表現するか？候補パターン（部分構造）を，どうやって重複なしに列挙するか？各パターンの頻度を，どのようにして高速に計算するか？

超高速半構造パターン発見技術最右拡張技法 (Rightmost expansion) l Tree
2011/09/30 列挙学校（再放送），有村博紀，北大超高速半構造パターン発見技術最右拡張技法 (Rightmost expansion) 申請者等が2002年に開発 [SIAM DM'02] 世界初の多項式時間アルゴリズムを提案現在，さまざまな組合せ構造の効率よい列挙に用いられる[当該分野の国際会議論文の大半に引用] Tree l 1 k-1 k p Freqt [Asai, Arimura et al., SIAM DM'02] TreeMiner [Zaki, KDD'02] 78

２．頻度制約の導入（頻出木マイニング）頻度に関する単調性探索の枝刈りとバックトラック
2011/09/30 列挙学校（再放送），有村博紀，北大２．頻度制約の導入（頻出木マイニング）頻度に関する単調性「もしある順序木 T が頻出なら，その親S も頻出」探索の枝刈りとバックトラック頻出でなくなったら，その子孫の探索は枝刈りする．

Itemset mining revisited: BFS vs. DFS
2011/09/30 列挙学校（再放送），有村博紀，北大 Itemset mining revisited: BFS vs. DFS アプリオリ型アルゴリズム [1994] 幅優先探索 (BFS) 水平データレイアウト Apriori algorithm [1994] 「第一世代」外部記憶型アルゴリズム「第二世代」メモリ型省スペースアルゴリズムバックトラック型アルゴリズム[ ] 深さ優先探索 (DFS) 垂直データレイアウト MaxMiner, Eclat, FP-growth etc [ ]. How to implement DFS for ordered tree patterns?

研究成果：高速半構造マイニング高速な木パターン発見エンジン半構造データの特徴的部分構造の発見
理論と実装： FREQT, StreamT, Unot SIAM DM'02, PKDD'02, IEEE ICDM'02), DS'03 公開＆応用 FREQT: DEWS'02優秀論文賞（H14年6月） Unot: DEWS’04優秀論文賞（H16年6月） MaxMotif: Awarded DEWS2004（H17.6） StreamT: ’０３AI学会大会優秀賞（H15.6） 81

Theoretical performance guanrantee
2011/09/30 列挙学校（再放送），有村博紀，北大 How to measure the time complexity of mining algorithms? Exponentially many answers! As enumeration algorithms Output-polynomial (OUT-POLY) Total time = poly(N, M) Input size M Input polynomial-time enumeration, or amortized polynomial-delay (POLY-ENUM) Amotized delay is poly(Input), or Total time = M·poly(N) Algorithm Output size M Outputs Time Delay D polynomial-delay (POLY-DELAY) Maximum of delay is poly(Input) Total Time T + polynomial-space (POLY-SPACE) 82 82

This algorithm is scalable.
2011/09/30 列挙学校（再放送），有村博紀，北大 EXP: Scalability Dataset: citeseers Minimum support: s=3.0(%) fixed Increasing the data size from 0.3MB to 5.6MB. (178285, 1.39) Runtime (sec) This algorithm is scalable. First, we studied the scalability of the algorithm. This figure shows the runtime of the algorithm with the minimum support 3.0(%) on the dataset citeseers as the size is increased from 0.3 MB to 5.6MB. As the result, the running time scales almost linearly and our algorithm is scalable. page=150, filesize=5.12(MB), CPUtime=1550(sec) # of nodes 83

2011/09/30 列挙学校（再放送）パート２の３：頻出無順序木のマイニング～ Unot (Asai, Arimura, Uno, Nakano, DS2003) ～ (Nijssen & Kok, MGTS2003) Tatsuya Asai, Hiroki Arimura, Takeaki Uno, Shin-Ichi Nakano: Discovering Frequent Substructures in Large Unordered Trees. Discovery Science 2003: 47-61 S. Nijssen and J.N. Kok. Efficient discovery of frequent unordered trees. In: Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences (MGTS), 有村博紀，北大

Trees グラフマイニングの究極の目標 Development of Efficient tree mining algorithms
2011/09/30 列挙学校（再放送），有村博紀，北大グラフマイニングの究極の目標 Sets (itemsets) Simple Efficient algorithms Expressive patterns & Efficient algorithms Development of Efficient tree mining algorithms Trees Ordered trees. .. General Graphs… Unordered trees are more expressive. But, no efficeint algorithm was known before. Thus, the goal of this study is to develop an efficient tree mining algorithm for this class. 一般のグラフ Expressive Computationally Hard

高速な無順序木パターン発見エンジン UNOT (DS2004)
無順序木：　グラフの自明でない部分クラス無順序木マイニングアルゴリズム Unot [Asai, Arimura, DS'03] NK [Nijssen, Kok, MGTS’03] FREQT: DEWS'02優秀論文賞（H14年6月） Unot: DEWS’04優秀論文賞（H16年6月） 2011/09/30 列挙学校（再放送），有村博紀，北大 86

これらの木はすべて無順序木としては互いに同値！
2011/09/30 列挙学校（再放送），有村博紀，北大 0．無順序木マイニングの問題点問題点：兄弟の入れ替えによって同値な木が，指数的に多く存在する A T1 A T2 A T3 A T4 B B B B B B B B A B C C C C B B A A B A C C C C For example, the tree T1 is a leftmost-biased tree because its code is lexicographically maximum. これらの木はすべて無順序木としては互いに同値！

2011/09/30 列挙学校（再放送），有村博紀，北大１．無順序木の一意な表現アイディア：無順序木の正規系として，「深さラベル列」のさまざまな回転に関して，辞書式順序で最大の表現を採用する A T1 A T2 A T3 A T4 B B B B B B ＞ B B C B A B C C A B A C A B C C C C For example, the tree T1 is a leftmost-biased tree because its code is lexicographically maximum. (0A,1B,2A,3C,2B,1B,2C) (0A,1B,2B,2A,3C,1B,2C) (0A,1B,2C,1B,2A,3C,2B) (0A,1B,2C,1B,2B,2A,3C)

２．候補となる無順序木パターンの列挙「逆探索性」列挙最右拡張に関して，正規形無順序木の親は，やはり正規形無順序木
2011/09/30 列挙学校（再放送），有村博紀，北大２．候補となる無順序木パターンの列挙「逆探索性」最右拡張に関して，正規形無順序木の親は，やはり正規形無順序木列挙順序木の場合と同じように，最右拡張が使えるただし，最右でない子をチェックして生成しない最右拡張 SAME However, rightmost expansion possibly generate non-leftmost-biased tree. To generate leftmost-biased trees only, we use copy depth.

2011/09/30 列挙学校（再放送）パート２の３：一般のグラフの頻出マイニング～ AGM (Inokuchi, Washio, Motoda, PKDD2000; Machine Learning 50(3), 2003) ～ gSpan (Yan, Han, ICDM2002) Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda: An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data. PKDD 2000: 13-23 Xifeng Yan, Jiawei Han: gSpan: Graph-Based Substructure Pattern Mining. ICDM 2002: 有村博紀，北大

どうやって一般のグラフをマイニングするか？
⊥ Jump! 基本戦略はパターン成長法小さなグラフに，頂点と辺を一つずつ追加する問題点グラフ同形性（graph isomorphism） 2011/09/30 列挙学校（再放送），有村博紀，北大

グラフマイニング Inokuchi, Washio, Motoda [PKDD’00]
AGM: “An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data” Kuramochi & Karypis [ICDM’01, ICDM’02] FSG: “Frequent Subgraph Discovery” Yan & Han [ICDM’02] gSpan: “gSpan: Graph-based substructure pattern mining” 深さ優先探索 DFS木表現 = rightmost expansion プラス同形性判定 H X C 2011/09/30 列挙学校（再放送），有村博紀，北大

AGM [Inokuchi, Washio, Motoda (PKDD’00)]
2011/09/30 列挙学校（再放送），有村博紀，北大 AGM [Inokuchi, Washio, Motoda (PKDD’00)] AGM = Apriori-based Graph Mining 隣接行列表現を用いて，Apriori風に頻出部分グラフを発見するアルゴリズムパターンの生成法隣接行列の正準形を効率よく計算 Cl C Cl C Cl H Cl C Cl C Cl H 隣接行列の正準形行と列を入れ替えて Codeが最小となるような隣接行列 Cl C H トリクロロエチレン Code =

グラフマイニングへの逆探索手法 gSpanアルゴリズム（Yan & Han, 2002）最速のグラフマイニング手法の一つ
2011/09/30 列挙学校（再放送），有村博紀，北大グラフマイニングへの逆探索手法 gSpanアルゴリズム（Yan & Han, 2002）最速のグラフマイニング手法の一つ最右拡張手法を使い，木を通して，グラフを列挙

さまざまな半構造データへの拡張極大パターン発見 LCMアルゴリズムを系列，木，グラフへ拡張 Sequence Motifs
2011/09/30 列挙学校（再放送），有村博紀，北大さまざまな半構造データへの拡張 H X C 極大パターン発見 LCMアルゴリズムを系列，木，グラフへ拡張 Sequence Motifs Item Sets Attribute trees

History of Sequence Mining
2011/09/30 列挙学校（再放送）有村博紀，北大 Statistical Sequence Mining Data Mining Area ～1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 HMM (Hidden Markov Model ]*1) [Haussler, Krogh et al., 26th HICSS 1993] [Durvin, Eddy, Krogh, Mitchison, Cambridge, 1998] MOTIF program [Smith, Annau et al., PNAS, 87,1990] PRATT program [Jonassen et al., Protein Sci. 4, 1995] TEIRESIAS program [Rigoutsos, Floratos, Bioinformatics 14, 1998] eMOTIF program [Nevill-Manning et al., PNAS 95, 1998] a PTAS algorithm for concensus motif problem in Hamming distance [Li, Ma, Wang, STOC 1999] PROJECTION (Random projection) [Buhler, Tompa, J. Comp. Bio. 9(2), 2002] Frequent Sequence Mining Mining frequent episodes in an event sequence (BFS) [Mannila, Toivonen, Verkamo, KDD1995] Mining frequent itemset sequence (BFS) Sequential Patterns [Agrawal, Srikant VLDB1995] Mining frequent itemset sequence (DFS) Spade [Zaki, 1998; Mach. Learn. 2000] Mining frequent sequences (DFS + Projected DB) PrefixSpan [Pei, Han, et al., ICDE 2001] Closed Sequence Mining Mining frequent closed frequent sequences (DFS) CloSpan [Yan, Han, et al., SDM 2003] Mining frequent closed sequences (DFS) BIDE [Wang & Han, ICDE 2004] Mining frequent closed sequeces with wildcards (DFS) MaxMotif [Arimura, Uno, LNCS 3827, ISAAC2005] Mining frequent closed sequeces with gaps (DFS) AU [Arimura, Uno, LNAI 4914, LLLL2007] There are studies on tree mining and graph mining. Bioinfomatics Area 飽和系列の多項式時間列挙アルゴリズム 96

Sequences: Maximal Motif Discovery
The problem of enumerating all maximal motifs in an input sequence without duplicates for the class of repeated motifs with wildcards [Parida et al. 2000, SIAM SODA] An input string s in S* An integer q  0 (quorum) q = 3 ABCABRRABRABCABABRABBC Task: Enumerate all maximal motifs without duplicates ABRAB Solutions AB BoABoAB e ABoAB B BoABoooooB BC BoAB BoooooB

ギャップ付きモチーフに対する極大パターン発見 [ISAAC 2005]
In real datasets, the number of maximal motifs is much smaller than that of motifs containing the complete information Succinct representation for all (frequent) motifs Arimura and Uno [ISAAC 2005; J. Comb. Optim., 2006]

エピソードマイニング [Katou, Arimura, Hirata, PAKDD'08,09, DS'09]
2011/09/30 列挙学校（再放送），有村博紀，北大エピソードマイニング [Katou, Arimura, Hirata, PAKDD'08,09, DS'09] 時系列データのマイニングデータに隠れたイベント間の(半)順序関係を発見する Mannila et al. (KDD'95) 高速なアルゴリズム医療データへの応用院内感染症の要因解析（Katou, Arimura, Hirata, CME2009）「症状 e が起きた後で，治療 a, b, c を行ったら，症状の改善 d が見られた」 a e b d c episode input sequence d c b 13 a 11 18 16 15 14 e 12 sliding window

2011/09/30 列挙学校（再放送）有村博紀，北大演習：「図形のようなグラフ」の列挙幾何グラフの列挙

2011/09/30 列挙学校（再放送），有村博紀，北大極大2次元幾何グラフのマイニング幾何変換（並進・回転・拡大）のもとで頻出で極大な点配置レイアウトを発見 (Arimura, Uno, Shimozono, DS'07) y A A 2.0 A g g g g g g A Graph Pattern g g 1.0 A A 1.0 2.0 3.0 x Goemetric Graph Database Hiroki Arimura, Takeaki Uno, Shinichi Shimozono: Time and Space Efficient Discovery of Maximal Geometric Graphs. Discovery Science 2007: 42-55 Sebastian Nowozin, Koji Tsuda: Frequent Subgraph Retrieval in Geometric Graph Databases. ICDM 2008:

Geometric Graph Pattern
2011/09/30 列挙学校（再放送），有村博紀，北大列挙アルゴリズム n点の幾何グラフGの表現開始点1を任意に選び，重心mの反時計回りに，すべての点を1, 2, ..., nと番号づける．点列，辺列，ラベル列の順に要素を並べて符号化．正規形：全ローテーションで辞書順最小の符号 Geometric Graph Pattern 1 単位長さ（最初の辺） 3 2 1 A m B 頂点を反時計回りに順序付け

Geometric Graph Pattern
2011/09/30 列挙学校（再放送），有村博紀，北大列挙アルゴリズム 1 単位長さ（最初の辺） n点の幾何グラフGの表現開始点1を任意に選び，重心mの反時計回りに，すべての点を1, 2, ..., nと番号づける．点列，辺列，ラベル列の順に要素を並べて符号化．正規形：全ローテーションで辞書順最小の符号 3 2 1 A m B 頂点を反時計回りに順序付け Geometric Graph Pattern 正規形はC1

列挙アルゴリズム「逆探索」を用いて，グラフと照合写像を同時に列挙．家系木の根は，単位長さの線分．
2011/09/30 列挙学校（再放送），有村博紀，北大列挙アルゴリズム「逆探索」を用いて，グラフと照合写像を同時に列挙．家系木の根は，単位長さの線分．子グラフＧの「親」は，正規形の符号で最後の点（最大の点）を除いたもの，のさらに正規形をとったもの．親から子をつくるには，ためしに新しい頂点をつけてみて，回転して正規形符号を求める．つけたものが符号の最後ならＯＫ．追加する頂点の候補は，照合写像の逆写像でデータ点から作る． A 2 2 1 A m 3 2 1 A m B ３ 1 A A m B ４親の計算

2011/09/30 列挙学校（再放送），有村博紀，北大極大2次元幾何グラフのマイニング幾何変換（並進・回転・拡大）のもとで頻出で極大な点配置レイアウトを発見 (Arimura, Uno, Shimozono, DS'07) 応用：空間データ，物体の配置図・設計図データタンパク質分子に拡張（3次元&誤差）(Nowozin,Tsuda, ICDM'08） y A A 2.0 A g g g g g g A Graph Pattern g g 1.0 A A 1.0 2.0 3.0 x Goemetric Graph Database Hiroki Arimura, Takeaki Uno, Shinichi Shimozono: Time and Space Efficient Discovery of Maximal Geometric Graphs. Discovery Science 2007: 42-55 Sebastian Nowozin, Koji Tsuda: Frequent Subgraph Retrieval in Geometric Graph Databases. ICDM 2008:

おわりに Applications and future directions

半構造データマイニングのシナリオ 107

応用：最適パターン発見すべての可能な部分構造の中で，統計的評価関数を最小化するパターンを発見するノイズに強い．性能保障が可能
さまざまな知識獲得問題に対応可能最近の機械学習アルゴリズムと組み合わせて，分類，予測，クラスタリングに応用 f(p) : 不均衡関数 p 50% 100% パターンが出現する正例の割合統計的評価関数分類誤差情報エントロピー Gini 指標ｓ 0% 良いパターン 8 positives 2 negatives 悪いパターン 9 positives negatives 108

応用：半構造データの自動分類機械学習の素性抽出に使う unknown function f : Graphs → {+1, -1}
１．入力グラフから素性として，多数の特徴的な部分構造を抽出２．機械学習手法を用いて，未知の分類関数を学習 f : Graphs → {+1, -1} H Fe N X C TCGCGAGGCTAGCT GCAGAGTAT TCGCGAGGCTAT TCGCGAGGT +1 -1

応用：半構造データの自動分類 1990年代に機械学習分野で，高性能予測の新しい理論が急速に進展決定木ブースティング SVM
最初の実用的予測アルゴリズムブースティング多数の機械学習アルゴリズムを統合して高精度予測オンライン予測の理論 SVM 現在の state-of-the-art methods 高次元空間と多様なデータへマージン最適化＋カーネル法鍵となる技術ブースティング学習（Boosting） SVMとカーネル法（サポートベクトル機械）ニューラルネットワークオンライン予測とゲーム理論計算学習理論研究者ブースティング, SVM：渡辺治・金森（東工大） SVM：津田宏治（AIST），鹿島久嗣（IBM TRL）オンライン予測：瀧本英二（東北大） 110

機械学習アルゴリズムへの応用ブースティング (Boosting) [Freund, Shapire 1996]
2011/09/30 列挙学校（再放送）有村博紀，北大ブースティング (Boosting) [Freund, Shapire 1996] 多数の機械学習アルゴリズムを統合して高精度予測オンライン予測の理論と深い関連 SVM (Support Vector Machines) [Vapnik 1996] マージン最大化による現在の state-of-the-art methods カーネル法を用いた高次元空間と多様なデータへの拡張これらの学習アルゴリズムと，重みつきアイテム集合マイニングを組み合わせて利用できるキーワード: Boosting, SVM, オンライン予測, ニューラルネット, 計算学習理論国際会議: NIPS, ICML, COLT, ALT V. Vapnik, Statistical Learning Theory, Wiley, (textbook) N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods, Cambridge, (textbook) Y. Freund and R. E. Schapire, A decisiontheoretic generalization of on-line learning and an application to boosting, JCSS, 55, , (AdaBoost) 金森, 畑埜, 渡辺,ブースティング: 学習アルゴリズムの設計技法, 森北出版 (text book) 111 111

「集中」「分散」これからのデータマイニングさまざまなデバイス多様な人間活動と応用多様で非均一な時空間不完全で複雑なデータと情報
膨大なデータ集積多数の独立したＣＰＵネットワークによる結合多くの計算リクエスト 2011/09/30 列挙学校（再放送）

まとめ：半構造データマイニングデータマイニング大量のデータから有用なパターンや規則を半自動的にとりだす方法の研究 1990年代半ばに顕在化．定型データを対象とする半構造データ多様な構造をもつ膨大なデータ従来のデータマイニング手法は，半構造データには直接適用不可能新しい半構造データマイニング技術が必要高速かつ頑健なマイニング技術が鍵 3. 新たな知識の構築知識 Knowledge H X C 半構造データ Semi-structured Data 半構造パターン発見 Pattern Discovery 2. 知識をパターンや規則として発見 JSTさきがけ研究「情報と知」(H11-H13) 文科省科研費特定「発見科学」(H10-H12), 「情報学」(H13-H17), 基盤研究B(H12-H14, H15-H17)等文科省科研費特別推進研究（平成１７年～１９年）「知識基盤形成のための大規模半構造データからの超高速パターン発見」知識を有機的につなげる

今回の講義（まとめ）４コマ目の目的：データマイニングを題材に列挙はどんなことに使えそうか？
2011/09/30 列挙学校（再放送）有村博紀，北大４コマ目の目的：データマイニングを題材に列挙はどんなことに使えそうか？データマイニング(DM)と機械学習における列挙に関連した話題を提供 DMの歴史をすこしパート1: データマイニングと頻出集合発見パート２: 構造データのマイニング（木とグラフ）

北海道大学大学院情報科学研究科有村博紀（拠点リーダー）渡邉日出海（バイオ）末岡和久（ナノ）宮永喜一（メディア）
平成19年度グローバルCOEプログラム知の創出を支える次世代IT基盤拠点北海道大学大学院情報科学研究科有村博紀（拠点リーダー）渡邉日出海（バイオ）末岡和久（ナノ）宮永喜一（メディア）知識発見バイオ知識メディア量子ナノエレクトロニクス知的通信

Interdesciplinary Collaboration: Biology and IT
ROV Project (Bio + Media + Info + Nano) ROV (Remote Operating Marine Vehicle) that can operate under 2000m deep sea New Spiecies Discovery Project Knowledge Discovery Technology Technology for Large-scale discovery of new spiecies and construction of databases Image processing and machine learning 2000mの深海を探索する知的リモート操作海中ロボット(ROV)の開発大規模新種発見とデータベース構築のための知的データ処理技術 Goal ■ 生物多様性と新種探索プロジェクト 116

列挙学校：第４コマパターンマイニングにおける列挙アルゴリズム

Similar presentations

Presentation on theme: "列挙学校：第４コマパターンマイニングにおける列挙アルゴリズム"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

列挙学校：第４コマ パターンマイニングにおける 列挙アルゴリズム

Similar presentations

Presentation on theme: "列挙学校：第４コマ パターンマイニングにおける 列挙アルゴリズム"— Presentation transcript:

Similar presentations

About project

フィードバック

列挙学校：第４コマパターンマイニングにおける列挙アルゴリズム

Presentation on theme: "列挙学校：第４コマパターンマイニングにおける列挙アルゴリズム"— Presentation transcript: