開発保守支援を目指した コードクローンの抽出手法 - 支援環境Geminiへの実装 - 植田 泰士 井上研究室
背景 (1/2) ソースコード中に類似したコード片があるとき、それらをコードクローンという
背景 (2/2) コードクローンはソフトウェア保守を困難にする コードクローンにバグが存在する場合,すべてのコードクローンについて修正を行わなければならない 機能追加も同様 [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7):654-670, 2002. [2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 67-76, 2002. コードクローン検出ツールCCFinder[1]と,そのコードクローン分析支援環境Gemini[2]を開発 CCFinder トークン単位でのコードクローンを検出 入力はソースファイル, 出力(テキストベース)はクローンペアの位置情報 最小一致トークン数 Gemini GUIベースの分析環境 コードクローン検出ツールとしてCCFinderを使用
CCFinder/Gemini (1/5) CCFinderが検出するコードクローン Exact クローン If (a > b) { b++; a=1; } ‘コピーとペースト’による再利用 Exactクローン 識別子名の変更 If (i > j) { j++; i=0; } Parameterized クローン CCFinder/Gemini (1/5) CCFinderが検出するコードクローン Exact クローン Parameterized クローン
CCFinder/Gemini (2/5) 最小一致トークン数 CCFinderは最小一致トークン数より 短いコードクローンは検出しない 短いコードクローンは,非常に多く存在する (例)14万行: 30トークン以上 約20,000クローンペア 10トークン以上 約4,000,000 クローンペア 1. static void foo() throws RESyntaxException 2. { 3. String a[] = new String [] {"123,400", "abc"}; 4. org.apache.regexp.RE pat = 5. new org.apache.regexp.RE("[0-9,]+"); 6. int sum = 0; 7. for (int i = 0; i < a.length; ++i) 8. { 9. if (pat.match(a[i])){ 10. sum += Sample.parseNumber(pat.getParen(0));} 11. } 12. System.out.println("sum = " + sum); 13. } 1. static void goo(String [] a) throws RESyntaxException 2. { 3. RE exp = new RE(“[0-9,]+”); 4. int sum = 0; 5. int i = 0; 6. while (i < a.length) 7. { 8. if (exp.match(a[i])) 9. sum += parseNumber(exp.getParen(0)); 10. i++; 11. } 12. System.out.println("sum = " + sum); 13. } 14 トークン 最小一致トークン数を20トークンに定めると... 最小一致トークン数を10トークンに定めると... 27 トークン 13 トークン
CCFinder/Gemini (3/5) Gemini概要 a b c a b c a d e c a b c a b c a d e c : 一致箇所 Gemini概要 GUIベースの分析環境 コードクローン検出ツールとしてCCFinderを使用 インタラクティブな分析環境を提供 クローン散布図ビュー マウス操作によるクローン選択 メトリクスグラフビュー メトリクス値によるクローン選択 ソースコードビュー
CCFinder/Gemini (4/5) これまでの適用 フリーソフトウェア 商用ソフトウェア 学生演習プログラム ソフトウェア著作権訴訟 JDK libraries (Java, 570 KLOC) Linux, FreeBSD (C, 1.6 + 1.3 MLOC) FreeBSD, OpenBSD,NetBSD (C) Qt (C++,240KLOC) 商用ソフトウェア NTTデータ, 日立, 日立GP, NASDA, NECソフト, ASTEC, SRA, 大和コンピュータ, … 学生演習プログラム ソフトウェア著作権訴訟 21世紀COEプロジェクト ソフトウェア工学工房
Ngクローン (Non-gapped clone) If (a > b) { b++; a=1; } ‘コピーとペースト’による再利用 Exactクローン 識別子名の変更 If (i > j) { j++; i=0; } Parameterized クローン If (i > j) { i = i / 2; j++; i=0; } 挿入 If (i > j) { i=0; } 削除 If (i > j) { j = j + 1; i=0; } 修正 CCFinder/Gemini (5/5) 実際のコピーとペーストによる再利用が行われる場合,部分的な変更が加えられることが多い If (a > b) { b++; a=1; } If (i > j) { i = i / 2; j++; i=0; } If (i > j) { i=0; } If (i > j) { j = j + 1; i=0; } ギャップ Gappedクローン Ngクローン (Non-gapped clone)
研究の目的 Gappedクローン検出手法を提案する 適用実験を行い本手法の有効性を確認する
Gappedクローン検出 - 概要 (1/2) 前提 Ngクローンの組合せ爆発 個々のGappedクローン発見問題をNgクローン(Exactクローン,Parameterizedクローン)の組合せ決定問題としてとらえる 15 Ngクローンの組合せ爆発 複数のNgクローンが互いに重なり合い密集していた場合, あるひとつのNgクローンから次のNgクローンへの連結先が多数存在するため,組合せ爆発が起こる 高い計算コスト 105
Gappedクローン検出 - 概要 (2/2) アプローチ マン・マシン共同作業 全NgクローンをNgクローン連結集合に分割 クローン散布図を用いて連結集合を視覚化 → ユーザはどのあたりにGappedクローンが 存在するのか識別できる 連結集合内に含まれるGappedクローンをインタラクティブに参照
Gappedクローン検出 - 検出手順 入力例 ファイルXのコード列: “ABCDCDEFBCDG” A B C E F B C D E B C D ファイル Y ファイル X 入力例 ファイルXのコード列: “ABCDCDEFBCDG” ファイルYのコード列: “ABCEFBCDEBCD” “A”, “B”, “C” … はソースコード中の任意粒度の要素 Ngクローン Ngクローン位置検出 ギャップ位置検出 視覚化 ギャップ Gap-and-clone 散布図 ソースコード分析 ソースファイル Ngクローンと ギャップの結合情報
Gappedクローン検出 - 検出手順 ギャップ 上限ギャップ長 ソースファイル Ngクローン位置検出 Ngクローン ギャップ位置検出 視覚化 ギャップ Gap-and-clone 散布図 ソースコード分析 ソースファイル Ngクローンと ギャップの結合情報 Source files 上限ギャップ長 ギャップ
Gappedクローン検出 - 検出手順 ファイル Y ファイル Y ファイル X ファイル X B C D E F G A B C E F B C D E B C D ファイル Y ファイル X A B C D E F G A B C E F B C D E B C D ファイル Y ファイル X Ngクローン Ngクローン位置検出 ギャップ位置検出 視覚化 ギャップ Gap-and-clone 散布図 ソースコード分析 ソースファイル Ngクローンと ギャップの結合情報
Gappedクローン検出 - 検出手順 ファイル Y ファイル X A B C E F B C D E B C D A B C D E F Ngクローン Ngクローン位置検出 ギャップ位置検出 視覚化 ギャップ Gap-and-clone 散布図 ソースコード分析 ソースファイル Ngクローンと ギャップの結合情報
Gappedクローン検出 - 実装 支援環境Geminiへ実装 Ngクローン 連結集合
適用実験概要 適用対象 分析内容 大阪大学のプログラミング演習で作成されたプログラム どのようなGappedクローンがみつかるのか C言語で記述されたコンパイラ 3つの課題で構成されている 課題1: 構文解析器の作成, 課題2: 意味解析器の作成, 課題3: コンパイラの作成 分析内容 どのようなGappedクローンがみつかるのか
分析 課題1内 課題2内 課題3内 課題3内 課題2内 課題1内 ある学生の3課題における関数“sentence()”の間で比較 void sentence() { if ((tok_name == SIDENTIFIER)|| (tok_name == SREADLN) || (tok_name == SWRITELN) || (tok_name == SBEGIN)) basic_sen(); else if (tok_name == SIF) scan(); if (expression() != TBOOLEAN) error(4); if (tok_name != STHEN) syntax_error(); multi_sentence(); if (tok_name == SELSE) } else if (tok_name == SWHILE) if (tok_name != SDO) syntax_error(); sentence(); else syntax_error(); 課題2内 int llt,llf,lp,lpf; llt=lt; llf=lf; lp=p; lpf=pf; if ((tok_name == SIDENTIFIER) || (tok_name == SREADLN) || (tok_name == SWRITELN) || fprintf(outfile,"\tPOP\tGR2\t;%d\n",tok_line); fprintf(outfile,"\tCPA\tGR2,TRUE\n",sub); fprintf(outfile,"\tJNZ\tLF%d\n\n",llf); lf++;lt++; fprintf(outfile,"\tJMP\tLT%d\n",llt); fprintf(outfile,"LF%d\n\n",llf); fprintf(outfile,"LT%d\n",llt); fprintf(outfile,"LOOP%d\n",lp); p++; fprintf(outfile,"\tJNZ\tLOOF%d\n\n",lpf); pf++; fprintf(outfile,"\tJMP\tLOOP%d\n",lp); fprintf(outfile,"LOOF%d\n\n",lpf); 課題3内 A in Ex.3 課題1内 課題2内 課題3内 課題3内 課題2内 課題1内 40 トークン 45 トークン 27 トークン 50 トークン A Ngクローン最小一致トークン数:20 トークン 分析 B1 B2 B3 B4 Ngクローン最小一致トークン数:10 トークン ギャップ上限長: 10 トークン 連結集合最小長: 20 トークン ある学生の3課題における関数“sentence()”の間で比較 18 トークン 14 トークン 12 トークン B
まとめと今後の課題 開発保守支援を目指したコードクローンの抽出手法の提案と実装を行った 適用実験により本手法の有効性を確認した 不一致部分を含んだコードクローンの検出 適用実験により本手法の有効性を確認した 個々では短くて検出されないような短い いくつかのNgクローンで構成された Gappedクローンを見つけることができた 今後の課題として,実際のソフトウェア開発保守作業への適用,大規模なソフトウェアへの適用が考えられる
Web page of CCFinder/Gemini is available at http://sel.ist.osaka-u.ac.jp/cdtools/index.html.en
既存のコードクローン検出手法(1/2) Bakerの手法 Baxterらの手法 行単位でソースコードを比較してコードクローンを検出する C言語のソースファイルを入力,構文解析して,解析木(の部分木)を比較する Baker プログラムテキストを行の並びとして表現した上でマッチングをとる. SuffixTreeを用いているのでO(n). C言語. [Baker1995] B. S. Baker: “On finding Duplication and Near-Duplication in Large Software System,” Proc. Second IEEE Working Conf. Reverse Eng., pp. 86-95, Tronto, Canada (Jul., 1995). [Baxter1998] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier: “Clone Detection Using Abstract Syntax Trees,” Proc. of ICSM ’98, pp. 368-377, Bethesda, Maryland (Nov., 1998).
既存のコードクローン検出手法(2/2) Merloらの手法 手続き(メソッドや関数)の特徴メトリクスを比較して,計測値が似ていればコードクローンであると判定する [Balazinska1999] M. Balazinska, E. Merlo, M. Dagenais, B. Lague, and K.A. Kontogiannis, "Measuring Clone Based Reengineering Opportunities", Proc. 6th IEEE Int'l Symposium on Software Metrics (METRICS '99), pp. 292-303, Boca Raton, Florida, Nov. 1999.
Differences between our method and homology analysis in genome informatics Alignment analysis Dynamic programming O(mn) (m, n : length of sequences) The optimal alignment is not our interest. Homology search BLAST, FASTA We have no query sequence for search and want to detect all gapped clones.
The difference between ‘diff’ and clone detection tools Diff finds the longest common sub-string. Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions.
Computation cost of our method Non-gapped clone detection (in CCFinder): O(n + m) n: length of source code m: number of non-gapped clones Gap identification: O(m) Identification of gaps combined with each non-gapped clones : O(1) Total: O(n+m) On the other hand, Our method needs only lexical analysis in non-gapped clone detection of CCFinder. and doesn’t need syntax analysis.
クローンが生じる理由 コピーとペースト 定型処理 コーディングスタイル 意図的な繰り返し プログラミング言語に適切な機能がないため コード生成ツール 偶然
Related work Baxter et al.[3] Extract clone pairs of statements, declarations, or sequences of them from C source files. Parse source code to build an abstract syntax tree (AST) and compare its sub-trees by characterization metrics (hash functions). Its computation complexity is O(n), where n is the number of the sub-tree of the source files. The hash function enables one to do parameterized matching, to detect gapped clones, and to identify clones of code portions in which some statements are reordered. As a related work, there is the method proposed by Dr. Baxter and others. They parse source code to build an AST and compare its sub-trees by hash functions. And, the hash function enables the method to do parameterized matching, to detect gapped clones and so on. However, it is generally expensive method since it requires full syntax analysis. [3] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone Detection Using Abstract Syntax Trees,” Proc. of ICSM ’98, pp. 368-377, Bethesda, Maryland, 1998.
Snapshots of clone class metric graph RAD LEN POP DFL I show the snapshots of Gemini again. The right side of this figure is metric graph. Each vertical axis means a single metric value, and one polygonal line is drawn per each clone class. By mouse dragging on this pale purple area, a user can set the warning range about each metric like this. Only the clone classes whose metric values are in the warning ranges are selected. Then he or she can refer to the actual code fragments with highlight corresponding to the selected clone class. Filtering mode : ON
Clone class metrics LEN (C ): Length of token sequence of each element in clone class C POP (C ): Number of elements in clone class C DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine RAD (C ): Distribution in the file system of elements in clone class C new sub routine caller statements In turn, I explain four metrics with clone class which are used in metric graph of Gemini. First one is LEN(C). This is a length of token sequence of each element in a clone class. Second one is POP(C). This is a number of elements in a clone class. Third one is DFL(C). This means an estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine as this figure. This value can be calculated from the values of LEN and POP. Fourth metric is RAD. This means the distribution in the file system of elements in clone class.
Definitions of DFL and RAD DFL(C ) DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C) LEN(C) ×POP(C) : the target code size for restructuring 5×POP(C) : the code size of new caller statements LEN(C) : the code size of new identical routine RAD (C ) Distribution in the file system of elements in clone class C RAD(C) = 0 : C is enclosed within a single file. RAD(C) = 1 : C is enclosed within a single directory. RAD(C) = n : C is enclosed within a directory tree of n layers.
Analysis using clone class metrics Example of analysis issue Finding clones that are appropriate for refactoring. Clones having high DFL Clones having high POP and low RAD It may be easy and meaningful to merge clones into one routine because of their density. Finding portions that are not reliable. Clones having high LEN Modules having larger code clones are less maintainable than modules having smaller code clones [4]. [4] Akito Monden, Daikai Nakae, Toshihiro Kamiya, Shin-ichi Sato, Ken-ichi Matsumoto, “Software Quality Analysis by Code Clones in Industrial Legacy Software”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 87-96, 2002.
Suffix-tree Suffix tree is a tree that satisfies the following conditions. A leaf node represents the starting position of sub-string. A path from root node to a leaf node represents a sub-string. First characters of labels of all the edges from one node are different from each other. → A common path means a clone Suffix tree is a tree that satisfies three conditions. Condition 1 is that a leaf node represents the starting position of sub-string. Condition 2 is that a path from root node to a leaf node represents a sub-string. Condition 3 is that first characters of labels of all the edges from one node are different from each other. Therefore, a common path means a clone pair. For example, This figure is a suffix-tree for this string “xxyxyz “. Next this path from root node to this leaf node represents the string “xyxyz “, and this sub-string starting from number 2 is also “xyxyz “. So, this leaf has the number 2. The path from root node to this leaf node represents the string “xyz “, and this sub-string starting from number 4 is also “xyz “. The common path between them is “xy “. Then this sub-string “xy “ is detected as a code clone. Also in this scatter plot, that “xy “ is shown here.
Example of transformation rules in Java All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ”java. lang. Math. PI” is transformed to ”Math. PI”. By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code. For example, all identifiers defined by user are translated to same tokens. This transformation absorbs differences in names of variables. In another rule, unique identifiers is inserted at each end of the top-level definitions and declarations. This rule prevents detecting clones that begin at the middle of class definition and end at the middle of another one. And, these practical devices are also given to the rules.
The output of CCFinder Output of CCFinder #version: ccfinder 3.1 #langspec: JAVA #option: -b 30,1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} 0.0 52 C:\Gemini.java 0.1 94 C:\GeneralManager.java : #end{file description} #begin{clone} 0.1 53,9 63,13 1.10 542,9 553,13 35 0.1 53,9 63,13 1.10 624,9 633,13 35 0.2 124,9 152,31 0.2 154,9 216,51 42 : #end{clone} Output of CCFinder Object file ID ( file 0 in Group 0 ) Location of a clone pair ( Lines 53 - 63 in file 0.1 and Lines 542 - 553 in file 1.10 are identical or similar to each other) However, the output of CCFinder is difficult to understand intuitively. This figure shows an example of the actual output. In this part, a file ID number is described, and in this part, a location of a clone pair is described. As you can see, it is difficult to analyze source code by only this text-based information of the location of clone pairs. The analysis of a large number of clone pairs is especially difficult. It is difficult to analyze source code by only this text-based information of the location of clone pairs.
Gapped code clone detection - Algorithm (1/5) Source files Step1: Non-gapped clone detection Detect non-gapped clones from input source files. Set the minimum length of clone (threshold1). Sort the list of the detected non-gapped clones for effective identification of gap locations in Step2. Non-gapped clone detection Non-gapped clones Gap identification Make a clone pair which appears previously in the file appear previously also in the sorted list. When the detected result is one of comparison among three or more files, a set of non-gapped clones can be divided into subsets defined by the combination of two files. Non-gapped clone ID Pos. in file X (ABCDCDEFBCDG) Pos. in file Y (ABCEFBCDEBCD) Matched subsequence c1 1 – 3 “ABC” c2 2 – 4 6 – 8 “BCD” c3 10 – 12 c4 5 – 5 3 – 3 “C” c5 5 – 6 11 – 12 “CD” c6 7 – 9 “CDE” c7 7 – 11 4 – 8 “EFBCD” c8 9 – 10 2 - 3 “BC” c9 9 – 11 10 - 12 Gaps Correspondences In first step, non-gapped clones are detected from input source files using a presupposed non-gapped clone detection tool. In this example, we consider the comparison only between file X and Y. And this table shows the detected result from the sample input. This table contains all non-gapped clones regardless of their length. But practical clone detection tools often do not detect short clones whose length is shorter than a threshold because almost of too short clones are coincidental ones. We call this threshold minimum length of clone or threshold1. If we set it 2, this clone c4 is not detected. Next, for effective identification of gap locations in Step2, we sort the list of the detected non-gapped clones. When sorting, we make a clone pair which appears previously in the file appear previously also in the sorted list. This table has been already sorted. Although this table shows only the result of comparison between file X and Y, when the detected result is one of comparison among three or more files, a set of non-gapped clones can be divided into subsets defined by the combination of two files because each code portion of clones is enclosed within a single file. Visualization Gap-and-clone scatter plot Source code investigation
Gapped code clone detection - Algorithm (2/5) Step2: Gap identification Generate gap locations from sorted list of non-gapped clones. Gap location is a kind of the combination of the two non-gapped clones. (c1, c6) = ((1-3, 1-3), (5-6, 7-9)) g1= (4-4, 4-6) The length of each gap is the length of longer unmatched subsequence. Set the upper limit of the length of each gap (threshold2). Source files Non-gapped clone detection Non-gapped clones Gap identification Gaps Correspondences Use the facts for optimizations non-gapped clones are stored as the sorted result. The number of gaps connected from each non-gapped clone can be considered up to a certain constant. The overall time complexity of Step2 is O(n) (n:number of non-gapped clones) Gap ID Pos. in file X (ABCDCDEFBCDG) Pos. in file Y (ABCEFBCDEBCD) Length in longer g1 4 – 4 4 – 6 3 g2 4 – 10 7 g3 – g4 4 – 8 4 – 9 6 g5 9 – 10 2 g6 5 – 8 9 4 g7 8 – 8 1 Next in second step, gap locations are identified from sorted list of non-gapped clones. Gap location is a kind of the combination of the two non-gapped clones. For example, a gap location defined by a combination of c1 and c6 is generated by the tail positions of c1 and head positions if c6 like this. This table shows all of the gaps from the sample input. And we define the length of each gap as the length of longer unmatched subsequence. Here we introduce the upper limit of the length of each gap. We call it threshold2. (If we set it to infinity, whole of each source file may become gapped clone.) In this example, we set the threshold2 3. Then these three gaps are not identified. In this identification, by using the fact that non-gapped clones are stored as the sorted result and the number of gaps connected from each non-gapped clone can be considered up to a certain constant, the overall time complexity of Step2 is linear of the number of non-gapped clones. Visualization Gap-and-clone scatter plot Source code investigation
Gapped code clone detection - Algorithm (3/5) Source files Step3-1: Visualization – gap-and-clone scatter plot Draw gaps on the scatter plot of non-gapped clone to visualize gapped clones in a pseudo way. Non-gapped clone detection 1 2 3 4 5 6 7 8 9 10 11 12 A B C D E F G A B C E F B C D E B C D File Y File X 1 2 3 4 5 6 7 8 9 10 11 12 Non-gapped clones c1 c2 c3 c7 c6 c9 c5 c8 Gap identification Gaps Correspondences g1 g3 g7 g5 Next in third step, we draw gaps on the scatter plot of non-gapped clone. to visualize gapped clones in a pseudo way. In this figure, file X is arranged on the vertical axis and file Y is also arranged on the horizontal axis. Then these diagonal line segments represents non-gapped clones between file X and Y. If we draw gap locations on this plot, It becomes like this. We call this scatter plot “gap-and-clone scatter plot”. As you can see, just looking this plot. these three gapped clones can be identified by a user. Visualization Gap-and-clone scatter plot Gapped clone ID Path Subsequence in file X (ABCDCDEFBCDG) Subsequence in file Y (ABCEFBCDEBCD) gc1 c1 g1 c5 g7 c7 “ABC-CDE--CD” “ABC---CDEBCD” gc2 c1 g3 c6 “ABC---EFBCD” “ABCEFBCD” gc3 c2 g5 c4 “BCDCD” Source code investigation
Gapped code clone detection - Algorithm (4/5) 1 2 3 4 5 6 7 8 9 10 11 12 A B C D E F G A B C E F B C D E B C D File Y File X 1 2 3 4 5 6 7 8 9 10 11 12 g1 g3 g7 c1 c7 c6 c9 g5 c2 c3 c5 c8 Non-gapped clone detection Gap identification Non-gapped clones Visualization Gaps Gap-and-clone scatter plot Source files Correspondences Source code investigation Step3-2: Visualization – filtering Remove non-gapped clones and gaps that do not contribute to make a long gapped clone. Introduce the length of each entanglement (“eSize”) of non-gapped clones and gaps. eSize = max (eSizeX, eSizeY) eSizeX = eEndX – eStartX eSizeY = eEndY – eStartY “eSize” means the maximum length of gapped clone included in the entanglement. Set the minimum “eSize” for display (threshold3). Although this step is optional, the visual effects of gap-and-clone scatter plot can be improved sharply. Here we introduce the length of each entanglement of non-gapped clones and gaps. The length is defined like these expressions. That is, it is the maximum length of gapped clone included in the entanglement. Using this length, we remove non-gapped clones and gaps that do not contribute to make a long gapped clone. So, we define its minimum length for display. We call it threshold3. If we set it to 8 or so, this gap-and-clone scatter plot becomes like this. As you can see, user can easily identify the locations of these long gapped clones.
Gapped code clone detection - Algorithm (5/5) Source files Step4: Source code investigation Investigate source files with gap-and-clone scatter plot. Change parameters. Threshold1: Minimum size of non-gapped clones in non-gapped clone detection Threshold2: Maximum size of gaps in identification of gap locations. Threshold3: Minimum size of entanglement of non-gapped clones and gaps in gap-and-clone scatter plot. Theshold1 and threshold2 greatly affect computation time. Small threshold1 makes O(m2) non-gapped clone pairs detected from size-m source code. Large threshold2 makes O(n2) gaps detected from n clone pairs. Non-gapped clone detection Non-gapped clones Gap identification Gaps Correspondences In final step, user investigates source code using the visualized result. However in previous steps it might be difficult to decide the values of thresholds. In such case, three thresholds should be tuned for the use repeatedly. Threshold1 and threshold2 greatly affect computation time. For example, small threshold1 makes the number of non-gapped clones squares of the size of source code. Also, large threshold2 makes the number of gaps squares of the number of non-gapped clones. Visualization Gap-and-clone scatter plot Source code investigation
Analysis - Usefulness of gap-and-clone scatter plot Ex.1 Ex.2 Ex.3 Threshold1 = 10 Threshold2 = 10 Threshold3 = 30 0 10 20 30 40 50 500 1000 1500 (Frequency of non-gapped clones) (Tokens) 0 10 20 30 40 50 500 1000 1500 (Frequency of non-gapped clones) (Tokens) Compared the scatter plots of non-gapped clones to the gap-and-clone scatter plot Three programs (Ex.1: 2267 tokens, Ex.2: 4394 tokens and Ex.3: 5738 tokens) of a student S are arranged on both of the vertical and horizontal axes. The grid represents boundary lines between sub-exercises. Shown up as long gapped clones Ex.1 Ex.2 Ex.3 Ex.1 Ex.2 Ex.3 Here, we compare the scatter plots of non-gapped clones to the gap-and-clone scatter plot. In these two figures, three programs of a student S are arranged on both of the vertical and horizontal axes. These grid lines represent boundary lines between sub-exercises. In the left, we plotted clones longer than 10 tokens. In the right, clones longer than 30 tokens. That is, fine-grained analysis may be possible in the left. But these enormous black dots means that there are so enormous non-gapped clones. This figure shows the frequencies of non-gapped clones per each size. You can see again how enormous the number of short non-gapped clones is. So it is difficult to perform fine-grained analysis using the left. On the other hand, the right is simpler than the left. So user can perform rapid analysis. But, it is naturally just rough-grained analysis. Next, these figures show the gap-and-clone scatter plot and the frequencies of non-gapped clones plotted in it. In this plot, only entanglements that are longer than 30 tokens are appeared. That is, non-gapped clones around here are shown up as long gapped clones and the frequencies are not so enormous like in the left. And they are not appeared in this scatter plot. Therefore gap-and-clone scatter plot can provide both of fine-grained analysis and rapid analysis. These observations are not accidental ones, similar results are observed in the programs of other students. Threshold1 = 10 Threshold1 = 30
The analysis of comparison among students (non-gapped clones only) The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B (4 students) Many code fragments were similar even with respect to name of variables or comments. B A So we analyzed the actual source code corresponding to these crowded clone pairs through source code viewer. As for area A, there are 2 students, and their similar code fragments were not plagiarisms but from source code of sample compiler described in textbook. However, as for area B, there are 4 students, and many code fragments were similar even with respect to name of variables or comments. So, it can be said that the possibility that plagiarisms among them happened is high.
CCFinder/Gemini (1/4) コードクローン検出手順 ソースファイル 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } 字句解析 トークン列 変換処理 変換後トークン列 検出処理 I explain the clone detection process using an example. This source code is a part of program written in Java language. There are two methods. As you can see, these two methods are very similar to each other. In this source code, these highlighted strings represent differences between the two. In first step, CCFinder performs lexical analysis. In this step, each line of source files is divided into tokens corresponding to a lexical rule of the programming language. In next step, a transformation of the tokens is performed. Here, tokens are added, removed or changed based on some transformation rules. The rules is aimed at regularization of identifiers and identification of structures. For example, these all identifiers defined by user are transformed to same tokens. In the next step, CCFinder detects clone pairs from all the sub-strings on this transformed token sequence. In this example, these two underlined portions are the longest identical sub-strings. So, CCFinder detects them as a clone pair. Finally these positions are mapped on the original source code and the mapped positions are output. If we refer to the positions on the actual source code, these portions are the clone pair. However, since the output of CCFinder is text-based one, it is difficult to understand it intuitively in maintenance. The analysis of a large number of clone pairs is especially difficult. 変換後トークン列上での コードクローン 整形処理 クローンペア位置情報
最小一致トークン数を20トークンに定めると... Gappedクローン検出の背景 (2/2) Gappedクローンを識別することが困難な理由 CCFinderが検出できるのはNgクローンのみ Gappedクローンは複数の相対的に短いNgクローンとして検出される 30 トークン以上のクローンペア (クローンペア数:1208) 10 トークン以上のクローンペア (クローンペア数:26984) CCFinderは最小一致トークン数より短いコードクローンは検出しない → 個々の一致箇所が短ければ検出されない 一般に, 最小一致トークン数が短く設定されると,膨大な数のコードクローンが検出される 1. static void foo() throws RESyntaxException 2. { 3. String a[] = new String [] {"123,400", "abc"}; 4. org.apache.regexp.RE pat = 5. new org.apache.regexp.RE("[0-9,]+"); 6. int sum = 0; 7. for (int i = 0; i < a.length; ++i) 8. { 9. if (pat.match(a[i])){ 10. sum += Sample.parseNumber(pat.getParen(0));} 11. } 12. System.out.println("sum = " + sum); 13. } 1. static void goo(String [] a) throws RESyntaxException 2. { 3. RE exp = new RE(“[0-9,]+”); 4. int sum = 0; 5. int i = 0; 6. while (i < a.length) 7. { 8. if (exp.match(a[i])) 9. sum += parseNumber(exp.getParen(0)); 10. i++; 11. } 12. System.out.println("sum = " + sum); 13. } 14 トークン 最小一致トークン数を20トークンに定めると... Now I explain why we want to detect such gapped clones. At first, almost of clones created by reusing the code with copying and pasting may become gapped clones because usually some modifications are needed to adapt the copied code to the pasting target portion. However, CCFinder can detect non-gapped clones. So, like in these two methods that are partly different to each other, a gapped clone is separately detected as several short non-gapped clones like these three. In that case, If each matched portion is too short, CCFinder does not identify it as a clone because the minimum length of clone to be detected must be set in CCFinder beforehand. If we set the minimum length to 20 tokens, these two clones are not identified because they are shorter than 20 tokens. On the other hand, if the minimum length is set to short one, too many clones would be detected because of the existence of coincidental simple statements such as substitution. These scatter plots are the result of comparison within a certain program. In the left, we plotted clones longer than 30 tokens. In the right, clones longer than 10 tokens. As you can see, too many clones are detected in the right. It is difficult to find gapped clones from such many clones. 27 トークン 13 トークン