開発保守支援を目指したコードクローンの抽出手法 - 支援環境Geminiへの実装 -

開発保守支援を目指したコードクローンの抽出手法 - 支援環境Geminiへの実装 -
植田泰士井上研究室

背景 (1/2) ソースコード中に類似したコード片があるとき、それらをコードクローンという

背景 (2/2) コードクローンはソフトウェア保守を困難にする
コードクローンにバグが存在する場合，すべてのコードクローンについて修正を行わなければならない機能追加も同様 [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7): , 2002. [2] Y. Ueda, T. Kamiya, S. Kusumoto and K. Inoue, “Gemini: Maintenance Support Environment Based on Code Clone Analysis”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 67-76, 2002. コードクローン検出ツールCCFinder[1]と，そのコードクローン分析支援環境Gemini[2]を開発 CCFinder トークン単位でのコードクローンを検出入力はソースファイル，出力(テキストベース)はクローンペアの位置情報最小一致トークン数 Gemini GUIベースの分析環境コードクローン検出ツールとしてCCFinderを使用

CCFinder/Gemini (1/5) CCFinderが検出するコードクローン Exact クローン
If (a > b) { b++; a=1; } ‘コピーとペースト’による再利用 Exactクローン識別子名の変更 If (i > j) { j++; i=0; } Parameterized クローン CCFinder/Gemini (1/5) CCFinderが検出するコードクローン Exact クローン Parameterized クローン

CCFinder/Gemini (2/5) 最小一致トークン数 CCFinderは最小一致トークン数より短いコードクローンは検出しない
短いコードクローンは，非常に多く存在する (例)14万行: 30トークン以上約20,000クローンペア　 10トークン以上約4,000,000 クローンペア 1. static void foo() throws RESyntaxException 2. { 3. String a[] = new String [] {"123,400", "abc"}; 4. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 6. int sum = 0; 7. for (int i = 0; i < a.length; ++i) 8. { if (pat.match(a[i])){ sum += Sample.parseNumber(pat.getParen(0));} } System.out.println("sum = " + sum); 13. } 1. static void goo(String [] a) throws RESyntaxException 2. { 3. RE exp = new RE(“[0-9,]+”); int sum = 0; 5. int i = 0; 6. while (i < a.length) { if (exp.match(a[i])) sum += parseNumber(exp.getParen(0)); i++; } System.out.println("sum = " + sum); 13. } 14 トークン最小一致トークン数を20トークンに定めると．．．最小一致トークン数を10トークンに定めると．．． 27 トークン 13 トークン

CCFinder/Gemini (3/5) Gemini概要 a b c a b c a d e c a b c a b c a d e c
: 一致箇所 Gemini概要 GUIベースの分析環境コードクローン検出ツールとしてCCFinderを使用インタラクティブな分析環境を提供クローン散布図ビューマウス操作によるクローン選択メトリクスグラフビューメトリクス値によるクローン選択ソースコードビュー

CCFinder/Gemini (4/5) これまでの適用フリーソフトウェア商用ソフトウェア学生演習プログラムソフトウェア著作権訴訟
JDK libraries (Java, 570 KLOC) Linux, FreeBSD (C, MLOC) FreeBSD, OpenBSD，NetBSD (C) Qt (C++，240KLOC) 商用ソフトウェア NTTデータ，日立, 日立GP, NASDA, NECソフト, ASTEC, SRA, 大和コンピュータ, … 学生演習プログラムソフトウェア著作権訴訟 21世紀COEプロジェクトソフトウェア工学工房

Ngクローン (Non-gapped clone)
If (a > b) { b++; a=1; } ‘コピーとペースト’による再利用 Exactクローン識別子名の変更 If (i > j) { j++; i=0; } Parameterized クローン If (i > j) { i = i / 2; j++; i=0; } 挿入 If (i > j) { i=0; } 削除 If (i > j) { j = j + 1; i=0; } 修正 CCFinder/Gemini (5/5) 実際のコピーとペーストによる再利用が行われる場合，部分的な変更が加えられることが多い If (a > b) { b++; a=1; } If (i > j) { i = i / 2; j++; i=0; } If (i > j) { i=0; } If (i > j) { j = j + 1; i=0; } ギャップ Gappedクローン Ngクローン (Non-gapped clone)

研究の目的 Gappedクローン検出手法を提案する適用実験を行い本手法の有効性を確認する

Gappedクローン検出 - 概要 (1/2) 前提 Ngクローンの組合せ爆発
個々のGappedクローン発見問題をNgクローン(Exactクローン，Parameterizedクローン)の組合せ決定問題としてとらえる 15 Ngクローンの組合せ爆発複数のNgクローンが互いに重なり合い密集していた場合，あるひとつのNgクローンから次のNgクローンへの連結先が多数存在するため，組合せ爆発が起こる高い計算コスト 105

Gappedクローン検出 - 概要 (2/2) アプローチマン・マシン共同作業全NgクローンをNgクローン連結集合に分割
クローン散布図を用いて連結集合を視覚化 → ユーザはどのあたりにGappedクローンが存在するのか識別できる連結集合内に含まれるGappedクローンをインタラクティブに参照

Gappedクローン検出 - 検出手順入力例ファイルXのコード列: “ABCDCDEFBCDG”
A B C E F B C D E B C D ファイル Y ファイル X 入力例ファイルXのコード列: “ABCDCDEFBCDG” ファイルYのコード列: “ABCEFBCDEBCD” “A”, “B”, “C” … はソースコード中の任意粒度の要素 Ngクローン Ngクローン位置検出ギャップ位置検出視覚化ギャップ Gap-and-clone 散布図ソースコード分析ソースファイル Ngクローンとギャップの結合情報

Gappedクローン検出 - 検出手順ギャップ上限ギャップ長ソースファイル Ngクローン位置検出 Ngクローンギャップ位置検出
視覚化ギャップ Gap-and-clone 散布図ソースコード分析ソースファイル Ngクローンとギャップの結合情報 Source files 上限ギャップ長　ギャップ

Gappedクローン検出 - 検出手順ファイル Y ファイル Y ファイル X ファイル X
B C D E F G A B C E F B C D E B C D ファイル Y ファイル X A B C D E F G A B C E F B C D E B C D ファイル Y ファイル X Ngクローン Ngクローン位置検出ギャップ位置検出視覚化ギャップ Gap-and-clone 散布図ソースコード分析ソースファイル Ngクローンとギャップの結合情報

Gappedクローン検出 - 検出手順ファイル Y ファイル X A B C E F B C D E B C D A B C D E F
Ngクローン Ngクローン位置検出ギャップ位置検出視覚化ギャップ Gap-and-clone 散布図ソースコード分析ソースファイル Ngクローンとギャップの結合情報

Gappedクローン検出 - 実装支援環境Geminiへ実装
Ngクローン連結集合

適用実験概要適用対象分析内容大阪大学のプログラミング演習で作成されたプログラムどのようなGappedクローンがみつかるのか
C言語で記述されたコンパイラ 3つの課題で構成されている課題1: 構文解析器の作成，課題2: 意味解析器の作成，課題3: コンパイラの作成分析内容どのようなGappedクローンがみつかるのか

分析課題1内課題2内課題3内課題3内課題2内課題1内ある学生の3課題における関数“sentence()”の間で比較
void sentence() { if ((tok_name == SIDENTIFIER)|| (tok_name == SREADLN) || (tok_name == SWRITELN) || (tok_name == SBEGIN)) basic_sen(); else if (tok_name == SIF) scan(); if (expression() != TBOOLEAN) error(4); if (tok_name != STHEN) syntax_error(); multi_sentence(); if (tok_name == SELSE) } else if (tok_name == SWHILE) if (tok_name != SDO) syntax_error(); sentence(); else syntax_error(); 課題2内 int llt,llf,lp,lpf; llt=lt; llf=lf; lp=p; lpf=pf; if ((tok_name == SIDENTIFIER) || (tok_name == SREADLN) || (tok_name == SWRITELN) || fprintf(outfile,"\tPOP\tGR2\t;%d\n",tok_line); fprintf(outfile,"\tCPA\tGR2,TRUE\n",sub); fprintf(outfile,"\tJNZ\tLF%d\n\n",llf); lf++;lt++; fprintf(outfile,"\tJMP\tLT%d\n",llt); fprintf(outfile,"LF%d\n\n",llf); fprintf(outfile,"LT%d\n",llt); fprintf(outfile,"LOOP%d\n",lp); p++; fprintf(outfile,"\tJNZ\tLOOF%d\n\n",lpf); pf++; fprintf(outfile,"\tJMP\tLOOP%d\n",lp); fprintf(outfile,"LOOF%d\n\n",lpf); 課題3内 A in Ex.3 課題1内課題2内課題3内課題3内課題2内課題1内 40 トークン 45 トークン 27 トークン 50 トークン A Ngクローン最小一致トークン数：20 トークン分析 B1 B2 B3 B4  Ngクローン最小一致トークン数:10 トークンギャップ上限長: 10 トークン連結集合最小長: 20 トークンある学生の3課題における関数“sentence()”の間で比較 18 トークン 14 トークン 12 トークン B

まとめと今後の課題開発保守支援を目指したコードクローンの抽出手法の提案と実装を行った適用実験により本手法の有効性を確認した
不一致部分を含んだコードクローンの検出適用実験により本手法の有効性を確認した個々では短くて検出されないような短いいくつかのNgクローンで構成された Gappedクローンを見つけることができた今後の課題として，実際のソフトウェア開発保守作業への適用，大規模なソフトウェアへの適用が考えられる

Web page of CCFinder/Gemini is available at

既存のコードクローン検出手法(1/2) Bakerの手法 Baxterらの手法行単位でソースコードを比較してコードクローンを検出する
C言語のソースファイルを入力，構文解析して，解析木(の部分木)を比較する Baker プログラムテキストを行の並びとして表現した上でマッチングをとる． SuffixTreeを用いているのでO(n)． C言語． [Baker1995] B. S. Baker: “On finding Duplication and Near-Duplication in Large Software System,” Proc. Second IEEE Working Conf. Reverse Eng., pp , Tronto, Canada (Jul., 1995). [Baxter1998] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier: “Clone Detection Using Abstract Syntax Trees,” Proc. of ICSM ’98, pp , Bethesda, Maryland (Nov., 1998).

既存のコードクローン検出手法(2/2) Merloらの手法
手続き（メソッドや関数）の特徴メトリクスを比較して，計測値が似ていればコードクローンであると判定する [Balazinska1999] M. Balazinska, E. Merlo, M. Dagenais, B. Lague, and K.A. Kontogiannis, "Measuring Clone Based Reengineering Opportunities", Proc. 6th IEEE Int'l Symposium on Software Metrics (METRICS '99), pp , Boca Raton, Florida, Nov

Differences between our method and homology analysis in genome informatics
Alignment analysis Dynamic programming O(mn) (m, n : length of sequences) The optimal alignment is not our interest. Homology search BLAST, FASTA We have no query sequence for search and want to detect all gapped clones.

The difference between ‘diff’ and clone detection tools
Diff finds the longest common sub-string. Given a code portion, diff does not report two or more same code portions (clones). Clone detection tool finds all the same or similar code portions.

Computation cost of our method
Non-gapped clone detection (in CCFinder): O(n + m) n: length of source code m: number of non-gapped clones Gap identification: O(m) Identification of gaps combined with each non-gapped clones : O(1) Total: O(n+m) On the other hand, Our method needs only lexical analysis in non-gapped clone detection of CCFinder. and doesn’t need syntax analysis.

クローンが生じる理由コピーとペースト定型処理コーディングスタイル意図的な繰り返しプログラミング言語に適切な機能がないため
コード生成ツール偶然

Related work Baxter et al.[3]
Extract clone pairs of statements, declarations, or sequences of them from C source files. Parse source code to build an abstract syntax tree (AST) and compare its sub-trees by characterization metrics (hash functions). Its computation complexity is O(n), where n is the number of the sub-tree of the source files. The hash function enables one to do parameterized matching, to detect gapped clones, and to identify clones of code portions in which some statements are reordered. As a related work, there is the method proposed by Dr. Baxter and others. They parse source code to build an AST and compare its sub-trees by hash functions. And, the hash function enables the method to do parameterized matching, to detect gapped clones and so on. However, it is generally expensive method since it requires full syntax analysis. [3] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone Detection Using Abstract Syntax Trees,” Proc. of ICSM ’98, pp , Bethesda, Maryland, 1998.

Snapshots of clone class metric graph
RAD LEN POP DFL I show the snapshots of Gemini again. The right side of this figure is metric graph. Each vertical axis means a single metric value, and one polygonal line is drawn per each clone class. By mouse dragging on this pale purple area, a user can set the warning range about each metric like this. Only the clone classes whose metric values are in the warning ranges are selected. Then he or she can refer to the actual code fragments with highlight corresponding to the selected clone class. Filtering mode : ON

Clone class metrics LEN (C ): Length of token sequence of each element in clone class C POP (C ): Number of elements in clone class C DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine RAD (C ): Distribution in the file system of elements in clone class C new sub routine caller statements In turn, I explain four metrics with clone class which are used in metric graph of Gemini. First one is LEN(C). This is a length of token sequence of each element in a clone class. Second one is POP(C). This is a number of elements in a clone class. Third one is DFL(C). This means an estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine as this figure. This value can be calculated from the values of LEN and POP. Fourth metric is RAD. This means the distribution in the file system of elements in clone class.

Definitions of DFL and RAD
DFL(C ) DFL(C) = LEN(C) ×POP(C) - 5×POP(C) + LEN(C) LEN(C) ×POP(C) : the target code size for restructuring 5×POP(C) : the code size of new caller statements LEN(C) : the code size of new identical routine RAD (C ) Distribution in the file system of elements in clone class C RAD(C) = 0 : C is enclosed within a single file. RAD(C) = 1 : C is enclosed within a single directory. RAD(C) = n : C is enclosed within a directory tree of n layers.

Analysis using clone class metrics
Example of analysis issue Finding clones that are appropriate for refactoring. Clones having high DFL Clones having high POP and low RAD It may be easy and meaningful to merge clones into one routine because of their density. Finding portions that are not reliable. Clones having high LEN Modules having larger code clones are less maintainable than modules having smaller code clones [4]. [4] Akito Monden, Daikai Nakae, Toshihiro Kamiya, Shin-ichi Sato, Ken-ichi Matsumoto, “Software Quality Analysis by Code Clones in Industrial Legacy Software”, Proc. Of the 8th IEEE International Symposium on Software Metrics, 87-96, 2002.

Suffix-tree Suffix tree is a tree that satisfies the following conditions. A leaf node represents the starting position of sub-string. A path from root node to a leaf node represents a sub-string. First characters of labels of all the edges from one node are different from each other. → A common path means a clone Suffix tree is a tree that satisfies three conditions. Condition 1 is that a leaf node represents the starting position of sub-string. Condition 2 is that a path from root node to a leaf node represents a sub-string. Condition 3 is that first characters of labels of all the edges from one node are different from each other. Therefore, a common path means a clone pair. For example, This figure is a suffix-tree for this string “xxyxyz “. Next this path from root node to this leaf node represents the string “xyxyz “, and this sub-string starting from number 2 is also “xyxyz “. So, this leaf has the number 2. The path from root node to this leaf node represents the string “xyz “, and this sub-string starting from number 4 is also “xyz “. The common path between them is “xy “. Then this sub-string “xy “ is detected as a code clone. Also in this scatter plot, that “xy “ is shown here.

Example of transformation rules in Java
All identifiers defined by user are transformed to same tokens. Unique identifier is inserted at each end of the top-level definitions and declarations. Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. ”java. lang. Math. PI” is transformed to ”Math. PI”. By using import sentence, a class is referred to with either full package name or a shorter name ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” Eliminates table initialization code. For example, all identifiers defined by user are translated to same tokens. This transformation absorbs differences in names of variables. In another rule, unique identifiers is inserted at each end of the top-level definitions and declarations. This rule prevents detecting clones that begin at the middle of class definition and end at the middle of another one. And, these practical devices are also given to the rules.

The output of CCFinder Output of CCFinder
#version: ccfinder 3.1 #langspec: JAVA #option: -b 30,1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} C:\Gemini.java C:\GeneralManager.java : #end{file description} #begin{clone} 0.1 53,9 63, ,9 553,13 35 0.1 53,9 63, ,9 633,13 35 ,9 152, ,9 216,51 42 　　　　　　: #end{clone} Output of CCFinder Object file ID ( file 0 in Group 0 ) Location of a clone pair ( Lines in file 0.1 and Lines in file are identical or similar to each other) However, the output of CCFinder is difficult to understand intuitively. This figure shows an example of the actual output. In this part, a file ID number is described, and in this part, a location of a clone pair is described. As you can see, it is difficult to analyze source code by only this text-based information of the location of clone pairs. The analysis of a large number of clone pairs is especially difficult. It is difficult to analyze source code by only this text-based information of the location of clone pairs.

Gapped code clone detection - Algorithm (1/5)
Source files Step1: Non-gapped clone detection Detect non-gapped clones from input source files. Set the minimum length of clone (threshold1). Sort the list of the detected non-gapped clones for effective identification of gap locations in Step2. Non-gapped clone detection Non-gapped clones Gap identification Make a clone pair which appears previously in the file appear previously also in the sorted list. When the detected result is one of comparison among three or more files, a set of non-gapped clones can be divided into subsets defined by the combination of two files. Non-gapped clone ID Pos. in file X (ABCDCDEFBCDG) Pos. in file Y (ABCEFBCDEBCD) Matched subsequence c1 1 – 3 “ABC” c2 2 – 4 6 – 8 “BCD” c3 10 – 12 c4 5 – 5 3 – 3 “C” c5 5 – 6 11 – 12 “CD” c6 7 – 9 “CDE” c7 7 – 11 4 – 8 “EFBCD” c8 9 – 10 2 - 3 “BC” c9 9 – 11 Gaps Correspondences In first step, non-gapped clones are detected from input source files using a presupposed non-gapped clone detection tool. In this example, we consider the comparison only between file X and Y. And this table shows the detected result from the sample input. This table contains all non-gapped clones regardless of their length. But practical clone detection tools often do not detect short clones whose length is shorter than a threshold because almost of too short clones are coincidental ones. We call this threshold minimum length of clone or threshold1. If we set it 2, this clone c4 is not detected. Next, for effective identification of gap locations in Step2, we sort the list of the detected non-gapped clones. When sorting, we make a clone pair which appears previously in the file appear previously also in the sorted list. This table has been already sorted. Although this table shows only the result of comparison between file X and Y, when the detected result is one of comparison among three or more files, a set of non-gapped clones can be divided into subsets defined by the combination of two files because each code portion of clones is enclosed within a single file. Visualization Gap-and-clone scatter plot Source code investigation

Step2: Gap identification Generate gap locations from sorted list of non-gapped clones. Gap location is a kind of the combination of the two non-gapped clones. (c1, c6) = ((1-3, 1-3), (5-6, 7-9))  g1= (4-4, 4-6) The length of each gap is the length of longer unmatched subsequence. Set the upper limit of the length of each gap (threshold2). Source files Non-gapped clone detection Non-gapped clones Gap identification Gaps Correspondences Use the facts for optimizations non-gapped clones are stored as the sorted result. The number of gaps connected from each non-gapped clone can be considered up to a certain constant.  The overall time complexity of Step2 is O(n) (n:number of non-gapped clones) Gap ID Pos. in file X (ABCDCDEFBCDG) Pos. in file Y (ABCEFBCDEBCD) Length in longer g1 4 – 4 4 – 6 3 g2 4 – 10 7 g3 – g4 4 – 8 4 – 9 6 g5 9 – 10 2 g6 5 – 8 9 4 g7 8 – 8 1 Next in second step, gap locations are identified from sorted list of non-gapped clones. Gap location is a kind of the combination of the two non-gapped clones. For example, a gap location defined by a combination of c1 and c6 is generated by the tail positions of c1 and head positions if c6 like this. This table shows all of the gaps from the sample input. And we define the length of each gap as the length of longer unmatched subsequence. Here we introduce the upper limit of the length of each gap. We call it threshold2. (If we set it to infinity, whole of each source file may become gapped clone.) In this example, we set the threshold2 3. Then these three gaps are not identified. In this identification, by using the fact that non-gapped clones are stored as the sorted result and the number of gaps connected from each non-gapped clone can be considered up to a certain constant, the overall time complexity of Step2 is linear of the number of non-gapped clones. Visualization Gap-and-clone scatter plot Source code investigation

Source files Step3-1: Visualization – gap-and-clone scatter plot Draw gaps on the scatter plot of non-gapped clone to visualize gapped clones in a pseudo way. Non-gapped clone detection 1 2 3 4 5 6 7 8 9 10 11 12 A B C D E F G A B C E F B C D E B C D File Y File X Non-gapped clones c1 c2 c3 c7 c6 c9 c5 c8 Gap identification Gaps Correspondences g1 g3 g7 g5 Next in third step, we draw gaps on the scatter plot of non-gapped clone. to visualize gapped clones in a pseudo way. In this figure, file X is arranged on the vertical axis and file Y is also arranged on the horizontal axis. Then these diagonal line segments represents non-gapped clones between file X and Y. If we draw gap locations on this plot, It becomes like this. We call this scatter plot “gap-and-clone scatter plot”. As you can see, just looking this plot. these three gapped clones can be identified by a user. Visualization Gap-and-clone scatter plot Gapped clone ID Path Subsequence in file X (ABCDCDEFBCDG) Subsequence in file Y (ABCEFBCDEBCD) gc1 c1 g1 c5 g7 c7 “ABC-CDE--CD” “ABC---CDEBCD” gc2 c1 g3 c6 “ABC---EFBCD” “ABCEFBCD” gc3 c2 g5 c4 “BCDCD” Source code investigation

1 2 3 4 5 6 7 8 9 10 11 12 A B C D E F G A B C E F B C D E B C D File Y File X g1 g3 g7 c1 c7 c6 c9 g5 c2 c3 c5 c8 Non-gapped clone detection Gap identification Non-gapped clones Visualization Gaps Gap-and-clone scatter plot Source files Correspondences Source code investigation Step3-2: Visualization – filtering Remove non-gapped clones and gaps that do not contribute to make a long gapped clone. Introduce the length of each entanglement (“eSize”) of non-gapped clones and gaps. eSize = max (eSizeX, eSizeY) eSizeX = eEndX – eStartX eSizeY = eEndY – eStartY “eSize” means the maximum length of gapped clone included in the entanglement. Set the minimum “eSize” for display (threshold3). Although this step is optional, the visual effects of gap-and-clone scatter plot can be improved sharply. Here we introduce the length of each entanglement of non-gapped clones and gaps. The length is defined like these expressions. That is, it is the maximum length of gapped clone included in the entanglement. Using this length, we remove non-gapped clones and gaps that do not contribute to make a long gapped clone. So, we define its minimum length for display. We call it threshold3. If we set it to 8 or so, this gap-and-clone scatter plot becomes like this. As you can see, user can easily identify the locations of these long gapped clones.

Source files Step4: Source code investigation Investigate source files with gap-and-clone scatter plot. Change parameters. Threshold1: Minimum size of non-gapped clones in non-gapped clone detection Threshold2: Maximum size of gaps in identification of gap locations. Threshold3: Minimum size of entanglement of non-gapped clones and gaps in gap-and-clone scatter plot. Theshold1 and threshold2 greatly affect computation time. Small threshold1 makes O(m2) non-gapped clone pairs detected from size-m source code. Large threshold2 makes O(n2) gaps detected from n clone pairs. Non-gapped clone detection Non-gapped clones Gap identification Gaps Correspondences In final step, user investigates source code using the visualized result. However in previous steps it might be difficult to decide the values of thresholds. In such case, three thresholds should be tuned for the use repeatedly. Threshold1 and threshold2 greatly affect computation time. For example, small threshold1 makes the number of non-gapped clones squares of the size of source code. Also, large threshold2 makes the number of gaps squares of the number of non-gapped clones. Visualization Gap-and-clone scatter plot Source code investigation

Analysis - Usefulness of gap-and-clone scatter plot
Ex.1 Ex.2 Ex.3 Threshold1 = 10 Threshold2 = 10 Threshold3 = 30 　　　　　 30 　　　　　　　　　 500 1000 1500 (Frequency of non-gapped clones) (Tokens) 　　　　　 30 　　　　　　　　　 500 1000 1500 (Frequency of non-gapped clones) (Tokens) Compared the scatter plots of non-gapped clones to the gap-and-clone scatter plot Three programs (Ex.1: 2267 tokens, Ex.2: 4394 tokens and Ex.3: 5738 tokens) of a student S are arranged on both of the vertical and horizontal axes. The grid represents boundary lines between sub-exercises. Shown up as long gapped clones Ex.1 Ex.2 Ex.3 Ex.1 Ex.2 Ex.3 Here, we compare the scatter plots of non-gapped clones to the gap-and-clone scatter plot. In these two figures, three programs of a student S are arranged on both of the vertical and horizontal axes. These grid lines represent boundary lines between sub-exercises. In the left, we plotted clones longer than 10 tokens. In the right, clones longer than 30 tokens. That is, fine-grained analysis may be possible in the left. But these enormous black dots means that there are so enormous non-gapped clones. This figure shows the frequencies of non-gapped clones per each size. You can see again how enormous the number of short non-gapped clones is. So it is difficult to perform fine-grained analysis using the left. On the other hand, the right is simpler than the left. So user can perform rapid analysis. But, it is naturally just rough-grained analysis. Next, these figures show the gap-and-clone scatter plot and the frequencies of non-gapped clones plotted in it. In this plot, only entanglements that are longer than 30 tokens are appeared. That is, non-gapped clones around here are shown up as long gapped clones and the frequencies are not so enormous like in the left. And they are not appeared in this scatter plot. Therefore gap-and-clone scatter plot can provide both of fine-grained analysis and rapid analysis. These observations are not accidental ones, similar results are observed in the programs of other students. Threshold1 = 10 Threshold1 = 30

The analysis of comparison among students (non-gapped clones only)
The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B (4 students) Many code fragments were similar even with respect to name of variables or comments. B A So we analyzed the actual source code corresponding to these crowded clone pairs through source code viewer. As for area A, there are 2 students, and their similar code fragments were not plagiarisms but from source code of sample compiler described in textbook. However, as for area B, there are 4 students, and many code fragments were similar even with respect to name of variables or comments. So, it can be said that the possibility that plagiarisms among them happened is high.

CCFinder/Gemini (1/4) コードクローン検出手順
ソースファイル 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) if (pat.match(a[i])) sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { RE exp = new RE("[0-9,]+"); int sum = 0; for (int i = 0; i < a.length; ++i) if (exp.match(a[i])) sum += parseNumber(exp.getParen(0)); System.out.println("sum = " + sum); 17. } 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) if (pat.match(a[i])) sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { RE exp = new RE("[0-9,]+"); int sum = 0; for (int i = 0; i < a.length; ++i) if (exp.match(a[i])) sum += parseNumber(exp.getParen(0)); System.out.println("sum = " + sum); 17. } 字句解析トークン列変換処理変換後トークン列検出処理 I explain the clone detection process using an example. This source code is a part of program written in Java language. There are two methods. As you can see, these two methods are very similar to each other. In this source code, these highlighted strings represent differences between the two. In first step, CCFinder performs lexical analysis. In this step, each line of source files is divided into tokens corresponding to a lexical rule of the programming language. In next step, a transformation of the tokens is performed. Here, tokens are added, removed or changed based on some transformation rules. The rules is aimed at regularization of identifiers and identification of structures. For example, these all identifiers defined by user are transformed to same tokens. In the next step, CCFinder detects clone pairs from all the sub-strings on this transformed token sequence. In this example, these two underlined portions are the longest identical sub-strings. So, CCFinder detects them as a clone pair. Finally these positions are mapped on the original source code and the mapped positions are output. If we refer to the positions on the actual source code, these portions are the clone pair. However, since the output of CCFinder is text-based one, it is difficult to understand it intuitively in maintenance. The analysis of a large number of clone pairs is especially difficult. 変換後トークン列上でのコードクローン整形処理クローンペア位置情報

最小一致トークン数を20トークンに定めると．．．
Gappedクローン検出の背景 (2/2) Gappedクローンを識別することが困難な理由 CCFinderが検出できるのはNgクローンのみ Gappedクローンは複数の相対的に短いNgクローンとして検出される 30 トークン以上のクローンペア (クローンペア数：1208) 10 トークン以上のクローンペア (クローンペア数：26984) CCFinderは最小一致トークン数より短いコードクローンは検出しない　　→ 個々の一致箇所が短ければ検出されない一般に, 最小一致トークン数が短く設定されると，膨大な数のコードクローンが検出される 1. static void foo() throws RESyntaxException 2. { 3. String a[] = new String [] {"123,400", "abc"}; 4. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 6. int sum = 0; 7. for (int i = 0; i < a.length; ++i) 8. { if (pat.match(a[i])){ sum += Sample.parseNumber(pat.getParen(0));} } System.out.println("sum = " + sum); 13. } 1. static void goo(String [] a) throws RESyntaxException 2. { 3. RE exp = new RE(“[0-9,]+”); int sum = 0; 5. int i = 0; 6. while (i < a.length) { if (exp.match(a[i])) sum += parseNumber(exp.getParen(0)); i++; } System.out.println("sum = " + sum); 13. } 14 トークン最小一致トークン数を20トークンに定めると．．． Now I explain why we want to detect such gapped clones. At first, almost of clones created by reusing the code with copying and pasting may become gapped clones because usually some modifications are needed to adapt the copied code to the pasting target portion. However, CCFinder can detect non-gapped clones. So, like in these two methods that are partly different to each other, a gapped clone is separately detected as several short non-gapped clones like these three. In that case, If each matched portion is too short, CCFinder does not identify it as a clone because the minimum length of clone to be detected must be set in CCFinder beforehand. If we set the minimum length to 20 tokens, these two clones are not identified because they are shorter than 20 tokens. On the other hand, if the minimum length is set to short one, too many clones would be detected because of the existence of coincidental simple statements such as substitution. These scatter plots are the result of comparison within a certain program. In the left, we plotted clones longer than 30 tokens. In the right, clones longer than 10 tokens. As you can see, too many clones are detected in the right. It is difficult to find gapped clones from such many clones. 27 トークン 13 トークン

開発保守支援を目指したコードクローンの抽出手法 - 支援環境Geminiへの実装 -

Similar presentations

Presentation on theme: "開発保守支援を目指したコードクローンの抽出手法 - 支援環境Geminiへの実装 -"— Presentation transcript:

Similar presentations

About project

フィードバック

ログインする

Auth with social network:

開発保守支援を目指した コードクローンの抽出手法 - 支援環境Geminiへの実装 -

Similar presentations

Presentation on theme: "開発保守支援を目指した コードクローンの抽出手法 - 支援環境Geminiへの実装 -"— Presentation transcript:

Similar presentations

About project

フィードバック

開発保守支援を目指したコードクローンの抽出手法 - 支援環境Geminiへの実装 -

Presentation on theme: "開発保守支援を目指したコードクローンの抽出手法 - 支援環境Geminiへの実装 -"— Presentation transcript: