Parallel Programming in MPI part 2 January 18, 2011 1
Sample answer of the last week's report. (1) #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include "mpi.h" int main(int argc, char *argv[]) { int i,r, myid, procs,*result; struct timeval tv; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); if(myid==0){ result = (int *)malloc(sizeof(int)*procs); if( result == NULL){ printf("Not enough memory\n"); exit(EXIT_FAILURE); } gettimeofday(&tv, NULL); srand(tv.tv_usec); r = rand(); MPI_Gather(&r,1,MPI_INT,result,1,MPI_INT,0,MPI_COMM_WORLD); if(myid == 0){ for(i=0;i < procs ;i++){ printf("%d: %d\n",i, result[i]); free(result);
Sample answer of the last week's report. (2) #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include "mpi.h" int main(int argc, char *argv[]) { MPI_Status status; int val=0; int r,myid,procs; struct timeval tv; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); if (myid != 0) MPI_Recv(&val,1,MPI_INT,myid-1,0,MPI_COMM_WORLD,&status); gettimeofday(&tv,NULL); srand(tv.tv_usec); r = rand(); printf("%d: %d\n",myid,r); if (myid !=procs-1) MPI_Send(&val,1,MPI_INT,myid+1,0,MPI_COMM_WORLD); MPI_Finalize(); }
Today's Topic 並列処理 Parallel Computing: 複数のCPUコアに仕事を分担 Distribute the work into processes Non Parallel Computing Parallel Computing task1 task1 task2 task3 task2 task3
プログラム例 (並列化前) Sample Program (Before Parallelization) 2個のベクトルの和 Add two vectors (C = A + B) 一つずつ順に処理 Sequential computation one after another ... 99 A = = = = = = = = = = = = = = = = = = = = B + + + + + + + + + + + + + + + + + + + + C Sequential program double A[100], B[100], C[100]; ... for (i = 0; i < 100; i++) A[i] = B[i] + C[i];
プログラム例 (並列化後) Sample (After Parallelization) 処理とデータを各プロセスに分割 Distribute the computation and the data into processes Rank 0 ... Rank 1 ... Rank 2 ... Rank 3 ... 24 24 24 24 A A A A = = = = = = = = = = = = = = = = = = = = B B B B + + + + + + + + + + + + + + + + + + + + C C C C Parallel program double A[25], B[25], C[25]; ... MPI_Init(...); for (i = 0; i < 25; i++) A[i] = B[i] + C[i];
並列化の効果 Effect of Parallelization 速度向上率= (1プロセスによる実行時間) / (複数プロセスによる実行時間) Speed Up Ratio = (Execution time of sequential program) / (Execution time of parallel program) 並列化によって何倍速くなったか How many times the program got faster by the parallelization
並列処理に対する期待と現実 Hope and Reality of Parallelization プログラマ Programmer: 「CPUを 4台使うんだから,並列化で 4倍速くなって欲しい」 "I want the program to get 4 times faster by parallelization, when I use 4 CPUs" 計算機製作者 Maker of Computers: 「CPU 4台で 3倍くらい速くなれば十分だろう」 "3 times faster should be enough with 4 CPUs" Why? アムダールの法則 Amdahl's Law 負荷のバランス Load Balance 通信のコスト Time for Communication
アムダールの法則 Amdarl's Law 並列化にあてはめて考えると Apply the law on Parallelization: 並列化による性能向上率の理論的な限界 Limit of the Speed-Up Ratio by Parallelization =1/((1-P)+P/N) P: プログラム中の並列化対象部分が全処理時間に占める割合 Ratio of the execution time of the part that will be parallelized in the total execution time N: プロセス数 the number of processes Example) N=4 で 3.5倍以上高速化する ためには 95%以上の部分を並列化 To achieve 3.5 times faster on N=4, more than 95% part of the execution time must be parallelized. Limit of Speed-Up Ratio (N=4) 9 9
負荷のバランス Load Balance 並列プログラムの処理時間は 「最も遅いプロセスの処理時間」 である The execution time of a parallel program is "the execution time of the slowest process". Execution time of this program Execution time of this program Rank 0 Rank 0 Rank 1 Rank 1 Rank 2 Rank 2 Rank 3 Rank 3
通信時間 Time for communication 並列化前は不要だった時間 = 並列化によるオーバーヘッド The time that is not necessary before parallelization = The overhead caused by parallelization Rank 0 Rank 1 Rank 2 Rank 3
By the way 並列化だけが高速化の手段ではない Parallelization is not the only way for speeding up. For example... Add optimization options at compiling programs. Modify programs to fit the sizes of "cache memory systems". 並列化は簡単な作業ではない Parallelization is not a simple task. Careful analysis about the cost and the effect of parallelization is important.
では,プログラムを並列化してみましょう OK. Let's Parallelize the Program. プログラムのどこを並列化するか? On which part of the program to parallelize? そこは並列化できるか? Can it be parallelized? どうやって並列化するか? How to parallelize it?
どこを並列化するか? On which part to parallelize? アムダールの法則を考慮すると, 「可能であれば、最も時間のかかっている場所を並列化した方がよい」 Considering Amdahl's Law, "Choose the part that takes the longest computing time, if possible." Usually, try 最大の多重ループ "Try the largest loop nest, first." さらに、その最外ループ"Try the outermost loop of the nest, first." 最大の多重ループが、最も所要時間が長い可能性が高い The largest loop nest tends to consume the longest computing time. 内側のループを並列化すると,並列化のオーバーヘッドが大きくなる可能性が高い Parallelizing the inner loop tends to cause larger overhead for parallelization.
並列化可能か? Can it be parallelized? 並列処理 = 複数の処理を同時進行 Parallel computing = Proceed multiple computation at a same time. 仕事の順番が不確定 The order of the computation is not deterministic. 並列化できるプログラム = 実行の順番が変わっても結果が変わらない。 Parallelizable program = Change of the order of computations does not change the results for (i = 0; i < 12; i++) a[i] = b[i] + c[i]; b 1 1 1 1 1 1 1 1 1 1 1 1 + c 1 2 3 4 5 6 7 8 9 10 11 12 a 2 3 4 5 6 7 8 9 10 11 12 13 15
並列化が難しいプログラム Programs that are hard to parallelize 実行の順序によって結果が変わる The change of the order of computations can change the results Example: 前に計算した値を参照して計算 Computation that uses the results of previous computations. for (i = 1; i < 12; i++) a[i] = a[i] + a[i-1]; Rank 0 Rank 1 Rank 2 Rank 3 a 1 2 3 4 5 6 7 8 9 10 11 12 a 1 3 6 ? ? ? ? ? ? ? 16 ? ? 16
ループの並列化 Parallelization of loops どのようにプロセスに割り当てるか? How to distribute loops into processes どのようにプロセスにデータを配置するか? How to allocate data on processes 計算結果のとりまとめ How to achieve the final results of distributed loops
ループの分割 How to distribute loops Block: 連続した繰り返し毎に分割 Distribute by contiguous region of iterations 通常、このBlock分割で十分な高速化が得られる場合が多い。 Usually, this distribution shows sufficient speed-up. 繰り返し数がプロセス数で割り切れない場合があるので注意。 Additional consideration must be paid for the cases in which the number of iterations is not divisible by the number of processes. for (i = (100/procs)*myid; i < (100/procs)*(myid+1); i++) a[i] = work(i); Rank 0: i = 0, 1, 2, ..., 24 Rank 1: i = 25, 26, 27, ..., 49 Rank 2: i = 50, 51, 52, ..., 74 Rank 3: i = 75, 76, 77, ..., 99 procs: Number of processes myid: Rank 18 18 18
ループの分割 (つづき) How to distribute loops (cont.) Cyclic: とびとびに分割 Distribute with some stride Block-Cyclic: 二つの組み合わせ Mixture of Block & Cyclic for (i = myid; i < 100; i += procs) a[i] = work(i); Rank 0: i = 0, 4, 8, 12, ..., 96 Rank 1: i = 1, 5, 9, 13, ..., 97 Rank 2: i = 2, 6, 10, 14, ..., 98 Rank 3: i = 3, 7, 11, 15, ..., 99 Rank 0: i = 0, 1, 2, 3,16,17,18,19, ... Rank 1: i = 4, 5, 6, 7,20,21,22,23, ... Rank 2: i = 8, 9,10,11,24,25,26,27, ... Rank 3: i = 12,13,14,15,28,29,30,31, ... for (i = myid*4; i < 100; i += procs*4) for (ii = i; ii < i+4; i++) a[i] = work(i); 繰り返し毎に計算の量が異なる場合に利用 Use these when the amount of computation differs from one iteration to another. 19 19 19
データの配置 How to allocate data 通常,それぞれのプロセスで使用するデータのみを宣言 Usually, declare the data that is used by a process Good: Efficient usage of memory. Bad: Entire program must be rewritten according to the sizes of arrays. double *a, *b, *c; int local_size; ... MPI_Init(...); local_size = N / procs; a = (double *)malloc( local_size*sizeof(double)); for (i = 0; i < local_size; i++) c[i] = a[i] * b[i]; Rank 0 Rank 1 Rank 2 Rank 3 c = a + b c = a + b c = a + b c = a + b 20
データの配置(つづき) How to allocate data (cont.) 全プロセスで全配列を宣言 Redundant allocation: Good: Less modification for parallelization. Bad: Problem size cannot be increased by increasing processes. Allocate regions that are never used in the execution. double a[N], b[N], c[N]; int local_size ... MPI_Init(...); for (i = N/procs*myid; i < N/procs*(myid+1); i++) c[i] = a[i] * b[i]; Rank 0 Rank 1 Rank 2 Rank 3 c = a + b c = a + b c = a + b c = a + b 小さいデータの配置に利用 Use this allocation for small data. 21
結果のとりまとめ How to achieve the final results(1) Gather distributed data on each process 通常,MPI_Gather を使用 MPI_Gather can be used, generally. ループの分割やデータの配置によって,調整が必要 Need some adjustment according to the loop distribution or the data allocation. Rank 0 Rank 1 Rank 2 Rank 3 c_total c c c c MPI_Gather(c, local_size, MPI_DOUBLE, c_total, local_size,MPI_DOUBLE, 0, MPI_COMM_WORLD); Ex) MPI_Gather(&(c[myid*local_size]), local_size, MPI_DOUBLE, c_total, local_size, MPI_DOUBLE, 0, MPI_COMM_WORLD); 22 22
結果のとりまとめ How to achieve the final results(2) とりまとめ結果を全プロセスにコピー Copy the gathered data to all process Rank 0 Rank 1 Rank 2 Rank 3 c_total c c_total c c_total c c_total c MPI_AllGather(c, local_size, MPI_DOUBLE, c_total, local_size, MPI_DOUBLE, MPI_COMM_WORLD); 23 23
結果のとりまとめ How to achieve the final results(3) 全プロセスの結果の総計 Total calculation of the results among processes Example) Total Sum Possible Calculations: MPI_SUM, MPI_PROD, etc. MPI_Allreduce for copying the result to all process. Rank 0 c_total c Rank 1 c Rank 2 c Rank 3 c SUM MPI_Reduce(c, c_total, N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); 24
MPIプログラムの時間計測 Measure the time of MPI programs MPI_Wtime 現在時間(秒)を実数で返す関数 Returns the current time in seconds. Example) ... double t1, t2; ... t1 = MPI_Wtime(); 処理 t2 = MPI_Wtime(); printf("Elapsed time: %e sec.\n", t2 – t1); Measure time here
並列プログラムにおける時間計測の問題 Problem on measuring time in parallel programs プロセス毎に違う時間を測定: どの時間が本当の所要時間か? Each process measures different time. Which time is the time we want? Rank 0 t1 = MPI_Wtime(); Rank 1 Read Rank 2 t1 = MPI_Wtime(); Measure time here Read t1 = MPI_Wtime(); Receive Send Receive Read t1 = MPI_Wtime(); Send t1 = MPI_Wtime(); t1 = MPI_Wtime(); 26
集団通信 MPI_Barrierを使った解決策 Use MPI_Barrier 時間計測前にMPI_Barrierで同期 Synchronize processes before each measurement For measuring total execution time. Rank 0 MPI_Barrier Rank 1 MPI_Barrier Rank 2 MPI_Barrier t1 = MPI_Wtime(); Receive Receive Read Measure time here Read Send Read MPI_Barrier Send MPI_Barrier MPI_Barrier t1 = MPI_Wtime(); 27
より細かい解析 Detailed analysis Average MPI_Reduce can be used to achieve the average: MAX and MIN Use MPI_Gather to gather all of the results to Rank 0. Let Rank 0 to find MAX and MIN double t1, t2, t, total; t1 = MPI_Wtime(); ... t2 = MPI_Wtime(); t = t2 – t1; MPI_Reduce(&t, &total, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myrank == 0) printf("Ave. elapsed: %e sec.\n", total/procs);
最大(Max)、平均(Ave)、最小(Min)の関係 Relationships among Max, Ave and Min プロセス毎の負荷(仕事量)のばらつき検証に利用 Can be used for checking the load-balance. Max – Ave is large Max – Ave is small Ave – Min is large NG Mostly OK Ave – Min is small OK Time includes Computation Time and Communication Time
通信時間の計測 Measuring time for communications double t1, t2, t3, t4 comm=0; t3 = MPI_Wtime(); for (i = 0; i < N; i++){ computation t1 = MPI_Wtime(); communication t2 = MPI_Wtime(); comm += t2 – t1; computation t1 = MPI_Wtime(); communication } t4 = MPI_Wtime();
Analyze computation time Computation time = Total time - Communication time Or, just measure the computation time 計算時間のばらつき = 負荷の不均衡の度合い Balance of computation time shows balance of the amount of computation 注意: 通信時間には、負荷の不均衡によって生じた待ち時間が含まれるので、単純な評価は難しい Communication time is difficult to analyze since it consists waiting time caused by load-imbalance. ==> Balance computation first.
Today's MPI functions MPI_Allgather MPI_Reduce MPI_Allreduce MPI_Barrier MPI_Wtime
MPI_Allgather Usage: int MPI_Allgather(void *sb, int sc MPI_Datatype st, void *rb, int rc, MPI_Datatype rt, MPI_Comm comm); MPI_Gatherの結果を全プロセスにコピー copy the result of MPI_Gather to all process Parameters: send data: start address, number of elements, data type, receive data: start address, number of elements, data type, (means only on the root rank) communicator Example: MPI_Allgather(a, 3, MPI_DOUBLE, b, 3, MPI_DOUBLE, MPI_COMM_WORLD); Rank 0 Rank 1 Rank 2 Rank 3 a a a a b b b b 33 33 33
MPI_Reduce Usage: int MPI_Reduce(void *sb, void *rb, int c, MPI_Datatype t, MPI_Op op, int root, MPI_Comm comm); 全プロセスからのデータを集めて計算(総和等)をする Gather data from all process and apply a reduction operation Parameters: start address of original data, start address to store the result, number of elements, data type, operation, root rank, communicator Possible operations: MPI_SUM, MPI_PROD, MPI_MAX(only for integer), MPI_MIN(only for integer), etc. Example: MPI_Reduce(a, b, 3, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); Rank 0 Rank 1 Rank 2 Rank 3 a a a a SUM b 34 34 34
MPI_Allreduce Usage: int MPI_Allreduce(void *sb, void *rb, int c, MPI_Datatype t, MPI_Op op, MPI_Comm comm); MPI_Reduceの結果を全プロセスにコピー Copy the result of MPI_Reduce to all process Parameters: start address of the original data, start address to store the result, number of elements, data type, operation, communicator Example: MPI_Allreduce(a, b, 3, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); Rank 0 Rank 1 Rank 2 Rank 3 a a a a SUM b b b b 35 35 35
MPI_Barrier Usage: int MPI_Barrier(MPI_Comm comm); 同期。全プロセスがMPI_Barrierを実行するまで待つ。 Synchronization. Wait here until all process execute MPI_Barrier. Parameter: communicator 36 36
MPI_Wtime Usage: double MPI_Wtime(); 現在の時刻を秒で取得。秒は倍精度実数で返す。 Get current time in seconds.Parameters: Returns double precision floating point number. 37 37
レポート課題: ベクトルの内積の並列化 Report: Parallelization of Dot-Product 以下のプログラムの計算部分を並列化しなさい #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #define N 1000 int main(int argc, char *argv[]) { double *a, *b, c, r; int i; struct timeval tv; a = (double *)malloc(N * sizeof(double)); b = (double *)malloc(N * sizeof(double)); continue to the next page...
gettimeofday(&tv, NULL); srand(tv.tv_usec); for (i = 0; i < N; i++){ r = rand(); a[i]=r/RAND_MAX; } b[i]=r/RAND_MAX; c = 0.0; for (i = 0; i < N; i++) c = c + a[i]*b[i]; printf("Result: %e\n", c); free(a); free(b);
並列化にあたって Parallelization ベクトル a, b の初期値はランク0で生成し, 他のランクへ適切にコピーする Let Rank 0 to create initial values of the elements of a and b. Copy the data to other processes appropriately. 結果はランク0で表示する Let Rank 0 to display the result. 時間と興味があったら,N やプロセス数を変えながら MPI_Wtimeで所要時間や計算部分の時間を計測してみる If you have time and interest, measure the time for entire execution, or the time for computation, with different N and the number of processes.