MONARCプロジェクト LHCにおける国際分散型 地域データ解析センター (I) Y.Morita, H. Sato I.Legrand KEK Computing Research Center I.Legrand CERN IT-division for the MONARC Collaboration
Detector for LHCb experiment Detector for ALICE experiment Detector for LHCb experiment 2000/3/30
2000/3/30
Models Of Networked Analysis MONARC Models Of Networked Analysis at Regional Centres http://www.cern.ch/MONARC/ LHCの大規模な実験データの解析のための 計算機・ネットワークシステムのモデリングと性能評価 大規模なデータ量: 1 PByte / year / experiment (raw data) 大量の計算機資源: 109 event/year 250 SPECint95/event (ATLAS CTP) 世界規模で分散した研究者社会 (ATLAS 33ヶ国, 1800人) LCB (LHC Computing Board)のR&Dプロジェクトとして 98年9月より開始. 2000/3/30
MONARC Collaboration 30機関 75名 4つのWorking Group LHC4実験のソフトウェア・コンピューティングの代表者と関係者 CERN IT-division 欧・米・アジアの計算センターの関係者など 4つのWorking Group Architecture WG Analysis Model WG Simulation WG Testbed WG M. Aderholz (MPI), K. Amako (KEK), E. Auge (L.A.L/Orsay), G. Bagliesi (Pisa/INFN), L. Barone (Roma1/INFN), G. Battistoni (Milano/INFN), M. Bernardi (CINECA), M. Boschini (CILEA), A. Brunengo (Genova/INFN) J.J. Bunn (Caltech/CERN), J. Butler (FNAL), M. Campanella (Milano/INFN), P. Capiluppi (Bologna/INFN), F, Carminati (CERN), M. D'Amato (Bari/INFN), M. Dameri (Genova/INFN), A. di Mattia (Roma1/INFN), A. Dorokhov (CERN), G. Erbacci (CINECA), U. Gasparini (Padova/INFN), F. Gagliardi (CERN), I. Gaines (FNAL), P. Galvez (Caltech), A. Ghiselli (CNAF/INFN), J. Gordon (RAL), C. Grandi (Bologna/INFN), F. Harris (Oxford), K. Holtman (CERN), V. Karimäki (Helsinki), Y. Karita (KEK), J. Klem (Helsinki), I. Legrand (Caltech/CERN), M. Leltchouk (Columbia), D. Linglin (IN2P3/Lyon Computing Centre), P. Lubrano (Perugia/INFN), L. Luminari (Roma1/INFN), A. Maslennicov (CASPUR), A. Mattasoglio (CILEA), M. Michelotto (Padova/INFN), I. McArthur (Oxford), Y. Morita (KEK), A. Nazarenko (Tufts), H. Newman (Caltech), V. O'Dell (FNAL), S.W. O'Neale (Birmingham/CERN), B. Osculati (Genova/INFN), M. Pepe (Perugia/INFN), L. Perini (Milano/INFN), J. Pinfold (Alberta), R. Pordes (FNAL), F. Prelz (Milano/INFN), A. Putzer (Heidelberg), S. Resconi (Milano/INFN and CILEA), L. Robertson (CERN), S. Rolli (Tufts), T. Sasaki (KEK), H. Sato (KEK), L. Servoli (Perugia/INFN), R.D. Schaffer (Orsay), T. Schalk (BaBar), M. Sgaravatto (Padova/INFN), J. Shiers (CERN), L. Silvestris (Bari/INFN), G.P. Siroli (Bologna/INFN), K. Sliwa (Tufts), T. Smith (CERN), R. Somigliana (Tufts), C. Stanescu (Roma3), H. Stockinger (CERN), D. Ugolotti (Bologna/INFN), E. Valente (INFN), C. Vistoli (CNAF/INFN), I. Willers (CERN), R. Wilkinson (Caltech), D.O. Williams (CERN). 2000/3/30
Old Monarc values Analysis Model Schema Hierarchy of Datasets: RAW, ESD, AOD, TAG (real and simulated) RAW ESD AOD Recorded by DAQ Triggered events Detectors digitisation Pseudo-physical information: Clusters, track candidates (electrons, muons), jets, etc. Physical information: Transverse momentum, Association of particles, (best) id of particles, Physical info for relevant “objects” TAG Reconstructed information Selected Collection of only relevant info (for the analysis) ~1 MB/event ~100 kB/event ~10 kB/event ~1 kB/event Relevant information for fast event selection 24 March 2000, WW A/C Panel, P. Capiluppi 2000/3/30
HPSS Analysis Model Schema Hierarchy of Resources (Tier0, Tier1, Tier2, ...) Tier2 Center ~1 TIPS Online System Offline Farm ~20 TIPS CERN Computer Center >20 TIPS Fermilab ~4 TIPS France Regional Center Italy Regional Center Germany Regional Center Institute Institute ~0.25TIPS Workstations ~100 MBytes/sec ~2.4 Gbits/sec 100 - 1000 Mbits/sec HPSS Bunch crossing per 25 nsecs. 100 triggers per second Event is ~1 MByte in size Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~622 Mbits/sec or Air Freight ~622 Mbits/sec Tier 0 Tier 1 Tier 3 Tier 4 1 TIPS = 25,000 SpecInt95 PC (1999) = ~15 SpecInt95 Tier 2 24 March 2000, WW A/C Panel, P. Capiluppi 2000/3/30
WORKFLOW Mass Storage & Disk Servers Database Servers Data Export Import Export Mass Storage & Disk Servers Database Servers Tapes Network from CERN from Tier 2 and simulation centers Physics Software Development R&D Systems and Testbeds Info servers Code servers Web Servers Telepresence Training Consulting Help Desk Production Reconstruction Raw/Sim-->ESD Scheduled, predictable experiment/ physics groups Analysis ESD-->AOD AOD-->DPD Scheduled Physics groups Individual and plots Chaotic Physicists Desktops Tier 2 Local institutes CERN WORKFLOW Experiment Groups 2000/3/30
システム性能評価の手法 システムのパフォーマンス評価はシステムの動作実証と性能測定、挙動評価のモデリングとツールの利用、得られた挙動の評価手法の確立、のサイクルを繰り返すことによって精度が向上する システム性能向上のポリシーを決定する (待ち時間短縮、利用率、など) 求められる性能評価の精度に応じて実証試験の測定精度やモデリングの精度を向上させる 2000/3/30
MONARC Simulation Process Oriented Discrete Event Simulation 評価するシステム内のジョブをJavaプログラムのマルチスレッド・タスクとして生成し、CPU、Disk I/O、Network転送などの要素オペレーション(イベント)の所要時間を計算する (開発者 I. Legrand) T1 T2 TF1 TF2 I1 I2 TASK1 TASK2 TASK MEMORY CPU ... I/O Link LAN 2000/3/30
Data Server Model 実際のシステムを幾つかの要素にわけて、その挙動を抽象化 データベースの基本構造としてObjectivity/DBの実装をモデル化(Federation, Database, Containerなど) 相互に関係しあう各要素の所要時間を関数化 2000/3/30
Testbed DB Application Object Model in Atlas Simulated Raw Events (Atlas 0.0.24 in 1TB milestone) System DB Raw Data DB1 Raw Data DB2 ... Event Container Raw Data Container PEvent #1 PEventObjeVector PEventObjVector : PEvent #2 PSiDetector PSiDigit ... PTRT_Detector PTRTDigit ... PMDT_Detector PMDT_Digit ... PCaloRigion PCaloDigit ... PTruthVertex PTruthTrack ... b PEvent PEventObjVector PEventObj PSiDetector PSiDigit PMDT_Detector PMDT_Digit PCaloRegion PCaloDigit PTruthVertex PTruthTrack 2000/3/30
LAN measurements Machine A: Sun Enterprise450 (400MHz 4x CPU) (1) Machine A: Sun Enterprise450 (400MHz 4x CPU) Machine B: Sun Ultra5 (270MHz) -- Lock Server Machine C: Sun Enterprise450 (300MHz 2x CPU) Tests: (1) Machine A local (2 CPU) (2) Machine C local (4 CPU) (3) Machine A (client) and Machine C (server) number of client processes: 1, 2, 4, ..., 32 Raw Data jobs (2) (3) 2000/3/30
Machine A as client, Machine C as server Measurements Machine C local Machine A local Machine A as client, Machine C as server job execution time Aggregated CPU% 2000/3/30
Modeling of the Client Job Event Time CPU I/O ... Job on Machine A CPU 17.4 SI95 I/O 207MB/s @ 54MB file CPU I/O ... Job on Machine C CPU 14.0 SI95 I/O 31MB/s @ 54MB file TI/O and TCPU can be extracted from single client test on two different machines with SPECint95 ratings and the disk I/O speed measurements Same job parameters are put into the simulation of multiple clients 2000/3/30
Disk I/O Measured with write() & read() system calls with various block sizes and file sizes File size [MB] 54 864 Block size [MB] 54 1024 54 1024 Machine A write [MB/s] 14.8 16.2 13.5-9.2 15.9 " read [MB/s] 207 207 18.5 22 Machine C write [MB/s] 18.3 25.9 20.7-8.0 27 " read [MB/s] 31 44 22.8 28.3 Reading 54MB file fits on the disk cache on Machine A 2000/3/30
AMSプロトコルの挙動 AMS page transfer latency is modeled into the simulation Physical Bandwidth: B Effective Bandwidth: Beff T = t(transfer) + t(handshake) = unit_size / B + RTT Beff unit_size ------ = ----------------------------- B unit_size + B * RTT 1 2 Time since #’4M is sent (sec) 4000 4100 4200 x 103 Seq. Number Write Read Ref. H. Sato (30aSG10) AMS Packet Sequence 2000/3/30
Other Modeling Details Client SMP architecture is modeled with the same number of CPU farm with high speed network connection DB Server CPU Node 1 Node 2 Node 3 Node 4 Node_link_speed DB_link_speed DB_read_speed AMS Server AMS Client (4cpu) Network 2000/3/30
Simulated Results (Machine A) (Machine C) Y.Morita et al, MONARC-99/7 2000/3/30
Job execution time Job execution time competing for the same resource Time (sec) jobs Measurement 106 107 108 109 110 111 30 25 20 15 10 5 Time (sec) jobs Simulation 2000/3/30
RC Simulation Example 解析の各段階のジョブ実行数と実行パターンを静的に仮定 必要資源量の評価: 1日のジョブは翌日に持ち越さない Tier-0, Tier-1に必要なCPU、Networkなどの資源量見積り 2000/3/30
今後の動向 MONARCは Phase2 Report を今年3月に公開http://monarc.web.cern.ch/MONARC/docs/phase2report/Phase2Report.doc さらに1年間、Phase3としてモデルの最適化などを行う LHC Computing MoU締結に向けて Computing Review CommitteeがLHC実験の計算機・ネットワークモデルのrecommendationを6月に答申の予定 http://lhc-computing-review-public.web.cern.ch/lhc-computing-review-public/Public/ CERNは他の機関と共同で地域解析センター実現のための要素技術R&Dプロポーザル DataGrid をEUに提出予定(2001年から3年間程度) http://www.cern.ch/grid/ 2000/3/30
まとめ MONARCは、LHCの各実験に属する研究者が自国の研究機関で実験データ解析を行うための多階層地域解析センターモデルを提唱 Process Oriented Discrete Event Simulationによるシステム性能評価の手法が確立し、Objectivity/DB性能評価のテストベッドの測定結果を再現した(詳細は II参照) LHC各実験の解析の各段階でのジョブ実行パターンを仮定することで、必要なCPU, Network資源の見積りが進行中 階層型マスストレージシステムなどの資源評価はPhase3で 米国を中心とした「GRID」プロジェクトが地域解析センター構築のための基盤技術として有望視されつつある 2000/3/30