Koji Inoue Kyushu University Security Aware System Design ~Architectural Support for Secure Computing~ Koji Inoue Kyushu University
IoT Processors MUST be Secure! Architectural Support for Security ? Pipelining SuperScalar Branch Prediction Value Prediction On-chip Cache OOO Exe. ILP TLP MLP Clock Gating Signal Gating DVS Resizing Drowsy Operation Selective Activation High Performance Low Power/Energy
Our Approach (#1) #1 Secure Cache Security High Low Performance Architectural Support for Security #1 Secure Cache Pipelining SuperScalar Branch Prediction Value Prediction On-chip Cache OOO Exe. ILP TLP MLP Clock Gating Signal Gating DVS Resizing Drowsy Operation Selective Activation High Performance Low Power/Energy
Buffer-Overflow Attack R.B.Lee, D.K.Karig, J.P.McGregor, and Z.Shi, “Enlisting Hardware Architecture to Thwart Malicious Code Injection,” Proc. of the Int. Conf. on Security in Pervasive Computing, Mar. 2003. Well-Known vulnerability Exploited by Blaster@2003 Caused by unexpected operations writing an inordinately large amount of data into a buffer This vulnerability exists in the C standard library (e.g. strcpy) Leads to a stack smashing An attack code is inserted The return address is corrupted Highjacks the program execution control
Function Call/Return Start f( ) Call g( ) Execute strcpy( ) int f ( ) { … g (s1); } int g ( char *s1) { char buf [10]; strcpy(buf, s1); Program code Start f( ) Call g( ) Execute strcpy( ) Return to f( )
Function Call/Return Return Address Start f( ) Call g( ) Stack Growth s1 Return Address Saved FP Local Variable buf FP SP Higher Addr. Lower Addr. int f ( ) { … g (s1); } int g ( char *s1) { char buf [10]; strcpy(buf, s1); The Next PC of Call g( ) Program code Program code Start f( ) Call g( ) Execute strcpy( ) Return to f( )
Function Call/Return Return Address String Start f( ) Call g( ) Higher Addr. int f ( ) { … g (s1); } int g ( char *s1) { char buf [10]; strcpy(buf, s1); FP s1 The Next PC of Call g( ) Return Address Program code Stack Growth Saved FP String Local Variable buf SP Lower Addr. Start f( ) Call g( ) Execute strcpy( ) Return to f( )
Function Call/Return Return Address String Start f( ) Call g( ) Stack Growth s1 Return Address Saved FP Local Variable buf FP SP Higher Addr. Lower Addr. The Next PC of Call g( ) String int f ( ) { … g (s1); } int g ( char *s1) { char buf [10]; strcpy(buf, s1); Program code Start f( ) Call g( ) Execute strcpy( ) Return to f( )
Function Call/Return Return Address String Start f( ) Call g( ) Stack Growth s1 Return Address Saved FP Local Variable buf FP SP Higher Addr. Lower Addr. The Next PC of Call g( ) String int f ( ) { … g (s1); } int g ( char *s1) { char buf [10]; strcpy(buf, s1); Program code Start f( ) Call g( ) Execute strcpy( ) Return to f( )
Stack Smashing Return Address Return Address String Start f( ) Growth s1 Return Address Saved FP Local Variable buf FP SP Higher Addr. Lower Addr. Stack Growth s1 Return Address Saved FP Local Variable buf FP SP Higher Addr. Lower Addr. The Next PC of Call g( ) String int f ( ) { … g (s1); } int g ( char *s1) { char buf [10]; strcpy(buf, s1); The Next PC of Call g( ) Program code Start f( ) Call g( ) Execute strcpy( ) Return to f( )
Stack Smashing Insert the attack code! Corrupt the return address! Higher Addr. Stack Growth s1 Return Address Saved FP Local Variable buf FP SP Higher Addr. Lower Addr. The Next PC of Call g( ) String int f ( ) { … g (s1); } int g ( char *s1) { char buf [10]; strcpy(buf, s1); FP s1 Attack Code To the Attack Code Return Address The Next PC of Call g( ) Program code Stack Growth Saved FP Local Variable buf SP Lower Addr. Start f( ) Call g( ) Execute strcpy( ) Return to f( ) Insert the attack code! Corrupt the return address!
Stack Smashing Insert the attack code! Corrupt the return address! Higher Addr. Stack Growth s1 Return Address Saved FP Local Variable buf FP SP Higher Addr. Lower Addr. The Next PC of Call g( ) String int f ( ) { … g (s1); } int g ( char *s1) { char buf [10]; strcpy(buf, s1); FP s1 Attack Code To the Attack Code Return Address The Next PC of Call g( ) Program code Stack Growth Saved FP Local Variable buf SP Lower Addr. Start f( ) Call g( ) Execute strcpy( ) Return to f( ) Insert the attack code! Corrupt the return address! Hijack the program execution!
Secure Cache Architecture Protect return-address (RA) values in the cache!
Secure Cache Architecture Protect return-address (RA) values in the cache! Generate one or more “Replicas” on each RA store
Secure Cache Architecture Protect return-address (RA) values in the cache! Generate one or more “Replicas” on each RA store Compare the original RA with a replica on the RA load
Secure Cache Architecture Protect return-address (RA) values in the cache! Generate one or more “Replicas” on each RA store Compare the original RA with a replica on the RA load If they are not the same, we know that the popped RA has been corrupted!
Security Issue The replica lines are also evicted from the cache Miss the opportunity to check the RA value if no replica lines reside in the cache So… Good for many applications (w/ high cache-hit rates) Bad for memory intensive applications (w/ high cache-miss rates)
Lock & Unlock Approach Prohibit the eviction of replica lines until they are loaded later! (Lock) Keep the replicas in the cache until they are evicted due to the replacement policy! (Unlock) Pros. Avoid premature replica-line invalidation Cons. Waste cache resource
Experimental Set-Up Processor Simulator SCache Model SimpleScalar3.0 4-way OoO Superscalar 4-way 16KB L1 D-Cache SCache Model LRU1-L&U: Lock & Unlock (w/ LRU replica placement) Benchmark Programs SPEC2000 7 integer programs, 4 fp programs Small input
Vulnerability LRU1-L&U Vulnerability Benchmark Programs (Nv-rald / Nrald) * 100 Total #of issued RA load Insecure issued Benchmark Programs
Performance Overhead LRU1-L&U Performance Overhead Benchmark Programs
#2 Dynamic Program Authentication Our Approach (#2) Architectural Support for Security #2 Dynamic Program Authentication Pipelining SuperScalar Branch Prediction Value Prediction On-chip Cache OOO Exe. ILP TLP MLP Clock Gating Signal Gating DVS Resizing Drowsy Operation Selective Activation High Performance Low Power/Energy
Prohibit Malicious Codes from Executing Search Malicious Codes Allow to Execute Only Authenticated Codes Virus Scan (Pattern Matching) (Rule Matching) Static Dynamic Certificate (w/ Key) Static Dynamic ? White List Black List
Concept of Dynamic Program Signature Determine an execution behavior for the program authentication from a secret key E.g., Memory-Access Pattern Control the execution behavior at compile time Monitor the execution behavior at run time E.g., Hardware Profiler Program object code compile Secret Key Behavior Model Behavior Model ? CPU
Dynamic Execution Behavior for Program Authentication Address X is Accessed in Every N Instructions! Program HW Profiler End of Execution Attack Code Hijacking of Execution Control N Instructions HW Profiler
How Can We Control The Execution Behavior At Compile Time? Basic Block A special instruction to generate a dynamic signature Unified original Branches Unify the size of basic blocks Speculative/OOO Execution Monitor the sequence of committed instructions
Security Strength We can NOT detect malicious attacks if The attacker knows the secret key information and algorithm how to determine the pre-defined behavior The attack codes are inserted at the compile time The attacker can predict the pre-defined behavior Provability: 1/(2ADDRbit-width+2N) where N is the candidate of the unified basic-block size The size of malicious code is smaller than that of unified basic-block E.g., less than 5 or 30 instructions
Performance/Cost Overhead (1/2) Evaluation Code size (Cost) Execution time (Performance) Experimental Environment ARMSimplescalar StrongARM model Support the dynamic signature Benchmark programs SPECint95(129.compress) Dynamic Signature Unified basic-block size:from 5 to 25 instructions #of key-load instructions in each basic block: 1
Performance/Cost Overhead (2/2) Load for Key Nop Original Code size x5 in maximum, and x1.7 in minimum Execution time overhead 50% in max. and 10% in min.
Apply Machine Learning (AI) Tech. Training Program HW Profiler L1D/I misses Instruction Counts #Loads/Stores Etc. Trusted Execution Behavior Model Input-A Input-A Input-A Execution Program HW Profiler Time Window (e.g., = 0.1 sec)
Preliminary Experiments Training lbm SVM (Support Vector Machine) Trusted Execution Behavior Model (for lbm) bzip2 mcf gcc wrf from SPEC2006 Execution Program Accuracy[%] Pos. Example lbm 100.00 Neg. Example bzip2 gcc mcf wrf Program Accuracy[%] Un-Known hmmer 100.00 libquantum perlbench sphinx3 bwaves 95.09
What’s Next? ~Energy-Security Tradeoff~ Don't you have such an experience? P You locked me! Go to your office! Let’s Work! Did I Lock the Door? ? Very Tired Did I Lock The Door? ? I Don’t Care!
Example of Energy-Security Tradeoff ~Case for White-List Execution~ Low energy, but insecure Program HW Profiler Time Window (e.g., = 1.0 sec) Execution High energy, but secure Program HW Profiler Time Window (e.g., = 0.1 sec)
Conclusions IoT processors MUST be… Low cost, low power/energy, high performance, and SECURE! Architectural support is a key technology Adaptive tradeoff optimization Our collaboration w/ IITD focuses on… Energy-security tradeoff in architecture perspective (including CPU/Memory/VMs)
Backup Slides
Example of Energy-Security Tradeoff ~Case for Secure-Cache~ 6.0% 5.4% LRU1L LRU1 LRU2 MRU1 MRU2 ALL MRU1: more than 88.5%of RA load ALL: more than 99.3% of RA load
Example of Energy-Security Tradeoff ~Case for Secure-Cache~ LRU1L LRU1 LRU2 MRU1 MRU2 ALL ALL: 23% of energy overhead MRU1: 9.9% of energy overhead
Example of Energy-Security Tradeoff ~Case for Secure-Cache~ 2.4% 2.3% E2VP 164.gzip 176.gcc 188.ammp 175.vpr 197.parser EV2P 164.gzip 176.gcc 188.ammp 175.vpr 197.parser EVP 164.gzip 176.gcc 188.ammp 175.vpr 197.parser Normalized to LRU1 MRU1: good for Energy-Oriented Applications MRU2: good for Security-Oriented Applications LRU2 MRU1 MRU2 ALL
Example of Energy-Security Tradeoff Security/Energy/Performance Energy SimpleScalar3.0 16KB 4-way D-cache Line size:32B OOO execution SPEC2000 7 integer programs 4 fp programs Small input 4KB SRAM design 0.18μm CMOS technology One way of the 16KB cache Hspice simulation w/ extracted load capacitances Measure the energy consumed for 1-bit accesses
AI技術を活用する! 特徴量の可視化とその評価環境 SPEC2006から10種 INT:bzip2, gcc, perlbench, hmmer, libquantum FP :sphinx3, bwaves, lbm, wrf, mcf パフォーマンスカウンタを特徴量として使用 プログラムの実行全体を0.1秒ごとに取得 以下の8つのイベントを取得 命令(L1)/データキャッシュミス回数、実行命令数、load/store命令の実行回数、分岐命令の実行回数、 浮動小数点除算器使用サイクル数、分岐命令予測ミス数 主成分分析(PCA)による特徴量の可視化 識別のためには特徴量の分布が重要 8次元の特徴量を2次元に削減 実験のために用いるプログラムは、整数演算を用いるSPEC2006 INTから5種、浮動小数点演算を用いるSPEC2006 FPから5種、合計10種選びました。 特徴量として、先行研究で用いられたパフォーマンスカウンタを使用しました。 本実験で用いたCPUはパフォーマンスカウンタ用のレジスタが8つあるので、同時に8つのイベントを取得しました。 イベントは数百種類の中から、先行研究で用いられたものなどを参考に、スライドに載っている8つのイベントを取得しました。 プログラムの実行開始1秒後から終了まで、0.1秒ごとに8つのイベントの回数を取得し、プログラムの特徴量として用いました。 ここで、ホワイトリスト方式によるプログラム識別のためには、特徴量の分布が重要となります。 そのため、まずは各プログラムの特徴量がどのように広がっているかを主成分分析により可視化します。 0.1秒ごとに取得した8次元のパフォーマンスカウンタの値を、主成分分析により2次元に次元削減することでマッピングを行いました。
Visualization of Feature Vector 可視化を行った結果、このようになりました。 各プログラムの特徴量に、凡例のように色をつけてマッピングしています。 横軸は主成分分析後の第1主成分、縦軸は第2主成分を表しています。 なお、それぞれの主成分は0から1の間で正規化しています。
Visualization of Feature Vector lbm 多くのプログラムで特徴量が重なっていますが、青い丸で囲ったプログラムlbmの特徴量は独立です。 従って、lbmを識別するような識別器は実現可能であると考えられます。
機械学習による識別とその評価 特徴量 機械学習アルゴリズム lbmの識別実験 特徴量可視化の実験で用いたパフォーマンスカウンタ使用 サポートベクトルマシン lbmの識別実験 正例:lbm 負例:bzip2, gcc, mcf, wrf プログラム 正解率[%] 正例 lbm 100.00 負例 bzip2 gcc mcf wrf プログラム 正解率[%] 未知 hmmer 100.00 libquantum perlbench sphinx3 bwaves 95.09 特徴量の可視化を行ったことで、lbmを識別する識別器は実現可能であり、bzip2を識別する識別器は実現が難しいと考察しました。 そこで、可視化に用いた特徴量と同じパフォーマンスカウンタを特徴量として採用し、サポートベクトルマシンを用いた機械学習により識別器を作成して評価を行いました。 lbmとbzip2それぞれの識別器の学習に用いた、正例と負例のプログラムはスライドに載せているものを採用し、残りは評価に用いました。 各プログラムの実行全体における、0.1秒ごとの8次元のパフォーマンスカウンタの値に、正例もしくは負例のラベルをつけて学習を行いました。 ーーーーー Q. 負例はどのような基準で選んだの? lbmの負例は、lbmの特徴量から離れているプログラムを選びました。 また、bzip2の場合はbzip2と特徴量が似ているプログラムを選びました。 どのように負例を選ぶべきかは今後考えるべきことかと思います。 今後の参考にさせていただきます。