Binary Basic Block Similarity Metric Method in Cross-Instruction Set Architecture

undefined
 
Similarity Metric Method for Binary Basic
Blocks of Cross-Instruction Set Architecture
 
Xiaochuan Zhang
 
zhangxiaochuan@outlook.com
 
Artificial Intelligence Research Center, National Innovation Institute of Defense Technology, Beijing, China
 
Content
 
Background
 
Methodology & Implementation
 
Experiment & Result
01
02
03
Background
Binary program similarity metric can be used in:
malware
classification
vulnerability
detection
authorship
analysis
 
The similarity between basic blocks is the basis
Background
Two step of basic block similarity metric
 
[0.24, 0.37,…, 0.93]
 
[0.56, 0.74,…, 0.31]
 
Similarity Score
[0, 1]
 
Similarity Calculation
 
Basic Block Embedding
sub
 
sp, sp, #72
ldr
 
r7, [r11, #12]
ldr
 
r8, [r11, #8]
ldr
 
r0, .LCPI0_0
movq
 
%rdx, %r14
movq
 
%rsi, %r15
movq
 
%rdi, %rbx
movabsq
 
$.L0, %rdi
 
Background
basic block
embedding
each dimension corresponds to a manually selected
static feature [1-3]
static word representation based methods [4-7]
INNEREYE-BB, an RNN based method [8]
 
manually
 
automatically
 
[1] Qian Feng, et al. Scalable Graph-based Bug Search for Firmware Images. CCS 2016
[2] Xiaojun Xu,et al. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. CCS 2017
[3] Gang Zhao, Jeff Huang. DeepSim: deep learning code functional similarity. ESEC/SIGSOFT FSE 2018
[4] Yujia Li,et al.Graph Matching Networks for Learning the Similarity of Graph Structured Objects. ICML 2019
[5] Luca Massarelli, et al. SAFE: Self-Attentive Function Embeddings for Binary Similarity. DIMVA 2019
[6] Uri Alon, et al. code2vec: learning distributed representations of code. PACMPL 3(POPL) 2019
[7] Steven H. H. Ding, et al. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. S&P 2019
[8] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019
 
Type of methods
Background
INNEREYE-BB [1]
[1] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019
 
ldr
 
r0
 
.LCPI0_115
 
bl
 
printf
 
scanf
 
memcpy
 
……
 
FUNC
Methodology & Implementation
 
x86
 
BB
 
ARM
 
BB
 
BB embedding
Idealized Solution (based on 
perfect translation
 assumption)
 
Methodology & Implementation
 
Practical Solution
Methodology & Implementation
x86-encoder pre-training
 
data:
  
x86-ARM basic block pairs
NMT model: 
 
Transformer [1], other NMT models also work
Optimization goal:
 
minimize the translation loss
[1] Ashish Vaswani, et al. Attention is All you Need. NIPS 2017
Methodology & Implementation
ARM-encoder training & x86-encoder fine-tuning
 
data:
  
basic block triplets, {anchor, positive, negative}
Optimization goal:
 
minimize the margin-based triplet loss
 
margin
 
anchor
 
positive
 
negative
 
semantically equivalent
basic block pair
Methodology & Implementation
Mixed negative sampling
 
Hard Negatives:
 
Similar but not equivalent to anchor
Methodology & Implementation
Hard negative sampling: if anchor is a x86 basic block
 
anchor(x86)
 
rand_x86_1
 
rand_x86_2
 
……
 
rand_x86_n
 
pretrained x86-encoder
 
……
 
rand_x86_t
 
rand_ARM_t
 
Methodology & Implementation
 
Similarity Metric
 
embedding dimension
 
Euclidean distance
 
Experiment & Result
 
Setup
 
prototype:
 
MIRROR
  
https://github.com/zhangxiaochuan/MIRROR
Dataset: 
 
MISA, 
1,122,171
 semantically equivalent x86-ARM basic block pairs
  
https://drive.google.com/file/d/1krJbsfu6EsLhF86QAUVxVRQjbkfWx7ZF/view
Experiment & Result
Comparison with Baseline
* Higher is better
 
Experiment & Result
 
Evaluation of
 
negative
 
sampling methods
 
* Higher is better
 
Experiment & Result
 
Effectiveness of
 
pre-training
 
The pre-training phase seems redundant?
Experiment & Result
Effectiveness of
 
pre-training
* Higher is better
 
Experiment & Result
 
Visualization
undefined
 
Thanks!
 
zhangxiaochuan@outlook.com
Slide Note
Embed
Share

The similarity metric method for binary basic blocks is crucial in various applications like malware classification, vulnerability detection, and authorship analysis. This method involves two steps: sub-ldr operations and similarity score calculation. Different methods, both manual and automatic, have been proposed for basic block embedding and analysis. The INNEREYE-BB method utilizes neural machine translation for comparing binary code similarities beyond function pairs. The methodology involves encoding, decoding, and neural machine translation for x86 and ARM basic blocks, which are then aggregated to obtain the basic block embeddings.

  • Binary Basic Blocks
  • Similarity Metric
  • Cross-Instruction Set Architecture
  • Malware Classification
  • Neural Machine Translation

Uploaded on Jul 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Similarity Metric Method for Binary Basic Blocks of Cross-Instruction Set Architecture Xiaochuan Zhang zhangxiaochuan@outlook.com Artificial Intelligence Research Center, National Innovation Institute of Defense Technology, Beijing, China

  2. Content Background 01 Methodology & Implementation 02 Experiment & Result 03

  3. Background Binary program similarity metric can be used in: malware classification vulnerability detection authorship analysis The similarity between basic blocks is the basis

  4. Background Two step of basic block similarity metric sub ldr ldr ldr sp, sp, #72 r7, [r11, #12] r8, [r11, #8] r0, .LCPI0_0 [0.24, 0.37, , 0.93] Similarity Score [0, 1] movq movq movq movabsq $.L0, %rdi %rdx, %r14 %rsi, %r15 %rdi, %rbx [0.56, 0.74, , 0.31] Basic Block Embedding Similarity Calculation

  5. Background Type of methods manually each dimension corresponds to a manually selected static feature [1-3] basic block embedding static word representation based methods [4-7] automatically INNEREYE-BB, an RNN based method [8] [1] Qian Feng, et al. Scalable Graph-based Bug Search for Firmware Images. CCS 2016 [2] Xiaojun Xu,et al. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. CCS 2017 [3] Gang Zhao, Jeff Huang. DeepSim: deep learning code functional similarity. ESEC/SIGSOFT FSE 2018 [4] Yujia Li,et al.Graph Matching Networks for Learning the Similarity of Graph Structured Objects. ICML 2019 [5] Luca Massarelli, et al. SAFE: Self-Attentive Function Embeddings for Binary Similarity. DIMVA 2019 [6] Uri Alon, et al. code2vec: learning distributed representations of code. PACMPL 3(POPL) 2019 [7] Steven H. H. Ding, et al. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. S&P 2019 [8] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019

  6. Background INNEREYE-BB [1] ?= ?(??, ? 1) 1 2 3 4 5 ?1 ?2 ?3 ?4 ?5 ldr r0 .LCPI0_115 bl printf scanf memcpy FUNC [1] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019

  7. Methodology & Implementation Idealized Solution (based on PERFECT TRANSLATION assumption) Encoding Decoding Neural Machine Translation x86 BB ARM BB ? ? matrix Aggregation Encoding Aggregation BB embedding

  8. Methodology & Implementation Practical Solution

  9. Methodology & Implementation x86-encoder pre-training data: NMT model: Optimization goal: minimize the translation loss x86-ARM basic block pairs Transformer [1], other NMT models also work [1] Ashish Vaswani, et al. Attention is All you Need. NIPS 2017

  10. Methodology & Implementation ARM-encoder training & x86-encoder fine-tuning data: Optimization goal: minimize the margin-based triplet loss basic block triplets, {anchor, positive, negative} semantically equivalent basic block pair margin positive negative anchor

  11. Methodology & Implementation Mixed negative sampling Hard Negatives: Similar but not equivalent to anchor 33% 67% Random Negatives Hard Negatives

  12. Methodology & Implementation Hard negative sampling: if anchor is a x86 basic block ??????? anchor(x86) ?? ?? rand_x86_1 ?? pretrained x86-encoder ?? rand_x86_2 rand_x86_t rand_ARM_t ?? ?? rand_x86_n

  13. Methodology & Implementation Similarity Metric Euclidean distance embedding dimension

  14. Experiment & Result Setup prototype: MIRROR https://github.com/zhangxiaochuan/MIRROR MISA, 1,122,171 semantically equivalent x86-ARM basic block pairs https://drive.google.com/file/d/1krJbsfu6EsLhF86QAUVxVRQjbkfWx7ZF/view Dataset:

  15. Experiment & Result Comparison with Baseline * Higher is better

  16. Experiment & Result Evaluation of negative sampling methods * Higher is better

  17. Experiment & Result Effectiveness of pre-training The pre-training phase seems redundant?

  18. Experiment & Result Effectiveness of pre-training * Higher is better

  19. Experiment & Result Visualization

  20. Thanks! zhangxiaochuan@outlook.com

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#