Binary Basic Block Similarity Metric Method in Cross-Instruction Set Architecture

undefined

Similarity Metric Method for Binary Basic

Blocks of Cross-Instruction Set Architecture

Xiaochuan Zhang

zhangxiaochuan@outlook.com

Artificial Intelligence Research Center, National Innovation Institute of Defense Technology, Beijing, China

•

Content

Background

Methodology & Implementation

Experiment & Result

•

Background

Binary program similarity metric can be used in:

malware

classification

vulnerability

detection

authorship

analysis

The similarity between basic blocks is the basis

•

Background

Two step of basic block similarity metric

[0.24, 0.37,…, 0.93]

[0.56, 0.74,…, 0.31]

Similarity Score

[0, 1]

Similarity Calculation

Basic Block Embedding

sub

sp, sp, #72

ldr

r7, [r11, #12]

ldr

r8, [r11, #8]

ldr

r0, .LCPI0_0

movq

%rdx, %r14

movq

%rsi, %r15

movq

%rdi, %rbx

movabsq

$.L0, %rdi

•

Background

basic block

embedding

each dimension corresponds to a manually selected

static feature [1-3]

static word representation based methods [4-7]

INNEREYE-BB, an RNN based method [8]

manually

automatically

[1] Qian Feng, et al. Scalable Graph-based Bug Search for Firmware Images. CCS 2016

[2] Xiaojun Xu,et al. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. CCS 2017

[3] Gang Zhao, Jeff Huang. DeepSim: deep learning code functional similarity. ESEC/SIGSOFT FSE 2018

[4] Yujia Li,et al.Graph Matching Networks for Learning the Similarity of Graph Structured Objects. ICML 2019

[5] Luca Massarelli, et al. SAFE: Self-Attentive Function Embeddings for Binary Similarity. DIMVA 2019

[6] Uri Alon, et al. code2vec: learning distributed representations of code. PACMPL 3(POPL) 2019

[7] Steven H. H. Ding, et al. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. S&P 2019

[8] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019

Type of methods

•

Background

INNEREYE-BB [1]

[1] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019

ldr

r0

.LCPI0_115

bl

printf

scanf

memcpy

……

FUNC

•

Methodology & Implementation

x86

BB

ARM

BB

BB embedding

Idealized Solution (based on

perfect translation

 assumption)

•

Methodology & Implementation

Practical Solution

•

Methodology & Implementation

x86-encoder pre-training



data:

x86-ARM basic block pairs



NMT model:

Transformer [1], other NMT models also work



Optimization goal:

minimize the translation loss

[1] Ashish Vaswani, et al. Attention is All you Need. NIPS 2017

•

Methodology & Implementation

ARM-encoder training & x86-encoder fine-tuning



data:

basic block triplets, {anchor, positive, negative}



Optimization goal:

minimize the margin-based triplet loss

margin

anchor

positive

negative

semantically equivalent

basic block pair

•

Methodology & Implementation

Mixed negative sampling

Hard Negatives:

Similar but not equivalent to anchor

•

Methodology & Implementation

Hard negative sampling: if anchor is a x86 basic block

anchor(x86)

rand_x86_1

rand_x86_2

……

rand_x86_n

pretrained x86-encoder

……

rand_x86_t

rand_ARM_t

•

Methodology & Implementation

Similarity Metric

embedding dimension

Euclidean distance

•

Experiment & Result

Setup



prototype:

MIRROR

https://github.com/zhangxiaochuan/MIRROR



Dataset:

MISA,

1,122,171

 semantically equivalent x86-ARM basic block pairs

https://drive.google.com/file/d/1krJbsfu6EsLhF86QAUVxVRQjbkfWx7ZF/view

•

Experiment & Result

Comparison with Baseline

* Higher is better

•

Experiment & Result

Evaluation of

negative

sampling methods

* Higher is better

•

Experiment & Result

Effectiveness of

pre-training

The pre-training phase seems redundant?

•

Experiment & Result

Effectiveness of

pre-training

* Higher is better

•

Experiment & Result

Visualization

undefined

Thanks!

zhangxiaochuan@outlook.com

Slide Note

Embed Share

Download

The similarity metric method for binary basic blocks is crucial in various applications like malware classification, vulnerability detection, and authorship analysis. This method involves two steps: sub-ldr operations and similarity score calculation. Different methods, both manual and automatic, have been proposed for basic block embedding and analysis. The INNEREYE-BB method utilizes neural machine translation for comparing binary code similarities beyond function pairs. The methodology involves encoding, decoding, and neural machine translation for x86 and ARM basic blocks, which are then aggregated to obtain the basic block embeddings.

constanc Follow

Uploaded on Jul 26, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Similarity Metric Method for Binary Basic Blocks of Cross-Instruction Set Architecture Xiaochuan Zhang zhangxiaochuan@outlook.com Artificial Intelligence Research Center, National Innovation Institute of Defense Technology, Beijing, China

Content Background 01 Methodology & Implementation 02 Experiment & Result 03

Background Binary program similarity metric can be used in: malware classification vulnerability detection authorship analysis The similarity between basic blocks is the basis

Background Two step of basic block similarity metric sub ldr ldr ldr sp, sp, #72 r7, [r11, #12] r8, [r11, #8] r0, .LCPI0_0 [0.24, 0.37, , 0.93] Similarity Score [0, 1] movq movq movq movabsq $.L0, %rdi %rdx, %r14 %rsi, %r15 %rdi, %rbx [0.56, 0.74, , 0.31] Basic Block Embedding Similarity Calculation

Background Type of methods manually each dimension corresponds to a manually selected static feature [1-3] basic block embedding static word representation based methods [4-7] automatically INNEREYE-BB, an RNN based method [8] [1] Qian Feng, et al. Scalable Graph-based Bug Search for Firmware Images. CCS 2016 [2] Xiaojun Xu,et al. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. CCS 2017 [3] Gang Zhao, Jeff Huang. DeepSim: deep learning code functional similarity. ESEC/SIGSOFT FSE 2018 [4] Yujia Li,et al.Graph Matching Networks for Learning the Similarity of Graph Structured Objects. ICML 2019 [5] Luca Massarelli, et al. SAFE: Self-Attentive Function Embeddings for Binary Similarity. DIMVA 2019 [6] Uri Alon, et al. code2vec: learning distributed representations of code. PACMPL 3(POPL) 2019 [7] Steven H. H. Ding, et al. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. S&P 2019 [8] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019

Background INNEREYE-BB [1] ?= ?(??, ? 1) 1 2 3 4 5 ?1 ?2 ?3 ?4 ?5 ldr r0 .LCPI0_115 bl printf scanf memcpy FUNC [1] Fei Zuo, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. NDSS 2019

Methodology & Implementation Idealized Solution (based on PERFECT TRANSLATION assumption) Encoding Decoding Neural Machine Translation x86 BB ARM BB ? ? matrix Aggregation Encoding Aggregation BB embedding

Methodology & Implementation Practical Solution

Methodology & Implementation x86-encoder pre-training data: NMT model: Optimization goal: minimize the translation loss x86-ARM basic block pairs Transformer [1], other NMT models also work [1] Ashish Vaswani, et al. Attention is All you Need. NIPS 2017

Methodology & Implementation ARM-encoder training & x86-encoder fine-tuning data: Optimization goal: minimize the margin-based triplet loss basic block triplets, {anchor, positive, negative} semantically equivalent basic block pair margin positive negative anchor

Methodology & Implementation Mixed negative sampling Hard Negatives: Similar but not equivalent to anchor 33% 67% Random Negatives Hard Negatives

Methodology & Implementation Hard negative sampling: if anchor is a x86 basic block ??????? anchor(x86) ?? ?? rand_x86_1 ?? pretrained x86-encoder ?? rand_x86_2 rand_x86_t rand_ARM_t ?? ?? rand_x86_n

Methodology & Implementation Similarity Metric Euclidean distance embedding dimension

Experiment & Result Setup prototype: MIRROR https://github.com/zhangxiaochuan/MIRROR MISA, 1,122,171 semantically equivalent x86-ARM basic block pairs https://drive.google.com/file/d/1krJbsfu6EsLhF86QAUVxVRQjbkfWx7ZF/view Dataset: