本文简单分析介绍了capstone2llvmir源码与本地编译运行的方式,适合初步学习汇编转ir的原理并自己做简单修改,编译运行,做出自己的简易asm2llvmir小程序,有了llvmir,就可以优化、去混淆、干坏事了,详见本菜之前的文章
https://bbs.pediy.com/thread-265335.htm利用编译器优化干掉虚假控制流
https://bbs.pediy.com/thread-266323.htm利用编译器优化干掉控制流平坦化
ps:有些比较复杂的asm2ir转换源码里面没有,需要自己试着写,慢慢完善,然后编译成为自己的工具
recdec的源代码里很重要的部分capstone2llvmir与bin2llvmir,功能是把汇编转换为llvmir,我认真学习了这个神器并记录笔记
源代码https://retdec-tc.avast.com/repository/download/Retdec_DoxygenBuild/.lastSuccessful/build/doc/doxygen/html/files.html
它介绍里面有一个capstone2llvmirtool入门https://github.com/avast/retdec/wiki/Capstone2LlvmIr,我把它大概意思整理了一下
1.完整语义翻译 完全把汇编语法翻译成ir,只对于足够简单的指令 ps:很多不常用指令翻译源码里没有,如果碰到需要模仿源码自己写
2.翻译为内部函数call 把一些汇编翻译成大多数编译器理解的内部函数,比如翻译一些跳转
3.翻译为伪代码call 根据Capstone反汇编信息创建伪代码call翻译指令 ps:看到这些call对应的汇编没被翻译,而它对于优化又很重要,就可以着手自己写翻译函数了,不重要直接忽略就行
4.不翻译 忽略一些难以翻译的指令
1.创建空的LLVM IR module
2.初始化Capstone engine 和其他数据结构
3.创建架构运行环境,也就是寄存器相关数据结构什么的
3.1把汇编地址映射为ir全局变量
@_asm_program_counter = internal global i64 0
; ...
; add eax, 0x1234 @ 0x1000
store volatile i64 4096, i64* @_asm_program_counter
; ... LLVM IR sequence for the add instruction
; sub ebx, 0x1234 @ 0x1005
store volatile i64 4101, i64* @_asm_program_counter
; ... LLVM IR sequence for the sub instruction
3.2控制流伪代码函数生成,为什么不用ir是因为ir通过块标签跳转而不是像汇编一样通过地址
Control-flow-related pseudo functions are generated.
; void (i<architecture_size> target_address)
declare void @__pseudo_call(i32)
; void (i<architecture_size> target_address)
declare void @__pseudo_return(i32)
; void (i<architecture_size> target_address)
declare void @__pseudo_branch(i32)
; void (i1 condition, i<architecture_size> target_address)
declare void @__pseudo_cond_branch(i1, i32)
3.3架构相关寄存器全局变量初始化
@eax = internal global i32 0
@ecx = internal global i32 0
; ...
@st0 = internal global x86_fp80 0xK00000000000000000000
@st1 = internal global x86_fp80 0xK00000000000000000000
1.用Capstone engine 反编译二进制,对于一句汇编,它大概包含如下信息
add eax, 0x1234
:
2.找到翻译方式翻译指令到ir id保存了操作码
2.1Capstone ID is mapped to an ID-specific routine 每个id也就是操作码对应一个 routine
2.2Capstone ID is mapped to a specific pseudo assembly generation method
id对应一个 pseudo method汇编伪代码生成方法
2.3Capstone ID is not mapped to any value
啥也没匹配到。使用Capstone-provided instruction info信息自动创建call,这取决于Capstone提供信息的质量
源码结构:
公开接口include/retdec/capstone2llvmir
隐藏接口src/capstone2llvmir
接口Capstone2LlvmIrTranslator
实现Capstone2LlvmIrTranslator_impl
相应架构实现Capstone2LlvmIrTranslatorArm
直接看入口,入口在capstone2llvmirtool/capstone2llvmir.cpp里main 函数(还有一个在retdec\src\bin2llvmir\optimizations\decoder里,学习这2个函数,就能学会如何使用translate函数翻译asm为ir),先创建一个llvm::function,填入一个block与return,根据cpu架构创建翻译器Capstone2LlvmIrTranslator::createArch,最后通过capstone2llvmir/capstone2llvmir.h定义的translate函数翻译asm为ir,传入data,size,base获得irb
cs_malloc分配capstone的handle,用这个handle通过cs_disasm_iter把二进制翻译为汇编保存在insn
generateSpecialAsm2LlvmInstr ,关键函数generateSpecialAsm2LlvmInstr 把insn的address转换为llvm全局变量,每种架构都有一个程序计数器记录程序执行到哪个地址了,arm就是pc,每执行一句就修改pc,这里的pc值就来源于generateSpecialAsm2LlvmInstr 转换的globalvalue
translateInstruction真正进入到关键把insn翻译为ir,这里4种方式对应前面的4种翻译策略,简单看一下骨架
关键函数translateInstruction,把汇编insn转换为llvmir,它是capstone2llvmir_impl.h声明的一个虚函数
不同的汇编都有自己的translateInstruction实现,arm的在src\capstone2llvmir\arm\arm.cpp
这里面一个重要结构体_cs_insn,电脑里的python3安装了capstone我们翻``python``看它的结构
translateInstruction代码粗看
这里面有一个重要的hash表_i2fm全称Instruction translation map,把汇编指令和翻译ir函数指针一一对应,比如ARM_INS_ADC加法指令对应指针 &Capstone2LlvmIrTranslatorArm_impl::translateAdc
还有arm_init.cpp中定义的寄存器符号名字对应的哈希表r2n,寄存器符号类型对应的哈希表r2t这两个重要结构,他们完全抽象出了arm寄存器为c++数据结构
翻译asm为一般的伪代码函数,就是处理在_i2fm表里面没有对应翻译函数的指令如何翻译
1.根据capstone提供的指令信息,搞明白要生成的ir有多少寄存器与非寄存器的读写,需要创建多少llvm的type和value,函数有没有返回值等信息
2.根据之前创建的llvm的type和value创建参数和返回值,把_asm与助记符insn->mnemonic拼接起来命名函数名字,生成一个空壳伪函数
3.我们在生成ir的时候,如果观察到一些以汇编助记符命名的ir函数,就可以知道这句汇编指令没有对应的翻译函数,然后自己写一个完成完全的翻译,当然,_i2fm表里面给的翻译函数99%情况下够用了
git clone https://github.com/avast/retdec.git
cd retdec
mkdir build && cd build
语法cmake .. -DCMAKE_INSTALL_PREFIX=<path>
-DRETDECENABLE<component>=ON
cmake ../ -DRETDEC_ENABLE_CAPSTONE2LLVMIRTOOL=ON 只编译CAPSTONE2LLVMIR前端,这里是原汁原味一句一句翻译asm为ir的逻辑,也就是本文讲的
//cmake ../ -DRETDEC_ENABLE_BIN2LLVMIRTOOL=ON 注意这个是之前版本的,现在已经没有BIN2LLVMIRTOOL了,只有一个库
cmake ../ -DRETDEC_ENABLE_RETDECTOOL=ON 只编译RETDECTOOL前端,也就是之前版本的bin2llvmir前端,这里先通过CAPSTONE2LLVMIR处理得到的ir,然后通过很多pass对于最初的ir进行了分析和优化,其中的到达定值分析和构造西沟分析等都非常的巧妙,值得研究,关键接口函数retdec::disassemble(po.inputFile, &fs)
ps:这里要从git上下载capstone与keystone与llvm相关的库,我下的比较慢 可以修改为国内的源
make -jN
(N
一般设置为核心数+1),然后在retdec\build\src\下面找到可执行文件,像下面这样
retdec-decompiler是bin2llvmir2cpp
retdectool是bin2llvmir(capstone2llvmir+多个pass优化后)
capstone2llvmirtool是capstone2llvmir原汁原味
retdectool也就是以前的bin2llvmir可执行文件,从入口开始学习这个,搞清楚如何通过各种库把汇编转换为ir,然后通过各种分析优化pass得到可读性很强的ir,main函数retdec-master\src\retdectool\retdec.cpp里,关键是disassemble,第一个string指针参数表示待处理文件路径inputPath,第二个生成的ir结果,存储在FunctionSet类型的fs指针,这里可以看一下retdec::common::Function的数据结构,存储了函数类型,ir等有用信息
General info:
id
:
8
(add)
addr :
1000
size :
5
bytes :
05
34
12
00
00
mnem : add
op
str
: eax,
0x1234
Detail info:
R regs :
0
W regs :
1
25
(eflags)
groups :
0
Architecture
-
dependent info:
prefix :
00
00
00
00
(
-
,
-
,
-
,
-
)
opcode :
05
00
00
00
rex :
0
addr sz:
4
modrm :
0
sib :
0
disp :
0
sib idx:
0
(
-
)
sib sc :
0
sib bs :
0
(
-
)
sse cc : X86_SSE_CC_INVALID
avx cc : X86_AVX_CC_INVALID
avx sae: false
avx rm : X86_AVX_RM_INVALID
op cnt :
2
type
: X86_OP_REG
reg :
19
(eax)
size :
4
access : CS_AC_READ
+
CS_AC_WRITE
avx bct: X86_AVX_BCAST_INVALID
avx
0
m: false
type
: X86_OP_IMM
imm :
1234
size :
4
access : CS_AC_INVALID
avx bct: X86_AVX_BCAST_INVALID
avx
0
m: false
General info:
id
:
8
(add)
addr :
1000
size :
5
bytes :
05
34
12
00
00
mnem : add
op
str
: eax,
0x1234
Detail info:
R regs :
0
W regs :
1
25
(eflags)
groups :
0
Architecture
-
dependent info:
prefix :
00
00
00
00
(
-
,
-
,
-
,
-
)
opcode :
05
00
00
00
rex :
0
addr sz:
4
modrm :
0
sib :
0
disp :
0
sib idx:
0
(
-
)
sib sc :
0
sib bs :
0
(
-
)
sse cc : X86_SSE_CC_INVALID
avx cc : X86_AVX_CC_INVALID
avx sae: false
avx rm : X86_AVX_RM_INVALID
op cnt :
2
type
: X86_OP_REG
reg :
19
(eax)
size :
4
access : CS_AC_READ
+
CS_AC_WRITE
avx bct: X86_AVX_BCAST_INVALID
avx
0
m: false
type
: X86_OP_IMM
imm :
1234
size :
4
access : CS_AC_INVALID
avx bct: X86_AVX_BCAST_INVALID
avx
0
m: false
__asm_<mnem>(op0)
op0
=
__asm_<mnem>(op0)
__asm_<mnem>(op0, op1)
op0
=
__asm_<mnem>(op1)
op0
=
__asm_<mnem>(op0, op1)
__asm_<mnem>(op0, op1, op2)
__asm_<mnem>(op0)
op0
=
__asm_<mnem>(op0)
__asm_<mnem>(op0, op1)
op0
=
__asm_<mnem>(op1)
op0
=
__asm_<mnem>(op0, op1)
__asm_<mnem>(op0, op1, op2)
main
{
llvm::Function::Create
llvm::BasicBlock::Create
Capstone2LlvmIrTranslator::createArch
translate(po.code.data(), po.code.size(), po.base, irb)
}
main
{
llvm::Function::Create
llvm::BasicBlock::Create
Capstone2LlvmIrTranslator::createArch
translate(po.code.data(), po.code.size(), po.base, irb)
}
Capstone2LlvmIrTranslator_impl<CInsn, CInsnOp>::translate
{
cs_malloc
cs_disasm_iter
generateSpecialAsm2LlvmInstr
translateInstruction
/
/
在capstone2llvmir_impl.h声明的虚函数,不同架构有不同的translateInstruction实现
{
*
f
=
*
(_i2fm.find(i→
id
))
/
/
如果在Instruction translation
map
_i2fm里找到翻译函数,直接通过指针调用,对应
1
{
translateAdd
translateB
...
}
or
translatePseudoAsmGeneric
/
/
如果没有找到,回到translatePseudoAsmGeneric函数,对应
2
{
loadOp
loadRegister
getPseudoAsmFunction
CreateCall
/
/
对应
3
storeOp
storeRegister
}
}
}
Capstone2LlvmIrTranslator_impl<CInsn, CInsnOp>::translate
{
cs_malloc
cs_disasm_iter
generateSpecialAsm2LlvmInstr
translateInstruction
/
/
在capstone2llvmir_impl.h声明的虚函数,不同架构有不同的translateInstruction实现
{
*
f
=
*
(_i2fm.find(i→
id
))
/
/
如果在Instruction translation
map
_i2fm里找到翻译函数,直接通过指针调用,对应
1
{
translateAdd
translateB
...
}
or
translatePseudoAsmGeneric
/
/
如果没有找到,回到translatePseudoAsmGeneric函数,对应
2
{
loadOp
loadRegister
getPseudoAsmFunction
CreateCall
/
/
对应
3
storeOp
storeRegister
}
}
}
以前读cs的笔记:https:
/
/
bbs.pediy.com
/
thread
-
258473.htm
class
_cs_insn(ctypes.Structure):
_fields_
=
(
(
'id'
, ctypes.c_uint),
(
'address'
, ctypes.c_uint64),
(
'size'
, ctypes.c_uint16),
(
'bytes'
, ctypes.c_ubyte
*
16
),
(
'mnemonic'
, ctypes.c_char
*
32
),
(
'op_str'
, ctypes.c_char
*
160
),
(
'detail'
, ctypes.POINTER(_cs_detail)),
)
class
_cs_detail(ctypes.Structure):
_fields_
=
(
(
'regs_read'
, ctypes.c_uint16
*
12
),
(
'regs_read_count'
, ctypes.c_ubyte),
(
'regs_write'
, ctypes.c_uint16
*
20
),
(
'regs_write_count'
, ctypes.c_ubyte),
(
'groups'
, ctypes.c_ubyte
*
8
),
(
'groups_count'
, ctypes.c_ubyte),
(
'arch'
, _cs_arch),
)
class
_cs_arch(ctypes.Union):
_fields_
=
(
(
'arm64'
, arm64.CsArm64),
(
'arm'
, arm.CsArm),
(
'm68k'
, m68k.CsM68K),
(
'mips'
, mips.CsMips),
(
'x86'
, x86.CsX86),
(
'ppc'
, ppc.CsPpc),
(
'sparc'
, sparc.CsSparc),
(
'sysz'
, systemz.CsSysz),
(
'xcore'
, xcore.CsXcore),
(
'tms320c64x'
, tms320c64x.CsTMS320C64x),
(
'm680x'
, m680x.CsM680x),
(
'evm'
, evm.CsEvm),
)
/
/
/
Instruction structure
typedef struct cs_arm {
bool
usermode;
/
/
/
< User
-
mode registers to be loaded (
for
LDM
/
STM instructions)
int
vector_size;
/
/
/
< Scalar size
for
vector instructions
arm_vectordata_type vector_data;
/
/
/
< Data
type
for
elements of vector instructions
arm_cpsmode_type cps_mode;
/
/
/
< CPS mode
for
CPS instruction
arm_cpsflag_type cps_flag;
/
/
/
< CPS mode
for
CPS instruction
arm_cc cc;
/
/
/
< conditional code
for
this insn
bool
update_flags;
/
/
/
< does this insn update flags?
bool
writeback;
/
/
/
< does this insn write
-
back?
arm_mem_barrier mem_barrier;
/
/
/
< Option
for
some memory barrier instructions
/
/
/
Number of operands of this instruction,
/
/
/
or
0
when instruction has no operand.
uint8_t op_count;
cs_arm_op operands[
36
];
/
/
/
< operands
for
this instruction.
} cs_arm;
typedef enum arm_cc {
ARM_CC_INVALID
=
0
,
ARM_CC_EQ,
/
/
/
< Equal Equal
ARM_CC_NE,
/
/
/
< Not equal Not equal,
or
unordered
ARM_CC_HS,
/
/
/
< Carry
set
>,
=
=
,
or
unordered
ARM_CC_LO,
/
/
/
< Carry clear Less than
ARM_CC_MI,
/
/
/
< Minus, negative Less than
ARM_CC_PL,
/
/
/
< Plus, positive
or
zero >,
=
=
,
or
unordered
ARM_CC_VS,
/
/
/
< Overflow Unordered
ARM_CC_VC,
/
/
/
< No overflow Not unordered
ARM_CC_HI,
/
/
/
< Unsigned higher Greater than,
or
unordered
ARM_CC_LS,
/
/
/
< Unsigned lower
or
same Less than
or
equal
ARM_CC_GE,
/
/
/
< Greater than
or
equal Greater than
or
equal
ARM_CC_LT,
/
/
/
< Less than Less than,
or
unordered
ARM_CC_GT,
/
/
/
< Greater than Greater than
ARM_CC_LE,
/
/
/
< Less than
or
equal <,
=
=
,
or
unordered
ARM_CC_AL
/
/
/
< Always (unconditional) Always (unconditional)
} arm_cc;
/
/
/
Instruction operand
typedef struct cs_arm_op {
int
vector_index;
/
/
/
< Vector Index
for
some vector operands (
or
-
1
if
irrelevant)
struct {
arm_shifter
type
;
unsigned
int
value;
} shift;
arm_op_type
type
;
/
/
/
< operand
type
union {
int
reg;
/
/
/
< register value
for
REG
/
SYSREG operand
int32_t imm;
/
/
/
< immediate value
for
C
-
IMM, P
-
IMM
or
IMM operand
double fp;
/
/
/
< floating point value
for
FP operand
arm_op_mem mem;
/
/
/
< base
/
index
/
scale
/
disp value
for
MEM operand
arm_setend_type setend;
/
/
/
< SETEND instruction's operand
type
};
/
/
/
in
some instructions, an operand can be subtracted
or
added to
/
/
/
the base register,
/
/
/
if
TRUE, this operand
is
subtracted. otherwise, it
is
added.
bool
subtracted;
/
/
/
How
is
this operand accessed? (READ, WRITE
or
READ|WRITE)
/
/
/
This field
is
combined of cs_ac_type.
/
/
/
NOTE: this field
is
irrelevant
if
engine
is
compiled
in
DIET mode.
uint8_t access;
/
/
/
Neon lane index
for
NEON instructions (
or
-
1
if
irrelevant)
int8_t neon_lane;
} cs_arm_op;
以前读cs的笔记:https:
/
/
bbs.pediy.com
/
thread
-
258473.htm
class
_cs_insn(ctypes.Structure):
_fields_
=
(
(
'id'
, ctypes.c_uint),
(
'address'
, ctypes.c_uint64),
(
'size'
, ctypes.c_uint16),
(
'bytes'
, ctypes.c_ubyte
*
16
),
(
'mnemonic'
, ctypes.c_char
*
32
),
(
'op_str'
, ctypes.c_char
*
160
),
(
'detail'
, ctypes.POINTER(_cs_detail)),
)
class
_cs_detail(ctypes.Structure):
_fields_
=
(
(
'regs_read'
, ctypes.c_uint16
*
12
),
(
'regs_read_count'
, ctypes.c_ubyte),
(
'regs_write'
, ctypes.c_uint16
*
20
),
(
'regs_write_count'
, ctypes.c_ubyte),
(
'groups'
, ctypes.c_ubyte
*
8
),
(
'groups_count'
, ctypes.c_ubyte),
(
'arch'
, _cs_arch),
)
class
_cs_arch(ctypes.Union):
_fields_
=
(
(
'arm64'
, arm64.CsArm64),
(
'arm'
, arm.CsArm),
(
'm68k'
, m68k.CsM68K),
(
'mips'
, mips.CsMips),
(
'x86'
, x86.CsX86),
(
'ppc'
, ppc.CsPpc),
(
'sparc'
, sparc.CsSparc),
(
'sysz'
, systemz.CsSysz),
(
'xcore'
, xcore.CsXcore),
(
'tms320c64x'
, tms320c64x.CsTMS320C64x),
(
'm680x'
, m680x.CsM680x),
(
'evm'
, evm.CsEvm),
)
/
/
/
Instruction structure
typedef struct cs_arm {
bool
usermode;
/
/
/
< User
-
mode registers to be loaded (
for
LDM
/
STM instructions)
int
vector_size;
/
/
/
< Scalar size
for
vector instructions
arm_vectordata_type vector_data;
/
/
/
< Data
type
for
elements of vector instructions
arm_cpsmode_type cps_mode;
/
/
/
< CPS mode
for
CPS instruction
arm_cpsflag_type cps_flag;
/
/
/
< CPS mode
for
CPS instruction
arm_cc cc;
/
/
/
< conditional code
for
this insn
bool
update_flags;
/
/
/
< does this insn update flags?
bool
writeback;
/
/
/
< does this insn write
-
back?
arm_mem_barrier mem_barrier;
/
/
/
< Option
for
some memory barrier instructions
/
/
/
Number of operands of this instruction,
/
/
/
or
0
when instruction has no operand.
uint8_t op_count;
cs_arm_op operands[
36
];
/
/
/
< operands
for
this instruction.
} cs_arm;
typedef enum arm_cc {
ARM_CC_INVALID
=
0
,
ARM_CC_EQ,
/
/
/
< Equal Equal
ARM_CC_NE,
/
/
/
< Not equal Not equal,
or
unordered
ARM_CC_HS,
/
/
/
< Carry
set
>,
=
=
,
or
unordered
ARM_CC_LO,
/
/
/
< Carry clear Less than
ARM_CC_MI,
/
/
/
< Minus, negative Less than
ARM_CC_PL,
/
/
/
< Plus, positive
or
zero >,
=
=
,
or
unordered
ARM_CC_VS,
/
/
/
< Overflow Unordered
ARM_CC_VC,
/
/
/
< No overflow Not unordered
ARM_CC_HI,
/
/
/
< Unsigned higher Greater than,
or
unordered
ARM_CC_LS,
/
/
/
< Unsigned lower
or
same Less than
or
equal
ARM_CC_GE,
/
/
/
< Greater than
or
equal Greater than
or
equal
ARM_CC_LT,
/
/
/
< Less than Less than,
or
unordered
ARM_CC_GT,
/
/
/
< Greater than Greater than
ARM_CC_LE,
/
/
/
< Less than
or
equal <,
=
=
,
or
unordered
ARM_CC_AL
/
/
/
< Always (unconditional) Always (unconditional)
} arm_cc;
/
/
/
Instruction operand
typedef struct cs_arm_op {
int
vector_index;
/
/
/
< Vector Index
for
some vector operands (
or
-
1
if
irrelevant)
struct {
arm_shifter
type
;
unsigned
int
value;
} shift;
arm_op_type
type
;
/
/
/
< operand
type
union {
int
reg;
/
/
/
< register value
for
REG
/
SYSREG operand
int32_t imm;
/
/
/
< immediate value
for
C
-
IMM, P
-
IMM
or
IMM operand
double fp;
/
/
/
< floating point value
for
FP operand
arm_op_mem mem;
/
/
/
< base
/
index
/
scale
/
disp value
for
MEM operand
arm_setend_type setend;
/
/
/
< SETEND instruction's operand
type
};
[注意]传递专业知识、拓宽行业人脉——看雪讲师团队等你加入!