首页
社区
课程
招聘
[原创]capstone2llvmir入门---如何把汇编转换为llvmir
2021-5-8 18:28 19342

[原创]capstone2llvmir入门---如何把汇编转换为llvmir

2021-5-8 18:28
19342

目录

前言

本文简单分析介绍了capstone2llvmir源码与本地编译运行的方式,适合初步学习汇编转ir的原理并自己做简单修改,编译运行,做出自己的简易asm2llvmir小程序,有了llvmir,就可以优化、去混淆、干坏事了,详见本菜之前的文章

 

https://bbs.pediy.com/thread-265335.htm利用编译器优化干掉虚假控制流

 

https://bbs.pediy.com/thread-266323.htm利用编译器优化干掉控制流平坦化

 

ps:有些比较复杂的asm2ir转换源码里面没有,需要自己试着写,慢慢完善,然后编译成为自己的工具

 

recdec的源代码里很重要的部分capstone2llvmir与bin2llvmir,功能是把汇编转换为llvmir,我认真学习了这个神器并记录笔记

 

源代码https://retdec-tc.avast.com/repository/download/Retdec_DoxygenBuild/.lastSuccessful/build/doc/doxygen/html/files.html

 

它介绍里面有一个capstone2llvmirtool入门https://github.com/avast/retdec/wiki/Capstone2LlvmIr,我把它大概意思整理了一下

capstone2llvmirtool入门

对于不同代码的四种不同翻译方式

1.完整语义翻译 完全把汇编语法翻译成ir,只对于足够简单的指令 ps:很多不常用指令翻译源码里没有,如果碰到需要模仿源码自己写

 

2.翻译为内部函数call 把一些汇编翻译成大多数编译器理解的内部函数,比如翻译一些跳转

 

3.翻译为伪代码call 根据Capstone反汇编信息创建伪代码call翻译指令 ps:看到这些call对应的汇编没被翻译,而它对于优化又很重要,就可以着手自己写翻译函数了,不重要直接忽略就行

 

4.不翻译 忽略一些难以翻译的指令

具体原理概括

首先,创建翻译模块translator module

1.创建空的LLVM IR module

 

2.初始化Capstone engine 和其他数据结构

 

3.创建架构运行环境,也就是寄存器相关数据结构什么的

 

3.1把汇编地址映射为ir全局变量

 

@_asm_program_counter = internal global i64 0

 

; ...

 

; add eax, 0x1234 @ 0x1000

 

store volatile i64 4096, i64* @_asm_program_counter

 

; ... LLVM IR sequence for the add instruction

 

; sub ebx, 0x1234 @ 0x1005

 

store volatile i64 4101, i64* @_asm_program_counter

 

; ... LLVM IR sequence for the sub instruction

 

3.2控制流伪代码函数生成,为什么不用ir是因为ir通过块标签跳转而不是像汇编一样通过地址

 

Control-flow-related pseudo functions are generated.

 

; void (i<architecture_size> target_address)

 

declare void @__pseudo_call(i32)

 

; void (i<architecture_size> target_address)

 

declare void @__pseudo_return(i32)

 

; void (i<architecture_size> target_address)

 

declare void @__pseudo_branch(i32)

 

; void (i1 condition, i<architecture_size> target_address)

 

declare void @__pseudo_cond_branch(i1, i32)

 

3.3架构相关寄存器全局变量初始化

 

@eax = internal global i32 0

 

@ecx = internal global i32 0

 

; ...

 

@st0 = internal global x86_fp80 0xK00000000000000000000

 

@st1 = internal global x86_fp80 0xK00000000000000000000

然后,通过translator 执行翻译

1.用Capstone engine 反编译二进制,对于一句汇编,它大概包含如下信息

 

add eax, 0x1234:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
General info:
         id     8 (add)
         addr   :  1000
         size   :  5
         bytes  :  05 34 12 00 00
         mnem   :  add
         op str :  eax, 0x1234
 Detail info:
         R regs :  0
         W regs :  1
                 25 (eflags)
         groups :  0
 Architecture-dependent info:
         prefix :  00 00 00 00  (-, -, -, -)
         opcode :  05 00 00 00
         rex    :  0
         addr sz:  4
         modrm  :  0
         sib    :  0
         disp   :  0
         sib idx:  0 (-)
         sib sc :  0
         sib bs :  0 (-)
         sse cc :  X86_SSE_CC_INVALID
         avx cc :  X86_AVX_CC_INVALID
         avx sae:  false
         avx rm :  X86_AVX_RM_INVALID
         op cnt :  2
 
                 type   :  X86_OP_REG
                 reg    :  19 (eax)
                 size   :  4
                 access :  CS_AC_READ + CS_AC_WRITE
                 avx bct:  X86_AVX_BCAST_INVALID
                 avx 0 m:  false
 
                 type   :  X86_OP_IMM
                 imm    :  1234
                 size   :  4
                 access :  CS_AC_INVALID
                 avx bct:  X86_AVX_BCAST_INVALID
                 avx 0 m:  false

2.找到翻译方式翻译指令到ir id保存了操作码

 

2.1Capstone ID is mapped to an ID-specific routine 每个id也就是操作码对应一个 routine

 

2.2Capstone ID is mapped to a specific pseudo assembly generation method

 

id对应一个 pseudo method汇编伪代码生成方法

1
2
3
4
5
6
__asm_<mnem>(op0)
op0 = __asm_<mnem>(op0)
__asm_<mnem>(op0, op1)
op0 = __asm_<mnem>(op1)
op0 = __asm_<mnem>(op0, op1)
__asm_<mnem>(op0, op1, op2)

2.3Capstone ID is not mapped to any value

 

啥也没匹配到。使用Capstone-provided instruction info信息自动创建call,这取决于Capstone提供信息的质量

 

源码结构:

 

公开接口include/retdec/capstone2llvmir
隐藏接口src/capstone2llvmir

 

接口Capstone2LlvmIrTranslator
实现Capstone2LlvmIrTranslator_impl
相应架构实现Capstone2LlvmIrTranslatorArm

capstone2llvmir入口

直接看入口,入口在capstone2llvmirtool/capstone2llvmir.cpp里main 函数(还有一个在retdec\src\bin2llvmir\optimizations\decoder里,学习这2个函数,就能学会如何使用translate函数翻译asm为ir),先创建一个llvm::function,填入一个block与return,根据cpu架构创建翻译器Capstone2LlvmIrTranslator::createArch,最后通过capstone2llvmir/capstone2llvmir.h定义的translate函数翻译asm为ir,传入data,size,base获得irb

1
2
3
4
5
6
7
main
{
 llvm::Function::Create
 llvm::BasicBlock::Create
 Capstone2LlvmIrTranslator::createArch
 translate(po.code.data(), po.code.size(), po.base, irb)
}

translate函数

cs_malloc分配capstone的handle,用这个handle通过cs_disasm_iter把二进制翻译为汇编保存在insn

 

generateSpecialAsm2LlvmInstr ,关键函数generateSpecialAsm2LlvmInstr 把insn的address转换为llvm全局变量,每种架构都有一个程序计数器记录程序执行到哪个地址了,arm就是pc,每执行一句就修改pc,这里的pc值就来源于generateSpecialAsm2LlvmInstr 转换的globalvalue

 

translateInstruction真正进入到关键把insn翻译为ir,这里4种方式对应前面的4种翻译策略,简单看一下骨架

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Capstone2LlvmIrTranslator_impl<CInsn, CInsnOp>::translate
{
 cs_malloc
 cs_disasm_iter 
 generateSpecialAsm2LlvmInstr
 translateInstruction //在capstone2llvmir_impl.h声明的虚函数,不同架构有不同的translateInstruction实现
 {
  *f=*(_i2fm.find(i→id)) //如果在Instruction translation map _i2fm里找到翻译函数,直接通过指针调用,对应1
  {
   translateAdd
   translateB
   ...
  }
  or translatePseudoAsmGeneric //如果没有找到,回到translatePseudoAsmGeneric函数,对应2
  {
   loadOp
   loadRegister
   getPseudoAsmFunction
   CreateCall                  //对应3
   storeOp
   storeRegister
  }
 }
}

translateInstruction函数

关键函数translateInstruction,把汇编insn转换为llvmir,它是capstone2llvmir_impl.h声明的一个虚函数

 

不同的汇编都有自己的translateInstruction实现,arm的在src\capstone2llvmir\arm\arm.cpp

 

这里面一个重要结构体_cs_insn,电脑里的python3安装了capstone我们翻``python``看它的结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
以前读cs的笔记:https://bbs.pediy.com/thread-258473.htm
class _cs_insn(ctypes.Structure):
    _fields_ = (
        ('id', ctypes.c_uint),
        ('address', ctypes.c_uint64),
        ('size', ctypes.c_uint16),
        ('bytes', ctypes.c_ubyte * 16),
        ('mnemonic', ctypes.c_char * 32),
        ('op_str', ctypes.c_char * 160),
        ('detail', ctypes.POINTER(_cs_detail)),
    )
class _cs_detail(ctypes.Structure):
    _fields_ = (
        ('regs_read', ctypes.c_uint16 * 12),
        ('regs_read_count', ctypes.c_ubyte),
        ('regs_write', ctypes.c_uint16 * 20),
        ('regs_write_count', ctypes.c_ubyte),
        ('groups', ctypes.c_ubyte * 8),
        ('groups_count', ctypes.c_ubyte),
        ('arch', _cs_arch),
    )
class _cs_arch(ctypes.Union):
    _fields_ = (
        ('arm64', arm64.CsArm64),
        ('arm', arm.CsArm),
        ('m68k', m68k.CsM68K),
        ('mips', mips.CsMips),
        ('x86', x86.CsX86),
        ('ppc', ppc.CsPpc),
        ('sparc', sparc.CsSparc),
        ('sysz', systemz.CsSysz),
        ('xcore', xcore.CsXcore),
        ('tms320c64x', tms320c64x.CsTMS320C64x),
        ('m680x', m680x.CsM680x),
        ('evm', evm.CsEvm),
    )   
/// Instruction structure
typedef struct cs_arm {
    bool usermode;    ///< User-mode registers to be loaded (for LDM/STM instructions)
    int vector_size;     ///< Scalar size for vector instructions
    arm_vectordata_type vector_data; ///< Data type for elements of vector instructions
    arm_cpsmode_type cps_mode;    ///< CPS mode for CPS instruction
    arm_cpsflag_type cps_flag;    ///< CPS mode for CPS instruction
    arm_cc cc;            ///< conditional code for this insn
    bool update_flags;    ///< does this insn update flags?
    bool writeback;        ///< does this insn write-back?
    arm_mem_barrier mem_barrier;    ///< Option for some memory barrier instructions
 
    /// Number of operands of this instruction,
    /// or 0 when instruction has no operand.
    uint8_t op_count;
 
    cs_arm_op operands[36];    ///< operands for this instruction.
} cs_arm; 
typedef enum arm_cc {
    ARM_CC_INVALID = 0,
    ARM_CC_EQ,            ///< Equal                      Equal
    ARM_CC_NE,            ///< Not equal                  Not equal, or unordered
    ARM_CC_HS,            ///< Carry set                  >, ==, or unordered
    ARM_CC_LO,            ///< Carry clear                Less than
    ARM_CC_MI,            ///< Minus, negative            Less than
    ARM_CC_PL,            ///< Plus, positive or zero     >, ==, or unordered
    ARM_CC_VS,            ///< Overflow                   Unordered
    ARM_CC_VC,            ///< No overflow                Not unordered
    ARM_CC_HI,            ///< Unsigned higher            Greater than, or unordered
    ARM_CC_LS,            ///< Unsigned lower or same     Less than or equal
    ARM_CC_GE,            ///< Greater than or equal      Greater than or equal
    ARM_CC_LT,            ///< Less than                  Less than, or unordered
    ARM_CC_GT,            ///< Greater than               Greater than
    ARM_CC_LE,            ///< Less than or equal         <, ==, or unordered
    ARM_CC_AL             ///< Always (unconditional)     Always (unconditional)
} arm_cc;
/// Instruction operand
typedef struct cs_arm_op {
    int vector_index;    ///< Vector Index for some vector operands (or -1 if irrelevant)
 
    struct {
        arm_shifter type;
        unsigned int value;
    } shift;
 
    arm_op_type type;    ///< operand type
 
    union {
        int reg;    ///< register value for REG/SYSREG operand
        int32_t imm;            ///< immediate value for C-IMM, P-IMM or IMM operand
        double fp;            ///< floating point value for FP operand
        arm_op_mem mem;        ///< base/index/scale/disp value for MEM operand
        arm_setend_type setend; ///< SETEND instruction's operand type
    };
 
    /// in some instructions, an operand can be subtracted or added to
    /// the base register,
    /// if TRUE, this operand is subtracted. otherwise, it is added.
    bool subtracted;
 
    /// How is this operand accessed? (READ, WRITE or READ|WRITE)
    /// This field is combined of cs_ac_type.
    /// NOTE: this field is irrelevant if engine is compiled in DIET mode.
    uint8_t access;
 
    /// Neon lane index for NEON instructions (or -1 if irrelevant)
    int8_t neon_lane;
} cs_arm_op;

translateInstruction代码粗看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
void Capstone2LlvmIrTranslatorArm_impl::translateInstruction(
        cs_insn* i,
        llvm::IRBuilder<>& irb)
{
    _insn = i;
 
    cs_detail* d = i->detail;
    cs_arm* ai = &d->arm;//这里储存了arm架构相关信息
 
    auto fIt = _i2fm.find(i->id);//这里id存储着指令类型
    //_i2fm是一个hash表在arm_init.cpp中初始化,存储部分arm指令与翻译方法一一对应,如ARM_INS_ADC对应Capstone2LlvmIrTranslatorArm_impl::translateAdc,可以看到还有很多指令还没有转换函数
    if (fIt != _i2fm.end() && fIt->second != nullptr)//如果在hash里找到了
    {
        auto f = fIt->second;//获得翻译方法f
 
        bool branchInsn = i->id == ARM_INS_B || i->id == ARM_INS_BX
                || i->id == ARM_INS_BL || i->id == ARM_INS_BLX
                || i->id == ARM_INS_CBZ || i->id == ARM_INS_CBNZ;
        if (ai->cc == ARM_CC_AL || ai->cc == ARM_CC_INVALID || branchInsn)
        //这里区分条件跳和非条件跳,cc就是condition code的意思
        {
            _inCondition = false;
            (this->*f)(i, ai, irb);//直接指针调用f处理irb
        }
        else
        {
            _inCondition = true;
 
            auto* cond = generateInsnConditionCode(irb, ai);//条件跳要generateIfThen先生成ifthen的bodyIrb
            auto bodyIrb = generateIfThen(cond, irb);
 
            (this->*f)(i, ai, bodyIrb);
        }
    }
    else
    {
        throwUnhandledInstructions(i);
 
        if (ai->cc == ARM_CC_AL || ai->cc == ARM_CC_INVALID)
        {
            _inCondition = false;
            translatePseudoAsmGeneric(i, ai, irb);//如果在_i2fm的hash表里没找到对应,继续用translatePseudoAsmGeneric生成ir,它定义在capstone2llvmir_impl.cpp里
        }
        else
        {
            _inCondition = true;
 
            auto* cond = generateInsnConditionCode(irb, ai);
            auto bodyIrb = generateIfThen(cond, irb);
 
            translatePseudoAsmGeneric(i, ai, bodyIrb);
        }
    }
}

这里面有一个重要的hash表_i2fm全称Instruction translation map,把汇编指令和翻译ir函数指针一一对应,比如ARM_INS_ADC加法指令对应指针 &Capstone2LlvmIrTranslatorArm_impl::translateAdc

 

还有arm_init.cpp中定义的寄存器符号名字对应的哈希表r2n,寄存器符号类型对应的哈希表r2t这两个重要结构,他们完全抽象出了arm寄存器为c++数据结构

translatePseudoAsmGeneric函数

翻译asm为一般的伪代码函数,就是处理在_i2fm表里面没有对应翻译函数的指令如何翻译

 

1.根据capstone提供的指令信息,搞明白要生成的ir有多少寄存器与非寄存器的读写,需要创建多少llvm的type和value,函数有没有返回值等信息

 

2.根据之前创建的llvm的type和value创建参数和返回值,把_asm与助记符insn->mnemonic拼接起来命名函数名字,生成一个空壳伪函数

 

3.我们在生成ir的时候,如果观察到一些以汇编助记符命名的ir函数,就可以知道这句汇编指令没有对应的翻译函数,然后自己写一个完成完全的翻译,当然,_i2fm表里面给的翻译函数99%情况下够用了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
void Capstone2LlvmIrTranslator_impl<CInsn, CInsnOp>::translatePseudoAsmGeneric(
        cs_insn* i,
        CInsn* ci,
        llvm::IRBuilder<>& irb)//这里区分一下cs_insn是带有address,mnemonic,op_str,detail的信息很全的结构体,CInsn仅仅就是原始的汇编二进制指令结构
{
    std::vector<llvm::Value*> vals;
    std::vector<llvm::Type*> types;
 
    unsigned writeCnt = 0;
    llvm::Type* writeType = getDefaultType();
    bool writesOp = false;
    for (std::size_t j = 0; j < ci->op_count; ++j)//先遍历CInsn二进制汇编的operands读取寄存器相关信息,确定生成ir需要什么样的value和type
    {
        auto& op = ci->operands[j];
        auto access = getOperandAccess(op);//getOperandAccess获得operands是读取写入还是其他
// regs_read,字面理解是,返回存储所有读取的隐式寄存器的list,实测只有pc,lr,sp和状态寄存器会被存储在list
// regs_write,字面理解是,返回存储所有写入的隐式寄存器的list,实测只有pc,lr,sp和状态寄存器会被存储在list
// regs_access,合并上面2个的结果
// # Access types for instruction operands.
// CS_AC_INVALID  = 0        # Invalid/unitialized access type.
// CS_AC_READ     = (1 << 0) # Operand that is read from.
// CS_AC_WRITE    = (1 << 1) # Operand that is written to.
        if (access == CS_AC_INVALID || (access & CS_AC_READ))//如果有读取存在,调用loadOp翻译获得需要的llvm的value与type,存入vals与type向量
        {
            auto* o = loadOp(op, irb);
            vals.push_back(o);
            types.push_back(o->getType());
        }
 
        if (access & CS_AC_WRITE)//如果有写入寄存器,writesOp为真,调用getRegisterType获得寄存器类型llvm的value,存入vals向量
                                //如果不是写入寄存器,可能写入到内存地址之类的,直接默认存储到vals向量
        {
            writesOp = true;
            ++writeCnt;
 
            if (isOperandRegister(op))//如果写入寄存器
            {
                auto* t = getRegisterType(op.reg);
                if (writeCnt == 1 || writeType == t)
                {
                    writeType = t;
                }
                else
                {
                    writeType = getDefaultType();
                }
            }
            else
            {
                writeType = getDefaultType();
            }
        }
    }
 
    if (vals.empty())//如果上面遍历之后vals还是空,再次通过detail->regs_read_count遍历所有读取寄存器相关信息存入vals
    {
        // All registers must be ok, or don't use them at all.
        std::vector<uint32_t> readRegs;
        readRegs.reserve(i->detail->regs_read_count);
        for (std::size_t j = 0; j < i->detail->regs_read_count; ++j)
        {
            auto r = i->detail->regs_read[j];
            if (getRegister(r))
            {
                readRegs.push_back(r);
            }
            else
            {
                readRegs.clear();
                break;
            }
        }
 
        for (auto r : readRegs)
        {
            auto* op = loadRegister(r, irb); //如果有读取寄存器操作,调用loadRegister获得irb
            vals.push_back(op);
            types.push_back(op->getType());
        }
    }
 
    auto* retType = writesOp ? writeType : irb.getVoidTy();//只要writesOp为真,retType就为返回类型,否则返回类型为void
    llvm::Function* fnc = getPseudoAsmFunction(//通过getPseudoAsmFunction创建翻译对应cs_insn* i,类型为types,返回值为retType的llvm函数原型
            i,                                 //注意这个函数只是一个空壳,是没有内部ir的,它通过getPseudoAsmFunctionName命名函数名字,就是把__asm_与助记符insn->mnemonic拼接起来
            retType,                           //这样等我们看到生成的ir时,就知道这句汇编指令没有对应的翻译函数,然后自己写一个类似Capstone2LlvmIrTranslatorArm_impl::translateAdc的翻译函数
            types);
 
    auto* c = irb.CreateCall(fnc, vals);//通过CreateCall创建参数为vals,原型为fnc的伪代码函数c
 
    std::set<uint32_t> writtenRegs;
    if (retType)
    {
        for (std::size_t j = 0; j < ci->op_count; ++j)//先通过op_count遍历operands写入寄存器相关信息
        {
            auto& op = ci->operands[j];
            if (getOperandAccess(op) & CS_AC_WRITE)//Return (list-of-registers-read, list-of-registers-modified) by this instructions
            {
                storeOp(op, c, irb);//通过storeOp函数创建存储ir
 
                if (isOperandRegister(op))
                {
                    writtenRegs.insert(op.reg);//存储到被写入寄存器writtenRegs集合里
                }
            }
        }
    }
 
    // All registers must be ok, or don't use them at all.
    std::vector<uint32_t> writeRegs;
    writeRegs.reserve(i->detail->regs_write_count);
    for (std::size_t j = 0; j < i->detail->regs_write_count; ++j)//再次通过detail遍历写入寄存器(不包含writtenRegs里被写入的寄存器)相关信息存储到writeRegs向量
    {
        auto r = i->detail->regs_write[j];
        if (writtenRegs.count(r))
        {
            // silently ignore
        }
        else if (getRegister(r))
        {
            writeRegs.push_back(r);
        }
        else
        {
            writeRegs.clear();
            break;
        }
    }
 
    for (auto r : writeRegs)
    {
        llvm::Value* val = retType->isVoidTy()
                ? llvm::cast<llvm::Value>(
                        llvm::UndefValue::get(getRegisterType(r)))
                : llvm::cast<llvm::Value>(c);
        storeRegister(r, val, irb);//遍历writeRegs调用storeRegister函数翻译ir,注意这里排除了上面storeOp翻译的ir,否则会重复
    }
}

自己编译

  • git clone https://github.com/avast/retdec.git

  • cd retdec

  • mkdir build && cd build

  • 语法cmake .. -DCMAKE_INSTALL_PREFIX=<path> -DRETDECENABLE<component>=ON

    cmake ../ -DRETDEC_ENABLE_CAPSTONE2LLVMIRTOOL=ON 只编译CAPSTONE2LLVMIR前端,这里是原汁原味一句一句翻译asm为ir的逻辑,也就是本文讲的

    //cmake ../ -DRETDEC_ENABLE_BIN2LLVMIRTOOL=ON 注意这个是之前版本的,现在已经没有BIN2LLVMIRTOOL了,只有一个库

    cmake ../ -DRETDEC_ENABLE_RETDECTOOL=ON 只编译RETDECTOOL前端,也就是之前版本的bin2llvmir前端,这里先通过CAPSTONE2LLVMIR处理得到的ir,然后通过很多pass对于最初的ir进行了分析和优化,其中的到达定值分析和构造西沟分析等都非常的巧妙,值得研究,关键接口函数retdec::disassemble(po.inputFile, &fs)

    ps:这里要从git上下载capstone与keystone与llvm相关的库,我下的比较慢 可以修改为国内的源

    1
    git remote set-url --push origin  https://github.com/Hackergeek/architectur
  • make -jN (N 一般设置为核心数+1),然后在retdec\build\src\下面找到可执行文件,像下面这样

    retdec-decompiler是bin2llvmir2cpp

    retdectool是bin2llvmir(capstone2llvmir+多个pass优化后)

    capstone2llvmirtool是capstone2llvmir原汁原味

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
./retdec-decompiler --help
./retdec-decompiler:
Mandatory arguments:
        INPUT_FILE File to decompile.
General arguments:
        [-o|--output FILE] Output file (default: INPUT_FILE.c if OUTPUT_FORMAT is plain, INPUT_FILE.c.json if OUTPUT_FORMAT is json|json-human).
        [-s|--silent] Turns off informative output of the decompilation.
        [-f|--output-format OUTPUT_FORMAT] Output format [plain|json|json-human] (default: plain).
        [-m|--mode MODE] Force the type of decompilation mode [bin|raw] (default: bin).
        [-p|--pdb FILE] File with PDB debug information.
        [-k|--keep-unreachable-funcs] Keep functions that are unreachable from the main function.
        [--cleanup] Removes temporary files created during the decompilation.
        [--config] Specify JSON decompilation configuration file.
        [--disable-static-code-detection] Prevents detection of statically linked code.
Selective decompilation arguments:
        [--select-ranges RANGES] Specify a comma separated list of ranges to decompile (example: 0x100-0x200,0x300-0x400,0x500-0x600).
        [--select-functions FUNCS] Specify a comma separated list of functions to decompile (example: fnc1,fnc2,fnc3).
        [--select-decode-only] Decode only selected parts (functions/ranges). Faster decompilation, but worse results.
Raw or Intel HEX decompilation arguments:
        [-a|--arch ARCH] Specify target architecture [mips|pic32|arm|thumb|arm64|powerpc|x86|x86-64].
                         Required if it cannot be autodetected from the input (e.g. raw mode, Intel HEX).
        [-e|--endian ENDIAN] Specify target endianness [little|big].
                             Required if it cannot be autodetected from the input (e.g. raw mode, Intel HEX).
        [-b|--bit-size SIZE] Specify target bit size [16|32|64] (default: 32).
                             Required if it cannot be autodetected from the input (e.g. raw mode).
        [--raw-section-vma ADDRESS] Virtual address where section created from the raw binary will be placed.

retdectool

retdectool也就是以前的bin2llvmir可执行文件,从入口开始学习这个,搞清楚如何通过各种库把汇编转换为ir,然后通过各种分析优化pass得到可读性很强的ir,main函数retdec-master\src\retdectool\retdec.cpp里,关键是disassemble,第一个string指针参数表示待处理文件路径inputPath,第二个生成的ir结果,存储在FunctionSet类型的fs指针,这里可以看一下retdec::common::Function的数据结构,存储了函数类型,ir等有用信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
main
{
llvmModuleContextPair disassemble(
        const std::string& inputPath,
        retdec::common::FunctionSet* fs)
 {
    auto context = std::make_unique<llvm::LLVMContext>();
    auto module = createLlvmModule(*context);
 
    config::Config c;
    c.parameters.setInputFile(inputPath);
 
    // Create a PassManager to hold and optimize the collection of passes we
    // are about to build.
    llvm::legacy::PassManager pm;//创建一个PassManager,它的作用是管理pass,我们可以往其中添加很多pass,然后通过run遍历执行所有pass
 
    pm.add(new bin2llvmir::ProviderInitialization(&c));
    //ProviderInitialization继承自modulepass,路径src\bin2llvmir\optimizations\provider_init,执行runonmodule
    pm.add(new bin2llvmir::Decoder());
    //Decoder这个pass就是对capstone2llvmir的进一步封装了
    // Now that we have all of the passes ready, run them.
    pm.run(*module);
 
    fillFunctions(*module, fs);
 
    return LlvmModuleContextPair{std::move(module), std::move(context)};
 }
}

附录:capstone2llvmir目录结构与源码

capstone2llvmir主目录
arm分目录
arm.cpp ARM implementation of Capstone2LlvmIrTranslator arm翻译声明
arm_impl.h ARM implementation of Capstone2LlvmIrTranslator arm翻译实现
arm_init.cpp Initializations for ARM implementation of Capstone2LlvmIrTranslator初始化
capstone2llvmir.cpp Converts bytes to Capstone representation, and Capstone representation to LLVM IR 重要接口声明
capstone2llvmir_impl.cpp Common public interface for translators converting bytes to LLVM IR 重要接口实现
capstone2llvmir_impl.h Common private implementation for translators converting bytes to LLVM IR
capstone_utils.h Utility functions for types, enums, etc. defined in Capstone
exceptions.cpp Definitions of exceptions used in capstone2llmvir library
llvmir_utils.cpp LLVM IR utilities
llvmir_utils.h LLVM IR utilities

[培训]《安卓高级研修班(网课)》月薪三万计划,掌握调试、分析还原ollvm、vmp的方法,定制art虚拟机自动化脱壳的方法

收藏
点赞5
打赏
分享
最新回复 (10)
雪    币: 5233
活跃值: (3255)
能力值: ( LV10,RANK:175 )
在线值:
发帖
回帖
粉丝
挤蹭菌衣 1 2021-5-8 18:33
2
1
郑重感谢群里指纹大佬等各路大佬带我
雪    币: 6
活跃值: (980)
能力值: ( LV2,RANK:10 )
在线值:
发帖
回帖
粉丝
lookzo 2021-5-10 08:37
3
0
感谢分享
雪    币: 2012
活跃值: (2775)
能力值: (RANK:260 )
在线值:
发帖
回帖
粉丝
xiaohang 3 2021-5-10 16:41
4
1

感谢分享

最后于 2021-5-10 16:44 被xiaohang编辑 ,原因:
雪    币: 5233
活跃值: (3255)
能力值: ( LV10,RANK:175 )
在线值:
发帖
回帖
粉丝
挤蹭菌衣 1 2021-5-10 19:03
5
0
又看了看源码,retdectool好像远没有原来的bin2llvmir的功能 只用了少数几个pass,是我想当然了,还是从decompile函数学习比较全面
雪    币: 2613
活跃值: (4773)
能力值: ( LV11,RANK:185 )
在线值:
发帖
回帖
粉丝
Thehepta 3 2021-5-10 19:29
6
0
你是准备把retdec ,全部分析一遍
雪    币: 5233
活跃值: (3255)
能力值: ( LV10,RANK:175 )
在线值:
发帖
回帖
粉丝
挤蹭菌衣 1 2021-5-11 07:54
7
0
ChicWalk 你是准备把retdec ,全部分析一遍
打算把重要的部分理一理
雪    币: 2613
活跃值: (4773)
能力值: ( LV11,RANK:185 )
在线值:
发帖
回帖
粉丝
Thehepta 3 2021-5-11 09:34
8
0
加油,已经包好大腿了
雪    币: 7074
活跃值: (3468)
能力值: ( LV12,RANK:340 )
在线值:
发帖
回帖
粉丝
bxc 6 2021-5-11 09:41
9
0
好厉害!
雪    币: 5233
活跃值: (3255)
能力值: ( LV10,RANK:175 )
在线值:
发帖
回帖
粉丝
挤蹭菌衣 1 2021-5-11 10:36
10
0
ChicWalk 加油,已经包好大腿了
毕竟f5没源码,想学反编译器还是retdec好入门啊,ida,ghidra之后,retdec排个第三还凑合吧
雪    币: 5233
活跃值: (3255)
能力值: ( LV10,RANK:175 )
在线值:
发帖
回帖
粉丝
挤蹭菌衣 1 2021-5-11 10:38
11
0
bxc 好厉害!
大佬见笑了
游客
登录 | 注册 方可回帖
返回