首页
社区
课程
招聘
[原创]一篇关于天命战队desctf devil.exe的贴子(非差评)
发表于: 7小时前 181

[原创]一篇关于天命战队desctf devil.exe的贴子(非差评)

7小时前
181

注册以来一直没在看雪发贴子,除了前两年比赛的题目提交贴.去年巨佬都隐退了,拿了个奖,今年就不打攻击方了.

唉,ctf没落了.

发这题是经过原作者同意的.这题出得很好.准确来说,很有艺术感,能激发思考.(不然我就不会在老大任务学习之余瞎写了这么多fw代码,还写篇文章提交上来.)

先说说这次比赛.

好久没打ctf了.这次是我师傅的战队举办的比赛,我就征得老大的同意,以学习的目的打了这次比赛.注册了n00bzx账号,不打分,用小号上,做完签到题交完flag,就存题睡觉了,慢慢做.反正题目又不会没,是吧?他们第一次办比赛,题目难度不算太难,而且交流群气氛也很好.所以,这是十分成功的第一次.这只是我一个re菜鸡的评价.别的方向都不会,没有评价的权限.希望他们战队比赛越办越好.感谢出题师傅给我权限发这篇粗浅分析.

题目附件见后文,如出题师傅不愿意我发附件,看到后请立刻告诉我,我立刻删除整篇贴子并修改后重新发布,要打要骂随便!

这篇文章只算思路,不提供具体wp.

ida打开不用说,c++写得,莫得符号(也没关系),在某个位置(包含反调试),计算了某个段的某种算法hash,存在全局变量中(后面干嘛用呢?),用了inline hook了crt(这个hook比较关键,后面干嘛用呢?),这个hook使用了veh(比较特别),在handler里面放了一些东西,也起到了一定的反调试器的作用,需要一定的驱动思想(注意只是思想)来绕过(我觉得).经过进一步xxxx后,就是算法了.这是我要说的.前面那些建议15分钟内解决(很久了,绝对够).15分钟后,直接来到算法部分.经过一番ida xxxx后(包含建立结构体,重命名变量,xxxx静态分析手段,说给完全没做过的师傅听的,毕竟也有比我更菜的是吧!),得到初步逻辑代码:

老爹的话:还有一件事!这里的代码都有一点点bug,(不知道是不)是我故意加的(你猜),所以别直接编译运行(或许可以,但是结果一定是错的!),做题也是学习,要有自己的思路.(菜鸡浅薄之见)

现在才是代码#(滑稽)

我讨厌crt!所以我把默认库都禁了,这样能缩小程序大小(没卵用,之前玩最小pe玩魔怔了,当我是个sb就行了)...说实话,这段代码能正确编译运行,但是需要桌面上有程序文件...不需要任何库(包括默认库#(滑稽)),使用vs2022(没必要,都行,看你们都爱这么说,我也说下)编译,开c语言(其实也没必要,废话).总之就是不要任何设置,直接编译(又是废话).

#include<stdio.h>
#include<windows.h>
BYTE* unk_51E000 = 0;
BYTE* unk_866000 = 0;
BYTE* unk_7F6000 = 0;
BYTE* unk_43D000 = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
    DWORD i;
    char buffer[1024];
    for (i = 0; i < 1024; i++)
    {
        buffer[i] = 0;
    }
    va_list args;
    va_start(args, format);
    PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
    ((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
    DWORD bytes_written;
    WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void sub_4011A0(BYTE* a1)
{
    int v2[16];
    int i;
    char v4[16];
    v2[0] = 0;
    v2[1] = 5;
    v2[2] = 10;
    v2[3] = 15;
    v2[4] = 4;
    v2[5] = 9;
    v2[6] = 14;
    v2[7] = 3;
    v2[8] = 8;
    v2[9] = 13;
    v2[10] = 2;
    v2[11] = 7;
    v2[12] = 12;
    v2[13] = 1;
    v2[14] = 6;
    v2[15] = 11;
    for (i = 0; i < 16; ++i)
    {
        v4[i] = a1[v2[i]];
    }
    memcpy(a1, v4, 0x10);
}
void sub_4011A0_inv(BYTE* a1)
{
    int v2[16];
    BYTE temp[16];
    int i;
    v2[0] = 0;
    v2[1] = 5;
    v2[2] = 10;
    v2[3] = 15;
    v2[4] = 4;
    v2[5] = 9;
    v2[6] = 14;
    v2[7] = 3;
    v2[8] = 8;
    v2[9] = 13;
    v2[10] = 2;
    v2[11] = 7;
    v2[12] = 12;
    v2[13] = 1;
    v2[14] = 6;
    v2[15] = 11;
    memcpy(temp, a1, 0x10);
    for (i = 0; i < 16; ++i)
    {
        a1[v2[i]] = temp[i];
    }
}
void sub_401270(BYTE *input_pass, BYTE *out)
{
    int n;
    int m;
    int i;
    BYTE aa, bb, cc, dd;
    BYTE low, high;
    int j;
    int k;
    BYTE const1[] = { 0xB8,0xA1,0xD9,0xB9,0xD8,0x3B,0x17,0x91,0x75,0x12,0x1B,0x74,0x18,0x5B,0x16,0x39,0x76,0xA2,0x0C,0xFA,0x90,0x94,0x36,0x41,0x58,0x59,0x43,0xD4,0x47,0x92,0x2D,0xEA };
    BYTE const2[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
    for (i = 0; i < 16; ++i)
    {
        input_pass[i] ^= const2[i];
    }
    for (j = 0; j < 13; ++j)
    {
        sub_4011A0(input_pass);
        for (k = 0; k < 4; ++k)
        {
            BYTE v14_a = unk_51E000[3 + 4 * (53248 * ((int)*const1 >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
            BYTE v12_a = unk_51E000[3 + 4 * (53248 * ((int)*const1 >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
            BYTE v10_a = unk_51E000[3 + 4 * (53248 * ((int)*const1 >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
            BYTE v8_a = unk_51E000[3 + 4 * (53248 * ((int)*const1 >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
            BYTE v14_b = unk_51E000[2 + 4 * (53248 * ((int)*const1 >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
            BYTE v12_b = unk_51E000[2 + 4 * (53248 * ((int)*const1 >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
            BYTE v10_b = unk_51E000[2 + 4 * (53248 * ((int)*const1 >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
            BYTE v8_b = unk_51E000[2 + 4 * (53248 * ((int)*const1 >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
            BYTE v14_c = unk_51E000[1 + 4 * (53248 * ((int)*const1 >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
            BYTE v12_c = unk_51E000[1 + 4 * (53248 * ((int)*const1 >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
            BYTE v10_c = unk_51E000[1 + 4 * (53248 * ((int)*const1 >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
            BYTE v8_c = unk_51E000[1 + 4 * (53248 * ((int)*const1 >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
            BYTE v14_d = unk_51E000[4 * (53248 * ((int)*const1 >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
            BYTE v12_d = unk_51E000[4 * (53248 * ((int)*const1 >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
            BYTE v10_d = unk_51E000[4 * (53248 * ((int)*const1 >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
            BYTE v8_d = unk_51E000[4 * (53248 * ((int)*const1 >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
            
            low = unk_866000[319488 * ((int)const1[5] >> 4) + 1280 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 512 + 24576 * j + 6144 * k + 16 * (v14_a & 0xF) + (v12_a & 0xF)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 768 + 24576 * j + 6144 * k + 16 * (v10_a & 0xF) + (v8_a & 0xF)]];
            high = unk_866000[319488 * ((int)const1[5] >> 4) + 1024 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 24576 * j + 6144 * k + 16 * (v14_a >> 4) + (v12_a >> 4)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 256 + 24576 * j + 6144 * k + 16 * (v10_a >> 4) + (v8_a >> 4)]];
            aa = high;
            aa <<= 4;
            aa |= low;
            
            low = unk_866000[319488 * ((int)const1[5] >> 4) + 2816 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 2048 + 24576 * j + 6144 * k + 16 * (v14_b & 0xF) + (v12_b & 0xF)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 2304 + 24576 * j + 6144 * k + 16 * (v10_b & 0xF) + (v8_b & 0xF)]];
            high = unk_866000[319488 * ((int)const1[5] >> 4) + 2560 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 1536 + 24576 * j + 6144 * k + 16 * (v14_b >> 4) + (v12_b >> 4)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 1792 + 24576 * j + 6144 * k + 16 * (v10_b >> 4) + (v8_b >> 4)]];
            bb = high;
            bb <<= 4;
            bb |= low;
            
            low = unk_866000[319488 * ((int)const1[5] >> 4) + 4352 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 3584 + 24576 * j + 6144 * k + 16 * (v14_c & 0xF) + (v12_c & 0xF)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 3840 + 24576 * j + 6144 * k + 16 * (v10_c & 0xF) + (v8_c & 0xF)]];
            high = unk_866000[319488 * ((int)const1[5] >> 4) + 4096 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 3072 + 24576 * j + 6144 * k + 16 * (v14_c >> 4) + (v12_c >> 4)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 3328 + 24576 * j + 6144 * k + 16 * (v10_c >> 4) + (v8_c >> 4)]];
            cc = high;
            cc <<= 4;
            cc |= low;
            
            low = unk_866000[319488 * ((int)const1[5] >> 4) + 5888 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 5120 + 24576 * j + 6144 * k + 16 * (v14_d & 0xF) + (v12_d & 0xF)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 5376 + 24576 * j + 6144 * k + 16 * (v10_d & 0xF) + (v8_d & 0xF)]];
            high = unk_866000[319488 * ((int)const1[5] >> 4) + 5632 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 4608 + 24576 * j + 6144 * k + 16 * (v14_d >> 4) + (v12_d >> 4)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 4864 + 24576 * j + 6144 * k + 16 * (v10_d >> 4) + (v8_d >> 4)]];
            dd = high;
            dd <<= 4;
            dd |= low;
            
            input_pass[4 * k] = aa;
            input_pass[4 * k + 1] = bb;
            input_pass[4 * k + 2] = cc;
            input_pass[4 * k + 3] = dd;
            
            BYTE v15_a = unk_7F6000[3 + 4 * (57344 * ((int)const1[10] >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
            BYTE v13_a = unk_7F6000[3 + 4 * (57344 * ((int)const1[10] >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
            BYTE v11_a = unk_7F6000[3 + 4 * (57344 * ((int)const1[10] >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
            BYTE v9_a = unk_7F6000[3 + 4 * (57344 * ((int)const1[10] >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
            BYTE v15_b = unk_7F6000[2 + 4 * (57344 * ((int)const1[10] >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
            BYTE v13_b = unk_7F6000[2 + 4 * (57344 * ((int)const1[10] >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
            BYTE v11_b = unk_7F6000[2 + 4 * (57344 * ((int)const1[10] >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
            BYTE v9_b = unk_7F6000[2 + 4 * (57344 * ((int)const1[10] >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
            BYTE v15_c = unk_7F6000[1 + 4 * (57344 * ((int)const1[10] >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
            BYTE v13_c = unk_7F6000[1 + 4 * (57344 * ((int)const1[10] >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
            BYTE v11_c = unk_7F6000[1 + 4 * (57344 * ((int)const1[10] >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
            BYTE v9_c = unk_7F6000[1 + 4 * (57344 * ((int)const1[10] >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
            BYTE v15_d = unk_7F6000[4 * (57344 * ((int)const1[10] >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
            BYTE v13_d = unk_7F6000[4 * (57344 * ((int)const1[10] >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
            BYTE v11_d = unk_7F6000[4 * (57344 * ((int)const1[10] >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
            BYTE v9_d = unk_7F6000[4 * (57344 * ((int)const1[10] >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
            
            low = unk_866000[319488 * ((int)const1[5] >> 4) + 1280 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 512 + 24576 * j + 6144 * k + 16 * (v15_a & 0xF) + (v13_a & 0xF)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 768 + 24576 * j + 6144 * k + 16 * (v11_a & 0xF) + (v9_a & 0xF)]];
            high = unk_866000[319488 * ((int)const1[5] >> 4) + 1024 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 24576 * j + 6144 * k + 16 * (v15_a >> 4) + (v13_a >> 4)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 256 + 24576 * j + 6144 * k + 16 * (v11_a >> 4) + (v9_a >> 4)]];
            aa = high;
            aa <<= 4;
            aa |= low;
            
            low = unk_866000[319488 * ((int)const1[5] >> 4) + 2816 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 2048 + 24576 * j + 6144 * k + 16 * (v15_b & 0xF) + (v13_b & 0xF)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 2304 + 24576 * j + 6144 * k + 16 * (v11_b & 0xF) + (v9_b & 0xF)]];
            high = unk_866000[319488 * ((int)const1[5] >> 4) + 2560 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 1536 + 24576 * j + 6144 * k + 16 * (v15_b >> 4) + (v13_b >> 4)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 1792 + 24576 * j + 6144 * k + 16 * (v11_b >> 4) + (v9_b >> 4)]];
            bb = high;
            bb <<= 4;
            bb |= low;
            
            low = unk_866000[319488 * ((int)const1[5] >> 4) + 4352 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 3584 + 24576 * j + 6144 * k + 16 * (v15_c & 0xF) + (v13_c & 0xF)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 3840 + 24576 * j + 6144 * k + 16 * (v11_c & 0xF) + (v9_c & 0xF)]];
            high = unk_866000[319488 * ((int)const1[5] >> 4) + 4096 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 3072 + 24576 * j + 6144 * k + 16 * (v15_c >> 4) + (v13_c >> 4)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 3328 + 24576 * j + 6144 * k + 16 * (v11_c >> 4) + (v9_c >> 4)]];
            cc = high;
            cc <<= 4;
            cc |= low;
            
            low = unk_866000[319488 * ((int)const1[5] >> 4) + 5888 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 5120 + 24576 * j + 6144 * k + 16 * (v15_d & 0xF) + (v13_d & 0xF)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 5376 + 24576 * j + 6144 * k + 16 * (v11_d & 0xF) + (v9_d & 0xF)]];
            high = unk_866000[319488 * ((int)const1[5] >> 4) + 5632 + 24576 * j + 6144 * k + 16
                * unk_866000[319488 * ((int)const1[5] >> 4) + 4608 + 24576 * j + 6144 * k + 16 * (v15_d >> 4) + (v13_d >> 4)]
                + unk_866000[319488 * ((int)const1[5] >> 4) + 4864 + 24576 * j + 6144 * k + 16 * (v11_d >> 4) + (v9_d >> 4)]];
            dd = high;
            dd <<= 4;
            dd |= low;
            
            input_pass[4 * k] = aa;
            input_pass[4 * k + 1] = bb;
            input_pass[4 * k + 2] = cc;
            input_pass[4 * k + 3] = dd;
        }
    }
    sub_4011A0(input_pass);
    for (m = 0; m < 16; ++m)
    {
        input_pass[m] = unk_43D000[57344 * ((int)const1[2] >> 4) + 53248 + 256 * m + input_pass[m]];
    }
    for (n = 0; n < 16; ++n)
    {
        out[n] = input_pass[n];
    }
}
int main()
{
    void* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
    DWORD hhh = 0;
    HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    ReadFile(file, buff, 11534336, &hhh, NULL);
    CloseHandle(file);
    unk_51E000 = (BYTE*)buff + 0x11ba00;
    unk_866000 = (BYTE*)buff + 0x463a00;
    unk_7F6000 = (BYTE*)buff + 0x3f3a00;
    unk_43D000 = (BYTE*)buff + 0x3aa00;
    BYTE input_pass[] = { 0xA0,0xA8,0xAC,0xA7,0xA9,0xB6,0x95,0x79,0xBD,0x76,0x7D,0xA9,0x29,0x5F,0xB9,0x42 };
    BYTE out[16] = { 0 };
    sub_401270(input_pass, out);
    int i = 0;
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X ", out[i]);
    }
    VirtualFree(buff, 0, MEM_RELEASE);
    return 0xb19b00b5;
}

这就是主要算法逻辑.当然肯定不能用他直接爆破(4小时以上).所以要优化.第一段代码是你们可能有[只是可能,这是十分可笑(或许)]的想法.教科书式,使用z3解决.

from z3 import *
bv16_sort = BitVecSort(16)
bv8_sort = BitVecSort(8)
table_ptr = Array('table_ptr', bv16_sort, bv8_sort)
def func_a(a,b,c,d):
    aa=ZeroExt(8,a)
    bb=ZeroExt(8,b)
    cc=ZeroExt(8,c)
    dd=ZeroExt(8,d)
    low = ((aa & 0xF) << 12) | ((bb & 0xF) << 8) | ((cc & 0xF) << 4) | (dd & 0xF)
    high = ((aa >> 4) << 12) | ((bb >> 4) << 8) | ((cc >> 4) << 4) | (dd >> 4)
    return (Select(table_ptr,high) << 4) | Select(table_ptr,low)
buff=open('C:\\Users\\n00bzx\\Desktop\\Devil.exe','rb').read()
unk_51E000 = 0x11ba00
unk_866000 = 0x463a00
unk_7F6000 = 0x3f3a00
unk_43D000 = 0x3aa00
for low_high in range(256):
    for low_low in range(256):
        low_nib = buff[unk_866000+0xea000 + 768 + low_low]
        high_nib = buff[unk_866000+0xea000 + 512 + low_high]
        low = (high_nib << 4) | low_nib
        table_ptr=Store(table_ptr,low_high * 256 + low_low,buff[unk_866000+0xea000 + 1280 + low])
sizes=13 * 4 * 16
first_tables=[0]*sizes
second_tables=[0]*sizes
for j in range(13):
    for k in range(4):
        idx = (j << 12) | (k << 10)
        table_base = 16 * (idx >> 10)
        for i in range(16):
            cur_idx=table_base+i
            first_tables[cur_idx]=Array('tables_first_%d'%cur_idx, bv8_sort, bv8_sort)
            second_tables[cur_idx]=Array('tables_second_%d'%cur_idx, bv8_sort, bv8_sort)
        for i in range(256):
            first_tables[table_base+0]=Store(first_tables[table_base+0],i,buff[unk_51E000+3 + 4 * (0x8f000 + idx + i)])
            first_tables[table_base+1]=Store(first_tables[table_base+1],i,buff[unk_51E000+3 + 4 * (0x8f000 + 256 + idx + i)])
            first_tables[table_base+2]= Store(first_tables[table_base+2],i,buff[unk_51E000+3 + 4 * (0x8f000 + 512 + idx + i)])
            first_tables[table_base+3] = Store(first_tables[table_base+3],i,buff[unk_51E000+3 + 4 * (0x8f000 + 768 + idx + i)])
            first_tables[table_base+4] = Store(first_tables[table_base+4],i,buff[unk_51E000+2 + 4 * (0x8f000 + idx + i)])
            first_tables[table_base+5] = Store(first_tables[table_base+5],i,buff[unk_51E000+2 + 4 * (0x8f000 + 256 + idx + i)])
            first_tables[table_base+6] = Store(first_tables[table_base+6],i,buff[unk_51E000+2 + 4 * (0x8f000 + 512 + idx + i)])
            first_tables[table_base+7] = Store(first_tables[table_base+7],i,buff[unk_51E000+2 + 4 * (0x8f000 + 768 + idx + i)])
            first_tables[table_base+8] = Store(first_tables[table_base+8],i,buff[unk_51E000+1 + 4 * (0x8f000 + idx + i)])
            first_tables[table_base+9] = Store(first_tables[table_base+9],i,buff[unk_51E000+1 + 4 * (0x8f000 + 256 + idx + i)])
            first_tables[table_base+10] = Store(first_tables[table_base+10],i,buff[unk_51E000+1 + 4 * (0x8f000 + 512 + idx + i)])
            first_tables[table_base+11] = Store(first_tables[table_base+11],i,buff[unk_51E000+1 + 4 * (0x8f000 + 768 + idx + i)])
            first_tables[table_base+12] = Store(first_tables[table_base+12],i,buff[unk_51E000+4 * (0x8f000 + idx + i)])
            first_tables[table_base+13] = Store(first_tables[table_base+13],i,buff[unk_51E000+4 * (0x8f000 + 256 + idx + i)])
            first_tables[table_base+14] = Store(first_tables[table_base+14],i,buff[unk_51E000+4 * (0x8f000 + 512 + idx + i)])
            first_tables[table_base+15] = Store(first_tables[table_base+15],i,buff[unk_51E000+4 * (0x8f000 + 768 + idx + i)])

            second_tables[table_base+0] = Store(second_tables[table_base+0],i,buff[unk_7F6000+3 + 4 * (0xe000 + idx + i)])
            second_tables[table_base+1] = Store(second_tables[table_base+1],i,buff[unk_7F6000+3 + 4 * (0xe000 + 256 + idx + i)])
            second_tables[table_base+2] = Store(second_tables[table_base+2],i,buff[unk_7F6000+3 + 4 * (0xe000 + 512 + idx + i)])
            second_tables[table_base+3] = Store(second_tables[table_base+3],i,buff[unk_7F6000+3 + 4 * (0xe000 + 768 + idx + i)])
            second_tables[table_base+4] = Store(second_tables[table_base+4],i,buff[unk_7F6000+2 + 4 * (0xe000 + idx + i)])
            second_tables[table_base+5] = Store(second_tables[table_base+5],i,buff[unk_7F6000+2 + 4 * (0xe000 + 256 + idx + i)])
            second_tables[table_base+6] = Store(second_tables[table_base+6],i,buff[unk_7F6000+2 + 4 * (0xe000 + 512 + idx + i)])
            second_tables[table_base+7] = Store(second_tables[table_base+7],i,buff[unk_7F6000+2 + 4 * (0xe000 + 768 + idx + i)])
            second_tables[table_base+8] = Store(second_tables[table_base+8],i,buff[unk_7F6000+1 + 4 * (0xe000 + idx + i)])
            second_tables[table_base+9] = Store(second_tables[table_base+9],i,buff[unk_7F6000+1 + 4 * (0xe000 + 256 + idx + i)])
            second_tables[table_base+10]= Store(second_tables[table_base+10],i,buff[unk_7F6000+1 + 4 * (0xe000 + 512 + idx + i)])
            second_tables[table_base+11] = Store(second_tables[table_base+11],i,buff[unk_7F6000+1 + 4 * (0xe000 + 768 + idx + i)])
            second_tables[table_base+12] = Store(second_tables[table_base+12],i,buff[unk_7F6000+4 * (0xe000 + idx + i)])
            second_tables[table_base+13] = Store(second_tables[table_base+13],i,buff[unk_7F6000+4 * (0xe000 + 256 + idx + i)])
            second_tables[table_base+14]= Store(second_tables[table_base+14],i,buff[unk_7F6000+4 * (0xe000 + 512 + idx + i)])
            second_tables[table_base+15]= Store(second_tables[table_base+15],i,buff[unk_7F6000+4 * (0xe000 + 768 + idx + i)])
def trans(j, k, a, b, c, d):
    table_base = (j << 6) | (k << 4)
    a_a = Select(first_tables[table_base+0], a)
    a_b = Select(first_tables[table_base+4], a)
    a_c = Select(first_tables[table_base+8], a);
    a_d = Select(first_tables[table_base+12], a)

    b_a = Select(first_tables[table_base+1], b)
    b_b = Select(first_tables[table_base+5], b)
    b_c = Select(first_tables[table_base+9], b)
    b_d = Select(first_tables[table_base+13], b)

    c_a = Select(first_tables[table_base+2], c)
    c_b = Select(first_tables[table_base+6], c)
    c_c = Select(first_tables[table_base+10], c)
    c_d = Select(first_tables[table_base+14], c)

    d_a = Select(first_tables[table_base+3], d)
    d_b = Select(first_tables[table_base+7], d)
    d_c = Select(first_tables[table_base+11], d)
    d_d = Select(first_tables[table_base+15], d)

    a = func_a(a_a, b_a, c_a, d_a)
    b = func_a(a_b, b_b, c_b, d_b)
    c = func_a(a_c, b_c, c_c, d_c)
    d = func_a(a_d, b_d, c_d, d_d)

    a_a = Select(second_tables[table_base+0], a)
    a_b = Select(second_tables[table_base+4], a)
    a_c = Select(second_tables[table_base+8], a);
    a_d = Select(second_tables[table_base+12], a)

    b_a = Select(second_tables[table_base+1], b)
    b_b = Select(second_tables[table_base+5], b)
    b_c = Select(second_tables[table_base+9], b)
    b_d = Select(second_tables[table_base+13], b)

    c_a = Select(second_tables[table_base+2], c)
    c_b = Select(second_tables[table_base+6], c)
    c_c = Select(second_tables[table_base+10], c)
    c_d = Select(second_tables[table_base+14], c)

    d_a = Select(second_tables[table_base+3], d)
    d_b = Select(second_tables[table_base+7], d)
    d_c = Select(second_tables[table_base+11], d)
    d_d = Select(second_tables[table_base+15], d)
    
    a = func_a(a_a, b_a, c_a, d_a)
    b = func_a(a_b, b_b, c_b, d_b)
    c = func_a(a_c, b_c, c_c, d_c)
    d = func_a(a_d, b_d, c_d, d_d)
    return [a,b,c,d]
dst=[0xa7,0xe4,0x8f,0x9c]
a=BitVec('a',8)
b=BitVec('b',8)
c=BitVec('c',8)
d=BitVec('d',8)
ret=trans(0,0,a,b,c,d)
solver=Solver()
for i in range(4):
    solver.add(dst[i]==ret[i])
if solver.check()==sat:
    print(solver.model())
else:
    print("nope")

当然,代码是我事后写的,因为做的时候直接排除了这种想法,只是想告诉那些比我还菜的师傅说,这样不行.因为你也看到了,逻辑都是查表操作(再优化也有),z3是擅长解线性约束(线性规划啥的,毕竟我高中毕业也只学过这点,别的不懂也不能乱说),查表显然不是啊...所以这样会爆内存,你懂得...啥?c语言效率比python更高是吧?那我用c也写了一遍,你看看?

#include&lt;windows.h&gt;
#include&lt;stdio.h&gt;
#include"z3.h"
Z3_ast table_ptr_z3 = { 0 };
Z3_ast first_tables_z3[13 * 4 * 16];
Z3_ast second_tables_z3[13 * 4 * 16];
Z3_sort bv8_sort, bv16_sort, array_sort_8, array_sort_16;
static void gen_z3_table_ptr(Z3_context* ctx)
{
    char tmp[32] = { 0 };
    DWORD i, j, k, m;
    BYTE table_ptr[65536];
    BYTE* first_tables = 0;
    BYTE* second_tables = 0;
    DWORD hhh = 0;
    HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    BYTE* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
    ReadFile(file, buff, 0xb00000, &hhh, NULL);
    CloseHandle(file);
    BYTE* unk_51E000 = buff + 0x11ba00;
    BYTE* unk_866000 = buff + 0x463a00;
    BYTE* unk_7F6000 = buff + 0x3f3a00;
    BYTE* unk_43D000 = buff + 0x3aa00;
    DWORD low_high, low_low;
    for (low_high = 0; low_high < 256; low_high++)
    {
        for (low_low = 0; low_low < 256; low_low++)
        {
            BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
            BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
            BYTE low = (high_nib << 4) | low_nib;
            table_ptr[low_high * 256 + low_low] = unk_866000[0xea000 + 1280 + (DWORD)low];
        }
    }
    first_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    second_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    for (j = 0; j < 13; j++)
    {
        for (k = 0; k < 4; k++)
        {
            DWORD idx = (j << 12) | (k << 10);
            DWORD table_base = 256 * 16 * (idx >> 10);
            for (i = 0; i < 256; i++)
            {
                BYTE* tables = first_tables + table_base;
                tables[256 * 0 + i] = unk_51E000[3 + 4 * (0x8f000 + idx + i)];
                tables[256 * 1 + i] = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 2 + i] = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 3 + i] = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 4 + i] = unk_51E000[2 + 4 * (0x8f000 + idx + i)];
                tables[256 * 5 + i] = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 6 + i] = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 7 + i] = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 8 + i] = unk_51E000[1 + 4 * (0x8f000 + idx + i)];
                tables[256 * 9 + i] = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 10 + i] = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 11 + i] = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 12 + i] = unk_51E000[4 * (0x8f000 + idx + i)];
                tables[256 * 13 + i] = unk_51E000[4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 14 + i] = unk_51E000[4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 15 + i] = unk_51E000[4 * (0x8f000 + 768 + idx + i)];

                tables = second_tables + table_base;
                tables[256 * 0 + i] = unk_7F6000[3 + 4 * (0xe000 + idx + i)];
                tables[256 * 1 + i] = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 2 + i] = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 3 + i] = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 4 + i] = unk_7F6000[2 + 4 * (0xe000 + idx + i)];
                tables[256 * 5 + i] = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 6 + i] = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 7 + i] = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 8 + i] = unk_7F6000[1 + 4 * (0xe000 + idx + i)];
                tables[256 * 9 + i] = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 10 + i] = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 11 + i] = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 12 + i] = unk_7F6000[4 * (0xe000 + idx + i)];
                tables[256 * 13 + i] = unk_7F6000[4 * (0xe000 + 256 + idx + i)];
                tables[256 * 14 + i] = unk_7F6000[4 * (0xe000 + 512 + idx + i)];
                tables[256 * 15 + i] = unk_7F6000[4 * (0xe000 + 768 + idx + i)];
            }
        }
    }
    VirtualFree(buff, 0, MEM_RELEASE);
    table_ptr_z3 = Z3_mk_const(*ctx, Z3_mk_string_symbol(*ctx, "table_ptr"), array_sort_16);
    for (i = 0; i < 65536; i++)
    {
        Z3_ast index = Z3_mk_int(*ctx, i, bv16_sort);
        Z3_ast elem = Z3_mk_int(*ctx, table_ptr[i], bv8_sort);
        table_ptr_z3 = Z3_mk_store(*ctx, table_ptr_z3, index, elem);
    }
    for (j = 0; j < 13; j++)
    {
        for (k = 0; k < 4; k++)
        {
            DWORD idx = (j << 12) | (k << 10);
            DWORD table_base_z3 = 16 * (idx >> 10);
            DWORD table_base_orig = 256 * 16 * (idx >> 10);
            Z3_ast* tables_z3_first = first_tables_z3 + table_base_z3;
            BYTE* tables_orig_first = first_tables + table_base_orig;
            Z3_ast* tables_z3_second = second_tables_z3 + table_base_z3;
            BYTE* tables_orig_second = second_tables + table_base_orig;
            for (i = 0; i < 16; i++)
            {
                DWORD idx_in = table_base_z3 + i;
                snprintf(tmp, sizeof(tmp), "table_first_%d", idx_in);
                tables_z3_first[i] = Z3_mk_const(*ctx, Z3_mk_string_symbol(*ctx, tmp), array_sort_8);
                snprintf(tmp, sizeof(tmp), "table_second_%d", idx_in);
                tables_z3_second[i] = Z3_mk_const(*ctx, Z3_mk_string_symbol(*ctx, tmp), array_sort_8);
            }
            for (i = 0; i < 256; i++)
            {
                Z3_ast index = Z3_mk_int(*ctx, i, bv8_sort);
                for (m = 0; m < 16; m++)
                {
                    tables_z3_first[m] = Z3_mk_store(*ctx, tables_z3_first[m], index, Z3_mk_int(*ctx, tables_orig_first[256 * m + i], bv8_sort));
                    tables_z3_second[m] = Z3_mk_store(*ctx, tables_z3_second[m], index, Z3_mk_int(*ctx, tables_orig_second[256 * m + i], bv8_sort));
                }
            }
        }
    }
    VirtualFree(second_tables, 0, MEM_RELEASE);
    VirtualFree(first_tables, 0, MEM_RELEASE);
}
static Z3_ast func_a(Z3_context* ctx, Z3_ast a, Z3_ast b, Z3_ast c, Z3_ast d)
{
    a = Z3_mk_sign_ext(*ctx, 8, a);
    b = Z3_mk_sign_ext(*ctx, 8, b);
    c = Z3_mk_sign_ext(*ctx, 8, c);
    d = Z3_mk_sign_ext(*ctx, 8, d);

    Z3_ast mask = Z3_mk_int(*ctx, 0xF, bv16_sort);
    Z3_ast a_low = Z3_mk_bvand(*ctx, a, mask);
    Z3_ast b_low = Z3_mk_bvand(*ctx, b, mask);
    Z3_ast c_low = Z3_mk_bvand(*ctx, c, mask);
    Z3_ast d_low = Z3_mk_bvand(*ctx, d, mask);

    Z3_ast a_low_shifted = Z3_mk_bvshl(*ctx, a_low, Z3_mk_int(*ctx, 12, bv16_sort));
    Z3_ast b_low_shifted = Z3_mk_bvshl(*ctx, b_low, Z3_mk_int(*ctx, 8, bv16_sort));
    Z3_ast c_low_shifted = Z3_mk_bvshl(*ctx, c_low, Z3_mk_int(*ctx, 4, bv16_sort));

    Z3_ast low = Z3_mk_bvor(*ctx, Z3_mk_bvor(*ctx, a_low_shifted, b_low_shifted), Z3_mk_bvor(*ctx, c_low_shifted, d_low));

    Z3_ast a_high = Z3_mk_bvlshr(*ctx, a, Z3_mk_int(*ctx, 4, bv16_sort));
    Z3_ast b_high = Z3_mk_bvlshr(*ctx, b, Z3_mk_int(*ctx, 4, bv16_sort));
    Z3_ast c_high = Z3_mk_bvlshr(*ctx, c, Z3_mk_int(*ctx, 4, bv16_sort));
    Z3_ast d_high = Z3_mk_bvlshr(*ctx, d, Z3_mk_int(*ctx, 4, bv16_sort));

    Z3_ast a_high_shifted = Z3_mk_bvshl(*ctx, a_high, Z3_mk_int(*ctx, 12, bv16_sort));
    Z3_ast b_high_shifted = Z3_mk_bvshl(*ctx, b_high, Z3_mk_int(*ctx, 8, bv16_sort));
    Z3_ast c_high_shifted = Z3_mk_bvshl(*ctx, c_high, Z3_mk_int(*ctx, 4, bv16_sort));

    Z3_ast high = Z3_mk_bvor(*ctx, Z3_mk_bvor(*ctx, a_high_shifted, b_high_shifted), Z3_mk_bvor(*ctx, c_high_shifted, d_high));

    Z3_ast high_value = Z3_mk_select(*ctx, table_ptr_z3, high);
    Z3_ast low_value = Z3_mk_select(*ctx, table_ptr_z3, low);

    high_value = Z3_mk_sign_ext(*ctx, 8, high_value);
    low_value = Z3_mk_sign_ext(*ctx, 8, low_value);

    Z3_ast high_shifted = Z3_mk_bvshl(*ctx, high_value, Z3_mk_int(*ctx, 4, bv16_sort));
    Z3_ast result = Z3_mk_bvor(*ctx, high_shifted, low_value);
    return Z3_mk_extract(*ctx, 7, 0, result);
}
static void trans(Z3_context* ctx,DWORD j, DWORD k, Z3_ast* a, Z3_ast* b, Z3_ast* c, Z3_ast* d)
{
    DWORD table_base = (j << 6) | (k << 4);
    Z3_ast* tables = first_tables_z3 + table_base;

    Z3_ast a_a = Z3_mk_select(*ctx, tables[0], *a);
    Z3_ast a_b = Z3_mk_select(*ctx, tables[4], *a);
    Z3_ast a_c = Z3_mk_select(*ctx, tables[8], *a);
    Z3_ast a_d = Z3_mk_select(*ctx, tables[12], *a);

    Z3_ast b_a = Z3_mk_select(*ctx, tables[1], *b);
    Z3_ast b_b = Z3_mk_select(*ctx, tables[5], *b);
    Z3_ast b_c = Z3_mk_select(*ctx, tables[9], *b);
    Z3_ast b_d = Z3_mk_select(*ctx, tables[13], *b);

    Z3_ast c_a = Z3_mk_select(*ctx, tables[2], *c);
    Z3_ast c_b = Z3_mk_select(*ctx, tables[6], *c);
    Z3_ast c_c = Z3_mk_select(*ctx, tables[10], *c);
    Z3_ast c_d = Z3_mk_select(*ctx, tables[14], *c);

    Z3_ast d_a = Z3_mk_select(*ctx, tables[3], *d);
    Z3_ast d_b = Z3_mk_select(*ctx, tables[7], *d);
    Z3_ast d_c = Z3_mk_select(*ctx, tables[11], *d);
    Z3_ast d_d = Z3_mk_select(*ctx, tables[15], *d);

    *a = func_a(ctx, a_a, b_a, c_a, d_a);
    *b = func_a(ctx, a_b, b_b, c_b, d_b);
    *c = func_a(ctx, a_c, b_c, c_c, d_c);
    *d = func_a(ctx, a_d, b_d, c_d, d_d);

    tables = second_tables_z3 + table_base;

    a_a = Z3_mk_select(*ctx, tables[0], *a);
    a_b = Z3_mk_select(*ctx, tables[4], *a);
    a_c = Z3_mk_select(*ctx, tables[8], *a);
    a_d = Z3_mk_select(*ctx, tables[12], *a);

    b_a = Z3_mk_select(*ctx, tables[1], *b);
    b_b = Z3_mk_select(*ctx, tables[5], *b);
    b_c = Z3_mk_select(*ctx, tables[9], *b);
    b_d = Z3_mk_select(*ctx, tables[13], *b);

    c_a = Z3_mk_select(*ctx, tables[2], *c);
    c_b = Z3_mk_select(*ctx, tables[6], *c);
    c_c = Z3_mk_select(*ctx, tables[10], *c);
    c_d = Z3_mk_select(*ctx, tables[14], *c);

    d_a = Z3_mk_select(*ctx, tables[3], *d);
    d_b = Z3_mk_select(*ctx, tables[7], *d);
    d_c = Z3_mk_select(*ctx, tables[11], *d);
    d_d = Z3_mk_select(*ctx, tables[15], *d);

    *a = func_a(ctx, a_a, b_a, c_a, d_a);
    *b = func_a(ctx, a_b, b_b, c_b, d_b);
    *c = func_a(ctx, a_c, b_c, c_c, d_c);
    *d = func_a(ctx, a_d, b_d, c_d, d_d);
}
int main()
{
    Z3_config cfg = Z3_mk_config();
    Z3_context ctx = Z3_mk_context(cfg);
    Z3_del_config(cfg);

    bv16_sort = Z3_mk_bv_sort(ctx, 16);
    bv8_sort = Z3_mk_bv_sort(ctx, 8);
    array_sort_16 = Z3_mk_array_sort(ctx, bv16_sort, bv8_sort);
    array_sort_8 = Z3_mk_array_sort(ctx, bv8_sort, bv8_sort);
    gen_z3_table_ptr(&ctx);

    Z3_ast a = Z3_mk_const(ctx, Z3_mk_string_symbol(ctx, "a"), bv8_sort);
    Z3_ast b = Z3_mk_const(ctx, Z3_mk_string_symbol(ctx, "b"), bv8_sort);
    Z3_ast c = Z3_mk_const(ctx, Z3_mk_string_symbol(ctx, "c"), bv8_sort);
    Z3_ast d = Z3_mk_const(ctx, Z3_mk_string_symbol(ctx, "d"), bv8_sort);
    
    trans(&ctx, 0, 0, &a, &b, &c, &d);

    Z3_ast eqa = Z3_mk_eq(ctx, a, Z3_mk_int(ctx, 0xa7, bv8_sort));
    Z3_ast eqb = Z3_mk_eq(ctx, b, Z3_mk_int(ctx, 0xe4, bv8_sort));
    Z3_ast eqc = Z3_mk_eq(ctx, c, Z3_mk_int(ctx, 0x8f, bv8_sort));
    Z3_ast eqd = Z3_mk_eq(ctx, d, Z3_mk_int(ctx, 0x9c, bv8_sort));
    Z3_ast eqs[] = { eqa,eqb,eqc,eqd };
    Z3_ast eq = Z3_mk_and(ctx, 4, eqs);

    Z3_solver solver = Z3_mk_solver(ctx);
    Z3_solver_assert(ctx, solver, eq);
    Z3_lbool result = Z3_solver_check(ctx, solver);
    if (result == Z3_L_TRUE)
    {
        DWORD64 a_val_u64, b_val_u64, c_val_u64, d_val_u64;
        BYTE a_val, b_val, c_val, d_val;
        Z3_model model = Z3_solver_get_model(ctx, solver);

        Z3_func_decl a_decl = Z3_get_app_decl(ctx, Z3_to_app(ctx, a));
        Z3_ast a_result = Z3_model_get_const_interp(ctx, model, a_decl);
        Z3_get_numeral_uint64(ctx, a_result, &a_val_u64);

        Z3_func_decl b_decl = Z3_get_app_decl(ctx, Z3_to_app(ctx, b));
        Z3_ast b_result = Z3_model_get_const_interp(ctx, model, b_decl);
        Z3_get_numeral_uint64(ctx, b_result, &b_val_u64);

        Z3_func_decl c_decl = Z3_get_app_decl(ctx, Z3_to_app(ctx, c));
        Z3_ast c_result = Z3_model_get_const_interp(ctx, model, c_decl);
        Z3_get_numeral_uint64(ctx, c_result, &c_val_u64);

        Z3_func_decl d_decl = Z3_get_app_decl(ctx, Z3_to_app(ctx, d));
        Z3_ast d_result = Z3_model_get_const_interp(ctx, model, d_decl);
        Z3_get_numeral_uint64(ctx, d_result, &d_val_u64);

        a_val = (BYTE)a_val_u64;
        b_val = (BYTE)b_val_u64;
        c_val = (BYTE)c_val_u64;
        d_val = (BYTE)d_val_u64;

        printf("result: %02X%02X%02X%02X\n", d_val, c_val, b_val, a_val);
    }
    else if (result == Z3_L_FALSE)
    {
        printf("no result!\n");
    }
    else
    {
        printf("error!\n");
    }
    Z3_del_context(ctx);
    return 0;
}

吐槽一下,z3某些编译选项会有一些安全问题,比如某个tls中的计数器在主线程结束的时候被删了,在全局析构的时候居然还去访问他...可能我瞎开了什么选项吧#(滑稽)告诉你结果吧,内存一样爆炸,而且变化趋势都一样的#(滑稽),当然,底层的api其实是一样的.(不是显然吗)以上只是和某些比我还菜的师傅说的,其实是完全没有用的...

以下为第一次优化的代码(优化了一些表啥的,基本还是原样爆破,也是事后写的)

#include&lt;windows.h&gt;
#include&lt;immintrin.h&gt;
BYTE* unk_51E000 = 0;
BYTE* unk_866000 = 0;
BYTE* unk_7F6000 = 0;
BYTE* unk_43D000 = 0;
BYTE unk_43D000_inv[256 * 16] = { 0 };
const BYTE consts[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
const BYTE v2[] = { 0,5,10,15,4,9,14,3,8,13,2,7,12,1,6,11 };
BYTE v2_inv[16] = { 0 };
BYTE* table_ptr = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
    int i;
    char buffer[1024];
    for (i = 0; i < 1024; i++)
    {
        buffer[i] = 0;
    }
    va_list args;
    va_start(args, format);
    PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
    ((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
    DWORD bytes_written;
    WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void sub_4011A0(BYTE* a1)
{
    int i;
    char v4[16];
    for (i = 0; i < 16; ++i)
    {
        v4[i] = a1[v2[i]];
    }
    memcpy(a1, v4, 0x10);
}
void sub_4011A0_inv(BYTE* a1)
{
    int i;
    BYTE temp[16];
    memcpy(temp, a1, 0x10);
    for (i = 0; i < 16; ++i)
    {
        a1[v2_inv[i]] = temp[i];
    }
}
void translation(DWORD idx, BYTE* a, BYTE* b, BYTE* c, BYTE* d)
{
    DWORD aa = (DWORD)*a;
    DWORD bb = (DWORD)*b;
    DWORD cc = (DWORD)*c;
    DWORD dd = (DWORD)*d;

    BYTE v14_a = unk_51E000[3 + 4 * (0x8f000 + idx + aa)];
    BYTE v12_a = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + bb)];
    BYTE v10_a = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + cc)];
    BYTE v8_a = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + dd)];
    BYTE v14_b = unk_51E000[2 + 4 * (0x8f000 + idx + aa)];
    BYTE v12_b = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + bb)];
    BYTE v10_b = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + cc)];
    BYTE v8_b = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + dd)];
    BYTE v14_c = unk_51E000[1 + 4 * (0x8f000 + idx + aa)];
    BYTE v12_c = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + bb)];
    BYTE v10_c = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + cc)];
    BYTE v8_c = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + dd)];
    BYTE v14_d = unk_51E000[4 * (0x8f000 + idx + aa)];
    BYTE v12_d = unk_51E000[4 * (0x8f000 + 256 + idx + bb)];
    BYTE v10_d = unk_51E000[4 * (0x8f000 + 512 + idx + cc)];
    BYTE v8_d = unk_51E000[4 * (0x8f000 + 768 + idx + dd)];

    WORD low_a = ((v14_a & 0xF) << 12) | ((v12_a & 0xF) << 8) | ((v10_a & 0xF) << 4) | (v8_a & 0xF);
    WORD high_a = ((v14_a >> 4) << 12) | ((v12_a >> 4) << 8) | ((v10_a >> 4) << 4) | (v8_a >> 4);
    WORD low_b = ((v14_b & 0xF) << 12) | ((v12_b & 0xF) << 8) | ((v10_b & 0xF) << 4) | (v8_b & 0xF);
    WORD high_b = ((v14_b >> 4) << 12) | ((v12_b >> 4) << 8) | ((v10_b >> 4) << 4) | (v8_b >> 4);
    WORD low_c = ((v14_c & 0xF) << 12) | ((v12_c & 0xF) << 8) | ((v10_c & 0xF) << 4) | (v8_c & 0xF);
    WORD high_c = ((v14_c >> 4) << 12) | ((v12_c >> 4) << 8) | ((v10_c >> 4) << 4) | (v8_c >> 4);
    WORD low_d = ((v14_d & 0xF) << 12) | ((v12_d & 0xF) << 8) | ((v10_d & 0xF) << 4) | (v8_d & 0xF);
    WORD high_d = ((v14_d >> 4) << 12) | ((v12_d >> 4) << 8) | ((v10_d >> 4) << 4) | (v8_d >> 4);

    aa = (DWORD)((table_ptr[high_a] << 4) | table_ptr[low_a]);
    bb = (DWORD)((table_ptr[high_b] << 4) | table_ptr[low_b]);
    cc = (DWORD)((table_ptr[high_c] << 4) | table_ptr[low_c]);
    dd = (DWORD)((table_ptr[high_d] << 4) | table_ptr[low_d]);

    BYTE v15_a = unk_7F6000[3 + 4 * (0xe000 + idx + aa)];
    BYTE v13_a = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + bb)];
    BYTE v11_a = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + cc)];
    BYTE v9_a = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + dd)];
    BYTE v15_b = unk_7F6000[2 + 4 * (0xe000 + idx + aa)];
    BYTE v13_b = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + bb)];
    BYTE v11_b = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + cc)];
    BYTE v9_b = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + dd)];
    BYTE v15_c = unk_7F6000[1 + 4 * (0xe000 + idx + aa)];
    BYTE v13_c = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + bb)];
    BYTE v11_c = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + cc)];
    BYTE v9_c = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + dd)];
    BYTE v15_d = unk_7F6000[4 * (0xe000 + idx + aa)];
    BYTE v13_d = unk_7F6000[4 * (0xe000 + 256 + idx + bb)];
    BYTE v11_d = unk_7F6000[4 * (0xe000 + 512 + idx + cc)];
    BYTE v9_d = unk_7F6000[4 * (0xe000 + 768 + idx + dd)];

    low_a = ((v15_a & 0xF) << 12) | ((v13_a & 0xF) << 8) | ((v11_a & 0xF) << 4) | (v9_a & 0xF);
    high_a = ((v15_a >> 4) << 12) | ((v13_a >> 4) << 8) | ((v11_a >> 4) << 4) | (v9_a >> 4);
    low_b = ((v15_b & 0xF) << 12) | ((v13_b & 0xF) << 8) | ((v11_b & 0xF) << 4) | (v9_b & 0xF);
    high_b = ((v15_b >> 4) << 12) | ((v13_b >> 4) << 8) | ((v11_b >> 4) << 4) | (v9_b >> 4);
    low_c = ((v15_c & 0xF) << 12) | ((v13_c & 0xF) << 8) | ((v11_c & 0xF) << 4) | (v9_c & 0xF);
    high_c = ((v15_c >> 4) << 12) | ((v13_c >> 4) << 8) | ((v11_c >> 4) << 4) | (v9_c >> 4);
    low_d = ((v15_d & 0xF) << 12) | ((v13_d & 0xF) << 8) | ((v11_d & 0xF) << 4) | (v9_d & 0xF);
    high_d = ((v15_d >> 4) << 12) | ((v13_d >> 4) << 8) | ((v11_d >> 4) << 4) | (v9_d >> 4);

    *a = (table_ptr[high_a] << 4) | table_ptr[low_a];
    *b = (table_ptr[high_b] << 4) | table_ptr[low_b];
    *c = (table_ptr[high_c] << 4) | table_ptr[low_c];
    *d = (table_ptr[high_d] << 4) | table_ptr[low_d];
}
void brute_force(DWORD idx,BYTE* va, BYTE* vb, BYTE* vc, BYTE* vd)
{
    DWORD a, b, c, d;
    __m128i indices;
    WORD* indices_ptr = (WORD*)&indices;
    for (a = 0; a < 256; a++)
    {
        for (b = 0; b < 256; b++)
        {
            for (c = 0; c < 256; c++)
            {
                for (d = 0; d < 256; d++)
                {
                    DWORD aa = a;
                    DWORD bb = b;
                    DWORD cc = c;
                    DWORD dd = d;

                    BYTE v14_a = unk_51E000[3 + 4 * (0x8f000 + idx + aa)];
                    BYTE v12_a = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + bb)];
                    BYTE v10_a = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + cc)];
                    BYTE v8_a = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + dd)];
                    BYTE v14_b = unk_51E000[2 + 4 * (0x8f000 + idx + aa)];
                    BYTE v12_b = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + bb)];
                    BYTE v10_b = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + cc)];
                    BYTE v8_b = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + dd)];
                    BYTE v14_c = unk_51E000[1 + 4 * (0x8f000 + idx + aa)];
                    BYTE v12_c = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + bb)];
                    BYTE v10_c = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + cc)];
                    BYTE v8_c = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + dd)];
                    BYTE v14_d = unk_51E000[4 * (0x8f000 + idx + aa)];
                    BYTE v12_d = unk_51E000[4 * (0x8f000 + 256 + idx + bb)];
                    BYTE v10_d = unk_51E000[4 * (0x8f000 + 512 + idx + cc)];
                    BYTE v8_d = unk_51E000[4 * (0x8f000 + 768 + idx + dd)];

                    WORD low_a = ((v14_a & 0xF) << 12) | ((v12_a & 0xF) << 8) | ((v10_a & 0xF) << 4) | (v8_a & 0xF);
                    WORD high_a = ((v14_a >> 4) << 12) | ((v12_a >> 4) << 8) | ((v10_a >> 4) << 4) | (v8_a >> 4);
                    WORD low_b = ((v14_b & 0xF) << 12) | ((v12_b & 0xF) << 8) | ((v10_b & 0xF) << 4) | (v8_b & 0xF);
                    WORD high_b = ((v14_b >> 4) << 12) | ((v12_b >> 4) << 8) | ((v10_b >> 4) << 4) | (v8_b >> 4);
                    WORD low_c = ((v14_c & 0xF) << 12) | ((v12_c & 0xF) << 8) | ((v10_c & 0xF) << 4) | (v8_c & 0xF);
                    WORD high_c = ((v14_c >> 4) << 12) | ((v12_c >> 4) << 8) | ((v10_c >> 4) << 4) | (v8_c >> 4);
                    WORD low_d = ((v14_d & 0xF) << 12) | ((v12_d & 0xF) << 8) | ((v10_d & 0xF) << 4) | (v8_d & 0xF);
                    WORD high_d = ((v14_d >> 4) << 12) | ((v12_d >> 4) << 8) | ((v10_d >> 4) << 4) | (v8_d >> 4);

                    aa = (DWORD)((table_ptr[high_a] << 4) | table_ptr[low_a]);
                    bb = (DWORD)((table_ptr[high_b] << 4) | table_ptr[low_b]);
                    cc = (DWORD)((table_ptr[high_c] << 4) | table_ptr[low_c]);
                    dd = (DWORD)((table_ptr[high_d] << 4) | table_ptr[low_d]);

                    BYTE v15_a = unk_7F6000[3 + 4 * (0xe000 + idx + aa)];
                    BYTE v13_a = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + bb)];
                    BYTE v11_a = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + cc)];
                    BYTE v9_a = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + dd)];
                    BYTE v15_b = unk_7F6000[2 + 4 * (0xe000 + idx + aa)];
                    BYTE v13_b = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + bb)];
                    BYTE v11_b = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + cc)];
                    BYTE v9_b = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + dd)];
                    BYTE v15_c = unk_7F6000[1 + 4 * (0xe000 + idx + aa)];
                    BYTE v13_c = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + bb)];
                    BYTE v11_c = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + cc)];
                    BYTE v9_c = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + dd)];
                    BYTE v15_d = unk_7F6000[4 * (0xe000 + idx + aa)];
                    BYTE v13_d = unk_7F6000[4 * (0xe000 + 256 + idx + bb)];
                    BYTE v11_d = unk_7F6000[4 * (0xe000 + 512 + idx + cc)];
                    BYTE v9_d = unk_7F6000[4 * (0xe000 + 768 + idx + dd)];

                    low_a = ((v15_a & 0xF) << 12) | ((v13_a & 0xF) << 8) | ((v11_a & 0xF) << 4) | (v9_a & 0xF);
                    high_a = ((v15_a >> 4) << 12) | ((v13_a >> 4) << 8) | ((v11_a >> 4) << 4) | (v9_a >> 4);
                    low_b = ((v15_b & 0xF) << 12) | ((v13_b & 0xF) << 8) | ((v11_b & 0xF) << 4) | (v9_b & 0xF);
                    high_b = ((v15_b >> 4) << 12) | ((v13_b >> 4) << 8) | ((v11_b >> 4) << 4) | (v9_b >> 4);
                    low_c = ((v15_c & 0xF) << 12) | ((v13_c & 0xF) << 8) | ((v11_c & 0xF) << 4) | (v9_c & 0xF);
                    high_c = ((v15_c >> 4) << 12) | ((v13_c >> 4) << 8) | ((v11_c >> 4) << 4) | (v9_c >> 4);
                    low_d = ((v15_d & 0xF) << 12) | ((v13_d & 0xF) << 8) | ((v11_d & 0xF) << 4) | (v9_d & 0xF);
                    high_d = ((v15_d >> 4) << 12) | ((v13_d >> 4) << 8) | ((v11_d >> 4) << 4) | (v9_d >> 4);

                    aa = (DWORD)((table_ptr[high_a] << 4) | table_ptr[low_a]);
                    bb = (DWORD)((table_ptr[high_b] << 4) | table_ptr[low_b]);
                    cc = (DWORD)((table_ptr[high_c] << 4) | table_ptr[low_c]);
                    dd = (DWORD)((table_ptr[high_d] << 4) | table_ptr[low_d]);
                    if ((BYTE)aa == *va && (BYTE)bb == *vb && (BYTE)cc == *vc && (BYTE)dd == *vd)
                    {
                        *va = (BYTE)a;
                        *vb = (BYTE)b;
                        *vc = (BYTE)c;
                        *vd = (BYTE)d;
                        my_printf("%02X%02X%02X%02X\n", d, c, b, a);
                        //return;
                    }
                }
            }
        }
    }
}
void sub_401270(BYTE* input, BYTE* out)
{
    int m, n, i, j, k;
    for (i = 0; i < 16; ++i)
    {
        input[i] ^= consts[i];
    }
    for (j = 0; j < 13; ++j)
    {
        sub_4011A0(input);
        for (k = 0; k < 4; ++k)
        {
            translation((j << 12) | (k << 10), input + k * 4, input + k * 4 + 1, input + k * 4 + 2, input + k * 4 + 3);
        }
    }
    sub_4011A0(input);
    for (m = 0; m < 16; ++m)
    {
        BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
        input[m] = table[input[m]];
    }
    for (n = 0; n < 16; ++n)
    {
        out[n] = input[n];
    }
}
void gen_tables()
{
    int i, m;
    int low_high, low_low;
    for (low_high = 0; low_high < 256; ++low_high)
    {
        for (low_low = 0; low_low < 256; ++low_low)
        {
            if (low_low % 16 == 0)
            {
                my_printf("\n");
            }
            BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
            BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
            BYTE low = (high_nib << 4) | low_nib;
            table_ptr[low_high * 256 + low_low] = (DWORD)unk_866000[0xea000 + 1280 + (DWORD)low];
            my_printf("%02X", table_ptr[low_high * 256 + low_low]);
        }
        my_printf("\n\n");
    }
    for (i = 0; i < 16; i++)
    {
        v2_inv[v2[i]] = i;
    }
    for (m = 0; m < 16; m++)
    {
        BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
        BYTE* table_inv = unk_43D000_inv + 256 * m;
        for (i = 0; i < 256; i++)
        {
            table_inv[table[i]] = (BYTE)i;
        }
    }
}
void sub_401270_inv(BYTE* input, BYTE* out)
{
    int n, m, i, j, k;
    for (n = 0; n < 16; ++n)
    {
        out[n] = input[n];
    }
    for (m = 0; m < 16; ++m)
    {
        BYTE* table_inv = unk_43D000_inv + 256 * m;
        out[m] = table_inv[out[m]];
    }
    sub_4011A0_inv(out);
    for (j = 0; j < 13; ++j)
    {
        for (k = 0; k < 4; ++k)
        {
            brute_force((j << 12) | (k << 10), out + k, out + k + 1, out + k + 2, out + k + 3);
        }
        sub_4011A0_inv(out);
    }
    for (i = 0; i < 16; ++i)
    {
        out[i] ^= consts[i];
    }
}
int main()
{
    //int i;
    void* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
    table_ptr = (BYTE*)VirtualAlloc(NULL, 65536 * 4, MEM_COMMIT, PAGE_READWRITE);
    DWORD hhh = 0;
    HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    ReadFile(file, buff, 0xb00000, &hhh, NULL);
    CloseHandle(file);
    unk_51E000 = (BYTE*)buff + 0x11ba00;
    unk_866000 = (BYTE*)buff + 0x463a00;
    unk_7F6000 = (BYTE*)buff + 0x3f3a00;
    unk_43D000 = (BYTE*)buff + 0x3aa00;
    gen_tables();
    BYTE input[] = { 0,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xaa,0xbb,0xcc,0xdd,0xee,0xff };
    BYTE out[16] = { 0 };
    BYTE out2[16] = { 0 };
    translation(0, input, input + 1, input + 2, input + 3);
    //brute_force(0, input, input + 1, input + 2, input + 3);
    /*sub_401270(input, out);
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X ", out[i]);//E4 11 53 70 7B E9 64 C4 EF 7D 51 74 EB 4B 3B 75
    }
    my_printf("\n\n");
    sub_401270_inv(out, out2);
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X ", out2[i]);
    }
    my_printf("\n\n");*/
    VirtualFree(table_ptr, 0, MEM_RELEASE);
    VirtualFree(buff, 0, MEM_RELEASE);
    return 0xb19b00b5;
}

我都是单线程运行,这样要跑4小时.

仍然需要本地桌面exe文件.不需要任何库.

于是,菜鸡开始想办法了...

我就想,simd不是很快吗(这是我当时的真实想法),还能并行,我干嘛不用simd写?

不想用cl,cuda啥的,那样太开挂了,不要动不动就瞎想用gpu,能用cpu写就用cpu写,我觉得,在我没有搞清楚cpu的奥秘之前,不会天天想着玩gpu的,写光追的时候也要压榨cpu到极致,那才有利于学习!!!学得开心就完事了!!!

于是代码来了(真是当时写的,别想着直接编译咯~~~):

#include&lt;windows.h&gt;
#include&lt;immintrin.h&gt;
DWORD* unk_51E000 = 0;
BYTE* unk_866000 = 0;
DWORD* unk_7F6000 = 0;
BYTE* unk_43D000 = 0;
BYTE unk_43D000_inv[256 * 16] = { 0 };
const BYTE consts[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
const BYTE v2[] = { 0,5,10,15,4,9,14,3,8,13,2,7,12,1,6,11 };
BYTE v2_inv[16] = { 0 };
DWORD* table_ptr = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
    int i;
    char buffer[1024];
    for (i = 0; i < 1024; i++)
    {
        buffer[i] = 0;
    }
    va_list args;
    va_start(args, format);
    PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
    ((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
    DWORD bytes_written;
    WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void sub_4011A0(BYTE* a1)
{
    int i;
    char v4[16];
    for (i = 0; i < 16; ++i)
    {
        v4[i] = a1[v2[i]];
    }
    memcpy(a1, v4, 0x10);
}
void sub_4011A0_inv(BYTE* a1)
{
    int i;
    BYTE temp[16];
    memcpy(temp, a1, 0x10);
    for (i = 0; i < 16; ++i)
    {
        a1[v2_inv[i]] = temp[i];
    }
}
void trans_main(DWORD idx, DWORD* in_dw)//实际上就是拆分->4*8矩阵转置8*4->组合
{
    __m128i arg_arr = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*in_dw));
    __m128i init_indices_or = _mm_cvtepu16_epi32(_mm_cvtsi64_si128(0x0300020001000000LL));
    __m128i init_indices = _mm_or_si128(arg_arr, init_indices_or);
    __m128i input = _mm_i32gather_epi32(unk_51E000 + 0x8f000 + idx, init_indices, 4);//此处也为内存瓶颈,但是为必须查表操作
    __m128i mask = _mm_broadcastq_epi64(_mm_cvtsi64_si128(0x0f0f0f0f0f0f0f0fLL));
    __m128i low = _mm_and_si128(input, mask);
    __m128i high = _mm_and_si128(_mm_srli_epi16(input, 4), mask);//高位移到最高位让他溢出
    __m256i total = _mm256_or_si256(_mm256_cvtepu8_epi16(low), _mm256_slli_epi16(_mm256_cvtepu8_epi16(high), 8));
    __m128i indices_byte_hl_odd = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x0f070e060d050c04LL));
    __m128i indices_byte_hl_even = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x0b030a0209010800LL));
    __m256i indices_byte_hl = _mm256_castsi128_si256(_mm_or_si128(_mm_slli_epi16(indices_byte_hl_odd, 8), indices_byte_hl_even));
    __m256i indices_byte = _mm256_permute2x128_si256(indices_byte_hl, indices_byte_hl, 0x20);
    __m256i indices_dword = _mm256_cvtepi8_epi32(_mm_cvtsi64_si128(0x0705030106040200LL));
    __m256i shuffled_dwords = _mm256_permutevar8x32_epi32(total, indices_dword);//先将转置的4*8转换为两个4*4进行进一步排列(avx2只支持独立xmm重排)
    __m256i shuffled_bytes = _mm256_shuffle_epi8(shuffled_dwords, indices_byte);//实现按顺序组合32个低4bit的byte到16个高低位的byte(转置矩阵)
    __m256i shuffled_moved = _mm256_srli_epi16(shuffled_bytes, 4);//上一个低位移到下一个高位,组合成byte
    __m256i combined_8_on_high = _mm256_or_si256(shuffled_bytes, shuffled_moved);//合并,这下就是16位中包含一个byte(在高位)
    __m128i compressed = _mm256_cvtepi16_epi8(combined_8_on_high);//压缩
    __m256i final_indices = _mm256_cvtepu16_epi32(compressed);
    __m256i output_dwords = _mm256_i32gather_epi32(table_ptr, final_indices, 4);//每个dword的最低4bit放着结果
    __m256i output_high = _mm256_srli_epi64(output_dwords, 28);//高位右移28位去找到低位
    __m256i combined_8_on_low = _mm256_or_si256(output_dwords, output_high);//和原先组合成低位
    __m128i reverse_dwords_odd = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x030107050b090f0dLL));
    __m128i reverse_dwords_even = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x020006040a080e0cLL));
    __m128i reverse_dwords = _mm_or_si128(_mm_slli_epi16(reverse_dwords_odd, 8), reverse_dwords_even);//第二轮,大部分代码重复,故不作注释
    __m128i init_indices_2 = _mm_or_si128(_mm_shuffle_epi8(_mm256_cvtepi64_epi32(combined_8_on_low), reverse_dwords), init_indices_or);
    __m128i input_2 = _mm_i32gather_epi32(unk_7F6000 + 0xe000 + idx, init_indices_2, 4);
    __m128i low_2 = _mm_and_si128(input_2, mask);
    __m128i high_2 = _mm_and_si128(_mm_srli_epi16(input_2, 4), mask);
    __m256i total_2 = _mm256_or_si256(_mm256_cvtepu8_epi16(low_2), _mm256_slli_epi16(_mm256_cvtepu8_epi16(high_2), 8));
    __m256i shuffled_dwords_2 = _mm256_permutevar8x32_epi32(total_2, indices_dword);
    __m256i shuffled_bytes_2 = _mm256_shuffle_epi8(shuffled_dwords_2, indices_byte);
    __m256i shuffled_moved_2 = _mm256_srli_epi16(shuffled_bytes_2, 4);
    __m256i combined_8_on_high_2 = _mm256_or_si256(shuffled_bytes_2, shuffled_moved_2);
    __m128i compressed_2 = _mm256_cvtepi16_epi8(combined_8_on_high_2);
    __m256i final_indices_2 = _mm256_cvtepu16_epi32(compressed_2);
    __m256i output_dwords_2 = _mm256_i32gather_epi32(table_ptr, final_indices_2, 4);
    __m256i output_high_2 = _mm256_srli_epi64(output_dwords_2, 28);
    __m256i combined_8_on_low_2 = _mm256_or_si256(output_dwords_2, output_high_2);
    *in_dw = _byteswap_ulong(_mm_cvtsi128_si32(_mm256_cvtepi64_epi8(combined_8_on_low_2)));
}
void brute_force(DWORD idx, DWORD* in_dw)
{
    __m128i mask = _mm_broadcastq_epi64(_mm_cvtsi64_si128(0x0f0f0f0f0f0f0f0fLL));
    __m128i init_indices_or = _mm_cvtepu16_epi32(_mm_cvtsi64_si128(0x0300020001000000LL));
    __m128i indices_byte_hl_odd = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x0f070e060d050c04LL));
    __m128i indices_byte_hl_even = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x0b030a0209010800LL));
    __m256i indices_byte_hl = _mm256_castsi128_si256(_mm_or_si128(_mm_slli_epi16(indices_byte_hl_odd, 8), indices_byte_hl_even));
    __m256i indices_byte = _mm256_permute2x128_si256(indices_byte_hl, indices_byte_hl, 0x20);
    __m256i indices_dword = _mm256_cvtepi8_epi32(_mm_cvtsi64_si128(0x0705030106040200LL));
    __m128i reverse_dwords_odd = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x030107050b090f0dLL));
    __m128i reverse_dwords_even = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x020006040a080e0cLL));
    __m128i reverse_dwords = _mm_or_si128(_mm_slli_epi16(reverse_dwords_odd, 8), reverse_dwords_even);
    DWORD64 val;
    DWORD* curr_tab_a_1 = unk_51E000 + 0x8f000 + idx;
    DWORD* curr_tab_a_2 = unk_7F6000 + 0xe000 + idx;
    for (val = 0; val < 0x100000000; val++)
    {
        DWORD tmp = (DWORD)val;
        __m128i arg_arr = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(tmp));
        __m128i init_indices = _mm_or_si128(arg_arr, init_indices_or);
        __m128i input = _mm_i32gather_epi32(curr_tab_a_1, init_indices, 4);//此处也为内存瓶颈,但是为必须查表操作
        __m128i low = _mm_and_si128(input, mask);
        __m128i high = _mm_and_si128(_mm_srli_epi16(input, 4), mask);//高位移到最高位让他溢出
        __m256i total = _mm256_or_si256(_mm256_cvtepu8_epi16(low), _mm256_slli_epi16(_mm256_cvtepu8_epi16(high), 8));
        __m256i shuffled_dwords = _mm256_permutevar8x32_epi32(total, indices_dword);//先将转置的4*8转换为两个4*4进行进一步排列(avx2只支持独立xmm重排)
        __m256i shuffled_bytes = _mm256_shuffle_epi8(shuffled_dwords, indices_byte);//实现按顺序组合32个低4bit的byte到16个高低位的byte(转置矩阵)
        __m256i shuffled_moved = _mm256_srli_epi16(shuffled_bytes, 4);//上一个低位移到下一个高位,组合成byte
        __m256i combined_8_on_high = _mm256_or_si256(shuffled_bytes, shuffled_moved);//合并,这下就是16位中包含一个byte(在高位)
        __m128i compressed = _mm256_cvtepi16_epi8(combined_8_on_high);//压缩
        __m256i final_indices = _mm256_cvtepu16_epi32(compressed);
        __m256i output_dwords = _mm256_i32gather_epi32(table_ptr, final_indices, 4);//每个dword的最低4bit放着结果
        __m256i output_high = _mm256_srli_epi64(output_dwords, 28);//高位右移28位去找到低位
        __m256i combined_8_on_low = _mm256_or_si256(output_dwords, output_high);//和原先组合成低位
        __m128i init_indices_2 = _mm_or_si128(_mm_shuffle_epi8(_mm256_cvtepi64_epi32(combined_8_on_low), reverse_dwords), init_indices_or);//第二轮,大部分代码重复,故不作注释
        __m128i input_2 = _mm_i32gather_epi32(curr_tab_a_2, init_indices_2, 4);
        __m128i low_2 = _mm_and_si128(input_2, mask);
        __m128i high_2 = _mm_and_si128(_mm_srli_epi16(input_2, 4), mask);
        __m256i total_2 = _mm256_or_si256(_mm256_cvtepu8_epi16(low_2), _mm256_slli_epi16(_mm256_cvtepu8_epi16(high_2), 8));
        __m256i shuffled_dwords_2 = _mm256_permutevar8x32_epi32(total_2, indices_dword);
        __m256i shuffled_bytes_2 = _mm256_shuffle_epi8(shuffled_dwords_2, indices_byte);
        __m256i shuffled_moved_2 = _mm256_srli_epi16(shuffled_bytes_2, 4);
        __m256i combined_8_on_high_2 = _mm256_or_si256(shuffled_bytes_2, shuffled_moved_2);
        __m128i compressed_2 = _mm256_cvtepi16_epi8(combined_8_on_high_2);
        __m256i final_indices_2 = _mm256_cvtepu16_epi32(compressed_2);
        __m256i output_dwords_2 = _mm256_i32gather_epi32(table_ptr, final_indices_2, 4);
        __m256i output_high_2 = _mm256_srli_epi64(output_dwords_2, 28);
        __m256i combined_8_on_low_2 = _mm256_or_si256(output_dwords_2, output_high_2);
        tmp = _byteswap_ulong(_mm_cvtsi128_si32(_mm256_cvtepi64_epi8(combined_8_on_low_2)));
        if (tmp == *in_dw)
        {
            my_printf("%08X\n", val);
            *in_dw = (DWORD)val;
            //return;
        }
    }
}
void sub_401270(BYTE* input, BYTE* out)//不用汇编写了,那样不好看#(滑稽)
{
    int n, m, i, j, k;
    for (i = 0; i < 16; ++i)
    {
        input[i] ^= consts[i];
    }
    for (j = 0; j < 13; ++j)
    {
        sub_4011A0(input);
        for (k = 0; k < 4; ++k)
        {
            trans_main((j << 12) | (k << 10), (DWORD*)input + k);
        }
    }
    sub_4011A0(input);
    for (m = 0; m < 16; ++m)
    {
        BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
        input[m] = table[input[m]];
    }
    for (n = 0; n < 16; ++n)
    {
        out[n] = input[n];
    }
}
void gen_tables()
{
    int i, m;
    int low_high, low_low;
    for (low_high = 0; low_high < 256; ++low_high)
    {
        for (low_low = 0; low_low < 256; ++low_low)
        {
            BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
            BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
            BYTE low = (high_nib << 4) | low_nib;
            table_ptr[low_high * 256 + low_low] = (DWORD)unk_866000[0xea000 + 1280 + (DWORD)low];
        }
    }
    for (i = 0; i < 16; i++)
    {
        v2_inv[v2[i]] = i;
    }
    for (m = 0; m < 16; m++)
    {
        BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
        BYTE* table_inv = unk_43D000_inv + 256 * m;
        for (i = 0; i < 256; i++)
        {
            table_inv[table[i]] = (BYTE)i;
        }
    }
}
void sub_401270_inv(BYTE* input, BYTE* out)
{
    int n, m, i, j, k;
    for (n = 0; n < 16; ++n)
    {
        out[n] = input[n];
    }
    for (m = 0; m < 16; ++m)
    {
        BYTE* table_inv = unk_43D000_inv + 256 * m;
        out[m] = table_inv[out[m]];
    }
    sub_4011A0_inv(out);
    for (j = 0; j < 13; ++j)
    {
        for (k = 0; k < 4; ++k)
        {
            brute_force((j << 12) | (k << 10), (DWORD*)out + k);
        }
        sub_4011A0_inv(out);
    }
    for (i = 0; i < 16; ++i)
    {
        out[i] ^= consts[i];
    }
}
int main()
{
    //int i;
    void* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
    table_ptr = (DWORD*)VirtualAlloc(NULL, 65536 * 4, MEM_COMMIT, PAGE_READWRITE);
    DWORD hhh = 0;
    HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    ReadFile(file, buff, 0xb00000, &hhh, NULL);
    CloseHandle(file);
    unk_51E000 = (DWORD*)((BYTE*)buff + 0x11ba00);
    unk_866000 = (BYTE*)buff + 0x463a00;
    unk_7F6000 = (DWORD*)((BYTE*)buff + 0x3f3a00);
    unk_43D000 = (BYTE*)buff + 0x3aa00;
    gen_tables();
    BYTE input[] = { 0,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xaa,0xbb,0xcc,0xdd,0xee,0xff };
    BYTE out[16] = { 0 };
    BYTE out2[16] = { 0 };
    trans_main(0, (DWORD*)input);
    brute_force(0, (DWORD*)input);
    /*sub_401270(input, out);
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X ", out[i]);//E4 11 53 70 7B E9 64 C4 EF 7D 51 74 EB 4B 3B 75
    }
    my_printf("\n\n");
    sub_401270_inv(out, out2);
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X ", out2[i]);
    }
    my_printf("\n\n");*/
    VirtualFree(table_ptr, 0, MEM_RELEASE);
    VirtualFree(buff, 0, MEM_RELEASE);
    return 0xb19b00b5;
}

优化方法不说了,看注释就够了.有一些都是临时的乱七八糟想法.无法组织成语言...语言能力有待提高,顺便感谢公司老大,教会我写文章了...在此膜拜(这里可不能不让我膜巨啊!!!)

但是这样几乎没有效率的提升.看来,我还没有参透cpu的奥秘.在intel和微软面前我仍然是个小丑.哦不对,连小丑都不是,就是一只小虾米#(滑稽)

我初步瞎想,瓶颈应该出在内存访问上.毕竟,内存访问会比寄存器访问慢几百倍(大概).

再次苦思冥想后,查阅了无数遍intel巨佬的官方文档,仔细阅读指令伪码和经过延迟比较后,决定放弃simd优化(粗野的想法,各位巨佬肯定在看着我的sb代码发笑呢...)

思路回到优化表查询上.这时候,老大催我写报告了.于是irql立刻提升至DISPATCH_LEVEL,一溜烟跑了......

两天后...

优化代码出来了...

#include&lt;windows.h&gt;
#include&lt;immintrin.h&gt;
BYTE unk_43D000_inv[256 * 16];
const BYTE consts[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
const BYTE v2[] = { 0,5,10,15,4,9,14,3,8,13,2,7,12,1,6,11 };
BYTE v2_inv[16];
BYTE table_ptr[65536];
BYTE* first_tables = 0;
BYTE* second_tables = 0;
BYTE* first_tables_map = 0;
BYTE* second_tables_map = 0;
BYTE* first_tables_sorted = 0;
BYTE* second_tables_sorted = 0;
DWORD64* first_tables_indices = 0;
DWORD64* second_tables_indices = 0;
DWORD first_tables_sorted_len[13 * 4 * 16];
DWORD second_tables_sorted_len[13 * 4 * 16];
BYTE third_tables[16 * 256];
DWORD* map_keys = 0;
BYTE* map_vals = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
    DWORD i;
    char buffer[1024];
    for (i = 0; i < 1024; i++)
    {
        buffer[i] = 0;
    }
    va_list args;
    va_start(args, format);
    PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
    ((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
    DWORD bytes_written;
    WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void junk_memcpy(void* dst, void* src, size_t sz)//16字节对齐才能用
{
    while (sz)
    {
        _mm_store_si128((__m128i*)((DWORD64)dst + sz - 16), _mm_loadu_si128((__m128i*)((DWORD64)src + sz - 16)));
        sz -= 16;
    }
}
void junk_memset(void* mem, BYTE val, size_t sz)//16字节对齐才能用
{
    while (sz)
    {
        _mm_store_si128((__m128i*)((DWORD64)mem + sz - 16), _mm_set1_epi8(val));
        sz -= 16;
    }
}
int partition(DWORD* arr, int low, int high)
{
    int i = low, j = high;
    DWORD p = arr[low];
    while (i < j)
    {
        while (i&lt;j && arr[j]&gt;p)
        {
            j--;
        }
        if (i < j)
        {
            arr[i] ^= arr[j];
            arr[j] ^= arr[i];
            arr[i] ^= arr[j];
            i++;
        }
        while (i < j && arr[i] <= p)
        {
            i++;
        }
        if (i < j)
        {
            arr[i] ^= arr[j];
            arr[j] ^= arr[i];
            arr[i] ^= arr[j];
            j--;
        }
    }
    arr[i] = p;
    return i;
}
void _qsort(DWORD* arr, int low, int high)
{
    if (low >= 0 && high >= 0 && low < high)
    {
        int mid = partition(arr, low, high);
        _qsort(arr, low, mid - 1);
        _qsort(arr, mid + 1, high);
    }
}
#define qsort(arr,len) _qsort((arr),0,(len)-1)
void idsort(DWORD* arr, DWORD* ids, DWORD n)
{
    int i, j;
    for (i = 0; i < (int)n - 1; i++)
    {
        for (j = 0; j < (int)n - i - 1; j++)
        {
            if (arr[j] > arr[j + 1])
            {
                arr[j] ^= arr[j + 1];
                arr[j + 1] ^= arr[j];
                arr[j] ^= arr[j + 1];
                ids[j] ^= ids[j + 1];
                ids[j + 1] ^= ids[j];
                ids[j] ^= ids[j + 1];
            }
        }
    }
}
DWORD my_bsearch(DWORD* arr, DWORD len, DWORD target)
{
    int l = 0, r = (int)len - 1;
    DWORD ret = 0xffffffff;
    while (l <= r)
    {
        int mid = l + ((r - l) >> 1);
        if (arr[mid] == target)
        {
            ret = mid;
            break;
        }
        else if (arr[mid] < target)
        {
            l = mid + 1;
        }
        else
        {
            r = mid - 1;
        }
    }
    return ret;
}
void sub_4011A0(BYTE* a1)
{
    DWORD i;
    char v4[16];
    for (i = 0; i < 16; i++)
    {
        v4[i] = a1[v2[i]];
    }
    junk_memcpy(a1, v4, 0x10);
}
void sub_4011A0_inv(BYTE* a1)
{
    DWORD i;
    BYTE temp[16];
    junk_memcpy(temp, a1, 0x10);
    for (i = 0; i < 16; i++)
    {
        a1[v2_inv[i]] = temp[i];
    }
}
DWORD de_trans_quad(BYTE* table0, DWORD* table1, DWORD64* table2, DWORD j, DWORD k, DWORD idx, BYTE in, BOOL inited)
{
    DWORD s, t, u, v;
    DWORD w, x, y, z;
    DWORD total_len_low = 0, total_len_high = 0;
    DWORD ptrs_2[4] = { 0 };
    BYTE low = in & 0xf;
    BYTE high = in >> 4;
    DWORD mat_idx = 16 * ((j << 2) | k);
    DWORD table_base = 256 * mat_idx;
    DWORD table_mat_base = idx * 4 * 256;
    BYTE* tables2 = table0 + table_base;
    DWORD* tables3 = table1 + mat_idx;
    DWORD64* tables4 = table2 + table_base;
    BYTE* t_a = tables2 + (idx << 10);
    BYTE* t_b = tables2 + ((idx << 10) | 0x100);
    BYTE* t_c = tables2 + ((idx << 10) | 0x200);
    BYTE* t_d = tables2 + ((idx << 10) | 0x300);
    DWORD64* tt_a = tables4 + (idx << 10);
    DWORD64* tt_b = tables4 + ((idx << 10) | 0x100);
    DWORD64* tt_c = tables4 + ((idx << 10) | 0x200);
    DWORD64* tt_d = tables4 + ((idx << 10) | 0x300);
    DWORD sz_a = tables3[idx << 2];
    DWORD sz_b = tables3[(idx << 2) | 1];
    DWORD sz_c = tables3[(idx << 2) | 2];
    DWORD sz_d = tables3[(idx << 2) | 3];
    DWORD cnt_a = 256 / sz_a;
    DWORD cnt_b = 256 / sz_b;
    DWORD cnt_c = 256 / sz_c;
    DWORD cnt_d = 256 / sz_d;
    DWORD indices = 0;
    for (w = 0; w < sz_a; w++)
    {
        for (x = 0; x < sz_b; x++)
        {
            for (y = 0; y < sz_c; y++)
            {
                for (z = 0; z < sz_d; z++)
                {
                    BYTE a_a = t_a[w];
                    BYTE a_b = t_b[x];
                    BYTE a_c = t_c[y];
                    BYTE a_d = t_d[z];
                    WORD low_word = ((a_a & 0xF) << 12) | ((a_b & 0xF) << 8) | ((a_c & 0xF) << 4) | (a_d & 0xF);
                    WORD high_word = ((a_a >> 4) << 12) | ((a_b >> 4) << 8) | ((a_c >> 4) << 4) | (a_d >> 4);
                    if (low == table_ptr[low_word] && high == table_ptr[high_word])
                    {
                        DWORD64 arr_a_contains = tt_a[a_a];
                        DWORD64 arr_b_contains = tt_b[a_b];
                        DWORD64 arr_c_contains = tt_c[a_c];
                        DWORD64 arr_d_contains = tt_d[a_d];
                        BYTE* arr_a = (BYTE*)&arr_a_contains;
                        BYTE* arr_b = (BYTE*)&arr_b_contains;
                        BYTE* arr_c = (BYTE*)&arr_c_contains; 
                        BYTE* arr_d = (BYTE*)&arr_d_contains;
                        for (s = 0; s < cnt_a; s++)
                        {
                            for (t = 0; t < cnt_b; t++)
                            {
                                for (u = 0; u < cnt_c; u++)
                                {
                                    for (v = 0; v < cnt_d; v++)
                                    {
                                        BYTE aa = arr_a[s];
                                        BYTE bb = arr_b[t];
                                        BYTE cc = arr_c[u];
                                        BYTE dd = arr_d[v];
                                        DWORD hash_val = aa | (bb << 8) | (cc << 16) | (dd << 24);
                                        if (!inited)
                                        {
                                            map_keys[indices++] = hash_val;
                                        }
                                        else
                                        {
                                            DWORD pos = my_bsearch(map_keys, 0x1000000, hash_val);
                                            if (pos != 0xffffffff)
                                            {
                                                if (++map_vals[pos] == 4)
                                                {
                                                    return pos;
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    if (!inited)
    {
        qsort(map_keys, 0x1000000);
        junk_memset(map_vals, 1, 0x1000000);
    }
    return 0xffffffff;
}
DWORD de_trans(DWORD j, DWORD k,DWORD* in)
{
    DWORD ret = 0xffffffff;
    DWORD i, m;
    DWORD mat_idx = 16 * ((j << 2) | k);
    DWORD table_base = 256 * mat_idx;
    BYTE* table_sorted = second_tables_sorted + table_base;
    DWORD* tables_lens = second_tables_sorted_len + mat_idx;
    DWORD idxs[4] = { 0,1,2,3 };
    DWORD sizes[4] = { 0 };
    BOOL inited = FALSE;
    for (i = 0; i < 4; i++)
    {
        DWORD sum = 1;
        for (m = 0; m < 4; m++)
        {
            sum *= tables_lens[(i << 2) | m];
        }
        sizes[i] = sum;
    }
    idsort(sizes, idxs, 4);
    for (i = 0; i < 4; i++)
    {
        int cur_idx = idxs[i];
        BYTE input = (BYTE)(*in >> (cur_idx << 3));
        ret = de_trans_quad(second_tables_sorted, second_tables_sorted_len, second_tables_indices, j, k, cur_idx, input, inited);
        inited = TRUE;
    }
    return map_keys[ret];
}
void trans(DWORD j, DWORD k, BYTE* a, BYTE* b, BYTE* c, BYTE* d)
{
    DWORD idx = 16 * ((j << 2) | k);
    DWORD table_base = 256 * idx;

    WORD low, high;
    DWORD aa = (DWORD)*a;
    DWORD bb = (DWORD)*b;
    DWORD cc = (DWORD)*c;
    DWORD dd = (DWORD)*d;

    BYTE* tables = first_tables + table_base;

    BYTE a_a = tables[256 * 0 + aa];
    BYTE a_b = tables[256 * 4 + aa];
    BYTE a_c = tables[256 * 8 + aa];
    BYTE a_d = tables[256 * 12 + aa];

    BYTE b_a = tables[256 * 1 + bb];
    BYTE b_b = tables[256 * 5 + bb];
    BYTE b_c = tables[256 * 9 + bb];
    BYTE b_d = tables[256 * 13 + bb];

    BYTE c_a = tables[256 * 2 + cc];
    BYTE c_b = tables[256 * 6 + cc];
    BYTE c_c = tables[256 * 10 + cc];
    BYTE c_d = tables[256 * 14 + cc];

    BYTE d_a = tables[256 * 3 + dd];
    BYTE d_b = tables[256 * 7 + dd];
    BYTE d_c = tables[256 * 11 + dd];
    BYTE d_d = tables[256 * 15 + dd];

    low = ((a_a & 0xF) << 12) | ((b_a & 0xF) << 8) | ((c_a & 0xF) << 4) | (d_a & 0xF);
    high = ((a_a >> 4) << 12) | ((b_a >> 4) << 8) | ((c_a >> 4) << 4) | (d_a >> 4);
    aa = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_b & 0xF) << 12) | ((b_b & 0xF) << 8) | ((c_b & 0xF) << 4) | (d_b & 0xF);
    high = ((a_b >> 4) << 12) | ((b_b >> 4) << 8) | ((c_b >> 4) << 4) | (d_b >> 4);
    bb = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_c & 0xF) << 12) | ((b_c & 0xF) << 8) | ((c_c & 0xF) << 4) | (d_c & 0xF);
    high = ((a_c >> 4) << 12) | ((b_c >> 4) << 8) | ((c_c >> 4) << 4) | (d_c >> 4);
    cc = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_d & 0xF) << 12) | ((b_d & 0xF) << 8) | ((c_d & 0xF) << 4) | (d_d & 0xF);
    high = ((a_d >> 4) << 12) | ((b_d >> 4) << 8) | ((c_d >> 4) << 4) | (d_d >> 4);
    dd = (table_ptr[high] << 4) | table_ptr[low];
    my_printf("%02X %02X %02X %02X\n", aa, bb, cc, dd);
    tables = second_tables + table_base;
    a_a = tables[256 * 0 + aa];
    a_b = tables[256 * 4 + aa];
    a_c = tables[256 * 8 + aa];
    a_d = tables[256 * 12 + aa];

    b_a = tables[256 * 1 + bb];
    b_b = tables[256 * 5 + bb];
    b_c = tables[256 * 9 + bb];
    b_d = tables[256 * 13 + bb];

    c_a = tables[256 * 2 + cc];
    c_b = tables[256 * 6 + cc];
    c_c = tables[256 * 10 + cc];
    c_d = tables[256 * 14 + cc];

    d_a = tables[256 * 3 + dd];
    d_b = tables[256 * 7 + dd];
    d_c = tables[256 * 11 + dd];
    d_d = tables[256 * 15 + dd];

    low = ((a_a & 0xF) << 12) | ((b_a & 0xF) << 8) | ((c_a & 0xF) << 4) | (d_a & 0xF);
    high = ((a_a >> 4) << 12) | ((b_a >> 4) << 8) | ((c_a >> 4) << 4) | (d_a >> 4);
    *a = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_b & 0xF) << 12) | ((b_b & 0xF) << 8) | ((c_b & 0xF) << 4) | (d_b & 0xF);
    high = ((a_b >> 4) << 12) | ((b_b >> 4) << 8) | ((c_b >> 4) << 4) | (d_b >> 4);
    *b = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_c & 0xF) << 12) | ((b_c & 0xF) << 8) | ((c_c & 0xF) << 4) | (d_c & 0xF);
    high = ((a_c >> 4) << 12) | ((b_c >> 4) << 8) | ((c_c >> 4) << 4) | (d_c >> 4);
    *c = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_d & 0xF) << 12) | ((b_d & 0xF) << 8) | ((c_d & 0xF) << 4) | (d_d & 0xF);
    high = ((a_d >> 4) << 12) | ((b_d >> 4) << 8) | ((c_d >> 4) << 4) | (d_d >> 4);
    *d = (table_ptr[high] << 4) | table_ptr[low];
}
void sub_401270(BYTE* input, BYTE* out)
{
    DWORD m, n, i, j, k;
    for (i = 0; i < 16; i++)
    {
        input[i] ^= consts[i];
    }
    for (j = 0; j < 13; j++)
    {
        sub_4011A0(input);
        for (k = 0; k < 4; k++)
        {
            trans(j, k, input + k * 4, input + k * 4 + 1, input + k * 4 + 2, input + k * 4 + 3);
        }
    }
    sub_4011A0(input);
    for (m = 0; m < 16; m++)
    {
        input[m] = third_tables[(m << 8) | input[m]];
    }
    for (n = 0; n < 16; n++)
    {
        out[n] = input[n];
    }
}
void gen_tables(BYTE* buff)
{
    BYTE* unk_51E000 = buff + 0x11ba00;
    BYTE* unk_866000 = buff + 0x463a00;
    BYTE* unk_7F6000 = buff + 0x3f3a00;
    BYTE* unk_43D000 = buff + 0x3aa00;
    DWORD i, j, k, m;
    DWORD low_high, low_low;
    for (low_high = 0; low_high < 256; low_high++)
    {
        for (low_low = 0; low_low < 256; low_low++)
        {
            BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
            BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
            BYTE low = (high_nib << 4) | low_nib;
            table_ptr[low_high * 256 + low_low] = unk_866000[0xea000 + 1280 + (DWORD)low];
        }
    }
    for (i = 0; i < 16; i++)
    {
        v2_inv[v2[i]] = (BYTE)i;
    }
    for (m = 0; m < 16; m++)
    {
        BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
        BYTE* table_inv = unk_43D000_inv + 256 * m;
        for (i = 0; i < 256; i++)
        {
            table_inv[table[i]] = (BYTE)i;
        }
    }
    for (j = 0; j < 13; j++)
    {
        for (k = 0; k < 4; k++)
        {	
            DWORD idx = (j << 12) | (k << 10);
            DWORD table_base = 256 * 16 * (idx >> 10);
            for (i = 0; i < 256; i++)
            {
                BYTE* tables = first_tables + table_base;
                tables[256 * 0 + i] = unk_51E000[3 + 4 * (0x8f000 + idx + i)];
                tables[256 * 1 + i] = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 2 + i] = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 3 + i] = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 4 + i] = unk_51E000[2 + 4 * (0x8f000 + idx + i)];
                tables[256 * 5 + i] = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 6 + i] = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 7 + i] = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 8 + i] = unk_51E000[1 + 4 * (0x8f000 + idx + i)];
                tables[256 * 9 + i] = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 10 + i] = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 11 + i] = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 12 + i] = unk_51E000[4 * (0x8f000 + idx + i)];
                tables[256 * 13 + i] = unk_51E000[4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 14 + i] = unk_51E000[4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 15 + i] = unk_51E000[4 * (0x8f000 + 768 + idx + i)];

                tables = second_tables + table_base;
                tables[256 * 0 + i] = unk_7F6000[3 + 4 * (0xe000 + idx + i)];
                tables[256 * 1 + i] = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 2 + i] = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 3 + i] = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 4 + i] = unk_7F6000[2 + 4 * (0xe000 + idx + i)];
                tables[256 * 5 + i] = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 6 + i] = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 7 + i] = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 8 + i] = unk_7F6000[1 + 4 * (0xe000 + idx + i)];
                tables[256 * 9 + i] = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 10 + i] = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 11 + i] = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 12 + i] = unk_7F6000[4 * (0xe000 + idx + i)];
                tables[256 * 13 + i] = unk_7F6000[4 * (0xe000 + 256 + idx + i)];
                tables[256 * 14 + i] = unk_7F6000[4 * (0xe000 + 512 + idx + i)];
                tables[256 * 15 + i] = unk_7F6000[4 * (0xe000 + 768 + idx + i)];
            }
        }
    }
    for (m = 0; m < 13 * 4 * 16 * 256; m += 256)
    {
        BYTE* ptr_first_orig = first_tables + m;
        BYTE* ptr_second_orig = second_tables + m;
        BYTE* ptr_first_dst = first_tables_map + m;
        BYTE* ptr_second_dst = second_tables_map + m;
        for (i = 0; i < 256; i++)
        {
            ptr_first_dst[ptr_first_orig[i]] = 1;
            ptr_second_dst[ptr_second_orig[i]] = 1;
        }
    }
    for (m = 0; m < 13 * 4 * 16 * 256; m += 256)
    {
        BYTE* ptr_first_orig = first_tables_map + m;
        BYTE* ptr_second_orig = second_tables_map + m;
        BYTE* ptr_first_dst = first_tables_sorted + m;
        BYTE* ptr_second_dst = second_tables_sorted + m;
        int len_indice = m >> 8;
        first_tables_sorted_len[len_indice] = 0;
        second_tables_sorted_len[len_indice] = 0;
        for (i = 0; i < 256; i++)
        {
            if (ptr_first_orig[i])
            {
                ptr_first_dst[first_tables_sorted_len[len_indice]++] = (BYTE)i;
            }
            if (ptr_second_orig[i])
            {
                ptr_second_dst[second_tables_sorted_len[len_indice]++] = (BYTE)i;
            }
        }
    }
    for (m = 0; m < 13 * 4 * 16 * 256; m += 256)//根据重复个数大小排开,排列的是索引
    {
        BYTE tmp_indices_first[256];
        BYTE tmp_indices_second[256];
        junk_memset(tmp_indices_first, 0, 256);
        junk_memset(tmp_indices_second, 0, 256);
        BYTE* ptr_first_orig = first_tables + m;
        BYTE* ptr_second_orig = second_tables + m;
        DWORD64* ptr_first_dst = first_tables_indices + m;
        DWORD64* ptr_second_dst = second_tables_indices + m;
        for (i = 0; i < 256; i++)
        {
            DWORD first = (DWORD)ptr_first_orig[i];
            DWORD second = (DWORD)ptr_second_orig[i];
            BYTE* arr_first = (BYTE*)(ptr_first_dst + first);
            BYTE* arr_second = (BYTE*)(ptr_second_dst + second);
            arr_first[(DWORD)(tmp_indices_first[first]++)] = (BYTE)i;
            arr_second[(DWORD)(tmp_indices_second[second]++)] = (BYTE)i;
        }
    }
    for (m = 0; m < 16; m++)
    {
        for (i = 0; i < 256; i++)
        {
            BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
            third_tables[256 * m + i] = table[i];
        }
    }
}
void sub_401270_inv(BYTE* input, BYTE* out)
{
    DWORD n, m, i, j, k;
    for (n = 0; n < 16; n++)
    {
        out[n] = input[n];
    }
    for (m = 0; m < 16; m++)
    {
        BYTE* table_inv = unk_43D000_inv + 256 * m;
        out[m] = table_inv[out[m]];
    }
    sub_4011A0_inv(out);
    for (j = 0; j < 13; j++)
    {
        for (k = 0; k < 4; k++)
        {
            de_trans(j, k, (DWORD*)(input + k * 4));
        }
        sub_4011A0_inv(out);
    }
    for (i = 0; i < 16; i++)
    {
        out[i] ^= consts[i];
    }
}
int main()
{
    DWORD i;
    first_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    second_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    first_tables_map = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    second_tables_map = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    first_tables_sorted = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    second_tables_sorted = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    first_tables_indices = (DWORD64*)VirtualAlloc(NULL, 13 * 4 * 16 * 256 * 8, MEM_COMMIT, PAGE_READWRITE);
    second_tables_indices = (DWORD64*)VirtualAlloc(NULL, 13 * 4 * 16 * 256 * 8, MEM_COMMIT, PAGE_READWRITE);
    map_keys = (DWORD*)VirtualAlloc(NULL, 0x1000000 * 4, MEM_COMMIT, PAGE_READWRITE);
    map_vals = (BYTE*)VirtualAlloc(NULL, 0x1000000, MEM_COMMIT, PAGE_READWRITE);
    DWORD hhh = 0;
    HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    void* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
    ReadFile(file, buff, 0xb00000, &hhh, NULL);
    CloseHandle(file);
    gen_tables(buff);
    BYTE input[] = { 0xA0,0xA8,0xAC,0xA7,0xA9,0xB6,0x95,0x79,0xBD,0x76,0x7D,0xA9,0x29,0x5F,0xB9,0x42 };
    BYTE out[16] = { 0 };
    BYTE out2[16] = { 0 };
    for (i = 0; i < 4; i++)
    {
        trans(0, 0, input + i * 4, input + i * 4 + 1, input + i * 4 + 2, input + i * 4 + 3);
        my_printf("%08X\n", de_trans(0, 0, (DWORD*)(input + i * 4)));
    }
    /*sub_401270(input, out);
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X ", out[i]);//E4 11 53 70 7B E9 64 C4 EF 7D 51 74 EB 4B 3B 75
    }
    my_printf("\n\n");
    sub_401270_inv(out, out2);
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X ", out2[i]);
    }
    my_printf("\n\n");*/
    VirtualFree(buff, 0, MEM_RELEASE);
    VirtualFree(map_vals, 0, MEM_RELEASE);
    VirtualFree(map_keys, 0, MEM_RELEASE);
    VirtualFree(second_tables_indices, 0, MEM_RELEASE);
    VirtualFree(first_tables_indices, 0, MEM_RELEASE);
    VirtualFree(second_tables_sorted, 0, MEM_RELEASE);
    VirtualFree(first_tables_sorted, 0, MEM_RELEASE);
    VirtualFree(second_tables_map, 0, MEM_RELEASE);
    VirtualFree(first_tables_map, 0, MEM_RELEASE);
    VirtualFree(second_tables, 0, MEM_RELEASE);
    VirtualFree(first_tables, 0, MEM_RELEASE);
    return 0xb19b00b5;
}

基本就是雏形了.但是还能更快.

优化算法并不难,只使用了一些朴素的(noip普及组)算法,如二分查找,快速排序,双指针求交集...难的不适合用在这,也没必要用在这...

表的优化也很简单,合并同类表(我写文章瞎想出的名词,比如BYTE变成WORD,WORD变成DWORD等等...),其他就没了,代码也很简单,看看就懂了,而且我只是来提供思路的...

进一步优化,少许修改,得出最终版本,单线程,纯cpu,少量simd(基本没用,一个shuffle能提升多少呢...)

#include&lt;windows.h&gt;
#include&lt;immintrin.h&gt;
BYTE unk_43D000_inv[256 * 16];
const BYTE consts[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
BYTE table_ptr[65536];
BYTE* first_tables = 0;
BYTE* second_tables = 0;
BYTE* first_tables_map = 0;
BYTE* second_tables_map = 0;
BYTE* first_tables_sorted = 0;
BYTE* second_tables_sorted = 0;
DWORD64* first_tables_indices = 0;
DWORD64* second_tables_indices = 0;
DWORD first_tables_sorted_len[13 * 4 * 16];
DWORD second_tables_sorted_len[13 * 4 * 16];
BYTE third_tables[16 * 256];
DWORD* probs_0 = 0;
DWORD* probs_1 = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
    DWORD i;
    char buffer[1024];
    for (i = 0; i < 1024; i++)
    {
        buffer[i] = 0;
    }
    va_list args;
    va_start(args, format);
    PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
    ((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
    DWORD bytes_written;
    WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void junk_memcpy(void* dst, void* src, size_t sz)//16字节对齐才能用
{
    while (sz)
    {
        _mm_store_si128((__m128i*)((DWORD64)dst + sz - 16), _mm_loadu_si128((__m128i*)((DWORD64)src + sz - 16)));
        sz -= 16;
    }
}
void junk_memset(void* mem, BYTE val, size_t sz)//16字节对齐才能用
{
    while (sz)
    {
        _mm_store_si128((__m128i*)((DWORD64)mem + sz - 16), _mm_set1_epi8(val));
        sz -= 16;
    }
}
DWORD binary_search(DWORD* arr, int start, int end, DWORD target)
{
    int l = start, r = end;
    DWORD ret = 0xffffffff;
    while (l <= r)
    {
        int mid = l + ((r - l) >> 1);
        if (arr[mid] == target)
        {
            ret = mid;
            break;
        }
        else if (arr[mid] < target)
        {
            l = mid + 1;
        }
        else
        {
            r = mid - 1;
        }
    }
    return ret;
}
DWORD my_bsearch(DWORD* arr, DWORD len, DWORD target)
{
    return binary_search(arr, 0, (int)len - 1, target);//抖动高的"优化",我不搞#(滑稽)
}
DWORD intersec(DWORD* arr1, DWORD len1, DWORD* arr2, DWORD len2)
{
    int i = 0, j = 0;
    DWORD outlen = 0;
    while (i < (int)len1 && j < (int)len2)
    {
        if (arr1[i] < arr2[j])
        {
            i++;
        }
        else if (arr1[i] > arr2[j])
        {
            j++;
        }
        else
        {
            arr1[outlen++] = arr1[i];
            i++;
            j++;
        }
    }
    return outlen;
}
int partition(DWORD* arr, int low, int high)
{
    int i = low, j = high;
    DWORD p = arr[low];
    while (i < j)
    {
        while (i&lt;j && arr[j]&gt;p)
        {
            j--;
        }
        if (i < j)
        {
            arr[i] ^= arr[j];
            arr[j] ^= arr[i];
            arr[i] ^= arr[j];
            i++;
        }
        while (i < j && arr[i] <= p)
        {
            i++;
        }
        if (i < j)
        {
            arr[i] ^= arr[j];
            arr[j] ^= arr[i];
            arr[i] ^= arr[j];
            j--;
        }
    }
    arr[i] = p;
    return i;
}
void _qsort(DWORD* arr, int low, int high)
{
    if (low >= 0 && high >= 0 && low < high)
    {
        int mid = partition(arr, low, high);
        _qsort(arr, low, mid - 1);
        _qsort(arr, mid + 1, high);
    }
}
#define qsort(arr,len) _qsort((arr),0,(len)-1)
void idsort(DWORD* arr, DWORD* ids, DWORD n)
{
    int i, j;
    for (i = 0; i < (int)n - 1; i++)
    {
        for (j = 0; j < (int)n - i - 1; j++)
        {
            if (arr[j] > arr[j + 1])
            {
                arr[j] ^= arr[j + 1];
                arr[j + 1] ^= arr[j];
                arr[j] ^= arr[j + 1];
                ids[j] ^= ids[j + 1];
                ids[j + 1] ^= ids[j];
                ids[j] ^= ids[j + 1];
            }
        }
    }
}
void sub_4011A0(BYTE* a1)
{
    *(__m128i*)a1 = _mm_shuffle_epi8(*(__m128i*)a1, _mm_set_epi8(11, 6, 1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15, 10, 5, 0));
}
void sub_4011A0_inv(BYTE* a1)
{
    *(__m128i*)a1 = _mm_shuffle_epi8(*(__m128i*)a1, _mm_set_epi8(3, 6, 9, 12, 15, 2, 5, 8, 11, 14, 1, 4, 7, 10, 13, 0));
}
DWORD de_trans_quad(DWORD* probs, BYTE* table0, DWORD* table1, DWORD64* table2,
    DWORD j, DWORD k, DWORD idx, BYTE in, DWORD* priv_tab, DWORD priv_len, BOOL search)
{
    DWORD s, t, u, v;
    DWORD w, x, y, z;
    BYTE low = in & 0xf;
    BYTE high = in >> 4;
    DWORD mat_idx = (j << 6) | (k << 4);
    DWORD table_base = mat_idx << 8;
    BYTE* tables0 = table0 + table_base;
    DWORD* tables1 = table1 + mat_idx;
    DWORD64* tables2 = table2 + table_base;
    BYTE* t_a = tables0 + (idx << 10);
    BYTE* t_b = tables0 + ((idx << 10) | 0x100);
    BYTE* t_c = tables0 + ((idx << 10) | 0x200);
    BYTE* t_d = tables0 + ((idx << 10) | 0x300);
    DWORD64* tt_a = tables2 + (idx << 10);
    DWORD64* tt_b = tables2 + ((idx << 10) | 0x100);
    DWORD64* tt_c = tables2 + ((idx << 10) | 0x200);
    DWORD64* tt_d = tables2 + ((idx << 10) | 0x300);
    DWORD sz_a = tables1[idx << 2];
    DWORD sz_b = tables1[(idx << 2) | 1];
    DWORD sz_c = tables1[(idx << 2) | 2];
    DWORD sz_d = tables1[(idx << 2) | 3];
    DWORD cnt_a = 256 / sz_a;
    DWORD cnt_b = 256 / sz_b;
    DWORD cnt_c = 256 / sz_c;
    DWORD cnt_d = 256 / sz_d;
    DWORD indices = 0;
    for (w = 0; w < sz_a; w++)
    {
        for (x = 0; x < sz_b; x++)
        {
            for (y = 0; y < sz_c; y++)
            {
                for (z = 0; z < sz_d; z++)
                {
                    WORD a_a = (WORD)t_a[w];//正向计算
                    WORD a_b = (WORD)t_b[x];
                    WORD a_c = (WORD)t_c[y];
                    WORD a_d = (WORD)t_d[z];
                    WORD low_word = ((a_a & 0xF) << 12) | ((a_b & 0xF) << 8) | ((a_c & 0xF) << 4) | (a_d & 0xF);
                    WORD high_word = ((a_a >> 4) << 12) | ((a_b >> 4) << 8) | ((a_c >> 4) << 4) | (a_d >> 4);
                    if (low == table_ptr[low_word] && high == table_ptr[high_word])
                    {
                        BYTE* arr_a = (BYTE*)&tt_a[a_a];//遍历记录表中值在原表中所有可能的索引,组合即为所有可能原值
                        BYTE* arr_b = (BYTE*)&tt_b[a_b];
                        BYTE* arr_c = (BYTE*)&tt_c[a_c];
                        BYTE* arr_d = (BYTE*)&tt_d[a_d];
                        for (s = 0; s < cnt_a; s++)
                        {
                            for (t = 0; t < cnt_b; t++)
                            {
                                for (u = 0; u < cnt_c; u++)
                                {
                                    for (v = 0; v < cnt_d; v++)
                                    {
                                        DWORD val = arr_a[s] | (arr_b[t] << 8) | (arr_c[u] << 16) | (arr_d[v] << 24);
                                        if (!search || my_bsearch(priv_tab, priv_len, val) != 0xffffffff)//二分搜索有序列表,动态计算交集
                                        {
                                            probs[indices++] = val;
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    return indices;
}
void de_trans_half(BYTE* table0, DWORD* table1, DWORD64* table2, DWORD j, DWORD k, DWORD* in)
{
    DWORD i, m;
    DWORD mat_idx = (j << 6) | (k << 4);
    DWORD table_base = mat_idx << 8;
    DWORD* tables_lens = table1 + mat_idx;
    DWORD idxs[4] = { 0,1,2,3 };
    DWORD sizes[4] = { 0 };
    for (i = 0; i < 4; i++)
    {
        DWORD sum = 1;
        for (m = 0; m < 4; m++)
        {
            sum *= tables_lens[(i << 2) | m];
        }
        sizes[i] = sum;
    }
    idsort(sizes, idxs, 4);//数据量小,冒泡排序,从最小数据量的行开始.因为总数据量不变,从最小开始能获得最小范围

    DWORD cur_idx = idxs[0];
    DWORD outlen_1 = de_trans_quad(probs_0, table0, table1, table2, j, k, cur_idx, (BYTE)(*in >> (cur_idx << 3)), NULL, 0, FALSE);
    qsort(probs_0, outlen_1);//第一组数据,数据量0x1000000,数据量大,快排,用于接下来求交集

    cur_idx = idxs[1];
    DWORD outlen_2 = de_trans_quad(probs_1, table0, table1, table2, j, k, cur_idx, (BYTE)(*in >> (cur_idx << 3)), NULL, 0, FALSE);
    qsort(probs_1, outlen_2);//第二组数据快排,数据量0x1000000

    DWORD outlen = intersec(probs_0, outlen_1, probs_1, outlen_2);//两组数据求交集,数据量约束到0x10000
    
    cur_idx = idxs[2];//开始使用二分搜索优化
    outlen = de_trans_quad(probs_1, table0, table1, table2, j, k, cur_idx, (BYTE)(*in >> (cur_idx << 3)), probs_0, outlen, TRUE);
    qsort(probs_1, outlen);//约束数据量到0x100
    
    cur_idx = idxs[3];
    de_trans_quad(probs_0, table0, table1, table2, j, k, cur_idx, (BYTE)(*in >> (cur_idx << 3)), probs_1, outlen, TRUE);//约束出结果
    *in = probs_0[0];
}
void trans(DWORD j, DWORD k, BYTE* a, BYTE* b, BYTE* c, BYTE* d)
{
    DWORD aa = (DWORD)*a;
    DWORD bb = (DWORD)*b;
    DWORD cc = (DWORD)*c;
    DWORD dd = (DWORD)*d;
    my_printf("%02X%02X%02X%02X\n", dd, cc, bb, aa);
    DWORD mat_idx = (j << 6) | (k << 4);
    DWORD table_base = mat_idx << 8;

    WORD low, high;
    BYTE* tables = first_tables + table_base;

    BYTE a_a = tables[256 * 0 + aa];
    BYTE a_b = tables[256 * 4 + aa];
    BYTE a_c = tables[256 * 8 + aa];
    BYTE a_d = tables[256 * 12 + aa];

    BYTE b_a = tables[256 * 1 + bb];
    BYTE b_b = tables[256 * 5 + bb];
    BYTE b_c = tables[256 * 9 + bb];
    BYTE b_d = tables[256 * 13 + bb];

    BYTE c_a = tables[256 * 2 + cc];
    BYTE c_b = tables[256 * 6 + cc];
    BYTE c_c = tables[256 * 10 + cc];
    BYTE c_d = tables[256 * 14 + cc];

    BYTE d_a = tables[256 * 3 + dd];
    BYTE d_b = tables[256 * 7 + dd];
    BYTE d_c = tables[256 * 11 + dd];
    BYTE d_d = tables[256 * 15 + dd];

    low = ((a_a & 0xF) << 12) | ((b_a & 0xF) << 8) | ((c_a & 0xF) << 4) | (d_a & 0xF);
    high = ((a_a >> 4) << 12) | ((b_a >> 4) << 8) | ((c_a >> 4) << 4) | (d_a >> 4);
    aa = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_b & 0xF) << 12) | ((b_b & 0xF) << 8) | ((c_b & 0xF) << 4) | (d_b & 0xF);
    high = ((a_b >> 4) << 12) | ((b_b >> 4) << 8) | ((c_b >> 4) << 4) | (d_b >> 4);
    bb = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_c & 0xF) << 12) | ((b_c & 0xF) << 8) | ((c_c & 0xF) << 4) | (d_c & 0xF);
    high = ((a_c >> 4) << 12) | ((b_c >> 4) << 8) | ((c_c >> 4) << 4) | (d_c >> 4);
    cc = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_d & 0xF) << 12) | ((b_d & 0xF) << 8) | ((c_d & 0xF) << 4) | (d_d & 0xF);
    high = ((a_d >> 4) << 12) | ((b_d >> 4) << 8) | ((c_d >> 4) << 4) | (d_d >> 4);
    dd = (table_ptr[high] << 4) | table_ptr[low];
    my_printf("%02X%02X%02X%02X\n", dd, cc, bb, aa);
    tables = second_tables + table_base;
    a_a = tables[256 * 0 + aa];
    a_b = tables[256 * 4 + aa];
    a_c = tables[256 * 8 + aa];
    a_d = tables[256 * 12 + aa];

    b_a = tables[256 * 1 + bb];
    b_b = tables[256 * 5 + bb];
    b_c = tables[256 * 9 + bb];
    b_d = tables[256 * 13 + bb];

    c_a = tables[256 * 2 + cc];
    c_b = tables[256 * 6 + cc];
    c_c = tables[256 * 10 + cc];
    c_d = tables[256 * 14 + cc];

    d_a = tables[256 * 3 + dd];
    d_b = tables[256 * 7 + dd];
    d_c = tables[256 * 11 + dd];
    d_d = tables[256 * 15 + dd];

    low = ((a_a & 0xF) << 12) | ((b_a & 0xF) << 8) | ((c_a & 0xF) << 4) | (d_a & 0xF);
    high = ((a_a >> 4) << 12) | ((b_a >> 4) << 8) | ((c_a >> 4) << 4) | (d_a >> 4);
    *a = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_b & 0xF) << 12) | ((b_b & 0xF) << 8) | ((c_b & 0xF) << 4) | (d_b & 0xF);
    high = ((a_b >> 4) << 12) | ((b_b >> 4) << 8) | ((c_b >> 4) << 4) | (d_b >> 4);
    *b = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_c & 0xF) << 12) | ((b_c & 0xF) << 8) | ((c_c & 0xF) << 4) | (d_c & 0xF);
    high = ((a_c >> 4) << 12) | ((b_c >> 4) << 8) | ((c_c >> 4) << 4) | (d_c >> 4);
    *c = (table_ptr[high] << 4) | table_ptr[low];

    low = ((a_d & 0xF) << 12) | ((b_d & 0xF) << 8) | ((c_d & 0xF) << 4) | (d_d & 0xF);
    high = ((a_d >> 4) << 12) | ((b_d >> 4) << 8) | ((c_d >> 4) << 4) | (d_d >> 4);
    *d = (table_ptr[high] << 4) | table_ptr[low];
}
void sub_401270(BYTE* input, BYTE* out)
{
    DWORD m, n, i, j, k;
    for (n = 0; n < 16; n++)
    {
        out[n] = input[n];
    }
    for (i = 0; i < 16; i++)
    {
        out[i] ^= consts[i];
    }
    for (j = 0; j < 13; j++)
    {
        sub_4011A0(out);
        for (k = 0; k < 4; k++)
        {
            trans(j, k, out + k * 4, out + k * 4 + 1, out + k * 4 + 2, out + k * 4 + 3);
        }
    }
    sub_4011A0(out);
    for (m = 0; m < 16; m++)
    {
        out[m] = third_tables[(m << 8) | out[m]];
    }
}
void gen_tables()
{
    DWORD hhh = 0;
    HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    BYTE* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
    ReadFile(file, buff, 0xb00000, &hhh, NULL);
    CloseHandle(file);
    BYTE* unk_51E000 = buff + 0x11ba00;
    BYTE* unk_866000 = buff + 0x463a00;
    BYTE* unk_7F6000 = buff + 0x3f3a00;
    BYTE* unk_43D000 = buff + 0x3aa00;
    DWORD i, j, k, m;
    DWORD low_high, low_low;
    for (low_high = 0; low_high < 256; low_high++)
    {
        for (low_low = 0; low_low < 256; low_low++)
        {
            BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
            BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
            BYTE low = (high_nib << 4) | low_nib;
            table_ptr[low_high * 256 + low_low] = unk_866000[0xea000 + 1280 + (DWORD)low];
        }
    }
    for (m = 0; m < 16; m++)
    {
        BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
        BYTE* table_inv = unk_43D000_inv + 256 * m;
        for (i = 0; i < 256; i++)
        {
            table_inv[table[i]] = (BYTE)i;
        }
    }
    first_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    second_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    for (j = 0; j < 13; j++)
    {
        for (k = 0; k < 4; k++)
        {	
            DWORD idx = (j << 12) | (k << 10);
            DWORD table_base = 256 * 16 * (idx >> 10);
            for (i = 0; i < 256; i++)
            {
                BYTE* tables = first_tables + table_base;
                tables[256 * 0 + i] = unk_51E000[3 + 4 * (0x8f000 + idx + i)];
                tables[256 * 1 + i] = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 2 + i] = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 3 + i] = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 4 + i] = unk_51E000[2 + 4 * (0x8f000 + idx + i)];
                tables[256 * 5 + i] = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 6 + i] = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 7 + i] = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 8 + i] = unk_51E000[1 + 4 * (0x8f000 + idx + i)];
                tables[256 * 9 + i] = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 10 + i] = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 11 + i] = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + i)];
                tables[256 * 12 + i] = unk_51E000[4 * (0x8f000 + idx + i)];
                tables[256 * 13 + i] = unk_51E000[4 * (0x8f000 + 256 + idx + i)];
                tables[256 * 14 + i] = unk_51E000[4 * (0x8f000 + 512 + idx + i)];
                tables[256 * 15 + i] = unk_51E000[4 * (0x8f000 + 768 + idx + i)];

                tables = second_tables + table_base;
                tables[256 * 0 + i] = unk_7F6000[3 + 4 * (0xe000 + idx + i)];
                tables[256 * 1 + i] = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 2 + i] = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 3 + i] = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 4 + i] = unk_7F6000[2 + 4 * (0xe000 + idx + i)];
                tables[256 * 5 + i] = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 6 + i] = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 7 + i] = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 8 + i] = unk_7F6000[1 + 4 * (0xe000 + idx + i)];
                tables[256 * 9 + i] = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + i)];
                tables[256 * 10 + i] = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + i)];
                tables[256 * 11 + i] = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + i)];
                tables[256 * 12 + i] = unk_7F6000[4 * (0xe000 + idx + i)];
                tables[256 * 13 + i] = unk_7F6000[4 * (0xe000 + 256 + idx + i)];
                tables[256 * 14 + i] = unk_7F6000[4 * (0xe000 + 512 + idx + i)];
                tables[256 * 15 + i] = unk_7F6000[4 * (0xe000 + 768 + idx + i)];
            }
        }
    }
    for (m = 0; m < 16; m++)
    {
        for (i = 0; i < 256; i++)
        {
            BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
            third_tables[256 * m + i] = table[i];
        }
    }
    VirtualFree(buff, 0, MEM_RELEASE);
    first_tables_map = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    second_tables_map = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    for (m = 0; m < 13 * 4 * 16 * 256; m += 256)
    {
        BYTE* ptr_first_orig = first_tables + m;
        BYTE* ptr_second_orig = second_tables + m;
        BYTE* ptr_first_dst = first_tables_map + m;
        BYTE* ptr_second_dst = second_tables_map + m;
        for (i = 0; i < 256; i++)
        {
            ptr_first_dst[ptr_first_orig[i]] = 1;
            ptr_second_dst[ptr_second_orig[i]] = 1;
        }
    }
    first_tables_sorted = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    second_tables_sorted = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
    for (m = 0; m < 13 * 4 * 16 * 256; m += 256)
    {
        BYTE* ptr_first_orig = first_tables_map + m;
        BYTE* ptr_second_orig = second_tables_map + m;
        BYTE* ptr_first_dst = first_tables_sorted + m;
        BYTE* ptr_second_dst = second_tables_sorted + m;
        int len_indice = m >> 8;
        first_tables_sorted_len[len_indice] = 0;
        second_tables_sorted_len[len_indice] = 0;
        for (i = 0; i < 256; i++)
        {
            if (ptr_first_orig[i])
            {
                ptr_first_dst[first_tables_sorted_len[len_indice]++] = (BYTE)i;
            }
            if (ptr_second_orig[i])
            {
                ptr_second_dst[second_tables_sorted_len[len_indice]++] = (BYTE)i;
            }
        }
    }
    VirtualFree(second_tables_map, 0, MEM_RELEASE);
    VirtualFree(first_tables_map, 0, MEM_RELEASE);
    first_tables_indices = (DWORD64*)VirtualAlloc(NULL, 13 * 4 * 16 * 256 * 8, MEM_COMMIT, PAGE_READWRITE);
    second_tables_indices = (DWORD64*)VirtualAlloc(NULL, 13 * 4 * 16 * 256 * 8, MEM_COMMIT, PAGE_READWRITE);
    for (m = 0; m < 13 * 4 * 16 * 256; m += 256)//根据重复个数大小排开,排列的是索引
    {
        BYTE tmp_indices_first[256];
        BYTE tmp_indices_second[256];
        junk_memset(tmp_indices_first, 0, 256);
        junk_memset(tmp_indices_second, 0, 256);
        BYTE* ptr_first_orig = first_tables + m;
        BYTE* ptr_second_orig = second_tables + m;
        DWORD64* ptr_first_dst = first_tables_indices + m;
        DWORD64* ptr_second_dst = second_tables_indices + m;
        for (i = 0; i < 256; i++)
        {
            DWORD first = (DWORD)ptr_first_orig[i];
            DWORD second = (DWORD)ptr_second_orig[i];
            BYTE* arr_first = (BYTE*)(ptr_first_dst + first);
            BYTE* arr_second = (BYTE*)(ptr_second_dst + second);
            arr_first[(DWORD)(tmp_indices_first[first]++)] = (BYTE)i;
            arr_second[(DWORD)(tmp_indices_second[second]++)] = (BYTE)i;
        }
    }
}
void sub_401270_inv(BYTE* input, BYTE* out)
{
    DWORD n, m, i, k;
    int j;
    for (n = 0; n < 16; n++)
    {
        out[n] = input[n];
    }
    for (m = 0; m < 16; m++)
    {
        BYTE* table_inv = unk_43D000_inv + 256 * m;
        out[m] = table_inv[out[m]];
    }
    sub_4011A0_inv(out);
    for (j = 12; j >= 0; j--)
    {
        for (k = 0; k < 4; k++)
        {
            de_trans_half(second_tables_sorted, second_tables_sorted_len, second_tables_indices, (DWORD)j, k, (DWORD*)(out + k * 4));
            my_printf("%08X\n", *(DWORD*)(out + k * 4));
            de_trans_half(first_tables_sorted, first_tables_sorted_len, first_tables_indices, (DWORD)j, k, (DWORD*)(out + k * 4));
            my_printf("%08X\n", *(DWORD*)(out + k * 4));
        }
        sub_4011A0_inv(out);
    }
    for (i = 0; i < 16; i++)
    {
        out[i] ^= consts[i];
    }
}
int main()
{
    DWORD i;
    gen_tables();
    BYTE input[] = { 0xA0,0xA8,0xAC,0xA7,0xA9,0xB6,0x95,0x79,0xBD,0x76,0x7D,0xA9,0x29,0x5F,0xB9,0x42 };
    BYTE out[16] = { 0 };
    BYTE out2[16] = { 0 };
    probs_0 = (DWORD*)VirtualAlloc(NULL, 0x1000000 * 4, MEM_COMMIT, PAGE_READWRITE);
    probs_1 = (DWORD*)VirtualAlloc(NULL, 0x1000000 * 4, MEM_COMMIT, PAGE_READWRITE);
    trans(0, 0, input + 2 * 4, input + 2 * 4 + 1, input + 2 * 4 + 2, input + 2 * 4 + 3);
    de_trans_half(second_tables_sorted, second_tables_sorted_len, second_tables_indices, 0, 0, (DWORD*)(input + 2 * 4));
    my_printf("%08X\n", *(DWORD*)(input + 2 * 4));
    de_trans_half(first_tables_sorted, first_tables_sorted_len, first_tables_indices, 0, 0, (DWORD*)(input + 2 * 4));
    my_printf("%08X\n", *(DWORD*)(input + 2 * 4));
    /*sub_401270(input, out);
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X ", out[i]);
    }
    my_printf("\n\n");
    sub_401270_inv(out, out2);
    for (i = 0; i < 16; i++)
    {
        my_printf("%02X", out2[i]);
    }
    my_printf("\n\n");*/
    VirtualFree(probs_1, 0, MEM_RELEASE);
    VirtualFree(probs_0, 0, MEM_RELEASE);
    VirtualFree(second_tables_indices, 0, MEM_RELEASE);
    VirtualFree(first_tables_indices, 0, MEM_RELEASE);
    VirtualFree(second_tables_sorted, 0, MEM_RELEASE);
    VirtualFree(first_tables_sorted, 0, MEM_RELEASE);
    VirtualFree(second_tables, 0, MEM_RELEASE);
    VirtualFree(first_tables, 0, MEM_RELEASE);
    return 0xb19b00b5;
}

还是一样,不能编译,但是代表了我的思路,实测最慢半小时运行完毕(抖动不多,较稳定),提升到原先的1/8左右.

这题总共花了我24小时左右时间.我实际写了4个版本的脚本:直接爆破版本,simd版本,第一次优化版本,第二次优化版本.前面的4小时是估计得出的时间(单次trans要2分钟,2 * 2 * 13 * 4大约4小时,单线程),实际只跑了一次trans.

如果在比赛中完成,将花我3小时左右时间,十分钟逆向到算法,写出直接爆破半小时,增加多线程功能10分钟,2小时爆破得出结果(多线程爆破,每组4*2次trans可以并行,我4核cpu,排除干扰).

在比赛中,一血数码暴龙2小时,4小时已经有10解.

所以,我又是什么呢?从我开始学习到现在,我又做了什么呢?我的水平还有很大的提升空间,还需要一辈子学习.

老大教我,想点现实的好.

福州好热呀...好想去济南...冬天去吧...老大等我...#(滑稽)


传播安全知识、拓宽行业人脉——看雪讲师团队等你加入!

最后于 6小时前 被n00bzx编辑 ,原因:
上传的附件:
收藏
免费 0
支持
分享
最新回复 (1)
雪    币: 1364
活跃值: (2613)
能力值: ( LV12,RANK:226 )
在线值:
发帖
回帖
粉丝
2

一看,又臭又长.认命吧.写文章,我不适合.

最后于 1小时前 被n00bzx编辑 ,原因:
1小时前
0
游客
登录 | 注册 方可回帖
返回