-
-
[原创]一篇关于天命战队desctf devil.exe的贴子(非差评)
-
发表于: 8小时前 199
-
注册以来一直没在看雪发贴子,除了前两年比赛的题目提交贴.去年巨佬都隐退了,拿了个奖,今年就不打攻击方了.
唉,ctf没落了.
发这题是经过原作者同意的.这题出得很好.准确来说,很有艺术感,能激发思考.(不然我就不会在老大任务学习之余瞎写了这么多fw代码,还写篇文章提交上来.)
先说说这次比赛.
好久没打ctf了.这次是我师傅的战队举办的比赛,我就征得老大的同意,以学习的目的打了这次比赛.注册了n00bzx账号,不打分,用小号上,做完签到题交完flag,就存题睡觉了,慢慢做.反正题目又不会没,是吧?他们第一次办比赛,题目难度不算太难,而且交流群气氛也很好.所以,这是十分成功的第一次.这只是我一个re菜鸡的评价.别的方向都不会,没有评价的权限.希望他们战队比赛越办越好.感谢出题师傅给我权限发这篇粗浅分析.
题目附件见后文,如出题师傅不愿意我发附件,看到后请立刻告诉我,我立刻删除整篇贴子并修改后重新发布,要打要骂随便!
这篇文章只算思路,不提供具体wp.
ida打开不用说,c++写得,莫得符号(也没关系),在某个位置(包含反调试),计算了某个段的某种算法hash,存在全局变量中(后面干嘛用呢?),用了inline hook了crt(这个hook比较关键,后面干嘛用呢?),这个hook使用了veh(比较特别),在handler里面放了一些东西,也起到了一定的反调试器的作用,需要一定的驱动思想(注意只是思想)来绕过(我觉得).经过进一步xxxx后,就是算法了.这是我要说的.前面那些建议15分钟内解决(很久了,绝对够).15分钟后,直接来到算法部分.经过一番ida xxxx后(包含建立结构体,重命名变量,xxxx静态分析手段,说给完全没做过的师傅听的,毕竟也有比我更菜的是吧!),得到初步逻辑代码:
老爹的话:还有一件事!这里的代码都有一点点bug,(不知道是不)是我故意加的(你猜),所以别直接编译运行(或许可以,但是结果一定是错的!),做题也是学习,要有自己的思路.(菜鸡浅薄之见)
现在才是代码#(滑稽)
我讨厌crt!所以我把默认库都禁了,这样能缩小程序大小(没卵用,之前玩最小pe玩魔怔了,当我是个sb就行了)...说实话,这段代码能正确编译运行,但是需要桌面上有程序文件...不需要任何库(包括默认库#(滑稽)),使用vs2022(没必要,都行,看你们都爱这么说,我也说下)编译,开c语言(其实也没必要,废话).总之就是不要任何设置,直接编译(又是废话).
#include<stdio.h>
#include<windows.h>
BYTE* unk_51E000 = 0;
BYTE* unk_866000 = 0;
BYTE* unk_7F6000 = 0;
BYTE* unk_43D000 = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
DWORD i;
char buffer[1024];
for (i = 0; i < 1024; i++)
{
buffer[i] = 0;
}
va_list args;
va_start(args, format);
PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
DWORD bytes_written;
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void sub_4011A0(BYTE* a1)
{
int v2[16];
int i;
char v4[16];
v2[0] = 0;
v2[1] = 5;
v2[2] = 10;
v2[3] = 15;
v2[4] = 4;
v2[5] = 9;
v2[6] = 14;
v2[7] = 3;
v2[8] = 8;
v2[9] = 13;
v2[10] = 2;
v2[11] = 7;
v2[12] = 12;
v2[13] = 1;
v2[14] = 6;
v2[15] = 11;
for (i = 0; i < 16; ++i)
{
v4[i] = a1[v2[i]];
}
memcpy(a1, v4, 0x10);
}
void sub_4011A0_inv(BYTE* a1)
{
int v2[16];
BYTE temp[16];
int i;
v2[0] = 0;
v2[1] = 5;
v2[2] = 10;
v2[3] = 15;
v2[4] = 4;
v2[5] = 9;
v2[6] = 14;
v2[7] = 3;
v2[8] = 8;
v2[9] = 13;
v2[10] = 2;
v2[11] = 7;
v2[12] = 12;
v2[13] = 1;
v2[14] = 6;
v2[15] = 11;
memcpy(temp, a1, 0x10);
for (i = 0; i < 16; ++i)
{
a1[v2[i]] = temp[i];
}
}
void sub_401270(BYTE *input_pass, BYTE *out)
{
int n;
int m;
int i;
BYTE aa, bb, cc, dd;
BYTE low, high;
int j;
int k;
BYTE const1[] = { 0xB8,0xA1,0xD9,0xB9,0xD8,0x3B,0x17,0x91,0x75,0x12,0x1B,0x74,0x18,0x5B,0x16,0x39,0x76,0xA2,0x0C,0xFA,0x90,0x94,0x36,0x41,0x58,0x59,0x43,0xD4,0x47,0x92,0x2D,0xEA };
BYTE const2[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
for (i = 0; i < 16; ++i)
{
input_pass[i] ^= const2[i];
}
for (j = 0; j < 13; ++j)
{
sub_4011A0(input_pass);
for (k = 0; k < 4; ++k)
{
BYTE v14_a = unk_51E000[3 + 4 * (53248 * ((int)*const1 >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
BYTE v12_a = unk_51E000[3 + 4 * (53248 * ((int)*const1 >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
BYTE v10_a = unk_51E000[3 + 4 * (53248 * ((int)*const1 >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
BYTE v8_a = unk_51E000[3 + 4 * (53248 * ((int)*const1 >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
BYTE v14_b = unk_51E000[2 + 4 * (53248 * ((int)*const1 >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
BYTE v12_b = unk_51E000[2 + 4 * (53248 * ((int)*const1 >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
BYTE v10_b = unk_51E000[2 + 4 * (53248 * ((int)*const1 >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
BYTE v8_b = unk_51E000[2 + 4 * (53248 * ((int)*const1 >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
BYTE v14_c = unk_51E000[1 + 4 * (53248 * ((int)*const1 >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
BYTE v12_c = unk_51E000[1 + 4 * (53248 * ((int)*const1 >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
BYTE v10_c = unk_51E000[1 + 4 * (53248 * ((int)*const1 >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
BYTE v8_c = unk_51E000[1 + 4 * (53248 * ((int)*const1 >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
BYTE v14_d = unk_51E000[4 * (53248 * ((int)*const1 >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
BYTE v12_d = unk_51E000[4 * (53248 * ((int)*const1 >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
BYTE v10_d = unk_51E000[4 * (53248 * ((int)*const1 >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
BYTE v8_d = unk_51E000[4 * (53248 * ((int)*const1 >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
low = unk_866000[319488 * ((int)const1[5] >> 4) + 1280 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 512 + 24576 * j + 6144 * k + 16 * (v14_a & 0xF) + (v12_a & 0xF)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 768 + 24576 * j + 6144 * k + 16 * (v10_a & 0xF) + (v8_a & 0xF)]];
high = unk_866000[319488 * ((int)const1[5] >> 4) + 1024 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 24576 * j + 6144 * k + 16 * (v14_a >> 4) + (v12_a >> 4)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 256 + 24576 * j + 6144 * k + 16 * (v10_a >> 4) + (v8_a >> 4)]];
aa = high;
aa <<= 4;
aa |= low;
low = unk_866000[319488 * ((int)const1[5] >> 4) + 2816 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 2048 + 24576 * j + 6144 * k + 16 * (v14_b & 0xF) + (v12_b & 0xF)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 2304 + 24576 * j + 6144 * k + 16 * (v10_b & 0xF) + (v8_b & 0xF)]];
high = unk_866000[319488 * ((int)const1[5] >> 4) + 2560 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 1536 + 24576 * j + 6144 * k + 16 * (v14_b >> 4) + (v12_b >> 4)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 1792 + 24576 * j + 6144 * k + 16 * (v10_b >> 4) + (v8_b >> 4)]];
bb = high;
bb <<= 4;
bb |= low;
low = unk_866000[319488 * ((int)const1[5] >> 4) + 4352 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 3584 + 24576 * j + 6144 * k + 16 * (v14_c & 0xF) + (v12_c & 0xF)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 3840 + 24576 * j + 6144 * k + 16 * (v10_c & 0xF) + (v8_c & 0xF)]];
high = unk_866000[319488 * ((int)const1[5] >> 4) + 4096 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 3072 + 24576 * j + 6144 * k + 16 * (v14_c >> 4) + (v12_c >> 4)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 3328 + 24576 * j + 6144 * k + 16 * (v10_c >> 4) + (v8_c >> 4)]];
cc = high;
cc <<= 4;
cc |= low;
low = unk_866000[319488 * ((int)const1[5] >> 4) + 5888 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 5120 + 24576 * j + 6144 * k + 16 * (v14_d & 0xF) + (v12_d & 0xF)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 5376 + 24576 * j + 6144 * k + 16 * (v10_d & 0xF) + (v8_d & 0xF)]];
high = unk_866000[319488 * ((int)const1[5] >> 4) + 5632 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 4608 + 24576 * j + 6144 * k + 16 * (v14_d >> 4) + (v12_d >> 4)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 4864 + 24576 * j + 6144 * k + 16 * (v10_d >> 4) + (v8_d >> 4)]];
dd = high;
dd <<= 4;
dd |= low;
input_pass[4 * k] = aa;
input_pass[4 * k + 1] = bb;
input_pass[4 * k + 2] = cc;
input_pass[4 * k + 3] = dd;
BYTE v15_a = unk_7F6000[3 + 4 * (57344 * ((int)const1[10] >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
BYTE v13_a = unk_7F6000[3 + 4 * (57344 * ((int)const1[10] >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
BYTE v11_a = unk_7F6000[3 + 4 * (57344 * ((int)const1[10] >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
BYTE v9_a = unk_7F6000[3 + 4 * (57344 * ((int)const1[10] >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
BYTE v15_b = unk_7F6000[2 + 4 * (57344 * ((int)const1[10] >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
BYTE v13_b = unk_7F6000[2 + 4 * (57344 * ((int)const1[10] >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
BYTE v11_b = unk_7F6000[2 + 4 * (57344 * ((int)const1[10] >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
BYTE v9_b = unk_7F6000[2 + 4 * (57344 * ((int)const1[10] >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
BYTE v15_c = unk_7F6000[1 + 4 * (57344 * ((int)const1[10] >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
BYTE v13_c = unk_7F6000[1 + 4 * (57344 * ((int)const1[10] >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
BYTE v11_c = unk_7F6000[1 + 4 * (57344 * ((int)const1[10] >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
BYTE v9_c = unk_7F6000[1 + 4 * (57344 * ((int)const1[10] >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
BYTE v15_d = unk_7F6000[4 * (57344 * ((int)const1[10] >> 4) + 4096 * j + 1024 * k + input_pass[4 * k])];
BYTE v13_d = unk_7F6000[4 * (57344 * ((int)const1[10] >> 4) + 256 + 4096 * j + 1024 * k + input_pass[4 * k + 1])];
BYTE v11_d = unk_7F6000[4 * (57344 * ((int)const1[10] >> 4) + 512 + 4096 * j + 1024 * k + input_pass[4 * k + 2])];
BYTE v9_d = unk_7F6000[4 * (57344 * ((int)const1[10] >> 4) + 768 + 4096 * j + 1024 * k + input_pass[4 * k + 3])];
low = unk_866000[319488 * ((int)const1[5] >> 4) + 1280 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 512 + 24576 * j + 6144 * k + 16 * (v15_a & 0xF) + (v13_a & 0xF)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 768 + 24576 * j + 6144 * k + 16 * (v11_a & 0xF) + (v9_a & 0xF)]];
high = unk_866000[319488 * ((int)const1[5] >> 4) + 1024 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 24576 * j + 6144 * k + 16 * (v15_a >> 4) + (v13_a >> 4)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 256 + 24576 * j + 6144 * k + 16 * (v11_a >> 4) + (v9_a >> 4)]];
aa = high;
aa <<= 4;
aa |= low;
low = unk_866000[319488 * ((int)const1[5] >> 4) + 2816 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 2048 + 24576 * j + 6144 * k + 16 * (v15_b & 0xF) + (v13_b & 0xF)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 2304 + 24576 * j + 6144 * k + 16 * (v11_b & 0xF) + (v9_b & 0xF)]];
high = unk_866000[319488 * ((int)const1[5] >> 4) + 2560 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 1536 + 24576 * j + 6144 * k + 16 * (v15_b >> 4) + (v13_b >> 4)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 1792 + 24576 * j + 6144 * k + 16 * (v11_b >> 4) + (v9_b >> 4)]];
bb = high;
bb <<= 4;
bb |= low;
low = unk_866000[319488 * ((int)const1[5] >> 4) + 4352 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 3584 + 24576 * j + 6144 * k + 16 * (v15_c & 0xF) + (v13_c & 0xF)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 3840 + 24576 * j + 6144 * k + 16 * (v11_c & 0xF) + (v9_c & 0xF)]];
high = unk_866000[319488 * ((int)const1[5] >> 4) + 4096 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 3072 + 24576 * j + 6144 * k + 16 * (v15_c >> 4) + (v13_c >> 4)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 3328 + 24576 * j + 6144 * k + 16 * (v11_c >> 4) + (v9_c >> 4)]];
cc = high;
cc <<= 4;
cc |= low;
low = unk_866000[319488 * ((int)const1[5] >> 4) + 5888 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 5120 + 24576 * j + 6144 * k + 16 * (v15_d & 0xF) + (v13_d & 0xF)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 5376 + 24576 * j + 6144 * k + 16 * (v11_d & 0xF) + (v9_d & 0xF)]];
high = unk_866000[319488 * ((int)const1[5] >> 4) + 5632 + 24576 * j + 6144 * k + 16
* unk_866000[319488 * ((int)const1[5] >> 4) + 4608 + 24576 * j + 6144 * k + 16 * (v15_d >> 4) + (v13_d >> 4)]
+ unk_866000[319488 * ((int)const1[5] >> 4) + 4864 + 24576 * j + 6144 * k + 16 * (v11_d >> 4) + (v9_d >> 4)]];
dd = high;
dd <<= 4;
dd |= low;
input_pass[4 * k] = aa;
input_pass[4 * k + 1] = bb;
input_pass[4 * k + 2] = cc;
input_pass[4 * k + 3] = dd;
}
}
sub_4011A0(input_pass);
for (m = 0; m < 16; ++m)
{
input_pass[m] = unk_43D000[57344 * ((int)const1[2] >> 4) + 53248 + 256 * m + input_pass[m]];
}
for (n = 0; n < 16; ++n)
{
out[n] = input_pass[n];
}
}
int main()
{
void* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
DWORD hhh = 0;
HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
ReadFile(file, buff, 11534336, &hhh, NULL);
CloseHandle(file);
unk_51E000 = (BYTE*)buff + 0x11ba00;
unk_866000 = (BYTE*)buff + 0x463a00;
unk_7F6000 = (BYTE*)buff + 0x3f3a00;
unk_43D000 = (BYTE*)buff + 0x3aa00;
BYTE input_pass[] = { 0xA0,0xA8,0xAC,0xA7,0xA9,0xB6,0x95,0x79,0xBD,0x76,0x7D,0xA9,0x29,0x5F,0xB9,0x42 };
BYTE out[16] = { 0 };
sub_401270(input_pass, out);
int i = 0;
for (i = 0; i < 16; i++)
{
my_printf("%02X ", out[i]);
}
VirtualFree(buff, 0, MEM_RELEASE);
return 0xb19b00b5;
}
这就是主要算法逻辑.当然肯定不能用他直接爆破(4小时以上).所以要优化.第一段代码是你们可能有[只是可能,这是十分可笑(或许)]的想法.教科书式,使用z3解决.
from z3 import *
bv16_sort = BitVecSort(16)
bv8_sort = BitVecSort(8)
table_ptr = Array('table_ptr', bv16_sort, bv8_sort)
def func_a(a,b,c,d):
aa=ZeroExt(8,a)
bb=ZeroExt(8,b)
cc=ZeroExt(8,c)
dd=ZeroExt(8,d)
low = ((aa & 0xF) << 12) | ((bb & 0xF) << 8) | ((cc & 0xF) << 4) | (dd & 0xF)
high = ((aa >> 4) << 12) | ((bb >> 4) << 8) | ((cc >> 4) << 4) | (dd >> 4)
return (Select(table_ptr,high) << 4) | Select(table_ptr,low)
buff=open('C:\\Users\\n00bzx\\Desktop\\Devil.exe','rb').read()
unk_51E000 = 0x11ba00
unk_866000 = 0x463a00
unk_7F6000 = 0x3f3a00
unk_43D000 = 0x3aa00
for low_high in range(256):
for low_low in range(256):
low_nib = buff[unk_866000+0xea000 + 768 + low_low]
high_nib = buff[unk_866000+0xea000 + 512 + low_high]
low = (high_nib << 4) | low_nib
table_ptr=Store(table_ptr,low_high * 256 + low_low,buff[unk_866000+0xea000 + 1280 + low])
sizes=13 * 4 * 16
first_tables=[0]*sizes
second_tables=[0]*sizes
for j in range(13):
for k in range(4):
idx = (j << 12) | (k << 10)
table_base = 16 * (idx >> 10)
for i in range(16):
cur_idx=table_base+i
first_tables[cur_idx]=Array('tables_first_%d'%cur_idx, bv8_sort, bv8_sort)
second_tables[cur_idx]=Array('tables_second_%d'%cur_idx, bv8_sort, bv8_sort)
for i in range(256):
first_tables[table_base+0]=Store(first_tables[table_base+0],i,buff[unk_51E000+3 + 4 * (0x8f000 + idx + i)])
first_tables[table_base+1]=Store(first_tables[table_base+1],i,buff[unk_51E000+3 + 4 * (0x8f000 + 256 + idx + i)])
first_tables[table_base+2]= Store(first_tables[table_base+2],i,buff[unk_51E000+3 + 4 * (0x8f000 + 512 + idx + i)])
first_tables[table_base+3] = Store(first_tables[table_base+3],i,buff[unk_51E000+3 + 4 * (0x8f000 + 768 + idx + i)])
first_tables[table_base+4] = Store(first_tables[table_base+4],i,buff[unk_51E000+2 + 4 * (0x8f000 + idx + i)])
first_tables[table_base+5] = Store(first_tables[table_base+5],i,buff[unk_51E000+2 + 4 * (0x8f000 + 256 + idx + i)])
first_tables[table_base+6] = Store(first_tables[table_base+6],i,buff[unk_51E000+2 + 4 * (0x8f000 + 512 + idx + i)])
first_tables[table_base+7] = Store(first_tables[table_base+7],i,buff[unk_51E000+2 + 4 * (0x8f000 + 768 + idx + i)])
first_tables[table_base+8] = Store(first_tables[table_base+8],i,buff[unk_51E000+1 + 4 * (0x8f000 + idx + i)])
first_tables[table_base+9] = Store(first_tables[table_base+9],i,buff[unk_51E000+1 + 4 * (0x8f000 + 256 + idx + i)])
first_tables[table_base+10] = Store(first_tables[table_base+10],i,buff[unk_51E000+1 + 4 * (0x8f000 + 512 + idx + i)])
first_tables[table_base+11] = Store(first_tables[table_base+11],i,buff[unk_51E000+1 + 4 * (0x8f000 + 768 + idx + i)])
first_tables[table_base+12] = Store(first_tables[table_base+12],i,buff[unk_51E000+4 * (0x8f000 + idx + i)])
first_tables[table_base+13] = Store(first_tables[table_base+13],i,buff[unk_51E000+4 * (0x8f000 + 256 + idx + i)])
first_tables[table_base+14] = Store(first_tables[table_base+14],i,buff[unk_51E000+4 * (0x8f000 + 512 + idx + i)])
first_tables[table_base+15] = Store(first_tables[table_base+15],i,buff[unk_51E000+4 * (0x8f000 + 768 + idx + i)])
second_tables[table_base+0] = Store(second_tables[table_base+0],i,buff[unk_7F6000+3 + 4 * (0xe000 + idx + i)])
second_tables[table_base+1] = Store(second_tables[table_base+1],i,buff[unk_7F6000+3 + 4 * (0xe000 + 256 + idx + i)])
second_tables[table_base+2] = Store(second_tables[table_base+2],i,buff[unk_7F6000+3 + 4 * (0xe000 + 512 + idx + i)])
second_tables[table_base+3] = Store(second_tables[table_base+3],i,buff[unk_7F6000+3 + 4 * (0xe000 + 768 + idx + i)])
second_tables[table_base+4] = Store(second_tables[table_base+4],i,buff[unk_7F6000+2 + 4 * (0xe000 + idx + i)])
second_tables[table_base+5] = Store(second_tables[table_base+5],i,buff[unk_7F6000+2 + 4 * (0xe000 + 256 + idx + i)])
second_tables[table_base+6] = Store(second_tables[table_base+6],i,buff[unk_7F6000+2 + 4 * (0xe000 + 512 + idx + i)])
second_tables[table_base+7] = Store(second_tables[table_base+7],i,buff[unk_7F6000+2 + 4 * (0xe000 + 768 + idx + i)])
second_tables[table_base+8] = Store(second_tables[table_base+8],i,buff[unk_7F6000+1 + 4 * (0xe000 + idx + i)])
second_tables[table_base+9] = Store(second_tables[table_base+9],i,buff[unk_7F6000+1 + 4 * (0xe000 + 256 + idx + i)])
second_tables[table_base+10]= Store(second_tables[table_base+10],i,buff[unk_7F6000+1 + 4 * (0xe000 + 512 + idx + i)])
second_tables[table_base+11] = Store(second_tables[table_base+11],i,buff[unk_7F6000+1 + 4 * (0xe000 + 768 + idx + i)])
second_tables[table_base+12] = Store(second_tables[table_base+12],i,buff[unk_7F6000+4 * (0xe000 + idx + i)])
second_tables[table_base+13] = Store(second_tables[table_base+13],i,buff[unk_7F6000+4 * (0xe000 + 256 + idx + i)])
second_tables[table_base+14]= Store(second_tables[table_base+14],i,buff[unk_7F6000+4 * (0xe000 + 512 + idx + i)])
second_tables[table_base+15]= Store(second_tables[table_base+15],i,buff[unk_7F6000+4 * (0xe000 + 768 + idx + i)])
def trans(j, k, a, b, c, d):
table_base = (j << 6) | (k << 4)
a_a = Select(first_tables[table_base+0], a)
a_b = Select(first_tables[table_base+4], a)
a_c = Select(first_tables[table_base+8], a);
a_d = Select(first_tables[table_base+12], a)
b_a = Select(first_tables[table_base+1], b)
b_b = Select(first_tables[table_base+5], b)
b_c = Select(first_tables[table_base+9], b)
b_d = Select(first_tables[table_base+13], b)
c_a = Select(first_tables[table_base+2], c)
c_b = Select(first_tables[table_base+6], c)
c_c = Select(first_tables[table_base+10], c)
c_d = Select(first_tables[table_base+14], c)
d_a = Select(first_tables[table_base+3], d)
d_b = Select(first_tables[table_base+7], d)
d_c = Select(first_tables[table_base+11], d)
d_d = Select(first_tables[table_base+15], d)
a = func_a(a_a, b_a, c_a, d_a)
b = func_a(a_b, b_b, c_b, d_b)
c = func_a(a_c, b_c, c_c, d_c)
d = func_a(a_d, b_d, c_d, d_d)
a_a = Select(second_tables[table_base+0], a)
a_b = Select(second_tables[table_base+4], a)
a_c = Select(second_tables[table_base+8], a);
a_d = Select(second_tables[table_base+12], a)
b_a = Select(second_tables[table_base+1], b)
b_b = Select(second_tables[table_base+5], b)
b_c = Select(second_tables[table_base+9], b)
b_d = Select(second_tables[table_base+13], b)
c_a = Select(second_tables[table_base+2], c)
c_b = Select(second_tables[table_base+6], c)
c_c = Select(second_tables[table_base+10], c)
c_d = Select(second_tables[table_base+14], c)
d_a = Select(second_tables[table_base+3], d)
d_b = Select(second_tables[table_base+7], d)
d_c = Select(second_tables[table_base+11], d)
d_d = Select(second_tables[table_base+15], d)
a = func_a(a_a, b_a, c_a, d_a)
b = func_a(a_b, b_b, c_b, d_b)
c = func_a(a_c, b_c, c_c, d_c)
d = func_a(a_d, b_d, c_d, d_d)
return [a,b,c,d]
dst=[0xa7,0xe4,0x8f,0x9c]
a=BitVec('a',8)
b=BitVec('b',8)
c=BitVec('c',8)
d=BitVec('d',8)
ret=trans(0,0,a,b,c,d)
solver=Solver()
for i in range(4):
solver.add(dst[i]==ret[i])
if solver.check()==sat:
print(solver.model())
else:
print("nope")
当然,代码是我事后写的,因为做的时候直接排除了这种想法,只是想告诉那些比我还菜的师傅说,这样不行.因为你也看到了,逻辑都是查表操作(再优化也有),z3是擅长解线性约束(线性规划啥的,毕竟我高中毕业也只学过这点,别的不懂也不能乱说),查表显然不是啊...所以这样会爆内存,你懂得...啥?c语言效率比python更高是吧?那我用c也写了一遍,你看看?
#include<windows.h>
#include<stdio.h>
#include"z3.h"
Z3_ast table_ptr_z3 = { 0 };
Z3_ast first_tables_z3[13 * 4 * 16];
Z3_ast second_tables_z3[13 * 4 * 16];
Z3_sort bv8_sort, bv16_sort, array_sort_8, array_sort_16;
static void gen_z3_table_ptr(Z3_context* ctx)
{
char tmp[32] = { 0 };
DWORD i, j, k, m;
BYTE table_ptr[65536];
BYTE* first_tables = 0;
BYTE* second_tables = 0;
DWORD hhh = 0;
HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
BYTE* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
ReadFile(file, buff, 0xb00000, &hhh, NULL);
CloseHandle(file);
BYTE* unk_51E000 = buff + 0x11ba00;
BYTE* unk_866000 = buff + 0x463a00;
BYTE* unk_7F6000 = buff + 0x3f3a00;
BYTE* unk_43D000 = buff + 0x3aa00;
DWORD low_high, low_low;
for (low_high = 0; low_high < 256; low_high++)
{
for (low_low = 0; low_low < 256; low_low++)
{
BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
BYTE low = (high_nib << 4) | low_nib;
table_ptr[low_high * 256 + low_low] = unk_866000[0xea000 + 1280 + (DWORD)low];
}
}
first_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
second_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
for (j = 0; j < 13; j++)
{
for (k = 0; k < 4; k++)
{
DWORD idx = (j << 12) | (k << 10);
DWORD table_base = 256 * 16 * (idx >> 10);
for (i = 0; i < 256; i++)
{
BYTE* tables = first_tables + table_base;
tables[256 * 0 + i] = unk_51E000[3 + 4 * (0x8f000 + idx + i)];
tables[256 * 1 + i] = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 2 + i] = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 3 + i] = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 4 + i] = unk_51E000[2 + 4 * (0x8f000 + idx + i)];
tables[256 * 5 + i] = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 6 + i] = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 7 + i] = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 8 + i] = unk_51E000[1 + 4 * (0x8f000 + idx + i)];
tables[256 * 9 + i] = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 10 + i] = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 11 + i] = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 12 + i] = unk_51E000[4 * (0x8f000 + idx + i)];
tables[256 * 13 + i] = unk_51E000[4 * (0x8f000 + 256 + idx + i)];
tables[256 * 14 + i] = unk_51E000[4 * (0x8f000 + 512 + idx + i)];
tables[256 * 15 + i] = unk_51E000[4 * (0x8f000 + 768 + idx + i)];
tables = second_tables + table_base;
tables[256 * 0 + i] = unk_7F6000[3 + 4 * (0xe000 + idx + i)];
tables[256 * 1 + i] = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 2 + i] = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 3 + i] = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 4 + i] = unk_7F6000[2 + 4 * (0xe000 + idx + i)];
tables[256 * 5 + i] = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 6 + i] = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 7 + i] = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 8 + i] = unk_7F6000[1 + 4 * (0xe000 + idx + i)];
tables[256 * 9 + i] = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 10 + i] = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 11 + i] = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 12 + i] = unk_7F6000[4 * (0xe000 + idx + i)];
tables[256 * 13 + i] = unk_7F6000[4 * (0xe000 + 256 + idx + i)];
tables[256 * 14 + i] = unk_7F6000[4 * (0xe000 + 512 + idx + i)];
tables[256 * 15 + i] = unk_7F6000[4 * (0xe000 + 768 + idx + i)];
}
}
}
VirtualFree(buff, 0, MEM_RELEASE);
table_ptr_z3 = Z3_mk_const(*ctx, Z3_mk_string_symbol(*ctx, "table_ptr"), array_sort_16);
for (i = 0; i < 65536; i++)
{
Z3_ast index = Z3_mk_int(*ctx, i, bv16_sort);
Z3_ast elem = Z3_mk_int(*ctx, table_ptr[i], bv8_sort);
table_ptr_z3 = Z3_mk_store(*ctx, table_ptr_z3, index, elem);
}
for (j = 0; j < 13; j++)
{
for (k = 0; k < 4; k++)
{
DWORD idx = (j << 12) | (k << 10);
DWORD table_base_z3 = 16 * (idx >> 10);
DWORD table_base_orig = 256 * 16 * (idx >> 10);
Z3_ast* tables_z3_first = first_tables_z3 + table_base_z3;
BYTE* tables_orig_first = first_tables + table_base_orig;
Z3_ast* tables_z3_second = second_tables_z3 + table_base_z3;
BYTE* tables_orig_second = second_tables + table_base_orig;
for (i = 0; i < 16; i++)
{
DWORD idx_in = table_base_z3 + i;
snprintf(tmp, sizeof(tmp), "table_first_%d", idx_in);
tables_z3_first[i] = Z3_mk_const(*ctx, Z3_mk_string_symbol(*ctx, tmp), array_sort_8);
snprintf(tmp, sizeof(tmp), "table_second_%d", idx_in);
tables_z3_second[i] = Z3_mk_const(*ctx, Z3_mk_string_symbol(*ctx, tmp), array_sort_8);
}
for (i = 0; i < 256; i++)
{
Z3_ast index = Z3_mk_int(*ctx, i, bv8_sort);
for (m = 0; m < 16; m++)
{
tables_z3_first[m] = Z3_mk_store(*ctx, tables_z3_first[m], index, Z3_mk_int(*ctx, tables_orig_first[256 * m + i], bv8_sort));
tables_z3_second[m] = Z3_mk_store(*ctx, tables_z3_second[m], index, Z3_mk_int(*ctx, tables_orig_second[256 * m + i], bv8_sort));
}
}
}
}
VirtualFree(second_tables, 0, MEM_RELEASE);
VirtualFree(first_tables, 0, MEM_RELEASE);
}
static Z3_ast func_a(Z3_context* ctx, Z3_ast a, Z3_ast b, Z3_ast c, Z3_ast d)
{
a = Z3_mk_sign_ext(*ctx, 8, a);
b = Z3_mk_sign_ext(*ctx, 8, b);
c = Z3_mk_sign_ext(*ctx, 8, c);
d = Z3_mk_sign_ext(*ctx, 8, d);
Z3_ast mask = Z3_mk_int(*ctx, 0xF, bv16_sort);
Z3_ast a_low = Z3_mk_bvand(*ctx, a, mask);
Z3_ast b_low = Z3_mk_bvand(*ctx, b, mask);
Z3_ast c_low = Z3_mk_bvand(*ctx, c, mask);
Z3_ast d_low = Z3_mk_bvand(*ctx, d, mask);
Z3_ast a_low_shifted = Z3_mk_bvshl(*ctx, a_low, Z3_mk_int(*ctx, 12, bv16_sort));
Z3_ast b_low_shifted = Z3_mk_bvshl(*ctx, b_low, Z3_mk_int(*ctx, 8, bv16_sort));
Z3_ast c_low_shifted = Z3_mk_bvshl(*ctx, c_low, Z3_mk_int(*ctx, 4, bv16_sort));
Z3_ast low = Z3_mk_bvor(*ctx, Z3_mk_bvor(*ctx, a_low_shifted, b_low_shifted), Z3_mk_bvor(*ctx, c_low_shifted, d_low));
Z3_ast a_high = Z3_mk_bvlshr(*ctx, a, Z3_mk_int(*ctx, 4, bv16_sort));
Z3_ast b_high = Z3_mk_bvlshr(*ctx, b, Z3_mk_int(*ctx, 4, bv16_sort));
Z3_ast c_high = Z3_mk_bvlshr(*ctx, c, Z3_mk_int(*ctx, 4, bv16_sort));
Z3_ast d_high = Z3_mk_bvlshr(*ctx, d, Z3_mk_int(*ctx, 4, bv16_sort));
Z3_ast a_high_shifted = Z3_mk_bvshl(*ctx, a_high, Z3_mk_int(*ctx, 12, bv16_sort));
Z3_ast b_high_shifted = Z3_mk_bvshl(*ctx, b_high, Z3_mk_int(*ctx, 8, bv16_sort));
Z3_ast c_high_shifted = Z3_mk_bvshl(*ctx, c_high, Z3_mk_int(*ctx, 4, bv16_sort));
Z3_ast high = Z3_mk_bvor(*ctx, Z3_mk_bvor(*ctx, a_high_shifted, b_high_shifted), Z3_mk_bvor(*ctx, c_high_shifted, d_high));
Z3_ast high_value = Z3_mk_select(*ctx, table_ptr_z3, high);
Z3_ast low_value = Z3_mk_select(*ctx, table_ptr_z3, low);
high_value = Z3_mk_sign_ext(*ctx, 8, high_value);
low_value = Z3_mk_sign_ext(*ctx, 8, low_value);
Z3_ast high_shifted = Z3_mk_bvshl(*ctx, high_value, Z3_mk_int(*ctx, 4, bv16_sort));
Z3_ast result = Z3_mk_bvor(*ctx, high_shifted, low_value);
return Z3_mk_extract(*ctx, 7, 0, result);
}
static void trans(Z3_context* ctx,DWORD j, DWORD k, Z3_ast* a, Z3_ast* b, Z3_ast* c, Z3_ast* d)
{
DWORD table_base = (j << 6) | (k << 4);
Z3_ast* tables = first_tables_z3 + table_base;
Z3_ast a_a = Z3_mk_select(*ctx, tables[0], *a);
Z3_ast a_b = Z3_mk_select(*ctx, tables[4], *a);
Z3_ast a_c = Z3_mk_select(*ctx, tables[8], *a);
Z3_ast a_d = Z3_mk_select(*ctx, tables[12], *a);
Z3_ast b_a = Z3_mk_select(*ctx, tables[1], *b);
Z3_ast b_b = Z3_mk_select(*ctx, tables[5], *b);
Z3_ast b_c = Z3_mk_select(*ctx, tables[9], *b);
Z3_ast b_d = Z3_mk_select(*ctx, tables[13], *b);
Z3_ast c_a = Z3_mk_select(*ctx, tables[2], *c);
Z3_ast c_b = Z3_mk_select(*ctx, tables[6], *c);
Z3_ast c_c = Z3_mk_select(*ctx, tables[10], *c);
Z3_ast c_d = Z3_mk_select(*ctx, tables[14], *c);
Z3_ast d_a = Z3_mk_select(*ctx, tables[3], *d);
Z3_ast d_b = Z3_mk_select(*ctx, tables[7], *d);
Z3_ast d_c = Z3_mk_select(*ctx, tables[11], *d);
Z3_ast d_d = Z3_mk_select(*ctx, tables[15], *d);
*a = func_a(ctx, a_a, b_a, c_a, d_a);
*b = func_a(ctx, a_b, b_b, c_b, d_b);
*c = func_a(ctx, a_c, b_c, c_c, d_c);
*d = func_a(ctx, a_d, b_d, c_d, d_d);
tables = second_tables_z3 + table_base;
a_a = Z3_mk_select(*ctx, tables[0], *a);
a_b = Z3_mk_select(*ctx, tables[4], *a);
a_c = Z3_mk_select(*ctx, tables[8], *a);
a_d = Z3_mk_select(*ctx, tables[12], *a);
b_a = Z3_mk_select(*ctx, tables[1], *b);
b_b = Z3_mk_select(*ctx, tables[5], *b);
b_c = Z3_mk_select(*ctx, tables[9], *b);
b_d = Z3_mk_select(*ctx, tables[13], *b);
c_a = Z3_mk_select(*ctx, tables[2], *c);
c_b = Z3_mk_select(*ctx, tables[6], *c);
c_c = Z3_mk_select(*ctx, tables[10], *c);
c_d = Z3_mk_select(*ctx, tables[14], *c);
d_a = Z3_mk_select(*ctx, tables[3], *d);
d_b = Z3_mk_select(*ctx, tables[7], *d);
d_c = Z3_mk_select(*ctx, tables[11], *d);
d_d = Z3_mk_select(*ctx, tables[15], *d);
*a = func_a(ctx, a_a, b_a, c_a, d_a);
*b = func_a(ctx, a_b, b_b, c_b, d_b);
*c = func_a(ctx, a_c, b_c, c_c, d_c);
*d = func_a(ctx, a_d, b_d, c_d, d_d);
}
int main()
{
Z3_config cfg = Z3_mk_config();
Z3_context ctx = Z3_mk_context(cfg);
Z3_del_config(cfg);
bv16_sort = Z3_mk_bv_sort(ctx, 16);
bv8_sort = Z3_mk_bv_sort(ctx, 8);
array_sort_16 = Z3_mk_array_sort(ctx, bv16_sort, bv8_sort);
array_sort_8 = Z3_mk_array_sort(ctx, bv8_sort, bv8_sort);
gen_z3_table_ptr(&ctx);
Z3_ast a = Z3_mk_const(ctx, Z3_mk_string_symbol(ctx, "a"), bv8_sort);
Z3_ast b = Z3_mk_const(ctx, Z3_mk_string_symbol(ctx, "b"), bv8_sort);
Z3_ast c = Z3_mk_const(ctx, Z3_mk_string_symbol(ctx, "c"), bv8_sort);
Z3_ast d = Z3_mk_const(ctx, Z3_mk_string_symbol(ctx, "d"), bv8_sort);
trans(&ctx, 0, 0, &a, &b, &c, &d);
Z3_ast eqa = Z3_mk_eq(ctx, a, Z3_mk_int(ctx, 0xa7, bv8_sort));
Z3_ast eqb = Z3_mk_eq(ctx, b, Z3_mk_int(ctx, 0xe4, bv8_sort));
Z3_ast eqc = Z3_mk_eq(ctx, c, Z3_mk_int(ctx, 0x8f, bv8_sort));
Z3_ast eqd = Z3_mk_eq(ctx, d, Z3_mk_int(ctx, 0x9c, bv8_sort));
Z3_ast eqs[] = { eqa,eqb,eqc,eqd };
Z3_ast eq = Z3_mk_and(ctx, 4, eqs);
Z3_solver solver = Z3_mk_solver(ctx);
Z3_solver_assert(ctx, solver, eq);
Z3_lbool result = Z3_solver_check(ctx, solver);
if (result == Z3_L_TRUE)
{
DWORD64 a_val_u64, b_val_u64, c_val_u64, d_val_u64;
BYTE a_val, b_val, c_val, d_val;
Z3_model model = Z3_solver_get_model(ctx, solver);
Z3_func_decl a_decl = Z3_get_app_decl(ctx, Z3_to_app(ctx, a));
Z3_ast a_result = Z3_model_get_const_interp(ctx, model, a_decl);
Z3_get_numeral_uint64(ctx, a_result, &a_val_u64);
Z3_func_decl b_decl = Z3_get_app_decl(ctx, Z3_to_app(ctx, b));
Z3_ast b_result = Z3_model_get_const_interp(ctx, model, b_decl);
Z3_get_numeral_uint64(ctx, b_result, &b_val_u64);
Z3_func_decl c_decl = Z3_get_app_decl(ctx, Z3_to_app(ctx, c));
Z3_ast c_result = Z3_model_get_const_interp(ctx, model, c_decl);
Z3_get_numeral_uint64(ctx, c_result, &c_val_u64);
Z3_func_decl d_decl = Z3_get_app_decl(ctx, Z3_to_app(ctx, d));
Z3_ast d_result = Z3_model_get_const_interp(ctx, model, d_decl);
Z3_get_numeral_uint64(ctx, d_result, &d_val_u64);
a_val = (BYTE)a_val_u64;
b_val = (BYTE)b_val_u64;
c_val = (BYTE)c_val_u64;
d_val = (BYTE)d_val_u64;
printf("result: %02X%02X%02X%02X\n", d_val, c_val, b_val, a_val);
}
else if (result == Z3_L_FALSE)
{
printf("no result!\n");
}
else
{
printf("error!\n");
}
Z3_del_context(ctx);
return 0;
}
吐槽一下,z3某些编译选项会有一些安全问题,比如某个tls中的计数器在主线程结束的时候被删了,在全局析构的时候居然还去访问他...可能我瞎开了什么选项吧#(滑稽)告诉你结果吧,内存一样爆炸,而且变化趋势都一样的#(滑稽),当然,底层的api其实是一样的.(不是显然吗)以上只是和某些比我还菜的师傅说的,其实是完全没有用的...
以下为第一次优化的代码(优化了一些表啥的,基本还是原样爆破,也是事后写的)
#include<windows.h>
#include<immintrin.h>
BYTE* unk_51E000 = 0;
BYTE* unk_866000 = 0;
BYTE* unk_7F6000 = 0;
BYTE* unk_43D000 = 0;
BYTE unk_43D000_inv[256 * 16] = { 0 };
const BYTE consts[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
const BYTE v2[] = { 0,5,10,15,4,9,14,3,8,13,2,7,12,1,6,11 };
BYTE v2_inv[16] = { 0 };
BYTE* table_ptr = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
int i;
char buffer[1024];
for (i = 0; i < 1024; i++)
{
buffer[i] = 0;
}
va_list args;
va_start(args, format);
PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
DWORD bytes_written;
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void sub_4011A0(BYTE* a1)
{
int i;
char v4[16];
for (i = 0; i < 16; ++i)
{
v4[i] = a1[v2[i]];
}
memcpy(a1, v4, 0x10);
}
void sub_4011A0_inv(BYTE* a1)
{
int i;
BYTE temp[16];
memcpy(temp, a1, 0x10);
for (i = 0; i < 16; ++i)
{
a1[v2_inv[i]] = temp[i];
}
}
void translation(DWORD idx, BYTE* a, BYTE* b, BYTE* c, BYTE* d)
{
DWORD aa = (DWORD)*a;
DWORD bb = (DWORD)*b;
DWORD cc = (DWORD)*c;
DWORD dd = (DWORD)*d;
BYTE v14_a = unk_51E000[3 + 4 * (0x8f000 + idx + aa)];
BYTE v12_a = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + bb)];
BYTE v10_a = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + cc)];
BYTE v8_a = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + dd)];
BYTE v14_b = unk_51E000[2 + 4 * (0x8f000 + idx + aa)];
BYTE v12_b = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + bb)];
BYTE v10_b = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + cc)];
BYTE v8_b = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + dd)];
BYTE v14_c = unk_51E000[1 + 4 * (0x8f000 + idx + aa)];
BYTE v12_c = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + bb)];
BYTE v10_c = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + cc)];
BYTE v8_c = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + dd)];
BYTE v14_d = unk_51E000[4 * (0x8f000 + idx + aa)];
BYTE v12_d = unk_51E000[4 * (0x8f000 + 256 + idx + bb)];
BYTE v10_d = unk_51E000[4 * (0x8f000 + 512 + idx + cc)];
BYTE v8_d = unk_51E000[4 * (0x8f000 + 768 + idx + dd)];
WORD low_a = ((v14_a & 0xF) << 12) | ((v12_a & 0xF) << 8) | ((v10_a & 0xF) << 4) | (v8_a & 0xF);
WORD high_a = ((v14_a >> 4) << 12) | ((v12_a >> 4) << 8) | ((v10_a >> 4) << 4) | (v8_a >> 4);
WORD low_b = ((v14_b & 0xF) << 12) | ((v12_b & 0xF) << 8) | ((v10_b & 0xF) << 4) | (v8_b & 0xF);
WORD high_b = ((v14_b >> 4) << 12) | ((v12_b >> 4) << 8) | ((v10_b >> 4) << 4) | (v8_b >> 4);
WORD low_c = ((v14_c & 0xF) << 12) | ((v12_c & 0xF) << 8) | ((v10_c & 0xF) << 4) | (v8_c & 0xF);
WORD high_c = ((v14_c >> 4) << 12) | ((v12_c >> 4) << 8) | ((v10_c >> 4) << 4) | (v8_c >> 4);
WORD low_d = ((v14_d & 0xF) << 12) | ((v12_d & 0xF) << 8) | ((v10_d & 0xF) << 4) | (v8_d & 0xF);
WORD high_d = ((v14_d >> 4) << 12) | ((v12_d >> 4) << 8) | ((v10_d >> 4) << 4) | (v8_d >> 4);
aa = (DWORD)((table_ptr[high_a] << 4) | table_ptr[low_a]);
bb = (DWORD)((table_ptr[high_b] << 4) | table_ptr[low_b]);
cc = (DWORD)((table_ptr[high_c] << 4) | table_ptr[low_c]);
dd = (DWORD)((table_ptr[high_d] << 4) | table_ptr[low_d]);
BYTE v15_a = unk_7F6000[3 + 4 * (0xe000 + idx + aa)];
BYTE v13_a = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + bb)];
BYTE v11_a = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + cc)];
BYTE v9_a = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + dd)];
BYTE v15_b = unk_7F6000[2 + 4 * (0xe000 + idx + aa)];
BYTE v13_b = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + bb)];
BYTE v11_b = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + cc)];
BYTE v9_b = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + dd)];
BYTE v15_c = unk_7F6000[1 + 4 * (0xe000 + idx + aa)];
BYTE v13_c = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + bb)];
BYTE v11_c = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + cc)];
BYTE v9_c = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + dd)];
BYTE v15_d = unk_7F6000[4 * (0xe000 + idx + aa)];
BYTE v13_d = unk_7F6000[4 * (0xe000 + 256 + idx + bb)];
BYTE v11_d = unk_7F6000[4 * (0xe000 + 512 + idx + cc)];
BYTE v9_d = unk_7F6000[4 * (0xe000 + 768 + idx + dd)];
low_a = ((v15_a & 0xF) << 12) | ((v13_a & 0xF) << 8) | ((v11_a & 0xF) << 4) | (v9_a & 0xF);
high_a = ((v15_a >> 4) << 12) | ((v13_a >> 4) << 8) | ((v11_a >> 4) << 4) | (v9_a >> 4);
low_b = ((v15_b & 0xF) << 12) | ((v13_b & 0xF) << 8) | ((v11_b & 0xF) << 4) | (v9_b & 0xF);
high_b = ((v15_b >> 4) << 12) | ((v13_b >> 4) << 8) | ((v11_b >> 4) << 4) | (v9_b >> 4);
low_c = ((v15_c & 0xF) << 12) | ((v13_c & 0xF) << 8) | ((v11_c & 0xF) << 4) | (v9_c & 0xF);
high_c = ((v15_c >> 4) << 12) | ((v13_c >> 4) << 8) | ((v11_c >> 4) << 4) | (v9_c >> 4);
low_d = ((v15_d & 0xF) << 12) | ((v13_d & 0xF) << 8) | ((v11_d & 0xF) << 4) | (v9_d & 0xF);
high_d = ((v15_d >> 4) << 12) | ((v13_d >> 4) << 8) | ((v11_d >> 4) << 4) | (v9_d >> 4);
*a = (table_ptr[high_a] << 4) | table_ptr[low_a];
*b = (table_ptr[high_b] << 4) | table_ptr[low_b];
*c = (table_ptr[high_c] << 4) | table_ptr[low_c];
*d = (table_ptr[high_d] << 4) | table_ptr[low_d];
}
void brute_force(DWORD idx,BYTE* va, BYTE* vb, BYTE* vc, BYTE* vd)
{
DWORD a, b, c, d;
__m128i indices;
WORD* indices_ptr = (WORD*)&indices;
for (a = 0; a < 256; a++)
{
for (b = 0; b < 256; b++)
{
for (c = 0; c < 256; c++)
{
for (d = 0; d < 256; d++)
{
DWORD aa = a;
DWORD bb = b;
DWORD cc = c;
DWORD dd = d;
BYTE v14_a = unk_51E000[3 + 4 * (0x8f000 + idx + aa)];
BYTE v12_a = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + bb)];
BYTE v10_a = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + cc)];
BYTE v8_a = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + dd)];
BYTE v14_b = unk_51E000[2 + 4 * (0x8f000 + idx + aa)];
BYTE v12_b = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + bb)];
BYTE v10_b = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + cc)];
BYTE v8_b = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + dd)];
BYTE v14_c = unk_51E000[1 + 4 * (0x8f000 + idx + aa)];
BYTE v12_c = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + bb)];
BYTE v10_c = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + cc)];
BYTE v8_c = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + dd)];
BYTE v14_d = unk_51E000[4 * (0x8f000 + idx + aa)];
BYTE v12_d = unk_51E000[4 * (0x8f000 + 256 + idx + bb)];
BYTE v10_d = unk_51E000[4 * (0x8f000 + 512 + idx + cc)];
BYTE v8_d = unk_51E000[4 * (0x8f000 + 768 + idx + dd)];
WORD low_a = ((v14_a & 0xF) << 12) | ((v12_a & 0xF) << 8) | ((v10_a & 0xF) << 4) | (v8_a & 0xF);
WORD high_a = ((v14_a >> 4) << 12) | ((v12_a >> 4) << 8) | ((v10_a >> 4) << 4) | (v8_a >> 4);
WORD low_b = ((v14_b & 0xF) << 12) | ((v12_b & 0xF) << 8) | ((v10_b & 0xF) << 4) | (v8_b & 0xF);
WORD high_b = ((v14_b >> 4) << 12) | ((v12_b >> 4) << 8) | ((v10_b >> 4) << 4) | (v8_b >> 4);
WORD low_c = ((v14_c & 0xF) << 12) | ((v12_c & 0xF) << 8) | ((v10_c & 0xF) << 4) | (v8_c & 0xF);
WORD high_c = ((v14_c >> 4) << 12) | ((v12_c >> 4) << 8) | ((v10_c >> 4) << 4) | (v8_c >> 4);
WORD low_d = ((v14_d & 0xF) << 12) | ((v12_d & 0xF) << 8) | ((v10_d & 0xF) << 4) | (v8_d & 0xF);
WORD high_d = ((v14_d >> 4) << 12) | ((v12_d >> 4) << 8) | ((v10_d >> 4) << 4) | (v8_d >> 4);
aa = (DWORD)((table_ptr[high_a] << 4) | table_ptr[low_a]);
bb = (DWORD)((table_ptr[high_b] << 4) | table_ptr[low_b]);
cc = (DWORD)((table_ptr[high_c] << 4) | table_ptr[low_c]);
dd = (DWORD)((table_ptr[high_d] << 4) | table_ptr[low_d]);
BYTE v15_a = unk_7F6000[3 + 4 * (0xe000 + idx + aa)];
BYTE v13_a = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + bb)];
BYTE v11_a = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + cc)];
BYTE v9_a = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + dd)];
BYTE v15_b = unk_7F6000[2 + 4 * (0xe000 + idx + aa)];
BYTE v13_b = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + bb)];
BYTE v11_b = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + cc)];
BYTE v9_b = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + dd)];
BYTE v15_c = unk_7F6000[1 + 4 * (0xe000 + idx + aa)];
BYTE v13_c = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + bb)];
BYTE v11_c = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + cc)];
BYTE v9_c = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + dd)];
BYTE v15_d = unk_7F6000[4 * (0xe000 + idx + aa)];
BYTE v13_d = unk_7F6000[4 * (0xe000 + 256 + idx + bb)];
BYTE v11_d = unk_7F6000[4 * (0xe000 + 512 + idx + cc)];
BYTE v9_d = unk_7F6000[4 * (0xe000 + 768 + idx + dd)];
low_a = ((v15_a & 0xF) << 12) | ((v13_a & 0xF) << 8) | ((v11_a & 0xF) << 4) | (v9_a & 0xF);
high_a = ((v15_a >> 4) << 12) | ((v13_a >> 4) << 8) | ((v11_a >> 4) << 4) | (v9_a >> 4);
low_b = ((v15_b & 0xF) << 12) | ((v13_b & 0xF) << 8) | ((v11_b & 0xF) << 4) | (v9_b & 0xF);
high_b = ((v15_b >> 4) << 12) | ((v13_b >> 4) << 8) | ((v11_b >> 4) << 4) | (v9_b >> 4);
low_c = ((v15_c & 0xF) << 12) | ((v13_c & 0xF) << 8) | ((v11_c & 0xF) << 4) | (v9_c & 0xF);
high_c = ((v15_c >> 4) << 12) | ((v13_c >> 4) << 8) | ((v11_c >> 4) << 4) | (v9_c >> 4);
low_d = ((v15_d & 0xF) << 12) | ((v13_d & 0xF) << 8) | ((v11_d & 0xF) << 4) | (v9_d & 0xF);
high_d = ((v15_d >> 4) << 12) | ((v13_d >> 4) << 8) | ((v11_d >> 4) << 4) | (v9_d >> 4);
aa = (DWORD)((table_ptr[high_a] << 4) | table_ptr[low_a]);
bb = (DWORD)((table_ptr[high_b] << 4) | table_ptr[low_b]);
cc = (DWORD)((table_ptr[high_c] << 4) | table_ptr[low_c]);
dd = (DWORD)((table_ptr[high_d] << 4) | table_ptr[low_d]);
if ((BYTE)aa == *va && (BYTE)bb == *vb && (BYTE)cc == *vc && (BYTE)dd == *vd)
{
*va = (BYTE)a;
*vb = (BYTE)b;
*vc = (BYTE)c;
*vd = (BYTE)d;
my_printf("%02X%02X%02X%02X\n", d, c, b, a);
//return;
}
}
}
}
}
}
void sub_401270(BYTE* input, BYTE* out)
{
int m, n, i, j, k;
for (i = 0; i < 16; ++i)
{
input[i] ^= consts[i];
}
for (j = 0; j < 13; ++j)
{
sub_4011A0(input);
for (k = 0; k < 4; ++k)
{
translation((j << 12) | (k << 10), input + k * 4, input + k * 4 + 1, input + k * 4 + 2, input + k * 4 + 3);
}
}
sub_4011A0(input);
for (m = 0; m < 16; ++m)
{
BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
input[m] = table[input[m]];
}
for (n = 0; n < 16; ++n)
{
out[n] = input[n];
}
}
void gen_tables()
{
int i, m;
int low_high, low_low;
for (low_high = 0; low_high < 256; ++low_high)
{
for (low_low = 0; low_low < 256; ++low_low)
{
if (low_low % 16 == 0)
{
my_printf("\n");
}
BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
BYTE low = (high_nib << 4) | low_nib;
table_ptr[low_high * 256 + low_low] = (DWORD)unk_866000[0xea000 + 1280 + (DWORD)low];
my_printf("%02X", table_ptr[low_high * 256 + low_low]);
}
my_printf("\n\n");
}
for (i = 0; i < 16; i++)
{
v2_inv[v2[i]] = i;
}
for (m = 0; m < 16; m++)
{
BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
BYTE* table_inv = unk_43D000_inv + 256 * m;
for (i = 0; i < 256; i++)
{
table_inv[table[i]] = (BYTE)i;
}
}
}
void sub_401270_inv(BYTE* input, BYTE* out)
{
int n, m, i, j, k;
for (n = 0; n < 16; ++n)
{
out[n] = input[n];
}
for (m = 0; m < 16; ++m)
{
BYTE* table_inv = unk_43D000_inv + 256 * m;
out[m] = table_inv[out[m]];
}
sub_4011A0_inv(out);
for (j = 0; j < 13; ++j)
{
for (k = 0; k < 4; ++k)
{
brute_force((j << 12) | (k << 10), out + k, out + k + 1, out + k + 2, out + k + 3);
}
sub_4011A0_inv(out);
}
for (i = 0; i < 16; ++i)
{
out[i] ^= consts[i];
}
}
int main()
{
//int i;
void* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
table_ptr = (BYTE*)VirtualAlloc(NULL, 65536 * 4, MEM_COMMIT, PAGE_READWRITE);
DWORD hhh = 0;
HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
ReadFile(file, buff, 0xb00000, &hhh, NULL);
CloseHandle(file);
unk_51E000 = (BYTE*)buff + 0x11ba00;
unk_866000 = (BYTE*)buff + 0x463a00;
unk_7F6000 = (BYTE*)buff + 0x3f3a00;
unk_43D000 = (BYTE*)buff + 0x3aa00;
gen_tables();
BYTE input[] = { 0,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xaa,0xbb,0xcc,0xdd,0xee,0xff };
BYTE out[16] = { 0 };
BYTE out2[16] = { 0 };
translation(0, input, input + 1, input + 2, input + 3);
//brute_force(0, input, input + 1, input + 2, input + 3);
/*sub_401270(input, out);
for (i = 0; i < 16; i++)
{
my_printf("%02X ", out[i]);//E4 11 53 70 7B E9 64 C4 EF 7D 51 74 EB 4B 3B 75
}
my_printf("\n\n");
sub_401270_inv(out, out2);
for (i = 0; i < 16; i++)
{
my_printf("%02X ", out2[i]);
}
my_printf("\n\n");*/
VirtualFree(table_ptr, 0, MEM_RELEASE);
VirtualFree(buff, 0, MEM_RELEASE);
return 0xb19b00b5;
}
我都是单线程运行,这样要跑4小时.
仍然需要本地桌面exe文件.不需要任何库.
于是,菜鸡开始想办法了...
我就想,simd不是很快吗(这是我当时的真实想法),还能并行,我干嘛不用simd写?
不想用cl,cuda啥的,那样太开挂了,不要动不动就瞎想用gpu,能用cpu写就用cpu写,我觉得,在我没有搞清楚cpu的奥秘之前,不会天天想着玩gpu的,写光追的时候也要压榨cpu到极致,那才有利于学习!!!学得开心就完事了!!!
于是代码来了(真是当时写的,别想着直接编译咯~~~):
#include<windows.h>
#include<immintrin.h>
DWORD* unk_51E000 = 0;
BYTE* unk_866000 = 0;
DWORD* unk_7F6000 = 0;
BYTE* unk_43D000 = 0;
BYTE unk_43D000_inv[256 * 16] = { 0 };
const BYTE consts[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
const BYTE v2[] = { 0,5,10,15,4,9,14,3,8,13,2,7,12,1,6,11 };
BYTE v2_inv[16] = { 0 };
DWORD* table_ptr = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
int i;
char buffer[1024];
for (i = 0; i < 1024; i++)
{
buffer[i] = 0;
}
va_list args;
va_start(args, format);
PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
DWORD bytes_written;
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void sub_4011A0(BYTE* a1)
{
int i;
char v4[16];
for (i = 0; i < 16; ++i)
{
v4[i] = a1[v2[i]];
}
memcpy(a1, v4, 0x10);
}
void sub_4011A0_inv(BYTE* a1)
{
int i;
BYTE temp[16];
memcpy(temp, a1, 0x10);
for (i = 0; i < 16; ++i)
{
a1[v2_inv[i]] = temp[i];
}
}
void trans_main(DWORD idx, DWORD* in_dw)//实际上就是拆分->4*8矩阵转置8*4->组合
{
__m128i arg_arr = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*in_dw));
__m128i init_indices_or = _mm_cvtepu16_epi32(_mm_cvtsi64_si128(0x0300020001000000LL));
__m128i init_indices = _mm_or_si128(arg_arr, init_indices_or);
__m128i input = _mm_i32gather_epi32(unk_51E000 + 0x8f000 + idx, init_indices, 4);//此处也为内存瓶颈,但是为必须查表操作
__m128i mask = _mm_broadcastq_epi64(_mm_cvtsi64_si128(0x0f0f0f0f0f0f0f0fLL));
__m128i low = _mm_and_si128(input, mask);
__m128i high = _mm_and_si128(_mm_srli_epi16(input, 4), mask);//高位移到最高位让他溢出
__m256i total = _mm256_or_si256(_mm256_cvtepu8_epi16(low), _mm256_slli_epi16(_mm256_cvtepu8_epi16(high), 8));
__m128i indices_byte_hl_odd = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x0f070e060d050c04LL));
__m128i indices_byte_hl_even = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x0b030a0209010800LL));
__m256i indices_byte_hl = _mm256_castsi128_si256(_mm_or_si128(_mm_slli_epi16(indices_byte_hl_odd, 8), indices_byte_hl_even));
__m256i indices_byte = _mm256_permute2x128_si256(indices_byte_hl, indices_byte_hl, 0x20);
__m256i indices_dword = _mm256_cvtepi8_epi32(_mm_cvtsi64_si128(0x0705030106040200LL));
__m256i shuffled_dwords = _mm256_permutevar8x32_epi32(total, indices_dword);//先将转置的4*8转换为两个4*4进行进一步排列(avx2只支持独立xmm重排)
__m256i shuffled_bytes = _mm256_shuffle_epi8(shuffled_dwords, indices_byte);//实现按顺序组合32个低4bit的byte到16个高低位的byte(转置矩阵)
__m256i shuffled_moved = _mm256_srli_epi16(shuffled_bytes, 4);//上一个低位移到下一个高位,组合成byte
__m256i combined_8_on_high = _mm256_or_si256(shuffled_bytes, shuffled_moved);//合并,这下就是16位中包含一个byte(在高位)
__m128i compressed = _mm256_cvtepi16_epi8(combined_8_on_high);//压缩
__m256i final_indices = _mm256_cvtepu16_epi32(compressed);
__m256i output_dwords = _mm256_i32gather_epi32(table_ptr, final_indices, 4);//每个dword的最低4bit放着结果
__m256i output_high = _mm256_srli_epi64(output_dwords, 28);//高位右移28位去找到低位
__m256i combined_8_on_low = _mm256_or_si256(output_dwords, output_high);//和原先组合成低位
__m128i reverse_dwords_odd = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x030107050b090f0dLL));
__m128i reverse_dwords_even = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x020006040a080e0cLL));
__m128i reverse_dwords = _mm_or_si128(_mm_slli_epi16(reverse_dwords_odd, 8), reverse_dwords_even);//第二轮,大部分代码重复,故不作注释
__m128i init_indices_2 = _mm_or_si128(_mm_shuffle_epi8(_mm256_cvtepi64_epi32(combined_8_on_low), reverse_dwords), init_indices_or);
__m128i input_2 = _mm_i32gather_epi32(unk_7F6000 + 0xe000 + idx, init_indices_2, 4);
__m128i low_2 = _mm_and_si128(input_2, mask);
__m128i high_2 = _mm_and_si128(_mm_srli_epi16(input_2, 4), mask);
__m256i total_2 = _mm256_or_si256(_mm256_cvtepu8_epi16(low_2), _mm256_slli_epi16(_mm256_cvtepu8_epi16(high_2), 8));
__m256i shuffled_dwords_2 = _mm256_permutevar8x32_epi32(total_2, indices_dword);
__m256i shuffled_bytes_2 = _mm256_shuffle_epi8(shuffled_dwords_2, indices_byte);
__m256i shuffled_moved_2 = _mm256_srli_epi16(shuffled_bytes_2, 4);
__m256i combined_8_on_high_2 = _mm256_or_si256(shuffled_bytes_2, shuffled_moved_2);
__m128i compressed_2 = _mm256_cvtepi16_epi8(combined_8_on_high_2);
__m256i final_indices_2 = _mm256_cvtepu16_epi32(compressed_2);
__m256i output_dwords_2 = _mm256_i32gather_epi32(table_ptr, final_indices_2, 4);
__m256i output_high_2 = _mm256_srli_epi64(output_dwords_2, 28);
__m256i combined_8_on_low_2 = _mm256_or_si256(output_dwords_2, output_high_2);
*in_dw = _byteswap_ulong(_mm_cvtsi128_si32(_mm256_cvtepi64_epi8(combined_8_on_low_2)));
}
void brute_force(DWORD idx, DWORD* in_dw)
{
__m128i mask = _mm_broadcastq_epi64(_mm_cvtsi64_si128(0x0f0f0f0f0f0f0f0fLL));
__m128i init_indices_or = _mm_cvtepu16_epi32(_mm_cvtsi64_si128(0x0300020001000000LL));
__m128i indices_byte_hl_odd = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x0f070e060d050c04LL));
__m128i indices_byte_hl_even = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x0b030a0209010800LL));
__m256i indices_byte_hl = _mm256_castsi128_si256(_mm_or_si128(_mm_slli_epi16(indices_byte_hl_odd, 8), indices_byte_hl_even));
__m256i indices_byte = _mm256_permute2x128_si256(indices_byte_hl, indices_byte_hl, 0x20);
__m256i indices_dword = _mm256_cvtepi8_epi32(_mm_cvtsi64_si128(0x0705030106040200LL));
__m128i reverse_dwords_odd = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x030107050b090f0dLL));
__m128i reverse_dwords_even = _mm_cvtepu8_epi16(_mm_cvtsi64_si128(0x020006040a080e0cLL));
__m128i reverse_dwords = _mm_or_si128(_mm_slli_epi16(reverse_dwords_odd, 8), reverse_dwords_even);
DWORD64 val;
DWORD* curr_tab_a_1 = unk_51E000 + 0x8f000 + idx;
DWORD* curr_tab_a_2 = unk_7F6000 + 0xe000 + idx;
for (val = 0; val < 0x100000000; val++)
{
DWORD tmp = (DWORD)val;
__m128i arg_arr = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(tmp));
__m128i init_indices = _mm_or_si128(arg_arr, init_indices_or);
__m128i input = _mm_i32gather_epi32(curr_tab_a_1, init_indices, 4);//此处也为内存瓶颈,但是为必须查表操作
__m128i low = _mm_and_si128(input, mask);
__m128i high = _mm_and_si128(_mm_srli_epi16(input, 4), mask);//高位移到最高位让他溢出
__m256i total = _mm256_or_si256(_mm256_cvtepu8_epi16(low), _mm256_slli_epi16(_mm256_cvtepu8_epi16(high), 8));
__m256i shuffled_dwords = _mm256_permutevar8x32_epi32(total, indices_dword);//先将转置的4*8转换为两个4*4进行进一步排列(avx2只支持独立xmm重排)
__m256i shuffled_bytes = _mm256_shuffle_epi8(shuffled_dwords, indices_byte);//实现按顺序组合32个低4bit的byte到16个高低位的byte(转置矩阵)
__m256i shuffled_moved = _mm256_srli_epi16(shuffled_bytes, 4);//上一个低位移到下一个高位,组合成byte
__m256i combined_8_on_high = _mm256_or_si256(shuffled_bytes, shuffled_moved);//合并,这下就是16位中包含一个byte(在高位)
__m128i compressed = _mm256_cvtepi16_epi8(combined_8_on_high);//压缩
__m256i final_indices = _mm256_cvtepu16_epi32(compressed);
__m256i output_dwords = _mm256_i32gather_epi32(table_ptr, final_indices, 4);//每个dword的最低4bit放着结果
__m256i output_high = _mm256_srli_epi64(output_dwords, 28);//高位右移28位去找到低位
__m256i combined_8_on_low = _mm256_or_si256(output_dwords, output_high);//和原先组合成低位
__m128i init_indices_2 = _mm_or_si128(_mm_shuffle_epi8(_mm256_cvtepi64_epi32(combined_8_on_low), reverse_dwords), init_indices_or);//第二轮,大部分代码重复,故不作注释
__m128i input_2 = _mm_i32gather_epi32(curr_tab_a_2, init_indices_2, 4);
__m128i low_2 = _mm_and_si128(input_2, mask);
__m128i high_2 = _mm_and_si128(_mm_srli_epi16(input_2, 4), mask);
__m256i total_2 = _mm256_or_si256(_mm256_cvtepu8_epi16(low_2), _mm256_slli_epi16(_mm256_cvtepu8_epi16(high_2), 8));
__m256i shuffled_dwords_2 = _mm256_permutevar8x32_epi32(total_2, indices_dword);
__m256i shuffled_bytes_2 = _mm256_shuffle_epi8(shuffled_dwords_2, indices_byte);
__m256i shuffled_moved_2 = _mm256_srli_epi16(shuffled_bytes_2, 4);
__m256i combined_8_on_high_2 = _mm256_or_si256(shuffled_bytes_2, shuffled_moved_2);
__m128i compressed_2 = _mm256_cvtepi16_epi8(combined_8_on_high_2);
__m256i final_indices_2 = _mm256_cvtepu16_epi32(compressed_2);
__m256i output_dwords_2 = _mm256_i32gather_epi32(table_ptr, final_indices_2, 4);
__m256i output_high_2 = _mm256_srli_epi64(output_dwords_2, 28);
__m256i combined_8_on_low_2 = _mm256_or_si256(output_dwords_2, output_high_2);
tmp = _byteswap_ulong(_mm_cvtsi128_si32(_mm256_cvtepi64_epi8(combined_8_on_low_2)));
if (tmp == *in_dw)
{
my_printf("%08X\n", val);
*in_dw = (DWORD)val;
//return;
}
}
}
void sub_401270(BYTE* input, BYTE* out)//不用汇编写了,那样不好看#(滑稽)
{
int n, m, i, j, k;
for (i = 0; i < 16; ++i)
{
input[i] ^= consts[i];
}
for (j = 0; j < 13; ++j)
{
sub_4011A0(input);
for (k = 0; k < 4; ++k)
{
trans_main((j << 12) | (k << 10), (DWORD*)input + k);
}
}
sub_4011A0(input);
for (m = 0; m < 16; ++m)
{
BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
input[m] = table[input[m]];
}
for (n = 0; n < 16; ++n)
{
out[n] = input[n];
}
}
void gen_tables()
{
int i, m;
int low_high, low_low;
for (low_high = 0; low_high < 256; ++low_high)
{
for (low_low = 0; low_low < 256; ++low_low)
{
BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
BYTE low = (high_nib << 4) | low_nib;
table_ptr[low_high * 256 + low_low] = (DWORD)unk_866000[0xea000 + 1280 + (DWORD)low];
}
}
for (i = 0; i < 16; i++)
{
v2_inv[v2[i]] = i;
}
for (m = 0; m < 16; m++)
{
BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
BYTE* table_inv = unk_43D000_inv + 256 * m;
for (i = 0; i < 256; i++)
{
table_inv[table[i]] = (BYTE)i;
}
}
}
void sub_401270_inv(BYTE* input, BYTE* out)
{
int n, m, i, j, k;
for (n = 0; n < 16; ++n)
{
out[n] = input[n];
}
for (m = 0; m < 16; ++m)
{
BYTE* table_inv = unk_43D000_inv + 256 * m;
out[m] = table_inv[out[m]];
}
sub_4011A0_inv(out);
for (j = 0; j < 13; ++j)
{
for (k = 0; k < 4; ++k)
{
brute_force((j << 12) | (k << 10), (DWORD*)out + k);
}
sub_4011A0_inv(out);
}
for (i = 0; i < 16; ++i)
{
out[i] ^= consts[i];
}
}
int main()
{
//int i;
void* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
table_ptr = (DWORD*)VirtualAlloc(NULL, 65536 * 4, MEM_COMMIT, PAGE_READWRITE);
DWORD hhh = 0;
HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
ReadFile(file, buff, 0xb00000, &hhh, NULL);
CloseHandle(file);
unk_51E000 = (DWORD*)((BYTE*)buff + 0x11ba00);
unk_866000 = (BYTE*)buff + 0x463a00;
unk_7F6000 = (DWORD*)((BYTE*)buff + 0x3f3a00);
unk_43D000 = (BYTE*)buff + 0x3aa00;
gen_tables();
BYTE input[] = { 0,0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0xaa,0xbb,0xcc,0xdd,0xee,0xff };
BYTE out[16] = { 0 };
BYTE out2[16] = { 0 };
trans_main(0, (DWORD*)input);
brute_force(0, (DWORD*)input);
/*sub_401270(input, out);
for (i = 0; i < 16; i++)
{
my_printf("%02X ", out[i]);//E4 11 53 70 7B E9 64 C4 EF 7D 51 74 EB 4B 3B 75
}
my_printf("\n\n");
sub_401270_inv(out, out2);
for (i = 0; i < 16; i++)
{
my_printf("%02X ", out2[i]);
}
my_printf("\n\n");*/
VirtualFree(table_ptr, 0, MEM_RELEASE);
VirtualFree(buff, 0, MEM_RELEASE);
return 0xb19b00b5;
}
优化方法不说了,看注释就够了.有一些都是临时的乱七八糟想法.无法组织成语言...语言能力有待提高,顺便感谢公司老大,教会我写文章了...在此膜拜(这里可不能不让我膜巨啊!!!)
但是这样几乎没有效率的提升.看来,我还没有参透cpu的奥秘.在intel和微软面前我仍然是个小丑.哦不对,连小丑都不是,就是一只小虾米#(滑稽)
我初步瞎想,瓶颈应该出在内存访问上.毕竟,内存访问会比寄存器访问慢几百倍(大概).
再次苦思冥想后,查阅了无数遍intel巨佬的官方文档,仔细阅读指令伪码和经过延迟比较后,决定放弃simd优化(粗野的想法,各位巨佬肯定在看着我的sb代码发笑呢...)
思路回到优化表查询上.这时候,老大催我写报告了.于是irql立刻提升至DISPATCH_LEVEL,一溜烟跑了......
两天后...
优化代码出来了...
#include<windows.h>
#include<immintrin.h>
BYTE unk_43D000_inv[256 * 16];
const BYTE consts[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
const BYTE v2[] = { 0,5,10,15,4,9,14,3,8,13,2,7,12,1,6,11 };
BYTE v2_inv[16];
BYTE table_ptr[65536];
BYTE* first_tables = 0;
BYTE* second_tables = 0;
BYTE* first_tables_map = 0;
BYTE* second_tables_map = 0;
BYTE* first_tables_sorted = 0;
BYTE* second_tables_sorted = 0;
DWORD64* first_tables_indices = 0;
DWORD64* second_tables_indices = 0;
DWORD first_tables_sorted_len[13 * 4 * 16];
DWORD second_tables_sorted_len[13 * 4 * 16];
BYTE third_tables[16 * 256];
DWORD* map_keys = 0;
BYTE* map_vals = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
DWORD i;
char buffer[1024];
for (i = 0; i < 1024; i++)
{
buffer[i] = 0;
}
va_list args;
va_start(args, format);
PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
DWORD bytes_written;
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void junk_memcpy(void* dst, void* src, size_t sz)//16字节对齐才能用
{
while (sz)
{
_mm_store_si128((__m128i*)((DWORD64)dst + sz - 16), _mm_loadu_si128((__m128i*)((DWORD64)src + sz - 16)));
sz -= 16;
}
}
void junk_memset(void* mem, BYTE val, size_t sz)//16字节对齐才能用
{
while (sz)
{
_mm_store_si128((__m128i*)((DWORD64)mem + sz - 16), _mm_set1_epi8(val));
sz -= 16;
}
}
int partition(DWORD* arr, int low, int high)
{
int i = low, j = high;
DWORD p = arr[low];
while (i < j)
{
while (i<j && arr[j]>p)
{
j--;
}
if (i < j)
{
arr[i] ^= arr[j];
arr[j] ^= arr[i];
arr[i] ^= arr[j];
i++;
}
while (i < j && arr[i] <= p)
{
i++;
}
if (i < j)
{
arr[i] ^= arr[j];
arr[j] ^= arr[i];
arr[i] ^= arr[j];
j--;
}
}
arr[i] = p;
return i;
}
void _qsort(DWORD* arr, int low, int high)
{
if (low >= 0 && high >= 0 && low < high)
{
int mid = partition(arr, low, high);
_qsort(arr, low, mid - 1);
_qsort(arr, mid + 1, high);
}
}
#define qsort(arr,len) _qsort((arr),0,(len)-1)
void idsort(DWORD* arr, DWORD* ids, DWORD n)
{
int i, j;
for (i = 0; i < (int)n - 1; i++)
{
for (j = 0; j < (int)n - i - 1; j++)
{
if (arr[j] > arr[j + 1])
{
arr[j] ^= arr[j + 1];
arr[j + 1] ^= arr[j];
arr[j] ^= arr[j + 1];
ids[j] ^= ids[j + 1];
ids[j + 1] ^= ids[j];
ids[j] ^= ids[j + 1];
}
}
}
}
DWORD my_bsearch(DWORD* arr, DWORD len, DWORD target)
{
int l = 0, r = (int)len - 1;
DWORD ret = 0xffffffff;
while (l <= r)
{
int mid = l + ((r - l) >> 1);
if (arr[mid] == target)
{
ret = mid;
break;
}
else if (arr[mid] < target)
{
l = mid + 1;
}
else
{
r = mid - 1;
}
}
return ret;
}
void sub_4011A0(BYTE* a1)
{
DWORD i;
char v4[16];
for (i = 0; i < 16; i++)
{
v4[i] = a1[v2[i]];
}
junk_memcpy(a1, v4, 0x10);
}
void sub_4011A0_inv(BYTE* a1)
{
DWORD i;
BYTE temp[16];
junk_memcpy(temp, a1, 0x10);
for (i = 0; i < 16; i++)
{
a1[v2_inv[i]] = temp[i];
}
}
DWORD de_trans_quad(BYTE* table0, DWORD* table1, DWORD64* table2, DWORD j, DWORD k, DWORD idx, BYTE in, BOOL inited)
{
DWORD s, t, u, v;
DWORD w, x, y, z;
DWORD total_len_low = 0, total_len_high = 0;
DWORD ptrs_2[4] = { 0 };
BYTE low = in & 0xf;
BYTE high = in >> 4;
DWORD mat_idx = 16 * ((j << 2) | k);
DWORD table_base = 256 * mat_idx;
DWORD table_mat_base = idx * 4 * 256;
BYTE* tables2 = table0 + table_base;
DWORD* tables3 = table1 + mat_idx;
DWORD64* tables4 = table2 + table_base;
BYTE* t_a = tables2 + (idx << 10);
BYTE* t_b = tables2 + ((idx << 10) | 0x100);
BYTE* t_c = tables2 + ((idx << 10) | 0x200);
BYTE* t_d = tables2 + ((idx << 10) | 0x300);
DWORD64* tt_a = tables4 + (idx << 10);
DWORD64* tt_b = tables4 + ((idx << 10) | 0x100);
DWORD64* tt_c = tables4 + ((idx << 10) | 0x200);
DWORD64* tt_d = tables4 + ((idx << 10) | 0x300);
DWORD sz_a = tables3[idx << 2];
DWORD sz_b = tables3[(idx << 2) | 1];
DWORD sz_c = tables3[(idx << 2) | 2];
DWORD sz_d = tables3[(idx << 2) | 3];
DWORD cnt_a = 256 / sz_a;
DWORD cnt_b = 256 / sz_b;
DWORD cnt_c = 256 / sz_c;
DWORD cnt_d = 256 / sz_d;
DWORD indices = 0;
for (w = 0; w < sz_a; w++)
{
for (x = 0; x < sz_b; x++)
{
for (y = 0; y < sz_c; y++)
{
for (z = 0; z < sz_d; z++)
{
BYTE a_a = t_a[w];
BYTE a_b = t_b[x];
BYTE a_c = t_c[y];
BYTE a_d = t_d[z];
WORD low_word = ((a_a & 0xF) << 12) | ((a_b & 0xF) << 8) | ((a_c & 0xF) << 4) | (a_d & 0xF);
WORD high_word = ((a_a >> 4) << 12) | ((a_b >> 4) << 8) | ((a_c >> 4) << 4) | (a_d >> 4);
if (low == table_ptr[low_word] && high == table_ptr[high_word])
{
DWORD64 arr_a_contains = tt_a[a_a];
DWORD64 arr_b_contains = tt_b[a_b];
DWORD64 arr_c_contains = tt_c[a_c];
DWORD64 arr_d_contains = tt_d[a_d];
BYTE* arr_a = (BYTE*)&arr_a_contains;
BYTE* arr_b = (BYTE*)&arr_b_contains;
BYTE* arr_c = (BYTE*)&arr_c_contains;
BYTE* arr_d = (BYTE*)&arr_d_contains;
for (s = 0; s < cnt_a; s++)
{
for (t = 0; t < cnt_b; t++)
{
for (u = 0; u < cnt_c; u++)
{
for (v = 0; v < cnt_d; v++)
{
BYTE aa = arr_a[s];
BYTE bb = arr_b[t];
BYTE cc = arr_c[u];
BYTE dd = arr_d[v];
DWORD hash_val = aa | (bb << 8) | (cc << 16) | (dd << 24);
if (!inited)
{
map_keys[indices++] = hash_val;
}
else
{
DWORD pos = my_bsearch(map_keys, 0x1000000, hash_val);
if (pos != 0xffffffff)
{
if (++map_vals[pos] == 4)
{
return pos;
}
}
}
}
}
}
}
}
}
}
}
}
if (!inited)
{
qsort(map_keys, 0x1000000);
junk_memset(map_vals, 1, 0x1000000);
}
return 0xffffffff;
}
DWORD de_trans(DWORD j, DWORD k,DWORD* in)
{
DWORD ret = 0xffffffff;
DWORD i, m;
DWORD mat_idx = 16 * ((j << 2) | k);
DWORD table_base = 256 * mat_idx;
BYTE* table_sorted = second_tables_sorted + table_base;
DWORD* tables_lens = second_tables_sorted_len + mat_idx;
DWORD idxs[4] = { 0,1,2,3 };
DWORD sizes[4] = { 0 };
BOOL inited = FALSE;
for (i = 0; i < 4; i++)
{
DWORD sum = 1;
for (m = 0; m < 4; m++)
{
sum *= tables_lens[(i << 2) | m];
}
sizes[i] = sum;
}
idsort(sizes, idxs, 4);
for (i = 0; i < 4; i++)
{
int cur_idx = idxs[i];
BYTE input = (BYTE)(*in >> (cur_idx << 3));
ret = de_trans_quad(second_tables_sorted, second_tables_sorted_len, second_tables_indices, j, k, cur_idx, input, inited);
inited = TRUE;
}
return map_keys[ret];
}
void trans(DWORD j, DWORD k, BYTE* a, BYTE* b, BYTE* c, BYTE* d)
{
DWORD idx = 16 * ((j << 2) | k);
DWORD table_base = 256 * idx;
WORD low, high;
DWORD aa = (DWORD)*a;
DWORD bb = (DWORD)*b;
DWORD cc = (DWORD)*c;
DWORD dd = (DWORD)*d;
BYTE* tables = first_tables + table_base;
BYTE a_a = tables[256 * 0 + aa];
BYTE a_b = tables[256 * 4 + aa];
BYTE a_c = tables[256 * 8 + aa];
BYTE a_d = tables[256 * 12 + aa];
BYTE b_a = tables[256 * 1 + bb];
BYTE b_b = tables[256 * 5 + bb];
BYTE b_c = tables[256 * 9 + bb];
BYTE b_d = tables[256 * 13 + bb];
BYTE c_a = tables[256 * 2 + cc];
BYTE c_b = tables[256 * 6 + cc];
BYTE c_c = tables[256 * 10 + cc];
BYTE c_d = tables[256 * 14 + cc];
BYTE d_a = tables[256 * 3 + dd];
BYTE d_b = tables[256 * 7 + dd];
BYTE d_c = tables[256 * 11 + dd];
BYTE d_d = tables[256 * 15 + dd];
low = ((a_a & 0xF) << 12) | ((b_a & 0xF) << 8) | ((c_a & 0xF) << 4) | (d_a & 0xF);
high = ((a_a >> 4) << 12) | ((b_a >> 4) << 8) | ((c_a >> 4) << 4) | (d_a >> 4);
aa = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_b & 0xF) << 12) | ((b_b & 0xF) << 8) | ((c_b & 0xF) << 4) | (d_b & 0xF);
high = ((a_b >> 4) << 12) | ((b_b >> 4) << 8) | ((c_b >> 4) << 4) | (d_b >> 4);
bb = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_c & 0xF) << 12) | ((b_c & 0xF) << 8) | ((c_c & 0xF) << 4) | (d_c & 0xF);
high = ((a_c >> 4) << 12) | ((b_c >> 4) << 8) | ((c_c >> 4) << 4) | (d_c >> 4);
cc = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_d & 0xF) << 12) | ((b_d & 0xF) << 8) | ((c_d & 0xF) << 4) | (d_d & 0xF);
high = ((a_d >> 4) << 12) | ((b_d >> 4) << 8) | ((c_d >> 4) << 4) | (d_d >> 4);
dd = (table_ptr[high] << 4) | table_ptr[low];
my_printf("%02X %02X %02X %02X\n", aa, bb, cc, dd);
tables = second_tables + table_base;
a_a = tables[256 * 0 + aa];
a_b = tables[256 * 4 + aa];
a_c = tables[256 * 8 + aa];
a_d = tables[256 * 12 + aa];
b_a = tables[256 * 1 + bb];
b_b = tables[256 * 5 + bb];
b_c = tables[256 * 9 + bb];
b_d = tables[256 * 13 + bb];
c_a = tables[256 * 2 + cc];
c_b = tables[256 * 6 + cc];
c_c = tables[256 * 10 + cc];
c_d = tables[256 * 14 + cc];
d_a = tables[256 * 3 + dd];
d_b = tables[256 * 7 + dd];
d_c = tables[256 * 11 + dd];
d_d = tables[256 * 15 + dd];
low = ((a_a & 0xF) << 12) | ((b_a & 0xF) << 8) | ((c_a & 0xF) << 4) | (d_a & 0xF);
high = ((a_a >> 4) << 12) | ((b_a >> 4) << 8) | ((c_a >> 4) << 4) | (d_a >> 4);
*a = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_b & 0xF) << 12) | ((b_b & 0xF) << 8) | ((c_b & 0xF) << 4) | (d_b & 0xF);
high = ((a_b >> 4) << 12) | ((b_b >> 4) << 8) | ((c_b >> 4) << 4) | (d_b >> 4);
*b = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_c & 0xF) << 12) | ((b_c & 0xF) << 8) | ((c_c & 0xF) << 4) | (d_c & 0xF);
high = ((a_c >> 4) << 12) | ((b_c >> 4) << 8) | ((c_c >> 4) << 4) | (d_c >> 4);
*c = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_d & 0xF) << 12) | ((b_d & 0xF) << 8) | ((c_d & 0xF) << 4) | (d_d & 0xF);
high = ((a_d >> 4) << 12) | ((b_d >> 4) << 8) | ((c_d >> 4) << 4) | (d_d >> 4);
*d = (table_ptr[high] << 4) | table_ptr[low];
}
void sub_401270(BYTE* input, BYTE* out)
{
DWORD m, n, i, j, k;
for (i = 0; i < 16; i++)
{
input[i] ^= consts[i];
}
for (j = 0; j < 13; j++)
{
sub_4011A0(input);
for (k = 0; k < 4; k++)
{
trans(j, k, input + k * 4, input + k * 4 + 1, input + k * 4 + 2, input + k * 4 + 3);
}
}
sub_4011A0(input);
for (m = 0; m < 16; m++)
{
input[m] = third_tables[(m << 8) | input[m]];
}
for (n = 0; n < 16; n++)
{
out[n] = input[n];
}
}
void gen_tables(BYTE* buff)
{
BYTE* unk_51E000 = buff + 0x11ba00;
BYTE* unk_866000 = buff + 0x463a00;
BYTE* unk_7F6000 = buff + 0x3f3a00;
BYTE* unk_43D000 = buff + 0x3aa00;
DWORD i, j, k, m;
DWORD low_high, low_low;
for (low_high = 0; low_high < 256; low_high++)
{
for (low_low = 0; low_low < 256; low_low++)
{
BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
BYTE low = (high_nib << 4) | low_nib;
table_ptr[low_high * 256 + low_low] = unk_866000[0xea000 + 1280 + (DWORD)low];
}
}
for (i = 0; i < 16; i++)
{
v2_inv[v2[i]] = (BYTE)i;
}
for (m = 0; m < 16; m++)
{
BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
BYTE* table_inv = unk_43D000_inv + 256 * m;
for (i = 0; i < 256; i++)
{
table_inv[table[i]] = (BYTE)i;
}
}
for (j = 0; j < 13; j++)
{
for (k = 0; k < 4; k++)
{
DWORD idx = (j << 12) | (k << 10);
DWORD table_base = 256 * 16 * (idx >> 10);
for (i = 0; i < 256; i++)
{
BYTE* tables = first_tables + table_base;
tables[256 * 0 + i] = unk_51E000[3 + 4 * (0x8f000 + idx + i)];
tables[256 * 1 + i] = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 2 + i] = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 3 + i] = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 4 + i] = unk_51E000[2 + 4 * (0x8f000 + idx + i)];
tables[256 * 5 + i] = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 6 + i] = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 7 + i] = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 8 + i] = unk_51E000[1 + 4 * (0x8f000 + idx + i)];
tables[256 * 9 + i] = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 10 + i] = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 11 + i] = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 12 + i] = unk_51E000[4 * (0x8f000 + idx + i)];
tables[256 * 13 + i] = unk_51E000[4 * (0x8f000 + 256 + idx + i)];
tables[256 * 14 + i] = unk_51E000[4 * (0x8f000 + 512 + idx + i)];
tables[256 * 15 + i] = unk_51E000[4 * (0x8f000 + 768 + idx + i)];
tables = second_tables + table_base;
tables[256 * 0 + i] = unk_7F6000[3 + 4 * (0xe000 + idx + i)];
tables[256 * 1 + i] = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 2 + i] = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 3 + i] = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 4 + i] = unk_7F6000[2 + 4 * (0xe000 + idx + i)];
tables[256 * 5 + i] = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 6 + i] = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 7 + i] = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 8 + i] = unk_7F6000[1 + 4 * (0xe000 + idx + i)];
tables[256 * 9 + i] = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 10 + i] = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 11 + i] = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 12 + i] = unk_7F6000[4 * (0xe000 + idx + i)];
tables[256 * 13 + i] = unk_7F6000[4 * (0xe000 + 256 + idx + i)];
tables[256 * 14 + i] = unk_7F6000[4 * (0xe000 + 512 + idx + i)];
tables[256 * 15 + i] = unk_7F6000[4 * (0xe000 + 768 + idx + i)];
}
}
}
for (m = 0; m < 13 * 4 * 16 * 256; m += 256)
{
BYTE* ptr_first_orig = first_tables + m;
BYTE* ptr_second_orig = second_tables + m;
BYTE* ptr_first_dst = first_tables_map + m;
BYTE* ptr_second_dst = second_tables_map + m;
for (i = 0; i < 256; i++)
{
ptr_first_dst[ptr_first_orig[i]] = 1;
ptr_second_dst[ptr_second_orig[i]] = 1;
}
}
for (m = 0; m < 13 * 4 * 16 * 256; m += 256)
{
BYTE* ptr_first_orig = first_tables_map + m;
BYTE* ptr_second_orig = second_tables_map + m;
BYTE* ptr_first_dst = first_tables_sorted + m;
BYTE* ptr_second_dst = second_tables_sorted + m;
int len_indice = m >> 8;
first_tables_sorted_len[len_indice] = 0;
second_tables_sorted_len[len_indice] = 0;
for (i = 0; i < 256; i++)
{
if (ptr_first_orig[i])
{
ptr_first_dst[first_tables_sorted_len[len_indice]++] = (BYTE)i;
}
if (ptr_second_orig[i])
{
ptr_second_dst[second_tables_sorted_len[len_indice]++] = (BYTE)i;
}
}
}
for (m = 0; m < 13 * 4 * 16 * 256; m += 256)//根据重复个数大小排开,排列的是索引
{
BYTE tmp_indices_first[256];
BYTE tmp_indices_second[256];
junk_memset(tmp_indices_first, 0, 256);
junk_memset(tmp_indices_second, 0, 256);
BYTE* ptr_first_orig = first_tables + m;
BYTE* ptr_second_orig = second_tables + m;
DWORD64* ptr_first_dst = first_tables_indices + m;
DWORD64* ptr_second_dst = second_tables_indices + m;
for (i = 0; i < 256; i++)
{
DWORD first = (DWORD)ptr_first_orig[i];
DWORD second = (DWORD)ptr_second_orig[i];
BYTE* arr_first = (BYTE*)(ptr_first_dst + first);
BYTE* arr_second = (BYTE*)(ptr_second_dst + second);
arr_first[(DWORD)(tmp_indices_first[first]++)] = (BYTE)i;
arr_second[(DWORD)(tmp_indices_second[second]++)] = (BYTE)i;
}
}
for (m = 0; m < 16; m++)
{
for (i = 0; i < 256; i++)
{
BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
third_tables[256 * m + i] = table[i];
}
}
}
void sub_401270_inv(BYTE* input, BYTE* out)
{
DWORD n, m, i, j, k;
for (n = 0; n < 16; n++)
{
out[n] = input[n];
}
for (m = 0; m < 16; m++)
{
BYTE* table_inv = unk_43D000_inv + 256 * m;
out[m] = table_inv[out[m]];
}
sub_4011A0_inv(out);
for (j = 0; j < 13; j++)
{
for (k = 0; k < 4; k++)
{
de_trans(j, k, (DWORD*)(input + k * 4));
}
sub_4011A0_inv(out);
}
for (i = 0; i < 16; i++)
{
out[i] ^= consts[i];
}
}
int main()
{
DWORD i;
first_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
second_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
first_tables_map = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
second_tables_map = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
first_tables_sorted = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
second_tables_sorted = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
first_tables_indices = (DWORD64*)VirtualAlloc(NULL, 13 * 4 * 16 * 256 * 8, MEM_COMMIT, PAGE_READWRITE);
second_tables_indices = (DWORD64*)VirtualAlloc(NULL, 13 * 4 * 16 * 256 * 8, MEM_COMMIT, PAGE_READWRITE);
map_keys = (DWORD*)VirtualAlloc(NULL, 0x1000000 * 4, MEM_COMMIT, PAGE_READWRITE);
map_vals = (BYTE*)VirtualAlloc(NULL, 0x1000000, MEM_COMMIT, PAGE_READWRITE);
DWORD hhh = 0;
HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
void* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
ReadFile(file, buff, 0xb00000, &hhh, NULL);
CloseHandle(file);
gen_tables(buff);
BYTE input[] = { 0xA0,0xA8,0xAC,0xA7,0xA9,0xB6,0x95,0x79,0xBD,0x76,0x7D,0xA9,0x29,0x5F,0xB9,0x42 };
BYTE out[16] = { 0 };
BYTE out2[16] = { 0 };
for (i = 0; i < 4; i++)
{
trans(0, 0, input + i * 4, input + i * 4 + 1, input + i * 4 + 2, input + i * 4 + 3);
my_printf("%08X\n", de_trans(0, 0, (DWORD*)(input + i * 4)));
}
/*sub_401270(input, out);
for (i = 0; i < 16; i++)
{
my_printf("%02X ", out[i]);//E4 11 53 70 7B E9 64 C4 EF 7D 51 74 EB 4B 3B 75
}
my_printf("\n\n");
sub_401270_inv(out, out2);
for (i = 0; i < 16; i++)
{
my_printf("%02X ", out2[i]);
}
my_printf("\n\n");*/
VirtualFree(buff, 0, MEM_RELEASE);
VirtualFree(map_vals, 0, MEM_RELEASE);
VirtualFree(map_keys, 0, MEM_RELEASE);
VirtualFree(second_tables_indices, 0, MEM_RELEASE);
VirtualFree(first_tables_indices, 0, MEM_RELEASE);
VirtualFree(second_tables_sorted, 0, MEM_RELEASE);
VirtualFree(first_tables_sorted, 0, MEM_RELEASE);
VirtualFree(second_tables_map, 0, MEM_RELEASE);
VirtualFree(first_tables_map, 0, MEM_RELEASE);
VirtualFree(second_tables, 0, MEM_RELEASE);
VirtualFree(first_tables, 0, MEM_RELEASE);
return 0xb19b00b5;
}
基本就是雏形了.但是还能更快.
优化算法并不难,只使用了一些朴素的(noip普及组)算法,如二分查找,快速排序,双指针求交集...难的不适合用在这,也没必要用在这...
表的优化也很简单,合并同类表(我写文章瞎想出的名词,比如BYTE变成WORD,WORD变成DWORD等等...),其他就没了,代码也很简单,看看就懂了,而且我只是来提供思路的...
进一步优化,少许修改,得出最终版本,单线程,纯cpu,少量simd(基本没用,一个shuffle能提升多少呢...)
#include<windows.h>
#include<immintrin.h>
BYTE unk_43D000_inv[256 * 16];
const BYTE consts[] = { 0x65,0xD6,0xCD,0xFE,0xFF,0x1C,0x41,0x65,0x15,0x6E,0x18,0x4C,0xF5,0xB9,0x4E,0x13 };
BYTE table_ptr[65536];
BYTE* first_tables = 0;
BYTE* second_tables = 0;
BYTE* first_tables_map = 0;
BYTE* second_tables_map = 0;
BYTE* first_tables_sorted = 0;
BYTE* second_tables_sorted = 0;
DWORD64* first_tables_indices = 0;
DWORD64* second_tables_indices = 0;
DWORD first_tables_sorted_len[13 * 4 * 16];
DWORD second_tables_sorted_len[13 * 4 * 16];
BYTE third_tables[16 * 256];
DWORD* probs_0 = 0;
DWORD* probs_1 = 0;
typedef int my_sprintf(char* a, size_t b, const char* c, va_list d);
void my_printf(const char* format, ...)
{
DWORD i;
char buffer[1024];
for (i = 0; i < 1024; i++)
{
buffer[i] = 0;
}
va_list args;
va_start(args, format);
PVOID fuck_crt = GetProcAddress(GetModuleHandleA("ntdll.dll"), "_vsnprintf");
((my_sprintf*)fuck_crt)(buffer, sizeof(buffer), format, args);
DWORD bytes_written;
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), buffer, lstrlenA(buffer), &bytes_written, NULL);
}
void junk_memcpy(void* dst, void* src, size_t sz)//16字节对齐才能用
{
while (sz)
{
_mm_store_si128((__m128i*)((DWORD64)dst + sz - 16), _mm_loadu_si128((__m128i*)((DWORD64)src + sz - 16)));
sz -= 16;
}
}
void junk_memset(void* mem, BYTE val, size_t sz)//16字节对齐才能用
{
while (sz)
{
_mm_store_si128((__m128i*)((DWORD64)mem + sz - 16), _mm_set1_epi8(val));
sz -= 16;
}
}
DWORD binary_search(DWORD* arr, int start, int end, DWORD target)
{
int l = start, r = end;
DWORD ret = 0xffffffff;
while (l <= r)
{
int mid = l + ((r - l) >> 1);
if (arr[mid] == target)
{
ret = mid;
break;
}
else if (arr[mid] < target)
{
l = mid + 1;
}
else
{
r = mid - 1;
}
}
return ret;
}
DWORD my_bsearch(DWORD* arr, DWORD len, DWORD target)
{
return binary_search(arr, 0, (int)len - 1, target);//抖动高的"优化",我不搞#(滑稽)
}
DWORD intersec(DWORD* arr1, DWORD len1, DWORD* arr2, DWORD len2)
{
int i = 0, j = 0;
DWORD outlen = 0;
while (i < (int)len1 && j < (int)len2)
{
if (arr1[i] < arr2[j])
{
i++;
}
else if (arr1[i] > arr2[j])
{
j++;
}
else
{
arr1[outlen++] = arr1[i];
i++;
j++;
}
}
return outlen;
}
int partition(DWORD* arr, int low, int high)
{
int i = low, j = high;
DWORD p = arr[low];
while (i < j)
{
while (i<j && arr[j]>p)
{
j--;
}
if (i < j)
{
arr[i] ^= arr[j];
arr[j] ^= arr[i];
arr[i] ^= arr[j];
i++;
}
while (i < j && arr[i] <= p)
{
i++;
}
if (i < j)
{
arr[i] ^= arr[j];
arr[j] ^= arr[i];
arr[i] ^= arr[j];
j--;
}
}
arr[i] = p;
return i;
}
void _qsort(DWORD* arr, int low, int high)
{
if (low >= 0 && high >= 0 && low < high)
{
int mid = partition(arr, low, high);
_qsort(arr, low, mid - 1);
_qsort(arr, mid + 1, high);
}
}
#define qsort(arr,len) _qsort((arr),0,(len)-1)
void idsort(DWORD* arr, DWORD* ids, DWORD n)
{
int i, j;
for (i = 0; i < (int)n - 1; i++)
{
for (j = 0; j < (int)n - i - 1; j++)
{
if (arr[j] > arr[j + 1])
{
arr[j] ^= arr[j + 1];
arr[j + 1] ^= arr[j];
arr[j] ^= arr[j + 1];
ids[j] ^= ids[j + 1];
ids[j + 1] ^= ids[j];
ids[j] ^= ids[j + 1];
}
}
}
}
void sub_4011A0(BYTE* a1)
{
*(__m128i*)a1 = _mm_shuffle_epi8(*(__m128i*)a1, _mm_set_epi8(11, 6, 1, 12, 7, 2, 13, 8, 3, 14, 9, 4, 15, 10, 5, 0));
}
void sub_4011A0_inv(BYTE* a1)
{
*(__m128i*)a1 = _mm_shuffle_epi8(*(__m128i*)a1, _mm_set_epi8(3, 6, 9, 12, 15, 2, 5, 8, 11, 14, 1, 4, 7, 10, 13, 0));
}
DWORD de_trans_quad(DWORD* probs, BYTE* table0, DWORD* table1, DWORD64* table2,
DWORD j, DWORD k, DWORD idx, BYTE in, DWORD* priv_tab, DWORD priv_len, BOOL search)
{
DWORD s, t, u, v;
DWORD w, x, y, z;
BYTE low = in & 0xf;
BYTE high = in >> 4;
DWORD mat_idx = (j << 6) | (k << 4);
DWORD table_base = mat_idx << 8;
BYTE* tables0 = table0 + table_base;
DWORD* tables1 = table1 + mat_idx;
DWORD64* tables2 = table2 + table_base;
BYTE* t_a = tables0 + (idx << 10);
BYTE* t_b = tables0 + ((idx << 10) | 0x100);
BYTE* t_c = tables0 + ((idx << 10) | 0x200);
BYTE* t_d = tables0 + ((idx << 10) | 0x300);
DWORD64* tt_a = tables2 + (idx << 10);
DWORD64* tt_b = tables2 + ((idx << 10) | 0x100);
DWORD64* tt_c = tables2 + ((idx << 10) | 0x200);
DWORD64* tt_d = tables2 + ((idx << 10) | 0x300);
DWORD sz_a = tables1[idx << 2];
DWORD sz_b = tables1[(idx << 2) | 1];
DWORD sz_c = tables1[(idx << 2) | 2];
DWORD sz_d = tables1[(idx << 2) | 3];
DWORD cnt_a = 256 / sz_a;
DWORD cnt_b = 256 / sz_b;
DWORD cnt_c = 256 / sz_c;
DWORD cnt_d = 256 / sz_d;
DWORD indices = 0;
for (w = 0; w < sz_a; w++)
{
for (x = 0; x < sz_b; x++)
{
for (y = 0; y < sz_c; y++)
{
for (z = 0; z < sz_d; z++)
{
WORD a_a = (WORD)t_a[w];//正向计算
WORD a_b = (WORD)t_b[x];
WORD a_c = (WORD)t_c[y];
WORD a_d = (WORD)t_d[z];
WORD low_word = ((a_a & 0xF) << 12) | ((a_b & 0xF) << 8) | ((a_c & 0xF) << 4) | (a_d & 0xF);
WORD high_word = ((a_a >> 4) << 12) | ((a_b >> 4) << 8) | ((a_c >> 4) << 4) | (a_d >> 4);
if (low == table_ptr[low_word] && high == table_ptr[high_word])
{
BYTE* arr_a = (BYTE*)&tt_a[a_a];//遍历记录表中值在原表中所有可能的索引,组合即为所有可能原值
BYTE* arr_b = (BYTE*)&tt_b[a_b];
BYTE* arr_c = (BYTE*)&tt_c[a_c];
BYTE* arr_d = (BYTE*)&tt_d[a_d];
for (s = 0; s < cnt_a; s++)
{
for (t = 0; t < cnt_b; t++)
{
for (u = 0; u < cnt_c; u++)
{
for (v = 0; v < cnt_d; v++)
{
DWORD val = arr_a[s] | (arr_b[t] << 8) | (arr_c[u] << 16) | (arr_d[v] << 24);
if (!search || my_bsearch(priv_tab, priv_len, val) != 0xffffffff)//二分搜索有序列表,动态计算交集
{
probs[indices++] = val;
}
}
}
}
}
}
}
}
}
}
return indices;
}
void de_trans_half(BYTE* table0, DWORD* table1, DWORD64* table2, DWORD j, DWORD k, DWORD* in)
{
DWORD i, m;
DWORD mat_idx = (j << 6) | (k << 4);
DWORD table_base = mat_idx << 8;
DWORD* tables_lens = table1 + mat_idx;
DWORD idxs[4] = { 0,1,2,3 };
DWORD sizes[4] = { 0 };
for (i = 0; i < 4; i++)
{
DWORD sum = 1;
for (m = 0; m < 4; m++)
{
sum *= tables_lens[(i << 2) | m];
}
sizes[i] = sum;
}
idsort(sizes, idxs, 4);//数据量小,冒泡排序,从最小数据量的行开始.因为总数据量不变,从最小开始能获得最小范围
DWORD cur_idx = idxs[0];
DWORD outlen_1 = de_trans_quad(probs_0, table0, table1, table2, j, k, cur_idx, (BYTE)(*in >> (cur_idx << 3)), NULL, 0, FALSE);
qsort(probs_0, outlen_1);//第一组数据,数据量0x1000000,数据量大,快排,用于接下来求交集
cur_idx = idxs[1];
DWORD outlen_2 = de_trans_quad(probs_1, table0, table1, table2, j, k, cur_idx, (BYTE)(*in >> (cur_idx << 3)), NULL, 0, FALSE);
qsort(probs_1, outlen_2);//第二组数据快排,数据量0x1000000
DWORD outlen = intersec(probs_0, outlen_1, probs_1, outlen_2);//两组数据求交集,数据量约束到0x10000
cur_idx = idxs[2];//开始使用二分搜索优化
outlen = de_trans_quad(probs_1, table0, table1, table2, j, k, cur_idx, (BYTE)(*in >> (cur_idx << 3)), probs_0, outlen, TRUE);
qsort(probs_1, outlen);//约束数据量到0x100
cur_idx = idxs[3];
de_trans_quad(probs_0, table0, table1, table2, j, k, cur_idx, (BYTE)(*in >> (cur_idx << 3)), probs_1, outlen, TRUE);//约束出结果
*in = probs_0[0];
}
void trans(DWORD j, DWORD k, BYTE* a, BYTE* b, BYTE* c, BYTE* d)
{
DWORD aa = (DWORD)*a;
DWORD bb = (DWORD)*b;
DWORD cc = (DWORD)*c;
DWORD dd = (DWORD)*d;
my_printf("%02X%02X%02X%02X\n", dd, cc, bb, aa);
DWORD mat_idx = (j << 6) | (k << 4);
DWORD table_base = mat_idx << 8;
WORD low, high;
BYTE* tables = first_tables + table_base;
BYTE a_a = tables[256 * 0 + aa];
BYTE a_b = tables[256 * 4 + aa];
BYTE a_c = tables[256 * 8 + aa];
BYTE a_d = tables[256 * 12 + aa];
BYTE b_a = tables[256 * 1 + bb];
BYTE b_b = tables[256 * 5 + bb];
BYTE b_c = tables[256 * 9 + bb];
BYTE b_d = tables[256 * 13 + bb];
BYTE c_a = tables[256 * 2 + cc];
BYTE c_b = tables[256 * 6 + cc];
BYTE c_c = tables[256 * 10 + cc];
BYTE c_d = tables[256 * 14 + cc];
BYTE d_a = tables[256 * 3 + dd];
BYTE d_b = tables[256 * 7 + dd];
BYTE d_c = tables[256 * 11 + dd];
BYTE d_d = tables[256 * 15 + dd];
low = ((a_a & 0xF) << 12) | ((b_a & 0xF) << 8) | ((c_a & 0xF) << 4) | (d_a & 0xF);
high = ((a_a >> 4) << 12) | ((b_a >> 4) << 8) | ((c_a >> 4) << 4) | (d_a >> 4);
aa = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_b & 0xF) << 12) | ((b_b & 0xF) << 8) | ((c_b & 0xF) << 4) | (d_b & 0xF);
high = ((a_b >> 4) << 12) | ((b_b >> 4) << 8) | ((c_b >> 4) << 4) | (d_b >> 4);
bb = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_c & 0xF) << 12) | ((b_c & 0xF) << 8) | ((c_c & 0xF) << 4) | (d_c & 0xF);
high = ((a_c >> 4) << 12) | ((b_c >> 4) << 8) | ((c_c >> 4) << 4) | (d_c >> 4);
cc = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_d & 0xF) << 12) | ((b_d & 0xF) << 8) | ((c_d & 0xF) << 4) | (d_d & 0xF);
high = ((a_d >> 4) << 12) | ((b_d >> 4) << 8) | ((c_d >> 4) << 4) | (d_d >> 4);
dd = (table_ptr[high] << 4) | table_ptr[low];
my_printf("%02X%02X%02X%02X\n", dd, cc, bb, aa);
tables = second_tables + table_base;
a_a = tables[256 * 0 + aa];
a_b = tables[256 * 4 + aa];
a_c = tables[256 * 8 + aa];
a_d = tables[256 * 12 + aa];
b_a = tables[256 * 1 + bb];
b_b = tables[256 * 5 + bb];
b_c = tables[256 * 9 + bb];
b_d = tables[256 * 13 + bb];
c_a = tables[256 * 2 + cc];
c_b = tables[256 * 6 + cc];
c_c = tables[256 * 10 + cc];
c_d = tables[256 * 14 + cc];
d_a = tables[256 * 3 + dd];
d_b = tables[256 * 7 + dd];
d_c = tables[256 * 11 + dd];
d_d = tables[256 * 15 + dd];
low = ((a_a & 0xF) << 12) | ((b_a & 0xF) << 8) | ((c_a & 0xF) << 4) | (d_a & 0xF);
high = ((a_a >> 4) << 12) | ((b_a >> 4) << 8) | ((c_a >> 4) << 4) | (d_a >> 4);
*a = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_b & 0xF) << 12) | ((b_b & 0xF) << 8) | ((c_b & 0xF) << 4) | (d_b & 0xF);
high = ((a_b >> 4) << 12) | ((b_b >> 4) << 8) | ((c_b >> 4) << 4) | (d_b >> 4);
*b = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_c & 0xF) << 12) | ((b_c & 0xF) << 8) | ((c_c & 0xF) << 4) | (d_c & 0xF);
high = ((a_c >> 4) << 12) | ((b_c >> 4) << 8) | ((c_c >> 4) << 4) | (d_c >> 4);
*c = (table_ptr[high] << 4) | table_ptr[low];
low = ((a_d & 0xF) << 12) | ((b_d & 0xF) << 8) | ((c_d & 0xF) << 4) | (d_d & 0xF);
high = ((a_d >> 4) << 12) | ((b_d >> 4) << 8) | ((c_d >> 4) << 4) | (d_d >> 4);
*d = (table_ptr[high] << 4) | table_ptr[low];
}
void sub_401270(BYTE* input, BYTE* out)
{
DWORD m, n, i, j, k;
for (n = 0; n < 16; n++)
{
out[n] = input[n];
}
for (i = 0; i < 16; i++)
{
out[i] ^= consts[i];
}
for (j = 0; j < 13; j++)
{
sub_4011A0(out);
for (k = 0; k < 4; k++)
{
trans(j, k, out + k * 4, out + k * 4 + 1, out + k * 4 + 2, out + k * 4 + 3);
}
}
sub_4011A0(out);
for (m = 0; m < 16; m++)
{
out[m] = third_tables[(m << 8) | out[m]];
}
}
void gen_tables()
{
DWORD hhh = 0;
HANDLE file = CreateFileA("C:\\Users\\n00bzx\\Desktop\\Devil.exe", GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
BYTE* buff = (BYTE*)VirtualAlloc(NULL, 0xb00000, MEM_COMMIT, PAGE_READWRITE);
ReadFile(file, buff, 0xb00000, &hhh, NULL);
CloseHandle(file);
BYTE* unk_51E000 = buff + 0x11ba00;
BYTE* unk_866000 = buff + 0x463a00;
BYTE* unk_7F6000 = buff + 0x3f3a00;
BYTE* unk_43D000 = buff + 0x3aa00;
DWORD i, j, k, m;
DWORD low_high, low_low;
for (low_high = 0; low_high < 256; low_high++)
{
for (low_low = 0; low_low < 256; low_low++)
{
BYTE low_nib = unk_866000[0xea000 + 768 + low_low];
BYTE high_nib = unk_866000[0xea000 + 512 + low_high];
BYTE low = (high_nib << 4) | low_nib;
table_ptr[low_high * 256 + low_low] = unk_866000[0xea000 + 1280 + (DWORD)low];
}
}
for (m = 0; m < 16; m++)
{
BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
BYTE* table_inv = unk_43D000_inv + 256 * m;
for (i = 0; i < 256; i++)
{
table_inv[table[i]] = (BYTE)i;
}
}
first_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
second_tables = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
for (j = 0; j < 13; j++)
{
for (k = 0; k < 4; k++)
{
DWORD idx = (j << 12) | (k << 10);
DWORD table_base = 256 * 16 * (idx >> 10);
for (i = 0; i < 256; i++)
{
BYTE* tables = first_tables + table_base;
tables[256 * 0 + i] = unk_51E000[3 + 4 * (0x8f000 + idx + i)];
tables[256 * 1 + i] = unk_51E000[3 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 2 + i] = unk_51E000[3 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 3 + i] = unk_51E000[3 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 4 + i] = unk_51E000[2 + 4 * (0x8f000 + idx + i)];
tables[256 * 5 + i] = unk_51E000[2 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 6 + i] = unk_51E000[2 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 7 + i] = unk_51E000[2 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 8 + i] = unk_51E000[1 + 4 * (0x8f000 + idx + i)];
tables[256 * 9 + i] = unk_51E000[1 + 4 * (0x8f000 + 256 + idx + i)];
tables[256 * 10 + i] = unk_51E000[1 + 4 * (0x8f000 + 512 + idx + i)];
tables[256 * 11 + i] = unk_51E000[1 + 4 * (0x8f000 + 768 + idx + i)];
tables[256 * 12 + i] = unk_51E000[4 * (0x8f000 + idx + i)];
tables[256 * 13 + i] = unk_51E000[4 * (0x8f000 + 256 + idx + i)];
tables[256 * 14 + i] = unk_51E000[4 * (0x8f000 + 512 + idx + i)];
tables[256 * 15 + i] = unk_51E000[4 * (0x8f000 + 768 + idx + i)];
tables = second_tables + table_base;
tables[256 * 0 + i] = unk_7F6000[3 + 4 * (0xe000 + idx + i)];
tables[256 * 1 + i] = unk_7F6000[3 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 2 + i] = unk_7F6000[3 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 3 + i] = unk_7F6000[3 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 4 + i] = unk_7F6000[2 + 4 * (0xe000 + idx + i)];
tables[256 * 5 + i] = unk_7F6000[2 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 6 + i] = unk_7F6000[2 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 7 + i] = unk_7F6000[2 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 8 + i] = unk_7F6000[1 + 4 * (0xe000 + idx + i)];
tables[256 * 9 + i] = unk_7F6000[1 + 4 * (0xe000 + 256 + idx + i)];
tables[256 * 10 + i] = unk_7F6000[1 + 4 * (0xe000 + 512 + idx + i)];
tables[256 * 11 + i] = unk_7F6000[1 + 4 * (0xe000 + 768 + idx + i)];
tables[256 * 12 + i] = unk_7F6000[4 * (0xe000 + idx + i)];
tables[256 * 13 + i] = unk_7F6000[4 * (0xe000 + 256 + idx + i)];
tables[256 * 14 + i] = unk_7F6000[4 * (0xe000 + 512 + idx + i)];
tables[256 * 15 + i] = unk_7F6000[4 * (0xe000 + 768 + idx + i)];
}
}
}
for (m = 0; m < 16; m++)
{
for (i = 0; i < 256; i++)
{
BYTE* table = unk_43D000 + 0xb6000 + 53248 + 256 * m;
third_tables[256 * m + i] = table[i];
}
}
VirtualFree(buff, 0, MEM_RELEASE);
first_tables_map = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
second_tables_map = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
for (m = 0; m < 13 * 4 * 16 * 256; m += 256)
{
BYTE* ptr_first_orig = first_tables + m;
BYTE* ptr_second_orig = second_tables + m;
BYTE* ptr_first_dst = first_tables_map + m;
BYTE* ptr_second_dst = second_tables_map + m;
for (i = 0; i < 256; i++)
{
ptr_first_dst[ptr_first_orig[i]] = 1;
ptr_second_dst[ptr_second_orig[i]] = 1;
}
}
first_tables_sorted = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
second_tables_sorted = (BYTE*)VirtualAlloc(NULL, 13 * 4 * 16 * 256, MEM_COMMIT, PAGE_READWRITE);
for (m = 0; m < 13 * 4 * 16 * 256; m += 256)
{
BYTE* ptr_first_orig = first_tables_map + m;
BYTE* ptr_second_orig = second_tables_map + m;
BYTE* ptr_first_dst = first_tables_sorted + m;
BYTE* ptr_second_dst = second_tables_sorted + m;
int len_indice = m >> 8;
first_tables_sorted_len[len_indice] = 0;
second_tables_sorted_len[len_indice] = 0;
for (i = 0; i < 256; i++)
{
if (ptr_first_orig[i])
{
ptr_first_dst[first_tables_sorted_len[len_indice]++] = (BYTE)i;
}
if (ptr_second_orig[i])
{
ptr_second_dst[second_tables_sorted_len[len_indice]++] = (BYTE)i;
}
}
}
VirtualFree(second_tables_map, 0, MEM_RELEASE);
VirtualFree(first_tables_map, 0, MEM_RELEASE);
first_tables_indices = (DWORD64*)VirtualAlloc(NULL, 13 * 4 * 16 * 256 * 8, MEM_COMMIT, PAGE_READWRITE);
second_tables_indices = (DWORD64*)VirtualAlloc(NULL, 13 * 4 * 16 * 256 * 8, MEM_COMMIT, PAGE_READWRITE);
for (m = 0; m < 13 * 4 * 16 * 256; m += 256)//根据重复个数大小排开,排列的是索引
{
BYTE tmp_indices_first[256];
BYTE tmp_indices_second[256];
junk_memset(tmp_indices_first, 0, 256);
junk_memset(tmp_indices_second, 0, 256);
BYTE* ptr_first_orig = first_tables + m;
BYTE* ptr_second_orig = second_tables + m;
DWORD64* ptr_first_dst = first_tables_indices + m;
DWORD64* ptr_second_dst = second_tables_indices + m;
for (i = 0; i < 256; i++)
{
DWORD first = (DWORD)ptr_first_orig[i];
DWORD second = (DWORD)ptr_second_orig[i];
BYTE* arr_first = (BYTE*)(ptr_first_dst + first);
BYTE* arr_second = (BYTE*)(ptr_second_dst + second);
arr_first[(DWORD)(tmp_indices_first[first]++)] = (BYTE)i;
arr_second[(DWORD)(tmp_indices_second[second]++)] = (BYTE)i;
}
}
}
void sub_401270_inv(BYTE* input, BYTE* out)
{
DWORD n, m, i, k;
int j;
for (n = 0; n < 16; n++)
{
out[n] = input[n];
}
for (m = 0; m < 16; m++)
{
BYTE* table_inv = unk_43D000_inv + 256 * m;
out[m] = table_inv[out[m]];
}
sub_4011A0_inv(out);
for (j = 12; j >= 0; j--)
{
for (k = 0; k < 4; k++)
{
de_trans_half(second_tables_sorted, second_tables_sorted_len, second_tables_indices, (DWORD)j, k, (DWORD*)(out + k * 4));
my_printf("%08X\n", *(DWORD*)(out + k * 4));
de_trans_half(first_tables_sorted, first_tables_sorted_len, first_tables_indices, (DWORD)j, k, (DWORD*)(out + k * 4));
my_printf("%08X\n", *(DWORD*)(out + k * 4));
}
sub_4011A0_inv(out);
}
for (i = 0; i < 16; i++)
{
out[i] ^= consts[i];
}
}
int main()
{
DWORD i;
gen_tables();
BYTE input[] = { 0xA0,0xA8,0xAC,0xA7,0xA9,0xB6,0x95,0x79,0xBD,0x76,0x7D,0xA9,0x29,0x5F,0xB9,0x42 };
BYTE out[16] = { 0 };
BYTE out2[16] = { 0 };
probs_0 = (DWORD*)VirtualAlloc(NULL, 0x1000000 * 4, MEM_COMMIT, PAGE_READWRITE);
probs_1 = (DWORD*)VirtualAlloc(NULL, 0x1000000 * 4, MEM_COMMIT, PAGE_READWRITE);
trans(0, 0, input + 2 * 4, input + 2 * 4 + 1, input + 2 * 4 + 2, input + 2 * 4 + 3);
de_trans_half(second_tables_sorted, second_tables_sorted_len, second_tables_indices, 0, 0, (DWORD*)(input + 2 * 4));
my_printf("%08X\n", *(DWORD*)(input + 2 * 4));
de_trans_half(first_tables_sorted, first_tables_sorted_len, first_tables_indices, 0, 0, (DWORD*)(input + 2 * 4));
my_printf("%08X\n", *(DWORD*)(input + 2 * 4));
/*sub_401270(input, out);
for (i = 0; i < 16; i++)
{
my_printf("%02X ", out[i]);
}
my_printf("\n\n");
sub_401270_inv(out, out2);
for (i = 0; i < 16; i++)
{
my_printf("%02X", out2[i]);
}
my_printf("\n\n");*/
VirtualFree(probs_1, 0, MEM_RELEASE);
VirtualFree(probs_0, 0, MEM_RELEASE);
VirtualFree(second_tables_indices, 0, MEM_RELEASE);
VirtualFree(first_tables_indices, 0, MEM_RELEASE);
VirtualFree(second_tables_sorted, 0, MEM_RELEASE);
VirtualFree(first_tables_sorted, 0, MEM_RELEASE);
VirtualFree(second_tables, 0, MEM_RELEASE);
VirtualFree(first_tables, 0, MEM_RELEASE);
return 0xb19b00b5;
}
还是一样,不能编译,但是代表了我的思路,实测最慢半小时运行完毕(抖动不多,较稳定),提升到原先的1/8左右.
这题总共花了我24小时左右时间.我实际写了4个版本的脚本:直接爆破版本,simd版本,第一次优化版本,第二次优化版本.前面的4小时是估计得出的时间(单次trans要2分钟,2 * 2 * 13 * 4大约4小时,单线程),实际只跑了一次trans.
如果在比赛中完成,将花我3小时左右时间,十分钟逆向到算法,写出直接爆破半小时,增加多线程功能10分钟,2小时爆破得出结果(多线程爆破,每组4*2次trans可以并行,我4核cpu,排除干扰).
在比赛中,一血数码暴龙2小时,4小时已经有10解.
所以,我又是什么呢?从我开始学习到现在,我又做了什么呢?我的水平还有很大的提升空间,还需要一辈子学习.
老大教我,想点现实的好.
福州好热呀...好想去济南...冬天去吧...老大等我...#(滑稽)