This is a novel topic. I have some questions.
Sorry I cannot type Chinese. I hope you are able to understand my English.
Question 1:
How do you know if a variable has a backup? A variable can not only be stored in reg/mem, but also in file/registry. In the latter case, you have to identify Windows API relating to IO, right? Besides, a variable can be encrypted before stroing, and decrypted before using.
Question 2:
If a variable is encrypted before backing up, what is the complexity to find the correlation between decrypted value and original value?
For example:
if algorithm1=AES and algorithm2=MD5, then var3 is not a backup of var1
but if algorithm1=AES and algorithm2=inverse_AES, then var3 is a backup of var1.
Theoretically it is possible to analyzed the flow of the algorithm, but the complexity may exceed the capability of contemporary CPU and memory, and you may consume all the memory on you machine before you get the result.
Question 3:
Your algorithm depends on identifying variables and constants. BUT, how can you tell if src is a constant?
Consider the following example:
00401000 B8 40C54801 mov eax, 0148C540
Is src constant? Maybe yes, maybe no.
The following SMC can change it easily:
mov dword ptr [00401001], 12345678
And this SMC can hide itself deeply, transformed, maybe in a gardian thread.
Similarly, the imm value 12345678 cannot be granted as constant.
The imm value at 00401001 can also be modified by a gardian process, or even in R0 kernel.
In such SMC case, it is extreamly hard to trace its origin, and you have to assume there is no constant at all.
about question 3
every value coded in is a "constant", the concept of "constant" is that its value is solid and every time you run the program , when it's here , it gets the same value.
a constant replaces a constant , is opening a new data chain, and when we patch it , we surely can find the way through the old chain out and deploy the new one.
about question 1
it's a compiler design issue. it's called SSA--“static single-assignment form”. which means every time you move a data from here to there, you built a copy for it . and every time you change one data. you killed a copy of it. the copys information , i think we the better keep them in a database.
about question 2
if a value which has only one copy and you change it. it means you just give this varible a new value , and you can found it is a node in the data chain, after a long time processing. the way it looks may change, but since its the cause of the result, we can trace back from the result, through every proceeding procedure, back to it.
the difficult thing is we must find the encrypt key. another factor in the chain. there are two datachains in the program ,one is the encrypted data and two is the decrypt key. as we see what happens in winrar, you could hardly decrypt the data without a key. but in packed program as long as it can run , surely it embeded the key in the same program, and all we have to do is finding it.
about question 2
when you read a value from windows api like file or registry . we can identify our program label it a constant or varible , as yourself determine , the human's brain is much more capable handle this kind information than machine.
because the windows API is same every where, and though the packer writer may cosplay the API, after all , it always run into the kernel. and may the packer writer made a driver in it . but we can make our driver in the kernel first.
in this way ,the packer can't invoke an api hiding from us. and since we trace its every step. when it invokes the api . we deside if the data read in a solid value or on the fly.
"complexity may exceed the capability of contemporary CPU and memory"
that why we must found a way , to structrue our datastruct and algorithm effecientlly. like we only analysize a loop one time. not 1000 times.
For my question 1 & 2, I agree it's theoretically possible to determine the data flow.
My concern is that, the complexity may be huge, because when analyzing API and enc/dec algorithm, a lot of other variables are introduced and you need a huge database to store them. Think about IDA... just statically disasm and analysis, it's already slow.
For your answer to question 3, I agree your idea, but how you can you tell when a new chain should be generated? And how can you trace its origin?
You can monitor several hotspots, but not for the entire memory. There is no extra space to store the states of every memory blocks.
Also memory breakpoint is not available. Think about OD...If memory breakpoint is triggered too often, then most CPU resources are wasted on filtering out the false alarm. If I hide the patch operation among 1,000,000 normal memory accesses, how are you able to detect it?
Hmm..the only solution I can come up with is to use remote debugging or VM debugging, with a huge memory to store the state of each memory block of the debugee.
这就需要我们来分析了,当然这个时候机器会提示我们,虽然迷惑数是随机产生的,而且在做 c=a+b 的时候, a 和 b 发生了短路,但是等一会儿做 e=e-a 的时候,短路似乎被解除了。我们的智能分析器缺乏足够的经验,不知道到底要不要按解除短路来处理,但它会提示我们相关的信息,让我们人来确定,而一旦我们确定一个点是“污点”,那整条信息链的前端每一步就回溯上去当固定值处理了。
这样,剩下的就是真正的信息流和key了。
我们再把难度调高一点
这次我们去掉 e 这个变量,情况如何呢?你一定发现程序走不下去了,是的,它到了壳里,却回不了正常的程序流程了。这就是为什么任何壳在处理自己的代码时一定要把原来的数据放在一个它能找到的地方,并且保持指向这个地方的指针,或者指向这个指针的指针,一直抓在自己的寄存器里。
The performance issue is still what I'm worrying about.
Let's consider the timing at first and assume that you have sufficient large buffer to store the data flows extracted from the code.
How to collect information form instructions? I think it is necessary to trace through every instruction, and there are 2 ways of doing it:
setting TF to 1 in debugger at every single instruction, or anaylzing the target in VM. Either way is much slower than running without debugging or VM, especially with loops. You don't want to wait for hours before a messagebox prompts out, do you?
You may claim that the analyzer can skip the loop when it is recognized. But what if it needs assistance from human brain to detect a well-transformed loop? It may eat up all the CPU resources on a deliberately-made hard-to-detect loop.
If you use static analysis only, it cannot prevent patching or SMC, and you have the risk to lose the trace.
BTW, I'm not criticizing your algorithm. I just want to figure out if it is feasible, given limited CPU and memory resource, and also of course time. If yes, it can also be applied to virus analysis, or to build a polymorphism engin based on the data flow, etc.