-
-
[翻译]使用机器学习检测命令行混淆
-
发表于: 2019-1-13 10:26 6603
-
翻译:玉林小学生 校对:Daemond
This blog post presents a machine learning (ML) approach to solving an emerging security problem: detecting obfuscated Windows command line invocations on endpoints. We start out with an introduction to this relatively new threat capability, and then discuss how such problems have traditionally been handled. We then describe a machine learning approach to solving this problem and point out how ML vastly simplifies development and maintenance of a robust obfuscation detector. Finally, we present the results obtained using two different ML techniques and compare the benefits of each.
本博客介绍如何使用机器学习解决当前的一个安全问题:检测终端执行混淆的命令行。我们先介绍下这个较新的威胁,然后讨论如何使用传统方法解决该问题。之后我们介绍一种用机器学习解决该问题的方法,并且说明机器学习如何在最简化开发的同时维护一个强大的混淆检测器。最后我们介绍使用两种不同机器学习技术得到的结果,并比较两种技术的优缺点。
Malicious actors are increasingly “living off the land,” using built-in utilities such as PowerShell and the Windows Command Processor (cmd.exe) as part of their infection workflow in an effort to minimize the chance of detection and bypass whitelisting defense strategies. The release of new obfuscation tools makes detection of these threats even more difficult by adding a layer of indirection between the visible syntax and the final behavior of the command. For example, Invoke-Obfuscation and Invoke-DOSfuscation are two recently released tools that automate the obfuscation of Powershell and Windows command lines respectively.
恶意攻击者越来越多使用例如PowerShell和cmd的内建工具实现不落地攻击,以最小化被检测风险并绕过白名单防御策略。市面上的新混淆工具通过在命令的可见语法和实际行为间增加一个间接层使得检测这些威胁更加困难。例如,Invoke-Obfuscation和Invoke-DOSfuscation是最近发布的两个工具,它们分别对Powershell和Windows命令行进行自动化混淆。
The traditional pattern matching and rule-based approaches for detecting obfuscation are difficult to develop and generalize, and can pose a huge maintenance headache for defenders. We will show how using ML techniques can address this problem.
传统基于模式匹配和基于规则的混淆检测方法难以开发和推广,并且会长期让防御者头疼。我们要介绍机器学习如何解决这个问题。
Detecting obfuscated command lines is a very useful technique because it allows defenders to reduce the data they must review by providing a strong filter for possibly malicious activity. While there are some examples of “legitimate” obfuscation in the wild, in the overwhelming majority of cases, the presence of obfuscation generally serves as a signal for malicious intent.
检测被混淆的命令行是一个非常有用的技术,它可以提供一个强大的恶意行为过滤器使得防御者减少必须匹配的数据。现实世界中是有一些合法进行混淆的例子,但大多数时候,存在混淆是存在恶意行为的信号。
There has been a long history of obfuscation being employed to hide the presence of malware, ranging from encryption of malicious payloads (starting with the Cascade virus) and obfuscation of strings, to JavaScript obfuscation. The purpose of obfuscation is two-fold:
利用混淆来隐藏恶意软件已经有很长的历史了,从开始的加密payload(从Cascade virus起)和混淆字符串,到后来的JavaScript混淆。混淆有双重目的:
In that sense, command line obfuscation is not a new problem – it is just that the target of obfuscation (the Windows Command Processor) is relatively new. The recent release of tools such as Invoke-Obfuscation (for PowerShell) and Invoke-DOSfuscation (for cmd.exe) have demonstrated just how flexible these commands are, and how even incredibly complex obfuscation will still run commands effectively.
这样说来,命令行混淆不是个新东西,只是混淆的对象(Windows命令解析器)相对较新。最近发布的工具(针对PowerShell的Invoke-Obfuscation,针对cmd.exe的Invoke-DOSfuscation)展示了这些命令的灵活性,以及命令经过那么难以置信的混淆却仍然能有效执行。
There are two categorical axes in the space of obfuscated vs. non-obfuscated command lines: simple/complex and clear/obfuscated (see Figure 1 and Figure 2). For this discussion “simple” means generally short and relatively uncomplicated, but can still contain obfuscation, while “complex” means long, complicated strings that may or may not be obfuscated. Thus, the simple/complex axis is orthogonal to obfuscated/unobfuscated. The interplay of these two axes produce many boundary cases where simple heuristics to detect if a script is obfuscated (e.g. length of a command) will produce false positives on unobfuscated samples. The flexibility of the command line processor makes classification a difficult task from an ML perspective.
混淆和非混淆命令行之间两个类坐标轴:简单/复杂,清晰/混淆(见图一和图二)。简单意味着通常较短并相对不复杂,但仍然可以包含混淆;复杂意味着长,进过混淆的或没经过混淆的复杂字符串。因此,简单/复杂维度与混淆/未混淆维度垂直。相互垂直的两个维度产生许多分隔的情况,简单的混淆脚本启发式检测方法(命令的长度)将对未混淆的简单样本产生误报。命令解析器的灵活性使得从机器学习视角看分类成了一个困难的任务。
Traditional obfuscation detection can be split into three approaches. One approach is to write a large number of complex regular expressions to match the most commonly abused syntax of the Windows command line. Figure 3 shows one such regular expression that attempts to match ampersand chaining with a call command, a common pattern seen in obfuscation. Figure 4 shows an example command sequence this regex is designed to detect.
传统混淆检测可以分为三类。第一是写许多复杂的正则表达式去匹配Windows命令行中最常被滥用的语法。图三是一个正则表达式样例,尝试匹配一个call命令的&链,这是混淆的一种常用模式。图四展示一个这条正则表达式负责检测的命令样本。
There are two problems with this approach. First, it is virtually impossible to develop regular expressions to cover every possible abuse of the command line. The flexibility of the command line results in a non-regular language, which is feasible yet impractical to express using regular expressions. A second issue with this approach is that even if a regular expression exists for the technique a malicious sample is using, a determined attacker can make minor modifications to avoid the regular expression. Figure 5 shows a minor modification to the sequence in Figure 4, which avoids the regex detection.
这个方法有两个问题。第一,基本不可能开发出匹配所有命令行滥用的正则表达式。命令行的灵活性使其就像一个非正则的语言,它的灵活性使得使用一个正则表达式表示存在恶意样本使用的技术不现实,这个方法的另一个问题就是,即使一个正则表达式适用于恶意软件样本,攻击者只需做很小的修改就可以绕过正则表达式。图五显示了对图四的一个小修改,使其避免被正则表达式检测。
The second approach, which is closer to an ML approach, involves writing complex if-then rules. However, these rules are hard to derive, are complex to verify, and pose a significant maintenance burden as authors evolve to escape detection by such rules. Figure 6 shows one such if-then rule.
第二个方法,与机器学习有点像,采用复杂的if-then规则。然而,这些规则很难发现,很难验证,并且随着恶意作者渐渐规避那些规则的检测将代码很大的维护开销。图6展示了一个if-then规则。
A third approach is to combine regular expressions and if-then rules. This greatly complicates the development and maintenance burden, and still suffers from the same weaknesses that make the first two approaches fragile. Figure 7 shows an example of an if-then rule with regular expressions. Clearly, it is easy to appreciate how burdensome it is to generate, test, maintain and determine the efficacy of such rules.
第三个方法结合正则表达式和if-then规则。它使得开发和维护成本很大,并也面临前两个方法相同的缺陷。图七展示一个if-then结合正则表达式的样例。清楚地看到,生成、测试、维护和判断这个样例的有效性需要大量的开销。
Using ML simplifies the solution to these problems. We will illustrate two ML approaches: a feature-based approach and a feature-less end-to-end approach.
使用机器学习简化了对这些问题的解决。我们也举两个机器学习方法的例子:一个基于特征的方法和一个非特征端到端的方法。
There are some ML techniques that can work with any kind of raw data (provided it is numeric), and neural networks are a prime example. Most other ML algorithms require the modeler to extract pertinent information, called features, from raw data before they are fed into the algorithm. Some examples of this latter type are tree-based algorithms, which we will also look at in this blog (we described the structure and uses of Tree-Based algorithms in a previous blog post, where we used a Gradient-Boosted Tree-Based Model).
有些机器学习技术可以应用于任何类型的源数据(提供数字化后的),神经网络是主要的一个。大多数其它机器学习算法需要建模者从源数据提取出相关信息,称作特征,才能提交给算法。这一类例如基于树的算法,我们也要看看这种算法(在之前的博客中我们讨论过基于树的算法的结构和使用,那里我们使用了一个基于树的梯度提升模型)。
Neural networks are a type of ML algorithm that have recently become very popular and consist of a series of elements called neurons. A neuron is essentially an element that takes a set of inputs, computes a weighted sum of these inputs, and then feeds the sum into a non-linear function. It has been shown that a relatively shallow network of neurons can approximate any continuous mapping between input and output. The specific type of neural network we used for this research is what is called a Convolutional Neural Network (CNN), which was developed primarily for computer vision applications, but has also found success in other domains including natural language processing. One of the main benefits of a neural network is that it can be trained without having to manually engineer features.
神经网络是一种最近非常流行的机器学习算法,它由许多称为神经元的元素组成。一个神经元是一个元素,它获取一个输入集合,计算这些输入的一个加权和,然后将和提供给一个非线性函数。现已证实,一个层数较浅的神经网络就可以逼近输入与输出之间的任意连续映射关系。我们在本研究中使用的神经网络类型是卷积神经网络(CNN),它最初被开发以应用于图像应用,但也被成功应用于许多其它领域,如自然语言识别。使用神经网络的一个优势是无需人工构造特征就可以进行训练。
While neural networks can be used with feature data, one of the attractions of this approach is that it can work with raw data (converted into numeric form) without doing any feature design or extraction. The first step in the model is converting text data into numeric form. We used a character-based encoding where each character type was encoded by a real valued number. The value was automatically derived during training and conveys semantic information about the relationships between characters as they apply to cmd.exe syntax.
可以使用神经操作特征数据,但这种方法的真正吸引在于它可以应用于源数据(转换为数字形式的)而无需进行特征设计或提取工作。第一步是将文本数据转换成数字形式。我们使用了一个字符编码,它将每一个字符类型编程成一个实数。这些值在训练过程中自动推理,并将字符间关系的语义信息转换成cmd.exe的语法。
We also experimented with hand-engineered features and a Gradient Boosted Decision Tree algorithm. The features developed for this model were largely statistical in nature – derived from the presence and frequency of character sets and keywords. For example, the presence of dozens of ‘%’ characters or long, contiguous strings might contribute to detecting potential obfuscation. While any single feature will not perfectly separate the two classes, a combination of features as present in a tree-based model can learn flexible patterns in the data. The expectation is that those patterns are robust and can generalize to future obfuscation variants.
我们也针对人工提权的特征和梯度提升决策树进行了测试。为这个模型开发的特征来源于对现实的大量统计-产生于字符集和关键字的存在及其频率。例如,存在大量或很长一段‘%’字符,附加的字符串将被用于检测潜在的混淆。当单个特征无法很好地区分两类结果是,基于树的模型中的多个特征的结合可以很好地学习数据中模式的灵活性。我们期望这些模型稳定并能包含未来混淆的变化。
To develop our models, we collected non-obfuscated data from tens of thousands of endpoint events and generated obfuscated data using a variety of methods in Invoke-DOSfuscation. We developed our models using roughly 80 percent of the data as training data, and tested them on the remaining 20 percent. We ensured that our train-test split was stratified. For featureless ML (i.e. neural networks), we simply input Unicode code points into the first layer of the CNN model. The first layer converts the code point into semantically meaningful numerical representations (called embeddings) before feeding it into the rest of the neural network.
为了开发该模型,我们从成千上万终端事件中搜集非混淆数据并用Invoke-DOSfuscation中的各种方法产生混淆数据。我们使用大约80%数据训练,用剩下20%数据测试。我们确保有层次地进行训练、测试数据划分。对于非特征机器学习(如神经网络),我们只需输入Unicode编码到CNN的第一层。第一层将编码到包含语义信息的数值型表示(称作embeddings),然后交给后续的神经网络。
For the Gradient Boosted Tree method, we generated a number of features from the raw command lines. The following are some of them:
对于梯度提升树方法,我们从源命令行中产生一些特征。下面是产生的部分特征:
[培训]内核驱动高级班,冲击BAT一流互联网大厂工作,每周日13:00-18:00直播授课