精确率PRE:mmdthash
= ssdeep
= vhash
> tlsh
基于本篇的测试结果,在mmdthash
阈值取0.95,ssdeep
阈值取0.8,tlsh
阈值取0.8的前提下,敏感哈希效果的综合排序如下:
tlsh
> mmdthash
> ssdeep
> vhash
四类敏感哈希一览:
基于python_mmdt:KNN机器学习分类结果测试分析(五)文章中的mmdthash
测试数据及测试结果,对ssdeep
、tlsh
、vhash
进行对比测试。即计算mmdthash
关联出的两个样本之间的ssdeep
、tlsh
、vhash
相似度,并统计分析相关的异常值、精确度、召回率、精确率,从而得出敏感哈希算法之间的对比结果。
说明:ssdeep
和tlsh
在Windows上的安装比较折腾,测试直接在树莓派的Linux环境上进行。
ssdeep安装:
或者,如果linux的编译环境比较全(包含automake等工具),可以直接通过pip
安装ssdeep
的模糊哈希库:BUILD_LIB=1 pip install ssdeep
利用python的ssdeep
库计算785个测试文件的ssdeep
值,并使用json
格式保存在文件中,代码如下:
ssdeep
计算结果示例如下:
tlsh
安装:pip install py-tlsh
利用python的tlsh
库计算785个测试文件的tlsh
值,并使用json
格式保存在文件中,代码如下:
tlsh
计算结果示例如下:
virustotal没有开源vhash
的计算方法,目前只能通过virustotal的web api接口进行查询。通过virustotal的接口,同时可以获取到对应文件的ssdeep
和tlsh
值(估计tlsh
上virustotal的时间比较短,一些老样本会缺失tlsh
值)。virustotal的api文档点击这里,可以直接在页面上对接口进行测试以及生成对应开发语言的代码,使用非常方便。
注册virustotal的账户,申请api_key请参考virustotal的文档,几分钟即可完成。使用api接口进行查询的时候,注意查询频率限制。
使用python查询virustotal的代码如下:
vhash
查询结果如下:
使用python_mmdt:KNN机器学习分类结果测试分析(五))中的测试结果。
将上述计算过程生成的ssdeep_test.json、tlsh_test.json、vhash_test.json、mmdthash_test.json文件中的数据,按字典形式,整合至ssdeep_tlsh_vhash_mmdthash_test.json文件中,示例如下:
使用python_mmdt:KNN机器学习分类结果测试分析(五))中的mmdthash
分类结果作为依据,计算关联文件之间的ssdeep
、tlsh
、vhash
的相似度值。计算过程有3点需要注意:
对比计算代码如下:
涉及的相关文件及下载地址:
ssdeep_tlsh_vhash_mmdthash_test.xlsx
示例数据如下:
如前文所述,我们使用mmdthash
的检出结果作为基线,对ssdeep
、tlsh
、vhash
的结果进行对比。
mmdthash
检出结果对比
mmdthash
相似度阈值取0.95,相似度按从大到小排序,前133个文件是检出为恶意的文件,其中132个检测正确,为恶意文件,最后一个检测错误,恶意家族检测分类结果不一致。
mmdthash
未检出结果对比
mmdthash
相似度阈值取0.95,相似度按从大到小排序,后267个文件为未检出的文件,其中有200个检测正确,为干净文件,有67个检测错误,将恶意文件预测为干净文件。
对比
通过人工分析异常点对应的样本,mmdthash
阈值取0.95,ssdeep
阈值取0.8,tlsh
阈值取0.8,得到统计数据如下:
如图所示:
在上述mmdthash
阈值取0.95,ssdeep
阈值取0.8,tlsh
阈值取0.8的前提下,可得出如下结论:
综上,基于本篇的测试结果,敏感哈希效果的综合考虑如下:
tlsh
> mmdthash
> ssdeep
> vhash
其他文件占比2%
为体现差异,特意在低性能的树莓派上执行,耗时如下:
import
os
import
sys
import
hashlib
import
json
import
ssdeep
def
list_dir(root_dir):
files
=
os.listdir(root_dir)
for
f
in
files:
file_path
=
os.path.join(root_dir, f)
yield
file_path
def
gen_sha1(file_name):
with
open
(file_name,
'rb'
) as f:
s
=
f.read()
_s
=
hashlib.sha1()
_s.update(s)
return
_s.hexdigest()
def
main():
file_path
=
sys.argv[
1
]
ssdeep_dict
=
dict
()
for
file_name
in
list_dir(file_path):
file_sha1
=
gen_sha1(file_name)
ssdeep_hash
=
ssdeep.hash_from_file(file_name)
print
(
'%s,%s'
%
(file_sha1, ssdeep_hash))
ssdeep_dict[file_sha1]
=
ssdeep_hash
with
open
(
'ssdeep_test.json'
,
'w'
) as f:
f.write(json.dumps(ssdeep_dict, indent
=
4
))
if
__name__
=
=
'__main__'
:
main()
import
os
import
sys
import
hashlib
import
json
import
ssdeep
def
list_dir(root_dir):
files
=
os.listdir(root_dir)
for
f
in
files:
file_path
=
os.path.join(root_dir, f)
yield
file_path
def
gen_sha1(file_name):
with
open
(file_name,
'rb'
) as f:
s
=
f.read()
_s
=
hashlib.sha1()
_s.update(s)
return
_s.hexdigest()
def
main():
file_path
=
sys.argv[
1
]
ssdeep_dict
=
dict
()
for
file_name
in
list_dir(file_path):
file_sha1
=
gen_sha1(file_name)
ssdeep_hash
=
ssdeep.hash_from_file(file_name)
print
(
'%s,%s'
%
(file_sha1, ssdeep_hash))
ssdeep_dict[file_sha1]
=
ssdeep_hash
with
open
(
'ssdeep_test.json'
,
'w'
) as f:
f.write(json.dumps(ssdeep_dict, indent
=
4
))
if
__name__
=
=
'__main__'
:
main()
cat ssdeep_test.json
{
"0ec279513e9e8a0e8f6e7c170b9462b60d9888c6"
:
"6144:w9qaZ5E6fCvH5H42SUiTV2MTb54y94HTFboTWhmzeOws:w9d96yeKV2MTb5X4zZQWhmqd"
,
"0ad6db9128353742b3d4c8a5fc1993ca8bf399f1"
:
"1536:NxiIXeGNc0BL0IFx34bPMkG/KsrKlEqjjPWUJ7h/dbZkv13t43O:eIXeGNtV0KIQjr5ehlbSv13t43O"
,
"e3dc592a0fa552beb35ebcb4160e5e4cb4686f17"
:
"1536:qKXppRU0D2KmMESllkQSp5jcUyT/jAdp/hsonBqar5mVNCG:JpGjKm9fQSp5sjAfAa1mVMG"
,
"c8e1100b1e38e5c5e671a23cd49d98e315b74a36"
:
"3072:XwZcFNCpegr+L3Y5D+LRohyOBGbNc8GMmE/A9VpGLGWtQeGwX1gnuZPZc2:XHCNEY5D+LfOi3GbE/AsAeGwXwc5"
,
"0ae0cba5b411541cc8d9f94e01151fec9d6b9242"
:
"384:enXKs1aOcWkZ1WgoELXuf9OO5GD+IGA4p1XMWfg7CF:enp1aOasDOOM+ut"
,
......
}
cat ssdeep_test.json
{
"0ec279513e9e8a0e8f6e7c170b9462b60d9888c6"
:
"6144:w9qaZ5E6fCvH5H42SUiTV2MTb54y94HTFboTWhmzeOws:w9d96yeKV2MTb5X4zZQWhmqd"
,
"0ad6db9128353742b3d4c8a5fc1993ca8bf399f1"
:
"1536:NxiIXeGNc0BL0IFx34bPMkG/KsrKlEqjjPWUJ7h/dbZkv13t43O:eIXeGNtV0KIQjr5ehlbSv13t43O"
,
"e3dc592a0fa552beb35ebcb4160e5e4cb4686f17"
:
"1536:qKXppRU0D2KmMESllkQSp5jcUyT/jAdp/hsonBqar5mVNCG:JpGjKm9fQSp5sjAfAa1mVMG"
,
"c8e1100b1e38e5c5e671a23cd49d98e315b74a36"
:
"3072:XwZcFNCpegr+L3Y5D+LRohyOBGbNc8GMmE/A9VpGLGWtQeGwX1gnuZPZc2:XHCNEY5D+LfOi3GbE/AsAeGwXwc5"
,
"0ae0cba5b411541cc8d9f94e01151fec9d6b9242"
:
"384:enXKs1aOcWkZ1WgoELXuf9OO5GD+IGA4p1XMWfg7CF:enp1aOasDOOM+ut"
,
......
}
import
os
import
sys
import
hashlib
import
json
import
tlsh
def
list_dir(root_dir):
files
=
os.listdir(root_dir)
for
f
in
files:
file_path
=
os.path.join(root_dir, f)
yield
file_path
def
gen_sha1(file_name):
with
open
(file_name,
'rb'
) as f:
s
=
f.read()
_s
=
hashlib.sha1()
_s.update(s)
return
_s.hexdigest()
def
gen_tlsh(file_name):
with
open
(file_name,
'rb'
) as f:
s
=
f.read()
_s
=
tlsh.
hash
(s)
return
_s
def
main():
file_path
=
sys.argv[
1
]
tlsh_dict
=
dict
()
for
file_name
in
list_dir(file_path):
file_sha1
=
gen_sha1(file_name)
tlsh_hash
=
gen_tlsh(file_name)
print
(
'%s,%s'
%
(file_sha1, tlsh_hash))
tlsh_dict[file_sha1]
=
tlsh_hash
with
open
(
'tlsh_test.json'
,
'w'
) as f:
f.write(json.dumps(tlsh_dict, indent
=
4
))
if
__name__
=
=
'__main__'
:
main()
import
os
import
sys
import
hashlib
import
json
import
tlsh
def
list_dir(root_dir):
files
=
os.listdir(root_dir)
for
f
in
files:
file_path
=
os.path.join(root_dir, f)
yield
file_path
def
gen_sha1(file_name):
with
open
(file_name,
'rb'
) as f:
s
=
f.read()
_s
=
hashlib.sha1()
_s.update(s)
return
_s.hexdigest()
def
gen_tlsh(file_name):
with
open
(file_name,
'rb'
) as f:
s
=
f.read()
_s
=
tlsh.
hash
(s)
return
_s
def
main():
file_path
=
sys.argv[
1
]
tlsh_dict
=
dict
()
for
file_name
in
list_dir(file_path):
file_sha1
=
gen_sha1(file_name)
tlsh_hash
=
gen_tlsh(file_name)
print
(
'%s,%s'
%
(file_sha1, tlsh_hash))
tlsh_dict[file_sha1]
=
tlsh_hash
with
open
(
'tlsh_test.json'
,
'w'
) as f:
f.write(json.dumps(tlsh_dict, indent
=
4
))
if
__name__
=
=
'__main__'
:
main()
cat tlsh_test.json
{
"0ec279513e9e8a0e8f6e7c170b9462b60d9888c6"
:
"T1616423D5248C5DF8E251CCF4C73AB60493EADA48BF516B75BDD9C2692FF2480C93A214"
,
"0ad6db9128353742b3d4c8a5fc1993ca8bf399f1"
:
"T13D73024483EBEDA8EE040AB0124C43B9CBAD8D1B7659653DFD3864D1FC064AE47269A6"
,
"e3dc592a0fa552beb35ebcb4160e5e4cb4686f17"
:
"T1CF93293D766924E5E139C17CC5474E0AF772B025071227EF06A4C2BE1F97BE06C39AA5"
,
"c8e1100b1e38e5c5e671a23cd49d98e315b74a36"
:
"T17F34391A57EC0465F1B7923589B34919F233B8625731E2DF109082BC2E27FD8BE36B56"
,
"0ae0cba5b411541cc8d9f94e01151fec9d6b9242"
:
"T12D5208C71F69F7D4C19F85F84A3B623E1EA4616A6111412057DD3E92BC1C3DBFA2A09C"
,
......
}
cat tlsh_test.json
{
"0ec279513e9e8a0e8f6e7c170b9462b60d9888c6"
:
"T1616423D5248C5DF8E251CCF4C73AB60493EADA48BF516B75BDD9C2692FF2480C93A214"
,
"0ad6db9128353742b3d4c8a5fc1993ca8bf399f1"
:
"T13D73024483EBEDA8EE040AB0124C43B9CBAD8D1B7659653DFD3864D1FC064AE47269A6"
,
"e3dc592a0fa552beb35ebcb4160e5e4cb4686f17"
:
"T1CF93293D766924E5E139C17CC5474E0AF772B025071227EF06A4C2BE1F97BE06C39AA5"
,
"c8e1100b1e38e5c5e671a23cd49d98e315b74a36"
:
"T17F34391A57EC0465F1B7923589B34919F233B8625731E2DF109082BC2E27FD8BE36B56"
,
"0ae0cba5b411541cc8d9f94e01151fec9d6b9242"
:
"T12D5208C71F69F7D4C19F85F84A3B623E1EA4616A6111412057DD3E92BC1C3DBFA2A09C"
,
......
}
import
sys
import
json
import
requests
from
time
import
sleep
x_apikey
=
'xxxx'
def
read_hash(file_name):
with
open
(file_name,
'r'
) as f:
datas
=
f.readlines()
return
[file_hash.strip()
for
file_hash
in
datas]
def
parse_vt_report(vt_report_json):
attributes
=
vt_report_json.get(
'data'
, {}).get(
'attributes'
, {})
parse_data
=
dict
()
if
attributes:
parse_data[
'vhash'
]
=
attributes.get(
'vhash'
, '')
parse_data[
'magic'
]
=
attributes.get(
'magic'
, '')
parse_data[
'tlsh'
]
=
attributes.get(
'tlsh'
, '')
parse_data[
'ssdeep'
]
=
attributes.get(
'ssdeep'
, '')
return
parse_data
def
vt_search(sha1_hash):
url
=
"https://www.virustotal.com/api/v3/files/{}"
.
format
(sha1_hash)
headers
=
{
"Accept"
:
"application/json"
,
"x-apikey"
: x_apikey
}
response
=
requests.request(
"GET"
, url, headers
=
headers)
try
:
parse_data
=
parse_vt_report(response.json())
except
Exception as e:
print
(
'error: %s, reason: %s'
%
(sha1_hash,
str
(e)))
return
parse_data
def
main():
file_path
=
sys.argv[
1
]
vhash_dict
=
dict
()
file_hashs
=
read_hash(file_path)
for
file_hash
in
file_hashs:
parse_data
=
vt_search(file_hash)
print
(
'%s,%s'
%
(file_hash, json.dumps(parse_data)))
if
parse_data:
vhash_dict[file_hash]
=
parse_data
else
:
break
sleep(
1
)
with
open
(
'vhash_test.json'
,
'w'
) as f:
f.write(json.dumps(vhash_dict, indent
=
4
))
if
__name__
=
=
'__main__'
:
main()
import
sys
import
json
import
requests
from
time
import
sleep
x_apikey
=
'xxxx'
def
read_hash(file_name):
with
open
(file_name,
'r'
) as f:
datas
=
f.readlines()
return
[file_hash.strip()
for
file_hash
in
datas]
def
parse_vt_report(vt_report_json):
attributes
=
vt_report_json.get(
'data'
, {}).get(
'attributes'
, {})
parse_data
=
dict
()
if
attributes:
parse_data[
'vhash'
]
=
attributes.get(
'vhash'
, '')
parse_data[
'magic'
]
=
attributes.get(
'magic'
, '')
parse_data[
'tlsh'
]
=
attributes.get(
'tlsh'
, '')
parse_data[
'ssdeep'
]
=
attributes.get(
'ssdeep'
, '')
return
parse_data
def
vt_search(sha1_hash):
url
=
"https://www.virustotal.com/api/v3/files/{}"
.
format
(sha1_hash)
headers
=
{
"Accept"
:
"application/json"
,
"x-apikey"
: x_apikey
}
response
=
requests.request(
"GET"
, url, headers
=
headers)
try
:
parse_data
=
parse_vt_report(response.json())
except
Exception as e:
print
(
'error: %s, reason: %s'
%
(sha1_hash,
str
(e)))
return
parse_data
[招生]科锐逆向工程师培训(2024年11月15日实地,远程教学同时开班, 第51期)