[分享]GPUs + Information Security issue papers.-密码应用-看雪-安全社区|安全招聘|kanxue.com

[分享]GPUs + Information Security issue papers.

发表于: 2009-11-15 21:51 4588

[分享]GPUs + Information Security issue papers.

rockinuk

2009-11-15 21:51

4588

奉壇主 Kanxue 的指示，希望密碼小組能做出一些作(產)品出來。
現在我把全 IEEE/IEL 有關 GPU 涉及加密解密的論文全搜括上來。
(SDOS/SDOL 數據庫找不到這方面的論文。 )
請大家參考。
先瞭解一下 GPU 應用在密碼學上的範圍有哪些，可行性如何!?
謝謝。

Coprocessor Computing with FPGA and GPU.pdf (249.9 KB)
High-Speed Private Information Retrieval Computation on GPU.pdf (736.5 KB)
Implementation of Advanced Encryption Standard for encryption and decryption of images and text .pdf (729.4 KB)
Implementations of hardware acceleration for MD4-family algorithms based on GPU.pdf (230.5 KB)
Voice Command Recognition with Dynamic Time Warping (DTW) using Graphics Processing Units (GPU) .pdf (694.1 KB)
Efficient implementation for MD5-RC4 encryption using GPU with CUDA.rar (668.6 KB)
CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography.rar (1.33 MB)

[注意]传递专业知识、拓宽行业人脉——看雪讲师团队等你加入！

上传的附件：

Coprocessor Computing with FPGA and GPU.pdf （249.88kb，9次下载）
High-Speed Private Information Retrieval Computation on GPU.pdf （736.52kb，17次下载）
Implementation of Advanced Encryption Standard for encryption and decryption of images and text .pdf （729.35kb，18次下载）
Implementations of hardware acceleration for MD4-family algorithms based on GPU.pdf （230.49kb，9次下载）
Voice Command Recognition with Dynamic Time Warping (DTW) using Graphics Processing Units (GPU) .pdf （694.12kb，14次下载）
Efficient implementation for MD5-RC4 encryption using GPU with CUDA.rar （668.63kb，13次下载）
CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography.rar （1.33MB，20次下载）

收藏・1

免费・0

支持

最新回复 (33) 1 2 ▶
kanxue 雪币： 47147 活跃值： (20460) 能力值： (RANK：350 ) 在线值：发帖 2375 回帖 17045 粉丝 541 关注私信	kanxue 8 2 楼 rockinuk辛苦了 2009-11-15 21:52 0
游客雪币：能力值： (RANK： ) 在线值：发帖 0 回帖 0 粉丝关注私信	游客 3 楼应R大的要求, 综合以上提供的这些论文的内容, 应用GPU的大致原理如下: 一. GPU(Graphics Processing Units, 图形处理器)已经从专门图形处理器进化到高性能, 灵活的可编程单元, 应用范围更加广泛. 二. 现代GPU的体系架构是基于通用流处理器(general purpose stream processors)的, 利用专门的API可以象搭积木一样任意指定不同的通用流处理器处理不同的任务, 而不再受限于经典的绘图流水线(classic graphics pipeline) 三. 通用的API, 如OpenGL和DirectX, 可以让开发者存取CPU和GPU之间的交换数据, 控制render进程; 厂家独有的API则为开发者提供了更广泛的应用能力. 这些API包括: 1. NVidia: Compute Unified Device Architecture (CUDA) http://www.nvidia.cn/object/cuda_home_cn.html 2. AMD(ATI): Close To the Metal (CTM) http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx 3. Microsoft: Microsoft Accelerator http://research.microsoft.com/en-us/projects/Accelerator/ 四. 如果需要使用到并行的特性, 计算方法必须能分成若干个块(block), 而每个块又能分解成为线程(thread), 每个线程执行一段原子操作(kernel, 意译), 其产出物是计算结果的一部分. 常见的MD4/5, RC4, AES都满足第四个条件, 即都由若干轮组成, 每轮内部都是原子操作, 相互不干扰, 所以使用GPU可以提高运算速度, 此结论在上述论文中均有提及. 由于现有的三大API和硬件相关, 而且互不兼容, 在选择API上只能根据开发者自己现有的硬件来决定, 所以我觉得讨论应该集中在如何将特定应用中的关键计算分解成多个块上, 这样才能真正发挥GPU并行处理的优势. 嗯, 砖头抛完了, 大家拿玉来砸我吧. 2009-11-16 23:16 编辑删除 0
qsyqsy 雪币： 234 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 28 回帖 685 粉丝 0 关注私信	qsyqsy 4 楼嗯, 砖头抛完了, 大家拿玉来砸我吧宁为玉碎，不为砖全不过话说砖的确丢得不错 2009-11-17 18:51 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 5 楼学习下，听说有些解密软件就是通过NV的GPU库加速计算的，不过想必需要涉及很深的硬件编程理论才能实用化。 2009-11-18 12:57 0
游客雪币：能力值： (RANK： ) 在线值：发帖 0 回帖 0 粉丝关注私信	游客 6 楼厂家提供API(CUDA/CTM/MA)之后, 开发者不需要涉及很深的硬件编程都能实用化, 关键是如何拆分算法, 使得它能充分发挥GPU的并行能力. 否则仅仅串行的使用一个SP(stream processors), 怎么也快不到哪里去的. 2009-11-18 20:48 编辑删除 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 7 楼消息已收到，个人对这个很感兴趣，希望能够在代码编写上出一份力。先去收集些相关的编程资料好了。我机器果然过时了，NV GeForce 8400以上的台式机显卡才支持CUDA。这是支持的产品列表：http://www.nvidia.cn/object/cuda_learn_products_cn.html 微软的方案虽然能让支持DX9的显卡跑起来，不过居然要用C#或.Net写... 真是陷入窘境了找到个不错的地方 GPGPU Stream Processing(CUDA\OpenCL\DirectCompute)：http://www.opengpu.org/bbs/forumdisplay.php?fid=6 2009-11-20 09:31 0
芳草碧连雪币： 290 活跃值： (11) 能力值： ( LV2，RANK：10 ) 在线值：发帖 19 回帖 327 粉丝 0 关注私信	芳草碧连 8 楼貌似这个完全不懂 2009-11-20 09:43 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 9 楼赞！CUDA果然是个好东西，对编程人员来说完全是透明的，并且相当容易上手。稍微扫了一眼 Programming Guide 就可以明白利用CUDA操作GPU进行运算是怎么回事，例如这个矩阵加法例子（NVIDIA CUDA Programming Guide 2.3: p18）： // Device code __global__ void[B] VecAdd[/B](float* A, float* B, float* C) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int main() { int N = ...; size_t size = N * sizeof(float); // Allocate input vectors h_A and h_B in host memory float* h_A = malloc(size); float* h_B = malloc(size); // Allocate vectors in device memory float* d_A; cudaMalloc((void*)&d_A, size); float d_B; cudaMalloc((void*)&d_B, size); float d_C; cudaMalloc((void*)&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock – 1) / threadsPerBlock; [B]VecAdd[/B]<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C); // Copy result from device memory to host memory // h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } 这段代码利用cuda这种函数在设备内存中进行操作，然后将要GPU运算的VecAdd函数通过 VecAdd<<<xx, xxx>>>(d_A, d_B, d_C); 这样的方式调用（<<<..>>>是CUDA自定义的操作，表示该函数由GPU负责），后面传入在设备内存中的 d_A, d_B, d_C。算完后再将设备内存中的 d_C 写回主机内存 h_C 中。 2009-11-20 11:02 0
jackozoo 雪币： 1450 活跃值： (35) 能力值： (RANK：680 ) 在线值：发帖 50 回帖 775 粉丝 3 关注私信	jackozoo 14 10 楼 C = A + B; 这里加的进位怎么处理? 2009-11-20 12:58 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 11 楼开发手册的演示代码，估计压根没考虑这个... 实际上你看他 h_A, h_B 的初值都没给，不过从结构上已经可以体现CUDA的方便了。矩阵相加这样做肯定比 for (N*N) 次要快 2009-11-20 14:12 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 12 楼找了几个关于用CUDA加速MD5破解的工程，有条件的可以编译一个试试看： Nightingale is a a GPU password software designed to encrypt or bruteforce password using GPU http://code.google.com/p/nightingale/ CUDA MD5 http://code.google.com/p/cudamd5/ HashClash http://code.google.com/p/hashclash/ NVIDIA CUDA校园程序设计大赛相关下载 http://cuda.csdn.net/Contest/pro/nvidia_show.aspx 2009-11-20 15:00 0
qsyqsy 雪币： 234 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 28 回帖 685 粉丝 0 关注私信	qsyqsy 13 楼后来看了GPUS的专案，仔细思考了一下，我想，除了arab的砖头，估计还得整理一下，把可能的算法分析一下。由若干轮组成, 每轮内部都是原子操作, 相互不干扰论文还没看透彻，等等在来分析。。。。。。。。。。 2009-11-20 18:31 0
饮水思源雪币： 132 活跃值： (28) 能力值： ( LV2，RANK：10 ) 在线值：发帖 17 回帖 149 粉丝 0 关注私信	饮水思源 14 楼虽然是看不懂，但是我也愿意为密码学小组贡献一份力量。。有需要我的尽管说就我个人而言，我是真想研究这个。。。哎，没有基础真不好办刚刚上大学，学了点肤浅的知识我会继续努力的！ 2009-11-20 18:40 0
qsyqsy 雪币： 234 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 28 回帖 685 粉丝 0 关注私信	qsyqsy 15 楼现在觉得很不称职，搞代码编写我估计有点困难，算法还有点懂 ps:R大，GPUs专案的计划，大概要搞多少时间？ 2009-11-21 09:40 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 16 楼搞出一個軟件出來。近程目標是搞出一個小軟件。中程目標是搞出一個可以分解 RSA 512 bits 的軟件。遠程目標，等有辦法完成中程目標時，再計劃。以上只有進度表，沒有時間表。 2009-11-21 14:21 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 17 楼信息已收到同时也看到rockinuk第一时间整理的资料，很感动，我去过很多论坛遇到很多版主，自己也当过版主，从来没见过这么负责和勤劳的版主，自问也做不到这样。所以在这里我由衷的向你表示我的感激。关于GPU概念、GPU的作用、GPU的原理rockinuk收集的论文都有相关介绍我就不在赘述了。使用GPU应用于密码学的目的只有一个就是发挥GPU并行处理的优势，从而提高其运算速度。如arab所总结的常见的算法都符合GPU原理。本人因对数签比较感兴趣，在这帮大家再抛一砖头。将SHA-1在GPU实现。 1、言归正传实现开始：步骤1：添加填充位(一个1和若干个0)。在消息的最后添加适当的填充位使得数据位的长度满足长=448mod312。步骤2：添加一个64位块，表示原始消息长度，64位无符号整数(最高有效字节在前)。步骤3：初试化消息摘要的缓冲区。一个160位消息摘要缓冲区用以保存中间和最终敞列函数的结果。它可以表示5个32位的寄存器(A，B，C，D，E)。步骤4：以512位数据块为单位处理消息。共计80步。步骤5：输出最终的160位消息摘要。这过程中，步骤4是数字指纹实现的关键。其中Yq是输入的512位数据，CVq是上二步产生的结果值，CV0是算法确定的初始值。f1、f2、f3，f4有相似的结构，但每个循环使用不同的原始逻辑函数。以前20步为例，其基本的处理过程如图3所示[attach]图2图3.JPG[/attach]，图中Wt是从当前512位输入数据块导出的32位字，Kt一个用于加法的常量，Si表示循环左移i位。在CPU上的实现过程如图4所示[attach] 图4图5.JPG [/attach]，顺序执行A、B、C、D、E的求解。利用CUDA2．0的分支结构编程，可以将程序的流程改为如图5所示，在GPU上执行。将5个寄存器中数据的求解过程并行处理，从而提高速度。 2、运行结果：算法的实现采用了三组字符数据分别在CPU和GPU上运行，得到的结果相同，运算时间的对比见表1[attach] 表1.JPG [/attach] 。其中CPU的运行环境为：　　双核CPU：InteI(R)Core(TM)2 CPU 6300@1.86Ghz 　　Intel(R)Core(TM)2 CPU 6300@1.86Ghz 　　GPU采用NVIDIA GeForce 8800 GT。　　第一组：输入字符为：abc 　　最后消息摘要的值为：A9993E36 4706816A BA3E2571 7850C26C 9CDOD89D 　　第二组：输入字符为：abce 　　最后消息摘要的值为：l FE8BFE 87576C3E CB22426F 8E578473 829 l 7ACF 　　第三组：输入字符为：abcdef 　　最后消息摘要的值为：03DE6C57 OBFE24BF C328CCD7 CA46876E ADAF4334 对比CPU和GPU上算法的运行时间，可以发现效率的提高并不是很大，这是因为在计算A、B、C、D、E五个寄存器中的数据时，A寄存器内的数据需要时间较长，其他寄存器处理完成后需要等待。 3、改进与提高：利用上述方法在GPU实现数字指纹可以提高效率，但是效果不足很明显，这足因为SHA-1算法中A，B、C、D、E五个处理器中数据处理的过程需要的时间相差很大，且每一步的计算都需要上一步运算的结果作为输入，对数据计算的并行性利用不高。近一步分析发现，计算A寄存器中的数据时，由于要使用上一次计算结果中的A寄存器内容，所以必须等待。但此时B、C、D、E寄存器中的内容已经存在(由于计算量相对于A来说，要小的多)，在和A寄存器中内容相加前，可以对B，C、D、E中的数据先进行处理，从而提高效率，如图6所示[attach] 图6.JPG [/attach] 。在具体的实现过程中，采用GPU上六个处理器进行并行处理。由于每个处理器上运行的程序不同，因此运行开始前需利用CUDA2．0中的分支结构判断处理器上应运行的程序。同时，程序中多个处理器需要同一个数据，将A、B、C、D、E中的数据存储任全局变量中，使每个处理器都能读取。为了保证处理器能够协调工作，需要增加一些变量来控制处理器j：开始运行SHA-I算法程序。每个处理器的任务分配与工作时序如表2所示[attach]表2.JPG[/attach]。经过改进，计算的速度得到了进一步的提高，运行时间如表3所示[attach]表3.JPG[/attach]。 4、结论：为了充分利用GPU的并行处理能力，提高SHA-1算法的速度，这里通过对该算法进行并行处理，快速实现了SHA-1算法。实验结果表明，GPU上运行SHA-1算法所需时间比CPU上节约了近3倍。哈哈，巨大的优势，鉴于SHA-1算法自身仍有并行性有待开发，算法的运算速度也还有提升的空间。期待其他算法的实现！！！上传的附件：表1.JPG （12.87kb，39次下载）表2.JPG （21.68kb，39次下载）表3.JPG （13.11kb，39次下载）图2图3.JPG （49.29kb，39次下载）图4图5.JPG （19.94kb，39次下载）图6.JPG （23.59kb，39次下载） 2009-11-21 23:14 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 18 楼謝謝 cykerr 大大這麼有用的材料。今天無意間發現了一篇論文，為台灣清華大學碩士論文，但該論文未公開，所以無法取得全文。我會想辦法去取得這份論文，並提供給大家參考。 ※台灣大學、台灣交通大學、台灣清華大學、台灣成功大學，在工程領域上為台灣 Top 5 的大學。 ===== http://140.113.39.130/cgi-bin/gs/hugsweb.cgi?o=dnthucdr&i=sGH000936350.id 題名: 基於繪圖處理器之封包檢測技術研究其他題名: A Graphics Processor-based Packet Inspection Scheme 作者: 徐獻文 Hsien-Wen Hsu 描述: 碩士國立清華大學資訊工程學系 GH000936350 日期: 2006 關鍵詞: 顯示卡字串比對網路安全 GPU String matching Network security 摘要: 近年來網路安全備受重視，深層封包檢測等相關技術被廣泛地研究與應用。本論文提出了一個基於繪圖處理器之深層封包檢測方案與架構 -- 利用個人電腦上之繪圖卡來作為字串比對之加速器。由於繪圖卡天生具有平行運算能力，我們利用繪圖卡中數量極多且可平行運算的Fragment Shader做為平行字串比對處理器，搭配繪圖卡的平行管線，將字串比對的效能發揮到極限。這對需求度日益增高的深層封包檢測是十分具有吸引力的，因為大部分的網路流量都是由多個連線所聚集而來的，所以本論文提出的方案與架構特別針對如何平行地處理這些網路流量有一巧妙之設計與安排。我們將熟知的自動機結構，轉化為可由繪圖卡所執行之資料結構，並透過了各種圖形處理程式語言來處理其有關狀態轉換過程的 data flow以及與系統I/O運作的control flow。因此本論文提出的方案與架構將會適用於許多以自動機為基礎的字串比對演算法，例如著名的Aho and Corasick(AC) 演算法。本論文除理論上提出了架構，並將之實現於繪圖卡之中。並透過實驗計算來分析效能表現。包括展示將自動機轉換為材質後，其在繪圖卡記憶體中的佈局，並計算出各種資料結構所需的記憶體大小外，更客觀地去分析其處理效能表現。經由實驗證明，本論文提出的方案與其他同樣利用硬體加速器的方案相比較下，我們可將目前市面上售價不到美金400元的繪圖卡，擁有高達6.4G bps的字串比對能力。另外，透過實驗本論文證明可讓一般個人電腦中大部分時間為閒置的繪圖卡充分地發揮其效用，在使用者無須再添購任何的新硬體的情況下，與其他的純軟體字串比對實作相比較，約可以減少2/3以上字串比對時間。我們相信此一研究將會成為利用繪圖卡進行網路封包處理的濫觴。 Network security has recently become increasingly important. Hence, related technologies like deep packet inspection are being extensively researched and widely implemented. This study proposes a novel scheme and architecture for packet content inspection using graphics processing units (GPUs). The proposed method takes the common component of personal computers, namely GPU, as the accelerator for pattern matching which the critical problem in deep packet inspection. With the native parallel computing power of GPUs, the multiple fragment shaders are considered as parallel pattern matching engines, and with simultaneous pipelines they maximize power of pattern matching. These features are very attractive for content signature recognition which is becoming increasingly popular. Since most network traffic is aggregated from many sessions, the proposed scheme and architecture are particularly designed for processing the multi-session network traffic simultaneously. The well-known automata structure is converted to the data structure executed in GPUs for processing the data flow, which manages the state transition procedure, and the control flow, which is in charge of system I/O, via various graphics processing languages. The proposed scheme and architecture work well with many automaton-based pattern matching algorithms. This study focuses on the most famous one, the Aho and Corasick (AC) algorithm. Besides presenting the scheme in theory and implementing on commodity GPU in practice, this study analyzes the performance of our proposed approach through evaluations, such as showing the memory fingerprint of the automaton in GPU, determining the memory size of all necessary data structures, and analyzing the throughput. The experiment results indicate that, the proposed pattern matching approach can achieve 6.4Gbps throughput powered by the graphics card with the market price less than US$400, representing an improvement on other approaches using accelerator. This study also reveals that the proposed scheme can exploit the resources of the GPU to accelerate the pattern matching processing time, since the GPU installed in common PCs is originally idle at most time periods. The processing time of the proposed software-based implementation is 2/3 less than that of other software-based pattern matching implementations. The proposed concept, Network Processing on Graphics Processing Units, is a novel research field in network processing. 參考資料: [1] R. S. Boyer and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, vol. 20, Session 10, Oct. 1977, pp. 761–772. [2] A. V. Aho and M. J. Corasick, “Efficient string matching: An aid to bibliographic search,” Communications of the ACM, vol. 18, issue 6, Jun. 1975, pp. 333–340. [3] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching,” Technical Report TR-94-17, Department of Computer Science, University of Arizona, 1994. [4] N. Tuck, T. Sherwood, B. Calder, G. Varghese, “Deterministic memory-efficient string matching algorithms for intrusion detection,” In Proceedings of the IEEE Infocom Conference, 2004, pp. 333–340. [5] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, J. Lockwood, “Deep packet inspection using parallel Bloom filters,” IEEE Micro, vol. 24, No. 1, 2004, pp. 52–61. [6] J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos, “Implementation of a content-scanning module for an internet firewall,” Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, April 9–11, 2003, pp. 31–38. [7] F. Yu , R. H. Katz , T. V. Lakshman, “Gigabit rate packet pattern-matching using TCAM,” Proceedings of the network protocols, 12th IEEE International Conference on (ICNP’04), Oct. 5–8, 2004, pp.174–183. [8] C. Courcoubetis and V. A. Siris, “Measurement and analysis of real network traffic,” Proceedings of the 7th Hellenic Conference on Informatics (HCI'99), Aug. 1999 [9] R.T. Liu, N.F. Huang, C.H. Chen, C.N. Kao,“A Fast String Matching Algorithm for Network Processor-based Intrusion Detection Systems," ACM Transactions on Embedded Computer Systems, Vol. 3, No. 3, Aug. 2004, pp. 614 – 633. [10] M. Pharr and R. Fernando, “GPU Gems 2,” Addison Wesley, 2004. [11] GPGPU: General-Purpose Computation on GPUs. http://www.gpgpu.org [12] P. Trancoso, and M. Charalambous, “Exploring graphics processor performance for general purpose applications,” dsd, 8th Euromicro Conference on Digital System Design (DSD’05), 2005, pp. 306–313. [13] N. K. Govindaraju, J. Gray, R. Kumar, and D. Manocha, “GPUTeraSort: High performance graphics co-processor sorting for large database management,” Proceedings of ACM SIGMOD Conference, Chicago, IL, Jun. 2006. [14] P. Kipfer, M. Segal, and R. Westermann, “UberFlow: A GPU-based particle engine,” Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, session: Computation, 2004, pp. 115–122. [15] I. Rudomín, B. Hernandez, E. Millán, “Fragment shaders for agent animation using finite state machines,” In Simulation Modelling Practice and Theory Journal, Volume 13, Issue 8, Programmable Graphics Hardware November 2005, pp. 741–751 Elsevier, (preprint) [16] D. L. Cook, J. Ioannidis, A. D. Keromytis, and J. Luck, “CryptoGraphics: Secret key cryptography using graphics cards,” Proceedings of the RSA Conference, Cryptographer's Track (CT-RSA), 2005, pp. 334–350. [17] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, “LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware,” sc, ACM/IEEE SC 2005 Conference (SC’05), 2005, pp. 3. [18] J. D. Hall, N. A. Carr, and J. C. Hart, “Cache and bandwidth aware matrix multiplication on the GPU,” Technical Report UIUCDCS-R-2003-2328, University of Illinois, Apr. 2003. [19] U. Kapasi, W. J. Dally, et al. “The Imagine stream processor,” In IEEE International Conference on Computer Design, Sep. 2002, pp. 282–288. [20] DEFCON. http://cctf.shmoo.com [21] M. Roesch, “Snort: Lightweight intrusion detection for networks,” Proceedings of the 1999 USENIX LISA Systems Administration Conference, November 1999. http://www.snort.org/ [22] Randi J. Rost, “OpenGL(R) shading language,” Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, 2004 [23] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream computing on graphics hardware,” ACM Transactions on Graphics (SIGGRAPH) ,Aug. 2004, pp. 777–786 [24] R.S. Wright, M. Sweet, “OpenGL SuperBible,” Waite Group Press, Indianapolis, 2000. [25] “NVIDIA developer tools: NVShaderPerf,” http://developer.NVIDIANVIDIANVIDIA.com/object/nvshaderperf_home.html [26] Y. H. Cho, S. Navab, and W. Mangione-Smith, “Specialized hardware for deep network packet filtering,” Proceedings of 12th International Conference on Field, vol. 2438, Sep. 2–4, 2002, pp. 452. [27] T. Song, W. Zhang, Z. Tang, D. Wang, “Alphabet based selected character decoding for area efficient pattern matching architecture on FPGAs,” icess, Second International Conference on Embedded Software and Systems (ICESS’05), 2005, pp. 276–283. [28] H. Bos, K. Huang, “Towards software-based signature detection for intrusion prevention on the network card,” Proceedings of Eighth International Symposium on Recent Advances in Intrusion Detection (RAID2005), Seattle, Washington, Sep. 2005. [29] M. Norton, “Optimizing Pattern Matching for Intrusion Detection,” Jul. 2004. http://docs.idsresearch.org/OptimizingPatternMatchingForIDS.pdf. [30] “Graphic Remedy gDEBugger,” http://www.gremedy.com, 2005 [31] L. Tan, B. Brotherton, and T. Sherwood, “Bit-split string-matching engines for intrusion detection and prevention,” ACM Transactions on Architecture and Code Optimization (TACO), Vol. 3 No. 1, Jun. 2006. 永久連結: http://nthur.lib.nthu.edu.tw/handle/987654321/35871 來源連結: http://thesis.nthu.edu.tw/cgi-bin/gs/hugsweb.cgi?o=dnthucdr&i=sGH000936350.id 顯示於類別: [資訊工程學系所] 博碩士論文 2009-11-22 04:00 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 19 楼嗯，GPU的特性决定了它在很多领域能发挥其优秀效能，现在有很多人致力于这项有意义的活动。 2009-11-22 10:27 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 20 楼研究 GPU 應用於加/解密上的文獻還不是很廣泛，估計在這方面還是能做點東西。 2009-11-22 10:52 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 21 楼也不是呀，现在很多人在研究这，因为GPU相比CPU在其结构上设计了更多的ALU用于数据计算，尤其适用可以被描述为数据并行计算。这对实现常见算法的提速，无疑是福音 2009-11-22 10:57 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 22 楼我的意思是，麻省理工、哈佛大學、劍橋大學或是牛津大學，微軟或是IBM 等，這些研究部門的密碼專家沒聽說有在搞 GPU之類的，這也就是我為什麼找數據庫之後所能提供的論文或是參考材料有限的原因。學術界跟工業界還是有點Gap 存在，欣慰的事是，至少還有一些論文可以參考，表示關於這類的研究還不是很廣泛，不用擔心題目被做到爛了。譬如數位簽署( digital signature) 這類的題目，十年前左右，學術界就做到爆了，能再做出新東西的不多。 2009-11-22 11:06 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 23 楼 GPU应用于密码学是个趋势，早在前几年nVidia公司为了便于使用GPU用于通用计算，提出了计算统一设备体系结构CUDA，把GPU作为并行运算设备进行程序发布和管理运算，并且不需要将计算映射到图形应用程序接口的硬件和软件的架构。呵呵论文少很正常，一看你就不玩游戏，你只要稍微了解下GPU的发展史就知道了其实也就这几年才开始这也是CUDA只适用于GeForce 8800、Quadro FX 5600／4600系列及更高级别的显示卡的原因。呵呵，等大家的机器都换这配置嘎嘎想不用都不行，这方面的学术文章很多，你找下关于图形处理的文章，GPU应用于密码学的前景个人认为是比较大的。 2009-11-22 11:31 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 24 楼我翻阅过GPU想关的历史资料对NV给GPU划分代的标准保留意见，GPU发展如果从SGI时代算起的话，应该大致经历了分离式集成电路（SGI的Graphics Card最多有3块卡）=>单一芯片集成电路（这是Voodoo3时代的杰作）=>MultiGPU（在Voodoo6上发挥到极致）=>TnL（Geforce256）=>SM 1~3（DX8/DX9级别的可编程单元）=>SM 4.0（统一架构，DX10级别，Fixed Function沦为二等公民，即接口的概念）=>至于未来，SGI-like Graphics Pipeline不能永坐江山，该让贤就要让贤了,未来的架构一定会和CPU的协处理器有一场正面的战争,谁能赢呢?我想谁也赢不了，只有消费者能赢。所以还是相互取长补短才能共赢。 PS:图形硬件的大型机时代没有算进去，不过也应该算进去。 2009-11-22 11:45 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 25 楼是的，cykerr 大大，我是不玩 game (PC online game 等). 對於 GPU应用于密码学是个趋势我持保留態度。但我會持續在這方面努力。(GPUs project) 2009-11-22 16:03 0
	游客登录 \| 注册方可回帖回帖表情雪币赚取及消费高级回复

rockinuk

613

发帖

1939

回帖

420

RANK

关注

私信

他的文章

关于我们

联系我们

企业服务

看雪公众号

最新回复 (33) 1 2 ▶
kanxue 雪币： 47147 活跃值： (20460) 能力值： (RANK：350 ) 在线值：发帖 2375 回帖 17045 粉丝 541 关注私信	kanxue 8 2 楼 rockinuk辛苦了 2009-11-15 21:52 0
游客雪币：能力值： (RANK： ) 在线值：发帖 0 回帖 0 粉丝关注私信	游客 3 楼应R大的要求, 综合以上提供的这些论文的内容, 应用GPU的大致原理如下: 一. GPU(Graphics Processing Units, 图形处理器)已经从专门图形处理器进化到高性能, 灵活的可编程单元, 应用范围更加广泛. 二. 现代GPU的体系架构是基于通用流处理器(general purpose stream processors)的, 利用专门的API可以象搭积木一样任意指定不同的通用流处理器处理不同的任务, 而不再受限于经典的绘图流水线(classic graphics pipeline) 三. 通用的API, 如OpenGL和DirectX, 可以让开发者存取CPU和GPU之间的交换数据, 控制render进程; 厂家独有的API则为开发者提供了更广泛的应用能力. 这些API包括: 1. NVidia: Compute Unified Device Architecture (CUDA) http://www.nvidia.cn/object/cuda_home_cn.html 2. AMD(ATI): Close To the Metal (CTM) http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx 3. Microsoft: Microsoft Accelerator http://research.microsoft.com/en-us/projects/Accelerator/ 四. 如果需要使用到并行的特性, 计算方法必须能分成若干个块(block), 而每个块又能分解成为线程(thread), 每个线程执行一段原子操作(kernel, 意译), 其产出物是计算结果的一部分. 常见的MD4/5, RC4, AES都满足第四个条件, 即都由若干轮组成, 每轮内部都是原子操作, 相互不干扰, 所以使用GPU可以提高运算速度, 此结论在上述论文中均有提及. 由于现有的三大API和硬件相关, 而且互不兼容, 在选择API上只能根据开发者自己现有的硬件来决定, 所以我觉得讨论应该集中在如何将特定应用中的关键计算分解成多个块上, 这样才能真正发挥GPU并行处理的优势. 嗯, 砖头抛完了, 大家拿玉来砸我吧. 2009-11-16 23:16 编辑删除 0
qsyqsy 雪币： 234 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 28 回帖 685 粉丝 0 关注私信	qsyqsy 4 楼嗯, 砖头抛完了, 大家拿玉来砸我吧宁为玉碎，不为砖全不过话说砖的确丢得不错 2009-11-17 18:51 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 5 楼学习下，听说有些解密软件就是通过NV的GPU库加速计算的，不过想必需要涉及很深的硬件编程理论才能实用化。 2009-11-18 12:57 0
游客雪币：能力值： (RANK： ) 在线值：发帖 0 回帖 0 粉丝关注私信	游客 6 楼厂家提供API(CUDA/CTM/MA)之后, 开发者不需要涉及很深的硬件编程都能实用化, 关键是如何拆分算法, 使得它能充分发挥GPU的并行能力. 否则仅仅串行的使用一个SP(stream processors), 怎么也快不到哪里去的. 2009-11-18 20:48 编辑删除 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 7 楼消息已收到，个人对这个很感兴趣，希望能够在代码编写上出一份力。先去收集些相关的编程资料好了。我机器果然过时了，NV GeForce 8400以上的台式机显卡才支持CUDA。这是支持的产品列表：http://www.nvidia.cn/object/cuda_learn_products_cn.html 微软的方案虽然能让支持DX9的显卡跑起来，不过居然要用C#或.Net写... 真是陷入窘境了找到个不错的地方 GPGPU Stream Processing(CUDA\OpenCL\DirectCompute)：http://www.opengpu.org/bbs/forumdisplay.php?fid=6 2009-11-20 09:31 0
芳草碧连雪币： 290 活跃值： (11) 能力值： ( LV2，RANK：10 ) 在线值：发帖 19 回帖 327 粉丝 0 关注私信	芳草碧连 8 楼貌似这个完全不懂 2009-11-20 09:43 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 9 楼赞！CUDA果然是个好东西，对编程人员来说完全是透明的，并且相当容易上手。稍微扫了一眼 Programming Guide 就可以明白利用CUDA操作GPU进行运算是怎么回事，例如这个矩阵加法例子（NVIDIA CUDA Programming Guide 2.3: p18）： // Device code __global__ void[B] VecAdd[/B](float* A, float* B, float* C) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int main() { int N = ...; size_t size = N * sizeof(float); // Allocate input vectors h_A and h_B in host memory float* h_A = malloc(size); float* h_B = malloc(size); // Allocate vectors in device memory float* d_A; cudaMalloc((void*)&d_A, size); float d_B; cudaMalloc((void*)&d_B, size); float d_C; cudaMalloc((void*)&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock – 1) / threadsPerBlock; [B]VecAdd[/B]<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C); // Copy result from device memory to host memory // h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } 这段代码利用cuda这种函数在设备内存中进行操作，然后将要GPU运算的VecAdd函数通过 VecAdd<<<xx, xxx>>>(d_A, d_B, d_C); 这样的方式调用（<<<..>>>是CUDA自定义的操作，表示该函数由GPU负责），后面传入在设备内存中的 d_A, d_B, d_C。算完后再将设备内存中的 d_C 写回主机内存 h_C 中。 2009-11-20 11:02 0
jackozoo 雪币： 1450 活跃值： (35) 能力值： (RANK：680 ) 在线值：发帖 50 回帖 775 粉丝 3 关注私信	jackozoo 14 10 楼 C = A + B; 这里加的进位怎么处理? 2009-11-20 12:58 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 11 楼开发手册的演示代码，估计压根没考虑这个... 实际上你看他 h_A, h_B 的初值都没给，不过从结构上已经可以体现CUDA的方便了。矩阵相加这样做肯定比 for (N*N) 次要快 2009-11-20 14:12 0
deryope 雪币： 232 活跃值： (10) 能力值： ( LV4，RANK：50 ) 在线值：发帖 7 回帖 141 粉丝 0 关注私信	deryope 1 12 楼找了几个关于用CUDA加速MD5破解的工程，有条件的可以编译一个试试看： Nightingale is a a GPU password software designed to encrypt or bruteforce password using GPU http://code.google.com/p/nightingale/ CUDA MD5 http://code.google.com/p/cudamd5/ HashClash http://code.google.com/p/hashclash/ NVIDIA CUDA校园程序设计大赛相关下载 http://cuda.csdn.net/Contest/pro/nvidia_show.aspx 2009-11-20 15:00 0
qsyqsy 雪币： 234 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 28 回帖 685 粉丝 0 关注私信	qsyqsy 13 楼后来看了GPUS的专案，仔细思考了一下，我想，除了arab的砖头，估计还得整理一下，把可能的算法分析一下。由若干轮组成, 每轮内部都是原子操作, 相互不干扰论文还没看透彻，等等在来分析。。。。。。。。。。 2009-11-20 18:31 0
饮水思源雪币： 132 活跃值： (28) 能力值： ( LV2，RANK：10 ) 在线值：发帖 17 回帖 149 粉丝 0 关注私信	饮水思源 14 楼虽然是看不懂，但是我也愿意为密码学小组贡献一份力量。。有需要我的尽管说就我个人而言，我是真想研究这个。。。哎，没有基础真不好办刚刚上大学，学了点肤浅的知识我会继续努力的！ 2009-11-20 18:40 0
qsyqsy 雪币： 234 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 28 回帖 685 粉丝 0 关注私信	qsyqsy 15 楼现在觉得很不称职，搞代码编写我估计有点困难，算法还有点懂 ps:R大，GPUs专案的计划，大概要搞多少时间？ 2009-11-21 09:40 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 16 楼搞出一個軟件出來。近程目標是搞出一個小軟件。中程目標是搞出一個可以分解 RSA 512 bits 的軟件。遠程目標，等有辦法完成中程目標時，再計劃。以上只有進度表，沒有時間表。 2009-11-21 14:21 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 17 楼信息已收到同时也看到rockinuk第一时间整理的资料，很感动，我去过很多论坛遇到很多版主，自己也当过版主，从来没见过这么负责和勤劳的版主，自问也做不到这样。所以在这里我由衷的向你表示我的感激。关于GPU概念、GPU的作用、GPU的原理rockinuk收集的论文都有相关介绍我就不在赘述了。使用GPU应用于密码学的目的只有一个就是发挥GPU并行处理的优势，从而提高其运算速度。如arab所总结的常见的算法都符合GPU原理。本人因对数签比较感兴趣，在这帮大家再抛一砖头。将SHA-1在GPU实现。 1、言归正传实现开始：步骤1：添加填充位(一个1和若干个0)。在消息的最后添加适当的填充位使得数据位的长度满足长=448mod312。步骤2：添加一个64位块，表示原始消息长度，64位无符号整数(最高有效字节在前)。步骤3：初试化消息摘要的缓冲区。一个160位消息摘要缓冲区用以保存中间和最终敞列函数的结果。它可以表示5个32位的寄存器(A，B，C，D，E)。步骤4：以512位数据块为单位处理消息。共计80步。步骤5：输出最终的160位消息摘要。这过程中，步骤4是数字指纹实现的关键。其中Yq是输入的512位数据，CVq是上二步产生的结果值，CV0是算法确定的初始值。f1、f2、f3，f4有相似的结构，但每个循环使用不同的原始逻辑函数。以前20步为例，其基本的处理过程如图3所示[attach]图2图3.JPG[/attach]，图中Wt是从当前512位输入数据块导出的32位字，Kt一个用于加法的常量，Si表示循环左移i位。在CPU上的实现过程如图4所示[attach] 图4图5.JPG [/attach]，顺序执行A、B、C、D、E的求解。利用CUDA2．0的分支结构编程，可以将程序的流程改为如图5所示，在GPU上执行。将5个寄存器中数据的求解过程并行处理，从而提高速度。 2、运行结果：算法的实现采用了三组字符数据分别在CPU和GPU上运行，得到的结果相同，运算时间的对比见表1[attach] 表1.JPG [/attach] 。其中CPU的运行环境为：　　双核CPU：InteI(R)Core(TM)2 CPU 6300@1.86Ghz 　　Intel(R)Core(TM)2 CPU 6300@1.86Ghz 　　GPU采用NVIDIA GeForce 8800 GT。　　第一组：输入字符为：abc 　　最后消息摘要的值为：A9993E36 4706816A BA3E2571 7850C26C 9CDOD89D 　　第二组：输入字符为：abce 　　最后消息摘要的值为：l FE8BFE 87576C3E CB22426F 8E578473 829 l 7ACF 　　第三组：输入字符为：abcdef 　　最后消息摘要的值为：03DE6C57 OBFE24BF C328CCD7 CA46876E ADAF4334 对比CPU和GPU上算法的运行时间，可以发现效率的提高并不是很大，这是因为在计算A、B、C、D、E五个寄存器中的数据时，A寄存器内的数据需要时间较长，其他寄存器处理完成后需要等待。 3、改进与提高：利用上述方法在GPU实现数字指纹可以提高效率，但是效果不足很明显，这足因为SHA-1算法中A，B、C、D、E五个处理器中数据处理的过程需要的时间相差很大，且每一步的计算都需要上一步运算的结果作为输入，对数据计算的并行性利用不高。近一步分析发现，计算A寄存器中的数据时，由于要使用上一次计算结果中的A寄存器内容，所以必须等待。但此时B、C、D、E寄存器中的内容已经存在(由于计算量相对于A来说，要小的多)，在和A寄存器中内容相加前，可以对B，C、D、E中的数据先进行处理，从而提高效率，如图6所示[attach] 图6.JPG [/attach] 。在具体的实现过程中，采用GPU上六个处理器进行并行处理。由于每个处理器上运行的程序不同，因此运行开始前需利用CUDA2．0中的分支结构判断处理器上应运行的程序。同时，程序中多个处理器需要同一个数据，将A、B、C、D、E中的数据存储任全局变量中，使每个处理器都能读取。为了保证处理器能够协调工作，需要增加一些变量来控制处理器j：开始运行SHA-I算法程序。每个处理器的任务分配与工作时序如表2所示[attach]表2.JPG[/attach]。经过改进，计算的速度得到了进一步的提高，运行时间如表3所示[attach]表3.JPG[/attach]。 4、结论：为了充分利用GPU的并行处理能力，提高SHA-1算法的速度，这里通过对该算法进行并行处理，快速实现了SHA-1算法。实验结果表明，GPU上运行SHA-1算法所需时间比CPU上节约了近3倍。哈哈，巨大的优势，鉴于SHA-1算法自身仍有并行性有待开发，算法的运算速度也还有提升的空间。期待其他算法的实现！！！上传的附件：表1.JPG （12.87kb，39次下载）表2.JPG （21.68kb，39次下载）表3.JPG （13.11kb，39次下载）图2图3.JPG （49.29kb，39次下载）图4图5.JPG （19.94kb，39次下载）图6.JPG （23.59kb，39次下载） 2009-11-21 23:14 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 18 楼謝謝 cykerr 大大這麼有用的材料。今天無意間發現了一篇論文，為台灣清華大學碩士論文，但該論文未公開，所以無法取得全文。我會想辦法去取得這份論文，並提供給大家參考。 ※台灣大學、台灣交通大學、台灣清華大學、台灣成功大學，在工程領域上為台灣 Top 5 的大學。 ===== http://140.113.39.130/cgi-bin/gs/hugsweb.cgi?o=dnthucdr&i=sGH000936350.id 題名: 基於繪圖處理器之封包檢測技術研究其他題名: A Graphics Processor-based Packet Inspection Scheme 作者: 徐獻文 Hsien-Wen Hsu 描述: 碩士國立清華大學資訊工程學系 GH000936350 日期: 2006 關鍵詞: 顯示卡字串比對網路安全 GPU String matching Network security 摘要: 近年來網路安全備受重視，深層封包檢測等相關技術被廣泛地研究與應用。本論文提出了一個基於繪圖處理器之深層封包檢測方案與架構 -- 利用個人電腦上之繪圖卡來作為字串比對之加速器。由於繪圖卡天生具有平行運算能力，我們利用繪圖卡中數量極多且可平行運算的Fragment Shader做為平行字串比對處理器，搭配繪圖卡的平行管線，將字串比對的效能發揮到極限。這對需求度日益增高的深層封包檢測是十分具有吸引力的，因為大部分的網路流量都是由多個連線所聚集而來的，所以本論文提出的方案與架構特別針對如何平行地處理這些網路流量有一巧妙之設計與安排。我們將熟知的自動機結構，轉化為可由繪圖卡所執行之資料結構，並透過了各種圖形處理程式語言來處理其有關狀態轉換過程的 data flow以及與系統I/O運作的control flow。因此本論文提出的方案與架構將會適用於許多以自動機為基礎的字串比對演算法，例如著名的Aho and Corasick(AC) 演算法。本論文除理論上提出了架構，並將之實現於繪圖卡之中。並透過實驗計算來分析效能表現。包括展示將自動機轉換為材質後，其在繪圖卡記憶體中的佈局，並計算出各種資料結構所需的記憶體大小外，更客觀地去分析其處理效能表現。經由實驗證明，本論文提出的方案與其他同樣利用硬體加速器的方案相比較下，我們可將目前市面上售價不到美金400元的繪圖卡，擁有高達6.4G bps的字串比對能力。另外，透過實驗本論文證明可讓一般個人電腦中大部分時間為閒置的繪圖卡充分地發揮其效用，在使用者無須再添購任何的新硬體的情況下，與其他的純軟體字串比對實作相比較，約可以減少2/3以上字串比對時間。我們相信此一研究將會成為利用繪圖卡進行網路封包處理的濫觴。 Network security has recently become increasingly important. Hence, related technologies like deep packet inspection are being extensively researched and widely implemented. This study proposes a novel scheme and architecture for packet content inspection using graphics processing units (GPUs). The proposed method takes the common component of personal computers, namely GPU, as the accelerator for pattern matching which the critical problem in deep packet inspection. With the native parallel computing power of GPUs, the multiple fragment shaders are considered as parallel pattern matching engines, and with simultaneous pipelines they maximize power of pattern matching. These features are very attractive for content signature recognition which is becoming increasingly popular. Since most network traffic is aggregated from many sessions, the proposed scheme and architecture are particularly designed for processing the multi-session network traffic simultaneously. The well-known automata structure is converted to the data structure executed in GPUs for processing the data flow, which manages the state transition procedure, and the control flow, which is in charge of system I/O, via various graphics processing languages. The proposed scheme and architecture work well with many automaton-based pattern matching algorithms. This study focuses on the most famous one, the Aho and Corasick (AC) algorithm. Besides presenting the scheme in theory and implementing on commodity GPU in practice, this study analyzes the performance of our proposed approach through evaluations, such as showing the memory fingerprint of the automaton in GPU, determining the memory size of all necessary data structures, and analyzing the throughput. The experiment results indicate that, the proposed pattern matching approach can achieve 6.4Gbps throughput powered by the graphics card with the market price less than US$400, representing an improvement on other approaches using accelerator. This study also reveals that the proposed scheme can exploit the resources of the GPU to accelerate the pattern matching processing time, since the GPU installed in common PCs is originally idle at most time periods. The processing time of the proposed software-based implementation is 2/3 less than that of other software-based pattern matching implementations. The proposed concept, Network Processing on Graphics Processing Units, is a novel research field in network processing. 參考資料: [1] R. S. Boyer and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, vol. 20, Session 10, Oct. 1977, pp. 761–772. [2] A. V. Aho and M. J. Corasick, “Efficient string matching: An aid to bibliographic search,” Communications of the ACM, vol. 18, issue 6, Jun. 1975, pp. 333–340. [3] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching,” Technical Report TR-94-17, Department of Computer Science, University of Arizona, 1994. [4] N. Tuck, T. Sherwood, B. Calder, G. Varghese, “Deterministic memory-efficient string matching algorithms for intrusion detection,” In Proceedings of the IEEE Infocom Conference, 2004, pp. 333–340. [5] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, J. Lockwood, “Deep packet inspection using parallel Bloom filters,” IEEE Micro, vol. 24, No. 1, 2004, pp. 52–61. [6] J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos, “Implementation of a content-scanning module for an internet firewall,” Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, April 9–11, 2003, pp. 31–38. [7] F. Yu , R. H. Katz , T. V. Lakshman, “Gigabit rate packet pattern-matching using TCAM,” Proceedings of the network protocols, 12th IEEE International Conference on (ICNP’04), Oct. 5–8, 2004, pp.174–183. [8] C. Courcoubetis and V. A. Siris, “Measurement and analysis of real network traffic,” Proceedings of the 7th Hellenic Conference on Informatics (HCI'99), Aug. 1999 [9] R.T. Liu, N.F. Huang, C.H. Chen, C.N. Kao,“A Fast String Matching Algorithm for Network Processor-based Intrusion Detection Systems," ACM Transactions on Embedded Computer Systems, Vol. 3, No. 3, Aug. 2004, pp. 614 – 633. [10] M. Pharr and R. Fernando, “GPU Gems 2,” Addison Wesley, 2004. [11] GPGPU: General-Purpose Computation on GPUs. http://www.gpgpu.org [12] P. Trancoso, and M. Charalambous, “Exploring graphics processor performance for general purpose applications,” dsd, 8th Euromicro Conference on Digital System Design (DSD’05), 2005, pp. 306–313. [13] N. K. Govindaraju, J. Gray, R. Kumar, and D. Manocha, “GPUTeraSort: High performance graphics co-processor sorting for large database management,” Proceedings of ACM SIGMOD Conference, Chicago, IL, Jun. 2006. [14] P. Kipfer, M. Segal, and R. Westermann, “UberFlow: A GPU-based particle engine,” Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, session: Computation, 2004, pp. 115–122. [15] I. Rudomín, B. Hernandez, E. Millán, “Fragment shaders for agent animation using finite state machines,” In Simulation Modelling Practice and Theory Journal, Volume 13, Issue 8, Programmable Graphics Hardware November 2005, pp. 741–751 Elsevier, (preprint) [16] D. L. Cook, J. Ioannidis, A. D. Keromytis, and J. Luck, “CryptoGraphics: Secret key cryptography using graphics cards,” Proceedings of the RSA Conference, Cryptographer's Track (CT-RSA), 2005, pp. 334–350. [17] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, “LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware,” sc, ACM/IEEE SC 2005 Conference (SC’05), 2005, pp. 3. [18] J. D. Hall, N. A. Carr, and J. C. Hart, “Cache and bandwidth aware matrix multiplication on the GPU,” Technical Report UIUCDCS-R-2003-2328, University of Illinois, Apr. 2003. [19] U. Kapasi, W. J. Dally, et al. “The Imagine stream processor,” In IEEE International Conference on Computer Design, Sep. 2002, pp. 282–288. [20] DEFCON. http://cctf.shmoo.com [21] M. Roesch, “Snort: Lightweight intrusion detection for networks,” Proceedings of the 1999 USENIX LISA Systems Administration Conference, November 1999. http://www.snort.org/ [22] Randi J. Rost, “OpenGL(R) shading language,” Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, 2004 [23] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream computing on graphics hardware,” ACM Transactions on Graphics (SIGGRAPH) ,Aug. 2004, pp. 777–786 [24] R.S. Wright, M. Sweet, “OpenGL SuperBible,” Waite Group Press, Indianapolis, 2000. [25] “NVIDIA developer tools: NVShaderPerf,” http://developer.NVIDIANVIDIANVIDIA.com/object/nvshaderperf_home.html [26] Y. H. Cho, S. Navab, and W. Mangione-Smith, “Specialized hardware for deep network packet filtering,” Proceedings of 12th International Conference on Field, vol. 2438, Sep. 2–4, 2002, pp. 452. [27] T. Song, W. Zhang, Z. Tang, D. Wang, “Alphabet based selected character decoding for area efficient pattern matching architecture on FPGAs,” icess, Second International Conference on Embedded Software and Systems (ICESS’05), 2005, pp. 276–283. [28] H. Bos, K. Huang, “Towards software-based signature detection for intrusion prevention on the network card,” Proceedings of Eighth International Symposium on Recent Advances in Intrusion Detection (RAID2005), Seattle, Washington, Sep. 2005. [29] M. Norton, “Optimizing Pattern Matching for Intrusion Detection,” Jul. 2004. http://docs.idsresearch.org/OptimizingPatternMatchingForIDS.pdf. [30] “Graphic Remedy gDEBugger,” http://www.gremedy.com, 2005 [31] L. Tan, B. Brotherton, and T. Sherwood, “Bit-split string-matching engines for intrusion detection and prevention,” ACM Transactions on Architecture and Code Optimization (TACO), Vol. 3 No. 1, Jun. 2006. 永久連結: http://nthur.lib.nthu.edu.tw/handle/987654321/35871 來源連結: http://thesis.nthu.edu.tw/cgi-bin/gs/hugsweb.cgi?o=dnthucdr&i=sGH000936350.id 顯示於類別: [資訊工程學系所] 博碩士論文 2009-11-22 04:00 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 19 楼嗯，GPU的特性决定了它在很多领域能发挥其优秀效能，现在有很多人致力于这项有意义的活动。 2009-11-22 10:27 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 20 楼研究 GPU 應用於加/解密上的文獻還不是很廣泛，估計在這方面還是能做點東西。 2009-11-22 10:52 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 21 楼也不是呀，现在很多人在研究这，因为GPU相比CPU在其结构上设计了更多的ALU用于数据计算，尤其适用可以被描述为数据并行计算。这对实现常见算法的提速，无疑是福音 2009-11-22 10:57 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 22 楼我的意思是，麻省理工、哈佛大學、劍橋大學或是牛津大學，微軟或是IBM 等，這些研究部門的密碼專家沒聽說有在搞 GPU之類的，這也就是我為什麼找數據庫之後所能提供的論文或是參考材料有限的原因。學術界跟工業界還是有點Gap 存在，欣慰的事是，至少還有一些論文可以參考，表示關於這類的研究還不是很廣泛，不用擔心題目被做到爛了。譬如數位簽署( digital signature) 這類的題目，十年前左右，學術界就做到爆了，能再做出新東西的不多。 2009-11-22 11:06 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 23 楼 GPU应用于密码学是个趋势，早在前几年nVidia公司为了便于使用GPU用于通用计算，提出了计算统一设备体系结构CUDA，把GPU作为并行运算设备进行程序发布和管理运算，并且不需要将计算映射到图形应用程序接口的硬件和软件的架构。呵呵论文少很正常，一看你就不玩游戏，你只要稍微了解下GPU的发展史就知道了其实也就这几年才开始这也是CUDA只适用于GeForce 8800、Quadro FX 5600／4600系列及更高级别的显示卡的原因。呵呵，等大家的机器都换这配置嘎嘎想不用都不行，这方面的学术文章很多，你找下关于图形处理的文章，GPU应用于密码学的前景个人认为是比较大的。 2009-11-22 11:31 0
cykerr 雪币： 259 活跃值： (10) 能力值： ( LV2，RANK：10 ) 在线值：发帖 0 回帖 50 粉丝 0 关注私信	cykerr 24 楼我翻阅过GPU想关的历史资料对NV给GPU划分代的标准保留意见，GPU发展如果从SGI时代算起的话，应该大致经历了分离式集成电路（SGI的Graphics Card最多有3块卡）=>单一芯片集成电路（这是Voodoo3时代的杰作）=>MultiGPU（在Voodoo6上发挥到极致）=>TnL（Geforce256）=>SM 1~3（DX8/DX9级别的可编程单元）=>SM 4.0（统一架构，DX10级别，Fixed Function沦为二等公民，即接口的概念）=>至于未来，SGI-like Graphics Pipeline不能永坐江山，该让贤就要让贤了,未来的架构一定会和CPU的协处理器有一场正面的战争,谁能赢呢?我想谁也赢不了，只有消费者能赢。所以还是相互取长补短才能共赢。 PS:图形硬件的大型机时代没有算进去，不过也应该算进去。 2009-11-22 11:45 0
rockinuk 雪币： 2096 活跃值： (100) 能力值： (RANK：420 ) 在线值：发帖 613 回帖 1939 粉丝 7 关注私信	rockinuk 8 25 楼是的，cykerr 大大，我是不玩 game (PC online game 等). 對於 GPU应用于密码学是个趋势我持保留態度。但我會持續在這方面努力。(GPUs project) 2009-11-22 16:03 0
	游客登录 \| 注册方可回帖回帖表情雪币赚取及消费高级回复