首页
社区
课程
招聘
[分享]具偵錯能力的帶符號快速加法電路之設計
发表于: 2009-12-29 18:00 3842

[分享]具偵錯能力的帶符號快速加法電路之設計

2009-12-29 18:00
3842
=== 提供給 deryope 組長的一帖 【讨论】关于并行大数加法的,遇到困境了... 有關加法的設計之論文 === (沒有 parallel, 只有 fast)

系統編號: 087CCIT0442033
出版年: 1999  
研究生: 賴瑞昌  
研究生(英文姓名): Jui-Chang Lai  
全文資料 :  電子全文下載  
論文名稱:  具偵錯能力的帶符號快速加法電路之設計

英文論文名稱: Design of a Fast Signed Binary Adder with Error Detection

指導教授: 婁德權  
指導教授(英文姓名):  Der-Chyuan Lou  
學位類別:  碩士
校院名稱:  中正理工學院 
系所名稱:  電機工程研究所
學號:  881325  
學年度:  87  
語文別:  中文  
論文頁數:  72  
關鍵詞:  帶符號二進位數字系統 ; 進位傳遞延遲 ; 漣波進位加法電路  
英文關鍵詞:  signed binary digit system ; propagation delay ;
ripple carry adder  
被引用次數:  0
[ 摘要 ]  
密碼系統不論是以軟體或硬體去實現,大部份都需要用到模乘法與模指數的運算。而這些運算的最基本單元就是加法器,而且此類加法器不但要快速,更要能偵錯。因為密碼系統講求的就是安全性,若在基本的電路中就有閃失,那就不堪稱為安全性高的密碼系統了。傳統的二進位加法電路會產生進位傳遞延遲(carry propagation delay) ,因此計算速度緩慢。雖然前看進位(carry look-ahead)加法電路可改善此問題,但其電路複雜度隨著位元數的增加而變大,需要很大的推動電流,並不實用。解決進位傳遞延遲的有效方法是使用帶符號的二進位數字(signed binary digit,簡稱SBD)系統,本論文以SBD為加法運算之數系,提出了一個快速而且可以偵錯的加法器架構,以提高資訊安全裝置之容錯能力。  
[ 英文摘要 ]  
Most of the cryptosystem, which implemented in software or hardware methods, need modular multiplication and modular exponentiation. In the hardware design of a good cryptosystem, the basic circuit should not include any error. The adder, which constructs the basic device of the cryptosystem, should operate fast and provide the ability of error detection. The traditional binary adder has the carry propagation delay so it is slow. The carry look-ahead adder improves the carry propagation delay but it has another disadvantage, i.e., more bits size of the circuit involves the more complexity of the circuit. It is unrealistic that very large fan-in are required by the carry look-ahead adder. One of the valid way to solve the carry propagation delay is using the signed binary digit (SBD) system. In this thesis, we propose a SBD based adder, which is fast and error detection. Base on this adder, the cryptosystem is more robust.

[ 論文目次 ]  
誌 謝 ii
摘 要 iii
Abstract iv
目 錄 v
表 錄 vii
圖 錄 viii
1. 前言 1
2. 快速加法電路 6
2.1. 前看進位加法器(carry look-ahead adder). 6
2.2. 進位選擇加法器(carry select adder) 24
2.3. 進位儲存加法器(carry save adder) 27
2.4. 帶符號二進位數字加法器(signed binary digit adder) 32
3. 無進位且結構規則的帶符號快速加法電路 43
3.1. 無進位且結構規則的帶符號快速加法電路設計 43
3.2. 無進位且結構規則的帶符號快速加法電路之硬體實現 51
3.3. 討論 54
4. 具偵錯能力的帶符號快速加法電路 55
4.1. 具偵錯能力的帶符號快速加法電路設計 55
4.2. 具偵錯能力的帶符號快速加法電路之硬體實現 57
4.3. 討論 60
5. 結論 67
參考文獻 69
論 著 72
自 傳 73  

[ 參考文獻]  
1. Avizienis A., “Signed-digit number representations for fast parallel arithmetic,” IRE Transactions on Electronic Computers, vol. 10, pp. 389-400, 1961.
2. Brent R. P. and Kung H.-T., “A Regular Layout for Parallel Adders,” IEEE Transactions on Computers, vol. 31, no. 3, pp. 260-264, Mar. 1982.
3. Briggs W. S. and Matula D. W., “Signed digit multiplier,” U. S. Patent No. 5144576, Sep. 1992.
4. Briggs W. S. and Matula D. W., “A 1769 bit multiply and add unit with redundant binary feedback and single cycle latency,” Proceedings of the Eleventh Symposium on Computer Arithmetic, 1993, pp. 163-170.
5. Darley H. M., “Signed digit adder circuit,” U. S. Patent No. 4979140, Dec. 1990.
6. Harata Y., Nakamura Y., Nagese H., Takigawa M., and Takagi N., “A high-speed multiplier using a redundant binary adder tree,” IEEE Journal on Solid-State Circuits, vol. 22, no. 2, pp. 28-34, Feb. 1987.
7. Jedwab J. and Mitchell C. J., “Minimum weight modified signed-digit representations and fast exponentiation,” IEE Electronics Letters, vol. 25, no. 17, pp. 1171-1172, Aug. 1989.
8. Kantabutra V., “A recursive carry-lookahead/carry-select hybrid adder,” IEEE Transactions on Computers, vol. 42, no. 12, pp. 1495-1499, Dec. 1993.
9. Knuth D. E., The art of computer programming, vol. 2: semi-numerical algorithms, Addison-Wesley, USA, 1981, 2nd edition.
10. Koren I., Computer arithmetic algorithms, Prentice-Hall, Englewood Cliffs, N. J., 1993.
11. Lou D.-C. and Chang C.-C., “Fast Exponentiation method obtained by folding the exponent in half,” IEE Electronics Letters, vol. 32, no. 11, pp. 984-985, May. 1996.
12. Parhami B., “Generalized signed-digit number systems: a unifying framework for redundant number representations,” IEEE Transactions on Computers, vol. 39, no. 1, pp. 89-98, Jan. 1990.
13. Patterson D. A. and Hennessy J. L., Computer architecture: a quantitative approach, Morgan Kaufmann, San Francisco, C.A., 1996.
14. Rivest R., Shamir A., and Adleman L., “A method for obtaining digital signatures and public-key cryptosystems,” Communications of the Association for Computing Machinery, vol. 21, no. 2, pp. 120-126, Feb. 1978.
15. Takagi N. and Yajima S., “On-line error-detectable high-speed multiplier using redundant binary representation and three-rail logic,” IEEE Transactions on Computers, vol. 36, no. 11, pp. 1310-1317, Nov. 1987.
16. Taylor F. J., “A VLSI residue arithmetic multiplier,” IEEE Transactions on Computers, vol. 31, no. 6, pp. 540-546, June. 1982.
17. Thornton M. A., “Signed binary addition circuitry with inherent even parity outputs,” IEEE Transactions on Computers, vol. 46, no. 7, pp. 811-816, July. 1997.
18.Walter C. D., “Systolic modular multiplication,” IEEE Transactions on Computers, vol. 42, no. 3, pp. 376-378, Mar. 1993.

[招生]科锐逆向工程师培训(2024年11月15日实地,远程教学同时开班, 第51期)

上传的附件:
收藏
免费 0
支持
分享
最新回复 (15)
雪    币: 2096
活跃值: (100)
能力值: (RANK:420 )
在线值:
发帖
回帖
粉丝
2
When A, B, and the carry-in are all HIGH, a sum of 1 and a carry-out are produced. First, consider A and B. When both are HIGH, the output of gate 1 is LOW, and the output of gate 2 is HIGH, giving us a carry-out at gate 5. The carry-in produces a 1 output at gate 3, giving us a sum of 1. The output of the full adder is 112. The sum of 12 plus 12 plus 12 is 11

PARALLEL ADDERS
The adders discussed in the previous section have been limited to adding single-digit binary numbers and carries. The largest sum that can be obtained using a full adder is 112. Parallel adders let us add multiple-digit numbers. If we place full adders in parallel, we can add two- or four-digit numbers or any other size desired. Figure 3-9 uses STANDARD SYMBOLS to show a parallel adder capable of adding two, two-digit binary numbers. In previous discussions we have depicted circuits with individual logic gates shown. Standard symbols (blocks) allow us to analyze circuits with inputs and outputs only. One standard symbol may actually contain many and various types of gates and circuits. The addend would be input on the A inputs (A2 = MSD, A1 = LSD), and the augend input on the B inputs (B2 = MSD, B1  = LSD). For this explanation we will assume there is no input to C0 (carry from a previous circuit).

Figure 3-9. —Parallel binary adder.
Now let’s add some two-digit numbers. To add 102 (addend) and 012 (augend), assume there are numbers at the appropriate inputs. The addend inputs will be 1 on A2 and 0 on A1. The augend inputs will be 0 on B2 and 1 on B1. Working from right to left, as we do in normal addition, let’s calculate the outputs of each full adder. With A1 at 0 and B1 at 1, the output of adder 1 will be a sum (S1) of 1 with no carry (C1). Since A2 is 1 and B2 is 0, we have a sum (S2) of 1 with no carry (C2) from adder 1. To determine the sum, read the outputs (C2, S 2, and S1) from left to right. In this case, C2 = 0, S2 = 1, and S1 = 1. The sum, then, of 102 and 012 is 0112  or 112. To add 112 and 012, assume one number is applied to A1 and A2, and the other to B1 and B2, as shown in figure 3-10. Adder 1 produces a sum (S1) of 0 and a carry (C1) of 1. Adder 2 gives us a sum (S2)

Source from http://www.tpub.com/content/neets/14185/css/14185_126.htm
2009-12-29 18:16
0
雪    币: 2096
活跃值: (100)
能力值: (RANK:420 )
在线值:
发帖
回帖
粉丝
3
加法和減法的運算,可以在一個加法器中結合起來,做成一個二進位的並行加法器
。要達成此一目的,就在每一個全加器中增加一個Exclusive-OR閘。下圖所示為一4-bit
的加減法器。S輸入為運算控制;當S=0時做加法運算,S=1時做減法運算。


Source from http://www.cis.nctu.edu.tw/~info24/advanced/add_sub.htm
2009-12-29 18:17
0
雪    币: 2096
活跃值: (100)
能力值: (RANK:420 )
在线值:
发帖
回帖
粉丝
4
Parallel Adders

1. Introduction
The saying goes that if you can count, you can control. Addition is a fundamental operation for any digital system, digital signal processing or control system. A fast and accurate operation of a digital system is greatly influenced by the performance of the resident adders. Adders are also very important component in digital systems because of their extensive use in other basic digital operations such as subtraction, multiplication and division. Hence, improving performance of the digital adder would greatly advance the execution of binary operations inside a circuit compromised of such blocks. The performance of a digital circuit block is gauged by analyzing its power dissipation, layout area and its operating speed. 2. Types of Adders In this lecture we will review the implementation technique of several types of adders and study their characteristics and performance.
These are:
1) Ripple carry adder, or carry propagate adder,
2) Carry look-ahead adder
3) Carry skip adder,
4) Manchester chain adder,
5) Carry select adders,
6) Pre-Fix Adders,
7) Multi-operand adder,
8) Carry save Adder,
9) Pipelined parallel adder,

Source from http://www.ece.concordia.ca/~asim/COEN_6501/Lecture_Notes/L2_Notes.pdf
上传的附件:
2009-12-29 18:25
0
雪    币: 232
活跃值: (10)
能力值: ( LV4,RANK:50 )
在线值:
发帖
回帖
粉丝
5
呃,忘了关注密码学小组的帖子,原来rockinuk版主两天前就已经帮我找了这么多资料。正好三天假,我先下载下来好好研究下,看能不能有所启发!
感谢
2009-12-31 08:34
0
雪    币: 2096
活跃值: (100)
能力值: (RANK:420 )
在线值:
发帖
回帖
粉丝
6
Parallelization - Parallel Adder

The following is a description of a basic single-processor code and its transition into a parallel code. We hope this example of a simple parallelization process serves to demonstrate what steps and calls are needed for codes whose work can be performed independently of one another. A simple "divide and conquer" approach is appropriate here.  

To the experienced parallel programmer, this example is a "trivially parallelizable" code because the entire work of the code can be partitioned between the processors very simply. That is the point of this discussion: we are demonstrating how one might go about converting this kind of single-processor code into a parallel code.

This technique could apply to your situation in at least three possible ways. First, this technique is applicable to to problems such as minmax searches for the best move in a game such as chess, "monte carlo" techniques, or computationally-intensive searches through a large parameter space. If you are fortunate enough to have a code of this category, then you need to look little further than this technique. Second, selected components of a code could also fall under this category, so this technique could be useful to apply to those portions of a single-processor code. Finally, if this is your first look at parallel code, this example serves as an excellent start to learn how to go about writing and thinking about parallel codes.

Before: Single-Processor adder


We begin with a code utilizing single-processor execution. This example, adder, sums the square of each integer from 1 to 10000, inclusive. It is organized into three routines:

1) kernelroutine - performs the elemental work of squaring each integer

2) computeloop - allocates memory, loops over the integers, calls kernelroutine, and saves and sums their results

3) main - performs any required initialization, calls computeloop, saves its results into a file, printf's the sum, and deallocates memory

The code is structured in such a way to clearly separate the pieces that perform the work that is secondary yet essential to solving the problem. These include allocating the appropriate memory and saving data to an output file. Besides identifying where these functions are being performed, the structure makes it obvious that a much more complicated problem can substitute for the kernelroutine shown. We also intend this explicit structure to represent corresponding portions of a much larger code. The C source is shown in Listing 1.

Listing 1 - adder.c 
See also adder.f90 

--------------------------------------------------------------------------------
#include <stdio.h> /* standard I/O routines */
#include <stdlib.h> /* standard library routines, such as memory */
#include <math.h> /* math libraries */

/* A routine performing elemental work.
This can be replaced with a much larger routine. */
double kernelroutine(double input); 
double kernelroutine(double input) 
	{
	return (input+1)*(input+1); 
	}

/* HighestIndex specifies how highest index to sample */
#define HighestIndex			10000L

void computeloop(double *theSum, double **theArray); 
void computeloop(double *theSum, double **theArray)
	{
	/* local copies of the data to be output */
	double *myArray, mySum=0; 
	/* limit of the loop */
	long loopEnd=HighestIndex; 
	
	/* allocate an array to hold all the results */
	myArray=malloc(sizeof(double)*loopEnd); 
	
	if (myArray) {
		long index; 
		
		/* loop over indicies */
		for(index=0; index<loopEnd; index++) {
	/* call the kernel routine for each index, and save into the array */
			myArray[index]=kernelroutine(index); 
			/* sum as we go */
			mySum+=myArray[index]; 
			
			}
		
		}
	
	/* return the sum and the array */
	*theSum=mySum; 
	*theArray=myArray; 
	}
	
int main(int argc, char *argv[])
	{
	/* main copies of the sum and the array */
	double theSum, *theArray=NULL; 
	
	printf("Beginning computation...\n"); 
	
	computeloop(&theSum, &theArray); 
	
	if (theArray) {/* error checking */
		FILE *fp; 
		
		/* save the array into a data file */
		fp=fopen("output", "w"); 
		if (fp) {
			printf("writing array...\n"); 
			
			fwrite(theArray, sizeof(double), HighestIndex, fp); 
			
			fclose(fp); 
			}
		
		printf("the sum is %f\n", theSum); 
		
		/* clean up after myself */
		free(theArray); 
		
		}
	else 
		printf("memory allocation failure\n"); 
	
	return 0; 
	}

 


When this code is run with its HighestIndex constant set to 10000, the code states that the sum is 333383335000 and produces a 80000 character long binary file containing the squares of integers 1 through 10000.

After: paralleladder

A key to parallelizing an application is choosing the appropriate partitioning of the problem. The obvious choice here is to divide the number of integers by the number of processors. However, a few details must be worked out: What if the number of processors doesn't divide evenly into the number of integers? How does each processor know which integers to work on so they don't do the same integer twice or miss one? Once they have the answers, how do the sum numbers on one processor with those on another, and how do these processors save the data into one file?

The parallelization process here is primarily a matter of managing and delegating data and computation. Changes to main are limited to "boilerplate" code that prepare and tear down the parallel environment and pass the sufficient information to computeloop so that it can organize the work and a if test so only processor 0 creates an output file. The most important modifications are in computeloop to perform the coordination and calculate the partitioning between processors. Given the identification number of the processor it is running on and the number of processors in the system, computeloop calculates how to partition the problem among the processors. To minimize bottlenecks, the executable running on each processor calculates its assignment itself, without a central authority delegating assignments. Once it finishes its work, it collects the data back to processor 0 so main can write the output in one convenient place for the user.

The detailed answer is in the code. Listing 2 shows the C source of paralleladder.c, with the changes relative to adder.c underlined.

Listing 2 - paralleladder.c, with changes compared to adder.c underlined 
See also paralleladder.f90 

--------------------------------------------------------------------------------

#include <stdio.h> /* standard I/O routines */
#include <stdlib.h> /* standard library routines, such as memory */
#include <math.h> /* math libraries */

#include "mpi.h" /* MPI library */

/* A routine performing elemental work.
This can be replaced with a much larger routine. */
double kernelroutine(double input); 
double kernelroutine(double input) 
	{
	return (input+1)*(input+1); 
	}

/* HighestIndex specifies how highest index to sample */
#define HighestIndex			10000L

void computeloop(double *theSum, double **theArray, int idproc, int nproc); 
void computeloop(double *theSum, double **theArray, int idproc, int nproc)
	{
	/* local copies of the data to be output */
	double *myArray, mySum=0; 
	/* limit of the loop */
	long loopEnd=(HighestIndex+nproc-1)/nproc;
	/* just this proc's piece */
	
	/* this processor's index offset */
	long offset=idproc*loopEnd;
	
	/* allocate an array to hold all the results */
	myArray=malloc(sizeof(double)*loopEnd); 
	
	if (myArray) {
		long index; 
		
		/* loop over indicies */
		for(index=0; index<loopEnd; index++) {
/* call the kernel routine for each index, and save into the array */
			myArray[index]=kernelroutine(index+offset); 
			/* sum as we go */
			if (index+offset<HighestIndex)
			/* limit to the desired indicies */
				mySum+=myArray[index]; 
			
			}
		
		/* proc 0 needs to hold the entire array */
		{double *bigArray=idproc?NULL:malloc(sizeof(double)*loopEnd*nproc);
	
		/* gathers the data from the other arrays ... */
		MPI_Gather(myArray, loopEnd, MPI_DOUBLE, 
			bigArray, loopEnd, MPI_DOUBLE, 
			/* ... to proc 0 */ 0, MPI_COMM_WORLD); 
		
		if (!idproc) {
			free(myArray); 
			myArray=bigArray; 
			}
		}
		
		/* performs a parallel sum across processors 
			and saves the result ... */
		MPI_Reduce(&mySum, &mySum, 1, MPI_DOUBLE, MPI_SUM, 
			/* ... at proc 0 */ 0, MPI_COMM_WORLD); 
		
		}
	
	/* return the sum and the array */
	*theSum=mySum; 
	*theArray=myArray; 
	}
	
int ppinit(int argc, char *argv[], int *idproc, int *nproc); 
void ppexit(void);

int main(int argc, char *argv[])
	{
	/* main copies of the sum and the array */
	double theSum, *theArray=NULL; 
	int idproc, nproc, ierr; 
	
	/* initialize parallel processing */
	ierr = ppinit(argc, argv, &idproc, &nproc); 
	
	if (ierr) return ierr; /* stop right there if there's a problem */
	
	printf("I'm processor #%d in a %d-processor cluster.\n", idproc, nproc); 
	
	printf("Beginning computation...\n"); 
	
	computeloop(&theSum, &theArray, idproc, nproc); 
	
	if (theArray) {/* error checking */
		
		if (!idproc) {/* only processor 0 */
			FILE *fp; 
		
		/* save the array into a data file */
			fp=fopen("output", "w"); 
			if (fp) {
				printf("writing array...\n"); 
				
				fwrite(theArray, sizeof(double), HighestIndex, fp); 
				
				fclose(fp); 
				}
			
			printf("the sum is %f\n", theSum); 
			
			}
		
		/* clean up after myself */
		free(theArray); 
		
		}
	else 
		printf("memory allocation failure\n"); 
	
	/* only proc 0 pauses for user exit */
	if (!idproc) {
		printf("press return to continue\n"); 
		getc(stdin); 
		}
	
	ppexit(); 
	
	return 0; 
	}
	
#ifdef __MWERKS__
/* only for Metrowerks CodeWarrior */
#include <SIOUX.h>
#endif 

int ppinit(int argc, char *argv[], int *idproc, int *nproc)
	{
/* this subroutine initializes parallel processing
   idproc = processor id
   nproc = number of real or virtual processors obtained */
   int ierr;
   *nproc=0; 

/* initialize the MPI execution environment */

   ierr = MPI_Init(&argc, &argv);
   if (!ierr) {

/* determine the rank of the calling process in the communicator */
		ierr = MPI_Comm_rank(MPI_COMM_WORLD, idproc);

/* determine the size of the group associated with the communicator */
		ierr = MPI_Comm_size(MPI_COMM_WORLD, nproc);

#ifdef __MWERKS__
/* only for Metrowerks CodeWarrior */
		SIOUXSettings.asktosaveonclose=0; 
		SIOUXSettings.autocloseonquit=1;
#endif 
		
		}
	
	return ierr;
	}

void ppexit(void)
	{
/* this subroutine terminates parallel processing */
	int ierr;

/* terminate MPI execution environment */
	ierr = MPI_Finalize();
	
	}

 


The changes from adder to paralleladder:

mpi.h - the header file for the MPI library, required to access information about the parallel system and perform communication

idproc, nproc - nproc describes how many processors are currently running this job and idproc identifies the designation, labeled from 0 to nproc - 1, of this processor. This information is sufficient to identify exactly which part of the problem this instance of the executable should work on. In this case, these variables are supplied to computeloop by main.

loopEnd=(HighestIndex+nproc-1)/nproc; - loopEnd describes how many integers this processor should perform kernelroutine as it was in adder. Here, we choose to make have each processor perform the same amount of work, yet the total amount of work is at least that necessary to complete the problem. Depending on HighestIndex and nproc, the last processor might do a little too much, but that processor would end up waiting for the others to catch up if it did skip the excess work and it's only a small waste if HighestIndex is sufficiently large.

offset=idproc*loopEnd; - By shifting the start of the sampling for this particular processor, offset is how each processor knows not to overlap work with other processors, without skipping the needed work. With this variable and loopEnd, this is sufficient information to specify the partition of work assigned to this processor.

the malloc - performs the memory allocation for each processor. Except for processor 0, the memory allocation is the amount sufficient to hold the results of this processor's partition of work. In the case of processor 0, it must create an array big enough hold the results of its own work and that of every other processor to save the data later.

if (index+offset<HighestIndex) - limit the summation to only the indicies we want in the sum

MPI_COMM_WORLD - MPI defines communicator worlds or communicators that define a set of processors that can communicate with each other. At initialization, one communicator, MPI_COMM_WORLD, covers all the processors in the system. Other MPI calls can define arbitrary subsets of MPI_COMM_WORLD, making it possible to confine a code to a particular processor subset just by passing it the appropriate communicator. In simpler cases such as this, using MPI_COMM_WORLD is sufficient.

MPI_Gather - fills bigArray of processor 0 with the data in myArray from the other processors in preparation for saving into one data file. myArray, loopEnd, and MPI_DOUBLE specify the first element, size, and data type of the array. 0 specifies where the data is to be collected. This call can be considered a "convenience" routine that simplifies the process of "gathering" regularly organized data from other processors.

MPI_Reduce - computes the sum, specified by MPI_SUM, of each mySum variable of each processor and collects the result in processor 0. This call can sum an entire array of values, so mySum is passed as an array of length one. When supplied MPI_MAX or MPI_MIN, it can also perform other operations such as comparing values across processors and retaining the maximum or minimum values.

ppinit - performs initialization of the parallel computing environment. Part of the boilerplate for MPI codes, this routine calls:
MPI_Init - performs the actual initialization of MPI. It returns the error code; zero means no error.
MPI_Comm_size - accesses the processor count of the parallel system
MPI_Comm_rank - accesses the identification number of this particular processor

ppexit - by calling MPI_Finalize, terminates the parallel computing environment. Also part of the boilerplate of MPI codes, a call to MPI_Finalize is needed to properly cleanup and close the connections between codes running on other processors and release control.

__MWERKS__ - these #ifdef blocks merely compensate for a pecularity about the Metrowerks CodeWarrior compiler. These blocks allows the executable on other processors terminate automatically. If you are using another compiler, you don't have to worry about this code.
Note that kernelroutine did not need to be altered at all. That reflects the independence of the work performed by this routine. The code that organizes and coordinates the work between processors is in computeloop. main's changes merely perform the boilerplate work of setting up and tearing down the parallel environment, plus an if statement so only processor 0 writes the output file.

Something to note is that this code does not explicitly call MPI_Send or MPI_Recv or any other elemental communication calls. Instead it calls MPI_Gather and MPI_Reduce. There isn't anything wrong with using elemental versus collective calls, if either will do the job. In fact, the MPI_Gather could just as easily be replaced with the following:

Listing 3 - an alternative to the above MPI_Gather call 
--------------------------------------------------------------------------------

if (idproc) 
	MPI_Send(myArray, loopEnd, MPI_DOUBLE, 0, idproc, MPI_COMM_WORLD); 
else {
	long otherProc; 
	MPI_Status status; 
	for(index=0; index<loopEnd; index++) bigArray[index]=myArray[index]; 
	for(otherProc=1; otherProc<nproc; otherProc++)
		MPI_Recv(&bigArray[otherProc*loopEnd], loopEnd, MPI_DOUBLE, 
			otherProc, otherProc, MPI_COMM_WORLD, &status); 
	}

 


References

Using MPI by William Gropp, Ewing Lusk and Anthony Skjellum is a good reference for the MPI library. In addition, many of the techniques and style of the above parallel code were adopted from Dr. Viktor K. Decyk. He expresses these ideas in a few references, including:

V. K. Decyk, "How to Write (Nearly) Portable Fortran Programs for Parallel Computers", Computers In Physics, 7, p. 418 (1993).
V. K. Decyk, "Skeleton PIC Codes for Parallel Computers", Computer Physics Communications, 87, p. 87 (1995).
You're welcome to read more about parallel computing via the Tutorials page.

sample Code:
adder.c
adder.f90
paralleladder.c
paralleladder.f90

Source from http://daugerresearch.com/vault/paralleladder.shtml
上传的附件:
2009-12-31 08:51
0
雪    币: 232
活跃值: (10)
能力值: ( LV4,RANK:50 )
在线值:
发帖
回帖
粉丝
7
这个运算上,很多大数库也是如此处理的。比如libtommath中,用一个单独的变量 sign 表示大数的符号位,假设a,b大于0,这样一来:
(a + b) or (-a + -b),调用原子级加法(不管符号位,直接相加两个数的数字部分),计算结果c的符号与a相同,
(-a + b) or (a + -b),调用原子级减法,c符号设为 |a|, |b| 中较大数的符号。

这样的优势在于求负数只需换一下 sign 就行了,而不用把整个数组给跑一遍。
2009-12-31 14:00
0
雪    币: 232
活跃值: (10)
能力值: ( LV4,RANK:50 )
在线值:
发帖
回帖
粉丝
8
顺着这个 L2_Notes.pdf 說的找了一下,收获还不错。虽然并行加法上思路还不是很清晰(看了几个方法,里面讲的Carry-Save Addition、carry look-ahead addition似乎都是针对有限位数计算时进位的优化,感觉这么复杂的逻辑拿到GPU上不一定特别有用),不过找到个不错的并行乘法

原文链接:Comba multiplication By Paul G. Comba


原理如下,设下划线表进位,我们一般做乘法:
    2 3
x   8 9
=======
  2_0_7
1_8_4
=======
2_0 4 7


Comba的计算方式:(保留进位,算完再加)
         2  3
x        8  9
  ===========
        18 27
     16 24
  ===========
   2 _0 _4 _7


这个方法提出已经是20多年前了,后人还做了一个更好的改进。首先把两个数写在两张纸的角落,倒转其中一张纸和写个另一张纸对接,比如23倒转后是32,然后如下操作:
  |
  |3 2
  `----
----.      => 3*9           = 27   => 2047
 8 9|                        / |
    |                       /  |
                  <________/   |
  |              / carry 2     |
  |3 2          /              |
  `----        /               |
  ----.    => 2 + 3*8 + 2*9 = 44
   8 9|                      / |
      |                     /  |
                  <________/   |
  |              / carry 4     |
  |3 2          /              |
  `----        /               |
    ----.  => 4 + 2*8       = 20
     8 9|
        |


这样做的话可以减少乘法次数,对于两个大数相乘优势很大

顺便贴两个我找到的资料,都在这里汇总一下好了:
Carry-Save Addition
www.ece.tamu.edu/~sshakkot/courses/ecen248/csa-notes.pdf

Conditional-Sum Addition
Chap8 - Conditional Sum Adders
2010-1-2 10:13
0
雪    币: 232
活跃值: (10)
能力值: ( LV4,RANK:50 )
在线值:
发帖
回帖
粉丝
9
并行加法上,借鉴Comba这个保留进位的思路,我想出一种不太完美的解决办法。
其实并行加法需要解决的问题就是carry往哪放的问题,高位计算是需要低位的进位才导致了无法并行。那么如果能把进位结果保存在当前位上,这个问题就解决了。比如计算 999 + 1:
   9 9 9
+  0 0 1
========
   9 9 0
+    1_0    <= 加上进位
========
   9 0 0
+  1_0      <= 又进位了,这是最棘手的问题


解决办法,进位后把结果直接保留在该位上:
   9  9  9
+  0  0  1
==========
   9  9 10

其结果上 9×100 + 9×10 + 10 = 1000 是一样的。

这个理论很简单,放到GPU中就是(假设计算 c[n] = a[n] + b[n] ):
1. 并行计算 c = a + b,将进位结果存入u
2. 同步各线程,保证计算结束
3. 计算 c = c + u[i-1],此时如果发生进位,则直接存入c


当然这样c[i]保存数的范围会变成一个可变值,这会带来很多不方便的地方,我在考虑是否运用超前进位加法器的思路,当需要统一范围的c[i]时,有选择的在尽可能少的循环次数内把c[i] = c[i] + u[i-1]这里的进位给消除掉。这样兼顾了运算速度并且其他算法实现也会简单点(至少考虑乘法计算的溢出时少了一个不确定因素)。

再次感谢R大的资料,顺着参考资料、关键词搜索容易很多
2010-1-2 10:38
0
雪    币: 2096
活跃值: (100)
能力值: (RANK:420 )
在线值:
发帖
回帖
粉丝
10
建議 deryope 去翻一翻書本-- Algorithm, 有一個 chapter 是介紹 distribution sysyem 或是  parallel 的部份。
順便閱讀 Synchronization  及 Assynchronization
2010-1-2 23:41
0
雪    币: 259
活跃值: (10)
能力值: ( LV2,RANK:10 )
在线值:
发帖
回帖
粉丝
11
嗯,用线程号来对齐的话,除非线程数和矩阵列数相等或成倍数,否则就会造成内存错位,而用原始数据自行对齐效果比较好。
2010-1-2 23:47
0
雪    币: 232
活跃值: (10)
能力值: ( LV4,RANK:50 )
在线值:
发帖
回帖
粉丝
12
好的,我去找来看看先。



不是倍数也没有关系,比如512个线程计算一个300位的加法,那么让多出的212线程闲着好了,虽然没有最大限度利用GPU性能
用原始数据自行对齐?这个怎么做?刚学CUDA...
2010-1-4 09:22
0
雪    币: 259
活跃值: (10)
能力值: ( LV2,RANK:10 )
在线值:
发帖
回帖
粉丝
13
这个说起来有点长,G8800的带宽理论上可以达到72GB/s,而目前的一些高阶CPU的带宽一般在10GB/s的级别上,并且G8800内部含有128个处理器,单核处理器的频率达到1.3GHz,虽然频率较慢,但是内核数上的压倒性优势远强于CPU。但这些优势的背后也有一些制约我们进行并行编程的因素,像对于分支语句的运行效率极差,所以在程序中要尽量减少分支语句。目前显卡只支持32bit运算数,所以我们的一般只是进行单精度浮点数的运算。

所以在编程时要注意一些地方如;

显卡运算中涉及位对齐的标志,最好不要用块号或线程号。尤其在这次实验中,线性化的矩阵表示,如果要使用线程号来对齐的话,除非线程数和矩阵列数相等或成倍数,否则很有可能造成内存错位,而用原始数据自行对齐效果比较好。

在保证正确的基础上,尽量减少代码量,能不初始化的地方都不要初始化。因为你编写的那一段核心代码在运算期间运行的总次数是块数乘以线程数之多,每一条语句都有可能增加很大的时间开销,最小化核心代码是非常重要的。在精简过程中也有些小技巧,比如初始时给变量在CPU上分配显存的global memory时,变量本身会被初始化为0,那么在核心函数中就可以省去初始化的步骤,但是在优化过程中如果要用到高速存取的块内share memory的话,声明后若不进行初始化,则结果会出错,这是值得注意的,但是即使是share memory的初始化,这可以可以减少消耗的时间。

尽量减少分支语句。这个可以说是除了算法以外最重要的一点了,GPU对于分支语句的处理速度真是让人不敢恭维,经过测试,发现CPU在有无分支的两种不同函数表达下的运算时间竟也相差6到7倍。首先,当需要进行编号对齐时,将很有可能用到分支语句对编号的位置进行判断,此时以前的我们写程序的风格都是一串的if语句,但是后来发现if的分支太多了,看着自己都难受,于是进行的删减,对于只可能出现其中之一的情况,可以将if改成if...else if...else的形式,并从前到后发生的可能性依次减弱,则可以提高程序性能。但是即使如此,分支语句依然庞大,经过分析,索性直接将if完全省去,使用编号的计算结果直接索引,完全没有了分支语句,性能极大提高,这也是我最终的程序能比同屋的人快出0.1ms的主要原因了。当然提到减少分支语句,不可不说的是对于循环语句的优化,其实for循环中,每次循环的过程需要一次判断和一次控制变量的运算和赋值,最初还有一次是控制变量的初始化,当然为了减少代码,控制变量最好在声明时初始化,那么此时声明时在for的参量内还是外部声明就没有什么关系了。而对于无法确定循环次数,需要在运算中判断终止条件的情况,while(1)或while(true)的形式自然是要比在for里把循环控制变量设置得很大,然后还是每次都要累加的代码不仅风格好,而且效率高很多很多。然而对于已知循环次数的情况呢,有一种虽然代码风格很囧,但是效率很高的方法,就是直接将循环展开,每次只做判断,而将循环变量省去,不仅没有了循环变量的计算,更是免去了判断时循环变量的存取时间,因为显存上global memory的存取latency与主存相当,然而CPU计算就没有这种事情,因为CPU的高速缓存的存取时间在ns级别上都是可以直接忽略的,GPU也有类似的部分。

说到存取速度,可以说是GPU运算的一个软肋,但是块内share memory的应用,为GPU的进一步推广和效率优化带来了福音。当将原始数据从主存向显存复制时,纵使显卡有极大的带宽,复制速度依然受到global memory存取速度的限制。然而在运算中,如果还是直接使用global memory的话,由于数据量的巨大将造成很大的延时,所以编程中不可不用高速内存。每个线程都将要遍历的数据先放进share memory中再进行处理。据悉share memory的存取速度可与CPU的cache相媲美。如上面提到的,share memory声明后若不初始化将无法使用,然而如果将初始化就已经看做是运算的第一步的话,那可以将循环次数减少一次。但需要非常注意的是,share memory的使用范围是块,其中的内容无法在块间进行通信,所以块数较多的情况下,如果想要很好的结果,一个函数可能很难满足要求。可以将每块的运算结果先复制回global memory中,然后利用其他函数,将这些结果重新整合在同一块中进行操作,最后将结果复制回主存,这样的话可以极大减小主存和显存通信造成的延时。

最重要的,就是算法了。所以显存的线程存取模式也放在这里写了。由于极大的并行处理,GPU运算时线程会以极快的速度切换,每个线程在执行一个循环后就会被切换到下一个线程,所以如果一个线程在编写时采用连续存取的话,在实际运行中将不会是连续的,而是不停地来回跳动,由于读取原始数据是从global memory中,而global memory慢的可不仅是存取速度,连寻址速度都让人无法忍受,所以编程时,每个线程读取的两个数据中需要隔开线程数个数据以供其他数据取用,这样就可以做到连续访存,而提高效率。对于加法,最优的算法无非就是树状加法(也称扇入加法),这种算法可以将N个数据加总进行N次加法优化到N个数据加总需要lbN次加法,在N很大时,两种算法的时间复杂度根本不在一个阶上。这里也没有什么太多说的,注意的就是线程数需要取到2的幂次,否则会无法对齐,加总时要采用循环展开法。

基本就这么多,你如果对这有兴趣可以去些专业图形编程网站去看看,了解下GPU上的并行编程模型导论,我也只知皮毛。
2010-1-4 21:52
0
雪    币: 232
活跃值: (10)
能力值: ( LV4,RANK:50 )
在线值:
发帖
回帖
粉丝
14
所以在编程时要注意一些地方如;

显卡运算中涉及位对齐的标志,最好不要用块号或线程号。尤其在这次实验中,线性化的矩阵表示,如果要使用线程号来对齐的话,除非线程数和矩阵列数相等或成倍数,否则很有可能造成内存错位,而用原始数据自行对齐效果比较好。


还是没明白为什么最好不要用线程号... 上面那句话说的,应该是将矩阵线性化表示后,使用线程号对其遇到的问题(这也取决于具体算法)
难道这样计算出来的idx会出现重复?
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx<N) c[idx] = a[idx]+ b[idx];

并且如果只有一个block时,threadIdx.x应该是唯一的。

share memory和global memory速度差异这个我知道一点,CSDN的CUDA版上看到过几个类似讨论。
目前CUDA大数库方面资料很少,国内我只搜到《GPU高性能运算之CUDA》作者OpenHero的的一个开发计划,其中有大数计算的事项,只不过好像还在开发中,没有进一步的消息
2010-1-5 09:35
0
雪    币: 259
活跃值: (10)
能力值: ( LV2,RANK:10 )
在线值:
发帖
回帖
粉丝
15
用线程直接访问这些存储体需要几百个存储周期,而访间片上的共享存储体只需要几个存储周期。

张舒的《GPU高性能运算之CUDA》网上好像已经有电子版,你找下。
2010-1-8 07:17
0
雪    币: 259
活跃值: (10)
能力值: ( LV2,RANK:10 )
在线值:
发帖
回帖
粉丝
16
其实个问题说详细点要从设计构架上看,因为CPU上,都有分支预测单元,包括用BTB(Branch Target Buffer, 分支预测缓冲区)和PHT(Pattern Hi story Table,模式历史表) ,从而提高流水线的利用率。
而GUP却不具备这功能,google发现分支预测只有对指令间并行的架构才有意义,当走到分支指令的时候因为判断结果还没出来,CPU不知道下一条指令在哪里,所以需要预测分支的走向,以避免流水线停顿。而GPU走的是线程并行路线,不要说分支预测了,就是在一个Warp里,32个线程必须走一样的指令路径,如果有分支的话则不同线程只能串行化执行,可以说是比CPU原始的多。GPU现在面临的问题不是分支预测,而是如何动态重组warp,使得一个warp内的线程走相同的指令路径。

而且从耗能的角度看,分支预测消耗的晶体管量太多,NV和AMD出于功耗和成本都无法实现。要是有分支预测,那CPU基本可以退出历史舞台了。
2010-1-8 11:22
0
游客
登录 | 注册 方可回帖
返回
//