英文論文名稱: Design of a Fast Signed Binary Adder with Error Detection
指導教授: 婁德權
指導教授(英文姓名): Der-Chyuan Lou
學位類別: 碩士
校院名稱: 中正理工學院
系所名稱: 電機工程研究所
學號: 881325
學年度: 87
語文別: 中文
論文頁數: 72
關鍵詞: 帶符號二進位數字系統 ; 進位傳遞延遲 ; 漣波進位加法電路
英文關鍵詞: signed binary digit system ; propagation delay ;
ripple carry adder
被引用次數: 0
[ 摘要 ]
密碼系統不論是以軟體或硬體去實現,大部份都需要用到模乘法與模指數的運算。而這些運算的最基本單元就是加法器,而且此類加法器不但要快速,更要能偵錯。因為密碼系統講求的就是安全性,若在基本的電路中就有閃失,那就不堪稱為安全性高的密碼系統了。傳統的二進位加法電路會產生進位傳遞延遲(carry propagation delay) ,因此計算速度緩慢。雖然前看進位(carry look-ahead)加法電路可改善此問題,但其電路複雜度隨著位元數的增加而變大,需要很大的推動電流,並不實用。解決進位傳遞延遲的有效方法是使用帶符號的二進位數字(signed binary digit,簡稱SBD)系統,本論文以SBD為加法運算之數系,提出了一個快速而且可以偵錯的加法器架構,以提高資訊安全裝置之容錯能力。
[ 英文摘要 ]
Most of the cryptosystem, which implemented in software or hardware methods, need modular multiplication and modular exponentiation. In the hardware design of a good cryptosystem, the basic circuit should not include any error. The adder, which constructs the basic device of the cryptosystem, should operate fast and provide the ability of error detection. The traditional binary adder has the carry propagation delay so it is slow. The carry look-ahead adder improves the carry propagation delay but it has another disadvantage, i.e., more bits size of the circuit involves the more complexity of the circuit. It is unrealistic that very large fan-in are required by the carry look-ahead adder. One of the valid way to solve the carry propagation delay is using the signed binary digit (SBD) system. In this thesis, we propose a SBD based adder, which is fast and error detection. Base on this adder, the cryptosystem is more robust.
[ 參考文獻]
1. Avizienis A., “Signed-digit number representations for fast parallel arithmetic,” IRE Transactions on Electronic Computers, vol. 10, pp. 389-400, 1961.
2. Brent R. P. and Kung H.-T., “A Regular Layout for Parallel Adders,” IEEE Transactions on Computers, vol. 31, no. 3, pp. 260-264, Mar. 1982.
3. Briggs W. S. and Matula D. W., “Signed digit multiplier,” U. S. Patent No. 5144576, Sep. 1992.
4. Briggs W. S. and Matula D. W., “A 1769 bit multiply and add unit with redundant binary feedback and single cycle latency,” Proceedings of the Eleventh Symposium on Computer Arithmetic, 1993, pp. 163-170.
5. Darley H. M., “Signed digit adder circuit,” U. S. Patent No. 4979140, Dec. 1990.
6. Harata Y., Nakamura Y., Nagese H., Takigawa M., and Takagi N., “A high-speed multiplier using a redundant binary adder tree,” IEEE Journal on Solid-State Circuits, vol. 22, no. 2, pp. 28-34, Feb. 1987.
7. Jedwab J. and Mitchell C. J., “Minimum weight modified signed-digit representations and fast exponentiation,” IEE Electronics Letters, vol. 25, no. 17, pp. 1171-1172, Aug. 1989.
8. Kantabutra V., “A recursive carry-lookahead/carry-select hybrid adder,” IEEE Transactions on Computers, vol. 42, no. 12, pp. 1495-1499, Dec. 1993.
9. Knuth D. E., The art of computer programming, vol. 2: semi-numerical algorithms, Addison-Wesley, USA, 1981, 2nd edition.
10. Koren I., Computer arithmetic algorithms, Prentice-Hall, Englewood Cliffs, N. J., 1993.
11. Lou D.-C. and Chang C.-C., “Fast Exponentiation method obtained by folding the exponent in half,” IEE Electronics Letters, vol. 32, no. 11, pp. 984-985, May. 1996.
12. Parhami B., “Generalized signed-digit number systems: a unifying framework for redundant number representations,” IEEE Transactions on Computers, vol. 39, no. 1, pp. 89-98, Jan. 1990.
13. Patterson D. A. and Hennessy J. L., Computer architecture: a quantitative approach, Morgan Kaufmann, San Francisco, C.A., 1996.
14. Rivest R., Shamir A., and Adleman L., “A method for obtaining digital signatures and public-key cryptosystems,” Communications of the Association for Computing Machinery, vol. 21, no. 2, pp. 120-126, Feb. 1978.
15. Takagi N. and Yajima S., “On-line error-detectable high-speed multiplier using redundant binary representation and three-rail logic,” IEEE Transactions on Computers, vol. 36, no. 11, pp. 1310-1317, Nov. 1987.
16. Taylor F. J., “A VLSI residue arithmetic multiplier,” IEEE Transactions on Computers, vol. 31, no. 6, pp. 540-546, June. 1982.
17. Thornton M. A., “Signed binary addition circuitry with inherent even parity outputs,” IEEE Transactions on Computers, vol. 46, no. 7, pp. 811-816, July. 1997.
18.Walter C. D., “Systolic modular multiplication,” IEEE Transactions on Computers, vol. 42, no. 3, pp. 376-378, Mar. 1993.
When A, B, and the carry-in are all HIGH, a sum of 1 and a carry-out are produced. First, consider A and B. When both are HIGH, the output of gate 1 is LOW, and the output of gate 2 is HIGH, giving us a carry-out at gate 5. The carry-in produces a 1 output at gate 3, giving us a sum of 1. The output of the full adder is 112. The sum of 12 plus 12 plus 12 is 11
PARALLEL ADDERS
The adders discussed in the previous section have been limited to adding single-digit binary numbers and carries. The largest sum that can be obtained using a full adder is 112. Parallel adders let us add multiple-digit numbers. If we place full adders in parallel, we can add two- or four-digit numbers or any other size desired. Figure 3-9 uses STANDARD SYMBOLS to show a parallel adder capable of adding two, two-digit binary numbers. In previous discussions we have depicted circuits with individual logic gates shown. Standard symbols (blocks) allow us to analyze circuits with inputs and outputs only. One standard symbol may actually contain many and various types of gates and circuits. The addend would be input on the A inputs (A2 = MSD, A1 = LSD), and the augend input on the B inputs (B2 = MSD, B1 = LSD). For this explanation we will assume there is no input to C0 (carry from a previous circuit).
Figure 3-9. —Parallel binary adder.
Now let’s add some two-digit numbers. To add 102 (addend) and 012 (augend), assume there are numbers at the appropriate inputs. The addend inputs will be 1 on A2 and 0 on A1. The augend inputs will be 0 on B2 and 1 on B1. Working from right to left, as we do in normal addition, let’s calculate the outputs of each full adder. With A1 at 0 and B1 at 1, the output of adder 1 will be a sum (S1) of 1 with no carry (C1). Since A2 is 1 and B2 is 0, we have a sum (S2) of 1 with no carry (C2) from adder 1. To determine the sum, read the outputs (C2, S 2, and S1) from left to right. In this case, C2 = 0, S2 = 1, and S1 = 1. The sum, then, of 102 and 012 is 0112 or 112. To add 112 and 012, assume one number is applied to A1 and A2, and the other to B1 and B2, as shown in figure 3-10. Adder 1 produces a sum (S1) of 0 and a carry (C1) of 1. Adder 2 gives us a sum (S2)
1. Introduction
The saying goes that if you can count, you can control. Addition is a fundamental operation for any digital system, digital signal processing or control system. A fast and accurate operation of a digital system is greatly influenced by the performance of the resident adders. Adders are also very important component in digital systems because of their extensive use in other basic digital operations such as subtraction, multiplication and division. Hence, improving performance of the digital adder would greatly advance the execution of binary operations inside a circuit compromised of such blocks. The performance of a digital circuit block is gauged by analyzing its power dissipation, layout area and its operating speed. 2. Types of Adders In this lecture we will review the implementation technique of several types of adders and study their characteristics and performance.
These are:
1) Ripple carry adder, or carry propagate adder,
2) Carry look-ahead adder
3) Carry skip adder,
4) Manchester chain adder,
5) Carry select adders,
6) Pre-Fix Adders,
7) Multi-operand adder,
8) Carry save Adder,
9) Pipelined parallel adder,
The following is a description of a basic single-processor code and its transition into a parallel code. We hope this example of a simple parallelization process serves to demonstrate what steps and calls are needed for codes whose work can be performed independently of one another. A simple "divide and conquer" approach is appropriate here.
To the experienced parallel programmer, this example is a "trivially parallelizable" code because the entire work of the code can be partitioned between the processors very simply. That is the point of this discussion: we are demonstrating how one might go about converting this kind of single-processor code into a parallel code.
This technique could apply to your situation in at least three possible ways. First, this technique is applicable to to problems such as minmax searches for the best move in a game such as chess, "monte carlo" techniques, or computationally-intensive searches through a large parameter space. If you are fortunate enough to have a code of this category, then you need to look little further than this technique. Second, selected components of a code could also fall under this category, so this technique could be useful to apply to those portions of a single-processor code. Finally, if this is your first look at parallel code, this example serves as an excellent start to learn how to go about writing and thinking about parallel codes.
Before: Single-Processor adder
We begin with a code utilizing single-processor execution. This example, adder, sums the square of each integer from 1 to 10000, inclusive. It is organized into three routines:
1) kernelroutine - performs the elemental work of squaring each integer
2) computeloop - allocates memory, loops over the integers, calls kernelroutine, and saves and sums their results
3) main - performs any required initialization, calls computeloop, saves its results into a file, printf's the sum, and deallocates memory
The code is structured in such a way to clearly separate the pieces that perform the work that is secondary yet essential to solving the problem. These include allocating the appropriate memory and saving data to an output file. Besides identifying where these functions are being performed, the structure makes it obvious that a much more complicated problem can substitute for the kernelroutine shown. We also intend this explicit structure to represent corresponding portions of a much larger code. The C source is shown in Listing 1.
Listing 1 - adder.c
See also adder.f90
--------------------------------------------------------------------------------
#include <stdio.h> /* standard I/O routines */
#include <stdlib.h> /* standard library routines, such as memory */
#include <math.h> /* math libraries */
/* A routine performing elemental work.
This can be replaced with a much larger routine. */
double kernelroutine(double input);
double kernelroutine(double input)
{
return (input+1)*(input+1);
}
/* HighestIndex specifies how highest index to sample */
#define HighestIndex 10000L
void computeloop(double *theSum, double **theArray);
void computeloop(double *theSum, double **theArray)
{
/* local copies of the data to be output */
double *myArray, mySum=0;
/* limit of the loop */
long loopEnd=HighestIndex;
/* allocate an array to hold all the results */
myArray=malloc(sizeof(double)*loopEnd);
if (myArray) {
long index;
/* loop over indicies */
for(index=0; index<loopEnd; index++) {
/* call the kernel routine for each index, and save into the array */
myArray[index]=kernelroutine(index);
/* sum as we go */
mySum+=myArray[index];
}
}
/* return the sum and the array */
*theSum=mySum;
*theArray=myArray;
}
int main(int argc, char *argv[])
{
/* main copies of the sum and the array */
double theSum, *theArray=NULL;
printf("Beginning computation...\n");
computeloop(&theSum, &theArray);
if (theArray) {/* error checking */
FILE *fp;
/* save the array into a data file */
fp=fopen("output", "w");
if (fp) {
printf("writing array...\n");
fwrite(theArray, sizeof(double), HighestIndex, fp);
fclose(fp);
}
printf("the sum is %f\n", theSum);
/* clean up after myself */
free(theArray);
}
else
printf("memory allocation failure\n");
return 0;
}
When this code is run with its HighestIndex constant set to 10000, the code states that the sum is 333383335000 and produces a 80000 character long binary file containing the squares of integers 1 through 10000.
After: paralleladder
A key to parallelizing an application is choosing the appropriate partitioning of the problem. The obvious choice here is to divide the number of integers by the number of processors. However, a few details must be worked out: What if the number of processors doesn't divide evenly into the number of integers? How does each processor know which integers to work on so they don't do the same integer twice or miss one? Once they have the answers, how do the sum numbers on one processor with those on another, and how do these processors save the data into one file?
The parallelization process here is primarily a matter of managing and delegating data and computation. Changes to main are limited to "boilerplate" code that prepare and tear down the parallel environment and pass the sufficient information to computeloop so that it can organize the work and a if test so only processor 0 creates an output file. The most important modifications are in computeloop to perform the coordination and calculate the partitioning between processors. Given the identification number of the processor it is running on and the number of processors in the system, computeloop calculates how to partition the problem among the processors. To minimize bottlenecks, the executable running on each processor calculates its assignment itself, without a central authority delegating assignments. Once it finishes its work, it collects the data back to processor 0 so main can write the output in one convenient place for the user.
The detailed answer is in the code. Listing 2 shows the C source of paralleladder.c, with the changes relative to adder.c underlined.
Listing 2 - paralleladder.c, with changes compared to adder.c underlined
See also paralleladder.f90
--------------------------------------------------------------------------------
#include <stdio.h> /* standard I/O routines */
#include <stdlib.h> /* standard library routines, such as memory */
#include <math.h> /* math libraries */
#include "mpi.h" /* MPI library */
/* A routine performing elemental work.
This can be replaced with a much larger routine. */
double kernelroutine(double input);
double kernelroutine(double input)
{
return (input+1)*(input+1);
}
/* HighestIndex specifies how highest index to sample */
#define HighestIndex 10000L
void computeloop(double *theSum, double **theArray, int idproc, int nproc);
void computeloop(double *theSum, double **theArray, int idproc, int nproc)
{
/* local copies of the data to be output */
double *myArray, mySum=0;
/* limit of the loop */
long loopEnd=(HighestIndex+nproc-1)/nproc;
/* just this proc's piece */
/* this processor's index offset */
long offset=idproc*loopEnd;
/* allocate an array to hold all the results */
myArray=malloc(sizeof(double)*loopEnd);
if (myArray) {
long index;
/* loop over indicies */
for(index=0; index<loopEnd; index++) {
/* call the kernel routine for each index, and save into the array */
myArray[index]=kernelroutine(index+offset);
/* sum as we go */
if (index+offset<HighestIndex)
/* limit to the desired indicies */
mySum+=myArray[index];
}
/* proc 0 needs to hold the entire array */
{double *bigArray=idproc?NULL:malloc(sizeof(double)*loopEnd*nproc);
/* gathers the data from the other arrays ... */
MPI_Gather(myArray, loopEnd, MPI_DOUBLE,
bigArray, loopEnd, MPI_DOUBLE,
/* ... to proc 0 */ 0, MPI_COMM_WORLD);
if (!idproc) {
free(myArray);
myArray=bigArray;
}
}
/* performs a parallel sum across processors
and saves the result ... */
MPI_Reduce(&mySum, &mySum, 1, MPI_DOUBLE, MPI_SUM,
/* ... at proc 0 */ 0, MPI_COMM_WORLD);
}
/* return the sum and the array */
*theSum=mySum;
*theArray=myArray;
}
int ppinit(int argc, char *argv[], int *idproc, int *nproc);
void ppexit(void);
int main(int argc, char *argv[])
{
/* main copies of the sum and the array */
double theSum, *theArray=NULL;
int idproc, nproc, ierr;
/* initialize parallel processing */
ierr = ppinit(argc, argv, &idproc, &nproc);
if (ierr) return ierr; /* stop right there if there's a problem */
printf("I'm processor #%d in a %d-processor cluster.\n", idproc, nproc);
printf("Beginning computation...\n");
computeloop(&theSum, &theArray, idproc, nproc);
if (theArray) {/* error checking */
if (!idproc) {/* only processor 0 */
FILE *fp;
/* save the array into a data file */
fp=fopen("output", "w");
if (fp) {
printf("writing array...\n");
fwrite(theArray, sizeof(double), HighestIndex, fp);
fclose(fp);
}
printf("the sum is %f\n", theSum);
}
/* clean up after myself */
free(theArray);
}
else
printf("memory allocation failure\n");
/* only proc 0 pauses for user exit */
if (!idproc) {
printf("press return to continue\n");
getc(stdin);
}
ppexit();
return 0;
}
#ifdef __MWERKS__
/* only for Metrowerks CodeWarrior */
#include <SIOUX.h>
#endif
int ppinit(int argc, char *argv[], int *idproc, int *nproc)
{
/* this subroutine initializes parallel processing
idproc = processor id
nproc = number of real or virtual processors obtained */
int ierr;
*nproc=0;
/* initialize the MPI execution environment */
ierr = MPI_Init(&argc, &argv);
if (!ierr) {
/* determine the rank of the calling process in the communicator */
ierr = MPI_Comm_rank(MPI_COMM_WORLD, idproc);
/* determine the size of the group associated with the communicator */
ierr = MPI_Comm_size(MPI_COMM_WORLD, nproc);
#ifdef __MWERKS__
/* only for Metrowerks CodeWarrior */
SIOUXSettings.asktosaveonclose=0;
SIOUXSettings.autocloseonquit=1;
#endif
}
return ierr;
}
void ppexit(void)
{
/* this subroutine terminates parallel processing */
int ierr;
/* terminate MPI execution environment */
ierr = MPI_Finalize();
}
The changes from adder to paralleladder:
mpi.h - the header file for the MPI library, required to access information about the parallel system and perform communication
idproc, nproc - nproc describes how many processors are currently running this job and idproc identifies the designation, labeled from 0 to nproc - 1, of this processor. This information is sufficient to identify exactly which part of the problem this instance of the executable should work on. In this case, these variables are supplied to computeloop by main.
loopEnd=(HighestIndex+nproc-1)/nproc; - loopEnd describes how many integers this processor should perform kernelroutine as it was in adder. Here, we choose to make have each processor perform the same amount of work, yet the total amount of work is at least that necessary to complete the problem. Depending on HighestIndex and nproc, the last processor might do a little too much, but that processor would end up waiting for the others to catch up if it did skip the excess work and it's only a small waste if HighestIndex is sufficiently large.
offset=idproc*loopEnd; - By shifting the start of the sampling for this particular processor, offset is how each processor knows not to overlap work with other processors, without skipping the needed work. With this variable and loopEnd, this is sufficient information to specify the partition of work assigned to this processor.
the malloc - performs the memory allocation for each processor. Except for processor 0, the memory allocation is the amount sufficient to hold the results of this processor's partition of work. In the case of processor 0, it must create an array big enough hold the results of its own work and that of every other processor to save the data later.
if (index+offset<HighestIndex) - limit the summation to only the indicies we want in the sum
MPI_COMM_WORLD - MPI defines communicator worlds or communicators that define a set of processors that can communicate with each other. At initialization, one communicator, MPI_COMM_WORLD, covers all the processors in the system. Other MPI calls can define arbitrary subsets of MPI_COMM_WORLD, making it possible to confine a code to a particular processor subset just by passing it the appropriate communicator. In simpler cases such as this, using MPI_COMM_WORLD is sufficient.
MPI_Gather - fills bigArray of processor 0 with the data in myArray from the other processors in preparation for saving into one data file. myArray, loopEnd, and MPI_DOUBLE specify the first element, size, and data type of the array. 0 specifies where the data is to be collected. This call can be considered a "convenience" routine that simplifies the process of "gathering" regularly organized data from other processors.
MPI_Reduce - computes the sum, specified by MPI_SUM, of each mySum variable of each processor and collects the result in processor 0. This call can sum an entire array of values, so mySum is passed as an array of length one. When supplied MPI_MAX or MPI_MIN, it can also perform other operations such as comparing values across processors and retaining the maximum or minimum values.
ppinit - performs initialization of the parallel computing environment. Part of the boilerplate for MPI codes, this routine calls:
MPI_Init - performs the actual initialization of MPI. It returns the error code; zero means no error.
MPI_Comm_size - accesses the processor count of the parallel system
MPI_Comm_rank - accesses the identification number of this particular processor
ppexit - by calling MPI_Finalize, terminates the parallel computing environment. Also part of the boilerplate of MPI codes, a call to MPI_Finalize is needed to properly cleanup and close the connections between codes running on other processors and release control.
__MWERKS__ - these #ifdef blocks merely compensate for a pecularity about the Metrowerks CodeWarrior compiler. These blocks allows the executable on other processors terminate automatically. If you are using another compiler, you don't have to worry about this code.
Note that kernelroutine did not need to be altered at all. That reflects the independence of the work performed by this routine. The code that organizes and coordinates the work between processors is in computeloop. main's changes merely perform the boilerplate work of setting up and tearing down the parallel environment, plus an if statement so only processor 0 writes the output file.
Something to note is that this code does not explicitly call MPI_Send or MPI_Recv or any other elemental communication calls. Instead it calls MPI_Gather and MPI_Reduce. There isn't anything wrong with using elemental versus collective calls, if either will do the job. In fact, the MPI_Gather could just as easily be replaced with the following:
Listing 3 - an alternative to the above MPI_Gather call
--------------------------------------------------------------------------------
if (idproc)
MPI_Send(myArray, loopEnd, MPI_DOUBLE, 0, idproc, MPI_COMM_WORLD);
else {
long otherProc;
MPI_Status status;
for(index=0; index<loopEnd; index++) bigArray[index]=myArray[index];
for(otherProc=1; otherProc<nproc; otherProc++)
MPI_Recv(&bigArray[otherProc*loopEnd], loopEnd, MPI_DOUBLE,
otherProc, otherProc, MPI_COMM_WORLD, &status);
}
References
Using MPI by William Gropp, Ewing Lusk and Anthony Skjellum is a good reference for the MPI library. In addition, many of the techniques and style of the above parallel code were adopted from Dr. Viktor K. Decyk. He expresses these ideas in a few references, including:
V. K. Decyk, "How to Write (Nearly) Portable Fortran Programs for Parallel Computers", Computers In Physics, 7, p. 418 (1993).
V. K. Decyk, "Skeleton PIC Codes for Parallel Computers", Computer Physics Communications, 87, p. 87 (1995).
You're welcome to read more about parallel computing via the Tutorials page.
这个运算上,很多大数库也是如此处理的。比如libtommath中,用一个单独的变量 sign 表示大数的符号位,假设a,b大于0,这样一来:
(a + b) or (-a + -b),调用原子级加法(不管符号位,直接相加两个数的数字部分),计算结果c的符号与a相同,
(-a + b) or (a + -b),调用原子级减法,c符号设为 |a|, |b| 中较大数的符号。
其实个问题说详细点要从设计构架上看,因为CPU上,都有分支预测单元,包括用BTB(Branch Target Buffer, 分支预测缓冲区)和PHT(Pattern Hi story Table,模式历史表) ,从而提高流水线的利用率。
而GUP却不具备这功能,google发现分支预测只有对指令间并行的架构才有意义,当走到分支指令的时候因为判断结果还没出来,CPU不知道下一条指令在哪里,所以需要预测分支的走向,以避免流水线停顿。而GPU走的是线程并行路线,不要说分支预测了,就是在一个Warp里,32个线程必须走一样的指令路径,如果有分支的话则不同线程只能串行化执行,可以说是比CPU原始的多。GPU现在面临的问题不是分支预测,而是如何动态重组warp,使得一个warp内的线程走相同的指令路径。