The theory in this section is only covered briefly. The reader is encouraged to refer to Bruce Schneier’s Applied Cryptography for more details.
A symmetric cipher is characterised by the functions
The intent of a cipher is that the function E be non-invertible without the key. This ensures that the plaintext remains secret to people without the key.
A block cipher is one where data is processed in discrete blocks. The input plaintext or ciphertext is broken up into blocks of the appropriate size. An example is DES, which processes data in 64 bit blocks. A stream cipher is one which works with much smaller units of data – often a single bit at a time. A5/1 is a common stream cipher. Stream ciphers are used to generate a key stream, which is then XORd with the plaintext to produce the ciphertext. XORing the ciphertext with the key stream again will decrypt the data.
In a known plaintext attack, the attacker possesses some ciphertext and the matching plaintext. The goal is to find the key. This is the attack method usually used in research; possessing or being able to infer part of the plaintext is a reasonably safe assumption. E-mail headers and IP packets always begin in the same way, for example.
In a ciphertext only attack, the attacker possesses some ciphertext. These attacks are more difficult to perform. Usually, the attacker relies on some properties of the plaintext to determine when they are successful (such as character distributions or language statistics).
There are two criteria for a symmetric cipher to be considered secure :
In a key search attack, the attacker tries every possible key as input to the cipher. The known piece of ciphertext is also used as an input. In a known plaintext attack, the trial plaintext from the cipher output is compared to the known plaintext. In a ciphertext only attack, heuristics are used to determine if the output is valid plaintext.
Most ciphers consist of a key setup phase and an operation phase. During key setup, the internal state is initialised. During operation, input ciphertext or plaintext is encrypted or decrypted. Key setup only needs to be conducted once for each key that is used.
When a cipher is used in practice, the key is usually kept constant for a long period while the plaintext or ciphertext input is varied frequently. Key setup is performed only once, and the cipher is designed to handle the rapid change in input.
Exhaustive key search reverses this by keeping the input constant while changing the key frequently. The main implication from this is that key setup must be performed very frequently. Many ciphers exploit this to improve resistance to key search attacks by having a very long key setup period. The key setup period is often comprised of the encryption algorithm itself. This greatly increases the time and resources needed to conduct a successful exhaustive key search.
Commercial chips that perform encryption or decryption may not be suitable for use in a key search machine if they are not designed to have the key changed frequently. Conversely, it may be possible to optimise a custom key search design by precomputing (partially evaluating) parts of the algorithm, since the ciphertext is known in advance. This technique has already been used to produce very fast and efficient cipher implementations by including the key in the design itself.
When attacking a block cipher, one output block is usually tested for each key. If the output block matches the known plaintext, tests with more blocks are conducted to verify that the key is correct. The further checking step is important. There may be several keys that give the same plaintext output if the key size is longer than the block size and only a single block is checked.
Stream ciphers are often faster to conduct brute-force attacks against because incorrect keys can be quickly eliminated. A simple approach would be to generate a quantity of the key stream and XOR that with the ciphertext to generate the plaintext. The stream cipher can then be treated in exactly the same way as a block cipher. Efficiency can be slightly improved by ignoring the XOR stage and simply searching for the correct key stream. The amount of key stream to be generated must balance out the number of false alarms with the amount of time taken to check each key. Generating more of the key stream will cut down on false alarms, but take more time.
The main problem with this approach is that it is very inefficient. The entire block of key stream must be generated before it is checked for correctness. The first few bits to be generated may be enough to determine that a key is incorrect. A more efficient algorithm would then be:
With this algorithm, an average of two units of key stream need to be generated for each trial key. This is far more efficient than the simple algorithm, which may need to generate a large amount of key stream to avoid returning an excessive number of potential keys.
As with a block cipher, generating the first unit of key stream may require a lengthy key setup phase be carried out. Many key search attacks avoid this by searching for the initial state of the cipher after the key setup has been completed. This is not always feasible; some stream ciphers such as RC4 have very large internal states.
Most ciphers make use of a number of common operations. These operations typically retain entropy to ensure random-looking output. They may also introduce nonlinearities in the output. Ciphers generally operations from a number of algebraic groups to improve their strength.
|DES||56||Block||Bit permute, rotate, XOR, table lookup (6×4)|||
|RC4||64||Stream||Add, table read and write (8×8), XOR|||
|Rijndael||128/192/256||Block||Table lookup (8×8), rotate, multiply (GF(28)), XOR|||
|3DES||112/168||Block||Bit permute, rotate, XOR, table lookup (6×4)|||
|IDEA||128||Block||XOR, add, multiply (16 bits), rotate|||
|Blowfish||32-448||Block||XOR, add, table read and write (8×32)|||
|RC5||0-2040||Block||XOR, add, variable rotate|||
|Skipjack||80||Block||XOR, shift, add, table lookup (8×8)|||
Table sizes are specified by (x×y), where x is the number of bits in the index and y is the number of bits in the output. For example, 6×4 can be modelled by a RAM with 6 address bits and 4 data bits. Addition is considered to include subtraction as a trivial extension.
By testing the relative speed of each operation on each implementation technology, it should be possible to gain insights into which ciphers will run quickly on which technology.
The Data Encryption Standard (DES) has been extensively used and studied for decades. Several linear and differential attacks against it have been discovered, but the most effective attack remains exhaustive key search. It works with 64 bit blocks and was originally designed for fast hardware implementations. It has been the subject of several contests.
DES has enjoyed widespread military, government and commercial use in the past, notably within the banking and finance sectors. Its key length is only 56 bits, which is considered far too weak nowadays. Many attacks on the key length of DES have been performed, some of which are described below.
DES remains in use through a variant called Triple DES (or 3DES). In 3DES, the DES cipher is applied three times with two or three different keys. This is an effective method of increasing the strength of the DES cipher, but care must be taken during implementation to ensure that all of the keys are different. Clayton and Bond successfully attacked a secure processor (formerly used in ATMs) that utilises 3DES . They exploit protocol flaws to force the processor to use duplicate keys. They can then perform a key search attack on the reduced key space to determine a “master key”.
RC5 is a simple block cipher designed by Ronald Rivest . Despite its simple structure, very few attacks better than exhaustive key search have been discovered. It is fully parameterised, so the block length, number of rounds and key length can be selected to suit the application. It is best known as the cipher being attacked by the RSA Secret Key Challenges . The parameters for the challenges are selected to be efficient on modern CPUs. RC5 is not widely used for commercial applications and has been patented by RSA Data Security.
Software is usually the first platform that a cipher will be implemented with. General-purpose CPUs are cheap, plentiful, and fast for most tasks. Many ciphers are designed for an efficient software implementation.
CPUs are designed to perform serial operations very quickly. Regardless of the amount of available chip area, the need to operate serially remains. This has lead to modern CPUs expending a large amount of area on prediction and caching circuitry. A doubling in chip area for a CPU will not result in a doubling of performance.
Exhaustive key search is highly parallelisable; the task can be split perfectly between any number of processing units. This means that using multiple specialised processing units instead of a single CPU may give higher performance.
Recent CPUs have demonstrated a move towards increasing parallelism. Multiple execution units, deep pipelines, Symmetric Multiprocessing (SMP) and techniques such as Intel’s HyperThreading all serve to increase the level of parallelism.
Eli Biham pioneered a technique which later became known as bitslicing . The paper deals primarily with its application to the DES cipher, but it is applicable to many other algorithms. In it, each register of a CPU is viewed as a large number of single-bit registers. This allows a large number of single-bit operations to be performed in parallel. For an algorithm such as DES which is composed largely of single-bit operations, this provides a very large performance gain.
An ASIC (Application Specific Integrated Circuit) is a chip that has been designed for a particular purpose. They are usually very fast for whatever application they have been designed for, but cannot be modified after fabrication. Initial fabrication costs are very high, but can be amortized over many chips. The unit price per chip is usually quite low. There is a far greater development effort required than for software, and more than for FPGAs. Gate array designs reduce the high cost of development significantly, but reduce achievable density. They work by placing gates over the entire silicon area of a device during fabrication and linking the gates with metal layers later.
FPGA designs can be converted to equivalent gate array ASIC designs at a relatively low cost. This technique was used for the machine described in . Designs implemented in this way tend to be faster and cheaper than those on FPGAs, but not as fast as a dedicated ASIC design. It is an attractive option where development time and cost are important and the number of FPGAs needed make the implementation cost prohibitive. Many of the issues affecting FPGA designs (such as timing) also apply to ASICs. Routing tends to be much less problematic on an ASIC compared with an FPGA.
FPGAs (Field Programmable Gate Arrays) combine software and hardware approaches. They are chips that can have their internal layout reconfigured at any time. This is usually achieved by loading a bitstream from a ROM or controller. An FPGA typically contains a large number of logic blocks. “Programming” an FPGA determines how the logic blocks are wired together. Modern FPGAs may contain tens of thousands of logic blocks, each of which contains latches, combinational functions and other logic. High-end FPGAs such as the Virtex II Pro  even integrate CPU cores within the normal FPGA fabric. There is a move to providing dedicated hardware within FPGAs, such as RAM and communication controllers.
FPGA performance approaches that of an ASIC, but their general structure makes them slower and less efficient. Much less can be done within the same amount of silicon area. They also generate more heat and use more power than an equivalent ASIC. FPGAs do not have the high up-front cost of an ASIC, but cost more per unit. They are ideal for prototyping and development, since they can be reprogrammed quickly at no cost.
Developing a design for an FPGA is generally more time consuming than writing normal software due to the low-level nature of the design. There are also far fewer competent FPGA designers than software programmers, increasing development cost and time.
The Pilchard development board
Pilchard is a low-cost FPGA development board . It contains a Xilinx Virtex E device; the device used for this thesis was an XCV1000E-6HQ240. The Pilchard board and FPGA device used are shown above.
Pilchard interfaces to the RAM bus of certain motherboards. From the perspective of the FPGA designer, it provides a simple synchronous 100MHz or 133MHz bus with no interrupts or DMA. It appears as memory range to the programmer, so registers within the FPGA can map directly to variables in the driver software. This combination allows easy system development, high performance and low FPGA resource requirements.
A number of external I/O pins are available that can be interfaced to other hardware. Space is also available on the circuit board for ROM chips that can load a bitstream into the FPGA device on power up.
Software, FPGA and ASIC components can be profitably combined. All known FPGA and ASIC-based key search machines are controlled by a general purpose computer.
One useful idea is to use each technology for the task it excels at. For example, a machine using all three technologies might use a computer as a primary controller, FPGAs as lower-level controllers, and ASICs for each search unit. The computer is useful for human interface and easy reconfigurability. The FPGAs are useful for their high speed, I/O capabilities and reconfigurability. ASICs have the advantages of very high speed and low price in very large quantities. This scheme is particularly desirable for ciphers where suitable ASICs are available commercially, reducing fabrication costs.
Another possibility is to perform hardware-fast operations on ASICs and FPGAs, and software-fast operations on computers. This is generally infeasible due to the “I/O gap” – latencies between the ASIC/FPGA and the CPU far outweigh any speed benefit. Pilchard interfaces with the memory bus of a computer and can thus provide a very high bandwidth and low latency connection to the CPU. The integrated CPUs in Virtex II Pro FPGAs also provide similar benefits.
The integrated CPUs on Virtex II Pro FPGAs provide another option for ciphers favouring software implementations. Each Virtex II Pro FPGA contains up to four PowerPC 405 cores . These CPUs could be used to perform the bulk of the cipher operations while the surrounding FPGA fabric handles control, communication and testing of results. A very large number of CPUs could be integrated into a small space using this technique.
Most previous hardware key search machines have been designed to locate DES keys. This is because DES is very fast in hardware, widely deployed and has a dangerously short key length. There are also political issues involved with DES and its selection as a standard.
Many hardware key search machines have been produced in the past. Most of these are not practical machines. They are used to gather performance estimates with a certain technology or technique. More key search machines are known to exist; only those with notable features have been presented in this section.
Most of these performance figures have caveats; they may be estimates, approximations, or based on implementations which were not completely carried out.
A number of papers provide theoretical estimates of the cost of breaking ciphers with a hardware key search engine. Minimal Key Lengths for Symmetric Ciphers to Provide Adequate Commercial Security  is a prime example of this. It lacks practical grounding, but is still contains useful estimates and background. In 1996, it predicts that a $200 FPGA (AT&T ORCA) can test 30 million DES keys per second, and that a $10 ASIC can test 200 million DES keys per second. For $300,000, an FPGA-based machine should be able to crack a DES key every 19 days, and an ASIC-based machine every three hours.
Wiener’s Efficient DES Key Search  describes a theoretical hardware DES key search machine based around a custom ASIC. The machine was designed, but not built. Just about every detail of the machine was described, including the schematics, interfaces and physical requirements. Each ASIC in the design is estimated to be able to check 50 million keys per second. Pipelined search units and an LFSR key generator are used. In 1993, a machine costing $100,000 is estimated to be able to crack a DES key every 35 hours, on average. These estimates were updated in 1997 to take newer technology and further analysis into account . In this paper, a $100,000 machine should be capable of cracking a DES key in six hours. The speed estimates given in this paper are the basis of those presented in .
McLaughlin presented a high-level design for a DES key search machine . The paper ignores low-level issues and focuses on the high-level functionality of the machine. Its main features are the use of a fuzzy comparer and specialist key generators.
Diffie and Hellman produced a paper in 1977 that counters objections to the possibility of a key search machine . Objections to the reliability, size, speed, power and cost of a key search machine were countered, and a system architecture based around a million search chips presented. Despite the (comparatively) primitive technology available at the time, a key search machine is still believed to be feasible.
The machine was based around a very large number of search units. Each search unit takes 16 clock cycles to check a DES key. 24 search units were built into a custom ASIC design that ran at 40MHz. 64 ASICs were placed on each circuit board and 27 boards constructed. Taking into account faulty search units, the entire machine was capable of a search rate of 92.6 billion keys per second, or an average search time of 4.5 days. The machine was built with a budget of $250,000.
A flexible plaintext recognition scheme was used that allows selective matching against certain characters, as well as specialised modes for the Blaze Challenge . This allowed the machine to conduct ciphertext-only attacks.
Kocher further elaborated on the technical problems inherent with building such a large machine. Power and heat issues were the main ones dealt with. The ASICs used had to be produced successfully with a single attempt, leading to a number of design compromises. Had this requirement been lifted, both the performance and correctness of the design could have been improved significantly.
Many of the political issues involved with DES were also covered. These focused primarily on the disparity between what government officials report and what the DES cracking machine is capable of.
Many small key search engines have been produced in an attempt to gauge how processing power has changed with time. These are all based on reconfigurable hardware (FPGAs or CPLDs).
Hamer and Chow implemented a DES key search machine on the Transmogrifier 2a, a system containing 32 linked Altera FPGAs . Their design features a long DES pipeline and an LFSR key generator design that minimises the need for key schedule logic. Each FPGA ran at 25MHz, giving an aggregate search rate of 800Mkeys/sec.
Tsoi, Lee and Leong implemented an RC4 key search machine  using a Pilchard board. Their design used 96 search units running at 50MHz to achieve a total rate of 6.06Mkeys/sec. Not all of the FPGA resources were used; the number of search units was limited by the number of RAM blocks available. The FPGA implementation ran approximately 58 times faster than a software implementation running on a Pentium 4 1500MHz. Kundarewich, Wilton and Hu also implemented an RC4 key search machine using Altera CPLDs, and obtained a search rate of 40kkeys/sec . Brief cost and performance comparisons were carried out.
Deeper investigation into architectural decisions was made by Kaps and Paar . They explored the idea of an algorithm independent key search machine on an FPGA, focusing on DES. Several architectural options for DES were investigated and implemented on Xilinx FPGAs. Their key search design would be capable of 6.29Mkeys/sec.
Goldberg and Wagner performed an analysis of RC4, A5, DES and CDMF implementations on a CPLD board . The performance of a variety of CPLDs was compared with their cost, noting that low end CPLDs generally give the best price/performance ratio. Comparisons were also made against equivalent software implementations. The RC4 cipher was found to be faster in software than hardware, the opposite result to that of  and .
Clayton and Bond exploited a variety of protocol flaws to successfully attack a security module that was previously used in ATMs . They were able to successfully recover 3DES keys from the device with the assistance of a cheap FPGA board. By implementing a more practical attack they were able to learn more about the difficulties and benefits of working within a real environment.
Pornin and Stern attacked A5/1 using a combination of software and hardware approaches . Software was used to reduce the search space of initial states, while hardware was used to conduct an exhaustive search over this subspace. A board containing four Xilinx 4010E FPGAs was used in conjunction with a 500MHz Alpha workstation. Each FPGA contains 12 search units, each checking one initial state every 65 cycles. The FPGAs were clocked at 50MHz, giving a total search rate of 37 million initial states per second. Using two boards with one workstation allowed an initial state to be determined in 2.5 days on average, far faster than exhaustive key search alone. Keller and Seitz took a more analytical approach by using backtracking to reduce the search space . Their implementation was performed on a Xilinx XC4062 FPGA.
Several organisations have implemented software to conduct distributed key search attacks against ciphers using network-connected hosts. distributed.net  is the largest and most well-known of these. Others include DESCHALL  and SolNET . The basic idea is the same: run a piece of software on many hosts and coordinate their efforts with a central server. The software is configured to run during idle time on the hosts. Buffering schemes allow hosts to continue working on their part of the task when not connected to a network. Regardless of the precise task being performed, work is usually divided into “blocks” which take a (relatively) short period of time to complete. A client connects to the server to be allocated a number of blocks and does not communicate again until those blocks are complete.
These efforts have been quite successful so far. distributed.net has successfully completed RSA Data Security’s RC5-64, RC5-56, DES II-1 challenges , as well as a similar challenge from CS Communications & Systems . They completed the DES-III challenge with the help of the EFF DES cracker. DESCHALL completed the DES-I challenge. A group headed by Germano Caronni and containing 3500 computers completed the RC5-48 challenge. Ian Goldberg used 250 computers to complete RC5-40 .
 M. Blaze, W. Diffie, R. L. Rivest, B. Schneier, T. Shimomura, E. Thompson, and M. Wiener, “Minimal key lengths for symmetric ciphers to provide adequate commercial security,” A Report by an Ad Hoc Group of Cryptographers and Computer Scientists, January 1996. [Online]. Available: http://www.schneier.com/paper-keylength.pdf
 B. Schneier, “Description of a new variable-length key, 64-bit block cipher (Blowfish), in Lecture Notes in Computer Science, no. 809. Springer-Verlag, 1994, pp. 191 204. [Online]. Available: http://www.schneier.com/paper-blowfish-fse.html
 J. Keller and B. Seitz, “A hardware-based attack on the A5/1 stream cipher,” in APC 2001. VDE Verlag, 2001, pp. 155-158. [Online]. Available: http://www.informatik.fernuni-hagen.de/ti2/papers/apc2001-final.pdf
 National Security Agency. (1998, May) Skipjack and KEA algorithm specifications. [Online]. Available: http://csrc.nist.gov/encryption/skipjack/skipjack.pdf
 E. Biham, “A fast new DES implementation in software,” Lecture Notes in Computer Science, vol. 1267, pp. 260 ??, 1997. [Online]. Available: http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/1997/CS/CS08%91.ps.gz
 Xilinx, Inc., “Virtex-II Pro complete data sheet,” September 2003, http://direct.xilinx.com/bvdocs/publications/ds083.pdf
 P. Leong, M. Leong, O. Cheung, T. Tung, C. Kwok, M. Wong, and K. Lee, “Pilchard - a reconfigurable computing platform with memory slot interface,” in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2001. [Online]. Available: http://www.cse.cuhk.edu.hk/~phwl/papers/pilchard_fccm01.pdf
 I. Goldberg and D. Wagner, “Architectural considerations for cryptanalytic hardware,” CS252 Report, 1996. [Online]. Available: http://www.cs.berkeley.edu/~iang/isaac/hardware/paper.ps
 M. J. Wiener, “Efficient DES key search - an update,” in Cryptobytes, RSA Laboratories, Ed., 1997, vol. 3, no. 2, pp. 6 8. [Online]. Available: ftp://ftp.rsasecurity.com/pub/cryptobytes/crypto3n2.pdf
 I. Hamer and P. Chow, “DES cracking on the Transmogrifier 2a,” in Lecture Notes in Computer Science, ser. Cryptographic Hardware and Embedded Systems. Springer-Verlag, 1999, no. 1717, pp. 13 24. [Online]. Available: http://www.eecg.toronto.edu/~pc/research/publications/des.ches99.ps.gz
 P. D. Kundarewich, S. J. Wilton, and A. J. Hu, “A cpld-based rc4 cracking system,” in Canadian Conference on Electrical and Computer Engineering, 1999. [Online]. Available: http://www.ee.ubc.ca/~stevew/papers/pdf/ccece99.pdf
 K. L. K.H. Tsoi and P. Leong, “A massively parallel RC4 key search engine,” in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2002, pp. 13 21. [Online]. Available: http://www.cse.cuhk.edu.hk/~phwl/papers/vrvw_fccm02.pdf
 P. Kocher, “Breaking DES,” in Cryptobytes, RSA Laboratories, Ed., 1999, vol. 4, no. 2, pp. 1 5. [Online]. Available: ftp://ftp.rsasecurity.com/pub/cryptobytes/crypto4n2.pdf
 M. Blaze. (1997, June) A better DES challenge. [Online]. Available: http://www.privacy.nb.ca/cryptography/archives/cryptography/html/1997-0%6/0127.html
 R. Clayton and M. Bond. Experience using a low-cost fpga design to crack des keys. [Online]. Available: http://www.cl.cam.ac.uk/users/rnc1/descrack/DEScracker.html
 T. Pornin and J. Stern, “Software-hardware trade-offs; application to A5/1 cryptanalysis,” in Lecture Notes in Computer Science, ser. CHES 99. Springer-Verlag, 2000, pp. 318 327. [Online]. Available: http://www.di.ens.fr/~stern/data/St91.pdf
 (2003, October) distributed.net: Node Zero. [Online]. Available: http://www.distributed.net/
 C. M. Curtin. (1998, June) DESCHALL. [Online]. Available: http://www.interhack.net/projects/deschall/
 The RSA Laboratories Secret-Key Challenge. RSA Security. [Online]. Available: http://www.rsasecurity.com/rsalabs/challenges/secretkey/index.html
 (2003, October) distributed.net: Project CSC. [Online]. Available: http://www.distributed.net/csc/
 (1997, January) 40-bit crypto proves no problem. [Online]. Available: http://news.com.com/2100-1017-266268.html?legacy=cnet