Most key search machines are designed around similar ideas. A controller operates a number of independent search units. This controller usually interfaces with a general purpose computer. Each search unit contains a key generator, decryptor (or encryptor) and comparator. The key generator produces trial keys that need to be checked. Some designs combine the key generator and decryptor modules to improve performance. The decryptor decrypts the known ciphertext with the trial key. An encryptor can also be used with some limitations. The comparator checks the plaintext that is generated by the decryptor to see if it is correct. If it is, the controller is signalled.
Due to its complexity, the cipher module is usually considered to be the bottleneck in the system. All other modules must be able to operate at least as quickly.
Conceptual key search machine design
A counter is an obvious choice for a key generator. The count sequence is predictable, but performance is not adequate on all devices. Xilinx FPGAs provide a dedicated carry chain which improves performance significantly.
The EFF DES cracker  uses a counter where the 24 most significant bits are held constant and the 32 least significant bits counted. This technique is useful to reduce a counter’s propagation delay. The most significant bits must be counted and loaded externally. This scheme introduces the idea of a “block” of keys – a subset of the key space which can be searched in a short period of time. The 24 constant bits can be viewed as the block number. There must be a mechanism with which the controller can detect the end of block condition and start the key generator on a new block.
One design  uses a single counter that is shared between all of the available search units. Each unit adds or concatenates a unique ID to the counter value to obtain its trial key. This scheme works well when the number of search units is a power of two since the ID can simply be concatenated, saving resources.
A number of designs , ,  use Linear Feedback Shift Registers (LFSRs) to generate trial keys. The main advantage of an LFSR over a counter is its high speed; propagation delays remain constant regardless of the length of the LFSR. One disadvantage of LFSRs is that their count sequence is nonlinear. Evenly breaking up a large key space between search units requires more effort than with a linear counter. One simple scheme is to use a shorter LFSR than usual and set the remainder of the key bits to a constant value. This works similarly to the block scheme for linear counters described above; the LFSR can be 32 bits long, and the remaining 24 bits set by the controller.
When performing a known plaintext attack, the choice of encryptor or decryptor is dependent on which has higher performance. Most ciphers (including all stream ciphers) have identical performance regardless of their mode. Some may have a more efficient implementation when implemented in one way or another. RC5  is an example of this. An RC5 encryptor can operate more efficiently than a decryptor because the order that the S array is used in during key setup matches that used in the encryption stage, allowing the phases to overlap.
Most key search machines will use a decryptor. Ciphertext-only attacks require a decryptor. Known-plaintext attacks will also require a decryptor under some conditions. This is to allow a more flexible comparator scheme that can detect correct plaintext regardless of imperfect knowledge.
Two major approaches are used when implementing the cipher module; a long pipeline or a small iterative module. Most ciphers are comprised of a number of round functions, making an iterative implementation natural. Pipelined approaches can achieve much greater speeds at the expense of FPGA resources. DES is frequently implemented as either a small iterated module or a long pipeline. The iterated version takes a multiple of 16 cycles (one for each application of the round function) to produce one block of output, while the pipelined version can produce one unit of output every clock cycle. The resource gains made by using an iterated cipher implementation almost never outweigh the loss of speed. Resource constraints may force a cipher to be implemented in iterative form.
An FPGA-Based Performance Evaluation of the AES Block Cipher Candidate Algorithm Finalists  explores these issues in depth. It presents FPGA performance figures for the MARS, RC6, Rijndael, Serpent and Twofish ciphers. Loop unrolling, pipelining and sub-pipelining are investigated as architectural choices. In most cases, a pipelined implementation was fastest. Fast DES Implementation for FPGAs and Its Application to a Universal Key-Search Machine  explores pipelined, combinatorial and iterative approaches for the DES cipher.
Some ciphers may benefit from having values precomputed during compilation time. This is usually used to achieve higher performance in systems that have very infrequent key changes. FPGAs fit well with this approach, allowing the programmed-in key to be changed with only a small period of downtime. The speed efficiency of an FPGA DES implementation was improved significantly using this technique . The utility of this technique in a key search machine is dependent on the cipher. The plaintext or ciphertext would be compiled into the design instead of the key. This may yield improvements for some ciphers, but exploratory experiments only showed very small resource savings.
The environment that the key search machine operates in determines the choice of comparator. If a perfect ciphertext/plaintext pair is known, simply checking for bit equality will be adequate. Ignoring certain bits in the trial plaintext may be a useful extension when only a portion of the sought plaintext is known.
If a ciphertext-only attack will be attempted or the plaintext is not precisely known, it may be necessary to implement a heuristic matching scheme. Such a scheme will generally flag a number of keys as potential matches and allow humans or software to check them further for correctness.
A simple scheme to detect ASCII text is to require that the most significant bit of each plaintext byte be 0. This can be further generalised into a statistical approach that scores each plaintext byte in the plaintext according to its probability of occurrence. A Programmable Plaintext Recognizer  uses similar ideas to extend Wiener’s theoretical key search machine . Applying compression to a message before encrypting it causes their heuristics to fail. This is an effective countermeasure against any statistical comparator, since the compression makes the message “look like” random data.
Some applications may also benefit from a specialist comparator. A machine designed to solve the Blaze Challenge  would need a specialist comparator that will find a match on any block that fits the form of the solution (in this case, when the plaintext is composed only of a single repeated byte.)
At some point in a key search machine’s operation it will be necessary to return potential keys to the host computer. Several schemes have been used to achieve this goal.
Most key search machines simply stop running when a match is found and wait for the computer to read out the key value. This is simple and flexible, but inefficient when many keys need to be returned – while a search unit is waiting to release the key it halted on, it cannot be used to search the key space.
A hardware buffer can be used to reduce the waiting time. When a key needs to be returned it is read into the hardware buffer, and the controller can read the keys out. This has the advantage of improved efficiency, but costs hardware resources.
One novel approach is to measure the amount of time needed to find the key. Using knowledge of how quickly the key space can be searched, an approximate trial key can be found. A number of keys need to be checked to account for timer inaccuracies. This method removes the need for key storage and retrieval hardware.
The programming interface for this design is supplied in Key search engine 1 interface.
To produce a FPGA-based key search machine which can operate independently of cipher algorithm. It should communicate with a computer for instructions and data. It should be reasonably scalable for large key cracks, and be easily modifiable for ciphertext-only attacks. It should allow rapid prototyping of key search machines for different ciphers.
Initial key search machine top level design
The bus provided by the Pilchard interface runs synchronously at 100 or 133MHz. The remainder of the key search machine must operate at this speed.
The top-level Status register provides general status information for the entire key search machine.
The key buffer stores potentially correct keys for the computer to read out and check further. This prevents search units from being paused for very long when a potential key is located. It is particularly useful for ciphertext-only attacks, where there may be a large number of potentially correct keys. 256 keys can be stored; this figure fully utilises the four Block SelectRAM units that are needed to store a 64 bit word.
The controller operates the search bus. It relays commands from the computer to individual search units, stores the ciphertext and plaintext registers, and polls each search unit on the bus to see if there are any keys waiting. It uses a simple state machine. While there are no commands waiting to execute, it polls search units to see if there are any keys waiting. If a key is found, it is read into the key buffer. If a command arrives, it temporarily stops polling and executes the command.
Initial key search machine search unit design
Each search unit has its own status register which the controller uses to determine if a key has been located. The key generator provides a trial key to the decryptor, which uses the key and the supplied ciphertext to produce trial plaintext. This trial plaintext is compared with the known plaintext or has a set of heuristics applied to determine if it appears to be valid. If it is, the search unit is halted until it is instructed to restart by the controller.
Several problems were identified with the original key search machine that justified the design of a new one.
The new design and its driver software took approximately four days to implement and debug. The programming interface is described in Key search machine 2 interfaces.
Revised key search machine design
There are two controllers in this design; a master and a slave. The master handles all communication with the host computer and links to the slave with an asynchronous bus modelled closely on VME. The slave controller’s only purpose is to link the asynchronous bus and the search bus, which can run synchronously at any speed. This allows search units to run at any speed, simplifying cipher implementation.
The search unit is designed to use a block system for key allocation. Only the block number is transmitted to the search unit. When retrieving the key from the search unit, the least significant 32 bits of the key are returned. The software is expected to track which search unit is searching which block.
The search bus runs at the same speed as the search units. The two clock domains (SDRAM clock and search unit clock) are linked with an asynchronous bus using similar protocols to VME.
High speed search units were later identified as a problem; it was found to be difficult to route a wide high speed bus over the entire FPGA and still meet timing constraints. A potential improvement to the machine would be to decouple the clock rate of the search bus from that of the search units or make the bus completely asynchronous. The latter was the original intent of the asynchronous bus, but the resources required to implement it made it unwieldy to use on every search unit. It remains the best solution for a large-scale machine (at least between FPGA devices).
Two linear counters were implemented. The first was a simple 64 bit counter. The second was designed to work with block schemes and added functionality to allow counting to be inhibited for ciphers that do not need a new key every clock cycle. It counts through 32 bits of range and has a further 48 bits of range that is set externally.
A comparator that checks for exact bit equality was implemented. It was 64 bits wide. It flags a match when its two inputs are identical. To ensure that it runs quickly enough with high clock speeds, it was implemented as a short pipeline. On the first clock cycle, four 16 bit segments of the trial plaintext are compared individually. On the second cycle if the result of these four comparisons is true, a match is flagged.
A simple statistical comparator was implemented using some of the ideas within . Its purpose is to use the probabilities of different bytes within the produced plaintext to determine if the plaintext “looks right”. The definition of “looks right” varies depending on the attack scenario; English text would have different statistical properties to an executable file, for example.
The algorithm used is fairly simple. The comparator takes a 64 bit input and splits it into 8 bit bytes. Each byte value has an assigned “score” – higher scores correspond with more frequently occurring byte values. The scores for each byte are added and compared against a threshold value. If the threshold is exceeded, a match is flagged.
Implementation of the algorithm was more challenging, but still straightforward. The main design constraint was that the comparator be no slower than any decryption module – in this case, the DES module running at over 149MHz and producing one word of plaintext per cycle. In order to meet this timing requirement, steps of the algorithm were split up as much as possible.
The figure below shows the steps performed by the implementation. Four RAM blocks were used to store the byte value scores (8 bits each). Each RAM block has two ports, allowing a total of eight memory lookups every cycle. The scores are added in parallel in pairs to minimise delays on each cycle. Finally, the total score is compared with the threshold (which is set by the plaintext value). Splitting up the steps in this way produces a deep pipeline, but allows very high clock rates.
Statistical comparator design
The threshold comparison stage is the main timing bottleneck. Speed improvements can be made by reducing the comparison resolution. By only comparing the most significant four bits, synthesis reports a maximum speed of 181MHz. The required resolution depends on the statistical properties of the text being attacked.
A small C program was written to generate character scores from files. It counts the frequency of each character in the file. The scores are then normalised down to 8 bits and output in a format suitable for entry directly into the VHDL RAM initialisation code. This data could also be used to modify the bitstream after compilation if desired. This program allows the comparator to be “trained” on similar input data to what is expected.
The DES implementation is a modified copy of the DES demonstration provided with the Pilchard board, which is itself a modified version of the Xilinx optimised DES implementation in . The order of the round functions was reversed to convert the encryptor into a decryptor.
Registers were added to the key schedule logic, but later removed when the efficient keying system described in  was implemented. This scheme integrated an LFSR key generator with the DES key schedule logic. A 72 bit LFSR with taps suitable for a 56 bit LFSR was used. As the previous keys were shifted through the LFSR they remain available to the key schedule logic, which can generate the necessary subkeys with rotations. This saved approximately 500 slices that were previously used for subkey registers. Subkey generation was essentially free, although fanout on the LFSR bits did reduce the speed slightly.
The possibility of attacking a key and its complement simultaneously was considered. This halves the search space, but not the search time. The decryption portion of DES comprised the bulk of the area requirement in a hardware implementation, and this improvement only saves key schedule logic. After implementing the LFSR keying scheme above, performance improvements would be negligible.
Using the XCV1000E, a key search machine containing controllers and five search units was operated at 100MHz, giving a total search rate of 500Mkeys/sec.
The A5/1 implementation was produced from scratch using the algorithm description given in . It aims to find the initial key state rather than the key itself. Time constraints did not allow the more efficient stream cipher attack in Stream Ciphers to be implemented, and so no further work was performed using this module.
The (already small) resources needed to implement the A5/1 module could be further reduced by configuring the Xilinx LUTs as shift registers . This would complicate key loading; the entire key state could no longer be loaded in a single cycle.
The RC5 implementation was produced completely from scratch using the algorithm description given by Rivest . It implemented RC5-32/12/9. It was intended to be used to complete the RSA Secret-Key Challenge contests . The possibility of connecting the complete key search machine to distributed.net  was considered as an extension.
Few prior works in this area could be located.  claims to have schematics for a functional RC5 implementation on Xilinx FPGAs, but they are no longer available. The author was not able to be contacted.  contains a Verilog model which was not found to be useful.
A fully pipelined design similar to that used for DES was investigated. This possibility was considered to be impractical due to the large number of registers needed for the S array.
After implementing the iterative version, the possibility of implementing a pipelined version was considered again. This time, the number of LUTs required was identified as being excessive. A prototype implementation determined that each stage of the key mixing phase would require 256 LUTs, and each half-round of the decryption phase would require 192 LUTs. Given 78 mixing steps and 24 decryption half-rounds, the number of LUTs required is 24576 – coincidentally, the exact number of LUTs available on the Virtex 1000E. Many more would be required for state decoding, communication, key generation, comparisons, routing overhead and so on. This possibility was not investigated further, but would almost certainly be feasible given more hardware resources to work with. Such an implementation would be able to provide very high search rates on sufficiently large FPGA devices.
An iterative design for the RC5 implementation was used. Block SelectRAM memories within the FPGA were used to store the S array. The number of RAM blocks was anticipated to be the limiting factor, similar to the RC4 key search engine described in . The L array was stored in three rotating registers; this eased timing constraints and prevented reads and writes to the RAM becoming a bottleneck.
The key mixing phase of RC5 took the bulk of the time needed to check a key. It required 78 iterations, each of which consists of a read and a write to the S and L arrays. To minimise the time required per cycle, the key mixing stage of the algorithm was set up to operate continuously on two separate regions of RAM. The initialisation and decryption stages were arranged to work on the opposite region of RAM. When a key mix phase completes, the decryption and initialisation phases begin on that region of RAM. In this way, the average time required to check a key would effectively be the time required to perform the key mixing phase.
RC5 RAM timing
The key mix phase needs to be completed as quickly as possible. The decryption and initialisation phases are not timing critical, and can be completed more slowly in order to save FPGA resources. The decryption module takes advantage of this by performing twice as many rounds and interchanging the A and B registers at the end of each round. In this way the subtract, shift, XOR and RAM lookup resources can be reused. The initialisation module actually performs the additions required to initialise the S array, even though these results could be trivially precomputed. This saves FPGA resources.
The general goal for the key mix operation is to complete as quickly as possible. The general goal for the decryption and initialisation operations is to use as few resources as possible, so long as the time taken for these two operations does not exceed that needed by the key mix operation.
One problematic area in the implementation was the 32 bit barrel shifter required by RC5. The initial naïve implementation required 352 slices; with the help of  this was improved to 80 slices. One shifter is required for each of the key mix stage and the decryption stages. These account for a significant amount of the resource usage. Some research and experimentation was conducted to find smaller or faster shifter designs, without success. Shrinking or speeding up the barrel shifters would provide large benefits to the overall performance of the design.
Running the module at 100MHz proved difficult. Routing delays introduced after the place and route stage were the cause of the problem; congestion was present at one of the RAM blocks. The delay at this point increased when the number of search units was increased, suggesting that floorplanning may be useful to reduce the delay or at least make it consistent. A brief unsuccessful attempt at floorplanning was made.
To solve this problem, two approaches were used. Originally, two RAM blocks were used to provide a 32 bit wide RAM. One port was used by the key schedule module and the other by the decryptor and initialisation module. The number of RAM blocks was doubled and writes made to both pairs. Reads could be made from either pair of RAM blocks, allowing unrelated logic to be moved to different areas of the FPGA by the place and route tools. This helped to reduce delays. The RAM blocks were not being otherwise used. Adding a wait state after RAM access allowed the module to meet its timing requirements at the cost of reduced performance.
The total time required to check an RC5 key is 469 clock cycles. Each iteration needs 6 clock cycles, and 78 iterations are required. One cycle is needed for initialisation. At the target clock speed of 100MHz, this gives a search rate of 213,220 keys/sec. 16 search units could be fit into an XCV1000E device, giving an aggregate search rate of 3.4Mkeys/sec.
The possibility of increasing the clock speed of the RC5 module was investigated, but found to be counterproductive. The intent was to balance the time spent in each pipeline stage better, hopefully overcoming the increase in resource usage and number of stages required. Registers were inserted at locations responsible for timing limitations. These registers did not increase resource usage significantly due to the structure of the Virtex slice . The number of cycles per round increased from 5 to 8 and the synthesis clock speed from 102MHz to 142MHz, which was not an effective tradeoff. Many previously trivial operations such as the comparison needed to be split into stages instead of being simple combinatorial operations, which greatly increased the complexity of the source code. The overall resource usage also increased.
Replacing each bit in the three registers used to implement the L array with a short LUT shift register would reduce the resources allocated and potentially ease routing.
Some work was conducted to see if it was possible to take shortcuts in the key mixing operation; this was unsuccessful.
Including the ciphertext and IV at synthesis time reduced resource usage for the search unit to 539 slices. This would be a worthwhile approach for an attack where the ciphertext and IV are known in advance. It would generally not be suitable for an ASIC implementation.
This module was implemented before the second key search machine. Performance could be improved by running at a lower clock speed with fewer pipeline stages.
Benchmarks were conducted on a number of different CPUs to measure how quickly they could perform key searches. Setting up and running the benchmarks was very rapid, so many different CPUs were tested to determine if any would provide significant price/performance advantages.
Pre-written benchmarks were used. These benchmarks were faster and more thoroughly tested than what could otherwise be produced in the available time.
Each benchmark was run at least three times until consistent results were achieved. Linux benchmarks were run as the root user, prefixing the benchmark command with
nice -20 to ensure that the benchmark ran with the highest priority.
Tables containing the gathered results are given in CPU benchmark results. distributed.net maintains an online database  of search rates for each CPU, allowing some of the benchmark results to be verified.
Two benchmark programs were used: the distributed.net client version 19991117 (which had to be compiled from source), and the SolNET DES client . The distributed.net client gave far better benchmark results, but could only be run on Linux machines with appropriate compiler versions. Neither DES client had been optimised for modern CPUs.
dnetc -benchmark des was used to run the distributed.net benchmarks, and
desclient-x86-linux -m for the SolNET benchmarks. The SolNET client’s benchmark results were unstable on faster CPUs, requiring them to be run a large number of times.
distributed.net maintains an online database of search rates for each CPU . The DES benchmarks for newer CPUs could not be verified because the CPUs did not exist at the time that the online benchmarks were gathered. The results for older CPUs were far higher than those in the online database.
Benchmarks for Celeron, P4HT and Athlon XP (Barton) CPUs had to be inferred from others based on the same core. The Mkeys/sec/MHz ratios obtained for RC5 remained fairly constant under this assumption, and this is assumed to remain true for DES.
The distributed.net client version 03033120 was used to conduct RC5-72 benchmarks. Binaries from the distributed.net website were downloaded for the relevant platform, unpacked, and the benchmark executed from the command line with
dnetc -benchmark rc5-72.
The RC5 benchmark results were verified against those in the distributed.net database. Confusion is apparent with the Athlon speed ratings; it is not obvious whether an entry marked “1900” refers to a 1900+ or a 1900MHz Athlon. Nevertheless, the RC5 benchmark results gathered were found to mesh well with those in the database.
No Celeron machines based on the Pentium IV core were available to run benchmarks on, so the online benchmark results were used for analysis. These appeared internally consistent, so a Mkeys/sec/MHz rating was determined and averaged across the available benchmark results to reduce error. This rating was used to infer the missing benchmark results.
 J. Keller and B. Seitz, “A hardware-based attack on the A5/1 stream cipher,” in APC 2001. VDE Verlag, 2001, pp. 155 158. [Online]. Available: http://www.informatik.fernuni-hagen.de/ti2/papers/apc2001-nal.pdf
 P. Leong, M. Leong, O. Cheung, T. Tung, C. Kwok, M. Wong, and K. Lee, “Pilchard - a reconfigurable computing platform with memory slot interface,” in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2001. [Online]. Available: http://www.cse.cuhk.edu.hk/~phwl/papers/pilchard_fccm01.pdf
 I. Goldberg and D. Wagner, “Architectural considerations for cryptanalytic hardware,” CS252 Report, 1996. [Online]. Available: http://www.cs.berkeley.edu/~iang/isaac/hardware/paper.ps
 I. Hamer and P. Chow, “DES cracking on the Transmogrifier 2a,” in Lecture Notes in Computer Science, ser. Cryptographic Hardware and Embedded Systems. Springer-Verlag, 1999, no. 1717, pp. 13 24. [Online]. Available: http://www.eecg.toronto.edu/~pc/research/publications/des.ches99.ps.gz
 K. L. K.H. Tsoi and P. Leong, “A massively parallel RC4 key search engine,” in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2002, pp. 13 21. [Online]. Available: http://www.cse.cuhk.edu.hk/~phwl/papers/vrvw_fccm02.pdf
 M. Blaze. (1997, June) A better DES challenge. [Online]. Available: http://www.privacy.nb.ca/cryptography/archives/cryptography/html/1997-0%6/0127.html
 (2003, October) distributed.net: Node Zero. [Online]. Available: http://www.distributed.net/
 The RSA Laboratories Secret-Key Challenge. RSA Security. [Online]. Available: http://www.rsasecurity.com/rsalabs/challenges/secretkey/index.html
 A. Elbirt, W. Yip, B. Chetwynd, and C. Paar, “An FPGA-based performance evaluation of the AES block cipher candidate algorithm finalists,* in IEEE Transactions on VLSI Systems, ser. IEEE Transactions on VLSI Systems, August 2001, vol. 9, no. 4.
 J. Leonard and W. H. Mangione-Smith, “A case study of partially evaluated hardware circuits: Key-specific DES,” in Field-Programmable Logic and Applications. 7th International Workshop, W. Luk, P. Y. K. Cheung, and M. Glesner, Eds., vol. 1304. London, U.K.: Springer-Verlag, 1997, pp. 151 160. [Online]. Available: http://citeseer.nj.nec.com/leonard97case.html
 D. Wagner and S. M. Bellovin, “A programmable plaintext recognizer,” 1994. [Online]. Available: ftp://ftp.research.att.com/dist/smb/recog.ps
 C. Eilbeck. My crypto page. [Online]. Available: http://www.yordas.demon.co.uk/crypto/
 Xilinx, Inc. SRL16 16-bit shift register look-up-table (LUT). [Online]. Available: http://toolbox.xilinx.com/docsan/xilinx5/data/docs/lib/lib0393_377.html
 E. Soha. (1998, May) RC5 on FPGAs. No longer available from original source. [Online]. Available: http://web.archive.org/web/19981205053422/http://www-inst.eecs.berkeley%.edu/~barrel/rc5.html
 Xilinx, Inc., “XC4000E and XC4000X Series Field Programmable Gate Arrays,” May 1999. [Online]. Available: http://www.xilinx.com/bvdocs/publications/4000.pdf
 Xilinx Inc., “Virtex-E 1.8V Field Programmable Gate Arrays,” July 2002. [Online]. Available: http://direct.xilinx.com/bvdocs/publications/ds022.pdf
 distributed.net. (2003, October) distributed.net: Client Speed Comparisons. [Online]. Available: http://n0cgi.distributed.net/speed/
 (1997, May) SolNET DES Challenge Attack: Download Page. [Online]. Available: http://www.des.sollentuna.se/download.html