FPGAs in cryptanalysis: Analysis

Many factors affect the time taken to conduct an exhaustive key search attack:

Key length. Increasing the key length used by a cipher will dramatically increase the size of the key space and hence the time required. Using long keys is by far the most effective countermeasure to an exhaustive key search attack.
Available resources. An attacker with more resources (money, computational power, people) can sweep the key space more rapidly.
Cipher design. The design of a cipher has a strong influence on how long an exhaustive key search will take. Several factors contribute to this:
- Key setup time. In normal use, a cipher’s key setup time is usually negligible. When conducting a key search attack, the key setup phase must be performed for every key. Long key setup times can frustrate key search attacks.
- Cipher operations. Some operations will take longer to perform than others. Some ciphers are designed to use operations that are fast on one particular technology and slow on another.
- Cipher speed. On identical implementation technologies, some ciphers can encrypt or decrypt more quickly than others.
- Frequency of register access. Pipelined designs can achieve significant resource savings if registers are accessed infrequently (discussed in Estimating FPGA resource usage for pipelined cipher implementations.) Iterative designs often have less constraints on their RAM or register access.
Availability of accurate plaintext and ciphertext pairs. Possessing only ciphertext or incomplete ciphertext and plaintext may reduce the speed and accuracy of the exhaustive key search attack. Ideally, several perfect pairs of ciphertext and plaintext should be available. If many pairs are available, time-memory tradeoff attacks may be more appropriate for some ciphers.

Software factors

The most significant factor that affects the time required to conduct a key search using CPUs is the word length of the cipher. The speed of the cipher tends to be highest on a CPU whose word length matches that of the cipher. All processing must be performed in units of this word length.

Ciphers that have very small states may be able to operate completely within CPU registers, which will improve performance. Smaller speed benefits will arise if the state is small enough to fit inside the highest-level cache. Very few ciphers have state sizes over a few thousand bytes.

Some cipher implementations may exchange storage space for processing power by precomputing values. One software implementation of A5/1 [44] exploits this by precomputing all possible values of the individual LFSRs. Instead of performing the normal shift and XOR operations, output is generated by modifying pointers to the precomputed values. This requires almost 128MB of memory, but greatly improves the performance of A5/1 in software.

FPGA factors

The most significant factor affecting the speed of exhaustive key search on FPGAs is whether the cipher can be implemented as a long pipeline as opposed to an iterative structure. Pipelined structures make far more efficient use of the FPGA resources and can achieve higher clock speeds. Any cipher can be implemented as a long pipeline, but the required FPGA resources are often prohibitive. Generally, a long pipeline will check one key every clock cycle.

FPGAs

Operation performance on FPGAs is influenced by the time required to complete the operation and the resource usage. Many operations can be completed more quickly by using more resources. Conversely, using less resources allows a greater level of parallelism, increasing overall performance. Finding the correct balance is difficult and may require several attempts at implementation.

Results giving the time and space requirements of a number of operations are given in LUTs and time required per bit for cipher operations. Not all operations are covered, and optimisations using RAM or other FPGA features are ignored. In particular, large data-dependent table lookups will be more efficiently performed using the RAM blocks contained within most Xilinx FPGAs. Similarly, multiplications may be better performed using onboard Virtex II, Virtex II Pro or Spartan 3 multiplier resources.

Addition and subtraction are dependent on the width of the data being added, due to the Xilinx carry chain. Variable rotation is achieved using the scheme in [39], and assumes the use of Xilinx slice multiplexers. Data-dependent table lookup does not assume this because each Xilinx family has different multiplexer resources available which will affect the structures used for implementation.

Operation	LUTs	Time (LUT depth)
Bit permute	0	0
Fixed rotate or shift	0	0
XOR (up to 4 inputs)	1	1
Data-dependant table read or write (1-4 address bits)	1	1
Data-dependant table lookup ($w$ address bits, $w>4$)	$2^{w-3}-1$	$w-3$
Add or subtract (two inputs)	1	Depends on data width
Variable rotate or shift over $w$ bits	$\log_{2}w$	$\frac{\log_{2}w}{2}$

CPUs

In contrast with FPGAs, storage space is not a concern for software implementations. When the word size of the operation inputs is equal to or less than that of the CPU, most modern CPUs can complete any operation very quickly. The operations being performed thus become less important.

Word size is the major factor when determining operation speed on CPUs. If the word size of the operation is greater than that of the CPU, performance will generally be halved (or slightly worse).

If the word size of the operation is less than that of the CPU, processing resources are being wasted. We then need to examine the algorithm to determine if any parallelism can be applied to make the best use of the resources that are available. For example, it may be possible to pack a number of short-word operands into one word and perform an operation over a number of words simultaneously. Bitslicing [9] is another approach that is used to accelerate single-bit operations on CPUs with a wide native word length.

Introduction

It is useful to determine whether a pipelined cipher implementation is feasible before beginning work on the implementation itself. This section describes a method that can be used to estimate the resource usage of a cipher given details of its algorithm.

Assumptions

A number of assumptions are made to facilitate this analysis:

FPGAs are comprised of many logic cells (LCs.) A logic cell is comprised of a four input look-up table (LUT) and a flip-flop (FF). This is similar to the LC layout used in most Xilinx and Altera FPGAs, and is shown in Simplified FPGA logic cell.
The state size remains constant throughout the key setup and decryption phases. This is rarely true, but does not significantly affect the results. The analysis can be extended to handle changing state sizes.
The key setup and decryption phases are composed of a number of rounds. Each round performs the same operations. The logic used to perform each round maps the state from round $r$ to round $r+1$.

Simplified FPGA logic cell

Factors

To estimate the quantity of resources required, we consider:

The size of the state during key setup and decryption
The operations that are performed during key setup and decryption
What quantity of the state is modified during each step or round of key setup and decryption

The final result will specify the number of LCs required for a fully pipelined implementation of the cipher. The mapping from LC to real FPGA resources is generally trivial. A Virtex, Spartan IIE, Virtex II or Virtex II Pro slice maps to two LCs. Spartan 3 devices have fewer effective LCs because half of the LUTs on the device have reduced functionality and do not support usage as a RAM or a shift register.

Method

We define a number of variables, shown in the table below.

$s$	State size in bits
$n$	Number of rounds needed to complete a phase
$r$	Number of LUTs required to perform a round
$m$	Number of bits of state modified during a round
$c$	Number of LCs required to perform a phase

The state size $s$ is the number of bits that need to be stored between rounds. This includes all registers and modifiable lookup tables used in a phase. Usually this figure can be determined by summing the total number of bits of storage used by the cipher.

The number of rounds $n$ is usually specified by the cipher. Modelling the cipher in this way forces each round to complete in one clock cycle, which may result in very long combinational delays for some ciphers. This will be discussed further below.

The number of LUTs $r$ required to perform a round is determined by examining at the operations performed during that round and the number of bits that these operations are performed on. These results are summarised in LUTs and time required per bit for cipher operations. Another strategy would be to implement the round function and use synthesis results to estimate the resource usage.

If a state bit is modified during a round, its result can be stored in the same LC as the LUT that modified it. A dedicated LC is needed otherwise. This gives us value $m$.

The number of LCs needed to complete a cipher phase is thus:

$$c=n(r+s-m) \text{ (equation 4.1)}$$

We can then compare the number of LCs in the result to the number of LCs in a device to determine whether a pipelined implementation is feasible. Alternatively, we can use the LC estimate to determine what the smallest device needed is. The LC figure can also be used as a “difficulty rating”; ciphers with high LC counts will generally take more FPGA resources and time to attack.

Optimisations

Multiplexers

Each Xilinx slice contains a number of additional multiplexers which can reduce the number of LCs needed for large table lookups.

Shift registers

Xilinx LUTs can also operate as shift registers [36]. This is very useful when a state bit is not modified for one or more clock cycles. Instead of chaining FFs together, a single LUT can replace up to 16 FFs. Analysis of the data dependencies in the cipher algorithm can provide justification to reduce the effective $s$ value significantly. Synthesis tools will sometimes perform this optimisation automatically. We can then obtain an alternate equation to obtain the number of LCs required:

$$c=S+nr\label{eq:LC2} \text{ (equation 4.2)}$$

where $S$ is the number of LCs used to store the state over the entire pipeline. Implementations using shift registers for storage cannot pack the shift registers into the same LC as the LUT performing the calculation, since the shift register uses the LUT. All state that is modified can be latched in the same LC as the LUT that performed the calculation. We thus do not require variable $m$.

Short pipeline stages

Pipelined implementations on FPGAs can be sped up by making each pipeline stage as short as possible. No additional cost is incurred when using the latches included in LCs, making very deep pipelines at high speeds a good design approach. Instead of completing an entire round in a single clock cycle, it is usually more efficient to break up the round and complete it over more clock cycles (at a higher clock rate). The design will still check an average of one key per clock cycle at the higher clock rate.

Splitting up the pipeline stages requires analysis of the data dependencies within each round, and would significantly complicate this analysis.

Using short pipeline stages will increase the number of LUTs that will need to be used as shift registers, and hence increase overall resource usage slightly. At worst, doubling the clock rate of a design will double the number of LCs used only for data storage. Smaller penalties will be incurred if shift registers are less than half full.

Some operations (particularly boolean operations like XOR) can often be collapsed into other operations, particularly if there are spare LUT input lines. This is highly dependent on the cipher algorithm.

RC5 example

Rivest describes the RC5 algorithm in [6]. It consists of three phases: initialisation, key mixing and decryption. The initialisation phase is ignored because it can be trivially precomputed and will not consume FPGA resources when implemented in this way.

Key mixing

The key mixing phase of RC5-32/12/9 uses an S array of 26 words and an L array of 3 words. Each word is 32 bits wide, giving a total state size ($s$) of 928 bits. 78 rounds ($n$) are needed. Each round modifies 64 bits of state ($m$.)

To determine $r$, we examine the operations being performed. All of the table lookups operate in a predictable order, removing the need for additional multiplexers. Each round contains five adds, a fixed rotation and a variable rotation, all over 32 bits. One of the adds ($A+B$ in the second line) is performed twice, and can be ignored. This gives an $r$ value of $32(4\times1+0+\log_{2}32)=288$ and a final $c$ value of $78(288+928-64)=89856$. This represents a slice count that can only be achieved in very large FPGAs. The estimated $r$ value is close to the value of 256 obtained in the trial implementation.

Significant resource savings can be achieved by recognising that each bit in the S array is only accessed once every 26 rounds, and each bit in the L array is only accessed once every 3 rounds. We can thus collapse a significant number of the LCs used purely for storage into shift register LUTs. 7488 LCs are used to store the L array as it travels through the pipeline; this can be reduced to 2496. 64896 total FFs are used to store the S array. Two shift registers are needed to provide the 25 cycle delay on each bit, and each bit is accessed three times. The total number of LCs needed is thus $26\times32\times2\times3=4992$. This demonstrates that the frequency of register access has a significant bearing on the efficiency of cipher implementations on FPGAs. The S array stores almost 9 times as much data as the L array, but requires only twice the resources.

Using Equation 4.2, we obtain an LC count of 22464 for the key mixing phase of RC5.

The resource usage can be further optimised by noticing that elements of the S array are initialised sequentially, and so not all values need to be stored until 26 rounds have been completed.

Decryption

The decryption phase of RC5 uses the same S array as the key mixing phase, but does not require the L array. Each round consists of two half-rounds which are identical except for the source and destination of the results. We can use this property to double the number of rounds and halve the number of LCs required per round. It consists of 22 half-rounds, each of which contains a subtraction, a variable rotation and an XOR. All operations are performed on 32 bit words. Two additional subtractions are performed after the half-rounds. We thus have $s=26\times32=832$, $n=22$, $r=32(1+5+1)=192$ (which matches the trial implementation) and $m=64$. This gives $c=22(192+832-64)=21120$.

Again, significant resource savings can be achieved by using shift registers for storage. Each word in the S array is only accessed once and is never written to, so we need only provide storage for the period between the start of the decryption phase and the point where it is accessed. At worst, this will be 26 rounds, requiring two shift registers per bit. There are 26 words storing 32 bits each, giving an LC count of $26\times32\times2=1664$. Applying Equation 4.2 again gives $c=1664+24\times192=6272$. The final subtractions require an additional 64 LCs, giving a total of 6336 LCs.

Results

Using the figures determined from Equation 4.1, we obtain a total LC count of 110976. This translates to 55488 slices – barely fitting within the largest Virtex II Pro device that is planned for production (and is not yet even shipping.)

Using the Xilinx-optimised figures from Equation 4.2 gives a total LC count of 28800. This converts to a far more practical 14400 slices, within the capacities of many larger devices.

It should be noted that the figures generated by this analysis technique tend to be conservative and ignore many potential resource optimisations. It also ignores issues of pipelining within the cipher rounds which are difficult to deal with in such a general sense. Both of these areas can provide significant resource and speed advantages in an actual implementation.

Pricing data for Virtex E, Spartan IIE, Virtex II and Virtex II Pro FPGAs in quantities of 25-99 was obtained from Avnet’s website [45], and is current as of 15 October, 2003. The XC2V40 and XC2V80 device pricing is for a quantity of 100 or more.

Pricing data for Spartan 3 devices was obtained from Ernest Peltzer [46] of Sensory Networks, and are projected prices for Q1 2004 in quantities of 100 or more. Spartan 3 devices only started shipping very recently and so pricing data is both difficult to obtain and very likely to change.

This analysis assumes that the cipher module is the slowest part of the total key search machine. It ignores resources that would be dedicated to the PC interface, but includes those that are required for each search unit. Interface overheads are ignored because in a large-scale design each FPGA does not need its own PC interface and controller; the search bus can be connected directly to FPGA pads.

Speed grades and packaging

Each FPGA is available in a variety of speed grades and several packaging options. In general, the slowest speed grade and the smallest package gives the best price/performance ratio. Low FPGA speeds can be compensated for by using more FPGAs, and packaging is not important since only a few I/O lines are needed. This greatly simplifies the analysis by allowing a large percentage of the FPGA devices available to be ignored.

Families

The maximum attainable speed and resource usage is the same within each FPGA family. This allows performance estimates to be generated with far less effort. Synthesis estimates for the DES and RC5 search units are used. These are less accurate than those obtained after place and route, but remain valid across different capacity FPGAs within a family. These results are summarised in FPGA price/performance tables. Search unit resource costs are estimated in the same way.

Relative clock speed across FPGA families

The maximum clock speed for a given design varies greatly depending on the FPGA family used. Interestingly, the current “budget” family (Spartan 3) achieves the highest clock rates. This can be explained by their 90nm manufacturing process; the Virtex II and Virtex II Pro use 150nm and 120nm processes.

RC5 requires half as many RAM blocks on a Virtex II, Virtex II Pro or Spartan 3 as it does on a Virtex E or Spartan IIE. This is because the RC5 implementation needs a 32 bit wide RAM, and the Virtex E and Spartan IIE RAM is only 16 bits wide.

DES

FPGA price/performance for DES

FPGA price/performance for DES, showing low-end detail

The above figures show the price and performance of each device in each Xilinx FPGA family for the DES cipher. The second shows the same data, but zoomed to show detail for the low-end FPGAs. “Kinks” in the graph appear where moving to the next largest device does not improve performance (since there are not enough available resources to add another search unit). Smaller kinks are visible when moving to the next largest package.

We can see that Spartan 3 FPGAs give by far the best performance for a given price. This is not surprising; they are positioned as a budget FPGA and can achieve higher clock rates than the other families being considered. Again, it should be pointed out that they have only recently started shipping and pricing will likely be volatile for some time. The pricing figures used for the analysis were also projected figures for Q1 2004 and for a larger quantity than the other families.

Of the mature families, Spartan IIE devices give the best price/performance ratio. Their performance is quite limited, however. Virtex II Pro devices can achieve spectacularly high search rates, but at a high cost per device.

The figure below shows the price/performance ratio achieved by each device in the Virtex II Pro family. The device with the best ratio is the XC2VP20, with the XC2VP30 and XC2VP40 close behind. In a real system where PCB costs and assembly have to be taken into account, it may be worthwhile purchasing a smaller number of faster FPGAs with a worse price/performance ratio. These three devices use the FG676 package; the jump in price to the XC2VP50 can be explained by the larger package (FF1152). The XC2VP100 has a far worse ratio than the others, and a much larger package (FF1696). Like the Spartan 3 devices, it has only recently started shipping and may still have unstable pricing.

Virtex II Pro price/performance ratio by device for DES

RC5

FPGA price/performance for RC5

Relative FPGA price and performance for RC5 is shown above. It is similar to that for DES, but with less gap between the Virtex E and Virtex II/Virtex II Pro families. Detail near the low end is also very similar. Relative pricing and performance within the family remains the same as for DES.

RC4

FPGA price/performance for RC4

FPGA price/performance for RC4, showing low-end detail

RC4’s performance is entirely constrained by the number of RAM blocks available on the FPGA. This gives quite difference price/performance results. From the figures, we can see that the Virtex E and Virtex II families are quite similarly placed. Virtex II Pro devices perform much better when cost is taken into account. Examining the low-end detail shows that the Spartan IIE family remains competitive for far longer than the Spartan 3, in contrast with the other results. Again, the XC2VP20 remains the most cost-effective choice in the Virtex II Pro family.

CPU pricing data was obtained from Sastradi Satria of OnLine Centre [47]. This is for quantities of 10 and is specified in AUD without GST. This is useful for comparing CPUs, but makes comparing price/performance ratios between CPUs and FPGAs difficult. Several other pricing sources were located but not used due to accuracy or quantity issues.

Benchmark results are listed in CPU benchmark results, and should be interpreted taking into account the problems noted in Software benchmark results. These results were scaled by the clock speed to obtain performance estimates for each CPU that is currently being sold. This assumes that performance will scale linearly with CPU speed, which is generally true for exhaustive key search.

When comparing performance between CPUs the SolNET benchmarks were used because they have been performed on a wider variety of CPUs. They are not directly applicable to CPU to FPGA comparisons.

No pricing data was available for mobile Intel CPUs (PIII-M, P4-M, Centrino). This would be useful when considering a very large-scale key search machine based on CPUs; these CPUs use much less power and generate far less heat. PIII-M, Centrino and G3 are particularly interesting due to their high benchmark results at comparatively low clock rates.

These comparisons ignore the cost of support hardware, which can be expected to be several times that of the CPU device itself in some cases.

DES

The figures below show the price and performance of each CPU family for the DES cipher. We can see that the CPUs within a family that achieve the highest search rates are disproportionately expensive. It is scarcely worth trying to achieve a search rate over 12Mkeys/sec with an Athlon XP or 11Mkeys/sec with a Pentium 4 because the price increases so steeply.

CPU price/performance by family for DES

CPU price/performance by device for DES

The Athlon XP curve contains a number of kinks; these occur because pricing increases with their performance rating. This performance rating is not in line with actual performance, however – Barton core Athlons have a higher performance rating than their clock speed (and measured performance) would suggest. We can see that the Duron line appears to fit reasonably well with the Athlon XP line. Celeron CPUs achieve higher performance than their price would otherwise suggest.

Examination of the data shows that the two Duron data points have a linear price/performance relationship. In most practical systems, it would thus be best to choose the faster of the two in order to save on auxiliary costs (support hardware, space, etc.)

The second figure above shows the price/performance ratio for each device under consideration. We can see that the slowest device in each family generally gives the best ratio. The Durons have exactly the same price/performance ratio. The Athlon XP 2200+ and Celeron 2200 provide a marginally better ratio than their neighbours. All Pentium 4 devices are quite expensive for the performance that they give. The exact ratio needed for a large-scale machine will depend on the price of the support hardware, but in general any Duron, Celeron or Athlon XP up to 2600+ will provide a good price/performance ratio.

RC5

CPU price/performance by family for RC5

CPU price/performance by device for RC5

The top figure shows the price/performance ratios of each CPU family for RC5. We can see that the Celeron and Pentium 4 families are far less competitive for RC5; the most expensive Pentium 4 HT device barely outperforms the cheapest Duron! The Duron and Athlon XP families remain similarly positioned relative to each other. The same kinks in the Athlon XP curve are apparent.

The bottom figure shows the price/performance ratios of each device for RC5. As with DES, the cheapest device in each family provides the best ratio (with the minor exceptions noted for DES). Unlike DES, however, the Celeron family is no longer as competitive. Key search machine designers would do best to select the Duron 1600 or a low-end Athlon XP. Both Pentium 4 varieties remain very expensive for their performance.

CPUs and FPGAs

Ciphers

The pricing and performance data for CPUs and FPGAs is not directly comparable. CPU prices are given in AUD for quantities of 10; FPGA prices are given in USD for quantities of 24-99. The CPU performance is based on benchmark results, while the FPGA performance is based on synthesis estimates.

Nonetheless, we can scale the CPU pricing based on the current exchange rate, and scale the FPGA performance based on measured performance. At the time of writing, one Australian dollar is worth 0.700639 U.S. dollars. The predicted performance for the XCV1000E running DES was 894Mkeys/sec; achieved performance was 500Mkeys/sec. The DES CPU performance also needs to be scaled up by approximately 2.5 to account for the low speeds achieved by the SolNET client compared with the distributed.net client. The predicted FPGA performance for RC5 matched quite closely with the achieved performance (predictions were 1.0625 times faster.) Scaling with these figures ignores many factors but will suffice for this analysis.

CPU and FPGA family comparison for DES

The figure above shows the price/performance ratios for each CPU and FPGA family. The entire CPU range is compressed into the left-hand side of the graph; even at the high end, they do not come anywhere near the search rate of a low-end FPGA. It can be seen that searching DES on general purpose CPUs is very costly compared to searching with FPGAs.

CPU and FPGA family comparison for RC5

The figure above shows the same comparison for the RC5 cipher with the less competitive FPGA families removed. CPUs now perform better than FPGAs at the same price. They still cannot match the performance offered by high-end FPGAs.

These two comparisons show that the choice of implementation technology can greatly affect the time and cost to perform an exhaustive key search. In a practical key search machine, the technology must be selected to match the cipher being attacked.

In general

Over time, FPGAs will most likely become more efficient for key search than CPUs. This is because CPU performance does not scale linearly with available silicon area; it is limited by bus speeds, interactions between instructions and limited parallelism. FPGA performance for key search will scale linearly; if twice as much silicon area is available, twice as many search units can be implemented. FPGAs will thus become more important in future cryptanalysis. Already, improved CPU performance is becoming dependent on increasing parallelism; SIMD techniques and HyperThreading are examples of this.

CPUs are, of course, far easier to obtain than FPGAs. Many organisations already have a large computing infrastructure that could be used to perform key searches. distributed.net and other software RC5 efforts have demonstrated the feasibility of this approach. FPGA hardware is very rare in comparison, especially in the quantities that would be needed to conduct key searches.

CPUs need a large amount of support hardware (heatsinks, RAM, chipsets, multi-voltage power supplies and so on) which drives up the cost of a CPU-based key search machine. All of this hardware is very cheap and available. Storage space for it becomes more of a concern. FPGAs have a clear advantage here; many FPGAs can be mounted on a card that will fit within a computer case. It may be possible to design a motherboard for commodity CPUs that has some of these advantages.

Ultimately, neither CPUs nor FPGAs are very efficient for conducting key searches compared with ASICs. CPUs are inefficient due to their support hardware and program-based operation; FPGAs are inefficient because of their generic hardware structure. ASICs have custom hardware and a low unit price, but very high initial price.

Extrapolation for ASICs

In [48], Craig Ulmer reports that ASIC implementations can achieve three times the speed of an FPGA implementation, and ten times the density. This is useful as a general guide, but not in this analysis since the area required by FPGA dice is not easily obtained.

Instead, we can estimate ASIC costs based on the gate count required and infer the cost of a gate array device. During the Map phase of FPGA compilation, ISE reports the ‘equivalent gate count’ for an ASIC implementation of the design. This is based partly on the data contained within [49], and can be used to determine an approximate ASIC cost. The DES implementation on the XCV1000E uses 453,968 gates, and the RC5 implementation uses 1,353,397 gates. RC5’s large gate count is due to the amount of RAM used, including the additional RAM blocks used to reduce routing delays on the FPGA implementation. Both of these figures include interface and controller logic. It would also be possible to find a tradeoff between die size with final cost.

According to [50], a Virtex E design should be implementable with a CMOS-10HD gate array. This is designed around a 250nm process, which seems reasonable for a 1:1 speed and density conversion; the Virtex E family uses a 180nm process. No measures on the physical size of a CMOS-10HD die were available, but [51] claims 15k gates/mm² for a CMOS-9HD die. Assuming that the number of gates per mm² scales linearly with feature size, we get approximately 30k gates/mm². Assuming a 50% gate utilisation ratio gives us approximately 900kgates for DES, or 30mm².

MOSIS [52] provides small-quantity ASIC fabrication. They also provide an online price list [53]. We select the TSMC 250nm process (CL025) as one that should be suitable; other 250nm processes are approximately the same price. This gives a fabrication cost of $44,200 for 40 parts. $2,500 more will be required for packaging [54]. This is a high average price per part, but not a great deal more than the price of the XCV1000E. No pricing data was available for larger quantities.

Comparison of CPUs, FPGAs and ASICs for DES

Without further pricing data, it is difficult to perform an intelligent cost comparison involving ASICs. We can, however, use the price of packaging as a bare minimum cost per device to determine a price point at which ASICs become a viable option. The figure shows ASICs, CPUs and FPGAs compared at their best price points for the DES cipher. We can see that to assemble a machine equivalent in power to the EFF DES Cracker, FPGAs remain the most cost-effective choice. ASICs do not become price-effective until the machine performance reaches almost 400Gkeys/sec – over four times the performance of the EFF machine.

Spartan 3 FPGAs were not included in this comparison due to their uncertain pricing and performance. In addition, their $/Mkeys/sec ratio is below that of the ASIC design given, meaning that the two curves would never converge (as they would if all fabrication options had been considered.) It will be interesting to update this analysis when Spartan 3 pricing stabilises.

The below figure shows known FPGA DES key search machines and the performance that was predicted by Blaze et al. in [2]. Extrapolating their estimates with Moore’s Law gives an estimate of 1120Mkeys/sec for a $200 FPGA today. Performance estimates for the FPGAs priced around $200 are also shown.

Previous FPGA DES key search machines and performance estimates

The graph shows that no implementation has matched the performance predicted in 1996 for FPGA devices, regardless of the price of FPGA device used. The implementation presented in this thesis moves closer to predictions (as a percentage of expected performance) but still falls short. It also uses an FPGA that costs $938 today, well in excess of the $200 quote given. Other $200 FPGA devices are predicted to achieve similar performance.

The XC3S1000 is interesting; it has a very high capacity for its price. It still falls short of the estimate, but not by much. Its predicted price is also well under $200. It will be interesting to see if this price remains accurate in Q1 2004. No larger Spartan 3 devices are shipping yet, so a device with a price closer to $200 could not be selected.

CPUs

CPUs are more suited to RC5 key search than FPGAs. A large-scale machine to complete the RSA RC5-72 challenge in one year might be considered. This requires an average search rate of almost 150Tkeys/sec to conduct a complete sweep of the key space (half a year on average to find the key). The most cost-effective CPU is the Duron 1600, achieving 5.0Mkeys/sec at a price of $58. To reach the target search rate, almost 30 million CPUs will be needed, costing over $1.7 billion. This is before considering extras such as RAM, motherboards, power, heat removal and storage space.

At present, distributed.net is achieving a key rate of approximately 120Gkeys/sec [55]. At the current rate, the RC5-72 challenge will probably be solved in 624 years.

A more feasible machine might attempt to match the performance of the EFF DES Cracker, which achieved 92.6Gkeys/sec using a large number of gate array ASICs. The most cost-effective CPU is again the Duron 1600, achieving 21.5Mkeys/sec (distributed.net scale) for $58. Over 4300 CPUs would be needed at a cost of over $250,000. This is not excessively expensive, but again ignores support hardware and other extras.

FPGAs

FPGAs perform extremely well for DES key search. A machine matching the speed of the EFF DES Cracker could be constructed from XC2S200E devices ($25, 149Mkeys/sec). Spartan 3 devices were not considered due to their unstable pricing. 622 devices would be needed, at a total cost of $15,540. The EFF machine spent $130,000 on materials; it is not clear how much of this was spent on ASIC fabrication.

Alternatively, XC2VP20 devices could be used. They are slightly more expensive at a given search rate, but far fewer devices would be needed. This would reduce auxiliary costs significantly. To match the performance of the EFF DES Cracker, a mere 78 devices would be needed. Each device costs $299, giving a total component cost of $23,322. In contrast, the EFF machine used 1536 devices spanning many circuit boards and several physical cabinets.

At the top end of the FPGA spectrum, XC2VP100 devices could be used. These are the largest Xilinx devices that are currently shipping. Only 16 devices would be required. The total device cost would be $89,264, but the physical space consumed by the machine would be very small – less than that of a single board in the EFF DES Cracker.

FPGAs are generally more expensive than CPUs when performing RC5 key searches, and so will not be considered. Using the theoretical pipelined design may be profitable; a single (expensive) FPGA could search 100–200Mkeys/sec.

Accounting for hardware costs

The device cost of a large-scale key search machine is not the only factor affecting a machine’s cost. All technologies require circuit boards, controllers, assembly, testing, power, storage and cooling considerations to be addressed.

ASICs and FPGAs

ASIC and FPGA implementations can use the estimates provided by Wiener [15]. A circuit board that can support 120 small package ICs is reported to cost $300. The devices used by Wiener are 18mm square. FG676 packages (as used on the XC2VP20) are 27mm square, fitting approximately 35 devices per board. PQ208 packages (as used on the XCS200E) are approximately the same size (28mm.) The microcontrollers used by Wiener are not needed since controllers can be integrated into the FPGAs. Assuming that only one FG676 or PQ208 can fit into the space occupied by four of Wiener’s ASICs allows 35 devices per board.

From this we can see the value of high-density devices. One XC2VP20 has eight times the performance of an XCS200E. A machine capable of 100Gkeys/sec would be take just over four days on average to find a key. Taking into account circuit board costs, a machine using XCS200E devices would span 671 devices, 20 boards and cost $22,641. A similar machine using XC2VP20 devices would span 84 devices, three boards and cost $26,016. Factoring in controllers, power supplies and mechanical concerns according to Wiener’s figures gives a total of $33,741 for the XCS200E machine and $33,816 for the XC2VP20 machine. Power consumption, heat generation and storage space for the XC2VP20 machine would be significantly lower at only a very small increase in total price.

Using the preliminary pricing for XC3S1000 devices gives 89 devices, three boards and $13,663. Most of the cost in this machine is devoted to auxiliary hardware, meaning that higher speed machines would cost less in relation to the key rate achieved.

Cost of key search machines and their expected search time

Figure “Cost of key search machines” infers the cost to find a DES key in a given amount of time. Wiener’s estimates, the EFF DES Cracker and the estimates proposed by this work for XC2VP20 devices are shown. We can see that to achieve very short search times a lot of money must be spent, and vice-versa.

CPUs

Commodity computer hardware pricing is needed to infer the remainder of the hardware costs. Assuming that computers can boot from a network, each CPU will need a heatsink, motherboard, case, power supply, network adapter and a small amount of RAM. Using an all-in-one motherboard and low quality case reduces costs significantly. Each CPU will need approximately AUD$175 in support hardware.

It has been shown time and time again that using a long key is the best way to protect a cipher from an exhaustive key search attack. Assuming a cipher with a 90 bit key length (as recommended in [22]) that can be attacked at the same speed as DES, 132 billion XC2S200E devices would be needed to cover the key space in a year, at a price of $3.3 trillion. Even using the best available Spartan 3 device (XC3S400) will cost $850 billion and need over 35 billion devices. Of course, attacks become even more infeasible if the key length is increased only slightly.

Capabilities

Well-resourced entities can feasibly attack ciphers that use long key lengths. Assuming the performance of the XC2VP20 machine is maintained regardless of key length, we can see the cost to obtain a key within a year in the table below. If an attacker is prepared to wait for a year, 56 bit ciphers like DES are trivial to defeat using current technology. Each additional bit added to the key doubles the cost to break it within one year.

5660646872768092

Key length	Cost	Potential attacker
$305	Bored teenager
$4,900	Employed adult
$78,000	Business department
$1.25 million	Large business
$20 million	Small government
$320 million	Large government entity
$5.1 billion	Significant inter-government collaboration
$21 trillion	Infeasible?

Minimum key lengths

The minimum key length required for a system depends on who the potential attackers will be. Assuming that we want messages encrypted today to be completely undecipherable to all known attackers, a 92 bit minimum key length seems to be appropriate.

Future data security must also be considered. Moore’s Law is currently the de facto method of predicting future computing capabilities. Over an 18 month period, Moore’s Law states that transistors per IC (or computing power) will double. If we apply this to a 20 year period, messages must be encrypted with 14 bits of additional key length to remain secure against all known potential attackers. 106 bits appears to be a suitable minimum key length to keep data secure for the next 20 years.

Data that needs to be kept secure over a longer period of time (such as census data) will need an even longer key. To keep data secure for the next hundred years, a 159 bit key seems appropriate.

All of these estimates ignore future computing technologies that may become available or new cryptanalytic techniques which may decrease the strength of a given cipher. Predicting suitable key lengths can be likened to telling the future. Given the low additional processing cost, using a key length of 192 or 256 bits should protect data from all known potential attackers in the foreseeable future.

Alternatives

When attacking a well-designed system that uses cryptography, it is rarely profitable to attack the ciphers themselves. Normally there are weaker areas of the system, such as software bugs, inadequate security policies, weak passwords and the people involved in the system. It will almost certainly be easier to exploit one of these areas (particularly people) to complete an attack against a system.

One of the strongest reasons that this research is valuable is in the context of legal restrictions on the use and export of strong cryptography. With it we can evaluate different ciphers in the face of key length restrictions, as well as the availability of the technology required to perform an exhaustive key search attack.

Cryptography controls

Unless otherwise referenced, information in this section was assimilated from [56], [57] and [58]. They should be consulted for more details.

Local regulations on cryptography have changed significantly in recent years. Australia is a party to the Wassenaar Agreement, which restricts the export of “dual-use goods”, including encryption. It is vague in parts (particularly as to what constitutes an “export”), but sufficiently restrictive to raise concerns. Australia’s regulations are more restrictive in that any export requires approval from the Defence Signals Directorate (DSD), which deals with Australia’s signals intelligence and information security [59]. The Defence and Strategic Goods List [60] describes goods which may be subject to export controls, including encryption. Many different products are covered by the legislation, including nuclear, biological, optical, semiconductor and other technology goods. It is frequently updated.

Obtaining export approval usually involves submitting an export application [61] with the DSD. To date no applications have been rejected, although some companies have been informed that their applications will be rejected without having applied. An early assessment of cryptographic goods can be performed to determine if export approval will be required for that good [62].

There are no restrictions on the of cryptography within Australia or the importation of cryptography into Australia.

The United States has comparatively tight controls on cryptographic exports. Currently, symmetric cryptography using up to 56 bit keys is able to be exported once it has undergone a one-time review. Export of any cryptography is not permitted to seven “terrorist countries (also known as “Tier 4” [63].) As can be seen in the results from this thesis, 56 bit symmetric cryptography is not very resistant to brute force attacks. Someone operating under these constraints would be advised to select a cipher that is relatively slow and expensive to attack, such as RC5.

The legal export status of this thesis and its accompanying CD can be questioned. The thesis itself is probably safe to export without approval, since it does not contain any cryptographic algorithms. The CD can almost certainly not be exported without approval, since it contains cryptographic source code.

Computing controls

Regulatory issues exist for exports of high performance computers to certain countries. This was most visible when exports of the Playstation II gaming console to China were denied [64] for fears that they may enhance China’s military capability. The regulations have recently been updated to increase the allowed performance of exported devices [65]. Computer exports are controlled for “Tier 3” countries, which generally includes any countries that are not allied with the United States. Exemptions can be obtained to bypass these controls.

It is not clear whether FPGA devices are subject to export controls, but it would almost certainly be easy to force designs to fall under various performance classifications.

[2] M. Blaze, W. Diffie, R. L. Rivest, B. Schneier, T. Shimomura, E. Thompson, and M. Wiener, “Minimal key lengths for symmetric ciphers to provide adequate commercial security,” A Report by an Ad Hoc Group of Cryptographers and Computer Scientists, January 1996. [Online]. Available: http://www.schneier.com/paper-keylength.pdf

[9] E. Biham, “A fast new DES implementation in software,” Lecture Notes in Computer Science, vol. 1267, pp. 260 ??, 1997. [Online]. Available: http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/1997/CS/CS08%91.ps.gz

[36] Xilinx, Inc. SRL16 16-bit shift register look-up-table (LUT). [Online]. Available: http://toolbox.xilinx.com/docsan/xilinx5/data/docs/lib/lib0393_377.html

[39] P. Alfke and B. New, “Multiplexers and barrel shifters in XC3000/XC3100,” Xilinx, Inc., Tech. Rep. [Online]. Available: http://direct.xilinx.com/bvdocs/appnotes/xapp026.pdf

[44] A. Biryukov, A. Shamir, and D. Wagner, “Real time cryptanalysis of A5/1 on a PC,” Lecture Notes in Computer Science, vol. 1978, pp. 1+, 2001.

[45] (2003, October) Avnet electronics marketing. [Online]. Available: http://em.avnet.com/

[46] E. Peltzer, October 2003, private communication.

[47] S. Satria, October 2003, private communication.

[48] C. D. Ulmer, “Configurable Computing: Practical Use of Field Programmable Gate Arrays,” Ph.D. dissertation, School of Electrical and Computer Engineering, Georgia Institute of Technology, January 1999. [Online]. Available: http://users.ece.gatech.edu/~grimace/research/reports/qual_report.pdf

[49] Gate count capacity metrics for FPGAs, Feb 1997. [Online]. Available: http: //www.xilinx.com/bvdocs/appnotes/xapp059.pdf

[50] NEC Electronics America, Inc. FPGA to ASIC Conversion. [Online]. Available: http://www.necelam.com/asic/conversion.cfm

[51] NEC Electronics. NEC: Gate array information. [Online]. Available: http://www.necgatearray.com/content.nsf/webpages/gatearrayinfo

[52] The MOSIS Service. MOSIS Integrated Circuit Fabrication Service. [Online]. Available: http://www.mosis.org/

[53] The MOSIS Service. Domestic Price List for MOSIS IC Prototyping Service. [Online]. Available: http://www.mosis.org/Orders/Prices/price-list-domestic.html#tsmc25_logi%c

[54] The MOSIS Service. MOSIS Domestic Price List for ASAT Plastic Packages. [Online]. Available: http://www.mosis.org/products/assembly/plastic/price_domestic_asat.html

[55] distributed.net. RC5-72 Live Stats. [Online]. Available: http://www1.distributed.net/~pstadt/rc5-72/

[56] G. Pure and G. Taylor. The Australian cryptography FAQ. [Online]. Available: http://www.efa.org.au/Issues/Crypto/cryptfaq.html

[57] B.-J. Koops. Crypto law survey. [Online]. Available: http://rechten.uvt.nl/koops/cryptolaw/cls2.htm

[58] Electronic Frontiers Australia Inc. Crypto politics. [Online]. Available: http: //www.efa.org.au/Issues/Crypto/crypto2.html

[59] Defence Signals Directorate. Defence Signals Directorate. [Online]. Available: http://www.dsd.gov.au/

[60] Department of Defence. Defence and strategic goods list. [Online]. Available: http://www.defence.gov.au/dmo/id/export/DSGL_2003.pdf

[61] Department of Defence. Export application. [Online]. Available: http://www.defence.gov.au/dmo/id/export/dsec/AC717_Oct_03.pdf

[62] Department of Defense. Application for one time review. [Online]. Available: http://www.defence.gov.au/dmo/id/export/dsec/Onetime.pdf

[63] U. S. Bureau of Industry and Security. HPC - CTP Chart. [Online]. Available: http://www.bxa.doc.gov/HPCs/ctpchart.htm

[64] D. E. Sanger. Letting the Chips Fall Where They May. [Online]. Available: http://www.nytimes.com/library/review/061399china-chips-review.html

[65] Deputy Press Secretary. President changes export controls on computers. [Online]. Available: http://www.whitehouse.gov/news/releases/2002/01/20020102-3.html

\(s\)	State size in bits
\(n\)	Number of rounds needed to complete a phase
\(r\)	Number of LUTs required to perform a round
\(m\)	Number of bits of state modified during a round
\(c\)	Number of LCs required to perform a phase

Ian Howson