2. Background

This chapter summarises the history of programmable logic and surveys the research in the field. It covers some of the more important commercial architectures with particular attention to those intended for computing applications and run-time reconfiguration. A number of research architectures are also described.

2.1 Programmable logic

Over the years many advances have been made in the field of digital logic. From the days of discrete transistors progress has been made to small-scale integrated circuits initially, and then to successively greater densities with each generation of logic.

As the level of integration available became higher, so also did the level of specialisation for each available part become greater. This left hardware designers with the option of either using existing solutions or designing their own from smaller scale logic at much greater expense and with much greater size and power consumption. Designers wanting to develop sophisticated custom circuits were greatly stymied by this.

The advent of programmable logic devices (PLDs) was the first step in alleviating this problem. This allowed the function of a number of small scale logic devices to be placed on a single chip, reducing the amount of board space required for this kind of ``glue logic''. Configuration of these devices was through antifuses which when programmed appropriately allowed the designer to determine which inputs and outputs of each logic gate were connected to which internal buses. See figure 2.1.

Figure 2.1 - Partial circuit of a 22V10 PAL

Following PLDs came a solution to larger scale problems - gate arrays. These devices operate much like mask-programmed ROMs, except that rather than programming a memory image the designer was able to determine the wiring layout of the logic. Their great advantage over designing a custom chip for the same task was cost - since each production run involved changing only the wiring areas most of the transistor area of the die was known to work correctly and was able to be left unaltered. Since they were much cheaper than designing a custom chip for a task, gate arrays naturally became quite popular.

The disadvantage of gate arrays was although the lead times for production were not the several months that could be expected for a custom chip, they were still in the range of several weeks. Their density and speed was not as great as that of an equivalent custom chip, but this cost was greatly offset by the economy of design time.

The logical progression of combining the programmability of the smaller PLDs with the size of gate arrays led to programmable gate arrays (PGAs). These used a system of fusible links to determine the logic function and wiring of the device. While slow to program and relatively expensive per item compared to mask programmed gate arrays, they for the first time allowed the designer to quickly prototype complex circuits.

Fusible-link PGAs have one major frustration for the logic designer - they can be programmed only once so an iterative process of prototyping means programming, testing and discarding many chips. This can be quite expensive. As a result field programmable gate arrays were developed. With the configuration stored in static RAM (``Random Access Memory'') or EPROM (``Erasable Programmable Read Only Memory'') designers were now able to quickly and cheaply prototype a system which could then be either put into production using FPGAs or by using the same design with a mask programmed gate array.

2.1.1 FPGAs

FPGAs are general purpose programmable logic devices. Their programmable logic blocks can be connected with programmable interconnect, allowing designers to implement multi-level logic not possible with the two-level logic structure of PLDs. The capabilities and implementations of FPGAs vary widely between the available families however.

EPROM FPGAs are usually arranged like a set of PLDs connected with a configurable interconnection system. Some of these devices are Xilinx's EPLD, Altera's MAX series and AMD's Mach.

SRAM FPGAs can be generally categorised as island-style or cellular-style. Island-style ones include Xilinx LCAs, Altera Flex and AT&T's Orca. Cellular-style FPGAs include Triptych [BorrielloEbeling95], Algotronix's CAL, and Amtel's CLi [Furtek93].

2.1.2 FPGAs in electronic design

FPGAs have significant advantages over mask programmed gate arrays when applied to conventional logic design situations. The design cycle for FPGAs is much quicker - test pattern generation, mask making, wafer fabrication, packaging and testing are all replaced by a relatively quick download process. The expensive equipment necessary to perform this wafer fabrication is also not necessary - even small companies can afford the equipment to program FPGAs.

Given this quick turnaround time it is much easier to verify the correctness of a design - and also much less expensive to have made a mistake. Testing of the raw wafers is also much easier - since all of them are identical they are easily tested before sale instead of requiring a different test regime for each design.

The life cycle of a product using FPGAs can be extended by providing updated configurations. This also means that a product can be released before all its logic function has even be designed, and the improved logic configuration provided as a later upgrade along with any other extensions to the product. Variations in product demand are also more easily accommodated as they do not suffer from the long lead times associated with producing a batch of mask programmed gate arrays.

FPGAs do cost significantly more per item than mask programmed logic, however. High-volume applications make mask programmed logic look increasingly attractive, and the density achieved on FPGAs is considerably less than that possible with mask programmed gate arrays or custom devices. The programmable switches in FPGAs are larger and add more capacitance than the equivalent mask programmed links. This leads to FPGAs being larger and slower.

The greater size of FPGA logic also means that its logic function is smaller when compared to a gate array of similar size which is composed of the denser mask programmed logic. Current FPGAs are about ten times less dense than the equivalent mask programmed devices, so FPGAs are considerably more expensive when used to implement a specific logic function. This greater size for equivalent logic function also leads to interconnect being longer and slower - about twice as slow as a mask programmed gate array.

2.1.3 FPGA programming technology

An important element in programmable logic is the programming system itself - for programmable logic to be viable it must not use excessive amounts of die space. PGAs and PLDs used special antifuse circuit elements which could be permanently ``blown'' at programming time to allow a circuit to be created. These antifuses are quite small so it is feasible for them to be located at almost any point where two lines may need to be connected. Most PGAs employ large numbers of antifuses in crossbar-style arrays to provide versatile interconnect. This means that the majority of antifuses remain unused, but given the design of the interconnect arrays and the size of the antifuses this is not an excessive expense.

Antifuses use a higher-than-normal voltage programming pulse to permanently change the character of their silicon. An oxide-nitride-oxide dielectric normally prevents current from flowing between diffusion and polysilicon layers. When a programming pulse is applied the dielectric is melted and a circuit is formed between the diffusion and polysilicon, as shown in figure 2.2. [Actel94] [Greene93]

Figure [2.2 - "An antifuse"]

This programming method is very space efficient, but its irreversible nature is unsuited to reprogrammable FPGAs. SRAM-based FPGAs have the advantage that they can be reprogrammed at will, requiring no new hardware when changes are made to a logic design. Consequently their programming cells are rather different from antifuses.

An example of the type of programming cell used in SRAM-based FPGAs is shown in figure 2.3. The circular arrangement of inverters stores the programming value. The value stored in the cell can be changed using the input transistor. The input transistor is capable of stronger drive than the competing inverter so it is capable of overpowering the inverter to change the state of the cell. An N-transistor is shown on the output of the cell. This would be used if the cell was to be used as a direct replacement for an antifuse, allowing a circuit to be created or not as desired. Alternatively the programmable cell might be used as a configuration bit within a logic array cell in which case the inverter outputs could be used directly as the inputs to multiplexers or other logic.

Figure [2.3 - "A programmable interconnect point"]

SRAM-based FPGAs are volatile so they must be reprogrammed every time power is applied before they can be used. This is obviously a disadvantage and for this reason manufacturers have gone to some effort to make power-up reprogramming of FPGAs easy. Most commercial FPGAs have the ability to read their configuration from an EPROM automatically on power-up.

2.1.4 Commercial FPGA architectures

Most commerically-available FPGA architectures to date have been aimed at the electronic design market. Until recently little effort has been made to create FPGAs which are well suited to custom computing applications. The following sections describe FPGA designs for electronic applications, and culminate in some designs which are better suited to custom computing applications.

2.1.4.1 Xilinx 4000

One of the first and most popular SRAM-based FPGA architectures was developed by Xilinx. This architecture has gone through several design revisions, but still remains the definitive example of ``island-style'' FPGAs [Trimburger94]. Figure 2.4 shows the structure of the Xilinx 4000 series FPGAs [Xilinx94]. The routing structure of these devices allows arbitrary routing from any point to any other point. Routing resources are limited however, constraining the number and length of connection paths to keep the expense of routing resources relatively low.

Figure [2.4 - "Xilinx XC4000 interconnect"]

Each CLB (``configurable logic block'') has twelve main inputs which are connected to tracks adjacent to the cell, some in each of the four directions. The four outputs are each sent to two of the four directions, one vertical and one horizontal - making it easy to route an output signal in whatever direction is desired. Not all of the inputs and outputs are normally used - in fact it is common that most will be unused.

The routing structure has three lengths of lines - single, double and long. Single length lines travel the length of a single cell before entering a switch matrix. These switch matrices allow any horizontal line to be connected to any vertical line, as shown in figure 2.4. Double-length lines provide an option to allow longer transmission with lower propagation delays. Passing a signal through a switch matrix inevitably delays it slightly, so using longer length lines reduces the number of switches and hence the propagation delays. Long lines cover half the length of the die without entering a switch matrix. They are particularly useful for low-skew broadcast signals and permit long-distance communication to occur without excessive delays. Note that significantly less of the longer lines are available than local lines.

Note the relative sparseness of switch units compared to the PAL (``Programmable Array Logic'') design in figure 2.1. This is due to the greater expense of SRAM switches compared to antifuses.

Figure [2.5 - "Circuit of a Xilinx XC4000 cell" (from www.xilinx.com)]

Each CLB contains two four-input lookup tables (LUTs) which can act independently and send their outputs through a latch if desired. The LUTs each allow arbitrary functions of up to four inputs to be created. The outputs of the two LUTs can be put through another LUT if desired and combined with an external signal to provide some functions of up to nine inputs in a single cell. See figure 2.5.

Other functions are also available, including the ability to access configuration SRAM as a register from within the circuit.

The philosophy of the Xilinx LCA is representative of that of the island-style FPGAs - relatively powerful functional blocks are connected with versatile but expensive interconnect. The space taken by a CLB is a fraction of that used by its associated interconnect, so it is worthwhile creating more powerful CLBs to reduce the number of them needed to create a complex circuit.

The alternative to this philosophy is to make the interconnect far less versatile - and so far less expensive. With interconnect taking less space it is viable to greatly increase the number of logic cells which can be accommodated within a die. This is the essence of the ``cellular-style'' FPGAs [Trimburger94].

2.1.4.2 Xilinx 6200 series

Figure 2.6 - Xilinx XC6200 random access programming interface

The Xilinx XC6200 series is a recent development aimed at the growing reconfigurable computing market. It incorporates a RAM-style configuration interface which permits it to be partially or wholly reconfigured at runtime, as shown in figure 2.6. Significantly, this series has taken a step away from Xilinx's traditional island model. Cells have efficient local interconnect to their neighbours. This emphasis on locality is also employed in grouping cells together into 4 by 4 blocks with fast local interconnect. See figures 2.7 and 2.8.

Figure 2.7 - A block of Xilinx XC6200 cells
Figure 2.8 - Xilinx XC6200 nearest-neighbour communication

Another significant change in this architecture is a move from PLD-style crossbar interconnect to input multiplexers as shown in figure 2.9. Each of three input multiplexers can select an input from cells in one of four directions, or from one of four adjacent medium-distance interconnect lines. Each cell has four outputs, one for each direction. Each output may be configured to take either a routing function or output the value computed by the function unit. This approach is very much like the cellular-style approach used in Triptych and Atmel, amongst others. As will be seen later, the Xilinx XC6200 uses a similar approach to that which was independently developed for the Switchblade architecture described in this thesis.

Figure 2.9 - Xilinx XC6200 cell logic

2.1.4.3 Research on FPGAs

In 1970, Shoup described a system which used flip-flops to program a logic array instead of the more conventional mask programming or fusible links [Shoup70]. This was essentially the birth of the FPGA. Shoup's research used an array structure very similar to the PGAs it was based on and while some promising results were obtained the idea was well ahead of its time - the levels of integration available at the time were not adequate to make FPGAs viable.

In the early eighties Snyder described the ``CHiP'' architecture - a system of programmable interconnect which can be seen as the precursor to the interconnect schemes used in current-day FPGAs [Snyder82]. This was not a logic array design - rather it described a way of programmably interconnecting higher-level microprocessors - but it took the idea of field-programmable interconnection a step further than the simple scheme proposed by Shoup a decade earlier. One particularly interesting feature of the CHiP architecture was its ability to switch between a number of available configurations in response to a broadcast signal.

In the mid eighties the idea resurfaced with startup company Xilinx's 2000 series field-programmable gate arrays. The idea quickly spread, with a number of other companies - notably Altera - vying for this new segment of the programmable logic market.

Figure 2.10 shows the structure of the Altera FLEX device. This is very similar in design to the fusible link PGAs which preceded it, but with SRAM programmable switches substituted for the antifuses.

Figure 2.10 - Altera FLEX architecture

The Plessey/Pilkington ERA (``Electrically Reprogrammable Array'') was an early FPGA with fine-grained design attributes similar to conventional gate arrays rather than programmable logic devices [DillienPhillips89] [HowardTaylor92]. Pilkington's later DPLD (``Dynamically Programmable Logic Device'') took a similar fine grained approach but with enhancements to the cell and routing structures [Jhitta93].

The CAL architecture was a system which specifically addressed systolic and cellular automaton algorithms - algorithms which are not efficiently implemented on conventional computers [KeanGray90]. Labyrinth was another architecture aimed at massively parallel, register-intensive computation [FurtekStone90]. It had a single-bit register and a half-adder per cell, complemented by a highly symmetrical, scalable architecture.

Triptych (figure 2.11) took a different approach from the Xilinx ``island style'' designs which favoured increasingly complex logic blocks with expensive interconnect structures. Instead, the Triptych ``cellular'' approach (as characterised by Trimburger [Trimburger94]) involves a homogeneous array of very simple logic blocks with very limited interconnect. The basic structure shown at the left also has access to more conventional routing channels as shown the centre diagram. On the right is shown the result of reflecting the same structure in both directions. Logic blocks acting as interconnect provide extra routing resources if more sophisticated routing is required [EbelingBoriello91]. This approach has the advantage of simplifying the placement and routing process and can offer density improvements, particularly in applications with regular structure. Each individual cell is relatively simple and compact, making high densities possible as shown in figure 2.12. A later development of the architecture was called ``Montage'' and consisted entirely of asynchronous logic whilst retaining the same general structure [HauckBorriello92] [BorrielloEbeling95].

Figure 2.11 - Triptych routing architecture
Figure 2.11 - Triptych cell architecture

A step into the other direction complexity-wise was ORCA, AT&T's FPGA [HillBritton92]. This system had complex cells, each of which could process four bits simultaneously. Applicability to computing applications was enhanced by this approach, though at the expense of generality.

Tabula Rasa was an FPGA developed at AT&T Bell Labs which was specifically aimed at rapid development of logic [Hill90]. The system facilitated monitoring of the circuit under test.

Chow and Rose chose an experimental approach in determining their FPGA architecture [ChowRose93]. They found that a more successful approach was a ``mini-tiles'' approach, similar to that used in PLAs. Each tile has a degree of local routing resources, and tiles are connected by a higher level global routing structure.

DPGA (``Dynamic Programmable Gate Array'') was an architecture designed by Bolotski and DeHon [BolotskiDehon94] [Dehon94]. This was probably the first real implementation of multiple configuration stores in an FPGA, though the idea had been outlined earlier by Saleeba in [Saleeba93]. Their architecture including a method for nonintrusive background loading of configurations is described in more detail in [Tau95] and [Dehon96a]. This work was later enhanced with multiplexed interconnect and renamed TSFPGA (``Time-Switched Field Programmable Gate Array'') [Dehon96a]. A good study of the applications and benefits of multiple configuration stores was also presented by DeHon [Dehon96].

A device specifically intended for boolean neural networks was designed by Duggan and Haycock [DugganHaycock93]. A 3d structured FPGA with optical interconnect is described in [DepreitereNeefs94]. An architecture aimed at variable word length computers is described in [HartensteinKress94]. MATRIX, an array of small, reconfigurable general purpose processors is connected with reconfigurable routing in [MirskyDehon96]. Other FPGA architectures included TUTCA [KorpiharjuViitanen91], Teramac [AmersonCarter96], one using area I/O [Maheshwari96], an improved cascade circuit [Woo96], an architecture with special support for on-chip memory [Wilton96], an FPGA with support for both analog and digital circuits [Chow96] and the architecture used in the HARP system [StansfieldPage95]. There is also an unusual architecture which implements data driven virtual hardware [LingAmano93].

The Aptix CLi architecture is an example of a ``cellular'' architecture [Jenkins94]. Compared to the Xilinx architecture it has relatively small amounts of long-distance routing resources. Each cell receives two signals from each of four of its neighbours and sends its own two outputs to all four of its neighbours. The logic cells are much smaller and simpler, and require much fewer programming bits.

Atmel's ``Cache Logic''TM is a related cellular architecture [Rosenberg93]. The circuit of an Atmel cell is shown in figure 2.13. The relatively simple logic cell is almost identical to CLi's. Figure 2.14.

Figure 2.13 - Atmel cell circuit
Figure 2.14 - Atmel interconnection scheme

A number of researchers have investigated the properties of reconfigurable logic array architectures from a more theoretical point of view. Rose and Francis investigated the basic questions of how many inputs per logic block was optimal and how much functionality should be embedded in each logic block [RoseFrancis90]. They concluded that the maximum functionality per block input should be provided. Extending this work, Rose and Brown investigated the benefits of routing resources at various levels [RoseBrown91].

Chan and Schlag discuss the tradeoffs between routability, rearrangeability and speed in [ChanSchlag93]. Hill and Woo look at the possibility of splitting LUTs into multiple, smaller LUTs in [HillWoo93].

Viability of reconfigurable logic arrays compared with conventional processors is investigated in [AlbaharnaCheung94] and [AlbaharnaCheung94b]. They conclude that a radical change in approach would be needed for logic arrays to be a viable substitute. This appears to be in radical contrast to [VuilleminBertin96], which concludes that not only are reconfigurable processors much faster than general processors (when applied to certain problems) but predicts how the gap will grow over time using Noyce's law.

2.2 Extended general purpose processors

Some researchers have taken the approach of implementing a conventional processor on reconfigurable logic. The advantage is that certain parameters of the processor can be adjusted to better suit a particular application.

In 1988, Wolfe and Shen introduced the use of Xilinx FPGAs as a tool for application specific processor design. They described a ``flexible processor'' which could be configured for a specific task using a number of parameters [WolfeShen88]. This idea has been taken further by Maki, Whitaker and Ganesh in their 1991 ``Reconfigurable Data Path Processor'' which permitted even more fundamental control over the organisation of the components of the processor [MakiWhitaker91] and similarly in [WangGulak93]. A compiler to create customised coprocessors was described in [RazdanSmith94], and a system to create custom microprocessors is described in [Page93].

A number of researchers have implemented general purpose processors on FPGAs since then, including microcontrollers [GrunbacherJaud92] [Davidson93], fuzzy logic microcontrollers [SurmannUngering92], digital signal processors [PetersenHutchings95], parameterised RISC (reduced instruction set computing) processors [Meier95], VLIW (very long instruction word) machines [IseliSanchez93] and asynchronous RISC processors [Brunvand92].

Some researchers have concentrated on automatic creation of customised instruction sets specifically for an application [AthanasSilverman93] [LewisVanierssel93]. Witting's OneChip incorporated a general purpose processor core with FPGA functional units [Witting95a]. Wirthlin and Hutchings investigated the idea of a dynamic instruction set computer which changed its instruction set over the period of execution [WirthlinGilson94] [WirthlinHutchings95] [WirthlinHutchings95b].

2.3 Custom computing machines

2.3.1 Configurable computing devices

FPGAs were originally conceived as an easily reprogrammable PGA. The applications were thought to be limited to those which PGAs undertook - generally application-specific hardware for computers. Once in production these static circuit designs rarely change (if ever) so the reprogrammable aspect of the devices can only be considered an aid to quick, cheap development. Given the high cost of FPGAs compared to Mask Programmed Gate Arrays (MPGAs) they are really only suited to lower-volume applications where development cost is significant compared to production costs.

It wasn't long before people realised that reprogrammable logic devices had greater potential. Xilinx described applications where hardware which had multiple operating modes could be programmed with a different configuration for each of these modes. An example is a monitor which could operate in either a horizontal (landscape) or vertical (portrait) mode [TannguyenOyama90]. When rotated the device reloaded its configuration to correspond to its new orientation. Other examples include a tape reader with separate code generation and code checking modes, printer interfaces and Charge Coupled Device (CCD) camera interfaces [Xilinx91] [Fawcett93] [Mayrhofer94] [Shand95].

FPGAs also struck a chord with people involved in designing special-purpose computing devices or ``Custom Computing Machines'' as they have come to be known. Development of these devices using Application-Specific Integrated Circuits (ASICs) and MPGAs was very slow and expensive. Using FPGAs as replacements for MPGAs was certainly an improvement but they quickly realised that in many cases they could use the same hardware for entirely different applications, simply changing the configuration to suit its new role. This meant not only a huge saving in the cost and time taken to develop new hardware for each special-purpose computing project, but the same hardware could be used on rotation by several projects if conditions permitted this. For instance one project could be running while another was undergoing placement and routing.

This new kind of usage of FPGAs was the birth of a new kind of computing device - the reconfigurable logic array processor. Early experiments with reconfigurable logic array processors showed that the technology had a great deal of promise, and consequently this has become a popular field of investigation.

Perle was the logic array processor which was the basis for many of these early experiments [BertinRoncin89]. It consisted of an array of Xilinx FPGAs with a Motorola VME bus interface to a host processor. The later model ``Perle-1'' also included 32MB of RAM for intermediate storage of results. A number of applications including high-precision arithmetic, the popular RSA public-key encryption scheme, neural networks, finite difference calculations and image processing demonstrated in some cases several orders of magnitude performance gains over conventional processors. The RSA applications were also an order of magnitude faster than the fastest ASIC of the time - an impressive result considering that reconfigurable logic is inherently slower than custom implementations [BertinRoncin93].

The reason for this surprising speed advantage over custom silicon was that in reconfigurable logic the algorithms could be easily modified and incrementally improved with a much quicker design cycle than that involved in ASICs. By the time a single ASIC design could reach silicon a reconfigurable design had gone through several generations of improvement.

A system called ``Programmable Active Memory'' was used to code the speed-critical part of an algorithm, which was then mapped on to the two dimensional array of FPGAs using an optimisation process. The coding process was quite laborious and slow, but still much quicker than hardware design iterations. The XACT Design Editor is used to create data in the PAM model.

A similar project with a slightly different focus was ``SPLASH'' [GokhaleHolmes91] [Lopresti91]. Instead of the two-dimensional array architecture used by Perle, the SPLASH developers opted for a linear arrangement designed specifically for one-dimensional systolic problems. While this architecture is less general than Perle, it performed extremely well in its application of DNA (Deoxyribonucleic Acid) sequence matching, two orders of magnitude faster than Cray supercomputers.

SPLASH uses the VHDL hardware-description language [IEEE88].

One further advantage of reconfigurable logic array processors which these projects brought to the fore was the economies of scale which could be realised with FPGAs which could not be as easily achieved with the ASIC solutions previously used. Since the FPGA array was usually reused for a number of projects it became economically viable to make quite large arrays, where ASIC solutions were more heavily constrained by funding and were generally not as large.

The benefits of quick design cycles leading to faster development of algorithms and economy leading to larger arrays means that reconfigurable logic arrays are not only faster than conventional CPUs at many applications, they are often also faster than custom silicon.

A very popular application of reconfigurable logic array processors is in logic emulation. It is hard to imagine an application which more naturally suits the nature of logic arrays. The RPM logic emulator by Quickturn Design Systems Inc. views logic arrays as a method for prototyping designs for ASICs [Walters90]. Designs can be rapidly prototyped on reconfigurable logic, tested in-circuit, and then known-good designs can be sent to a foundry to create equivalent silicon. Of course the FPGA version may not be able to run at quite the same speeds as purpose-built silicon but it is still millions of times faster than simulation for large circuits.

RPM is programmed using a large netlist which is partitioned over a number of FPGAs. A variety of high-level tools can be used to create the netlist.

``A Flexible Processor'' described by Wolfe and Shen in 1988 emulated a conventional CPU in several FPGAs and could be configured to emulate a variety of processor architectures [WolfeShen88]. Arithmetic and Logic Unit (ALU) operations, instruction sets and pipelining schemes could be reconfigured to determine which schemes were most successful for particular applications. While this processor was a conventional circuitry-into-FPGA approach, it demonstrates that conventional sequential processor architectures can be implemented quite effectively using reconfigurable logic arrays.

2.3.2 Systems

Many reconfigurable logic computing systems have been developed. Most of these are research tools, but commercial systems are starting to appear [Mchenry95]. By far the most popular application of these systems is problems previously left to custom computing machines, so unsurprisingly most of the architectures are targetted at these kinds of problems.

Most system architectures are based on commercial FPGAs, and these tend to be connected together in ways which are fairly predictable given the devices. System architecture with reconfigurable processors is a relatively young field and most effort so far is concentrated on providing data from external devices to the logic array quickly enough to keep the array busy.

Perle was based around the coprocessor model of a conventional processor which accessed a reprogrammable logic array as a peripheral device (it was not a coprocessor in the true sense of the word). The array was comprised of Xilinx devices. Two generations of Perle exist - Perle-0 [BertinRoncin89] and Perle-1 [BertinTouati94] [VuilleminBertin96].

SPLASH is a system using Xilinx FPGAs which was one of the earliest research vehicles for reconfigurable computing [Waugh91] [GokhaleHolmes91]. It is a plugin board for a host computer and provides a set of FPGAs connected in a linear arrangement which is optimised for vector-style operations. A later refinement ``Splash 2'' provided up to 16 boards each of which comprised 16 Xilinx 4010 FPGAs connected linearly and also through a crossbar [ArnoldBuell92] [Arnold93].

A large number of other systems have been created, many of which are interesting but do not warrant specific comment with respect to this thesis [RosendahlPaille91] [AthanasSilverman91] [MakiWhitaker91] [CockshottShaw92] [SueyoshiApduhan92] [Athanas92] [MonaghanCowen93] [CuccaroReese93] [Wood93] [WazlowskiAgarwal93] [AgarwalWazlowski94] [Box94] [AmersonCarter95] [BoxNieznanski95] [DarnauerGaray95].

ARMEN was a machine with a slightly different approach. It combined a number of transputers with FPGAs, and is discussed in more detail in section 2.4.1 [RaimbaultLavenier93] [ChampeauPape94].

SPACE is a machine which was designed to exploit the correspondence between the locality of circuit elements in the system and the locations of computational elements in some applications - for example cellular automata [ShawMilne92] [MilneCockshott93].

Some architectures are more obviously defined by their intended applications - for instance the GANGLION system for neural networks [CoxBlanz92] and MORRPH for image processing [DrayerKing95] [MoranMartinez95].

A number of systems are directed at circuit prototyping, with the aim of implementing the final circuits on another architecture. Anyboard [PetersenThomae91] [ThomaeVandenbout92] [Vandenbout93], Realiser [Butts92], HARP [Page94] [LawrenceKay94], Teramac [CulbertsonAmerson95] and PROTEUS [HayashiMiyazaki95].

Of more immediate relevance to this thesis is work in making complete computer systems entirely based on reconfigurable processing. Linde and Nordstrom took this approach, building a machine comprised of FPGAs and memory [LindeNordstrom92]. The system was successful, although it was a fairly straightforward architecture. Casselman offered a rather more unusual approach, connecting his FPGAs with programmable interconnect - offering yet another dimension of reconfigurability [Casselman93].

A number of projects have regarded FPGAs as coprocessors [ChowCasselman95], with degrees of success varying from ``the custom computers based only on FPGA execution units show little performance improvement over state-of-the-art workstations'' [BergmannMudge94] [AlbaharnaCheung96] to orders of magnitude speedup [GokhaleHolmes91].

A very interesting run-time reconfigurable system developed by Wirthlin and Hutchings was able to ``cache'' configurations on the logic array [WirthlinHutchings96]. This approach is essentially equivalent to the multiple configuration store approach used by Saleeba [Saleeba93] or DeHon [Dehon94].

An alternative system which combined fixed and reconfigurable logic to gain benefits from both approaches is described in [BakkesDuplessis96]. In an application involving large amounts of floating point multiplication they showed that special purpose floating point units were faster than reconfigurable hardware only, but that a combination could yield even better results.

2.3.3 Applications

Throughout the history of reconfigurable logic arrays the strongest driving force behind their development has been from people wanting to use the properties that set them aside from more conventional forms of computer. Most commonly the people who started using reconfigurable logic came from a ``Custom Computing Machine'' (CCM) background where it is natural for developers of an application to be looking for these kinds of unconventional hardware solutions. Soon it became clear to many people in the field that reconfigurable logic arrays offered many compelling benefits for people creating custom computing machines and reconfigurable logic became strongly championed by these people.

As a result of this application-derived background, many sophisticated applications have been developed for reconfigurable hardware and its value in this area has been clearly established. As yet there is a paucity of applications in other areas so the value of this technology is not so well established in non-CCM uses.

Fawcett characterised the application of configurable hardware as being divided into three areas - systems with built-in diagnostics, adaptable system designs and systems with multi-purpose hardware [Fawcett93]. It can be argued that this rather limiting classification scheme can be broadened to accept most general computing tasks when more recent advances in the technology are taken into consideration. The following brief survey of reconfigurable computing applications serves to demonstrate the breadth of application to date, even on the current FPGA architectures which were not designed with configurable computing in mind.

Since the intent of this thesis is to widen the scope of applicability of logic array processors it pays to take a look first at exactly where they are currently being applied to understand what their current limitations are.

2.3.3.1 Database searching and sequence comparison

A popular application of reconfigurable logic is in genetics. Quick searching for sequences of genetic code in very large databases is important to geneticists, yet these operations are very expensive and slow on conventional computers. Reconfigurable logic is well suited to this type of application and good results have been obtained from its application here.

The SPLASH system was a landmark reconfigurable logic system using Xilinx devices [GokhaleHolmes91]. Its first major appliation was in genetic sequence comparison [Lopresti91]. Subsequently a more advanced system, SPLASH2, was developed [ArnoldBuell92] and again it was applied to genetic sequences [Hoang93]. Related work on SPLASH2 included text keyword searching [PryorThistle93].

Others have also applied reconfigurable logic to database searching [Fagin93] [GuntherMilne96] - in the case of Gunther and Milne using run-time partial reconfiguration to enhance performance of the search. Speed gains in these types of applications are typically described as ``order of magnitude'' [Fagin93].

2.3.3.2 Neural networks

Another popular application of reconfigurable logic is in neural networks. Neural networks are particularly well-suited to reconfigurable logic for a number of reasons. Often a serious restriction of reconfigurable logic systems is their I/O - often they can process data much quicker than I/O and memory devices can supply it. Neural networks on the other hand tend to be relatively self-contained and so their performance is more often bounded by processing speed than I/O. Neural networks also have the ability to be ``trained'' - a feature which lends itself to subsequent optimisation and reconfiguration of the network to a particular state of training.

``RM-nc'' was an early system designed specifically for neural network applications [ErdoganWahab92] [ErdoganHong93]. This system required the neural network to be described in VHDL, and provided a fully parallel implementation with three layer back propagation - both forward and backward passes being implemented in logic.

Daalen and Jeavons used a bit serial stochastic approach combined with reconfiguration to develop a high performance neural network system [DaalenJeavons93]. Eldredge took a run-time reconfiguration approach to maximise the density of a Xilinx-based neural network system for his masters thesis [Eldredge93]. He described the performance and density advantages of his RRANN system [EldredgeHutchings94] [EldredgeHutchings96] and developed the run-time reconfiguration algorithms [EldredgeHutchings94b] [EldredgeHutchings94c].

Guccione and Gonzalez used their earlier programming system work [GuccioneGonzalez93] to construct a multi-layer feed-forward neural network using a vector based data parallel approach [GuccioneGonzalez93b]. The algorithm was described using a subset of C and automatically translated into logic.

Duggan and Haycock took a slightly different approach. They argued that while reconfigurable logic is a good medium for hardware neural networks the available devices of the day were poorly suited to the task. As a result they developed their own device which was specifically aimed at boolean neural network applications [DugganHaycock93].

Another stochastic approach, this time using Xilinx lookup tables achieved a very high density - up to two synapses per Xilinx CLB [BadeHutchings94]. A different approach to increased density was achieved by time-multiplexing sections of the algorithm on to a single Atmel FPGA using Atmel's facility for run-time reconfiguration [LysaghtStockwood94]. This provided the higher density as a tradeoff for slower overall system speed.

2.3.3.3 Image processing

Image processing is another field which lends itself to the use of reconfigurable logic. The tendency of image processing problems to break down into a large number of very similar computations means that most image processing problems are very well suited to the fine-grained parallelism which reconfigurable logic arrays offer.

Work by Lazarus and Meyer demonstrated the feasibility of using FPGAs for avionics signal processing [LazarusMeyer93]. Their system provided a library of primitives which could be configured into the array for different types of processing. A similar system was concurrently developed by Ross and Vellacott [RossVellacott93], and another proof-of-concept system with a bent to digital filtering was developed by Chou [Chou93].

A 3*3 programmable convolution system is described in [BarrosAkil92]. A fifth-order wave digital elliptical filter implementation is described in [LeeserChapman93]. This work compares the efficiency of filter systems created using synthesis tools against hand-optimised logic. Similar concerns with synthesis tools led to Shoup's parameterised system for generating convolution filters [Shoup93].

A more specialised application of FPGAs in image processing is the optical character recognition (OCR) algorithm developed by Tse, Leung and Cheng [TseLeung93]. This was used in Chinese character recognition and achieved speedups of three orders of magnitude when compared with contemporaneous personal computer (PC) systems.

An application using multimode reconfigurability for telescope image data acquisition and processing is described in [Shand95].

The SPLASH [GokhaleHolmes91] system was applied to video-rate image processing by Abbott and Athanas [AbbottAthanas94]. They implemented a number of image processing and vision algorithms including the Hough transform and pyramid generation. Their approach is described in more detail in [AthanasAbbott94] and [AthanasAbbott95]. Real-time vision is also addressed by Quenot and Kraljic's ``Data-Flow Functional Computer'' [QuenotKraljic94]. This machine generates logic from algorithms written in a functional language and its effectiveness is demonstrated with several algorithms including connected component labelling, nonlinear filtering and coloured object tracking. A Canny edge detector was implemented in [Luk94]. A high performance scalable convolution system is described in [RathaJain95]. This operates at video rates and is suitable for computer vision applications.

A system called MORRPH was developed in 1995 and is specifically aimed at realtime image processing problems [DrayerKing95] [MoranMartinez95]. It allows easy addition of external special-purpose support chips matched to the needs of an application. It was used mainly as a preprocessor to more complex algorithms run on a host PC.

Villasenor, Schoner, Chia and Zapata describe a system for automatic target recognition in [VillasenorSchoner96]. They take advantage of the reconfigurability to generate target-specific configurations and find that the application maps very well to FPGAs. Culbertson, Amerson, Carter, Kuekes and Snider describe an algorithm on the Teramac system for automatic artery extraction in medical images in [CulbertsonAmerson96].

2.3.3.4 Custom computing machines

Various other applications traditionally attacked by researchers in the field of custom computing machines have also been investigated using FPGAs. Cellular automata systems have been developed with some success [BouazzaChampeau91] [HowardTaylor92], in the latter case using the Plessey ERA FPGA.

Classic computational problems such as the many-bodied problem and the travelling salesman problem have been implemented on reconfigurable logic. The many-bodied problem is attacked by Monaghan and Noakes in [MonaghanNoakes92], specifically as a cheaper alternative to supercomputers. The SPLASH2 system is applied to the travelling salesman problem in [GrahamNelson95] using a genetic algorithm approach. This achieved an order of magnitude speedup over high-end workstations. A detailed performance analysis of the system in comparison to workstations is provided in [GrahamNelson96]. Genetic algorithms are also implemented on FPGAs and applied to the chip partitioning problem in [SitkoffWazlowski95].

DES encryption is implemented using Xilinx devices in [TseYuk93]. Fast RSA encryption is implemented in [ShandBertin91] and [ShandVuillemin93] using the Perle system [VuilleminBertin96], and they achieve order of magnitude improvements over competing technologies.

Originally developed in 1989 [BertinRoncin89], the Perle system was one of the first and best-developed systems. It used Xilinx FPGAs and has been applied to a wide range of problems including computer arithmetic, cryptography, error correction, image analysis, stereo vision, video compression, sound synthesis, neural networks, high-energy physics, thermodynamics, biology and astronomy, often with groundbreaking results. These results are summarised in [VuilleminBertin96].

Variable word length arithmetic is studied in [ShandBertin91] and [Daumas94]. Perle-1 is used to evaluate polynomials at high speed in [Ercegovac95] using a most-significant-digit-first approach.

A system specifically designed to solve Monte-Carlo problems was developed Cowen and Monaghan [CowenMonaghan94]. This system achieved order of magnitude speedups over Digital Signal Processing (DSP) hardware.

2.3.3.5 Graphics

A promising area for configurable hardware is in computer graphics. Graphics algorithms which are capable of operating at frame rates are much easier to realise with configurable hardware than general purpose processors.

An early feasibility study was done by Singh and Bellec in which they investigated a number of FPGA graphics architectures [SinghBellec94]. They compared reconfigurable graphics hardware with custom hardware and general purpose processors.

Run-time reconfiguration was applied by Schoner, Jones and Villasenor to video coding. Their system used a number of passes each of which involved a different configuration.

Volume visualisation is a popular scientific application for graphically viewing data which requires powerful realtime processing. Dao, Cook and Silver presented a volume rendering system using a ray casting approach as a coprocessor on PC hardware [DaoCook95]. The Teramac system was applied to volume visalisation by Culbertson, Amerson, Carter, Kuekes and Snider [CulbertsonAmerson96]. They used parameterised designs to improve the scalability of their system and improved their tools to operate better with this mode of working.

2.3.3.6 Logic emulation

Another area with traditional appeal for FPGAs is in logic emulation. Since a reconfigurable logic array is the closest thing to the target application which offers reprogrammability their utility in the area is obvious. Lysaght had students using FPGAs to do experiments in hardware multitasking [Lysaght91]. Babb and Terrier proposed ``virtual wires'' which were in fact a form of multiplexed interconnect as a way of overcoming the I/O limitations of logic emulators [BabbTessier93]. Rapid logic prototyping with FPGAs was covered in [Spie95].

2.3.3.7 Other applications

For completeness, this section covers some other applications that reconfigurable logic has been applied to. Camerota and Rosenberg detailed ways of using Atmel FPGAs for data acquisition [CamerotaRosenberg93]. Reconfigurable logic has also been used in fuzzy logic controllers [SchmitThomas95], telescope control [ShandWei95] particle accelerator control [MollVuillemin95] and infra-red missile warning applications [Box94]. Heeb and Pfister described a workstation fitted with an FPGA coporocessor as standard [HeebPfister92]. The pioneers from the Perle project applied their system to a wide variety of applications including arithmetic, cryptography, error correction, image analysis, stereo vision, video compression, sound synthesis, neural networks, high-energy physics, thermodynamics, biology and astronomy [VuilleminBertin96].

2.4 FPGA programming systems

Most programming systems for FPGAs have grown out of older MPGA systems. This means that not only are the systems often not particularly well adapted to FPGAs, they are very poorly suited to dynamically reconfigurable logic arrays.

Traditionally, the software has viewed PGAs as a collection of gates and the methods derived from this view persist today. While FPGAs were originally designed to replace large numbers of gates this is not necessarily the most efficient way of using them. The optimisation algorithms used in many packages attempt to factor logic heavily, resulting in large numbers of small gates. FPGA architectures with complicated cells such as Xilinx are poorly suited to this method as they can express far more complicated functions using their LUTs.

Added to this are the fundamental differences in the implementation of FPGAs. Issues related to the slow interconnect in FPGAs have a substantial effect on the optimal placement and routing, yet these factors are largely ignored by existing software.

At this low-level end of the problem more work clearly remains to be done to improve the mapping of circuits on to arrays. This remains outside the scope of this thesis however.

An interesting trend which corresponds to the advent of reconfigurable logic array processors is an increase in interest in programming logic arrays with higher-level languages. The influx of people with a software background rather than a hardware background bring with it a different attitude - users want to quickly and easily write and test code rather than going through a lengthy hardware design process each time they write a program. This new attitude has resulted in much interesting work in mapping algorithms to hardware which are covered in section 2.4.3.

For any modern computer system to be usable it must have a suitable programming system. This section reviews the work to date. Unfortunately it was not possible within the contraints of the project to develop a programming system for Switchblade so this remains a challenge for the future.

2.4.1 Parallel programming

Reconfigurable logic arrays can be considered a type of processor uniquely well suited to parallel processing. One traditional problem of parallel program is that the problem must be cleverly divided between the available processing resources. Often these resources are quite coarse-grained and not in a form convenient for the algorithm. In addition to this the algorithm's needs might change during the course of its execution and yet the resources of a conventional multiprocessor computer will not.

A great deal of effort in parallel programming goes into adapting the algorithm to the available architecture. With reconfigurable logic this kind of problem ceases to be an issue - the logic array can be programmed to offer exactly the level of parallelism and fine- or coarse-grainedness required for the situation. Not surprisingly the field has attracted the attention of a number of researchers in parallel programming.

A pioneer of the concept of reconfigurable computing, Estrin described the advantages of reconfigurable architectures to parallel programming in his paper ``Parallel Processing in a Restructurable Computer System'' [EstrinBussell63]. Although predating FPGAs by many years the ideas described in this paper are equally valid today.

More recently research into fine-grained parallelism has led parallel programming researchers to investigate programmable logic. A system called ``ARMEN'' based on Transputers and FPGAs provided a balance of coarse-grained MIMD (multiple instruction, multiple data) parallelism from its general purpose processors and fine-grained parallelism from the FPGAs [RaimbaultLavenier93]. Each node in the system has access to a ring of FPGAs, as well as a set of global special-purpose coprocessors [ChampeauPape94]. The synthesis system used to tranform parallel algorithms into object code is described in [DhaussyFilloque94].

The ``SPACE'' architecture (Scalable Parallel Architecture for Concurrency Experiments) is another architecture aimed at investigating parallelism [ShawMilne92]. The benefits of fine grained parallelism were investigated on this system with a particular bent towards highly parallel applications with a high degree of local communication - such as road traffic systems [MilneCockshott93].

Bolotski, DeHon and Knight created a high-level model to unify logic arrays with other computational devices [BolotskiDehon94] [Dehon96a]. Using this unified ``RP-space'' model they created DPGA, a logic array architecture which attempted to combine the best features of FPGAs and SIMD (single instruction multiple data) processors. The important architectural feature introduced here was the use of multiple configuration stores. This unified theory of computing was later applied to create TSFPGA (which introduced multiplexed interconnect) and MATRIX, a more conventional multiprocessor architecture with the ability to assign instruction resources at runtime [MirskyDehon96]. These are all described in [Dehon96a].

2.4.2 Theory

Some remarkable early research into reconfigurable computing was done by Estrin in the early 1960s, long before FPGAs appeared [Estrin60] [EstrinBussell63]. Estrin and his collaborators predicted that performance advantages could be obtained for a variety of problems using reconfigurable architectures. The levels of integration available prohibited the complexity achievable now, so much less ambitious techniques such as manual reconfiguration were considered a good option at the time.

The team's theoretical work identified the key features of the technology and the problems which required solution - such as the circuitry vs. software partitioning problem [EstrinTurn63]. They created a partitioning system which used a successive approximation approach and heuristics. The system performed well on the problems which they set it, but no further work was done in the field until the late 80s and the advent of FPGAs.

At this point a number of groups rediscovered the idea of reconfigurable computing fairly simultaneously. Wolfe and Shen described ``Flexible Processors'' [WolfeShen88] based on Xilinx FPGAs. Gray and Kean proposed ``Configurable Hardware'' [GrayKean89], describing what was to become their ``CAL'' architecture [KeanGray90].

Bertin and his associates developed an abstraction called ``Programmable Active Memories'' with a physical implementation based on Xilinx FPGAs - their ``Perle'' system. Perle is a very well developed system, consisting of a virtual machine executing on a conventional processor which reconfigures the logic array coprocessor as required. Many applications have been developed for various versions of Perle, with very impressive results [BertinRoncin89] [BertinRoncinVuillemin89] [ShandBertin91] [BertinRoncin93] [Touati93] [BertinTouati94] [MollVuillemin95] [Ercegovac95] [VuilleminBertin96].

The Perle researchers claim that as VLSI (``Very Large Scale Integration'') technology has improved the performance advantage of reconfigurable processors over conventional processors in many applications has increased. Vuillemin and Bertin applied Noyce's law to show how this performance gap will continue to increase over time [VuilleminBertin96].

Formal verification of reconfigurable modules became possible with Shaw and Milne's work which involved translating ``Occam'' language programs into asynchronous modules and using the ``Circal'' process algebra to formally verify them [ShawMilne92] [CockshottShaw92].

Work towards a unified approach to describing hardware and software has made some progress. Some software tools which provide various levels of transparency between the two modes are described in the next section. A theoretical model and an accompanying set of tools for unifying the two media are described in [Page96]. Singh and Hogg developed a system which goes further by describing run-time reconfigurable algorithms in terms of partial evaluation, though this system seemed to be limited in its scope [SinghHogg96].

2.4.3 Software tools

Handcrafted software tools for reconfigurable logic systems abound in the research environment and yet so far there is little in the way of generally available tools. As yet, the field is young enough that there is a great deal of work to be done on some fundamental aspects of tool creation. High-level language to logic compilation is in its infancy. The partitioning of an algorithm into sections some of which are executed in hardware and some of which run in software is also very much a research area, particularly with regard to run-time reconfiguration.

A number of researchers have concentrated on the conventional tools available to them, creating applications using hardware description languages such as VHDL [KempaJung92] [ArnoldBuell93] [Singh95] and in some cases adapting this approach to the new medium [GokhaleKopser90].

The desire to program these systems using more convenient and familiar tools resulted in a rash of high-level language to FPGA compilers, particularly in the popular language C [Athanas92] [Arnold93] [GuccioneGonzalez93] [Athanas93] [AgarwalWazlowski94] [WoForward94] [AgarwalWazlowski94] [WoForward94] [JantschEllervee94] [Gokhale94] [Galloway95] [IseliSanchez95] [ClarkHutchings96]. Others have chosen to use their own languages and systems which may be better suited to defining a logic algorithm; Occam [PageLuk91] [CockshottShaw92] [ShawMilne92], Ruby [LukLok93] [Luk95] [GuoLuk95], and Handel [SpiveyPage93]. These tools provided the ability to create applications much more quickly than their VHDL equivalents, although with some considerable expense in their efficiency of utilisation of the logic resource.

Simultaneously, many researchers have attacked the partitioning problem - in many cases as part of a high-level language to FPGA synthesis system [Athanas93] [ThomaeVandenbout92] [LukLok93] [WoForward94] [GokhaleMarks95]. Other related synthesis work includes [HartensteinHirschbiel91] [WuPerkowski92].

Some tools are specifically geared at rapid creation of a subclass of configurations. For example tools to enable rapid prototyping of DSPs [GebotysGebotys94], customised formats for fast floating point units [ShiraziWalters95] or custom general-purpose processors, tailored to a specific task [RazdanSmith94] [Luk95].

Other tools which have been created include simulation tools for runtime reconfiguration [LysaghtStockwood95], prototyping/validation tools for embedded systems [HerpelHeld94], and C++ libraries for simulation and design of reconfigurable logic applications [Touati93].

2.5 Dynamic reconfiguration

One interesting optimisation which is possible with reconfigurable logic is the ability to reconfigure the logic array at runtime. This has become an active research area, with many applications having exploited this feature despite the poor support for it in currently available technology.

This poor support comes down to one essential problem - how to reconfigure a device to its new state as rapidly as possible. Many current-generation devices are configured bit-serially (which is slow), only the entire device can only be reconfigured (no partial reconfiguration is possible) and the device is offline for the duration of the reconfiguration process. These factors combine to make runtime reconfiguration slow and inconvenient.

Four features which may aid an architecture in efficient run-time reconfiguration [Saleeba93] are:

  • Partial reconfiguration
  • Fast reconfiguration
  • Background loading
  • Multiple configuration stores

Early experiments using run-time reconfiguration include neural networks [Eldredge93] [DaalenJeavons93] [EldredgeHutchings94] [EldredgeHutchings94b] [LysaghtStockwood94] [EldredgeHutchings96], graphics and image processing [RossVellacott93] [SinghBellec94] [SchonerJones95] [Shand95] and database searching [LemoineMerceron95] [GuntherMilne96]. Other uses include hardware multitasking [Lysaght91], adaptable bit-length arithmetic [Daumas94] and a general purpose processor with an instruction set that varies during execution [WirthlinHutchings95]. These experiments indicate that in some applications there is substantial gain to be had from run-time reconfiguration.

Self-reconfiguration is the ability of a reconfigurable logic array processor to control its own reconfiguration as described by Saleeba [Saleeba93] and French [FrenchTaylor93]. These early papers proposed simple bus interfaces to the configuration store. Self-reconfiguration interfaces are explored in more detail in this thesis.

Partial reconfiguration is an important feature for run-time reconfigurable systems as it makes loading much less expensive as described by Saleeba [Saleeba93]. Lysaght and Dunlop discuss it and the concept of ``Logic Caching'' in [LysaghtDunlop93]. The Atmel ``Cache Logic''TM series of FPGAs is derived from these ideas.

Hadley and Hutchings discussed the benefits of partial reconfiguration with respect to their RRANN neural network system [HadleyHutchings95] [HadleyHutchings95b] [Hadley95], and found that in this application a doubling of speed could be achieved. With Wirthlin, Hutchings worked on a dynamic instruction set computer which used partial reconfiguration to change the instruction set at run-time [WirthlinHutchings95b].

DeHon and his associates have also worked on improving reconfiguration speed. Their DPGA and TSFPGA architectures provide both partial reconfiguration through a RAM-style bus interface and multiple configuration stores to allow instantaneous switching between configurations [Dehon94] [Tau95] [Dehon96a]. This also provides a basis for background loading as one context can be operating while another is being loaded through the partial reconfiguration interface.

A recent architecture offering partial reconfiguration is the Xilinx XC6200 series of FPGAs. These provide a bus-style host processor interface which can be used to randomly access the configuration store and hence partially reconfigure the device.

The ability to reconfigure a processing element at runtime leads to the question of how to decide what configurations to load into configuration store at what times. This problem of ``temporal partitioning'' is addressed in [GokhaleMarks95]. A rather more simplistic view is taken in [HastieCliff90], where programs are considered as a set of ``hardware subroutines'' to be loaded on demand.

Another system which uses the idea of caching configurations is that of Wirthlin and Hutchings [WirthlinHutchings96]. They describe how the caching system reduced the amount of time required to run-time reconfigure their logic array as reconfigurations were often re-used. In a more general discussion they discuss the division between compile-time and run-time reconfiguration [HutchingsWirthlin95]. They propose run-time constant propagation as a method of obtaining further speed increases in [WirthlinHutchings97a], a similar idea to the ``data-folding'' scheme described in [Foulk93].

The idea of multiple configuration stores which can be selected between by a broadcast signal was introduced by Snyder [Snyder82]. This was not however in the context of a logic array, rather it was used to switch the programmable interconnect of an array of higher-level microprocessors.

Partial evaluation is a technique which can be used to defer computation until it is strictly required. Singh and Hogg investigated a field of problems which can be expressed using partial evaluation in such a way as to map each part to a configuration - leading directly to a dynamically reconfigurable version of the algorithm [SinghHogg96].

2.5.1 Multiplexing

Little research has been done so far into using multiplexing as a means of reducing the expense of routing signals over long distances on logic arrays. Babb and Tessier investigated the use of multiplexing to overcome I/O limitations on their logic emulator [BabbTessier93] but didn't seek to extend this work to reducing internal routing expense. Bit serial circuit construction methods were investigated by Isshiki [Isshiki96] but again he didn't apply his methods to logic array architecture.

Simultaneous with the multiplexed routing system described in this thesis, DeHon was developing TSFPGA - a logic array architecture with time-switched internal routing [Dehon96a]. TSFPGA uses subarrays of logic blocks each of which has time switched input registers. The subarray is arranged as a set of logic blocks which share input lines and each LUT output is taken as an input to a crossbar. Crossbar outputs are sent to neighbouring subarrays both horizontally and vertically. TSFPGA has no facility for fast non-multiplexed communication between subarrays.