This chapter summarises the history of programmable logic and surveys the research in the field. It covers some of the more important commercial architectures with particular attention to those intended for computing applications and run-time reconfiguration. A number of research architectures are also described.
Over the years many advances have been made in the field of digital logic. From the days of discrete transistors progress has been made to small-scale integrated circuits initially, and then to successively greater densities with each generation of logic.
As the level of integration available became higher, so also did the level of specialisation for each available part become greater. This left hardware designers with the option of either using existing solutions or designing their own from smaller scale logic at much greater expense and with much greater size and power consumption. Designers wanting to develop sophisticated custom circuits were greatly stymied by this.
The advent of programmable logic devices (PLDs) was the first
step in alleviating this problem. This allowed the function of a
number of small scale logic devices to be placed on a single chip,
reducing the amount of board space required for this kind of ``glue
logic''. Configuration of these devices was through antifuses which
when programmed appropriately allowed the designer to determine which
inputs and outputs of each logic gate were connected to which internal
buses. See figure 2.1.
Following PLDs came a solution to larger scale problems -
gate arrays. These devices operate much like mask-programmed
ROMs, except that rather than programming a memory image the designer
was able to determine the wiring layout of the logic. Their great
advantage over designing a custom chip for the same task was cost -
since each production run involved changing only the wiring areas
most of the transistor area of the die was known to work correctly
and was able to be left unaltered. Since they were much cheaper than
designing a custom chip for a task, gate arrays naturally became
quite popular.
The disadvantage of gate arrays was although the lead times for
production were not the several months that could be expected for a
custom chip, they were still in the range of several weeks. Their
density and speed was not as great as that of an equivalent custom
chip, but this cost was greatly offset by the economy of design time.
The logical progression of combining the programmability of the
smaller PLDs with the size of gate arrays led to programmable
gate arrays (PGAs). These used a system of fusible links to
determine the logic function and wiring of the device. While slow to
program and relatively expensive per item compared to mask programmed
gate arrays, they for the first time allowed the designer to quickly
prototype complex circuits.
Fusible-link PGAs have one major frustration for the logic
designer - they can be programmed only once so an iterative process
of prototyping means programming, testing and discarding many chips.
This can be quite expensive. As a result field programmable gate
arrays were developed. With the configuration stored in static
RAM (``Random Access Memory'') or EPROM (``Erasable Programmable Read Only Memory'') designers
were now able to quickly and cheaply prototype a system which could
then be either put into production using FPGAs or by using the
same design with a mask programmed gate array.
FPGAs are general purpose programmable logic devices. Their
programmable logic blocks can be connected with programmable
interconnect, allowing designers to implement multi-level logic not
possible with the two-level logic structure of PLDs. The capabilities
and implementations of FPGAs vary widely between the available
families however.
EPROM FPGAs are usually arranged like a set of PLDs connected with
a configurable interconnection system. Some of these devices are
Xilinx's EPLD, Altera's MAX series and AMD's Mach.
SRAM FPGAs can be generally categorised as island-style
or cellular-style. Island-style ones include Xilinx LCAs,
Altera Flex and AT&T's Orca. Cellular-style FPGAs include Triptych
[BorrielloEbeling95],
Algotronix's CAL, and Amtel's CLi
[Furtek93].
FPGAs have significant advantages over mask programmed gate arrays
when applied to conventional logic design situations. The design cycle
for FPGAs is much quicker - test pattern generation, mask making,
wafer fabrication, packaging and testing are all replaced by a
relatively quick download process. The expensive equipment necessary
to perform this wafer fabrication is also not necessary - even small
companies can afford the equipment to program FPGAs.
Given this quick turnaround time it is much easier to verify the
correctness of a design - and also much less expensive to have made a
mistake. Testing of the raw wafers is also much easier - since all of
them are identical they are easily tested before sale instead of
requiring a different test regime for each design.
The life cycle of a product using FPGAs can be extended by
providing updated configurations. This also means that a product can
be released before all its logic function has even be designed, and
the improved logic configuration provided as a later upgrade along
with any other extensions to the product. Variations in product
demand are also more easily accommodated as they do not suffer from
the long lead times associated with producing a batch of mask
programmed gate arrays.
FPGAs do cost significantly more per item than mask programmed
logic, however. High-volume applications make mask programmed logic
look increasingly attractive, and the density achieved on FPGAs is
considerably less than that possible with mask programmed gate arrays
or custom devices. The programmable switches in FPGAs are larger and
add more capacitance than the equivalent mask programmed links. This
leads to FPGAs being larger and slower.
The greater size of FPGA logic also means that its logic function
is smaller when compared to a gate array of similar size which is
composed of the denser mask programmed logic. Current FPGAs are about
ten times less dense than the equivalent mask programmed devices, so
FPGAs are considerably more expensive when used to implement a
specific logic function. This greater size for equivalent logic
function also leads to interconnect being longer and slower - about
twice as slow as a mask programmed gate array.
An important element in programmable logic is the programming
system itself - for programmable logic to be viable it must not use
excessive amounts of die space. PGAs and PLDs used special antifuse
circuit elements which could be permanently ``blown'' at programming
time to allow a circuit to be created. These antifuses are quite
small so it is feasible for them to be located at almost any point
where two lines may need to be connected. Most PGAs employ large
numbers of antifuses in crossbar-style arrays to provide versatile
interconnect. This means that the majority of antifuses remain unused,
but given the design of the interconnect arrays and the size of the
antifuses this is not an excessive expense.
Antifuses use a higher-than-normal voltage programming pulse to
permanently change the character of their silicon. An
oxide-nitride-oxide dielectric normally prevents current from flowing
between diffusion and polysilicon layers. When a programming pulse is
applied the dielectric is melted and a circuit is formed between the
diffusion and polysilicon, as shown in
figure 2.2.
[Actel94]
[Greene93]
This programming method is very space efficient, but its
irreversible nature is unsuited to reprogrammable FPGAs.
SRAM-based FPGAs have the advantage that they can be reprogrammed at
will, requiring no new hardware when changes are made to a logic
design. Consequently their programming cells are rather different from
antifuses.
An example of the type of programming cell used in SRAM-based
FPGAs is shown in
figure 2.3.
The circular arrangement of inverters stores the programming value.
The value stored in the cell can be changed using the input
transistor. The input transistor is capable of stronger drive than the
competing inverter so it is capable of overpowering the inverter to
change the state of the cell. An N-transistor is shown on the output
of the cell. This would be used if the cell was to be used as a
direct replacement for an antifuse, allowing a circuit to be created
or not as desired. Alternatively the programmable cell might be used
as a configuration bit within a logic array cell in which case the
inverter outputs could be used directly as the inputs to multiplexers
or other logic.
SRAM-based FPGAs are volatile so they must be reprogrammed every
time power is applied before they can be used. This is obviously a
disadvantage and for this reason manufacturers have gone to some
effort to make power-up reprogramming of FPGAs easy. Most commercial
FPGAs have the ability to read their configuration from an EPROM
automatically on power-up.
Most commerically-available FPGA architectures to date have been
aimed at the electronic design market. Until recently little effort
has been made to create FPGAs which are well suited to custom
computing applications. The following sections describe FPGA
designs for electronic applications, and culminate in some designs
which are better suited to custom computing applications.
One of the first and most popular SRAM-based FPGA architectures
was developed by Xilinx. This architecture has gone through several
design revisions, but still remains the definitive example of
``island-style'' FPGAs
[Trimburger94].
Figure 2.4
shows the structure of the Xilinx 4000 series FPGAs
[Xilinx94].
The routing structure of
these devices allows arbitrary routing from any point to any other
point. Routing resources are limited however, constraining the number
and length of connection paths to keep the expense of routing
resources relatively low.
Each CLB (``configurable logic block'') has twelve main inputs
which are connected to tracks adjacent to the cell, some in each of
the four directions. The four outputs are each sent to two of the
four directions, one vertical and one horizontal - making it easy to
route an output signal in whatever direction is desired. Not all of
the inputs and outputs are normally used - in fact it is common that
most will be unused.
The routing structure has three lengths of lines - single, double
and long. Single length lines travel the length of a single cell
before entering a switch matrix. These switch matrices allow any
horizontal line to be connected to any vertical line, as shown in
figure 2.4.
Double-length lines provide an option to allow longer
transmission with lower propagation delays. Passing a signal through a
switch matrix inevitably delays it slightly, so using longer length
lines reduces the number of switches and hence the propagation delays.
Long lines cover half the length of the die without entering a switch
matrix. They are particularly useful for low-skew broadcast signals
and permit long-distance communication to occur without excessive
delays. Note that significantly less of the longer lines are available
than local lines.
Note the relative sparseness of switch units compared to the PAL
(``Programmable Array Logic'') design in
figure 2.1.
This is due to the greater expense of SRAM switches compared to antifuses.
Each CLB contains two four-input lookup tables (LUTs) which can
act independently and send their outputs through a latch if desired.
The LUTs each allow arbitrary functions of up to four inputs to be
created. The outputs of the two LUTs can be put through another LUT
if desired and combined with an external signal to provide some
functions of up to nine inputs in a single cell. See
figure 2.5.
Other functions are also available, including the ability to
access configuration SRAM as a register from within the circuit.
The philosophy of the Xilinx LCA is representative of that of the
island-style FPGAs - relatively powerful functional blocks are
connected with versatile but expensive interconnect. The space taken
by a CLB is a fraction of that used by its associated interconnect,
so it is worthwhile creating more powerful CLBs to reduce the number
of them needed to create a complex circuit.
The alternative to this philosophy is to make the interconnect
far less versatile - and so far less expensive. With interconnect
taking less space it is viable to greatly increase the number of
logic cells which can be accommodated within a die. This is the
essence of the ``cellular-style'' FPGAs
[Trimburger94].
The Xilinx XC6200 series is a recent development aimed at the
growing reconfigurable computing market. It incorporates a RAM-style
configuration interface which permits it to be partially or wholly
reconfigured at runtime, as shown in
figure 2.6.
Significantly, this
series has taken a step away from Xilinx's traditional island model.
Cells have efficient local interconnect to their neighbours. This
emphasis on locality is also employed in grouping cells together into
4 by 4 blocks with fast local interconnect. See
figures 2.7
and 2.8.
Another significant change in this architecture is a move from
PLD-style crossbar interconnect to input multiplexers as shown in
figure 2.9.
Each of three
input multiplexers can select an input from cells in one of four
directions, or from one of four adjacent medium-distance interconnect
lines. Each cell has four outputs, one for each direction. Each
output may be configured to take either a routing function or output
the value computed by the function unit. This approach is very much
like the cellular-style approach used in Triptych and Atmel, amongst
others. As will be seen later, the Xilinx XC6200 uses a similar
approach to that which was independently developed for the
Switchblade architecture described in this thesis.
In 1970, Shoup described a system which used flip-flops to
program a logic array instead of the more conventional mask
programming or fusible links [Shoup70]. This was essentially the
birth of the FPGA. Shoup's research used an array structure very
similar to the PGAs it was based on and while some promising results
were obtained the idea was well ahead of its time - the levels of
integration available at the time were not adequate to make FPGAs
viable.
In the early eighties Snyder described the ``CHiP'' architecture -
a system of programmable interconnect which can be seen as the
precursor to the interconnect schemes used in current-day FPGAs
[Snyder82]. This was not a logic
array design - rather it described a way of programmably interconnecting
higher-level microprocessors - but it took the idea of
field-programmable interconnection a step further than the simple
scheme proposed by Shoup a decade earlier. One particularly
interesting feature of the CHiP architecture was its ability to switch
between a number of available configurations in response to a
broadcast signal.
In the mid eighties the idea resurfaced with startup company
Xilinx's 2000 series field-programmable gate arrays. The idea quickly
spread, with a number of other companies - notably Altera - vying for
this new segment of the programmable logic market.
Figure 2.10
shows the structure of the Altera FLEX device. This is
very similar in design to the fusible link PGAs which preceded it, but
with SRAM programmable switches substituted for the antifuses.
The Plessey/Pilkington ERA (``Electrically Reprogrammable Array'')
was an early FPGA with fine-grained
design attributes similar to conventional gate arrays rather than
programmable logic devices
[DillienPhillips89]
[HowardTaylor92].
Pilkington's later DPLD (``Dynamically Programmable Logic Device'') took a similar fine grained approach but with
enhancements to the cell and routing structures
[Jhitta93].
The CAL architecture was a system which specifically addressed
systolic and cellular automaton algorithms - algorithms which are not
efficiently implemented on conventional computers
[KeanGray90].
Labyrinth was another architecture aimed at massively
parallel, register-intensive computation
[FurtekStone90]. It had a
single-bit register and a half-adder per cell, complemented by a
highly symmetrical, scalable architecture.
Triptych
(figure 2.11)
took a different approach from the Xilinx
``island style'' designs which favoured increasingly complex logic
blocks with expensive interconnect structures. Instead, the Triptych
``cellular'' approach (as characterised by Trimburger
[Trimburger94])
involves a
homogeneous array of very simple logic blocks with very limited
interconnect. The basic structure shown at the left also has access to more
conventional routing channels as shown the centre diagram. On the right is shown the
result of reflecting the same structure in both directions.
Logic blocks acting as interconnect provide extra routing resources if
more sophisticated routing is required
[EbelingBoriello91].
This approach has the advantage of simplifying the placement and routing
process and can offer density improvements, particularly in
applications with regular structure. Each individual cell is relatively
simple and compact, making high densities possible as shown in
figure 2.12.
A later development of the
architecture was called ``Montage'' and consisted entirely of
asynchronous logic whilst retaining the same general structure
[HauckBorriello92]
[BorrielloEbeling95].
A step into the other direction complexity-wise was ORCA, AT&T's
FPGA [HillBritton92].
This system had complex cells, each of which
could process four bits simultaneously. Applicability to computing
applications was enhanced by this approach, though at the expense of
generality.
Tabula Rasa was an FPGA developed at AT&T Bell Labs which was
specifically aimed at rapid development of logic
[Hill90]. The system
facilitated monitoring of the circuit under test.
Chow and Rose chose an experimental approach in determining their
FPGA architecture [ChowRose93].
They found that a more successful
approach was a ``mini-tiles'' approach, similar to that used in PLAs.
Each tile has a degree of local routing resources, and tiles are
connected by a higher level global routing structure.
DPGA (``Dynamic Programmable Gate Array'') was an architecture designed by Bolotski and DeHon
[BolotskiDehon94]
[Dehon94].
This was probably the first real
implementation of multiple configuration stores in an FPGA, though the
idea had been outlined earlier by Saleeba in
[Saleeba93].
Their architecture including a
method for nonintrusive background loading of configurations is
described in more detail in
[Tau95] and
[Dehon96a]. This work was
later enhanced with multiplexed interconnect and renamed TSFPGA
(``Time-Switched Field Programmable Gate Array'')
[Dehon96a]. A good study of the
applications and benefits of multiple configuration stores was also
presented by DeHon [Dehon96].
A device specifically intended for boolean neural networks was
designed by Duggan and Haycock
[DugganHaycock93]. A 3d
structured FPGA with optical interconnect is described in
[DepreitereNeefs94].
An architecture aimed at variable word length
computers is described in
[HartensteinKress94].
MATRIX, an array of small, reconfigurable general purpose processors
is connected with reconfigurable routing in
[MirskyDehon96]. Other FPGA
architectures included TUTCA
[KorpiharjuViitanen91],
Teramac
[AmersonCarter96], one
using area I/O
[Maheshwari96], an improved
cascade circuit [Woo96], an
architecture with special support for on-chip memory
[Wilton96], an FPGA with support
for both analog and digital circuits
[Chow96] and the architecture used
in the HARP system
[StansfieldPage95]. There
is also an unusual architecture which implements data driven virtual
hardware
[LingAmano93].
The Aptix CLi architecture is an example of a ``cellular'' architecture
[Jenkins94]. Compared to the
Xilinx architecture it has relatively small amounts of long-distance
routing resources. Each cell receives two signals from each of four
of its neighbours and sends its own two outputs to all four of its
neighbours. The logic cells are much smaller and simpler, and require
much fewer programming bits.
Atmel's ``Cache Logic'' A number of researchers have investigated the properties of
reconfigurable logic array architectures from a more theoretical
point of view. Rose and Francis investigated the basic questions of
how many inputs per logic block was optimal and how much functionality
should be embedded in each logic block
[RoseFrancis90].
They concluded
that the maximum functionality per block input should be provided.
Extending this work, Rose and Brown investigated the benefits of
routing resources at various levels
[RoseBrown91].
Chan and Schlag discuss the tradeoffs between routability,
rearrangeability and speed in
[ChanSchlag93].
Hill and Woo look at the
possibility of splitting LUTs into multiple, smaller LUTs in
[HillWoo93].
Viability of reconfigurable logic arrays compared with
conventional processors is investigated in
[AlbaharnaCheung94] and
[AlbaharnaCheung94b].
They conclude that a radical change in approach
would be needed for logic arrays to be a viable substitute. This
appears to be in radical contrast to
[VuilleminBertin96], which
concludes that not only are reconfigurable processors much faster than
general processors (when applied to certain problems) but predicts how
the gap will grow over time using Noyce's law.
Some researchers have taken the approach of implementing a
conventional processor on reconfigurable logic. The advantage is that
certain parameters of the processor can be adjusted to better suit a
particular application.
In 1988, Wolfe and Shen introduced the use of Xilinx FPGAs as a
tool for application specific processor design. They described a
``flexible processor'' which could be configured for a specific task
using a number of parameters
[WolfeShen88].
This idea has been taken further by Maki, Whitaker
and Ganesh in their 1991 ``Reconfigurable Data Path Processor'' which
permitted even more fundamental control over the organisation of the
components of the processor
[MakiWhitaker91]
and similarly in
[WangGulak93].
A compiler to create customised coprocessors was
described in
[RazdanSmith94],
and a system to create custom
microprocessors is described in
[Page93].
A number of researchers have implemented general purpose
processors on FPGAs since then, including microcontrollers
[GrunbacherJaud92]
[Davidson93],
fuzzy logic microcontrollers
[SurmannUngering92],
digital signal processors
[PetersenHutchings95],
parameterised RISC (reduced instruction set computing) processors
[Meier95], VLIW (very long
instruction word) machines
[IseliSanchez93]
and asynchronous RISC processors
[Brunvand92].
Some researchers have concentrated on automatic creation of
customised instruction sets specifically for an application
[AthanasSilverman93]
[LewisVanierssel93].
Witting's OneChip
incorporated a general purpose processor core with FPGA functional
units
[Witting95a].
Wirthlin and Hutchings investigated the idea of a
dynamic instruction set computer which changed its instruction set
over the period of execution
[WirthlinGilson94]
[WirthlinHutchings95]
[WirthlinHutchings95b].
FPGAs were originally conceived as an easily reprogrammable PGA.
The applications were thought to be limited to those which PGAs
undertook - generally application-specific hardware for computers.
Once in production these static circuit designs rarely change (if
ever) so the reprogrammable aspect of the devices can only be
considered an aid to quick, cheap development. Given the high cost of
FPGAs compared to Mask Programmed Gate Arrays (MPGAs) they are really
only suited to lower-volume applications where development cost is
significant compared to production costs.
It wasn't long before people realised that reprogrammable logic
devices had greater potential. Xilinx described applications where
hardware which had multiple operating modes could be programmed with
a different configuration for each of these modes. An
example is a monitor which could operate in either a horizontal
(landscape) or vertical (portrait) mode
[TannguyenOyama90].
When rotated the
device reloaded its configuration to correspond to its new
orientation. Other examples include a tape reader with separate code
generation and code checking modes, printer interfaces and Charge
Coupled Device (CCD) camera
interfaces [Xilinx91]
[Fawcett93]
[Mayrhofer94]
[Shand95].
FPGAs also struck a chord with people involved in designing
special-purpose computing devices or ``Custom Computing Machines'' as
they have come to be known. Development of these devices using
Application-Specific Integrated Circuits (ASICs)
and MPGAs was very slow and expensive. Using FPGAs as replacements
for MPGAs was certainly an improvement but they quickly realised that
in many cases they could use the same hardware for entirely different
applications, simply changing the configuration to suit its new role.
This meant not only a huge saving in the cost and time taken to
develop new hardware for each special-purpose computing project, but
the same hardware could be used on rotation by several projects if
conditions permitted this. For instance one project could be running
while another was undergoing placement and routing.
This new kind of usage of FPGAs was the birth of a new kind of
computing device - the reconfigurable logic array processor. Early
experiments with reconfigurable logic array processors showed that
the technology had a great deal of promise, and consequently this has
become a popular field of investigation.
Perle was the logic array processor which was the basis for many
of these early experiments
[BertinRoncin89].
It consisted of an array
of Xilinx FPGAs with a Motorola VME bus interface to a host processor. The
later model ``Perle-1'' also included 32MB of RAM for intermediate
storage of results. A number of applications including high-precision
arithmetic, the popular RSA public-key encryption scheme, neural
networks, finite difference
calculations and image processing demonstrated in some cases several
orders of magnitude performance gains over conventional processors.
The RSA applications were also an order of magnitude faster than the
fastest ASIC of the time - an impressive result considering that
reconfigurable logic is inherently slower than custom implementations
[BertinRoncin93].
The reason for this surprising speed advantage over custom silicon
was that in reconfigurable logic the algorithms could be easily
modified and incrementally improved with a much quicker design cycle
than that involved in ASICs. By the time a single ASIC design could
reach silicon a reconfigurable design had gone through several
generations of improvement.
A system called ``Programmable Active Memory'' was used to code the
speed-critical part of an algorithm, which was then mapped on to the
two dimensional array of FPGAs using an optimisation process. The
coding process was quite laborious and slow, but still much quicker
than hardware design iterations. The XACT Design Editor is used to
create data in the PAM model.
A similar project with a slightly different focus was ``SPLASH''
[GokhaleHolmes91]
[Lopresti91].
Instead of the two-dimensional array
architecture used by Perle, the SPLASH developers opted for a linear
arrangement designed specifically for one-dimensional systolic
problems. While this architecture is less general than Perle, it
performed extremely well in its application of DNA (Deoxyribonucleic
Acid) sequence
matching, two orders of magnitude faster than Cray supercomputers.
SPLASH uses the VHDL hardware-description language
[IEEE88].
One further advantage of reconfigurable logic array processors
which these projects brought to the fore was the economies of scale
which could be realised with FPGAs which could not be as easily
achieved with the ASIC solutions previously used. Since the FPGA array
was usually reused for a number of projects it became economically
viable to make quite large arrays, where ASIC solutions were more
heavily constrained by funding and were generally not as large.
The benefits of quick design cycles leading to faster development
of algorithms and economy leading to larger arrays means that
reconfigurable logic arrays are not only faster than conventional CPUs
at many applications, they are often also faster than custom silicon.
A very popular application of reconfigurable logic array
processors is in logic emulation. It is hard to imagine an
application which more naturally suits the nature of logic arrays. The
RPM logic emulator by Quickturn Design Systems Inc. views logic arrays
as a method for prototyping designs for ASICs
[Walters90]. Designs can
be rapidly prototyped on reconfigurable logic, tested in-circuit, and
then known-good designs can be sent to a foundry to create equivalent
silicon. Of course the FPGA version may not be able to run at quite
the same speeds as purpose-built silicon but it is still millions of
times faster than simulation for large circuits.
RPM is programmed using a large netlist which is partitioned over
a number of FPGAs. A variety of high-level tools can be used to create
the netlist.
``A Flexible Processor'' described by Wolfe and Shen in 1988
emulated a conventional CPU in several FPGAs and could be configured
to emulate a variety of processor architectures
[WolfeShen88].
Arithmetic and Logic Unit (ALU)
operations, instruction sets and pipelining schemes could be
reconfigured to determine which schemes were most successful for
particular applications. While this processor was a conventional
circuitry-into-FPGA approach, it demonstrates that conventional
sequential processor architectures can be implemented quite
effectively using reconfigurable logic arrays.
Many reconfigurable logic computing systems have been developed.
Most of these are research tools, but commercial systems are starting
to appear [Mchenry95]. By far
the most popular application of these systems is problems previously
left to custom computing machines, so unsurprisingly most of the
architectures are targetted at these kinds of problems.
Most system architectures are based on commercial FPGAs, and these
tend to be connected together in ways which are fairly predictable
given the devices. System architecture with reconfigurable processors
is a relatively young field and most effort so far is concentrated on
providing data from external devices to the logic array quickly enough
to keep the array busy.
Perle was based around the coprocessor model of a conventional
processor which accessed a reprogrammable logic array as a peripheral
device (it was not a coprocessor in the true sense of the word). The
array was comprised of Xilinx devices. Two generations of Perle exist
- Perle-0
[BertinRoncin89]
and Perle-1
[BertinTouati94]
[VuilleminBertin96].
SPLASH is a system using Xilinx FPGAs which was one of the
earliest research vehicles for reconfigurable computing
[Waugh91]
[GokhaleHolmes91].
It is a plugin board for a host computer and
provides a set of FPGAs connected in a linear arrangement which is
optimised for vector-style operations. A later refinement ``Splash 2''
provided up to 16 boards each of which comprised 16 Xilinx 4010 FPGAs
connected linearly and also through a crossbar
[ArnoldBuell92]
[Arnold93].
A large number of other systems have been created, many of which
are interesting but do not warrant specific comment with respect to this
thesis
[RosendahlPaille91]
[AthanasSilverman91]
[MakiWhitaker91]
[CockshottShaw92]
[SueyoshiApduhan92]
[Athanas92]
[MonaghanCowen93]
[CuccaroReese93]
[Wood93]
[WazlowskiAgarwal93]
[AgarwalWazlowski94]
[Box94]
[AmersonCarter95]
[BoxNieznanski95]
[DarnauerGaray95].
ARMEN was a machine with a slightly different approach. It combined
a number of transputers with FPGAs, and is discussed in more detail in
section 2.4.1
[RaimbaultLavenier93]
[ChampeauPape94].
SPACE is a machine which was designed to exploit the
correspondence between the locality of circuit elements in the system
and the locations of computational elements in some applications - for
example cellular automata
[ShawMilne92]
[MilneCockshott93].
Some architectures are more obviously defined by their intended
applications - for instance the GANGLION system for neural networks
[CoxBlanz92]
and MORRPH for image processing
[DrayerKing95]
[MoranMartinez95].
A number of systems are directed at circuit prototyping, with the
aim of implementing the final circuits on another architecture.
Anyboard
[PetersenThomae91]
[ThomaeVandenbout92]
[Vandenbout93],
Realiser [Butts92],
HARP [Page94]
[LawrenceKay94],
Teramac
[CulbertsonAmerson95]
and PROTEUS
[HayashiMiyazaki95].
Of more immediate relevance to this thesis is work in making
complete computer systems entirely based on reconfigurable
processing. Linde and Nordstrom took this approach, building a machine
comprised of FPGAs and memory
[LindeNordstrom92].
The system was
successful, although it was a fairly straightforward architecture.
Casselman offered a rather more unusual approach, connecting his FPGAs
with programmable interconnect - offering yet another dimension of
reconfigurability
[Casselman93].
A number of projects have regarded FPGAs as coprocessors
[ChowCasselman95],
with degrees of success varying from ``the custom
computers based only on FPGA execution units show little performance
improvement over state-of-the-art workstations''
[BergmannMudge94]
[AlbaharnaCheung96]
to orders of magnitude speedup
[GokhaleHolmes91].
A very interesting run-time reconfigurable system developed by
Wirthlin and Hutchings was able to ``cache'' configurations on the
logic array
[WirthlinHutchings96].
This approach is essentially equivalent to the multiple configuration store
approach used by Saleeba
[Saleeba93]
or DeHon
[Dehon94].
An alternative system which combined fixed and reconfigurable
logic to gain benefits from both approaches is described in
[BakkesDuplessis96].
In an application involving large amounts of
floating point multiplication they showed that special purpose
floating point units were faster than reconfigurable hardware only,
but that a combination could yield even better results.
Throughout the history of reconfigurable logic arrays the
strongest driving force behind their development has been from people
wanting to use the properties that set them aside from more
conventional forms of computer. Most commonly the people who started
using reconfigurable logic came from a ``Custom Computing Machine''
(CCM) background where it is natural for developers of an application to be
looking for these kinds of unconventional hardware solutions. Soon it
became clear to many people in the field that reconfigurable logic
arrays offered many compelling benefits for people creating custom
computing machines and reconfigurable logic became strongly
championed by these people.
As a result of this application-derived background, many
sophisticated applications have been developed for reconfigurable
hardware and its value in this area has been clearly established. As
yet there is a paucity of applications in other areas so the value of
this technology is not so well established in non-CCM uses.
Fawcett characterised the application of configurable hardware as
being divided into three areas - systems with built-in diagnostics,
adaptable system designs and systems with multi-purpose hardware
[Fawcett93].
It can be argued that this rather limiting
classification scheme can be broadened to accept most general
computing tasks when more recent advances in the technology are taken
into consideration. The following brief survey of reconfigurable
computing applications serves to demonstrate the breadth of
application to date, even on the current FPGA architectures which
were not designed with configurable computing in mind.
Since the intent of this thesis is to widen the scope of
applicability of logic array processors it pays to take a look first
at exactly where they are currently being applied to understand what
their current limitations are.
A popular application of reconfigurable logic is in genetics.
Quick searching for sequences of genetic code in very large databases
is important to geneticists, yet these operations are very expensive
and slow on conventional computers. Reconfigurable logic is well
suited to this type of application and good results have been
obtained from its application here.
The SPLASH system was a landmark reconfigurable logic system
using Xilinx devices
[GokhaleHolmes91].
Its first major appliation
was in genetic sequence comparison
[Lopresti91].
Subsequently a more
advanced system, SPLASH2, was developed
[ArnoldBuell92]
and again it
was applied to genetic sequences
[Hoang93].
Related work on SPLASH2
included text keyword searching
[PryorThistle93].
Others have also applied reconfigurable logic to database
searching [Fagin93]
[GuntherMilne96]
- in the case of Gunther and
Milne using run-time partial reconfiguration to enhance performance
of the search. Speed gains in these types of applications are
typically described as ``order of magnitude''
[Fagin93].
Another popular application of reconfigurable logic is in neural
networks. Neural networks are particularly well-suited to
reconfigurable logic for a number of reasons. Often a serious
restriction of reconfigurable logic systems is their I/O - often they
can process data much quicker than I/O and memory devices can supply
it. Neural networks on the other hand tend to be relatively
self-contained and so their performance is more often bounded by
processing speed than I/O. Neural networks also have the ability to be
``trained'' - a feature which lends itself to subsequent optimisation
and reconfiguration of the network to a particular state of training.
``RM-nc'' was an early system designed specifically for neural
network applications
[ErdoganWahab92]
[ErdoganHong93]. This system
required the neural network to be described in VHDL, and provided a
fully parallel implementation with three layer back propagation -
both forward and backward passes being implemented in logic.
Daalen and Jeavons used a bit serial stochastic approach combined
with reconfiguration to develop a high performance neural network
system
[DaalenJeavons93].
Eldredge took a run-time reconfiguration
approach to maximise the density of a Xilinx-based neural network
system for his masters thesis
[Eldredge93]. He described the
performance and density advantages of his RRANN system
[EldredgeHutchings94]
[EldredgeHutchings96]
and developed the
run-time reconfiguration algorithms
[EldredgeHutchings94b]
[EldredgeHutchings94c].
Guccione and Gonzalez used their earlier programming system work
[GuccioneGonzalez93]
to construct a multi-layer feed-forward neural
network using a vector based data parallel approach
[GuccioneGonzalez93b].
The algorithm was described using a subset of
C and automatically translated into logic.
Duggan and Haycock took a slightly different approach. They
argued that while reconfigurable logic is a good medium for hardware
neural networks the available devices of the day were poorly suited
to the task. As a result they developed their own device which was
specifically aimed at boolean neural network applications
[DugganHaycock93].
Another stochastic approach, this time using Xilinx lookup tables
achieved a very high density - up to two synapses per Xilinx CLB
[BadeHutchings94].
A different approach to increased density was
achieved by time-multiplexing sections of the algorithm on to a
single Atmel FPGA using Atmel's facility for run-time reconfiguration
[LysaghtStockwood94].
This provided the higher density as a tradeoff
for slower overall system speed.
Image processing is another field which lends itself to the use of
reconfigurable logic. The tendency of image processing problems to
break down into a large number of very similar computations means that
most image processing problems are very well suited to the
fine-grained parallelism which reconfigurable logic arrays offer.
Work by Lazarus and Meyer demonstrated the feasibility of using
FPGAs for avionics signal processing
[LazarusMeyer93].
Their system
provided a library of primitives which could be configured into the
array for different types of processing. A similar system was
concurrently developed by Ross and Vellacott
[RossVellacott93],
and
another proof-of-concept system with a bent to digital filtering was
developed by Chou [Chou93].
A 3*3 programmable convolution system is described in
[BarrosAkil92].
A fifth-order wave digital elliptical filter
implementation is described in
[LeeserChapman93].
This work compares
the efficiency of filter systems created using synthesis tools
against hand-optimised logic. Similar concerns with synthesis tools
led to Shoup's parameterised system for generating convolution
filters [Shoup93].
A more specialised application of FPGAs in image processing is the
optical character recognition (OCR) algorithm developed by Tse, Leung and Cheng
[TseLeung93].
This was
used in Chinese character recognition and achieved speedups of three
orders of magnitude when compared with contemporaneous personal
computer (PC) systems.
An application using multimode reconfigurability for telescope
image data acquisition and processing is described in
[Shand95].
The SPLASH
[GokhaleHolmes91]
system was applied to video-rate
image processing by Abbott and Athanas
[AbbottAthanas94].
They
implemented a number of image processing and vision algorithms
including the Hough transform and pyramid generation. Their approach
is described in more detail in
[AthanasAbbott94]
and
[AthanasAbbott95].
Real-time vision is also addressed by Quenot and
Kraljic's ``Data-Flow Functional Computer''
[QuenotKraljic94].
This machine
generates logic from algorithms written in a functional language and
its effectiveness is demonstrated with several algorithms including
connected component labelling, nonlinear filtering and coloured
object tracking. A Canny edge detector was implemented in
[Luk94].
A
high performance scalable convolution system is described in
[RathaJain95].
This operates at video rates and is suitable for
computer vision applications.
A system called MORRPH was developed in 1995 and is specifically
aimed at realtime image processing problems
[DrayerKing95]
[MoranMartinez95].
It allows easy addition of external
special-purpose support chips matched to the needs of an application.
It was used mainly as a preprocessor to more complex algorithms run
on a host PC.
Villasenor, Schoner, Chia and Zapata describe a system for
automatic target recognition in
[VillasenorSchoner96].
They take
advantage of the reconfigurability to generate target-specific
configurations and find that the application maps very well to FPGAs.
Culbertson, Amerson, Carter, Kuekes and Snider describe an algorithm
on the Teramac system for automatic artery extraction in medical
images in
[CulbertsonAmerson96].
Various other applications traditionally attacked by researchers
in the field of custom computing machines have also been investigated
using FPGAs. Cellular automata systems have been developed with some
success
[BouazzaChampeau91]
[HowardTaylor92],
in the latter case using
the Plessey ERA FPGA.
Classic computational problems such as the many-bodied problem
and the travelling salesman problem have been implemented on
reconfigurable logic. The many-bodied problem is attacked by Monaghan
and Noakes in
[MonaghanNoakes92],
specifically as a cheaper
alternative to supercomputers. The SPLASH2 system is applied to the
travelling salesman problem in
[GrahamNelson95]
using a genetic
algorithm approach. This achieved an order of magnitude speedup over
high-end workstations. A detailed performance analysis of the system
in comparison to workstations is provided in
[GrahamNelson96].
Genetic algorithms are also implemented on FPGAs and applied to the
chip partitioning problem in
[SitkoffWazlowski95].
DES encryption is implemented using Xilinx devices
in [TseYuk93].
Fast RSA encryption is implemented in
[ShandBertin91]
and [ShandVuillemin93]
using the Perle system
[VuilleminBertin96],
and they achieve order of magnitude improvements over competing
technologies.
Originally developed in 1989
[BertinRoncin89],
the Perle system
was one of the first and best-developed systems. It used Xilinx FPGAs
and has been applied to a wide range of problems including computer
arithmetic, cryptography, error correction, image analysis, stereo
vision, video compression, sound synthesis, neural networks,
high-energy physics, thermodynamics, biology and astronomy, often
with groundbreaking results. These results are summarised in
[VuilleminBertin96].
Variable word length arithmetic is studied in
[ShandBertin91]
and
[Daumas94].
Perle-1 is used to evaluate polynomials at high speed in
[Ercegovac95]
using a most-significant-digit-first approach.
A system specifically designed to solve Monte-Carlo problems was
developed Cowen and Monaghan
[CowenMonaghan94].
This system achieved
order of magnitude speedups over Digital Signal Processing (DSP) hardware.
A promising area for configurable hardware is in computer
graphics. Graphics algorithms which are capable of operating at frame
rates are much easier to realise with configurable hardware than
general purpose processors.
An early feasibility study was done by Singh and Bellec in which
they investigated a number of FPGA graphics architectures
[SinghBellec94].
They compared reconfigurable graphics hardware with
custom hardware and general purpose processors.
Run-time reconfiguration was applied by Schoner, Jones and
Villasenor to video coding. Their system used a number of passes each
of which involved a different configuration.
Volume visualisation is a popular scientific application for
graphically viewing data which requires powerful realtime processing.
Dao, Cook and Silver presented a volume rendering system using a ray
casting approach as a coprocessor on PC hardware
[DaoCook95].
The
Teramac system was applied to volume visalisation by Culbertson,
Amerson, Carter, Kuekes and Snider
[CulbertsonAmerson96].
They used
parameterised designs to improve the scalability of their system and
improved their tools to operate better with this mode of working.
Another area with traditional appeal for FPGAs is in logic
emulation. Since a reconfigurable logic array is the closest thing to
the target application which offers reprogrammability their utility in
the area is obvious. Lysaght had students using FPGAs to do
experiments in hardware multitasking
[Lysaght91].
Babb and Terrier proposed ``virtual wires'' which were in fact a form of
multiplexed interconnect as a way of overcoming the I/O limitations of
logic emulators
[BabbTessier93].
Rapid logic prototyping with FPGAs was covered in
[Spie95].
For completeness, this section covers some other applications
that reconfigurable logic has been applied to. Camerota and Rosenberg
detailed ways of using Atmel FPGAs for data acquisition
[CamerotaRosenberg93].
Reconfigurable logic has also been used in fuzzy logic controllers
[SchmitThomas95],
telescope control
[ShandWei95]
particle accelerator control
[MollVuillemin95]
and infra-red missile warning applications
[Box94].
Heeb and Pfister described a workstation fitted with an FPGA
coporocessor as standard
[HeebPfister92].
The pioneers from the Perle project applied their system to a wide
variety of applications including
arithmetic, cryptography, error correction, image analysis, stereo
vision, video compression, sound synthesis, neural networks,
high-energy physics, thermodynamics, biology and astronomy
[VuilleminBertin96].
Most programming systems for FPGAs have grown out of older MPGA
systems. This means that not only are the systems often not
particularly well adapted to FPGAs, they are very poorly suited to
dynamically reconfigurable logic arrays.
Traditionally, the software has viewed PGAs as a collection of
gates and the methods derived from this view persist today. While
FPGAs were originally designed to replace large numbers of gates this
is not necessarily the most efficient way of using them. The
optimisation algorithms used in many packages attempt to factor logic
heavily, resulting in large numbers of small gates. FPGA architectures
with complicated cells such as Xilinx are poorly suited to this method
as they can express far more complicated functions using their LUTs.
Added to this are the fundamental differences in the
implementation of FPGAs. Issues related to the slow interconnect in
FPGAs have a substantial effect on the optimal placement and routing,
yet these factors are largely ignored by existing software.
At this low-level end of the problem more work clearly remains to
be done to improve the mapping of circuits on to arrays. This remains
outside the scope of this thesis however.
An interesting trend which corresponds to the advent of
reconfigurable logic array processors is an increase in interest in
programming logic arrays with higher-level languages. The influx of
people with a software background rather than a hardware background
bring with it a different attitude - users want to quickly and easily
write and test code rather than going through a lengthy hardware
design process each time they write a program. This new attitude has
resulted in much interesting work in mapping algorithms to hardware
which are covered in section 2.4.3.
For any modern computer system to be usable it must have a
suitable programming system. This section reviews the work to date.
Unfortunately it was not possible within the contraints of the
project to develop a programming system for Switchblade so this
remains a challenge for the future.
Reconfigurable logic arrays can be considered a type of processor
uniquely well suited to parallel processing. One traditional problem
of parallel program is that the problem must be cleverly divided
between the available processing resources. Often these resources are
quite coarse-grained and not in a form convenient for the algorithm.
In addition to this the algorithm's needs might change during the
course of its execution and yet the resources of a conventional
multiprocessor computer will not.
A great deal of effort in parallel programming goes into adapting
the algorithm to the available architecture. With reconfigurable
logic this kind of problem ceases to be an issue - the logic array can
be programmed to offer exactly the level of parallelism and fine- or
coarse-grainedness required for the situation. Not surprisingly the
field has attracted the attention of a number of researchers in
parallel programming.
A pioneer of the concept of reconfigurable computing, Estrin
described the advantages of reconfigurable architectures to parallel
programming in his paper ``Parallel Processing in a Restructurable
Computer System''
[EstrinBussell63].
Although predating FPGAs by many
years the ideas described in this paper are equally valid today.
More recently research into fine-grained parallelism has led
parallel programming researchers to investigate programmable logic. A
system called ``ARMEN'' based on Transputers and FPGAs provided a
balance of coarse-grained MIMD (multiple instruction, multiple data)
parallelism from its general purpose
processors and fine-grained parallelism from the FPGAs
[RaimbaultLavenier93].
Each node in the system has access to a ring
of FPGAs, as well as a set of global special-purpose coprocessors
[ChampeauPape94].
The synthesis system used to tranform parallel
algorithms into object code is described in
[DhaussyFilloque94].
The ``SPACE'' architecture (Scalable Parallel Architecture for
Concurrency Experiments) is another architecture aimed at
investigating parallelism
[ShawMilne92].
The benefits of fine grained
parallelism were investigated on this system with a particular bent
towards highly parallel applications with a high degree of local
communication - such as road traffic systems
[MilneCockshott93].
Bolotski, DeHon and Knight created a high-level model to unify
logic arrays with other computational devices
[BolotskiDehon94]
[Dehon96a].
Using this unified ``RP-space'' model they created
DPGA, a logic array architecture which attempted to combine the best
features of FPGAs and SIMD (single instruction multiple data) processors. The important architectural
feature introduced here was the use of multiple configuration stores.
This unified theory of computing was later applied to create TSFPGA (which
introduced multiplexed interconnect) and MATRIX, a more conventional
multiprocessor architecture with the ability to assign instruction
resources at runtime
[MirskyDehon96].
These are all described in
[Dehon96a].
Some remarkable early research into reconfigurable computing was
done by Estrin in the early 1960s, long before FPGAs appeared
[Estrin60]
[EstrinBussell63].
Estrin and his collaborators predicted
that performance advantages could be obtained for a variety of
problems using reconfigurable architectures. The levels of
integration available prohibited the complexity achievable now, so
much less ambitious techniques such as manual reconfiguration were
considered a good option at the time.
The team's theoretical work identified the key features of the
technology and the problems which required solution - such as the
circuitry vs. software partitioning problem
[EstrinTurn63]. They
created a partitioning system which used a successive approximation
approach and heuristics. The system performed well on the problems
which they set it, but no further work was done in the field until
the late 80s and the advent of FPGAs.
At this point a number of groups rediscovered the idea of
reconfigurable computing fairly simultaneously. Wolfe and Shen
described ``Flexible Processors''
[WolfeShen88]
based on Xilinx FPGAs.
Gray and Kean proposed ``Configurable Hardware''
[GrayKean89],
describing what was to become their ``CAL'' architecture
[KeanGray90].
Bertin and his associates developed an abstraction called
``Programmable Active Memories'' with a physical implementation based
on Xilinx FPGAs - their ``Perle''
system. Perle is a very well developed system, consisting of
a virtual machine executing on a conventional processor which
reconfigures the logic array coprocessor as required. Many
applications have been developed for various versions of Perle, with
very impressive results
[BertinRoncin89]
[BertinRoncinVuillemin89]
[ShandBertin91]
[BertinRoncin93]
[Touati93]
[BertinTouati94]
[MollVuillemin95]
[Ercegovac95]
[VuilleminBertin96].
The Perle researchers claim that as VLSI (``Very Large Scale
Integration'') technology has improved
the performance advantage of
reconfigurable processors over conventional processors in many
applications has increased. Vuillemin and Bertin applied Noyce's law
to show how this performance gap will continue to increase over time
[VuilleminBertin96].
Formal verification of reconfigurable modules became possible with
Shaw and Milne's work which involved translating ``Occam'' language
programs into asynchronous modules and using the ``Circal'' process
algebra to formally verify them
[ShawMilne92]
[CockshottShaw92].
Work towards a unified approach to describing hardware and
software has made some progress. Some software tools which provide
various levels of transparency between the two modes are described in
the next section. A theoretical model and an accompanying set of
tools for unifying the two media are described in
[Page96]. Singh
and Hogg developed a system which goes further by describing run-time
reconfigurable algorithms in terms of partial evaluation, though this
system seemed to be limited in its scope
[SinghHogg96].
Handcrafted software tools for reconfigurable logic systems
abound in the research environment and yet so far there is little in
the way of generally available tools. As yet, the field is young
enough that there is a great deal of work to be done on some
fundamental aspects of tool creation. High-level language to logic
compilation is in its infancy. The partitioning of an algorithm into
sections some of which are executed in hardware and some of which run
in software is also very much a research area, particularly with
regard to run-time reconfiguration.
A number of researchers have concentrated on the conventional
tools available to them, creating applications using hardware
description languages such as VHDL [KempaJung92] [ArnoldBuell93]
[Singh95] and in some cases adapting this approach to the new medium
[GokhaleKopser90].
The desire to program these systems using more convenient and
familiar tools resulted in a rash of high-level language to FPGA
compilers, particularly in the popular language C
[Athanas92]
[Arnold93]
[GuccioneGonzalez93]
[Athanas93]
[AgarwalWazlowski94]
[WoForward94]
[AgarwalWazlowski94]
[WoForward94]
[JantschEllervee94]
[Gokhale94]
[Galloway95]
[IseliSanchez95]
[ClarkHutchings96].
Others have chosen
to use their own languages and systems which may be better suited to
defining a logic algorithm; Occam
[PageLuk91]
[CockshottShaw92]
[ShawMilne92],
Ruby
[LukLok93]
[Luk95]
[GuoLuk95],
and Handel
[SpiveyPage93].
These tools provided the ability to create
applications much more quickly than their VHDL equivalents, although
with some considerable expense in their efficiency of utilisation of
the logic resource.
Simultaneously, many researchers have attacked the partitioning
problem - in many cases as part of a high-level language to FPGA
synthesis system
[Athanas93]
[ThomaeVandenbout92]
[LukLok93]
[WoForward94]
[GokhaleMarks95].
Other related synthesis work includes
[HartensteinHirschbiel91]
[WuPerkowski92].
Some tools are specifically geared at rapid creation of a subclass
of configurations. For example tools to enable rapid prototyping of
DSPs
[GebotysGebotys94],
customised formats for fast floating point
units
[ShiraziWalters95]
or custom general-purpose processors,
tailored to a specific task
[RazdanSmith94]
[Luk95].
Other tools which have been created include simulation tools for
runtime reconfiguration
[LysaghtStockwood95],
prototyping/validation
tools for embedded systems
[HerpelHeld94],
and C++ libraries for
simulation and design of reconfigurable logic applications
[Touati93].
One interesting optimisation which is possible with reconfigurable
logic is the ability to reconfigure the logic array at runtime. This
has become an active research area, with many applications having
exploited this feature despite the poor support for it in currently
available technology.
This poor support comes down to one essential problem - how to
reconfigure a device to its new state as rapidly as possible. Many
current-generation devices are configured bit-serially (which is
slow), only the entire device can only be reconfigured (no partial
reconfiguration is possible) and the device is offline for the
duration of the reconfiguration process. These factors combine to make
runtime reconfiguration slow and inconvenient.
Four features which may aid an architecture in efficient run-time
reconfiguration
[Saleeba93] are:
Early experiments using run-time reconfiguration include neural
networks
[Eldredge93]
[DaalenJeavons93]
[EldredgeHutchings94]
[EldredgeHutchings94b]
[LysaghtStockwood94]
[EldredgeHutchings96],
graphics and image processing
[RossVellacott93]
[SinghBellec94]
[SchonerJones95]
[Shand95]
and database searching
[LemoineMerceron95]
[GuntherMilne96].
Other uses include hardware multitasking
[Lysaght91],
adaptable bit-length arithmetic
[Daumas94]
and a general purpose processor with an instruction set that varies
during execution
[WirthlinHutchings95].
These experiments indicate that in some applications there is
substantial gain to be had from run-time reconfiguration.
Self-reconfiguration is the ability of a reconfigurable logic
array processor to control its own reconfiguration as described by
Saleeba
[Saleeba93]
and French
[FrenchTaylor93].
These early papers proposed simple bus interfaces
to the configuration store. Self-reconfiguration interfaces are
explored in more detail in this thesis.
Partial reconfiguration is an important feature for run-time
reconfigurable systems as it makes loading much less expensive
as described by Saleeba
[Saleeba93].
Lysaght and Dunlop discuss it and
the concept of ``Logic Caching'' in
[LysaghtDunlop93].
The Atmel ``Cache
Logic'' Hadley and Hutchings discussed the benefits of partial
reconfiguration with respect to their RRANN neural network system
[HadleyHutchings95]
[HadleyHutchings95b]
[Hadley95],
and found that in this application a doubling of speed could be
achieved. With Wirthlin, Hutchings worked on a dynamic instruction set
computer which used partial reconfiguration to change the instruction
set at run-time
[WirthlinHutchings95b].
DeHon and his associates have also worked on improving
reconfiguration speed. Their DPGA and TSFPGA architectures provide
both partial reconfiguration through a RAM-style bus interface and
multiple configuration stores to allow instantaneous switching between
configurations
[Dehon94]
[Tau95]
[Dehon96a].
This also provides a basis
for background loading as one context can be operating while another
is being loaded through the partial reconfiguration interface.
A recent architecture offering partial reconfiguration is the
Xilinx XC6200 series of FPGAs. These provide a bus-style host processor
interface which can be used to randomly access the configuration store
and hence partially reconfigure the device.
The ability to reconfigure a processing element at runtime leads
to the question of how to decide what configurations to load into
configuration store at what times. This problem of ``temporal
partitioning'' is addressed in
[GokhaleMarks95].
A rather more simplistic view is taken in
[HastieCliff90],
where programs are
considered as a set of ``hardware subroutines'' to be loaded on demand.
Another system which uses the idea of caching configurations is
that of Wirthlin and Hutchings
[WirthlinHutchings96].
They describe
how the caching system reduced the amount of time required to
run-time reconfigure their logic array as reconfigurations were often
re-used. In a more general discussion they discuss the division
between compile-time and run-time reconfiguration
[HutchingsWirthlin95].
They propose run-time constant propagation as a method of obtaining
further speed increases in
[WirthlinHutchings97a],
a similar idea to the ``data-folding'' scheme described in
[Foulk93].
The idea of multiple configuration stores which can be
selected between by a broadcast signal was introduced by Snyder
[Snyder82].
This was not however in the context of a logic array, rather it was
used to switch the programmable interconnect of an array of
higher-level microprocessors.
Partial evaluation is a technique which can be used to defer
computation until it is strictly required. Singh and Hogg
investigated a field of problems which can be expressed using partial
evaluation in such a way as to map each part to a configuration -
leading directly to a dynamically reconfigurable version of the
algorithm [SinghHogg96].
Little research has been done so far into using multiplexing as a
means of reducing the expense of routing signals over long distances
on logic arrays. Babb and Tessier investigated the use of multiplexing
to overcome I/O limitations on their logic emulator
[BabbTessier93]
but didn't seek to extend this work to reducing internal routing
expense. Bit serial circuit construction methods were investigated by
Isshiki
[Isshiki96]
but again he didn't apply his methods to logic array
architecture.
Simultaneous with the multiplexed routing system described in this
thesis, DeHon was developing TSFPGA - a logic array architecture with
time-switched internal routing
[Dehon96a].
TSFPGA uses subarrays of logic blocks each of which has time switched
input registers. The subarray is arranged as a set of logic blocks
which share input lines and each LUT output is taken as an input to a
crossbar. Crossbar outputs are sent to neighbouring subarrays both
horizontally and vertically. TSFPGA has no facility for
fast non-multiplexed communication between subarrays.
Figure 2.1 - Partial circuit of a 22V10 PAL
2.1.1 FPGAs
2.1.2 FPGAs in electronic design
2.1.3 FPGA programming technology
Figure [2.2 - "An antifuse"]
Figure [2.3 - "A programmable interconnect point"]
2.1.4 Commercial FPGA architectures
2.1.4.1 Xilinx 4000
Figure [2.4 - "Xilinx XC4000 interconnect"]
Figure [2.5 - "Circuit of a Xilinx XC4000 cell" (from www.xilinx.com)]
2.1.4.2 Xilinx 6200 series
Figure 2.6 - Xilinx XC6200 random access programming interface
Figure 2.7 - A block of Xilinx XC6200 cells
Figure 2.8 - Xilinx XC6200 nearest-neighbour communication
Figure 2.9 - Xilinx XC6200 cell logic
2.1.4.3 Research on FPGAs
Figure 2.10 - Altera FLEX architecture
Figure 2.11 - Triptych routing architecture
Figure 2.11 - Triptych cell architecture
Figure 2.13 - Atmel cell circuit
Figure 2.14 - Atmel interconnection scheme
2.2 Extended general purpose processors
2.3 Custom computing machines
2.3.1 Configurable computing devices
2.3.2 Systems
2.3.3 Applications
2.3.3.1 Database searching and sequence comparison
2.3.3.2 Neural networks
2.3.3.3 Image processing
2.3.3.4 Custom computing machines
2.3.3.5 Graphics
2.3.3.6 Logic emulation
2.3.3.7 Other applications
2.4 FPGA programming systems
2.4.1 Parallel programming
2.4.2 Theory
2.4.3 Software tools
2.5 Dynamic reconfiguration
2.5.1 Multiplexing