Figure[5.4]
shows three architectures which resulted from separate
runs using different circuits to prime the parameters used in
evolving the system. Each of these architectures was found to be
reasonably effective for a wider domain of circuit types and despite
the very significant differences in their overall design the results
showed no clear winner. Not surprisingly, each was more successful in
dealing with circuits which had similar characteristics to the
circuits which were used to parameterise the original design.
Priming the system with circuits which do a large amount of
longer-distance routing resulted in the architecture shown in
figure[5.4]b.
This architecture doesn't vary a great deal from the ``best''
architecture in essentials, but it has a slightly higher emphasis on
second-level routing.
Figure[5.4]c
results from reducing the cost of routing via other
dedicated cells to an unreasonably low level. Note how multiplexer 1
has taken very well to the ``hinted'' symmetrical layout, though
multiplexers 2 and 3 are still relatively asymmetrical. The big
difference is that the second level multiplexer has been stripped of
all its external inputs and no second level routing is needed at all
- connections to adjacent cells were adequate and in the rare case
where extra routing is necessary it can be performed using extra
cells without excessive cost.
It is clear that there is no one great answer to the need for
efficient configurable logic array architectures. Certain
architectures favour certain styles of circuit. Architectures which
work well with a wide variety of circuits tend to be more expensive
than architectures which are suited to a specific style of circuit.
Experimental results demonstrate that a wide variety of architectures
can fill this general role quite well, with no clear winner emerging
- at least with the architectural options made available to the
simulated system.
The most interesting effect demonstrated by the genetic algorithm
approach to logic array design was in the gains which can be made
from asymmetrical design of cells. Symmetrical connections are
inherently poor at using the die space efficiently as they tend to
duplicate functionality across the cell's inputs. Since there is no
need to send the same signal to multiple inputs at once, it is
possible to eliminate the more redundant combinations of routing
options. This in turn means that each input tends to have completely
different routing abilities - hence the asymmetry.
Genetic algorithms proved to be very useful in creating the
design. Experimentation with new architectural ideas in this
environment quickly demonstrated their worth or lack thereof. Perhaps
the most unsatisfactory aspect of this potentially useful tool was
the tendency to randomness in the architectures which it created.
While with some considerable effort a relatively random design can be
converted to an equivalent one which is less chaotic it is not clear
how to cause an evolutionary system to naturally pick solutions which
are more regular by nature. Mind you it's not clear that more regular
solutions are necessarily advantageous, but they are at least more
aesthetically pleasing than highly random ones.
The final architecture which was developed for Switchblade
is shown in
figure[ 5.3]a.
This architecture was created using the
genetic algorithm described above, in combination with a cost
function which reflected the features desired for the
Switchblade system -
- Multiple configuration stores
- Quick switching between configurations
- Efficient handling of a variety of circuit types
- Cell functionality
- Routing
The architecture is based around a 64-entry lookup table
(figure[ 5.3]). This
provides six bits of input which can be used to select the one output
bit. Under normal operation three of the six table inputs are devoted
to selecting one of eight configurations. The same three bits are used
to select one of eight configurations for all the remaining
configurable options in the cell, although this is not explicitly
shown in the diagram. The three configuration bits are transmitted to
cells in a region from a nearby shared-signal distribution point.
The remaining three inputs of each LUT are taken from three input
multiplexers. These route input signals from a number of possible
sources to each of the inputs. The output of the LUT can be either
directed first through a D-latch or alternatively taken directly to
the primary output of the cell.
Two other shared signals are available to cells. One is a global
reset signal which returns the D-latch to a configuration determined
state, and the other is a common clock signal which can be used to
strobe the D-latch.
A fourth input multiplexer is used to provide an extra signal in
cases where it may be useful. This fourth input is directed to the
secondary output of the cell, providing extra connectivity which can
be routed ``over'' logic cells without affecting their operation. This
is intended to be used as a routing scheme which is mostly
independent of the operation of the host cell. Signals can be routed
at random from secondary output to secondary output over a reasonable
distance, independent of the operation of the cells which the signal
passes through.
The fourth input can also be used as the clock source to the
D-latch if it is in use, or in cases where a three-input LUT is
inadequate it can be used as an extra LUT input.
The ability to trade off the number of available configurations
against LUT width is not a feature which is automatically available
to user programs. The security model ensures that user programs are
not able to use more than their allotted single configuration unless
they successfully request two adjacent configurations from the
operating system.
Another use of the fourth input multiplexer is to provide a data
input to the LUT. It is possible to configure the LUT as a single bit
wide addressable register, much in the same way that Xilinx LCAs
allow
[Xilinx94].
Data is latched into the RAM using a data read
strobe signal which is taken from the local shared-clock signal. The
remaining three multiplexer inputs act as address lines, providing
eight bits of data stored per configuration.
If more storage is required this can be achieved through the use
of multiple configurations. Address bits sent to the local shared
signal distribution point are broadcast to nearby cells as
configuration selection bits, allowing up to sixty four bits to be
accessed per cell if all the configurations are used.
The strongest influence of the genetic algorithm used in the
design of Switchblade can be seen in the optimisation of
interconnect. Primed with a set of statistics on routing, the genetic
algorithm favoured an architecture which permitted direct connections
from any three neighbouring cells or indirect connections to more
distant cells.
This design favours the primary use of the cell as a three-input
LUT with orthogonal inputs - the multiplexers are optimised on the
assumption that any combination of one, two or three input functions
may be required from any combination of available signal sources. It
is not important which input each signal is sent to however,
as the LUT may be configured appropriately for any arrangement of
inputs. This makes it possible to eliminate a substantial number of
options from the multiplexers, resulting in a set of multiplexers
roughly half the size of those which would otherwise be necessary.
The genetic algorithm's choice of three-input LUTs in the cells
was no coincidence - the general architecture which the sample
routing data was taken from was a three-input architecture so it is
natural that the genetic algorithm would optimise for this
arrangement. It was realised in advance that this bias would be
apparent but for efficiency of the cost function it was decided to
limit the number of inputs to three. The choice of three was quite
deliberate - this number has been empirically found to work well with
a good variety of circuits
[EbelingBoriello91]
and kept the size of
expensive LUT tables relatively small while simultaneously allowing a
number of configurations to be selected from. Note that the
architecture allows the fourth input to be also used as a LUT input
if necessary. This halves the number of available configurations,
however.
Compared with some other logic array designs, the large input
multiplexers initially seem like an expensive routing option. There
are however two factors which weigh in favour of this scheme.
Firstly, inputs are only connected locally. This means that a huge
amount of die space is saved from routing costs already. The large
crossbar system used in many logic arrays is replaced, and the size
of the multiplexers is small relative to that of the sea of
interconnect. Since connections are all to neighbouring cells there
is no need to have large numbers of long lines surrounding all the
cells. This style of interconnection is popular in architectures for
configurable computing - an example is the Xilinx 6200 series (which
was announced after this architecture was developed).
The second reason why multiplexers are not so great an expense is
due to the multiple-configuration feature. Multiple configurations
necessarily mean a large LUT. The size of the LUT was taken into
consideration in the cost function and compared to it the multiplexers
are relatively small.
One final, compelling reason why multiplexers were chosen instead
of a conventional crossbar arrangement is that input multiplexers
cannot be programmed in such a way as to damage the device. Since
rogue circuits are not only possible but likely in this
self-reconfiguring architecture, the device needs to be inherently
solid enough to resist damage through misconfiguration.
The inputs available to each multiplexer are shown in
figure[5.4]a
and
table[ 5.1].
Both the first multiplexer and the fourth multiplexer
have their primary inputs directly to the north, south, east and west.
In the case of the first multiplexer this is directly to the outputs
of neighbouring cells and in the case of the fourth multiplexer it is
to the outputs of the fourth multiplexers of neighbouring cells -
providing a separate routing layer which passes signals around
independent of the underlying logic cells.
| |
lower | |
upper | |
other |
| |
NW | N | NE | E |
SE | S | SW | W | |
NW | N | NE | E |
SE | S | SW | W | |
UP | B1 | B2 |
| mux 1 | |
| X | | X |
| X | | X | |
X | X | X | |
X | | X | X | |
X | X | X |
| mux 2 | |
X | X | X | X |
X | X | X | X | |
X | | X | |
| | | | |
X | X | |
| mux 3 | |
X | X | | X |
X | X | | X | |
X | | | X |
| X | X | | |
X | | X |
| level 2 mux | |
| | X | |
| | X | | |
| X | | X |
| X | | X | |
| | |
Table 5.1 - Switchblade input multiplexer assignments
The three LUT input multiplexers each have some inputs devoted to
obtaining signals from the upper routing layer. Note that not as many
inputs are taken from the upper level though - only twelve as opposed
to eighteen inputs. This is due in part to the lower frequency of
signals being required from the upper level, and in part to the
availability of the ``UP'' signal. The ``UP'' signal derives from the
output of the fourth multiplexer, providing an opportunity to use its
additional routing resources if the upper layer is not being
otherwise used. Note that the genetic algorithm has taken this
resource into account and has eliminated some inputs with this kind of
routing in mind.
The ``B1'' and ``B2'' inputs are two inputs from the local broadcast
mechanism. These come directly from the local routing centres
described later, and often contain data transmitted over a relatively
long distance using the medium distance interconnect system also
described later.
The architectural features described to this point are mostly
fairly conventional. While the exact circuit of the logic block is
new and the addition of a separate routing layer is slightly
different from other schemes, the main point of interest in the
design so far is the use of a genetic algorithm as an aid to design.
The remainder of this chapter describes features which are
specifically aimed at turning a plain logic array into a device
suitable for a new type of configurable computing system.
To date designs for configurable computing have been heavily
based on standard logic arrays. The purpose of Switchblade is
to take a step forward from this very basic and specialised type of
computing engine and develop a more sophisticated architecture with
support for a much wider spectrum of use, in much the same way that a
modern multitasking computer system transcends a microcontroller.
The remainder of the chapter details a number of advances aimed at
turning logic arrays into a computing device better suited for general
purpose computing.
Multiple configuration stores are an architectural feature which
has already been discussed in some detail in the previous chapters.
Switchblade provides built-in support for eight configuration
stores. If logic blocks are configured to take three inputs all eight
configurations can be used. An alternate setup is available where
logic blocks can take four inputs but only four configurations are
available.
Switchblade uses a relatively conservative eight
configurations. This was chosen as a good compromise between keeping
logic block size down and providing extra functionality per block.
Research shows that eight configurations is more than enough for most
applications anyway
[Dehon94]
[Dehon96].
More importantly as a
multitasking system Switchblade is able to make better use of
the multiple configuration stores which are available to it. Since
multiple, large concurrent tasks are likely to be alternating in
their access to the logic array it is likely that much of the array
will be alternating between several different configurations when the
system is heavily loaded. This is the scenario where multiple
configuration stores really excel. Chapter 6 contains some simulations
which show the value of multiple configuration stores in a
multitasking system.
Most systems which support multiple configuration stores also
suffer from a lack of flexibility in the way that their multiple
configurations can be switched - often the entire array must be
switched simultaneously. Switchblade is capable of switching
configurations in a very flexible, per-logic-block manner. This
allows maximum use to be made of the multiple-configuration feature -
different parts of the same circuit can easily be switching
configurations independently.
Logic cells can operate in three different modes. The LUT table
lookup mode and the addressable register mode were discussed in
section 5.1.7. The remaining mode allows a cell to be converted into
a centre for the transmission and distribution of logic signals.
These signals are of two types - shared signals for local broadcast
and individual signals for medium distance interconnect.
Some sections of logic will require a high density of such
signals, and this will naturally result in such sections having
larger numbers of logic cells configured as signal distribution
cells. Conversely, highly regular applications will tend to have a
larger proportion of local interconnect and hence a lower density of
signal distribution cells. In this way the amount of die area devoted
to routing resources can be directly adjusted to circuit needs. Unlike
most existing architectures where logic and routing resources are
predetermined by the logic array architect and in any given
application tend to be either underutilised or at a premium, this
system allows routing resources to be configured dynamically along
with the configuration for each individual circuit.
Shared signals are a form of localised broadcast signal. A number
of these signals are distributed from each signal distribution cell.
An example of these signals are the three configuration bits which
are used to select which part of each cell's configuration store is
used. Two other such signals relate to the D-latch - a ``reset'' signal
used to set the D-latches to a predetermined state (set in the
configuration) and a clock signal used to trigger the latch. Note
that the broadcast clock signal is not necessarily the clock used by
all cells - individual cells may be configured to take their clocks
from other sources.
Another localised broadcast signal is the data read strobe. This
signal is used by cells configured as RAM units. When strobed it
causes a new value to be latched into RAM. Unlike the clock signal
this function can only be performed using the localised broadcast
system due to the limited number of multiplexed inputs available on
each cell.
Signals can be injected into the broadcast system using a cell
configured for signal distribution. In the simple case this cell can
take one or two inputs on its input multiplexers and place them on
selected broadcast lines to be sent to the surrounding circuit area.
The LUT bits which would otherwise be used for table lookup are
instead used to select which destination signal each of the inputs is
sent to.
Broadcast signals are distributed to the three neighbouring cells
directly. Each group of four cells is similar, but the layout of each
of the cells is flipped such that the circuitry involved with
broadcast signals meets at the conjunction of the four cells. This
reduces the line length of the interconnect to a tiny fraction of
what would otherwise be necessary.
The point at the conjunction of the cells is not only used to
communicate between the cells - it also has specialised routing
circuitry known as a ``local routing center'' (see
figure[ 5.7]).
In a bid to reduce the cost of routing a large number of signals to
every cell these signals are sent in multiplexed form between each
group of four cells, requiring relatively few lines. This adds up to
a substantial saving in die real estate - as well as the signals
previously mentioned all the configuration layer signals are also
transmitted this way.
Figure 5.7 - Sixteen cells and their four routing centres
The multiplexed signals are by default distributed in a fractal
pattern designed to minimise signal skew. This is complicated by the
fact that broadcast signals can be limited to specific areas - for
instance a single task may use a limited area of the logic array and
when its configuration is switched out only the given task should be
switched out - not other, unrelated tasks.
Broadcast signals are routable in a limited way. Each local
routing center can elect to source its broadcast signals from any of
the north, south, east or west directions, or from a logic cell
configured for signal distribution. Moreover, at each junction of
cells independent of whether a local routing center exists at that
point it is possible to route the signal from any of the four
standard directions. This allows the broadcast signals to be routed
flexibly.
The cost of this routing is less than might otherwise be
supposed. Since these signals are multiplexed on a single line, the
routing need only be performed for one line rather than lines for all
the component signals. This results in a vast improvement of
efficiency in transmitting these signals - only a fraction of the die
space is used compared to that which would otherwise be required.
Flexible routing of these local broadcast signals offers some
quite powerful behaviour. A circuit can take a set of broadcast
signals, modify some at will, and then distribute this modified
signal set to a special local routing domain configured to its own
needs. This is useful for instance when creating such circuits as
synchronous counters - an appropriate clock and reset signal can be
injected into a section of the logic array covering the counters
only. No other routing resources are necessary to deliver these
signals to all the latches in the counter. An example of fractal
broadcast routing to a restricted area is given in
figure[ 5.8].
There are some limitations on transmitting signals using this
system however. It is not particularly useful as a general
point-to-point communications mechanism since signals tend always to
travel in one direction down the broadcast routing hierarchy towards
the terminating logic cells. While exceptions can be made in this
routing scheme for the convenience of specific cases it isn't
possible to implement situations where lines cross each other, and
most other such situations where routing is non-trivial are also
impossible with the broadcast scheme.
Figure 5.8 - Broadcast signal distribution in an arbitrary area
Signal distribution cells are not used only for distributing
broadcast signals - they can be used for space-efficient transmission
of signals over moderate distances. This mechanism doesn't use the
broadcast interconnect system, instead it uses the normal cell-to-cell
interconnect system in a multiplexed mode. This provides the benefits
of the flexible cell-to-cell routing in combination with the efficient
space use of a multiplexed scheme.
Operation of the system is identical to that of the broadcast
scheme except that the input signal is taken from the cell's fourth
input multiplexer rather than the broadcast system and the resulting
signal is directed to the secondary output rather than to other
cells. It is intended that the separate routing layer provided by
this input and output be used to transmit the multiplexed signal
between cells.
Conceptually a multiplexed link is sent over a number of
signal distribution cells using the normal forms of routing. Each
signal distribution cell can then add data to the link and take
signals from the link. The first cell in the link will add a signal
to it, to be retrieved at some point later at one or perhaps several
locations. Other cells may add other signals to the link at any
point.
Signal transmission in the routing scheme is unidirectional.
This in turn means that signals can only pass along the link in one
direction. Most communication in circuits tends to favour one
direction, so it is likely that signals in the primary direction
would be most efficiently sent on a multiplexed link and the
relatively few return signals may be cheaper to route directly
without multiplexing. Alternatively multiplexing may be used both
ways using two separate multiplexed links, depending on what the
circuit requires.
The nature of Switchblade's linear multiplexed links is
rather different from TSFPGA which was developed independently
[BolotskiDehon94].
TSFPGA is a rather unusual architecture with a quite different form
of interconnect. It is difficult to estimate how these schemes would
compare as they use rather different principles. One potential
advantage of Switchblade's system is that multiplexing need
only be used where it is needed, meaning that connections to
neighbouring cells are very fast. Rather than accepting the space
cost of multiplexing as standard Switchblade encourages the
creation of multiplexed links only where they are of significant
benefit. This encourages the use of the link for a number of
unrelated signals which happen to be travelling in the same general
direction. Other schemes are only really useful for groups of signals
which are sourced and terminated together. It is likely that a link
could travel far beyond where the original signal was delivered, with
new signals being added all the time and the multiplexer slots of
delivered signals being filled with new signals to be sent further
down the line. It could be quite reasonable for such a link to be
routed all over a logic array apparently at random. The logic to the
routing would only become apparent when the sourcing and sinking of
each signal at the correct locations was verified.
An alternative in some situations is to route a transmission link
in a ring, feeding the end of the link back into the start. This has
the advantage that return signals can be fed back to earlier parts of
a circuit. It may or may not be a viable option depending on how
efficient it is to loop the link back in the individual circumstance.
Note that this is not strictly a token ring as no token is passed
between cells - synchronisation is performed using a different method
which is described later.
Each multiplexed line is only capable of carrying a limited
number of signals at one time. Switchblade sets this limit at
sixteen, an arbitrary choice reflecting the tradeoff between
maximising utilisation of the interconnect and the speed and expense
of demultiplexing the signals.
Multiplexed transmission systems naturally have a cost - the
delay between a signal being ready and it being transmitted down the
line. This cost is minimised by the use of a high clock rate in
switching between the transmission channels. The clock used is the
common high-speed clock - which is used by the broadcast signal
system as well. The broadcast system provides a synchronisation
signal which is used to reset the demultiplexers after each set of
signals has been transmitted. This keeps all the signal distribution
cells in synchrony.
The chief advantage of this multiplexed form of signal
transmission over a simpler, non-multiplexed system is that
interconnect can be used much more efficiently. Interconnect is a
very expensive resource in logic arrays, with as much as 80% of
circuit area being devoted to interconnect. Much of this space ends
up unused in any given circuit and much of it is used to route
low-bandwidth signals over long distances.
The multiplexed system described provides an improvement to both
of these problems. Since Switchblade has relatively sparse
non-local routing resources the amount of die area devoted to both
local and non-local routing at this level is appropriately low -
approximately 10% (although another 10% was expended on the
higher-level routing system described in the next chapter). When the
occasion calls for additional routing resources signal distribution
cells can be added which not only provide longer distance routing,
they provide it in an efficient multiplexed form which does not
require extra interconnect resources specifically for this purpose.
Adding signal distribution cells is quite costly - each cell used
for signal distribution cannot be used for logic functions - however
traded off against the high cost of long unmultiplexed interconnect
resources this cost is not excessive. More importantly the signal
distribution cells allow routing resources to be matched to circuit
requirements - resources can be placed exactly where they're needed,
avoiding the problem of conventional systems where interconnect is
relatively unused in most places and can be heavily oversubscribed in
other places. Sections of a circuit where routing resources are at a
premium tend to have placement problems, causing many cells to be
unused or used only for routing purposes. Switchblade's system
allows routing resources to be added flexibly, preventing both the
under-use and the starvation problems.
Figure 5.10 - Interconnect area cost
Figure[ 5.10]
compares the resource expense of routing in Switchblade to the
Xilinx XC4000 series FPGAs. In Switchblade routing between two
adjacent logic cells uses special local interconnect - it doesn't
draw on the general interconnect resources as the Xilinx does. Using
the second-level routing system for interconnect results in an
expense equal to the distance, minus the local connection which can
be made at the final step.
Multiplexed interconnect on the other hand offers a slower but
more efficient method for transmitting signals over longer distances.
In the best case up to sixteen signals can be multiplexed together,
though for the graph line a more realistic average of six has been
assumed. In the real world it is likely that shorter lines will avoid
the inconvenience of multiplexing but longer lines are more likely to
take advantage of the density gains it offers. The ``realistic average
Switchblade expense'' line is based on an estimate of the likely
combination of the two forms of interconnect.
Compared to the Xilinx 4000, Switchblade's interconnect is
slightly more compact even in the worst case and significantly more
compact even on average. In the best case it is vastly more efficient
of array space.
Figure 5.11 - Interconnect propagation delay
Switchblade is not as impressive where it comes to
propagation delays however. Most delays originate from routing
switches. Ignoring the difference in routing switches and array
technology for the purposes of this comparison, it is clear to see
that Xilinx's ``long lines'' have great speed benefits in sending
signals over long distances
(figure[ 5.11]).
Switchblade has a switch at every step of the route, which
introduces significant propagation delays on lines which are routed
over long distances. One possible saving grace here is that many
long-haul signals will be transmitted between contours in an
asynchronous, pipelined fashion by the virtual logic system described
in the next chapter so very long local signals are unlikely to be
generated by the synthesis tools in any case.
The logarithmic curve for the Xilinx device is an estimation.
This is based on the observed frequency of switches in lines of
various lengths in a number of Xilinx layouts.
Each group of four logic array cells is adjacent to a single
local routing centre. These local routing centres have the task of
multiplexing and demultiplexing broadcast signals for the surrounding
logic cells. The die space for each of these blocks is located in a
concavity of the four surrounding logic cells as shown in
figure[ 5.7].
The transmission of broadcast signals uses a very similar system
to that described for signal distribution cells. The global high
speed, low-skew clock signal provides a time base for the
multiplexing/demultiplexing process and a separate global
synchronisation signal ensures that the demultiplexing process is
synchronised between all cells.
A single data line is used to transfer multiplexed data to cells.
It is reserved entirely for system functions such as cell
configuration and global reset signals. This line is routed in a
similar manner to the previously described shared signals which are
used for local broadcast. The system broadcast line reaches each cell
via its adjacent network routing centre and the demultiplexed signals
provide configuration control as well as other circuit-wide useful
signals such as clocks and reset signals.
Each local routing centre is addressable for configuration - this
allows a common programming signal to be broadcast to a large number
of cells and each to respond individually to its configuration
instructions. Note that the system is not limited to this style of
one-cell-at-a-time configuration - by separately routing the system
broadcast line to multiple cells it is possible to configure them all
simultaneously. More commonly it is routed using the same type of
low-skew fractal routing scheme as used by the global clock - a scheme
which is relatively slow but simpler to manage.
A second configuration line is available to user circuits. This
provides user access to the configuration system. The powerful
``self-reconfiguration'' feature described in earlier chapters is made
possible through this line. Using this feature circuits can
dynamically allocate additional circuit space and populate it with
circuits which are generated on the fly as program requirements
dictate. This ability to expand the parallel processing abilities of
a circuit based on dynamic program requirements is unequalled in
current architectures.
Unlike the system configuration line no special routing resources
are provided for the user configuration line. The line is taken from
the user-level circuit and gated into the configuration system. It is
really just a back door into the system configuration logic which
allows user processes to broadcast their own configuration signals on
the system lines.
This user configuration line should not be confused with the user
data line used to distribute shared signals. The user downstream data
line is available for user signals and may be routed in a similar way
to the system configuration line. This is the data line which is used
by shared signal distribution cells. Its functionality was described
earlier, and does not (generally) carry configuration information. In
contrast the user configuration line is not routed, it derives from
the local cell. Just to confuse the issue further, it is quite common
to take signals from the user data line and place them on the user
configuration line. Keeping the operation of these signals separately
configurable allows greater flexibility however.
There are restrictions on what users may configure. A two-level
security system is imposed which limits the scope of user
reconfigurations. User-level tasks may reconfigure their own
user-level circuits, but may not interfere with system tasks or other
user-level tasks. Along with the other configuration signals is a
``user level'' signal which determines if certain types of
reconfiguration are permitted. Most importantly, user level tasks are
prevented from reconfiguring the routing of the system configuration
lines. Were they able to do this they could deny the system control
of their part of the array and attempt attacks on other parts of the
array, virus-style.
Control of user-configurable parts of the array is ensured by
directing a surrounding perimeter of logic to ignore the
user-configurable part (by making sure that no configuration data is
obtained from the user-configurable area) and having the operating
system preconfigure a configuration line routing scheme which is
only accessible from outside the user-configurable section. If a
runaway configuration should occur in this enclosed environment it
cannot harm anything around it and can be easily rendered harmless by
reconfiguration from outside.
Note that unlike some other architectures it is not possible to
reconfigure the system fatally - all configurations are valid although
most will be meaningless. Many other systems (eg. Xilinx) can be
damaged by an incorrect configuration.
Switchblade has multiple configuration stores which can be
switched between by the system or by user processes to a limited
extent. Most programs will desire to isolate from each other
unrelated configurations which reside in different configuration
stores but use the same die area. User-configurable areas could
potentially attack configurations other than their own. This is
prevented by the simple expedient of disallowing user-controlled
configuration switching in areas which are user-configurable.
Configuration switching is normally performed using signal
distribution cells to change the appropriate local broadcast signals.
The modified signals are distributed to other cells, which then
switch to the selected configuration. Setting of the configuration
bits is simply disallowed in areas which have the ``user level''
signal set, preventing user-configurable areas from attacking other
configurations.
Why can we trust configuration switching in user circuits which
are not user-configured, but cannot trust ones which are user
configured? The answer is the compiler - the compiler is a trusted
part of the system which must check circuits to ensure that they
aren't capable of scurrilously changing configurations to ones they
would not normally be allowed access to. The compiler may elect to
use multiple configurations if switching between configurations will
make circuit operation more efficient, but it requires special
cooperation from the OS and the loader to be able to place and load a
configuration which uses configuration switching.
Configurations contain a great density of data relative to that
which is stored in individual cells' latches. Most user-defined
configurations which are run-time configured will also be quite
repetitive and in general will have few customisations, so it's
likely that most of the configuration data will simply be copied from
another source. When copying new configurations to user-defined
circuit areas a great deal of data has to be brought in, and yet the
surrounding array logic blocks don't provide a good density of data
to store these static configurations. This makes it attractive to
store this configuration data in the configuration stores of other
cells. The alternatives would be to bring in the configuration data
from off the processor, which would be very slow, or store it
one-bit-per-cell in the cells' latches, which would be very
inefficient.
To expedite the copying of data from one cell to another the
configuration system permits a ``burst mode'' transfer of data from one
cell to another. A special configuration broadcast signal can be set
to indicate that the next sequence of bits will be a complete set
of configuration data for one configuration of a cell rather than the
usual set of multiplexed broadcast signals. This mode is also the
normal mode used to program cells as it is much quicker than
programming individual configuration bits, which is also possible.
Switchblade is designed as a multitasking, multiprocessing
system. It requires the ability to pre-emptively switch tasks out and
continue them at a later stage. A mechanism is required which can
stop tasks in their tracks so the operating system has a chance to
switch them out in a consistent state. A broadcast ``freeze'' signal
provides exactly this function, latching all cell outputs to their
current state and preventing any further changes in circuit state.
The ability to freeze and reconfigure areas is complicated by the
fact that broadcast signals could conceivably be tampered with,
routed in strange ways or even not be routed to certain parts of the
circuit at all. This is one important reason why the security system
prevents user-configurable areas from tampering with the system
broadcast lines - the system reserves the right to pre-empt any task
at any time with a freeze signal. Non user-configurable areas are less
worrisome as the compiler ensures that they are well behaved in this
respect.
On power-up the logic array must be configured to a consistent
state before it is usable. A central system configuration port
provides a point for a bootstrapping bitstream to be injected which
starts the system in a known state. Since it is assumed that there is
no general purpose coprocessor to perform the configuration for it,
the logic array must perform the majority of the bootstrapping itself.
Bootstrapping firmware in EPROM provides the initial trivial
programming - a state machine which can be used as a simple sequential
processor - and from there the logic array handles its own
configuration.
Bootstrapping the system is more difficult than may initially be
apparent. On power up the configuration of the array may be
completely random, and since it is possible that the routing will not
be set to take input from the central system configuration port it may
not normally be possible to externally configure the system at all.
For this reason a routing override signal is available to the
external configuration system. This signal is multiplexed with all
the other system configuration signals and when received by a routing
centre causes that centre to switch to receiving system configuration
from the source of the override signal. The signal is passed between
local routing centres using the normal routing system. It is not
generated from anywhere within the logic array - it is only intended
for use in the initial configuration process.
This method of setting the initial state of the array has the
advantage of not requiring much extra logic in each cell devoted to
setting the startup state. More importantly it eliminates the need for
a global reset line which would be very expensive.
I/O in Switchblade uses the array's normal connectivity to
the periphery of the array as the prime routing. In the place of logic
cells the periphery hosts I/O cells which are really just simplified
logic cells. Instead of having four inputs, two outputs and a LUT they
have only one input and one output.
Each I/O cell corresponds to one external line which can be
configured as either an input or an output. In input mode the output
of the I/O cell into the logic array directly follows the logic level
present on the external line. When configured as an output the output
tri-state buffers are turned on and drive the line to a logic level
corresponding to the cell's internal input. The tri-state buffers are
current-limited to prevent damage if the direction of the external
connections is misconfigured.
The Switchblade system uses a logic array as the primary
computing device. External to this logic array are I/O components
such as disk drives and terminals as shown in
figure [ 5.12].
Other devices more closely
associated with the computing function are also connected to the
array. This includes RAM and may also include special function units
such as floating-point coprocessors.
Figure 5.12 - System architecture
I/O devices interface directly to the array for direct control
from the processor. Updates to controller hardware can be achieved as
software upgrades as the majority of controller logic is implemented
on the logic array. This need not be overly expensive as much of the
controller logic can be implemented in a single configuration which is
switched in only as necessary. Between uses the array space can be
re-used by other tasks. Fast task switching means that other tasks can
be switched in and out many times a second if necessary.
It would not be desirable to move this much controller logic into
the array in every application. If ASIC I/O controllers are already
available which serve the purpose well they would probably be more
efficient than an array implementation. Since Switchblade
comes from an experimental environment it was decided to allow as
much flexibility as possible and move as much control to the logic
array as possible. This more elegant solution is not necessarily
isolated to prototypes either - in research and development
environments or other experimental environments direct control of
hardware can greatly facilitate development of control hardware.
Storage is an important part of modern computers. Logic arrays
provide huge potential parallel computing power, limited mainly by the
ability to transfer enough data through the system to keep the array
``fed''. This favours the use of fast RAMs. Yet many applications
require large amounts of data to be processed quickly, which calls for
large memory area. On the other hand the multitasking, multiuser nature of
Switchblade implies a need for virtual memory. Clearly a
compromise is required.
The compromise chosen is to provide a limited number of fast RAMs
for high-throughput computation and a large central
virtually-addressed memory pool. No cache memory is used - it is
assumed that frequently-accessed values will be stored in registers
on the array itself.
The fast RAMs are intended to be used as temporary storage in
high speed vector-style computation. A common use would be to store an
initial data set in one RAM, filter it through a computation on the
array to another RAM, and then reverse the process to perform another
computation and leave the result in the first RAM. This process could
continue as long as necessary, processing the data ``coctail shaker''
style.
Fast RAMs are a scarce resource. Only eight are provided in the
prototype Switchblade design, meaning that tasks could well be
stalled waiting for a suitable set of RAMs to become available. RAMs
are not easily context-switched, either. Unlike the quick-switching
multiple-context ability of the logic array the RAMs are single units
which must be painstakingly copied into and out of to change contexts.
An OS feature which alleviates this situation to some extent is
the addition of fencepost registers to the interfacing logic to each
RAM. Note that since the OS is in dynamic logic this is a true OS
feature, not a hardware feature. These fencepost registers allow the
system to dynamically allocate sections of RAM to different tasks and
provide protection between tasks, at the cost of reducing further the
amount of RAM available to each task.
Data for computation must be obtained from somewhere and the
results must be stored somewhere. Generally this data's location is
either dynamic RAM or disk. Modern computer systems usually provide a
virtual memory system which performs both disk caching and storage
abstraction functions. Switchblade is no different. A
conventional memory management unit (MMU) design is used, with an
address translation cache
external to the main logic array. See
figure [ 5.13].
Most other functions of the MMU are
provided by the OS. This minimal-hardware approach is similar to that
taken by early MIPS processors
[KaneHeinrich92],
except that in this case
the OS functions are in dynamic hardware which can be switched in very
rapidly when needed.
Figure 5.13 - External memory system
The Switchblade architecture is based on the assumption
that logic arrays will be large enough to support this kind of general
purpose multi-processing activity. This is not currently the case,
though it is far closer to reality now than when the project began.
An estimate was made that an array size of 64k gates with eight
configuration stores was likely to be reasonable for today's
applications running on Switchblade. This is laid out as a 32
by 32 array of pages, each page containing 8 by 8 logic cells.
Section 6.8 contains a performance assessment of the architecture
using a simulation of ``gcc'', a well-known C compiler. This test
concludes that the array parameters are well chosen. Current logic
arrays are around 16k gates maximum, such as the Xilinx XC6264 and
DPGA
[BolotskiDehon94].
In 1994 Vuillemin used Moore's law to predict that the number of logic
blocks on a single die would increase from around 400 in 1992 to
around 25,000 by 2001
[Vuillemin94].
In reality we're already ahead of this curve, demonstrating the
vitality of improvement in the area. This has been aided in part by
architectural improvements such as a move away from island-style
designs to take the speed of advance beyond that available from
fabrication advances alone. Re-applying Moore's law (which states that
feature size is shrinking by a factor
1/alpha approximately equals 1.25
each year, Switchblade is likely to be viable by the year 2003.
Switchblade provides a logic array architecture which is
similar in basic function to some conventional logic arrays. Where it
differs greatly from conventional logic arrays is in the support
structures it provides for the low-level logic functions.
- Multiplexed medium distance interconnect
- Multiplexing greatly reduces
interconnect costs, while allowing more resources to be flexibly
allocated to interconnect where needed.
Multiple configuration stores
Fast task switching is made possible multiple configuration
stores. They increase the utilisation of the silicon by making it
possible to gain more functionality from logic blocks without
increasing their size to a prohibitive extent.
Routable broadcast system
Routable broadcast allows transmission of signals to widespread
and user-definable areas, reducing the amount of repetitive
interconnect required. Transmission is multiplexed, making it cheap to
distribute a large number of potentially useful signals to a wide
area.
Self-reconfiguration system
The configuration system allows both system-controlled configuration
and user-controlled configuration from any point in the array. There
are no restrictions on the number of independent areas which can be
simultaneously reconfigured. This system offers both high speed and
great flexibility in the way it can be reconfigured.
Security
Low-level security features provide primitives for protection
between tasks. The two level user/system security model provides
enough security support for sophisticated operating system security models to be
built on top of it.
Genetic algorithms
Genetic algorithms provide an interesting alternative to
conventional human methods of computer design. The results obtained by
applying a genetic algorithm to logic array design were surprising and
revealing.
The features that Switchblade's logic architecture
provides are geared at providing a structure for a complete computing
system based around a logic array architecture. This is a significant
advance upon existing systems which either treat logic arrays as
simple hardware devices or assume that they are a coprocessor to a
more complex computing device. The features described here only go
part of the way to realising the goal of a complete logic array
computing device - the next chapter describes the final step;
virtualising the logic array itself.