5. Self-Reconfiguring Logic Array

The Switchblade architecture employs a low-level logic array architecture which improves on existing designs in a number of ways.

5.1 Genetic parameterisation

A bewildering array of options are available to the engineer choosing a logic array. Designing a logic array offers even more options and tradeoffs - interconnection methods, routing flexibility versus space used, the choice of small, simple cells versus larger, functionally sophisticated cells and the decision of where the limitations of the design should lie. These are all difficult decisions to make. Worse, it is often quite unclear at the design stage how such low-level decisions will affect the real applications of the logic array.

Rather than making an ad-hoc decision about which design features would result in the best performance, it was decided to use a genetic optimisation method to find a good general array architecture which would be well suited to a variety of applications. This architecture would need to provide enough flexibility to be able to map circuits efficiently on to the cells, yet this flexibility should not make each cell so large and complex that a simpler but more restrictive design would be more efficient.

Genetic algorithms give us the opportunity to compare a large number of architectures against each other and let the system gradually refine the feature set towards a good design. The initial feature set is completely random - and the architecture is correspondingly poor in performance. After a number of generations some features have been completely eliminated and others have been adjusted to work well in combination with each other. The keystone to this evolutionary approach is the function which assesses how well an architecture performs - the cost function.

5.1.1 Cost function

The feature domain of this experiment was made as broad as possible to enhance the range of possible solutions while still ensuring that the specific architectural features desired for this project were retained. The genetic material of each architecture is capable of expressing a number of design features - the number of inputs to each cell's LUT, where those inputs can be obtained from, separate routing circuits and other features such as the multiplexed routing scheme discussed later. The intention was to see which combination of these features gave the best result.

Figure 5.1 - Cost function

The cost function must accurately reflect how well a given set of features will actually perform, as shown in figure5.1. This is necessarily a tradeoff of cell functionality against space used, so the cost function is calculated as the product of cell size and functional cost. Cell size is calculated as an approximation based on the die space used by various features in existing logic arrays.

The functional cost of a set of features is rather more difficult to calculate, particularly as there is no definitive answer - any given logic array architecture will be better suited to some applications than others. It is necessary to test an architecture against a number of applications to determine how well it will perform.

Unfortunately the obvious way of doing this is horribly expensive - the most effective way of determining performance is to do circuit synthesis of a number of sample circuits on each of the architectures being tried. The cost of synthesis for all these circuits on even a single architecture is in the range of hours. This is not practical when many generations of many architectures must be scored in this way for each simulation run.

A more practical cost function was developed by deriving various statistics from circuits which had already been synthesised, and using these statistics to form a set of ``real world'' functions which the logic array was required to be able to perform. For instance a tally of inter-cell connections was kept, noting how far and in what direction each connection was. This was used to determine whether routing resources in each architecture would be adequate to a task or whether they would be underutilised.

The disadvantage of this much quicker method is that it does bias the results in favour of the style of architecture that the circuit was originally laid out for. This was considered an acceptable tradeoff as the biasing effects were to some extent ameliorated by performing the synthesis process on a generic architecture which was designed to exhibit relatively small bias towards any particular architecture. In the event that the synthesis process had used a function or a routing ability which was not available on a specific architecture, the cost of an equivalent alternative function or route was added into the total cost of the architecture.

The creation of such a cost function is rather more sensitive to small factors than might initially be supposed. Changes to parameters such as the die size of features, the set of sample circuits used to prime the cost function and the features of the ``generic'' architecture used for placement and routing all had a very significant effect on the outcome. These effects will be discussed in more detail later.

5.1.2 Genetic algorithm

The genetic algorithm takes the initial random genetic code of the population of architectures and through natural selection optimises towards the more effective architectures. The system used was an asexual reproduction scheme which has the virtue of being simple and effective. The most successful architectures of each generation were mated with the remaining architectures, producing a new generation. A caveat was that these successful architectures were given the chance to live on into the next generation so the best genetic material wouldn't be lost.

The purpose of this work wasn't to explore genetic algorithms so much as to find a good logic array architecture. As a result the genetic algorithm used is relatively unsophisticated and ad-hoc. Greater depth of study of genetic algorithms might have yielded a faster genetic algorithm however the one used proved quite adequate in the circumstances.

One problem of genetic algorithms is their tendency to find local minima and then become entrenched at that solution and fail to find other, better solutions. A couple of strategies were used to avoid this.

Figure 5.2 - Genetic diversity algorithm

Firstly, genetic diversity was encouraged by having several separate gene pools. Figure5.2 shows how each of these pools was effectively a patriarchy, with a successful ``stag'' architecture governing an individual pool of ``doe'' architectures. While these gene pools were mostly distinct, occasional indiscretions were permitted to allow the more successful stags to introduce their genetic material into the gene pools of the less successful stags. This environment was designed to provide some genetic diversity in an attempt to escape local minima while encouraging the groups towards solutions which weren't too disparate.

The effect of this approach was that evolution took significantly longer than a one-stag system, but better solutions were often found in the end.

The second technique was an oscillating mutation rate, used to avoid local minima. The mutation rate of the system was initially low but increased gradually over fifty generations. In nearly all cases a stable state had already been achieved before the maximum mutation rate was reached. The period of high mutation rate then tended to perturb the system enough that other nearby (but perhaps slightly less optimal) solutions were found and explored by some of the gene pools. The mutation rate would then drop again, allowing fine-tuning of these solutions to occur.

This oscillating mutation rate was occasionally successful in producing a better solution in combination with the separate gene pools, but for the most part didn't make a great deal of difference once a steady ``best'' solution had been achieved.

5.1.3 Symmetry

One result of these experiments in genetic logic array design was that the architectures the system created were often quite asymmetrical. This is not surprising since random mutation is an important part of the process. What is slightly more surprising is that the best solutions are asymmetrical. On the other hand most computer architectures designed by humans are symmetrical. This raises the question of whether symmetrical designs are actually superior to asymmetrical designs or whether they are simply more appealing to computer designers.

Generally speaking, symmetrical designs are often easier and more efficient to lay out on a die than asymmetrical ones so there is a definite advantage to some types of symmetry. Conceivably this factor could be included in the cost function, as a factor affecting cell size. It would however be quite expensive to calculate accurately, necessitating that some sort of VLSI layout be generated for each ``organism''. The computational cost of this was considered to be unacceptably high for this experiment.

One question remains however - whether genetic algorithms tend to generate asymmetrical layouts just by chance, or whether there are real advantages to be had from asymmetry. To investigate this further a set of experiments were run where the most successful organism of each generation was seeded with predefined sections of symmetrical genetic material. The intent was to see whether this symmetrical structure would be incorporated in successful organisms and whether these organisms would be as successful as other runs which didn't have genetic tampering occurring.

The results of this test were quite interesting - while there are many free variables in the genetic code which may be manipulated, the effects of some changes can be quite wide-reaching and complex. For instance in one experiment it was desired to configure the routing to have a symmetrical north-south-east-west grid structure. Setting the second-level routing to have this structure was successful, as was setting one of the input multiplexers to have this structure. However from this point on all other attempts to impose this structure on the array failed - resulting either in unstable genetic pools or in architectures which were significantly less successful than those which didn't have the symmetry imposed on them. The unstable genetic pools were forced towards solutions which were close to the imposed sections of genetic code but all the best solutions were mutations from the introduced code, indicating that the introduced code was inherently suboptimal.

Some time was spent in attempting to evolve a gene set which was both symmetrical and similarly low in cost to the best asymmetrical solutions. This process consisted of trial-and-error creation of symmetrical genetic code which was then inserted into the most successful members of into the gene pool each generation. This process was later refined by stopping the insertion when the system was approaching a good result - allowing mutation to vary the solution if the imposed genetic code was fundamentally inefficient.

After a great deal of trial-and-error experimentation some basic facts were determined -

  • It is possible to significantly increase symmetry without adversely affecting the efficiency of the resultant architecture
  • For the cost function which was used it was not possible to create a completely symmetrical or even mostly symmetrical architecture which performed as well as asymmetrical ones.
  • Varying the cost function or some parameters of the cost function had an effect on the extent of the ability of the array to be symmetrical.
  • Despite this, some asymmetry seemed to be inherent in the system.

Why would asymmetrical architectures be favoured over symmetrical ones? The answer is in the elimination of redundant routes. The modelled system allowed array cells of up to three inputs to be created, each of which could take its inputs directly from one of eight neighbouring cells or from a separate ``indirect'' routing system.

In all cases the natural selection process tended to weed out any tendency of the architecture to duplicate the function of any input on other inputs - since the inputs are otherwise equivalent there is no need for any two inputs to be able to derive a signal from the same source. This is further complicated by the fact that inputs may come from a large number of sources and the three inputs must between them be able to make connections to any combination of three sources. If they can't make a direct connection an additional cost for extra routing must be incurred.

The result is that the input configurability for each of the three inputs tends to be very different. This is in stark contrast to most man-made architectures in which all inputs are very similar in the ways in which they may be configured.

5.1.4 Is asymmetry bad?

If asymmetrical logic arrays tend to be more efficient than symmetrical ones is there any good reason for avoiding them? The answer to this is not entirely clearcut. In general the less ordered a VLSI circuit is, the more expensive of routing resources and unused die area it tends to be. This applies more to some structures than others however - for instance empty holes in a RAM array would simply be wasted space but on the other hand multiplexers can be compacted so as not to waste space in this way. The parameters of the test architecture were designed so that multiplexers are used for inputs - eliminating this cost of asymmetry to a large extent.

Other designs might not be so well adapted to asymmetrical design. Crossbars of the kind commonly used in routing between cells (for example in the Xilinx FPGAs [Xilinx94] ) do not benefit significantly from the elimination of unused routing options - in most cases the same space is used whether an individual routing option is available or not. If using a genetic algorithm with a crossbar architecture some attention should be paid to reflecting this effect in the cost function.

5.1.5 Results

The final low-level architecture chosen for Switchblade used parameters designed to give good general performance for a variety of circuits. It was geared specifically to provide features useful for the multiple configuration store architecture and the genetic algorithm was given a variety of resources it could draw on, each with a corresponding cost in array space. The general style of cell circuit which was developed is shown in figure5.3.

Figure 5.3 - The circuit for a cell

For some choices of parameters the results obtained were clearly not what was intended. An example is the cost which is added for cases where the architecture can't route either directly or indirectly, and connections have to be routed by the use of additional cells which have no logic function. This kind of cost is hard to estimate in a general cost function as it depends largely on the individual placement issues in each case. Moreover while this parameter is not heavily used, the cost is large enough that it does have a significant effect on the result when indirect routing of this type is used. High values tend to cause the array to have expensive routing resources in order to avoid incurring the expense at all, and low values tend to strip down the routing resources to a ridiculously low level. This kind of behaviour is not particularly surprising but it does outline the sensitivity of the results to the parameters of the experiment.

Some effort was spent in evolving a balanced architecture which demonstrated as much symmetry as could be supported without degrading performance. It was tried against a wide variety of circuit types. The result is shown in figure5.4a. Each block represents an input multiplexer, the first three used directly as LUT input and the fourth being used for the second-level routing system. The arrows shown entering each block at a base level are inputs taken directly from the outputs of neighbouring cells in specific directions. The elevated inputs are taken from the second-level multiplexer of neighbouring cells. A kinked-back arrow indicates that the input can be taken from the cell's own second-level routing.


a) The best balanced architecture, manually ``hinted'' to be more symmetrical.
This is the architecture used in Switchblade

b) Based on the assumption that routing via other cells is very costly

c) Assumes that routing via other cells is cheap - no need for second-level routing at all
Figure 5.4 - Three evolved architectures

Figure5.4 shows three architectures which resulted from separate runs using different circuits to prime the parameters used in evolving the system. Each of these architectures was found to be reasonably effective for a wider domain of circuit types and despite the very significant differences in their overall design the results showed no clear winner. Not surprisingly, each was more successful in dealing with circuits which had similar characteristics to the circuits which were used to parameterise the original design.

Priming the system with circuits which do a large amount of longer-distance routing resulted in the architecture shown in figure5.4b. This architecture doesn't vary a great deal from the ``best'' architecture in essentials, but it has a slightly higher emphasis on second-level routing.

Figure5.4c results from reducing the cost of routing via other dedicated cells to an unreasonably low level. Note how multiplexer 1 has taken very well to the ``hinted'' symmetrical layout, though multiplexers 2 and 3 are still relatively asymmetrical. The big difference is that the second level multiplexer has been stripped of all its external inputs and no second level routing is needed at all - connections to adjacent cells were adequate and in the rare case where extra routing is necessary it can be performed using extra cells without excessive cost.

5.1.6 Conclusions

It is clear that there is no one great answer to the need for efficient configurable logic array architectures. Certain architectures favour certain styles of circuit. Architectures which work well with a wide variety of circuits tend to be more expensive than architectures which are suited to a specific style of circuit. Experimental results demonstrate that a wide variety of architectures can fill this general role quite well, with no clear winner emerging - at least with the architectural options made available to the simulated system.

The most interesting effect demonstrated by the genetic algorithm approach to logic array design was in the gains which can be made from asymmetrical design of cells. Symmetrical connections are inherently poor at using the die space efficiently as they tend to duplicate functionality across the cell's inputs. Since there is no need to send the same signal to multiple inputs at once, it is possible to eliminate the more redundant combinations of routing options. This in turn means that each input tends to have completely different routing abilities - hence the asymmetry.

Genetic algorithms proved to be very useful in creating the design. Experimentation with new architectural ideas in this environment quickly demonstrated their worth or lack thereof. Perhaps the most unsatisfactory aspect of this potentially useful tool was the tendency to randomness in the architectures which it created. While with some considerable effort a relatively random design can be converted to an equivalent one which is less chaotic it is not clear how to cause an evolutionary system to naturally pick solutions which are more regular by nature. Mind you it's not clear that more regular solutions are necessarily advantageous, but they are at least more aesthetically pleasing than highly random ones.

5.1.7 Switchblade low-level architecture

The final architecture which was developed for Switchblade is shown in figure 5.3a. This architecture was created using the genetic algorithm described above, in combination with a cost function which reflected the features desired for the Switchblade system -

  • Multiple configuration stores
  • Quick switching between configurations
  • Efficient handling of a variety of circuit types
    • Cell functionality
    • Routing

The architecture is based around a 64-entry lookup table (figure 5.3). This provides six bits of input which can be used to select the one output bit. Under normal operation three of the six table inputs are devoted to selecting one of eight configurations. The same three bits are used to select one of eight configurations for all the remaining configurable options in the cell, although this is not explicitly shown in the diagram. The three configuration bits are transmitted to cells in a region from a nearby shared-signal distribution point.

The remaining three inputs of each LUT are taken from three input multiplexers. These route input signals from a number of possible sources to each of the inputs. The output of the LUT can be either directed first through a D-latch or alternatively taken directly to the primary output of the cell.

Two other shared signals are available to cells. One is a global reset signal which returns the D-latch to a configuration determined state, and the other is a common clock signal which can be used to strobe the D-latch.

A fourth input multiplexer is used to provide an extra signal in cases where it may be useful. This fourth input is directed to the secondary output of the cell, providing extra connectivity which can be routed ``over'' logic cells without affecting their operation. This is intended to be used as a routing scheme which is mostly independent of the operation of the host cell. Signals can be routed at random from secondary output to secondary output over a reasonable distance, independent of the operation of the cells which the signal passes through.

The fourth input can also be used as the clock source to the D-latch if it is in use, or in cases where a three-input LUT is inadequate it can be used as an extra LUT input.

The ability to trade off the number of available configurations against LUT width is not a feature which is automatically available to user programs. The security model ensures that user programs are not able to use more than their allotted single configuration unless they successfully request two adjacent configurations from the operating system.

Another use of the fourth input multiplexer is to provide a data input to the LUT. It is possible to configure the LUT as a single bit wide addressable register, much in the same way that Xilinx LCAs allow [Xilinx94]. Data is latched into the RAM using a data read strobe signal which is taken from the local shared-clock signal. The remaining three multiplexer inputs act as address lines, providing eight bits of data stored per configuration.

If more storage is required this can be achieved through the use of multiple configurations. Address bits sent to the local shared signal distribution point are broadcast to nearby cells as configuration selection bits, allowing up to sixty four bits to be accessed per cell if all the configurations are used.

5.2 Local Interconnect

The strongest influence of the genetic algorithm used in the design of Switchblade can be seen in the optimisation of interconnect. Primed with a set of statistics on routing, the genetic algorithm favoured an architecture which permitted direct connections from any three neighbouring cells or indirect connections to more distant cells.

This design favours the primary use of the cell as a three-input LUT with orthogonal inputs - the multiplexers are optimised on the assumption that any combination of one, two or three input functions may be required from any combination of available signal sources. It is not important which input each signal is sent to however, as the LUT may be configured appropriately for any arrangement of inputs. This makes it possible to eliminate a substantial number of options from the multiplexers, resulting in a set of multiplexers roughly half the size of those which would otherwise be necessary.

The genetic algorithm's choice of three-input LUTs in the cells was no coincidence - the general architecture which the sample routing data was taken from was a three-input architecture so it is natural that the genetic algorithm would optimise for this arrangement. It was realised in advance that this bias would be apparent but for efficiency of the cost function it was decided to limit the number of inputs to three. The choice of three was quite deliberate - this number has been empirically found to work well with a good variety of circuits [EbelingBoriello91] and kept the size of expensive LUT tables relatively small while simultaneously allowing a number of configurations to be selected from. Note that the architecture allows the fourth input to be also used as a LUT input if necessary. This halves the number of available configurations, however.

Compared with some other logic array designs, the large input multiplexers initially seem like an expensive routing option. There are however two factors which weigh in favour of this scheme. Firstly, inputs are only connected locally. This means that a huge amount of die space is saved from routing costs already. The large crossbar system used in many logic arrays is replaced, and the size of the multiplexers is small relative to that of the sea of interconnect. Since connections are all to neighbouring cells there is no need to have large numbers of long lines surrounding all the cells. This style of interconnection is popular in architectures for configurable computing - an example is the Xilinx 6200 series (which was announced after this architecture was developed).

The second reason why multiplexers are not so great an expense is due to the multiple-configuration feature. Multiple configurations necessarily mean a large LUT. The size of the LUT was taken into consideration in the cost function and compared to it the multiplexers are relatively small.

One final, compelling reason why multiplexers were chosen instead of a conventional crossbar arrangement is that input multiplexers cannot be programmed in such a way as to damage the device. Since rogue circuits are not only possible but likely in this self-reconfiguring architecture, the device needs to be inherently solid enough to resist damage through misconfiguration.

The inputs available to each multiplexer are shown in figure5.4a and table 5.1. Both the first multiplexer and the fourth multiplexer have their primary inputs directly to the north, south, east and west. In the case of the first multiplexer this is directly to the outputs of neighbouring cells and in the case of the fourth multiplexer it is to the outputs of the fourth multiplexers of neighbouring cells - providing a separate routing layer which passes signals around independent of the underlying logic cells.

lower
upper
other
NWNNEE SESSWW NWNNEE SESSWW UPB1B2
mux 1
 X X  X X XXX  X XX XXX
mux 2
XXXX XXXX X X       XX 
mux 3
XX X XX X X  X  XX  X X
level 2 mux
  X    X   X X  X X    
Table 5.1 - Switchblade input multiplexer assignments

The three LUT input multiplexers each have some inputs devoted to obtaining signals from the upper routing layer. Note that not as many inputs are taken from the upper level though - only twelve as opposed to eighteen inputs. This is due in part to the lower frequency of signals being required from the upper level, and in part to the availability of the ``UP'' signal. The ``UP'' signal derives from the output of the fourth multiplexer, providing an opportunity to use its additional routing resources if the upper layer is not being otherwise used. Note that the genetic algorithm has taken this resource into account and has eliminated some inputs with this kind of routing in mind.

The ``B1'' and ``B2'' inputs are two inputs from the local broadcast mechanism. These come directly from the local routing centres described later, and often contain data transmitted over a relatively long distance using the medium distance interconnect system also described later.

5.3 Logic arrays for computing

The architectural features described to this point are mostly fairly conventional. While the exact circuit of the logic block is new and the addition of a separate routing layer is slightly different from other schemes, the main point of interest in the design so far is the use of a genetic algorithm as an aid to design. The remainder of this chapter describes features which are specifically aimed at turning a plain logic array into a device suitable for a new type of configurable computing system.

To date designs for configurable computing have been heavily based on standard logic arrays. The purpose of Switchblade is to take a step forward from this very basic and specialised type of computing engine and develop a more sophisticated architecture with support for a much wider spectrum of use, in much the same way that a modern multitasking computer system transcends a microcontroller.

The remainder of the chapter details a number of advances aimed at turning logic arrays into a computing device better suited for general purpose computing.

5.4 Multiple configuration stores

Multiple configuration stores are an architectural feature which has already been discussed in some detail in the previous chapters. Switchblade provides built-in support for eight configuration stores. If logic blocks are configured to take three inputs all eight configurations can be used. An alternate setup is available where logic blocks can take four inputs but only four configurations are available.

Switchblade uses a relatively conservative eight configurations. This was chosen as a good compromise between keeping logic block size down and providing extra functionality per block. Research shows that eight configurations is more than enough for most applications anyway [Dehon94] [Dehon96]. More importantly as a multitasking system Switchblade is able to make better use of the multiple configuration stores which are available to it. Since multiple, large concurrent tasks are likely to be alternating in their access to the logic array it is likely that much of the array will be alternating between several different configurations when the system is heavily loaded. This is the scenario where multiple configuration stores really excel. Chapter 6 contains some simulations which show the value of multiple configuration stores in a multitasking system.

Most systems which support multiple configuration stores also suffer from a lack of flexibility in the way that their multiple configurations can be switched - often the entire array must be switched simultaneously. Switchblade is capable of switching configurations in a very flexible, per-logic-block manner. This allows maximum use to be made of the multiple-configuration feature - different parts of the same circuit can easily be switching configurations independently.

5.5 Signal distribution cells

Logic cells can operate in three different modes. The LUT table lookup mode and the addressable register mode were discussed in section 5.1.7. The remaining mode allows a cell to be converted into a centre for the transmission and distribution of logic signals. These signals are of two types - shared signals for local broadcast and individual signals for medium distance interconnect.

Some sections of logic will require a high density of such signals, and this will naturally result in such sections having larger numbers of logic cells configured as signal distribution cells. Conversely, highly regular applications will tend to have a larger proportion of local interconnect and hence a lower density of signal distribution cells. In this way the amount of die area devoted to routing resources can be directly adjusted to circuit needs. Unlike most existing architectures where logic and routing resources are predetermined by the logic array architect and in any given application tend to be either underutilised or at a premium, this system allows routing resources to be configured dynamically along with the configuration for each individual circuit.

5.5.1 Shared signals

Shared signals are a form of localised broadcast signal. A number of these signals are distributed from each signal distribution cell. An example of these signals are the three configuration bits which are used to select which part of each cell's configuration store is used. Two other such signals relate to the D-latch - a ``reset'' signal used to set the D-latches to a predetermined state (set in the configuration) and a clock signal used to trigger the latch. Note that the broadcast clock signal is not necessarily the clock used by all cells - individual cells may be configured to take their clocks from other sources.

Another localised broadcast signal is the data read strobe. This signal is used by cells configured as RAM units. When strobed it causes a new value to be latched into RAM. Unlike the clock signal this function can only be performed using the localised broadcast system due to the limited number of multiplexed inputs available on each cell.

Signals can be injected into the broadcast system using a cell configured for signal distribution. In the simple case this cell can take one or two inputs on its input multiplexers and place them on selected broadcast lines to be sent to the surrounding circuit area. The LUT bits which would otherwise be used for table lookup are instead used to select which destination signal each of the inputs is sent to.

Broadcast signals are distributed to the three neighbouring cells directly. Each group of four cells is similar, but the layout of each of the cells is flipped such that the circuitry involved with broadcast signals meets at the conjunction of the four cells. This reduces the line length of the interconnect to a tiny fraction of what would otherwise be necessary.

The point at the conjunction of the cells is not only used to communicate between the cells - it also has specialised routing circuitry known as a ``local routing center'' (see figure 5.7). In a bid to reduce the cost of routing a large number of signals to every cell these signals are sent in multiplexed form between each group of four cells, requiring relatively few lines. This adds up to a substantial saving in die real estate - as well as the signals previously mentioned all the configuration layer signals are also transmitted this way.

Figure 5.7 - Sixteen cells and their four routing centres

The multiplexed signals are by default distributed in a fractal pattern designed to minimise signal skew. This is complicated by the fact that broadcast signals can be limited to specific areas - for instance a single task may use a limited area of the logic array and when its configuration is switched out only the given task should be switched out - not other, unrelated tasks.

Broadcast signals are routable in a limited way. Each local routing center can elect to source its broadcast signals from any of the north, south, east or west directions, or from a logic cell configured for signal distribution. Moreover, at each junction of cells independent of whether a local routing center exists at that point it is possible to route the signal from any of the four standard directions. This allows the broadcast signals to be routed flexibly.

The cost of this routing is less than might otherwise be supposed. Since these signals are multiplexed on a single line, the routing need only be performed for one line rather than lines for all the component signals. This results in a vast improvement of efficiency in transmitting these signals - only a fraction of the die space is used compared to that which would otherwise be required.

Flexible routing of these local broadcast signals offers some quite powerful behaviour. A circuit can take a set of broadcast signals, modify some at will, and then distribute this modified signal set to a special local routing domain configured to its own needs. This is useful for instance when creating such circuits as synchronous counters - an appropriate clock and reset signal can be injected into a section of the logic array covering the counters only. No other routing resources are necessary to deliver these signals to all the latches in the counter. An example of fractal broadcast routing to a restricted area is given in figure 5.8.

There are some limitations on transmitting signals using this system however. It is not particularly useful as a general point-to-point communications mechanism since signals tend always to travel in one direction down the broadcast routing hierarchy towards the terminating logic cells. While exceptions can be made in this routing scheme for the convenience of specific cases it isn't possible to implement situations where lines cross each other, and most other such situations where routing is non-trivial are also impossible with the broadcast scheme.

Figure 5.8 - Broadcast signal distribution in an arbitrary area

5.5.2 Medium distance interconnect

Signal distribution cells are not used only for distributing broadcast signals - they can be used for space-efficient transmission of signals over moderate distances. This mechanism doesn't use the broadcast interconnect system, instead it uses the normal cell-to-cell interconnect system in a multiplexed mode. This provides the benefits of the flexible cell-to-cell routing in combination with the efficient space use of a multiplexed scheme.

Operation of the system is identical to that of the broadcast scheme except that the input signal is taken from the cell's fourth input multiplexer rather than the broadcast system and the resulting signal is directed to the secondary output rather than to other cells. It is intended that the separate routing layer provided by this input and output be used to transmit the multiplexed signal between cells.

Conceptually a multiplexed link is sent over a number of signal distribution cells using the normal forms of routing. Each signal distribution cell can then add data to the link and take signals from the link. The first cell in the link will add a signal to it, to be retrieved at some point later at one or perhaps several locations. Other cells may add other signals to the link at any point.

Signal transmission in the routing scheme is unidirectional. This in turn means that signals can only pass along the link in one direction. Most communication in circuits tends to favour one direction, so it is likely that signals in the primary direction would be most efficiently sent on a multiplexed link and the relatively few return signals may be cheaper to route directly without multiplexing. Alternatively multiplexing may be used both ways using two separate multiplexed links, depending on what the circuit requires.

The nature of Switchblade's linear multiplexed links is rather different from TSFPGA which was developed independently [BolotskiDehon94]. TSFPGA is a rather unusual architecture with a quite different form of interconnect. It is difficult to estimate how these schemes would compare as they use rather different principles. One potential advantage of Switchblade's system is that multiplexing need only be used where it is needed, meaning that connections to neighbouring cells are very fast. Rather than accepting the space cost of multiplexing as standard Switchblade encourages the creation of multiplexed links only where they are of significant benefit. This encourages the use of the link for a number of unrelated signals which happen to be travelling in the same general direction. Other schemes are only really useful for groups of signals which are sourced and terminated together. It is likely that a link could travel far beyond where the original signal was delivered, with new signals being added all the time and the multiplexer slots of delivered signals being filled with new signals to be sent further down the line. It could be quite reasonable for such a link to be routed all over a logic array apparently at random. The logic to the routing would only become apparent when the sourcing and sinking of each signal at the correct locations was verified.

An alternative in some situations is to route a transmission link in a ring, feeding the end of the link back into the start. This has the advantage that return signals can be fed back to earlier parts of a circuit. It may or may not be a viable option depending on how efficient it is to loop the link back in the individual circumstance. Note that this is not strictly a token ring as no token is passed between cells - synchronisation is performed using a different method which is described later.

Each multiplexed line is only capable of carrying a limited number of signals at one time. Switchblade sets this limit at sixteen, an arbitrary choice reflecting the tradeoff between maximising utilisation of the interconnect and the speed and expense of demultiplexing the signals.

Multiplexed transmission systems naturally have a cost - the delay between a signal being ready and it being transmitted down the line. This cost is minimised by the use of a high clock rate in switching between the transmission channels. The clock used is the common high-speed clock - which is used by the broadcast signal system as well. The broadcast system provides a synchronisation signal which is used to reset the demultiplexers after each set of signals has been transmitted. This keeps all the signal distribution cells in synchrony.

5.5.3 Advantages of multiplexed interconnect

The chief advantage of this multiplexed form of signal transmission over a simpler, non-multiplexed system is that interconnect can be used much more efficiently. Interconnect is a very expensive resource in logic arrays, with as much as 80% of circuit area being devoted to interconnect. Much of this space ends up unused in any given circuit and much of it is used to route low-bandwidth signals over long distances.

The multiplexed system described provides an improvement to both of these problems. Since Switchblade has relatively sparse non-local routing resources the amount of die area devoted to both local and non-local routing at this level is appropriately low - approximately 10% (although another 10% was expended on the higher-level routing system described in the next chapter). When the occasion calls for additional routing resources signal distribution cells can be added which not only provide longer distance routing, they provide it in an efficient multiplexed form which does not require extra interconnect resources specifically for this purpose.

Adding signal distribution cells is quite costly - each cell used for signal distribution cannot be used for logic functions - however traded off against the high cost of long unmultiplexed interconnect resources this cost is not excessive. More importantly the signal distribution cells allow routing resources to be matched to circuit requirements - resources can be placed exactly where they're needed, avoiding the problem of conventional systems where interconnect is relatively unused in most places and can be heavily oversubscribed in other places. Sections of a circuit where routing resources are at a premium tend to have placement problems, causing many cells to be unused or used only for routing purposes. Switchblade's system allows routing resources to be added flexibly, preventing both the under-use and the starvation problems.

5.5.4 Interconnect characteristics

Figure 5.10 - Interconnect area cost

Figure 5.10 compares the resource expense of routing in Switchblade to the Xilinx XC4000 series FPGAs. In Switchblade routing between two adjacent logic cells uses special local interconnect - it doesn't draw on the general interconnect resources as the Xilinx does. Using the second-level routing system for interconnect results in an expense equal to the distance, minus the local connection which can be made at the final step.

Multiplexed interconnect on the other hand offers a slower but more efficient method for transmitting signals over longer distances. In the best case up to sixteen signals can be multiplexed together, though for the graph line a more realistic average of six has been assumed. In the real world it is likely that shorter lines will avoid the inconvenience of multiplexing but longer lines are more likely to take advantage of the density gains it offers. The ``realistic average Switchblade expense'' line is based on an estimate of the likely combination of the two forms of interconnect.

Compared to the Xilinx 4000, Switchblade's interconnect is slightly more compact even in the worst case and significantly more compact even on average. In the best case it is vastly more efficient of array space.

Figure 5.11 - Interconnect propagation delay

Switchblade is not as impressive where it comes to propagation delays however. Most delays originate from routing switches. Ignoring the difference in routing switches and array technology for the purposes of this comparison, it is clear to see that Xilinx's ``long lines'' have great speed benefits in sending signals over long distances (figure 5.11). Switchblade has a switch at every step of the route, which introduces significant propagation delays on lines which are routed over long distances. One possible saving grace here is that many long-haul signals will be transmitted between contours in an asynchronous, pipelined fashion by the virtual logic system described in the next chapter so very long local signals are unlikely to be generated by the synthesis tools in any case.

The logarithmic curve for the Xilinx device is an estimation. This is based on the observed frequency of switches in lines of various lengths in a number of Xilinx layouts.

5.5.5 Shared signal distribution

Each group of four logic array cells is adjacent to a single local routing centre. These local routing centres have the task of multiplexing and demultiplexing broadcast signals for the surrounding logic cells. The die space for each of these blocks is located in a concavity of the four surrounding logic cells as shown in figure 5.7.

The transmission of broadcast signals uses a very similar system to that described for signal distribution cells. The global high speed, low-skew clock signal provides a time base for the multiplexing/demultiplexing process and a separate global synchronisation signal ensures that the demultiplexing process is synchronised between all cells.

5.5.5.1 System configuration line

A single data line is used to transfer multiplexed data to cells. It is reserved entirely for system functions such as cell configuration and global reset signals. This line is routed in a similar manner to the previously described shared signals which are used for local broadcast. The system broadcast line reaches each cell via its adjacent network routing centre and the demultiplexed signals provide configuration control as well as other circuit-wide useful signals such as clocks and reset signals.

Each local routing centre is addressable for configuration - this allows a common programming signal to be broadcast to a large number of cells and each to respond individually to its configuration instructions. Note that the system is not limited to this style of one-cell-at-a-time configuration - by separately routing the system broadcast line to multiple cells it is possible to configure them all simultaneously. More commonly it is routed using the same type of low-skew fractal routing scheme as used by the global clock - a scheme which is relatively slow but simpler to manage.

5.5.5.2 User configuration line

A second configuration line is available to user circuits. This provides user access to the configuration system. The powerful ``self-reconfiguration'' feature described in earlier chapters is made possible through this line. Using this feature circuits can dynamically allocate additional circuit space and populate it with circuits which are generated on the fly as program requirements dictate. This ability to expand the parallel processing abilities of a circuit based on dynamic program requirements is unequalled in current architectures.

Unlike the system configuration line no special routing resources are provided for the user configuration line. The line is taken from the user-level circuit and gated into the configuration system. It is really just a back door into the system configuration logic which allows user processes to broadcast their own configuration signals on the system lines.

This user configuration line should not be confused with the user data line used to distribute shared signals. The user downstream data line is available for user signals and may be routed in a similar way to the system configuration line. This is the data line which is used by shared signal distribution cells. Its functionality was described earlier, and does not (generally) carry configuration information. In contrast the user configuration line is not routed, it derives from the local cell. Just to confuse the issue further, it is quite common to take signals from the user data line and place them on the user configuration line. Keeping the operation of these signals separately configurable allows greater flexibility however.

5.5.5.3 User configuration security

There are restrictions on what users may configure. A two-level security system is imposed which limits the scope of user reconfigurations. User-level tasks may reconfigure their own user-level circuits, but may not interfere with system tasks or other user-level tasks. Along with the other configuration signals is a ``user level'' signal which determines if certain types of reconfiguration are permitted. Most importantly, user level tasks are prevented from reconfiguring the routing of the system configuration lines. Were they able to do this they could deny the system control of their part of the array and attempt attacks on other parts of the array, virus-style.

Control of user-configurable parts of the array is ensured by directing a surrounding perimeter of logic to ignore the user-configurable part (by making sure that no configuration data is obtained from the user-configurable area) and having the operating system preconfigure a configuration line routing scheme which is only accessible from outside the user-configurable section. If a runaway configuration should occur in this enclosed environment it cannot harm anything around it and can be easily rendered harmless by reconfiguration from outside.

Note that unlike some other architectures it is not possible to reconfigure the system fatally - all configurations are valid although most will be meaningless. Many other systems (eg. Xilinx) can be damaged by an incorrect configuration.

Switchblade has multiple configuration stores which can be switched between by the system or by user processes to a limited extent. Most programs will desire to isolate from each other unrelated configurations which reside in different configuration stores but use the same die area. User-configurable areas could potentially attack configurations other than their own. This is prevented by the simple expedient of disallowing user-controlled configuration switching in areas which are user-configurable. Configuration switching is normally performed using signal distribution cells to change the appropriate local broadcast signals. The modified signals are distributed to other cells, which then switch to the selected configuration. Setting of the configuration bits is simply disallowed in areas which have the ``user level'' signal set, preventing user-configurable areas from attacking other configurations.

Why can we trust configuration switching in user circuits which are not user-configured, but cannot trust ones which are user configured? The answer is the compiler - the compiler is a trusted part of the system which must check circuits to ensure that they aren't capable of scurrilously changing configurations to ones they would not normally be allowed access to. The compiler may elect to use multiple configurations if switching between configurations will make circuit operation more efficient, but it requires special cooperation from the OS and the loader to be able to place and load a configuration which uses configuration switching.

5.5.5.4 Copying configurations

Configurations contain a great density of data relative to that which is stored in individual cells' latches. Most user-defined configurations which are run-time configured will also be quite repetitive and in general will have few customisations, so it's likely that most of the configuration data will simply be copied from another source. When copying new configurations to user-defined circuit areas a great deal of data has to be brought in, and yet the surrounding array logic blocks don't provide a good density of data to store these static configurations. This makes it attractive to store this configuration data in the configuration stores of other cells. The alternatives would be to bring in the configuration data from off the processor, which would be very slow, or store it one-bit-per-cell in the cells' latches, which would be very inefficient.

To expedite the copying of data from one cell to another the configuration system permits a ``burst mode'' transfer of data from one cell to another. A special configuration broadcast signal can be set to indicate that the next sequence of bits will be a complete set of configuration data for one configuration of a cell rather than the usual set of multiplexed broadcast signals. This mode is also the normal mode used to program cells as it is much quicker than programming individual configuration bits, which is also possible.

5.5.5.5 Freeze mode

Switchblade is designed as a multitasking, multiprocessing system. It requires the ability to pre-emptively switch tasks out and continue them at a later stage. A mechanism is required which can stop tasks in their tracks so the operating system has a chance to switch them out in a consistent state. A broadcast ``freeze'' signal provides exactly this function, latching all cell outputs to their current state and preventing any further changes in circuit state.

The ability to freeze and reconfigure areas is complicated by the fact that broadcast signals could conceivably be tampered with, routed in strange ways or even not be routed to certain parts of the circuit at all. This is one important reason why the security system prevents user-configurable areas from tampering with the system broadcast lines - the system reserves the right to pre-empt any task at any time with a freeze signal. Non user-configurable areas are less worrisome as the compiler ensures that they are well behaved in this respect.

5.5.5.6 Bootstrapping

On power-up the logic array must be configured to a consistent state before it is usable. A central system configuration port provides a point for a bootstrapping bitstream to be injected which starts the system in a known state. Since it is assumed that there is no general purpose coprocessor to perform the configuration for it, the logic array must perform the majority of the bootstrapping itself. Bootstrapping firmware in EPROM provides the initial trivial programming - a state machine which can be used as a simple sequential processor - and from there the logic array handles its own configuration.

Bootstrapping the system is more difficult than may initially be apparent. On power up the configuration of the array may be completely random, and since it is possible that the routing will not be set to take input from the central system configuration port it may not normally be possible to externally configure the system at all.

For this reason a routing override signal is available to the external configuration system. This signal is multiplexed with all the other system configuration signals and when received by a routing centre causes that centre to switch to receiving system configuration from the source of the override signal. The signal is passed between local routing centres using the normal routing system. It is not generated from anywhere within the logic array - it is only intended for use in the initial configuration process.

This method of setting the initial state of the array has the advantage of not requiring much extra logic in each cell devoted to setting the startup state. More importantly it eliminates the need for a global reset line which would be very expensive.

5.6 I/O

I/O in Switchblade uses the array's normal connectivity to the periphery of the array as the prime routing. In the place of logic cells the periphery hosts I/O cells which are really just simplified logic cells. Instead of having four inputs, two outputs and a LUT they have only one input and one output.

Each I/O cell corresponds to one external line which can be configured as either an input or an output. In input mode the output of the I/O cell into the logic array directly follows the logic level present on the external line. When configured as an output the output tri-state buffers are turned on and drive the line to a logic level corresponding to the cell's internal input. The tri-state buffers are current-limited to prevent damage if the direction of the external connections is misconfigured.

5.7 System design

The Switchblade system uses a logic array as the primary computing device. External to this logic array are I/O components such as disk drives and terminals as shown in figure 5.12. Other devices more closely associated with the computing function are also connected to the array. This includes RAM and may also include special function units such as floating-point coprocessors.

Figure 5.12 - System architecture

5.7.1 I/O

I/O devices interface directly to the array for direct control from the processor. Updates to controller hardware can be achieved as software upgrades as the majority of controller logic is implemented on the logic array. This need not be overly expensive as much of the controller logic can be implemented in a single configuration which is switched in only as necessary. Between uses the array space can be re-used by other tasks. Fast task switching means that other tasks can be switched in and out many times a second if necessary.

It would not be desirable to move this much controller logic into the array in every application. If ASIC I/O controllers are already available which serve the purpose well they would probably be more efficient than an array implementation. Since Switchblade comes from an experimental environment it was decided to allow as much flexibility as possible and move as much control to the logic array as possible. This more elegant solution is not necessarily isolated to prototypes either - in research and development environments or other experimental environments direct control of hardware can greatly facilitate development of control hardware.

5.7.2 RAM

Storage is an important part of modern computers. Logic arrays provide huge potential parallel computing power, limited mainly by the ability to transfer enough data through the system to keep the array ``fed''. This favours the use of fast RAMs. Yet many applications require large amounts of data to be processed quickly, which calls for large memory area. On the other hand the multitasking, multiuser nature of Switchblade implies a need for virtual memory. Clearly a compromise is required.

The compromise chosen is to provide a limited number of fast RAMs for high-throughput computation and a large central virtually-addressed memory pool. No cache memory is used - it is assumed that frequently-accessed values will be stored in registers on the array itself.

The fast RAMs are intended to be used as temporary storage in high speed vector-style computation. A common use would be to store an initial data set in one RAM, filter it through a computation on the array to another RAM, and then reverse the process to perform another computation and leave the result in the first RAM. This process could continue as long as necessary, processing the data ``coctail shaker'' style.

Fast RAMs are a scarce resource. Only eight are provided in the prototype Switchblade design, meaning that tasks could well be stalled waiting for a suitable set of RAMs to become available. RAMs are not easily context-switched, either. Unlike the quick-switching multiple-context ability of the logic array the RAMs are single units which must be painstakingly copied into and out of to change contexts.

An OS feature which alleviates this situation to some extent is the addition of fencepost registers to the interfacing logic to each RAM. Note that since the OS is in dynamic logic this is a true OS feature, not a hardware feature. These fencepost registers allow the system to dynamically allocate sections of RAM to different tasks and provide protection between tasks, at the cost of reducing further the amount of RAM available to each task.

5.7.2.1 Virtual memory

Data for computation must be obtained from somewhere and the results must be stored somewhere. Generally this data's location is either dynamic RAM or disk. Modern computer systems usually provide a virtual memory system which performs both disk caching and storage abstraction functions. Switchblade is no different. A conventional memory management unit (MMU) design is used, with an address translation cache external to the main logic array. See figure 5.13. Most other functions of the MMU are provided by the OS. This minimal-hardware approach is similar to that taken by early MIPS processors [KaneHeinrich92], except that in this case the OS functions are in dynamic hardware which can be switched in very rapidly when needed.

Figure 5.13 - External memory system

5.7.3 Array parameters

The Switchblade architecture is based on the assumption that logic arrays will be large enough to support this kind of general purpose multi-processing activity. This is not currently the case, though it is far closer to reality now than when the project began.

An estimate was made that an array size of 64k gates with eight configuration stores was likely to be reasonable for today's applications running on Switchblade. This is laid out as a 32 by 32 array of pages, each page containing 8 by 8 logic cells.

Section 6.8 contains a performance assessment of the architecture using a simulation of ``gcc'', a well-known C compiler. This test concludes that the array parameters are well chosen. Current logic arrays are around 16k gates maximum, such as the Xilinx XC6264 and DPGA [BolotskiDehon94]. In 1994 Vuillemin used Moore's law to predict that the number of logic blocks on a single die would increase from around 400 in 1992 to around 25,000 by 2001 [Vuillemin94]. In reality we're already ahead of this curve, demonstrating the vitality of improvement in the area. This has been aided in part by architectural improvements such as a move away from island-style designs to take the speed of advance beyond that available from fabrication advances alone. Re-applying Moore's law (which states that feature size is shrinking by a factor 1/alpha approximately equals 1.25 each year, Switchblade is likely to be viable by the year 2003.

5.8 Conclusion

Switchblade provides a logic array architecture which is similar in basic function to some conventional logic arrays. Where it differs greatly from conventional logic arrays is in the support structures it provides for the low-level logic functions.

Multiplexed medium distance interconnect
Multiplexing greatly reduces interconnect costs, while allowing more resources to be flexibly allocated to interconnect where needed.

Multiple configuration stores
Fast task switching is made possible multiple configuration stores. They increase the utilisation of the silicon by making it possible to gain more functionality from logic blocks without increasing their size to a prohibitive extent.

Routable broadcast system
Routable broadcast allows transmission of signals to widespread and user-definable areas, reducing the amount of repetitive interconnect required. Transmission is multiplexed, making it cheap to distribute a large number of potentially useful signals to a wide area.

Self-reconfiguration system
The configuration system allows both system-controlled configuration and user-controlled configuration from any point in the array. There are no restrictions on the number of independent areas which can be simultaneously reconfigured. This system offers both high speed and great flexibility in the way it can be reconfigured.

Security
Low-level security features provide primitives for protection between tasks. The two level user/system security model provides enough security support for sophisticated operating system security models to be built on top of it.

Genetic algorithms
Genetic algorithms provide an interesting alternative to conventional human methods of computer design. The results obtained by applying a genetic algorithm to logic array design were surprising and revealing.

The features that Switchblade's logic architecture provides are geared at providing a structure for a complete computing system based around a logic array architecture. This is a significant advance upon existing systems which either treat logic arrays as simple hardware devices or assume that they are a coprocessor to a more complex computing device. The features described here only go part of the way to realising the goal of a complete logic array computing device - the next chapter describes the final step; virtualising the logic array itself.