8!DOSTYPE XÜ N PULMC "-//SoFtPtad/'D”Ddraft LL\ 3.2 + åxvejsyonsnos0HGV etaL)PRO 3.0 98609r7//EN" > Stack Comp}tmrs" 6.2 APCJIEC\URHL DIFFÒENCDW FÓOM C_NVDNTIONAL MACHINC ¼/TITLE <IN{ VUR="màde&2HREF="maálto:{ompoenãmw>eD5">>+h!D> St!ck Com`upeRs: The!newtwave P`ilip Oompman,>.Ë> C HPEF="cgp9rght.jtm,"$ ¼P.Chcxter(>®- -

6.#$@rC\MUeCTUrÁL DIFFGÓMNCES FRGM C^VA^TIONAL MACHINES8Hr>M Tje mbrIous!dhffgrånc% bdtwEen 3tabë maciInes4`oD coîventional machines€ i3 the ýse of 2-MpMrind stakk `ddvåcsiJg iîs|ugd og rebhsrev er eíor{ baQed adiressilg sblemes® Tjiq difbåqenae wlen¡combined witx03Upport of QvIck stbroutine Cqlls, eaces stack mqcjiNgqruqevior to conöe.tionam maãJines mn theMHAreas of progjam qize- prkcessor #omplexiài,0sysu%m 4c/mpLexkty, xrkcassm2 Šperfor}enge, amd(consiSteNcyakf progòim eøucutiof. |/@> >X><.P ¸P.+Qž 4p¾M <>Not curpras)ngly, this exPlosion in"0rogrq-comvè'x)ty leads to!a seEm}ng cootzadicdiof, 4he sayi7g dxat "proGRams exPa~d to fill !all`aveal!ble Š}ekori,0ald 4he. sme." `Ôhe amount of pronbam00memoò Avainable fr an ap0licwtion is gixeD by thå ejono-ics of`the aãtual cgCt of tàe4}åmoby cxipó aîD priNdel cmrkui4 bocsd sq`c%&!Kt( is qìro Affebtåd"by mekhanical limh}s {wsh as power- cooling ob uhg n}ober of exPansion wìovs mj"|he`sys|em0(|Hméts$whhch ahsg wiguve in`uhe0ekgnomi# péct}re). bEwel w)th an`unliméted budged$eleat2hcal lo!DIng c/nqidmra4hons and the speed-og=ì)ght!wirinçdel#y li-Kt bring an ul4imaxe limit dï the number of fast`ee-ori ships 4ra| eay!be"used` f9 a procescor.2 SmaLl¤z2ogram sIzew zeduce Memory cos4s® cÿípknent goElt,anD qowezrequire-entr, anä giN improve sys4em¢speeD by0ahlowhng0(the¢cïsT dFgecôkve use"/b smqlLar,0hmghe2 sðeed mm-ory #hits* Adda|honal be.%æits k~#lute better qeRform`Nce ij$a vi2ôtál mem/rý environíeîV! (Swåat & Sán$}`n 1ù82, Momn 9;8©$andba òUquirement for less`cachd me-orù`to as(ievE a gaöen hi} ratio. Some applications, notably embedded microprocessor applications, are very sensitive to the costs of printed circuit board space and memory chips, since these resources form a substantial proportion of all system costs (Ditzel et al. 1987b).

The traditional solution for a growing program size is to employ a hierarchy of memory devices with a series of capacity/cost/access-time tradeoffs. A hierarchy might consist of (from cheapest/biggest/slowest to most expensive/smallest/fastest): magnetic tape, optical disk, hard disk, dynamic memory, off-chip cache memory, and on-chip instruction buffer memory. So a more correct version of the saying that "memory is cheap" might be that "slow memory is cheap, but fast memory is very dear indeed."

The memory problem comes down to one of supplying a sufficient quantity of memory fast enough to support the processor at a price that can be afforded. This is accomplished by fitting the most program po{qible ioto"the gartes| ìevgL OD th% memory èiuRArcjy&&=/Q><@>Phu$usU!l`way`to ia.aee the gasteót level of uju memory hiurarcxy is Ry uqif'0ãáihE mem/rIes. CaChE memormås work on the rrincéPlm (dhat a 3mall secthnn of a p3oremiq likely to bg used$more 4han`once with)n k s`ost reriod /æ ti}u, Thus, tèe îirst t)mu q Síall crgUp od inSpr5ctions i"rafårefbee, hp is gopiedfRoe {Lnw memov{ into( the0fast ãache1meeory Aïd sóvEd gor late2 uae. Th`s decreisus the ebkeSs deliy on |je secnnd aod [ub{eqwdnt acóesses 4o Jproçzae frAgmentq. Sinca kaahE memor}bhss ¡0limitet0cApacity,"C^y0instructifn fetched )nvg c)c@a$is mven|eally discar$õd whåo0ius slo| m}sT`be use` t/ `old a mmbe recEntlyfutãhed iîótbwctiotn! T(e¡pòoblem sith cachg"keoïri 0ég tha4 itmusv be âif mnough to hold$enougl pòOçrám!frawmmîts&lofv enoumh for tHe eventwal0reyseto0nbcup.

'cacje iemopy04hat és bag enough 5o lold a"ceptamn .uiâer gv hnótructan~s, aalmcfà|he ¤ñ}od;woòking"3at,&auot: can significan\ly amprove óyótei0 perfwriinaa/ How does vxe size kf a tRogram`affect thks pezfovmcnCe (increasE? If!we kswu-e c"gévmn nqmbår nf&jifhle~el D)nGwage$operaôionó( il"th%wo0kmæg 3eô, sonséddr the efNeCt of0{fcr%asing(vhc`compact~uqr Gf dhe encotiNo0/f anstdõkpmols. INtuit!Vely, if e sáquen#e of ihstruCtj/ns do accoeplmsi c èigh leöen langu`ge statement !rEl¯re!compact gn maciInã A than i`cHine(B.theN machine0A loefs i smelles!nuMâår oæ bytec og cachm |o hold tJe(i¾struc|ions fele2a4ed ãgr0ôhe!same0so}vce$ cnde as íacxineB. %tniw magns that meching"Á oae`s"A rmámle2 caclm t. achieve 4x- same AVårage meíory risñknCu daid pevfoòíance.
- 8@>B9 way(of`ehampìe, E!vidson aî` Vaugèqn$(1987) sugwest pha|(RISC cjm4uter Jproçrams can!be up to :&5 thmes(âigcdr <(in`CIQC(6erqio.s! oæ The°sama p:o'rams (anthï}gh$oôher souwces,(espgciildy ÒÉ[A venemvs, wOuld(placE!thiw nu]beb av `erh!pw 1.4`times bhg'er.i Uhey also ruggest thcô tia RIÓC ãomputePs$nåmd e cacju siza vet is$twice as lazg, as a ÃHSC fache 4m achiewe |he same "purfozMance. Fu`4`epuore, 0a RÉSC eachkne ÷ith t_ice`lèu cabhe of(a BISGM maãhIn%$wiml stinl gejevape(twice the number of kechE`eisses (sinc% a golstAntmiss!ratio¢geNepates t}ige as0mq~} misses wor"twice azga~øc ãhe agcewsas)( resul|)ng¨ kn a lee` fgr higher sðadd main oe-kry devi#ås aq 5åll foò eyu`| p%rfoòmincE. This is bOrrobora4ed "y phe vunE0md lhumb tjat a VISC¨xpgcEssor il te 1° MAP Midn)o."SISÃ Ifstrucpions Pcr0Óecond( pefïrmcnce"rpngu negds 128K0"ytas of cache memory!nor(Satisgact-ry pårdmb-anbe whiìe higØ" endCISC prOcesrors4t}tybenly .eed no more ähan 64 bites/ J,/R
Small program size on stack machines not only decreases system costs by eliminating memory chips, but can actually improve system performance. This happens by increasing the chance that an instruction will be resident in high speed memory when needed, possibly by using the small program size as a justification for placing an entire program in fast memory.

How can it be that stack processors have such small memory requirements? There are two factors that account for the extremely small program sizes possible on stack machines. The more obvious factor, and the one usually cited ij The litdr`tu`e, ió vèat stecë macpKnuw h!ve small ijstruction goRmatr. COnreniolal architectures$must specafy"nkt `o+li an`oqera|ion n. d!#h ifstructëon$(býta|so oxeranäs anä !ddressing MolUs Forexemple, é typmcah JregiståsEc`sgd machinw yïsuruction to adä two%nuebers togevher might bg:j
AÆD R1,R Txis ins4ru#t)kn must fo| onlÙ sPecmdx0t¬u <>ADD opcgde,"but ads/ tiåfacT`thaö tèea$ämtioN iS beinf"lnnå(on two%registMrs anethaT(the reFisvers arE" AnD"R+b> ¬ 8/P> =PA@d hgss obviou[(!buv ecTuall}`more ymporpant peasïn tos stacc each!nes @avi~ç more #oípacô code Is!tlat`they edfibiegt|y$su0pïrt!Code w)vh$ many bre1uelplY rewsmd su`romtIles,!nften0caLle` tèpeteåd¡code )Bell 1953, dd÷ar19%7). Giile sush code is!po3sibLe on snVenuionaí mychhfer<0t(e$exec5tiov spuee pe.ièty is Ãuveren I~ faCô,(oNe /f The )ost e|dmentazy comðiler optimiz!tIÿnsá"o2 bnth RISCand CIsC 0oachines ©q tM c/mrile`procEduòedcalls as in-line MacroS. Thi{$ `cdde`uo íosv pòkgpameers' experience thit too many prcedurq ca
There are several qualifications associated with the claim that stack machines have more compact code than other machines, especially since we are not presenting the results of a comprehensive study here. Program size measures depend largely on the language being used, the compiler, and programming style, as well as the instruction set of the processor being used. Also, the studies by Harris, Ohran, and Schoellkopf were mostly for stack machines that used variable length instructions, while machines described in this book use 16 or 32 bit fixed length instructions. Counterbalancing the fixed instruction length is the fact that processoZs runfi.g0Fortj ëao have cmáller pr?grams than3$other suqck |!chines.0 Uneproframc are Rmaller b`caesg |hey =se¤ FrEquå~t cubr¯ut).e, å|moing A high deGreeof codu8reu3edwmthio$a single appnac!t)o."pr.ubai. And, qs we!{h!dl see in a la|er 3uc|9nn( the Šbixee instruapion length for Mven 32-{T `robe3sors suké asthå"RTX s2A does ŠnoT cosd$cs much prnwram$memory spece as /le0mhgjt04hink/ (

=P>
- <3¿6.p.2 QbOcåssgr and$sùsuel!cÏmplåhitywjan speaoIng of phe comphexiTy!of a(comruter(%$w/ deve|[ are impkrdalt; proceswmr coíplexi4y¼ anl sióueM cOmpLmxitY.( Prgcassor cn}`,exity" is tde amounp$of looic (-uasu{ed in Czm@ area, nu-rer f!tránsis|ors, eta.) -n dèe-qstua. c/rd°o&!thw prncesskR that dkds the gOí`utbvmons. System coitmgyi4y ãonside2s the prnce{sor ombedte` il`Á`fullY ftîctional system whicx Goftaánssuraortcirjuitry, the memoby zi%rarchy< and óOfËYSC computmrsbèávm0becoye$stbst!ntiaìly i/be complex gvErtld$y%cr{.$¢Tèis comðLexi|y$arióEs From thm need 0n!be wEzy wood t ald their Many nenCtikns wimuhvåþuous|}> A lArge `ecrde gf theks$cmmp,exity"stemw dro} a~0a|tuMpv to tiGlTly g~coÅe a f+te v`Viety kf inwTrect9nr uséng a jarge numrer"od instsubtion fgrmaus AddeD!cOyphex)py!Comes(frm "e`eir SuRto2t of¨mõLthple prograiAihg anÄ fatq mo@!lc AîY m!chine that iS r}awonablY(}fficieft `t x2GãessinG C_Bl Rqg+%d`d%ãimal d(tc$ tYQe2 on a timm sxicu$ ba3is$uitd bunninçdoufle-prdcksion floáting PoInt FOZTSAn maô2ix!operatimnó ing MISP`expert systemk i³ bounfbtg "e comqìåx!
The cïmplexipyhof CISC machines is pmptkall[the sgsålt oæ eîbodinå inStó5côions toKdep proçralq welatitely0s}aìl> !V(g 'oal$hsto`"råduce0of tha qeian4ic fqp båtween high leVed hangwqges )&d"The(-achinm$ p hòodq#e mgre effèa)aot code® UforTunately( Phik may lea` a shtuatkon$uieze áhMost all aviilaâle chip!aòea is usud foz tau bonprol 0anD data xAths (dor instcfcå the Motmòola 680x¸ a~dMntul(:0X8`ppOlucT3! ,Q.he£extremcs to w(éch s/e0# [ prcceswmr3 takm |hm compleèi|y of"%the cose proceswob maq see eXcecsive, b5t0they `re$friven b{ a Ãoåmkn" a.d sell fotn$ee goAN: es`ablióhment kf acons){tEnt$kzd"simtld intE3face@ betweeJ iapduar% alä roftwqre. Th scceS3 dhqt tjks apprmach cqn" h`ve hs dem/lsvråted bydtle IBM System/370 line of cnmpeuers This co}pute0 Fám!,y encïe`asRec a waðt rcnFe0kf prace an$ ðepfMseanke, froí Personal comput%b qluç-in karDw tG supercoíput!rs,qlì with ta`` wame assembly hanguawe instruc4ioî&set/

Ôie"c,ean anl consMst%ntmNterfàce detween harÄwáre ald soDtwre av t`e cs3embly langEage hdv%l meanr"that compklmrS"need nou be exkus3avely ammPlex v/prkdu+u$s%asoniBlE sode. anl thát they$máy be$reyse` among $many DIffcP%Nð me#Hines(of$the`sa,e famm9. !Enothe² ad~anvage of( SIsC p:ocesrorr hs 4ha4- JsinCE )nstructinNs q2e very sompagt- thei !Do nod`req5yse a l`rg% casheme}ory fgv aKceptabne s[stem terâoz-e~ce. So, CISC macjinms `ive tpadef off yngrmIseæ proaessor cgíp|exity$f.r 2educmd system ckopèexky.( 4/P>
The cOnbept behénd RISC makhifes ar |o makM ôhe p"ocdssor naatms `by rddecinw0its bgmðlexmôy.0 Do"vhcs(und< RI[C0proãescorw jave"few r pransa3toòs Jin$ôhe actuAl procesror co~4rol #krcuitry than CISC machinesj! Ühis iS acjomplIshed fi lavinG {implg inst2uctionfobmats cnd instvuc|AoNs wyti lo3„ sumentic gonôent9(they dgn't do íuch gork$ but0dÃn¦4 téKe much pima to doYt.! JThe instruction forM!ts are Uswally !(osen to coxreSpond wûth"s%quiBemejds foò run.iog a pqrticular progreiming lcfguagm and0tcsk, tyPicallù ind%gez šarytLmetió(in tje C progreMmIng laNg5age. <-Ð>]ŠThis reduced qrocmwsor anm°Mexity hsàlot wiThouu a substanu)al co;t. "Oosd RISC tzoc!ssors háve a Lag% banK gf0regéstur3 tO allow suhckpa}sg Of fråq}un$äq icCmósåd data. 0Thmwe rEWisuez canks must je€dõal-torted mgmorym(allowi~g"tvî sioWluaneouwagcuss%s ad€dibgereît!d`resses) todallow fetãhIng âo4h sgurce opercnds on mvevy"cqãìm.0 Fõrtlerh/rE, because"of ôhelow zemaîTic coîtent0of |huyr -nstructiOnr BYÃ procds{ops` need m}aH hiWHer -gmor{ banfu}lth to kEap il3trucvio.S &lowine Mnto tÉe CPu. !Tlis means vxat s5bst`ndiaì of-çhipa~d systey/w}de rewourceó mw2t be devot$d to âacè% mem'sq to attaon a#cEptable pe2æosmajce. $Also, `rISC pRocessorrqcharactezisTicAìly have ¡n kîtErnaL inótsUc|iïn pipglioe. ThishMeans that extra xgòlware or compilEv uekhnipqgs íus4 fe pro~adEd to"ManageuhE 0iteìine/ Sðgcial attentioj and extva0xardware0r%sources iur| âe u{el to enrure Têat thí piðeìyne`state yr coRrgcuLy sa~ed an$ restoRed wh¥n iNter2u0ts ¡re rgCe-vmd. ,P>Finally, æivfdrent BISC imPleig.taté/nstreteg)es makg si'nif{cAnt deman`S on comxilMrs`suãh aó< schåd4lingpmpåline usage to avoaD h`zardsl 0filling bránch falqy slotr, and Manawinoallgaatinn and spènling0ob( the beeI3te2 bankó. _hile ti'(decreased complexity of the processor makes it easier to get bug-free hardware, even more complexity shows up in the compiler. This is bound to make compilers complex as well as expensive to develop and debug.

The reduced complexity of RISC processors comes, then, with an offsetting (perhaps even more severe) increase in system complexity.

Stack machines strive to achieve a balance between processor complexity and system complexity. Stack machine designs realize processor simplicity not by restricting the number of instructions, but rather by limiting the data upon which instructions may operate: all operations are on the top stack elements. In this sense, stack machines are "reduced operand set computers" as opposed to "reduced instruction set computers."

Limiting the operand selection instead of how much work the instruction may do has several advantages. Instructions may be very compact, since they need specify only the actual operation, not where the sources are to be obtained. The on-chip stack memory can be single ported, since only a single element needs to be pushed or popped from the stack per clock cycle (assuming the top two stack elements are held in registers.) More importantly, since all operands are known in advance to be the top stack elements, no pipelining is needed to fetch operands. The operands are always immediately available in the top-of-stack registers. As an example of this, consider the T and N registers in the NC4016 design, and contrast these with the dozens or hundreds of randomly accessible registers found on a RISC machine.

Having implicit operand selection also simplifies instruction formats. Even RISC machines must have multiple instruction formats. Consider, though, that stack machines have few instruction formats, even to the extreme of having only one instruction format for the RTX 32P. Limiting the number of instruction formats simplifies instruction decoding logic, and speeds up system operation.

Stack machines are extraordinarily simple: 16-bit stack machines typically use only 20 to 35 thousand transistors for the processor core. In contrast, the Intel 80386 chip has 275 thousand transistors and the Motorola 68020 has 200 thousand transistors. Even taking into account that the 80386 and 68020 are 32-bit machines, the difference is significant.

Stack machine compilers are also simple, because instructions are very consistent in format and operand selection. In fact, most compilers for register machines go through a stack-like view of the source program for expression evaluation, then map that information onto a register set. Stack machine compilers have that much less work to do in mapping the stack-like version of the source code into assembly language. Forth compilers, in particular, are well known to be exceedingly simple and flexible.

Stack computer systems are also simple as a whole. Because stack programs are so small, exotic cache control schemes are not required for good performance. Typically the entire program can fit into cache-speed memory chips without the complexity of cache control circuitry.

In those cases where the program and/or data is too large to fit in affordable memory, a software-managed memory hierarchy can be used: frequently used subroutines and program segments can be place in high speed memory, while infrequently used program segments are placed in slow memory. Inexpensive single-cycle calls to the frequent sections in the high speed memory make this technique very effective.

The Data Stack acts as a data cache for most purposes, such as in procedure parameter passing, and data elements can be moved in and out of high speed memory under software control as desired. While a traditional data cache, and to a lesser extent an instruction cache, might give some speed improvements, they are certainly not required, nor even desirable, for most small- to medium-sized applications.

Stack machines, therefore, achieve reduced processor complexity by limiting the operands available to the instruction. This does not force a reduction of the number of potential instructions available, nor does it cause an explosion in the amount of support hardware and software required to operate the processor. The result of this reduced complexity is that stack computers have more room left for program memory or other special purpose hardware on-chip. An interesting implication is that, since stack programs are so small, program memory for many applications can be entirely on-chip. This on-chip memory is faster than off-chip cache memory would be, eliminating the need for complex cache control circuitry while sacrificing none of the speed.

6.2.3 Processor performance

Processor performance is a very tricky area to talk about. Untold energy has been spent debating which processor is better than another, often based on sketchy evidence of questionable benchmarks, heated by the flames of self interest and product loyalty (or purchase rationalization).

Some of the reasons that comparisons are so difficult stem from the question of application area. Benchmarks that measure performance at integer arithmetic are not adequate for floating point performance, business applications, or symbolic processing. About the best that one can hope for when using a benchmark is to claim that processor A is better than processor B when installed in the given hardware (with associated caches, memories, disks, clock speeds, etc.), using the given operating systems, using the given compilers, using the given source programming language, but only when running the benchmark that was measured. Clearly, measuring the performance of different machines is a difficult matter.

Measuring the performance of radically different architectures is even harder. At the core of this difficulty is quantifying how much work is done by a single instruction. Since the amount of work done by a polynomial evaluation instruction in a VAX is different than a register-to-register move in a RISC machine, the whole concept of "Instructions Per Second" is tenuous at best (even when normalized to a standardized instruction measure, using those same benchmarks that we don't really trust). Adding to the problem is that different processors are built using different technology (bipolar, ECL, SOS, NMOS, and CMOS, with varying feature sizes) and different levels of design sophistication (expensive full-custom layout, standard cell automatic layout, and gate array layout). Yet, the very concept of comparing architectures requires deducting the effects of differences in implementation technologies. Furthermore, performance varies greatly with the characteristics of the software being executed. The problem is that in real life, the effectiveness of a particular computer is measured not only by processor speed, but also by the quality and performance of the system hardware, operating system, programming language, and compiler.

All these difficulties should lead the reader to the conclusion that the problem of finding exact performance measures is not going to be resolved here. Instead, we shall concentrate on a discussion of some reasons why stack machines can be made to go faster than other types of machines on an instruction-by-instruction basis, why stack machines have good system speed characteristics, and what kinds of programs stack machines are well suited to.

6.2.3.1 Instruction execution rate

Figure 6.1(a) -- Instruction phase overlapping -- raw instruction phases.

The most sophisticated RISC processors boast that they have the highest possible instruction execution rate -- one instruction per processor clock cycle. This is accomplished by pipelining instructions into some sequence of instruction address generation, instruction fetch, instruction decode, data fetch, instruction execute, and data store cycles as shown in Figure 6.1a. This breakdown of instruction execution accelerates overall instruction flow, but introduces a number of problems. The most significant of these problems is management of data to avoid hazards caused by data dependencies. This problem comes about when one instruction depends upon the result of the previous instruction. This can create a problem, because the second instruction must wait for the first instruction to store its results before it can fetch its own operands. There are several hardware and software strategies to alleviate the impact of data dependencies, but none of them completely solves it.

Stack machines can execute programs as quickly as RISC machines, perhaps even faster, without the data dependency problem. It has been said that register machines are more efficient than stack machines because register machines can be pipelined for speed while stack machines cannot. This problem is caused by the fact that each instruction depends on the effect of the previous instruction on the stack. The whole point is, however, that stack machines do not need to be pipelined to get the same speed as RISC machines.

Consider how the RISC machine instruction pipeline can be modified when it is redesigned for a stack machine. Both machines need to fetch the instruction, and on both machines this can be done in parallel with processing previous instructions. For convenience, we shall lump this stage in with instruction decoding. RISC and some stack machines need to decode the instruction, although stack machines such as the RTX 32P do not need to perform conditional operations to extract parameter fields from the instruction or chose which format to use, and are therefore simpler than RISC machines.

In the next step of the pipeline, the major difference becomes apparent. RISC machines must spend a pipeline stage accessing operands for the instruction after (at least some of) the decoding is completed. A RISC instruction specifies two or more registers as inputs to the ALU for the operation. A stack machine does not need to fetch the data; they will be waiting on top of the stack when needed. This means that as a minimum, the stack machine can dispense with the operand fetch portion of the pipeline. Actually, the stack access can also be made faster than the register access. This is because a single-ported stack can be made smaller, and therefore faster than a dual-ported register memory.

The instruction execute portion of both the RISC and stack machine are judged to be about the same since the same sort of ALU can be used by both systems. But, even in this area some stack machines can gain an advantage over RISC machines by precomputing ALU functions based on the top-of-stack elements before the instruction is even decoded, as is done on the M17 stack machine.

The operand storage phase takes another pipeline stage in some RISC designs, since the result must be written back into the register file. This write conflict with reads that need to take place for new instructions beginning execution, causing delays or the need for a triple-ported register file. This can require holding the ALU output in a register, then using that register in the next clock cycle as a source for the register file write operation. Conversely, the stack machine simply deposits the ALU output result in the top-of-stack register and is done. An additional problem is that extra data forwarding logic must be provided in a RISC machine to prevent waiting for the result to be written back into the register file if the ALU output is needed as an input for the next instruction. A stack machine always has the ALU output available as one of the implied inputs to the ALU.

Figure 6.1(b) -- Instruction phase overlapping -- typical RISC machine.

Figure 6.1(c) -- Instruction phase overlapping -- typical stack machine.

Figure 6.1b shows that RISC machines need at least three pipeline stages and perhaps four to maintain the same throughput: instruction fetch, operand fetch, and instruction execute/operand store. Also, we have noted that there are several problems inherent with the RISC approach, such as data dependencies and resource contention, that are simply not present in the stack machine. Figure 6.1c shows that stack machines need only a two-stage pipeline: instruction fetch and instruction execute.

What this all means is that there is no reason that stack machines should be any slower than RISC machines in executing instructions, and there is a good chance that stack machines can be made faster and simpler using the same fabrication technology.

6.2.3.2 System Performance

System performance is even more difficult to measure than raw processor performance. System performance includes not only how many instructions can be performed per second on straight-line code, but also speed in handling interrupts, context switches, and system performance degradation because of factors such as conditional branches and procedure calls. Approaches such as the Three-Dimensional Computer Performance technique (Rabbat et al. 1988) are better measures of system performance than the raw instruction execution rate.

RISC and CISC machines are usually constructed to execute straight-line code as the general case. Frequent procedure calls can seriously degrade the performance these machines. The cost for procedure calls not only includes the cost of saving the program counter and fetching a different stream of instructions, but also the cost of saving and restoring registers, arranging parameters, and any pipeline breaking that may occur. The very existence of a structure called the Return Address Stack should imply how much importance stack machines place upon flow-of-control structures such as procedure calls. Since stack machines keep all working variables on a hardware stack, the setup time required for preparing parameters to pass to subroutines is very low, usually a single DUP or OVER instruction.

Conditional branches are a difficult thing for any processor to handle. The reason is that instruction prefetching schemes and pipelines depend upon uninterrupted program execution to keep busy, and conditional branches force a wait while the branch outcome is being resolved. The only other option is to forge ahead on one of the possible paths in the hopes that there is nondestructive work to be done while waiting for the branch to take effect. RISC machines handle the conditional branch problem by using a "branch delay slot" (McFarling & Hennesy 1986) and placing a nondestructive instruction or no-op, which is always executed, after the branch.

Stack machines handle branches in different manners, all of which result in a single-cycle branch without the need for a delay slot and the compiler complexity that it entails. The NC4016 and RTX 2000 handle the problem by specifying memory faster than the processor cycle. This means that there is time in the processor cycle to generate an address based on a conditional branch and still have the next instruction fetched by the end of the clock cycle. This approach works well, but runs into trouble as processor speed increases beyond affordable program memory speed.

The FRISC 3 generates the condition for a branch on one instruction, then accomplishes the branch with the next instruction. This is really a rather clever approach, since a comparison or other operation is needed before most branches on any machine. Instead of just doing the comparison operation (usually a subtraction), the FRISC 3 also specifies which condition code is of interest for the next branch. This moves much of the branching decision into the comparison instruction, and only requires the testing of a single bit when executing the succeeding conditional branch.

The RTX 32P uses its microcode to combine comparisons and branches into a two-instruction-cycle combination that takes the equivalent time as a comparison instruction followed by a condition branch. For example, the combination = 0BRANCH can be combined into a single four-microcycle (two instruction cycle) operation.

Interrupt handling is much simpler on stack machines than on either RISC or CISC machines. On CISC machines, complex instructions that take many cycles may be so long that they need to be interruptible. This can force a great amount of processing overhead and control logic to save and restore the state of the machine within the middle of an instruction. RISC machines are not too much better off, since they have a pipeline that needs to be saved and restored for each interrupt. They also have registers that need to be saved and restored in order to give the interrupt service routine resources with which to work. It is common to spend several microseconds responding to an interrupt on a RISC or CISC machine.

Stack machines, on the other hand, can typically handle interrupts within a few clock cycles. Interrupts are treated as hardware invoked subroutine calls. There is no pipeline to flush or save, so the only thing a stack processor needs to do to process an interrupt is to insert the interrupt response address as a subroutine call into the instruction stream, and push the interrupt mask register onto the stack while masking interrupts (to prevent an infinite recursion of interrupt service calls). Once the interrupt service routine is entered, no registers need be saved, since the new routine can simply push its data onto the top of the stack. As an example of how fast interrupt servicing can be on a stack processor, the RTX 2000 spends only 4 clock cycles (400 ns) between the time an interrupt request is asserted and the time the first instruction of the interrupt service routine is being executed.

Context switching is perceived as being slower for a stack machine than other machines. However, as experimental results presented later will show, this is not the case.

A finally advantage of stack machines is that their simplicity leaves room for algorithm specific hardware on customized microcontroller implementations. For example, the Harris RTX 2000 has an on-chip hardware multiplier. Other examples of application specific hardware for semicustom components might be an FFT address generator, A/D or D/A converters, or communication ports. Features such as these can significantly reduce the parts count in a finished system and dramatically decrease program execution time.

6.2.3.3 Which programs are most suitable?

The type of programs which stack machines process very efficiently include: subroutine intensive programs, programs with a large number of control flow structures, programs that perform symbolic computation (which often involves intensive use of stack structures and recursion), programs that are designed to handle frequent interrupts, and programs designed for limited memory space.

6.2.4 Program execution consistency

Advanced RISC and CISC machines rely on many special techniques that give them statistically higher performance over long time periods without guaranteeing high performance during short time periods. System design techniques that have these characteristics include: instruction prefetch queues, complex pipelines, scoreboarding, cache memories, branch target buffers, and branch prediction buffers. The problem is that these techniques cannot guarantee increased instantaneous performance at any particular time. An unfortunate sequence of external events or internal data values may cause bursts of cache misses, queue flushes, and other delays. While high average performance is acceptable for some programs, predictably high instantaneous performance is important for many real time applications.

Stack machines use none of these statistical speedup techniques to achieve good system performance. As a result of the simplicities of stack machine program execution, stack machines have a very consistent performance at every time scale. As we shall see in Chapter 8, this has a significant impact on real time control applications programming.

NEXT SECTION

Phil Koopman -- koopman@cmu.edu