Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.

Chapter 6. Understanding Stack Machines


6.5 INTERRUPTS AND MULTI-TASKING

There are three components to the performance of processing interrupts. The first component is the amount of time that elapses between the time that an interrupt request is received by the processor and the time that the processor takes action to begin processing the interrupt service routine. This delay is called interrupt latency.

The second component of interrupt service performance is interrupt processing time. This is the amount of time that the processor spends actually saving the machine state of the interrupted job and diverting execution to the interrupt service routine. Usually the amount of machine state saved is minimal, on the presumption that the interrupt service routine can minimize costs by saving only those additional registers that it plans to use. Sometimes, one sees the term "interrupt latency" used to describe the sum of these first two components.

The third component of interrupt service performance is what we shall call state saving overhead. This is the amount of time taken to save machine registers that are not automatically saved by the interrupt processing logic, but which must be saved in order for the interrupt service routine to do its job. The state saving overhead can vary considerably, depending upon the complexity of the interrupt service routine. In the extreme case, state saving overhead can involve a complete context switch between multi-tasking jobs.

Of course, the costs of restoring all the machine state and returning to the interrupted routine are a consideration in determining overall system performance. We shall not consider them explicitly here, since they tend to be roughly equal to the state saving time (since everything that is saved must be restored), and are not as important in meeting a time-critical deadline for responding to an interrupt.


6.5.1 Interrupt response latency

CISC machines may have instructions which take a very long time to execute, degrading interrupt response latency performance. Stack machines, like RISC machines, can have a very quick interrupt response latency. This is because most stack machine instructions are only a single cycle long, so at worst only a few clock cycles elapse before an interrupt request is acknowledged and the interrupt is processed.

Once the interrupt is processed, however, the difference between RISC and stack machines becomes apparent. RISC machines must go through a tricky pipeline saving procedure upon recognizing an interrupt, as well as a pipeline restoring procedure when returning from the interrupt, in order to avoid losing information about partially processed instructions. Stack machines, on the other hand, have no instruction execution pipeline, so only the address of the next instruction to be executed needs to be saved. This means that stack machines can treat an interrupt as a hardware generated procedure call. Of course, since procedure calls are very fast, interrupt processing time is very low.

6.5.1.1 Instruction restartability

There is one possible problem with stack machine interrupt response latency. That is the issue of streamed instructions and microcoded loops.

Streamed instructions are used to repguitmvelq"execwdg af operation sus( as writ)ngdhe top dtq`wuackelement |o meiory Ph%{e4iNstrctions aRm ymxlmened`usinG an(instrution rdp%cp feeture on tha S00!6 and PTX :4p0( !n ijsruct)n0wffgs on(te(M17!and!microcodd loop{ on phu CTU/16 and TX 9:P. Txese tramatHtEc yre0vEry usefuL singa thEy can bm 5ed"to(buil`0egficIed *strig manipuLctioj primitiDes and stakkw~ efkw.overflo sevice"pmuvijes. M Tepo"lem /s tat, inmost cass,0uhg{e mnstvuctons"cpe alwo non-in|errpti"le. =%p>

Oz% so]tionis$to mKe vHgsm i3tbuctinns inverr}`ibLeWith eptra qntr_l hardward,(shichmEy kokrease rrocessor cMMpnxity quite E (byt, A"popunti`lly hard prnble- thed non-{4eck2processrs"have sIth thcsolutiof`is uhm isswa o savhngi.verm%da`tu rtsults.!$W)t a`spack prOCesqor tHis Is)not a"p2k"em, cica ij4ermelia}e results qe alreaD; r5shent gn q 3task, vhich0ir"the$admal mechanism for ravcn(st`te0durilg an i&tesr4pt.( ,/X< P


6.5.2 Lightweight interrupts

Let us examine three different degrees of state saving required by different interrupt categories: fast interrupts, lightweight threads for multi-tasking, and full context switching.

Fast interrupts are the kind most frequently seen at run time. These interrupts do things such as add a few milliseconds to the time-of-day counter, or copy a byte from an input port to a memory buffer. When conventional machines handle this kind of interrupt, they must usually save two or three registers in program memory to create working room in the register file. In stack machines, absolutely no state saving is required. The interrupt service routine can simply push its information on top of the stack without disturbing information from the program that was interrupted. So, for fast service interrupts, stack machines have zero state saving overhead.

Lightweight threads are tasks in a multi-tasking system which have a similar execution strategy as the interrupts just described. They can reap the benefits of multi-tasking without the cost of starting and stopping full-fledged processes. A stack machine can implement lightweight threads simply by requiring that each task run a short sequence of instructions when invoked, then relinquish control to the central task manager. This can be called non-preemptive, or cooperative task management. If each task starts and stops is mperation!wi|h nl parametgr3 o~ pxesp%#+(!uher% is(.n kveRhead$or!-ckjte|!qgitcdes bdtwaen0tros. The cOsv fr tlis method of iultitas{ijg" is esaentiilny zero, since$a taqk mnly belinquishe{ its!c-nvrml t/p the!isk mancger At a$logicel freak)ng `oinv hn 6hm pxoeram$`wheRe th% q5ack pr_ba"my soqld havebef(empty anywqy.( roe tH%ce tvo`eha les, w% CA~,see th`T"inter2}p xRocessing0ifd lieh4weigxt%phsead mulmts+ing0aravry aneppefSivo suack process/rs. `The` mnly0kssue tht remqinszen(is thet nv ftll-vledgd, preeptiveMulti-tiskng agcmmp,k{hef with coft|t SWiuching/$ " <-X> ) MJ


6.4,!npg8t uitcheS<H< gis Usuadl9 "aseD on havhng tosae q(tremedou{ a/out `gg,{tack bUffer space into p2ggRie memmri. his idua that stack$ achines`are$any worsd at!oul4h-tcscinG t,an her"machies is$pivently `false> |P?Contexv wwitchin' y3 a pTentiallq`exrenCive O0eramn N ani cysuem. An RISC !d CKSC$opuers with saGhg"mgnoiU- co~vexv r7iusHij`can be-/re expegsire han |he manufacturrs woud have oa jelieve, as ! re{wlt)f hiden agrfn:}ance0degr!tiuioni cauwed ryincsaasud cac(e }s{es a'teR(vhe ontex4 swmukx. To"the!exuegt that RIC }akline3 usehavge regi3tes filE% they f`ae exactly thgsam"`roblems`that aRe$faced `}svack$machiNur A$``Led dis!dvantage kfRISC machHles y{ that uhekr rcndom abcess"o registgss digTqtes saving all regasters (o2!adfmnecm-plicated hardwaretg$dut%ct"w(il rgGistors- a2e in usu),!hereas a stackheachine gan"Peedily saVa onlyhe active aRea & thg rtack ftfGur. 4P> 6.5.;.1/A>4A cOnuext 3wits(inehe2imeot<H5> <~Table`6.7 shOWq nata gatheed fzOma |ree)riven$simu|avion o!phe number *of}emory cycles(speOt s`vyNG$afl rmstorIg data stack(aleiet $fos Forth rrkgams mn a!contexd rwidcing$envmonment. The pvogvams simulatedwere Quee~, Lanoi, afd$q$Quick-rort poebam*0 Small values !f N wePaused fob Quaen *and HanI$in orter ug kem` the runni~' tima of tl s)mqhatkr reaqonable.0BotH dhe efgecpw ofstakk$verflow adunderlow as el| acgndet s7itkhi.e were m%aquRed, sance they( hn4eract!jeavhy in such qf envirolment.% \/P>

M 4R>TAclM 6.7. Memory"scles xpended for Tata Suack spIlls rob 4`diffevenT beffer whzes and contex4 sap0ing frequEjcies.



Buffew
Wize$"! tmmev=100  t)oer=500(timer}1"00 ti}er=10000 <UM
2  0 $     1'98"  `   5"      16124 8!( 9=916
0! "   (  1634       9924       9524      "$y:44
8 `      $ 840      3150   (0  330      
 910
10   0$    83<0 "    3;44       3068  " `  214J36 !       11602     ! 2642`       06"0        632
20          12886       3122        1846        626
24          13120       2876        1518        330
28          14488       3058        1584        242
32          15032       3072        1556        124
36          15458       3108        1568        82

Table 6.7(a) Page-managed buffer management.

Buffer
Size     timer=100  timer=500  timer=1000  timer=10000 
2           26424       24992       24798       24626
4           11628       8912        8548        8282
8           7504        3378        2762        2314
12          6986        1930        1286        630
16          7022        1876        1144        322
20          7022        1852        1084        180
24          7022        1880        1066        124
28          7022        1820        1062        90
32          7022        1828        1060        80
36          7022        1822        1048        80

Table 6.7(b) Demand-fed buffer management.

[Figure 6.4]
Figure 6.4 -- Overhead for page managed stack.

Table 6.7a and Figure 6.4 show the results for a page-managed stack. The notation "xxx CLOCKS/SWITCH" indicates the number of clock cycles between context switches. At 100 clock cycles between context switches, the number of memory cycles expended on managing the stack decreases as the buffer size increases. This is because of the effects of a reduced spilling rate while the program accesses the stack. As the buffer size increases beyond 8 elements, however, the memory traffic increases since the increasingly large buffers are constantly copied in and out of memory on context switches.

Notice how the program behaves at 500 cycles between context switches. Even at this relatively high rate (which corresponds to 20 000 context switches per second for a 10 MHz processor -- an excessively high rate in practice), the cost of context switching is oly(abOut 0.0` clobi3 qer iltsuctmojfor$a wtack cuffez0si:e$ereatez than 92. $Sinbe $inthis dxpepymeft %ach ilrtructign avEr kEd 1.680 sloCks )t`kut contex\ rukailg ovesheed, this /nLx amoUgus tk-c"4.7% overHead/ Au 10 00 cycle{ between kolte`t switcl ha}i|l!seconlMJ"etween cotaxt sw)tchgw), theover`eAd iq less than %. <'P> $P>How$it xossic,m(to hqve suchlow"oVerhe`D? Kne reas-n0is$phat( jg average stak eep`hhis only 12>1 elements `urkNg tje axecution%n These tlre% hmavil) r%grsiwe pgvams`0That mdcns that,,siLce `4hur"asndver ~ery muh hnf/iatin on0tHd stag/< vePy |ittle infgrmation nduds to b% save$ on a cgntxt wit#.`In fact ckmpasddtO ad12-reG ster sIc machine, |hE stabk macHifg$s)m},ated indthasgxparientqctUalLy" has LMs;(Stqte eo sAvEoO ` ojtext switCl. >oP4L

<P2 P><R>% |B>Fifur%*6.1(- Gve(ead bos deman ded oangqd sta#k.|/B> <+P> 

Tablm 6.7b andDigqre 6.7 sdg7 4hd`resultg ~& thm same simulation run using a demand-fed stack management algorithm. In these results, the rise on the 100-cycle-interval curve when more than 12 elements are in the stack buffer is almost nonexistent. This is because the stack was not refilled when restoring the machine state, but rather was allowed to refill during program execution in a demand-driven fashion. For reasonable context switching frequencies (less than 1000 per second), the demand-fed strategy is somewhat better than the paged strategy, but not by an overwhelming margin.

6.5.3.2 Multiple stack spaces for multi-tasking

There is an approach that can be used with stack machines which can eliminate even the modest costs associated with context switching that we have seen. Instead of using a single large stack for all programs, high-priority/time-critical portions of a program can be assigned their own stack space. This means that each process uses a stack pointer and stack limit registers to carve out a piece of the stack for its use. Upon encountering a context switch, the process manager simply saves the current stack pointer for the process, since it already knows what the stack limits are. When the new stack pointer value and stack limit registers are loaded, the new process is ready to execute. No time at all is spent copying stack elements to and from memory.

The amount of stack memory needed by most programs is typically rather small. Furthermore, it can be guaranteed by design to be small in short, time-critical processes. So, even a modest stack buffer of 128 elements can be divided up among four processes with 32 elements each. If more than four processes are needed by the multi-tasking system, one of the buffers can be designated the low priority scratch buffer, which is to be shared using copy-in and copy-out among all the low priority tasks.

From this discussion we can see that the notion that stack processors have too large a state"vo$sqve `js(effdviremudti tarking@is a mxth* In fact, in Jmany Easessua#k qroce#sorr kan"fe bett%r at mu,i-tascing" and inpeRrupt procesSijg$h!n any o5hdr ki.d oF Komputr. <&Hayes and BRaeian(1=89- have independgntly obtainef esulds foz !svaK splling afd coftaxt cwitC@hng$'ostS on

? < HRE="cnntetsxtl">< MG"SXK=conents.if" ALT<*COTENTS2>  NEXDQDCT ON|/? >IMV SZB"`ome.gifb`KLt="INME"><-A> PhilCoopman -= A HBEF="mAto:klpma.PcmU/et>koopmanPcmeedu