o1i's Planet AROS

September 08, 2015


Amiga Parallel Port: How fast can you go?

In my plipbox project a fairly fast AVR 8-bit MCU with 16 MHz was connected to the Amiga’s parallel port to transfer incoming and outgoing IP packets from/to the attached Ethernet controller. A protocol on the parallel port was devised to quickly transmit the bytes in both directions. In version 0.6 a data rate of up to 240 KB/s was achieved… The question now arises if this is the top speed we can get or is the parallel port capable of more?

This blog post shows the results of my experiments I performed with the parallel port on my Amiga. It tries to show different classes of transfers possible on this port and gives the achievable maximum speed of each class.

Since the available documents and data sheets are all lacking the exact description of the I/O part on the peripheral side of the device, this blog post is also an effort to try to document this undocumented side of the parallel port (or: “What you always wanted to know about your CIA 8520 and never dared to ask”)

1. Introduction

1.1 The CIA 8520 and the Parallel Port

The Amiga has two custom chips called the CIAs 8520 (Complex Interface Adapter) that are called CIA A and CIA B. A CIA chip has two I/O ports (Port A, Port B) with 8 bits each that can be individually configured for peripheral input or output.

The parallel ports pins consists of three kinds of pins:

  • 8 Data Pins (In or Out), Pin 2-9
  • 1 Strobe Line (Pin 1), 1 Ack Line (Pin 10): Hardware Handshake
  • 3 Control Lines (BUSY, POUT, SELECT) (Pin 11, 12, 13)

Those pins are connected to the two CIAs as follows:

  • CIA A, Port B: 8Data Pins
  • CIA A, PC and F to  Strobe and Ack for Hardware Handshake
  • CIA B, Port A, Bits 0,1,2: BUSY, POUT, SELECT

While CIA A Port B handles the data pins, CIA B Port A handles the 3 control lines. Note that the other bits of this port are connected to serial port lines.

In the Amiga memory map both CIAs are mapped to different memory ranges. Here is an excerpt with the registers useful for parallel port programming. (See the Amiga Hardware Reference Manual, Appendix F for a complete list)

Address   Name  Default  Description
BFE101    prb    0xff    Parallel port
BFE301    ddrb   0x00    Direction for port B (BFE101);1=output (can be in or out)
BFD000    pra    0xff    /DTR  /RTS  /CD   /CTS  /DSR   SEL   POUT  BUSY
BFD200    ddra   0xc0    Direction for Port A (BFD000);1 = output (set to 0xFF)

The data direction register (DDR) for both ports set a bit of the port either to input or output. The logic of the data pins is not inverted, i.e. a 1 in a register is a high (5V) value on the line.

The default values indicate the setup after the Amiga has booted and sets all parallel pins to input.

1.2. The CIAs in the Amiga system

The CIA chip is compared to the MC680xx CPU clock of an Amiga a fairly slow device. It can handle a clock rate of up to 1 or 2 MHz while the CPU runs at 7 or more MHz. The MC68000 CPU architecture offers a special mode of device access for these devices that is based on a slower clock called the E clock. It runs at the 1/10th of the CPU clock speed.

Lets see some numbers:

  • CPU Clock F_CPU =  7.16 MHz (NTSC)  7.09 MHz (PAL)
  • E Clock F_ECLK = F_CPU / 10 = 716 KHz (NTSC)  709 KHz (PAL)
  • E cycle length t_ECLK = 1.40 us (NTSC) 1.41 us (PAL)

This means the CPU accesses the CIA with at most the speed of F_ECLK. An access is a read or write to a register. So when we transfer data we either read or write the data register of CIA A Port B. If we only access this register the top speed we can ever achieve on this port is one byte per F_ECLK or 716/709 KB/s max!

If you look in the data sheet of the CIA’s ancestor device called the MOS 6526 you will see that the E Clock interval is divided into two sections: a HI and LOW range of the clock interval. While in the HI range (4/10 of t_ECLK) the CPU accesses the device, in the LO range (6/10 of t_ECLK) the device starts to realize the change set by the CPU, i.e. if a port is on output it will set the pins low or high accordingly. On a read the data has to be stable on the port before the HI phase will access it from the CPU.

Here are the numbers (naming according to the 6526 data sheet):

  • t_CHW (Clock High Width) = 4 / 10 * t_ECLK = 560 ns (NTSC)  564 ns (PAL)
  • t_CLW (Clock Low Width) = 6 / 10 * t_ECLK = 840 ns (NTSC)  846 ns (PAL)

Some interesting limits of the 6526 chip:

  • t_PD (Output Delay on Write): max 1 us
  • t_PS (Port Setup Time on Read): min 300 ns

The t_PD of max 1 us results in port setups that may take almost the whole E cycle of 1.4 us and it overlaps the next HI range for CPU access.

  <- t_ECLK -> 
   ____        ____
--|    |______|    |______|
   CPU |------->
 Write    t_PD

1.3 Hardware Handshake with Strobe

The parallel port offers two pins for hardware handshaking called Strobe and Ack. The hardware handshake allows to signal the external peripheral whenever new data has been set (or read!) on the external port of the CIA. After data is valid the strobe line sends a short pulse (low active) on Strobe to signal the receiver. It will then read the data byte and acknowledge the transfer by pulling Ack low. The Amiga detects the Ack pulse either by polling or by interrupt and then transmits the next byte.

While strobing (i.e. generating the Strobe pulse) after a Port B read/write happens automatically, you have to manually trigger Ack to confirm it.

Lets see a time sheet with some E clock cycles (.H, .L being the high and low range of the cycle)

ECycle   CPU                 CIA Port B     Strobe
0.H      Write Port B=42     -              H
0.L      -                   42!            H
1.H      -                   42!            H
1.L      -                   42             L
2.H      -                   42             L
2.L      -                   42             H

This examples writes a byte with value 42 to Port B. The CIA realizes this value in the next to sub cycles (denoted with !) and beginning with 1.L a stable value of 42 is available on the output of port B. Then strobe goes low for a full E cycle length.

We see that strobe has to be delayed otherwise a peer reading on falling edge of strobe won’t have a stable data signal.

The interesting questions that now arise are:

  • What is the strobe delay in cycles of the 8520 CIA on the Amiga?
  • What is the strobe width in cycles?
  • How fast can we transfer data and still get valid strobes?

The answer to the first one can be found in the Amiga Hardware Reference Manual, Appendix F, Section Handshaking):

PC will go low on the third cycle after a port B access.

But the other ons are unanswered in the docs. So its time for some experiments…

2. My Experiments

My Setup is an Amiga 500 with ACA500 and ACA1230/33 Accelerator attached. A plipbox device was attached with running version 0.6 firmware unless otherwise stated.

2.1 Setup Port

Using ASMone I quickly hacked some code to set the parallel port to data output and all lines to low/zero:

  lea $bfe101,a0 ; parallel port data
  lea $bfe301,a1 ; parallel port ddr
  move.b #$ff,(a1) ; all bits to output
  move.b #$00,(a0) ; set all lines to low/zero

2.2 Writing a byte

With the port setup lets conduct the first experiment: Write a $ff byte to the parallel port and capture the lines with a logic analyzer. The code:

  lea $bfe101,a0
  move.b #$ff,d0

  move.b d0,(a0)

The scope triggered on falling edge of strobe:


Write $ff (Port was $00)


First interesting fact we see here is the strobe width: Its 2.813 us or 2 * t_ECLK!

So the Strobe width of the CIA 8520 is (in contrast to 6526’s 1E) 2 E long! t_SW = 2 E

Lets repeat the write. Now write a $00 on a port that has been initialized with $ff:


Write $00 (Port was $ff)



Notable difference here is the point in time when the port signal changes:

  • LO->HI: late at end of cycle
  • HI->LO: early at the beginning of the cycle

Note: the markers are aligned to begin of strobe (falling edge) in 1 E steps (i.e. 1.4 us)

If we assume that strobe starts with the LO range of the E cycle then the markers and the begin of strobe denote the HI->LO transition inside an E cycle.

If we compare these lines with the typical E cycle diagram of a data sheet then they denote the center of the cycles and not the borders!

              ^ visible marker              ^ strobe falling edge
              |                             |
|<-- t_CHW -->|<--- t_CHL --->|<-- t_CHW -->|<--- t_CHL --->|...
|------- E cycle 0 -----------|------- E cycle 1 -----------|

              | H>L changes          L>H    |  signal changes

With this shift in mind we can conclude that the actual CPU write of this byte has happened right left of the first marker in the lower image (i.e. in t_CHW).

Lets write down the strobe sequence in a time sheet: the $00 write

EClock   CPU      PortB   Strobe   Annotation
0.H      w00      ff      H
0.L      -        00!     H        realizing 00 on port
1.H      -        00*     H        already 00 stable on port
1.L      -        00      H        \ safety range
2.H      -        00      H        /
2.L      -        00      L        strobe begin
3.H      -        00      L        \ strobe width: 2 E cycles
3.L      -        00      L        /
4.H      -        00      L        strobe end
5.L      -        00      H

and the $ff write:

EClock   CPU      PortB   Strobe   Annotation
0.H      wff      00      H
0.L      -        ff!     H        realizing ff on port
1.H      -        ff!     H        needs this range, too
1.L      -        ff      H
2.H      -        ff      H
2.L      -        ff      L
3.H      -        ff      L
3.L      -        ff      L
4.H      -        ff      L
5.L      -        ff      H

We can see the strobe starting in the third cycle as stated in the docs. It keeps a safety range of one E cycle after setting up the values before beginning the strobe.

2.3 Writing multiple bytes in a row

What will happen if we write two or more bytes in a row (i.e in each E cycle a byte) to the strobe signal?

Let’s see and write two bytes (port again setup with $00):

  lea $bfe101,a0
  move.b #$ff,d0
  moveq  #$00,d1

  move.b d0,(a0) ; write in 0.H
  move.b d1,(a1) ; write in 1.H


Write $ff and $00 (Port was $00)

Write $ff and $00 (Port was $00)

The time sheet:

EClock   CPU      PortB   Strobe   Annotation
0.H      wff      00      H
--- Marker
0.L      -        ff!     H        realizing ff on port (slow)
1.H      w00      ff!     H        needs this range, too
--- Marker
1.L      -        00!     H        realizing 00 on port (fast)
2.H      -        00*     H
2.L      -        ff      L        regular strobe begin  (1st E)
3.H      -        ff      L
3.L      -        ff      L                              (2nd E)
4.H      -        ff      L        regular strobe end
4.L      -        ff      L        extended strobe begin (3rd E)
5.H      -        ff      L        extended strobe end
5.L      -        ff      H

What do we see?

  • A strobe of length 3 * E! So the first write’s strobe and the second one is somewhat merged now.
  • The $ff write happens right before the left marker and is established on the port inside the two marker’s range.  (LO-HI transition = slow)
  • The $00 write happens right before the right marker and is established right after the marker. (HI-LO transition = fast)
  • The $ff value is only valid at the end of the two marker’s interval!

Let’s write 4 bytes in a row:

Write $ff, $00, $ff, $00 (Port was $00)

Interesting Result:

  • Still a Strobe of 3E cycle length! The strobe width is not enlarged, no matter how many bytes you send. Seems that the strobe logic gets stuck.
  • Data $ff, $00, and $ff is valid at the end of the E ranges around the falling edge of strobe

Now 4 bytes starting with $00 (port was $ff):

Write $00,$ff,$00,$ff (Port was $00)

Write $00,$ff,$00,$ff (Port was $00)

Same result here:

  • A 3E Strobe and nothing more!
  • Data again valid at the end of the E cycles around falling edge of strobe

To sum up this experiment: While we can write to the CIA from the Amiga with E cycle speed, the resulting strobe signals are not useable anymore! However, all data values appear on the port lines (in fragments of the E cycle).

Let’s call the non-stop writes to the CIA 1E Transfers and let’s experiment now with transfers that take more E clock cycles in the next experiments.

Data transfer speed of 1E Transfers is E clock speed, i.e 716/709 KB/s

What is the lowest xE transfer that generates useable strobes?

2.4 2E Transfers

Ok, we need to make a pause between the data write from the Amiga. To be precise we want to wait for multiples of the E clock. The best way to perform a “wait” on (or better waste) an E clock cycle is to actually perform a register access to one of the CIAs. Make sure to perform an access with no side effects, so reading a port A (i.e. does not strobe) already does the trick.

A 2E transfer code now does a write (1E cycle) and one pause (second 1E cycle) looks like this:

  lea $bfe101,a0
  lea $bfd000,a1 ; let's use CIAB Port A to "waste" E cycles
  move.b #$ff,d0
  moveq  #$00,d1

  move.b d0,(a0) ; 1E write in 0.H
  tst.b  (a1)    ; 1E waste cycle by reading register (1.H)
                 ; =2E transfer per byte

  move.b d1,(a1) ; write in 2.H
  tst.b  (a1)    ; waste E cycle (3.H)


2E Transfer writing $55,$aa,$55,$aa,... (port was $ff)

2E Transfer writing $aa,$55,$aa,$55,… (port was $aa)


  • Strobe is back again at 2E. But only the first one is visible! All others are gone :(
  • Complete range of 1E port data valid (1E range for port setup)
  • Note: instead of reading a “waste” value in the second E access to the CIA, you can also perform a single control signal write. In the picture above I toggled the SEL signal. This gives you an exact location of the 1.H, 3.H, … locations and can be used on the receiver side as a sync signal! (Very useful since strobe is broken here)
  • Note2: If you toggle SEL (or POUT, BUSY) you can only write the Port A (but not read it beforehand). Therefore, a signal update of only the parallel line bits won’t work. In fact you have to ignore serial line bits in the same port and write them always to a constant value -> Serial lines don’t work with 2E transfers !! or in other words: There is no system friendly way to implement it…
  • Data transfer speed of 2E is half of 1E: 354.5 – 358 KB/s

Time Sheet:

EClock   CPU      PortB   Strobe   Annotation
0.H      w55      ff      H
0.L      -        55!     H        realizing aa on port
1.H      <waste>  55!     H        needs this range, too
1.L      -        55      H        
2.H      waa      55      H
2.L      -        aa!     L        regular strobe begin
3.H      <waste>  aa!     L
--- Marker
3.L      -        aa      L
4.H      w55      aa      L        regular strobe end
4.L      -        55!     H
5.H      -        55!     H
--- Marker
5.L      -        55      H

2.5 3E Transfers

Since 2E transfers still have broken strobe output, lets add another “wasted” cycle and setup a 3E transfer. With two spare E cycle accesses in our transfer loop we can also use the two cycles to perform a read/modify/write operation to a register. E.g. a bclr (bit clear) or bset (bit set) operation can be used to modify a control line of the parallel port and is then used as a “clock” line for our data transfer.

Code Example:

  lea $bfe101,a0
  lea $bfd000,a1 ; let's use CIAB Port A to "waste" E cycles
  move.b #$ff,d0
  moveq  #$00,d1

  move.b d0,(a0) ; 1E write data
  tst.b  (a1)    ; 2E waste cycles
  tst.b  (a1)

  move.b d1,(a0) ; 1E write data
  bclr   d1,(a1) ; 2E cycles to clear "clock" line (bit 0)

  move.b d1,(a0) ; 1E write data
  bset   d1,(a1) ; 2E cycles to set "clock" line

A scope plot of a 3E transfer:

3E Transfer with $aa,$55,$aa writes (Port was $55)

3E Transfer with$55, $aa,$55,$aa writes (Port was $00)

  • Ah! Now we have valid strobes! Makes sense: timing per byte is now 3E with 2E for (fixed) strobe size and 1E for the spacing between strobes.
  • Data transfer speed for a 3E transfer is a third of the 1E speed: 236 – 239 KB/s  
  • The current 0.6 plipbox implementation uses a 3E transfer method and achieves the calculated limit of about 240 KB/s.

Time sheet 3E Transfer:

EClock   CPU      PortB   Strobe   Annotation
0.H      w55      ff      H
0.L      -        55!     H        realizing $55 on port
1.H      <waste1> 55!     H        needs this range, too
1.L      -        55      H        
2.H      <waste2> 55      H
2.L      -        55      L        regular strobe begin
3.H      waa      55      L
3.L      -        aa!     L
4.H      <w1>     aa!     L        regular strobe end
4.L      -        aa      H
5.H      <w2>     aa      H
5.L      -        aa      L        next strobe begin
6.H      w55      aa      L
6.L      -        55!     L
7.H      <w1>     55!     L        next strobe end
7.L      -        55      H

Note: you can see that the first value (here $55) is valid during H->L falling edge of first strobe. Thats the point of time when the external device reads the value.

You can now continue to add waste cycles and introduce 4E, 5E, … transfers. But they do not really make sense as they only move the strobe further apart. You cannot really use the extra E cycles…

Here is an example of a 4E transfer:

4E Transfer writing $55,$aa,$55,$aa (Port was $00)

4E Transfer writing $55,$aa,$55,$aa (Port was $00)

Note the 4E strobe cycle: 2E strobe and 2E spacing between strobes.

2.6 What about read transfers?

In the above experiments I always talked about writing bytes to the port. But what changes if we want to read data with 1E, 2E, or 3E transfers?

  • Strobing is essentially the same. After a read operation the strobe will be generated.
  • The device feeding the port needs to setup the data to be read before the .H cycle that performs the CPU read operation

Here is a time sheet of a 3E read:

EClock   CPU      PortB   Strobe   Annotation
-1.H     -        11!     H        (save setup time)
-1.L     -        11!     H        device sets up data on PortB
0.H      r11      11      H        CIA reads PortB
0.L      -        11      H
1.H      <waste1> 11      H
1.L      -        11      H        
2.H      <waste2> 22!     H        (save setup time)
2.L      -        22!     L        device sets new data on PortB
3.H      r22      22      L        CIA reads PortB
3.L      -        22      L
4.H      <w1>     22      L        regular strobe end
4.L      -        22      H
5.H      <w2>     22      H


  • The external device needs to setup data right before the CPU access. While the .L sub cycle before the read might suffice for stable read it is more safe to already setup data in .H before
  • If you use a parallel port control line to “clock” the data you can set the line before the first CPU read and start reading with the first byte.
  • If you want to use the strobes to sync your reads then you have a problem: The strobe signal arrives _after_ the read! To get in sync with this signal you must use a trick: first perform a dummy CPU read just to generate a strobe and then use this strobe to sync your device’s writes:
    • In the above time sheet we dummy read at 0.H
    • The device already sets up data 0x22
    • The CPU performs the next read at 3.H and gets 0x22
    • The device waits for the raising edge of strobe (4.H – 4.L) and sets the next data
  • Reading in 2E and event 1E gets more difficult as in the worst case no “clock” signal is available and you have to use a sampling pattern with fixed E size to setup the data in time from the device. It is still open if it possible to write a stable 1E transport this way.
  • In most reader code the interrupts have to be disabled on Amiga side otherwise the clocked setting up of data before a read might arrive too late and thus a CPU read gets wrong.

3. Summary

This (rather long) blog article shows you all the details when transferring data over the parallel port at the maximum possible speed. We discovered some interesting anomalies with strobe generation at these high transfer rates.

I introduced a new speed classification for the parallel transfer types called 1E, 2E, or 3E transfers.

The top speeds achievable with the xE transfers are:

1E: 709..716 KB/s
2E: 355..358 KB/s
3E: 236..239 KB/s

Current plipbox version 0.6 implements a 3E transfer using external control lines for clocking. I am currently experimenting with a 3E transfer using only strobes as signalling (it frees control lines for other functions). Another interesting coding exercise will be a 2E or even a 1E transfer… Now the technical background is available!

by lallafa at September 08, 2015 08:04 PM

August 31, 2015


Dopus5.91 released

After ~1 year since first opensource release, we happy to release a new version of Dopus 5 !

Download it from http://dopus5.org or from https://sourceforge.net/p/dopus5allamigas/ in the Files section.

August 31, 2015 10:15 PM

August 25, 2015

Icaros Desktop

Accessing the Tube!

In Italy we have a motto which says, once translated into "barely English", a single image worths a thousand words. That's why I can't really stop myself from sharing the following image with you: Yes, this basically means Deadwood has made the miracle and yes, you will be playing YouTube videos on Icaros Desktop starting with next update. No more scripts to download

by Paolo Besser (noreply@blogger.com) at August 25, 2015 01:57 PM

July 28, 2015


Hard drive tab..

Not working yes (buttons not yet activated), but at least displays the actual config.

Now if SourceForge would come back to life, I could commit all that stuff. This all will be a big commit, no chance to ever track the single changes..

by noreply@blogger.com (o1i) at July 28, 2015 02:45 PM

July 23, 2015


Happy birthday Amiga!

Happy 30th birthday, Amiga!

Your journey has started 30 years ago, on 23rd July 1985,
when the first Amiga 1000 was introduced to the clueless public at a clumsy, but epic event.
This journey is never ending, still going on after 30 years.

We love you.

by noreply@blogger.com (Álmos Rajnai) at July 23, 2015 07:48 AM

July 09, 2015



The Listtree class is .. not my friend ;-). But somehow with a lot of trial and error, I managed to get it working:

Well, the nice icons are missing (I tried to get icons working , but did I already say, Listtree is not my friend?), but it is close enough to WinUAE:

by noreply@blogger.com (o1i) at July 09, 2015 01:42 PM

May 12, 2015


Combo Boxes

WinUAE in Windows has nice comboboxes to select rom and adf images:

AROS only offers Cycle and String gadgets or listviews. None of the three can give you the functions of a combobox. I tried to work around it, but ended up with a lot of useless spent time, as those three can#t emulate all combobox features.

Even back in gtk-mui times, I wanted a combobox custom class, so now it was time to code one:

It even works with type-ahead ;-). There are still some minor bugs left, or some bugs in other parts of the gui were added during combobox development.

So there is still progress, but as always, time is much too limited to really progress fast.

PS: Forgot to mention, I moved my development environment to Debian/64bit, so from now on, x86_64/ABI_V1 is the primary target.

by noreply@blogger.com (o1i) at May 12, 2015 01:14 PM

May 01, 2015


Spin locks and the beauty of conditional instructions


Low-level toying with multiple CPUs without proper locking mechanisms is asking for trouble. I have already seen many cryptic boot logs form native AROS on RaspberryPi2 which you simply cannot decode. This happens every time when more than one core tries to speak over serial line.

The locking primitive which we have just added to AROS is a spin lock. It does not have an owner, so one cannot re-enter it — trying to do so will result in an endless loop with no exit. The spin lock can be obtained either for reading or for writing. When spin lock is in read mode, it can be acquired by many clients but as long as at least one of them is holding a read lock, code willing to switch it into write mode will have to wait. When spin lock is in write mode, it gives an exclusive access to not more nor less but only one caller. Until it is released again, no other code will be able to obtain the lock at all.

So, here it goes, the spin lock:

typedef struct {
    volatile unsigned long lock;
} spinlock_t;

#define SPINLOCK_INIT_WRITE_LOCKED  { 0x80000000 }

The spin lock comes with three default initializers for those who want to put it in some defined state into e.g. data section. The lock uses one 32-bit value which defines the state of lock:

  • lock == 0 – the lock is in its free state, everyone can lock it in either mode
  • lock > 0 – locked in READ mode. Everyone can lock it in READ state (up to 2^31 times, then it wraps), but attempting to lock it in WRITE mode will blocks until it is free.
  • lock == 0x80000000 – the lock is in WRITE mode. Further attempts to lock it in either modes will block.

The code for locking and unlocking uses the LDREX and STREX instructions which guarantee exclusive access to addressed memory. The code uses also a nice feature of ARM processors – conditional execution of instructions. Let’s look at the code – it assumes that register r0 points to the lock

    mov       r3, #0x80000000
1:  ldrex     r2, [r0]
    teq       r2, #0
    strexeq   r2, r3, [r0]
    teq       r2, #0
    bne       1b

Only one single loop inside. When the function finishes, the spin lock is acquired in WRITE mode. How does it work? The LDREX function reads the lock value into r2 register and marks exclusive access to addressed memory. The lock value is compared against zero. If the lock value was not zero, then the WFE instruction will be executed (please note the “ne” suffix). It puts the CPU into sleep mode until either an interrupt or an event from any other core is sent. If the lock value was zero, the WFE instruction is not executed at all. The next one is conditional variant of STREX. It is executed only if the lock value equals zero (spin lock is free, note the suffix “eq” after STREX). The STREX stores register r3 at address pointed by register r0. If write succeeds, i.e. exclusive lock was still granted, register r2 will be set to value 0, if write fails, r2 will contain value 1. Finally, register r2 is tested against value 0 and, if it’s not zero, we jump back and repeat.

Please note, that in second comparison r2 can contain one of three values:

  • 0, if STREXeq was executed and succeeded,
  • 1, if STREXeq was executed and filed,
  • 0x80000000, if the lock was already acquired and our CPU went to sleep (WFEne).

The last case means, that CPU has received either an event (from another CPU core when it released a spin lock) or an interrupt was triggered. In both cases the CPU will re-attempt to acquire the lock. It wakes up, STREXeq is not executed, 0x80000000 is compared against 0x00000000 and if they are not equal, CPU does a branch. Nice, isn’t it?

There is one more scenario to be considered. What happens if there was an interrupt triggered between LDREX and STREX? Well, in that case AROS code needs to release the exclusive memory by either issuing a CLREX instruction (ARM v7 cpus and up) or by issuing a dummy STREX instruction to some arbitrary memory location. In that case the interrupted code will re-attempt the process of obtaining a spin lock.

Now after the locks were added and properly used, you can turn this:

[KRN:ide27 modbces 08_dritpri fl #s veacion nam00:
0 (0147300 7ff0)
nif8 1or08# 1ls @ 0x50 "ex0c.
 iKRar C
e f81a03tr: p110 .2
 41N]expansi CPlib60ry01
 815f] Core105CP2 =600001bu
libra Cor
 80e f70: 100001
e @ 0e500c:ec00
0x0001889a 8RN]41o"er2 .library@
b3e10 C1 e41 Bootstrad t.re @ 0x0"
[d0c: or9 2 cpu1 ontek.reizurc12
+ KR1a1Ca8e 2 cp 01tx @ 0pr00e3eb0
08] 0fm2948_ini4_c1r 43

into this:

[KRN:BCM2708] Initialising Multicore System
[KRN:BCM2708] bcm2708_init: Copy SMP trampoline from f800074c to 00002000 (100 bytes)
[KRN:BCM2708] bcm2708_init: Patching data for trampoline at offset 80
[KRN:BCM2708] bcm2708_init: Attempting to wake core #1
[KRN:BCM2708] bcm2708_init: core #1 stack @ 0x000b4380 (sp=0x000dc370)
[KRN:BCM2708] bcm2708_init: core #1 fiq stack @ 0x000dc390 (sp=0x000dd380)
[KRN:BCM2708] bcm2708_init: core #1 tls @ 0x000dd3a0
[KRN] Core 1 Boostrapping..
[KRN] Core 1 CPSR=600001d3
[KRN] Core 1 CPSR=60000193
[KRN] Core 1 TLS @ 0x000dd3a0
[KRN] Core 1 KernelBase @ 0x000b3ec0
[KRN] Core 1 SysBase @ 0x000b3200
[KRN] Core 1 Bootstrap task @ 0x000dd3c0
[KRN] Core 1 cpu context size 2124
[KRN] Core 1 cpu ctx @ 0x000dd460
[KRN:BCM2708] bcm2708_init_core(1)
[KRN] Core 1 operational
[KRN] Core 1 waiting for interrupts
[KRN:BCM2708] bcm2708_init: Attempting to wake core #2

by michal at May 01, 2015 06:17 PM

April 27, 2015


All your nightly are belong to us

Yay, I’ve killed all nightly builds. Sorry 😉

That was the short version. Last weekend I was busy with removing some legal hacks from AROS sources. The hack on the schedule was commonly used ThisTask pointer in the SysBase. Now, at least in my local branch of AROS for RaspberryPi the SysBase->ThisTask points to a nirvana place where all code is either happy crashing, or dead, or both. ThisTask points to NULL :)

No, it didn’t disappeared completely. The ThisTask pointer has been moved (and is used there) to something similar to a thread local storage. It is local, but not local for a thread. It is local to a CPU core. On RPi2 we use four independent local storages and each of them has it’s own ThisTask pointer. Don’t hold your breath, it’s not SMP yet. Far from it :) The scheduler works only on the CPU#0. At least for now.

The TLS is used exclusively by the kernel.resource, which knows best about the low-level part of the system. Exec has become two new architecture-specific macros, named GET_THIS_TASK and SET_THIS_TASK(x). On all architectures they do expand to SysBase->ThisTask, on RaspberryPi they expand to TLS_GET(ThisTask) and equivalent TLS_SET. What about the rest of the AROS code? Well, in that case the only sane way to get ThisTask shall be used — the FindTask(NULL) call.

And here we come to the point where I’ve killed all nightlies. During my ThisTask removal fun I broke accidentally one macro in AROSTCP network stack :) It should be fixed already.

by michal at April 27, 2015 08:27 PM

April 21, 2015


Hello Core 1, hello Core 2, Core 3 – wake up!

Porting AROS to RaspberryPi is a lot of fun, I told that already. There’s also a lot of frustration and You know that. This time because of 4 CPU cores…

From very beginning I have noticed that the speed of frame buffer was relatively slow. At least not as fast as I would expect form a nearly 1 GHz machine. Well, issue there, ignored first. I followed with AROS porting and came to a point where AROS was booting into desktop and running programs. As a simple example I have added Clock to WBStartup folder, thus making this app start automatically once the system is up. Of course I have had full debug enabled in screen console and over serial port.

Huh, it took AROS nearly 30 seconds to boot. Not bad, but could be better for sure. Slow redrawing od the screen was worrying me but hey, we do have the simplest graphics driver ever. No acceleration, just a simple portion of memory filled pixel by pixel (with some help of our base graphics class of course). So far so good.


Then out of curiosity I decided to take a look at an old raspberry pi model I have on my desktop. I booted it and looked on the Clock and gone mad. Old raspberry pi with arm11 CPU booted in about 20 seconds. 2/3 of RaspberryPi2 speed! Can’t be, I thought. The new machine cannot be that bad, can it? Have I missed some cache setup? Frame buffer can’t be cached, right? Why was linux frame buffer console faster?

Finally I found a forum where Bare Metal guys were discussing their great efforts to develop standalone software for RaspberryPi. Luckily for me one of them had similar issue I had. He also led me to the final solution. It turned out, that the CPU cores of RaspberryPi2 are not silently seeping and waiting for an interrupt when start.elf transfers the control over to the ARM cpu. No, instead they are busy looping and polling the registers, anxiously waiting to start and do some useful work. As you can imagine polling technique is not something very effective, it’s rather the contrary. The additional CPU cores were stealing the precious bus cycles, leaving less for the CPU#0 which was actually running AROS code. Eureka!

There are two solutions and I have found both of them working with AROS. The first one is to extend the config.txt file (the file which is read and parsed by VideoCore). There, one has to add following parameter


It forces the additional CPUs to go sleep and wait for interrupts instead of do busy looping. I tested it and it really helped. After adding that line AROS really flies on that tiny computer! Frame buffer refreshes quickly, display redraws quickly, few demos redraw their windows nearly immediately. Boo! Now the machine not only feels faster than old RPi, it actually is faster.

Letting the additional CPUs to sleep alone is good, but not something I liked very much. Sure, start.elf does good job but I wanted to make AROS do that job. So I started to code :) I wrote small assembly routine, a trampoline which initializes caches and MMU of the woken up core. The trampoline initializes also the supervisor stack and jumps to a routine in C code. At the moment the C routine is rather simple. It checks CPU type, enables VFP and enters endless wait-for-interrupt loop. Ah, the C routine babbles on the system log of course to let me know it is actually working. What I got was:

[KRN] Co]e o Co eUp ani idiwir igr rrutatuots

Uh. Not very readable. Forgot something? Ah yes, there is no locking in our bug() function, which means all cores were fighting on the serial line. Proper locking will come later, since it has to be done right, for now I have only added some delays. This is how it looks now

Bildschirmfoto 2015-04-21 um 21.53.40

Please note that the “Core x up and waiting” lines are sent to the console respectively by different ARM cores. It’s not SMP, not even AMP. It’s just small initialization routine. But at least it work as expected…

And with current setup AROS really flies on the RaspberryPi 2 😀


by michal at April 21, 2015 08:50 PM

April 18, 2015


Raspberry Pi

Eons ago I was involved in several ARM-related projects. One of them was to make a linux-hosted port of AROS for ARM devices. These were the days full of fun and joy (if everything worked well) and frustration (if everything failed). After that my engagement in AROS dropped nearly to zero. There were, of course, some exceptions like improvements in memory management (TLSF support) or improvements in x86_64 AROS. But none of them were as low-level as I wished them to be.

Since at work we started to use some ARM-based embedded machines for our electronics, I had some fun with coding them. Not really low level, but weird enough :) This all drove me to an idea of buying an ARM platform and make native AROS for it.

IMG_3049 Kopie

Even if there are better machines available, I have decided to support RaspberryPi. One of the reasons was availability of the rPi code in AROS repository – our great developer Nick Andrews has started a port of Aros for that machines already and made a great progress with it. Another reason, a very important one, is a huge community behind Raspberry.

So, the board, the RaspberryPi 2, has been bought :)


During last weeks me and Nick had fun with bringing AROS port back into usable state, rewriting it and improving in many places. Code which was initially not working with rPi2 boards at all now boots equally good (or equally bad) on both rPi and rPi2 into Wanderer, the desktop environment of AROS. The kernel of our system is loaded at a virtual address 0xf8000000. The read-only portion of the kernel is MMU-protected again writes. All caches and write buffers are enabled. Slowly all bits and pieces are improved and we are doing our best to get USB on-the-go up and running. Having it would allow us to actually use Aros on these nice machines already.

Meanwhile, I’m completing our small EABI library for ARM cpus so that we could build entire AROS with gcc5 compiler. Well, fun :)

by michal at April 18, 2015 10:11 PM

April 17, 2015

Icaros Desktop

Your Icaros HTTP server

Icaros Desktop has 'hidden' a little treasure for years, called Snug. Made by the same author of Yafs, the beloved FTP server which allows us sharing files from the AROS machine to the local network, Snug is a lovely HTTP server which does exactly what is meant for: publishing (simple) web pages and allow browsing into directories thorugh a web browser (like Internet Explorer, Chrome or Firefox

by Paolo Besser (noreply@blogger.com) at April 17, 2015 03:21 PM

April 15, 2015

Icaros Desktop

Adding submenus to initial GRUB boot menu

Works on Icaros 2.0.4 are going on, even if I recently forgot to add any status update on this page. So I decided to write a little tutorial about a nice customization I'm adding to the distribution. Unluckily, only whoever will install v2.0.4 from scratch (Live! or Light versions) will see that, while users of current editions will need to manually make some modifications to a single text file

by Paolo Besser (noreply@blogger.com) at April 15, 2015 05:17 PM

April 10, 2015



Over two years passed since last entry on this page — two years only but it feels like eons. I think it’s time to reactivate this blog :)


So, reboot…

by michal at April 10, 2015 02:50 PM

April 05, 2015


A lua shell for FS-UAE

While FS-UAE recently added a scripting interface with a Lua scripting binding, it only provides capabilities to write scripts with hooks that will be called on certain emulator events. I  hacked this scripting interface and added a Lua remote shell. With this shell you can connect while the emulator is running and issue commands. I also started to add disk image related functions to the Lua binding. With these features combined I could show off the power of a scripting shell by writing a tool to insert floppy and cd-rom disk images while the emulator is running – a long awaited and missing feature…

1. Build FS-UAE with the lua shell

The lua shell is for now only available as a source code patch and you need to compile a fresh FS-UAE to use it. But that is not too difficult:

The code is hosted on the lua branch in my github repository:


I also submitted the patch as a pull request for the main line FS-UAE  and hope that Frode will like this feature and include it :)

Clone this branch and start the compilation with the following options:

$ ./bootstrap
$ mkdir build
$ cd build
$ ../configure --enable-lua

Now you can compile it:

$ make
$ (cd dist/macosx && make)  # only on Mac OS X

This results in a new FS-UAE binary with lua shell support.

2. Configure and First Start

You have to enable the lua shell in order to use it.

Either add an option to one of your .fs-uae config files:

lua_shell = 1

or give the option on the command line:

fs-uae ... --lua_shell=1

The lua shell opens a TCP/IP socket on localhost port 6800 and waits for incoming client connects e.g. via telnet or putty.

You can use the following options to change these settings:

lua_shell_addr = "localhost"
lua_shell_port = 6800

Now launch FS-UAE with the lua shell option enabled and have a look at the log files. They usually reside in Documents/FS-UAE/Cache/Logs/fs-uae.log.txt.

There watch out for messages starting with lua-shell:

$ grep lua-shell ~/Documents/FS-UAE/Cache/Logs/fs-uae.log.txt
lua-shell: addr=, port=6800
lua-shell: +listener: 20
lua-shell: -listener
lua-shell: stopping done...

If you see these messages you should be able to connect while the emulator is running:

$ telnet 6800
Connected to localhost.
Escape character is '^]'.
FS-UAE 2.5.29dev Lua 5.2

Ok. Now lets see what you can do in the shell…

3. Using the shell

The lua shell is very similar to the interactive lua interpreter that is shipped with lots of lua distributions (see doc). You can enter a valid lua statement that is then evaluated in the current lua state:

> print "hello, world!"
hello, world!

Note, that the print command is redirected to your shell. Any return value is not printed automatically, you need to prepend a = (or return) before you can enter an expression:

> =2+3

Next to most libs that lua already ships the lua shell in FS-UAE also registers special emulator commands for you (see next section for details):

> =fsuae.floppy.get_num_drives()

This returns the number of virtual floppy drives that are currently emulated.

If you want to quit the shell then enter the quit() command:

> quit()

You can also simply disconnect the connection by closing with telnet or putty…

3. FS-UAE Lua commands

Currently, the following modules are defined with commands:

  • fsemu – generic emulator commands e.g. for changing display
  • fsuae – commands available in the FS-UAE adaption layer of UAE
  • uae – core Amiga emulator commands, e.g. read Amiga memory

Each module is defined in a lua table with the same name and is thus accessed with this prefix. For an up to date list of commands have a look at the corresponding source files:

  • fsemu in libfsemu/src/emu/emu_lua.c
  • fsuae in src/fs-uae/lualibfsuae.c
  • uae in src/lualibuae.cpp

In this post I’ll focus on the drive image functions I’ve added:

  return: the number of floppy drives currently active
fsuae.floppy.set_file(num, path)
  num: index of floppy 0..3
  path: file path of drive image (e.g. adf file)
  return: -
  num: index of floppy 0..3
  return: file path of current drive image or empty string

The same functions are also available for CD-ROM images: just replace floppy with cdrom in the above commands…

Now you can insert and eject floppy images when you run a lua shell during FS-UAE’s operation:

> fsuae.floppy.set_file(0, "/path/to/my/test.adf")  -- insert image into DF0
> fsuae.floppy.set_file(0, "") -- eject image in DF0

You are now able to control your floppies interactively via shell, but there is more…

4. fs-uae-ctl Command Line Utility

You can also write utility programs that use the shell to communicate with FS-UAE while its running.

fs-uae-ctl found in the new tools directory of the FS-UAE source tree is a small Python 3.x tool that allows to manage the floppy and CD-ROM images via command line:

$ python3 fs-uae-ctl df0 # return the current image attached to DF0
df0 empty
$ python3 fs-uae-ctl df0 /path/to/my/test.adf # insert new image
$ python3 fs-uae-ctl df0 eject # remove current image

Use df1 to df3 to access the other drives (if they are enabled).

Use cd0 to cd3 to access the CD-ROM images.

Some extra options allow you to change the host or port where to find the lua shell. Example:

... --port 6811 --host my.host.ip

5. fs-uae-ctl-ui GUI Image Changer Utility

Of course you could do more fancy and write a GUI based utility. Here it is: fs-uae-ctl-ui also found in tools. It requires PyQt4, so make sure it is installed in your Python 3.x setup. E.g. by installing package python3-pyqt4 on a Ubuntu/Debian Linux system. Mac OS X MaxPorts user install port py34-pyqt4.

If you run the tool, you’ll see a nice image change window:

fs-uae-ctl-ui: FS-UAE Image Changer Utility

fs-uae-ctl-ui: FS-UAE Image Changer Utility

Usage is really simply:

First press button Connect to connect with the running FS-UAE lua shell. If all went well then the enabled drives are also enabled in the UI window. See the status line at the bottom for error messages if any.

On an enabled drive slot you can insert a new disk image by entering a new path name into the edit box. If you press button then a file selector will be opened and allows you to choose a new image to be inserted. Pressing the button with the Eject Symbol ejects the image in the drive slot…

There is also a tab for CD-ROMs with the same feature set.

6. Dev Tools and more…

Both fs-uae-ctl utilities use a common Python 3 library that is also shipped. It allows you to integrate lua shell access to FS-UAE with a few lines of python code and also provides some classes that wrap the floppy and CD-ROM image functions… A high level Emu class wraps everything together:

import fsuae
emu = fsuae.Emu()
if not emu.connect():
  print("ERROR", emu.getError())
  print("Drives", emu.getNumDrives())

So writing your own tool is not really an issue…

And user jbl007 of the English Amiga Board already picked up the idea while discussing the lua shell (see EAB Thread) and created his own tool that offers the image control in a very compact menu attached to the systray icon of FS-UAE! See his GitHub Repo:


His launcher fs-uae.py does all the magic…

(BTW: jbl007 has also written a very nice command line frontend for FS-UE called amiga that allows to run FS-UAE for some standard Amiga models without editing a fs-uae config file. It creates the necessary file automatically during startup…)

That’s it… I hope you enjoy the new lua shell for FS-UAE and if you find new interesting uses of it then drop a comment or join the discussion on EAB…

by lallafa at April 05, 2015 05:49 PM