Update:

Have the bare bones of a program now that takes a buffer of step requests (# of steps, step rate, direction) in 4 dimensions X/Y/Z/A and outputs these under interrupt driven timing to the GPIO outputs. Without having to resort to a kernel level interrupt handler (which I will do when I've worked out how to get the Linux cross compiler to work) and using an out of the box user-space handler I am able to get 5,000 steps/sec on each axis with <10uS delta between channels (10kHz interrupt). That uses around 50% CPU on an unclocked Pi.

Next is to get a basic G-code interpreter running and do the vector maths to handle acceleration/deceleration.