I've not looked at the code for GRBL, but from you're describing I'd guess they're using a set clock tick, and then scaling the required steps to that tick. What that means is the resultant pulses may not be optimally timed.

For example, if you have a 100Hz tick, and you need a 40Hz pulse train, that means you need to generate a pulse every 2.5 ticks. So you need to scale that to the available tick, which means over 10 ticks, you'd end up with 1001000100. You would end up alternating between 20 and 30 millisecond gaps between pulses.

Good motion controllers will nearly always use an FPGA to generate the required pulses, and will run using a very similar principle, but because with the FPGA you are programming logic gates directly (you're not relying on embedded code and additional hardware layers to process that code) they run far more efficiently, with far less latency. You also gain more flexibility, in that you can potentially use independent pre-scalers for each pulse channel, so pulses from different channels don't end up aligned, but at some point there will most likely be slight timing inconsistencies as not every potential output frequency can be matched precisely.

I'd like to be able to programme FPGAs, but all my attempts so far have ended up in frustration, as I struggle to get my head around VHDL or Verilog. It's one of these things I know what I'd like to achieve, however I've never found any good guides to explain the basics, or with examples that actually work :-/