Ok I decided to take a proper look at this.
My thinking is that there are three main posibilities for what is wrong:
1) The contents, sequence or timing of the packets sent during logon is sometimes wrong.
2) Something is being corrupted while sending or receiving data (maybe some bug at the data queuing level)
3) The protocol implemented is not quite what the client expects. It's close as it often works but maybe it's not quite right.
Number 1) seems the most likely to me or at least the easiest starting point so I've modified my server to dump with timestamps all of the packets being sent and received by both the world and zone servers. I know it already can do some of this but I wanted it in a form I could analyse more easily... My plan is to then to capture a sucessful and a broken zone in and then look for differences in the data, or it's sequence or the timing.. That will be a farily painful process though.... One theory I have is that this involves both the world and zone servers which run as different processes so maybe this is a timing related problem between the two of some kind?
Anyway, I'll analyse the data I've got and see if I can figure it out. If that doesn't work I'll look at my options 2) and 3)
Any suggestions from delevelopers would be welcomed, particularly any information on how the login is supposed to work and what the zone and world servers do in what order?