Socket problems
I've been working on a server program for work. For the past few days, Victor's been off on-site doing a beta test of it, and it seemed to be going so well. Then today happened.
He sent me a log where there was some bogus error code trying to connect. How that works is that the server creates a socket, and does a non-blocking connect to it. The main program goes on its way while the network thread waits for the socket to become writable. When the socket's ready, the socket's fd is sent via Carbon Event to the main thread, where it's tested to see if it's really connected, and then further processed.
What was happening is that somewhere else in the code the connection failed a timeout, and close() was called on the socket. On the other hand, the network thread was still blocked in select() on that socket. What's select supposed to do when one of the sockets you're blocking on disappears? I don't know, and apparently neither does Darwin. So I get a Carbon Event that the socket's ready, I try to test it to see if it's ready, I get some bogus error code, and there ya go.
Now, when I close a socket, I send a message to the network thread to abandon listening to the socket. (This wasn't a problem in OT land since there we were using an async callback, and closing the endpoint cancelled any callbacks.)
(By the way, the best test to see if a non-blocking connect succeeded in Darwin is straight out of Stevens's UNIX Network Programming. Call getpeername(), and see if you get a -1 return. If you do, you know the socket didn't connect. Then call getsockopt() with the SOL_SOCKET/SO_ERROR selector to pull the error out of the socket.)
Solving that bug was nice, but the problem is that's not why Victor sent me the log. The thing about the log is that it caught the problem where every single one of my connections died with an EPIPE error. Two dozen connections, all dead within a second. That's ridiculously bizarre, but I don't know what to say.
I went through and ifdef-ed out the speed networking code (which should have been ifdef-ed in the first place) for now. It's reasonably-elegant code and it significantly speeds things up, but it's not as well-tested as the basic socket code, and I need to simplify things to find out where the problem lies.
So tomorrow, back to hacking on the networking code...