OSError: [Errno 5] Input/output error
After identifying the problem I logged in to server via SSH, restarted the service with debug logging enabled and started processing a sequence of documents, that guaranteed that the workers will have to be restarted. I was really surprised to see that the service managed to cope with the problem and bring up all the workers.
I verified that the only place that service could have crashed was starting the new worker process before the actual fork. I traced in python std libraries, that starting a new process eventually calls Popen(self).
Inspecting multiprocessing/forking.py:
if sys.platform != 'win32': # some not importent stuff class Popen(object): def __init__(self, process_obj): sys.stdout.flush() sys.stderr.flush() self.returncode = None self.pid = os.fork() if self.pid == 0: if 'random' in sys.modules: import random random.seed() code = process_obj._bootstrap() sys.stdout.flush() sys.stderr.flush() os._exit(code)
So before os.fork is actually called, the script tries to flush standard output and error streams. So the IOError that was caused by our script trying to flush stderr/stdout to the /dev/tty device, which was unavailable after a period of time (dropping ssh session after starting the daemon). I investigated the script for any left behind print / logging StreamHandlers. After a long investigation it occurred that a 3-rd party library was rarely logging some errors using a StreamHandler...
So the lesson learned is always verify that Your daemon scripts don't write stdout/stderr scripts or make sure the streams are redirect in the init script. Besides the fact that stderr won't provide any valuable information when You're off-line, You can easily run in to similar problems.
KR