Submit Blog  RSS Feeds

Sunday, August 31, 2014

Possible I/O Errors while starting a new process from a daemon script (Popen)

I recently had a problem with one of my document processing application crashing after trying to revive malfunctioned workers, with an:

OSError: [Errno 5] Input/output error

After identifying the problem I logged in to server via SSH, restarted the service with debug logging enabled and started processing a sequence of documents, that guaranteed that the workers will have to be restarted. I was really surprised to see that the service managed to cope with the problem and bring up all the workers.

I verified that the only place that service could have crashed was starting the new worker process before the actual fork. I traced in python std libraries, that starting a new process eventually calls Popen(self).

Inspecting multiprocessing/forking.py:


if sys.platform != 'win32': 

    # some not importent stuff
    class Popen(object):

        def __init__(self, process_obj):
            sys.stdout.flush()
            sys.stderr.flush()
            self.returncode = None

            self.pid = os.fork()
            if self.pid == 0:
                if 'random' in sys.modules:
                    import random
                    random.seed() 
                code = process_obj._bootstrap()
                sys.stdout.flush()
                sys.stderr.flush()
                os._exit(code)

So before os.fork is actually called, the script tries to flush standard output and error streams. So the IOError that was caused by our script trying to flush stderr/stdout to the /dev/tty device, which was unavailable after a period of time (dropping ssh session after starting the daemon). I investigated the script for any left behind print / logging StreamHandlers. After a long investigation it occurred that a 3-rd party library was rarely logging some errors using a StreamHandler...

So the lesson learned is always verify that Your daemon scripts don't write stdout/stderr scripts or make sure the streams are redirect in the init script. Besides the fact that stderr won't provide any valuable information when You're off-line, You can easily run in to similar problems.

KR
free counters