On Thu, 15 Sep 2005, Jamis Buck wrote:

> Rails applications that use FCGI have been observing some strange behavior.
> I have a hypothesis regarding the cause, but I'd like some feedback as to
> whether it is a reasonable hypothesis, and any solutions/workarounds that
> people might have.

on which platforms?

> Sometimes (and some apps experience this more frequently than others) a FCGI
> process that is not currently handling a request will fail to respond to a
> signal (specifically USR1 or HUP) until a request is received.

just to clarify - a fcgi process is __always__ handling a request.  for
instance, if i run this code as a fcgi process:

   [ahoward@localhost html]$ cat ./env.fcgi
   #! /usr/local/bin/ruby
   require 'fcgi'
   loaded, pid = Time::now, Process::pid
   FCGI.each_cgi do |cgi|
     env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
     content = <<-html
       LOADED @ #{ loaded } <br>\n
       PID @ #{ pid } <br>\n
       <hr><hr>
       #{ env }
     html
     cgi.out{ content }
   end

   [ahoward@localhost html]$ links -dump http://localhost/env.fcgi |grep PID
      PID @ 12568

and then check that process

   [root@localhost ahoward]# strace -p 12568
   Process 12568 attached - interrupt to quit
   select(1, [0], NULL, NULL, NULL ...

is see it's waiting for a request and blocked in select to io multiplex.
checking os_unix.c in the fcgi lib source we see

   void OS_ShutdownPending()
   {
       shutdownPending = TRUE;
   }
   static void OS_Sigusr1Handler(int signo)
   {
       OS_ShutdownPending();
   }

   ...

   int OS_Accept(int listen_sock, int fail_on_intr, const char *webServerAddrs)
   {
       int socket = -1;
       union {
           struct sockaddr_un un;
           struct sockaddr_in in;
       } sa;

       for (;;) {
           if (AcquireLock(listen_sock, fail_on_intr))
               return -1;

           for (;;) {
               do {
   #ifdef HAVE_SOCKLEN
                   socklen_t len = sizeof(sa);
   #else
                   int len = sizeof(sa);
   #endif
                   if (shutdownPending) break;
                   /* There's a window here */

                   socket = accept(listen_sock, (struct sockaddr *)&sa, &len);
               } while (socket < 0
                        && errno == EINTR
                        && ! fail_on_intr
                        && ! shutdownPending);


   ...


so it seems that the signal handler sets a global flag which is checked at
appropriate times.  we can send a signal to the process and see what happens:

   [root@localhost html]# kill -HUP 12568

and, back in our strace window we see:

   --- SIGHUP (Hangup) @ 0 (0) ---
   rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
   rt_sigaction(SIGINT, {SIG_DFL}, {0x80a4884, [], SA_RESTART}, 8) = 0
   exit_group(1)                           = ?
   Process 12568 detached

looks fine - so it does, in fact, receive and handle the signal asap.  but
wait a minute.... it exited with 1 for failure.  checking the apache logs we
see :

   [root@localhost ahoward]# tail -2 /var/log/httpd/error_log
   [Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" (pid 12568) terminated by calling exit with status '1'
   [Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" restarted (pid 12614)

seems __ok__.  but let's do it a few times:

   [root@localhost html]# echo `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`
   12614
   [root@localhost html]# kill -HUP `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`
   [root@localhost html]# kill -HUP `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`

now we check the logs:

   [root@localhost ahoward]# tail -2 /var/log/httpd/error_log
   [Thu Sep 15 10:15:34 2005] [error] [client 127.0.0.1] FastCGI: incomplete headers (0 bytes) received from server "/var/www/html/env.fcgi"
   [Thu Sep 15 10:15:34 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds

so now the bloody thing won't run for ten minutes!  the apache process manager
is prevent rapid startup/shutdown by buggy fcgi processes and this makes sense
since thousands of them could hose a system.

but, let's assume we sometimes want to shutdown nicely and know what we are
doing.  we run this:

   [ahoward@localhost html]$ cat env2.fcgi
   #! /usr/local/bin/ruby
   require 'fcgi'
   trap('USR2'){ exit 0 }
   loaded, pid = Time::now, Process::pid
   FCGI.each_cgi do |cgi|
     env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
     content = <<-html
       LOADED @ #{ loaded } <br>\n
       PID @ #{ pid } <br>\n
       <hr><hr>
       #{ env }
     html
     cgi.out{ content }
   end

   [ahoward@localhost html]$ lynx -dump http://localhost/env2.fcgi |grep PID
      PID @ 12690


note that this one exits, doing no cleanup, immediately with success if it gets
USR2.  let's test it out:

   [root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'`
   [root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'`
   [root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'`

checking the log

   [root@localhost ahoward]# tail -2 /var/log/httpd/error_log
   [Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12865) terminated by calling exit with status '0'
   [Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12877)
   [Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12877) terminated by calling exit with status '0'
   [Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12883)
   [Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12883) terminated by calling exit with status '0'
   [Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds

so this is better - at least we got a few restarts out of it once by exiting
with zero - the process manager thought this was ok and just logged it.
however, restarting too rapidly  caused us to be backed off into oblivion.
there are config options to control this, but consider setting them to NOT
backoff - a typo in a script would cause a loop in the webserver where is just
tried over and over to restart the app.  a bunch of these could easily bring a
system to it's knees.  so i'm thinking that 'fixing' this problem would create
a far worse one with system crashing implications.

so i'm not sure what to do, but adding a signal handler that exits with sucess
may be a start in the right direction.  this would allow nice restarts so long
as you didn't do them too quickly.  if you are doing them too quickly you
really shouldn't be hitting the fcgi page anyhow so maybe this is good enough.

so... all that is totally nix/apache specific and i'd imagine none of it would
work in windows.  but maybe it's a start ;-)

please let me know if you end up learning more - i'll apply anything i find to
my acgi package since all the same things apply there.

cheers.

-a
-- 
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze.  --Nagarjuna
===============================================================================