On Thu, 15 Sep 2005, Jamis Buck wrote: > Rails applications that use FCGI have been observing some strange behavior. > I have a hypothesis regarding the cause, but I'd like some feedback as to > whether it is a reasonable hypothesis, and any solutions/workarounds that > people might have. on which platforms? > Sometimes (and some apps experience this more frequently than others) a FCGI > process that is not currently handling a request will fail to respond to a > signal (specifically USR1 or HUP) until a request is received. just to clarify - a fcgi process is __always__ handling a request. for instance, if i run this code as a fcgi process: [ahoward@localhost html]$ cat ./env.fcgi #! /usr/local/bin/ruby require 'fcgi' loaded, pid = Time::now, Process::pid FCGI.each_cgi do |cgi| env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n" content = <<-html LOADED @ #{ loaded } <br>\n PID @ #{ pid } <br>\n <hr><hr> #{ env } html cgi.out{ content } end [ahoward@localhost html]$ links -dump http://localhost/env.fcgi |grep PID PID @ 12568 and then check that process [root@localhost ahoward]# strace -p 12568 Process 12568 attached - interrupt to quit select(1, [0], NULL, NULL, NULL ... is see it's waiting for a request and blocked in select to io multiplex. checking os_unix.c in the fcgi lib source we see void OS_ShutdownPending() { shutdownPending = TRUE; } static void OS_Sigusr1Handler(int signo) { OS_ShutdownPending(); } ... int OS_Accept(int listen_sock, int fail_on_intr, const char *webServerAddrs) { int socket = -1; union { struct sockaddr_un un; struct sockaddr_in in; } sa; for (;;) { if (AcquireLock(listen_sock, fail_on_intr)) return -1; for (;;) { do { #ifdef HAVE_SOCKLEN socklen_t len = sizeof(sa); #else int len = sizeof(sa); #endif if (shutdownPending) break; /* There's a window here */ socket = accept(listen_sock, (struct sockaddr *)&sa, &len); } while (socket < 0 && errno == EINTR && ! fail_on_intr && ! shutdownPending); ... so it seems that the signal handler sets a global flag which is checked at appropriate times. we can send a signal to the process and see what happens: [root@localhost html]# kill -HUP 12568 and, back in our strace window we see: --- SIGHUP (Hangup) @ 0 (0) --- rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGINT, {SIG_DFL}, {0x80a4884, [], SA_RESTART}, 8) = 0 exit_group(1) = ? Process 12568 detached looks fine - so it does, in fact, receive and handle the signal asap. but wait a minute.... it exited with 1 for failure. checking the apache logs we see : [root@localhost ahoward]# tail -2 /var/log/httpd/error_log [Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" (pid 12568) terminated by calling exit with status '1' [Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" restarted (pid 12614) seems __ok__. but let's do it a few times: [root@localhost html]# echo `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'` 12614 [root@localhost html]# kill -HUP `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'` [root@localhost html]# kill -HUP `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'` now we check the logs: [root@localhost ahoward]# tail -2 /var/log/httpd/error_log [Thu Sep 15 10:15:34 2005] [error] [client 127.0.0.1] FastCGI: incomplete headers (0 bytes) received from server "/var/www/html/env.fcgi" [Thu Sep 15 10:15:34 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds so now the bloody thing won't run for ten minutes! the apache process manager is prevent rapid startup/shutdown by buggy fcgi processes and this makes sense since thousands of them could hose a system. but, let's assume we sometimes want to shutdown nicely and know what we are doing. we run this: [ahoward@localhost html]$ cat env2.fcgi #! /usr/local/bin/ruby require 'fcgi' trap('USR2'){ exit 0 } loaded, pid = Time::now, Process::pid FCGI.each_cgi do |cgi| env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n" content = <<-html LOADED @ #{ loaded } <br>\n PID @ #{ pid } <br>\n <hr><hr> #{ env } html cgi.out{ content } end [ahoward@localhost html]$ lynx -dump http://localhost/env2.fcgi |grep PID PID @ 12690 note that this one exits, doing no cleanup, immediately with success if it gets USR2. let's test it out: [root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'` [root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'` [root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'` checking the log [root@localhost ahoward]# tail -2 /var/log/httpd/error_log [Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12865) terminated by calling exit with status '0' [Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12877) [Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12877) terminated by calling exit with status '0' [Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12883) [Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12883) terminated by calling exit with status '0' [Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds so this is better - at least we got a few restarts out of it once by exiting with zero - the process manager thought this was ok and just logged it. however, restarting too rapidly caused us to be backed off into oblivion. there are config options to control this, but consider setting them to NOT backoff - a typo in a script would cause a loop in the webserver where is just tried over and over to restart the app. a bunch of these could easily bring a system to it's knees. so i'm thinking that 'fixing' this problem would create a far worse one with system crashing implications. so i'm not sure what to do, but adding a signal handler that exits with sucess may be a start in the right direction. this would allow nice restarts so long as you didn't do them too quickly. if you are doing them too quickly you really shouldn't be hitting the fcgi page anyhow so maybe this is good enough. so... all that is totally nix/apache specific and i'd imagine none of it would work in windows. but maybe it's a start ;-) please let me know if you end up learning more - i'll apply anything i find to my acgi package since all the same things apply there. cheers. -a -- =============================================================================== | email :: ara [dot] t [dot] howard [at] noaa [dot] gov | phone :: 303.497.6469 | Your life dwells amoung the causes of death | Like a lamp standing in a strong breeze. --Nagarjuna ===============================================================================