From mboxrd@z Thu Jan 1 00:00:00 1970 From: Willem Riede Subject: [RFC] Change signal used to exit scsi error handlers Date: Wed, 1 Jan 2003 16:05:55 -0500 Sender: linux-scsi-owner@vger.kernel.org Message-ID: <20030101210555.GS1378@linnie.riede.org> Reply-To: wrlk@riede.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="pvezYHf7grwyp3Bc" Content-Transfer-Encoding: 8bit Return-path: Received: from linnie.riede.org (localhost.localdomain [127.0.0.1]) by linnie.riede.org (8.11.6/8.11.6) with ESMTP id h01L5tf08805 for ; Wed, 1 Jan 2003 16:05:55 -0500 Content-Disposition: inline List-Id: linux-scsi@vger.kernel.org To: linux-scsi@vger.kernel.org --pvezYHf7grwyp3Bc Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit I earlier reported, that the error handler for ide-scsi exits prematurely if modprobed from rc.sysinit. I put in some debug prints to apprehend the culprit responsible for sending the SIGHUP signal that causes the exit. This is what my log captured: Jan 1 12:20:13 fallguy kernel: Process 223 [modprobe] starting scsi error handler Jan 1 12:20:13 fallguy kernel: Wake up parent of scsi_eh_2, pid 224 Jan 1 12:20:13 fallguy kernel: Signals pending for scsi_eh_2: 00000000 00000000 Jan 1 12:20:13 fallguy kernel: Error handler scsi_eh_2 sleeping Jan 1 12:20:13 fallguy kernel: scsi2 : SCSI host adapter emulation for IDE ATAPI devices [detected devices skipped] Jan 1 12:20:14 fallguy kernel: Signal 15 sent from 181 [rc.sysinit] to 182 [getkey] Jan 1 12:20:14 fallguy kernel: Signal 1 sent from 22 [init] to 22 [init] Jan 1 12:20:14 fallguy kernel: Signal 18 sent from 22 [init] to 22 [init] Jan 1 12:20:14 fallguy kernel: Signal 1 sent from 22 [init] to 22 [init] Jan 1 12:20:14 fallguy kernel: Signal 1 sent from 22 [init] to 24 [initlog] Jan 1 12:20:14 fallguy kernel: Signal 1 sent from 22 [init] to 78 [khubd] Jan 1 12:20:14 fallguy kernel: Signal 1 sent from 22 [init] to 224 [scsi_eh_2] Jan 1 12:20:14 fallguy kernel: Signals pending for scsi_eh_2: 00000001 00000000 Jan 1 12:20:14 fallguy kernel: Error handler scsi_eh_2 exiting Here is a snapshot of some processes made during rc.sysinit: F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 100 0 1 0 15 0 1332 420 schedu S ? 0:05 init ... 040 0 22 1 16 0 1332 388 wait4 S tty1 0:00 init 000 0 23 22 15 0 4116 1316 wait4 S tty1 0:00 /bin/bash / 040 0 24 23 16 0 2160 1364 schedu S tty1 0:00 /sbin/initl ... Init must have forked to exec bash to exec rc.sysinit which then gets re-executed through initlog. When rc.sysinit ends, the last thing it does is send that TERM signal from sub-process 181 to getkey (process 182) -- the 'Signal 15 ...' line above. As the forked init (process 22) exits, it sends a flurry of signals to all surviving processes created from it. That looks like standard "if I am to die I need to take all my offspring down with me" behavior -- do you agree? Since we want error handlers to survive, IMHO that means that the choice of signal for error handler exit is unfortunate. The source of scsi_error suggests SIGPWR might be a worthy alternative. I think that is true. From inspecting init source, it is not capable of sending SIGPWR. SIGPWR should never be sent by dying processes (its sole use should be from a power daemon _to_ init to shut the system down when the juice is running out). So I suggest the following changes to hosts.c and scsi_error.c: --- drivers/scsi/hosts.c Tue Dec 24 09:59:30 2002 +++ /home/wriede/develop/hosts.c Wed Jan 1 15:09:05 2003 @@ -337,7 +337,7 @@ if (shost->ehandler) { DECLARE_MUTEX_LOCKED(sem); shost->eh_notify = &sem; - send_sig(SIGHUP, shost->ehandler, 1); + send_sig(SIGPWR, shost->ehandler, 1); down(&sem); shost->eh_notify = NULL; } --- drivers/scsi/scsi_error.c Tue Dec 24 09:59:30 2002 +++ /home/wriede/develop/scsi_error.c Wed Jan 1 15:21:46 2003 @@ -52,8 +52,12 @@ * go to single-user mode. For that matter, init also sends SIGKILL, * so we mustn't enable that one either. We use SIGHUP instead. Other * options would be SIGPWR, I suppose. + * + * Changed behavior 1/1/2003 - it turns out, that SIGHUP can get sent + * to error handlers from a process responsible for their creation. + * To sidestep that issue, we now use SIGPWR as suggested above. */ -#define SHUTDOWN_SIGS (sigmask(SIGHUP)) +#define SHUTDOWN_SIGS (sigmask(SIGPWR)) #ifdef DEBUG #define SENSE_TIMEOUT SCSI_TIMEOUT Seperatly, I'd like to suggest improving the debug printout associated with the error handler process. Full diffs against 2.5.53 attached. If accepted, they need to go in 2.4.x too, as I have confirmed, that the same problem exists there. Comments, please. Willem Riede. --pvezYHf7grwyp3Bc Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="hosts.patch" --- drivers/scsi/hosts.c Tue Dec 24 09:59:30 2002 +++ /home/wriede/develop/hosts.c Wed Jan 1 15:09:05 2003 @@ -337,7 +337,7 @@ if (shost->ehandler) { DECLARE_MUTEX_LOCKED(sem); shost->eh_notify = &sem; - send_sig(SIGHUP, shost->ehandler, 1); + send_sig(SIGPWR, shost->ehandler, 1); down(&sem); shost->eh_notify = NULL; } --pvezYHf7grwyp3Bc Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="scsi_error.patch" --- drivers/scsi/scsi_error.c Tue Dec 24 09:59:30 2002 +++ /home/wriede/develop/scsi_error.c Wed Jan 1 15:21:46 2003 @@ -52,8 +52,12 @@ * go to single-user mode. For that matter, init also sends SIGKILL, * so we mustn't enable that one either. We use SIGHUP instead. Other * options would be SIGPWR, I suppose. + * + * Changed behavior 1/1/2003 - it turns out, that SIGHUP can get sent + * to error handlers from a process responsible for their creation. + * To sidestep that issue, we now use SIGPWR as suggested above. */ -#define SHUTDOWN_SIGS (sigmask(SIGHUP)) +#define SHUTDOWN_SIGS (sigmask(SIGPWR)) #ifdef DEBUG #define SENSE_TIMEOUT SCSI_TIMEOUT @@ -1619,7 +1623,7 @@ /* * Wake up the thread that created us. */ - SCSI_LOG_ERROR_RECOVERY(3, printk("Wake up parent \n")); + SCSI_LOG_ERROR_RECOVERY(3, printk("Wake up parent of scsi_eh_%d\n",shost->host_no)); up(shost->eh_notify); @@ -1629,7 +1633,7 @@ * away and die. This typically happens if the user is * trying to unload a module. */ - SCSI_LOG_ERROR_RECOVERY(1, printk("Error handler sleeping\n")); + SCSI_LOG_ERROR_RECOVERY(1, printk("Error handler scsi_eh_%d sleeping\n",shost->host_no)); /* * Note - we always use down_interruptible with the semaphore @@ -1644,7 +1648,7 @@ if (signal_pending(current)) break; - SCSI_LOG_ERROR_RECOVERY(1, printk("Error handler waking up\n")); + SCSI_LOG_ERROR_RECOVERY(1, printk("Error handler scsi_eh_%d waking up\n",shost->host_no)); shost->eh_active = 1; @@ -1672,7 +1676,7 @@ } - SCSI_LOG_ERROR_RECOVERY(1, printk("Error handler exiting\n")); + SCSI_LOG_ERROR_RECOVERY(1, printk("Error handler scsi_eh_%d exiting\n",shost->host_no)); /* * Make sure that nobody tries to wake us up again. --pvezYHf7grwyp3Bc--