From: Douglas Gilbert <dougg@torque.net>
To: "Dailey, Nate" <Nate.Dailey@stratus.com>
Cc: linux-scsi@vger.kernel.org
Subject: Re: requesting advice: oops when doing sg_dd IO and device is removed
Date: Tue, 18 Oct 2005 14:54:37 +1000 [thread overview]
Message-ID: <4354800D.309@torque.net> (raw)
In-Reply-To: <92952AEF1F064042B6EF2522E0EEF437024334E7@EXNA.corp.stratus.com>
Nate,
I always suspected that if one really tried hard enough
then unexpectedly closing a file descriptor (e.g. via
control-C on an app) at the same time as there are
outstanding commands could cause problems. Obviously
you have succeeded demonstrating that with your experiment.
When I fought with this problem in the lk 2.4 series,
someone skilled at finding flaws in locking logic
suggested there was no solution (at least the way
I was doing it) ...
Waiting a close() call (i.e. sg_release() ), potentially
indefinitely, was very unpopular with application clients.
The sg driver did that in the old days, and I was abused
from several quarters (one name springs to mind). So I
tried to design sg_release() so it wouldn't ever hang
the application client.
On reflection, I think to allow sg_release() to go
through quickly, another kernel thread is needed that
is passed ownership of unfinished commands. What do
you think?
Doug Gilbert
Dailey, Nate wrote:
> Doug (and anyone else who might be interested), I'm looking for some
> advice as to how best to fix a problem I hit in sg.
>
> I was doing the following:
>
> - start up several processes reading from an sg device with sg_dd
> - every other iteration, kill the processes
> - remove the device (echo 1 > /sys/bus/scsi/devices/1:0:1:0/delete)
> - add the device (echo "0 1 0" > /sys/class/scsi_host/host1/scan)
>
> This was on a 2.6.9 kernel, with a patch:
>
> http://marc.theaimsgroup.com/?l=linux-scsi&m=112749860222533&w=2
>
> That I've submitted to the LSML... patch was intended to fix a different
> oops when doing the above test.
>
> Here's what I saw, after the device was removed:
>
> Debug: sleeping function called from invalid context at
> include/linux/rwsem.h:6in_atomic():1[expected: 0], irqs_disabled():0
> [<c011fcd1>] __might_sleep+0x7d/0x88
> [<c021cb5e>] device_del+0x1e/0x90
> [<f8840ccc>] scsi_device_dev_release+0xdf/0x13a [scsi_mod]
> [<c021c8e9>] device_release+0x11/0x40
> [<c01bf03f>] kobject_cleanup+0x40/0x60
> [<c01bf05f>] kobject_release+0x0/0x8
> [<c01bf305>] kref_put+0x42/0x45
> [<f883e61b>] scsi_next_command+0xc/0x14 [scsi_mod]
> [<f883a0a8>] scsi_release_request+0x8/0x10 [scsi_mod]
> [<f883ab15>] scsi_finish_command+0xad/0xb1 [scsi_mod]
> [<f883aa3a>] scsi_softirq+0xb6/0xbe [scsi_mod]
> [<c0126424>] __do_softirq+0x4c/0xb1
> [<c010812f>] do_softirq+0x4f/0x56
> =======================
> [<c0107a44>] do_IRQ+0x1a2/0x1ae
> [<c02d1a8c>] common_interrupt+0x18/0x20
> [<c0104018>] default_idle+0x0/0x2c
> [<c0104041>] default_idle+0x29/0x2c
> [<c010409d>] cpu_idle+0x26/0x3b
> [<c0390786>] start_kernel+0x199/0x19d
>
> (It looks like a command comes back, we drop the last reference on the
> device, the device is freed... because we're in softirq, and device_del
> may eventually sleep (in this kernel version, anyway), we get this debug
> message. I don't think this should normally happen when a command comes
> back, for reasons I'll discuss below)
>
> Continuing on...
>
> Badness in kref_get at lib/kref.c:32
> [<c01bf2bb>] kref_get+0x24/0x2c
> [<c01beffb>] kobject_get+0xf/0x13
> [<c021cb32>] get_device+0xe/0x14
> [<f883ef34>] scsi_request_fn+0x19/0x319 [scsi_mod]
> [<c022213e>] blk_run_queue+0x20/0x2f
> [<f883a0a8>] scsi_release_request+0x8/0x10 [scsi_mod]
> [<f883ab15>] scsi_finish_command+0xad/0xb1 [scsi_mod]
> [<f883aa3a>] scsi_softirq+0xb6/0xbe [scsi_mod]
> [<c0126424>] __do_softirq+0x4c/0xb1
> [<c010812f>] do_softirq+0x4f/0x56
> =======================
> [<c0107a44>] do_IRQ+0x1a2/0x1ae
> [<c02d1a8c>] common_interrupt+0x18/0x20
> [<c0104018>] default_idle+0x0/0x2c
> [<c0104041>] default_idle+0x29/0x2c
> [<c010409d>] cpu_idle+0x26/0x3b
> [<c0390786>] start_kernel+0x199/0x19d
>
> (So, there were more commands on the device's queue... presumably for
> sg, since I wasn't doing any other IO to this disk. We're trying to get
> a reference on the device, but the ref count has already gone to 0 so
> this "Badness" message results.)
>
> Continuing on...
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000008
> printing eip:
> c0229678
> *pde = 00004001
> Oops: 0000 [#1]
> SMP
> Modules linked in: sg(U) parport_pc lp parport autofs4 nfs lockd sunrpc
> md5
> ip)
> CPU: 0
> EIP: 0060:[<c0229678>] Tainted: PF VLI
> EFLAGS: 00010046 (2.6.9-20.ELsmp)
> EIP is at cfq_next_request+0x7/0x35
> eax: f7b6f5e0 ebx: 00000000 ecx: c03249bc edx: f243a594
> esi: f7b6f5e0 edi: f7f5b000 ebp: f7b6f5e0 esp: c03c7f80
> ds: 007b es: 007b ss: 0068
> Process swapper (pid: 0, threadinfo=c03c7000 task=c031ea80)
> Stack: f7b6f5e0 f7b6f5e0 c02209dc f7b6f5e0 f243a400 f7f5b000 f883ef51
> f7b6f5e0
> 00000246 f243a400 00000000 c022213e f7d62800 f3872800 f883a0a8
> f7f5b000
> f883ab15 00002002 f3872800 c03c7fd4 f883aa3a c03c7fd4 c03c7fd4
> 00000003
> Call Trace:
> [<c02209dc>] elv_next_request+0xbe/0xce
> [<f883ef51>] scsi_request_fn+0x36/0x319 [scsi_mod]
> [<c022213e>] blk_run_queue+0x20/0x2f
> [<f883a0a8>] scsi_release_request+0x8/0x10 [scsi_mod]
> [<f883ab15>] scsi_finish_command+0xad/0xb1 [scsi_mod]
> [<f883aa3a>] scsi_softirq+0xb6/0xbe [scsi_mod]
> [<c0126424>] __do_softirq+0x4c/0xb1
> [<c010812f>] do_softirq+0x4f/0x56
> =======================
> [<c0107a44>] do_IRQ+0x1a2/0x1ae
> [<c02d1a8c>] common_interrupt+0x18/0x20
> [<c0104018>] default_idle+0x0/0x2c
> [<c0104041>] default_idle+0x29/0x2c
> [<c010409d>] cpu_idle+0x26/0x3b
> [<c0390786>] start_kernel+0x199/0x19d
> Code: 01 00 00 00 89 e8 eb b6 8b 04 24 3b 43 24 0f 92 c0 31 d2 85 ff 0f
> 95 c2
>
> (It was inevitable that something bad would happen when trying to run
> the queue on a device which had been freed)
>
>
> My theory here is that sg should be holding a reference on the SCSI
> device until all IO comes back. With or without the patch I mentioned
> above, this is not the case. Although sg_release does take out an extra
> reference if there's any IO still outstanding, if sg_remove is called,
> this reference is dropped. So after sg_remove, there could still be IO
> to the device outstanding, but no reference on the device (from sg).
> This means the device could get freed even though there's still some sg
> IO on the queue.
>
> Does this analysis sound reasonable? I'm a bit puzzled by the fact that
> sd (for example) doesn't appear to have this problem (maybe sd_release
> just never gets called until all IO comes back?).
>
>
> So, how to fix this?
>
> The first thing I tried was adding a kref to the Sg_device struct,
> patterned after the reference counting in sd, sr, st. In sg_release, I
> would drop a reference only if there was no IO outstanding. If there
> _was_ IO outstanding, I would drop the reference in sg_cmd_done when the
> last IO came back. In theory, this would ensure that sg kept a reference
> on the device until all IO came back.
>
> First problem: I was protecting the kref with a mutex, as sd, sr, etc.
> do. Trying to down the mutex from sg_cmd_done didn't work (as this is
> done from softirq). Okay, so I took out the mutex, just to see how
> things would work. Still there was a problem... doing the final
> scsi_device_put in sg_cmd_done won't work, because
> scsi_device_dev_release calls device_del, which may sleep (in this
> kernel version, anyway... I think this has changed in later kernels). So
> it seems trying to drop a reference when the last command comes back is
> out of the question.
>
> The next thing I tried was simply sleeping in sg_release until all
> outstanding IO comes back (changing sg_remove_sfp to return a count of
> outstanding IOs), and only then dropping the device reference (patch
> below). This seems to be working fine. But I'm wondering if this is
> really a good solution? I'm not sure I can think of anything else... any
> ideas?
>
> Thanks!
>
> Nate Dailey
> Stratus Technologies
>
>
> Here's the "wait in sg_release" patch (which I think is still useful,
> even if it's not strictly required). This is against a 2.6.9 kernel, but
> I can resubmit a patch against a later kernel if desired...
>
> --- sg.c.orig 2005-10-13 16:50:00.806798387 -0400
> +++ sg.c 2005-10-13 16:07:06.365733548 -0400
> @@ -49,6 +49,7 @@ static int sg_version_num = 30531; /* 2
> #include <linux/seq_file.h>
> #include <linux/blkdev.h>
> #include <linux/delay.h>
> +#include <linux/kref.h>
>
> #include "scsi.h"
> #include <scsi/scsi_host.h>
> @@ -173,6 +174,7 @@ typedef struct sg_device { /* holds the
> char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10->
> all devs */
> struct gendisk *disk;
> struct cdev * cdev; /* char_dev [sysfs:
> /sys/cdev/major/sg<n>] */
> + struct kref kref;
> } Sg_device;
>
> static int sg_fasync(int fd, struct file *filp, int mode);
> @@ -218,11 +220,61 @@ static Sg_device **sg_dev_arr = NULL;
> static int sg_dev_max;
> static int sg_nr_dev;
>
> +/* This semaphore is used to mediate the 0->1 reference get in the
> + * face of object destruction (i.e. we can't allow a get on an
> + * object after last put) */
> +static DECLARE_MUTEX(sg_ref_sem);
> +
> +static void scsi_generic_release(struct kref *kref);
> +#define to_scsi_generic(obj) container_of(obj,Sg_device,kref)
> +
> #define SZ_SG_HEADER sizeof(struct sg_header)
> #define SZ_SG_IO_HDR sizeof(sg_io_hdr_t)
> #define SZ_SG_IOVEC sizeof(sg_iovec_t)
> #define SZ_SG_REQ_INFO sizeof(sg_req_info_t)
>
> +static Sg_device *scsi_generic_get(int dev)
> +{
> + Sg_device *sdp = NULL;
> +
> + down(&sg_ref_sem);
> +
> + sdp = sg_get_dev (dev);
> + if (!sdp) goto out;
> +
> + if (sdp->detached) {
> + sdp = NULL;
> + goto out;
> + }
> +
> + kref_get(&sdp->kref);
> +
> + if (!sdp->device)
> + goto out_put;
> +
> + if (scsi_device_get(sdp->device))
> + goto out_put;
> +
> + goto out;
> +
> +out_put:
> + kref_put(&sdp->kref, scsi_generic_release);
> + sdp = NULL;
> +out:
> + up(&sg_ref_sem);
> + return sdp;
> +}
> +
> +static void scsi_generic_put(Sg_device *sdp)
> +{
> + struct scsi_device *sdev = sdp->device;
> +
> + down(&sg_ref_sem);
> + kref_put(&sdp->kref, scsi_generic_release);
> + scsi_device_put(sdev);
> + up(&sg_ref_sem);
> +}
> +
> static int
> sg_open(struct inode *inode, struct file *filp)
> {
> @@ -235,17 +287,8 @@ sg_open(struct inode *inode, struct file
>
> nonseekable_open(inode, filp);
> SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, flags=0x%x\n", dev,
> flags));
> - sdp = sg_get_dev(dev);
> - if ((!sdp) || (!sdp->device))
> + if (!(sdp = scsi_generic_get(dev)))
> return -ENXIO;
> - if (sdp->detached)
> - return -ENODEV;
> -
> - /* This driver's module count bumped by fops_get in <linux/fs.h>
> */
> - /* Prevent the device driver from vanishing while we sleep */
> - retval = scsi_device_get(sdp->device);
> - if (retval)
> - return retval;
>
> if (!((flags & O_NONBLOCK) ||
> scsi_block_when_processing_errors(sdp->device))) {
> @@ -302,7 +345,7 @@ sg_open(struct inode *inode, struct file
> return 0;
>
> error_out:
> - scsi_device_put(sdp->device);
> + scsi_generic_put(sdp);
> return retval;
> }
>
> @@ -317,13 +360,14 @@ sg_release(struct inode *inode, struct f
> return -ENXIO;
> SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n",
> sdp->disk->disk_name));
> sg_fasync(-1, filp, 0); /* remove filp from async notification
> list */
> - if (0 == sg_remove_sfp(sdp, sfp)) { /* Returns 1 when sdp
> gone */
> - if (!sdp->detached) {
> - scsi_device_put(sdp->device);
> - }
> - sdp->exclude = 0;
> - wake_up_interruptible(&sdp->o_excl_wait);
> - }
> +
> + /* Don't release the device until all IO comes back. */
> + while (sg_remove_sfp(sdp, sfp)){
> + msleep(10);
> + }
> + sdp->exclude = 0;
> + wake_up_interruptible(&sdp->o_excl_wait);
> + scsi_generic_put(sdp);
> return 0;
> }
>
> @@ -1238,7 +1285,7 @@ sg_cmd_done(Scsi_Cmnd * SCpnt)
> sfp = srp->parentfp;
> if (sfp)
> sdp = sfp->parentdp;
> - if ((NULL == sdp) || sdp->detached) {
> + if (NULL == sdp) {
> printk(KERN_INFO "sg_cmd_done: device detached\n");
> scsi_release_request(SRpnt);
> return;
> @@ -1292,18 +1339,8 @@ sg_cmd_done(Scsi_Cmnd * SCpnt)
> scsi_release_request(SRpnt);
> SRpnt = NULL;
> - if (sfp->closed) { /* whoops this fd already released,
> cleanup */
> - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed,
> freeing ...\n"));
> - sg_finish_rem_req(srp);
> - srp = NULL;
> - if (NULL == sfp->headrp) {
> - SCSI_LOG_TIMEOUT(1, printk("sg...bh: already
> closed, final cleanup\n"));
> - if (0 == sg_remove_sfp(sdp, sfp)) { /*
> device still present */
> - scsi_device_put(sdp->device);
> - }
> - sfp = NULL;
> - }
> - } else if (srp && srp->orphan) {
> +
> + if (srp && srp->orphan) {
> if (sfp->keep_orphan)
> srp->sg_io_owned = 0;
> else {
> @@ -1441,6 +1478,7 @@ sg_add(struct class_device *cl_dev)
> }
> k = error;
> sdp = sg_dev_arr[k];
> + kref_init(&sdp->kref);
>
> devfs_mk_cdev(MKDEV(SCSI_GENERIC_MAJOR, k),
> S_IFCHR | S_IRUSR | S_IWUSR | S_IRGRP,
> @@ -1493,45 +1531,33 @@ sg_remove(struct class_device *cl_dev)
> unsigned long iflags;
> Sg_fd *sfp;
> Sg_fd *tsfp;
> - Sg_request *srp;
> - Sg_request *tsrp;
> - int k, delay;
> + int k;
>
> if (NULL == sg_dev_arr)
> return;
> - delay = 0;
> write_lock_irqsave(&sg_dev_arr_lock, iflags);
> for (k = 0; k < sg_dev_max; k++) {
> - sdp = sg_dev_arr[k];
> - if ((NULL == sdp) || (sdp->device != scsidp))
> + if ((NULL == sg_dev_arr[k]) ||
> + (sg_dev_arr[k]->device != scsidp))
> continue; /* dirty but lowers nesting */
> + sdp = sg_dev_arr[k];
> + sdp->detached = 1;
> +
> if (sdp->headfp) {
> - sdp->detached = 1;
> for (sfp = sdp->headfp; sfp; sfp = tsfp) {
> tsfp = sfp->nextfp;
> - for (srp = sfp->headrp; srp; srp = tsrp)
> {
> - tsrp = srp->nextrp;
> - if (sfp->closed || (0 ==
> sg_srp_done(srp, sfp)))
> - sg_finish_rem_req(srp);
> - }
> - if (sfp->closed) {
> - scsi_device_put(sdp->device);
> - __sg_remove_sfp(sdp, sfp);
> - } else {
> - delay = 1;
> + if (!(sfp->closed)) {
>
> wake_up_interruptible(&sfp->read_wait);
> kill_fasync(&sfp->async_qp,
> SIGPOLL,
> POLL_HUP);
> }
> }
> SCSI_LOG_TIMEOUT(3, printk("sg_detach: dev=%d,
> dirty\n", k));
> - if (NULL == sdp->headfp) {
> - sg_dev_arr[k] = NULL;
> - }
> } else { /* nothing active, simple case */
> SCSI_LOG_TIMEOUT(3, printk("sg_detach:
> dev=%d\n", k));
> - sg_dev_arr[k] = NULL;
> }
> +
> + sg_dev_arr[k] = NULL;
> sg_nr_dev--;
> break;
> }
> @@ -1543,14 +1569,32 @@ sg_remove(struct class_device *cl_dev)
> cdev_del(sdp->cdev);
> sdp->cdev = NULL;
> devfs_remove("%s/generic", scsidp->devfs_name);
> - put_disk(sdp->disk);
> - sdp->disk = NULL;
> - if (NULL == sdp->headfp)
> - kfree((char *) sdp);
> +
> + down(&sg_ref_sem);
> + kref_put(&sdp->kref, scsi_generic_release);
> + up(&sg_ref_sem);
> }
> +}
> +
> +/**
> + * scsi_generic_release - Called to free the Sg_device structure
> + * @kref: pointer to embedded kref
> + *
> + * sg_ref_sem must be held entering this routine. Because it is
> + * called on last put, you should always use the
> scsi_generic_get()
> + * and scsi_generic_put() helpers which manipulate the semaphore
> + * directly and never do a direct kref_put().
> + **/
> +static void scsi_generic_release(struct kref *kref)
> +{
> + Sg_device *sdp = to_scsi_generic(kref);
> + struct gendisk *disk = sdp->disk;
> +
> + disk->private_data = NULL;
> + put_disk(disk);
>
> - if (delay)
> - msleep(10); /* dirty detach so delay device
> destruction */
> + kfree((char *) sdp);
> + return;
> }
>
> /* Set 'perm' (4th argument) to 0 to disable module_param's definition
> @@ -2484,14 +2528,13 @@ __sg_remove_sfp(Sg_device * sdp, Sg_fd *
> sg_page_free((char *) sfp, sizeof (Sg_fd));
> }
>
> -/* Returns 0 in normal case, 1 when detached and sdp object removed */
> +/* Returns a count of IOs still in progress. */
> static int
> sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp)
> {
> Sg_request *srp;
> Sg_request *tsrp;
> int dirty = 0;
> - int res = 0;
>
> for (srp = sfp->headrp; srp; srp = tsrp) {
> tsrp = srp->nextrp;
> @@ -2505,30 +2548,16 @@ sg_remove_sfp(Sg_device * sdp, Sg_fd * s
>
> write_lock_irqsave(&sg_dev_arr_lock, iflags);
> __sg_remove_sfp(sdp, sfp);
> - if (sdp->detached && (NULL == sdp->headfp)) {
> - int k, maxd;
> -
> - maxd = sg_dev_max;
> - for (k = 0; k < maxd; ++k) {
> - if (sdp == sg_dev_arr[k])
> - break;
> - }
> - if (k < maxd)
> - sg_dev_arr[k] = NULL;
> - kfree((char *) sdp);
> - res = 1;
> - }
> write_unlock_irqrestore(&sg_dev_arr_lock, iflags);
> } else {
> - /* MOD_INC's to inhibit unloading sg and associated
> adapter driver */
> - /* only bump the access_count if we actually succeeded
> in
> - * throwing another counter on the host module */
> - scsi_device_get(sdp->device); /* XXX: retval ignored?
> */
> + /* The non-zero dirty count will be returned. This will
> tell
> + the caller not to drop the reference on this device.
> */
> +
> sfp->closed = 1; /* flag dirty state on this fd
> */
> SCSI_LOG_TIMEOUT(1, printk("sg_remove_sfp: worrisome, %d
> writes pending\n",
> dirty));
> }
> - return res;
> + return dirty;
> }
>
> static int
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2005-10-18 4:53 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-10-17 13:33 requesting advice: oops when doing sg_dd IO and device is removed Dailey, Nate
2005-10-18 4:54 ` Douglas Gilbert [this message]
2005-10-18 22:02 ` Tony Battersby
-- strict thread matches above, loose matches on Subject: below --
2005-10-18 21:00 Dailey, Nate
2005-10-20 16:48 Dailey, Nate
2005-10-24 13:04 ` Douglas Gilbert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4354800D.309@torque.net \
--to=dougg@torque.net \
--cc=Nate.Dailey@stratus.com \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).