* [PATCH 0/2] sg: fix races during device removal (v2) @ 2009-01-05 19:07 Tony Battersby 2009-01-08 23:21 ` Douglas Gilbert 2009-01-10 17:26 ` FUJITA Tomonori 0 siblings, 2 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-05 19:07 UTC (permalink / raw) To: Douglas Gilbert, FUJITA Tomonori, James.Bottomley Cc: Christoph Hellwig, linux-scsi This is version 2 of my patch, this time using a kref instead of int. There are also a lot of other changes since the first version since I found even more bugs both through testing and source code analysis. I have split the patch up into two parts: the first patch fixes races between open, close, device removal, and command completion. The second patch fixes some races I spotted in ioctl(SG_IO). Below are a list of test cases fixed by the patch. ---------- test #1 open /dev/sgX send a command that takes a long time (e.g. any tape drive seek command) before command completes, echo 1 > /sys/class/scsi_generic/sgX/device/delete without patch: oops with patch: test program gets ENODEV immediately keventd sleeps until the cmd is complete this is suboptimal since it starves other users of keventd while waiting for the command to complete, but it is better than an oops ---------- test #2 open /dev/sgX send a command that takes a long time (e.g. any tape drive seek command) without waiting for it to complete close fd before command completes, echo 1 > /sys/class/scsi_generic/sgX/device/delete without patch: oops when the command does complete (sg_rq_end_io() bad pointer deref) with patch: keventd sleeps until the cmd is complete this is suboptimal since it starves other users of keventd while waiting for the command to complete, but it is better than an oops ---------- test #3 open /dev/sgX send a command that takes a long time (e.g. any tape drive seek command) without waiting for it to complete close fd rmmod sg without patch: rmmod succeeds without waiting for the command to complete oops when the command does complete (sg_rq_end_io() callback no longer exists) with patch: sg module usage count does not drop to 0 until the command completes, so cannot rmmod ---------- test #4 open /dev/sgX loop: send commands, check results unplug SAS cable mptsas automatically removes the device and fails active commands the test program detects the failed commands and closes its fds at the same time this results in the following sequence: sg_remove() enter sg_release() enter sg_release() exit sg_remove() exit without patch: oops with patch: ok ---------- test #5 open /dev/sgX loop: send commands, check results unplug SAS cable mptsas automatically removes the device and fails active commands the test program detects the failed commands, but sleeps for a second before closing its fds this results in the following sequence: sg_remove() enter sg_remove() exit sg_release() enter sg_release() exit without patch: oops with patch: ok ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-05 19:07 [PATCH 0/2] sg: fix races during device removal (v2) Tony Battersby @ 2009-01-08 23:21 ` Douglas Gilbert 2009-01-10 17:26 ` FUJITA Tomonori 1 sibling, 0 replies; 41+ messages in thread From: Douglas Gilbert @ 2009-01-08 23:21 UTC (permalink / raw) To: Tony Battersby Cc: FUJITA Tomonori, James.Bottomley, Christoph Hellwig, linux-scsi Tony Battersby wrote: > This is version 2 of my patch, this time using a kref instead of int. > There are also a lot of other changes since the first version since I > found even more bugs both through testing and source code analysis. > I have split the patch up into two parts: the first patch fixes > races between open, close, device removal, and command completion. > The second patch fixes some races I spotted in ioctl(SG_IO). Okay, lets go with v2. Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-05 19:07 [PATCH 0/2] sg: fix races during device removal (v2) Tony Battersby 2009-01-08 23:21 ` Douglas Gilbert @ 2009-01-10 17:26 ` FUJITA Tomonori 2009-01-12 21:09 ` Tony Battersby 2009-01-14 20:31 ` Tony Battersby 1 sibling, 2 replies; 41+ messages in thread From: FUJITA Tomonori @ 2009-01-10 17:26 UTC (permalink / raw) To: tonyb; +Cc: dgilbert, fujita.tomonori, James.Bottomley, hch, linux-scsi On Mon, 05 Jan 2009 14:07:19 -0500 Tony Battersby <tonyb@cybernetics.com> wrote: > This is version 2 of my patch, this time using a kref instead of int. > There are also a lot of other changes since the first version since I > found even more bugs both through testing and source code analysis. > I have split the patch up into two parts: the first patch fixes > races between open, close, device removal, and command completion. > The second patch fixes some races I spotted in ioctl(SG_IO). > > Below are a list of test cases fixed by the patch. Thanks for the great work, > ---------- > > test #1 > > open /dev/sgX > send a command that takes a long time (e.g. any tape drive seek > command) > before command completes, echo 1 > > /sys/class/scsi_generic/sgX/device/delete > > without patch: > oops > > with patch: > test program gets ENODEV immediately > keventd sleeps until the cmd is complete > this is suboptimal since it starves other users of keventd while > waiting for the command to complete, but it is better than an oops As you said, we can do better. It's not correct to make class_interface's remove_dev hook, sg_remove wait for the completion of all the commands. > test #2 > > open /dev/sgX > send a command that takes a long time (e.g. any tape drive seek > command) without waiting for it to complete > close fd > before command completes, echo 1 > > /sys/class/scsi_generic/sgX/device/delete > > without patch: > oops when the command does complete (sg_rq_end_io() bad pointer deref) > > with patch: > keventd sleeps until the cmd is complete > this is suboptimal since it starves other users of keventd while > waiting for the command to complete, but it is better than an oops Ditto. The main problem is about the lifetime of sg_dev and sg_fd. I think that we can fix it in a simpler and better way. You use kref for sg_dev. We can use kref for sg_fd too. I think that that's all we need to do. A sg_request gets the reference of sg_fd. It makes sure that sg_fd doesn't go away if any outstanding sg_requests exists. sg_fd gets the reference of sg_dev (with scsi_device) and makes sure that the sg module doesn't go away. Then sg_remove doesn't wait for anything. sg_fd gets the reference of sg_dev so sg_dev doesn't go away. sg_rq_end_io puts the reference of sg_fd then sg_fd is freed and sg_dev is freed safely. Here's a patch to do the above. It's even simpler than the current sg code. I confirmed that sg can pass the first, second and third tests with this patch. I think that 4th and 5th tests fail because of the lifetime of sg_dev and sg_fd. Can you please try these tests with this patch? = From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Date: Sat, 10 Jan 2009 23:25:24 +0900 Subject: [PATCH] sg: fix the lifetime management of sg_dev and sg_fd To fix the lifetime of sg_dev and sg_fd, this patch adds kref for sg_dev and sg_fd. A sg_request gets the reference of sg_fd. It makes sure that sg_fd doesn't go away if any outstanding sg_requests exists. sg_fd gets the reference of sg_dev (with scsi_device) and makes sure that the sg module doesn't go away. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> --- drivers/scsi/sg.c | 198 +++++++++++++++++++++++++---------------------------- 1 files changed, 93 insertions(+), 105 deletions(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 5103855..11b4a56 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -158,6 +158,8 @@ typedef struct sg_fd { /* holds the state of a file descriptor */ char next_cmd_len; /* 0 -> automatic (def), >0 -> use on next write() */ char keep_orphan; /* 0 -> drop orphan (def), 1 -> keep for read() */ char mmap_called; /* 0 -> mmap() never called on this fd */ + struct kref f_ref; + struct execute_work ew; } Sg_fd; typedef struct sg_device { /* holds the state of each scsi generic device */ @@ -171,6 +173,7 @@ typedef struct sg_device { /* holds the state of each scsi generic device */ char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10-> all devs */ struct gendisk *disk; struct cdev * cdev; /* char_dev [sysfs: /sys/cdev/major/sg<n>] */ + struct kref d_ref; } Sg_device; static int sg_fasync(int fd, struct file *filp, int mode); @@ -194,13 +197,13 @@ static void sg_build_reserve(Sg_fd * sfp, int req_size); static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size); static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp); static Sg_fd *sg_add_sfp(Sg_device * sdp, int dev); -static int sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); -static void __sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); +static void sg_remove_sfp(struct kref *); static Sg_request *sg_get_rq_mark(Sg_fd * sfp, int pack_id); static Sg_request *sg_add_request(Sg_fd * sfp); static int sg_remove_request(Sg_fd * sfp, Sg_request * srp); static int sg_res_in_use(Sg_fd * sfp); static Sg_device *sg_get_dev(int dev); +static void sg_put_dev(struct sg_device *sdp); #ifdef CONFIG_SCSI_PROC_FS static int sg_last_dev(void); #endif @@ -238,21 +241,19 @@ sg_open(struct inode *inode, struct file *filp) SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, flags=0x%x\n", dev, flags)); sdp = sg_get_dev(dev); if ((!sdp) || (!sdp->device)) { - unlock_kernel(); - return -ENXIO; + retval = -ENXIO; + goto sg_put; } if (sdp->detached) { - unlock_kernel(); - return -ENODEV; + retval = -ENODEV; + goto sg_put; } /* This driver's module count bumped by fops_get in <linux/fs.h> */ /* Prevent the device driver from vanishing while we sleep */ retval = scsi_device_get(sdp->device); - if (retval) { - unlock_kernel(); - return retval; - } + if (retval) + goto sg_put; if (!((flags & O_NONBLOCK) || scsi_block_when_processing_errors(sdp->device))) { @@ -308,11 +309,12 @@ sg_open(struct inode *inode, struct file *filp) retval = -ENOMEM; goto error_out; } - unlock_kernel(); - return 0; - - error_out: - scsi_device_put(sdp->device); + retval = 0; +error_out: + if (retval) + scsi_device_put(sdp->device); +sg_put: + sg_put_dev(sdp); unlock_kernel(); return retval; } @@ -327,13 +329,11 @@ sg_release(struct inode *inode, struct file *filp) if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n", sdp->disk->disk_name)); - if (0 == sg_remove_sfp(sdp, sfp)) { /* Returns 1 when sdp gone */ - if (!sdp->detached) { - scsi_device_put(sdp->device); - } - sdp->exclude = 0; - wake_up_interruptible(&sdp->o_excl_wait); - } + + sdp->exclude = 0; + wake_up_interruptible(&sdp->o_excl_wait); + + kref_put(&sfp->f_ref, sg_remove_sfp); return 0; } @@ -1259,12 +1259,10 @@ static void sg_rq_end_io(struct request *rq, int uptodate) return; } sfp = srp->parentfp; - if (sfp) - sdp = sfp->parentdp; - if ((NULL == sdp) || sdp->detached) { + sdp = sfp->parentdp; + + if (sdp->detached) printk(KERN_INFO "sg_cmd_done: device detached\n"); - return; - } sense = rq->sense; result = rq->errors; @@ -1303,18 +1301,7 @@ static void sg_rq_end_io(struct request *rq, int uptodate) } /* Rely on write phase to clean out srp status values, so no "else" */ - if (sfp->closed) { /* whoops this fd already released, cleanup */ - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, freeing ...\n")); - sg_finish_rem_req(srp); - srp = NULL; - if (NULL == sfp->headrp) { - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, final cleanup\n")); - if (0 == sg_remove_sfp(sdp, sfp)) { /* device still present */ - scsi_device_put(sdp->device); - } - sfp = NULL; - } - } else if (srp && srp->orphan) { + if (srp->orphan) { if (sfp->keep_orphan) srp->sg_io_owned = 0; else { @@ -1322,7 +1309,8 @@ static void sg_rq_end_io(struct request *rq, int uptodate) srp = NULL; } } - if (sfp && srp) { + + if (srp) { /* Now wake up any sg_read() that is waiting for this packet. */ kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); write_lock_irqsave(&sfp->rq_list_lock, iflags); @@ -1330,6 +1318,8 @@ static void sg_rq_end_io(struct request *rq, int uptodate) wake_up_interruptible(&sfp->read_wait); write_unlock_irqrestore(&sfp->rq_list_lock, iflags); } + + kref_put(&sfp->f_ref, sg_remove_sfp); } static struct file_operations sg_fops = { @@ -1391,6 +1381,7 @@ static Sg_device *sg_alloc(struct gendisk *disk, struct scsi_device *scsidp) init_waitqueue_head(&sdp->o_excl_wait); sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments); sdp->index = k; + kref_init(&sdp->d_ref); error = 0; out: @@ -1496,41 +1487,18 @@ sg_remove(struct device *cl_dev, struct class_interface *cl_intf) unsigned long iflags; Sg_fd *sfp; Sg_fd *tsfp; - Sg_request *srp; - Sg_request *tsrp; - int delay; if (!sdp) return; - delay = 0; + sdp->detached = 1; + write_lock_irqsave(&sg_index_lock, iflags); - if (sdp->headfp) { - sdp->detached = 1; - for (sfp = sdp->headfp; sfp; sfp = tsfp) { - tsfp = sfp->nextfp; - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sfp->closed || (0 == sg_srp_done(srp, sfp))) - sg_finish_rem_req(srp); - } - if (sfp->closed) { - scsi_device_put(sdp->device); - __sg_remove_sfp(sdp, sfp); - } else { - delay = 1; - wake_up_interruptible(&sfp->read_wait); - kill_fasync(&sfp->async_qp, SIGPOLL, - POLL_HUP); - } - } - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d, dirty\n", sdp->index)); - if (NULL == sdp->headfp) { - idr_remove(&sg_index_idr, sdp->index); - } - } else { /* nothing active, simple case */ - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d\n", sdp->index)); - idr_remove(&sg_index_idr, sdp->index); + for (sfp = sdp->headfp; sfp; sfp = tsfp) { + tsfp = sfp->nextfp; + + wake_up_interruptible(&sfp->read_wait); + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP); } write_unlock_irqrestore(&sg_index_lock, iflags); @@ -1540,11 +1508,8 @@ sg_remove(struct device *cl_dev, struct class_interface *cl_intf) sdp->cdev = NULL; put_disk(sdp->disk); sdp->disk = NULL; - if (NULL == sdp->headfp) - kfree(sdp); - if (delay) - msleep(10); /* dirty detach so delay device destruction */ + sg_put_dev(sdp); } module_param_named(scatter_elem_sz, scatter_elem_sz, int, S_IRUGO | S_IWUSR); @@ -1993,6 +1958,7 @@ sg_add_request(Sg_fd * sfp) if (resp) { resp->nextrp = NULL; resp->header.duration = jiffies_to_msecs(jiffies); + kref_get(&sfp->f_ref); } write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return resp; @@ -2060,6 +2026,7 @@ sg_add_sfp(Sg_device * sdp, int dev) init_waitqueue_head(&sfp->read_wait); rwlock_init(&sfp->rq_list_lock); + kref_init(&sfp->f_ref); sfp->timeout = SG_DEFAULT_TIMEOUT; sfp->timeout_user = SG_DEFAULT_TIMEOUT_USER; sfp->force_packid = SG_DEF_FORCE_PACK_ID; @@ -2087,6 +2054,9 @@ sg_add_sfp(Sg_device * sdp, int dev) sg_build_reserve(sfp, bufflen); SCSI_LOG_TIMEOUT(3, printk("sg_add_sfp: bufflen=%d, k_use_sg=%d\n", sfp->reserve.bufflen, sfp->reserve.k_use_sg)); + + kref_get(&sdp->d_ref); + __module_get(THIS_MODULE); return sfp; } @@ -2114,48 +2084,38 @@ __sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) (int) sfp->reserve.bufflen, (int) sfp->reserve.k_use_sg)); sg_remove_scat(&sfp->reserve); } - sfp->parentdp = NULL; SCSI_LOG_TIMEOUT(6, printk("__sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp); } -/* Returns 0 in normal case, 1 when detached and sdp object removed */ -static int -sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) +static void sg_remove_sfp_usercontext(struct work_struct *work) { + struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); + struct sg_device *sdp = sfp->parentdp; + unsigned long iflags; Sg_request *srp; Sg_request *tsrp; - int dirty = 0; - int res = 0; for (srp = sfp->headrp; srp; srp = tsrp) { tsrp = srp->nextrp; - if (sg_srp_done(srp, sfp)) - sg_finish_rem_req(srp); - else - ++dirty; + sg_finish_rem_req(srp); } - if (0 == dirty) { - unsigned long iflags; - write_lock_irqsave(&sg_index_lock, iflags); - __sg_remove_sfp(sdp, sfp); - if (sdp->detached && (NULL == sdp->headfp)) { - idr_remove(&sg_index_idr, sdp->index); - kfree(sdp); - res = 1; - } - write_unlock_irqrestore(&sg_index_lock, iflags); - } else { - /* MOD_INC's to inhibit unloading sg and associated adapter driver */ - /* only bump the access_count if we actually succeeded in - * throwing another counter on the host module */ - scsi_device_get(sdp->device); /* XXX: retval ignored? */ - sfp->closed = 1; /* flag dirty state on this fd */ - SCSI_LOG_TIMEOUT(1, printk("sg_remove_sfp: worrisome, %d writes pending\n", - dirty)); - } - return res; + write_lock_irqsave(&sg_index_lock, iflags); + __sg_remove_sfp(sdp, sfp); + write_unlock_irqrestore(&sg_index_lock, iflags); + + scsi_device_put(sdp->device); + + sg_put_dev(sdp); + + module_put(THIS_MODULE); +} + +static void sg_remove_sfp(struct kref *kref) +{ + struct sg_fd *sfp = container_of(kref, struct sg_fd, f_ref); + execute_in_process_context(sg_remove_sfp_usercontext, &sfp->ew); } static int @@ -2197,6 +2157,28 @@ sg_last_dev(void) } #endif +static void sg_device_release(struct kref *kref) +{ + struct sg_device *sdp = container_of(kref, struct sg_device, d_ref); + unsigned long flags; + + SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d\n", sdp->index)); + + write_lock_irqsave(&sg_index_lock, flags); + idr_remove(&sg_index_idr, sdp->index); + write_unlock_irqrestore(&sg_index_lock, flags); + + kfree(sdp); +} + +static void sg_put_dev(struct sg_device *sdp) +{ + if (!sdp) + return; + + kref_put(&sdp->d_ref, sg_device_release); +} + static Sg_device * sg_get_dev(int dev) { @@ -2205,6 +2187,8 @@ sg_get_dev(int dev) read_lock_irqsave(&sg_index_lock, iflags); sdp = idr_find(&sg_index_idr, dev); + if (sdp) + kref_get(&sdp->d_ref); read_unlock_irqrestore(&sg_index_lock, iflags); return sdp; @@ -2478,6 +2462,7 @@ static int sg_proc_seq_show_dev(struct seq_file *s, void *v) (int) scsi_device_online(scsidp)); else seq_printf(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n"); + sg_put_dev(sdp); return 0; } @@ -2498,6 +2483,7 @@ static int sg_proc_seq_show_devstrs(struct seq_file *s, void *v) scsidp->vendor, scsidp->model, scsidp->rev); else seq_printf(s, "<no active device>\n"); + sg_put_dev(sdp); return 0; } @@ -2582,7 +2568,7 @@ static int sg_proc_seq_show_debug(struct seq_file *s, void *v) if (NULL == scsidp) { seq_printf(s, "device %d detached ??\n", (int)it->index); - return 0; + goto out; } if (sg_get_nth_sfp(sdp, 0)) { @@ -2602,6 +2588,8 @@ static int sg_proc_seq_show_debug(struct seq_file *s, void *v) } sg_proc_debug_helper(s, sdp); } +out: + sg_put_dev(sdp); return 0; } -- 1.6.0.6 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-10 17:26 ` FUJITA Tomonori @ 2009-01-12 21:09 ` Tony Battersby 2009-01-13 16:24 ` FUJITA Tomonori 2009-01-14 20:31 ` Tony Battersby 1 sibling, 1 reply; 41+ messages in thread From: Tony Battersby @ 2009-01-12 21:09 UTC (permalink / raw) To: FUJITA Tomonori; +Cc: dgilbert, James.Bottomley, hch, linux-scsi FUJITA Tomonori wrote: > >> test #1 >> >> open /dev/sgX >> send a command that takes a long time (e.g. any tape drive seek >> command) >> before command completes, echo 1 > >> /sys/class/scsi_generic/sgX/device/delete >> >> without patch: >> oops >> >> with patch: >> test program gets ENODEV immediately >> keventd sleeps until the cmd is complete >> this is suboptimal since it starves other users of keventd while >> waiting for the command to complete, but it is better than an oops >> > > As you said, we can do better. It's not correct to make > class_interface's remove_dev hook, sg_remove wait for the completion > of all the commands. > > The main problem is about the lifetime of sg_dev and sg_fd. I think > that we can fix it in a simpler and better way. > > You use kref for sg_dev. We can use kref for sg_fd too. I think that > that's all we need to do. > > Thanks for the review. Your patch behaves very similarly to my patch with the wait_event() removed from sg_remove(). With my patch, it was not strictly necessary for sg_remove() to wait until all commands had gone through sg_rq_end_io() - the refcounting, locking, and other checks in my patch would have taken care of freeing sdp, sfp, and related resources at the right time, even if sg_remove() didn't wait. The reason I added the wait is because without it keventd would still sleep (outside of sg_remove()) waiting for the command to complete anyway, except that when the command did complete, the kernel would spew some scary-looking error messages: mptscsih: ioc1: ERROR - Received a mf that was already freed mptscsih: ioc1: ERROR - req_idx=beaf req_idx_MR=beaf mf=ce504b00 mr=00000000 sc=00000000 So I figured it would be better to wait in sg_remove() and avoid the error above, instead of waiting somewhere else for the same amount of time and getting an error. With your patch, I get the above error again, and keventd blocks for the same amount of time, so initially it did not look like an improvement. However, since this issue has surfaced again, I went ahead and analyzed why keventd is waiting outside of sg_remove(), and discovered that it is a bug in mptscsih (*). When I retested with sym53c8xx or open-iscsi, keventd did not block waiting for the command to complete. I will send a patch for mptscsih later. So we can accomplish the same effect with my patch just by removing the wait_event() in sg_remove() and the other code associated with it. Now the question is - do we still want to use krefs for the sg_fd? I had considered it when coming up with my patch, but I decided not to add the overhead of two extra atomic ops per command. Atomic ops for opening and closing fds are OK, but I didn't want to slow down the command processing critical paths. I haven't measured the overhead, so maybe it isn't too bad, but it is something to consider. Also, I spent a lot of time analyzing locks and potential races with my patch, and your patch reverts some of the fixes that I incorporated in my patch. Just to point out one example that I spotted quickly: sg_rq_end_io() needs to check srp->orphan and set srp->done = 1 while holding sfp->rq_list_lock to avoid races with sg_ioctl(SG_IO) (see my [PATCH 2/2] sg: fix races with ioctl(SG_IO)). But if you really want to go with a kref on sg_fd and get/put on every command, let me know and I will check and test everything again in detail. If not, I will resubmit my patch with the wait_event and associated code removed. As a side note, scsi_remove_device() is called from keventd when using the /sys interface to delete a device, but may be called from a different process when using a different interface to delete a device. So under different circumstances the sleep may block rmmod, iscsi_wq, etc., instead of keventd. (*) Here is why mptscsih causes keventd to sleep: scsi_remove_device() -> __scsi_remove_device() -> mptscsih_slave_destroy() First, mptscsih_search_running_cmds() calls sc->scsi_done() (sg_rq_end_io()) on all outstanding commands. Then, mptscsih_synchronize_cache() is called. It has the following check: if (vdevice->vtarget->type != TYPE_DISK || vdevice->vtarget->deleted || !vdevice->configured_lun) return; However, this check is buggy. My test is using a tape drive, so the type != TYPE_DISK check should make the function return without doing anything. However, vtarget->type appears not to be set. Also, you might hope that the vtarget->deleted check might make the function return, but it does not. So it ends up sending a synchronize cache disk command to a tape drive being deleted, and waits for it to complete. Since the tape drive does not do command queueing, this forces it to wait for the tape drive seek command previously sent by my test program to finish. But when the seek command does finish, the message frame was already freed by mptscsih_search_running_cmds(), so it gives the error message above. Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-12 21:09 ` Tony Battersby @ 2009-01-13 16:24 ` FUJITA Tomonori 0 siblings, 0 replies; 41+ messages in thread From: FUJITA Tomonori @ 2009-01-13 16:24 UTC (permalink / raw) To: tonyb; +Cc: fujita.tomonori, dgilbert, James.Bottomley, hch, linux-scsi On Mon, 12 Jan 2009 16:09:25 -0500 Tony Battersby <tonyb@cybernetics.com> wrote: > So we can accomplish the same effect with my patch just by removing the > wait_event() in sg_remove() and the other code associated with it. Now > the question is - do we still want to use krefs for the sg_fd? I had > considered it when coming up with my patch, but I decided not to add the > overhead of two extra atomic ops per command. Atomic ops for opening > and closing fds are OK, but I didn't want to slow down the command > processing critical paths. I haven't measured the overhead, so maybe it > isn't too bad, but it is something to consider. kref_get() adds nothing. We already have lots of atomic_inc per scsi command. Compared with the rest of the overhead of scsi command processing, kref_get is simply nothing. > Also, I spent a lot of time analyzing locks and potential races with my > patch, and your patch reverts some of the fixes that I incorporated in > my patch. Just to point out one example that I spotted quickly: > sg_rq_end_io() needs to check srp->orphan and set srp->done = 1 while > holding sfp->rq_list_lock to avoid races with sg_ioctl(SG_IO) (see my > [PATCH 2/2] sg: fix races with ioctl(SG_IO)). Yeah, I know that sg has other races but let's fix the most critical one first. BTW, I think that we can do better about srp->done too. > But if you really want to > go with a kref on sg_fd and get/put on every command, let me know and I > will check and test everything again in detail. If not, I will resubmit > my patch with the wait_event and associated code removed. Using kref for lifetime management is pretty common in Linux kernel. Other developers can easily understand the code. For example, scsi mid-layer calls get_device() on every command, which is similar to what we need to solve about sg_fd and commands. So I like this way but if you can fix the lifetime of sg_dev and sg_fd in a simpler way than my patch (less lines), then we need to consider about it. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-10 17:26 ` FUJITA Tomonori 2009-01-12 21:09 ` Tony Battersby @ 2009-01-14 20:31 ` Tony Battersby 2009-01-14 21:39 ` Greg KH 2009-01-19 6:57 ` FUJITA Tomonori 1 sibling, 2 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-14 20:31 UTC (permalink / raw) To: FUJITA Tomonori Cc: dgilbert, James.Bottomley, hch, linux-scsi, Greg Kroah-Hartman [CC: Greg Kroah-Hartman <greg@kroah.com>] Fujita, your patch results in simpler code than my version 2 patch, but it still leaves some subtle races and other problems. The crux of the problem is that kref_put() needs do be done while holding a lock if there is still a way for some other CPU to find a reference to the object in between the time the refcount drops to 0 and the time the destructor is called. For example, look at sg_get_dev() and sg_put_dev(). In my patch, I locked sg_index_lock in sg_put_dev() and called kref_put() while holding the lock. In your patch, you moved the lock from sg_put_dev() to the destructor function. This introduces a subtle race with sg_get_dev(): CPU 1: sg_put_dev() kref_put(): refcount 1 -> 0 CPU 2: sg_get_dev() lock sg_index_lock idr_find() kref_get(): refcount 0 -> 1: bug unlock sg_index_lock CPU 1: sg_device_release() lock sg_index_lock idr_remove unlock sg_index_lock kfree(sdp) CPU 2: access sdp which has already been freed sg_put_dev(): double-free bug Besides holding the lock during kref_put(), I also considered two other simple ways to avoid this race: 1) Do idr_remove() from sg_remove(). 2) Return NULL in sg_get_dev() if sdp->detached. However, both of these options would have changed the behavior of the /proc/scsi/sg/* functions that show information for devices that are in the process of being detached. I wanted to fix bugs without changing other behavior, so I chose to call kref_put() under lock in my previous patches. A similar problem exists with sg_remove_sfp() in your patch. The refcount for the sg_fd can drop to 0 while a different CPU can still find a reference to it by scanning sg_device::headfp. The good news is that most places that scan sg_fd's using sdp->headfp never drop sg_index_lock while still holding a reference, so for most cases it is safe for the destructor to work the way you have it. The one exception is sg_get_nth_sfp(), which needs to kref_get() the sg_fd it returns (my version 2 patch doesn't fix this problem either since I just noticed it). But then the kref_get() would have a race similar to the one I described above between sg_get_dev() and sg_put_dev(). There are several solutions to this problem: 1) Call kref_put(&sfp->f_ref, sg_remove_sfp) while holding sg_index_lock, and have the destructor unlink sfp from sdp->headfp before dropping the lock. This is ugly since the destructor may want to drop sg_index_lock (e.g. to call scsi_device_put()), but the saved irqflags variable is in a different function. 2) Unlink sfp from sdp->headfp in sg_release(). The downside to this is that the /proc/scsi/sg/* functions no longer show fds that have been closed but still have commands pending. 3) Set sfp->closed in sg_release() and have sg_get_nth_sfp() return NULL if sfp->closed. Same downside as #2. 4) Add a variation of kref_get() that uses atomic_inc_not_zero(). When I was designing my previous patches, I anticipated running into these complications with kref, which is one reason I avoided it. Other people have experienced similar problems with the kref interface (search "kref_get_not_zero" or "cgroups: fix probable race with put_css_set[_taskexit] and find_css_set"). If you still want to use kref, I think that we are either going to have to change the behavior of /proc/scsi/sg/*, or else use atomic_inc_not_zero(). Since I still prefer not to change valid user-visible behavior in a patch that fixes so many possible oopses, I will go with atomic_inc_not_zero(). And, since I am going that route, I will use it for sg_get_dev() also, so that sg_put_dev() doesn't need to acquire sg_index_lock. For the purposes of this patch, I added the local function sg_kref_get_not_zero() to sg.c. It would be better to add kref_get_not_zero() to kref.c, but I see in the mailing list archives that there has been some resistance to that idea because it complicates the kref interface. However, since it has come up more than once, perhaps it would be better to go ahead and make it an official part of the interface. If anyone wants to support that idea, I will break it out into a separate patch. Your patch also has a problem with management of f_ref. You do kref_get() in sg_add_request() and kref_put() in sg_rq_end_io()(). However, an error can occur in between sg_add_request() and sending the command to the midlevel, so sg_rq_end_io() may never be called. It is better to do kref_get() just before sending the command to the midlevel; that way the refcount counts active commands waiting for the completion callback rather than prepared requests. Your patch doesn't fix the problem of various functions calling SCSI_LOG_TIMEOUT with sdp->disk->disk_name after sg_remove() sets sdp->disk to NULL. This is why I moved put_disk() from sg_remove() to sg_device_release(). Your patch doesn't fix the problem of forgetting to do wake_up_interruptible(&sdp->o_excl_wait) after removing sfp from sdp->headfp. This is important if sg_release() is called with commands still active; the last command that finishes should wake up any O_EXCL waiters. So below is version 3 of my patch combining my ideas and your ideas. For testing with mpt, I commented out mptscsih_synchronize_cache() in mptscsih.c as a workaround for the bug I mentioned in a previous message. This patch still fixes all the test cases previously listed, but with the additional improvements that it is simpler and it no longer blocks keventd waiting for commands to complete. My previously-posted "[PATCH 2/2] sg: fix races with ioctl(SG_IO)" still needs to be applied after this patch. = From: Tony Battersby <tonyb@cybernetics.com> Date: Wed, 14 Jan 2009 15:30:00 -0500 Subject: [PATCH] sg: fix races during device removal (v3) sg has the following problems related to device removal: * opening a sg fd races with removing a device * closing a sg fd races with removing a device * /proc/scsi/sg/* access races with removing a device * command completion races with removing a device * command completion races with closing a sg fd * can rmmod sg with active commands These problems can cause kernel oopses, memory-use-after-free, or double-free errors. This patch fixes these problems by using krefs to manage the lifetime of sg_device and sg_fd. Each command submitted to the midlevel holds a reference to sg_fd until the completion callback. This ensures that sg_fd doesn't go away if the fd is closed with commands still outstanding. sg_fd gets the reference of sg_device (with scsi_device) and also makes sure that the sg module doesn't go away. /proc functions get and put references to sg_fd and sg_device as needed. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> --- sg.c | 342 ++++++++++++++++++++++++++++++++++++------------------------------- 1 file changed, 184 insertions(+), 158 deletions(-) --- linux-2.6.29-rc1-git4/drivers/scsi/sg.c.orig 2009-01-14 14:16:14.000000000 -0500 +++ linux-2.6.29-rc1-git4/drivers/scsi/sg.c 2009-01-14 14:25:45.000000000 -0500 @@ -158,6 +158,8 @@ typedef struct sg_fd { /* holds the sta char next_cmd_len; /* 0 -> automatic (def), >0 -> use on next write() */ char keep_orphan; /* 0 -> drop orphan (def), 1 -> keep for read() */ char mmap_called; /* 0 -> mmap() never called on this fd */ + struct kref f_ref; + struct execute_work ew; } Sg_fd; typedef struct sg_device { /* holds the state of each scsi generic device */ @@ -171,6 +173,7 @@ typedef struct sg_device { /* holds the char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10-> all devs */ struct gendisk *disk; struct cdev * cdev; /* char_dev [sysfs: /sys/cdev/major/sg<n>] */ + struct kref d_ref; } Sg_device; static int sg_fasync(int fd, struct file *filp, int mode); @@ -194,13 +197,13 @@ static void sg_build_reserve(Sg_fd * sfp static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size); static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp); static Sg_fd *sg_add_sfp(Sg_device * sdp, int dev); -static int sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); -static void __sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); +static void sg_remove_sfp(struct kref *); static Sg_request *sg_get_rq_mark(Sg_fd * sfp, int pack_id); static Sg_request *sg_add_request(Sg_fd * sfp); static int sg_remove_request(Sg_fd * sfp, Sg_request * srp); static int sg_res_in_use(Sg_fd * sfp); static Sg_device *sg_get_dev(int dev); +static void sg_put_dev(struct sg_device *sdp); #ifdef CONFIG_SCSI_PROC_FS static int sg_last_dev(void); #endif @@ -210,6 +213,37 @@ static int sg_last_dev(void); #define SZ_SG_IOVEC sizeof(sg_iovec_t) #define SZ_SG_REQ_INFO sizeof(sg_req_info_t) +/** + * sg_kref_get_not_zero - increment refcount for object if not already 0. + * @kref: object. + * + * If the object is not being destroyed, then increment the refcount and return + * true. Otherwise, return false without doing anything. + * + * Depending on how you manage the referenced object, there may be ways of + * finding the object in between the time kref_put() decrements the refcount to + * 0 and the time the destructor finishes. For example, if the object can be + * found in a lookup table, and the destructor removes the object from that + * table, then one CPU may still be able to find the object in that table + * for a brief period while a different CPU is running the destructor. To + * prevent problems, either call kref_put() while holding a lock that prevents + * another CPU from finding the object, or else use this function instead of + * kref_get() when using the lookup table to find the object. + * + * This function should be called while holding whatever lock is used to + * protect the data structure used to find the object. If this function + * returns false, then do not access the object after dropping the lock, and do + * not call kref_put(). + */ +static bool __must_check sg_kref_get_not_zero(struct kref *kref) +{ + if (likely(atomic_inc_not_zero(&kref->refcount))) { + smp_mb__after_atomic_inc(); + return true; + } + return false; +} + static int sg_allow_access(struct file *filp, unsigned char *cmd) { struct sg_fd *sfp = (struct sg_fd *)filp->private_data; @@ -238,21 +272,19 @@ sg_open(struct inode *inode, struct file SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, flags=0x%x\n", dev, flags)); sdp = sg_get_dev(dev); if ((!sdp) || (!sdp->device)) { - unlock_kernel(); - return -ENXIO; + retval = -ENXIO; + goto sg_put; } if (sdp->detached) { - unlock_kernel(); - return -ENODEV; + retval = -ENODEV; + goto sg_put; } /* This driver's module count bumped by fops_get in <linux/fs.h> */ /* Prevent the device driver from vanishing while we sleep */ retval = scsi_device_get(sdp->device); - if (retval) { - unlock_kernel(); - return retval; - } + if (retval) + goto sg_put; if (!((flags & O_NONBLOCK) || scsi_block_when_processing_errors(sdp->device))) { @@ -303,16 +335,19 @@ sg_open(struct inode *inode, struct file if ((sfp = sg_add_sfp(sdp, dev))) filp->private_data = sfp; else { - if (flags & O_EXCL) + if (flags & O_EXCL) { sdp->exclude = 0; /* undo if error */ + wake_up_interruptible(&sdp->o_excl_wait); + } retval = -ENOMEM; goto error_out; } - unlock_kernel(); - return 0; - - error_out: - scsi_device_put(sdp->device); + retval = 0; +error_out: + if (retval) + scsi_device_put(sdp->device); +sg_put: + sg_put_dev(sdp); unlock_kernel(); return retval; } @@ -327,13 +362,13 @@ sg_release(struct inode *inode, struct f if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n", sdp->disk->disk_name)); - if (0 == sg_remove_sfp(sdp, sfp)) { /* Returns 1 when sdp gone */ - if (!sdp->detached) { - scsi_device_put(sdp->device); - } - sdp->exclude = 0; - wake_up_interruptible(&sdp->o_excl_wait); - } + + sfp->closed = 1; + + sdp->exclude = 0; + wake_up_interruptible(&sdp->o_excl_wait); + + kref_put(&sfp->f_ref, sg_remove_sfp); return 0; } @@ -755,6 +790,7 @@ sg_common_write(Sg_fd * sfp, Sg_request hp->duration = jiffies_to_msecs(jiffies); srp->rq->timeout = timeout; + kref_get(&sfp->f_ref); /* sg_rq_end_io() does kref_put(). */ blk_execute_rq_nowait(sdp->device->request_queue, sdp->disk, srp->rq, 1, sg_rq_end_io); return 0; @@ -1247,25 +1283,28 @@ sg_mmap(struct file *filp, struct vm_are static void sg_rq_end_io(struct request *rq, int uptodate) { struct sg_request *srp = rq->end_io_data; - Sg_device *sdp = NULL; + Sg_device *sdp; Sg_fd *sfp; unsigned long iflags; unsigned int ms; char *sense; - int result, resid; + int result, resid, done = 1; - if (NULL == srp) { - printk(KERN_ERR "sg_cmd_done: NULL request\n"); + if (unlikely(NULL == srp)) { /* shouldn't happen */ + printk(KERN_ERR "sg_rq_end_io: NULL request\n"); return; } + sfp = srp->parentfp; - if (sfp) - sdp = sfp->parentdp; - if ((NULL == sdp) || sdp->detached) { - printk(KERN_INFO "sg_cmd_done: device detached\n"); + if (unlikely(NULL == sfp)) { /* shouldn't happen */ + printk(KERN_ERR "sg_rq_end_io: NULL sg_fd\n"); return; } + sdp = sfp->parentdp; + if (unlikely(sdp->detached)) + printk(KERN_INFO "sg_rq_end_io: device detached\n"); + sense = rq->sense; result = rq->errors; resid = rq->data_len; @@ -1303,33 +1342,25 @@ static void sg_rq_end_io(struct request } /* Rely on write phase to clean out srp status values, so no "else" */ - if (sfp->closed) { /* whoops this fd already released, cleanup */ - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, freeing ...\n")); - sg_finish_rem_req(srp); - srp = NULL; - if (NULL == sfp->headrp) { - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, final cleanup\n")); - if (0 == sg_remove_sfp(sdp, sfp)) { /* device still present */ - scsi_device_put(sdp->device); - } - sfp = NULL; - } - } else if (srp && srp->orphan) { + write_lock_irqsave(&sfp->rq_list_lock, iflags); + if (unlikely(srp->orphan)) { if (sfp->keep_orphan) srp->sg_io_owned = 0; - else { - sg_finish_rem_req(srp); - srp = NULL; - } + else + done = 0; } - if (sfp && srp) { - /* Now wake up any sg_read() that is waiting for this packet. */ - kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); - write_lock_irqsave(&sfp->rq_list_lock, iflags); - srp->done = 1; + srp->done = done; + write_unlock_irqrestore(&sfp->rq_list_lock, iflags); + + if (likely(done)) { + /* Now wake up any sg_read() that is waiting for this + packet. */ wake_up_interruptible(&sfp->read_wait); - write_unlock_irqrestore(&sfp->rq_list_lock, iflags); - } + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); + } else + sg_finish_rem_req(srp); /* call with srp->done == 0 */ + + kref_put(&sfp->f_ref, sg_remove_sfp); } static struct file_operations sg_fops = { @@ -1391,6 +1422,7 @@ static Sg_device *sg_alloc(struct gendis init_waitqueue_head(&sdp->o_excl_wait); sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments); sdp->index = k; + kref_init(&sdp->d_ref); error = 0; out: @@ -1488,63 +1520,33 @@ out: return error; } -static void -sg_remove(struct device *cl_dev, struct class_interface *cl_intf) +static void sg_remove(struct device *cl_dev, struct class_interface *cl_intf) { struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); Sg_device *sdp = dev_get_drvdata(cl_dev); unsigned long iflags; Sg_fd *sfp; - Sg_fd *tsfp; - Sg_request *srp; - Sg_request *tsrp; - int delay; - if (!sdp) + if (!sdp || sdp->detached) return; - delay = 0; - write_lock_irqsave(&sg_index_lock, iflags); - if (sdp->headfp) { - sdp->detached = 1; - for (sfp = sdp->headfp; sfp; sfp = tsfp) { - tsfp = sfp->nextfp; - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sfp->closed || (0 == sg_srp_done(srp, sfp))) - sg_finish_rem_req(srp); - } - if (sfp->closed) { - scsi_device_put(sdp->device); - __sg_remove_sfp(sdp, sfp); - } else { - delay = 1; - wake_up_interruptible(&sfp->read_wait); - kill_fasync(&sfp->async_qp, SIGPOLL, - POLL_HUP); - } - } - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d, dirty\n", sdp->index)); - if (NULL == sdp->headfp) { - idr_remove(&sg_index_idr, sdp->index); - } - } else { /* nothing active, simple case */ - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d\n", sdp->index)); - idr_remove(&sg_index_idr, sdp->index); + SCSI_LOG_TIMEOUT(3, printk("sg_remove: %s\n", sdp->disk->disk_name)); + + sdp->detached = 1; + + read_lock_irqsave(&sg_index_lock, iflags); + for (sfp = sdp->headfp; sfp; sfp = sfp->nextfp) { + wake_up_interruptible(&sfp->read_wait); + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP); } - write_unlock_irqrestore(&sg_index_lock, iflags); + read_unlock_irqrestore(&sg_index_lock, iflags); sysfs_remove_link(&scsidp->sdev_gendev.kobj, "generic"); device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index)); cdev_del(sdp->cdev); sdp->cdev = NULL; - put_disk(sdp->disk); - sdp->disk = NULL; - if (NULL == sdp->headfp) - kfree(sdp); - if (delay) - msleep(10); /* dirty detach so delay device destruction */ + sg_put_dev(sdp); /* put initial reference from kref_init. */ } module_param_named(scatter_elem_sz, scatter_elem_sz, int, S_IRUGO | S_IWUSR); @@ -2033,18 +2035,21 @@ sg_remove_request(Sg_fd * sfp, Sg_reques } #ifdef CONFIG_SCSI_PROC_FS -static Sg_fd * -sg_get_nth_sfp(Sg_device * sdp, int nth) +static Sg_fd *sg_get_next_sfp(Sg_device *sdp, Sg_fd *prev_sfp) { - Sg_fd *resp; + Sg_fd *sfp; unsigned long iflags; - int k; read_lock_irqsave(&sg_index_lock, iflags); - for (k = 0, resp = sdp->headfp; resp && (k < nth); - ++k, resp = resp->nextfp) ; + sfp = (prev_sfp == NULL) ? sdp->headfp : prev_sfp->nextfp; + for ( ; sfp != NULL; sfp = sfp->nextfp) { + if (sg_kref_get_not_zero(&sfp->f_ref)) + break; + } read_unlock_irqrestore(&sg_index_lock, iflags); - return resp; + if (prev_sfp != NULL) + kref_put(&prev_sfp->f_ref, sg_remove_sfp); + return sfp; } #endif @@ -2062,6 +2067,7 @@ sg_add_sfp(Sg_device * sdp, int dev) init_waitqueue_head(&sfp->read_wait); rwlock_init(&sfp->rq_list_lock); + kref_init(&sfp->f_ref); sfp->timeout = SG_DEFAULT_TIMEOUT; sfp->timeout_user = SG_DEFAULT_TIMEOUT_USER; sfp->force_packid = SG_DEF_FORCE_PACK_ID; @@ -2089,15 +2095,49 @@ sg_add_sfp(Sg_device * sdp, int dev) sg_build_reserve(sfp, bufflen); SCSI_LOG_TIMEOUT(3, printk("sg_add_sfp: bufflen=%d, k_use_sg=%d\n", sfp->reserve.bufflen, sfp->reserve.k_use_sg)); + + kref_get(&sdp->d_ref); + __module_get(THIS_MODULE); return sfp; } -static void -__sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) +static void sg_remove_sfp_usercontext(struct work_struct *work) { + struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); + struct sg_device *sdp = sfp->parentdp; + + /* Cleanup any responses which were never read(). */ + while (sfp->headrp) + sg_finish_rem_req(sfp->headrp); + + if (sfp->reserve.bufflen > 0) { + SCSI_LOG_TIMEOUT(6, + printk("sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", + (int) sfp->reserve.bufflen, + (int) sfp->reserve.k_use_sg)); + sg_remove_scat(&sfp->reserve); + } + + SCSI_LOG_TIMEOUT(6, + printk("sg_remove_sfp: %s, sfp=0x%p\n", + sdp->disk->disk_name, + sfp)); + kfree(sfp); + + scsi_device_put(sdp->device); + sg_put_dev(sdp); + module_put(THIS_MODULE); +} + +static void sg_remove_sfp(struct kref *kref) +{ + struct sg_fd *sfp = container_of(kref, struct sg_fd, f_ref); + struct sg_device *sdp = sfp->parentdp; Sg_fd *fp; Sg_fd *prev_fp; + unsigned long iflags; + write_lock_irqsave(&sg_index_lock, iflags); prev_fp = sdp->headfp; if (sfp == prev_fp) sdp->headfp = prev_fp->nextfp; @@ -2110,54 +2150,10 @@ __sg_remove_sfp(Sg_device * sdp, Sg_fd * prev_fp = fp; } } - if (sfp->reserve.bufflen > 0) { - SCSI_LOG_TIMEOUT(6, - printk("__sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", - (int) sfp->reserve.bufflen, (int) sfp->reserve.k_use_sg)); - sg_remove_scat(&sfp->reserve); - } - sfp->parentdp = NULL; - SCSI_LOG_TIMEOUT(6, printk("__sg_remove_sfp: sfp=0x%p\n", sfp)); - kfree(sfp); -} - -/* Returns 0 in normal case, 1 when detached and sdp object removed */ -static int -sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) -{ - Sg_request *srp; - Sg_request *tsrp; - int dirty = 0; - int res = 0; - - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sg_srp_done(srp, sfp)) - sg_finish_rem_req(srp); - else - ++dirty; - } - if (0 == dirty) { - unsigned long iflags; + write_unlock_irqrestore(&sg_index_lock, iflags); + wake_up_interruptible(&sdp->o_excl_wait); - write_lock_irqsave(&sg_index_lock, iflags); - __sg_remove_sfp(sdp, sfp); - if (sdp->detached && (NULL == sdp->headfp)) { - idr_remove(&sg_index_idr, sdp->index); - kfree(sdp); - res = 1; - } - write_unlock_irqrestore(&sg_index_lock, iflags); - } else { - /* MOD_INC's to inhibit unloading sg and associated adapter driver */ - /* only bump the access_count if we actually succeeded in - * throwing another counter on the host module */ - scsi_device_get(sdp->device); /* XXX: retval ignored? */ - sfp->closed = 1; /* flag dirty state on this fd */ - SCSI_LOG_TIMEOUT(1, printk("sg_remove_sfp: worrisome, %d writes pending\n", - dirty)); - } - return res; + execute_in_process_context(sg_remove_sfp_usercontext, &sfp->ew); } static int @@ -2199,6 +2195,28 @@ sg_last_dev(void) } #endif +static void sg_device_destroy(struct kref *kref) +{ + struct sg_device *sdp = container_of(kref, struct sg_device, d_ref); + unsigned long flags; + + SCSI_LOG_TIMEOUT(3, + printk("sg_device_destroy: %s\n", + sdp->disk->disk_name)); + + write_lock_irqsave(&sg_index_lock, flags); + idr_remove(&sg_index_idr, sdp->index); + write_unlock_irqrestore(&sg_index_lock, flags); + put_disk(sdp->disk); + kfree(sdp); +} + +static void sg_put_dev(struct sg_device *sdp) +{ + if (sdp) + kref_put(&sdp->d_ref, sg_device_destroy); +} + static Sg_device * sg_get_dev(int dev) { @@ -2207,6 +2225,9 @@ sg_get_dev(int dev) read_lock_irqsave(&sg_index_lock, iflags); sdp = idr_find(&sg_index_idr, dev); + if (sdp) + if (!sg_kref_get_not_zero(&sdp->d_ref)) + sdp = NULL; read_unlock_irqrestore(&sg_index_lock, iflags); return sdp; @@ -2480,6 +2501,7 @@ static int sg_proc_seq_show_dev(struct s (int) scsi_device_online(scsidp)); else seq_printf(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n"); + sg_put_dev(sdp); return 0; } @@ -2500,19 +2522,19 @@ static int sg_proc_seq_show_devstrs(stru scsidp->vendor, scsidp->model, scsidp->rev); else seq_printf(s, "<no active device>\n"); + sg_put_dev(sdp); return 0; } -static void sg_proc_debug_helper(struct seq_file *s, Sg_device * sdp) +static void sg_proc_debug_helper(struct seq_file *s, Sg_device *sdp, Sg_fd *fp) { int k, m, new_interface, blen, usg; Sg_request *srp; - Sg_fd *fp; const sg_io_hdr_t *hp; const char * cp; unsigned int ms; - for (k = 0; (fp = sg_get_nth_sfp(sdp, k)); ++k) { + for (k = 0; fp != NULL; ++k, fp = sg_get_next_sfp(sdp, fp)) { seq_printf(s, " FD(%d): timeout=%dms bufflen=%d " "(res)sgat=%d low_dma=%d\n", k + 1, jiffies_to_msecs(fp->timeout), @@ -2580,14 +2602,16 @@ static int sg_proc_seq_show_debug(struct sdp = it ? sg_get_dev(it->index) : NULL; if (sdp) { struct scsi_device *scsidp = sdp->device; + Sg_fd *fp; if (NULL == scsidp) { seq_printf(s, "device %d detached ??\n", (int)it->index); - return 0; + goto out; } - if (sg_get_nth_sfp(sdp, 0)) { + fp = sg_get_next_sfp(sdp, NULL); + if (fp) { seq_printf(s, " >>> device=%s ", sdp->disk->disk_name); if (sdp->detached) @@ -2601,9 +2625,11 @@ static int sg_proc_seq_show_debug(struct scsidp->host->hostt->emulated); seq_printf(s, " sg_tablesize=%d excl=%d\n", sdp->sg_tablesize, sdp->exclude); + sg_proc_debug_helper(s, sdp, fp); } - sg_proc_debug_helper(s, sdp); } +out: + sg_put_dev(sdp); return 0; } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-14 20:31 ` Tony Battersby @ 2009-01-14 21:39 ` Greg KH 2009-01-14 21:59 ` Tony Battersby 2009-01-14 22:33 ` Stefan Richter 2009-01-19 6:57 ` FUJITA Tomonori 1 sibling, 2 replies; 41+ messages in thread From: Greg KH @ 2009-01-14 21:39 UTC (permalink / raw) To: Tony Battersby Cc: FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi On Wed, Jan 14, 2009 at 03:31:15PM -0500, Tony Battersby wrote: > 4) Add a variation of kref_get() that uses atomic_inc_not_zero(). Ick. > When I was designing my previous patches, I anticipated running into > these complications with kref, which is one reason I avoided it. Heh. > Other people have experienced similar problems with the kref interface > (search "kref_get_not_zero" or "cgroups: fix probable race with > put_css_set[_taskexit] and find_css_set"). If you still want to use > kref, I think that we are either going to have to change the behavior > of /proc/scsi/sg/*, or else use atomic_inc_not_zero(). Since I still > prefer not to change valid user-visible behavior in a patch that fixes > so many possible oopses, I will go with atomic_inc_not_zero(). And, > since I am going that route, I will use it for sg_get_dev() also, > so that sg_put_dev() doesn't need to acquire sg_index_lock. > > For the purposes of this patch, I added the local function > sg_kref_get_not_zero() to sg.c. It would be better to add > kref_get_not_zero() to kref.c, but I see in the mailing list archives > that there has been some resistance to that idea because it complicates > the kref interface. However, since it has come up more than once, > perhaps it would be better to go ahead and make it an official part > of the interface. If anyone wants to support that idea, I will break > it out into a separate patch. I'd be interested in seeing how you propose such a change so that it works properly for people, because as you have noted, others have had this same problem. I strongly object to the use of sg_kref_get_not_zero(), as you are just providing a private version of this function, which you shouldn't be doing. thanks, greg k-h ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-14 21:39 ` Greg KH @ 2009-01-14 21:59 ` Tony Battersby 2009-01-14 22:33 ` Stefan Richter 1 sibling, 0 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-14 21:59 UTC (permalink / raw) To: Greg KH; +Cc: FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Greg KH wrote: > I'd be interested in seeing how you propose such a change so that it > works properly for people, because as you have noted, others have had > this same problem. > > I strongly object to the use of sg_kref_get_not_zero(), as you are just > providing a private version of this function, which you shouldn't be > doing. > > OK, then I'll just send kref_get_not_zero() as a separate patch to kref.c. If there are no objections to that patch, then I'll update my sg patch to use the new officially-blessed function. In the meantime, the patch I already posted should work the same way as a split patch, so if anyone has any other objections, then let me know. Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-14 21:39 ` Greg KH 2009-01-14 21:59 ` Tony Battersby @ 2009-01-14 22:33 ` Stefan Richter 2009-01-14 22:53 ` Tony Battersby 1 sibling, 1 reply; 41+ messages in thread From: Stefan Richter @ 2009-01-14 22:33 UTC (permalink / raw) To: Greg KH, Tony Battersby Cc: FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Greg KH wrote: > On Wed, Jan 14, 2009 at 03:31:15PM -0500, Tony Battersby wrote: >> 4) Add a variation of kref_get() that uses atomic_inc_not_zero(). > > Ick. I missed the most part of this thread. But in the grandparent post, Tony wrote: >> Fujita, your patch results in simpler code than my version 2 patch, >> but it still leaves some subtle races and other problems. The crux >> of the problem is that kref_put() needs do be done while holding a >> lock if there is still a way for some other CPU to find a reference >> to the object in between the time the refcount drops to 0 and the >> time the destructor is called. This sounds bogus to me. If "some other CPU can find a reference to the object" after the reference count dropped to zero, then the problem is IMO clear and simple: Some site did not increase the refcount when it should. -- Stefan Richter -=====-==--= ---= -===- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-14 22:33 ` Stefan Richter @ 2009-01-14 22:53 ` Tony Battersby 2009-01-14 23:47 ` Stefan Richter 0 siblings, 1 reply; 41+ messages in thread From: Tony Battersby @ 2009-01-14 22:53 UTC (permalink / raw) To: Stefan Richter Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Stefan Richter wrote: > Greg KH wrote: > >> On Wed, Jan 14, 2009 at 03:31:15PM -0500, Tony Battersby wrote: >> >>> 4) Add a variation of kref_get() that uses atomic_inc_not_zero(). >>> >> Ick. >> > > I missed the most part of this thread. But in the grandparent post, > Tony wrote: > > >>> Fujita, your patch results in simpler code than my version 2 patch, >>> but it still leaves some subtle races and other problems. The crux >>> of the problem is that kref_put() needs do be done while holding a >>> lock if there is still a way for some other CPU to find a reference >>> to the object in between the time the refcount drops to 0 and the >>> time the destructor is called. >>> > > This sounds bogus to me. If "some other CPU can find a reference to the > object" after the reference count dropped to zero, then the problem is > IMO clear and simple: > > Some site did not increase the refcount when it should. > The original code makes it possible to find the object and get information about it right up until the point that the destructor is called. However, adding a reference just for this purpose would prevent the object from ever being freed. Removing the ability to get information about closed fds or deleted devices that still have outstanding commands changes the user-visible behavior, which I didn't want to do in a patch that is just trying to fix a bunch of oopses. However, if the solution that I have proposed isn't acceptable, then that is exactly what will have to happen. I can live with that if that is the consensus, but I was trying to avoid it. Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-14 22:53 ` Tony Battersby @ 2009-01-14 23:47 ` Stefan Richter 2009-01-15 14:47 ` Tony Battersby 0 siblings, 1 reply; 41+ messages in thread From: Stefan Richter @ 2009-01-14 23:47 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Tony Battersby wrote: > Stefan Richter wrote: >> If "some other CPU can find a reference to the >> object" after the reference count dropped to zero, then the problem is >> IMO clear and simple: >> >> Some site did not increase the refcount when it should. >> > The original code makes it possible to find the object and get > information about it right up until the point that the destructor is > called. However, adding a reference just for this purpose would prevent > the object from ever being freed. No, why? There is a list or idr with pointers to your objects? As long as the list contains a pointer to object A, this list needs to own one reference count to A. Right after A was deregistered from the list, A's reference count is decremented on behalf of the list. If some site looks objects up in that list and uses the objects, it increases the refcount of such an object while it accesses the list. When done with the object, it decrements the object's refcount on this site's behalf. Eventually, some site will have been the last one to put away a pointer to the object. Then, and only then, the kref goes down to zero and the destructor is executed. > Removing the ability to get > information about closed fds or deleted devices that still have > outstanding commands changes the user-visible behavior, I'm not saying you are to remove such an ability; I'm just saying that as long as any site is ably to get to an object, the refcount of the object can't be zero. Bring it down to zero _after_ you made the object invisible to others. -- Stefan Richter -=====-==--= ---= -==== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-14 23:47 ` Stefan Richter @ 2009-01-15 14:47 ` Tony Battersby 2009-01-15 16:22 ` Stefan Richter 0 siblings, 1 reply; 41+ messages in thread From: Tony Battersby @ 2009-01-15 14:47 UTC (permalink / raw) To: Stefan Richter Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Stefan Richter wrote: > Tony Battersby wrote: > >> Stefan Richter wrote: >> >>> If "some other CPU can find a reference to the >>> object" after the reference count dropped to zero, then the problem is >>> IMO clear and simple: >>> >>> Some site did not increase the refcount when it should. >>> >>> >> The original code makes it possible to find the object and get >> information about it right up until the point that the destructor is >> called. However, adding a reference just for this purpose would prevent >> the object from ever being freed. >> > > No, why? > > There is a list or idr with pointers to your objects? As long as the > list contains a pointer to object A, this list needs to own one > reference count to A. Right after A was deregistered from the list, A's > reference count is decremented on behalf of the list. > > If some site looks objects up in that list and uses the objects, it > increases the refcount of such an object while it accesses the list. > When done with the object, it decrements the object's refcount on this > site's behalf. > > Eventually, some site will have been the last one to put away a pointer > to the object. Then, and only then, the kref goes down to zero and the > destructor is executed. > > >> Removing the ability to get >> information about closed fds or deleted devices that still have >> outstanding commands changes the user-visible behavior, >> > > I'm not saying you are to remove such an ability; I'm just saying that > as long as any site is ably to get to an object, the refcount of the > object can't be zero. Bring it down to zero _after_ you made the object > invisible to others. > Here is an example the way my patch does it: open fd: kref_init(): refcount = 1 send command: kref_get(): refcount 1 -> 2 close fd: kref_put(): refcount 2 -> 1 /proc access begin: idr_find(); kref_get_not_zero(): refcount 1 -> 2 /proc access end: kref_put(): refcount 2 -> 1 command completes: kref_put(): refcount 1 -> 0: idr_remove() Here is the alternative: open fd: kref_init(): refcount = 1 send command: kref_get(): refcount 1 -> 2 close fd: idr_remove(); kref_put(): refcount 2 -> 1 /proc access begin: idr_find(): *** NOT FOUND *** command completes: kref_put(): refcount 1 -> 0 The only obvious places to remove the idr are either from the destructor (as done by my patch) or from close (which is what you are probably thinking). If you do it from close, /proc no longer gives information about fds that have been closed but still have commands pending. I don't see an obvious way to remove the idr from the command completion path before kref_put(), since there can be multiple commands outstanding, and we want to leave the idr in place until the last command completes. Most users of kref would do the idr_remove() from close(), since the object doesn't stay around long enough afterward for it to matter. However, some sg commands can take minutes to complete (or sometimes never complete unless the user explicitly takes action to abort them). While there are outstanding commands, the sg module cannot be unloaded, and re-opening the sg device and trying to send more commands may hang if the old commands need to be aborted first. It is nice to have some way of telling the user why rmmod sg fails and why newly-sent commands don't complete. Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 14:47 ` Tony Battersby @ 2009-01-15 16:22 ` Stefan Richter 2009-01-15 16:44 ` Stefan Richter ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Stefan Richter @ 2009-01-15 16:22 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Tony Battersby wrote: > Here is an example the way my patch does it: > > open fd: kref_init(): refcount = 1 > send command: kref_get(): refcount 1 -> 2 > close fd: kref_put(): refcount 2 -> 1 > /proc access begin: idr_find(); kref_get_not_zero(): refcount 1 -> 2 > /proc access end: kref_put(): refcount 2 -> 1 > command completes: kref_put(): refcount 1 -> 0: idr_remove() > > Here is the alternative: > > open fd: kref_init(): refcount = 1 > send command: kref_get(): refcount 1 -> 2 > close fd: idr_remove(); kref_put(): refcount 2 -> 1 > /proc access begin: idr_find(): *** NOT FOUND *** > command completes: kref_put(): refcount 1 -> 0 > > The only obvious places to remove the idr are either from the destructor > (as done by my patch) or from close (which is what you are probably > thinking). I'm not thinking of a particular solution; I'm not familiar with sg. > If you do it from close, /proc no longer gives information > about fds that have been closed but still have commands pending. I > don't see an obvious way to remove the idr from the command completion > path before kref_put(), since there can be multiple commands > outstanding, and we want to leave the idr in place until the last > command completes. [...] I believe your kref_get_not_zero() invention is because you want to count two unrelated numbers in the same counter. This won't work, I'm afraid. One number is - the number of unfinished transactions (request->response) of an sg device. (Is that struct sg_device, or struct sg_fd?) The other number is - the number of contexts which peek and poke at the device representation's memory. (struct sg_device, or struct sg_fd?) These two numbers are of course unrelated. Hence I recommend you count them in separate counters. The second counter can be ideally implemented by means of the <linux/kref.h> API. The first counter can perhaps be a kref too, or it can be an atomic_t, or a plain integer which is accessed inside lock-protected regions. Which of them works best depends on what happens in detail. Perhaps kref is a good fit for this as well. Now, what to do with these counters. It's entirely obvious how you count the first number: Inserted a new request? -> t++. Got a response or timeout or whatever kind of transaction completion? -> t--. It's also entirely obvious how you count the second number: The device representation is created and put into the idr: -> ref = 1. The reason is of course because there is now one reference to that instance owned by the idr. When a /proc user looks into the idr and "borrows" the instance: -> ref++, and when that user is done: -> ref--. The reason for doing so is that the /proc user did not actually borrow the reference, it in fact made a copy of the reference. But that's not all. A transaction representation obviously refers to the device instance as well. So a request gets a ref++ too, and a response does a ref--. Well, that's just like what you described above, except that there is this other independent counter t. And we need one more thing to finally get all pieces together: A state flag which says "device still open? yes/no". This flag is necessary to decide when to take the device representation out of the idr: If the device is closed _and_ there are no outstanding transactions anymore, you unregister the device representation from the idr. You can do a little trick here and combine this flag with the counter t: Simply say t = 1 for "this device is open and has no outstanding transactions yet". Then increment and decrement for requests and responses. Also do t-- once on close(). Then t == 0 means "closed and all transactions completed". And any t != 0 means of course "there are t - 1 transactions pending". You could also shift it all by one and say "t == -1 means device closed and 0 transactions pending" but the former way allows you to use the kref API. (However, if you ever need to know whether the device is open or closed, independently whether there are transactions pending or not, then you can't fold this "open?" flag into the transaction counter t.) So we already discussed what happens when t drops to 0: The device representation is deregistered from the idr. Of course at this moment you also do a ref-- because the reference that was owned by the idr has been deleted now. And when ref eventually becomes 0, you clean up and kfree() the device representation, i.e. call its destructor. (Furthermore, the device representation surely refers to objects of the SCSI core. Hence the constructor and destructor of the sg device representation engage in refcounting of those SCSI core objects. And I presume they refcount the sg module, so that sg cannot be unloaded as long as some context may call sg's functions.) PS: In case you need to wait somewhere for a refcount dropping to 0, the <linux/completion.h> API may be a very good fit. -- Stefan Richter -=====-==--= ---= -==== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 16:22 ` Stefan Richter @ 2009-01-15 16:44 ` Stefan Richter 2009-01-15 18:17 ` Tony Battersby 2009-01-16 8:09 ` Stefan Richter 2 siblings, 0 replies; 41+ messages in thread From: Stefan Richter @ 2009-01-15 16:44 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi I wrote: > You can do a little trick here and combine this flag with the counter t: > Simply say t = 1 for "this device is open and has no outstanding > transactions yet". Then increment and decrement for requests and > responses. Also do t-- once on close(). Then t == 0 means "closed and > all transactions completed". And any t != 0 means of course "there are > t - 1 transactions pending". You could also shift it all by one and say > "t == -1 means device closed and 0 transactions pending" but the former > way allows you to use the kref API. (However, if you ever need to know > whether the device is open or closed, independently whether there are > transactions pending or not, then you can't fold this "open?" flag into > the transaction counter t.) Also, of course you can't use the kref API for this counter t if you want to read the number of pending transactions somewhere. Use an atomic_t then, or an int whose accessors are serialized by a lock. -- Stefan Richter -=====-==--= ---= -==== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 16:22 ` Stefan Richter 2009-01-15 16:44 ` Stefan Richter @ 2009-01-15 18:17 ` Tony Battersby 2009-01-15 18:47 ` Stefan Richter 2009-01-16 8:09 ` Stefan Richter 2 siblings, 1 reply; 41+ messages in thread From: Tony Battersby @ 2009-01-15 18:17 UTC (permalink / raw) To: Stefan Richter Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Stefan Richter wrote: > I believe your kref_get_not_zero() invention is because you want to > count two unrelated numbers in the same counter. This won't work, I'm > afraid. > > It does work actually, just in a way that people don't seem to like very much. And it is not really my invention; other people have run into the same problem and come up with the same solution. Sometimes people drop kref and go directly to using atomic ops just so that they can use atomic_inc_not_zero() without "violating" the kref API. > One number is > - the number of unfinished transactions (request->response) > of an sg device. (Is that struct sg_device, or struct sg_fd?) > > The other number is > - the number of contexts which peek and poke at the device > representation's memory. (struct sg_device, or struct sg_fd?) > > These two numbers are of course unrelated. Hence I recommend you count > them in separate counters. > > I might be able to get this to work, but is it really better than kref_get_not_zero()? And since there are actually two different types of objects being refcounted with the same issue (sg_fd and sg_device), applying the same "trick" to both might require (ugh) four counters total. Does anyone else favor this approach? I expect that if I put the effort into a patch that does this, someone else will just tell me to do it some other way. But if someone else seconds your proposal, I will give it a try. Please, cast your votes now. Rant: 1) I posted a patch that fixes a bunch of oopses. 2) I was told to use a kref because it makes things "simpler". 3) I posted a second patch using one kref, being very careful to avoid the subtle pitfalls of the kref API. 4) Someone else modified my patch to use two krefs to make things "even simpler", but it was subtly broken because krefs aren't an ideal fit for this problem. 5) I fixed the patch that used two krefs by extending the kref API. 6) I get flamed all to hell. 7) I am told to use three krefs now, using one kref to refcount another kref. However, it may be necessary to use as many as four. 8) I give up being a programmer, move to Wisconsin, and become a dairy farmer. I am of course just joking around and not actually offended, but this whole thing is starting to seem a bit silly. How much more complicated does it have to get just to make things simpler? NIH syndrome, anyone? --Tony "moo cow" Battersby ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 18:17 ` Tony Battersby @ 2009-01-15 18:47 ` Stefan Richter 2009-01-15 19:14 ` Stefan Richter 2009-01-15 19:20 ` Tony Battersby 0 siblings, 2 replies; 41+ messages in thread From: Stefan Richter @ 2009-01-15 18:47 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Tony Battersby wrote: > Stefan Richter wrote: >> I believe your kref_get_not_zero() invention is because you want to >> count two unrelated numbers in the same counter. This won't work, I'm >> afraid. >> > It does work actually, just in a way that people don't seem to like very > much. No, it doesn't work. You can track how many transactions are pending, and you can track how many sites look at memory X, but you can't track both issues in the same counter. If you only count pending transactions, you know when to deregister the device from the idr. But you don't know when it's OK to free the device's memory. If you cont only references to the memory, you know when it is OK to free it but you don't know when to deregister from the idr. > And it is not really my invention; other people have run into the > same problem and come up with the same solution. Perhaps these people did not analyze rigorously what it really is what they want to count? ... > Please, cast your votes now. I for one am not interested in what you guys do with sg. However, the proposal to change the kref API in a way which IMO undermines the whole principle of reference counting motivated me to comment in this thread, to try to understand what you want to solve, and to propose something. What is this kref_get_not_zero()? Do you want to encode object state into the reference counter? That's not the purpose of a reference count; use separate object attribute for that. The reference count says how many sites look at the object. Nothing more, nothing less. > 7) I am told to use three krefs now, using one kref to refcount another > kref. However, it may be necessary to use as many as four. ... > How much more complicated does it have to get just to make things simpler? As I said: Why count independent numbers with a single counter? Count number A with counter a. Count number B with counter b. This is easy, works 100%, and is obvious to everybody. How much simpler than that can it get? -- Stefan Richter -=====-==--= ---= -==== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 18:47 ` Stefan Richter @ 2009-01-15 19:14 ` Stefan Richter 2009-01-15 19:20 ` Tony Battersby 1 sibling, 0 replies; 41+ messages in thread From: Stefan Richter @ 2009-01-15 19:14 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi I wrote: > If you only count pending transactions, you know when to deregister the > device from the idr. But you don't know when it's OK to free the > device's memory. > > If you cont only references to the memory, you know when it is OK to > free it but you don't know when to deregister from the idr. PS: However, if you never copy the idr's reference, then you don't need that other ref-counter. To avoid such copies, every site which looks up the device in the idr would have to perform all its accesses to the device while inside an sg_index_lock protected section. Then only the transactions have copies of the pointer to the device representation in memory. This of course means that the transaction counter is equal to the reference counter (plus the one reference of the idr which you don't need to count because you remove that reference when there is also no transaction anymore). -- Stefan Richter -=====-==--= ---= -==== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 18:47 ` Stefan Richter 2009-01-15 19:14 ` Stefan Richter @ 2009-01-15 19:20 ` Tony Battersby 2009-01-15 20:43 ` Stefan Richter 1 sibling, 1 reply; 41+ messages in thread From: Tony Battersby @ 2009-01-15 19:20 UTC (permalink / raw) To: Stefan Richter Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Stefan Richter wrote: > Tony Battersby wrote: > >> Stefan Richter wrote: >> >>> I believe your kref_get_not_zero() invention is because you want to >>> count two unrelated numbers in the same counter. This won't work, I'm >>> afraid. >>> >>> >> It does work actually, just in a way that people don't seem to like very >> much. >> > > No, it doesn't work. You can track how many transactions are pending, > and you can track how many sites look at memory X, but you can't track > both issues in the same counter. > > If you only count pending transactions, you know when to deregister the > device from the idr. But you don't know when it's OK to free the > device's memory. > > If you cont only references to the memory, you know when it is OK to > free it but you don't know when to deregister from the idr. > > It does work because kref_get_not_zero() must be called while holding a lock that prevents the destructor from freeing the memory (this requirement is in the comments that I put above the function). If kref_get_not_zero() returns false, then the caller forgets that it found the object, drops the lock, and lets the destructor continue. It is safe from all races that I can see if used properly. In my opinion, the only legitimate objection that one could have is if the API is too hard to understand and use correctly, thereby risking misuse or confusion by people who don't understand it. Let me give some examples: * Example 1 * CPU 1: kref_put(): refcount 1 -> 0 CPU 2: lock data structure find object kref_get_not_zero() returns false forget object unlock data structure CPU 1: lock data structure remove object unlock data structure free object * Example 2 * CPU 1: kref_put(): refcount 1 -> 0 lock data structure remove object unlock data structure CPU 2: lock data structure object not found unlock data structure CPU 1: free object Can you point out an actual problem where this will fail to do the right thing? Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 19:20 ` Tony Battersby @ 2009-01-15 20:43 ` Stefan Richter 2009-01-15 21:43 ` Tony Battersby 0 siblings, 1 reply; 41+ messages in thread From: Stefan Richter @ 2009-01-15 20:43 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Tony Battersby wrote: > It does work because kref_get_not_zero() must be called while holding a > lock that prevents the destructor from freeing the memory [...] > lock data structure > find object > kref_get_not_zero() returns false > forget object > unlock data structure I have three comments to that. 1.: What are locks for? They serialize accesses to data, in order to protect the data from invalid modifications due to concurrent accesses. The sg_index_lock for example protects the data structures sg_index_idr and sdp->headfp from invalid modifications. (This can be seen quickly from the comment at the definition of sg_index_lock or from looking at the occurrences of sg_index_lock in the code.) What are reference counters for? Well, the name says it. I'd better ask: What is reference counting for? Of course to prevent premature destruction of objects; IOW to implement lifetime rules. Locks don't implement lifetime rules. They only implement serialization of accesses. Now, data structures like sg_index_idr have a possibly confusing relevance to lifetimes of other objects: sg_index_idr contains references to those objects. Whoever has read access to sg_index_idr has the ability to copy references. Due to that, sg_index_lock happens to influence /when/ such copies can be made. If you want you can implement some decisions whether to allow such copies inside the lock. But note: The lock's influence is incomplete! Once a copy of a reference has it made out of the serialized region, it can be again copied at will. Hence, lock's don't implement lifetime rules. They only implement serialization of data accesses. In your case, you want to extend sg_index_lock's purposes to also serialize access to a state variable of the sg device objects. Nothing wrong with that, as long as you and, later on, readers of your code are aware what is going on. But I suspect you aren't 100% aware what you are doing: You are speaking of "a lock that prevents the destructor from freeing the memory". That's not precise. If anything, the *state variable* prevents the destructor from doing its thing. The lock only serializes accesses to the state variable. 2.: Your kref_get_not_zero() tries to use private data of struct kref as a state variable of the object. That's wrong because kref is a reference counter, not a bitfield to memorize whatever object-specific states that could exist. The kref is the number of existing references. It's not saying anything about an object's state. Its saying something about the state of users of the object: There are this many users who aren't done yet. It can't be a good thing to confuse these independent matters. 3.: There can be only one correct variant of kref_get...(): void kref_get(struct kref *kref) { WARN_ON(!atomic_read(&kref->refcount)); atomic_inc(&kref->refcount); smp_mb__after_atomic_inc(); } WARN_ON says: Here is a kernel bug! And the above WARN_ON is straight to the point. Why is it a bug to call kref_get when the reference count is already 0? Simple: If you can kref_get an object, then it's because you got hold of a reference which you are now trying to copy. One reference is more than zero references! If your code works correctly, then there are no references lying around anywhere anymore at the moment when the reference count becomes 0. A kref_get_not_zero() however would only be a valid thing to do if we say "well, according to our count there are no references anymore, yet we have a couple references left somewhere, people still use them (notably to call kref_get_not_zero() on them) --- we don't care, because our reference count was only a meaningless approximation in the first place. Err, wait, why do we do this refcounting thing anyway?" So, there can't be any legal usage of a kref_get_not_zero(). If references are counted correctly, kref_get_not_zero() will never ever come across refcount == 0. -- Stefan Richter -=====-==--= ---= -==== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 20:43 ` Stefan Richter @ 2009-01-15 21:43 ` Tony Battersby 2009-01-15 21:58 ` Stefan Richter 2009-01-16 0:53 ` Stefan Richter 0 siblings, 2 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-15 21:43 UTC (permalink / raw) To: Stefan Richter Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Stefan Richter wrote: > What are locks for? > > They serialize accesses to data, in order to protect the data from > invalid modifications due to concurrent accesses. > ... > What are reference counters for? > > Well, the name says it. I'd better ask: What is reference counting > for? Of course to prevent premature destruction of objects; IOW to > implement lifetime rules. > > Locks don't implement lifetime rules. They only implement serialization > of accesses. > > ... > In your case, you want to extend sg_index_lock's purposes to also > serialize access to a state variable of the sg device objects. Nothing > wrong with that, as long as you and, later on, readers of your code are > aware what is going on. > > OK, so your argument now seems to be that the code just might in practice work correctly, but in theory it is not philosophically right, semantically pure, or politically correct because it uses fundamental programming constructs in odd ways that are not the "right" ways taught in CS courses that everyone understands and loves. That is a valid viewpoint I suppose, but I tend to be much more pragmatic. BTW, you might want to do a quick grep of the kernel source for all users of atomic_inc_not_zero() and make sure they all follow the established CS curricula: net/netfilter/nf_conntrack_expect.c: struct nf_conntrack_expect * nf_ct_expect_find_get(struct net *net, const struct nf_conntrack_tuple *tuple) { struct nf_conntrack_expect *i; rcu_read_lock(); i = __nf_ct_expect_find(net, tuple); if (i && !atomic_inc_not_zero(&i->use)) i = NULL; rcu_read_unlock(); return i; } void nf_ct_expect_put(struct nf_conntrack_expect *exp) { if (atomic_dec_and_test(&exp->use)) call_rcu(&exp->rcu, nf_ct_expect_free_rcu); } fs/nfs/nfs4state.c: static struct nfs4_state * __nfs4_find_state_byowner(struct inode *inode, struct nfs4_state_owner *owner) { struct nfs_inode *nfsi = NFS_I(inode); struct nfs4_state *state; list_for_each_entry(state, &nfsi->open_states, inode_states) { if (state->owner != owner) continue; if (atomic_inc_not_zero(&state->count)) return state; } return NULL; } struct nfs4_state * nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner) { struct nfs4_state *state, *new; struct nfs_inode *nfsi = NFS_I(inode); spin_lock(&inode->i_lock); state = __nfs4_find_state_byowner(inode, owner); spin_unlock(&inode->i_lock); ... } void nfs4_put_open_state(struct nfs4_state *state) { struct inode *inode = state->inode; struct nfs4_state_owner *owner = state->owner; if (!atomic_dec_and_lock(&state->count, &owner->so_lock)) return; spin_lock(&inode->i_lock); list_del(&state->inode_states); list_del(&state->open_states); spin_unlock(&inode->i_lock); spin_unlock(&owner->so_lock); iput(inode); nfs4_free_open_state(state); nfs4_put_state_owner(owner); } ... etc. But I'm sure all those silly network and RCU developers didn't understand what they were doing. I would still like for some other people to weigh in with their opinions before I spend more effort on another patch or else push forward with kref_get_not_zero(). Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 21:43 ` Tony Battersby @ 2009-01-15 21:58 ` Stefan Richter 2009-01-15 22:23 ` Tony Battersby 2009-01-16 0:53 ` Stefan Richter 1 sibling, 1 reply; 41+ messages in thread From: Stefan Richter @ 2009-01-15 21:58 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Tony Battersby wrote: > OK, so your argument now seems to be that the code just might in > practice work correctly, but in theory it is not philosophically right, > semantically pure, or politically correct Nonsense. I wrote that there is a bug if you have a reference while the reference count is zero. -- Stefan Richter -=====-==--= ---= -==== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 21:58 ` Stefan Richter @ 2009-01-15 22:23 ` Tony Battersby 2009-01-15 23:24 ` Stefan Richter 0 siblings, 1 reply; 41+ messages in thread From: Tony Battersby @ 2009-01-15 22:23 UTC (permalink / raw) To: Stefan Richter Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Stefan Richter wrote: > Tony Battersby wrote: > >> OK, so your argument now seems to be that the code just might in >> practice work correctly, but in theory it is not philosophically right, >> semantically pure, or politically correct >> > > Nonsense. > > I wrote that there is a bug if you have a reference while the reference > count is zero. > Being pragmatic, I call that a _philosophical_ objection rather than a bug because the code does actually work in practice as far as I can tell. But you can call it a bug if you like. If you want me to give in and call it a bug, then you will have to come up with an actual case where the code fails to do the right thing - memory use after free, double free, memory leak, oops, etc. Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 22:23 ` Tony Battersby @ 2009-01-15 23:24 ` Stefan Richter 2009-01-16 14:16 ` Tony Battersby 0 siblings, 1 reply; 41+ messages in thread From: Stefan Richter @ 2009-01-15 23:24 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Tony Battersby wrote: > Stefan Richter wrote: >> I wrote that there is a bug if you have a reference while the reference >> count is zero. >> > Being pragmatic, I call that a _philosophical_ objection rather than a > bug because the code does actually work in practice as far as I can > tell. But you can call it a bug if you like. If you want me to give in > and call it a bug, then you will have to come up with an actual case > where the code fails to do the right thing - memory use after free, > double free, memory leak, oops, etc. I understood that you want to fix bugs which are caused by lack of reference counting. You fix it by /reference count estimation/. (Yet you could also fix it by /reference counting/.) Actual reference counting would have the benefit that it works not only right after your fix, but it will continue to work when people proceed to modify sg, provided they follow the sound, proven, straight-forward, familiar rules of reference counting: - r++ when a copy of a reference is made. - r-- when a reference is given up. Sure, it will also continue to work if those who modify sg follow the ad-hoc rules of your special reference count estimation, which are: - ... - ... - ... I recommend to stick to the semantics of "refcount == 0 means there is no reference anymore" because it's simple and robust. This is pragmatism. I recommend it over "refcount == 0 means that there are, hmm, still some references somewhere, but believe me, you can ignore those" because that's not very pragmatic in code which is supposed to be maintainable and to not crash people's machines. In the subsystems which I maintain, I made the experience that reference counting is easier to get right and keep right than some locking tricks around reference stores or whatever other lifetime assumptions. Anyway. Fix sg in whatever way you prefer; you are doing the work, your opinion counts. Mine doesn't, as I don't contribute the code. (I spoke up nevertheless because the idea came up in this thread to change lib/kref in a way which is only useful to buggy refcounting. This worried me..) -- Stefan Richter -=====-==--= ---= -==== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 23:24 ` Stefan Richter @ 2009-01-16 14:16 ` Tony Battersby 0 siblings, 0 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-16 14:16 UTC (permalink / raw) To: Stefan Richter Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Stefan, Thank you for all of your comments, and for you concern for the quality of the code. In the end, we all want the same thing, and we are all trying to reach the same goal. You have given me an alternate way of looking at the problem. I am still not sure which way this one will go, but at the very least your objections will help me write better comments and documentation about the controversial approach I am taking. Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 21:43 ` Tony Battersby 2009-01-15 21:58 ` Stefan Richter @ 2009-01-16 0:53 ` Stefan Richter 1 sibling, 0 replies; 41+ messages in thread From: Stefan Richter @ 2009-01-16 0:53 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi Tony Battersby wrote: > net/netfilter/nf_conntrack_expect.c: > > struct nf_conntrack_expect * > nf_ct_expect_find_get(struct net *net, const struct nf_conntrack_tuple *tuple) > { > struct nf_conntrack_expect *i; > > rcu_read_lock(); > i = __nf_ct_expect_find(net, tuple); > if (i && !atomic_inc_not_zero(&i->use)) > i = NULL; > rcu_read_unlock(); > > return i; > } > > void nf_ct_expect_put(struct nf_conntrack_expect *exp) > { > if (atomic_dec_and_test(&exp->use)) > call_rcu(&exp->rcu, nf_ct_expect_free_rcu); > } > > > fs/nfs/nfs4state.c: > > static struct nfs4_state * > __nfs4_find_state_byowner(struct inode *inode, struct nfs4_state_owner *owner) > { > struct nfs_inode *nfsi = NFS_I(inode); > struct nfs4_state *state; > > list_for_each_entry(state, &nfsi->open_states, inode_states) { > if (state->owner != owner) > continue; > if (atomic_inc_not_zero(&state->count)) > return state; > } > return NULL; > } > > struct nfs4_state * > nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner) > { > struct nfs4_state *state, *new; > struct nfs_inode *nfsi = NFS_I(inode); > > spin_lock(&inode->i_lock); > state = __nfs4_find_state_byowner(inode, owner); > spin_unlock(&inode->i_lock); > ... > } > > void nfs4_put_open_state(struct nfs4_state *state) > { > struct inode *inode = state->inode; > struct nfs4_state_owner *owner = state->owner; > > if (!atomic_dec_and_lock(&state->count, &owner->so_lock)) > return; > spin_lock(&inode->i_lock); > list_del(&state->inode_states); > list_del(&state->open_states); > spin_unlock(&inode->i_lock); > spin_unlock(&owner->so_lock); > iput(inode); > nfs4_free_open_state(state); > nfs4_put_state_owner(owner); > } > > > ... etc. But I'm sure all those silly network and RCU developers didn't > understand what they were doing. Is it obvious from these excerpts what the lifetime rules of the objects are? Not really immediately. Do we know merely from looking at the code why its developers chose to do it this way? No. Is there a necessity to do similarly complicated things in sg? No, definitely not. -- Stefan Richter -=====-==--= ---= =---- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-15 16:22 ` Stefan Richter 2009-01-15 16:44 ` Stefan Richter 2009-01-15 18:17 ` Tony Battersby @ 2009-01-16 8:09 ` Stefan Richter 2 siblings, 0 replies; 41+ messages in thread From: Stefan Richter @ 2009-01-16 8:09 UTC (permalink / raw) To: Tony Battersby Cc: Greg KH, FUJITA Tomonori, dgilbert, James.Bottomley, hch, linux-scsi I wrote: > One number is > - the number of unfinished transactions (request->response) > of an sg device. [...] The other number is > - the number of contexts which peek and poke at the device > representation's memory. [...] > And we need one more thing to finally get all pieces together: A state > flag which says "device still open? yes/no". This flag is necessary to > decide when to take the device representation out of the idr: You probably don't need an explicit counter of pending transactions. This number is implicit in the number of list members in struct sg_fd.headrp. So you only need the reference counter and the open-or-closed flag (which would be different from struct sg_fd.closed). -- Stefan Richter -=====-==--= ---= =---- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-14 20:31 ` Tony Battersby 2009-01-14 21:39 ` Greg KH @ 2009-01-19 6:57 ` FUJITA Tomonori 2009-01-19 15:02 ` Tony Battersby ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: FUJITA Tomonori @ 2009-01-19 6:57 UTC (permalink / raw) To: tonyb; +Cc: fujita.tomonori, dgilbert, James.Bottomley, hch, linux-scsi, greg Sorry for the delay, On Wed, 14 Jan 2009 15:31:15 -0500 Tony Battersby <tonyb@cybernetics.com> wrote: > [CC: Greg Kroah-Hartman <greg@kroah.com>] > > Fujita, your patch results in simpler code than my version 2 patch, > but it still leaves some subtle races and other problems. The crux > of the problem is that kref_put() needs do be done while holding a > lock if there is still a way for some other CPU to find a reference > to the object in between the time the refcount drops to 0 and the > time the destructor is called. > > For example, look at sg_get_dev() and sg_put_dev(). In my patch, > I locked sg_index_lock in sg_put_dev() and called kref_put() while > holding the lock. In your patch, you moved the lock from sg_put_dev() > to the destructor function. This introduces a subtle race with > sg_get_dev(): > > CPU 1: > sg_put_dev() > kref_put(): refcount 1 -> 0 > > CPU 2: > sg_get_dev() > lock sg_index_lock > idr_find() > kref_get(): refcount 0 -> 1: bug > unlock sg_index_lock > > CPU 1: > sg_device_release() > lock sg_index_lock > idr_remove > unlock sg_index_lock > kfree(sdp) > > CPU 2: > access sdp which has already been freed > sg_put_dev(): double-free bug > > Besides holding the lock during kref_put(), I also considered two > other simple ways to avoid this race: > 1) Do idr_remove() from sg_remove(). > 2) Return NULL in sg_get_dev() if sdp->detached. > > However, both of these options would have changed the behavior of > the /proc/scsi/sg/* functions that show information for devices that > are in the process of being detached. I wanted to fix bugs without > changing other behavior, so I chose to call kref_put() under lock in > my previous patches. How about doing 2) and accessing to /proc/scsi/sg/* with sg_index_lock (don't use sg_get_device for it). What /proc/scsi/sg/* doing is abnormal from the perspective of the ref counting (accessing to something that is going away). >From the perspective of the ref counting, the best way is calling idr_remove in from sg_remove but as you said, it's not nice to change the behavior. If we don't use sg_get_device for /proc/scsi/sg/*, then we use sg_get_device for only sg_open So we can do 2) without changing the behavior. > A similar problem exists with sg_remove_sfp() in your patch. > The refcount for the sg_fd can drop to 0 while a different CPU can > still find a reference to it by scanning sg_device::headfp. The good > news is that most places that scan sg_fd's using sdp->headfp never > drop sg_index_lock while still holding a reference, so for most cases > it is safe for the destructor to work the way you have it. The one > exception is sg_get_nth_sfp(), which needs to kref_get() the sg_fd it > returns (my version 2 patch doesn't fix this problem either since I > just noticed it). But then the kref_get() would have a race similar > to the one I described above between sg_get_dev() and sg_put_dev(). If we go with the above way, we can solve this? sg_get_nth_sfp is used for only /proc/scsi/sg/*. > There are several solutions to this problem: > 1) Call kref_put(&sfp->f_ref, sg_remove_sfp) while holding > sg_index_lock, and have the destructor unlink sfp from sdp->headfp > before dropping the lock. This is ugly since the destructor may want > to drop sg_index_lock (e.g. to call scsi_device_put()), but the saved > irqflags variable is in a different function. > 2) Unlink sfp from sdp->headfp in sg_release(). The downside to this > is that the /proc/scsi/sg/* functions no longer show fds that have > been closed but still have commands pending. > 3) Set sfp->closed in sg_release() and have sg_get_nth_sfp() return > NULL if sfp->closed. Same downside as #2. > 4) Add a variation of kref_get() that uses atomic_inc_not_zero(). > > When I was designing my previous patches, I anticipated running into > these complications with kref, which is one reason I avoided it. > Other people have experienced similar problems with the kref interface > (search "kref_get_not_zero" or "cgroups: fix probable race with > put_css_set[_taskexit] and find_css_set"). If you still want to use > kref, I think that we are either going to have to change the behavior > of /proc/scsi/sg/*, or else use atomic_inc_not_zero(). Since I still > prefer not to change valid user-visible behavior in a patch that fixes > so many possible oopses, I will go with atomic_inc_not_zero(). And, > since I am going that route, I will use it for sg_get_dev() also, > so that sg_put_dev() doesn't need to acquire sg_index_lock. > > For the purposes of this patch, I added the local function > sg_kref_get_not_zero() to sg.c. It would be better to add > kref_get_not_zero() to kref.c, but I see in the mailing list archives > that there has been some resistance to that idea because it complicates > the kref interface. However, since it has come up more than once, > perhaps it would be better to go ahead and make it an official part > of the interface. If anyone wants to support that idea, I will break > it out into a separate patch. I've not read the whole discussion, but using the common mechanism in your original way is not good. It confuses other developers and other developers don't like it. > Your patch also has a problem with management of f_ref. You do > kref_get() in sg_add_request() and kref_put() in sg_rq_end_io()(). > However, an error can occur in between sg_add_request() and sending > the command to the midlevel, so sg_rq_end_io() may never be called. > It is better to do kref_get() just before sending the command to the > midlevel; that way the refcount counts active commands waiting for > the completion callback rather than prepared requests. Oops, yeah, you are right. > Your patch doesn't fix the problem of various functions calling > SCSI_LOG_TIMEOUT with sdp->disk->disk_name after sg_remove() sets > sdp->disk to NULL. This is why I moved put_disk() from sg_remove() > to sg_device_release(). As I wrote in the previous mail, I don't try to fix all the race in this patch. But yeah, we need to clean up sdp->disk in sg_device_release(). > Your patch doesn't fix the problem of forgetting to do > wake_up_interruptible(&sdp->o_excl_wait) after removing sfp from > sdp->headfp. This is important if sg_release() is called with > commands still active; the last command that finishes should wake up > any O_EXCL waiters. I think we can fix this even with my patch. Thanks, ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] sg: fix races during device removal (v2) 2009-01-19 6:57 ` FUJITA Tomonori @ 2009-01-19 15:02 ` Tony Battersby 2009-01-19 23:03 ` [PATCH 1/2] sg: fix races during device removal (v4) Tony Battersby 2009-01-19 23:06 ` [PATCH 2/2] sg: fix races with ioctl(SG_IO) Tony Battersby 2 siblings, 0 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-19 15:02 UTC (permalink / raw) To: FUJITA Tomonori; +Cc: dgilbert, James.Bottomley, hch, linux-scsi, greg FUJITA Tomonori wrote: >> Besides holding the lock during kref_put(), I also considered two >> other simple ways to avoid this race: >> 1) Do idr_remove() from sg_remove(). >> 2) Return NULL in sg_get_dev() if sdp->detached. >> >> However, both of these options would have changed the behavior of >> the /proc/scsi/sg/* functions that show information for devices that >> are in the process of being detached. I wanted to fix bugs without >> changing other behavior, so I chose to call kref_put() under lock in >> my previous patches. >> > > How about doing 2) and accessing to /proc/scsi/sg/* with > sg_index_lock (don't use sg_get_device for it). > > That's an excellent idea. If it works, I can forget this whole atomic_inc_not_zero() business and the pain that goes with it. > What /proc/scsi/sg/* doing is abnormal from the perspective of the ref > counting (accessing to something that is going away). > > >From the perspective of the ref counting, the best way is calling > idr_remove in from sg_remove but as you said, it's not nice to change > the behavior. > > Exactly. > If we don't use sg_get_device for /proc/scsi/sg/*, then we use > sg_get_device for only sg_open So we can do 2) without changing the > behavior. > > I will try to implement it this way today. Thanks for your input. Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/2] sg: fix races during device removal (v4) 2009-01-19 6:57 ` FUJITA Tomonori 2009-01-19 15:02 ` Tony Battersby @ 2009-01-19 23:03 ` Tony Battersby 2009-01-20 1:06 ` FUJITA Tomonori 2009-01-19 23:06 ` [PATCH 2/2] sg: fix races with ioctl(SG_IO) Tony Battersby 2 siblings, 1 reply; 41+ messages in thread From: Tony Battersby @ 2009-01-19 23:03 UTC (permalink / raw) To: James.Bottomley; +Cc: FUJITA Tomonori, dgilbert, hch, linux-scsi sg has the following problems related to device removal: * opening a sg fd races with removing a device * closing a sg fd races with removing a device * /proc/scsi/sg/* access races with removing a device * command completion races with removing a device * command completion races with closing a sg fd * can rmmod sg with active commands These problems can cause kernel oopses, memory-use-after-free, or double-free errors. This patch fixes these problems by using krefs to manage the lifetime of sg_device and sg_fd. Each command submitted to the midlevel holds a reference to sg_fd until the completion callback. This ensures that sg_fd doesn't go away if the fd is closed with commands still outstanding. sg_fd gets the reference of sg_device (with scsi_device) and also makes sure that the sg module doesn't go away. /proc/scsi/sg/* functions don't play nicely with krefs because they give information about sg_fds which have been closed but not yet freed due to still having outstanding commands and sg_devices which have been removed but not yet freed due to still being referenced by one or more sg_fds. To deal with this safely without removing functionality, /proc functions now access sg_device and sg_fd while holding a lock instead of using kref_get()/kref_put(). Signed-off-by: Tony Battersby <tonyb@cybernetics.com> --- This version implements Fujita's suggestion to use locking instead of kref_get() for the /proc functions. This version does not use any controversial extensions to the kref API. To avoid changing the behavior of the /proc functions, it is still possible to find a sg_fd or a sg_device with a refcount of 0, since the last remaining references are removed from the kref_put() release functions. However, no one should be doing a kref_get() under these circumstances, so it shouldn't cause a problem. I added some comments warning about this non-standard behavior to reduce the risk of tripping up someone in the future. The original sg_open() would return -ENXIO if idr_find() returned NULL but -ENODEV if sdp->detached. It would be inconvenient to keep this behavior without inlining sg_get_dev() into sg_open(). Since sg_open() was the only remaining caller anyway, I went ahead and inlined it and removed sg_get_dev(), thereby enabling sg_open() to return the same error codes as before. This version still passes all the tests listed previously. However, mptscsih.c still needs some fixes; I will submit a patch for that later. sg.c | 386 ++++++++++++++++++++++++++++++++----------------------------------- 1 file changed, 186 insertions(+), 200 deletions(-) --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-19 16:19:01.000000000 -0500 +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-19 16:26:21.000000000 -0500 @@ -101,6 +101,7 @@ static int scatter_elem_sz_prev = SG_SCA #define SG_SECTOR_MSK (SG_SECTOR_SZ - 1) static int sg_add(struct device *, struct class_interface *); +static void sg_device_destroy(struct kref *kref); static void sg_remove(struct device *, struct class_interface *); static DEFINE_IDR(sg_index_idr); @@ -158,6 +159,12 @@ typedef struct sg_fd { /* holds the sta char next_cmd_len; /* 0 -> automatic (def), >0 -> use on next write() */ char keep_orphan; /* 0 -> drop orphan (def), 1 -> keep for read() */ char mmap_called; /* 0 -> mmap() never called on this fd */ + /* CAUTION! sfp is unlinked from sdp->headfp AFTER the refcount drops + to 0. When scanning sdp->headfp, it is not safe to call + kref_get(&sfp->f_ref) if sfp->closed because the refcount may + already be 0. */ + struct kref f_ref; + struct execute_work ew; } Sg_fd; typedef struct sg_device { /* holds the state of each scsi generic device */ @@ -171,6 +178,22 @@ typedef struct sg_device { /* holds the char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10-> all devs */ struct gendisk *disk; struct cdev * cdev; /* char_dev [sysfs: /sys/cdev/major/sg<n>] */ + /* CAUTION! idr_remove() is called AFTER the refcount drops to 0. + When using idr_find(), it is not safe to call kref_get(&sdp->d_ref) + if sdp->detached because the refcount may already be 0. + + In order to get a reference safely using idr_find(): + 1) lock sg_index_lock (either read or write lock) + 2) call idr_find() + 3) if sdp->detached, pretend that idr_find() returned NULL, and do + not call kref_get() + 4) if !sdp->detached, then kref_get(&sdp->d_ref) + 5) unlock sg_index_lock + + Alternately, you can use idr_find() to access the device even if + sdp->detached if you do not drop sg_index_lock, in which case you + should NOT call kref_get(&sdp->d_ref). */ + struct kref d_ref; } Sg_device; static int sg_fasync(int fd, struct file *filp, int mode); @@ -194,13 +217,11 @@ static void sg_build_reserve(Sg_fd * sfp static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size); static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp); static Sg_fd *sg_add_sfp(Sg_device * sdp, int dev); -static int sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); -static void __sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); +static void sg_remove_sfp(struct kref *); static Sg_request *sg_get_rq_mark(Sg_fd * sfp, int pack_id); static Sg_request *sg_add_request(Sg_fd * sfp); static int sg_remove_request(Sg_fd * sfp, Sg_request * srp); static int sg_res_in_use(Sg_fd * sfp); -static Sg_device *sg_get_dev(int dev); #ifdef CONFIG_SCSI_PROC_FS static int sg_last_dev(void); #endif @@ -236,23 +257,30 @@ sg_open(struct inode *inode, struct file lock_kernel(); nonseekable_open(inode, filp); SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, flags=0x%x\n", dev, flags)); - sdp = sg_get_dev(dev); - if ((!sdp) || (!sdp->device)) { - unlock_kernel(); - return -ENXIO; - } - if (sdp->detached) { - unlock_kernel(); - return -ENODEV; + + retval = -ENXIO; + read_lock_irq(&sg_index_lock); + sdp = idr_find(&sg_index_idr, dev); + if (sdp) { + /* See comments about using idr_find() and d_ref. If + sdp->detached, then the refcount may already be 0, in which + case it would be a bug to do kref_get(). */ + if (!sdp->detached) + kref_get(&sdp->d_ref); + else { + sdp = NULL; + retval = -ENODEV; + } } + read_unlock_irq(&sg_index_lock); + if ((!sdp) || (!sdp->device)) + goto sg_put; /* This driver's module count bumped by fops_get in <linux/fs.h> */ /* Prevent the device driver from vanishing while we sleep */ retval = scsi_device_get(sdp->device); - if (retval) { - unlock_kernel(); - return retval; - } + if (retval) + goto sg_put; if (!((flags & O_NONBLOCK) || scsi_block_when_processing_errors(sdp->device))) { @@ -303,16 +331,20 @@ sg_open(struct inode *inode, struct file if ((sfp = sg_add_sfp(sdp, dev))) filp->private_data = sfp; else { - if (flags & O_EXCL) + if (flags & O_EXCL) { sdp->exclude = 0; /* undo if error */ + wake_up_interruptible(&sdp->o_excl_wait); + } retval = -ENOMEM; goto error_out; } - unlock_kernel(); - return 0; - - error_out: - scsi_device_put(sdp->device); + retval = 0; +error_out: + if (retval) + scsi_device_put(sdp->device); +sg_put: + if (sdp) + kref_put(&sdp->d_ref, sg_device_destroy); unlock_kernel(); return retval; } @@ -327,13 +359,14 @@ sg_release(struct inode *inode, struct f if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n", sdp->disk->disk_name)); - if (0 == sg_remove_sfp(sdp, sfp)) { /* Returns 1 when sdp gone */ - if (!sdp->detached) { - scsi_device_put(sdp->device); - } - sdp->exclude = 0; - wake_up_interruptible(&sdp->o_excl_wait); - } + + sfp->closed = 1; + + sdp->exclude = 0; + wake_up_interruptible(&sdp->o_excl_wait); + + kref_put(&sfp->f_ref, sg_remove_sfp); + return 0; } @@ -755,6 +788,7 @@ sg_common_write(Sg_fd * sfp, Sg_request hp->duration = jiffies_to_msecs(jiffies); srp->rq->timeout = timeout; + kref_get(&sfp->f_ref); /* sg_rq_end_io() does kref_put(). */ blk_execute_rq_nowait(sdp->device->request_queue, sdp->disk, srp->rq, 1, sg_rq_end_io); return 0; @@ -1247,25 +1281,28 @@ sg_mmap(struct file *filp, struct vm_are static void sg_rq_end_io(struct request *rq, int uptodate) { struct sg_request *srp = rq->end_io_data; - Sg_device *sdp = NULL; + Sg_device *sdp; Sg_fd *sfp; unsigned long iflags; unsigned int ms; char *sense; - int result, resid; + int result, resid, done = 1; - if (NULL == srp) { - printk(KERN_ERR "sg_cmd_done: NULL request\n"); + if (unlikely(NULL == srp)) { /* shouldn't happen */ + printk(KERN_ERR "sg_rq_end_io: NULL request\n"); return; } + sfp = srp->parentfp; - if (sfp) - sdp = sfp->parentdp; - if ((NULL == sdp) || sdp->detached) { - printk(KERN_INFO "sg_cmd_done: device detached\n"); + if (unlikely(NULL == sfp)) { /* shouldn't happen */ + printk(KERN_ERR "sg_rq_end_io: NULL sg_fd\n"); return; } + sdp = sfp->parentdp; + if (unlikely(sdp->detached)) + printk(KERN_INFO "sg_rq_end_io: device detached\n"); + sense = rq->sense; result = rq->errors; resid = rq->data_len; @@ -1303,33 +1340,25 @@ static void sg_rq_end_io(struct request } /* Rely on write phase to clean out srp status values, so no "else" */ - if (sfp->closed) { /* whoops this fd already released, cleanup */ - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, freeing ...\n")); - sg_finish_rem_req(srp); - srp = NULL; - if (NULL == sfp->headrp) { - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, final cleanup\n")); - if (0 == sg_remove_sfp(sdp, sfp)) { /* device still present */ - scsi_device_put(sdp->device); - } - sfp = NULL; - } - } else if (srp && srp->orphan) { + write_lock_irqsave(&sfp->rq_list_lock, iflags); + if (unlikely(srp->orphan)) { if (sfp->keep_orphan) srp->sg_io_owned = 0; - else { - sg_finish_rem_req(srp); - srp = NULL; - } + else + done = 0; } - if (sfp && srp) { - /* Now wake up any sg_read() that is waiting for this packet. */ - kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); - write_lock_irqsave(&sfp->rq_list_lock, iflags); - srp->done = 1; + srp->done = done; + write_unlock_irqrestore(&sfp->rq_list_lock, iflags); + + if (likely(done)) { + /* Now wake up any sg_read() that is waiting for this + packet. */ wake_up_interruptible(&sfp->read_wait); - write_unlock_irqrestore(&sfp->rq_list_lock, iflags); - } + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); + } else + sg_finish_rem_req(srp); /* call with srp->done == 0 */ + + kref_put(&sfp->f_ref, sg_remove_sfp); } static struct file_operations sg_fops = { @@ -1391,6 +1420,7 @@ static Sg_device *sg_alloc(struct gendis init_waitqueue_head(&sdp->o_excl_wait); sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments); sdp->index = k; + kref_init(&sdp->d_ref); error = 0; out: @@ -1488,49 +1518,45 @@ out: return error; } -static void -sg_remove(struct device *cl_dev, struct class_interface *cl_intf) +static void sg_device_destroy(struct kref *kref) +{ + struct sg_device *sdp = container_of(kref, struct sg_device, d_ref); + unsigned long flags; + + /* CAUTION! Note that the device can still be found via idr_find() + even though the refcount is 0. Therefore, do idr_remove() BEFORE + any other cleanup. */ + + write_lock_irqsave(&sg_index_lock, flags); + idr_remove(&sg_index_idr, sdp->index); + write_unlock_irqrestore(&sg_index_lock, flags); + + SCSI_LOG_TIMEOUT(3, + printk("sg_device_destroy: %s\n", + sdp->disk->disk_name)); + + put_disk(sdp->disk); + kfree(sdp); +} + +static void sg_remove(struct device *cl_dev, struct class_interface *cl_intf) { struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); Sg_device *sdp = dev_get_drvdata(cl_dev); unsigned long iflags; Sg_fd *sfp; - Sg_fd *tsfp; - Sg_request *srp; - Sg_request *tsrp; - int delay; - if (!sdp) + if (!sdp || sdp->detached) return; - delay = 0; + SCSI_LOG_TIMEOUT(3, printk("sg_remove: %s\n", sdp->disk->disk_name)); + + /* Need a write lock to set sdp->detached. */ write_lock_irqsave(&sg_index_lock, iflags); - if (sdp->headfp) { - sdp->detached = 1; - for (sfp = sdp->headfp; sfp; sfp = tsfp) { - tsfp = sfp->nextfp; - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sfp->closed || (0 == sg_srp_done(srp, sfp))) - sg_finish_rem_req(srp); - } - if (sfp->closed) { - scsi_device_put(sdp->device); - __sg_remove_sfp(sdp, sfp); - } else { - delay = 1; - wake_up_interruptible(&sfp->read_wait); - kill_fasync(&sfp->async_qp, SIGPOLL, - POLL_HUP); - } - } - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d, dirty\n", sdp->index)); - if (NULL == sdp->headfp) { - idr_remove(&sg_index_idr, sdp->index); - } - } else { /* nothing active, simple case */ - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d\n", sdp->index)); - idr_remove(&sg_index_idr, sdp->index); + sdp->detached = 1; + for (sfp = sdp->headfp; sfp; sfp = sfp->nextfp) { + wake_up_interruptible(&sfp->read_wait); + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP); } write_unlock_irqrestore(&sg_index_lock, iflags); @@ -1538,13 +1564,8 @@ sg_remove(struct device *cl_dev, struct device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index)); cdev_del(sdp->cdev); sdp->cdev = NULL; - put_disk(sdp->disk); - sdp->disk = NULL; - if (NULL == sdp->headfp) - kfree(sdp); - if (delay) - msleep(10); /* dirty detach so delay device destruction */ + kref_put(&sdp->d_ref, sg_device_destroy); } module_param_named(scatter_elem_sz, scatter_elem_sz, int, S_IRUGO | S_IWUSR); @@ -1941,22 +1962,6 @@ sg_get_rq_mark(Sg_fd * sfp, int pack_id) return resp; } -#ifdef CONFIG_SCSI_PROC_FS -static Sg_request * -sg_get_nth_request(Sg_fd * sfp, int nth) -{ - Sg_request *resp; - unsigned long iflags; - int k; - - read_lock_irqsave(&sfp->rq_list_lock, iflags); - for (k = 0, resp = sfp->headrp; resp && (k < nth); - ++k, resp = resp->nextrp) ; - read_unlock_irqrestore(&sfp->rq_list_lock, iflags); - return resp; -} -#endif - /* always adds to end of list */ static Sg_request * sg_add_request(Sg_fd * sfp) @@ -2032,22 +2037,6 @@ sg_remove_request(Sg_fd * sfp, Sg_reques return res; } -#ifdef CONFIG_SCSI_PROC_FS -static Sg_fd * -sg_get_nth_sfp(Sg_device * sdp, int nth) -{ - Sg_fd *resp; - unsigned long iflags; - int k; - - read_lock_irqsave(&sg_index_lock, iflags); - for (k = 0, resp = sdp->headfp; resp && (k < nth); - ++k, resp = resp->nextfp) ; - read_unlock_irqrestore(&sg_index_lock, iflags); - return resp; -} -#endif - static Sg_fd * sg_add_sfp(Sg_device * sdp, int dev) { @@ -2062,6 +2051,7 @@ sg_add_sfp(Sg_device * sdp, int dev) init_waitqueue_head(&sfp->read_wait); rwlock_init(&sfp->rq_list_lock); + kref_init(&sfp->f_ref); sfp->timeout = SG_DEFAULT_TIMEOUT; sfp->timeout_user = SG_DEFAULT_TIMEOUT_USER; sfp->force_packid = SG_DEF_FORCE_PACK_ID; @@ -2089,15 +2079,53 @@ sg_add_sfp(Sg_device * sdp, int dev) sg_build_reserve(sfp, bufflen); SCSI_LOG_TIMEOUT(3, printk("sg_add_sfp: bufflen=%d, k_use_sg=%d\n", sfp->reserve.bufflen, sfp->reserve.k_use_sg)); + + kref_get(&sdp->d_ref); + __module_get(THIS_MODULE); return sfp; } -static void -__sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) +static void sg_remove_sfp_usercontext(struct work_struct *work) +{ + struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); + struct sg_device *sdp = sfp->parentdp; + + /* Cleanup any responses which were never read(). */ + while (sfp->headrp) + sg_finish_rem_req(sfp->headrp); + + if (sfp->reserve.bufflen > 0) { + SCSI_LOG_TIMEOUT(6, + printk("sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", + (int) sfp->reserve.bufflen, + (int) sfp->reserve.k_use_sg)); + sg_remove_scat(&sfp->reserve); + } + + SCSI_LOG_TIMEOUT(6, + printk("sg_remove_sfp: %s, sfp=0x%p\n", + sdp->disk->disk_name, + sfp)); + kfree(sfp); + + scsi_device_put(sdp->device); + kref_put(&sdp->d_ref, sg_device_destroy); + module_put(THIS_MODULE); +} + +static void sg_remove_sfp(struct kref *kref) { + struct sg_fd *sfp = container_of(kref, struct sg_fd, f_ref); + struct sg_device *sdp = sfp->parentdp; Sg_fd *fp; Sg_fd *prev_fp; + unsigned long iflags; + + /* CAUTION! Note that sfp can still be found by walking sdp->headfp + even though the refcount is now 0. Therefore, unlink sfp from + sdp->headfp BEFORE doing any other cleanup. */ + write_lock_irqsave(&sg_index_lock, iflags); prev_fp = sdp->headfp; if (sfp == prev_fp) sdp->headfp = prev_fp->nextfp; @@ -2110,54 +2138,10 @@ __sg_remove_sfp(Sg_device * sdp, Sg_fd * prev_fp = fp; } } - if (sfp->reserve.bufflen > 0) { - SCSI_LOG_TIMEOUT(6, - printk("__sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", - (int) sfp->reserve.bufflen, (int) sfp->reserve.k_use_sg)); - sg_remove_scat(&sfp->reserve); - } - sfp->parentdp = NULL; - SCSI_LOG_TIMEOUT(6, printk("__sg_remove_sfp: sfp=0x%p\n", sfp)); - kfree(sfp); -} - -/* Returns 0 in normal case, 1 when detached and sdp object removed */ -static int -sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) -{ - Sg_request *srp; - Sg_request *tsrp; - int dirty = 0; - int res = 0; - - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sg_srp_done(srp, sfp)) - sg_finish_rem_req(srp); - else - ++dirty; - } - if (0 == dirty) { - unsigned long iflags; + write_unlock_irqrestore(&sg_index_lock, iflags); + wake_up_interruptible(&sdp->o_excl_wait); - write_lock_irqsave(&sg_index_lock, iflags); - __sg_remove_sfp(sdp, sfp); - if (sdp->detached && (NULL == sdp->headfp)) { - idr_remove(&sg_index_idr, sdp->index); - kfree(sdp); - res = 1; - } - write_unlock_irqrestore(&sg_index_lock, iflags); - } else { - /* MOD_INC's to inhibit unloading sg and associated adapter driver */ - /* only bump the access_count if we actually succeeded in - * throwing another counter on the host module */ - scsi_device_get(sdp->device); /* XXX: retval ignored? */ - sfp->closed = 1; /* flag dirty state on this fd */ - SCSI_LOG_TIMEOUT(1, printk("sg_remove_sfp: worrisome, %d writes pending\n", - dirty)); - } - return res; + execute_in_process_context(sg_remove_sfp_usercontext, &sfp->ew); } static int @@ -2199,19 +2183,6 @@ sg_last_dev(void) } #endif -static Sg_device * -sg_get_dev(int dev) -{ - Sg_device *sdp; - unsigned long iflags; - - read_lock_irqsave(&sg_index_lock, iflags); - sdp = idr_find(&sg_index_idr, dev); - read_unlock_irqrestore(&sg_index_lock, iflags); - - return sdp; -} - #ifdef CONFIG_SCSI_PROC_FS static struct proc_dir_entry *sg_proc_sgp = NULL; @@ -2468,8 +2439,10 @@ static int sg_proc_seq_show_dev(struct s struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; + unsigned long iflags; - sdp = it ? sg_get_dev(it->index) : NULL; + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? idr_find(&sg_index_idr, it->index) : NULL; if (sdp && (scsidp = sdp->device) && (!sdp->detached)) seq_printf(s, "%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n", scsidp->host->host_no, scsidp->channel, @@ -2480,6 +2453,7 @@ static int sg_proc_seq_show_dev(struct s (int) scsi_device_online(scsidp)); else seq_printf(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n"); + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } @@ -2493,13 +2467,16 @@ static int sg_proc_seq_show_devstrs(stru struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; + unsigned long iflags; - sdp = it ? sg_get_dev(it->index) : NULL; + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? idr_find(&sg_index_idr, it->index) : NULL; if (sdp && (scsidp = sdp->device) && (!sdp->detached)) seq_printf(s, "%8.8s\t%16.16s\t%4.4s\n", scsidp->vendor, scsidp->model, scsidp->rev); else seq_printf(s, "<no active device>\n"); + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } @@ -2512,7 +2489,8 @@ static void sg_proc_debug_helper(struct const char * cp; unsigned int ms; - for (k = 0; (fp = sg_get_nth_sfp(sdp, k)); ++k) { + for (k = 0, fp = sdp->headfp; fp != NULL; ++k, fp = fp->nextfp) { + read_lock(&fp->rq_list_lock); /* irqs already disabled */ seq_printf(s, " FD(%d): timeout=%dms bufflen=%d " "(res)sgat=%d low_dma=%d\n", k + 1, jiffies_to_msecs(fp->timeout), @@ -2522,7 +2500,9 @@ static void sg_proc_debug_helper(struct seq_printf(s, " cmd_q=%d f_packid=%d k_orphan=%d closed=%d\n", (int) fp->cmd_q, (int) fp->force_packid, (int) fp->keep_orphan, (int) fp->closed); - for (m = 0; (srp = sg_get_nth_request(fp, m)); ++m) { + for (m = 0, srp = fp->headrp; + srp != NULL; + ++m, srp = srp->nextrp) { hp = &srp->header; new_interface = (hp->interface_id == '\0') ? 0 : 1; if (srp->res_used) { @@ -2559,6 +2539,7 @@ static void sg_proc_debug_helper(struct } if (0 == m) seq_printf(s, " No requests active\n"); + read_unlock(&fp->rq_list_lock); } } @@ -2571,23 +2552,26 @@ static int sg_proc_seq_show_debug(struct { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; + unsigned long iflags; if (it && (0 == it->index)) { seq_printf(s, "max_active_device=%d(origin 1)\n", (int)it->max); seq_printf(s, " def_reserved_size=%d\n", sg_big_buff); } - sdp = it ? sg_get_dev(it->index) : NULL; + + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? idr_find(&sg_index_idr, it->index) : NULL; if (sdp) { struct scsi_device *scsidp = sdp->device; if (NULL == scsidp) { seq_printf(s, "device %d detached ??\n", (int)it->index); - return 0; + goto out; } - if (sg_get_nth_sfp(sdp, 0)) { + if (sdp->headfp) { seq_printf(s, " >>> device=%s ", sdp->disk->disk_name); if (sdp->detached) @@ -2604,6 +2588,8 @@ static int sg_proc_seq_show_debug(struct } sg_proc_debug_helper(s, sdp); } +out: + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/2] sg: fix races during device removal (v4) 2009-01-19 23:03 ` [PATCH 1/2] sg: fix races during device removal (v4) Tony Battersby @ 2009-01-20 1:06 ` FUJITA Tomonori 2009-01-20 21:58 ` [PATCH 1/2] sg: fix races during device removal (v5) Tony Battersby 2009-01-20 22:00 ` [PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2) Tony Battersby 0 siblings, 2 replies; 41+ messages in thread From: FUJITA Tomonori @ 2009-01-20 1:06 UTC (permalink / raw) To: tonyb; +Cc: James.Bottomley, fujita.tomonori, dgilbert, hch, linux-scsi On Mon, 19 Jan 2009 18:03:11 -0500 Tony Battersby <tonyb@cybernetics.com> wrote: > sg has the following problems related to device removal: > > * opening a sg fd races with removing a device > * closing a sg fd races with removing a device > * /proc/scsi/sg/* access races with removing a device > * command completion races with removing a device > * command completion races with closing a sg fd > * can rmmod sg with active commands > > These problems can cause kernel oopses, memory-use-after-free, or > double-free errors. This patch fixes these problems by using krefs > to manage the lifetime of sg_device and sg_fd. > > Each command submitted to the midlevel holds a reference to sg_fd > until the completion callback. This ensures that sg_fd doesn't go > away if the fd is closed with commands still outstanding. > > sg_fd gets the reference of sg_device (with scsi_device) and also > makes sure that the sg module doesn't go away. > > /proc/scsi/sg/* functions don't play nicely with krefs because they > give information about sg_fds which have been closed but not yet > freed due to still having outstanding commands and sg_devices which > have been removed but not yet freed due to still being referenced > by one or more sg_fds. To deal with this safely without removing > functionality, /proc functions now access sg_device and sg_fd while > holding a lock instead of using kref_get()/kref_put(). > > Signed-off-by: Tony Battersby <tonyb@cybernetics.com> > --- > > This version implements Fujita's suggestion to use locking instead > of kref_get() for the /proc functions. > > This version does not use any controversial extensions to the kref API. > > To avoid changing the behavior of the /proc functions, it is still > possible to find a sg_fd or a sg_device with a refcount of 0, > since the last remaining references are removed from the kref_put() > release functions. However, no one should be doing a kref_get() > under these circumstances, so it shouldn't cause a problem. I added > some comments warning about this non-standard behavior to reduce the > risk of tripping up someone in the future. > > The original sg_open() would return -ENXIO if idr_find() returned > NULL but -ENODEV if sdp->detached. It would be inconvenient to > keep this behavior without inlining sg_get_dev() into sg_open(). > Since sg_open() was the only remaining caller anyway, I went ahead > and inlined it and removed sg_get_dev(), thereby enabling sg_open() > to return the same error codes as before. I think that it would be nice to use sg_get_dev and sg_put_dev. How about something like this: /* should be called with sg_index_lock held */ struct sg_device *sg_lookup_dev(int index) { return idr_find(&sg_index_idr, dev); } struct sg_device *sg_get_dev(int index) { struct sg_device *sdp; unsigned long flags; read_lock_irqsave(&sg_index_lock, flags); sdp = sg_lookup_dev(index); if (!sdp || !sdp->device) sdp = ERR_PTR(-ENXIO); else if (sdp->detached) sdp = ERR_PTR(-ENODEV); else kref_get(&sdp->d_ref); write_unlock_irqrestore(&sg_index_lock, flags); return sdp; } static void sg_put_dev(struct sg_device *sdp) { kref_put(&sdp->d_ref, sg_device_release); } The /proc functions use sg_lookup_dev. > This version still passes all the tests listed previously. However, > mptscsih.c still needs some fixes; I will submit a patch for that > later. > > sg.c | 386 ++++++++++++++++++++++++++++++++----------------------------------- > 1 file changed, 186 insertions(+), 200 deletions(-) > > --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-19 16:19:01.000000000 -0500 > +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-19 16:26:21.000000000 -0500 > @@ -101,6 +101,7 @@ static int scatter_elem_sz_prev = SG_SCA > #define SG_SECTOR_MSK (SG_SECTOR_SZ - 1) > > static int sg_add(struct device *, struct class_interface *); > +static void sg_device_destroy(struct kref *kref); > static void sg_remove(struct device *, struct class_interface *); > > static DEFINE_IDR(sg_index_idr); > @@ -158,6 +159,12 @@ typedef struct sg_fd { /* holds the sta > char next_cmd_len; /* 0 -> automatic (def), >0 -> use on next write() */ > char keep_orphan; /* 0 -> drop orphan (def), 1 -> keep for read() */ > char mmap_called; /* 0 -> mmap() never called on this fd */ > + /* CAUTION! sfp is unlinked from sdp->headfp AFTER the refcount drops > + to 0. When scanning sdp->headfp, it is not safe to call > + kref_get(&sfp->f_ref) if sfp->closed because the refcount may > + already be 0. */ > + struct kref f_ref; > + struct execute_work ew; Documentation/CodingStyle says: The preferred style for long (multi-line) comments is: /* * This is the preferred style for multi-line * comments in the Linux kernel source code. * Please use it consistently. * * Description: A column of asterisks on the left side, * with beginning and ending almost-blank lines. */ But I think that we are fine without the comment. > } Sg_fd; > > typedef struct sg_device { /* holds the state of each scsi generic device */ > @@ -171,6 +178,22 @@ typedef struct sg_device { /* holds the > char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10-> all devs */ > struct gendisk *disk; > struct cdev * cdev; /* char_dev [sysfs: /sys/cdev/major/sg<n>] */ > + /* CAUTION! idr_remove() is called AFTER the refcount drops to 0. > + When using idr_find(), it is not safe to call kref_get(&sdp->d_ref) > + if sdp->detached because the refcount may already be 0. > + > + In order to get a reference safely using idr_find(): > + 1) lock sg_index_lock (either read or write lock) > + 2) call idr_find() > + 3) if sdp->detached, pretend that idr_find() returned NULL, and do > + not call kref_get() > + 4) if !sdp->detached, then kref_get(&sdp->d_ref) > + 5) unlock sg_index_lock > + > + Alternately, you can use idr_find() to access the device even if > + sdp->detached if you do not drop sg_index_lock, in which case you > + should NOT call kref_get(&sdp->d_ref). */ > + struct kref d_ref; Ditto. > } Sg_device; > > static int sg_fasync(int fd, struct file *filp, int mode); > @@ -194,13 +217,11 @@ static void sg_build_reserve(Sg_fd * sfp > static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size); > static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp); > static Sg_fd *sg_add_sfp(Sg_device * sdp, int dev); > -static int sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); > -static void __sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); > +static void sg_remove_sfp(struct kref *); > static Sg_request *sg_get_rq_mark(Sg_fd * sfp, int pack_id); > static Sg_request *sg_add_request(Sg_fd * sfp); > static int sg_remove_request(Sg_fd * sfp, Sg_request * srp); > static int sg_res_in_use(Sg_fd * sfp); > -static Sg_device *sg_get_dev(int dev); > #ifdef CONFIG_SCSI_PROC_FS > static int sg_last_dev(void); > #endif > @@ -236,23 +257,30 @@ sg_open(struct inode *inode, struct file > lock_kernel(); > nonseekable_open(inode, filp); > SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, flags=0x%x\n", dev, flags)); > - sdp = sg_get_dev(dev); > - if ((!sdp) || (!sdp->device)) { > - unlock_kernel(); > - return -ENXIO; > - } > - if (sdp->detached) { > - unlock_kernel(); > - return -ENODEV; > + > + retval = -ENXIO; > + read_lock_irq(&sg_index_lock); > + sdp = idr_find(&sg_index_idr, dev); > + if (sdp) { > + /* See comments about using idr_find() and d_ref. If > + sdp->detached, then the refcount may already be 0, in which > + case it would be a bug to do kref_get(). */ > + if (!sdp->detached) > + kref_get(&sdp->d_ref); > + else { > + sdp = NULL; > + retval = -ENODEV; > + } > } > + read_unlock_irq(&sg_index_lock); > + if ((!sdp) || (!sdp->device)) > + goto sg_put; With the above sg_get_dev, we can do: sdp = sg_get_dev(dev); if (IS_ERR(sdp)) { unlock_kernel(); return PTR_ERR(sdp); } > /* This driver's module count bumped by fops_get in <linux/fs.h> */ > /* Prevent the device driver from vanishing while we sleep */ > retval = scsi_device_get(sdp->device); > - if (retval) { > - unlock_kernel(); > - return retval; > - } > + if (retval) > + goto sg_put; > > if (!((flags & O_NONBLOCK) || > scsi_block_when_processing_errors(sdp->device))) { > @@ -303,16 +331,20 @@ sg_open(struct inode *inode, struct file > if ((sfp = sg_add_sfp(sdp, dev))) > filp->private_data = sfp; > else { > - if (flags & O_EXCL) > + if (flags & O_EXCL) { > sdp->exclude = 0; /* undo if error */ > + wake_up_interruptible(&sdp->o_excl_wait); > + } > retval = -ENOMEM; > goto error_out; > } > - unlock_kernel(); > - return 0; > - > - error_out: > - scsi_device_put(sdp->device); > + retval = 0; > +error_out: > + if (retval) > + scsi_device_put(sdp->device); > +sg_put: > + if (sdp) > + kref_put(&sdp->d_ref, sg_device_destroy); > unlock_kernel(); > return retval; > } > @@ -327,13 +359,14 @@ sg_release(struct inode *inode, struct f > if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) > return -ENXIO; > SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n", sdp->disk->disk_name)); > - if (0 == sg_remove_sfp(sdp, sfp)) { /* Returns 1 when sdp gone */ > - if (!sdp->detached) { > - scsi_device_put(sdp->device); > - } > - sdp->exclude = 0; > - wake_up_interruptible(&sdp->o_excl_wait); > - } > + > + sfp->closed = 1; > + > + sdp->exclude = 0; > + wake_up_interruptible(&sdp->o_excl_wait); > + > + kref_put(&sfp->f_ref, sg_remove_sfp); > + > return 0; > } > > @@ -755,6 +788,7 @@ sg_common_write(Sg_fd * sfp, Sg_request > hp->duration = jiffies_to_msecs(jiffies); > > srp->rq->timeout = timeout; > + kref_get(&sfp->f_ref); /* sg_rq_end_io() does kref_put(). */ > blk_execute_rq_nowait(sdp->device->request_queue, sdp->disk, > srp->rq, 1, sg_rq_end_io); > return 0; > @@ -1247,25 +1281,28 @@ sg_mmap(struct file *filp, struct vm_are > static void sg_rq_end_io(struct request *rq, int uptodate) > { > struct sg_request *srp = rq->end_io_data; > - Sg_device *sdp = NULL; > + Sg_device *sdp; > Sg_fd *sfp; > unsigned long iflags; > unsigned int ms; > char *sense; > - int result, resid; > + int result, resid, done = 1; > > - if (NULL == srp) { > - printk(KERN_ERR "sg_cmd_done: NULL request\n"); > + if (unlikely(NULL == srp)) { /* shouldn't happen */ > + printk(KERN_ERR "sg_rq_end_io: NULL request\n"); > return; > } I think that we can simply remove the above code. As you wrote, if it happens, we do something completely wrong. > + > sfp = srp->parentfp; > - if (sfp) > - sdp = sfp->parentdp; > - if ((NULL == sdp) || sdp->detached) { > - printk(KERN_INFO "sg_cmd_done: device detached\n"); > + if (unlikely(NULL == sfp)) { /* shouldn't happen */ > + printk(KERN_ERR "sg_rq_end_io: NULL sg_fd\n"); > return; > } Ditto. > + sdp = sfp->parentdp; > + if (unlikely(sdp->detached)) > + printk(KERN_INFO "sg_rq_end_io: device detached\n"); > + > sense = rq->sense; > result = rq->errors; > resid = rq->data_len; > @@ -1303,33 +1340,25 @@ static void sg_rq_end_io(struct request > } > /* Rely on write phase to clean out srp status values, so no "else" */ > > - if (sfp->closed) { /* whoops this fd already released, cleanup */ > - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, freeing ...\n")); > - sg_finish_rem_req(srp); > - srp = NULL; > - if (NULL == sfp->headrp) { > - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, final cleanup\n")); > - if (0 == sg_remove_sfp(sdp, sfp)) { /* device still present */ > - scsi_device_put(sdp->device); > - } > - sfp = NULL; > - } > - } else if (srp && srp->orphan) { > + write_lock_irqsave(&sfp->rq_list_lock, iflags); > + if (unlikely(srp->orphan)) { > if (sfp->keep_orphan) > srp->sg_io_owned = 0; > - else { > - sg_finish_rem_req(srp); > - srp = NULL; > - } > + else > + done = 0; > } > - if (sfp && srp) { > - /* Now wake up any sg_read() that is waiting for this packet. */ > - kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); > - write_lock_irqsave(&sfp->rq_list_lock, iflags); > - srp->done = 1; > + srp->done = done; > + write_unlock_irqrestore(&sfp->rq_list_lock, iflags); > + > + if (likely(done)) { > + /* Now wake up any sg_read() that is waiting for this > + packet. */ > wake_up_interruptible(&sfp->read_wait); > - write_unlock_irqrestore(&sfp->rq_list_lock, iflags); > - } > + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); > + } else > + sg_finish_rem_req(srp); /* call with srp->done == 0 */ > + > + kref_put(&sfp->f_ref, sg_remove_sfp); > } > > static struct file_operations sg_fops = { > @@ -1391,6 +1420,7 @@ static Sg_device *sg_alloc(struct gendis > init_waitqueue_head(&sdp->o_excl_wait); > sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments); > sdp->index = k; > + kref_init(&sdp->d_ref); > > error = 0; > out: > @@ -1488,49 +1518,45 @@ out: > return error; > } > > -static void > -sg_remove(struct device *cl_dev, struct class_interface *cl_intf) > +static void sg_device_destroy(struct kref *kref) > +{ > + struct sg_device *sdp = container_of(kref, struct sg_device, d_ref); > + unsigned long flags; > + > + /* CAUTION! Note that the device can still be found via idr_find() > + even though the refcount is 0. Therefore, do idr_remove() BEFORE > + any other cleanup. */ > + > + write_lock_irqsave(&sg_index_lock, flags); > + idr_remove(&sg_index_idr, sdp->index); > + write_unlock_irqrestore(&sg_index_lock, flags); > + > + SCSI_LOG_TIMEOUT(3, > + printk("sg_device_destroy: %s\n", > + sdp->disk->disk_name)); > + > + put_disk(sdp->disk); > + kfree(sdp); > +} > + > +static void sg_remove(struct device *cl_dev, struct class_interface *cl_intf) > { > struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); > Sg_device *sdp = dev_get_drvdata(cl_dev); > unsigned long iflags; > Sg_fd *sfp; > - Sg_fd *tsfp; > - Sg_request *srp; > - Sg_request *tsrp; > - int delay; > > - if (!sdp) > + if (!sdp || sdp->detached) > return; > > - delay = 0; > + SCSI_LOG_TIMEOUT(3, printk("sg_remove: %s\n", sdp->disk->disk_name)); > + > + /* Need a write lock to set sdp->detached. */ > write_lock_irqsave(&sg_index_lock, iflags); > - if (sdp->headfp) { > - sdp->detached = 1; > - for (sfp = sdp->headfp; sfp; sfp = tsfp) { > - tsfp = sfp->nextfp; > - for (srp = sfp->headrp; srp; srp = tsrp) { > - tsrp = srp->nextrp; > - if (sfp->closed || (0 == sg_srp_done(srp, sfp))) > - sg_finish_rem_req(srp); > - } > - if (sfp->closed) { > - scsi_device_put(sdp->device); > - __sg_remove_sfp(sdp, sfp); > - } else { > - delay = 1; > - wake_up_interruptible(&sfp->read_wait); > - kill_fasync(&sfp->async_qp, SIGPOLL, > - POLL_HUP); > - } > - } > - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d, dirty\n", sdp->index)); > - if (NULL == sdp->headfp) { > - idr_remove(&sg_index_idr, sdp->index); > - } > - } else { /* nothing active, simple case */ > - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d\n", sdp->index)); > - idr_remove(&sg_index_idr, sdp->index); > + sdp->detached = 1; > + for (sfp = sdp->headfp; sfp; sfp = sfp->nextfp) { > + wake_up_interruptible(&sfp->read_wait); > + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP); > } > write_unlock_irqrestore(&sg_index_lock, iflags); > > @@ -1538,13 +1564,8 @@ sg_remove(struct device *cl_dev, struct > device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index)); > cdev_del(sdp->cdev); > sdp->cdev = NULL; > - put_disk(sdp->disk); > - sdp->disk = NULL; > - if (NULL == sdp->headfp) > - kfree(sdp); > > - if (delay) > - msleep(10); /* dirty detach so delay device destruction */ > + kref_put(&sdp->d_ref, sg_device_destroy); > } > > module_param_named(scatter_elem_sz, scatter_elem_sz, int, S_IRUGO | S_IWUSR); > @@ -1941,22 +1962,6 @@ sg_get_rq_mark(Sg_fd * sfp, int pack_id) > return resp; > } > > -#ifdef CONFIG_SCSI_PROC_FS > -static Sg_request * > -sg_get_nth_request(Sg_fd * sfp, int nth) > -{ > - Sg_request *resp; > - unsigned long iflags; > - int k; > - > - read_lock_irqsave(&sfp->rq_list_lock, iflags); > - for (k = 0, resp = sfp->headrp; resp && (k < nth); > - ++k, resp = resp->nextrp) ; > - read_unlock_irqrestore(&sfp->rq_list_lock, iflags); > - return resp; > -} > -#endif > - > /* always adds to end of list */ > static Sg_request * > sg_add_request(Sg_fd * sfp) > @@ -2032,22 +2037,6 @@ sg_remove_request(Sg_fd * sfp, Sg_reques > return res; > } > > -#ifdef CONFIG_SCSI_PROC_FS > -static Sg_fd * > -sg_get_nth_sfp(Sg_device * sdp, int nth) > -{ > - Sg_fd *resp; > - unsigned long iflags; > - int k; > - > - read_lock_irqsave(&sg_index_lock, iflags); > - for (k = 0, resp = sdp->headfp; resp && (k < nth); > - ++k, resp = resp->nextfp) ; > - read_unlock_irqrestore(&sg_index_lock, iflags); > - return resp; > -} > -#endif > - > static Sg_fd * > sg_add_sfp(Sg_device * sdp, int dev) > { > @@ -2062,6 +2051,7 @@ sg_add_sfp(Sg_device * sdp, int dev) > init_waitqueue_head(&sfp->read_wait); > rwlock_init(&sfp->rq_list_lock); > > + kref_init(&sfp->f_ref); > sfp->timeout = SG_DEFAULT_TIMEOUT; > sfp->timeout_user = SG_DEFAULT_TIMEOUT_USER; > sfp->force_packid = SG_DEF_FORCE_PACK_ID; > @@ -2089,15 +2079,53 @@ sg_add_sfp(Sg_device * sdp, int dev) > sg_build_reserve(sfp, bufflen); > SCSI_LOG_TIMEOUT(3, printk("sg_add_sfp: bufflen=%d, k_use_sg=%d\n", > sfp->reserve.bufflen, sfp->reserve.k_use_sg)); > + > + kref_get(&sdp->d_ref); > + __module_get(THIS_MODULE); > return sfp; > } > > -static void > -__sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) > +static void sg_remove_sfp_usercontext(struct work_struct *work) > +{ > + struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); > + struct sg_device *sdp = sfp->parentdp; > + > + /* Cleanup any responses which were never read(). */ > + while (sfp->headrp) > + sg_finish_rem_req(sfp->headrp); > + > + if (sfp->reserve.bufflen > 0) { > + SCSI_LOG_TIMEOUT(6, > + printk("sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", > + (int) sfp->reserve.bufflen, > + (int) sfp->reserve.k_use_sg)); > + sg_remove_scat(&sfp->reserve); > + } > + > + SCSI_LOG_TIMEOUT(6, > + printk("sg_remove_sfp: %s, sfp=0x%p\n", > + sdp->disk->disk_name, > + sfp)); > + kfree(sfp); > + > + scsi_device_put(sdp->device); > + kref_put(&sdp->d_ref, sg_device_destroy); > + module_put(THIS_MODULE); > +} > + > +static void sg_remove_sfp(struct kref *kref) > { > + struct sg_fd *sfp = container_of(kref, struct sg_fd, f_ref); > + struct sg_device *sdp = sfp->parentdp; > Sg_fd *fp; > Sg_fd *prev_fp; > + unsigned long iflags; > + > + /* CAUTION! Note that sfp can still be found by walking sdp->headfp > + even though the refcount is now 0. Therefore, unlink sfp from > + sdp->headfp BEFORE doing any other cleanup. */ > > + write_lock_irqsave(&sg_index_lock, iflags); > prev_fp = sdp->headfp; > if (sfp == prev_fp) > sdp->headfp = prev_fp->nextfp; > @@ -2110,54 +2138,10 @@ __sg_remove_sfp(Sg_device * sdp, Sg_fd * > prev_fp = fp; > } > } > - if (sfp->reserve.bufflen > 0) { > - SCSI_LOG_TIMEOUT(6, > - printk("__sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", > - (int) sfp->reserve.bufflen, (int) sfp->reserve.k_use_sg)); > - sg_remove_scat(&sfp->reserve); > - } > - sfp->parentdp = NULL; > - SCSI_LOG_TIMEOUT(6, printk("__sg_remove_sfp: sfp=0x%p\n", sfp)); > - kfree(sfp); > -} > - > -/* Returns 0 in normal case, 1 when detached and sdp object removed */ > -static int > -sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) > -{ > - Sg_request *srp; > - Sg_request *tsrp; > - int dirty = 0; > - int res = 0; > - > - for (srp = sfp->headrp; srp; srp = tsrp) { > - tsrp = srp->nextrp; > - if (sg_srp_done(srp, sfp)) > - sg_finish_rem_req(srp); > - else > - ++dirty; > - } > - if (0 == dirty) { > - unsigned long iflags; > + write_unlock_irqrestore(&sg_index_lock, iflags); > + wake_up_interruptible(&sdp->o_excl_wait); > > - write_lock_irqsave(&sg_index_lock, iflags); > - __sg_remove_sfp(sdp, sfp); > - if (sdp->detached && (NULL == sdp->headfp)) { > - idr_remove(&sg_index_idr, sdp->index); > - kfree(sdp); > - res = 1; > - } > - write_unlock_irqrestore(&sg_index_lock, iflags); > - } else { > - /* MOD_INC's to inhibit unloading sg and associated adapter driver */ > - /* only bump the access_count if we actually succeeded in > - * throwing another counter on the host module */ > - scsi_device_get(sdp->device); /* XXX: retval ignored? */ > - sfp->closed = 1; /* flag dirty state on this fd */ > - SCSI_LOG_TIMEOUT(1, printk("sg_remove_sfp: worrisome, %d writes pending\n", > - dirty)); > - } > - return res; > + execute_in_process_context(sg_remove_sfp_usercontext, &sfp->ew); > } > > static int > @@ -2199,19 +2183,6 @@ sg_last_dev(void) > } > #endif > > -static Sg_device * > -sg_get_dev(int dev) > -{ > - Sg_device *sdp; > - unsigned long iflags; > - > - read_lock_irqsave(&sg_index_lock, iflags); > - sdp = idr_find(&sg_index_idr, dev); > - read_unlock_irqrestore(&sg_index_lock, iflags); > - > - return sdp; > -} > - > #ifdef CONFIG_SCSI_PROC_FS > > static struct proc_dir_entry *sg_proc_sgp = NULL; > @@ -2468,8 +2439,10 @@ static int sg_proc_seq_show_dev(struct s > struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; > Sg_device *sdp; > struct scsi_device *scsidp; > + unsigned long iflags; > > - sdp = it ? sg_get_dev(it->index) : NULL; > + read_lock_irqsave(&sg_index_lock, iflags); > + sdp = it ? idr_find(&sg_index_idr, it->index) : NULL; > if (sdp && (scsidp = sdp->device) && (!sdp->detached)) > seq_printf(s, "%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n", > scsidp->host->host_no, scsidp->channel, > @@ -2480,6 +2453,7 @@ static int sg_proc_seq_show_dev(struct s > (int) scsi_device_online(scsidp)); > else > seq_printf(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n"); > + read_unlock_irqrestore(&sg_index_lock, iflags); > return 0; > } > > @@ -2493,13 +2467,16 @@ static int sg_proc_seq_show_devstrs(stru > struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; > Sg_device *sdp; > struct scsi_device *scsidp; > + unsigned long iflags; > > - sdp = it ? sg_get_dev(it->index) : NULL; > + read_lock_irqsave(&sg_index_lock, iflags); > + sdp = it ? idr_find(&sg_index_idr, it->index) : NULL; > if (sdp && (scsidp = sdp->device) && (!sdp->detached)) > seq_printf(s, "%8.8s\t%16.16s\t%4.4s\n", > scsidp->vendor, scsidp->model, scsidp->rev); > else > seq_printf(s, "<no active device>\n"); > + read_unlock_irqrestore(&sg_index_lock, iflags); > return 0; > } > > @@ -2512,7 +2489,8 @@ static void sg_proc_debug_helper(struct > const char * cp; > unsigned int ms; > > - for (k = 0; (fp = sg_get_nth_sfp(sdp, k)); ++k) { > + for (k = 0, fp = sdp->headfp; fp != NULL; ++k, fp = fp->nextfp) { > + read_lock(&fp->rq_list_lock); /* irqs already disabled */ > seq_printf(s, " FD(%d): timeout=%dms bufflen=%d " > "(res)sgat=%d low_dma=%d\n", k + 1, > jiffies_to_msecs(fp->timeout), > @@ -2522,7 +2500,9 @@ static void sg_proc_debug_helper(struct > seq_printf(s, " cmd_q=%d f_packid=%d k_orphan=%d closed=%d\n", > (int) fp->cmd_q, (int) fp->force_packid, > (int) fp->keep_orphan, (int) fp->closed); > - for (m = 0; (srp = sg_get_nth_request(fp, m)); ++m) { > + for (m = 0, srp = fp->headrp; > + srp != NULL; > + ++m, srp = srp->nextrp) { > hp = &srp->header; > new_interface = (hp->interface_id == '\0') ? 0 : 1; > if (srp->res_used) { > @@ -2559,6 +2539,7 @@ static void sg_proc_debug_helper(struct > } > if (0 == m) > seq_printf(s, " No requests active\n"); > + read_unlock(&fp->rq_list_lock); > } > } > > @@ -2571,23 +2552,26 @@ static int sg_proc_seq_show_debug(struct > { > struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; > Sg_device *sdp; > + unsigned long iflags; > > if (it && (0 == it->index)) { > seq_printf(s, "max_active_device=%d(origin 1)\n", > (int)it->max); > seq_printf(s, " def_reserved_size=%d\n", sg_big_buff); > } > - sdp = it ? sg_get_dev(it->index) : NULL; > + > + read_lock_irqsave(&sg_index_lock, iflags); > + sdp = it ? idr_find(&sg_index_idr, it->index) : NULL; > if (sdp) { > struct scsi_device *scsidp = sdp->device; > > if (NULL == scsidp) { > seq_printf(s, "device %d detached ??\n", > (int)it->index); > - return 0; > + goto out; > } > > - if (sg_get_nth_sfp(sdp, 0)) { > + if (sdp->headfp) { > seq_printf(s, " >>> device=%s ", > sdp->disk->disk_name); > if (sdp->detached) > @@ -2604,6 +2588,8 @@ static int sg_proc_seq_show_debug(struct > } > sg_proc_debug_helper(s, sdp); > } > +out: > + read_unlock_irqrestore(&sg_index_lock, iflags); > return 0; > } > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/2] sg: fix races during device removal (v5) 2009-01-20 1:06 ` FUJITA Tomonori @ 2009-01-20 21:58 ` Tony Battersby 2009-01-21 18:25 ` Stefan Richter 2009-01-21 19:45 ` [PATCH 1/2] sg: fix races during device removal (v6) Tony Battersby 2009-01-20 22:00 ` [PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2) Tony Battersby 1 sibling, 2 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-20 21:58 UTC (permalink / raw) To: James.Bottomley; +Cc: FUJITA Tomonori, dgilbert, hch, linux-scsi sg has the following problems related to device removal: * opening a sg fd races with removing a device * closing a sg fd races with removing a device * /proc/scsi/sg/* access races with removing a device * command completion races with removing a device * command completion races with closing a sg fd * can rmmod sg with active commands These problems can cause kernel oopses, memory-use-after-free, or double-free errors. This patch fixes these problems by using krefs to manage the lifetime of sg_device and sg_fd. Each command submitted to the midlevel holds a reference to sg_fd until the completion callback. This ensures that sg_fd doesn't go away if the fd is closed with commands still outstanding. sg_fd gets the reference of sg_device (with scsi_device) and also makes sure that the sg module doesn't go away. /proc/scsi/sg/* functions don't play nicely with krefs because they give information about sg_fds which have been closed but not yet freed due to still having outstanding commands and sg_devices which have been removed but not yet freed due to still being referenced by one or more sg_fds. To deal with this safely without removing functionality, /proc functions now access sg_device and sg_fd while holding a lock instead of using kref_get()/kref_put(). Signed-off-by: Tony Battersby <tonyb@cybernetics.com> --- This version addresses Fujita's comments about version 4. However, regarding the unnecessary checks in sg_rq_end_io(), I think that it is best to have some checks there to help catch bugs. I think BUG_ON is the best way to go for this instead of printk/return because bugs will be noticed and fixed instead of possibly going unnoticed. BUG_ON is also less distracting than printk/return when reading the source, and it also has a clearer intent. To that end, I removed the check for srp == NULL, which probably wasn't useful. I changed the check for sfp == NULL to BUG_ON(); this would happen if a bug caused sg_finish_rem_req() or sg_remove_request() to be called before sg_rq_end_io() or if a buggy LLDD called sc->scsi_done() twice for the same command. I also added BUG_ON(srp->done != 0) to catch similar problems. This version also changes locking in sg_alloc() to make it impossible for idr_find() called under lock to return sdp with sdp->device == NULL or some other not-quite-fully-initialized state, so it is no longer necessary to check for that condition after idr_find(). I also fixed sg_alloc() forgetting to call idr_remove() if k >= SG_MAX_DEVS. sg.c | 420 ++++++++++++++++++++++++++++++++----------------------------------- 1 file changed, 201 insertions(+), 219 deletions(-) --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-20 16:24:59.000000000 -0500 +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-20 16:32:28.000000000 -0500 @@ -101,6 +101,7 @@ static int scatter_elem_sz_prev = SG_SCA #define SG_SECTOR_MSK (SG_SECTOR_SZ - 1) static int sg_add(struct device *, struct class_interface *); +static void sg_device_destroy(struct kref *kref); static void sg_remove(struct device *, struct class_interface *); static DEFINE_IDR(sg_index_idr); @@ -158,6 +159,8 @@ typedef struct sg_fd { /* holds the sta char next_cmd_len; /* 0 -> automatic (def), >0 -> use on next write() */ char keep_orphan; /* 0 -> drop orphan (def), 1 -> keep for read() */ char mmap_called; /* 0 -> mmap() never called on this fd */ + struct kref f_ref; + struct execute_work ew; } Sg_fd; typedef struct sg_device { /* holds the state of each scsi generic device */ @@ -171,6 +174,7 @@ typedef struct sg_device { /* holds the char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10-> all devs */ struct gendisk *disk; struct cdev * cdev; /* char_dev [sysfs: /sys/cdev/major/sg<n>] */ + struct kref d_ref; } Sg_device; static int sg_fasync(int fd, struct file *filp, int mode); @@ -194,13 +198,14 @@ static void sg_build_reserve(Sg_fd * sfp static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size); static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp); static Sg_fd *sg_add_sfp(Sg_device * sdp, int dev); -static int sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); -static void __sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); +static void sg_remove_sfp(struct kref *); static Sg_request *sg_get_rq_mark(Sg_fd * sfp, int pack_id); static Sg_request *sg_add_request(Sg_fd * sfp); static int sg_remove_request(Sg_fd * sfp, Sg_request * srp); static int sg_res_in_use(Sg_fd * sfp); +static Sg_device *sg_lookup_dev(int dev); static Sg_device *sg_get_dev(int dev); +static void sg_put_dev(Sg_device *sdp); #ifdef CONFIG_SCSI_PROC_FS static int sg_last_dev(void); #endif @@ -237,22 +242,17 @@ sg_open(struct inode *inode, struct file nonseekable_open(inode, filp); SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, flags=0x%x\n", dev, flags)); sdp = sg_get_dev(dev); - if ((!sdp) || (!sdp->device)) { - unlock_kernel(); - return -ENXIO; - } - if (sdp->detached) { - unlock_kernel(); - return -ENODEV; + if (IS_ERR(sdp)) { + retval = PTR_ERR(sdp); + sdp = NULL; + goto sg_put; } /* This driver's module count bumped by fops_get in <linux/fs.h> */ /* Prevent the device driver from vanishing while we sleep */ retval = scsi_device_get(sdp->device); - if (retval) { - unlock_kernel(); - return retval; - } + if (retval) + goto sg_put; if (!((flags & O_NONBLOCK) || scsi_block_when_processing_errors(sdp->device))) { @@ -303,16 +303,20 @@ sg_open(struct inode *inode, struct file if ((sfp = sg_add_sfp(sdp, dev))) filp->private_data = sfp; else { - if (flags & O_EXCL) + if (flags & O_EXCL) { sdp->exclude = 0; /* undo if error */ + wake_up_interruptible(&sdp->o_excl_wait); + } retval = -ENOMEM; goto error_out; } - unlock_kernel(); - return 0; - - error_out: - scsi_device_put(sdp->device); + retval = 0; +error_out: + if (retval) + scsi_device_put(sdp->device); +sg_put: + if (sdp) + sg_put_dev(sdp); unlock_kernel(); return retval; } @@ -327,13 +331,13 @@ sg_release(struct inode *inode, struct f if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n", sdp->disk->disk_name)); - if (0 == sg_remove_sfp(sdp, sfp)) { /* Returns 1 when sdp gone */ - if (!sdp->detached) { - scsi_device_put(sdp->device); - } - sdp->exclude = 0; - wake_up_interruptible(&sdp->o_excl_wait); - } + + sfp->closed = 1; + + sdp->exclude = 0; + wake_up_interruptible(&sdp->o_excl_wait); + + kref_put(&sfp->f_ref, sg_remove_sfp); return 0; } @@ -755,6 +759,7 @@ sg_common_write(Sg_fd * sfp, Sg_request hp->duration = jiffies_to_msecs(jiffies); srp->rq->timeout = timeout; + kref_get(&sfp->f_ref); /* sg_rq_end_io() does kref_put(). */ blk_execute_rq_nowait(sdp->device->request_queue, sdp->disk, srp->rq, 1, sg_rq_end_io); return 0; @@ -1247,24 +1252,21 @@ sg_mmap(struct file *filp, struct vm_are static void sg_rq_end_io(struct request *rq, int uptodate) { struct sg_request *srp = rq->end_io_data; - Sg_device *sdp = NULL; + Sg_device *sdp; Sg_fd *sfp; unsigned long iflags; unsigned int ms; char *sense; - int result, resid; + int result, resid, done = 1; + + BUG_ON(srp->done != 0); - if (NULL == srp) { - printk(KERN_ERR "sg_cmd_done: NULL request\n"); - return; - } sfp = srp->parentfp; - if (sfp) - sdp = sfp->parentdp; - if ((NULL == sdp) || sdp->detached) { - printk(KERN_INFO "sg_cmd_done: device detached\n"); - return; - } + BUG_ON(sfp == NULL); + + sdp = sfp->parentdp; + if (unlikely(sdp->detached)) + printk(KERN_INFO "sg_rq_end_io: device detached\n"); sense = rq->sense; result = rq->errors; @@ -1303,33 +1305,26 @@ static void sg_rq_end_io(struct request } /* Rely on write phase to clean out srp status values, so no "else" */ - if (sfp->closed) { /* whoops this fd already released, cleanup */ - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, freeing ...\n")); - sg_finish_rem_req(srp); - srp = NULL; - if (NULL == sfp->headrp) { - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, final cleanup\n")); - if (0 == sg_remove_sfp(sdp, sfp)) { /* device still present */ - scsi_device_put(sdp->device); - } - sfp = NULL; - } - } else if (srp && srp->orphan) { + write_lock_irqsave(&sfp->rq_list_lock, iflags); + if (unlikely(srp->orphan)) { if (sfp->keep_orphan) srp->sg_io_owned = 0; - else { - sg_finish_rem_req(srp); - srp = NULL; - } + else + done = 0; } - if (sfp && srp) { - /* Now wake up any sg_read() that is waiting for this packet. */ - kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); - write_lock_irqsave(&sfp->rq_list_lock, iflags); - srp->done = 1; + srp->done = done; + write_unlock_irqrestore(&sfp->rq_list_lock, iflags); + + if (likely(done)) { + /* Now wake up any sg_read() that is waiting for this + * packet. + */ wake_up_interruptible(&sfp->read_wait); - write_unlock_irqrestore(&sfp->rq_list_lock, iflags); - } + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); + } else + sg_finish_rem_req(srp); /* call with srp->done == 0 */ + + kref_put(&sfp->f_ref, sg_remove_sfp); } static struct file_operations sg_fops = { @@ -1364,17 +1359,18 @@ static Sg_device *sg_alloc(struct gendis printk(KERN_WARNING "kmalloc Sg_device failure\n"); return ERR_PTR(-ENOMEM); } - error = -ENOMEM; + if (!idr_pre_get(&sg_index_idr, GFP_KERNEL)) { printk(KERN_WARNING "idr expansion Sg_device failure\n"); + error = -ENOMEM; goto out; } write_lock_irqsave(&sg_index_lock, iflags); - error = idr_get_new(&sg_index_idr, sdp, &k); - write_unlock_irqrestore(&sg_index_lock, iflags); + error = idr_get_new(&sg_index_idr, sdp, &k); if (error) { + write_unlock_irqrestore(&sg_index_lock, iflags); printk(KERN_WARNING "idr allocation Sg_device failure: %d\n", error); goto out; @@ -1391,6 +1387,9 @@ static Sg_device *sg_alloc(struct gendis init_waitqueue_head(&sdp->o_excl_wait); sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments); sdp->index = k; + kref_init(&sdp->d_ref); + + write_unlock_irqrestore(&sg_index_lock, iflags); error = 0; out: @@ -1401,6 +1400,8 @@ static Sg_device *sg_alloc(struct gendis return sdp; overflow: + idr_remove(&sg_index_idr, k); + write_unlock_irqrestore(&sg_index_lock, iflags); sdev_printk(KERN_WARNING, scsidp, "Unable to attach sg device type=%d, minor " "number exceeds %d\n", scsidp->type, SG_MAX_DEVS - 1); @@ -1488,49 +1489,46 @@ out: return error; } -static void -sg_remove(struct device *cl_dev, struct class_interface *cl_intf) +static void sg_device_destroy(struct kref *kref) +{ + struct sg_device *sdp = container_of(kref, struct sg_device, d_ref); + unsigned long flags; + + /* CAUTION! Note that the device can still be found via idr_find() + * even though the refcount is 0. Therefore, do idr_remove() BEFORE + * any other cleanup. + */ + + write_lock_irqsave(&sg_index_lock, flags); + idr_remove(&sg_index_idr, sdp->index); + write_unlock_irqrestore(&sg_index_lock, flags); + + SCSI_LOG_TIMEOUT(3, + printk("sg_device_destroy: %s\n", + sdp->disk->disk_name)); + + put_disk(sdp->disk); + kfree(sdp); +} + +static void sg_remove(struct device *cl_dev, struct class_interface *cl_intf) { struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); Sg_device *sdp = dev_get_drvdata(cl_dev); unsigned long iflags; Sg_fd *sfp; - Sg_fd *tsfp; - Sg_request *srp; - Sg_request *tsrp; - int delay; - if (!sdp) + if (!sdp || sdp->detached) return; - delay = 0; + SCSI_LOG_TIMEOUT(3, printk("sg_remove: %s\n", sdp->disk->disk_name)); + + /* Need a write lock to set sdp->detached. */ write_lock_irqsave(&sg_index_lock, iflags); - if (sdp->headfp) { - sdp->detached = 1; - for (sfp = sdp->headfp; sfp; sfp = tsfp) { - tsfp = sfp->nextfp; - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sfp->closed || (0 == sg_srp_done(srp, sfp))) - sg_finish_rem_req(srp); - } - if (sfp->closed) { - scsi_device_put(sdp->device); - __sg_remove_sfp(sdp, sfp); - } else { - delay = 1; - wake_up_interruptible(&sfp->read_wait); - kill_fasync(&sfp->async_qp, SIGPOLL, - POLL_HUP); - } - } - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d, dirty\n", sdp->index)); - if (NULL == sdp->headfp) { - idr_remove(&sg_index_idr, sdp->index); - } - } else { /* nothing active, simple case */ - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d\n", sdp->index)); - idr_remove(&sg_index_idr, sdp->index); + sdp->detached = 1; + for (sfp = sdp->headfp; sfp; sfp = sfp->nextfp) { + wake_up_interruptible(&sfp->read_wait); + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP); } write_unlock_irqrestore(&sg_index_lock, iflags); @@ -1538,13 +1536,8 @@ sg_remove(struct device *cl_dev, struct device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index)); cdev_del(sdp->cdev); sdp->cdev = NULL; - put_disk(sdp->disk); - sdp->disk = NULL; - if (NULL == sdp->headfp) - kfree(sdp); - if (delay) - msleep(10); /* dirty detach so delay device destruction */ + sg_put_dev(sdp); } module_param_named(scatter_elem_sz, scatter_elem_sz, int, S_IRUGO | S_IWUSR); @@ -1941,22 +1934,6 @@ sg_get_rq_mark(Sg_fd * sfp, int pack_id) return resp; } -#ifdef CONFIG_SCSI_PROC_FS -static Sg_request * -sg_get_nth_request(Sg_fd * sfp, int nth) -{ - Sg_request *resp; - unsigned long iflags; - int k; - - read_lock_irqsave(&sfp->rq_list_lock, iflags); - for (k = 0, resp = sfp->headrp; resp && (k < nth); - ++k, resp = resp->nextrp) ; - read_unlock_irqrestore(&sfp->rq_list_lock, iflags); - return resp; -} -#endif - /* always adds to end of list */ static Sg_request * sg_add_request(Sg_fd * sfp) @@ -2032,22 +2009,6 @@ sg_remove_request(Sg_fd * sfp, Sg_reques return res; } -#ifdef CONFIG_SCSI_PROC_FS -static Sg_fd * -sg_get_nth_sfp(Sg_device * sdp, int nth) -{ - Sg_fd *resp; - unsigned long iflags; - int k; - - read_lock_irqsave(&sg_index_lock, iflags); - for (k = 0, resp = sdp->headfp; resp && (k < nth); - ++k, resp = resp->nextfp) ; - read_unlock_irqrestore(&sg_index_lock, iflags); - return resp; -} -#endif - static Sg_fd * sg_add_sfp(Sg_device * sdp, int dev) { @@ -2062,6 +2023,7 @@ sg_add_sfp(Sg_device * sdp, int dev) init_waitqueue_head(&sfp->read_wait); rwlock_init(&sfp->rq_list_lock); + kref_init(&sfp->f_ref); sfp->timeout = SG_DEFAULT_TIMEOUT; sfp->timeout_user = SG_DEFAULT_TIMEOUT_USER; sfp->force_packid = SG_DEF_FORCE_PACK_ID; @@ -2089,15 +2051,54 @@ sg_add_sfp(Sg_device * sdp, int dev) sg_build_reserve(sfp, bufflen); SCSI_LOG_TIMEOUT(3, printk("sg_add_sfp: bufflen=%d, k_use_sg=%d\n", sfp->reserve.bufflen, sfp->reserve.k_use_sg)); + + kref_get(&sdp->d_ref); + __module_get(THIS_MODULE); return sfp; } -static void -__sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) +static void sg_remove_sfp_usercontext(struct work_struct *work) +{ + struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); + struct sg_device *sdp = sfp->parentdp; + + /* Cleanup any responses which were never read(). */ + while (sfp->headrp) + sg_finish_rem_req(sfp->headrp); + + if (sfp->reserve.bufflen > 0) { + SCSI_LOG_TIMEOUT(6, + printk("sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", + (int) sfp->reserve.bufflen, + (int) sfp->reserve.k_use_sg)); + sg_remove_scat(&sfp->reserve); + } + + SCSI_LOG_TIMEOUT(6, + printk("sg_remove_sfp: %s, sfp=0x%p\n", + sdp->disk->disk_name, + sfp)); + kfree(sfp); + + scsi_device_put(sdp->device); + sg_put_dev(sdp); + module_put(THIS_MODULE); +} + +static void sg_remove_sfp(struct kref *kref) { + struct sg_fd *sfp = container_of(kref, struct sg_fd, f_ref); + struct sg_device *sdp = sfp->parentdp; Sg_fd *fp; Sg_fd *prev_fp; + unsigned long iflags; + + /* CAUTION! Note that sfp can still be found by walking sdp->headfp + * even though the refcount is now 0. Therefore, unlink sfp from + * sdp->headfp BEFORE doing any other cleanup. + */ + write_lock_irqsave(&sg_index_lock, iflags); prev_fp = sdp->headfp; if (sfp == prev_fp) sdp->headfp = prev_fp->nextfp; @@ -2110,54 +2111,10 @@ __sg_remove_sfp(Sg_device * sdp, Sg_fd * prev_fp = fp; } } - if (sfp->reserve.bufflen > 0) { - SCSI_LOG_TIMEOUT(6, - printk("__sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", - (int) sfp->reserve.bufflen, (int) sfp->reserve.k_use_sg)); - sg_remove_scat(&sfp->reserve); - } - sfp->parentdp = NULL; - SCSI_LOG_TIMEOUT(6, printk("__sg_remove_sfp: sfp=0x%p\n", sfp)); - kfree(sfp); -} - -/* Returns 0 in normal case, 1 when detached and sdp object removed */ -static int -sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) -{ - Sg_request *srp; - Sg_request *tsrp; - int dirty = 0; - int res = 0; - - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sg_srp_done(srp, sfp)) - sg_finish_rem_req(srp); - else - ++dirty; - } - if (0 == dirty) { - unsigned long iflags; + write_unlock_irqrestore(&sg_index_lock, iflags); + wake_up_interruptible(&sdp->o_excl_wait); - write_lock_irqsave(&sg_index_lock, iflags); - __sg_remove_sfp(sdp, sfp); - if (sdp->detached && (NULL == sdp->headfp)) { - idr_remove(&sg_index_idr, sdp->index); - kfree(sdp); - res = 1; - } - write_unlock_irqrestore(&sg_index_lock, iflags); - } else { - /* MOD_INC's to inhibit unloading sg and associated adapter driver */ - /* only bump the access_count if we actually succeeded in - * throwing another counter on the host module */ - scsi_device_get(sdp->device); /* XXX: retval ignored? */ - sfp->closed = 1; /* flag dirty state on this fd */ - SCSI_LOG_TIMEOUT(1, printk("sg_remove_sfp: worrisome, %d writes pending\n", - dirty)); - } - return res; + execute_in_process_context(sg_remove_sfp_usercontext, &sfp->ew); } static int @@ -2199,19 +2156,38 @@ sg_last_dev(void) } #endif -static Sg_device * -sg_get_dev(int dev) +/* must be called with sg_index_lock held */ +static Sg_device *sg_lookup_dev(int dev) { - Sg_device *sdp; - unsigned long iflags; + return idr_find(&sg_index_idr, dev); +} - read_lock_irqsave(&sg_index_lock, iflags); - sdp = idr_find(&sg_index_idr, dev); - read_unlock_irqrestore(&sg_index_lock, iflags); +static Sg_device *sg_get_dev(int dev) +{ + struct sg_device *sdp; + unsigned long flags; + + read_lock_irqsave(&sg_index_lock, flags); + sdp = sg_lookup_dev(dev); + if (!sdp) + sdp = ERR_PTR(-ENXIO); + else if (sdp->detached) { + /* If sdp->detached, then the refcount may already be 0, in + * which case it would be a bug to do kref_get(). + */ + sdp = ERR_PTR(-ENODEV); + } else + kref_get(&sdp->d_ref); + read_unlock_irqrestore(&sg_index_lock, flags); return sdp; } +static void sg_put_dev(struct sg_device *sdp) +{ + kref_put(&sdp->d_ref, sg_device_destroy); +} + #ifdef CONFIG_SCSI_PROC_FS static struct proc_dir_entry *sg_proc_sgp = NULL; @@ -2468,8 +2444,10 @@ static int sg_proc_seq_show_dev(struct s struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; + unsigned long iflags; - sdp = it ? sg_get_dev(it->index) : NULL; + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? sg_lookup_dev(it->index) : NULL; if (sdp && (scsidp = sdp->device) && (!sdp->detached)) seq_printf(s, "%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n", scsidp->host->host_no, scsidp->channel, @@ -2480,6 +2458,7 @@ static int sg_proc_seq_show_dev(struct s (int) scsi_device_online(scsidp)); else seq_printf(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n"); + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } @@ -2493,16 +2472,20 @@ static int sg_proc_seq_show_devstrs(stru struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; + unsigned long iflags; - sdp = it ? sg_get_dev(it->index) : NULL; + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? sg_lookup_dev(it->index) : NULL; if (sdp && (scsidp = sdp->device) && (!sdp->detached)) seq_printf(s, "%8.8s\t%16.16s\t%4.4s\n", scsidp->vendor, scsidp->model, scsidp->rev); else seq_printf(s, "<no active device>\n"); + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } +/* must be called while holding sg_index_lock */ static void sg_proc_debug_helper(struct seq_file *s, Sg_device * sdp) { int k, m, new_interface, blen, usg; @@ -2512,7 +2495,8 @@ static void sg_proc_debug_helper(struct const char * cp; unsigned int ms; - for (k = 0; (fp = sg_get_nth_sfp(sdp, k)); ++k) { + for (k = 0, fp = sdp->headfp; fp != NULL; ++k, fp = fp->nextfp) { + read_lock(&fp->rq_list_lock); /* irqs already disabled */ seq_printf(s, " FD(%d): timeout=%dms bufflen=%d " "(res)sgat=%d low_dma=%d\n", k + 1, jiffies_to_msecs(fp->timeout), @@ -2522,7 +2506,9 @@ static void sg_proc_debug_helper(struct seq_printf(s, " cmd_q=%d f_packid=%d k_orphan=%d closed=%d\n", (int) fp->cmd_q, (int) fp->force_packid, (int) fp->keep_orphan, (int) fp->closed); - for (m = 0; (srp = sg_get_nth_request(fp, m)); ++m) { + for (m = 0, srp = fp->headrp; + srp != NULL; + ++m, srp = srp->nextrp) { hp = &srp->header; new_interface = (hp->interface_id == '\0') ? 0 : 1; if (srp->res_used) { @@ -2559,6 +2545,7 @@ static void sg_proc_debug_helper(struct } if (0 == m) seq_printf(s, " No requests active\n"); + read_unlock(&fp->rq_list_lock); } } @@ -2571,39 +2558,34 @@ static int sg_proc_seq_show_debug(struct { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; + unsigned long iflags; if (it && (0 == it->index)) { seq_printf(s, "max_active_device=%d(origin 1)\n", (int)it->max); seq_printf(s, " def_reserved_size=%d\n", sg_big_buff); } - sdp = it ? sg_get_dev(it->index) : NULL; - if (sdp) { - struct scsi_device *scsidp = sdp->device; - if (NULL == scsidp) { - seq_printf(s, "device %d detached ??\n", - (int)it->index); - return 0; - } + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? sg_lookup_dev(it->index) : NULL; + if (sdp && sdp->headfp) { + struct scsi_device *scsidp = sdp->device; - if (sg_get_nth_sfp(sdp, 0)) { - seq_printf(s, " >>> device=%s ", - sdp->disk->disk_name); - if (sdp->detached) - seq_printf(s, "detached pending close "); - else - seq_printf - (s, "scsi%d chan=%d id=%d lun=%d em=%d", - scsidp->host->host_no, - scsidp->channel, scsidp->id, - scsidp->lun, - scsidp->host->hostt->emulated); - seq_printf(s, " sg_tablesize=%d excl=%d\n", - sdp->sg_tablesize, sdp->exclude); - } + seq_printf(s, " >>> device=%s ", sdp->disk->disk_name); + if (sdp->detached) + seq_printf(s, "detached pending close "); + else + seq_printf + (s, "scsi%d chan=%d id=%d lun=%d em=%d", + scsidp->host->host_no, + scsidp->channel, scsidp->id, + scsidp->lun, + scsidp->host->hostt->emulated); + seq_printf(s, " sg_tablesize=%d excl=%d\n", + sdp->sg_tablesize, sdp->exclude); sg_proc_debug_helper(s, sdp); } + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/2] sg: fix races during device removal (v5) 2009-01-20 21:58 ` [PATCH 1/2] sg: fix races during device removal (v5) Tony Battersby @ 2009-01-21 18:25 ` Stefan Richter 2009-01-21 19:23 ` Tony Battersby 2009-01-21 19:45 ` [PATCH 1/2] sg: fix races during device removal (v6) Tony Battersby 1 sibling, 1 reply; 41+ messages in thread From: Stefan Richter @ 2009-01-21 18:25 UTC (permalink / raw) To: Tony Battersby Cc: James.Bottomley, FUJITA Tomonori, dgilbert, hch, linux-scsi Tony Battersby wrote: > I think BUG_ON is > the best way to go for this instead of printk/return because bugs > will be noticed and fixed instead of possibly going unnoticed. ... > static void sg_rq_end_io(struct request *rq, int uptodate) > { ... > + int result, resid, done = 1; > + > + BUG_ON(srp->done != 0); AFAIU this is typically called in atomic context. If so, WARN_ON is preferred. ... > + BUG_ON(sfp == NULL); > + > + sdp = sfp->parentdp; This would bring up a NULL pointer dereference dump anyway. -- Stefan Richter -=====-==--= ---= =-=-= http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/2] sg: fix races during device removal (v5) 2009-01-21 18:25 ` Stefan Richter @ 2009-01-21 19:23 ` Tony Battersby 0 siblings, 0 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-21 19:23 UTC (permalink / raw) To: Stefan Richter Cc: James.Bottomley, FUJITA Tomonori, dgilbert, hch, linux-scsi Stefan Richter wrote: > Tony Battersby wrote: > >> I think BUG_ON is >> the best way to go for this instead of printk/return because bugs >> will be noticed and fixed instead of possibly going unnoticed. >> > ... > >> static void sg_rq_end_io(struct request *rq, int uptodate) >> { >> > ... > >> + int result, resid, done = 1; >> + >> + BUG_ON(srp->done != 0); >> > > AFAIU this is typically called in atomic context. If so, WARN_ON is > preferred. > > Good point. I think the following would be the best way then: if (WARN_ON(srp->done != 0)) return; > ... > >> + BUG_ON(sfp == NULL); >> + >> + sdp = sfp->parentdp; >> > > This would bring up a NULL pointer dereference dump anyway. > Yes, and that would have the same problem as BUG_ON by killing the machine. So I think WARN_ON()/return works best for this too. Tony ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/2] sg: fix races during device removal (v6) 2009-01-20 21:58 ` [PATCH 1/2] sg: fix races during device removal (v5) Tony Battersby 2009-01-21 18:25 ` Stefan Richter @ 2009-01-21 19:45 ` Tony Battersby 2009-01-25 12:46 ` FUJITA Tomonori 2009-01-26 13:57 ` Douglas Gilbert 1 sibling, 2 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-21 19:45 UTC (permalink / raw) To: James.Bottomley; +Cc: FUJITA Tomonori, dgilbert, hch, linux-scsi sg has the following problems related to device removal: * opening a sg fd races with removing a device * closing a sg fd races with removing a device * /proc/scsi/sg/* access races with removing a device * command completion races with removing a device * command completion races with closing a sg fd * can rmmod sg with active commands These problems can cause kernel oopses, memory-use-after-free, or double-free errors. This patch fixes these problems by using krefs to manage the lifetime of sg_device and sg_fd. Each command submitted to the midlevel holds a reference to sg_fd until the completion callback. This ensures that sg_fd doesn't go away if the fd is closed with commands still outstanding. sg_fd gets the reference of sg_device (with scsi_device) and also makes sure that the sg module doesn't go away. /proc/scsi/sg/* functions don't play nicely with krefs because they give information about sg_fds which have been closed but not yet freed due to still having outstanding commands and sg_devices which have been removed but not yet freed due to still being referenced by one or more sg_fds. To deal with this safely without removing functionality, /proc functions now access sg_device and sg_fd while holding a lock instead of using kref_get()/kref_put(). Signed-off-by: Tony Battersby <tonyb@cybernetics.com> --- This version changes BUG_ON() to WARN_ON()/return as suggested by Stefan Richter. The second patch "[PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2)" is still the same as before, so I am not resending it. sg.c | 418 ++++++++++++++++++++++++++++++++----------------------------------- 1 file changed, 201 insertions(+), 217 deletions(-) --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-21 14:34:05.000000000 -0500 +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-21 14:36:00.000000000 -0500 @@ -101,6 +101,7 @@ static int scatter_elem_sz_prev = SG_SCA #define SG_SECTOR_MSK (SG_SECTOR_SZ - 1) static int sg_add(struct device *, struct class_interface *); +static void sg_device_destroy(struct kref *kref); static void sg_remove(struct device *, struct class_interface *); static DEFINE_IDR(sg_index_idr); @@ -158,6 +159,8 @@ typedef struct sg_fd { /* holds the sta char next_cmd_len; /* 0 -> automatic (def), >0 -> use on next write() */ char keep_orphan; /* 0 -> drop orphan (def), 1 -> keep for read() */ char mmap_called; /* 0 -> mmap() never called on this fd */ + struct kref f_ref; + struct execute_work ew; } Sg_fd; typedef struct sg_device { /* holds the state of each scsi generic device */ @@ -171,6 +174,7 @@ typedef struct sg_device { /* holds the char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10-> all devs */ struct gendisk *disk; struct cdev * cdev; /* char_dev [sysfs: /sys/cdev/major/sg<n>] */ + struct kref d_ref; } Sg_device; static int sg_fasync(int fd, struct file *filp, int mode); @@ -194,13 +198,14 @@ static void sg_build_reserve(Sg_fd * sfp static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size); static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp); static Sg_fd *sg_add_sfp(Sg_device * sdp, int dev); -static int sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); -static void __sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp); +static void sg_remove_sfp(struct kref *); static Sg_request *sg_get_rq_mark(Sg_fd * sfp, int pack_id); static Sg_request *sg_add_request(Sg_fd * sfp); static int sg_remove_request(Sg_fd * sfp, Sg_request * srp); static int sg_res_in_use(Sg_fd * sfp); +static Sg_device *sg_lookup_dev(int dev); static Sg_device *sg_get_dev(int dev); +static void sg_put_dev(Sg_device *sdp); #ifdef CONFIG_SCSI_PROC_FS static int sg_last_dev(void); #endif @@ -237,22 +242,17 @@ sg_open(struct inode *inode, struct file nonseekable_open(inode, filp); SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, flags=0x%x\n", dev, flags)); sdp = sg_get_dev(dev); - if ((!sdp) || (!sdp->device)) { - unlock_kernel(); - return -ENXIO; - } - if (sdp->detached) { - unlock_kernel(); - return -ENODEV; + if (IS_ERR(sdp)) { + retval = PTR_ERR(sdp); + sdp = NULL; + goto sg_put; } /* This driver's module count bumped by fops_get in <linux/fs.h> */ /* Prevent the device driver from vanishing while we sleep */ retval = scsi_device_get(sdp->device); - if (retval) { - unlock_kernel(); - return retval; - } + if (retval) + goto sg_put; if (!((flags & O_NONBLOCK) || scsi_block_when_processing_errors(sdp->device))) { @@ -303,16 +303,20 @@ sg_open(struct inode *inode, struct file if ((sfp = sg_add_sfp(sdp, dev))) filp->private_data = sfp; else { - if (flags & O_EXCL) + if (flags & O_EXCL) { sdp->exclude = 0; /* undo if error */ + wake_up_interruptible(&sdp->o_excl_wait); + } retval = -ENOMEM; goto error_out; } - unlock_kernel(); - return 0; - - error_out: - scsi_device_put(sdp->device); + retval = 0; +error_out: + if (retval) + scsi_device_put(sdp->device); +sg_put: + if (sdp) + sg_put_dev(sdp); unlock_kernel(); return retval; } @@ -327,13 +331,13 @@ sg_release(struct inode *inode, struct f if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n", sdp->disk->disk_name)); - if (0 == sg_remove_sfp(sdp, sfp)) { /* Returns 1 when sdp gone */ - if (!sdp->detached) { - scsi_device_put(sdp->device); - } - sdp->exclude = 0; - wake_up_interruptible(&sdp->o_excl_wait); - } + + sfp->closed = 1; + + sdp->exclude = 0; + wake_up_interruptible(&sdp->o_excl_wait); + + kref_put(&sfp->f_ref, sg_remove_sfp); return 0; } @@ -755,6 +759,7 @@ sg_common_write(Sg_fd * sfp, Sg_request hp->duration = jiffies_to_msecs(jiffies); srp->rq->timeout = timeout; + kref_get(&sfp->f_ref); /* sg_rq_end_io() does kref_put(). */ blk_execute_rq_nowait(sdp->device->request_queue, sdp->disk, srp->rq, 1, sg_rq_end_io); return 0; @@ -1247,24 +1252,23 @@ sg_mmap(struct file *filp, struct vm_are static void sg_rq_end_io(struct request *rq, int uptodate) { struct sg_request *srp = rq->end_io_data; - Sg_device *sdp = NULL; + Sg_device *sdp; Sg_fd *sfp; unsigned long iflags; unsigned int ms; char *sense; - int result, resid; + int result, resid, done = 1; - if (NULL == srp) { - printk(KERN_ERR "sg_cmd_done: NULL request\n"); + if (WARN_ON(srp->done != 0)) return; - } + sfp = srp->parentfp; - if (sfp) - sdp = sfp->parentdp; - if ((NULL == sdp) || sdp->detached) { - printk(KERN_INFO "sg_cmd_done: device detached\n"); + if (WARN_ON(sfp == NULL)) return; - } + + sdp = sfp->parentdp; + if (unlikely(sdp->detached)) + printk(KERN_INFO "sg_rq_end_io: device detached\n"); sense = rq->sense; result = rq->errors; @@ -1303,33 +1307,26 @@ static void sg_rq_end_io(struct request } /* Rely on write phase to clean out srp status values, so no "else" */ - if (sfp->closed) { /* whoops this fd already released, cleanup */ - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, freeing ...\n")); - sg_finish_rem_req(srp); - srp = NULL; - if (NULL == sfp->headrp) { - SCSI_LOG_TIMEOUT(1, printk("sg_cmd_done: already closed, final cleanup\n")); - if (0 == sg_remove_sfp(sdp, sfp)) { /* device still present */ - scsi_device_put(sdp->device); - } - sfp = NULL; - } - } else if (srp && srp->orphan) { + write_lock_irqsave(&sfp->rq_list_lock, iflags); + if (unlikely(srp->orphan)) { if (sfp->keep_orphan) srp->sg_io_owned = 0; - else { - sg_finish_rem_req(srp); - srp = NULL; - } + else + done = 0; } - if (sfp && srp) { - /* Now wake up any sg_read() that is waiting for this packet. */ - kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); - write_lock_irqsave(&sfp->rq_list_lock, iflags); - srp->done = 1; + srp->done = done; + write_unlock_irqrestore(&sfp->rq_list_lock, iflags); + + if (likely(done)) { + /* Now wake up any sg_read() that is waiting for this + * packet. + */ wake_up_interruptible(&sfp->read_wait); - write_unlock_irqrestore(&sfp->rq_list_lock, iflags); - } + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); + } else + sg_finish_rem_req(srp); /* call with srp->done == 0 */ + + kref_put(&sfp->f_ref, sg_remove_sfp); } static struct file_operations sg_fops = { @@ -1364,17 +1361,18 @@ static Sg_device *sg_alloc(struct gendis printk(KERN_WARNING "kmalloc Sg_device failure\n"); return ERR_PTR(-ENOMEM); } - error = -ENOMEM; + if (!idr_pre_get(&sg_index_idr, GFP_KERNEL)) { printk(KERN_WARNING "idr expansion Sg_device failure\n"); + error = -ENOMEM; goto out; } write_lock_irqsave(&sg_index_lock, iflags); - error = idr_get_new(&sg_index_idr, sdp, &k); - write_unlock_irqrestore(&sg_index_lock, iflags); + error = idr_get_new(&sg_index_idr, sdp, &k); if (error) { + write_unlock_irqrestore(&sg_index_lock, iflags); printk(KERN_WARNING "idr allocation Sg_device failure: %d\n", error); goto out; @@ -1391,6 +1389,9 @@ static Sg_device *sg_alloc(struct gendis init_waitqueue_head(&sdp->o_excl_wait); sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments); sdp->index = k; + kref_init(&sdp->d_ref); + + write_unlock_irqrestore(&sg_index_lock, iflags); error = 0; out: @@ -1401,6 +1402,8 @@ static Sg_device *sg_alloc(struct gendis return sdp; overflow: + idr_remove(&sg_index_idr, k); + write_unlock_irqrestore(&sg_index_lock, iflags); sdev_printk(KERN_WARNING, scsidp, "Unable to attach sg device type=%d, minor " "number exceeds %d\n", scsidp->type, SG_MAX_DEVS - 1); @@ -1488,49 +1491,46 @@ out: return error; } -static void -sg_remove(struct device *cl_dev, struct class_interface *cl_intf) +static void sg_device_destroy(struct kref *kref) +{ + struct sg_device *sdp = container_of(kref, struct sg_device, d_ref); + unsigned long flags; + + /* CAUTION! Note that the device can still be found via idr_find() + * even though the refcount is 0. Therefore, do idr_remove() BEFORE + * any other cleanup. + */ + + write_lock_irqsave(&sg_index_lock, flags); + idr_remove(&sg_index_idr, sdp->index); + write_unlock_irqrestore(&sg_index_lock, flags); + + SCSI_LOG_TIMEOUT(3, + printk("sg_device_destroy: %s\n", + sdp->disk->disk_name)); + + put_disk(sdp->disk); + kfree(sdp); +} + +static void sg_remove(struct device *cl_dev, struct class_interface *cl_intf) { struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); Sg_device *sdp = dev_get_drvdata(cl_dev); unsigned long iflags; Sg_fd *sfp; - Sg_fd *tsfp; - Sg_request *srp; - Sg_request *tsrp; - int delay; - if (!sdp) + if (!sdp || sdp->detached) return; - delay = 0; + SCSI_LOG_TIMEOUT(3, printk("sg_remove: %s\n", sdp->disk->disk_name)); + + /* Need a write lock to set sdp->detached. */ write_lock_irqsave(&sg_index_lock, iflags); - if (sdp->headfp) { - sdp->detached = 1; - for (sfp = sdp->headfp; sfp; sfp = tsfp) { - tsfp = sfp->nextfp; - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sfp->closed || (0 == sg_srp_done(srp, sfp))) - sg_finish_rem_req(srp); - } - if (sfp->closed) { - scsi_device_put(sdp->device); - __sg_remove_sfp(sdp, sfp); - } else { - delay = 1; - wake_up_interruptible(&sfp->read_wait); - kill_fasync(&sfp->async_qp, SIGPOLL, - POLL_HUP); - } - } - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d, dirty\n", sdp->index)); - if (NULL == sdp->headfp) { - idr_remove(&sg_index_idr, sdp->index); - } - } else { /* nothing active, simple case */ - SCSI_LOG_TIMEOUT(3, printk("sg_remove: dev=%d\n", sdp->index)); - idr_remove(&sg_index_idr, sdp->index); + sdp->detached = 1; + for (sfp = sdp->headfp; sfp; sfp = sfp->nextfp) { + wake_up_interruptible(&sfp->read_wait); + kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP); } write_unlock_irqrestore(&sg_index_lock, iflags); @@ -1538,13 +1538,8 @@ sg_remove(struct device *cl_dev, struct device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index)); cdev_del(sdp->cdev); sdp->cdev = NULL; - put_disk(sdp->disk); - sdp->disk = NULL; - if (NULL == sdp->headfp) - kfree(sdp); - if (delay) - msleep(10); /* dirty detach so delay device destruction */ + sg_put_dev(sdp); } module_param_named(scatter_elem_sz, scatter_elem_sz, int, S_IRUGO | S_IWUSR); @@ -1941,22 +1936,6 @@ sg_get_rq_mark(Sg_fd * sfp, int pack_id) return resp; } -#ifdef CONFIG_SCSI_PROC_FS -static Sg_request * -sg_get_nth_request(Sg_fd * sfp, int nth) -{ - Sg_request *resp; - unsigned long iflags; - int k; - - read_lock_irqsave(&sfp->rq_list_lock, iflags); - for (k = 0, resp = sfp->headrp; resp && (k < nth); - ++k, resp = resp->nextrp) ; - read_unlock_irqrestore(&sfp->rq_list_lock, iflags); - return resp; -} -#endif - /* always adds to end of list */ static Sg_request * sg_add_request(Sg_fd * sfp) @@ -2032,22 +2011,6 @@ sg_remove_request(Sg_fd * sfp, Sg_reques return res; } -#ifdef CONFIG_SCSI_PROC_FS -static Sg_fd * -sg_get_nth_sfp(Sg_device * sdp, int nth) -{ - Sg_fd *resp; - unsigned long iflags; - int k; - - read_lock_irqsave(&sg_index_lock, iflags); - for (k = 0, resp = sdp->headfp; resp && (k < nth); - ++k, resp = resp->nextfp) ; - read_unlock_irqrestore(&sg_index_lock, iflags); - return resp; -} -#endif - static Sg_fd * sg_add_sfp(Sg_device * sdp, int dev) { @@ -2062,6 +2025,7 @@ sg_add_sfp(Sg_device * sdp, int dev) init_waitqueue_head(&sfp->read_wait); rwlock_init(&sfp->rq_list_lock); + kref_init(&sfp->f_ref); sfp->timeout = SG_DEFAULT_TIMEOUT; sfp->timeout_user = SG_DEFAULT_TIMEOUT_USER; sfp->force_packid = SG_DEF_FORCE_PACK_ID; @@ -2089,15 +2053,54 @@ sg_add_sfp(Sg_device * sdp, int dev) sg_build_reserve(sfp, bufflen); SCSI_LOG_TIMEOUT(3, printk("sg_add_sfp: bufflen=%d, k_use_sg=%d\n", sfp->reserve.bufflen, sfp->reserve.k_use_sg)); + + kref_get(&sdp->d_ref); + __module_get(THIS_MODULE); return sfp; } -static void -__sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) +static void sg_remove_sfp_usercontext(struct work_struct *work) +{ + struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); + struct sg_device *sdp = sfp->parentdp; + + /* Cleanup any responses which were never read(). */ + while (sfp->headrp) + sg_finish_rem_req(sfp->headrp); + + if (sfp->reserve.bufflen > 0) { + SCSI_LOG_TIMEOUT(6, + printk("sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", + (int) sfp->reserve.bufflen, + (int) sfp->reserve.k_use_sg)); + sg_remove_scat(&sfp->reserve); + } + + SCSI_LOG_TIMEOUT(6, + printk("sg_remove_sfp: %s, sfp=0x%p\n", + sdp->disk->disk_name, + sfp)); + kfree(sfp); + + scsi_device_put(sdp->device); + sg_put_dev(sdp); + module_put(THIS_MODULE); +} + +static void sg_remove_sfp(struct kref *kref) { + struct sg_fd *sfp = container_of(kref, struct sg_fd, f_ref); + struct sg_device *sdp = sfp->parentdp; Sg_fd *fp; Sg_fd *prev_fp; + unsigned long iflags; + + /* CAUTION! Note that sfp can still be found by walking sdp->headfp + * even though the refcount is now 0. Therefore, unlink sfp from + * sdp->headfp BEFORE doing any other cleanup. + */ + write_lock_irqsave(&sg_index_lock, iflags); prev_fp = sdp->headfp; if (sfp == prev_fp) sdp->headfp = prev_fp->nextfp; @@ -2110,54 +2113,10 @@ __sg_remove_sfp(Sg_device * sdp, Sg_fd * prev_fp = fp; } } - if (sfp->reserve.bufflen > 0) { - SCSI_LOG_TIMEOUT(6, - printk("__sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", - (int) sfp->reserve.bufflen, (int) sfp->reserve.k_use_sg)); - sg_remove_scat(&sfp->reserve); - } - sfp->parentdp = NULL; - SCSI_LOG_TIMEOUT(6, printk("__sg_remove_sfp: sfp=0x%p\n", sfp)); - kfree(sfp); -} - -/* Returns 0 in normal case, 1 when detached and sdp object removed */ -static int -sg_remove_sfp(Sg_device * sdp, Sg_fd * sfp) -{ - Sg_request *srp; - Sg_request *tsrp; - int dirty = 0; - int res = 0; - - for (srp = sfp->headrp; srp; srp = tsrp) { - tsrp = srp->nextrp; - if (sg_srp_done(srp, sfp)) - sg_finish_rem_req(srp); - else - ++dirty; - } - if (0 == dirty) { - unsigned long iflags; + write_unlock_irqrestore(&sg_index_lock, iflags); + wake_up_interruptible(&sdp->o_excl_wait); - write_lock_irqsave(&sg_index_lock, iflags); - __sg_remove_sfp(sdp, sfp); - if (sdp->detached && (NULL == sdp->headfp)) { - idr_remove(&sg_index_idr, sdp->index); - kfree(sdp); - res = 1; - } - write_unlock_irqrestore(&sg_index_lock, iflags); - } else { - /* MOD_INC's to inhibit unloading sg and associated adapter driver */ - /* only bump the access_count if we actually succeeded in - * throwing another counter on the host module */ - scsi_device_get(sdp->device); /* XXX: retval ignored? */ - sfp->closed = 1; /* flag dirty state on this fd */ - SCSI_LOG_TIMEOUT(1, printk("sg_remove_sfp: worrisome, %d writes pending\n", - dirty)); - } - return res; + execute_in_process_context(sg_remove_sfp_usercontext, &sfp->ew); } static int @@ -2199,19 +2158,38 @@ sg_last_dev(void) } #endif -static Sg_device * -sg_get_dev(int dev) +/* must be called with sg_index_lock held */ +static Sg_device *sg_lookup_dev(int dev) { - Sg_device *sdp; - unsigned long iflags; + return idr_find(&sg_index_idr, dev); +} - read_lock_irqsave(&sg_index_lock, iflags); - sdp = idr_find(&sg_index_idr, dev); - read_unlock_irqrestore(&sg_index_lock, iflags); +static Sg_device *sg_get_dev(int dev) +{ + struct sg_device *sdp; + unsigned long flags; + + read_lock_irqsave(&sg_index_lock, flags); + sdp = sg_lookup_dev(dev); + if (!sdp) + sdp = ERR_PTR(-ENXIO); + else if (sdp->detached) { + /* If sdp->detached, then the refcount may already be 0, in + * which case it would be a bug to do kref_get(). + */ + sdp = ERR_PTR(-ENODEV); + } else + kref_get(&sdp->d_ref); + read_unlock_irqrestore(&sg_index_lock, flags); return sdp; } +static void sg_put_dev(struct sg_device *sdp) +{ + kref_put(&sdp->d_ref, sg_device_destroy); +} + #ifdef CONFIG_SCSI_PROC_FS static struct proc_dir_entry *sg_proc_sgp = NULL; @@ -2468,8 +2446,10 @@ static int sg_proc_seq_show_dev(struct s struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; + unsigned long iflags; - sdp = it ? sg_get_dev(it->index) : NULL; + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? sg_lookup_dev(it->index) : NULL; if (sdp && (scsidp = sdp->device) && (!sdp->detached)) seq_printf(s, "%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n", scsidp->host->host_no, scsidp->channel, @@ -2480,6 +2460,7 @@ static int sg_proc_seq_show_dev(struct s (int) scsi_device_online(scsidp)); else seq_printf(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n"); + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } @@ -2493,16 +2474,20 @@ static int sg_proc_seq_show_devstrs(stru struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; + unsigned long iflags; - sdp = it ? sg_get_dev(it->index) : NULL; + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? sg_lookup_dev(it->index) : NULL; if (sdp && (scsidp = sdp->device) && (!sdp->detached)) seq_printf(s, "%8.8s\t%16.16s\t%4.4s\n", scsidp->vendor, scsidp->model, scsidp->rev); else seq_printf(s, "<no active device>\n"); + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } +/* must be called while holding sg_index_lock */ static void sg_proc_debug_helper(struct seq_file *s, Sg_device * sdp) { int k, m, new_interface, blen, usg; @@ -2512,7 +2497,8 @@ static void sg_proc_debug_helper(struct const char * cp; unsigned int ms; - for (k = 0; (fp = sg_get_nth_sfp(sdp, k)); ++k) { + for (k = 0, fp = sdp->headfp; fp != NULL; ++k, fp = fp->nextfp) { + read_lock(&fp->rq_list_lock); /* irqs already disabled */ seq_printf(s, " FD(%d): timeout=%dms bufflen=%d " "(res)sgat=%d low_dma=%d\n", k + 1, jiffies_to_msecs(fp->timeout), @@ -2522,7 +2508,9 @@ static void sg_proc_debug_helper(struct seq_printf(s, " cmd_q=%d f_packid=%d k_orphan=%d closed=%d\n", (int) fp->cmd_q, (int) fp->force_packid, (int) fp->keep_orphan, (int) fp->closed); - for (m = 0; (srp = sg_get_nth_request(fp, m)); ++m) { + for (m = 0, srp = fp->headrp; + srp != NULL; + ++m, srp = srp->nextrp) { hp = &srp->header; new_interface = (hp->interface_id == '\0') ? 0 : 1; if (srp->res_used) { @@ -2559,6 +2547,7 @@ static void sg_proc_debug_helper(struct } if (0 == m) seq_printf(s, " No requests active\n"); + read_unlock(&fp->rq_list_lock); } } @@ -2571,39 +2560,34 @@ static int sg_proc_seq_show_debug(struct { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; + unsigned long iflags; if (it && (0 == it->index)) { seq_printf(s, "max_active_device=%d(origin 1)\n", (int)it->max); seq_printf(s, " def_reserved_size=%d\n", sg_big_buff); } - sdp = it ? sg_get_dev(it->index) : NULL; - if (sdp) { - struct scsi_device *scsidp = sdp->device; - if (NULL == scsidp) { - seq_printf(s, "device %d detached ??\n", - (int)it->index); - return 0; - } + read_lock_irqsave(&sg_index_lock, iflags); + sdp = it ? sg_lookup_dev(it->index) : NULL; + if (sdp && sdp->headfp) { + struct scsi_device *scsidp = sdp->device; - if (sg_get_nth_sfp(sdp, 0)) { - seq_printf(s, " >>> device=%s ", - sdp->disk->disk_name); - if (sdp->detached) - seq_printf(s, "detached pending close "); - else - seq_printf - (s, "scsi%d chan=%d id=%d lun=%d em=%d", - scsidp->host->host_no, - scsidp->channel, scsidp->id, - scsidp->lun, - scsidp->host->hostt->emulated); - seq_printf(s, " sg_tablesize=%d excl=%d\n", - sdp->sg_tablesize, sdp->exclude); - } + seq_printf(s, " >>> device=%s ", sdp->disk->disk_name); + if (sdp->detached) + seq_printf(s, "detached pending close "); + else + seq_printf + (s, "scsi%d chan=%d id=%d lun=%d em=%d", + scsidp->host->host_no, + scsidp->channel, scsidp->id, + scsidp->lun, + scsidp->host->hostt->emulated); + seq_printf(s, " sg_tablesize=%d excl=%d\n", + sdp->sg_tablesize, sdp->exclude); sg_proc_debug_helper(s, sdp); } + read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/2] sg: fix races during device removal (v6) 2009-01-21 19:45 ` [PATCH 1/2] sg: fix races during device removal (v6) Tony Battersby @ 2009-01-25 12:46 ` FUJITA Tomonori 2009-01-26 13:57 ` Douglas Gilbert 1 sibling, 0 replies; 41+ messages in thread From: FUJITA Tomonori @ 2009-01-25 12:46 UTC (permalink / raw) To: tonyb; +Cc: James.Bottomley, fujita.tomonori, dgilbert, hch, linux-scsi On Wed, 21 Jan 2009 14:45:50 -0500 Tony Battersby <tonyb@cybernetics.com> wrote: > sg has the following problems related to device removal: > > * opening a sg fd races with removing a device > * closing a sg fd races with removing a device > * /proc/scsi/sg/* access races with removing a device > * command completion races with removing a device > * command completion races with closing a sg fd > * can rmmod sg with active commands > > These problems can cause kernel oopses, memory-use-after-free, or > double-free errors. This patch fixes these problems by using krefs > to manage the lifetime of sg_device and sg_fd. > > Each command submitted to the midlevel holds a reference to sg_fd > until the completion callback. This ensures that sg_fd doesn't go > away if the fd is closed with commands still outstanding. > > sg_fd gets the reference of sg_device (with scsi_device) and also > makes sure that the sg module doesn't go away. > > /proc/scsi/sg/* functions don't play nicely with krefs because they > give information about sg_fds which have been closed but not yet > freed due to still having outstanding commands and sg_devices which > have been removed but not yet freed due to still being referenced > by one or more sg_fds. To deal with this safely without removing > functionality, /proc functions now access sg_device and sg_fd while > holding a lock instead of using kref_get()/kref_put(). > > Signed-off-by: Tony Battersby <tonyb@cybernetics.com> > --- > > This version changes BUG_ON() to WARN_ON()/return as suggested by > Stefan Richter. > > The second patch "[PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2)" > is still the same as before, so I am not resending it. > > sg.c | 418 ++++++++++++++++++++++++++++++++----------------------------------- > 1 file changed, 201 insertions(+), 217 deletions(-) Looks good to me, Thanks! ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/2] sg: fix races during device removal (v6) 2009-01-21 19:45 ` [PATCH 1/2] sg: fix races during device removal (v6) Tony Battersby 2009-01-25 12:46 ` FUJITA Tomonori @ 2009-01-26 13:57 ` Douglas Gilbert 2009-01-28 1:51 ` FUJITA Tomonori 1 sibling, 1 reply; 41+ messages in thread From: Douglas Gilbert @ 2009-01-26 13:57 UTC (permalink / raw) To: Tony Battersby; +Cc: James.Bottomley, FUJITA Tomonori, hch, linux-scsi Tony Battersby wrote: > sg has the following problems related to device removal: > > * opening a sg fd races with removing a device > * closing a sg fd races with removing a device > * /proc/scsi/sg/* access races with removing a device > * command completion races with removing a device > * command completion races with closing a sg fd > * can rmmod sg with active commands > > These problems can cause kernel oopses, memory-use-after-free, or > double-free errors. This patch fixes these problems by using krefs > to manage the lifetime of sg_device and sg_fd. > > Each command submitted to the midlevel holds a reference to sg_fd > until the completion callback. This ensures that sg_fd doesn't go > away if the fd is closed with commands still outstanding. > > sg_fd gets the reference of sg_device (with scsi_device) and also > makes sure that the sg module doesn't go away. > > /proc/scsi/sg/* functions don't play nicely with krefs because they > give information about sg_fds which have been closed but not yet > freed due to still having outstanding commands and sg_devices which > have been removed but not yet freed due to still being referenced > by one or more sg_fds. To deal with this safely without removing > functionality, /proc functions now access sg_device and sg_fd while > holding a lock instead of using kref_get()/kref_put(). > > Signed-off-by: Tony Battersby <tonyb@cybernetics.com> > --- > > This version changes BUG_ON() to WARN_ON()/return as suggested by > Stefan Richter. > > The second patch "[PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2)" > is still the same as before, so I am not resending it. > > sg.c | 418 ++++++++++++++++++++++++++++++++----------------------------------- > 1 file changed, 201 insertions(+), 217 deletions(-) > > --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-21 14:34:05.000000000 -0500 > +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-21 14:36:00.000000000 -0500 Tony, We seem to have consensus on this version (v6 20090121). Thanks for you work. Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/2] sg: fix races during device removal (v6) 2009-01-26 13:57 ` Douglas Gilbert @ 2009-01-28 1:51 ` FUJITA Tomonori 2009-01-28 15:06 ` James Bottomley 0 siblings, 1 reply; 41+ messages in thread From: FUJITA Tomonori @ 2009-01-28 1:51 UTC (permalink / raw) To: dgilbert; +Cc: tonyb, James.Bottomley, fujita.tomonori, hch, linux-scsi On Mon, 26 Jan 2009 08:57:20 -0500 Douglas Gilbert <dgilbert@interlog.com> wrote: > Tony Battersby wrote: > > sg has the following problems related to device removal: > > > > * opening a sg fd races with removing a device > > * closing a sg fd races with removing a device > > * /proc/scsi/sg/* access races with removing a device > > * command completion races with removing a device > > * command completion races with closing a sg fd > > * can rmmod sg with active commands > > > > These problems can cause kernel oopses, memory-use-after-free, or > > double-free errors. This patch fixes these problems by using krefs > > to manage the lifetime of sg_device and sg_fd. > > > > Each command submitted to the midlevel holds a reference to sg_fd > > until the completion callback. This ensures that sg_fd doesn't go > > away if the fd is closed with commands still outstanding. > > > > sg_fd gets the reference of sg_device (with scsi_device) and also > > makes sure that the sg module doesn't go away. > > > > /proc/scsi/sg/* functions don't play nicely with krefs because they > > give information about sg_fds which have been closed but not yet > > freed due to still having outstanding commands and sg_devices which > > have been removed but not yet freed due to still being referenced > > by one or more sg_fds. To deal with this safely without removing > > functionality, /proc functions now access sg_device and sg_fd while > > holding a lock instead of using kref_get()/kref_put(). > > > > Signed-off-by: Tony Battersby <tonyb@cybernetics.com> > > --- > > > > This version changes BUG_ON() to WARN_ON()/return as suggested by > > Stefan Richter. > > > > The second patch "[PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2)" > > is still the same as before, so I am not resending it. > > > > sg.c | 418 ++++++++++++++++++++++++++++++++----------------------------------- > > 1 file changed, 201 insertions(+), 217 deletions(-) > > > > --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-21 14:34:05.000000000 -0500 > > +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-21 14:36:00.000000000 -0500 > > Tony, > We seem to have consensus on this version (v6 20090121). > > Thanks for you work. > > Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> Can we also get your ACK on: [PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2) http://marc.info/?l=linux-scsi&m=123248892909435&w=2 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/2] sg: fix races during device removal (v6) 2009-01-28 1:51 ` FUJITA Tomonori @ 2009-01-28 15:06 ` James Bottomley 0 siblings, 0 replies; 41+ messages in thread From: James Bottomley @ 2009-01-28 15:06 UTC (permalink / raw) To: FUJITA Tomonori; +Cc: dgilbert, tonyb, hch, linux-scsi On Wed, 2009-01-28 at 10:51 +0900, FUJITA Tomonori wrote: > On Mon, 26 Jan 2009 08:57:20 -0500 > Douglas Gilbert <dgilbert@interlog.com> wrote: > > > Tony Battersby wrote: > > > sg has the following problems related to device removal: > > > > > > * opening a sg fd races with removing a device > > > * closing a sg fd races with removing a device > > > * /proc/scsi/sg/* access races with removing a device > > > * command completion races with removing a device > > > * command completion races with closing a sg fd > > > * can rmmod sg with active commands > > > > > > These problems can cause kernel oopses, memory-use-after-free, or > > > double-free errors. This patch fixes these problems by using krefs > > > to manage the lifetime of sg_device and sg_fd. > > > > > > Each command submitted to the midlevel holds a reference to sg_fd > > > until the completion callback. This ensures that sg_fd doesn't go > > > away if the fd is closed with commands still outstanding. > > > > > > sg_fd gets the reference of sg_device (with scsi_device) and also > > > makes sure that the sg module doesn't go away. > > > > > > /proc/scsi/sg/* functions don't play nicely with krefs because they > > > give information about sg_fds which have been closed but not yet > > > freed due to still having outstanding commands and sg_devices which > > > have been removed but not yet freed due to still being referenced > > > by one or more sg_fds. To deal with this safely without removing > > > functionality, /proc functions now access sg_device and sg_fd while > > > holding a lock instead of using kref_get()/kref_put(). > > > > > > Signed-off-by: Tony Battersby <tonyb@cybernetics.com> > > > --- > > > > > > This version changes BUG_ON() to WARN_ON()/return as suggested by > > > Stefan Richter. > > > > > > The second patch "[PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2)" > > > is still the same as before, so I am not resending it. > > > > > > sg.c | 418 ++++++++++++++++++++++++++++++++----------------------------------- > > > 1 file changed, 201 insertions(+), 217 deletions(-) > > > > > > --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-21 14:34:05.000000000 -0500 > > > +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-21 14:36:00.000000000 -0500 > > > > Tony, > > We seem to have consensus on this version (v6 20090121). > > > > Thanks for you work. > > > > Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> > > Can we also get your ACK on: > > [PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2) > > http://marc.info/?l=linux-scsi&m=123248892909435&w=2 Actually, using maintainer's prerogative, I was taking this as ack to both. James ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2) 2009-01-20 1:06 ` FUJITA Tomonori 2009-01-20 21:58 ` [PATCH 1/2] sg: fix races during device removal (v5) Tony Battersby @ 2009-01-20 22:00 ` Tony Battersby 2009-01-25 12:46 ` FUJITA Tomonori 1 sibling, 1 reply; 41+ messages in thread From: Tony Battersby @ 2009-01-20 22:00 UTC (permalink / raw) To: James.Bottomley; +Cc: FUJITA Tomonori, dgilbert, hch, linux-scsi sg_io_owned needs to be set before the command is sent to the midlevel; otherwise, a quickly-completing command may cause a different CPU to see "srp->done == 1 && !srp->sg_io_owned", which would lead to incorrect behavior. Check srp->done and set srp->orphan while holding rq_list_lock to prevent races with sg_rq_end_io(). There is no need to check sfp->closed from read/write/ioctl/poll/etc. since the kernel guarantees that this won't happen. The usefulness of sg_srp_done() was questionable before; now it is definitely not needed. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> --- Compared to the previous version of this patch, this version removes sg_srp_done() and the unnecessary check for sfp->closed. sg.c | 39 ++++++++++++++------------------------- 1 file changed, 14 insertions(+), 25 deletions(-) --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-20 16:32:28.000000000 -0500 +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-20 16:34:37.000000000 -0500 @@ -189,7 +189,7 @@ static ssize_t sg_new_read(Sg_fd * sfp, Sg_request * srp); static ssize_t sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, size_t count, int blocking, - int read_only, Sg_request **o_srp); + int read_only, int sg_io_owned, Sg_request **o_srp); static int sg_common_write(Sg_fd * sfp, Sg_request * srp, unsigned char *cmnd, int timeout, int blocking); static int sg_read_oxfer(Sg_request * srp, char __user *outp, int num_read_xfer); @@ -561,7 +561,8 @@ sg_write(struct file *filp, const char _ return -EFAULT; blocking = !(filp->f_flags & O_NONBLOCK); if (old_hdr.reply_len < 0) - return sg_new_write(sfp, filp, buf, count, blocking, 0, NULL); + return sg_new_write(sfp, filp, buf, count, + blocking, 0, 0, NULL); if (count < (SZ_SG_HEADER + 6)) return -EIO; /* The minimum scsi command length is 6 bytes. */ @@ -642,7 +643,7 @@ sg_write(struct file *filp, const char _ static ssize_t sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, - size_t count, int blocking, int read_only, + size_t count, int blocking, int read_only, int sg_io_owned, Sg_request **o_srp) { int k; @@ -662,6 +663,7 @@ sg_new_write(Sg_fd *sfp, struct file *fi SCSI_LOG_TIMEOUT(1, printk("sg_new_write: queue full\n")); return -EDOM; } + srp->sg_io_owned = sg_io_owned; hp = &srp->header; if (__copy_from_user(hp, buf, SZ_SG_IO_HDR)) { sg_remove_request(sfp, srp); @@ -766,18 +768,6 @@ sg_common_write(Sg_fd * sfp, Sg_request } static int -sg_srp_done(Sg_request *srp, Sg_fd *sfp) -{ - unsigned long iflags; - int done; - - read_lock_irqsave(&sfp->rq_list_lock, iflags); - done = srp->done; - read_unlock_irqrestore(&sfp->rq_list_lock, iflags); - return done; -} - -static int sg_ioctl(struct inode *inode, struct file *filp, unsigned int cmd_in, unsigned long arg) { @@ -809,27 +799,26 @@ sg_ioctl(struct inode *inode, struct fil return -EFAULT; result = sg_new_write(sfp, filp, p, SZ_SG_IO_HDR, - blocking, read_only, &srp); + blocking, read_only, 1, &srp); if (result < 0) return result; - srp->sg_io_owned = 1; while (1) { result = 0; /* following macro to beat race condition */ __wait_event_interruptible(sfp->read_wait, - (sdp->detached || sfp->closed || sg_srp_done(srp, sfp)), - result); + (srp->done || sdp->detached), + result); if (sdp->detached) return -ENODEV; - if (sfp->closed) - return 0; /* request packet dropped already */ - if (0 == result) + write_lock_irq(&sfp->rq_list_lock); + if (srp->done) { + srp->done = 2; + write_unlock_irq(&sfp->rq_list_lock); break; + } srp->orphan = 1; + write_unlock_irq(&sfp->rq_list_lock); return result; /* -ERESTARTSYS because signal hit process */ } - write_lock_irqsave(&sfp->rq_list_lock, iflags); - srp->done = 2; - write_unlock_irqrestore(&sfp->rq_list_lock, iflags); result = sg_new_read(sfp, p, SZ_SG_IO_HDR, srp); return (result < 0) ? result : 0; } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2) 2009-01-20 22:00 ` [PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2) Tony Battersby @ 2009-01-25 12:46 ` FUJITA Tomonori 0 siblings, 0 replies; 41+ messages in thread From: FUJITA Tomonori @ 2009-01-25 12:46 UTC (permalink / raw) To: tonyb; +Cc: James.Bottomley, fujita.tomonori, dgilbert, hch, linux-scsi On Tue, 20 Jan 2009 17:00:09 -0500 Tony Battersby <tonyb@cybernetics.com> wrote: > sg_io_owned needs to be set before the command is sent to the midlevel; > otherwise, a quickly-completing command may cause a different CPU > to see "srp->done == 1 && !srp->sg_io_owned", which would lead to > incorrect behavior. > > Check srp->done and set srp->orphan while holding rq_list_lock to > prevent races with sg_rq_end_io(). > > There is no need to check sfp->closed from read/write/ioctl/poll/etc. > since the kernel guarantees that this won't happen. > > The usefulness of sg_srp_done() was questionable before; now it is > definitely not needed. > > Signed-off-by: Tony Battersby <tonyb@cybernetics.com> > --- > > Compared to the previous version of this patch, this version removes > sg_srp_done() and the unnecessary check for sfp->closed. > > sg.c | 39 ++++++++++++++------------------------- It looks ok to me. Better (simpler) than the first version. I guess that we could simplify 'done' and 'orphan' stuff a bit more but we can do later on. ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 2/2] sg: fix races with ioctl(SG_IO) 2009-01-19 6:57 ` FUJITA Tomonori 2009-01-19 15:02 ` Tony Battersby 2009-01-19 23:03 ` [PATCH 1/2] sg: fix races during device removal (v4) Tony Battersby @ 2009-01-19 23:06 ` Tony Battersby 2 siblings, 0 replies; 41+ messages in thread From: Tony Battersby @ 2009-01-19 23:06 UTC (permalink / raw) To: James.Bottomley; +Cc: FUJITA Tomonori, dgilbert, hch, linux-scsi sg_io_owned needs to be set before the command is sent to the midlevel; otherwise, a quickly-completing command may cause a different CPU to see "srp->done == 1 && !srp->sg_io_owned", which would lead to incorrect behavior. Check srp->done and set srp->orphan while holding rq_list_lock to prevent races with sg_rq_end_io(). Signed-off-by: Tony Battersby <tonyb@cybernetics.com> --- Same as before, just rediffed against v4 of the previous patch. sg.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) --- linux-2.6.29-rc2/drivers/scsi/sg.c.orig 2009-01-19 16:26:21.000000000 -0500 +++ linux-2.6.29-rc2/drivers/scsi/sg.c 2009-01-19 17:53:28.000000000 -0500 @@ -208,7 +208,7 @@ static ssize_t sg_new_read(Sg_fd * sfp, Sg_request * srp); static ssize_t sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, size_t count, int blocking, - int read_only, Sg_request **o_srp); + int read_only, int sg_io_owned, Sg_request **o_srp); static int sg_common_write(Sg_fd * sfp, Sg_request * srp, unsigned char *cmnd, int timeout, int blocking); static int sg_read_oxfer(Sg_request * srp, char __user *outp, int num_read_xfer); @@ -590,7 +590,8 @@ sg_write(struct file *filp, const char _ return -EFAULT; blocking = !(filp->f_flags & O_NONBLOCK); if (old_hdr.reply_len < 0) - return sg_new_write(sfp, filp, buf, count, blocking, 0, NULL); + return sg_new_write(sfp, filp, buf, count, + blocking, 0, 0, NULL); if (count < (SZ_SG_HEADER + 6)) return -EIO; /* The minimum scsi command length is 6 bytes. */ @@ -671,7 +672,7 @@ sg_write(struct file *filp, const char _ static ssize_t sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, - size_t count, int blocking, int read_only, + size_t count, int blocking, int read_only, int sg_io_owned, Sg_request **o_srp) { int k; @@ -691,6 +692,7 @@ sg_new_write(Sg_fd *sfp, struct file *fi SCSI_LOG_TIMEOUT(1, printk("sg_new_write: queue full\n")); return -EDOM; } + srp->sg_io_owned = sg_io_owned; hp = &srp->header; if (__copy_from_user(hp, buf, SZ_SG_IO_HDR)) { sg_remove_request(sfp, srp); @@ -838,10 +840,9 @@ sg_ioctl(struct inode *inode, struct fil return -EFAULT; result = sg_new_write(sfp, filp, p, SZ_SG_IO_HDR, - blocking, read_only, &srp); + blocking, read_only, 1, &srp); if (result < 0) return result; - srp->sg_io_owned = 1; while (1) { result = 0; /* following macro to beat race condition */ __wait_event_interruptible(sfp->read_wait, @@ -851,14 +852,16 @@ sg_ioctl(struct inode *inode, struct fil return -ENODEV; if (sfp->closed) return 0; /* request packet dropped already */ - if (0 == result) + write_lock_irq(&sfp->rq_list_lock); + if (srp->done) { + srp->done = 2; + write_unlock_irq(&sfp->rq_list_lock); break; + } srp->orphan = 1; + write_unlock_irq(&sfp->rq_list_lock); return result; /* -ERESTARTSYS because signal hit process */ } - write_lock_irqsave(&sfp->rq_list_lock, iflags); - srp->done = 2; - write_unlock_irqrestore(&sfp->rq_list_lock, iflags); result = sg_new_read(sfp, p, SZ_SG_IO_HDR, srp); return (result < 0) ? result : 0; } ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2009-01-28 15:06 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-01-05 19:07 [PATCH 0/2] sg: fix races during device removal (v2) Tony Battersby 2009-01-08 23:21 ` Douglas Gilbert 2009-01-10 17:26 ` FUJITA Tomonori 2009-01-12 21:09 ` Tony Battersby 2009-01-13 16:24 ` FUJITA Tomonori 2009-01-14 20:31 ` Tony Battersby 2009-01-14 21:39 ` Greg KH 2009-01-14 21:59 ` Tony Battersby 2009-01-14 22:33 ` Stefan Richter 2009-01-14 22:53 ` Tony Battersby 2009-01-14 23:47 ` Stefan Richter 2009-01-15 14:47 ` Tony Battersby 2009-01-15 16:22 ` Stefan Richter 2009-01-15 16:44 ` Stefan Richter 2009-01-15 18:17 ` Tony Battersby 2009-01-15 18:47 ` Stefan Richter 2009-01-15 19:14 ` Stefan Richter 2009-01-15 19:20 ` Tony Battersby 2009-01-15 20:43 ` Stefan Richter 2009-01-15 21:43 ` Tony Battersby 2009-01-15 21:58 ` Stefan Richter 2009-01-15 22:23 ` Tony Battersby 2009-01-15 23:24 ` Stefan Richter 2009-01-16 14:16 ` Tony Battersby 2009-01-16 0:53 ` Stefan Richter 2009-01-16 8:09 ` Stefan Richter 2009-01-19 6:57 ` FUJITA Tomonori 2009-01-19 15:02 ` Tony Battersby 2009-01-19 23:03 ` [PATCH 1/2] sg: fix races during device removal (v4) Tony Battersby 2009-01-20 1:06 ` FUJITA Tomonori 2009-01-20 21:58 ` [PATCH 1/2] sg: fix races during device removal (v5) Tony Battersby 2009-01-21 18:25 ` Stefan Richter 2009-01-21 19:23 ` Tony Battersby 2009-01-21 19:45 ` [PATCH 1/2] sg: fix races during device removal (v6) Tony Battersby 2009-01-25 12:46 ` FUJITA Tomonori 2009-01-26 13:57 ` Douglas Gilbert 2009-01-28 1:51 ` FUJITA Tomonori 2009-01-28 15:06 ` James Bottomley 2009-01-20 22:00 ` [PATCH 2/2] sg: fix races with ioctl(SG_IO) (v2) Tony Battersby 2009-01-25 12:46 ` FUJITA Tomonori 2009-01-19 23:06 ` [PATCH 2/2] sg: fix races with ioctl(SG_IO) Tony Battersby
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox