From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josh Durgin Subject: Re: [PATCH v3 3/5] rbd: fix use-after free of rbd_dev->disk Date: Mon, 09 Sep 2013 09:15:01 -0700 Message-ID: <522DF405.6080602@inktank.com> References: <1378711022-20158-1-git-send-email-josh.durgin@inktank.com> <1378711022-20158-4-git-send-email-josh.durgin@inktank.com> <522DCF08.8050708@linaro.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ob0-f170.google.com ([209.85.214.170]:56289 "EHLO mail-ob0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750990Ab3IIQPC (ORCPT ); Mon, 9 Sep 2013 12:15:02 -0400 Received: by mail-ob0-f170.google.com with SMTP id eh20so6119464obb.1 for ; Mon, 09 Sep 2013 09:15:01 -0700 (PDT) In-Reply-To: <522DCF08.8050708@linaro.org> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Alex Elder Cc: ceph-devel@vger.kernel.org On 09/09/2013 06:37 AM, Alex Elder wrote: > On 09/09/2013 02:17 AM, Josh Durgin wrote: >> Removing a device deallocates the disk, unschedules the watch, and >> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called >> from the watch callback, updates the disk size and rbd_dev >> structure. With no locking between them, rbd_dev_refresh() may use the >> device or rbd_dev after they've been freed. >> >> To fix this, check whether RBD_DEV_FLAG_REMOVING is set before >> updating the disk size in rbd_dev_refresh(). In order to prevent a >> race where rbd_dev_refresh() is already revalidating the disk when >> rbd_remove() is called, move the call to rbd_bus_del_dev() after the >> watch is unregistered and all notifies are complete. It's safe to >> defer deleting this structure because no new requests can be submitted >> once the RBD_DEV_FLAG_REMOVING is set, since the device cannot be >> opened. >> >> Fixes: http://tracker.ceph.com/issues/5636 >> Signed-off-by: Josh Durgin > > I'm really sorry but I think I still see a problem. This stuff > is very tricky... > >> --- >> drivers/block/rbd.c | 34 +++++++++++++++++++++++++++------- >> 1 files changed, 27 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >> index 9e5f07f..fe5767a 100644 >> --- a/drivers/block/rbd.c >> +++ b/drivers/block/rbd.c >> @@ -3325,6 +3325,31 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev) >> clear_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags); >> } >> >> +static void rbd_dev_update_size(struct rbd_device *rbd_dev) >> +{ >> + sector_t size; >> + bool removing; >> + >> + /* >> + * Don't hold the lock while doing disk operations, >> + * or lock ordering will conflict with the bdev mutex via: >> + * rbd_add() -> blkdev_get() -> rbd_open() >> + */ >> + spin_lock_irq(&rbd_dev->lock); >> + removing = test_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags); >> + spin_unlock_irq(&rbd_dev->lock); > > (This is not the problem I was referring to, but thinking about > it led me to the real issue. This is just some explanation > behind the logic, FYI.) > > It is safe to do so, of course, but it's not necessary to hold > the lock for just testing the bit. test_bit() (and clear_bit(), > etc.) are all atomic operations. The only reason the lock > is held in rbd_open() is so that testing the bit value and > updating the open count are done together, atomically. (The > same basic reasoning applies in rbd_remove()). > > It may be necessary, however, to do a (read or full) memory > barrier before calling test_bit() if you're not using the lock. > (My history with memory barriers in this code is a bit checkered > though. Looking at it again now I think I still got it wrong.) > >> + /* >> + * If the device is being removed, rbd_dev->disk has >> + * been destroyed, so don't try to update its size >> + */ >> + if (!removing) { > > Here's the problem. It is the same one faced by the open path. > > You determined above whether or not the device was getting removed. > But you haven't left any state that indicates that the image is > getting refreshed (or more specifically, getting its size updated). > > So, if the device is mapped but isn't actually opened by anybody > there's nothing stopping the code in rbd_remove() from going > ahead anyway. So the possibility of the structures getting freed > remains (though the window is a little smaller now). I think this is safe with the use of ceph_osdc_flush_notifies() before rbd_bus_del_dev(). Since the watch is already unregistered, flushing the notifies waits for all calls to rbd_watch_cb() to complete. This means that rbd_remove() can't actually free any of the rbd_dev structures until after the last revalidate_disk() etc. is done. Does this make sense, or am I missing something? This should probably go in a comment in rbd_remove(), since it's not obvious from the code there why the ordering of the last few calls makes sense. Josh > I think you can resolve this problem by using the open count. > That is, use the same interlock between the open count and the > removing flag that's used for rbd_open() and rbd_remove(). > The open count in some sense represents someone still needing > the data structures for the image. So treat the refresh activity > as one of those "somebodies" by claiming one of those opens > while the size update is going on. > > If you encapsulated some code now in place for this purpose, > something like the following might be helpful. (I've renamed > "open_count" to be "inuse_count" here.) > > static inline int rbd_request_access(struct rbd_device *rbd_dev) > { > int ret = 0; > > spin_lock_irq(&rbd_dev->lock); > if (!test_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags)) > rbd_dev->inuse_count++; > else > ret = -ENOENT; > spin_unlock_irq(&rbd_dev->lock); > > return ret; > } > > static inline int rbd_release_access(struct rbd_device *rbd_dev) > { > int ret = 0; > > spin_lock_irq(&rbd_dev->lock); > if (rbd_dev->inuse_count) > rbd_dev->inuse_count--; > else > ret = -EINVAL; > spin_unlock_irq(&rbd_dev->lock); > > return ret; > } > > static inline int rbd_prevent_access(struct rbd_device *rbd_dev) > { > int ret = 0; > > spin_lock_irq(&rbd_dev->lock); > if (rbd_dev->inuse_count) > ret = -EBUSY; > else if (test_and_set_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags)) > ret = -EINPROGRESS; > spin_unlock_irq(&rbd_dev->lock); > > return ret; > } > > I'll assume you know what I'm talking about at this point and > won't go into exactly where and how you'd use these. > >> + size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE; >> + dout("setting size to %llu sectors", (unsigned long long)size); >> + set_capacity(rbd_dev->disk, size); >> + revalidate_disk(rbd_dev->disk); >> + } >> +}+- >> + >> static int rbd_dev_refresh(struct rbd_device *rbd_dev) >> { >> u64 mapping_size; > > Everything below is good. > >> @@ -3344,12 +3369,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) >> up_write(&rbd_dev->header_rwsem); >> >> if (mapping_size != rbd_dev->mapping.size) { >> - sector_t size; >> - >> - size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE; >> - dout("setting size to %llu sectors", (unsigned long long)size); >> - set_capacity(rbd_dev->disk, size); >> - revalidate_disk(rbd_dev->disk); >> + rbd_dev_update_size(rbd_dev); >> } >> >> return ret; >> @@ -5160,7 +5180,6 @@ static ssize_t rbd_remove(struct bus_type *bus, >> if (ret < 0 || already) >> return ret; >> >> - rbd_bus_del_dev(rbd_dev); >> ret = rbd_dev_header_watch_sync(rbd_dev, false); >> if (ret) >> rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret); >> @@ -5171,6 +5190,7 @@ static ssize_t rbd_remove(struct bus_type *bus, >> */ >> dout("%s: flushing notifies", __func__); >> ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc); >> + rbd_bus_del_dev(rbd_dev); >> rbd_dev_image_release(rbd_dev); >> module_put(THIS_MODULE); >> >> >