From: Josh Durgin <josh.durgin@inktank.com>
To: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
Subject: Re: [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free
Date: Thu, 29 Aug 2013 11:36:06 -0700 [thread overview]
Message-ID: <521F9496.7040103@inktank.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1308290743380.10144@cobra.newdream.net>
On 08/29/2013 07:46 AM, Sage Weil wrote:
> On Wed, 28 Aug 2013, Josh Durgin wrote:
>> Removing a device deallocates the disk, unschedules the watch, and
>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
>> from the watch callback, updates the disk size and rbd_dev
>> structure. With no locking between them, rbd_dev_refresh() may use the
>> device or rbd_dev after they've been freed.
>>
>> To fix this, use the header_rwsem to protect all the work
>> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
>> the block device is released and the watch is canceled.
>> rbd_bus_del_dev() ends up releasing the block device, so no requests
>> to the device remain after this. This makes all the work in
>> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
>> the watch has been canceled.
>>
>> Finally, flush the osd client's notify queue before deallocating the
>> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
>> safely. No more notifies can enter the queue at this point since the
>> watch has been canceled.
>>
>> Fixes: http://tracker.ceph.com/issues/5636
>> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
>> ---
>> drivers/block/rbd.c | 25 ++++++++++++++++++++++---
>> 1 files changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index fef3687..63e1590 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>> static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>> {
>> u64 mapping_size;
>> - int ret;
>> + int ret = 0;
>>
>> rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>> down_write(&rbd_dev->header_rwsem);
>> + if (!rbd_dev->watch_event)
>> + goto out;
>> +
>> mapping_size = rbd_dev->mapping.size;
>> if (rbd_dev->image_format == 1)
>> ret = rbd_dev_v1_header_info(rbd_dev);
>> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>> /* If it's a mapped snapshot, validate its EXISTS flag */
>>
>> rbd_exists_validate(rbd_dev);
>> - up_write(&rbd_dev->header_rwsem);
>>
>> if (mapping_size != rbd_dev->mapping.size) {
>> sector_t size;
>> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>> set_capacity(rbd_dev->disk, size);
>> revalidate_disk(rbd_dev->disk);
>> }
>> -
>> +out:
>> + up_write(&rbd_dev->header_rwsem);
>> return ret;
>> }
>>
>> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>> if (ret < 0 || already)
>> return ret;
>>
>> + /*
>> + * take header semaphore while destroying the device and
>> + * canceling the watch so that device destruction will
>> + * not race with device updates from the watch callback
>> + */
>> + down_write(&rbd_dev->header_rwsem);
>> rbd_bus_del_dev(rbd_dev);
>> ret = rbd_dev_header_watch_sync(rbd_dev, false);
>> + up_write(&rbd_dev->header_rwsem);
>
> If I'm reading thsi right, we're holding the rwsem exclusively, and
> blocking the workqueue, for the duration of the osd request to remove the
> watch. It worries me to block up the workqueue for that long (possibly
> for other images). I don't think it will deadlock, but if hte cluster is
> unhealthy this one request could take a while.
>
> What we if we just add a separate flag "shutting down" and use that
> instead of the rbd_dev->watch_event? Then we can up_write prior to the
> watch removal call...
Due to the lock inversion Alex mentioned, I'll use a separate mutex and
flag. Since we're draining the notify queue after this, only
rbd_bus_del_dev() needs to be protected from running at the same time -
canceling the watch doesn't matter for correctness.
Josh
> sage
>
>
>> +
>> if (ret)
>> rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
>> +
>> + /*
>> + * flush remaing watch callbacks - these don't update anything
>> + * anymore since rbd_dev->watch_event is NULL, but it avoids
>> + * the watch callback using a freed rbd_dev
>> + */
>> + dout("%s: flushing notifies", __func__);
>> + ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>> rbd_dev_image_release(rbd_dev);
>> module_put(THIS_MODULE);
>>
>> --
>> 1.7.2.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
next prev parent reply other threads:[~2013-08-29 18:38 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-29 6:24 [PATCH 0/3] shutdown race and debug fix Josh Durgin
2013-08-29 6:24 ` [PATCH 1/3] rbd: fix null dereference in dout Josh Durgin
2013-08-29 14:26 ` Alex Elder
2013-08-29 14:46 ` Sage Weil
2013-08-29 6:24 ` [PATCH 2/3] libceph: add function to ensure notifies are complete Josh Durgin
2013-08-29 14:46 ` Sage Weil
2013-08-29 15:21 ` Alex Elder
2013-08-29 6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
2013-08-29 14:46 ` Sage Weil
2013-08-29 18:36 ` Josh Durgin [this message]
2013-08-29 15:21 ` Alex Elder
2013-08-29 18:33 ` Josh Durgin
2013-08-30 0:57 ` [PATCH v2 1/3] rbd: fix use-after free of rbd_dev->disk Josh Durgin
2013-09-03 12:41 ` Alex Elder
2013-09-09 7:30 ` Josh Durgin
2013-08-30 0:57 ` [PATCH v2 2/3] rbd: complete notifies before cleaning up osd_client and rbd_dev Josh Durgin
2013-09-03 12:45 ` Alex Elder
2013-08-30 0:57 ` [PATCH v2 3/3] rbd: make rbd_obj_notify_ack() synchronous Josh Durgin
2013-09-03 13:05 ` Alex Elder
2013-09-03 13:07 ` Alex Elder
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=521F9496.7040103@inktank.com \
--to=josh.durgin@inktank.com \
--cc=ceph-devel@vger.kernel.org \
--cc=sage@inktank.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.