From mboxrd@z Thu Jan  1 00:00:00 1970
From: Josh Durgin <josh.durgin@inktank.com>
Subject: Re: [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free
Date: Thu, 29 Aug 2013 11:36:06 -0700
Message-ID: <521F9496.7040103@inktank.com>
References: <1377757447-23515-1-git-send-email-josh.durgin@inktank.com> <1377757447-23515-4-git-send-email-josh.durgin@inktank.com> <alpine.DEB.2.00.1308290743380.10144@cobra.newdream.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from mail-pb0-f44.google.com ([209.85.160.44]:45300 "EHLO
	mail-pb0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1756672Ab3H2SiO (ORCPT
	<rfc822;ceph-devel@vger.kernel.org>); Thu, 29 Aug 2013 14:38:14 -0400
Received: by mail-pb0-f44.google.com with SMTP id xa7so818858pbc.31
        for <ceph-devel@vger.kernel.org>; Thu, 29 Aug 2013 11:38:13 -0700 (PDT)
In-Reply-To: <alpine.DEB.2.00.1308290743380.10144@cobra.newdream.net>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org

On 08/29/2013 07:46 AM, Sage Weil wrote:
> On Wed, 28 Aug 2013, Josh Durgin wrote:
>> Removing a device deallocates the disk, unschedules the watch, and
>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
>> from the watch callback, updates the disk size and rbd_dev
>> structure. With no locking between them, rbd_dev_refresh() may use the
>> device or rbd_dev after they've been freed.
>>
>> To fix this, use the header_rwsem to protect all the work
>> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
>> the block device is released and the watch is canceled.
>> rbd_bus_del_dev() ends up releasing the block device, so no requests
>> to the device remain after this. This makes all the work in
>> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
>> the watch has been canceled.
>>
>> Finally, flush the osd client's notify queue before deallocating the
>> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
>> safely. No more notifies can enter the queue at this point since the
>> watch has been canceled.
>>
>> Fixes: http://tracker.ceph.com/issues/5636
>> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
>> ---
>>   drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>>   1 files changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index fef3687..63e1590 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>>   static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   {
>>   	u64 mapping_size;
>> -	int ret;
>> +	int ret = 0;
>>
>>   	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>>   	down_write(&rbd_dev->header_rwsem);
>> +	if (!rbd_dev->watch_event)
>> +		goto out;
>> +
>>   	mapping_size = rbd_dev->mapping.size;
>>   	if (rbd_dev->image_format == 1)
>>   		ret = rbd_dev_v1_header_info(rbd_dev);
>> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   	/* If it's a mapped snapshot, validate its EXISTS flag */
>>
>>   	rbd_exists_validate(rbd_dev);
>> -	up_write(&rbd_dev->header_rwsem);
>>
>>   	if (mapping_size != rbd_dev->mapping.size) {
>>   		sector_t size;
>> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   		set_capacity(rbd_dev->disk, size);
>>   		revalidate_disk(rbd_dev->disk);
>>   	}
>> -
>> +out:
>> +	up_write(&rbd_dev->header_rwsem);
>>   	return ret;
>>   }
>>
>> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>   	if (ret < 0 || already)
>>   		return ret;
>>
>> +	/*
>> +	 * take header semaphore while destroying the device and
>> +	 * canceling the watch so that device destruction will
>> +	 * not race with device updates from the watch callback
>> +	 */
>> +	down_write(&rbd_dev->header_rwsem);
>>   	rbd_bus_del_dev(rbd_dev);
>>   	ret = rbd_dev_header_watch_sync(rbd_dev, false);
>> +	up_write(&rbd_dev->header_rwsem);
>
> If I'm reading thsi right, we're holding the rwsem exclusively, and
> blocking the workqueue, for the duration of the osd request to remove the
> watch.  It worries me to block up the workqueue for that long (possibly
> for other images).  I don't think it will deadlock, but if hte cluster is
> unhealthy this one request could take a while.
>
> What we if we just add a separate flag "shutting down" and use that
> instead of the rbd_dev->watch_event?  Then we can up_write prior to the
> watch removal call...

Due to the lock inversion Alex mentioned, I'll use a separate mutex and
flag. Since we're draining the notify queue after this, only
rbd_bus_del_dev() needs to be protected from running at the same time -
canceling the watch doesn't matter for correctness.

Josh

> sage
>
>
>> +
>>   	if (ret)
>>   		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
>> +
>> +	/*
>> +	 * flush remaing watch callbacks - these don't update anything
>> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
>> +	 * the watch callback using a freed rbd_dev
>> +	 */
>> +	dout("%s: flushing notifies", __func__);
>> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>>   	rbd_dev_image_release(rbd_dev);
>>   	module_put(THIS_MODULE);
>>
>> --
>> 1.7.2.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>