[PATCH 0/3] shutdown race and debug fix

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] shutdown race and debug fix
@ 2013-08-29  6:24 Josh Durgin
  2013-08-29  6:24 ` [PATCH 1/3] rbd: fix null dereference in dout Josh Durgin
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Josh Durgin @ 2013-08-29  6:24 UTC (permalink / raw)
  To: ceph-devel; +Cc: Josh Durgin

These patches fix a race in rbd shutdown and a null deref when
debugging is enabled. They're also available in the wip-rbd-bugs branch
of ceph-client.git.

Josh Durgin (3):
  rbd: fix null dereference in dout
  libceph: add function to ensure notifies are complete
  rbd: close remove vs. notify race leading to use-after-free

 drivers/block/rbd.c             |   33 +++++++++++++++++++++++++++------
 include/linux/ceph/osd_client.h |    2 ++
 net/ceph/osd_client.c           |   11 +++++++++++
 3 files changed, 40 insertions(+), 6 deletions(-)

-- 
1.7.2.5

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] rbd: fix null dereference in dout
  2013-08-29  6:24 [PATCH 0/3] shutdown race and debug fix Josh Durgin
@ 2013-08-29  6:24 ` Josh Durgin
  2013-08-29 14:26   ` Alex Elder
  2013-08-29 14:46   ` Sage Weil
  2013-08-29  6:24 ` [PATCH 2/3] libceph: add function to ensure notifies are complete Josh Durgin
  2013-08-29  6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
  2 siblings, 2 replies; 20+ messages in thread
From: Josh Durgin @ 2013-08-29  6:24 UTC (permalink / raw)
  To: ceph-devel; +Cc: Josh Durgin

The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but
the dout() always derefences it. Move this to another dout() protected
by a check that order is non-NULL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 drivers/block/rbd.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 80f787b..fef3687 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3702,12 +3702,14 @@ static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
 	if (ret < sizeof (size_buf))
 		return -ERANGE;
 
-	if (order)
+	if (order) {
 		*order = size_buf.order;
+		dout("  order %u", (unsigned int)*order);
+	}
 	*snap_size = le64_to_cpu(size_buf.size);
 
-	dout("  snap_id 0x%016llx order = %u, snap_size = %llu\n",
-		(unsigned long long)snap_id, (unsigned int)*order,
+	dout("  snap_id 0x%016llx snap_size = %llu\n",
+		(unsigned long long)snap_id,
 		(unsigned long long)*snap_size);
 
 	return 0;
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] libceph: add function to ensure notifies are complete
  2013-08-29  6:24 [PATCH 0/3] shutdown race and debug fix Josh Durgin
  2013-08-29  6:24 ` [PATCH 1/3] rbd: fix null dereference in dout Josh Durgin
@ 2013-08-29  6:24 ` Josh Durgin
  2013-08-29 14:46   ` Sage Weil
  2013-08-29 15:21   ` Alex Elder
  2013-08-29  6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
  2 siblings, 2 replies; 20+ messages in thread
From: Josh Durgin @ 2013-08-29  6:24 UTC (permalink / raw)
  To: ceph-devel; +Cc: Josh Durgin

Without a way to flush the osd client's notify workqueue, a watch
event that is unregistered could continue receiving callbacks
indefinitely.

Unregistering the event simply means no new notifies are added to the
queue, but there may still be events in the queue that will call the
watch callback for the event. If the queue is flushed after the event
is unregistered, the caller can be sure no more watch callbacks will
occur for the canceled watch.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 include/linux/ceph/osd_client.h |    2 ++
 net/ceph/osd_client.c           |   11 +++++++++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index ce6df39..8f47625 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -335,6 +335,8 @@ extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
 				  struct ceph_osd_request *req);
 extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
 
+extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
+
 extern int ceph_osdc_readpages(struct ceph_osd_client *osdc,
 			       struct ceph_vino vino,
 			       struct ceph_file_layout *layout,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 1606f74..2b4b32a 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2216,6 +2216,17 @@ void ceph_osdc_sync(struct ceph_osd_client *osdc)
 EXPORT_SYMBOL(ceph_osdc_sync);
 
 /*
+ * Call all pending notify callbacks - for use after a watch is
+ * unregistered, to make sure no more callbacks for it will be invoked
+ */
+extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc)
+{
+	flush_workqueue(osdc->notify_wq);
+}
+EXPORT_SYMBOL(ceph_osdc_flush_notifies);
+
+
+/*
  * init, shutdown
  */
 int ceph_osdc_init(struct ceph_osd_client *osdc, struct ceph_client *client)
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free
  2013-08-29  6:24 [PATCH 0/3] shutdown race and debug fix Josh Durgin
  2013-08-29  6:24 ` [PATCH 1/3] rbd: fix null dereference in dout Josh Durgin
  2013-08-29  6:24 ` [PATCH 2/3] libceph: add function to ensure notifies are complete Josh Durgin
@ 2013-08-29  6:24 ` Josh Durgin
  2013-08-29 14:46   ` Sage Weil
                     ` (4 more replies)
  2 siblings, 5 replies; 20+ messages in thread
From: Josh Durgin @ 2013-08-29  6:24 UTC (permalink / raw)
  To: ceph-devel; +Cc: Josh Durgin

Removing a device deallocates the disk, unschedules the watch, and
finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
from the watch callback, updates the disk size and rbd_dev
structure. With no locking between them, rbd_dev_refresh() may use the
device or rbd_dev after they've been freed.

To fix this, use the header_rwsem to protect all the work
rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
the block device is released and the watch is canceled.
rbd_bus_del_dev() ends up releasing the block device, so no requests
to the device remain after this. This makes all the work in
rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
the watch has been canceled.

Finally, flush the osd client's notify queue before deallocating the
rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
safely. No more notifies can enter the queue at this point since the
watch has been canceled.

Fixes: http://tracker.ceph.com/issues/5636
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 drivers/block/rbd.c |   25 ++++++++++++++++++++++---
 1 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index fef3687..63e1590 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
 static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 {
 	u64 mapping_size;
-	int ret;
+	int ret = 0;
 
 	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
 	down_write(&rbd_dev->header_rwsem);
+	if (!rbd_dev->watch_event)
+		goto out;
+
 	mapping_size = rbd_dev->mapping.size;
 	if (rbd_dev->image_format == 1)
 		ret = rbd_dev_v1_header_info(rbd_dev);
@@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	/* If it's a mapped snapshot, validate its EXISTS flag */
 
 	rbd_exists_validate(rbd_dev);
-	up_write(&rbd_dev->header_rwsem);
 
 	if (mapping_size != rbd_dev->mapping.size) {
 		sector_t size;
@@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 		set_capacity(rbd_dev->disk, size);
 		revalidate_disk(rbd_dev->disk);
 	}
-
+out:
+	up_write(&rbd_dev->header_rwsem);
 	return ret;
 }
 
@@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
 	if (ret < 0 || already)
 		return ret;
 
+	/*
+	 * take header semaphore while destroying the device and
+	 * canceling the watch so that device destruction will
+	 * not race with device updates from the watch callback
+	 */
+	down_write(&rbd_dev->header_rwsem);
 	rbd_bus_del_dev(rbd_dev);
 	ret = rbd_dev_header_watch_sync(rbd_dev, false);
+	up_write(&rbd_dev->header_rwsem);
+
 	if (ret)
 		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
+
+	/*
+	 * flush remaing watch callbacks - these don't update anything
+	 * anymore since rbd_dev->watch_event is NULL, but it avoids
+	 * the watch callback using a freed rbd_dev
+	 */
+	dout("%s: flushing notifies", __func__);
+	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
 	rbd_dev_image_release(rbd_dev);
 	module_put(THIS_MODULE);
 
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] rbd: fix null dereference in dout
  2013-08-29  6:24 ` [PATCH 1/3] rbd: fix null dereference in dout Josh Durgin
@ 2013-08-29 14:26   ` Alex Elder
  2013-08-29 14:46   ` Sage Weil
  1 sibling, 0 replies; 20+ messages in thread
From: Alex Elder @ 2013-08-29 14:26 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 08/29/2013 01:24 AM, Josh Durgin wrote:
> The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but
> the dout() always derefences it. Move this to another dout() protected
> by a check that order is non-NULL.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
> ---
>  drivers/block/rbd.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 80f787b..fef3687 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3702,12 +3702,14 @@ static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
>  	if (ret < sizeof (size_buf))
>  		return -ERANGE;
>  
> -	if (order)
> +	if (order) {
>  		*order = size_buf.order;
> +		dout("  order %u", (unsigned int)*order);
> +	}
>  	*snap_size = le64_to_cpu(size_buf.size);
>  
> -	dout("  snap_id 0x%016llx order = %u, snap_size = %llu\n",
> -		(unsigned long long)snap_id, (unsigned int)*order,
> +	dout("  snap_id 0x%016llx snap_size = %llu\n",
> +		(unsigned long long)snap_id,
>  		(unsigned long long)*snap_size);
>  
>  	return 0;
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free
  2013-08-29  6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
@ 2013-08-29 14:46   ` Sage Weil
  2013-08-29 18:36     ` Josh Durgin
  2013-08-29 15:21   ` Alex Elder
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Sage Weil @ 2013-08-29 14:46 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On Wed, 28 Aug 2013, Josh Durgin wrote:
> Removing a device deallocates the disk, unschedules the watch, and
> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
> from the watch callback, updates the disk size and rbd_dev
> structure. With no locking between them, rbd_dev_refresh() may use the
> device or rbd_dev after they've been freed.
> 
> To fix this, use the header_rwsem to protect all the work
> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
> the block device is released and the watch is canceled.
> rbd_bus_del_dev() ends up releasing the block device, so no requests
> to the device remain after this. This makes all the work in
> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
> the watch has been canceled.
> 
> Finally, flush the osd client's notify queue before deallocating the
> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
> safely. No more notifies can enter the queue at this point since the
> watch has been canceled.
> 
> Fixes: http://tracker.ceph.com/issues/5636
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
> ---
>  drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>  1 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index fef3687..63e1590 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>  static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  {
>  	u64 mapping_size;
> -	int ret;
> +	int ret = 0;
>  
>  	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>  	down_write(&rbd_dev->header_rwsem);
> +	if (!rbd_dev->watch_event)
> +		goto out;
> +
>  	mapping_size = rbd_dev->mapping.size;
>  	if (rbd_dev->image_format == 1)
>  		ret = rbd_dev_v1_header_info(rbd_dev);
> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  	/* If it's a mapped snapshot, validate its EXISTS flag */
>  
>  	rbd_exists_validate(rbd_dev);
> -	up_write(&rbd_dev->header_rwsem);
>  
>  	if (mapping_size != rbd_dev->mapping.size) {
>  		sector_t size;
> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  		set_capacity(rbd_dev->disk, size);
>  		revalidate_disk(rbd_dev->disk);
>  	}
> -
> +out:
> +	up_write(&rbd_dev->header_rwsem);
>  	return ret;
>  }
>  
> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>  	if (ret < 0 || already)
>  		return ret;
>  
> +	/*
> +	 * take header semaphore while destroying the device and
> +	 * canceling the watch so that device destruction will
> +	 * not race with device updates from the watch callback
> +	 */
> +	down_write(&rbd_dev->header_rwsem);
>  	rbd_bus_del_dev(rbd_dev);
>  	ret = rbd_dev_header_watch_sync(rbd_dev, false);
> +	up_write(&rbd_dev->header_rwsem);

If I'm reading thsi right, we're holding the rwsem exclusively, and 
blocking the workqueue, for the duration of the osd request to remove the 
watch.  It worries me to block up the workqueue for that long (possibly 
for other images).  I don't think it will deadlock, but if hte cluster is 
unhealthy this one request could take a while.

What we if we just add a separate flag "shutting down" and use that 
instead of the rbd_dev->watch_event?  Then we can up_write prior to the 
watch removal call...

sage


> +
>  	if (ret)
>  		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
> +
> +	/*
> +	 * flush remaing watch callbacks - these don't update anything
> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
> +	 * the watch callback using a freed rbd_dev
> +	 */
> +	dout("%s: flushing notifies", __func__);
> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>  	rbd_dev_image_release(rbd_dev);
>  	module_put(THIS_MODULE);
>  
> -- 
> 1.7.2.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] rbd: fix null dereference in dout
  2013-08-29  6:24 ` [PATCH 1/3] rbd: fix null dereference in dout Josh Durgin
  2013-08-29 14:26   ` Alex Elder
@ 2013-08-29 14:46   ` Sage Weil
  1 sibling, 0 replies; 20+ messages in thread
From: Sage Weil @ 2013-08-29 14:46 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

Reviewed-by: Sage Weil <sage@inktank.com>

On Wed, 28 Aug 2013, Josh Durgin wrote:

> The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but
> the dout() always derefences it. Move this to another dout() protected
> by a check that order is non-NULL.
> 
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
> ---
>  drivers/block/rbd.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 80f787b..fef3687 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3702,12 +3702,14 @@ static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
>  	if (ret < sizeof (size_buf))
>  		return -ERANGE;
>  
> -	if (order)
> +	if (order) {
>  		*order = size_buf.order;
> +		dout("  order %u", (unsigned int)*order);
> +	}
>  	*snap_size = le64_to_cpu(size_buf.size);
>  
> -	dout("  snap_id 0x%016llx order = %u, snap_size = %llu\n",
> -		(unsigned long long)snap_id, (unsigned int)*order,
> +	dout("  snap_id 0x%016llx snap_size = %llu\n",
> +		(unsigned long long)snap_id,
>  		(unsigned long long)*snap_size);
>  
>  	return 0;
> -- 
> 1.7.2.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] libceph: add function to ensure notifies are complete
  2013-08-29  6:24 ` [PATCH 2/3] libceph: add function to ensure notifies are complete Josh Durgin
@ 2013-08-29 14:46   ` Sage Weil
  2013-08-29 15:21   ` Alex Elder
  1 sibling, 0 replies; 20+ messages in thread
From: Sage Weil @ 2013-08-29 14:46 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

Reviewed-by: Sage Weil <sage@inktank.com>

On Wed, 28 Aug 2013, Josh Durgin wrote:

> Without a way to flush the osd client's notify workqueue, a watch
> event that is unregistered could continue receiving callbacks
> indefinitely.
> 
> Unregistering the event simply means no new notifies are added to the
> queue, but there may still be events in the queue that will call the
> watch callback for the event. If the queue is flushed after the event
> is unregistered, the caller can be sure no more watch callbacks will
> occur for the canceled watch.
> 
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
> ---
>  include/linux/ceph/osd_client.h |    2 ++
>  net/ceph/osd_client.c           |   11 +++++++++++
>  2 files changed, 13 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index ce6df39..8f47625 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -335,6 +335,8 @@ extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
>  				  struct ceph_osd_request *req);
>  extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
>  
> +extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> +
>  extern int ceph_osdc_readpages(struct ceph_osd_client *osdc,
>  			       struct ceph_vino vino,
>  			       struct ceph_file_layout *layout,
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 1606f74..2b4b32a 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -2216,6 +2216,17 @@ void ceph_osdc_sync(struct ceph_osd_client *osdc)
>  EXPORT_SYMBOL(ceph_osdc_sync);
>  
>  /*
> + * Call all pending notify callbacks - for use after a watch is
> + * unregistered, to make sure no more callbacks for it will be invoked
> + */
> +extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc)
> +{
> +	flush_workqueue(osdc->notify_wq);
> +}
> +EXPORT_SYMBOL(ceph_osdc_flush_notifies);
> +
> +
> +/*
>   * init, shutdown
>   */
>  int ceph_osdc_init(struct ceph_osd_client *osdc, struct ceph_client *client)
> -- 
> 1.7.2.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] libceph: add function to ensure notifies are complete
  2013-08-29  6:24 ` [PATCH 2/3] libceph: add function to ensure notifies are complete Josh Durgin
  2013-08-29 14:46   ` Sage Weil
@ 2013-08-29 15:21   ` Alex Elder
  1 sibling, 0 replies; 20+ messages in thread
From: Alex Elder @ 2013-08-29 15:21 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 08/29/2013 01:24 AM, Josh Durgin wrote:
> Without a way to flush the osd client's notify workqueue, a watch
> event that is unregistered could continue receiving callbacks
> indefinitely.
> 
> Unregistering the event simply means no new notifies are added to the
> queue, but there may still be events in the queue that will call the
> watch callback for the event. If the queue is flushed after the event
> is unregistered, the caller can be sure no more watch callbacks will
> occur for the canceled watch.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
> ---
>  include/linux/ceph/osd_client.h |    2 ++
>  net/ceph/osd_client.c           |   11 +++++++++++
>  2 files changed, 13 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index ce6df39..8f47625 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -335,6 +335,8 @@ extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
>  				  struct ceph_osd_request *req);
>  extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
>  
> +extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> +
>  extern int ceph_osdc_readpages(struct ceph_osd_client *osdc,
>  			       struct ceph_vino vino,
>  			       struct ceph_file_layout *layout,
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 1606f74..2b4b32a 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -2216,6 +2216,17 @@ void ceph_osdc_sync(struct ceph_osd_client *osdc)
>  EXPORT_SYMBOL(ceph_osdc_sync);
>  
>  /*
> + * Call all pending notify callbacks - for use after a watch is
> + * unregistered, to make sure no more callbacks for it will be invoked
> + */
> +extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc)
> +{
> +	flush_workqueue(osdc->notify_wq);
> +}
> +EXPORT_SYMBOL(ceph_osdc_flush_notifies);
> +
> +
> +/*
>   * init, shutdown
>   */
>  int ceph_osdc_init(struct ceph_osd_client *osdc, struct ceph_client *client)
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free
  2013-08-29  6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
  2013-08-29 14:46   ` Sage Weil
@ 2013-08-29 15:21   ` Alex Elder
  2013-08-29 18:33     ` Josh Durgin
  2013-08-30  0:57   ` [PATCH v2 1/3] rbd: fix use-after free of rbd_dev->disk Josh Durgin
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Alex Elder @ 2013-08-29 15:21 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 08/29/2013 01:24 AM, Josh Durgin wrote:
> Removing a device deallocates the disk, unschedules the watch, and
> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
> from the watch callback, updates the disk size and rbd_dev
> structure. With no locking between them, rbd_dev_refresh() may use the
> device or rbd_dev after they've been freed.
> 
> To fix this, use the header_rwsem to protect all the work
> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
> the block device is released and the watch is canceled.

It seems to me this was a little bit of a tricky area.  As I recall
there was a problem related to lock inversion, holding the semaphore
while calling revalidate_disk().  (I'm sorry, this may be wrong and
I can't dig much into it further at the moment.)  In any case, make
sure you test with lockdep enabled and look for splats to the console.

Also, there was once a function rbd_update_mapping_size(),
which encapsulated both updating the size field and then
calling set_capacity() immediately thereafter.  When images
were refreshed (in rbd_dev_vX_refresh()), these updates would
be made by calling rbd_update_mapping_size().  That made good
sense, but after this commit:
    00a653e216  rbd: update capacity in rbd_dev_refresh()
the function became pretty trivial and it got removed.

I think due to the lockdep thing, revalidate_disk() was
later called outside the protection of the mutex.  It may
be necessary to at least move those size updates back into
the refresh functions to resolve the lockdep problem.

Anyway, having a function that encapsulates these size changes
like before might be nice, but it's not that important.  If
there really is a problem with calling revalidate_disk() with
the mutex held, then it's not all done in a single unit anyway.

> rbd_bus_del_dev() ends up releasing the block device, so no requests
> to the device remain after this. This makes all the work in
> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
> the watch has been canceled.

This makes sense, but a brief comment explaining why you're
skipping it at the point in the code would be helpful.

> Finally, flush the osd client's notify queue before deallocating the
> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
> safely. No more notifies can enter the queue at this point since the
> watch has been canceled.
> 
> Fixes: http://tracker.ceph.com/issues/5636
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

Please verify there's no lockdep reports if you haven't already.
And consider my other comments, but barring lockdep issues:

Reviewed-by: Alex Elder <elder@linaro.org>

> ---
>  drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>  1 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index fef3687..63e1590 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>  static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  {
>  	u64 mapping_size;
> -	int ret;
> +	int ret = 0;
>  
>  	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>  	down_write(&rbd_dev->header_rwsem);
> +	if (!rbd_dev->watch_event)
> +		goto out;
> +
>  	mapping_size = rbd_dev->mapping.size;
>  	if (rbd_dev->image_format == 1)
>  		ret = rbd_dev_v1_header_info(rbd_dev);
> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  	/* If it's a mapped snapshot, validate its EXISTS flag */
>  
>  	rbd_exists_validate(rbd_dev);
> -	up_write(&rbd_dev->header_rwsem);
>  
>  	if (mapping_size != rbd_dev->mapping.size) {
>  		sector_t size;
> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  		set_capacity(rbd_dev->disk, size);
>  		revalidate_disk(rbd_dev->disk);
>  	}
> -
> +out:
> +	up_write(&rbd_dev->header_rwsem);
>  	return ret;
>  }
>  
> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>  	if (ret < 0 || already)
>  		return ret;
>  
> +	/*
> +	 * take header semaphore while destroying the device and
> +	 * canceling the watch so that device destruction will
> +	 * not race with device updates from the watch callback
> +	 */
> +	down_write(&rbd_dev->header_rwsem);
>  	rbd_bus_del_dev(rbd_dev);
>  	ret = rbd_dev_header_watch_sync(rbd_dev, false);
> +	up_write(&rbd_dev->header_rwsem);
> +
>  	if (ret)
>  		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
> +
> +	/*
> +	 * flush remaing watch callbacks - these don't update anything
> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
> +	 * the watch callback using a freed rbd_dev
> +	 */
> +	dout("%s: flushing notifies", __func__);
> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>  	rbd_dev_image_release(rbd_dev);
>  	module_put(THIS_MODULE);
>  
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free
  2013-08-29 15:21   ` Alex Elder
@ 2013-08-29 18:33     ` Josh Durgin
  0 siblings, 0 replies; 20+ messages in thread
From: Josh Durgin @ 2013-08-29 18:33 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

On 08/29/2013 08:21 AM, Alex Elder wrote:
> On 08/29/2013 01:24 AM, Josh Durgin wrote:
>> Removing a device deallocates the disk, unschedules the watch, and
>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
>> from the watch callback, updates the disk size and rbd_dev
>> structure. With no locking between them, rbd_dev_refresh() may use the
>> device or rbd_dev after they've been freed.
>>
>> To fix this, use the header_rwsem to protect all the work
>> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
>> the block device is released and the watch is canceled.
>
> It seems to me this was a little bit of a tricky area.  As I recall
> there was a problem related to lock inversion, holding the semaphore
> while calling revalidate_disk().  (I'm sorry, this may be wrong and
> I can't dig much into it further at the moment.)  In any case, make
> sure you test with lockdep enabled and look for splats to the console.

You're right, there was a lock inversion with the header_rwsem and
bdev->bd_mutex. I'll switch to using a new mutex to avoid this.

> Also, there was once a function rbd_update_mapping_size(),
> which encapsulated both updating the size field and then
> calling set_capacity() immediately thereafter.  When images
> were refreshed (in rbd_dev_vX_refresh()), these updates would
> be made by calling rbd_update_mapping_size().  That made good
> sense, but after this commit:
>      00a653e216  rbd: update capacity in rbd_dev_refresh()
> the function became pretty trivial and it got removed.
>
> I think due to the lockdep thing, revalidate_disk() was
> later called outside the protection of the mutex.  It may
> be necessary to at least move those size updates back into
> the refresh functions to resolve the lockdep problem.
>
> Anyway, having a function that encapsulates these size changes
> like before might be nice, but it's not that important.  If
> there really is a problem with calling revalidate_disk() with
> the mutex held, then it's not all done in a single unit anyway.

Good idea.

>> rbd_bus_del_dev() ends up releasing the block device, so no requests
>> to the device remain after this. This makes all the work in
>> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
>> the watch has been canceled.
>
> This makes sense, but a brief comment explaining why you're
> skipping it at the point in the code would be helpful.

I'll add one. Thanks!

Josh

>> Finally, flush the osd client's notify queue before deallocating the
>> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
>> safely. No more notifies can enter the queue at this point since the
>> watch has been canceled.
>>
>> Fixes: http://tracker.ceph.com/issues/5636
>> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
>
> Please verify there's no lockdep reports if you haven't already.
> And consider my other comments, but barring lockdep issues:
>
> Reviewed-by: Alex Elder <elder@linaro.org>
>
>> ---
>>   drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>>   1 files changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index fef3687..63e1590 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>>   static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   {
>>   	u64 mapping_size;
>> -	int ret;
>> +	int ret = 0;
>>
>>   	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>>   	down_write(&rbd_dev->header_rwsem);
>> +	if (!rbd_dev->watch_event)
>> +		goto out;
>> +
>>   	mapping_size = rbd_dev->mapping.size;
>>   	if (rbd_dev->image_format == 1)
>>   		ret = rbd_dev_v1_header_info(rbd_dev);
>> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   	/* If it's a mapped snapshot, validate its EXISTS flag */
>>
>>   	rbd_exists_validate(rbd_dev);
>> -	up_write(&rbd_dev->header_rwsem);
>>
>>   	if (mapping_size != rbd_dev->mapping.size) {
>>   		sector_t size;
>> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   		set_capacity(rbd_dev->disk, size);
>>   		revalidate_disk(rbd_dev->disk);
>>   	}
>> -
>> +out:
>> +	up_write(&rbd_dev->header_rwsem);
>>   	return ret;
>>   }
>>
>> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>   	if (ret < 0 || already)
>>   		return ret;
>>
>> +	/*
>> +	 * take header semaphore while destroying the device and
>> +	 * canceling the watch so that device destruction will
>> +	 * not race with device updates from the watch callback
>> +	 */
>> +	down_write(&rbd_dev->header_rwsem);
>>   	rbd_bus_del_dev(rbd_dev);
>>   	ret = rbd_dev_header_watch_sync(rbd_dev, false);
>> +	up_write(&rbd_dev->header_rwsem);
>> +
>>   	if (ret)
>>   		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
>> +
>> +	/*
>> +	 * flush remaing watch callbacks - these don't update anything
>> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
>> +	 * the watch callback using a freed rbd_dev
>> +	 */
>> +	dout("%s: flushing notifies", __func__);
>> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>>   	rbd_dev_image_release(rbd_dev);
>>   	module_put(THIS_MODULE);
>>
>>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free
  2013-08-29 14:46   ` Sage Weil
@ 2013-08-29 18:36     ` Josh Durgin
  0 siblings, 0 replies; 20+ messages in thread
From: Josh Durgin @ 2013-08-29 18:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 08/29/2013 07:46 AM, Sage Weil wrote:
> On Wed, 28 Aug 2013, Josh Durgin wrote:
>> Removing a device deallocates the disk, unschedules the watch, and
>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
>> from the watch callback, updates the disk size and rbd_dev
>> structure. With no locking between them, rbd_dev_refresh() may use the
>> device or rbd_dev after they've been freed.
>>
>> To fix this, use the header_rwsem to protect all the work
>> rbd_dev_refresh() does, and take it exclusively in rbd_remove() where
>> the block device is released and the watch is canceled.
>> rbd_bus_del_dev() ends up releasing the block device, so no requests
>> to the device remain after this. This makes all the work in
>> rbd_dev_refresh() unnecessary, as well as race-prone, so skip it if
>> the watch has been canceled.
>>
>> Finally, flush the osd client's notify queue before deallocating the
>> rbd dev, so that any callbacks remaining can read rbd_dev->watch_event
>> safely. No more notifies can enter the queue at this point since the
>> watch has been canceled.
>>
>> Fixes: http://tracker.ceph.com/issues/5636
>> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
>> ---
>>   drivers/block/rbd.c |   25 ++++++++++++++++++++++---
>>   1 files changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index fef3687..63e1590 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -3327,10 +3327,13 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>>   static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   {
>>   	u64 mapping_size;
>> -	int ret;
>> +	int ret = 0;
>>
>>   	rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
>>   	down_write(&rbd_dev->header_rwsem);
>> +	if (!rbd_dev->watch_event)
>> +		goto out;
>> +
>>   	mapping_size = rbd_dev->mapping.size;
>>   	if (rbd_dev->image_format == 1)
>>   		ret = rbd_dev_v1_header_info(rbd_dev);
>> @@ -3340,7 +3343,6 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   	/* If it's a mapped snapshot, validate its EXISTS flag */
>>
>>   	rbd_exists_validate(rbd_dev);
>> -	up_write(&rbd_dev->header_rwsem);
>>
>>   	if (mapping_size != rbd_dev->mapping.size) {
>>   		sector_t size;
>> @@ -3350,7 +3352,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   		set_capacity(rbd_dev->disk, size);
>>   		revalidate_disk(rbd_dev->disk);
>>   	}
>> -
>> +out:
>> +	up_write(&rbd_dev->header_rwsem);
>>   	return ret;
>>   }
>>
>> @@ -5159,10 +5162,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>   	if (ret < 0 || already)
>>   		return ret;
>>
>> +	/*
>> +	 * take header semaphore while destroying the device and
>> +	 * canceling the watch so that device destruction will
>> +	 * not race with device updates from the watch callback
>> +	 */
>> +	down_write(&rbd_dev->header_rwsem);
>>   	rbd_bus_del_dev(rbd_dev);
>>   	ret = rbd_dev_header_watch_sync(rbd_dev, false);
>> +	up_write(&rbd_dev->header_rwsem);
>
> If I'm reading thsi right, we're holding the rwsem exclusively, and
> blocking the workqueue, for the duration of the osd request to remove the
> watch.  It worries me to block up the workqueue for that long (possibly
> for other images).  I don't think it will deadlock, but if hte cluster is
> unhealthy this one request could take a while.
>
> What we if we just add a separate flag "shutting down" and use that
> instead of the rbd_dev->watch_event?  Then we can up_write prior to the
> watch removal call...

Due to the lock inversion Alex mentioned, I'll use a separate mutex and
flag. Since we're draining the notify queue after this, only
rbd_bus_del_dev() needs to be protected from running at the same time -
canceling the watch doesn't matter for correctness.

Josh

> sage
>
>
>> +
>>   	if (ret)
>>   		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
>> +
>> +	/*
>> +	 * flush remaing watch callbacks - these don't update anything
>> +	 * anymore since rbd_dev->watch_event is NULL, but it avoids
>> +	 * the watch callback using a freed rbd_dev
>> +	 */
>> +	dout("%s: flushing notifies", __func__);
>> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>>   	rbd_dev_image_release(rbd_dev);
>>   	module_put(THIS_MODULE);
>>
>> --
>> 1.7.2.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/3] rbd: fix use-after free of rbd_dev->disk
  2013-08-29  6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
  2013-08-29 14:46   ` Sage Weil
  2013-08-29 15:21   ` Alex Elder
@ 2013-08-30  0:57   ` Josh Durgin
  2013-09-03 12:41     ` Alex Elder
  2013-08-30  0:57   ` [PATCH v2 2/3] rbd: complete notifies before cleaning up osd_client and rbd_dev Josh Durgin
  2013-08-30  0:57   ` [PATCH v2 3/3] rbd: make rbd_obj_notify_ack() synchronous Josh Durgin
  4 siblings, 1 reply; 20+ messages in thread
From: Josh Durgin @ 2013-08-30  0:57 UTC (permalink / raw)
  To: ceph-devel

Removing a device deallocates the disk, unschedules the watch, and
finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
from the watch callback, updates the disk size and rbd_dev
structure. With no locking between them, rbd_dev_refresh() may use the
device or rbd_dev after they've been freed.

To fix this, add a shutting_down flag and a mutex protecting it to rbd_dev.
Take the mutex and check whether the flag is set before using rbd_dev->disk.
Move this disk-updating to a separate function as well.

Fixes: http://tracker.ceph.com/issues/5636
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 drivers/block/rbd.c |   39 +++++++++++++++++++++++++++++++++------
 1 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index fef3687..8ab3362b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -343,6 +343,9 @@ struct rbd_device {
 	struct ceph_osd_event   *watch_event;
 	struct rbd_obj_request	*watch_request;
 
+	bool			shutting_down;	/* rbd_remove() in progress */
+	struct mutex		shutdown_lock;	/* protects shutting_down */
+
 	struct rbd_spec		*parent_spec;
 	u64			parent_overlap;
 	atomic_t		parent_ref;
@@ -3324,6 +3327,24 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
 		clear_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags);
 }
 
+static void rbd_dev_update_size(struct rbd_device *rbd_dev)
+{
+	sector_t size;
+
+	mutex_lock(&rbd_dev->shutdown_lock);
+	/*
+	 * If the device is being removed, rbd_dev->disk has
+	 * been destroyed, so don't try to update its size
+	 */
+	if (!rbd_dev->shutting_down) {
+		size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
+		dout("setting size to %llu sectors", (unsigned long long)size);
+		set_capacity(rbd_dev->disk, size);
+		revalidate_disk(rbd_dev->disk);
+	}
+	mutex_unlock(&rbd_dev->shutdown_lock);
+}
+
 static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 {
 	u64 mapping_size;
@@ -3343,12 +3364,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	up_write(&rbd_dev->header_rwsem);
 
 	if (mapping_size != rbd_dev->mapping.size) {
-		sector_t size;
-
-		size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
-		dout("setting size to %llu sectors", (unsigned long long)size);
-		set_capacity(rbd_dev->disk, size);
-		revalidate_disk(rbd_dev->disk);
+		rbd_dev_update_size(rbd_dev);
 	}
 
 	return ret;
@@ -3656,6 +3672,8 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
 	atomic_set(&rbd_dev->parent_ref, 0);
 	INIT_LIST_HEAD(&rbd_dev->node);
 	init_rwsem(&rbd_dev->header_rwsem);
+	mutex_init(&rbd_dev->shutdown_lock);
+	rbd_dev->shutting_down = false;
 
 	rbd_dev->spec = spec;
 	rbd_dev->rbd_client = rbdc;
@@ -5159,7 +5177,16 @@ static ssize_t rbd_remove(struct bus_type *bus,
 	if (ret < 0 || already)
 		return ret;
 
+	/*
+	 * hold shutdown_lock while destroying the device so that
+	 * device destruction will not race with device updates from
+	 * the watch callback
+	 */
+	mutex_lock(&rbd_dev->shutdown_lock);
+	rbd_dev->shutting_down = true;
 	rbd_bus_del_dev(rbd_dev);
+	mutex_unlock(&rbd_dev->shutdown_lock);
+
 	ret = rbd_dev_header_watch_sync(rbd_dev, false);
 	if (ret)
 		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/3] rbd: complete notifies before cleaning up osd_client and rbd_dev
  2013-08-29  6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
                     ` (2 preceding siblings ...)
  2013-08-30  0:57   ` [PATCH v2 1/3] rbd: fix use-after free of rbd_dev->disk Josh Durgin
@ 2013-08-30  0:57   ` Josh Durgin
  2013-09-03 12:45     ` Alex Elder
  2013-08-30  0:57   ` [PATCH v2 3/3] rbd: make rbd_obj_notify_ack() synchronous Josh Durgin
  4 siblings, 1 reply; 20+ messages in thread
From: Josh Durgin @ 2013-08-30  0:57 UTC (permalink / raw)
  To: ceph-devel

To ensure rbd_dev is not used after it's released, flush all pending
notify callbacks before calling rbd_dev_image_release(). No new
notifies can be added to the queue at this point because the watch has
already be unregistered with the osd_client.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 drivers/block/rbd.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8ab3362b..2223617 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5190,6 +5190,13 @@ static ssize_t rbd_remove(struct bus_type *bus,
 	ret = rbd_dev_header_watch_sync(rbd_dev, false);
 	if (ret)
 		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
+
+	/*
+	 * flush remaining watch callbacks - these must be complete
+	 * before the osd_client is shutdown
+	 */
+	dout("%s: flushing notifies", __func__);
+	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
 	rbd_dev_image_release(rbd_dev);
 	module_put(THIS_MODULE);
 
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/3] rbd: make rbd_obj_notify_ack() synchronous
  2013-08-29  6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
                     ` (3 preceding siblings ...)
  2013-08-30  0:57   ` [PATCH v2 2/3] rbd: complete notifies before cleaning up osd_client and rbd_dev Josh Durgin
@ 2013-08-30  0:57   ` Josh Durgin
  2013-09-03 13:05     ` Alex Elder
  4 siblings, 1 reply; 20+ messages in thread
From: Josh Durgin @ 2013-08-30  0:57 UTC (permalink / raw)
  To: ceph-devel

The only user is rbd_watch_cb(). If this is asynchronous, there is no
tracking of when it completes, so it may still be in progress when the
osd_client is shut down.  This results in a BUG() since the osd client
assumes no requests are in flight when it stops. Since all notifies
are flushed now, waiting for the notify ack to complete before
returning from the watch callback ensures there are no notify acks in
flight during shutdown.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
---
 drivers/block/rbd.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 2223617..0e83a10 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2826,13 +2826,15 @@ static int rbd_obj_notify_ack(struct rbd_device *rbd_dev, u64 notify_id)
 	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, obj_request);
 	if (!obj_request->osd_req)
 		goto out;
-	obj_request->callback = rbd_obj_request_put;
 
 	osd_req_op_watch_init(obj_request->osd_req, 0, CEPH_OSD_OP_NOTIFY_ACK,
 					notify_id, 0, 0);
 	rbd_osd_req_format_read(obj_request);
 
 	ret = rbd_obj_request_submit(osdc, obj_request);
+	if (ret)
+		goto out;
+	ret = rbd_obj_request_wait(obj_request);
 out:
 	if (ret)
 		rbd_obj_request_put(obj_request);
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] rbd: fix use-after free of rbd_dev->disk
  2013-08-30  0:57   ` [PATCH v2 1/3] rbd: fix use-after free of rbd_dev->disk Josh Durgin
@ 2013-09-03 12:41     ` Alex Elder
  2013-09-09  7:30       ` Josh Durgin
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Elder @ 2013-09-03 12:41 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 08/29/2013 07:57 PM, Josh Durgin wrote:
> Removing a device deallocates the disk, unschedules the watch, and
> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
> from the watch callback, updates the disk size and rbd_dev
> structure. With no locking between them, rbd_dev_refresh() may use the
> device or rbd_dev after they've been freed.
> 
> To fix this, add a shutting_down flag and a mutex protecting it to rbd_dev.
> Take the mutex and check whether the flag is set before using rbd_dev->disk.
> Move this disk-updating to a separate function as well.
> 
> Fixes: http://tracker.ceph.com/issues/5636
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

A few comments below.  I don't think you need the shutting_down
flag after all.  If you disagree, say so.  Either way though,
this looks good to me.

Reviewed-by: Alex Elder <elder@linaro.org>

> ---
>  drivers/block/rbd.c |   39 +++++++++++++++++++++++++++++++++------
>  1 files changed, 33 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index fef3687..8ab3362b 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -343,6 +343,9 @@ struct rbd_device {
>  	struct ceph_osd_event   *watch_event;
>  	struct rbd_obj_request	*watch_request;
>  
> +	bool			shutting_down;	/* rbd_remove() in progress */

You could use a new rbd_dev_flags value for this.

In fact, now that I'm looking at this, I think the REMOVING flag
is already sufficient to indicate that bit of state.  (Sorry I
didn't see this before.)

You would still want the mutex so the shutdown won't happen until
an underway size update completed.  (Or you could add another
UPDATING_SIZE flag, but I think the mutex is better in this case.)

> +	struct mutex		shutdown_lock;	/* protects shutting_down */
> +
>  	struct rbd_spec		*parent_spec;
>  	u64			parent_overlap;
>  	atomic_t		parent_ref;
> @@ -3324,6 +3327,24 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>  		clear_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags);
>  }
>  
> +static void rbd_dev_update_size(struct rbd_device *rbd_dev)
> +{
> +	sector_t size;
> +
> +	mutex_lock(&rbd_dev->shutdown_lock);
> +	/*
> +	 * If the device is being removed, rbd_dev->disk has
> +	 * been destroyed, so don't try to update its size
> +	 */
> +	if (!rbd_dev->shutting_down) {
> +		size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
> +		dout("setting size to %llu sectors", (unsigned long long)size);
> +		set_capacity(rbd_dev->disk, size);

Is it true you don't hit that locking problem because of the new mutex?

> +		revalidate_disk(rbd_dev->disk);
> +	}
> +	mutex_unlock(&rbd_dev->shutdown_lock);
> +}
> +
>  static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  {
>  	u64 mapping_size;
> @@ -3343,12 +3364,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>  	up_write(&rbd_dev->header_rwsem);
>  
>  	if (mapping_size != rbd_dev->mapping.size) {
> -		sector_t size;
> -
> -		size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
> -		dout("setting size to %llu sectors", (unsigned long long)size);
> -		set_capacity(rbd_dev->disk, size);
> -		revalidate_disk(rbd_dev->disk);
> +		rbd_dev_update_size(rbd_dev);
>  	}
>  
>  	return ret;
> @@ -3656,6 +3672,8 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
>  	atomic_set(&rbd_dev->parent_ref, 0);
>  	INIT_LIST_HEAD(&rbd_dev->node);
>  	init_rwsem(&rbd_dev->header_rwsem);
> +	mutex_init(&rbd_dev->shutdown_lock);
> +	rbd_dev->shutting_down = false;
>  
>  	rbd_dev->spec = spec;
>  	rbd_dev->rbd_client = rbdc;
> @@ -5159,7 +5177,16 @@ static ssize_t rbd_remove(struct bus_type *bus,
>  	if (ret < 0 || already)
>  		return ret;
>  
> +	/*
> +	 * hold shutdown_lock while destroying the device so that
> +	 * device destruction will not race with device updates from
> +	 * the watch callback
> +	 */
> +	mutex_lock(&rbd_dev->shutdown_lock);
> +	rbd_dev->shutting_down = true;
>  	rbd_bus_del_dev(rbd_dev);
> +	mutex_unlock(&rbd_dev->shutdown_lock);
> +
>  	ret = rbd_dev_header_watch_sync(rbd_dev, false);
>  	if (ret)
>  		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] rbd: complete notifies before cleaning up osd_client and rbd_dev
  2013-08-30  0:57   ` [PATCH v2 2/3] rbd: complete notifies before cleaning up osd_client and rbd_dev Josh Durgin
@ 2013-09-03 12:45     ` Alex Elder
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Elder @ 2013-09-03 12:45 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 08/29/2013 07:57 PM, Josh Durgin wrote:
> To ensure rbd_dev is not used after it's released, flush all pending
> notify callbacks before calling rbd_dev_image_release(). No new
> notifies can be added to the queue at this point because the watch has
> already be unregistered with the osd_client.
> 
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> ---
>  drivers/block/rbd.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 8ab3362b..2223617 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -5190,6 +5190,13 @@ static ssize_t rbd_remove(struct bus_type *bus,
>  	ret = rbd_dev_header_watch_sync(rbd_dev, false);
>  	if (ret)
>  		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
> +
> +	/*
> +	 * flush remaining watch callbacks - these must be complete
> +	 * before the osd_client is shutdown
> +	 */
> +	dout("%s: flushing notifies", __func__);
> +	ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>  	rbd_dev_image_release(rbd_dev);
>  	module_put(THIS_MODULE);
>  
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] rbd: make rbd_obj_notify_ack() synchronous
  2013-08-30  0:57   ` [PATCH v2 3/3] rbd: make rbd_obj_notify_ack() synchronous Josh Durgin
@ 2013-09-03 13:05     ` Alex Elder
  2013-09-03 13:07       ` Alex Elder
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Elder @ 2013-09-03 13:05 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 08/29/2013 07:57 PM, Josh Durgin wrote:
> The only user is rbd_watch_cb(). If this is asynchronous, there is no

The only user *of rbd_obj_notify_ack()* is rbd_watch_cb().  ...

> tracking of when it completes, so it may still be in progress when the

I think what you're saying is that, because it's currently
asynchronous, the object request may still be in progress.
(I had a little trouble parsing this, as you can see...)

> osd_client is shut down.  This results in a BUG() since the osd client
> assumes no requests are in flight when it stops. Since all notifies
> are flushed now, waiting for the notify ack to complete before
> returning from the watch callback ensures there are no notify acks in
> flight during shutdown.
> 
> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
> ---
>  drivers/block/rbd.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 2223617..0e83a10 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2826,13 +2826,15 @@ static int rbd_obj_notify_ack(struct rbd_device *rbd_dev, u64 notify_id)

Since you are now making this function wait for the object request
to complete, you should rename the function to match the convention of
all the others that do this.  That is, please rename it:
    rbd_obj_notify_ack_sync()

There's another problem though.  This function used a clever trick of
assigning rbd_obj_request_put() as the object request's callback
function.  This allowed the request to just be sent and cleaned up
whenever it completed.

Since you're now waiting for completion here you need to do the
cleanup here as well.  (Otherwise it might have completed before
the wait call even begins, with unpleasant results.)

So here's what I think you need to do:
- Rename the function rbd_obj_notify_ack_sync()
- delete this line in that function:
        obj_request->callback = rbd_obj_request_put;
- Add your call to rbd_obj_request_wait() (though
  you could ignore the return value).
- Make the call to rbd_obj_request_put() at the end of
  that function unconditional.  I.e., drop the "if (ret)"
  and fix the indentation.

					-Alex

>  	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, obj_request);
>  	if (!obj_request->osd_req)
>  		goto out;
> -	obj_request->callback = rbd_obj_request_put;
>  
>  	osd_req_op_watch_init(obj_request->osd_req, 0, CEPH_OSD_OP_NOTIFY_ACK,
>  					notify_id, 0, 0);
>  	rbd_osd_req_format_read(obj_request);
>  
>  	ret = rbd_obj_request_submit(osdc, obj_request);
> +	if (ret)
> +		goto out;
> +	ret = rbd_obj_request_wait(obj_request);
>  out:
>  	if (ret)
>  		rbd_obj_request_put(obj_request);
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] rbd: make rbd_obj_notify_ack() synchronous
  2013-09-03 13:05     ` Alex Elder
@ 2013-09-03 13:07       ` Alex Elder
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Elder @ 2013-09-03 13:07 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 09/03/2013 08:05 AM, Alex Elder wrote:
> There's another problem though.  This function used a clever trick of
> assigning rbd_obj_request_put() as the object request's callback
> function.  This allowed the request to just be sent and cleaned up
> whenever it completed.

Oh, sorry, I neglected to notice you already deleted this.
You still need to call it after waiting though, so make
the call to rbd_obj_request_put() unconditional at the end.

					-Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] rbd: fix use-after free of rbd_dev->disk
  2013-09-03 12:41     ` Alex Elder
@ 2013-09-09  7:30       ` Josh Durgin
  0 siblings, 0 replies; 20+ messages in thread
From: Josh Durgin @ 2013-09-09  7:30 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

On 09/03/2013 05:41 AM, Alex Elder wrote:
> On 08/29/2013 07:57 PM, Josh Durgin wrote:
>> Removing a device deallocates the disk, unschedules the watch, and
>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
>> from the watch callback, updates the disk size and rbd_dev
>> structure. With no locking between them, rbd_dev_refresh() may use the
>> device or rbd_dev after they've been freed.
>>
>> To fix this, add a shutting_down flag and a mutex protecting it to rbd_dev.
>> Take the mutex and check whether the flag is set before using rbd_dev->disk.
>> Move this disk-updating to a separate function as well.
>>
>> Fixes: http://tracker.ceph.com/issues/5636
>> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
>
> A few comments below.  I don't think you need the shutting_down
> flag after all.  If you disagree, say so.  Either way though,
> this looks good to me.

You're right, the extra lock turned out to be unnecessary.
Holding during revalidate_disk() also created a lock inversion
(since rbd_open() is called with bdev->lock held), so I've reworked
this in v3.

Thanks for the reviews!
Josh

> Reviewed-by: Alex Elder <elder@linaro.org>
>
>> ---
>>   drivers/block/rbd.c |   39 +++++++++++++++++++++++++++++++++------
>>   1 files changed, 33 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index fef3687..8ab3362b 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -343,6 +343,9 @@ struct rbd_device {
>>   	struct ceph_osd_event   *watch_event;
>>   	struct rbd_obj_request	*watch_request;
>>
>> +	bool			shutting_down;	/* rbd_remove() in progress */
>
> You could use a new rbd_dev_flags value for this.
>
> In fact, now that I'm looking at this, I think the REMOVING flag
> is already sufficient to indicate that bit of state.  (Sorry I
> didn't see this before.)
>
> You would still want the mutex so the shutdown won't happen until
> an underway size update completed.  (Or you could add another
> UPDATING_SIZE flag, but I think the mutex is better in this case.)
>
>> +	struct mutex		shutdown_lock;	/* protects shutting_down */
>> +
>>   	struct rbd_spec		*parent_spec;
>>   	u64			parent_overlap;
>>   	atomic_t		parent_ref;
>> @@ -3324,6 +3327,24 @@ static void rbd_exists_validate(struct rbd_device *rbd_dev)
>>   		clear_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags);
>>   }
>>
>> +static void rbd_dev_update_size(struct rbd_device *rbd_dev)
>> +{
>> +	sector_t size;
>> +
>> +	mutex_lock(&rbd_dev->shutdown_lock);
>> +	/*
>> +	 * If the device is being removed, rbd_dev->disk has
>> +	 * been destroyed, so don't try to update its size
>> +	 */
>> +	if (!rbd_dev->shutting_down) {
>> +		size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
>> +		dout("setting size to %llu sectors", (unsigned long long)size);
>> +		set_capacity(rbd_dev->disk, size);
>
> Is it true you don't hit that locking problem because of the new mutex?
>
>> +		revalidate_disk(rbd_dev->disk);
>> +	}
>> +	mutex_unlock(&rbd_dev->shutdown_lock);
>> +}
>> +
>>   static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   {
>>   	u64 mapping_size;
>> @@ -3343,12 +3364,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>   	up_write(&rbd_dev->header_rwsem);
>>
>>   	if (mapping_size != rbd_dev->mapping.size) {
>> -		sector_t size;
>> -
>> -		size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
>> -		dout("setting size to %llu sectors", (unsigned long long)size);
>> -		set_capacity(rbd_dev->disk, size);
>> -		revalidate_disk(rbd_dev->disk);
>> +		rbd_dev_update_size(rbd_dev);
>>   	}
>>
>>   	return ret;
>> @@ -3656,6 +3672,8 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
>>   	atomic_set(&rbd_dev->parent_ref, 0);
>>   	INIT_LIST_HEAD(&rbd_dev->node);
>>   	init_rwsem(&rbd_dev->header_rwsem);
>> +	mutex_init(&rbd_dev->shutdown_lock);
>> +	rbd_dev->shutting_down = false;
>>
>>   	rbd_dev->spec = spec;
>>   	rbd_dev->rbd_client = rbdc;
>> @@ -5159,7 +5177,16 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>   	if (ret < 0 || already)
>>   		return ret;
>>
>> +	/*
>> +	 * hold shutdown_lock while destroying the device so that
>> +	 * device destruction will not race with device updates from
>> +	 * the watch callback
>> +	 */
>> +	mutex_lock(&rbd_dev->shutdown_lock);
>> +	rbd_dev->shutting_down = true;
>>   	rbd_bus_del_dev(rbd_dev);
>> +	mutex_unlock(&rbd_dev->shutdown_lock);
>> +
>>   	ret = rbd_dev_header_watch_sync(rbd_dev, false);
>>   	if (ret)
>>   		rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
>>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-09-09  7:30 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-29  6:24 [PATCH 0/3] shutdown race and debug fix Josh Durgin
2013-08-29  6:24 ` [PATCH 1/3] rbd: fix null dereference in dout Josh Durgin
2013-08-29 14:26   ` Alex Elder
2013-08-29 14:46   ` Sage Weil
2013-08-29  6:24 ` [PATCH 2/3] libceph: add function to ensure notifies are complete Josh Durgin
2013-08-29 14:46   ` Sage Weil
2013-08-29 15:21   ` Alex Elder
2013-08-29  6:24 ` [PATCH 3/3] rbd: close remove vs. notify race leading to use-after-free Josh Durgin
2013-08-29 14:46   ` Sage Weil
2013-08-29 18:36     ` Josh Durgin
2013-08-29 15:21   ` Alex Elder
2013-08-29 18:33     ` Josh Durgin
2013-08-30  0:57   ` [PATCH v2 1/3] rbd: fix use-after free of rbd_dev->disk Josh Durgin
2013-09-03 12:41     ` Alex Elder
2013-09-09  7:30       ` Josh Durgin
2013-08-30  0:57   ` [PATCH v2 2/3] rbd: complete notifies before cleaning up osd_client and rbd_dev Josh Durgin
2013-09-03 12:45     ` Alex Elder
2013-08-30  0:57   ` [PATCH v2 3/3] rbd: make rbd_obj_notify_ack() synchronous Josh Durgin
2013-09-03 13:05     ` Alex Elder
2013-09-03 13:07       ` Alex Elder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.