[PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
@ 2017-08-16 20:40 Waiman Long
  2017-08-17 13:34 ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Waiman Long @ 2017-08-16 20:40 UTC (permalink / raw)
  To: Jens Axboe, Jeff Layton, J. Bruce Fields, Steven Rostedt,
	Ingo Molnar
  Cc: linux-kernel, linux-block, linux-fsdevel, Waiman Long

The lockdep code had reported the following unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(s_active#228);
                               lock(&bdev->bd_mutex/1);
                               lock(s_active#228);
  lock(&bdev->bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete
a partition in a block device and another task (CPU0) is accessing
tracing sysfs file in that partition.

To avoid that, accessing tracing sysfs file will now use a mutex
trylock loop and the operation will fail if a delete operation is
in progress.

Signed-off-by: Waiman Long <longman@redhat.com>
---

 v2:
   - Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
   - Check for signal in the mutex_trylock loops.
   - Use usleep() instead of schedule() for RT tasks.

 block/ioctl.c           |  3 +++
 include/linux/fs.h      |  1 +
 kernel/trace/blktrace.c | 39 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 0de02ee..b920329 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -86,12 +86,15 @@ static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user
 				return -EBUSY;
 			}
 			/* all seems OK */
+			smp_store_mb(bdev->bd_deleting, 1);
 			fsync_bdev(bdevp);
 			invalidate_bdev(bdevp);
 
 			mutex_lock_nested(&bdev->bd_mutex, 1);
 			delete_partition(disk, partno);
 			mutex_unlock(&bdev->bd_mutex);
+			smp_store_mb(bdev->bd_deleting, 0);
+
 			mutex_unlock(&bdevp->bd_mutex);
 			bdput(bdevp);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d..c2ba35e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -427,6 +427,7 @@ struct block_device {
 #endif
 	struct block_device *	bd_contains;
 	unsigned		bd_block_size;
+	int			bd_deleting;
 	struct hd_struct *	bd_part;
 	/* number of times partitions within this device have been opened. */
 	unsigned		bd_part_count;
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index bc364f8..b2dffa9 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -27,6 +27,8 @@
 #include <linux/time.h>
 #include <linux/uaccess.h>
 #include <linux/list.h>
+#include <linux/delay.h>
+#include <linux/sched/rt.h>
 
 #include "../../block/blk.h"
 
@@ -1605,6 +1607,23 @@ static struct request_queue *blk_trace_get_queue(struct block_device *bdev)
 	return bdev_get_queue(bdev);
 }
 
+/*
+ * Read/write to the tracing sysfs file requires taking references to the
+ * sysfs file and then acquiring the bd_mutex. Deleting a block device
+ * requires acquiring the bd_mutex and then waiting for all the sysfs
+ * references to be gone. This can lead to deadlock if both operations
+ * happen simultaneously. To avoid this problem, read/write to the
+ * the tracing sysfs files can now fail if the bd_mutex cannot be
+ * acquired while a deletion operation is in progress.
+ *
+ * A mutex trylock loop is used assuming that tracing sysfs operations
+ * aren't frequently enough to cause any contention problem.
+ *
+ * For RT tasks, a running high priority task will prevent any lower
+ * priority RT tasks from being run. So they do need to actually sleep
+ * when the trylock fails to allow lower priority tasks to make forward
+ * progress.
+ */
 static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
 					 struct device_attribute *attr,
 					 char *buf)
@@ -1622,7 +1641,15 @@ static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
 	if (q == NULL)
 		goto out_bdput;
 
-	mutex_lock(&bdev->bd_mutex);
+	while (!mutex_trylock(&bdev->bd_mutex)) {
+		if (READ_ONCE(bdev->bd_deleting))
+			goto out_bdput;
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			goto out_bdput;
+		}
+		rt_task(current) ? usleep_range(10, 10) : schedule();
+	}
 
 	if (attr == &dev_attr_enable) {
 		ret = sprintf(buf, "%u\n", !!q->blk_trace);
@@ -1683,7 +1710,15 @@ static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
 	if (q == NULL)
 		goto out_bdput;
 
-	mutex_lock(&bdev->bd_mutex);
+	while (!mutex_trylock(&bdev->bd_mutex)) {
+		if (READ_ONCE(bdev->bd_deleting))
+			goto out_bdput;
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			goto out_bdput;
+		}
+		rt_task(current) ? usleep_range(10, 10) : schedule();
+	}
 
 	if (attr == &dev_attr_enable) {
 		if (value)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-16 20:40 [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops Waiman Long
@ 2017-08-17 13:34 ` Steven Rostedt
  2017-08-17 16:24   ` Waiman Long
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2017-08-17 13:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On Wed, 16 Aug 2017 16:40:40 -0400
Waiman Long <longman@redhat.com> wrote:

> The lockdep code had reported the following unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(s_active#228);
>                                lock(&bdev->bd_mutex/1);
>                                lock(s_active#228);
>   lock(&bdev->bd_mutex);

Can you show the exact locations of these locks. I have no idea where
this "s_active" is.

> 
>  *** DEADLOCK ***
> 
> The deadlock may happen when one task (CPU1) is trying to delete
> a partition in a block device and another task (CPU0) is accessing
> tracing sysfs file in that partition.
> 
> To avoid that, accessing tracing sysfs file will now use a mutex
> trylock loop and the operation will fail if a delete operation is
> in progress.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> 
>  v2:
>    - Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
>    - Check for signal in the mutex_trylock loops.
>    - Use usleep() instead of schedule() for RT tasks.

I'm sorry but I really do hate this patch.

> 
>  block/ioctl.c           |  3 +++
>  include/linux/fs.h      |  1 +
>  kernel/trace/blktrace.c | 39 +++++++++++++++++++++++++++++++++++++--
>  3 files changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 0de02ee..b920329 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -86,12 +86,15 @@ static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user
>  				return -EBUSY;
>  			}
>  			/* all seems OK */
> +			smp_store_mb(bdev->bd_deleting, 1);

No comment to explain what is happening here, and why.

>  			fsync_bdev(bdevp);
>  			invalidate_bdev(bdevp);
>  
>  			mutex_lock_nested(&bdev->bd_mutex, 1);
>  			delete_partition(disk, partno);
>  			mutex_unlock(&bdev->bd_mutex);
> +			smp_store_mb(bdev->bd_deleting, 0);
> +

ditto.

>  			mutex_unlock(&bdevp->bd_mutex);
>  			bdput(bdevp);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d..c2ba35e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -427,6 +427,7 @@ struct block_device {
>  #endif
>  	struct block_device *	bd_contains;
>  	unsigned		bd_block_size;
> +	int			bd_deleting;
>  	struct hd_struct *	bd_part;
>  	/* number of times partitions within this device have been opened. */
>  	unsigned		bd_part_count;
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index bc364f8..b2dffa9 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -27,6 +27,8 @@
>  #include <linux/time.h>
>  #include <linux/uaccess.h>
>  #include <linux/list.h>
> +#include <linux/delay.h>
> +#include <linux/sched/rt.h>
>  
>  #include "../../block/blk.h"
>  
> @@ -1605,6 +1607,23 @@ static struct request_queue *blk_trace_get_queue(struct block_device *bdev)
>  	return bdev_get_queue(bdev);
>  }
>  
> +/*
> + * Read/write to the tracing sysfs file requires taking references to the

What's the "tracing sysfs" file? tracefs?

> + * sysfs file and then acquiring the bd_mutex. Deleting a block device
> + * requires acquiring the bd_mutex and then waiting for all the sysfs
> + * references to be gone. This can lead to deadlock if both operations
> + * happen simultaneously. To avoid this problem, read/write to the
> + * the tracing sysfs files can now fail if the bd_mutex cannot be
> + * acquired while a deletion operation is in progress.
> + *
> + * A mutex trylock loop is used assuming that tracing sysfs operations

A mutex trylock loop is not enough to stop a deadlock. But I'm guessing
the undocumented bd_deleting may prevent that.

> + * aren't frequently enough to cause any contention problem.
> + *
> + * For RT tasks, a running high priority task will prevent any lower
> + * priority RT tasks from being run. So they do need to actually sleep
> + * when the trylock fails to allow lower priority tasks to make forward
> + * progress.
> + */
>  static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
>  					 struct device_attribute *attr,
>  					 char *buf)
> @@ -1622,7 +1641,15 @@ static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
>  	if (q == NULL)
>  		goto out_bdput;
>  
> -	mutex_lock(&bdev->bd_mutex);
> +	while (!mutex_trylock(&bdev->bd_mutex)) {
> +		if (READ_ONCE(bdev->bd_deleting))
> +			goto out_bdput;
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			goto out_bdput;
> +		}
> +		rt_task(current) ? usleep_range(10, 10) : schedule();

We need to come up with a better solution. This is just a hack that
circumvents a lot of the lockdep infrastructure.

-- Steve

> +	}
>  
>  	if (attr == &dev_attr_enable) {
>  		ret = sprintf(buf, "%u\n", !!q->blk_trace);
> @@ -1683,7 +1710,15 @@ static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
>  	if (q == NULL)
>  		goto out_bdput;
>  
> -	mutex_lock(&bdev->bd_mutex);
> +	while (!mutex_trylock(&bdev->bd_mutex)) {
> +		if (READ_ONCE(bdev->bd_deleting))
> +			goto out_bdput;
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			goto out_bdput;
> +		}
> +		rt_task(current) ? usleep_range(10, 10) : schedule();
> +	}
>  
>  	if (attr == &dev_attr_enable) {
>  		if (value)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 13:34 ` Steven Rostedt
@ 2017-08-17 16:24   ` Waiman Long
  2017-08-17 20:30     ` Steven Rostedt
  2017-08-17 21:10     ` Steven Rostedt
  0 siblings, 2 replies; 17+ messages in thread
From: Waiman Long @ 2017-08-17 16:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On 08/17/2017 09:34 AM, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 16:40:40 -0400
> Waiman Long <longman@redhat.com> wrote:
>
>> The lockdep code had reported the following unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(s_active#228);
>>                                lock(&bdev->bd_mutex/1);
>>                                lock(s_active#228);
>>   lock(&bdev->bd_mutex);
> Can you show the exact locations of these locks. I have no idea where
> this "s_active" is.
The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

>>  *** DEADLOCK ***
>>
>> The deadlock may happen when one task (CPU1) is trying to delete
>> a partition in a block device and another task (CPU0) is accessing
>> tracing sysfs file in that partition.
>>
>> To avoid that, accessing tracing sysfs file will now use a mutex
>> trylock loop and the operation will fail if a delete operation is
>> in progress.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>
>>  v2:
>>    - Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.=

>>    - Check for signal in the mutex_trylock loops.
>>    - Use usleep() instead of schedule() for RT tasks.
> I'm sorry but I really do hate this patch.

Any suggestion on how to make it better?

>>  block/ioctl.c           |  3 +++
>>  include/linux/fs.h      |  1 +
>>  kernel/trace/blktrace.c | 39 +++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 41 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index 0de02ee..b920329 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -86,12 +86,15 @@ static int blkpg_ioctl(struct block_device *bdev, =
struct blkpg_ioctl_arg __user
>>  				return -EBUSY;
>>  			}
>>  			/* all seems OK */
>> +			smp_store_mb(bdev->bd_deleting, 1);
> No comment to explain what is happening here, and why.

I am going to add a comment block here.

>>  			fsync_bdev(bdevp);
>>  			invalidate_bdev(bdevp);
>> =20
>>  			mutex_lock_nested(&bdev->bd_mutex, 1);
>>  			delete_partition(disk, partno);
>>  			mutex_unlock(&bdev->bd_mutex);
>> +			smp_store_mb(bdev->bd_deleting, 0);
>> +
> ditto.
>
>>  			mutex_unlock(&bdevp->bd_mutex);
>>  			bdput(bdevp);
>> =20
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 6e1fd5d..c2ba35e 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -427,6 +427,7 @@ struct block_device {
>>  #endif
>>  	struct block_device *	bd_contains;
>>  	unsigned		bd_block_size;
>> +	int			bd_deleting;
>>  	struct hd_struct *	bd_part;
>>  	/* number of times partitions within this device have been opened. *=
/
>>  	unsigned		bd_part_count;
>> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
>> index bc364f8..b2dffa9 100644
>> --- a/kernel/trace/blktrace.c
>> +++ b/kernel/trace/blktrace.c
>> @@ -27,6 +27,8 @@
>>  #include <linux/time.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/list.h>
>> +#include <linux/delay.h>
>> +#include <linux/sched/rt.h>
>> =20
>>  #include "../../block/blk.h"
>> =20
>> @@ -1605,6 +1607,23 @@ static struct request_queue *blk_trace_get_queu=
e(struct block_device *bdev)
>>  	return bdev_get_queue(bdev);
>>  }
>> =20
>> +/*
>> + * Read/write to the tracing sysfs file requires taking references to=
 the
> What's the "tracing sysfs" file? tracefs?

I am referring to blk_trace sysfs files which are used for enable
tracing of block operations.

>> + * sysfs file and then acquiring the bd_mutex. Deleting a block devic=
e
>> + * requires acquiring the bd_mutex and then waiting for all the sysfs=

>> + * references to be gone. This can lead to deadlock if both operation=
s
>> + * happen simultaneously. To avoid this problem, read/write to the
>> + * the tracing sysfs files can now fail if the bd_mutex cannot be
>> + * acquired while a deletion operation is in progress.
>> + *
>> + * A mutex trylock loop is used assuming that tracing sysfs operation=
s
> A mutex trylock loop is not enough to stop a deadlock. But I'm guessing=

> the undocumented bd_deleting may prevent that.

Yes, that is what the bd_deleting flag is for.

I was thinking about failing the sysfs operation after a certain number
of trylock attempts, but then it will require changes to user space code
to handle the occasional failures. Finally I decided to fail it only
when a delete operation is in progress as all hopes are lost in this case=
=2E

>> + * aren't frequently enough to cause any contention problem.
>> + *
>> + * For RT tasks, a running high priority task will prevent any lower
>> + * priority RT tasks from being run. So they do need to actually slee=
p
>> + * when the trylock fails to allow lower priority tasks to make forwa=
rd
>> + * progress.
>> + */
>>  static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
>>  					 struct device_attribute *attr,
>>  					 char *buf)
>> @@ -1622,7 +1641,15 @@ static ssize_t sysfs_blk_trace_attr_show(struct=
 device *dev,
>>  	if (q =3D=3D NULL)
>>  		goto out_bdput;
>> =20
>> -	mutex_lock(&bdev->bd_mutex);
>> +	while (!mutex_trylock(&bdev->bd_mutex)) {
>> +		if (READ_ONCE(bdev->bd_deleting))
>> +			goto out_bdput;
>> +		if (signal_pending(current)) {
>> +			ret =3D -EINTR;
>> +			goto out_bdput;
>> +		}
>> +		rt_task(current) ? usleep_range(10, 10) : schedule();
> We need to come up with a better solution. This is just a hack that
> circumvents a lot of the lockdep infrastructure.
>
> -- Steve

The root cause is the lock inversion under this circumstance. I think
modifying the blk_trace code has the least impact overall. I agree that
the code is ugly. If you have a better suggestion, I will certainly like
to hear it.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 16:24   ` Waiman Long
@ 2017-08-17 20:30     ` Steven Rostedt
  2017-08-17 20:41       ` Bart Van Assche
  2017-08-17 21:10     ` Steven Rostedt
  1 sibling, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2017-08-17 20:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On Thu, 17 Aug 2017 12:24:39 -0400
Waiman Long <longman@redhat.com> wrote:

> On 08/17/2017 09:34 AM, Steven Rostedt wrote:
> > On Wed, 16 Aug 2017 16:40:40 -0400
> > Waiman Long <longman@redhat.com> wrote:
> >  
> >> The lockdep code had reported the following unsafe locking scenario:
> >>
> >>        CPU0                    CPU1
> >>        ----                    ----
> >>   lock(s_active#228);
> >>                                lock(&bdev->bd_mutex/1);
> >>                                lock(s_active#228);
> >>   lock(&bdev->bd_mutex);  
> > Can you show the exact locations of these locks. I have no idea where
> > this "s_active" is.  
> The s_active isn't an actual lock. It is a reference count (kn->count)
> on the sysfs (kernfs) file. Removal of a sysfs file, however, require
> a wait until all the references are gone. The reference count is
> treated like a rwsem using lockdep instrumentation code.

Which kernel is this? I don't see any lockdep annotation around
kn->count (doing a git grep, I find it referenced in fs/kernfs/dir.c)

> 
> >>  *** DEADLOCK ***
> >>
> >> The deadlock may happen when one task (CPU1) is trying to delete
> >> a partition in a block device and another task (CPU0) is accessing
> >> tracing sysfs file in that partition.
> >>
> >> To avoid that, accessing tracing sysfs file will now use a mutex
> >> trylock loop and the operation will fail if a delete operation is
> >> in progress.
> >>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> >>
> >>  v2:
> >>    - Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
> >>    - Check for signal in the mutex_trylock loops.
> >>    - Use usleep() instead of schedule() for RT tasks.  
> > I'm sorry but I really do hate this patch.  
> 
> Any suggestion on how to make it better?

I'd like to be able to at least trigger the warning. And see the lock
issues. I wont be able to recommend anything until I understand what is
happening.


> The root cause is the lock inversion under this circumstance. I think
> modifying the blk_trace code has the least impact overall. I agree that
> the code is ugly. If you have a better suggestion, I will certainly like
> to hear it.

Again, I need to see where the issue lies before recommending something
else. I would hope there is a more elegant solution to this.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 20:30     ` Steven Rostedt
@ 2017-08-17 20:41       ` Bart Van Assche
  2017-08-17 20:56         ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2017-08-17 20:41 UTC (permalink / raw)
  To: longman@redhat.com, rostedt@goodmis.org
  Cc: bfields@fieldses.org, mingo@kernel.org, jlayton@poochiereds.net,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	axboe@kernel.dk, linux-fsdevel@vger.kernel.org

T24gVGh1LCAyMDE3LTA4LTE3IGF0IDE2OjMwIC0wNDAwLCBTdGV2ZW4gUm9zdGVkdCB3cm90ZToN
Cj4gT24gVGh1LCAxNyBBdWcgMjAxNyAxMjoyNDozOSAtMDQwMA0KPiBXYWltYW4gTG9uZyA8bG9u
Z21hbkByZWRoYXQuY29tPiB3cm90ZToNCj4gDQo+ID4gT24gMDgvMTcvMjAxNyAwOTozNCBBTSwg
U3RldmVuIFJvc3RlZHQgd3JvdGU6DQo+ID4gPiBPbiBXZWQsIDE2IEF1ZyAyMDE3IDE2OjQwOjQw
IC0wNDAwDQo+ID4gPiBXYWltYW4gTG9uZyA8bG9uZ21hbkByZWRoYXQuY29tPiB3cm90ZToNCj4g
PiA+ICANCj4gPiA+ID4gVGhlIGxvY2tkZXAgY29kZSBoYWQgcmVwb3J0ZWQgdGhlIGZvbGxvd2lu
ZyB1bnNhZmUgbG9ja2luZyBzY2VuYXJpbzoNCj4gPiA+ID4gDQo+ID4gPiA+ICAgICAgICBDUFUw
ICAgICAgICAgICAgICAgICAgICBDUFUxDQo+ID4gPiA+ICAgICAgICAtLS0tICAgICAgICAgICAg
ICAgICAgICAtLS0tDQo+ID4gPiA+ICAgbG9jayhzX2FjdGl2ZSMyMjgpOw0KPiA+ID4gPiAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgbG9jaygmYmRldi0+YmRfbXV0ZXgvMSk7DQo+ID4g
PiA+ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBsb2NrKHNfYWN0aXZlIzIyOCk7DQo+
ID4gPiA+ICAgbG9jaygmYmRldi0+YmRfbXV0ZXgpOyAgDQo+ID4gPiANCj4gPiA+IENhbiB5b3Ug
c2hvdyB0aGUgZXhhY3QgbG9jYXRpb25zIG9mIHRoZXNlIGxvY2tzLiBJIGhhdmUgbm8gaWRlYSB3
aGVyZQ0KPiA+ID4gdGhpcyAic19hY3RpdmUiIGlzLiAgDQo+ID4gDQo+ID4gVGhlIHNfYWN0aXZl
IGlzbid0IGFuIGFjdHVhbCBsb2NrLiBJdCBpcyBhIHJlZmVyZW5jZSBjb3VudCAoa24tPmNvdW50
KQ0KPiA+IG9uIHRoZSBzeXNmcyAoa2VybmZzKSBmaWxlLiBSZW1vdmFsIG9mIGEgc3lzZnMgZmls
ZSwgaG93ZXZlciwgcmVxdWlyZQ0KPiA+IGEgd2FpdCB1bnRpbCBhbGwgdGhlIHJlZmVyZW5jZXMg
YXJlIGdvbmUuIFRoZSByZWZlcmVuY2UgY291bnQgaXMNCj4gPiB0cmVhdGVkIGxpa2UgYSByd3Nl
bSB1c2luZyBsb2NrZGVwIGluc3RydW1lbnRhdGlvbiBjb2RlLg0KPiANCj4gV2hpY2gga2VybmVs
IGlzIHRoaXM/IEkgZG9uJ3Qgc2VlIGFueSBsb2NrZGVwIGFubm90YXRpb24gYXJvdW5kDQo+IGtu
LT5jb3VudCAoZG9pbmcgYSBnaXQgZ3JlcCwgSSBmaW5kIGl0IHJlZmVyZW5jZWQgaW4gZnMva2Vy
bmZzL2Rpci5jKQ0KDQpBcyBmYXIgYXMgSSBrbm93IHRoZSBzX2FjdGl2ZSBsb2NrZGVwIGFubm90
YXRpb25zIHdlcmUgaW50cm9kdWNlZCBpbiAyMDA3DQp0aHJvdWdoIGNvbW1pdCAwYWI2NjA4OGM4
NTUgKCJzeXNmczogaW1wbGVtZW50IHN5c2ZzX2RpcmVudCBhY3RpdmUgcmVmZXJlbmNlDQphbmQg
aW1tZWRpYXRlIGRpc2Nvbm5lY3QiKS4gVG9kYXkgdGhlc2UgYW5ub3RhdGlvbnMgZXhpc3QgaW4g
a2VybmZzOg0KJCBnaXQgZ3JlcCAtbkh3IGRlcF9tYXAgZnMva2VybmZzDQpmcy9rZXJuZnMvZGly
LmM6NDIxOgkJcndzZW1fYWNxdWlyZV9yZWFkKCZrbi0+ZGVwX21hcCwgMCwgMSwgX1JFVF9JUF8p
Ow0KZnMva2VybmZzL2Rpci5jOjQ0MToJCXJ3c2VtX3JlbGVhc2UoJmtuLT5kZXBfbWFwLCAxLCBf
UkVUX0lQXyk7DQpmcy9rZXJuZnMvZGlyLmM6NDY4OgkJcndzZW1fYWNxdWlyZSgma24tPmRlcF9t
YXAsIDAsIDAsIF9SRVRfSVBfKTsNCmZzL2tlcm5mcy9kaXIuYzo0NzA6CQkJbG9ja19jb250ZW5k
ZWQoJmtuLT5kZXBfbWFwLCBfUkVUX0lQXyk7DQpmcy9rZXJuZnMvZGlyLmM6NDc4OgkJbG9ja19h
Y3F1aXJlZCgma24tPmRlcF9tYXAsIF9SRVRfSVBfKTsNCmZzL2tlcm5mcy9kaXIuYzo0Nzk6CQly
d3NlbV9yZWxlYXNlKCZrbi0+ZGVwX21hcCwgMSwgX1JFVF9JUF8pOw0KZnMva2VybmZzL2Rpci5j
OjEzODU6CQlyd3NlbV9hY3F1aXJlKCZrbi0+ZGVwX21hcCwgMCwgMSwgX1JFVF9JUF8pOw0KZnMv
a2VybmZzL2ZpbGUuYzoxMDAwOgkJbG9ja2RlcF9pbml0X21hcCgma24tPmRlcF9tYXAsICJzX2Fj
dGl2ZSIsIGtleSwgMCk7DQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 20:41       ` Bart Van Assche
@ 2017-08-17 20:56         ` Steven Rostedt
  0 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2017-08-17 20:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: longman@redhat.com, bfields@fieldses.org, mingo@kernel.org,
	jlayton@poochiereds.net, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, axboe@kernel.dk,
	linux-fsdevel@vger.kernel.org

On Thu, 17 Aug 2017 20:41:10 +0000
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Thu, 2017-08-17 at 16:30 -0400, Steven Rostedt wrote:
> > On Thu, 17 Aug 2017 12:24:39 -0400
> > Waiman Long <longman@redhat.com> wrote:
> >   
> > > On 08/17/2017 09:34 AM, Steven Rostedt wrote:  
> > > > On Wed, 16 Aug 2017 16:40:40 -0400
> > > > Waiman Long <longman@redhat.com> wrote:
> > > >    
> > > > > The lockdep code had reported the following unsafe locking scenario:
> > > > > 
> > > > >        CPU0                    CPU1
> > > > >        ----                    ----
> > > > >   lock(s_active#228);
> > > > >                                lock(&bdev->bd_mutex/1);
> > > > >                                lock(s_active#228);
> > > > >   lock(&bdev->bd_mutex);    
> > > > 
> > > > Can you show the exact locations of these locks. I have no idea where
> > > > this "s_active" is.    
> > > 
> > > The s_active isn't an actual lock. It is a reference count (kn->count)
> > > on the sysfs (kernfs) file. Removal of a sysfs file, however, require
> > > a wait until all the references are gone. The reference count is
> > > treated like a rwsem using lockdep instrumentation code.  
> > 
> > Which kernel is this? I don't see any lockdep annotation around
> > kn->count (doing a git grep, I find it referenced in fs/kernfs/dir.c)  
> 
> As far as I know the s_active lockdep annotations were introduced in 2007
> through commit 0ab66088c855 ("sysfs: implement sysfs_dirent active reference
> and immediate disconnect"). Today these annotations exist in kernfs:
> $ git grep -nHw dep_map fs/kernfs
> fs/kernfs/dir.c:421:		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
> fs/kernfs/dir.c:441:		rwsem_release(&kn->dep_map, 1, _RET_IP_);
> fs/kernfs/dir.c:468:		rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
> fs/kernfs/dir.c:470:			lock_contended(&kn->dep_map, _RET_IP_);
> fs/kernfs/dir.c:478:		lock_acquired(&kn->dep_map, _RET_IP_);
> fs/kernfs/dir.c:479:		rwsem_release(&kn->dep_map, 1, _RET_IP_);
> fs/kernfs/dir.c:1385:		rwsem_acquire(&kn->dep_map, 0, 1, _RET_IP_);
> fs/kernfs/file.c:1000:		lockdep_init_map(&kn->dep_map, "s_active", key, 0);
> 

Ah thanks. I was searching for the wrong thing. :-/

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 16:24   ` Waiman Long
  2017-08-17 20:30     ` Steven Rostedt
@ 2017-08-17 21:10     ` Steven Rostedt
  2017-08-17 21:27       ` Waiman Long
  2017-08-17 21:30       ` Steven Rostedt
  1 sibling, 2 replies; 17+ messages in thread
From: Steven Rostedt @ 2017-08-17 21:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On Thu, 17 Aug 2017 12:24:39 -0400
Waiman Long <longman@redhat.com> wrote:

> >> + * sysfs file and then acquiring the bd_mutex. Deleting a block device
> >> + * requires acquiring the bd_mutex and then waiting for all the sysfs
> >> + * references to be gone. This can lead to deadlock if both operations
> >> + * happen simultaneously. To avoid this problem, read/write to the
> >> + * the tracing sysfs files can now fail if the bd_mutex cannot be
> >> + * acquired while a deletion operation is in progress.
> >> + *
> >> + * A mutex trylock loop is used assuming that tracing sysfs operations  
> > A mutex trylock loop is not enough to stop a deadlock. But I'm guessing
> > the undocumented bd_deleting may prevent that.  
> 
> Yes, that is what the bd_deleting flag is for.
> 
> I was thinking about failing the sysfs operation after a certain number
> of trylock attempts, but then it will require changes to user space code
> to handle the occasional failures. Finally I decided to fail it only
> when a delete operation is in progress as all hopes are lost in this case.

Actually, why not just fail the attr read on deletion? If it is being
deleted, and one is reading the attribute, perhaps -ENODEV is the
proper return value?

> 
> The root cause is the lock inversion under this circumstance. I think
> modifying the blk_trace code has the least impact overall. I agree that
> the code is ugly. If you have a better suggestion, I will certainly like
> to hear it.

Instead of playing games with taking the lock, the only way this race
is hit, is if the partition is being deleted and the sysfs attribute is
being read at the same time, correct? In that case, just return
-ENODEV, and be done with it.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 21:10     ` Steven Rostedt
@ 2017-08-17 21:27       ` Waiman Long
  2017-08-17 22:13         ` Steven Rostedt
  2017-08-17 21:30       ` Steven Rostedt
  1 sibling, 1 reply; 17+ messages in thread
From: Waiman Long @ 2017-08-17 21:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On 08/17/2017 05:10 PM, Steven Rostedt wrote:
> On Thu, 17 Aug 2017 12:24:39 -0400
> Waiman Long <longman@redhat.com> wrote:
>
>>>> + * sysfs file and then acquiring the bd_mutex. Deleting a block dev=
ice
>>>> + * requires acquiring the bd_mutex and then waiting for all the sys=
fs
>>>> + * references to be gone. This can lead to deadlock if both operati=
ons
>>>> + * happen simultaneously. To avoid this problem, read/write to the
>>>> + * the tracing sysfs files can now fail if the bd_mutex cannot be
>>>> + * acquired while a deletion operation is in progress.
>>>> + *
>>>> + * A mutex trylock loop is used assuming that tracing sysfs operati=
ons =20
>>> A mutex trylock loop is not enough to stop a deadlock. But I'm guessi=
ng
>>> the undocumented bd_deleting may prevent that. =20
>> Yes, that is what the bd_deleting flag is for.
>>
>> I was thinking about failing the sysfs operation after a certain numbe=
r
>> of trylock attempts, but then it will require changes to user space co=
de
>> to handle the occasional failures. Finally I decided to fail it only
>> when a delete operation is in progress as all hopes are lost in this c=
ase.
> Actually, why not just fail the attr read on deletion? If it is being
> deleted, and one is reading the attribute, perhaps -ENODEV is the
> proper return value?
>
>> The root cause is the lock inversion under this circumstance. I think
>> modifying the blk_trace code has the least impact overall. I agree tha=
t
>> the code is ugly. If you have a better suggestion, I will certainly li=
ke
>> to hear it.
> Instead of playing games with taking the lock, the only way this race
> is hit, is if the partition is being deleted and the sysfs attribute is=

> being read at the same time, correct? In that case, just return
> -ENODEV, and be done with it.
>
> -- Steve

It is actually what the patch is trying to do by checking for the
deletion flag in the mutex_trylock loop. Please note that mutex does not
guarantee FIFO ordering of lock acquisition. As a result, cpu1 may call
mutex_lock() and wait for it while cpu2 can set the deletion flag later
and get the mutex first before cpu1. So checking for the deletion flag
before taking the mutex isn't enough.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 21:10     ` Steven Rostedt
  2017-08-17 21:27       ` Waiman Long
@ 2017-08-17 21:30       ` Steven Rostedt
  2017-08-18 13:55         ` Waiman Long
  1 sibling, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2017-08-17 21:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On Thu, 17 Aug 2017 17:10:07 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:


> Instead of playing games with taking the lock, the only way this race
> is hit, is if the partition is being deleted and the sysfs attribute is
> being read at the same time, correct? In that case, just return
> -ENODEV, and be done with it.

Nevermind that wont work. Too bad there's not a mutex_lock_timeout()
that we could use in a loop. It would solve the issue of forward
progress with RT tasks, and will break after a timeout in case of
deadlock.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 21:27       ` Waiman Long
@ 2017-08-17 22:13         ` Steven Rostedt
  2017-08-17 22:18           ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2017-08-17 22:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On Thu, 17 Aug 2017 17:27:22 -0400
Waiman Long <longman@redhat.com> wrote:


> It is actually what the patch is trying to do by checking for the
> deletion flag in the mutex_trylock loop. Please note that mutex does not
> guarantee FIFO ordering of lock acquisition. As a result, cpu1 may call
> mutex_lock() and wait for it while cpu2 can set the deletion flag later
> and get the mutex first before cpu1. So checking for the deletion flag
> before taking the mutex isn't enough.

Yeah, I figured that out already (crossed emails). BTW, how did you
trigger this warning. I'm playing around with adding loop devices,
volume groups, and logical volumes, and reading the trace files
created in the sysfs directory, then removing those items, but it's
not triggering the "delete" path. What operation deletes the partition?

Thanks

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 22:13         ` Steven Rostedt
@ 2017-08-17 22:18           ` Steven Rostedt
  2017-08-17 23:23             ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2017-08-17 22:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On Thu, 17 Aug 2017 18:13:20 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 17 Aug 2017 17:27:22 -0400
> Waiman Long <longman@redhat.com> wrote:
> 
> 
> > It is actually what the patch is trying to do by checking for the
> > deletion flag in the mutex_trylock loop. Please note that mutex does not
> > guarantee FIFO ordering of lock acquisition. As a result, cpu1 may call
> > mutex_lock() and wait for it while cpu2 can set the deletion flag later
> > and get the mutex first before cpu1. So checking for the deletion flag
> > before taking the mutex isn't enough.  
> 
> Yeah, I figured that out already (crossed emails). BTW, how did you
> trigger this warning. I'm playing around with adding loop devices,
> volume groups, and logical volumes, and reading the trace files
> created in the sysfs directory, then removing those items, but it's
> not triggering the "delete" path. What operation deletes the partition?

I'm guessing that deleting an actual partition may work (unfortunately,
my test box has no partition to delete ;-) I'll find another box to
test on.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 22:18           ` Steven Rostedt
@ 2017-08-17 23:23             ` Steven Rostedt
  2017-08-18 13:42               ` Waiman Long
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2017-08-17 23:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On Thu, 17 Aug 2017 18:18:18 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 17 Aug 2017 18:13:20 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Thu, 17 Aug 2017 17:27:22 -0400
> > Waiman Long <longman@redhat.com> wrote:
> > 
> >   
> > > It is actually what the patch is trying to do by checking for the
> > > deletion flag in the mutex_trylock loop. Please note that mutex does not
> > > guarantee FIFO ordering of lock acquisition. As a result, cpu1 may call
> > > mutex_lock() and wait for it while cpu2 can set the deletion flag later
> > > and get the mutex first before cpu1. So checking for the deletion flag
> > > before taking the mutex isn't enough.    
> > 
> > Yeah, I figured that out already (crossed emails). BTW, how did you
> > trigger this warning. I'm playing around with adding loop devices,
> > volume groups, and logical volumes, and reading the trace files
> > created in the sysfs directory, then removing those items, but it's
> > not triggering the "delete" path. What operation deletes the partition?  
> 
> I'm guessing that deleting an actual partition may work (unfortunately,
> my test box has no partition to delete ;-) I'll find another box to
> test on.
>

OK, deleting a partition doesn't trigger the lockdep splat. But I also
added a printk in the BLKPG_DEL_PARTITION case switch, which also
doesn't print. What command do I need to do to trigger that path?

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 23:23             ` Steven Rostedt
@ 2017-08-18 13:42               ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-08-18 13:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1659 bytes --]

On 08/17/2017 07:23 PM, Steven Rostedt wrote:
> On Thu, 17 Aug 2017 18:18:18 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> On Thu, 17 Aug 2017 18:13:20 -0400
>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>>> On Thu, 17 Aug 2017 17:27:22 -0400
>>> Waiman Long <longman@redhat.com> wrote:
>>>
>>>   
>>>> It is actually what the patch is trying to do by checking for the
>>>> deletion flag in the mutex_trylock loop. Please note that mutex does not
>>>> guarantee FIFO ordering of lock acquisition. As a result, cpu1 may call
>>>> mutex_lock() and wait for it while cpu2 can set the deletion flag later
>>>> and get the mutex first before cpu1. So checking for the deletion flag
>>>> before taking the mutex isn't enough.    
>>> Yeah, I figured that out already (crossed emails). BTW, how did you
>>> trigger this warning. I'm playing around with adding loop devices,
>>> volume groups, and logical volumes, and reading the trace files
>>> created in the sysfs directory, then removing those items, but it's
>>> not triggering the "delete" path. What operation deletes the partition?  
>> I'm guessing that deleting an actual partition may work (unfortunately,
>> my test box has no partition to delete ;-) I'll find another box to
>> test on.
>>
> OK, deleting a partition doesn't trigger the lockdep splat. But I also
> added a printk in the BLKPG_DEL_PARTITION case switch, which also
> doesn't print. What command do I need to do to trigger that path?
>
> Thanks,
>
> -- Steve

Attached is a reproducer that was used to trigger the warning. Some
tuning may be needed depend on the actual configuration of the test machine.

Cheers,
Longman


[-- Attachment #2: run_test.sh --]
[-- Type: application/x-shellscript, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-17 21:30       ` Steven Rostedt
@ 2017-08-18 13:55         ` Waiman Long
  2017-08-18 16:21           ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Waiman Long @ 2017-08-18 13:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jens Axboe, Jeff Layton, J. Bruce Fields, Ingo Molnar,
	linux-kernel, linux-block, linux-fsdevel

On 08/17/2017 05:30 PM, Steven Rostedt wrote:
> On Thu, 17 Aug 2017 17:10:07 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
>
>> Instead of playing games with taking the lock, the only way this race
>> is hit, is if the partition is being deleted and the sysfs attribute i=
s
>> being read at the same time, correct? In that case, just return
>> -ENODEV, and be done with it.
> Nevermind that wont work. Too bad there's not a mutex_lock_timeout()
> that we could use in a loop. It would solve the issue of forward
> progress with RT tasks, and will break after a timeout in case of
> deadlock.
>
> -- Steve

I think it will be useful to have mutex_timed_lock(). RT-mutex does have
a timed version, so I guess it shouldn't be hard to implement one for
mutex. I can take a shot at trying to do that.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-18 13:55         ` Waiman Long
@ 2017-08-18 16:21           ` Bart Van Assche
  2017-08-18 17:22             ` Waiman Long
  2017-08-18 17:26             ` Steven Rostedt
  0 siblings, 2 replies; 17+ messages in thread
From: Bart Van Assche @ 2017-08-18 16:21 UTC (permalink / raw)
  To: longman@redhat.com, rostedt@goodmis.org
  Cc: bfields@fieldses.org, mingo@kernel.org, jlayton@poochiereds.net,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	axboe@kernel.dk, linux-fsdevel@vger.kernel.org

T24gRnJpLCAyMDE3LTA4LTE4IGF0IDA5OjU1IC0wNDAwLCBXYWltYW4gTG9uZyB3cm90ZToNCj4g
T24gMDgvMTcvMjAxNyAwNTozMCBQTSwgU3RldmVuIFJvc3RlZHQgd3JvdGU6DQo+ID4gT24gVGh1
LCAxNyBBdWcgMjAxNyAxNzoxMDowNyAtMDQwMA0KPiA+IFN0ZXZlbiBSb3N0ZWR0IDxyb3N0ZWR0
QGdvb2RtaXMub3JnPiB3cm90ZToNCj4gPiA+IEluc3RlYWQgb2YgcGxheWluZyBnYW1lcyB3aXRo
IHRha2luZyB0aGUgbG9jaywgdGhlIG9ubHkgd2F5IHRoaXMgcmFjZQ0KPiA+ID4gaXMgaGl0LCBp
cyBpZiB0aGUgcGFydGl0aW9uIGlzIGJlaW5nIGRlbGV0ZWQgYW5kIHRoZSBzeXNmcyBhdHRyaWJ1
dGUgaXMNCj4gPiA+IGJlaW5nIHJlYWQgYXQgdGhlIHNhbWUgdGltZSwgY29ycmVjdD8gSW4gdGhh
dCBjYXNlLCBqdXN0IHJldHVybg0KPiA+ID4gLUVOT0RFViwgYW5kIGJlIGRvbmUgd2l0aCBpdC4N
Cj4gPiANCj4gPiBOZXZlcm1pbmQgdGhhdCB3b250IHdvcmsuIFRvbyBiYWQgdGhlcmUncyBub3Qg
YSBtdXRleF9sb2NrX3RpbWVvdXQoKQ0KPiA+IHRoYXQgd2UgY291bGQgdXNlIGluIGEgbG9vcC4g
SXQgd291bGQgc29sdmUgdGhlIGlzc3VlIG9mIGZvcndhcmQNCj4gPiBwcm9ncmVzcyB3aXRoIFJU
IHRhc2tzLCBhbmQgd2lsbCBicmVhayBhZnRlciBhIHRpbWVvdXQgaW4gY2FzZSBvZg0KPiA+IGRl
YWRsb2NrLg0KPiANCj4gSSB0aGluayBpdCB3aWxsIGJlIHVzZWZ1bCB0byBoYXZlIG11dGV4X3Rp
bWVkX2xvY2soKS4gUlQtbXV0ZXggZG9lcyBoYXZlDQo+IGEgdGltZWQgdmVyc2lvbiwgc28gSSBn
dWVzcyBpdCBzaG91bGRuJ3QgYmUgaGFyZCB0byBpbXBsZW1lbnQgb25lIGZvcg0KPiBtdXRleC4g
SSBjYW4gdGFrZSBhIHNob3QgYXQgdHJ5aW5nIHRvIGRvIHRoYXQuDQoNCihqdXN0IGNhdWdodCB1
cCB3aXRoIHRoZSBlbnRpcmUgZS1tYWlsIHRocmVhZCkNCg0KU29ycnkgV2FpbWFuIGJ1dCBwZXJz
b25hbGx5IEkgdGhvcm91Z2hseSBkZXRlc3QgbG9vcHMgYXJvdW5kIG11dGV4X3RyeWxvY2soKSBv
cg0KbXV0ZXhfdGltZWRfbG9jaygpIGJlY2F1c2Ugc3VjaCBsb29wcyBhcmUgdXN1YWxseSB1c2Vk
IHRvIHBhcGVyIG92ZXIgYSBwcm9ibGVtDQppbnN0ZWFkIG9mIGZpeGluZyB0aGUgcm9vdCBjYXVz
ZS4gV2hhdCBJIHVuZGVyc3Rvb2QgZnJvbSB0aGUgY29tbWVudCBpbiB2MSBvZiB5b3VyDQpwYXRj
aCBpcyB0aGF0IGJkX211dGV4IGlzIG5vdCBvbmx5IGhlbGQgZHVyaW5nIGJsb2NrIGRldmljZSBj
cmVhdGlvbiBhbmQgcmVtb3ZhbA0KYnV0IGFkZGl0aW9uYWxseSB0aGF0IGJkX211dGV4IGlzIG9i
dGFpbmVkIGluc2lkZSBzeXNmcyBhdHRyaWJ1dGUgY2FsbGJhY2sgbWV0aG9kcz8NClRoYXQgcGF0
dGVybiBpcyBndWFyYW50ZWVkIHRvIGxlYWQgdG8gZGVhZGxvY2tzLiBTaW5jZSB0aGUgYmxvY2sg
ZGV2aWNlIHJlbW92YWwNCmNvZGUgd2FpdHMgdW50aWwgYWxsIHN5c2ZzIGNhbGxiYWNrIG1ldGhv
ZHMgaGF2ZSBmaW5pc2hlZCB0aGVyZSBpcyBubyBuZWVkIHRvDQpwcm90ZWN0IGFnYWluc3QgYmxv
Y2sgZGV2aWNlIHJlbW92YWwgaW5zaWRlIHRoZSBzeXNmcyBjYWxsYmFjayBtZXRob2RzLiBNeSBw
cm9wb3NhbA0KaXMgdG8gc3BsaXQgYmRfbXV0ZXg6IG9uZSBnbG9iYWwgbXV0ZXggdGhhdCBzZXJp
YWxpemVzIGJsb2NrIGRldmljZSBjcmVhdGlvbiBhbmQNCnJlbW92YWwgYW5kIG9uZSBtdXRleCBw
ZXIgYmxvY2sgZGV2aWNlIHRoYXQgc2VyaWFsaXplcyBjaGFuZ2VzIHRvIGEgc2luZ2xlIGJsb2Nr
DQpkZXZpY2UuIE9idGFpbmluZyB0aGUgZ2xvYmFsIG11dGV4IGZyb20gaW5zaWRlIGEgYmxvY2sg
ZGV2aWNlIHN5c2ZzIGNhbGxiYWNrDQpmdW5jdGlvbiBpcyBub3Qgc2FmZSBidXQgb2J0YWluaW5n
IHRoZSBwZXItYmxvY2stZGV2aWNlIG11dGV4IGZyb20gaW5zaWRlIGEgc3lzZnMNCmNhbGxiYWNr
IGZ1bmN0aW9uIGlzIHNhZmUuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-18 16:21           ` Bart Van Assche
@ 2017-08-18 17:22             ` Waiman Long
  2017-08-18 17:26             ` Steven Rostedt
  1 sibling, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-08-18 17:22 UTC (permalink / raw)
  To: Bart Van Assche, rostedt@goodmis.org
  Cc: bfields@fieldses.org, mingo@kernel.org, jlayton@poochiereds.net,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	axboe@kernel.dk, linux-fsdevel@vger.kernel.org

On 08/18/2017 12:21 PM, Bart Van Assche wrote:
> On Fri, 2017-08-18 at 09:55 -0400, Waiman Long wrote:
>> On 08/17/2017 05:30 PM, Steven Rostedt wrote:
>>> On Thu, 17 Aug 2017 17:10:07 -0400
>>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>>> Instead of playing games with taking the lock, the only way this rac=
e
>>>> is hit, is if the partition is being deleted and the sysfs attribute=
 is
>>>> being read at the same time, correct? In that case, just return
>>>> -ENODEV, and be done with it.
>>> Nevermind that wont work. Too bad there's not a mutex_lock_timeout()
>>> that we could use in a loop. It would solve the issue of forward
>>> progress with RT tasks, and will break after a timeout in case of
>>> deadlock.
>> I think it will be useful to have mutex_timed_lock(). RT-mutex does ha=
ve
>> a timed version, so I guess it shouldn't be hard to implement one for
>> mutex. I can take a shot at trying to do that.
> (just caught up with the entire e-mail thread)
>
> Sorry Waiman but personally I thoroughly detest loops around mutex_tryl=
ock() or
> mutex_timed_lock() because such loops are usually used to paper over a =
problem
> instead of fixing the root cause. What I understood from the comment in=
 v1 of your
> patch is that bd_mutex is not only held during block device creation an=
d removal
> but additionally that bd_mutex is obtained inside sysfs attribute callb=
ack methods?
> That pattern is guaranteed to lead to deadlocks. Since the block device=
 removal
> code waits until all sysfs callback methods have finished there is no n=
eed to
> protect against block device removal inside the sysfs callback methods.=
 My proposal

You are right. We don't really need to take the bd_mutex as the fact
that inside the sysfs callback method will guarantee the block device
won't go away.

> is to split bd_mutex: one global mutex that serializes block device cre=
ation and
> removal and one mutex per block device that serializes changes to a sin=
gle block
> device. Obtaining the global mutex from inside a block device sysfs cal=
lback
> function is not safe but obtaining the per-block-device mutex from insi=
de a sysfs
> callback function is safe.
>
> Bart.

The bd_mutex we are talking here is already per block device. I am
thinking about having a global blktrace mutex that is used to serialize
the read and write of blktrace attributes. Since blktrace sysfs files
are not supposed to be frequently accessed, having a global lock
shouldn't cause any problem.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops
  2017-08-18 16:21           ` Bart Van Assche
  2017-08-18 17:22             ` Waiman Long
@ 2017-08-18 17:26             ` Steven Rostedt
  1 sibling, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2017-08-18 17:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: longman@redhat.com, bfields@fieldses.org, mingo@kernel.org,
	jlayton@poochiereds.net, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, axboe@kernel.dk,
	linux-fsdevel@vger.kernel.org

On Fri, 18 Aug 2017 16:21:46 +0000
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> Sorry Waiman but personally I thoroughly detest loops around mutex_trylock() or
> mutex_timed_lock() because such loops are usually used to paper over a problem
> instead of fixing the root cause.

100% agree.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-08-18 17:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-16 20:40 [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops Waiman Long
2017-08-17 13:34 ` Steven Rostedt
2017-08-17 16:24   ` Waiman Long
2017-08-17 20:30     ` Steven Rostedt
2017-08-17 20:41       ` Bart Van Assche
2017-08-17 20:56         ` Steven Rostedt
2017-08-17 21:10     ` Steven Rostedt
2017-08-17 21:27       ` Waiman Long
2017-08-17 22:13         ` Steven Rostedt
2017-08-17 22:18           ` Steven Rostedt
2017-08-17 23:23             ` Steven Rostedt
2017-08-18 13:42               ` Waiman Long
2017-08-17 21:30       ` Steven Rostedt
2017-08-18 13:55         ` Waiman Long
2017-08-18 16:21           ` Bart Van Assche
2017-08-18 17:22             ` Waiman Long
2017-08-18 17:26             ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).