Race condition between "read CFQ stats" and "block device shutdown"

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* Race condition between "read CFQ stats" and "block device shutdown"
@ 2013-09-03 20:14 Anatol Pomozov
       [not found] ` <CAOMFOmXJ5ZTYdOvdUt-oxsouhPGRmMshCRhn6AFgmFAGZw5WZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Anatol Pomozov @ 2013-09-03 20:14 UTC (permalink / raw)
  To: Cgroups, Tejun Heo, Hannes Reinecke, Jens Axboe

Hi,

I am running a program that checkes "read CFQ stat files" for race
conditions with other evens (e.g. device shutdown).

And I discovered an interesting bug. Here is the "double_unlock" crash for it


print_unlock_imbalance_bug.isra.23+0x4/0x10
[ 261.453775] [<ffffffff810f7c65>] lock_release_non_nested.isra.39+0x2f5/0x300
[ 261.460900] [<ffffffff810f7cfe>] lock_release+0x8e/0x1f0
[ 261.466293] [<ffffffff81339030>] ? cfqg_prfill_service_level+0x60/0x60
[ 261.472894] [<ffffffff81005be3>] _raw_spin_unlock_irq+0x23/0x50
[ 261.478894] [<ffffffff8133559f>] blkcg_print_blkgs+0x8f/0x140
[ 261.484724] [<ffffffff81335515>] ? blkcg_print_blkgs+0x5/0x140
[ 261.490631] [<ffffffff81338a7f>] cfqg_print_weighted_queue_time+0x2f/0x40
[ 261.497489] [<ffffffff8110b793>] cgroup_seqfile_show+0x53/0x60
[ 261.503398] [<ffffffff811f1fe4>] seq_read+0x124/0x3a0
[ 261.508529] [<ffffffff811ce39d>] vfs_read+0xad/0x180
[ 261.513576] [<ffffffff811ce625>] SyS_read+0x55/0xa0
[ 261.518538] [<ffffffff81609f66>] cstar_dispatch+0x7/0x1f

blkcg_print_blkgs fails with double unlock? Hmm, I checked
cfqg_prfill_service_level and I did not find any places where unlock
can happen.

After some debugging I found that in blkcg_print_blkgs() spinlock
passed to spin_lock_irq() function differs from the object passed to
spin_unlock_irq just a few lines below. It means
request_queue->queue_lock spinlock has changed under the function feet
while it was executing!!!

To make sure I added

--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -465,10 +465,16 @@ void blkcg_print_blkgs(struct seq_file *sf,
struct blkcg *blkcg,

        rcu_read_lock();
        hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
-               spin_lock_irq(blkg->q->queue_lock);
+               spinlock_t *lock = blkg->q->queue_lock;
+               spinlock_t *new_lock;
+               spin_lock_irq(lock);
                if (blkcg_policy_enabled(blkg->q, pol))
                        total += prfill(sf, blkg->pd[pol->plid], data);
-               spin_unlock_irq(blkg->q->queue_lock);
+               new_lock = blkg->q->queue_lock;
+               if (lock != new_lock) {
+                       pr_err("old lock %p %s  new lock %p %s\n",
lock, lock->dep_map.name, new_lock, new_lock->dep_map.name);
+               }
+               spin_unlock_irq(lock);
        }
        rcu_read_unlock();



And indeed it shows locks are different.


It comes from this change 777eb1bf1 "block: Free queue resources at
blk_release_queue()" that changes lock when devices is shutting down.

What would be the best fix for the issue?

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAOMFOmXJ5ZTYdOvdUt-oxsouhPGRmMshCRhn6AFgmFAGZw5WZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Race condition between "read CFQ stats" and "block device shutdown"
       [not found] ` <CAOMFOmXJ5ZTYdOvdUt-oxsouhPGRmMshCRhn6AFgmFAGZw5WZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-04  6:42   ` Hannes Reinecke
       [not found]     ` <5226D661.7070301-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2013-09-04  6:42 UTC (permalink / raw)
  To: Anatol Pomozov; +Cc: Cgroups, Tejun Heo, Jens Axboe

On 09/03/2013 10:14 PM, Anatol Pomozov wrote:
> Hi,
> 
> I am running a program that checkes "read CFQ stat files" for race
> conditions with other evens (e.g. device shutdown).
> 
> And I discovered an interesting bug. Here is the "double_unlock" crash for it
> 
> 
> print_unlock_imbalance_bug.isra.23+0x4/0x10
> [ 261.453775] [<ffffffff810f7c65>] lock_release_non_nested.isra.39+0x2f5/0x300
> [ 261.460900] [<ffffffff810f7cfe>] lock_release+0x8e/0x1f0
> [ 261.466293] [<ffffffff81339030>] ? cfqg_prfill_service_level+0x60/0x60
> [ 261.472894] [<ffffffff81005be3>] _raw_spin_unlock_irq+0x23/0x50
> [ 261.478894] [<ffffffff8133559f>] blkcg_print_blkgs+0x8f/0x140
> [ 261.484724] [<ffffffff81335515>] ? blkcg_print_blkgs+0x5/0x140
> [ 261.490631] [<ffffffff81338a7f>] cfqg_print_weighted_queue_time+0x2f/0x40
> [ 261.497489] [<ffffffff8110b793>] cgroup_seqfile_show+0x53/0x60
> [ 261.503398] [<ffffffff811f1fe4>] seq_read+0x124/0x3a0
> [ 261.508529] [<ffffffff811ce39d>] vfs_read+0xad/0x180
> [ 261.513576] [<ffffffff811ce625>] SyS_read+0x55/0xa0
> [ 261.518538] [<ffffffff81609f66>] cstar_dispatch+0x7/0x1f
> 
> blkcg_print_blkgs fails with double unlock? Hmm, I checked
> cfqg_prfill_service_level and I did not find any places where unlock
> can happen.
> 
> After some debugging I found that in blkcg_print_blkgs() spinlock
> passed to spin_lock_irq() function differs from the object passed to
> spin_unlock_irq just a few lines below. It means
> request_queue->queue_lock spinlock has changed under the function feet
> while it was executing!!!
> 
> To make sure I added
> 
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -465,10 +465,16 @@ void blkcg_print_blkgs(struct seq_file *sf,
> struct blkcg *blkcg,
> 
>         rcu_read_lock();
>         hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
> -               spin_lock_irq(blkg->q->queue_lock);
> +               spinlock_t *lock = blkg->q->queue_lock;
> +               spinlock_t *new_lock;
> +               spin_lock_irq(lock);
>                 if (blkcg_policy_enabled(blkg->q, pol))
>                         total += prfill(sf, blkg->pd[pol->plid], data);
> -               spin_unlock_irq(blkg->q->queue_lock);
> +               new_lock = blkg->q->queue_lock;
> +               if (lock != new_lock) {
> +                       pr_err("old lock %p %s  new lock %p %s\n",
> lock, lock->dep_map.name, new_lock, new_lock->dep_map.name);
> +               }
> +               spin_unlock_irq(lock);
>         }
>         rcu_read_unlock();
> 
> 
> 
> And indeed it shows locks are different.
> 
> 
> It comes from this change 777eb1bf1 "block: Free queue resources at
> blk_release_queue()" that changes lock when devices is shutting down.
> 
> What would be the best fix for the issue?
> 
The correct fix would be to add checks for 'blkq->q'; the mentioned
lock reassignment can only happen during queue shutdown.
So whenever the queue is dead or stopping we whould refuse to print
anything here.

Try this:

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 290792a..3e17841 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -504,6 +504,8 @@ void blkcg_print_blkgs(struct seq_file *sf,
struct blkcg *bl
kcg,

        rcu_read_lock();
        hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+               if (unlikely(blk_queue_dying(blkg->q)))
+                       continue;
                spin_lock_irq(blkg->q->queue_lock);
                if (blkcg_policy_enabled(blkg->q, pol))
                        total += prfill(sf, blkg->pd[pol->plid], data);


Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare-l3A5Bk7waGM@public.gmane.org			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

[parent not found: <5226D661.7070301-l3A5Bk7waGM@public.gmane.org>]

* Re: Race condition between "read CFQ stats" and "block device shutdown"
       [not found]     ` <5226D661.7070301-l3A5Bk7waGM@public.gmane.org>
@ 2013-09-04 15:45       ` Anatol Pomozov
       [not found]         ` <CAOMFOmUCqXN1uaqBEWH3PStuZXvnvLw=YrARgv7DvqO6Y4bFPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-09-04 16:15       ` Anatol Pomozov
  1 sibling, 1 reply; 12+ messages in thread
From: Anatol Pomozov @ 2013-09-04 15:45 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Cgroups, Tejun Heo, Jens Axboe

Hi

On Tue, Sep 3, 2013 at 11:42 PM, Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org> wrote:
> On 09/03/2013 10:14 PM, Anatol Pomozov wrote:
>> Hi,
>>
>> I am running a program that checkes "read CFQ stat files" for race
>> conditions with other evens (e.g. device shutdown).
>>
>> And I discovered an interesting bug. Here is the "double_unlock" crash for it
>>
>>
>> print_unlock_imbalance_bug.isra.23+0x4/0x10
>> [ 261.453775] [<ffffffff810f7c65>] lock_release_non_nested.isra.39+0x2f5/0x300
>> [ 261.460900] [<ffffffff810f7cfe>] lock_release+0x8e/0x1f0
>> [ 261.466293] [<ffffffff81339030>] ? cfqg_prfill_service_level+0x60/0x60
>> [ 261.472894] [<ffffffff81005be3>] _raw_spin_unlock_irq+0x23/0x50
>> [ 261.478894] [<ffffffff8133559f>] blkcg_print_blkgs+0x8f/0x140
>> [ 261.484724] [<ffffffff81335515>] ? blkcg_print_blkgs+0x5/0x140
>> [ 261.490631] [<ffffffff81338a7f>] cfqg_print_weighted_queue_time+0x2f/0x40
>> [ 261.497489] [<ffffffff8110b793>] cgroup_seqfile_show+0x53/0x60
>> [ 261.503398] [<ffffffff811f1fe4>] seq_read+0x124/0x3a0
>> [ 261.508529] [<ffffffff811ce39d>] vfs_read+0xad/0x180
>> [ 261.513576] [<ffffffff811ce625>] SyS_read+0x55/0xa0
>> [ 261.518538] [<ffffffff81609f66>] cstar_dispatch+0x7/0x1f
>>
>> blkcg_print_blkgs fails with double unlock? Hmm, I checked
>> cfqg_prfill_service_level and I did not find any places where unlock
>> can happen.
>>
>> After some debugging I found that in blkcg_print_blkgs() spinlock
>> passed to spin_lock_irq() function differs from the object passed to
>> spin_unlock_irq just a few lines below. It means
>> request_queue->queue_lock spinlock has changed under the function feet
>> while it was executing!!!
>>
>> To make sure I added
>>
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -465,10 +465,16 @@ void blkcg_print_blkgs(struct seq_file *sf,
>> struct blkcg *blkcg,
>>
>>         rcu_read_lock();
>>         hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
>> -               spin_lock_irq(blkg->q->queue_lock);
>> +               spinlock_t *lock = blkg->q->queue_lock;
>> +               spinlock_t *new_lock;
>> +               spin_lock_irq(lock);
>>                 if (blkcg_policy_enabled(blkg->q, pol))
>>                         total += prfill(sf, blkg->pd[pol->plid], data);
>> -               spin_unlock_irq(blkg->q->queue_lock);
>> +               new_lock = blkg->q->queue_lock;
>> +               if (lock != new_lock) {
>> +                       pr_err("old lock %p %s  new lock %p %s\n",
>> lock, lock->dep_map.name, new_lock, new_lock->dep_map.name);
>> +               }
>> +               spin_unlock_irq(lock);
>>         }
>>         rcu_read_unlock();
>>
>>
>>
>> And indeed it shows locks are different.
>>
>>
>> It comes from this change 777eb1bf1 "block: Free queue resources at
>> blk_release_queue()" that changes lock when devices is shutting down.
>>
>> What would be the best fix for the issue?
>>
> The correct fix would be to add checks for 'blkq->q'; the mentioned
> lock reassignment can only happen during queue shutdown.
> So whenever the queue is dead or stopping we whould refuse to print
> anything here.
>
> Try this:
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 290792a..3e17841 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -504,6 +504,8 @@ void blkcg_print_blkgs(struct seq_file *sf,
> struct blkcg *bl
> kcg,
>
>         rcu_read_lock();
>         hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> +               if (unlikely(blk_queue_dying(blkg->q)))
> +                       continue;

I am not an expect in block code, so I have a few questions here:

- are we sure that this operation is atomic? What if blkg->q becomes
dead right after we checked it, and blkg->q->queue_lock got invalid so
we have the same crash as before?

- do we need to check blk_queue_dying() in every single place where we
iterate over blkcg->blkg_list?

- would it be better to remove block cgroup from blkcg->blkg_list
before queue marked as dead?

>                 spin_lock_irq(blkg->q->queue_lock);
>                 if (blkcg_policy_enabled(blkg->q, pol))
>                         total += prfill(sf, blkg->pd[pol->plid], data);

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAOMFOmUCqXN1uaqBEWH3PStuZXvnvLw=YrARgv7DvqO6Y4bFPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Race condition between "read CFQ stats" and "block device shutdown"
       [not found]         ` <CAOMFOmUCqXN1uaqBEWH3PStuZXvnvLw=YrARgv7DvqO6Y4bFPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-04 16:07           ` Tejun Heo
       [not found]             ` <20130904160723.GC26609-9pTldWuhBndy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2013-09-04 16:07 UTC (permalink / raw)
  To: Anatol Pomozov; +Cc: Hannes Reinecke, Cgroups, Jens Axboe

Hello,

On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote:
> I am not an expect in block code, so I have a few questions here:
> 
> - are we sure that this operation is atomic? What if blkg->q becomes
> dead right after we checked it, and blkg->q->queue_lock got invalid so
> we have the same crash as before?

request_queue lock switching is something inherently broken in block
layer.  It's unsalvageable.  Maybe we can drop lock switching once
blk-mq is fully merged.

> - do we need to check blk_queue_dying() in every single place where we
> iterate over blkcg->blkg_list?
> 
> - would it be better to remove block cgroup from blkcg->blkg_list
> before queue marked as dead?

I have no idea.  No matter what we do, it's working around something
fundamentally broken in a half-assed way.  Hmmm... there was an effort
to remove the lock switching.  I don't think it actually happened
tho. :(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <20130904160723.GC26609-9pTldWuhBndy/B6EtB590w@public.gmane.org>]

* Re: Race condition between "read CFQ stats" and "block device shutdown"
       [not found]             ` <20130904160723.GC26609-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2013-09-25 20:37               ` Anatol Pomozov
  2013-09-26 13:54                 ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Anatol Pomozov @ 2013-09-25 20:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Hannes Reinecke, Cgroups, Jens Axboe

Hi

On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote:
>> I am not an expect in block code, so I have a few questions here:
>>
>> - are we sure that this operation is atomic? What if blkg->q becomes
>> dead right after we checked it, and blkg->q->queue_lock got invalid so
>> we have the same crash as before?
>
> request_queue lock switching is something inherently broken in block
> layer.  It's unsalvageable.

Fully agree. The problem that request_queue->queue_lock is a shared
resource that concurrently modified/accessed. In this case (when one
thread changes, another thread access it) we need synchronization to
prevent race conditions. So we need a spin_lock to access queue_lock
spin_lock, otherwise we have a crash like one above...

>  Maybe we can drop lock switching once blk-mq is fully merged.

Could you please provide more information about it? What is the timeline?

If there is an easy way to fix the race condition I would like to
help. Please give me some pointer what direction I should move.

PS Just a little bit of context why I care about this bug. We test a
large farm that actively uses iscsi. We are going to have a lot of
iscsi device startup/shutdown. I am testing whether this codepath has
race conditions and I found one above.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race condition between "read CFQ stats" and "block device shutdown"
  2013-09-25 20:37               ` Anatol Pomozov
@ 2013-09-26 13:54                 ` Tejun Heo
  2013-09-26 14:18                   ` Hannes Reinecke
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2013-09-26 13:54 UTC (permalink / raw)
  To: Anatol Pomozov; +Cc: Hannes Reinecke, Cgroups, Jens Axboe, linux-scsi

Hello, (cc'ing linux-scsi)

On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote:
> Hi
> 
> On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj@kernel.org> wrote:
> > Hello,
> >
> > On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote:
> >> I am not an expect in block code, so I have a few questions here:
> >>
> >> - are we sure that this operation is atomic? What if blkg->q becomes
> >> dead right after we checked it, and blkg->q->queue_lock got invalid so
> >> we have the same crash as before?
> >
> > request_queue lock switching is something inherently broken in block
> > layer.  It's unsalvageable.
> 
> Fully agree. The problem that request_queue->queue_lock is a shared
> resource that concurrently modified/accessed. In this case (when one
> thread changes, another thread access it) we need synchronization to
> prevent race conditions. So we need a spin_lock to access queue_lock
> spin_lock, otherwise we have a crash like one above...
> 
> >  Maybe we can drop lock switching once blk-mq is fully merged.
> 
> Could you please provide more information about it? What is the timeline?

I have no idea.  Hopefully, not too far out.  Jens would have better
idea.

> If there is an easy way to fix the race condition I would like to
> help. Please give me some pointer what direction I should move.

The first step would be identifying who are actually making use of
lock switching, why and how much difference it would make for them to
not do that.

> PS Just a little bit of context why I care about this bug. We test a
> large farm that actively uses iscsi. We are going to have a lot of
> iscsi device startup/shutdown. I am testing whether this codepath has
> race conditions and I found one above.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race condition between "read CFQ stats" and "block device shutdown"
  2013-09-26 13:54                 ` Tejun Heo
@ 2013-09-26 14:18                   ` Hannes Reinecke
  2013-09-26 14:20                     ` Tejun Heo
       [not found]                     ` <5244423A.2050107-l3A5Bk7waGM@public.gmane.org>
  0 siblings, 2 replies; 12+ messages in thread
From: Hannes Reinecke @ 2013-09-26 14:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Anatol Pomozov, Cgroups, Jens Axboe, linux-scsi

On 09/26/2013 03:54 PM, Tejun Heo wrote:
> Hello, (cc'ing linux-scsi)
> 
> On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote:
>> Hi
>>
>> On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj@kernel.org> wrote:
>>> Hello,
>>>
>>> On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote:
>>>> I am not an expect in block code, so I have a few questions here:
>>>>
>>>> - are we sure that this operation is atomic? What if blkg->q becomes
>>>> dead right after we checked it, and blkg->q->queue_lock got invalid so
>>>> we have the same crash as before?
>>>
>>> request_queue lock switching is something inherently broken in block
>>> layer.  It's unsalvageable.
>>
>> Fully agree. The problem that request_queue->queue_lock is a shared
>> resource that concurrently modified/accessed. In this case (when one
>> thread changes, another thread access it) we need synchronization to
>> prevent race conditions. So we need a spin_lock to access queue_lock
>> spin_lock, otherwise we have a crash like one above...
>>
>>>  Maybe we can drop lock switching once blk-mq is fully merged.
>>
>> Could you please provide more information about it? What is the timeline?
> 
> I have no idea.  Hopefully, not too far out.  Jens would have better
> idea.
> 
>> If there is an easy way to fix the race condition I would like to
>> help. Please give me some pointer what direction I should move.
> 
> The first step would be identifying who are actually making use of
> lock switching, why and how much difference it would make for them to
> not do that.
> 
Typically, the lock is being used by the block drivers to
synchronize access between some internal data structures and the
request queue itself. You don't actually _need_ to do it that way,
but removing the lock switching would involve quite some redesign of
these drivers.
Give that most of the are rather oldish I really wouldn't want to
touch them.

However, none of the modern devices should be using this lock
switching, so I would just ignore it.
EG SCSI most definitely doesn't use it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race condition between "read CFQ stats" and "block device shutdown"
  2013-09-26 14:18                   ` Hannes Reinecke
@ 2013-09-26 14:20                     ` Tejun Heo
       [not found]                     ` <5244423A.2050107-l3A5Bk7waGM@public.gmane.org>
  1 sibling, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2013-09-26 14:20 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Anatol Pomozov, Cgroups, Jens Axboe, linux-scsi

Hey, Hannes.

On Thu, Sep 26, 2013 at 04:18:34PM +0200, Hannes Reinecke wrote:
> However, none of the modern devices should be using this lock
> switching, so I would just ignore it.
> EG SCSI most definitely doesn't use it.

The kernel is crashing from it, so I don't think ignoring is an
acceptable strategy here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <5244423A.2050107-l3A5Bk7waGM@public.gmane.org>]

* Re: Race condition between "read CFQ stats" and "block device shutdown"
       [not found]                     ` <5244423A.2050107-l3A5Bk7waGM@public.gmane.org>
@ 2013-09-26 16:23                       ` Anatol Pomozov
  2013-09-26 16:30                         ` Tejun Heo
       [not found]                         ` <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 12+ messages in thread
From: Anatol Pomozov @ 2013-09-26 16:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Tejun Heo, Cgroups, Jens Axboe, linux-scsi-u79uwXL29TY76Z2rM5mHXA

Hi

On Thu, Sep 26, 2013 at 7:18 AM, Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org> wrote:
> On 09/26/2013 03:54 PM, Tejun Heo wrote:
>> Hello, (cc'ing linux-scsi)
>>
>> On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote:
>>> Hi
>>>
>>> On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>>> Hello,
>>>>
>>>> On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote:
>>>>> I am not an expect in block code, so I have a few questions here:
>>>>>
>>>>> - are we sure that this operation is atomic? What if blkg->q becomes
>>>>> dead right after we checked it, and blkg->q->queue_lock got invalid so
>>>>> we have the same crash as before?
>>>>
>>>> request_queue lock switching is something inherently broken in block
>>>> layer.  It's unsalvageable.
>>>
>>> Fully agree. The problem that request_queue->queue_lock is a shared
>>> resource that concurrently modified/accessed. In this case (when one
>>> thread changes, another thread access it) we need synchronization to
>>> prevent race conditions. So we need a spin_lock to access queue_lock
>>> spin_lock, otherwise we have a crash like one above...
>>>
>>>>  Maybe we can drop lock switching once blk-mq is fully merged.
>>>
>>> Could you please provide more information about it? What is the timeline?
>>
>> I have no idea.  Hopefully, not too far out.  Jens would have better
>> idea.
>>
>>> If there is an easy way to fix the race condition I would like to
>>> help. Please give me some pointer what direction I should move.
>>
>> The first step would be identifying who are actually making use of
>> lock switching, why and how much difference it would make for them to
>> not do that.
>>
> Typically, the lock is being used by the block drivers to
> synchronize access between some internal data structures and the
> request queue itself. You don't actually _need_ to do it that way,
> but removing the lock switching would involve quite some redesign of
> these drivers.
> Give that most of the are rather oldish I really wouldn't want to
> touch them.

Thanks for the info.

We use modified version of "sbull" block device driver from GKH book.
We use it for testing block device startup/shutdown path + CFQ
manipulation.

The sbull driver uses function
blk_init_queue(..., &dev->qlock);

it passes lock as a second parameter and function makes the lock
swapping. According to your information passing lock to
blk_init_queue() considered as an "old non recommended way" so we
should modify our driver and avoid doing this. I'll take a look.

Thanks.


>
> However, none of the modern devices should be using this lock
> switching, so I would just ignore it.
> EG SCSI most definitely doesn't use it.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                   zSeries & Storage
> hare-l3A5Bk7waGM@public.gmane.org                          +49 911 74053 688
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race condition between "read CFQ stats" and "block device shutdown"
  2013-09-26 16:23                       ` Anatol Pomozov
@ 2013-09-26 16:30                         ` Tejun Heo
       [not found]                         ` <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2013-09-26 16:30 UTC (permalink / raw)
  To: Anatol Pomozov; +Cc: Hannes Reinecke, Cgroups, Jens Axboe, linux-scsi

Hello,

On Thu, Sep 26, 2013 at 09:23:19AM -0700, Anatol Pomozov wrote:
> We use modified version of "sbull" block device driver from GKH book.
> We use it for testing block device startup/shutdown path + CFQ
> manipulation.
> 
> The sbull driver uses function
> blk_init_queue(..., &dev->qlock);
> 
> it passes lock as a second parameter and function makes the lock
> swapping. According to your information passing lock to
> blk_init_queue() considered as an "old non recommended way" so we
> should modify our driver and avoid doing this. I'll take a look.

I wonder whether we can make request_queue record the passed in lock
and grab it in addition to the queue lock instead of replacing it when
invoking the request function.  This would mean more overhead for the
broken drivers but given how legacy they are at this point, I don't
think that's gonna be a big problem.  if that's likely to work, we may
as well just push all the extra locking to the legacy drivers so that
their request functions take the extra lock and then we can remove the
spinlock argument completely.  I'm not sure whether the locking in
request path is the only thing necessary tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Race condition between "read CFQ stats" and "block device shutdown"
       [not found]                         ` <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-09-27  5:59                           ` Hannes Reinecke
  0 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2013-09-27  5:59 UTC (permalink / raw)
  To: Anatol Pomozov
  Cc: Tejun Heo, Cgroups, Jens Axboe, linux-scsi-u79uwXL29TY76Z2rM5mHXA

On 09/26/2013 06:23 PM, Anatol Pomozov wrote:
> Hi
> 
> On Thu, Sep 26, 2013 at 7:18 AM, Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org> wrote:
>> On 09/26/2013 03:54 PM, Tejun Heo wrote:
>>> Hello, (cc'ing linux-scsi)
>>>
>>> On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote:
>>>> Hi
>>>>
>>>> On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>>>> Hello,
>>>>>
>>>>> On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote:
>>>>>> I am not an expect in block code, so I have a few questions here:
>>>>>>
>>>>>> - are we sure that this operation is atomic? What if blkg->q becomes
>>>>>> dead right after we checked it, and blkg->q->queue_lock got invalid so
>>>>>> we have the same crash as before?
>>>>>
>>>>> request_queue lock switching is something inherently broken in block
>>>>> layer.  It's unsalvageable.
>>>>
>>>> Fully agree. The problem that request_queue->queue_lock is a shared
>>>> resource that concurrently modified/accessed. In this case (when one
>>>> thread changes, another thread access it) we need synchronization to
>>>> prevent race conditions. So we need a spin_lock to access queue_lock
>>>> spin_lock, otherwise we have a crash like one above...
>>>>
>>>>>  Maybe we can drop lock switching once blk-mq is fully merged.
>>>>
>>>> Could you please provide more information about it? What is the timeline?
>>>
>>> I have no idea.  Hopefully, not too far out.  Jens would have better
>>> idea.
>>>
>>>> If there is an easy way to fix the race condition I would like to
>>>> help. Please give me some pointer what direction I should move.
>>>
>>> The first step would be identifying who are actually making use of
>>> lock switching, why and how much difference it would make for them to
>>> not do that.
>>>
>> Typically, the lock is being used by the block drivers to
>> synchronize access between some internal data structures and the
>> request queue itself. You don't actually _need_ to do it that way,
>> but removing the lock switching would involve quite some redesign of
>> these drivers.
>> Give that most of the are rather oldish I really wouldn't want to
>> touch them.
> 
> Thanks for the info.
> 
> We use modified version of "sbull" block device driver from GKH book.
> We use it for testing block device startup/shutdown path + CFQ
> manipulation.
> 
> The sbull driver uses function
> blk_init_queue(..., &dev->qlock);
> 
> it passes lock as a second parameter and function makes the lock
> swapping. According to your information passing lock to
> blk_init_queue() considered as an "old non recommended way" so we
> should modify our driver and avoid doing this. I'll take a look.
> 
Yep. I would recommend to use the queue_lock directly whenever you
need to pull requests off the request_queue, and use your own lock
to protect any internal structures associated with handling the
request internally.

Yes, this _might_ involve some lock dancing between the queue_lock
and the internal lock, but has the advantage that the internal lock
typically it used far more often than the queue_lock itself.
So you might even get some speed advantage here as you reduce
contention on the queue_lock.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare-l3A5Bk7waGM@public.gmane.org			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Race condition between "read CFQ stats" and "block device shutdown"
       [not found]     ` <5226D661.7070301-l3A5Bk7waGM@public.gmane.org>
  2013-09-04 15:45       ` Anatol Pomozov
@ 2013-09-04 16:15       ` Anatol Pomozov
  1 sibling, 0 replies; 12+ messages in thread
From: Anatol Pomozov @ 2013-09-04 16:15 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Cgroups, Tejun Heo, Jens Axboe

Hi

On Tue, Sep 3, 2013 at 11:42 PM, Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org> wrote:
> On 09/03/2013 10:14 PM, Anatol Pomozov wrote:
>> Hi,
>>
>> I am running a program that checkes "read CFQ stat files" for race
>> conditions with other evens (e.g. device shutdown).
>>
>> And I discovered an interesting bug. Here is the "double_unlock" crash for it
>>
>>
>> print_unlock_imbalance_bug.isra.23+0x4/0x10
>> [ 261.453775] [<ffffffff810f7c65>] lock_release_non_nested.isra.39+0x2f5/0x300
>> [ 261.460900] [<ffffffff810f7cfe>] lock_release+0x8e/0x1f0
>> [ 261.466293] [<ffffffff81339030>] ? cfqg_prfill_service_level+0x60/0x60
>> [ 261.472894] [<ffffffff81005be3>] _raw_spin_unlock_irq+0x23/0x50
>> [ 261.478894] [<ffffffff8133559f>] blkcg_print_blkgs+0x8f/0x140
>> [ 261.484724] [<ffffffff81335515>] ? blkcg_print_blkgs+0x5/0x140
>> [ 261.490631] [<ffffffff81338a7f>] cfqg_print_weighted_queue_time+0x2f/0x40
>> [ 261.497489] [<ffffffff8110b793>] cgroup_seqfile_show+0x53/0x60
>> [ 261.503398] [<ffffffff811f1fe4>] seq_read+0x124/0x3a0
>> [ 261.508529] [<ffffffff811ce39d>] vfs_read+0xad/0x180
>> [ 261.513576] [<ffffffff811ce625>] SyS_read+0x55/0xa0
>> [ 261.518538] [<ffffffff81609f66>] cstar_dispatch+0x7/0x1f
>>
>> blkcg_print_blkgs fails with double unlock? Hmm, I checked
>> cfqg_prfill_service_level and I did not find any places where unlock
>> can happen.
>>
>> After some debugging I found that in blkcg_print_blkgs() spinlock
>> passed to spin_lock_irq() function differs from the object passed to
>> spin_unlock_irq just a few lines below. It means
>> request_queue->queue_lock spinlock has changed under the function feet
>> while it was executing!!!
>>
>> To make sure I added
>>
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -465,10 +465,16 @@ void blkcg_print_blkgs(struct seq_file *sf,
>> struct blkcg *blkcg,
>>
>>         rcu_read_lock();
>>         hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
>> -               spin_lock_irq(blkg->q->queue_lock);
>> +               spinlock_t *lock = blkg->q->queue_lock;
>> +               spinlock_t *new_lock;
>> +               spin_lock_irq(lock);
>>                 if (blkcg_policy_enabled(blkg->q, pol))
>>                         total += prfill(sf, blkg->pd[pol->plid], data);
>> -               spin_unlock_irq(blkg->q->queue_lock);
>> +               new_lock = blkg->q->queue_lock;
>> +               if (lock != new_lock) {
>> +                       pr_err("old lock %p %s  new lock %p %s\n",
>> lock, lock->dep_map.name, new_lock, new_lock->dep_map.name);
>> +               }
>> +               spin_unlock_irq(lock);
>>         }
>>         rcu_read_unlock();
>>
>>
>>
>> And indeed it shows locks are different.
>>
>>
>> It comes from this change 777eb1bf1 "block: Free queue resources at
>> blk_release_queue()" that changes lock when devices is shutting down.
>>
>> What would be the best fix for the issue?
>>
> The correct fix would be to add checks for 'blkq->q'; the mentioned
> lock reassignment can only happen during queue shutdown.
> So whenever the queue is dead or stopping we whould refuse to print
> anything here.
>
> Try this:
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 290792a..3e17841 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -504,6 +504,8 @@ void blkcg_print_blkgs(struct seq_file *sf,
> struct blkcg *bl
> kcg,
>
>         rcu_read_lock();
>         hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> +               if (unlikely(blk_queue_dying(blkg->q)))
> +                       continue;
>                 spin_lock_irq(blkg->q->queue_lock);
>                 if (blkcg_policy_enabled(blkg->q, pol))
>                         total += prfill(sf, blkg->pd[pol->plid], data);

I ran my tests with this patch and unfortunately I still see the same
crash. Although test runs much longer now - it needed ~1000 device
shutdowns before the oops. Previously it was just a dozen iterations.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-09-27  5:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-03 20:14 Race condition between "read CFQ stats" and "block device shutdown" Anatol Pomozov
     [not found] ` <CAOMFOmXJ5ZTYdOvdUt-oxsouhPGRmMshCRhn6AFgmFAGZw5WZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-04  6:42   ` Hannes Reinecke
     [not found]     ` <5226D661.7070301-l3A5Bk7waGM@public.gmane.org>
2013-09-04 15:45       ` Anatol Pomozov
     [not found]         ` <CAOMFOmUCqXN1uaqBEWH3PStuZXvnvLw=YrARgv7DvqO6Y4bFPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-04 16:07           ` Tejun Heo
     [not found]             ` <20130904160723.GC26609-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-09-25 20:37               ` Anatol Pomozov
2013-09-26 13:54                 ` Tejun Heo
2013-09-26 14:18                   ` Hannes Reinecke
2013-09-26 14:20                     ` Tejun Heo
     [not found]                     ` <5244423A.2050107-l3A5Bk7waGM@public.gmane.org>
2013-09-26 16:23                       ` Anatol Pomozov
2013-09-26 16:30                         ` Tejun Heo
     [not found]                         ` <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-27  5:59                           ` Hannes Reinecke
2013-09-04 16:15       ` Anatol Pomozov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox