* Race condition between "read CFQ stats" and "block device shutdown"
@ 2013-09-03 20:14 Anatol Pomozov
[not found] ` <CAOMFOmXJ5ZTYdOvdUt-oxsouhPGRmMshCRhn6AFgmFAGZw5WZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Anatol Pomozov @ 2013-09-03 20:14 UTC (permalink / raw)
To: Cgroups, Tejun Heo, Hannes Reinecke, Jens Axboe
Hi,
I am running a program that checkes "read CFQ stat files" for race
conditions with other evens (e.g. device shutdown).
And I discovered an interesting bug. Here is the "double_unlock" crash for it
print_unlock_imbalance_bug.isra.23+0x4/0x10
[ 261.453775] [<ffffffff810f7c65>] lock_release_non_nested.isra.39+0x2f5/0x300
[ 261.460900] [<ffffffff810f7cfe>] lock_release+0x8e/0x1f0
[ 261.466293] [<ffffffff81339030>] ? cfqg_prfill_service_level+0x60/0x60
[ 261.472894] [<ffffffff81005be3>] _raw_spin_unlock_irq+0x23/0x50
[ 261.478894] [<ffffffff8133559f>] blkcg_print_blkgs+0x8f/0x140
[ 261.484724] [<ffffffff81335515>] ? blkcg_print_blkgs+0x5/0x140
[ 261.490631] [<ffffffff81338a7f>] cfqg_print_weighted_queue_time+0x2f/0x40
[ 261.497489] [<ffffffff8110b793>] cgroup_seqfile_show+0x53/0x60
[ 261.503398] [<ffffffff811f1fe4>] seq_read+0x124/0x3a0
[ 261.508529] [<ffffffff811ce39d>] vfs_read+0xad/0x180
[ 261.513576] [<ffffffff811ce625>] SyS_read+0x55/0xa0
[ 261.518538] [<ffffffff81609f66>] cstar_dispatch+0x7/0x1f
blkcg_print_blkgs fails with double unlock? Hmm, I checked
cfqg_prfill_service_level and I did not find any places where unlock
can happen.
After some debugging I found that in blkcg_print_blkgs() spinlock
passed to spin_lock_irq() function differs from the object passed to
spin_unlock_irq just a few lines below. It means
request_queue->queue_lock spinlock has changed under the function feet
while it was executing!!!
To make sure I added
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -465,10 +465,16 @@ void blkcg_print_blkgs(struct seq_file *sf,
struct blkcg *blkcg,
rcu_read_lock();
hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
- spin_lock_irq(blkg->q->queue_lock);
+ spinlock_t *lock = blkg->q->queue_lock;
+ spinlock_t *new_lock;
+ spin_lock_irq(lock);
if (blkcg_policy_enabled(blkg->q, pol))
total += prfill(sf, blkg->pd[pol->plid], data);
- spin_unlock_irq(blkg->q->queue_lock);
+ new_lock = blkg->q->queue_lock;
+ if (lock != new_lock) {
+ pr_err("old lock %p %s new lock %p %s\n",
lock, lock->dep_map.name, new_lock, new_lock->dep_map.name);
+ }
+ spin_unlock_irq(lock);
}
rcu_read_unlock();
And indeed it shows locks are different.
It comes from this change 777eb1bf1 "block: Free queue resources at
blk_release_queue()" that changes lock when devices is shutting down.
What would be the best fix for the issue?
^ permalink raw reply [flat|nested] 12+ messages in thread[parent not found: <CAOMFOmXJ5ZTYdOvdUt-oxsouhPGRmMshCRhn6AFgmFAGZw5WZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Race condition between "read CFQ stats" and "block device shutdown" [not found] ` <CAOMFOmXJ5ZTYdOvdUt-oxsouhPGRmMshCRhn6AFgmFAGZw5WZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-04 6:42 ` Hannes Reinecke [not found] ` <5226D661.7070301-l3A5Bk7waGM@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Hannes Reinecke @ 2013-09-04 6:42 UTC (permalink / raw) To: Anatol Pomozov; +Cc: Cgroups, Tejun Heo, Jens Axboe On 09/03/2013 10:14 PM, Anatol Pomozov wrote: > Hi, > > I am running a program that checkes "read CFQ stat files" for race > conditions with other evens (e.g. device shutdown). > > And I discovered an interesting bug. Here is the "double_unlock" crash for it > > > print_unlock_imbalance_bug.isra.23+0x4/0x10 > [ 261.453775] [<ffffffff810f7c65>] lock_release_non_nested.isra.39+0x2f5/0x300 > [ 261.460900] [<ffffffff810f7cfe>] lock_release+0x8e/0x1f0 > [ 261.466293] [<ffffffff81339030>] ? cfqg_prfill_service_level+0x60/0x60 > [ 261.472894] [<ffffffff81005be3>] _raw_spin_unlock_irq+0x23/0x50 > [ 261.478894] [<ffffffff8133559f>] blkcg_print_blkgs+0x8f/0x140 > [ 261.484724] [<ffffffff81335515>] ? blkcg_print_blkgs+0x5/0x140 > [ 261.490631] [<ffffffff81338a7f>] cfqg_print_weighted_queue_time+0x2f/0x40 > [ 261.497489] [<ffffffff8110b793>] cgroup_seqfile_show+0x53/0x60 > [ 261.503398] [<ffffffff811f1fe4>] seq_read+0x124/0x3a0 > [ 261.508529] [<ffffffff811ce39d>] vfs_read+0xad/0x180 > [ 261.513576] [<ffffffff811ce625>] SyS_read+0x55/0xa0 > [ 261.518538] [<ffffffff81609f66>] cstar_dispatch+0x7/0x1f > > blkcg_print_blkgs fails with double unlock? Hmm, I checked > cfqg_prfill_service_level and I did not find any places where unlock > can happen. > > After some debugging I found that in blkcg_print_blkgs() spinlock > passed to spin_lock_irq() function differs from the object passed to > spin_unlock_irq just a few lines below. It means > request_queue->queue_lock spinlock has changed under the function feet > while it was executing!!! > > To make sure I added > > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -465,10 +465,16 @@ void blkcg_print_blkgs(struct seq_file *sf, > struct blkcg *blkcg, > > rcu_read_lock(); > hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) { > - spin_lock_irq(blkg->q->queue_lock); > + spinlock_t *lock = blkg->q->queue_lock; > + spinlock_t *new_lock; > + spin_lock_irq(lock); > if (blkcg_policy_enabled(blkg->q, pol)) > total += prfill(sf, blkg->pd[pol->plid], data); > - spin_unlock_irq(blkg->q->queue_lock); > + new_lock = blkg->q->queue_lock; > + if (lock != new_lock) { > + pr_err("old lock %p %s new lock %p %s\n", > lock, lock->dep_map.name, new_lock, new_lock->dep_map.name); > + } > + spin_unlock_irq(lock); > } > rcu_read_unlock(); > > > > And indeed it shows locks are different. > > > It comes from this change 777eb1bf1 "block: Free queue resources at > blk_release_queue()" that changes lock when devices is shutting down. > > What would be the best fix for the issue? > The correct fix would be to add checks for 'blkq->q'; the mentioned lock reassignment can only happen during queue shutdown. So whenever the queue is dead or stopping we whould refuse to print anything here. Try this: diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 290792a..3e17841 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -504,6 +504,8 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *bl kcg, rcu_read_lock(); hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { + if (unlikely(blk_queue_dying(blkg->q))) + continue; spin_lock_irq(blkg->q->queue_lock); if (blkcg_policy_enabled(blkg->q, pol)) total += prfill(sf, blkg->pd[pol->plid], data); Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare-l3A5Bk7waGM@public.gmane.org +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ^ permalink raw reply related [flat|nested] 12+ messages in thread
[parent not found: <5226D661.7070301-l3A5Bk7waGM@public.gmane.org>]
* Re: Race condition between "read CFQ stats" and "block device shutdown" [not found] ` <5226D661.7070301-l3A5Bk7waGM@public.gmane.org> @ 2013-09-04 15:45 ` Anatol Pomozov [not found] ` <CAOMFOmUCqXN1uaqBEWH3PStuZXvnvLw=YrARgv7DvqO6Y4bFPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-04 16:15 ` Anatol Pomozov 1 sibling, 1 reply; 12+ messages in thread From: Anatol Pomozov @ 2013-09-04 15:45 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Cgroups, Tejun Heo, Jens Axboe Hi On Tue, Sep 3, 2013 at 11:42 PM, Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org> wrote: > On 09/03/2013 10:14 PM, Anatol Pomozov wrote: >> Hi, >> >> I am running a program that checkes "read CFQ stat files" for race >> conditions with other evens (e.g. device shutdown). >> >> And I discovered an interesting bug. Here is the "double_unlock" crash for it >> >> >> print_unlock_imbalance_bug.isra.23+0x4/0x10 >> [ 261.453775] [<ffffffff810f7c65>] lock_release_non_nested.isra.39+0x2f5/0x300 >> [ 261.460900] [<ffffffff810f7cfe>] lock_release+0x8e/0x1f0 >> [ 261.466293] [<ffffffff81339030>] ? cfqg_prfill_service_level+0x60/0x60 >> [ 261.472894] [<ffffffff81005be3>] _raw_spin_unlock_irq+0x23/0x50 >> [ 261.478894] [<ffffffff8133559f>] blkcg_print_blkgs+0x8f/0x140 >> [ 261.484724] [<ffffffff81335515>] ? blkcg_print_blkgs+0x5/0x140 >> [ 261.490631] [<ffffffff81338a7f>] cfqg_print_weighted_queue_time+0x2f/0x40 >> [ 261.497489] [<ffffffff8110b793>] cgroup_seqfile_show+0x53/0x60 >> [ 261.503398] [<ffffffff811f1fe4>] seq_read+0x124/0x3a0 >> [ 261.508529] [<ffffffff811ce39d>] vfs_read+0xad/0x180 >> [ 261.513576] [<ffffffff811ce625>] SyS_read+0x55/0xa0 >> [ 261.518538] [<ffffffff81609f66>] cstar_dispatch+0x7/0x1f >> >> blkcg_print_blkgs fails with double unlock? Hmm, I checked >> cfqg_prfill_service_level and I did not find any places where unlock >> can happen. >> >> After some debugging I found that in blkcg_print_blkgs() spinlock >> passed to spin_lock_irq() function differs from the object passed to >> spin_unlock_irq just a few lines below. It means >> request_queue->queue_lock spinlock has changed under the function feet >> while it was executing!!! >> >> To make sure I added >> >> --- a/block/blk-cgroup.c >> +++ b/block/blk-cgroup.c >> @@ -465,10 +465,16 @@ void blkcg_print_blkgs(struct seq_file *sf, >> struct blkcg *blkcg, >> >> rcu_read_lock(); >> hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) { >> - spin_lock_irq(blkg->q->queue_lock); >> + spinlock_t *lock = blkg->q->queue_lock; >> + spinlock_t *new_lock; >> + spin_lock_irq(lock); >> if (blkcg_policy_enabled(blkg->q, pol)) >> total += prfill(sf, blkg->pd[pol->plid], data); >> - spin_unlock_irq(blkg->q->queue_lock); >> + new_lock = blkg->q->queue_lock; >> + if (lock != new_lock) { >> + pr_err("old lock %p %s new lock %p %s\n", >> lock, lock->dep_map.name, new_lock, new_lock->dep_map.name); >> + } >> + spin_unlock_irq(lock); >> } >> rcu_read_unlock(); >> >> >> >> And indeed it shows locks are different. >> >> >> It comes from this change 777eb1bf1 "block: Free queue resources at >> blk_release_queue()" that changes lock when devices is shutting down. >> >> What would be the best fix for the issue? >> > The correct fix would be to add checks for 'blkq->q'; the mentioned > lock reassignment can only happen during queue shutdown. > So whenever the queue is dead or stopping we whould refuse to print > anything here. > > Try this: > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index 290792a..3e17841 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -504,6 +504,8 @@ void blkcg_print_blkgs(struct seq_file *sf, > struct blkcg *bl > kcg, > > rcu_read_lock(); > hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { > + if (unlikely(blk_queue_dying(blkg->q))) > + continue; I am not an expect in block code, so I have a few questions here: - are we sure that this operation is atomic? What if blkg->q becomes dead right after we checked it, and blkg->q->queue_lock got invalid so we have the same crash as before? - do we need to check blk_queue_dying() in every single place where we iterate over blkcg->blkg_list? - would it be better to remove block cgroup from blkcg->blkg_list before queue marked as dead? > spin_lock_irq(blkg->q->queue_lock); > if (blkcg_policy_enabled(blkg->q, pol)) > total += prfill(sf, blkg->pd[pol->plid], data); ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CAOMFOmUCqXN1uaqBEWH3PStuZXvnvLw=YrARgv7DvqO6Y4bFPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Race condition between "read CFQ stats" and "block device shutdown" [not found] ` <CAOMFOmUCqXN1uaqBEWH3PStuZXvnvLw=YrARgv7DvqO6Y4bFPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-04 16:07 ` Tejun Heo [not found] ` <20130904160723.GC26609-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2013-09-04 16:07 UTC (permalink / raw) To: Anatol Pomozov; +Cc: Hannes Reinecke, Cgroups, Jens Axboe Hello, On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote: > I am not an expect in block code, so I have a few questions here: > > - are we sure that this operation is atomic? What if blkg->q becomes > dead right after we checked it, and blkg->q->queue_lock got invalid so > we have the same crash as before? request_queue lock switching is something inherently broken in block layer. It's unsalvageable. Maybe we can drop lock switching once blk-mq is fully merged. > - do we need to check blk_queue_dying() in every single place where we > iterate over blkcg->blkg_list? > > - would it be better to remove block cgroup from blkcg->blkg_list > before queue marked as dead? I have no idea. No matter what we do, it's working around something fundamentally broken in a half-assed way. Hmmm... there was an effort to remove the lock switching. I don't think it actually happened tho. :( Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <20130904160723.GC26609-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Race condition between "read CFQ stats" and "block device shutdown" [not found] ` <20130904160723.GC26609-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2013-09-25 20:37 ` Anatol Pomozov 2013-09-26 13:54 ` Tejun Heo 0 siblings, 1 reply; 12+ messages in thread From: Anatol Pomozov @ 2013-09-25 20:37 UTC (permalink / raw) To: Tejun Heo; +Cc: Hannes Reinecke, Cgroups, Jens Axboe Hi On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, > > On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote: >> I am not an expect in block code, so I have a few questions here: >> >> - are we sure that this operation is atomic? What if blkg->q becomes >> dead right after we checked it, and blkg->q->queue_lock got invalid so >> we have the same crash as before? > > request_queue lock switching is something inherently broken in block > layer. It's unsalvageable. Fully agree. The problem that request_queue->queue_lock is a shared resource that concurrently modified/accessed. In this case (when one thread changes, another thread access it) we need synchronization to prevent race conditions. So we need a spin_lock to access queue_lock spin_lock, otherwise we have a crash like one above... > Maybe we can drop lock switching once blk-mq is fully merged. Could you please provide more information about it? What is the timeline? If there is an easy way to fix the race condition I would like to help. Please give me some pointer what direction I should move. PS Just a little bit of context why I care about this bug. We test a large farm that actively uses iscsi. We are going to have a lot of iscsi device startup/shutdown. I am testing whether this codepath has race conditions and I found one above. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Race condition between "read CFQ stats" and "block device shutdown" 2013-09-25 20:37 ` Anatol Pomozov @ 2013-09-26 13:54 ` Tejun Heo 2013-09-26 14:18 ` Hannes Reinecke 0 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2013-09-26 13:54 UTC (permalink / raw) To: Anatol Pomozov; +Cc: Hannes Reinecke, Cgroups, Jens Axboe, linux-scsi Hello, (cc'ing linux-scsi) On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote: > Hi > > On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj@kernel.org> wrote: > > Hello, > > > > On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote: > >> I am not an expect in block code, so I have a few questions here: > >> > >> - are we sure that this operation is atomic? What if blkg->q becomes > >> dead right after we checked it, and blkg->q->queue_lock got invalid so > >> we have the same crash as before? > > > > request_queue lock switching is something inherently broken in block > > layer. It's unsalvageable. > > Fully agree. The problem that request_queue->queue_lock is a shared > resource that concurrently modified/accessed. In this case (when one > thread changes, another thread access it) we need synchronization to > prevent race conditions. So we need a spin_lock to access queue_lock > spin_lock, otherwise we have a crash like one above... > > > Maybe we can drop lock switching once blk-mq is fully merged. > > Could you please provide more information about it? What is the timeline? I have no idea. Hopefully, not too far out. Jens would have better idea. > If there is an easy way to fix the race condition I would like to > help. Please give me some pointer what direction I should move. The first step would be identifying who are actually making use of lock switching, why and how much difference it would make for them to not do that. > PS Just a little bit of context why I care about this bug. We test a > large farm that actively uses iscsi. We are going to have a lot of > iscsi device startup/shutdown. I am testing whether this codepath has > race conditions and I found one above. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Race condition between "read CFQ stats" and "block device shutdown" 2013-09-26 13:54 ` Tejun Heo @ 2013-09-26 14:18 ` Hannes Reinecke 2013-09-26 14:20 ` Tejun Heo [not found] ` <5244423A.2050107-l3A5Bk7waGM@public.gmane.org> 0 siblings, 2 replies; 12+ messages in thread From: Hannes Reinecke @ 2013-09-26 14:18 UTC (permalink / raw) To: Tejun Heo; +Cc: Anatol Pomozov, Cgroups, Jens Axboe, linux-scsi On 09/26/2013 03:54 PM, Tejun Heo wrote: > Hello, (cc'ing linux-scsi) > > On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote: >> Hi >> >> On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj@kernel.org> wrote: >>> Hello, >>> >>> On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote: >>>> I am not an expect in block code, so I have a few questions here: >>>> >>>> - are we sure that this operation is atomic? What if blkg->q becomes >>>> dead right after we checked it, and blkg->q->queue_lock got invalid so >>>> we have the same crash as before? >>> >>> request_queue lock switching is something inherently broken in block >>> layer. It's unsalvageable. >> >> Fully agree. The problem that request_queue->queue_lock is a shared >> resource that concurrently modified/accessed. In this case (when one >> thread changes, another thread access it) we need synchronization to >> prevent race conditions. So we need a spin_lock to access queue_lock >> spin_lock, otherwise we have a crash like one above... >> >>> Maybe we can drop lock switching once blk-mq is fully merged. >> >> Could you please provide more information about it? What is the timeline? > > I have no idea. Hopefully, not too far out. Jens would have better > idea. > >> If there is an easy way to fix the race condition I would like to >> help. Please give me some pointer what direction I should move. > > The first step would be identifying who are actually making use of > lock switching, why and how much difference it would make for them to > not do that. > Typically, the lock is being used by the block drivers to synchronize access between some internal data structures and the request queue itself. You don't actually _need_ to do it that way, but removing the lock switching would involve quite some redesign of these drivers. Give that most of the are rather oldish I really wouldn't want to touch them. However, none of the modern devices should be using this lock switching, so I would just ignore it. EG SCSI most definitely doesn't use it. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Race condition between "read CFQ stats" and "block device shutdown" 2013-09-26 14:18 ` Hannes Reinecke @ 2013-09-26 14:20 ` Tejun Heo [not found] ` <5244423A.2050107-l3A5Bk7waGM@public.gmane.org> 1 sibling, 0 replies; 12+ messages in thread From: Tejun Heo @ 2013-09-26 14:20 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Anatol Pomozov, Cgroups, Jens Axboe, linux-scsi Hey, Hannes. On Thu, Sep 26, 2013 at 04:18:34PM +0200, Hannes Reinecke wrote: > However, none of the modern devices should be using this lock > switching, so I would just ignore it. > EG SCSI most definitely doesn't use it. The kernel is crashing from it, so I don't think ignoring is an acceptable strategy here. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <5244423A.2050107-l3A5Bk7waGM@public.gmane.org>]
* Re: Race condition between "read CFQ stats" and "block device shutdown" [not found] ` <5244423A.2050107-l3A5Bk7waGM@public.gmane.org> @ 2013-09-26 16:23 ` Anatol Pomozov 2013-09-26 16:30 ` Tejun Heo [not found] ` <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 2 replies; 12+ messages in thread From: Anatol Pomozov @ 2013-09-26 16:23 UTC (permalink / raw) To: Hannes Reinecke Cc: Tejun Heo, Cgroups, Jens Axboe, linux-scsi-u79uwXL29TY76Z2rM5mHXA Hi On Thu, Sep 26, 2013 at 7:18 AM, Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org> wrote: > On 09/26/2013 03:54 PM, Tejun Heo wrote: >> Hello, (cc'ing linux-scsi) >> >> On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote: >>> Hi >>> >>> On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: >>>> Hello, >>>> >>>> On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote: >>>>> I am not an expect in block code, so I have a few questions here: >>>>> >>>>> - are we sure that this operation is atomic? What if blkg->q becomes >>>>> dead right after we checked it, and blkg->q->queue_lock got invalid so >>>>> we have the same crash as before? >>>> >>>> request_queue lock switching is something inherently broken in block >>>> layer. It's unsalvageable. >>> >>> Fully agree. The problem that request_queue->queue_lock is a shared >>> resource that concurrently modified/accessed. In this case (when one >>> thread changes, another thread access it) we need synchronization to >>> prevent race conditions. So we need a spin_lock to access queue_lock >>> spin_lock, otherwise we have a crash like one above... >>> >>>> Maybe we can drop lock switching once blk-mq is fully merged. >>> >>> Could you please provide more information about it? What is the timeline? >> >> I have no idea. Hopefully, not too far out. Jens would have better >> idea. >> >>> If there is an easy way to fix the race condition I would like to >>> help. Please give me some pointer what direction I should move. >> >> The first step would be identifying who are actually making use of >> lock switching, why and how much difference it would make for them to >> not do that. >> > Typically, the lock is being used by the block drivers to > synchronize access between some internal data structures and the > request queue itself. You don't actually _need_ to do it that way, > but removing the lock switching would involve quite some redesign of > these drivers. > Give that most of the are rather oldish I really wouldn't want to > touch them. Thanks for the info. We use modified version of "sbull" block device driver from GKH book. We use it for testing block device startup/shutdown path + CFQ manipulation. The sbull driver uses function blk_init_queue(..., &dev->qlock); it passes lock as a second parameter and function makes the lock swapping. According to your information passing lock to blk_init_queue() considered as an "old non recommended way" so we should modify our driver and avoid doing this. I'll take a look. Thanks. > > However, none of the modern devices should be using this lock > switching, so I would just ignore it. > EG SCSI most definitely doesn't use it. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare-l3A5Bk7waGM@public.gmane.org +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Race condition between "read CFQ stats" and "block device shutdown" 2013-09-26 16:23 ` Anatol Pomozov @ 2013-09-26 16:30 ` Tejun Heo [not found] ` <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 12+ messages in thread From: Tejun Heo @ 2013-09-26 16:30 UTC (permalink / raw) To: Anatol Pomozov; +Cc: Hannes Reinecke, Cgroups, Jens Axboe, linux-scsi Hello, On Thu, Sep 26, 2013 at 09:23:19AM -0700, Anatol Pomozov wrote: > We use modified version of "sbull" block device driver from GKH book. > We use it for testing block device startup/shutdown path + CFQ > manipulation. > > The sbull driver uses function > blk_init_queue(..., &dev->qlock); > > it passes lock as a second parameter and function makes the lock > swapping. According to your information passing lock to > blk_init_queue() considered as an "old non recommended way" so we > should modify our driver and avoid doing this. I'll take a look. I wonder whether we can make request_queue record the passed in lock and grab it in addition to the queue lock instead of replacing it when invoking the request function. This would mean more overhead for the broken drivers but given how legacy they are at this point, I don't think that's gonna be a big problem. if that's likely to work, we may as well just push all the extra locking to the legacy drivers so that their request functions take the extra lock and then we can remove the spinlock argument completely. I'm not sure whether the locking in request path is the only thing necessary tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Race condition between "read CFQ stats" and "block device shutdown" [not found] ` <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-27 5:59 ` Hannes Reinecke 0 siblings, 0 replies; 12+ messages in thread From: Hannes Reinecke @ 2013-09-27 5:59 UTC (permalink / raw) To: Anatol Pomozov Cc: Tejun Heo, Cgroups, Jens Axboe, linux-scsi-u79uwXL29TY76Z2rM5mHXA On 09/26/2013 06:23 PM, Anatol Pomozov wrote: > Hi > > On Thu, Sep 26, 2013 at 7:18 AM, Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org> wrote: >> On 09/26/2013 03:54 PM, Tejun Heo wrote: >>> Hello, (cc'ing linux-scsi) >>> >>> On Wed, Sep 25, 2013 at 01:37:51PM -0700, Anatol Pomozov wrote: >>>> Hi >>>> >>>> On Wed, Sep 4, 2013 at 9:07 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: >>>>> Hello, >>>>> >>>>> On Wed, Sep 04, 2013 at 08:45:33AM -0700, Anatol Pomozov wrote: >>>>>> I am not an expect in block code, so I have a few questions here: >>>>>> >>>>>> - are we sure that this operation is atomic? What if blkg->q becomes >>>>>> dead right after we checked it, and blkg->q->queue_lock got invalid so >>>>>> we have the same crash as before? >>>>> >>>>> request_queue lock switching is something inherently broken in block >>>>> layer. It's unsalvageable. >>>> >>>> Fully agree. The problem that request_queue->queue_lock is a shared >>>> resource that concurrently modified/accessed. In this case (when one >>>> thread changes, another thread access it) we need synchronization to >>>> prevent race conditions. So we need a spin_lock to access queue_lock >>>> spin_lock, otherwise we have a crash like one above... >>>> >>>>> Maybe we can drop lock switching once blk-mq is fully merged. >>>> >>>> Could you please provide more information about it? What is the timeline? >>> >>> I have no idea. Hopefully, not too far out. Jens would have better >>> idea. >>> >>>> If there is an easy way to fix the race condition I would like to >>>> help. Please give me some pointer what direction I should move. >>> >>> The first step would be identifying who are actually making use of >>> lock switching, why and how much difference it would make for them to >>> not do that. >>> >> Typically, the lock is being used by the block drivers to >> synchronize access between some internal data structures and the >> request queue itself. You don't actually _need_ to do it that way, >> but removing the lock switching would involve quite some redesign of >> these drivers. >> Give that most of the are rather oldish I really wouldn't want to >> touch them. > > Thanks for the info. > > We use modified version of "sbull" block device driver from GKH book. > We use it for testing block device startup/shutdown path + CFQ > manipulation. > > The sbull driver uses function > blk_init_queue(..., &dev->qlock); > > it passes lock as a second parameter and function makes the lock > swapping. According to your information passing lock to > blk_init_queue() considered as an "old non recommended way" so we > should modify our driver and avoid doing this. I'll take a look. > Yep. I would recommend to use the queue_lock directly whenever you need to pull requests off the request_queue, and use your own lock to protect any internal structures associated with handling the request internally. Yes, this _might_ involve some lock dancing between the queue_lock and the internal lock, but has the advantage that the internal lock typically it used far more often than the queue_lock itself. So you might even get some speed advantage here as you reduce contention on the queue_lock. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare-l3A5Bk7waGM@public.gmane.org +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Race condition between "read CFQ stats" and "block device shutdown" [not found] ` <5226D661.7070301-l3A5Bk7waGM@public.gmane.org> 2013-09-04 15:45 ` Anatol Pomozov @ 2013-09-04 16:15 ` Anatol Pomozov 1 sibling, 0 replies; 12+ messages in thread From: Anatol Pomozov @ 2013-09-04 16:15 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Cgroups, Tejun Heo, Jens Axboe Hi On Tue, Sep 3, 2013 at 11:42 PM, Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org> wrote: > On 09/03/2013 10:14 PM, Anatol Pomozov wrote: >> Hi, >> >> I am running a program that checkes "read CFQ stat files" for race >> conditions with other evens (e.g. device shutdown). >> >> And I discovered an interesting bug. Here is the "double_unlock" crash for it >> >> >> print_unlock_imbalance_bug.isra.23+0x4/0x10 >> [ 261.453775] [<ffffffff810f7c65>] lock_release_non_nested.isra.39+0x2f5/0x300 >> [ 261.460900] [<ffffffff810f7cfe>] lock_release+0x8e/0x1f0 >> [ 261.466293] [<ffffffff81339030>] ? cfqg_prfill_service_level+0x60/0x60 >> [ 261.472894] [<ffffffff81005be3>] _raw_spin_unlock_irq+0x23/0x50 >> [ 261.478894] [<ffffffff8133559f>] blkcg_print_blkgs+0x8f/0x140 >> [ 261.484724] [<ffffffff81335515>] ? blkcg_print_blkgs+0x5/0x140 >> [ 261.490631] [<ffffffff81338a7f>] cfqg_print_weighted_queue_time+0x2f/0x40 >> [ 261.497489] [<ffffffff8110b793>] cgroup_seqfile_show+0x53/0x60 >> [ 261.503398] [<ffffffff811f1fe4>] seq_read+0x124/0x3a0 >> [ 261.508529] [<ffffffff811ce39d>] vfs_read+0xad/0x180 >> [ 261.513576] [<ffffffff811ce625>] SyS_read+0x55/0xa0 >> [ 261.518538] [<ffffffff81609f66>] cstar_dispatch+0x7/0x1f >> >> blkcg_print_blkgs fails with double unlock? Hmm, I checked >> cfqg_prfill_service_level and I did not find any places where unlock >> can happen. >> >> After some debugging I found that in blkcg_print_blkgs() spinlock >> passed to spin_lock_irq() function differs from the object passed to >> spin_unlock_irq just a few lines below. It means >> request_queue->queue_lock spinlock has changed under the function feet >> while it was executing!!! >> >> To make sure I added >> >> --- a/block/blk-cgroup.c >> +++ b/block/blk-cgroup.c >> @@ -465,10 +465,16 @@ void blkcg_print_blkgs(struct seq_file *sf, >> struct blkcg *blkcg, >> >> rcu_read_lock(); >> hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) { >> - spin_lock_irq(blkg->q->queue_lock); >> + spinlock_t *lock = blkg->q->queue_lock; >> + spinlock_t *new_lock; >> + spin_lock_irq(lock); >> if (blkcg_policy_enabled(blkg->q, pol)) >> total += prfill(sf, blkg->pd[pol->plid], data); >> - spin_unlock_irq(blkg->q->queue_lock); >> + new_lock = blkg->q->queue_lock; >> + if (lock != new_lock) { >> + pr_err("old lock %p %s new lock %p %s\n", >> lock, lock->dep_map.name, new_lock, new_lock->dep_map.name); >> + } >> + spin_unlock_irq(lock); >> } >> rcu_read_unlock(); >> >> >> >> And indeed it shows locks are different. >> >> >> It comes from this change 777eb1bf1 "block: Free queue resources at >> blk_release_queue()" that changes lock when devices is shutting down. >> >> What would be the best fix for the issue? >> > The correct fix would be to add checks for 'blkq->q'; the mentioned > lock reassignment can only happen during queue shutdown. > So whenever the queue is dead or stopping we whould refuse to print > anything here. > > Try this: > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index 290792a..3e17841 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -504,6 +504,8 @@ void blkcg_print_blkgs(struct seq_file *sf, > struct blkcg *bl > kcg, > > rcu_read_lock(); > hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { > + if (unlikely(blk_queue_dying(blkg->q))) > + continue; > spin_lock_irq(blkg->q->queue_lock); > if (blkcg_policy_enabled(blkg->q, pol)) > total += prfill(sf, blkg->pd[pol->plid], data); I ran my tests with this patch and unfortunately I still see the same crash. Although test runs much longer now - it needed ~1000 device shutdowns before the oops. Previously it was just a dozen iterations. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-09-27 5:59 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-03 20:14 Race condition between "read CFQ stats" and "block device shutdown" Anatol Pomozov
[not found] ` <CAOMFOmXJ5ZTYdOvdUt-oxsouhPGRmMshCRhn6AFgmFAGZw5WZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-04 6:42 ` Hannes Reinecke
[not found] ` <5226D661.7070301-l3A5Bk7waGM@public.gmane.org>
2013-09-04 15:45 ` Anatol Pomozov
[not found] ` <CAOMFOmUCqXN1uaqBEWH3PStuZXvnvLw=YrARgv7DvqO6Y4bFPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-04 16:07 ` Tejun Heo
[not found] ` <20130904160723.GC26609-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-09-25 20:37 ` Anatol Pomozov
2013-09-26 13:54 ` Tejun Heo
2013-09-26 14:18 ` Hannes Reinecke
2013-09-26 14:20 ` Tejun Heo
[not found] ` <5244423A.2050107-l3A5Bk7waGM@public.gmane.org>
2013-09-26 16:23 ` Anatol Pomozov
2013-09-26 16:30 ` Tejun Heo
[not found] ` <CAOMFOmX2f35qWyTr7=1HNu=RMB_LMAmpMbYxSEsX1xgURhx_mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-27 5:59 ` Hannes Reinecke
2013-09-04 16:15 ` Anatol Pomozov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox