Races in sbitmap batched wakeups

Linux block layer
 help / color / mirror / Atom feed

* Races in sbitmap batched wakeups
@ 2022-06-16 17:21 Jan Kara
  2022-06-17  1:07 ` Ming Lei
  2022-06-17  1:40 ` Yu Kuai
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Kara @ 2022-06-16 17:21 UTC (permalink / raw)
  To: Jens Axboe, Omar Sandoval; +Cc: linux-block

Hello!

I've been debugging some customer reports of tasks hanging (forever)
waiting for free tags when in fact all tags are free. After looking into it
for some time I think I know what it happening. First, keep in mind that
it concerns a device which uses shared tags. There are 127 tags available
and the number of active queues using these tags is easily 40 or more. So
number of tags available for each device is rather small. Now I'm not sure
how batched wakeups can ever work in such situations, but maybe I'm missing
something.

So take for example a situation where two tags are available for a device,
they are both currently used. Now a process comes into blk_mq_get_tag() and
wants to allocate tag and goes to sleep. Now how can it ever be woken up if
wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
that's not enough to trigger the batched wakeup to really wakeup the
waiter...

Even if we have say 4 tags available so in theory there should be enough
wakeups to fill the batch, there can be the following problem. So 4 tags
are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
queue 0, one on wait queue 1. Now four IOs complete so
sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
waiters is woken up but the other one is still sleeping in wq1 and there
are not enough wakeups to fill the batch and wake it up? This is
essentially because we have lost three wakeups on wq0 because it didn't
have enough waiters to wake...

Finally, sbitmap_queue_wake_up() is racy and if two of them race together,
they can end up decrementing wait_cnt of wq which does not have any process
queued which again effectively leads to lost wakeups and possibly
indefinitely sleeping tasks.

Now this all looks so broken that I'm unsure whether I'm not missing
something fundamental. Also I'm not quite sure how to fix all this without
basically destroying the batched wakeup feature (but I didn't put too much
thought to this yet). Comments?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-16 17:21 Races in sbitmap batched wakeups Jan Kara
@ 2022-06-17  1:07 ` Ming Lei
  2022-06-17 10:50   ` Jan Kara
  2022-06-17  1:40 ` Yu Kuai
  1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2022-06-17  1:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Omar Sandoval, linux-block

On Thu, Jun 16, 2022 at 07:21:02PM +0200, Jan Kara wrote:
> Hello!
> 
> I've been debugging some customer reports of tasks hanging (forever)
> waiting for free tags when in fact all tags are free. After looking into it
> for some time I think I know what it happening. First, keep in mind that
> it concerns a device which uses shared tags. There are 127 tags available
> and the number of active queues using these tags is easily 40 or more. So
> number of tags available for each device is rather small. Now I'm not sure
> how batched wakeups can ever work in such situations, but maybe I'm missing
> something.
> 
> So take for example a situation where two tags are available for a device,
> they are both currently used. Now a process comes into blk_mq_get_tag() and
> wants to allocate tag and goes to sleep. Now how can it ever be woken up if
> wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
> that's not enough to trigger the batched wakeup to really wakeup the
> waiter...

commit 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
is supposed for addressing this kind of issue.

> 
> Even if we have say 4 tags available so in theory there should be enough
> wakeups to fill the batch, there can be the following problem. So 4 tags
> are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
> queue 0, one on wait queue 1. Now four IOs complete so
> sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
> wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
> waiters is woken up but the other one is still sleeping in wq1 and there
> are not enough wakeups to fill the batch and wake it up? This is
> essentially because we have lost three wakeups on wq0 because it didn't
> have enough waiters to wake...

But the following completions will wake up the waiter in wq1, given
there are more in-flight.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-17  1:07 ` Ming Lei
@ 2022-06-17 10:50   ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2022-06-17 10:50 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jan Kara, Jens Axboe, Omar Sandoval, linux-block

On Fri 17-06-22 09:07:18, Ming Lei wrote:
> On Thu, Jun 16, 2022 at 07:21:02PM +0200, Jan Kara wrote:
> > Hello!
> > 
> > I've been debugging some customer reports of tasks hanging (forever)
> > waiting for free tags when in fact all tags are free. After looking into it
> > for some time I think I know what it happening. First, keep in mind that
> > it concerns a device which uses shared tags. There are 127 tags available
> > and the number of active queues using these tags is easily 40 or more. So
> > number of tags available for each device is rather small. Now I'm not sure
> > how batched wakeups can ever work in such situations, but maybe I'm missing
> > something.
> > 
> > So take for example a situation where two tags are available for a device,
> > they are both currently used. Now a process comes into blk_mq_get_tag() and
> > wants to allocate tag and goes to sleep. Now how can it ever be woken up if
> > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
> > that's not enough to trigger the batched wakeup to really wakeup the
> > waiter...
> 
> commit 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
> is supposed for addressing this kind of issue.

I have observed the deadlock with the above fixes applied.

> > Even if we have say 4 tags available so in theory there should be enough
> > wakeups to fill the batch, there can be the following problem. So 4 tags
> > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
> > queue 0, one on wait queue 1. Now four IOs complete so
> > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
> > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
> > waiters is woken up but the other one is still sleeping in wq1 and there
> > are not enough wakeups to fill the batch and wake it up? This is
> > essentially because we have lost three wakeups on wq0 because it didn't
> > have enough waiters to wake...
> 
> But the following completions will wake up the waiter in wq1, given
> there are more in-flight.

Well, there is only one more request in flight - from the unblocked waiter.
And once that request completes, it will generate just one wakeup which is
not enough to wake the waiter on wq1 because wake_batch is 4...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-16 17:21 Races in sbitmap batched wakeups Jan Kara
  2022-06-17  1:07 ` Ming Lei
@ 2022-06-17  1:40 ` Yu Kuai
  2022-06-17 11:31   ` Jan Kara
  1 sibling, 1 reply; 10+ messages in thread
From: Yu Kuai @ 2022-06-17  1:40 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe, Omar Sandoval; +Cc: linux-block

在 2022/06/17 1:21, Jan Kara 写道:
> Hello!
> 
> I've been debugging some customer reports of tasks hanging (forever)
> waiting for free tags when in fact all tags are free. After looking into it
> for some time I think I know what it happening. First, keep in mind that
> it concerns a device which uses shared tags. There are 127 tags available
> and the number of active queues using these tags is easily 40 or more. So
> number of tags available for each device is rather small. Now I'm not sure
> how batched wakeups can ever work in such situations, but maybe I'm missing
> something.
> 
> So take for example a situation where two tags are available for a device,
> they are both currently used. Now a process comes into blk_mq_get_tag() and
> wants to allocate tag and goes to sleep. Now how can it ever be woken up if
> wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
> that's not enough to trigger the batched wakeup to really wakeup the
> waiter...
> 
> Even if we have say 4 tags available so in theory there should be enough
> wakeups to fill the batch, there can be the following problem. So 4 tags
> are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
> queue 0, one on wait queue 1. Now four IOs complete so
> sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
> wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
> waiters is woken up but the other one is still sleeping in wq1 and there
> are not enough wakeups to fill the batch and wake it up? This is
> essentially because we have lost three wakeups on wq0 because it didn't
> have enough waiters to wake...
Hi,

 From what I see, if tags are shared for multiple devices, wake_batch
should make sure that all waiter will be woke up:

For example:
there are total 64 tags shared for two devices, then wake_batch is 4(if
both devices are active).  If there are waiters, which means at least 32
tags are grabed, thus 8 queues will ensure to wake up at least once
after 32 tags are freed.

> 
> Finally, sbitmap_queue_wake_up() is racy and if two of them race together,
> they can end up decrementing wait_cnt of wq which does not have any process
> queued which again effectively leads to lost wakeups and possibly
> indefinitely sleeping tasks.
> 

BTW, I do this implementation have some problems on concurrent
scenario, as described in following patch:

https://lore.kernel.org/lkml/20220415101053.554495-4-yukuai3@huawei.com/

> Now this all looks so broken that I'm unsure whether I'm not missing
> something fundamental. Also I'm not quite sure how to fix all this without
> basically destroying the batched wakeup feature (but I didn't put too much
> thought to this yet). Comments?
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-17  1:40 ` Yu Kuai
@ 2022-06-17 11:31   ` Jan Kara
  2022-06-17 12:50     ` Yu Kuai
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2022-06-17 11:31 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Jan Kara, Jens Axboe, Omar Sandoval, linux-block

Hi!

On Fri 17-06-22 09:40:11, Yu Kuai wrote:
> 在 2022/06/17 1:21, Jan Kara 写道:
> > I've been debugging some customer reports of tasks hanging (forever)
> > waiting for free tags when in fact all tags are free. After looking into it
> > for some time I think I know what it happening. First, keep in mind that
> > it concerns a device which uses shared tags. There are 127 tags available
> > and the number of active queues using these tags is easily 40 or more. So
> > number of tags available for each device is rather small. Now I'm not sure
> > how batched wakeups can ever work in such situations, but maybe I'm missing
> > something.
> > 
> > So take for example a situation where two tags are available for a device,
> > they are both currently used. Now a process comes into blk_mq_get_tag() and
> > wants to allocate tag and goes to sleep. Now how can it ever be woken up if
> > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
> > that's not enough to trigger the batched wakeup to really wakeup the
> > waiter...
> > 
> > Even if we have say 4 tags available so in theory there should be enough
> > wakeups to fill the batch, there can be the following problem. So 4 tags
> > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
> > queue 0, one on wait queue 1. Now four IOs complete so
> > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
> > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
> > waiters is woken up but the other one is still sleeping in wq1 and there
> > are not enough wakeups to fill the batch and wake it up? This is
> > essentially because we have lost three wakeups on wq0 because it didn't
> > have enough waiters to wake...
> 
> From what I see, if tags are shared for multiple devices, wake_batch
> should make sure that all waiter will be woke up:
> 
> For example:
> there are total 64 tags shared for two devices, then wake_batch is 4(if
> both devices are active).  If there are waiters, which means at least 32
> tags are grabed, thus 8 queues will ensure to wake up at least once
> after 32 tags are freed.

Well, yes, wake_batch is updated but as my example above shows it is not
enough to fix "wasted" wakeups.

> > Finally, sbitmap_queue_wake_up() is racy and if two of them race together,
> > they can end up decrementing wait_cnt of wq which does not have any process
> > queued which again effectively leads to lost wakeups and possibly
> > indefinitely sleeping tasks.
> > 
> 
> BTW, I do this implementation have some problems on concurrent
> scenario, as described in following patch:
> 
> https://lore.kernel.org/lkml/20220415101053.554495-4-yukuai3@huawei.com/

Yes, as far as I can see you have identified similar races as I point out
in this email. But I'm not sure whether your patch fixes all the
possibilities for lost wakeups...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-17 11:31   ` Jan Kara
@ 2022-06-17 12:50     ` Yu Kuai
  2022-06-20 11:57       ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Yu Kuai @ 2022-06-17 12:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Omar Sandoval, linux-block

在 2022/06/17 19:31, Jan Kara 写道:
> Hi!
> 
> On Fri 17-06-22 09:40:11, Yu Kuai wrote:
>> 在 2022/06/17 1:21, Jan Kara 写道:
>>> I've been debugging some customer reports of tasks hanging (forever)
>>> waiting for free tags when in fact all tags are free. After looking into it
>>> for some time I think I know what it happening. First, keep in mind that
>>> it concerns a device which uses shared tags. There are 127 tags available
>>> and the number of active queues using these tags is easily 40 or more. So
>>> number of tags available for each device is rather small. Now I'm not sure
>>> how batched wakeups can ever work in such situations, but maybe I'm missing
>>> something.
>>>
>>> So take for example a situation where two tags are available for a device,
>>> they are both currently used. Now a process comes into blk_mq_get_tag() and
>>> wants to allocate tag and goes to sleep. Now how can it ever be woken up if
>>> wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
>>> that's not enough to trigger the batched wakeup to really wakeup the
>>> waiter...
>>>
>>> Even if we have say 4 tags available so in theory there should be enough
>>> wakeups to fill the batch, there can be the following problem. So 4 tags
>>> are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
>>> queue 0, one on wait queue 1. Now four IOs complete so
>>> sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
>>> wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
>>> waiters is woken up but the other one is still sleeping in wq1 and there
>>> are not enough wakeups to fill the batch and wake it up? This is
>>> essentially because we have lost three wakeups on wq0 because it didn't
>>> have enough waiters to wake...
>>
>>  From what I see, if tags are shared for multiple devices, wake_batch
>> should make sure that all waiter will be woke up:
>>
>> For example:
>> there are total 64 tags shared for two devices, then wake_batch is 4(if
>> both devices are active).  If there are waiters, which means at least 32
>> tags are grabed, thus 8 queues will ensure to wake up at least once
>> after 32 tags are freed.
> 
> Well, yes, wake_batch is updated but as my example above shows it is not
> enough to fix "wasted" wakeups.

Tags can be preempted, which means new thread can be added to waitqueue
only if there are no free tags.

With the above condition, I can't think of any possibility how the
following scenario can be existed(dispite the wake ups can be missed):

Only wake_batch tags are still in use, while multiple waitqueues are
still active.

If you think this is possible, can you share the initial conditions and
how does it end up to the problematic scenario?

Thanks,
Kuai
> 
>>> Finally, sbitmap_queue_wake_up() is racy and if two of them race together,
>>> they can end up decrementing wait_cnt of wq which does not have any process
>>> queued which again effectively leads to lost wakeups and possibly
>>> indefinitely sleeping tasks.
>>>
>>
>> BTW, I do this implementation have some problems on concurrent
>> scenario, as described in following patch:
>>
>> https://lore.kernel.org/lkml/20220415101053.554495-4-yukuai3@huawei.com/
> 
> Yes, as far as I can see you have identified similar races as I point out
> in this email. But I'm not sure whether your patch fixes all the
> possibilities for lost wakeups...
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-17 12:50     ` Yu Kuai
@ 2022-06-20 11:57       ` Jan Kara
  2022-06-20 13:11         ` Yu Kuai
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2022-06-20 11:57 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Jan Kara, Jens Axboe, Omar Sandoval, linux-block, Laibin Qiu

Hello!

On Fri 17-06-22 20:50:17, Yu Kuai wrote:
> 在 2022/06/17 19:31, Jan Kara 写道:
> > On Fri 17-06-22 09:40:11, Yu Kuai wrote:
> > > 在 2022/06/17 1:21, Jan Kara 写道:
> > > > I've been debugging some customer reports of tasks hanging (forever)
> > > > waiting for free tags when in fact all tags are free. After looking into it
> > > > for some time I think I know what it happening. First, keep in mind that
> > > > it concerns a device which uses shared tags. There are 127 tags available
> > > > and the number of active queues using these tags is easily 40 or more. So
> > > > number of tags available for each device is rather small. Now I'm not sure
> > > > how batched wakeups can ever work in such situations, but maybe I'm missing
> > > > something.
> > > > 
> > > > So take for example a situation where two tags are available for a device,
> > > > they are both currently used. Now a process comes into blk_mq_get_tag() and
> > > > wants to allocate tag and goes to sleep. Now how can it ever be woken up if
> > > > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
> > > > that's not enough to trigger the batched wakeup to really wakeup the
> > > > waiter...
> > > > 
> > > > Even if we have say 4 tags available so in theory there should be enough
> > > > wakeups to fill the batch, there can be the following problem. So 4 tags
> > > > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
> > > > queue 0, one on wait queue 1. Now four IOs complete so
> > > > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
> > > > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
> > > > waiters is woken up but the other one is still sleeping in wq1 and there
> > > > are not enough wakeups to fill the batch and wake it up? This is
> > > > essentially because we have lost three wakeups on wq0 because it didn't
> > > > have enough waiters to wake...
> > > 
> > >  From what I see, if tags are shared for multiple devices, wake_batch
> > > should make sure that all waiter will be woke up:
> > > 
> > > For example:
> > > there are total 64 tags shared for two devices, then wake_batch is 4(if
> > > both devices are active).  If there are waiters, which means at least 32
> > > tags are grabed, thus 8 queues will ensure to wake up at least once
> > > after 32 tags are freed.
> > 
> > Well, yes, wake_batch is updated but as my example above shows it is not
> > enough to fix "wasted" wakeups.
> 
> Tags can be preempted, which means new thread can be added to waitqueue
> only if there are no free tags.

Yes.

> With the above condition, I can't think of any possibility how the
> following scenario can be existed(dispite the wake ups can be missed):
> 
> Only wake_batch tags are still in use, while multiple waitqueues are
> still active.
> 
> If you think this is possible, can you share the initial conditions and
> how does it end up to the problematic scenario?

Very easily AFAICT. I've described the scenario in my original email but
let me maybe write it here with more detail. Let's assume we have 4 tags
available for our device, wake_batch is 4, wait_index is 0, wait_cnt is 4
for all waitqueues. All four tags are currently in use.

Now task T1 comes, wants a new tag:
blk_mq_get_tag()
  bt_wait_ptr() -> gets ws 0, wait_index incremented to 1
  goes to sleep on ws 0

Now task T2 comes, wants a new tag:
blk_mq_get_tag()
  bt_wait_ptr() -> gets ws 1, wait_index incremented to 2
  goes to sleep on ws 1

Now all four requests complete, this generates 4 calls to
sbitmap_queue_wake_up() for ws 0, which decrements wait_cnt on ws 0 to 0
and we do wake_up_nr(&ws->wait, 4). This wakes T1.

T1 allocates a tag, does IO, IO completes. sbitmap_queue_wake_up() is
called for ws 1. wait_cnt is decremented to 3.

Now there's no IO in flight but we still have task sleeping in ws 1.
Everything is stuck until someone submits more IO (which may never happen
because everything ends up waiting on completion of IO T2 does).

Note that I have given an example with 4 tags available but in principle
the problem can happen with up to (wake_batch - 1) * SBQ_WAIT_QUEUES tags.
Hum, spelling it out so exactly this particular problem should be probably
fixed by changing the code in sbitmap_queue_recalculate_wake_batch() to not
force wake_batch to 4 if there is higher number of total tags available?
Laibin? That would be at least a quick fix for this particular problem.

And AFAICS this is independent problem from the one you are fixing in your
patch...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-20 11:57       ` Jan Kara
@ 2022-06-20 13:11         ` Yu Kuai
  2022-06-20 16:57           ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Yu Kuai @ 2022-06-20 13:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Omar Sandoval, linux-block, Laibin Qiu

在 2022/06/20 19:57, Jan Kara 写道:
> Hello!
> 
> On Fri 17-06-22 20:50:17, Yu Kuai wrote:
>> 在 2022/06/17 19:31, Jan Kara 写道:
>>> On Fri 17-06-22 09:40:11, Yu Kuai wrote:
>>>> 在 2022/06/17 1:21, Jan Kara 写道:
>>>>> I've been debugging some customer reports of tasks hanging (forever)
>>>>> waiting for free tags when in fact all tags are free. After looking into it
>>>>> for some time I think I know what it happening. First, keep in mind that
>>>>> it concerns a device which uses shared tags. There are 127 tags available
>>>>> and the number of active queues using these tags is easily 40 or more. So
>>>>> number of tags available for each device is rather small. Now I'm not sure
>>>>> how batched wakeups can ever work in such situations, but maybe I'm missing
>>>>> something.
>>>>>
>>>>> So take for example a situation where two tags are available for a device,
>>>>> they are both currently used. Now a process comes into blk_mq_get_tag() and
>>>>> wants to allocate tag and goes to sleep. Now how can it ever be woken up if
>>>>> wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
>>>>> that's not enough to trigger the batched wakeup to really wakeup the
>>>>> waiter...
>>>>>
>>>>> Even if we have say 4 tags available so in theory there should be enough
>>>>> wakeups to fill the batch, there can be the following problem. So 4 tags
>>>>> are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
>>>>> queue 0, one on wait queue 1. Now four IOs complete so
>>>>> sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
>>>>> wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
>>>>> waiters is woken up but the other one is still sleeping in wq1 and there
>>>>> are not enough wakeups to fill the batch and wake it up? This is
>>>>> essentially because we have lost three wakeups on wq0 because it didn't
>>>>> have enough waiters to wake...
>>>>
>>>>   From what I see, if tags are shared for multiple devices, wake_batch
>>>> should make sure that all waiter will be woke up:
>>>>
>>>> For example:
>>>> there are total 64 tags shared for two devices, then wake_batch is 4(if
>>>> both devices are active).  If there are waiters, which means at least 32
>>>> tags are grabed, thus 8 queues will ensure to wake up at least once
>>>> after 32 tags are freed.
>>>
>>> Well, yes, wake_batch is updated but as my example above shows it is not
>>> enough to fix "wasted" wakeups.
>>
>> Tags can be preempted, which means new thread can be added to waitqueue
>> only if there are no free tags.
> 
> Yes.
> 
>> With the above condition, I can't think of any possibility how the
>> following scenario can be existed(dispite the wake ups can be missed):
>>
>> Only wake_batch tags are still in use, while multiple waitqueues are
>> still active.
>>
>> If you think this is possible, can you share the initial conditions and
>> how does it end up to the problematic scenario?
> 
> Very easily AFAICT. I've described the scenario in my original email but
> let me maybe write it here with more detail. Let's assume we have 4 tags
> available for our device, wake_batch is 4, wait_index is 0, wait_cnt is 4
> for all waitqueues. All four tags are currently in use.
> 
> Now task T1 comes, wants a new tag:
> blk_mq_get_tag()
>    bt_wait_ptr() -> gets ws 0, wait_index incremented to 1
>    goes to sleep on ws 0
> 
> Now task T2 comes, wants a new tag:
> blk_mq_get_tag()
>    bt_wait_ptr() -> gets ws 1, wait_index incremented to 2
>    goes to sleep on ws 1
> 
> Now all four requests complete, this generates 4 calls to
> sbitmap_queue_wake_up() for ws 0, which decrements wait_cnt on ws 0 to 0
> and we do wake_up_nr(&ws->wait, 4). This wakes T1.
> 
> T1 allocates a tag, does IO, IO completes. sbitmap_queue_wake_up() is
> called for ws 1. wait_cnt is decremented to 3.
> 
> Now there's no IO in flight but we still have task sleeping in ws 1.
> Everything is stuck until someone submits more IO (which may never happen
> because everything ends up waiting on completion of IO T2 does).
Hi, Jan

I assum that there should be at least 32 total tags, and at least 8
device are issuing io, so that there are only 4 tags available for
the devcie? (due to fair share).

If so, io from other devcies should trigger new wakeup.

Thanks,
Kuai
> 
> Note that I have given an example with 4 tags available but in principle
> the problem can happen with up to (wake_batch - 1) * SBQ_WAIT_QUEUES tags.
> Hum, spelling it out so exactly this particular problem should be probably
> fixed by changing the code in sbitmap_queue_recalculate_wake_batch() to not
> force wake_batch to 4 if there is higher number of total tags available?
> Laibin? That would be at least a quick fix for this particular problem.
> 
> And AFAICS this is independent problem from the one you are fixing in your
> patch...
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-20 13:11         ` Yu Kuai
@ 2022-06-20 16:57           ` Jan Kara
  2022-06-21  1:10             ` Yu Kuai
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2022-06-20 16:57 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Jan Kara, Jens Axboe, Omar Sandoval, linux-block, Laibin Qiu

On Mon 20-06-22 21:11:25, Yu Kuai wrote:
> 在 2022/06/20 19:57, Jan Kara 写道:
> > Hello!
> > 
> > On Fri 17-06-22 20:50:17, Yu Kuai wrote:
> > > 在 2022/06/17 19:31, Jan Kara 写道:
> > > > On Fri 17-06-22 09:40:11, Yu Kuai wrote:
> > > > > 在 2022/06/17 1:21, Jan Kara 写道:
> > > > > > I've been debugging some customer reports of tasks hanging (forever)
> > > > > > waiting for free tags when in fact all tags are free. After looking into it
> > > > > > for some time I think I know what it happening. First, keep in mind that
> > > > > > it concerns a device which uses shared tags. There are 127 tags available
> > > > > > and the number of active queues using these tags is easily 40 or more. So
> > > > > > number of tags available for each device is rather small. Now I'm not sure
> > > > > > how batched wakeups can ever work in such situations, but maybe I'm missing
> > > > > > something.
> > > > > > 
> > > > > > So take for example a situation where two tags are available for a device,
> > > > > > they are both currently used. Now a process comes into blk_mq_get_tag() and
> > > > > > wants to allocate tag and goes to sleep. Now how can it ever be woken up if
> > > > > > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
> > > > > > that's not enough to trigger the batched wakeup to really wakeup the
> > > > > > waiter...
> > > > > > 
> > > > > > Even if we have say 4 tags available so in theory there should be enough
> > > > > > wakeups to fill the batch, there can be the following problem. So 4 tags
> > > > > > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
> > > > > > queue 0, one on wait queue 1. Now four IOs complete so
> > > > > > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
> > > > > > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
> > > > > > waiters is woken up but the other one is still sleeping in wq1 and there
> > > > > > are not enough wakeups to fill the batch and wake it up? This is
> > > > > > essentially because we have lost three wakeups on wq0 because it didn't
> > > > > > have enough waiters to wake...
> > > > > 
> > > > >   From what I see, if tags are shared for multiple devices, wake_batch
> > > > > should make sure that all waiter will be woke up:
> > > > > 
> > > > > For example:
> > > > > there are total 64 tags shared for two devices, then wake_batch is 4(if
> > > > > both devices are active).  If there are waiters, which means at least 32
> > > > > tags are grabed, thus 8 queues will ensure to wake up at least once
> > > > > after 32 tags are freed.
> > > > 
> > > > Well, yes, wake_batch is updated but as my example above shows it is not
> > > > enough to fix "wasted" wakeups.
> > > 
> > > Tags can be preempted, which means new thread can be added to waitqueue
> > > only if there are no free tags.
> > 
> > Yes.
> > 
> > > With the above condition, I can't think of any possibility how the
> > > following scenario can be existed(dispite the wake ups can be missed):
> > > 
> > > Only wake_batch tags are still in use, while multiple waitqueues are
> > > still active.
> > > 
> > > If you think this is possible, can you share the initial conditions and
> > > how does it end up to the problematic scenario?
> > 
> > Very easily AFAICT. I've described the scenario in my original email but
> > let me maybe write it here with more detail. Let's assume we have 4 tags
> > available for our device, wake_batch is 4, wait_index is 0, wait_cnt is 4
> > for all waitqueues. All four tags are currently in use.
> > 
> > Now task T1 comes, wants a new tag:
> > blk_mq_get_tag()
> >    bt_wait_ptr() -> gets ws 0, wait_index incremented to 1
> >    goes to sleep on ws 0
> > 
> > Now task T2 comes, wants a new tag:
> > blk_mq_get_tag()
> >    bt_wait_ptr() -> gets ws 1, wait_index incremented to 2
> >    goes to sleep on ws 1
> > 
> > Now all four requests complete, this generates 4 calls to
> > sbitmap_queue_wake_up() for ws 0, which decrements wait_cnt on ws 0 to 0
> > and we do wake_up_nr(&ws->wait, 4). This wakes T1.
> > 
> > T1 allocates a tag, does IO, IO completes. sbitmap_queue_wake_up() is
> > called for ws 1. wait_cnt is decremented to 3.
> > 
> > Now there's no IO in flight but we still have task sleeping in ws 1.
> > Everything is stuck until someone submits more IO (which may never happen
> > because everything ends up waiting on completion of IO T2 does).
> 
> I assum that there should be at least 32 total tags, and at least 8
> device are issuing io, so that there are only 4 tags available for
> the devcie? (due to fair share).
> 
> If so, io from other devcies should trigger new wakeup.

Well, there isn't necessarily any IO going on on other devices, there may
be 4 tags used on our device, the rest is free but we are not allowed to
use them. Sure eventually we should detect other devices are idle and
decrease tags->active_queues but that can take 30s or more...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Races in sbitmap batched wakeups
  2022-06-20 16:57           ` Jan Kara
@ 2022-06-21  1:10             ` Yu Kuai
  0 siblings, 0 replies; 10+ messages in thread
From: Yu Kuai @ 2022-06-21  1:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Omar Sandoval, linux-block, Laibin Qiu



在 2022/06/21 0:57, Jan Kara 写道:
> On Mon 20-06-22 21:11:25, Yu Kuai wrote:
>> 在 2022/06/20 19:57, Jan Kara 写道:
>>> Hello!
>>>
>>> On Fri 17-06-22 20:50:17, Yu Kuai wrote:
>>>> 在 2022/06/17 19:31, Jan Kara 写道:
>>>>> On Fri 17-06-22 09:40:11, Yu Kuai wrote:
>>>>>> 在 2022/06/17 1:21, Jan Kara 写道:
>>>>>>> I've been debugging some customer reports of tasks hanging (forever)
>>>>>>> waiting for free tags when in fact all tags are free. After looking into it
>>>>>>> for some time I think I know what it happening. First, keep in mind that
>>>>>>> it concerns a device which uses shared tags. There are 127 tags available
>>>>>>> and the number of active queues using these tags is easily 40 or more. So
>>>>>>> number of tags available for each device is rather small. Now I'm not sure
>>>>>>> how batched wakeups can ever work in such situations, but maybe I'm missing
>>>>>>> something.
>>>>>>>
>>>>>>> So take for example a situation where two tags are available for a device,
>>>>>>> they are both currently used. Now a process comes into blk_mq_get_tag() and
>>>>>>> wants to allocate tag and goes to sleep. Now how can it ever be woken up if
>>>>>>> wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but
>>>>>>> that's not enough to trigger the batched wakeup to really wakeup the
>>>>>>> waiter...
>>>>>>>
>>>>>>> Even if we have say 4 tags available so in theory there should be enough
>>>>>>> wakeups to fill the batch, there can be the following problem. So 4 tags
>>>>>>> are in use, two processes come to blk_mq_get_tag() and sleep, one on wait
>>>>>>> queue 0, one on wait queue 1. Now four IOs complete so
>>>>>>> sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements
>>>>>>> wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the
>>>>>>> waiters is woken up but the other one is still sleeping in wq1 and there
>>>>>>> are not enough wakeups to fill the batch and wake it up? This is
>>>>>>> essentially because we have lost three wakeups on wq0 because it didn't
>>>>>>> have enough waiters to wake...
>>>>>>
>>>>>>    From what I see, if tags are shared for multiple devices, wake_batch
>>>>>> should make sure that all waiter will be woke up:
>>>>>>
>>>>>> For example:
>>>>>> there are total 64 tags shared for two devices, then wake_batch is 4(if
>>>>>> both devices are active).  If there are waiters, which means at least 32
>>>>>> tags are grabed, thus 8 queues will ensure to wake up at least once
>>>>>> after 32 tags are freed.
>>>>>
>>>>> Well, yes, wake_batch is updated but as my example above shows it is not
>>>>> enough to fix "wasted" wakeups.
>>>>
>>>> Tags can be preempted, which means new thread can be added to waitqueue
>>>> only if there are no free tags.
>>>
>>> Yes.
>>>
>>>> With the above condition, I can't think of any possibility how the
>>>> following scenario can be existed(dispite the wake ups can be missed):
>>>>
>>>> Only wake_batch tags are still in use, while multiple waitqueues are
>>>> still active.
>>>>
>>>> If you think this is possible, can you share the initial conditions and
>>>> how does it end up to the problematic scenario?
>>>
>>> Very easily AFAICT. I've described the scenario in my original email but
>>> let me maybe write it here with more detail. Let's assume we have 4 tags
>>> available for our device, wake_batch is 4, wait_index is 0, wait_cnt is 4
>>> for all waitqueues. All four tags are currently in use.
>>>
>>> Now task T1 comes, wants a new tag:
>>> blk_mq_get_tag()
>>>     bt_wait_ptr() -> gets ws 0, wait_index incremented to 1
>>>     goes to sleep on ws 0
>>>
>>> Now task T2 comes, wants a new tag:
>>> blk_mq_get_tag()
>>>     bt_wait_ptr() -> gets ws 1, wait_index incremented to 2
>>>     goes to sleep on ws 1
>>>
>>> Now all four requests complete, this generates 4 calls to
>>> sbitmap_queue_wake_up() for ws 0, which decrements wait_cnt on ws 0 to 0
>>> and we do wake_up_nr(&ws->wait, 4). This wakes T1.
>>>
>>> T1 allocates a tag, does IO, IO completes. sbitmap_queue_wake_up() is
>>> called for ws 1. wait_cnt is decremented to 3.
>>>
>>> Now there's no IO in flight but we still have task sleeping in ws 1.
>>> Everything is stuck until someone submits more IO (which may never happen
>>> because everything ends up waiting on completion of IO T2 does).
>>
>> I assum that there should be at least 32 total tags, and at least 8
>> device are issuing io, so that there are only 4 tags available for
>> the devcie? (due to fair share).
>>
>> If so, io from other devcies should trigger new wakeup.
> 
> Well, there isn't necessarily any IO going on on other devices, there may
> be 4 tags used on our device, the rest is free but we are not allowed to
> use them. Sure eventually we should detect other devices are idle and
> decrease tags->active_queues but that can take 30s or more...
> 
It's right queue can be active while there is no io, blk_mq_tag_idle()
will not be called untill such queue doesn't issue any io for a period
time, thus I do think io will not hung eventually, however, it can be
improved that io will rely on other devices to woke up.

BTW, I tried to improve that tags can be wasted for share tags long time
ago:

https://lore.kernel.org/all/20201226102808.2534966-1-yukuai3@huawei.com/

Thanks,
Kuai
> 								Honza
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-06-21  1:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-16 17:21 Races in sbitmap batched wakeups Jan Kara
2022-06-17  1:07 ` Ming Lei
2022-06-17 10:50   ` Jan Kara
2022-06-17  1:40 ` Yu Kuai
2022-06-17 11:31   ` Jan Kara
2022-06-17 12:50     ` Yu Kuai
2022-06-20 11:57       ` Jan Kara
2022-06-20 13:11         ` Yu Kuai
2022-06-20 16:57           ` Jan Kara
2022-06-21  1:10             ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox