* Races in sbitmap batched wakeups @ 2022-06-16 17:21 Jan Kara 2022-06-17 1:07 ` Ming Lei 2022-06-17 1:40 ` Yu Kuai 0 siblings, 2 replies; 10+ messages in thread From: Jan Kara @ 2022-06-16 17:21 UTC (permalink / raw) To: Jens Axboe, Omar Sandoval; +Cc: linux-block Hello! I've been debugging some customer reports of tasks hanging (forever) waiting for free tags when in fact all tags are free. After looking into it for some time I think I know what it happening. First, keep in mind that it concerns a device which uses shared tags. There are 127 tags available and the number of active queues using these tags is easily 40 or more. So number of tags available for each device is rather small. Now I'm not sure how batched wakeups can ever work in such situations, but maybe I'm missing something. So take for example a situation where two tags are available for a device, they are both currently used. Now a process comes into blk_mq_get_tag() and wants to allocate tag and goes to sleep. Now how can it ever be woken up if wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but that's not enough to trigger the batched wakeup to really wakeup the waiter... Even if we have say 4 tags available so in theory there should be enough wakeups to fill the batch, there can be the following problem. So 4 tags are in use, two processes come to blk_mq_get_tag() and sleep, one on wait queue 0, one on wait queue 1. Now four IOs complete so sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the waiters is woken up but the other one is still sleeping in wq1 and there are not enough wakeups to fill the batch and wake it up? This is essentially because we have lost three wakeups on wq0 because it didn't have enough waiters to wake... Finally, sbitmap_queue_wake_up() is racy and if two of them race together, they can end up decrementing wait_cnt of wq which does not have any process queued which again effectively leads to lost wakeups and possibly indefinitely sleeping tasks. Now this all looks so broken that I'm unsure whether I'm not missing something fundamental. Also I'm not quite sure how to fix all this without basically destroying the batched wakeup feature (but I didn't put too much thought to this yet). Comments? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-16 17:21 Races in sbitmap batched wakeups Jan Kara @ 2022-06-17 1:07 ` Ming Lei 2022-06-17 10:50 ` Jan Kara 2022-06-17 1:40 ` Yu Kuai 1 sibling, 1 reply; 10+ messages in thread From: Ming Lei @ 2022-06-17 1:07 UTC (permalink / raw) To: Jan Kara; +Cc: Jens Axboe, Omar Sandoval, linux-block On Thu, Jun 16, 2022 at 07:21:02PM +0200, Jan Kara wrote: > Hello! > > I've been debugging some customer reports of tasks hanging (forever) > waiting for free tags when in fact all tags are free. After looking into it > for some time I think I know what it happening. First, keep in mind that > it concerns a device which uses shared tags. There are 127 tags available > and the number of active queues using these tags is easily 40 or more. So > number of tags available for each device is rather small. Now I'm not sure > how batched wakeups can ever work in such situations, but maybe I'm missing > something. > > So take for example a situation where two tags are available for a device, > they are both currently used. Now a process comes into blk_mq_get_tag() and > wants to allocate tag and goes to sleep. Now how can it ever be woken up if > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but > that's not enough to trigger the batched wakeup to really wakeup the > waiter... commit 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") is supposed for addressing this kind of issue. > > Even if we have say 4 tags available so in theory there should be enough > wakeups to fill the batch, there can be the following problem. So 4 tags > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait > queue 0, one on wait queue 1. Now four IOs complete so > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the > waiters is woken up but the other one is still sleeping in wq1 and there > are not enough wakeups to fill the batch and wake it up? This is > essentially because we have lost three wakeups on wq0 because it didn't > have enough waiters to wake... But the following completions will wake up the waiter in wq1, given there are more in-flight. Thanks, Ming ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-17 1:07 ` Ming Lei @ 2022-06-17 10:50 ` Jan Kara 0 siblings, 0 replies; 10+ messages in thread From: Jan Kara @ 2022-06-17 10:50 UTC (permalink / raw) To: Ming Lei; +Cc: Jan Kara, Jens Axboe, Omar Sandoval, linux-block On Fri 17-06-22 09:07:18, Ming Lei wrote: > On Thu, Jun 16, 2022 at 07:21:02PM +0200, Jan Kara wrote: > > Hello! > > > > I've been debugging some customer reports of tasks hanging (forever) > > waiting for free tags when in fact all tags are free. After looking into it > > for some time I think I know what it happening. First, keep in mind that > > it concerns a device which uses shared tags. There are 127 tags available > > and the number of active queues using these tags is easily 40 or more. So > > number of tags available for each device is rather small. Now I'm not sure > > how batched wakeups can ever work in such situations, but maybe I'm missing > > something. > > > > So take for example a situation where two tags are available for a device, > > they are both currently used. Now a process comes into blk_mq_get_tag() and > > wants to allocate tag and goes to sleep. Now how can it ever be woken up if > > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but > > that's not enough to trigger the batched wakeup to really wakeup the > > waiter... > > commit 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") > is supposed for addressing this kind of issue. I have observed the deadlock with the above fixes applied. > > Even if we have say 4 tags available so in theory there should be enough > > wakeups to fill the batch, there can be the following problem. So 4 tags > > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait > > queue 0, one on wait queue 1. Now four IOs complete so > > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements > > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the > > waiters is woken up but the other one is still sleeping in wq1 and there > > are not enough wakeups to fill the batch and wake it up? This is > > essentially because we have lost three wakeups on wq0 because it didn't > > have enough waiters to wake... > > But the following completions will wake up the waiter in wq1, given > there are more in-flight. Well, there is only one more request in flight - from the unblocked waiter. And once that request completes, it will generate just one wakeup which is not enough to wake the waiter on wq1 because wake_batch is 4... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-16 17:21 Races in sbitmap batched wakeups Jan Kara 2022-06-17 1:07 ` Ming Lei @ 2022-06-17 1:40 ` Yu Kuai 2022-06-17 11:31 ` Jan Kara 1 sibling, 1 reply; 10+ messages in thread From: Yu Kuai @ 2022-06-17 1:40 UTC (permalink / raw) To: Jan Kara, Jens Axboe, Omar Sandoval; +Cc: linux-block 在 2022/06/17 1:21, Jan Kara 写道: > Hello! > > I've been debugging some customer reports of tasks hanging (forever) > waiting for free tags when in fact all tags are free. After looking into it > for some time I think I know what it happening. First, keep in mind that > it concerns a device which uses shared tags. There are 127 tags available > and the number of active queues using these tags is easily 40 or more. So > number of tags available for each device is rather small. Now I'm not sure > how batched wakeups can ever work in such situations, but maybe I'm missing > something. > > So take for example a situation where two tags are available for a device, > they are both currently used. Now a process comes into blk_mq_get_tag() and > wants to allocate tag and goes to sleep. Now how can it ever be woken up if > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but > that's not enough to trigger the batched wakeup to really wakeup the > waiter... > > Even if we have say 4 tags available so in theory there should be enough > wakeups to fill the batch, there can be the following problem. So 4 tags > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait > queue 0, one on wait queue 1. Now four IOs complete so > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the > waiters is woken up but the other one is still sleeping in wq1 and there > are not enough wakeups to fill the batch and wake it up? This is > essentially because we have lost three wakeups on wq0 because it didn't > have enough waiters to wake... Hi, From what I see, if tags are shared for multiple devices, wake_batch should make sure that all waiter will be woke up: For example: there are total 64 tags shared for two devices, then wake_batch is 4(if both devices are active). If there are waiters, which means at least 32 tags are grabed, thus 8 queues will ensure to wake up at least once after 32 tags are freed. > > Finally, sbitmap_queue_wake_up() is racy and if two of them race together, > they can end up decrementing wait_cnt of wq which does not have any process > queued which again effectively leads to lost wakeups and possibly > indefinitely sleeping tasks. > BTW, I do this implementation have some problems on concurrent scenario, as described in following patch: https://lore.kernel.org/lkml/20220415101053.554495-4-yukuai3@huawei.com/ > Now this all looks so broken that I'm unsure whether I'm not missing > something fundamental. Also I'm not quite sure how to fix all this without > basically destroying the batched wakeup feature (but I didn't put too much > thought to this yet). Comments? > > Honza > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-17 1:40 ` Yu Kuai @ 2022-06-17 11:31 ` Jan Kara 2022-06-17 12:50 ` Yu Kuai 0 siblings, 1 reply; 10+ messages in thread From: Jan Kara @ 2022-06-17 11:31 UTC (permalink / raw) To: Yu Kuai; +Cc: Jan Kara, Jens Axboe, Omar Sandoval, linux-block Hi! On Fri 17-06-22 09:40:11, Yu Kuai wrote: > 在 2022/06/17 1:21, Jan Kara 写道: > > I've been debugging some customer reports of tasks hanging (forever) > > waiting for free tags when in fact all tags are free. After looking into it > > for some time I think I know what it happening. First, keep in mind that > > it concerns a device which uses shared tags. There are 127 tags available > > and the number of active queues using these tags is easily 40 or more. So > > number of tags available for each device is rather small. Now I'm not sure > > how batched wakeups can ever work in such situations, but maybe I'm missing > > something. > > > > So take for example a situation where two tags are available for a device, > > they are both currently used. Now a process comes into blk_mq_get_tag() and > > wants to allocate tag and goes to sleep. Now how can it ever be woken up if > > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but > > that's not enough to trigger the batched wakeup to really wakeup the > > waiter... > > > > Even if we have say 4 tags available so in theory there should be enough > > wakeups to fill the batch, there can be the following problem. So 4 tags > > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait > > queue 0, one on wait queue 1. Now four IOs complete so > > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements > > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the > > waiters is woken up but the other one is still sleeping in wq1 and there > > are not enough wakeups to fill the batch and wake it up? This is > > essentially because we have lost three wakeups on wq0 because it didn't > > have enough waiters to wake... > > From what I see, if tags are shared for multiple devices, wake_batch > should make sure that all waiter will be woke up: > > For example: > there are total 64 tags shared for two devices, then wake_batch is 4(if > both devices are active). If there are waiters, which means at least 32 > tags are grabed, thus 8 queues will ensure to wake up at least once > after 32 tags are freed. Well, yes, wake_batch is updated but as my example above shows it is not enough to fix "wasted" wakeups. > > Finally, sbitmap_queue_wake_up() is racy and if two of them race together, > > they can end up decrementing wait_cnt of wq which does not have any process > > queued which again effectively leads to lost wakeups and possibly > > indefinitely sleeping tasks. > > > > BTW, I do this implementation have some problems on concurrent > scenario, as described in following patch: > > https://lore.kernel.org/lkml/20220415101053.554495-4-yukuai3@huawei.com/ Yes, as far as I can see you have identified similar races as I point out in this email. But I'm not sure whether your patch fixes all the possibilities for lost wakeups... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-17 11:31 ` Jan Kara @ 2022-06-17 12:50 ` Yu Kuai 2022-06-20 11:57 ` Jan Kara 0 siblings, 1 reply; 10+ messages in thread From: Yu Kuai @ 2022-06-17 12:50 UTC (permalink / raw) To: Jan Kara; +Cc: Jens Axboe, Omar Sandoval, linux-block 在 2022/06/17 19:31, Jan Kara 写道: > Hi! > > On Fri 17-06-22 09:40:11, Yu Kuai wrote: >> 在 2022/06/17 1:21, Jan Kara 写道: >>> I've been debugging some customer reports of tasks hanging (forever) >>> waiting for free tags when in fact all tags are free. After looking into it >>> for some time I think I know what it happening. First, keep in mind that >>> it concerns a device which uses shared tags. There are 127 tags available >>> and the number of active queues using these tags is easily 40 or more. So >>> number of tags available for each device is rather small. Now I'm not sure >>> how batched wakeups can ever work in such situations, but maybe I'm missing >>> something. >>> >>> So take for example a situation where two tags are available for a device, >>> they are both currently used. Now a process comes into blk_mq_get_tag() and >>> wants to allocate tag and goes to sleep. Now how can it ever be woken up if >>> wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but >>> that's not enough to trigger the batched wakeup to really wakeup the >>> waiter... >>> >>> Even if we have say 4 tags available so in theory there should be enough >>> wakeups to fill the batch, there can be the following problem. So 4 tags >>> are in use, two processes come to blk_mq_get_tag() and sleep, one on wait >>> queue 0, one on wait queue 1. Now four IOs complete so >>> sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements >>> wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the >>> waiters is woken up but the other one is still sleeping in wq1 and there >>> are not enough wakeups to fill the batch and wake it up? This is >>> essentially because we have lost three wakeups on wq0 because it didn't >>> have enough waiters to wake... >> >> From what I see, if tags are shared for multiple devices, wake_batch >> should make sure that all waiter will be woke up: >> >> For example: >> there are total 64 tags shared for two devices, then wake_batch is 4(if >> both devices are active). If there are waiters, which means at least 32 >> tags are grabed, thus 8 queues will ensure to wake up at least once >> after 32 tags are freed. > > Well, yes, wake_batch is updated but as my example above shows it is not > enough to fix "wasted" wakeups. Tags can be preempted, which means new thread can be added to waitqueue only if there are no free tags. With the above condition, I can't think of any possibility how the following scenario can be existed(dispite the wake ups can be missed): Only wake_batch tags are still in use, while multiple waitqueues are still active. If you think this is possible, can you share the initial conditions and how does it end up to the problematic scenario? Thanks, Kuai > >>> Finally, sbitmap_queue_wake_up() is racy and if two of them race together, >>> they can end up decrementing wait_cnt of wq which does not have any process >>> queued which again effectively leads to lost wakeups and possibly >>> indefinitely sleeping tasks. >>> >> >> BTW, I do this implementation have some problems on concurrent >> scenario, as described in following patch: >> >> https://lore.kernel.org/lkml/20220415101053.554495-4-yukuai3@huawei.com/ > > Yes, as far as I can see you have identified similar races as I point out > in this email. But I'm not sure whether your patch fixes all the > possibilities for lost wakeups... > > Honza > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-17 12:50 ` Yu Kuai @ 2022-06-20 11:57 ` Jan Kara 2022-06-20 13:11 ` Yu Kuai 0 siblings, 1 reply; 10+ messages in thread From: Jan Kara @ 2022-06-20 11:57 UTC (permalink / raw) To: Yu Kuai; +Cc: Jan Kara, Jens Axboe, Omar Sandoval, linux-block, Laibin Qiu Hello! On Fri 17-06-22 20:50:17, Yu Kuai wrote: > 在 2022/06/17 19:31, Jan Kara 写道: > > On Fri 17-06-22 09:40:11, Yu Kuai wrote: > > > 在 2022/06/17 1:21, Jan Kara 写道: > > > > I've been debugging some customer reports of tasks hanging (forever) > > > > waiting for free tags when in fact all tags are free. After looking into it > > > > for some time I think I know what it happening. First, keep in mind that > > > > it concerns a device which uses shared tags. There are 127 tags available > > > > and the number of active queues using these tags is easily 40 or more. So > > > > number of tags available for each device is rather small. Now I'm not sure > > > > how batched wakeups can ever work in such situations, but maybe I'm missing > > > > something. > > > > > > > > So take for example a situation where two tags are available for a device, > > > > they are both currently used. Now a process comes into blk_mq_get_tag() and > > > > wants to allocate tag and goes to sleep. Now how can it ever be woken up if > > > > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but > > > > that's not enough to trigger the batched wakeup to really wakeup the > > > > waiter... > > > > > > > > Even if we have say 4 tags available so in theory there should be enough > > > > wakeups to fill the batch, there can be the following problem. So 4 tags > > > > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait > > > > queue 0, one on wait queue 1. Now four IOs complete so > > > > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements > > > > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the > > > > waiters is woken up but the other one is still sleeping in wq1 and there > > > > are not enough wakeups to fill the batch and wake it up? This is > > > > essentially because we have lost three wakeups on wq0 because it didn't > > > > have enough waiters to wake... > > > > > > From what I see, if tags are shared for multiple devices, wake_batch > > > should make sure that all waiter will be woke up: > > > > > > For example: > > > there are total 64 tags shared for two devices, then wake_batch is 4(if > > > both devices are active). If there are waiters, which means at least 32 > > > tags are grabed, thus 8 queues will ensure to wake up at least once > > > after 32 tags are freed. > > > > Well, yes, wake_batch is updated but as my example above shows it is not > > enough to fix "wasted" wakeups. > > Tags can be preempted, which means new thread can be added to waitqueue > only if there are no free tags. Yes. > With the above condition, I can't think of any possibility how the > following scenario can be existed(dispite the wake ups can be missed): > > Only wake_batch tags are still in use, while multiple waitqueues are > still active. > > If you think this is possible, can you share the initial conditions and > how does it end up to the problematic scenario? Very easily AFAICT. I've described the scenario in my original email but let me maybe write it here with more detail. Let's assume we have 4 tags available for our device, wake_batch is 4, wait_index is 0, wait_cnt is 4 for all waitqueues. All four tags are currently in use. Now task T1 comes, wants a new tag: blk_mq_get_tag() bt_wait_ptr() -> gets ws 0, wait_index incremented to 1 goes to sleep on ws 0 Now task T2 comes, wants a new tag: blk_mq_get_tag() bt_wait_ptr() -> gets ws 1, wait_index incremented to 2 goes to sleep on ws 1 Now all four requests complete, this generates 4 calls to sbitmap_queue_wake_up() for ws 0, which decrements wait_cnt on ws 0 to 0 and we do wake_up_nr(&ws->wait, 4). This wakes T1. T1 allocates a tag, does IO, IO completes. sbitmap_queue_wake_up() is called for ws 1. wait_cnt is decremented to 3. Now there's no IO in flight but we still have task sleeping in ws 1. Everything is stuck until someone submits more IO (which may never happen because everything ends up waiting on completion of IO T2 does). Note that I have given an example with 4 tags available but in principle the problem can happen with up to (wake_batch - 1) * SBQ_WAIT_QUEUES tags. Hum, spelling it out so exactly this particular problem should be probably fixed by changing the code in sbitmap_queue_recalculate_wake_batch() to not force wake_batch to 4 if there is higher number of total tags available? Laibin? That would be at least a quick fix for this particular problem. And AFAICS this is independent problem from the one you are fixing in your patch... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-20 11:57 ` Jan Kara @ 2022-06-20 13:11 ` Yu Kuai 2022-06-20 16:57 ` Jan Kara 0 siblings, 1 reply; 10+ messages in thread From: Yu Kuai @ 2022-06-20 13:11 UTC (permalink / raw) To: Jan Kara; +Cc: Jens Axboe, Omar Sandoval, linux-block, Laibin Qiu 在 2022/06/20 19:57, Jan Kara 写道: > Hello! > > On Fri 17-06-22 20:50:17, Yu Kuai wrote: >> 在 2022/06/17 19:31, Jan Kara 写道: >>> On Fri 17-06-22 09:40:11, Yu Kuai wrote: >>>> 在 2022/06/17 1:21, Jan Kara 写道: >>>>> I've been debugging some customer reports of tasks hanging (forever) >>>>> waiting for free tags when in fact all tags are free. After looking into it >>>>> for some time I think I know what it happening. First, keep in mind that >>>>> it concerns a device which uses shared tags. There are 127 tags available >>>>> and the number of active queues using these tags is easily 40 or more. So >>>>> number of tags available for each device is rather small. Now I'm not sure >>>>> how batched wakeups can ever work in such situations, but maybe I'm missing >>>>> something. >>>>> >>>>> So take for example a situation where two tags are available for a device, >>>>> they are both currently used. Now a process comes into blk_mq_get_tag() and >>>>> wants to allocate tag and goes to sleep. Now how can it ever be woken up if >>>>> wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but >>>>> that's not enough to trigger the batched wakeup to really wakeup the >>>>> waiter... >>>>> >>>>> Even if we have say 4 tags available so in theory there should be enough >>>>> wakeups to fill the batch, there can be the following problem. So 4 tags >>>>> are in use, two processes come to blk_mq_get_tag() and sleep, one on wait >>>>> queue 0, one on wait queue 1. Now four IOs complete so >>>>> sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements >>>>> wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the >>>>> waiters is woken up but the other one is still sleeping in wq1 and there >>>>> are not enough wakeups to fill the batch and wake it up? This is >>>>> essentially because we have lost three wakeups on wq0 because it didn't >>>>> have enough waiters to wake... >>>> >>>> From what I see, if tags are shared for multiple devices, wake_batch >>>> should make sure that all waiter will be woke up: >>>> >>>> For example: >>>> there are total 64 tags shared for two devices, then wake_batch is 4(if >>>> both devices are active). If there are waiters, which means at least 32 >>>> tags are grabed, thus 8 queues will ensure to wake up at least once >>>> after 32 tags are freed. >>> >>> Well, yes, wake_batch is updated but as my example above shows it is not >>> enough to fix "wasted" wakeups. >> >> Tags can be preempted, which means new thread can be added to waitqueue >> only if there are no free tags. > > Yes. > >> With the above condition, I can't think of any possibility how the >> following scenario can be existed(dispite the wake ups can be missed): >> >> Only wake_batch tags are still in use, while multiple waitqueues are >> still active. >> >> If you think this is possible, can you share the initial conditions and >> how does it end up to the problematic scenario? > > Very easily AFAICT. I've described the scenario in my original email but > let me maybe write it here with more detail. Let's assume we have 4 tags > available for our device, wake_batch is 4, wait_index is 0, wait_cnt is 4 > for all waitqueues. All four tags are currently in use. > > Now task T1 comes, wants a new tag: > blk_mq_get_tag() > bt_wait_ptr() -> gets ws 0, wait_index incremented to 1 > goes to sleep on ws 0 > > Now task T2 comes, wants a new tag: > blk_mq_get_tag() > bt_wait_ptr() -> gets ws 1, wait_index incremented to 2 > goes to sleep on ws 1 > > Now all four requests complete, this generates 4 calls to > sbitmap_queue_wake_up() for ws 0, which decrements wait_cnt on ws 0 to 0 > and we do wake_up_nr(&ws->wait, 4). This wakes T1. > > T1 allocates a tag, does IO, IO completes. sbitmap_queue_wake_up() is > called for ws 1. wait_cnt is decremented to 3. > > Now there's no IO in flight but we still have task sleeping in ws 1. > Everything is stuck until someone submits more IO (which may never happen > because everything ends up waiting on completion of IO T2 does). Hi, Jan I assum that there should be at least 32 total tags, and at least 8 device are issuing io, so that there are only 4 tags available for the devcie? (due to fair share). If so, io from other devcies should trigger new wakeup. Thanks, Kuai > > Note that I have given an example with 4 tags available but in principle > the problem can happen with up to (wake_batch - 1) * SBQ_WAIT_QUEUES tags. > Hum, spelling it out so exactly this particular problem should be probably > fixed by changing the code in sbitmap_queue_recalculate_wake_batch() to not > force wake_batch to 4 if there is higher number of total tags available? > Laibin? That would be at least a quick fix for this particular problem. > > And AFAICS this is independent problem from the one you are fixing in your > patch... > > Honza > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-20 13:11 ` Yu Kuai @ 2022-06-20 16:57 ` Jan Kara 2022-06-21 1:10 ` Yu Kuai 0 siblings, 1 reply; 10+ messages in thread From: Jan Kara @ 2022-06-20 16:57 UTC (permalink / raw) To: Yu Kuai; +Cc: Jan Kara, Jens Axboe, Omar Sandoval, linux-block, Laibin Qiu On Mon 20-06-22 21:11:25, Yu Kuai wrote: > 在 2022/06/20 19:57, Jan Kara 写道: > > Hello! > > > > On Fri 17-06-22 20:50:17, Yu Kuai wrote: > > > 在 2022/06/17 19:31, Jan Kara 写道: > > > > On Fri 17-06-22 09:40:11, Yu Kuai wrote: > > > > > 在 2022/06/17 1:21, Jan Kara 写道: > > > > > > I've been debugging some customer reports of tasks hanging (forever) > > > > > > waiting for free tags when in fact all tags are free. After looking into it > > > > > > for some time I think I know what it happening. First, keep in mind that > > > > > > it concerns a device which uses shared tags. There are 127 tags available > > > > > > and the number of active queues using these tags is easily 40 or more. So > > > > > > number of tags available for each device is rather small. Now I'm not sure > > > > > > how batched wakeups can ever work in such situations, but maybe I'm missing > > > > > > something. > > > > > > > > > > > > So take for example a situation where two tags are available for a device, > > > > > > they are both currently used. Now a process comes into blk_mq_get_tag() and > > > > > > wants to allocate tag and goes to sleep. Now how can it ever be woken up if > > > > > > wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but > > > > > > that's not enough to trigger the batched wakeup to really wakeup the > > > > > > waiter... > > > > > > > > > > > > Even if we have say 4 tags available so in theory there should be enough > > > > > > wakeups to fill the batch, there can be the following problem. So 4 tags > > > > > > are in use, two processes come to blk_mq_get_tag() and sleep, one on wait > > > > > > queue 0, one on wait queue 1. Now four IOs complete so > > > > > > sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements > > > > > > wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the > > > > > > waiters is woken up but the other one is still sleeping in wq1 and there > > > > > > are not enough wakeups to fill the batch and wake it up? This is > > > > > > essentially because we have lost three wakeups on wq0 because it didn't > > > > > > have enough waiters to wake... > > > > > > > > > > From what I see, if tags are shared for multiple devices, wake_batch > > > > > should make sure that all waiter will be woke up: > > > > > > > > > > For example: > > > > > there are total 64 tags shared for two devices, then wake_batch is 4(if > > > > > both devices are active). If there are waiters, which means at least 32 > > > > > tags are grabed, thus 8 queues will ensure to wake up at least once > > > > > after 32 tags are freed. > > > > > > > > Well, yes, wake_batch is updated but as my example above shows it is not > > > > enough to fix "wasted" wakeups. > > > > > > Tags can be preempted, which means new thread can be added to waitqueue > > > only if there are no free tags. > > > > Yes. > > > > > With the above condition, I can't think of any possibility how the > > > following scenario can be existed(dispite the wake ups can be missed): > > > > > > Only wake_batch tags are still in use, while multiple waitqueues are > > > still active. > > > > > > If you think this is possible, can you share the initial conditions and > > > how does it end up to the problematic scenario? > > > > Very easily AFAICT. I've described the scenario in my original email but > > let me maybe write it here with more detail. Let's assume we have 4 tags > > available for our device, wake_batch is 4, wait_index is 0, wait_cnt is 4 > > for all waitqueues. All four tags are currently in use. > > > > Now task T1 comes, wants a new tag: > > blk_mq_get_tag() > > bt_wait_ptr() -> gets ws 0, wait_index incremented to 1 > > goes to sleep on ws 0 > > > > Now task T2 comes, wants a new tag: > > blk_mq_get_tag() > > bt_wait_ptr() -> gets ws 1, wait_index incremented to 2 > > goes to sleep on ws 1 > > > > Now all four requests complete, this generates 4 calls to > > sbitmap_queue_wake_up() for ws 0, which decrements wait_cnt on ws 0 to 0 > > and we do wake_up_nr(&ws->wait, 4). This wakes T1. > > > > T1 allocates a tag, does IO, IO completes. sbitmap_queue_wake_up() is > > called for ws 1. wait_cnt is decremented to 3. > > > > Now there's no IO in flight but we still have task sleeping in ws 1. > > Everything is stuck until someone submits more IO (which may never happen > > because everything ends up waiting on completion of IO T2 does). > > I assum that there should be at least 32 total tags, and at least 8 > device are issuing io, so that there are only 4 tags available for > the devcie? (due to fair share). > > If so, io from other devcies should trigger new wakeup. Well, there isn't necessarily any IO going on on other devices, there may be 4 tags used on our device, the rest is free but we are not allowed to use them. Sure eventually we should detect other devices are idle and decrease tags->active_queues but that can take 30s or more... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Races in sbitmap batched wakeups 2022-06-20 16:57 ` Jan Kara @ 2022-06-21 1:10 ` Yu Kuai 0 siblings, 0 replies; 10+ messages in thread From: Yu Kuai @ 2022-06-21 1:10 UTC (permalink / raw) To: Jan Kara; +Cc: Jens Axboe, Omar Sandoval, linux-block, Laibin Qiu 在 2022/06/21 0:57, Jan Kara 写道: > On Mon 20-06-22 21:11:25, Yu Kuai wrote: >> 在 2022/06/20 19:57, Jan Kara 写道: >>> Hello! >>> >>> On Fri 17-06-22 20:50:17, Yu Kuai wrote: >>>> 在 2022/06/17 19:31, Jan Kara 写道: >>>>> On Fri 17-06-22 09:40:11, Yu Kuai wrote: >>>>>> 在 2022/06/17 1:21, Jan Kara 写道: >>>>>>> I've been debugging some customer reports of tasks hanging (forever) >>>>>>> waiting for free tags when in fact all tags are free. After looking into it >>>>>>> for some time I think I know what it happening. First, keep in mind that >>>>>>> it concerns a device which uses shared tags. There are 127 tags available >>>>>>> and the number of active queues using these tags is easily 40 or more. So >>>>>>> number of tags available for each device is rather small. Now I'm not sure >>>>>>> how batched wakeups can ever work in such situations, but maybe I'm missing >>>>>>> something. >>>>>>> >>>>>>> So take for example a situation where two tags are available for a device, >>>>>>> they are both currently used. Now a process comes into blk_mq_get_tag() and >>>>>>> wants to allocate tag and goes to sleep. Now how can it ever be woken up if >>>>>>> wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but >>>>>>> that's not enough to trigger the batched wakeup to really wakeup the >>>>>>> waiter... >>>>>>> >>>>>>> Even if we have say 4 tags available so in theory there should be enough >>>>>>> wakeups to fill the batch, there can be the following problem. So 4 tags >>>>>>> are in use, two processes come to blk_mq_get_tag() and sleep, one on wait >>>>>>> queue 0, one on wait queue 1. Now four IOs complete so >>>>>>> sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements >>>>>>> wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the >>>>>>> waiters is woken up but the other one is still sleeping in wq1 and there >>>>>>> are not enough wakeups to fill the batch and wake it up? This is >>>>>>> essentially because we have lost three wakeups on wq0 because it didn't >>>>>>> have enough waiters to wake... >>>>>> >>>>>> From what I see, if tags are shared for multiple devices, wake_batch >>>>>> should make sure that all waiter will be woke up: >>>>>> >>>>>> For example: >>>>>> there are total 64 tags shared for two devices, then wake_batch is 4(if >>>>>> both devices are active). If there are waiters, which means at least 32 >>>>>> tags are grabed, thus 8 queues will ensure to wake up at least once >>>>>> after 32 tags are freed. >>>>> >>>>> Well, yes, wake_batch is updated but as my example above shows it is not >>>>> enough to fix "wasted" wakeups. >>>> >>>> Tags can be preempted, which means new thread can be added to waitqueue >>>> only if there are no free tags. >>> >>> Yes. >>> >>>> With the above condition, I can't think of any possibility how the >>>> following scenario can be existed(dispite the wake ups can be missed): >>>> >>>> Only wake_batch tags are still in use, while multiple waitqueues are >>>> still active. >>>> >>>> If you think this is possible, can you share the initial conditions and >>>> how does it end up to the problematic scenario? >>> >>> Very easily AFAICT. I've described the scenario in my original email but >>> let me maybe write it here with more detail. Let's assume we have 4 tags >>> available for our device, wake_batch is 4, wait_index is 0, wait_cnt is 4 >>> for all waitqueues. All four tags are currently in use. >>> >>> Now task T1 comes, wants a new tag: >>> blk_mq_get_tag() >>> bt_wait_ptr() -> gets ws 0, wait_index incremented to 1 >>> goes to sleep on ws 0 >>> >>> Now task T2 comes, wants a new tag: >>> blk_mq_get_tag() >>> bt_wait_ptr() -> gets ws 1, wait_index incremented to 2 >>> goes to sleep on ws 1 >>> >>> Now all four requests complete, this generates 4 calls to >>> sbitmap_queue_wake_up() for ws 0, which decrements wait_cnt on ws 0 to 0 >>> and we do wake_up_nr(&ws->wait, 4). This wakes T1. >>> >>> T1 allocates a tag, does IO, IO completes. sbitmap_queue_wake_up() is >>> called for ws 1. wait_cnt is decremented to 3. >>> >>> Now there's no IO in flight but we still have task sleeping in ws 1. >>> Everything is stuck until someone submits more IO (which may never happen >>> because everything ends up waiting on completion of IO T2 does). >> >> I assum that there should be at least 32 total tags, and at least 8 >> device are issuing io, so that there are only 4 tags available for >> the devcie? (due to fair share). >> >> If so, io from other devcies should trigger new wakeup. > > Well, there isn't necessarily any IO going on on other devices, there may > be 4 tags used on our device, the rest is free but we are not allowed to > use them. Sure eventually we should detect other devices are idle and > decrease tags->active_queues but that can take 30s or more... > It's right queue can be active while there is no io, blk_mq_tag_idle() will not be called untill such queue doesn't issue any io for a period time, thus I do think io will not hung eventually, however, it can be improved that io will rely on other devices to woke up. BTW, I tried to improve that tags can be wasted for share tags long time ago: https://lore.kernel.org/all/20201226102808.2534966-1-yukuai3@huawei.com/ Thanks, Kuai > Honza > ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-06-21 1:10 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-06-16 17:21 Races in sbitmap batched wakeups Jan Kara 2022-06-17 1:07 ` Ming Lei 2022-06-17 10:50 ` Jan Kara 2022-06-17 1:40 ` Yu Kuai 2022-06-17 11:31 ` Jan Kara 2022-06-17 12:50 ` Yu Kuai 2022-06-20 11:57 ` Jan Kara 2022-06-20 13:11 ` Yu Kuai 2022-06-20 16:57 ` Jan Kara 2022-06-21 1:10 ` Yu Kuai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox