* [PATCH v2 0/2] Fixes for zone write plugging
@ 2024-04-20 7:58 Damien Le Moal
2024-04-20 7:58 ` [PATCH v2 1/2] block: prevent freeing a zone write plug too early Damien Le Moal
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Damien Le Moal @ 2024-04-20 7:58 UTC (permalink / raw)
To: Jens Axboe, linux-block
(Apologies for the churn. I sent the wrong patch 2...)
Jens,
A couple of fix patches for zone write plugging. The second fix
addresses something that is in fact likely to happen in production with
large servers with lots of SMR drives.
Changes from v1:
- Fix patch 2 to create the workqueue with max active set to zone plug
pool size.
Damien Le Moal (2):
block: prevent freeing a zone write plug too early
block: use a per disk workqueue for zone write plugging
block/blk-zoned.c | 34 ++++++++++++++++++++++++++--------
include/linux/blkdev.h | 1 +
2 files changed, 27 insertions(+), 8 deletions(-)
--
2.44.0
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH v2 1/2] block: prevent freeing a zone write plug too early 2024-04-20 7:58 [PATCH v2 0/2] Fixes for zone write plugging Damien Le Moal @ 2024-04-20 7:58 ` Damien Le Moal 2024-04-22 6:23 ` Christoph Hellwig 2024-04-23 14:21 ` Jens Axboe 2024-04-20 7:58 ` [PATCH v2 2/2] block: use a per disk workqueue for zone write plugging Damien Le Moal 2024-04-23 15:46 ` (subset) [PATCH v2 0/2] Fixes " Jens Axboe 2 siblings, 2 replies; 13+ messages in thread From: Damien Le Moal @ 2024-04-20 7:58 UTC (permalink / raw) To: Jens Axboe, linux-block The submission of plugged BIOs is done using a work struct executing the function blk_zone_wplug_bio_work(). This function gets and submits a plugged zone write BIO and is guaranteed to operate on a valid zone write plug (with a reference count higher than 0) on entry as plugged BIOs hold a reference on their zone write plugs. However, once a BIO is submitted with submit_bio_noacct_nocheck(), the BIO may complete before blk_zone_wplug_bio_work(), with the BIO completion trigering a release and freeing of the zone write plug if the BIO is the last write to a zone (making the zone FULL). This potentially can result in the zone write plug being freed while the work is still active. Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> --- block/blk-zoned.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 3befebe6b319..685f0b9159fd 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head) struct blk_zone_wplug *zwplug = container_of(rcu_head, struct blk_zone_wplug, rcu_head); + flush_work(&zwplug->bio_work); + mempool_free(zwplug, zwplug->disk->zone_wplugs_pool); } -- 2.44.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] block: prevent freeing a zone write plug too early 2024-04-20 7:58 ` [PATCH v2 1/2] block: prevent freeing a zone write plug too early Damien Le Moal @ 2024-04-22 6:23 ` Christoph Hellwig 2024-04-23 6:12 ` Damien Le Moal 2024-04-23 14:21 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2024-04-22 6:23 UTC (permalink / raw) To: Damien Le Moal; +Cc: Jens Axboe, linux-block On Sat, Apr 20, 2024 at 04:58:10PM +0900, Damien Le Moal wrote: > Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). Calling flush_work from a rcu callback is just asking for nasty deadlocks. What prevents you from just holding an extra zwplug reference while blk_zone_wplug_bio_work is running? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] block: prevent freeing a zone write plug too early 2024-04-22 6:23 ` Christoph Hellwig @ 2024-04-23 6:12 ` Damien Le Moal 0 siblings, 0 replies; 13+ messages in thread From: Damien Le Moal @ 2024-04-23 6:12 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, linux-block On 2024/04/22 16:23, Christoph Hellwig wrote: > On Sat, Apr 20, 2024 at 04:58:10PM +0900, Damien Le Moal wrote: >> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). > > Calling flush_work from a rcu callback is just asking for nasty > deadlocks. > > What prevents you from just holding an extra zwplug reference while > blk_zone_wplug_bio_work is running? Problem is that this extra reference needs to be released in blk_zone_wplug_bio_work(), before that function returns, and that is still the work thread context using zwplug->bio_work. So we always have a small window between the ref drop and the zone BIO work thread completing (context switch). If we get a BIO completion in that window and free the plug, then the BIO work struct may go away while the work thread is still referencing it. Given that freeing of plugs will happen only after the RCU grace periods elapses, I think this is all very unlikely to happen, but at the same time, I do not see any guarantee that this cannot happen... -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] block: prevent freeing a zone write plug too early 2024-04-20 7:58 ` [PATCH v2 1/2] block: prevent freeing a zone write plug too early Damien Le Moal 2024-04-22 6:23 ` Christoph Hellwig @ 2024-04-23 14:21 ` Jens Axboe 2024-04-23 15:16 ` Damien Le Moal 1 sibling, 1 reply; 13+ messages in thread From: Jens Axboe @ 2024-04-23 14:21 UTC (permalink / raw) To: Damien Le Moal, linux-block On 4/20/24 1:58 AM, Damien Le Moal wrote: > The submission of plugged BIOs is done using a work struct executing the > function blk_zone_wplug_bio_work(). This function gets and submits a > plugged zone write BIO and is guaranteed to operate on a valid zone > write plug (with a reference count higher than 0) on entry as plugged > BIOs hold a reference on their zone write plugs. However, once a BIO is > submitted with submit_bio_noacct_nocheck(), the BIO may complete before > blk_zone_wplug_bio_work(), with the BIO completion trigering a release > and freeing of the zone write plug if the BIO is the last write to a > zone (making the zone FULL). This potentially can result in the zone > write plug being freed while the work is still active. > > Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). > > Fixes: dd291d77cc90 ("block: Introduce zone write plugging") > Signed-off-by: Damien Le Moal <dlemoal@kernel.org> > --- > block/blk-zoned.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/block/blk-zoned.c b/block/blk-zoned.c > index 3befebe6b319..685f0b9159fd 100644 > --- a/block/blk-zoned.c > +++ b/block/blk-zoned.c > @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head) > struct blk_zone_wplug *zwplug = > container_of(rcu_head, struct blk_zone_wplug, rcu_head); > > + flush_work(&zwplug->bio_work); > + > mempool_free(zwplug, zwplug->disk->zone_wplugs_pool); > } This is totally backwards. First of all, if you actually had work that needed flushing at this point, the kernel would bomb spectacularly. Secondly, what's the point of using RCU to protect this, if you're now needing to flush work from the RCU callback? That's a clear sign that something is very wrong here with your references / RCU usage.. The work item should hold a reference to it, trying to paper around it like this is not going to work at all. Why is the work item racing with RCU freeing?! -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] block: prevent freeing a zone write plug too early 2024-04-23 14:21 ` Jens Axboe @ 2024-04-23 15:16 ` Damien Le Moal 2024-04-23 15:36 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Damien Le Moal @ 2024-04-23 15:16 UTC (permalink / raw) To: Jens Axboe, linux-block On 2024/04/24 0:21, Jens Axboe wrote: > On 4/20/24 1:58 AM, Damien Le Moal wrote: >> The submission of plugged BIOs is done using a work struct executing the >> function blk_zone_wplug_bio_work(). This function gets and submits a >> plugged zone write BIO and is guaranteed to operate on a valid zone >> write plug (with a reference count higher than 0) on entry as plugged >> BIOs hold a reference on their zone write plugs. However, once a BIO is >> submitted with submit_bio_noacct_nocheck(), the BIO may complete before >> blk_zone_wplug_bio_work(), with the BIO completion trigering a release >> and freeing of the zone write plug if the BIO is the last write to a >> zone (making the zone FULL). This potentially can result in the zone >> write plug being freed while the work is still active. >> >> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). >> >> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") >> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> >> --- >> block/blk-zoned.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/block/blk-zoned.c b/block/blk-zoned.c >> index 3befebe6b319..685f0b9159fd 100644 >> --- a/block/blk-zoned.c >> +++ b/block/blk-zoned.c >> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head) >> struct blk_zone_wplug *zwplug = >> container_of(rcu_head, struct blk_zone_wplug, rcu_head); >> >> + flush_work(&zwplug->bio_work); >> + >> mempool_free(zwplug, zwplug->disk->zone_wplugs_pool); >> } > > This is totally backwards. First of all, if you actually had work that > needed flushing at this point, the kernel would bomb spectacularly. > Secondly, what's the point of using RCU to protect this, if you're now > needing to flush work from the RCU callback? That's a clear sign that > something is very wrong here with your references / RCU usage.. The work > item should hold a reference to it, trying to paper around it like this > is not going to work at all. > > Why is the work item racing with RCU freeing?! The work item is a field of the zone write plug. Zone write plugs have references to them as long as BIOs are in flight and and the zone is not full. The zone write plug freeing through rcu is triggered by the last write to a zone that makes the zone full. But the completion of this last write BIO may happen right after the work issued the BIO with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work() returns, while the work item is still active. The actual freeing of the plug happens only after the rcu grace period, and I was not entirely sure if this is enough to guarantee that the work thread is finished. But checking how the workqueue code processes the work item by calling the work function (blk_zone_wplug_bio_work() in this case), there is no issue because the work item (struct work_struct) is not touched once the work function is called. So there are no issues/races with freeing the zone write plug. I was overthinking this. My bad. We can drop this patch. Apologies for the noise. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] block: prevent freeing a zone write plug too early 2024-04-23 15:16 ` Damien Le Moal @ 2024-04-23 15:36 ` Jens Axboe 2024-04-23 18:19 ` Damien Le Moal 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2024-04-23 15:36 UTC (permalink / raw) To: Damien Le Moal, linux-block On 4/23/24 9:16 AM, Damien Le Moal wrote: > On 2024/04/24 0:21, Jens Axboe wrote: >> On 4/20/24 1:58 AM, Damien Le Moal wrote: >>> The submission of plugged BIOs is done using a work struct executing the >>> function blk_zone_wplug_bio_work(). This function gets and submits a >>> plugged zone write BIO and is guaranteed to operate on a valid zone >>> write plug (with a reference count higher than 0) on entry as plugged >>> BIOs hold a reference on their zone write plugs. However, once a BIO is >>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before >>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release >>> and freeing of the zone write plug if the BIO is the last write to a >>> zone (making the zone FULL). This potentially can result in the zone >>> write plug being freed while the work is still active. >>> >>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). >>> >>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") >>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> >>> --- >>> block/blk-zoned.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c >>> index 3befebe6b319..685f0b9159fd 100644 >>> --- a/block/blk-zoned.c >>> +++ b/block/blk-zoned.c >>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head) >>> struct blk_zone_wplug *zwplug = >>> container_of(rcu_head, struct blk_zone_wplug, rcu_head); >>> >>> + flush_work(&zwplug->bio_work); >>> + >>> mempool_free(zwplug, zwplug->disk->zone_wplugs_pool); >>> } >> >> This is totally backwards. First of all, if you actually had work that >> needed flushing at this point, the kernel would bomb spectacularly. >> Secondly, what's the point of using RCU to protect this, if you're now >> needing to flush work from the RCU callback? That's a clear sign that >> something is very wrong here with your references / RCU usage.. The work >> item should hold a reference to it, trying to paper around it like this >> is not going to work at all. >> >> Why is the work item racing with RCU freeing?! > > The work item is a field of the zone write plug. Zone write plugs have > references to them as long as BIOs are in flight and and the zone is > not full. The zone write plug freeing through rcu is triggered by the > last write to a zone that makes the zone full. But the completion of > this last write BIO may happen right after the work issued the BIO > with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work() > returns, while the work item is still active. > > The actual freeing of the plug happens only after the rcu grace > period, and I was not entirely sure if this is enough to guarantee > that the work thread is finished. But checking how the workqueue code > processes the work item by calling the work function > (blk_zone_wplug_bio_work() in this case), there is no issue because > the work item (struct work_struct) is not touched once the work > function is called. So there are no issues/races with freeing the zone > write plug. I was overthinking this. My bad. We can drop this patch. > Apologies for the noise. I took a closer look at the zone write plug reference handling, and it still doesn't look very good. Why are some just atomic_dec and there's just one spot that does dec_and_test? This again looks like janky referencing, to be honest. The relationship seems like it should be pretty clear. Any bio inflight against this zone plug should have a reference to it, AND the owner should have a reference to it, otherwise any bio completion (which can happen at ANY time) could free it. Any dropping of the ref should use a helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug() does. There should be no doubt about the above at all. If the plug has been added to a workqueue, it should be quite obvious that of course it has a reference to it already, outside of the bio's that are in it. I'd strongly encourage getting this sorted out before the merge window, I'm not at all convinced it's correct as-is. It's certainly not obviously correct, which it should be. The RCU rules are pretty simple if the the references are done in the kernel idiomatic way, but they are not. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] block: prevent freeing a zone write plug too early 2024-04-23 15:36 ` Jens Axboe @ 2024-04-23 18:19 ` Damien Le Moal 2024-04-24 13:58 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Damien Le Moal @ 2024-04-23 18:19 UTC (permalink / raw) To: Jens Axboe, linux-block On 2024/04/24 1:36, Jens Axboe wrote: > On 4/23/24 9:16 AM, Damien Le Moal wrote: >> On 2024/04/24 0:21, Jens Axboe wrote: >>> On 4/20/24 1:58 AM, Damien Le Moal wrote: >>>> The submission of plugged BIOs is done using a work struct executing the >>>> function blk_zone_wplug_bio_work(). This function gets and submits a >>>> plugged zone write BIO and is guaranteed to operate on a valid zone >>>> write plug (with a reference count higher than 0) on entry as plugged >>>> BIOs hold a reference on their zone write plugs. However, once a BIO is >>>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before >>>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release >>>> and freeing of the zone write plug if the BIO is the last write to a >>>> zone (making the zone FULL). This potentially can result in the zone >>>> write plug being freed while the work is still active. >>>> >>>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). >>>> >>>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") >>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> >>>> --- >>>> block/blk-zoned.c | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c >>>> index 3befebe6b319..685f0b9159fd 100644 >>>> --- a/block/blk-zoned.c >>>> +++ b/block/blk-zoned.c >>>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head) >>>> struct blk_zone_wplug *zwplug = >>>> container_of(rcu_head, struct blk_zone_wplug, rcu_head); >>>> >>>> + flush_work(&zwplug->bio_work); >>>> + >>>> mempool_free(zwplug, zwplug->disk->zone_wplugs_pool); >>>> } >>> >>> This is totally backwards. First of all, if you actually had work that >>> needed flushing at this point, the kernel would bomb spectacularly. >>> Secondly, what's the point of using RCU to protect this, if you're now >>> needing to flush work from the RCU callback? That's a clear sign that >>> something is very wrong here with your references / RCU usage.. The work >>> item should hold a reference to it, trying to paper around it like this >>> is not going to work at all. >>> >>> Why is the work item racing with RCU freeing?! >> >> The work item is a field of the zone write plug. Zone write plugs have >> references to them as long as BIOs are in flight and and the zone is >> not full. The zone write plug freeing through rcu is triggered by the >> last write to a zone that makes the zone full. But the completion of >> this last write BIO may happen right after the work issued the BIO >> with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work() >> returns, while the work item is still active. >> >> The actual freeing of the plug happens only after the rcu grace >> period, and I was not entirely sure if this is enough to guarantee >> that the work thread is finished. But checking how the workqueue code >> processes the work item by calling the work function >> (blk_zone_wplug_bio_work() in this case), there is no issue because >> the work item (struct work_struct) is not touched once the work >> function is called. So there are no issues/races with freeing the zone >> write plug. I was overthinking this. My bad. We can drop this patch. >> Apologies for the noise. > > I took a closer look at the zone write plug reference handling, and it > still doesn't look very good. Why are some just atomic_dec and there's > just one spot that does dec_and_test? This again looks like janky > referencing, to be honest. > > The relationship seems like it should be pretty clear. Any bio inflight > against this zone plug should have a reference to it, AND the owner > should have a reference to it, otherwise any bio completion (which can > happen at ANY time) could free it. Any dropping of the ref should use a > helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug() > does. > > There should be no doubt about the above at all. If the plug has been > added to a workqueue, it should be quite obvious that of course it has a > reference to it already, outside of the bio's that are in it. > > I'd strongly encourage getting this sorted out before the merge window, > I'm not at all convinced it's correct as-is. It's certainly not > obviously correct, which it should be. The RCU rules are pretty simple > if the the references are done in the kernel idiomatic way, but they are > not. OK. I am traveling this week so I will not be able to send a well-tested cleanup patch but I will do so first thing next week. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] block: prevent freeing a zone write plug too early 2024-04-23 18:19 ` Damien Le Moal @ 2024-04-24 13:58 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2024-04-24 13:58 UTC (permalink / raw) To: Damien Le Moal, linux-block On 4/23/24 12:19 PM, Damien Le Moal wrote: > On 2024/04/24 1:36, Jens Axboe wrote: >> On 4/23/24 9:16 AM, Damien Le Moal wrote: >>> On 2024/04/24 0:21, Jens Axboe wrote: >>>> On 4/20/24 1:58 AM, Damien Le Moal wrote: >>>>> The submission of plugged BIOs is done using a work struct executing the >>>>> function blk_zone_wplug_bio_work(). This function gets and submits a >>>>> plugged zone write BIO and is guaranteed to operate on a valid zone >>>>> write plug (with a reference count higher than 0) on entry as plugged >>>>> BIOs hold a reference on their zone write plugs. However, once a BIO is >>>>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before >>>>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release >>>>> and freeing of the zone write plug if the BIO is the last write to a >>>>> zone (making the zone FULL). This potentially can result in the zone >>>>> write plug being freed while the work is still active. >>>>> >>>>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). >>>>> >>>>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") >>>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> >>>>> --- >>>>> block/blk-zoned.c | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c >>>>> index 3befebe6b319..685f0b9159fd 100644 >>>>> --- a/block/blk-zoned.c >>>>> +++ b/block/blk-zoned.c >>>>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head) >>>>> struct blk_zone_wplug *zwplug = >>>>> container_of(rcu_head, struct blk_zone_wplug, rcu_head); >>>>> >>>>> + flush_work(&zwplug->bio_work); >>>>> + >>>>> mempool_free(zwplug, zwplug->disk->zone_wplugs_pool); >>>>> } >>>> >>>> This is totally backwards. First of all, if you actually had work that >>>> needed flushing at this point, the kernel would bomb spectacularly. >>>> Secondly, what's the point of using RCU to protect this, if you're now >>>> needing to flush work from the RCU callback? That's a clear sign that >>>> something is very wrong here with your references / RCU usage.. The work >>>> item should hold a reference to it, trying to paper around it like this >>>> is not going to work at all. >>>> >>>> Why is the work item racing with RCU freeing?! >>> >>> The work item is a field of the zone write plug. Zone write plugs have >>> references to them as long as BIOs are in flight and and the zone is >>> not full. The zone write plug freeing through rcu is triggered by the >>> last write to a zone that makes the zone full. But the completion of >>> this last write BIO may happen right after the work issued the BIO >>> with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work() >>> returns, while the work item is still active. >>> >>> The actual freeing of the plug happens only after the rcu grace >>> period, and I was not entirely sure if this is enough to guarantee >>> that the work thread is finished. But checking how the workqueue code >>> processes the work item by calling the work function >>> (blk_zone_wplug_bio_work() in this case), there is no issue because >>> the work item (struct work_struct) is not touched once the work >>> function is called. So there are no issues/races with freeing the zone >>> write plug. I was overthinking this. My bad. We can drop this patch. >>> Apologies for the noise. >> >> I took a closer look at the zone write plug reference handling, and it >> still doesn't look very good. Why are some just atomic_dec and there's >> just one spot that does dec_and_test? This again looks like janky >> referencing, to be honest. >> >> The relationship seems like it should be pretty clear. Any bio inflight >> against this zone plug should have a reference to it, AND the owner >> should have a reference to it, otherwise any bio completion (which can >> happen at ANY time) could free it. Any dropping of the ref should use a >> helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug() >> does. >> >> There should be no doubt about the above at all. If the plug has been >> added to a workqueue, it should be quite obvious that of course it has a >> reference to it already, outside of the bio's that are in it. >> >> I'd strongly encourage getting this sorted out before the merge window, >> I'm not at all convinced it's correct as-is. It's certainly not >> obviously correct, which it should be. The RCU rules are pretty simple >> if the the references are done in the kernel idiomatic way, but they are >> not. > > OK. I am traveling this week so I will not be able to send a > well-tested cleanup patch but I will do so first thing next week. Sounds good, thanks. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 2/2] block: use a per disk workqueue for zone write plugging 2024-04-20 7:58 [PATCH v2 0/2] Fixes for zone write plugging Damien Le Moal 2024-04-20 7:58 ` [PATCH v2 1/2] block: prevent freeing a zone write plug too early Damien Le Moal @ 2024-04-20 7:58 ` Damien Le Moal 2024-04-22 6:25 ` Christoph Hellwig 2024-04-23 15:46 ` (subset) [PATCH v2 0/2] Fixes " Jens Axboe 2 siblings, 1 reply; 13+ messages in thread From: Damien Le Moal @ 2024-04-20 7:58 UTC (permalink / raw) To: Jens Axboe, linux-block A zone write plug BIO work function blk_zone_wplug_bio_work() calls submit_bio_noacct_nocheck() to execute the next unplugged BIO. This function may block. So executing zone plugs BIO works using the block layer global kblockd workqueue can potentially lead to preformance or latency issues as the number of concurrent work for a workqueue is limited to WQ_DFL_ACTIVE (256). 1) For a system with a large number of zoned disks, issuing write requests to otherwise unused zones may be delayed wiating for a work thread to become available. 2) Requeue operations which use kblockd but are independent of zone write plugging may alsoi end up being delayed. To avoid these potential performance issues, create a workqueue per zoned device to execute zone plugs BIO work. The workqueue max active parameter is set to the maximum number of zone write plugs allocated with the zone write plug mempool. This limit is equal to the maximum number of open zones of the disk and defaults to 128 for disks that do not have a limit on the number of open zones. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> --- block/blk-zoned.c | 32 ++++++++++++++++++++++++-------- include/linux/blkdev.h | 1 + 2 files changed, 25 insertions(+), 8 deletions(-) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 685f0b9159fd..4f5bf1c0a99f 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1131,7 +1131,7 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk, /* Schedule submission of the next plugged BIO if we have one. */ if (!bio_list_empty(&zwplug->bio_list)) { spin_unlock_irqrestore(&zwplug->lock, flags); - kblockd_schedule_work(&zwplug->bio_work); + queue_work(disk->zone_wplugs_wq, &zwplug->bio_work); return; } @@ -1334,7 +1334,7 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk, /* Restart BIO submission if we still have any BIO left. */ if (!bio_list_empty(&zwplug->bio_list)) { WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)); - kblockd_schedule_work(&zwplug->bio_work); + queue_work(disk->zone_wplugs_wq, &zwplug->bio_work); goto unlock; } @@ -1411,14 +1411,25 @@ static int disk_alloc_zone_resources(struct gendisk *disk, disk->zone_wplugs_pool = mempool_create_kmalloc_pool(pool_size, sizeof(struct blk_zone_wplug)); - if (!disk->zone_wplugs_pool) { - kfree(disk->zone_wplugs_hash); - disk->zone_wplugs_hash = NULL; - disk->zone_wplugs_hash_bits = 0; - return -ENOMEM; - } + if (!disk->zone_wplugs_pool) + goto free_hash; + + disk->zone_wplugs_wq = + alloc_workqueue("%s_zwplugs", WQ_MEM_RECLAIM | WQ_HIGHPRI, + pool_size, disk->disk_name); + if (!disk->zone_wplugs_wq) + goto destroy_pool; return 0; + +destroy_pool: + mempool_destroy(disk->zone_wplugs_pool); + disk->zone_wplugs_pool = NULL; +free_hash: + kfree(disk->zone_wplugs_hash); + disk->zone_wplugs_hash = NULL; + disk->zone_wplugs_hash_bits = 0; + return -ENOMEM; } static void disk_destroy_zone_wplugs_hash_table(struct gendisk *disk) @@ -1449,6 +1460,11 @@ void disk_free_zone_resources(struct gendisk *disk) { cancel_work_sync(&disk->zone_wplugs_work); + if (disk->zone_wplugs_wq) { + destroy_workqueue(disk->zone_wplugs_wq); + disk->zone_wplugs_wq = NULL; + } + disk_destroy_zone_wplugs_hash_table(disk); /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index c62536c78a46..a2847f8e5f82 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -191,6 +191,7 @@ struct gendisk { struct hlist_head *zone_wplugs_hash; struct list_head zone_wplugs_err_list; struct work_struct zone_wplugs_work; + struct workqueue_struct *zone_wplugs_wq; #endif /* CONFIG_BLK_DEV_ZONED */ #if IS_ENABLED(CONFIG_CDROM) -- 2.44.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] block: use a per disk workqueue for zone write plugging 2024-04-20 7:58 ` [PATCH v2 2/2] block: use a per disk workqueue for zone write plugging Damien Le Moal @ 2024-04-22 6:25 ` Christoph Hellwig 2024-04-23 6:01 ` Damien Le Moal 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2024-04-22 6:25 UTC (permalink / raw) To: Damien Le Moal; +Cc: Jens Axboe, linux-block On Sat, Apr 20, 2024 at 04:58:11PM +0900, Damien Le Moal wrote: > A zone write plug BIO work function blk_zone_wplug_bio_work() calls > submit_bio_noacct_nocheck() to execute the next unplugged BIO. This > function may block. So executing zone plugs BIO works using the block > layer global kblockd workqueue can potentially lead to preformance or > latency issues as the number of concurrent work for a workqueue is > limited to WQ_DFL_ACTIVE (256). > 1) For a system with a large number of zoned disks, issuing write > requests to otherwise unused zones may be delayed wiating for a work > thread to become available. > 2) Requeue operations which use kblockd but are independent of zone > write plugging may alsoi end up being delayed. > > To avoid these potential performance issues, create a workqueue per > zoned device to execute zone plugs BIO work. The workqueue max active > parameter is set to the maximum number of zone write plugs allocated > with the zone write plug mempool. This limit is equal to the maximum > number of open zones of the disk and defaults to 128 for disks that do > not have a limit on the number of open zones. Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> Should the zone write plug submission do non-blocking submissions as well to avoid stalling in the workqueue thread all the time? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] block: use a per disk workqueue for zone write plugging 2024-04-22 6:25 ` Christoph Hellwig @ 2024-04-23 6:01 ` Damien Le Moal 0 siblings, 0 replies; 13+ messages in thread From: Damien Le Moal @ 2024-04-23 6:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, linux-block On 2024/04/22 16:25, Christoph Hellwig wrote: > On Sat, Apr 20, 2024 at 04:58:11PM +0900, Damien Le Moal wrote: >> A zone write plug BIO work function blk_zone_wplug_bio_work() calls >> submit_bio_noacct_nocheck() to execute the next unplugged BIO. This >> function may block. So executing zone plugs BIO works using the block >> layer global kblockd workqueue can potentially lead to preformance or >> latency issues as the number of concurrent work for a workqueue is >> limited to WQ_DFL_ACTIVE (256). >> 1) For a system with a large number of zoned disks, issuing write >> requests to otherwise unused zones may be delayed wiating for a work >> thread to become available. >> 2) Requeue operations which use kblockd but are independent of zone >> write plugging may alsoi end up being delayed. >> >> To avoid these potential performance issues, create a workqueue per >> zoned device to execute zone plugs BIO work. The workqueue max active >> parameter is set to the maximum number of zone write plugs allocated >> with the zone write plug mempool. This limit is equal to the maximum >> number of open zones of the disk and defaults to 128 for disks that do >> not have a limit on the number of open zones. > > Looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > Should the zone write plug submission do non-blocking submissions as well > to avoid stalling in the workqueue thread all the time? I do not think that the stalling actually happens that often. The 2 main cases I see are: 1) Out of tag so we block on tag allocation when preparing the request in submit_bio_noacct_nocheck(), or 2) The device has BLK_MQ_F_BLOCKING set for its tag set (e.g. nullblk with memory backing). For (1), we could use RQF_NOWAIT to prevent blocking, but then we would need to retry later on with a timer to make forward progress for the plug. And I do not think we can actually avoid (2). So in the end, I do not see a clean way to completely avoid blocking in all cases. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: (subset) [PATCH v2 0/2] Fixes for zone write plugging 2024-04-20 7:58 [PATCH v2 0/2] Fixes for zone write plugging Damien Le Moal 2024-04-20 7:58 ` [PATCH v2 1/2] block: prevent freeing a zone write plug too early Damien Le Moal 2024-04-20 7:58 ` [PATCH v2 2/2] block: use a per disk workqueue for zone write plugging Damien Le Moal @ 2024-04-23 15:46 ` Jens Axboe 2 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2024-04-23 15:46 UTC (permalink / raw) To: linux-block, Damien Le Moal On Sat, 20 Apr 2024 16:58:09 +0900, Damien Le Moal wrote: > (Apologies for the churn. I sent the wrong patch 2...) > > Jens, > > A couple of fix patches for zone write plugging. The second fix > addresses something that is in fact likely to happen in production with > large servers with lots of SMR drives. > > [...] Applied, thanks! [2/2] block: use a per disk workqueue for zone write plugging commit: a8f59e5a5deaf3e99a8b7252e96cee9af67858a9 Best regards, -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-04-24 13:58 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-20 7:58 [PATCH v2 0/2] Fixes for zone write plugging Damien Le Moal 2024-04-20 7:58 ` [PATCH v2 1/2] block: prevent freeing a zone write plug too early Damien Le Moal 2024-04-22 6:23 ` Christoph Hellwig 2024-04-23 6:12 ` Damien Le Moal 2024-04-23 14:21 ` Jens Axboe 2024-04-23 15:16 ` Damien Le Moal 2024-04-23 15:36 ` Jens Axboe 2024-04-23 18:19 ` Damien Le Moal 2024-04-24 13:58 ` Jens Axboe 2024-04-20 7:58 ` [PATCH v2 2/2] block: use a per disk workqueue for zone write plugging Damien Le Moal 2024-04-22 6:25 ` Christoph Hellwig 2024-04-23 6:01 ` Damien Le Moal 2024-04-23 15:46 ` (subset) [PATCH v2 0/2] Fixes " Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox