* Re: [PATCH] blk-throttle: support io merge over iops_limit
From: Yu Kuai @ 2025-03-10 1:58 UTC (permalink / raw)
To: Tejun Heo, Yu Kuai
Cc: ming.lei, axboe, josef, linux-block, linux-kernel, cgroups,
yi.zhang, yangerkun, yukuai (C)
In-Reply-To: <Z8sZyElaHQQwKqpB@slm.duckdns.org>
Hi,
在 2025/03/08 0:07, Tejun Heo 写道:
> Hello,
>
> On Fri, Mar 07, 2025 at 05:01:52PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
>> support to account split IO for iops limit, because block layer provides
>> io accounting against split bio.
>>
>> However, io merge is still not handled, while block layer doesn't
>> account merged io for iops. Fix this problem by decreasing io_disp
>> if bio is merged, and following IO can use the extra budget. If io merge
>> concurrent with iops throttling, it's not handled if one more or one
>> less bio is dispatched, this is fine because as long as new slice is not
>> started, blk-throttle already preserve one extra slice for deviation,
>> and it's not worth it to handle the case that iops_limit rate is less than
>> one per slice.
>>
>> A regression test will be added for this case [1], before this patch,
>> the test will fail:
>>
>> +++ /root/blktests-mainline/results/nodev/throtl/007.out.bad
>> @@ -1,4 +1,4 @@
>> Running throtl/007
>> 1
>> -1
>> +11
>> Test complete
>>
>> [1] https://lore.kernel.org/all/20250307080318.3860858-2-yukuai1@huaweicloud.com/
>
> For blk-throtl, iops limit has meant the number of bios issued. I'm not
Yes, but it's a litter hard to explain to users the differece between IO
split and IO merge, they just think IO split is the numer of IOs issued
to disk, and IO merge is the number of IOs issued from user.
> necessarily against this change but this is significantly changing what a
> given configuration means. Also, if we're now doing hardware request based
> throttling, maybe we should just move this under rq-qos. That has the
> problem of not supporting bio-based drivers but maybe we can leave
> blk-throtl in deprecation mode and slowly phase it out.
Yes, we discussed this before.
And there is another angle that might convince you for the patch, if the
user workload triggers a lot of IO merge, and iops limit is above the
actual iops on disk, then before this patch, blk-throttle will make IO
merge impossible and resulting in performance degradation.
>
> Also, can you please make atomic_t conversion a separate patch and describe
> why that's being done?
Of course, the reason is that new helper will decrease the counter
outside lock.
Thanks,
Kuai
>
> Thanks.
>
^ permalink raw reply
* Re: [PATCH RFC 1/2] tests/throtl: add a new test 007
From: Yu Kuai @ 2025-03-10 1:43 UTC (permalink / raw)
To: Ming Lei, Yu Kuai; +Cc: shinichiro.kawasaki, linux-block, yangerkun, yukuai (C)
In-Reply-To: <Z8rAo8aCwi-OWADq@fedora>
Hi,
在 2025/03/07 17:47, Ming Lei 写道:
> On Fri, Mar 07, 2025 at 04:03:17PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Add test for IO merge over iops limit.
>>
>> Noted this test will fail for now, kernel solution is in development.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>> tests/throtl/007 | 65 ++++++++++++++++++++++++++++++++++++++++++++
>> tests/throtl/007.out | 4 +++
>> 2 files changed, 69 insertions(+)
>> create mode 100755 tests/throtl/007
>> create mode 100644 tests/throtl/007.out
>>
>> diff --git a/tests/throtl/007 b/tests/throtl/007
>> new file mode 100755
>> index 0000000..597f879
>> --- /dev/null
>> +++ b/tests/throtl/007
>> @@ -0,0 +1,65 @@
>> +#!/bin/bash
>> +# SPDX-License-Identifier: GPL-3.0+
>> +# Copyright (C) 2025 Yu Kuai
>> +#
>> +# Test iops limit over io merge
>> +
>> +. tests/throtl/rc
>> +
>> +DESCRIPTION="basic functionality"
>> +QUICK=1
>> +
>> +requires() {
>> + _have_program taskset
>> + _have_program fio
>> +}
>> +
>> +# every 16 0.5k IO will merge into one 8k IO, ideally runtime is 1s,
>> +# however it's about 1.3s in practice
>> +__fio() {
>> + taskset -c 0 \
>> + fio -filename=/dev/$THROTL_DEV \
>> + -name=test \
>> + -size=1600k \
>> + -rw=write \
>> + -bs=512 \
>> + -iodepth=32 \
>> + -iodepth_low=16 \
>> + -iodepth_batch=16 \
>> + -numjobs=1 \
>> + -direct=1 \
>> + -ioengine=io_uring &> /dev/null
>> +}
>> +
>> +test_io() {
>> + start_time=$(date +%s.%N)
>> +
>> + {
>> + echo "$BASHPID" > "$CGROUP2_DIR/$THROTL_DIR/cgroup.procs"
>> + __fio
>> + } &
>> +
>> + wait $!
>> + end_time=$(date +%s.%N)
>> + elapsed=$(echo "$end_time - $start_time" | bc)
>> + printf "%.0f\n" "$elapsed"
>> +}
>> +
>> +test() {
>> + echo "Running ${TEST_NAME}"
>> +
>> + # iolatency is 10ms, iops is at most 200/s
>> + if ! _set_up_throtl irqmode=2 completion_nsec=10000000 hw_queue_depth=2; then
>> + return 1;
>> + fi
>> +
>> + test_io
>> +
>> + # 300 means 50% error range, no IO should be throttled
>> + _throtl_set_limits wiops=300
>> + test_io
>> + _throtl_remove_limits
>> +
>> + _clean_up_throtl
>> + echo "Test complete"
>> +}
>> diff --git a/tests/throtl/007.out b/tests/throtl/007.out
>> new file mode 100644
>> index 0000000..0d568ef
>> --- /dev/null
>> +++ b/tests/throtl/007.out
>> @@ -0,0 +1,4 @@
>> +Running throtl/007
>> +1
>> +1
>> +Test complete
>
> I'd suggest to check if actual iops matches with the iops limit directly,
> and it isn't intuitive to compare time taken in test wrt. iops throttle.
Yes, that's a good idea.
BTW, I'll wait if we agree with the change in kernel first before
sending the next version.
Thanks,
Kuai
>
> Thanks,
> Ming
>
>
> .
>
^ permalink raw reply
* Re: Bring back md-faulty? (was Re: dm-flakey: Fix memory corruption)
From: Mike Snitzer @ 2025-03-10 1:13 UTC (permalink / raw)
To: Kent Overstreet
Cc: dm-devel, Mikulas Patocka, Alasdair Kergon, linux-block,
Josef Bacik, Jens Axboe, Linus Torvalds
In-Reply-To: <xt5b22a3i6klpdsysz4ds3fubu3p6s22redu56w6ewc7mmcter@kcgbz3ccoedf>
On Sun, Mar 09, 2025 at 01:04:22PM -0400, Kent Overstreet wrote:
> On Sun, Mar 09, 2025 at 11:44:38AM -0400, Mike Snitzer wrote:
> > On Sat, Mar 08, 2025 at 04:50:05PM -0500, Kent Overstreet wrote:
> > > On Sat, Mar 08, 2025 at 01:19:30PM -0500, Mike Snitzer wrote:
> > > > On Sat, Mar 08, 2025 at 10:50:08AM -0500, Kent Overstreet wrote:
> > > > > So, this code clearly isn't getting tested - at all. Besides this
bug,
> > > > > the parsing for the "corrupt" modes is also broken.
> > > > >
> > > > > Guys, don't push broken crap, and figure out how to write some
tests.
> > > >
> > > > Thank you sir, may we have another?
> > > >
> > > > Like you never introduced a bug in your life?
> > > >
> > > > Not going to tolerate your entitled primadonna attitude here. You
are
> > > > capable of being better, you've chosen not to be on this issue
(twice)
> > >
> > > Talking about basic engineering standards is in no way "being a prima
> > > donna". Testing your changes is as basic as it gets, and this code
> > > wasn't tested _at all_.
> >
> > "entitled primadonna attitude" was me pulling punches.
> >
> > I don't disagree that this is a bug that was missed and that proper
> > testing hasn't been performed (I'd quibble with the no testing part
> > only because I cannot speak for Mikulas and don't like to assume I
> > know it all).
> >
> > But you're missing the very problematic detail: you used a bug in an
> > optional feature of the test-only dm-flakey target to try to take a
> > pound of flesh while preaching from your high horse. That is
> > unacceptable behaviour that won't be tolerated here. Be cool and
> > others will be in return (unless you keep setting fire to bridges).
> >
> > Fin.
>
> Mike, saying code needs to be tested is not an "entitled primadonna
> attitude".
Definition of primadonna:
"a very temperamental person with an inflated view of their own talent
or importance."
My issue from the start on Friday night (in private) has always been
how holier-than-thou yet abusive you've been since having discovered
this bug in dm-flakey's optional "corrupt_bio_byte" feature.
> Pushing completely broken code because you made no attempt to
> test it and then flipping out when called out over it - that is.
I didn't push commit 1d9a94389853 _because_ I "made no attempt to test
it". Commit 1d9a94389853 sought to fix a similar but different
corruption in the original "corrupt_bio_byte" implementation (which
proved useful for the specific case it was first developed for).
> To recap, we're not talking about some obscure corner cases, we're
> talking about core documentated functionality in dm-flakey that is
> completely broken in ways that show up immediately if you run it - and
> there's at least three bugs that I saw; the parsing code, the
> clone_bio() memory corruption, and the read side corruption still wasn't
> working when I fixed or worked around the other two (write side did).
>
> This isn't your personal project, this is the kernel - there are
> standards, and other people depend on your work. dm-flakey is used
> heavily by filesystem folks, and additionally, md-faulty was recently
> removed because, supposedly, dm-flakey was sufficient.
>
> And that's what I was using before, and it worked fine, so I'm willing
> to bring it back and maintain it if dm-flakey can't be relied on.
This dm-flakey corrupt_bio_byte bug will be fixed upstream this week.
But your recap is devoid of any understanding that my responses to
this dm-devel thread, and your private messages, have primarily taken
issue with how you've chosen to conduct yourself.
I'm not aware of dm-flakey's corrupt_bio_byte being used in upstream
testing frameworks (xfstests only uses flakey's basic capabilities).
Any willingness to elevate dm-flakey's corrupt_bio_byte to wider use
in upstream testing frameworks would have uncovered the need for your
fix. You were first to notice it. Rather than be cool about it,
you've been hostile from the start and completely misrepresented the
significance of the bug given the limited scope of who is impacted.
I really am done responding to your escalating campaign of self-owns.
^ permalink raw reply
* Re: [RFC PATCH 7/7] dm: allow devices to revalidate existing zones
From: Damien Le Moal @ 2025-03-09 23:59 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Christoph Hellwig
In-Reply-To: <20250309222904.449803-8-bmarzins@redhat.com>
On 3/10/25 07:29, Benjamin Marzinski wrote:
> dm_revalidate_zones() only allowed devices that had no zone resources
> set up to call blk_revalidate_disk_zones(). If the device already had
> zone resources, disk->nr_zones would always equal md->nr_zones so
> dm_revalidate_zones() returned without doing any work. Instead, always
> call blk_revalidate_disk_zones() if you are loading a new zoned table.
>
> However, if the device emulates zone append operations and already has
> zone append emulation resources, the table size cannot change when
> loading a new table. Otherwise, all those resources will be garbage.
>
> If emulated zone append operations are needed and the zone write pointer
> offsets of the new table do not match those of the old table, writes to
> the device will still fail. This patch allows users to safely grow and
> shrink zone devices. But swapping arbitrary zoned tables will still not
> work.
I do not think that this patch correctly address the shrinking of dm zoned
device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
will leave already hashed zone write plugs outside of that new zone range in the
disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
reallocate the hash table if the number of zones shrinks. For a physical drive,
this can only happen if the drive is reformatted with some magic vendor unique
command, which is why this was never implemented as that is not a valid
production use case.
To make things simpler, I think we should allow growing/shrinking zoned device
tables, and much less swapping tables between zoned and not-zoned. I am more
inclined to avoid all these corner cases by simply not supporting table
switching for zoned device. That would be much safer I think.
No-one complained about any issue with table switching until now, which likely
means that no-one is using this. So what about simply returning an error for
table switching for a zoned device ? If someone request this support, we can
revisit this.
>
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
> drivers/md/dm-zone.c | 23 +++++++++++++----------
> 1 file changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index ac86011640c3..7e9ebeee7eac 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
> if (!get_capacity(disk))
> return 0;
>
> - /* Revalidate only if something changed. */
> - if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> - DMINFO("%s using %s zone append",
> - disk->disk_name,
> - queue_emulates_zone_append(q) ? "emulated" : "native");
> - md->nr_zones = 0;
> - }
> -
> - if (md->nr_zones)
> - return 0;
> + DMINFO("%s using %s zone append", disk->disk_name,
> + queue_emulates_zone_append(q) ? "emulated" : "native");
>
> /*
> * Our table is not live yet. So the call to dm_get_live_table()
> @@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> return 0;
> }
>
> + /*
> + * If the device needs zone append emulation, and the device already has
> + * zone append emulation resources, make sure that the chunk_sectors
> + * hasn't changed size. Otherwise those resources will be garbage.
> + */
> + if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash &&
> + q->limits.chunk_sectors != lim->chunk_sectors) {
> + DMERR("Cannot change zone size when swapping tables");
> + return -EINVAL;
> + }
> +
> /*
> * Warn once (when the capacity is not yet set) if the mapped device is
> * partially using zone resources of the target devices as that leads to
--
Damien Le Moal
Western Digital Research
^ permalink raw reply
* Re: [PATCH 5/7] blk-zoned: clean up zone settings for devices without zwplugs
From: Damien Le Moal @ 2025-03-09 23:31 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Christoph Hellwig
In-Reply-To: <20250309222904.449803-6-bmarzins@redhat.com>
On 3/10/25 07:29, Benjamin Marzinski wrote:
> Previously disk_free_zone_resources() would only clean up zoned settings
> on a disk if the disk had write plugs allocated. Make it clean up zoned
> settings on any disk, so disks that don't allocate write plugs can use
> it as well.
>
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Looks OK to me, but as commented in the cover letter, if we do not allow
swtiching tables for a zoned device, then I do not think this patch is needed.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply
* Re: [PATCH 4/7] dm: fix dm_blk_report_zones
From: Damien Le Moal @ 2025-03-09 23:27 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Christoph Hellwig
In-Reply-To: <20250309222904.449803-5-bmarzins@redhat.com>
On 3/10/25 07:29, Benjamin Marzinski wrote:
> If dm_get_live_table() returned NULL, dm_put_live_table() was never
> called. Also, if md->zone_revalidate_map is set, check that
> dm_blk_report_zones() is being called from the process that set it in
> __bind(). Otherwise the zone resources could change while accessing
> them. Finally, it is possible that md->zone_revalidate_map will change
> while calling this function. Only read it once, so that we are always
> using the same value. Otherwise we might miss a call to
> dm_put_live_table().
>
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Looks good to me.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply
* Re: [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
From: Damien Le Moal @ 2025-03-09 23:25 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Christoph Hellwig
In-Reply-To: <20250309222904.449803-4-bmarzins@redhat.com>
On 3/10/25 07:28, Benjamin Marzinski wrote:
> If dm_table_set_restrictions() fails while swapping tables,
> device-mapper will continue using the previous table. It must be sure to
> leave the mapped_device in it's previous state on failure. Otherwise
> device-mapper could end up using the old table with settings from the
> unused table.
>
> Do not update the mapped device in dm_set_zones_restrictions(). Wait
> till after dm_table_set_restrictions() is sure to succeed to update the
> md zoned settings. Do the same with the dax settings, and if
> dm_revalidate_zones() fails, restore the original queue limits.
>
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
> drivers/md/dm-table.c | 24 ++++++++++++++++--------
> drivers/md/dm-zone.c | 26 ++++++++++++++++++--------
> drivers/md/dm.h | 1 +
> 3 files changed, 35 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 0ef5203387b2..4003e84af11d 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1836,6 +1836,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> struct queue_limits *limits)
> {
> int r;
> + struct queue_limits old_limits;
>
> if (!dm_table_supports_nowait(t))
> limits->features &= ~BLK_FEAT_NOWAIT;
> @@ -1862,16 +1863,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> if (dm_table_supports_flush(t))
> limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
>
> - if (dm_table_supports_dax(t, device_not_dax_capable)) {
> + if (dm_table_supports_dax(t, device_not_dax_capable))
> limits->features |= BLK_FEAT_DAX;
> - if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
> - set_dax_synchronous(t->md->dax_dev);
> - } else
> + else
> limits->features &= ~BLK_FEAT_DAX;
>
> - if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
> - dax_write_cache(t->md->dax_dev, true);
> -
> /* For a zoned table, setup the zone related queue attributes. */
> if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> (limits->features & BLK_FEAT_ZONED)) {
> @@ -1883,6 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> if (dm_table_supports_atomic_writes(t))
> limits->features |= BLK_FEAT_ATOMIC_WRITES;
>
> + old_limits = q->limits;
I am not sure this is safe to do like this since the user may be simultaneously
changing attributes, which would result in the old_limits struct being in an
inconsistent state. So shouldn't we hold q->limits_lock here ? We probably want
a queue_limits_get() helper for that though.
> r = queue_limits_set(q, limits);
...Or, we could modify queue_limits_set() to also return the old limit struct
under the q limits_lock. That maybe easier.
> if (r)
> return r;
> @@ -1894,10 +1891,21 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> (limits->features & BLK_FEAT_ZONED)) {
> r = dm_revalidate_zones(t, q);
> - if (r)
> + if (r) {
> + queue_limits_set(q, &old_limits);
> return r;
> + }
> }
>
> + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED))
> + dm_finalize_zone_settings(t, limits);
> +
> + if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
> + set_dax_synchronous(t->md->dax_dev);
> +
> + if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
> + dax_write_cache(t->md->dax_dev, true);
> +
> dm_update_crypto_profile(q, t);
> return 0;
> }
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 20edd3fabbab..681058feb63b 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -340,12 +340,8 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> * mapped device queue as needing zone append emulation.
> */
> WARN_ON_ONCE(queue_is_mq(q));
> - if (dm_table_supports_zone_append(t)) {
> - clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> - } else {
> - set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> + if (!dm_table_supports_zone_append(t))
> lim->max_hw_zone_append_sectors = 0;
> - }
>
> /*
> * Determine the max open and max active zone limits for the mapped
> @@ -383,9 +379,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> lim->zone_write_granularity = 0;
> lim->chunk_sectors = 0;
> lim->features &= ~BLK_FEAT_ZONED;
> - clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> - md->nr_zones = 0;
> - disk->nr_zones = 0;
> return 0;
> }
>
> @@ -408,6 +401,23 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> return 0;
> }
>
> +void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim)
> +{
> + struct mapped_device *md = t->md;
> +
> + if (lim->features & BLK_FEAT_ZONED) {
> + if (dm_table_supports_zone_append(t))
> + clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> + else
> + set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> + } else {
> + clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> + md->nr_zones = 0;
> + md->disk->nr_zones = 0;
> + }
> +}
> +
> +
> /*
> * IO completion callback called from clone_endio().
> */
> diff --git a/drivers/md/dm.h b/drivers/md/dm.h
> index a0a8ff119815..e5d3a9f46a91 100644
> --- a/drivers/md/dm.h
> +++ b/drivers/md/dm.h
> @@ -102,6 +102,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
> int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> struct queue_limits *lim);
> int dm_revalidate_zones(struct dm_table *t, struct request_queue *q);
> +void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
> void dm_zone_endio(struct dm_io *io, struct bio *clone);
> #ifdef CONFIG_BLK_DEV_ZONED
> int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
--
Damien Le Moal
Western Digital Research
^ permalink raw reply
* Re: [PATCH 2/7] dm: free table mempools if not used in __bind
From: Damien Le Moal @ 2025-03-09 23:19 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Christoph Hellwig
In-Reply-To: <20250309222904.449803-3-bmarzins@redhat.com>
On 3/10/25 07:28, Benjamin Marzinski wrote:
> With request-based dm, the mempools don't need reloading when switching
> tables, but the unused table mempools are not freed until the active
> table is finally freed. Free them immediately if they are not needed.
>
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Looks OK to me.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply
* Re: [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails
From: Damien Le Moal @ 2025-03-09 23:18 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Christoph Hellwig
In-Reply-To: <20250309222904.449803-2-bmarzins@redhat.com>
On 3/10/25 07:28, Benjamin Marzinski wrote:
> __bind was changing the disk capacity, geometry and mempools of the
> mapped device before calling dm_table_set_restrictions() which could
> fail, forcing dm to drop the new table. Failing here would leave the
> device using the old table but with the wrong capacity and mempools.
>
> Move dm_table_set_restrictions() earlier in __bind(). Since it needs the
> capacity to be set, save the old version and restore it on failure.
>
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Does this need a "Fixes" tag maybe ?
Otherwise looks good to me.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply
* Re: [RFC PATCH 0/7] dm: fix issues with swapping dm tables
From: Damien Le Moal @ 2025-03-09 23:16 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Christoph Hellwig
In-Reply-To: <20250309222904.449803-1-bmarzins@redhat.com>
On 3/10/25 07:28, Benjamin Marzinski wrote:
> There were multiple places in dm's __bind() function where it could fail
> and not completely roll back, leaving the device using the the old
> table, but with device limits and resources from the new table.
> Additionally, unused mempools for request-based devices were not always
> freed immediately.
>
> Finally, there were a number of issues with switching zoned tables that
> emulate zone append (in other words, dm-crypt on top of zoned devices).
> dm_blk_report_zones() could be called while the device was suspended and
> modifying zoned resources or could possibly fail to end a srcu read
> section. More importantly, blk_revalidate_disk_zones() would never get
> called when updating a zoned table. This could cause the dm device to
> see the wrong zone write offsets, not have a large enough zwplugs
> reserved in its mempool, or read invalid memory when checking the
> conventional zones bitmap.
>
> This patchset fixes these issues. It does not make it so that
> device-mapper is able to load any zoned table from any other zoned
> table. Zoned dm-crypt devices can be safely grown and shrunk, but
> reloading a zoned dm-crypt device to, for instance, point at a
> completely different underlying device won't work correctly. IO might
> fail since the zone write offsets of the dm-crypt device will not be
> updated for all the existing zones with plugs. If the new device's zone
> offsets don't match the old device's offsets, IO to the zone will fail.
> If the ability to switch tables from a zoned dm-crypt device to an
> abritry other zoned dm-crypt device is important to people, it could be
> done as long as there are no plugged zones when dm suspends.
Thanks for fixing this.
Given that in the general case switching tables will always likely result in
unaligned write errors, I think we should just report a ENOTSUPP error if the
user attempts to swap tables.
> This patchset also doesn't touch the code for switching from a zoned to
> a non-zoned device. Switching from a zoned dm-crypt device to a
> non-zoned device is problematic if there are plugged zones, since the
> underlying device will not be prepared to handle these plugged writes.
> Switching to a target that does not pass down IOs, like the dm-error
> target, is fine. So is switching when there are no plugged zones, except
> that we do not free the zoned resources in this case, even though we
> safely could.
This is another case that does not make much sense in practice. So instead of
still allowing it knowing that it most likely will not work, we should return
ENOTSUPP here too I think.
> If people are interested in removing some of these limitations, I can
> send patches for them, but I'm not sure how much extra code we want,
> just to support niche zoned dm-crypt reloads.
I have never heard any complaints/bug reports from people attempting this. Which
likely means that no-one is needing/trying to do this. So as mentionned above,
we should make sure that this feature is not reported as not supported with a
ENOTSUPP error, and maybe a warning.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply
* [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
From: Benjamin Marzinski @ 2025-03-09 22:28 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
In-Reply-To: <20250309222904.449803-1-bmarzins@redhat.com>
If dm_table_set_restrictions() fails while swapping tables,
device-mapper will continue using the previous table. It must be sure to
leave the mapped_device in it's previous state on failure. Otherwise
device-mapper could end up using the old table with settings from the
unused table.
Do not update the mapped device in dm_set_zones_restrictions(). Wait
till after dm_table_set_restrictions() is sure to succeed to update the
md zoned settings. Do the same with the dax settings, and if
dm_revalidate_zones() fails, restore the original queue limits.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-table.c | 24 ++++++++++++++++--------
drivers/md/dm-zone.c | 26 ++++++++++++++++++--------
drivers/md/dm.h | 1 +
3 files changed, 35 insertions(+), 16 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 0ef5203387b2..4003e84af11d 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1836,6 +1836,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits)
{
int r;
+ struct queue_limits old_limits;
if (!dm_table_supports_nowait(t))
limits->features &= ~BLK_FEAT_NOWAIT;
@@ -1862,16 +1863,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (dm_table_supports_flush(t))
limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
- if (dm_table_supports_dax(t, device_not_dax_capable)) {
+ if (dm_table_supports_dax(t, device_not_dax_capable))
limits->features |= BLK_FEAT_DAX;
- if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
- set_dax_synchronous(t->md->dax_dev);
- } else
+ else
limits->features &= ~BLK_FEAT_DAX;
- if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
- dax_write_cache(t->md->dax_dev, true);
-
/* For a zoned table, setup the zone related queue attributes. */
if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
(limits->features & BLK_FEAT_ZONED)) {
@@ -1883,6 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (dm_table_supports_atomic_writes(t))
limits->features |= BLK_FEAT_ATOMIC_WRITES;
+ old_limits = q->limits;
r = queue_limits_set(q, limits);
if (r)
return r;
@@ -1894,10 +1891,21 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
(limits->features & BLK_FEAT_ZONED)) {
r = dm_revalidate_zones(t, q);
- if (r)
+ if (r) {
+ queue_limits_set(q, &old_limits);
return r;
+ }
}
+ if (IS_ENABLED(CONFIG_BLK_DEV_ZONED))
+ dm_finalize_zone_settings(t, limits);
+
+ if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
+ set_dax_synchronous(t->md->dax_dev);
+
+ if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
+ dax_write_cache(t->md->dax_dev, true);
+
dm_update_crypto_profile(q, t);
return 0;
}
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 20edd3fabbab..681058feb63b 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -340,12 +340,8 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
* mapped device queue as needing zone append emulation.
*/
WARN_ON_ONCE(queue_is_mq(q));
- if (dm_table_supports_zone_append(t)) {
- clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
- } else {
- set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ if (!dm_table_supports_zone_append(t))
lim->max_hw_zone_append_sectors = 0;
- }
/*
* Determine the max open and max active zone limits for the mapped
@@ -383,9 +379,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
lim->zone_write_granularity = 0;
lim->chunk_sectors = 0;
lim->features &= ~BLK_FEAT_ZONED;
- clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
- md->nr_zones = 0;
- disk->nr_zones = 0;
return 0;
}
@@ -408,6 +401,23 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
return 0;
}
+void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim)
+{
+ struct mapped_device *md = t->md;
+
+ if (lim->features & BLK_FEAT_ZONED) {
+ if (dm_table_supports_zone_append(t))
+ clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ else
+ set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ } else {
+ clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ md->nr_zones = 0;
+ md->disk->nr_zones = 0;
+ }
+}
+
+
/*
* IO completion callback called from clone_endio().
*/
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index a0a8ff119815..e5d3a9f46a91 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -102,6 +102,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *lim);
int dm_revalidate_zones(struct dm_table *t, struct request_queue *q);
+void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
void dm_zone_endio(struct dm_io *io, struct bio *clone);
#ifdef CONFIG_BLK_DEV_ZONED
int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
--
2.48.1
^ permalink raw reply related
* [RFC PATCH 7/7] dm: allow devices to revalidate existing zones
From: Benjamin Marzinski @ 2025-03-09 22:29 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
In-Reply-To: <20250309222904.449803-1-bmarzins@redhat.com>
dm_revalidate_zones() only allowed devices that had no zone resources
set up to call blk_revalidate_disk_zones(). If the device already had
zone resources, disk->nr_zones would always equal md->nr_zones so
dm_revalidate_zones() returned without doing any work. Instead, always
call blk_revalidate_disk_zones() if you are loading a new zoned table.
However, if the device emulates zone append operations and already has
zone append emulation resources, the table size cannot change when
loading a new table. Otherwise, all those resources will be garbage.
If emulated zone append operations are needed and the zone write pointer
offsets of the new table do not match those of the old table, writes to
the device will still fail. This patch allows users to safely grow and
shrink zone devices. But swapping arbitrary zoned tables will still not
work.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-zone.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index ac86011640c3..7e9ebeee7eac 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
if (!get_capacity(disk))
return 0;
- /* Revalidate only if something changed. */
- if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
- DMINFO("%s using %s zone append",
- disk->disk_name,
- queue_emulates_zone_append(q) ? "emulated" : "native");
- md->nr_zones = 0;
- }
-
- if (md->nr_zones)
- return 0;
+ DMINFO("%s using %s zone append", disk->disk_name,
+ queue_emulates_zone_append(q) ? "emulated" : "native");
/*
* Our table is not live yet. So the call to dm_get_live_table()
@@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
return 0;
}
+ /*
+ * If the device needs zone append emulation, and the device already has
+ * zone append emulation resources, make sure that the chunk_sectors
+ * hasn't changed size. Otherwise those resources will be garbage.
+ */
+ if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash &&
+ q->limits.chunk_sectors != lim->chunk_sectors) {
+ DMERR("Cannot change zone size when swapping tables");
+ return -EINVAL;
+ }
+
/*
* Warn once (when the capacity is not yet set) if the mapped device is
* partially using zone resources of the target devices as that leads to
--
2.48.1
^ permalink raw reply related
* [RFC PATCH 0/7] dm: fix issues with swapping dm tables
From: Benjamin Marzinski @ 2025-03-09 22:28 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
There were multiple places in dm's __bind() function where it could fail
and not completely roll back, leaving the device using the the old
table, but with device limits and resources from the new table.
Additionally, unused mempools for request-based devices were not always
freed immediately.
Finally, there were a number of issues with switching zoned tables that
emulate zone append (in other words, dm-crypt on top of zoned devices).
dm_blk_report_zones() could be called while the device was suspended and
modifying zoned resources or could possibly fail to end a srcu read
section. More importantly, blk_revalidate_disk_zones() would never get
called when updating a zoned table. This could cause the dm device to
see the wrong zone write offsets, not have a large enough zwplugs
reserved in its mempool, or read invalid memory when checking the
conventional zones bitmap.
This patchset fixes these issues. It does not make it so that
device-mapper is able to load any zoned table from any other zoned
table. Zoned dm-crypt devices can be safely grown and shrunk, but
reloading a zoned dm-crypt device to, for instance, point at a
completely different underlying device won't work correctly. IO might
fail since the zone write offsets of the dm-crypt device will not be
updated for all the existing zones with plugs. If the new device's zone
offsets don't match the old device's offsets, IO to the zone will fail.
If the ability to switch tables from a zoned dm-crypt device to an
abritry other zoned dm-crypt device is important to people, it could be
done as long as there are no plugged zones when dm suspends.
This patchset also doesn't touch the code for switching from a zoned to
a non-zoned device. Switching from a zoned dm-crypt device to a
non-zoned device is problematic if there are plugged zones, since the
underlying device will not be prepared to handle these plugged writes.
Switching to a target that does not pass down IOs, like the dm-error
target, is fine. So is switching when there are no plugged zones, except
that we do not free the zoned resources in this case, even though we
safely could.
If people are interested in removing some of these limitations, I can
send patches for them, but I'm not sure how much extra code we want,
just to support niche zoned dm-crypt reloads.
Benjamin Marzinski (7):
dm: don't change md if dm_table_set_restrictions() fails
dm: free table mempools if not used in __bind
dm: handle failures in dm_table_set_restrictions
dm: fix dm_blk_report_zones
blk-zoned: clean up zone settings for devices without zwplugs
blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers
dm: allow devices to revalidate existing zones
block/blk-zoned.c | 78 ++++++++++++++++++++++--------------------
block/blk.h | 4 ---
drivers/md/dm-core.h | 1 +
drivers/md/dm-table.c | 24 ++++++++-----
drivers/md/dm-zone.c | 75 ++++++++++++++++++++++++++--------------
drivers/md/dm.c | 30 ++++++++--------
drivers/md/dm.h | 1 +
include/linux/blkdev.h | 4 +++
8 files changed, 128 insertions(+), 89 deletions(-)
--
2.48.1
^ permalink raw reply
* [RFC PATCH 6/7] blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers
From: Benjamin Marzinski @ 2025-03-09 22:29 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
In-Reply-To: <20250309222904.449803-1-bmarzins@redhat.com>
If device-mapper is swapping zoned tables, and
blk_revalidate_disk_zones() fails. It must retain its current zoned
resources since device-mapper will be failing back to using the previous
table and the zoned settings need to match the table. Allocating
unnecessary zwplugs is acceptable, but the zoned configuration must not
change. Otherwise it can run into errors like bdev_zone_is_seq()
reading invalid memory because disk->conv_zones_bitmap is the wrong
size. However if device-mapper did not previously have a zoned table, it
should free up the zoned resources, instead of leaving them allocated
and unused.
To solve this, do not free the zone resources when
blk_revalidate_disk_zones() fails for bio based drivers. Additionally,
delay copying the zoned settings to the gendisk until
disk_update_zone_resources() can no longer fail, and do not freeze the
queue for bio-based drivers, since this will hang if there are any
plugged zone write bios.
Also, export disk_free_zone_resources() so that device-mapper can choose
when to free the zoned resources.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
block/blk-zoned.c | 49 +++++++++++++++++++++++-------------------
block/blk.h | 4 ----
drivers/md/dm-zone.c | 3 +++
include/linux/blkdev.h | 4 ++++
4 files changed, 34 insertions(+), 26 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index d7dc89cbdccb..3bec289d27db 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1343,22 +1343,17 @@ static void disk_destroy_zone_wplugs_hash_table(struct gendisk *disk)
disk->zone_wplugs_hash_bits = 0;
}
-static unsigned int disk_set_conv_zones_bitmap(struct gendisk *disk,
- unsigned long *bitmap)
+static void disk_set_conv_zones_bitmap(struct gendisk *disk,
+ unsigned long *bitmap)
{
- unsigned int nr_conv_zones = 0;
unsigned long flags;
spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
- if (bitmap)
- nr_conv_zones = bitmap_weight(bitmap, disk->nr_zones);
bitmap = rcu_replace_pointer(disk->conv_zones_bitmap, bitmap,
lockdep_is_held(&disk->zone_wplugs_lock));
spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
kfree_rcu_mightsleep(bitmap);
-
- return nr_conv_zones;
}
void disk_free_zone_resources(struct gendisk *disk)
@@ -1386,6 +1381,7 @@ void disk_free_zone_resources(struct gendisk *disk)
disk->last_zone_capacity = 0;
disk->nr_zones = 0;
}
+EXPORT_SYMBOL_GPL(disk_free_zone_resources);
static inline bool disk_need_zone_resources(struct gendisk *disk)
{
@@ -1434,24 +1430,23 @@ struct blk_revalidate_zone_args {
/*
* Update the disk zone resources information and device queue limits.
- * The disk queue is frozen when this is executed.
+ * The disk queue is frozen when this is executed on blk-mq drivers.
*/
static int disk_update_zone_resources(struct gendisk *disk,
struct blk_revalidate_zone_args *args)
{
struct request_queue *q = disk->queue;
- unsigned int nr_seq_zones, nr_conv_zones;
+ unsigned int nr_seq_zones, nr_conv_zones = 0;
unsigned int pool_size;
struct queue_limits lim;
+ int ret;
- disk->nr_zones = args->nr_zones;
- disk->zone_capacity = args->zone_capacity;
- disk->last_zone_capacity = args->last_zone_capacity;
- nr_conv_zones =
- disk_set_conv_zones_bitmap(disk, args->conv_zones_bitmap);
- if (nr_conv_zones >= disk->nr_zones) {
+ if (args->conv_zones_bitmap)
+ nr_conv_zones = bitmap_weight(args->conv_zones_bitmap,
+ args->nr_zones);
+ if (nr_conv_zones >= args->nr_zones) {
pr_warn("%s: Invalid number of conventional zones %u / %u\n",
- disk->disk_name, nr_conv_zones, disk->nr_zones);
+ disk->disk_name, nr_conv_zones, args->nr_zones);
return -ENODEV;
}
@@ -1463,7 +1458,7 @@ static int disk_update_zone_resources(struct gendisk *disk,
* small ZNS namespace. For such case, assume that the zoned device has
* no zone resource limits.
*/
- nr_seq_zones = disk->nr_zones - nr_conv_zones;
+ nr_seq_zones = args->nr_zones - nr_conv_zones;
if (lim.max_open_zones >= nr_seq_zones)
lim.max_open_zones = 0;
if (lim.max_active_zones >= nr_seq_zones)
@@ -1493,7 +1488,19 @@ static int disk_update_zone_resources(struct gendisk *disk,
}
commit:
- return queue_limits_commit_update_frozen(q, &lim);
+ if (queue_is_mq(disk->queue))
+ ret = queue_limits_commit_update_frozen(q, &lim);
+ else
+ ret = queue_limits_commit_update(q, &lim);
+
+ if (!ret) {
+ disk->nr_zones = args->nr_zones;
+ disk->zone_capacity = args->zone_capacity;
+ disk->last_zone_capacity = args->last_zone_capacity;
+ disk_set_conv_zones_bitmap(disk, args->conv_zones_bitmap);
+ }
+
+ return ret;
}
static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
@@ -1648,8 +1655,6 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
* and when the zone configuration of the gendisk changes (e.g. after a format).
* Before calling this function, the device driver must already have set the
* device zone size (chunk_sector limit) and the max zone append limit.
- * BIO based drivers can also use this function as long as the device queue
- * can be safely frozen.
*/
int blk_revalidate_disk_zones(struct gendisk *disk)
{
@@ -1709,13 +1714,13 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
/*
* Set the new disk zone parameters only once the queue is frozen and
- * all I/Os are completed.
+ * all I/Os are completed on blk-mq drivers.
*/
if (ret > 0)
ret = disk_update_zone_resources(disk, &args);
else
pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
- if (ret) {
+ if (ret && queue_is_mq(disk->queue)) {
unsigned int memflags = blk_mq_freeze_queue(q);
disk_free_zone_resources(disk);
diff --git a/block/blk.h b/block/blk.h
index 90fa5f28ccab..c84af503b77b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -454,7 +454,6 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
#ifdef CONFIG_BLK_DEV_ZONED
void disk_init_zone_resources(struct gendisk *disk);
-void disk_free_zone_resources(struct gendisk *disk);
static inline bool bio_zone_write_plugging(struct bio *bio)
{
return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
@@ -500,9 +499,6 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
static inline void disk_init_zone_resources(struct gendisk *disk)
{
}
-static inline void disk_free_zone_resources(struct gendisk *disk)
-{
-}
static inline bool bio_zone_write_plugging(struct bio *bio)
{
return false;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 11e19281bb64..ac86011640c3 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -159,6 +159,7 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
struct mapped_device *md = t->md;
struct gendisk *disk = md->disk;
int ret;
+ bool was_zoned = disk->nr_zones != 0;
if (!get_capacity(disk))
return 0;
@@ -187,6 +188,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
if (ret) {
DMERR("Revalidate zones failed %d", ret);
+ if (!was_zoned)
+ disk_free_zone_resources(disk);
return ret;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..51edf35ff715 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -690,12 +690,16 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
}
#ifdef CONFIG_BLK_DEV_ZONED
+void disk_free_zone_resources(struct gendisk *disk);
static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
return disk->nr_zones;
}
bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
#else /* CONFIG_BLK_DEV_ZONED */
+static inline void disk_free_zone_resources(struct gendisk *disk)
+{
+}
static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
return 0;
--
2.48.1
^ permalink raw reply related
* [PATCH 4/7] dm: fix dm_blk_report_zones
From: Benjamin Marzinski @ 2025-03-09 22:29 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
In-Reply-To: <20250309222904.449803-1-bmarzins@redhat.com>
If dm_get_live_table() returned NULL, dm_put_live_table() was never
called. Also, if md->zone_revalidate_map is set, check that
dm_blk_report_zones() is being called from the process that set it in
__bind(). Otherwise the zone resources could change while accessing
them. Finally, it is possible that md->zone_revalidate_map will change
while calling this function. Only read it once, so that we are always
using the same value. Otherwise we might miss a call to
dm_put_live_table().
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-core.h | 1 +
drivers/md/dm-zone.c | 23 +++++++++++++++--------
2 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 3637761f3585..f3a3f2ef6322 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -141,6 +141,7 @@ struct mapped_device {
#ifdef CONFIG_BLK_DEV_ZONED
unsigned int nr_zones;
void *zone_revalidate_map;
+ struct task_struct *revalidate_map_task;
#endif
#ifdef CONFIG_IMA
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 681058feb63b..11e19281bb64 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -56,24 +56,29 @@ int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
{
struct mapped_device *md = disk->private_data;
struct dm_table *map;
- int srcu_idx, ret;
+ struct dm_table *zone_revalidate_map = md->zone_revalidate_map;
+ int srcu_idx, ret = -EIO;
- if (!md->zone_revalidate_map) {
- /* Regular user context */
+ if (!zone_revalidate_map || md->revalidate_map_task != current) {
+ /*
+ * Regular user context or
+ * Zone revalidation during __bind() is in progress, but this
+ * call is from a different process
+ */
if (dm_suspended_md(md))
return -EAGAIN;
map = dm_get_live_table(md, &srcu_idx);
- if (!map)
- return -EIO;
} else {
/* Zone revalidation during __bind() */
- map = md->zone_revalidate_map;
+ map = zone_revalidate_map;
}
- ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, data);
+ if (map)
+ ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb,
+ data);
- if (!md->zone_revalidate_map)
+ if (!zone_revalidate_map)
dm_put_live_table(md, srcu_idx);
return ret;
@@ -175,7 +180,9 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
* our table for dm_blk_report_zones() to use directly.
*/
md->zone_revalidate_map = t;
+ md->revalidate_map_task = current;
ret = blk_revalidate_disk_zones(disk);
+ md->revalidate_map_task = NULL;
md->zone_revalidate_map = NULL;
if (ret) {
--
2.48.1
^ permalink raw reply related
* [PATCH 2/7] dm: free table mempools if not used in __bind
From: Benjamin Marzinski @ 2025-03-09 22:28 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
In-Reply-To: <20250309222904.449803-1-bmarzins@redhat.com>
With request-based dm, the mempools don't need reloading when switching
tables, but the unused table mempools are not freed until the active
table is finally freed. Free them immediately if they are not needed.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f5c5ccb6f8d2..292414da871d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2461,10 +2461,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
* requests in the queue may refer to bio from the old bioset,
* so you must walk through the queue to unprep.
*/
- if (!md->mempools) {
+ if (!md->mempools)
md->mempools = t->mempools;
- t->mempools = NULL;
- }
+ else
+ dm_free_md_mempools(t->mempools);
} else {
/*
* The md may already have mempools that need changing.
@@ -2473,8 +2473,8 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
*/
dm_free_md_mempools(md->mempools);
md->mempools = t->mempools;
- t->mempools = NULL;
}
+ t->mempools = NULL;
old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
rcu_assign_pointer(md->map, (void *)t);
--
2.48.1
^ permalink raw reply related
* [PATCH 5/7] blk-zoned: clean up zone settings for devices without zwplugs
From: Benjamin Marzinski @ 2025-03-09 22:29 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
In-Reply-To: <20250309222904.449803-1-bmarzins@redhat.com>
Previously disk_free_zone_resources() would only clean up zoned settings
on a disk if the disk had write plugs allocated. Make it clean up zoned
settings on any disk, so disks that don't allocate write plugs can use
it as well.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
block/blk-zoned.c | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 761ea662ddc3..d7dc89cbdccb 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1363,24 +1363,23 @@ static unsigned int disk_set_conv_zones_bitmap(struct gendisk *disk,
void disk_free_zone_resources(struct gendisk *disk)
{
- if (!disk->zone_wplugs_pool)
- return;
-
- if (disk->zone_wplugs_wq) {
- destroy_workqueue(disk->zone_wplugs_wq);
- disk->zone_wplugs_wq = NULL;
- }
+ if (disk->zone_wplugs_pool) {
+ if (disk->zone_wplugs_wq) {
+ destroy_workqueue(disk->zone_wplugs_wq);
+ disk->zone_wplugs_wq = NULL;
+ }
- disk_destroy_zone_wplugs_hash_table(disk);
+ disk_destroy_zone_wplugs_hash_table(disk);
- /*
- * Wait for the zone write plugs to be RCU-freed before
- * destorying the mempool.
- */
- rcu_barrier();
+ /*
+ * Wait for the zone write plugs to be RCU-freed before
+ * destorying the mempool.
+ */
+ rcu_barrier();
- mempool_destroy(disk->zone_wplugs_pool);
- disk->zone_wplugs_pool = NULL;
+ mempool_destroy(disk->zone_wplugs_pool);
+ disk->zone_wplugs_pool = NULL;
+ }
disk_set_conv_zones_bitmap(disk, NULL);
disk->zone_capacity = 0;
--
2.48.1
^ permalink raw reply related
* [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails
From: Benjamin Marzinski @ 2025-03-09 22:28 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
In-Reply-To: <20250309222904.449803-1-bmarzins@redhat.com>
__bind was changing the disk capacity, geometry and mempools of the
mapped device before calling dm_table_set_restrictions() which could
fail, forcing dm to drop the new table. Failing here would leave the
device using the old table but with the wrong capacity and mempools.
Move dm_table_set_restrictions() earlier in __bind(). Since it needs the
capacity to be set, save the old version and restore it on failure.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ab7574c0c76..f5c5ccb6f8d2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2421,21 +2421,29 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
struct queue_limits *limits)
{
struct dm_table *old_map;
- sector_t size;
+ sector_t size, old_size;
int ret;
lockdep_assert_held(&md->suspend_lock);
size = dm_table_get_size(t);
+ old_size = dm_get_size(md);
+ set_capacity(md->disk, size);
+
+ ret = dm_table_set_restrictions(t, md->queue, limits);
+ if (ret) {
+ set_capacity(md->disk, old_size);
+ old_map = ERR_PTR(ret);
+ goto out;
+ }
+
/*
* Wipe any geometry if the size of the table changed.
*/
- if (size != dm_get_size(md))
+ if (size != old_size)
memset(&md->geometry, 0, sizeof(md->geometry));
- set_capacity(md->disk, size);
-
dm_table_event_callback(t, event_callback, md);
if (dm_table_request_based(t)) {
@@ -2468,12 +2476,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
t->mempools = NULL;
}
- ret = dm_table_set_restrictions(t, md->queue, limits);
- if (ret) {
- old_map = ERR_PTR(ret);
- goto out;
- }
-
old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
rcu_assign_pointer(md->map, (void *)t);
md->immutable_target_type = dm_table_get_immutable_target_type(t);
--
2.48.1
^ permalink raw reply related
* Fwd: [PATCH] badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0
From: Coly Li @ 2025-03-09 16:12 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Dan Carpenter
In-Reply-To: <20250309160556.42854-1-colyli@kernel.org>
Hi Jens,
Could you please take a look at it and pick this patch into the for-6.15/block branch? The patch is generated based on the for-6.15/block branch.
Thanks in advance.
Coly Li
> From: Coly Li <colyli@kernel.org>
>
> In _badblocks_check(), there are lines of code like this,
> 1246 sectors -= len;
> [snipped]
> 1251 WARN_ON(sectors < 0);
>
> The WARN_ON() at line 1257 doesn't make sense because sectors is
> unsigned long long type and never to be <0.
>
> Fix it by checking directly checking whether sectors is less than len.
>
> Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
> Signed-off-by: Coly Li <colyli@kernel.org>
> ---
> block/badblocks.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/block/badblocks.c b/block/badblocks.c
> index 673ef068423a..ece64e76fe8f 100644
> --- a/block/badblocks.c
> +++ b/block/badblocks.c
> @@ -1242,14 +1242,15 @@ static int _badblocks_check(struct badblocks *bb, sector_t s, sector_t sectors,
> len = sectors;
>
> update_sectors:
> + /* This situation should never happen */
> + WARN_ON(sectors < len);
> +
> s += len;
> sectors -= len;
>
> if (sectors > 0)
> goto re_check;
>
> - WARN_ON(sectors < 0);
> -
> if (unacked_badblocks > 0)
> rv = -1;
> else if (acked_badblocks > 0)
> --
> 2.47.2
>
^ permalink raw reply
* Bring back md-faulty? (was Re: dm-flakey: Fix memory corruption)
From: Kent Overstreet @ 2025-03-09 17:04 UTC (permalink / raw)
To: Mike Snitzer
Cc: dm-devel, Mikulas Patocka, Alasdair Kergon, linux-block,
Josef Bacik, axboe, Linus Torvalds
In-Reply-To: <Z823ZmamUoGO8P6I@kernel.org>
On Sun, Mar 09, 2025 at 11:44:38AM -0400, Mike Snitzer wrote:
> On Sat, Mar 08, 2025 at 04:50:05PM -0500, Kent Overstreet wrote:
> > On Sat, Mar 08, 2025 at 01:19:30PM -0500, Mike Snitzer wrote:
> > > On Sat, Mar 08, 2025 at 10:50:08AM -0500, Kent Overstreet wrote:
> > > > So, this code clearly isn't getting tested - at all. Besides this bug,
> > > > the parsing for the "corrupt" modes is also broken.
> > > >
> > > > Guys, don't push broken crap, and figure out how to write some tests.
> > >
> > > Thank you sir, may we have another?
> > >
> > > Like you never introduced a bug in your life?
> > >
> > > Not going to tolerate your entitled primadonna attitude here. You are
> > > capable of being better, you've chosen not to be on this issue (twice)
> >
> > Talking about basic engineering standards is in no way "being a prima
> > donna". Testing your changes is as basic as it gets, and this code
> > wasn't tested _at all_.
>
> "entitled primadonna attitude" was me pulling punches.
>
> I don't disagree that this is a bug that was missed and that proper
> testing hasn't been performed (I'd quibble with the no testing part
> only because I cannot speak for Mikulas and don't like to assume I
> know it all).
>
> But you're missing the very problematic detail: you used a bug in an
> optional feature of the test-only dm-flakey target to try to take a
> pound of flesh while preaching from your high horse. That is
> unacceptable behaviour that won't be tolerated here. Be cool and
> others will be in return (unless you keep setting fire to bridges).
>
> Fin.
Mike, saying code needs to be tested is not an "entitled primadonna
attitude". Pushing completely broken code because you made no attempt to
test it and then flipping out when called out over it - that is.
To recap, we're not talking about some obscure corner cases, we're
talking about core documentated functionality in dm-flakey that is
completely broken in ways that show up immediately if you run it - and
there's at least three bugs that I saw; the parsing code, the
clone_bio() memory corruption, and the read side corruption still wasn't
working when I fixed or worked around the other two (write side did).
This isn't your personal project, this is the kernel - there are
standards, and other people depend on your work. dm-flakey is used
heavily by filesystem folks, and additionally, md-faulty was recently
removed because, supposedly, dm-flakey was sufficient.
And that's what I was using before, and it worked fine, so I'm willing
to bring it back and maintain it if dm-flakey can't be relied on.
^ permalink raw reply
* Re: [bug report] badblocks: improve badblocks_check() for multiple ranges handling
From: Coly Li @ 2025-03-09 16:06 UTC (permalink / raw)
To: Dan Carpenter; +Cc: linux-block, colyli
In-Reply-To: <c8e4f72d-7a9f-428d-a67b-41ddd4da8f2f@stanley.mountain>
> 2025年3月8日 18:25,Dan Carpenter <dan.carpenter@linaro.org> 写道:
>
> Hello Coly Li,
>
> Commit 3ea3354cb9f0 ("badblocks: improve badblocks_check() for
> multiple ranges handling") from Aug 12, 2023 (linux-next), leads to
> the following Smatch static checker warning:
>
> block/badblocks.c:1251 _badblocks_check()
> warn: unsigned 'sectors' is never less than zero.
>
> block/badblocks.c
> 1186 static int _badblocks_check(struct badblocks *bb, sector_t s, sector_t sectors,
> ^^^^^^^^^^^^^^^^
> sector_t is u64.
Hi Dan,
Thank you for the notice.
>
> 1187 sector_t *first_bad, sector_t *bad_sectors)
> 1188 {
> 1189 int prev = -1, hint = -1, set = 0;
> 1190 struct badblocks_context bad;
> 1191 int unacked_badblocks = 0;
> 1192 int acked_badblocks = 0;
> 1193 u64 *p = bb->page;
> 1194 int len, rv;
> 1195
> 1196 re_check:
> 1197 bad.start = s;
> 1198 bad.len = sectors;
> 1199
> 1200 if (badblocks_empty(bb)) {
> 1201 len = sectors;
> 1202 goto update_sectors;
> 1203 }
> 1204
> 1205 prev = prev_badblocks(bb, &bad, hint);
> 1206
> 1207 /* start after all badblocks */
> 1208 if ((prev >= 0) &&
> 1209 ((prev + 1) >= bb->count) && !overlap_front(bb, prev, &bad)) {
> 1210 len = sectors;
> 1211 goto update_sectors;
> 1212 }
> 1213
> 1214 /* Overlapped with front badblocks record */
> 1215 if ((prev >= 0) && overlap_front(bb, prev, &bad)) {
> 1216 if (BB_ACK(p[prev]))
> 1217 acked_badblocks++;
> 1218 else
> 1219 unacked_badblocks++;
> 1220
> 1221 if (BB_END(p[prev]) >= (s + sectors))
> 1222 len = sectors;
> 1223 else
> 1224 len = BB_END(p[prev]) - s;
> 1225
> 1226 if (set == 0) {
> 1227 *first_bad = BB_OFFSET(p[prev]);
> 1228 *bad_sectors = BB_LEN(p[prev]);
> 1229 set = 1;
> 1230 }
> 1231 goto update_sectors;
> 1232 }
> 1233
> 1234 /* Not front overlap, but behind overlap */
> 1235 if ((prev + 1) < bb->count && overlap_behind(bb, &bad, prev + 1)) {
> 1236 len = BB_OFFSET(p[prev + 1]) - bad.start;
> 1237 hint = prev + 1;
> 1238 goto update_sectors;
> 1239 }
> 1240
> 1241 /* not cover any badblocks range in the table */
> 1242 len = sectors;
> 1243
> 1244 update_sectors:
> 1245 s += len;
> 1246 sectors -= len;
> ^^^^^^^^^^^^^^
> Subtraction.
>
> 1247
> 1248 if (sectors > 0)
> 1249 goto re_check;
> 1250
> --> 1251 WARN_ON(sectors < 0);
> ^^^^^^^^^^^
>
Yes you are right, this is totally unnecessary and no sense. Let me fix it.
Thank you again !
Coly Li
> 1252
> 1253 if (unacked_badblocks > 0)
> 1254 rv = -1;
> 1255 else if (acked_badblocks > 0)
> 1256 rv = 1;
> 1257 else
> 1258 rv = 0;
> 1259
> 1260 return rv;
> 1261 }
>
> regards,
> dan carpenter
>
^ permalink raw reply
* Re: [bug report] badblocks: improve badblocks_check() for multiple ranges handling
From: Coly Li @ 2025-03-09 16:10 UTC (permalink / raw)
To: Dan Carpenter; +Cc: linux-block, colyli
In-Reply-To: <c8e4f72d-7a9f-428d-a67b-41ddd4da8f2f@stanley.mountain>
> 2025年3月8日 18:25,Dan Carpenter <dan.carpenter@linaro.org> 写道:
>
> Hello Coly Li,
>
> Commit 3ea3354cb9f0 ("badblocks: improve badblocks_check() for
> multiple ranges handling") from Aug 12, 2023 (linux-next), leads to
> the following Smatch static checker warning:
>
> block/badblocks.c:1251 _badblocks_check()
> warn: unsigned 'sectors' is never less than zero.
>
> block/badblocks.c
> 1186 static int _badblocks_check(struct badblocks *bb, sector_t s, sector_t sectors,
> ^^^^^^^^^^^^^^^^
> sector_t is u64.
Hi Dan,
Thank you for the notice.
>
> 1187 sector_t *first_bad, sector_t *bad_sectors)
> 1188 {
> 1189 int prev = -1, hint = -1, set = 0;
> 1190 struct badblocks_context bad;
> 1191 int unacked_badblocks = 0;
> 1192 int acked_badblocks = 0;
> 1193 u64 *p = bb->page;
> 1194 int len, rv;
> 1195
> 1196 re_check:
> 1197 bad.start = s;
> 1198 bad.len = sectors;
> 1199
> 1200 if (badblocks_empty(bb)) {
> 1201 len = sectors;
> 1202 goto update_sectors;
> 1203 }
> 1204
> 1205 prev = prev_badblocks(bb, &bad, hint);
> 1206
> 1207 /* start after all badblocks */
> 1208 if ((prev >= 0) &&
> 1209 ((prev + 1) >= bb->count) && !overlap_front(bb, prev, &bad)) {
> 1210 len = sectors;
> 1211 goto update_sectors;
> 1212 }
> 1213
> 1214 /* Overlapped with front badblocks record */
> 1215 if ((prev >= 0) && overlap_front(bb, prev, &bad)) {
> 1216 if (BB_ACK(p[prev]))
> 1217 acked_badblocks++;
> 1218 else
> 1219 unacked_badblocks++;
> 1220
> 1221 if (BB_END(p[prev]) >= (s + sectors))
> 1222 len = sectors;
> 1223 else
> 1224 len = BB_END(p[prev]) - s;
> 1225
> 1226 if (set == 0) {
> 1227 *first_bad = BB_OFFSET(p[prev]);
> 1228 *bad_sectors = BB_LEN(p[prev]);
> 1229 set = 1;
> 1230 }
> 1231 goto update_sectors;
> 1232 }
> 1233
> 1234 /* Not front overlap, but behind overlap */
> 1235 if ((prev + 1) < bb->count && overlap_behind(bb, &bad, prev + 1)) {
> 1236 len = BB_OFFSET(p[prev + 1]) - bad.start;
> 1237 hint = prev + 1;
> 1238 goto update_sectors;
> 1239 }
> 1240
> 1241 /* not cover any badblocks range in the table */
> 1242 len = sectors;
> 1243
> 1244 update_sectors:
> 1245 s += len;
> 1246 sectors -= len;
> ^^^^^^^^^^^^^^
> Subtraction.
>
> 1247
> 1248 if (sectors > 0)
> 1249 goto re_check;
> 1250
> --> 1251 WARN_ON(sectors < 0);
> ^^^^^^^^^^^
>
Yes you are right, this is totally unnecessary and no sense. Let me fix it.
Thank you again !
Coly Li
> 1252
> 1253 if (unacked_badblocks > 0)
> 1254 rv = -1;
> 1255 else if (acked_badblocks > 0)
> 1256 rv = 1;
> 1257 else
> 1258 rv = 0;
> 1259
> 1260 return rv;
> 1261 }
>
> regards,
> dan carpenter
>
^ permalink raw reply
* [PATCH] badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0
From: colyli @ 2025-03-09 16:05 UTC (permalink / raw)
To: linux-block; +Cc: axboe, Coly Li, Dan Carpenter
From: Coly Li <colyli@kernel.org>
In _badblocks_check(), there are lines of code like this,
1246 sectors -= len;
[snipped]
1251 WARN_ON(sectors < 0);
The WARN_ON() at line 1257 doesn't make sense because sectors is
unsigned long long type and never to be <0.
Fix it by checking directly checking whether sectors is less than len.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Coly Li <colyli@kernel.org>
---
block/badblocks.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/block/badblocks.c b/block/badblocks.c
index 673ef068423a..ece64e76fe8f 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -1242,14 +1242,15 @@ static int _badblocks_check(struct badblocks *bb, sector_t s, sector_t sectors,
len = sectors;
update_sectors:
+ /* This situation should never happen */
+ WARN_ON(sectors < len);
+
s += len;
sectors -= len;
if (sectors > 0)
goto re_check;
- WARN_ON(sectors < 0);
-
if (unacked_badblocks > 0)
rv = -1;
else if (acked_badblocks > 0)
--
2.47.2
^ permalink raw reply related
* [PATCH v2 5/5] rust: enable `clippy::as_underscore` lint
From: Tamir Duberstein @ 2025-03-09 16:00 UTC (permalink / raw)
To: Masahiro Yamada, Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Alex Gaynor, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Greg Kroah-Hartman, Rafael J. Wysocki,
Brendan Higgins, David Gow, Rae Moar, Bjorn Helgaas,
Luis Chamberlain, Russ Weight, Rob Herring, Saravana Kannan
Cc: linux-kbuild, linux-kernel, rust-for-linux, linux-kselftest,
kunit-dev, linux-pci, linux-block, devicetree, Tamir Duberstein
In-Reply-To: <20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com>
In Rust 1.63.0, Clippy introduced the `as_underscore` lint [1]:
> The conversion might include lossy conversion or a dangerous cast that
> might go undetected due to the type being inferred.
>
> The lint is allowed by default as using `_` is less wordy than always
> specifying the type.
Always specifying the type is especially helpful in function call
contexts where the inferred type may change at a distance. Specifying
the type also allows Clippy to spot more cases of `useless_conversion`.
The primary downside is the need to specify the type in trivial getters.
There are 4 such functions: 3 have become slightly less ergonomic, 1 was
revealed to be a `useless_conversion`.
While this doesn't eliminate unchecked `as` conversions, it makes such
conversions easier to scrutinize. It also has the slight benefit of
removing a degree of freedom on which to bikeshed. Thus apply the
changes and enable the lint -- no functional change intended.
Link: https://rust-lang.github.io/rust-clippy/master/index.html#as_underscore [1]
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
Makefile | 1 +
rust/kernel/block/mq/operations.rs | 2 +-
rust/kernel/block/mq/request.rs | 2 +-
rust/kernel/device_id.rs | 2 +-
rust/kernel/devres.rs | 15 ++++++++-------
rust/kernel/error.rs | 2 +-
rust/kernel/firmware.rs | 2 +-
rust/kernel/io.rs | 18 +++++++++---------
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/of.rs | 6 +++---
rust/kernel/pci.rs | 9 ++++++---
rust/kernel/str.rs | 8 ++++----
rust/kernel/workqueue.rs | 2 +-
13 files changed, 38 insertions(+), 33 deletions(-)
diff --git a/Makefile b/Makefile
index bb15b86182a3..2af40bfed9ce 100644
--- a/Makefile
+++ b/Makefile
@@ -478,6 +478,7 @@ export rust_common_flags := --edition=2021 \
-Wunreachable_pub \
-Wclippy::all \
-Wclippy::as_ptr_cast_mut \
+ -Wclippy::as_underscore \
-Wclippy::ignored_unit_patterns \
-Wclippy::mut_mut \
-Wclippy::needless_bitwise_bool \
diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
index 864ff379dc91..d18ef55490da 100644
--- a/rust/kernel/block/mq/operations.rs
+++ b/rust/kernel/block/mq/operations.rs
@@ -101,7 +101,7 @@ impl<T: Operations> OperationsVTable<T> {
if let Err(e) = ret {
e.to_blk_status()
} else {
- bindings::BLK_STS_OK as _
+ bindings::BLK_STS_OK as u8
}
}
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index 10c6d69be7f3..bcf2b73d9189 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -125,7 +125,7 @@ pub fn end_ok(this: ARef<Self>) -> Result<(), ARef<Self>> {
// success of the call to `try_set_end` guarantees that there are no
// `ARef`s pointing to this request. Therefore it is safe to hand it
// back to the block layer.
- unsafe { bindings::blk_mq_end_request(request_ptr, bindings::BLK_STS_OK as _) };
+ unsafe { bindings::blk_mq_end_request(request_ptr, bindings::BLK_STS_OK as u8) };
Ok(())
}
diff --git a/rust/kernel/device_id.rs b/rust/kernel/device_id.rs
index e5859217a579..4063f09d76d9 100644
--- a/rust/kernel/device_id.rs
+++ b/rust/kernel/device_id.rs
@@ -82,7 +82,7 @@ impl<T: RawDeviceId, U, const N: usize> IdArray<T, U, N> {
unsafe {
raw_ids[i]
.as_mut_ptr()
- .byte_offset(T::DRIVER_DATA_OFFSET as _)
+ .byte_add(T::DRIVER_DATA_OFFSET)
.cast::<usize>()
.write(i);
}
diff --git a/rust/kernel/devres.rs b/rust/kernel/devres.rs
index 598001157293..20159b7c9293 100644
--- a/rust/kernel/devres.rs
+++ b/rust/kernel/devres.rs
@@ -45,7 +45,7 @@ struct DevresInner<T> {
/// # Example
///
/// ```no_run
-/// # use kernel::{bindings, c_str, device::Device, devres::Devres, io::{Io, IoRaw}};
+/// # use kernel::{bindings, c_str, device::Device, devres::Devres, ffi::c_void, io::{Io, IoRaw}};
/// # use core::ops::Deref;
///
/// // See also [`pci::Bar`] for a real example.
@@ -59,19 +59,19 @@ struct DevresInner<T> {
/// unsafe fn new(paddr: usize) -> Result<Self>{
/// // SAFETY: By the safety requirements of this function [`paddr`, `paddr` + `SIZE`) is
/// // valid for `ioremap`.
-/// let addr = unsafe { bindings::ioremap(paddr as _, SIZE as _) };
+/// let addr = unsafe { bindings::ioremap(paddr as u64, SIZE) };
/// if addr.is_null() {
/// return Err(ENOMEM);
/// }
///
-/// Ok(IoMem(IoRaw::new(addr as _, SIZE)?))
+/// Ok(IoMem(IoRaw::new(addr as usize, SIZE)?))
/// }
/// }
///
/// impl<const SIZE: usize> Drop for IoMem<SIZE> {
/// fn drop(&mut self) {
/// // SAFETY: `self.0.addr()` is guaranteed to be properly mapped by `Self::new`.
-/// unsafe { bindings::iounmap(self.0.addr() as _); };
+/// unsafe { bindings::iounmap(self.0.addr() as *mut c_void); };
/// }
/// }
///
@@ -115,8 +115,9 @@ fn new(dev: &Device, data: T, flags: Flags) -> Result<Arc<DevresInner<T>>> {
// SAFETY: `devm_add_action` guarantees to call `Self::devres_callback` once `dev` is
// detached.
- let ret =
- unsafe { bindings::devm_add_action(dev.as_raw(), Some(inner.callback), data as _) };
+ let ret = unsafe {
+ bindings::devm_add_action(dev.as_raw(), Some(inner.callback), data.cast_mut().cast())
+ };
if ret != 0 {
// SAFETY: We just created another reference to `inner` in order to pass it to
@@ -130,7 +131,7 @@ fn new(dev: &Device, data: T, flags: Flags) -> Result<Arc<DevresInner<T>>> {
}
fn as_ptr(&self) -> *const Self {
- self as _
+ self
}
fn remove_action(this: &Arc<Self>) {
diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
index 8654d52b0bb9..eb8fa52f08ba 100644
--- a/rust/kernel/error.rs
+++ b/rust/kernel/error.rs
@@ -152,7 +152,7 @@ pub(crate) fn to_blk_status(self) -> bindings::blk_status_t {
/// Returns the error encoded as a pointer.
pub fn to_ptr<T>(self) -> *mut T {
// SAFETY: `self.0` is a valid error due to its invariant.
- unsafe { bindings::ERR_PTR(self.0.get() as _).cast() }
+ unsafe { bindings::ERR_PTR(self.0.get() as isize).cast() }
}
/// Returns a string representing the error, if one exists.
diff --git a/rust/kernel/firmware.rs b/rust/kernel/firmware.rs
index c5162fdc95ff..9a279330261c 100644
--- a/rust/kernel/firmware.rs
+++ b/rust/kernel/firmware.rs
@@ -61,7 +61,7 @@ fn request_internal(name: &CStr, dev: &Device, func: FwFunc) -> Result<Self> {
// SAFETY: `pfw` is a valid pointer to a NULL initialized `bindings::firmware` pointer.
// `name` and `dev` are valid as by their type invariants.
- let ret = unsafe { func.0(pfw as _, name.as_char_ptr(), dev.as_raw()) };
+ let ret = unsafe { func.0(pfw, name.as_char_ptr(), dev.as_raw()) };
if ret != 0 {
return Err(Error::from_errno(ret));
}
diff --git a/rust/kernel/io.rs b/rust/kernel/io.rs
index d4a73e52e3ee..d375eed73dc8 100644
--- a/rust/kernel/io.rs
+++ b/rust/kernel/io.rs
@@ -5,7 +5,7 @@
//! C header: [`include/asm-generic/io.h`](srctree/include/asm-generic/io.h)
use crate::error::{code::EINVAL, Result};
-use crate::{bindings, build_assert};
+use crate::{bindings, build_assert, ffi::c_void};
/// Raw representation of an MMIO region.
///
@@ -56,7 +56,7 @@ pub fn maxsize(&self) -> usize {
/// # Examples
///
/// ```no_run
-/// # use kernel::{bindings, io::{Io, IoRaw}};
+/// # use kernel::{bindings, ffi::c_void, io::{Io, IoRaw}};
/// # use core::ops::Deref;
///
/// // See also [`pci::Bar`] for a real example.
@@ -70,19 +70,19 @@ pub fn maxsize(&self) -> usize {
/// unsafe fn new(paddr: usize) -> Result<Self>{
/// // SAFETY: By the safety requirements of this function [`paddr`, `paddr` + `SIZE`) is
/// // valid for `ioremap`.
-/// let addr = unsafe { bindings::ioremap(paddr as _, SIZE as _) };
+/// let addr = unsafe { bindings::ioremap(paddr as u64, SIZE) };
/// if addr.is_null() {
/// return Err(ENOMEM);
/// }
///
-/// Ok(IoMem(IoRaw::new(addr as _, SIZE)?))
+/// Ok(IoMem(IoRaw::new(addr as usize, SIZE)?))
/// }
/// }
///
/// impl<const SIZE: usize> Drop for IoMem<SIZE> {
/// fn drop(&mut self) {
/// // SAFETY: `self.0.addr()` is guaranteed to be properly mapped by `Self::new`.
-/// unsafe { bindings::iounmap(self.0.addr() as _); };
+/// unsafe { bindings::iounmap(self.0.addr() as *mut c_void); };
/// }
/// }
///
@@ -119,7 +119,7 @@ pub fn $name(&self, offset: usize) -> $type_name {
let addr = self.io_addr_assert::<$type_name>(offset);
// SAFETY: By the type invariant `addr` is a valid address for MMIO operations.
- unsafe { bindings::$name(addr as _) }
+ unsafe { bindings::$name(addr as *const c_void) }
}
/// Read IO data from a given offset.
@@ -131,7 +131,7 @@ pub fn $try_name(&self, offset: usize) -> Result<$type_name> {
let addr = self.io_addr::<$type_name>(offset)?;
// SAFETY: By the type invariant `addr` is a valid address for MMIO operations.
- Ok(unsafe { bindings::$name(addr as _) })
+ Ok(unsafe { bindings::$name(addr as *const c_void) })
}
};
}
@@ -148,7 +148,7 @@ pub fn $name(&self, value: $type_name, offset: usize) {
let addr = self.io_addr_assert::<$type_name>(offset);
// SAFETY: By the type invariant `addr` is a valid address for MMIO operations.
- unsafe { bindings::$name(value, addr as _, ) }
+ unsafe { bindings::$name(value, addr as *mut c_void) }
}
/// Write IO data from a given offset.
@@ -160,7 +160,7 @@ pub fn $try_name(&self, value: $type_name, offset: usize) -> Result {
let addr = self.io_addr::<$type_name>(offset)?;
// SAFETY: By the type invariant `addr` is a valid address for MMIO operations.
- unsafe { bindings::$name(value, addr as _) }
+ unsafe { bindings::$name(value, addr as *mut c_void) }
Ok(())
}
};
diff --git a/rust/kernel/miscdevice.rs b/rust/kernel/miscdevice.rs
index e14433b2ab9d..2c66e926bffb 100644
--- a/rust/kernel/miscdevice.rs
+++ b/rust/kernel/miscdevice.rs
@@ -33,7 +33,7 @@ impl MiscDeviceOptions {
pub const fn into_raw<T: MiscDevice>(self) -> bindings::miscdevice {
// SAFETY: All zeros is valid for this C type.
let mut result: bindings::miscdevice = unsafe { MaybeUninit::zeroed().assume_init() };
- result.minor = bindings::MISC_DYNAMIC_MINOR as _;
+ result.minor = bindings::MISC_DYNAMIC_MINOR as i32;
result.name = self.name.as_char_ptr();
result.fops = create_vtable::<T>();
result
diff --git a/rust/kernel/of.rs b/rust/kernel/of.rs
index 04f2d8ef29cb..40d1bd13682c 100644
--- a/rust/kernel/of.rs
+++ b/rust/kernel/of.rs
@@ -22,7 +22,7 @@ unsafe impl RawDeviceId for DeviceId {
const DRIVER_DATA_OFFSET: usize = core::mem::offset_of!(bindings::of_device_id, data);
fn index(&self) -> usize {
- self.0.data as _
+ self.0.data as usize
}
}
@@ -34,10 +34,10 @@ pub const fn new(compatible: &'static CStr) -> Self {
// SAFETY: FFI type is valid to be zero-initialized.
let mut of: bindings::of_device_id = unsafe { core::mem::zeroed() };
- // TODO: Use `clone_from_slice` once the corresponding types do match.
+ // TODO: Use `copy_from_slice` once stabilized for `const`.
let mut i = 0;
while i < src.len() {
- of.compatible[i] = src[i] as _;
+ of.compatible[i] = src[i];
i += 1;
}
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index 63218abb7a25..a925732f6c7a 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -166,7 +166,7 @@ unsafe impl RawDeviceId for DeviceId {
const DRIVER_DATA_OFFSET: usize = core::mem::offset_of!(bindings::pci_device_id, driver_data);
fn index(&self) -> usize {
- self.0.driver_data as _
+ self.0.driver_data
}
}
@@ -201,7 +201,10 @@ macro_rules! pci_device_table {
/// MODULE_PCI_TABLE,
/// <MyDriver as pci::Driver>::IdInfo,
/// [
-/// (pci::DeviceId::from_id(bindings::PCI_VENDOR_ID_REDHAT, bindings::PCI_ANY_ID as _), ())
+/// (
+/// pci::DeviceId::from_id(bindings::PCI_VENDOR_ID_REDHAT, bindings::PCI_ANY_ID as u32),
+/// (),
+/// )
/// ]
/// );
///
@@ -317,7 +320,7 @@ unsafe fn do_release(pdev: &Device, ioptr: usize, num: i32) {
// `ioptr` is valid by the safety requirements.
// `num` is valid by the safety requirements.
unsafe {
- bindings::pci_iounmap(pdev.as_raw(), ioptr as _);
+ bindings::pci_iounmap(pdev.as_raw(), ioptr as *mut kernel::ffi::c_void);
bindings::pci_release_region(pdev.as_raw(), num);
}
}
diff --git a/rust/kernel/str.rs b/rust/kernel/str.rs
index 6a1a982b946d..0b80a119d5f0 100644
--- a/rust/kernel/str.rs
+++ b/rust/kernel/str.rs
@@ -692,9 +692,9 @@ fn new() -> Self {
pub(crate) unsafe fn from_ptrs(pos: *mut u8, end: *mut u8) -> Self {
// INVARIANT: The safety requirements guarantee the type invariants.
Self {
- beg: pos as _,
- pos: pos as _,
- end: end as _,
+ beg: pos as usize,
+ pos: pos as usize,
+ end: end as usize,
}
}
@@ -719,7 +719,7 @@ pub(crate) unsafe fn from_buffer(buf: *mut u8, len: usize) -> Self {
///
/// N.B. It may point to invalid memory.
pub(crate) fn pos(&self) -> *mut u8 {
- self.pos as _
+ self.pos as *mut u8
}
/// Returns the number of bytes written to the formatter.
diff --git a/rust/kernel/workqueue.rs b/rust/kernel/workqueue.rs
index 8ff54105be3f..d03f3440cb5a 100644
--- a/rust/kernel/workqueue.rs
+++ b/rust/kernel/workqueue.rs
@@ -198,7 +198,7 @@ pub fn enqueue<W, const ID: u64>(&self, w: W) -> W::EnqueueOutput
unsafe {
w.__enqueue(move |work_ptr| {
bindings::queue_work_on(
- bindings::wq_misc_consts_WORK_CPU_UNBOUND as _,
+ bindings::wq_misc_consts_WORK_CPU_UNBOUND as i32,
queue_ptr,
work_ptr,
)
--
2.48.1
^ permalink raw reply related
* [PATCH v2 4/5] rust: enable `clippy::as_ptr_cast_mut` lint
From: Tamir Duberstein @ 2025-03-09 16:00 UTC (permalink / raw)
To: Masahiro Yamada, Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Alex Gaynor, Boqun Feng, Gary Guo, Björn Roy Baron,
Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
Danilo Krummrich, Greg Kroah-Hartman, Rafael J. Wysocki,
Brendan Higgins, David Gow, Rae Moar, Bjorn Helgaas,
Luis Chamberlain, Russ Weight, Rob Herring, Saravana Kannan
Cc: linux-kbuild, linux-kernel, rust-for-linux, linux-kselftest,
kunit-dev, linux-pci, linux-block, devicetree, Tamir Duberstein
In-Reply-To: <20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com>
In Rust 1.66.0, Clippy introduced the `as_ptr_cast_mut` lint [1]:
> Since `as_ptr` takes a `&self`, the pointer won’t have write
> permissions unless interior mutability is used, making it unlikely
> that having it as a mutable pointer is correct.
There is only one affected callsite, and the change amounts to replacing
`as _` with `.cast_mut().cast()`. This doesn't change the semantics, but
is more descriptive of what's going on.
Apply this change and enable the lint -- no functional change intended.
Link: https://rust-lang.github.io/rust-clippy/master/index.html#as_ptr_cast_mut [1]
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
Makefile | 1 +
rust/kernel/devres.rs | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/Makefile b/Makefile
index c62bae2b107b..bb15b86182a3 100644
--- a/Makefile
+++ b/Makefile
@@ -477,6 +477,7 @@ export rust_common_flags := --edition=2021 \
-Wrust_2018_idioms \
-Wunreachable_pub \
-Wclippy::all \
+ -Wclippy::as_ptr_cast_mut \
-Wclippy::ignored_unit_patterns \
-Wclippy::mut_mut \
-Wclippy::needless_bitwise_bool \
diff --git a/rust/kernel/devres.rs b/rust/kernel/devres.rs
index 3a9d998ec371..598001157293 100644
--- a/rust/kernel/devres.rs
+++ b/rust/kernel/devres.rs
@@ -143,7 +143,7 @@ fn remove_action(this: &Arc<Self>) {
bindings::devm_remove_action_nowarn(
this.dev.as_raw(),
Some(this.callback),
- this.as_ptr() as _,
+ this.as_ptr().cast_mut().cast(),
)
};
--
2.48.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox