* [Report] blk-zoned/ZNS: non_power_of_2 of zone->len]
@ 2024-01-12 1:13 Ming Lei
2024-01-12 3:05 ` Damien Le Moal
0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2024-01-12 1:13 UTC (permalink / raw)
To: linux-block, Damien Le Moal
Cc: ming.lei, Yi Zhang, John Meneghini, linux-nvme, hch, Keith Busch
----- Forwarded message from Ming Lei <ming.lei@redhat.com> -----
Add linux-nvme
Date: Fri, 12 Jan 2024 09:08:47 +0800
From: Ming Lei <ming.lei@redhat.com>
To: linux-block@vger.kernel.org, Damien Le Moal <dlemoal@kernel.org>
Cc: ming.lei@redhat.com, Yi Zhang <yi.zhang@redhat.com>, John Meneghini <jmeneghi@redhat.com>
Subject: [Report] blk-zoned/ZNS: non_power_of_2 of zone->len
Hello Damien and Guys,
Yi reported that the following failure:
Oct 18 15:24:15 localhost kernel: nvme nvme4: invalid zone size:196608 for namespace:1
Oct 18 15:24:33 localhost smartd[2303]: Device: /dev/nvme4, opened
Oct 18 15:24:33 localhost smartd[2303]: Device: /dev/nvme4, NETAPPX4022S173A4T0NTZ, S/N:S66NNE0T800169, FW:MVP40B7B, 4.09 TB
Looks current blk-zoned requires zone->len to be power_of_2() since
commit:
6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically")
And the original power_of_2() requirement is from the following commit
for ZBC and ZAC.
d9dd73087a8b ("block: Enhance blk_revalidate_disk_zones()")
Meantime block layer does support non-power_of_2 chunk sectors limit.
The question is if there is such hard requirement for ZNS, and I can't see
any such words in NVMe Zoned Namespace Command Set Specification.
So is it one NVMe firmware issue? or blk-zoned problem with too strict(power_of_2)
requirement on zone->len?
Thanks,
Ming
----- End forwarded message -----
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Report] blk-zoned/ZNS: non_power_of_2 of zone->len]
2024-01-12 1:13 [Report] blk-zoned/ZNS: non_power_of_2 of zone->len] Ming Lei
@ 2024-01-12 3:05 ` Damien Le Moal
2024-01-12 3:29 ` Ming Lei
0 siblings, 1 reply; 6+ messages in thread
From: Damien Le Moal @ 2024-01-12 3:05 UTC (permalink / raw)
To: Ming Lei, linux-block
Cc: Yi Zhang, John Meneghini, linux-nvme, hch, Keith Busch
On 1/12/24 10:13, Ming Lei wrote:
> Hello Damien and Guys,
>
> Yi reported that the following failure:
>
> Oct 18 15:24:15 localhost kernel: nvme nvme4: invalid zone size:196608 for namespace:1
> Oct 18 15:24:33 localhost smartd[2303]: Device: /dev/nvme4, opened
> Oct 18 15:24:33 localhost smartd[2303]: Device: /dev/nvme4, NETAPPX4022S173A4T0NTZ, S/N:S66NNE0T800169, FW:MVP40B7B, 4.09 TB
>
> Looks current blk-zoned requires zone->len to be power_of_2() since
> commit:
>
> 6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically")
>
> And the original power_of_2() requirement is from the following commit
> for ZBC and ZAC.
>
> d9dd73087a8b ("block: Enhance blk_revalidate_disk_zones()")
>
> Meantime block layer does support non-power_of_2 chunk sectors limit.
That is not true. It does. See blk_stack_limits which ahs:
/* Set non-power-of-2 compatible chunk_sectors boundary */
if (b->chunk_sectors)
t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
and the absence of any check on the value of chunk_sectors in
blk_queue_chunk_sectors().
> The question is if there is such hard requirement for ZNS, and I can't see
> any such words in NVMe Zoned Namespace Command Set Specification.
No, there are no requirements in ZNS for the zone size to be a power of 2 number
of sectors/LBAs. The same is also true for ZBC and ZAC (SCSI and ATA) SMR HDDs.
The requirement for the zone size to be a power of 2 number of sectors is
entirely in the kernel. The reason being that zoned block device support started
with SMR HDDs which all had a zone size of 256 MB (and still do) and no user
ever wanted anything else than that. So everything was coded with this
requirement, as that allowed many nice things like bit-shift/mask arithmetic for
conversions between zone number and sectors etc (and that of course is very
efficient).
> So is it one NVMe firmware issue? or blk-zoned problem with too strict(power_of_2)
> requirement on zone->len?
It is the latter. There was a session at LSF/MM last year about this. I recall
that the conclusion was that unless there is a strong user demand for non power
of 2 zone size, we are not going to do anything about it. Because allowing
non-power of 2 zone size has some serious consequences all over the place,
including in FSes that natively support zoned devices. So relaxing that
requirement is not trivial.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Report] blk-zoned/ZNS: non_power_of_2 of zone->len]
2024-01-12 3:05 ` Damien Le Moal
@ 2024-01-12 3:29 ` Ming Lei
2024-01-12 3:34 ` Damien Le Moal
2024-01-12 15:40 ` Pankaj Raghav (Samsung)
0 siblings, 2 replies; 6+ messages in thread
From: Ming Lei @ 2024-01-12 3:29 UTC (permalink / raw)
To: Damien Le Moal, Bart Van Assche
Cc: linux-block, Yi Zhang, John Meneghini, linux-nvme, hch,
Keith Busch
On Fri, Jan 12, 2024 at 12:05:45PM +0900, Damien Le Moal wrote:
> On 1/12/24 10:13, Ming Lei wrote:
> > Hello Damien and Guys,
> >
> > Yi reported that the following failure:
> >
> > Oct 18 15:24:15 localhost kernel: nvme nvme4: invalid zone size:196608 for namespace:1
> > Oct 18 15:24:33 localhost smartd[2303]: Device: /dev/nvme4, opened
> > Oct 18 15:24:33 localhost smartd[2303]: Device: /dev/nvme4, NETAPPX4022S173A4T0NTZ, S/N:S66NNE0T800169, FW:MVP40B7B, 4.09 TB
> >
> > Looks current blk-zoned requires zone->len to be power_of_2() since
> > commit:
> >
> > 6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically")
> >
> > And the original power_of_2() requirement is from the following commit
> > for ZBC and ZAC.
> >
> > d9dd73087a8b ("block: Enhance blk_revalidate_disk_zones()")
> >
> > Meantime block layer does support non-power_of_2 chunk sectors limit.
>
> That is not true. It does. See blk_stack_limits which ahs:
>
> /* Set non-power-of-2 compatible chunk_sectors boundary */
> if (b->chunk_sectors)
> t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
>
> and the absence of any check on the value of chunk_sectors in
> blk_queue_chunk_sectors().
I meant non-power_of_2 chunk sectors limit is supported, see
07d098e6bbad ("block: allow 'chunk_sectors' to be non-power-of-2")
And device mapper uses that.
>
> > The question is if there is such hard requirement for ZNS, and I can't see
> > any such words in NVMe Zoned Namespace Command Set Specification.
>
> No, there are no requirements in ZNS for the zone size to be a power of 2 number
> of sectors/LBAs. The same is also true for ZBC and ZAC (SCSI and ATA) SMR HDDs.
> The requirement for the zone size to be a power of 2 number of sectors is
> entirely in the kernel. The reason being that zoned block device support started
> with SMR HDDs which all had a zone size of 256 MB (and still do) and no user
> ever wanted anything else than that. So everything was coded with this
> requirement, as that allowed many nice things like bit-shift/mask arithmetic for
> conversions between zone number and sectors etc (and that of course is very
> efficient).
Thanks for the clarification.
>
> > So is it one NVMe firmware issue? or blk-zoned problem with too strict(power_of_2)
> > requirement on zone->len?
>
> It is the latter. There was a session at LSF/MM last year about this. I recall
> that the conclusion was that unless there is a strong user demand for non power
> of 2 zone size, we are not going to do anything about it. Because allowing
> non-power of 2 zone size has some serious consequences all over the place,
> including in FSes that natively support zoned devices. So relaxing that
> requirement is not trivial.
Just saw Bart's work on supporting non-power_of_2 zone len:
https://lore.kernel.org/linux-block/dc89c70e-4931-baaf-c450-6801c200c1d7@acm.org/
IMO FS support might be another topic, cause FS isn't the only user,
also without block layer support, the device isn't usable, not mention FS.
Since non-power2 zoned device does exists, I'd suggest Bart to restart the
work and let linux cover more zoned devices(include non-power 2 zone).
Thanks,
Ming
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Report] blk-zoned/ZNS: non_power_of_2 of zone->len]
2024-01-12 3:29 ` Ming Lei
@ 2024-01-12 3:34 ` Damien Le Moal
2024-01-12 3:46 ` Bart Van Assche
2024-01-12 15:40 ` Pankaj Raghav (Samsung)
1 sibling, 1 reply; 6+ messages in thread
From: Damien Le Moal @ 2024-01-12 3:34 UTC (permalink / raw)
To: Ming Lei, Bart Van Assche
Cc: linux-block, Yi Zhang, John Meneghini, linux-nvme, hch,
Keith Busch, Martin K . Petersen,
linux-scsi @ vger . kernel . org
On 1/12/24 12:29, Ming Lei wrote:
> On Fri, Jan 12, 2024 at 12:05:45PM +0900, Damien Le Moal wrote:
>> On 1/12/24 10:13, Ming Lei wrote:
>>> Hello Damien and Guys,
>>>
>>> Yi reported that the following failure:
>>>
>>> Oct 18 15:24:15 localhost kernel: nvme nvme4: invalid zone size:196608 for namespace:1
>>> Oct 18 15:24:33 localhost smartd[2303]: Device: /dev/nvme4, opened
>>> Oct 18 15:24:33 localhost smartd[2303]: Device: /dev/nvme4, NETAPPX4022S173A4T0NTZ, S/N:S66NNE0T800169, FW:MVP40B7B, 4.09 TB
>>>
>>> Looks current blk-zoned requires zone->len to be power_of_2() since
>>> commit:
>>>
>>> 6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically")
>>>
>>> And the original power_of_2() requirement is from the following commit
>>> for ZBC and ZAC.
>>>
>>> d9dd73087a8b ("block: Enhance blk_revalidate_disk_zones()")
>>>
>>> Meantime block layer does support non-power_of_2 chunk sectors limit.
>>
>> That is not true. It does. See blk_stack_limits which ahs:
>>
>> /* Set non-power-of-2 compatible chunk_sectors boundary */
>> if (b->chunk_sectors)
>> t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
>>
>> and the absence of any check on the value of chunk_sectors in
>> blk_queue_chunk_sectors().
>
> I meant non-power_of_2 chunk sectors limit is supported, see
>
> 07d098e6bbad ("block: allow 'chunk_sectors' to be non-power-of-2")
>
> And device mapper uses that.
>
>>
>>> The question is if there is such hard requirement for ZNS, and I can't see
>>> any such words in NVMe Zoned Namespace Command Set Specification.
>>
>> No, there are no requirements in ZNS for the zone size to be a power of 2 number
>> of sectors/LBAs. The same is also true for ZBC and ZAC (SCSI and ATA) SMR HDDs.
>> The requirement for the zone size to be a power of 2 number of sectors is
>> entirely in the kernel. The reason being that zoned block device support started
>> with SMR HDDs which all had a zone size of 256 MB (and still do) and no user
>> ever wanted anything else than that. So everything was coded with this
>> requirement, as that allowed many nice things like bit-shift/mask arithmetic for
>> conversions between zone number and sectors etc (and that of course is very
>> efficient).
>
> Thanks for the clarification.
>
>>
>>> So is it one NVMe firmware issue? or blk-zoned problem with too strict(power_of_2)
>>> requirement on zone->len?
>>
>> It is the latter. There was a session at LSF/MM last year about this. I recall
>> that the conclusion was that unless there is a strong user demand for non power
>> of 2 zone size, we are not going to do anything about it. Because allowing
>> non-power of 2 zone size has some serious consequences all over the place,
>> including in FSes that natively support zoned devices. So relaxing that
>> requirement is not trivial.
>
> Just saw Bart's work on supporting non-power_of_2 zone len:
>
> https://lore.kernel.org/linux-block/dc89c70e-4931-baaf-c450-6801c200c1d7@acm.org/
>
> IMO FS support might be another topic, cause FS isn't the only user,
> also without block layer support, the device isn't usable, not mention FS.
And if the FS requires a power of 2 zone size, that will create fragmentation of
the zoned device support: some devices will be usable with an FS, others not.
Not nice at all. That is *not* something that exists today, for any block
device. I am not very keen on going down such route.
> Since non-power2 zoned device does exists, I'd suggest Bart to restart the
> work and let linux cover more zoned devices(include non-power 2 zone).
See above. Others (Keith, Christoph, Martin) may also have a different opinion.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Report] blk-zoned/ZNS: non_power_of_2 of zone->len]
2024-01-12 3:34 ` Damien Le Moal
@ 2024-01-12 3:46 ` Bart Van Assche
0 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2024-01-12 3:46 UTC (permalink / raw)
To: Damien Le Moal, Ming Lei
Cc: linux-block, Yi Zhang, John Meneghini, linux-nvme, hch,
Keith Busch, Martin K . Petersen,
linux-scsi @ vger . kernel . org, Jaegeuk Kim, Pankaj Raghav
On 1/11/24 19:34, Damien Le Moal wrote:
> On 1/12/24 12:29, Ming Lei wrote:
>> Just saw Bart's work on supporting non-power_of_2 zone len:
>>
>> https://lore.kernel.org/linux-block/dc89c70e-4931-baaf-c450-6801c200c1d7@acm.org/
Hmm ... weren't these patches developed by Pankaj Raghav from Samsung?
>> IMO FS support might be another topic, cause FS isn't the only user,
>> also without block layer support, the device isn't usable, not mention FS.
>
> And if the FS requires a power of 2 zone size, that will create fragmentation of
> the zoned device support: some devices will be usable with an FS, others not.
> Not nice at all. That is *not* something that exists today, for any block
> device. I am not very keen on going down such route.
F2FS supports zone sizes that are not a power of two. Recent Android
kernels have support for zone sizes that are not a power of two in the
block layer since UFS vendors requested support for this. We prefer to
have support in the upstream kernel for zone sizes that are not a power
of two because having to carry out-of-tree patches is painful.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Report] blk-zoned/ZNS: non_power_of_2 of zone->len]
2024-01-12 3:29 ` Ming Lei
2024-01-12 3:34 ` Damien Le Moal
@ 2024-01-12 15:40 ` Pankaj Raghav (Samsung)
1 sibling, 0 replies; 6+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-01-12 15:40 UTC (permalink / raw)
To: Ming Lei
Cc: Damien Le Moal, Bart Van Assche, linux-block, Yi Zhang,
John Meneghini, linux-nvme, hch, Keith Busch, p.raghav,
javier.gonz
Hi Ming,
> >
> > It is the latter. There was a session at LSF/MM last year about this. I recall
> > that the conclusion was that unless there is a strong user demand for non power
> > of 2 zone size, we are not going to do anything about it. Because allowing
> > non-power of 2 zone size has some serious consequences all over the place,
> > including in FSes that natively support zoned devices. So relaxing that
> > requirement is not trivial.
>
> Just saw Bart's work on supporting non-power_of_2 zone len:
>
> https://lore.kernel.org/linux-block/dc89c70e-4931-baaf-c450-6801c200c1d7@acm.org/
As Bart said, I did most of the work in 2022.
>
> IMO FS support might be another topic, cause FS isn't the only user,
> also without block layer support, the device isn't usable, not mention FS.
>
I also added a small dm target in the series that converts a non-po2
device to a po2 device to support existing FS without modifications
until native support is added in them.
One of the main arguments against the support was the fragmentation it
may cause in the FS world for zoned devices. Given that F2FS already
supports non-po2 devices, it is only btrfs that will need some work to
have native non-po2 support.
> Since non-power2 zoned device does exists, I'd suggest Bart to restart the
> work and let linux cover more zoned devices(include non-power 2 zone).
>
I would be more than happy to provide my reviews if someone wants to do
a respin on that series. IIRC, the changes to the block layer were not
very intrusive.
--
Pankaj
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-01-12 15:40 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-12 1:13 [Report] blk-zoned/ZNS: non_power_of_2 of zone->len] Ming Lei
2024-01-12 3:05 ` Damien Le Moal
2024-01-12 3:29 ` Ming Lei
2024-01-12 3:34 ` Damien Le Moal
2024-01-12 3:46 ` Bart Van Assche
2024-01-12 15:40 ` Pankaj Raghav (Samsung)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox