* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
[not found] ` <20220315135245.eqf4tqngxxb7ymqa@unifi>
@ 2022-03-15 14:14 ` Johannes Thumshirn
2022-03-15 14:27 ` David Sterba
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Johannes Thumshirn @ 2022-03-15 14:14 UTC (permalink / raw)
To: Javier González, Christoph Hellwig
Cc: Matias Bjørling, Damien Le Moal, Luis Chamberlain,
Keith Busch, Pankaj Raghav, Adam Manzanares,
jiangbo.365@bytedance.com, kanchan Joshi, Jens Axboe,
Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-btrfs @ vger . kernel . org
On 15/03/2022 14:52, Javier González wrote:
> On 15.03.2022 14:30, Christoph Hellwig wrote:
>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>> file-system. As other interfaces arrive, this work will become natural.
>>>
>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>> still do the work in phases to make sure we have enough early feedback
>>> from the community.
>>>
>>> Since this thread has been very active, I will wait some time for
>>> Christoph and others to catch up before we start sending code.
>>
>> Can someone summarize where we stand? Between the lack of quoting
>>from hell and overly long lines from corporate mail clients I've
>> mostly stopped reading this thread because it takes too much effort
>> actually extract the information.
>
> Let me give it a try:
>
> - PO2 emulation in NVMe is a no-go. Drop this.
>
> - The arguments against supporting PO2 are:
> - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
> can create confusion for users of both SMR and ZNS
>
> - Existing applications assume PO2 zone sizes, and probably do
> optimizations for these. These applications, if wanting to use
> ZNS will have to change the calculations
>
> - There is a fear for performance regressions.
>
> - It adds more work to you and other maintainers
>
> - The arguments in favour of PO2 are:
> - Unmapped LBAs create holes that applications need to deal with.
> This affects mapping and performance due to splits. Bo explained
> this in a thread from Bytedance's perspective. I explained in an
> answer to Matias how we are not letting zones transition to
> offline in order to simplify the host stack. Not sure if this is
> something we want to bring to NVMe.
>
> - As ZNS adds more features and other protocols add support for
> zoned devices we will have more use-cases for the zoned block
> device. We will have to deal with these fragmentation at some
> point.
>
> - This is used in production workloads in Linux hosts. I would
> advocate for this not being off-tree as it will be a headache for
> all in the future.
>
> - If you agree that removing PO2 is an option, we can do the following:
> - Remove the constraint in the block layer and add ZoneFS support
> in a first patch.
>
> - Add btrfs support in a later patch
(+ linux-btrfs )
Please also make sure to support btrfs and not only throw some patches
over the fence. Zoned device support in btrfs is complex enough and has
quite some special casing vs regular btrfs, which we're working on getting
rid of. So having non-power-of-2 zone size, would also mean having NPO2
block-groups (and thus block-groups not aligned to the stripe size).
Just thinking of this and knowing I need to support it gives me a
headache.
Also please consult the rest of the btrfs developers for thoughts on this.
After all btrfs has full zoned support (including ZNS, not saying it's
perfect) and is also the default FS for at least two Linux distributions.
Thanks a lot,
Johannes
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
2022-03-15 14:14 ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Johannes Thumshirn
@ 2022-03-15 14:27 ` David Sterba
2022-03-15 19:56 ` Pankaj Raghav
2022-03-15 15:11 ` Javier González
2022-03-15 18:51 ` Pankaj Raghav
2 siblings, 1 reply; 6+ messages in thread
From: David Sterba @ 2022-03-15 14:27 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Javier González, Christoph Hellwig, Matias Bjørling,
Damien Le Moal, Luis Chamberlain, Keith Busch, Pankaj Raghav,
Adam Manzanares, jiangbo.365@bytedance.com, kanchan Joshi,
Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-btrfs @ vger . kernel . org
On Tue, Mar 15, 2022 at 02:14:23PM +0000, Johannes Thumshirn wrote:
> On 15/03/2022 14:52, Javier González wrote:
> > On 15.03.2022 14:30, Christoph Hellwig wrote:
> >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> >>> but we do not see a usage for ZNS in F2FS, as it is a mobile
> >>> file-system. As other interfaces arrive, this work will become natural.
> >>>
> >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> >>> still do the work in phases to make sure we have enough early feedback
> >>> from the community.
> >>>
> >>> Since this thread has been very active, I will wait some time for
> >>> Christoph and others to catch up before we start sending code.
> >>
> >> Can someone summarize where we stand? Between the lack of quoting
> >> from hell and overly long lines from corporate mail clients I've
> >> mostly stopped reading this thread because it takes too much effort
> >> actually extract the information.
> >
> > Let me give it a try:
> >
> > - PO2 emulation in NVMe is a no-go. Drop this.
> >
> > - The arguments against supporting PO2 are:
> > - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
> > can create confusion for users of both SMR and ZNS
> >
> > - Existing applications assume PO2 zone sizes, and probably do
> > optimizations for these. These applications, if wanting to use
> > ZNS will have to change the calculations
> >
> > - There is a fear for performance regressions.
> >
> > - It adds more work to you and other maintainers
> >
> > - The arguments in favour of PO2 are:
> > - Unmapped LBAs create holes that applications need to deal with.
> > This affects mapping and performance due to splits. Bo explained
> > this in a thread from Bytedance's perspective. I explained in an
> > answer to Matias how we are not letting zones transition to
> > offline in order to simplify the host stack. Not sure if this is
> > something we want to bring to NVMe.
> >
> > - As ZNS adds more features and other protocols add support for
> > zoned devices we will have more use-cases for the zoned block
> > device. We will have to deal with these fragmentation at some
> > point.
> >
> > - This is used in production workloads in Linux hosts. I would
> > advocate for this not being off-tree as it will be a headache for
> > all in the future.
> >
> > - If you agree that removing PO2 is an option, we can do the following:
> > - Remove the constraint in the block layer and add ZoneFS support
> > in a first patch.
> >
> > - Add btrfs support in a later patch
>
> (+ linux-btrfs )
>
> Please also make sure to support btrfs and not only throw some patches
> over the fence. Zoned device support in btrfs is complex enough and has
> quite some special casing vs regular btrfs, which we're working on getting
> rid of. So having non-power-of-2 zone size, would also mean having NPO2
> block-groups (and thus block-groups not aligned to the stripe size).
>
> Just thinking of this and knowing I need to support it gives me a
> headache.
PO2 is really easy to work with and I guess allocation on the physical
device could also benefit from that, I'm still puzzled why the NPO2 is
even proposed.
We can possibly hide the calculations behind some API so I hope in the
end it should be bearable. The size of block groups is flexible we only
want some reasonable alignment.
> Also please consult the rest of the btrfs developers for thoughts on this.
> After all btrfs has full zoned support (including ZNS, not saying it's
> perfect) and is also the default FS for at least two Linux distributions.
I haven't read the whole thread yet, my impression is that some hardware
is deliberately breaking existing assumptions about zoned devices and in
turn breaking btrfs support. I hope I'm wrong on that or at least that
it's possible to work around it.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
2022-03-15 14:14 ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Johannes Thumshirn
2022-03-15 14:27 ` David Sterba
@ 2022-03-15 15:11 ` Javier González
2022-03-15 18:51 ` Pankaj Raghav
2 siblings, 0 replies; 6+ messages in thread
From: Javier González @ 2022-03-15 15:11 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Christoph Hellwig, Matias Bjørling, Damien Le Moal,
Luis Chamberlain, Keith Busch, Pankaj Raghav, Adam Manzanares,
jiangbo.365@bytedance.com, kanchan Joshi, Jens Axboe,
Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-btrfs @ vger . kernel . org
On 15.03.2022 14:14, Johannes Thumshirn wrote:
>On 15/03/2022 14:52, Javier González wrote:
>> On 15.03.2022 14:30, Christoph Hellwig wrote:
>>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>>> file-system. As other interfaces arrive, this work will become natural.
>>>>
>>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>>> still do the work in phases to make sure we have enough early feedback
>>>> from the community.
>>>>
>>>> Since this thread has been very active, I will wait some time for
>>>> Christoph and others to catch up before we start sending code.
>>>
>>> Can someone summarize where we stand? Between the lack of quoting
>>>from hell and overly long lines from corporate mail clients I've
>>> mostly stopped reading this thread because it takes too much effort
>>> actually extract the information.
>>
>> Let me give it a try:
>>
>> - PO2 emulation in NVMe is a no-go. Drop this.
>>
>> - The arguments against supporting PO2 are:
>> - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
>> can create confusion for users of both SMR and ZNS
>>
>> - Existing applications assume PO2 zone sizes, and probably do
>> optimizations for these. These applications, if wanting to use
>> ZNS will have to change the calculations
>>
>> - There is a fear for performance regressions.
>>
>> - It adds more work to you and other maintainers
>>
>> - The arguments in favour of PO2 are:
>> - Unmapped LBAs create holes that applications need to deal with.
>> This affects mapping and performance due to splits. Bo explained
>> this in a thread from Bytedance's perspective. I explained in an
>> answer to Matias how we are not letting zones transition to
>> offline in order to simplify the host stack. Not sure if this is
>> something we want to bring to NVMe.
>>
>> - As ZNS adds more features and other protocols add support for
>> zoned devices we will have more use-cases for the zoned block
>> device. We will have to deal with these fragmentation at some
>> point.
>>
>> - This is used in production workloads in Linux hosts. I would
>> advocate for this not being off-tree as it will be a headache for
>> all in the future.
>>
>> - If you agree that removing PO2 is an option, we can do the following:
>> - Remove the constraint in the block layer and add ZoneFS support
>> in a first patch.
>>
>> - Add btrfs support in a later patch
>
>(+ linux-btrfs )
>
>Please also make sure to support btrfs and not only throw some patches
>over the fence. Zoned device support in btrfs is complex enough and has
>quite some special casing vs regular btrfs, which we're working on getting
>rid of. So having non-power-of-2 zone size, would also mean having NPO2
>block-groups (and thus block-groups not aligned to the stripe size).
Thanks for mentioning this Johannes. If we say we will work with you in
supporting btrfs properly, we will.
I believe you have seen already a couple of patches fixing things for
zone support in btrfs in the last weeks.
>
>Just thinking of this and knowing I need to support it gives me a
>headache.
I hope we have help you with that. butrfs has no alignment to PO2
natively, so I am confident we can find a good solution.
>
>Also please consult the rest of the btrfs developers for thoughts on this.
>After all btrfs has full zoned support (including ZNS, not saying it's
>perfect) and is also the default FS for at least two Linux distributions.
Of course. We will work with you and other btrfs developers. Luis is
helping making sure that we have good tests for linux-next. This is in
part how we have found the problems with Append, which should be fixed
now.
>
>Thanks a lot,
> Johannes
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
2022-03-15 14:14 ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Johannes Thumshirn
2022-03-15 14:27 ` David Sterba
2022-03-15 15:11 ` Javier González
@ 2022-03-15 18:51 ` Pankaj Raghav
2022-03-16 8:37 ` Johannes Thumshirn
2 siblings, 1 reply; 6+ messages in thread
From: Pankaj Raghav @ 2022-03-15 18:51 UTC (permalink / raw)
To: Johannes Thumshirn, Javier González, Christoph Hellwig
Cc: Matias Bjørling, Damien Le Moal, Luis Chamberlain,
Keith Busch, Adam Manzanares, jiangbo.365@bytedance.com,
kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
Kanchan Joshi, linux-block@vger.kernel.org,
linux-nvme@lists.infradead.org, linux-btrfs @ vger . kernel . org
Hi Johannes,
On 2022-03-15 15:14, Johannes Thumshirn wrote:
> Please also make sure to support btrfs and not only throw some patches
> over the fence. Zoned device support in btrfs is complex enough and has
> quite some special casing vs regular btrfs, which we're working on getting
> rid of. So having non-power-of-2 zone size, would also mean having NPO2
I already made a simple btrfs npo2 poc and it involved mostly changing
the po2 calculation to be based on generic calculation. I understand
that changing the calculations from using log & shifts to division will
incur some performance penalty but I think we can wrap them with helpers
to minimize those impact.
> So having non-power-of-2 zone size, would also mean having NPO2
> block-groups (and thus block-groups not aligned to the stripe size).
>
I agree with your point that we risk not aligning to stripe size when we
move to npo2 zone size which I believe the minimum is 64K (please
correct me if I am wrong). As David Sterba mentioned in his email, we
could agree on some reasonable alignment, which I believe would be the
minimum stripe size of 64k to avoid added complexity to the existing
btrfs zoned support. And it is a much milder constraint that most
devices can naturally adhere compared to the po2 zone size requirement.
> Just thinking of this and knowing I need to support it gives me a
> headache.
>
This is definitely not some one off patch that we want upstream and
disappear. As Javier already pointed out, we would be more than happy
help you out here.
> Also please consult the rest of the btrfs developers for thoughts on this.
> After all btrfs has full zoned support (including ZNS, not saying it's
> perfect) and is also the default FS for at least two Linux distributions.
>
> Thanks a lot,
> Johannes
--
Regards,
Pankaj
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
2022-03-15 14:27 ` David Sterba
@ 2022-03-15 19:56 ` Pankaj Raghav
0 siblings, 0 replies; 6+ messages in thread
From: Pankaj Raghav @ 2022-03-15 19:56 UTC (permalink / raw)
To: dsterba, Johannes Thumshirn, Javier González,
Christoph Hellwig, Matias Bjørling, Damien Le Moal,
Luis Chamberlain, Keith Busch, Adam Manzanares,
jiangbo.365@bytedance.com, kanchan Joshi, Jens Axboe,
Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-btrfs @ vger . kernel . org
Hi David,
On 2022-03-15 15:27, David Sterba wrote:
>
> PO2 is really easy to work with and I guess allocation on the physical
> device could also benefit from that, I'm still puzzled why the NPO2 is
> even proposed.
>
Quick recap:
Hardware NAND cannot naturally align to po2 zone sizes which led to
having a zone cap and zone size, where, zone cap is the actually storage
available in a zone. The main proposal is to remove the po2 constraint
to get rid of this LBA holes (generally speaking). That is why this
whole effort was started.
> We can possibly hide the calculations behind some API so I hope in the
> end it should be bearable. The size of block groups is flexible we only
> want some reasonable alignment.
>
I agree. I already replied to Johannes on what it might look like.
Reiterating here again, the reasonable alignment I was thinking while I
was doing a POC for btrfs with npo2 zone size is the minimum stripe size
that is required by btrfs (64K) to reduce the impact of this change on
the zoned support in btrfs.
> I haven't read the whole thread yet, my impression is that some hardware
> is deliberately breaking existing assumptions about zoned devices and in
> turn breaking btrfs support. I hope I'm wrong on that or at least that
> it's possible to work around it.
Based on the POC we did internally, it is definitely possible to support
it in btrfs. And making this change will not break the existing btrfs
support for zoned devices. Naive approach to making this change will
have some performance impact as we will be changing the po2 calculations
from log & shifts to division, multiplications. I definitely think we
can optimize it to minimize the impact on the existing deployments.
--
Regards,
Pankaj
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
2022-03-15 18:51 ` Pankaj Raghav
@ 2022-03-16 8:37 ` Johannes Thumshirn
0 siblings, 0 replies; 6+ messages in thread
From: Johannes Thumshirn @ 2022-03-16 8:37 UTC (permalink / raw)
To: Pankaj Raghav, Javier González, Christoph Hellwig
Cc: Matias Bjørling, Damien Le Moal, Luis Chamberlain,
Keith Busch, Adam Manzanares, jiangbo.365@bytedance.com,
kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
Kanchan Joshi, linux-block@vger.kernel.org,
linux-nvme@lists.infradead.org, linux-btrfs @ vger . kernel . org
On 15/03/2022 19:51, Pankaj Raghav wrote:
>> ck-groups (and thus block-groups not aligned to the stripe size).
>>
> I agree with your point that we risk not aligning to stripe size when we
> move to npo2 zone size which I believe the minimum is 64K (please
> correct me if I am wrong). As David Sterba mentioned in his email, we
> could agree on some reasonable alignment, which I believe would be the
> minimum stripe size of 64k to avoid added complexity to the existing
> btrfs zoned support. And it is a much milder constraint that most
> devices can naturally adhere compared to the po2 zone size requirement.
>
What could be done is rounding a zone down to the next po2 (64k aligned),
but then we need to explicitly finish the zones.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-03-16 8:37 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20220314073537.GA4204@lst.de>
[not found] ` <05a1fde2-12bd-1059-6177-2291307dbd8d@opensource.wdc.com>
[not found] ` <20220314104938.hv26bf5vah4x32c2@ArmHalley.local>
[not found] ` <BYAPR04MB49682B9263F21EE67070A4B1F10F9@BYAPR04MB4968.namprd04.prod.outlook.com>
[not found] ` <20220314195551.sbwkksv33ylhlyx2@ArmHalley.local>
[not found] ` <BYAPR04MB49688BD817284E5C317DD5D8F1109@BYAPR04MB4968.namprd04.prod.outlook.com>
[not found] ` <20220315130501.q7fjpqzutadadfu3@ArmHalley.localdomain>
[not found] ` <BYAPR04MB49689803ED6E1E32C49C6413F1109@BYAPR04MB4968.namprd04.prod.outlook.com>
[not found] ` <20220315132611.g5ert4tzuxgi7qd5@unifi>
[not found] ` <20220315133052.GA12593@lst.de>
[not found] ` <20220315135245.eqf4tqngxxb7ymqa@unifi>
2022-03-15 14:14 ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Johannes Thumshirn
2022-03-15 14:27 ` David Sterba
2022-03-15 19:56 ` Pankaj Raghav
2022-03-15 15:11 ` Javier González
2022-03-15 18:51 ` Pankaj Raghav
2022-03-16 8:37 ` Johannes Thumshirn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox