linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
To: Luis Chamberlain <mcgrof@kernel.org>, Keith Busch <kbusch@kernel.org>
Cc: "Christoph Hellwig" <hch@lst.de>,
	"Pankaj Raghav" <p.raghav@samsung.com>,
	"Adam Manzanares" <a.manzanares@samsung.com>,
	"Javier González" <javier.gonz@samsung.com>,
	jiangbo.365@bytedance.com, "kanchan Joshi" <joshi.k@samsung.com>,
	"Jens Axboe" <axboe@kernel.dk>,
	"Sagi Grimberg" <sagi@grimberg.me>,
	"Matias Bjørling" <matias.bjorling@wdc.com>,
	"Pankaj Raghav" <pankydev8@gmail.com>,
	"Kanchan Joshi" <joshiiitr@gmail.com>,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
Date: Sat, 12 Mar 2022 16:58:08 +0900	[thread overview]
Message-ID: <bc0e53a9-f623-c69f-002e-d62e697a43d1@opensource.wdc.com> (raw)
In-Reply-To: <YivMBj7+j/EZcMVV@bombadil.infradead.org>

On 3/12/22 07:24, Luis Chamberlain wrote:
> On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote:
>> On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote:
>>> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
>>>
>>>> I'm starting to like the previous idea of creating an unholey
>>>> device-mapper for such users...
>>>
>>> Won't that restrict nvme with chunk size crap. For instance later if we
>>> want much larger block sizes.
>>
>> I'm not sure I understand. The chunk_size has nothing to do with the
>> block size. And while nvme is a user of this in some circumstances, it
>> can't be used concurrently with ZNS because the block layer appropriates
>> the field for the zone size.
> 
> Many device mapper targets split I/O into chunks, see max_io_len(),
> wouldn't this create an overhead?

Apart from the bio clone, the overhead should not be higher than what
the block layer already has. IOs that are too large or that are
straddling zones are split by the block layer, and DM splitting leads
generally to no split in the block layer for the underlying device IO.
DM essentially follows the same pattern: max_io_len() depends on the
target design limits, which in turn depend on the underlying device. For
a dm-unhole target, the IO size limit would typically be the same as
that of the underlying device.

> Using a device mapper target also creates a divergence in strategy
> for ZNS. Some will use the block device, others the dm target. The
> goal should be to create a unified path.

If we allow non power of 2 zone sized devices, the path will *never* be
unified because we will get fragmentation on what can run on these
devices as opposed to power of 2 sized ones. E.g. f2fs will not work for
the former but will for the latter. That is really not an ideal situation.

> 
> And all this, just because SMR. Is that worth it? Are we sure?

No. This is *not* because of SMR. Never has been. The first prototype
SMR drives I received in my lab 10 years ago did not have a power of 2
sized zone size because zones where naturally aligned to tracks, which
like NAND erase blocks, are not necessarily power of 2 sized. And all
zones were not even the same size. That was not usable.

The reason for the power of 2 requirement is 2 fold:
1) At the time we added zone support for SMR, chunk_sectors had to be a
power of 2 number of sectors.
2) SMR users did request power of 2 zone sizes and that all zones have
the same size as that simplified software design. There was even a
de-facto agreement that 256MB zone size is a good compromise between
usability and overhead of zone reclaim/GC. But that particular number is
for HDD due to their performance characteristics.

Hence the current Linux requirements which have been serving us well so
far. DM needed that chunk_sectors be changed to allow non power of 2
values. So the chunk_sectors requirement was lifted recently (can't
remember which version added this). Allowing non power of 2 zone size
would thus be more easily feasible now. Allowing devices with a non
power of 2 zone size is not technically difficult.

But...

The problem being raised is all about the fact that the power of 2 zone
size requirement creates a hole of unusable sectors in every zone when
the device implementation has a zone capacity lower than the zone size.

I have been arguing all along that I think this problem is a
non-problem, simply because a well designed application should *always*
use zones as storage containers without ever hoping that the next zone
in sequence can be used as well. The application should *never* consider
the entire LBA space of the device capacity without this zone split. The
zone based management of capacity is necessary for any good design to
deal correctly with write error recovery and active/open zone resources
management. And as Keith said. there is always a "hole" anyway for any
non-full zone, between the zone write pointer and the last usable sector
in the zone. Reads there are nonsensical and writes can only go to one
place.

Now, in the spirit of trying to facilitate software development for
zoned devices, we can try finding solutions to remove that hole. zonefs
is a obvious solution. But back to the previous point: with one zone ==
one file, there is no continuity in the storage address space that the
application can use. The application has to be designed to use
individual files representing a zone. And with such design, an
equivalent design directly using the block device file would have no
difficulties due to the the sector hole between zone capacity and zone
size. I have a prototype LevelDB implementation that can use both zonefs
and block device file on ZNS with only a few different lines of code to
prove this point.

The other solution would be adding a dm-unhole target to remap sectors
to remove the holes from the device address space. Such target would be
easy to write, but in my opinion, this would still not change the fact
that applications still have to deal with error recovery and active/open
zone resources. So they still have to be zone aware and operate per zone.

Furthermore, adding such DM target would create a non power of 2 zone
size zoned device which will need support from the block layer. So some
block layer functions will need to change. In the end, this may not be
different than enabling non power of 2 zone sized devices for ZNS.

And for this decision, I maintain some of my requirements:
1) The added overhead from multiplication & divisions should be
acceptable and not degrade performance. Otherwise, this would be a
disservice to the zone ecosystem.
2) Nothing that works today on available devices should break
3) Zone size requirements will still exist. E.g. btrfs 64K alignment
requirement

But even with all these properly addressed, f2fs will not work anymore,
some in-kernel users will still need some zone size requirements (btrfs)
and *all* applications using a zoned block device file will now have to
be designed based on non power of 2 zone size so that they can work on
all devices. Meaning that this is also potentially forcing changes on
existing applications to use newer zoned devices that may not have a
power of 2 zone size.

This entire discussion is about the problem that power of 2 zone size
creates (which again I think is a non-problem). However, based on the
arguments above, allowing non power of 2 zone sized devices is not
exactly problem free either.

My answer to your last question ("Are we sure?") is thus: No. I am not
sure this is a good idea. But as always, I would be happy to be proven
wrong. So far, I have not seen any argument doing that.

-- 
Damien Le Moal
Western Digital Research

  reply	other threads:[~2022-03-12  7:58 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20220308165414eucas1p106df0bd6a901931215cfab81660a4564@eucas1p1.samsung.com>
2022-03-08 16:53 ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Pankaj Raghav
     [not found]   ` <CGME20220308165421eucas1p20575444f59702cd5478cb35fce8b72cd@eucas1p2.samsung.com>
2022-03-08 16:53     ` [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size Pankaj Raghav
2022-03-08 17:14       ` Keith Busch
2022-03-08 17:43         ` Pankaj Raghav
2022-03-09  3:40       ` Damien Le Moal
2022-03-09 13:19         ` Pankaj Raghav
2022-03-09  3:44       ` Damien Le Moal
2022-03-09 13:35         ` Pankaj Raghav
     [not found]   ` <CGME20220308165428eucas1p14ea0a38eef47055c4fa41d695c5a249d@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 2/6] block: Add npo2_zone_setup callback to block device fops Pankaj Raghav
2022-03-09  3:46       ` Damien Le Moal
2022-03-09 14:02         ` Pankaj Raghav
     [not found]   ` <CGME20220308165432eucas1p18b36a238ef3f5a812ee7f9b0e52599a5@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 3/6] block: add a bool member to request_queue for power_of_2 emulation Pankaj Raghav
     [not found]   ` <CGME20220308165436eucas1p1b76f3cb5b4fa1f7d78b51a3b1b44d160@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices Pankaj Raghav
2022-03-09  4:04       ` Damien Le Moal
2022-03-09 14:33         ` Pankaj Raghav
2022-03-09 21:43           ` Damien Le Moal
2022-03-10 20:35             ` Luis Chamberlain
2022-03-10 23:50               ` Damien Le Moal
2022-03-11  0:56                 ` Luis Chamberlain
     [not found]   ` <CGME20220308165443eucas1p17e61670a5057f21a6c073711b284bfeb@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 5/6] null_blk: forward the sector value from null_handle_memory_backend Pankaj Raghav
     [not found]   ` <CGME20220308165448eucas1p12c7c302a4b239db64b49d54cc3c1f0ac@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 6/6] null_blk: Add support for power_of_2 emulation to the null blk device Pankaj Raghav
2022-03-09  4:09       ` Damien Le Moal
2022-03-09 14:42         ` Pankaj Raghav
2022-03-10  9:47   ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Christoph Hellwig
2022-03-10 12:57     ` Pankaj Raghav
2022-03-10 13:07       ` Matias Bjørling
2022-03-10 13:14         ` Javier González
2022-03-10 14:58           ` Matias Bjørling
2022-03-10 15:07             ` Keith Busch
2022-03-10 15:16               ` Javier González
2022-03-10 23:44                 ` Damien Le Moal
2022-03-10 15:13             ` Javier González
2022-03-10 14:44       ` Christoph Hellwig
2022-03-11 20:19         ` Luis Chamberlain
2022-03-11 20:51           ` Keith Busch
2022-03-11 21:04             ` Luis Chamberlain
2022-03-11 21:31               ` Keith Busch
2022-03-11 22:24                 ` Luis Chamberlain
2022-03-12  7:58                   ` Damien Le Moal [this message]
2022-03-14  7:35                     ` Christoph Hellwig
2022-03-14  7:45                       ` Damien Le Moal
2022-03-14  7:58                         ` Christoph Hellwig
2022-03-14 10:49                         ` Javier González
2022-03-14 14:16                           ` Matias Bjørling
2022-03-14 16:23                             ` Luis Chamberlain
2022-03-14 19:30                               ` Matias Bjørling
2022-03-14 19:51                                 ` Luis Chamberlain
2022-03-15 10:45                                   ` Matias Bjørling
2022-03-14 19:55                             ` Javier González
2022-03-15 12:32                               ` Matias Bjørling
2022-03-15 13:05                                 ` Javier González
2022-03-15 13:14                                   ` Matias Bjørling
2022-03-15 13:26                                     ` Javier González
2022-03-15 13:30                                       ` Christoph Hellwig
2022-03-15 13:52                                         ` Javier González
2022-03-15 14:03                                           ` Matias Bjørling
2022-03-15 14:14                                           ` Johannes Thumshirn
2022-03-15 14:27                                             ` David Sterba
2022-03-15 19:56                                               ` Pankaj Raghav
2022-03-15 15:11                                             ` Javier González
2022-03-15 18:51                                             ` Pankaj Raghav
2022-03-16  8:37                                               ` Johannes Thumshirn
2022-03-15 17:00                                         ` Luis Chamberlain
2022-03-16  0:07                                           ` Damien Le Moal
2022-03-16  0:23                                             ` Luis Chamberlain
2022-03-16  0:46                                               ` Damien Le Moal
2022-03-16  1:24                                                 ` Luis Chamberlain
2022-03-16  1:44                                                   ` Damien Le Moal
2022-03-16  2:13                                                     ` Luis Chamberlain
2022-03-16  2:27                                               ` Martin K. Petersen
2022-03-16  2:41                                                 ` Luis Chamberlain
2022-03-16  8:44                                                 ` Javier González
2022-03-15 13:39                                       ` Matias Bjørling
2022-03-16  0:00                                   ` Damien Le Moal
2022-03-16  8:57                                     ` Javier González
2022-03-16 16:18                                     ` Pankaj Raghav
2022-03-14  8:36                     ` Matias Bjørling
2022-03-11 22:23             ` Adam Manzanares
2022-03-11 22:30               ` Keith Busch
2022-03-21 16:21             ` Jonathan Derrick
2022-03-21 16:44               ` Keith Busch
2022-03-10 17:38     ` Adam Manzanares
2022-03-14  7:36       ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bc0e53a9-f623-c69f-002e-d62e697a43d1@opensource.wdc.com \
    --to=damien.lemoal@opensource.wdc.com \
    --cc=a.manzanares@samsung.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=javier.gonz@samsung.com \
    --cc=jiangbo.365@bytedance.com \
    --cc=joshi.k@samsung.com \
    --cc=joshiiitr@gmail.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=matias.bjorling@wdc.com \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=pankydev8@gmail.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).