From: "Javier González" <javier@javigon.com>
To: "Matias Bjørling" <Matias.Bjorling@wdc.com>
Cc: Dave Chinner <david@fromorbit.com>,
Luis Chamberlain <mcgrof@kernel.org>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"lsf-pc@lists.linux-foundation.org"
<lsf-pc@lists.linux-foundation.org>,
Damien Le Moal <Damien.LeMoal@wdc.com>,
Bart Van Assche <bvanassche@acm.org>,
Hans Holmberg <Hans.Holmberg@wdc.com>,
Adam Manzanares <a.manzanares@samsung.com>,
Keith Busch <Keith.Busch@wdc.com>,
Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
Naohiro Aota <Naohiro.Aota@wdc.com>,
Pankaj Raghav <pankydev8@gmail.com>,
Kanchan Joshi <joshi.k@samsung.com>,
Nitesh Shetty <nj.shetty@samsung.com>
Subject: Re: [LSF/MM/BPF BoF] BoF for Zoned Storage
Date: Mon, 7 Mar 2022 12:29:45 +0100 [thread overview]
Message-ID: <20220307112824.ehnnec5xv6fsvkpa@ArmHalley.local> (raw)
In-Reply-To: <BYAPR04MB496845AB3EEC1EAD8C7CE4D9F1089@BYAPR04MB4968.namprd04.prod.outlook.com>
On 07.03.2022 10:27, Matias Bjørling wrote:
>> > I understand that you point to ZoneFS for this. It is true that it was
>> > presented at the moment as the way to do raw zone access from
>> > user-space.
>> >
>> > However, there is no users of ZoneFS for ZNS devices that I am aware
>> > of (maybe for SMR this is a different story). The main open-source
>> > implementations out there for RocksDB that are being used in
>> > production (ZenFS and xZTL) rely on either raw zone block access or
>> > the generic char device in NVMe (/dev/ngXnY).
>>
>> That's exactly the situation we want to avoid.
>>
>> You're talking about accessing Zoned storage by knowing directly about how
>> the hardware works and interfacing directly with hardware specific device
>> commands.
>>
>> This is exactly what is wrong with this whole conversation - direct access to
>> hardware is fragile and very limiting, and the whole purpose of having an
>> operating system is to abstract the hardware functionality into a generally
>> usable API. That way when something new gets added to the hardware or
>> something gets removed, the applications don't because they weren't written
>> with that sort of hardware functionality extension in mind.
>>
>> I understand that RocksDB probably went direct to the hardware because, at
>> the time, it was the only choice the developers had to make use of ZNS based
>> storage. I understand that.
>>
>> However, I also understand that there are *better options now* that allow
>> applications to target zone storage in a way that doesn't expose them to the
>> foibles of hardware support and storage protocol specifications and
>> characteristics.
>>
>> The generic interface that the kernel provides for zoned storage is called
>> ZoneFS. Forget about the fact it is a filesystem, all it does is provide userspace
>> with a named zone abstraction for a zoned
>> device: every zone is an append-only file.
>>
>> That's what I'm trying to get across here - this whole discussion about zone
>> capacity not matching zone size is a hardware/ specification detail that
>> applications *do not need to know about* to use zone storage. That's
>> something taht Zonefs can/does hide from applications completely - the zone
>> files behave exactly the same from the user perspective regardless of whether
>> the hardware zone capacity is the same or less than the zone size.
>>
>> Expanding access the hardware and/or raw block devices to ensure userspace
>> applications can directly manage zone write pointers, zone capacity/space
>> limits, etc is the wrong architectural direction to be taking. The sort of
>> *hardware quirks* being discussed in this thread need to be managed by the
>> kernel and hidden from userspace; userspace shouldn't need to care about
>> such wierd and esoteric hardware and storage
>> protocol/specification/implementation
>> differences.
>>
>> IMO, while RocksDB is the technology leader for ZNS, it is not the model that
>> new applications should be trying to emulate. They should be designed from
>> the ground up to use ZoneFS instead of directly accessing nvme devices or
>> trying to use the raw block devices for zoned storage. Use the generic kernel
>> abstraction for the hardware like applications do for all other things!
>>
>> > This is because having the capability to do zone management from
>> > applications that already work with objects fits much better.
>>
>> ZoneFS doesn't absolve applications from having to perform zone management
>> to pack it's objects and garbage collect stale storage space. ZoneFS merely
>> provides a generic, file based, hardware independent API for performing these
>> zone management tasks.
>>
>> > My point is that there is space for both ZoneFS and raw zoned block
>> > device. And regarding !PO2 zone sizes, my point is that this can be
>> > leveraged both by btrfs and this raw zone block device.
>>
>> On that I disagree - any argument that starts with "we need raw zoned block
>> device access to ...." is starting from an invalid premise. We should be hiding
>> hardware quirks from userspace, not exposing them further.
>>
>> IMO, we want writing zone storage native applications to be simple and
>> approachable by anyone who knows how to write to append-only files. We do
>> not want such applications to be limited to people who have deep and rare
>> expertise in the dark details of, say, largely undocumented niche NVMe ZNS
>> specification and protocol quirks.
>>
>> ZoneFS provides us with a path to the former, what you are advocating is the
>> latter....
I agree with all you say. I can see ZoneFS becoming a generic zone API,
but we are not there yet. Rather than advocating for using raw devices,
I am describing how zone devices are being consumed today. So to me
there are 2 things we need to consider: Support current customers and
improve the way future customers consume these devices.
Coming back to the original topic of the LSF/MM discussion, what I would
like to propose is that we support existing, deployed devices that are
running in Linux and do not have PO2 zone sizes. These can then be
consumed by btrfs or presented to applications through ZoneFS. And for
existing customers, this will mean less headaches.
Note here that if we use ZoneFS and all we care is zone capacities, then
the whole PO2 argument to make applications more efficient does not
apply anymore, as applications would be using the real capacity of the
zone. I very much like this approach.
>+ Hans (zenfs/rocksdb author)
>
>Dave, thank you for your great insight. It is a great argument for why zonefs makes sense. I must admit that Damien has been telling me this multiple times, but I didn't fully grok the benefits until seeing it in the light of this thread.
>
>Wrt to RocksDB support using ZenFS - while raw block access was the initial approach, it is very easy to change to use the zonefs API. Hans has already whipped up a plan for how to do it.
This is great. We have been thinking for some time about aligning with
ZenFS for the in-kernel path. This might be the right time to take
action on this.
next prev parent reply other threads:[~2022-03-07 11:32 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-03 0:56 [LSF/MM/BPF BoF] BoF for Zoned Storage Luis Chamberlain
2022-03-03 1:03 ` Luis Chamberlain
2022-03-03 1:33 ` Bart Van Assche
2022-03-03 4:31 ` Matias Bjørling
2022-03-03 5:21 ` Adam Manzanares
2022-03-03 5:32 ` Javier González
2022-03-03 6:29 ` Javier González
2022-03-03 7:54 ` Pankaj Raghav
2022-03-03 9:49 ` Damien Le Moal
2022-03-03 14:55 ` Adam Manzanares
2022-03-03 15:22 ` Damien Le Moal
2022-03-03 17:10 ` Adam Manzanares
2022-03-03 19:51 ` Matias Bjørling
2022-03-03 20:18 ` Adam Manzanares
2022-03-03 21:08 ` Javier González
2022-03-03 21:33 ` Matias Bjørling
2022-03-04 20:12 ` Luis Chamberlain
2022-03-06 23:54 ` Damien Le Moal
2022-03-03 16:12 ` Himanshu Madhani
2022-03-03 7:21 ` Hannes Reinecke
2022-03-03 8:55 ` Damien Le Moal
2022-03-03 7:38 ` Kanchan Joshi
2022-03-03 8:43 ` Johannes Thumshirn
2022-03-03 18:20 ` Viacheslav Dubeyko
2022-03-04 0:10 ` Dave Chinner
2022-03-04 22:10 ` Luis Chamberlain
2022-03-04 22:42 ` Dave Chinner
2022-03-04 22:55 ` Luis Chamberlain
2022-03-05 7:33 ` Javier González
2022-03-07 7:12 ` Dave Chinner
2022-03-07 10:27 ` Matias Bjørling
2022-03-07 11:29 ` Javier González [this message]
2022-03-11 0:49 ` Luis Chamberlain
2022-03-11 6:07 ` Christoph Hellwig
2022-03-11 20:31 ` Luis Chamberlain
2022-03-07 13:55 ` James Bottomley
2022-03-07 14:35 ` Javier González
2022-03-07 15:15 ` Keith Busch
2022-03-07 15:28 ` Javier González
2022-03-07 20:42 ` Damien Le Moal
2022-03-11 7:21 ` Javier González
2022-03-11 7:39 ` Damien Le Moal
2022-03-11 7:42 ` Christoph Hellwig
2022-03-11 7:53 ` Javier González
2022-03-11 8:46 ` Christoph Hellwig
2022-03-11 8:59 ` Javier González
2022-03-12 8:03 ` Damien Le Moal
2022-03-07 0:07 ` Damien Le Moal
2022-03-06 23:56 ` Damien Le Moal
2022-03-07 15:44 ` Luis Chamberlain
2022-03-07 16:23 ` Johannes Thumshirn
2022-03-07 16:36 ` Luis Chamberlain
2022-03-15 18:08 ` [EXT] " Luca Porzio (lporzio)
2022-03-15 18:39 ` Bart Van Assche
2022-03-15 18:47 ` Bean Huo (beanhuo)
2022-03-15 18:49 ` Jens Axboe
2022-03-15 19:04 ` Bean Huo (beanhuo)
2022-03-15 19:16 ` Jens Axboe
2022-03-15 19:59 ` Bart Van Assche
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220307112824.ehnnec5xv6fsvkpa@ArmHalley.local \
--to=javier@javigon.com \
--cc=Damien.LeMoal@wdc.com \
--cc=Hans.Holmberg@wdc.com \
--cc=Johannes.Thumshirn@wdc.com \
--cc=Keith.Busch@wdc.com \
--cc=Matias.Bjorling@wdc.com \
--cc=Naohiro.Aota@wdc.com \
--cc=a.manzanares@samsung.com \
--cc=bvanassche@acm.org \
--cc=david@fromorbit.com \
--cc=joshi.k@samsung.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=nj.shetty@samsung.com \
--cc=pankydev8@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox