Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Joel Granados <j.granados@samsung.com>
To: Hans Holmberg <Hans.Holmberg@wdc.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"jaegeuk@kernel.org" <jaegeuk@kernel.org>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"Matias Bjørling" <Matias.Bjorling@wdc.com>,
	"Damien Le Moal" <Damien.LeMoal@wdc.com>,
	"Dennis Maisenbacher" <dennis.maisenbacher@wdc.com>,
	"Naohiro Aota" <Naohiro.Aota@wdc.com>,
	"Johannes Thumshirn" <Johannes.Thumshirn@wdc.com>,
	"Aravind Ramesh" <Aravind.Ramesh@wdc.com>,
	"Jørgen Hansen" <Jorgen.Hansen@wdc.com>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"javier@javigon.com" <javier@javigon.com>,
	"hch@lst.de" <hch@lst.de>,
	"a.manzanares@samsung.com" <a.manzanares@samsung.com>,
	"guokuankuan@bytedance.com" <guokuankuan@bytedance.com>,
	"viacheslav.dubeyko@bytedance.com"
	<viacheslav.dubeyko@bytedance.com>
Subject: Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices
Date: Tue, 14 Feb 2023 22:08:00 +0100	[thread overview]
Message-ID: <20230214210800.mfrok5hfb4hdkph2@localhost> (raw)
In-Reply-To: <20230206134148.GD6704@gsv>

[-- Attachment #1: Type: text/plain, Size: 6520 bytes --]

On Mon, Feb 06, 2023 at 01:41:49PM +0000, Hans Holmberg wrote:
> Write amplification induced by garbage collection negatively impacts
> both the performance and the life time for storage devices.
> 
> With zoned storage now standardized for SMR hard drives
> and flash(both NVME and UFS) we have an interface that allows
> us to reduce this overhead by adapting file systems to do
> better data placement.
I'm also very interested in discussions related to data placements. I am
interested in this discussion.

> 
> Background
> ----------
> 
> Zoned block devices enables the host to reduce the cost of
> reclaim/garbage collection/cleaning by exposing the media erase
> units as zones.
> 
> By filling up zones with data from files that will
> have roughly the same life span, garbage collection I/O
> can be minimized, reducing write amplification.
> Less disk I/O per user write.
> 
> Reduced amounts of garbage collection I/O improves
> user max read and write throughput and tail latencies, see [1].
> 
> Migrating out still-valid data to erase and reclaim unused
> capacity in e.g. NAND blocks has a significant performance
> cost. Unnecessarily moving data around also means that there
> will be more erase cycles per user write, reducing the life
> time of the media.
> 
> Current state
> -------------
> 
> To enable the performance benefits of zoned block devices
> a file system needs to:
> 
> 1) Comply with the write restrictions associated to the
> zoned device model. 
> 
> 2) Make active choices when allocating file data into zones
> to minimize GC.
> 
> Out of the upstream file systems, btrfs and f2fs supports
> the zoned block device model. F2fs supports active data placement
> by separating cold from hot data which helps in reducing gc,
> but there is room for improvement.
> 
> 
> There is still work to be done
> ------------------------------
> 
> I've spent a fair amount of time benchmarking btrfs and f2fs
> on top of zoned block devices along with xfs, ext4 and other
> file systems using the conventional block interface
> and at least for modern applicationsm, doing log-structured
> flash-friendly writes, much can be improved. 
> 
> A good example of a flash-friendly workload is RocksDB [6]
> which both does append-only writes and has a good prediction model
> for the life time of its files (due to its lsm-tree based data structures)
> 
> For RocksDB workloads, the cost of garbage collection can be reduced
> by 3x if near-optimal data placement is done (at 80% capacity usage).
> This is based on comparing ZenFS[2], a zoned storage file system plugin
> for RocksDB, with f2fs, xfs, ext4 and btrfs.
> 
> I see no good reason why linux kernel file systems (at least f2fs & btrfs)
> could not play as nice with these workload as ZenFS does, by just allocating
> file data blocks in a better way.
> 
> In addition to ZenFS we also have flex-alloc [5].
> There are probably more data placement schemes for zoned storage out there.
> 
> I think wee need to implement a scheme that is general-purpose enough
> for in-kernel file systems to cover a wide range of use cases and workloads.
> 
> I brought this up at LPC last year[4], but we did not have much time
> for discussions.
> 
> What is missing
> ---------------
> 
> Data needs to be allocated to zones in a way that minimizes the need for
> reclaim. Best-effort placement decision making could be implemented to place
> files of similar life times into the same zones.
> 
> To do this, file systems would have to utilize some sort of hint to
> separate data into different life-time-buckets and map those to
> different zones.
> 
> There is a user ABI for hints available - the write-life-time hint interface
> that was introduced for streams [3]. F2FS is the only user of this currently.
> 
> BTRFS and other file systems with zoned support could make use of it too,
> but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.
> 
> Maybe the life time hints could be combined with process id to separate
> different workloads better, maybe we need something else. F2FS supports
> cold/hot data separation based on file extension, which is another solution.
> 
> This is the first thing I'd like to discuss.
> 
> The second thing I'd like to discuss is testing and benchmarking, which
> is probably even more important and something that should be put into
> place first.
> 
> Testing/benchmarking
> --------------------
> 
> I think any improvements must be measurable, preferably without having to
> run live production application workloads.
> 
> Benchmarking and testing is generally hard to get right, and particularily hard
> when it comes to testing and benchmarking reclaim/garbage collection,
> so it would make sense to share the effort.
> 
> We should be able to use fio to model a bunch of application workloads
> that would benefit from data placement (lsm-tree based key-value database
> stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) .. 
> 
> Once we have a set of benchmarks that we collectively care about, I think we
> can work towards generic data placement methods with some level of
> confidence that it will actually work in practice.
> 
> Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
> would be beneficial not only for kernel file systems but also for user-space
> and distributed file systems such as ceph.
> 
> Thanks,
> Hans
> 
> [1] https://www.usenix.org/system/files/atc21-bjorling.pdf
> [2] https://protect2.fireeye.com/v1/url?k=dce9fbc5-8372c2a0-dce8708a-000babff32e3-302a3cb629dc78ae&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2Fwesterndigitalcorporation%2Fzenfs
> [3] https://lwn.net/Articles/726477/
> [4] https://protect2.fireeye.com/v1/url?k=911c6738-ce875e5d-911dec77-000babff32e3-7bd289693aa18731&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Flpc.events%2Fevent%2F16%2Fcontributions%2F1231%2F
> [5] https://protect2.fireeye.com/v1/url?k=e4102d1c-bb8b1479-e411a653-000babff32e3-d07ddeaede7547d7&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2FOpenMPDK%2FFlexAlloc
> [6] https://protect2.fireeye.com/v1/url?k=1f7befc6-40e0d6a3-1f7a6489-000babff32e3-a7f3b118578d6c39&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

     prev parent reply	other threads:[~2023-02-14 21:08 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230206134200uscas1p20382220d7fc10c899b4c79e01d94cf0b@uscas1p2.samsung.com>
2023-02-06 13:41 ` [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices Hans Holmberg
2023-02-06 14:24   ` Johannes Thumshirn
2023-02-07 12:31     ` Hans Holmberg
2023-02-07 17:46       ` Boris Burkov
2023-02-08  9:16         ` Hans Holmberg
2023-02-07 19:53   ` [External] " Viacheslav A.Dubeyko
2023-02-08 17:13   ` Adam Manzanares
2023-02-09 10:05     ` Hans Holmberg
2023-02-09 10:22       ` Johannes Thumshirn
2023-02-17 14:33         ` Jan Kara
2023-02-14 21:05       ` Joel Granados
2023-02-14 21:08   ` Joel Granados [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230214210800.mfrok5hfb4hdkph2@localhost \
    --to=j.granados@samsung.com \
    --cc=Aravind.Ramesh@wdc.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=Hans.Holmberg@wdc.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=Jorgen.Hansen@wdc.com \
    --cc=Matias.Bjorling@wdc.com \
    --cc=Naohiro.Aota@wdc.com \
    --cc=a.manzanares@samsung.com \
    --cc=dennis.maisenbacher@wdc.com \
    --cc=guokuankuan@bytedance.com \
    --cc=hch@lst.de \
    --cc=jaegeuk@kernel.org \
    --cc=javier@javigon.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=viacheslav.dubeyko@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).