From: Kanchan Joshi <joshi.k@samsung.com>
To: josef@toxicpanda.com, dsterba@suse.com, clm@fb.com,
axboe@kernel.dk, kbusch@kernel.org, hch@lst.de
Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, gost.dev@samsung.com,
Kanchan Joshi <joshi.k@samsung.com>
Subject: [RFC 0/3] Btrfs checksum offload
Date: Wed, 29 Jan 2025 19:32:04 +0530 [thread overview]
Message-ID: <20250129140207.22718-1-joshi.k@samsung.com> (raw)
In-Reply-To: CGME20250129141039epcas5p11feb1be4124c0db3c5223325924183a3@epcas5p1.samsung.com
TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
SSD for data checksumming.
Now, the longer version for why/how.
End-to-end data protection (E2EDP)-capable drives require the transfer
of integrity metadata (PI).
This is currently handled by the block layer, without filesystem
involvement/awareness.
The block layer attaches the metadata buffer, generates the checksum
(and reftag) for write I/O, and verifies it during read I/O.
Btrfs has its own data and metadata checksumming, which is currently
disconnected from the above.
It maintains a separate on-device 'checksum tree' for data checksums,
while the block layer will also be checksumming each Btrfs I/O.
There is value in avoiding Copy-on-write (COW) checksum tree on
a device that can anyway store checksums inline (as part of PI).
This would eliminate extra checksum writes/reads, making I/O
more CPU-efficient.
Additionally, usable space would increase, and write
amplification, both in Btrfs and eventually at the device level, would
be reduced [*].
NVMe drives can also automatically insert and strip the PI/checksum
and provide a per-I/O control knob (the PRACT bit) for this.
Block layer currently makes no attempt to know/advertise this offload.
This patch series: (a) adds checksum offload awareness to the
block layer (patch #1),
(b) enables the NVMe driver to register and support the offload
(patch #2), and
(c) introduces an opt-in (datasum_offload mount option) in Btrfs to
apply checksum offload for data (patch #3).
[*] Here are some perf/write-amplification numbers from randwrite test [1]
on 3 configs (same device):
Config 1: No meta format (4K) + Btrfs (base)
Config 2: Meta format (4K + 8b) + Btrfs (base)
Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)
In config 1 and 2, Btrfs will operate with a checksum tree.
Only in config 2, block-layer will attach integrity buffer with each I/O and
do checksum/reftag verification.
Only in config 3, offload will take place and device will generate/verify
the checksum.
AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
FsW: writes issued to device (from iostat)
ExtraW: extra writes compared to AppW
Direct I/O
---------------------------------------------------------
Config IOPS(K) FsW(G) ExtraW(G)
1 144 186 66
2 141 181 61
3 172 129 9
Buffered I/O
---------------------------------------------------------
Config IOPS(K) FsW(G) ExtraW(G)
1 82 255 135
2 80 181 132
3 100 199 79
Write amplification is generally high (and that's understandable given
B-trees) but not sure why buffered I/O shows that much.
[1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting
Kanchan Joshi (3):
block: add integrity offload
nvme: support integrity offload
btrfs: add checksum offload
block/bio-integrity.c | 42 ++++++++++++++++++++++++++++++++++++++-
block/t10-pi.c | 7 +++++++
drivers/nvme/host/core.c | 24 ++++++++++++++++++++++
drivers/nvme/host/nvme.h | 1 +
fs/btrfs/bio.c | 12 +++++++++++
fs/btrfs/fs.h | 1 +
fs/btrfs/super.c | 9 +++++++++
include/linux/blk_types.h | 3 +++
include/linux/blkdev.h | 7 +++++++
9 files changed, 105 insertions(+), 1 deletion(-)
--
2.25.1
next parent reply other threads:[~2025-01-29 14:10 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20250129141039epcas5p11feb1be4124c0db3c5223325924183a3@epcas5p1.samsung.com>
2025-01-29 14:02 ` Kanchan Joshi [this message]
2025-01-29 14:02 ` [RFC 1/3] block: add integrity offload Kanchan Joshi
2025-01-29 14:02 ` [RFC 2/3] nvme: support " Kanchan Joshi
2025-01-29 14:02 ` [RFC 3/3] btrfs: add checksum offload Kanchan Joshi
2025-01-29 21:27 ` Qu Wenruo
2025-01-29 14:55 ` [RFC 0/3] Btrfs " Johannes Thumshirn
2025-01-31 10:19 ` Kanchan Joshi
2025-01-31 10:29 ` Johannes Thumshirn
2025-02-03 13:25 ` Kanchan Joshi
2025-02-03 13:40 ` Johannes Thumshirn
2025-02-03 14:03 ` Kanchan Joshi
2025-02-03 14:41 ` Johannes Thumshirn
2025-01-29 15:28 ` Keith Busch
2025-01-29 15:40 ` Christoph Hellwig
2025-01-29 18:03 ` Keith Busch
2025-01-30 12:54 ` Christoph Hellwig
2025-01-29 15:35 ` Christoph Hellwig
2025-01-30 9:22 ` Kanchan Joshi
2025-01-30 12:53 ` Christoph Hellwig
2025-01-31 10:29 ` Kanchan Joshi
2025-01-31 10:42 ` Christoph Hellwig
2025-01-29 15:55 ` Mark Harmstone
2025-01-29 19:02 ` Goffredo Baroncelli
2025-01-30 9:33 ` Daniel Vacek
2025-01-30 20:21 ` Martin K. Petersen
2025-01-31 7:44 ` Christoph Hellwig
2025-02-03 19:31 ` Martin K. Petersen
2025-02-04 5:12 ` Christoph Hellwig
2025-02-04 12:52 ` Martin K. Petersen
2025-02-04 13:49 ` Christoph Hellwig
2025-02-05 2:31 ` Martin K. Petersen
2025-02-03 13:24 ` Kanchan Joshi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250129140207.22718-1-joshi.k@samsung.com \
--to=joshi.k@samsung.com \
--cc=axboe@kernel.dk \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=gost.dev@samsung.com \
--cc=hch@lst.de \
--cc=josef@toxicpanda.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox