public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Harmstone <maharmstone@meta.com>
To: Kanchan Joshi <joshi.k@samsung.com>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"dsterba@suse.com" <dsterba@suse.com>, Chris Mason <clm@meta.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	"hch@lst.de" <hch@lst.de>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"gost.dev@samsung.com" <gost.dev@samsung.com>
Subject: Re: [RFC 0/3] Btrfs checksum offload
Date: Wed, 29 Jan 2025 15:55:33 +0000	[thread overview]
Message-ID: <cdc17997-cb43-4254-a90a-b010fc6c9f5a@meta.com> (raw)
In-Reply-To: <20250129140207.22718-1-joshi.k@samsung.com>

On 29/1/25 14:02, Kanchan Joshi wrote:
> > 
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
> 
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
> 
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
> 
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).
> 
> [*] Here are some perf/write-amplification numbers from randwrite test [1]
> on 3 configs (same device):
> Config 1: No meta format (4K) + Btrfs (base)
> Config 2: Meta format (4K + 8b) + Btrfs (base)
> Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)
> 
> In config 1 and 2, Btrfs will operate with a checksum tree.
> Only in config 2, block-layer will attach integrity buffer with each I/O and
> do checksum/reftag verification.
> Only in config 3, offload will take place and device will generate/verify
> the checksum.
> 
> AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
> FsW: writes issued to device (from iostat)
> ExtraW: extra writes compared to AppW
> 
> Direct I/O
> ---------------------------------------------------------
> Config		IOPS(K)		FsW(G)		ExtraW(G)
> 1		144		186		66
> 2		141		181		61
> 3		172		129		9
> 
> Buffered I/O
> ---------------------------------------------------------
> Config		IOPS(K)		FsW(G)		ExtraW(G)
> 1		82		255		135
> 2		80		181		132
> 3		100		199		79
> 
> Write amplification is generally high (and that's understandable given
> B-trees) but not sure why buffered I/O shows that much.
> 
> [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting
> 
> 
> Kanchan Joshi (3):
>    block: add integrity offload
>    nvme: support integrity offload
>    btrfs: add checksum offload
> 
>   block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
>   block/t10-pi.c            |  7 +++++++
>   drivers/nvme/host/core.c  | 24 ++++++++++++++++++++++
>   drivers/nvme/host/nvme.h  |  1 +
>   fs/btrfs/bio.c            | 12 +++++++++++
>   fs/btrfs/fs.h             |  1 +
>   fs/btrfs/super.c          |  9 +++++++++
>   include/linux/blk_types.h |  3 +++
>   include/linux/blkdev.h    |  7 +++++++
>   9 files changed, 105 insertions(+), 1 deletion(-)
> 

There's also checksumming done on the metadata trees, which could be 
avoided if we're trusting the block device to do it.

Maybe rather than putting this behind a new compat flag, add a new csum 
type of "none"? With the logic being that it also zeroes out the csum 
field in the B-tree headers.

Mark

  parent reply	other threads:[~2025-01-29 15:55 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20250129141039epcas5p11feb1be4124c0db3c5223325924183a3@epcas5p1.samsung.com>
2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
2025-01-29 14:02   ` [RFC 1/3] block: add integrity offload Kanchan Joshi
2025-01-29 14:02   ` [RFC 2/3] nvme: support " Kanchan Joshi
2025-01-29 14:02   ` [RFC 3/3] btrfs: add checksum offload Kanchan Joshi
2025-01-29 21:27     ` Qu Wenruo
2025-01-29 14:55   ` [RFC 0/3] Btrfs " Johannes Thumshirn
2025-01-31 10:19     ` Kanchan Joshi
2025-01-31 10:29       ` Johannes Thumshirn
2025-02-03 13:25         ` Kanchan Joshi
2025-02-03 13:40           ` Johannes Thumshirn
2025-02-03 14:03             ` Kanchan Joshi
2025-02-03 14:41               ` Johannes Thumshirn
2025-01-29 15:28   ` Keith Busch
2025-01-29 15:40     ` Christoph Hellwig
2025-01-29 18:03       ` Keith Busch
2025-01-30 12:54         ` Christoph Hellwig
2025-01-29 15:35   ` Christoph Hellwig
2025-01-30  9:22     ` Kanchan Joshi
2025-01-30 12:53       ` Christoph Hellwig
2025-01-31 10:29         ` Kanchan Joshi
2025-01-31 10:42           ` Christoph Hellwig
2025-01-29 15:55   ` Mark Harmstone [this message]
2025-01-29 19:02   ` Goffredo Baroncelli
2025-01-30  9:33     ` Daniel Vacek
2025-01-30 20:21   ` Martin K. Petersen
2025-01-31  7:44     ` Christoph Hellwig
2025-02-03 19:31       ` Martin K. Petersen
2025-02-04  5:12         ` Christoph Hellwig
2025-02-04 12:52           ` Martin K. Petersen
2025-02-04 13:49             ` Christoph Hellwig
2025-02-05  2:31               ` Martin K. Petersen
2025-02-03 13:24     ` Kanchan Joshi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cdc17997-cb43-4254-a90a-b010fc6c9f5a@meta.com \
    --to=maharmstone@meta.com \
    --cc=axboe@kernel.dk \
    --cc=clm@meta.com \
    --cc=dsterba@suse.com \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=josef@toxicpanda.com \
    --cc=joshi.k@samsung.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox