linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] Btrfs checksum offload
       [not found] <CGME20250129141039epcas5p11feb1be4124c0db3c5223325924183a3@epcas5p1.samsung.com>
@ 2025-01-29 14:02 ` Kanchan Joshi
  2025-01-29 14:02   ` [RFC 1/3] block: add integrity offload Kanchan Joshi
                     ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Kanchan Joshi @ 2025-01-29 14:02 UTC (permalink / raw)
  To: josef, dsterba, clm, axboe, kbusch, hch
  Cc: linux-btrfs, linux-nvme, linux-block, gost.dev, Kanchan Joshi

TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
SSD for data checksumming.

Now, the longer version for why/how.

End-to-end data protection (E2EDP)-capable drives require the transfer
of integrity metadata (PI).
This is currently handled by the block layer, without filesystem
involvement/awareness.
The block layer attaches the metadata buffer, generates the checksum
(and reftag) for write I/O, and verifies it during read I/O.

Btrfs has its own data and metadata checksumming, which is currently
disconnected from the above.
It maintains a separate on-device 'checksum tree' for data checksums,
while the block layer will also be checksumming each Btrfs I/O.

There is value in avoiding Copy-on-write (COW) checksum tree on
a device that can anyway store checksums inline (as part of PI).
This would eliminate extra checksum writes/reads, making I/O
more CPU-efficient.
Additionally, usable space would increase, and write
amplification, both in Btrfs and eventually at the device level, would
be reduced [*].

NVMe drives can also automatically insert and strip the PI/checksum
and provide a per-I/O control knob (the PRACT bit) for this.
Block layer currently makes no attempt to know/advertise this offload.

This patch series: (a) adds checksum offload awareness to the
block layer (patch #1),
(b) enables the NVMe driver to register and support the offload
(patch #2), and
(c) introduces an opt-in (datasum_offload mount option) in Btrfs to
apply checksum offload for data (patch #3).

[*] Here are some perf/write-amplification numbers from randwrite test [1]
on 3 configs (same device):
Config 1: No meta format (4K) + Btrfs (base)
Config 2: Meta format (4K + 8b) + Btrfs (base)
Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)

In config 1 and 2, Btrfs will operate with a checksum tree.
Only in config 2, block-layer will attach integrity buffer with each I/O and
do checksum/reftag verification.
Only in config 3, offload will take place and device will generate/verify
the checksum.

AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
FsW: writes issued to device (from iostat)
ExtraW: extra writes compared to AppW

Direct I/O
---------------------------------------------------------
Config		IOPS(K)		FsW(G)		ExtraW(G)
1		144		186		66
2		141		181		61
3		172		129		9

Buffered I/O
---------------------------------------------------------
Config		IOPS(K)		FsW(G)		ExtraW(G)
1		82		255		135
2		80		181		132
3		100		199		79

Write amplification is generally high (and that's understandable given
B-trees) but not sure why buffered I/O shows that much.

[1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting


Kanchan Joshi (3):
  block: add integrity offload
  nvme: support integrity offload
  btrfs: add checksum offload

 block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
 block/t10-pi.c            |  7 +++++++
 drivers/nvme/host/core.c  | 24 ++++++++++++++++++++++
 drivers/nvme/host/nvme.h  |  1 +
 fs/btrfs/bio.c            | 12 +++++++++++
 fs/btrfs/fs.h             |  1 +
 fs/btrfs/super.c          |  9 +++++++++
 include/linux/blk_types.h |  3 +++
 include/linux/blkdev.h    |  7 +++++++
 9 files changed, 105 insertions(+), 1 deletion(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 1/3] block: add integrity offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
@ 2025-01-29 14:02   ` Kanchan Joshi
  2025-01-29 14:02   ` [RFC 2/3] nvme: support " Kanchan Joshi
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Kanchan Joshi @ 2025-01-29 14:02 UTC (permalink / raw)
  To: josef, dsterba, clm, axboe, kbusch, hch
  Cc: linux-btrfs, linux-nvme, linux-block, gost.dev, Kanchan Joshi,
	Anuj Gupta

Add a 'offload_type' that is populated by the integrity providers
(e.g., drivers) based on the underlying capability/configuration.

- BLK_INTEGRITY_OFFLOAD_NONE: indicates offload capability is absent.
- BLK_INTEGRITY_OFFLOAD_NO_BUF: offload for which integrity buffer is
not needed.
- BLK_INTEGRITY_OFFLOAD_BUF: offload for which integrity buffer is
needed.

Make block layer skip certain processing (checksum generate/verify,
reftag remapping) that is not compatible with the offload.

Users (e.g., filesystems) can send the flag REQ_INTEGRITY_OFFLOAD to
ask for the offload.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Co-developed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
 block/t10-pi.c            |  7 +++++++
 include/linux/blk_types.h |  3 +++
 include/linux/blkdev.h    |  7 +++++++
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5d81ad9a3d20..05872b5ad9aa 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -413,6 +413,41 @@ int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
 	return ret;
 }
 
+static bool bio_integrity_offload(struct bio *bio)
+{
+	struct bio_integrity_payload *bip;
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	unsigned int len;
+	void *buf;
+	gfp_t gfp = GFP_NOIO;
+
+	if (bi->offload_type == BLK_INTEGRITY_OFFLOAD_NO_BUF)
+		return true;
+
+	/* Allocate kernel buffer for protection data */
+	len = bio_integrity_bytes(bi, bio_sectors(bio));
+	buf = kmalloc(len, gfp | __GFP_ZERO);
+	if (unlikely(buf == NULL))
+		goto err_end_io;
+
+	bip = bio_integrity_alloc(bio, gfp, 1);
+	if (IS_ERR(bip)) {
+		kfree(buf);
+		goto err_end_io;
+	}
+
+	bip->bip_flags |= BIP_BLOCK_INTEGRITY;
+	if (bio_integrity_add_page(bio, virt_to_page(buf), len,
+			offset_in_page(buf)) < len)
+		goto err_end_io;
+
+	return true;
+
+err_end_io:
+	bio->bi_status = BLK_STS_RESOURCE;
+	bio_endio(bio);
+	return false;
+}
 /**
  * bio_integrity_prep - Prepare bio for integrity I/O
  * @bio:	bio to prepare
@@ -443,6 +478,10 @@ bool bio_integrity_prep(struct bio *bio)
 	if (bio_integrity(bio))
 		return true;
 
+	if (bio->bi_opf & REQ_INTEGRITY_OFFLOAD &&
+	    bi->offload_type != BLK_INTEGRITY_OFFLOAD_NONE)
+		return  bio_integrity_offload(bio);
+
 	switch (bio_op(bio)) {
 	case REQ_OP_READ:
 		if (bi->flags & BLK_INTEGRITY_NOVERIFY)
@@ -522,7 +561,8 @@ static void bio_integrity_verify_fn(struct work_struct *work)
 		container_of(work, struct bio_integrity_payload, bip_work);
 	struct bio *bio = bip->bip_bio;
 
-	blk_integrity_verify(bio);
+	if (!(bio->bi_opf & REQ_INTEGRITY_OFFLOAD))
+		blk_integrity_verify(bio);
 
 	kfree(bvec_virt(bip->bip_vec));
 	bio_integrity_free(bio);
diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2d05421f0fa5..9eca1ad5d5e6 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -452,6 +452,9 @@ void blk_integrity_prepare(struct request *rq)
 
 	if (!(bi->flags & BLK_INTEGRITY_REF_TAG))
 		return;
+	if ((rq->cmd_flags & REQ_INTEGRITY_OFFLOAD) &&
+	    (bi->offload_type != BLK_INTEGRITY_OFFLOAD_NONE))
+		return;
 
 	if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC64)
 		ext_pi_type1_prepare(rq);
@@ -466,6 +469,10 @@ void blk_integrity_complete(struct request *rq, unsigned int nr_bytes)
 	if (!(bi->flags & BLK_INTEGRITY_REF_TAG))
 		return;
 
+	if ((rq->cmd_flags & REQ_INTEGRITY_OFFLOAD) &&
+	    (bi->offload_type != BLK_INTEGRITY_OFFLOAD_NONE))
+		return;
+
 	if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC64)
 		ext_pi_type1_complete(rq, nr_bytes);
 	else
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7..65615dbc3e2d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -378,6 +378,7 @@ enum req_flag_bits {
 	__REQ_DRV,		/* for driver use */
 	__REQ_FS_PRIVATE,	/* for file system (submitter) use */
 	__REQ_ATOMIC,		/* for atomic write operations */
+	__REQ_INTEGRITY_OFFLOAD,/* I/O that wants HW integrity offload */
 	/*
 	 * Command specific flags, keep last:
 	 */
@@ -399,6 +400,8 @@ enum req_flag_bits {
 #define REQ_NOMERGE	(__force blk_opf_t)(1ULL << __REQ_NOMERGE)
 #define REQ_IDLE	(__force blk_opf_t)(1ULL << __REQ_IDLE)
 #define REQ_INTEGRITY	(__force blk_opf_t)(1ULL << __REQ_INTEGRITY)
+#define REQ_INTEGRITY_OFFLOAD \
+			(__force blk_opf_t)(1ULL << __REQ_INTEGRITY_OFFLOAD)
 #define REQ_FUA		(__force blk_opf_t)(1ULL << __REQ_FUA)
 #define REQ_PREFLUSH	(__force blk_opf_t)(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD	(__force blk_opf_t)(1ULL << __REQ_RAHEAD)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7ac153e4423a..ef061eb4cb73 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -113,6 +113,12 @@ enum blk_integrity_checksum {
 	BLK_INTEGRITY_CSUM_CRC64	= 3,
 } __packed ;
 
+enum blk_integrity_offload {
+	BLK_INTEGRITY_OFFLOAD_NONE	= 0,
+	BLK_INTEGRITY_OFFLOAD_NO_BUF	= 1,
+	BLK_INTEGRITY_OFFLOAD_BUF	= 2,
+} __packed;
+
 struct blk_integrity {
 	unsigned char				flags;
 	enum blk_integrity_checksum		csum_type;
@@ -120,6 +126,7 @@ struct blk_integrity {
 	unsigned char				pi_offset;
 	unsigned char				interval_exp;
 	unsigned char				tag_size;
+	unsigned char				offload_type;
 };
 
 typedef unsigned int __bitwise blk_mode_t;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 2/3] nvme: support integrity offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
  2025-01-29 14:02   ` [RFC 1/3] block: add integrity offload Kanchan Joshi
@ 2025-01-29 14:02   ` Kanchan Joshi
  2025-01-29 14:02   ` [RFC 3/3] btrfs: add checksum offload Kanchan Joshi
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Kanchan Joshi @ 2025-01-29 14:02 UTC (permalink / raw)
  To: josef, dsterba, clm, axboe, kbusch, hch
  Cc: linux-btrfs, linux-nvme, linux-block, gost.dev, Kanchan Joshi

Register the integrity offload with the block layer if it is supported
by the device.

Serve incoming offload requests by setting PRACT and GUARD check.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 drivers/nvme/host/core.c | 24 ++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 25 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f4b8d6a0984a..1fae0a6a932e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -984,6 +984,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 {
 	u16 control = 0;
 	u32 dsmgmt = 0;
+	u8  type;
 
 	if (req->cmd_flags & REQ_FUA)
 		control |= NVME_RW_FUA;
@@ -1022,7 +1023,21 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 				return BLK_STS_NOTSUPP;
 			control |= NVME_RW_PRINFO_PRACT;
 		}
+		if (req->cmd_flags & REQ_INTEGRITY_OFFLOAD) {
+			type = ns->head->pi_offload_type;
 
+			if (type == BLK_INTEGRITY_OFFLOAD_NONE ||
+			    (type == BLK_INTEGRITY_OFFLOAD_BUF &&
+			     !blk_integrity_rq(req))) {
+				WARN_ON_ONCE(1);
+				return BLK_STS_NOTSUPP;
+			}
+
+			control |= NVME_RW_PRINFO_PRACT |
+				   NVME_RW_PRINFO_PRCHK_GUARD;
+			/* skip redundant processing for offload */
+			goto out;
+		}
 		if (bio_integrity_flagged(req->bio, BIP_CHECK_GUARD))
 			control |= NVME_RW_PRINFO_PRCHK_GUARD;
 		if (bio_integrity_flagged(req->bio, BIP_CHECK_REFTAG)) {
@@ -1037,6 +1052,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		}
 	}
 
+out:
 	cmnd->rw.control = cpu_to_le16(control);
 	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
 	return 0;
@@ -1846,6 +1862,14 @@ static bool nvme_init_integrity(struct nvme_ns_head *head,
 
 	bi->tuple_size = head->ms;
 	bi->pi_offset = info->pi_offset;
+	if (bi->csum_type != BLK_INTEGRITY_CSUM_NONE) {
+		if (head->ms == head->pi_size)
+			bi->offload_type = BLK_INTEGRITY_OFFLOAD_NO_BUF;
+		else
+			bi->offload_type = BLK_INTEGRITY_OFFLOAD_BUF;
+		head->pi_offload_type = bi->offload_type;
+	}
+
 	return true;
 }
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7be92d07430e..add04583b040 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -476,6 +476,7 @@ struct nvme_ns_head {
 	u16			pi_size;
 	u8			pi_type;
 	u8			guard_type;
+	u8			pi_offload_type;
 	struct list_head	entry;
 	struct kref		ref;
 	bool			shared;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 3/3] btrfs: add checksum offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
  2025-01-29 14:02   ` [RFC 1/3] block: add integrity offload Kanchan Joshi
  2025-01-29 14:02   ` [RFC 2/3] nvme: support " Kanchan Joshi
@ 2025-01-29 14:02   ` Kanchan Joshi
  2025-01-29 21:27     ` Qu Wenruo
  2025-01-29 14:55   ` [RFC 0/3] Btrfs " Johannes Thumshirn
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Kanchan Joshi @ 2025-01-29 14:02 UTC (permalink / raw)
  To: josef, dsterba, clm, axboe, kbusch, hch
  Cc: linux-btrfs, linux-nvme, linux-block, gost.dev, Kanchan Joshi

Add new mount option 'datsum_offload'.

When passed
- Data checksumming at the FS level is disabled.
- Data checksumming at the device level is enabled. This is done by
setting REQ_INTEGRITY_OFFLOAD flag for data I/O if the underlying device is
capable.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 fs/btrfs/bio.c   | 12 ++++++++++++
 fs/btrfs/fs.h    |  1 +
 fs/btrfs/super.c |  9 +++++++++
 3 files changed, 22 insertions(+)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 7ea6f0b43b95..811d89c64991 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -5,6 +5,7 @@
  */
 
 #include <linux/bio.h>
+#include <linux/blk-integrity.h>
 #include "bio.h"
 #include "ctree.h"
 #include "volumes.h"
@@ -424,6 +425,15 @@ static void btrfs_clone_write_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
+static void btrfs_prep_csum_offload_hw(struct btrfs_device *dev, struct bio *bio)
+{
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+
+	if (btrfs_test_opt(dev->fs_info, DATASUM_OFFLOAD) &&
+	    bi && bi->offload_type != BLK_INTEGRITY_OFFLOAD_NONE)
+		bio->bi_opf |= REQ_INTEGRITY_OFFLOAD;
+}
+
 static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
 {
 	if (!dev || !dev->bdev ||
@@ -435,6 +445,8 @@ static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
 	}
 
 	bio_set_dev(bio, dev->bdev);
+	if (!(bio->bi_opf & REQ_META))
+		btrfs_prep_csum_offload_hw(dev, bio);
 
 	/*
 	 * For zone append writing, bi_sector must point the beginning of the
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 79a1a3d6f04d..88e493967100 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -228,6 +228,7 @@ enum {
 	BTRFS_MOUNT_NOSPACECACHE		= (1ULL << 30),
 	BTRFS_MOUNT_IGNOREMETACSUMS		= (1ULL << 31),
 	BTRFS_MOUNT_IGNORESUPERFLAGS		= (1ULL << 32),
+	BTRFS_MOUNT_DATASUM_OFFLOAD		= (1ULL << 33),
 };
 
 /*
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 7dfe5005129a..d0d5b35c2df9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -121,6 +121,7 @@ enum {
 	Opt_treelog,
 	Opt_user_subvol_rm_allowed,
 	Opt_norecovery,
+	Opt_datasum_offload,
 
 	/* Rescue options */
 	Opt_rescue,
@@ -223,6 +224,7 @@ static const struct fs_parameter_spec btrfs_fs_parameters[] = {
 	fsparam_string("compress-force", Opt_compress_force_type),
 	fsparam_flag_no("datacow", Opt_datacow),
 	fsparam_flag_no("datasum", Opt_datasum),
+	fsparam_flag_no("datasum_offload", Opt_datasum_offload),
 	fsparam_flag("degraded", Opt_degraded),
 	fsparam_string("device", Opt_device),
 	fsparam_flag_no("discard", Opt_discard),
@@ -323,6 +325,10 @@ static int btrfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
 			btrfs_clear_opt(ctx->mount_opt, NODATASUM);
 		}
 		break;
+	case Opt_datasum_offload:
+		btrfs_set_opt(ctx->mount_opt, NODATASUM);
+		btrfs_set_opt(ctx->mount_opt, DATASUM_OFFLOAD);
+		break;
 	case Opt_datacow:
 		if (result.negated) {
 			btrfs_clear_opt(ctx->mount_opt, COMPRESS);
@@ -1057,6 +1063,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",degraded");
 	if (btrfs_test_opt(info, NODATASUM))
 		seq_puts(seq, ",nodatasum");
+	if (btrfs_test_opt(info, DATASUM_OFFLOAD))
+		seq_puts(seq, ",datasum_offload");
 	if (btrfs_test_opt(info, NODATACOW))
 		seq_puts(seq, ",nodatacow");
 	if (btrfs_test_opt(info, NOBARRIER))
@@ -1434,6 +1442,7 @@ static void btrfs_emit_options(struct btrfs_fs_info *info,
 	btrfs_info_if_set(info, old, NODATASUM, "setting nodatasum");
 	btrfs_info_if_set(info, old, DEGRADED, "allowing degraded mounts");
 	btrfs_info_if_set(info, old, NODATASUM, "setting nodatasum");
+	btrfs_info_if_set(info, old, DATASUM_OFFLOAD, "setting datasum offload to the device");
 	btrfs_info_if_set(info, old, SSD, "enabling ssd optimizations");
 	btrfs_info_if_set(info, old, SSD_SPREAD, "using spread ssd allocation scheme");
 	btrfs_info_if_set(info, old, NOBARRIER, "turning off barriers");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
                     ` (2 preceding siblings ...)
  2025-01-29 14:02   ` [RFC 3/3] btrfs: add checksum offload Kanchan Joshi
@ 2025-01-29 14:55   ` Johannes Thumshirn
  2025-01-31 10:19     ` Kanchan Joshi
  2025-01-29 15:28   ` Keith Busch
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Johannes Thumshirn @ 2025-01-29 14:55 UTC (permalink / raw)
  To: Kanchan Joshi, josef@toxicpanda.com, dsterba@suse.com, clm@fb.com,
	axboe@kernel.dk, kbusch@kernel.org, hch
  Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, gost.dev@samsung.com

On 29.01.25 15:13, Kanchan Joshi wrote:
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
> 
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
> 
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
> 
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).

Hi Kanchan,

This is an interesting approach on offloading the checksum work. I've 
only had a quick glance over it with a birds eye view and one thing that 
I noticed is, the missing connection of error reporting between the layers.

For instance if we get a checksum error on btrfs we not only report in 
in dmesg, but also try to repair the affected sector if we do have a 
data profile with redundancy.

So while this patchset offloads the submission side work of the checksum 
tree to the PI code, I don't see the back-propagation of the errors into 
btrfs and the triggering of the repair code.

I get it's a RFC, but as it is now it essentially breaks functionality 
we rely on. Can you add this part as well so we can evaluate the 
patchset not only from the write but also from the read side.

Byte,
	Johannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
                     ` (3 preceding siblings ...)
  2025-01-29 14:55   ` [RFC 0/3] Btrfs " Johannes Thumshirn
@ 2025-01-29 15:28   ` Keith Busch
  2025-01-29 15:40     ` Christoph Hellwig
  2025-01-29 15:35   ` Christoph Hellwig
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Keith Busch @ 2025-01-29 15:28 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: josef, dsterba, clm, axboe, hch, linux-btrfs, linux-nvme,
	linux-block, gost.dev

On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.

Another potential benefit: if the device does the checksum, then I think
btrfs could avoid the stable page writeback overhead and let the
contents be changable all the way until it goes out on the wire.

Though I feel the very specific device format constraints that can
support an offload like this are a unfortunate.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
                     ` (4 preceding siblings ...)
  2025-01-29 15:28   ` Keith Busch
@ 2025-01-29 15:35   ` Christoph Hellwig
  2025-01-30  9:22     ` Kanchan Joshi
  2025-01-29 15:55   ` Mark Harmstone
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-01-29 15:35 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: josef, dsterba, clm, axboe, kbusch, hch, linux-btrfs, linux-nvme,
	linux-block, gost.dev

On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.

That's not quite true.  The block layer automatically adds a PI
payload if not is added by the caller.  The caller can add it's own
PI payload, but currently no file system does this - only the block
device fops as of 6.13 and the nvme and scsi targets.  But file systems
can do that, and I have (hacky and outdated patches) wiring this up
in XFS.

Note that the "auto-PI" vs "caller-PI" isn't very cleanly split
currently, which causes some confusion.  I have a series almost
ready to clean that up a bit.

> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).

Yes.

> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),

I've skipped over the patches and don't understand what this offload
awareness concept does compared the file system simply attaching PI
metadata.

> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).

Not really important for an initial prototype, but incompatible on-disk
format changes like this need feature flags and not just a mount
option.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 15:28   ` Keith Busch
@ 2025-01-29 15:40     ` Christoph Hellwig
  2025-01-29 18:03       ` Keith Busch
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-01-29 15:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kanchan Joshi, josef, dsterba, clm, axboe, hch, linux-btrfs,
	linux-nvme, linux-block, gost.dev

On Wed, Jan 29, 2025 at 08:28:24AM -0700, Keith Busch wrote:
> On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> > There is value in avoiding Copy-on-write (COW) checksum tree on
> > a device that can anyway store checksums inline (as part of PI).
> > This would eliminate extra checksum writes/reads, making I/O
> > more CPU-efficient.
> 
> Another potential benefit: if the device does the checksum, then I think
> btrfs could avoid the stable page writeback overhead and let the
> contents be changable all the way until it goes out on the wire.

If the device generates the checksum (aka DIF insert) that problem goes
away.  But we also lose integrity protection over the wire, which would
be unfortunate.

If you feed the checksum / guard tag from the kernel we still have the
same problem.  A while ago I did a prototype where we'd bubble up to the
fs that we had guard tag error vs just the non-specific "protection
error" and the file system would then retry after copying.  This was
pretty sketchy as the error handling blew up frequently and at least my
version would only work for synchronous I/O and not with aio / io_uring
due to the missing MM context.  But if someone has enough spare cycles
that could be something interesting to look into again.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
                     ` (5 preceding siblings ...)
  2025-01-29 15:35   ` Christoph Hellwig
@ 2025-01-29 15:55   ` Mark Harmstone
  2025-01-29 19:02   ` Goffredo Baroncelli
  2025-01-30 20:21   ` Martin K. Petersen
  8 siblings, 0 replies; 32+ messages in thread
From: Mark Harmstone @ 2025-01-29 15:55 UTC (permalink / raw)
  To: Kanchan Joshi, josef@toxicpanda.com, dsterba@suse.com,
	Chris Mason, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de
  Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, gost.dev@samsung.com

On 29/1/25 14:02, Kanchan Joshi wrote:
> > 
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
> 
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
> 
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
> 
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).
> 
> [*] Here are some perf/write-amplification numbers from randwrite test [1]
> on 3 configs (same device):
> Config 1: No meta format (4K) + Btrfs (base)
> Config 2: Meta format (4K + 8b) + Btrfs (base)
> Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)
> 
> In config 1 and 2, Btrfs will operate with a checksum tree.
> Only in config 2, block-layer will attach integrity buffer with each I/O and
> do checksum/reftag verification.
> Only in config 3, offload will take place and device will generate/verify
> the checksum.
> 
> AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
> FsW: writes issued to device (from iostat)
> ExtraW: extra writes compared to AppW
> 
> Direct I/O
> ---------------------------------------------------------
> Config		IOPS(K)		FsW(G)		ExtraW(G)
> 1		144		186		66
> 2		141		181		61
> 3		172		129		9
> 
> Buffered I/O
> ---------------------------------------------------------
> Config		IOPS(K)		FsW(G)		ExtraW(G)
> 1		82		255		135
> 2		80		181		132
> 3		100		199		79
> 
> Write amplification is generally high (and that's understandable given
> B-trees) but not sure why buffered I/O shows that much.
> 
> [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting
> 
> 
> Kanchan Joshi (3):
>    block: add integrity offload
>    nvme: support integrity offload
>    btrfs: add checksum offload
> 
>   block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
>   block/t10-pi.c            |  7 +++++++
>   drivers/nvme/host/core.c  | 24 ++++++++++++++++++++++
>   drivers/nvme/host/nvme.h  |  1 +
>   fs/btrfs/bio.c            | 12 +++++++++++
>   fs/btrfs/fs.h             |  1 +
>   fs/btrfs/super.c          |  9 +++++++++
>   include/linux/blk_types.h |  3 +++
>   include/linux/blkdev.h    |  7 +++++++
>   9 files changed, 105 insertions(+), 1 deletion(-)
> 

There's also checksumming done on the metadata trees, which could be 
avoided if we're trusting the block device to do it.

Maybe rather than putting this behind a new compat flag, add a new csum 
type of "none"? With the logic being that it also zeroes out the csum 
field in the B-tree headers.

Mark

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 15:40     ` Christoph Hellwig
@ 2025-01-29 18:03       ` Keith Busch
  2025-01-30 12:54         ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Keith Busch @ 2025-01-29 18:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, josef, dsterba, clm, axboe, linux-btrfs,
	linux-nvme, linux-block, gost.dev

On Wed, Jan 29, 2025 at 04:40:25PM +0100, Christoph Hellwig wrote:
> > 
> > Another potential benefit: if the device does the checksum, then I think
> > btrfs could avoid the stable page writeback overhead and let the
> > contents be changable all the way until it goes out on the wire.
> 
> If the device generates the checksum (aka DIF insert) that problem goes
> away.  But we also lose integrity protection over the wire, which would
> be unfortunate.

If the "wire" is only PCIe, I don't see why it matters. What kind of
wire corruption gets undetected by the protocol's encoding and LCRC that
would get caught by the host's CRC payload?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
                     ` (6 preceding siblings ...)
  2025-01-29 15:55   ` Mark Harmstone
@ 2025-01-29 19:02   ` Goffredo Baroncelli
  2025-01-30  9:33     ` Daniel Vacek
  2025-01-30 20:21   ` Martin K. Petersen
  8 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2025-01-29 19:02 UTC (permalink / raw)
  To: Kanchan Joshi, josef, dsterba, clm, axboe, kbusch, hch
  Cc: linux-btrfs, linux-nvme, linux-block, gost.dev

On 29/01/2025 15.02, Kanchan Joshi wrote:
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
May be this is a stupid question, but if we can (will) avoid storing the checksum
in the FS, which is the advantage of having a COW filesystem ?

My understand is that a COW filesystem is needed mainly to synchronize
csum and data. Am I wrong ?

[...]

BR

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 3/3] btrfs: add checksum offload
  2025-01-29 14:02   ` [RFC 3/3] btrfs: add checksum offload Kanchan Joshi
@ 2025-01-29 21:27     ` Qu Wenruo
  0 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2025-01-29 21:27 UTC (permalink / raw)
  To: Kanchan Joshi, josef, dsterba, clm, axboe, kbusch, hch
  Cc: linux-btrfs, linux-nvme, linux-block, gost.dev



在 2025/1/30 00:32, Kanchan Joshi 写道:
> Add new mount option 'datsum_offload'.
> 
> When passed
> - Data checksumming at the FS level is disabled.
> - Data checksumming at the device level is enabled. This is done by
> setting REQ_INTEGRITY_OFFLOAD flag for data I/O if the underlying device is
> capable.
> 
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>


Just as mentioned by Christoph, the change on csum tree is an on-disk 
format change, which requires extra incompat flags.
I believe that's fine because this is only a prototype.

But my concern is, this lack of csum tree has the following problems:

- Require all devices in the btrfs has the same capacity
   E.g. you can no longer add a devices without REQ_INTEGRITY_OFFLOAD
   capability.
   This can be a very big problem, especially if one just wants to
   migrate the fs to another device.

- Less versatile compared to nodatacsum flags/mount option
   For NODATACSUM flag it can be set on a per-inode basis, but the new
   no-datacsum flag is bound to hardware storage.

And finally my question on why to remove btrfs datacsum.

I understand the device's own checksum is super fast and efficient, but 
that doesn't mean different checksum at different layers are exclusive.

Yes, it cause some extra workload, but since it's handled by hardware 
there is no obvious penalty.

On the other hand, it is adding one extra layer of protection, upon the 
existing btrfs' checksum.

Thus if the end user is fully trusting the hardware's protection, then 
they can just use nodatasum mount option and call it a day.
The benefit you shown is really just the benefit from "nodatasum" 
behavior, and that's more or less expected due to the COW overhead.

So I really prefer to let the end user to choose what they want.

If they want to fully rely on the hardware's internal checksum, then 
configure the block device to do it, and create a btrfs with nodatasum.

If they want both btrfs and the hardware checksum, just do the usual way.

Thanks,
Qu

> ---
>   fs/btrfs/bio.c   | 12 ++++++++++++
>   fs/btrfs/fs.h    |  1 +
>   fs/btrfs/super.c |  9 +++++++++
>   3 files changed, 22 insertions(+)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 7ea6f0b43b95..811d89c64991 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -5,6 +5,7 @@
>    */
>   
>   #include <linux/bio.h>
> +#include <linux/blk-integrity.h>
>   #include "bio.h"
>   #include "ctree.h"
>   #include "volumes.h"
> @@ -424,6 +425,15 @@ static void btrfs_clone_write_end_io(struct bio *bio)
>   	bio_put(bio);
>   }
>   
> +static void btrfs_prep_csum_offload_hw(struct btrfs_device *dev, struct bio *bio)
> +{
> +	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
> +
> +	if (btrfs_test_opt(dev->fs_info, DATASUM_OFFLOAD) &&
> +	    bi && bi->offload_type != BLK_INTEGRITY_OFFLOAD_NONE)
> +		bio->bi_opf |= REQ_INTEGRITY_OFFLOAD;
> +}
> +
>   static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
>   {
>   	if (!dev || !dev->bdev ||
> @@ -435,6 +445,8 @@ static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
>   	}
>   
>   	bio_set_dev(bio, dev->bdev);
> +	if (!(bio->bi_opf & REQ_META))
> +		btrfs_prep_csum_offload_hw(dev, bio);
>   
>   	/*
>   	 * For zone append writing, bi_sector must point the beginning of the
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 79a1a3d6f04d..88e493967100 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -228,6 +228,7 @@ enum {
>   	BTRFS_MOUNT_NOSPACECACHE		= (1ULL << 30),
>   	BTRFS_MOUNT_IGNOREMETACSUMS		= (1ULL << 31),
>   	BTRFS_MOUNT_IGNORESUPERFLAGS		= (1ULL << 32),
> +	BTRFS_MOUNT_DATASUM_OFFLOAD		= (1ULL << 33),
>   };
>   
>   /*
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 7dfe5005129a..d0d5b35c2df9 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -121,6 +121,7 @@ enum {
>   	Opt_treelog,
>   	Opt_user_subvol_rm_allowed,
>   	Opt_norecovery,
> +	Opt_datasum_offload,
>   
>   	/* Rescue options */
>   	Opt_rescue,
> @@ -223,6 +224,7 @@ static const struct fs_parameter_spec btrfs_fs_parameters[] = {
>   	fsparam_string("compress-force", Opt_compress_force_type),
>   	fsparam_flag_no("datacow", Opt_datacow),
>   	fsparam_flag_no("datasum", Opt_datasum),
> +	fsparam_flag_no("datasum_offload", Opt_datasum_offload),
>   	fsparam_flag("degraded", Opt_degraded),
>   	fsparam_string("device", Opt_device),
>   	fsparam_flag_no("discard", Opt_discard),
> @@ -323,6 +325,10 @@ static int btrfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
>   			btrfs_clear_opt(ctx->mount_opt, NODATASUM);
>   		}
>   		break;
> +	case Opt_datasum_offload:
> +		btrfs_set_opt(ctx->mount_opt, NODATASUM);
> +		btrfs_set_opt(ctx->mount_opt, DATASUM_OFFLOAD);
> +		break;
>   	case Opt_datacow:
>   		if (result.negated) {
>   			btrfs_clear_opt(ctx->mount_opt, COMPRESS);
> @@ -1057,6 +1063,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>   		seq_puts(seq, ",degraded");
>   	if (btrfs_test_opt(info, NODATASUM))
>   		seq_puts(seq, ",nodatasum");
> +	if (btrfs_test_opt(info, DATASUM_OFFLOAD))
> +		seq_puts(seq, ",datasum_offload");
>   	if (btrfs_test_opt(info, NODATACOW))
>   		seq_puts(seq, ",nodatacow");
>   	if (btrfs_test_opt(info, NOBARRIER))
> @@ -1434,6 +1442,7 @@ static void btrfs_emit_options(struct btrfs_fs_info *info,
>   	btrfs_info_if_set(info, old, NODATASUM, "setting nodatasum");
>   	btrfs_info_if_set(info, old, DEGRADED, "allowing degraded mounts");
>   	btrfs_info_if_set(info, old, NODATASUM, "setting nodatasum");
> +	btrfs_info_if_set(info, old, DATASUM_OFFLOAD, "setting datasum offload to the device");
>   	btrfs_info_if_set(info, old, SSD, "enabling ssd optimizations");
>   	btrfs_info_if_set(info, old, SSD_SPREAD, "using spread ssd allocation scheme");
>   	btrfs_info_if_set(info, old, NOBARRIER, "turning off barriers");


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 15:35   ` Christoph Hellwig
@ 2025-01-30  9:22     ` Kanchan Joshi
  2025-01-30 12:53       ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Kanchan Joshi @ 2025-01-30  9:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: josef, dsterba, clm, axboe, kbusch, linux-btrfs, linux-nvme,
	linux-block, gost.dev

On 1/29/2025 9:05 PM, Christoph Hellwig wrote:
>> This patch series: (a) adds checksum offload awareness to the
>> block layer (patch #1),
> I've skipped over the patches and don't understand what this offload
> awareness concept does compared the file system simply attaching PI
> metadata.

Difference is that FS does not have to attach any PI for offload.

Offload is about the Host doing as little as possible, and the closest 
we get there is by setting PRACT bit.

Attaching PI is not really needed, neither for FS nor for block-layer, 
for pure offload.
When device has "ms == pi_size" format, we only need to send I/O with 
PRACT set and device take care of attaching integrity buffer and 
checksum generation/verification.
This is abstracted as 'offload type 1' in this series.

For other format "ms > pi_size" also we set the PRACT but integrity 
buffer also needs to be passed. This is abstracted as 'offload type 2'.
Still offload as the checksum processing is done only by the device.

Block layer Auto-PI is a good place because all above details are common 
and remain abstracted, while filesystems only need to decide whether 
they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 19:02   ` Goffredo Baroncelli
@ 2025-01-30  9:33     ` Daniel Vacek
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Vacek @ 2025-01-30  9:33 UTC (permalink / raw)
  To: kreijack
  Cc: Kanchan Joshi, josef, dsterba, clm, axboe, kbusch, hch,
	linux-btrfs, linux-nvme, linux-block, gost.dev

On Wed, 29 Jan 2025 at 20:04, Goffredo Baroncelli <kreijack@libero.it> wrote:
>
> On 29/01/2025 15.02, Kanchan Joshi wrote:
> > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> > SSD for data checksumming.
> >
> > Now, the longer version for why/how.
> >
> > End-to-end data protection (E2EDP)-capable drives require the transfer
> > of integrity metadata (PI).
> > This is currently handled by the block layer, without filesystem
> > involvement/awareness.
> > The block layer attaches the metadata buffer, generates the checksum
> > (and reftag) for write I/O, and verifies it during read I/O.
> >
> May be this is a stupid question, but if we can (will) avoid storing the checksum
> in the FS, which is the advantage of having a COW filesystem ?

I was wondering the same. My understanding is the checksums are there
primarily to protect against untrusted devices or data transfers over
the line. And now suddenly we're going to trust them? What's even the
point then?

Is there any other advantage of having these checksums that I may be missing?
Perhaps logic code bugs accidentally corrupting the data? Is the
stored payload even ever touched? That would not be wanted, right?
Or perhaps data mangled on the storage by an attacker?

> My understand is that a COW filesystem is needed mainly to synchronize
> csum and data. Am I wrong ?
>
> [...]
>
> BR
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-30  9:22     ` Kanchan Joshi
@ 2025-01-30 12:53       ` Christoph Hellwig
  2025-01-31 10:29         ` Kanchan Joshi
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-01-30 12:53 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, josef, dsterba, clm, axboe, kbusch,
	linux-btrfs, linux-nvme, linux-block, gost.dev

On Thu, Jan 30, 2025 at 02:52:23PM +0530, Kanchan Joshi wrote:
> On 1/29/2025 9:05 PM, Christoph Hellwig wrote:
> >> This patch series: (a) adds checksum offload awareness to the
> >> block layer (patch #1),
> > I've skipped over the patches and don't understand what this offload
> > awareness concept does compared the file system simply attaching PI
> > metadata.
> 
> Difference is that FS does not have to attach any PI for offload.
> 
> Offload is about the Host doing as little as possible, and the closest 
> we get there is by setting PRACT bit.

But that doesn't actually work.  The file system needs to be able
to verify the checksum for failing over to other mirrors, repair,
etc.  Also if you trust the device to get things right you do not
need to use PI at all - SSDs or hard drives that support PI generally
use PI internally anyway, and PRACT just means you treat a format
with PI like one without.  In other words - no need for an offload
here, you might as well just trust the device if you're not doing
end to end protection.

> 
> Attaching PI is not really needed, neither for FS nor for block-layer, 
> for pure offload.
> When device has "ms == pi_size" format, we only need to send I/O with 
> PRACT set and device take care of attaching integrity buffer and 
> checksum generation/verification.
> This is abstracted as 'offload type 1' in this series.
> 
> For other format "ms > pi_size" also we set the PRACT but integrity 
> buffer also needs to be passed. This is abstracted as 'offload type 2'.
> Still offload as the checksum processing is done only by the device.
> 
> Block layer Auto-PI is a good place because all above details are common 
> and remain abstracted, while filesystems only need to decide whether 
> they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility.
---end quoted text---

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 18:03       ` Keith Busch
@ 2025-01-30 12:54         ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2025-01-30 12:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, josef, dsterba, clm, axboe,
	linux-btrfs, linux-nvme, linux-block, gost.dev

On Wed, Jan 29, 2025 at 11:03:36AM -0700, Keith Busch wrote:
> > away.  But we also lose integrity protection over the wire, which would
> > be unfortunate.
> 
> If the "wire" is only PCIe, I don't see why it matters. What kind of
> wire corruption gets undetected by the protocol's encoding and LCRC that
> would get caught by the host's CRC payload?

The "wire" could be anything.  And includes a little more than than
than the wire, like the entire host side driver stack and the device
data path between the phy and wherever in the stack the PI insert/strip
accelerator sits.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
                     ` (7 preceding siblings ...)
  2025-01-29 19:02   ` Goffredo Baroncelli
@ 2025-01-30 20:21   ` Martin K. Petersen
  2025-01-31  7:44     ` Christoph Hellwig
  2025-02-03 13:24     ` Kanchan Joshi
  8 siblings, 2 replies; 32+ messages in thread
From: Martin K. Petersen @ 2025-01-30 20:21 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: josef, dsterba, clm, axboe, kbusch, hch, linux-btrfs, linux-nvme,
	linux-block, gost.dev


Hi Kanchan!

> There is value in avoiding Copy-on-write (COW) checksum tree on a
> device that can anyway store checksums inline (as part of PI). This
> would eliminate extra checksum writes/reads, making I/O more
> CPU-efficient. Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].

I have a couple of observations.

First of all, there is no inherent benefit to PI if it is generated at
the same time as the ECC. The ECC is usually far superior when it comes
to protecting data at rest. And you'll still get an error if uncorrected
corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer
any benefits in my book.

The motivation for T10 PI is that it is generated in close temporal
proximity to the data. I.e. ideally the PI protecting the data is
calculated as soon as the data has been created in memory. And then the
I/O will eventually be queued, submitted, traverse the kernel, through
the storage fabric, and out to the end device. The PI and data have
traveled along different paths (potentially, more on that later) to get
there. The device will calculate the ECC and then perform a validation
of the PI wrt. to the data buffer. And if those two line up, we know the
ECC is also good. At that point we have confirmed that the data to be
stored matches the data that was used as input when the PI was generated
N seconds ago in host memory. And therefore we can write.

I.e. the goal of PI is protect against problems that happen between data
creation time and the data being persisted to media. Once the ECC has
been calculated, PI essentially stops being interesting.

The second point I would like to make is that the separation between PI
and data that we introduced with DIX, and which NVMe subsequently
adopted, was a feature. It was not just to avoid the inconvenience of
having to deal with buffers that were multiples of 520 bytes in host
memory. The separation between the data and its associated protection
information had proven critical for data protection in many common
corruption scenarios. Inline protection had been tried and had failed to
catch many of the scenarios we had come across in the field.

At the time T10 PI was designed spinning rust was the only game in town.
And nobody was willing to take the performance hit of having to seek
twice per I/O to store PI separately from the data. And while schemes
involving sending all the PI ahead of the data were entertained, they
never came to fruition. Storing 512+8 in the same sector was a necessity
in the context of SCSI drives, not a desired behavior. Addressing that
in DIX was key.

So to me, it's a highly desirable feature that btrfs stores its
checksums elsewhere on media. But that's obviously a trade-off a user
can make. In some cases media WAR may be more important than extending
the protection envelope for the data, that's OK. I would suggest you
look at using CRC32C given the intended 4KB block use case, though,
because the 16-bit CRC isn't fantastic for large blocks.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-30 20:21   ` Martin K. Petersen
@ 2025-01-31  7:44     ` Christoph Hellwig
  2025-02-03 19:31       ` Martin K. Petersen
  2025-02-03 13:24     ` Kanchan Joshi
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-01-31  7:44 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Kanchan Joshi, josef, dsterba, clm, axboe, kbusch, hch,
	linux-btrfs, linux-nvme, linux-block, gost.dev

On Thu, Jan 30, 2025 at 03:21:45PM -0500, Martin K. Petersen wrote:
> So to me, it's a highly desirable feature that btrfs stores its
> checksums elsewhere on media.

Except for SSDs it generally doesn't - the fact that they are written
at the same time means there is a very high chance they will end up
on media together for traditional SSDs designs.  This might be different
when explicitly using some form of data placement scheme, and SSD
vendors might be able to place PI/metadata different under the hood
when using a big enough customer aks for it (they might not be very
happy about the request :)).


> But that's obviously a trade-off a user
> can make. In some cases media WAR may be more important than extending
> the protection envelope for the data, that's OK. I would suggest you
> look at using CRC32C given the intended 4KB block use case, though,
> because the 16-bit CRC isn't fantastic for large blocks.

One thing that I did implement for my XFS hack/prototype is the ability
to store a crc32c in the non-PI metadata support by nvme.  This allows
for low overhead data checksumming as you don't need a separate data
structure to track where the checksums for a data block are located and
doesn't require out of place writes.  It doesn't provide a reg tag
equivalent or device side checking of the guard tag unfortunately.

I never could come up with a good use of the app_tag for file systems,
so not wasting space for that is actually a good thing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-29 14:55   ` [RFC 0/3] Btrfs " Johannes Thumshirn
@ 2025-01-31 10:19     ` Kanchan Joshi
  2025-01-31 10:29       ` Johannes Thumshirn
  0 siblings, 1 reply; 32+ messages in thread
From: Kanchan Joshi @ 2025-01-31 10:19 UTC (permalink / raw)
  To: Johannes Thumshirn, josef@toxicpanda.com, dsterba@suse.com,
	clm@fb.com, axboe@kernel.dk, kbusch@kernel.org, hch
  Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, gost.dev@samsung.com

On 1/29/2025 8:25 PM, Johannes Thumshirn wrote:
> For instance if we get a checksum error on btrfs we not only report in
> in dmesg, but also try to repair the affected sector if we do have a
> data profile with redundancy.
> 
> So while this patchset offloads the submission side work of the checksum
> tree to the PI code, I don't see the back-propagation of the errors into
> btrfs and the triggering of the repair code.
> 
> I get it's a RFC, but as it is now it essentially breaks functionality
> we rely on. Can you add this part as well so we can evaluate the
> patchset not only from the write but also from the read side.

I tested the series for read, but only the success cases. In this case 
checksum generation/verification happens only within the device. It was 
slightly tricky to inject an error and I skipped that.

Since separate checksum I/Os are omitted, this is about handling the 
error condition in data read I/O path itself. I have not yet checked if 
repair code triggers when Btrfs is working with existing 'nodatasum' 
mount option. But I get your point that this needs to be handled.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-31 10:19     ` Kanchan Joshi
@ 2025-01-31 10:29       ` Johannes Thumshirn
  2025-02-03 13:25         ` Kanchan Joshi
  0 siblings, 1 reply; 32+ messages in thread
From: Johannes Thumshirn @ 2025-01-31 10:29 UTC (permalink / raw)
  To: Kanchan Joshi, josef@toxicpanda.com, dsterba@suse.com, clm@fb.com,
	axboe@kernel.dk, kbusch@kernel.org, hch
  Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, gost.dev@samsung.com

On 31.01.25 11:19, Kanchan Joshi wrote:
> On 1/29/2025 8:25 PM, Johannes Thumshirn wrote:
>> For instance if we get a checksum error on btrfs we not only report in
>> in dmesg, but also try to repair the affected sector if we do have a
>> data profile with redundancy.
>>
>> So while this patchset offloads the submission side work of the checksum
>> tree to the PI code, I don't see the back-propagation of the errors into
>> btrfs and the triggering of the repair code.
>>
>> I get it's a RFC, but as it is now it essentially breaks functionality
>> we rely on. Can you add this part as well so we can evaluate the
>> patchset not only from the write but also from the read side.
> 
> I tested the series for read, but only the success cases. In this case
> checksum generation/verification happens only within the device. It was
> slightly tricky to inject an error and I skipped that.
> 
> Since separate checksum I/Os are omitted, this is about handling the
> error condition in data read I/O path itself. I have not yet checked if
> repair code triggers when Btrfs is working with existing 'nodatasum'
> mount option. But I get your point that this needs to be handled.
> 

So this as of now disables one of the most useful features of the FS, 
repairing bad data. The whole "story" for the RAID code in the FS is 
build around this assumption, that we can repair bad data, unlike say MD 
RAID.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-30 12:53       ` Christoph Hellwig
@ 2025-01-31 10:29         ` Kanchan Joshi
  2025-01-31 10:42           ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Kanchan Joshi @ 2025-01-31 10:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: josef, dsterba, clm, axboe, kbusch, linux-btrfs, linux-nvme,
	linux-block, gost.dev

On 1/30/2025 6:23 PM, Christoph Hellwig wrote:
>> Difference is that FS does not have to attach any PI for offload.
>>
>> Offload is about the Host doing as little as possible, and the closest
>> we get there is by setting PRACT bit.
> But that doesn't actually work.  The file system needs to be able
> to verify the checksum for failing over to other mirrors, repair,
> etc.

Right. That sounds like reusing the existing code on detecting 
checksum-specific failure. So maybe that can be handled iff this gets 
any far.

> Also if you trust the device to get things right you do not
> need to use PI at all - SSDs or hard drives that support PI generally
> use PI internally anyway, and PRACT just means you treat a format
> with PI like one without.  In other words - no need for an offload
> here, you might as well just trust the device if you're not doing
> end to end protection.

Agree that device maybe implementing internal E2E, but that's not a 
contract to be honored. Host can't trust until device says it explicitly.

Since Btrfs already has 'nodatasum' mount option, I assume there are 
deployments that prefer to optimize for checksum. I thought Btrfs will 
be more comfortable to give up (its own checksum tree) and exercise this 
more often if it certainly knows that device is also capable.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-31 10:29         ` Kanchan Joshi
@ 2025-01-31 10:42           ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2025-01-31 10:42 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, josef, dsterba, clm, axboe, kbusch,
	linux-btrfs, linux-nvme, linux-block, gost.dev

On Fri, Jan 31, 2025 at 03:59:17PM +0530, Kanchan Joshi wrote:
> > Also if you trust the device to get things right you do not
> > need to use PI at all - SSDs or hard drives that support PI generally
> > use PI internally anyway, and PRACT just means you treat a format
> > with PI like one without.  In other words - no need for an offload
> > here, you might as well just trust the device if you're not doing
> > end to end protection.
> 
> Agree that device maybe implementing internal E2E, but that's not a 
> contract to be honored. Host can't trust until device says it explicitly.

But you're not doing end to end protection.  Once you set PRACT you
basically tell the device to pretend the LU/namespace was formatted
without protection information.  That fact that you even need the
flag has always been very confusing to me - the logical way to expose
PI would have been to make PRACT the default and require a flag to
actually look at the passed information.  I suspect for SCSI this
is a result of shoe-horning DIF and DIX into existing infrastructure,
and NVMe then blindly copied much of that without thinking how it
fits into an architecture without a separate HBA and without all
the legacy concerns.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-30 20:21   ` Martin K. Petersen
  2025-01-31  7:44     ` Christoph Hellwig
@ 2025-02-03 13:24     ` Kanchan Joshi
  1 sibling, 0 replies; 32+ messages in thread
From: Kanchan Joshi @ 2025-02-03 13:24 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: josef, dsterba, clm, axboe, kbusch, hch, linux-btrfs, linux-nvme,
	linux-block, gost.dev

On 1/31/2025 1:51 AM, Martin K. Petersen wrote:
>> There is value in avoiding Copy-on-write (COW) checksum tree on a
>> device that can anyway store checksums inline (as part of PI). This
>> would eliminate extra checksum writes/reads, making I/O more
>> CPU-efficient. Additionally, usable space would increase, and write
>> amplification, both in Btrfs and eventually at the device level, would
>> be reduced [*].
> I have a couple of observations.
> 
> First of all, there is no inherent benefit to PI if it is generated at
> the same time as the ECC. The ECC is usually far superior when it comes
> to protecting data at rest. And you'll still get an error if uncorrected
> corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer
> any benefits in my book.

I fully agree, there is no benefit if we see it from E2E use case.
But for a different use case (i.e., checksum offload), 
BLK_INTEGRITY_OFFLOAD_NO_BUF is as good as it gets.

[Application -> FS -> Block-layer -> Device]

I understand that E2E gets stronger when integrity/checksum is placed at 
the origin of data (application), and then each component collaborates 
in checking it.

But I am not doing E2E here. Call it abuse or creative, but I am using 
the same E2E capable device to do checksum-offload. If the device had 
exposed checksum-offload in a different way, I would have taken that 
route rather than E2E one.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-31 10:29       ` Johannes Thumshirn
@ 2025-02-03 13:25         ` Kanchan Joshi
  2025-02-03 13:40           ` Johannes Thumshirn
  0 siblings, 1 reply; 32+ messages in thread
From: Kanchan Joshi @ 2025-02-03 13:25 UTC (permalink / raw)
  To: Johannes Thumshirn, josef@toxicpanda.com, dsterba@suse.com,
	clm@fb.com, axboe@kernel.dk, kbusch@kernel.org, hch
  Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, gost.dev@samsung.com

On 1/31/2025 3:59 PM, Johannes Thumshirn wrote:
>> I tested the series for read, but only the success cases. In this case
>> checksum generation/verification happens only within the device. It was
>> slightly tricky to inject an error and I skipped that.
>>
>> Since separate checksum I/Os are omitted, this is about handling the
>> error condition in data read I/O path itself. I have not yet checked if
>> repair code triggers when Btrfs is working with existing 'nodatasum'
>> mount option. But I get your point that this needs to be handled.
>>
> So this as of now disables one of the most useful features of the FS,
> repairing bad data. The whole "story" for the RAID code in the FS is
> build around this assumption, that we can repair bad data, unlike say MD
> RAID.

Does repairing-bad-data work when Btrfs is mounted with NODATASUM?
If not, should not the proposed option be seen as an upgrade over that?

You might be knowing, but I do not know how does Btrfs currently decide 
to apply NODATSUM! With these patches it becomes possible to know if 
checksum-offload is supported by the underlying hardware. And that makes 
it possible to apply NODATASUM in an informed manner.

I have not reduced anything, but added an opt-in for deployments that 
may have a different definition of what is useful. Not all planets are 
Mars. The cost of checksum tree will be different (say on QLC vs SLC).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-02-03 13:25         ` Kanchan Joshi
@ 2025-02-03 13:40           ` Johannes Thumshirn
  2025-02-03 14:03             ` Kanchan Joshi
  0 siblings, 1 reply; 32+ messages in thread
From: Johannes Thumshirn @ 2025-02-03 13:40 UTC (permalink / raw)
  To: Kanchan Joshi, josef@toxicpanda.com, dsterba@suse.com, clm@fb.com,
	axboe@kernel.dk, kbusch@kernel.org, hch
  Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, gost.dev@samsung.com

On 03.02.25 14:25, Kanchan Joshi wrote:
> On 1/31/2025 3:59 PM, Johannes Thumshirn wrote:
>>> I tested the series for read, but only the success cases. In this case
>>> checksum generation/verification happens only within the device. It was
>>> slightly tricky to inject an error and I skipped that.
>>>
>>> Since separate checksum I/Os are omitted, this is about handling the
>>> error condition in data read I/O path itself. I have not yet checked if
>>> repair code triggers when Btrfs is working with existing 'nodatasum'
>>> mount option. But I get your point that this needs to be handled.
>>>
>> So this as of now disables one of the most useful features of the FS,
>> repairing bad data. The whole "story" for the RAID code in the FS is
>> build around this assumption, that we can repair bad data, unlike say MD
>> RAID.
> 
> Does repairing-bad-data work when Btrfs is mounted with NODATASUM?
> If not, should not the proposed option be seen as an upgrade over that?
> 
> You might be knowing, but I do not know how does Btrfs currently decide
> to apply NODATSUM! With these patches it becomes possible to know if
> checksum-offload is supported by the underlying hardware. And that makes
> it possible to apply NODATASUM in an informed manner.

NODATASUM is something I personally would only ever tun on on VM images, 
so we don't have the stable page vs checksums f*up (see Qu's answers).

> I have not reduced anything, but added an opt-in for deployments that
> may have a different definition of what is useful. Not all planets are
> Mars. The cost of checksum tree will be different (say on QLC vs SLC).
> 

But NODATASUM isn't something that is actively recommended unless you 
know what you're doing. I thought of your patches as an offload of the 
checksum tree to the T10-PI extended sector format, which I personally 
like. And it's not that hard to do that.

If it's about getting people to use NODATASUM I'm really starting to 
dislike the patchset. Also NODATASUM implies deactivated compression.

So this to me then sounds like a good use case for 1 or 2 specific 
scenarios but not really all too helpful for the broader picture, which 
it could've been.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-02-03 13:40           ` Johannes Thumshirn
@ 2025-02-03 14:03             ` Kanchan Joshi
  2025-02-03 14:41               ` Johannes Thumshirn
  0 siblings, 1 reply; 32+ messages in thread
From: Kanchan Joshi @ 2025-02-03 14:03 UTC (permalink / raw)
  To: Johannes Thumshirn, josef@toxicpanda.com, dsterba@suse.com,
	clm@fb.com, axboe@kernel.dk, kbusch@kernel.org, hch
  Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, gost.dev@samsung.com

On 2/3/2025 7:10 PM, Johannes Thumshirn wrote:
> But NODATASUM isn't something that is actively recommended unless you
> know what you're doing. I thought of your patches as an offload of the
> checksum tree to the T10-PI extended sector format

You thought right, patches do "offload to T10-PI format" part. That part 
is generic for any upper-layer user. One only needs to send flag 
REQ_INTEGRITY_OFFLOAD for that.
And for the other part "suppress data-csums at Btrfs level", I thought 
of using NODATASUM.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-02-03 14:03             ` Kanchan Joshi
@ 2025-02-03 14:41               ` Johannes Thumshirn
  0 siblings, 0 replies; 32+ messages in thread
From: Johannes Thumshirn @ 2025-02-03 14:41 UTC (permalink / raw)
  To: Kanchan Joshi, josef@toxicpanda.com, dsterba@suse.com, clm@fb.com,
	axboe@kernel.dk, kbusch@kernel.org, hch
  Cc: linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, gost.dev@samsung.com

On 03.02.25 15:04, Kanchan Joshi wrote:
> On 2/3/2025 7:10 PM, Johannes Thumshirn wrote:
>> But NODATASUM isn't something that is actively recommended unless you
>> know what you're doing. I thought of your patches as an offload of the
>> checksum tree to the T10-PI extended sector format
> 
> You thought right, patches do "offload to T10-PI format" part. That part
> is generic for any upper-layer user. One only needs to send flag
> REQ_INTEGRITY_OFFLOAD for that.

That one I think is nice.

> And for the other part "suppress data-csums at Btrfs level", I thought
> of using NODATASUM.
> 

That one I hate with passion, IFF done the way it's done in this patchset.

For the generation/write side you can go that way. But it needs to be 
wired up so that btrfs can be the consumer (is this the correct term 
here?) of the checksums as well. Luckily 'btrfs_check_read_bio()' 
treats bio.bi_ioerror != BLK_STS_OK the same way as
!btrfs_data_csum_ok(). So that part looks save to me. Also 
btrfs_data_csum_ok() does an early return if there's no checsum, so I 
think we're good there.

In order to make scrub (and RAID5) work, you'd need a fallback for 
'btrfs_lookup_csums_bitmap()' that returns the sector checksum from the 
bio integrity layer instead of looking at the checksum tree.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-01-31  7:44     ` Christoph Hellwig
@ 2025-02-03 19:31       ` Martin K. Petersen
  2025-02-04  5:12         ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Martin K. Petersen @ 2025-02-03 19:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Kanchan Joshi, josef, dsterba, clm, axboe,
	kbusch, linux-btrfs, linux-nvme, linux-block, gost.dev


Christoph,

> Except for SSDs it generally doesn't - the fact that they are written
> at the same time means there is a very high chance they will end up
> on media together for traditional SSDs designs.
>
> This might be different when explicitly using some form of data
> placement scheme, and SSD vendors might be able to place PI/metadata
> different under the hood when using a big enough customer aks for it
> (they might not be very happy about the request :)).

There was a multi-vendor effort many years ago (first gen SSD era) to
make vendors guarantee that metadata and data would be written to
different channels. But performance got in the way, obviously.

> One thing that I did implement for my XFS hack/prototype is the ability
> to store a crc32c in the non-PI metadata support by nvme.  This allows
> for low overhead data checksumming as you don't need a separate data
> structure to track where the checksums for a data block are located and
> doesn't require out of place writes.  It doesn't provide a reg tag
> equivalent or device side checking of the guard tag unfortunately.

That sounds fine. Again, I don't have a problem with having the ability
to choose whether checksum placement or WAF is more important for a
given application.

> I never could come up with a good use of the app_tag for file systems,
> so not wasting space for that is actually a good thing.

I wish we could just do 4 bytes of CRC32C + 4 bytes of ref tag. I think
that would be a reasonable compromise between space and utility. But we
can't do that because of the app tag escape. We're essentially wasting 2
bytes per block to store a single bit flag.

In general I think 4096+16 is a reasonable format going forward. With
either CRC32C or CRC64 plus full LBA as ref tag.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-02-03 19:31       ` Martin K. Petersen
@ 2025-02-04  5:12         ` Christoph Hellwig
  2025-02-04 12:52           ` Martin K. Petersen
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-02-04  5:12 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Kanchan Joshi, josef, dsterba, clm, axboe,
	kbusch, linux-btrfs, linux-nvme, linux-block, gost.dev

On Mon, Feb 03, 2025 at 02:31:13PM -0500, Martin K. Petersen wrote:
> > I never could come up with a good use of the app_tag for file systems,
> > so not wasting space for that is actually a good thing.
> 
> I wish we could just do 4 bytes of CRC32C + 4 bytes of ref tag. I think
> that would be a reasonable compromise between space and utility.

Agreed.

> But we
> can't do that because of the app tag escape. We're essentially wasting 2
> bytes per block to store a single bit flag.

Well, what do we actually need the app tag escape for except for
historical precedence?  In NVMe the app tag escape is an option for
deallocated blocks, but it's only one of the options, there other beeing
a synthetic guard/ref tag with the expected value.

> In general I think 4096+16 is a reasonable format going forward. With
> either CRC32C or CRC64 plus full LBA as ref tag.

That would also work fine.  NVMe supports 4byte crc32c + 2 byte app tag +
12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6
byte guard / storage tag, although Linux only supports the latter so far.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-02-04  5:12         ` Christoph Hellwig
@ 2025-02-04 12:52           ` Martin K. Petersen
  2025-02-04 13:49             ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Martin K. Petersen @ 2025-02-04 12:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Kanchan Joshi, josef, dsterba, clm, axboe,
	kbusch, linux-btrfs, linux-nvme, linux-block, gost.dev


Christoph,

> Well, what do we actually need the app tag escape for except for
> historical precedence?  In NVMe the app tag escape is an option for
> deallocated blocks, but it's only one of the options, there other beeing
> a synthetic guard/ref tag with the expected value.

I have been told that some arrays use it to disable PI when writing the
RAID parity blocks. I guess that makes sense if the array firmware is
mixing data and parity blocks in a single write operation. For my test
tool I just use WRPROTECT=3 to disable checking when writing "bad" PI.

And obviously the original reason for the initialization pattern escape
made sense on a disk drive that had no FTL to track whether a block was
in a written state or not. But I really don't think any of this is
useful in a modern SSD or even array context.

>> In general I think 4096+16 is a reasonable format going forward. With
>> either CRC32C or CRC64 plus full LBA as ref tag.
>
> That would also work fine.  NVMe supports 4byte crc32c + 2 byte app tag +
> 12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6
> byte guard / storage tag, although Linux only supports the latter so far.

Yep, the CRC32C variant should be easy to wire up. I've thought about
the storage tag but haven't really come up with a good use case. It's
essentially the same situation as with the app tag.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-02-04 12:52           ` Martin K. Petersen
@ 2025-02-04 13:49             ` Christoph Hellwig
  2025-02-05  2:31               ` Martin K. Petersen
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-02-04 13:49 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Kanchan Joshi, josef, dsterba, clm, axboe,
	kbusch, linux-btrfs, linux-nvme, linux-block, gost.dev

On Tue, Feb 04, 2025 at 07:52:38AM -0500, Martin K. Petersen wrote:
> I have been told that some arrays use it to disable PI when writing the
> RAID parity blocks. I guess that makes sense if the array firmware is
> mixing data and parity blocks in a single write operation. For my test
> tool I just use WRPROTECT=3 to disable checking when writing "bad" PI.

Hmm, why would you disable PI for parity blocks?  But yes, outside of
Linux there might be uses.  Just thinking of a "perfect" format for
our needs.

> >
> > That would also work fine.  NVMe supports 4byte crc32c + 2 byte app tag +
> > 12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6
> > byte guard / storage tag, although Linux only supports the latter so far.
> 
> Yep, the CRC32C variant should be easy to wire up. I've thought about
> the storage tag but haven't really come up with a good use case. It's
> essentially the same situation as with the app tag.

Exactly, it's an app tag by other means.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/3] Btrfs checksum offload
  2025-02-04 13:49             ` Christoph Hellwig
@ 2025-02-05  2:31               ` Martin K. Petersen
  0 siblings, 0 replies; 32+ messages in thread
From: Martin K. Petersen @ 2025-02-05  2:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Kanchan Joshi, josef, dsterba, clm, axboe,
	kbusch, linux-btrfs, linux-nvme, linux-block, gost.dev


Christoph,

> Hmm, why would you disable PI for parity blocks?

Some devices calculate P and Q on the data portion of the block only and
store valid PI. Others calculate P and Q for the entire block, including
metadata.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-02-05  2:32 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20250129141039epcas5p11feb1be4124c0db3c5223325924183a3@epcas5p1.samsung.com>
2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
2025-01-29 14:02   ` [RFC 1/3] block: add integrity offload Kanchan Joshi
2025-01-29 14:02   ` [RFC 2/3] nvme: support " Kanchan Joshi
2025-01-29 14:02   ` [RFC 3/3] btrfs: add checksum offload Kanchan Joshi
2025-01-29 21:27     ` Qu Wenruo
2025-01-29 14:55   ` [RFC 0/3] Btrfs " Johannes Thumshirn
2025-01-31 10:19     ` Kanchan Joshi
2025-01-31 10:29       ` Johannes Thumshirn
2025-02-03 13:25         ` Kanchan Joshi
2025-02-03 13:40           ` Johannes Thumshirn
2025-02-03 14:03             ` Kanchan Joshi
2025-02-03 14:41               ` Johannes Thumshirn
2025-01-29 15:28   ` Keith Busch
2025-01-29 15:40     ` Christoph Hellwig
2025-01-29 18:03       ` Keith Busch
2025-01-30 12:54         ` Christoph Hellwig
2025-01-29 15:35   ` Christoph Hellwig
2025-01-30  9:22     ` Kanchan Joshi
2025-01-30 12:53       ` Christoph Hellwig
2025-01-31 10:29         ` Kanchan Joshi
2025-01-31 10:42           ` Christoph Hellwig
2025-01-29 15:55   ` Mark Harmstone
2025-01-29 19:02   ` Goffredo Baroncelli
2025-01-30  9:33     ` Daniel Vacek
2025-01-30 20:21   ` Martin K. Petersen
2025-01-31  7:44     ` Christoph Hellwig
2025-02-03 19:31       ` Martin K. Petersen
2025-02-04  5:12         ` Christoph Hellwig
2025-02-04 12:52           ` Martin K. Petersen
2025-02-04 13:49             ` Christoph Hellwig
2025-02-05  2:31               ` Martin K. Petersen
2025-02-03 13:24     ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).