[LSF/MM/BPF TOPIC] File system checksum offload

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] File system checksum offload
       [not found] <CGME20250130092400epcas5p1a3a9d899583e9502ed45fe500ae8a824@epcas5p1.samsung.com>
@ 2025-01-30  9:15 ` Kanchan Joshi
  2025-01-30 14:28   ` Theodore Ts'o
  0 siblings, 1 reply; 27+ messages in thread
From: Kanchan Joshi @ 2025-01-30  9:15 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-btrfs, linux-fsdevel, linux-nvme, linux-block, josef,
	Kanchan Joshi

Hi All,

I would like to propose a discussion on employing checksum offload in
filesystems.
It would be good to co-locate this with the storage track, as the
finer details lie in the block layer and NVMe driver.

For Btrfs, I had a fleeting chat with Joseph during the last LSFMM.
It seemed there will be value in optimizing writes induced by the
separate checksum tree.
Anuj and I have developed an RFC. This may help us have a clearer
discussion and decide the path forward.

https://lore.kernel.org/linux-block/20250129140207.22718-1-joshi.k@samsung.com/

The proposed RFC maintains a generic infrastructure, allowing other
filesystems to adopt it easily.
XFS/Ext4 have native checksumming for metadata but not for data.
With this approach, they could just instruct the SSD to checksum the
data.
But I am unsure if there are FS-specific trade-offs. Maybe that can
also be up for the discussion.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-01-30  9:15 ` [LSF/MM/BPF TOPIC] File system checksum offload Kanchan Joshi
@ 2025-01-30 14:28   ` Theodore Ts'o
  2025-01-30 20:39     ` [Lsf-pc] " Martin K. Petersen
  2025-01-31 13:11     ` Kanchan Joshi
  0 siblings, 2 replies; 27+ messages in thread
From: Theodore Ts'o @ 2025-01-30 14:28 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: lsf-pc, linux-btrfs, linux-fsdevel, linux-nvme, linux-block,
	josef

On Thu, Jan 30, 2025 at 02:45:45PM +0530, Kanchan Joshi wrote:
> I would like to propose a discussion on employing checksum offload in
> filesystems.
> It would be good to co-locate this with the storage track, as the
> finer details lie in the block layer and NVMe driver.

I wouldn't call this "file system offload".  Enabling the data
integrity feature or whatever you want to call it is really a block
layer issue.  The file system doesn't need to get involved at all.
Indeed, looking the patch, the only reason why the file system is
getting involved is because (a) you've added a mount option, and (b)
the mount option flips a bit in the bio that gets sent to the block
layer.

But this could also be done by adding a queue specific flag, at which
point the file system doesn't need to be involved at all.  Why would
you want to enable the data ingregity feature on a per block I/O
basis, if the device supports it?

We can debate whether it should be defaulted to on if the device
supports it, or whether the needs to be explicitly enabled.  It's true
that relative to not doing checksumming at all, it it's not "free".
The CPU has to calculate the checksum, so there are power, CPU, and
memory bandwidth costs.  I'd still tend to lean towards defaulting it
to on, so that the user doesn't need do anything special if they have
hardware is capable of supporting the data integrity feature.

It would be enlightening to measure the performance and power using
some file system benchmark with and without the block layer data
integrity enabled, with no other changes in the file system.  If
there's no difference, then enabling it queue-wide, for any file
system, would be a no-brainer.  If we discover that there is a
downside to enabling it, then we might decide how we want to enable
it.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] File system checksum offload
  2025-01-30 14:28   ` Theodore Ts'o
@ 2025-01-30 20:39     ` Martin K. Petersen
  2025-01-31  4:40       ` Theodore Ts'o
  2025-01-31 13:11     ` Kanchan Joshi
  1 sibling, 1 reply; 27+ messages in thread
From: Martin K. Petersen @ 2025-01-30 20:39 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Kanchan Joshi, lsf-pc, linux-btrfs, linux-fsdevel, linux-nvme,
	linux-block, josef


Ted,

> It's true that relative to not doing checksumming at all, it it's not
> "free". The CPU has to calculate the checksum, so there are power,
> CPU, and memory bandwidth costs. I'd still tend to lean towards
> defaulting it to on, so that the user doesn't need do anything special
> if they have hardware is capable of supporting the data integrity
> feature.

It already works that way. If a device advertises being
integrity-capable, the block layer will automatically generate
protection information on write and verify received protection
information on read. Leveraging hardware-accelerated CRC calculation if
the CPU is capable (PCLMULQDQ, etc.).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] File system checksum offload
  2025-01-30 20:39     ` [Lsf-pc] " Martin K. Petersen
@ 2025-01-31  4:40       ` Theodore Ts'o
  2025-01-31  7:07         ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2025-01-31  4:40 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Kanchan Joshi, lsf-pc, linux-btrfs, linux-fsdevel, linux-nvme,
	linux-block, josef

On Thu, Jan 30, 2025 at 03:39:20PM -0500, Martin K. Petersen wrote:
> It already works that way. If a device advertises being
> integrity-capable, the block layer will automatically generate
> protection information on write and verify received protection
> information on read. Leveraging hardware-accelerated CRC calculation if
> the CPU is capable (PCLMULQDQ, etc.).

So I'm confused.  If that's the case, why do we need Kanchan Joshi's
patch to set some magic bio flag and adding a mount option to btrfs?

      	     	  	    	     	    - Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] File system checksum offload
  2025-01-31  4:40       ` Theodore Ts'o
@ 2025-01-31  7:07         ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-01-31  7:07 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Martin K. Petersen, Kanchan Joshi, lsf-pc, linux-btrfs,
	linux-fsdevel, linux-nvme, linux-block, josef

On Thu, Jan 30, 2025 at 08:40:15PM -0800, Theodore Ts'o wrote:
> On Thu, Jan 30, 2025 at 03:39:20PM -0500, Martin K. Petersen wrote:
> > It already works that way. If a device advertises being
> > integrity-capable, the block layer will automatically generate
> > protection information on write and verify received protection
> > information on read. Leveraging hardware-accelerated CRC calculation if
> > the CPU is capable (PCLMULQDQ, etc.).
> 
> So I'm confused.  If that's the case, why do we need Kanchan Joshi's
> patch to set some magic bio flag and adding a mount option to btrfs?

Well, as stated in reply to the series there really isn't.  The
case with a buffer duplicates the existing auto-PI in the block layer,
and the no-buffer cases reduces the protection envelope compared to
using the auto PI.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-01-30 14:28   ` Theodore Ts'o
  2025-01-30 20:39     ` [Lsf-pc] " Martin K. Petersen
@ 2025-01-31 13:11     ` Kanchan Joshi
  2025-02-03  7:47       ` Johannes Thumshirn
  1 sibling, 1 reply; 27+ messages in thread
From: Kanchan Joshi @ 2025-01-31 13:11 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: lsf-pc, linux-btrfs, linux-fsdevel, linux-nvme, linux-block,
	josef

On 1/30/2025 7:58 PM, Theodore Ts'o wrote:
> On Thu, Jan 30, 2025 at 02:45:45PM +0530, Kanchan Joshi wrote:
>> I would like to propose a discussion on employing checksum offload in
>> filesystems.
>> It would be good to co-locate this with the storage track, as the
>> finer details lie in the block layer and NVMe driver.
> 
> I wouldn't call this "file system offload".  Enabling the data
> integrity feature or whatever you want to call it is really a block
> layer issue.  The file system doesn't need to get involved at all.
> Indeed, looking the patch, the only reason why the file system is
> getting involved is because (a) you've added a mount option, and (b)
> the mount option flips a bit in the bio that gets sent to the block
> layer.

Mount option was only for the RFC. If everything else gets sorted, it 
would be about choosing whatever is liked by the Btrfs.
   > But this could also be done by adding a queue specific flag, at which
> point the file system doesn't need to be involved at all.  Why would
> you want to enable the data ingregity feature on a per block I/O
> basis, if the device supports it?

Because I thought users (filesystems) would prefer flexibility. Per-IO 
control helps to choose different policy for say data and meta. Let me 
outline the differences.

Block-layer auto integrity
- always attaches integrity-payload for each I/O.
- it does compute checksum/reftag for each I/O. And this part does not 
do justice to the label 'offload'.

The patches make auto-integrity
- attach the integrity-buffer only if the device configuration demands.
- never compute checksum/reftag at the block-layer.
- keeps the offload choice at per I/O level.

Btrfs checksum tree is created only for data blocks, so the patches 
apply the flag (REQ_INTEGRITY_OFFLOAD) on that. While metadata blocks, 
which maybe more important, continue to get checksummed at two levels 
(block and device).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-01-31 13:11     ` Kanchan Joshi
@ 2025-02-03  7:47       ` Johannes Thumshirn
  2025-02-03  7:56         ` Christoph Hellwig
  2025-02-03 13:32         ` Kanchan Joshi
  0 siblings, 2 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2025-02-03  7:47 UTC (permalink / raw)
  To: Kanchan Joshi, Theodore Ts'o
  Cc: lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On 31.01.25 14:11, Kanchan Joshi wrote:
> On 1/30/2025 7:58 PM, Theodore Ts'o wrote:
>> On Thu, Jan 30, 2025 at 02:45:45PM +0530, Kanchan Joshi wrote:
>>> I would like to propose a discussion on employing checksum offload in
>>> filesystems.
>>> It would be good to co-locate this with the storage track, as the
>>> finer details lie in the block layer and NVMe driver.
>>
>> I wouldn't call this "file system offload".  Enabling the data
>> integrity feature or whatever you want to call it is really a block
>> layer issue.  The file system doesn't need to get involved at all.
>> Indeed, looking the patch, the only reason why the file system is
>> getting involved is because (a) you've added a mount option, and (b)
>> the mount option flips a bit in the bio that gets sent to the block
>> layer.
> 
> Mount option was only for the RFC. If everything else gets sorted, it
> would be about choosing whatever is liked by the Btrfs.
>     > But this could also be done by adding a queue specific flag, at which
>> point the file system doesn't need to be involved at all.  Why would
>> you want to enable the data ingregity feature on a per block I/O
>> basis, if the device supports it?
> 
> Because I thought users (filesystems) would prefer flexibility. Per-IO
> control helps to choose different policy for say data and meta. Let me
> outline the differences.

But data and metadata checksums are handled differently in btrfs. For 
data we checksum each block and write it into the checksum tree. For 
metadata the checksum is part of the metadata (see 'struct btrfs_header').

> Block-layer auto integrity
> - always attaches integrity-payload for each I/O.
> - it does compute checksum/reftag for each I/O. And this part does not
> do justice to the label 'offload'.
> 
> The patches make auto-integrity
> - attach the integrity-buffer only if the device configuration demands.
> - never compute checksum/reftag at the block-layer.
> - keeps the offload choice at per I/O level.
> 
> Btrfs checksum tree is created only for data blocks, so the patches
> apply the flag (REQ_INTEGRITY_OFFLOAD) on that. While metadata blocks,
> which maybe more important, continue to get checksummed at two levels
> (block and device).

The thing I don't like with the current RFC patchset is, it breaks 
scrub, repair and device error statistics. It nothing that can't be 
solved though. But as of now it just doesn't make any sense at all to 
me. We at least need the FS to look at the BLK_STS_PROTECTION return and 
handle accordingly in scrub, read repair and statistics.

And that's only for feature parity. I'd also like to see some 
performance numbers and numbers of reduced WAF, if this is really worth 
the hassle.

Thanks,
	Johannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  7:47       ` Johannes Thumshirn
@ 2025-02-03  7:56         ` Christoph Hellwig
  2025-02-03  8:04           ` Johannes Thumshirn
  2025-02-03 13:32         ` Kanchan Joshi
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-02-03  7:56 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Kanchan Joshi, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On Mon, Feb 03, 2025 at 07:47:53AM +0000, Johannes Thumshirn wrote:
> The thing I don't like with the current RFC patchset is, it breaks 
> scrub, repair and device error statistics. It nothing that can't be 
> solved though. But as of now it just doesn't make any sense at all to 
> me. We at least need the FS to look at the BLK_STS_PROTECTION return and 
> handle accordingly in scrub, read repair and statistics.
> 
> And that's only for feature parity. I'd also like to see some 
> performance numbers and numbers of reduced WAF, if this is really worth 
> the hassle.

If we can store checksums in metadata / extended LBA that will help
WAF a lot, and also performance becaue you only need one write
instead of two dependent writes, and also just one read.

The checksums in the current PI formats (minus the new ones in NVMe)
aren't that good as Martin pointed out, but the biggest issue really
is that you need hardware that does support metadata or PI.  SATA
doesn't support it at all.  For NVMe PI support is generally a feature
that is supported by gold plated fully featured enterprise devices
but not the cheaper tiers.  I've heard some talks of customers asking
for plain non-PI metadata in certain cheaper tiers, but not much of
that has actually materialized yet.  If we ever get at least non-PI
metadata support on cheap NVMe drives the idea of storing checksums
there would become very, very useful.

FYI, I'll post my hacky XFS data checksumming code to show how relatively
simple using the out of band metadata is for file system based
checksumming.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  7:56         ` Christoph Hellwig
@ 2025-02-03  8:04           ` Johannes Thumshirn
  2025-02-03  8:06             ` hch
  2025-02-03  8:16             ` Qu Wenruo
  0 siblings, 2 replies; 27+ messages in thread
From: Johannes Thumshirn @ 2025-02-03  8:04 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Kanchan Joshi, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On 03.02.25 08:56, Christoph Hellwig wrote:
> On Mon, Feb 03, 2025 at 07:47:53AM +0000, Johannes Thumshirn wrote:
>> The thing I don't like with the current RFC patchset is, it breaks
>> scrub, repair and device error statistics. It nothing that can't be
>> solved though. But as of now it just doesn't make any sense at all to
>> me. We at least need the FS to look at the BLK_STS_PROTECTION return and
>> handle accordingly in scrub, read repair and statistics.
>>
>> And that's only for feature parity. I'd also like to see some
>> performance numbers and numbers of reduced WAF, if this is really worth
>> the hassle.
> 
> If we can store checksums in metadata / extended LBA that will help
> WAF a lot, and also performance becaue you only need one write
> instead of two dependent writes, and also just one read.

Well for the WAF part, it'll save us 32 Bytes per FS sector (typically 
4k) in the btrfs case, that's ~0.8% of the space.

> The checksums in the current PI formats (minus the new ones in NVMe)
> aren't that good as Martin pointed out, but the biggest issue really
> is that you need hardware that does support metadata or PI.  SATA
> doesn't support it at all.  For NVMe PI support is generally a feature
> that is supported by gold plated fully featured enterprise devices
> but not the cheaper tiers.  I've heard some talks of customers asking
> for plain non-PI metadata in certain cheaper tiers, but not much of
> that has actually materialized yet.  If we ever get at least non-PI
> metadata support on cheap NVMe drives the idea of storing checksums
> there would become very, very useful.
> 
> FYI, I'll post my hacky XFS data checksumming code to show how relatively
> simple using the out of band metadata is for file system based
> checksumming.
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:04           ` Johannes Thumshirn
@ 2025-02-03  8:06             ` hch
  2025-02-03  8:16             ` Qu Wenruo
  1 sibling, 0 replies; 27+ messages in thread
From: hch @ 2025-02-03  8:06 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: hch@infradead.org, Kanchan Joshi, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On Mon, Feb 03, 2025 at 08:04:42AM +0000, Johannes Thumshirn wrote:
> Well for the WAF part, it'll save us 32 Bytes per FS sector (typically 
> 4k) in the btrfs case, that's ~0.8% of the space.

It saves you from the cascading btree updates.  With nocow this could
actually allow btrfs to write with WAF=1 for pure overwrites.  But even
with data cow it will reduce a lot of churn, especially for O_SYNC
writes.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:04           ` Johannes Thumshirn
  2025-02-03  8:06             ` hch
@ 2025-02-03  8:16             ` Qu Wenruo
  2025-02-03  8:26               ` Matthew Wilcox
                                 ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Qu Wenruo @ 2025-02-03  8:16 UTC (permalink / raw)
  To: Johannes Thumshirn, hch@infradead.org
  Cc: Kanchan Joshi, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com



在 2025/2/3 18:34, Johannes Thumshirn 写道:
> On 03.02.25 08:56, Christoph Hellwig wrote:
>> On Mon, Feb 03, 2025 at 07:47:53AM +0000, Johannes Thumshirn wrote:
>>> The thing I don't like with the current RFC patchset is, it breaks
>>> scrub, repair and device error statistics. It nothing that can't be
>>> solved though. But as of now it just doesn't make any sense at all to
>>> me. We at least need the FS to look at the BLK_STS_PROTECTION return and
>>> handle accordingly in scrub, read repair and statistics.
>>>
>>> And that's only for feature parity. I'd also like to see some
>>> performance numbers and numbers of reduced WAF, if this is really worth
>>> the hassle.
>>
>> If we can store checksums in metadata / extended LBA that will help
>> WAF a lot, and also performance becaue you only need one write
>> instead of two dependent writes, and also just one read.
> 
> Well for the WAF part, it'll save us 32 Bytes per FS sector (typically
> 4k) in the btrfs case, that's ~0.8% of the space.

You forgot the csum tree COW part.

Updating csum tree is pretty COW heavy and that's going to cause quite 
some wearing.

Thus although I do not think the RFC patch makes much sense compared to 
just existing NODATASUM mount option, I'm interesting in the hardware 
csum handling.

> 
>> The checksums in the current PI formats (minus the new ones in NVMe)
>> aren't that good as Martin pointed out, but the biggest issue really
>> is that you need hardware that does support metadata or PI.  SATA
>> doesn't support it at all.  For NVMe PI support is generally a feature
>> that is supported by gold plated fully featured enterprise devices
>> but not the cheaper tiers.  I've heard some talks of customers asking
>> for plain non-PI metadata in certain cheaper tiers, but not much of
>> that has actually materialized yet.  If we ever get at least non-PI
>> metadata support on cheap NVMe drives the idea of storing checksums
>> there would become very, very useful.

The other pain point of btrfs' data checksum is related to Direct IO and 
the content change halfway.

It's pretty common to reproduce, just start a VM with an image on btrfs, 
set the VM cache mode to none (aka, using direct IO), and run XFS/EXT4 
inside the VM, run some fsstress it should cause btrfs to hit data csum 
mismatch false alerts.

The root cause is the content change during direct IO, and XFS/EXT4 
doesn't wait for folio writeback before dirtying the folio (if no 
AS_STABLE_WRITES set).
That's a valid optimization, but that will cause contents change.

(I know there is the AS_STABLE_WRITES, but I'm not sure if qemu will 
pass that flag to virtio block devices inside the VM)

And with btrfs' checksum calculation happening before submitting the 
real bio, it means if the contents changed after the csum calculation 
and before bio finished, we will got csum mismatch.

So if the csum can happening inside the hardware, it will solve the 
problem of direct IO and csum change.

Thanks,
Qu

>>
>> FYI, I'll post my hacky XFS data checksumming code to show how relatively
>> simple using the out of band metadata is for file system based
>> checksumming.
>>
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:16             ` Qu Wenruo
@ 2025-02-03  8:26               ` Matthew Wilcox
  2025-02-03  8:30                 ` hch
  2025-02-03  8:26               ` hch
  2025-02-03 13:27               ` Kanchan Joshi
  2 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2025-02-03  8:26 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Johannes Thumshirn, hch@infradead.org, Kanchan Joshi,
	Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com

On Mon, Feb 03, 2025 at 06:46:49PM +1030, Qu Wenruo wrote:
> It's pretty common to reproduce, just start a VM with an image on btrfs, set
> the VM cache mode to none (aka, using direct IO), and run XFS/EXT4 inside
> the VM, run some fsstress it should cause btrfs to hit data csum mismatch
> false alerts.
> 
> The root cause is the content change during direct IO, and XFS/EXT4 doesn't
> wait for folio writeback before dirtying the folio (if no AS_STABLE_WRITES
> set).
> That's a valid optimization, but that will cause contents change.

XFS honours the bdev flag:

static inline void xfs_update_stable_writes(struct xfs_inode *ip)
{
        if (bdev_stable_writes(xfs_inode_buftarg(ip)->bt_bdev))
                mapping_set_stable_writes(VFS_I(ip)->i_mapping);
        else
                mapping_clear_stable_writes(VFS_I(ip)->i_mapping);
}

so this is a block layer issue if it's not set.

> (I know there is the AS_STABLE_WRITES, but I'm not sure if qemu will pass
> that flag to virtio block devices inside the VM)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:16             ` Qu Wenruo
  2025-02-03  8:26               ` Matthew Wilcox
@ 2025-02-03  8:26               ` hch
  2025-02-03 13:27               ` Kanchan Joshi
  2 siblings, 0 replies; 27+ messages in thread
From: hch @ 2025-02-03  8:26 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Johannes Thumshirn, hch@infradead.org, Kanchan Joshi,
	Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com

On Mon, Feb 03, 2025 at 06:46:49PM +1030, Qu Wenruo wrote:
> The root cause is the content change during direct IO, and XFS/EXT4 doesn't
> wait for folio writeback before dirtying the folio (if no AS_STABLE_WRITES
> set).
> That's a valid optimization, but that will cause contents change.
> 
> (I know there is the AS_STABLE_WRITES, but I'm not sure if qemu will pass
> that flag to virtio block devices inside the VM)

It doesn't, and even if it did you can force guests to add it.  But it
would be an interesting experiment to support passing it through at
least for virtio-blk.

> And with btrfs' checksum calculation happening before submitting the real
> bio, it means if the contents changed after the csum calculation and before
> bio finished, we will got csum mismatch.

btrfs checksums before submitting the bio.  But that doesn't change the
thing as you still have the same problem.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:26               ` Matthew Wilcox
@ 2025-02-03  8:30                 ` hch
  2025-02-03  8:36                   ` Qu Wenruo
  0 siblings, 1 reply; 27+ messages in thread
From: hch @ 2025-02-03  8:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Qu Wenruo, Johannes Thumshirn, hch@infradead.org, Kanchan Joshi,
	Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com

On Mon, Feb 03, 2025 at 08:26:33AM +0000, Matthew Wilcox wrote:
> so this is a block layer issue if it's not set.

And even if the flag is set direct I/O ignores it.  So while passing
through such a flag through virtio might be useful we need to eventually
sort out the fact that direct I/O doesn't respect it.

Locking up any thread touching memory under direct I/O might be quite
heavy handed, so this means bounce buffering on page fault.  We had
plenty of discussion of this before, but I don't think anyone actually
looked into implementing it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:30                 ` hch
@ 2025-02-03  8:36                   ` Qu Wenruo
  2025-02-03  8:40                     ` hch
  0 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2025-02-03  8:36 UTC (permalink / raw)
  To: hch@infradead.org, Matthew Wilcox
  Cc: Johannes Thumshirn, Kanchan Joshi, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

在 2025/2/3 19:00, hch@infradead.org 写道:
> On Mon, Feb 03, 2025 at 08:26:33AM +0000, Matthew Wilcox wrote:
>> so this is a block layer issue if it's not set.
> 
> And even if the flag is set direct I/O ignores it.  So while passing
> through such a flag through virtio might be useful we need to eventually
> sort out the fact that direct I/O doesn't respect it.

The example I mentioned is just an easy-to-reproduce example, there are 
even worse cases, like multi-thread workload where one is doing direct 
IO, the other one is modifying the buffer.

So passing AS_STABLE_WRITES is only a solution for the specific case I 
mentioned, not a generic solution at all.

But I would still appreciate any movement to expose that flag to virtio-blk.

> 
> Locking up any thread touching memory under direct I/O might be quite
> heavy handed, so this means bounce buffering on page fault.  We had
> plenty of discussion of this before, but I don't think anyone actually
> looked into implementing it.
> 

Thus my current plan to fix it is to make btrfs to skip csum for direct IO.
This will make btrfs to align with EXT4/XFS behavior, without the 
complex AS_STABLE_FLAGS passing (and there is no way for user space to 
probe that flag IIRC).

But that will break the current per-inode level NODATASUM setting, and 
will cause some incompatibility (a new incompat flag needed, extra 
handling if no data csum found, extra fsck support etc).

Thanks,
Qu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:36                   ` Qu Wenruo
@ 2025-02-03  8:40                     ` hch
  2025-02-03  8:51                       ` Qu Wenruo
  0 siblings, 1 reply; 27+ messages in thread
From: hch @ 2025-02-03  8:40 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: hch@infradead.org, Matthew Wilcox, Johannes Thumshirn,
	Kanchan Joshi, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On Mon, Feb 03, 2025 at 07:06:15PM +1030, Qu Wenruo wrote:
> Thus my current plan to fix it is to make btrfs to skip csum for direct IO.
> This will make btrfs to align with EXT4/XFS behavior, without the complex
> AS_STABLE_FLAGS passing (and there is no way for user space to probe that
> flag IIRC).
> 
> But that will break the current per-inode level NODATASUM setting, and will
> cause some incompatibility (a new incompat flag needed, extra handling if no
> data csum found, extra fsck support etc).

I don't think simply removing the checksums when using direct I/O is
a good idea as it unexpectedly reduces the protection envelope.  The
best (or least bad) fix would be to simply not support actually direct
I/O without NODATASUM and fall back to buffered I/O (preferably the new
uncached variant from Jens) unless explicitly overridden.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:40                     ` hch
@ 2025-02-03  8:51                       ` Qu Wenruo
  2025-02-03  8:57                         ` hch
  0 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2025-02-03  8:51 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Matthew Wilcox, Johannes Thumshirn, Kanchan Joshi,
	Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com



在 2025/2/3 19:10, hch@infradead.org 写道:
> On Mon, Feb 03, 2025 at 07:06:15PM +1030, Qu Wenruo wrote:
>> Thus my current plan to fix it is to make btrfs to skip csum for direct IO.
>> This will make btrfs to align with EXT4/XFS behavior, without the complex
>> AS_STABLE_FLAGS passing (and there is no way for user space to probe that
>> flag IIRC).
>>
>> But that will break the current per-inode level NODATASUM setting, and will
>> cause some incompatibility (a new incompat flag needed, extra handling if no
>> data csum found, extra fsck support etc).
> 
> I don't think simply removing the checksums when using direct I/O is
> a good idea as it unexpectedly reduces the protection envelope.  The
> best (or least bad) fix would be to simply not support actually direct
> I/O without NODATASUM and fall back to buffered I/O (preferably the new
> uncached variant from Jens) unless explicitly overridden.
> 

That always falling-back-to-buffered-IO sounds pretty good.
(For NODATASUM inodes, we do not need to fallback though).

The only concern is performance.
I guess even for the uncached write it still involves some extra folio 
copy, thus not completely the same performance level of direct IO?

And always falling back (for inodes with datacsum) may also sound a 
little overkilled.
If the program is properly coded, and no contents change halfway, we 
always pay the performance penalty but without really any extra benefit.

I guess it really depends on the performance of uncached writes.
(And I really hope it's not obviously slower than direct IO)

Thanks,
Qu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:51                       ` Qu Wenruo
@ 2025-02-03  8:57                         ` hch
  0 siblings, 0 replies; 27+ messages in thread
From: hch @ 2025-02-03  8:57 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: hch@infradead.org, Matthew Wilcox, Johannes Thumshirn,
	Kanchan Joshi, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On Mon, Feb 03, 2025 at 07:21:08PM +1030, Qu Wenruo wrote:
> That always falling-back-to-buffered-IO sounds pretty good.
> (For NODATASUM inodes, we do not need to fallback though).

Yes, that's what I meant above.

> 
> The only concern is performance.
> I guess even for the uncached write it still involves some extra folio copy,
> thus not completely the same performance level of direct IO?

In general buffered I/O is going to be slower, but at least the uncached
mode will avoid the cache pollution.

> And always falling back (for inodes with datacsum) may also sound a little
> overkilled.
> If the program is properly coded, and no contents change halfway, we always
> pay the performance penalty but without really any extra benefit.

But you don't know that, and people have very different expectations for
"properly coded" :)  So I'd opt for the safe variant (copy) an allow an
opt-in for the faster but less safe variant (realy direct I/O without
copy with checksums).  And hopefully we can eventually find a version
that will bounce buffer when modifying pages that are in-flight for
direct I/O which would be safe and fast.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  8:16             ` Qu Wenruo
  2025-02-03  8:26               ` Matthew Wilcox
  2025-02-03  8:26               ` hch
@ 2025-02-03 13:27               ` Kanchan Joshi
  2025-02-03 23:17                 ` Qu Wenruo
  2025-02-04  5:16                 ` hch
  2 siblings, 2 replies; 27+ messages in thread
From: Kanchan Joshi @ 2025-02-03 13:27 UTC (permalink / raw)
  To: Qu Wenruo, Johannes Thumshirn, hch@infradead.org
  Cc: Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com

On 2/3/2025 1:46 PM, Qu Wenruo wrote:
>> ell for the WAF part, it'll save us 32 Bytes per FS sector (typically
>> 4k) in the btrfs case, that's ~0.8% of the space.
> 
> You forgot the csum tree COW part.
> 
> Updating csum tree is pretty COW heavy and that's going to cause quite 
> some wearing.
> 
> Thus although I do not think the RFC patch makes much sense compared to 
> just existing NODATASUM mount option, I'm interesting in the hardware 
> csum handling.

But, patches do exactly that i.e., hardware cusm support. And posted 
numbers [*] are also when hardware is checksumming the data blocks.

NODATASUM forgoes the data cums at Btrfs level, but its scope of 
control/influence ends there, as it knows nothing about what happens 
underneath.
Proposed option (DATASUM_OFFLOAD) ensures that the [only] hardware 
checksums the data blocks.

[*] 
https://lore.kernel.org/linux-block/20250129140207.22718-1-joshi.k@samsung.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03  7:47       ` Johannes Thumshirn
  2025-02-03  7:56         ` Christoph Hellwig
@ 2025-02-03 13:32         ` Kanchan Joshi
  1 sibling, 0 replies; 27+ messages in thread
From: Kanchan Joshi @ 2025-02-03 13:32 UTC (permalink / raw)
  To: Johannes Thumshirn, Theodore Ts'o
  Cc: lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On 2/3/2025 1:17 PM, Johannes Thumshirn wrote:

>>> I wouldn't call this "file system offload".  Enabling the data
>>> integrity feature or whatever you want to call it is really a block
>>> layer issue.  The file system doesn't need to get involved at all.
>>> Indeed, looking the patch, the only reason why the file system is
>>> getting involved is because (a) you've added a mount option, and (b)
>>> the mount option flips a bit in the bio that gets sent to the block
>>> layer.
>> Mount option was only for the RFC. If everything else gets sorted, it
>> would be about choosing whatever is liked by the Btrfs.
>>      > But this could also be done by adding a queue specific flag, at which
>>> point the file system doesn't need to be involved at all.  Why would
>>> you want to enable the data ingregity feature on a per block I/O
>>> basis, if the device supports it?
>> Because I thought users (filesystems) would prefer flexibility. Per-IO
>> control helps to choose different policy for say data and meta. Let me
>> outline the differences.
> But data and metadata checksums are handled differently in btrfs. For
> data we checksum each block and write it into the checksum tree. For
> metadata the checksum is part of the metadata (see 'struct btrfs_header').

Yes, I am aware that btrfs meta checksum is put into an exiting field in 
the B-tree header. It only affects the compute part, and not the 
write-amplification part unlike the data checksums. That is exactly why 
I did not find it worth to optimize in the existing RFC.


>> Block-layer auto integrity
>> - always attaches integrity-payload for each I/O.
>> - it does compute checksum/reftag for each I/O. And this part does not
>> do justice to the label 'offload'.
>>
>> The patches make auto-integrity
>> - attach the integrity-buffer only if the device configuration demands.
>> - never compute checksum/reftag at the block-layer.
>> - keeps the offload choice at per I/O level.
>>
>> Btrfs checksum tree is created only for data blocks, so the patches
>> apply the flag (REQ_INTEGRITY_OFFLOAD) on that. While metadata blocks,
>> which maybe more important, continue to get checksummed at two levels
>> (block and device).
> The thing I don't like with the current RFC patchset is, it breaks
> scrub, repair and device error statistics. It nothing that can't be
> solved though. But as of now it just doesn't make any sense at all to
> me. We at least need the FS to look at the BLK_STS_PROTECTION return and
> handle accordingly in scrub, read repair and statistics.
> 
> And that's only for feature parity. I'd also like to see some
> performance numbers and numbers of reduced WAF, if this is really worth
> the hassle.

Have you seen the numbers mentioned in the cover letter of the RFC?
For direct IO: IOPS increase from 144K to 172K, and extra-writes 
reduction from 66GB to 9GB.

Something similar for buffered IO:
https://lore.kernel.org/linux-block/20250129140207.22718-1-joshi.k@samsung.com/


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03 13:27               ` Kanchan Joshi
@ 2025-02-03 23:17                 ` Qu Wenruo
  2025-02-04  5:48                   ` hch
  2025-02-04  5:16                 ` hch
  1 sibling, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2025-02-03 23:17 UTC (permalink / raw)
  To: Kanchan Joshi, Johannes Thumshirn, hch@infradead.org
  Cc: Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com



在 2025/2/3 23:57, Kanchan Joshi 写道:
> On 2/3/2025 1:46 PM, Qu Wenruo wrote:
>>> ell for the WAF part, it'll save us 32 Bytes per FS sector (typically
>>> 4k) in the btrfs case, that's ~0.8% of the space.
>>
>> You forgot the csum tree COW part.
>>
>> Updating csum tree is pretty COW heavy and that's going to cause quite
>> some wearing.
>>
>> Thus although I do not think the RFC patch makes much sense compared to
>> just existing NODATASUM mount option, I'm interesting in the hardware
>> csum handling.
> 
> But, patches do exactly that i.e., hardware cusm support. And posted
> numbers [*] are also when hardware is checksumming the data blocks.
> 
> NODATASUM forgoes the data cums at Btrfs level, but its scope of
> control/influence ends there, as it knows nothing about what happens
> underneath.
> Proposed option (DATASUM_OFFLOAD) ensures that the [only] hardware
> checksums the data blocks.

My understanding is, if the hardware supports the extra payload, it's 
better to let the user to configure it.

Btrfs already has the way to disable its data checksum. It's the end 
users' choice to determine if they want to trust the hardware.

The only thing that btrfs may want to interact with this hardware csum 
is metadata.
Doing the double checksum may waste extra writes, thus disabling either 
the metadata csum or the hardware one looks more reasonable.

Otherwise we're pushing for a policy (btrfs' automatic csum behavior 
change), not a mechanism (the existing btrfs nodatacsum mount 
option/per-inode flag).

And inside kernels, we always provides a mechanism, not a policy.

Thanks,
Qu
> 
> [*]
> https://lore.kernel.org/linux-block/20250129140207.22718-1-joshi.k@samsung.com/


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03 13:27               ` Kanchan Joshi
  2025-02-03 23:17                 ` Qu Wenruo
@ 2025-02-04  5:16                 ` hch
  2025-03-18  7:06                   ` Kanchan Joshi
  1 sibling, 1 reply; 27+ messages in thread
From: hch @ 2025-02-04  5:16 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Qu Wenruo, Johannes Thumshirn, hch@infradead.org,
	Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com

On Mon, Feb 03, 2025 at 06:57:13PM +0530, Kanchan Joshi wrote:
> But, patches do exactly that i.e., hardware cusm support. And posted 
> numbers [*] are also when hardware is checksumming the data blocks.

I'm still not sure why you think the series implements hardware
csum support.

The buf mode is just a duplicate implementation of the block layer
automatic PI.  The no buf means PRACT which let's the device auto
generate and strip PI.  Especially the latter one (which is the
one that was benchmarked) literally provides no additional protection
over what the device would already do.  It's the "trust me, bro" of
data integrity :)  Which to be fair will work pretty well as devices
that support PI are the creme de la creme of storage devices and
will have very good internal data protection internally.  But the
point of data checksums is to not trust the storage device and
not trust layers between the checksum generation and the storage
device.

IFF using PRACT is an acceptable level of protection just running
NODATASUM and disabling PI generation/verification in the block
layer using the current sysfs attributes (or an in-kernel interface
for that) to force the driver to set PRACT will do exactly the same
thing.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-03 23:17                 ` Qu Wenruo
@ 2025-02-04  5:48                   ` hch
  0 siblings, 0 replies; 27+ messages in thread
From: hch @ 2025-02-04  5:48 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Kanchan Joshi, Johannes Thumshirn, hch@infradead.org,
	Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com

On Tue, Feb 04, 2025 at 09:47:20AM +1030, Qu Wenruo wrote:
> Btrfs already has the way to disable its data checksum. It's the end users'
> choice to determine if they want to trust the hardware.

Yes.

> The only thing that btrfs may want to interact with this hardware csum is
> metadata.
> Doing the double checksum may waste extra writes, thus disabling either the
> metadata csum or the hardware one looks more reasonable.

Note that the most common (and often only supported) PI checksum
algorithm is significantly less strong than the default btrfs checksum
algorithm.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-02-04  5:16                 ` hch
@ 2025-03-18  7:06                   ` Kanchan Joshi
  2025-03-18  8:07                     ` hch
  0 siblings, 1 reply; 27+ messages in thread
From: Kanchan Joshi @ 2025-03-18  7:06 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Qu Wenruo, Johannes Thumshirn, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On 2/4/2025 10:46 AM, hch@infradead.org wrote:
> On Mon, Feb 03, 2025 at 06:57:13PM +0530, Kanchan Joshi wrote:
>> But, patches do exactly that i.e., hardware cusm support. And posted
>> numbers [*] are also when hardware is checksumming the data blocks.
> 
> I'm still not sure why you think the series implements hardware
> csum support.

Series ensure that (a) that host does not compute the csum, and (b) 
device computes.
Not sure if you were doubting the HW instead, but I checked that part 
with user-space nvme-passthrough program which
- [During write] does not send checksum and sets PRACT as 1.
- [During read] sends metadata buffer and keeps PRACT as 0.
It reads the correct data checksum which host never computed (but device 
did at the time of write).

> The buf mode is just a duplicate implementation of the block layer
> automatic PI.  The no buf means PRACT which let's the device auto
> generate and strip PI.

Regardless of buf or no buf, it applies PRACT and only device computes 
the checksum. The two modes are taking shape only because of the way 
PRACT works for two different device configurations

#1: when meta-size == pi-size, we don't need to send meta-buffer.
#2: when meta-size > pi-size, we need to.

Automatic PI helps for #2 as split handling of meta-buffer comes free if 
I/O is split. But overall, this is also about abstracting PRACT details 
so that each filesystem does not have to bother.
And changes to keep this abstracted in Auto-PI/NVMe are not much:

  block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
  block/t10-pi.c            |  7 +++++++
  drivers/nvme/host/core.c  | 24 ++++++++++++++++++++++
  drivers/nvme/host/nvme.h  |  1 +

>  Especially the latter one (which is the
> one that was benchmarked) literally provides no additional protection
> over what the device would already do.  It's the "trust me, bro" of
> data integrity :)  Which to be fair will work pretty well as devices
> that support PI are the creme de la creme of storage devices and
> will have very good internal data protection internally.  But the
> point of data checksums is to not trust the storage device and
> not trust layers between the checksum generation and the storage
> device.

Right, I'm not saying that protection is getting better. Just that any 
offload is about trusting someone else with the job. We have other 
instances like atomic-writes, copy, write-zeroes, write-same etc.

> IFF using PRACT is an acceptable level of protection just running
> NODATASUM and disabling PI generation/verification in the block
> layer using the current sysfs attributes (or an in-kernel interface
> for that) to force the driver to set PRACT will do exactly the same
> thing.

I had considered but that can't work because:

- the sysfs attributes operate at block-device level for all read or all 
write operations. That's not flexible for policies such "do something 
for some writes/reads but not for others" which can translate to "do 
checksum offload for FS data, but keep things as is for FS meta" or 
other combinations.

- If the I/O goes down to driver with , driver will start failing 
(rather than setting PRACT) if the configuration is "meta-size > 
pi-size". This part in nvme_setup_rw:

                 if (!blk_integrity_rq(req)) {
                         if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
                                 return BLK_STS_NOTSUPP;
                         control |= NVME_RW_PRINFO_PRACT;
                 }


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-03-18  7:06                   ` Kanchan Joshi
@ 2025-03-18  8:07                     ` hch
  2025-03-19 18:06                       ` Kanchan Joshi
  0 siblings, 1 reply; 27+ messages in thread
From: hch @ 2025-03-18  8:07 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch@infradead.org, Qu Wenruo, Johannes Thumshirn,
	Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com

On Tue, Mar 18, 2025 at 12:36:44PM +0530, Kanchan Joshi wrote:
> Right, I'm not saying that protection is getting better. Just that any 
> offload is about trusting someone else with the job. We have other 
> instances like atomic-writes, copy, write-zeroes, write-same etc.

So wahst is the use case for it?  What is the "thread" model you are
trying to protect against (where thread here is borrowed from the
security world and implies data corruption caught by checksums).

> 
> > IFF using PRACT is an acceptable level of protection just running
> > NODATASUM and disabling PI generation/verification in the block
> > layer using the current sysfs attributes (or an in-kernel interface
> > for that) to force the driver to set PRACT will do exactly the same
> > thing.
> 
> I had considered but that can't work because:
> 
> - the sysfs attributes operate at block-device level for all read or all 
> write operations. That's not flexible for policies such "do something 
> for some writes/reads but not for others" which can translate to "do 
> checksum offload for FS data, but keep things as is for FS meta" or 
> other combinations.

Well, we can easily do the using a per-I/O flag assuming we have a
use caѕe for it.  But for that we need to find the use case first.

So as with so many things (including some of my own), we really need
to go back to describe the problem we're trying to fix, before
diving deep down into the implementation.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-03-18  8:07                     ` hch
@ 2025-03-19 18:06                       ` Kanchan Joshi
  2025-03-20  5:48                         ` hch
  0 siblings, 1 reply; 27+ messages in thread
From: Kanchan Joshi @ 2025-03-19 18:06 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Qu Wenruo, Johannes Thumshirn, Theodore Ts'o,
	lsf-pc@lists.linux-foundation.org, linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, josef@toxicpanda.com

On 3/18/2025 1:37 PM, hch@infradead.org wrote:
> On Tue, Mar 18, 2025 at 12:36:44PM +0530, Kanchan Joshi wrote:
>> Right, I'm not saying that protection is getting better. Just that any
>> offload is about trusting someone else with the job. We have other
>> instances like atomic-writes, copy, write-zeroes, write-same etc.
> 
> So wahst is the use case for it? 

I tried to describe that in the cover letter of the PoC:
https://lore.kernel.org/linux-btrfs/20250129140207.22718-1-joshi.k@samsung.com/


  What is the "thread" model you are
> trying to protect against (where thread here is borrowed from the
> security world and implies data corruption caught by checksums).

Seems you meant threat model. That was not on my mind for this series, 
but sure, we don't boost integrity with offload.

>>
>>> IFF using PRACT is an acceptable level of protection just running
>>> NODATASUM and disabling PI generation/verification in the block
>>> layer using the current sysfs attributes (or an in-kernel interface
>>> for that) to force the driver to set PRACT will do exactly the same
>>> thing.
>>
>> I had considered but that can't work because:
>>
>> - the sysfs attributes operate at block-device level for all read or all
>> write operations. That's not flexible for policies such "do something
>> for some writes/reads but not for others" which can translate to "do
>> checksum offload for FS data, but keep things as is for FS meta" or
>> other combinations.
> 
> Well, we can easily do the using a per-I/O flag

Right, a per-I/O flag (named REQ_INTEGRITY_OFFLOAD) is what I did in the 
patch:
https://lore.kernel.org/linux-btrfs/20250129140207.22718-2-joshi.k@samsung.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] File system checksum offload
  2025-03-19 18:06                       ` Kanchan Joshi
@ 2025-03-20  5:48                         ` hch
  0 siblings, 0 replies; 27+ messages in thread
From: hch @ 2025-03-20  5:48 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch@infradead.org, Qu Wenruo, Johannes Thumshirn,
	Theodore Ts'o, lsf-pc@lists.linux-foundation.org,
	linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	josef@toxicpanda.com, Martin K. Petersen

On Wed, Mar 19, 2025 at 11:36:28PM +0530, Kanchan Joshi wrote:
> I tried to describe that in the cover letter of the PoC:
> https://lore.kernel.org/linux-btrfs/20250129140207.22718-1-joshi.k@samsung.com/

I don't think it does as people have pointed that out multiple times.
Once you do checksums on the device they are not end to end, either in
the T10 DIF definition or that of file system checksums.  You leave a
huge glaring gap in the protection envelope.  So calling this "offload"
is extremely dishonest and misleading.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-03-20  5:48 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20250130092400epcas5p1a3a9d899583e9502ed45fe500ae8a824@epcas5p1.samsung.com>
2025-01-30  9:15 ` [LSF/MM/BPF TOPIC] File system checksum offload Kanchan Joshi
2025-01-30 14:28   ` Theodore Ts'o
2025-01-30 20:39     ` [Lsf-pc] " Martin K. Petersen
2025-01-31  4:40       ` Theodore Ts'o
2025-01-31  7:07         ` Christoph Hellwig
2025-01-31 13:11     ` Kanchan Joshi
2025-02-03  7:47       ` Johannes Thumshirn
2025-02-03  7:56         ` Christoph Hellwig
2025-02-03  8:04           ` Johannes Thumshirn
2025-02-03  8:06             ` hch
2025-02-03  8:16             ` Qu Wenruo
2025-02-03  8:26               ` Matthew Wilcox
2025-02-03  8:30                 ` hch
2025-02-03  8:36                   ` Qu Wenruo
2025-02-03  8:40                     ` hch
2025-02-03  8:51                       ` Qu Wenruo
2025-02-03  8:57                         ` hch
2025-02-03  8:26               ` hch
2025-02-03 13:27               ` Kanchan Joshi
2025-02-03 23:17                 ` Qu Wenruo
2025-02-04  5:48                   ` hch
2025-02-04  5:16                 ` hch
2025-03-18  7:06                   ` Kanchan Joshi
2025-03-18  8:07                     ` hch
2025-03-19 18:06                       ` Kanchan Joshi
2025-03-20  5:48                         ` hch
2025-02-03 13:32         ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox