public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
	"hch@infradead.org" <hch@infradead.org>
Cc: Kanchan Joshi <joshi.k@samsung.com>,
	Theodore Ts'o <tytso@mit.edu>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>
Subject: Re: [LSF/MM/BPF TOPIC] File system checksum offload
Date: Mon, 3 Feb 2025 18:46:49 +1030	[thread overview]
Message-ID: <eaec853d-eda6-4ee9-abb6-e2fa32f54f5c@suse.com> (raw)
In-Reply-To: <bb516f19-a6b3-4c6b-89f9-928d46b66e2a@wdc.com>



在 2025/2/3 18:34, Johannes Thumshirn 写道:
> On 03.02.25 08:56, Christoph Hellwig wrote:
>> On Mon, Feb 03, 2025 at 07:47:53AM +0000, Johannes Thumshirn wrote:
>>> The thing I don't like with the current RFC patchset is, it breaks
>>> scrub, repair and device error statistics. It nothing that can't be
>>> solved though. But as of now it just doesn't make any sense at all to
>>> me. We at least need the FS to look at the BLK_STS_PROTECTION return and
>>> handle accordingly in scrub, read repair and statistics.
>>>
>>> And that's only for feature parity. I'd also like to see some
>>> performance numbers and numbers of reduced WAF, if this is really worth
>>> the hassle.
>>
>> If we can store checksums in metadata / extended LBA that will help
>> WAF a lot, and also performance becaue you only need one write
>> instead of two dependent writes, and also just one read.
> 
> Well for the WAF part, it'll save us 32 Bytes per FS sector (typically
> 4k) in the btrfs case, that's ~0.8% of the space.

You forgot the csum tree COW part.

Updating csum tree is pretty COW heavy and that's going to cause quite 
some wearing.

Thus although I do not think the RFC patch makes much sense compared to 
just existing NODATASUM mount option, I'm interesting in the hardware 
csum handling.

> 
>> The checksums in the current PI formats (minus the new ones in NVMe)
>> aren't that good as Martin pointed out, but the biggest issue really
>> is that you need hardware that does support metadata or PI.  SATA
>> doesn't support it at all.  For NVMe PI support is generally a feature
>> that is supported by gold plated fully featured enterprise devices
>> but not the cheaper tiers.  I've heard some talks of customers asking
>> for plain non-PI metadata in certain cheaper tiers, but not much of
>> that has actually materialized yet.  If we ever get at least non-PI
>> metadata support on cheap NVMe drives the idea of storing checksums
>> there would become very, very useful.

The other pain point of btrfs' data checksum is related to Direct IO and 
the content change halfway.

It's pretty common to reproduce, just start a VM with an image on btrfs, 
set the VM cache mode to none (aka, using direct IO), and run XFS/EXT4 
inside the VM, run some fsstress it should cause btrfs to hit data csum 
mismatch false alerts.

The root cause is the content change during direct IO, and XFS/EXT4 
doesn't wait for folio writeback before dirtying the folio (if no 
AS_STABLE_WRITES set).
That's a valid optimization, but that will cause contents change.

(I know there is the AS_STABLE_WRITES, but I'm not sure if qemu will 
pass that flag to virtio block devices inside the VM)

And with btrfs' checksum calculation happening before submitting the 
real bio, it means if the contents changed after the csum calculation 
and before bio finished, we will got csum mismatch.

So if the csum can happening inside the hardware, it will solve the 
problem of direct IO and csum change.

Thanks,
Qu

>>
>> FYI, I'll post my hacky XFS data checksumming code to show how relatively
>> simple using the out of band metadata is for file system based
>> checksumming.
>>
> 


  parent reply	other threads:[~2025-02-03  8:16 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20250130092400epcas5p1a3a9d899583e9502ed45fe500ae8a824@epcas5p1.samsung.com>
2025-01-30  9:15 ` [LSF/MM/BPF TOPIC] File system checksum offload Kanchan Joshi
2025-01-30 14:28   ` Theodore Ts'o
2025-01-30 20:39     ` [Lsf-pc] " Martin K. Petersen
2025-01-31  4:40       ` Theodore Ts'o
2025-01-31  7:07         ` Christoph Hellwig
2025-01-31 13:11     ` Kanchan Joshi
2025-02-03  7:47       ` Johannes Thumshirn
2025-02-03  7:56         ` Christoph Hellwig
2025-02-03  8:04           ` Johannes Thumshirn
2025-02-03  8:06             ` hch
2025-02-03  8:16             ` Qu Wenruo [this message]
2025-02-03  8:26               ` Matthew Wilcox
2025-02-03  8:30                 ` hch
2025-02-03  8:36                   ` Qu Wenruo
2025-02-03  8:40                     ` hch
2025-02-03  8:51                       ` Qu Wenruo
2025-02-03  8:57                         ` hch
2025-02-03  8:26               ` hch
2025-02-03 13:27               ` Kanchan Joshi
2025-02-03 23:17                 ` Qu Wenruo
2025-02-04  5:48                   ` hch
2025-02-04  5:16                 ` hch
2025-03-18  7:06                   ` Kanchan Joshi
2025-03-18  8:07                     ` hch
2025-03-19 18:06                       ` Kanchan Joshi
2025-03-20  5:48                         ` hch
2025-02-03 13:32         ` Kanchan Joshi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eaec853d-eda6-4ee9-abb6-e2fa32f54f5c@suse.com \
    --to=wqu@suse.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=joshi.k@samsung.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox