From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: Kanchan Joshi <joshi.k@samsung.com>
Cc: josef@toxicpanda.com, dsterba@suse.com, clm@fb.com,
axboe@kernel.dk, kbusch@kernel.org, hch@lst.de,
linux-btrfs@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, gost.dev@samsung.com
Subject: Re: [RFC 0/3] Btrfs checksum offload
Date: Thu, 30 Jan 2025 15:21:45 -0500 [thread overview]
Message-ID: <yq134h0p1m5.fsf@ca-mkp.ca.oracle.com> (raw)
In-Reply-To: <20250129140207.22718-1-joshi.k@samsung.com> (Kanchan Joshi's message of "Wed, 29 Jan 2025 19:32:04 +0530")
Hi Kanchan!
> There is value in avoiding Copy-on-write (COW) checksum tree on a
> device that can anyway store checksums inline (as part of PI). This
> would eliminate extra checksum writes/reads, making I/O more
> CPU-efficient. Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
I have a couple of observations.
First of all, there is no inherent benefit to PI if it is generated at
the same time as the ECC. The ECC is usually far superior when it comes
to protecting data at rest. And you'll still get an error if uncorrected
corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer
any benefits in my book.
The motivation for T10 PI is that it is generated in close temporal
proximity to the data. I.e. ideally the PI protecting the data is
calculated as soon as the data has been created in memory. And then the
I/O will eventually be queued, submitted, traverse the kernel, through
the storage fabric, and out to the end device. The PI and data have
traveled along different paths (potentially, more on that later) to get
there. The device will calculate the ECC and then perform a validation
of the PI wrt. to the data buffer. And if those two line up, we know the
ECC is also good. At that point we have confirmed that the data to be
stored matches the data that was used as input when the PI was generated
N seconds ago in host memory. And therefore we can write.
I.e. the goal of PI is protect against problems that happen between data
creation time and the data being persisted to media. Once the ECC has
been calculated, PI essentially stops being interesting.
The second point I would like to make is that the separation between PI
and data that we introduced with DIX, and which NVMe subsequently
adopted, was a feature. It was not just to avoid the inconvenience of
having to deal with buffers that were multiples of 520 bytes in host
memory. The separation between the data and its associated protection
information had proven critical for data protection in many common
corruption scenarios. Inline protection had been tried and had failed to
catch many of the scenarios we had come across in the field.
At the time T10 PI was designed spinning rust was the only game in town.
And nobody was willing to take the performance hit of having to seek
twice per I/O to store PI separately from the data. And while schemes
involving sending all the PI ahead of the data were entertained, they
never came to fruition. Storing 512+8 in the same sector was a necessity
in the context of SCSI drives, not a desired behavior. Addressing that
in DIX was key.
So to me, it's a highly desirable feature that btrfs stores its
checksums elsewhere on media. But that's obviously a trade-off a user
can make. In some cases media WAR may be more important than extending
the protection envelope for the data, that's OK. I would suggest you
look at using CRC32C given the intended 4KB block use case, though,
because the 16-bit CRC isn't fantastic for large blocks.
--
Martin K. Petersen Oracle Linux Engineering
next prev parent reply other threads:[~2025-01-30 20:22 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20250129141039epcas5p11feb1be4124c0db3c5223325924183a3@epcas5p1.samsung.com>
2025-01-29 14:02 ` [RFC 0/3] Btrfs checksum offload Kanchan Joshi
2025-01-29 14:02 ` [RFC 1/3] block: add integrity offload Kanchan Joshi
2025-01-29 14:02 ` [RFC 2/3] nvme: support " Kanchan Joshi
2025-01-29 14:02 ` [RFC 3/3] btrfs: add checksum offload Kanchan Joshi
2025-01-29 21:27 ` Qu Wenruo
2025-01-29 14:55 ` [RFC 0/3] Btrfs " Johannes Thumshirn
2025-01-31 10:19 ` Kanchan Joshi
2025-01-31 10:29 ` Johannes Thumshirn
2025-02-03 13:25 ` Kanchan Joshi
2025-02-03 13:40 ` Johannes Thumshirn
2025-02-03 14:03 ` Kanchan Joshi
2025-02-03 14:41 ` Johannes Thumshirn
2025-01-29 15:28 ` Keith Busch
2025-01-29 15:40 ` Christoph Hellwig
2025-01-29 18:03 ` Keith Busch
2025-01-30 12:54 ` Christoph Hellwig
2025-01-29 15:35 ` Christoph Hellwig
2025-01-30 9:22 ` Kanchan Joshi
2025-01-30 12:53 ` Christoph Hellwig
2025-01-31 10:29 ` Kanchan Joshi
2025-01-31 10:42 ` Christoph Hellwig
2025-01-29 15:55 ` Mark Harmstone
2025-01-29 19:02 ` Goffredo Baroncelli
2025-01-30 9:33 ` Daniel Vacek
2025-01-30 20:21 ` Martin K. Petersen [this message]
2025-01-31 7:44 ` Christoph Hellwig
2025-02-03 19:31 ` Martin K. Petersen
2025-02-04 5:12 ` Christoph Hellwig
2025-02-04 12:52 ` Martin K. Petersen
2025-02-04 13:49 ` Christoph Hellwig
2025-02-05 2:31 ` Martin K. Petersen
2025-02-03 13:24 ` Kanchan Joshi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=yq134h0p1m5.fsf@ca-mkp.ca.oracle.com \
--to=martin.petersen@oracle.com \
--cc=axboe@kernel.dk \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=gost.dev@samsung.com \
--cc=hch@lst.de \
--cc=josef@toxicpanda.com \
--cc=joshi.k@samsung.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox