From: Qu Wenruo <wqu@suse.com>
To: Mark Harmstone <mark@harmstone.com>,
linux-btrfs@vger.kernel.org, boris@bur.io
Subject: Re: [PATCH v2] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
Date: Thu, 9 Apr 2026 20:38:58 +0930 [thread overview]
Message-ID: <b0a7acec-73ce-4cc2-aecd-d0f686a36e4d@suse.com> (raw)
In-Reply-To: <20260408174642.136962-1-mark@harmstone.com>
在 2026/4/9 03:16, Mark Harmstone 写道:
> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
> query the on-disk csums for a file.
After some more discussion, now I understand why you want an
unprivileged ioctl instead of splitting the workload into fiemap + csum
tree search ioctl.
You want to do extra permission checks, which is impossible for the csum
tree search ioctl.
And if we allow unprivileged csum tree search, it will expose all the
data checksum to an attacker.
The csum itself is not enough to re-construct the plaintext even for the
weakest CRC32C.
But it is still enough info to know other aspects of some data, e.g. if
some blocks are all zero, or some two blocks are (possibly) the same etc.
Not sure if you want to include some short words on this design decision
though.
>
> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
> the kernel, which details the offset and length we're interested in, and
> a buffer for the kernel to write its results into. The kernel writes a
> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
> csums if available.
>
> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
> type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
> the precomputed hash of the zero sector.
Well, for mkfs it's going to skip the range as a hole, which is even
faster than using any precalculated csum.
Although keeping the ZEROED flag may be useful for future users, I would
not mind to keep this flag.
> Otherwise, it sets the type to
> BTRFS_GET_CSUMS_NO_CSUMS.
>
> We do store the csums of compressed extents, but we deliberately don't
> return them here: they're hashed over the compressed data, not the
> uncompressed data that's returned to userspace.
Consdiering we're already treating prealloc/hole with a dedicated ZEROED
flag, just to keep things consistent, it may be better to provide a
ENCODED flag, to indicate the range is either compressed or encrypted
for the incoming encyrption feature.
We still don't provide the csum, but just let the user space to know why.
>
> +#define GET_CSUMS_BUF_MAX (16 * 1024 * 1024)
SZ_16M.
[...]
> long btrfs_ioctl(struct file *file, unsigned int
> cmd, unsigned long arg)
> {
> @@ -5294,6 +5622,8 @@ long btrfs_ioctl(struct file *file, unsigned int
> #endif
> case BTRFS_IOC_SUBVOL_SYNC_WAIT:
> return btrfs_ioctl_subvol_sync(fs_info, argp);
> + case BTRFS_IOC_GET_CSUMS:
> + return btrfs_ioctl_get_csums(file, argp);
> #ifdef CONFIG_BTRFS_EXPERIMENTAL
> case BTRFS_IOC_SHUTDOWN:
> return btrfs_ioctl_shutdown(fs_info, arg);
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 9165154a274d94..d079e8b67fd740 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -1100,6 +1100,25 @@ enum btrfs_err_code {
> BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
> };
>
> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
> +#define BTRFS_GET_CSUMS_ZEROED 1
> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
> +
> +struct btrfs_ioctl_get_csums_entry {
> + __u64 offset; /* file offset of this range */
> + __u64 length; /* length in bytes */
> + __u32 type; /* BTRFS_GET_CSUMS_* type */
> + __u32 reserved; /* padding, must be 0 */
> +};
> +
> +struct btrfs_ioctl_get_csums_args {
> + __u64 offset; /* in/out: file offset */
> + __u64 length; /* in/out: range length */
> + __u64 buf_size; /* in/out: buffer capacity / bytes written */
> + __u8 buf[]; /* out: entries + csum data */
Maybe you want to push more explanation on the output buffer format.
The resulted buffer would be something like the following example:
Input:
inode has [0, 4K) hole, [4K, 12K) data, isize 12K.
args.offset = 0
args.length = 1M
args.buf_size = 1M
Output:
args.offset = 0
args.length = 1M
args.buf_size = buf_size_out
buf:
| [0, 4K) ZEROED | [4K, 12K) HAS_CSUM | CSUM | [12K, 1M) ZEROED |
|<------------------------ buf_size_out ----------------------->|
As it takes me some time to understand the output buffer format from the
code, which is different from my initial impression.
Another thing is, it may be better to add a flag/version member to
btrfs_ioctl_get_csums_args.
If we need to add extra flags to entry->type, or utilize the reserved
entry padding for something, or even introduce some new behavior to the
output buffer format, we must have a way to tell the end users.
Otherwise looks good to me.
Thanks,
Qu
next prev parent reply other threads:[~2026-04-09 11:09 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 17:46 [PATCH v2] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
2026-04-08 17:51 ` Mark Harmstone
2026-04-09 11:08 ` Qu Wenruo [this message]
2026-04-13 13:14 ` Mark Harmstone
2026-04-13 14:12 ` Daniel Vacek
2026-04-13 14:31 ` Mark Harmstone
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b0a7acec-73ce-4cc2-aecd-d0f686a36e4d@suse.com \
--to=wqu@suse.com \
--cc=boris@bur.io \
--cc=linux-btrfs@vger.kernel.org \
--cc=mark@harmstone.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox