public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Mark Harmstone <mark@harmstone.com>
To: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org, boris@bur.io
Cc: neelx@suse.com
Subject: Re: [PATCH v2] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
Date: Mon, 13 Apr 2026 14:14:09 +0100	[thread overview]
Message-ID: <33a920c9-c876-45bb-b1aa-eed42472efc4@harmstone.com> (raw)
In-Reply-To: <b0a7acec-73ce-4cc2-aecd-d0f686a36e4d@suse.com>

On 09/04/2026 12.08 pm, Qu Wenruo wrote:
> 
> 
> 在 2026/4/9 03:16, Mark Harmstone 写道:
>> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
>> query the on-disk csums for a file.
> 
> After some more discussion, now I understand why you want an 
> unprivileged ioctl instead of splitting the workload into fiemap + csum 
> tree search ioctl.
> 
> You want to do extra permission checks, which is impossible for the csum 
> tree search ioctl.
> 
> And if we allow unprivileged csum tree search, it will expose all the 
> data checksum to an attacker.
> The csum itself is not enough to re-construct the plaintext even for the 
> weakest CRC32C.
> 
> But it is still enough info to know other aspects of some data, e.g. if 
> some blocks are all zero, or some two blocks are (possibly) the same etc.
> 
> Not sure if you want to include some short words on this design decision 
> though.

That's correct. I'll make sure the description is clearer for v3.

The reason arbitrary csums can't be returned to unprivileged users is 
principally because it is a good indication that a known block is 
somewhere on the system. An attack vector might be an unprivileged user 
scanning the whole filesystem to see if a known vulnerable ELF file is 
installed.

>> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
>> the kernel, which details the offset and length we're interested in, and
>> a buffer for the kernel to write its results into. The kernel writes a
>> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
>> csums if available.
>>
>> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
>> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
>> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
>> type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
>> the precomputed hash of the zero sector.
> 
> Well, for mkfs it's going to skip the range as a hole, which is even 
> faster than using any precalculated csum.
> 
> Although keeping the ZEROED flag may be useful for future users, I would 
> not mind to keep this flag.

I wrote this before you added your hole-scanning code to mkfs, but I'm 
keeping this because it's cheap and it might be useful... and you have 
to handle the case where userspace asks for the csum of a hole regardless.

One non-mkfs use that springs to mind is that XORing all the csums of a 
file together should give you a pretty fast ad-hoc checksum.

>> Otherwise, it sets the type to
>> BTRFS_GET_CSUMS_NO_CSUMS.
>>
>> We do store the csums of compressed extents, but we deliberately don't
>> return them here: they're hashed over the compressed data, not the
>> uncompressed data that's returned to userspace.
> 
> Consdiering we're already treating prealloc/hole with a dedicated ZEROED 
> flag, just to keep things consistent, it may be better to provide a 
> ENCODED flag, to indicate the range is either compressed or encrypted 
> for the incoming encyrption feature.
> 
> We still don't provide the csum, but just let the user space to know why.

This is free, so there's no reason not to.

At any rate, it ought to be erroring out if the encryption field is set 
on the file extent. From the looks of things Daniel's current encryption 
work is setting this.

There might one day be a use case for returning the csums of an 
encrypted extent, but as with compressed extents this shouldn't be done 
by default. At least there would be a one-to-one mapping between file 
blocks and encrypted sectors, so this could be done by extending this 
interface.

>> +#define GET_CSUMS_BUF_MAX    (16 * 1024 * 1024)
> 
> SZ_16M.
> 
> [...]
>>   long btrfs_ioctl(struct file *file, unsigned int
>>           cmd, unsigned long arg)
>>   {
>> @@ -5294,6 +5622,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>>   #endif
>>       case BTRFS_IOC_SUBVOL_SYNC_WAIT:
>>           return btrfs_ioctl_subvol_sync(fs_info, argp);
>> +    case BTRFS_IOC_GET_CSUMS:
>> +        return btrfs_ioctl_get_csums(file, argp);
>>   #ifdef CONFIG_BTRFS_EXPERIMENTAL
>>       case BTRFS_IOC_SHUTDOWN:
>>           return btrfs_ioctl_shutdown(fs_info, arg);
>> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
>> index 9165154a274d94..d079e8b67fd740 100644
>> --- a/include/uapi/linux/btrfs.h
>> +++ b/include/uapi/linux/btrfs.h
>> @@ -1100,6 +1100,25 @@ enum btrfs_err_code {
>>       BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
>>   };
>> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
>> +#define BTRFS_GET_CSUMS_HAS_CSUMS    0
>> +#define BTRFS_GET_CSUMS_ZEROED        1
>> +#define BTRFS_GET_CSUMS_NO_CSUMS    2
>> +
>> +struct btrfs_ioctl_get_csums_entry {
>> +    __u64 offset;        /* file offset of this range */
>> +    __u64 length;        /* length in bytes */
>> +    __u32 type;        /* BTRFS_GET_CSUMS_* type */
>> +    __u32 reserved;        /* padding, must be 0 */
>> +};
>> +
>> +struct btrfs_ioctl_get_csums_args {
>> +    __u64 offset;        /* in/out: file offset */
>> +    __u64 length;        /* in/out: range length */
>> +    __u64 buf_size;        /* in/out: buffer capacity / bytes written */
>> +    __u8 buf[];        /* out: entries + csum data */
> 
> Maybe you want to push more explanation on the output buffer format.
> 
> The resulted buffer would be something like the following example:
> 
> Input:
> 
>   inode has [0, 4K) hole, [4K, 12K) data, isize 12K.
> 
>   args.offset = 0
>   args.length = 1M
>   args.buf_size = 1M
> 
> Output:
> 
>   args.offset = 0
>   args.length = 1M
>   args.buf_size = buf_size_out
>   buf:
> 
>   | [0, 4K) ZEROED | [4K, 12K) HAS_CSUM | CSUM | [12K, 1M) ZEROED |
>   |<------------------------ buf_size_out ----------------------->|
> 
> As it takes me some time to understand the output buffer format from the 
> code, which is different from my initial impression.

Yes, that's right.

> Another thing is, it may be better to add a flag/version member to 
> btrfs_ioctl_get_csums_args.
> 
> If we need to add extra flags to entry->type, or utilize the reserved 
> entry padding for something, or even introduce some new behavior to the 
> output buffer format, we must have a way to tell the end users.

No objections here, that sounds like a good idea. Thanks Qu. One obvious 
future flag would be to return the csums of encrypted extents rather 
than ignoring them, if there was a use case for this.

> Otherwise looks good to me.
> 
> Thanks,
> Qu


  reply	other threads:[~2026-04-13 13:24 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 17:46 [PATCH v2] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
2026-04-08 17:51 ` Mark Harmstone
2026-04-09 11:08 ` Qu Wenruo
2026-04-13 13:14   ` Mark Harmstone [this message]
2026-04-13 14:12     ` Daniel Vacek
2026-04-13 14:31       ` Mark Harmstone

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=33a920c9-c876-45bb-b1aa-eed42472efc4@harmstone.com \
    --to=mark@harmstone.com \
    --cc=boris@bur.io \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=neelx@suse.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox