From: Mark Harmstone <mark@harmstone.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
Date: Thu, 2 Apr 2026 18:05:53 +0100 [thread overview]
Message-ID: <97ff76b9-5c07-4083-a020-3499ff595460@harmstone.com> (raw)
In-Reply-To: <a1675480-628b-46ba-95c7-f5ee29d6bf14@gmx.com>
On 25/03/2026 9.04 pm, Qu Wenruo wrote:
>
>
> 在 2026/3/26 01:13, Mark Harmstone 写道:
>> On 25/03/2026 7.34 am, Qu Wenruo wrote:
>>>
>>>
>>> 在 2026/3/21 08:48, Qu Wenruo 写道:
>>>>
>>>>
>>>> 在 2026/3/20 23:20, Mark Harmstone 写道:
>>>>> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
>>>>> query the on-disk csums for a file.
>>>>>
>>>>> This is done by userspace passing a struct
>>>>> btrfs_ioctl_get_csums_args to
>>>>> the kernel, which details the offset and length we're interested
>>>>> in, and
>>>>> a buffer for the kernel to write its results into. The kernel writes a
>>>>> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
>>>>> csums if available.
>>>>>
>>>>> If the extent is an uncompressed, non-nodatasum extent, the kernel
>>>>> sets
>>>>> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
>>>>> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
>>>>> type to BTRFS_GET_CSUMS_SPARSE - this is so userspace knows it can use
>>>>> the precomputed hash of the zero sector. Otherwise, it sets the
>>>>> type to
>>>>> BTRFS_GET_CSUMS_NO_CSUMS.
>>>>
>>>> I'm not sure if it's a good idea to put hole and preallocated range
>>>> into the same BTRFS_GET_CSUMS_SPARSE.
>>>>
>>>> Although both means there is no csum, hole case means there is
>>>> really no data extent, thus we should not create any extent instead
>>>> of writing zero.
>>
>> Thanks Qu.
>>
>> "SPARSE" is probably a bad name for it. It probably should be "ZERO"
>> or somesuch. The point is to tell userspace not to waste time
>> calculating csums, but use the precomputed values because the data
>> would be zero (for whatever reason).
>>
>>>> For preallocated, indicating it has no CSUM can allow mkfs to
>>>> distinguish hole and preallocated, thus change to zero writes to
>>>> prealloc, which is faster and make the resulted fs more aligned to
>>>> the source dir.
>>>>
>>>>
>>>> And for EOF checks, I think we don't need to bother that much, aka,
>>>> just let it return the regular results.
>>>>
>>>> My assumption is, the mkfs shouldn't pass a range completely beyond
>>>> the round_up(i_size), as non-reflink rootdir population would always
>>>> read out the content of the inode from the host fs.
>>>> Thus we won't really read beyond the inode size.
>>>
>>> After more investigation, I think we can put the hole/preallocation/
>>> compression detection into the user space.
>>>
>>> The hole detection is already pending for merge, mostly through
>>> SEEK_DATA/SEEK_HOLE flags of lseek():
>>>
>>> https://github.com/kdave/btrfs-progs/pull/1097
>>>
>>> I'm planning to implement preallocation detection through fiemap,
>>> which also allows us to detect compressed range and skip them for
>>> your case.
>>>
>>> With all those features implemented in progs, we can further simplify
>>> the get csum ioctl, to something more aligned to
>>> btrfs_lookup_csums_bitmap().
>>>
>>> We do not need to bother why there is no checksum for some ranges,
>>> that will be handled by progs first, we only need to return all the
>>> checksums found for the specified range.
>>>
>>> And as an extra safenet, use some bitmap inthe ioctl structure to
>>> indicate which ranges have checksum and which doesn't.
>>>
>>> This will definitely simplify the ioctl as we only need to do csum
>>> tree lookup, no need to bother anything in the subvolume tree.
>>
>> Unfortunately this won't work. You have to explicitly filter out
>> compressed extents, and identifying these requires checking the FS tree.
>
> That's done by progs through fiemap. There will be a flag ENCODED for
> compressed file extents.
No, this still won't work I'm afraid. The ioctl is answering the
question "what's the csum of the sector no. such-and-such in this
file?". That can't be answered for compressed extents, as the csums are
on the compressed data.
There might be a use case for "fetch the csums for a compressed extent",
but it'd be something different.
My concern about relying on ENCODED is that that would also be set when
we implement encryption. For an encrypted uncompressed extent the csum
*would* be meaningful.
>> The reason is that they may be "bookended", and you would leak
>> information about other files if you returned the whole of the csums
>> for the compressed extent. This is the reason why encoded read needs
>> root.
>
> Nope, for the fiemap call, we will never reach any bookend extents.
I really don't think the FIEMAP call achieves anything here. The kernel
still has to do a lookup in the FS tree to determine what the logical
address of the extent is. We can't allow (non-root) users to read the
csums of arbitrary sectors.
> Thanks,
> Qu
>
>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>>
>>>>> We do store the csums of compressed extents, but we deliberately don't
>>>>> return them here: they're hashed over the compressed data, not the
>>>>> uncompressed data that's returned to userspace.
>>>>
>>>> I agree with the skip of compressed extents, but I'd prefer to have
>>>> a special flag to indicate that, other than NO_CSUMS.
>>>>
>>>> Or mkfs is unable to distinguish hole and compressed extents.
>>
>> Compressed extents result in NO_CSUMS, a hole results in SPARSE.
>>
>>>> [...]
>>>>> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
>>>>> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
>>>>> +#define BTRFS_GET_CSUMS_SPARSE 1
>>>>> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
>>>>> +
>>>>> +struct btrfs_ioctl_get_csums_entry {
>>>>> + __u64 offset; /* file offset of this range */
>>>>> + __u64 length; /* length in bytes */
>>>>> + __u32 type; /* BTRFS_GET_CSUMS_* type */
>>>>> + __u32 reserved; /* padding, must be 0 */
>>>>> +};
>>>>> +
>>>>> +struct btrfs_ioctl_get_csums_args {
>>>>> + __u64 offset; /* in/out: file offset */
>>>>> + __u64 length; /* in/out: range length */
>>>>> + __u64 buf_size; /* in/out: buffer capacity / bytes
>>>>> written */
>>>>> + __u8 buf[]; /* out: entries + csum data */
>>>>> +};
>>>>
>>>> From the progs usage, it is always a single
>>>> btrfs_ioctl_get_csums_entry at the beginning of buf[], then real
>>>> buffer for csum, can we just combine both structures into one?
>>>>
>>>> Furthermore, since we only query one extent at one time, the offset/
>>>> length are more or less duplicated between args and entry structure.
>>>>
>>>> We can just save the length into the args without the need for entry
>>>> members (except the type).
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> +
>>>>> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_*
>>>>> flags. */
>>>>> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
>>>>> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
>>>>> @@ -1226,6 +1245,8 @@ enum btrfs_err_code {
>>>>> struct btrfs_ioctl_encoded_io_args)
>>>>> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
>>>>> struct btrfs_ioctl_subvol_wait)
>>>>> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
>>>>> + struct btrfs_ioctl_get_csums_args)
>>>>> /* Shutdown ioctl should follow XFS's interfaces, thus not using
>>>>> btrfs magic. */
>>>>> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
>>>>
>>>>
>>>
>>>
>>
>
next prev parent reply other threads:[~2026-04-02 17:06 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-20 12:50 [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
2026-03-20 13:03 ` Mark Harmstone
2026-03-20 22:18 ` Qu Wenruo
2026-03-25 7:34 ` Qu Wenruo
2026-03-25 14:43 ` Mark Harmstone
2026-03-25 21:04 ` Qu Wenruo
2026-04-02 17:05 ` Mark Harmstone [this message]
2026-04-02 21:46 ` Qu Wenruo
2026-04-03 22:44 ` Boris Burkov
2026-04-03 23:00 ` Qu Wenruo
2026-04-07 18:13 ` Mark Harmstone
2026-04-07 21:52 ` Qu Wenruo
2026-04-07 22:13 ` Boris Burkov
2026-04-07 22:39 ` Qu Wenruo
2026-04-08 13:22 ` Mark Harmstone
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=97ff76b9-5c07-4083-a020-3499ff595460@harmstone.com \
--to=mark@harmstone.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox