From: Mark Harmstone <mark@harmstone.com>
To: linux-btrfs@vger.kernel.org, wqu@suse.com, boris@bur.io
Subject: Re: [PATCH v2] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
Date: Wed, 8 Apr 2026 18:51:49 +0100 [thread overview]
Message-ID: <dc61a0a6-46d2-447d-9520-5b2a5d93a1c4@harmstone.com> (raw)
In-Reply-To: <20260408174642.136962-1-mark@harmstone.com>
The only change here is renaming SPARSE to ZEROED, to make it clearer
what the meaning is.
Qu suggested that we could make the output more structured, i.e. not
"__u8 buf[];", but we can return multiple csum entries in one call. So
for instance we could return [ZEROED, HAS_CSUMS, NO_CSUMS], for a file
with a sparse extent at the start, an uncompressed extent, and then a
compressed extent.
The progs PR https://github.com/kdave/btrfs-progs/pull/1096 needs to be
updated to rename SPARSE to ZEROED... but it also conflicts with Qu's PR
https://github.com/kdave/btrfs-progs/pull/1103, so one or the other has
to be rebased anyway.
On 08/04/2026 6.46 pm, Mark Harmstone wrote:
> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
> query the on-disk csums for a file.
>
> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
> the kernel, which details the offset and length we're interested in, and
> a buffer for the kernel to write its results into. The kernel writes a
> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
> csums if available.
>
> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
> type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
> the precomputed hash of the zero sector. Otherwise, it sets the type to
> BTRFS_GET_CSUMS_NO_CSUMS.
>
> We do store the csums of compressed extents, but we deliberately don't
> return them here: they're hashed over the compressed data, not the
> uncompressed data that's returned to userspace.
>
> The main use case for this is for speeding up mkfs.btrfs --rootdir. For
> the case when the source FS is btrfs and using the same csum algorithm,
> we can avoid having to recalculate the csums - in my synthetic
> benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
> speed-up (218s to 196s).
>
> When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
> reading the data entirely, resulting a ~2200% speed-up on the same test
> (128s to 6s).
>
> # mkdir rootdir
> # dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304
>
> (without ioctl)
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir testimg
> ...
> real 3m37.965s
> user 0m5.496s
> sys 0m6.125s
>
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir --reflink testimg
> ...
> real 2m8.342s
> user 0m5.472s
> sys 0m1.667s
>
> (with ioctl)
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir testimg
> ...
> real 3m15.865s
> user 0m4.258s
> sys 0m6.261s
>
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir --reflink testimg
> ...
> real 0m5.847s
> user 0m2.899s
> sys 0m0.097s
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/ioctl.c | 330 +++++++++++++++++++++++++++++++++++++
> include/uapi/linux/btrfs.h | 21 +++
> 2 files changed, 351 insertions(+)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index b2e447f5005c16..5cdda33eeaf05a 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -56,6 +56,7 @@
> #include "uuid-tree.h"
> #include "ioctl.h"
> #include "file.h"
> +#include "file-item.h"
> #include "scrub.h"
> #include "super.h"
>
> @@ -5139,6 +5140,333 @@ static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg
> }
> #endif
>
> +#define GET_CSUMS_BUF_MAX (16 * 1024 * 1024)
> +
> +static int copy_csums_to_user(struct btrfs_fs_info *fs_info, u64 disk_bytenr,
> + u64 len, u8 __user *buf)
> +{
> + struct btrfs_root *csum_root;
> + struct btrfs_ordered_sum *sums;
> + LIST_HEAD(list);
> + const u32 csum_size = fs_info->csum_size;
> + int ret;
> +
> + csum_root = btrfs_csum_root(fs_info, disk_bytenr);
> +
> + ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
> + disk_bytenr + len - 1, &list, false);
> + if (ret < 0)
> + return ret;
> +
> + /* Clear the output buffer to handle potential gaps in csum coverage. */
> + if (clear_user(buf, (len >> fs_info->sectorsize_bits) * csum_size)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = 0;
> + while (!list_empty(&list)) {
> + u64 offset;
> + size_t copy_size;
> +
> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
> + list_del(&sums->list);
> +
> + offset = ((sums->logical - disk_bytenr) >> fs_info->sectorsize_bits) * csum_size;
> + copy_size = (sums->len >> fs_info->sectorsize_bits) * csum_size;
> +
> + if (copy_to_user(buf + offset, sums->sums, copy_size)) {
> + kfree(sums);
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + kfree(sums);
> + }
> +
> +out:
> + while (!list_empty(&list)) {
> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
> + list_del(&sums->list);
> + kfree(sums);
> + }
> + return ret;
> +}
> +
> +static int btrfs_ioctl_get_csums(struct file *file, void __user *argp)
> +{
> + struct inode *inode = file_inode(file);
> + struct btrfs_inode *bi = BTRFS_I(inode);
> + struct btrfs_fs_info *fs_info = bi->root->fs_info;
> + struct btrfs_root *root = bi->root;
> + struct btrfs_ioctl_get_csums_args args;
> + BTRFS_PATH_AUTO_FREE(path);
> + const u64 ino = btrfs_ino(bi);
> + const u32 sectorsize = fs_info->sectorsize;
> + const u32 csum_size = fs_info->csum_size;
> + u8 __user *ubuf;
> + u64 buf_limit;
> + u64 buf_used = 0;
> + u64 cur_offset;
> + u64 end_offset;
> + u64 prev_extent_end;
> + struct btrfs_key key;
> + int ret;
> +
> + if (!(file->f_mode & FMODE_READ))
> + return -EBADF;
> +
> + if (!S_ISREG(inode->i_mode))
> + return -EINVAL;
> +
> + if (copy_from_user(&args, argp, sizeof(args)))
> + return -EFAULT;
> +
> + if (!IS_ALIGNED(args.offset, sectorsize) ||
> + !IS_ALIGNED(args.length, sectorsize))
> + return -EINVAL;
> + if (args.length == 0)
> + return -EINVAL;
> + if (args.offset + args.length < args.offset)
> + return -EOVERFLOW;
> + if (args.buf_size < sizeof(struct btrfs_ioctl_get_csums_entry))
> + return -EINVAL;
> +
> + buf_limit = min_t(u64, args.buf_size, GET_CSUMS_BUF_MAX);
> + ubuf = (u8 __user *)(argp + offsetof(struct btrfs_ioctl_get_csums_args, buf));
> + cur_offset = args.offset;
> + end_offset = args.offset + args.length;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
> + if (ret)
> + return ret;
> +
> + btrfs_inode_lock(bi, BTRFS_ILOCK_SHARED);
> +
> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
> + if (ret)
> + goto out_unlock;
> +
> + /* NODATASUM early exit. */
> + if (bi->flags & BTRFS_INODE_NODATASUM) {
> + struct btrfs_ioctl_get_csums_entry entry = {
> + .offset = cur_offset,
> + .length = end_offset - cur_offset,
> + .type = BTRFS_GET_CSUMS_NO_CSUMS,
> + };
> +
> + if (copy_to_user(ubuf, &entry, sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> +
> + buf_used = sizeof(entry);
> + cur_offset = end_offset;
> + goto done;
> + }
> +
> + prev_extent_end = cur_offset;
> +
> + while (cur_offset < end_offset) {
> + struct btrfs_file_extent_item *ei;
> + struct extent_buffer *leaf;
> + struct btrfs_ioctl_get_csums_entry entry;
> + u64 extent_end;
> + u64 disk_bytenr = 0;
> + u64 extent_offset = 0;
> + u64 range_start, range_len;
> + u64 entry_csum_size;
> + u64 key_offset;
> + int extent_type;
> + u8 compression;
> +
> + /* Search for the extent at or before cur_offset. */
> + key.objectid = ino;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> + key.offset = cur_offset;
> +
> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> + if (ret < 0)
> + goto out_unlock;
> +
> + if (ret > 0 && path->slots[0] > 0) {
> + btrfs_item_key_to_cpu(path->nodes[0], &key,
> + path->slots[0] - 1);
> + if (key.objectid == ino &&
> + key.type == BTRFS_EXTENT_DATA_KEY) {
> + path->slots[0]--;
> + if (btrfs_file_extent_end(path) <= cur_offset)
> + path->slots[0]++;
> + }
> + }
> +
> + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> + ret = btrfs_next_leaf(root, path);
> + if (ret < 0)
> + goto out_unlock;
> + if (ret > 0) {
> + ret = 0;
> + btrfs_release_path(path);
> + break;
> + }
> + }
> +
> + leaf = path->nodes[0];
> +
> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> + btrfs_release_path(path);
> + break;
> + }
> +
> + extent_end = btrfs_file_extent_end(path);
> + key_offset = key.offset;
> +
> + /* Read extent fields before releasing the path. */
> + ei = btrfs_item_ptr(leaf, path->slots[0],
> + struct btrfs_file_extent_item);
> + extent_type = btrfs_file_extent_type(leaf, ei);
> + compression = btrfs_file_extent_compression(leaf, ei);
> +
> + if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> + if (disk_bytenr && compression == BTRFS_COMPRESS_NONE)
> + extent_offset = btrfs_file_extent_offset(leaf, ei);
> + }
> +
> + btrfs_release_path(path);
> +
> + /* Implicit hole (NO_HOLES feature). */
> + if (prev_extent_end < key_offset) {
> + u64 hole_end = min(key_offset, end_offset);
> + u64 hole_len = hole_end - prev_extent_end;
> +
> + if (prev_extent_end >= cur_offset) {
> + memset(&entry, 0, sizeof(entry));
> + entry.offset = prev_extent_end;
> + entry.length = hole_len;
> + entry.type = BTRFS_GET_CSUMS_ZEROED;
> +
> + if (buf_used + sizeof(entry) > buf_limit)
> + goto done;
> + if (copy_to_user(ubuf + buf_used, &entry,
> + sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> + cur_offset = hole_end;
> + }
> +
> + if (key_offset >= end_offset) {
> + cur_offset = end_offset;
> + break;
> + }
> + }
> +
> + /* Clamp to our query range. */
> + range_start = max(cur_offset, key_offset);
> + range_len = min(extent_end, end_offset) - range_start;
> +
> + memset(&entry, 0, sizeof(entry));
> + entry.offset = range_start;
> + entry.length = range_len;
> +
> + if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> + entry.type = BTRFS_GET_CSUMS_NO_CSUMS;
> + entry_csum_size = 0;
> + } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> + entry.type = BTRFS_GET_CSUMS_ZEROED;
> + entry_csum_size = 0;
> + } else {
> + /* BTRFS_FILE_EXTENT_REG */
> + if (disk_bytenr == 0) {
> + /* Explicit hole. */
> + entry.type = BTRFS_GET_CSUMS_ZEROED;
> + entry_csum_size = 0;
> + } else if (compression != BTRFS_COMPRESS_NONE) {
> + entry.type = BTRFS_GET_CSUMS_NO_CSUMS;
> + entry_csum_size = 0;
> + } else {
> + entry.type = BTRFS_GET_CSUMS_HAS_CSUMS;
> + entry_csum_size = (range_len >> fs_info->sectorsize_bits) * csum_size;
> + }
> + }
> +
> + /* Check if this entry (+ csum data) fits in the buffer. */
> + if (buf_used + sizeof(entry) + entry_csum_size > buf_limit) {
> + if (buf_used == 0) {
> + ret = -EOVERFLOW;
> + goto out_unlock;
> + }
> + goto done;
> + }
> +
> + if (copy_to_user(ubuf + buf_used, &entry, sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> +
> + if (entry.type == BTRFS_GET_CSUMS_HAS_CSUMS) {
> + ret = copy_csums_to_user(fs_info,
> + disk_bytenr + extent_offset + (range_start - key_offset),
> + range_len, ubuf + buf_used);
> + if (ret)
> + goto out_unlock;
> + buf_used += entry_csum_size;
> + }
> +
> + cur_offset = range_start + range_len;
> + prev_extent_end = extent_end;
> +
> + if (fatal_signal_pending(current)) {
> + if (buf_used == 0) {
> + ret = -EINTR;
> + goto out_unlock;
> + }
> + goto done;
> + }
> +
> + cond_resched();
> + }
> +
> + /* Handle trailing implicit hole. */
> + if (cur_offset < end_offset) {
> + struct btrfs_ioctl_get_csums_entry entry = {
> + .offset = prev_extent_end,
> + .length = end_offset - prev_extent_end,
> + .type = BTRFS_GET_CSUMS_ZEROED,
> + };
> +
> + if (buf_used + sizeof(entry) <= buf_limit) {
> + if (copy_to_user(ubuf + buf_used, &entry,
> + sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> + cur_offset = end_offset;
> + }
> + }
> +
> +done:
> + args.offset = cur_offset;
> + args.length = (cur_offset < end_offset) ? end_offset - cur_offset : 0;
> + args.buf_size = buf_used;
> +
> + if (copy_to_user(argp, &args, sizeof(args)))
> + ret = -EFAULT;
> +
> +out_unlock:
> + btrfs_inode_unlock(bi, BTRFS_ILOCK_SHARED);
> + return ret;
> +}
> +
> long btrfs_ioctl(struct file *file, unsigned int
> cmd, unsigned long arg)
> {
> @@ -5294,6 +5622,8 @@ long btrfs_ioctl(struct file *file, unsigned int
> #endif
> case BTRFS_IOC_SUBVOL_SYNC_WAIT:
> return btrfs_ioctl_subvol_sync(fs_info, argp);
> + case BTRFS_IOC_GET_CSUMS:
> + return btrfs_ioctl_get_csums(file, argp);
> #ifdef CONFIG_BTRFS_EXPERIMENTAL
> case BTRFS_IOC_SHUTDOWN:
> return btrfs_ioctl_shutdown(fs_info, arg);
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 9165154a274d94..d079e8b67fd740 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -1100,6 +1100,25 @@ enum btrfs_err_code {
> BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
> };
>
> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
> +#define BTRFS_GET_CSUMS_ZEROED 1
> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
> +
> +struct btrfs_ioctl_get_csums_entry {
> + __u64 offset; /* file offset of this range */
> + __u64 length; /* length in bytes */
> + __u32 type; /* BTRFS_GET_CSUMS_* type */
> + __u32 reserved; /* padding, must be 0 */
> +};
> +
> +struct btrfs_ioctl_get_csums_args {
> + __u64 offset; /* in/out: file offset */
> + __u64 length; /* in/out: range length */
> + __u64 buf_size; /* in/out: buffer capacity / bytes written */
> + __u8 buf[]; /* out: entries + csum data */
> +};
> +
> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
> @@ -1226,6 +1245,8 @@ enum btrfs_err_code {
> struct btrfs_ioctl_encoded_io_args)
> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
> struct btrfs_ioctl_subvol_wait)
> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
> + struct btrfs_ioctl_get_csums_args)
>
> /* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
next prev parent reply other threads:[~2026-04-08 17:51 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 17:46 [PATCH v2] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
2026-04-08 17:51 ` Mark Harmstone [this message]
2026-04-09 11:08 ` Qu Wenruo
2026-04-13 13:14 ` Mark Harmstone
2026-04-13 14:12 ` Daniel Vacek
2026-04-13 14:31 ` Mark Harmstone
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dc61a0a6-46d2-447d-9520-5b2a5d93a1c4@harmstone.com \
--to=mark@harmstone.com \
--cc=boris@bur.io \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox