* [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
@ 2026-04-13 17:14 Mark Harmstone
2026-04-13 22:17 ` Qu Wenruo
2026-04-14 2:07 ` David Sterba
0 siblings, 2 replies; 7+ messages in thread
From: Mark Harmstone @ 2026-04-13 17:14 UTC (permalink / raw)
To: linux-btrfs, wqu, boris; +Cc: Mark Harmstone
Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
query the on-disk csums for a file.
The ioctl is deliberately per-file rather than exposing raw csum tree
lookups, to avoid leaking information to users about files they may not
have access to.
This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
the kernel, which details the offset and length we're interested in, and
a buffer for the kernel to write its results into. The kernel writes a
struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
csums if available.
If the extent is an uncompressed, non-nodatasum extent, the kernel sets
the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
csums. If it is sparse, preallocated, or beyond the EOF, it sets the
type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
the precomputed hash of the zero sector. Otherwise, it sets the type to
BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED,
BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE.
For example, a file with a [0, 4K) hole and [4K, 12K) data extent would
produce the following output buffer:
| [0, 4K) ZEROED | [4K, 12K) HAS_CSUMS | csum data |
We do store the csums of compressed extents, but we deliberately don't
return them here: they're hashed over the compressed data, not the
uncompressed data that's returned to userspace. Similarly for encrypted
data, once encryption is supported, in which the csums will be on the
ciphertext.
The main use case for this is for speeding up mkfs.btrfs --rootdir. For
the case when the source FS is btrfs and using the same csum algorithm,
we can avoid having to recalculate the csums - in my synthetic
benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
speed-up (218s to 196s).
When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
reading the data entirely, resulting a ~2200% speed-up on the same test
(128s to 6s).
# mkdir rootdir
# dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304
(without ioctl)
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir testimg
...
real 3m37.965s
user 0m5.496s
sys 0m6.125s
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir --reflink testimg
...
real 2m8.342s
user 0m5.472s
sys 0m1.667s
(with ioctl)
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir testimg
...
real 3m15.865s
user 0m4.258s
sys 0m6.261s
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir --reflink testimg
...
real 0m5.847s
user 0m2.899s
sys 0m0.097s
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
Changes since v2:
* Added flags field to args for future expansion, must be 0 for now
* Added check that extent isn't encrypted
* Split NO_CSUMS into multiple reasons, as this is cheap and userspace
might find it useful: COMPRESSED, ENCRYPTED, NODATASUM, and INLINE
* Changed to use SZ_16M macro
fs/btrfs/ioctl.c | 337 +++++++++++++++++++++++++++++++++++++
include/uapi/linux/btrfs.h | 25 +++
2 files changed, 362 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b2e447f5005c16..cb37d793dbceff 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -56,6 +56,7 @@
#include "uuid-tree.h"
#include "ioctl.h"
#include "file.h"
+#include "file-item.h"
#include "scrub.h"
#include "super.h"
@@ -5139,6 +5140,340 @@ static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg
}
#endif
+#define GET_CSUMS_BUF_MAX SZ_16M
+
+static int copy_csums_to_user(struct btrfs_fs_info *fs_info, u64 disk_bytenr,
+ u64 len, u8 __user *buf)
+{
+ struct btrfs_root *csum_root;
+ struct btrfs_ordered_sum *sums;
+ LIST_HEAD(list);
+ const u32 csum_size = fs_info->csum_size;
+ int ret;
+
+ csum_root = btrfs_csum_root(fs_info, disk_bytenr);
+
+ ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
+ disk_bytenr + len - 1, &list, false);
+ if (ret < 0)
+ return ret;
+
+ /* Clear the output buffer to handle potential gaps in csum coverage. */
+ if (clear_user(buf, (len >> fs_info->sectorsize_bits) * csum_size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret = 0;
+ while (!list_empty(&list)) {
+ u64 offset;
+ size_t copy_size;
+
+ sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
+ list_del(&sums->list);
+
+ offset = ((sums->logical - disk_bytenr) >> fs_info->sectorsize_bits) * csum_size;
+ copy_size = (sums->len >> fs_info->sectorsize_bits) * csum_size;
+
+ if (copy_to_user(buf + offset, sums->sums, copy_size)) {
+ kfree(sums);
+ ret = -EFAULT;
+ goto out;
+ }
+
+ kfree(sums);
+ }
+
+out:
+ while (!list_empty(&list)) {
+ sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
+ list_del(&sums->list);
+ kfree(sums);
+ }
+ return ret;
+}
+
+static int btrfs_ioctl_get_csums(struct file *file, void __user *argp)
+{
+ struct inode *inode = file_inode(file);
+ struct btrfs_inode *bi = BTRFS_I(inode);
+ struct btrfs_fs_info *fs_info = bi->root->fs_info;
+ struct btrfs_root *root = bi->root;
+ struct btrfs_ioctl_get_csums_args args;
+ BTRFS_PATH_AUTO_FREE(path);
+ const u64 ino = btrfs_ino(bi);
+ const u32 sectorsize = fs_info->sectorsize;
+ const u32 csum_size = fs_info->csum_size;
+ u8 __user *ubuf;
+ u64 buf_limit;
+ u64 buf_used = 0;
+ u64 cur_offset;
+ u64 end_offset;
+ u64 prev_extent_end;
+ struct btrfs_key key;
+ int ret;
+
+ if (!(file->f_mode & FMODE_READ))
+ return -EBADF;
+
+ if (!S_ISREG(inode->i_mode))
+ return -EINVAL;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+
+ if (!IS_ALIGNED(args.offset, sectorsize) ||
+ !IS_ALIGNED(args.length, sectorsize))
+ return -EINVAL;
+ if (args.length == 0)
+ return -EINVAL;
+ if (args.offset + args.length < args.offset)
+ return -EOVERFLOW;
+ if (args.flags != 0)
+ return -EINVAL;
+ if (args.buf_size < sizeof(struct btrfs_ioctl_get_csums_entry))
+ return -EINVAL;
+
+ buf_limit = min_t(u64, args.buf_size, GET_CSUMS_BUF_MAX);
+ ubuf = (u8 __user *)(argp + offsetof(struct btrfs_ioctl_get_csums_args, buf));
+ cur_offset = args.offset;
+ end_offset = args.offset + args.length;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
+ if (ret)
+ return ret;
+
+ btrfs_inode_lock(bi, BTRFS_ILOCK_SHARED);
+
+ ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
+ if (ret)
+ goto out_unlock;
+
+ /* NODATASUM early exit. */
+ if (bi->flags & BTRFS_INODE_NODATASUM) {
+ struct btrfs_ioctl_get_csums_entry entry = {
+ .offset = cur_offset,
+ .length = end_offset - cur_offset,
+ .type = BTRFS_GET_CSUMS_NODATASUM,
+ };
+
+ if (copy_to_user(ubuf, &entry, sizeof(entry))) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+
+ buf_used = sizeof(entry);
+ cur_offset = end_offset;
+ goto done;
+ }
+
+ prev_extent_end = cur_offset;
+
+ while (cur_offset < end_offset) {
+ struct btrfs_file_extent_item *ei;
+ struct extent_buffer *leaf;
+ struct btrfs_ioctl_get_csums_entry entry;
+ u64 extent_end;
+ u64 disk_bytenr = 0;
+ u64 extent_offset = 0;
+ u64 range_start, range_len;
+ u64 entry_csum_size;
+ u64 key_offset;
+ int extent_type;
+ u8 compression;
+ u8 encryption;
+
+ /* Search for the extent at or before cur_offset. */
+ key.objectid = ino;
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ key.offset = cur_offset;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ if (ret < 0)
+ goto out_unlock;
+
+ if (ret > 0 && path->slots[0] > 0) {
+ btrfs_item_key_to_cpu(path->nodes[0], &key,
+ path->slots[0] - 1);
+ if (key.objectid == ino &&
+ key.type == BTRFS_EXTENT_DATA_KEY) {
+ path->slots[0]--;
+ if (btrfs_file_extent_end(path) <= cur_offset)
+ path->slots[0]++;
+ }
+ }
+
+ if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+ ret = btrfs_next_leaf(root, path);
+ if (ret < 0)
+ goto out_unlock;
+ if (ret > 0) {
+ ret = 0;
+ btrfs_release_path(path);
+ break;
+ }
+ }
+
+ leaf = path->nodes[0];
+
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+ if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+ btrfs_release_path(path);
+ break;
+ }
+
+ extent_end = btrfs_file_extent_end(path);
+ key_offset = key.offset;
+
+ /* Read extent fields before releasing the path. */
+ ei = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_file_extent_item);
+ extent_type = btrfs_file_extent_type(leaf, ei);
+ compression = btrfs_file_extent_compression(leaf, ei);
+ encryption = btrfs_file_extent_encryption(leaf, ei);
+
+ if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
+ disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+ if (disk_bytenr && compression == BTRFS_COMPRESS_NONE)
+ extent_offset = btrfs_file_extent_offset(leaf, ei);
+ }
+
+ btrfs_release_path(path);
+
+ /* Implicit hole (NO_HOLES feature). */
+ if (prev_extent_end < key_offset) {
+ u64 hole_end = min(key_offset, end_offset);
+ u64 hole_len = hole_end - prev_extent_end;
+
+ if (prev_extent_end >= cur_offset) {
+ memset(&entry, 0, sizeof(entry));
+ entry.offset = prev_extent_end;
+ entry.length = hole_len;
+ entry.type = BTRFS_GET_CSUMS_ZEROED;
+
+ if (buf_used + sizeof(entry) > buf_limit)
+ goto done;
+ if (copy_to_user(ubuf + buf_used, &entry,
+ sizeof(entry))) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ buf_used += sizeof(entry);
+ cur_offset = hole_end;
+ }
+
+ if (key_offset >= end_offset) {
+ cur_offset = end_offset;
+ break;
+ }
+ }
+
+ /* Clamp to our query range. */
+ range_start = max(cur_offset, key_offset);
+ range_len = min(extent_end, end_offset) - range_start;
+
+ memset(&entry, 0, sizeof(entry));
+ entry.offset = range_start;
+ entry.length = range_len;
+
+ if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+ entry.type = BTRFS_GET_CSUMS_INLINE;
+ entry_csum_size = 0;
+ } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
+ entry.type = BTRFS_GET_CSUMS_ZEROED;
+ entry_csum_size = 0;
+ } else {
+ /* BTRFS_FILE_EXTENT_REG */
+ if (disk_bytenr == 0) {
+ /* Explicit hole. */
+ entry.type = BTRFS_GET_CSUMS_ZEROED;
+ entry_csum_size = 0;
+ } else if (encryption != 0) {
+ entry.type = BTRFS_GET_CSUMS_ENCRYPTED;
+ entry_csum_size = 0;
+ } else if (compression != BTRFS_COMPRESS_NONE) {
+ entry.type = BTRFS_GET_CSUMS_COMPRESSED;
+ entry_csum_size = 0;
+ } else {
+ entry.type = BTRFS_GET_CSUMS_HAS_CSUMS;
+ entry_csum_size = (range_len >> fs_info->sectorsize_bits) * csum_size;
+ }
+ }
+
+ /* Check if this entry (+ csum data) fits in the buffer. */
+ if (buf_used + sizeof(entry) + entry_csum_size > buf_limit) {
+ if (buf_used == 0) {
+ ret = -EOVERFLOW;
+ goto out_unlock;
+ }
+ goto done;
+ }
+
+ if (copy_to_user(ubuf + buf_used, &entry, sizeof(entry))) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ buf_used += sizeof(entry);
+
+ if (entry.type == BTRFS_GET_CSUMS_HAS_CSUMS) {
+ ret = copy_csums_to_user(fs_info,
+ disk_bytenr + extent_offset + (range_start - key_offset),
+ range_len, ubuf + buf_used);
+ if (ret)
+ goto out_unlock;
+ buf_used += entry_csum_size;
+ }
+
+ cur_offset = range_start + range_len;
+ prev_extent_end = extent_end;
+
+ if (fatal_signal_pending(current)) {
+ if (buf_used == 0) {
+ ret = -EINTR;
+ goto out_unlock;
+ }
+ goto done;
+ }
+
+ cond_resched();
+ }
+
+ /* Handle trailing implicit hole. */
+ if (cur_offset < end_offset) {
+ struct btrfs_ioctl_get_csums_entry entry = {
+ .offset = prev_extent_end,
+ .length = end_offset - prev_extent_end,
+ .type = BTRFS_GET_CSUMS_ZEROED,
+ };
+
+ if (buf_used + sizeof(entry) <= buf_limit) {
+ if (copy_to_user(ubuf + buf_used, &entry,
+ sizeof(entry))) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ buf_used += sizeof(entry);
+ cur_offset = end_offset;
+ }
+ }
+
+done:
+ args.offset = cur_offset;
+ args.length = (cur_offset < end_offset) ? end_offset - cur_offset : 0;
+ args.buf_size = buf_used;
+
+ if (copy_to_user(argp, &args, sizeof(args)))
+ ret = -EFAULT;
+
+out_unlock:
+ btrfs_inode_unlock(bi, BTRFS_ILOCK_SHARED);
+ return ret;
+}
+
long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
{
@@ -5294,6 +5629,8 @@ long btrfs_ioctl(struct file *file, unsigned int
#endif
case BTRFS_IOC_SUBVOL_SYNC_WAIT:
return btrfs_ioctl_subvol_sync(fs_info, argp);
+ case BTRFS_IOC_GET_CSUMS:
+ return btrfs_ioctl_get_csums(file, argp);
#ifdef CONFIG_BTRFS_EXPERIMENTAL
case BTRFS_IOC_SHUTDOWN:
return btrfs_ioctl_shutdown(fs_info, arg);
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 9165154a274d94..9b24218682990d 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -1100,6 +1100,29 @@ enum btrfs_err_code {
BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
};
+/* Types for struct btrfs_ioctl_get_csums_entry::type */
+#define BTRFS_GET_CSUMS_HAS_CSUMS 0
+#define BTRFS_GET_CSUMS_ZEROED 1
+#define BTRFS_GET_CSUMS_NODATASUM 2
+#define BTRFS_GET_CSUMS_COMPRESSED 3
+#define BTRFS_GET_CSUMS_ENCRYPTED 4
+#define BTRFS_GET_CSUMS_INLINE 5
+
+struct btrfs_ioctl_get_csums_entry {
+ __u64 offset; /* file offset of this range */
+ __u64 length; /* length in bytes */
+ __u32 type; /* BTRFS_GET_CSUMS_* type */
+ __u32 reserved; /* padding, must be 0 */
+};
+
+struct btrfs_ioctl_get_csums_args {
+ __u64 offset; /* in/out: file offset */
+ __u64 length; /* in/out: range length */
+ __u64 buf_size; /* in/out: buffer capacity / bytes written */
+ __u64 flags; /* in: flags, must be 0 for now */
+ __u8 buf[]; /* out: entries + csum data */
+};
+
/* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
#define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
#define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
@@ -1226,6 +1249,8 @@ enum btrfs_err_code {
struct btrfs_ioctl_encoded_io_args)
#define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
struct btrfs_ioctl_subvol_wait)
+#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
+ struct btrfs_ioctl_get_csums_args)
/* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
#define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-13 17:14 [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
@ 2026-04-13 22:17 ` Qu Wenruo
2026-04-14 12:57 ` Mark Harmstone
2026-04-14 2:07 ` David Sterba
1 sibling, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2026-04-13 22:17 UTC (permalink / raw)
To: Mark Harmstone, linux-btrfs, wqu, boris
在 2026/4/14 02:44, Mark Harmstone 写道:
> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
> query the on-disk csums for a file.
>
> The ioctl is deliberately per-file rather than exposing raw csum tree
> lookups, to avoid leaking information to users about files they may not
> have access to.
>
> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
> the kernel, which details the offset and length we're interested in, and
> a buffer for the kernel to write its results into. The kernel writes a
> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
> csums if available.
>
> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
> type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
> the precomputed hash of the zero sector. Otherwise, it sets the type to
> BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED,
> BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE.
>
> For example, a file with a [0, 4K) hole and [4K, 12K) data extent would
> produce the following output buffer:
>
> | [0, 4K) ZEROED | [4K, 12K) HAS_CSUMS | csum data |
>
> We do store the csums of compressed extents, but we deliberately don't
> return them here: they're hashed over the compressed data, not the
> uncompressed data that's returned to userspace. Similarly for encrypted
> data, once encryption is supported, in which the csums will be on the
> ciphertext.
>
> The main use case for this is for speeding up mkfs.btrfs --rootdir. For
> the case when the source FS is btrfs and using the same csum algorithm,
> we can avoid having to recalculate the csums - in my synthetic
> benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
> speed-up (218s to 196s).
>
> When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
> reading the data entirely, resulting a ~2200% speed-up on the same test
> (128s to 6s).
>
> # mkdir rootdir
> # dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304
>
> (without ioctl)
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir testimg
> ...
> real 3m37.965s
> user 0m5.496s
> sys 0m6.125s
>
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir --reflink testimg
> ...
> real 2m8.342s
> user 0m5.472s
> sys 0m1.667s
>
> (with ioctl)
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir testimg
> ...
> real 3m15.865s
> user 0m4.258s
> sys 0m6.261s
>
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir --reflink testimg
> ...
> real 0m5.847s
> user 0m2.899s
> sys 0m0.097s
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
Looks good to me for the current usage.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Although the discussion about encryption makes me wonder if
btrfs_ioctl_get_csums_entry::type should be bit flags.
For now it's fine, but as Daniel mentioned, in the future we may want to
return the csum of such encrypted/compressed data.
In that case we need extra checks on the @flags to determine if
ENCRYPTED/COMPRESSED is followed by a csum or not.
If @type is bit flags, we can easily have things like
COMPRESSED|HAS_CSUM to indicate that future change.
Feel free to ignore this, it's just making the future expansion easier,
not critical and we can definitely find a way using the new @flags to
handle future change.
Thanks,
Qu
> ---
> Changes since v2:
> * Added flags field to args for future expansion, must be 0 for now
> * Added check that extent isn't encrypted
> * Split NO_CSUMS into multiple reasons, as this is cheap and userspace
> might find it useful: COMPRESSED, ENCRYPTED, NODATASUM, and INLINE
> * Changed to use SZ_16M macro
>
> fs/btrfs/ioctl.c | 337 +++++++++++++++++++++++++++++++++++++
> include/uapi/linux/btrfs.h | 25 +++
> 2 files changed, 362 insertions(+)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index b2e447f5005c16..cb37d793dbceff 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -56,6 +56,7 @@
> #include "uuid-tree.h"
> #include "ioctl.h"
> #include "file.h"
> +#include "file-item.h"
> #include "scrub.h"
> #include "super.h"
>
> @@ -5139,6 +5140,340 @@ static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg
> }
> #endif
>
> +#define GET_CSUMS_BUF_MAX SZ_16M
> +
> +static int copy_csums_to_user(struct btrfs_fs_info *fs_info, u64 disk_bytenr,
> + u64 len, u8 __user *buf)
> +{
> + struct btrfs_root *csum_root;
> + struct btrfs_ordered_sum *sums;
> + LIST_HEAD(list);
> + const u32 csum_size = fs_info->csum_size;
> + int ret;
> +
> + csum_root = btrfs_csum_root(fs_info, disk_bytenr);
> +
> + ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
> + disk_bytenr + len - 1, &list, false);
> + if (ret < 0)
> + return ret;
> +
> + /* Clear the output buffer to handle potential gaps in csum coverage. */
> + if (clear_user(buf, (len >> fs_info->sectorsize_bits) * csum_size)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = 0;
> + while (!list_empty(&list)) {
> + u64 offset;
> + size_t copy_size;
> +
> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
> + list_del(&sums->list);
> +
> + offset = ((sums->logical - disk_bytenr) >> fs_info->sectorsize_bits) * csum_size;
> + copy_size = (sums->len >> fs_info->sectorsize_bits) * csum_size;
> +
> + if (copy_to_user(buf + offset, sums->sums, copy_size)) {
> + kfree(sums);
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + kfree(sums);
> + }
> +
> +out:
> + while (!list_empty(&list)) {
> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
> + list_del(&sums->list);
> + kfree(sums);
> + }
> + return ret;
> +}
> +
> +static int btrfs_ioctl_get_csums(struct file *file, void __user *argp)
> +{
> + struct inode *inode = file_inode(file);
> + struct btrfs_inode *bi = BTRFS_I(inode);
> + struct btrfs_fs_info *fs_info = bi->root->fs_info;
> + struct btrfs_root *root = bi->root;
> + struct btrfs_ioctl_get_csums_args args;
> + BTRFS_PATH_AUTO_FREE(path);
> + const u64 ino = btrfs_ino(bi);
> + const u32 sectorsize = fs_info->sectorsize;
> + const u32 csum_size = fs_info->csum_size;
> + u8 __user *ubuf;
> + u64 buf_limit;
> + u64 buf_used = 0;
> + u64 cur_offset;
> + u64 end_offset;
> + u64 prev_extent_end;
> + struct btrfs_key key;
> + int ret;
> +
> + if (!(file->f_mode & FMODE_READ))
> + return -EBADF;
> +
> + if (!S_ISREG(inode->i_mode))
> + return -EINVAL;
> +
> + if (copy_from_user(&args, argp, sizeof(args)))
> + return -EFAULT;
> +
> + if (!IS_ALIGNED(args.offset, sectorsize) ||
> + !IS_ALIGNED(args.length, sectorsize))
> + return -EINVAL;
> + if (args.length == 0)
> + return -EINVAL;
> + if (args.offset + args.length < args.offset)
> + return -EOVERFLOW;
> + if (args.flags != 0)
> + return -EINVAL;
> + if (args.buf_size < sizeof(struct btrfs_ioctl_get_csums_entry))
> + return -EINVAL;
> +
> + buf_limit = min_t(u64, args.buf_size, GET_CSUMS_BUF_MAX);
> + ubuf = (u8 __user *)(argp + offsetof(struct btrfs_ioctl_get_csums_args, buf));
> + cur_offset = args.offset;
> + end_offset = args.offset + args.length;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
> + if (ret)
> + return ret;
> +
> + btrfs_inode_lock(bi, BTRFS_ILOCK_SHARED);
> +
> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
> + if (ret)
> + goto out_unlock;
> +
> + /* NODATASUM early exit. */
> + if (bi->flags & BTRFS_INODE_NODATASUM) {
> + struct btrfs_ioctl_get_csums_entry entry = {
> + .offset = cur_offset,
> + .length = end_offset - cur_offset,
> + .type = BTRFS_GET_CSUMS_NODATASUM,
> + };
> +
> + if (copy_to_user(ubuf, &entry, sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> +
> + buf_used = sizeof(entry);
> + cur_offset = end_offset;
> + goto done;
> + }
> +
> + prev_extent_end = cur_offset;
> +
> + while (cur_offset < end_offset) {
> + struct btrfs_file_extent_item *ei;
> + struct extent_buffer *leaf;
> + struct btrfs_ioctl_get_csums_entry entry;
> + u64 extent_end;
> + u64 disk_bytenr = 0;
> + u64 extent_offset = 0;
> + u64 range_start, range_len;
> + u64 entry_csum_size;
> + u64 key_offset;
> + int extent_type;
> + u8 compression;
> + u8 encryption;
> +
> + /* Search for the extent at or before cur_offset. */
> + key.objectid = ino;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> + key.offset = cur_offset;
> +
> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> + if (ret < 0)
> + goto out_unlock;
> +
> + if (ret > 0 && path->slots[0] > 0) {
> + btrfs_item_key_to_cpu(path->nodes[0], &key,
> + path->slots[0] - 1);
> + if (key.objectid == ino &&
> + key.type == BTRFS_EXTENT_DATA_KEY) {
> + path->slots[0]--;
> + if (btrfs_file_extent_end(path) <= cur_offset)
> + path->slots[0]++;
> + }
> + }
> +
> + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> + ret = btrfs_next_leaf(root, path);
> + if (ret < 0)
> + goto out_unlock;
> + if (ret > 0) {
> + ret = 0;
> + btrfs_release_path(path);
> + break;
> + }
> + }
> +
> + leaf = path->nodes[0];
> +
> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> + btrfs_release_path(path);
> + break;
> + }
> +
> + extent_end = btrfs_file_extent_end(path);
> + key_offset = key.offset;
> +
> + /* Read extent fields before releasing the path. */
> + ei = btrfs_item_ptr(leaf, path->slots[0],
> + struct btrfs_file_extent_item);
> + extent_type = btrfs_file_extent_type(leaf, ei);
> + compression = btrfs_file_extent_compression(leaf, ei);
> + encryption = btrfs_file_extent_encryption(leaf, ei);
> +
> + if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> + if (disk_bytenr && compression == BTRFS_COMPRESS_NONE)
> + extent_offset = btrfs_file_extent_offset(leaf, ei);
> + }
> +
> + btrfs_release_path(path);
> +
> + /* Implicit hole (NO_HOLES feature). */
> + if (prev_extent_end < key_offset) {
> + u64 hole_end = min(key_offset, end_offset);
> + u64 hole_len = hole_end - prev_extent_end;
> +
> + if (prev_extent_end >= cur_offset) {
> + memset(&entry, 0, sizeof(entry));
> + entry.offset = prev_extent_end;
> + entry.length = hole_len;
> + entry.type = BTRFS_GET_CSUMS_ZEROED;
> +
> + if (buf_used + sizeof(entry) > buf_limit)
> + goto done;
> + if (copy_to_user(ubuf + buf_used, &entry,
> + sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> + cur_offset = hole_end;
> + }
> +
> + if (key_offset >= end_offset) {
> + cur_offset = end_offset;
> + break;
> + }
> + }
> +
> + /* Clamp to our query range. */
> + range_start = max(cur_offset, key_offset);
> + range_len = min(extent_end, end_offset) - range_start;
> +
> + memset(&entry, 0, sizeof(entry));
> + entry.offset = range_start;
> + entry.length = range_len;
> +
> + if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> + entry.type = BTRFS_GET_CSUMS_INLINE;
> + entry_csum_size = 0;
> + } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> + entry.type = BTRFS_GET_CSUMS_ZEROED;
> + entry_csum_size = 0;
> + } else {
> + /* BTRFS_FILE_EXTENT_REG */
> + if (disk_bytenr == 0) {
> + /* Explicit hole. */
> + entry.type = BTRFS_GET_CSUMS_ZEROED;
> + entry_csum_size = 0;
> + } else if (encryption != 0) {
> + entry.type = BTRFS_GET_CSUMS_ENCRYPTED;
> + entry_csum_size = 0;
> + } else if (compression != BTRFS_COMPRESS_NONE) {
> + entry.type = BTRFS_GET_CSUMS_COMPRESSED;
> + entry_csum_size = 0;
> + } else {
> + entry.type = BTRFS_GET_CSUMS_HAS_CSUMS;
> + entry_csum_size = (range_len >> fs_info->sectorsize_bits) * csum_size;
> + }
> + }
> +
> + /* Check if this entry (+ csum data) fits in the buffer. */
> + if (buf_used + sizeof(entry) + entry_csum_size > buf_limit) {
> + if (buf_used == 0) {
> + ret = -EOVERFLOW;
> + goto out_unlock;
> + }
> + goto done;
> + }
> +
> + if (copy_to_user(ubuf + buf_used, &entry, sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> +
> + if (entry.type == BTRFS_GET_CSUMS_HAS_CSUMS) {
> + ret = copy_csums_to_user(fs_info,
> + disk_bytenr + extent_offset + (range_start - key_offset),
> + range_len, ubuf + buf_used);
> + if (ret)
> + goto out_unlock;
> + buf_used += entry_csum_size;
> + }
> +
> + cur_offset = range_start + range_len;
> + prev_extent_end = extent_end;
> +
> + if (fatal_signal_pending(current)) {
> + if (buf_used == 0) {
> + ret = -EINTR;
> + goto out_unlock;
> + }
> + goto done;
> + }
> +
> + cond_resched();
> + }
> +
> + /* Handle trailing implicit hole. */
> + if (cur_offset < end_offset) {
> + struct btrfs_ioctl_get_csums_entry entry = {
> + .offset = prev_extent_end,
> + .length = end_offset - prev_extent_end,
> + .type = BTRFS_GET_CSUMS_ZEROED,
> + };
> +
> + if (buf_used + sizeof(entry) <= buf_limit) {
> + if (copy_to_user(ubuf + buf_used, &entry,
> + sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> + cur_offset = end_offset;
> + }
> + }
> +
> +done:
> + args.offset = cur_offset;
> + args.length = (cur_offset < end_offset) ? end_offset - cur_offset : 0;
> + args.buf_size = buf_used;
> +
> + if (copy_to_user(argp, &args, sizeof(args)))
> + ret = -EFAULT;
> +
> +out_unlock:
> + btrfs_inode_unlock(bi, BTRFS_ILOCK_SHARED);
> + return ret;
> +}
> +
> long btrfs_ioctl(struct file *file, unsigned int
> cmd, unsigned long arg)
> {
> @@ -5294,6 +5629,8 @@ long btrfs_ioctl(struct file *file, unsigned int
> #endif
> case BTRFS_IOC_SUBVOL_SYNC_WAIT:
> return btrfs_ioctl_subvol_sync(fs_info, argp);
> + case BTRFS_IOC_GET_CSUMS:
> + return btrfs_ioctl_get_csums(file, argp);
> #ifdef CONFIG_BTRFS_EXPERIMENTAL
> case BTRFS_IOC_SHUTDOWN:
> return btrfs_ioctl_shutdown(fs_info, arg);
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 9165154a274d94..9b24218682990d 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -1100,6 +1100,29 @@ enum btrfs_err_code {
> BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
> };
>
> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
> +#define BTRFS_GET_CSUMS_ZEROED 1
> +#define BTRFS_GET_CSUMS_NODATASUM 2
> +#define BTRFS_GET_CSUMS_COMPRESSED 3
> +#define BTRFS_GET_CSUMS_ENCRYPTED 4
> +#define BTRFS_GET_CSUMS_INLINE 5
> +
> +struct btrfs_ioctl_get_csums_entry {
> + __u64 offset; /* file offset of this range */
> + __u64 length; /* length in bytes */
> + __u32 type; /* BTRFS_GET_CSUMS_* type */
> + __u32 reserved; /* padding, must be 0 */
> +};
> +
> +struct btrfs_ioctl_get_csums_args {
> + __u64 offset; /* in/out: file offset */
> + __u64 length; /* in/out: range length */
> + __u64 buf_size; /* in/out: buffer capacity / bytes written */
> + __u64 flags; /* in: flags, must be 0 for now */
> + __u8 buf[]; /* out: entries + csum data */
> +};
> +
> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
> @@ -1226,6 +1249,8 @@ enum btrfs_err_code {
> struct btrfs_ioctl_encoded_io_args)
> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
> struct btrfs_ioctl_subvol_wait)
> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
> + struct btrfs_ioctl_get_csums_args)
>
> /* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-13 17:14 [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
2026-04-13 22:17 ` Qu Wenruo
@ 2026-04-14 2:07 ` David Sterba
2026-04-15 13:56 ` Mark Harmstone
1 sibling, 1 reply; 7+ messages in thread
From: David Sterba @ 2026-04-14 2:07 UTC (permalink / raw)
To: Mark Harmstone; +Cc: linux-btrfs, wqu, boris
On Mon, Apr 13, 2026 at 06:14:16PM +0100, Mark Harmstone wrote:
> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
> query the on-disk csums for a file.
>
> The ioctl is deliberately per-file rather than exposing raw csum tree
> lookups, to avoid leaking information to users about files they may not
> have access to.
>
> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
> the kernel, which details the offset and length we're interested in, and
> a buffer for the kernel to write its results into. The kernel writes a
> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
> csums if available.
>
> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
> type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
> the precomputed hash of the zero sector. Otherwise, it sets the type to
> BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED,
> BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE.
>
> For example, a file with a [0, 4K) hole and [4K, 12K) data extent would
> produce the following output buffer:
>
> | [0, 4K) ZEROED | [4K, 12K) HAS_CSUMS | csum data |
>
> We do store the csums of compressed extents, but we deliberately don't
> return them here: they're hashed over the compressed data, not the
> uncompressed data that's returned to userspace. Similarly for encrypted
> data, once encryption is supported, in which the csums will be on the
> ciphertext.
>
> The main use case for this is for speeding up mkfs.btrfs --rootdir.
In the past reading the checksums was desired for deduplication, and we
had some proposed ioctls but I think it was not extensive to also cover
the holes, compression. Basically a wrapper for the search tree ioctl on
the checksum tree.
Handling all the types in this patch seems to be complete.
> For
> the case when the source FS is btrfs and using the same csum algorithm,
> we can avoid having to recalculate the csums - in my synthetic
> benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
> speed-up (218s to 196s).
>
> When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
> reading the data entirely, resulting a ~2200% speed-up on the same test
> (128s to 6s).
Repeated mkfs is a specific use case, normaly it happens just once but I
understand that for preparing various images the time savings are
significiant.
Once this ioctl is available I think the deduplication tools will make
use of it as an initial filter before actually doing the real
deduplication of extents.
As this is a unprivileged ioctl and touching structures and memory we'll
need a security pass on the final version. So far I haven't see any
issues known to me.
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -56,6 +56,7 @@
> #include "uuid-tree.h"
> #include "ioctl.h"
> #include "file.h"
> +#include "file-item.h"
> #include "scrub.h"
> #include "super.h"
>
> @@ -5139,6 +5140,340 @@ static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg
> }
> #endif
>
> +#define GET_CSUMS_BUF_MAX SZ_16M
> +
> +static int copy_csums_to_user(struct btrfs_fs_info *fs_info, u64 disk_bytenr,
> + u64 len, u8 __user *buf)
> +{
> + struct btrfs_root *csum_root;
> + struct btrfs_ordered_sum *sums;
> + LIST_HEAD(list);
> + const u32 csum_size = fs_info->csum_size;
> + int ret;
> +
> + csum_root = btrfs_csum_root(fs_info, disk_bytenr);
> +
> + ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
> + disk_bytenr + len - 1, &list, false);
> + if (ret < 0)
> + return ret;
> +
> + /* Clear the output buffer to handle potential gaps in csum coverage. */
> + if (clear_user(buf, (len >> fs_info->sectorsize_bits) * csum_size)) {
Clearning the partial range here is good but I think it should be done
globally right at the start of the ioctl, regardless of how many data we
return.
OTOH this can have a peformance hit in case a large buffer is provided
by default but for various ranges and possibly short ones, not matching
the entire buffer. So repeated calls will have to clear 16M but return
a few kilobytes.
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = 0;
> + while (!list_empty(&list)) {
> + u64 offset;
> + size_t copy_size;
> +
> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
> + list_del(&sums->list);
> +
> + offset = ((sums->logical - disk_bytenr) >> fs_info->sectorsize_bits) * csum_size;
> + copy_size = (sums->len >> fs_info->sectorsize_bits) * csum_size;
> +
> + if (copy_to_user(buf + offset, sums->sums, copy_size)) {
> + kfree(sums);
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + kfree(sums);
> + }
> +
> +out:
> + while (!list_empty(&list)) {
> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
> + list_del(&sums->list);
> + kfree(sums);
> + }
> + return ret;
> +}
> +
> +static int btrfs_ioctl_get_csums(struct file *file, void __user *argp)
> +{
> + struct inode *inode = file_inode(file);
Please use vfs_inode
> + struct btrfs_inode *bi = BTRFS_I(inode);
and plain inode
> + struct btrfs_fs_info *fs_info = bi->root->fs_info;
> + struct btrfs_root *root = bi->root;
> + struct btrfs_ioctl_get_csums_args args;
> + BTRFS_PATH_AUTO_FREE(path);
> + const u64 ino = btrfs_ino(bi);
> + const u32 sectorsize = fs_info->sectorsize;
> + const u32 csum_size = fs_info->csum_size;
> + u8 __user *ubuf;
> + u64 buf_limit;
> + u64 buf_used = 0;
> + u64 cur_offset;
> + u64 end_offset;
> + u64 prev_extent_end;
> + struct btrfs_key key;
> + int ret;
> +
> + if (!(file->f_mode & FMODE_READ))
> + return -EBADF;
> +
> + if (!S_ISREG(inode->i_mode))
> + return -EINVAL;
> +
> + if (copy_from_user(&args, argp, sizeof(args)))
> + return -EFAULT;
> +
> + if (!IS_ALIGNED(args.offset, sectorsize) ||
> + !IS_ALIGNED(args.length, sectorsize))
> + return -EINVAL;
> + if (args.length == 0)
> + return -EINVAL;
> + if (args.offset + args.length < args.offset)
> + return -EOVERFLOW;
> + if (args.flags != 0)
> + return -EINVAL;
> + if (args.buf_size < sizeof(struct btrfs_ioctl_get_csums_entry))
> + return -EINVAL;
> +
> + buf_limit = min_t(u64, args.buf_size, GET_CSUMS_BUF_MAX);
> + ubuf = (u8 __user *)(argp + offsetof(struct btrfs_ioctl_get_csums_args, buf));
> + cur_offset = args.offset;
> + end_offset = args.offset + args.length;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
> + if (ret)
> + return ret;
> +
> + btrfs_inode_lock(bi, BTRFS_ILOCK_SHARED);
This needs to be interruptible, all other points are. There's no inode
lock helper for that so you'd have to add it or call
down_rwsem_interruptible().
> +
> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
> + if (ret)
> + goto out_unlock;
> +
> + /* NODATASUM early exit. */
> + if (bi->flags & BTRFS_INODE_NODATASUM) {
> + struct btrfs_ioctl_get_csums_entry entry = {
> + .offset = cur_offset,
> + .length = end_offset - cur_offset,
> + .type = BTRFS_GET_CSUMS_NODATASUM,
> + };
> +
> + if (copy_to_user(ubuf, &entry, sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> +
> + buf_used = sizeof(entry);
> + cur_offset = end_offset;
> + goto done;
> + }
> +
> + prev_extent_end = cur_offset;
> +
> + while (cur_offset < end_offset) {
> + struct btrfs_file_extent_item *ei;
> + struct extent_buffer *leaf;
> + struct btrfs_ioctl_get_csums_entry entry;
Please initialize that right away with = { 0 }, it's done later by
memset() but for something filled partially it's better to do it at the
beginning than forgetting to do it later.
Overall comment, please check the final patch that there are not
unnecessary line breaks that overflow 80 chars only slightly. I've seen
a few and would have fixed it but I think I've been pointing that out
enough in the past. I'll fix anything that's left if needed.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-13 22:17 ` Qu Wenruo
@ 2026-04-14 12:57 ` Mark Harmstone
0 siblings, 0 replies; 7+ messages in thread
From: Mark Harmstone @ 2026-04-14 12:57 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs, wqu, boris
On 13/04/2026 11.17 pm, Qu Wenruo wrote:
>
>
> 在 2026/4/14 02:44, Mark Harmstone 写道:
>> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
>> query the on-disk csums for a file.
>>
>> The ioctl is deliberately per-file rather than exposing raw csum tree
>> lookups, to avoid leaking information to users about files they may not
>> have access to.
>>
>> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
>> the kernel, which details the offset and length we're interested in, and
>> a buffer for the kernel to write its results into. The kernel writes a
>> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
>> csums if available.
>>
>> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
>> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
>> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
>> type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
>> the precomputed hash of the zero sector. Otherwise, it sets the type to
>> BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED,
>> BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE.
>>
>> For example, a file with a [0, 4K) hole and [4K, 12K) data extent would
>> produce the following output buffer:
>>
>> | [0, 4K) ZEROED | [4K, 12K) HAS_CSUMS | csum data |
>>
>> We do store the csums of compressed extents, but we deliberately don't
>> return them here: they're hashed over the compressed data, not the
>> uncompressed data that's returned to userspace. Similarly for encrypted
>> data, once encryption is supported, in which the csums will be on the
>> ciphertext.
>>
>> The main use case for this is for speeding up mkfs.btrfs --rootdir. For
>> the case when the source FS is btrfs and using the same csum algorithm,
>> we can avoid having to recalculate the csums - in my synthetic
>> benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
>> speed-up (218s to 196s).
>>
>> When using the --reflink option added in btrfs-progs v6.16.1, we can
>> forgo
>> reading the data entirely, resulting a ~2200% speed-up on the same test
>> (128s to 6s).
>>
>> # mkdir rootdir
>> # dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304
>>
>> (without ioctl)
>> # echo 3 > /proc/sys/vm/drop_caches
>> # time mkfs.btrfs --rootdir rootdir testimg
>> ...
>> real 3m37.965s
>> user 0m5.496s
>> sys 0m6.125s
>>
>> # echo 3 > /proc/sys/vm/drop_caches
>> # time mkfs.btrfs --rootdir rootdir --reflink testimg
>> ...
>> real 2m8.342s
>> user 0m5.472s
>> sys 0m1.667s
>>
>> (with ioctl)
>> # echo 3 > /proc/sys/vm/drop_caches
>> # time mkfs.btrfs --rootdir rootdir testimg
>> ...
>> real 3m15.865s
>> user 0m4.258s
>> sys 0m6.261s
>>
>> # echo 3 > /proc/sys/vm/drop_caches
>> # time mkfs.btrfs --rootdir rootdir --reflink testimg
>> ...
>> real 0m5.847s
>> user 0m2.899s
>> sys 0m0.097s
>>
>> Signed-off-by: Mark Harmstone <mark@harmstone.com>
>
> Looks good to me for the current usage.
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
>
> Although the discussion about encryption makes me wonder if
> btrfs_ioctl_get_csums_entry::type should be bit flags.
>
> For now it's fine, but as Daniel mentioned, in the future we may want to
> return the csum of such encrypted/compressed data.
> In that case we need extra checks on the @flags to determine if
> ENCRYPTED/COMPRESSED is followed by a csum or not.
>
> If @type is bit flags, we can easily have things like COMPRESSED|
> HAS_CSUM to indicate that future change.
>
> Feel free to ignore this, it's just making the future expansion easier,
> not critical and we can definitely find a way using the new @flags to
> handle future change.
Thanks Qu. Yes, I agree, bit flags make more sense.
> Thanks,
> Qu
>
>> ---
>> Changes since v2:
>> * Added flags field to args for future expansion, must be 0 for now
>> * Added check that extent isn't encrypted
>> * Split NO_CSUMS into multiple reasons, as this is cheap and userspace
>> might find it useful: COMPRESSED, ENCRYPTED, NODATASUM, and INLINE
>> * Changed to use SZ_16M macro
>>
>> fs/btrfs/ioctl.c | 337 +++++++++++++++++++++++++++++++++++++
>> include/uapi/linux/btrfs.h | 25 +++
>> 2 files changed, 362 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index b2e447f5005c16..cb37d793dbceff 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -56,6 +56,7 @@
>> #include "uuid-tree.h"
>> #include "ioctl.h"
>> #include "file.h"
>> +#include "file-item.h"
>> #include "scrub.h"
>> #include "super.h"
>> @@ -5139,6 +5140,340 @@ static int btrfs_ioctl_shutdown(struct
>> btrfs_fs_info *fs_info, unsigned long arg
>> }
>> #endif
>> +#define GET_CSUMS_BUF_MAX SZ_16M
>> +
>> +static int copy_csums_to_user(struct btrfs_fs_info *fs_info, u64
>> disk_bytenr,
>> + u64 len, u8 __user *buf)
>> +{
>> + struct btrfs_root *csum_root;
>> + struct btrfs_ordered_sum *sums;
>> + LIST_HEAD(list);
>> + const u32 csum_size = fs_info->csum_size;
>> + int ret;
>> +
>> + csum_root = btrfs_csum_root(fs_info, disk_bytenr);
>> +
>> + ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
>> + disk_bytenr + len - 1, &list, false);
>> + if (ret < 0)
>> + return ret;
>> +
>> + /* Clear the output buffer to handle potential gaps in csum
>> coverage. */
>> + if (clear_user(buf, (len >> fs_info->sectorsize_bits) *
>> csum_size)) {
>> + ret = -EFAULT;
>> + goto out;
>> + }
>> +
>> + ret = 0;
>> + while (!list_empty(&list)) {
>> + u64 offset;
>> + size_t copy_size;
>> +
>> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
>> + list_del(&sums->list);
>> +
>> + offset = ((sums->logical - disk_bytenr) >> fs_info-
>> >sectorsize_bits) * csum_size;
>> + copy_size = (sums->len >> fs_info->sectorsize_bits) * csum_size;
>> +
>> + if (copy_to_user(buf + offset, sums->sums, copy_size)) {
>> + kfree(sums);
>> + ret = -EFAULT;
>> + goto out;
>> + }
>> +
>> + kfree(sums);
>> + }
>> +
>> +out:
>> + while (!list_empty(&list)) {
>> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
>> + list_del(&sums->list);
>> + kfree(sums);
>> + }
>> + return ret;
>> +}
>> +
>> +static int btrfs_ioctl_get_csums(struct file *file, void __user *argp)
>> +{
>> + struct inode *inode = file_inode(file);
>> + struct btrfs_inode *bi = BTRFS_I(inode);
>> + struct btrfs_fs_info *fs_info = bi->root->fs_info;
>> + struct btrfs_root *root = bi->root;
>> + struct btrfs_ioctl_get_csums_args args;
>> + BTRFS_PATH_AUTO_FREE(path);
>> + const u64 ino = btrfs_ino(bi);
>> + const u32 sectorsize = fs_info->sectorsize;
>> + const u32 csum_size = fs_info->csum_size;
>> + u8 __user *ubuf;
>> + u64 buf_limit;
>> + u64 buf_used = 0;
>> + u64 cur_offset;
>> + u64 end_offset;
>> + u64 prev_extent_end;
>> + struct btrfs_key key;
>> + int ret;
>> +
>> + if (!(file->f_mode & FMODE_READ))
>> + return -EBADF;
>> +
>> + if (!S_ISREG(inode->i_mode))
>> + return -EINVAL;
>> +
>> + if (copy_from_user(&args, argp, sizeof(args)))
>> + return -EFAULT;
>> +
>> + if (!IS_ALIGNED(args.offset, sectorsize) ||
>> + !IS_ALIGNED(args.length, sectorsize))
>> + return -EINVAL;
>> + if (args.length == 0)
>> + return -EINVAL;
>> + if (args.offset + args.length < args.offset)
>> + return -EOVERFLOW;
>> + if (args.flags != 0)
>> + return -EINVAL;
>> + if (args.buf_size < sizeof(struct btrfs_ioctl_get_csums_entry))
>> + return -EINVAL;
>> +
>> + buf_limit = min_t(u64, args.buf_size, GET_CSUMS_BUF_MAX);
>> + ubuf = (u8 __user *)(argp + offsetof(struct
>> btrfs_ioctl_get_csums_args, buf));
>> + cur_offset = args.offset;
>> + end_offset = args.offset + args.length;
>> +
>> + path = btrfs_alloc_path();
>> + if (!path)
>> + return -ENOMEM;
>> +
>> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
>> + if (ret)
>> + return ret;
>> +
>> + btrfs_inode_lock(bi, BTRFS_ILOCK_SHARED);
>> +
>> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
>> + if (ret)
>> + goto out_unlock;
>> +
>> + /* NODATASUM early exit. */
>> + if (bi->flags & BTRFS_INODE_NODATASUM) {
>> + struct btrfs_ioctl_get_csums_entry entry = {
>> + .offset = cur_offset,
>> + .length = end_offset - cur_offset,
>> + .type = BTRFS_GET_CSUMS_NODATASUM,
>> + };
>> +
>> + if (copy_to_user(ubuf, &entry, sizeof(entry))) {
>> + ret = -EFAULT;
>> + goto out_unlock;
>> + }
>> +
>> + buf_used = sizeof(entry);
>> + cur_offset = end_offset;
>> + goto done;
>> + }
>> +
>> + prev_extent_end = cur_offset;
>> +
>> + while (cur_offset < end_offset) {
>> + struct btrfs_file_extent_item *ei;
>> + struct extent_buffer *leaf;
>> + struct btrfs_ioctl_get_csums_entry entry;
>> + u64 extent_end;
>> + u64 disk_bytenr = 0;
>> + u64 extent_offset = 0;
>> + u64 range_start, range_len;
>> + u64 entry_csum_size;
>> + u64 key_offset;
>> + int extent_type;
>> + u8 compression;
>> + u8 encryption;
>> +
>> + /* Search for the extent at or before cur_offset. */
>> + key.objectid = ino;
>> + key.type = BTRFS_EXTENT_DATA_KEY;
>> + key.offset = cur_offset;
>> +
>> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> + if (ret < 0)
>> + goto out_unlock;
>> +
>> + if (ret > 0 && path->slots[0] > 0) {
>> + btrfs_item_key_to_cpu(path->nodes[0], &key,
>> + path->slots[0] - 1);
>> + if (key.objectid == ino &&
>> + key.type == BTRFS_EXTENT_DATA_KEY) {
>> + path->slots[0]--;
>> + if (btrfs_file_extent_end(path) <= cur_offset)
>> + path->slots[0]++;
>> + }
>> + }
>> +
>> + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
>> + ret = btrfs_next_leaf(root, path);
>> + if (ret < 0)
>> + goto out_unlock;
>> + if (ret > 0) {
>> + ret = 0;
>> + btrfs_release_path(path);
>> + break;
>> + }
>> + }
>> +
>> + leaf = path->nodes[0];
>> +
>> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
>> + btrfs_release_path(path);
>> + break;
>> + }
>> +
>> + extent_end = btrfs_file_extent_end(path);
>> + key_offset = key.offset;
>> +
>> + /* Read extent fields before releasing the path. */
>> + ei = btrfs_item_ptr(leaf, path->slots[0],
>> + struct btrfs_file_extent_item);
>> + extent_type = btrfs_file_extent_type(leaf, ei);
>> + compression = btrfs_file_extent_compression(leaf, ei);
>> + encryption = btrfs_file_extent_encryption(leaf, ei);
>> +
>> + if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
>> + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
>> + if (disk_bytenr && compression == BTRFS_COMPRESS_NONE)
>> + extent_offset = btrfs_file_extent_offset(leaf, ei);
>> + }
>> +
>> + btrfs_release_path(path);
>> +
>> + /* Implicit hole (NO_HOLES feature). */
>> + if (prev_extent_end < key_offset) {
>> + u64 hole_end = min(key_offset, end_offset);
>> + u64 hole_len = hole_end - prev_extent_end;
>> +
>> + if (prev_extent_end >= cur_offset) {
>> + memset(&entry, 0, sizeof(entry));
>> + entry.offset = prev_extent_end;
>> + entry.length = hole_len;
>> + entry.type = BTRFS_GET_CSUMS_ZEROED;
>> +
>> + if (buf_used + sizeof(entry) > buf_limit)
>> + goto done;
>> + if (copy_to_user(ubuf + buf_used, &entry,
>> + sizeof(entry))) {
>> + ret = -EFAULT;
>> + goto out_unlock;
>> + }
>> + buf_used += sizeof(entry);
>> + cur_offset = hole_end;
>> + }
>> +
>> + if (key_offset >= end_offset) {
>> + cur_offset = end_offset;
>> + break;
>> + }
>> + }
>> +
>> + /* Clamp to our query range. */
>> + range_start = max(cur_offset, key_offset);
>> + range_len = min(extent_end, end_offset) - range_start;
>> +
>> + memset(&entry, 0, sizeof(entry));
>> + entry.offset = range_start;
>> + entry.length = range_len;
>> +
>> + if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>> + entry.type = BTRFS_GET_CSUMS_INLINE;
>> + entry_csum_size = 0;
>> + } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
>> + entry.type = BTRFS_GET_CSUMS_ZEROED;
>> + entry_csum_size = 0;
>> + } else {
>> + /* BTRFS_FILE_EXTENT_REG */
>> + if (disk_bytenr == 0) {
>> + /* Explicit hole. */
>> + entry.type = BTRFS_GET_CSUMS_ZEROED;
>> + entry_csum_size = 0;
>> + } else if (encryption != 0) {
>> + entry.type = BTRFS_GET_CSUMS_ENCRYPTED;
>> + entry_csum_size = 0;
>> + } else if (compression != BTRFS_COMPRESS_NONE) {
>> + entry.type = BTRFS_GET_CSUMS_COMPRESSED;
>> + entry_csum_size = 0;
>> + } else {
>> + entry.type = BTRFS_GET_CSUMS_HAS_CSUMS;
>> + entry_csum_size = (range_len >> fs_info-
>> >sectorsize_bits) * csum_size;
>> + }
>> + }
>> +
>> + /* Check if this entry (+ csum data) fits in the buffer. */
>> + if (buf_used + sizeof(entry) + entry_csum_size > buf_limit) {
>> + if (buf_used == 0) {
>> + ret = -EOVERFLOW;
>> + goto out_unlock;
>> + }
>> + goto done;
>> + }
>> +
>> + if (copy_to_user(ubuf + buf_used, &entry, sizeof(entry))) {
>> + ret = -EFAULT;
>> + goto out_unlock;
>> + }
>> + buf_used += sizeof(entry);
>> +
>> + if (entry.type == BTRFS_GET_CSUMS_HAS_CSUMS) {
>> + ret = copy_csums_to_user(fs_info,
>> + disk_bytenr + extent_offset + (range_start -
>> key_offset),
>> + range_len, ubuf + buf_used);
>> + if (ret)
>> + goto out_unlock;
>> + buf_used += entry_csum_size;
>> + }
>> +
>> + cur_offset = range_start + range_len;
>> + prev_extent_end = extent_end;
>> +
>> + if (fatal_signal_pending(current)) {
>> + if (buf_used == 0) {
>> + ret = -EINTR;
>> + goto out_unlock;
>> + }
>> + goto done;
>> + }
>> +
>> + cond_resched();
>> + }
>> +
>> + /* Handle trailing implicit hole. */
>> + if (cur_offset < end_offset) {
>> + struct btrfs_ioctl_get_csums_entry entry = {
>> + .offset = prev_extent_end,
>> + .length = end_offset - prev_extent_end,
>> + .type = BTRFS_GET_CSUMS_ZEROED,
>> + };
>> +
>> + if (buf_used + sizeof(entry) <= buf_limit) {
>> + if (copy_to_user(ubuf + buf_used, &entry,
>> + sizeof(entry))) {
>> + ret = -EFAULT;
>> + goto out_unlock;
>> + }
>> + buf_used += sizeof(entry);
>> + cur_offset = end_offset;
>> + }
>> + }
>> +
>> +done:
>> + args.offset = cur_offset;
>> + args.length = (cur_offset < end_offset) ? end_offset -
>> cur_offset : 0;
>> + args.buf_size = buf_used;
>> +
>> + if (copy_to_user(argp, &args, sizeof(args)))
>> + ret = -EFAULT;
>> +
>> +out_unlock:
>> + btrfs_inode_unlock(bi, BTRFS_ILOCK_SHARED);
>> + return ret;
>> +}
>> +
>> long btrfs_ioctl(struct file *file, unsigned int
>> cmd, unsigned long arg)
>> {
>> @@ -5294,6 +5629,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>> #endif
>> case BTRFS_IOC_SUBVOL_SYNC_WAIT:
>> return btrfs_ioctl_subvol_sync(fs_info, argp);
>> + case BTRFS_IOC_GET_CSUMS:
>> + return btrfs_ioctl_get_csums(file, argp);
>> #ifdef CONFIG_BTRFS_EXPERIMENTAL
>> case BTRFS_IOC_SHUTDOWN:
>> return btrfs_ioctl_shutdown(fs_info, arg);
>> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
>> index 9165154a274d94..9b24218682990d 100644
>> --- a/include/uapi/linux/btrfs.h
>> +++ b/include/uapi/linux/btrfs.h
>> @@ -1100,6 +1100,29 @@ enum btrfs_err_code {
>> BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
>> };
>> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
>> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
>> +#define BTRFS_GET_CSUMS_ZEROED 1
>> +#define BTRFS_GET_CSUMS_NODATASUM 2
>> +#define BTRFS_GET_CSUMS_COMPRESSED 3
>> +#define BTRFS_GET_CSUMS_ENCRYPTED 4
>> +#define BTRFS_GET_CSUMS_INLINE 5
>> +
>> +struct btrfs_ioctl_get_csums_entry {
>> + __u64 offset; /* file offset of this range */
>> + __u64 length; /* length in bytes */
>> + __u32 type; /* BTRFS_GET_CSUMS_* type */
>> + __u32 reserved; /* padding, must be 0 */
>> +};
>> +
>> +struct btrfs_ioctl_get_csums_args {
>> + __u64 offset; /* in/out: file offset */
>> + __u64 length; /* in/out: range length */
>> + __u64 buf_size; /* in/out: buffer capacity / bytes written */
>> + __u64 flags; /* in: flags, must be 0 for now */
>> + __u8 buf[]; /* out: entries + csum data */
>> +};
>> +
>> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
>> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
>> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
>> @@ -1226,6 +1249,8 @@ enum btrfs_err_code {
>> struct btrfs_ioctl_encoded_io_args)
>> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
>> struct btrfs_ioctl_subvol_wait)
>> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
>> + struct btrfs_ioctl_get_csums_args)
>> /* Shutdown ioctl should follow XFS's interfaces, thus not using
>> btrfs magic. */
>> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-14 2:07 ` David Sterba
@ 2026-04-15 13:56 ` Mark Harmstone
2026-04-15 16:43 ` David Sterba
0 siblings, 1 reply; 7+ messages in thread
From: Mark Harmstone @ 2026-04-15 13:56 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs, wqu, boris
Thanks David.
On 14/04/2026 3.07 am, David Sterba wrote:
> On Mon, Apr 13, 2026 at 06:14:16PM +0100, Mark Harmstone wrote:
>> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
>> query the on-disk csums for a file.
>>
>> The ioctl is deliberately per-file rather than exposing raw csum tree
>> lookups, to avoid leaking information to users about files they may not
>> have access to.
>>
>> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
>> the kernel, which details the offset and length we're interested in, and
>> a buffer for the kernel to write its results into. The kernel writes a
>> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
>> csums if available.
>>
>> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
>> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
>> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
>> type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
>> the precomputed hash of the zero sector. Otherwise, it sets the type to
>> BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED,
>> BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE.
>>
>> For example, a file with a [0, 4K) hole and [4K, 12K) data extent would
>> produce the following output buffer:
>>
>> | [0, 4K) ZEROED | [4K, 12K) HAS_CSUMS | csum data |
>>
>> We do store the csums of compressed extents, but we deliberately don't
>> return them here: they're hashed over the compressed data, not the
>> uncompressed data that's returned to userspace. Similarly for encrypted
>> data, once encryption is supported, in which the csums will be on the
>> ciphertext.
>>
>> The main use case for this is for speeding up mkfs.btrfs --rootdir.
>
> In the past reading the checksums was desired for deduplication, and we
> had some proposed ioctls but I think it was not extensive to also cover
> the holes, compression. Basically a wrapper for the search tree ioctl on
> the checksum tree.
>
> Handling all the types in this patch seems to be complete.
>
>> For
>> the case when the source FS is btrfs and using the same csum algorithm,
>> we can avoid having to recalculate the csums - in my synthetic
>> benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
>> speed-up (218s to 196s).
>>
>> When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
>> reading the data entirely, resulting a ~2200% speed-up on the same test
>> (128s to 6s).
>
> Repeated mkfs is a specific use case, normaly it happens just once but I
> understand that for preparing various images the time savings are
> significiant.
>
> Once this ioctl is available I think the deduplication tools will make
> use of it as an initial filter before actually doing the real
> deduplication of extents.
IIRC the mathematics of the birthday paradox is that for an n-bit hash,
you need 2^(n/2) sectors to have a 50% chance of a collision.
For the 32-bit hashes, that means a 50% chance of a collision every 256
MB. For the 256-bit hashes, that's 2^140 bytes... whatever that works
out as. So the dedupe logic would no doubt be that for larger hash sizes
GET_CSUMS itself will be sufficient.
> As this is a unprivileged ioctl and touching structures and memory we'll
> need a security pass on the final version. So far I haven't see any
> issues known to me.
>
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -56,6 +56,7 @@
>> #include "uuid-tree.h"
>> #include "ioctl.h"
>> #include "file.h"
>> +#include "file-item.h"
>> #include "scrub.h"
>> #include "super.h"
>>
>> @@ -5139,6 +5140,340 @@ static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg
>> }
>> #endif
>>
>> +#define GET_CSUMS_BUF_MAX SZ_16M
>> +
>> +static int copy_csums_to_user(struct btrfs_fs_info *fs_info, u64 disk_bytenr,
>> + u64 len, u8 __user *buf)
>> +{
>> + struct btrfs_root *csum_root;
>> + struct btrfs_ordered_sum *sums;
>> + LIST_HEAD(list);
>> + const u32 csum_size = fs_info->csum_size;
>> + int ret;
>> +
>> + csum_root = btrfs_csum_root(fs_info, disk_bytenr);
>> +
>> + ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
>> + disk_bytenr + len - 1, &list, false);
>> + if (ret < 0)
>> + return ret;
>> +
>> + /* Clear the output buffer to handle potential gaps in csum coverage. */
>> + if (clear_user(buf, (len >> fs_info->sectorsize_bits) * csum_size)) {
>
> Clearning the partial range here is good but I think it should be done
> globally right at the start of the ioctl, regardless of how many data we
> return.
>
> OTOH this can have a peformance hit in case a large buffer is provided
> by default but for various ranges and possibly short ones, not matching
> the entire buffer. So repeated calls will have to clear 16M but return
> a few kilobytes.
That's fine - this isn't an interface change, so we can always change it
later if it turns out to be a bottleneck. At any rate it's going to be
faster than userspace calculating the csums manually.
>> + ret = -EFAULT;
>> + goto out;
>> + }
>> +
>> + ret = 0;
>> + while (!list_empty(&list)) {
>> + u64 offset;
>> + size_t copy_size;
>> +
>> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
>> + list_del(&sums->list);
>> +
>> + offset = ((sums->logical - disk_bytenr) >> fs_info->sectorsize_bits) * csum_size;
>> + copy_size = (sums->len >> fs_info->sectorsize_bits) * csum_size;
>> +
>> + if (copy_to_user(buf + offset, sums->sums, copy_size)) {
>> + kfree(sums);
>> + ret = -EFAULT;
>> + goto out;
>> + }
>> +
>> + kfree(sums);
>> + }
>> +
>> +out:
>> + while (!list_empty(&list)) {
>> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
>> + list_del(&sums->list);
>> + kfree(sums);
>> + }
>> + return ret;
>> +}
>> +
>> +static int btrfs_ioctl_get_csums(struct file *file, void __user *argp)
>> +{
>> + struct inode *inode = file_inode(file);
>
> Please use vfs_inode
>
>> + struct btrfs_inode *bi = BTRFS_I(inode);
>
> and plain inode
>
>> + struct btrfs_fs_info *fs_info = bi->root->fs_info;
>> + struct btrfs_root *root = bi->root;
>> + struct btrfs_ioctl_get_csums_args args;
>> + BTRFS_PATH_AUTO_FREE(path);
>> + const u64 ino = btrfs_ino(bi);
>> + const u32 sectorsize = fs_info->sectorsize;
>> + const u32 csum_size = fs_info->csum_size;
>> + u8 __user *ubuf;
>> + u64 buf_limit;
>> + u64 buf_used = 0;
>> + u64 cur_offset;
>> + u64 end_offset;
>> + u64 prev_extent_end;
>> + struct btrfs_key key;
>> + int ret;
>> +
>> + if (!(file->f_mode & FMODE_READ))
>> + return -EBADF;
>> +
>> + if (!S_ISREG(inode->i_mode))
>> + return -EINVAL;
>> +
>> + if (copy_from_user(&args, argp, sizeof(args)))
>> + return -EFAULT;
>> +
>> + if (!IS_ALIGNED(args.offset, sectorsize) ||
>> + !IS_ALIGNED(args.length, sectorsize))
>> + return -EINVAL;
>> + if (args.length == 0)
>> + return -EINVAL;
>> + if (args.offset + args.length < args.offset)
>> + return -EOVERFLOW;
>> + if (args.flags != 0)
>> + return -EINVAL;
>> + if (args.buf_size < sizeof(struct btrfs_ioctl_get_csums_entry))
>> + return -EINVAL;
>> +
>> + buf_limit = min_t(u64, args.buf_size, GET_CSUMS_BUF_MAX);
>> + ubuf = (u8 __user *)(argp + offsetof(struct btrfs_ioctl_get_csums_args, buf));
>> + cur_offset = args.offset;
>> + end_offset = args.offset + args.length;
>> +
>> + path = btrfs_alloc_path();
>> + if (!path)
>> + return -ENOMEM;
>> +
>> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
>> + if (ret)
>> + return ret;
>> +
>> + btrfs_inode_lock(bi, BTRFS_ILOCK_SHARED);
>
> This needs to be interruptible, all other points are. There's no inode
> lock helper for that so you'd have to add it or call
> down_rwsem_interruptible().
>
>> +
>> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
>> + if (ret)
>> + goto out_unlock;
>> +
>> + /* NODATASUM early exit. */
>> + if (bi->flags & BTRFS_INODE_NODATASUM) {
>> + struct btrfs_ioctl_get_csums_entry entry = {
>> + .offset = cur_offset,
>> + .length = end_offset - cur_offset,
>> + .type = BTRFS_GET_CSUMS_NODATASUM,
>> + };
>> +
>> + if (copy_to_user(ubuf, &entry, sizeof(entry))) {
>> + ret = -EFAULT;
>> + goto out_unlock;
>> + }
>> +
>> + buf_used = sizeof(entry);
>> + cur_offset = end_offset;
>> + goto done;
>> + }
>> +
>> + prev_extent_end = cur_offset;
>> +
>> + while (cur_offset < end_offset) {
>> + struct btrfs_file_extent_item *ei;
>> + struct extent_buffer *leaf;
>> + struct btrfs_ioctl_get_csums_entry entry;
>
> Please initialize that right away with = { 0 }, it's done later by
> memset() but for something filled partially it's better to do it at the
> beginning than forgetting to do it later.
>
> Overall comment, please check the final patch that there are not
> unnecessary line breaks that overflow 80 chars only slightly. I've seen
> a few and would have fixed it but I think I've been pointing that out
> enough in the past. I'll fix anything that's left if needed.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-15 13:56 ` Mark Harmstone
@ 2026-04-15 16:43 ` David Sterba
2026-04-15 16:52 ` Mark Harmstone
0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2026-04-15 16:43 UTC (permalink / raw)
To: Mark Harmstone; +Cc: dsterba, linux-btrfs, wqu, boris
On Wed, Apr 15, 2026 at 02:56:49PM +0100, Mark Harmstone wrote:
> >> When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
> >> reading the data entirely, resulting a ~2200% speed-up on the same test
> >> (128s to 6s).
> >
> > Repeated mkfs is a specific use case, normaly it happens just once but I
> > understand that for preparing various images the time savings are
> > significiant.
> >
> > Once this ioctl is available I think the deduplication tools will make
> > use of it as an initial filter before actually doing the real
> > deduplication of extents.
>
> IIRC the mathematics of the birthday paradox is that for an n-bit hash,
> you need 2^(n/2) sectors to have a 50% chance of a collision.
>
> For the 32-bit hashes, that means a 50% chance of a collision every 256
> MB. For the 256-bit hashes, that's 2^140 bytes... whatever that works
> out as. So the dedupe logic would no doubt be that for larger hash sizes
> GET_CSUMS itself will be sufficient.
It's used as a hint in the initial filter (e.g. in
https://github.com/lakshmipathi/dduper), the final deduplication does
full byte compariison (memcmp() inside vfs_dedupe_file_range_compare()).
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-15 16:43 ` David Sterba
@ 2026-04-15 16:52 ` Mark Harmstone
0 siblings, 0 replies; 7+ messages in thread
From: Mark Harmstone @ 2026-04-15 16:52 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs, wqu, boris, lakshmipathi.g
On 15/04/2026 5.43 pm, David Sterba wrote:
> On Wed, Apr 15, 2026 at 02:56:49PM +0100, Mark Harmstone wrote:
>>>> When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
>>>> reading the data entirely, resulting a ~2200% speed-up on the same test
>>>> (128s to 6s).
>>>
>>> Repeated mkfs is a specific use case, normaly it happens just once but I
>>> understand that for preparing various images the time savings are
>>> significiant.
>>>
>>> Once this ioctl is available I think the deduplication tools will make
>>> use of it as an initial filter before actually doing the real
>>> deduplication of extents.
>>
>> IIRC the mathematics of the birthday paradox is that for an n-bit hash,
>> you need 2^(n/2) sectors to have a 50% chance of a collision.
>>
>> For the 32-bit hashes, that means a 50% chance of a collision every 256
>> MB. For the 256-bit hashes, that's 2^140 bytes... whatever that works
>> out as. So the dedupe logic would no doubt be that for larger hash sizes
>> GET_CSUMS itself will be sufficient.
>
> It's used as a hint in the initial filter (e.g. in
> https://github.com/lakshmipathi/dduper), the final deduplication does
> full byte compariison (memcmp() inside vfs_dedupe_file_range_compare()).
I've only had a brief look at the project, but what Lakshmipathi calls
the "insane mode" ought to be safe for SHA256.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-04-15 16:52 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-13 17:14 [PATCH v3] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
2026-04-13 22:17 ` Qu Wenruo
2026-04-14 12:57 ` Mark Harmstone
2026-04-14 2:07 ` David Sterba
2026-04-15 13:56 ` Mark Harmstone
2026-04-15 16:43 ` David Sterba
2026-04-15 16:52 ` Mark Harmstone
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox