* [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
@ 2026-03-20 12:50 Mark Harmstone
2026-03-20 13:03 ` Mark Harmstone
2026-03-20 22:18 ` Qu Wenruo
0 siblings, 2 replies; 15+ messages in thread
From: Mark Harmstone @ 2026-03-20 12:50 UTC (permalink / raw)
To: linux-btrfs; +Cc: Mark Harmstone
Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
query the on-disk csums for a file.
This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
the kernel, which details the offset and length we're interested in, and
a buffer for the kernel to write its results into. The kernel writes a
struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
csums if available.
If the extent is an uncompressed, non-nodatasum extent, the kernel sets
the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
csums. If it is sparse, preallocated, or beyond the EOF, it sets the
type to BTRFS_GET_CSUMS_SPARSE - this is so userspace knows it can use
the precomputed hash of the zero sector. Otherwise, it sets the type to
BTRFS_GET_CSUMS_NO_CSUMS.
We do store the csums of compressed extents, but we deliberately don't
return them here: they're hashed over the compressed data, not the
uncompressed data that's returned to userspace.
The main use case for this is for speeding up mkfs.btrfs --rootdir. For
the case when the source FS is btrfs and using the same csum algorithm,
we can avoid having to recalculate the csums - in my synthetic
benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
speed-up (218s to 196s).
When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
reading the data entirely, resulting a ~2200% speed-up on the same test
(128s to 6s).
# mkdir rootdir
# dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304
(without ioctl)
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir testimg
...
real 3m37.965s
user 0m5.496s
sys 0m6.125s
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir --reflink testimg
...
real 2m8.342s
user 0m5.472s
sys 0m1.667s
(with ioctl)
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir testimg
...
real 3m15.865s
user 0m4.258s
sys 0m6.261s
# echo 3 > /proc/sys/vm/drop_caches
# time mkfs.btrfs --rootdir rootdir --reflink testimg
...
real 0m5.847s
user 0m2.899s
sys 0m0.097s
Signed-off-by: Mark Harmstone <mark@harmstone.com>
---
fs/btrfs/ioctl.c | 330 +++++++++++++++++++++++++++++++++++++
include/uapi/linux/btrfs.h | 21 +++
2 files changed, 351 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a4d715bbed57ba..b7c8bfb90fed29 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -56,6 +56,7 @@
#include "uuid-tree.h"
#include "ioctl.h"
#include "file.h"
+#include "file-item.h"
#include "scrub.h"
#include "super.h"
@@ -5138,6 +5139,333 @@ static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg
}
#endif
+#define GET_CSUMS_BUF_MAX (16 * 1024 * 1024)
+
+static int copy_csums_to_user(struct btrfs_fs_info *fs_info, u64 disk_bytenr,
+ u64 len, u8 __user *buf)
+{
+ struct btrfs_root *csum_root;
+ struct btrfs_ordered_sum *sums;
+ LIST_HEAD(list);
+ const u32 csum_size = fs_info->csum_size;
+ int ret;
+
+ csum_root = btrfs_csum_root(fs_info, disk_bytenr);
+
+ ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
+ disk_bytenr + len - 1, &list, false);
+ if (ret < 0)
+ return ret;
+
+ /* Clear the output buffer to handle potential gaps in csum coverage. */
+ if (clear_user(buf, (len >> fs_info->sectorsize_bits) * csum_size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret = 0;
+ while (!list_empty(&list)) {
+ u64 offset;
+ size_t copy_size;
+
+ sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
+ list_del(&sums->list);
+
+ offset = ((sums->logical - disk_bytenr) >> fs_info->sectorsize_bits) * csum_size;
+ copy_size = (sums->len >> fs_info->sectorsize_bits) * csum_size;
+
+ if (copy_to_user(buf + offset, sums->sums, copy_size)) {
+ kfree(sums);
+ ret = -EFAULT;
+ goto out;
+ }
+
+ kfree(sums);
+ }
+
+out:
+ while (!list_empty(&list)) {
+ sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
+ list_del(&sums->list);
+ kfree(sums);
+ }
+ return ret;
+}
+
+static int btrfs_ioctl_get_csums(struct file *file, void __user *argp)
+{
+ struct inode *inode = file_inode(file);
+ struct btrfs_inode *bi = BTRFS_I(inode);
+ struct btrfs_fs_info *fs_info = bi->root->fs_info;
+ struct btrfs_root *root = bi->root;
+ struct btrfs_ioctl_get_csums_args args;
+ BTRFS_PATH_AUTO_FREE(path);
+ const u64 ino = btrfs_ino(bi);
+ const u32 sectorsize = fs_info->sectorsize;
+ const u32 csum_size = fs_info->csum_size;
+ u8 __user *ubuf;
+ u64 buf_limit;
+ u64 buf_used = 0;
+ u64 cur_offset;
+ u64 end_offset;
+ u64 prev_extent_end;
+ struct btrfs_key key;
+ int ret;
+
+ if (!(file->f_mode & FMODE_READ))
+ return -EBADF;
+
+ if (!S_ISREG(inode->i_mode))
+ return -EINVAL;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+
+ if (!IS_ALIGNED(args.offset, sectorsize) ||
+ !IS_ALIGNED(args.length, sectorsize))
+ return -EINVAL;
+ if (args.length == 0)
+ return -EINVAL;
+ if (args.offset + args.length < args.offset)
+ return -EOVERFLOW;
+ if (args.buf_size < sizeof(struct btrfs_ioctl_get_csums_entry))
+ return -EINVAL;
+
+ buf_limit = min_t(u64, args.buf_size, GET_CSUMS_BUF_MAX);
+ ubuf = (u8 __user *)(argp + offsetof(struct btrfs_ioctl_get_csums_args, buf));
+ cur_offset = args.offset;
+ end_offset = args.offset + args.length;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
+ if (ret)
+ return ret;
+
+ btrfs_inode_lock(bi, BTRFS_ILOCK_SHARED);
+
+ ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
+ if (ret)
+ goto out_unlock;
+
+ /* NODATASUM early exit. */
+ if (bi->flags & BTRFS_INODE_NODATASUM) {
+ struct btrfs_ioctl_get_csums_entry entry = {
+ .offset = cur_offset,
+ .length = end_offset - cur_offset,
+ .type = BTRFS_GET_CSUMS_NO_CSUMS,
+ };
+
+ if (copy_to_user(ubuf, &entry, sizeof(entry))) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+
+ buf_used = sizeof(entry);
+ cur_offset = end_offset;
+ goto done;
+ }
+
+ prev_extent_end = cur_offset;
+
+ while (cur_offset < end_offset) {
+ struct btrfs_file_extent_item *ei;
+ struct extent_buffer *leaf;
+ struct btrfs_ioctl_get_csums_entry entry;
+ u64 extent_end;
+ u64 disk_bytenr = 0;
+ u64 extent_offset = 0;
+ u64 range_start, range_len;
+ u64 entry_csum_size;
+ u64 key_offset;
+ int extent_type;
+ u8 compression;
+
+ /* Search for the extent at or before cur_offset. */
+ key.objectid = ino;
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ key.offset = cur_offset;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ if (ret < 0)
+ goto out_unlock;
+
+ if (ret > 0 && path->slots[0] > 0) {
+ btrfs_item_key_to_cpu(path->nodes[0], &key,
+ path->slots[0] - 1);
+ if (key.objectid == ino &&
+ key.type == BTRFS_EXTENT_DATA_KEY) {
+ path->slots[0]--;
+ if (btrfs_file_extent_end(path) <= cur_offset)
+ path->slots[0]++;
+ }
+ }
+
+ if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+ ret = btrfs_next_leaf(root, path);
+ if (ret < 0)
+ goto out_unlock;
+ if (ret > 0) {
+ ret = 0;
+ btrfs_release_path(path);
+ break;
+ }
+ }
+
+ leaf = path->nodes[0];
+
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+ if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+ btrfs_release_path(path);
+ break;
+ }
+
+ extent_end = btrfs_file_extent_end(path);
+ key_offset = key.offset;
+
+ /* Read extent fields before releasing the path. */
+ ei = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_file_extent_item);
+ extent_type = btrfs_file_extent_type(leaf, ei);
+ compression = btrfs_file_extent_compression(leaf, ei);
+
+ if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
+ disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+ if (disk_bytenr && compression == BTRFS_COMPRESS_NONE)
+ extent_offset = btrfs_file_extent_offset(leaf, ei);
+ }
+
+ btrfs_release_path(path);
+
+ /* Implicit hole (NO_HOLES feature). */
+ if (prev_extent_end < key_offset) {
+ u64 hole_end = min(key_offset, end_offset);
+ u64 hole_len = hole_end - prev_extent_end;
+
+ if (prev_extent_end >= cur_offset) {
+ memset(&entry, 0, sizeof(entry));
+ entry.offset = prev_extent_end;
+ entry.length = hole_len;
+ entry.type = BTRFS_GET_CSUMS_SPARSE;
+
+ if (buf_used + sizeof(entry) > buf_limit)
+ goto done;
+ if (copy_to_user(ubuf + buf_used, &entry,
+ sizeof(entry))) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ buf_used += sizeof(entry);
+ cur_offset = hole_end;
+ }
+
+ if (key_offset >= end_offset) {
+ cur_offset = end_offset;
+ break;
+ }
+ }
+
+ /* Clamp to our query range. */
+ range_start = max(cur_offset, key_offset);
+ range_len = min(extent_end, end_offset) - range_start;
+
+ memset(&entry, 0, sizeof(entry));
+ entry.offset = range_start;
+ entry.length = range_len;
+
+ if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+ entry.type = BTRFS_GET_CSUMS_NO_CSUMS;
+ entry_csum_size = 0;
+ } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
+ entry.type = BTRFS_GET_CSUMS_SPARSE;
+ entry_csum_size = 0;
+ } else {
+ /* BTRFS_FILE_EXTENT_REG */
+ if (disk_bytenr == 0) {
+ /* Explicit hole. */
+ entry.type = BTRFS_GET_CSUMS_SPARSE;
+ entry_csum_size = 0;
+ } else if (compression != BTRFS_COMPRESS_NONE) {
+ entry.type = BTRFS_GET_CSUMS_NO_CSUMS;
+ entry_csum_size = 0;
+ } else {
+ entry.type = BTRFS_GET_CSUMS_HAS_CSUMS;
+ entry_csum_size = (range_len >> fs_info->sectorsize_bits) * csum_size;
+ }
+ }
+
+ /* Check if this entry (+ csum data) fits in the buffer. */
+ if (buf_used + sizeof(entry) + entry_csum_size > buf_limit) {
+ if (buf_used == 0) {
+ ret = -EOVERFLOW;
+ goto out_unlock;
+ }
+ goto done;
+ }
+
+ if (copy_to_user(ubuf + buf_used, &entry, sizeof(entry))) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ buf_used += sizeof(entry);
+
+ if (entry.type == BTRFS_GET_CSUMS_HAS_CSUMS) {
+ ret = copy_csums_to_user(fs_info,
+ disk_bytenr + extent_offset + (range_start - key_offset),
+ range_len, ubuf + buf_used);
+ if (ret)
+ goto out_unlock;
+ buf_used += entry_csum_size;
+ }
+
+ cur_offset = range_start + range_len;
+ prev_extent_end = extent_end;
+
+ if (fatal_signal_pending(current)) {
+ if (buf_used == 0) {
+ ret = -EINTR;
+ goto out_unlock;
+ }
+ goto done;
+ }
+
+ cond_resched();
+ }
+
+ /* Handle trailing implicit hole. */
+ if (cur_offset < end_offset) {
+ struct btrfs_ioctl_get_csums_entry entry = {
+ .offset = prev_extent_end,
+ .length = end_offset - prev_extent_end,
+ .type = BTRFS_GET_CSUMS_SPARSE,
+ };
+
+ if (buf_used + sizeof(entry) <= buf_limit) {
+ if (copy_to_user(ubuf + buf_used, &entry,
+ sizeof(entry))) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ buf_used += sizeof(entry);
+ cur_offset = end_offset;
+ }
+ }
+
+done:
+ args.offset = cur_offset;
+ args.length = (cur_offset < end_offset) ? end_offset - cur_offset : 0;
+ args.buf_size = buf_used;
+
+ if (copy_to_user(argp, &args, sizeof(args)))
+ ret = -EFAULT;
+
+out_unlock:
+ btrfs_inode_unlock(bi, BTRFS_ILOCK_SHARED);
+ return ret;
+}
+
long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
{
@@ -5293,6 +5621,8 @@ long btrfs_ioctl(struct file *file, unsigned int
#endif
case BTRFS_IOC_SUBVOL_SYNC_WAIT:
return btrfs_ioctl_subvol_sync(fs_info, argp);
+ case BTRFS_IOC_GET_CSUMS:
+ return btrfs_ioctl_get_csums(file, argp);
#ifdef CONFIG_BTRFS_EXPERIMENTAL
case BTRFS_IOC_SHUTDOWN:
return btrfs_ioctl_shutdown(fs_info, arg);
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 9165154a274d94..db1374c892f825 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -1100,6 +1100,25 @@ enum btrfs_err_code {
BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
};
+/* Types for struct btrfs_ioctl_get_csums_entry::type */
+#define BTRFS_GET_CSUMS_HAS_CSUMS 0
+#define BTRFS_GET_CSUMS_SPARSE 1
+#define BTRFS_GET_CSUMS_NO_CSUMS 2
+
+struct btrfs_ioctl_get_csums_entry {
+ __u64 offset; /* file offset of this range */
+ __u64 length; /* length in bytes */
+ __u32 type; /* BTRFS_GET_CSUMS_* type */
+ __u32 reserved; /* padding, must be 0 */
+};
+
+struct btrfs_ioctl_get_csums_args {
+ __u64 offset; /* in/out: file offset */
+ __u64 length; /* in/out: range length */
+ __u64 buf_size; /* in/out: buffer capacity / bytes written */
+ __u8 buf[]; /* out: entries + csum data */
+};
+
/* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
#define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
#define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
@@ -1226,6 +1245,8 @@ enum btrfs_err_code {
struct btrfs_ioctl_encoded_io_args)
#define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
struct btrfs_ioctl_subvol_wait)
+#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
+ struct btrfs_ioctl_get_csums_args)
/* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
#define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
--
2.52.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-03-20 12:50 [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
@ 2026-03-20 13:03 ` Mark Harmstone
2026-03-20 22:18 ` Qu Wenruo
1 sibling, 0 replies; 15+ messages in thread
From: Mark Harmstone @ 2026-03-20 13:03 UTC (permalink / raw)
To: Btrfs BTRFS
mkfs patches for this: https://github.com/kdave/btrfs-progs/pull/1096
On 20/03/2026 12.50 pm, Mark Harmstone wrote:
> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
> query the on-disk csums for a file.
>
> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
> the kernel, which details the offset and length we're interested in, and
> a buffer for the kernel to write its results into. The kernel writes a
> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
> csums if available.
>
> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
> type to BTRFS_GET_CSUMS_SPARSE - this is so userspace knows it can use
> the precomputed hash of the zero sector. Otherwise, it sets the type to
> BTRFS_GET_CSUMS_NO_CSUMS.
>
> We do store the csums of compressed extents, but we deliberately don't
> return them here: they're hashed over the compressed data, not the
> uncompressed data that's returned to userspace.
>
> The main use case for this is for speeding up mkfs.btrfs --rootdir. For
> the case when the source FS is btrfs and using the same csum algorithm,
> we can avoid having to recalculate the csums - in my synthetic
> benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
> speed-up (218s to 196s).
>
> When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
> reading the data entirely, resulting a ~2200% speed-up on the same test
> (128s to 6s).
>
> # mkdir rootdir
> # dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304
>
> (without ioctl)
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir testimg
> ...
> real 3m37.965s
> user 0m5.496s
> sys 0m6.125s
>
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir --reflink testimg
> ...
> real 2m8.342s
> user 0m5.472s
> sys 0m1.667s
>
> (with ioctl)
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir testimg
> ...
> real 3m15.865s
> user 0m4.258s
> sys 0m6.261s
>
> # echo 3 > /proc/sys/vm/drop_caches
> # time mkfs.btrfs --rootdir rootdir --reflink testimg
> ...
> real 0m5.847s
> user 0m2.899s
> sys 0m0.097s
>
> Signed-off-by: Mark Harmstone <mark@harmstone.com>
> ---
> fs/btrfs/ioctl.c | 330 +++++++++++++++++++++++++++++++++++++
> include/uapi/linux/btrfs.h | 21 +++
> 2 files changed, 351 insertions(+)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index a4d715bbed57ba..b7c8bfb90fed29 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -56,6 +56,7 @@
> #include "uuid-tree.h"
> #include "ioctl.h"
> #include "file.h"
> +#include "file-item.h"
> #include "scrub.h"
> #include "super.h"
>
> @@ -5138,6 +5139,333 @@ static int btrfs_ioctl_shutdown(struct btrfs_fs_info *fs_info, unsigned long arg
> }
> #endif
>
> +#define GET_CSUMS_BUF_MAX (16 * 1024 * 1024)
> +
> +static int copy_csums_to_user(struct btrfs_fs_info *fs_info, u64 disk_bytenr,
> + u64 len, u8 __user *buf)
> +{
> + struct btrfs_root *csum_root;
> + struct btrfs_ordered_sum *sums;
> + LIST_HEAD(list);
> + const u32 csum_size = fs_info->csum_size;
> + int ret;
> +
> + csum_root = btrfs_csum_root(fs_info, disk_bytenr);
> +
> + ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
> + disk_bytenr + len - 1, &list, false);
> + if (ret < 0)
> + return ret;
> +
> + /* Clear the output buffer to handle potential gaps in csum coverage. */
> + if (clear_user(buf, (len >> fs_info->sectorsize_bits) * csum_size)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = 0;
> + while (!list_empty(&list)) {
> + u64 offset;
> + size_t copy_size;
> +
> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
> + list_del(&sums->list);
> +
> + offset = ((sums->logical - disk_bytenr) >> fs_info->sectorsize_bits) * csum_size;
> + copy_size = (sums->len >> fs_info->sectorsize_bits) * csum_size;
> +
> + if (copy_to_user(buf + offset, sums->sums, copy_size)) {
> + kfree(sums);
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + kfree(sums);
> + }
> +
> +out:
> + while (!list_empty(&list)) {
> + sums = list_first_entry(&list, struct btrfs_ordered_sum, list);
> + list_del(&sums->list);
> + kfree(sums);
> + }
> + return ret;
> +}
> +
> +static int btrfs_ioctl_get_csums(struct file *file, void __user *argp)
> +{
> + struct inode *inode = file_inode(file);
> + struct btrfs_inode *bi = BTRFS_I(inode);
> + struct btrfs_fs_info *fs_info = bi->root->fs_info;
> + struct btrfs_root *root = bi->root;
> + struct btrfs_ioctl_get_csums_args args;
> + BTRFS_PATH_AUTO_FREE(path);
> + const u64 ino = btrfs_ino(bi);
> + const u32 sectorsize = fs_info->sectorsize;
> + const u32 csum_size = fs_info->csum_size;
> + u8 __user *ubuf;
> + u64 buf_limit;
> + u64 buf_used = 0;
> + u64 cur_offset;
> + u64 end_offset;
> + u64 prev_extent_end;
> + struct btrfs_key key;
> + int ret;
> +
> + if (!(file->f_mode & FMODE_READ))
> + return -EBADF;
> +
> + if (!S_ISREG(inode->i_mode))
> + return -EINVAL;
> +
> + if (copy_from_user(&args, argp, sizeof(args)))
> + return -EFAULT;
> +
> + if (!IS_ALIGNED(args.offset, sectorsize) ||
> + !IS_ALIGNED(args.length, sectorsize))
> + return -EINVAL;
> + if (args.length == 0)
> + return -EINVAL;
> + if (args.offset + args.length < args.offset)
> + return -EOVERFLOW;
> + if (args.buf_size < sizeof(struct btrfs_ioctl_get_csums_entry))
> + return -EINVAL;
> +
> + buf_limit = min_t(u64, args.buf_size, GET_CSUMS_BUF_MAX);
> + ubuf = (u8 __user *)(argp + offsetof(struct btrfs_ioctl_get_csums_args, buf));
> + cur_offset = args.offset;
> + end_offset = args.offset + args.length;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
> + if (ret)
> + return ret;
> +
> + btrfs_inode_lock(bi, BTRFS_ILOCK_SHARED);
> +
> + ret = btrfs_wait_ordered_range(bi, cur_offset, args.length);
> + if (ret)
> + goto out_unlock;
> +
> + /* NODATASUM early exit. */
> + if (bi->flags & BTRFS_INODE_NODATASUM) {
> + struct btrfs_ioctl_get_csums_entry entry = {
> + .offset = cur_offset,
> + .length = end_offset - cur_offset,
> + .type = BTRFS_GET_CSUMS_NO_CSUMS,
> + };
> +
> + if (copy_to_user(ubuf, &entry, sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> +
> + buf_used = sizeof(entry);
> + cur_offset = end_offset;
> + goto done;
> + }
> +
> + prev_extent_end = cur_offset;
> +
> + while (cur_offset < end_offset) {
> + struct btrfs_file_extent_item *ei;
> + struct extent_buffer *leaf;
> + struct btrfs_ioctl_get_csums_entry entry;
> + u64 extent_end;
> + u64 disk_bytenr = 0;
> + u64 extent_offset = 0;
> + u64 range_start, range_len;
> + u64 entry_csum_size;
> + u64 key_offset;
> + int extent_type;
> + u8 compression;
> +
> + /* Search for the extent at or before cur_offset. */
> + key.objectid = ino;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> + key.offset = cur_offset;
> +
> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> + if (ret < 0)
> + goto out_unlock;
> +
> + if (ret > 0 && path->slots[0] > 0) {
> + btrfs_item_key_to_cpu(path->nodes[0], &key,
> + path->slots[0] - 1);
> + if (key.objectid == ino &&
> + key.type == BTRFS_EXTENT_DATA_KEY) {
> + path->slots[0]--;
> + if (btrfs_file_extent_end(path) <= cur_offset)
> + path->slots[0]++;
> + }
> + }
> +
> + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
> + ret = btrfs_next_leaf(root, path);
> + if (ret < 0)
> + goto out_unlock;
> + if (ret > 0) {
> + ret = 0;
> + btrfs_release_path(path);
> + break;
> + }
> + }
> +
> + leaf = path->nodes[0];
> +
> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> + btrfs_release_path(path);
> + break;
> + }
> +
> + extent_end = btrfs_file_extent_end(path);
> + key_offset = key.offset;
> +
> + /* Read extent fields before releasing the path. */
> + ei = btrfs_item_ptr(leaf, path->slots[0],
> + struct btrfs_file_extent_item);
> + extent_type = btrfs_file_extent_type(leaf, ei);
> + compression = btrfs_file_extent_compression(leaf, ei);
> +
> + if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
> + disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
> + if (disk_bytenr && compression == BTRFS_COMPRESS_NONE)
> + extent_offset = btrfs_file_extent_offset(leaf, ei);
> + }
> +
> + btrfs_release_path(path);
> +
> + /* Implicit hole (NO_HOLES feature). */
> + if (prev_extent_end < key_offset) {
> + u64 hole_end = min(key_offset, end_offset);
> + u64 hole_len = hole_end - prev_extent_end;
> +
> + if (prev_extent_end >= cur_offset) {
> + memset(&entry, 0, sizeof(entry));
> + entry.offset = prev_extent_end;
> + entry.length = hole_len;
> + entry.type = BTRFS_GET_CSUMS_SPARSE;
> +
> + if (buf_used + sizeof(entry) > buf_limit)
> + goto done;
> + if (copy_to_user(ubuf + buf_used, &entry,
> + sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> + cur_offset = hole_end;
> + }
> +
> + if (key_offset >= end_offset) {
> + cur_offset = end_offset;
> + break;
> + }
> + }
> +
> + /* Clamp to our query range. */
> + range_start = max(cur_offset, key_offset);
> + range_len = min(extent_end, end_offset) - range_start;
> +
> + memset(&entry, 0, sizeof(entry));
> + entry.offset = range_start;
> + entry.length = range_len;
> +
> + if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> + entry.type = BTRFS_GET_CSUMS_NO_CSUMS;
> + entry_csum_size = 0;
> + } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
> + entry.type = BTRFS_GET_CSUMS_SPARSE;
> + entry_csum_size = 0;
> + } else {
> + /* BTRFS_FILE_EXTENT_REG */
> + if (disk_bytenr == 0) {
> + /* Explicit hole. */
> + entry.type = BTRFS_GET_CSUMS_SPARSE;
> + entry_csum_size = 0;
> + } else if (compression != BTRFS_COMPRESS_NONE) {
> + entry.type = BTRFS_GET_CSUMS_NO_CSUMS;
> + entry_csum_size = 0;
> + } else {
> + entry.type = BTRFS_GET_CSUMS_HAS_CSUMS;
> + entry_csum_size = (range_len >> fs_info->sectorsize_bits) * csum_size;
> + }
> + }
> +
> + /* Check if this entry (+ csum data) fits in the buffer. */
> + if (buf_used + sizeof(entry) + entry_csum_size > buf_limit) {
> + if (buf_used == 0) {
> + ret = -EOVERFLOW;
> + goto out_unlock;
> + }
> + goto done;
> + }
> +
> + if (copy_to_user(ubuf + buf_used, &entry, sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> +
> + if (entry.type == BTRFS_GET_CSUMS_HAS_CSUMS) {
> + ret = copy_csums_to_user(fs_info,
> + disk_bytenr + extent_offset + (range_start - key_offset),
> + range_len, ubuf + buf_used);
> + if (ret)
> + goto out_unlock;
> + buf_used += entry_csum_size;
> + }
> +
> + cur_offset = range_start + range_len;
> + prev_extent_end = extent_end;
> +
> + if (fatal_signal_pending(current)) {
> + if (buf_used == 0) {
> + ret = -EINTR;
> + goto out_unlock;
> + }
> + goto done;
> + }
> +
> + cond_resched();
> + }
> +
> + /* Handle trailing implicit hole. */
> + if (cur_offset < end_offset) {
> + struct btrfs_ioctl_get_csums_entry entry = {
> + .offset = prev_extent_end,
> + .length = end_offset - prev_extent_end,
> + .type = BTRFS_GET_CSUMS_SPARSE,
> + };
> +
> + if (buf_used + sizeof(entry) <= buf_limit) {
> + if (copy_to_user(ubuf + buf_used, &entry,
> + sizeof(entry))) {
> + ret = -EFAULT;
> + goto out_unlock;
> + }
> + buf_used += sizeof(entry);
> + cur_offset = end_offset;
> + }
> + }
> +
> +done:
> + args.offset = cur_offset;
> + args.length = (cur_offset < end_offset) ? end_offset - cur_offset : 0;
> + args.buf_size = buf_used;
> +
> + if (copy_to_user(argp, &args, sizeof(args)))
> + ret = -EFAULT;
> +
> +out_unlock:
> + btrfs_inode_unlock(bi, BTRFS_ILOCK_SHARED);
> + return ret;
> +}
> +
> long btrfs_ioctl(struct file *file, unsigned int
> cmd, unsigned long arg)
> {
> @@ -5293,6 +5621,8 @@ long btrfs_ioctl(struct file *file, unsigned int
> #endif
> case BTRFS_IOC_SUBVOL_SYNC_WAIT:
> return btrfs_ioctl_subvol_sync(fs_info, argp);
> + case BTRFS_IOC_GET_CSUMS:
> + return btrfs_ioctl_get_csums(file, argp);
> #ifdef CONFIG_BTRFS_EXPERIMENTAL
> case BTRFS_IOC_SHUTDOWN:
> return btrfs_ioctl_shutdown(fs_info, arg);
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 9165154a274d94..db1374c892f825 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -1100,6 +1100,25 @@ enum btrfs_err_code {
> BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
> };
>
> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
> +#define BTRFS_GET_CSUMS_SPARSE 1
> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
> +
> +struct btrfs_ioctl_get_csums_entry {
> + __u64 offset; /* file offset of this range */
> + __u64 length; /* length in bytes */
> + __u32 type; /* BTRFS_GET_CSUMS_* type */
> + __u32 reserved; /* padding, must be 0 */
> +};
> +
> +struct btrfs_ioctl_get_csums_args {
> + __u64 offset; /* in/out: file offset */
> + __u64 length; /* in/out: range length */
> + __u64 buf_size; /* in/out: buffer capacity / bytes written */
> + __u8 buf[]; /* out: entries + csum data */
> +};
> +
> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
> @@ -1226,6 +1245,8 @@ enum btrfs_err_code {
> struct btrfs_ioctl_encoded_io_args)
> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
> struct btrfs_ioctl_subvol_wait)
> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
> + struct btrfs_ioctl_get_csums_args)
>
> /* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-03-20 12:50 [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
2026-03-20 13:03 ` Mark Harmstone
@ 2026-03-20 22:18 ` Qu Wenruo
2026-03-25 7:34 ` Qu Wenruo
1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2026-03-20 22:18 UTC (permalink / raw)
To: Mark Harmstone, linux-btrfs
在 2026/3/20 23:20, Mark Harmstone 写道:
> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
> query the on-disk csums for a file.
>
> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
> the kernel, which details the offset and length we're interested in, and
> a buffer for the kernel to write its results into. The kernel writes a
> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
> csums if available.
>
> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
> type to BTRFS_GET_CSUMS_SPARSE - this is so userspace knows it can use
> the precomputed hash of the zero sector. Otherwise, it sets the type to
> BTRFS_GET_CSUMS_NO_CSUMS.
I'm not sure if it's a good idea to put hole and preallocated range into
the same BTRFS_GET_CSUMS_SPARSE.
Although both means there is no csum, hole case means there is really no
data extent, thus we should not create any extent instead of writing zero.
For preallocated, indicating it has no CSUM can allow mkfs to
distinguish hole and preallocated, thus change to zero writes to
prealloc, which is faster and make the resulted fs more aligned to the
source dir.
And for EOF checks, I think we don't need to bother that much, aka, just
let it return the regular results.
My assumption is, the mkfs shouldn't pass a range completely beyond the
round_up(i_size), as non-reflink rootdir population would always read
out the content of the inode from the host fs.
Thus we won't really read beyond the inode size.
>
> We do store the csums of compressed extents, but we deliberately don't
> return them here: they're hashed over the compressed data, not the
> uncompressed data that's returned to userspace.
I agree with the skip of compressed extents, but I'd prefer to have a
special flag to indicate that, other than NO_CSUMS.
Or mkfs is unable to distinguish hole and compressed extents.
[...]
> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
> +#define BTRFS_GET_CSUMS_SPARSE 1
> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
> +
> +struct btrfs_ioctl_get_csums_entry {
> + __u64 offset; /* file offset of this range */
> + __u64 length; /* length in bytes */
> + __u32 type; /* BTRFS_GET_CSUMS_* type */
> + __u32 reserved; /* padding, must be 0 */
> +};
> +
> +struct btrfs_ioctl_get_csums_args {
> + __u64 offset; /* in/out: file offset */
> + __u64 length; /* in/out: range length */
> + __u64 buf_size; /* in/out: buffer capacity / bytes written */
> + __u8 buf[]; /* out: entries + csum data */
> +};
From the progs usage, it is always a single btrfs_ioctl_get_csums_entry
at the beginning of buf[], then real buffer for csum, can we just
combine both structures into one?
Furthermore, since we only query one extent at one time, the
offset/length are more or less duplicated between args and entry structure.
We can just save the length into the args without the need for entry
members (except the type).
Thanks,
Qu
> +
> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
> @@ -1226,6 +1245,8 @@ enum btrfs_err_code {
> struct btrfs_ioctl_encoded_io_args)
> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
> struct btrfs_ioctl_subvol_wait)
> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
> + struct btrfs_ioctl_get_csums_args)
>
> /* Shutdown ioctl should follow XFS's interfaces, thus not using btrfs magic. */
> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-03-20 22:18 ` Qu Wenruo
@ 2026-03-25 7:34 ` Qu Wenruo
2026-03-25 14:43 ` Mark Harmstone
0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2026-03-25 7:34 UTC (permalink / raw)
To: Mark Harmstone, linux-btrfs
在 2026/3/21 08:48, Qu Wenruo 写道:
>
>
> 在 2026/3/20 23:20, Mark Harmstone 写道:
>> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
>> query the on-disk csums for a file.
>>
>> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
>> the kernel, which details the offset and length we're interested in, and
>> a buffer for the kernel to write its results into. The kernel writes a
>> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
>> csums if available.
>>
>> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
>> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
>> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
>> type to BTRFS_GET_CSUMS_SPARSE - this is so userspace knows it can use
>> the precomputed hash of the zero sector. Otherwise, it sets the type to
>> BTRFS_GET_CSUMS_NO_CSUMS.
>
> I'm not sure if it's a good idea to put hole and preallocated range into
> the same BTRFS_GET_CSUMS_SPARSE.
>
> Although both means there is no csum, hole case means there is really no
> data extent, thus we should not create any extent instead of writing zero.
>
> For preallocated, indicating it has no CSUM can allow mkfs to
> distinguish hole and preallocated, thus change to zero writes to
> prealloc, which is faster and make the resulted fs more aligned to the
> source dir.
>
>
> And for EOF checks, I think we don't need to bother that much, aka, just
> let it return the regular results.
>
> My assumption is, the mkfs shouldn't pass a range completely beyond the
> round_up(i_size), as non-reflink rootdir population would always read
> out the content of the inode from the host fs.
> Thus we won't really read beyond the inode size.
After more investigation, I think we can put the
hole/preallocation/compression detection into the user space.
The hole detection is already pending for merge, mostly through
SEEK_DATA/SEEK_HOLE flags of lseek():
https://github.com/kdave/btrfs-progs/pull/1097
I'm planning to implement preallocation detection through fiemap, which
also allows us to detect compressed range and skip them for your case.
With all those features implemented in progs, we can further simplify
the get csum ioctl, to something more aligned to
btrfs_lookup_csums_bitmap().
We do not need to bother why there is no checksum for some ranges, that
will be handled by progs first, we only need to return all the checksums
found for the specified range.
And as an extra safenet, use some bitmap inthe ioctl structure to
indicate which ranges have checksum and which doesn't.
This will definitely simplify the ioctl as we only need to do csum tree
lookup, no need to bother anything in the subvolume tree.
Thanks,
Qu
>
>>
>> We do store the csums of compressed extents, but we deliberately don't
>> return them here: they're hashed over the compressed data, not the
>> uncompressed data that's returned to userspace.
>
> I agree with the skip of compressed extents, but I'd prefer to have a
> special flag to indicate that, other than NO_CSUMS.
>
> Or mkfs is unable to distinguish hole and compressed extents.
>
> [...]
>> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
>> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
>> +#define BTRFS_GET_CSUMS_SPARSE 1
>> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
>> +
>> +struct btrfs_ioctl_get_csums_entry {
>> + __u64 offset; /* file offset of this range */
>> + __u64 length; /* length in bytes */
>> + __u32 type; /* BTRFS_GET_CSUMS_* type */
>> + __u32 reserved; /* padding, must be 0 */
>> +};
>> +
>> +struct btrfs_ioctl_get_csums_args {
>> + __u64 offset; /* in/out: file offset */
>> + __u64 length; /* in/out: range length */
>> + __u64 buf_size; /* in/out: buffer capacity / bytes written */
>> + __u8 buf[]; /* out: entries + csum data */
>> +};
>
> From the progs usage, it is always a single btrfs_ioctl_get_csums_entry
> at the beginning of buf[], then real buffer for csum, can we just
> combine both structures into one?
>
> Furthermore, since we only query one extent at one time, the offset/
> length are more or less duplicated between args and entry structure.
>
> We can just save the length into the args without the need for entry
> members (except the type).
>
> Thanks,
> Qu
>
>> +
>> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
>> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
>> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
>> @@ -1226,6 +1245,8 @@ enum btrfs_err_code {
>> struct btrfs_ioctl_encoded_io_args)
>> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
>> struct btrfs_ioctl_subvol_wait)
>> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
>> + struct btrfs_ioctl_get_csums_args)
>> /* Shutdown ioctl should follow XFS's interfaces, thus not using
>> btrfs magic. */
>> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-03-25 7:34 ` Qu Wenruo
@ 2026-03-25 14:43 ` Mark Harmstone
2026-03-25 21:04 ` Qu Wenruo
0 siblings, 1 reply; 15+ messages in thread
From: Mark Harmstone @ 2026-03-25 14:43 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 25/03/2026 7.34 am, Qu Wenruo wrote:
>
>
> 在 2026/3/21 08:48, Qu Wenruo 写道:
>>
>>
>> 在 2026/3/20 23:20, Mark Harmstone 写道:
>>> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
>>> query the on-disk csums for a file.
>>>
>>> This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
>>> the kernel, which details the offset and length we're interested in, and
>>> a buffer for the kernel to write its results into. The kernel writes a
>>> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
>>> csums if available.
>>>
>>> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
>>> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
>>> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
>>> type to BTRFS_GET_CSUMS_SPARSE - this is so userspace knows it can use
>>> the precomputed hash of the zero sector. Otherwise, it sets the type to
>>> BTRFS_GET_CSUMS_NO_CSUMS.
>>
>> I'm not sure if it's a good idea to put hole and preallocated range
>> into the same BTRFS_GET_CSUMS_SPARSE.
>>
>> Although both means there is no csum, hole case means there is really
>> no data extent, thus we should not create any extent instead of
>> writing zero.
Thanks Qu.
"SPARSE" is probably a bad name for it. It probably should be "ZERO" or
somesuch. The point is to tell userspace not to waste time calculating
csums, but use the precomputed values because the data would be zero
(for whatever reason).
>> For preallocated, indicating it has no CSUM can allow mkfs to
>> distinguish hole and preallocated, thus change to zero writes to
>> prealloc, which is faster and make the resulted fs more aligned to the
>> source dir.
>>
>>
>> And for EOF checks, I think we don't need to bother that much, aka,
>> just let it return the regular results.
>>
>> My assumption is, the mkfs shouldn't pass a range completely beyond
>> the round_up(i_size), as non-reflink rootdir population would always
>> read out the content of the inode from the host fs.
>> Thus we won't really read beyond the inode size.
>
> After more investigation, I think we can put the hole/preallocation/
> compression detection into the user space.
>
> The hole detection is already pending for merge, mostly through
> SEEK_DATA/SEEK_HOLE flags of lseek():
>
> https://github.com/kdave/btrfs-progs/pull/1097
>
> I'm planning to implement preallocation detection through fiemap, which
> also allows us to detect compressed range and skip them for your case.
>
> With all those features implemented in progs, we can further simplify
> the get csum ioctl, to something more aligned to
> btrfs_lookup_csums_bitmap().
>
> We do not need to bother why there is no checksum for some ranges, that
> will be handled by progs first, we only need to return all the checksums
> found for the specified range.
>
> And as an extra safenet, use some bitmap inthe ioctl structure to
> indicate which ranges have checksum and which doesn't.
>
> This will definitely simplify the ioctl as we only need to do csum tree
> lookup, no need to bother anything in the subvolume tree.
Unfortunately this won't work. You have to explicitly filter out
compressed extents, and identifying these requires checking the FS tree.
The reason is that they may be "bookended", and you would leak
information about other files if you returned the whole of the csums for
the compressed extent. This is the reason why encoded read needs root.
> Thanks,
> Qu
>
>>
>>>
>>> We do store the csums of compressed extents, but we deliberately don't
>>> return them here: they're hashed over the compressed data, not the
>>> uncompressed data that's returned to userspace.
>>
>> I agree with the skip of compressed extents, but I'd prefer to have a
>> special flag to indicate that, other than NO_CSUMS.
>>
>> Or mkfs is unable to distinguish hole and compressed extents.
Compressed extents result in NO_CSUMS, a hole results in SPARSE.
>> [...]
>>> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
>>> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
>>> +#define BTRFS_GET_CSUMS_SPARSE 1
>>> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
>>> +
>>> +struct btrfs_ioctl_get_csums_entry {
>>> + __u64 offset; /* file offset of this range */
>>> + __u64 length; /* length in bytes */
>>> + __u32 type; /* BTRFS_GET_CSUMS_* type */
>>> + __u32 reserved; /* padding, must be 0 */
>>> +};
>>> +
>>> +struct btrfs_ioctl_get_csums_args {
>>> + __u64 offset; /* in/out: file offset */
>>> + __u64 length; /* in/out: range length */
>>> + __u64 buf_size; /* in/out: buffer capacity / bytes
>>> written */
>>> + __u8 buf[]; /* out: entries + csum data */
>>> +};
>>
>> From the progs usage, it is always a single
>> btrfs_ioctl_get_csums_entry at the beginning of buf[], then real
>> buffer for csum, can we just combine both structures into one?
>>
>> Furthermore, since we only query one extent at one time, the offset/
>> length are more or less duplicated between args and entry structure.
>>
>> We can just save the length into the args without the need for entry
>> members (except the type).
>>
>> Thanks,
>> Qu
>>
>>> +
>>> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_* flags. */
>>> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
>>> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
>>> @@ -1226,6 +1245,8 @@ enum btrfs_err_code {
>>> struct btrfs_ioctl_encoded_io_args)
>>> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
>>> struct btrfs_ioctl_subvol_wait)
>>> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
>>> + struct btrfs_ioctl_get_csums_args)
>>> /* Shutdown ioctl should follow XFS's interfaces, thus not using
>>> btrfs magic. */
>>> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
>>
>>
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-03-25 14:43 ` Mark Harmstone
@ 2026-03-25 21:04 ` Qu Wenruo
2026-04-02 17:05 ` Mark Harmstone
0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2026-03-25 21:04 UTC (permalink / raw)
To: Mark Harmstone, linux-btrfs
在 2026/3/26 01:13, Mark Harmstone 写道:
> On 25/03/2026 7.34 am, Qu Wenruo wrote:
>>
>>
>> 在 2026/3/21 08:48, Qu Wenruo 写道:
>>>
>>>
>>> 在 2026/3/20 23:20, Mark Harmstone 写道:
>>>> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
>>>> query the on-disk csums for a file.
>>>>
>>>> This is done by userspace passing a struct
>>>> btrfs_ioctl_get_csums_args to
>>>> the kernel, which details the offset and length we're interested in,
>>>> and
>>>> a buffer for the kernel to write its results into. The kernel writes a
>>>> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
>>>> csums if available.
>>>>
>>>> If the extent is an uncompressed, non-nodatasum extent, the kernel sets
>>>> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
>>>> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
>>>> type to BTRFS_GET_CSUMS_SPARSE - this is so userspace knows it can use
>>>> the precomputed hash of the zero sector. Otherwise, it sets the type to
>>>> BTRFS_GET_CSUMS_NO_CSUMS.
>>>
>>> I'm not sure if it's a good idea to put hole and preallocated range
>>> into the same BTRFS_GET_CSUMS_SPARSE.
>>>
>>> Although both means there is no csum, hole case means there is really
>>> no data extent, thus we should not create any extent instead of
>>> writing zero.
>
> Thanks Qu.
>
> "SPARSE" is probably a bad name for it. It probably should be "ZERO" or
> somesuch. The point is to tell userspace not to waste time calculating
> csums, but use the precomputed values because the data would be zero
> (for whatever reason).
>
>>> For preallocated, indicating it has no CSUM can allow mkfs to
>>> distinguish hole and preallocated, thus change to zero writes to
>>> prealloc, which is faster and make the resulted fs more aligned to
>>> the source dir.
>>>
>>>
>>> And for EOF checks, I think we don't need to bother that much, aka,
>>> just let it return the regular results.
>>>
>>> My assumption is, the mkfs shouldn't pass a range completely beyond
>>> the round_up(i_size), as non-reflink rootdir population would always
>>> read out the content of the inode from the host fs.
>>> Thus we won't really read beyond the inode size.
>>
>> After more investigation, I think we can put the hole/preallocation/
>> compression detection into the user space.
>>
>> The hole detection is already pending for merge, mostly through
>> SEEK_DATA/SEEK_HOLE flags of lseek():
>>
>> https://github.com/kdave/btrfs-progs/pull/1097
>>
>> I'm planning to implement preallocation detection through fiemap,
>> which also allows us to detect compressed range and skip them for your
>> case.
>>
>> With all those features implemented in progs, we can further simplify
>> the get csum ioctl, to something more aligned to
>> btrfs_lookup_csums_bitmap().
>>
>> We do not need to bother why there is no checksum for some ranges,
>> that will be handled by progs first, we only need to return all the
>> checksums found for the specified range.
>>
>> And as an extra safenet, use some bitmap inthe ioctl structure to
>> indicate which ranges have checksum and which doesn't.
>>
>> This will definitely simplify the ioctl as we only need to do csum
>> tree lookup, no need to bother anything in the subvolume tree.
>
> Unfortunately this won't work. You have to explicitly filter out
> compressed extents, and identifying these requires checking the FS tree.
That's done by progs through fiemap. There will be a flag ENCODED for
compressed file extents.
> The reason is that they may be "bookended", and you would leak
> information about other files if you returned the whole of the csums for
> the compressed extent. This is the reason why encoded read needs root.
Nope, for the fiemap call, we will never reach any bookend extents.
Thanks,
Qu
>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> We do store the csums of compressed extents, but we deliberately don't
>>>> return them here: they're hashed over the compressed data, not the
>>>> uncompressed data that's returned to userspace.
>>>
>>> I agree with the skip of compressed extents, but I'd prefer to have a
>>> special flag to indicate that, other than NO_CSUMS.
>>>
>>> Or mkfs is unable to distinguish hole and compressed extents.
>
> Compressed extents result in NO_CSUMS, a hole results in SPARSE.
>
>>> [...]
>>>> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
>>>> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
>>>> +#define BTRFS_GET_CSUMS_SPARSE 1
>>>> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
>>>> +
>>>> +struct btrfs_ioctl_get_csums_entry {
>>>> + __u64 offset; /* file offset of this range */
>>>> + __u64 length; /* length in bytes */
>>>> + __u32 type; /* BTRFS_GET_CSUMS_* type */
>>>> + __u32 reserved; /* padding, must be 0 */
>>>> +};
>>>> +
>>>> +struct btrfs_ioctl_get_csums_args {
>>>> + __u64 offset; /* in/out: file offset */
>>>> + __u64 length; /* in/out: range length */
>>>> + __u64 buf_size; /* in/out: buffer capacity / bytes
>>>> written */
>>>> + __u8 buf[]; /* out: entries + csum data */
>>>> +};
>>>
>>> From the progs usage, it is always a single
>>> btrfs_ioctl_get_csums_entry at the beginning of buf[], then real
>>> buffer for csum, can we just combine both structures into one?
>>>
>>> Furthermore, since we only query one extent at one time, the offset/
>>> length are more or less duplicated between args and entry structure.
>>>
>>> We can just save the length into the args without the need for entry
>>> members (except the type).
>>>
>>> Thanks,
>>> Qu
>>>
>>>> +
>>>> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_*
>>>> flags. */
>>>> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
>>>> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
>>>> @@ -1226,6 +1245,8 @@ enum btrfs_err_code {
>>>> struct btrfs_ioctl_encoded_io_args)
>>>> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
>>>> struct btrfs_ioctl_subvol_wait)
>>>> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
>>>> + struct btrfs_ioctl_get_csums_args)
>>>> /* Shutdown ioctl should follow XFS's interfaces, thus not using
>>>> btrfs magic. */
>>>> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
>>>
>>>
>>
>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-03-25 21:04 ` Qu Wenruo
@ 2026-04-02 17:05 ` Mark Harmstone
2026-04-02 21:46 ` Qu Wenruo
0 siblings, 1 reply; 15+ messages in thread
From: Mark Harmstone @ 2026-04-02 17:05 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 25/03/2026 9.04 pm, Qu Wenruo wrote:
>
>
> 在 2026/3/26 01:13, Mark Harmstone 写道:
>> On 25/03/2026 7.34 am, Qu Wenruo wrote:
>>>
>>>
>>> 在 2026/3/21 08:48, Qu Wenruo 写道:
>>>>
>>>>
>>>> 在 2026/3/20 23:20, Mark Harmstone 写道:
>>>>> Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
>>>>> query the on-disk csums for a file.
>>>>>
>>>>> This is done by userspace passing a struct
>>>>> btrfs_ioctl_get_csums_args to
>>>>> the kernel, which details the offset and length we're interested
>>>>> in, and
>>>>> a buffer for the kernel to write its results into. The kernel writes a
>>>>> struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
>>>>> csums if available.
>>>>>
>>>>> If the extent is an uncompressed, non-nodatasum extent, the kernel
>>>>> sets
>>>>> the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
>>>>> csums. If it is sparse, preallocated, or beyond the EOF, it sets the
>>>>> type to BTRFS_GET_CSUMS_SPARSE - this is so userspace knows it can use
>>>>> the precomputed hash of the zero sector. Otherwise, it sets the
>>>>> type to
>>>>> BTRFS_GET_CSUMS_NO_CSUMS.
>>>>
>>>> I'm not sure if it's a good idea to put hole and preallocated range
>>>> into the same BTRFS_GET_CSUMS_SPARSE.
>>>>
>>>> Although both means there is no csum, hole case means there is
>>>> really no data extent, thus we should not create any extent instead
>>>> of writing zero.
>>
>> Thanks Qu.
>>
>> "SPARSE" is probably a bad name for it. It probably should be "ZERO"
>> or somesuch. The point is to tell userspace not to waste time
>> calculating csums, but use the precomputed values because the data
>> would be zero (for whatever reason).
>>
>>>> For preallocated, indicating it has no CSUM can allow mkfs to
>>>> distinguish hole and preallocated, thus change to zero writes to
>>>> prealloc, which is faster and make the resulted fs more aligned to
>>>> the source dir.
>>>>
>>>>
>>>> And for EOF checks, I think we don't need to bother that much, aka,
>>>> just let it return the regular results.
>>>>
>>>> My assumption is, the mkfs shouldn't pass a range completely beyond
>>>> the round_up(i_size), as non-reflink rootdir population would always
>>>> read out the content of the inode from the host fs.
>>>> Thus we won't really read beyond the inode size.
>>>
>>> After more investigation, I think we can put the hole/preallocation/
>>> compression detection into the user space.
>>>
>>> The hole detection is already pending for merge, mostly through
>>> SEEK_DATA/SEEK_HOLE flags of lseek():
>>>
>>> https://github.com/kdave/btrfs-progs/pull/1097
>>>
>>> I'm planning to implement preallocation detection through fiemap,
>>> which also allows us to detect compressed range and skip them for
>>> your case.
>>>
>>> With all those features implemented in progs, we can further simplify
>>> the get csum ioctl, to something more aligned to
>>> btrfs_lookup_csums_bitmap().
>>>
>>> We do not need to bother why there is no checksum for some ranges,
>>> that will be handled by progs first, we only need to return all the
>>> checksums found for the specified range.
>>>
>>> And as an extra safenet, use some bitmap inthe ioctl structure to
>>> indicate which ranges have checksum and which doesn't.
>>>
>>> This will definitely simplify the ioctl as we only need to do csum
>>> tree lookup, no need to bother anything in the subvolume tree.
>>
>> Unfortunately this won't work. You have to explicitly filter out
>> compressed extents, and identifying these requires checking the FS tree.
>
> That's done by progs through fiemap. There will be a flag ENCODED for
> compressed file extents.
No, this still won't work I'm afraid. The ioctl is answering the
question "what's the csum of the sector no. such-and-such in this
file?". That can't be answered for compressed extents, as the csums are
on the compressed data.
There might be a use case for "fetch the csums for a compressed extent",
but it'd be something different.
My concern about relying on ENCODED is that that would also be set when
we implement encryption. For an encrypted uncompressed extent the csum
*would* be meaningful.
>> The reason is that they may be "bookended", and you would leak
>> information about other files if you returned the whole of the csums
>> for the compressed extent. This is the reason why encoded read needs
>> root.
>
> Nope, for the fiemap call, we will never reach any bookend extents.
I really don't think the FIEMAP call achieves anything here. The kernel
still has to do a lookup in the FS tree to determine what the logical
address of the extent is. We can't allow (non-root) users to read the
csums of arbitrary sectors.
> Thanks,
> Qu
>
>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>>
>>>>> We do store the csums of compressed extents, but we deliberately don't
>>>>> return them here: they're hashed over the compressed data, not the
>>>>> uncompressed data that's returned to userspace.
>>>>
>>>> I agree with the skip of compressed extents, but I'd prefer to have
>>>> a special flag to indicate that, other than NO_CSUMS.
>>>>
>>>> Or mkfs is unable to distinguish hole and compressed extents.
>>
>> Compressed extents result in NO_CSUMS, a hole results in SPARSE.
>>
>>>> [...]
>>>>> +/* Types for struct btrfs_ioctl_get_csums_entry::type */
>>>>> +#define BTRFS_GET_CSUMS_HAS_CSUMS 0
>>>>> +#define BTRFS_GET_CSUMS_SPARSE 1
>>>>> +#define BTRFS_GET_CSUMS_NO_CSUMS 2
>>>>> +
>>>>> +struct btrfs_ioctl_get_csums_entry {
>>>>> + __u64 offset; /* file offset of this range */
>>>>> + __u64 length; /* length in bytes */
>>>>> + __u32 type; /* BTRFS_GET_CSUMS_* type */
>>>>> + __u32 reserved; /* padding, must be 0 */
>>>>> +};
>>>>> +
>>>>> +struct btrfs_ioctl_get_csums_args {
>>>>> + __u64 offset; /* in/out: file offset */
>>>>> + __u64 length; /* in/out: range length */
>>>>> + __u64 buf_size; /* in/out: buffer capacity / bytes
>>>>> written */
>>>>> + __u8 buf[]; /* out: entries + csum data */
>>>>> +};
>>>>
>>>> From the progs usage, it is always a single
>>>> btrfs_ioctl_get_csums_entry at the beginning of buf[], then real
>>>> buffer for csum, can we just combine both structures into one?
>>>>
>>>> Furthermore, since we only query one extent at one time, the offset/
>>>> length are more or less duplicated between args and entry structure.
>>>>
>>>> We can just save the length into the args without the need for entry
>>>> members (except the type).
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> +
>>>>> /* Flags for IOC_SHUTDOWN, must match XFS_FSOP_GOING_FLAGS_*
>>>>> flags. */
>>>>> #define BTRFS_SHUTDOWN_FLAGS_DEFAULT 0x0
>>>>> #define BTRFS_SHUTDOWN_FLAGS_LOGFLUSH 0x1
>>>>> @@ -1226,6 +1245,8 @@ enum btrfs_err_code {
>>>>> struct btrfs_ioctl_encoded_io_args)
>>>>> #define BTRFS_IOC_SUBVOL_SYNC_WAIT _IOW(BTRFS_IOCTL_MAGIC, 65, \
>>>>> struct btrfs_ioctl_subvol_wait)
>>>>> +#define BTRFS_IOC_GET_CSUMS _IOWR(BTRFS_IOCTL_MAGIC, 66, \
>>>>> + struct btrfs_ioctl_get_csums_args)
>>>>> /* Shutdown ioctl should follow XFS's interfaces, thus not using
>>>>> btrfs magic. */
>>>>> #define BTRFS_IOC_SHUTDOWN _IOR('X', 125, __u32)
>>>>
>>>>
>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-02 17:05 ` Mark Harmstone
@ 2026-04-02 21:46 ` Qu Wenruo
2026-04-03 22:44 ` Boris Burkov
0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2026-04-02 21:46 UTC (permalink / raw)
To: Mark Harmstone, linux-btrfs
在 2026/4/3 03:35, Mark Harmstone 写道:
> On 25/03/2026 9.04 pm, Qu Wenruo wrote:
[...]
>>
>> That's done by progs through fiemap. There will be a flag ENCODED for
>> compressed file extents.
>
> No, this still won't work I'm afraid. The ioctl is answering the
> question "what's the csum of the sector no. such-and-such in this
> file?". That can't be answered for compressed extents, as the csums are
> on the compressed data.
My point is, since you are not trying to fetching the csum of compressed
extent in the first place, you don't need to bother that situation at all.
And even for compressed extents, it is still possible to fetch the csum,
after all we're just search the csum tree for a given logical bytenr.
There will be some extra concerns like fiemap can not return the real
compressed length, but again we ruled our compressed extents in the
first place.
>
> There might be a use case for "fetch the csums for a compressed extent",
> but it'd be something different.
>
> My concern about relying on ENCODED is that that would also be set when
> we implement encryption. For an encrypted uncompressed extent the csum
> *would* be meaningful.
>
>>> The reason is that they may be "bookended", and you would leak
>>> information about other files if you returned the whole of the csums
>>> for the compressed extent. This is the reason why encoded read needs
>>> root.
>>
>> Nope, for the fiemap call, we will never reach any bookend extents.
>
> I really don't think the FIEMAP call achieves anything here. The kernel
> still has to do a lookup in the FS tree to determine what the logical
> address of the extent is.
The point is to minimize the work in kernel.
We already have a ioctl to do the fs tree lookup, and it is definitely
enough for non-compressed extents, and that's fiemap ioctl.
Thus I do not want to introduce new codes just to do a similar thing again.
> We can't allow (non-root) users to read the
> csums of arbitrary sectors.
And that's your choice on the csum ioctl, I have no preference, since
fiemap doesn't require root privilege anyway, if you want root check I
see no problem either since it's more secure.
Thanks,
Qu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-02 21:46 ` Qu Wenruo
@ 2026-04-03 22:44 ` Boris Burkov
2026-04-03 23:00 ` Qu Wenruo
0 siblings, 1 reply; 15+ messages in thread
From: Boris Burkov @ 2026-04-03 22:44 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Mark Harmstone, linux-btrfs
On Fri, Apr 03, 2026 at 08:16:26AM +1030, Qu Wenruo wrote:
>
>
> 在 2026/4/3 03:35, Mark Harmstone 写道:
> > On 25/03/2026 9.04 pm, Qu Wenruo wrote:
> [...]
> > >
> > > That's done by progs through fiemap. There will be a flag ENCODED
> > > for compressed file extents.
> >
> > No, this still won't work I'm afraid. The ioctl is answering the
> > question "what's the csum of the sector no. such-and-such in this
> > file?". That can't be answered for compressed extents, as the csums are
> > on the compressed data.
>
> My point is, since you are not trying to fetching the csum of compressed
> extent in the first place, you don't need to bother that situation at all.
>
> And even for compressed extents, it is still possible to fetch the csum,
> after all we're just search the csum tree for a given logical bytenr.
>
> There will be some extra concerns like fiemap can not return the real
> compressed length, but again we ruled our compressed extents in the first
> place.
>
I may be misinterpreting, but I feel like the question at hand is how
much hardening the ioctl requires to be correct, and how much work it
can delegate to userspace.
Suppose we make it require root, I think we could make the interface
much simpler and just use the logical offsets instead of a file based
interface, and we can leave it up to userspace entirely to figure out
which ranges they care about.
OTOH, if we agree we want the csums ioctl to be unprivileged, which
means that the interface must assume that the input could be bad, on
overlapping extents, not marked up properly, etc... In that case, I do
not quite know *exactly* what is redundant with fiemap, what is exactly
necessary for safety vs for caller convenience, etc.
Basically, I think the options are roughly:
- Mark's proposal: A smart, convenient GET_CSUMS that does everything
turnkey and as helpfully as possible. Lots of redundance with fiemap.
Safe to make unprivileged.
- Qu's review: Require the user to do the fiemap part themselves and
don't make GET_CSUMS quite as turnkey. It is unclear to me whether it
is possible to make such a version unprivileged safely *without*
the fiemap redundancy.
- Boris's strawman: A dumb, inconvenient GET_CSUMS that expects a lot of
userspace but doesn't check anything and definitely needs root. If we
do go root-only, I feel like this might be the best interface?
And the questions are:
1. How badly do we want non-root? In practice, mkfs is root when writing
disks but not necessarily when writing image files, so it's a bit of a
toss up there. At meta we tend to end up sad when mkfs has root-only
functionality that we want.
2. What is the bare minimum processing needed to safely allow non-root
callers with arbitrarily wrong input? I don't see how we can assume they
will use fiemap correctly and not hit bookends, or set correct tags on
the input, for a few examples.
Thanks,
Boris
> >
> > There might be a use case for "fetch the csums for a compressed extent",
> > but it'd be something different.
> >
> > My concern about relying on ENCODED is that that would also be set when
> > we implement encryption. For an encrypted uncompressed extent the csum
> > *would* be meaningful.
> >
> > > > The reason is that they may be "bookended", and you would leak
> > > > information about other files if you returned the whole of the
> > > > csums for the compressed extent. This is the reason why encoded
> > > > read needs root.
> > >
> > > Nope, for the fiemap call, we will never reach any bookend extents.
> >
> > I really don't think the FIEMAP call achieves anything here. The kernel
> > still has to do a lookup in the FS tree to determine what the logical
> > address of the extent is.
>
> The point is to minimize the work in kernel.
>
> We already have a ioctl to do the fs tree lookup, and it is definitely
> enough for non-compressed extents, and that's fiemap ioctl.
>
> Thus I do not want to introduce new codes just to do a similar thing again.
>
> > We can't allow (non-root) users to read the csums of arbitrary sectors.
>
> And that's your choice on the csum ioctl, I have no preference, since fiemap
> doesn't require root privilege anyway, if you want root check I see no
> problem either since it's more secure.
>
> Thanks,
> Qu
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-03 22:44 ` Boris Burkov
@ 2026-04-03 23:00 ` Qu Wenruo
2026-04-07 18:13 ` Mark Harmstone
0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2026-04-03 23:00 UTC (permalink / raw)
To: Boris Burkov; +Cc: Mark Harmstone, linux-btrfs
在 2026/4/4 09:14, Boris Burkov 写道:
> On Fri, Apr 03, 2026 at 08:16:26AM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2026/4/3 03:35, Mark Harmstone 写道:
>>> On 25/03/2026 9.04 pm, Qu Wenruo wrote:
>> [...]
>>>>
>>>> That's done by progs through fiemap. There will be a flag ENCODED
>>>> for compressed file extents.
>>>
>>> No, this still won't work I'm afraid. The ioctl is answering the
>>> question "what's the csum of the sector no. such-and-such in this
>>> file?". That can't be answered for compressed extents, as the csums are
>>> on the compressed data.
>>
>> My point is, since you are not trying to fetching the csum of compressed
>> extent in the first place, you don't need to bother that situation at all.
>>
>> And even for compressed extents, it is still possible to fetch the csum,
>> after all we're just search the csum tree for a given logical bytenr.
>>
>> There will be some extra concerns like fiemap can not return the real
>> compressed length, but again we ruled our compressed extents in the first
>> place.
>>
>
> I may be misinterpreting, but I feel like the question at hand is how
> much hardening the ioctl requires to be correct, and how much work it
> can delegate to userspace.
>
> Suppose we make it require root, I think we could make the interface
> much simpler and just use the logical offsets instead of a file based
> interface, and we can leave it up to userspace entirely to figure out
> which ranges they care about.
>
> OTOH, if we agree we want the csums ioctl to be unprivileged, which
> means that the interface must assume that the input could be bad, on
> overlapping extents, not marked up properly, etc... In that case, I do
> not quite know *exactly* what is redundant with fiemap, what is exactly
> necessary for safety vs for caller convenience, etc.
>
> Basically, I think the options are roughly:
>
> - Mark's proposal: A smart, convenient GET_CSUMS that does everything
> turnkey and as helpfully as possible. Lots of redundance with fiemap.
> Safe to make unprivileged.
> - Qu's review: Require the user to do the fiemap part themselves and
> don't make GET_CSUMS quite as turnkey. It is unclear to me whether it
> is possible to make such a version unprivileged safely *without*
> the fiemap redundancy.
> - Boris's strawman: A dumb, inconvenient GET_CSUMS that expects a lot of
> userspace but doesn't check anything and definitely needs root. If we
> do go root-only, I feel like this might be the best interface?
Well, my idea is more aligned with yours, except the root part.
Our ideas share the same part, the ioctl just handles things inside the
csum tree without bothering subvolume tree.
Yes, bad inputs can lead to a lot of information leakage if we allow
non-root users to use this ioctl, but I doubt if they can really do
anything with the information they got.
One still needs proper privilege to call fiemap on a file, so even if
one knows there are some csum at random logical bytenr, unless they can
access fiemap result of files that are utilizing those bytenrs, the csum
is still useless.
But I'm also fine with root privilege requirement for the ioctl too, as
to me stricter requirement has no obvious disadvantage, and can release
us from safety concerns.
>
> And the questions are:
> 1. How badly do we want non-root? In practice, mkfs is root when writing
> disks but not necessarily when writing image files, so it's a bit of a
> toss up there. At meta we tend to end up sad when mkfs has root-only
> functionality that we want.
I'm fine either way.
> 2. What is the bare minimum processing needed to safely allow non-root
> callers with arbitrarily wrong input? I don't see how we can assume they
> will use fiemap correctly and not hit bookends, or set correct tags on
> the input, for a few examples.
They just pass random logical into the ioctl, and we return whatever
they want, including something to show which range has csum, and the csum.
Let me be clear again, it's just a variant of TREE_SEARCH, except the
existing TREE_SEARCH is not good enough for csum tree search.
We don't need to bother if it's bookend/compressed or whatever, if they
want to do stupid things, that's their choice, and just reading the csum
shouldn't cause any writes/effects/damage to the fs, so let they do
whatever.
Thanks,
Qu
>
> Thanks,
> Boris
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-03 23:00 ` Qu Wenruo
@ 2026-04-07 18:13 ` Mark Harmstone
2026-04-07 21:52 ` Qu Wenruo
0 siblings, 1 reply; 15+ messages in thread
From: Mark Harmstone @ 2026-04-07 18:13 UTC (permalink / raw)
To: Qu Wenruo, Boris Burkov; +Cc: linux-btrfs
I think all three of us are confusing each other a little here.
The ioctl answers the question: if I were to read X bytes of data from a
file at Y offset and calculated the csums manually, what would the value
be? To which the kernel responds either with the values, that the read
is guaranteed to return zero and thus we can use the precomputed csum
for the zero sector, or that the value isn't known and userspace has to
do it anyway.
The value isn't known if it's a nodatasum file or if it's compressed. We
store the csums of compressed extents, but crucially it's over the
compressed data. So there's no one-to-one mapping between file blocks
and compressed sectors (by definition, because it's compressed), and
bookending means that it might be data we don't have access to.
We absolutely can't give non-root users csums to arbitrary data, that's
definitely a security breach.
Userspace can already obtain the csums from the disk for a file by using
FIEMAP and the tree search ioctl. But I believe the consensus around the
tree search ioctl is a) that it's very difficult to use, as you need to
know the internals of btrfs, and b) it requires CAP_SYS_ADMIN, at a time
when containerization and finer-grained access controls means this is
frowned upon.
This ioctl is a simpler way of doing the csum lookup, and without
requiring root.
On 04/04/2026 12.00 am, Qu Wenruo wrote:
>
>
> 在 2026/4/4 09:14, Boris Burkov 写道:
>> On Fri, Apr 03, 2026 at 08:16:26AM +1030, Qu Wenruo wrote:
>>>
>>>
>>> 在 2026/4/3 03:35, Mark Harmstone 写道:
>>>> On 25/03/2026 9.04 pm, Qu Wenruo wrote:
>>> [...]
>>>>>
>>>>> That's done by progs through fiemap. There will be a flag ENCODED
>>>>> for compressed file extents.
>>>>
>>>> No, this still won't work I'm afraid. The ioctl is answering the
>>>> question "what's the csum of the sector no. such-and-such in this
>>>> file?". That can't be answered for compressed extents, as the csums are
>>>> on the compressed data.
>>>
>>> My point is, since you are not trying to fetching the csum of compressed
>>> extent in the first place, you don't need to bother that situation at
>>> all.
>>>
>>> And even for compressed extents, it is still possible to fetch the csum,
>>> after all we're just search the csum tree for a given logical bytenr.
>>>
>>> There will be some extra concerns like fiemap can not return the real
>>> compressed length, but again we ruled our compressed extents in the
>>> first
>>> place.
>>>
>>
>> I may be misinterpreting, but I feel like the question at hand is how
>> much hardening the ioctl requires to be correct, and how much work it
>> can delegate to userspace.
>>
>> Suppose we make it require root, I think we could make the interface
>> much simpler and just use the logical offsets instead of a file based
>> interface, and we can leave it up to userspace entirely to figure out
>> which ranges they care about.
>>
>> OTOH, if we agree we want the csums ioctl to be unprivileged, which
>> means that the interface must assume that the input could be bad, on
>> overlapping extents, not marked up properly, etc... In that case, I do
>> not quite know *exactly* what is redundant with fiemap, what is exactly
>> necessary for safety vs for caller convenience, etc.
>>
>> Basically, I think the options are roughly:
>>
>> - Mark's proposal: A smart, convenient GET_CSUMS that does everything
>> turnkey and as helpfully as possible. Lots of redundance with fiemap.
>> Safe to make unprivileged.
>> - Qu's review: Require the user to do the fiemap part themselves and
>> don't make GET_CSUMS quite as turnkey. It is unclear to me whether it
>> is possible to make such a version unprivileged safely *without*
>> the fiemap redundancy.
>> - Boris's strawman: A dumb, inconvenient GET_CSUMS that expects a lot of
>> userspace but doesn't check anything and definitely needs root. If we
>> do go root-only, I feel like this might be the best interface?
>
> Well, my idea is more aligned with yours, except the root part.
>
> Our ideas share the same part, the ioctl just handles things inside the
> csum tree without bothering subvolume tree.
>
> Yes, bad inputs can lead to a lot of information leakage if we allow
> non-root users to use this ioctl, but I doubt if they can really do
> anything with the information they got.
>
> One still needs proper privilege to call fiemap on a file, so even if
> one knows there are some csum at random logical bytenr, unless they can
> access fiemap result of files that are utilizing those bytenrs, the csum
> is still useless.
>
> But I'm also fine with root privilege requirement for the ioctl too, as
> to me stricter requirement has no obvious disadvantage, and can release
> us from safety concerns.
>
>>
>> And the questions are:
>> 1. How badly do we want non-root? In practice, mkfs is root when writing
>> disks but not necessarily when writing image files, so it's a bit of a
>> toss up there. At meta we tend to end up sad when mkfs has root-only
>> functionality that we want.
>
> I'm fine either way.
>
>> 2. What is the bare minimum processing needed to safely allow non-root
>> callers with arbitrarily wrong input? I don't see how we can assume they
>> will use fiemap correctly and not hit bookends, or set correct tags on
>> the input, for a few examples.
>
> They just pass random logical into the ioctl, and we return whatever
> they want, including something to show which range has csum, and the csum.
>
> Let me be clear again, it's just a variant of TREE_SEARCH, except the
> existing TREE_SEARCH is not good enough for csum tree search.
>
> We don't need to bother if it's bookend/compressed or whatever, if they
> want to do stupid things, that's their choice, and just reading the csum
> shouldn't cause any writes/effects/damage to the fs, so let they do
> whatever.
>
> Thanks,
> Qu
>
>>
>> Thanks,
>> Boris
>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-07 18:13 ` Mark Harmstone
@ 2026-04-07 21:52 ` Qu Wenruo
2026-04-07 22:13 ` Boris Burkov
0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2026-04-07 21:52 UTC (permalink / raw)
To: Mark Harmstone, Boris Burkov; +Cc: linux-btrfs
在 2026/4/8 03:43, Mark Harmstone 写道:
> I think all three of us are confusing each other a little here.
>
> The ioctl answers the question: if I were to read X bytes of data from a
> file at Y offset and calculated the csums manually, what would the value
> be? To which the kernel responds either with the values, that the read
> is guaranteed to return zero and thus we can use the precomputed csum
> for the zero sector, or that the value isn't known and userspace has to
> do it anyway.
>
> The value isn't known if it's a nodatasum file or if it's compressed. We
> store the csums of compressed extents, but crucially it's over the
> compressed data. So there's no one-to-one mapping between file blocks
> and compressed sectors (by definition, because it's compressed), and
> bookending means that it might be data we don't have access to.
>
> We absolutely can't give non-root users csums to arbitrary data, that's
> definitely a security breach.
If getting csums for random logical is a security breach, I do not think
the new GET_CSUM ioctl is any better.
>
> Userspace can already obtain the csums from the disk for a file by using
> FIEMAP and the tree search ioctl. But I believe the consensus around the
> tree search ioctl is a) that it's very difficult to use, as you need to
> know the internals of btrfs,
I completely agree with this part, furthermore due to the layout of csum
tree, one has to workaround by searching with a much smaller value than
the bytenr as the min_key, which means possible unnecessary reads of
previous leaves.
Thanks,
Qu
> and b) it requires CAP_SYS_ADMIN, at a time
> when containerization and finer-grained access controls means this is
> frowned upon.
>
> This ioctl is a simpler way of doing the csum lookup, and without
> requiring root.
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-07 21:52 ` Qu Wenruo
@ 2026-04-07 22:13 ` Boris Burkov
2026-04-07 22:39 ` Qu Wenruo
0 siblings, 1 reply; 15+ messages in thread
From: Boris Burkov @ 2026-04-07 22:13 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Mark Harmstone, linux-btrfs
On Wed, Apr 08, 2026 at 07:22:44AM +0930, Qu Wenruo wrote:
>
>
> 在 2026/4/8 03:43, Mark Harmstone 写道:
> > I think all three of us are confusing each other a little here.
> >
> > The ioctl answers the question: if I were to read X bytes of data from a
> > file at Y offset and calculated the csums manually, what would the value
> > be? To which the kernel responds either with the values, that the read
> > is guaranteed to return zero and thus we can use the precomputed csum
> > for the zero sector, or that the value isn't known and userspace has to
> > do it anyway.
> >
> > The value isn't known if it's a nodatasum file or if it's compressed. We
> > store the csums of compressed extents, but crucially it's over the
> > compressed data. So there's no one-to-one mapping between file blocks
> > and compressed sectors (by definition, because it's compressed), and
> > bookending means that it might be data we don't have access to.
> >
> > We absolutely can't give non-root users csums to arbitrary data, that's
> > definitely a security breach.
>
> If getting csums for random logical is a security breach, I do not think the
> new GET_CSUM ioctl is any better.
>
Isn't it better because you have to use a file we do permissions checks on?
So it's not an arbitrary logical, it's a logical used by a file you have
access to? That might still be insecure against some attack, though, what
do I know..
> >
> > Userspace can already obtain the csums from the disk for a file by using
> > FIEMAP and the tree search ioctl. But I believe the consensus around the
> > tree search ioctl is a) that it's very difficult to use, as you need to
> > know the internals of btrfs,
>
> I completely agree with this part, furthermore due to the layout of csum
> tree, one has to workaround by searching with a much smaller value than the
> bytenr as the min_key, which means possible unnecessary reads of previous
> leaves.
>
> Thanks,
> Qu
>
> > and b) it requires CAP_SYS_ADMIN, at a time when containerization and
> > finer-grained access controls means this is frowned upon.
> >
> > This ioctl is a simpler way of doing the csum lookup, and without
> > requiring root.
> >
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-07 22:13 ` Boris Burkov
@ 2026-04-07 22:39 ` Qu Wenruo
2026-04-08 13:22 ` Mark Harmstone
0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2026-04-07 22:39 UTC (permalink / raw)
To: Boris Burkov; +Cc: Mark Harmstone, linux-btrfs
在 2026/4/8 07:43, Boris Burkov 写道:
[...]
>>> We absolutely can't give non-root users csums to arbitrary data, that's
>>> definitely a security breach.
>>
>> If getting csums for random logical is a security breach, I do not think the
>> new GET_CSUM ioctl is any better.
>>
>
> Isn't it better because you have to use a file we do permissions checks on?
>
> So it's not an arbitrary logical, it's a logical used by a file you have
> access to? That might still be insecure against some attack, though, what
> do I know.
OK, that makes sense now. Although I'm still not a huge fan just to
combine two different tree search operations into one, just to fulfill
the privilege check requirement.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl
2026-04-07 22:39 ` Qu Wenruo
@ 2026-04-08 13:22 ` Mark Harmstone
0 siblings, 0 replies; 15+ messages in thread
From: Mark Harmstone @ 2026-04-08 13:22 UTC (permalink / raw)
To: Qu Wenruo, Boris Burkov; +Cc: linux-btrfs
On 07/04/2026 11.39 pm, Qu Wenruo wrote:
>
>
> 在 2026/4/8 07:43, Boris Burkov 写道:
> [...]
>>>> We absolutely can't give non-root users csums to arbitrary data, that's
>>>> definitely a security breach.
>>>
>>> If getting csums for random logical is a security breach, I do not
>>> think the
>>> new GET_CSUM ioctl is any better.
>>>
>>
>> Isn't it better because you have to use a file we do permissions
>> checks on?
>>
>> So it's not an arbitrary logical, it's a logical used by a file you have
>> access to? That might still be insecure against some attack, though, what
>> do I know.
>
> OK, that makes sense now. Although I'm still not a huge fan just to
> combine two different tree search operations into one, just to fulfill
> the privilege check requirement.
The privilege check is important, we are running non-root mkfs to create
(I think) VM images. And doing two tree searches is still massively
quicker than userspace reading the data from disk and calculating the
csums manually.
Plus there's other potential uses for this ioctl in the future:
something rsync-like, for instance, or for deduplication.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-04-08 13:22 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 12:50 [PATCH] btrfs: add BTRFS_IOC_GET_CSUMS ioctl Mark Harmstone
2026-03-20 13:03 ` Mark Harmstone
2026-03-20 22:18 ` Qu Wenruo
2026-03-25 7:34 ` Qu Wenruo
2026-03-25 14:43 ` Mark Harmstone
2026-03-25 21:04 ` Qu Wenruo
2026-04-02 17:05 ` Mark Harmstone
2026-04-02 21:46 ` Qu Wenruo
2026-04-03 22:44 ` Boris Burkov
2026-04-03 23:00 ` Qu Wenruo
2026-04-07 18:13 ` Mark Harmstone
2026-04-07 21:52 ` Qu Wenruo
2026-04-07 22:13 ` Boris Burkov
2026-04-07 22:39 ` Qu Wenruo
2026-04-08 13:22 ` Mark Harmstone
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox