Linux Btrfs filesystem development

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* Re: [PATCH 7.2 v3 01/12] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
From: Baolin Wang @ 2026-04-20  6:07 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest
In-Reply-To: <20260418024429.4055056-2-ziy@nvidia.com>



On 4/18/26 10:44 AM, Zi Yan wrote:
> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that.
> MADV_COLLAPSE ignores shmem huge config, so exclude the check for shmem.
> 
> While at it, replace VM_BUG_ON with VM_WARN_ON_ONCE.
> 
> Add a helper function mapping_pmd_thp_support() for FSes supporting large
> folio with at least PMD_ORDER.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply

* Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
From: Qu Wenruo @ 2026-04-19 22:30 UTC (permalink / raw)
  To: Paul Richards; +Cc: linux-btrfs, dsterba
In-Reply-To: <CAMoswehjDpHm7_jmTPV_vKkqJSNKu5pCEDxmwGzAghhEwTGHSA@mail.gmail.com>



在 2026/4/20 04:10, Paul Richards 写道:
> Thanks Qu for your feedback! I wasn't expecting such a thorough review of
> this early RFC. I will reply to a few of your comments below.
> 
> For the comments I don't reply to, please know that I have read them and
> agree with them. If I follow up with another revision then I will address them
> there.
> 
> 
> On Sun, 19 Apr 2026 at 06:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> 在 2026/4/19 09:55, Qu Wenruo 写道:
>>>
>>>
>>> 在 2026/4/19 00:08, Paul Richards 写道:
>>>> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
>>>> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
>>>> already supported by ext4 and xfs. The userspace contract is
>>>> documented in fallocate(2).
>>>>
>>>> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
>>>> moving punch_hole into its own function and decoupling locking from the
>>>> per-operation helpers. This is similar to the implementaitons for ext4
>>>> and xfs. The allocate-range and zero-range paths remain coupled since
>>>> they share some setup logic.
>>>>
>>>> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
>>>>
>>>> == Implementation approach ==
>>>>
>>>> For COLLAPSE_RANGE:
>>>>    - The removed region [offset, offset+len) is punched out via
>>>>      btrfs_replace_file_extents(), which handles boundary splitting.
>>>>    - All EXTENT_DATA keys with key.offset >= offset+len are shifted
>>>>      leftward by len in forward order.
>>>>
>>>> For INSERT_RANGE:
>>>>    - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
>>>>      by len in reverse order (required to avoid key collisions).
>>>>    - No pre-splitting of straddling extents is needed: the left portion
>>>>      of a straddling extent stays in place, the right portion is shifted;
>>>>      both reference the same physical extent via their existing
>>>>      extent_offset fields.
>>>>
>>>> For each shifted key, the corresponding back-reference in the extent
>>>> tree is updated via a shared helper btrfs_shift_extent_backref().
>>
>> After looking into each patch, I do not think the low level direct file
>> item change is a good idea, especially with your current implement:
>>
>> - Can lock the inode for a very long time
>>     E.g. inserting a hole into the beginning of a very large file.
>>     We will lock the inode until all file items are iterated, which kills
>>     concurrency.

After more reference to reflink code, I think this comment is a little 
over-reaction, as reflink also locks the involved range in one go, thus 
can lock the inode for a very long time too.

So long lock may not be a huge problem.

>>
>> - Possible problems with metadata reservation
>>
>> - Problems with ^no-holes collapse
>>     Will cause duplicated file offsets with hole file items.

This is still true, even if we only support the insert/collapse range 
for no-holes feature.

The bigger problem is that, even if a fs has no-holes feature, it can 
still have explicit hole extents. As the fs can be converted from 
^no-holes, and the existing holes are not removed.

I guess the initial design is mostly to make converting existing fses 
much easier, at the cost that kernel always has to handle explicit holes.


One idea is to introduce something like COMPAT_RO_STRICT_NO_HOLE to 
prevent explicit holes, then it will make the low-level key modification 
more feasible.
But that will need quite some time for such new feature to get adapted.

>>
>> On the other hand, with reflink the insert/collapse can even be
>> implemented in user space, with a proper step setting, we can still
>> allow concurrent read/write out of the reflink ranges.
>>
>> This makes me wonder, is these features really that necessary?
>>
>> If you know some programs actively utilize these features for real world
>> benefits, but can not be done through reflink, please provide them.
>>
> 
> There is a proprietary tool at my $dayjob which uses insert and collapse range.
> This works great on ext4 and xfs, and I am personally interested in having it
> work on btrfs.
> 
> Emulating in userspace is entirely possible using FICLONERANGE ioclt.. but
> since ext4 doesn't support that operation this leaves userspace needing to
> use one solution for ext4 (fallocate insert/collapse) and another for btrfs
> (FICLONERANGE). By filling the gap of btrfs fallocate modes I was hoping
> to simplify userspace and reduce friction in supporting btrfs.
> 
> However, on balance I agree with your opinion that emulating in userspace
> is the best approach here. So for now I will pause my work on this patch series.

I think you may be interested in utilizing btrfs_clone() to implement 
the insert/collapse range ioctl.

The benefit is you do not need to bother the low-level file item 
changes, nor the metadata reservation part (already done in btrfs_clone()).

However it won't work if the src/dst range overlaps, it can still be 
worked around by shrinking the reflink length to the minimal so that the 
ranges no longer overlaps, and at the cost of more fragments.

And you may still want to dig deeper into the extra locks and other 
corner cases like the final truncation after collapse and the extra hole 
insertion after insert.

I guess it may be a little easier to implement this time.

Thanks,
Qu

> 
>>>>
>>>> == Testing ==
>>>>
>>>> Tested with a Rust-based functional test suite covering:
>>>>    - Collapse and insert at the start, middle of a file
>>>>    - Multiple sequential operations on the same file
>>>>    - Files with multiple extents (fsync between writes to force separate
>>>>      extent items)
>>>>    - Files with holes (explicit punch_hole and implicit sparse writes)
>>>>    - Compressed extents (mount -o compress=zstd)
>>>>    - Transaction cycling (interval reduced to 4 during testing, verified
>>>>      in dmesg logs)
>>>>    - Inline files, verified that -EOPNOTSUPP is returned.
>>>
>>> I guess that tool has never verify the contents, nor multi-thread stress
>>> tests, e.g. fsstress?
>>>
> 
> My tests did read back and verify file contents. It was not multi threaded or
> particularly stressy.
> 
>>>>
>>>> The same tests pass on both btrfs and xfs (modulo the inline files).
>>>>
>>>> I have not run fstests which I know contains tests for INSERT_RANGE
>>>> and COLLAPSE_RANGE. I will do so.
>>>
>>> Thus I'd prefer a fstest run before whatever your local tool.
>>>
> 
> 100% agree that fstests would be better. I started with my own tests
> only to keep
> things very simple while the code got off the ground. If I carry this work on I
> will use fstests.
> 
>>>> == Notes ==
>>>>
>>>> This is my first kernel contribution. Development was significantly
>>>> assisted by an LLM (Amazon Q Developer). The implementation, testing,
>>>> and final review decisions are my own.
>>>
>>> I'm very interested in how the LLM is involved.
>>>
>>> You mentioned "implementation, testing, review" are on your own, this
>>> looks like everything is on your own.
>>>
>>> I don't think you're only using LLM to help understanding the code, thus
>>> it looks like implementation is contributed by the LLM.
>>>
>>> Please remember, you're the one explaining/defending the code.
>>> But as long as you can explain/defend the code, I'm fine with that.
>>>
> 
> My workflow with the LLM was this:
> 
> I provided a high level design, reference to the fallocate man page,
> pointers to the implementation of clone range in btrfs, and the existing
> fallocate implementations (for btrfs, ext4, and xfs). This was a couple
> of paragraphs and a handful of links.
> 
> I prompted the LLM to produce a detailed design document for how
> I could go about implementing insert and collapse for btrfs. The
> document it produced is English text of around 2000 words. It
> contains references to internal btrfs functions I needed and a
> description of the logic I needed to implement. It has no code.
> 
> I reviewed that doc, asked clarifications, and the doc was revised.
> 
> I then crafted the code. I relied heavily on the design doc that the
> LLM had written but did not follow it to the letter. During this
> implementation work I asked the LLM to clarify a few points, and
> the design doc was revised by the LLM.
> 
> During debugging most of the bugs I diagnosed myself. There was one
> bug which stumped me however, so I asked the LLM to provide
> hypotheses which I then worked through to solve the bug. I crafted
> the code for the fix. ( The specific bug was stale/incorrect data being
> read after the operation returned. The code was already invalidating
> the page cache, but I was not aware of the extent map cache that
> needed to be invalidated too. )
> 
> I wrote the test code myself.
> 
> Once the tests were passing the LLM was used to update the design
> doc to align with the code I actually wrote. I also asked the LLM to
> review my code prior to submitting it to the mailing list. I addressed
> some of that feedback. As an example piece of feedback it
> recommended I make use of unlikely() in "if" statements testing for
> errors.
> 
> 
>>> Still a newcomer with not a small code change will always attract more
>>> scrutiny, so please don't expect rapid review/merge/etc.
>>>
> 
> I am not in a rush. This has been a low-priority side quest going back
> to 2023: https://lore.kernel.org/linux-btrfs/CAMosweitbAN5EPOgJCtrbkRAj1QSbsYt4uDGVMZ378YY7wjnRw@mail.gmail.com/
> 
> 


^ permalink raw reply

* Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
From: Paul Richards @ 2026-04-19 18:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba
In-Reply-To: <412d6926-8613-4377-8f33-3cf0194f685b@gmx.com>

Thanks Qu for your feedback! I wasn't expecting such a thorough review of
this early RFC. I will reply to a few of your comments below.

For the comments I don't reply to, please know that I have read them and
agree with them. If I follow up with another revision then I will address them
there.

On Sun, 19 Apr 2026 at 06:09, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> 在 2026/4/19 09:55, Qu Wenruo 写道:
> >
> >
> > 在 2026/4/19 00:08, Paul Richards 写道:
> >> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
> >> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
> >> already supported by ext4 and xfs. The userspace contract is
> >> documented in fallocate(2).
> >>
> >> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
> >> moving punch_hole into its own function and decoupling locking from the
> >> per-operation helpers. This is similar to the implementaitons for ext4
> >> and xfs. The allocate-range and zero-range paths remain coupled since
> >> they share some setup logic.
> >>
> >> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
> >>
> >> == Implementation approach ==
> >>
> >> For COLLAPSE_RANGE:
> >>   - The removed region [offset, offset+len) is punched out via
> >>     btrfs_replace_file_extents(), which handles boundary splitting.
> >>   - All EXTENT_DATA keys with key.offset >= offset+len are shifted
> >>     leftward by len in forward order.
> >>
> >> For INSERT_RANGE:
> >>   - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
> >>     by len in reverse order (required to avoid key collisions).
> >>   - No pre-splitting of straddling extents is needed: the left portion
> >>     of a straddling extent stays in place, the right portion is shifted;
> >>     both reference the same physical extent via their existing
> >>     extent_offset fields.
> >>
> >> For each shifted key, the corresponding back-reference in the extent
> >> tree is updated via a shared helper btrfs_shift_extent_backref().
>
> After looking into each patch, I do not think the low level direct file
> item change is a good idea, especially with your current implement:
>
> - Can lock the inode for a very long time
>    E.g. inserting a hole into the beginning of a very large file.
>    We will lock the inode until all file items are iterated, which kills
>    concurrency.
>
> - Possible problems with metadata reservation
>
> - Problems with ^no-holes collapse
>    Will cause duplicated file offsets with hole file items.
>
> On the other hand, with reflink the insert/collapse can even be
> implemented in user space, with a proper step setting, we can still
> allow concurrent read/write out of the reflink ranges.
>
> This makes me wonder, is these features really that necessary?
>
> If you know some programs actively utilize these features for real world
> benefits, but can not be done through reflink, please provide them.
>

There is a proprietary tool at my $dayjob which uses insert and collapse range.
This works great on ext4 and xfs, and I am personally interested in having it
work on btrfs.

Emulating in userspace is entirely possible using FICLONERANGE ioclt.. but
since ext4 doesn't support that operation this leaves userspace needing to
use one solution for ext4 (fallocate insert/collapse) and another for btrfs
(FICLONERANGE). By filling the gap of btrfs fallocate modes I was hoping
to simplify userspace and reduce friction in supporting btrfs.

However, on balance I agree with your opinion that emulating in userspace
is the best approach here. So for now I will pause my work on this patch series.

> >>
> >> == Testing ==
> >>
> >> Tested with a Rust-based functional test suite covering:
> >>   - Collapse and insert at the start, middle of a file
> >>   - Multiple sequential operations on the same file
> >>   - Files with multiple extents (fsync between writes to force separate
> >>     extent items)
> >>   - Files with holes (explicit punch_hole and implicit sparse writes)
> >>   - Compressed extents (mount -o compress=zstd)
> >>   - Transaction cycling (interval reduced to 4 during testing, verified
> >>     in dmesg logs)
> >>   - Inline files, verified that -EOPNOTSUPP is returned.
> >
> > I guess that tool has never verify the contents, nor multi-thread stress
> > tests, e.g. fsstress?
> >

My tests did read back and verify file contents. It was not multi threaded or
particularly stressy.

> >>
> >> The same tests pass on both btrfs and xfs (modulo the inline files).
> >>
> >> I have not run fstests which I know contains tests for INSERT_RANGE
> >> and COLLAPSE_RANGE. I will do so.
> >
> > Thus I'd prefer a fstest run before whatever your local tool.
> >

100% agree that fstests would be better. I started with my own tests
only to keep
things very simple while the code got off the ground. If I carry this work on I
will use fstests.

> >> == Notes ==
> >>
> >> This is my first kernel contribution. Development was significantly
> >> assisted by an LLM (Amazon Q Developer). The implementation, testing,
> >> and final review decisions are my own.
> >
> > I'm very interested in how the LLM is involved.
> >
> > You mentioned "implementation, testing, review" are on your own, this
> > looks like everything is on your own.
> >
> > I don't think you're only using LLM to help understanding the code, thus
> > it looks like implementation is contributed by the LLM.
> >
> > Please remember, you're the one explaining/defending the code.
> > But as long as you can explain/defend the code, I'm fine with that.
> >

My workflow with the LLM was this:

I provided a high level design, reference to the fallocate man page,
pointers to the implementation of clone range in btrfs, and the existing
fallocate implementations (for btrfs, ext4, and xfs). This was a couple
of paragraphs and a handful of links.

I prompted the LLM to produce a detailed design document for how
I could go about implementing insert and collapse for btrfs. The
document it produced is English text of around 2000 words. It
contains references to internal btrfs functions I needed and a
description of the logic I needed to implement. It has no code.

I reviewed that doc, asked clarifications, and the doc was revised.

I then crafted the code. I relied heavily on the design doc that the
LLM had written but did not follow it to the letter. During this
implementation work I asked the LLM to clarify a few points, and
the design doc was revised by the LLM.

During debugging most of the bugs I diagnosed myself. There was one
bug which stumped me however, so I asked the LLM to provide
hypotheses which I then worked through to solve the bug. I crafted
the code for the fix. ( The specific bug was stale/incorrect data being
read after the operation returned. The code was already invalidating
the page cache, but I was not aware of the extent map cache that
needed to be invalidated too. )

I wrote the test code myself.

Once the tests were passing the LLM was used to update the design
doc to align with the code I actually wrote. I also asked the LLM to
review my code prior to submitting it to the mailing list. I addressed
some of that feedback. As an example piece of feedback it
recommended I make use of unlikely() in "if" statements testing for
errors.

> > Still a newcomer with not a small code change will always attract more
> > scrutiny, so please don't expect rapid review/merge/etc.
> >

I am not in a rush. This has been a low-priority side quest going back
to 2023: https://lore.kernel.org/linux-btrfs/CAMosweitbAN5EPOgJCtrbkRAj1QSbsYt4uDGVMZ378YY7wjnRw@mail.gmail.com/

-- 
Paul Richards

^ permalink raw reply

* [PATCH 5/5] btrfs: expose scrub lifetime and session counters via sysfs
From: Torstein Eide @ 2026-04-19 14:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Torstein Eide
In-Reply-To: <20260419142618.3147763-1-torsteine+linux@gmail.com>

From: Torstein Eide <torsteine@gmail.com>

Create two sysfs sub-hierarchies under each filesystem UUID and under
each devinfo/<devid>/ entry:

  /sys/fs/btrfs/<UUID>/scrub/
    reset                   (write "1" to zero all devices' lifetime
                             counters and mark them dirty)
    lifetime/               filesystem-wide sums of per-device lifetime
      data_extents_scrubbed   counters (one file per stat index)
      tree_extents_scrubbed
      ... (14 counters total)
    session/                filesystem-wide view of the most recent run
      data_extents_scrubbed   (summed across devices for counters,
      ...                      earliest t_start / latest t_end for time)
      status                  idle | running | finished | canceled
      t_start / t_end         Unix timestamps (seconds)
      duration_seconds        computed on read; live while running

  /sys/fs/btrfs/<UUID>/devinfo/<devid>/scrub/
    reset                   (per-device reset)
    lifetime/               per-device lifetime counters
    session/                per-device session counters + metadata
      last_physical           byte offset where the last run stopped

The attribute arrays are declared as 'const struct attribute *[]' so
they are compatible with sysfs_create_files() / sysfs_remove_files().
Forward declarations are placed before btrfs_sysfs_remove_mounted() and
btrfs_sysfs_remove_device() which reference them before their
definitions.

Signed-off-by: Torstein Eide <torsteine@gmail.com>
Assisted-by: Claude:claude-sonnet-4-6
---
 fs/btrfs/sysfs.c | 586 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/sysfs.h |   3 +
 2 files changed, 589 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 0d14570c8bc29..d5f12a40b31fc 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/bug.h>
 #include <linux/list.h>
 #include <linux/string_choices.h>
+#include <linux/timekeeping.h>
 #include "messages.h"
 #include "ctree.h"
 #include "discard.h"
@@ -1707,6 +1708,14 @@ static void btrfs_sysfs_remove_fs_devices(struct btrfs_fs_devices *fs_devices)
 	}
 }
 
+/* Forward declarations for scrub sysfs attribute arrays defined later. */
+static const struct attribute *scrub_attrs[];
+static const struct attribute *scrub_lifetime_attrs[];
+static const struct attribute *scrub_session_attrs[];
+static const struct attribute *devid_scrub_attrs[];
+static const struct attribute *devid_scrub_lifetime_attrs[];
+static const struct attribute *devid_scrub_session_attrs[];
+
 void btrfs_sysfs_remove_mounted(struct btrfs_fs_info *fs_info)
 {
 	struct kobject *fsid_kobj = &fs_info->fs_devices->fsid_kobj;
@@ -1723,6 +1732,21 @@ void btrfs_sysfs_remove_mounted(struct btrfs_fs_info *fs_info)
 		kobject_del(fs_info->discard_kobj);
 		kobject_put(fs_info->discard_kobj);
 	}
+	if (fs_info->scrub_session_kobj) {
+		sysfs_remove_files(fs_info->scrub_session_kobj, scrub_session_attrs);
+		kobject_del(fs_info->scrub_session_kobj);
+		kobject_put(fs_info->scrub_session_kobj);
+	}
+	if (fs_info->scrub_lifetime_kobj) {
+		sysfs_remove_files(fs_info->scrub_lifetime_kobj, scrub_lifetime_attrs);
+		kobject_del(fs_info->scrub_lifetime_kobj);
+		kobject_put(fs_info->scrub_lifetime_kobj);
+	}
+	if (fs_info->scrub_kobj) {
+		sysfs_remove_files(fs_info->scrub_kobj, scrub_attrs);
+		kobject_del(fs_info->scrub_kobj);
+		kobject_put(fs_info->scrub_kobj);
+	}
 #ifdef CONFIG_BTRFS_DEBUG
 	if (fs_info->debug_kobj) {
 		sysfs_remove_files(fs_info->debug_kobj, btrfs_debug_mount_attrs);
@@ -1970,6 +1994,28 @@ void btrfs_sysfs_remove_device(struct btrfs_device *device)
 	if (device->bdev)
 		sysfs_remove_link(devices_kobj, bdev_kobj(device->bdev)->name);
 
+	/* Tear down devinfo/<devid>/scrub/ hierarchy (children before parent) */
+	if (device->scrub_session_kobj) {
+		sysfs_remove_files(device->scrub_session_kobj,
+				   devid_scrub_session_attrs);
+		kobject_del(device->scrub_session_kobj);
+		kobject_put(device->scrub_session_kobj);
+		device->scrub_session_kobj = NULL;
+	}
+	if (device->scrub_lifetime_kobj) {
+		sysfs_remove_files(device->scrub_lifetime_kobj,
+				   devid_scrub_lifetime_attrs);
+		kobject_del(device->scrub_lifetime_kobj);
+		kobject_put(device->scrub_lifetime_kobj);
+		device->scrub_lifetime_kobj = NULL;
+	}
+	if (device->scrub_kobj) {
+		sysfs_remove_files(device->scrub_kobj, devid_scrub_attrs);
+		kobject_del(device->scrub_kobj);
+		kobject_put(device->scrub_kobj);
+		device->scrub_kobj = NULL;
+	}
+
 	if (device->devid_kobj.state_initialized) {
 		kobject_del(&device->devid_kobj);
 		kobject_put(&device->devid_kobj);
@@ -2099,6 +2145,482 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
 
+/* ---------- scrub/lifetime/ and scrub/session/ sysfs attributes ---------- */
+
+/*
+ * Return the btrfs_device owning a kobject that lives two levels below
+ * devid_kobj, i.e. kobj->parent is scrub_kobj, kobj->parent->parent is
+ * devid_kobj.
+ */
+static struct btrfs_device *kobj_to_scrub_device(struct kobject *kobj)
+{
+	return container_of(kobj->parent->parent, struct btrfs_device, devid_kobj);
+}
+
+/*
+ * Return the btrfs_fs_info owning a kobject that lives two levels below
+ * fsid_kobj, i.e. kobj->parent is fs_info->scrub_kobj,
+ * kobj->parent->parent is &fs_devs->fsid_kobj.
+ */
+static struct btrfs_fs_info *kobj_to_scrub_fs_info(struct kobject *kobj)
+{
+	struct btrfs_fs_devices *fs_devs =
+		container_of(kobj->parent->parent, struct btrfs_fs_devices, fsid_kobj);
+	return fs_devs->fs_info;
+}
+
+/*
+ * Per-device scrub/lifetime/ attributes.
+ * Each show function reads an atomic64 from btrfs_device::scrub_stat_values[].
+ */
+#define DEV_SCRUB_LIFETIME_ATTR(_name, _idx)					\
+static ssize_t btrfs_devid_scrub_lifetime_##_name##_show(			\
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)	\
+{										\
+	struct btrfs_device *dev = kobj_to_scrub_device(kobj);			\
+	return sysfs_emit(buf, "%llu\n", btrfs_scrub_stat_read(dev, _idx));	\
+}										\
+BTRFS_ATTR(devid_scrub_lifetime, _name,					\
+	   btrfs_devid_scrub_lifetime_##_name##_show)
+
+DEV_SCRUB_LIFETIME_ATTR(data_extents_scrubbed, BTRFS_SCRUB_STAT_DATA_EXTENTS_SCRUBBED);
+DEV_SCRUB_LIFETIME_ATTR(tree_extents_scrubbed, BTRFS_SCRUB_STAT_TREE_EXTENTS_SCRUBBED);
+DEV_SCRUB_LIFETIME_ATTR(data_bytes_scrubbed,   BTRFS_SCRUB_STAT_DATA_BYTES_SCRUBBED);
+DEV_SCRUB_LIFETIME_ATTR(tree_bytes_scrubbed,   BTRFS_SCRUB_STAT_TREE_BYTES_SCRUBBED);
+DEV_SCRUB_LIFETIME_ATTR(read_errors,           BTRFS_SCRUB_STAT_READ_ERRORS);
+DEV_SCRUB_LIFETIME_ATTR(csum_errors,           BTRFS_SCRUB_STAT_CSUM_ERRORS);
+DEV_SCRUB_LIFETIME_ATTR(verify_errors,         BTRFS_SCRUB_STAT_VERIFY_ERRORS);
+DEV_SCRUB_LIFETIME_ATTR(no_csum,               BTRFS_SCRUB_STAT_NO_CSUM);
+DEV_SCRUB_LIFETIME_ATTR(csum_discards,         BTRFS_SCRUB_STAT_CSUM_DISCARDS);
+DEV_SCRUB_LIFETIME_ATTR(super_errors,          BTRFS_SCRUB_STAT_SUPER_ERRORS);
+DEV_SCRUB_LIFETIME_ATTR(malloc_errors,         BTRFS_SCRUB_STAT_MALLOC_ERRORS);
+DEV_SCRUB_LIFETIME_ATTR(uncorrectable_errors,  BTRFS_SCRUB_STAT_UNCORRECTABLE_ERRORS);
+DEV_SCRUB_LIFETIME_ATTR(corrected_errors,      BTRFS_SCRUB_STAT_CORRECTED_ERRORS);
+DEV_SCRUB_LIFETIME_ATTR(unverified_errors,     BTRFS_SCRUB_STAT_UNVERIFIED_ERRORS);
+
+static const struct attribute *devid_scrub_lifetime_attrs[] = {
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, data_extents_scrubbed),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, tree_extents_scrubbed),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, data_bytes_scrubbed),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, tree_bytes_scrubbed),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, read_errors),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, csum_errors),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, verify_errors),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, no_csum),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, csum_discards),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, super_errors),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, malloc_errors),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, uncorrectable_errors),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, corrected_errors),
+	BTRFS_ATTR_PTR(devid_scrub_lifetime, unverified_errors),
+	NULL
+};
+
+/*
+ * Per-device scrub/session/ attributes.
+ */
+#define DEV_SCRUB_SESSION_ATTR(_name, _idx)					\
+static ssize_t btrfs_devid_scrub_session_##_name##_show(			\
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)	\
+{										\
+	struct btrfs_device *dev = kobj_to_scrub_device(kobj);			\
+	return sysfs_emit(buf, "%llu\n", btrfs_scrub_session_read(dev, _idx));	\
+}										\
+BTRFS_ATTR(devid_scrub_session, _name,					\
+	   btrfs_devid_scrub_session_##_name##_show)
+
+DEV_SCRUB_SESSION_ATTR(data_extents_scrubbed, BTRFS_SCRUB_STAT_DATA_EXTENTS_SCRUBBED);
+DEV_SCRUB_SESSION_ATTR(tree_extents_scrubbed, BTRFS_SCRUB_STAT_TREE_EXTENTS_SCRUBBED);
+DEV_SCRUB_SESSION_ATTR(data_bytes_scrubbed,   BTRFS_SCRUB_STAT_DATA_BYTES_SCRUBBED);
+DEV_SCRUB_SESSION_ATTR(tree_bytes_scrubbed,   BTRFS_SCRUB_STAT_TREE_BYTES_SCRUBBED);
+DEV_SCRUB_SESSION_ATTR(read_errors,           BTRFS_SCRUB_STAT_READ_ERRORS);
+DEV_SCRUB_SESSION_ATTR(csum_errors,           BTRFS_SCRUB_STAT_CSUM_ERRORS);
+DEV_SCRUB_SESSION_ATTR(verify_errors,         BTRFS_SCRUB_STAT_VERIFY_ERRORS);
+DEV_SCRUB_SESSION_ATTR(no_csum,               BTRFS_SCRUB_STAT_NO_CSUM);
+DEV_SCRUB_SESSION_ATTR(csum_discards,         BTRFS_SCRUB_STAT_CSUM_DISCARDS);
+DEV_SCRUB_SESSION_ATTR(super_errors,          BTRFS_SCRUB_STAT_SUPER_ERRORS);
+DEV_SCRUB_SESSION_ATTR(malloc_errors,         BTRFS_SCRUB_STAT_MALLOC_ERRORS);
+DEV_SCRUB_SESSION_ATTR(uncorrectable_errors,  BTRFS_SCRUB_STAT_UNCORRECTABLE_ERRORS);
+DEV_SCRUB_SESSION_ATTR(corrected_errors,      BTRFS_SCRUB_STAT_CORRECTED_ERRORS);
+DEV_SCRUB_SESSION_ATTR(unverified_errors,     BTRFS_SCRUB_STAT_UNVERIFIED_ERRORS);
+
+static ssize_t btrfs_devid_scrub_session_last_physical_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *dev = kobj_to_scrub_device(kobj);
+
+	return sysfs_emit(buf, "%llu\n", READ_ONCE(dev->scrub_session_last_physical));
+}
+BTRFS_ATTR(devid_scrub_session, last_physical,
+	   btrfs_devid_scrub_session_last_physical_show);
+
+static ssize_t btrfs_devid_scrub_session_t_start_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *dev = kobj_to_scrub_device(kobj);
+
+	return sysfs_emit(buf, "%llu\n", READ_ONCE(dev->scrub_session_t_start));
+}
+BTRFS_ATTR(devid_scrub_session, t_start, btrfs_devid_scrub_session_t_start_show);
+
+static ssize_t btrfs_devid_scrub_session_t_end_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *dev = kobj_to_scrub_device(kobj);
+
+	return sysfs_emit(buf, "%llu\n", READ_ONCE(dev->scrub_session_t_end));
+}
+BTRFS_ATTR(devid_scrub_session, t_end, btrfs_devid_scrub_session_t_end_show);
+
+static ssize_t btrfs_devid_scrub_session_duration_seconds_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *dev = kobj_to_scrub_device(kobj);
+	u64 t_start = READ_ONCE(dev->scrub_session_t_start);
+	u64 t_end   = READ_ONCE(dev->scrub_session_t_end);
+	u64 dur;
+
+	if (t_start == 0)
+		dur = 0;
+	else if (t_end != 0)
+		dur = t_end - t_start;
+	else
+		dur = (u64)ktime_get_real_seconds() - t_start;
+	return sysfs_emit(buf, "%llu\n", dur);
+}
+BTRFS_ATTR(devid_scrub_session, duration_seconds,
+	   btrfs_devid_scrub_session_duration_seconds_show);
+
+static const char * const btrfs_scrub_status_strings[] = {
+	[BTRFS_SCRUB_STATUS_IDLE]     = "idle",
+	[BTRFS_SCRUB_STATUS_RUNNING]  = "running",
+	[BTRFS_SCRUB_STATUS_FINISHED] = "finished",
+	[BTRFS_SCRUB_STATUS_CANCELED] = "canceled",
+};
+
+static ssize_t btrfs_devid_scrub_session_status_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *dev = kobj_to_scrub_device(kobj);
+	int status = atomic_read(&dev->scrub_session_status);
+
+	if (status < 0 || status >= ARRAY_SIZE(btrfs_scrub_status_strings))
+		return sysfs_emit(buf, "unknown\n");
+	return sysfs_emit(buf, "%s\n", btrfs_scrub_status_strings[status]);
+}
+BTRFS_ATTR(devid_scrub_session, status, btrfs_devid_scrub_session_status_show);
+
+static const struct attribute *devid_scrub_session_attrs[] = {
+	BTRFS_ATTR_PTR(devid_scrub_session, data_extents_scrubbed),
+	BTRFS_ATTR_PTR(devid_scrub_session, tree_extents_scrubbed),
+	BTRFS_ATTR_PTR(devid_scrub_session, data_bytes_scrubbed),
+	BTRFS_ATTR_PTR(devid_scrub_session, tree_bytes_scrubbed),
+	BTRFS_ATTR_PTR(devid_scrub_session, read_errors),
+	BTRFS_ATTR_PTR(devid_scrub_session, csum_errors),
+	BTRFS_ATTR_PTR(devid_scrub_session, verify_errors),
+	BTRFS_ATTR_PTR(devid_scrub_session, no_csum),
+	BTRFS_ATTR_PTR(devid_scrub_session, csum_discards),
+	BTRFS_ATTR_PTR(devid_scrub_session, super_errors),
+	BTRFS_ATTR_PTR(devid_scrub_session, malloc_errors),
+	BTRFS_ATTR_PTR(devid_scrub_session, uncorrectable_errors),
+	BTRFS_ATTR_PTR(devid_scrub_session, corrected_errors),
+	BTRFS_ATTR_PTR(devid_scrub_session, unverified_errors),
+	BTRFS_ATTR_PTR(devid_scrub_session, last_physical),
+	BTRFS_ATTR_PTR(devid_scrub_session, t_start),
+	BTRFS_ATTR_PTR(devid_scrub_session, t_end),
+	BTRFS_ATTR_PTR(devid_scrub_session, duration_seconds),
+	BTRFS_ATTR_PTR(devid_scrub_session, status),
+	NULL
+};
+
+/*
+ * Per-device scrub/reset (write-only): "echo 1 > reset" zeroes lifetime
+ * counters and marks them dirty for the next transaction commit.
+ */
+static ssize_t btrfs_devid_scrub_reset_store(struct kobject *kobj,
+		struct kobj_attribute *a, const char *buf, size_t len)
+{
+	struct btrfs_device *dev;
+	unsigned long val;
+	int i;
+
+	if (kstrtoul(buf, 10, &val) || val != 1)
+		return -EINVAL;
+
+	/* kobj here is the scrub_kobj, one level below devid_kobj */
+	dev = container_of(kobj->parent, struct btrfs_device, devid_kobj);
+
+	for (i = 0; i < BTRFS_SCRUB_STAT_VALUES_MAX; i++)
+		btrfs_scrub_stat_set(dev, i, 0);
+
+	btrfs_info(dev->fs_info,
+		   "scrub: lifetime stats reset for devid %llu by %s (%d)",
+		   dev->devid, current->comm, task_pid_nr(current));
+	return len;
+}
+BTRFS_ATTR_W(devid_scrub, reset, btrfs_devid_scrub_reset_store);
+
+static const struct attribute *devid_scrub_attrs[] = {
+	BTRFS_ATTR_PTR(devid_scrub, reset),
+	NULL
+};
+
+/* ---------- fs-level scrub/lifetime/ and scrub/session/ attributes ---------- */
+
+/*
+ * Filesystem-level lifetime counters: sum of all device lifetime counters.
+ */
+#define FS_SCRUB_LIFETIME_ATTR(_name, _idx)					\
+static ssize_t btrfs_scrub_lifetime_##_name##_show(				\
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)	\
+{										\
+	struct btrfs_fs_info *fs_info = kobj_to_scrub_fs_info(kobj);		\
+	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;			\
+	struct btrfs_device *dev;						\
+	u64 total = 0;								\
+										\
+	mutex_lock(&fs_devs->device_list_mutex);				\
+	list_for_each_entry(dev, &fs_devs->devices, dev_list)			\
+		total += btrfs_scrub_stat_read(dev, _idx);			\
+	mutex_unlock(&fs_devs->device_list_mutex);				\
+	return sysfs_emit(buf, "%llu\n", total);				\
+}										\
+BTRFS_ATTR(scrub_lifetime, _name, btrfs_scrub_lifetime_##_name##_show)
+
+FS_SCRUB_LIFETIME_ATTR(data_extents_scrubbed, BTRFS_SCRUB_STAT_DATA_EXTENTS_SCRUBBED);
+FS_SCRUB_LIFETIME_ATTR(tree_extents_scrubbed, BTRFS_SCRUB_STAT_TREE_EXTENTS_SCRUBBED);
+FS_SCRUB_LIFETIME_ATTR(data_bytes_scrubbed,   BTRFS_SCRUB_STAT_DATA_BYTES_SCRUBBED);
+FS_SCRUB_LIFETIME_ATTR(tree_bytes_scrubbed,   BTRFS_SCRUB_STAT_TREE_BYTES_SCRUBBED);
+FS_SCRUB_LIFETIME_ATTR(read_errors,           BTRFS_SCRUB_STAT_READ_ERRORS);
+FS_SCRUB_LIFETIME_ATTR(csum_errors,           BTRFS_SCRUB_STAT_CSUM_ERRORS);
+FS_SCRUB_LIFETIME_ATTR(verify_errors,         BTRFS_SCRUB_STAT_VERIFY_ERRORS);
+FS_SCRUB_LIFETIME_ATTR(no_csum,               BTRFS_SCRUB_STAT_NO_CSUM);
+FS_SCRUB_LIFETIME_ATTR(csum_discards,         BTRFS_SCRUB_STAT_CSUM_DISCARDS);
+FS_SCRUB_LIFETIME_ATTR(super_errors,          BTRFS_SCRUB_STAT_SUPER_ERRORS);
+FS_SCRUB_LIFETIME_ATTR(malloc_errors,         BTRFS_SCRUB_STAT_MALLOC_ERRORS);
+FS_SCRUB_LIFETIME_ATTR(uncorrectable_errors,  BTRFS_SCRUB_STAT_UNCORRECTABLE_ERRORS);
+FS_SCRUB_LIFETIME_ATTR(corrected_errors,      BTRFS_SCRUB_STAT_CORRECTED_ERRORS);
+FS_SCRUB_LIFETIME_ATTR(unverified_errors,     BTRFS_SCRUB_STAT_UNVERIFIED_ERRORS);
+
+static const struct attribute *scrub_lifetime_attrs[] = {
+	BTRFS_ATTR_PTR(scrub_lifetime, data_extents_scrubbed),
+	BTRFS_ATTR_PTR(scrub_lifetime, tree_extents_scrubbed),
+	BTRFS_ATTR_PTR(scrub_lifetime, data_bytes_scrubbed),
+	BTRFS_ATTR_PTR(scrub_lifetime, tree_bytes_scrubbed),
+	BTRFS_ATTR_PTR(scrub_lifetime, read_errors),
+	BTRFS_ATTR_PTR(scrub_lifetime, csum_errors),
+	BTRFS_ATTR_PTR(scrub_lifetime, verify_errors),
+	BTRFS_ATTR_PTR(scrub_lifetime, no_csum),
+	BTRFS_ATTR_PTR(scrub_lifetime, csum_discards),
+	BTRFS_ATTR_PTR(scrub_lifetime, super_errors),
+	BTRFS_ATTR_PTR(scrub_lifetime, malloc_errors),
+	BTRFS_ATTR_PTR(scrub_lifetime, uncorrectable_errors),
+	BTRFS_ATTR_PTR(scrub_lifetime, corrected_errors),
+	BTRFS_ATTR_PTR(scrub_lifetime, unverified_errors),
+	NULL
+};
+
+/*
+ * Filesystem-level session counters: sum of per-device session values.
+ * Timing/status derive from per-device values.
+ */
+#define FS_SCRUB_SESSION_ATTR(_name, _idx)					\
+static ssize_t btrfs_scrub_session_##_name##_show(				\
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)	\
+{										\
+	struct btrfs_fs_info *fs_info = kobj_to_scrub_fs_info(kobj);		\
+	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;			\
+	struct btrfs_device *dev;						\
+	u64 total = 0;								\
+										\
+	mutex_lock(&fs_devs->device_list_mutex);				\
+	list_for_each_entry(dev, &fs_devs->devices, dev_list)			\
+		total += btrfs_scrub_session_read(dev, _idx);			\
+	mutex_unlock(&fs_devs->device_list_mutex);				\
+	return sysfs_emit(buf, "%llu\n", total);				\
+}										\
+BTRFS_ATTR(scrub_session, _name, btrfs_scrub_session_##_name##_show)
+
+FS_SCRUB_SESSION_ATTR(data_extents_scrubbed, BTRFS_SCRUB_STAT_DATA_EXTENTS_SCRUBBED);
+FS_SCRUB_SESSION_ATTR(tree_extents_scrubbed, BTRFS_SCRUB_STAT_TREE_EXTENTS_SCRUBBED);
+FS_SCRUB_SESSION_ATTR(data_bytes_scrubbed,   BTRFS_SCRUB_STAT_DATA_BYTES_SCRUBBED);
+FS_SCRUB_SESSION_ATTR(tree_bytes_scrubbed,   BTRFS_SCRUB_STAT_TREE_BYTES_SCRUBBED);
+FS_SCRUB_SESSION_ATTR(read_errors,           BTRFS_SCRUB_STAT_READ_ERRORS);
+FS_SCRUB_SESSION_ATTR(csum_errors,           BTRFS_SCRUB_STAT_CSUM_ERRORS);
+FS_SCRUB_SESSION_ATTR(verify_errors,         BTRFS_SCRUB_STAT_VERIFY_ERRORS);
+FS_SCRUB_SESSION_ATTR(no_csum,               BTRFS_SCRUB_STAT_NO_CSUM);
+FS_SCRUB_SESSION_ATTR(csum_discards,         BTRFS_SCRUB_STAT_CSUM_DISCARDS);
+FS_SCRUB_SESSION_ATTR(super_errors,          BTRFS_SCRUB_STAT_SUPER_ERRORS);
+FS_SCRUB_SESSION_ATTR(malloc_errors,         BTRFS_SCRUB_STAT_MALLOC_ERRORS);
+FS_SCRUB_SESSION_ATTR(uncorrectable_errors,  BTRFS_SCRUB_STAT_UNCORRECTABLE_ERRORS);
+FS_SCRUB_SESSION_ATTR(corrected_errors,      BTRFS_SCRUB_STAT_CORRECTED_ERRORS);
+FS_SCRUB_SESSION_ATTR(unverified_errors,     BTRFS_SCRUB_STAT_UNVERIFIED_ERRORS);
+
+static ssize_t btrfs_scrub_session_status_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_fs_info *fs_info = kobj_to_scrub_fs_info(kobj);
+
+	if (atomic_read(&fs_info->scrubs_running) > 0)
+		return sysfs_emit(buf, "running\n");
+
+	/*
+	 * Not running: check if any device's last session finished or was
+	 * canceled.
+	 */
+	{
+		struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
+		struct btrfs_device *dev;
+		int seen_finished = 0, seen_canceled = 0;
+
+		mutex_lock(&fs_devs->device_list_mutex);
+		list_for_each_entry(dev, &fs_devs->devices, dev_list) {
+			int st = atomic_read(&dev->scrub_session_status);
+
+			if (st == BTRFS_SCRUB_STATUS_FINISHED)
+				seen_finished = 1;
+			else if (st == BTRFS_SCRUB_STATUS_CANCELED)
+				seen_canceled = 1;
+		}
+		mutex_unlock(&fs_devs->device_list_mutex);
+
+		if (seen_canceled)
+			return sysfs_emit(buf, "canceled\n");
+		if (seen_finished)
+			return sysfs_emit(buf, "finished\n");
+	}
+	return sysfs_emit(buf, "idle\n");
+}
+BTRFS_ATTR(scrub_session, status, btrfs_scrub_session_status_show);
+
+static ssize_t btrfs_scrub_session_t_start_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_fs_info *fs_info = kobj_to_scrub_fs_info(kobj);
+	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
+	struct btrfs_device *dev;
+	u64 t_min = 0;
+
+	mutex_lock(&fs_devs->device_list_mutex);
+	list_for_each_entry(dev, &fs_devs->devices, dev_list) {
+		u64 t = READ_ONCE(dev->scrub_session_t_start);
+
+		if (t && (!t_min || t < t_min))
+			t_min = t;
+	}
+	mutex_unlock(&fs_devs->device_list_mutex);
+	return sysfs_emit(buf, "%llu\n", t_min);
+}
+BTRFS_ATTR(scrub_session, t_start, btrfs_scrub_session_t_start_show);
+
+static ssize_t btrfs_scrub_session_t_end_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_fs_info *fs_info = kobj_to_scrub_fs_info(kobj);
+	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
+	struct btrfs_device *dev;
+	u64 t_max = 0;
+
+	mutex_lock(&fs_devs->device_list_mutex);
+	list_for_each_entry(dev, &fs_devs->devices, dev_list) {
+		u64 t = READ_ONCE(dev->scrub_session_t_end);
+
+		if (t > t_max)
+			t_max = t;
+	}
+	mutex_unlock(&fs_devs->device_list_mutex);
+	return sysfs_emit(buf, "%llu\n", t_max);
+}
+BTRFS_ATTR(scrub_session, t_end, btrfs_scrub_session_t_end_show);
+
+static ssize_t btrfs_scrub_session_duration_seconds_show(
+		struct kobject *kobj, struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_fs_info *fs_info = kobj_to_scrub_fs_info(kobj);
+	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
+	struct btrfs_device *dev;
+	u64 t_start = 0, t_end = 0, dur;
+
+	mutex_lock(&fs_devs->device_list_mutex);
+	list_for_each_entry(dev, &fs_devs->devices, dev_list) {
+		u64 ts = READ_ONCE(dev->scrub_session_t_start);
+		u64 te = READ_ONCE(dev->scrub_session_t_end);
+
+		if (ts && (!t_start || ts < t_start))
+			t_start = ts;
+		if (te > t_end)
+			t_end = te;
+	}
+	mutex_unlock(&fs_devs->device_list_mutex);
+
+	if (!t_start)
+		dur = 0;
+	else if (t_end)
+		dur = t_end - t_start;
+	else
+		dur = (u64)ktime_get_real_seconds() - t_start;
+	return sysfs_emit(buf, "%llu\n", dur);
+}
+BTRFS_ATTR(scrub_session, duration_seconds,
+	   btrfs_scrub_session_duration_seconds_show);
+
+static const struct attribute *scrub_session_attrs[] = {
+	BTRFS_ATTR_PTR(scrub_session, data_extents_scrubbed),
+	BTRFS_ATTR_PTR(scrub_session, tree_extents_scrubbed),
+	BTRFS_ATTR_PTR(scrub_session, data_bytes_scrubbed),
+	BTRFS_ATTR_PTR(scrub_session, tree_bytes_scrubbed),
+	BTRFS_ATTR_PTR(scrub_session, read_errors),
+	BTRFS_ATTR_PTR(scrub_session, csum_errors),
+	BTRFS_ATTR_PTR(scrub_session, verify_errors),
+	BTRFS_ATTR_PTR(scrub_session, no_csum),
+	BTRFS_ATTR_PTR(scrub_session, csum_discards),
+	BTRFS_ATTR_PTR(scrub_session, super_errors),
+	BTRFS_ATTR_PTR(scrub_session, malloc_errors),
+	BTRFS_ATTR_PTR(scrub_session, uncorrectable_errors),
+	BTRFS_ATTR_PTR(scrub_session, corrected_errors),
+	BTRFS_ATTR_PTR(scrub_session, unverified_errors),
+	BTRFS_ATTR_PTR(scrub_session, status),
+	BTRFS_ATTR_PTR(scrub_session, t_start),
+	BTRFS_ATTR_PTR(scrub_session, t_end),
+	BTRFS_ATTR_PTR(scrub_session, duration_seconds),
+	NULL
+};
+
+/*
+ * Filesystem-level scrub/reset (write-only): zeros all devices' lifetime
+ * counters and marks them dirty.
+ */
+static ssize_t btrfs_scrub_reset_store(struct kobject *kobj,
+		struct kobj_attribute *a, const char *buf, size_t len)
+{
+	/* kobj is fs_info->scrub_kobj, parent is fsid_kobj */
+	struct btrfs_fs_devices *fs_devs =
+		container_of(kobj->parent, struct btrfs_fs_devices, fsid_kobj);
+	struct btrfs_fs_info *fs_info = fs_devs->fs_info;
+	struct btrfs_device *dev;
+	unsigned long val;
+	int i;
+
+	if (kstrtoul(buf, 10, &val) || val != 1)
+		return -EINVAL;
+
+	mutex_lock(&fs_devs->device_list_mutex);
+	list_for_each_entry(dev, &fs_devs->devices, dev_list)
+		for (i = 0; i < BTRFS_SCRUB_STAT_VALUES_MAX; i++)
+			btrfs_scrub_stat_set(dev, i, 0);
+	mutex_unlock(&fs_devs->device_list_mutex);
+
+	btrfs_info(fs_info, "scrub: all lifetime stats reset by %s (%d)",
+		   current->comm, task_pid_nr(current));
+	return len;
+}
+BTRFS_ATTR_W(scrub, reset, btrfs_scrub_reset_store);
+
+static const struct attribute *scrub_attrs[] = {
+	BTRFS_ATTR_PTR(scrub, reset),
+	NULL
+};
+
 /*
  * Information about one device.
  *
@@ -2169,7 +2691,41 @@ int btrfs_sysfs_add_device(struct btrfs_device *device)
 		btrfs_warn(device->fs_info,
 			   "devinfo init for devid %llu failed: %d",
 			   device->devid, ret);
+		goto out;
+	}
+
+	/* Create devinfo/<devid>/scrub/ hierarchy */
+	device->scrub_kobj = kobject_create_and_add("scrub", &device->devid_kobj);
+	if (!device->scrub_kobj) {
+		ret = -ENOMEM;
+		btrfs_warn(device->fs_info,
+			   "scrub kobj init for devid %llu failed",
+			   device->devid);
+		goto out;
+	}
+	ret = sysfs_create_files(device->scrub_kobj, devid_scrub_attrs);
+	if (ret)
+		goto out;
+
+	device->scrub_lifetime_kobj = kobject_create_and_add("lifetime",
+							      device->scrub_kobj);
+	if (!device->scrub_lifetime_kobj) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = sysfs_create_files(device->scrub_lifetime_kobj,
+				 devid_scrub_lifetime_attrs);
+	if (ret)
+		goto out;
+
+	device->scrub_session_kobj = kobject_create_and_add("session",
+							     device->scrub_kobj);
+	if (!device->scrub_session_kobj) {
+		ret = -ENOMEM;
+		goto out;
 	}
+	ret = sysfs_create_files(device->scrub_session_kobj,
+				 devid_scrub_session_attrs);
 
 out:
 	memalloc_nofs_restore(nofs_flag);
@@ -2346,6 +2902,36 @@ int btrfs_sysfs_add_mounted(struct btrfs_fs_info *fs_info)
 	if (ret)
 		goto failure;
 
+	/* Create /sys/fs/btrfs/<UUID>/scrub/{lifetime,session}/ hierarchy */
+	fs_info->scrub_kobj = kobject_create_and_add("scrub", fsid_kobj);
+	if (!fs_info->scrub_kobj) {
+		ret = -ENOMEM;
+		goto failure;
+	}
+	ret = sysfs_create_files(fs_info->scrub_kobj, scrub_attrs);
+	if (ret)
+		goto failure;
+
+	fs_info->scrub_lifetime_kobj = kobject_create_and_add("lifetime",
+							       fs_info->scrub_kobj);
+	if (!fs_info->scrub_lifetime_kobj) {
+		ret = -ENOMEM;
+		goto failure;
+	}
+	ret = sysfs_create_files(fs_info->scrub_lifetime_kobj, scrub_lifetime_attrs);
+	if (ret)
+		goto failure;
+
+	fs_info->scrub_session_kobj = kobject_create_and_add("session",
+							      fs_info->scrub_kobj);
+	if (!fs_info->scrub_session_kobj) {
+		ret = -ENOMEM;
+		goto failure;
+	}
+	ret = sysfs_create_files(fs_info->scrub_session_kobj, scrub_session_attrs);
+	if (ret)
+		goto failure;
+
 	return 0;
 failure:
 	btrfs_sysfs_remove_mounted(fs_info);
diff --git a/fs/btrfs/sysfs.h b/fs/btrfs/sysfs.h
index 05498e5346c39..93da8ea06659e 100644
--- a/fs/btrfs/sysfs.h
+++ b/fs/btrfs/sysfs.h
@@ -41,6 +41,9 @@ int btrfs_sysfs_add_space_info_type(struct btrfs_space_info *space_info);
 void btrfs_sysfs_remove_space_info(struct btrfs_space_info *space_info);
 void btrfs_sysfs_update_devid(struct btrfs_device *device);
 
+int btrfs_sysfs_add_scrub_device(struct btrfs_device *device);
+void btrfs_sysfs_remove_scrub_device(struct btrfs_device *device);
+
 int btrfs_sysfs_add_one_qgroup(struct btrfs_fs_info *fs_info,
 				struct btrfs_qgroup *qgroup);
 void btrfs_sysfs_del_qgroups(struct btrfs_fs_info *fs_info);
-- 
2.48.1


^ permalink raw reply related

* [PATCH 4/5] btrfs: hook scrub session tracking into btrfs_scrub_dev()
From: Torstein Eide @ 2026-04-19 14:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Torstein Eide
In-Reply-To: <20260419142618.3147763-1-torsteine+linux@gmail.com>

From: Torstein Eide <torsteine@gmail.com>

At the start of a scrub (excluding device replace runs):
  - zero all per-device scrub_session_values[] counters
  - set scrub_session_last_physical to the start offset
  - record scrub_session_t_start (wall-clock seconds)
  - set scrub_session_status to BTRFS_SCRUB_STATUS_RUNNING

At the end of a scrub, call btrfs_update_scrub_stats() which copies
the final btrfs_scrub_progress counters into the session and lifetime
arrays, stamps t_end, and sets the status to FINISHED or CANCELED
depending on the return value.

Device replace runs are excluded because their progress is tracked
separately and their statistics should not pollute the device's own
scrub history.

Signed-off-by: Torstein Eide <torsteine@gmail.com>
Assisted-by: Claude:claude-sonnet-4-6
---
 fs/btrfs/scrub.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 1ac609239cbe3..a3bf3276f5f44 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -6,6 +6,7 @@
 #include <linux/blkdev.h>
 #include <linux/ratelimit.h>
 #include <linux/sched/mm.h>
+#include <linux/timekeeping.h>
 #include "ctree.h"
 #include "discard.h"
 #include "volumes.h"
@@ -3152,6 +3153,18 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	dev->scrub_ctx = sctx;
 	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 
+	if (!is_dev_replace) {
+		int i;
+
+		/* Reset session counters for this new scrub run */
+		for (i = 0; i < BTRFS_SCRUB_STAT_VALUES_MAX; i++)
+			atomic64_set(&dev->scrub_session_values[i], 0);
+		dev->scrub_session_last_physical = start;
+		dev->scrub_session_t_start = ktime_get_real_seconds();
+		dev->scrub_session_t_end = 0;
+		atomic_set(&dev->scrub_session_status, BTRFS_SCRUB_STATUS_RUNNING);
+	}
+
 	/*
 	 * checking @scrub_pause_req here, we can avoid
 	 * race between committing transaction and scrubbing.
@@ -3207,6 +3220,9 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	if (progress)
 		memcpy(progress, &sctx->stat, sizeof(*progress));
 
+	if (!is_dev_replace)
+		btrfs_update_scrub_stats(dev, &sctx->stat, ret);
+
 	if (!is_dev_replace)
 		btrfs_info(fs_info, "scrub: %s on devid %llu with status: %d",
 			ret ? "not finished" : "finished", devid, ret);
-- 
2.48.1


^ permalink raw reply related

* [PATCH 3/5] btrfs: persist scrub lifetime stats to the device tree
From: Torstein Eide @ 2026-04-19 14:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Torstein Eide
In-Reply-To: <20260419142618.3147763-1-torsteine+linux@gmail.com>

From: Torstein Eide <torsteine@gmail.com>

Load and save per-device scrub lifetime counters using the on-disk item
introduced in the previous patch.

btrfs_init_scrub_stats() - called from open_ctree() after device stats
  are initialised.  Reads each device's BTRFS_SCRUB_STATS_OBJECTID item
  into the in-memory atomic arrays.  Devices with no item yet start with
  all counters at zero.  Items shorter than the current struct (from an
  older kernel) are handled by zero-filling the missing tail entries.

update_scrub_stat_item() / btrfs_run_scrub_stats() - mirrors the
  existing btrfs_run_dev_stats() pattern.  Called from
  commit_cowonly_roots() after btrfs_run_dev_stats().  Iterates over all
  devices; for each device whose scrub_stats_ccnt dirty counter is
  non-zero it writes the current atomic values back to the tree item,
  creating or replacing the item as needed.

btrfs_update_scrub_stats() - called from btrfs_scrub_dev() on
  completion (or cancellation).  Copies the final btrfs_scrub_progress
  counters into both the session arrays and the lifetime totals, records
  t_end and last_physical, and sets the session status.

Signed-off-by: Torstein Eide <torsteine@gmail.com>
Assisted-by: Claude:claude-sonnet-4-6
---
 fs/btrfs/disk-io.c     |   6 ++
 fs/btrfs/transaction.c |   3 +
 fs/btrfs/volumes.c     | 239 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 246 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8a11be02eeb9b..fab08780e403e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3545,6 +3545,12 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_block_groups;
 	}
 
+	ret = btrfs_init_scrub_stats(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to init scrub_stats: %d", ret);
+		goto fail_block_groups;
+	}
+
 	ret = btrfs_init_dev_replace(fs_info);
 	if (ret) {
 		btrfs_err(fs_info, "failed to init dev_replace: %d", ret);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 248adb785051b..65689a3abbdbc 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1375,6 +1375,9 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans)
 		return ret;
 
 	ret = btrfs_run_dev_stats(trans);
+	if (ret)
+		return ret;
+	ret = btrfs_run_scrub_stats(trans);
 	if (ret)
 		return ret;
 	ret = btrfs_run_dev_replace(trans);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a88e68f905646..7de9396a52757 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -9,6 +9,7 @@
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
 #include <linux/semaphore.h>
+#include <linux/timekeeping.h>
 #include <linux/uuid.h>
 #include <linux/list_sort.h>
 #include <linux/namei.h>
@@ -8386,6 +8388,242 @@ int btrfs_get_dev_stats(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+/* ---------- scrub lifetime stats: on-disk load/flush ---------- */
+
+static u64 btrfs_scrub_stats_value(const struct extent_buffer *eb,
+				   const struct btrfs_scrub_stats_item *ptr,
+				   int index)
+{
+	u64 val;
+
+	read_extent_buffer(eb, &val,
+			   offsetof(struct btrfs_scrub_stats_item, values) +
+			    ((unsigned long)ptr) + (index * sizeof(u64)),
+			   sizeof(val));
+	return le64_to_cpu(val);
+}
+
+static void btrfs_set_scrub_stats_value(struct extent_buffer *eb,
+					struct btrfs_scrub_stats_item *ptr,
+					int index, u64 val)
+{
+	__le64 leval = cpu_to_le64(val);
+
+	write_extent_buffer(eb, &leval,
+			    offsetof(struct btrfs_scrub_stats_item, values) +
+			     ((unsigned long)ptr) + (index * sizeof(u64)),
+			    sizeof(leval));
+}
+
+static int btrfs_device_init_scrub_stats(struct btrfs_device *device,
+					 struct btrfs_path *path)
+{
+	struct btrfs_scrub_stats_item *ptr;
+	struct extent_buffer *eb;
+	struct btrfs_key key;
+	int item_size;
+	int i, ret, slot;
+
+	key.objectid = BTRFS_SCRUB_STATS_OBJECTID;
+	key.type = BTRFS_PERSISTENT_ITEM_KEY;
+	key.offset = device->devid;
+
+	ret = btrfs_search_slot(NULL, device->fs_info->dev_root, &key, path, 0, 0);
+	if (ret) {
+		for (i = 0; i < BTRFS_SCRUB_STAT_VALUES_MAX; i++)
+			atomic64_set(&device->scrub_stat_values[i], 0);
+		device->scrub_stats_valid = 1;
+		btrfs_release_path(path);
+		return ret < 0 ? ret : 0;
+	}
+
+	slot = path->slots[0];
+	eb = path->nodes[0];
+	item_size = btrfs_item_size(eb, slot);
+	ptr = btrfs_item_ptr(eb, slot, struct btrfs_scrub_stats_item);
+
+	for (i = 0; i < BTRFS_SCRUB_STAT_VALUES_MAX; i++) {
+		u64 val = 0;
+
+		if (item_size >= (i + 1) * sizeof(__le64))
+			val = btrfs_scrub_stats_value(eb, ptr, i);
+		atomic64_set(&device->scrub_stat_values[i], (long long)val);
+	}
+
+	device->scrub_stats_valid = 1;
+	btrfs_release_path(path);
+	return 0;
+}
+
+int btrfs_init_scrub_stats(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	int ret = 0;
+
+	BTRFS_PATH_AUTO_FREE(path);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		ret = btrfs_device_init_scrub_stats(device, path);
+		if (ret)
+			goto out;
+	}
+out:
+	mutex_unlock(&fs_devices->device_list_mutex);
+	return ret;
+}
+
+static int update_scrub_stat_item(struct btrfs_trans_handle *trans,
+				  struct btrfs_device *device)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_root *dev_root = fs_info->dev_root;
+	struct btrfs_key key;
+	struct extent_buffer *eb;
+	struct btrfs_scrub_stats_item *ptr;
+	int ret;
+	int i;
+
+	BTRFS_PATH_AUTO_FREE(path);
+
+	key.objectid = BTRFS_SCRUB_STATS_OBJECTID;
+	key.type = BTRFS_PERSISTENT_ITEM_KEY;
+	key.offset = device->devid;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(trans, dev_root, &key, path, -1, 1);
+	if (ret < 0) {
+		btrfs_warn(fs_info,
+			"error %d searching for scrub_stats item for device %s",
+			ret, btrfs_dev_name(device));
+		return ret;
+	}
+
+	if (ret == 0 &&
+	    btrfs_item_size(path->nodes[0], path->slots[0]) < sizeof(*ptr)) {
+		ret = btrfs_del_item(trans, dev_root, path);
+		if (ret) {
+			btrfs_warn(fs_info,
+				"delete undersized scrub_stats item for device %s failed %d",
+				btrfs_dev_name(device), ret);
+			return ret;
+		}
+		ret = 1;
+	}
+
+	if (ret == 1) {
+		btrfs_release_path(path);
+		ret = btrfs_insert_empty_item(trans, dev_root, path,
+					      &key, sizeof(*ptr));
+		if (ret < 0) {
+			btrfs_warn(fs_info,
+				"insert scrub_stats item for device %s failed %d",
+				btrfs_dev_name(device), ret);
+			return ret;
+		}
+	}
+
+	eb = path->nodes[0];
+	ptr = btrfs_item_ptr(eb, path->slots[0], struct btrfs_scrub_stats_item);
+	for (i = 0; i < BTRFS_SCRUB_STAT_VALUES_MAX; i++)
+		btrfs_set_scrub_stats_value(eb, ptr, i,
+					    btrfs_scrub_stat_read(device, i));
+	return 0;
+}
+
+/*
+ * Called from commit_transaction.  Flushes changed scrub lifetime stats to
+ * disk.  Mirrors btrfs_run_dev_stats().
+ */
+int btrfs_run_scrub_stats(struct btrfs_trans_handle *trans)
+{
+	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	int stats_cnt;
+	int ret = 0;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		stats_cnt = atomic_read(&device->scrub_stats_ccnt);
+		if (!device->scrub_stats_valid || stats_cnt == 0)
+			continue;
+
+		/*
+		 * LOAD-LOAD control dependency: reading scrub_stats_ccnt before
+		 * the counter values requires an explicit read barrier.  Pairs
+		 * with smp_mb__before_atomic() in btrfs_scrub_stat_add/set.
+		 */
+		smp_rmb();
+
+		ret = update_scrub_stat_item(trans, device);
+		if (ret)
+			break;
+		atomic_sub(stats_cnt, &device->scrub_stats_ccnt);
+	}
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	return ret;
+}
+
+/*
+ * Update per-device scrub stats after a scrub run completes (or is canceled).
+ * Accumulates session counters into the lifetime totals and records session
+ * metadata (timestamps, status, last_physical).
+ *
+ * @scrub_ret: return value from btrfs_scrub_dev(); 0=finished, -ECANCELED=
+ *             canceled, other nonzero = error/incomplete.
+ */
+void btrfs_update_scrub_stats(struct btrfs_device *dev,
+			      const struct btrfs_scrub_progress *progress,
+			      int scrub_ret)
+{
+	int i;
+	static const u64 offsets[BTRFS_SCRUB_STAT_VALUES_MAX] = {
+#define OFF(field) offsetof(struct btrfs_scrub_progress, field)
+		[BTRFS_SCRUB_STAT_DATA_EXTENTS_SCRUBBED]  = OFF(data_extents_scrubbed),
+		[BTRFS_SCRUB_STAT_TREE_EXTENTS_SCRUBBED]  = OFF(tree_extents_scrubbed),
+		[BTRFS_SCRUB_STAT_DATA_BYTES_SCRUBBED]    = OFF(data_bytes_scrubbed),
+		[BTRFS_SCRUB_STAT_TREE_BYTES_SCRUBBED]    = OFF(tree_bytes_scrubbed),
+		[BTRFS_SCRUB_STAT_READ_ERRORS]            = OFF(read_errors),
+		[BTRFS_SCRUB_STAT_CSUM_ERRORS]            = OFF(csum_errors),
+		[BTRFS_SCRUB_STAT_VERIFY_ERRORS]          = OFF(verify_errors),
+		[BTRFS_SCRUB_STAT_NO_CSUM]                = OFF(no_csum),
+		[BTRFS_SCRUB_STAT_CSUM_DISCARDS]          = OFF(csum_discards),
+		[BTRFS_SCRUB_STAT_SUPER_ERRORS]           = OFF(super_errors),
+		[BTRFS_SCRUB_STAT_MALLOC_ERRORS]          = OFF(malloc_errors),
+		[BTRFS_SCRUB_STAT_UNCORRECTABLE_ERRORS]   = OFF(uncorrectable_errors),
+		[BTRFS_SCRUB_STAT_CORRECTED_ERRORS]       = OFF(corrected_errors),
+		[BTRFS_SCRUB_STAT_UNVERIFIED_ERRORS]      = OFF(unverified_errors),
+#undef OFF
+	};
+
+	/* Update session counters from this run's progress struct */
+	for (i = 0; i < BTRFS_SCRUB_STAT_VALUES_MAX; i++) {
+		u64 val = *(const u64 *)((const u8 *)progress + offsets[i]);
+
+		atomic64_set(&dev->scrub_session_values[i], (long long)val);
+		/* Accumulate into lifetime totals */
+		btrfs_scrub_stat_add(dev, i, val);
+	}
+
+	dev->scrub_session_last_physical = progress->last_physical;
+	dev->scrub_session_t_end = ktime_get_real_seconds();
+
+	if (scrub_ret == -ECANCELED)
+		atomic_set(&dev->scrub_session_status, BTRFS_SCRUB_STATUS_CANCELED);
+	else
+		atomic_set(&dev->scrub_session_status, BTRFS_SCRUB_STATUS_FINISHED);
+}
+
 /*
  * Update the size and bytes used for each device where it changed.  This is
  * delayed since we would otherwise get errors while writing out the
@@ -8609,7 +8846,7 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 
 		btrfs_item_key_to_cpu(leaf, &key, slot);
 		if (key.type != BTRFS_DEV_EXTENT_KEY)
-			break;
+			goto next;
 		devid = key.objectid;
 		physical_offset = key.offset;
 
@@ -8631,7 +8868,7 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 			return ret;
 		prev_devid = devid;
 		prev_dev_ext_end = physical_offset + physical_len;
-
+next:
 		ret = btrfs_next_item(root, path);
 		if (ret < 0)
 			return ret;
-- 
2.48.1


^ permalink raw reply related

* [PATCH 2/5] btrfs: add in-memory scrub lifetime and session fields
From: Torstein Eide @ 2026-04-19 14:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Torstein Eide
In-Reply-To: <20260419142618.3147763-1-torsteine+linux@gmail.com>

From: Torstein Eide <torsteine@gmail.com>

Extend struct btrfs_device with two counter arrays and session metadata:

  scrub_stat_values[]    - lifetime totals, persisted across mounts via
                           the new BTRFS_SCRUB_STATS_OBJECTID tree item.
                           Protected by an atomic dirty counter
                           (scrub_stats_ccnt) and flushed at transaction
                           commit by btrfs_run_scrub_stats().

  scrub_session_values[] - per-run counters reset when a new scrub
                           starts; never written to disk.

  scrub_session_{t_start,t_end,last_physical,status} - timing and
                           progress metadata for the current or most
                           recent scrub session on this device.

Add matching kobject pointers (scrub_kobj, scrub_lifetime_kobj,
scrub_session_kobj) to both struct btrfs_device and struct btrfs_fs_info
for the sysfs hierarchy added in a later patch.

Add BTRFS_SCRUB_STATUS_* constants and the inline read/write helpers
btrfs_scrub_stat_{read,add,set}() and btrfs_scrub_session_read().
The add/set helpers include an smp_mb__before_atomic() to order stat
updates before the dirty-counter increment, pairing with the smp_rmb()
in btrfs_run_scrub_stats().

Signed-off-by: Torstein Eide <torsteine@gmail.com>
Assisted-by: Claude:claude-sonnet-4-6
---
 fs/btrfs/fs.h      |  5 +++
 fs/btrfs/volumes.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a4758d94b32e9..c333e30c93e40 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -714,6 +714,11 @@ struct btrfs_fs_info {
 	struct kobject *qgroups_kobj;
 	struct kobject *discard_kobj;
 
+	/* For /sys/fs/btrfs/<UUID>/scrub/{lifetime,session}/ */
+	struct kobject *scrub_kobj;
+	struct kobject *scrub_lifetime_kobj;
+	struct kobject *scrub_session_kobj;
+
 	/* Track the number of blocks (sectors) read by the filesystem. */
 	struct percpu_counter stats_read_blocks;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0082c166af91f..e77f726928abd 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -200,6 +200,29 @@ struct btrfs_device {
 	atomic_t dev_stats_ccnt;
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
 
+	/*
+	 * Scrub lifetime counters. Persisted via BTRFS_SCRUB_STATS_OBJECTID
+	 * tree items in the device tree; loaded at mount by
+	 * btrfs_init_scrub_stats(), flushed at commit by btrfs_run_scrub_stats().
+	 * Index values defined by BTRFS_SCRUB_STAT_* in btrfs_tree.h.
+	 */
+	int scrub_stats_valid;
+	atomic_t scrub_stats_ccnt;
+	atomic64_t scrub_stat_values[BTRFS_SCRUB_STAT_VALUES_MAX];
+
+	/*
+	 * Per-session scrub counters. Reset to zero when a new scrub starts on
+	 * this device. Never persisted to disk.
+	 */
+	atomic64_t scrub_session_values[BTRFS_SCRUB_STAT_VALUES_MAX];
+	/* Resume offset at end of session */
+	u64 scrub_session_last_physical;
+	/* Unix time (seconds) when the session started / ended (0 if idle) */
+	u64 scrub_session_t_start;
+	u64 scrub_session_t_end;
+	/* BTRFS_SCRUB_STATUS_* */
+	atomic_t scrub_session_status;
+
 	/*
 	 * Device's major-minor number. Must be set even if the device is not
 	 * opened (bdev == NULL), unless the device is missing.
@@ -211,6 +234,10 @@ struct btrfs_device {
 	struct completion kobj_unregister;
 	/* For sysfs/FSID/devinfo/devid/ */
 	struct kobject devid_kobj;
+	/* For sysfs/FSID/devinfo/devid/scrub/ hierarchy */
+	struct kobject *scrub_kobj;
+	struct kobject *scrub_lifetime_kobj;
+	struct kobject *scrub_session_kobj;
 
 	/* Bandwidth limit for scrub, in bytes */
 	u64 scrub_speed_max;
@@ -864,6 +891,58 @@ static inline void btrfs_dev_stat_set(struct btrfs_device *dev,
 	atomic_inc(&dev->dev_stats_ccnt);
 }
 
+/*
+ * Scrub session status values stored in device->scrub_session_status.
+ */
+#define BTRFS_SCRUB_STATUS_IDLE		0
+#define BTRFS_SCRUB_STATUS_RUNNING	1
+#define BTRFS_SCRUB_STATUS_FINISHED	2
+#define BTRFS_SCRUB_STATUS_CANCELED	3
+
+static inline u64 btrfs_scrub_stat_read(const struct btrfs_device *dev,
+					int index)
+{
+	return (u64)atomic64_read(&dev->scrub_stat_values[index]);
+}
+
+static inline void btrfs_scrub_stat_add(struct btrfs_device *dev,
+					int index, u64 val)
+{
+	atomic64_add((long long)val, &dev->scrub_stat_values[index]);
+	/*
+	 * Order the stat update before the dirty-counter increment so that
+	 * btrfs_run_scrub_stats() observes a consistent snapshot.  Pairs with
+	 * smp_rmb() in btrfs_run_scrub_stats().
+	 */
+	smp_mb__before_atomic();
+	atomic_inc(&dev->scrub_stats_ccnt);
+}
+
+static inline void btrfs_scrub_stat_set(struct btrfs_device *dev,
+					int index, u64 val)
+{
+	atomic64_set(&dev->scrub_stat_values[index], (long long)val);
+	/*
+	 * Order the stat update before the dirty-counter increment so that
+	 * btrfs_run_scrub_stats() observes a consistent snapshot.  Pairs with
+	 * smp_rmb() in btrfs_run_scrub_stats().
+	 */
+	smp_mb__before_atomic();
+	atomic_inc(&dev->scrub_stats_ccnt);
+}
+
+static inline u64 btrfs_scrub_session_read(const struct btrfs_device *dev,
+					   int index)
+{
+	return (u64)atomic64_read(&dev->scrub_session_values[index]);
+}
+
+int btrfs_init_scrub_stats(struct btrfs_fs_info *fs_info);
+int btrfs_run_scrub_stats(struct btrfs_trans_handle *trans);
+void btrfs_update_scrub_stats(struct btrfs_device *dev,
+			      const struct btrfs_scrub_progress *progress,
+			      int scrub_ret);
+
 static inline const char *btrfs_dev_name(const struct btrfs_device *device)
 {
 	if (!device || test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
-- 
2.48.1


^ permalink raw reply related

* [PATCH 1/5] btrfs: uapi: introduce on-disk scrub stats item
From: Torstein Eide @ 2026-04-19 14:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Torstein Eide
In-Reply-To: <20260419142618.3147763-1-torsteine+linux@gmail.com>

From: Torstein Eide <torsteine@gmail.com>

Add a new persistent item type for per-device scrub lifetime counters:

  Key: (BTRFS_SCRUB_STATS_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)

The item payload is struct btrfs_scrub_stats_item, an array of __le64
values indexed by BTRFS_SCRUB_STAT_* constants (data/tree extents and
bytes scrubbed, read/checksum/verify/malloc errors, etc.).

The array is designed to grow at the end; existing index values are
fixed and must not be renumbered.  New entries should be appended and
the item_size check in the reader handles shorter on-disk items from
older kernels gracefully.

Signed-off-by: Torstein Eide <torsteine@gmail.com>
Assisted-by: Claude:claude-sonnet-4-6
---
 include/uapi/linux/btrfs_tree.h | 34 +++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index cc3b9f7dccafa..7eb775b6b6253 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -82,6 +82,9 @@
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
+/* per-device scrub lifetime stats in the device tree */
+#define BTRFS_SCRUB_STATS_OBJECTID 1ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -1139,6 +1142,37 @@ struct btrfs_dev_stats_item {
 	__le64 values[BTRFS_DEV_STAT_VALUES_MAX];
 } __attribute__ ((__packed__));
 
+/*
+ * Scrub lifetime error counters, stored per device in the device tree.
+ * Key: (BTRFS_SCRUB_STATS_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
+ *
+ * Index values are defined by enum btrfs_scrub_stat_index in btrfs_tree.h
+ * (kernel-internal, not exposed via ioctl).
+ */
+#define BTRFS_SCRUB_STAT_DATA_EXTENTS_SCRUBBED		0
+#define BTRFS_SCRUB_STAT_TREE_EXTENTS_SCRUBBED		1
+#define BTRFS_SCRUB_STAT_DATA_BYTES_SCRUBBED		2
+#define BTRFS_SCRUB_STAT_TREE_BYTES_SCRUBBED		3
+#define BTRFS_SCRUB_STAT_READ_ERRORS			4
+#define BTRFS_SCRUB_STAT_CSUM_ERRORS			5
+#define BTRFS_SCRUB_STAT_VERIFY_ERRORS			6
+#define BTRFS_SCRUB_STAT_NO_CSUM			7
+#define BTRFS_SCRUB_STAT_CSUM_DISCARDS			8
+#define BTRFS_SCRUB_STAT_SUPER_ERRORS			9
+#define BTRFS_SCRUB_STAT_MALLOC_ERRORS			10
+#define BTRFS_SCRUB_STAT_UNCORRECTABLE_ERRORS		11
+#define BTRFS_SCRUB_STAT_CORRECTED_ERRORS		12
+#define BTRFS_SCRUB_STAT_UNVERIFIED_ERRORS		13
+#define BTRFS_SCRUB_STAT_VALUES_MAX			14
+
+struct btrfs_scrub_stats_item {
+	/*
+	 * Grow at the end for future enhancements; keep existing values
+	 * unchanged.  Index values defined by BTRFS_SCRUB_STAT_* above.
+	 */
+	__le64 values[BTRFS_SCRUB_STAT_VALUES_MAX];
+} __attribute__ ((__packed__));
+
 #define BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_ALWAYS	0
 #define BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID	1
 
-- 
2.48.1


^ permalink raw reply related

* [PATCH 0/5] btrfs: add persistent scrub lifetime and session counters
From: Torstein Eide @ 2026-04-19 14:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Torstein Eide

From: Torstein Eide <torsteine@gmail.com>

btrfs currently exposes scrub progress only via ioctl which returns a
snapshot of a running scrub and nothing once it finishes.  There is no
persistent record of what previous scrubs found and no sysfs interface
for monitoring.

This series adds per-device scrub statistics that survive across
unmounts and are exposed via sysfs.  A new on-disk item
(BTRFS_SCRUB_STATS_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid) stores
14 __le64 lifetime counters per device in the device tree, following
the same dirty-counter/flush-at-commit pattern as btrfs_dev_stats.

Each device gains an in-memory session snapshot (reset at scrub
start, populated at scrub end) for the most recent run.

The sysfs layout under /sys/fs/btrfs/<UUID>/ and devinfo/<devid>/:

  scrub/lifetime/<counter>    accumulated totals across all runs
  scrub/session/<counter>     snapshot of the most recent run
  scrub/session/status        idle | running | finished | canceled
  scrub/session/duration_seconds
  scrub/reset                 write "1" to zero lifetime counters

plan is explaned her: https://github.com/kdave/btrfs-progs/issues/1108

tested with xfstests.

## Result

Bellow is table with results of a destrutive test, with two passes,  one  `dd` per devices, scrub repeat. 

| FILE                  | Total                   | Device 1                | Device  2               |
|                       | Lifetime   | Session    | Lifetime   | Session    | Lifetime   | Session    |
|-----------------------|------------|------------|------------|------------|------------|------------|
| corrected_errors      | 28736      | 13120      | 15616      | 0          | 13120      | 13120      |
| csum_discards         | 0          | 0          | 0          | 0          | 0          | 0          |
| csum_errors           | 28736      | 13120      | 15616      | 0          | 13120      | 13120      |
| data_bytes_scrubbed   | 6174015488 | 3087007744 | 3087007744 | 1543503872 | 3087007744 | 1543503872 |
| data_extents_scrubbed | 94208      | 47104      | 47104      | 23552      | 47104      | 23552      |
| duration_seconds      |            | 1          |            | 1          |            | 1          |
| last_physical         |            |            |            | 2147483648 |            | 2126643200 |
| malloc_errors         | 0          | 0          | 0          | 0          | 0          | 0          |
| no_csum               | 0          | 0          | 0          | 0          | 0          | 0          |
| read_errors           | 0          | 0          | 0          | 0          | 0          | 0          |
| status                |            | finished   |            | finished   |            | finished   |
| super_errors          | 0          | 0          | 0          | 0          | 0          | 0          |
| t_end                 |            | 1776595754 |            | 1776595754 |            | 1776595754 |
| tree_bytes_scrubbed   | 6946816    | 3473408    | 3473408    | 1736704    | 3473408    | 1736704    |
| tree_extents_scrubbed | 424        | 212        | 212        | 106        | 212        | 106        |
| t_start               |            | 1776595753 |            | 1776595753 |            | 1776595753 |
| uncorrectable_errors  | 0          | 0          | 0          | 0          | 0          | 0          |
| unverified_errors     | 0          | 0          | 0          | 0          | 0          | 0          |
| verify_errors         | 0          | 0          | 0          | 0          | 0          | 0          |


Torstein Eide (5):
  btrfs: uapi: introduce on-disk scrub stats item
  btrfs: add in-memory scrub lifetime and session fields
  btrfs: persist scrub lifetime stats to the device tree
  btrfs: hook scrub session tracking into 'btrfs_scrub_dev()'
  btrfs: expose scrub lifetime and session counters via sysfs

 fs/btrfs/disk-io.c              |   6 +
 fs/btrfs/fs.h                   |   5 +
 fs/btrfs/scrub.c                |  16 +
 fs/btrfs/sysfs.c                | 586 ++++++++++++++++++++++++++++++++
 fs/btrfs/sysfs.h                |   3 +
 fs/btrfs/transaction.c          |   3 +
 fs/btrfs/volumes.c              | 239 ++++++++++++-
 fs/btrfs/volumes.h              |  79 +++++
 include/uapi/linux/btrfs_tree.h |  34 ++
 9 files changed, 969 insertions(+), 2 deletions(-)

-- 
2.48.1


^ permalink raw reply

* Re: [PATCH v2 00/19] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Vineeth Remanan Pillai @ 2026-04-19 13:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Dmitry Ilvokhin, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Jens Axboe, io-uring,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Marcelo Ricardo Leitner,
	Xin Long, Jon Maloy, Aaron Conole, Eelco Chaudron, Ilya Maximets,
	netdev, bpf, linux-sctp, tipc-discussion, dev, Jiri Pirko,
	Oded Gabbay, Koby Elbaz, dri-devel, Rafael J. Wysocki,
	Viresh Kumar, Gautham R. Shenoy, Huang Rui, Mario Limonciello,
	Len Brown, Srinivas Pandruvada, linux-pm, MyungJoo Ham,
	Kyungmin Park, Chanwoo Choi, Christian König, Sumit Semwal,
	linaro-mm-sig, Eddie James, Andrew Jeffery, Joel Stanley,
	linux-fsi, David Airlie, Simona Vetter, Alex Deucher,
	Danilo Krummrich, Matthew Brost, Philipp Stanner, Harry Wentland,
	Leo Li, amd-gfx, Jiri Kosina, Benjamin Tissoires, linux-input,
	Wolfram Sang, linux-i2c, Mark Brown, Michael Hennerich,
	Nuno Sá, linux-spi, James E.J. Bottomley, Martin K. Petersen,
	linux-scsi, Chris Mason, David Sterba, linux-btrfs,
	Thomas Gleixner, Andrew Morton, SeongJae Park, linux-mm,
	Borislav Petkov, Dave Hansen, x86, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260418190456.631df6f3@fedora>

On Sat, Apr 18, 2026 at 7:05 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 23 Mar 2026 12:00:19 -0400
> "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> wrote:
>
> >   if (trace_foo_enabled() && cond)
> >       trace_call__foo(args);   /* calls __do_trace_foo() directly */
>
> Hi Vineeth,
>
> Could you rebase this series on top of 7.1-rc1 when it comes out?
> Several of these patches were accepted already. Obviously drop those.
> They were the patches that added the feature, and any where the
> maintainer acked the patch.
>
> Now that the feature has been accepted, if you post the patch series
> again after 7.1-rc1 with all the patches that haven't been accepted
> yet, then the maintainers can simply take them directly. As the feature
> is now accepted, there's no dependency on it, and they don't need to go
> through the tracing tree.
>
Sure, will do. Thanks for merging this feature.

Thanks,
Vineeth

^ permalink raw reply

* Re: [PATCH v2 02/10] btrfs: reduce size of struct btrfs_block_group
From: Sun YangKai @ 2026-04-19  7:32 UTC (permalink / raw)
  To: fdmanana, linux-btrfs
In-Reply-To: <ef130a3bd2f994d615e9b7dc8b17132b6c3a7fc1.1776278490.git.fdmanana@suse.com>

>    struct btrfs_block_group {
>          struct btrfs_fs_info *     fs_info;              /*     0     8 */
>          struct btrfs_inode *       inode;                /*     8     8 */
>          spinlock_t                 lock __attribute__((__aligned__(4))); /*    16     4 */
>          unsigned int               ro;                   /*    20     4 */
>          u64                        start;                /*    24     8 */
>          u64                        length;               /*    32     8 */
>          u64                        pinned;               /*    40     8 */
>          u64                        reserved;             /*    48     8 */
>          u64                        used;                 /*    56     8 */
>          /* --- cacheline 1 boundary (64 bytes) --- */
>          u64                        delalloc_bytes;       /*    64     8 */
>          u64                        bytes_super;          /*    72     8 */
>          u64                        flags;                /*    80     8 */
>          u64                        cache_generation;     /*    88     8 */
>          u64                        global_root_id;       /*    96     8 */
>          u64                        remap_bytes;          /*   104     8 */
>          u32                        identity_remap_count; /*   112     4 */
>          u32                        last_identity_remap_count; /*   116     4 */
>          u64                        last_used;            /*   120     8 */
>          /* --- cacheline 2 boundary (128 bytes) --- */
>          u64                        last_remap_bytes;     /*   128     8 */
>          u64                        last_flags;           /*   136     8 */
>          u32                        bitmap_high_thresh;   /*   144     4 */
>          u32                        bitmap_low_thresh;    /*   148     4 */
>          struct rw_semaphore        data_rwsem __attribute__((__aligned__(8))); /*   152    40 */
>          /* --- cacheline 3 boundary (192 bytes) --- */
>          long unsigned int          full_stripe_len;      /*   192     8 */
>          long unsigned int          runtime_flags;        /*   200     8 */
>          int                        disk_cache_state;     /*   208     4 */
This seems always used as enum btrfs_disk_cache_state
>          int                        cached;               /*   212     4 */
This seems always used as enum btrfs_caching_type

Maybe we could change them to use the more proper type in this patch 
series for better type safety and self documenting.

Thanks,
Sun YangKai
>          struct btrfs_caching_control * caching_ctl;      /*   216     8 */
>          struct btrfs_space_info *  space_info;           /*   224     8 */
>          struct btrfs_free_space_ctl * free_space_ctl;    /*   232     8 */
>          struct rb_node             cache_node __attribute__((__aligned__(8))); /*   240    24 */
>          /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
>          struct list_head           list;                 /*   264    16 */
>          refcount_t                 refs __attribute__((__aligned__(4))); /*   280     4 */
>          atomic_t                   frozen __attribute__((__aligned__(4))); /*   284     4 */
>          struct list_head           cluster_list;         /*   288    16 */
>          struct list_head           bg_list;              /*   304    16 */
>          /* --- cacheline 5 boundary (320 bytes) --- */
>          struct list_head           ro_list;              /*   320    16 */
>          struct list_head           discard_list;         /*   336    16 */
>          int                        discard_index;        /*   352     4 */
>          enum btrfs_discard_state   discard_state;        /*   356     4 */
>          u64                        discard_eligible_time; /*   360     8 */
>          u64                        discard_cursor;       /*   368     8 */
>          struct list_head           dirty_list;           /*   376    16 */
>          /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
>          struct list_head           io_list;              /*   392    16 */
>          struct btrfs_io_ctl        io_ctl;               /*   408    72 */
>          /* --- cacheline 7 boundary (448 bytes) was 32 bytes ago --- */
>          atomic_t                   reservations __attribute__((__aligned__(4))); /*   480     4 */
>          atomic_t                   nocow_writers __attribute__((__aligned__(4))); /*   484     4 */
>          struct mutex               free_space_lock __attribute__((__aligned__(8))); /*   488    32 */
>          /* --- cacheline 8 boundary (512 bytes) was 8 bytes ago --- */
>          bool                       using_free_space_bitmaps; /*   520     1 */
>          bool                       using_free_space_bitmaps_cached; /*   521     1 */
> 
>          /* XXX 2 bytes hole, try to pack */
>          /* Bitfield combined with previous fields */
> 
>          static enum btrfs_block_group_size_class size_class; /*     0: 0  0 */
>          int                        swap_extents;         /*   524     4 */
>          u64                        alloc_offset;         /*   528     8 */
>          u64                        zone_unusable;        /*   536     8 */
>          u64                        zone_capacity;        /*   544     8 */
>          u64                        meta_write_pointer;   /*   552     8 */
>          struct btrfs_chunk_map *   physical_map;         /*   560     8 */
>          struct list_head           active_bg_list;       /*   568    16 */
>          /* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
>          struct work_struct         zone_finish_work;     /*   584    32 */
>          struct extent_buffer *     last_eb;              /*   616     8 */
>          u64                        reclaim_mark;         /*   624     8 */
> 
>          /* size: 632, cachelines: 10, members: 60, static members: 1 */
>          /* sum members: 630, holes: 1, sum holes: 2 */
>          /* sum bitfield members: 8 bits (1 bytes) */
>          /* forced alignments: 8 */
>          /* last cacheline: 56 bytes */
> 
>          /* BRAIN FART ALERT! 632 bytes != 630 (member bytes) + 8 (member bits) + 2 (byte holes) + 0 (bit holes), diff = -8 bits */
>    } __attribute__((__aligned__(8)));
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/block-group.h | 33 ++++++++++++++++-----------------
>   1 file changed, 16 insertions(+), 17 deletions(-)



^ permalink raw reply

* Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
From: Qu Wenruo @ 2026-04-19  5:08 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba
In-Reply-To: <92c96052-6619-4c25-8add-a85a4b76c060@gmx.com>



在 2026/4/19 09:55, Qu Wenruo 写道:
> 
> 
> 在 2026/4/19 00:08, Paul Richards 写道:
>> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
>> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
>> already supported by ext4 and xfs. The userspace contract is
>> documented in fallocate(2).
>>
>> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
>> moving punch_hole into its own function and decoupling locking from the
>> per-operation helpers. This is similar to the implementaitons for ext4
>> and xfs. The allocate-range and zero-range paths remain coupled since
>> they share some setup logic.
>>
>> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
>>
>> == Implementation approach ==
>>
>> For COLLAPSE_RANGE:
>>   - The removed region [offset, offset+len) is punched out via
>>     btrfs_replace_file_extents(), which handles boundary splitting.
>>   - All EXTENT_DATA keys with key.offset >= offset+len are shifted
>>     leftward by len in forward order.
>>
>> For INSERT_RANGE:
>>   - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
>>     by len in reverse order (required to avoid key collisions).
>>   - No pre-splitting of straddling extents is needed: the left portion
>>     of a straddling extent stays in place, the right portion is shifted;
>>     both reference the same physical extent via their existing
>>     extent_offset fields.
>>
>> For each shifted key, the corresponding back-reference in the extent
>> tree is updated via a shared helper btrfs_shift_extent_backref().

After looking into each patch, I do not think the low level direct file 
item change is a good idea, especially with your current implement:

- Can lock the inode for a very long time
   E.g. inserting a hole into the beginning of a very large file.
   We will lock the inode until all file items are iterated, which kills
   concurrency.

- Possible problems with metadata reservation

- Problems with ^no-holes collapse
   Will cause duplicated file offsets with hole file items.

On the other hand, with reflink the insert/collapse can even be 
implemented in user space, with a proper step setting, we can still 
allow concurrent read/write out of the reflink ranges.

This makes me wonder, is these features really that necessary?

If you know some programs actively utilize these features for real world 
benefits, but can not be done through reflink, please provide them.

Thanks,
Qu


>>
>> After the key-shift loop, btrfs_drop_extent_map_range() is called to
>> invalidate the in-memory extent map cache. This is important for reads
>> after the fallocate operation to ensure they obtain data for the new
>> offsets.
>>
>> The page cache is flushed and invalidated upfront (before any extent
>> manipulation) following the ext4/xfs pattern. The inode lock
>> (BTRFS_ILOCK_MMAP) is held throughout, preventing new dirty pages from
>> appearing during the operation.
> 
> So far the explanation looks fine, but I'll need to dig deeper for each 
> patch to be sure.
> 
>>
>> Transaction cycling: the key-shift loop cycles transactions every
>> BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32) items to avoid
>> holding a single transaction open across a large number of extents.
>>
>> Both operations are gated on CONFIG_BTRFS_EXPERIMENTAL.
>>
>> == Known limitations ==
>>
>> INSERT_RANGE returns -EOPNOTSUPP for inlined files. Supporting inline
>> files will require promoting the existing inline extent to a regular
>> one, since inline extends are supported only at the very start of a
>> file.
> 
> This can be addressed, e.g. by reading out the inlined extent into the 
> page cache and redirty the new page cache.
> 
> But I'd say it's not a critical part thus I'm totally fine without this 
> support for a while.
> 
>>
>> In the opposite direction, COLLAPSE_RANGE will not create inline files
>> like it should if the remaining data is small enough.
> 
> This is even less important, there are a lot of cases that we don't 
> inline a data extent, so keeping the extent uninlined is completely fine.
> 
>>
>> I intend to address both of these limitations.
>>
>> == Testing ==
>>
>> Tested with a Rust-based functional test suite covering:
>>   - Collapse and insert at the start, middle of a file
>>   - Multiple sequential operations on the same file
>>   - Files with multiple extents (fsync between writes to force separate
>>     extent items)
>>   - Files with holes (explicit punch_hole and implicit sparse writes)
>>   - Compressed extents (mount -o compress=zstd)
>>   - Transaction cycling (interval reduced to 4 during testing, verified
>>     in dmesg logs)
>>   - Inline files, verified that -EOPNOTSUPP is returned.
> 
> I guess that tool has never verify the contents, nor multi-thread stress 
> tests, e.g. fsstress?
> 
>>
>> The same tests pass on both btrfs and xfs (modulo the inline files).
>>
>> I have not run fstests which I know contains tests for INSERT_RANGE
>> and COLLAPSE_RANGE. I will do so.
> 
> Thus I'd prefer a fstest run before whatever your local tool.
> 
>>
>> == Questions for reviewers ==
>>
>> 1. Transaction cycling interval: we use 32 items per cycle. Is this
>>     the right threshold, or is there an established convention in btrfs
>>     for this kind of loop?
> 
> 
> We have btrfs_should_end_transaction() to do it, thus you do not need to 
> use a locally defined threshold.
> 
>>
>> 2. Extent lock scope for collapse: we hold the extent lock only on
>>     [offset, offset+len) during the hole punch, not on the full
>>     [offset, i_size) range that the key-shift loop operates on. Is
>>     this safe, or should we lock the full affected range?
> 
> I think you should hold the lock range [offset, U64_MAX).
> 
> Especially you have already dropped all cache for range [offset, 
> U64_MAX) before the lock.
> 
> The same will apply for insert, but it looks like you have not locked 
> any extent range for insert?
> 
>>
>> 3. CONFIG_BTRFS_EXPERIMENTAL gate: is this the right gate for these
>>     operations, or should they be unconditionally available?
> 
> It's strongly recommended in this case.
> 
>>
>> == Notes ==
>>
>> This is my first kernel contribution. Development was significantly
>> assisted by an LLM (Amazon Q Developer). The implementation, testing,
>> and final review decisions are my own.
> 
> I'm very interested in how the LLM is involved.
> 
> You mentioned "implementation, testing, review" are on your own, this 
> looks like everything is on your own.
> 
> I don't think you're only using LLM to help understanding the code, thus 
> it looks like implementation is contributed by the LLM.
> 
> Please remember, you're the one explaining/defending the code.
> But as long as you can explain/defend the code, I'm fine with that.
> 
> Still a newcomer with not a small code change will always attract more 
> scrutiny, so please don't expect rapid review/merge/etc.
> 
>>
>> Various btrfs_info() print statements, assertions, and comments that
>> were useful during development and testing have been left in place,
>> but will be removed or streamlined in the next revision.
> 
> Please don't. It's pretty easy to leave such scaffolding in the final 
> code, and it will be time consuming to making they are properly removed 
> in the final version.
> 
> Furthermore, dmesg based printk() is slow thus it can mask a lot of race 
> related bugs just by slowing down the opeartions.
> 
> For critical operations, introduce trace events for them, there are 
> examples across several inode operations.
> 
> If you really want to debug during development, please introduce 
> temporary trace_printk()s in a dedicated patch, and just don't send that 
> debug patch.
> 
> Thanks,
> Qu
> 
>>
>> Paul Richards (3):
>>    btrfs: refactor btrfs_fallocate() ahead of supporting more modes
>>    btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
>>    btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()
>>
>>   fs/btrfs/file.c | 601 ++++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 578 insertions(+), 23 deletions(-)
>>
> 
> 


^ permalink raw reply

* Re: [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()
From: Qu Wenruo @ 2026-04-19  4:44 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba
In-Reply-To: <20260418143808.199603-4-paul.richards@gmail.com>



在 2026/4/19 00:08, Paul Richards 写道:
> Assisted-by: Amazon Q Developer:auto/unknown
> Signed-off-by: Paul Richards <paul.richards@gmail.com>
> ---
>   fs/btrfs/file.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 238 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 99d24bef5f88..b708bb6a1082 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2945,6 +2945,241 @@ static int btrfs_collapse_range(struct inode *inode, loff_t offset, loff_t len)
>   	return ret;
>   }
>   
> +static int btrfs_insert_range(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct btrfs_path *path;
> +	struct btrfs_trans_handle *trans = NULL;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	struct btrfs_key new_key;
> +	u64 ino = btrfs_ino(BTRFS_I(inode));
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
> +		return -EOPNOTSUPP;
> +
> +	/* offset and len must be sector-aligned */
> +	if (!IS_ALIGNED(offset | len, fs_info->sectorsize))
> +		return -EINVAL;
> +
> +	/* offset must be within the file - use ftruncate to extend */
> +	if (offset >= inode->i_size)
> +		return -EINVAL;
> +
> +	/* result must not exceed the maximum file size */
> +	if (len > inode->i_sb->s_maxbytes - inode->i_size)
> +		return -EFBIG;
> +
> +	btrfs_info(fs_info,
> +		   "btrfs_insert_range: ino=%llu offset=%lld len=%lld i_size=%lld",
> +		   btrfs_ino(BTRFS_I(inode)), offset, len, inode->i_size);
> +
> +	/* wait for any ordered extents in [offset, i_size) to complete */
> +	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
> +				       inode->i_size - offset);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Flush and invalidate the page cache for [offset, i_size) upfront,
> +	 * following the same pattern as btrfs_collapse_range().
> +	 */
> +	ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX);
> +	if (ret)
> +		return ret;
> +	truncate_pagecache_range(inode, offset, LLONG_MAX);
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	trans = btrfs_start_transaction(root, 1);

As we may iterate through a lot of file items, the nr_items == 1 is 
definitely not correct.

On the other hand I'm not sure what's the correct nr_item to be passed 
here either.
Thus this may cause problems for metadata reservation. This also applies 
to the previous patch.

> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		trans = NULL;
> +		goto out_path;
> +	}
> +
> +	/*
> +	 * Shift all BTRFS_EXTENT_DATA_KEY items with key.offset >= offset
> +	 * rightward by len bytes.
> +	 *
> +	 * We must iterate in reverse order (highest offset first) to avoid
> +	 * colliding with a key we haven't shifted yet - shifting forward
> +	 * would overwrite the next item's key before we process it.

OK, this explains why we can not have a shared helper to do the file 
offset shift.

> +	 *
> +	 * No pre-splitting of straddling extents is needed. If an extent
> +	 * straddles offset, the left portion (key.offset < offset) stays
> +	 * in place and the right portion is shifted. Both reference the
> +	 * same physical extent via their existing extent_offset fields,
> +	 * which remain correct after the key shift.
> +	 */
> +
> +	int nr_shifted = 0;
> +
> +	/* Find the last extent item for this inode */
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = (u64)-1;
> +
> +	while (1) {
> +		struct btrfs_file_extent_item *fi;
> +		u64 disk_bytenr;
> +		u64 num_bytes;
> +		u64 extent_offset;
> +		int extent_type;
> +
> +		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);

Again you're doing btrfs_search_slot() for every file item, which is 
very inefficient.

> +		if (ret < 0)
> +			goto out_trans;
> +
> +		/*
> +		 * Search for (ino, EXTENT_DATA, -1) will never find an exact
> +		 * match, so ret == 1 and slot points one past the last item.
> +		 * Step back one slot to land on the last extent item.
> +		 */
> +		if (path->slots[0] == 0) {
> +			/* No items at all - nothing to shift */
> +			ret = 0;
> +			break;
> +		}
> +		path->slots[0]--;
> +
> +		leaf = path->nodes[0];
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> +		/* If we've gone past this inode's items, we are done */
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		/* If this item is before the insertion point, we are done */
> +		if (key.offset < offset) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		btrfs_info(fs_info,
> +			   "btrfs_insert_range: shifting key offset %llu -> %llu",
> +			   key.offset, key.offset + len);
> +
> +		fi = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		extent_type = btrfs_file_extent_type(leaf, fi);
> +		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
> +		num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
> +		extent_offset = btrfs_file_extent_offset(leaf, fi);
> +
> +		/*
> +		 * Inline extents must have key.offset == 0 and cannot be
> +		 * shifted to a non-zero offset - the tree checker enforces
> +		 * this invariant. Reject with -EOPNOTSUPP.
> +		 *
> +		 * An inline extent can only exist if the file's entire content
> +		 * fits within a single sector, meaning it is the only extent
> +		 * item for this inode. It will therefore always be the first
> +		 * item we encounter in the reverse iteration, before any keys
> +		 * have been shifted, so bailing here leaves the file in a
> +		 * consistent state.
> +		 *
> +		 * TODO: support this case by converting the inline extent to
> +		 * a regular extent first, then shifting it. This would allow
> +		 * INSERT_RANGE on small files, which xfs supports.
> +		 */
> +		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> +			ret = -EOPNOTSUPP;
> +			btrfs_release_path(path);
> +			goto out_trans;
> +		}
> +
> +		memcpy(&new_key, &key, sizeof(new_key));
> +		new_key.offset += len;
> +		btrfs_set_item_key_safe(trans, path, &new_key);
> +
> +		/* Update back-reference: drop old offset, add new offset */
> +		if (extent_type != BTRFS_FILE_EXTENT_INLINE && disk_bytenr > 0) {
> +			ret = btrfs_shift_extent_backref(trans, root, ino,
> +					disk_bytenr, num_bytes,
> +					key.offset - extent_offset,
> +					new_key.offset - extent_offset);
> +			if (unlikely(ret))
> +				goto out_trans;
> +		}
> +
> +		/*
> +		 * Step back to the previous item for the next iteration.
> +		 * If we've reached slot 0 we need to move to the previous leaf.
> +		 */
> +		nr_shifted++;
> +		if (nr_shifted % BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL == 0) {
> +			btrfs_info(fs_info,
> +			   "btrfs_insert_range: cycling transaction, nr_shifted=%d", nr_shifted);
> +
> +			inode_inc_iversion(inode);
> +			inode_set_mtime_to_ts(inode,
> +					      inode_set_ctime_current(inode));
> +			ret = btrfs_update_inode(trans, BTRFS_I(inode));
> +			if (ret) {
> +				btrfs_release_path(path);
> +				goto out_trans;
> +			}
> +			btrfs_end_transaction(trans);
> +			btrfs_btree_balance_dirty(fs_info);
> +			trans = btrfs_start_transaction(root, 1);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				trans = NULL;
> +				btrfs_release_path(path);
> +				goto out_path;
> +			}
> +		}
> +
> +		/*
> +		 * Set key.offset to one below the current item so the next
> +		 * btrfs_search_slot lands on the item before it.
> +		 */
> +		if (key.offset == 0) {
> +			ret = 0;
> +			break;
> +		}
> +		key.offset--;
> +		btrfs_release_path(path);
> +	}
> +
> +	if (ret)
> +		goto out_trans;
> +
> +	/*
> +	 * Drop stale extent map entries so subsequent reads re-load correct
> +	 * mappings from the btree.
> +	 */
> +	btrfs_drop_extent_map_range(BTRFS_I(inode), offset, (u64)-1, false);
> +
> +	btrfs_info(fs_info,
> +		   "btrfs_insert_range: updating i_size %lld -> %lld",
> +		   inode->i_size, inode->i_size + len);
> +	inode_inc_iversion(inode);
> +	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> +	i_size_write(inode, inode->i_size + len);
> +	btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
> +	ret = btrfs_update_inode(trans, BTRFS_I(inode));

After shifting all file extents to new file offsets, there is a hole in 
the range [@offset, @offset + @len), but there is no explicit hole 
extent for the range if ^no-holes.

This will cause fsck errors.

Thanks,
Qu

> +
> +out_trans:
> +	if (trans) {
> +		if (ret)
> +			btrfs_end_transaction(trans);
> +		else
> +			ret = btrfs_end_transaction(trans);
> +	}
> +out_path:
> +	btrfs_free_path(path);
> +	btrfs_info(fs_info, "btrfs_insert_range: returning %d", ret);
> +	return ret;
> +}
> +
>   static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>   {
>   	struct inode *inode = file_inode(file);
> @@ -3596,6 +3831,9 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	case FALLOC_FL_COLLAPSE_RANGE:
>   		ret = btrfs_collapse_range(inode, offset, len);
>   		break;
> +	case FALLOC_FL_INSERT_RANGE:
> +		ret = btrfs_insert_range(inode, offset, len);
> +		break;
>   	default:
>   		ret = -EOPNOTSUPP;
>   	}


^ permalink raw reply

* Re: [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
From: Qu Wenruo @ 2026-04-19  1:29 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba
In-Reply-To: <20260418143808.199603-3-paul.richards@gmail.com>



在 2026/4/19 00:08, Paul Richards 写道:

Commit message please, especially when you're implementing a new feature.

> Assisted-by: Amazon Q Developer:auto/unknown
> Signed-off-by: Paul Richards <paul.richards@gmail.com>
> ---
>   fs/btrfs/file.c | 300 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 300 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0b5cc3cec675..99d24bef5f88 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -39,6 +39,13 @@
>   #include "super.h"
>   #include "print-tree.h"
>   
> +/*
> + * When we shift extents as part of fallocate insert or collapse we commit
> + * and cycle the transaction every BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL
> + * extents to avoid accumulating too many changes in one transaction.
> + */
> +#define BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32)
> +

As mentioned in the cover letter, you can go with 
btrfs_should_end_transaction() to do the check, which I believe is 
better than a fixed threshold.

>   /*
>    * Unlock folio after btrfs_file_write() is done with it.
>    */
> @@ -2648,6 +2655,296 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
>   	return ret;
>   }
>   
> +/*
> + * Update the extent back-reference in the extent tree when a
> + * BTRFS_EXTENT_DATA_KEY item is shifted to a new logical file offset.
> + * Drops the back-reference at old_file_offset and adds one at new_file_offset.
> + * Holes (disk_bytenr == 0) and inline extents have no back-references and
> + * must not be passed to this function.
> + */
> +static int btrfs_shift_extent_backref(struct btrfs_trans_handle *trans,
> +				      struct btrfs_root *root, u64 ino,
> +				      u64 disk_bytenr, u64 num_bytes,
> +				      u64 old_file_offset, u64 new_file_offset)
> +{
> +	struct btrfs_ref ref = {
> +		.bytenr = disk_bytenr,
> +		.num_bytes = num_bytes,
> +		.parent = 0,
> +		.owning_root = btrfs_root_id(root),
> +		.ref_root = btrfs_root_id(root),
> +	};
> +	int ret;
> +
> +	ref.action = BTRFS_DROP_DELAYED_REF;
> +	btrfs_init_data_ref(&ref, ino, old_file_offset, 0, false);
> +	ret = btrfs_free_extent(trans, &ref);
> +	if (unlikely(ret)) {
> +		btrfs_abort_transaction(trans, ret);
> +		return ret;
> +	}
> +
> +	ref.action = BTRFS_ADD_DELAYED_REF;
> +	btrfs_init_data_ref(&ref, ino, new_file_offset, 0, false);
> +	ret = btrfs_inc_extent_ref(trans, &ref);
> +	if (unlikely(ret))
> +		btrfs_abort_transaction(trans, ret);
> +
> +	return ret;
> +}
> +
> +static int btrfs_collapse_range(struct inode *inode, loff_t offset, loff_t len)
> +{
> +	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	u64 end = offset + len;
> +	struct btrfs_path *path;

I'd recommend to go auto free for @path.

> +	struct btrfs_trans_handle *trans = NULL;
> +	struct extent_state *cached_state = NULL;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	struct btrfs_key new_key;
> +	u64 ino = btrfs_ino(BTRFS_I(inode));
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
> +		return -EOPNOTSUPP;
> +
> +	/* offset and len must be sector-aligned */
> +	if (!IS_ALIGNED(offset | len, fs_info->sectorsize))

I'd prefer a more human readable separate checks on @offset and @len 
instead.

> +		return -EINVAL;
> +
> +	/* collapse range must not reach or pass EOF - use ftruncate instead */
> +	if (end >= inode->i_size)

And please also do a overflow check before this one.

> +		return -EINVAL;
> +
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: ino=%llu offset=%lld len=%lld i_size=%lld",
> +		   btrfs_ino(BTRFS_I(inode)), offset, len, inode->i_size);

Either change it to a trace event, or just drop it.

For internal debugging, introduce trace_printk()/printk() in a dedicated 
debug patch instead and never send that debug patch.

> +
> +	/* wait for any ordered extents in [offset, i_size) to complete */
> +	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
> +				       inode->i_size - offset);

I'd prefer to use (u64)-1 as the length to be extra safe.

It's possible that we have some OE beyond the rounded up isize.

> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Flush dirty pages and invalidate the page cache for [offset, i_size)
> +	 * before any extent manipulation, following the ext4/xfs pattern.
> +	 * We hold BTRFS_ILOCK_MMAP so no new dirty pages can appear during
> +	 * the operation. The page cache must be empty before we shift extent
> +	 * keys so that stale pages at the old offsets cannot be read back
> +	 * after the collapse.
> +	 */
> +	ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX);
> +	if (ret)
> +		return ret;
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: nrpages before upfront invalidate=%lu (pages before offset=%llu not invalidated)",
> +		   inode->i_mapping->nrpages, offset >> PAGE_SHIFT);
> +	truncate_pagecache_range(inode, offset, LLONG_MAX);
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: nrpages after upfront invalidate=%lu (expected %llu)",
> +		   inode->i_mapping->nrpages, offset >> PAGE_SHIFT);
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Lock the range [offset, end) and invalidate the page cache within
> +	 * it. btrfs_punch_hole_lock_range() calls truncate_pagecache_range()
> +	 * internally in a retry loop.
> +	 */
> +	btrfs_punch_hole_lock_range(inode, offset, end - 1, &cached_state);

Please lock the range to (u64)-1.

You're modifying all the underlay extent maps beyond @offset.

> +
> +	/*
> +	 * Remove all extents in [offset, end). Passing NULL for extent_info
> +	 * means we are punching a hole. btrfs_replace_file_extents() splits
> +	 * any extent straddling the boundaries, drops extent refs, and
> +	 * returns a transaction handle for us to reuse.
> +	 */
> +	ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
> +					 offset, end - 1, NULL, &trans);

Not sure if this will work on fses without no-holes (aka, -O ^no-holes 
option for mkfs).

This will insert hole file extents to fill range [@offset, @offset + 
@len), with ^no-holes, there will be file extent items with disk_bytenr 
== 0.

> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: btrfs_replace_file_extents ret=%d nrpages=%lu",
> +		   ret, inode->i_mapping->nrpages);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/*
> +	 * Shift all BTRFS_EXTENT_DATA_KEY items with key.offset >= end
> +	 * leftward by len bytes.
> +	 *
> +	 * We iterate forward (lowest offset first) which is safe for a
> +	 * left-shift because the new key is always less than the old one,
> +	 * so we never collide with a key we haven't visited yet.
> +	 */
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = end;
> +
> +	int nr_shifted = 0;
> +	while (1) {
> +		struct btrfs_file_extent_item *fi;
> +		u64 disk_bytenr;
> +		u64 num_bytes;
> +		u64 extent_offset;
> +		int extent_type;
> +
> +		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
> +		if (ret < 0)
> +			goto out_trans;
> +
> +		/* If no exact match, slot points at the next item - that's fine */
> +
> +		leaf = path->nodes[0];
> +		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
> +			ret = btrfs_next_leaf(root, path);

This is not safe.
The next leaf may not be COWed.

If you check all btrfs_next_leaf() usages after a btrfs_search_slot() 
call, the btrfs_search_slot() is nevered called with @cow == 1.

> +			if (ret < 0)
> +				goto out_trans;
> +			if (ret > 0) {
> +				/* No more items - we are done shifting */
> +				ret = 0;
> +				break;
> +			}
> +			leaf = path->nodes[0];
> +		}
> +
> +		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> +		/* Stop if we've moved past this inode's extent data items */
> +		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		btrfs_info(fs_info,
> +			   "btrfs_collapse_range: shifting key offset %llu -> %llu",
> +			   key.offset, key.offset - len);
> +
> +		fi = btrfs_item_ptr(leaf, path->slots[0],
> +				    struct btrfs_file_extent_item);
> +		extent_type = btrfs_file_extent_type(leaf, fi);
> +		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
> +		num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
> +		extent_offset = btrfs_file_extent_offset(leaf, fi);
> +
> +		memcpy(&new_key, &key, sizeof(new_key));
> +		new_key.offset -= len;
> +		btrfs_set_item_key_safe(trans, path, &new_key);

For ^no-holes, this will cause problems, as there can be a hole extent 
with exactly the same offset.

> +
> +		/*
> +		 * Update the back-reference in the extent tree to reflect the
> +		 * new logical file offset. Holes (disk_bytenr == 0) and inline
> +		 * extents have no back-references to update.
> +		 */
> +		if (extent_type != BTRFS_FILE_EXTENT_INLINE && disk_bytenr > 0) {
> +			ret = btrfs_shift_extent_backref(trans, root, ino,
> +					disk_bytenr, num_bytes,
> +					key.offset - extent_offset,
> +					new_key.offset - extent_offset);
> +			if (unlikely(ret))
> +				goto out_trans;
> +		}
> +
> +		/* Advance to the next item on the next iteration */
> +		key.offset = new_key.offset + 1;

This is very ineffcient.

You're calling btrfs_search_slot() for every file extent item, just 
please don't.

Just handle all involved file extent items inside the leaf in one go.

> +		nr_shifted++;
> +
> +		/*
> +		 * Cycle the transaction every N items to avoid holding a
> +		 * single transaction open across a large number of extent
> +		 * items. btrfs_set_item_key_safe() modifies leaves in-place
> +		 * so we won't hit -ENOSPC; we use a simple counter instead.
> +		 * Update the inode at each cycle point so it is consistent
> +		 * on disk if a crash occurs mid-loop.
> +		 */
> +		if (nr_shifted % BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL == 0) {
> +			btrfs_info(fs_info,
> +			   "btrfs_collapse_range: cycling transaction, nr_shifted=%d", nr_shifted);
> +			inode_inc_iversion(inode);
> +			inode_set_mtime_to_ts(inode,
> +					      inode_set_ctime_current(inode));
> +			ret = btrfs_update_inode(trans, BTRFS_I(inode));
> +			if (ret) {
> +				btrfs_release_path(path);
> +				goto out_trans;
> +			}
> +			btrfs_end_transaction(trans);
> +			btrfs_btree_balance_dirty(fs_info);
> +			trans = btrfs_start_transaction(root, 1);
> +			if (IS_ERR(trans)) {
> +				ret = PTR_ERR(trans);
> +				trans = NULL;
> +				btrfs_release_path(path);
> +				goto out_unlock;
> +			}
> +		}
> +
> +		btrfs_release_path(path);
> +	}

The whole loop can be extracted into a helper.

The difference between insert and collapse is just the difference in the 
file offset shift direction (pluse for insert, minus to collapse).

Having two similar copies between insert and collapse is never going to 
improve the maintenanceability.

> +
> +	if (ret)
> +		goto out_trans;
> +
> +	/* Update i_size and the on-disk inode */
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: updating i_size %lld -> %lld nrpages=%lu",
> +		   inode->i_size, inode->i_size - len, inode->i_mapping->nrpages);
> +	/*
> +	 * Drop the extent map cache for [offset, i_size) so that subsequent
> +	 * reads re-load the correct mappings from the btree. The key-shift
> +	 * loop updated the btree but the in-memory extent map cache still
> +	 * has entries at the old logical offsets.
> +	 */
> +	btrfs_drop_extent_map_range(BTRFS_I(inode), offset, (u64)-1, false);
> +	inode_inc_iversion(inode);
> +	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> +	i_size_write(inode, inode->i_size - len);
> +	btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
> +	ret = btrfs_update_inode(trans, BTRFS_I(inode));
> +
> +out_trans:
> +	if (trans) {
> +		if (ret)
> +			btrfs_end_transaction(trans);

Not sure if we're safe to end the transaction here. If we a critical 
error, there may be file extents moved but some are not.

But at the same time, we may have commit a transaction halfway, thus the 
inconsitency may already be there.

Is there any docs about what is the expected behavior if INSERT/COLLAPSE 
failed halfway?

Should the fs fully revert to the original state (not sure if even 
possible) or something else?

> +		else
> +			ret = btrfs_end_transaction(trans);
> +	}
> +out_unlock:
> +	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, offset, end - 1,
> +			    &cached_state);
> +	btrfs_info(fs_info,
> +		   "btrfs_collapse_range: post-unlock ret=%d i_size=%lld, invalidating page cache from %lld",
> +		   ret, inode->i_size, offset);
> +	if (IS_ENABLED(CONFIG_BTRFS_DEBUG) && !ret) {
> +		/*
> +		 * These are expected to be no-ops: ordered extents were drained
> +		 * at the start of this function and BTRFS_ILOCK_MMAP has been
> +		 * held throughout, so no new writes could have been submitted.
> +		 * The page cache was emptied from offset onwards upfront and
> +		 * nrpages in that region stayed 0 throughout the operation.
> +		 * Pages before offset are unaffected and may still be cached.
> +		 */
> +		int wait_ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
> +						    inode->i_size);

Hide it behind DEBUG doesn't mean you can add whatever operations.

If you want to add this, please explain why.
If you can not explain it, just don't add in the formal patch.

Thanks,
Qu

> +		btrfs_info(fs_info,
> +			   "btrfs_collapse_range: post-shift wait_ordered ret=%d",
> +			   wait_ret);
> +		ASSERT(wait_ret == 0);
> +		ASSERT(filemap_range_has_page(inode->i_mapping, offset,
> +					      LLONG_MAX) == false);
> +		truncate_pagecache_range(inode, offset, LLONG_MAX);
> +		btrfs_info(fs_info,
> +			   "btrfs_collapse_range: truncate_pagecache_range done");
> +		ASSERT(filemap_range_has_page(inode->i_mapping, offset,
> +					      LLONG_MAX) == false);
> +	}
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
>   static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>   {
>   	struct inode *inode = file_inode(file);
> @@ -3296,6 +3593,9 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	case FALLOC_FL_PUNCH_HOLE:
>   		ret = btrfs_punch_hole(file, offset, len);
>   		break;
> +	case FALLOC_FL_COLLAPSE_RANGE:
> +		ret = btrfs_collapse_range(inode, offset, len);
> +		break;
>   	default:
>   		ret = -EOPNOTSUPP;
>   	}


^ permalink raw reply

* Re: [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes
From: Qu Wenruo @ 2026-04-19  0:57 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba
In-Reply-To: <20260418143808.199603-2-paul.richards@gmail.com>



在 2026/4/19 00:08, Paul Richards 写道:
> Refactor btrfs_fallocate() to switch and dispatch based on
> the mode argument, splitting most of the modes into separate
> functions. Only the "allocate range" and "zero range" functions
> remain coupled.
> 
> Signed-off-by: Paul Richards <paul.richards@gmail.com>
> ---
[...]
> @@ -3115,27 +3113,12 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	int blocksize = BTRFS_I(inode)->root->fs_info->sectorsize;
>   	int ret;
>   
> -	if (btrfs_is_shutdown(inode_to_fs_info(inode)))
> -		return -EIO;
> -
> -	/* Do not allow fallocate in ZONED mode */
> -	if (btrfs_is_zoned(inode_to_fs_info(inode)))
> -		return -EOPNOTSUPP;
> +	ASSERT((mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ALLOCATE_RANGE || (mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ZERO_RANGE);

Too long line, this can be caught by checkpath.

>   
>   	alloc_start = round_down(offset, blocksize);
>   	alloc_end = round_up(offset + len, blocksize);
>   	cur_offset = alloc_start;
>   
> -	/* Make sure we aren't being give some crap mode */
> -	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> -		     FALLOC_FL_ZERO_RANGE))
> -		return -EOPNOTSUPP;
> -
> -	if (mode & FALLOC_FL_PUNCH_HOLE)
> -		return btrfs_punch_hole(file, offset, len);
> -
> -	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
> -
>   	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
>   		ret = inode_newsize_ok(inode, offset + len);
>   		if (ret)
> @@ -3185,7 +3168,6 @@ static long btrfs_fallocate(struct file *file, int mode,
>   
>   	if (mode & FALLOC_FL_ZERO_RANGE) {
>   		ret = btrfs_zero_range(inode, offset, len, mode);
> -		btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
>   		return ret;
>   	}
>   
> @@ -3282,11 +3264,46 @@ static long btrfs_fallocate(struct file *file, int mode,
>   	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
>   			    &cached_state);
>   out:
> -	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
>   	extent_changeset_free(data_reserved);
>   	return ret;
>   }
>   
> +static long btrfs_fallocate(struct file *file, int mode,
> +			    loff_t offset, loff_t len)
> +{
> +	struct inode *inode = file_inode(file);
> +	int ret;
> +
> +	if (btrfs_is_shutdown(inode_to_fs_info(inode)))
> +		return -EIO;
> +
> +	/* Do not allow fallocate in ZONED mode */
> +	if (btrfs_is_zoned(inode_to_fs_info(inode)))
> +		return -EOPNOTSUPP;
> +
> +	/* Check for options we do not support. */
> +	if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_KEEP_SIZE))
> +		return -EOPNOTSUPP;
> +
> +	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
> +
> +	/* Only one mode bit may be set. */
> +	switch (mode & FALLOC_FL_MODE_MASK) {
> +	case FALLOC_FL_ALLOCATE_RANGE:
> +	case FALLOC_FL_ZERO_RANGE:
> +		ret = btrfs_allocate_or_zero_range(file, mode, offset, len);

I'd prefer to have separate functions for zero and allocate range.

There are some shared code between them, but not too much, just:

- isize check
- file modified check
- btrfs_cont_expand()/btrfs_truncate_block()
- btrfs_wait_ordered_range().

This can be extracted into a helper function, shared by both 
btrfs_allocate_range() and btrfs_zero_range().

This should make it a little easier to read.

Thanks,
Qu


> +		break;
> +	case FALLOC_FL_PUNCH_HOLE:
> +		ret = btrfs_punch_hole(file, offset, len);
> +		break;
> +	default:
> +		ret = -EOPNOTSUPP;
> +	}
> +
> +	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
> +	return ret;
> +}
> +
>   /*
>    * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
>    * that has unflushed and/or flushing delalloc. There might be other adjacent


^ permalink raw reply

* Re: [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
From: Qu Wenruo @ 2026-04-19  0:25 UTC (permalink / raw)
  To: Paul Richards, linux-btrfs; +Cc: dsterba
In-Reply-To: <20260418143808.199603-1-paul.richards@gmail.com>



在 2026/4/19 00:08, Paul Richards 写道:
> This series adds support for FALLOC_FL_COLLAPSE_RANGE and
> FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
> already supported by ext4 and xfs. The userspace contract is
> documented in fallocate(2).
> 
> Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
> moving punch_hole into its own function and decoupling locking from the
> per-operation helpers. This is similar to the implementaitons for ext4
> and xfs. The allocate-range and zero-range paths remain coupled since
> they share some setup logic.
> 
> Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.
> 
> == Implementation approach ==
> 
> For COLLAPSE_RANGE:
>   - The removed region [offset, offset+len) is punched out via
>     btrfs_replace_file_extents(), which handles boundary splitting.
>   - All EXTENT_DATA keys with key.offset >= offset+len are shifted
>     leftward by len in forward order.
> 
> For INSERT_RANGE:
>   - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
>     by len in reverse order (required to avoid key collisions).
>   - No pre-splitting of straddling extents is needed: the left portion
>     of a straddling extent stays in place, the right portion is shifted;
>     both reference the same physical extent via their existing
>     extent_offset fields.
> 
> For each shifted key, the corresponding back-reference in the extent
> tree is updated via a shared helper btrfs_shift_extent_backref().
> 
> After the key-shift loop, btrfs_drop_extent_map_range() is called to
> invalidate the in-memory extent map cache. This is important for reads
> after the fallocate operation to ensure they obtain data for the new
> offsets.
> 
> The page cache is flushed and invalidated upfront (before any extent
> manipulation) following the ext4/xfs pattern. The inode lock
> (BTRFS_ILOCK_MMAP) is held throughout, preventing new dirty pages from
> appearing during the operation.

So far the explanation looks fine, but I'll need to dig deeper for each 
patch to be sure.

> 
> Transaction cycling: the key-shift loop cycles transactions every
> BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32) items to avoid
> holding a single transaction open across a large number of extents.
> 
> Both operations are gated on CONFIG_BTRFS_EXPERIMENTAL.
> 
> == Known limitations ==
> 
> INSERT_RANGE returns -EOPNOTSUPP for inlined files. Supporting inline
> files will require promoting the existing inline extent to a regular
> one, since inline extends are supported only at the very start of a
> file.

This can be addressed, e.g. by reading out the inlined extent into the 
page cache and redirty the new page cache.

But I'd say it's not a critical part thus I'm totally fine without this 
support for a while.

> 
> In the opposite direction, COLLAPSE_RANGE will not create inline files
> like it should if the remaining data is small enough.

This is even less important, there are a lot of cases that we don't 
inline a data extent, so keeping the extent uninlined is completely fine.

> 
> I intend to address both of these limitations.
> 
> == Testing ==
> 
> Tested with a Rust-based functional test suite covering:
>   - Collapse and insert at the start, middle of a file
>   - Multiple sequential operations on the same file
>   - Files with multiple extents (fsync between writes to force separate
>     extent items)
>   - Files with holes (explicit punch_hole and implicit sparse writes)
>   - Compressed extents (mount -o compress=zstd)
>   - Transaction cycling (interval reduced to 4 during testing, verified
>     in dmesg logs)
>   - Inline files, verified that -EOPNOTSUPP is returned.

I guess that tool has never verify the contents, nor multi-thread stress 
tests, e.g. fsstress?

> 
> The same tests pass on both btrfs and xfs (modulo the inline files).
> 
> I have not run fstests which I know contains tests for INSERT_RANGE
> and COLLAPSE_RANGE. I will do so.

Thus I'd prefer a fstest run before whatever your local tool.

> 
> == Questions for reviewers ==
> 
> 1. Transaction cycling interval: we use 32 items per cycle. Is this
>     the right threshold, or is there an established convention in btrfs
>     for this kind of loop?


We have btrfs_should_end_transaction() to do it, thus you do not need to 
use a locally defined threshold.

> 
> 2. Extent lock scope for collapse: we hold the extent lock only on
>     [offset, offset+len) during the hole punch, not on the full
>     [offset, i_size) range that the key-shift loop operates on. Is
>     this safe, or should we lock the full affected range?

I think you should hold the lock range [offset, U64_MAX).

Especially you have already dropped all cache for range [offset, 
U64_MAX) before the lock.

The same will apply for insert, but it looks like you have not locked 
any extent range for insert?

> 
> 3. CONFIG_BTRFS_EXPERIMENTAL gate: is this the right gate for these
>     operations, or should they be unconditionally available?

It's strongly recommended in this case.

> 
> == Notes ==
> 
> This is my first kernel contribution. Development was significantly
> assisted by an LLM (Amazon Q Developer). The implementation, testing,
> and final review decisions are my own.

I'm very interested in how the LLM is involved.

You mentioned "implementation, testing, review" are on your own, this 
looks like everything is on your own.

I don't think you're only using LLM to help understanding the code, thus 
it looks like implementation is contributed by the LLM.

Please remember, you're the one explaining/defending the code.
But as long as you can explain/defend the code, I'm fine with that.

Still a newcomer with not a small code change will always attract more 
scrutiny, so please don't expect rapid review/merge/etc.

> 
> Various btrfs_info() print statements, assertions, and comments that
> were useful during development and testing have been left in place,
> but will be removed or streamlined in the next revision.

Please don't. It's pretty easy to leave such scaffolding in the final 
code, and it will be time consuming to making they are properly removed 
in the final version.

Furthermore, dmesg based printk() is slow thus it can mask a lot of race 
related bugs just by slowing down the opeartions.

For critical operations, introduce trace events for them, there are 
examples across several inode operations.

If you really want to debug during development, please introduce 
temporary trace_printk()s in a dedicated patch, and just don't send that 
debug patch.

Thanks,
Qu

> 
> Paul Richards (3):
>    btrfs: refactor btrfs_fallocate() ahead of supporting more modes
>    btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
>    btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()
> 
>   fs/btrfs/file.c | 601 ++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 578 insertions(+), 23 deletions(-)
> 


^ permalink raw reply

* Re: [PATCH v2 00/19] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Steven Rostedt @ 2026-04-18 23:04 UTC (permalink / raw)
  To: Vineeth Pillai (Google)
  Cc: Peter Zijlstra, Dmitry Ilvokhin, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Jens Axboe, io-uring,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Marcelo Ricardo Leitner,
	Xin Long, Jon Maloy, Aaron Conole, Eelco Chaudron, Ilya Maximets,
	netdev, bpf, linux-sctp, tipc-discussion, dev, Jiri Pirko,
	Oded Gabbay, Koby Elbaz, dri-devel, Rafael J. Wysocki,
	Viresh Kumar, Gautham R. Shenoy, Huang Rui, Mario Limonciello,
	Len Brown, Srinivas Pandruvada, linux-pm, MyungJoo Ham,
	Kyungmin Park, Chanwoo Choi, Christian König, Sumit Semwal,
	linaro-mm-sig, Eddie James, Andrew Jeffery, Joel Stanley,
	linux-fsi, David Airlie, Simona Vetter, Alex Deucher,
	Danilo Krummrich, Matthew Brost, Philipp Stanner, Harry Wentland,
	Leo Li, amd-gfx, Jiri Kosina, Benjamin Tissoires, linux-input,
	Wolfram Sang, linux-i2c, Mark Brown, Michael Hennerich,
	Nuno Sá, linux-spi, James E.J. Bottomley, Martin K. Petersen,
	linux-scsi, Chris Mason, David Sterba, linux-btrfs,
	Thomas Gleixner, Andrew Morton, SeongJae Park, linux-mm,
	Borislav Petkov, Dave Hansen, x86, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260323160052.17528-1-vineeth@bitbyteword.org>

On Mon, 23 Mar 2026 12:00:19 -0400
"Vineeth Pillai (Google)" <vineeth@bitbyteword.org> wrote:

>   if (trace_foo_enabled() && cond)
>       trace_call__foo(args);   /* calls __do_trace_foo() directly */

Hi Vineeth,

Could you rebase this series on top of 7.1-rc1 when it comes out?
Several of these patches were accepted already. Obviously drop those.
They were the patches that added the feature, and any where the
maintainer acked the patch.

Now that the feature has been accepted, if you post the patch series
again after 7.1-rc1 with all the patches that haven't been accepted
yet, then the maintainers can simply take them directly. As the feature
is now accepted, there's no dependency on it, and they don't need to go
through the tracing tree.

Thanks,

-- Steve

^ permalink raw reply

* [syzbot] [btrfs?] kernel BUG in replace_file_extents
From: syzbot @ 2026-04-18 16:18 UTC (permalink / raw)
  To: clm, dsterba, linux-btrfs, linux-kernel, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    43cfbdda5af6 Merge tag 'for-linus-iommufd' of git://git.ke..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=14fb82d2580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=eae4973b8684a00c
dashboard link: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/d900f083ada3/non_bootable_disk-43cfbdda.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/473adbf1a380/vmlinux-43cfbdda.xz
kernel image: https://storage.googleapis.com/syzbot-assets/63ebea9fbab4/bzImage-43cfbdda.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com

BTRFS info (device loop0): balance: start -d -m
BTRFS info (device loop0): relocating block group 6881280 flags data|metadata
BTRFS info (device loop0): relocating block group 5242880 flags data|metadata
BTRFS info (device loop0): found 22 extents, stage: move data extents
------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:841!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5328 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:get_new_location fs/btrfs/relocation.c:838 [inline]
RIP: 0010:replace_file_extents+0x1574/0x1590 fs/btrfs/relocation.c:941
Code: 81 2c fe 49 8b 3e be 03 00 00 00 48 c7 c2 40 da 17 8c 44 89 e9 e8 cc dd 25 fd e8 f7 36 ff ff e9 a7 fc ff ff e8 ad 5f c0 fd 90 <0f> 0b e8 a5 5f c0 fd 90 0f 0b e8 9d 5f c0 fd 90 0f 0b e8 95 5f c0
RSP: 0000:ffffc900071f6e80 EFLAGS: 00010287
RAX: ffffffff8404c563 RBX: ffff88800e57f9a0 RCX: 0000000000100000
RDX: ffffc90021003000 RSI: 00000000000c282c RDI: 00000000000c282d
RBP: ffffc900071f7090 R08: ffffea0000ef04c7 R09: 1ffffd40001de098
R10: dffffc0000000000 R11: fffff940001de099 R12: 0000000000000483
R13: ffff88805b2e8000 R14: 0000000000000e70 R15: 80c06030180c2811
FS:  00007fa6cd9886c0(0000) GS:ffff88808c820000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2a6cc8ecf0 CR3: 000000001cdeb000 CR4: 0000000000352ef0
Call Trace:
 <TASK>
 btrfs_force_cow_block+0xa4d/0x2450 fs/btrfs/ctree.c:532
 btrfs_cow_block+0x3c4/0xa90 fs/btrfs/ctree.c:699
 do_relocation+0xd2e/0x1920 fs/btrfs/relocation.c:2288
 relocate_tree_block fs/btrfs/relocation.c:2538 [inline]
 relocate_tree_blocks+0x11ee/0x2020 fs/btrfs/relocation.c:2645
 relocate_block_group+0x76f/0xe70 fs/btrfs/relocation.c:3570
 do_nonremap_reloc+0xa8/0x5b0 fs/btrfs/relocation.c:5250
 btrfs_relocate_block_group+0x7e6/0xc40 fs/btrfs/relocation.c:5416
 btrfs_relocate_chunk+0x115/0x820 fs/btrfs/volumes.c:3598
 __btrfs_balance+0x1db0/0x2ae0 fs/btrfs/volumes.c:4509
 btrfs_balance+0xaf3/0x11b0 fs/btrfs/volumes.c:4896
 btrfs_ioctl_balance+0x3d3/0x610 fs/btrfs/ioctl.c:3453
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa6ccb9c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fa6cd987fe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fa6cce16090 RCX: 00007fa6ccb9c819
RDX: 0000200000000180 RSI: 00000000c4009420 RDI: 0000000000000006
RBP: 00007fa6ccc32c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fa6cce16128 R14: 00007fa6cce16090 R15: 00007ffdf45f3438
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:get_new_location fs/btrfs/relocation.c:838 [inline]
RIP: 0010:replace_file_extents+0x1574/0x1590 fs/btrfs/relocation.c:941
Code: 81 2c fe 49 8b 3e be 03 00 00 00 48 c7 c2 40 da 17 8c 44 89 e9 e8 cc dd 25 fd e8 f7 36 ff ff e9 a7 fc ff ff e8 ad 5f c0 fd 90 <0f> 0b e8 a5 5f c0 fd 90 0f 0b e8 9d 5f c0 fd 90 0f 0b e8 95 5f c0
RSP: 0000:ffffc900071f6e80 EFLAGS: 00010287
RAX: ffffffff8404c563 RBX: ffff88800e57f9a0 RCX: 0000000000100000
RDX: ffffc90021003000 RSI: 00000000000c282c RDI: 00000000000c282d
RBP: ffffc900071f7090 R08: ffffea0000ef04c7 R09: 1ffffd40001de098
R10: dffffc0000000000 R11: fffff940001de099 R12: 0000000000000483
R13: ffff88805b2e8000 R14: 0000000000000e70 R15: 80c06030180c2811
FS:  00007fa6cd9886c0(0000) GS:ffff88808c820000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055557e0c9528 CR3: 000000001cdeb000 CR4: 0000000000352ef0


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [PATCH 3/3] btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()
From: Paul Richards @ 2026-04-18 14:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Paul Richards
In-Reply-To: <20260418143808.199603-1-paul.richards@gmail.com>

Assisted-by: Amazon Q Developer:auto/unknown
Signed-off-by: Paul Richards <paul.richards@gmail.com>
---
 fs/btrfs/file.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 238 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 99d24bef5f88..b708bb6a1082 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2945,6 +2945,241 @@ static int btrfs_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 	return ret;
 }
 
+static int btrfs_insert_range(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_path *path;
+	struct btrfs_trans_handle *trans = NULL;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	struct btrfs_key new_key;
+	u64 ino = btrfs_ino(BTRFS_I(inode));
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
+		return -EOPNOTSUPP;
+
+	/* offset and len must be sector-aligned */
+	if (!IS_ALIGNED(offset | len, fs_info->sectorsize))
+		return -EINVAL;
+
+	/* offset must be within the file - use ftruncate to extend */
+	if (offset >= inode->i_size)
+		return -EINVAL;
+
+	/* result must not exceed the maximum file size */
+	if (len > inode->i_sb->s_maxbytes - inode->i_size)
+		return -EFBIG;
+
+	btrfs_info(fs_info,
+		   "btrfs_insert_range: ino=%llu offset=%lld len=%lld i_size=%lld",
+		   btrfs_ino(BTRFS_I(inode)), offset, len, inode->i_size);
+
+	/* wait for any ordered extents in [offset, i_size) to complete */
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
+				       inode->i_size - offset);
+	if (ret)
+		return ret;
+
+	/*
+	 * Flush and invalidate the page cache for [offset, i_size) upfront,
+	 * following the same pattern as btrfs_collapse_range().
+	 */
+	ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX);
+	if (ret)
+		return ret;
+	truncate_pagecache_range(inode, offset, LLONG_MAX);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		trans = NULL;
+		goto out_path;
+	}
+
+	/*
+	 * Shift all BTRFS_EXTENT_DATA_KEY items with key.offset >= offset
+	 * rightward by len bytes.
+	 *
+	 * We must iterate in reverse order (highest offset first) to avoid
+	 * colliding with a key we haven't shifted yet - shifting forward
+	 * would overwrite the next item's key before we process it.
+	 *
+	 * No pre-splitting of straddling extents is needed. If an extent
+	 * straddles offset, the left portion (key.offset < offset) stays
+	 * in place and the right portion is shifted. Both reference the
+	 * same physical extent via their existing extent_offset fields,
+	 * which remain correct after the key shift.
+	 */
+
+	int nr_shifted = 0;
+
+	/* Find the last extent item for this inode */
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = (u64)-1;
+
+	while (1) {
+		struct btrfs_file_extent_item *fi;
+		u64 disk_bytenr;
+		u64 num_bytes;
+		u64 extent_offset;
+		int extent_type;
+
+		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+		if (ret < 0)
+			goto out_trans;
+
+		/*
+		 * Search for (ino, EXTENT_DATA, -1) will never find an exact
+		 * match, so ret == 1 and slot points one past the last item.
+		 * Step back one slot to land on the last extent item.
+		 */
+		if (path->slots[0] == 0) {
+			/* No items at all - nothing to shift */
+			ret = 0;
+			break;
+		}
+		path->slots[0]--;
+
+		leaf = path->nodes[0];
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		/* If we've gone past this inode's items, we are done */
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+			ret = 0;
+			break;
+		}
+
+		/* If this item is before the insertion point, we are done */
+		if (key.offset < offset) {
+			ret = 0;
+			break;
+		}
+
+		btrfs_info(fs_info,
+			   "btrfs_insert_range: shifting key offset %llu -> %llu",
+			   key.offset, key.offset + len);
+
+		fi = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		extent_type = btrfs_file_extent_type(leaf, fi);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+		num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+		extent_offset = btrfs_file_extent_offset(leaf, fi);
+
+		/*
+		 * Inline extents must have key.offset == 0 and cannot be
+		 * shifted to a non-zero offset - the tree checker enforces
+		 * this invariant. Reject with -EOPNOTSUPP.
+		 *
+		 * An inline extent can only exist if the file's entire content
+		 * fits within a single sector, meaning it is the only extent
+		 * item for this inode. It will therefore always be the first
+		 * item we encounter in the reverse iteration, before any keys
+		 * have been shifted, so bailing here leaves the file in a
+		 * consistent state.
+		 *
+		 * TODO: support this case by converting the inline extent to
+		 * a regular extent first, then shifting it. This would allow
+		 * INSERT_RANGE on small files, which xfs supports.
+		 */
+		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+			ret = -EOPNOTSUPP;
+			btrfs_release_path(path);
+			goto out_trans;
+		}
+
+		memcpy(&new_key, &key, sizeof(new_key));
+		new_key.offset += len;
+		btrfs_set_item_key_safe(trans, path, &new_key);
+
+		/* Update back-reference: drop old offset, add new offset */
+		if (extent_type != BTRFS_FILE_EXTENT_INLINE && disk_bytenr > 0) {
+			ret = btrfs_shift_extent_backref(trans, root, ino,
+					disk_bytenr, num_bytes,
+					key.offset - extent_offset,
+					new_key.offset - extent_offset);
+			if (unlikely(ret))
+				goto out_trans;
+		}
+
+		/*
+		 * Step back to the previous item for the next iteration.
+		 * If we've reached slot 0 we need to move to the previous leaf.
+		 */
+		nr_shifted++;
+		if (nr_shifted % BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL == 0) {
+			btrfs_info(fs_info,
+			   "btrfs_insert_range: cycling transaction, nr_shifted=%d", nr_shifted);
+
+			inode_inc_iversion(inode);
+			inode_set_mtime_to_ts(inode,
+					      inode_set_ctime_current(inode));
+			ret = btrfs_update_inode(trans, BTRFS_I(inode));
+			if (ret) {
+				btrfs_release_path(path);
+				goto out_trans;
+			}
+			btrfs_end_transaction(trans);
+			btrfs_btree_balance_dirty(fs_info);
+			trans = btrfs_start_transaction(root, 1);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				trans = NULL;
+				btrfs_release_path(path);
+				goto out_path;
+			}
+		}
+
+		/*
+		 * Set key.offset to one below the current item so the next
+		 * btrfs_search_slot lands on the item before it.
+		 */
+		if (key.offset == 0) {
+			ret = 0;
+			break;
+		}
+		key.offset--;
+		btrfs_release_path(path);
+	}
+
+	if (ret)
+		goto out_trans;
+
+	/*
+	 * Drop stale extent map entries so subsequent reads re-load correct
+	 * mappings from the btree.
+	 */
+	btrfs_drop_extent_map_range(BTRFS_I(inode), offset, (u64)-1, false);
+
+	btrfs_info(fs_info,
+		   "btrfs_insert_range: updating i_size %lld -> %lld",
+		   inode->i_size, inode->i_size + len);
+	inode_inc_iversion(inode);
+	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+	i_size_write(inode, inode->i_size + len);
+	btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
+	ret = btrfs_update_inode(trans, BTRFS_I(inode));
+
+out_trans:
+	if (trans) {
+		if (ret)
+			btrfs_end_transaction(trans);
+		else
+			ret = btrfs_end_transaction(trans);
+	}
+out_path:
+	btrfs_free_path(path);
+	btrfs_info(fs_info, "btrfs_insert_range: returning %d", ret);
+	return ret;
+}
+
 static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
@@ -3596,6 +3831,9 @@ static long btrfs_fallocate(struct file *file, int mode,
 	case FALLOC_FL_COLLAPSE_RANGE:
 		ret = btrfs_collapse_range(inode, offset, len);
 		break;
+	case FALLOC_FL_INSERT_RANGE:
+		ret = btrfs_insert_range(inode, offset, len);
+		break;
 	default:
 		ret = -EOPNOTSUPP;
 	}
-- 
2.53.0


^ permalink raw reply related

* [PATCH 2/3] btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
From: Paul Richards @ 2026-04-18 14:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Paul Richards
In-Reply-To: <20260418143808.199603-1-paul.richards@gmail.com>

Assisted-by: Amazon Q Developer:auto/unknown
Signed-off-by: Paul Richards <paul.richards@gmail.com>
---
 fs/btrfs/file.c | 300 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 300 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0b5cc3cec675..99d24bef5f88 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -39,6 +39,13 @@
 #include "super.h"
 #include "print-tree.h"
 
+/*
+ * When we shift extents as part of fallocate insert or collapse we commit
+ * and cycle the transaction every BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL
+ * extents to avoid accumulating too many changes in one transaction.
+ */
+#define BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32)
+
 /*
  * Unlock folio after btrfs_file_write() is done with it.
  */
@@ -2648,6 +2655,296 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Update the extent back-reference in the extent tree when a
+ * BTRFS_EXTENT_DATA_KEY item is shifted to a new logical file offset.
+ * Drops the back-reference at old_file_offset and adds one at new_file_offset.
+ * Holes (disk_bytenr == 0) and inline extents have no back-references and
+ * must not be passed to this function.
+ */
+static int btrfs_shift_extent_backref(struct btrfs_trans_handle *trans,
+				      struct btrfs_root *root, u64 ino,
+				      u64 disk_bytenr, u64 num_bytes,
+				      u64 old_file_offset, u64 new_file_offset)
+{
+	struct btrfs_ref ref = {
+		.bytenr = disk_bytenr,
+		.num_bytes = num_bytes,
+		.parent = 0,
+		.owning_root = btrfs_root_id(root),
+		.ref_root = btrfs_root_id(root),
+	};
+	int ret;
+
+	ref.action = BTRFS_DROP_DELAYED_REF;
+	btrfs_init_data_ref(&ref, ino, old_file_offset, 0, false);
+	ret = btrfs_free_extent(trans, &ref);
+	if (unlikely(ret)) {
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
+
+	ref.action = BTRFS_ADD_DELAYED_REF;
+	btrfs_init_data_ref(&ref, ino, new_file_offset, 0, false);
+	ret = btrfs_inc_extent_ref(trans, &ref);
+	if (unlikely(ret))
+		btrfs_abort_transaction(trans, ret);
+
+	return ret;
+}
+
+static int btrfs_collapse_range(struct inode *inode, loff_t offset, loff_t len)
+{
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	u64 end = offset + len;
+	struct btrfs_path *path;
+	struct btrfs_trans_handle *trans = NULL;
+	struct extent_state *cached_state = NULL;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	struct btrfs_key new_key;
+	u64 ino = btrfs_ino(BTRFS_I(inode));
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
+		return -EOPNOTSUPP;
+
+	/* offset and len must be sector-aligned */
+	if (!IS_ALIGNED(offset | len, fs_info->sectorsize))
+		return -EINVAL;
+
+	/* collapse range must not reach or pass EOF - use ftruncate instead */
+	if (end >= inode->i_size)
+		return -EINVAL;
+
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: ino=%llu offset=%lld len=%lld i_size=%lld",
+		   btrfs_ino(BTRFS_I(inode)), offset, len, inode->i_size);
+
+	/* wait for any ordered extents in [offset, i_size) to complete */
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
+				       inode->i_size - offset);
+	if (ret)
+		return ret;
+
+	/*
+	 * Flush dirty pages and invalidate the page cache for [offset, i_size)
+	 * before any extent manipulation, following the ext4/xfs pattern.
+	 * We hold BTRFS_ILOCK_MMAP so no new dirty pages can appear during
+	 * the operation. The page cache must be empty before we shift extent
+	 * keys so that stale pages at the old offsets cannot be read back
+	 * after the collapse.
+	 */
+	ret = filemap_write_and_wait_range(inode->i_mapping, offset, LLONG_MAX);
+	if (ret)
+		return ret;
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: nrpages before upfront invalidate=%lu (pages before offset=%llu not invalidated)",
+		   inode->i_mapping->nrpages, offset >> PAGE_SHIFT);
+	truncate_pagecache_range(inode, offset, LLONG_MAX);
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: nrpages after upfront invalidate=%lu (expected %llu)",
+		   inode->i_mapping->nrpages, offset >> PAGE_SHIFT);
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	/*
+	 * Lock the range [offset, end) and invalidate the page cache within
+	 * it. btrfs_punch_hole_lock_range() calls truncate_pagecache_range()
+	 * internally in a retry loop.
+	 */
+	btrfs_punch_hole_lock_range(inode, offset, end - 1, &cached_state);
+
+	/*
+	 * Remove all extents in [offset, end). Passing NULL for extent_info
+	 * means we are punching a hole. btrfs_replace_file_extents() splits
+	 * any extent straddling the boundaries, drops extent refs, and
+	 * returns a transaction handle for us to reuse.
+	 */
+	ret = btrfs_replace_file_extents(BTRFS_I(inode), path,
+					 offset, end - 1, NULL, &trans);
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: btrfs_replace_file_extents ret=%d nrpages=%lu",
+		   ret, inode->i_mapping->nrpages);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Shift all BTRFS_EXTENT_DATA_KEY items with key.offset >= end
+	 * leftward by len bytes.
+	 *
+	 * We iterate forward (lowest offset first) which is safe for a
+	 * left-shift because the new key is always less than the old one,
+	 * so we never collide with a key we haven't visited yet.
+	 */
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = end;
+
+	int nr_shifted = 0;
+	while (1) {
+		struct btrfs_file_extent_item *fi;
+		u64 disk_bytenr;
+		u64 num_bytes;
+		u64 extent_offset;
+		int extent_type;
+
+		ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+		if (ret < 0)
+			goto out_trans;
+
+		/* If no exact match, slot points at the next item - that's fine */
+
+		leaf = path->nodes[0];
+		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0)
+				goto out_trans;
+			if (ret > 0) {
+				/* No more items - we are done shifting */
+				ret = 0;
+				break;
+			}
+			leaf = path->nodes[0];
+		}
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		/* Stop if we've moved past this inode's extent data items */
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+			ret = 0;
+			break;
+		}
+
+		btrfs_info(fs_info,
+			   "btrfs_collapse_range: shifting key offset %llu -> %llu",
+			   key.offset, key.offset - len);
+
+		fi = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		extent_type = btrfs_file_extent_type(leaf, fi);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+		num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+		extent_offset = btrfs_file_extent_offset(leaf, fi);
+
+		memcpy(&new_key, &key, sizeof(new_key));
+		new_key.offset -= len;
+		btrfs_set_item_key_safe(trans, path, &new_key);
+
+		/*
+		 * Update the back-reference in the extent tree to reflect the
+		 * new logical file offset. Holes (disk_bytenr == 0) and inline
+		 * extents have no back-references to update.
+		 */
+		if (extent_type != BTRFS_FILE_EXTENT_INLINE && disk_bytenr > 0) {
+			ret = btrfs_shift_extent_backref(trans, root, ino,
+					disk_bytenr, num_bytes,
+					key.offset - extent_offset,
+					new_key.offset - extent_offset);
+			if (unlikely(ret))
+				goto out_trans;
+		}
+
+		/* Advance to the next item on the next iteration */
+		key.offset = new_key.offset + 1;
+		nr_shifted++;
+
+		/*
+		 * Cycle the transaction every N items to avoid holding a
+		 * single transaction open across a large number of extent
+		 * items. btrfs_set_item_key_safe() modifies leaves in-place
+		 * so we won't hit -ENOSPC; we use a simple counter instead.
+		 * Update the inode at each cycle point so it is consistent
+		 * on disk if a crash occurs mid-loop.
+		 */
+		if (nr_shifted % BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL == 0) {
+			btrfs_info(fs_info,
+			   "btrfs_collapse_range: cycling transaction, nr_shifted=%d", nr_shifted);
+			inode_inc_iversion(inode);
+			inode_set_mtime_to_ts(inode,
+					      inode_set_ctime_current(inode));
+			ret = btrfs_update_inode(trans, BTRFS_I(inode));
+			if (ret) {
+				btrfs_release_path(path);
+				goto out_trans;
+			}
+			btrfs_end_transaction(trans);
+			btrfs_btree_balance_dirty(fs_info);
+			trans = btrfs_start_transaction(root, 1);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				trans = NULL;
+				btrfs_release_path(path);
+				goto out_unlock;
+			}
+		}
+
+		btrfs_release_path(path);
+	}
+
+	if (ret)
+		goto out_trans;
+
+	/* Update i_size and the on-disk inode */
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: updating i_size %lld -> %lld nrpages=%lu",
+		   inode->i_size, inode->i_size - len, inode->i_mapping->nrpages);
+	/*
+	 * Drop the extent map cache for [offset, i_size) so that subsequent
+	 * reads re-load the correct mappings from the btree. The key-shift
+	 * loop updated the btree but the in-memory extent map cache still
+	 * has entries at the old logical offsets.
+	 */
+	btrfs_drop_extent_map_range(BTRFS_I(inode), offset, (u64)-1, false);
+	inode_inc_iversion(inode);
+	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+	i_size_write(inode, inode->i_size - len);
+	btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
+	ret = btrfs_update_inode(trans, BTRFS_I(inode));
+
+out_trans:
+	if (trans) {
+		if (ret)
+			btrfs_end_transaction(trans);
+		else
+			ret = btrfs_end_transaction(trans);
+	}
+out_unlock:
+	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, offset, end - 1,
+			    &cached_state);
+	btrfs_info(fs_info,
+		   "btrfs_collapse_range: post-unlock ret=%d i_size=%lld, invalidating page cache from %lld",
+		   ret, inode->i_size, offset);
+	if (IS_ENABLED(CONFIG_BTRFS_DEBUG) && !ret) {
+		/*
+		 * These are expected to be no-ops: ordered extents were drained
+		 * at the start of this function and BTRFS_ILOCK_MMAP has been
+		 * held throughout, so no new writes could have been submitted.
+		 * The page cache was emptied from offset onwards upfront and
+		 * nrpages in that region stayed 0 throughout the operation.
+		 * Pages before offset are unaffected and may still be cached.
+		 */
+		int wait_ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset,
+						    inode->i_size);
+		btrfs_info(fs_info,
+			   "btrfs_collapse_range: post-shift wait_ordered ret=%d",
+			   wait_ret);
+		ASSERT(wait_ret == 0);
+		ASSERT(filemap_range_has_page(inode->i_mapping, offset,
+					      LLONG_MAX) == false);
+		truncate_pagecache_range(inode, offset, LLONG_MAX);
+		btrfs_info(fs_info,
+			   "btrfs_collapse_range: truncate_pagecache_range done");
+		ASSERT(filemap_range_has_page(inode->i_mapping, offset,
+					      LLONG_MAX) == false);
+	}
+	btrfs_free_path(path);
+	return ret;
+}
+
 static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
@@ -3296,6 +3593,9 @@ static long btrfs_fallocate(struct file *file, int mode,
 	case FALLOC_FL_PUNCH_HOLE:
 		ret = btrfs_punch_hole(file, offset, len);
 		break;
+	case FALLOC_FL_COLLAPSE_RANGE:
+		ret = btrfs_collapse_range(inode, offset, len);
+		break;
 	default:
 		ret = -EOPNOTSUPP;
 	}
-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/3] btrfs: refactor btrfs_fallocate() ahead of supporting more modes
From: Paul Richards @ 2026-04-18 14:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Paul Richards
In-Reply-To: <20260418143808.199603-1-paul.richards@gmail.com>

Refactor btrfs_fallocate() to switch and dispatch based on
the mode argument, splitting most of the modes into separate
functions. Only the "allocate range" and "zero range" functions
remain coupled.

Signed-off-by: Paul Richards <paul.richards@gmail.com>
---
 fs/btrfs/file.c | 63 +++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a4cb9d3cfc4e..0b5cc3cec675 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2668,8 +2668,6 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 	bool truncated_block = false;
 	bool updated_inode = false;
 
-	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
-
 	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset, len);
 	if (ret)
 		goto out_only_mutex;
@@ -2712,7 +2710,6 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 		truncated_block = true;
 		ret = btrfs_truncate_block(BTRFS_I(inode), offset, orig_start, orig_end);
 		if (ret) {
-			btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 			return ret;
 		}
 	}
@@ -2809,7 +2806,6 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 				ret = ret2;
 		}
 	}
-	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 	return ret;
 }
 
@@ -2933,6 +2929,8 @@ static int btrfs_zero_range(struct inode *inode,
 	u64 bytes_to_reserve = 0;
 	bool space_reserved = false;
 
+	ASSERT((mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ZERO_RANGE);
+
 	em = btrfs_get_extent(BTRFS_I(inode), NULL, alloc_start,
 			      alloc_end - alloc_start);
 	if (IS_ERR(em)) {
@@ -3092,7 +3090,7 @@ static int btrfs_zero_range(struct inode *inode,
 	return ret;
 }
 
-static long btrfs_fallocate(struct file *file, int mode,
+static long btrfs_allocate_or_zero_range(struct file *file, int mode,
 			    loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
@@ -3115,27 +3113,12 @@ static long btrfs_fallocate(struct file *file, int mode,
 	int blocksize = BTRFS_I(inode)->root->fs_info->sectorsize;
 	int ret;
 
-	if (btrfs_is_shutdown(inode_to_fs_info(inode)))
-		return -EIO;
-
-	/* Do not allow fallocate in ZONED mode */
-	if (btrfs_is_zoned(inode_to_fs_info(inode)))
-		return -EOPNOTSUPP;
+	ASSERT((mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ALLOCATE_RANGE || (mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ZERO_RANGE);
 
 	alloc_start = round_down(offset, blocksize);
 	alloc_end = round_up(offset + len, blocksize);
 	cur_offset = alloc_start;
 
-	/* Make sure we aren't being give some crap mode */
-	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
-		     FALLOC_FL_ZERO_RANGE))
-		return -EOPNOTSUPP;
-
-	if (mode & FALLOC_FL_PUNCH_HOLE)
-		return btrfs_punch_hole(file, offset, len);
-
-	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
-
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
 		ret = inode_newsize_ok(inode, offset + len);
 		if (ret)
@@ -3185,7 +3168,6 @@ static long btrfs_fallocate(struct file *file, int mode,
 
 	if (mode & FALLOC_FL_ZERO_RANGE) {
 		ret = btrfs_zero_range(inode, offset, len, mode);
-		btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 		return ret;
 	}
 
@@ -3282,11 +3264,46 @@ static long btrfs_fallocate(struct file *file, int mode,
 	btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
 			    &cached_state);
 out:
-	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 	extent_changeset_free(data_reserved);
 	return ret;
 }
 
+static long btrfs_fallocate(struct file *file, int mode,
+			    loff_t offset, loff_t len)
+{
+	struct inode *inode = file_inode(file);
+	int ret;
+
+	if (btrfs_is_shutdown(inode_to_fs_info(inode)))
+		return -EIO;
+
+	/* Do not allow fallocate in ZONED mode */
+	if (btrfs_is_zoned(inode_to_fs_info(inode)))
+		return -EOPNOTSUPP;
+
+	/* Check for options we do not support. */
+	if (mode & ~(FALLOC_FL_MODE_MASK | FALLOC_FL_KEEP_SIZE))
+		return -EOPNOTSUPP;
+
+	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
+
+	/* Only one mode bit may be set. */
+	switch (mode & FALLOC_FL_MODE_MASK) {
+	case FALLOC_FL_ALLOCATE_RANGE:
+	case FALLOC_FL_ZERO_RANGE:
+		ret = btrfs_allocate_or_zero_range(file, mode, offset, len);
+		break;
+	case FALLOC_FL_PUNCH_HOLE:
+		ret = btrfs_punch_hole(file, offset, len);
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+	}
+
+	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
+	return ret;
+}
+
 /*
  * Helper for btrfs_find_delalloc_in_range(). Find a subrange in a given range
  * that has unflushed and/or flushing delalloc. There might be other adjacent
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH 0/3] btrfs: implement FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE
From: Paul Richards @ 2026-04-18 14:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Paul Richards

This series adds support for FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE to btrfs_fallocate(). Both operations are
already supported by ext4 and xfs. The userspace contract is
documented in fallocate(2).

Patch 1 refactors btrfs_fallocate() to dispatch via a switch statement,
moving punch_hole into its own function and decoupling locking from the
per-operation helpers. This is similar to the implementaitons for ext4
and xfs. The allocate-range and zero-range paths remain coupled since
they share some setup logic.

Patches 2 and 3 add COLLAPSE_RANGE and INSERT_RANGE respectively.

== Implementation approach ==

For COLLAPSE_RANGE:
 - The removed region [offset, offset+len) is punched out via
   btrfs_replace_file_extents(), which handles boundary splitting.
 - All EXTENT_DATA keys with key.offset >= offset+len are shifted
   leftward by len in forward order.

For INSERT_RANGE:
 - All EXTENT_DATA keys with key.offset >= offset are shifted rightward
   by len in reverse order (required to avoid key collisions).
 - No pre-splitting of straddling extents is needed: the left portion
   of a straddling extent stays in place, the right portion is shifted;
   both reference the same physical extent via their existing
   extent_offset fields.

For each shifted key, the corresponding back-reference in the extent
tree is updated via a shared helper btrfs_shift_extent_backref().

After the key-shift loop, btrfs_drop_extent_map_range() is called to
invalidate the in-memory extent map cache. This is important for reads
after the fallocate operation to ensure they obtain data for the new
offsets.

The page cache is flushed and invalidated upfront (before any extent
manipulation) following the ext4/xfs pattern. The inode lock
(BTRFS_ILOCK_MMAP) is held throughout, preventing new dirty pages from
appearing during the operation.

Transaction cycling: the key-shift loop cycles transactions every
BTRFS_INSERT_COLLAPSE_TRANSACTION_CYCLE_INTERVAL (32) items to avoid
holding a single transaction open across a large number of extents.

Both operations are gated on CONFIG_BTRFS_EXPERIMENTAL.

== Known limitations ==

INSERT_RANGE returns -EOPNOTSUPP for inlined files. Supporting inline
files will require promoting the existing inline extent to a regular
one, since inline extends are supported only at the very start of a
file.

In the opposite direction, COLLAPSE_RANGE will not create inline files
like it should if the remaining data is small enough.

I intend to address both of these limitations.

== Testing ==

Tested with a Rust-based functional test suite covering:
 - Collapse and insert at the start, middle of a file
 - Multiple sequential operations on the same file
 - Files with multiple extents (fsync between writes to force separate
   extent items)
 - Files with holes (explicit punch_hole and implicit sparse writes)
 - Compressed extents (mount -o compress=zstd)
 - Transaction cycling (interval reduced to 4 during testing, verified
   in dmesg logs)
 - Inline files, verified that -EOPNOTSUPP is returned.

The same tests pass on both btrfs and xfs (modulo the inline files).

I have not run fstests which I know contains tests for INSERT_RANGE
and COLLAPSE_RANGE. I will do so.

== Questions for reviewers ==

1. Transaction cycling interval: we use 32 items per cycle. Is this
   the right threshold, or is there an established convention in btrfs
   for this kind of loop?

2. Extent lock scope for collapse: we hold the extent lock only on
   [offset, offset+len) during the hole punch, not on the full
   [offset, i_size) range that the key-shift loop operates on. Is
   this safe, or should we lock the full affected range?

3. CONFIG_BTRFS_EXPERIMENTAL gate: is this the right gate for these
   operations, or should they be unconditionally available?

== Notes ==

This is my first kernel contribution. Development was significantly
assisted by an LLM (Amazon Q Developer). The implementation, testing,
and final review decisions are my own.

Various btrfs_info() print statements, assertions, and comments that
were useful during development and testing have been left in place,
but will be removed or streamlined in the next revision.

Paul Richards (3):
  btrfs: refactor btrfs_fallocate() ahead of supporting more modes
  btrfs: support for FALLOC_FL_COLLAPSE_RANGE in btrfs_fallocate()
  btrfs: support for FALLOC_FL_INSERT_RANGE in btrfs_fallocate()

 fs/btrfs/file.c | 601 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 578 insertions(+), 23 deletions(-)

-- 
2.53.0

^ permalink raw reply

* Re: [PATCH 7.2 v3 00/12] Remove read-only THP support for FSes without large folio support
From: Lorenzo Stoakes @ 2026-04-18  9:27 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox (Oracle), Song Liu, Chris Mason, David Sterba,
	Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-btrfs, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest
In-Reply-To: <20260418024429.4055056-1-ziy@nvidia.com>

On Fri, Apr 17, 2026 at 10:44:17PM -0400, Zi Yan wrote:
> Hi all,
>
> This patchset removes READ_ONLY_THP_FOR_FS Kconfig and enables creating
> read-only THPs for FSes with large folio support (the supported orders
> need to include PMD_ORDER) by default.
>
> Before the patchset, the status of creating read-only THPs is below:

Good to specify the read-only bit up front!

>
>                             |    PF     | MADV_COLLAPSE | khugepaged |
>                             |-----------|---------------|------------|
>  large folio FSes only      |     ✓     |       x       |      x     |
>  READ_ONLY_THP_FOR_FS only  |     x     |       ✓       |      ✓     |
>  both                       |     ✓     |       ✓       |      ✓     |

This diagrams seem familiar :P but very nice, thanks!

And since we include cover letter in series in mm this should be some nice
documentation in the commit msg also.

>
> where READ_ONLY_THP_FOR_FS implies no large folio FSes.
>
>
> Now without READ_ONLY_THP_FOR_FS:
>
>                            |    PF     | MADV_COLLAPSE | khugepaged |
>                            |-----------|---------------|------------|
>  large folio FSes          |     ✓     |       ✓       |      ✓     |
>  no large folio FSes       |     x     |       x       |      x     |

This is really nice and clear thanks!

>
> This means no large folio FSes need to add large folio support (the
> supported orders need to include PMD_ORDER), so that they can leverage
> read-only THP creation function.
>
> To prevent breaking read-only THP support for large folio FSes,
> 1. first 4 patches enables the support, so that without READ_ONLY_THP_FOR_FS,
>    read-only THP still works for large folio FSes,

I guess this introduces what was previously supported by
CONFIG_READ_ONLY_THP_FOR_FS to large folios as part of that before removal of
the config option?

> 2. Patch 5 removes READ_ONLY_THP_FOR_FS Kconfig,
> 3. the rest of patches remove code related to READ_ONLY_THP_FOR_FS.

Makes sense thanks!

>
>
> The overview of the changes is:
>
> 1. collapse_file() checks for to-be-collapsed folio dirtiness after they
>    are locked, unmapped to make sure no new write happens. Before,
>    mapping->nr_thps and inode->i_writecount are used to cause read-only
>    THP truncation before a fd becomes writable.
>
> 2. hugepage_pmd_enabled() is true for anon, shmem, and file-backed cases
>    if the global khugepaged control is on, otherwise, khugepaged for
>    file-backed case is turned off and anon and shmem depend on per-size
>    control knobs.
>
> 3. collapse_file() from mm/khugepaged.c, instead of checking
>    CONFIG_READ_ONLY_THP_FOR_FS, makes sure the mapping_max_folio_order()
>    of struct address_space of the file is at least PMD_ORDER.
>
> 4. file_thp_enabled() also checks mapping_max_folio_order() instead and
>    no longer checks if the input file is opened as read-only (Change 1
>    handles read-write files).
>
> 5. truncate_inode_partial_folio() calls folio_split() directly instead
>    of the removed try_folio_split_to_order(), since large folios can
>    only show up on a FS with large folio support.
>
> 6. nr_thps is removed from struct address_space, since it is no longer
>    needed to drop all read-only THPs from a FS without large folio
>    support when the fd becomes writable. Its related filemap_nr_thps*()
>    are removed too.
>
> 7. folio_check_splittable() no longer checks READ_ONLY_THP_FOR_FS.
>
> 8. Updated comments in various places.
>
>
> Changelog
> ===
> From V2[3]:
> 1. removed unnecessary check in collapse_scan_file().
>
> 2. removed inode_is_open_for_write() check in file_thp_enabled().
>
> 3. changed hugepage_pmd_enabled() to return true if khugepaged global
>    control is on instead of false. cleaned up anon and shmem code in the
>    function.
>
> 4. moved folio dirtiness check after try_to_unmap() but before
>    try_to_unmap_flush(), since that is sufficient to prevent new writes.
>
> 5. reordered patch 4 and 5, so that khugepaged behavior does not change
>    after READ_ONLY_THP_FOR_FS is removed.
>
> 6. added read-write file test in khugepaged selftest.
>
> 7. removed the read-only file restriction from guard-region selftest.
>
> From V1[2]:
> 1. removed inode_is_open_for_write() check in collapse_file(), since the
>    added folio dirtiness check after try_to_unmap_flush() should be
>    sufficient to prevent writes to candidate folios.
>
> 2. removed READ_ONLY_THP_FOR_FS check in hugepage_pmd_enabled(), please
>    see Patch 5 and item 2 in the overview for more details.
>
> 3. moved the patch removing READ_ONLY_THP_FOR_FS Kconfig after enabling
>    khugepaged and MADV_COLLAPSE to create read-only THPs.
>
> 4. added mapping_pmd_thp_support() helper function.
>
> 5. used VM_WARN_ON_ONCE() in collapse_file() for mapping eligibility check
>    and address alignment check instead of if + return error code. Always
>    allow shmem, since MADV_COLLAPSE ignore shmem huge config.
>
> 6. added mapping eligibility check in collapse_scan_file().
>
> 7. removed trailing ; for folio_split() in the !CONFIG_TRANSPARENT_HUGEPAGE.
>
> 8. simplified code in folio_check_splittable() after removing
>    READ_ONLY_THP_FOR_FS code.
>
> 9. clarified that read-only THP works for FSes with PMD THP support by
>    default.
>
> From RFC[1]:
> 1. instead of removing READ_ONLY_THP_FOR_FS function entirely, turn it
>    on by default for all FSes with large folio support and the supported
>    orders includes PMD_ORDER.
>
> Suggestions and comments are welcome.
>
> Link: https://lore.kernel.org/all/20260323190644.1714379-1-ziy@nvidia.com/ [1]
> Link: https://lore.kernel.org/all/20260327014255.2058916-1-ziy@nvidia.com/ [2]
> Link: https://lore.kernel.org/all/20260413192030.3275825-1-ziy@nvidia.com/ [3]
>
> Zi Yan (12):
>   mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
>   mm/khugepaged: add folio dirty check after try_to_unmap()
>   mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
>   mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in
>     hugepage_pmd_enabled()
>   mm: remove READ_ONLY_THP_FOR_FS Kconfig option
>   mm: fs: remove filemap_nr_thps*() functions and their users
>   fs: remove nr_thps from struct address_space
>   mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS
>   mm/truncate: use folio_split() in truncate_inode_partial_folio()
>   fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS
>   selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
>   selftests/mm: remove READ_ONLY_THP_FOR_FS code from guard-regions
>
>  fs/btrfs/defrag.c                          |   3 -
>  fs/inode.c                                 |   3 -
>  fs/open.c                                  |  27 -----
>  include/linux/fs.h                         |   5 -
>  include/linux/huge_mm.h                    |  25 +----
>  include/linux/pagemap.h                    |  35 ++-----
>  include/linux/shmem_fs.h                   |   2 +-
>  mm/Kconfig                                 |  11 ---
>  mm/filemap.c                               |   1 -
>  mm/huge_memory.c                           |  39 ++------
>  mm/khugepaged.c                            |  86 ++++++++--------
>  mm/truncate.c                              |   8 +-
>  tools/testing/selftests/mm/guard-regions.c |  18 +---
>  tools/testing/selftests/mm/khugepaged.c    | 110 +++++++++++++++------
>  tools/testing/selftests/mm/run_vmtests.sh  |  12 ++-
>  15 files changed, 156 insertions(+), 229 deletions(-)
>
> --
> 2.43.0
>

^ permalink raw reply

* Re: [PATCH v2] btrfs: preallocate extent changeset before acquiring extent_io_tree lock
From: Qu Wenruo @ 2026-04-18  4:46 UTC (permalink / raw)
  To: tchou, linux-btrfs; +Cc: clm, dsterba
In-Reply-To: <9180d175-397c-45ba-94ac-8b6299fb9ec4@gmx.com>



在 2026/4/17 18:50, Qu Wenruo 写道:
> 
> 
> 在 2026/4/17 17:08, tchou 写道:
>> In btrfs_clear_extent_bit_changeset(), the extent changeset ulist may 
>> need
>> to allocate new nodes. Currently, this can happen while holding the
>> extent_io_tree->lock spinlock. Although ulist_prealloc() uses GFP_NOFS to
>> avoid deadlock with filesystem reclaim, it's better to preallocate before
>> acquiring the spinlock to:
>>
>> 1. Avoid potential allocation failures while holding the lock
>> 2. Be consistent with set_extent_bit() which already preallocates both
>>     extent state and changeset before the spinlock
>> 3. Reduce lock contention by not doing allocations under the lock
>>
>> Preallocate the changeset ulist node before acquiring the spinlock,
>> mirroring the pattern used for extent state preallocation.
>>
>> Signed-off-by: Ting-Chang Hou <tchou@synology.com>
>> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>

Forgot to mention, please do not add reviewed-by tags unless the 
reviewer has explicitly provided.

Especially when you do not know which mail address the reviewer is 
really using.

You do not need to resend, I have fixed the address in the for-next branch.

Thanks,
Qu

> 
> Now pushed to for-next.
> 
> Thanks,
> Qu
> 
>> ---
>>   fs/btrfs/extent-io-tree.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
>> index 626702244809..b5ba650cdb55 100644
>> --- a/fs/btrfs/extent-io-tree.c
>> +++ b/fs/btrfs/extent-io-tree.c
>> @@ -663,6 +663,9 @@ int btrfs_clear_extent_bit_changeset(struct 
>> extent_io_tree *tree, u64 start, u64
>>            */
>>           prealloc = alloc_extent_state(mask);
>>       }
>> +    /* Preallocate the extent changeset ulist node before acquiring 
>> spinlock. */
>> +    if (changeset)
>> +        extent_changeset_prealloc(changeset, mask);
>>       spin_lock(&tree->lock);
>>       if (cached_state) {
> 
> 


^ permalink raw reply

* [PATCH 7.2 v3 12/12] selftests/mm: remove READ_ONLY_THP_FOR_FS code from guard-regions
From: Zi Yan @ 2026-04-18  2:44 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-btrfs,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest
In-Reply-To: <20260418024429.4055056-1-ziy@nvidia.com>

Any file system with large folio support and the supported orders include
PMD_ORDER can be used. There is no need to open a file with read-only.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 tools/testing/selftests/mm/guard-regions.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 48e8b1539be3..117639891953 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -2203,17 +2203,6 @@ TEST_F(guard_regions, collapse)
 	if (variant->backing != ANON_BACKED)
 		ASSERT_EQ(ftruncate(self->fd, size), 0);
 
-	/*
-	 * We must close and re-open local-file backed as read-only for
-	 * CONFIG_READ_ONLY_THP_FOR_FS to work.
-	 */
-	if (variant->backing == LOCAL_FILE_BACKED) {
-		ASSERT_EQ(close(self->fd), 0);
-
-		self->fd = open(self->path, O_RDONLY);
-		ASSERT_GE(self->fd, 0);
-	}
-
 	ptr = mmap_(self, variant, NULL, size, PROT_READ, 0, 0);
 	ASSERT_NE(ptr, MAP_FAILED);
 
@@ -2237,9 +2226,10 @@ TEST_F(guard_regions, collapse)
 	/*
 	 * Now collapse the entire region. This should fail in all cases.
 	 *
-	 * The madvise() call will also fail if CONFIG_READ_ONLY_THP_FOR_FS is
-	 * not set for the local file case, but we can't differentiate whether
-	 * this occurred or if the collapse was rightly rejected.
+	 * The madvise() call will also fail if the file system does not support
+	 * large folio or the supported orders do not include PMD_ORDER for the
+	 * local file case, but we can't differentiate whether this occurred or
+	 * if the collapse was rightly rejected.
 	 */
 	EXPECT_NE(madvise(ptr, size, MADV_COLLAPSE), 0);
 
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox