Linux XFS filesystem development

Linux XFS filesystem development
 help / color / mirror / Atom feed

* Re: [RFC PATCH 1/2] libfrog: make xfrog_defragrange return a positive valued
From: Darrick J. Wong @ 2026-01-20 17:20 UTC (permalink / raw)
  To: cem; +Cc: aalbersh, linux-xfs
In-Reply-To: <20260119142724.284933-2-cem@kernel.org>

On Mon, Jan 19, 2026 at 03:26:50PM +0100, cem@kernel.org wrote:
> From: Carlos Maiolino <cem@kernel.org>
> 
> Currently, the only user for xfrog_defragrange is xfs_fsr's packfile(),
> which expects error to be a positive value.
> 
> Whenever xfrog_defragrange fails, the switch case always falls into the
> default clausule, making the error message pointless.
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  libfrog/file_exchange.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/libfrog/file_exchange.c b/libfrog/file_exchange.c
> index e6c3f486b0ff..31bbc6da60c3 100644
> --- a/libfrog/file_exchange.c
> +++ b/libfrog/file_exchange.c
> @@ -232,7 +232,7 @@ xfrog_defragrange(
>  	if (ret) {
>  		if (errno == EOPNOTSUPP || errno != ENOTTY)
>  			goto legacy_fallback;
> -		return -errno;
> +		return errno;

Hrmm.  If you're going to change the polarity of the error numbers (e.g.
negative to positive) then please update the comments.

That said, I'd prefer to keep the errno polarity the same at least
within a .c file ... even though libfrog is a mess of different error
number return strategies.  What if the callsite changed to:

	/* Swap the extents */
	error = -xfrog_defragrange(...);

and

	/* Snapshot file_fd before we start copying data... */
	error = -xfrog_defragrange_prep(...);

(and I guess io/exchrange.c also needs a fix)

	/* Snapshot the original file metadata in anticipation... */
	ret = -xfrog_commitrange_prep(...);

Hrm?

--D

>  	}
>  
>  	return 0;
> @@ -240,7 +240,7 @@ xfrog_defragrange(
>  legacy_fallback:
>  	ret = xfrog_ioc_swapext(file2_fd, xdf);
>  	if (ret)
> -		return -errno;
> +		return errno;
>  
>  	return 0;
>  }
> -- 
> 2.52.0
> 
> 

^ permalink raw reply

* Re: [PATCH v4 3/3] xfs: adjust handling of a few numerical mount options
From: Dmitry Antipov @ 2026-01-20 16:57 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Andrew Morton, Kees Cook, Carlos Maiolino, Christoph Hellwig,
	linux-xfs, linux-hardening
In-Reply-To: <aW-YP7wCEvRJzyfR@smile.fi.intel.com>

On Tue, 2026-01-20 at 16:59 +0200, Andy Shevchenko wrote:

> With all this, I do not see the point of having a new API.
> Also, where are the test cases for it?

If there is no point, why worrying about tests?
Also, do you always communicate with the people
just like they're your (well-) paid personnel?

Dmitry

^ permalink raw reply

* Re: [PATCH RESEND 09/12] mm: make vm_area_desc utilise vma_flags_t only
From: Lorenzo Stoakes @ 2026-01-20 16:50 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jason Gunthorpe, Andrew Morton, Jarkko Sakkinen, Dave Hansen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Greg Kroah-Hartman, Dan Williams, Vishal Verma,
	Dave Jiang, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Dave Airlie, Simona Vetter, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Christian König, Huang Rui,
	Matthew Auld, Matthew Brost, Alexander Viro, Christian Brauner,
	Jan Kara, Benjamin LaHaise, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Theodore Ts'o,
	Andreas Dilger, Muchun Song, Oscar Salvador,
	David Hildenbrand (Red Hat), Konstantin Komarov, Mike Marshall,
	Martin Brandenburg, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Babu Moger, Carlos Maiolino, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Matthew Wilcox, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Zi Yan, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Jann Horn, Pedro Falcato,
	David Howells, Paul Moore, James Morris, Serge E. Hallyn,
	Yury Norov, Rasmus Villemoes, linux-sgx, linux-kernel, nvdimm,
	linux-cxl, dri-devel, intel-gfx, linux-fsdevel, linux-aio,
	linux-erofs, linux-ext4, linux-mm, ntfs3, devel, linux-xfs,
	keyrings, linux-security-module
In-Reply-To: <9ff58468-a72d-4984-95f4-d0a60554705d@app.fastmail.com>

On Tue, Jan 20, 2026 at 05:44:29PM +0100, Arnd Bergmann wrote:
> On Tue, Jan 20, 2026, at 17:22, Lorenzo Stoakes wrote:
> > On Tue, Jan 20, 2026 at 05:00:28PM +0100, Arnd Bergmann wrote:
> >> On Tue, Jan 20, 2026, at 16:10, Lorenzo Stoakes wrote:
> >> >
> >> > It strikes me that the key optimisation here is the inlining, now if the issue
> >> > is that ye olde compiler might choose not to inline very small functions (seems
> >> > unlikely) we could always throw in an __always_inline?
> >>
> >> I can think of three specific things going wrong with structures passed
> >> by value:
> >
> > I mean now you seem to be talking about it _in general_ which, _in theory_,
> > kills the whole concept of bitmap VMA flags _altogether_ really, or at
> > least any workable version of them.
>
> No, what I'm saying is "understand what the pitfalls are", not
> "don't do it". I think that is what Jason was also getting at.
>
>      Arnd

Ack sure and your input is appreciated :) It's important to kick the tyres
and be aware of possible issues.

Actually I think now I understand where Jason's coming from - the by-value
cases will be const value for the most part - which should make life MUCH
easier for the compiler and avoid a lot of the issues you raised.

So _hopefully_ we're mitigated. Again as I said, in cases where we might
not be, I will take action to figure out workarounds.

I'm excited by the proposed approach in general (+ again thanks to Jason to
opening my eyes to the possibility in the first place), so perhaps a
_little_ defensive, as it allows for a like-for-like replacement generally
which should HUGELY speed up + simplify the transition :)

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH RESEND 09/12] mm: make vm_area_desc utilise vma_flags_t only
From: Lorenzo Stoakes @ 2026-01-20 16:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jarkko Sakkinen, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, Christian Koenig, Huang Rui, Matthew Auld,
	Matthew Brost, Alexander Viro, Christian Brauner, Jan Kara,
	Benjamin LaHaise, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Theodore Ts'o,
	Andreas Dilger, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Mike Marshall, Martin Brandenburg, Tony Luck,
	Reinette Chatre, Dave Martin, James Morse, Babu Moger,
	Carlos Maiolino, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	Matthew Wilcox, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Zi Yan, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Jann Horn, Pedro Falcato, David Howells, Paul Moore,
	James Morris, Serge E . Hallyn, Yury Norov, Rasmus Villemoes,
	linux-sgx, linux-kernel, nvdimm, linux-cxl, dri-devel, intel-gfx,
	linux-fsdevel, linux-aio, linux-erofs, linux-ext4, linux-mm,
	ntfs3, devel, linux-xfs, keyrings, linux-security-module
In-Reply-To: <20260120152245.GC1134360@nvidia.com>

On Tue, Jan 20, 2026 at 11:22:45AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 20, 2026 at 03:10:54PM +0000, Lorenzo Stoakes wrote:
> > The natural implication of what you're saying is that we can no longer use this
> > from _anywhere_ because - hey - passing this by value is bad so now _everything_
> > has to be re-written as:
>
> No, I'm not saying that, I'm saying this specific case where you are
> making an accessor to reach an unknown value located on the heap

OK it would have been helpful for you to say that! Sometimes reviews feel
like a ratcheting series of 'what do you actually mean?'s... :)

> should be using a pointer as both a matter of style and to simplify
> life for the compiler.

OK fine.

>
> > 	vma_flags_t flags_to_set = mk_vma_flags(<flags>);
> >
> > 	if (vma_flags_test(&flags, &flags_to_set)) { ... }
>
> This is quite a different situation, it is a known const at compile
> time value located on the stack.

Well as a const time thing it'll be optimised to just a value assuming
nothing changes flags_to_set in the mean time. You'd hope.

Note that we have xxx_mask() variants, such that you can do, e.g.:

	vma_flags_t flags1 = mk_vma_flags(...);
	vma_flags_t flags2 = mk_vma_flags(...);

	if (vma_flags_test_mask(flags1, flags2)) {
		...
	}

ASIDE ->
	NOTE: A likely use of this, and one I already added is so we can do
	e.g.:

	#define VMA_REMAP_FLAGS mk_vma_flags(VMA_IO_BIT, VMA_PFNMAP_BIT, \
		VMA_DONTEXPAND_BIT, VMA_DONTDUMP_BIT)

	...

	if (vma_flagss_test_mask(flags, VMA_REMAP_FLAGS)) { ... }

	Which would be effectively a const input anyway.
<- ASIDE

Or in a world where flags1 is a const pointer now:

	if (vma_flags_test_mask(&flags1, flags2)) { ... }

Which makes the form... kinda weird. Then again it's consistent with other
forms which update flags1, ofc we name this separately, e.g. flags, to_test
or flags, to_set so I guess not such a problme.

Now, nobody is _likely_ to do e.g.:

	if (vma_flags_test_mask(&vma1->flags, vma2->flags)) { ... }

In this situation, but they could.

However perhaps having one value pass-by-const-pointer and the other
by-value essentially documents the fact you're being dumb.

And if somebody really needs something like this (not sure why) we could
add something.

But yeah ok, I'll change this. It's more than this case it's also all the
test stuff but shouldn't be a really huge change.

>
> > If it was just changing this one function I'd still object as it makes it differ
> > from _every other test predicate_ using vma_flags_t but maybe to humour you I'd
> > change it, but surely by this argument you're essentially objecting to the whole
> > series?
>
> I only think that if you are taking a heap input that is not of known
> value you should continue to pass by pointer as is generally expected
> in the C style we use.

Ack.

>
> And it isn't saying anything about the overall technique in the
> series, just a minor note about style.

OK good, though Arnd's reply feels more like a comment on the latter,
though only really doing pass-by-value for const values (in nearly all sane
cases) should hopefully mitigate.

>
> > I am not sure about this 'idiomatic kernel style' thing either, it feels rather
> > conjured. Yes you wouldn't ordinarily pass something larger than a register size
> > by-value, but here the intent is for it to be inlined anyway right?
>
> Well, exactly, we don't normally pass things larger than an interger
> by value, that isn't the style, so I don't think it is such a great
> thing to introduce here kind of unnecessarily.
>
> The troubles I recently had were linked to odd things like gcov and
> very old still supported versions of gcc. Also I saw a power compiler
> make a very strange choice to not inline something that evaluated to a
> constant.

Right ok.

>
> Jason

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH RESEND 09/12] mm: make vm_area_desc utilise vma_flags_t only
From: Arnd Bergmann @ 2026-01-20 16:44 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jason Gunthorpe, Andrew Morton, Jarkko Sakkinen, Dave Hansen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Greg Kroah-Hartman, Dan Williams, Vishal Verma,
	Dave Jiang, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Dave Airlie, Simona Vetter, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Christian König, Huang Rui,
	Matthew Auld, Matthew Brost, Alexander Viro, Christian Brauner,
	Jan Kara, Benjamin LaHaise, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Theodore Ts'o,
	Andreas Dilger, Muchun Song, Oscar Salvador,
	David Hildenbrand (Red Hat), Konstantin Komarov, Mike Marshall,
	Martin Brandenburg, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Babu Moger, Carlos Maiolino, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Matthew Wilcox, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Zi Yan, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Jann Horn, Pedro Falcato,
	David Howells, Paul Moore, James Morris, Serge E. Hallyn,
	Yury Norov, Rasmus Villemoes, linux-sgx, linux-kernel, nvdimm,
	linux-cxl, dri-devel, intel-gfx, linux-fsdevel, linux-aio,
	linux-erofs, linux-ext4, linux-mm, ntfs3, devel, linux-xfs,
	keyrings, linux-security-module
In-Reply-To: <44461883-a75c-466b-a278-97c4ab46b461@lucifer.local>

On Tue, Jan 20, 2026, at 17:22, Lorenzo Stoakes wrote:
> On Tue, Jan 20, 2026 at 05:00:28PM +0100, Arnd Bergmann wrote:
>> On Tue, Jan 20, 2026, at 16:10, Lorenzo Stoakes wrote:
>> >
>> > It strikes me that the key optimisation here is the inlining, now if the issue
>> > is that ye olde compiler might choose not to inline very small functions (seems
>> > unlikely) we could always throw in an __always_inline?
>>
>> I can think of three specific things going wrong with structures passed
>> by value:
>
> I mean now you seem to be talking about it _in general_ which, _in theory_,
> kills the whole concept of bitmap VMA flags _altogether_ really, or at
> least any workable version of them.

No, what I'm saying is "understand what the pitfalls are", not
"don't do it". I think that is what Jason was also getting at.

     Arnd

^ permalink raw reply

* [PATCH 6.12.y] xfs: set max_agbno to allow sparse alloc of last full inode chunk
From: Brian Foster @ 2026-01-20 16:40 UTC (permalink / raw)
  To: stable; +Cc: linux-xfs, Darrick J. Wong, Carlos Maiolino
In-Reply-To: <2026012006-doorway-print-237d@gregkh>

Sparse inode cluster allocation sets min/max agbno values to avoid
allocating an inode cluster that might map to an invalid inode
chunk. For example, we can't have an inode record mapped to agbno 0
or that extends past the end of a runt AG of misaligned size.

The initial calculation of max_agbno is unnecessarily conservative,
however. This has triggered a corner case allocation failure where a
small runt AG (i.e. 2063 blocks) is mostly full save for an extent
to the EOFS boundary: [2050,13]. max_agbno is set to 2048 in this
case, which happens to be the offset of the last possible valid
inode chunk in the AG. In practice, we should be able to allocate
the 4-block cluster at agbno 2052 to map to the parent inode record
at agbno 2048, but the max_agbno value precludes it.

Note that this can result in filesystem shutdown via dirty trans
cancel on stable kernels prior to commit 9eb775968b68 ("xfs: walk
all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_ags") because
the tail AG selection by the allocator sets t_highest_agno on the
transaction. If the inode allocator spins around and finds an inode
chunk with free inodes in an earlier AG, the subsequent dir name
creation path may still fail to allocate due to the AG restriction
and cancel.

To avoid this problem, update the max_agbno calculation to the agbno
prior to the last chunk aligned agbno in the AG. This is not
necessarily the last valid allocation target for a sparse chunk, but
since inode chunks (i.e. records) are chunk aligned and sparse
allocs are cluster sized/aligned, this allows the sb_spino_align
alignment restriction to take over and round down the max effective
agbno to within the last valid inode chunk in the AG.

Note that even though the allocator improvements in the
aforementioned commit seem to avoid this particular dirty trans
cancel situation, the max_agbno logic improvement still applies as
we should be able to allocate from an AG that has been appropriately
selected. The more important target for this patch however are
older/stable kernels prior to this allocator rework/improvement.

Cc: stable@vger.kernel.org # v4.2
Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
(cherry picked from commit c360004c0160dbe345870f59f24595519008926f)
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 6258527315f2..8223464e23e7 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -850,15 +850,16 @@ xfs_ialloc_ag_alloc(
 		 * invalid inode records, such as records that start at agbno 0
 		 * or extend beyond the AG.
 		 *
-		 * Set min agbno to the first aligned, non-zero agbno and max to
-		 * the last aligned agbno that is at least one full chunk from
-		 * the end of the AG.
+		 * Set min agbno to the first chunk aligned, non-zero agbno and
+		 * max to one less than the last chunk aligned agbno from the
+		 * end of the AG. We subtract 1 from max so that the cluster
+		 * allocation alignment takes over and allows allocation within
+		 * the last full inode chunk in the AG.
 		 */
 		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
 		args.max_agbno = round_down(xfs_ag_block_count(args.mp,
 							pag->pag_agno),
-					    args.mp->m_sb.sb_inoalignmt) -
-				 igeo->ialloc_blks;
+					    args.mp->m_sb.sb_inoalignmt) - 1;

 		error = xfs_alloc_vextent_near_bno(&args,
 				XFS_AGB_TO_FSB(args.mp, pag->pag_agno,
-- 
2.52.0

^ permalink raw reply related

* Re: [PATCH RESEND 09/12] mm: make vm_area_desc utilise vma_flags_t only
From: Lorenzo Stoakes @ 2026-01-20 16:22 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jason Gunthorpe, Andrew Morton, Jarkko Sakkinen, Dave Hansen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Greg Kroah-Hartman, Dan Williams, Vishal Verma,
	Dave Jiang, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Dave Airlie, Simona Vetter, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Christian König, Huang Rui,
	Matthew Auld, Matthew Brost, Alexander Viro, Christian Brauner,
	Jan Kara, Benjamin LaHaise, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Theodore Ts'o,
	Andreas Dilger, Muchun Song, Oscar Salvador,
	David Hildenbrand (Red Hat), Konstantin Komarov, Mike Marshall,
	Martin Brandenburg, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Babu Moger, Carlos Maiolino, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Matthew Wilcox, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Zi Yan, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Jann Horn, Pedro Falcato,
	David Howells, Paul Moore, James Morris, Serge E. Hallyn,
	Yury Norov, Rasmus Villemoes, linux-sgx, linux-kernel, nvdimm,
	linux-cxl, dri-devel, intel-gfx, linux-fsdevel, linux-aio,
	linux-erofs, linux-ext4, linux-mm, ntfs3, devel, linux-xfs,
	keyrings, linux-security-module
In-Reply-To: <1617ac60-6261-483d-aeb5-13aba5f477af@app.fastmail.com>

On Tue, Jan 20, 2026 at 05:00:28PM +0100, Arnd Bergmann wrote:
> On Tue, Jan 20, 2026, at 16:10, Lorenzo Stoakes wrote:
> > On Tue, Jan 20, 2026 at 09:36:19AM -0400, Jason Gunthorpe wrote:
> >
> > I am not sure about this 'idiomatic kernel style' thing either, it feels rather
> > conjured. Yes you wouldn't ordinarily pass something larger than a register size
> > by-value, but here the intent is for it to be inlined anyway right?
> >
> > It strikes me that the key optimisation here is the inlining, now if the issue
> > is that ye olde compiler might choose not to inline very small functions (seems
> > unlikely) we could always throw in an __always_inline?
>
> I can think of three specific things going wrong with structures passed
> by value:

I mean now you seem to be talking about it _in general_ which, _in theory_,
kills the whole concept of bitmap VMA flags _altogether_ really, or at
least any workable version of them.

But... no.

I'm not going to not do this because of perceived possible issues with ppc
and mips.

It's not reasonable to hold up a necessary change for the future of the
kernel IMO, and we can find workarounds as necessary should anything
problematic actually occur in practice.

I am happy to do so as maintainer of this work :)

>
> - functions that cannot be inlined are bound by the ELF ABI, and
>   several of them require structs to be passed on the stack regardless
>   of the size. Most of the popular architectures seem fine here, but
>   mips and powerpc look like they are affected.

I explicitly checked mips and it seemed fine, but not gone super deep.

>
> - The larger the struct is, the more architectures are affected.
>   Parts of the amdgpu driver and the bcachefs file system ran into this

bcachefs is not in the kernel. We don't care about out-of-tree stuff by
convention.

amdgpu is more concerning, but...

>   with 64-bit structures passed by value on 32-bit architectures
>   causing horrible codegen even with inlining. I think it's
>   usually fine up to a single register size.

...32-bit kernels are not ones where you would anticipate incredible
performance for one, for another if any significant issues arise we can
look at arch-specific workarounds.

I already have vma_flags_*_word*() helpers to do things 'the old way' in
the worst case. More can be added if and when anything arises.

Again, I don't think we should hold up the rest of the kernel (being able
to transition to not being arbitrarily limited by VMA count is very
important) on this basis.

Also I've checked 32-bit code generation which _seemed_ fine at a
glance. Of course again I've not good super deep on that.

>
> - clang's inlining algorithm works the other way round from gcc's:
>   inlining into the root caller first and sometimes leaving tiny
>   leaf function out of line unless you add __always_inline.

I already __always_inline all pertinent funcitons so hopefully that should
be no issue.

And for instance the assembly I shared earlier was built using clang, as I
now use clang for _all_ my builds locally.

>
>       Arnd

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH RESEND 09/12] mm: make vm_area_desc utilise vma_flags_t only
From: Arnd Bergmann @ 2026-01-20 16:00 UTC (permalink / raw)
  To: Lorenzo Stoakes, Jason Gunthorpe
  Cc: Andrew Morton, Jarkko Sakkinen, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, Dave Airlie,
	Simona Vetter, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, Christian König, Huang Rui, Matthew Auld,
	Matthew Brost, Alexander Viro, Christian Brauner, Jan Kara,
	Benjamin LaHaise, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Theodore Ts'o,
	Andreas Dilger, Muchun Song, Oscar Salvador,
	David Hildenbrand (Red Hat), Konstantin Komarov, Mike Marshall,
	Martin Brandenburg, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Babu Moger, Carlos Maiolino, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, Matthew Wilcox, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Zi Yan, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Jann Horn, Pedro Falcato,
	David Howells, Paul Moore, James Morris, Serge E. Hallyn,
	Yury Norov, Rasmus Villemoes, linux-sgx, linux-kernel, nvdimm,
	linux-cxl, dri-devel, intel-gfx, linux-fsdevel, linux-aio,
	linux-erofs, linux-ext4, linux-mm, ntfs3, devel, linux-xfs,
	keyrings, linux-security-module
In-Reply-To: <488a0fd8-5d64-4907-873b-60cefee96979@lucifer.local>

On Tue, Jan 20, 2026, at 16:10, Lorenzo Stoakes wrote:
> On Tue, Jan 20, 2026 at 09:36:19AM -0400, Jason Gunthorpe wrote:
>
> I am not sure about this 'idiomatic kernel style' thing either, it feels rather
> conjured. Yes you wouldn't ordinarily pass something larger than a register size
> by-value, but here the intent is for it to be inlined anyway right?
>
> It strikes me that the key optimisation here is the inlining, now if the issue
> is that ye olde compiler might choose not to inline very small functions (seems
> unlikely) we could always throw in an __always_inline?

I can think of three specific things going wrong with structures passed
by value:

- functions that cannot be inlined are bound by the ELF ABI, and
  several of them require structs to be passed on the stack regardless
  of the size. Most of the popular architectures seem fine here, but
  mips and powerpc look like they are affected.

- The larger the struct is, the more architectures are affected.
  Parts of the amdgpu driver and the bcachefs file system ran into this
  with 64-bit structures passed by value on 32-bit architectures
  causing horrible codegen even with inlining. I think it's
  usually fine up to a single register size.

- clang's inlining algorithm works the other way round from gcc's:
  inlining into the root caller first and sometimes leaving tiny
  leaf function out of line unless you add __always_inline.

      Arnd

^ permalink raw reply

* Re: [PATCH v7] xfs: add FALLOC_FL_WRITE_ZEROES to XFS code base
From: Darrick J. Wong @ 2026-01-20 15:57 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, hch, lukas
In-Reply-To: <20260120132056.534646-2-cem@kernel.org>

On Tue, Jan 20, 2026 at 02:20:50PM +0100, cem@kernel.org wrote:
> From: Lukas Herbolt <lukas@herbolt.com>
> 
> Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable
> the unmap write zeroes operation.
> 
> Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
> [cem: rewrite xfs_falloc_zero_range() bits]
> ---
> 
> Christoph, Darrick, could you please review/ack this patch again? I
> needed to rewrite the xfs_falloc_zero_range() bits, because it
> conflicted with 66d78a11479c and 8dc15b7a6e59. This version aims mostly
> to remove one of the if-else nested levels to keep it a bit cleaner.
> 
> please let me know if you agree with this version, otherwise I'll ask
> Lukas to rebase it on top of the new code.
> 
> Thanks!
> 
>  fs/xfs/xfs_bmap_util.c | 10 ++++++++--
>  fs/xfs/xfs_bmap_util.h |  2 +-
>  fs/xfs/xfs_file.c      | 38 +++++++++++++++++++++++++++-----------
>  3 files changed, 36 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 0ab00615f1ad..74a7597d0998 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -642,11 +642,17 @@ xfs_free_eofblocks(
>  	return error;
>  }
>  
> +/*
> + * Callers can specify bmapi_flags, if XFS_BMAPI_ZERO is used there are no
> + * further checks whether the hard ware supports and it can fallback to
> + * software zeroing.
> + */
>  int
>  xfs_alloc_file_space(
>  	struct xfs_inode	*ip,
>  	xfs_off_t		offset,
> -	xfs_off_t		len)
> +	xfs_off_t		len,
> +	uint32_t		bmapi_flags)
>  {
>  	xfs_mount_t		*mp = ip->i_mount;
>  	xfs_off_t		count;
> @@ -748,7 +754,7 @@ xfs_alloc_file_space(
>  		 * will eventually reach the requested range.
>  		 */
>  		error = xfs_bmapi_write(tp, ip, startoffset_fsb,
> -				allocatesize_fsb, XFS_BMAPI_PREALLOC, 0, imapp,
> +				allocatesize_fsb, bmapi_flags, 0, imapp,
>  				&nimaps);
>  		if (error) {
>  			if (error != -ENOSR)
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index c477b3361630..2895cc97a572 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -56,7 +56,7 @@ int	xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
>  
>  /* preallocation and hole punch interface */
>  int	xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
> -		xfs_off_t len);
> +		xfs_off_t len, uint32_t bmapi_flags);
>  int	xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t len, struct xfs_zone_alloc_ctx *ac);
>  int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index d36a9aafa8ab..b23f1373116e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1302,16 +1302,29 @@ xfs_falloc_zero_range(
>  
>  	if (xfs_falloc_force_zero(ip, ac)) {
>  		error = xfs_zero_range(ip, offset, len, ac, NULL);
> -	} else {
> -		error = xfs_free_file_space(ip, offset, len, ac);
> -		if (error)
> -			return error;
> +		goto out;
> +	}
>  
> -		len = round_up(offset + len, blksize) -
> -			round_down(offset, blksize);
> -		offset = round_down(offset, blksize);
> -		error = xfs_alloc_file_space(ip, offset, len);
> +	error = xfs_free_file_space(ip, offset, len, ac);
> +	if (error)
> +		return error;
> +
> +	len = round_up(offset + len, blksize) - round_down(offset, blksize);
> +	offset = round_down(offset, blksize);
> +
> +	if (mode & FALLOC_FL_WRITE_ZEROES) {
> +		if (xfs_is_always_cow_inode(ip) ||
> +		    !bdev_write_zeroes_unmap_sectors(
> +				xfs_inode_buftarg(ip)->bt_bdev))
> +			return -EOPNOTSUPP;

Taking a second look -- this code allows ZERO_RANGE|WRITE_ZEROES to
punch out the file space but then fail with EOPNOTSUPP.  I think if
we're going to error out that way, we should do that at the top of the
function before any changes are made.

--D

> +		error = xfs_alloc_file_space(ip, offset, len,
> +					     XFS_BMAPI_ZERO);
> +	} else {
> +		error = xfs_alloc_file_space(ip, offset, len,
> +					     XFS_BMAPI_PREALLOC);
>  	}
> +
> +out:
>  	if (error)
>  		return error;
>  	return xfs_falloc_setsize(file, new_size);
> @@ -1336,7 +1349,8 @@ xfs_falloc_unshare_range(
>  	if (error)
>  		return error;
>  
> -	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> +	error = xfs_alloc_file_space(XFS_I(inode), offset, len,
> +			XFS_BMAPI_PREALLOC);
>  	if (error)
>  		return error;
>  	return xfs_falloc_setsize(file, new_size);
> @@ -1364,7 +1378,8 @@ xfs_falloc_allocate_range(
>  	if (error)
>  		return error;
>  
> -	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> +	error = xfs_alloc_file_space(XFS_I(inode), offset, len,
> +			XFS_BMAPI_PREALLOC);
>  	if (error)
>  		return error;
>  	return xfs_falloc_setsize(file, new_size);
> @@ -1374,7 +1389,7 @@ xfs_falloc_allocate_range(
>  		(FALLOC_FL_ALLOCATE_RANGE | FALLOC_FL_KEEP_SIZE |	\
>  		 FALLOC_FL_PUNCH_HOLE |	FALLOC_FL_COLLAPSE_RANGE |	\
>  		 FALLOC_FL_ZERO_RANGE |	FALLOC_FL_INSERT_RANGE |	\
> -		 FALLOC_FL_UNSHARE_RANGE)
> +		 FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_WRITE_ZEROES)
>  
>  STATIC long
>  __xfs_file_fallocate(
> @@ -1417,6 +1432,7 @@ __xfs_file_fallocate(
>  	case FALLOC_FL_INSERT_RANGE:
>  		error = xfs_falloc_insert_range(file, offset, len);
>  		break;
> +	case FALLOC_FL_WRITE_ZEROES:
>  	case FALLOC_FL_ZERO_RANGE:
>  		error = xfs_falloc_zero_range(file, mode, offset, len, ac);
>  		break;
> -- 
> 2.52.0
> 
> 

^ permalink raw reply

* Re: [PATCH] xfs: always allocate the free zone with the lowest index
From: Darrick J. Wong @ 2026-01-20 15:53 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: linux-xfs, Carlos Maiolino, Dave Chinner, Christoph Hellwig,
	dlemoal, johannes.thumshirn
In-Reply-To: <20260120085746.29980-1-hans.holmberg@wdc.com>

On Tue, Jan 20, 2026 at 09:57:46AM +0100, Hans Holmberg wrote:
> Zones in the beginning of the address space are typically mapped to
> higer bandwidth tracks on HDDs than those at the end of the address
> space. So, in stead of allocating zones "round robin" across the whole
> address space, always allocate the zone with the lowest index.

Does it make any difference if it's a zoned ssd?  I'd imagine not, but I
wonder if there are any longer term side effects like lower-numbered
zones filling up and getting gc'd more often?

--D

> This increases average write bandwidth for overwrite workloads
> when less than the full capacity is being used. At ~50% utilization
> this improves bandwidth for a random file overwrite benchmark
> with 128MiB files and 256MiB zone capacity by 30%.
> 
> Running the same benchmark with small 2-8 MiB files at 67% capacity
> shows no significant difference in performance. Due to heavy
> fragmentation the whole zone range is in use, greatly limiting the 
> number of free zones with high bw.
> 
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> ---
> 
>  fs/xfs/xfs_zone_alloc.c | 47 +++++++++++++++--------------------------
>  fs/xfs/xfs_zone_priv.h  |  1 -
>  2 files changed, 17 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
> index bbcf21704ea0..d6c97026f733 100644
> --- a/fs/xfs/xfs_zone_alloc.c
> +++ b/fs/xfs/xfs_zone_alloc.c
> @@ -408,31 +408,6 @@ xfs_zone_free_blocks(
>  	return 0;
>  }
>  
> -static struct xfs_group *
> -xfs_find_free_zone(
> -	struct xfs_mount	*mp,
> -	unsigned long		start,
> -	unsigned long		end)
> -{
> -	struct xfs_zone_info	*zi = mp->m_zone_info;
> -	XA_STATE		(xas, &mp->m_groups[XG_TYPE_RTG].xa, start);
> -	struct xfs_group	*xg;
> -
> -	xas_lock(&xas);
> -	xas_for_each_marked(&xas, xg, end, XFS_RTG_FREE)
> -		if (atomic_inc_not_zero(&xg->xg_active_ref))
> -			goto found;
> -	xas_unlock(&xas);
> -	return NULL;
> -
> -found:
> -	xas_clear_mark(&xas, XFS_RTG_FREE);
> -	atomic_dec(&zi->zi_nr_free_zones);
> -	zi->zi_free_zone_cursor = xg->xg_gno;
> -	xas_unlock(&xas);
> -	return xg;
> -}
> -
>  static struct xfs_open_zone *
>  xfs_init_open_zone(
>  	struct xfs_rtgroup	*rtg,
> @@ -472,13 +447,25 @@ xfs_open_zone(
>  	bool			is_gc)
>  {
>  	struct xfs_zone_info	*zi = mp->m_zone_info;
> +	XA_STATE		(xas, &mp->m_groups[XG_TYPE_RTG].xa, 0);
>  	struct xfs_group	*xg;
>  
> -	xg = xfs_find_free_zone(mp, zi->zi_free_zone_cursor, ULONG_MAX);
> -	if (!xg)
> -		xg = xfs_find_free_zone(mp, 0, zi->zi_free_zone_cursor);
> -	if (!xg)
> -		return NULL;
> +	/*
> +	 * Pick the free zone with lowest index. Zones in the beginning of the
> +	 * address space typically provides higher bandwidth than those at the
> +	 * end of the address space on HDDs.
> +	 */
> +	xas_lock(&xas);
> +	xas_for_each_marked(&xas, xg, ULONG_MAX, XFS_RTG_FREE)
> +		if (atomic_inc_not_zero(&xg->xg_active_ref))
> +			goto found;
> +	xas_unlock(&xas);
> +	return NULL;
> +
> +found:
> +	xas_clear_mark(&xas, XFS_RTG_FREE);
> +	atomic_dec(&zi->zi_nr_free_zones);
> +	xas_unlock(&xas);
>  
>  	set_current_state(TASK_RUNNING);
>  	return xfs_init_open_zone(to_rtg(xg), 0, write_hint, is_gc);
> diff --git a/fs/xfs/xfs_zone_priv.h b/fs/xfs/xfs_zone_priv.h
> index ce7f0e2f4598..8fbf9a52964e 100644
> --- a/fs/xfs/xfs_zone_priv.h
> +++ b/fs/xfs/xfs_zone_priv.h
> @@ -72,7 +72,6 @@ struct xfs_zone_info {
>  	/*
>  	 * Free zone search cursor and number of free zones:
>  	 */
> -	unsigned long		zi_free_zone_cursor;
>  	atomic_t		zi_nr_free_zones;
>  
>  	/*
> -- 
> 2.40.1
> 
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] fs: add FS_XFLAG_VERITY for fs-verity files
From: Darrick J. Wong @ 2026-01-20 15:51 UTC (permalink / raw)
  To: Andrey Albershteyn; +Cc: linux-xfs, fstests, ebiggers
In-Reply-To: <20260119165644.2945008-2-aalbersh@kernel.org>

On Mon, Jan 19, 2026 at 05:56:42PM +0100, Andrey Albershteyn wrote:
> fs-verity introduced inode flag for inodes with enabled fs-verity on
> them. This patch adds FS_XFLAG_VERITY file attribute which can be
> retrieved with FS_IOC_FSGETXATTR ioctl() and file_getattr() syscall.
> 
> This flag is read-only and can not be set with corresponding set ioctl()
> and file_setattr(). The FS_IOC_SETFLAGS requires file to be opened for
> writing which is not allowed for verity files. The FS_IOC_FSSETXATTR and
> file_setattr() clears this flag from the user input.
> 
> As this is now common flag for both flag interfaces (flags/xflags) add
> it to overlapping flags list to exclude it from overwrite.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

Technically this uapi change should be cc'd to linux-api, but adding
a flag definition is fairly minor so:

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  Documentation/filesystems/fsverity.rst | 16 ++++++++++++++++
>  fs/file_attr.c                         |  4 ++++
>  include/linux/fileattr.h               |  6 +++---
>  include/uapi/linux/fs.h                |  1 +
>  4 files changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
> index 412cf11e3298..22b49b295d1f 100644
> --- a/Documentation/filesystems/fsverity.rst
> +++ b/Documentation/filesystems/fsverity.rst
> @@ -341,6 +341,22 @@ the file has fs-verity enabled.  This can perform better than
>  FS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require
>  opening the file, and opening verity files can be expensive.
>  
> +FS_IOC_FSGETXATTR
> +-----------------
> +
> +Since Linux v7.0, the FS_IOC_FSGETXATTR ioctl sets FS_XFLAG_VERITY (0x00020000)
> +in the returned flags when the file has verity enabled. Note that this attribute
> +cannot be set with FS_IOC_FSSETXATTR as enabling verity requires input
> +parameters. See FS_IOC_ENABLE_VERITY.
> +
> +file_getattr
> +------------
> +
> +Since Linux v7.0, the file_getattr() syscall sets FS_XFLAG_VERITY (0x00020000)
> +in the returned flags when the file has verity enabled. Note that this attribute
> +cannot be set with file_setattr() as enabling verity requires input parameters.
> +See FS_IOC_ENABLE_VERITY.
> +
>  .. _accessing_verity_files:
>  
>  Accessing verity files
> diff --git a/fs/file_attr.c b/fs/file_attr.c
> index 13cdb31a3e94..f44c873af92b 100644
> --- a/fs/file_attr.c
> +++ b/fs/file_attr.c
> @@ -37,6 +37,8 @@ void fileattr_fill_xflags(struct file_kattr *fa, u32 xflags)
>  		fa->flags |= FS_DAX_FL;
>  	if (fa->fsx_xflags & FS_XFLAG_PROJINHERIT)
>  		fa->flags |= FS_PROJINHERIT_FL;
> +	if (fa->fsx_xflags & FS_XFLAG_VERITY)
> +		fa->flags |= FS_VERITY_FL;
>  }
>  EXPORT_SYMBOL(fileattr_fill_xflags);
>  
> @@ -67,6 +69,8 @@ void fileattr_fill_flags(struct file_kattr *fa, u32 flags)
>  		fa->fsx_xflags |= FS_XFLAG_DAX;
>  	if (fa->flags & FS_PROJINHERIT_FL)
>  		fa->fsx_xflags |= FS_XFLAG_PROJINHERIT;
> +	if (fa->flags & FS_VERITY_FL)
> +		fa->fsx_xflags |= FS_XFLAG_VERITY;
>  }
>  EXPORT_SYMBOL(fileattr_fill_flags);
>  
> diff --git a/include/linux/fileattr.h b/include/linux/fileattr.h
> index f89dcfad3f8f..3780904a63a6 100644
> --- a/include/linux/fileattr.h
> +++ b/include/linux/fileattr.h
> @@ -7,16 +7,16 @@
>  #define FS_COMMON_FL \
>  	(FS_SYNC_FL | FS_IMMUTABLE_FL | FS_APPEND_FL | \
>  	 FS_NODUMP_FL |	FS_NOATIME_FL | FS_DAX_FL | \
> -	 FS_PROJINHERIT_FL)
> +	 FS_PROJINHERIT_FL | FS_VERITY_FL)
>  
>  #define FS_XFLAG_COMMON \
>  	(FS_XFLAG_SYNC | FS_XFLAG_IMMUTABLE | FS_XFLAG_APPEND | \
>  	 FS_XFLAG_NODUMP | FS_XFLAG_NOATIME | FS_XFLAG_DAX | \
> -	 FS_XFLAG_PROJINHERIT)
> +	 FS_XFLAG_PROJINHERIT | FS_XFLAG_VERITY)
>  
>  /* Read-only inode flags */
>  #define FS_XFLAG_RDONLY_MASK \
> -	(FS_XFLAG_PREALLOC | FS_XFLAG_HASATTR)
> +	(FS_XFLAG_PREALLOC | FS_XFLAG_HASATTR | FS_XFLAG_VERITY)
>  
>  /* Flags to indicate valid value of fsx_ fields */
>  #define FS_XFLAG_VALUES_MASK \
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 66ca526cf786..70b2b661f42c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -253,6 +253,7 @@ struct file_attr {
>  #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
>  #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
>  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> +#define FS_XFLAG_VERITY		0x00020000	/* fs-verity enabled */
>  #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
>  
>  /* the read-only stuff doesn't really belong here, but any other place is
> -- 
> 2.52.0
> 
> 

^ permalink raw reply

* Re: [PATCH 3/3] xfs: switch (back) to a per-buftarg buffer hash
From: Darrick J. Wong @ 2026-01-20 15:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Carlos Maiolino, Dave Chinner, linux-xfs,
	syzbot+0391d34e801643e2809b
In-Reply-To: <20260120070615.GB3954@lst.de>

On Tue, Jan 20, 2026 at 08:06:15AM +0100, Christoph Hellwig wrote:
> On Mon, Jan 19, 2026 at 06:39:18PM -0800, Darrick J. Wong wrote:
> > On Mon, Jan 19, 2026 at 04:31:37PM +0100, Christoph Hellwig wrote:
> > > The per-AG buffer hashes were added when all buffer lookups took a
> > > per-hash look.  Since then we've made lookups entirely lockless and
> > > removed the need for a hash-wide lock for inserts and removals as
> > > well.  With this there is no need to sharding the hash, so reduce the
> > > used resources by using a per-buftarg hash for all buftargs.
> > 
> > Hey, not having all the per-ag buffer cache sounds neat!
> > 
> > > Long after writing this initially, syzbot found a problem in the
> > > buffer cache teardown order, which this happens to fix as well.
> > 
> > What did we get wrong, specifically?
> 
> Dave has a really good analysis here:
> 
> https://lore.kernel.org/linux-xfs/aLeUdemAZ5wmtZel@dread.disaster.area/

Can you Link: to that in this patch?  If I'm reading linus' most recent
exposition correctly, links to other threads are still allowed.

> > Also: Is there a simpler fix for this bug that we can stuff into old lts
> > kernels?
> 
> I can't really think of anything much simpler, just different.  It would
> require some careful reordering of the unmount path, which is always
> hairy.
> 
> > Or is this fix independent of the b_hold and lockref changes
> > in the previous patches?
> 
> In theory it is, except that the old tricks with the refcount would
> make it very difficult.  I tried to reorder this twice and failed
> both times.

<nod> I figured that might be the case seeing as you cleaned up the
confusing b_hold rules.

--D

^ permalink raw reply

* Re: [PATCH RESEND 09/12] mm: make vm_area_desc utilise vma_flags_t only
From: Jason Gunthorpe @ 2026-01-20 15:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jarkko Sakkinen, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, Christian Koenig, Huang Rui, Matthew Auld,
	Matthew Brost, Alexander Viro, Christian Brauner, Jan Kara,
	Benjamin LaHaise, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Theodore Ts'o,
	Andreas Dilger, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Mike Marshall, Martin Brandenburg, Tony Luck,
	Reinette Chatre, Dave Martin, James Morse, Babu Moger,
	Carlos Maiolino, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	Matthew Wilcox, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Zi Yan, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Jann Horn, Pedro Falcato, David Howells, Paul Moore,
	James Morris, Serge E . Hallyn, Yury Norov, Rasmus Villemoes,
	linux-sgx, linux-kernel, nvdimm, linux-cxl, dri-devel, intel-gfx,
	linux-fsdevel, linux-aio, linux-erofs, linux-ext4, linux-mm,
	ntfs3, devel, linux-xfs, keyrings, linux-security-module
In-Reply-To: <488a0fd8-5d64-4907-873b-60cefee96979@lucifer.local>

On Tue, Jan 20, 2026 at 03:10:54PM +0000, Lorenzo Stoakes wrote:
> The natural implication of what you're saying is that we can no longer use this
> from _anywhere_ because - hey - passing this by value is bad so now _everything_
> has to be re-written as:

No, I'm not saying that, I'm saying this specific case where you are
making an accessor to reach an unknown value located on the heap
should be using a pointer as both a matter of style and to simplify
life for the compiler.

> 	vma_flags_t flags_to_set = mk_vma_flags(<flags>);
> 
> 	if (vma_flags_test(&flags, &flags_to_set)) { ... }

This is quite a different situation, it is a known const at compile
time value located on the stack.

> If it was just changing this one function I'd still object as it makes it differ
> from _every other test predicate_ using vma_flags_t but maybe to humour you I'd
> change it, but surely by this argument you're essentially objecting to the whole
> series?

I only think that if you are taking a heap input that is not of known
value you should continue to pass by pointer as is generally expected
in the C style we use.

And it isn't saying anything about the overall technique in the
series, just a minor note about style.

> I am not sure about this 'idiomatic kernel style' thing either, it feels rather
> conjured. Yes you wouldn't ordinarily pass something larger than a register size
> by-value, but here the intent is for it to be inlined anyway right?

Well, exactly, we don't normally pass things larger than an interger
by value, that isn't the style, so I don't think it is such a great
thing to introduce here kind of unnecessarily.

The troubles I recently had were linked to odd things like gcov and
very old still supported versions of gcc. Also I saw a power compiler
make a very strange choice to not inline something that evaluated to a
constant.

Jason

^ permalink raw reply

* Re: [PATCH 3/3] ovl: Use real disk UUID for origin file handles
From: Amir Goldstein @ 2026-01-20 15:12 UTC (permalink / raw)
  To: André Almeida
  Cc: Christoph Hellwig, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Carlos Maiolino,
	Chris Mason, David Sterba, Miklos Szeredi, Christian Brauner,
	Alexander Viro, Jan Kara, linux-nfs, linux-kernel, linux-xfs,
	linux-fsdevel, Qu Wenruo, linux-btrfs, linux-unionfs, kernel-dev,
	vivek, Ludovico de Nittis
In-Reply-To: <75a9247a-12f4-4066-9712-c70ab41c274f@igalia.com>

[-- Attachment #1: Type: text/plain, Size: 7394 bytes --]

On Mon, Jan 19, 2026 at 5:56 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Em 16/01/2026 14:06, Amir Goldstein escreveu:
> > On Fri, Jan 16, 2026 at 2:28 PM André Almeida <andrealmeid@igalia.com> wrote:
> >>
> >> [+CC SteamOS developers]
> >>
> >> Em 16/01/2026 06:55, Amir Goldstein escreveu:
> >>> On Thu, Jan 15, 2026 at 7:55 PM André Almeida <andrealmeid@igalia.com> wrote:
> >>>>
> >>>> Em 15/01/2026 13:07, Amir Goldstein escreveu:
> >>>>> On Thu, Jan 15, 2026 at 4:42 PM André Almeida <andrealmeid@igalia.com> wrote:
> >>>>>>
> >>>>>> Em 15/01/2026 04:23, Christoph Hellwig escreveu:
> >>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>>>
> >>>>>>> I still wonder what the use case is here.  Looking at André's original
> >>>>>>> mail it states:
> >>>>>>>
> >>>>>>> "However, btrfs mounts may have volatiles UUIDs. When mounting the exact same
> >>>>>>> disk image with btrfs, a random UUID is assigned for the following disks each
> >>>>>>> time they are mounted, stored at temp_fsid and used across the kernel as the
> >>>>>>> disk UUID. `btrfs filesystem show` presents that. Calling statfs() however
> >>>>>>> shows the original (and duplicated) UUID for all disks."
> >>>>>>>
> >>>>>>> and this doesn't even talk about multiple mounts, but looking at
> >>>>>>> device_list_add it seems to only set the temp_fsid flag when set
> >>>>>>> same_fsid_diff_dev is set by find_fsid_by_device, which isn't documented
> >>>>>>> well, but does indeed seem to be done transparently when two file systems
> >>>>>>> with the same fsid are mounted.
> >>>>>>>
> >>>>>>> So André, can you confirm this what you're worried about?  And btrfs
> >>>>>>> developers, I think the main problem is indeed that btrfs simply allows
> >>>>>>> mounting the same fsid twice.  Which is really fatal for anything using
> >>>>>>> the fsid/uuid, such NFS exports, mount by fs uuid or any sb->s_uuid user.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, I'm would like to be able to mount two cloned btrfs images and to
> >>>>>> use overlayfs with them. This is useful for SteamOS A/B partition scheme.
> >>>>>>
> >>>>>>>> If so, I think it's time to revert the behavior before it's too late.
> >>>>>>>> Currently the main usage of such duplicated fsids is for Steam deck to
> >>>>>>>> maintain A/B partitions, I think they can accept a new compat_ro flag for
> >>>>>>>> that.
> >>>>>>>
> >>>>>>> What's an A/B partition?  And how are these safely used at the same time?
> >>>>>>>
> >>>>>>
> >>>>>> The Steam Deck have two main partitions to install SteamOS updates
> >>>>>> atomically. When you want to update the device, assuming that you are
> >>>>>> using partition A, the updater will write the new image in partition B,
> >>>>>> and vice versa. Then after the reboot, the system will mount the new
> >>>>>> image on B.
> >>>>>>
> >>>>>
> >>>>> And what do you expect to happen wrt overlayfs when switching from
> >>>>> image A to B?
> >>>>>
> >>>>> What are the origin file handles recorded in overlayfs index from image A
> >>>>> lower worth when the lower image is B?
> >>>>>
> >>>>> Is there any guarantee that file handles are relevant and point to the
> >>>>> same objects?
> >>>>>
> >>>>> The whole point of the overlayfs index feature is that overlayfs inodes
> >>>>> can have a unique id across copy-up.
> >>>>>
> >>>>> Please explain in more details exactly which overlayfs setup you are
> >>>>> trying to do with index feature.
> >>>>>
> >>>>
> >>>> The problem happens _before_ switching from A to B, it happens when
> >>>> trying to install the same image from A on B.
> >>>>
> >>>> During the image installation process, while running in A, the B image
> >>>> will be mounted more than once for some setup steps, and overlayfs is
> >>>> used for this. Because A have the same UUID, each time B is remouted
> >>>> will get a new UUID and then the installation scripts fails mounting the
> >>>> image.
> >>>
> >>> Please describe the exact overlayfs setup and specifically,
> >>> is it multi lower or single lower layer setup?
> >>> What reason do you need the overlayfs index for?
> >>> Can you mount with index=off which should relax the hard
> >>> requirement for match with the original lower layer uuid.
> >>>
> >>
> >> The setup has a single lower layer. This is how the mount command looks
> >> like:
> >>
> >> mount -t overlay -o
> >> "lowerdir=${DEV_DIR}/etc,upperdir=${DEV_DIR}/var/lib/overlays/etc/upper,workdir=${DEV_DIR}/var/lib/overlays/etc/work"
> >> none "${DEV_DIR}/etc"
> >>
> >> They would rather not disable index, to avoid mounting the wrong layers
> >> and to avoid corner cases with hardlinks.
> >
> > IIUC you have all the layers on the same fs ($DEV_DIR)?
> >
> > See mount option uuid=off, created for this exact use case:
> >
> > Documentation/filesystems/overlayfs.rst:
> > Note: the mount option uuid=off can be used to replace UUID of the underlying
> > filesystem in file handles with null, and effectively disable UUID checks. This
> > can be useful in case the underlying disk is copied and the UUID of this copy
> > is changed. This is only applicable if all lower/upper/work directories are on
> > the same filesystem, otherwise it will fallback to normal behaviour.
> >
> > commit 5830fb6b54f7167cc7c9d43612eb01c24312c7ca
> > Author: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> > Date:   Tue Oct 13 17:59:54 2020 +0300
> >
> >      ovl: introduce new "uuid=off" option for inodes index feature
> >
> >      This replaces uuid with null in overlayfs file handles and thus relaxes
> >      uuid checks for overlay index feature. It is only possible in case there is
> >      only one filesystem for all the work/upper/lower directories and bare file
> >      handles from this backing filesystem are unique. In other case when we have
> >      multiple filesystems lets just fallback to "uuid=on" which is and
> >      equivalent of how it worked before with all uuid checks.
> >
> >      This is needed when overlayfs is/was mounted in a container with index
> >      enabled ...
> >
> >      If you just change the uuid of the backing filesystem, overlay is not
> >      mounting any more. In Virtuozzo we copy container disks (ploops) when
> >      create the copy of container and we require fs uuid to be unique for a new
> >      container.
> >
> > TBH, I am trying to remember why we require upper/work to be on the
> > same fs as lower for uuid=off,index=on and I can't remember.
> > If this is important I can look into it.
> >
>
> Actually they are not in the same fs, upper and lower are coming from
> different fs', so when trying to mount I get the fallback to
> `uuid=null`. A quick hack circumventing this check makes the mount work.
>
> If you think this is the best way to solve this issue (rather than
> following the VFS helper path for instance),

That's up to you if you want to solve the "all lower layers on same fs"
or want to also allow lower layers on different fs.
The former could be solved by relaxing the ovl rules.

> please let me know how can
> I safely lift this restriction, like maybe adding a new flag for this?

I think the attached patch should work for you and should not
break anything.

It's only sanity tested and will need to write tests to verify it.

Thanks,
Amir.

[-- Attachment #2: 0001-ovl-relax-requirement-for-uuid-off-index-on.patch --]
[-- Type: text/x-patch, Size: 5560 bytes --]

From 147e88d88b5dfbcdd23aff736e4d381a8af446f6 Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@gmail.com>
Date: Tue, 20 Jan 2026 15:58:31 +0100
Subject: [PATCH] ovl: relax requirement for uuid=off,index=on

uuid=off,index=on required that all upper/lower directories are on the
same filesystem.

Relax the requirement so that only all the lower directories need to be
on the same filesystem.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 Documentation/filesystems/overlayfs.rst |  2 +-
 fs/overlayfs/namei.c                    | 21 +++++++++++++--------
 fs/overlayfs/overlayfs.h                |  2 ++
 fs/overlayfs/super.c                    | 13 +++++--------
 4 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index ab989807a2cb6..d4020eae1deba 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -755,7 +755,7 @@ read-write mount and will result in an error.
 Note: the mount option uuid=off can be used to replace UUID of the underlying
 filesystem in file handles with null, and effectively disable UUID checks. This
 can be useful in case the underlying disk is copied and the UUID of this copy
-is changed. This is only applicable if all lower/upper/work directories are on
+is changed. This is only applicable if all lower directories are on
 the same filesystem, otherwise it will fallback to normal behaviour.
 
 
diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index e9a69c95be918..74c514603ac23 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -158,6 +158,18 @@ static struct ovl_fh *ovl_get_fh(struct ovl_fs *ofs, struct dentry *upperdentry,
 	goto out;
 }
 
+bool ovl_uuid_match(struct ovl_fs *ofs, const struct super_block *sb,
+		    const uuid_t *uuid)
+{
+	/*
+	 * Make sure that the stored uuid matches the uuid of the lower
+	 * layer where file handle will be decoded.
+	 * In case of uuid=off option just make sure that stored uuid is null.
+	 */
+	return ovl_origin_uuid(ofs) ? uuid_equal(uuid, &sb->s_uuid) :
+				      uuid_is_null(uuid);
+}
+
 struct dentry *ovl_decode_real_fh(struct ovl_fs *ofs, struct ovl_fh *fh,
 				  struct vfsmount *mnt, bool connected)
 {
@@ -167,14 +179,7 @@ struct dentry *ovl_decode_real_fh(struct ovl_fs *ofs, struct ovl_fh *fh,
 	if (!capable(CAP_DAC_READ_SEARCH))
 		return NULL;
 
-	/*
-	 * Make sure that the stored uuid matches the uuid of the lower
-	 * layer where file handle will be decoded.
-	 * In case of uuid=off option just make sure that stored uuid is null.
-	 */
-	if (ovl_origin_uuid(ofs) ?
-	    !uuid_equal(&fh->fb.uuid, &mnt->mnt_sb->s_uuid) :
-	    !uuid_is_null(&fh->fb.uuid))
+	if (!ovl_uuid_match(ofs, mnt->mnt_sb, &fh->fb.uuid))
 		return NULL;
 
 	bytes = (fh->fb.len - offsetof(struct ovl_fb, fid));
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f9ac9bdde8305..cf10661522106 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -710,6 +710,8 @@ static inline int ovl_check_fh_len(struct ovl_fh *fh, int fh_len)
 	return ovl_check_fb_len(&fh->fb, fh_len - OVL_FH_WIRE_OFFSET);
 }
 
+bool ovl_uuid_match(struct ovl_fs *ofs, const struct super_block *sb,
+		    const uuid_t *uuid);
 struct dentry *ovl_decode_real_fh(struct ovl_fs *ofs, struct ovl_fh *fh,
 				  struct vfsmount *mnt, bool connected);
 int ovl_check_origin_fh(struct ovl_fs *ofs, struct ovl_fh *fh, bool connected,
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index ba9146f22a2cc..8f0ecb4905e93 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -940,7 +940,7 @@ static bool ovl_lower_uuid_ok(struct ovl_fs *ofs, const uuid_t *uuid)
 		 * disable lower file handle decoding on all of them.
 		 */
 		if (ofs->fs[i].is_lower &&
-		    uuid_equal(&ofs->fs[i].sb->s_uuid, uuid)) {
+		    ovl_uuid_match(ofs, ofs->fs[i].sb, uuid)) {
 			ofs->fs[i].bad_uuid = true;
 			return false;
 		}
@@ -952,6 +952,7 @@ static bool ovl_lower_uuid_ok(struct ovl_fs *ofs, const uuid_t *uuid)
 static int ovl_get_fsid(struct ovl_fs *ofs, const struct path *path)
 {
 	struct super_block *sb = path->mnt->mnt_sb;
+	const uuid_t *uuid = ovl_origin_uuid(ofs) ? &sb->s_uuid : &uuid_null;
 	unsigned int i;
 	dev_t dev;
 	int err;
@@ -963,7 +964,7 @@ static int ovl_get_fsid(struct ovl_fs *ofs, const struct path *path)
 			return i;
 	}
 
-	if (!ovl_lower_uuid_ok(ofs, &sb->s_uuid)) {
+	if (!ovl_lower_uuid_ok(ofs, uuid)) {
 		bad_uuid = true;
 		if (ofs->config.xino == OVL_XINO_AUTO) {
 			ofs->config.xino = OVL_XINO_OFF;
@@ -976,8 +977,7 @@ static int ovl_get_fsid(struct ovl_fs *ofs, const struct path *path)
 		}
 		if (warn) {
 			pr_warn("%s uuid detected in lower fs '%pd2', falling back to xino=%s,index=off,nfs_export=off.\n",
-				uuid_is_null(&sb->s_uuid) ? "null" :
-							    "conflicting",
+				uuid_is_null(uuid) ? "null" : "conflicting",
 				path->dentry, ovl_xino_mode(&ofs->config));
 		}
 	}
@@ -1469,10 +1469,7 @@ static int ovl_fill_super_creds(struct fs_context *fc, struct super_block *sb)
 	if (!ovl_upper_mnt(ofs))
 		sb->s_flags |= SB_RDONLY;
 
-	if (!ovl_origin_uuid(ofs) && ofs->numfs > 1) {
-		pr_warn("The uuid=off requires a single fs for lower and upper, falling back to uuid=null.\n");
-		ofs->config.uuid = OVL_UUID_NULL;
-	} else if (ovl_has_fsid(ofs) && ovl_upper_mnt(ofs)) {
+	if (ovl_has_fsid(ofs) && ovl_upper_mnt(ofs)) {
 		/* Use per instance persistent uuid/fsid */
 		ovl_init_uuid_xattr(sb, ofs, &ctx->upper);
 	}
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH RESEND 09/12] mm: make vm_area_desc utilise vma_flags_t only
From: Lorenzo Stoakes @ 2026-01-20 15:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jarkko Sakkinen, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, Christian Koenig, Huang Rui, Matthew Auld,
	Matthew Brost, Alexander Viro, Christian Brauner, Jan Kara,
	Benjamin LaHaise, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Theodore Ts'o,
	Andreas Dilger, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Mike Marshall, Martin Brandenburg, Tony Luck,
	Reinette Chatre, Dave Martin, James Morse, Babu Moger,
	Carlos Maiolino, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	Matthew Wilcox, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Zi Yan, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Jann Horn, Pedro Falcato, David Howells, Paul Moore,
	James Morris, Serge E . Hallyn, Yury Norov, Rasmus Villemoes,
	linux-sgx, linux-kernel, nvdimm, linux-cxl, dri-devel, intel-gfx,
	linux-fsdevel, linux-aio, linux-erofs, linux-ext4, linux-mm,
	ntfs3, devel, linux-xfs, keyrings, linux-security-module
In-Reply-To: <20260120133619.GZ1134360@nvidia.com>

On Tue, Jan 20, 2026 at 09:36:19AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 20, 2026 at 09:46:05AM +0000, Lorenzo Stoakes wrote:
> > On Mon, Jan 19, 2026 at 07:14:03PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 19, 2026 at 09:19:11PM +0000, Lorenzo Stoakes wrote:
> > > > +static inline bool is_shared_maywrite(vma_flags_t flags)
> > > > +{
> > >
> > > I'm not sure it is ideal to pass this array by value? Seems like it
> > > might invite some negative optimizations since now the compiler has to
> > > optimze away a copy too.
> >
> > I really don't think so? This is inlined and thus collapses to a totally
> > standard vma_flags_test_all() which passes by value anyway.
>
> > Do you have specific examples or evidence the compiler will optimise poorly here
> > on that basis as compared to pass by reference? And pass by reference would
> > necessitate:
>
> I've recently seen enough cases of older compilers and other arches
> making weird choices to be a little concerened. In the above case
> there is no reason not to use a const pointer (and indeed that would
> be the expected idomatic kernel style), so why take chances is my
> thinking.

With respect Jason, you're going to have to do better than that.

The entire implementation is dependent on passing-by-value.

Right now we can do:

	vma_flags_test(&flags, VMA_READ_BIT, VMA_WRITE_BIT, ...);

Which uses mk_vma_flags() in a macro to generalise to:

	vma_flags_test(&flags, <vma_flags_t value>);

The natural implication of what you're saying is that we can no longer use this
from _anywhere_ because - hey - passing this by value is bad so now _everything_
has to be re-written as:

	vma_flags_t flags_to_set = mk_vma_flags(<flags>);

	if (vma_flags_test(&flags, &flags_to_set)) { ... }

Right?

But is even that ok? Because presumably these compilers can inline, so that is
basically equivalent to what the macro's doing so does that rule out the VMA
bitmap flags concept altogether...

For hand-waved 'old compilers' (ok, people who use old compilers should not
expect optimal code) or 'other arches' (unspecified)?

If it was just changing this one function I'd still object as it makes it differ
from _every other test predicate_ using vma_flags_t but maybe to humour you I'd
change it, but surely by this argument you're essentially objecting to the whole
series?

I find it really strange you're going down this road as it was you who suggested
this approach in the first place and had to convince me the compiler would
manage it!...

Maybe I'm missing something here...

I am not sure about this 'idiomatic kernel style' thing either, it feels rather
conjured. Yes you wouldn't ordinarily pass something larger than a register size
by-value, but here the intent is for it to be inlined anyway right?

It strikes me that the key optimisation here is the inlining, now if the issue
is that ye olde compiler might choose not to inline very small functions (seems
unlikely) we could always throw in an __always_inline?

But it seems rather silly for a one-liner?

If the concern is deeper (not optimising the bitmap operations) then aren't you
saying no to the whole concept of the series?

Out of interest I godbolted a bunch of architectures:

x86-64
riscv
mips
s390x
sparc
arm7 32-bit
loongarch
m68k
xtensa

And found the manual method vs. the pass-by-value macro method were equivalent
in each case as far as I could tell.

In the worst case if we hit a weirdo case we can always substitute something
manual I have all the vma_flags_*word*() stuff available (which I recall you
objecting to...!)

I may have completely the wrong end of the stick here?...

>
> Jason

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v4 3/3] xfs: adjust handling of a few numerical mount options
From: Andy Shevchenko @ 2026-01-20 14:59 UTC (permalink / raw)
  To: Dmitry Antipov
  Cc: Andrew Morton, Kees Cook, Carlos Maiolino, Christoph Hellwig,
	linux-xfs, linux-hardening
In-Reply-To: <20260120141229.356513-3-dmantipov@yandex.ru>

On Tue, Jan 20, 2026 at 05:12:29PM +0300, Dmitry Antipov wrote:
> Prefer recently introduced 'memvalue()' over an ad-hoc 'suffix_kstrtoint()'
> and 'suffix_kstrtoull()' to parse and basically validate the values passed
> via 'logbsize', 'allocsize', and 'max_atomic_write' mount options, and
> reject non-power-of-two values passed via the first and second one early
> in 'xfs_fs_parse_param()' rather than in 'xfs_fs_validate_params()'.

...

> -	if (kstrtoint(value, base, &_res))
> -		ret = -EINVAL;
> -	kfree(value);
> -	*res = _res << shift_left_factor;
> -	return ret;

_res is int, if negative the above is UB in accordance with C standard.
So, if ever this code runs to the shifting left negative numbers it goes
to a slippery slope (I think it works as intended, but...).

That said, I assume this code was never designed to get a negative value
to the _res.

With all this, I do not see the point of having a new API.
Also, where are the test cases for it?

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v4 2/3] lib: fix a few comments to match kernel-doc -Wreturn style
From: Andy Shevchenko @ 2026-01-20 14:47 UTC (permalink / raw)
  To: Dmitry Antipov
  Cc: Andrew Morton, Kees Cook, Carlos Maiolino, Christoph Hellwig,
	linux-xfs, linux-hardening
In-Reply-To: <20260120141229.356513-2-dmantipov@yandex.ru>

On Tue, Jan 20, 2026 at 05:12:28PM +0300, Dmitry Antipov wrote:
> Fix 'get_option()', 'memparse()' and 'parse_option_str()' comments
> to match the commonly used style as suggested by kernel-doc -Wreturn.

Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>

Thanks for doing this change, I think it should go first for the consistency's
sake.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v4 1/3] lib: introduce simple error-checking wrapper for memparse()
From: Andy Shevchenko @ 2026-01-20 14:46 UTC (permalink / raw)
  To: Dmitry Antipov
  Cc: Andrew Morton, Kees Cook, Carlos Maiolino, Christoph Hellwig,
	linux-xfs, linux-hardening
In-Reply-To: <aW-VDu4aPV6kZv80@smile.fi.intel.com>

On Tue, Jan 20, 2026 at 04:45:39PM +0200, Andy Shevchenko wrote:
> On Tue, Jan 20, 2026 at 05:12:27PM +0300, Dmitry Antipov wrote:
> > Introduce 'memvalue()' which uses 'memparse()' to parse a string
> > with optional memory suffix into a non-negative number. If parsing
> > has succeeded, returns 0 and stores the result at the location
> > specified by the second argument. Otherwise returns -EINVAL and
> > leaves the location untouched.

Also this misses the cover letter to explain the motivation, changelog, etc.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v4 1/3] lib: introduce simple error-checking wrapper for memparse()
From: Andy Shevchenko @ 2026-01-20 14:45 UTC (permalink / raw)
  To: Dmitry Antipov
  Cc: Andrew Morton, Kees Cook, Carlos Maiolino, Christoph Hellwig,
	linux-xfs, linux-hardening
In-Reply-To: <20260120141229.356513-1-dmantipov@yandex.ru>

On Tue, Jan 20, 2026 at 05:12:27PM +0300, Dmitry Antipov wrote:
> Introduce 'memvalue()' which uses 'memparse()' to parse a string
> with optional memory suffix into a non-negative number. If parsing
> has succeeded, returns 0 and stores the result at the location
> specified by the second argument. Otherwise returns -EINVAL and
> leaves the location untouched.

...

> +/**
> + *	memvalue -  Wrap memparse() with simple error detection
> + *	@ptr: Where parse begins
> + *	@valptr: Where to store result
> + *
> + *	Uses memparse() to parse a string into a number stored at
> + *	@valptr, leaving memory at @valptr untouched in case of error.
> + *
> + *	Return: -EINVAL for a presumably negative value or if an
> + *	unrecognized character was encountered, and 0 otherwise.
> + */
> +int __must_check memvalue(const char *ptr, unsigned long long *valptr)
> +{
> +	unsigned long long ret;
> +	char *end;
> +
> +	if (*ptr == '-')
> +		return -EINVAL;
> +	ret = memparse(ptr, &end);
> +	if (*end)
> +		return -EINVAL;
> +	*valptr = ret;
> +	return 0;
> +}

My questions seem left unsettled:
- why -EINVAL in the first place and not -ERANGE in the first place;
- why do we need this patch _at all_ based on the how callers are
doing now (w.o. this change), i.o.w. why the memparse() can't be
used directly.


-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v2 01/31] Documentation: document EXPORT_OP_NOLOCKS
From: Jeff Layton @ 2026-01-20 14:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Alexander Viro, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Amir Goldstein,
	Hugh Dickins, Baolin Wang, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Jan Kara, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Carlos Maiolino,
	Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko, Chris Mason,
	David Sterba, Luis de Bethencourt, Salah Triki, Phillip Lougher,
	Steve French, Paulo Alcantara, Ronnie Sahlberg, Shyam Prasad N,
	Bharath SM, Miklos Szeredi, Mike Marshall, Martin Brandenburg,
	Mark Fasheh, Joel Becker, Joseph Qi, Konstantin Komarov,
	Ryusuke Konishi, Trond Myklebust, Anna Schumaker, Dave Kleikamp,
	David Woodhouse, Richard Weinberger, Jan Kara,
	Andreas Gruenbacher, OGAWA Hirofumi, Jaegeuk Kim, Jonathan Corbet,
	David Laight, Dave Chinner, linux-nfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-ext4, linux-erofs, linux-xfs,
	ceph-devel, linux-btrfs, linux-cifs, samba-technical,
	linux-unionfs, devel, ocfs2-devel, ntfs3, linux-nilfs,
	jfs-discussion, linux-mtd, gfs2, linux-f2fs-devel, linux-doc
In-Reply-To: <707f08e114bf603caf7de020bb630d5477e86bca.camel@kernel.org>

On Tue, 2026-01-20 at 09:12 -0500, Jeff Layton wrote:
> On Tue, 2026-01-20 at 08:20 -0500, Jeff Layton wrote:
> > On Mon, 2026-01-19 at 23:44 -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 19, 2026 at 11:26:18AM -0500, Jeff Layton wrote:
> > > > +  EXPORT_OP_NOLOCKS - Disable file locking on this filesystem. Some
> > > > +    filesystems cannot properly support file locking as implemented by
> > > > +    nfsd. A case in point is reexport of NFS itself, which can't be done
> > > > +    safely without coordinating the grace period handling. Other clustered
> > > > +    and networked filesystems can be problematic here as well.
> > > 
> > > I'm not sure this is very useful.  It really needs to document what
> > > locking semantics nfs expects, because otherwise no reader will know
> > > if they set this or not.
> > 
> > Fair point. I'll see if I can draft something better. Suggestions
> > welcome.
> 
> How about this?
> 
> +  EXPORT_OP_NOLOCKS - Disable file locking on this filesystem. Filesystems
> +    that want to support locking over NFS must support POSIX file locking
> +    semantics and must handle lock recovery requests from clients after a
> +    reboot. Most local disk, RAM, or pseudo-filesystems use the generic POSIX
> +    locking support in the kernel and naturally provide this capability. Network
> +    or clustered filesystems usually need special handling to do this properly.

Even better, I think?

+
+  EXPORT_OP_NOLOCKS - Disable file locking on this filesystem. Filesystems
+    that want to support locking over NFS must support POSIX file locking
+    semantics. When the server reboots, the clients will issue requests to
+    recover their locks, which nfsd will issue to the filesystem as new lock
+    requests. Those must succeed in order for lock recovery to work. Most
+    local disk, RAM, or pseudo-filesystems use the generic POSIX locking
+    support in the kernel and naturally provide this capability. Network or
+    clustered filesystems usually need special handling to do this properly.
+    Set this flag on filesystems that can't guarantee the proper semantics
+    (e.g. reexported NFS).

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH 1/4] xfs: use blkdev_report_zones_cached()
From: Andrey Albershteyn @ 2026-01-20 14:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrey Albershteyn, Damien Le Moal, Darrick J . Wong,
	Carlos Maiolino, linux-xfs, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260109162324.2386829-2-hch@lst.de>

On 2026-01-09 17:22:52, Christoph Hellwig wrote:
> From: Damien Le Moal <dlemoal@kernel.org>
> 
> Source kernel commit: e04ccfc28252f181ea8d469d834b48e7dece65b2
> 
> Modify xfs_mount_zones() to replace the call to blkdev_report_zones()
> with blkdev_report_zones_cached() to speed-up mount operations.
> Since this causes xfs_zone_validate_seq() to see zones with the
> BLK_ZONE_COND_ACTIVE condition, this function is also modified to acept
> this condition as valid.
> 
> With this change, mounting a freshly formatted large capacity (30 TB)
> SMR HDD completes under 2s compared to over 4.7s before.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/platform_defs.h | 4 ++++
>  libxfs/xfs_zones.c      | 1 +
>  2 files changed, 5 insertions(+)
> 
> diff --git a/include/platform_defs.h b/include/platform_defs.h
> index da966490b0f5..cfdaca642645 100644
> --- a/include/platform_defs.h
> +++ b/include/platform_defs.h
> @@ -307,4 +307,8 @@ struct kvec {
>  	size_t iov_len;
>  };
>  
> +#ifndef BLK_ZONE_COND_ACTIVE /* added in Linux 6.19 */
> +#define BLK_ZONE_COND_ACTIVE	0xff

hmm I think #ifndef doesn't work for enum member. Compiling against
linux 6.19-rc6: 

../include/platform_defs.h:311:33: error: expected identifier before numeric constant
  311 | #define BLK_ZONE_COND_ACTIVE    0xff
      |                                 ^~~~
/linux-headers-v6.19-rc6/include/linux/blkzoned.h:84:9: note: in expansion of macro ‘BLK_ZONE_COND_ACTIVE’
   84 |         BLK_ZONE_COND_ACTIVE    = 0xFF,
      |         ^~~~~~~~~~~~~~~~~~~~

-- 
- Andrey


^ permalink raw reply

* [PATCH v6 16/16] ksmbd: Report filesystem case sensitivity via FS_ATTRIBUTE_INFORMATION
From: Chuck Lever @ 2026-01-20 14:24 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Jan Kara
  Cc: linux-fsdevel, linux-ext4, linux-xfs, linux-cifs, linux-nfs,
	linux-f2fs-devel, hirofumi, linkinjeon, sj1557.seo, yuezhang.mo,
	almaz.alexandrovich, slava, glaubitz, frank.li, tytso,
	adilger.kernel, cem, sfrench, pc, ronniesahlberg, sprasad,
	trondmy, anna, jaegeuk, chao, hansg, senozhatsky, Chuck Lever
In-Reply-To: <20260120142439.1821554-1-cel@kernel.org>

From: Chuck Lever <chuck.lever@oracle.com>

ksmbd hard-codes FILE_CASE_SENSITIVE_SEARCH and
FILE_CASE_PRESERVED_NAMES in FS_ATTRIBUTE_INFORMATION responses,
incorrectly indicating all exports are case-sensitive. This breaks
clients accessing case-insensitive filesystems like exFAT or
ext4/f2fs directories with casefold enabled.

Query actual case behavior via vfs_fileattr_get() and report accurate
attributes to SMB clients. Filesystems without ->fileattr_get continue
reporting default POSIX behavior (case-sensitive, case-preserving).

SMB's FS_ATTRIBUTE_INFORMATION reports per-share attributes from the
share root, not per-file. Shares mixing casefold and non-casefold
directories report the root directory's behavior.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/smb/server/smb2pdu.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/smb/server/smb2pdu.c b/fs/smb/server/smb2pdu.c
index 2fcd0d4d1fb0..257da9282bcf 100644
--- a/fs/smb/server/smb2pdu.c
+++ b/fs/smb/server/smb2pdu.c
@@ -13,6 +13,7 @@
 #include <linux/falloc.h>
 #include <linux/mount.h>
 #include <linux/filelock.h>
+#include <linux/fileattr.h>
 
 #include "glob.h"
 #include "smbfsctl.h"
@@ -5486,16 +5487,28 @@ static int smb2_get_info_filesystem(struct ksmbd_work *work,
 	case FS_ATTRIBUTE_INFORMATION:
 	{
 		FILE_SYSTEM_ATTRIBUTE_INFO *info;
+		struct file_kattr fa = {};
 		size_t sz;
+		u32 attrs;
+		int err;
 
 		info = (FILE_SYSTEM_ATTRIBUTE_INFO *)rsp->Buffer;
-		info->Attributes = cpu_to_le32(FILE_SUPPORTS_OBJECT_IDS |
-					       FILE_PERSISTENT_ACLS |
-					       FILE_UNICODE_ON_DISK |
-					       FILE_CASE_PRESERVED_NAMES |
-					       FILE_CASE_SENSITIVE_SEARCH |
-					       FILE_SUPPORTS_BLOCK_REFCOUNTING);
+		attrs = FILE_SUPPORTS_OBJECT_IDS |
+			FILE_PERSISTENT_ACLS |
+			FILE_UNICODE_ON_DISK |
+			FILE_SUPPORTS_BLOCK_REFCOUNTING;
 
+		err = vfs_fileattr_get(path.dentry, &fa);
+		if (err && err != -ENOIOCTLCMD) {
+			path_put(&path);
+			return err;
+		}
+		if (!(fa.fsx_xflags & FS_XFLAG_CASEFOLD))
+			attrs |= FILE_CASE_SENSITIVE_SEARCH;
+		if (!(fa.fsx_xflags & FS_XFLAG_CASENONPRESERVING))
+			attrs |= FILE_CASE_PRESERVED_NAMES;
+
+		info->Attributes = cpu_to_le32(attrs);
 		info->Attributes |= cpu_to_le32(server_conf.share_fake_fscaps);
 
 		if (test_share_config_flag(work->tcon->share_conf,
-- 
2.52.0


^ permalink raw reply related

* [PATCH v6 15/16] nfsd: Implement NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING
From: Chuck Lever @ 2026-01-20 14:24 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Jan Kara
  Cc: linux-fsdevel, linux-ext4, linux-xfs, linux-cifs, linux-nfs,
	linux-f2fs-devel, hirofumi, linkinjeon, sj1557.seo, yuezhang.mo,
	almaz.alexandrovich, slava, glaubitz, frank.li, tytso,
	adilger.kernel, cem, sfrench, pc, ronniesahlberg, sprasad,
	trondmy, anna, jaegeuk, chao, hansg, senozhatsky, Chuck Lever
In-Reply-To: <20260120142439.1821554-1-cel@kernel.org>

From: Chuck Lever <chuck.lever@oracle.com>

NFSD currently provides NFSv4 clients with hard-coded responses
indicating all exported filesystems are case-sensitive and
case-preserving. This is incorrect for case-insensitive filesystems
and ext4 directories with casefold enabled.

Query the underlying filesystem's actual case sensitivity via
nfsd_get_case_info() and return accurate values to clients. This
supports per-directory settings for filesystems that allow mixing
case-sensitive and case-insensitive directories within an export.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs4xdr.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 51ef97c25456..a4988a643d12 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2933,6 +2933,8 @@ struct nfsd4_fattr_args {
 	u32			rdattr_err;
 	bool			contextsupport;
 	bool			ignore_crossmnt;
+	bool			case_insensitive;
+	bool			case_preserving;
 };
 
 typedef __be32(*nfsd4_enc_attr)(struct xdr_stream *xdr,
@@ -3131,6 +3133,18 @@ static __be32 nfsd4_encode_fattr4_acl(struct xdr_stream *xdr,
 	return nfs_ok;
 }
 
+static __be32 nfsd4_encode_fattr4_case_insensitive(struct xdr_stream *xdr,
+					const struct nfsd4_fattr_args *args)
+{
+	return nfsd4_encode_bool(xdr, args->case_insensitive);
+}
+
+static __be32 nfsd4_encode_fattr4_case_preserving(struct xdr_stream *xdr,
+					const struct nfsd4_fattr_args *args)
+{
+	return nfsd4_encode_bool(xdr, args->case_preserving);
+}
+
 static __be32 nfsd4_encode_fattr4_filehandle(struct xdr_stream *xdr,
 					     const struct nfsd4_fattr_args *args)
 {
@@ -3487,8 +3501,8 @@ static const nfsd4_enc_attr nfsd4_enc_fattr4_encode_ops[] = {
 	[FATTR4_ACLSUPPORT]		= nfsd4_encode_fattr4_aclsupport,
 	[FATTR4_ARCHIVE]		= nfsd4_encode_fattr4__noop,
 	[FATTR4_CANSETTIME]		= nfsd4_encode_fattr4__true,
-	[FATTR4_CASE_INSENSITIVE]	= nfsd4_encode_fattr4__false,
-	[FATTR4_CASE_PRESERVING]	= nfsd4_encode_fattr4__true,
+	[FATTR4_CASE_INSENSITIVE]	= nfsd4_encode_fattr4_case_insensitive,
+	[FATTR4_CASE_PRESERVING]	= nfsd4_encode_fattr4_case_preserving,
 	[FATTR4_CHOWN_RESTRICTED]	= nfsd4_encode_fattr4__true,
 	[FATTR4_FILEHANDLE]		= nfsd4_encode_fattr4_filehandle,
 	[FATTR4_FILEID]			= nfsd4_encode_fattr4_fileid,
@@ -3674,8 +3688,9 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 		if (err)
 			goto out_nfserr;
 	}
-	if ((attrmask[0] & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) &&
-	    !fhp) {
+	if ((attrmask[0] & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID |
+			    FATTR4_WORD0_CASE_INSENSITIVE |
+			    FATTR4_WORD0_CASE_PRESERVING)) && !fhp) {
 		tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
 		status = nfserr_jukebox;
 		if (!tempfh)
@@ -3687,6 +3702,14 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 		args.fhp = tempfh;
 	} else
 		args.fhp = fhp;
+	if (attrmask[0] & (FATTR4_WORD0_CASE_INSENSITIVE |
+			   FATTR4_WORD0_CASE_PRESERVING)) {
+		status = nfsd_get_case_info(args.fhp, &args.case_insensitive,
+					    &args.case_preserving);
+		if (status != nfs_ok)
+			attrmask[0] &= ~(FATTR4_WORD0_CASE_INSENSITIVE |
+					 FATTR4_WORD0_CASE_PRESERVING);
+	}
 
 	if (attrmask[0] & FATTR4_WORD0_ACL) {
 		err = nfsd4_get_nfs4_acl(rqstp, dentry, &args.acl);
-- 
2.52.0


^ permalink raw reply related

* [PATCH v6 14/16] nfsd: Report export case-folding via NFSv3 PATHCONF
From: Chuck Lever @ 2026-01-20 14:24 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Jan Kara
  Cc: linux-fsdevel, linux-ext4, linux-xfs, linux-cifs, linux-nfs,
	linux-f2fs-devel, hirofumi, linkinjeon, sj1557.seo, yuezhang.mo,
	almaz.alexandrovich, slava, glaubitz, frank.li, tytso,
	adilger.kernel, cem, sfrench, pc, ronniesahlberg, sprasad,
	trondmy, anna, jaegeuk, chao, hansg, senozhatsky, Chuck Lever
In-Reply-To: <20260120142439.1821554-1-cel@kernel.org>

From: Chuck Lever <chuck.lever@oracle.com>

The hard-coded MSDOS_SUPER_MAGIC check in nfsd3_proc_pathconf()
only recognizes FAT filesystems as case-insensitive. Modern
filesystems like F2FS, exFAT, and CIFS support case-insensitive
directories, but NFSv3 clients cannot discover this capability.

Query the export's actual case behavior through ->fileattr_get
instead. This allows NFSv3 clients to correctly handle case
sensitivity for any filesystem that implements the fileattr
interface. Filesystems without ->fileattr_get continue to report
the default POSIX behavior (case-sensitive, case-preserving).

This change assumes the ("fat: Implement fileattr_get for case
sensitivity") has been applied, which ensures FAT filesystems
report their case behavior correctly via the fileattr interface.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs3proc.c | 18 ++++++++++--------
 fs/nfsd/vfs.c      | 25 +++++++++++++++++++++++++
 fs/nfsd/vfs.h      |  2 ++
 3 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 42adc5461db0..9be0aca01de0 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -717,17 +717,19 @@ nfsd3_proc_pathconf(struct svc_rqst *rqstp)
 
 	if (resp->status == nfs_ok) {
 		struct super_block *sb = argp->fh.fh_dentry->d_sb;
+		bool case_insensitive, case_preserving;
 
-		/* Note that we don't care for remote fs's here */
-		switch (sb->s_magic) {
-		case EXT2_SUPER_MAGIC:
+		if (sb->s_magic == EXT2_SUPER_MAGIC) {
 			resp->p_link_max = EXT2_LINK_MAX;
 			resp->p_name_max = EXT2_NAME_LEN;
-			break;
-		case MSDOS_SUPER_MAGIC:
-			resp->p_case_insensitive = 1;
-			resp->p_case_preserving  = 0;
-			break;
+		}
+
+		resp->status = nfsd_get_case_info(&argp->fh,
+						  &case_insensitive,
+						  &case_preserving);
+		if (resp->status == nfs_ok) {
+			resp->p_case_insensitive = case_insensitive;
+			resp->p_case_preserving = case_preserving;
 		}
 	}
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 168d3ccc8155..55cf0c0165c9 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -32,6 +32,7 @@
 #include <linux/writeback.h>
 #include <linux/security.h>
 #include <linux/sunrpc/xdr.h>
+#include <linux/fileattr.h>
 
 #include "xdr3.h"
 
@@ -2871,3 +2872,27 @@ nfsd_permission(struct svc_cred *cred, struct svc_export *exp,
 
 	return err? nfserrno(err) : 0;
 }
+
+/**
+ * nfsd_get_case_info - get case sensitivity info for a file handle
+ * @fhp: file handle that has already been verified
+ * @case_insensitive: output, true if the filesystem is case-insensitive
+ * @case_preserving: output, true if the filesystem preserves case
+ *
+ * Returns nfs_ok on success, or an nfserr on failure.
+ */
+__be32
+nfsd_get_case_info(struct svc_fh *fhp, bool *case_insensitive,
+		   bool *case_preserving)
+{
+	struct file_kattr fa = {};
+	int err;
+
+	err = vfs_fileattr_get(fhp->fh_dentry, &fa);
+	if (err && err != -ENOIOCTLCMD)
+		return nfserrno(err);
+
+	*case_insensitive = fa.fsx_xflags & FS_XFLAG_CASEFOLD;
+	*case_preserving = !(fa.fsx_xflags & FS_XFLAG_CASENONPRESERVING);
+	return nfs_ok;
+}
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index e192dca4a679..1ff62eecec09 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -155,6 +155,8 @@ __be32		nfsd_readdir(struct svc_rqst *, struct svc_fh *,
 			     loff_t *, struct readdir_cd *, nfsd_filldir_t);
 __be32		nfsd_statfs(struct svc_rqst *, struct svc_fh *,
 				struct kstatfs *, int access);
+__be32		nfsd_get_case_info(struct svc_fh *fhp, bool *case_insensitive,
+				   bool *case_preserving);
 
 __be32		nfsd_permission(struct svc_cred *cred, struct svc_export *exp,
 				struct dentry *dentry, int acc);
-- 
2.52.0


^ permalink raw reply related

* [PATCH v6 13/16] isofs: Implement fileattr_get for case sensitivity
From: Chuck Lever @ 2026-01-20 14:24 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Jan Kara
  Cc: linux-fsdevel, linux-ext4, linux-xfs, linux-cifs, linux-nfs,
	linux-f2fs-devel, hirofumi, linkinjeon, sj1557.seo, yuezhang.mo,
	almaz.alexandrovich, slava, glaubitz, frank.li, tytso,
	adilger.kernel, cem, sfrench, pc, ronniesahlberg, sprasad,
	trondmy, anna, jaegeuk, chao, hansg, senozhatsky, Chuck Lever
In-Reply-To: <20260120142439.1821554-1-cel@kernel.org>

From: Chuck Lever <chuck.lever@oracle.com>

Upper layers such as NFSD need a way to query whether a
filesystem handles filenames in a case-sensitive manner so
they can provide correct semantics to remote clients. Without
this information, NFS exports of ISO 9660 filesystems cannot
properly advertise their filename case behavior.

Implement isofs_fileattr_get() to report ISO 9660 case handling
behavior via the FS_XFLAG_CASEFOLD flag. The 'check=r' (relaxed)
mount option enables case-insensitive lookups, and this setting
determines the value reported. By default, Joliet extensions
operate in relaxed mode while plain ISO 9660 uses strict
(case-sensitive) mode. All ISO 9660 variants are case-preserving,
meaning filenames are stored exactly as they appear on the disc.

The callback is registered only on isofs_dir_inode_operations
because isofs has no custom inode_operations for regular
files, and symlinks use the generic page_symlink_inode_operations.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/isofs/dir.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
index 09df40b612fb..e1a708f219f7 100644
--- a/fs/isofs/dir.c
+++ b/fs/isofs/dir.c
@@ -13,6 +13,7 @@
  */
 #include <linux/gfp.h>
 #include "isofs.h"
+#include <linux/fileattr.h>

 int isofs_name_translate(struct iso_directory_record *de, char *new, struct inode *inode)
 {
@@ -266,6 +267,19 @@ static int isofs_readdir(struct file *file, struct dir_context *ctx)
 	return result;
 }

+static int isofs_fileattr_get(struct dentry *dentry, struct file_kattr *fa)
+{
+	struct isofs_sb_info *sbi = ISOFS_SB(dentry->d_sb);
+
+	/*
+	 * FS_XFLAG_CASEFOLD indicates case-insensitive lookups.
+	 * When check=r (relaxed) is set, lookups ignore case.
+	 */
+	if (sbi->s_check == 'r')
+		fa->fsx_xflags |= FS_XFLAG_CASEFOLD;
+	return 0;
+}
+
 const struct file_operations isofs_dir_operations =
 {
 	.llseek = generic_file_llseek,
@@ -279,6 +293,7 @@ const struct file_operations isofs_dir_operations =
 const struct inode_operations isofs_dir_inode_operations =
 {
 	.lookup = isofs_lookup,
+	.fileattr_get = isofs_fileattr_get,
 };

-- 
2.52.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox