* [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
@ 2025-03-03 17:00 Josef Bacik
2025-03-04 8:04 ` Dave Chinner
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Josef Bacik @ 2025-03-03 17:00 UTC (permalink / raw)
To: linux-fsdevel
Hello,
I've recently gotten annoyed with the current reference counting rules that
exist in the file system arena, specifically this pattern of having 0 referenced
objects that indicate that they're ready to be reclaimed.
This pattern consistently bites us in the ass, is error prone, gives us a lot of
complicated logic around when an object is actually allowed to be touched versus
when it is not.
We do this everywhere, with inodes, dentries, and folios, but I specifically
went to change inodes recently thinking it would be the easiest, and I've run
into a few big questions. Currently I've got about ~30 patches, and that is
mostly just modifying the existing file systems for a new inode_operation.
Before I devote more time to this silly path, I figured it'd be good to bring it
up to the group to get some input on what possible better solutions there would
be.
I'll try to make this as easy to follow as possible, but I spent a full day and
a half writing code and thinking about this and it's kind of complicated. I'll
break this up into sections to try and make it easier to digest.
WHAT DO I WANT
I want to have refcount 0 == we're freeing the object. This will give us clear
"I'm using this object, thus I have a reference count on it" rules, and we can
(hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
I_WILL_FREE).
HOW DO I WANT TO DO THIS
Well obviously we keep a reference count always whenever we are using the inode,
and we hold a reference when it is on a list. This means the i_io_list holds a
reference to the inode, that means the LRU list holds a reference to the inode.
This makes LRU handling easier, we just walk the objects and drop our reference
to the object. If it was truly the last reference then we free it, otherwise it
will get added back onto the LRU list when the next guy does an iput().
POTENTIAL PROBLEM #1
Now we're actively checking to see if this inode is on the LRU list and
potentially taking the lru list lock more often. I don't think this will be the
case, as we would check the inode flags before we take the lock, so we would
martinally increase the lock contention on the LRU lock. We could mitigate this
by doing the LRU list add at lookup time, where we already have to grab some of
these locks, but I don't want to get into premature optimization territory here.
I'm just surfacing it as a potential problem.
POTENTIAL PROBLEM #2
We have a fair bit of logic in writeback around when we can just skip writeback,
which amounts to we're currently doing the final truncate on an inode with
->i_nlink set. This is kind of a big problem actually, as we could no
potentially end up with a large dirty inode that has an nlink of 0, and no
current users, but would now be written back because it has a reference on it
from writeback. Before we could get into the iput() and clean everything up
before writeback would occur. Now writeback would occur, and then we'd clean up
the inode.
SOLUTION FOR POTENTIAL PROBLEM #1
I think we ignore this for now, get the patches written, do some benchmarking
and see if this actually shows up in benchmarks. If it does then we come up
with strategies to resolve this at that point.
SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
My initial thought was to just move the final unlink logic outside of evict, and
create a new reference count that represents the actual use of the inode. Then
when the actual use went to 0 we would do the final unlink, de-coupling the
cleanup of the on-disk inode (in the case of local file systems) from the
freeing of the memory.
This is a nice to have because the other thing that bites us occasionally is an
iput() in a place where we don't necessarily want to be/is safe to do the final
truncate on the inode. This would allow us to do the final truncate at a time
when it is safe to do so.
However this means adding a different reference count to the inode. I started
to do this work, but it runs into some ugliness around ->tmpfile and file
systems that don't use the normal inode caching things (bcachefs, xfs). I do
like this solution, but I'm not sure if it's worth the complexity.
The other solution here is to just say screw it, we'll just always writeback
dirty inodes, and if they were unlinked then they get unlinked like always. I
think this is also a fine solution, because generally speaking if you've got
memory pressure on the system and the file is dirty and still open, you'll be
writing it back normally anyway. But I don't know how people feel about this.
CONCLUSION
I'd love some feedback on my potential problems and solutions, as well as any
other problems people may see. If we can get some discussion beforehand I can
finish up these patches and get some testing in before LSFMMBPF and we can have
a proper in-person discussion about the realities of the patchset. Thanks,
Josef
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik
@ 2025-03-04 8:04 ` Dave Chinner
2025-03-04 15:02 ` Josef Bacik
2025-03-04 10:19 ` Christian Brauner
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2025-03-04 8:04 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-fsdevel
On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> Hello,
>
> I've recently gotten annoyed with the current reference counting rules that
> exist in the file system arena, specifically this pattern of having 0 referenced
> objects that indicate that they're ready to be reclaimed.
Which means you are now talking about inode life cycle issues :)
There is some recent relevant discussion here:
https://lore.kernel.org/linux-fsdevel/ZvNWqhnUgqk5BlS4@dread.disaster.area/
And here:
https://lore.kernel.org/all/20241002014017.3801899-1-david@fromorbit.com/
> This pattern consistently bites us in the ass, is error prone, gives us a lot of
> complicated logic around when an object is actually allowed to be touched versus
> when it is not.
>
> We do this everywhere, with inodes, dentries, and folios, but I specifically
> went to change inodes recently thinking it would be the easiest, and I've run
> into a few big questions. Currently I've got about ~30 patches, and that is
> mostly just modifying the existing file systems for a new inode_operation.
> Before I devote more time to this silly path, I figured it'd be good to bring it
> up to the group to get some input on what possible better solutions there would
> be.
I want to get rid of the inode cache (i.e. the unreferenced inodes
on the LRU) altogether. To do that, everything that needs the inode
to stay in memory (e.g. writeback) needs to hold an active reference
to the inode.
As you've noticed, there are lots of hacks in code that tries to
hang subsystem state/objects off the inode without taking an inode
reference.
> I'll try to make this as easy to follow as possible, but I spent a full day and
> a half writing code and thinking about this and it's kind of complicated. I'll
> break this up into sections to try and make it easier to digest.
>
> WHAT DO I WANT
>
> I want to have refcount 0 == we're freeing the object. This will give us clear
> "I'm using this object, thus I have a reference count on it" rules, and we can
> (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
> I_WILL_FREE).
Yes. IMO, when iput_final() is called, the VFS inode should be kaput
never to be found again. Nothing in the VFS should be accessing it,
nor should any VFS level code be allowed to take a new reference to
it at that point in time.
> HOW DO I WANT TO DO THIS
>
> Well obviously we keep a reference count always whenever we are using the inode,
> and we hold a reference when it is on a list. This means the i_io_list holds a
> reference to the inode, that means the LRU list holds a reference to the inode.
Add to that list:
- fsnotify objects that are currently sweep from superblock teardown
via an inode cache walk.
- LSM modules (e.g. landlock) that sweep state from superblock
teardown via an inode cache walk.
- dquots
- writeback
- mapping tree containing folios (*)
- anything else that stores non-core state or private objects
in/on the VFS inode.
(*) this is why __inode_add_lru() is such a mess and why I NACKed it
being used in the mm subsystem in the first place. The mm subsystem
uses this to prevent unreferenced inodes that have page cache
attached to them from being added to the LRU. Instead, the mm
subsystem has been allows to prevent unreferenced inodes from being
tracked by the VFS whilst the mm subsystem tracks and ages out mm
owned objects attached to the inodes.
The whole reason this was done is to prevent the inode cache
shrinker from tossing inodes with cached pages before the workingset
aging mechanism decides the page cache needs to be tossed. But to do
this, games had to be played with the LRU because the places where
the mm wanted to mark the inode as "can be reclaimed" are places
where it is not safe to call iput()....
IOWs, any "all inode usage needs to be reference counted" is going
to have to solve this .... mess in some way....
> This makes LRU handling easier, we just walk the objects and drop our reference
> to the object. If it was truly the last reference then we free it, otherwise it
> will get added back onto the LRU list when the next guy does an iput().
>
> POTENTIAL PROBLEM #1
>
> Now we're actively checking to see if this inode is on the LRU list and
> potentially taking the lru list lock more often. I don't think this will be the
> case, as we would check the inode flags before we take the lock, so we would
> martinally increase the lock contention on the LRU lock.
> We could mitigate this
> by doing the LRU list add at lookup time, where we already have to grab some of
> these locks, but I don't want to get into premature optimization territory here.
> I'm just surfacing it as a potential problem.
The inode cache (and the LRU) needs to go away. Reference count
everything so page cache residency doesn't trigger inode reclaim,
and otherwise the dentry cache LRU keeps the hot working set on
inodes in cache. There is no reason for maintaining a separately
aged inode cache these days....
> POTENTIAL PROBLEM #2
>
> We have a fair bit of logic in writeback around when we can just skip writeback,
> which amounts to we're currently doing the final truncate on an inode with
> ->i_nlink set. This is kind of a big problem actually, as we could no
> potentially end up with a large dirty inode that has an nlink of 0, and no
> current users, but would now be written back because it has a reference on it
> from writeback. Before we could get into the iput() and clean everything up
> before writeback would occur. Now writeback would occur, and then we'd clean up
> the inode.
Yup, an active/passive reference pattern.
> SOLUTION FOR POTENTIAL PROBLEM #1
>
> I think we ignore this for now, get the patches written, do some benchmarking
> and see if this actually shows up in benchmarks. If it does then we come up
> with strategies to resolve this at that point.
Oh, I pretty much guarantee moving away from lazy LRU management
will show up in any benchmark that stresses cold cache behaviour :/
> SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
>
> My initial thought was to just move the final unlink logic outside of evict, and
That's what XFS does - we don't actually process the final unlink
stage until after ->destroy_inode has been called. We defer that
processing to background inodegc workers (i.e. processed after
unlink() returns to userspace), and it all happens without the VFS
knowing anything about it.
> create a new reference count that represents the actual use of the inode. Then
> when the actual use went to 0 we would do the final unlink, de-coupling the
> cleanup of the on-disk inode (in the case of local file systems) from the
> freeing of the memory.
Can we defer the final filesystem unlink processing like
XFS does? The filesystem effectively controls freeing of the inode
object, so it can live as long as the filesystem wants once the
VFS is actively done with it.
> This is a nice to have because the other thing that bites us occasionally is an
> iput() in a place where we don't necessarily want to be/is safe to do the final
> truncate on the inode. This would allow us to do the final truncate at a time
> when it is safe to do so.
Yes, this is one of the reasons why XFS has traditionally done this
final unlink stage outside the VFS inode life cycle. We also do
cleanup of speculative prealloc outside the VFS inode life cycle via
the same code paths, too.
> However this means adding a different reference count to the inode. I started
> to do this work, but it runs into some ugliness around ->tmpfile and file
> systems that don't use the normal inode caching things (bcachefs, xfs). I do
> like this solution, but I'm not sure if it's worth the complexity.
We should be able to make it work just fine with XFS if it allows
decoupled cleanup. i.e. it is run via a new cleanup method that
would be called instead of ->destroy_inode, leaving the filesystem
to call destroy_inode() when it is done?
> The other solution here is to just say screw it, we'll just always writeback
> dirty inodes, and if they were unlinked then they get unlinked like always. I
> think this is also a fine solution, because generally speaking if you've got
> memory pressure on the system and the file is dirty and still open, you'll be
> writing it back normally anyway. But I don't know how people feel about this.
>
> CONCLUSION
>
> I'd love some feedback on my potential problems and solutions, as well as any
> other problems people may see. If we can get some discussion beforehand I can
> finish up these patches and get some testing in before LSFMMBPF and we can have
> a proper in-person discussion about the realities of the patchset. Thanks,
I think there's several overlapping issues here:
1. everything that stores state on/in the VFS inode needs to take a
reference to the inode.
2. Hacks like "walk the superblock inode list to visit every cached
inode without holding references to them" need to go away
3. inode eviction needs to be reworked to allow decoupled processing
of the inode once all VFS references go away.
4. The inode LRU (i.e. unreferenced inode caching) needs to go away
Anything else I missed? :)
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik
2025-03-04 8:04 ` Dave Chinner
@ 2025-03-04 10:19 ` Christian Brauner
2025-03-04 14:56 ` Josef Bacik
2025-03-04 22:44 ` Matthew Wilcox
2025-03-04 23:34 ` Christoph Hellwig
3 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2025-03-04 10:19 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-fsdevel
On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> Hello,
>
> I've recently gotten annoyed with the current reference counting rules that
> exist in the file system arena, specifically this pattern of having 0 referenced
> objects that indicate that they're ready to be reclaimed.
>
> This pattern consistently bites us in the ass, is error prone, gives us a lot of
> complicated logic around when an object is actually allowed to be touched versus
> when it is not.
>
> We do this everywhere, with inodes, dentries, and folios, but I specifically
> went to change inodes recently thinking it would be the easiest, and I've run
> into a few big questions. Currently I've got about ~30 patches, and that is
> mostly just modifying the existing file systems for a new inode_operation.
> Before I devote more time to this silly path, I figured it'd be good to bring it
> up to the group to get some input on what possible better solutions there would
> be.
>
> I'll try to make this as easy to follow as possible, but I spent a full day and
> a half writing code and thinking about this and it's kind of complicated. I'll
> break this up into sections to try and make it easier to digest.
>
> WHAT DO I WANT
>
> I want to have refcount 0 == we're freeing the object. This will give us clear
> "I'm using this object, thus I have a reference count on it" rules, and we can
> (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
> I_WILL_FREE).
Yeah, I want to see I_FREEING and I_WILL_FREE stuff to go away. This bit
fiddling and waiting is terribly opaque for anyone who hasn't worked on
this since the dawn of time. So I'm all for it.
>
> HOW DO I WANT TO DO THIS
>
> Well obviously we keep a reference count always whenever we are using the inode,
> and we hold a reference when it is on a list. This means the i_io_list holds a
> reference to the inode, that means the LRU list holds a reference to the inode.
>
> This makes LRU handling easier, we just walk the objects and drop our reference
> to the object. If it was truly the last reference then we free it, otherwise it
> will get added back onto the LRU list when the next guy does an iput().
>
> POTENTIAL PROBLEM #1
>
> Now we're actively checking to see if this inode is on the LRU list and
> potentially taking the lru list lock more often. I don't think this will be the
> case, as we would check the inode flags before we take the lock, so we would
> martinally increase the lock contention on the LRU lock. We could mitigate this
> by doing the LRU list add at lookup time, where we already have to grab some of
> these locks, but I don't want to get into premature optimization territory here.
> I'm just surfacing it as a potential problem.
Yes, ignore it for now.
So I agree that if we can try and remove the inode cache altogether that
would be pretty awesome and we know that we have support for attempting
that from Linus. But I'm not sure what regression potential that has.
There might just be enough implicit behavior that workloads depend on
that will bite us in the ass.
But I don't think you need to address this in this series. Your changes
might end up making it easier to experiemnt with the inode cache removal
though.
> POTENTIAL PROBLEM #2
>
> We have a fair bit of logic in writeback around when we can just skip writeback,
> which amounts to we're currently doing the final truncate on an inode with
> ->i_nlink set. This is kind of a big problem actually, as we could no
> potentially end up with a large dirty inode that has an nlink of 0, and no
> current users, but would now be written back because it has a reference on it
> from writeback. Before we could get into the iput() and clean everything up
> before writeback would occur. Now writeback would occur, and then we'd clean up
> the inode.
So in the old pattern you'd call iput_final() and then do writeback.
Whereas in the new pattern you'd do writeback before iput_final().
And this is a problem because it potentially delays freeing of the inode
for a long time?
>
> SOLUTION FOR POTENTIAL PROBLEM #1
>
> I think we ignore this for now, get the patches written, do some benchmarking
> and see if this actually shows up in benchmarks. If it does then we come up
> with strategies to resolve this at that point.
>
> SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
>
> My initial thought was to just move the final unlink logic outside of evict, and
> create a new reference count that represents the actual use of the inode. Then
> when the actual use went to 0 we would do the final unlink, de-coupling the
> cleanup of the on-disk inode (in the case of local file systems) from the
> freeing of the memory.
I really do like active/passive reference counts. I've used that pattern
for mount namespaces, seccomp filters and some other stuff quite
successfully. So I'm somewhat inclined to prefer that solution.
Imho, when active/reference patterns are needed or useful then it's
almost always because the original single reference counting mechanism
was semantically vague because it mixed two different meanings of the
reference count. So switching to an active/passive pattern will end up
clarifying things.
> This is a nice to have because the other thing that bites us occasionally is an
> iput() in a place where we don't necessarily want to be/is safe to do the final
> truncate on the inode. This would allow us to do the final truncate at a time
> when it is safe to do so.
>
> However this means adding a different reference count to the inode. I started
> to do this work, but it runs into some ugliness around ->tmpfile and file
> systems that don't use the normal inode caching things (bcachefs, xfs). I do
> like this solution, but I'm not sure if it's worth the complexity.
>
> The other solution here is to just say screw it, we'll just always writeback
> dirty inodes, and if they were unlinked then they get unlinked like always. I
> think this is also a fine solution, because generally speaking if you've got
> memory pressure on the system and the file is dirty and still open, you'll be
> writing it back normally anyway. But I don't know how people feel about this.
>
> CONCLUSION
>
> I'd love some feedback on my potential problems and solutions, as well as any
> other problems people may see. If we can get some discussion beforehand I can
> finish up these patches and get some testing in before LSFMMBPF and we can have
> a proper in-person discussion about the realities of the patchset. Thanks,
>
> Josef
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
2025-03-04 10:19 ` Christian Brauner
@ 2025-03-04 14:56 ` Josef Bacik
0 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2025-03-04 14:56 UTC (permalink / raw)
To: Christian Brauner; +Cc: linux-fsdevel
On Tue, Mar 04, 2025 at 11:19:34AM +0100, Christian Brauner wrote:
> On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> > Hello,
> >
> > I've recently gotten annoyed with the current reference counting rules that
> > exist in the file system arena, specifically this pattern of having 0 referenced
> > objects that indicate that they're ready to be reclaimed.
> >
> > This pattern consistently bites us in the ass, is error prone, gives us a lot of
> > complicated logic around when an object is actually allowed to be touched versus
> > when it is not.
> >
> > We do this everywhere, with inodes, dentries, and folios, but I specifically
> > went to change inodes recently thinking it would be the easiest, and I've run
> > into a few big questions. Currently I've got about ~30 patches, and that is
> > mostly just modifying the existing file systems for a new inode_operation.
> > Before I devote more time to this silly path, I figured it'd be good to bring it
> > up to the group to get some input on what possible better solutions there would
> > be.
> >
> > I'll try to make this as easy to follow as possible, but I spent a full day and
> > a half writing code and thinking about this and it's kind of complicated. I'll
> > break this up into sections to try and make it easier to digest.
> >
> > WHAT DO I WANT
> >
> > I want to have refcount 0 == we're freeing the object. This will give us clear
> > "I'm using this object, thus I have a reference count on it" rules, and we can
> > (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
> > I_WILL_FREE).
>
> Yeah, I want to see I_FREEING and I_WILL_FREE stuff to go away. This bit
> fiddling and waiting is terribly opaque for anyone who hasn't worked on
> this since the dawn of time. So I'm all for it.
>
> >
> > HOW DO I WANT TO DO THIS
> >
> > Well obviously we keep a reference count always whenever we are using the inode,
> > and we hold a reference when it is on a list. This means the i_io_list holds a
> > reference to the inode, that means the LRU list holds a reference to the inode.
> >
> > This makes LRU handling easier, we just walk the objects and drop our reference
> > to the object. If it was truly the last reference then we free it, otherwise it
> > will get added back onto the LRU list when the next guy does an iput().
> >
> > POTENTIAL PROBLEM #1
> >
> > Now we're actively checking to see if this inode is on the LRU list and
> > potentially taking the lru list lock more often. I don't think this will be the
> > case, as we would check the inode flags before we take the lock, so we would
> > martinally increase the lock contention on the LRU lock. We could mitigate this
> > by doing the LRU list add at lookup time, where we already have to grab some of
> > these locks, but I don't want to get into premature optimization territory here.
> > I'm just surfacing it as a potential problem.
>
> Yes, ignore it for now.
>
> So I agree that if we can try and remove the inode cache altogether that
> would be pretty awesome and we know that we have support for attempting
> that from Linus. But I'm not sure what regression potential that has.
> There might just be enough implicit behavior that workloads depend on
> that will bite us in the ass.
>
> But I don't think you need to address this in this series. Your changes
> might end up making it easier to experiemnt with the inode cache removal
> though.
>
> > POTENTIAL PROBLEM #2
> >
> > We have a fair bit of logic in writeback around when we can just skip writeback,
> > which amounts to we're currently doing the final truncate on an inode with
> > ->i_nlink set. This is kind of a big problem actually, as we could no
> > potentially end up with a large dirty inode that has an nlink of 0, and no
> > current users, but would now be written back because it has a reference on it
> > from writeback. Before we could get into the iput() and clean everything up
> > before writeback would occur. Now writeback would occur, and then we'd clean up
> > the inode.
>
> So in the old pattern you'd call iput_final() and then do writeback.
> Whereas in the new pattern you'd do writeback before iput_final().
> And this is a problem because it potentially delays freeing of the inode
> for a long time?
Well we don't do the writeback in iput_final() if we've unlinked the inode.
Currently, you can have the following sequence of events
Open file
Write to file
Unlink file
Close file
iput_final()
remove the inode from the io_list
wait on current writeback that may have started before the iput_final()
truncate the inode
truncate the page cache
In this case we avoid the writeout completely, because once we've entered
iput_final() and set I_FREEING the writeback threads will skip that inode if
they try to write it back, so we skip a whole writeback cycle.
With my naive implementation of just holding reference counts for everything,
writeback now holds its reference, so what would happen now is
Open file
Write to file
Unlink file
Close file
<some time passes>
Writeback occurs on the file
iput_final()
truncate the inode
So now we've written back the inode, and then we truncate it. Is this bad? Not
really, unless you're workload does this and you're on SSD's, now you've got
more write cycles than you had before. The slightly bigger problem is now
you've got the final iput happening in the writeback threads, which will induce
latency in the whole system.
>
> >
> > SOLUTION FOR POTENTIAL PROBLEM #1
> >
> > I think we ignore this for now, get the patches written, do some benchmarking
> > and see if this actually shows up in benchmarks. If it does then we come up
> > with strategies to resolve this at that point.
> >
> > SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
> >
> > My initial thought was to just move the final unlink logic outside of evict, and
> > create a new reference count that represents the actual use of the inode. Then
> > when the actual use went to 0 we would do the final unlink, de-coupling the
> > cleanup of the on-disk inode (in the case of local file systems) from the
> > freeing of the memory.
>
> I really do like active/passive reference counts. I've used that pattern
> for mount namespaces, seccomp filters and some other stuff quite
> successfully. So I'm somewhat inclined to prefer that solution.
>
> Imho, when active/reference patterns are needed or useful then it's
> almost always because the original single reference counting mechanism
> was semantically vague because it mixed two different meanings of the
> reference count. So switching to an active/passive pattern will end up
> clarifying things.
>
Ok cool, that's the path I wandered down and then wanted to make sure everybody
else was cool with it. I'll finish up these patches and get some testing in and
get them out so we can have something concrete to look at. I'm limiting my
kernel development time to Fridays so it'll be either end of the week or next
week when the patches show up. Thanks,
Josef
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
2025-03-04 8:04 ` Dave Chinner
@ 2025-03-04 15:02 ` Josef Bacik
2025-03-04 22:23 ` Dave Chinner
0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2025-03-04 15:02 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-fsdevel
On Tue, Mar 04, 2025 at 07:04:49PM +1100, Dave Chinner wrote:
> On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> > Hello,
> >
> > I've recently gotten annoyed with the current reference counting rules that
> > exist in the file system arena, specifically this pattern of having 0 referenced
> > objects that indicate that they're ready to be reclaimed.
>
> Which means you are now talking about inode life cycle issues :)
>
> There is some recent relevant discussion here:
>
> https://lore.kernel.org/linux-fsdevel/ZvNWqhnUgqk5BlS4@dread.disaster.area/
>
> And here:
>
> https://lore.kernel.org/all/20241002014017.3801899-1-david@fromorbit.com/
>
> > This pattern consistently bites us in the ass, is error prone, gives us a lot of
> > complicated logic around when an object is actually allowed to be touched versus
> > when it is not.
> >
> > We do this everywhere, with inodes, dentries, and folios, but I specifically
> > went to change inodes recently thinking it would be the easiest, and I've run
> > into a few big questions. Currently I've got about ~30 patches, and that is
> > mostly just modifying the existing file systems for a new inode_operation.
> > Before I devote more time to this silly path, I figured it'd be good to bring it
> > up to the group to get some input on what possible better solutions there would
> > be.
>
> I want to get rid of the inode cache (i.e. the unreferenced inodes
> on the LRU) altogether. To do that, everything that needs the inode
> to stay in memory (e.g. writeback) needs to hold an active reference
> to the inode.
>
> As you've noticed, there are lots of hacks in code that tries to
> hang subsystem state/objects off the inode without taking an inode
> reference.
>
> > I'll try to make this as easy to follow as possible, but I spent a full day and
> > a half writing code and thinking about this and it's kind of complicated. I'll
> > break this up into sections to try and make it easier to digest.
> >
> > WHAT DO I WANT
> >
> > I want to have refcount 0 == we're freeing the object. This will give us clear
> > "I'm using this object, thus I have a reference count on it" rules, and we can
> > (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
> > I_WILL_FREE).
>
> Yes. IMO, when iput_final() is called, the VFS inode should be kaput
> never to be found again. Nothing in the VFS should be accessing it,
> nor should any VFS level code be allowed to take a new reference to
> it at that point in time.
>
> > HOW DO I WANT TO DO THIS
> >
> > Well obviously we keep a reference count always whenever we are using the inode,
> > and we hold a reference when it is on a list. This means the i_io_list holds a
> > reference to the inode, that means the LRU list holds a reference to the inode.
>
> Add to that list:
>
> - fsnotify objects that are currently sweep from superblock teardown
> via an inode cache walk.
> - LSM modules (e.g. landlock) that sweep state from superblock
> teardown via an inode cache walk.
> - dquots
> - writeback
> - mapping tree containing folios (*)
> - anything else that stores non-core state or private objects
> in/on the VFS inode.
>
> (*) this is why __inode_add_lru() is such a mess and why I NACKed it
> being used in the mm subsystem in the first place. The mm subsystem
> uses this to prevent unreferenced inodes that have page cache
> attached to them from being added to the LRU. Instead, the mm
> subsystem has been allows to prevent unreferenced inodes from being
> tracked by the VFS whilst the mm subsystem tracks and ages out mm
> owned objects attached to the inodes.
>
> The whole reason this was done is to prevent the inode cache
> shrinker from tossing inodes with cached pages before the workingset
> aging mechanism decides the page cache needs to be tossed. But to do
> this, games had to be played with the LRU because the places where
> the mm wanted to mark the inode as "can be reclaimed" are places
> where it is not safe to call iput()....
>
> IOWs, any "all inode usage needs to be reference counted" is going
> to have to solve this .... mess in some way....
>
> > This makes LRU handling easier, we just walk the objects and drop our reference
> > to the object. If it was truly the last reference then we free it, otherwise it
> > will get added back onto the LRU list when the next guy does an iput().
> >
> > POTENTIAL PROBLEM #1
> >
> > Now we're actively checking to see if this inode is on the LRU list and
> > potentially taking the lru list lock more often. I don't think this will be the
> > case, as we would check the inode flags before we take the lock, so we would
> > martinally increase the lock contention on the LRU lock.
>
> > We could mitigate this
> > by doing the LRU list add at lookup time, where we already have to grab some of
> > these locks, but I don't want to get into premature optimization territory here.
> > I'm just surfacing it as a potential problem.
>
> The inode cache (and the LRU) needs to go away. Reference count
> everything so page cache residency doesn't trigger inode reclaim,
> and otherwise the dentry cache LRU keeps the hot working set on
> inodes in cache. There is no reason for maintaining a separately
> aged inode cache these days....
>
> > POTENTIAL PROBLEM #2
> >
> > We have a fair bit of logic in writeback around when we can just skip writeback,
> > which amounts to we're currently doing the final truncate on an inode with
> > ->i_nlink set. This is kind of a big problem actually, as we could no
> > potentially end up with a large dirty inode that has an nlink of 0, and no
> > current users, but would now be written back because it has a reference on it
> > from writeback. Before we could get into the iput() and clean everything up
> > before writeback would occur. Now writeback would occur, and then we'd clean up
> > the inode.
>
> Yup, an active/passive reference pattern.
>
> > SOLUTION FOR POTENTIAL PROBLEM #1
> >
> > I think we ignore this for now, get the patches written, do some benchmarking
> > and see if this actually shows up in benchmarks. If it does then we come up
> > with strategies to resolve this at that point.
>
> Oh, I pretty much guarantee moving away from lazy LRU management
> will show up in any benchmark that stresses cold cache behaviour :/
>
> > SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
> >
> > My initial thought was to just move the final unlink logic outside of evict, and
>
> That's what XFS does - we don't actually process the final unlink
> stage until after ->destroy_inode has been called. We defer that
> processing to background inodegc workers (i.e. processed after
> unlink() returns to userspace), and it all happens without the VFS
> knowing anything about it.
>
> > create a new reference count that represents the actual use of the inode. Then
> > when the actual use went to 0 we would do the final unlink, de-coupling the
> > cleanup of the on-disk inode (in the case of local file systems) from the
> > freeing of the memory.
>
> Can we defer the final filesystem unlink processing like
> XFS does? The filesystem effectively controls freeing of the inode
> object, so it can live as long as the filesystem wants once the
> VFS is actively done with it.
>
> > This is a nice to have because the other thing that bites us occasionally is an
> > iput() in a place where we don't necessarily want to be/is safe to do the final
> > truncate on the inode. This would allow us to do the final truncate at a time
> > when it is safe to do so.
>
> Yes, this is one of the reasons why XFS has traditionally done this
> final unlink stage outside the VFS inode life cycle. We also do
> cleanup of speculative prealloc outside the VFS inode life cycle via
> the same code paths, too.
>
> > However this means adding a different reference count to the inode. I started
> > to do this work, but it runs into some ugliness around ->tmpfile and file
> > systems that don't use the normal inode caching things (bcachefs, xfs). I do
> > like this solution, but I'm not sure if it's worth the complexity.
>
> We should be able to make it work just fine with XFS if it allows
> decoupled cleanup. i.e. it is run via a new cleanup method that
> would be called instead of ->destroy_inode, leaving the filesystem
> to call destroy_inode() when it is done?
>
> > The other solution here is to just say screw it, we'll just always writeback
> > dirty inodes, and if they were unlinked then they get unlinked like always. I
> > think this is also a fine solution, because generally speaking if you've got
> > memory pressure on the system and the file is dirty and still open, you'll be
> > writing it back normally anyway. But I don't know how people feel about this.
> >
> > CONCLUSION
> >
> > I'd love some feedback on my potential problems and solutions, as well as any
> > other problems people may see. If we can get some discussion beforehand I can
> > finish up these patches and get some testing in before LSFMMBPF and we can have
> > a proper in-person discussion about the realities of the patchset. Thanks,
>
> I think there's several overlapping issues here:
>
> 1. everything that stores state on/in the VFS inode needs to take a
> reference to the inode.
> 2. Hacks like "walk the superblock inode list to visit every cached
> inode without holding references to them" need to go away
> 3. inode eviction needs to be reworked to allow decoupled processing
> of the inode once all VFS references go away.
> 4. The inode LRU (i.e. unreferenced inode caching) needs to go away
>
> Anything else I missed? :)
Nope that's all what I've got on my list. I want to run my plan for decoupling
the inode eviction by you before I send my patches out to see if you have a
better idea. I know XFS has all this delaying stuff, but a lot of file systems
don't have that infrastructure, and I don't want to go around adding it to
everybody, so I still want to have a VFS hook way to do the final truncate. The
question is where to put it, and will it at the very least not mess you guys up,
or in the best case be useful.
We're agreed on the active/passive refcount, so the active refcount will be the
arbiter of when we can do the final truncate. My current patchset adds a new
->final_unlink() method to the inode_operations, and if it's set we call it when
the active refcount goes to 0. Obviously a passive refcount is still held on
the inode while this operation is occurring.
I just want to spell it out, because some of what you've indicated is you want
the file system to control when the final unlink happens, even outside of the
VFS. I'm not sure if you intend to say that it happens without a struct inode,
or you just mean that we'll delay it for whenever the file system wants to do
it, and we'll be holding the struct inode for that entire time. I'm assuming
it's the latter, but I just want to be sure I understand what you're saying
before I send patches and do something different and then you think I'm ignoring
you. Thanks,
Josef
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
2025-03-04 15:02 ` Josef Bacik
@ 2025-03-04 22:23 ` Dave Chinner
0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2025-03-04 22:23 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-fsdevel
On Tue, Mar 04, 2025 at 10:02:57AM -0500, Josef Bacik wrote:
> On Tue, Mar 04, 2025 at 07:04:49PM +1100, Dave Chinner wrote:
> > On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> > > CONCLUSION
> > >
> > > I'd love some feedback on my potential problems and solutions, as well as any
> > > other problems people may see. If we can get some discussion beforehand I can
> > > finish up these patches and get some testing in before LSFMMBPF and we can have
> > > a proper in-person discussion about the realities of the patchset. Thanks,
> >
> > I think there's several overlapping issues here:
> >
> > 1. everything that stores state on/in the VFS inode needs to take a
> > reference to the inode.
> > 2. Hacks like "walk the superblock inode list to visit every cached
> > inode without holding references to them" need to go away
> > 3. inode eviction needs to be reworked to allow decoupled processing
> > of the inode once all VFS references go away.
> > 4. The inode LRU (i.e. unreferenced inode caching) needs to go away
> >
> > Anything else I missed? :)
>
> Nope that's all what I've got on my list. I want to run my plan for decoupling
> the inode eviction by you before I send my patches out to see if you have a
> better idea. I know XFS has all this delaying stuff, but a lot of file systems
> don't have that infrastructure, and I don't want to go around adding it to
> everybody, so I still want to have a VFS hook way to do the final truncate. The
> question is where to put it, and will it at the very least not mess you guys up,
> or in the best case be useful.
What I've been looking at locally if for the iput_final() processing
to be done asynchronously rather than in iput() context.
i.e. once the refcount goes to zero, we do all the cleanup stuff we
can do without blocking, then punt it to an async context to finish
the cleanup.
The problems I've had with this are entirely from the "inodes with
refcount of zero are still visible and accessed by the VFS code",
and I hadn't got to solving that problem. The sb list iter work I
posted a while back (and need to finish off) was the first step I'd
made in that direction.
> We're agreed on the active/passive refcount, so the active refcount will be the
> arbiter of when we can do the final truncate. My current patchset adds a new
> ->final_unlink() method to the inode_operations, and if it's set we call it when
> the active refcount goes to 0. Obviously a passive refcount is still held on
> the inode while this operation is occurring.
Yes, I think this allows async processing of the inode after the
active refcount reaches zero and iput_final() is called. The async
processing retains a passive reference all the way through until the
inode is completely destroyed and then freed.
> I just want to spell it out, because some of what you've indicated is you want
> the file system to control when the final unlink happens, even outside of the
> VFS.
What I really want is for all iput_final/evict handling that may
need to block to be punted to an async processing context. That's
what we already do with XFS, and what I'd really like for the VFS to
natively support. i.e. passive reference counts allow decoupling of
inode eviction work from the task context that drops the last active
reference.
We need to do this to allow the nasty mm LRU hacks to be turned into
active reference counts because those hacks require iput() to be
safe to call from any context...
> I'm not sure if you intend to say that it happens without a struct inode,
> or you just mean that we'll delay it for whenever the file system wants to do
> it, and we'll be holding the struct inode for that entire time.
Yes, XFS holds the struct inode via the struct xfs_inode
wrapped around the outside of it. i.e. the VFS inode lifecycle is a
subset of the XFS inode lifecycle.
i.e. I would expect that a filesystem with such a VFs vs FS
lifecycle setup to maintain a passive struct inode reference count
for it's own inode wrapper structure. This means the object will
remain valid for as long as the filesystem needs it's inode wrapper
structure to live.
The filesystem would add the passive reference at inode allocation.
This passive reference then gets removed when the filesystem is done
tearing down it's internal inode structure and at that point the
whole object can be freed.
Also, we already do a -lot- of work behind the scenes without inode
references in the internal XFS code (e.g. inode writeback does
lockless, cache coherent unreferenced lookups of inodes). Converting
these internal lookups to use passive references would greatly
simplify the internal lookup vs reclaim interactions that we
currently have to handle....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik
2025-03-04 8:04 ` Dave Chinner
2025-03-04 10:19 ` Christian Brauner
@ 2025-03-04 22:44 ` Matthew Wilcox
2025-03-04 23:34 ` Christoph Hellwig
3 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2025-03-04 22:44 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-fsdevel
On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> I've recently gotten annoyed with the current reference counting rules that
> exist in the file system arena, specifically this pattern of having 0 referenced
> objects that indicate that they're ready to be reclaimed.
>
> We do this everywhere, with inodes, dentries, and folios, but I specifically
Folios? I don't think so. The only exceptions I know to the rule of
"when a folio refcount reaches 0 it is reclaimed" are:
- Frozen folios. Filesystems should never see a frozen folio.
- Devmem folios. They were freed when their refcount reached 1.
Alistair has that fixed in -next
If there's something you don't like about the folio refcount, please let
me know. I have Some Thoughts:
- It's silly that the pagecache takes 2^order references on the folio.
It should be just 1
- We play with the refcount in too many places. In conjunction with
the first point, what I'd like is if filemap_add_folio() consumed the
refcount on the folio passed to it. That is, today we allocate the
folio (refcount 1), pass it to filemap_add_folio() which increments
the refcount by 2^n and then put the refcount in readahead_folio().
We should do noe of that; just have the pagecacahe assume the
refcount that was passed in. There are a few filesystems this would
break today ... need to finish some more conversions.
Anyway, what's your problem with the folio refcount? Filesystems
shouldn't need to care about folio refcounts (other than fuse which
decided to get all up in the mm's business).
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik
` (2 preceding siblings ...)
2025-03-04 22:44 ` Matthew Wilcox
@ 2025-03-04 23:34 ` Christoph Hellwig
3 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2025-03-04 23:34 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-fsdevel
On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> Hello,
>
> I've recently gotten annoyed with the current reference counting rules that
> exist in the file system arena, specifically this pattern of having 0
> referenced objects that indicate that they're ready to be reclaimed.
The other way around is worse. E.g. the xfs_buf currently holds a
reference of 1 for buffers on the LRU, which makes a complete mess of
the buf rele and related code. And it prevents us from using the
lockref primitive. Switching it to have a refcount of zero greatly
cleans this up:
http://git.infradead.org/?p=users/hch/xfs.git;a=commitdiff;h=8b46634dd6199f332f09e6f730a7a8801547c8b5
I suspect the inode just needs more clear rules about what state an
inode can be in.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-03-04 23:34 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik
2025-03-04 8:04 ` Dave Chinner
2025-03-04 15:02 ` Josef Bacik
2025-03-04 22:23 ` Dave Chinner
2025-03-04 10:19 ` Christian Brauner
2025-03-04 14:56 ` Josef Bacik
2025-03-04 22:44 ` Matthew Wilcox
2025-03-04 23:34 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).