* [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes
@ 2025-03-03 17:00 Josef Bacik
2025-03-04 8:04 ` Dave Chinner
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Josef Bacik @ 2025-03-03 17:00 UTC (permalink / raw)
To: linux-fsdevel
Hello,
I've recently gotten annoyed with the current reference counting rules that
exist in the file system arena, specifically this pattern of having 0 referenced
objects that indicate that they're ready to be reclaimed.
This pattern consistently bites us in the ass, is error prone, gives us a lot of
complicated logic around when an object is actually allowed to be touched versus
when it is not.
We do this everywhere, with inodes, dentries, and folios, but I specifically
went to change inodes recently thinking it would be the easiest, and I've run
into a few big questions. Currently I've got about ~30 patches, and that is
mostly just modifying the existing file systems for a new inode_operation.
Before I devote more time to this silly path, I figured it'd be good to bring it
up to the group to get some input on what possible better solutions there would
be.
I'll try to make this as easy to follow as possible, but I spent a full day and
a half writing code and thinking about this and it's kind of complicated. I'll
break this up into sections to try and make it easier to digest.
WHAT DO I WANT
I want to have refcount 0 == we're freeing the object. This will give us clear
"I'm using this object, thus I have a reference count on it" rules, and we can
(hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
I_WILL_FREE).
HOW DO I WANT TO DO THIS
Well obviously we keep a reference count always whenever we are using the inode,
and we hold a reference when it is on a list. This means the i_io_list holds a
reference to the inode, that means the LRU list holds a reference to the inode.
This makes LRU handling easier, we just walk the objects and drop our reference
to the object. If it was truly the last reference then we free it, otherwise it
will get added back onto the LRU list when the next guy does an iput().
POTENTIAL PROBLEM #1
Now we're actively checking to see if this inode is on the LRU list and
potentially taking the lru list lock more often. I don't think this will be the
case, as we would check the inode flags before we take the lock, so we would
martinally increase the lock contention on the LRU lock. We could mitigate this
by doing the LRU list add at lookup time, where we already have to grab some of
these locks, but I don't want to get into premature optimization territory here.
I'm just surfacing it as a potential problem.
POTENTIAL PROBLEM #2
We have a fair bit of logic in writeback around when we can just skip writeback,
which amounts to we're currently doing the final truncate on an inode with
->i_nlink set. This is kind of a big problem actually, as we could no
potentially end up with a large dirty inode that has an nlink of 0, and no
current users, but would now be written back because it has a reference on it
from writeback. Before we could get into the iput() and clean everything up
before writeback would occur. Now writeback would occur, and then we'd clean up
the inode.
SOLUTION FOR POTENTIAL PROBLEM #1
I think we ignore this for now, get the patches written, do some benchmarking
and see if this actually shows up in benchmarks. If it does then we come up
with strategies to resolve this at that point.
SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
My initial thought was to just move the final unlink logic outside of evict, and
create a new reference count that represents the actual use of the inode. Then
when the actual use went to 0 we would do the final unlink, de-coupling the
cleanup of the on-disk inode (in the case of local file systems) from the
freeing of the memory.
This is a nice to have because the other thing that bites us occasionally is an
iput() in a place where we don't necessarily want to be/is safe to do the final
truncate on the inode. This would allow us to do the final truncate at a time
when it is safe to do so.
However this means adding a different reference count to the inode. I started
to do this work, but it runs into some ugliness around ->tmpfile and file
systems that don't use the normal inode caching things (bcachefs, xfs). I do
like this solution, but I'm not sure if it's worth the complexity.
The other solution here is to just say screw it, we'll just always writeback
dirty inodes, and if they were unlinked then they get unlinked like always. I
think this is also a fine solution, because generally speaking if you've got
memory pressure on the system and the file is dirty and still open, you'll be
writing it back normally anyway. But I don't know how people feel about this.
CONCLUSION
I'd love some feedback on my potential problems and solutions, as well as any
other problems people may see. If we can get some discussion beforehand I can
finish up these patches and get some testing in before LSFMMBPF and we can have
a proper in-person discussion about the realities of the patchset. Thanks,
Josef
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes 2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik @ 2025-03-04 8:04 ` Dave Chinner 2025-03-04 15:02 ` Josef Bacik 2025-03-04 10:19 ` Christian Brauner ` (2 subsequent siblings) 3 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2025-03-04 8:04 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-fsdevel On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote: > Hello, > > I've recently gotten annoyed with the current reference counting rules that > exist in the file system arena, specifically this pattern of having 0 referenced > objects that indicate that they're ready to be reclaimed. Which means you are now talking about inode life cycle issues :) There is some recent relevant discussion here: https://lore.kernel.org/linux-fsdevel/ZvNWqhnUgqk5BlS4@dread.disaster.area/ And here: https://lore.kernel.org/all/20241002014017.3801899-1-david@fromorbit.com/ > This pattern consistently bites us in the ass, is error prone, gives us a lot of > complicated logic around when an object is actually allowed to be touched versus > when it is not. > > We do this everywhere, with inodes, dentries, and folios, but I specifically > went to change inodes recently thinking it would be the easiest, and I've run > into a few big questions. Currently I've got about ~30 patches, and that is > mostly just modifying the existing file systems for a new inode_operation. > Before I devote more time to this silly path, I figured it'd be good to bring it > up to the group to get some input on what possible better solutions there would > be. I want to get rid of the inode cache (i.e. the unreferenced inodes on the LRU) altogether. To do that, everything that needs the inode to stay in memory (e.g. writeback) needs to hold an active reference to the inode. As you've noticed, there are lots of hacks in code that tries to hang subsystem state/objects off the inode without taking an inode reference. > I'll try to make this as easy to follow as possible, but I spent a full day and > a half writing code and thinking about this and it's kind of complicated. I'll > break this up into sections to try and make it easier to digest. > > WHAT DO I WANT > > I want to have refcount 0 == we're freeing the object. This will give us clear > "I'm using this object, thus I have a reference count on it" rules, and we can > (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING | > I_WILL_FREE). Yes. IMO, when iput_final() is called, the VFS inode should be kaput never to be found again. Nothing in the VFS should be accessing it, nor should any VFS level code be allowed to take a new reference to it at that point in time. > HOW DO I WANT TO DO THIS > > Well obviously we keep a reference count always whenever we are using the inode, > and we hold a reference when it is on a list. This means the i_io_list holds a > reference to the inode, that means the LRU list holds a reference to the inode. Add to that list: - fsnotify objects that are currently sweep from superblock teardown via an inode cache walk. - LSM modules (e.g. landlock) that sweep state from superblock teardown via an inode cache walk. - dquots - writeback - mapping tree containing folios (*) - anything else that stores non-core state or private objects in/on the VFS inode. (*) this is why __inode_add_lru() is such a mess and why I NACKed it being used in the mm subsystem in the first place. The mm subsystem uses this to prevent unreferenced inodes that have page cache attached to them from being added to the LRU. Instead, the mm subsystem has been allows to prevent unreferenced inodes from being tracked by the VFS whilst the mm subsystem tracks and ages out mm owned objects attached to the inodes. The whole reason this was done is to prevent the inode cache shrinker from tossing inodes with cached pages before the workingset aging mechanism decides the page cache needs to be tossed. But to do this, games had to be played with the LRU because the places where the mm wanted to mark the inode as "can be reclaimed" are places where it is not safe to call iput().... IOWs, any "all inode usage needs to be reference counted" is going to have to solve this .... mess in some way.... > This makes LRU handling easier, we just walk the objects and drop our reference > to the object. If it was truly the last reference then we free it, otherwise it > will get added back onto the LRU list when the next guy does an iput(). > > POTENTIAL PROBLEM #1 > > Now we're actively checking to see if this inode is on the LRU list and > potentially taking the lru list lock more often. I don't think this will be the > case, as we would check the inode flags before we take the lock, so we would > martinally increase the lock contention on the LRU lock. > We could mitigate this > by doing the LRU list add at lookup time, where we already have to grab some of > these locks, but I don't want to get into premature optimization territory here. > I'm just surfacing it as a potential problem. The inode cache (and the LRU) needs to go away. Reference count everything so page cache residency doesn't trigger inode reclaim, and otherwise the dentry cache LRU keeps the hot working set on inodes in cache. There is no reason for maintaining a separately aged inode cache these days.... > POTENTIAL PROBLEM #2 > > We have a fair bit of logic in writeback around when we can just skip writeback, > which amounts to we're currently doing the final truncate on an inode with > ->i_nlink set. This is kind of a big problem actually, as we could no > potentially end up with a large dirty inode that has an nlink of 0, and no > current users, but would now be written back because it has a reference on it > from writeback. Before we could get into the iput() and clean everything up > before writeback would occur. Now writeback would occur, and then we'd clean up > the inode. Yup, an active/passive reference pattern. > SOLUTION FOR POTENTIAL PROBLEM #1 > > I think we ignore this for now, get the patches written, do some benchmarking > and see if this actually shows up in benchmarks. If it does then we come up > with strategies to resolve this at that point. Oh, I pretty much guarantee moving away from lazy LRU management will show up in any benchmark that stresses cold cache behaviour :/ > SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here > > My initial thought was to just move the final unlink logic outside of evict, and That's what XFS does - we don't actually process the final unlink stage until after ->destroy_inode has been called. We defer that processing to background inodegc workers (i.e. processed after unlink() returns to userspace), and it all happens without the VFS knowing anything about it. > create a new reference count that represents the actual use of the inode. Then > when the actual use went to 0 we would do the final unlink, de-coupling the > cleanup of the on-disk inode (in the case of local file systems) from the > freeing of the memory. Can we defer the final filesystem unlink processing like XFS does? The filesystem effectively controls freeing of the inode object, so it can live as long as the filesystem wants once the VFS is actively done with it. > This is a nice to have because the other thing that bites us occasionally is an > iput() in a place where we don't necessarily want to be/is safe to do the final > truncate on the inode. This would allow us to do the final truncate at a time > when it is safe to do so. Yes, this is one of the reasons why XFS has traditionally done this final unlink stage outside the VFS inode life cycle. We also do cleanup of speculative prealloc outside the VFS inode life cycle via the same code paths, too. > However this means adding a different reference count to the inode. I started > to do this work, but it runs into some ugliness around ->tmpfile and file > systems that don't use the normal inode caching things (bcachefs, xfs). I do > like this solution, but I'm not sure if it's worth the complexity. We should be able to make it work just fine with XFS if it allows decoupled cleanup. i.e. it is run via a new cleanup method that would be called instead of ->destroy_inode, leaving the filesystem to call destroy_inode() when it is done? > The other solution here is to just say screw it, we'll just always writeback > dirty inodes, and if they were unlinked then they get unlinked like always. I > think this is also a fine solution, because generally speaking if you've got > memory pressure on the system and the file is dirty and still open, you'll be > writing it back normally anyway. But I don't know how people feel about this. > > CONCLUSION > > I'd love some feedback on my potential problems and solutions, as well as any > other problems people may see. If we can get some discussion beforehand I can > finish up these patches and get some testing in before LSFMMBPF and we can have > a proper in-person discussion about the realities of the patchset. Thanks, I think there's several overlapping issues here: 1. everything that stores state on/in the VFS inode needs to take a reference to the inode. 2. Hacks like "walk the superblock inode list to visit every cached inode without holding references to them" need to go away 3. inode eviction needs to be reworked to allow decoupled processing of the inode once all VFS references go away. 4. The inode LRU (i.e. unreferenced inode caching) needs to go away Anything else I missed? :) -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes 2025-03-04 8:04 ` Dave Chinner @ 2025-03-04 15:02 ` Josef Bacik 2025-03-04 22:23 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Josef Bacik @ 2025-03-04 15:02 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel On Tue, Mar 04, 2025 at 07:04:49PM +1100, Dave Chinner wrote: > On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote: > > Hello, > > > > I've recently gotten annoyed with the current reference counting rules that > > exist in the file system arena, specifically this pattern of having 0 referenced > > objects that indicate that they're ready to be reclaimed. > > Which means you are now talking about inode life cycle issues :) > > There is some recent relevant discussion here: > > https://lore.kernel.org/linux-fsdevel/ZvNWqhnUgqk5BlS4@dread.disaster.area/ > > And here: > > https://lore.kernel.org/all/20241002014017.3801899-1-david@fromorbit.com/ > > > This pattern consistently bites us in the ass, is error prone, gives us a lot of > > complicated logic around when an object is actually allowed to be touched versus > > when it is not. > > > > We do this everywhere, with inodes, dentries, and folios, but I specifically > > went to change inodes recently thinking it would be the easiest, and I've run > > into a few big questions. Currently I've got about ~30 patches, and that is > > mostly just modifying the existing file systems for a new inode_operation. > > Before I devote more time to this silly path, I figured it'd be good to bring it > > up to the group to get some input on what possible better solutions there would > > be. > > I want to get rid of the inode cache (i.e. the unreferenced inodes > on the LRU) altogether. To do that, everything that needs the inode > to stay in memory (e.g. writeback) needs to hold an active reference > to the inode. > > As you've noticed, there are lots of hacks in code that tries to > hang subsystem state/objects off the inode without taking an inode > reference. > > > I'll try to make this as easy to follow as possible, but I spent a full day and > > a half writing code and thinking about this and it's kind of complicated. I'll > > break this up into sections to try and make it easier to digest. > > > > WHAT DO I WANT > > > > I want to have refcount 0 == we're freeing the object. This will give us clear > > "I'm using this object, thus I have a reference count on it" rules, and we can > > (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING | > > I_WILL_FREE). > > Yes. IMO, when iput_final() is called, the VFS inode should be kaput > never to be found again. Nothing in the VFS should be accessing it, > nor should any VFS level code be allowed to take a new reference to > it at that point in time. > > > HOW DO I WANT TO DO THIS > > > > Well obviously we keep a reference count always whenever we are using the inode, > > and we hold a reference when it is on a list. This means the i_io_list holds a > > reference to the inode, that means the LRU list holds a reference to the inode. > > Add to that list: > > - fsnotify objects that are currently sweep from superblock teardown > via an inode cache walk. > - LSM modules (e.g. landlock) that sweep state from superblock > teardown via an inode cache walk. > - dquots > - writeback > - mapping tree containing folios (*) > - anything else that stores non-core state or private objects > in/on the VFS inode. > > (*) this is why __inode_add_lru() is such a mess and why I NACKed it > being used in the mm subsystem in the first place. The mm subsystem > uses this to prevent unreferenced inodes that have page cache > attached to them from being added to the LRU. Instead, the mm > subsystem has been allows to prevent unreferenced inodes from being > tracked by the VFS whilst the mm subsystem tracks and ages out mm > owned objects attached to the inodes. > > The whole reason this was done is to prevent the inode cache > shrinker from tossing inodes with cached pages before the workingset > aging mechanism decides the page cache needs to be tossed. But to do > this, games had to be played with the LRU because the places where > the mm wanted to mark the inode as "can be reclaimed" are places > where it is not safe to call iput().... > > IOWs, any "all inode usage needs to be reference counted" is going > to have to solve this .... mess in some way.... > > > This makes LRU handling easier, we just walk the objects and drop our reference > > to the object. If it was truly the last reference then we free it, otherwise it > > will get added back onto the LRU list when the next guy does an iput(). > > > > POTENTIAL PROBLEM #1 > > > > Now we're actively checking to see if this inode is on the LRU list and > > potentially taking the lru list lock more often. I don't think this will be the > > case, as we would check the inode flags before we take the lock, so we would > > martinally increase the lock contention on the LRU lock. > > > We could mitigate this > > by doing the LRU list add at lookup time, where we already have to grab some of > > these locks, but I don't want to get into premature optimization territory here. > > I'm just surfacing it as a potential problem. > > The inode cache (and the LRU) needs to go away. Reference count > everything so page cache residency doesn't trigger inode reclaim, > and otherwise the dentry cache LRU keeps the hot working set on > inodes in cache. There is no reason for maintaining a separately > aged inode cache these days.... > > > POTENTIAL PROBLEM #2 > > > > We have a fair bit of logic in writeback around when we can just skip writeback, > > which amounts to we're currently doing the final truncate on an inode with > > ->i_nlink set. This is kind of a big problem actually, as we could no > > potentially end up with a large dirty inode that has an nlink of 0, and no > > current users, but would now be written back because it has a reference on it > > from writeback. Before we could get into the iput() and clean everything up > > before writeback would occur. Now writeback would occur, and then we'd clean up > > the inode. > > Yup, an active/passive reference pattern. > > > SOLUTION FOR POTENTIAL PROBLEM #1 > > > > I think we ignore this for now, get the patches written, do some benchmarking > > and see if this actually shows up in benchmarks. If it does then we come up > > with strategies to resolve this at that point. > > Oh, I pretty much guarantee moving away from lazy LRU management > will show up in any benchmark that stresses cold cache behaviour :/ > > > SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here > > > > My initial thought was to just move the final unlink logic outside of evict, and > > That's what XFS does - we don't actually process the final unlink > stage until after ->destroy_inode has been called. We defer that > processing to background inodegc workers (i.e. processed after > unlink() returns to userspace), and it all happens without the VFS > knowing anything about it. > > > create a new reference count that represents the actual use of the inode. Then > > when the actual use went to 0 we would do the final unlink, de-coupling the > > cleanup of the on-disk inode (in the case of local file systems) from the > > freeing of the memory. > > Can we defer the final filesystem unlink processing like > XFS does? The filesystem effectively controls freeing of the inode > object, so it can live as long as the filesystem wants once the > VFS is actively done with it. > > > This is a nice to have because the other thing that bites us occasionally is an > > iput() in a place where we don't necessarily want to be/is safe to do the final > > truncate on the inode. This would allow us to do the final truncate at a time > > when it is safe to do so. > > Yes, this is one of the reasons why XFS has traditionally done this > final unlink stage outside the VFS inode life cycle. We also do > cleanup of speculative prealloc outside the VFS inode life cycle via > the same code paths, too. > > > However this means adding a different reference count to the inode. I started > > to do this work, but it runs into some ugliness around ->tmpfile and file > > systems that don't use the normal inode caching things (bcachefs, xfs). I do > > like this solution, but I'm not sure if it's worth the complexity. > > We should be able to make it work just fine with XFS if it allows > decoupled cleanup. i.e. it is run via a new cleanup method that > would be called instead of ->destroy_inode, leaving the filesystem > to call destroy_inode() when it is done? > > > The other solution here is to just say screw it, we'll just always writeback > > dirty inodes, and if they were unlinked then they get unlinked like always. I > > think this is also a fine solution, because generally speaking if you've got > > memory pressure on the system and the file is dirty and still open, you'll be > > writing it back normally anyway. But I don't know how people feel about this. > > > > CONCLUSION > > > > I'd love some feedback on my potential problems and solutions, as well as any > > other problems people may see. If we can get some discussion beforehand I can > > finish up these patches and get some testing in before LSFMMBPF and we can have > > a proper in-person discussion about the realities of the patchset. Thanks, > > I think there's several overlapping issues here: > > 1. everything that stores state on/in the VFS inode needs to take a > reference to the inode. > 2. Hacks like "walk the superblock inode list to visit every cached > inode without holding references to them" need to go away > 3. inode eviction needs to be reworked to allow decoupled processing > of the inode once all VFS references go away. > 4. The inode LRU (i.e. unreferenced inode caching) needs to go away > > Anything else I missed? :) Nope that's all what I've got on my list. I want to run my plan for decoupling the inode eviction by you before I send my patches out to see if you have a better idea. I know XFS has all this delaying stuff, but a lot of file systems don't have that infrastructure, and I don't want to go around adding it to everybody, so I still want to have a VFS hook way to do the final truncate. The question is where to put it, and will it at the very least not mess you guys up, or in the best case be useful. We're agreed on the active/passive refcount, so the active refcount will be the arbiter of when we can do the final truncate. My current patchset adds a new ->final_unlink() method to the inode_operations, and if it's set we call it when the active refcount goes to 0. Obviously a passive refcount is still held on the inode while this operation is occurring. I just want to spell it out, because some of what you've indicated is you want the file system to control when the final unlink happens, even outside of the VFS. I'm not sure if you intend to say that it happens without a struct inode, or you just mean that we'll delay it for whenever the file system wants to do it, and we'll be holding the struct inode for that entire time. I'm assuming it's the latter, but I just want to be sure I understand what you're saying before I send patches and do something different and then you think I'm ignoring you. Thanks, Josef ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes 2025-03-04 15:02 ` Josef Bacik @ 2025-03-04 22:23 ` Dave Chinner 0 siblings, 0 replies; 8+ messages in thread From: Dave Chinner @ 2025-03-04 22:23 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-fsdevel On Tue, Mar 04, 2025 at 10:02:57AM -0500, Josef Bacik wrote: > On Tue, Mar 04, 2025 at 07:04:49PM +1100, Dave Chinner wrote: > > On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote: > > > CONCLUSION > > > > > > I'd love some feedback on my potential problems and solutions, as well as any > > > other problems people may see. If we can get some discussion beforehand I can > > > finish up these patches and get some testing in before LSFMMBPF and we can have > > > a proper in-person discussion about the realities of the patchset. Thanks, > > > > I think there's several overlapping issues here: > > > > 1. everything that stores state on/in the VFS inode needs to take a > > reference to the inode. > > 2. Hacks like "walk the superblock inode list to visit every cached > > inode without holding references to them" need to go away > > 3. inode eviction needs to be reworked to allow decoupled processing > > of the inode once all VFS references go away. > > 4. The inode LRU (i.e. unreferenced inode caching) needs to go away > > > > Anything else I missed? :) > > Nope that's all what I've got on my list. I want to run my plan for decoupling > the inode eviction by you before I send my patches out to see if you have a > better idea. I know XFS has all this delaying stuff, but a lot of file systems > don't have that infrastructure, and I don't want to go around adding it to > everybody, so I still want to have a VFS hook way to do the final truncate. The > question is where to put it, and will it at the very least not mess you guys up, > or in the best case be useful. What I've been looking at locally if for the iput_final() processing to be done asynchronously rather than in iput() context. i.e. once the refcount goes to zero, we do all the cleanup stuff we can do without blocking, then punt it to an async context to finish the cleanup. The problems I've had with this are entirely from the "inodes with refcount of zero are still visible and accessed by the VFS code", and I hadn't got to solving that problem. The sb list iter work I posted a while back (and need to finish off) was the first step I'd made in that direction. > We're agreed on the active/passive refcount, so the active refcount will be the > arbiter of when we can do the final truncate. My current patchset adds a new > ->final_unlink() method to the inode_operations, and if it's set we call it when > the active refcount goes to 0. Obviously a passive refcount is still held on > the inode while this operation is occurring. Yes, I think this allows async processing of the inode after the active refcount reaches zero and iput_final() is called. The async processing retains a passive reference all the way through until the inode is completely destroyed and then freed. > I just want to spell it out, because some of what you've indicated is you want > the file system to control when the final unlink happens, even outside of the > VFS. What I really want is for all iput_final/evict handling that may need to block to be punted to an async processing context. That's what we already do with XFS, and what I'd really like for the VFS to natively support. i.e. passive reference counts allow decoupling of inode eviction work from the task context that drops the last active reference. We need to do this to allow the nasty mm LRU hacks to be turned into active reference counts because those hacks require iput() to be safe to call from any context... > I'm not sure if you intend to say that it happens without a struct inode, > or you just mean that we'll delay it for whenever the file system wants to do > it, and we'll be holding the struct inode for that entire time. Yes, XFS holds the struct inode via the struct xfs_inode wrapped around the outside of it. i.e. the VFS inode lifecycle is a subset of the XFS inode lifecycle. i.e. I would expect that a filesystem with such a VFs vs FS lifecycle setup to maintain a passive struct inode reference count for it's own inode wrapper structure. This means the object will remain valid for as long as the filesystem needs it's inode wrapper structure to live. The filesystem would add the passive reference at inode allocation. This passive reference then gets removed when the filesystem is done tearing down it's internal inode structure and at that point the whole object can be freed. Also, we already do a -lot- of work behind the scenes without inode references in the internal XFS code (e.g. inode writeback does lockless, cache coherent unreferenced lookups of inodes). Converting these internal lookups to use passive references would greatly simplify the internal lookup vs reclaim interactions that we currently have to handle.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes 2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik 2025-03-04 8:04 ` Dave Chinner @ 2025-03-04 10:19 ` Christian Brauner 2025-03-04 14:56 ` Josef Bacik 2025-03-04 22:44 ` Matthew Wilcox 2025-03-04 23:34 ` Christoph Hellwig 3 siblings, 1 reply; 8+ messages in thread From: Christian Brauner @ 2025-03-04 10:19 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-fsdevel On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote: > Hello, > > I've recently gotten annoyed with the current reference counting rules that > exist in the file system arena, specifically this pattern of having 0 referenced > objects that indicate that they're ready to be reclaimed. > > This pattern consistently bites us in the ass, is error prone, gives us a lot of > complicated logic around when an object is actually allowed to be touched versus > when it is not. > > We do this everywhere, with inodes, dentries, and folios, but I specifically > went to change inodes recently thinking it would be the easiest, and I've run > into a few big questions. Currently I've got about ~30 patches, and that is > mostly just modifying the existing file systems for a new inode_operation. > Before I devote more time to this silly path, I figured it'd be good to bring it > up to the group to get some input on what possible better solutions there would > be. > > I'll try to make this as easy to follow as possible, but I spent a full day and > a half writing code and thinking about this and it's kind of complicated. I'll > break this up into sections to try and make it easier to digest. > > WHAT DO I WANT > > I want to have refcount 0 == we're freeing the object. This will give us clear > "I'm using this object, thus I have a reference count on it" rules, and we can > (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING | > I_WILL_FREE). Yeah, I want to see I_FREEING and I_WILL_FREE stuff to go away. This bit fiddling and waiting is terribly opaque for anyone who hasn't worked on this since the dawn of time. So I'm all for it. > > HOW DO I WANT TO DO THIS > > Well obviously we keep a reference count always whenever we are using the inode, > and we hold a reference when it is on a list. This means the i_io_list holds a > reference to the inode, that means the LRU list holds a reference to the inode. > > This makes LRU handling easier, we just walk the objects and drop our reference > to the object. If it was truly the last reference then we free it, otherwise it > will get added back onto the LRU list when the next guy does an iput(). > > POTENTIAL PROBLEM #1 > > Now we're actively checking to see if this inode is on the LRU list and > potentially taking the lru list lock more often. I don't think this will be the > case, as we would check the inode flags before we take the lock, so we would > martinally increase the lock contention on the LRU lock. We could mitigate this > by doing the LRU list add at lookup time, where we already have to grab some of > these locks, but I don't want to get into premature optimization territory here. > I'm just surfacing it as a potential problem. Yes, ignore it for now. So I agree that if we can try and remove the inode cache altogether that would be pretty awesome and we know that we have support for attempting that from Linus. But I'm not sure what regression potential that has. There might just be enough implicit behavior that workloads depend on that will bite us in the ass. But I don't think you need to address this in this series. Your changes might end up making it easier to experiemnt with the inode cache removal though. > POTENTIAL PROBLEM #2 > > We have a fair bit of logic in writeback around when we can just skip writeback, > which amounts to we're currently doing the final truncate on an inode with > ->i_nlink set. This is kind of a big problem actually, as we could no > potentially end up with a large dirty inode that has an nlink of 0, and no > current users, but would now be written back because it has a reference on it > from writeback. Before we could get into the iput() and clean everything up > before writeback would occur. Now writeback would occur, and then we'd clean up > the inode. So in the old pattern you'd call iput_final() and then do writeback. Whereas in the new pattern you'd do writeback before iput_final(). And this is a problem because it potentially delays freeing of the inode for a long time? > > SOLUTION FOR POTENTIAL PROBLEM #1 > > I think we ignore this for now, get the patches written, do some benchmarking > and see if this actually shows up in benchmarks. If it does then we come up > with strategies to resolve this at that point. > > SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here > > My initial thought was to just move the final unlink logic outside of evict, and > create a new reference count that represents the actual use of the inode. Then > when the actual use went to 0 we would do the final unlink, de-coupling the > cleanup of the on-disk inode (in the case of local file systems) from the > freeing of the memory. I really do like active/passive reference counts. I've used that pattern for mount namespaces, seccomp filters and some other stuff quite successfully. So I'm somewhat inclined to prefer that solution. Imho, when active/reference patterns are needed or useful then it's almost always because the original single reference counting mechanism was semantically vague because it mixed two different meanings of the reference count. So switching to an active/passive pattern will end up clarifying things. > This is a nice to have because the other thing that bites us occasionally is an > iput() in a place where we don't necessarily want to be/is safe to do the final > truncate on the inode. This would allow us to do the final truncate at a time > when it is safe to do so. > > However this means adding a different reference count to the inode. I started > to do this work, but it runs into some ugliness around ->tmpfile and file > systems that don't use the normal inode caching things (bcachefs, xfs). I do > like this solution, but I'm not sure if it's worth the complexity. > > The other solution here is to just say screw it, we'll just always writeback > dirty inodes, and if they were unlinked then they get unlinked like always. I > think this is also a fine solution, because generally speaking if you've got > memory pressure on the system and the file is dirty and still open, you'll be > writing it back normally anyway. But I don't know how people feel about this. > > CONCLUSION > > I'd love some feedback on my potential problems and solutions, as well as any > other problems people may see. If we can get some discussion beforehand I can > finish up these patches and get some testing in before LSFMMBPF and we can have > a proper in-person discussion about the realities of the patchset. Thanks, > > Josef ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes 2025-03-04 10:19 ` Christian Brauner @ 2025-03-04 14:56 ` Josef Bacik 0 siblings, 0 replies; 8+ messages in thread From: Josef Bacik @ 2025-03-04 14:56 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel On Tue, Mar 04, 2025 at 11:19:34AM +0100, Christian Brauner wrote: > On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote: > > Hello, > > > > I've recently gotten annoyed with the current reference counting rules that > > exist in the file system arena, specifically this pattern of having 0 referenced > > objects that indicate that they're ready to be reclaimed. > > > > This pattern consistently bites us in the ass, is error prone, gives us a lot of > > complicated logic around when an object is actually allowed to be touched versus > > when it is not. > > > > We do this everywhere, with inodes, dentries, and folios, but I specifically > > went to change inodes recently thinking it would be the easiest, and I've run > > into a few big questions. Currently I've got about ~30 patches, and that is > > mostly just modifying the existing file systems for a new inode_operation. > > Before I devote more time to this silly path, I figured it'd be good to bring it > > up to the group to get some input on what possible better solutions there would > > be. > > > > I'll try to make this as easy to follow as possible, but I spent a full day and > > a half writing code and thinking about this and it's kind of complicated. I'll > > break this up into sections to try and make it easier to digest. > > > > WHAT DO I WANT > > > > I want to have refcount 0 == we're freeing the object. This will give us clear > > "I'm using this object, thus I have a reference count on it" rules, and we can > > (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING | > > I_WILL_FREE). > > Yeah, I want to see I_FREEING and I_WILL_FREE stuff to go away. This bit > fiddling and waiting is terribly opaque for anyone who hasn't worked on > this since the dawn of time. So I'm all for it. > > > > > HOW DO I WANT TO DO THIS > > > > Well obviously we keep a reference count always whenever we are using the inode, > > and we hold a reference when it is on a list. This means the i_io_list holds a > > reference to the inode, that means the LRU list holds a reference to the inode. > > > > This makes LRU handling easier, we just walk the objects and drop our reference > > to the object. If it was truly the last reference then we free it, otherwise it > > will get added back onto the LRU list when the next guy does an iput(). > > > > POTENTIAL PROBLEM #1 > > > > Now we're actively checking to see if this inode is on the LRU list and > > potentially taking the lru list lock more often. I don't think this will be the > > case, as we would check the inode flags before we take the lock, so we would > > martinally increase the lock contention on the LRU lock. We could mitigate this > > by doing the LRU list add at lookup time, where we already have to grab some of > > these locks, but I don't want to get into premature optimization territory here. > > I'm just surfacing it as a potential problem. > > Yes, ignore it for now. > > So I agree that if we can try and remove the inode cache altogether that > would be pretty awesome and we know that we have support for attempting > that from Linus. But I'm not sure what regression potential that has. > There might just be enough implicit behavior that workloads depend on > that will bite us in the ass. > > But I don't think you need to address this in this series. Your changes > might end up making it easier to experiemnt with the inode cache removal > though. > > > POTENTIAL PROBLEM #2 > > > > We have a fair bit of logic in writeback around when we can just skip writeback, > > which amounts to we're currently doing the final truncate on an inode with > > ->i_nlink set. This is kind of a big problem actually, as we could no > > potentially end up with a large dirty inode that has an nlink of 0, and no > > current users, but would now be written back because it has a reference on it > > from writeback. Before we could get into the iput() and clean everything up > > before writeback would occur. Now writeback would occur, and then we'd clean up > > the inode. > > So in the old pattern you'd call iput_final() and then do writeback. > Whereas in the new pattern you'd do writeback before iput_final(). > And this is a problem because it potentially delays freeing of the inode > for a long time? Well we don't do the writeback in iput_final() if we've unlinked the inode. Currently, you can have the following sequence of events Open file Write to file Unlink file Close file iput_final() remove the inode from the io_list wait on current writeback that may have started before the iput_final() truncate the inode truncate the page cache In this case we avoid the writeout completely, because once we've entered iput_final() and set I_FREEING the writeback threads will skip that inode if they try to write it back, so we skip a whole writeback cycle. With my naive implementation of just holding reference counts for everything, writeback now holds its reference, so what would happen now is Open file Write to file Unlink file Close file <some time passes> Writeback occurs on the file iput_final() truncate the inode So now we've written back the inode, and then we truncate it. Is this bad? Not really, unless you're workload does this and you're on SSD's, now you've got more write cycles than you had before. The slightly bigger problem is now you've got the final iput happening in the writeback threads, which will induce latency in the whole system. > > > > > SOLUTION FOR POTENTIAL PROBLEM #1 > > > > I think we ignore this for now, get the patches written, do some benchmarking > > and see if this actually shows up in benchmarks. If it does then we come up > > with strategies to resolve this at that point. > > > > SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here > > > > My initial thought was to just move the final unlink logic outside of evict, and > > create a new reference count that represents the actual use of the inode. Then > > when the actual use went to 0 we would do the final unlink, de-coupling the > > cleanup of the on-disk inode (in the case of local file systems) from the > > freeing of the memory. > > I really do like active/passive reference counts. I've used that pattern > for mount namespaces, seccomp filters and some other stuff quite > successfully. So I'm somewhat inclined to prefer that solution. > > Imho, when active/reference patterns are needed or useful then it's > almost always because the original single reference counting mechanism > was semantically vague because it mixed two different meanings of the > reference count. So switching to an active/passive pattern will end up > clarifying things. > Ok cool, that's the path I wandered down and then wanted to make sure everybody else was cool with it. I'll finish up these patches and get some testing in and get them out so we can have something concrete to look at. I'm limiting my kernel development time to Fridays so it'll be either end of the week or next week when the patches show up. Thanks, Josef ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes 2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik 2025-03-04 8:04 ` Dave Chinner 2025-03-04 10:19 ` Christian Brauner @ 2025-03-04 22:44 ` Matthew Wilcox 2025-03-04 23:34 ` Christoph Hellwig 3 siblings, 0 replies; 8+ messages in thread From: Matthew Wilcox @ 2025-03-04 22:44 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-fsdevel On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote: > I've recently gotten annoyed with the current reference counting rules that > exist in the file system arena, specifically this pattern of having 0 referenced > objects that indicate that they're ready to be reclaimed. > > We do this everywhere, with inodes, dentries, and folios, but I specifically Folios? I don't think so. The only exceptions I know to the rule of "when a folio refcount reaches 0 it is reclaimed" are: - Frozen folios. Filesystems should never see a frozen folio. - Devmem folios. They were freed when their refcount reached 1. Alistair has that fixed in -next If there's something you don't like about the folio refcount, please let me know. I have Some Thoughts: - It's silly that the pagecache takes 2^order references on the folio. It should be just 1 - We play with the refcount in too many places. In conjunction with the first point, what I'd like is if filemap_add_folio() consumed the refcount on the folio passed to it. That is, today we allocate the folio (refcount 1), pass it to filemap_add_folio() which increments the refcount by 2^n and then put the refcount in readahead_folio(). We should do noe of that; just have the pagecacahe assume the refcount that was passed in. There are a few filesystems this would break today ... need to finish some more conversions. Anyway, what's your problem with the folio refcount? Filesystems shouldn't need to care about folio refcounts (other than fuse which decided to get all up in the mm's business). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes 2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik ` (2 preceding siblings ...) 2025-03-04 22:44 ` Matthew Wilcox @ 2025-03-04 23:34 ` Christoph Hellwig 3 siblings, 0 replies; 8+ messages in thread From: Christoph Hellwig @ 2025-03-04 23:34 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-fsdevel On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote: > Hello, > > I've recently gotten annoyed with the current reference counting rules that > exist in the file system arena, specifically this pattern of having 0 > referenced objects that indicate that they're ready to be reclaimed. The other way around is worse. E.g. the xfs_buf currently holds a reference of 1 for buffers on the LRU, which makes a complete mess of the buf rele and related code. And it prevents us from using the lockref primitive. Switching it to have a refcount of zero greatly cleans this up: http://git.infradead.org/?p=users/hch/xfs.git;a=commitdiff;h=8b46634dd6199f332f09e6f730a7a8801547c8b5 I suspect the inode just needs more clear rules about what state an inode can be in. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-03-04 23:34 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-03 17:00 [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes Josef Bacik 2025-03-04 8:04 ` Dave Chinner 2025-03-04 15:02 ` Josef Bacik 2025-03-04 22:23 ` Dave Chinner 2025-03-04 10:19 ` Christian Brauner 2025-03-04 14:56 ` Josef Bacik 2025-03-04 22:44 ` Matthew Wilcox 2025-03-04 23:34 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).