* [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
@ 2024-01-25 15:48 Steven Rostedt
2024-01-26 1:24 ` Greg Kroah-Hartman
0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2024-01-25 15:48 UTC (permalink / raw)
To: lsf-pc
Cc: linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds, Greg Kroah-Hartman
The tracefs file system was designed from the debugfs file system. The
rationale for separating tracefs from debugfs was to allow systems to
enable tracing but still keep debugfs disabled.
The debugfs API centers around dentry, e.g:
struct dentry *debugfs_create_file(const char *name, umode_t mode,
struct dentry *parent, void *data,
const struct file_operations *fops);
struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
Where if you need to create a file in debugfs, you call the above
debugfs_create_file() code and it returns a dentry handle, that can be used
to delete that file later. If parent is NULL, it adds the file at the root
of the debugfs file system (/sys/kernel/debug), otherwise you could create
a directory within that file system with the debugfs_create_dir().
Behind the scenes, that dentry also has a created inode structure
representing it. This all happens regardless if debugfs is mounted or not!
As every trace event in the system is represented by a directory and
several files in tracefs's events directory, it created quite a lot of
dentries and inodes.
# find /sys/kernel/tracing/ | wc -l
18352
And if you create an instance it will duplicate all the events in the
instance directory:
# mkdir /sys/kernel/tracing/instances/foo
# find /sys/kernel/tracing/ | wc -l
36617
And that goes for every new instance you make!
# mkdir /sys/kernel/tracing/instances/bar
# find /sys/kernel/tracing/ | wc -l
54882
As having inodes and dentries created for all these files and directories
even when they are not used, wastes a lot of memory.
Two years ago at LSF/MM I presented changing how the events directory works
via a new "eventfs" file system. It would still be part of tracefs, but it
would dynamically create the inodes and dentries on the fly.
As I was new to how VFS works, and really didn't understand it as well as I
would have liked, I just got something working and finally submitted it.
But because of my inexperience, Linus had some strong issues against the
code. Part of this was because I was touching dentries when he said I
shouldn't be. But that is because the code was designed from debugfs, which
dentry is the central part of that code.
When Linus said to me:
"And dammit, it shouldn't be necessary. When the tree is mounted, there
should be no existing dentries."
(I'd share the link, but it was on the security list so there's no public
link for this conversation)
Linus's comment made me realize how debugfs was doing it wrong!
He was right, when a file system is mounted, it should not have any
dentries nor inodes. That's because dentry and inodes are basically "cache"
of the underlining file system. They should only be created when they are
referenced.
The debugfs and tracefs (and possibly other pseudo file systems) should not
be using dentry as a descriptor for the object. It should just create a
generic object that can save the fops, mode, parent, and data, and have the
dentries and inodes created when referenced just like any other file system
would.
Now that I have finished the eventfs file system, I would like to present a
proposal to make a more generic interface that the rest of tracefs and even
debugfs could use that wouldn't rely on dentry as the main handle.
-- Steve
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-25 15:48 [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems Steven Rostedt
@ 2024-01-26 1:24 ` Greg Kroah-Hartman
2024-01-26 1:50 ` Steven Rostedt
0 siblings, 1 reply; 18+ messages in thread
From: Greg Kroah-Hartman @ 2024-01-26 1:24 UTC (permalink / raw)
To: Steven Rostedt
Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Thu, Jan 25, 2024 at 10:48:22AM -0500, Steven Rostedt wrote:
> Now that I have finished the eventfs file system, I would like to present a
> proposal to make a more generic interface that the rest of tracefs and even
> debugfs could use that wouldn't rely on dentry as the main handle.
You mean like kernfs does for you today? :)
thanks,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-26 1:24 ` Greg Kroah-Hartman
@ 2024-01-26 1:50 ` Steven Rostedt
2024-01-26 1:59 ` Greg Kroah-Hartman
0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2024-01-26 1:50 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Thu, 25 Jan 2024 17:24:03 -0800
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> On Thu, Jan 25, 2024 at 10:48:22AM -0500, Steven Rostedt wrote:
> > Now that I have finished the eventfs file system, I would like to present a
> > proposal to make a more generic interface that the rest of tracefs and even
> > debugfs could use that wouldn't rely on dentry as the main handle.
>
> You mean like kernfs does for you today? :)
>
I tried to use kernfs when doing a lot of this and I had issues. I
don't remember what those were, but I can revisit it.
-- Steve
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-26 1:50 ` Steven Rostedt
@ 2024-01-26 1:59 ` Greg Kroah-Hartman
2024-01-26 2:40 ` Steven Rostedt
0 siblings, 1 reply; 18+ messages in thread
From: Greg Kroah-Hartman @ 2024-01-26 1:59 UTC (permalink / raw)
To: Steven Rostedt
Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Thu, Jan 25, 2024 at 08:50:55PM -0500, Steven Rostedt wrote:
> On Thu, 25 Jan 2024 17:24:03 -0800
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
>
> > On Thu, Jan 25, 2024 at 10:48:22AM -0500, Steven Rostedt wrote:
> > > Now that I have finished the eventfs file system, I would like to present a
> > > proposal to make a more generic interface that the rest of tracefs and even
> > > debugfs could use that wouldn't rely on dentry as the main handle.
> >
> > You mean like kernfs does for you today? :)
> >
>
> I tried to use kernfs when doing a lot of this and I had issues. I
> don't remember what those were, but I can revisit it.
You might, as kernfs makes it so that the filesystem structures are
created on demand, when accessed, and then removed when memory pressure
happens. That's what sysfs and configfs and cgroups use quite
successfully.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-26 1:59 ` Greg Kroah-Hartman
@ 2024-01-26 2:40 ` Steven Rostedt
2024-01-26 14:16 ` Greg Kroah-Hartman
0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2024-01-26 2:40 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Thu, 25 Jan 2024 17:59:40 -0800
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> > I tried to use kernfs when doing a lot of this and I had issues. I
> > don't remember what those were, but I can revisit it.
>
> You might, as kernfs makes it so that the filesystem structures are
> created on demand, when accessed, and then removed when memory pressure
> happens. That's what sysfs and configfs and cgroups use quite
> successfully.
kernfs doesn't look trivial and I can't find any documentation on how
to use it.
Should there be work to move debugfs over to kernfs?
I could look at it too, but as tracefs, and more specifically eventfs,
has 10s of thousands of files, I'm very concerned about meta data size.
Currently eventfs keeps a data structure for every directory, but for
the files, it only keeps an array of names and callbacks. When a
directory is registered, it lists the files it needs. eventfs is
specific that the number of files a directory has is always constant,
and files will not be removed or added once a directory is created.
This way, the information on how a file is created is done via a
callback that was registered when the directory was created.
For this use case, I don't think kernfs could be used. But I would
still like to talk about what I'm trying to accomplish, and perhaps see
if there's work that can be done to consolidate what is out there.
-- Steve
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-26 2:40 ` Steven Rostedt
@ 2024-01-26 14:16 ` Greg Kroah-Hartman
2024-01-26 15:15 ` Steven Rostedt
0 siblings, 1 reply; 18+ messages in thread
From: Greg Kroah-Hartman @ 2024-01-26 14:16 UTC (permalink / raw)
To: Steven Rostedt
Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Thu, Jan 25, 2024 at 09:40:07PM -0500, Steven Rostedt wrote:
> On Thu, 25 Jan 2024 17:59:40 -0800
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
>
> > > I tried to use kernfs when doing a lot of this and I had issues. I
> > > don't remember what those were, but I can revisit it.
> >
> > You might, as kernfs makes it so that the filesystem structures are
> > created on demand, when accessed, and then removed when memory pressure
> > happens. That's what sysfs and configfs and cgroups use quite
> > successfully.
>
> kernfs doesn't look trivial and I can't find any documentation on how
> to use it.
You have the code :)
> Should there be work to move debugfs over to kernfs?
Why? Are you seeing real actual memory use with debugfs that is causing
problems? That is why we made kernfs, because people were seeing this
in sysfs.
Don't change stuff unless you need to, right?
> I could look at it too, but as tracefs, and more specifically eventfs,
> has 10s of thousands of files, I'm very concerned about meta data size.
Do you have real numbers? If not, then don't worry about it :)
> Currently eventfs keeps a data structure for every directory, but for
> the files, it only keeps an array of names and callbacks. When a
> directory is registered, it lists the files it needs. eventfs is
> specific that the number of files a directory has is always constant,
> and files will not be removed or added once a directory is created.
>
> This way, the information on how a file is created is done via a
> callback that was registered when the directory was created.
That's fine, and shouldn't matter.
> For this use case, I don't think kernfs could be used. But I would
> still like to talk about what I'm trying to accomplish, and perhaps see
> if there's work that can be done to consolidate what is out there.
Again, look at kernfs if you care about the memory usage of your virtual
filesystem, that's what it is there for, you shouldn't have to reinvent
the wheel.
And the best part is, when people find issues with scaling or other
stuff with kernfs, your filesystem will then benifit (lots of tweaks
have gone into kernfs for this over the past few kernel releases.)
thanks,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-26 14:16 ` Greg Kroah-Hartman
@ 2024-01-26 15:15 ` Steven Rostedt
2024-01-26 15:41 ` Greg Kroah-Hartman
0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2024-01-26 15:15 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Fri, 26 Jan 2024 06:16:38 -0800
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> On Thu, Jan 25, 2024 at 09:40:07PM -0500, Steven Rostedt wrote:
> > On Thu, 25 Jan 2024 17:59:40 -0800
> > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> >
> > > > I tried to use kernfs when doing a lot of this and I had issues. I
> > > > don't remember what those were, but I can revisit it.
> > >
> > > You might, as kernfs makes it so that the filesystem structures are
> > > created on demand, when accessed, and then removed when memory pressure
> > > happens. That's what sysfs and configfs and cgroups use quite
> > > successfully.
> >
> > kernfs doesn't look trivial and I can't find any documentation on how
> > to use it.
>
> You have the code :)
Really Greg?
I can write what I want to do twice as fast than trying to figure out why
someone else did what they did in their code, unless there's good
documentation on the subject.
>
> > Should there be work to move debugfs over to kernfs?
>
> Why? Are you seeing real actual memory use with debugfs that is causing
> problems? That is why we made kernfs, because people were seeing this
> in sysfs.
The reason I brought it up was from Linus's comment about dentries and
inodes should not exist if the file system isn't mounted. That's not the
case with debugfs. My question is, do we want debugfs to not use dentries
as its main handle?
>
> Don't change stuff unless you need to, right?
>
> > I could look at it too, but as tracefs, and more specifically eventfs,
> > has 10s of thousands of files, I'm very concerned about meta data size.
>
> Do you have real numbers? If not, then don't worry about it :)
I wouldn't be doing any of this without real numbers. They are in the
change log of eventfs.
See commits:
27152bceea1df27ffebb12ac9cd9adbf2c4c3f35
5790b1fb3d672d9a1fe3881a7181dfdbe741568f
>
> > Currently eventfs keeps a data structure for every directory, but for
> > the files, it only keeps an array of names and callbacks. When a
> > directory is registered, it lists the files it needs. eventfs is
> > specific that the number of files a directory has is always constant,
> > and files will not be removed or added once a directory is created.
> >
> > This way, the information on how a file is created is done via a
> > callback that was registered when the directory was created.
>
> That's fine, and shouldn't matter.
>
> > For this use case, I don't think kernfs could be used. But I would
> > still like to talk about what I'm trying to accomplish, and perhaps see
> > if there's work that can be done to consolidate what is out there.
>
> Again, look at kernfs if you care about the memory usage of your virtual
> filesystem, that's what it is there for, you shouldn't have to reinvent
> the wheel.
Already did because it was much easier than trying to use kernfs without
documentation. I did try at first, and realized it was easier to do it
myself. tracefs was based on top of debugfs, and I saw no easy path to go
from that to kernfs.
>
> And the best part is, when people find issues with scaling or other
> stuff with kernfs, your filesystem will then benifit (lots of tweaks
> have gone into kernfs for this over the past few kernel releases.)
Code is already done. It would be a huge effort to try to convert it over
to kernfs without even knowing if it will regress the memory issues, which
I believe it would (as the second commit saved 2 megs by getting rid of
meta data per file, which kernfs would bring back).
So, unless there's proof that kernfs would not add that memory footprint
back, I have no time to waste on it.
-- Steve
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-26 15:15 ` Steven Rostedt
@ 2024-01-26 15:41 ` Greg Kroah-Hartman
2024-01-26 16:44 ` Steven Rostedt
0 siblings, 1 reply; 18+ messages in thread
From: Greg Kroah-Hartman @ 2024-01-26 15:41 UTC (permalink / raw)
To: Steven Rostedt
Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Fri, Jan 26, 2024 at 10:15:53AM -0500, Steven Rostedt wrote:
> On Fri, 26 Jan 2024 06:16:38 -0800
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
>
> > On Thu, Jan 25, 2024 at 09:40:07PM -0500, Steven Rostedt wrote:
> > > On Thu, 25 Jan 2024 17:59:40 -0800
> > > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> > >
> > > > > I tried to use kernfs when doing a lot of this and I had issues. I
> > > > > don't remember what those were, but I can revisit it.
> > > >
> > > > You might, as kernfs makes it so that the filesystem structures are
> > > > created on demand, when accessed, and then removed when memory pressure
> > > > happens. That's what sysfs and configfs and cgroups use quite
> > > > successfully.
> > >
> > > kernfs doesn't look trivial and I can't find any documentation on how
> > > to use it.
> >
> > You have the code :)
>
> Really Greg?
>
> I can write what I want to do twice as fast than trying to figure out why
> someone else did what they did in their code, unless there's good
> documentation on the subject.
Sorry, that was snarky, but yes, there is no documentation for kernfs,
as it evolved over time with the users of it being converted to use it
as it went. I'd suggest looking at how cgroups uses it as odds are
that's the simplest way.
> > > Should there be work to move debugfs over to kernfs?
> >
> > Why? Are you seeing real actual memory use with debugfs that is causing
> > problems? That is why we made kernfs, because people were seeing this
> > in sysfs.
>
> The reason I brought it up was from Linus's comment about dentries and
> inodes should not exist if the file system isn't mounted. That's not the
> case with debugfs. My question is, do we want debugfs to not use dentries
> as its main handle?
In the long run, yes, I want the "handle" that all callers to debugfs to
NOT use a dentry, and have been slowly migrating away from allowing
debugfs to actually return a dentry to the caller. When that is
eventually finished, it will be an opaque "handle" that all users of
debugfs has and THEN we can convert debugfs to do whatever it wants to.
Again, long-term plans, slowly getting there, if only I had an intern or
10 to help out with it :)
But, this is only being driven by my "this feels like the wrong api to
use" ideas, and seeing how debugfs returning a dentry has been abused by
many subsystems in places, not by any real-world measurements of
"debugfs is using up too much memory!" like we have had for sysfs ever
since the beginning.
If someone comes up with a real workload that shows debugfs is just too
slow or taking up too much memory for their systems for functionality
that they rely on (that's the kicker), then the movement for debugfs to
kernfs would happen much faster as someone would actually have the need
to do so.
> > Don't change stuff unless you need to, right?
> >
> > > I could look at it too, but as tracefs, and more specifically eventfs,
> > > has 10s of thousands of files, I'm very concerned about meta data size.
> >
> > Do you have real numbers? If not, then don't worry about it :)
>
> I wouldn't be doing any of this without real numbers. They are in the
> change log of eventfs.
>
> See commits:
>
> 27152bceea1df27ffebb12ac9cd9adbf2c4c3f35
> 5790b1fb3d672d9a1fe3881a7181dfdbe741568f
Sorry, I mean for debugfs.
> > Again, look at kernfs if you care about the memory usage of your virtual
> > filesystem, that's what it is there for, you shouldn't have to reinvent
> > the wheel.
>
> Already did because it was much easier than trying to use kernfs without
> documentation. I did try at first, and realized it was easier to do it
> myself. tracefs was based on top of debugfs, and I saw no easy path to go
> from that to kernfs.
Perhaps do some digging into history and see how we moved sysfs to
kernfs, as originally sysfs looked exactly like debugfs. That might
give you some ideas of what to do here.
> > And the best part is, when people find issues with scaling or other
> > stuff with kernfs, your filesystem will then benifit (lots of tweaks
> > have gone into kernfs for this over the past few kernel releases.)
>
> Code is already done. It would be a huge effort to try to convert it over
> to kernfs without even knowing if it will regress the memory issues, which
> I believe it would (as the second commit saved 2 megs by getting rid of
> meta data per file, which kernfs would bring back).
>
> So, unless there's proof that kernfs would not add that memory footprint
> back, I have no time to waste on it.
That's fine, I was just responding to your "do we need a in-kernel way
to do this type of thing" and I pointed out that kernfs already does
just that. Rolling your own is great, like you did, I'm not saying you
have to move to kernfs at all if you don't want to as I'm not the one
having to maintain eventfs :)
thanks,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-26 15:41 ` Greg Kroah-Hartman
@ 2024-01-26 16:44 ` Steven Rostedt
2024-01-27 10:15 ` Amir Goldstein
0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2024-01-26 16:44 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Fri, 26 Jan 2024 07:41:31 -0800
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> > The reason I brought it up was from Linus's comment about dentries and
> > inodes should not exist if the file system isn't mounted. That's not the
> > case with debugfs. My question is, do we want debugfs to not use dentries
> > as its main handle?
>
> In the long run, yes, I want the "handle" that all callers to debugfs to
> NOT use a dentry, and have been slowly migrating away from allowing
> debugfs to actually return a dentry to the caller. When that is
> eventually finished, it will be an opaque "handle" that all users of
> debugfs has and THEN we can convert debugfs to do whatever it wants to.
So it does sound like we are on the same page ;-)
>
> Again, long-term plans, slowly getting there, if only I had an intern or
> 10 to help out with it :)
Yeah, this is something we need to think about when people come up to us
and say "I'd like to be a kernel developer, is there anything you know of
that I can work on?" Add a KTODO?
>
> But, this is only being driven by my "this feels like the wrong api to
> use" ideas, and seeing how debugfs returning a dentry has been abused by
> many subsystems in places, not by any real-world measurements of
> "debugfs is using up too much memory!" like we have had for sysfs ever
> since the beginning.
So we have a bit of miscommunication. My motivation for this topic wasn't
necessary on memory overhead (but it does help). But more about the
correctness of debugfs. I can understand how you could have interpreted my
motivation, as eventfs was solely motivated by memory pressure. But this
thread was motivated by Linus's comment about dentries not being allocated
before mounting.
>
> If someone comes up with a real workload that shows debugfs is just too
> slow or taking up too much memory for their systems for functionality
> that they rely on (that's the kicker), then the movement for debugfs to
> kernfs would happen much faster as someone would actually have the need
> to do so.
Another motivation is to prevent another tracefs happening. That is,
another pseudo file system that copies debugfs like the way tracefs was
created. I've had a few conversations with others that say "we have a
special interface in debugfs but we want to move it out". And I've been
(incorrectly) telling them what I did with tracefs from debugfs.
>
> > > Don't change stuff unless you need to, right?
> > >
> > > > I could look at it too, but as tracefs, and more specifically eventfs,
> > > > has 10s of thousands of files, I'm very concerned about meta data size.
> > >
> > > Do you have real numbers? If not, then don't worry about it :)
> >
> > I wouldn't be doing any of this without real numbers. They are in the
> > change log of eventfs.
> >
> > See commits:
> >
> > 27152bceea1df27ffebb12ac9cd9adbf2c4c3f35
> > 5790b1fb3d672d9a1fe3881a7181dfdbe741568f
>
> Sorry, I mean for debugfs.
No problem. This is how I figured we were talking pass each other. eventfs
was a big culprit in memory issues, as it has so many files. But now I'm
talking about correctness more than memory savings. And this came about
from my conversations with Linus pointing out that "I was doing it wrong" ;-)
>
> > > Again, look at kernfs if you care about the memory usage of your virtual
> > > filesystem, that's what it is there for, you shouldn't have to reinvent
> > > the wheel.
> >
> > Already did because it was much easier than trying to use kernfs without
> > documentation. I did try at first, and realized it was easier to do it
> > myself. tracefs was based on top of debugfs, and I saw no easy path to go
> > from that to kernfs.
>
> Perhaps do some digging into history and see how we moved sysfs to
> kernfs, as originally sysfs looked exactly like debugfs. That might
> give you some ideas of what to do here.
I believe one project that should come out of this (again for those that
want to be a kernel developer) is to document how to create a new pseudo
file system out of kernfs.
>
> > > And the best part is, when people find issues with scaling or other
> > > stuff with kernfs, your filesystem will then benifit (lots of tweaks
> > > have gone into kernfs for this over the past few kernel releases.)
> >
> > Code is already done. It would be a huge effort to try to convert it over
> > to kernfs without even knowing if it will regress the memory issues, which
> > I believe it would (as the second commit saved 2 megs by getting rid of
> > meta data per file, which kernfs would bring back).
> >
> > So, unless there's proof that kernfs would not add that memory footprint
> > back, I have no time to waste on it.
>
> That's fine, I was just responding to your "do we need a in-kernel way
> to do this type of thing" and I pointed out that kernfs already does
> just that. Rolling your own is great, like you did, I'm not saying you
> have to move to kernfs at all if you don't want to as I'm not the one
> having to maintain eventfs :)
Yeah. So now the focus is on keeping others from rolling their own unless
they have to. I (or more realistically, someone else) could possibly
convert the tracefs portion to kernfs (keeping eventfs separate as it is
from tracefs, due to the amount of files). It would probably take the same
effort as moving debugfs over to kernfs as the two are pretty much
identical.
Creating eventfs was a great learning experience for me. But it took much
more time than I had allocated for it (putting me way behind in other
responsibilities I have).
I still like to bring up this discussion with the hopes that someone may be
interested in fixing this.
Thanks,
-- Steve
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-26 16:44 ` Steven Rostedt
@ 2024-01-27 10:15 ` Amir Goldstein
2024-01-27 14:54 ` Steven Rostedt
2024-01-27 14:59 ` James Bottomley
0 siblings, 2 replies; 18+ messages in thread
From: Amir Goldstein @ 2024-01-27 10:15 UTC (permalink / raw)
To: Steven Rostedt
Cc: Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm,
Christian Brauner, Al Viro, Linus Torvalds
On Fri, Jan 26, 2024 at 6:44 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 26 Jan 2024 07:41:31 -0800
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
>
> > > The reason I brought it up was from Linus's comment about dentries and
> > > inodes should not exist if the file system isn't mounted. That's not the
> > > case with debugfs. My question is, do we want debugfs to not use dentries
> > > as its main handle?
> >
> > In the long run, yes, I want the "handle" that all callers to debugfs to
> > NOT use a dentry, and have been slowly migrating away from allowing
> > debugfs to actually return a dentry to the caller. When that is
> > eventually finished, it will be an opaque "handle" that all users of
> > debugfs has and THEN we can convert debugfs to do whatever it wants to.
>
> So it does sound like we are on the same page ;-)
>
> >
> > Again, long-term plans, slowly getting there, if only I had an intern or
> > 10 to help out with it :)
>
> Yeah, this is something we need to think about when people come up to us
> and say "I'd like to be a kernel developer, is there anything you know of
> that I can work on?" Add a KTODO?
>
> >
> > But, this is only being driven by my "this feels like the wrong api to
> > use" ideas, and seeing how debugfs returning a dentry has been abused by
> > many subsystems in places, not by any real-world measurements of
> > "debugfs is using up too much memory!" like we have had for sysfs ever
> > since the beginning.
>
> So we have a bit of miscommunication. My motivation for this topic wasn't
> necessary on memory overhead (but it does help). But more about the
> correctness of debugfs. I can understand how you could have interpreted my
> motivation, as eventfs was solely motivated by memory pressure. But this
> thread was motivated by Linus's comment about dentries not being allocated
> before mounting.
>
> >
> > If someone comes up with a real workload that shows debugfs is just too
> > slow or taking up too much memory for their systems for functionality
> > that they rely on (that's the kicker), then the movement for debugfs to
> > kernfs would happen much faster as someone would actually have the need
> > to do so.
>
> Another motivation is to prevent another tracefs happening. That is,
> another pseudo file system that copies debugfs like the way tracefs was
> created. I've had a few conversations with others that say "we have a
> special interface in debugfs but we want to move it out". And I've been
> (incorrectly) telling them what I did with tracefs from debugfs.
>
> >
> > > > Don't change stuff unless you need to, right?
> > > >
> > > > > I could look at it too, but as tracefs, and more specifically eventfs,
> > > > > has 10s of thousands of files, I'm very concerned about meta data size.
> > > >
> > > > Do you have real numbers? If not, then don't worry about it :)
> > >
> > > I wouldn't be doing any of this without real numbers. They are in the
> > > change log of eventfs.
> > >
> > > See commits:
> > >
> > > 27152bceea1df27ffebb12ac9cd9adbf2c4c3f35
> > > 5790b1fb3d672d9a1fe3881a7181dfdbe741568f
> >
> > Sorry, I mean for debugfs.
>
> No problem. This is how I figured we were talking pass each other. eventfs
> was a big culprit in memory issues, as it has so many files. But now I'm
> talking about correctness more than memory savings. And this came about
> from my conversations with Linus pointing out that "I was doing it wrong" ;-)
>
> >
> > > > Again, look at kernfs if you care about the memory usage of your virtual
> > > > filesystem, that's what it is there for, you shouldn't have to reinvent
> > > > the wheel.
> > >
> > > Already did because it was much easier than trying to use kernfs without
> > > documentation. I did try at first, and realized it was easier to do it
> > > myself. tracefs was based on top of debugfs, and I saw no easy path to go
> > > from that to kernfs.
> >
> > Perhaps do some digging into history and see how we moved sysfs to
> > kernfs, as originally sysfs looked exactly like debugfs. That might
> > give you some ideas of what to do here.
>
> I believe one project that should come out of this (again for those that
> want to be a kernel developer) is to document how to create a new pseudo
> file system out of kernfs.
>
> >
> > > > And the best part is, when people find issues with scaling or other
> > > > stuff with kernfs, your filesystem will then benifit (lots of tweaks
> > > > have gone into kernfs for this over the past few kernel releases.)
> > >
> > > Code is already done. It would be a huge effort to try to convert it over
> > > to kernfs without even knowing if it will regress the memory issues, which
> > > I believe it would (as the second commit saved 2 megs by getting rid of
> > > meta data per file, which kernfs would bring back).
> > >
> > > So, unless there's proof that kernfs would not add that memory footprint
> > > back, I have no time to waste on it.
> >
> > That's fine, I was just responding to your "do we need a in-kernel way
> > to do this type of thing" and I pointed out that kernfs already does
> > just that. Rolling your own is great, like you did, I'm not saying you
> > have to move to kernfs at all if you don't want to as I'm not the one
> > having to maintain eventfs :)
>
> Yeah. So now the focus is on keeping others from rolling their own unless
> they have to. I (or more realistically, someone else) could possibly
> convert the tracefs portion to kernfs (keeping eventfs separate as it is
> from tracefs, due to the amount of files). It would probably take the same
> effort as moving debugfs over to kernfs as the two are pretty much
> identical.
>
> Creating eventfs was a great learning experience for me. But it took much
> more time than I had allocated for it (putting me way behind in other
> responsibilities I have).
>
> I still like to bring up this discussion with the hopes that someone may be
> interested in fixing this.
>
I would like to attend the talk about what happened since we suggested
that you use kernfs in LSFMM 2022 and what has happened since.
I am being serious, I am not being sarcastic and I am not claiming that
you did anything wrong :)
Also ,please do not forget to also fill out the Google form:
https://forms.gle/TGCgBDH1x5pXiWFo7
So we have your attendance request with suggested topics in our spreadsheet.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-27 10:15 ` Amir Goldstein
@ 2024-01-27 14:54 ` Steven Rostedt
2024-01-27 14:59 ` James Bottomley
1 sibling, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2024-01-27 14:54 UTC (permalink / raw)
To: Amir Goldstein
Cc: Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm,
Christian Brauner, Al Viro, Linus Torvalds
On Sat, 27 Jan 2024 12:15:00 +0200
Amir Goldstein <amir73il@gmail.com> wrote:
> >
>
> I would like to attend the talk about what happened since we suggested
> that you use kernfs in LSFMM 2022 and what has happened since.
It was the lack of documentation to understand the concept it was
using. As I was very familiar with the way debugfs worked, I couldn't
map that same logic to how kernfs worked for what I wanted to do. I
remember spending a lot of time on it but just kept getting lost. I then
went to see if just modifying the current method with tracefs that was
like debugfs and things made a lot more sense. I guess the biggest
failure in that was my thinking that using the dentry as the main
handle was the proper way to do things, as supposed to being the exact
opposite. If I had known that from the beginning, I probably would have
approached it much differently.
> I am being serious, I am not being sarcastic and I am not claiming that
> you did anything wrong :)
Thanks ;-)
>
> Also ,please do not forget to also fill out the Google form:
>
> https://forms.gle/TGCgBDH1x5pXiWFo7
Crap, I keep forgetting about that form.
>
> So we have your attendance request with suggested topics in our spreadsheet.
Appreciate it.
-- Steve
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-27 10:15 ` Amir Goldstein
2024-01-27 14:54 ` Steven Rostedt
@ 2024-01-27 14:59 ` James Bottomley
2024-01-27 18:06 ` Matthew Wilcox
1 sibling, 1 reply; 18+ messages in thread
From: James Bottomley @ 2024-01-27 14:59 UTC (permalink / raw)
To: Amir Goldstein, Steven Rostedt
Cc: Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm,
Christian Brauner, Al Viro, Linus Torvalds
On Sat, 2024-01-27 at 12:15 +0200, Amir Goldstein wrote:
> I would like to attend the talk about what happened since we
> suggested that you use kernfs in LSFMM 2022 and what has happened
> since. I am being serious, I am not being sarcastic and I am not
> claiming that you did anything wrong :)
Actually, could we do the reverse and use this session to investigate
what's wrong with the VFS for new coders? I had a somewhat similar
experience when I did shiftfs way back in 2017. There's a huge amount
of VFS knowledge you simply can't pick up reading the VFS API. The way
I did it was to look at existing filesystems (for me overlayfs was the
closes to my use case) as well (and of course configfs which proved to
be too narrow for the use case). I'd say it took a good six months
before I understood the subtleties enough to propose a new filesystem
and be capable of answering technical questions about it. And
remember, like Steve, I'm a fairly competent kernel programmer. Six
months plus of code reading is an enormous barrier to place in front of
anyone wanting to do a simple filesystem, and it would be way bigger if
that person were new(ish) to Linux.
It was also only after eventfs had gone around the houses several times
that people suggested kernfs; it wasn't the default answer (why not?).
Plus, if kernfs should have been the default answer early on, why is
there no documentation at all? I mean fine, eventfs isn't really a new
filesystem, it's an extension of the existing tracefs, which is perhaps
how it sailed under the radar until the initial blow up, but that still
doesn't answer how hostile an environment the VFS currently is to new
coders who don't have six months or more to invest.
James
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-27 14:59 ` James Bottomley
@ 2024-01-27 18:06 ` Matthew Wilcox
2024-01-27 19:44 ` Linus Torvalds
2024-01-27 20:07 ` James Bottomley
0 siblings, 2 replies; 18+ messages in thread
From: Matthew Wilcox @ 2024-01-27 18:06 UTC (permalink / raw)
To: James Bottomley
Cc: Amir Goldstein, Steven Rostedt, Greg Kroah-Hartman, lsf-pc,
linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Sat, Jan 27, 2024 at 09:59:10AM -0500, James Bottomley wrote:
> On Sat, 2024-01-27 at 12:15 +0200, Amir Goldstein wrote:
> > I would like to attend the talk about what happened since we
> > suggested that you use kernfs in LSFMM 2022 and what has happened
> > since. I am being serious, I am not being sarcastic and I am not
> > claiming that you did anything wrong :)
>
> Actually, could we do the reverse and use this session to investigate
> what's wrong with the VFS for new coders? I had a somewhat similar
> experience when I did shiftfs way back in 2017. There's a huge amount
> of VFS knowledge you simply can't pick up reading the VFS API. The way
> I did it was to look at existing filesystems (for me overlayfs was the
> closes to my use case) as well (and of course configfs which proved to
> be too narrow for the use case). I'd say it took a good six months
> before I understood the subtleties enough to propose a new filesystem
> and be capable of answering technical questions about it. And
> remember, like Steve, I'm a fairly competent kernel programmer. Six
> months plus of code reading is an enormous barrier to place in front of
> anyone wanting to do a simple filesystem, and it would be way bigger if
> that person were new(ish) to Linux.
I'd suggest that eventfs and shiftfs are not "simple filesystems".
They're synthetic filesystems that want to do very different things
from block filesystems and network filesystems. We have a lot of
infrastructure in place to help authors of, say, bcachefs, but not a lot
of infrastructure for synthetic filesystems (procfs, overlayfs, sysfs,
debugfs, etc).
I don't feel like I have a lot to offer in this area; it's not a
part of the VFS I'm comfortable with. I don't really understand the
dentry/vfsmount/... interactions. I'm more focused on the fs/mm/block
interactions. I would probably also struggle to write a synthetic
filesystem, while I could knock up something that's a clone of ext2 in
a matter of weeks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-27 18:06 ` Matthew Wilcox
@ 2024-01-27 19:44 ` Linus Torvalds
2024-01-27 20:23 ` James Bottomley
2024-01-29 15:08 ` Christian Brauner
2024-01-27 20:07 ` James Bottomley
1 sibling, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2024-01-27 19:44 UTC (permalink / raw)
To: Matthew Wilcox
Cc: James Bottomley, Amir Goldstein, Steven Rostedt,
Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm,
Christian Brauner, Al Viro
On Sat, 27 Jan 2024 at 10:06, Matthew Wilcox <willy@infradead.org> wrote:
>
> I'd suggest that eventfs and shiftfs are not "simple filesystems".
> They're synthetic filesystems that want to do very different things
> from block filesystems and network filesystems. We have a lot of
> infrastructure in place to help authors of, say, bcachefs, but not a lot
> of infrastructure for synthetic filesystems (procfs, overlayfs, sysfs,
> debugfs, etc).
Indeed.
I think it's worth pointing out three very _fundamental_ design issues
here, which all mean that a "regular filesystem" is in many ways much
simpler than a virtual one:
(a) this is what the VFS has literally primarily been designed for.
When you look at a lot of VFS issues, they almost all come from just
very basic "this is what a filesystem needs" issues, and particularly
performance issues. And when you talk "performance", the #1 thing is
caching. In fact, I'd argue that #2 is caching too. Caching is just
*so* important, and it really shows in the VFS. Think about just about
any part of the VFS, and it's all about caching filesystem data. It's
why the dentry cache exists, it's why the page / folios exist, it's
what 99% of all the VFS code is about.
And that performance / caching issue isn't just why most of the VFS
code exists, it's literally also the reason for most of the design
decisions. The dentry cache is a hugely complicated beast, and a *lot*
of the complications are directly related to one thing, and one thing
only: performance. It's why locking is so incredibly baroque.
Yes, there are other complications. The whole notion of "bind mounts"
is a huge complication that arguably isn't performance-related, and
it's why we have that combination of "vfsmount" and "dentry" that we
together call a "path". And that tends to confuse low-level filesystem
people, because the other thing the VFS layer does is to try to shield
the low-level filesystem from higher-level concepts like that, so that
the low-level filesystem literally doesn't have to know about "oh,
this same filesystem is mounted in five different places". The VFS
layer takes care of that, and the filesystem doesn't need to know.
So part of it is that the VFS has been designed for regular
filesystems, but the *other* part of the puzzle is on the other side:
(b) regular filesystems have been designed to be filesystems.
Ok, that may sound like a stupid truism, but when it comes to the
discussion of virtual filesystems and relative simplicity, it's quite
a big deal. The fact is, a regular filesystem has literally been
designed from the ground up to do regular filesystem things. And that
matters.
Yes, yes, many filesystems then made various bad design decisions, and
the world isn't perfect. But basically things like "read a directory"
and "read and write files" and "rename things" are all things that the
filesystem was *designed* for.
So the VFS layer was designed for real filesystems, and real
filesystems were designed to do filesystem operations, so they are not
just made to fit together, they are also all made to expose all the
normal read/write/open/stat/whatever system calls.
(c) none of the above is generally true of virtual filesystems
Sure, *some* virtual filesystems are designed to act like a filesystem
from the ground up. Something like "tmpfs" is obviously a virtual
filesystem, but it's "virtual" only in the sense that it doesn't have
much of a backing store. It's still designed primarily to *be* a
filesystem, and the only operations that happen on it are filesystem
operations.
So ignore 'tmpfs' here, and think about all the other virtual
filesystems we have.
And realize that hey aren't really designed to be filesystems per se -
they are literally designed to be something entirely different, and
the filesystem interface is then only a secondary thing - it's a
window into a strange non-filesystem world where normal filesystem
operations don't even exist, even if sometimes there can be some kind
of convoluted transformation for them.
So you have "simple" things like just plain read-only files in /proc,
and desp[ite being about as simple as they come, they fail miserably
at the most fundamental part of a file: you can't even 'stat()' them
and get sane file size data from them.
And "caching" - which was the #1 reason for most of the filesystem
code - ends up being much less so, although it turns out that it's
still hugely important because of the abstraction interface it allows.
So all those dentries, and all the complicated lookup code, end up
still being quite important to make the virtual filesystem look like a
filesystem at all: it's what gives you the 'getcwd()' system call,
it's what still gives you the whole bind mount thing, it really ends
up giving a lot of "structure" to the virtual filesystem that would be
an absolute nightmare without it. But it's a structure that is really
designed for something else.
Because the non-filesystem virtual part that a virtual filesystem is
actually trying to expose _as_ a filesystem to user space usually has
lifetime rules (and other rules) that are *entirely* unrelated to any
filesystem activity. A user can "chdir()" into a directory that
describes a process, but the lifetime of that process is then entirely
unrelated to that, and it can go away as a process, while the
directory still has to virtually exist.
That's part of what the VFS code gives a virtual filesystem: the
dentries etc end up being those things that hang around even when the
virtual part that they described may have disappeared. And you *need*
that, just to get sane UNIX 'home directory' semantics.
I think people often don't think of how much that VFS infrastructure
protects them from.
But it's also why virtual filesystems are generally a complete mess:
you have these two pieces, and they are really doing two *COMPLETELY*
different things.
It's why I told Steven so forcefully that tracefs must not mess around
with VFS internals. A virtual filesystem either needs to be a "real
filesystem" aka tmpfs and just leave it *all* to the VFS layer, or it
needs to just treat the dentries as a separate cache that the virtual
filesystem is *not* in charge of, and trust the VFS layer to do the
filesystem parts.
But no. You should *not* look at a virtual filesystem as a guide how
to write a filesystem, or how to use the VFS. Look at a real FS. A
simple one, and preferably one that is built from the ground up to
look like a POSIX one, so that you don't end up getting confused by
all the nasty hacks to make it all look ok.
IOW, while FAT is a simple filesystem, don't look at that one, just
because then you end up with all the complications that come from
decades of non-UNIX filesystem history.
I'd say "look at minix or sysv filesystems", except those may be
simple but they also end up being so legacy that they aren't good
examples. You shouldn't use buffer-heads for anything new. But they
are still probably good examples for one thing: if you want to
understand the real power of dentries, look at either of the minix or
sysv 'namei.c' files. Just *look* at how simple they are. Ignore the
internal implementation of how a directory entry is then looked up on
disk - because that's obviously filesystem-specific - and instead just
look at the interface.
Linus
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-27 18:06 ` Matthew Wilcox
2024-01-27 19:44 ` Linus Torvalds
@ 2024-01-27 20:07 ` James Bottomley
1 sibling, 0 replies; 18+ messages in thread
From: James Bottomley @ 2024-01-27 20:07 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Amir Goldstein, Steven Rostedt, Greg Kroah-Hartman, lsf-pc,
linux-fsdevel, linux-mm, Christian Brauner, Al Viro,
Linus Torvalds
On Sat, 2024-01-27 at 18:06 +0000, Matthew Wilcox wrote:
> On Sat, Jan 27, 2024 at 09:59:10AM -0500, James Bottomley wrote:
> > On Sat, 2024-01-27 at 12:15 +0200, Amir Goldstein wrote:
> > > I would like to attend the talk about what happened since we
> > > suggested that you use kernfs in LSFMM 2022 and what has happened
> > > since. I am being serious, I am not being sarcastic and I am not
> > > claiming that you did anything wrong :)
> >
> > Actually, could we do the reverse and use this session to
> > investigate what's wrong with the VFS for new coders? I had a
> > somewhat similar experience when I did shiftfs way back in 2017.
> > There's a huge amount of VFS knowledge you simply can't pick up
> > reading the VFS API. The way I did it was to look at existing
> > filesystems (for me overlayfs was the closes to my use case) as
> > well (and of course configfs which proved to be too narrow for the
> > use case). I'd say it took a good six months before I understood
> > the subtleties enough to propose a new filesystem and be capable of
> > answering technical questions about it. And remember, like Steve,
> > I'm a fairly competent kernel programmer. Six months plus of code
> > reading is an enormous barrier to place in front of anyone wanting
> > to do a simple filesystem, and it would be way bigger if that
> > person were new(ish) to Linux.
>
> I'd suggest that eventfs and shiftfs are not "simple filesystems".
> They're synthetic filesystems that want to do very different things
> from block filesystems and network filesystems. We have a lot of
> infrastructure in place to help authors of, say, bcachefs, but not a
> lot of infrastructure for synthetic filesystems (procfs, overlayfs,
> sysfs, debugfs, etc).
I'm not going to disagree with this at all, but I also don't think it
makes the question any less valid when exposing features through the
filesystem is one of our default things to do. If anything it makes it
more urgent because some enterprising young thing is going create their
own fantastic synthetic filesystem for something and run headlong into
this.
> I don't feel like I have a lot to offer in this area; it's not a
> part of the VFS I'm comfortable with. I don't really understand the
> dentry/vfsmount/... interactions. I'm more focused on the
> fs/mm/block interactions. I would probably also struggle to write a
> synthetic filesystem, while I could knock up something that's a clone
> of ext2 in a matter of weeks.
OK, I have to confess the relationship of superblocks, struct vfsmount
and struct mnt (as it then was, it's struct mnt_idmap now) to the fs
tree was a huge part of that learning (as was the vagaries of the
dentry cache).
I'm not saying this is easy or something that interests everyone, but I
think receont history demonstrates it's something we should discuss and
try to do better at.
James
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-27 19:44 ` Linus Torvalds
@ 2024-01-27 20:23 ` James Bottomley
2024-01-29 15:08 ` Christian Brauner
1 sibling, 0 replies; 18+ messages in thread
From: James Bottomley @ 2024-01-27 20:23 UTC (permalink / raw)
To: Linus Torvalds, Matthew Wilcox
Cc: Amir Goldstein, Steven Rostedt, Greg Kroah-Hartman, lsf-pc,
linux-fsdevel, linux-mm, Christian Brauner, Al Viro
On Sat, 2024-01-27 at 11:44 -0800, Linus Torvalds wrote:
[...]
> (c) none of the above is generally true of virtual filesystems
>
> Sure, *some* virtual filesystems are designed to act like a
> filesystem from the ground up. Something like "tmpfs" is obviously a
> virtual filesystem, but it's "virtual" only in the sense that it
> doesn't have much of a backing store. It's still designed primarily
> to *be* a filesystem, and the only operations that happen on it are
> filesystem operations.
>
> So ignore 'tmpfs' here, and think about all the other virtual
> filesystems we have.
Actually, I did look at tmpfs and it did help.
> And realize that hey aren't really designed to be filesystems per se
> - they are literally designed to be something entirely different, and
> the filesystem interface is then only a secondary thing - it's a
> window into a strange non-filesystem world where normal filesystem
> operations don't even exist, even if sometimes there can be some kind
> of convoluted transformation for them.
>
> So you have "simple" things like just plain read-only files in /proc,
> and desp[ite being about as simple as they come, they fail miserably
> at the most fundamental part of a file: you can't even 'stat()' them
> and get sane file size data from them.
Well, this is a big piece of the problem: when constructing a virtual
filesystem what properties do I really need to care about (like stat or
uniqueness of inode numbers) and what can I simply ignore? Ideally
this should be documented because you have to read a lot of code to get
an idea of what the must have properties are. I think a simple summary
of this would go a long way to getting people somewhat out of the swamp
that sucks you in when you try to construct virtual filesystems.
> And "caching" - which was the #1 reason for most of the filesystem
> code - ends up being much less so, although it turns out that it's
> still hugely important because of the abstraction interface it
> allows.
>
> So all those dentries, and all the complicated lookup code, end up
> still being quite important to make the virtual filesystem look like
> a filesystem at all: it's what gives you the 'getcwd()' system call,
> it's what still gives you the whole bind mount thing, it really ends
> up giving a lot of "structure" to the virtual filesystem that would
> be an absolute nightmare without it. But it's a structure that is
> really designed for something else.
I actually found dentries (which were the foundation of shiftfs) quite
easy. My biggest problem was the places in the code where we use a
bare dentry and I needed the struct mnt (or struct path) as well, but
that's a different discussion.
> Because the non-filesystem virtual part that a virtual filesystem is
> actually trying to expose _as_ a filesystem to user space usually has
> lifetime rules (and other rules) that are *entirely* unrelated to any
> filesystem activity. A user can "chdir()" into a directory that
> describes a process, but the lifetime of that process is then
> entirely unrelated to that, and it can go away as a process, while
> the directory still has to virtually exist.
On this alone, real filesystems do have the unplug problem as well
(device goes away while user is in the directory), so the solution that
works for them work for virtual filesystems as well.
> That's part of what the VFS code gives a virtual filesystem: the
> dentries etc end up being those things that hang around even when the
> virtual part that they described may have disappeared. And you *need*
> that, just to get sane UNIX 'home directory' semantics.
>
> I think people often don't think of how much that VFS infrastructure
> protects them from.
>
> But it's also why virtual filesystems are generally a complete mess:
> you have these two pieces, and they are really doing two *COMPLETELY*
> different things.
>
> It's why I told Steven so forcefully that tracefs must not mess
> around with VFS internals. A virtual filesystem either needs to be a
> "real filesystem" aka tmpfs and just leave it *all* to the VFS layer,
> or it needs to just treat the dentries as a separate cache that the
> virtual filesystem is *not* in charge of, and trust the VFS layer to
> do the filesystem parts.
>
> But no. You should *not* look at a virtual filesystem as a guide how
> to write a filesystem, or how to use the VFS. Look at a real FS. A
> simple one, and preferably one that is built from the ground up to
> look like a POSIX one, so that you don't end up getting confused by
> all the nasty hacks to make it all look ok.
Well, I did look at ext4 when I was wondering what a real filesystem
does, but we're back to having to read real and virtual filesystems now
just to understand what you have to do and hence we're back to the "how
do we make this easier" problem.
> IOW, while FAT is a simple filesystem, don't look at that one, just
> because then you end up with all the complications that come from
> decades of non-UNIX filesystem history.
>
> I'd say "look at minix or sysv filesystems", except those may be
> simple but they also end up being so legacy that they aren't good
> examples. You shouldn't use buffer-heads for anything new. But they
> are still probably good examples for one thing: if you want to
> understand the real power of dentries, look at either of the minix or
> sysv 'namei.c' files. Just *look* at how simple they are. Ignore the
> internal implementation of how a directory entry is then looked up on
> disk - because that's obviously filesystem-specific - and instead
> just look at the interface.
So shall I put you down for helping with virtual filesystem
documentation then ... ?
James
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-27 19:44 ` Linus Torvalds
2024-01-27 20:23 ` James Bottomley
@ 2024-01-29 15:08 ` Christian Brauner
2024-01-29 15:57 ` Steven Rostedt
1 sibling, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2024-01-29 15:08 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matthew Wilcox, James Bottomley, Amir Goldstein, Steven Rostedt,
Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Al Viro
> But no. You should *not* look at a virtual filesystem as a guide how
> to write a filesystem, or how to use the VFS. Look at a real FS. A
> simple one, and preferably one that is built from the ground up to
> look like a POSIX one, so that you don't end up getting confused by
> all the nasty hacks to make it all look ok.
>
> IOW, while FAT is a simple filesystem, don't look at that one, just
> because then you end up with all the complications that come from
> decades of non-UNIX filesystem history.
>
> I'd say "look at minix or sysv filesystems", except those may be
> simple but they also end up being so legacy that they aren't good
> examples. You shouldn't use buffer-heads for anything new. But they
> are still probably good examples for one thing: if you want to
> understand the real power of dentries, look at either of the minix or
> sysv 'namei.c' files. Just *look* at how simple they are. Ignore the
> internal implementation of how a directory entry is then looked up on
> disk - because that's obviously filesystem-specific - and instead just
> look at the interface.
I agree and I have to say I'm getting annoyed with this thread.
And I want to fundamentally oppose the notion that it's too difficult to
write a virtual filesystem. Just one look at how many virtual
filesystems we already have and how many are proposed. Recent example is
that KVM wanted to implement restricted memory as a stacking layer on
top of tmpfs which I luckily caught early and told them not to do.
If at all a surprising amount of people that have nothing to do with
filesystems manage to write filesystem drivers quickly and propose them
upstream. And I hope people take a couple of months to write a decently
sized/complex (virtual) filesystem.
And specifically for virtual filesystems they often aren't alike at
all. And that's got nothing to do with the VFS abstractions. It's
simply because a virtual filesystem is often used for purposes when
developers think that they want a filesystem like userspace interface
but don't want all of the actual filesystem semantics that come with it.
So they all differ from each other and what functionality they actually
implement.
And I somewhat oppose the notion that the VFS isn't documented. We do
have extensive documentation for locking rules, a constantly updated
changelog with fundamental changes to all VFS APIs and expectations
around it. Including very intricate details for the reader that really
needs to know everything. I wrote a whole document just on permission
checking and idmappings when we added that to the VFS. Both
implementation and theoretical background.
And stuff like overlayfs or shiftfs are completely separate stories
because they're even more special as they're (virtual) stacking
filesystems that challenge the VFS in way more radical ways than regular
virtual filesystems.
And I think (Amir may forgive me) that stacking filesystems are
generally an absolutely terrible idea as they complicate the VFS
massively and put us through an insane amount of pain. One just needs to
look at how much additional VFS machinery we have because of that and
how complicated our callchains can become because of that. It's just not
correct to even compare them to a boring virtual filesystem like
binderfs or bpffs.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems
2024-01-29 15:08 ` Christian Brauner
@ 2024-01-29 15:57 ` Steven Rostedt
0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2024-01-29 15:57 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Matthew Wilcox, James Bottomley, Amir Goldstein,
Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Al Viro
On Mon, 29 Jan 2024 16:08:33 +0100
Christian Brauner <brauner@kernel.org> wrote:
> > But no. You should *not* look at a virtual filesystem as a guide how
> > to write a filesystem, or how to use the VFS. Look at a real FS. A
> > simple one, and preferably one that is built from the ground up to
> > look like a POSIX one, so that you don't end up getting confused by
> > all the nasty hacks to make it all look ok.
> >
> > IOW, while FAT is a simple filesystem, don't look at that one, just
> > because then you end up with all the complications that come from
> > decades of non-UNIX filesystem history.
> >
> > I'd say "look at minix or sysv filesystems", except those may be
> > simple but they also end up being so legacy that they aren't good
> > examples. You shouldn't use buffer-heads for anything new. But they
> > are still probably good examples for one thing: if you want to
> > understand the real power of dentries, look at either of the minix or
> > sysv 'namei.c' files. Just *look* at how simple they are. Ignore the
> > internal implementation of how a directory entry is then looked up on
> > disk - because that's obviously filesystem-specific - and instead just
> > look at the interface.
>
> I agree and I have to say I'm getting annoyed with this thread.
>
> And I want to fundamentally oppose the notion that it's too difficult to
> write a virtual filesystem. Just one look at how many virtual
I guess you mean pseudo file systems? Somewhere along the discussion we
switched from saying pseudo to virtual. I may have been the culprit, I
don't remember and I'm not re-reading the thread to find out.
> filesystems we already have and how many are proposed. Recent example is
> that KVM wanted to implement restricted memory as a stacking layer on
> top of tmpfs which I luckily caught early and told them not to do.
>
> If at all a surprising amount of people that have nothing to do with
> filesystems manage to write filesystem drivers quickly and propose them
> upstream. And I hope people take a couple of months to write a decently
> sized/complex (virtual) filesystem.
I spent a lot of time on this. Let me give you a bit of history of where
tracefs/eventfs came from.
When we first started the tracing infrastructure, I wanted it to be easy to
debug embedded devices. I use to have my own tracer called "logdev" which
was a character device in /dev called /dev/logdev. I was able to write into
it for simple control actions.
But we needed a more complex system when we started integrating the
PREEMPT_RT latency tracer which eventually became the ftrace infrastructure.
As I wanted to still only need busybox to interact with it, I wanted to use
files and not system calls. I was recommended to use debugfs, and I did. It
became /sys/kernel/debug/tracing.
After a while, when tracing started to become useful in production systems,
people wanted access to tracing without having to have debugfs mounted.
That's because debugfs is a dumping ground to a lot of interactions with
the kernel, and people were legitimately worried about security
vulnerabilities it could expose.
I then asked about how to make /sys/kernel/debug/tracing its own file
system and was recommended to just start with debugfs (it's the easiest
concept of all the files systems to understand) and since tracing was
already used the debugfs API (with dentries as the handle) it made sense.
That created tracefs. Now you could mount tracefs at /sys/kernel/tracing
and even have debugfs configured out.
When the eBPF folks were using trace_printk directly into the main trace
buffer, I asked them to please use an instance instead. They told me that
an instance adds too much memory overhead. Over 20MBs! When I investigated,
I found that they were right. And most of that overhead was all the dentry
and inodes that were created for every directory and file that was used for
events. As there's 10s of thousands of files and directories that adds up.
And if you create a new instance, you create another 10s of thousands of
files and directories that are basically all the same.
This lead to the effort to create eventfs that would remove the overhead of
these inodes and dentries with just a light weight descriptor for every
directory. As there's just around 2000 directories, its the files that take
up most of the memory.
What got us here is the evolution of changes that were made. Now you can
argue that when tracefs was first moved out of debugfs I should have based
it on kernfs. I actually did look at that, but because it behaved so much
differently than debugfs (which was the only thing in VFS that I was
familiar with), I chose debugfs instead.
The biggest savings in eventfs is the fact that it has no meta data for
files. All the directories in eventfs has a fixed number of files when they
are created. The creating of a directory passes in an array that has a list
of names and callbacks to call when the file needs to be accessed. Note,
this array is static for all events. That is, there's one array for all
event files, and one array for all event systems, they are not allocated per
directory.
>
> And specifically for virtual filesystems they often aren't alike at
> all. And that's got nothing to do with the VFS abstractions. It's
> simply because a virtual filesystem is often used for purposes when
> developers think that they want a filesystem like userspace interface
> but don't want all of the actual filesystem semantics that come with it.
> So they all differ from each other and what functionality they actually
> implement.
I agree with the above.
>
> And I somewhat oppose the notion that the VFS isn't documented. We do
> have extensive documentation for locking rules, a constantly updated
> changelog with fundamental changes to all VFS APIs and expectations
> around it. Including very intricate details for the reader that really
> needs to know everything. I wrote a whole document just on permission
> checking and idmappings when we added that to the VFS. Both
> implementation and theoretical background.
I spent a lot of time reading the VFS documentation. The problem I had was
that it's very much focused for its main purpose. That is for real file
systems. It was hard to know what would apply to a pseudo file system and
what would not.
So I don't want to say that VFS isn't well documented. I would say that VFS
is a very big beast, and there's documentation that is focused on what the
majority want to do with it.
It's us outliers (pseudo file systems) that are screwing this up. And when
you come from an approach of "I just want an file systems like interface"
you really just want to know the bare minimum of VFS to get that done.
I've been approach countless of times by the embedded community (including
those that worked on the Mars helicopter) thanking me for having such a
nice file system like interface into tracing.
-- Steve
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-01-29 15:57 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-25 15:48 [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems Steven Rostedt
2024-01-26 1:24 ` Greg Kroah-Hartman
2024-01-26 1:50 ` Steven Rostedt
2024-01-26 1:59 ` Greg Kroah-Hartman
2024-01-26 2:40 ` Steven Rostedt
2024-01-26 14:16 ` Greg Kroah-Hartman
2024-01-26 15:15 ` Steven Rostedt
2024-01-26 15:41 ` Greg Kroah-Hartman
2024-01-26 16:44 ` Steven Rostedt
2024-01-27 10:15 ` Amir Goldstein
2024-01-27 14:54 ` Steven Rostedt
2024-01-27 14:59 ` James Bottomley
2024-01-27 18:06 ` Matthew Wilcox
2024-01-27 19:44 ` Linus Torvalds
2024-01-27 20:23 ` James Bottomley
2024-01-29 15:08 ` Christian Brauner
2024-01-29 15:57 ` Steven Rostedt
2024-01-27 20:07 ` James Bottomley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).