* [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems @ 2024-01-25 15:48 Steven Rostedt 2024-01-26 1:24 ` Greg Kroah-Hartman 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2024-01-25 15:48 UTC (permalink / raw) To: lsf-pc Cc: linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds, Greg Kroah-Hartman The tracefs file system was designed from the debugfs file system. The rationale for separating tracefs from debugfs was to allow systems to enable tracing but still keep debugfs disabled. The debugfs API centers around dentry, e.g: struct dentry *debugfs_create_file(const char *name, umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops); struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); Where if you need to create a file in debugfs, you call the above debugfs_create_file() code and it returns a dentry handle, that can be used to delete that file later. If parent is NULL, it adds the file at the root of the debugfs file system (/sys/kernel/debug), otherwise you could create a directory within that file system with the debugfs_create_dir(). Behind the scenes, that dentry also has a created inode structure representing it. This all happens regardless if debugfs is mounted or not! As every trace event in the system is represented by a directory and several files in tracefs's events directory, it created quite a lot of dentries and inodes. # find /sys/kernel/tracing/ | wc -l 18352 And if you create an instance it will duplicate all the events in the instance directory: # mkdir /sys/kernel/tracing/instances/foo # find /sys/kernel/tracing/ | wc -l 36617 And that goes for every new instance you make! # mkdir /sys/kernel/tracing/instances/bar # find /sys/kernel/tracing/ | wc -l 54882 As having inodes and dentries created for all these files and directories even when they are not used, wastes a lot of memory. Two years ago at LSF/MM I presented changing how the events directory works via a new "eventfs" file system. It would still be part of tracefs, but it would dynamically create the inodes and dentries on the fly. As I was new to how VFS works, and really didn't understand it as well as I would have liked, I just got something working and finally submitted it. But because of my inexperience, Linus had some strong issues against the code. Part of this was because I was touching dentries when he said I shouldn't be. But that is because the code was designed from debugfs, which dentry is the central part of that code. When Linus said to me: "And dammit, it shouldn't be necessary. When the tree is mounted, there should be no existing dentries." (I'd share the link, but it was on the security list so there's no public link for this conversation) Linus's comment made me realize how debugfs was doing it wrong! He was right, when a file system is mounted, it should not have any dentries nor inodes. That's because dentry and inodes are basically "cache" of the underlining file system. They should only be created when they are referenced. The debugfs and tracefs (and possibly other pseudo file systems) should not be using dentry as a descriptor for the object. It should just create a generic object that can save the fops, mode, parent, and data, and have the dentries and inodes created when referenced just like any other file system would. Now that I have finished the eventfs file system, I would like to present a proposal to make a more generic interface that the rest of tracefs and even debugfs could use that wouldn't rely on dentry as the main handle. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-25 15:48 [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems Steven Rostedt @ 2024-01-26 1:24 ` Greg Kroah-Hartman 2024-01-26 1:50 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Greg Kroah-Hartman @ 2024-01-26 1:24 UTC (permalink / raw) To: Steven Rostedt Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Thu, Jan 25, 2024 at 10:48:22AM -0500, Steven Rostedt wrote: > Now that I have finished the eventfs file system, I would like to present a > proposal to make a more generic interface that the rest of tracefs and even > debugfs could use that wouldn't rely on dentry as the main handle. You mean like kernfs does for you today? :) thanks, greg k-h ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-26 1:24 ` Greg Kroah-Hartman @ 2024-01-26 1:50 ` Steven Rostedt 2024-01-26 1:59 ` Greg Kroah-Hartman 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2024-01-26 1:50 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Thu, 25 Jan 2024 17:24:03 -0800 Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Thu, Jan 25, 2024 at 10:48:22AM -0500, Steven Rostedt wrote: > > Now that I have finished the eventfs file system, I would like to present a > > proposal to make a more generic interface that the rest of tracefs and even > > debugfs could use that wouldn't rely on dentry as the main handle. > > You mean like kernfs does for you today? :) > I tried to use kernfs when doing a lot of this and I had issues. I don't remember what those were, but I can revisit it. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-26 1:50 ` Steven Rostedt @ 2024-01-26 1:59 ` Greg Kroah-Hartman 2024-01-26 2:40 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Greg Kroah-Hartman @ 2024-01-26 1:59 UTC (permalink / raw) To: Steven Rostedt Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Thu, Jan 25, 2024 at 08:50:55PM -0500, Steven Rostedt wrote: > On Thu, 25 Jan 2024 17:24:03 -0800 > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > > On Thu, Jan 25, 2024 at 10:48:22AM -0500, Steven Rostedt wrote: > > > Now that I have finished the eventfs file system, I would like to present a > > > proposal to make a more generic interface that the rest of tracefs and even > > > debugfs could use that wouldn't rely on dentry as the main handle. > > > > You mean like kernfs does for you today? :) > > > > I tried to use kernfs when doing a lot of this and I had issues. I > don't remember what those were, but I can revisit it. You might, as kernfs makes it so that the filesystem structures are created on demand, when accessed, and then removed when memory pressure happens. That's what sysfs and configfs and cgroups use quite successfully. thanks, greg k-h ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-26 1:59 ` Greg Kroah-Hartman @ 2024-01-26 2:40 ` Steven Rostedt 2024-01-26 14:16 ` Greg Kroah-Hartman 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2024-01-26 2:40 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Thu, 25 Jan 2024 17:59:40 -0800 Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > I tried to use kernfs when doing a lot of this and I had issues. I > > don't remember what those were, but I can revisit it. > > You might, as kernfs makes it so that the filesystem structures are > created on demand, when accessed, and then removed when memory pressure > happens. That's what sysfs and configfs and cgroups use quite > successfully. kernfs doesn't look trivial and I can't find any documentation on how to use it. Should there be work to move debugfs over to kernfs? I could look at it too, but as tracefs, and more specifically eventfs, has 10s of thousands of files, I'm very concerned about meta data size. Currently eventfs keeps a data structure for every directory, but for the files, it only keeps an array of names and callbacks. When a directory is registered, it lists the files it needs. eventfs is specific that the number of files a directory has is always constant, and files will not be removed or added once a directory is created. This way, the information on how a file is created is done via a callback that was registered when the directory was created. For this use case, I don't think kernfs could be used. But I would still like to talk about what I'm trying to accomplish, and perhaps see if there's work that can be done to consolidate what is out there. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-26 2:40 ` Steven Rostedt @ 2024-01-26 14:16 ` Greg Kroah-Hartman 2024-01-26 15:15 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Greg Kroah-Hartman @ 2024-01-26 14:16 UTC (permalink / raw) To: Steven Rostedt Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Thu, Jan 25, 2024 at 09:40:07PM -0500, Steven Rostedt wrote: > On Thu, 25 Jan 2024 17:59:40 -0800 > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > > > I tried to use kernfs when doing a lot of this and I had issues. I > > > don't remember what those were, but I can revisit it. > > > > You might, as kernfs makes it so that the filesystem structures are > > created on demand, when accessed, and then removed when memory pressure > > happens. That's what sysfs and configfs and cgroups use quite > > successfully. > > kernfs doesn't look trivial and I can't find any documentation on how > to use it. You have the code :) > Should there be work to move debugfs over to kernfs? Why? Are you seeing real actual memory use with debugfs that is causing problems? That is why we made kernfs, because people were seeing this in sysfs. Don't change stuff unless you need to, right? > I could look at it too, but as tracefs, and more specifically eventfs, > has 10s of thousands of files, I'm very concerned about meta data size. Do you have real numbers? If not, then don't worry about it :) > Currently eventfs keeps a data structure for every directory, but for > the files, it only keeps an array of names and callbacks. When a > directory is registered, it lists the files it needs. eventfs is > specific that the number of files a directory has is always constant, > and files will not be removed or added once a directory is created. > > This way, the information on how a file is created is done via a > callback that was registered when the directory was created. That's fine, and shouldn't matter. > For this use case, I don't think kernfs could be used. But I would > still like to talk about what I'm trying to accomplish, and perhaps see > if there's work that can be done to consolidate what is out there. Again, look at kernfs if you care about the memory usage of your virtual filesystem, that's what it is there for, you shouldn't have to reinvent the wheel. And the best part is, when people find issues with scaling or other stuff with kernfs, your filesystem will then benifit (lots of tweaks have gone into kernfs for this over the past few kernel releases.) thanks, greg k-h ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-26 14:16 ` Greg Kroah-Hartman @ 2024-01-26 15:15 ` Steven Rostedt 2024-01-26 15:41 ` Greg Kroah-Hartman 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2024-01-26 15:15 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Fri, 26 Jan 2024 06:16:38 -0800 Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Thu, Jan 25, 2024 at 09:40:07PM -0500, Steven Rostedt wrote: > > On Thu, 25 Jan 2024 17:59:40 -0800 > > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > > > > > I tried to use kernfs when doing a lot of this and I had issues. I > > > > don't remember what those were, but I can revisit it. > > > > > > You might, as kernfs makes it so that the filesystem structures are > > > created on demand, when accessed, and then removed when memory pressure > > > happens. That's what sysfs and configfs and cgroups use quite > > > successfully. > > > > kernfs doesn't look trivial and I can't find any documentation on how > > to use it. > > You have the code :) Really Greg? I can write what I want to do twice as fast than trying to figure out why someone else did what they did in their code, unless there's good documentation on the subject. > > > Should there be work to move debugfs over to kernfs? > > Why? Are you seeing real actual memory use with debugfs that is causing > problems? That is why we made kernfs, because people were seeing this > in sysfs. The reason I brought it up was from Linus's comment about dentries and inodes should not exist if the file system isn't mounted. That's not the case with debugfs. My question is, do we want debugfs to not use dentries as its main handle? > > Don't change stuff unless you need to, right? > > > I could look at it too, but as tracefs, and more specifically eventfs, > > has 10s of thousands of files, I'm very concerned about meta data size. > > Do you have real numbers? If not, then don't worry about it :) I wouldn't be doing any of this without real numbers. They are in the change log of eventfs. See commits: 27152bceea1df27ffebb12ac9cd9adbf2c4c3f35 5790b1fb3d672d9a1fe3881a7181dfdbe741568f > > > Currently eventfs keeps a data structure for every directory, but for > > the files, it only keeps an array of names and callbacks. When a > > directory is registered, it lists the files it needs. eventfs is > > specific that the number of files a directory has is always constant, > > and files will not be removed or added once a directory is created. > > > > This way, the information on how a file is created is done via a > > callback that was registered when the directory was created. > > That's fine, and shouldn't matter. > > > For this use case, I don't think kernfs could be used. But I would > > still like to talk about what I'm trying to accomplish, and perhaps see > > if there's work that can be done to consolidate what is out there. > > Again, look at kernfs if you care about the memory usage of your virtual > filesystem, that's what it is there for, you shouldn't have to reinvent > the wheel. Already did because it was much easier than trying to use kernfs without documentation. I did try at first, and realized it was easier to do it myself. tracefs was based on top of debugfs, and I saw no easy path to go from that to kernfs. > > And the best part is, when people find issues with scaling or other > stuff with kernfs, your filesystem will then benifit (lots of tweaks > have gone into kernfs for this over the past few kernel releases.) Code is already done. It would be a huge effort to try to convert it over to kernfs without even knowing if it will regress the memory issues, which I believe it would (as the second commit saved 2 megs by getting rid of meta data per file, which kernfs would bring back). So, unless there's proof that kernfs would not add that memory footprint back, I have no time to waste on it. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-26 15:15 ` Steven Rostedt @ 2024-01-26 15:41 ` Greg Kroah-Hartman 2024-01-26 16:44 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Greg Kroah-Hartman @ 2024-01-26 15:41 UTC (permalink / raw) To: Steven Rostedt Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Fri, Jan 26, 2024 at 10:15:53AM -0500, Steven Rostedt wrote: > On Fri, 26 Jan 2024 06:16:38 -0800 > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > > On Thu, Jan 25, 2024 at 09:40:07PM -0500, Steven Rostedt wrote: > > > On Thu, 25 Jan 2024 17:59:40 -0800 > > > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > > > > > > > I tried to use kernfs when doing a lot of this and I had issues. I > > > > > don't remember what those were, but I can revisit it. > > > > > > > > You might, as kernfs makes it so that the filesystem structures are > > > > created on demand, when accessed, and then removed when memory pressure > > > > happens. That's what sysfs and configfs and cgroups use quite > > > > successfully. > > > > > > kernfs doesn't look trivial and I can't find any documentation on how > > > to use it. > > > > You have the code :) > > Really Greg? > > I can write what I want to do twice as fast than trying to figure out why > someone else did what they did in their code, unless there's good > documentation on the subject. Sorry, that was snarky, but yes, there is no documentation for kernfs, as it evolved over time with the users of it being converted to use it as it went. I'd suggest looking at how cgroups uses it as odds are that's the simplest way. > > > Should there be work to move debugfs over to kernfs? > > > > Why? Are you seeing real actual memory use with debugfs that is causing > > problems? That is why we made kernfs, because people were seeing this > > in sysfs. > > The reason I brought it up was from Linus's comment about dentries and > inodes should not exist if the file system isn't mounted. That's not the > case with debugfs. My question is, do we want debugfs to not use dentries > as its main handle? In the long run, yes, I want the "handle" that all callers to debugfs to NOT use a dentry, and have been slowly migrating away from allowing debugfs to actually return a dentry to the caller. When that is eventually finished, it will be an opaque "handle" that all users of debugfs has and THEN we can convert debugfs to do whatever it wants to. Again, long-term plans, slowly getting there, if only I had an intern or 10 to help out with it :) But, this is only being driven by my "this feels like the wrong api to use" ideas, and seeing how debugfs returning a dentry has been abused by many subsystems in places, not by any real-world measurements of "debugfs is using up too much memory!" like we have had for sysfs ever since the beginning. If someone comes up with a real workload that shows debugfs is just too slow or taking up too much memory for their systems for functionality that they rely on (that's the kicker), then the movement for debugfs to kernfs would happen much faster as someone would actually have the need to do so. > > Don't change stuff unless you need to, right? > > > > > I could look at it too, but as tracefs, and more specifically eventfs, > > > has 10s of thousands of files, I'm very concerned about meta data size. > > > > Do you have real numbers? If not, then don't worry about it :) > > I wouldn't be doing any of this without real numbers. They are in the > change log of eventfs. > > See commits: > > 27152bceea1df27ffebb12ac9cd9adbf2c4c3f35 > 5790b1fb3d672d9a1fe3881a7181dfdbe741568f Sorry, I mean for debugfs. > > Again, look at kernfs if you care about the memory usage of your virtual > > filesystem, that's what it is there for, you shouldn't have to reinvent > > the wheel. > > Already did because it was much easier than trying to use kernfs without > documentation. I did try at first, and realized it was easier to do it > myself. tracefs was based on top of debugfs, and I saw no easy path to go > from that to kernfs. Perhaps do some digging into history and see how we moved sysfs to kernfs, as originally sysfs looked exactly like debugfs. That might give you some ideas of what to do here. > > And the best part is, when people find issues with scaling or other > > stuff with kernfs, your filesystem will then benifit (lots of tweaks > > have gone into kernfs for this over the past few kernel releases.) > > Code is already done. It would be a huge effort to try to convert it over > to kernfs without even knowing if it will regress the memory issues, which > I believe it would (as the second commit saved 2 megs by getting rid of > meta data per file, which kernfs would bring back). > > So, unless there's proof that kernfs would not add that memory footprint > back, I have no time to waste on it. That's fine, I was just responding to your "do we need a in-kernel way to do this type of thing" and I pointed out that kernfs already does just that. Rolling your own is great, like you did, I'm not saying you have to move to kernfs at all if you don't want to as I'm not the one having to maintain eventfs :) thanks, greg k-h ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-26 15:41 ` Greg Kroah-Hartman @ 2024-01-26 16:44 ` Steven Rostedt 2024-01-27 10:15 ` Amir Goldstein 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2024-01-26 16:44 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Fri, 26 Jan 2024 07:41:31 -0800 Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > The reason I brought it up was from Linus's comment about dentries and > > inodes should not exist if the file system isn't mounted. That's not the > > case with debugfs. My question is, do we want debugfs to not use dentries > > as its main handle? > > In the long run, yes, I want the "handle" that all callers to debugfs to > NOT use a dentry, and have been slowly migrating away from allowing > debugfs to actually return a dentry to the caller. When that is > eventually finished, it will be an opaque "handle" that all users of > debugfs has and THEN we can convert debugfs to do whatever it wants to. So it does sound like we are on the same page ;-) > > Again, long-term plans, slowly getting there, if only I had an intern or > 10 to help out with it :) Yeah, this is something we need to think about when people come up to us and say "I'd like to be a kernel developer, is there anything you know of that I can work on?" Add a KTODO? > > But, this is only being driven by my "this feels like the wrong api to > use" ideas, and seeing how debugfs returning a dentry has been abused by > many subsystems in places, not by any real-world measurements of > "debugfs is using up too much memory!" like we have had for sysfs ever > since the beginning. So we have a bit of miscommunication. My motivation for this topic wasn't necessary on memory overhead (but it does help). But more about the correctness of debugfs. I can understand how you could have interpreted my motivation, as eventfs was solely motivated by memory pressure. But this thread was motivated by Linus's comment about dentries not being allocated before mounting. > > If someone comes up with a real workload that shows debugfs is just too > slow or taking up too much memory for their systems for functionality > that they rely on (that's the kicker), then the movement for debugfs to > kernfs would happen much faster as someone would actually have the need > to do so. Another motivation is to prevent another tracefs happening. That is, another pseudo file system that copies debugfs like the way tracefs was created. I've had a few conversations with others that say "we have a special interface in debugfs but we want to move it out". And I've been (incorrectly) telling them what I did with tracefs from debugfs. > > > > Don't change stuff unless you need to, right? > > > > > > > I could look at it too, but as tracefs, and more specifically eventfs, > > > > has 10s of thousands of files, I'm very concerned about meta data size. > > > > > > Do you have real numbers? If not, then don't worry about it :) > > > > I wouldn't be doing any of this without real numbers. They are in the > > change log of eventfs. > > > > See commits: > > > > 27152bceea1df27ffebb12ac9cd9adbf2c4c3f35 > > 5790b1fb3d672d9a1fe3881a7181dfdbe741568f > > Sorry, I mean for debugfs. No problem. This is how I figured we were talking pass each other. eventfs was a big culprit in memory issues, as it has so many files. But now I'm talking about correctness more than memory savings. And this came about from my conversations with Linus pointing out that "I was doing it wrong" ;-) > > > > Again, look at kernfs if you care about the memory usage of your virtual > > > filesystem, that's what it is there for, you shouldn't have to reinvent > > > the wheel. > > > > Already did because it was much easier than trying to use kernfs without > > documentation. I did try at first, and realized it was easier to do it > > myself. tracefs was based on top of debugfs, and I saw no easy path to go > > from that to kernfs. > > Perhaps do some digging into history and see how we moved sysfs to > kernfs, as originally sysfs looked exactly like debugfs. That might > give you some ideas of what to do here. I believe one project that should come out of this (again for those that want to be a kernel developer) is to document how to create a new pseudo file system out of kernfs. > > > > And the best part is, when people find issues with scaling or other > > > stuff with kernfs, your filesystem will then benifit (lots of tweaks > > > have gone into kernfs for this over the past few kernel releases.) > > > > Code is already done. It would be a huge effort to try to convert it over > > to kernfs without even knowing if it will regress the memory issues, which > > I believe it would (as the second commit saved 2 megs by getting rid of > > meta data per file, which kernfs would bring back). > > > > So, unless there's proof that kernfs would not add that memory footprint > > back, I have no time to waste on it. > > That's fine, I was just responding to your "do we need a in-kernel way > to do this type of thing" and I pointed out that kernfs already does > just that. Rolling your own is great, like you did, I'm not saying you > have to move to kernfs at all if you don't want to as I'm not the one > having to maintain eventfs :) Yeah. So now the focus is on keeping others from rolling their own unless they have to. I (or more realistically, someone else) could possibly convert the tracefs portion to kernfs (keeping eventfs separate as it is from tracefs, due to the amount of files). It would probably take the same effort as moving debugfs over to kernfs as the two are pretty much identical. Creating eventfs was a great learning experience for me. But it took much more time than I had allocated for it (putting me way behind in other responsibilities I have). I still like to bring up this discussion with the hopes that someone may be interested in fixing this. Thanks, -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-26 16:44 ` Steven Rostedt @ 2024-01-27 10:15 ` Amir Goldstein 2024-01-27 14:54 ` Steven Rostedt 2024-01-27 14:59 ` James Bottomley 0 siblings, 2 replies; 18+ messages in thread From: Amir Goldstein @ 2024-01-27 10:15 UTC (permalink / raw) To: Steven Rostedt Cc: Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Fri, Jan 26, 2024 at 6:44 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 26 Jan 2024 07:41:31 -0800 > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > > > The reason I brought it up was from Linus's comment about dentries and > > > inodes should not exist if the file system isn't mounted. That's not the > > > case with debugfs. My question is, do we want debugfs to not use dentries > > > as its main handle? > > > > In the long run, yes, I want the "handle" that all callers to debugfs to > > NOT use a dentry, and have been slowly migrating away from allowing > > debugfs to actually return a dentry to the caller. When that is > > eventually finished, it will be an opaque "handle" that all users of > > debugfs has and THEN we can convert debugfs to do whatever it wants to. > > So it does sound like we are on the same page ;-) > > > > > Again, long-term plans, slowly getting there, if only I had an intern or > > 10 to help out with it :) > > Yeah, this is something we need to think about when people come up to us > and say "I'd like to be a kernel developer, is there anything you know of > that I can work on?" Add a KTODO? > > > > > But, this is only being driven by my "this feels like the wrong api to > > use" ideas, and seeing how debugfs returning a dentry has been abused by > > many subsystems in places, not by any real-world measurements of > > "debugfs is using up too much memory!" like we have had for sysfs ever > > since the beginning. > > So we have a bit of miscommunication. My motivation for this topic wasn't > necessary on memory overhead (but it does help). But more about the > correctness of debugfs. I can understand how you could have interpreted my > motivation, as eventfs was solely motivated by memory pressure. But this > thread was motivated by Linus's comment about dentries not being allocated > before mounting. > > > > > If someone comes up with a real workload that shows debugfs is just too > > slow or taking up too much memory for their systems for functionality > > that they rely on (that's the kicker), then the movement for debugfs to > > kernfs would happen much faster as someone would actually have the need > > to do so. > > Another motivation is to prevent another tracefs happening. That is, > another pseudo file system that copies debugfs like the way tracefs was > created. I've had a few conversations with others that say "we have a > special interface in debugfs but we want to move it out". And I've been > (incorrectly) telling them what I did with tracefs from debugfs. > > > > > > > Don't change stuff unless you need to, right? > > > > > > > > > I could look at it too, but as tracefs, and more specifically eventfs, > > > > > has 10s of thousands of files, I'm very concerned about meta data size. > > > > > > > > Do you have real numbers? If not, then don't worry about it :) > > > > > > I wouldn't be doing any of this without real numbers. They are in the > > > change log of eventfs. > > > > > > See commits: > > > > > > 27152bceea1df27ffebb12ac9cd9adbf2c4c3f35 > > > 5790b1fb3d672d9a1fe3881a7181dfdbe741568f > > > > Sorry, I mean for debugfs. > > No problem. This is how I figured we were talking pass each other. eventfs > was a big culprit in memory issues, as it has so many files. But now I'm > talking about correctness more than memory savings. And this came about > from my conversations with Linus pointing out that "I was doing it wrong" ;-) > > > > > > > Again, look at kernfs if you care about the memory usage of your virtual > > > > filesystem, that's what it is there for, you shouldn't have to reinvent > > > > the wheel. > > > > > > Already did because it was much easier than trying to use kernfs without > > > documentation. I did try at first, and realized it was easier to do it > > > myself. tracefs was based on top of debugfs, and I saw no easy path to go > > > from that to kernfs. > > > > Perhaps do some digging into history and see how we moved sysfs to > > kernfs, as originally sysfs looked exactly like debugfs. That might > > give you some ideas of what to do here. > > I believe one project that should come out of this (again for those that > want to be a kernel developer) is to document how to create a new pseudo > file system out of kernfs. > > > > > > > And the best part is, when people find issues with scaling or other > > > > stuff with kernfs, your filesystem will then benifit (lots of tweaks > > > > have gone into kernfs for this over the past few kernel releases.) > > > > > > Code is already done. It would be a huge effort to try to convert it over > > > to kernfs without even knowing if it will regress the memory issues, which > > > I believe it would (as the second commit saved 2 megs by getting rid of > > > meta data per file, which kernfs would bring back). > > > > > > So, unless there's proof that kernfs would not add that memory footprint > > > back, I have no time to waste on it. > > > > That's fine, I was just responding to your "do we need a in-kernel way > > to do this type of thing" and I pointed out that kernfs already does > > just that. Rolling your own is great, like you did, I'm not saying you > > have to move to kernfs at all if you don't want to as I'm not the one > > having to maintain eventfs :) > > Yeah. So now the focus is on keeping others from rolling their own unless > they have to. I (or more realistically, someone else) could possibly > convert the tracefs portion to kernfs (keeping eventfs separate as it is > from tracefs, due to the amount of files). It would probably take the same > effort as moving debugfs over to kernfs as the two are pretty much > identical. > > Creating eventfs was a great learning experience for me. But it took much > more time than I had allocated for it (putting me way behind in other > responsibilities I have). > > I still like to bring up this discussion with the hopes that someone may be > interested in fixing this. > I would like to attend the talk about what happened since we suggested that you use kernfs in LSFMM 2022 and what has happened since. I am being serious, I am not being sarcastic and I am not claiming that you did anything wrong :) Also ,please do not forget to also fill out the Google form: https://forms.gle/TGCgBDH1x5pXiWFo7 So we have your attendance request with suggested topics in our spreadsheet. Thanks, Amir. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-27 10:15 ` Amir Goldstein @ 2024-01-27 14:54 ` Steven Rostedt 2024-01-27 14:59 ` James Bottomley 1 sibling, 0 replies; 18+ messages in thread From: Steven Rostedt @ 2024-01-27 14:54 UTC (permalink / raw) To: Amir Goldstein Cc: Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Sat, 27 Jan 2024 12:15:00 +0200 Amir Goldstein <amir73il@gmail.com> wrote: > > > > I would like to attend the talk about what happened since we suggested > that you use kernfs in LSFMM 2022 and what has happened since. It was the lack of documentation to understand the concept it was using. As I was very familiar with the way debugfs worked, I couldn't map that same logic to how kernfs worked for what I wanted to do. I remember spending a lot of time on it but just kept getting lost. I then went to see if just modifying the current method with tracefs that was like debugfs and things made a lot more sense. I guess the biggest failure in that was my thinking that using the dentry as the main handle was the proper way to do things, as supposed to being the exact opposite. If I had known that from the beginning, I probably would have approached it much differently. > I am being serious, I am not being sarcastic and I am not claiming that > you did anything wrong :) Thanks ;-) > > Also ,please do not forget to also fill out the Google form: > > https://forms.gle/TGCgBDH1x5pXiWFo7 Crap, I keep forgetting about that form. > > So we have your attendance request with suggested topics in our spreadsheet. Appreciate it. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-27 10:15 ` Amir Goldstein 2024-01-27 14:54 ` Steven Rostedt @ 2024-01-27 14:59 ` James Bottomley 2024-01-27 18:06 ` Matthew Wilcox 1 sibling, 1 reply; 18+ messages in thread From: James Bottomley @ 2024-01-27 14:59 UTC (permalink / raw) To: Amir Goldstein, Steven Rostedt Cc: Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Sat, 2024-01-27 at 12:15 +0200, Amir Goldstein wrote: > I would like to attend the talk about what happened since we > suggested that you use kernfs in LSFMM 2022 and what has happened > since. I am being serious, I am not being sarcastic and I am not > claiming that you did anything wrong :) Actually, could we do the reverse and use this session to investigate what's wrong with the VFS for new coders? I had a somewhat similar experience when I did shiftfs way back in 2017. There's a huge amount of VFS knowledge you simply can't pick up reading the VFS API. The way I did it was to look at existing filesystems (for me overlayfs was the closes to my use case) as well (and of course configfs which proved to be too narrow for the use case). I'd say it took a good six months before I understood the subtleties enough to propose a new filesystem and be capable of answering technical questions about it. And remember, like Steve, I'm a fairly competent kernel programmer. Six months plus of code reading is an enormous barrier to place in front of anyone wanting to do a simple filesystem, and it would be way bigger if that person were new(ish) to Linux. It was also only after eventfs had gone around the houses several times that people suggested kernfs; it wasn't the default answer (why not?). Plus, if kernfs should have been the default answer early on, why is there no documentation at all? I mean fine, eventfs isn't really a new filesystem, it's an extension of the existing tracefs, which is perhaps how it sailed under the radar until the initial blow up, but that still doesn't answer how hostile an environment the VFS currently is to new coders who don't have six months or more to invest. James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-27 14:59 ` James Bottomley @ 2024-01-27 18:06 ` Matthew Wilcox 2024-01-27 19:44 ` Linus Torvalds 2024-01-27 20:07 ` James Bottomley 0 siblings, 2 replies; 18+ messages in thread From: Matthew Wilcox @ 2024-01-27 18:06 UTC (permalink / raw) To: James Bottomley Cc: Amir Goldstein, Steven Rostedt, Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Sat, Jan 27, 2024 at 09:59:10AM -0500, James Bottomley wrote: > On Sat, 2024-01-27 at 12:15 +0200, Amir Goldstein wrote: > > I would like to attend the talk about what happened since we > > suggested that you use kernfs in LSFMM 2022 and what has happened > > since. I am being serious, I am not being sarcastic and I am not > > claiming that you did anything wrong :) > > Actually, could we do the reverse and use this session to investigate > what's wrong with the VFS for new coders? I had a somewhat similar > experience when I did shiftfs way back in 2017. There's a huge amount > of VFS knowledge you simply can't pick up reading the VFS API. The way > I did it was to look at existing filesystems (for me overlayfs was the > closes to my use case) as well (and of course configfs which proved to > be too narrow for the use case). I'd say it took a good six months > before I understood the subtleties enough to propose a new filesystem > and be capable of answering technical questions about it. And > remember, like Steve, I'm a fairly competent kernel programmer. Six > months plus of code reading is an enormous barrier to place in front of > anyone wanting to do a simple filesystem, and it would be way bigger if > that person were new(ish) to Linux. I'd suggest that eventfs and shiftfs are not "simple filesystems". They're synthetic filesystems that want to do very different things from block filesystems and network filesystems. We have a lot of infrastructure in place to help authors of, say, bcachefs, but not a lot of infrastructure for synthetic filesystems (procfs, overlayfs, sysfs, debugfs, etc). I don't feel like I have a lot to offer in this area; it's not a part of the VFS I'm comfortable with. I don't really understand the dentry/vfsmount/... interactions. I'm more focused on the fs/mm/block interactions. I would probably also struggle to write a synthetic filesystem, while I could knock up something that's a clone of ext2 in a matter of weeks. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-27 18:06 ` Matthew Wilcox @ 2024-01-27 19:44 ` Linus Torvalds 2024-01-27 20:23 ` James Bottomley 2024-01-29 15:08 ` Christian Brauner 2024-01-27 20:07 ` James Bottomley 1 sibling, 2 replies; 18+ messages in thread From: Linus Torvalds @ 2024-01-27 19:44 UTC (permalink / raw) To: Matthew Wilcox Cc: James Bottomley, Amir Goldstein, Steven Rostedt, Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro On Sat, 27 Jan 2024 at 10:06, Matthew Wilcox <willy@infradead.org> wrote: > > I'd suggest that eventfs and shiftfs are not "simple filesystems". > They're synthetic filesystems that want to do very different things > from block filesystems and network filesystems. We have a lot of > infrastructure in place to help authors of, say, bcachefs, but not a lot > of infrastructure for synthetic filesystems (procfs, overlayfs, sysfs, > debugfs, etc). Indeed. I think it's worth pointing out three very _fundamental_ design issues here, which all mean that a "regular filesystem" is in many ways much simpler than a virtual one: (a) this is what the VFS has literally primarily been designed for. When you look at a lot of VFS issues, they almost all come from just very basic "this is what a filesystem needs" issues, and particularly performance issues. And when you talk "performance", the #1 thing is caching. In fact, I'd argue that #2 is caching too. Caching is just *so* important, and it really shows in the VFS. Think about just about any part of the VFS, and it's all about caching filesystem data. It's why the dentry cache exists, it's why the page / folios exist, it's what 99% of all the VFS code is about. And that performance / caching issue isn't just why most of the VFS code exists, it's literally also the reason for most of the design decisions. The dentry cache is a hugely complicated beast, and a *lot* of the complications are directly related to one thing, and one thing only: performance. It's why locking is so incredibly baroque. Yes, there are other complications. The whole notion of "bind mounts" is a huge complication that arguably isn't performance-related, and it's why we have that combination of "vfsmount" and "dentry" that we together call a "path". And that tends to confuse low-level filesystem people, because the other thing the VFS layer does is to try to shield the low-level filesystem from higher-level concepts like that, so that the low-level filesystem literally doesn't have to know about "oh, this same filesystem is mounted in five different places". The VFS layer takes care of that, and the filesystem doesn't need to know. So part of it is that the VFS has been designed for regular filesystems, but the *other* part of the puzzle is on the other side: (b) regular filesystems have been designed to be filesystems. Ok, that may sound like a stupid truism, but when it comes to the discussion of virtual filesystems and relative simplicity, it's quite a big deal. The fact is, a regular filesystem has literally been designed from the ground up to do regular filesystem things. And that matters. Yes, yes, many filesystems then made various bad design decisions, and the world isn't perfect. But basically things like "read a directory" and "read and write files" and "rename things" are all things that the filesystem was *designed* for. So the VFS layer was designed for real filesystems, and real filesystems were designed to do filesystem operations, so they are not just made to fit together, they are also all made to expose all the normal read/write/open/stat/whatever system calls. (c) none of the above is generally true of virtual filesystems Sure, *some* virtual filesystems are designed to act like a filesystem from the ground up. Something like "tmpfs" is obviously a virtual filesystem, but it's "virtual" only in the sense that it doesn't have much of a backing store. It's still designed primarily to *be* a filesystem, and the only operations that happen on it are filesystem operations. So ignore 'tmpfs' here, and think about all the other virtual filesystems we have. And realize that hey aren't really designed to be filesystems per se - they are literally designed to be something entirely different, and the filesystem interface is then only a secondary thing - it's a window into a strange non-filesystem world where normal filesystem operations don't even exist, even if sometimes there can be some kind of convoluted transformation for them. So you have "simple" things like just plain read-only files in /proc, and desp[ite being about as simple as they come, they fail miserably at the most fundamental part of a file: you can't even 'stat()' them and get sane file size data from them. And "caching" - which was the #1 reason for most of the filesystem code - ends up being much less so, although it turns out that it's still hugely important because of the abstraction interface it allows. So all those dentries, and all the complicated lookup code, end up still being quite important to make the virtual filesystem look like a filesystem at all: it's what gives you the 'getcwd()' system call, it's what still gives you the whole bind mount thing, it really ends up giving a lot of "structure" to the virtual filesystem that would be an absolute nightmare without it. But it's a structure that is really designed for something else. Because the non-filesystem virtual part that a virtual filesystem is actually trying to expose _as_ a filesystem to user space usually has lifetime rules (and other rules) that are *entirely* unrelated to any filesystem activity. A user can "chdir()" into a directory that describes a process, but the lifetime of that process is then entirely unrelated to that, and it can go away as a process, while the directory still has to virtually exist. That's part of what the VFS code gives a virtual filesystem: the dentries etc end up being those things that hang around even when the virtual part that they described may have disappeared. And you *need* that, just to get sane UNIX 'home directory' semantics. I think people often don't think of how much that VFS infrastructure protects them from. But it's also why virtual filesystems are generally a complete mess: you have these two pieces, and they are really doing two *COMPLETELY* different things. It's why I told Steven so forcefully that tracefs must not mess around with VFS internals. A virtual filesystem either needs to be a "real filesystem" aka tmpfs and just leave it *all* to the VFS layer, or it needs to just treat the dentries as a separate cache that the virtual filesystem is *not* in charge of, and trust the VFS layer to do the filesystem parts. But no. You should *not* look at a virtual filesystem as a guide how to write a filesystem, or how to use the VFS. Look at a real FS. A simple one, and preferably one that is built from the ground up to look like a POSIX one, so that you don't end up getting confused by all the nasty hacks to make it all look ok. IOW, while FAT is a simple filesystem, don't look at that one, just because then you end up with all the complications that come from decades of non-UNIX filesystem history. I'd say "look at minix or sysv filesystems", except those may be simple but they also end up being so legacy that they aren't good examples. You shouldn't use buffer-heads for anything new. But they are still probably good examples for one thing: if you want to understand the real power of dentries, look at either of the minix or sysv 'namei.c' files. Just *look* at how simple they are. Ignore the internal implementation of how a directory entry is then looked up on disk - because that's obviously filesystem-specific - and instead just look at the interface. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-27 19:44 ` Linus Torvalds @ 2024-01-27 20:23 ` James Bottomley 2024-01-29 15:08 ` Christian Brauner 1 sibling, 0 replies; 18+ messages in thread From: James Bottomley @ 2024-01-27 20:23 UTC (permalink / raw) To: Linus Torvalds, Matthew Wilcox Cc: Amir Goldstein, Steven Rostedt, Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro On Sat, 2024-01-27 at 11:44 -0800, Linus Torvalds wrote: [...] > (c) none of the above is generally true of virtual filesystems > > Sure, *some* virtual filesystems are designed to act like a > filesystem from the ground up. Something like "tmpfs" is obviously a > virtual filesystem, but it's "virtual" only in the sense that it > doesn't have much of a backing store. It's still designed primarily > to *be* a filesystem, and the only operations that happen on it are > filesystem operations. > > So ignore 'tmpfs' here, and think about all the other virtual > filesystems we have. Actually, I did look at tmpfs and it did help. > And realize that hey aren't really designed to be filesystems per se > - they are literally designed to be something entirely different, and > the filesystem interface is then only a secondary thing - it's a > window into a strange non-filesystem world where normal filesystem > operations don't even exist, even if sometimes there can be some kind > of convoluted transformation for them. > > So you have "simple" things like just plain read-only files in /proc, > and desp[ite being about as simple as they come, they fail miserably > at the most fundamental part of a file: you can't even 'stat()' them > and get sane file size data from them. Well, this is a big piece of the problem: when constructing a virtual filesystem what properties do I really need to care about (like stat or uniqueness of inode numbers) and what can I simply ignore? Ideally this should be documented because you have to read a lot of code to get an idea of what the must have properties are. I think a simple summary of this would go a long way to getting people somewhat out of the swamp that sucks you in when you try to construct virtual filesystems. > And "caching" - which was the #1 reason for most of the filesystem > code - ends up being much less so, although it turns out that it's > still hugely important because of the abstraction interface it > allows. > > So all those dentries, and all the complicated lookup code, end up > still being quite important to make the virtual filesystem look like > a filesystem at all: it's what gives you the 'getcwd()' system call, > it's what still gives you the whole bind mount thing, it really ends > up giving a lot of "structure" to the virtual filesystem that would > be an absolute nightmare without it. But it's a structure that is > really designed for something else. I actually found dentries (which were the foundation of shiftfs) quite easy. My biggest problem was the places in the code where we use a bare dentry and I needed the struct mnt (or struct path) as well, but that's a different discussion. > Because the non-filesystem virtual part that a virtual filesystem is > actually trying to expose _as_ a filesystem to user space usually has > lifetime rules (and other rules) that are *entirely* unrelated to any > filesystem activity. A user can "chdir()" into a directory that > describes a process, but the lifetime of that process is then > entirely unrelated to that, and it can go away as a process, while > the directory still has to virtually exist. On this alone, real filesystems do have the unplug problem as well (device goes away while user is in the directory), so the solution that works for them work for virtual filesystems as well. > That's part of what the VFS code gives a virtual filesystem: the > dentries etc end up being those things that hang around even when the > virtual part that they described may have disappeared. And you *need* > that, just to get sane UNIX 'home directory' semantics. > > I think people often don't think of how much that VFS infrastructure > protects them from. > > But it's also why virtual filesystems are generally a complete mess: > you have these two pieces, and they are really doing two *COMPLETELY* > different things. > > It's why I told Steven so forcefully that tracefs must not mess > around with VFS internals. A virtual filesystem either needs to be a > "real filesystem" aka tmpfs and just leave it *all* to the VFS layer, > or it needs to just treat the dentries as a separate cache that the > virtual filesystem is *not* in charge of, and trust the VFS layer to > do the filesystem parts. > > But no. You should *not* look at a virtual filesystem as a guide how > to write a filesystem, or how to use the VFS. Look at a real FS. A > simple one, and preferably one that is built from the ground up to > look like a POSIX one, so that you don't end up getting confused by > all the nasty hacks to make it all look ok. Well, I did look at ext4 when I was wondering what a real filesystem does, but we're back to having to read real and virtual filesystems now just to understand what you have to do and hence we're back to the "how do we make this easier" problem. > IOW, while FAT is a simple filesystem, don't look at that one, just > because then you end up with all the complications that come from > decades of non-UNIX filesystem history. > > I'd say "look at minix or sysv filesystems", except those may be > simple but they also end up being so legacy that they aren't good > examples. You shouldn't use buffer-heads for anything new. But they > are still probably good examples for one thing: if you want to > understand the real power of dentries, look at either of the minix or > sysv 'namei.c' files. Just *look* at how simple they are. Ignore the > internal implementation of how a directory entry is then looked up on > disk - because that's obviously filesystem-specific - and instead > just look at the interface. So shall I put you down for helping with virtual filesystem documentation then ... ? James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-27 19:44 ` Linus Torvalds 2024-01-27 20:23 ` James Bottomley @ 2024-01-29 15:08 ` Christian Brauner 2024-01-29 15:57 ` Steven Rostedt 1 sibling, 1 reply; 18+ messages in thread From: Christian Brauner @ 2024-01-29 15:08 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, James Bottomley, Amir Goldstein, Steven Rostedt, Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Al Viro > But no. You should *not* look at a virtual filesystem as a guide how > to write a filesystem, or how to use the VFS. Look at a real FS. A > simple one, and preferably one that is built from the ground up to > look like a POSIX one, so that you don't end up getting confused by > all the nasty hacks to make it all look ok. > > IOW, while FAT is a simple filesystem, don't look at that one, just > because then you end up with all the complications that come from > decades of non-UNIX filesystem history. > > I'd say "look at minix or sysv filesystems", except those may be > simple but they also end up being so legacy that they aren't good > examples. You shouldn't use buffer-heads for anything new. But they > are still probably good examples for one thing: if you want to > understand the real power of dentries, look at either of the minix or > sysv 'namei.c' files. Just *look* at how simple they are. Ignore the > internal implementation of how a directory entry is then looked up on > disk - because that's obviously filesystem-specific - and instead just > look at the interface. I agree and I have to say I'm getting annoyed with this thread. And I want to fundamentally oppose the notion that it's too difficult to write a virtual filesystem. Just one look at how many virtual filesystems we already have and how many are proposed. Recent example is that KVM wanted to implement restricted memory as a stacking layer on top of tmpfs which I luckily caught early and told them not to do. If at all a surprising amount of people that have nothing to do with filesystems manage to write filesystem drivers quickly and propose them upstream. And I hope people take a couple of months to write a decently sized/complex (virtual) filesystem. And specifically for virtual filesystems they often aren't alike at all. And that's got nothing to do with the VFS abstractions. It's simply because a virtual filesystem is often used for purposes when developers think that they want a filesystem like userspace interface but don't want all of the actual filesystem semantics that come with it. So they all differ from each other and what functionality they actually implement. And I somewhat oppose the notion that the VFS isn't documented. We do have extensive documentation for locking rules, a constantly updated changelog with fundamental changes to all VFS APIs and expectations around it. Including very intricate details for the reader that really needs to know everything. I wrote a whole document just on permission checking and idmappings when we added that to the VFS. Both implementation and theoretical background. And stuff like overlayfs or shiftfs are completely separate stories because they're even more special as they're (virtual) stacking filesystems that challenge the VFS in way more radical ways than regular virtual filesystems. And I think (Amir may forgive me) that stacking filesystems are generally an absolutely terrible idea as they complicate the VFS massively and put us through an insane amount of pain. One just needs to look at how much additional VFS machinery we have because of that and how complicated our callchains can become because of that. It's just not correct to even compare them to a boring virtual filesystem like binderfs or bpffs. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-29 15:08 ` Christian Brauner @ 2024-01-29 15:57 ` Steven Rostedt 0 siblings, 0 replies; 18+ messages in thread From: Steven Rostedt @ 2024-01-29 15:57 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Matthew Wilcox, James Bottomley, Amir Goldstein, Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Al Viro On Mon, 29 Jan 2024 16:08:33 +0100 Christian Brauner <brauner@kernel.org> wrote: > > But no. You should *not* look at a virtual filesystem as a guide how > > to write a filesystem, or how to use the VFS. Look at a real FS. A > > simple one, and preferably one that is built from the ground up to > > look like a POSIX one, so that you don't end up getting confused by > > all the nasty hacks to make it all look ok. > > > > IOW, while FAT is a simple filesystem, don't look at that one, just > > because then you end up with all the complications that come from > > decades of non-UNIX filesystem history. > > > > I'd say "look at minix or sysv filesystems", except those may be > > simple but they also end up being so legacy that they aren't good > > examples. You shouldn't use buffer-heads for anything new. But they > > are still probably good examples for one thing: if you want to > > understand the real power of dentries, look at either of the minix or > > sysv 'namei.c' files. Just *look* at how simple they are. Ignore the > > internal implementation of how a directory entry is then looked up on > > disk - because that's obviously filesystem-specific - and instead just > > look at the interface. > > I agree and I have to say I'm getting annoyed with this thread. > > And I want to fundamentally oppose the notion that it's too difficult to > write a virtual filesystem. Just one look at how many virtual I guess you mean pseudo file systems? Somewhere along the discussion we switched from saying pseudo to virtual. I may have been the culprit, I don't remember and I'm not re-reading the thread to find out. > filesystems we already have and how many are proposed. Recent example is > that KVM wanted to implement restricted memory as a stacking layer on > top of tmpfs which I luckily caught early and told them not to do. > > If at all a surprising amount of people that have nothing to do with > filesystems manage to write filesystem drivers quickly and propose them > upstream. And I hope people take a couple of months to write a decently > sized/complex (virtual) filesystem. I spent a lot of time on this. Let me give you a bit of history of where tracefs/eventfs came from. When we first started the tracing infrastructure, I wanted it to be easy to debug embedded devices. I use to have my own tracer called "logdev" which was a character device in /dev called /dev/logdev. I was able to write into it for simple control actions. But we needed a more complex system when we started integrating the PREEMPT_RT latency tracer which eventually became the ftrace infrastructure. As I wanted to still only need busybox to interact with it, I wanted to use files and not system calls. I was recommended to use debugfs, and I did. It became /sys/kernel/debug/tracing. After a while, when tracing started to become useful in production systems, people wanted access to tracing without having to have debugfs mounted. That's because debugfs is a dumping ground to a lot of interactions with the kernel, and people were legitimately worried about security vulnerabilities it could expose. I then asked about how to make /sys/kernel/debug/tracing its own file system and was recommended to just start with debugfs (it's the easiest concept of all the files systems to understand) and since tracing was already used the debugfs API (with dentries as the handle) it made sense. That created tracefs. Now you could mount tracefs at /sys/kernel/tracing and even have debugfs configured out. When the eBPF folks were using trace_printk directly into the main trace buffer, I asked them to please use an instance instead. They told me that an instance adds too much memory overhead. Over 20MBs! When I investigated, I found that they were right. And most of that overhead was all the dentry and inodes that were created for every directory and file that was used for events. As there's 10s of thousands of files and directories that adds up. And if you create a new instance, you create another 10s of thousands of files and directories that are basically all the same. This lead to the effort to create eventfs that would remove the overhead of these inodes and dentries with just a light weight descriptor for every directory. As there's just around 2000 directories, its the files that take up most of the memory. What got us here is the evolution of changes that were made. Now you can argue that when tracefs was first moved out of debugfs I should have based it on kernfs. I actually did look at that, but because it behaved so much differently than debugfs (which was the only thing in VFS that I was familiar with), I chose debugfs instead. The biggest savings in eventfs is the fact that it has no meta data for files. All the directories in eventfs has a fixed number of files when they are created. The creating of a directory passes in an array that has a list of names and callbacks to call when the file needs to be accessed. Note, this array is static for all events. That is, there's one array for all event files, and one array for all event systems, they are not allocated per directory. > > And specifically for virtual filesystems they often aren't alike at > all. And that's got nothing to do with the VFS abstractions. It's > simply because a virtual filesystem is often used for purposes when > developers think that they want a filesystem like userspace interface > but don't want all of the actual filesystem semantics that come with it. > So they all differ from each other and what functionality they actually > implement. I agree with the above. > > And I somewhat oppose the notion that the VFS isn't documented. We do > have extensive documentation for locking rules, a constantly updated > changelog with fundamental changes to all VFS APIs and expectations > around it. Including very intricate details for the reader that really > needs to know everything. I wrote a whole document just on permission > checking and idmappings when we added that to the VFS. Both > implementation and theoretical background. I spent a lot of time reading the VFS documentation. The problem I had was that it's very much focused for its main purpose. That is for real file systems. It was hard to know what would apply to a pseudo file system and what would not. So I don't want to say that VFS isn't well documented. I would say that VFS is a very big beast, and there's documentation that is focused on what the majority want to do with it. It's us outliers (pseudo file systems) that are screwing this up. And when you come from an approach of "I just want an file systems like interface" you really just want to know the bare minimum of VFS to get that done. I've been approach countless of times by the embedded community (including those that worked on the Mars helicopter) thanking me for having such a nice file system like interface into tracing. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems 2024-01-27 18:06 ` Matthew Wilcox 2024-01-27 19:44 ` Linus Torvalds @ 2024-01-27 20:07 ` James Bottomley 1 sibling, 0 replies; 18+ messages in thread From: James Bottomley @ 2024-01-27 20:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Amir Goldstein, Steven Rostedt, Greg Kroah-Hartman, lsf-pc, linux-fsdevel, linux-mm, Christian Brauner, Al Viro, Linus Torvalds On Sat, 2024-01-27 at 18:06 +0000, Matthew Wilcox wrote: > On Sat, Jan 27, 2024 at 09:59:10AM -0500, James Bottomley wrote: > > On Sat, 2024-01-27 at 12:15 +0200, Amir Goldstein wrote: > > > I would like to attend the talk about what happened since we > > > suggested that you use kernfs in LSFMM 2022 and what has happened > > > since. I am being serious, I am not being sarcastic and I am not > > > claiming that you did anything wrong :) > > > > Actually, could we do the reverse and use this session to > > investigate what's wrong with the VFS for new coders? I had a > > somewhat similar experience when I did shiftfs way back in 2017. > > There's a huge amount of VFS knowledge you simply can't pick up > > reading the VFS API. The way I did it was to look at existing > > filesystems (for me overlayfs was the closes to my use case) as > > well (and of course configfs which proved to be too narrow for the > > use case). I'd say it took a good six months before I understood > > the subtleties enough to propose a new filesystem and be capable of > > answering technical questions about it. And remember, like Steve, > > I'm a fairly competent kernel programmer. Six months plus of code > > reading is an enormous barrier to place in front of anyone wanting > > to do a simple filesystem, and it would be way bigger if that > > person were new(ish) to Linux. > > I'd suggest that eventfs and shiftfs are not "simple filesystems". > They're synthetic filesystems that want to do very different things > from block filesystems and network filesystems. We have a lot of > infrastructure in place to help authors of, say, bcachefs, but not a > lot of infrastructure for synthetic filesystems (procfs, overlayfs, > sysfs, debugfs, etc). I'm not going to disagree with this at all, but I also don't think it makes the question any less valid when exposing features through the filesystem is one of our default things to do. If anything it makes it more urgent because some enterprising young thing is going create their own fantastic synthetic filesystem for something and run headlong into this. > I don't feel like I have a lot to offer in this area; it's not a > part of the VFS I'm comfortable with. I don't really understand the > dentry/vfsmount/... interactions. I'm more focused on the > fs/mm/block interactions. I would probably also struggle to write a > synthetic filesystem, while I could knock up something that's a clone > of ext2 in a matter of weeks. OK, I have to confess the relationship of superblocks, struct vfsmount and struct mnt (as it then was, it's struct mnt_idmap now) to the fs tree was a huge part of that learning (as was the vagaries of the dentry cache). I'm not saying this is easy or something that interests everyone, but I think receont history demonstrates it's something we should discuss and try to do better at. James ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-01-29 15:57 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-25 15:48 [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems Steven Rostedt 2024-01-26 1:24 ` Greg Kroah-Hartman 2024-01-26 1:50 ` Steven Rostedt 2024-01-26 1:59 ` Greg Kroah-Hartman 2024-01-26 2:40 ` Steven Rostedt 2024-01-26 14:16 ` Greg Kroah-Hartman 2024-01-26 15:15 ` Steven Rostedt 2024-01-26 15:41 ` Greg Kroah-Hartman 2024-01-26 16:44 ` Steven Rostedt 2024-01-27 10:15 ` Amir Goldstein 2024-01-27 14:54 ` Steven Rostedt 2024-01-27 14:59 ` James Bottomley 2024-01-27 18:06 ` Matthew Wilcox 2024-01-27 19:44 ` Linus Torvalds 2024-01-27 20:23 ` James Bottomley 2024-01-29 15:08 ` Christian Brauner 2024-01-29 15:57 ` Steven Rostedt 2024-01-27 20:07 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).