* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership [not found] ` <20240104043945.GQ1674809@ZenIV> @ 2024-01-04 15:05 ` Steven Rostedt 2024-01-04 18:25 ` Al Viro 2024-01-04 19:03 ` Matthew Wilcox 0 siblings, 2 replies; 10+ messages in thread From: Steven Rostedt @ 2024-01-04 15:05 UTC (permalink / raw) To: Al Viro Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, 4 Jan 2024 04:39:45 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote: > > On Thu, 4 Jan 2024 01:48:37 +0000 > > Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > > > > > + /* Get the tracefs root from the parent */ > > > > + inode = d_inode(dentry->d_parent); > > > > + inode = d_inode(inode->i_sb->s_root); > > > > > > That makes no sense. First of all, for any positive dentry we have > > > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > > > dentries on given superblock. So what's the point of that dance? > > > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > > > and be done with that... > > > > That was more of thinking that the dentry and dentry->d_parent are > > different. As dentry is part of eventfs and dentry->d_parent is part of > > tracefs. > > ??? > > >Currently they both have the same superblock so yeah, I could just > > write it that way too and it would work. But in my head, I was thinking > > that they behave differently and maybe one day eventfs would get its own > > superblock which would not work. > > ->d_parent *always* points to the same filesystem; if you get an (automounted?) > mountpoint there, ->d_parent simply won't work - it will point to dentry itself. > This is the "tribal knowledge" I'm talking about. I really didn't know how the root dentry parent worked. I guess that makes sense, as it matches the '..' of a directory, and the '/' directory '..' points to itself. Although mounted file systems do not behave that way. My /proc/.. is '/'. I just figured that the dentry->d_parent would be similar. Learn something everyday. > > To explain this better: > > > > /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events > > > > But everything but "events" in /sys/kernel/tracing/* is part of tracefs. > > Everything in /sys/kernel/tracing/events is part of eventfs. > > > > That was my thought process. But as both tracefs and eventfs still use > > tracefs_get_inode(), it would work as you state. > > > > I'll update that, as I don't foresee that eventfs will become its own file > > system. > > There is no way to get to underlying mountpoint by dentry - simply because > the same fs instance can be mounted in any number of places. OK, so the dentry is still separate from the path and tied closer to the inode. > > A crude overview of taxonomy: > > file_system_type: what filesystem instances belong to. Not quite the same > thing as fs driver (one driver can provide several of those). Usually > it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...). I don't know the difference between NFS and NFSv4 as I just used whatever was the latest. But I understand the ext[234] part. > > super_block: individual filesystem instance. Hosts dentry tree (connected or > several disconnected parts - think NFSv4 or the state while trying to get > a dentry by fhandle, etc.). I don't know how NFSv4 works, I'm only a user of it, I never actually looked at the code. So that's not the best example, at least for me. > > dentry: object in a filesystem's directory tree(s). Always belongs to > specific filesystem instance - that relationship never changes. Tree > structure (and names) _within_ _filesystem_ belong on that level. > ->d_parent is part of that tree structure; never NULL, root of a (sub)tree > has it pointing to itself. Might be negative, might refer to a filesystem object > (file, directory, symlink, etc.). This is useful. > > inode: filesystem object (file, directory, etc.). Always belongs to > specific filesystem instance. Non-directory inodes might have any > number of dentry instances (aliases) refering to it; a directory one - no > more than one. This above is very useful knowledge that I did not know. That directory inodes can only have a single dentry. > Filesystem object contents belongs here; multiple hardlinks > have different dentries and the same inode. So, can I assume that an inode could only have as many dentries as hard links? I know directories are only allowed to have a single hard link. Is that why they can only have a single dentry? > Of course, filesystem type in > question might have no such thing as multiple hardlinks - that's up to > filesystem. In general there is no way to find (or enumerate) such links; > e.g. a local filesystem might have an extra hardlink somewhere we had > never looked at and there won't be any dentries for such hardlinks and > no way to get them short of doing the full search of the entire tree. > The situation with e.g. NFS client is even worse, obviously. > > mount: in a sense, mount to super_block is what dentry is to inode. It > provides a view of (sub)tree hosted in given filesystem instance. The > same filesystem may have any number of mounts, refering to its subtrees > (possibly the same subtree for each, possibly all different - up to > the callers of mount(2)). They form mount tree(s) - that's where the > notions related to "this mounted on top of that" belong. Note that > they can be moved around - with no telling the filesystem about that > happening. Again, there's no such thing as "the mountpoint of given > filesystem instance" - it might be mounted in any number of places > at the same time. Specific mount - sure, no problem, except that it > can move around. > > namespace: mount tree. Unlike everything prior, this one is a part of > process state - same as descriptor table, mappings, etc. And I'm guessing namespace is for containers. At least that's what I've been assuming they are for. > > file: opened IO channel. It does refer to specific mount and specific > dentry (and thus filesystem instance and an inode on it). Current > IO position lives here, so does any per-open(2) state. And IIUC, this is what maps to a processes fd table. That is, the process's file descriptor number it passes to the kernel will be mapped to this "file". > > descriptor table: mapping from numbers to IO channels (opened files). This is that "process fd table" I mentioned above (I wrote that before reading this). > Again, a part of process state. dup() creates a new entry, with > reference to the same file as the old one; multiple open() of the Hmm, wouldn't "dup()" create another "file" that just points to the same dentry? It wouldn't be the "same file", or did you mean "file" from the user space point of view? > same pathname will each yield a separate opened file. _Some_ state > belongs here (close-on-exec, mostly). Note that there's no such > thing as "the descriptor of this file" - not even "the user-supplied > number that had been used to get the file we are currently reading > from", since that number might be refering to something entirely > different right after we'd resolved it to opened file and that > happens *without* disrupting the operation. This last paragraph confused me. What do you mean by ""referring to something entirely different"? Thanks for this overview. It was very useful, and something I think we should add to kernel doc. I did read Documentation/filesystems/vfs.rst but honestly, I think your writeup here is a better overview. -- Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 15:05 ` [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership Steven Rostedt @ 2024-01-04 18:25 ` Al Viro 2024-01-04 19:10 ` Steven Rostedt 2024-01-04 19:15 ` Steven Rostedt 2024-01-04 19:03 ` Matthew Wilcox 1 sibling, 2 replies; 10+ messages in thread From: Al Viro @ 2024-01-04 18:25 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote: > This is the "tribal knowledge" I'm talking about. I really didn't know how > the root dentry parent worked. I guess that makes sense, as it matches the > '..' of a directory, and the '/' directory '..' points to itself. Although > mounted file systems do not behave that way. My /proc/.. is '/'. I just > figured that the dentry->d_parent would be similar. Learn something everyday. What would you expect to happen if you have the same filesystem mounted in several places? Having separate dentry trees would be a nightmare - you'd get cache coherency problems from hell. It's survivable for procfs, but for something like a normal local filesystem it'd become very painful. And if we want them to share dentry tree, how do you choose where the .. would lead from the root dentry? The way it's done is that linkage between the trees is done separately - there's a tree of struct mount (well, forest, really - different processes can easily have separate trees, which is how namespaces are done) and each node in the mount tree refers to a dentry (sub)tree in some filesystem instance. Location is represented by (mount, dentry) pair and handling of .. is basically (modulo refcounting, locking, error handling, etc.) while dentry == subtree_root(mount) && mount != mountpoint_mount(mount) // cross into the mountpoint under it dentry = mountpoint_dentry(mount) mount = mountpoint_mount(mount) go_into(mount, dentry->d_parent) Note that you can have e.g. /usr/lib/gcc/x86_64-linux-gnu/12 mounted on /mnt/blah: ; mount --bind /usr/lib/gcc/x86_64-linux-gnu/12 /mnt/blah will do it. Then e.g. /mnt/blah/include will resolve to the same dentry as /usr/lib/gcc/x86_64-linux-gnu/12/include, etc. ; chdir /mnt/blah ; ls 32 crtprec80.o libgomp.so libsanitizer.spec cc1 g++-mapper-server libgomp.spec libssp_nonshared.a cc1plus include libitm.a libstdc++.a collect2 libasan.a libitm.so libstdc++fs.a crtbegin.o libasan_preinit.o libitm.spec libstdc++.so crtbeginS.o libasan.so liblsan.a libsupc++.a crtbeginT.o libatomic.a liblsan_preinit.o libtsan.a crtend.o libatomic.so liblsan.so libtsan_preinit.o crtendS.o libbacktrace.a liblto_plugin.so libtsan.so crtfastmath.o libcc1.so libobjc.a libubsan.a crtoffloadbegin.o libgcc.a libobjc_gc.a libubsan.so crtoffloadend.o libgcc_eh.a libobjc_gc.so lto1 crtoffloadtable.o libgcc_s.so libobjc.so lto-wrapper crtprec32.o libgcov.a libquadmath.a plugin crtprec64.o libgomp.a libquadmath.so x32 We obviously want .. to resolve to /mnt, though. ; ls .. ; ls /usr/lib/gcc/x86_64-linux-gnu/ 12 So the trigger for "cross into underlying mountpoint" has to be "dentry is the root of subtree mount refers to" - it depends upon the mount we are in. > > Filesystem object contents belongs here; multiple hardlinks > > have different dentries and the same inode. > > So, can I assume that an inode could only have as many dentries as hard > links? I know directories are only allowed to have a single hard link. Is > that why they can only have a single dentry? Not quite. Single alias for directories is more about cache coherency fun; we really can't afford multiple aliases for those. For non-directories it's possible to have an entirely disconnected dentry refering to that sucker; if somebody hands you an fhandle with no indication of the parent directory, you might end up having to do one of those, no matter how many times you find the same inode later. Not an issue for tracefs, though. > > namespace: mount tree. Unlike everything prior, this one is a part of > > process state - same as descriptor table, mappings, etc. > > And I'm guessing namespace is for containers. At least that's what I've > been assuming they are for. It predates containers by quite a few years, but yes, that's one of the users. It is related to virtual machines, in the same sense the set of memory mappings is - each thread can be thought of as a VM, with a bunch of components. Just as mmap() manipulates the virtual address translation for the threads that share memory space with the caller, mount() manipulates the pathname resolution for the threads that share the namespace with the caller. > > descriptor table: mapping from numbers to IO channels (opened files). > > This is that "process fd table" I mentioned above (I wrote that before > reading this). > > > Again, a part of process state. dup() creates a new entry, with > > reference to the same file as the old one; multiple open() of the > > Hmm, wouldn't "dup()" create another "file" that just points to the same > dentry? It wouldn't be the "same file", or did you mean "file" from the > user space point of view? No. The difference between open() and dup() is that the latter will result in a descriptor that really refers to the same file. Current IO position belongs to IO channel; it doesn't matter for e.g. terminals, but for regular file it immediately becomes an issue. fd1 = open("foo", 0); fd2 = open("foo", 0); read(fd1, &c1, 1); read(fd2, &c2, 1); will result in the first byte of foo read into c1 and c2, but fd1 = open("foo", 0); fd2 = dup(fd1); read(fd1, &c1, 1); read(fd2, &c2, 1); will have the first byte of foo in c1 and the second one - in c2. open() yields a new IO channel attached to new descriptor; dup() (and dup2()) attaches the existing IO channel to new descriptor. fork() acts like dup() in that respect - child gets its descriptor table populated with references to the same IO channels as the parent does. Any Unix since about '71 has it done that way and the same goes for NT, DOS, etc. - you can't implement redirects to/from regular files without that distinction. Unfortunately, the terms are clumsy as hell - POSIX ends up with "file descriptor" (for numbers) vs. "file description" (for IO channels), which is hard to distinguish when reading and just as hard to distinguish when listening. "Opened file" (as IO channel) vs. "file on disc" (as collection of data that might be accessed via said channels) distinction on top of that also doesn't help, to put it mildly. It's many decades too late to do anything about, unfortunately. Pity the UNIX 101 students... ;-/ The bottom line: * struct file represents an IO channel; it might be operating on various objects, including regular files, pipes, sockets, etc. * current IO position is a property of IO channel. * struct files_struct represents a descriptor table; each of those maps numbers to IO channels. * each thread uses a descriptor table to turn numbers ("file descriptors") into struct file references. Different threads might share the same descriptor table or have separate descriptor tables. current->files points to the descriptor table of the current thread. * open() creates a new IO channel and attaches it to an unused position in descriptor table. * dup(n) takes the IO channel from position 'n' in descriptor table and attaches it to an unused position. * dup2(old, new) takes the IO channel from position 'old' and attaches it to position 'new'; if there used to be something in position 'new', it gets detached. * close(n) takes the IO channel from position 'n', flushes and detaches it. Note that it IO channel itself is *NOT* closed until all references to it are gone. E.g. open() + fork() + (in parent) close() will end up with the child's descriptor table keeping a reference to IO channel established by open(); close() in parent will not shut the channel down. The same goes for implicit close() done by dup2() or by exit(), etc. * things like mmap() retain struct file references; open() + mmap() + close() ends up with struct file left (in vma->vm_file) alive and well for as long as the mapping exists, nevermind the reference that used to be in descriptor table. In other words, IO channels can exist with no references in any descriptor tables. There are other ways for such situation to occur (e.g. SCM_RIGHTS stuff); it's entirely normal. > > same pathname will each yield a separate opened file. _Some_ state > > belongs here (close-on-exec, mostly). Note that there's no such > > thing as "the descriptor of this file" - not even "the user-supplied > > number that had been used to get the file we are currently reading > > from", since that number might be refering to something entirely > > different right after we'd resolved it to opened file and that > > happens *without* disrupting the operation. > > This last paragraph confused me. What do you mean by ""referring to > something entirely different"? Two threads share descriptor table; one of them is in read(fd, ...), another does dup2(fd2, fd). If read() gets past the point where it gets struct file reference, it will keep accessing that IO channel. dup2() will replace the reference in descriptor table, but that won't disrupt the read()... > > Thanks for this overview. It was very useful, and something I think we > should add to kernel doc. I did read Documentation/filesystems/vfs.rst but > honestly, I think your writeup here is a better overview. At the very least it would need serious reordering ;-/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 18:25 ` Al Viro @ 2024-01-04 19:10 ` Steven Rostedt 2024-01-04 19:21 ` Linus Torvalds 2024-01-04 19:15 ` Steven Rostedt 1 sibling, 1 reply; 10+ messages in thread From: Steven Rostedt @ 2024-01-04 19:10 UTC (permalink / raw) To: Al Viro Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, 4 Jan 2024 18:25:02 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote: > > > This is the "tribal knowledge" I'm talking about. I really didn't know how > > the root dentry parent worked. I guess that makes sense, as it matches the > > '..' of a directory, and the '/' directory '..' points to itself. Although > > mounted file systems do not behave that way. My /proc/.. is '/'. I just > > figured that the dentry->d_parent would be similar. Learn something everyday. > > What would you expect to happen if you have the same filesystem mounted in > several places? Having separate dentry trees would be a nightmare - you'd > get cache coherency problems from hell. It's survivable for procfs, but > for something like a normal local filesystem it'd become very painful. > And if we want them to share dentry tree, how do you choose where the .. > would lead from the root dentry? My mistake was thinking that the dentry was attached more to the path than the inode. But that doesn't seem to be the case. I wasn't sure if there was a way to get to a dentry from the inode. I see the i_dentry list, which is a list, where I got some of my idea that dentry was closer to path than inode. > > The way it's done is that linkage between the trees is done separately - > there's a tree of struct mount (well, forest, really - different processes > can easily have separate trees, which is how namespaces are done) and > each node in the mount tree refers to a dentry (sub)tree in some filesystem > instance. Location is represented by (mount, dentry) pair and handling of > .. is basically (modulo refcounting, locking, error handling, etc.) > while dentry == subtree_root(mount) && mount != mountpoint_mount(mount) > // cross into the mountpoint under it > dentry = mountpoint_dentry(mount) > mount = mountpoint_mount(mount) > go_into(mount, dentry->d_parent) > > Note that you can have e.g. /usr/lib/gcc/x86_64-linux-gnu/12 mounted on /mnt/blah: > ; mount --bind /usr/lib/gcc/x86_64-linux-gnu/12 /mnt/blah > will do it. Then e.g. /mnt/blah/include will resolve to the same dentry as > /usr/lib/gcc/x86_64-linux-gnu/12/include, etc. > ; chdir /mnt/blah > ; ls > 32 crtprec80.o libgomp.so libsanitizer.spec > cc1 g++-mapper-server libgomp.spec libssp_nonshared.a > cc1plus include libitm.a libstdc++.a > collect2 libasan.a libitm.so libstdc++fs.a > crtbegin.o libasan_preinit.o libitm.spec libstdc++.so > crtbeginS.o libasan.so liblsan.a libsupc++.a > crtbeginT.o libatomic.a liblsan_preinit.o libtsan.a > crtend.o libatomic.so liblsan.so libtsan_preinit.o > crtendS.o libbacktrace.a liblto_plugin.so libtsan.so > crtfastmath.o libcc1.so libobjc.a libubsan.a > crtoffloadbegin.o libgcc.a libobjc_gc.a libubsan.so > crtoffloadend.o libgcc_eh.a libobjc_gc.so lto1 > crtoffloadtable.o libgcc_s.so libobjc.so lto-wrapper > crtprec32.o libgcov.a libquadmath.a plugin > crtprec64.o libgomp.a libquadmath.so x32 > > We obviously want .. to resolve to /mnt, though. > ; ls .. > ; ls /usr/lib/gcc/x86_64-linux-gnu/ > 12 > > So the trigger for "cross into underlying mountpoint" has to be "dentry is > the root of subtree mount refers to" - it depends upon the mount we are > in. > > > > Filesystem object contents belongs here; multiple hardlinks > > > have different dentries and the same inode. > > > > So, can I assume that an inode could only have as many dentries as hard > > links? I know directories are only allowed to have a single hard link. Is > > that why they can only have a single dentry? > > Not quite. Single alias for directories is more about cache coherency > fun; we really can't afford multiple aliases for those. For non-directories > it's possible to have an entirely disconnected dentry refering to that > sucker; if somebody hands you an fhandle with no indication of the parent > directory, you might end up having to do one of those, no matter how many > times you find the same inode later. Not an issue for tracefs, though. > > > > namespace: mount tree. Unlike everything prior, this one is a part of > > > process state - same as descriptor table, mappings, etc. > > > > And I'm guessing namespace is for containers. At least that's what I've > > been assuming they are for. > > It predates containers by quite a few years, but yes, that's one of the > users. It is related to virtual machines, in the same sense the set > of memory mappings is - each thread can be thought of as a VM, with > a bunch of components. Just as mmap() manipulates the virtual address > translation for the threads that share memory space with the caller, > mount() manipulates the pathname resolution for the threads that share > the namespace with the caller. > > > > descriptor table: mapping from numbers to IO channels (opened files). > > > > This is that "process fd table" I mentioned above (I wrote that before > > reading this). > > > > > Again, a part of process state. dup() creates a new entry, with > > > reference to the same file as the old one; multiple open() of the > > > > Hmm, wouldn't "dup()" create another "file" that just points to the same > > dentry? It wouldn't be the "same file", or did you mean "file" from the > > user space point of view? > > No. The difference between open() and dup() is that the latter will > result in a descriptor that really refers to the same file. Current > IO position belongs to IO channel; it doesn't matter for e.g. terminals, > but for regular file it immediately becomes an issue. > fd1 = open("foo", 0); > fd2 = open("foo", 0); > read(fd1, &c1, 1); > read(fd2, &c2, 1); > will result in the first byte of foo read into c1 and c2, but > fd1 = open("foo", 0); > fd2 = dup(fd1); > read(fd1, &c1, 1); > read(fd2, &c2, 1); > will have the first byte of foo in c1 and the second one - in c2. > open() yields a new IO channel attached to new descriptor; dup() > (and dup2()) attaches the existing IO channel to new descriptor. > fork() acts like dup() in that respect - child gets its descriptor > table populated with references to the same IO channels as the > parent does. Ah, looking at the code I use dup() in, it's mostly for pipes in and for redirecting stdout,stdin, etc. So yeah, that makes sense. > > Any Unix since about '71 has it done that way and the same goes > for NT, DOS, etc. - you can't implement redirects to/from regular > files without that distinction. Yep, which is what I used it for. Just forgot the details. > > > > same pathname will each yield a separate opened file. _Some_ state > > > belongs here (close-on-exec, mostly). Note that there's no such > > > thing as "the descriptor of this file" - not even "the user-supplied > > > number that had been used to get the file we are currently reading > > > from", since that number might be refering to something entirely > > > different right after we'd resolved it to opened file and that > > > happens *without* disrupting the operation. > > > > This last paragraph confused me. What do you mean by ""referring to > > something entirely different"? > > Two threads share descriptor table; one of them is in > read(fd, ...), another does dup2(fd2, fd). If read() gets past the > point where it gets struct file reference, it will keep accessing that > IO channel. dup2() will replace the reference in descriptor table, > but that won't disrupt the read()... Oh, OK. So basically if fd 4 is a reference to /tmp/foo and you open /tmp/bar which gets fd2, and one thread is reading fd 4 (/tmp/foo), the other thread doing dup2(fd2, fd) will make fd 4 a reference to /tmp/bar but the read will finish reading /tmp/foo. But if the first thread were to do another read(fd, ...) it would then read /tmp/bar. In other words, it allows read() to stay atomic with respect to what it is reading until it returns. > > > > > Thanks for this overview. It was very useful, and something I think we > > should add to kernel doc. I did read Documentation/filesystems/vfs.rst but > > honestly, I think your writeup here is a better overview. > > At the very least it would need serious reordering ;-/ Yeah, but this is all great information. Thanks for explaining it. -- Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 19:10 ` Steven Rostedt @ 2024-01-04 19:21 ` Linus Torvalds 0 siblings, 0 replies; 10+ messages in thread From: Linus Torvalds @ 2024-01-04 19:21 UTC (permalink / raw) To: Steven Rostedt Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, 4 Jan 2024 at 11:09, Steven Rostedt <rostedt@goodmis.org> wrote: > > My mistake was thinking that the dentry was attached more to the path than > the inode. But that doesn't seem to be the case. I wasn't sure if there was > a way to get to a dentry from the inode. Yeah, so dentry->inode and path->dentry are one-way translations, because the other way can have multiple different cases. IOW, a path will specify *one* dentry, and a dentry will specily *one* inode, but one inode can be associated with multiple dentries, and there may be other undiscovered dentries that *would* point to it but aren't even cached right now. And a single dentry can be part of multiple paths, thanks to bind mounts. The "inode->i_dentry" list is *not* a way to look up all dentries, because - as mentioned - there may be potential other paths (and thus other dentries) that lead to the same inode that just haven't been looked up yet (or that have already been aged out of the cache). Of course any *particular* filesystem may not have hard links (so one inode has only one possible dentry), and you may not have bind mounts, and it might be one of the virtual filesystems where everything is always in memory, so none of the above problems are guaranteed to be the case in any *particular* situation. But it's all part of why the dcache is actually really subtle. It's not just the RCU lookup rules and the specialized locking (both reflock and the rather complicated rules about d_lock ordering), it's also that whole "yeah, the filesystem only sees a 'dentry', but because of bind mounts the vfs layer actually does things internally in terms of 'struct path' in order to be able to then show that single fiolesystem in multiple places". Etc etc. There's a reason Al Viro ends up owning the dcache. Nobody else can wrap their tiny little minds around it all. Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 18:25 ` Al Viro 2024-01-04 19:10 ` Steven Rostedt @ 2024-01-04 19:15 ` Steven Rostedt 2024-01-04 19:26 ` Matthew Wilcox 2024-01-04 19:35 ` Linus Torvalds 1 sibling, 2 replies; 10+ messages in thread From: Steven Rostedt @ 2024-01-04 19:15 UTC (permalink / raw) To: Al Viro Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, 4 Jan 2024 18:25:02 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > Unfortunately, the terms are clumsy as hell - POSIX ends up with > "file descriptor" (for numbers) vs. "file description" (for IO > channels), which is hard to distinguish when reading and just > as hard to distinguish when listening. "Opened file" (as IO > channel) vs. "file on disc" (as collection of data that might > be accessed via said channels) distinction on top of that also > doesn't help, to put it mildly. It's many decades too late to > do anything about, unfortunately. Pity the UNIX 101 students... ;-/ Just so I understand this correctly. "file descriptor" - is just what maps to a specific inode. "file description" - is how the file is accessed (position in the file and flags associated to how it was opened) Did I get that correct? -- Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 19:15 ` Steven Rostedt @ 2024-01-04 19:26 ` Matthew Wilcox 2024-01-04 19:35 ` Linus Torvalds 1 sibling, 0 replies; 10+ messages in thread From: Matthew Wilcox @ 2024-01-04 19:26 UTC (permalink / raw) To: Steven Rostedt Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, Jan 04, 2024 at 02:15:17PM -0500, Steven Rostedt wrote: > On Thu, 4 Jan 2024 18:25:02 +0000 > Al Viro <viro@zeniv.linux.org.uk> wrote: > > > Unfortunately, the terms are clumsy as hell - POSIX ends up with > > "file descriptor" (for numbers) vs. "file description" (for IO > > channels), which is hard to distinguish when reading and just > > as hard to distinguish when listening. "Opened file" (as IO > > channel) vs. "file on disc" (as collection of data that might > > be accessed via said channels) distinction on top of that also > > doesn't help, to put it mildly. It's many decades too late to > > do anything about, unfortunately. Pity the UNIX 101 students... ;-/ > > Just so I understand this correctly. > > "file descriptor" - is just what maps to a specific inode. No -- file descriptor is a number in fdtable that maps to a struct file. > "file description" - is how the file is accessed (position in the file and > flags associated to how it was opened) file description is posix's awful name for struct file. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 19:15 ` Steven Rostedt 2024-01-04 19:26 ` Matthew Wilcox @ 2024-01-04 19:35 ` Linus Torvalds 2024-01-04 20:02 ` Linus Torvalds 2024-01-04 21:28 ` Al Viro 1 sibling, 2 replies; 10+ messages in thread From: Linus Torvalds @ 2024-01-04 19:35 UTC (permalink / raw) To: Steven Rostedt Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, 4 Jan 2024 at 11:14, Steven Rostedt <rostedt@goodmis.org> wrote: > > "file descriptor" - is just what maps to a specific inode. Nope. Technically and traditionally, file descriptor is just the integer index that is used to look up a 'struct file *'. Except in the kernel, we really just tend to use that term (well, I do) for the 'struct file *' itself, since the integer 'fd' is usually not really relevant except at the system call interface. Which is *NOT* the inode, because the 'struct file' has other things in it (the file position, the permissions that were used at open time etc, close-on-exec state etc etc). > "file description" - is how the file is accessed (position in the file and > flags associated to how it was opened) That's a horrible term that shouldn't be used at all. Apparently some people use it for what is our 'struct file *", also known as a "file table entry". Avoid it. If anything, just use "fd" for the integer representation, and "file" for the pointer to a 'struct file". But most of the time the two are conceptually interchangeable, in that an 'fd' just translates directly to a 'struct file *'. Note that while there's that conceptual direct translation, there's also very much a "time of use" issue, in that a "fd -> file" translation happens at one particular time and in one particular user context, and then it's *done* (so closing and possibly re-using the fd after it's been looked up does not actually affect an existing 'struct file *'). And while 'fd -> file' lookup is quick and common, the other way doesn't exist, because multiple 'fd's can map to one 'struct file *' thanks to dup() (and 'fork()', since a 'fd -> file' translation always happens within the context of a particular user space, an 'fd' in one process is obviously not the same as an 'fd' in another one). Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 19:35 ` Linus Torvalds @ 2024-01-04 20:02 ` Linus Torvalds 2024-01-04 21:28 ` Al Viro 1 sibling, 0 replies; 10+ messages in thread From: Linus Torvalds @ 2024-01-04 20:02 UTC (permalink / raw) To: Steven Rostedt Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, 4 Jan 2024 at 11:35, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > Which is *NOT* the inode, because the 'struct file' has other things > in it (the file position, the permissions that were used at open time > etc, close-on-exec state etc etc). That close-on-exec thing was a particularly bad example of things that are in the 'struct file', because it's in fact the only thing that *isn't* in 'struct file' and is associated directly with the 'int fd'. But hopefully the intent was clear despite me picking a particularly bad example. Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 19:35 ` Linus Torvalds 2024-01-04 20:02 ` Linus Torvalds @ 2024-01-04 21:28 ` Al Viro 1 sibling, 0 replies; 10+ messages in thread From: Al Viro @ 2024-01-04 21:28 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, Jan 04, 2024 at 11:35:37AM -0800, Linus Torvalds wrote: > > "file description" - is how the file is accessed (position in the file and > > flags associated to how it was opened) > > That's a horrible term that shouldn't be used at all. Apparently some > people use it for what is our 'struct file *", also known as a "file > table entry". Avoid it. Worse, really. As far as I can reconstruct what happened it was something along the lines of "colloquial expression is 'opened file', but that is confusing - sounds like a property+noun, so it might be misparsed as a member of subset of files satisfying the property of 'being opened'; can't have that in a standard, let's come up with something else". Except that what they did come up with had been much worse, for obvious linguistic reasons. The *ONLY* uses for that expression I can think of are 1. When reading POSIX texts, watch out for that one - if you see them talking about a file descriptor in context where it really should be about an opened file, check the wording. If it really says "file descriptOR", it's probably a bug in standard or a codified bullshit practice. If it says "file descriptION" instead, replace with "opened file" and move on. 2. An outstanding example of the taste of that bunch. IO channel would be a saner variant, but it's far too late for that. The 3-way distinction between descriptor/opened file/file as collection of data needs to be explained in UNIX 101; it is userland-visible and it has to be understood. Unfortunately, it's often done in a way that leaves students seriously confused ;-/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership 2024-01-04 15:05 ` [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership Steven Rostedt 2024-01-04 18:25 ` Al Viro @ 2024-01-04 19:03 ` Matthew Wilcox 1 sibling, 0 replies; 10+ messages in thread From: Matthew Wilcox @ 2024-01-04 19:03 UTC (permalink / raw) To: Steven Rostedt Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, Christian Brauner, linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote: > > file_system_type: what filesystem instances belong to. Not quite the same > > thing as fs driver (one driver can provide several of those). Usually > > it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...). > > I don't know the difference between NFS and NFSv4 as I just used whatever > was the latest. But I understand the ext[234] part. What Al's sying is that nfs.ko provides both nfs_fs_type and nfs4_fs_type. ext4.ko provides ext2_fs_type, ext3_fs_type and ext4_fs_type. This is allowed but anomalous. Most filesystems provide only one, eg ocfs2_fs_type. > > > > super_block: individual filesystem instance. Hosts dentry tree (connected or > > several disconnected parts - think NFSv4 or the state while trying to get > > a dentry by fhandle, etc.). > > I don't know how NFSv4 works, I'm only a user of it, I never actually > looked at the code. So that's not the best example, at least for me. Right, so NFS (v4 or otherwise) is Special. In the protocol, files are identified by a thing called an fhandle. This is (iirc) a 32-byte identifier which must persist across server reboot. Originally it was probably supposed to encode dev_t plus ino_t plus generation number. But you can do all kinds of things in the NFS protocol with an fhandle that you need a dentry for in Linux (like path walks). Unfortunately, clients can't be told "Hey, we've lost context, please rewalk" (which would have other problems anyway), so we need a way to find the dentry for an fhandle. I understand this very badly, but essentially we end up looking for canonical ones, and then creating isolated trees of dentries if we can't find them. Sometimes we then graft these isolated trees into the canonical spots if we end up connecting them through various filesystem activity. At least that's my understanding which probably contains several misunderstandings. > > Filesystem object contents belongs here; multiple hardlinks > > have different dentries and the same inode. > > So, can I assume that an inode could only have as many dentries as hard > links? I know directories are only allowed to have a single hard link. Is > that why they can only have a single dentry? There could be more. For example, I could open("A"); ln("A", "B"); open("B"); rm("A"); ln("B", "C"); open("C"); rm("B"). Now there are three dentries for this inode, its link count is currently one and never exceeded two. > Thanks for this overview. It was very useful, and something I think we > should add to kernel doc. I did read Documentation/filesystems/vfs.rst but > honestly, I think your writeup here is a better overview. Documentation/filesystems/locking.rst is often a better source, although the two should really be merged. Not for the faint-hearted. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-01-04 21:28 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20240103203246.115732ec@gandalf.local.home>
[not found] ` <20240104014837.GO1674809@ZenIV>
[not found] ` <20240103212506.41432d12@gandalf.local.home>
[not found] ` <20240104043945.GQ1674809@ZenIV>
2024-01-04 15:05 ` [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership Steven Rostedt
2024-01-04 18:25 ` Al Viro
2024-01-04 19:10 ` Steven Rostedt
2024-01-04 19:21 ` Linus Torvalds
2024-01-04 19:15 ` Steven Rostedt
2024-01-04 19:26 ` Matthew Wilcox
2024-01-04 19:35 ` Linus Torvalds
2024-01-04 20:02 ` Linus Torvalds
2024-01-04 21:28 ` Al Viro
2024-01-04 19:03 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).