Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
       [not found]     ` <20240104043945.GQ1674809@ZenIV>
@ 2024-01-04 15:05       ` Steven Rostedt
  2024-01-04 18:25         ` Al Viro
  2024-01-04 19:03         ` Matthew Wilcox
  0 siblings, 2 replies; 10+ messages in thread
From: Steven Rostedt @ 2024-01-04 15:05 UTC (permalink / raw)
  To: Al Viro
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Linus Torvalds, Christian Brauner, linux-fsdevel,
	Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, 4 Jan 2024 04:39:45 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote:
> > On Thu, 4 Jan 2024 01:48:37 +0000
> > Al Viro <viro@zeniv.linux.org.uk> wrote:
> >   
> > > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > >   
> > > > +	/* Get the tracefs root from the parent */
> > > > +	inode = d_inode(dentry->d_parent);
> > > > +	inode = d_inode(inode->i_sb->s_root);    
> > > 
> > > That makes no sense.  First of all, for any positive dentry we have
> > > dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> > > dentries on given superblock.  So what's the point of that dance?
> > > If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> > > and be done with that...  
> > 
> > That was more of thinking that the dentry and dentry->d_parent are
> > different. As dentry is part of eventfs and dentry->d_parent is part of
> > tracefs.  
> 
> ???
> 
> >Currently they both have the same superblock so yeah, I could just
> > write it that way too and it would work. But in my head, I was thinking
> > that they behave differently and maybe one day eventfs would get its own
> > superblock which would not work.  
> 
> ->d_parent *always* points to the same filesystem; if you get an (automounted?)  
> mountpoint there, ->d_parent simply won't work - it will point to dentry itself.
> 

This is the "tribal knowledge" I'm talking about. I really didn't know how
the root dentry parent worked. I guess that makes sense, as it matches the
'..' of a directory, and the '/' directory '..' points to itself. Although
mounted file systems do not behave that way. My /proc/.. is '/'. I just
figured that the dentry->d_parent would be similar. Learn something everyday.

> > To explain this better:
> > 
> >   /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events
> > 
> > But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
> > Everything in /sys/kernel/tracing/events is part of eventfs.
> > 
> > That was my thought process. But as both tracefs and eventfs still use
> > tracefs_get_inode(), it would work as you state.
> > 
> > I'll update that, as I don't foresee that eventfs will become its own file
> > system.  
> 
> There is no way to get to underlying mountpoint by dentry - simply because
> the same fs instance can be mounted in any number of places.

OK, so the dentry is still separate from the path and tied closer to the
inode.

> 
> A crude overview of taxonomy:
> 
> file_system_type: what filesystem instances belong to.  Not quite the same
> thing as fs driver (one driver can provide several of those).  Usually
> it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...).

I don't know the difference between NFS and NFSv4 as I just used whatever
was the latest. But I understand the ext[234] part.

> 
> super_block: individual filesystem instance.  Hosts dentry tree (connected or
> several disconnected parts - think NFSv4 or the state while trying to get
> a dentry by fhandle, etc.).

I don't know how NFSv4 works, I'm only a user of it, I never actually
looked at the code. So that's not the best example, at least for me.

> 
> dentry: object in a filesystem's directory tree(s).  Always belongs to
> specific filesystem instance - that relationship never changes.  Tree
> structure (and names) _within_ _filesystem_ belong on that level.
> ->d_parent is part of that tree structure; never NULL, root of a (sub)tree  
> has it pointing to itself.  Might be negative, might refer to a filesystem object
> (file, directory, symlink, etc.).

This is useful.

> 
> inode: filesystem object (file, directory, etc.).  Always belongs to
> specific filesystem instance.  Non-directory inodes might have any
> number of dentry instances (aliases) refering to it; a directory one - no 
> more than one.

This above is very useful knowledge that I did not know. That directory
inodes can only have a single dentry.

>  Filesystem object contents belongs here; multiple hardlinks
> have different dentries and the same inode.

So, can I assume that an inode could only have as many dentries as hard
links? I know directories are only allowed to have a single hard link. Is
that why they can only have a single dentry?

>  Of course, filesystem type in
> question might have no such thing as multiple hardlinks - that's up to
> filesystem.  In general there is no way to find (or enumerate) such links;
> e.g. a local filesystem might have an extra hardlink somewhere we had
> never looked at and there won't be any dentries for such hardlinks and
> no way to get them short of doing the full search of the entire tree.
> The situation with e.g. NFS client is even worse, obviously.
> 
> mount: in a sense, mount to super_block is what dentry is to inode.  It
> provides a view of (sub)tree hosted in given filesystem instance.  The
> same filesystem may have any number of mounts, refering to its subtrees
> (possibly the same subtree for each, possibly all different - up to
> the callers of mount(2)).  They form mount tree(s) - that's where the
> notions related to "this mounted on top of that" belong.  Note that
> they can be moved around - with no telling the filesystem about that
> happening.  Again, there's no such thing as "the mountpoint of given
> filesystem instance" - it might be mounted in any number of places
> at the same time.  Specific mount - sure, no problem, except that it
> can move around.
> 
> namespace: mount tree.  Unlike everything prior, this one is a part of
> process state - same as descriptor table, mappings, etc.

And I'm guessing namespace is for containers. At least that's what I've
been assuming they are for.

> 
> file: opened IO channel.  It does refer to specific mount and specific
> dentry (and thus filesystem instance and an inode on it).  Current
> IO position lives here, so does any per-open(2) state.

And IIUC, this is what maps to a processes fd table. That is, the process's
file descriptor number it passes to the kernel will be mapped to this
"file".

> 
> descriptor table: mapping from numbers to IO channels (opened files).

This is that "process fd table" I mentioned above (I wrote that before
reading this).

> Again, a part of process state.  dup() creates a new entry, with
> reference to the same file as the old one; multiple open() of the

Hmm, wouldn't "dup()" create another "file" that just points to the same
dentry? It wouldn't be the "same file", or did you mean "file" from the
user space point of view?

> same pathname will each yield a separate opened file.  _Some_ state
> belongs here (close-on-exec, mostly).  Note that there's no such
> thing as "the descriptor of this file" - not even "the user-supplied
> number that had been used to get the file we are currently reading
> from", since that number might be refering to something entirely
> different right after we'd resolved it to opened file and that
> happens *without* disrupting the operation.

This last paragraph confused me. What do you mean by ""referring to
something entirely different"?

Thanks for this overview. It was very useful, and something I think we
should add to kernel doc. I did read Documentation/filesystems/vfs.rst but
honestly, I think your writeup here is a better overview.

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 15:05       ` [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership Steven Rostedt
@ 2024-01-04 18:25         ` Al Viro
  2024-01-04 19:10           ` Steven Rostedt
  2024-01-04 19:15           ` Steven Rostedt
  2024-01-04 19:03         ` Matthew Wilcox
  1 sibling, 2 replies; 10+ messages in thread
From: Al Viro @ 2024-01-04 18:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Linus Torvalds, Christian Brauner, linux-fsdevel,
	Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote:

> This is the "tribal knowledge" I'm talking about. I really didn't know how
> the root dentry parent worked. I guess that makes sense, as it matches the
> '..' of a directory, and the '/' directory '..' points to itself. Although
> mounted file systems do not behave that way. My /proc/.. is '/'. I just
> figured that the dentry->d_parent would be similar. Learn something everyday.

What would you expect to happen if you have the same filesystem mounted in
several places?  Having separate dentry trees would be a nightmare - you'd
get cache coherency problems from hell.  It's survivable for procfs, but
for something like a normal local filesystem it'd become very painful.
And if we want them to share dentry tree, how do you choose where the ..
would lead from the root dentry?

The way it's done is that linkage between the trees is done separately -
there's a tree of struct mount (well, forest, really - different processes
can easily have separate trees, which is how namespaces are done) and
each node in the mount tree refers to a dentry (sub)tree in some filesystem
instance.  Location is represented by (mount, dentry) pair and handling of
.. is basically (modulo refcounting, locking, error handling, etc.)
	while dentry == subtree_root(mount) && mount != mountpoint_mount(mount)
		// cross into the mountpoint under it
		dentry = mountpoint_dentry(mount)
		mount = mountpoint_mount(mount)
	go_into(mount, dentry->d_parent)

Note that you can have e.g. /usr/lib/gcc/x86_64-linux-gnu/12 mounted on /mnt/blah:
; mount --bind /usr/lib/gcc/x86_64-linux-gnu/12 /mnt/blah
will do it.  Then e.g. /mnt/blah/include will resolve to the same dentry as
/usr/lib/gcc/x86_64-linux-gnu/12/include, etc.
; chdir /mnt/blah
; ls
32                 crtprec80.o        libgomp.so         libsanitizer.spec
cc1                g++-mapper-server  libgomp.spec       libssp_nonshared.a
cc1plus            include            libitm.a           libstdc++.a
collect2           libasan.a          libitm.so          libstdc++fs.a
crtbegin.o         libasan_preinit.o  libitm.spec        libstdc++.so
crtbeginS.o        libasan.so         liblsan.a          libsupc++.a
crtbeginT.o        libatomic.a        liblsan_preinit.o  libtsan.a
crtend.o           libatomic.so       liblsan.so         libtsan_preinit.o
crtendS.o          libbacktrace.a     liblto_plugin.so   libtsan.so
crtfastmath.o      libcc1.so          libobjc.a          libubsan.a
crtoffloadbegin.o  libgcc.a           libobjc_gc.a       libubsan.so
crtoffloadend.o    libgcc_eh.a        libobjc_gc.so      lto1
crtoffloadtable.o  libgcc_s.so        libobjc.so         lto-wrapper
crtprec32.o        libgcov.a          libquadmath.a      plugin
crtprec64.o        libgomp.a          libquadmath.so     x32

We obviously want .. to resolve to /mnt, though.
; ls ..
; ls /usr/lib/gcc/x86_64-linux-gnu/
12

So the trigger for "cross into underlying mountpoint" has to be "dentry is
the root of subtree mount refers to" - it depends upon the mount we are
in.

> >  Filesystem object contents belongs here; multiple hardlinks
> > have different dentries and the same inode.
> 
> So, can I assume that an inode could only have as many dentries as hard
> links? I know directories are only allowed to have a single hard link. Is
> that why they can only have a single dentry?

Not quite.  Single alias for directories is more about cache coherency
fun; we really can't afford multiple aliases for those.  For non-directories
it's possible to have an entirely disconnected dentry refering to that
sucker; if somebody hands you an fhandle with no indication of the parent
directory, you might end up having to do one of those, no matter how many
times you find the same inode later.  Not an issue for tracefs, though.

> > namespace: mount tree.  Unlike everything prior, this one is a part of
> > process state - same as descriptor table, mappings, etc.
> 
> And I'm guessing namespace is for containers. At least that's what I've
> been assuming they are for.

It predates containers by quite a few years, but yes, that's one of the
users.  It is related to virtual machines, in the same sense the set
of memory mappings is - each thread can be thought of as a VM, with
a bunch of components.  Just as mmap() manipulates the virtual address
translation for the threads that share memory space with the caller,
mount() manipulates the pathname resolution for the threads that share
the namespace with the caller.

> > descriptor table: mapping from numbers to IO channels (opened files).
> 
> This is that "process fd table" I mentioned above (I wrote that before
> reading this).
> 
> > Again, a part of process state.  dup() creates a new entry, with
> > reference to the same file as the old one; multiple open() of the
> 
> Hmm, wouldn't "dup()" create another "file" that just points to the same
> dentry? It wouldn't be the "same file", or did you mean "file" from the
> user space point of view?

No.  The difference between open() and dup() is that the latter will
result in a descriptor that really refers to the same file.  Current
IO position belongs to IO channel; it doesn't matter for e.g. terminals,
but for regular file it immediately becomes an issue.
	fd1 = open("foo", 0);
	fd2 = open("foo", 0);
	read(fd1, &c1, 1);
	read(fd2, &c2, 1);
will result in the first byte of foo read into c1 and c2, but
	fd1 = open("foo", 0);
	fd2 = dup(fd1);
	read(fd1, &c1, 1);
	read(fd2, &c2, 1);
will have the first byte of foo in c1 and the second one - in c2.
open() yields a new IO channel attached to new descriptor; dup()
(and dup2()) attaches the existing IO channel to new descriptor.
fork() acts like dup() in that respect - child gets its descriptor
table populated with references to the same IO channels as the
parent does.

Any Unix since about '71 has it done that way and the same goes
for NT, DOS, etc. - you can't implement redirects to/from regular
files without that distinction.

Unfortunately, the terms are clumsy as hell - POSIX ends up with
"file descriptor" (for numbers) vs. "file description" (for IO
channels), which is hard to distinguish when reading and just
as hard to distinguish when listening.  "Opened file" (as IO
channel) vs. "file on disc" (as collection of data that might
be accessed via said channels) distinction on top of that also
doesn't help, to put it mildly.  It's many decades too late to
do anything about, unfortunately.  Pity the UNIX 101 students... ;-/

The bottom line:
	* struct file represents an IO channel; it might be operating
on various objects, including regular files, pipes, sockets, etc.
	* current IO position is a property of IO channel.
	* struct files_struct represents a descriptor table; each of
those maps numbers to IO channels.
	* each thread uses a descriptor table to turn numbers ("file
descriptors") into struct file references.  Different threads might
share the same descriptor table or have separate descriptor tables.
current->files points to the descriptor table of the current thread.
	* open() creates a new IO channel and attaches it to an
unused position in descriptor table.
	* dup(n) takes the IO channel from position 'n' in descriptor
table and attaches it to an unused position.
	* dup2(old, new) takes the IO channel from position 'old' and
attaches it to position 'new'; if there used to be something in position
'new', it gets detached.
	* close(n) takes the IO channel from position 'n', flushes and
detaches it.  Note that it IO channel itself is *NOT* closed until
all references to it are gone.  E.g. open() + fork() + (in parent) close()
will end up with the child's descriptor table keeping a reference to
IO channel established by open(); close() in parent will not shut the
channel down.  The same goes for implicit close() done by dup2() or
by exit(), etc.
	* things like mmap() retain struct file references;
open() + mmap() + close() ends up with struct file left (in vma->vm_file)
alive and well for as long as the mapping exists, nevermind the reference
that used to be in descriptor table.  In other words, IO channels can
exist with no references in any descriptor tables.  There are other
ways for such situation to occur (e.g. SCM_RIGHTS stuff); it's entirely
normal.

> > same pathname will each yield a separate opened file.  _Some_ state
> > belongs here (close-on-exec, mostly).  Note that there's no such
> > thing as "the descriptor of this file" - not even "the user-supplied
> > number that had been used to get the file we are currently reading
> > from", since that number might be refering to something entirely
> > different right after we'd resolved it to opened file and that
> > happens *without* disrupting the operation.
> 
> This last paragraph confused me. What do you mean by ""referring to
> something entirely different"?

	Two threads share descriptor table; one of them is in
read(fd, ...), another does dup2(fd2, fd).  If read() gets past the
point where it gets struct file reference, it will keep accessing that
IO channel.  dup2() will replace the reference in descriptor table,
but that won't disrupt the read()...

> 
> Thanks for this overview. It was very useful, and something I think we
> should add to kernel doc. I did read Documentation/filesystems/vfs.rst but
> honestly, I think your writeup here is a better overview.

At the very least it would need serious reordering ;-/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 18:25         ` Al Viro
@ 2024-01-04 19:10           ` Steven Rostedt
  2024-01-04 19:21             ` Linus Torvalds
  2024-01-04 19:15           ` Steven Rostedt
  1 sibling, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2024-01-04 19:10 UTC (permalink / raw)
  To: Al Viro
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Linus Torvalds, Christian Brauner, linux-fsdevel,
	Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, 4 Jan 2024 18:25:02 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote:
> 
> > This is the "tribal knowledge" I'm talking about. I really didn't know how
> > the root dentry parent worked. I guess that makes sense, as it matches the
> > '..' of a directory, and the '/' directory '..' points to itself. Although
> > mounted file systems do not behave that way. My /proc/.. is '/'. I just
> > figured that the dentry->d_parent would be similar. Learn something everyday.  
> 
> What would you expect to happen if you have the same filesystem mounted in
> several places?  Having separate dentry trees would be a nightmare - you'd
> get cache coherency problems from hell.  It's survivable for procfs, but
> for something like a normal local filesystem it'd become very painful.
> And if we want them to share dentry tree, how do you choose where the ..
> would lead from the root dentry?

My mistake was thinking that the dentry was attached more to the path than
the inode. But that doesn't seem to be the case. I wasn't sure if there was
a way to get to a dentry from the inode. I see the i_dentry list, which is
a list, where I got some of my idea that dentry was closer to path than inode.

> 
> The way it's done is that linkage between the trees is done separately -
> there's a tree of struct mount (well, forest, really - different processes
> can easily have separate trees, which is how namespaces are done) and
> each node in the mount tree refers to a dentry (sub)tree in some filesystem
> instance.  Location is represented by (mount, dentry) pair and handling of
> .. is basically (modulo refcounting, locking, error handling, etc.)
> 	while dentry == subtree_root(mount) && mount != mountpoint_mount(mount)
> 		// cross into the mountpoint under it
> 		dentry = mountpoint_dentry(mount)
> 		mount = mountpoint_mount(mount)
> 	go_into(mount, dentry->d_parent)
> 
> Note that you can have e.g. /usr/lib/gcc/x86_64-linux-gnu/12 mounted on /mnt/blah:
> ; mount --bind /usr/lib/gcc/x86_64-linux-gnu/12 /mnt/blah
> will do it.  Then e.g. /mnt/blah/include will resolve to the same dentry as
> /usr/lib/gcc/x86_64-linux-gnu/12/include, etc.
> ; chdir /mnt/blah
> ; ls
> 32                 crtprec80.o        libgomp.so         libsanitizer.spec
> cc1                g++-mapper-server  libgomp.spec       libssp_nonshared.a
> cc1plus            include            libitm.a           libstdc++.a
> collect2           libasan.a          libitm.so          libstdc++fs.a
> crtbegin.o         libasan_preinit.o  libitm.spec        libstdc++.so
> crtbeginS.o        libasan.so         liblsan.a          libsupc++.a
> crtbeginT.o        libatomic.a        liblsan_preinit.o  libtsan.a
> crtend.o           libatomic.so       liblsan.so         libtsan_preinit.o
> crtendS.o          libbacktrace.a     liblto_plugin.so   libtsan.so
> crtfastmath.o      libcc1.so          libobjc.a          libubsan.a
> crtoffloadbegin.o  libgcc.a           libobjc_gc.a       libubsan.so
> crtoffloadend.o    libgcc_eh.a        libobjc_gc.so      lto1
> crtoffloadtable.o  libgcc_s.so        libobjc.so         lto-wrapper
> crtprec32.o        libgcov.a          libquadmath.a      plugin
> crtprec64.o        libgomp.a          libquadmath.so     x32
> 
> We obviously want .. to resolve to /mnt, though.
> ; ls ..
> ; ls /usr/lib/gcc/x86_64-linux-gnu/
> 12
> 
> So the trigger for "cross into underlying mountpoint" has to be "dentry is
> the root of subtree mount refers to" - it depends upon the mount we are
> in.
> 
> > >  Filesystem object contents belongs here; multiple hardlinks
> > > have different dentries and the same inode.  
> > 
> > So, can I assume that an inode could only have as many dentries as hard
> > links? I know directories are only allowed to have a single hard link. Is
> > that why they can only have a single dentry?  
> 
> Not quite.  Single alias for directories is more about cache coherency
> fun; we really can't afford multiple aliases for those.  For non-directories
> it's possible to have an entirely disconnected dentry refering to that
> sucker; if somebody hands you an fhandle with no indication of the parent
> directory, you might end up having to do one of those, no matter how many
> times you find the same inode later.  Not an issue for tracefs, though.
> 
> > > namespace: mount tree.  Unlike everything prior, this one is a part of
> > > process state - same as descriptor table, mappings, etc.  
> > 
> > And I'm guessing namespace is for containers. At least that's what I've
> > been assuming they are for.  
> 
> It predates containers by quite a few years, but yes, that's one of the
> users.  It is related to virtual machines, in the same sense the set
> of memory mappings is - each thread can be thought of as a VM, with
> a bunch of components.  Just as mmap() manipulates the virtual address
> translation for the threads that share memory space with the caller,
> mount() manipulates the pathname resolution for the threads that share
> the namespace with the caller.
> 
> > > descriptor table: mapping from numbers to IO channels (opened files).  
> > 
> > This is that "process fd table" I mentioned above (I wrote that before
> > reading this).
> >   
> > > Again, a part of process state.  dup() creates a new entry, with
> > > reference to the same file as the old one; multiple open() of the  
> > 
> > Hmm, wouldn't "dup()" create another "file" that just points to the same
> > dentry? It wouldn't be the "same file", or did you mean "file" from the
> > user space point of view?  
> 
> No.  The difference between open() and dup() is that the latter will
> result in a descriptor that really refers to the same file.  Current
> IO position belongs to IO channel; it doesn't matter for e.g. terminals,
> but for regular file it immediately becomes an issue.
> 	fd1 = open("foo", 0);
> 	fd2 = open("foo", 0);
> 	read(fd1, &c1, 1);
> 	read(fd2, &c2, 1);
> will result in the first byte of foo read into c1 and c2, but
> 	fd1 = open("foo", 0);
> 	fd2 = dup(fd1);
> 	read(fd1, &c1, 1);
> 	read(fd2, &c2, 1);
> will have the first byte of foo in c1 and the second one - in c2.
> open() yields a new IO channel attached to new descriptor; dup()
> (and dup2()) attaches the existing IO channel to new descriptor.
> fork() acts like dup() in that respect - child gets its descriptor
> table populated with references to the same IO channels as the
> parent does.

Ah, looking at the code I use dup() in, it's mostly for pipes in
and for redirecting stdout,stdin, etc. So yeah, that makes sense.

> 
> Any Unix since about '71 has it done that way and the same goes
> for NT, DOS, etc. - you can't implement redirects to/from regular
> files without that distinction.

Yep, which is what I used it for. Just forgot the details.

> 
> > > same pathname will each yield a separate opened file.  _Some_ state
> > > belongs here (close-on-exec, mostly).  Note that there's no such
> > > thing as "the descriptor of this file" - not even "the user-supplied
> > > number that had been used to get the file we are currently reading
> > > from", since that number might be refering to something entirely
> > > different right after we'd resolved it to opened file and that
> > > happens *without* disrupting the operation.  
> > 
> > This last paragraph confused me. What do you mean by ""referring to
> > something entirely different"?  
> 
> 	Two threads share descriptor table; one of them is in
> read(fd, ...), another does dup2(fd2, fd).  If read() gets past the
> point where it gets struct file reference, it will keep accessing that
> IO channel.  dup2() will replace the reference in descriptor table,
> but that won't disrupt the read()...

Oh, OK. So basically if fd 4 is a reference to /tmp/foo and you open
/tmp/bar which gets fd2, and one thread is reading fd 4 (/tmp/foo), the
other thread doing dup2(fd2, fd) will make fd 4 a reference to /tmp/bar but
the read will finish reading /tmp/foo.

But if the first thread were to do another read(fd, ...) it would then read
/tmp/bar. In other words, it allows read() to stay atomic with respect to
what it is reading until it returns.

> 
> > 
> > Thanks for this overview. It was very useful, and something I think we
> > should add to kernel doc. I did read Documentation/filesystems/vfs.rst but
> > honestly, I think your writeup here is a better overview.  
> 
> At the very least it would need serious reordering ;-/

Yeah, but this is all great information. Thanks for explaining it.

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 19:10           ` Steven Rostedt
@ 2024-01-04 19:21             ` Linus Torvalds
  0 siblings, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2024-01-04 19:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Christian Brauner, linux-fsdevel,
	Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, 4 Jan 2024 at 11:09, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> My mistake was thinking that the dentry was attached more to the path than
> the inode. But that doesn't seem to be the case. I wasn't sure if there was
> a way to get to a dentry from the inode.

Yeah, so dentry->inode and path->dentry are one-way translations,
because the other way can have multiple different cases.

IOW, a path will specify *one* dentry, and a dentry will specily *one*
inode, but one inode can be associated with multiple dentries, and
there may be other undiscovered dentries that *would* point to it but
aren't even cached right now.

And a single dentry can be part of multiple paths, thanks to bind mounts.

The "inode->i_dentry" list is *not* a way to look up all dentries,
because - as mentioned - there may be potential other paths (and thus
other dentries) that lead to the same inode that just haven't been
looked up yet (or that have already been aged out of the cache).

Of course any *particular* filesystem may not have hard links (so one
inode has only one possible dentry), and you may not have bind mounts,
and it might be one of the virtual filesystems where everything is
always in memory, so none of the above problems are guaranteed to be
the case in any *particular* situation.

But it's all part of why the dcache is actually really subtle. It's
not just the RCU lookup rules and the specialized locking (both
reflock and the rather complicated rules about d_lock ordering), it's
also that whole "yeah, the filesystem only sees a 'dentry', but
because of bind mounts the vfs layer actually does things internally
in terms of 'struct path' in order to be able to then show that single
fiolesystem in multiple places".

Etc etc.

There's a reason Al Viro ends up owning the dcache. Nobody else can
wrap their tiny little minds around it all.

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 18:25         ` Al Viro
  2024-01-04 19:10           ` Steven Rostedt
@ 2024-01-04 19:15           ` Steven Rostedt
  2024-01-04 19:26             ` Matthew Wilcox
  2024-01-04 19:35             ` Linus Torvalds
  1 sibling, 2 replies; 10+ messages in thread
From: Steven Rostedt @ 2024-01-04 19:15 UTC (permalink / raw)
  To: Al Viro
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Linus Torvalds, Christian Brauner, linux-fsdevel,
	Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, 4 Jan 2024 18:25:02 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> Unfortunately, the terms are clumsy as hell - POSIX ends up with
> "file descriptor" (for numbers) vs. "file description" (for IO
> channels), which is hard to distinguish when reading and just
> as hard to distinguish when listening.  "Opened file" (as IO
> channel) vs. "file on disc" (as collection of data that might
> be accessed via said channels) distinction on top of that also
> doesn't help, to put it mildly.  It's many decades too late to
> do anything about, unfortunately.  Pity the UNIX 101 students... ;-/

Just so I understand this correctly.

"file descriptor" - is just what maps to a specific inode.

"file description" - is how the file is accessed (position in the file and
			flags associated to how it was opened)

Did I get that correct?

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 19:15           ` Steven Rostedt
@ 2024-01-04 19:26             ` Matthew Wilcox
  2024-01-04 19:35             ` Linus Torvalds
  1 sibling, 0 replies; 10+ messages in thread
From: Matthew Wilcox @ 2024-01-04 19:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Linus Torvalds, Christian Brauner,
	linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, Jan 04, 2024 at 02:15:17PM -0500, Steven Rostedt wrote:
> On Thu, 4 Jan 2024 18:25:02 +0000
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > Unfortunately, the terms are clumsy as hell - POSIX ends up with
> > "file descriptor" (for numbers) vs. "file description" (for IO
> > channels), which is hard to distinguish when reading and just
> > as hard to distinguish when listening.  "Opened file" (as IO
> > channel) vs. "file on disc" (as collection of data that might
> > be accessed via said channels) distinction on top of that also
> > doesn't help, to put it mildly.  It's many decades too late to
> > do anything about, unfortunately.  Pity the UNIX 101 students... ;-/
> 
> Just so I understand this correctly.
> 
> "file descriptor" - is just what maps to a specific inode.

No -- file descriptor is a number in fdtable that maps to a struct file.

> "file description" - is how the file is accessed (position in the file and
> 			flags associated to how it was opened)

file description is posix's awful name for struct file.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 19:15           ` Steven Rostedt
  2024-01-04 19:26             ` Matthew Wilcox
@ 2024-01-04 19:35             ` Linus Torvalds
  2024-01-04 20:02               ` Linus Torvalds
  2024-01-04 21:28               ` Al Viro
  1 sibling, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2024-01-04 19:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Christian Brauner, linux-fsdevel,
	Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, 4 Jan 2024 at 11:14, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> "file descriptor" - is just what maps to a specific inode.

Nope. Technically and traditionally, file descriptor is just the
integer index that is used to look up a 'struct file *'.

Except in the kernel, we really just tend to use that term (well, I
do) for the 'struct file *' itself, since the integer 'fd' is usually
not really relevant except at the system call interface.

Which is *NOT* the inode, because the 'struct file' has other things
in it (the file position, the permissions that were used at open time
etc, close-on-exec state etc etc).

> "file description" - is how the file is accessed (position in the file and
>                         flags associated to how it was opened)

That's a horrible term that shouldn't be used at all. Apparently some
people use it for what is our 'struct file *", also known as a "file
table entry".  Avoid it.

If anything, just use "fd" for the integer representation, and "file"
for the pointer to a 'struct file".

But most of the time the two are conceptually interchangeable, in that
an 'fd' just translates directly to a 'struct file *'.

Note that while there's that conceptual direct translation, there's
also very much a "time of use" issue, in that a "fd -> file"
translation happens at one particular time and in one particular user
context, and then it's *done* (so closing and possibly re-using the fd
after it's been looked up does not actually affect an existing 'struct
file *').

And while 'fd -> file' lookup is quick and common, the other way
doesn't exist, because multiple 'fd's can map to one 'struct file *'
thanks to dup() (and 'fork()', since a 'fd -> file' translation always
happens within the context of a particular user space, an 'fd' in one
process is obviously not the same as an 'fd' in another one).

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 19:35             ` Linus Torvalds
@ 2024-01-04 20:02               ` Linus Torvalds
  2024-01-04 21:28               ` Al Viro
  1 sibling, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2024-01-04 20:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Christian Brauner, linux-fsdevel,
	Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, 4 Jan 2024 at 11:35, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>>
> Which is *NOT* the inode, because the 'struct file' has other things
> in it (the file position, the permissions that were used at open time
> etc, close-on-exec state etc etc).

That close-on-exec thing was a particularly bad example of things that
are in the 'struct file', because it's in fact the only thing that
*isn't* in 'struct file' and is associated directly with the 'int fd'.

But hopefully the intent was clear despite me picking a particularly
bad example.

            Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 19:35             ` Linus Torvalds
  2024-01-04 20:02               ` Linus Torvalds
@ 2024-01-04 21:28               ` Al Viro
  1 sibling, 0 replies; 10+ messages in thread
From: Al Viro @ 2024-01-04 21:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Linux Trace Kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Christian Brauner, linux-fsdevel,
	Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, Jan 04, 2024 at 11:35:37AM -0800, Linus Torvalds wrote:

> > "file description" - is how the file is accessed (position in the file and
> >                         flags associated to how it was opened)
> 
> That's a horrible term that shouldn't be used at all. Apparently some
> people use it for what is our 'struct file *", also known as a "file
> table entry".  Avoid it.

Worse, really.  As far as I can reconstruct what happened it was something
along the lines of "colloquial expression is 'opened file', but that is
confusing - sounds like a property+noun, so it might be misparsed as
a member of subset of files satisfying the property of 'being opened';
can't have that in a standard, let's come up with something else".
Except that what they did come up with had been much worse, for obvious
linguistic reasons.

The *ONLY* uses for that expression I can think of are
	1.  When reading POSIX texts, watch out for that one - if you
see them talking about a file descriptor in context where it really
should be about an opened file, check the wording.  If it really says
"file descriptOR", it's probably a bug in standard or a codified
bullshit practice.  If it says "file descriptION" instead, replace with
"opened file" and move on.
	2.  An outstanding example of the taste of that bunch.

IO channel would be a saner variant, but it's far too late for that.

The 3-way distinction between descriptor/opened file/file as collection of data
needs to be explained in UNIX 101; it is userland-visible and it has to be
understood.  Unfortunately, it's often done in a way that leaves students
seriously confused ;-/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
  2024-01-04 15:05       ` [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership Steven Rostedt
  2024-01-04 18:25         ` Al Viro
@ 2024-01-04 19:03         ` Matthew Wilcox
  1 sibling, 0 replies; 10+ messages in thread
From: Matthew Wilcox @ 2024-01-04 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Al Viro, LKML, Linux Trace Kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Linus Torvalds, Christian Brauner,
	linux-fsdevel, Greg Kroah-Hartman, Jonathan Corbet, linux-doc

On Thu, Jan 04, 2024 at 10:05:44AM -0500, Steven Rostedt wrote:
> > file_system_type: what filesystem instances belong to.  Not quite the same
> > thing as fs driver (one driver can provide several of those).  Usually
> > it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...).
> 
> I don't know the difference between NFS and NFSv4 as I just used whatever
> was the latest. But I understand the ext[234] part.

What Al's sying is that nfs.ko provides both nfs_fs_type and
nfs4_fs_type.  ext4.ko provides ext2_fs_type, ext3_fs_type and
ext4_fs_type.  This is allowed but anomalous.  Most filesystems provide
only one, eg ocfs2_fs_type.

> > 
> > super_block: individual filesystem instance.  Hosts dentry tree (connected or
> > several disconnected parts - think NFSv4 or the state while trying to get
> > a dentry by fhandle, etc.).
> 
> I don't know how NFSv4 works, I'm only a user of it, I never actually
> looked at the code. So that's not the best example, at least for me.

Right, so NFS (v4 or otherwise) is Special.  In the protocol, files
are identified by a thing called an fhandle.  This is (iirc) a 32-byte
identifier which must persist across server reboot.  Originally it was
probably supposed to encode dev_t plus ino_t plus generation number.
But you can do all kinds of things in the NFS protocol with an fhandle
that you need a dentry for in Linux (like path walks).  Unfortunately,
clients can't be told "Hey, we've lost context, please rewalk" (which
would have other problems anyway), so we need a way to find the dentry
for an fhandle.  I understand this very badly, but essentially we end
up looking for canonical ones, and then creating isolated trees of
dentries if we can't find them.  Sometimes we then graft these isolated
trees into the canonical spots if we end up connecting them through
various filesystem activity.

At least that's my understanding which probably contains several
misunderstandings.

> >  Filesystem object contents belongs here; multiple hardlinks
> > have different dentries and the same inode.
> 
> So, can I assume that an inode could only have as many dentries as hard
> links? I know directories are only allowed to have a single hard link. Is
> that why they can only have a single dentry?

There could be more.  For example, I could open("A"); ln("A", "B");
open("B"); rm("A"); ln("B", "C"); open("C"); rm("B").

Now there are three dentries for this inode, its link count is currently
one and never exceeded two.

> Thanks for this overview. It was very useful, and something I think we
> should add to kernel doc. I did read Documentation/filesystems/vfs.rst but
> honestly, I think your writeup here is a better overview.

Documentation/filesystems/locking.rst is often a better source, although
the two should really be merged.  Not for the faint-hearted.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-01-04 21:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240103203246.115732ec@gandalf.local.home>
     [not found] ` <20240104014837.GO1674809@ZenIV>
     [not found]   ` <20240103212506.41432d12@gandalf.local.home>
     [not found]     ` <20240104043945.GQ1674809@ZenIV>
2024-01-04 15:05       ` [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership Steven Rostedt
2024-01-04 18:25         ` Al Viro
2024-01-04 19:10           ` Steven Rostedt
2024-01-04 19:21             ` Linus Torvalds
2024-01-04 19:15           ` Steven Rostedt
2024-01-04 19:26             ` Matthew Wilcox
2024-01-04 19:35             ` Linus Torvalds
2024-01-04 20:02               ` Linus Torvalds
2024-01-04 21:28               ` Al Viro
2024-01-04 19:03         ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).