Re: cache fixes (redux)

Linux 9p file system development
 help / color / mirror / Atom feed

From: asmadeus@codewreck.org
To: Christian Schoenebeck <linux_oss@crudebyte.com>
Cc: Eric Van Hensbergen <ericvh@kernel.org>,
	v9fs@lists.linux.dev, ericvh@gmail.com
Subject: Re: cache fixes (redux)
Date: Fri, 29 Dec 2023 21:50:39 +0900	[thread overview]
Message-ID: <ZY7An6VBb0GCEMFu@codewreck.org> (raw)
In-Reply-To: <2850709.f7ZBceQk0b@silver>

Christian Schoenebeck wrote on Fri, Dec 29, 2023 at 01:22:45PM +0100:
> Which is the expected, correct behaviour, as you can still distinguish the two
> by their device IDs:

(Yes, just meant to say one can produce collisions easily if they want)

> BTW even if qid.path length is increased as part of a 9p protocol change, it
> would still be tricky for client implementation side, as that number is too
> big as simply being exposed as inode number by client. IIRC virtiofsd is
> handling this by automatically creating separate devices when needed. I
> haven't checked how exactly, whether there is e.g. some easy way to create
> "subdevices" with Linux.

NFS also automatically creates a new mountpoint at junctions, so it's
definitely possible (fs/nfs/namespace.c):
/*      
 * nfs_d_automount - Handle crossing a mountpoint on the server
 * @path - The mountpoint
 *      
 * When we encounter a mountpoint on the server, we want to set up
 * a mountpoint on the client too, to prevent inode numbers from
 * colliding, and to allow "df" to work properly.
 * On NFSv4, we also want to allow for the fact that different
 * filesystems may be migrated to different servers in a failover
 * situation, and that different filesystems may want to use
 * different security flavours.
 */

It's not strictly needed though; in practice the vfs can handle
duplicate i_ino as long as appropriate functions are used for iget
(iget5_locked in fs/9p/vfs_inode.c); although now I'm thinking about it
that won't work for programs like tar that'll notice the inode number is
identical and consider the second file to be a hard link of the first...

I guess we could swipe it under the rug by making the inode number a
hash of everything in qid.path, and pretend collisions never happen™?

> > (iiuc from a networked filesystem's point of view (e.g. nfs), we ought
> > to check i_ino + i_generation as inode number can be reused after the
> > previous owner of the number was deleted, but we're guaranteed the
> > couple change... At any given point though i_ino is unique so we've been
> > relying on just that for two reasons: no room to send i_generation, and
> > the field isn't easily obtainable from userspace (need to get in through
> > bytes in export_to_handle_at which is file system dependant), so we only
> > rely on inode number -- if the field gets bigger it should be variable
> > length and we should write the whole mnt_id + export_to_handle_at
> > content for a stable, truly unique handle. That's what userspace NFS
> > servers do.)
> 
> Why is that? The inode number is a serial number which is consecutively
> incremented:

It's serial for tmpfs, but that's an implementation detail.
ext4 and xfs will both allocate through some larger stride.

... Actually didn't think it'd be this easy to reproduce, but ext4
reliably recyles the same inodes after unlink:
h
# touch a b c
# ls -li a b c
357 -rw-r--r--. 1 root root 0 Dec 29 21:39 a
365 -rw-r--r--. 1 root root 0 Dec 29 21:39 b
366 -rw-r--r--. 1 root root 0 Dec 29 21:39 c
# rm -f a b c
# touch new
# ls -li new
357 -rw-r--r--. 1 root root 0 Dec 29 21:40 new
# rm -f new
# touch again
# ls -li again
357 -rw-r--r--. 1 root root 0 Dec 29 21:40 again

(works even if I actually write something into the files and sync them)

> So I would not expect an inode number to be reused before the consecutive
> counter wraps at 2^64. And then, this discussion is about files still being
> open. So host's file system can't recycle the inode number unless 9p client
> closes the corresponding old FID.

If the file's still open, it's not removed from the fs -- we're
guaranted the inode won't come back yet.
For a networked file system though I was thinking of cache more than
open files, e.g. in the previous example a 9p client might think 'new'
and 'again' are the same file.
I guess it doesn't really matter since we have no cache invalidation /
refresh logic...

> > In cacheless modes we should drop dentries as soon as no file/process
> > use it, ad drop inodes as soon as there's no dentries left.
> > In cached mode we should have an extra ref to dentries (droppable on
> > memory pressure somehow -- we're not doing that correctly right now
> > afaik), that'll keep the inode alive, so we should try very hard to
> > always reuse the same inode/dentres to avoid needless lookups.
> 
> Yes, the cache currently just grows indefinitely. The question is what kind of
> metric shall be used to decide when to start dropping old things from the
> cache.

My understanding is that the vfs already can keep track of this for us.
Look at e.g. prune_icache_sb in fs/inode.c
It's called when there's memory pressure, and it'll call inode op's
evict_inode when actually freeing thing so we could close any still
remembered fid there, and I think it'd work out in general...
Except that we keep an explicit ref so it doesn't have anything to free
iirc.

Well, it's more stuff that'd need experimenting with...

-- 
Dominique Martinet | Asmadeus

next prev parent reply	other threads:[~2023-12-29 12:51 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-26 16:13 cache fixes (redux) Eric Van Hensbergen
2023-12-27  8:10 ` Dominique Martinet
2023-12-27 18:07   ` Eric Van Hensbergen
     [not found]   ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com>
2023-12-28  0:32     ` Dominique Martinet
2023-12-28 10:44       ` Christian Schoenebeck
2023-12-28 15:07         ` Eric Van Hensbergen
2023-12-28 21:48           ` asmadeus
2023-12-29 12:22             ` Christian Schoenebeck
2023-12-29 12:50               ` asmadeus [this message]
2023-12-29 16:16                 ` Eric Van Hensbergen
2023-12-29 16:06               ` Eric Van Hensbergen
2023-12-29 15:50             ` Eric Van Hensbergen
2024-01-03 18:09               ` cache fixes (redux) (a tale of many inode numbers...) Eric Van Hensbergen
2024-01-03 20:24               ` cache fixes (redux) asmadeus
     [not found]                 ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com>
2024-01-03 22:42                   ` Fwd: " Eric Van Hensbergen
2024-01-03 22:48                 ` Eric Van Hensbergen
2024-01-04 12:37                   ` Christian Schoenebeck
2024-01-04 15:59                     ` Eric Van Hensbergen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZY7An6VBb0GCEMFu@codewreck.org \
    --to=asmadeus@codewreck.org \
    --cc=ericvh@gmail.com \
    --cc=ericvh@kernel.org \
    --cc=linux_oss@crudebyte.com \
    --cc=v9fs@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox