Linux 9p file system development
 help / color / mirror / Atom feed
From: asmadeus@codewreck.org
To: Eric Van Hensbergen <ericvh@kernel.org>
Cc: Christian Schoenebeck <linux_oss@crudebyte.com>,
	v9fs@lists.linux.dev, ericvh@gmail.com
Subject: Re: cache fixes (redux)
Date: Thu, 4 Jan 2024 05:24:41 +0900	[thread overview]
Message-ID: <ZZXCiSlrAMI5sMr3@codewreck.org> (raw)
In-Reply-To: <ZZWi6fbgy8oGwR1D@FV7GG9FTHL> <ZY7qr1Dy_aS716dC@FV7GG9FTHL>

Eric Van Hensbergen wrote on Fri, Dec 29, 2023 at 09:50:07AM -0600:
> > So to recap there should only be one inode, then as many dentries as
> > there are paths to it, and as many files as there are "open handles" to
> > it
> > 
> > In cacheless modes we should drop dentries as soon as no file/process
> > use it, ad drop inodes as soon as there's no dentries left.
> > In cached mode we should have an extra ref to dentries (droppable on
> > memory pressure somehow -- we're not doing that correctly right now
> > afaik), that'll keep the inode alive, so we should try very hard to
> > always reuse the same inode/dentres to avoid needless lookups.
> >
> 
> I'm not convinced of the second part, I think we don't want to hold the
> extra ref to dentries because that causes fid explosion.

(Note: no longer relevant if we get rid of fid lookup by dentry)

Might have the wrong word -- we don't want to prevent the vfs from
cleaning up if there's cache pressure, but we need to differentiate
cacheless (drop the inode as soon as it's no longer used) and cache
(keep it around while it's here)

It definitely causes total fids open to rise (which can be a probem
depending on the server), I guess we could count how many there are and
have a watermark to start reclaiming if too many are around as an
option, but I've sort of given up on that at this point...

> There's another thing which has bugged me looking through the code -
> there's lots of code (lookup, create, open, link, etc.) which passes
> inode *parent which we only use for session information, but really
> that should be fine as a start-point for finding the parent fid versus
> doing a d_parent traversal.

That makes sense, I sort of assumed we didn't have the parent's inode
but looking back we seem to always have an inode *dir..
Would be great to get rid of this dentry fid cache.

(Note the dentry will still be kept around by the vfs anyway, but we
don't care about that anymore if it doesn't hold fids)

> Then for refresh logic we have inode_revalidate which can do different
> things depending on cache policy:
>    - cacheless - release immediately, clunk all fids (this assumes the
> 	mountpoint inode will have a reference as long as we are mounted,
> 	but I think that is correct
>    - data/dir cache - same semantics as now, inode holds data cache, all good
>    - policy differences: (which will trigger on inode_revalidate/refresh)
> 	- none: always call walk/getattr to revalidate inode from server
> 	  (the combo of walk/getattr may be a good new compount proto element)
> 	- loose: same as current semantics, keep around forever if we can,
> 		always assume client side cache is correct.
> 	- tight: use qid.version from walk/re-walk to validate, and getattr
> 		if stale.
> 	- temporal: cache times out after a mount configured timeo and we call
> 		getattr
> 	- push(future): use a subscription method to get updates on cache
> 		invalidations from server.

Sounds good.


Also I need to correct one point I said in this thread: I had assumed
creat+unlink worked, but something as simple as fstat on the unlinked
file doesn't work in cacheless mode:
---
#include <fcntl.h>
#include <stdio.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
	int fd;
	struct stat statbuf;

	fd = open("test_unlink_stat", O_CREAT|O_RDWR, 0600);
	if (fd < 0) {
		perror("open");
		return 1;
	}
	if (unlink("test_unlink_stat") < 0) {
		perror("unlink");
		return 2;
	}
	write(fd, "data\n", 5);
	if (fstat(fd, &statbuf) < 0) {
		perror("fstat");
		return 3;
	}
	printf("stat ok, size %zd\n", statbuf.st_size);
	return 0;
}
---

Perhaps that'll just start working with your fixes but either way it's
already not working cacheless, so I don't think you need to worry about
that at this point...

(I noticed because I finally took the time to boot a debian on 9p
rootfs, and apt breaks without this -- works fine with caches)


Eric Van Hensbergen wrote on Wed, Jan 03, 2024 at 12:09:45PM -0600:
> Okay, been delving through this code.  Most (all?) was added by
> Aneesh when he did the cache stuff and I clearly wasn't paying
> close attention a decade ago (mostly because I didn't care about
> the cache then).  In any case, I think there is a ton of code
> I am going to be able to delete (patches coming soon), but wanted
> to sanity check one thing with you folks.

Good work!

> Looking at the inode get code (which also doubles as inode creation
> code).  The inodes are hashed with qid2ino(qid) which I get to take
> the blame for because its been there for 18th years.
> If the size of the inode_tt is 64-bit, it just
> assigns the qid->path+2 (+2?)

hm, inode 1 is definitely usable (a tmpfs' root directory uses it), but
perhaps 0 would caus problems?
In which case +1 makes sense if qid->path can be 0, or +2 if it could be
-1 (I wouldn't put it past some servers); but I can only guess.

> to the i_ino.  If not (presumably for 32 bit)

I had assumed the conversion to inode64 was done in practice but the
struct inode's i_ino is still an unsigned long... Although kstat's ino
field is u64 so if userspace is willing to handle it we could just bump
the size.

I remember having troubles with 64 bit inodes on 32 bit binaries quite a
few years ago so I guess we still need to do that dance -- although not
necessarily internally; we could emulate xfs here:
 - they use xfs_ino_t internally which is unsigned long long (u64);
there's even a confusingly named i_ino in xfs_inode so the inode's
struct i_ino is actually never used as far as I can see
 - there are inode32/inode64 mount option to force allocating inodes in
the 32 bit range if requested (but doesn't try to remap existing inodes,
this is just for new files)

In our case we could add a 64bit inode field in v9fs_inode so we don't
need to worry about it, and remap for kstat (v9fs_vfs_getattr*) if a new
inode32 mount option is requested?

I wouldn't do it by default on 32 bit systems as that'll blow up, and
hopefully by now any userspace that'll upgrade their kernel will have
long converted to _FILE_OFFSET_BITS=64 (which also controls 64bit inode
for stat's st_ino)

> it sets the inode to (qid->path+2) ^ (qid->path+2 >> 32).

(Seems a bit simplistic but probably good enough in practice...)


> If ino_t goes to 64 bit this means we reduce precision for no reason,
> so that has to be fixed -- but I'm trying to remember the relevance
> of +2 which dates back all the way to the original submission.  I
> don't remember why we did this, but I think I'm gonna kill that in
> one patch and in a second patch fix the length check.

Unfortunately cannot help as my involvement started much later..

> Moving up the stack a bit, we use the modified qid.path to hash the
> inode, but then we also assign it (overriding the allocated inode number
> on the client).  I'm gonna say this is wrong (definitely wrong in the
> case where we are allocated inodes willy-nilly) and we should probably
> continue to use qid.path to hash, but then keep the client side inode
> number which is guarenteed unique on allocation.  Or do you folks think
> we should align client-side inode with server-side inode to help with
> debug?  We should still have consistent qid->paths regardless.  Thoughts?

I ended up suggesting something similar above without reading this
properly, if I read this correctly I think we agree -- use as much bits
as we can internally and keep it coherent, then just downsize for
getattr if required.

-- 
Dominique Martinet | Asmadeus

  parent reply	other threads:[~2024-01-03 20:25 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-26 16:13 cache fixes (redux) Eric Van Hensbergen
2023-12-27  8:10 ` Dominique Martinet
2023-12-27 18:07   ` Eric Van Hensbergen
     [not found]   ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com>
2023-12-28  0:32     ` Dominique Martinet
2023-12-28 10:44       ` Christian Schoenebeck
2023-12-28 15:07         ` Eric Van Hensbergen
2023-12-28 21:48           ` asmadeus
2023-12-29 12:22             ` Christian Schoenebeck
2023-12-29 12:50               ` asmadeus
2023-12-29 16:16                 ` Eric Van Hensbergen
2023-12-29 16:06               ` Eric Van Hensbergen
2023-12-29 15:50             ` Eric Van Hensbergen
2024-01-03 18:09               ` cache fixes (redux) (a tale of many inode numbers...) Eric Van Hensbergen
2024-01-03 20:24               ` asmadeus [this message]
     [not found]                 ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com>
2024-01-03 22:42                   ` Fwd: cache fixes (redux) Eric Van Hensbergen
2024-01-03 22:48                 ` Eric Van Hensbergen
2024-01-04 12:37                   ` Christian Schoenebeck
2024-01-04 15:59                     ` Eric Van Hensbergen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZZXCiSlrAMI5sMr3@codewreck.org \
    --to=asmadeus@codewreck.org \
    --cc=ericvh@gmail.com \
    --cc=ericvh@kernel.org \
    --cc=linux_oss@crudebyte.com \
    --cc=v9fs@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox