Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Down <chris@chrisdown.name>
To: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Jeff Layton <jlayton@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	kernel-team@fb.com, Hugh Dickins <hughd@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Miklos Szeredi <miklos@szeredi.hu>
Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
Date: Fri, 20 Dec 2019 12:16:15 +0000	[thread overview]
Message-ID: <20191220121615.GB388018@chrisdown.name> (raw)
In-Reply-To: <CAOQ4uxjqSWcrA1reiyit9DRt+aq2tXBxLdPE31RrYw1yr=4hjg@mail.gmail.com>

Hi Amir,

Thanks for getting back, I appreciate it.

Amir Goldstein writes:
>How about something like this:
>
>/* just to explain - use an existing macro */
>shmem_ino_shift = ilog2(sizeof(void *));
>inode->i_ino = (__u64)inode >> shmem_ino_shift;
>
>This should solve the reported problem with little complexity,
>but it exposes internal kernel address to userspace.

One problem I can see with that approach is that get_next_ino doesn't 
discriminate based on the context (for example, when it is called for a 
particular tmpfs mount) which means that eventually wraparound risk is still 
pushed to the limit on such machines for other users of get_next_ino (like 
named pipes, sockets, procfs, etc). Granted then the space for collisions 
between them is less likely due to their general magnitude of inodes at one 
time compared to some tmpfs workloads, but still.

>Can we do anything to mitigate this risk?
>
>For example, instead of trying to maintain a unique map of
>ino_t to struct shmem_inode_info * in the system
>it would be enough (and less expensive) to maintain a unique map of
>shmem_ino_range_t to slab.
>The ino_range id can then be mixes with the relative object index in
>slab to compose i_ino.
>
>The big win here is not having to allocate an id every bunch of inodes
>instead of every inode, but the fact that recycled (i.e. delete/create)
>shmem_inode_info objects get the same i_ino without having to
>allocate any id.
>
>This mimics a standard behavior of blockdev filesystem like ext4/xfs
>where inode number is determined by logical offset on disk and is
>quite often recycled on delete/create.
>
>I realize that the method I described with slab it crossing module layers
>and would probably be NACKED.

Yeah, that's more or less my concern with that approach as well, hence why I 
went for something that seemed less intrusive and keeps with the current inode 
allocation strategy :-)

>Similar result could be achieved by shmem keeping a small stash of
>recycled inode objects, which are not returned to slab right away and
>retain their allocated i_ino. This at least should significantly reduce the
>rate of burning get_next_ino allocation.

While this issue happens to present itself currently on tmpfs, I'm worried that 
future users of get_next_ino based on historic precedent might end up hitting 
this as well. That's the main reason why I'm inclined to try and improve 
get_next_ino's strategy itself.

>Anyway, to add another consideration to the mix, overlayfs uses
>the high ino bits to multiplex several layers into a single ino domain
>(mount option xino=on).
>
>tmpfs is a very commonly used filesystem as overlayfs upper layer,
>so many users are going to benefit from keeping the higher most bits
>of tmpfs ino inodes unused.
>
>For this reason, I dislike the current "grow forever" approach of
>get_next_ino() and prefer that we use a smarter scheme when
>switching over to 64bit values.

By "a smarter scheme when switching over to 64bit values", you mean keeping 
i_ino as low magnitude as possible while still avoiding simultaneous reuse, 
right?

To that extent, if we can reliably and expediently recycle inode numbers, I'm 
not against sticking to the existing typing scheme in get_next_ino. It's just a 
matter of agreeing by what method and at what level of the stack that should 
take place :-)

I'd appreciate your thoughts on approaches forward. One potential option is to 
reimplement get_next_ino using an IDA, as mentioned in my patch message. Other 
than the potential to upset microbenchmarks, do you have concerns with that as 
a patch?

Thanks,

Chris

next prev parent reply	other threads:[~2019-12-20 12:16 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20  2:49 [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit Chris Down
2019-12-20  3:05 ` zhengbin (A)
2019-12-20  8:32 ` Amir Goldstein
2019-12-20 12:16   ` Chris Down [this message]
2019-12-20 13:41     ` Amir Goldstein
2019-12-20 16:46       ` Matthew Wilcox
2019-12-20 17:35         ` Amir Goldstein
2019-12-20 19:50           ` Matthew Wilcox
2019-12-23 20:45             ` Chris Down
2019-12-24  3:04               ` Amir Goldstein
2019-12-25 12:54                 ` Chris Down
2019-12-26  1:40                   ` zhengbin (A)
2019-12-20 21:30 ` Darrick J. Wong
2019-12-21  8:43   ` Amir Goldstein
2019-12-21 18:05     ` Darrick J. Wong
2019-12-21 10:16   ` Chris Down
2020-01-07 17:35     ` J. Bruce Fields
2020-01-07 17:44       ` Chris Down
2020-01-08  3:00         ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191220121615.GB388018@chrisdown.name \
    --to=chris@chrisdown.name \
    --cc=amir73il@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jlayton@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).