All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Down <chris@chrisdown.name>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Jeff Layton <jlayton@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	kernel-team@fb.com, Hugh Dickins <hughd@google.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	"zhengbin (A)" <zhengbin13@huawei.com>,
	Roman Gushchin <guro@fb.com>
Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
Date: Wed, 25 Dec 2019 12:54:48 +0000	[thread overview]
Message-ID: <20191225125448.GA309148@chrisdown.name> (raw)
In-Reply-To: <CAOQ4uxjm5JMvfbi4xa3yaDwuM+XpNOSDrbVsHvJtkms00ZBnAg@mail.gmail.com>

Amir Goldstein writes:
>> The slab i_ino recycling approach works somewhat, but is unfortunately neutered
>> quite a lot by the fact that slab recycling is per-memcg. That is, replacing
>> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
>> callsites only leads to about 10% slab reuse, which doesn't really stem the
>> bleeding of 32-bit inums on an affected workload:
>>
>>      # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
>>          4454 recycle_or_get_next_ino: not recycled
>>           546 recycle_or_get_next_ino: recycled
>>
>
>Too bad..
>Maybe recycled ino should be implemented all the same because it is simple
>and may improve workloads that are not so MEMCG intensive.

Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for 
this) since it's not a total solution for the problem, but still helps somewhat 
and we all seem to agree that it's overall an uncontroversial improvement.

>> Roman (who I've just added to cc) tells me that currently we only have
>> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
>> contributes fairly significantly here since there are multiple tasks across
>> multiple cgroups which are contributing to the get_next_ino() thrash.
>>
>> I think this is a good start, but we need something of a different magnitude in
>> order to actually solve this problem with the current slab infrastructure. How
>> about something like the following?
>>
>> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is
>> 2. Use get_next_ino_full in tmpfs (et al.)
>
>I would prefer that filesystems making heavy use of get_next_ino, be converted
>to use a private ino pool per sb:
>
>ino_pool_create()
>ino_pool_get_next()
>
>flags to ino_pool_create() can determine the desired ino range.
>Does the Facebook use case involve a single large tmpfs or many
>small ones? I would guess the latter and therefore we are trying to solve
>a problem that nobody really needs to solve (i.e. global efficient ino pool).

Unfortunately in the case under discussion, it's all in one large tmpfs in 
/dev/shm. I can empathise with that -- application owners often prefer to use 
the mounts provided to them rather than having to set up their own. For this 
one case we can change that, but I think it seems reasonable to support this 
case since using a single tmpfs can be a reasonable decision as an application 
developer, especially if you only have unprivileged access to the system.

>> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
>>     pass if they want the 32-bit inode numbers back. This would still allow
>>     people who want to make this tradeoff to use xino.
>
>inode32|inode64 (see man xfs(5)).

Ah great, thanks! I'll reuse precedent from those.

>> 4. (If you like) Also add a CONFIG option to disable this at compile time.
>>
>
>I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
>might me best determined by CONFIG option, so distro builders could decide
>if they want to take the risk of breaking applications on tmpfs.

Sounds good.

>But if you implement per sb ino pool, maybe inode64 will no longer be
>required for your use case?

In this case I think per-sb ino pool will help a bit, but unfortunately not by 
an order of magnitude. As with the recycling patch this will help reduce thrash 
a bit but not conclusively prevent the problem from happening long-term. To fix 
that, I think we really do need the option to use ino_t-sized get_next_ino_full 
(or per-sb equivalent).

Happy holidays, and thanks for your feedback!

Chris

  reply	other threads:[~2019-12-25 12:54 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20  2:49 [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit Chris Down
2019-12-20  3:05 ` zhengbin (A)
2019-12-20  8:32 ` Amir Goldstein
2019-12-20 12:16   ` Chris Down
2019-12-20 13:41     ` Amir Goldstein
2019-12-20 16:46       ` Matthew Wilcox
2019-12-20 17:35         ` Amir Goldstein
2019-12-20 19:50           ` Matthew Wilcox
2019-12-23 20:45             ` Chris Down
2019-12-24  3:04               ` Amir Goldstein
2019-12-25 12:54                 ` Chris Down [this message]
2019-12-26  1:40                   ` zhengbin (A)
2019-12-20 21:30 ` Darrick J. Wong
2019-12-21  8:43   ` Amir Goldstein
2019-12-21 18:05     ` Darrick J. Wong
2019-12-21 10:16   ` Chris Down
2020-01-07 17:35     ` J. Bruce Fields
2020-01-07 17:44       ` Chris Down
2020-01-08  3:00         ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191225125448.GA309148@chrisdown.name \
    --to=chris@chrisdown.name \
    --cc=amir73il@gmail.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jlayton@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=zhengbin13@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.