From: Chris Down <chris@chrisdown.name>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Al Viro <viro@zeniv.linux.org.uk>,
Jeff Layton <jlayton@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
linux-kernel <linux-kernel@vger.kernel.org>,
kernel-team@fb.com, Hugh Dickins <hughd@google.com>,
Miklos Szeredi <miklos@szeredi.hu>,
"zhengbin (A)" <zhengbin13@huawei.com>,
Roman Gushchin <guro@fb.com>
Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
Date: Wed, 25 Dec 2019 12:54:48 +0000 [thread overview]
Message-ID: <20191225125448.GA309148@chrisdown.name> (raw)
In-Reply-To: <CAOQ4uxjm5JMvfbi4xa3yaDwuM+XpNOSDrbVsHvJtkms00ZBnAg@mail.gmail.com>
Amir Goldstein writes:
>> The slab i_ino recycling approach works somewhat, but is unfortunately neutered
>> quite a lot by the fact that slab recycling is per-memcg. That is, replacing
>> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
>> callsites only leads to about 10% slab reuse, which doesn't really stem the
>> bleeding of 32-bit inums on an affected workload:
>>
>> # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
>> 4454 recycle_or_get_next_ino: not recycled
>> 546 recycle_or_get_next_ino: recycled
>>
>
>Too bad..
>Maybe recycled ino should be implemented all the same because it is simple
>and may improve workloads that are not so MEMCG intensive.
Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for
this) since it's not a total solution for the problem, but still helps somewhat
and we all seem to agree that it's overall an uncontroversial improvement.
>> Roman (who I've just added to cc) tells me that currently we only have
>> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
>> contributes fairly significantly here since there are multiple tasks across
>> multiple cgroups which are contributing to the get_next_ino() thrash.
>>
>> I think this is a good start, but we need something of a different magnitude in
>> order to actually solve this problem with the current slab infrastructure. How
>> about something like the following?
>>
>> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is
>> 2. Use get_next_ino_full in tmpfs (et al.)
>
>I would prefer that filesystems making heavy use of get_next_ino, be converted
>to use a private ino pool per sb:
>
>ino_pool_create()
>ino_pool_get_next()
>
>flags to ino_pool_create() can determine the desired ino range.
>Does the Facebook use case involve a single large tmpfs or many
>small ones? I would guess the latter and therefore we are trying to solve
>a problem that nobody really needs to solve (i.e. global efficient ino pool).
Unfortunately in the case under discussion, it's all in one large tmpfs in
/dev/shm. I can empathise with that -- application owners often prefer to use
the mounts provided to them rather than having to set up their own. For this
one case we can change that, but I think it seems reasonable to support this
case since using a single tmpfs can be a reasonable decision as an application
developer, especially if you only have unprivileged access to the system.
>> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
>> pass if they want the 32-bit inode numbers back. This would still allow
>> people who want to make this tradeoff to use xino.
>
>inode32|inode64 (see man xfs(5)).
Ah great, thanks! I'll reuse precedent from those.
>> 4. (If you like) Also add a CONFIG option to disable this at compile time.
>>
>
>I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
>might me best determined by CONFIG option, so distro builders could decide
>if they want to take the risk of breaking applications on tmpfs.
Sounds good.
>But if you implement per sb ino pool, maybe inode64 will no longer be
>required for your use case?
In this case I think per-sb ino pool will help a bit, but unfortunately not by
an order of magnitude. As with the recycling patch this will help reduce thrash
a bit but not conclusively prevent the problem from happening long-term. To fix
that, I think we really do need the option to use ino_t-sized get_next_ino_full
(or per-sb equivalent).
Happy holidays, and thanks for your feedback!
Chris
next prev parent reply other threads:[~2019-12-25 12:54 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-20 2:49 [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit Chris Down
2019-12-20 3:05 ` zhengbin (A)
2019-12-20 8:32 ` Amir Goldstein
2019-12-20 12:16 ` Chris Down
2019-12-20 13:41 ` Amir Goldstein
2019-12-20 16:46 ` Matthew Wilcox
2019-12-20 17:35 ` Amir Goldstein
2019-12-20 19:50 ` Matthew Wilcox
2019-12-23 20:45 ` Chris Down
2019-12-24 3:04 ` Amir Goldstein
2019-12-25 12:54 ` Chris Down [this message]
2019-12-26 1:40 ` zhengbin (A)
2019-12-20 21:30 ` Darrick J. Wong
2019-12-21 8:43 ` Amir Goldstein
2019-12-21 18:05 ` Darrick J. Wong
2019-12-21 10:16 ` Chris Down
2020-01-07 17:35 ` J. Bruce Fields
2020-01-07 17:44 ` Chris Down
2020-01-08 3:00 ` J. Bruce Fields
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191225125448.GA309148@chrisdown.name \
--to=chris@chrisdown.name \
--cc=amir73il@gmail.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=jlayton@kernel.org \
--cc=kernel-team@fb.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=tj@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=zhengbin13@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).