public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: ikent@redhat.com, onestero@redhat.com
Subject: [PATCH 0/3] proc: improve root readdir latency with many threads
Date: Tue, 14 Jun 2022 14:09:46 -0400	[thread overview]
Message-ID: <20220614180949.102914-1-bfoster@redhat.com> (raw)

Hi all,

We have a user who has reported performance problems related to
(presumably) custom task monitoring on Linux systems when running
processes with large numbers of threads. Unfortunately I don't have much
information around the practical workload and observations, but only
that the problem had been narrowed down to excessive readdir latency of
the /proc root dir in the presence of large numbers of threads in the
associated pid namespace.

This latency boils down to the inefficient pid_namespace walk down in
the proc_pid_readdir() path. More specifically, every thread/task
allocates an associated struct pid, and the procfs next_tgid()
implementation walks every pid in the namespace looking for those with
an associated PIDTYPE_TGID task to fill into the directory listing.

Given that ids are part of the idr radix-tree, it seemed fairly logical
that this could be improved using an internal tree tag. I started
playing around with an approach that tagged and untagged ids based on
actual task association (i.e., attach_pid() and friends), but after some
thought and feedback came to the realization that this could probably be
simplified to just tag the pid once on allocation and allow procfs to
use it as a hint for root dir population. This works because post-fork
tgid task disassociation (without an exit() and freeing the pid) seems
to be uncommon. The only tool I've seen in my testing so far that leaves
around a tagged, non-TGID pid is chronyd, which appears to do a fork()
-> setsid() -> fork() pattern where the intermediate task exits but the
associated pid hangs around for the lifetime of the process due to the
PIDTYPE_SID association.

Therefore, this series implements this tgid tag hinting approach. Patch
1 includes a couple tweaks to the idr tree to support traditional
radix-tree tag propagation. Patch 2 defines the new tag and sets it on
pid allocation. Patch 3 updates procfs to use the tag for the readdir
pid_namespace traversal.

As far as testing goes, I've thrown this at fstests (not for filesystem
testing purposes, but moreso just because I had the test env handy and
it's a longish running task creation workload), LTP and some of the
kernel internal tests in tools/testing/selftests (clone, proc,
pid_namespace) without any obvious regressions. From the performance
angle, the user who reported this problem has provided some synthetic
tools to create dummy tasks/threads and run repeated readdir iterations
of /proc, which is what they've been using to compare results on Linux
kernels with some $other OS. These tools show a notable improvement in
terms of the number of /proc readdir iterations possible per-second. For
example, on 5.19.0-rc2 running on a mostly idle system with an active
100k thread process, readdirs-per-second improves from a baseline of
~285 to ~7.3k with the series applied. More detailed getdents() latency
numbers are included in the commit log of patch 3.

Thoughts, reviews, flames appreciated.

Brian

Brian Foster (3):
  radix-tree: propagate all tags in idr tree
  pid: use idr tag to hint pids associated with group leader tasks
  proc: use idr tgid tag hint to iterate pids in readdir

 fs/proc/base.c      |  2 +-
 include/linux/idr.h | 25 +++++++++++++++++++++++++
 include/linux/pid.h |  2 +-
 kernel/fork.c       |  2 +-
 kernel/pid.c        |  9 ++++++++-
 lib/radix-tree.c    | 26 +++++++++++++++-----------
 6 files changed, 51 insertions(+), 15 deletions(-)

-- 
2.34.1


             reply	other threads:[~2022-06-14 18:09 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-14 18:09 Brian Foster [this message]
2022-06-14 18:09 ` [PATCH 1/3] radix-tree: propagate all tags in idr tree Brian Foster
2022-06-15 11:12   ` Christoph Hellwig
2022-06-15 12:57     ` Brian Foster
2022-06-15 13:40       ` Matthew Wilcox
2022-06-15 14:43         ` Brian Foster
2022-06-15 16:34           ` Matthew Wilcox
2022-06-28 12:55             ` Christian Brauner
2022-06-28 14:53               ` Brian Foster
2022-06-29 19:13                 ` Brian Foster
2022-07-11 20:24                 ` Matthew Wilcox
2022-06-15 13:33     ` Ian Kent
2022-06-14 18:09 ` [PATCH 2/3] pid: use idr tag to hint pids associated with group leader tasks Brian Foster
2022-06-14 18:09 ` [PATCH 3/3] proc: use idr tgid tag hint to iterate pids in readdir Brian Foster
2022-06-15 13:44   ` Matthew Wilcox
2022-06-15 14:44     ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220614180949.102914-1-bfoster@redhat.com \
    --to=bfoster@redhat.com \
    --cc=ikent@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=onestero@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox