Re: [PATCH 5.15] kernfs: switch global kernfs_rwsem lock to per-fs lock

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

From: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org, Minchan Kim <minchan@kernel.org>,
	Sasha Levin <sashal@kernel.org>, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH 5.15] kernfs: switch global kernfs_rwsem lock to per-fs lock
Date: Fri, 29 Nov 2024 22:20:48 +0100	[thread overview]
Message-ID: <95cf11dc-6771-4a53-9c34-20ee27bfeaa2@linux.microsoft.com> (raw)
In-Reply-To: <2024112923-constrict-respect-a0a6@gregkh>

On 29/11/2024 13:12, Greg Kroah-Hartman wrote:
> On Fri, Nov 29, 2024 at 12:32:36PM +0100, Jeremi Piotrowski wrote:
>> From: Minchan Kim <minchan@kernel.org>
>>
>> [ Upstream commit 393c3714081a53795bbff0e985d24146def6f57f ]
>>
>> The kernfs implementation has big lock granularity(kernfs_rwsem) so
>> every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the
>> lock. It makes trouble for some cases to wait the global lock
>> for a long time even though they are totally independent contexts
>> each other.
>>
>> A general example is process A goes under direct reclaim with holding
>> the lock when it accessed the file in sysfs and process B is waiting
>> the lock with exclusive mode and then process C is waiting the lock
>> until process B could finish the job after it gets the lock from
>> process A.
>>
>> This patch switches the global kernfs_rwsem to per-fs lock, which
>> put the rwsem into kernfs_root.
>>
>> Suggested-by: Tejun Heo <tj@kernel.org>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>> Link: https://lore.kernel.org/r/20211118230008.2679780-1-minchan@kernel.org
>> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
>> ---
>> Hi Stable Maintainers,
>>
>> This upstream commit fixes a kernel hang due to severe lock contention on
>> kernfs_rwsem that occurs when container workloads perform a lot of cgroupfs
>> accesses. Could you please apply to 5.15.y? I cherry-pick the upstream commit
>> to v5.15.173 and then performed `git format-patch`.
> 
> This should not hang, but rather just reduce contention, right? Do you
> have real performance numbers that show this is needed? What workloads 
> are overloading cgroupfs?

System hang due to the contention might be a more accurate description. On a
kubernetes node there is always a stream of processes
(systemd, kubelet, containerd, cadvisor) periodically opening/stating/reading cgroupfs
files. Java apps also love reading cgroup files. Other operations such as creation of
short-lived containers take a write lock on the rwsem when creating cgroups and when
creating veth netdevs. The veth netdev creation takes the rwsem when creating sysfs files.
Systemd service startup also contends for the same write lock.

It's not so much a particular workload as it is a matter of scale, the cgroupfs read
accesses scale with the number of containers on a host. With enough readers and the
right mix of writers, write operations can take minutes.

Here are some real performance number: I have a representative reproducer with 50 cgroupfs
readers in a loop and a container batch job every minute. `systemctl status` times out
after 1m30s, container creation takes over 4m causing the operations to pile up, making the
situation even worse. With this patch included, under the same load the operations finish in
~10s, preventing the system from becoming unresponsive.

This patch stops sysfs and cgroupfs modifications from contending for the same rwsem,
as well as lowering contention between different cgroup subsystems.

> And why not just switch them to 6.1.y kernels or newer?

I wish we could just do that. Right now all our users are on 5.15 and a lot of their
workloads are sensitive to changes to any part of the container stack including kernel
version. So they will gradually migrate to kernel 6.1.y and newer as part of upgrading
their clusters to a new kubernetes release after they validate their workloads on it.
This is a slow process and in the meantime they are hitting the issue that the patch
addresses. I'm sure there are other similar users of 5.15 out there.

> 
> thanks,
> 
> greg k-h

Thanks,
Jeremi

next prev parent reply	other threads:[~2024-11-29 21:20 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-29 11:32 [PATCH 5.15] kernfs: switch global kernfs_rwsem lock to per-fs lock Jeremi Piotrowski
2024-11-29 12:12 ` Greg Kroah-Hartman
2024-11-29 21:20   ` Jeremi Piotrowski [this message]
2024-11-30 15:47     ` Greg Kroah-Hartman
2024-11-29 20:03 ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=95cf11dc-6771-4a53-9c34-20ee27bfeaa2@linux.microsoft.com \
    --to=jpiotrowski@linux.microsoft.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=minchan@kernel.org \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox