All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <roman.gushchin@linux.dev>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: "Greg Thelen" <gthelen@google.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Yosry Ahmed" <yosry.ahmed@linux.dev>,
	"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	"Meta kernel team" <kernel-team@meta.com>
Subject: Re: [PATCH] memcg: introduce non-blocking limit setting interfaces
Date: Fri, 18 Apr 2025 22:07:29 +0000	[thread overview]
Message-ID: <aALNIVa3zxl9HFK5@google.com> (raw)
In-Reply-To: <ohrgrdyy36us7q3ytjm3pewsnkh3xwrtz4xdixxxa6hbzsj2ki@sn275kch6zkh>

On Fri, Apr 18, 2025 at 01:30:03PM -0700, Shakeel Butt wrote:
> On Fri, Apr 18, 2025 at 01:18:53PM -0700, Greg Thelen wrote:
> > On Fri, Apr 18, 2025 at 1:00 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > Setting the max and high limits can trigger synchronous reclaim and/or
> > > oom-kill if the usage is higher than the given limit. This behavior is
> > > fine for newly created cgroups but it can cause issues for the node
> > > controller while setting limits for existing cgroups.
> > >
> > > In our production multi-tenant and overcommitted environment, we are
> > > seeing priority inversion when the node controller dynamically adjusts
> > > the limits of running jobs of different priorities. Based on the system
> > > situation, the node controller may reduce the limits of lower priority
> > > jobs and increase the limits of higher priority jobs. However we are
> > > seeing node controller getting stuck for long period of time while
> > > reclaiming from lower priority jobs while setting their limits and also
> > > spends a lot of its own CPU.
> > >
> > > One of the workaround we are trying is to fork a new process which sets
> > > the limit of the lower priority job along with setting an alarm to get
> > > itself killed if it get stuck in the reclaim for lower priority job.
> > > However we are finding it very unreliable and costly. Either we need a
> > > good enough time buffer for the alarm to be delivered after setting
> > > limit and potentialy spend a lot of CPU in the reclaim or be unreliable
> > > in setting the limit for much shorter but cheaper (less reclaim) alarms.
> > >
> > > Let's introduce new limit setting interfaces which does not trigger
> > > reclaim and/or oom-kill and let the processes in the target cgroup to
> > > trigger reclaim and/or throttling and/or oom-kill in their next charge
> > > request. This will make the node controller on multi-tenant
> > > overcommitted environment much more reliable.
> > 
> > Would opening the typical synchronous files (e.g. memory.max) with
> > O_NONBLOCK be a more general way to tell the kernel that the user
> > space controller doesn't want to wait? It's not quite consistent with
> > traditional use of O_NONBLOCK, which would make operations to
> > fully succeed or fail, rather than altering the operation being requested.
> > But O_NONBLOCK would allow for a semantics of non-blocking
> > reclaim, if that's fast enough for your controller.

+1

> > 
> 
> We actually thought about O_NONBLOCK but the challenge with that is how
> would the node controller knows if the underlying kernel has O_NONBLOCK
> implying no-reclaim/no-oom-kill feature. I don't think opening
> memory.max with O_NONBLOCK will fail today, so the node controller would
> still need to implement the complicated fork+set-limit+alarm logic
> until the whole fleet has moved away from older kernel. Also I have
> checked with systemd folks and they are not happy to implement that
> complicated fork+set-limit+alarm logic.

/sys/kernel/cgroup/features ?

  reply	other threads:[~2025-04-18 22:07 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-18 19:59 [PATCH] memcg: introduce non-blocking limit setting interfaces Shakeel Butt
2025-04-18 20:18 ` Greg Thelen
2025-04-18 20:30   ` Shakeel Butt
2025-04-18 22:07     ` Roman Gushchin [this message]
2025-04-18 23:08       ` Shakeel Butt
2025-04-19  3:15         ` Tejun Heo
2025-04-19 16:36           ` Shakeel Butt
2025-04-21 17:06             ` Greg Thelen
2025-04-21 17:28               ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aALNIVa3zxl9HFK5@google.com \
    --to=roman.gushchin@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.