From: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, patches@lists.linux.dev,
Shakeel Butt <shakeelb@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Muchun Song <muchun.song@linux.dev>, Tejun Heo <tj@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, regressions@lists.linux.dev,
mathieu.tortuyaux@gmail.com
Subject: Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes
Date: Wed, 20 Sep 2023 15:25:23 +0200 [thread overview]
Message-ID: <4eb47d6a-b127-4aad-af30-896c3b9505b4@linux.microsoft.com> (raw)
In-Reply-To: <ZQrSXh+riB7NnZuE@dhcp22.suse.cz>
On 9/20/2023 1:07 PM, Michal Hocko wrote:
> On Wed 20-09-23 12:04:48, Jeremi Piotrowski wrote:
>> On 9/20/2023 10:43 AM, Michal Hocko wrote:
>>> On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
>>>> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
>>>>> 6.1-stable review patch. If anyone has any objections, please let me know.
>>>>>
>>>>> ------------------
>>>>
>>>> Hi Greg/Michal,
>>>>
>>>> This commit breaks userspace which makes it a bad commit for mainline and an
>>>> even worse commit for stable.
>>>>
>>>> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
>>>> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
>>>> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
>>>> fine.
>>>
>>> Could you expand some more on why is the file read? It doesn't support
>>> writing to it for some time so how does reading it helps in any sense?
>>>
>>> Anyway, I do agree that the stable backport should be reverted.
>>>
>>
>> This file is read together with all the other memcg files. Each prefix:
>>
>> memory
>> memory.memsw
>> memory.kmem
>> memory.kmem.tcp
>>
>> is combined with these suffixes
>>
>> .usage_in_bytes
>> .max_usage_in_bytes
>> .failcnt
>> .limit_in_bytes
>>
>> and read, the values are then forwarded on to other components for scheduling decisions.
>> You want to know the limit when checking the usage (is the usage close to the limit or not).
>
> You know there is no kmem limit as there is no way to set it for some
> time (since 5.16 - i.e. 2 years ago). I can see that users following old
> kernels could have missed that though.
I know what you mean, but I think this generally went unnoticed because the limit file is read
unconditionally, but only written when a kmem limit is explicitly requested for a specific
container, which is rarely (if ever) done.
Regarding following old kernels: a majority of kubernetes users are still on 5.15 and only slowly
started shifting to >=6.1 very recently (this summer). This is mostly driven by distro vendor
policies which tend to follow the pattern of "follow LTS kernels but don't switch to the next
LTS immediately".
I know this is far from ideal for reporting these kinds of issues, would love to report
them as soon as a kernel release happens.
>
>> Userspace tolerates MEMCG/MEMCG_KMEM being disabled, but having a single file out of the
>> set missing is an anomaly. So maybe we could keep the dummy file just for the
>> sake of consistency? Cgroupv1 is legacy after all.
>
> What we had was a dummy file. It didn't allow to write any value so it
> would have always reported unlimited. The reason I've decided to remove
> the file was that there were other users not being able to handle the
> write failure while they are just fine not having the file. So we are
> effectively between a rock and hard place here. Either way something is
> broken. The other SW got fixed as well but similar to your case it takes
> some time to absorb the change through all 3rd party users.
>
>>>>> Address this by wiping out the file completely and effectively get back to
>>>>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
>>>>
>>>> On reads, the runc code checks for MEMCG_KMEM=n by checking
>>>> kmem.usage_in_bytes.
>
> Just one side note. Config options get renamed and their semantic
> changes over time so I would just recomment to never make any
> dependencies on any specific one.
>
Right, what i meant is the logic is this, with checking the "usage"
file to determine whether the controller is available:
value, err := fscommon.GetCgroupParamUint(path, usage)
if err != nil {
if name != "" && os.IsNotExist(err) {
// Ignore ENOENT as swap and kmem controllers
// are optional in the kernel.
return cgroups.MemoryData{}, nil
}
return cgroups.MemoryData{}, err
}
and if it is, then it proceeds to read "limit_in_bytes" and the others.
>>>> If it is present then runc expects the other cgroup files
>>>> to be there (including kmem.limit_in_bytes). So this change is not effectively
>>>> the same.
>>>>
>>>> Here's a link to the PR that would be needed to handle this change in userspace
>>>> (not merged yet and would need to be propagated through the ecosystem):
>>>>
>>>> https://github.com/opencontainers/runc/pull/4018.
>>>
>>> Thanks. Does that mean the revert is still necessary for the Linus tree
>>> or do you expect that the fix can be merged and propagated in a
>>> reasonable time?
>>>
>>
>> We can probably get runc and currently supported kubernetes versions patched in time
>> before 6.6 (or the next LTS kernel) hits LTS distros.
>>
>> But there's still a bunch of users running cgroupv1 with unsupported kubernetes
>> versions that are still taking kernel updates as they come, so this might get reported
>> again next year if it stays in mainline.
>
> I can see how 3rd party users are hard to get aligned but having a fix
> available should allow them to apply it or is there any actual roadblock
> for them to adapt as soon as they hit the issue?
>
The issue with this is that these users are running a frozen set of kubernetes (+runc)
binaries, but still pull kernel updates from the base distro. These kubernetes versions
are out of maintenance so the code will not get fixed and no one will release fixed
binaries.
> I mean, normally I would be just fine reverting this API change because
> it is disruptive but the only way to have the file available and not
> break somebody is to revert 58056f77502f ("memcg, kmem: further
> deprecate kmem.limit_in_bytes") as well. Or to ignore any value written
> there but that sounds rather dubious. Although one could argue this
> would mimic nokmem kernel option.
>
I just want to make sure we don't introduce yet another new behavior in this legacy
system. I have not seen breakage due to 58056f77502f. Mimicing nokmem sounds good but
does this mean "don't enforce limits" (that should be fine) or "ignore writes to the limit"
(=don't event store the written limit). The latter might have unintended consequences.
next prev parent reply other threads:[~2023-09-20 13:25 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-17 19:12 [PATCH 6.1 000/219] 6.1.54-rc1 review Greg Kroah-Hartman
2023-09-17 20:47 ` SeongJae Park
2023-09-18 5:34 ` Takeshi Ogasawara
2023-09-18 6:42 ` Bagas Sanjaya
2023-09-18 11:24 ` Conor Dooley
2023-09-18 12:08 ` Ron Economos
2023-09-18 12:48 ` Jon Hunter
2023-09-18 18:34 ` Florian Fainelli
2023-09-18 18:41 ` Guenter Roeck
2023-09-18 20:56 ` Naresh Kamboju
2023-09-18 22:21 ` Shuah Khan
[not found] ` <20230917191042.204185566@linuxfoundation.org>
2023-09-20 8:11 ` [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes Jeremi Piotrowski
2023-09-20 8:43 ` Michal Hocko
2023-09-20 9:25 ` Greg Kroah-Hartman
2023-09-20 10:21 ` Jeremi Piotrowski
2023-09-20 10:45 ` Greg Kroah-Hartman
2023-09-20 11:08 ` Michal Hocko
2023-09-20 11:16 ` Greg Kroah-Hartman
2023-09-20 10:04 ` Jeremi Piotrowski
2023-09-20 11:07 ` Michal Hocko
2023-09-20 13:25 ` Jeremi Piotrowski [this message]
2023-09-20 13:47 ` Michal Hocko
2023-09-20 15:32 ` Shakeel Butt
2023-09-20 16:55 ` Michal Hocko
2023-09-20 19:46 ` Shakeel Butt
2023-09-20 20:08 ` Michal Hocko
2023-09-20 21:46 ` Shakeel Butt
2023-09-21 7:52 ` Michal Hocko
2023-09-21 10:43 ` Jeremi Piotrowski
2023-09-21 11:21 ` Michal Hocko
2023-09-21 17:25 ` Shakeel Butt
2023-09-21 19:50 ` Michal Hocko
2023-09-22 13:30 ` Johannes Weiner
2023-09-25 7:40 ` Michal Hocko
2023-09-22 23:00 ` Roman Gushchin
2023-09-25 7:41 ` Michal Hocko
2023-09-26 2:49 ` Roman Gushchin
2023-09-22 11:14 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-09-21 13:04 ` [PATCH 6.1 000/219] 6.1.54-rc1 review Conor Dooley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4eb47d6a-b127-4aad-af30-896c3b9505b4@linux.microsoft.com \
--to=jpiotrowski@linux.microsoft.com \
--cc=akpm@linux-foundation.org \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.tortuyaux@gmail.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=patches@lists.linux.dev \
--cc=regressions@lists.linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=stable@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox