From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D63E3C433FE for ; Wed, 5 Oct 2022 17:21:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 33A996B0072; Wed, 5 Oct 2022 13:21:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E95D6B0073; Wed, 5 Oct 2022 13:21:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 18A286B0074; Wed, 5 Oct 2022 13:21:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 068736B0072 for ; Wed, 5 Oct 2022 13:21:34 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A5802409A6 for ; Wed, 5 Oct 2022 17:21:33 +0000 (UTC) X-FDA: 79987562466.11.AA397F1 Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45]) by imf06.hostedemail.com (Postfix) with ESMTP id 1E139180016 for ; Wed, 5 Oct 2022 17:21:32 +0000 (UTC) Received: by mail-wm1-f45.google.com with SMTP id ay36so11217258wmb.0 for ; Wed, 05 Oct 2022 10:21:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date; bh=pSbHEWZPAlqGXv5FT+k2l6EJrddPS9d76u3iLY1UjMg=; b=nVP2RDD+po/mEoQ18TDe3da6XMJYLJ3G/86Z+g1/BnYSydgCF/IHS5V3aRKLcx2zOf hLdheL7NOkvibYxUvkWeMSiIoQorEvAvFsWt63HM3oqIt4LA3rE0H5PCoQuT0UgvESMm ioMWWD0Pct6I5vhv/ENxTQdKMQHa6ebYj2f5cJcT9fi8WYkN6y4yWxZ5h8qTEAH36yJ4 46wV6mmCl3CN/1ysRRGXW/rOwntLOlIRHeLcKem8rd5Ht8i0Wwa/HCqI+y411/l7w+BQ HVq3sEMXw1rkQPsNM7cGFjapzN8ABJeCJtjnDscNQlb1k1Pz2YENhe27h+LDmvXwgzk3 Rvow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=pSbHEWZPAlqGXv5FT+k2l6EJrddPS9d76u3iLY1UjMg=; b=VNCRZEOEplbNd9qr3iDeb4+4wtx5G9wbkeBXSFvHF8Pq0VC7wNqb0bm334xitVEOU7 u7Luuu8ruyXC3mXGER9RWZqSDDkjnyxuqTjnVrB8bcoYAR2ZZM3udXUIIBfqRvwFhavd dKZwOYTKJTi38Y8I9t6IiembG0IzNfD4pST57OTX+7wcZHI2rCTb5nP/WMFj++unJsoL nAHznx61P/RR6ylNvYj8P0dDL6BpGVrIqn2FPLiL3NeLHeNYOPzct5KHo3326G6z+ah8 uHprnGcoby0pWLgZzKU0I/yuwenQ8Vp8lOm1aavjt2iJhsYz89XGomb9ThsJh2wi6JJU 70MA== X-Gm-Message-State: ACrzQf1Zxn6zoFJHFTaSDAZb8GtO4Aid8gDOT9iO35p+5V8pLQu/pW0D Qf5b6YWceOBXx885Q1q8lDL2f9smcNi0HfBo1juUMQ== X-Google-Smtp-Source: AMsMyM7d5Q6HXtvYQXOQCWKXNWcOkPLeUbNu4V+CWuArKWw3GbYpNtMngeViG+yii7tyPdtH/pbyFDHozlyFiItUajo= X-Received: by 2002:a1c:ed11:0:b0:3b4:d3e1:bec with SMTP id l17-20020a1ced11000000b003b4d3e10becmr403307wmh.196.1664990491570; Wed, 05 Oct 2022 10:21:31 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Wed, 5 Oct 2022 10:20:54 -0700 Message-ID: Subject: Re: [RFC] memcg rstat flushing optimization To: Tejun Heo Cc: Zefan Li , Johannes Weiner , Michal Hocko , Shakeel Butt , Roman Gushchin , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Andrew Morton , Linux-MM , Cgroups , Greg Thelen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664990493; a=rsa-sha256; cv=none; b=Cn8BEUPqBDAo4CO4mD5XboxL+qghzd6yKuabaCw/+AzjPblnsL1SLhnjUR+9DghiHsKS08 lPlm2+1XX7R7jxKYUd8o5aOYBKzl8G8JRTO408eItgUoedkjrgB20kweACwQHJ+OCQL667 I+laMDUwEGnIO6HS+23oVRaD1O2q3x4= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=nVP2RDD+; spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.128.45 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1664990493; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pSbHEWZPAlqGXv5FT+k2l6EJrddPS9d76u3iLY1UjMg=; b=5tIWJFonI0OvGdolBfPkcjEODdJd/YKGYSKEIi6CyvlCbo1gcZUD0ikssMfRasXSK/dmyv So19x3AadGwEJUo4a34PpMqMcGaH/N10e0XvUfYXybvx3YfYvrlMOBqVYXn/+0BNIgMvaC b+WHHN2jYWnSPO4V9U18WciVgsgQNCg= X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 1E139180016 X-Rspam-User: Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=nVP2RDD+; spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.128.45 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: ha79dq3z87rbnrp1by4qectuqdnum3h9 X-HE-Tag: 1664990492-132763 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 5, 2022 at 9:30 AM Tejun Heo wrote: > > Hello, > > On Tue, Oct 04, 2022 at 06:17:40PM -0700, Yosry Ahmed wrote: > > We have recently ran into a hard lockup on a machine with hundreds of > > CPUs and thousands of memcgs during an rstat flush. There have also > > been some discussions during LPC between myself, Michal Koutn=C3=BD, an= d > > Shakeel about memcg rstat flushing optimization. This email is a > > follow up on that, discussing possible ideas to optimize memcg rstat > > flushing. > > > > Currently, mem_cgroup_flush_stats() is the main interface to flush > > memcg stats. It has some internal optimizations that can skip a flush > > if there hasn't been significant updates in general. It always flushes > > the entire memcg hierarchy, and always invokes flushing using > > cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does > > not sleep. As you can imagine, with a sufficiently large number of > > memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or > > in an extreme case like the one we ran into, cause a hard lockup > > (despite periodically flushing every 4 seconds). > > How long were the stalls? Given that rstats are usually flushed by its I think 10 seconds while interrupts are disabled is what we need for a hard lockup, right? > consumers, flushing taking some time might be acceptable but what's reall= y > problematic is that the whole thing is done with irq disabled. We can thi= nk > about other optimizations later too but I think the first thing to do is > making the flush code able to pause and resume. ie. flush in batches and > re-enable irq / resched between batches. We'd have to pay attention to > guaranteeing forward progress. It'd be ideal if we can structure iteratio= n > in such a way that resuming doesn't end up nodes which got added after it > started flushing. IIUC you mean that the caller of cgroup_rstat_flush() can call a different variant that only flushes a part of the rstat tree then returns, and the caller makes several calls interleaved by re-enabling irq, right? Because the flushing code seems to already do this internally if the non irqsafe version is used. I think this might be tricky. In this case the path that caused the lockup was memcg_check_events()->mem_cgroup_threshold()->__mem_cgroup_thres= hold()->mem_cgroup_usage()->mem_cgroup_flush_stats(). Interrupts are disabled by callers of memcg_check_events(), but the rstat flush call is made much deeper in the call stack. Whoever is disabling interrupts doesn't have access to pause/resume flushing. There are also other code paths that used to use cgroup_rstat_flush_irqsafe() directly before mem_cgroup_flush_stats() was introduced like mem_cgroup_wb_stats() [1]. This is why I suggested a selective flushing variant of cgroup_rstat_flush_irqsafe(), so that flushers that need irq disabled have the ability to only flush a subset of the stats to avoid long stalls if possible. [1] https://lore.kernel.org/lkml/20211001190040.48086-2-shakeelb@google.com= / > > Thanks. > > -- > tejun