From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A00D5C10F00 for ; Wed, 20 Feb 2019 02:47:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 53EF72146F for ; Wed, 20 Feb 2019 02:47:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 53EF72146F Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EFE1A8E0003; Tue, 19 Feb 2019 21:47:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EAFC58E0002; Tue, 19 Feb 2019 21:47:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D9D968E0003; Tue, 19 Feb 2019 21:47:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from mail-pl1-f198.google.com (mail-pl1-f198.google.com [209.85.214.198]) by kanga.kvack.org (Postfix) with ESMTP id 985928E0002 for ; Tue, 19 Feb 2019 21:47:28 -0500 (EST) Received: by mail-pl1-f198.google.com with SMTP id t1so1775451plo.20 for ; Tue, 19 Feb 2019 18:47:28 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:date:from:to :cc:subject:message-id:references:mime-version:content-disposition :in-reply-to:user-agent; bh=n5cNIIdpJQZZLTtkresoNTM1yUQaK9aX8XarVlSWpQg=; b=mtPz4sXN7EhV4kN/Z/jRxhuiUh3RovGmja3ELJ1Hv03c3dIihe8raDf06Q7IHK3okJ jXIcbOVGk8SpV29pAtpb7HjeuxJXkeb/oukvRblxmN09NqHE1/zJ7GHh60G2efiX/2dk /RnQWzCN599fYsDatEAQHW1HC6EXwc9ciYj0bGZbit810VvBUa8oJMRbVpVNdU7WbA2g ZFmpUlchM2sT3iZ5lmK/IOVY5Ko7nfBrTn+Nx0LPU0N+5QEduqSAdgCJlHf10tXhuZ8J ChVD9gNiqA568Q0VVbTH9+7+s++H4IfQPn7/9UWeNfVS43rxgl8r7QOAs1xFSO9bg/ss bOnQ== X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com: 150.101.137.136 is neither permitted nor denied by best guess record for domain of david@fromorbit.com) smtp.mailfrom=david@fromorbit.com X-Gm-Message-State: AHQUAuaNZGtXjCWckLt9YxNlJ66du0LHHw3udRODDKEVLx8OIIbCjpPI Z39Uf8ENHxwRePNJ1HFQ9GU4+wD8qMYMnh9YarIa4tdPJlsQ6JraIJDcmobj0USV0nrG7jTkyWm 9+to3RKht/ksmJs41Kj8FUzlZn4MQyCCrS8r7aMuNyxD3sSoyhg1gSHBSn2ji/Uc= X-Received: by 2002:a62:33c1:: with SMTP id z184mr32380717pfz.104.1550630848209; Tue, 19 Feb 2019 18:47:28 -0800 (PST) X-Google-Smtp-Source: AHgI3IbI7jUR5wmTY8CAEZfU58dQ2G9dCz9gdwgn2YkyD6oxkgAxkL5G9BFfeZoZb5x65dXGd9G+ X-Received: by 2002:a62:33c1:: with SMTP id z184mr32380649pfz.104.1550630847003; Tue, 19 Feb 2019 18:47:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550630846; cv=none; d=google.com; s=arc-20160816; b=eRGVsLGDZdDDpDfvaVY8d6nKORUwFlvyCilaR81FlD1P3FIhS5QoGAre+CdE5/cuXB lUEsJy3KoRGWC4mjiEpaJ4hJ4Il+ZogeOjAso4QJy83XUoh/QLOYqUL15W3KoAQ+n4Dg HzTmmTBh6zXCpC7IILCwykVBybtnz0uS0DDvTtzbLHrvCbb6O0wG6HSacBh7WiHUz96p wDXMMpa+NJVcC7MSF2pNX9eMqELQ96+Ecxq7RpraMEsvOjsBWA+leJDTyrMP/ECVCU98 LglRKn4Ofcn/lF/ZsA0gVD7WoLzgUNpEGu1DWrASRWH+2E0Px3B03RDpgixHeB31M/Ic I2Pg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=user-agent:in-reply-to:content-disposition:mime-version:references :message-id:subject:cc:to:from:date; bh=n5cNIIdpJQZZLTtkresoNTM1yUQaK9aX8XarVlSWpQg=; b=I8BvB4jBKWUBFZPG5nmSVC6+MGj96EUJdpXB4ija3omQ3aV/x012taewQo9Cpf1rZg Ww8rH20M5roQnNJ5AlVrOD3/1Cu1Pf3NE9O5Ioh+uwJ4hU6ypEHxf4GZA2H36dacA+c5 l7a7sZ5esVSQYpOaoaqOey0JgQgBlMRISwzU82HX+xhYl31WAwkjFquRp2u1z8EAOtyq iIz2OBOhScL9vWJ+AH7bqjd0EebqGOMjACYdfeMtpoij6M4R2H5Iq/JrTLySqdCnu15P tHUS4B8kR3z119+5YYxLBTNYT0c57XqsAQv9RN1shuJjquB6Y1y8LWQDFaePudsPQknb AzEw== ARC-Authentication-Results: i=1; mx.google.com; spf=neutral (google.com: 150.101.137.136 is neither permitted nor denied by best guess record for domain of david@fromorbit.com) smtp.mailfrom=david@fromorbit.com Received: from ipmail01.adl6.internode.on.net (ipmail01.adl6.internode.on.net. [150.101.137.136]) by mx.google.com with ESMTP id i33si18437238pld.329.2019.02.19.18.47.25 for ; Tue, 19 Feb 2019 18:47:26 -0800 (PST) Received-SPF: neutral (google.com: 150.101.137.136 is neither permitted nor denied by best guess record for domain of david@fromorbit.com) client-ip=150.101.137.136; Authentication-Results: mx.google.com; spf=neutral (google.com: 150.101.137.136 is neither permitted nor denied by best guess record for domain of david@fromorbit.com) smtp.mailfrom=david@fromorbit.com Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail01.adl6.internode.on.net with ESMTP; 20 Feb 2019 13:17:24 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1gwHut-0005t2-Pn; Wed, 20 Feb 2019 13:47:23 +1100 Date: Wed, 20 Feb 2019 13:47:23 +1100 From: Dave Chinner To: Roman Gushchin Cc: "lsf-pc@lists.linux-foundation.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , "riel@surriel.com" , "dchinner@redhat.com" , "guroan@gmail.com" , Kernel Team , "hannes@cmpxchg.org" Subject: Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues Message-ID: <20190220024723.GA20682@dastard> References: <20190219071329.GA7827@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190219071329.GA7827@castle.DHCP.thefacebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote: > Sorry, once more, now with fsdevel@ in cc, asked by Dave. > -- > > Recent reverts of memcg leak fixes [1, 2] reintroduced the problem > with accumulating of dying memory cgroups. This is a serious problem: > on most of our machines we've seen thousands on dying cgroups, and > the corresponding memory footprint was measured in hundreds of megabytes. > The problem was also independently discovered by other companies. > > The fixes were reverted due to xfs regression investigated by Dave Chinner. Context: it wasn't one regression that I investigated. We had multiple bug reports with different regressions, and I saw evidence on my own machines that something wasn't right because of the change in the IO patterns in certain benchmarks. Some of the problems were caused by the first patch, some were caused by the second patch. This also affects ext4 (i.e. it's a general problem, not an XFS problem) as has been reported a couple of times, including this one overnight: https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@stwm.de/ > Simultaneously we've seen a very small (0.18%) cpu regression on some hosts, > which caused Rik van Riel to propose a patch [3], which aimed to fix the > regression. The idea is to accumulate small memory pressure and apply it > periodically, so that we don't overscan small shrinker lists. According > to Jan Kara's data [4], Rik's patch partially fixed the regression, > but not entirely. Rik's patch was buggy and made an invalid assumptions about how a cache with a small number of freeable objects is a "small cache", so any comaprisons made with it are essentially worthless. More details about the problems with the patch and approach here: https://lore.kernel.org/stable/20190131224905.GN31397@rh/ > The path forward isn't entirely clear now, and the status quo isn't acceptable > due to memcg leak bug. Dave and Michal's position is to focus on dying memory > cgroup case and apply some artificial memory pressure on corresponding slabs > (probably, during cgroup deletion process). This approach can theoretically > be less harmful for the subtle scanning balance, and not cause any regressions. I outlined the dying memcg problem in patch[0] of the revert series: https://lore.kernel.org/linux-mm/20190130041707.27750-1-david@fromorbit.com/ It basically documents the solution I proposed for dying memcg cleanup: dgc> e.g. add a garbage collector via a background workqueue that sits on dgc> the dying memcg calling something like: dgc> dgc> void drop_slab_memcg(struct mem_cgroup *dying_memcg) dgc> { dgc> unsigned long freed; dgc> dgc> do { dgc> struct mem_cgroup *memcg = NULL; dgc> dgc> freed = 0; dgc> memcg = mem_cgroup_iter(dying_memcg, NULL, NULL); dgc> do { dgc> freed += shrink_slab_memcg(GFP_KERNEL, 0, memcg, 0); dgc> } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); dgc> } while (freed > 0); dgc> } This is a pretty trivial piece of code and doesn't requiring changing the core memory reclaim code at all. > In my opinion, it's not necessarily true. Slab objects can be shared between > cgroups, and often can't be reclaimed on cgroup removal without an impact on the > rest of the system. I've already pointed out that it is preferable for shared objects to stay in cache, not face expedited reclaim: https://lore.kernel.org/linux-mm/20190131221904.GL4205@dastard/ dgc> However, the memcg reaper *doesn't need to be perfect* to solve the dgc> "takes too long to clean up dying memcgs". Even if it leaves shared dgc> objects behind (which we want to do!), it still trims those memcgs dgc> down to /just the shared objects still in use/. And given that dgc> objects shared by memcgs are in the minority (according to past dgc> discussions about the difficulies of accounting them correctly) I dgc> think this is just fine. dgc> dgc> Besides, those reamining shared objects are the ones we want to dgc> naturally age out under memory pressure, but otherwise the memcgs dgc> will have been shaken clean of all other objects accounted to them. dgc> i.e. the "dying memcg" memory footprint goes down massively and the dgc> "long term buildup" of dying memcgs basically goes away. This all seems like pretty desirable cross-memcg working set maintenance behaviour to me... > Applying constant artificial memory pressure precisely only > on objects accounted to dying cgroups is challenging and will likely > cause a quite significant overhead. I don't know where you got that from - the above example is clearly a once-off cleanup. And executing it via a workqueue in the async memcg cleanup path (which already runs through multiple work queues to run and wait for different stages of cleanup) is not complex or challenging, nor is it likely to add additional overhead because it means we will avoid the long term shrinker scanning overhead that cleanup currently requires. > Also, by "forgetting" of some slab objects > under light or even moderate memory pressure, we're wasting memory, which can be > used for something useful. Cached memory is not "forgotten" or "wasted memory". If the scan is too small and not used, it is deferred to the next shrinker invocation. This batching behaviour is intentionally done for scan efficiency purposes. Don't take my word for it, read the discussion that went along with commit 0b1fb40a3b12 ("mm: vmscan: shrink all slab objects if tight on memory") https://lore.kernel.org/lkml/20140115012541.ad302526.akpm@linux-foundation.org/ >From Andrew: akpm> Actually, the intent of batching is to limit the number of calls to akpm> ->scan(). At least, that was the intent when I wrote it! This is a akpm> good principle and we should keep doing it. If we're going to send the akpm> CPU away to tread on a pile of cold cachelines, we should make sure akpm> that it does a good amount of work while it's there. IOWs, the "small scan" proposals defeat existing shrinker efficiency optimisations. This change in behaviour is where the CPU usage regressions in "small cache" scanning comes from. As Andrew said: scan batching is a good principle and we should keep doing it. > Dying cgroups are just making this problem more > obvious because of their size. Dying cgroups see this as a problem only because they have extremely poor life cycle management. Expediting dying memcg cache cleanup is the way to fix this and that does not need us to change global memory reclaim behaviour. > So, using "natural" memory pressure in a way, that all slabs objects are scanned > periodically, seems to me as the best solution. The devil is in details, and how > to do it without causing any regressions, is an open question now. > > Also, completely re-parenting slabs to parent cgroup (not only shrinker lists) > is a potential option to consider. That should be done once the memcg gc thread has shrunk the caches down to just the shared objects (which we want to keep in cache!) that reference the dying memcg. That will get rid of all the remaining references and allow the memcg to be reclaimed completely. > It will be nice to discuss the problem on LSF/MM, agree on general path and > make a potential list of benchmarks, which can be used to prove the solution. In reality, it comes down to this - should we: a) add a small amount of code into the subsystem to perform expedited reaping of subsystem owned objects and test against the known, specific reproducing workload; or b) change global memory reclaim algorithms in a way that affects every linux machine and workload in some way, resulting in us having to revalidate and rebalance memory reclaim for a large number of common workloads across all filesystems and subsystems that use shrinkers, on a wide range of different storage hardware and on both headless and desktop machines. And when we look at it this way, if we end up with option b) as the preferred solution then we've well and truly jumped the shark. The validation effort required for option b) is way out of proportion with the small niche of machines and environments affected by the dying memcg problem and the risk of regressions for users outside these memcg-heavy environments is extremely high (as has already been proven). Cheers, Dave -- Dave Chinner david@fromorbit.com