From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2DD2C71153 for ; Tue, 29 Aug 2023 19:36:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 07DF58E003A; Tue, 29 Aug 2023 15:36:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 006CE8E0039; Tue, 29 Aug 2023 15:36:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D9BAA8E003A; Tue, 29 Aug 2023 15:36:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C37428E0039 for ; Tue, 29 Aug 2023 15:36:31 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 930A1C0219 for ; Tue, 29 Aug 2023 19:36:31 +0000 (UTC) X-FDA: 81178148982.25.E234E21 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf14.hostedemail.com (Postfix) with ESMTP id A21C9100033 for ; Tue, 29 Aug 2023 19:36:29 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=jLoq2aRb; spf=pass (imf14.hostedemail.com: domain of htejun@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=htejun@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693337789; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zaMs62JQz4mOYmHMoUhCwPigQFveHohoGbGbhrjQBvI=; b=8RnFCjKNW3d9x6NAPlgeCzp9yg8SqRdWuZObpi1aFkE1UAHdaNmlW2wogJ7wLSIGCr2BMR i5UD5GGM/NO3Zb22egh3YdAeipWS2mSJ5mUQe2nwMR/hzElUVZWeL2r4NfVS6+hl8fO6RJ aNxuhTNMFNXfBKiidbddxEJTG0dvh54= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693337789; a=rsa-sha256; cv=none; b=QQZFSzHeEJgWHZa+UUk+SYFJ62a8K0pcHWhN4YYwgJN/MLVao+0wEapchgiZbE8Vv51Kf7 WfdN8QTt7ugrZ/slEuRlokQ1CzQvRvRK0lIwC5DRi/peREeQw+9oIYFMdSiYhK3J+Mnmf5 RWHro1x9shL5Img3ZHYbSUS1jsZD8xE= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=jLoq2aRb; spf=pass (imf14.hostedemail.com: domain of htejun@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=htejun@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-2685bcd046eso2567584a91.3 for ; Tue, 29 Aug 2023 12:36:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1693337788; x=1693942588; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=zaMs62JQz4mOYmHMoUhCwPigQFveHohoGbGbhrjQBvI=; b=jLoq2aRbGeiFgRRL5mFsreqXQA2ZyuFuov0ig27Oc2anGJfDps1GPVuwg4VdZpJgno 0M2a/4Eo5RD70nt8v4694OmX0vKKZMVsoNd4/QeJRsyf7ETiGBJpQ6KSYmBW1ncuQzE3 5/XwYboJwun+UwlF7uH5MNteOOiCJy6WRBEkGPk3kgRRByLJsmzPnn7tfns2PxCEITwt 6yYoHYzw0rUqWaZKUJAP/wdMPHGdAZxw2tcO5IbLvN/a40Rzqrpys+sXIB008dXOg0Ac wiQ3Y1pT3AYWYbvKG/VNE8hhz7YQE6O62JhxnXyT18droiYTPNOm6X/uE+rHl8B9jiS4 WGEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693337788; x=1693942588; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zaMs62JQz4mOYmHMoUhCwPigQFveHohoGbGbhrjQBvI=; b=JaI2QDg22eecisQT2nwWnFCNR3tQvwVIQKrg6xbioJZmmfHb7su8Yqx5oJiEDXcm+O rxGw6S0n5kRcd9dzIgpWxDmXLEua0OD6YM0HRlWfW3z5QWucsDzAzTR8h2kcr1Gd+GRk kO4TN9tV/zR0lbxxa76yly6o+r05T8x61ww1JAWfLeYWbRx9cmF8Q2q4j1rVtCikjiHE CRgPn6AviTohlWd4rir3olcNlPzcNedfNJcxNODWUTEgod/eEcZ3ie82LmquqdFYzlcG 0GJ8BtElKRT6arjAByFseDq8yv8gRhHzlnlhagJir6Hi8UrVsJm7oljv6XyqwzLCrjPt hcLg== X-Gm-Message-State: AOJu0YziMe2Galk37/vmlpdjYMCgSjmf9mhhmQRWvA4XtJQH4SY/5qnX vs4EjesDCZbSVFLTqofc3NiVVhCjGok= X-Google-Smtp-Source: AGHT+IHrbBF6nDowgSO7qMCt5Mn9c9zYPOE9PfrAmloLN0Z3sq5tKWkVm7UWmv1s3lsL6GFTWeqoXQ== X-Received: by 2002:a17:90a:f992:b0:26f:a6e7:abda with SMTP id cq18-20020a17090af99200b0026fa6e7abdamr144838pjb.46.1693337788109; Tue, 29 Aug 2023 12:36:28 -0700 (PDT) Received: from localhost ([2620:10d:c090:400::5:f05]) by smtp.gmail.com with ESMTPSA id e3-20020a17090a728300b00268dac826d4sm9225792pjg.0.2023.08.29.12.36.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Aug 2023 12:36:27 -0700 (PDT) Date: Tue, 29 Aug 2023 09:36:25 -1000 From: Tejun Heo To: Yosry Ahmed Cc: Michal Hocko , Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: A21C9100033 X-Rspam-User: X-Stat-Signature: c8khjaftj5goep57gkdaq1bk5716uzew X-Rspamd-Server: rspam03 X-HE-Tag: 1693337789-703692 X-HE-Meta: U2FsdGVkX1/hXAjTntKnOfDYQIj0y/CbVpsokHYdwVeg/47CsqDm26gskYNZwONhC3/Yal7J7YUkulS18e0dYXD00eTXpCeyZWjrhJUfcGTDRlCZVFGsn98VADfx1QJx5Y3oA+im0ZvTqunY2YSeuOxUM1RUrje/6TwIX8y/LL7DctZnjt2rsOLbkOOeMVUlHco7ThVbr3m2ea0uvDCX4S60DdTHEhuUK1yRHG6nAAnfJHpLlDVT/RiHA2sHeNuigZCwKDab8XzRtnSI2htAVvQ7nEX7xBtx0seeH9pnotReIwqZAO0JSb75lbam5vB6FmSm4Nt9RPjlsp8LNgy5qOyeI2bsXDBFFvpaB1kzSRLTEyfNBzMbVDcAgM6GJYuXiNncOFNzkeaqeEpaEufnlt7E/pjLmszXnM6V6zO0C1dManDP93qPZorMe0aHqAFzYMFXPYvOifUFcxBz7VEFCm+LMXkNhzkfqnLD/bUUQdu9SnjGGGAizmDkn656RI0qj8/tatGWqxuTgAprDixui+gx76fSm8546fHo5k9jNOU+Rtz2k06atTbgc2/Pg8O5egXbyFk9xdogqaQu/gQah3upJp8tq7q132/iEgf0G+5fbmkMkvkL2O/dFSDpcI8tHP2y3MVuI020MRSyV7dLCOWzckaMYP8jS6VvMtrd/ZyB3O+qIGqfRrwCugvCp+sGTof35GmgKVzsvy4znq/eX3EZH1re7QrnFU26Cte056Zs1dK6Rp1rmjSr2ymJGzbzAmB0Lj9mIYxPdigUx2p/HnbPndViWI3FMTTe/AENQLyjoCJAZYmqvhYSGuzXx6Bt2ed4faP8pcM/NXHbWLkAAy/ZRGqKrzRYrkESaHEHK9VFlzi0/bKQ+xto5GwP3iMUUTwPb6cUmvDSE5SP0RImywjySJfrzPKNdXSSkWQKmYU20HpLorjd88DHrPUuAqekcUNaVu26/AO0nvxUiVX Xj0FdLUg D5yDcuvalYX0so3tWy5W8mjkNENvzviqlK08OYF1AJ6K8CUVg/RVPWRBG5G62p/X/TP8KdMpIFI2SJb0m928mn/uXvZIBC/xO5RF4e999//W4iB6bE8AvRiwLA7eCRetRr28UDrMt2rOlD4elwonWgv3X7h0muToN8bGbrRcNEa/zns+dolAI43FQVKvKQpHYoNqaTK1aq5rd5+lZcXZkECuJVD/lr0YAwc/wh6xAKzNWqNd6ARCjFVnFzeSQFUPzANyrq/GvzNFzrssCYx5O4VbMhUcTSUyQgOsIq65T5PAKnB9x9JreEliR/1F3Ra1VWXWiEZSKEHtTAJVxW8iSWxaObLjZtx6onRQPeLAgaY9amQ3AqVZ29LoSDVX2n1eJ2ZPXpWmRo9mPNE5qmBrNzao17JtJOBMw3ShvawrHHzvPmeTpkhPpMEQjO+AkPADde9Ju3oIM1LGustN7bLwHbnOL10jmMsznLmvl1Lk2YCpaMu4XCkRz8/sH5L9bsAdVuaFJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote: ... > > So, the assumptions in the original design were: > > > > * Writers are high freq but readers are lower freq and can block. > > > > * The global lock is mutex. > > > > * Back-to-back reads won't have too much to do because it only has to flush > > what's been accumulated since the last flush which took place just before. > > > > It's likely that the userspace side is gonna be just fine if we restore the > > global lock to be a mutex and let them be. Most of the problems are caused > > by trying to allow flushing from non-sleepable and kernel contexts. > > So basically restore the flush without disabling preemption, and if a > userspace reader gets preempted while holding the mutex it's probably > okay because we won't have high concurrency among userspace readers? > > I think Shakeel was worried that this may cause a priority inversion > where a low priority task is preempted while holding the mutex, and > prevents high priority tasks from acquiring it to read the stats and > take actions (e.g. userspace OOMs). We'll have to see but I'm not sure this is going to be a huge problem. The most common way that priority inversions are resolved is through work conservation - ie. the system just runs out of other things to do, so whatever is blocking the system gets to run and unblocks it. It only really becomes a problem when work conservation breaks down on CPU side which happens if the one holding the resource is 1. blocked on IOs (usually through memory allocation but can be anything) 2. throttled by cpu.max. #1 is not a factor here. #2 is but that is a factor for everything in the kernel and should really be solved from cpu.max side. So, I think in practice, this should be fine, or at least not worse than anything else. > > Would it > > make sense to distinguish what can and can't wait and make the latter group > > always use cached value? e.g. even in kernel, during oom kill, waiting > > doesn't really matter and it can just wait to obtain the up-to-date numbers. > > The problem with waiting for in-kernel flushers is that high > concurrency leads to terrible serialization. Running a stress test > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. > > This patch series is indeed trying to formalize a distinction between > waiters who can wait and those who can't on the memcg side: > > - Unified flushers always flush the entire tree and only flush if no > one else is flushing (no waiting), otherwise they use cached data and > hope the concurrent flushing helps. This is what we currently do for > most memcg contexts. This patch series opts userspace reads out of it. > > - Non-unified flushers only flush the subtree they care about and they > wait if there are other flushers. This is what we currently do for > some zswap accounting code. This patch series opts userspace readers > into this. > > The problem Michal is raising is that dropping the lock can lead to an > unbounded number of waiters and longer worst case latency. Especially > that this is directly influenced by userspace. Reintroducing the mutex > and removing the lock dropping code fixes that problem, but then if > the mutex holder gets preempted, we face another problem. > > Personally I think there is a good chance there won't be userspace > latency problems due to the lock as usually there isn't high > concurrency among userspace readers, and we can deal with that problem > if/when it happens. So far no problem is happening for cpu.stat which > has the same potential problem. Maybe leave the global lock as-is and gate the userland flushers with a mutex so that there's only ever one contenting on the rstat lock from userland side? Thanks. -- tejun