From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01D17C71153 for ; Tue, 29 Aug 2023 19:54:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 398AD8E0039; Tue, 29 Aug 2023 15:54:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 348558E0029; Tue, 29 Aug 2023 15:54:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 210478E0039; Tue, 29 Aug 2023 15:54:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0EF578E0029 for ; Tue, 29 Aug 2023 15:54:47 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CBE27C028C for ; Tue, 29 Aug 2023 19:54:46 +0000 (UTC) X-FDA: 81178194972.06.EB05D49 Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) by imf17.hostedemail.com (Postfix) with ESMTP id 10E7340008 for ; Tue, 29 Aug 2023 19:54:44 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=h5tcFMS6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693338885; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qS8wm+PwrYrwUTMQlW3v50ZjmTvczTiXAPTMj19pd5g=; b=7bVCOMPHWH+T1CdwVf8Vjwx8EWaHZorvUhmEwyZ9tF3/s5Uvsd9h4wzFA5AlWnrWnqcm5g MI9Af+gLMhhQi1bni5O5bjEVY3Rcp9DVVIrKRToDnSFXo0XUjyQ3JTPcDw7V874WChINaV LZtx5LscFeAmLBRvvqb13x7CKCVXcaw= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=h5tcFMS6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693338885; a=rsa-sha256; cv=none; b=IHJLyYGqwUe5+9nWM2Z72Y2csb7LsvUSmqq5vWugJAU+pipFehfxXPUnR3wZx5sYB3mlS8 6liWdpRW4qXZ/iHukLksX/oPqflfRwDYTaOfeh66C7Ln8COYotUK1Dm7gpiZHmpD/Vmj8v lIwya44WTKb+adlyjWSmUJQfnKjSRlk= Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-99c1f6f3884so604648566b.0 for ; Tue, 29 Aug 2023 12:54:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693338883; x=1693943683; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qS8wm+PwrYrwUTMQlW3v50ZjmTvczTiXAPTMj19pd5g=; b=h5tcFMS6CcFxKfegSok10u/Jn3TnhshG0Bpl7/gFsCtvEfA4EVL5Hf8DdAyaJZVA9g QEERR/Jqwg44S59Dpy6zFkDG8rp9NhL2P8MJ5hFDTYE2DQuLEwzZxzHaSEpYBeJMRZaK nIcp4KpfeXu3M33LiOF3vm2YvqRdSsi7hNiozzqU2Us8TdOBWLRnR9I0EpdVUxvF83dh n79oLXpAuEDYUc5kbOVOTkglAQj9CvTIkRk79xoy13mcZI2HuRA829tHvJw81o9JzE0q 8vHUJZxm8tMW9Iwm01DTlPpg9KCqhw0Ii13Oy7MW4sUEPMXRvgPWZA4i80oVa6WQzgTs V00Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693338883; x=1693943683; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qS8wm+PwrYrwUTMQlW3v50ZjmTvczTiXAPTMj19pd5g=; b=Z5KrHstNZXXh/XgCrJZjRD+rbSrBmUFh8hRShgivy/zDBuOEp7dvc2gpbmsb4/PLtJ YkoX3Cw3HvUib+nEKAz47/xFAe0yy2rRcIk9/5CKYRPbTm6ti2C1Di/dfmqELKZX2K+a qi9adai4oznKx4DOCKupV16r2s1dqgiU3uzNNrcKdaFBf/YNwmLwxUcuz53xx97UdRUf O0X+dgafoXWa20NcXXmfY/f2nV5urHeT0Kk4ML1uChGJeWkBdAshJgp8uWDIwfl/pllX EtcgFSjjipicrsYJ7+5goOpTrkivFaOQoMncPpESPVxkNv9Qb6PyvdSW5v6FFuW2Key/ 0ddg== X-Gm-Message-State: AOJu0Yy1p5R3U/pJLff+Rnbm+3S+dGxyQ79WiYpuY9zmRFsWcLfjW03y zfKECzx4/zC+qynIwTxHq4XaTpGcUc6WH1fNjmS3OA== X-Google-Smtp-Source: AGHT+IF/bP33r8nxFhzGbH4f/UdtvUsaJrgxsfN1xW8nQcu8YVg54w9L7NvNVW9v728gatKoPL+n/BstGTkCFkrHU34= X-Received: by 2002:a17:906:5393:b0:9a1:cbe4:d033 with SMTP id g19-20020a170906539300b009a1cbe4d033mr20710ejo.53.1693338883252; Tue, 29 Aug 2023 12:54:43 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Tue, 29 Aug 2023 12:54:06 -0700 Message-ID: Subject: Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for userspace reads To: Tejun Heo Cc: Michal Hocko , Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 10E7340008 X-Stat-Signature: dse74nu7g8bmgz1i3wbzx963m1i58tqy X-Rspam-User: X-HE-Tag: 1693338884-677864 X-HE-Meta: U2FsdGVkX1+z9BtPA3bMwVgPTEsa1VaQR9PZymUYXDiSYlj7uJ8BG5ggKiYJEg2tYM8LWs2cZgnE3nc9jXUj7C8mlnbPMRyGPTPfdqwtyNnZgLqBsK8EQfzJQK2VBpC6OJHBhIe9tNOixhmEH/YhOykmdkh3SlKnxqWTZyfLvq48GB3qdNQrH7SYOu2l8uB1HCaHioEEWbnsqxeqqLpBskAdkKtETqfPXBCnpJPzN0MKH7lhWZMxCbcwUzlbP3/TPnvkIiivR/ZSMwns8iaP2UFwkwkE5/STb7rdLy6K7qQwzfud/qUsY9BunECzQnMLwx5SsHIM1puv+cUgOpQe4mLEvi/ZNP2c+fZn3AdgxI6t6ItlRY4rlJtPyCtzGuAgZ7qd2cYrY3V+sffaLmuOEbe+C3XEL2i7/mgMysqWqd5NfaRUWztZwt4LrBrpsbkOUeAr6gcR15+7Q0Uzfc5e0MhmUsSnYBepDI9wcRKelbbcOFfn1V8QZSFSrXhz/47G3ISGfM8pChktvSwizNEJl2TVxuLA/DKc3wvgVNfR/T3Ot2mSFwJy+HaWBnNgzri/HFhG2/UIHArB82/pniJdLhju2aIWup1XZp46AOa/JGKRHUtHl2A+fTDCrb8wREWn1oGHMUU6JbW8tXXdgEh6udQu2HUG+B2s1EANn9XaR2y3wTlufndvSV/aQyZycl2VCbP2BTbvGR5Ep+AeGscDX1QNqDT8YIV2si4+9u6ddrFggiYhcqMAsw9WefZsYaMQ8PzIsF3MewX4nNwv1BclxRlZl1IRLxjJff9ExkVTU7XhgZnCEQhAEBdKUV5Oq0ftOoseMcgkWS5fcFLf/WDu926S1/9Y+3WS4JDXuxAyyxVhsq5uDSWcwgESa7dI6MEWgvomKmx/itBF4FyUglOQCHceRiLsP1Y4WJPr9j+8imWCnamlqMGcTQSqdbiWmFY4koiPJCkgqxBBTZkYqcw euVuGrIZ Ws9MRp/fl6N0W6qF4lbsSSvMhMgHXWp0h4pZHKF7BMHrXhKGA5abKeit/xyrk+nFjtE+YxJV6aqON8UxOuWWjT0p9gCaQVurmY9C8gEZdrr675OMDEMR66Qcf2JGYJzsWc1cJKSHxfzYw04jFEqLAsfULh2bX6UqMTEQSXQDFm2JbSmeCwCaAux2L3aLU4DbczHIL/bt+Ej3t1LjZWX2vhQYhab1kSFuoF/boXkysQNzBkzevFb/qjsBRDZSD2hYMZYWI9519Sbu1OLLMv/tyGmjK7mDChQSLmhHq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 29, 2023 at 12:36=E2=80=AFPM Tejun Heo wrote: > > Hello, > > On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote: > ... > > > So, the assumptions in the original design were: > > > > > > * Writers are high freq but readers are lower freq and can block. > > > > > > * The global lock is mutex. > > > > > > * Back-to-back reads won't have too much to do because it only has to= flush > > > what's been accumulated since the last flush which took place just = before. > > > > > > It's likely that the userspace side is gonna be just fine if we resto= re the > > > global lock to be a mutex and let them be. Most of the problems are c= aused > > > by trying to allow flushing from non-sleepable and kernel contexts. > > > > So basically restore the flush without disabling preemption, and if a > > userspace reader gets preempted while holding the mutex it's probably > > okay because we won't have high concurrency among userspace readers? > > > > I think Shakeel was worried that this may cause a priority inversion > > where a low priority task is preempted while holding the mutex, and > > prevents high priority tasks from acquiring it to read the stats and > > take actions (e.g. userspace OOMs). > > We'll have to see but I'm not sure this is going to be a huge problem. Th= e > most common way that priority inversions are resolved is through work > conservation - ie. the system just runs out of other things to do, so > whatever is blocking the system gets to run and unblocks it. It only real= ly > becomes a problem when work conservation breaks down on CPU side which > happens if the one holding the resource is 1. blocked on IOs (usually > through memory allocation but can be anything) 2. throttled by cpu.max. > > #1 is not a factor here. #2 is but that is a factor for everything in the > kernel and should really be solved from cpu.max side. So, I think in > practice, this should be fine, or at least not worse than anything else. If that's not a concern I can just append a patch that changes the spinlock to a mutex to this series. Shakeel, wdyt? > > > > Would it > > > make sense to distinguish what can and can't wait and make the latter= group > > > always use cached value? e.g. even in kernel, during oom kill, waitin= g > > > doesn't really matter and it can just wait to obtain the up-to-date n= umbers. > > > > The problem with waiting for in-kernel flushers is that high > > concurrency leads to terrible serialization. Running a stress test > > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. > > > > This patch series is indeed trying to formalize a distinction between > > waiters who can wait and those who can't on the memcg side: > > > > - Unified flushers always flush the entire tree and only flush if no > > one else is flushing (no waiting), otherwise they use cached data and > > hope the concurrent flushing helps. This is what we currently do for > > most memcg contexts. This patch series opts userspace reads out of it. > > > > - Non-unified flushers only flush the subtree they care about and they > > wait if there are other flushers. This is what we currently do for > > some zswap accounting code. This patch series opts userspace readers > > into this. > > > > The problem Michal is raising is that dropping the lock can lead to an > > unbounded number of waiters and longer worst case latency. Especially > > that this is directly influenced by userspace. Reintroducing the mutex > > and removing the lock dropping code fixes that problem, but then if > > the mutex holder gets preempted, we face another problem. > > > > Personally I think there is a good chance there won't be userspace > > latency problems due to the lock as usually there isn't high > > concurrency among userspace readers, and we can deal with that problem > > if/when it happens. So far no problem is happening for cpu.stat which > > has the same potential problem. > > Maybe leave the global lock as-is and gate the userland flushers with a > mutex so that there's only ever one contenting on the rstat lock from > userland side? Waiman suggested this as well. We can do that for sure, although I think we should wait until we are sure it's needed. One question. If whoever is holding that mutex is either flushing with the spinlock held or spinning (i.e not sleepable or preemptable), wouldn't this be equivalent to just changing the spinlock with a mutex and disable preemption while holding it?