From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3B1CC3DA4A for ; Mon, 5 Aug 2024 18:51:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B05D6B0085; Mon, 5 Aug 2024 14:51:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6609E6B0093; Mon, 5 Aug 2024 14:51:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 527776B0096; Mon, 5 Aug 2024 14:51:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 352176B0085 for ; Mon, 5 Aug 2024 14:51:42 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D9364120355 for ; Mon, 5 Aug 2024 18:51:41 +0000 (UTC) X-FDA: 82419085602.01.2063C56 Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) by imf08.hostedemail.com (Postfix) with ESMTP id EDEC7160008 for ; Mon, 5 Aug 2024 18:51:39 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=csVOM+DY; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722883850; a=rsa-sha256; cv=none; b=DMq/JNy+TdhYFZ0yL9gvuEjFVAttN0bOEYS4PPiwtR1kOfzVy2jYoSs4EyyGZbZz8IC8Dk Dix+pNjl5uCFMbvQI1MYoMndxbNA8Pg7kXSBgAjCj/jV3CTqRzY841ANHitY+ij/RetF2y 1ikggdNT4VPjl7iYAmQocGUqo95NSLY= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=csVOM+DY; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722883849; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zX73mTG4pcwMAjmlvY0/Y+DmA7ujKlG2GvsuKR4ZF4c=; b=5Vq3l8zu+k0o27WDqii698Qw1B5wZnC2iH3kR3/96pu4hHkN2xJ9ACWDlb57RZblWL+Zpr VAYtEX5g/KjthSgD8WzzLGabxyujvtDIHJeDk9d2KLwbt9MED5QaStdNNJsCijzTj8HlAL YI/f7wCCcJq3ktZqeuM8iXSIocmhK1o= Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-52f04b3cb33so23377743e87.0 for ; Mon, 05 Aug 2024 11:51:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722883898; x=1723488698; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zX73mTG4pcwMAjmlvY0/Y+DmA7ujKlG2GvsuKR4ZF4c=; b=csVOM+DYqlP5lpDWeyDv3PNvN+fp9MkctgGKNrae2GWuxXk64mUpGQ+PANMEkbipmZ kHGZ3bACf72Jw8trMiuxAIds+ko4YMOvwenGg54KNp8ObR2VrOCGnoJCTeGgbackMW8m wKYtLJmcWYTxihJcYS2kMS0Km5fV7f2xqefmq3EtalrbGZi1rcHk590KeyhRSTjApWEm jJMJl2sqsOPOE3Oiv/aUdGoOzr+c9eI1Mt2+rU+UzPHGPiM+7fwSqrEV6YYIo26j1aax BPORFsP/w+wIyDKnGnVqk1rbey0WW9U8sV2U70lRJzChDnlz1Wy3drt8TL5vb0QMSo+v i9Hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722883898; x=1723488698; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zX73mTG4pcwMAjmlvY0/Y+DmA7ujKlG2GvsuKR4ZF4c=; b=fAPWL+GM/xWmsCk+CruoPobU3FQsqolg4AO352iKIOYJJsFMA7dBpjIAy8tqG5S1dY ghlw/VAextZouHYU/utbwj3CPtI0xYbeeZoV6Z65BA9+w3RjaaUB9+JeP9qy0mwCvndo mBGm0PrKx5/HEXPWls+MduwoAjzqRpRL2sHE+6EzFTiKHbfsI8qiA4AUG5Gr+myLQkC+ 9WdpRytDpfLJgYyrUr3564HQ1+k2B4KfH4szlJzZEMgYej+2XAx9S8niJFVZG80Xujc7 onJMfzh+QMxO+t9QLRbFTej8WOT/6DFKOAFp/qcoMZjBILdwwyMCjlNN8hYxYIUbECwr wg0w== X-Forwarded-Encrypted: i=1; AJvYcCVgJGgluzpKCg7b4ocZyj9f7lQK2krHquyBMc1HlqLy2AYrWCv8KVt6wTPbgJurPMkEduli0jmo/zoNwczu4D1NBHs= X-Gm-Message-State: AOJu0YyAMuqck8oqIIMe+X0mJM7QHdEzjaDFc79XUmqmsaWWaIsrIpl/ JqM83XougginsirWAqWqcePGtyhJSUT2nQ4XE3LLTiSq6r6BGjkdwR6Drak7rUO3cueHflm8zsi gyqf14BdZzloGYbMgFTSC5cH1q8qvwZDYXNVf X-Google-Smtp-Source: AGHT+IE/sthJa9woVQs/VxN/0k7SdSe1WSEegFQyAFsfnWxXYHSAr3iFGKf1qkAZH0DCQxNnjakEwf+pGoTSthFNq54= X-Received: by 2002:a05:6512:2399:b0:52c:9877:71b7 with SMTP id 2adb3069b0e04-530bb3bd153mr11647817e87.59.1722883897453; Mon, 05 Aug 2024 11:51:37 -0700 (PDT) MIME-Version: 1.0 References: <172139415725.3084888.13770938453137383953.stgit@firesoul> In-Reply-To: From: Yosry Ahmed Date: Mon, 5 Aug 2024 11:50:59 -0700 Message-ID: Subject: Re: [PATCH V8 1/2] cgroup/rstat: Avoid flushing if there is an ongoing overlapping flush To: Jesper Dangaard Brouer Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: EDEC7160008 X-Stat-Signature: 6eijqn4gcbf1cuoctgws1bjixfm1n6cp X-Rspam-User: X-HE-Tag: 1722883899-254809 X-HE-Meta: U2FsdGVkX1+ASlXW9JZFa7fukMGVbNAij/+B6GD+0BMdgHGvIpyOnYBu02Qo2UaqB2PNylcpj8IQk+V2BMfpvg4nf7D4I7lt/djMEaKVbORiLK1CVSDL66wLvCRV01IAWaeDsdgdXd1sN3FEwz7NatvlHMgvCxzDIyZ8Gu7nla0pZzpEhP/DnBe3XNyVzOwUiRDBpLn7uRJjtNABNMEKksYTRxpZkylu1so5fLBiKg/NL0MHEapgX1H2lG5a598A8JZUST7dr/VVGrBpFYJQPTIXeZ2QcGKRpXlIaCMfgwTYasiK4wmJsES0w/L/Nqe4eWYJErakvbkwiYjR4sskma8XKkqPmZ7Ixs37C6B7KutT2TOpOT1CZZHcOYkOugPJu7bgyxq5PkOr/RZpvYPENh5xDrrZPtUiM2kN3s8EujTYUnj1QPUpOm2wmN+bn+E2/FrkK8ato9YYJ07wWf5QTOXo/k/JtHr3qi3WKptI43b5KR2pANy8PSkpAV6pYqZIhf2g0EOzSSj6i2trZN8+bN/uGtwcuWCwfFhGTB6N/fNtJevP81CxQ747S5UmB0XkHVjdn611FiGflG0cd0LSYpjPZHvhzSQ1lOP/htm+YXZt5N/W1Uk90P5IyZj1ix82WiE1pDvW+Zoh4RAspmE4c6TWa8k/Itjm2XGAlXHY4pzx83QeGNQQiQOlyCO0PdImt/+os2CYhbmypm2aaouEbiqeCgXaI6MFRq9ozwefjkrVTS1b/njX2YQ6Z+Q3mb4dOcY2G39CLHe0hsgb/GTckOR+C6ZUMKJYfOq4Q4bHBidgO3xUitTTtazckhPhtS5RZRwS94Qfr1j0dXrNEwn9Dwb1TqQsesuPYRqsBbqoaxTcVZhcYa/1/Vf44xIaV7aXzJsIu4abgnYkqkI3+VO/mVlooYpxuA5Aj7VF80pR4gazcjgOYdZO4o+RFvHTNV+QSO4tCncTG+ZA6k2DqkQ HjRUhWC1 IV4AC1kkcDxrK6RMg9YSq7He2Rpyf9RcGR8sdc/ThdH6CR6KmLuudkELWapC7GW/CENg65GpM5fo9lOtD+3ROfiRksjlEturXGi0cdNW3PSreFEqx5gHh0AZT3mI7JstiGVs48kOSU6BzdyhcnQlD7gXlFz/dT1RE6F+V8y4B9NWVO5rOrCECMZU1dBLA+t5cdaOg7j3wTB3AzNSeZa5dAJ/X7rk5LHrrFEKXpX/onF8F2OU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.012366, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Aug 5, 2024 at 7:24=E2=80=AFAM Jesper Dangaard Brouer wrote: > > > On 02/08/2024 18.10, Yosry Ahmed wrote: > > On Fri, Aug 2, 2024 at 4:43=E2=80=AFAM Jesper Dangaard Brouer wrote: > >> > >> > >> On 30/07/2024 20.54, Yosry Ahmed wrote: > >>> [..] > >>>> > >>>> Well... I'm still not convinced that it makes sense to have level >= =3D 2 > >>>> be the ongoing flusher. > >>>> > >>>> E.g. if a level 2 cgroup becomes ongoing flusher, and kswapd starts = 12 > >>>> NUMA flushes at the same time, then the code will have these 12 kswa= pd > >>>> threads spin on the lock, until ongoing flusher finishes. That is li= kely > >>>> what happened above (for a level 1). These 12 spinning (root) flush= ers > >>>> will not recheck ongoing_flusher and will all flush the root > >>>> (unnecessarily 11 times). > >>> > >>> Hmm regardless of whether or not the level-2 cgroup becomes the > >>> ongoing flusher, the kswapd threads will all spin on the lock anyway > >>> since none of them can be the ongoing flusher until the level-2 cgrou= p > >>> finishes. Right? > >>> > >>> Is the scenario you have in mind that the level-2 cgroup starts > >>> flushing at the same time as kswapd, so there is a race on who gets t= o > >>> be the ongoing flusher? In this case as well, whoever gets the lock > >>> will be the ongoing flusher anyway. > >>> > >>> Not allowing whoever is holding the lock to be the ongoing flusher > >>> based on level is only useful when we can have multiple ongoing > >>> flushers (with lock yielding). Right? > >>> > >>> Perhaps I am missing something here. > >>> > >>>> > >>>> So, I don't think it is a good idea to have anything else that the r= oot > >>>> as the ongoing flusher. > >>>> > >>>> Can you explain/convince me why having sub-cgroups as ongoing flushe= r is > >>>> an advantage? > >>> > >>> I just don't see the benefit of the special casing here as I mentione= d > >>> above. If I missed something please let me know. > >>> > >> > >> I do think you missed something. Let me try to explain this in another > >> way. (I hope my frustrations doesn't shine through). > >> > >> The main purpose of the patch is/was to stop the thundering herd of > >> kswapd thread flushing (root-cgrp) at exactly the same time, leading t= o > >> lock contention. This happens all-the-time/constantly in production. > >> > >> The first versions (where ongoing was limited to root/level=3D0) solve= d > >> this 100%. The patches that generalized this to be all levels can > >> become ongoing flush, doesn't solve the problem any-longer! > >> > >> I hope it is clear what fails. E.g. When a level>0 becomes ongoing > >> flusher, and 12 kswapd simultaneously does a level=3D0/root-cgrp flush= , > >> then we have 12 CPU cores spinning on the rstat lock. (These 12 kswapd > >> threads will all go-through completing the flush, as they do not > >> discover/recheck that ongoing flush was previously became their own le= vel). > > > > I think we may be speaking past one another, let me try to clarify :) > > > > I agree with your assessment, all I am saying is that this restriction > > is only needed because of lock yielding, and can be removed after that > > IIUC. > > > > The problem when we allow non-root ongoing flushers now is that when > > the kswapd thread are woken up and the first one of them gets the lock > > and does the flush, it may be find that the ongoing_flusher is already > > set by another non-root flusher that yielded the lock. In this case, > > the following kswapd flushers will spin on the lock instead of waiting > > for the first kswapd to finish. > > > > If we remove lock yielding, then the above scenario cannot happen. > > I think, this is where we disagree/talk-past-each-other. Looking at the > code, I do believe the the situation *also* occurs without any lock > yielding involved. Yes, the situation if far-worse when we have lock > yielding, but it also happens in the default case. > > > When the lock/mutex is held by a flusher, it is guaranteed that > > ongoing_flusher is NULL and can be set by the flusher. In this case, > > we should allow any cgroup to be the ongoing_flusher because there can > > only be one anyway. > > > > With current patch proposal [V8 or V9]. > Assuming we have no lock yielding. > > Do we agree that 12 kswapd threads will be waiting on the lock, when a > level>0 were ongoing flusher when they were started? > Then level>0 finishes being ongoing flushed. > Then kswapd0 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd1 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd2 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd3 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd4 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd5 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd6 gets lock, observe NULL as ongoing, and becomes ongoing. > [etc] How does preventing the level>0 cgroup from becoming the ongoing_flusher change the above scenario? kswapd threads will still observe NULL as ongoing and spin on the lock, right? (Still assuming no lock yielding) > > Please, let me know if I misunderstood my own code, and you believe this > scenario cannot happen. > > When above happens, then patch didn't solve the kswapd thundering herd > issue that we observe in production. > > The point/problem is that once kswapd is waiting on the lock, then code > doesn't re-check the ongoing flusher, and every kswapd thread will be > spinning and every kswapd thread will need to go through the flush. > When a kswapd thread gets the lock, then it will observe ongoing as > NULL, so it cannot detect that another level=3D0 just were the ongoing. > > --Jesper