From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5BFA7C3DA4B for ; Wed, 17 Jul 2024 07:46:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9C0FE6B0099; Wed, 17 Jul 2024 03:46:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 970DD6B009A; Wed, 17 Jul 2024 03:46:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85F7A6B009B; Wed, 17 Jul 2024 03:46:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 672776B0099 for ; Wed, 17 Jul 2024 03:46:07 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E6882140917 for ; Wed, 17 Jul 2024 07:46:06 +0000 (UTC) X-FDA: 82348461132.11.24E981C Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf03.hostedemail.com (Postfix) with ESMTP id 250D820023 for ; Wed, 17 Jul 2024 07:46:04 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=JV4U7TLN; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721202321; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cOKBmggBo4v8/tT8claX3VAfkaEzSLgWyNoVY6gWtLA=; b=RyeYG3WVwU5fkatZ26UemZUNbZJbim2CVY43F2mB+2GRuou5dCQGEYClt9VR32I3CZnRhg C7nTVdh/+f8HgQrpIP4H0LC2ux3a7mK5ddk5yc9aeksQnQ5a4fFSLZCbI2o4xHW3jsLdxM h958e8p/lUcIJZ357dt4I9XSidM4scQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721202321; a=rsa-sha256; cv=none; b=4D5OZbrNW/rvd5ab1zSDEcujIuMO5W/NvYcCGkJpbqMJSmvgNdx/HB51WPUBk7JJ2MR953 CkGvSk0bxRoJcms51Utz7DPnYv0Vy6HMkmTYAOFZtB+zrM1V/0TQdSg8CNbYIPySCslA8t jOBfBzSjzy++7ADagW/v3hyXCXVzYbI= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=JV4U7TLN; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf03.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 3DAEE61435; Wed, 17 Jul 2024 07:46:04 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1EACDC32782; Wed, 17 Jul 2024 07:46:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1721202363; bh=M1UAov2c9FIa5WR/+oBi3eCA5gwsOt1hDB+03zxYG4c=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=JV4U7TLNV/x4Mchsr5JmvrBdpgbJCzl3+aSXz06WivcXQwzX1bcKyk1UYdoIaZWMB QnUzBT2MIuvW/oMsd77wnpZwiK0lpI+mPXdxTlD2bJicB7lhxU37+6AUG8HitUTNkl R5UVXf3EFnc5haC0C+n6/PQDGKBLLPF3w93pH+n1GWGrKSBL0wOfoEViLJAXJuBMGn VgNBGL2ZaJPXiz52bs8jyeZ5EDfeVchNK1O6xzvg3X7Z26yDBvDiTQ2q7lP8Br9B8K Jv9ftcXSOww9/JdlL3Wm4ei5CObiUkOi1dheMGmTZKbmkMwGkeldSVFMqO0vosvLus iCKL7TReGNdMw== Message-ID: Date: Wed, 17 Jul 2024 09:46:00 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V4 2/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Yosry Ahmed Cc: Shakeel Butt , tj@kernel.org, cgroups@vger.kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <171952310959.1810550.17003659816794335660.stgit@firesoul> <171952312320.1810550.13209360603489797077.stgit@firesoul> <4n3qu75efpznkomxytm7irwfiq44hhi4hb5igjbd55ooxgmvwa@tbgmwvcqsy75> <7ecdd625-37a0-49f1-92fc-eef9791fbe5b@kernel.org> <9a7930b9-dec0-418c-8475-5a7e18b3ec68@kernel.org> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 250D820023 X-Stat-Signature: 63kew5kjjnq9zz9r4gmp389k5q5u8bw1 X-Rspam-User: X-HE-Tag: 1721202364-819611 X-HE-Meta: U2FsdGVkX1921+gx7XCEJXpEl8ELxDcwlUioBPfDpdap00oN+R118s+U8cjaAmQqU4hoK7bwo+DxCeeOMnIu/J7eF16n2GAT7twDGgRhzjKu/LiXXQo/N4CwroeZduq1/KS8CAyqwXHrCTlkrmC6EGDA/6UIGLZKnJopI4ppCtCCmW6Qr/vG5K3Bu4VcakEjKu2YIxyssV5XyxGmk39pPydBBtSAvCpLgLeOp/dKtSPVT0Y9iKeSmXrmBJ0gd6J82JYtvOGvVQMp19EYqGvMiBSVOTcOr1jS57JDngk6aFvcQIYeFtWNGSduCYilqp98Y3AeWCQ3NE9jsz/vs2TaFR74BwwfR/K0MYSVBj5IdZA/Z8iiMEzXNbvpg3XIjGGgamXyBKQG98/eMK2X+yl1bXP9I6aXx3UN4X3DuWne/RzVpSBADFmFsLad9Ha4uDQ/IeWUb03PBWrBcTJfoGVUJoN4memULxiJrgDCjiCP2zziA0AoclUC6TAJCHPVnq2J/b0tCLEqLjVTROZa5eEP+6lfR2OHKTQDNPOulnTwtGJ0qGKmrig6ljDDmO2wCUguLjf1TAm1/OuMopaJW8HEaznGbG7hnXrNhTwgA+IuRaGWtVEh4EkpLpmTJeS3m8rStTBwStBgy85E23/hobDp0cXStJqY0tndv12EBkOZYyzuuV0NM7/yJcSbs9pBikCcVN8PMRnRbkZawF1WtQEqmgtbQuTZuh4mVQ8Ydl76J4w/zlngEB//VJVAiiTZhhK5MndhAS4TIvfGYMgRXf1WD7+rsPM/K6mAomgAahC/0BOeh51Kd0zLcVzMHun+2o1Qx5vdi5p0j5e5GAfvB3HjykBY2MjS9u+qpOrJFSAe4PeR6OmgF0S3YFB1pWM9OisTv+MdwsbWHb+p5EIkTXurHRpVrfKjrSQKcbDHaQhaJnBnLFCdkwQRU0mebgzAZzUEEWSVfeAlwjJknZ3b1c0 7dLIuqKa 1RAWI0ExZRx9SM0KMxOlseYRhoMTe2w9BlzPePs0jRrHadqthswtY9KCLCOFG1rG9YiXDCTGHZ9qoj7wkirPx3pHhrFHva09sdzcJLU5bOBhGuV2FhwzFB71eWwPYVfG7lUxrcJJ4/SknUurn1E/gi8CqX+UbwXZccOfdo/35QV4xdDtyh12GDlEcTq2pRkebRoj7ch1XZbwplkZEz/DvE8rb06X/RmDdMOh8gxdU4NCAoMTmEqycLeIzA5BueBntCtFfx/q9tcuKxYZg6r8Q9ghRQBLs+K6JFoeGBpwE+mfjDZnkbMuh50/OJdN7XqfxdPrrfDNyDz3vd9BvSC3LIEDsJwRpdBDWO730 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000014, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 16/07/2024 23.54, Yosry Ahmed wrote: > On Mon, Jul 8, 2024 at 8:26 AM Jesper Dangaard Brouer wrote: >> >> >> On 28/06/2024 11.39, Jesper Dangaard Brouer wrote: >>> >>> >>> On 28/06/2024 01.34, Shakeel Butt wrote: >>>> On Thu, Jun 27, 2024 at 11:18:56PM GMT, Jesper Dangaard Brouer wrote: >>>>> Avoid lock contention on the global cgroup rstat lock caused by kswapd >>>>> starting on all NUMA nodes simultaneously. At Cloudflare, we observed >>>>> massive issues due to kswapd and the specific mem_cgroup_flush_stats() >>>>> call inlined in shrink_node, which takes the rstat lock. >>>>> >> [...] >>>>> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); >>>>> @@ -312,6 +315,45 @@ static inline void __cgroup_rstat_unlock(struct >>>>> cgroup *cgrp, int cpu_in_loop) >>>>> spin_unlock_irq(&cgroup_rstat_lock); >>>>> } >>>>> +#define MAX_WAIT msecs_to_jiffies(100) >>>>> +/* Trylock helper that also checks for on ongoing flusher */ >>>>> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) >>>>> +{ >>>>> + bool locked = __cgroup_rstat_trylock(cgrp, -1); >>>>> + if (!locked) { >>>>> + struct cgroup *cgrp_ongoing; >>>>> + >>>>> + /* Lock is contended, lets check if ongoing flusher is already >>>>> + * taking care of this, if we are a descendant. >>>>> + */ >>>>> + cgrp_ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); >>>>> + if (cgrp_ongoing && cgroup_is_descendant(cgrp, cgrp_ongoing)) { >>>> >>>> I wonder if READ_ONCE() and cgroup_is_descendant() needs to happen >>>> within in rcu section. On a preemptable kernel, let's say we got >>>> preempted in between them, the flusher was unrelated and got freed >>>> before we get the CPU. In that case we are accessing freed memory. >>>> >>> >>> I have to think about this some more. >>> >> >> I don't think this is necessary. We are now waiting (for completion) and >> not skipping flush, because as part of take down function >> cgroup_rstat_exit() is called, which will call cgroup_rstat_flush(). >> >> >> void cgroup_rstat_exit(struct cgroup *cgrp) >> { >> int cpu; >> cgroup_rstat_flush(cgrp); >> >> > > Sorry for the late response, I was traveling for a bit. I will take a > look at your most recent version shortly. But I do have a comment > here. > > I don't see how this addresses Shakeel's concern. IIUC, if the cgroup > was freed after READ_ONCE() (and cgroup_rstat_flush() was called), > then cgroup_is_descendant() will access freed memory. We are not > holding the lock here so we are not preventing cgroup_rstat_flush() > from being called for the freed cgroup, right? If we go back to only allowing root-cgroup to be ongoing-flusher, then we could do a cgroup_rstat_flush(root) in cgroup_rstat_exit() to be sure nothing is left waiting for completion scheme. Right? IMHO the code is getting too complicated with sub-cgroup's as ongoing flushers which also required having 'completion' queues per cgroup. We should go back to only doing this for the root-cgroup. --Jesper