From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6234E1109; Wed, 17 Jul 2024 07:46:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721202364; cv=none; b=gHNLEsxH0vTmDffOjrsITuI5y+eEAyxB3T0GTL7V/MEpbzP1sFgYgoyvLpZPzEVha570FDb6lDajyW6nCP5099S1csDTTT3npzTkQVAkJQ2oPMR6AdC3q6EkAGQGdDGJRtEeVLYmn2tYreBtJwB0w4xIm/H5/ZBcwhoI3r7sKuI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721202364; c=relaxed/simple; bh=M1UAov2c9FIa5WR/+oBi3eCA5gwsOt1hDB+03zxYG4c=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=CaA0U1U2IXs6orzuU5cRUW1BANW2OAXcMbjXjS8mnE6qJkgNkTm5bo5wJkDUlt08WpGDIAZOFrt799SQu1hSWSdm79A9Gyzs1q+vPzEPzck3kQC0eJ4MQSBkQzFQdeZ00/frcygRr2Z62m4tL5P96TjjxE5MPVsmYEvJ5pN9Q1E= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=JV4U7TLN; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="JV4U7TLN" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1EACDC32782; Wed, 17 Jul 2024 07:46:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1721202363; bh=M1UAov2c9FIa5WR/+oBi3eCA5gwsOt1hDB+03zxYG4c=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=JV4U7TLNV/x4Mchsr5JmvrBdpgbJCzl3+aSXz06WivcXQwzX1bcKyk1UYdoIaZWMB QnUzBT2MIuvW/oMsd77wnpZwiK0lpI+mPXdxTlD2bJicB7lhxU37+6AUG8HitUTNkl R5UVXf3EFnc5haC0C+n6/PQDGKBLLPF3w93pH+n1GWGrKSBL0wOfoEViLJAXJuBMGn VgNBGL2ZaJPXiz52bs8jyeZ5EDfeVchNK1O6xzvg3X7Z26yDBvDiTQ2q7lP8Br9B8K Jv9ftcXSOww9/JdlL3Wm4ei5CObiUkOi1dheMGmTZKbmkMwGkeldSVFMqO0vosvLus iCKL7TReGNdMw== Message-ID: Date: Wed, 17 Jul 2024 09:46:00 +0200 Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V4 2/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Yosry Ahmed Cc: Shakeel Butt , tj@kernel.org, cgroups@vger.kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <171952310959.1810550.17003659816794335660.stgit@firesoul> <171952312320.1810550.13209360603489797077.stgit@firesoul> <4n3qu75efpznkomxytm7irwfiq44hhi4hb5igjbd55ooxgmvwa@tbgmwvcqsy75> <7ecdd625-37a0-49f1-92fc-eef9791fbe5b@kernel.org> <9a7930b9-dec0-418c-8475-5a7e18b3ec68@kernel.org> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 16/07/2024 23.54, Yosry Ahmed wrote: > On Mon, Jul 8, 2024 at 8:26 AM Jesper Dangaard Brouer wrote: >> >> >> On 28/06/2024 11.39, Jesper Dangaard Brouer wrote: >>> >>> >>> On 28/06/2024 01.34, Shakeel Butt wrote: >>>> On Thu, Jun 27, 2024 at 11:18:56PM GMT, Jesper Dangaard Brouer wrote: >>>>> Avoid lock contention on the global cgroup rstat lock caused by kswapd >>>>> starting on all NUMA nodes simultaneously. At Cloudflare, we observed >>>>> massive issues due to kswapd and the specific mem_cgroup_flush_stats() >>>>> call inlined in shrink_node, which takes the rstat lock. >>>>> >> [...] >>>>> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); >>>>> @@ -312,6 +315,45 @@ static inline void __cgroup_rstat_unlock(struct >>>>> cgroup *cgrp, int cpu_in_loop) >>>>> spin_unlock_irq(&cgroup_rstat_lock); >>>>> } >>>>> +#define MAX_WAIT msecs_to_jiffies(100) >>>>> +/* Trylock helper that also checks for on ongoing flusher */ >>>>> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) >>>>> +{ >>>>> + bool locked = __cgroup_rstat_trylock(cgrp, -1); >>>>> + if (!locked) { >>>>> + struct cgroup *cgrp_ongoing; >>>>> + >>>>> + /* Lock is contended, lets check if ongoing flusher is already >>>>> + * taking care of this, if we are a descendant. >>>>> + */ >>>>> + cgrp_ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); >>>>> + if (cgrp_ongoing && cgroup_is_descendant(cgrp, cgrp_ongoing)) { >>>> >>>> I wonder if READ_ONCE() and cgroup_is_descendant() needs to happen >>>> within in rcu section. On a preemptable kernel, let's say we got >>>> preempted in between them, the flusher was unrelated and got freed >>>> before we get the CPU. In that case we are accessing freed memory. >>>> >>> >>> I have to think about this some more. >>> >> >> I don't think this is necessary. We are now waiting (for completion) and >> not skipping flush, because as part of take down function >> cgroup_rstat_exit() is called, which will call cgroup_rstat_flush(). >> >> >> void cgroup_rstat_exit(struct cgroup *cgrp) >> { >> int cpu; >> cgroup_rstat_flush(cgrp); >> >> > > Sorry for the late response, I was traveling for a bit. I will take a > look at your most recent version shortly. But I do have a comment > here. > > I don't see how this addresses Shakeel's concern. IIUC, if the cgroup > was freed after READ_ONCE() (and cgroup_rstat_flush() was called), > then cgroup_is_descendant() will access freed memory. We are not > holding the lock here so we are not preventing cgroup_rstat_flush() > from being called for the freed cgroup, right? If we go back to only allowing root-cgroup to be ongoing-flusher, then we could do a cgroup_rstat_flush(root) in cgroup_rstat_exit() to be sure nothing is left waiting for completion scheme. Right? IMHO the code is getting too complicated with sub-cgroup's as ongoing flushers which also required having 'completion' queues per cgroup. We should go back to only doing this for the root-cgroup. --Jesper