From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47DF8146019 for ; Wed, 27 Nov 2024 22:05:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732745140; cv=none; b=NHGXdUo1VVWNK/yAoN2l2Lx2h2NbB4aUPSUtshbkfIP5WHzag8EYi/yrCrNirwyOQISWcI63HcO83AjQW51+YOOtO09bSRI0gPSgrT+WV1KeoPxJKGW+tMW6d5HnmIbgHawtGKk4IcSf939+u3Nxbz2Ji2YuoDQhZ0QKTBRw5Dk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732745140; c=relaxed/simple; bh=2DT+WGDRnXjPE/3u0bVQ0Lu9TXcuPRwfjTdcxffFIYg=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=KBq49LHhvi4xr9YAvIQH+18/JWF9dMuvsxm+ojLfJf4SLbju2Kz6cVt1GsVpBWeV0543GAWNoPP1RxEoAbZXgvkOkF9z+Hy3UNmdkD2bK8ope9clbieLBdpeUhF2cl0EUIgGJGyMHeq/5G+yXvbEf+7GvieutssyKYtZ/+agZY0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=1ZUI5UvD; arc=none smtp.client-ip=209.85.210.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="1ZUI5UvD" Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-724f42c1c38so243478b3a.1 for ; Wed, 27 Nov 2024 14:05:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1732745138; x=1733349938; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=UMsGQydQR4ffQMrLiCfWPs1SrKc1nDVN+ATpbx3XMwQ=; b=1ZUI5UvDnYf8XFQFgzmCshvEys0Oi6ONTGcbyybec3VFnPIgZijcfA8vOP+fb0vkYz 5JrRG6UYWh9RECeHlrNkrk39GSpoCr+i+FLL1lvun+fcmNbdqauuhJpMl8yhYQn/PKd8 lEaIFBSJPR67zgt31MgmGTu4050NEDjEdiHjFs+r5WXAxKaS564SmFKRPGAaBtdjZOUU Bhak5hLVsRZ7LQlgRHe1TkZNxj/saSCO8B+V888pQq9Uk3PAhRoh70UCmc1PQKmCOTPS CQA8VOEtFvdil0sJnB8dSw86BMA7fBl/VpL4Olux1FhdHFusdnquWZWr1mj/pIQzKRM/ o1eA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732745138; x=1733349938; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=UMsGQydQR4ffQMrLiCfWPs1SrKc1nDVN+ATpbx3XMwQ=; b=BfNnFsG8JpfLt+BATvBxM3eBBsBSmCGlVpMnTUNZLs1pa+CoD8BUj7STL9Qr+uY324 /x9jhIU8eiyQ8bFLDC3zUzR9mzRpCrgxdVtN7834509ShOiT6N6z/PmCzo6qUTc1G961 0AXtjwfL1hO5qUHhPRd7QmN974uJ7aZaBL+KMX07LHdf0gv9xJt6O5V2DK5p4KJ02nkc B/gVzRiHrPE3NiD8YvgR+uyhTzJdehYUbEOdahp5tjYmDO8noTMR5+iW2SEY0oUSzvvT S+qxnVvFQYGwBe34tmHti0rvW0WMDAJv8TJbwnpEjKT58M1ZMvsNi4xVeIB0uezF+cCa ypyg== X-Forwarded-Encrypted: i=1; AJvYcCW26goFoGyJTIR7G0HsNCTyJ4kS6rgyuRnvONbtvDeegb9fz/ttD9iKzerqw5aiXi/Khs2ajCBl@vger.kernel.org X-Gm-Message-State: AOJu0YywQraINiaow+L5Lz6FByAso8WhY5Jeq/4wsxZ27jNrlbhOf5fb 8pxYuRBrS5PEM9n+/rF+BFAF198MR0gxRLfUDT/2YkS3us+xxbaRd3PRAueunOg= X-Gm-Gg: ASbGncvefyRXtXZeVzVb8L4FyprYEUza6ZacIpei23KQnfM8E8Tw+UXGsszEdztQjmT HLSRxakToNbYDW9sasZirBGfZNRCbtc0XDJO6B2BxoytNvIoCgmgsx/32NeFQ8ESLBgsDndryxM 8C2VDZudVFLXLQEC/Us4Xfm8LPrNHnURLlNR4qLh6UQzhM3JRYVA/040hrP+55F7rOrSU7co4BZ /TYSfkhV7TJ/C74HCc2Z7oRK0M3zFFfdpGK0pSQyoryGvDUFUkYbitnmU8vvPtI54JWvw2VYXRh YMtAGG0+B0VA2pIupCKd4XgLrg== X-Google-Smtp-Source: AGHT+IEltlKl2PJHg3RFm39OBPMjZ8HODMlAfcoAhjcRSnu6VwmW9h4bksPWqKeNgStjEZ96bmLh5Q== X-Received: by 2002:a05:6a00:14d5:b0:724:63f1:a522 with SMTP id d2e1a72fcca58-7253017551amr6450497b3a.22.1732745138524; Wed, 27 Nov 2024 14:05:38 -0800 (PST) Received: from dread.disaster.area (pa49-180-121-96.pa.nsw.optusnet.com.au. [49.180.121.96]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7254180fcb0sm49277b3a.147.2024.11.27.14.05.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Nov 2024 14:05:37 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98) (envelope-from ) id 1tGQAA-00000003reS-1DGz; Thu, 28 Nov 2024 09:05:34 +1100 Date: Thu, 28 Nov 2024 09:05:34 +1100 From: Dave Chinner To: Alice Ryhl Cc: Johannes Weiner , Andrew Morton , Nhat Pham , Qi Zheng , Roman Gushchin , Muchun Song , Linux Memory Management List , Michal Hocko , Shakeel Butt , cgroups@vger.kernel.org, open list Subject: Re: [QUESTION] What memcg lifetime is required by list_lru_add? Message-ID: References: Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote: > Dear SHRINKER and MEMCG experts, > > When using list_lru_add() and list_lru_del(), it seems to be required > that you pass the same value of nid and memcg to both calls, since > list_lru_del() might otherwise try to delete it from the wrong list / > delete it while holding the wrong spinlock. I'm trying to understand > the implications of this requirement on the lifetime of the memcg. > > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking > to keep the memcg object alive for the duration of list_lru_add(). > That rcu locking is used here seems to imply that without it, the > memcg could be deallocated during the list_lru_add() call, which is of > course bad. But rcu is not enough on its own to keep the memcg alive > all the way until the list_lru_del_obj() call, so how does it ensure > that the memcg stays valid for that long? We don't care if the memcg goes away whilst there are objects on the LRU. memcg destruction will reparent the objects to a different memcg via memcg_reparent_list_lrus() before the memcg is torn down. New objects should not be added to the memcg LRUs once the memcg teardown process starts, so there should never be add vs reparent races during teardown. Hence all the list_lru_add_obj() function needs to do is ensure that the locking/lifecycle rules for the memcg object that mem_cgroup_from_slab_obj() returns are obeyed. > And if there is a mechanism > to keep the memcg alive for the entire duration between add and del, It's enforced by the -complex- state machine used to tear down control groups. tl;dr: If the memcg gets torn down, it will reparent the objects on the LRU to it's parent memcg during the teardown process. This reparenting happens in the cgroup ->css_offline() method, which only happens after the cgroup reference count goes to zero and is waited on via: kill_css percpu_ref_kill_and_confirm(css_killed_ref_fn) css_killed_ref_fn offline_css mem_cgroup_css_offline memcg_offline_kmem { ..... memcg_reparent_objcgs(memcg, parent); /* * After we have finished memcg_reparent_objcgs(), all list_lrus * corresponding to this cgroup are guaranteed to remain empty. * The ordering is imposed by list_lru_node->lock taken by * memcg_reparent_list_lrus(). */ memcg_reparent_list_lrus(memcg, parent) } Then the cgroup teardown control code then schedules the freeing of the memcg container via a RCU work callback when the reference count is globally visible as killed and the reference count has gone to zero. Hence the cgroup infrastructure requires RCU protection for the duration of unreferenced cgroup object accesses. This allows for subsystems to perform operations on the cgroup object without needing to holding cgroup references for every access. The complex, multi-stage teardown process allows for cgroup objects to release objects that it tracks hence avoiding the need for every object the cgroup tracks to hold a reference count on the cgroup. See the comment above css_free_rwork_fn() for more details about the teardown process: /* * css destruction is four-stage process. * * 1. Destruction starts. Killing of the percpu_ref is initiated. * Implemented in kill_css(). * * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs * and thus css_tryget_online() is guaranteed to fail, the css can be * offlined by invoking offline_css(). After offlining, the base ref is * put. Implemented in css_killed_work_fn(). * * 3. When the percpu_ref reaches zero, the only possible remaining * accessors are inside RCU read sections. css_release() schedules the * RCU callback. * * 4. After the grace period, the css can be freed. Implemented in * css_free_rwork_fn(). * * It is actually hairier because both step 2 and 4 require process context * and thus involve punting to css->destroy_work adding two additional * steps to the already complex sequence. */ -Dave. -- Dave Chinner david@fromorbit.com