From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qk1-f169.google.com (mail-qk1-f169.google.com [209.85.222.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 17392223316 for ; Thu, 12 Dec 2024 18:30:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.169 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734028222; cv=none; b=Ba/5ymTIoF89axHPFJkgtmrnUuU2Bu7pLoZGgEMxSh48tRJXZhpCnc/INsecZdM5wkqkmWQ0/qEPQbVmAYPAB5Oa66S5m22N6+sTVh7Qtgusj5LhXg7e5a8/Bn8o7UurIV2a7D1LgSJovg1dfAElqaul8BcWTPAp6hjNSm/1R/Y= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734028222; c=relaxed/simple; bh=KW2GUcJjjEQ/caRd6tYdueccDNsofrCjGydOTPeUN+c=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=bnf+lr3xvRLHVEyCObo2im9VdL5Q1GvRv4EJR+kmhhCohJmXo/QUhQimhhZZHG1E4rq0nHcIgEI06AD57uNo9oZqiK3vJFbDjzgRc6vRrMIUuo0DMy5W95vf2A8qPKeT2P/bT1ia7oRTZ94/PIvB2W68OW6Zo3FDnoS3YiSxFjI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b=JcMxufp+; arc=none smtp.client-ip=209.85.222.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b="JcMxufp+" Received: by mail-qk1-f169.google.com with SMTP id af79cd13be357-7b15467f383so87246385a.3 for ; Thu, 12 Dec 2024 10:30:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1734028218; x=1734633018; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=MmiY44ftef6O+ZMOcAuOGSJN7t6+XvI2HuwN5LfNTaQ=; b=JcMxufp+IwUOiCo0N7nB6AGbjDoNQwumEyJLnu+wWGXRcCPkupiYRK6fJJUDlCkEZJ Uor0TNANujrx9NEyzevkeD4YCDTDKtHwP13PEpNmibV8njpDN2juvECbUGajLMnkZ2Bb fznAyBiI7aDlonsYXmxq2CqW6UCoMBq/rM46mXb69UbnqM8cWD9xCuwu9M/NOXjWjFV0 Gsz7QQq3EYJg9kU9QRLqsfGQtvvsDVH40lBUgv9ZONRvxDTVbydZ4gVOjnFN4ITuBV6l nVAIEfxwT8XqcozwqxZCzaAN2wJJnrC0hisAdus3ecxs7SZNE+/b4Lsa4V+FMwD3JwJw ZzLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734028218; x=1734633018; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=MmiY44ftef6O+ZMOcAuOGSJN7t6+XvI2HuwN5LfNTaQ=; b=WDgS5PrVNvwfnBOm07on5inrpUgd7tOGwKciAMsgZZF+Mjb+nvxVQBuwLDgy0VGJI3 h/mz/0iW94uEKLjSAk/QbhH2sxm404EkhyZoIRaMG6dY1XAkxWkZ9VHT+Uzl6JOGsTwD x3RYLSB8vvQLR6fljnN/K534+52KYIyDr5HwaNiT9Ft6JAMmGJ3ZdHXTYtK9rEftTdrU 1JxRmEULYAKyefSiatkVXeWwiDxXzs6rpvo0+FwcMwcmDtInWQgyqf754k2yRgztqH9/ b+1VXlBfAVaPOhNvV5zPb+Pu0SNwFEOEJrdS2NXQQK102nS58j9KsOsASMCfNAJ0sm0b hUgg== X-Forwarded-Encrypted: i=1; AJvYcCUhC2ufxaYgjgzw24aSUZhDEpeEt7YvfKRxWiZwHkn3qm3xbyYV1U8QwzO0hthYbub0DfYXK8Ph@vger.kernel.org X-Gm-Message-State: AOJu0YzRqBuIbGwqOgxNmfpyNgZluao4orrJMyI7xBQ0nU3dtUE+ENjz ROs+oaGKEMoFsWn3EiCqoxbbDB7jhZrZAdSqVWslyqLIWkxyMbVOtxARISrEanc= X-Gm-Gg: ASbGncsqDgTlj87S3/wEvl1wEEXyYMaUg1eBpGEyO6OsC6dFJpLmsTWSDj97CZzoC1n ZkJ4k/3OEiXYauIXnS5gijc8Gukyg6OC7NkH+r9opYCcbx5632mH+Rd41ySFEooz9BVnrcswvvO 3JPim/XDkHjlYGQGmPk57WvZIFBL/Sy5RVOPAdiI1wGIgroz2VYXHXG5Ia7FT7wN9u1FuZTVtwR qf5VY4tm3Nsa6aytgqZa0+T+rA7UOOLHnV2j43U782LjmOappSsXf0= X-Google-Smtp-Source: AGHT+IG38MiE0nwEa6hQYO2vGYkkqMEfbHFTHod38nIta4Us7MLC6sgHlHqKAJsG3jnHQxWLOynmyg== X-Received: by 2002:a05:620a:1909:b0:7b6:cb8a:3d54 with SMTP id af79cd13be357-7b6f89eef4dmr208194685a.51.1734028217762; Thu, 12 Dec 2024 10:30:17 -0800 (PST) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-467a140f7b2sm2942531cf.79.2024.12.12.10.30.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Dec 2024 10:30:16 -0800 (PST) Date: Thu, 12 Dec 2024 13:30:12 -0500 From: Johannes Weiner To: Yosry Ahmed Cc: Rik van Riel , Balbir Singh , Michal Hocko , Roman Gushchin , hakeel Butt , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham Subject: Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap Message-ID: <20241212183012.GB1026@cmpxchg.org> References: <20241212115754.38f798b3@fangorn> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote: > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel wrote: > > > > A task already in exit can get stuck trying to allocate pages, if its > > cgroup is at the memory.max limit, the cgroup is using zswap, but > > zswap writeback is enabled, and the remaining memory in the cgroup is > > not compressible. > > > > This seems like an unlikely confluence of events, but it can happen > > quite easily if a cgroup is OOM killed due to exceeding its memory.max > > limit, and all the tasks in the cgroup are trying to exit simultaneously. > > > > When this happens, it can sometimes take hours for tasks to exit, > > as they are all trying to squeeze things into zswap to bring the group's > > memory consumption below memory.max. > > > > Allowing these exiting programs to push some memory from their own > > cgroup into swap allows them to quickly bring the cgroup's memory > > consumption below memory.max, and exit in seconds rather than hours. > > > > Signed-off-by: Rik van Riel > > Thanks for sending a v2. > > I still think maybe this needs to be fixed on the memcg side, at least > by not making exiting tasks try really hard to reclaim memory to the > point where this becomes a problem. IIUC there could be other reasons > why reclaim may take too long, but maybe not as pathological as this > case to be fair. I will let the memcg maintainers chime in for this. > > If there's a fundamental reason why this cannot be fixed on the memcg > side, I don't object to this change. > > Nhat, any objections on your end? I think your fleet workloads were > the first users of this interface. Does this break their expectations? Yes, I don't think we can do this, unfortunately :( There can be a variety of reasons for why a user might want to prohibit disk swap for a certain cgroup, and we can't assume it's okay to make exceptions. There might also not *be* any disk swap to overflow into after Nhat's virtual swap patches. Presumably zram would still have the issue too. So I'm also inclined to think this needs a reclaim/memcg-side fix. We have a somewhat tumultous history of policy in that space: commit 7775face207922ea62a4e96b9cd45abfdc7b9840 Author: Tetsuo Handa Date: Tue Mar 5 15:46:47 2019 -0800 memcg: killed threads should not invoke memcg OOM killer allowed dying tasks to simply force all charges and move on. This turned out to be too aggressive; there were instances of exiting, uncontained memcg tasks causing global OOMs. This lead to that: commit a4ebf1b6ca1e011289677239a2a361fde4a88076 Author: Vasily Averin Date: Fri Nov 5 13:38:09 2021 -0700 memcg: prohibit unconditional exceeding the limit of dying tasks which reverted the bypass rather thoroughly. Now NO dying tasks, *not even OOM victims*, can force charges. I am not sure this is correct, either: If we return -ENOMEM to an OOM victim in a fault, the fault handler will re-trigger OOM, which will find the existing OOM victim and do nothing, then restart the fault. This is a memory deadlock. The page allocator gives OOM victims access to reserves for that reason. Actually, it looks even worse. For some reason we're not triggering OOM from dying tasks: ret = task_is_dying() || out_of_memory(&oc); Even though dying tasks are in no way privileged or allowed to exit expediently. Why shouldn't they trigger the OOM killer like anybody else trying to allocate memory? As it stands, it seems we have dying tasks getting trapped in an endless fault->reclaim cycle; with no access to the OOM killer and no access to reserves. Presumably this is what's going on here? I think we want something like this: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 53db98d2c4a1..be6b6e72bde5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_margin(memcg) >= (1 << order)) goto unlock; - /* - * A few threads which were not waiting at mutex_lock_killable() can - * fail to bail out. Therefore, check again after holding oom_lock. - */ - ret = task_is_dying() || out_of_memory(&oc); + ret = out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (unlikely(current->flags & PF_MEMALLOC)) goto force; + if (unlikely(tsk_is_oom_victim(current))) + goto force; + if (unlikely(task_in_memcg_oom(current))) goto nomem;