From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63CC4C43334 for ; Mon, 27 Jun 2022 10:43:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ACE836B0072; Mon, 27 Jun 2022 06:43:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A7E116B0073; Mon, 27 Jun 2022 06:43:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 920118E0001; Mon, 27 Jun 2022 06:43:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 82BB76B0072 for ; Mon, 27 Jun 2022 06:43:47 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 5ADE7120EA2 for ; Mon, 27 Jun 2022 10:43:47 +0000 (UTC) X-FDA: 79623680094.05.FC4B64A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf17.hostedemail.com (Postfix) with ESMTP id F04F040024 for ; Mon, 27 Jun 2022 10:43:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656326625; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kr/grU3kkgeweF+fN7RAjBDwzZqei/Z54c4DEpy1KMs=; b=cpR/mdhvmKtoV4nOgDnYYNfJjR/S+MqVogf5uxZuQ3+HY3BkRU/DwpvfqegvWsFB2Wi8e2 I7ZQrxU5NeKwvI7LZWdihP2prBPmmHnIeZH14AEW3hy1NyUM67X5/JOWlxrQiLqTh6/39x VfR6jwu2CxhdNrpfg9YsDBcGW/Jv2vE= Received: from mail-lf1-f72.google.com (mail-lf1-f72.google.com [209.85.167.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-464-fFTCIYt9OV-8HSUCweeWag-1; Mon, 27 Jun 2022 06:43:44 -0400 X-MC-Unique: fFTCIYt9OV-8HSUCweeWag-1 Received: by mail-lf1-f72.google.com with SMTP id b2-20020a0565120b8200b00477a4532448so4534630lfv.22 for ; Mon, 27 Jun 2022 03:43:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=kr/grU3kkgeweF+fN7RAjBDwzZqei/Z54c4DEpy1KMs=; b=PLdzB0qO8jwlhqiwK3RpV9JmugBz4mz4Q0BlRLQSZPXgS9Gdio6etG/Cabczhml6zv nnQyO/lY/AKAsn0Bf/3XQ1VndogQxWwLpCIc+e7YND+wAIPGXGGxNrUgWWYrK/JCDREs W6y9hqZmJPS9gYcR5U5TZ0uSl2TS9Ewm28B1PCtWjpCHE8VvFFTZM/jwiQeTV6glnYOJ nwjsM/tPfhYb0l5wFFRF7uYiVKonAeIiTMPLYjMcnfl8C/EUkZLAiVRrXIYb7wFSsZe0 rtClPX4RkWuOLXIkM9SHFUJuvH/a5nU70fYm16+6U+7/PpRXw3pVQwHhZ4RKtGbUpSs1 1I8A== X-Gm-Message-State: AJIora/HrlV0Agjc27vACTGqU/faxzHp28ngELf6XKZ/ecrgo3o8YN8O b5hsw+ZM1cYYM3qHiC15dApLPTI4aUmSUmPEmkmgn19jW8+vTnGAKJgEc/1Qc5SGZm9FQe+bM7G b+ejidRf33w== X-Received: by 2002:a2e:8696:0:b0:25a:7673:d22a with SMTP id l22-20020a2e8696000000b0025a7673d22amr6354322lji.494.1656326621699; Mon, 27 Jun 2022 03:43:41 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uCfw27rdAGlL6dzckO8TuLAXJ9YcQYVJ415tCpSJLMrbjD4x2skEZpY/oGXCMYll7PSmdtkg== X-Received: by 2002:a2e:8696:0:b0:25a:7673:d22a with SMTP id l22-20020a2e8696000000b0025a7673d22amr6354296lji.494.1656326621311; Mon, 27 Jun 2022 03:43:41 -0700 (PDT) Received: from [192.168.1.121] (91-145-109-188.bb.dnainternet.fi. [91.145.109.188]) by smtp.gmail.com with ESMTPSA id q10-20020a056512210a00b00477a287438csm1764197lfr.2.2022.06.27.03.43.39 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 27 Jun 2022 03:43:39 -0700 (PDT) Message-ID: Date: Mon, 27 Jun 2022 13:43:38 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: [PATCH v6 00/11] Use obj_cgroup APIs to charge the LRU pages To: Yosry Ahmed , Muchun Song Cc: Andrew Morton , Johannes Weiner , longman@redhat.com, Michal Hocko , Roman Gushchin , Shakeel Butt , Cgroups , duanxiongchun@bytedance.com, Linux Kernel Mailing List , Linux-MM References: <20220621125658.64935-1-songmuchun@bytedance.com> From: =?UTF-8?Q?Mika_Penttil=c3=a4?= In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656326626; a=rsa-sha256; cv=none; b=fN7Sfde/ggcTeBU2DJwS9W7NER26OAN8iP3fcVylze72zfxzdzwaGLuJE6lNY6mAxxo3kp FklonS9ZkMU2SLDx8a7bRVz9aygSxdVENCSP7SSDUsNC8o7o0/MJf1+Yl2/qQEN0YKd3JB RlFkVy3pivaStdFUtDVgG5qspatxIYU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656326626; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kr/grU3kkgeweF+fN7RAjBDwzZqei/Z54c4DEpy1KMs=; b=fXF2hGnsXLa/CM4Hy9moMRjgXzHOi4JtoSy15YEtK33y+f2Ir79qOIJ5A41GAApp5tm4bE aICVsv9ouOmvEl8UPekFkkT7ijw3rFAHyHgLF1vzX926T8lZWIQnSLpVqYvu2QdGHPmztu 4wlW0b3jOoC01Zt+eHC+3rWw0b0lWX4= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="cpR/mdhv"; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf17.hostedemail.com: domain of mpenttil@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=mpenttil@redhat.com X-Stat-Signature: p9n4fdjckifgfq1eaxizsg8jix4s4pqr X-Rspamd-Queue-Id: F04F040024 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="cpR/mdhv"; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf17.hostedemail.com: domain of mpenttil@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=mpenttil@redhat.com X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1656326625-397436 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 27.6.2022 11.05, Yosry Ahmed wrote: > On Mon, Jun 27, 2022 at 12:11 AM Muchun Song wrote: >> >> On Sun, Jun 26, 2022 at 03:32:02AM -0700, Yosry Ahmed wrote: >>> On Tue, Jun 21, 2022 at 5:57 AM Muchun Song wrote: >>>> >>>> This version is rebased on mm-unstable. Hopefully, Andrew can get this series >>>> into mm-unstable which will help to determine whether there is a problem or >>>> degradation. I am also doing some benchmark tests in parallel. >>>> >>>> Since the following patchsets applied. All the kernel memory are charged >>>> with the new APIs of obj_cgroup. >>>> >>>> commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages") >>>> commit b4e0b68fbd9d ("mm: memcontrol: use obj_cgroup APIs to charge kmem pages") >>>> >>>> But user memory allocations (LRU pages) pinning memcgs for a long time - >>>> it exists at a larger scale and is causing recurring problems in the real >>>> world: page cache doesn't get reclaimed for a long time, or is used by the >>>> second, third, fourth, ... instance of the same job that was restarted into >>>> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, >>>> and make page reclaim very inefficient. >>>> >>>> We can convert LRU pages and most other raw memcg pins to the objcg direction >>>> to fix this problem, and then the LRU pages will not pin the memcgs. >>>> >>>> This patchset aims to make the LRU pages to drop the reference to memory >>>> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number >>>> of the dying cgroups will not increase if we run the following test script. >>> >>> This is amazing work! >>> >>> Sorry if I came late, I didn't follow the threads of previous versions >>> so this might be redundant, I just have a couple of questions. >>> >>> a) If LRU pages keep getting parented until they reach root_mem_cgroup >>> (assuming they can), aren't these pages effectively unaccounted at >>> this point or leaked? Is there protection against this? >>> >> >> In this case, those pages are accounted in root memcg level. Unfortunately, >> there is no mechanism now to transfer a page's memcg from one to another. >> >>> b) Since moving charged pages between memcgs is now becoming easier by >>> using the APIs of obj_cgroup, I wonder if this opens the door for >>> future work to transfer charges to memcgs that are actually using >>> reparented resources. For example, let's say cgroup A reads a few >>> pages into page cache, and then they are no longer used by cgroup A. >>> cgroup B, however, is using the same pages that are currently charged >>> to cgroup A, so it keeps taxing cgroup A for its use. When cgroup A >>> dies, and these pages are reparented to A's parent, can we possibly >>> mark these reparented pages (maybe in the page tables somewhere) so >>> that next time they get accessed we recharge them to B instead >>> (possibly asynchronously)? >>> I don't have much experience about page tables but I am pretty sure >>> they are loaded so maybe there is no room in PTEs for something like >>> this, but I have always wondered about what we can do for this case >>> where a cgroup is consistently using memory charged to another cgroup. >>> Maybe when this memory is reparented is a good point in time to decide >>> to recharge appropriately. It would also fix the reparenty leak to >>> root problem (if it even exists). >>> >> >> From my point of view, this is going to be an improvement to the memcg >> subsystem in the future. IIUC, most reparented pages are page cache >> pages without be mapped to users. So page tables are not a suitable >> place to record this information. However, we already have this information >> in struct obj_cgroup and struct mem_cgroup. If a page's obj_cgroup is not >> equal to the page's obj_cgroup->memcg->objcg, it means this page have >> been reparented. I am thinking if a place where a page is mapped (probably >> page fault patch) or page (cache) is written (usually vfs write path) >> is suitable to transfer page's memcg from one to another. But need more > > Very good point about unmapped pages, I missed this. Page tables will > do us no good here. Such a change would indeed require careful thought > because (like you mentioned) there are multiple points in time where > it might be suitable to consider recharging the page (e.g. when the > page is mapped). This could be an incremental change though. Right now > we have no recharging at all, so maybe we can gradually add recharging > to suitable paths. > >> thinking, e.g. How to decide if a reparented page needs to be transferred? > > Maybe if (page's obj_cgroup->memcg == root_mem_cgroup) OR (memcg of > current is not a descendant of page's obj_cgroup->memcg) is a good > place to start? > > My rationale is that if the page is charged to root_mem_cgroup through > reparenting and a process in a memcg is using it then this is probably > an accounting leak. If a page is charged to a memcg A through > reparenting and is used by a memcg B in a different subtree, then > probably memcg B is getting away with using the page for free while A > is being taxed. If B is a descendant of A, it is still getting away > with using the page unaccounted, but at least it makes no difference > for A. > > One could argue that we might as well recharge a reparented page > anyway if the process is cheap (or done asynchronously), and the paths > where we do recharging are not very common. > > All of this might be moot, I am just thinking out loud. In any way > this would be future work and not part of this work. > I think you have to uncharge at the reparented parent to keep balances right (because parent is hierarchically charged thru page_counter). And maybe recharge after that if appropriate. > >> If we need more information to make this decision, where to store those >> information? This is my primary thoughts on this question. > >> >> Thanks. >> >>> Thanks again for this work and please excuse my ignorance if any part >>> of what I said doesn't make sense :) >>> >>>> >>>> ```bash >>>> #!/bin/bash >>>> >>>> dd if=/dev/zero of=temp bs=4096 count=1 >>>> cat /proc/cgroups | grep memory >>>> >>>> for i in {0..2000} >>>> do >>>> mkdir /sys/fs/cgroup/memory/test$i >>>> echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs >>>> cat temp >> log >>>> echo $$ > /sys/fs/cgroup/memory/cgroup.procs >>>> rmdir /sys/fs/cgroup/memory/test$i >>>> done >>>> >>>> cat /proc/cgroups | grep memory >>>> >>>> rm -f temp log >>>> ``` >>>> >>>> v5: https://lore.kernel.org/all/20220530074919.46352-1-songmuchun@bytedance.com/ >>>> v4: https://lore.kernel.org/all/20220524060551.80037-1-songmuchun@bytedance.com/ >>>> v3: https://lore.kernel.org/all/20220216115132.52602-1-songmuchun@bytedance.com/ >>>> v2: https://lore.kernel.org/all/20210916134748.67712-1-songmuchun@bytedance.com/ >>>> v1: https://lore.kernel.org/all/20210814052519.86679-1-songmuchun@bytedance.com/ >>>> RFC v4: https://lore.kernel.org/all/20210527093336.14895-1-songmuchun@bytedance.com/ >>>> RFC v3: https://lore.kernel.org/all/20210421070059.69361-1-songmuchun@bytedance.com/ >>>> RFC v2: https://lore.kernel.org/all/20210409122959.82264-1-songmuchun@bytedance.com/ >>>> RFC v1: https://lore.kernel.org/all/20210330101531.82752-1-songmuchun@bytedance.com/ >>>> >>>> v6: >>>> - Collect Acked-by and Reviewed-by from Roman and Michal Koutný. Thanks. >>>> - Rebase to mm-unstable. >>>> >>>> v5: >>>> - Lots of improvements from Johannes, Roman and Waiman. >>>> - Fix lockdep warning reported by kernel test robot. >>>> - Add two new patches to do code cleanup. >>>> - Collect Acked-by and Reviewed-by from Johannes and Roman. >>>> - I didn't replace local_irq_disable/enable() to local_lock/unlock_irq() since >>>> local_lock/unlock_irq() takes an parameter, it needs more thinking to transform >>>> it to local_lock. It could be an improvement in the future. >>>> >>>> v4: >>>> - Resend and rebased on v5.18. >>>> >>>> v3: >>>> - Removed the Acked-by tags from Roman since this version is based on >>>> the folio relevant. >>>> >>>> v2: >>>> - Rename obj_cgroup_release_kmem() to obj_cgroup_release_bytes() and the >>>> dependencies of CONFIG_MEMCG_KMEM (suggested by Roman, Thanks). >>>> - Rebase to linux 5.15-rc1. >>>> - Add a new pacth to cleanup mem_cgroup_kmem_disabled(). >>>> >>>> v1: >>>> - Drop RFC tag. >>>> - Rebase to linux next-20210811. >>>> >>>> RFC v4: >>>> - Collect Acked-by from Roman. >>>> - Rebase to linux next-20210525. >>>> - Rename obj_cgroup_release_uncharge() to obj_cgroup_release_kmem(). >>>> - Change the patch 1 title to "prepare objcg API for non-kmem usage". >>>> - Convert reparent_ops_head to an array in patch 8. >>>> >>>> Thanks for Roman's review and suggestions. >>>> >>>> RFC v3: >>>> - Drop the code cleanup and simplification patches. Gather those patches >>>> into a separate series[1]. >>>> - Rework patch #1 suggested by Johannes. >>>> >>>> RFC v2: >>>> - Collect Acked-by tags by Johannes. Thanks. >>>> - Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks. >>>> - Fix move_pages_to_lru(). >>>> >>>> Muchun Song (11): >>>> mm: memcontrol: remove dead code and comments >>>> mm: rename unlock_page_lruvec{_irq, _irqrestore} to >>>> lruvec_unlock{_irq, _irqrestore} >>>> mm: memcontrol: prepare objcg API for non-kmem usage >>>> mm: memcontrol: make lruvec lock safe when LRU pages are reparented >>>> mm: vmscan: rework move_pages_to_lru() >>>> mm: thp: make split queue lock safe when LRU pages are reparented >>>> mm: memcontrol: make all the callers of {folio,page}_memcg() safe >>>> mm: memcontrol: introduce memcg_reparent_ops >>>> mm: memcontrol: use obj_cgroup APIs to charge the LRU pages >>>> mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance function >>>> mm: lru: use lruvec lock to serialize memcg changes >>>> >>>> fs/buffer.c | 4 +- >>>> fs/fs-writeback.c | 23 +- >>>> include/linux/memcontrol.h | 218 +++++++++------ >>>> include/linux/mm_inline.h | 6 + >>>> include/trace/events/writeback.h | 5 + >>>> mm/compaction.c | 39 ++- >>>> mm/huge_memory.c | 153 ++++++++-- >>>> mm/memcontrol.c | 584 +++++++++++++++++++++++++++------------ >>>> mm/migrate.c | 4 + >>>> mm/mlock.c | 2 +- >>>> mm/page_io.c | 5 +- >>>> mm/swap.c | 49 ++-- >>>> mm/vmscan.c | 66 ++--- >>>> 13 files changed, 776 insertions(+), 382 deletions(-) >>>> >>>> >>>> base-commit: 882be1ed6b1b5073fc88552181b99bd2b9c0031f >>>> -- >>>> 2.11.0 >>>> >>>> >>> >