From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D343EC433FE for ; Thu, 3 Nov 2022 06:01:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A65F8E0002; Thu, 3 Nov 2022 02:01:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4562B8E0001; Thu, 3 Nov 2022 02:01:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 345F98E0002; Thu, 3 Nov 2022 02:01:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 25C788E0001 for ; Thu, 3 Nov 2022 02:01:59 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E8DEF140885 for ; Thu, 3 Nov 2022 06:01:58 +0000 (UTC) X-FDA: 80091085116.26.90EBF1D Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by imf10.hostedemail.com (Postfix) with ESMTP id 5FD9EC0003 for ; Thu, 3 Nov 2022 06:01:58 +0000 (UTC) Received: from pps.filterd (m0148461.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2A2NVsgR000536 for ; Wed, 2 Nov 2022 23:01:57 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding : content-type; s=facebook; bh=cra7MQ38m3VCBbRWZHTLv+Oqy3uq8/S9vk/0Js1X18o=; b=pDJ4olqJuDYefNOCFQHL/mKMxrCDIa30unLqtJwY4EK17d8HVIu7d7LGuVsJ+qshVKxr wGBkseVyK6noH+0hYIP6tx0E+yf+1DrZAq/WReVnD+gPuCQ/eC1jRJv/LQg1GOJHFz3p LWQuju0ArvDPCtSrTmeSdqu1Il7OOkr0wK8= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3kkmtut1d7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 02 Nov 2022 23:01:56 -0700 Received: from twshared25017.14.frc2.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:11d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Wed, 2 Nov 2022 23:01:56 -0700 Received: by devvm6390.atn0.facebook.com (Postfix, from userid 352741) id 7FE665FD0412; Wed, 2 Nov 2022 23:01:49 -0700 (PDT) From: To: , CC: , , , , , Alexander Zhu Subject: [PATCH v6 0/5] THP Shrinker Date: Wed, 2 Nov 2022 23:01:42 -0700 Message-ID: X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-FB-Internal: Safe Content-Type: text/plain X-Proofpoint-ORIG-GUID: NppgNb8ed7ElGLeVf-X57XCMjobGlSj5 X-Proofpoint-GUID: NppgNb8ed7ElGLeVf-X57XCMjobGlSj5 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-11-02_15,2022-11-02_01,2022-06-22_01 ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=pDJ4olqJ; spf=pass (imf10.hostedemail.com: domain of "prvs=2306c4488a=alexlzhu@meta.com" designates 67.231.145.42 as permitted sender) smtp.mailfrom="prvs=2306c4488a=alexlzhu@meta.com"; dmarc=pass (policy=reject) header.from=fb.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667455318; a=rsa-sha256; cv=none; b=wFWsrBYdRXCWP/Fhq3GMnvyeBrj7unTbQlHCwZ/7DL2LXsjc9K/sq6Uk7n5yHOtUXtZPkc qkUisiMavxvzkMhVoP8glxgeVTKhXOxAzfSSLprJZ9tknr6guUA3ThyGOhPdYMBXzJaNMo KTR1R105a5WMHHa92sGgqeGOd4p7Sg0= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667455318; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=cra7MQ38m3VCBbRWZHTLv+Oqy3uq8/S9vk/0Js1X18o=; b=RsD8jDRwbQjaNz2AUIU2i0Z1FlvrpzmDjuAXEyUJFitHejhjx7I84F7SPOwTf9JAkxEEbg K8dPEQR1seZiomLeTO7l4bATTW5RbEEsxn5Wti1h+/owHqNJR1MRsFK0hJ1+EUGPwhe4ss GQnAhXWfKPgAGctGQgeqerOqMZf5Dhs= X-Stat-Signature: m896ri3rp1d869x9it981rjkb45f4s4s X-Rspamd-Queue-Id: 5FD9EC0003 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=pDJ4olqJ; spf=pass (imf10.hostedemail.com: domain of "prvs=2306c4488a=alexlzhu@meta.com" designates 67.231.145.42 as permitted sender) smtp.mailfrom="prvs=2306c4488a=alexlzhu@meta.com"; dmarc=pass (policy=reject) header.from=fb.com X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1667455318-680601 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Alexander Zhu Changelog:=20 v5 to v6 -removed PageSwapCache check from add_underutilized_thp as split_huge= _page takes care of this already. -added check for PageHuge in add_underutilized_thp to account for hug= etlbfs pages. -added Yu Zhao as author for the second patch v4 to v5 -split out split_huge_page changes into three different patches. One = for zapping zero pages, one for not remapping zero pages, and one for sel= f tests. -fixed bug with lru_to_folio, was corrupting the folio=20 -fixed bug with memchr_inv in mm/thp_utilization. zero page should me= an !memchr_inv(kaddr, 0, PAGE_SIZE) =20 v3 to v4 -changed thp_utilization_bucket() function to take folios, saves conv= ersion between page and folio -added newlines where they were previously missing in v2-v3 -moved the thp utilization code out into its own file under mm/thp_ut= ilization.c -removed is_anonymous_transparent_hugepage function. Use folio_test_a= non and folio_test_trans_huge instead. -changed thp_number_utilized_pages to use memchr_inv -added some comments regardling trylock -change the relock to be unconditional in low_util_free_page -only expose can_shrink_thp, abstract the thp_utilization and bucket = logic to be private to mm/thp_utilization.c v2 to v3 -put_page() after trylock_page in low_util_free_page. put() to be cal= led after get() call=20 -removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There = was a double unlock. =20 -moved spin_unlock_irq() to below list_lru_isolate() in low_util_free= _page. This is to shorten the critical section. -moved lock_page in add_underutilized_thp such that we only lock when= allocating and adding to the list_lru =20 -removed list_lru_alloc in list_lru_add_page and list_lru_delete_page= as these are no longer needed.=20 v1 to v2 -reversed ordering of is_transparent_hugepage and PageAnon in is_anon= _transparent_hugepage, page->mapping is only meaningful for user pages -only trigger the unmap_clean/zap in split_huge_page on anonymous THP= s. We cannot zap zero pages for file THPs. -modified split_huge_page self test based off more recent changes.=20 -Changed lru_lock to be irq safe. Added irq_save and restore around l= ist_lru adds/deletes. -Changed low_util_free_page() to trylock the page, and if it fails, u= nlock lru_lock and return LRU_SKIP. This is to avoid deadlock between rec= laim, which calls split_huge_page() and the THP Shrinker -Changed low_util_free_page() to unlock lru_lock, split_huge_page, th= en lock lru_lock. This way split_huge_page is not called with the lru_loc= k held. That leads to deadlock as split_huge_page calls on_each_cpu_mask=20 -Changed list_lru_shrink_walk to list_lru_shrink_walk_irq.=20 RFC to v1 -refactored out the code to obtain the thp_utilization_bucket, as tha= t now has to be used in multiple places. -added support to map to the read only zero page when splitting a THP= registered with userfaultfd.=20 -added a self test to verify that userfaultfd change is working. -Remove all THPs that are not in the top utilization bucket. This is = what we have found to perform the best in production testing, we have fou= nd that there are an almost trivial number of THPs in the middle range of= buckets that account for most of the memory waste.=20 -Added check for THP utilization prior to split_huge_page for the THP= Shrinker. This is to account for THPs that move to the top bucket, but w= ere underutilized at the time they were added to the list_lru.=20 -Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is be= cause a THP is 512 pages, and should count as 512 objects in reclaim. Thi= s way reclaim is triggered at a more appropriate frequency than in the RF= C.=20 Transparent Hugepages use a larger page size of 2MB in comparison to normal sized pages that are 4kb. A larger page size allows for fewer TLB cache misses and thus more efficient use of the CPU. Using a larger page size also results in more memory waste, which can hurt performance in som= e use cases. THPs are currently enabled in the Linux Kernel by applications in limited virtual address ranges via the madvise system call. The THP shrinker tries to find a balance between increased use of THPs, and increased use of memory. It shrinks the size of memory by removing the underutilized THPs that are identified by the thp_utilization scanner.=20 In our experiments we have noticed that the least utilized THPs are almos= t entirely unutilized. Sample Output:=20 Utilized[0-50]: 1331 680884 Utilized[51-101]: 9 3983 Utilized[102-152]: 3 1187 Utilized[153-203]: 0 0 Utilized[204-255]: 2 539 Utilized[256-306]: 5 1135 Utilized[307-357]: 1 192 Utilized[358-408]: 0 0 Utilized[409-459]: 1 57 Utilized[460-512]: 400 13 Last Scan Time: 223.98s Last Scan Duration: 70.65s Above is a sample obtained from one of our test machines when THP is alwa= ys enabled. Of the 1331 THPs in this thp_utilization sample that have from 0-50 utilized subpages, we see that there are 680884 free pages. This comes out to 680884 / (512 * 1331) =3D 99.91% zero pages in the least utilized bucket. This represents 680884 * 4KB =3D 2.7GB memory waste. Also note that the vast majority of pages are either in the least utilize= d [0-50] or most utilized [460-512] buckets. The least utilized THPs are=20 responsible for almost all of the memory waste when THP is always=20 enabled. Thus by clearing out THPs in the lowest utilization bucket we extract most of the improvement in CPU efficiency. We have seen=20 similar results on our production hosts. This patchset introduces the THP shrinker we have developed to identify and split the least utilized THPs. It includes the thp_utilization=20 changes that groups anonymous THPs into buckets, the split_huge_page() changes that identify and zap zero 4KB pages within THPs and the shrinker changes. It should be noted that the split_huge_page() changes are based off previous work done by Yu Zhao.=20 In the future, we intend to allow additional tuning to the shrinker based on workload depending on CPU/IO/Memory pressure and the=20 amount of anonymous memory. The long term goal is to eventually always=20 enable THP for all applications and deprecate madvise entirely. In production we thus far have observed 2-3% reduction in overall cpu usage on stateless web servers when THP is always enabled. Alexander Zhu (4): mm: add thp_utilization metrics to debugfs mm: do not remap clean subpages when splitting isolated thp mm: add selftests to split_huge_page() to verify unmap/zap of zero pages mm: THP low utilization shrinker Yu Zhao (1): mm: changes to split_huge_page() to free zero filled tail pages Documentation/admin-guide/mm/transhuge.rst | 9 + include/linux/huge_mm.h | 9 + include/linux/list_lru.h | 24 ++ include/linux/mm_types.h | 5 + include/linux/rmap.h | 2 +- include/linux/vm_event_item.h | 3 + mm/Makefile | 2 +- mm/huge_memory.c | 156 +++++++++++- mm/list_lru.c | 49 ++++ mm/migrate.c | 73 +++++- mm/migrate_device.c | 4 +- mm/page_alloc.c | 6 + mm/thp_utilization.c | 222 ++++++++++++++++++ mm/vmstat.c | 3 + .../selftests/vm/split_huge_page_test.c | 115 ++++++++- tools/testing/selftests/vm/vm_util.c | 23 ++ tools/testing/selftests/vm/vm_util.h | 3 + 17 files changed, 690 insertions(+), 18 deletions(-) create mode 100644 mm/thp_utilization.c --=20 2.30.2