From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D7F8C3DA70 for ; Tue, 30 Jul 2024 17:22:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E91DE6B007B; Tue, 30 Jul 2024 13:22:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E41E86B0082; Tue, 30 Jul 2024 13:22:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D08FF6B0083; Tue, 30 Jul 2024 13:22:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id ADF1A6B007B for ; Tue, 30 Jul 2024 13:22:52 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 603F4A492C for ; Tue, 30 Jul 2024 17:22:52 +0000 (UTC) X-FDA: 82397088984.26.81ACF15 Received: from mail-lf1-f42.google.com (mail-lf1-f42.google.com [209.85.167.42]) by imf30.hostedemail.com (Postfix) with ESMTP id 668088000E for ; Tue, 30 Jul 2024 17:22:50 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=geonzpmv; spf=pass (imf30.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.167.42 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722360127; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9SrOWLauCw0n1dZB3h45/pyK2Bmr2sMoKoIxGkx4yOw=; b=s6n2MHgsGcJpjRMEnt9aOjuWcpl0orP3+ygNn9Bq1bheOHmrXIJSRMeYS5STYr6AmhzXby qAyY3FJC9UHvHYtZM131ure4RMS2mWIzUa5ARKrU3Ivokz7agrFwPTdHgE9sjHqkl+lKlP 3d2gR9h0n23i4NPLJsdHsdbGxOctyGw= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=geonzpmv; spf=pass (imf30.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.167.42 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722360127; a=rsa-sha256; cv=none; b=W6ofCfXyaDV3ChuglwcvDzmQlTGpX4Hg3jqoNyTliWnFMR2SaSPS4Mi9iVJx2YxtydC1nV +nowzs+RXnZK8hCdsnfZ5iuuAp01udN4gl+aJ9uzwOOeTp6t6/kEPJQ346iXvo/ca+2gJe q557bZxBFNVdjpLauQJ9tLKRismKybc= Received: by mail-lf1-f42.google.com with SMTP id 2adb3069b0e04-52efa9500e0so5960488e87.3 for ; Tue, 30 Jul 2024 10:22:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722360169; x=1722964969; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=9SrOWLauCw0n1dZB3h45/pyK2Bmr2sMoKoIxGkx4yOw=; b=geonzpmvQZAniWh5iWyYoDz10Ax9XZAK06F0XLrW9FWLL0tIl4Um0zygGYiaOUaOz3 +Jn+6xS5YD13+Dtwd5AypAWABk0DmssMq05vWbvjS7Q3B0BRQ0I+yIEC7RXkOcdcFgkD bEzkk+M5+Zy5OxnwWdvmkcBcwZiHOh6nRx39n112pfOsx1sgkwrmTVttwPwkDUFcasry +T1ymuYNXjHE+BUUtK1NZcvqqeyaFQHgmXIfMNn3lSpuYE78sx48VCP0f7vpKs8LcfPq yyouLT5O7WZrpQB/A0DUdYuZJyE7bGctNrxx0VCYkxy/Qk4JxXqarveir7orrsljhMH8 BI4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722360169; x=1722964969; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9SrOWLauCw0n1dZB3h45/pyK2Bmr2sMoKoIxGkx4yOw=; b=RzR3xSyKIjo/WfQ/XZ+meaYcw4ATxC4fAlRxMJ+kWnSfr/+nD85pPTDnF9XySUlj3X PH4mTh1qaR6sGnzhn4FHbQRJsN9mTxwauTuAnMmO6nC1oo4HTYUJzzk/xYQbw6TSiOqC +XeYprJtreEW/fDN5e8bSyyD1Fi0NnXkpHyb2KFO+EDuknSp9TqYNKEBZ4ShO55sPk2/ wtqWli2iAIQhb/bHQMA5L19bPu7M//Pwrz2Ou/Fjk65BhjHiQyyY54PT7fSImW+2hOZH OupesYf89lcOGAQPdDu9nFY8bfgSjtj1C6/l8VUcH0H18XjCDPrIkTAhg9Y+fXn4M46w b6Bw== X-Forwarded-Encrypted: i=1; AJvYcCUG2K0KgN1UDYDCSQESIdmVqVZKgjLs2HyxMlSWv3JLDdspw25qjgTUe4VSSjBvdzanp5VpRkLzkA==@kvack.org X-Gm-Message-State: AOJu0YzjrvONhN4isgY5p/AgZRxlQ6dJ7VoMFI8/Uj6TU1VYp2a7Dbfm A5bNFqE3dyXYnhsH52Hg8z5BwBKDhJy4tedOQrV4O0kw/rth0gbn X-Google-Smtp-Source: AGHT+IGJtevap8SjGgLimtdXlsjljiFbrSf8UAhsdPH88eKRaqwyIl+zlGuqwGhOP+K8Zc+fGj+UwQ== X-Received: by 2002:ac2:4e05:0:b0:52e:969c:db83 with SMTP id 2adb3069b0e04-5309b26bf0emr7926267e87.17.1722360168092; Tue, 30 Jul 2024 10:22:48 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:eb:d0d0:c7fd:c82c? ([2620:10d:c092:500::4:6947]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a7acad414d7sm665498766b.110.2024.07.30.10.22.47 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 30 Jul 2024 10:22:47 -0700 (PDT) Message-ID: <01899bc3-1920-4ff2-a470-decd1c282e38@gmail.com> Date: Tue, 30 Jul 2024 18:22:46 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/6] mm: split underutilized THPs To: David Hildenbrand , akpm@linux-foundation.org, linux-mm@kvack.org Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, yuzhao@google.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com References: <20240730125346.1580150-1-usamaarif642@gmail.com> <3cd1b07d-7b02-4d37-918a-5759b23291fb@gmail.com> <73b97a03-3742-472f-9a36-26ba9009d715@gmail.com> <95ed1631-ff62-4627-8dc6-332096e673b4@redhat.com> Content-Language: en-US From: Usama Arif In-Reply-To: <95ed1631-ff62-4627-8dc6-332096e673b4@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: 1dgd1czm5tyzhunpwqn7j866ycwnbm4d X-Rspam-User: X-Rspamd-Queue-Id: 668088000E X-Rspamd-Server: rspam02 X-HE-Tag: 1722360170-870247 X-HE-Meta: U2FsdGVkX1/oOgXY8xC0LBwYQ/MSfJx4nMAqXPdpDFnyh1hGbeoUit9st3NLX7c7DmBRXBiqUC4C+g5mEdBazAqwyOcRS1HZwyBTs1hv4zQIZXWB+maadOEBR+mQEweAmGsOAGJJawRSYOWp5ZTaUxAFHinJq1epAlvnhxqUEYAjAnnXvTdd7QHzLWSMi+HovC81Tp4OybJzu8HXFEsmI+SR1kL4ZGllRMD1Op1bZZemVhZ8m8BYV1lNRyYPFF62I80WzGTDCCQPSBJfDpdOWfpYGiUtsnlB9d49h9vs6PQcIX5WZJvxjgXfYaDFC6IvqhgtqGkQB/gUTfa5hoeUh/h+Hq4sMxEFG8QFUSASFHUdHDC7amaHjRInTp2rERbGNhNc/cP/pMJwBs80K/NQAzub27TijnX/5A1BF32d0RZ0lYWnPvwUEYUfOjSTP2cYZVsM6ZYoJTMScAR1tBwpCCbYrOjO4qePOLyNdEWsPGJBuGc0Qau46MPWtQyv2aojFqPdOWdh9RO67t0ebEHdtFT4SnMyN9o0mjcX1u+/aXSoc5D06f5KhsKF1t/Uibz0OebHVWwmpBH3S8+iYWbF1yhlsv5PqkXy90dXKEfZE0cfyntz/UFWu3+BXHV0bMwoSHGB3Dhs4SULefGTULNxtaIa1hvmZ+uMozWSSbtwBECUWGWp6NbF9O0lxVQUzdWt4dxnYZzNuF1EMTuBoaEQIRKi0jOWqVInFPzzN7BCi1o51chdgthocpZWAnHTQUfeC5jlFxZx6yVvFoNDGCXtp5R2zYLHKCED+R2iihLzzUXLJvydkNv4uA8MDqkv+Xc+DiSj9yFsmk3rZosF/Ix8yZL+F/tJO/Ghouhoif1H2lANanpToF6o/9qNO6xPbd6O4wLIzOEC0D4KuZ9MMlu9S6hTqH66KW6zMMj0eBlU/eEjcqxRNrDzv2edk3qXlBReIS1BUOpsCCyfYOfOC+0 Sr+qHTzH GtojmzlsfnljqobppSjF+d6OW3tWhtv+Zx3NHdtVjQmX8dwA1Yo5E38i/gRsA4w4NHjaOQV7jyibkdBcFVaNRoKauLPylLWoDSvERAxMeMRO+z5cUcusUMBN4rDFQWDnlr4hcPbn0hWoc7ysEWR7rUCcBQhCJcyyj5sCKbzeaG2HFnxMssUMUFRZnZWyTGC1VLas5GX7E5sfW3f248H9IQPJrYF0mPI8OKhOWdxqx9MnGpG504JZgMDaZXujWfrPzXrhzwG9u6/qdgmUBlGNFlmbbcTYWdl97YAc9s0Cwy4q60M/7A5qj6S5nWvZ32eegSm7Xh9+EUDKYqT37ySHRTDOOS1JmJ4GGQdbvbUk3tte91VZmwR8DluUkPv+yr1jpu7zA9mndAIHbYDK5Ej3zId/OLM1zRaE/ZbdZzI+ZMqJYngusre/ZL5pB4wqUNPVRvhcO1aHmMzvGVxu/N66o+cnpAoiCb2LnS08t8kU3ln+dD0YIxLzxSXfpdFs9BH2KGzoq0MBKyXfmpd+5JotRDlA+fPlih5Hrxh0MYj4LtX0tcU0MrI8DUQB14D+FYnQknY9f X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 30/07/2024 17:11, David Hildenbrand wrote: > On 30.07.24 17:19, Usama Arif wrote: >> >> >> On 30/07/2024 16:14, Usama Arif wrote: >>> >>> >>> On 30/07/2024 15:35, David Hildenbrand wrote: >>>> On 30.07.24 14:45, Usama Arif wrote: >>>>> The current upstream default policy for THP is always. However, Meta >>>>> uses madvise in production as the current THP=always policy vastly >>>>> overprovisions THPs in sparsely accessed memory areas, resulting in >>>>> excessive memory pressure and premature OOM killing. >>>>> Using madvise + relying on khugepaged has certain drawbacks over >>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and >>>>> require userspace changes. Waiting for khugepaged to scan memory and >>>>> collapse pages into THP can be slow and unpredictable in terms of performance >>>>> (i.e. you dont know when the collapse will happen), while production >>>>> environments require predictable performance. If there is enough memory >>>>> available, its better for both performance and predictability to have >>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged >>>>> to collapse it, and deal with sparsely populated THPs when the system is >>>>> running out of memory. >>>>> >>>>> This patch-series is an attempt to mitigate the issue of running out of >>>>> memory when THP is always enabled. During runtime whenever a THP is being >>>>> faulted in or collapsed by khugepaged, the THP is added to a list. >>>>> Whenever memory reclaim happens, the kernel runs the deferred_split >>>>> shrinker which goes through the list and checks if the THP was underutilized, >>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled. >>>>> If this number goes above a certain threshold, the shrinker will attempt >>>>> to split that THP. Then at remap time, the pages that were zero-filled are >>>>> not remapped, hence saving memory. This method avoids the downside of >>>>> wasting memory in areas where THP is sparsely filled when THP is always >>>>> enabled, while still providing the upside THPs like reduced TLB misses without >>>>> having to use madvise. >>>>> >>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were >>>>> tested with THP shrinker. The results after 2 hours are as follows: >>>>> >>>>>                               | THP=madvise |  THP=always   | THP=always >>>>>                               |             |               | + shrinker series >>>>>                               |             |               | + max_ptes_none=409 >>>>> ----------------------------------------------------------------------------- >>>>> Performance improvement     |      -      |    +1.8%      |     +1.7% >>>>> (over THP=madvise)          |             |               | >>>>> ----------------------------------------------------------------------------- >>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%) >>>>> ----------------------------------------------------------------------------- >>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512 >>>>> (80%) zero filled filled pages will be split. >>>>> >>>>> To test out the patches, the below commands without the shrinker will >>>>> invoke OOM killer immediately and kill stress, but will not fail with >>>>> the shrinker: >>>>> >>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none >>>>> mkdir /sys/fs/cgroup/test >>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs >>>>> echo 20M > /sys/fs/cgroup/test/memory.max >>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max >>>>> # allocate twice memory.max for each stress worker and touch 40/512 of >>>>> # each THP, i.e. vm-stride 50K. >>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM >>>>> # killer. >>>>> # Without the shrinker, OOM killer is invoked immediately irrespective >>>>> # of max_ptes_none value and kill stress. >>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K >>>>> >>>>> Patches 1-2 add back helper functions that were previously removed >>>>> to operate on page lists (needed by patch 3). >>>>> Patch 3 is an optimization to free zapped tail pages rather than >>>>> waiting for page reclaim or migration. >>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled >>>>> subpages when splitting THP. >>>>> Patches 6 adds support for THP shrinker. >>>>> >>>>> (This patch-series restarts the work on having a THP shrinker in kernel >>>>> originally done in >>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/. >>>>> The THP shrinker in this series is significantly different than the >>>>> original one, hence its labelled v1 (although the prerequisite to not >>>>> remap clean subpages is the same).) >>>> >>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries. >>>> >>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages. >>>> >>>> Is that being done in this patch set already, or are we creating pte_none() entries? >>>> >>> >>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma. >>> >>>     if (userfaultfd_armed(pvmw->vma)) { >>>         newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)), >>>                            pvmw->vma->vm_page_prot)); >>>         ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte); >>>         set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); >>>     } >> >> >> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case. > > I remember one ugly case in QEMU with postcopy live-migration where we must not zap zero-filled pages. I am not 100% regarding THP (if it could be enabled at that point), but imagine the following > > 1) mmap(), enable THP > 2) Migrate a bunch of pages from the source during precopy (writing to >    the memory). Might end up creating THPs (during fault/khugepaged) > 3) Register UFFD on the VMA > 4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA > 5) Discard any pages that have been re-dirtied or not migrated yet > 6) Migrate-on-demand any holes using uffd > > > If we discard zero-filled pages between 2) and 3) we might get wrong uffd notifications in 6 for pages that have already been migrated). > > I'll have to check if that actually happens in that sequence in QEMU: if QEMU would disable THP right before 2) we would be safe. But I recall that it is not the case :/ > > Thanks for the example! Just to understand the issue better, as I am not very familiar with live-migration code, the problem is only for zero-filled pages that were migrated, right? If a THP is created and a subpage of it was a zero-page that was migrated and its split before VMA is armed with uffd, userfaultfd_armed(pvmw->vma) will return false when splitting and it will become pte_none. And afterwards when the destination faults on it, uffd will see that its pte_clear and will request the zero-page back from source. Uffd will then have to get the page again from source. If I understand the example correctly, the below diff over patch 6 should be good? i.e. just point to the empty_zero_page instead of doing pte_clear. This should still use the same amount of memory, although ptep_clear_flush means it might be slighly more expensive. diff --git a/mm/migrate.c b/mm/migrate.c index 2731ac20ff33..52aa4770fbed 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -206,14 +206,10 @@ static bool try_to_unmap_unused(struct page_vma_mapped_walk *pvmw, if (dirty) return false; - pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false); - - if (userfaultfd_armed(pvmw->vma)) { - newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)), - pvmw->vma->vm_page_prot)); - ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte); - set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); - } + newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)), + pvmw->vma->vm_page_prot)); + ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte); + set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio)); return true;