From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FA5AC7EE23 for ; Thu, 8 Jun 2023 06:34:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B3AA8900002; Thu, 8 Jun 2023 02:34:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AC38D6B0074; Thu, 8 Jun 2023 02:34:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 93CEB900002; Thu, 8 Jun 2023 02:34:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 7EEEE6B0072 for ; Thu, 8 Jun 2023 02:34:22 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4F0EBC013E for ; Thu, 8 Jun 2023 06:34:22 +0000 (UTC) X-FDA: 80878616364.24.1C49009 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf05.hostedemail.com (Postfix) with ESMTP id 6F46A100006 for ; Thu, 8 Jun 2023 06:34:17 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DN8UCwsF; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686206059; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JZFLcVZzssifoTLoi4P7sQm4IwQ4WBn+PYfcAyvf8sU=; b=6IMLbSWFAqh8LsjCPjqveQROT7cR5GEuSG1p6EBmNkMF27x+bE1J5ALNa/qbKsWIc55tpt UrzES0sabKakjbrLx/JwC08QA8l4d7nUgziUVRV3BO35v1aOmVHtBuJ3jMH3FB+31Kwivb Sg2GLjKluLaiURQKIiTzooNyOSqyGBM= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DN8UCwsF; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686206059; a=rsa-sha256; cv=none; b=a+5dKwbUMLsyWp1YUdDIKRUjBLpqR1QUGopMLL1aR5ifGQa2cehC/Ri5Ku5G3gvPlogHNG qC7uNN9HJXdAeCSdhomwHcv7e5BhbT3Rzqv7AGE/d8ZcdN8aUoFRFOBigRzkJBwTC19+iZ EGMV1tM3vvKAKiaexrtBU3vNIKB536c= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1686206056; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JZFLcVZzssifoTLoi4P7sQm4IwQ4WBn+PYfcAyvf8sU=; b=DN8UCwsFqyrO8M0PVQrBq0jgAFyNbueY1yoIcUfpTjWd+ygm8d9q/5qANxvLW0XpUI4HFA 8cszilVvmC+Lnk02UMoBl8Td2IxGOZohacPgykRHkMg2ufc1ilI23IhbmfIA+pNN6bBLte meQf6tpGjBfUKdGAriLb/wDIi+QDdOE= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-60-Q0E28eRCOlO3oo7AH2bUHg-1; Thu, 08 Jun 2023 02:34:14 -0400 X-MC-Unique: Q0E28eRCOlO3oo7AH2bUHg-1 Received: by mail-wr1-f70.google.com with SMTP id ffacd0b85a97d-30ae8a5006dso435874f8f.2 for ; Wed, 07 Jun 2023 23:34:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686206053; x=1688798053; h=content-transfer-encoding:in-reply-to:subject:organization:from :content-language:references:cc:to:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JZFLcVZzssifoTLoi4P7sQm4IwQ4WBn+PYfcAyvf8sU=; b=Uk5L1fBtlJL4i+HvRVXjKqqrIKNRmxQ7K/HnFF5+FczUlWJWCOXiEs3QC6P2g4Tsgr 3YUWGEIbNhM7Li572EprheHBk3jEC8U6UMhLGZfZoegqr8UMcpg9rczU7IKDzpIr4lPm w226vObeBI5nd5MXhaFo7OVHzZZUbIlfBEXuwdc/4jnNSeI29gKRXCljGhAh8m3HRLid g3r6ZM8XVHDNDZ+pmC2NdfPpn1oBC0bJLgP5rPvhW5TZjPywL5gdBBF/DhMNHUklLpB+ PxB+yrEtakHW3ZhePmHamwgCrkffiOMVYjVOtS1VSC+OZAESLiFVMAytv9Jw3caKUss6 9Nfg== X-Gm-Message-State: AC+VfDzn9qDt1eNzzUd7DWaOEgJUFc1+ZlSKxxMn0HLpedDKTbKtFZ44 5NhK1TvFru7AchhgiwvD4Y2hBmk53kM6Y1f0miJoJoA/zaBVAoAEFlCiXSyVEdZLcG6fTB9NVd8 4Gg7au0U8yYg= X-Received: by 2002:adf:df04:0:b0:30a:a715:66c8 with SMTP id y4-20020adfdf04000000b0030aa71566c8mr6234821wrl.8.1686206052959; Wed, 07 Jun 2023 23:34:12 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7lH/4yBzczpXTNLb6nst0eC/tlzcy7T4HR0vb+OQCY8PyUXVFnPztGoOXaihNF+XZYyHJQDg== X-Received: by 2002:adf:df04:0:b0:30a:a715:66c8 with SMTP id y4-20020adfdf04000000b0030aa71566c8mr6234784wrl.8.1686206052500; Wed, 07 Jun 2023 23:34:12 -0700 (PDT) Received: from ?IPV6:2003:cb:c71a:dd00:4f1c:58c:7db9:650f? (p200300cbc71add004f1c058c7db9650f.dip0.t-ipconnect.de. [2003:cb:c71a:dd00:4f1c:58c:7db9:650f]) by smtp.gmail.com with ESMTPSA id w18-20020a05600018d200b0030af54c5f33sm508086wrq.113.2023.06.07.23.34.11 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 07 Jun 2023 23:34:11 -0700 (PDT) Message-ID: Date: Thu, 8 Jun 2023 08:34:10 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 To: David Rientjes , Mike Kravetz Cc: Yosry Ahmed , James Houghton , Naoya Horiguchi , Miaohe Lin , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Peter Xu , Michal Hocko , Matthew Wilcox , Axel Rasmussen , Jiaqi Yan References: <20230306191944.GA15773@monkey> <20230602172723.GA3941@monkey> <7e0ce268-f374-8e83-2b32-7c53f025fec5@google.com> <7c42a738-d082-3338-dfb5-fd28f75edc58@redhat.com> <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com> <20230607220651.GC4122@monkey> <686e3e61-704e-1258-8a8b-f18399b41668@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs In-Reply-To: <686e3e61-704e-1258-8a8b-f18399b41668@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 6F46A100006 X-Stat-Signature: gfhdrhiqzz8sdrzx5fzzpa8fknpppuhk X-Rspam-User: X-HE-Tag: 1686206057-155963 X-HE-Meta: U2FsdGVkX1/v8G60rjEPt1BW/4tfzzES7meCgQ3sp//07g54cz/XnAbRbHm9ZcWQBXsnMJMcCWg1/GNrTnAFKS7s8mdGlV1RM3TRgrek/gXjrZWYcefZVQ1QxyILBiYpzRt/lPk6wJ1YnfwQpmQ2Nddj3Jyrj0sXrEhYHkspZmOqwVma5dq74xuKqGPhapucqyZZ5eZSXVHZHAEd/+fYf1SzjEbYoRMvysFyXtwC5H3TH58lpO007AWxkrL5h6qLG7csC/fNti+mUU0w9lOJDxuRYS8Wi8p0/RHOdYGOnRURcmlhSaLGfV9Ot+Z9dLoeuzetXPJXBnyVZwwJEgbcu7khMwouY+Orurh9z/d974J3hbKYKfGVpVbwtnh/zI4OovIUR9rJeAc1kf+1hBwc3Mmsmp9v/NLDwUX7aOZCcx15Ln65JIvglMLafdTl+EoO/sRki4oRIZdqGHjF2YwTXUe0fXoK1YhZT9uJP2trgrgqCWM/yMZTytdQYv4twwjxDZpzBToInyVnF1SsAZoqHlGnifSkrL44q06IYigMGpGAjAuvwYtOl7vNjwpnzewls3EkMkgTD47lSiKKXBeCi/MVRKipdmpi45Jcms8shhR9st4ZHaBi4XwFmKusIeLz70QQsUSdT42T0MODjG+vczwYx0DR+qA1Afbd5FciR/yLBHeOxcYcRNiN2xYRtYo6LUiZh4rOm1p/j0bN5hRE+dB2+bu+tJO60Gz5brpoOGJ750ZxeDolRzxMXlDzyTJ61vsEehaHWEM3cQZEWZxGlR/tUDpQH44TVUb6dgcJSNwa4agafgP2Xyga87zHa3fEF/8wnKijX4SGMihvCnxXPUg2SdBbdDVlH1IokNgK1BCB3kLczH7hMR21tEsP3NAvvCms2GjQQOpmUG9By1bsPc8vT8y+2XPhrNtdRhbCxlG75RyDSnnjhRiPQyuD883CvAovnvegr/0z/5lga2H ZPCG5136 PBF8ap43CQWPtwHuUQshfQ+qKPXMCy/Gz297nA3gtACdIOZd5kMew04SoE7CHV746LuQF524KFqsGplnVtOj0UZf4LfB7GJGjAhpl4o5ZW71RFLtYF6NT0aF6BWBfhl2QdzKmDD1gpjJIAqpE5CA5UzW0ShyZFkVE+R4nFK31KrdnZrZbs3eOVlRRgvl53Uv4ZcHRiO3glJ67Ybw3fOt81dUNZ9X4Q5GXB3mdywGOeWDCjPnoZ5Yr0YcX7x/AY6lqSbzG0vuXRd9nfMgpoz0A3ExgYX350vJ+B+bfbtsrlL7YpumuW+rXsKCeVvlm6bOsTTzwc11l88PKDAWXIxzYAY6O+fB2EOgzmqpzQa14Gvk0+nycBQ8TqQDurjk1oPbJOKRp X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 08.06.23 02:02, David Rientjes wrote: > On Wed, 7 Jun 2023, Mike Kravetz wrote: > >>>>>> Are there strong objections to extending hugetlb for this support? >>>>> >>>>> I don't want to get too involved in this discussion (busy), but I >>>>> absolutely agree on the points that were raised at LSF/MM that >>>>> >>>>> (A) hugetlb is complicated and very special (many things not integrated >>>>> with core-mm, so we need special-casing all over the place). [example: >>>>> what is a pte?] >>>>> >>>>> (B) We added a bunch of complexity in the past that some people >>>>> considered very important (and it was not feature frozen, right? ;) ). >>>>> Looking back, we might just not have done some of that, or done it >>>>> differently/cleaner -- better integrated in the core. (PMD sharing, >>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation >>>>> because it fails with NUMA/fork, ...) >>>>> >>>>> (C) Unifying hugetlb and the core looks like it's getting more and more >>>>> out of reach, maybe even impossible with all the complexity we added >>>>> over the years (well, and keep adding). >>>>> >>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But >>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years >>>>> old. So we managed to get quite far without that optimization. >>>>> > > Sane handling for memory poisoning and optimizations for live migration > are both much more important for the real-world 1GB hugetlb user, so it > doesn't quite have that lengthy of a history. > > Unfortuantely, cloud providers receive complaints about both of these from > customers. They are one of the most significant causes for poor customer > experience. > > While people have proposed 1GB THP support in the past, it was nacked, in > part, because of the suggestion to just use existing 1GB support in > hugetlb instead :) Yes, because I still think that the use for "transparent" (for the user) nowadays is very limited and not worth the complexity. IMHO, what you really want is a pool of large pages that (guarantees about availability and nodes) and fine control about who gets these pages. That's what hugetlb provides. In contrast to THP, you don't want to allow for * Partially mmap, mremap, munmap, mprotect them * Partially sharing then / COW'ing them * Partially mixing them with other anon pages (MADV_DONTNEED + refault) * Exclude them from some features KSM/swap * (swap them out and eventually split them for that) Because you don't want to get these pages PTE-mapped by the system *unless* there is a real reason (HGM, hwpoison) -- you want guarantees. Once such a page is PTE-mapped, you only want to collapse in place. But you don't want special-HGM, you simply want the core to PTE-map them like a (file) THP. IMHO, getting that realized much easier would be if we wouldn't have to care about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD sharing), but maybe there is a way ... > >>>>> Absolutely, HGM for better postcopy live migration also makes sense, I >>>>> guess nobody disagrees on that. >>>>> >>>>> >>>>> But as discussed in that session, maybe we should just start anew and >>>>> implement something that integrates nicely with the core , instead of >>>>> making hugetlb more complicated and even more special. >>>>> > > Certainly an ideal would be where we could support everybody's use cases > in a much more cohesive way with the rest of the core MM. I'm > particularly concerned about how long it will take to get to that state > even if we had kernel developers committed to doing the work. Even if we > had a design for this new subsystem that was more tightly coupled with the > core MM, it would take O(years) to implement, test, extend for other > architectures, and that's before any existing of users of hugetlb could > make the changes in the rest of their software stack to support it. One interesting experiment would be, to just take hugetlb and remove all complexity (strip it to it's core: a pooling of large pages without special MAP_PRIVATE support, PMD sharing, reservations, ...). Then, see how to get core-mm to just treat them like PUD/PMD-mapped folios that can get PTE-mapped -- just like we have with FS-level THP. Maybe we could then factor out what's shared with the old hugetlb implementations (e.g., pooling) and have both co-exist (e.g., configured at runtime). The user-space interface for hugetlb would not change (well, except fail MAP_PRIVATE for now) (especially, no messing with anon hugetlb pages) Again, the spirit would be "teach the core to just treat them like folios that can get PTE-mapped" instead of "add HGM to hugetlb". If we can achieve that without a hugetlb v2, great. But i think that will be harder .... but I might be just wrong. -- Cheers, David / dhildenb