From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E2B2C46467 for ; Tue, 10 Jan 2023 14:40:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 78D7E8E0002; Tue, 10 Jan 2023 09:40:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 73DE28E0001; Tue, 10 Jan 2023 09:40:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5DE9B8E0002; Tue, 10 Jan 2023 09:40:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4BBE08E0001 for ; Tue, 10 Jan 2023 09:40:17 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DDCC3160C9A for ; Tue, 10 Jan 2023 14:40:16 +0000 (UTC) X-FDA: 80339149632.14.785DEA2 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf10.hostedemail.com (Postfix) with ESMTP id DDDC5C0017 for ; Tue, 10 Jan 2023 14:40:14 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="MyQwQQG/"; spf=pass (imf10.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673361614; a=rsa-sha256; cv=none; b=DC0EWpVAtaAuServdvt7oZNT8cDZB4qIzUSsHnGWaxNtpVL/zVEYBMnsIIOVvtzVF4MnGA CUe/Wbjrfu+cpfd0h8DDdXiV/gGQl4tBAJxegSoGp+/ii0+SRo2Wy77mqupD4HDkHfiZ6z Al3xWFIonvPEaKE2F2OYhC8aLL9R4NM= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="MyQwQQG/"; spf=pass (imf10.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673361614; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=M4NY2nCxNKSL1aDtTTa753pJApSkqPcWILlnmsX4aJQ=; b=YEcj1k4VcaD3YZlGOnESoIIrM1owgUgISfcRrsgJjvA4oLAfb7IkyzBa+xElnWbRIIoIxO aWbOsZ9/UCCWBCabDG33mpPQ0fDUmUEia+SCMwYrb1+LNqp61469ch2MKC+90DoGYbD4Vg JeQ5bCK4sJPHnyZ91s9XrYKmygQMOpk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1673361614; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=M4NY2nCxNKSL1aDtTTa753pJApSkqPcWILlnmsX4aJQ=; b=MyQwQQG/DR5z1UzhhlyToUoWp1sfFTPu/5MjVgBu1lHVsBrzldfK2RbyyLOT4ZmfwW2BJD EEz6jh5lS0AWYT1MJZwoKD1AFyWD112iVfW6YTuSQaV6NfMhd/CpzCqy3zSlKJ2+KrypX+ Nrsfer6NjlkL8/7BZMYNZsKTTBC1Jhw= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-37-FQONRU40OK2x9LSZ-HQ1bw-1; Tue, 10 Jan 2023 09:40:10 -0500 X-MC-Unique: FQONRU40OK2x9LSZ-HQ1bw-1 Received: by mail-wm1-f72.google.com with SMTP id w8-20020a1cf608000000b003d9bb726833so2591812wmc.6 for ; Tue, 10 Jan 2023 06:40:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=M4NY2nCxNKSL1aDtTTa753pJApSkqPcWILlnmsX4aJQ=; b=0qv0vZsEQ1WNmoljXjnAZwLeDJ7uyldpklu+CiSI5xyF7tS5kS6dxhG6A01lnmLs1G dsXhPpMNg/cvnONE3O8aCrFofA4QP2/jBYS2HMx2x5hNe/vj3Arx8eSXpZoHKKqJeLzB /fQTgw/EAwAQCF2MIvgCtODKXZ5YQw3s1Ti88DmIMquga5xZLRROZU3Smv1QGFSucHpF +x25m7AdcncPAqh6jMR8Rz0NBoyP4Vt4F0dg68dbJLmnOc27qxKU23rbGcj5ypi/8sPw eE/jWUzlE81D29ExQZszoA8QHhb1z0GZJ4Ll1QswBTa7uj0pDfgUsI/1vaN8ZuHxeDUs SBHA== X-Gm-Message-State: AFqh2kpF3JQi3qWsiutUBJeIbZoAUKPYdg29e96OOsPKGR9yb3NdIxP8 aDr/o42+cEWdQE1Omakv0G7E1PVkdSjkx3WgbkEtYUt7E5USw9SzSaYI6N8HDKWCbZ1Vxm/64/8 o/RrC9uE6QKg= X-Received: by 2002:a5d:4705:0:b0:2a1:6ad2:3cac with SMTP id y5-20020a5d4705000000b002a16ad23cacmr16575336wrq.17.1673361609564; Tue, 10 Jan 2023 06:40:09 -0800 (PST) X-Google-Smtp-Source: AMrXdXtgCCEpItV2SrZJAWPqHtPjYYIrFcmrUPAnsLgKOajQyG3+kkKlF1/8Edxnhs2W0sQEyGEkHA== X-Received: by 2002:a5d:4705:0:b0:2a1:6ad2:3cac with SMTP id y5-20020a5d4705000000b002a16ad23cacmr16575323wrq.17.1673361609230; Tue, 10 Jan 2023 06:40:09 -0800 (PST) Received: from ?IPV6:2003:cb:c708:4200:65d9:4d0d:bb61:14c8? (p200300cbc708420065d94d0dbb6114c8.dip0.t-ipconnect.de. [2003:cb:c708:4200:65d9:4d0d:bb61:14c8]) by smtp.gmail.com with ESMTPSA id m10-20020adff38a000000b002bc83b85180sm2926402wro.114.2023.01.10.06.40.08 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 10 Jan 2023 06:40:08 -0800 (PST) Message-ID: Date: Tue, 10 Jan 2023 15:40:07 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 To: "Yin, Fengwei" , linux-mm@kvack.org, akpm@linux-foundation.org, jack@suse.cz, hughd@google.com, kirill.shutemov@linux.intel.com, mhocko@suse.com, ak@linux.intel.com, aarcange@redhat.com, npiggin@gmail.com, mgorman@techsingularity.net, willy@infradead.org, rppt@kernel.org, dave.hansen@intel.com, ying.huang@intel.com, tim.c.chen@intel.com References: <20230109072232.2398464-1-fengwei.yin@intel.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH 0/4] Multiple consecutive page for anonymous mapping In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: DDDC5C0017 X-Stat-Signature: x8n58shdmbm1hfbj55913xg71khwc1r6 X-HE-Tag: 1673361614-502808 X-HE-Meta: U2FsdGVkX19TSNqMOCK2/7/4QG/B6yolxHW15rvrkkn+jdTSq2uJPzSw0PNNhge5wMdgVLOOXQiu1wCQwWmbKaLcI8mQKeknonUQV4kL67Ocd0aTHe7mpiy7yVLXK/I4DTyeIPbSQxSCAcfVKYcYbtzIhstYw5s3n0Q5pOeXYqWOadsbtwUBU5DCXR/8jz9bVaKyTK3v/I9WQ0Mg6OXWoy56fno6olpR3IEJGS8VnFW8PzG8ndBeYjXxYg0VUKujfM7/10yikO6UFN/WofjotIpPHe7CSBzceFmo+Rx4yaAeRs2o7FhZh45guLHKQv3Uj99JiE/3wtrKo4q6GYtKWF4m0Fi5cxSc0nb0knR6xGc23CtpJM3yXh5aI7JG3m4UpHu2ocWtj0Dl07kAI6IaksaCYift0YWs2KZ4nD+70o320vYNgQ0jqs1ycuBeDp+UdpER/XUWDVQ1qBhwK4RsXQMuQVMQi4OD4/l+ULtY+oin8ZlQwnQ0YQhd4HvRFgYxou2vwZDSOeaVBtvcFFueUTvYmQHiTj03Erp0GUgp6CDI8b65WG1gs3/RfWTQSH5cohL8QfTY5E/N5OYcij3ksIWY0lwCx8GaWmoqRbmxtq3HNy4fAkc+B30/GdCs3gJTcvwyO2QyGJSOcQ9R9sD9u9vq4qKud/rvateNZZBZ44ByP4qvWDrvDWyQ3+RNx/7XJM0dvjTrUrCYKUJlg9bDfrsjY26DKieJdDWueWT1YyrVRZWBXbnAtFm3GjdDXfTXAhUNhvgrqA5Excu08CjWfP8I5e/9UEwATeRPW0emmVXdTnD/0u8UlvReiOzgmsWHvVOKOmQgX+BYUYiluGzUDQx2FlxsmPbCXMT2WV2ao0hH9jCkN81Xmco6IzbN5eVPSGtrYLz8L/CA6Yth5TXYBbatWdYoe/dyjDtgG1GhwtD6fHnLy89ogTyGrSkXM5Re7SvoVoGlQ9wvmkWp9rx h6wSDXss dJTfrMdTcu7TIcbpwfSsmlsiYhaCjcAfY0GcDegTHNDB81edy1CoBNcYXKXQSOTpAoG9jPbvHHYZKd6/eLBGcmfhISGYedLDjicFdpPAg3JPyhCDC0Qn+Q7953Ex3cNWbbR8v0pPA0KNxB+Ac7uJGEj0alA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10.01.23 04:57, Yin, Fengwei wrote: > > > On 1/10/2023 1:33 AM, David Hildenbrand wrote: >> On 09.01.23 08:22, Yin Fengwei wrote: >>> In a nutshell:  4k is too small and 2M is too big.  We started >>> asking ourselves whether there was something in the middle that >>> we could do.  This series shows what that middle ground might >>> look like.  It provides some of the benefits of THP while >>> eliminating some of the downsides. >>> >>> This series uses "multiple consecutive pages" (mcpages) of >>> between 8K and 2M of base pages for anonymous user space mappings. >>> This will lead to less internal fragmentation versus 2M mappings >>> and thus less memory consumption and wasted CPU time zeroing >>> memory which will never be used. >> >> Hi, Hi, >> >> what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache, > swapping etc. Right, PTE-mapping higher-order pages, in a faultaround fashion. But for pagecache etc. that doesn't require mcpage IMHO. I think it's the natural evolution of folios that Willy envisioned at some point. > >> >> Some thoughts: >> >> (1) Faultaround might be unexpected for some workloads and increase >>     memory consumption unnecessarily. > Comparing to THP, the memory consumption and latency introduced by > mcpage is minor. But it exists :) > >> >> Yes, something like that can happen with THP BUT >> >> (a) THP can be disabled or is frequently only enabled for madvised >>     regions -- for example, exactly for this reason. >> (b) Some workloads (especially memory ballooning) rely on memory not >>     suddenly re-appearing after MADV_DONTNEED. This works even with THP, >>     because the 4k MADV_DONTNEED will first PTE-map the THP. Because >>     there is a PTE page table, we won't suddenly get a THP populated >>     again (unless khugepaged is configured to fill holes). >> >> >> I strongly assume we will need something similar to force-disable, selectively-enable etc. > Agree. Thinking again, we might want to piggy-back on the THP machinery/config knobs completely, hmm. After all, it's a similar concept to a THP (once we properly handle folios), just that we are not able to PMD-map the folio because it is too small. Some applications that trigger MADV_NOHUGEPAGE don't want to get more pages populated than actually accessed. userfaultfd users come to mind, where we might not even have the guaranteed to see a UFFD registration before enabling MADV_NOHUGEPAGE and filling out some pages ... if we'd populate too many PTEs, we could miss uffd faults later ... > >> >> >> (2) This steals consecutive pages to immediately split them up >> >> I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users. >> > The motivations to split it immediately are: > 1. All the sub-pages is just normal 4K page. No other changes need be > added to handle it. > 2. splitting it before use doesn't involved complicated page lock handling. I think for an upstream version we really want to avoid these splits. >>> >>> In the implementation, we allocate high order page with order of >>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the >>> physical contiguous memory is used and benefit sequential memory >>> access latency. >>> >>> Then split the high order page. By doing this, the sub-page of >>> mcpage is just 4K normal page. The current kernel page >>> management is applied to "mc" pages without any changes. Batching >>> page faults is allowed with mcpage and reduce page faults number. >>> >>> There are costs with mcpage. Besides no TLB benefit THP brings, it >>> increases memory consumption and latency of allocation page >>> comparing to 4K base page. >>> >>> This series is the first step of mcpage. The furture work can be >>> enable mcpage for more components like page cache, swapping etc. >>> Finally, most pages in system will be allocated/free/reclaimed >>> with mcpage order. >> >> I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio. >> >> I'd summarize the ideas as "faultaround" whereby we try optimizing for locality. >> >> Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page. > "cont-pte" on ARM64 has fixed size which match the silicon definition. > mcpage allows software/user to define the size which is not necessary > to be exact same as silicon defined. Thanks. Yes. And the whole concept is abstracted away: it's logically a single, larger PTE, and we can only map/unmap in that PTE granularity. -- Thanks, David / dhildenb