From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C053FC2BBCA for ; Tue, 25 Jun 2024 12:23:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:CC:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=n77NiSkltj8tVW1l3tMrXjcvu/hKZrmi765FVMScEjE=; b=NDs+hKx2BOP4P1JNpK/3pjrS92 25jImveANhv7UnujW6seRKtIwQAz0MIB9Z6TZb3Qxf7Ik0s4NGhIEHfFtqdr2knedJOi7nxJa+q3x Ymg7b4jEIH32QUD2iUedLMlQbMZMIi96C2haoks8mlVRX8nXuYHC8P4tsKWVO7lkl8gSaQZ9xOaz9 8Hm6n8sjZJkRazkX9T6vtsBw1u2dwH37lqQj3oz124C+0BVUmcoL/YLR1xRpBIyg1QwR1sRAxz7P4 O+zVeRdfuy9u4EEXcgIopnKv3p9ICSbKXfcAs4qrb2wLLZmVHBN3tZ9XU12SD0+O8U7Ba6ZjkG2Cq MOC3tfag==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sM5D1-00000002j3F-17mZ; Tue, 25 Jun 2024 12:23:39 +0000 Received: from szxga05-in.huawei.com ([45.249.212.191]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sM5Ct-00000002iy2-47yI for linux-arm-kernel@lists.infradead.org; Tue, 25 Jun 2024 12:23:34 +0000 Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4W7kQz2cbWz1j5kH; Tue, 25 Jun 2024 20:19:23 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 576A8140381; Tue, 25 Jun 2024 20:23:22 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Tue, 25 Jun 2024 20:23:21 +0800 Message-ID: <99aa61b6-afc9-445f-8f50-1e017450efd1@huawei.com> Date: Tue, 25 Jun 2024 20:23:20 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings Content-Language: en-US To: Baolin Wang , Ryan Roberts , Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , James Morse , Andrey Ryabinin , Andrew Morton , Matthew Wilcox , Mark Rutland , David Hildenbrand , John Hubbard , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , Yang Shi , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , "Yin, Fengwei" CC: , , , , References: <20240215103205.2607016-1-ryan.roberts@arm.com> <20240215103205.2607016-19-ryan.roberts@arm.com> <1285eb59-fcc3-4db8-9dd9-e7c4d82b1be0@huawei.com> <8d57ed0d-fdd0-4fc6-b9f1-a6ac11ce93ce@arm.com> <018b5e83-789e-480f-82c8-a64515cdd14a@huawei.com> From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemf100008.china.huawei.com (7.185.36.138) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240625_052332_561170_FA40DE07 X-CRM114-Status: GOOD ( 26.10 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 2024/6/25 15:23, Baolin Wang wrote: > > > On 2024/6/25 11:16, Kefeng Wang wrote: >> >> >> On 2024/6/24 23:56, Ryan Roberts wrote: >>> + Baolin Wang and Yin Fengwei, who maybe able to help with this. >>> >>> >>> Hi Kefeng, >>> >>> Thanks for the report! >>> >>> >>> On 24/06/2024 15:30, Kefeng Wang wrote: >>>> Hi Ryan, >>>> >>>> A big regression on page-fault3("Separate file shared mapping page >>>> fault") testcase from will-it-scale on arm64, no issue on x86, >>>> >>>> ./page_fault3_processes -t 128 -s 5 >>> >>> I see that this program is mkstmp'ing a file at >>> "/tmp/willitscale.XXXXXX". Based >>> on your description, I'm inferring that /tmp is backed by ext4 with >>> your large >>> folio patches enabled? >> >> Yes, mount /tmp by ext4, sorry to forget to mention that. >> >>> >>>> >>>> 1) large folio disabled on ext4: >>>>     92378735 >>>> 2) large folio  enabled on ext4 +  CONTPTE enabled >>>>     16164943 >>>> 3) large folio  enabled on ext4 +  CONTPTE disabled >>>>     80364074 >>>> 4) large folio  enabled on ext4 +  CONTPTE enabled + large folio >>>> mapping enabled >>>> in finish_fault()[2] >>>>     299656874 >>>> >>>> We found *contpte_convert* consume lots of CPU(76%) in case 2), >>> >>> contpte_convert() is expensive and to be avoided; In this case I >>> expect it is >>> repainting the PTEs with the PTE_CONT bit added in, and to do that it >>> needs to >>> invalidate the tlb for the virtual range. The code is there to mop up >>> user space >>> patterns where each page in a range is temporarily made RO, then >>> later changed >>> back. In this case, we want to re-fold the contpte range once all >>> pages have >>> been serviced in RO mode. >>> >>> Of course this path is only intended as a fallback, and the more >>> optimium >>> approach is to set_ptes() the whole folio in one go where possible - >>> kind of >>> what you are doing below. >>> >>>> and disappeared >>>> by following change[2], it is easy to understood the different >>>> between case 2) >>>> and case 4) since case 2) always map one page >>>> size, but always try to fold contpte mappings, which spend a lot of >>>> time. Case 4) is a workaround, any other better suggestion? >>> >>> See below. >>> >>>> >>>> Thanks. >>>> >>>> [1] https://github.com/antonblanchard/will-it-scale >>>> [2] enable large folio mapping in finish_fault() >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 00728ea95583..5623a8ce3a1e 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -4880,7 +4880,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >>>>           * approach also applies to non-anonymous-shmem faults to >>>> avoid >>>>           * inflating the RSS of the process. >>>>           */ >>>> -       if (!vma_is_anon_shmem(vma) || >>>> unlikely(userfaultfd_armed(vma))) { >>>> +       if (unlikely(userfaultfd_armed(vma))) { >>> >>> The change to make finish_fault() handle multiple pages in one go are >>> new; added >>> by Baolin Wang at [1]. That extra conditional that you have removed >>> is there to >>> prevent RSS reporting bloat. See discussion that starts at [2]. >>> >>> Anyway, it was my vague understanding that the fault around mechanism >>> (do_fault_around()) would ensure that (by default) 64K worth of pages >>> get mapped >>> together in a single set_ptes() call, via filemap_map_pages() -> >>> filemap_map_folio_range(). Looking at the code, I guess fault around >>> only >>> applies to read faults. This test is doing a write fault. >>> >>> I guess we need to do a change a bit like what you have done, but >>> also taking >>> into account fault_around configuration? > > For the writable mmap() of tmpfs, we will use mTHP interface to control > the size of folio to allocate, as discussed in previous meeting [1], so > I don't think fault_around configuration will be helpful for tmpfs. Yes, tmpfs is different from ext4. > > For other filesystems, like ext4, I did not found the logic to determin > what size of folio to allocate in writable mmap() path (Kefeng, please > correct me if I missed something). If there is a control like mTHP, we > can rely on that instead of 'fault_around'? For ext4 or most filesystems, the folio is allocated from filemap_fault(), we don't have explicit interface like mTHP to control the folio size. > > [1] > https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/ > >> Yes, the current changes is not enough, I hint some issue and still >> debugging, so our direction is trying to map large folio for >> do_shared_fault(), right? > > I think this is the right direction to do. I add this > '!vma_is_anon_shmem(vma)' conditon to gradually implement support for > large folio mapping buidling, especially for writable mmap() support in > tmpfs. > >>> [1] >>> https://lore.kernel.org/all/3a190892355989d42f59cf9f2f98b94694b0d24d.1718090413.git.baolin.wang@linux.alibaba.com/ >>> [2] >>> https://lore.kernel.org/linux-mm/13939ade-a99a-4075-8a26-9be7576b7e03@arm.com/