From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04CFDECAAD2 for ; Mon, 29 Aug 2022 14:00:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 19E76940008; Mon, 29 Aug 2022 10:00:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 14DF2940007; Mon, 29 Aug 2022 10:00:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 015C5940008; Mon, 29 Aug 2022 10:00:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E7274940007 for ; Mon, 29 Aug 2022 10:00:58 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B9C441605DB for ; Mon, 29 Aug 2022 14:00:58 +0000 (UTC) X-FDA: 79852791396.03.057C407 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf22.hostedemail.com (Postfix) with ESMTP id 30C7DC0051 for ; Mon, 29 Aug 2022 14:00:57 +0000 (UTC) Received: by mail-pj1-f41.google.com with SMTP id u9-20020a17090a1f0900b001fde6477464so1641912pja.4 for ; Mon, 29 Aug 2022 07:00:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc; bh=VwucjO56FRTXH12J+9VmzrsKwG2woKnxAdupUBhNqZk=; b=pJg3FM/SVT7c7CGf9zgV9P8uQTwk+0K2zEhQ9snPlJzes0qNEE4Uoz15USvy7vk7rW zoYDq+YJYe0woGR1NEVwVw9PbB2YNu6TlOCTgLq0HYPhgPIR0+JMiW1GWnt7d6OSmkRB p3LWle2z9mlCNzNjtGzrilUsNUEizIOvlugfVawJdf2Dw5I/9nNwON2KiI/dcNoi4HCM Lagcx/cNmZi53hWmsdOG07kdrB9ufwj1PvVItSUkqa1sfERBrDIuP3YsdPh6syGYwL8c wwCoJzu1TKXA1EU1sYXRbh7gMKeJzFseelPLAecAbCK549sxuED0iVBJzrLbRS77lRfx w7zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc; bh=VwucjO56FRTXH12J+9VmzrsKwG2woKnxAdupUBhNqZk=; b=c5spo3yrc6VRjt4tDyxVNaJFOeo0pQaLBe/0OTncpn8nX1wZJI/FAFEE2Iz3lZCzFP QEsdeJXJyaQh3HlZN1MEJ3dDguNNU8luIvhnGD83lHVHvGOkQ0zuK4WqnDxP84SEGyfM Q2oPOsLa05wxY7iRCSasQ5A2yj8XHLDyqD0I9E5LgJRPMAiPR1hGdw2jZNESZ8Rdw8D1 s3HF9Fsj/SRx8X9BPcyEQKbF7rV1ssErjcUkxm80bGSng7AtnEi8/A4ZIC5pcxQY1Uah NV1wvpEp1ghhThnc3DqTW3FD5x8NSjwAgAcz/OwZ0lwpGenHawEW3PP8PjFkRAW+LiYf gvhg== X-Gm-Message-State: ACgBeo3U85g3T+5pdAW3WUkgI6LvAgpAw7WhArCxqDU/f7NHQbJHW7N9 keeBR9Wd4yUeAaOL/LfFJiUeCQ== X-Google-Smtp-Source: AA6agR4rDbtL93VfkB2mzMD8qt15B4Q5tCJdb8y45a8UrAx2DEmzdxcEQTlhzKzVmPRvjNsdhay+Kg== X-Received: by 2002:a17:902:8ec6:b0:172:dc2c:306d with SMTP id x6-20020a1709028ec600b00172dc2c306dmr17023408plo.104.1661781655448; Mon, 29 Aug 2022 07:00:55 -0700 (PDT) Received: from [10.4.115.67] ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id u123-20020a626081000000b0053813de1fdasm3500660pfb.28.2022.08.29.07.00.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 29 Aug 2022 07:00:54 -0700 (PDT) Message-ID: <68f43b57-32b6-1844-a0a6-d22fb0e089aa@bytedance.com> Date: Mon, 29 Aug 2022 22:00:47 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.1.2 Subject: Re: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages Content-Language: en-US To: David Hildenbrand , akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, jgg@nvidia.com, tglx@linutronix.de, willy@infradead.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, muchun.song@linux.dev References: <20220825101037.96517-1-zhengqi.arch@bytedance.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661781658; a=rsa-sha256; cv=none; b=bVPtj1eDJJnmnkzEpFGeHSLINtjpQSRQPl0f6MNHLwZMgt9s+6OVttvyrLP+oxxFzwuInK i5yFwbnvnx5R5pw2wi81PGY3HqzjeCu9gzRsusOxzVnsZND5qT7K/mN8GClxfjdk4VzbWl wYHVJFQTQZKegDmW/sV/VrGgd/JOoA0= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="pJg3FM/S"; spf=pass (imf22.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661781658; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VwucjO56FRTXH12J+9VmzrsKwG2woKnxAdupUBhNqZk=; b=Mo4bza/gnVQY8tPpWZe5geTCgqwPXtDvBkR7D8yDCTghF02NxvKmoFDITcySquPMoftqVv 2/08CtfHktV5D3dgTjiLtPgvV2km38xDxlVh+qZL3WjWMsUV3jSCPSRgDDRf8G8cxvOYtg 6GTgFe8v/qF0/npBbmgQxdEyuP1FCn8= Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="pJg3FM/S"; spf=pass (imf22.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam05 X-Stat-Signature: m3to5119d3st9owxfku4ffkxwa5rrnex X-Rspamd-Queue-Id: 30C7DC0051 X-Rspam-User: X-HE-Tag: 1661781657-107517 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2022/8/29 18:09, David Hildenbrand wrote: > On 25.08.22 12:10, Qi Zheng wrote: >> Hi, >> >> Before this, in order to free empty user PTE page table pages, I posted the >> following patch sets of two solutions: >> - atomic refcount version: >> https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/ >> - percpu refcount version: >> https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/ >> >> Both patch sets have the following behavior: >> a. Protect the page table walker by hooking pte_offset_map{_lock}() and >> pte_unmap{_unlock}() >> b. Will automatically reclaim PTE page table pages in the non-reclaiming path >> >> For behavior a, there may be the following disadvantages mentioned by >> David Hildenbrand: >> - It introduces a lot of complexity. It's not something easy to get in and most >> probably not easy to get out again >> - It is inconvenient to extend to other architectures. For example, for the >> continuous ptes of arm64, the pointer to the PTE entry is obtained directly >> through pte_offset_kernel() instead of pte_offset_map{_lock}() >> - It has been found that pte_unmap() is missing in some places that only >> execute on 64-bit systems, which is a disaster for pte_refcount >> >> For behavior b, it may not be necessary to actively reclaim PTE pages, especially >> when memory pressure is not high, and deferring to the reclaim path may be a >> better choice. >> >> In addition, the above two solutions are only for empty PTE pages (a PTE page >> where all entries are empty), and do not deal with the zero PTE page ( a PTE >> page where all page table entries are mapped to shared zero page) mentioned by >> David Hildenbrand: >> "Especially the shared zeropage is nasty, because there are >> sane use cases that can trigger it. Assume you have a VM >> (e.g., QEMU) that inflated the balloon to return free memory >> to the hypervisor. >> >> Simply migrating that VM will populate the shared zeropage to >> all inflated pages, because migration code ends up reading all >> VM memory. Similarly, the guest can just read that memory as >> well, for example, when the guest issues kdump itself." >> >> The purpose of this RFC patch is to continue the discussion and fix the above >> issues. The following is the solution to be discussed. > > Thanks for providing an alternative! It's certainly easier to digest :) Hi David, Nice to see your reply. > >> >> In order to quickly identify the above two types of PTE pages, we still >> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE >> entry counter into the pte_refcount of the PTE page. The bitmask has the >> following meaning: >> >> - bits 0-9 are mapped PTE entry count >> - bits 10-19 are zero PTE entry count > > I guess we could factor the zero PTE change out, to have an even simpler OK, we can deal with the empty PTE page case first. > first version. The issue is that some features (userfaultfd) don't > expect page faults when something was aleady mapped previously. > > PTE markers as introduced by Peter might require a thought -- we don't > have anything mapped but do have additional information that we have to > maintain. I see the pte marker entry is non-present entry not empty entry (pte_none()). So we've dealt with this situation, which is also what's done in [RFC PATCH 1/7]. > >> >> In this way, when mapped PTE entry count is 0, we can know that the current PTE >> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can >> know that the current PTE page is a zero PTE page. >> >> We only update the pte_refcount when setting and clearing of PTE entry, and >> since they are both protected by pte lock, pte_refcount can be a non-atomic >> variable with little performance overhead. >> >> For page table walker, we mutually exclusive it by holding write lock of >> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages). > > I recall when I played with that idea that the mmap_lock is not > sufficient to rip out a page table. IIRC, we also have to hold the rmap > lock(s), to prevent RMAP walkers from still using the page table. Oh, I forgot this. We should also hold rmap lock(s) like move_normal_pmd(). > > Especially if multiple VMAs intersect a page table, things might get > tricky, because multiple rmap locks could be involved. Maybe we can iterate over the vma list and just process the 2M aligned part? > > We might want/need another mechanism to synchronize against page table > walkers. This is a tricky problem, equivalent to narrowing the protection scope of mmap_lock. Any preliminary ideas? Thanks, Qi > -- Thanks, Qi