From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19069C432BE for ; Tue, 17 Aug 2021 09:04:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AC08860FA0 for ; Tue, 17 Aug 2021 09:04:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org AC08860FA0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 2359C8D0001; Tue, 17 Aug 2021 05:04:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E6556B0072; Tue, 17 Aug 2021 05:04:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0AD948D0001; Tue, 17 Aug 2021 05:04:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0254.hostedemail.com [216.40.44.254]) by kanga.kvack.org (Postfix) with ESMTP id E1EAA6B006C for ; Tue, 17 Aug 2021 05:04:26 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 8ECA6B290 for ; Tue, 17 Aug 2021 09:04:26 +0000 (UTC) X-FDA: 78483986532.01.E40AB55 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf25.hostedemail.com (Postfix) with ESMTP id 17605B00C2C5 for ; Tue, 17 Aug 2021 09:04:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629191065; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+IpbFjalCEJL70F6BNCn2FQl0douXGw12dWrMR8Sr9Y=; b=MhejpBlCRhmM/e8Ezz7FmgGdQqHy11afc0qXHtsUIxjZVdf2B0ndoVVQKqMOsCY1UYTrZ2 U/kKLrs7KocZTycLFb0vn4lQISruv+agIKfKSwtAty8ra7x6Mfs0kQvYxJdOQ5dx7lNFnO jZI2eE2cXbEisMtQgsnoH9Ay8d9BJFc= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-291-YX4gflWMMPiLiXxp1YFPPw-1; Tue, 17 Aug 2021 05:04:20 -0400 X-MC-Unique: YX4gflWMMPiLiXxp1YFPPw-1 Received: by mail-wr1-f72.google.com with SMTP id q4-20020a05600000c400b00156d811312aso777311wrx.3 for ; Tue, 17 Aug 2021 02:04:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=+IpbFjalCEJL70F6BNCn2FQl0douXGw12dWrMR8Sr9Y=; b=LlukyaIjxo17zsuhR0ELPDZzqEw06UuiBgwVUAm10i1kC6rXwvyCrRtcDt4m0yFv7k vViRjOiaev97FBzvxjQHPgm19p5WVJM4CqIKMOFhNvPmPbaxr2LWRc9+bq3m23Wgl79I crAXfR1AzepqkxFWtt3Ux00ybX4c4qTiejbXlpVpcPckCwfCil4d8hZABeDSdFJjHEqg jmv1seXf44bKngLwixvwiCY8I2rXA1U2Y/Y23HrqMIPFeJlorlvyPZBd3goVX/DQxdXX Je7y1+vOkYnIvpN1LsDeaiksEZHc9dnA3FKCfv1Qcux6uf5K/tEJozW2Yuhf8ThpGf9z +15w== X-Gm-Message-State: AOAM532Z/4AqYvMQB/wBkpelEzHRAFuBrrDtiLnO+FnNGhWhE3Lzr46j +v4f50Ik9k60kczITbvLitrFVjzTcwOfc7T/wwvRDGs6Du4+9L94AdipjlLFBEt30qpq6cvuJXr sxy9cYltMEGQ= X-Received: by 2002:adf:d194:: with SMTP id v20mr2709292wrc.126.1629191059647; Tue, 17 Aug 2021 02:04:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJye/5DW7piAT297rJq2MdvHuKv29UYJI0Ej0h/pY6QDpxSYNcpguZleFjWb4vEry66PwyDTTw== X-Received: by 2002:adf:d194:: with SMTP id v20mr2709255wrc.126.1629191059320; Tue, 17 Aug 2021 02:04:19 -0700 (PDT) Received: from [192.168.3.132] (p5b0c65c6.dip0.t-ipconnect.de. [91.12.101.198]) by smtp.gmail.com with ESMTPSA id z13sm1653015wrs.71.2021.08.17.02.04.18 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 17 Aug 2021 02:04:18 -0700 (PDT) To: Peter Xu , linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Alistair Popple , Tiberiu Georgescu , ivan.teterevkov@nutanix.com, Mike Rapoport , Hugh Dickins , Matthew Wilcox , Andrea Arcangeli , "Kirill A . Shutemov" , Andrew Morton , Mike Kravetz References: <20210807032521.7591-1-peterx@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Message-ID: <16a765e7-c2a3-982a-e585-c04067766e3f@redhat.com> Date: Tue, 17 Aug 2021 11:04:18 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210807032521.7591-1-peterx@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 17605B00C2C5 X-Stat-Signature: wdw5rrbpfpeecy7s3jkik418jssimnw6 Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MhejpBlC; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf25.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1629191065-848209 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Peter, a couple of comments, sorry for the late reply. > Summary > =3D=3D=3D=3D=3D=3D=3D >=20 > [Based on v5.14-rc4] >=20 > This patchset enables PM_SWAP of pagemap on shmem. IOW userspace will = be able > to detect whether a shmem page is swapped out, just like anonymous page= s. >=20 > This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enable= d, it > brings 0.8% overhead on swap-in performance on a shmem page, so I didn'= t make > it the default yet. However IMHO 0.8% is still in an acceptable range = that we > can even make it the default at last. Comments are welcomed here. Special config option and added complexity for handling a corner case=20 feature partially more correct. Hm. >=20 > There's one previous series that wanted to address the same issue but i= n > another way by Tiberiu A Georgescu , her= e: >=20 > https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@n= utanix.com/ >=20 > In that series it's done by looking up page cache for all none ptes. H= owever I > raised concern on 4x performance degradation for all shmem pagemap user= s. Who cares? I am asking because for me, it's hard to imagine a workload=20 that actually cares about a 4x performance degradation when querying the=20 pagemap in very special cases, especially if it involves gigantic shmem=20 ranges. VM live migration -- sync will be a bit slower? CRIU sync will=20 be a bit slower? I mean, actual page dumping is a lot more expensive.=20 Really a problem? I read that CRIU cares about shmem via pagemap [1], at least for=20 anonymous shared memory; not sure how memfd is treated, I assume in a=20 similar way. But I do wonder how it even works reliably, because it=20 relies on present/swapped out and sofrtdirty tracking, which are both=20 essentially broken e.g., when swapping out AFAIKT. Looks like this=20 really needs a proper fix. [1] https://criu.org/Shared_memory >=20 > Unlike the other approach, this series has zero overhead on pagemap rea= d > because the PM_SWAP info is consolidated into the zapped PTEs directly. >=20 > Goals > =3D=3D=3D=3D=3D >=20 > One major goal of this series is to add the PM_SWAP support, the reason= is as > stated by Tiberiu and Ivan in the other patchset: >=20 > https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@C= Y4PR0201MB3460.namprd02.prod.outlook.com/ >=20 > As a summary: for some reason the userspace needs to scan the pages in = the > background, however that scanning could misguide page reclaim on which = page is > hot and which is cold. With correct PM_SWAP information, the userspace= can > correct the behavior of page reclaim by firstly fetching that info from > pagemap, and explicit madvise(MADV_PAGEOUT). In this case, the pages a= re for > the guest, but it can be any shmem page. >=20 > Another major goal of this series is to do a proof-of-concept of the PT= E marker > idea, and that's also the major reason why it's RFC. So far PTE marker= can > potentially be the solution for below three problems that I'm aware of: >=20 > (a) PM_SWAP on shmem >=20 > (b) Userfaultfd-wp on shmem/hugetlbfs >=20 > (c) PM_SOFT_DIRTY lost for shmem over swapping >=20 > This series tries to resolve problem (a) which should be the simplest, = ideally > it should solve immediate problem for the live migration issue raised b= y > Tiberiu and Ivan on proactive paging out unused guest pages. >=20 > Both (a) and (c) will be for performance-wise or statistic-wise. >=20 > Scenario (b) will require pte markers as part of the function to trap w= rites to > uffd-wp protected regions when the pages were e.g. swapped out or zappe= d for > any reason. >=20 > Currently, uffd-wp shmem work (still during review on the list, latest = v5, [1]) > used another solution called "special swap pte". It works similarly li= ke PTE > markers as both of the approachs are to persist information into zapped= pte, > but people showed concern about that idea and it's suggested to use a s= afer > (swp-entry level operation, not pte level), and arch-independent approa= ch. >=20 > Hopefully PTE markers satifsfy these demands. >=20 > Before I rework the uffd-wp series, I wanted to know whether this appro= ach can > be accepted upstream. So besides the swap part, comments on PTE marker= s will > be extremely welcomed. For uffd-wp in its current form, it would certainly be the way to go I=20 think. AFAIKT, the idea of special swap entries isn't new, just that=20 it's limited to anonymous memory for now, which makes things like fork=20 and new mappings a lot cheaper. >=20 > What is PTE Markers? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > PTE markers are defined as some special PTEs that works like a "marker"= just > like in normal life. Firstly it uses a swap type, IOW it's not a valid= /present > pte, so processor will trigger a page fault when it's accessed. Meanwh= ile, the > format of the PTE is well-defined, so as to contain some information th= at we > would like to know before/during the page access happening. >=20 > In this specific case, when the shmem page is paged out, we set a marke= r > showing that this page was paged out, then when pagemap is read about t= his pte, > we know this is a swapped-out/very-cold page. >=20 > This use case is not an obvious one but the most simplest. The uffd-wp= use > case is more obvious (wr-protect is per-pte, so we can't save into page= cache; > meanwhile we need that info to persist across zappings e.g. thp split o= r page > out of shmem pages). >=20 > So in the future, it can contain more information, e.g., whether this p= te is > wr-protected by userfaultfd; whether this pte was written in this mm co= ntext > for soft-dirtying. On 64 bit systems, we have a total of 58 bits (swp_= offset). >=20 > I'm also curious whether it can be further expanded to other mm areas. = E.g., > logically it can work too for non-RAM based memories outside shmem/huge= tlbfs, > e.g. a common file system like ext4 or btrfs? As long as there will be= a need > to store some per-pte information across zapping of the ptes, then mayb= e it can > be considered. As already expressed, we should try storing as little information in=20 page tables as possible if we're dealing with shared memory. The=20 features we design around this all seem to over-complicate the actual=20 users, over-complicate fork, over-complicate handling on new mappings. For uffd-wp in its current form, there seems to be no way around it, and=20 PTE markers seem to be what you want -- but as I already raised, the=20 feature itself on shmem is somewhat suboptimal, just like SOFTDIRTY=20 tracking on shmem. But I guess I'm biased at this point because the main users of these=20 features actually want to query/set such properties for all sharers, not=20 individual processes; so the opinion of others would be appreciated. >=20 > Known Issues/Concerns > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > About THP > --------- >=20 > Currently we don't need to worry about THP because paged out shmem page= s will > be split when shrinking, IOW we only need to consider PTE, and the mark= ers will > only be applied to a shmem pte not pmd or bigger. >=20 > About PM_SWAP Accuracy > ---------------------- >=20 > This is not an "accurate" solution to provide PM_SWAP bit. Two exmaple= s: >=20 > - When process A & B both map shmem page P somewhere, it can happen = that only > one of these ptes got marked with the pte marker. Imagine below s= equence: >=20 > 0. Process A & B both map shmem page P somewhere > 1. Process A zap pte of page P for some reason (e.g. thp split) > 2. System decides to recycle page P > 3. System replace process B's pte (pointed to P) by PTE marker > 4. System _didn't_ replace process A's pte because it was none pte= , and > it'll continue to be none pte > 5. Only process B's relevant pte has the PTE marker after P swappe= d out >=20 > - When fork, we don't copy shmem vma ptes, including the pte markers= . So > even if page P was swapped out, only the parent process has the pt= e marker > installed, in child it'll be none pte if fork() happened after pag= eout. >=20 > Conclusion: just like it used to be, the PM_SWAP is best-effort. But i= t should > work in 99.99% cases and it should already start to solve problems. At least I don't like these semantics at all. PM_SWAP is a cached value=20 which might be under-represented and consequently wrong. Take CRIU as an=20 example, it has to be correct even if a process would remap a memory=20 region, fork() and unmap in the parent as far as I understand, ... If we really care about performance for users with the old semantics,=20 introduce some runtime toggle that enables the new behavior (even for a=20 single process?) and consequently is a bit slower in corner cases. But I=20 really do wonder if we care at all about the performance degradation in=20 corner cases. If we really care about performance for users with new semantics, then=20 let's do it properly and see how we can actually speed it up without=20 per-process page table hacks. Anyhow, just my 2 cents. --=20 Thanks, David / dhildenb