From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B747C4338F for ; Tue, 17 Aug 2021 18:46:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 336EA60EFF for ; Tue, 17 Aug 2021 18:46:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 336EA60EFF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id C50666B0071; Tue, 17 Aug 2021 14:46:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C000E6B0072; Tue, 17 Aug 2021 14:46:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC89E6B0073; Tue, 17 Aug 2021 14:46:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com [216.40.44.241]) by kanga.kvack.org (Postfix) with ESMTP id 958076B0071 for ; Tue, 17 Aug 2021 14:46:55 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 4E4D9183EE7A2 for ; Tue, 17 Aug 2021 18:46:55 +0000 (UTC) X-FDA: 78485454390.15.F122C0D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf26.hostedemail.com (Postfix) with ESMTP id D42BE2001712 for ; Tue, 17 Aug 2021 18:46:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629226014; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/UVqS9He4itCTfHUpyw05N/OnhkuV6wfA8Bz/uBk6FQ=; b=OABdT97U9YTbXxFhYf8rDOfKBE7ebXrhkvfhSsWSkHLXDalcViTpNfueGNPf7h5+Z8GYa8 I7HYm1JDd2MFYteJi9XfzdoKHOw68xo2pVDuJbstWN08NpxvvLSVEdc/GdcL38v1Kk1awe EbO1RDoT2fipiLPdaWiaMWYj0c63s/E= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-394-X84P6xzeNKm9i2lqgcCNhQ-1; Tue, 17 Aug 2021 14:46:49 -0400 X-MC-Unique: X84P6xzeNKm9i2lqgcCNhQ-1 Received: by mail-wm1-f71.google.com with SMTP id u15-20020a05600c210fb02902e6a5231792so50728wml.0 for ; Tue, 17 Aug 2021 11:46:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:references:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=/UVqS9He4itCTfHUpyw05N/OnhkuV6wfA8Bz/uBk6FQ=; b=uatjHQcl2pnSNSelmMabyJVo6kkIoOtmPsJVOHRPPhRooKh4+o7R/z+NYZSXn7y400 2rUwoZ/tlXMtGmKx9ateeUlSiNGzka119wz37caqfxduDymxYuy57YvLLR3rJ0cfmJRf HU3Q+W6zcr2NeJpxpGNsLy7z2zvg1JXmIqF5HoJBnylfx8mWO9oKtioI8Gp+jRasdvNV juJjiV30DuJ8P0XS1h3DvyDK/M3LMjJTvUNA4sbS8oMEzIE6qua7XdHPbTQSeo9XPOm2 v4xUqF9qaFNJBNmJhgLPqpy+BiKrhkyjrKZDDUfBtmlXUYbp8OotYaem7htIXZiBLZhx klVg== X-Gm-Message-State: AOAM531uzrxN0/x1O5n05F7cz8lKgb3ZwGQtqjDrrUSADi2GoNjs5OAu +vKk+ujXmRm/FGHATfP5xONut91efCl8yzqREnDyhEK+zE/EnaAGHkxJ5ktX4W4UwQ+n3VyJI/f ca6FXEVGr5uA= X-Received: by 2002:a1c:1b49:: with SMTP id b70mr4829471wmb.17.1629226007838; Tue, 17 Aug 2021 11:46:47 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwm3rNL9I4of8er/rHFIJ8C+gP/qgnzqSrsgHJlR5+16MRnj1wh0eisiIpRId4UiO+ygsUfWg== X-Received: by 2002:a1c:1b49:: with SMTP id b70mr4829450wmb.17.1629226007487; Tue, 17 Aug 2021 11:46:47 -0700 (PDT) Received: from [192.168.3.132] (p5b0c65c6.dip0.t-ipconnect.de. [91.12.101.198]) by smtp.gmail.com with ESMTPSA id b12sm4015900wrx.72.2021.08.17.11.46.46 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 17 Aug 2021 11:46:46 -0700 (PDT) From: David Hildenbrand To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Alistair Popple , Tiberiu Georgescu , ivan.teterevkov@nutanix.com, Mike Rapoport , Hugh Dickins , Matthew Wilcox , Andrea Arcangeli , "Kirill A . Shutemov" , Andrew Morton , Mike Kravetz References: <20210807032521.7591-1-peterx@redhat.com> <16a765e7-c2a3-982a-e585-c04067766e3f@redhat.com> Organization: Red Hat Subject: Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Message-ID: Date: Tue, 17 Aug 2021 20:46:45 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=OABdT97U; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf26.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Stat-Signature: iyuhisw1i8y5qetak5bn3st4g5oyizdg X-Rspamd-Queue-Id: D42BE2001712 X-Rspamd-Server: rspam05 X-HE-Tag: 1629226014-444699 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >> >> For uffd-wp in its current form, it would certainly be the way to go I >> think. AFAIKT, the idea of special swap entries isn't new, just that i= t's >> limited to anonymous memory for now, which makes things like fork and = new >> mappings a lot cheaper. >=20 > Thanks for reviewing this series separately; yes I definitely wanted to= get > comments on both sides: one on the pte marker idea, the other is whethe= r it's > applicable to this swap+shmem use case. Exactly. >=20 > Here I really want to make the pte marker be flexible - it can be stric= t when > necessary (it will be 100% strict with uffd-wp), then it can also be a = hint > just like what we have with available ptes on soft-dirty, idle, accesse= d bits. > Here the swap bit I wanted to make it that kind, so we add zero overhea= d to > fork() and we still solve problems. >=20 > Same thing to "newly mapped shmem". Do we have a use case for that? I= f that's > a hint bit, can we ignore it? I am really not a fan of taking an already broken feature (broken=20 presence information for shmem in pagemap) and instead of fixing it=20 properly, turning it less broken, crossing fingers that nobody will=20 notice the remaining (undocumented) cases. [...] >> As already expressed, we should try storing as little information in p= age >> tables as possible if we're dealing with shared memory. The features w= e >> design around this all seem to over-complicate the actual users, >> over-complicate fork, over-complicate handling on new mappings. >=20 > I'll skip the last two "over-complicated" items, because as we've discu= ssed I > don't think we need to take care of them so far. We can revisit when t= hey > become some kind of requirement. >=20 > To me having PM_SWAP 99% right on shmem is still a progress comparing t= o > completely missing it, even if it's not 100% right. It's used for perf= ormance > reasons on PAGEOUT and doing finer-grained memory control from userspac= e, it's > not a strict requirement. >=20 > So I still cannot strictly follow why storing information in pte is so = bad for > file-backed, which I can see you really don't like it. Could you share= some > detailed example? I am not a fan of your approach of "hints", because while it might work=20 for some use cases, it might not work for others (see below for CRIU); I=20 would rather like to avoid such "inconsistent caches" where it's not=20 really required. But again, this is just my opinion and there might be=20 plenty other people that will most probably disagree. Storing hints in page tables isn't actually that bad, because we can=20 drop hints whenever we like (well, there are side effects, and once we=20 drop hints too often people might complain again) -- for example, when=20 reclaiming "empty" but actually "filled with hints" page tables. When we=20 rely on consistent values, fork() and mmap() are a problem. Well, and=20 page tables will have to stick around. At least for uffd-wp, mmap() is=20 not an issue, and we don't expect too many uffd-wp users such that page=20 table consumption would matter -- my guess. So I repeat, for uffd-wp in its current form, it sounds like the right=20 thing to do. >> But I guess I'm biased at this point because the main users of these >> features actually want to query/set such properties for all sharers, n= ot >> individual processes; so the opinion of others would be appreciated. >> >>> >>> Known Issues/Concerns >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> >>> About THP >>> --------- >>> >>> Currently we don't need to worry about THP because paged out shmem pa= ges will >>> be split when shrinking, IOW we only need to consider PTE, and the ma= rkers will >>> only be applied to a shmem pte not pmd or bigger. >>> >>> About PM_SWAP Accuracy >>> ---------------------- >>> >>> This is not an "accurate" solution to provide PM_SWAP bit. Two exmap= les: >>> >>> - When process A & B both map shmem page P somewhere, it can happ= en that only >>> one of these ptes got marked with the pte marker. Imagine belo= w sequence: >>> >>> 0. Process A & B both map shmem page P somewhere >>> 1. Process A zap pte of page P for some reason (e.g. thp split) >>> 2. System decides to recycle page P >>> 3. System replace process B's pte (pointed to P) by PTE marker >>> 4. System _didn't_ replace process A's pte because it was none = pte, and >>> it'll continue to be none pte >>> 5. Only process B's relevant pte has the PTE marker after P swa= pped out >>> >>> - When fork, we don't copy shmem vma ptes, including the pte mark= ers. So >>> even if page P was swapped out, only the parent process has the= pte marker >>> installed, in child it'll be none pte if fork() happened after = pageout. >>> >>> Conclusion: just like it used to be, the PM_SWAP is best-effort. But= it should >>> work in 99.99% cases and it should already start to solve problems. >> >> At least I don't like these semantics at all. PM_SWAP is a cached valu= e >> which might be under-represented and consequently wrong. >=20 > Please have a look at current pagemap impl in pte_to_pagemap_entry(). = It's not > accurate from the 1st day, imho. E.g., when a page is being migrated f= rom numa > node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. = We can > make it more accurate, but I think it's fine, because it's a hint. That inconsistency doesn't really matter as you can determine if=20 something is present and worth dumping if it's either swapped or=20 present. As long as it's one of both but not simply nothing. I will shamelessly reference=20 tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that=20 checks exactly for that (the test case uses only private anonymous memory= ). >=20 >> Take CRIU as an example, it has to be correct even if a process would = remap a >> memory region, fork() and unmap in the parent as far as I understand, = ... >=20 > Are you talking about dirty bit or swap bit? I'm a bit confused on why= swap > bit needs to be accurate. Maybe you mean the dirty bit? https://criu.org/Shared_memory "Dumping present pages" "... CRIU does not dump all of the data. Instead, it determines which=20 pages contain it, and only dumps those pages. This is done similarly to=20 how regular memory dumping and restoring works, i.e. by looking for=20 PRESENT or SWAPPED bits in owners' pagemap entries." -> Neither PRESENT nor SWAPPED results in memory not getting dumped,=20 which makes perfect sense. 1) Process A sets up shared memory and writes data to it. 2) System swaps out memory, hints are setup. 3) Process A forks Process B, hints are not copied. 4) Process A unmaps shared memory, hints are dropped. 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in=20 pagemap. 6) Process B uses memory in shared memory region. Pages were not migrated= . Just one example; feel free to correct me. There is notion of the mincore() systemcall: "There is one particular feature of shared memory dumps worth=20 mentioning. Sometimes, a shared memory page can exist in the kernel, but=20 it is not mapped to any process. CRIU detects such pages by calling=20 mincore() on the shmem segment, which reports back the page in-memory=20 status. The mincore bitmap is when ANDed with the per-process ones. " Not sure if they actually mean ORed, because otherwise they'd be losing=20 pages that have been swapped out. "mincore() returns a vector that=20 indicates whether pages of the calling process's virtual memory are=20 resident in core (RAM)" --=20 Thanks, David / dhildenb