From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.0 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75F20C4320A for ; Wed, 18 Aug 2021 08:24:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0B6C260560 for ; Wed, 18 Aug 2021 08:24:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0B6C260560 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 869666B006C; Wed, 18 Aug 2021 04:24:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 819A28D0001; Wed, 18 Aug 2021 04:24:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6BAAC6B0073; Wed, 18 Aug 2021 04:24:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0220.hostedemail.com [216.40.44.220]) by kanga.kvack.org (Postfix) with ESMTP id 4E2396B006C for ; Wed, 18 Aug 2021 04:24:15 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id D8ECE1802EABD for ; Wed, 18 Aug 2021 08:24:14 +0000 (UTC) X-FDA: 78487514028.02.DAAD373 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf10.hostedemail.com (Postfix) with ESMTP id 6E7B260065CD for ; Wed, 18 Aug 2021 08:24:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629275053; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PVDVJT4CAfoChX3cfhI7dD5M9qDIXzTRCipwxamuOKs=; b=O7zubCECSFpyQSJsowwolCumdyCcetV6AgyWxAU3o/VXrjzT83BPBbrp0GgGAFyOvoWvsE zAwEUtir/lTJ72OSVtUzoY56uX8zGl5+xPAq0EIgau7T5yN4xtYSHqZuANsmnrTB8NsSE1 Pp5F2+0uIlhTqORtplS3sj5mlkXspCQ= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-102--OJAGPlvNfK2SFREZFkN7Q-1; Wed, 18 Aug 2021 04:24:09 -0400 X-MC-Unique: -OJAGPlvNfK2SFREZFkN7Q-1 Received: by mail-wr1-f70.google.com with SMTP id z10-20020a5d440a0000b0290154e0f00348so340387wrq.4 for ; Wed, 18 Aug 2021 01:24:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=PVDVJT4CAfoChX3cfhI7dD5M9qDIXzTRCipwxamuOKs=; b=CjymQ7VyzIWdBlOll6quq2/EFyHHhpHdHmreyAV+oqmon2ghUkqiQ74L7l26R6N5br 0jFiZaW9MfpekGDzu/AWvzvndtWvJgsI5XQFVfs6qzzFzTFwKNMMkrf/d1SsheLMqLe4 YdRbQBfG3UWicXAP052TS2vidAqlN7bFteY/HEla15ZsYK2U4ftGXornLWzq1vlMuPol txuu0ZtZ9uEfgUwm0P6ppsbDbffOP4E/b7PodWe+/aVwe4LqNIprfR5FjCZSsP8v7GAs a20cbCRG7JIv7OYCa+Cm1EUr00QADFSmzXUEtad9MMHZp6X5Gwrcyo3ecxD+FDeTK31w IiVA== X-Gm-Message-State: AOAM531Hxhasn3R8j2AzTrpn50bYHUtUq3Bw51WLGJB69Xuig2P0uN8i DhRYSe2vV1f7Bzhh2XwDqEUBQQrExgxGpgEp/4wgAYIjNeUv0V8YvgH1baOLYdPc1AgRcmk1C3N olmrvLkuYSZI= X-Received: by 2002:a05:600c:230c:: with SMTP id 12mr7154226wmo.75.1629275048284; Wed, 18 Aug 2021 01:24:08 -0700 (PDT) X-Google-Smtp-Source: ABdhPJweNYxRoLeRvv1Rq6RAvHpyYTf8x276iV7h8S5Km+dqwTr0iDZLbooRSKUfGhxv5i26aG1XlA== X-Received: by 2002:a05:600c:230c:: with SMTP id 12mr7154210wmo.75.1629275048065; Wed, 18 Aug 2021 01:24:08 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6417.dip0.t-ipconnect.de. [91.12.100.23]) by smtp.gmail.com with ESMTPSA id h4sm5338985wrm.42.2021.08.18.01.24.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 18 Aug 2021 01:24:07 -0700 (PDT) To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Alistair Popple , Tiberiu Georgescu , ivan.teterevkov@nutanix.com, Mike Rapoport , Hugh Dickins , Matthew Wilcox , Andrea Arcangeli , "Kirill A . Shutemov" , Andrew Morton , Mike Kravetz References: <20210807032521.7591-1-peterx@redhat.com> <16a765e7-c2a3-982a-e585-c04067766e3f@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Message-ID: Date: Wed, 18 Aug 2021 10:24:06 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Queue-Id: 6E7B260065CD Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=O7zubCEC; spf=none (imf10.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam01 X-Stat-Signature: grex89jrtehzmd1jkpoxert4w6pew6tk X-HE-Tag: 1629275054-64647 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 17.08.21 22:24, Peter Xu wrote: > On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote: >>> Please have a look at current pagemap impl in pte_to_pagemap_entry().= It's not >>> accurate from the 1st day, imho. E.g., when a page is being migrated= from numa >>> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case= . We can >>> make it more accurate, but I think it's fine, because it's a hint. >> >> That inconsistency doesn't really matter as you can determine if somet= hing >> is present and worth dumping if it's either swapped or present. As lon= g as >> it's one of both but not simply nothing. >> >> I will shamelessly reference >> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that >> checks exactly for that (the test case uses only private anonymous mem= ory). >=20 > Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend o= n > PM_SWAP for that when it goes beyond anonymous private memories - when = shmem > swapped out the pte can be none, then the test case can fail even if it > shouldn't, imho. Exactly, because the pagemap is fairly completely broken for shmem. >=20 > The mincore() syscall seems to be ideally the thing you may want to mak= e it > accurate, but again it's not a problem for current anonymous private me= mories. I haven't checked the details, but I believe the mincore() syscall won't=20 report swapped out pages. At least according to its documentation: "mincore() returns a vector that indicates whether pages of the calling=20 process's virtual memory are resident in core (RAM), and so will not=20 cause a disk access (page fault) if referenced." (to protect it from swapping and relying on mincore() we would have to=20 mlock that memory; we'd want MCL_ONFAULT to be able to test=20 MADV_POPULATE_READ|WRITE; or we'd just want to rely on lseek) >=20 >> >>> >>>> Take CRIU as an example, it has to be correct even if a process woul= d remap a >>>> memory region, fork() and unmap in the parent as far as I understand= , ... >>> >>> Are you talking about dirty bit or swap bit? I'm a bit confused on w= hy swap >>> bit needs to be accurate. Maybe you mean the dirty bit? >> >> https://criu.org/Shared_memory >> >> "Dumping present pages" >> >> "... CRIU does not dump all of the data. Instead, it determines which = pages >> contain it, and only dumps those pages. This is done similarly to how >> regular memory dumping and restoring works, i.e. by looking for PRESEN= T or >> SWAPPED bits in owners' pagemap entries." >> >> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, w= hich >> makes perfect sense. >> >> 1) Process A sets up shared memory and writes data to it. >> 2) System swaps out memory, hints are setup. >> 3) Process A forks Process B, hints are not copied. >> 4) Process A unmaps shared memory, hints are dropped. >> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED = in >> pagemap. >> 6) Process B uses memory in shared memory region. Pages were not migra= ted. >> >> Just one example; feel free to correct me. >=20 > I think pte marker won't crash criu, what will happen is that it'll see= more > ptes that used to be none that become the pte markers. This reminded m= e that > maybe I should teach up mincore() syscall to also be aware of the pte m= arker at > least, and all non_swap_entry() callers. >=20 I haven't checked what mincore() is doing, but from what I understand=20 when reading the CRIU doc and the mincore() doc, it does the right thing=20 without requiring any fiddling with pte marker hints. I assume you=20 merely have a performance improvement in mind. >> >> >> There is notion of the mincore() systemcall: >> >> "There is one particular feature of shared memory dumps worth mentioni= ng. >> Sometimes, a shared memory page can exist in the kernel, but it is not >> mapped to any process. CRIU detects such pages by calling mincore() on= the >> shmem segment, which reports back the page in-memory status. The minco= re >> bitmap is when ANDed with the per-process ones. " >> >> Not sure if they actually mean ORed, because otherwise they'd be losin= g >> pages that have been swapped out. "mincore() returns a vector that ind= icates >> whether pages of the calling process's virtual memory are resident in = core >> (RAM)" >=20 > I am wildly guessing they ORed the two just because PM_SWAP is not work= ing > properly for shmem, so the OR happens only for shmem. Criu may not onl= y rely > on mincore() because they also want the dirty bits. >=20 > Btw, I noticed in 2016 criu switched from mincore() to lseek(): >=20 > https://github.com/checkpoint-restore/criu/commit/1821acedd04b602b37b58= 7eac5a481094b6274ae Interesting. That's certainly what we want when it comes to skipping=20 holes in files. (before reading that, I wasn't even aware that mincore()=20 existed) >=20 > Criu should want to know "whether this page has valid data" not "whethe= r this > page has swapped out", so lseek() seems to be more suitable, which I'm = not > aware of before. Again, just as you, I learned a lot :) >=20 > I'm now wondering whether for Tiberiu's case mincore() can also be used= . It > should just still be a bit slow because it'll look up the cache too, bu= t it > should work similarly like the original proposal. >=20 Very right, maybe we can just avoid tampering with pagemap on shmem=20 completely (which sounds like an excellent idea to me) and document it=20 as "On shared memory, we will never indicate SWAPPED if the pages have=20 been swapped out. Further, PRESENT might be under-indicated: if a shared=20 page is currently not mapped into the page table of a process.". I saw=20 there was a related, proposed doc update, maybe we can finetune that. --=20 Thanks, David / dhildenb