From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CAC1C433EF for ; Tue, 14 Jun 2022 16:18:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 11EC36B0093; Tue, 14 Jun 2022 12:18:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0CF306B0096; Tue, 14 Jun 2022 12:18:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB0F06B009B; Tue, 14 Jun 2022 12:18:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DD0B56B0093 for ; Tue, 14 Jun 2022 12:18:46 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A9DF260BD0 for ; Tue, 14 Jun 2022 16:18:46 +0000 (UTC) X-FDA: 79577349852.17.DF4DEC1 Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) by imf10.hostedemail.com (Postfix) with ESMTP id 5A647C0008 for ; Tue, 14 Jun 2022 16:18:46 +0000 (UTC) Received: by mail-pg1-f169.google.com with SMTP id r5so2232419pgr.3 for ; Tue, 14 Jun 2022 09:18:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=BEIICtv/UrJOxL8gd5dIWxlqyNAcGKRbfdl08EIJyL4=; b=S+uuv9EwSPkG/P36kF0XEFGBgixAKd41dB+ZLSR32ExEkIJh6mY826ZSKB+3HRwHBY mwMM0iveyPhpPgZwwxfCY0oK7NJWfaPRNrWSQOJhlhakK0soD7J9W2b03/w/mxhkbtm0 FNRvnjSXxerPtll07V7UajTON+jH6xCqVnliEDLU4Ax89VBImIulNfG4UThRyUwfNSYM A5qWyhOjaOTkBydvPCrdcojcZdWox8dqWJG6kf1lC7gk33E9HzuqNzwlg1dYGPjVzqx/ i64Og6NtjcU/Nue+zxBko8qs9YMedjoWSwne3g0UF3vOo7NlcvkaUWPWgE6/3H0ouUWN o37Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=BEIICtv/UrJOxL8gd5dIWxlqyNAcGKRbfdl08EIJyL4=; b=hwsN3gtFjl7AsYRi1TNdBy+KaUDpw4aPG1n1H/0gUvW839M2mFmKQM/XlXysbtKDi7 /+rqxP/xgBRdmX+sYepPly+RypDSWi/2ElbykC66v9DsAX2xxKg/ImXqbcgI1UzKoYtF AvvphcZVuM+SHsf9WpjcjeFEJpqNrenAHmrhjsRlslbsMCkT210/g7ks3lgRUQ3xRoY4 sQqJA7LO/xiegu4OFGpQNNLy2svsawyW8UqtXheCQ5OGDDbuOvlN4p3WWnIBCGRXDUEq +To3gTRHHOCIapT/buCosjMbtYVqbfyYhiXE8HagDjVgxKo82DBWjI0jmYyt6QdWrecX zkjg== X-Gm-Message-State: AOAM530VgXRXe4cSv9IBHsozc1KLhH+Y/rIkWMUGFUdG0E5SvtP9rcyX eJVxA1wAcb/DWWFekcs6bq0q1foL22JWLw== X-Google-Smtp-Source: ABdhPJwFTI+rtxVd98r0TxNNkHJ/938z5e0JjhgsEwCz4GH753zfGkUtlTeb+RjKkz3+x274v+aZTA== X-Received: by 2002:a63:30f:0:b0:3fd:f9de:a35d with SMTP id 15-20020a63030f000000b003fdf9dea35dmr5097189pgd.317.1655223525154; Tue, 14 Jun 2022 09:18:45 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id ft23-20020a17090b0f9700b001e2bd411079sm9817319pjb.20.2022.06.14.09.18.44 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 Jun 2022 09:18:44 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.100.31\)) Subject: Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG From: Nadav Amit In-Reply-To: <3eea2e6e-1646-546a-d9ef-d30052c00c7d@redhat.com> Date: Tue, 14 Jun 2022 09:18:43 -0700 Cc: Peter Xu , Linux MM , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , Mike Rapoport Content-Transfer-Encoding: quoted-printable Message-Id: References: <20220613204043.98432-1-namit@vmware.com> <3eea2e6e-1646-546a-d9ef-d30052c00c7d@redhat.com> To: David Hildenbrand X-Mailer: Apple Mail (2.3696.100.31) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655223526; a=rsa-sha256; cv=none; b=Asvq7c32A6sbpdlCX+lQwhPlz1brdN3MOystUJAZpEJq4ayNpSDL7SUkNmQngqhd8QkbE6 YR0zXRZ0zFwqNs6i1jQsy0puiegLUOFE/CDUWfWpX46gdYeFnLjt0gKx+W5KYOGukgw1cM 8LEVSSn4naO12+lZfjUkhNLB5Z8C1J8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=S+uuv9Ew; spf=pass (imf10.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.215.169 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1655223526; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BEIICtv/UrJOxL8gd5dIWxlqyNAcGKRbfdl08EIJyL4=; b=4DAERB7AfzHcSBs7v9jRoFoL48kNIT0WyS99eGDfJUNLNjVGh/NLN9wtgBWZW1+1P1jTp0 ZZkuFUn2OuvpdV7z0VdQl3rWwCP+8RHSOK7uJB4EySCvakQ7U9qlLug7MQaWlAM7WwT/7p iYQsypMEXrXT1liNuSWtlc7tKifviyE= X-Stat-Signature: ipt5u7ipd45h1eckuu1gkkqo6gfn619k X-Rspamd-Queue-Id: 5A647C0008 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=S+uuv9Ew; spf=pass (imf10.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.215.169 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1655223526-210957 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Jun 14, 2022, at 8:22 AM, David Hildenbrand wrote: > On 13.06.22 22:40, Nadav Amit wrote: >> From: Nadav Amit >>=20 >> As we know, using a PTE on x86 with cleared access-bit (aka = young-bit) >> takes ~600 cycles more than when the access-bit is set. At the same >> time, setting the access-bit for memory that is not used (e.g., >> prefetched) can introduce greater overheads, as the prefetched memory = is >> reclaimed later than it should be. >>=20 >> Userfaultfd currently does not set the access-bit (excluding the >> huge-pages case). Arguably, it is best to let the uffd monitor = control >> whether the access-bit should be set or not. The expected use is for = the >> monitor to request userfaultfd to set the access-bit when the copy >> operation is done to resolve a page-fault, and not to set the = young-bit >> when the memory is prefetched. >=20 > Thinking out loud about existing users: postcopy live migration in = QEMU > has two usage for placement of pages >=20 > a) Resolving a fault. E.g., a VCPU might be waiting for resolution to > make progress. > b) Background migration to converge without faults on all relevant > pages. >=20 > I guess in a) we'd want UFFDIO_COPY_MODE_YOUNG in b) we don't want it. >=20 >=20 > I wonder, however, instead of calling this "young", which implies what > the OS should or shouldn't do, to define this as a hint that the = placed > page is very likely to be accessed next. >=20 > I'm bad at naming, UFFDIO_COPY_MODE_ACCESS_LIKELY would express what I > have in mind. How about UFFDIO_COPY_MODE_WILLNEED_READ ? >=20 >> Introduce UFFDIO_COPY_MODE_YOUNG to enable userspace to request the >> young bit to be set. For UFFDIO_CONTINUE and UFFDIO_ZEROPAGE set the = bit >> unconditionally since the former is only used to resolve page-faults = and >> the latter would not benefit from not setting the access-bit. >>=20 >> Cc: Mike Kravetz >> Cc: Hugh Dickins >> Cc: Andrew Morton >> Cc: Axel Rasmussen >> Cc: Peter Xu >> Cc: David Hildenbrand >> Cc: Mike Rapoport >> Signed-off-by: Nadav Amit >>=20 >> --- >>=20 >> There are 2 possible enhancements: >>=20 >> 1. Use the flag to decide on whether to mark the PTE as dirty (for >> writable PTEs). I guess that setting the dirty-bit is as expensive as >> setting the access-bit, and setting it introduces similar tradeoffs, >> as mentioned above. >>=20 >> 2. Introduce a similar mode for write-protect and use this = information >> for setting both the young and dirty bits. Makes one wonder whether >> mprotect() should also set the bit in certain cases... >=20 > I wonder if UFFDIO_COPY_MODE_READ_ACCESS_LIKELY vs. > UFFDIO_COPY_WRITE_ACCESS_LIKELY could evenmake sense. I feel like it = could. >=20 > For example, QEMU knows if a page fault it's resolving was due to a = read > or a write fault and could use that information accordingly. Of = course, > we don't completely know if we currently have a read fault, if we = could > get a write fault immediately after. >=20 > Especially in the context of UFFDIO_ZEROPAGE, > UFFDIO_ZEROPAGE_WRITE_ACCESS_LIKELY could ... not place the zeropage = but > instead populate an actual page and mark it accessed+dirty. I even = have > a use case for that ;) >=20 >=20 > The kernel could decide how to treat these hints -- for example, if it > doesn't want user space to mess with access/dirty bits, it could just > mostly ignore the hints. I can do that. I think users can do the zero page-copy themselves today, = but whatever you prefer. But, I cannot take it anymore: the list of arguments for uffd stuff is crazy. I would like to collect all the possible arguments that are used = for uffd operation into some =E2=80=9Cstruct uffd_op=E2=80=9D. Any objection?