From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E5F7C61DA3 for ; Tue, 21 Feb 2023 22:12:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 89C226B0071; Tue, 21 Feb 2023 17:12:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 84BFE6B0072; Tue, 21 Feb 2023 17:12:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6ED0B6B0073; Tue, 21 Feb 2023 17:12:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 5BB436B0071 for ; Tue, 21 Feb 2023 17:12:57 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 165A216016F for ; Tue, 21 Feb 2023 22:12:57 +0000 (UTC) X-FDA: 80492699994.25.D848C42 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf02.hostedemail.com (Postfix) with ESMTP id EBC2F80015 for ; Tue, 21 Feb 2023 22:12:54 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="f+irx0S/"; spf=pass (imf02.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677017575; a=rsa-sha256; cv=none; b=JlWnDYgMNbrqOv/dWuMCyYQ7CXNiUpEBhQ1K7++wIpM99crD31aRVMwLRdxNYssKrJP1EF D5s3eq68UCgasx4LPVhXrFEHc3mP27PzVRv5Fc+P6ConLfQRjPYPn6t1ZP8Dye8imiOzM/ qePySl00FyTQv2zjFpc2I/2RmwRC3EY= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="f+irx0S/"; spf=pass (imf02.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677017575; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PZexKVQu5aJySrpsPPoew7Lpnukgn8mCz3/T+aftBm0=; b=Jto3JC4diIok6gM3HE4XZw+3PRlyTZyS25x5XJYXXn+5lWGHAptluRFqMhbRffD/mZMAcs RKn97X6jfsDmXncWov5I5Rw/Fw8dqqf9ibdi3a3p+iZFzMwX/uMdO89Ku0CWYZ1YuoGOaw bGWtyO79hx4WSejRnruOWws9HY586og= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1677017574; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PZexKVQu5aJySrpsPPoew7Lpnukgn8mCz3/T+aftBm0=; b=f+irx0S/KoPjhcVP1H4myfELGDxfZm9gKf0/bOkHQyJSFfAPVzJ6Aqh4Y4XAbnu+fDZjbo 8GV2C0LPmrKmJKIswjmebf7A4SQsZzyNH17AN9av9mnflHj+WL0h4r1613AfVpEyfYG1OA RAWfIcCU5gMakiYyiFXwJ5bGYiOw5E0= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-65-0sXASV4RPzCj8HjQ0q6JVg-1; Tue, 21 Feb 2023 17:12:53 -0500 X-MC-Unique: 0sXASV4RPzCj8HjQ0q6JVg-1 Received: by mail-qk1-f197.google.com with SMTP id m25-20020ae9e019000000b007421ddd945eso1414770qkk.6 for ; Tue, 21 Feb 2023 14:12:53 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677017572; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=PZexKVQu5aJySrpsPPoew7Lpnukgn8mCz3/T+aftBm0=; b=bj3uAc+8Yweu1QnDPRfwDYd6hKm1+zJyi7qJuw5pGBMHABWCtvdYJ4xJPEetB+qx6w fPrUYhaCTuCcBhjki76HPW0E0gMX+nHQMRNNm2vXY2KXTH17JnMcS9NPaceVs4wqiklp 2sYZdZ40D1M1iNpGIfYZiF5yfNmM0Tyj/6b83ECkOJ7wAgZ2YFF9fyMYioqLsQS3upK+ NPeNA0/FwQEpIXN44AJQ6p4bJJyIdb+LDQZPji3TkjrZ+p5B6DOiaAzk29VFtew7svYP ve6NwFfhft8uVr2p5eNXX0a0Yg0OLHDTV6MbuKtPGPESDFiRbwmYzknvkeGebuaDTYW8 JtKg== X-Gm-Message-State: AO0yUKUIJrJITlH5gFZF/d4hm1qpupxZRwGOJfXDbyaWoS3XG7wnEKye 1vjMwXdgUxkluY/7l/T4AZG+21VpB1OeuY+fuf4VUcYftJjUuCzAeLXwFXt9V5WeGeKr+79qfY8 gqQBDudc+QSs= X-Received: by 2002:a0c:f0c1:0:b0:56e:89b9:9a92 with SMTP id d1-20020a0cf0c1000000b0056e89b99a92mr8445843qvl.0.1677017572658; Tue, 21 Feb 2023 14:12:52 -0800 (PST) X-Google-Smtp-Source: AK7set+cWTyFaUW6lHd1SJda/j8WwZf1yokHo8OQ4oqkcLZJxZgkC0T5JaGehHlVhQMcsl5N6DnZgg== X-Received: by 2002:a0c:f0c1:0:b0:56e:89b9:9a92 with SMTP id d1-20020a0cf0c1000000b0056e89b99a92mr8445817qvl.0.1677017572339; Tue, 21 Feb 2023 14:12:52 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id j64-20020a37b943000000b007422fd3009esm992940qkf.20.2023.02.21.14.12.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 21 Feb 2023 14:12:51 -0800 (PST) Date: Tue, 21 Feb 2023 17:12:50 -0500 From: Peter Xu To: David Stevens Cc: linux-mm@kvack.org, Matthew Wilcox , Andrew Morton , "Kirill A . Shutemov" , Yang Shi , David Hildenbrand , Hugh Dickins , linux-kernel@vger.kernel.org Subject: Re: [PATCH v4 2/3] mm/khugepaged: skip shmem with userfaultfd Message-ID: References: <20230217085439.2826375-1-stevensd@google.com> <20230217085439.2826375-3-stevensd@google.com> MIME-Version: 1.0 In-Reply-To: <20230217085439.2826375-3-stevensd@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Queue-Id: EBC2F80015 X-Rspamd-Server: rspam01 X-Stat-Signature: qr3s9i3jhaxmgrfoaf8um7oidruybwyp X-HE-Tag: 1677017574-22258 X-HE-Meta: U2FsdGVkX1/TL6LxEN3O3hKRh2Ld2HKQAB3dsb64sxShDG2/JwCjfLAfZBdE+J4xNveocTEmhKuyN8fMfu2Pl20uu9zXtyp1egVxJJeX1n56GG8bdcLP1ih5vxuToIJrvTUsRUszDBPL0kkAyLzSoW2Xkqo3v3JhNiNSPhfNPIpZbsuep/6+nKg+z4wQWycsURY9f0TDsey5iOOiF8cvzV1hJiqf9AKxkb2UC7AubPtvHz3DTRQrPQTz5UhmgOcIEIVir7jQCpziuMC9DAi8Pv1bb65kLvrWFu6AUsqKf0KwfNEImiaJu1x2g1349rDMURtftAkk/rcnjDdcCPXbCd+Y9+lmMUQ6uVFMb96y37HeN8mp1+DoFJFTnneElxyfBMoPgm3CzgB4Qu6gXs/yEVkQhvnywPbkmbvipmgTZP13YvWNvwRmOgtdZbGKujyMiQGM385C9ea0mRgCK/BclHSuwGVICNBdvCmCQBkZsfLrHjYcG/OrMJaibe3Z8hasMW28X4vIMKJmT4jOcQQfN8KSPXbvm8dyXy0RfSWfJAR2iTeqF5j8kxADXXJIHGxnP2UXkW+pAS8e8Dd4Iw658HrHBNTP2qQWZDiNeST7MXxezEM10UFvy8EqR/HecnrMJC6wDmUt2UQdnx1cMlnbCvZAE/r4CvD7r8dNLYav8g+pVpZ3t0lSvR/Dp/scfkGO0D6y+n4TL66VwWa1xqJ8EeXACdTMwnru66HjUJUYwwhhAC0suHdHvft3tlCAibAtZgoUyzUwujTlpKw7PsMJ7hH0d+vOnZP7BoIATlP1rftuamsp5zUfynEGhN2NpWc7zUvN1lQnAOcvpKWFsRwWvVtU8ufKpQJprkPUQHRzFgfW6+SIEtsTD64PG9btmHxAMTpJLsEBuwLNUSUn+yu7wxMilM4K5d1lAEJxdOXwaRMhbEhgpj9XshXQKF3kZdfCR/7ltaBn+SRRnVii3b9 EIMQ1N83 c6Wf0WO/L4ht1E3lZp7PjvR5dMTH7doYjXPwvfN+SxgxDa293KMr4nMxA/oB7K93IV9rmI7H2Fa4yK00SNkMACrOexIwRBrXdqU0zkzs7HlutuY4P6Zh6GolHNHvRoE8+dOaLUWzhXv9Hn1JtDje3SMZ91sUdq7iWQMyZXFpO+zeyWoV69HQLfaU+uITJI1/P0uF89HANumsIr2gVLD2c42BIK+v59XNXdq+qaxws5lap9jYpHq8197uvckwpcZ6Fa4aEC2GKIoR0nlaKLqS0Hh0tVsYPybnM0/KN4C5Dzo+algOiMSQnjGmMACxeX009VPFaWP5JpavjjOnIlpitiGTToStK7KNFGzuEMzM8Q6fUYDhyGmZqMQFyXx6Tb9rdK0IpBLNmAec2s0ffv8dbtgowgECWsCKzHwpnf8xPnlvoVBnate5VlP0zO7PJ+oMb2PC+05ZWx4RtvO8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 17, 2023 at 05:54:38PM +0900, David Stevens wrote: > From: David Stevens > > Make sure that collapse_file respects any userfaultfds registered with > MODE_MISSING. If userspace has any such userfaultfds registered, then > for any page which it knows to be missing, it may expect a > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when > collapsing a shmem range would result in replacing an empty page with a > THP, to avoid breaking userfaultfd. > > Synchronization when checking for userfaultfds in collapse_file is > tricky because the mmap locks can't be used to prevent races with the > registration of new userfaultfds. Instead, we provide synchronization by > ensuring that userspace cannot observe the fact that pages are missing > before we check for userfaultfds. Although this allows registration of a > userfaultfd to race with collapse_file, it ensures that userspace cannot > observe any pages transition from missing to present after such a race > occurs. This makes such a race indistinguishable to the collapse > occurring immediately before the userfaultfd registration. > > The first step to provide this synchronization is to stop filling gaps > during the loop iterating over the target range, since the page cache > lock can be dropped during that loop. The second step is to fill the > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final > time, to avoid races with accesses to the page cache that only take the > RCU read lock. > > The fact that we don't fill holes during the initial iteration means > that collapse_file now has to handle faults occurring during the > collapse. This is done by re-validating the number of missing pages > after acquiring the page cache lock for the final time. > > This fix is targeted at khugepaged, but the change also applies to > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now > return EBUSY if there are any missing pages (instead of succeeding on > shmem and returning EINVAL on anonymous memory). There is also now a > window during MADV_COLLAPSE where a fault on a missing page will cause > the syscall to fail with EAGAIN. > > The fact that intermediate page cache state can no longer be observed > before the rollback of a failed collapse is also technically a > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it > is exceedingly unlikely that anything relies on being able to observe > that transient state. > > Signed-off-by: David Stevens It'll be great to have another eye looking, but... AFAICT this works for us. Thanks David, this is better than what I suggested. Acked-by: Peter Xu -- Peter Xu