From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02F22C636D4 for ; Wed, 15 Feb 2023 22:48:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5286F6B0071; Wed, 15 Feb 2023 17:48:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D7D26B0072; Wed, 15 Feb 2023 17:48:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 378D06B0073; Wed, 15 Feb 2023 17:48:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 244F46B0071 for ; Wed, 15 Feb 2023 17:48:26 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id E18F6C0344 for ; Wed, 15 Feb 2023 22:48:25 +0000 (UTC) X-FDA: 80471016570.13.B444586 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf28.hostedemail.com (Postfix) with ESMTP id AC483C0019 for ; Wed, 15 Feb 2023 22:48:23 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="MJhs/h66"; spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676501303; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1Ls3rOlCkZIM9uqnTJAAjI8pZPeikDFz/SnN7vr7rRc=; b=aekznx9mCuTjTZ6j0kR7Yvcim137GDdJq1vF3lqMXHtVniTpplUfIMwGhJQ55wdgk7uJD5 6ityuDk+0uajBnBafM1STgeEzJ9B0NUr7K1IlTjhz9CPamiuNB2t2LNRHc/vg/7EFH0kD3 GZi0VKmer2rK9A0+VZ/vFh3Iymx610I= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="MJhs/h66"; spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676501303; a=rsa-sha256; cv=none; b=rYL4jgAkoPjgcTY+/8tn9wafRYLAJBWDeuzvIW/nUmtHXrht00W4qjjWIY9Is7TacMXLx2 0kdNigNUh2OH9omyv/9+lzf8t9Oea7UMJqsAl5XVWDoAqUx9YQ097tA+m3petIWJ1I+OYA Q2Hh66YujbahpAQllzCjzUb3O6R1noc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676501302; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1Ls3rOlCkZIM9uqnTJAAjI8pZPeikDFz/SnN7vr7rRc=; b=MJhs/h66LjAIAb4aUNtFvzmVKd2rifuAfZyUMpum7U6BzzE3h1AGQvjf7rk+AqoliVoo6g j1AFowbZCJK3ywLZrfSxmvetx0TJDqMr54yxuZxxK5mX/v4lABWm81RmqReuGbbgSqB+N4 7xR/IKD/RIzlhD6JeFwZb4c8RLRO+LE= Received: from mail-io1-f72.google.com (mail-io1-f72.google.com [209.85.166.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-75-Le6ESIaNM3Ky8JQ_D8SLMA-1; Wed, 15 Feb 2023 17:48:21 -0500 X-MC-Unique: Le6ESIaNM3Ky8JQ_D8SLMA-1 Received: by mail-io1-f72.google.com with SMTP id m15-20020a5e8d0f000000b0073529b4aeaaso26006ioj.10 for ; Wed, 15 Feb 2023 14:48:21 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1676501301; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=1Ls3rOlCkZIM9uqnTJAAjI8pZPeikDFz/SnN7vr7rRc=; b=ZsCAaAGOEITRCJCDHsAxUrhJzluqw8d1dhZc0EbcqriUJRyrNwEcNKiWZDN5V6Sq4K IjBjjbqI5Lht/pV1YRBb/cme5On7UOfVRqQxF2nw74yezU1JejNlYRfzhYrdDTtOISjx RL9OSm0s2Je5g9/cTYDA8gTFVN2PiIEaDmH28JigXTjx3CnMzn/J16BaugBsk8dxOxC/ /OuykPJGnAXANvxkV8TCxWp9QETzGKZfVrAnx6jNUt8HWQuL+IZLzMNke2SZm1EkGib/ MGkk4ba1wmVZq60947VGiny1v2RvHZKJTQjFzD+e6UuS7YohWYfIF3kv3D6O0kGvZ7UO FCnQ== X-Gm-Message-State: AO0yUKVdS9Y244lAgpoJjbo3cXC+PpBeUDKRpSy+daxmyWVdARgIeyoe Q6FcbzvM01xOOU+K8agIpcBcy2UBUDMMlDmVPXAlOMh4m7DpYg7S5ogJ1HnOCWaxsgYJMGA7kBr gGHrk7GIEr1E= X-Received: by 2002:a6b:b493:0:b0:718:2903:780f with SMTP id d141-20020a6bb493000000b007182903780fmr2657026iof.2.1676501301075; Wed, 15 Feb 2023 14:48:21 -0800 (PST) X-Google-Smtp-Source: AK7set/60ryARjkaiD7han0c7ITkzjNotMcDy8LYfAHruwaGVsnEhdIQpmfxfB0Ck29dRp73MUUJ2w== X-Received: by 2002:a6b:b493:0:b0:718:2903:780f with SMTP id d141-20020a6bb493000000b007182903780fmr2657020iof.2.1676501300735; Wed, 15 Feb 2023 14:48:20 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id h198-20020a6bb7cf000000b006f8ee49c22dsm6577498iof.10.2023.02.15.14.48.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Feb 2023 14:48:20 -0800 (PST) Date: Wed, 15 Feb 2023 17:48:18 -0500 From: Peter Xu To: David Stevens Cc: linux-mm@kvack.org, Matthew Wilcox , Andrew Morton , "Kirill A . Shutemov" , Yang Shi , David Hildenbrand , Hugh Dickins , linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd Message-ID: References: <20230214075710.2401855-1-stevensd@google.com> <20230214075710.2401855-2-stevensd@google.com> MIME-Version: 1.0 In-Reply-To: <20230214075710.2401855-2-stevensd@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Stat-Signature: q3p8dzfj17q8eerekzqywksioeqh7uij X-Rspam-User: X-Rspamd-Queue-Id: AC483C0019 X-Rspamd-Server: rspam06 X-HE-Tag: 1676501303-843171 X-HE-Meta: U2FsdGVkX19MvKrbXyqwhp3E3gwrqNSFf0wGSP77BLOWPLO2Gtv3lyPGxL/aUdolrn8kXz/v7QV0lpwqoHLOpEfJIDc522UNP4/2OwxzUOF7/NQwJqs97fPNc5A5/zDtjjLuS0ZmHt3ELcgJ+4Zy4T5AAc9UgLSvKfMI7ML8PGrgjyoPjA5mi4ru/V6yglD+r+oUxuAZcoAyvwrn94d581a5ZXY3Qqgk79E44f1TQNEF4jlYuIoKuh441Cz/kf3dOyasCPqQNtN+KPX2ZEyPj9YyAKJTY55YoiaFVwte/AduOVJelTBPXujow+gDWtKSlCjmxdmFNXi8DfMsl4604QPbZ2k/Mr0p6oF328WLpXgAmRIs/rLesQfUSg4sP1K5dCPb12flpNiLW9ZaE04lNAVg8DnuYu8XmN/nLKEWzLt8i0mvb9s/heb7RhlgJ2ETpdbzGnzP3GqZQxZiJ2LK10ZpG3QsjeS0x+QCRHBJ2EfL58h1iGy87jfy0M5Ks0zkbMnv+6QC40awlghGydj/JdwFTRuBfvs5IVb2ePS7LZi0xg32fBtds5U29y1g5UXDgk1xi+50/f+nrBwo2+s9Ct5HVPt1QkQzrUEGr/ucnbnnEtUA31bNwrUvubZp8BUawmUVCWxiIRby7QmOcpH1CD6GL/4Ik9NrTFrzB+Snghb7lOgS4LxY/GEy+zrO/OLCBlzICn5XLDA2GuH0TrghzeCmdYrH/7t2q5MfKEEtVOE/u+JAf4CGu25gc5oYp7jKHNRU6GoyTrGQJjJOMneiyJ7p0FjvxjgMntw2Dx4smIGLGXj1jZwHsH+8KNAhW/3QYlqmTr5oSfObQ9kpvQxnfTMnGWIUqONMV3jCCGWAzVigYcXxmFGhYQxe4ntulmAWS3QGvGvYImar212qrM0f7IZXp6+VyY/STy9rmEF7YUJEXaXDn3w1BNp5Z4uVVUVJJM0n3XNzXDN9PPfizJq 6Kj+5Rpc JvhN3tYyFhKCV82z5vz+wlgdXXWyw7Bzbj/T5vRodj0reVYxiNrggOWMQwzrFhm2oEP03hAAlgoKOlQ0I/+Cl1jV/zOp0Fil2NQ/fWQwfBhaeVtKkEfbmlxB7KSmhPrLJgR5AK46XkIqxkNDm8fKDPpF24NIadVdTdQHMgLIXMcDS5/qtoQ3KJdgpkYB5hWdREsB+ZTSwFWmqhnqu7vs4ht0fhNvcb5jnURhJc+7g965Bx+6k+UTxKlM0EpGcrfCGRiY03x+NSguXUPSFsC3qNjTFvXnPq/FGY8zsMqaeOapcJznXHCzetQ1RwrWOk7SBLWfGZWbyMuU2+6uIH3hRCnA9tWKdyKtA7Lk8z6KXF9nFjyZLBju/TTRkhI7IhTuMMLAvROPjP0x38kBAkSauEBEbV79T2t0xxCd61Jvspj9GtciAEHUe0pV4FI2y57kmE75LGR6HcGx7QyNc/WNhBcHtbCBB2aFEmGUs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote: > From: David Stevens > > Make sure that collapse_file respects any userfaultfds registered with > MODE_MISSING. If userspace has any such userfaultfds registered, then > for any page which it knows to be missing, it may expect a > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when > collapsing a shmem range would result in replacing an empty page with a > THP, so that it doesn't break userfaultfd. > > Synchronization when checking for userfaultfds in collapse_file is > tricky because the mmap locks can't be used to prevent races with the > registration of new userfaultfds. Instead, we provide synchronization by > ensuring that userspace cannot observe the fact that pages are missing > before we check for userfaultfds. Although this allows registration of a > userfaultfd to race with collapse_file, it ensures that userspace cannot > observe any pages transition from missing to present after such a race. > This makes such a race indistinguishable to the collapse occurring > immediately before the userfaultfd registration. > > The first step to provide this synchronization is to stop filling gaps > during the loop iterating over the target range, since the page cache > lock can be dropped during that loop. The second step is to fill the > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final > time, to avoid races with accesses to the page cache that only take the > RCU read lock. > > This fix is targeted at khugepaged, but the change also applies to > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now > return EBUSY if there are any missing pages (instead of succeeding on > shmem and returning EINVAL on anonymous memory). There is also now a > window during MADV_COLLAPSE where a fault on a missing page will cause > the syscall to fail with EAGAIN. > > The fact that intermediate page cache state can no longer be observed > before the rollback of a failed collapse is also technically a > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it > is exceedingly unlikely that anything relies on being able to observe > that transient state. > > Signed-off-by: David Stevens > --- > mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 58 insertions(+), 8 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index b648f1053d95..8c2e2349e883 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -55,6 +55,7 @@ enum scan_result { > SCAN_CGROUP_CHARGE_FAIL, > SCAN_TRUNCATED, > SCAN_PAGE_HAS_PRIVATE, > + SCAN_PAGE_FILLED, PS: You may want to also touch SCAN_STATUS in huge_memory.h next time. > }; > > #define CREATE_TRACE_POINTS > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > * - allocate and lock a new huge page; > * - scan page cache replacing old pages with the new one > * + swap/gup in pages if necessary; > - * + fill in gaps; IIUC it's not a complete removal, but just moved downwards: > * + keep old pages around in case rollback is required; > + * - finalize updates to the page cache; + fill in gaps with RETRY entries + detect race conditions with userfaultfds > * - if replacing succeeds: > * + copy data over; > * + free old pages; > @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > result = SCAN_TRUNCATED; > goto xa_locked; > } > - xas_set(&xas, index); > + xas_set(&xas, index + 1); > } > if (!shmem_charge(mapping->host, 1)) { > result = SCAN_FAIL; > goto xa_locked; > } > - xas_store(&xas, hpage); > nr_none++; > continue; > } > @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > put_page(page); > goto xa_unlocked; > } > + > + if (nr_none) { > + struct vm_area_struct *vma; > + int nr_none_check = 0; > + > + xas_unlock_irq(&xas); > + i_mmap_lock_read(mapping); > + xas_lock_irq(&xas); > + > + xas_set(&xas, start); > + for (index = start; index < end; index++) { > + if (!xas_next(&xas)) { > + xas_store(&xas, XA_RETRY_ENTRY); > + nr_none_check++; > + } > + } > + > + if (nr_none != nr_none_check) { > + result = SCAN_PAGE_FILLED; > + goto immap_locked; > + } > + > + /* > + * If userspace observed a missing page in a VMA with an armed > + * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for > + * that page, so we need to roll back to avoid suppressing such > + * an event. Any userfaultfds armed after this point will not be > + * able to observe any missing pages due to the previously > + * inserted retry entries. > + */ > + vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) { > + if (userfaultfd_missing(vma)) { > + result = SCAN_EXCEED_NONE_PTE; > + goto immap_locked; > + } > + } > + > +immap_locked: > + i_mmap_unlock_read(mapping); > + if (result != SCAN_SUCCEED) { > + xas_set(&xas, start); > + for (index = start; index < end; index++) { > + if (xas_next(&xas) == XA_RETRY_ENTRY) > + xas_store(&xas, NULL); > + } > + > + goto xa_locked; > + } > + } > + Until here, all look fine to me (ignoring patch 1 for now; assuming the hpage is always uptodate). My question is after here we'll release page cache lock again before try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing page cache lock? It means other threads can be spinning. I assume page lock is always safe and sleepable, but not sure about the page cache lock here. > nr = thp_nr_pages(hpage); > > if (is_shmem) > @@ -2068,15 +2118,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > } > > xas_set(&xas, start); > - xas_for_each(&xas, page, end - 1) { > + end = index; > + for (index = start; index < end; index++) { > + xas_next(&xas); > page = list_first_entry_or_null(&pagelist, > struct page, lru); > if (!page || xas.xa_index < page->index) { > - if (!nr_none) > - break; > nr_none--; > - /* Put holes back where they were */ > - xas_store(&xas, NULL); > continue; > } > > @@ -2592,11 +2640,13 @@ static int madvise_collapse_errno(enum scan_result r) > case SCAN_ALLOC_HUGE_PAGE_FAIL: > return -ENOMEM; > case SCAN_CGROUP_CHARGE_FAIL: > + case SCAN_EXCEED_NONE_PTE: > return -EBUSY; > /* Resource temporary unavailable - trying again might succeed */ > case SCAN_PAGE_LOCK: > case SCAN_PAGE_LRU: > case SCAN_DEL_PAGE_LRU: > + case SCAN_PAGE_FILLED: > return -EAGAIN; > /* > * Other: Trying again likely not to succeed / error intrinsic to > -- > 2.39.1.581.gbfd45094c4-goog > -- Peter Xu