From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6730FC433EF for ; Wed, 22 Sep 2021 21:21:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0AB2F61181 for ; Wed, 22 Sep 2021 21:21:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0AB2F61181 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 64AD46B006C; Wed, 22 Sep 2021 17:21:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5FB9F900002; Wed, 22 Sep 2021 17:21:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C1596B0072; Wed, 22 Sep 2021 17:21:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0073.hostedemail.com [216.40.44.73]) by kanga.kvack.org (Postfix) with ESMTP id 3A1426B006C for ; Wed, 22 Sep 2021 17:21:03 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id D8C2B22003 for ; Wed, 22 Sep 2021 21:21:02 +0000 (UTC) X-FDA: 78616479564.08.531BC03 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 77BCA20019C3 for ; Wed, 22 Sep 2021 21:21:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1632345661; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1sOtepVT3Qko/KYSQ2Yokm2u2avg1AOSms4eQPqswZk=; b=P1+SmEYTkdqWhFLW6smx+Q13Y4TRnyF0eH8F5lZp2H5Cl7JSqGHxXioUaNDKAMQGz5te9f LFFAdLk9kL89cimmT/e8EtAf+aJcpclR7ImmXimdwSUTcGu/YYAncGme1s9qErBXamJVS8 onaKmqwxveLLGkw1tRS1vOEOSrRJ+Ow= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-60-B2IzY6A1PyGtN5Wa5f2Ljw-1; Wed, 22 Sep 2021 17:21:00 -0400 X-MC-Unique: B2IzY6A1PyGtN5Wa5f2Ljw-1 Received: by mail-qt1-f200.google.com with SMTP id b15-20020a05622a020f00b0029e28300d94so10842023qtx.16 for ; Wed, 22 Sep 2021 14:21:00 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=1sOtepVT3Qko/KYSQ2Yokm2u2avg1AOSms4eQPqswZk=; b=Wgpyi37Ne6wi/SowNkSnkYwPoLDfojNrOOfAJ7H7aKGcjaJEBSIrNdcimB3ZNnd5zC 6wYkeItOjtX2yk/na8w4ea1HeaSnMxB6VYj5nIrKJ8yTrR7u9hQACi6eAknMWS4BDZ8i TG8BOqtXdhVHuRD8gCi0bq+32aXa7a0XIrBe9gZmZ7YU3wm56Jk10pGacEgFzJVZ/GQe KYz3CHN22UgbVhPx5pFstxV/W0LjbQl6WYHWZ46Yxp1Tq0jNlxmacYodq4ZZehtnV5YE P7G1snim8IU1vUkIxlcfTrzXyMB0Fu/vWUak0sPKj94br+PAE6Sz1JkMjtIkKVqaydMJ fjGw== X-Gm-Message-State: AOAM532ZR8ETcwunbzAgoOACAZ7p9earItE7LHbZgJkGu1F2L5mnrjDT ZkOvlpA5/dmKRB74MmIA12upcW+svndIpvJNxLnOZSP4/sVT2Y1MmsiMBLOGVL8ICcxQUZwgV+t 9J+RzsCS2aok= X-Received: by 2002:a05:620a:632:: with SMTP id 18mr1389116qkv.457.1632345660199; Wed, 22 Sep 2021 14:21:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyvdMFhxEd+7cJQ3ko6ceS3YO5KGWFBstwMECwRCVDskLxbC+AYBT3ku6FsaSJboCWkdx2BBg== X-Received: by 2002:a05:620a:632:: with SMTP id 18mr1389083qkv.457.1632345659892; Wed, 22 Sep 2021 14:20:59 -0700 (PDT) Received: from t490s ([2607:fea8:56a2:9100::d3ec]) by smtp.gmail.com with ESMTPSA id u12sm1831441qkk.109.2021.09.22.14.20.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Sep 2021 14:20:59 -0700 (PDT) Date: Wed, 22 Sep 2021 17:20:57 -0400 From: Peter Xu To: Axel Rasmussen Cc: LKML , Linux MM , Andrew Morton , Andrea Arcangeli , Hugh Dickins , Nadav Amit Subject: Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently Message-ID: References: <20210922175156.130228-1-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=P1+SmEYT; spf=none (imf26.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: pk1iwnerkccw5fwfgymea66w5reia5ey X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 77BCA20019C3 X-HE-Tag: 1632345662-707299 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 22, 2021 at 01:49:42PM -0700, Axel Rasmussen wrote: > Sorry for missing the other thread. Heh, you didn't miss the other thread; I just posted both of the emails in a few hours. :) > > Unfortunately, I think shmem THP *doesn't* really work with minor > faults, and what's worse, just checking the VMA flag isn't enough. As I replied to myself (sorry to have done that), now I think minor mode is fine, but let's see what else I've missed, which is possible... Please see below. > > First, let me note the guarantee UFFD minor faults are trying to > provide: for a given mapping, any minor fault (that is, pte_none() but > a page is present in the page cache) must result in a minor userfault > event. Furthermore, the only way the fault may be resolved (i.e., a > PTE installed) is via a UFFDIO_CONTINUE ioctl from userspace. Yes. > > A typical use case for minor faults is, we have two mappings (i.e., > two VMAs), both pointing to the same underlying physical memory. It's > typical for both to have MAP_SHARED. It's typical for one of these > mappings to be fully faulted in (i.e., all of its PTEs exist), while > the other one has some missing PTEs. The problem is, khugepaged might > scan *either* of the two mappings. Say it picks the fully-faulted VMA: > even if we set khugepaged_max_ptes_none to zero, it will still go > ahead and collapse these pages - because *this* VMA has no missing > PTEs. Yes. > > Why is this a problem? When we collapse, we install a PMD, for *all* > VMAs which reference these pages. In other words, we might install > PTEs for the other, minor-fault-registered mapping, and therefore > userfaults will never trigger for some of those regions, even though > userspace never UFFDIO_CONTINUE-ed them. Nop - we don't install PMD for file-backed, do we? Please see khugepaged_scan_pmd() - that one installs PMDs indeed, but it's anonymous-only code. Then please also see khugepaged_scan_file() - that one handles file-backed (aka, shmem), and it does _not_ install pmd, afaict. The installation is lazy. Not installing pmd means uffd-minor can still trap any further faults just like before, afaiu. There's a very trivial detail that the pmd missing case will have a very slight code path change when the next page fault happens: in __handle_mm_fault() we'll first try to go into create_huge_pmd() once, however since shmem didn't provide huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like before when faulting on a small pte. The next UFFDIO_CONTINUE will allocate that missing pmd again, however it'll install a 4K page only. > > I *think* the right place to check for this and solve it is in > retract_page_tables(), and I have a patch which does this. I've been > hesitant to send it though, as due to a lack of time and the > complexity involved I haven't been able to write a clear reproducer > program, which my patch clearly fixes. :/ Yes retract_page_tables() could drop pmd pgtable for minor fault, but IMHO it's fine too as mentioned above. Minor mode should only care about trapping the page fault when the next access comes. retract_page_tables() will wipe the pmd pgtable page, that's not fine for uffd-wp, but IMHO that's still very fine for minor mode as it will keep trapping the old missing ptes; the difference is it'll just generate even more traps (rather than on the pte holes only, now it'll generate one message for each 4k over the merged 2M). As I mentioned in the other thread, I think that'll cause false positive minor fault messages, but IMHO that's fine, and minor fault userspace should always need to handle that. I fully agree with you that a reproducer would be very nice to try. So if my understanding is correct, the reproducer won't really fail on minor mode in any way, but it'll just need to be prepared to receive more messages than it should. Thanks, -- Peter Xu