From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E37DC2D0A8 for ; Wed, 23 Sep 2020 17:18:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D2C6521BE5 for ; Wed, 23 Sep 2020 17:18:00 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=ziepe.ca header.i=@ziepe.ca header.b="Ys6gPy1O" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D2C6521BE5 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ziepe.ca Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4C587900002; Wed, 23 Sep 2020 13:18:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 474A76B006E; Wed, 23 Sep 2020 13:18:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 36429900002; Wed, 23 Sep 2020 13:18:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0155.hostedemail.com [216.40.44.155]) by kanga.kvack.org (Postfix) with ESMTP id 16ABA6B006C for ; Wed, 23 Sep 2020 13:18:00 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id CEE1E182CADAB for ; Wed, 23 Sep 2020 17:17:59 +0000 (UTC) X-FDA: 77294983878.06.rings99_3008daf27158 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id A71D410334B66 for ; Wed, 23 Sep 2020 17:17:59 +0000 (UTC) X-HE-Tag: rings99_3008daf27158 X-Filterd-Recvd-Size: 6933 Received: from mail-qk1-f193.google.com (mail-qk1-f193.google.com [209.85.222.193]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Wed, 23 Sep 2020 17:17:59 +0000 (UTC) Received: by mail-qk1-f193.google.com with SMTP id 16so448299qkf.4 for ; Wed, 23 Sep 2020 10:17:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=Wl0emuuFRwoMzXPU5qqXsiLAXTaoY16tfftIB0J40yc=; b=Ys6gPy1Opv7AG5/OZGbTnnJ4aTIMUptrJ823f9T2F0WoCv9W8/JCX+id1o3hVRzwpC gjc1Zz8Mb2st11gJuWOjOv3XpS3MP6aine2VKPdv56KIHPaxpgGedieMqguS6p5lphQP +7m1GFiHq7pJaN0Oio5vWW+QdOcxZ+7XJEu766Kc1lwYQRH/b2xLdQSkaXjoD6+dV6bo vCC5v9HFcbzjHQZeFxQAYugNp6d1GmouVdMuaKvFMvfX8NobD9tMbCsbUmUEZ5pTJ7Cp nlqa+aA5Ayy9SE/N1DqOc/Nzzv44UWtrfKjcE1zYdMgsLwTdjh9Jp3FhJe+X4xTPjQVJ gkdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Wl0emuuFRwoMzXPU5qqXsiLAXTaoY16tfftIB0J40yc=; b=JwhzM++9bgdBldt07FV0y/fXujchUdpj/AJnbkRUUqLWSoWbEpdTiWPPgG1fFRaqma UR8hmsIgrdmiwnhkbR3j2IVmtQgdPuTYhMmQWG18YyjD6TyEVTKYl4a4LExksre/2mfB qg+WcD8Ofd744SAudVmfQI4MbRtoKmtsFYqM6G2NxQG9eVz/SHZbgtt3R/bv8cjfTDq0 R9c0/swMbrLkqkjLMLJVwXMnwiGCCu20VtYTXzMjt8/81ErxgWHbGQrBoHsXZcCT4oT4 4Hpm0XoFvNnNVqxO/KgJUlSRlrElsD59Wl1Oi3bOnhVDbYRH6T7++jDlolFqDzfOCxSQ gqmw== X-Gm-Message-State: AOAM533T6CJyfRsXXzkJgk9alZBDJ3tZWvHn4cxL14qUGzSz/zDhEvjH DEDAdCo0gZ6lxChpjo9vuqDRQA== X-Google-Smtp-Source: ABdhPJyI1/lwDDielu1D4kgmt6C4Q19oPeKwKlpkm8/fdeNQjqvcEC8zn1HFA9gxbcoYMabTg/mRTg== X-Received: by 2002:a37:4711:: with SMTP id u17mr885282qka.54.1600881478432; Wed, 23 Sep 2020 10:17:58 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-156-34-48-30.dhcp-dynamic.fibreop.ns.bellaliant.net. [156.34.48.30]) by smtp.gmail.com with ESMTPSA id k20sm290184qtb.34.2020.09.23.10.17.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Sep 2020 10:17:57 -0700 (PDT) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1kL8Oy-0006Rh-T9; Wed, 23 Sep 2020 14:17:56 -0300 Date: Wed, 23 Sep 2020 14:17:56 -0300 From: Jason Gunthorpe To: Peter Xu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Linus Torvalds , Michal Hocko , Kirill Shutemov , Jann Horn , Oleg Nesterov , Kirill Tkhai , Hugh Dickins , Leon Romanovsky , Jan Kara , John Hubbard , Christoph Hellwig , Andrew Morton , Andrea Arcangeli Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork() Message-ID: <20200923171756.GC9916@ziepe.ca> References: <20200921211744.24758-1-peterx@redhat.com> <20200921212031.25233-1-peterx@redhat.com> <20200922120505.GH8409@ziepe.ca> <20200923152409.GC59978@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200923152409.GC59978@xz-x1> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 23, 2020 at 11:24:09AM -0400, Peter Xu wrote: > On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote: > > On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote: > > > Pinned pages shouldn't be write-protected when fork() happens, because follow > > > up copy-on-write on these pages could cause the pinned pages to be replaced by > > > random newly allocated pages. > > > > > > For huge PMDs, we split the huge pmd if pinning is detected. So that future > > > handling will be done by the PTE level (with our latest changes, each of the > > > small pages will be copied). We can achieve this by let copy_huge_pmd() return > > > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and > > > finally land the next copy_pte_range() call. > > > > > > Huge PUDs will be even more special - so far it does not support anonymous > > > pages. But it can actually be done the same as the huge PMDs even if the split > > > huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault > > > ins will remap the same pages in either parent/child later. > > > > > > This might not be the most efficient way, but it should be easy and clean > > > enough. It should be fine, since we're tackling with a very rare case just to > > > make sure userspaces that pinned some thps will still work even without > > > MADV_DONTFORK and after they fork()ed. > > > > > > Signed-off-by: Peter Xu > > > mm/huge_memory.c | 26 ++++++++++++++++++++++++++ > > > 1 file changed, 26 insertions(+) > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index 7ff29cc3d55c..c40aac0ad87e 100644 > > > +++ b/mm/huge_memory.c > > > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > > > > > src_page = pmd_page(pmd); > > > VM_BUG_ON_PAGE(!PageHead(src_page), src_page); > > > + > > > + /* > > > + * If this page is a potentially pinned page, split and retry the fault > > > + * with smaller page size. Normally this should not happen because the > > > + * userspace should use MADV_DONTFORK upon pinned regions. This is a > > > + * best effort that the pinned pages won't be replaced by another > > > + * random page during the coming copy-on-write. > > > + */ > > > + if (unlikely(READ_ONCE(src_mm->has_pinned) && > > > + page_maybe_dma_pinned(src_page))) { > > > + pte_free(dst_mm, pgtable); > > > + spin_unlock(src_ptl); > > > + spin_unlock(dst_ptl); > > > + __split_huge_pmd(vma, src_pmd, addr, false, NULL); > > > + return -EAGAIN; > > > + } > > > > Not sure why, but the PMD stuff here is not calling is_cow_mapping() > > before doing the write protect. Seems like it might be an existing > > bug? > > IMHO it's not a bug, because splitting a huge pmd should always be safe. Sur splitting is safe, but testing has_pinned without checking COW is not, for what Jann explained. The 'maybe' in page_maybe_dma_pinned() means it can return true when the correct answer is false. It can never return false when the correct answer is true. It is the same when has_pinned is involved, the combined expression must never return false when true is correct. Which means it can only be applied for COW cases. Jason