From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=NYeu=DC=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-12.6 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 63913C47423
	for <linux-mm@archiver.kernel.org>; Fri, 25 Sep 2020 22:26:10 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id C510B21775
	for <linux-mm@archiver.kernel.org>; Fri, 25 Sep 2020 22:26:09 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="SEgSKk+F"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C510B21775
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5F6FC900004; Fri, 25 Sep 2020 18:26:08 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5A5FD6B0068; Fri, 25 Sep 2020 18:26:08 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 444A76B006C; Fri, 25 Sep 2020 18:26:08 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0181.hostedemail.com [216.40.44.181])
	by kanga.kvack.org (Postfix) with ESMTP id 2E0986B0062
	for <linux-mm@kvack.org>; Fri, 25 Sep 2020 18:26:08 -0400 (EDT)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id EBF81181AE866
	for <linux-mm@kvack.org>; Fri, 25 Sep 2020 22:26:07 +0000 (UTC)
X-FDA: 77303017974.09.teeth79_3f11d9e2716b
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin09.hostedemail.com (Postfix) with ESMTP id CB304180AD801
	for <linux-mm@kvack.org>; Fri, 25 Sep 2020 22:26:07 +0000 (UTC)
X-HE-Tag: teeth79_3f11d9e2716b
X-Filterd-Recvd-Size: 12861
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124])
	by imf50.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri, 25 Sep 2020 22:26:07 +0000 (UTC)
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1601072766;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=f3neVB71YauFQCPpMdApRLGs3XY1TIPPKty54SXGigA=;
	b=SEgSKk+FmEllPP0wtP4vlzZ15qVfS82kxOdjumu6lBuABFWGOiHUe1Q2LloQTS8R1deMYK
	CGHJEW/PQPs1jBT+Rl8Kl2oNc8IC+awqtQuAbwsKztBYnDVa6Yf+92Ft2OhvL73XN+8Ri4
	cIGqPsnm60AVEbpAoALHa6BQMyWWldQ=
Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com
 [209.85.219.70]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-338-RgJl4TsAMW6PjmKnBlaSXg-1; Fri, 25 Sep 2020 18:26:03 -0400
X-MC-Unique: RgJl4TsAMW6PjmKnBlaSXg-1
Received: by mail-qv1-f70.google.com with SMTP id de12so2709914qvb.12
        for <linux-mm@kvack.org>; Fri, 25 Sep 2020 15:26:02 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=f3neVB71YauFQCPpMdApRLGs3XY1TIPPKty54SXGigA=;
        b=N+gPOD6qzkQLdorM642SwJKf6TjDUpG5QEU9/N1F7woBe5GwEL/S0rUmpVKxnnR1Qg
         Sq3RS+PANYjgyMadD4OBHdYoAo5rhyAIjkrRIUnlXW4r07PJotFpRmL2rHoZXFMyMlyi
         xkt1n8MRDUqnl3TKavVwIa/NKVO04QBBBHWI8Rx6Hpi+oj5kByjctaOL/3VlbHGyoCA8
         YHXciEcofJ5R1yMetQj+oYTHYHHsLRvza9eJvbI+wjzZVApG056Bw9VFFQC//GRyaICm
         OzTh9uUJlCnxTUDbZcB2MQKc0Sdi5/8GtLnHjDqhOY7vOMEZn0ncw/KQPvpw7TnZDC7g
         n0Yg==
X-Gm-Message-State: AOAM532REDEMs9Yv4FVFtGF6tKASkSdGzvCY9+MdG7AT6I6w+V5CGQVA
	8yI+VFBT9V/1/iItJNGQ/LOVs0bDNawvpCSo12Rw531kchBkAeJGiKwoFwRKKGSJ4mdUa9bsOYl
	g6A3H7g01hdk=
X-Received: by 2002:aed:2986:: with SMTP id o6mr1957591qtd.269.1601072762220;
        Fri, 25 Sep 2020 15:26:02 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJyNQ9JPn6Tr7y13N2I/zWl13emViM7mij2pf9V7O1fD80kpqHM0nnvuLIe1K91wSXn1j0IKNA==
X-Received: by 2002:aed:2986:: with SMTP id o6mr1957549qtd.269.1601072761830;
        Fri, 25 Sep 2020 15:26:01 -0700 (PDT)
Received: from localhost.localdomain (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15])
        by smtp.gmail.com with ESMTPSA id w44sm3051471qth.9.2020.09.25.15.26.00
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 25 Sep 2020 15:26:01 -0700 (PDT)
From: Peter Xu <peterx@redhat.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: peterx@redhat.com,
	Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Hellwig <hch@lst.de>,
	Yang Shi <shy828301@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Hugh Dickins <hughd@google.com>,
	Jann Horn <jannh@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>,
	Jan Kara <jack@suse.cz>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Leon Romanovsky <leonro@nvidia.com>
Subject: [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes
Date: Fri, 25 Sep 2020 18:25:59 -0400
Message-Id: <20200925222600.6832-4-peterx@redhat.com>
X-Mailer: git-send-email 2.26.2
In-Reply-To: <20200925222600.6832-1-peterx@redhat.com>
References: <20200925222600.6832-1-peterx@redhat.com>
MIME-Version: 1.0
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

It allows copy_pte_range() to do early cow if the pages were pinned on th=
e
source mm.  Currently we don't have an accurate way to know whether a pag=
e is
pinned or not.  The only thing we have is page_maybe_dma_pinned().  Howev=
er
that's good enough for now.  Especially, with the newly added mm->has_pin=
ned
flag to make sure we won't affect processes that never pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within copy_one_pte=
().
Unluckily, we can't because we're with the page table locks held for both=
 the
parent and child processes.  So the page allocation needs to be done outs=
ide
copy_one_pte().

Some trick is there in copy_present_pte(), majorly the wrprotect trick to=
 block
concurrent fast-gup.  Comments in the function should explain better in p=
lace.

Oleg Nesterov reported a (probably harmless) bug during review that we di=
dn't
reset entry.val properly in copy_pte_range() so that potentially there's =
chance
to call add_swap_count_continuation() multiple times on the same swp entr=
y.
However that should be harmless since even if it happens, the same functi=
on
(add_swap_count_continuation()) will return directly noticing that there'=
re
enough space for the swp counter.  So instead of a standalone stable patc=
h, it
is touched up in this patch directly.

Reference discussions:

  https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 156 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4c56d7b92b0e..92ad08616e60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -773,15 +773,109 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, stru=
ct mm_struct *src_mm,
 	return 0;
 }
=20
-static inline void
+/*
+ * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated=
 page
+ * is required to copy this pte.
+ */
+static inline int
 copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		struct vm_area_struct *new,
+		unsigned long addr, int *rss, struct page **prealloc)
 {
 	unsigned long vm_flags =3D vma->vm_flags;
 	pte_t pte =3D *src_pte;
 	struct page *page;
=20
+	page =3D vm_normal_page(vma, addr, pte);
+	if (page) {
+		if (is_cow_mapping(vm_flags)) {
+			bool is_write =3D pte_write(pte);
+
+			/*
+			 * The trick starts.
+			 *
+			 * What we want to do is to check whether this page may
+			 * have been pinned by the parent process.  If so,
+			 * instead of wrprotect the pte on both sides, we copy
+			 * the page immediately so that we'll always guarantee
+			 * the pinned page won't be randomly replaced in the
+			 * future.
+			 *
+			 * To achieve this, we do the following:
+			 *
+			 * 1. Write-protect the pte if it's writable.  This is
+			 *    to protect concurrent write fast-gup with
+			 *    FOLL_PIN, so that we'll fail the fast-gup with
+			 *    the write bit removed.
+			 *
+			 * 2. Check page_maybe_dma_pinned() to see whether this
+			 *    page may have been pinned.
+			 *
+			 * The order of these steps is important to serialize
+			 * against the fast-gup code (gup_pte_range()) on the
+			 * pte check and try_grab_compound_head(), so that
+			 * we'll make sure either we'll capture that fast-gup
+			 * so we'll copy the pinned page here, or we'll fail
+			 * that fast-gup.
+			 */
+			if (is_write) {
+				ptep_set_wrprotect(src_mm, addr, src_pte);
+				/*
+				 * This is not needed for serializing fast-gup,
+				 * however always make it consistent with
+				 * src_pte, since we'll need it when current
+				 * page is not pinned.
+				 */
+				pte =3D pte_wrprotect(pte);
+			}
+
+			if (atomic_read(&src_mm->has_pinned) &&
+			    page_maybe_dma_pinned(page)) {
+				struct page *new_page =3D *prealloc;
+
+				/*
+				 * This is possibly pinned page, need to copy.
+				 * Safe to release the write bit if necessary.
+				 */
+				if (is_write)
+					set_pte_at(src_mm, addr, src_pte,
+						   pte_mkwrite(pte));
+
+				/* If we don't have a pre-allocated page, ask */
+				if (!new_page)
+					return -EAGAIN;
+
+				/*
+				 * We have a prealloc page, all good!  Take it
+				 * over and copy the page & arm it.
+				 */
+				*prealloc =3D NULL;
+				copy_user_highpage(new_page, page, addr, vma);
+				__SetPageUptodate(new_page);
+				pte =3D mk_pte(new_page, new->vm_page_prot);
+				pte =3D pte_sw_mkyoung(pte);
+				pte =3D maybe_mkwrite(pte_mkdirty(pte), new);
+				page_add_new_anon_rmap(new_page, new, addr, false);
+				rss[mm_counter(new_page)]++;
+				set_pte_at(dst_mm, addr, dst_pte, pte);
+				return 0;
+			}
+
+			/*
+			 * Logically we should recover the wrprotect() for
+			 * fast-gup, however when reach here it also means we
+			 * actually need to wrprotect() it again for cow.
+			 * Simply keep everything.  Note that there's another
+			 * chunk of cow logic below, but we should still need
+			 * that for !page case.
+			 */
+		}
+		get_page(page);
+		page_dup_rmap(page, false);
+		rss[mm_counter(page)]++;
+	}
+
 	/*
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
@@ -807,14 +901,27 @@ copy_present_pte(struct mm_struct *dst_mm, struct m=
m_struct *src_mm,
 	if (!(vm_flags & VM_UFFD_WP))
 		pte =3D pte_clear_uffd_wp(pte);
=20
-	page =3D vm_normal_page(vma, addr, pte);
-	if (page) {
-		get_page(page);
-		page_dup_rmap(page, false);
-		rss[mm_counter(page)]++;
+	set_pte_at(dst_mm, addr, dst_pte, pte);
+	return 0;
+}
+
+static inline struct page *
+page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma,
+		   unsigned long addr)
+{
+	struct page *new_page;
+
+	new_page =3D alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+	if (!new_page)
+		return NULL;
+
+	if (mem_cgroup_charge(new_page, src_mm, GFP_KERNEL)) {
+		put_page(new_page);
+		return NULL;
 	}
+	cgroup_throttle_swaprate(new_page, GFP_KERNEL);
=20
-	set_pte_at(dst_mm, addr, dst_pte, pte);
+	return new_page;
 }
=20
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *sr=
c_mm,
@@ -825,16 +932,20 @@ static int copy_pte_range(struct mm_struct *dst_mm,=
 struct mm_struct *src_mm,
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
-	int progress =3D 0;
+	int progress, ret =3D 0;
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry =3D (swp_entry_t){0};
+	struct page *prealloc =3D NULL;
=20
 again:
+	progress =3D 0;
 	init_rss_vec(rss);
=20
 	dst_pte =3D pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
-	if (!dst_pte)
-		return -ENOMEM;
+	if (!dst_pte) {
+		ret =3D -ENOMEM;
+		goto out;
+	}
 	src_pte =3D pte_offset_map(src_pmd, addr);
 	src_ptl =3D pte_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -866,8 +977,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, =
struct mm_struct *src_mm,
 			progress +=3D 8;
 			continue;
 		}
-		copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
-				 vma, addr, rss);
+		/* copy_present_pte() will clear `*prealloc' if consumed */
+		ret =3D copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
+				       vma, new, addr, rss, &prealloc);
+		/*
+		 * If we need a pre-allocated page for this pte, drop the
+		 * locks, allocate, and try again.
+		 */
+		if (unlikely(ret =3D=3D -EAGAIN))
+			break;
+		if (unlikely(prealloc)) {
+			/*
+			 * pre-alloc page cannot be reused by next time so as
+			 * to strictly follow mempolicy (e.g., alloc_page_vma()
+			 * will allocate page according to address).  This
+			 * could only happen if one pinned pte changed.
+			 */
+			put_page(prealloc);
+			prealloc =3D NULL;
+		}
 		progress +=3D 8;
 	} while (dst_pte++, src_pte++, addr +=3D PAGE_SIZE, addr !=3D end);
=20
@@ -879,13 +1007,25 @@ static int copy_pte_range(struct mm_struct *dst_mm=
, struct mm_struct *src_mm,
 	cond_resched();
=20
 	if (entry.val) {
-		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
+		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
+			ret =3D -ENOMEM;
+			goto out;
+		}
+		entry.val =3D 0;
+	} else if (ret) {
+		WARN_ON_ONCE(ret !=3D -EAGAIN);
+		prealloc =3D page_copy_prealloc(src_mm, vma, addr);
+		if (!prealloc)
 			return -ENOMEM;
-		progress =3D 0;
+		/* We've captured and resolved the error. Reset, try again. */
+		ret =3D 0;
 	}
 	if (addr !=3D end)
 		goto again;
-	return 0;
+out:
+	if (unlikely(prealloc))
+		put_page(prealloc);
+	return ret;
 }
=20
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_str=
uct *src_mm,
--=20
2.26.2