From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63913C47423 for ; Fri, 25 Sep 2020 22:26:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C510B21775 for ; Fri, 25 Sep 2020 22:26:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="SEgSKk+F" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C510B21775 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5F6FC900004; Fri, 25 Sep 2020 18:26:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5A5FD6B0068; Fri, 25 Sep 2020 18:26:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 444A76B006C; Fri, 25 Sep 2020 18:26:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0181.hostedemail.com [216.40.44.181]) by kanga.kvack.org (Postfix) with ESMTP id 2E0986B0062 for ; Fri, 25 Sep 2020 18:26:08 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id EBF81181AE866 for ; Fri, 25 Sep 2020 22:26:07 +0000 (UTC) X-FDA: 77303017974.09.teeth79_3f11d9e2716b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin09.hostedemail.com (Postfix) with ESMTP id CB304180AD801 for ; Fri, 25 Sep 2020 22:26:07 +0000 (UTC) X-HE-Tag: teeth79_3f11d9e2716b X-Filterd-Recvd-Size: 12861 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Fri, 25 Sep 2020 22:26:07 +0000 (UTC) Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1601072766; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f3neVB71YauFQCPpMdApRLGs3XY1TIPPKty54SXGigA=; b=SEgSKk+FmEllPP0wtP4vlzZ15qVfS82kxOdjumu6lBuABFWGOiHUe1Q2LloQTS8R1deMYK CGHJEW/PQPs1jBT+Rl8Kl2oNc8IC+awqtQuAbwsKztBYnDVa6Yf+92Ft2OhvL73XN+8Ri4 cIGqPsnm60AVEbpAoALHa6BQMyWWldQ= Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com [209.85.219.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-338-RgJl4TsAMW6PjmKnBlaSXg-1; Fri, 25 Sep 2020 18:26:03 -0400 X-MC-Unique: RgJl4TsAMW6PjmKnBlaSXg-1 Received: by mail-qv1-f70.google.com with SMTP id de12so2709914qvb.12 for ; Fri, 25 Sep 2020 15:26:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=f3neVB71YauFQCPpMdApRLGs3XY1TIPPKty54SXGigA=; b=N+gPOD6qzkQLdorM642SwJKf6TjDUpG5QEU9/N1F7woBe5GwEL/S0rUmpVKxnnR1Qg Sq3RS+PANYjgyMadD4OBHdYoAo5rhyAIjkrRIUnlXW4r07PJotFpRmL2rHoZXFMyMlyi xkt1n8MRDUqnl3TKavVwIa/NKVO04QBBBHWI8Rx6Hpi+oj5kByjctaOL/3VlbHGyoCA8 YHXciEcofJ5R1yMetQj+oYTHYHHsLRvza9eJvbI+wjzZVApG056Bw9VFFQC//GRyaICm OzTh9uUJlCnxTUDbZcB2MQKc0Sdi5/8GtLnHjDqhOY7vOMEZn0ncw/KQPvpw7TnZDC7g n0Yg== X-Gm-Message-State: AOAM532REDEMs9Yv4FVFtGF6tKASkSdGzvCY9+MdG7AT6I6w+V5CGQVA 8yI+VFBT9V/1/iItJNGQ/LOVs0bDNawvpCSo12Rw531kchBkAeJGiKwoFwRKKGSJ4mdUa9bsOYl g6A3H7g01hdk= X-Received: by 2002:aed:2986:: with SMTP id o6mr1957591qtd.269.1601072762220; Fri, 25 Sep 2020 15:26:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyNQ9JPn6Tr7y13N2I/zWl13emViM7mij2pf9V7O1fD80kpqHM0nnvuLIe1K91wSXn1j0IKNA== X-Received: by 2002:aed:2986:: with SMTP id o6mr1957549qtd.269.1601072761830; Fri, 25 Sep 2020 15:26:01 -0700 (PDT) Received: from localhost.localdomain (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15]) by smtp.gmail.com with ESMTPSA id w44sm3051471qth.9.2020.09.25.15.26.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Sep 2020 15:26:01 -0700 (PDT) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: peterx@redhat.com, Jason Gunthorpe , John Hubbard , Andrew Morton , Christoph Hellwig , Yang Shi , Oleg Nesterov , Kirill Tkhai , Kirill Shutemov , Hugh Dickins , Jann Horn , Linus Torvalds , Michal Hocko , Jan Kara , Andrea Arcangeli , Leon Romanovsky Subject: [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes Date: Fri, 25 Sep 2020 18:25:59 -0400 Message-Id: <20200925222600.6832-4-peterx@redhat.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200925222600.6832-1-peterx@redhat.com> References: <20200925222600.6832-1-peterx@redhat.com> MIME-Version: 1.0 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: It allows copy_pte_range() to do early cow if the pages were pinned on th= e source mm. Currently we don't have an accurate way to know whether a pag= e is pinned or not. The only thing we have is page_maybe_dma_pinned(). Howev= er that's good enough for now. Especially, with the newly added mm->has_pin= ned flag to make sure we won't affect processes that never pinned any pages. It would be easier if we can do GFP_KERNEL allocation within copy_one_pte= (). Unluckily, we can't because we're with the page table locks held for both= the parent and child processes. So the page allocation needs to be done outs= ide copy_one_pte(). Some trick is there in copy_present_pte(), majorly the wrprotect trick to= block concurrent fast-gup. Comments in the function should explain better in p= lace. Oleg Nesterov reported a (probably harmless) bug during review that we di= dn't reset entry.val properly in copy_pte_range() so that potentially there's = chance to call add_swap_count_continuation() multiple times on the same swp entr= y. However that should be harmless since even if it happens, the same functi= on (add_swap_count_continuation()) will return directly noticing that there'= re enough space for the swp counter. So instead of a standalone stable patc= h, it is touched up in this patch directly. Reference discussions: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/ Suggested-by: Linus Torvalds Signed-off-by: Peter Xu --- mm/memory.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 156 insertions(+), 16 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 4c56d7b92b0e..92ad08616e60 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -773,15 +773,109 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, stru= ct mm_struct *src_mm, return 0; } =20 -static inline void +/* + * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated= page + * is required to copy this pte. + */ +static inline int copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, - unsigned long addr, int *rss) + struct vm_area_struct *new, + unsigned long addr, int *rss, struct page **prealloc) { unsigned long vm_flags =3D vma->vm_flags; pte_t pte =3D *src_pte; struct page *page; =20 + page =3D vm_normal_page(vma, addr, pte); + if (page) { + if (is_cow_mapping(vm_flags)) { + bool is_write =3D pte_write(pte); + + /* + * The trick starts. + * + * What we want to do is to check whether this page may + * have been pinned by the parent process. If so, + * instead of wrprotect the pte on both sides, we copy + * the page immediately so that we'll always guarantee + * the pinned page won't be randomly replaced in the + * future. + * + * To achieve this, we do the following: + * + * 1. Write-protect the pte if it's writable. This is + * to protect concurrent write fast-gup with + * FOLL_PIN, so that we'll fail the fast-gup with + * the write bit removed. + * + * 2. Check page_maybe_dma_pinned() to see whether this + * page may have been pinned. + * + * The order of these steps is important to serialize + * against the fast-gup code (gup_pte_range()) on the + * pte check and try_grab_compound_head(), so that + * we'll make sure either we'll capture that fast-gup + * so we'll copy the pinned page here, or we'll fail + * that fast-gup. + */ + if (is_write) { + ptep_set_wrprotect(src_mm, addr, src_pte); + /* + * This is not needed for serializing fast-gup, + * however always make it consistent with + * src_pte, since we'll need it when current + * page is not pinned. + */ + pte =3D pte_wrprotect(pte); + } + + if (atomic_read(&src_mm->has_pinned) && + page_maybe_dma_pinned(page)) { + struct page *new_page =3D *prealloc; + + /* + * This is possibly pinned page, need to copy. + * Safe to release the write bit if necessary. + */ + if (is_write) + set_pte_at(src_mm, addr, src_pte, + pte_mkwrite(pte)); + + /* If we don't have a pre-allocated page, ask */ + if (!new_page) + return -EAGAIN; + + /* + * We have a prealloc page, all good! Take it + * over and copy the page & arm it. + */ + *prealloc =3D NULL; + copy_user_highpage(new_page, page, addr, vma); + __SetPageUptodate(new_page); + pte =3D mk_pte(new_page, new->vm_page_prot); + pte =3D pte_sw_mkyoung(pte); + pte =3D maybe_mkwrite(pte_mkdirty(pte), new); + page_add_new_anon_rmap(new_page, new, addr, false); + rss[mm_counter(new_page)]++; + set_pte_at(dst_mm, addr, dst_pte, pte); + return 0; + } + + /* + * Logically we should recover the wrprotect() for + * fast-gup, however when reach here it also means we + * actually need to wrprotect() it again for cow. + * Simply keep everything. Note that there's another + * chunk of cow logic below, but we should still need + * that for !page case. + */ + } + get_page(page); + page_dup_rmap(page, false); + rss[mm_counter(page)]++; + } + /* * If it's a COW mapping, write protect it both * in the parent and the child @@ -807,14 +901,27 @@ copy_present_pte(struct mm_struct *dst_mm, struct m= m_struct *src_mm, if (!(vm_flags & VM_UFFD_WP)) pte =3D pte_clear_uffd_wp(pte); =20 - page =3D vm_normal_page(vma, addr, pte); - if (page) { - get_page(page); - page_dup_rmap(page, false); - rss[mm_counter(page)]++; + set_pte_at(dst_mm, addr, dst_pte, pte); + return 0; +} + +static inline struct page * +page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma, + unsigned long addr) +{ + struct page *new_page; + + new_page =3D alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr); + if (!new_page) + return NULL; + + if (mem_cgroup_charge(new_page, src_mm, GFP_KERNEL)) { + put_page(new_page); + return NULL; } + cgroup_throttle_swaprate(new_page, GFP_KERNEL); =20 - set_pte_at(dst_mm, addr, dst_pte, pte); + return new_page; } =20 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *sr= c_mm, @@ -825,16 +932,20 @@ static int copy_pte_range(struct mm_struct *dst_mm,= struct mm_struct *src_mm, pte_t *orig_src_pte, *orig_dst_pte; pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; - int progress =3D 0; + int progress, ret =3D 0; int rss[NR_MM_COUNTERS]; swp_entry_t entry =3D (swp_entry_t){0}; + struct page *prealloc =3D NULL; =20 again: + progress =3D 0; init_rss_vec(rss); =20 dst_pte =3D pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); - if (!dst_pte) - return -ENOMEM; + if (!dst_pte) { + ret =3D -ENOMEM; + goto out; + } src_pte =3D pte_offset_map(src_pmd, addr); src_ptl =3D pte_lockptr(src_mm, src_pmd); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); @@ -866,8 +977,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, = struct mm_struct *src_mm, progress +=3D 8; continue; } - copy_present_pte(dst_mm, src_mm, dst_pte, src_pte, - vma, addr, rss); + /* copy_present_pte() will clear `*prealloc' if consumed */ + ret =3D copy_present_pte(dst_mm, src_mm, dst_pte, src_pte, + vma, new, addr, rss, &prealloc); + /* + * If we need a pre-allocated page for this pte, drop the + * locks, allocate, and try again. + */ + if (unlikely(ret =3D=3D -EAGAIN)) + break; + if (unlikely(prealloc)) { + /* + * pre-alloc page cannot be reused by next time so as + * to strictly follow mempolicy (e.g., alloc_page_vma() + * will allocate page according to address). This + * could only happen if one pinned pte changed. + */ + put_page(prealloc); + prealloc =3D NULL; + } progress +=3D 8; } while (dst_pte++, src_pte++, addr +=3D PAGE_SIZE, addr !=3D end); =20 @@ -879,13 +1007,25 @@ static int copy_pte_range(struct mm_struct *dst_mm= , struct mm_struct *src_mm, cond_resched(); =20 if (entry.val) { - if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) + if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) { + ret =3D -ENOMEM; + goto out; + } + entry.val =3D 0; + } else if (ret) { + WARN_ON_ONCE(ret !=3D -EAGAIN); + prealloc =3D page_copy_prealloc(src_mm, vma, addr); + if (!prealloc) return -ENOMEM; - progress =3D 0; + /* We've captured and resolved the error. Reset, try again. */ + ret =3D 0; } if (addr !=3D end) goto again; - return 0; +out: + if (unlikely(prealloc)) + put_page(prealloc); + return ret; } =20 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_str= uct *src_mm, --=20 2.26.2