From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23474C00140 for ; Wed, 10 Aug 2022 06:21:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A2B38E0002; Wed, 10 Aug 2022 02:21:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 82B7C8E0001; Wed, 10 Aug 2022 02:21:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6A5B08E0002; Wed, 10 Aug 2022 02:21:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 55E068E0001 for ; Wed, 10 Aug 2022 02:21:12 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 171E41412CA for ; Wed, 10 Aug 2022 06:21:12 +0000 (UTC) X-FDA: 79782685584.19.2578870 Received: from out1.migadu.com (out1.migadu.com [91.121.223.63]) by imf27.hostedemail.com (Postfix) with ESMTP id 29AE54006A for ; Wed, 10 Aug 2022 06:21:10 +0000 (UTC) Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1660112469; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hy+nr+SHc7URxYD6Oc6FejCwJasYxNbjGCIxsgW4ixU=; b=iImfyMljvW1iWCWlXoVkaFT8tlMUmmkv5dgXSFqjDGe1WpbPh7sX9DjnzXuUxpDsWbbF8s A61rzriW02h9RZoPsTozVk+62FKt3IZzSCaPT9TH5JZGkEUckSRyvvm9rtqIZP6V8rj9nc C172zIF5QnJxH343JUGhh5EHfSERj2c= MIME-Version: 1.0 Subject: Re: [PATCH] hugetlb: freeze allocated pages before creating hugetlb pages X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <20220808212836.111749-1-mike.kravetz@oracle.com> Date: Wed, 10 Aug 2022 14:20:50 +0800 Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Joao Martins , Matthew Wilcox , Michal Hocko , Peter Xu , Miaohe Lin , Andrew Morton Content-Transfer-Encoding: quoted-printable Message-Id: References: <20220808212836.111749-1-mike.kravetz@oracle.com> To: Mike Kravetz X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660112471; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hy+nr+SHc7URxYD6Oc6FejCwJasYxNbjGCIxsgW4ixU=; b=t12duqvAOX4CUHOXnCwJf7UetvUJLfuQhua0JHJi0aO8BN8XjEp1BKA+fot0aGFp9dOYZt 6nmRN/kgy2SVx+Qj/9PfkBA6nScKhwQ2htlmDHwtf9bHX5x0+HD9UgtjLhl0/TMmRXqP3L eMvJKpm5hZIFolmgaYs1vcnA6YW6LGo= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=iImfyMlj; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf27.hostedemail.com: domain of muchun.song@linux.dev designates 91.121.223.63 as permitted sender) smtp.mailfrom=muchun.song@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660112471; a=rsa-sha256; cv=none; b=6sCCU/cCWPht9uEAutqkZWLy+TMTvDM1lEv74aaFMgURLCanJOiJIuOvdZ+YGc89H7nse+ xu27TxioZW1TBEqODOheEWbqmZHaRDkckNwWXpjnUdY5zJHwR5BvZNYu+0AQs4+LhbyuX/ eUtHNQkBRGsDrmBcBKoCLQkw3f2ggAo= X-Rspamd-Queue-Id: 29AE54006A Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=iImfyMlj; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf27.hostedemail.com: domain of muchun.song@linux.dev designates 91.121.223.63 as permitted sender) smtp.mailfrom=muchun.song@linux.dev X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: ksw8bjy759nyn6pdsuud565eq9u53o7y X-HE-Tag: 1660112470-598243 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Aug 9, 2022, at 05:28, Mike Kravetz = wrote: >=20 > When creating hugetlb pages, the hugetlb code must first allocate > contiguous pages from a low level allocator such as buddy, cma or > memblock. The pages returned from these low level allocators are > ref counted. This creates potential issues with other code taking > speculative references on these pages before they can be transformed = to > a hugetlb page. This issue has been addressed with methods and code > such as that provided in [1]. >=20 > Recent discussions about vmemmap freeing [2] have indicated that it > would be beneficial to freeze all sub pages, including the head page > of pages returned from low level allocators before converting to a > hugetlb page. This helps avoid races if want to replace the page > containing vmemmap for the head page. >=20 > There have been proposals to change at least the buddy allocator to > return frozen pages as described at [3]. If such a change is made, it > can be employed by the hugetlb code. However, as mentioned above > hugetlb uses several low level allocators so each would need to be > modified to return frozen pages. For now, we can manually freeze the > returned pages. This is done in two places: > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > We freeze the head page, retrying once in the VERY rare case where > there may be an inflated ref count. > 2) prep_compound_gigantic_page, for gigantic pages the current code > freezes all pages except the head page. New code will simply freeze > the head page as well. >=20 > In a few other places, code checks for inflated ref counts on newly > allocated hugetlb pages. With the modifications to freeze after > allocating, this code can be removed. >=20 > After hugetlb pages are freshly allocated, they are often added to the > hugetlb free lists. Since these pages were previously ref counted, = this > was done via put_page() which would end up calling the hugetlb > destructor: free_huge_page. With changes to freeze pages, we simply > call free_huge_page directly to add the pages to the free list. >=20 > In a few other places, freshly allocated hugetlb pages were = immediately > put into use, and the expectation was they were already ref counted. = In > these cases, we must manually ref count the page. >=20 > [1] = https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@orac= le.com/ > [2] = https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@ora= cle.com/ > [3] = https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.= org/ >=20 > Signed-off-by: Mike Kravetz > --- > mm/hugetlb.c | 97 +++++++++++++++++++--------------------------------- > 1 file changed, 35 insertions(+), 62 deletions(-) >=20 > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 28516881a1b2..6b90d85d545b 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1769,13 +1769,12 @@ static bool = __prep_compound_gigantic_page(struct page *page, unsigned int order, > { > int i, j; > int nr_pages =3D 1 << order; > - struct page *p =3D page + 1; > + struct page *p =3D page; >=20 > /* we rely on prep_new_huge_page to set the destructor */ > set_compound_order(page, order); > - __ClearPageReserved(page); > __SetPageHead(page); > - for (i =3D 1; i < nr_pages; i++, p =3D mem_map_next(p, page, i)) = { > + for (i =3D 0; i < nr_pages; i++, p =3D mem_map_next(p, page, i)) = { > /* > * For gigantic hugepages allocated through bootmem at > * boot, it's safer to be consistent with the = not-gigantic > @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct = page *page, unsigned int order, > } else { > VM_BUG_ON_PAGE(page_count(p), p); > } > - set_compound_head(p, page); > + if (i !=3D 0) > + set_compound_head(p, page); > } > atomic_set(compound_mapcount_ptr(page), -1); > atomic_set(compound_pincount_ptr(page), 0); > @@ -1918,6 +1918,7 @@ static struct page *alloc_buddy_huge_page(struct = hstate *h, > int order =3D huge_page_order(h); > struct page *page; > bool alloc_try_hard =3D true; > + bool retry =3D true; >=20 > /* > * By default we always try hard to allocate the page with > @@ -1933,7 +1934,21 @@ static struct page = *alloc_buddy_huge_page(struct hstate *h, > gfp_mask |=3D __GFP_RETRY_MAYFAIL; > if (nid =3D=3D NUMA_NO_NODE) > nid =3D numa_mem_id(); > +retry: > page =3D __alloc_pages(gfp_mask, order, nid, nmask); > + > + /* Freeze head page */ > + if (!page_ref_freeze(page, 1)) { Hi Mike, I saw Mattew has introduced a new helper alloc_frozen_pages() in thread = [1] to allocate a frozen page. Then we do not need to handle an unexpected refcount case, = which should be easy. Is there any consideration why we do not choose = alloc_frozen_pages()? [1] = https://lore.kernel.org/linux-mm/20220809171854.3725722-15-willy@infradead= .org/T/#u Muchun, Thanks. > + __free_pages(page, order); > + if (retry) { /* retry once */ > + retry =3D false; > + goto retry; > + } > + /* WOW! twice in a row. */ > + pr_warn("HugeTLB head page unexpected inflated ref = count\n"); > + page =3D NULL; > + } > + > if (page) > __count_vm_event(HTLB_BUDDY_PGALLOC); > else > @@ -1961,6 +1976,9 @@ static struct page *alloc_buddy_huge_page(struct = hstate *h, > /* > * Common helper to allocate a fresh hugetlb page. All specific = allocators > * should use this function to get new hugetlb pages > + * > + * Note that returned page is 'frozen': ref count of head page and = all tail > + * pages is zero. > */ > static struct page *alloc_fresh_huge_page(struct hstate *h, > gfp_t gfp_mask, int nid, nodemask_t *nmask, > @@ -2018,7 +2036,7 @@ static int alloc_pool_huge_page(struct hstate = *h, nodemask_t *nodes_allowed, > if (!page) > return 0; >=20 > - put_page(page); /* free it into the hugepage allocator */ > + free_huge_page(page); /* free it into the hugepage allocator */ >=20 > return 1; > } > @@ -2175,10 +2193,9 @@ int dissolve_free_huge_pages(unsigned long = start_pfn, unsigned long end_pfn) > * Allocates a fresh surplus page from the page allocator. > */ > static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t = gfp_mask, > - int nid, nodemask_t *nmask, bool zero_ref) > + int nid, nodemask_t = *nmask) > { > struct page *page =3D NULL; > - bool retry =3D false; >=20 > if (hstate_is_gigantic(h)) > return NULL; > @@ -2188,7 +2205,6 @@ static struct page = *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, > goto out_unlock; > spin_unlock_irq(&hugetlb_lock); >=20 > -retry: > page =3D alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); > if (!page) > return NULL; > @@ -2204,34 +2220,10 @@ static struct page = *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, > if (h->surplus_huge_pages >=3D h->nr_overcommit_huge_pages) { > SetHPageTemporary(page); > spin_unlock_irq(&hugetlb_lock); > - put_page(page); > + free_huge_page(page); > return NULL; > } >=20 > - if (zero_ref) { > - /* > - * Caller requires a page with zero ref count. > - * We will drop ref count here. If someone else is = holding > - * a ref, the page will be freed when they drop it. = Abuse > - * temporary page flag to accomplish this. > - */ > - SetHPageTemporary(page); > - if (!put_page_testzero(page)) { > - /* > - * Unexpected inflated ref count on freshly = allocated > - * huge. Retry once. > - */ > - pr_info("HugeTLB unexpected inflated ref count = on freshly allocated page\n"); > - spin_unlock_irq(&hugetlb_lock); > - if (retry) > - return NULL; > - > - retry =3D true; > - goto retry; > - } > - ClearHPageTemporary(page); > - } > - > h->surplus_huge_pages++; > h->surplus_huge_pages_node[page_to_nid(page)]++; >=20 > @@ -2253,6 +2245,9 @@ static struct page = *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, > if (!page) > return NULL; >=20 > + /* fresh huge pages are frozen */ > + set_page_refcounted(page); > + > /* > * We do not account these pages as surplus because they are = only > * temporary and will be released properly on the last reference > @@ -2280,14 +2275,14 @@ struct page = *alloc_buddy_huge_page_with_mpol(struct hstate *h, > gfp_t gfp =3D gfp_mask | __GFP_NOWARN; >=20 > gfp &=3D ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); > - page =3D alloc_surplus_huge_page(h, gfp, nid, nodemask, = false); > + page =3D alloc_surplus_huge_page(h, gfp, nid, nodemask); >=20 > /* Fallback to all nodes if page=3D=3DNULL */ > nodemask =3D NULL; > } >=20 > if (!page) > - page =3D alloc_surplus_huge_page(h, gfp_mask, nid, = nodemask, false); > + page =3D alloc_surplus_huge_page(h, gfp_mask, nid, = nodemask); > mpol_cond_put(mpol); > return page; > } > @@ -2358,7 +2353,7 @@ static int gather_surplus_pages(struct hstate = *h, long delta) > spin_unlock_irq(&hugetlb_lock); > for (i =3D 0; i < needed; i++) { > page =3D alloc_surplus_huge_page(h, htlb_alloc_mask(h), > - NUMA_NO_NODE, NULL, true); > + NUMA_NO_NODE, NULL); > if (!page) { > alloc_ok =3D false; > break; > @@ -2720,7 +2715,6 @@ static int alloc_and_dissolve_huge_page(struct = hstate *h, struct page *old_page, > { > gfp_t gfp_mask =3D htlb_alloc_mask(h) | __GFP_THISNODE; > int nid =3D page_to_nid(old_page); > - bool alloc_retry =3D false; > struct page *new_page; > int ret =3D 0; >=20 > @@ -2731,30 +2725,9 @@ static int alloc_and_dissolve_huge_page(struct = hstate *h, struct page *old_page, > * the pool. This simplifies and let us do most of the = processing > * under the lock. > */ > -alloc_retry: > new_page =3D alloc_buddy_huge_page(h, gfp_mask, nid, NULL, = NULL); > if (!new_page) > return -ENOMEM; > - /* > - * If all goes well, this page will be directly added to the = free > - * list in the pool. For this the ref count needs to be zero. > - * Attempt to drop now, and retry once if needed. It is VERY > - * unlikely there is another ref on the page. > - * > - * If someone else has a reference to the page, it will be freed > - * when they drop their ref. Abuse temporary page flag to = accomplish > - * this. Retry once if there is an inflated ref count. > - */ > - SetHPageTemporary(new_page); > - if (!put_page_testzero(new_page)) { > - if (alloc_retry) > - return -EBUSY; > - > - alloc_retry =3D true; > - goto alloc_retry; > - } > - ClearHPageTemporary(new_page); > - > __prep_new_huge_page(h, new_page); >=20 > retry: > @@ -2934,6 +2907,7 @@ struct page *alloc_huge_page(struct = vm_area_struct *vma, > } > spin_lock_irq(&hugetlb_lock); > list_add(&page->lru, &h->hugepage_activelist); > + set_page_refcounted(page); > /* Fall through */ > } > hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, = page); > @@ -3038,7 +3012,7 @@ static void __init gather_bootmem_prealloc(void) > if (prep_compound_gigantic_page(page, = huge_page_order(h))) { > WARN_ON(PageReserved(page)); > prep_new_huge_page(h, page, page_to_nid(page)); > - put_page(page); /* add to the hugepage allocator = */ > + free_huge_page(page); /* add to the hugepage = allocator */ > } else { > /* VERY unlikely inflated ref count on a tail = page */ > free_gigantic_page(page, huge_page_order(h)); > @@ -3070,7 +3044,7 @@ static void __init = hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) > &node_states[N_MEMORY], NULL); > if (!page) > break; > - put_page(page); /* free it into the hugepage = allocator */ > + free_huge_page(page); /* free it into the = hugepage allocator */ > } > cond_resched(); > } > @@ -3459,9 +3433,8 @@ static int demote_free_huge_page(struct hstate = *h, struct page *page) > else > prep_compound_page(page + i, = target_hstate->order); > set_page_private(page + i, 0); > - set_page_refcounted(page + i); > prep_new_huge_page(target_hstate, page + i, nid); > - put_page(page + i); > + free_huge_page(page + i); > } > mutex_unlock(&target_hstate->resize_lock); >=20 > --=20 > 2.37.1 >=20 >=20