From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CDF7EC433DB for ; Wed, 10 Feb 2021 08:24:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 19B1164E4B for ; Wed, 10 Feb 2021 08:24:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 19B1164E4B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6FF216B006E; Wed, 10 Feb 2021 03:24:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B0F86B0070; Wed, 10 Feb 2021 03:24:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C7F16B0071; Wed, 10 Feb 2021 03:24:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0058.hostedemail.com [216.40.44.58]) by kanga.kvack.org (Postfix) with ESMTP id 484456B006E for ; Wed, 10 Feb 2021 03:24:07 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 1A95411929 for ; Wed, 10 Feb 2021 08:24:07 +0000 (UTC) X-FDA: 77801670534.25.snail52_240f9592760e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin25.hostedemail.com (Postfix) with ESMTP id F09891814C3B0 for ; Wed, 10 Feb 2021 08:24:06 +0000 (UTC) X-HE-Tag: snail52_240f9592760e X-Filterd-Recvd-Size: 6804 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf16.hostedemail.com (Postfix) with ESMTP for ; Wed, 10 Feb 2021 08:24:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1612945445; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aB6w5KMqRZs4fASbYkKOZF5C6NqksMPRWk0Bw9OA14Q=; b=e/Q+tfb51AIIYljF9hr9TA5muQbH+aT0V5z6cSauQWdElRDVfRNmK4e7ALZRwUxD1wEX1Q xd0DP2x6CIjSr5KFDBD/x2laGBHBOTyFnN0RD7nj7k8FFRSkM8zPU8Ca7k+3lU70HcR/vP 6ADPhJi5E/S667aHQVyyA23MUWbBUsA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-533-bBsgDbSGMsKuAwQuLSsIeg-1; Wed, 10 Feb 2021 03:24:03 -0500 X-MC-Unique: bBsgDbSGMsKuAwQuLSsIeg-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 3C885107ACC7; Wed, 10 Feb 2021 08:24:02 +0000 (UTC) Received: from [10.36.113.218] (ovpn-113-218.ams2.redhat.com [10.36.113.218]) by smtp.corp.redhat.com (Postfix) with ESMTP id B30C26091B; Wed, 10 Feb 2021 08:24:00 +0000 (UTC) To: Oscar Salvador , Mike Kravetz Cc: Muchun Song , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20210208103812.32056-1-osalvador@suse.de> <20210208103812.32056-3-osalvador@suse.de> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages Message-ID: <9ed946df-9c6c-9a4d-4be9-2f32809974f7@redhat.com> Date: Wed, 10 Feb 2021 09:23:59 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 In-Reply-To: <20210208103812.32056-3-osalvador@suse.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 08.02.21 11:38, Oscar Salvador wrote: > Free hugetlb pages are trickier to handle as to in order to guarantee > no userspace appplication disruption, we need to replace the > current free hugepage with a new one. >=20 > In order to do that, a new function called alloc_and_dissolve_huge_page > in introduced. > This function will first try to get a new fresh hugetlb page, and if it > succeeds, it will dissolve the old one. > Thanks for looking into this! Can we move this patch to #1 in the=20 series? It is the easier case. I also wonder if we should at least try on the memory unplug path to=20 keep nr_pages by at least trying to allocate at new one if required, and=20 printing a warning if that fails (after all, we're messing with=20 something configured by the admin - "nr_pages"). Note that gigantic=20 pages are special (below). > Signed-off-by: Oscar Salvador > --- > include/linux/hugetlb.h | 6 ++++++ > mm/compaction.c | 11 +++++++++++ > mm/hugetlb.c | 35 +++++++++++++++++++++++++++++++++++ > 3 files changed, 52 insertions(+) >=20 > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index ebca2ef02212..f81afcb86e89 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -505,6 +505,7 @@ struct huge_bootmem_page { > struct hstate *hstate; > }; > =20 > +bool alloc_and_dissolve_huge_page(struct page *page); > struct page *alloc_huge_page(struct vm_area_struct *vma, > unsigned long addr, int avoid_reserve); > struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred= _nid, > @@ -773,6 +774,11 @@ static inline void huge_ptep_modify_prot_commit(st= ruct vm_area_struct *vma, > #else /* CONFIG_HUGETLB_PAGE */ > struct hstate {}; > =20 > +static inline bool alloc_and_dissolve_huge_page(struct page *page) > +{ > + return false; > +} > + > static inline struct page *alloc_huge_page(struct vm_area_struct *vma= , > unsigned long addr, > int avoid_reserve) > diff --git a/mm/compaction.c b/mm/compaction.c > index 89cd2e60da29..7969ddc10856 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control = *cc, unsigned long low_pfn, > low_pfn +=3D compound_nr(page) - 1; > goto isolate_success_no_list; > } > + } else { } else if (alloc_and_dissolve_huge_page(page))) { ... > + /* > + * Free hugetlb page. Allocate a new one and > + * dissolve this is if succeed. > + */ > + if (alloc_and_dissolve_huge_page(page)) { > + unsigned long order =3D buddy_order_unsafe(page); > + > + low_pfn +=3D (1UL << order) - 1; > + continue; > + } Note that there is a very ugly corner case we will have to handle=20 gracefully (I think also in patch #1): Assume you allocated a gigantic page (and assume that we are not using=20 CMA for gigantic pages for simplicity). Assume you want to allocate=20 another one. alloc_pool_huge_page()->...->alloc_contig_pages() will=20 stumble over the first allocated page. It will try to=20 alloc_and_dissolve_huge_page() the existing gigantic page. To do that,=20 it will alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on.=20 Bad. We really don't want to mess with gigantic pages (migrate, dissolve)=20 while allocating a gigantic page. I think the easiest (and cleanest) way=20 forward is to not mess (isolate, migrate, dissolve) with gigantic pages=20 at all. Gigantic pages are not movable, so they won't be placed on random CMA /=20 ZONE_MOVABLE. Some hstate_is_gigantic(h) calls (maybe inside=20 alloc_and_dissolve_huge_page() ? ) along with a nice comment might be=20 good enough to avoid having to pass down some kind of alloc_contig=20 context. I even think that should be handled inside (the main issue is that in contrast to CMA, plain alloc_contig_pages()=20 has no memory about which parts were allocated and will simply try=20 re-allocating what it previously allocated and never freed - which is=20 usually fine, unless we're dealing with such special cases) Apart from that, not messing with gigantic pages feels like the right=20 approach (allocating/migrating gigantic pages is just horribly slow and=20 most probably not worth it anyway). --=20 Thanks, David / dhildenb