From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97B46EB8FAF for ; Wed, 6 Sep 2023 11:35:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 27B45440158; Wed, 6 Sep 2023 07:35:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 205AA440151; Wed, 6 Sep 2023 07:35:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 07EFC440158; Wed, 6 Sep 2023 07:35:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E896D440151 for ; Wed, 6 Sep 2023 07:35:39 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B7219160D7A for ; Wed, 6 Sep 2023 11:35:39 +0000 (UTC) X-FDA: 81205967598.14.14E8CC6 Received: from out-216.mta0.migadu.com (out-216.mta0.migadu.com [91.218.175.216]) by imf08.hostedemail.com (Postfix) with ESMTP id B984916001E for ; Wed, 6 Sep 2023 11:35:37 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=cW7QlpWi; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf08.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.216 as permitted sender) smtp.mailfrom=muchun.song@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694000138; a=rsa-sha256; cv=none; b=HPrmSnPZvfWVT/Ai/x3m+HO7EoOKfcknJR8tl2hMo+1aRae3r/tDq1tz9z0CM6HCkk3sqf N3Izh0Z9v5BTyrZXMgbk46ItfKabmR+UW8w8WEdLxm/Eh85BdE5iBDuppUu15sPhxjDPCi jha68XYNyNOLVSrUNGdMe2bPSH4sqQM= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=cW7QlpWi; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf08.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.216 as permitted sender) smtp.mailfrom=muchun.song@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694000138; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Gs94c93f9mGBHcmdFfOmKtB/Hx0HRcMLSqYaVf+X+yY=; b=AlXgJtX9PE6iUvi0ynIdMepXIdUCqzT3zBJv3PsEJjtfu1xWhin5Ad0CqZP1W8KUmvIwno ymUyxG1O6+Eu+UF4wdxxy6k/oNRAcVH7swZLyou/8iECAZT/vQWFAHTDhJyMSw2ItuP0KC LMgyQehLs3QtEwDMqXuny85iyHAEk20= Content-Type: text/plain; charset=utf-8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1694000135; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Gs94c93f9mGBHcmdFfOmKtB/Hx0HRcMLSqYaVf+X+yY=; b=cW7QlpWiqZbKSn7zUSkmQ/P2HiDHpieb8cG+EuwqNrDGR9Hivy7r9eynNMglQ5JPdemsdB Gqb4NSFLudZPI5YgDw4w/jHwd+dHhCP5G2/9wEFxlaihPvXZO68BmS/mpGzIVOgLPrDXKt HOLrpq+E14aCd5sfED98ge0W2o5IDA4= Mime-Version: 1.0 Subject: Re: [External] [PATCH v2 09/11] hugetlb: batch PMD split for bulk vmemmap dedup X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: Date: Wed, 6 Sep 2023 19:34:51 +0800 Cc: Muchun Song , Mike Kravetz , Linux-MM , LKML , Oscar Salvador , David Hildenbrand , Miaohe Lin , David Rientjes , Anshuman Khandual , Naoya Horiguchi , Michal Hocko , Matthew Wilcox , Xiongchun Duan , Andrew Morton Content-Transfer-Encoding: quoted-printable Message-Id: References: <20230905214412.89152-1-mike.kravetz@oracle.com> <20230905214412.89152-10-mike.kravetz@oracle.com> <0b0609d8-bc87-0463-bafd-9613f0053039@linux.dev> <2e942706-5772-0a93-bab3-902644c578e7@oracle.com> To: Joao Martins X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: B984916001E X-Stat-Signature: 57dhetwn58gizdcdxh86xp3s4zgn66od X-HE-Tag: 1694000137-564485 X-HE-Meta: U2FsdGVkX19HdzZIUTwrPoxhbSfkoXIOuDXNGyMHutncnJyYzNbSv+aJ1aCH7aZdlZJls0l4fdq6cs2IQJxEXB8RcQfIgP07xi5u+HBRUfHB5sDZAmo7777nigsNmmdQnQLZjLokbQm/UOWN7KfBQ4zQWYOOzcqnfG7Pq92yYXqJDG990QLHcHZWWQFu6hrzmaK6+hOIMf1dUzdFj2+gP6pJjikZBvhW7mN6YN5v3xLcuFzJtQH6y9EHqKZMHY0q4NgS8YbSglwDgBtiApx7ttFn5zrDaZNsLN0eP23CBPIzjTUhq26WOnHpoNNzcAYLfeiAoHaVK6cVKT7MUsKq+8sF0qc25cnbEAdAi2Di+UL+k+JXtN4AK4wzvrtz8knkYQXQmfOO6Nre8CiY8vu1d1yMmXNUCPbApsEd1eRjMZz4yJIGKxqPgt5eb74sqzn6nBnwtrukBuFXjE7new5zO3QVj6Vw8c4Ul740mLI3B+Cy394g+TP6ejAd04UuGAdtRP1Gk7ZPqfCAxr9yK06kjIbCkhg922k+NAfjfGGFlhMm7a7zVxn6wj1m7t70ZwCVrFmYgWBUjNvaME14hYZjjOsC6QS9/6ezdRajgdPyiQhaAWnLHTmV3/V8KLtQ4vfy8XQld8ktTYrL/qaLj3HgjwBtCBfLEWYpcnnMAAhSUxqTstHh/i6Iu+eRlU/mGl3nQ1WXbEVvzC91SgtpTF83hnR8PafGGgo5g8HuPX0QIdZyiqe95FH2zrvF9JLmSDUcMeyn9BykZC6SGIi7an0VXfve1UdUWieI87Cnmvsm4q+FoT4Yye0pK93qOxzv3TAJP9pB55DFHhmQ5ql9ncZ2IuyWaFV69OaDX2dH/b6cMVdoPL6B7cUDbiZIPTdJjx0zna9xGzUsJBOnJWRXQ33QjejLZl1+Rjz2cjXV9WCDfrJ960cqQ6R5INXAVoQ6k2EZyfA9e/Y/GXn6n9Kvn0h rcK5HUfo ooEU9VG5nQFSUphwXla+K+3xV/c9erYOHEJ7fRbnmlL2wDWwdzt1l+C0Np0vUbWaj6aSXWMEvBhxWtMa6L0O6huCdzWoOgLboA81Os2C/b8loyY8Bw8Aw2YOmys25WoPcoofrULu1B45RvLs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Sep 6, 2023, at 17:44, Joao Martins = wrote: >=20 > On 06/09/2023 10:32, Muchun Song wrote: >>> On Sep 6, 2023, at 17:26, Joao Martins = wrote: >>> On 06/09/2023 10:11, Muchun Song wrote: >>>> On Wed, Sep 6, 2023 at 4:25=E2=80=AFPM Muchun Song = wrote: >>>>> On 2023/9/6 05:44, Mike Kravetz wrote: >>>>>> From: Joao Martins >>>>>>=20 >>>>>> In an effort to minimize amount of TLB flushes, batch all PMD = splits >>>>>> belonging to a range of pages in order to perform only 1 (global) = TLB >>>>>> flush. >>>>>>=20 >>>>>> Rebased and updated by Mike Kravetz >>>>>>=20 >>>>>> Signed-off-by: Joao Martins >>>>>> Signed-off-by: Mike Kravetz >>>>>> --- >>>>>> mm/hugetlb_vmemmap.c | 72 = +++++++++++++++++++++++++++++++++++++++++--- >>>>>> 1 file changed, 68 insertions(+), 4 deletions(-) >>>>>>=20 >>>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>>>>> index a715712df831..d956551699bc 100644 >>>>>> --- a/mm/hugetlb_vmemmap.c >>>>>> +++ b/mm/hugetlb_vmemmap.c >>>>>> @@ -37,7 +37,7 @@ struct vmemmap_remap_walk { >>>>>> struct list_head *vmemmap_pages; >>>>>> }; >>>>>>=20 >>>>>> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long = start) >>>>>> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long = start, bool flush) >>>>>> { >>>>>> pmd_t __pmd; >>>>>> int i; >>>>>> @@ -80,7 +80,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, = unsigned long start) >>>>>> /* Make pte visible before pmd. See comment in = pmd_install(). */ >>>>>> smp_wmb(); >>>>>> pmd_populate_kernel(&init_mm, pmd, pgtable); >>>>>> - flush_tlb_kernel_range(start, start + PMD_SIZE); >>>>>> + if (flush) >>>>>> + flush_tlb_kernel_range(start, start + = PMD_SIZE); >>>>>> } else { >>>>>> pte_free_kernel(&init_mm, pgtable); >>>>>> } >>>>>> @@ -127,11 +128,20 @@ static int vmemmap_pmd_range(pud_t *pud, = unsigned long addr, >>>>>> do { >>>>>> int ret; >>>>>>=20 >>>>>> - ret =3D split_vmemmap_huge_pmd(pmd, addr & = PMD_MASK); >>>>>> + ret =3D split_vmemmap_huge_pmd(pmd, addr & = PMD_MASK, >>>>>> + walk->remap_pte !=3D NULL); >>>>>=20 >>>>> It is bettter to only make @walk->remap_pte indicate whether we = should go >>>>> to the last page table level. I suggest reusing = VMEMMAP_NO_TLB_FLUSH >>>>> to indicate whether we should flush the TLB at pmd level. It'll be = more >>>>> clear. >>>>>=20 >>>>>> if (ret) >>>>>> return ret; >>>>>>=20 >>>>>> next =3D pmd_addr_end(addr, end); >>>>>> + >>>>>> + /* >>>>>> + * We are only splitting, not remapping the hugetlb = vmemmap >>>>>> + * pages. >>>>>> + */ >>>>>> + if (!walk->remap_pte) >>>>>> + continue; >>>>>> + >>>>>> vmemmap_pte_range(pmd, addr, next, walk); >>>>>> } while (pmd++, addr =3D next, addr !=3D end); >>>>>>=20 >>>>>> @@ -198,7 +208,8 @@ static int vmemmap_remap_range(unsigned long = start, unsigned long end, >>>>>> return ret; >>>>>> } while (pgd++, addr =3D next, addr !=3D end); >>>>>>=20 >>>>>> - flush_tlb_kernel_range(start, end); >>>>>> + if (walk->remap_pte) >>>>>> + flush_tlb_kernel_range(start, end); >>>>>>=20 >>>>>> return 0; >>>>>> } >>>>>> @@ -297,6 +308,35 @@ static void vmemmap_restore_pte(pte_t *pte, = unsigned long addr, >>>>>> set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); >>>>>> } >>>>>>=20 >>>>>> +/** >>>>>> + * vmemmap_remap_split - split the vmemmap virtual address range = [@start, @end) >>>>>> + * backing PMDs of the directmap into PTEs >>>>>> + * @start: start address of the vmemmap virtual address = range that we want >>>>>> + * to remap. >>>>>> + * @end: end address of the vmemmap virtual address range = that we want to >>>>>> + * remap. >>>>>> + * @reuse: reuse address. >>>>>> + * >>>>>> + * Return: %0 on success, negative error code otherwise. >>>>>> + */ >>>>>> +static int vmemmap_remap_split(unsigned long start, unsigned = long end, >>>>>> + unsigned long reuse) >>>>>> +{ >>>>>> + int ret; >>>>>> + struct vmemmap_remap_walk walk =3D { >>>>>> + .remap_pte =3D NULL, >>>>>> + }; >>>>>> + >>>>>> + /* See the comment in the vmemmap_remap_free(). */ >>>>>> + BUG_ON(start - reuse !=3D PAGE_SIZE); >>>>>> + >>>>>> + mmap_read_lock(&init_mm); >>>>>> + ret =3D vmemmap_remap_range(reuse, end, &walk); >>>>>> + mmap_read_unlock(&init_mm); >>>>>> + >>>>>> + return ret; >>>>>> +} >>>>>> + >>>>>> /** >>>>>> * vmemmap_remap_free - remap the vmemmap virtual address range = [@start, @end) >>>>>> * to the page which @reuse is mapped to, then = free vmemmap >>>>>> @@ -602,11 +642,35 @@ void hugetlb_vmemmap_optimize(const struct = hstate *h, struct page *head) >>>>>> free_vmemmap_page_list(&vmemmap_pages); >>>>>> } >>>>>>=20 >>>>>> +static void hugetlb_vmemmap_split(const struct hstate *h, struct = page *head) >>>>>> +{ >>>>>> + unsigned long vmemmap_start =3D (unsigned long)head, = vmemmap_end; >>>>>> + unsigned long vmemmap_reuse; >>>>>> + >>>>>> + if (!vmemmap_should_optimize(h, head)) >>>>>> + return; >>>>>> + >>>>>> + vmemmap_end =3D vmemmap_start + = hugetlb_vmemmap_size(h); >>>>>> + vmemmap_reuse =3D vmemmap_start; >>>>>> + vmemmap_start +=3D HUGETLB_VMEMMAP_RESERVE_SIZE; >>>>>> + >>>>>> + /* >>>>>> + * Split PMDs on the vmemmap virtual address range = [@vmemmap_start, >>>>>> + * @vmemmap_end] >>>>>> + */ >>>>>> + vmemmap_remap_split(vmemmap_start, vmemmap_end, = vmemmap_reuse); >>>>>> +} >>>>>> + >>>>>> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct = list_head *folio_list) >>>>>> { >>>>>> struct folio *folio; >>>>>> LIST_HEAD(vmemmap_pages); >>>>>>=20 >>>>>> + list_for_each_entry(folio, folio_list, lru) >>>>>> + hugetlb_vmemmap_split(h, &folio->page); >>>>>=20 >>>>> Maybe it is reasonable to add a return value to = hugetlb_vmemmap_split() >>>>> to indicate whether it has done successfully, if it fails, it must = be >>>>> OOM, in which case, there is no sense to continue to split the = page table >>>>> and optimize the vmemmap pages subsequently, right? >>>>=20 >>>> Sorry, it is reasonable to continue to optimize the vmemmap pages >>>> subsequently since it should succeed because those vmemmap pages >>>> have been split successfully previously. >>>>=20 >>>> Seems we should continue to optimize vmemmap once = hugetlb_vmemmap_split() >>>> fails, then we will have more memory to continue to split.=20 >>>=20 >>> Good point >>>=20 >>>> But it will >>>> make hugetlb_vmemmap_optimize_folios() a little complex. I'd like = to >>>> hear you guys' opinions here. >>>>=20 >>> I think it won't add that much complexity if we don't optimize too = much of the >>> slowpath (when we are out of memory). In the batch freeing patch we = could >>> additionally test the return value of __hugetlb_vmemmap_optimize() = for ENOMEM >>> and free the currently stored vmemmap_pages (if any), and keep = iterating the >>> optimize loop. Should be simple enough and make this a bit more = resilient to >>> that scenario. >>=20 >> Yep, we could try this. >>=20 >>> But we would need to keep the earlier check you commented above >>> (where we use @remap_pte to defer PMD flush). >>=20 >> I think 2 flags will suitable for you, one is = VMEMMAP_REMAP_NO_TLB_FLUSH, >> another is VMEMMAP_SPLIT_NO_TLB_FLUSH. >=20 > This means going back to the v1. I thought we agreed to = consolidate/simplify > into one flag, and use @remap_pte to differentiate between split and = remap. But a little different, we use @remap_pte to indicate whether we should = go to the last level (e.g. do the remap), the flag is used to indicate = whether we should flush TLB when splitting is necessary (note that remap also = need split). It means split and non-TLB flush are not bound. Sorry, I just = want to keep the semantics clear. Thanks.