From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEEA7CD4938 for ; Thu, 21 Sep 2023 01:42:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E90D6B01A7; Wed, 20 Sep 2023 21:42:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 399916B01A9; Wed, 20 Sep 2023 21:42:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 288D46B01AE; Wed, 20 Sep 2023 21:42:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 188ED6B01A7 for ; Wed, 20 Sep 2023 21:42:56 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 156DF8052B for ; Thu, 21 Sep 2023 01:42:55 +0000 (UTC) X-FDA: 81258905910.30.F99CED2 Received: from out-223.mta0.migadu.com (out-223.mta0.migadu.com [91.218.175.223]) by imf26.hostedemail.com (Postfix) with ESMTP id 1A07B140016 for ; Thu, 21 Sep 2023 01:42:52 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=ttwwEhnq; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf26.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.223 as permitted sender) smtp.mailfrom=muchun.song@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695260573; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Qt4/u/Z/9S9BRMM2co5Z0O1C9g/l+Q+KFSOnwWrp7+U=; b=zU/tG/13A9MNRHul+U9rULSSBJ0+ctjkIRP0b9/E3ksHQGh6ZhLOu9M0b2Sg6rJ7A6QCm+ wnI7Rpbhn5oyGBngn6n+VQA4gW6DTcMk+j8NxounV4EEHcpy69bcfMMXgjMjoQcIFUdp6X TQab1B+aRIryrJDT3p0Eq+EztDeLEM4= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=ttwwEhnq; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf26.hostedemail.com: domain of muchun.song@linux.dev designates 91.218.175.223 as permitted sender) smtp.mailfrom=muchun.song@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695260573; a=rsa-sha256; cv=none; b=ZFxlZp6xUdUp/a1yhsGcTl/ViYPW7PwWATp4/zqyIjguz0xFauovKzqV+TkqXSSn3TJ62+ NkFAj2BAch/QBvgel3WtEYy5ryS9LOl4bfv+om0/8PsKfjpJ1GKzS2oL8JDuye7Cn2IylB 5/BFUjh0SeUD005VEELnny12TgWWksc= Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1695260570; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Qt4/u/Z/9S9BRMM2co5Z0O1C9g/l+Q+KFSOnwWrp7+U=; b=ttwwEhnqBeL3QRFwp7bl2WRZqLoKC7YJBXDbCmduHiU3UvLiww7rYeJs2Vb1fFIZKhscLa G0AyDyy2LtV/mC2Q0KcDupuJM2iuvRCV6NmqkWBSafKLgr+nxU3ft9IG3u0l2RMS9QffLw 07/i1VGAWivqpI3aHNmOkKgcSm2OPE8= Mime-Version: 1.0 Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <257b5833-5aaf-4748-a576-7610bd36e632@oracle.com> Date: Thu, 21 Sep 2023 09:42:10 +0800 Cc: Mike Kravetz , Muchun Song , Oscar Salvador , David Hildenbrand , Miaohe Lin , David Rientjes , Anshuman Khandual , Naoya Horiguchi , Barry Song <21cnbao@gmail.com>, Michal Hocko , Matthew Wilcox , Xiongchun Duan , Linux-MM , Andrew Morton , LKML Content-Transfer-Encoding: quoted-printable Message-Id: <98613579-0644-4532-8DCC-D4BA7183AB15@linux.dev> References: <20230918230202.254631-1-mike.kravetz@oracle.com> <20230918230202.254631-7-mike.kravetz@oracle.com> <9c627733-e6a2-833b-b0f9-d59552f6ab0d@linux.dev> <07192BE2-C66E-4F74-8F76-05F57777C6B7@linux.dev> <83B874B6-FF22-4588-90A9-31644D598032@linux.dev> <5bd9c4f5-3411-4327-a495-ce6672150977@oracle.com> <257b5833-5aaf-4748-a576-7610bd36e632@oracle.com> To: Joao Martins X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 1A07B140016 X-Stat-Signature: 3d31x5qidhymbpccghrpy64hyfgjfwr9 X-Rspam-User: X-HE-Tag: 1695260572-258452 X-HE-Meta: U2FsdGVkX193adO8rz1Y/1drFBuyKwn11QvV38aww2iaAysYgERnBh98+csQun8Uk4tuk4Nx5K+PxvX3oJP76JgDmadPKHQakKXN/s0JVKNQaWCmpgXwp5Ri++wsWiUmozhQiUMb38fmftA2xsYkQuPp70zhraq/aLHqjrFabj9WLX73YDXrTNKV4de9sIDZgQc7Tqz6wRm37ekxEgPsOeJbGOQruIJWjTF2Kb3hNOioqoy1M8NI6dTXDaMls3fFXdgitALRPm4WwkPUQRahVrWDyWtM8FgW9qajEyfkX3pfQNH4r0Nc724On/5FOGAL0geqgj5Srb81esB3TvmN9D6vvQUmFfb7cDIfkpurlSfPHC3eQakAgLJEi9Je0H6TSqaevTAHub9MUx5v4o4dZzRsN5wFktpKwsTgRgJVCmCTwB09HyHhqT6/V3ex6vCHkrTLhF6vXScXonrIMZ+OXmmxNjz5DNp/EEz6ToLzL4GjpMQ+g189iP42+BZ6v48+uYha1WHu2k5Wvgw4W43ZJqPwxz+R81Xs8OLzouvXlg982RTqClii4bAqwk3qSrPhWY4jKoxb2tuq5EDCk7qitI/Sp3vTwxEWQxEcPJbXa1y/CglKVImwDdVN160PLhqxHTC1vTZg1dgyORj0SCy02xy+U+W3+f62qV4jRcn6plxhBy6gVXYCFLtj3Qji9DDtd7jfVSbXMOK2U/BnzglCsVFFOnLuP1xZhBJ/2RhqzMXIiD6gm+yHOlUaWn1oSyJdNaTaRXCQ9yixsg0zT6l1KWcT8jEmgZsq071zySKK9LjXpmjd5bflArhq91K/EbcpwY0HucGcHOamWfekyn7pnazJPHr6hlkRKodFxJVEkrm3si6UUGG2BjRjweZ7X4O5dtu/r7DlTE9Ps1ndYqynO779ommaywEdLsiw4dvxBBV4D1+m86DFnX5cmK3RbJxJZrAUW7Y0whp+ZhEnKOy pX0F/C6/ SG8deoiN2SyPanpfHleG2JXf0Z3BA8xaC5/q8V3toZH9ZSTTNuwzB6upfYHk/FV507sv20J/lFwp+lcM7Rb0851ML5S/YmgKxN7xuNh/uzAFpmCpjOANzIgHd/6qGnDorXCAParrkDqTppZG4VBg5pjLENFw79qI7tsKW9l6tdN4IkLS2Ma+lmQ72okFcLRciivdMLi0IijIZJrs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Sep 20, 2023, at 18:39, Joao Martins = wrote: >=20 > On 20/09/2023 03:47, Muchun Song wrote: >>> On Sep 19, 2023, at 23:09, Joao Martins = wrote: >>> On 19/09/2023 09:57, Muchun Song wrote: >>>>> On Sep 19, 2023, at 16:55, Joao Martins = wrote: >>>>> On 19/09/2023 09:41, Muchun Song wrote: >>>>>>> On Sep 19, 2023, at 16:26, Joao Martins = wrote: >>>>>>> On 19/09/2023 07:42, Muchun Song wrote: >>>>>>>> On 2023/9/19 07:01, Mike Kravetz wrote: >>>>>>>>> list_for_each_entry(folio, folio_list, lru) { >>>>>>>>> int ret =3D __hugetlb_vmemmap_optimize(h, &folio->page, >>>>>>>>> &vmemmap_pages); >>>>>>>>=20 >>>>>>>> This is unlikely to be failed since the page table allocation >>>>>>>> is moved to the above=20 >>>>>>>=20 >>>>>>>> (Note that the head vmemmap page allocation >>>>>>>> is not mandatory).=20 >>>>>>>=20 >>>>>>> Good point that I almost forgot >>>>>>>=20 >>>>>>>> So we should handle the error case in the above >>>>>>>> splitting operation. >>>>>>>=20 >>>>>>> But back to the previous discussion in v2... the thinking was = that /some/ PMDs >>>>>>> got split, and say could allow some PTE remapping to occur and = free some pages >>>>>>> back (each page allows 6 more splits worst case). Then the next >>>>>>> __hugetlb_vmemmap_optimize() will have to split PMD pages again = for those >>>>>>> hugepages that failed the batch PMD split (as we only defer the = PTE remap tlb >>>>>>> flush in this stage). >>>>>>=20 >>>>>> Oh, yes. Maybe we could break the above traversal as early as = possible >>>>>> once we enter an ENOMEM? >>>>>>=20 >>>>>=20 >>>>> Sounds good -- no point in keep trying to split if we are failing = with OOM. >>>>>=20 >>>>> Perhaps a comment in both of these clauses (the early break on = split and the OOM >>>>> handling in batch optimize) could help make this clear. >>>>=20 >>>> Make sense. >>>=20 >>> These are the changes I have so far for this patch based on the = discussion so >>> far. For next one it's at the end: >>=20 >> Code looks good to me. One nit below. >>=20 > Thanks >=20 >>>=20 >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index e8bc2f7567db..d9c6f2cf698c 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -27,7 +27,8 @@ >>> * @reuse_addr: the virtual address of the @reuse_page = page. >>> * @vmemmap_pages: the list head of the vmemmap pages that can be = freed >>> * or is mapped from. >>> - * @flags: used to modify behavior in bulk operations >>> + * @flags: used to modify behavior in vmemmap page = table walking >>> + * operations. >>> */ >>> struct vmemmap_remap_walk { >>> void (*remap_pte)(pte_t *pte, unsigned long = addr, >>> @@ -36,6 +37,8 @@ struct vmemmap_remap_walk { >>> struct page *reuse_page; >>> unsigned long reuse_addr; >>> struct list_head *vmemmap_pages; >>> + >>> +/* Skip the TLB flush when we split the PMD */ >>> #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0) >>> unsigned long flags; >>> }; >>> @@ -132,7 +135,7 @@ static int vmemmap_pmd_range(pud_t *pud, = unsigned long addr, >>> int ret; >>>=20 >>> ret =3D split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, >>> - walk->flags & = VMEMMAP_SPLIT_NO_TLB_FLUSH); >>> + !(walk->flags & = VMEMMAP_SPLIT_NO_TLB_FLUSH)); >>> if (ret) >>> return ret; >>>=20 >>> @@ -677,13 +680,13 @@ void hugetlb_vmemmap_optimize(const struct = hstate *h, >>> struct page *head) >>> free_vmemmap_page_list(&vmemmap_pages); >>> } >>>=20 >>> -static void hugetlb_vmemmap_split(const struct hstate *h, struct = page *head) >>> +static int hugetlb_vmemmap_split(const struct hstate *h, struct = page *head) >>> { >>> unsigned long vmemmap_start =3D (unsigned long)head, = vmemmap_end; >>> unsigned long vmemmap_reuse; >>>=20 >>> if (!vmemmap_should_optimize(h, head)) >>> - return; >>> + return 0; >>>=20 >>> vmemmap_end =3D vmemmap_start + hugetlb_vmemmap_size(h); >>> vmemmap_reuse =3D vmemmap_start; >>> @@ -693,7 +696,7 @@ static void hugetlb_vmemmap_split(const struct = hstate *h, >>> struct page *head) >>> * Split PMDs on the vmemmap virtual address range = [@vmemmap_start, >>> * @vmemmap_end] >>> */ >>> - vmemmap_remap_split(vmemmap_start, vmemmap_end, = vmemmap_reuse); >>> + return vmemmap_remap_split(vmemmap_start, vmemmap_end, = vmemmap_reuse); >>> } >>>=20 >>> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct = list_head >>> *folio_list) >>> @@ -701,8 +704,18 @@ void hugetlb_vmemmap_optimize_folios(struct = hstate *h, >>> struct list_head *folio_l >>> struct folio *folio; >>> LIST_HEAD(vmemmap_pages); >>>=20 >>> - list_for_each_entry(folio, folio_list, lru) >>> - hugetlb_vmemmap_split(h, &folio->page); >>> + list_for_each_entry(folio, folio_list, lru) { >>> + int ret =3D hugetlb_vmemmap_split(h, &folio->page); >>> + >>> + /* >>> + * Spliting the PMD requires allocating a page, thus = lets fail >> ^^^^ ^^^ >> Splitting page table = page >>=20 >> I'd like to specify the functionality of the allocated page. >>=20 > OK >=20 >>> + * early once we encounter the first OOM. No point = in retrying >>> + * as it can be dynamically done on remap with the = memory >>> + * we get back from the vmemmap deduplication. >>> + */ >>> + if (ret =3D=3D -ENOMEM) >>> + break; >>> + } >>>=20 >>> flush_tlb_all(); >>>=20 >>> For patch 7, I only have commentary added derived from this earlier = discussion >>> above: >>>=20 >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index d9c6f2cf698c..f6a1020a4b6a 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -40,6 +40,8 @@ struct vmemmap_remap_walk { >>>=20 >>> /* Skip the TLB flush when we split the PMD */ >>> #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0) >>> +/* Skip the TLB flush when we remap the PTE */ >>> #define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1) >>> unsigned long flags; >>> }; >>>=20 >>> @@ -721,19 +739,28 @@ void hugetlb_vmemmap_optimize_folios(struct = hstate *h, >>> struct list_head *folio_l >>>=20 >>> list_for_each_entry(folio, folio_list, lru) { >>> int ret =3D __hugetlb_vmemmap_optimize(h, = &folio->page, >>> &vmemmap_pages, >>> = VMEMMAP_REMAP_NO_TLB_FLUSH); >>>=20 >>> /* >>> * Pages to be freed may have been accumulated. If we >>> * encounter an ENOMEM, free what we have and try = again. >>> + * This can occur in the case that both spliting = fails >> ^^^ >> splitting >>=20 >=20 > ok >=20 >>> + * halfway and head page allocation also failed. In = this >> ^^^^^^^ >> head vmemmap page >>=20 > ok >=20 >> Otherwise: >>=20 >> Reviewed-by: Muchun Song >>=20 >=20 > Thanks, I assume that's for both patches? Yes. Thanks. >=20 >> Thanks. >>=20 >>> + * case __hugetlb_vmemmap_optimize() would free = memory >>> + * allowing more vmemmap remaps to occur. >>> */ >>> if (ret =3D=3D -ENOMEM && !list_empty(&vmemmap_pages)) = {