From: Baoquan He <bhe@redhat.com>
To: Hailong Liu <hailong.liu@oppo.com>
Cc: Barry Song <21cnbao@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Uladzislau Rezki <urezki@gmail.com>,
Christoph Hellwig <hch@infradead.org>,
Lorenzo Stoakes <lstoakes@gmail.com>,
Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
Matthew Wilcox <willy@infradead.org>,
Tangquan Zheng <zhengtangquan@oppo.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0
Date: Fri, 26 Jul 2024 17:29:26 +0800 [thread overview]
Message-ID: <ZqNsdmxa6cQfxKFP@MiWiFi-R3L-srv> (raw)
In-Reply-To: <20240726084809.gdz2axvawwwekpu6@oppo.com>
On 07/26/24 at 04:48pm, Hailong Liu wrote:
> On Fri, 26. Jul 16:37, Baoquan He wrote:
> > On 07/26/24 at 05:29pm, Barry Song wrote:
> > > On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> > > >
> > > > On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > > > > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > > > > [...]
> > > > > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > > > > allocation fails, it will go normal mapping.
> > > > > > >
> > > > > > > However I also save the fallback position. The ones before this position are
> > > > > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > > > > "support the combination of PMD and PTE mapping". this will take some
> > > > > > > times as it needs to address the corner cases and do some tests.
> > > > > >
> > > > > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > > > > and vmalloc_huge().
> > > > > >
> > > > > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > > > > invoked during boot. Basically they can succeed to get required contiguous
> > > > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > > > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > > > > it's a best effort behaviour.
> > > > > >
> > > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > > > > because the align requirement of huge. the real size is 4M.
> > > > > if allocation first order-9 successfully and the next failed. becuase the
> > > > > fallback, the layout out pages would be like order9 - 512 * order0
> > > > > order9 support huge mapping, but order0 not.
> > > > > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > > > > mapping, the huge mapping would not exist.
> > > > >
> > > > > > mm/mm_init.c <<alloc_large_system_hash>>
> > > > > > table = vmalloc_huge(size, gfp_flags);
> > > > > > net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > > > > > new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > > > > > net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > > > > > udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > > > > >
> > > > > > Maybe we should add code comment or document to notice people that the
> > > > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > > > > use it after boot.
> > > > > >
> > > > > > >
> > > > > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > > > > other to address the current mapping issue.
> > > > > > >
> > > > > > > --
> > > > > > > help you, help me,
> > > > > > > Hailong.
> > > > > > >
> > > > > >
> > > > > >
> > > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> > > > fallback to order 0, actuaally without this commit
> > > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> > > > if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
> > > >
> > > > fail:
> > > > if (shift > PAGE_SHIFT) {
> > > > shift = PAGE_SHIFT;
> > > > align = real_align;
> > > > size = real_size;
> > > > goto again;
> > > > }
> > > >
> > > > So do we really need fallback to order-0 if nofail?
> > >
> > > Good catch, this is what I missed. I feel we can revert Michal's fix.
> > > And just remove __GFP_NOFAIL bit when we are still allocating
> > > by high-order. When "goto again" happens, we will allocate by
> > > order-0, in this case, we keep the __GFP_NOFAIL.
> >
> > With Michal's patch, the fallback will be able to satisfy the allocation
> > for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The
>
> Hi Baoquan:
>
> int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> pgprot_t prot, struct page **pages, unsigned int page_shift)
> {
> unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
>
> WARN_ON(page_shift < PAGE_SHIFT);
>
> if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> page_shift == PAGE_SHIFT)
> return vmap_small_pages_range_noflush(addr, end, prot, pages);
>
> for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { ---> huge mapping
> int err;
>
> err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> page_to_phys(pages[i]), prot, ---------> incorrect mapping would occur here if nofail and fallback to order0
Thanks. I have got this issue from your patch. I mean if we have adjusted
the page_shift and page_order after fallback with the draft patch I
proposed, Barry still mentioned the nofail issue, that confuses me.
> page_shift);
> if (err)
> return err;
>
> addr += 1UL << page_shift;
> }
>
> return 0;
> }
> > 'if (shift > PAGE_SHIFT)' conditional checking and handling may be
> > problemtic since it could jump to fail becuase vmap_pages_range()
> > invocation failed, or partially allocate huge parges and break down,
> > then it will ignore the already allocated pages, and do all the thing again.
> >
> > The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes
> > sense is it fallback to the real_size and real_align. BUT we need handle
> > the fail separately, e.g
> > 1)__get_vm_area_node() failed;
> > 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail;
> > 3)vmap_pages_range() failed;
> >
> > Honestly, I didn't see where the nofail is mishandled, could you point
> > it out specifically? I could miss it.
> >
> > Thanks
> > Baoquan
> >
>
> --
> help you, help me,
> Hailong.
>
next prev parent reply other threads:[~2024-07-26 9:29 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-25 3:53 [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0 hailong.liu
2024-07-25 6:21 ` Barry Song
2024-07-25 9:17 ` Hailong Liu
2024-07-25 9:34 ` Barry Song
2024-07-25 9:58 ` Hailong Liu
2024-07-25 10:22 ` Barry Song
2024-07-25 11:39 ` Baoquan He
2024-07-25 16:40 ` Hailong Liu
2024-07-26 1:31 ` Barry Song
2024-07-26 2:31 ` Baoquan He
2024-07-26 3:53 ` Barry Song
2024-07-29 1:48 ` Baoquan He
2024-07-30 3:24 ` Hailong Liu
2024-07-30 4:29 ` Baoquan He
2024-07-26 4:00 ` Hailong Liu
2024-07-26 5:03 ` Hailong Liu
2024-07-26 5:29 ` Barry Song
2024-07-26 8:37 ` Baoquan He
2024-07-26 8:48 ` Hailong Liu
2024-07-26 9:00 ` Hailong Liu
2024-07-26 9:29 ` Baoquan He [this message]
2024-07-26 9:15 ` Barry Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZqNsdmxa6cQfxKFP@MiWiFi-R3L-srv \
--to=bhe@redhat.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=hailong.liu@oppo.com \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lstoakes@gmail.com \
--cc=mhocko@suse.com \
--cc=urezki@gmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=zhengtangquan@oppo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.