Re: [RFC 0/7] Support high-order page bulk allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Vlastimil Babka <vbabka@suse.cz>, John Dias <joaodias@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	pullip.cho@samsung.com,
	Chris Goldsworthy <cgoldswo@codeaurora.org>
Subject: Re: [RFC 0/7] Support high-order page bulk allocation
Date: Tue, 18 Aug 2020 08:15:43 -0700	[thread overview]
Message-ID: <20200818151543.GE3852332@google.com> (raw)
In-Reply-To: <7c07e8cf-6adc-92be-d819-d60a389559d8@redhat.com>

On Tue, Aug 18, 2020 at 09:49:24AM +0200, David Hildenbrand wrote:
> On 18.08.20 01:34, Minchan Kim wrote:
> > On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
> >> On 17.08.20 18:30, Minchan Kim wrote:
> >>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> >>>> On 17.08.20 17:27, Minchan Kim wrote:
> >>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >>>>>> On 14.08.20 19:31, Minchan Kim wrote:
> >>>>>>> There is a need for special HW to require bulk allocation of
> >>>>>>> high-order pages. For example, 4800 * order-4 pages.
> >>>>>>>
> >>>>>>> To meet the requirement, a option is using CMA area because
> >>>>>>> page allocator with compaction under memory pressure is
> >>>>>>> easily failed to meet the requirement and too slow for 4800
> >>>>>>> times. However, CMA has also the following drawbacks:
> >>>>>>>
> >>>>>>>  * 4800 of order-4 * cma_alloc is too slow
> >>>>>>>
> >>>>>>> To avoid the slowness, we could try to allocate 300M contiguous
> >>>>>>> memory once and then split them into order-4 chunks.
> >>>>>>> The problem of this approach is CMA allocation fails one of the
> >>>>>>> pages in those range couldn't migrate out, which happens easily
> >>>>>>> with fs write under memory pressure.
> >>>>>>
> >>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >>>>>> chunks and split them. That would already heavily reduce the call frequency.
> >>>>>
> >>>>> I think you meant this:
> >>>>>
> >>>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> >>>>>
> >>>>> It would work if system has lots of non-fragmented free memory.
> >>>>> However, once they are fragmented, it doesn't work. That's why we have
> >>>>> seen even order-4 allocation failure in the field easily and that's why
> >>>>> CMA was there.
> >>>>>
> >>>>> CMA has more logics to isolate the memory during allocation/freeing as
> >>>>> well as fragmentation avoidance so that it has less chance to be stealed
> >>>>> from others and increase high success ratio. That's why I want this API
> >>>>> to be used with CMA or movable zone.
> >>>>
> >>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> >>>> big 300M allocation. As you correctly note, memory placed into CMA
> >>>> should be movable, except for (short/long) term pinnings. In these
> >>>> cases, doing allocations smaller than 300M and splitting them up should
> >>>> be good enough to reduce the call frequency, no?
> >>>
> >>> I should have written that. The 300M I mentioned is really minimum size.
> >>> In some scenraio, we need way bigger than 300M, up to several GB.
> >>> Furthermore, the demand would be increased in near future.
> >>
> >> And what will the driver do with that data besides providing it to the
> >> device? Can it be mapped to user space? I think we really need more
> >> information / the actual user.
> >>
> >>>>
> >>>>>
> >>>>> A usecase is device can set a exclusive CMA area up when system boots.
> >>>>> When device needs 4800 * order-4 pages, it could call this bulk against
> >>>>> of the area so that it could effectively be guaranteed to allocate
> >>>>> enough fast.
> >>>>
> >>>> Just wondering
> >>>>
> >>>> a) Why does it have to be fast?
> >>>
> >>> That's because it's related to application latency, which ends up
> >>> user feel bad.
> >>
> >> Okay, but in theory, your device-needs are very similar to
> >> application-needs, besides you requiring order-4 pages, correct? Similar
> >> to an application that starts up and pins 300M (or more), just with
> >> ordr-4 pages.
> > 
> > Yes.
> > 
> >>
> >> I don't get quite yet why you need a range allocator for that. Because
> >> you intend to use CMA?
> > 
> > Yes, with CMA, it could be more guaranteed and fast enough with little
> > tweaking. Currently, CMA is too slow due to below IPI overheads.
> > 
> > 1. set_migratetype_isolate does drain_all_pages for every pageblock.
> > 2. __aloc_contig_migrate_range does migrate_prep
> > 3. alloc_contig_range does lru_add_drain_all.
> > 
> > Thus, if we need to increase frequency of call as your suggestion,
> > the set up overhead is also scaling up depending on the size.
> > Such overhead makes sense if caller requests big contiguous memory
> > but it's too much for normal high-order allocations.
> > 
> > Maybe, we might optimize those call sites to reduce or/remove
> > frequency of those IPI call smarter way but that would need to deal
> > with success ratio vs fastness dance in the end.
> > 
> > Another concern to use existing cma API is it's trying to make
> > allocation successful at the cost of latency. For example, waiting
> > a page writeback.
> > 
> > That's the this new sematic API comes from for compromise since I believe
> > we need some way to separate original CMA alloc(biased to be guaranteed
> > but slower) with this new API(biased to be fast but less guaranteed).
> > 
> > Is there any idea to work without existing cma API tweaking?
> 
> Let me try to summarize:
> 
> 1. Your driver needs a lot of order-4 pages. And it's needs them fast,
> because of observerable lag/delay in an application. The pages will be
> unmovable by the driver.
> 
> 2. Your idea is to use CMA, as that avoids unmovable allocations,
> theoretically allowing you to allocate all memory. But you don't
> actually want a large contiguous memory area.
> 
> 3. Doing a whole bunch of order-4 cma allocations is slow.
> 
> 4. Doing a single large cma allocation and splitting it manually in the
> caller can fail easily due to temporary page pinnings.
> 
> 
> Regarding 4., [1] comes to mind, which has the same issues with
> temporary page pinnings and solves it by simply retrying. Yeah, there
> will be some lag, but maybe it's overall faster than doing separate
> order-4 cma allocations?

Thanks for the pointer. However, it's not a single reason to make CMA
failure. Historically, there are various potentail problems to make
"temporal" as "non-temporal" like page write, indirect dependency
between objects.

> 
> In general, proactive compaction [2] comes to mind, does that help?

I think it makes sense if such high-order allocation are dominant in the
system workload because the benefit caused by TLB would be bigger than cost
caused by frequent migration overhead. However, it's not the our usecase.

> 
> [1]
> https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@codeaurora.org/
> [2] https://nitingupta.dev/post/proactive-compaction/
> 

I understand pfn stuff in the API is not pretty but the concept of idea
makes sense to me in that go though the *migratable area* and get possible
order pages with hard effort. It looks like GFP_NORETRY version for
kmem_cache_alloc_bulk.

How about this?

    int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);

next prev parent reply	other threads:[~2020-08-18 15:16 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-14 17:31 [RFC 0/7] Support high-order page bulk allocation Minchan Kim
2020-08-14 17:31 ` [RFC 1/7] mm: page_owner: split page by order Minchan Kim
2020-08-14 17:31 ` [RFC 2/7] mm: introduce split_page_by_order Minchan Kim
2020-08-14 17:31 ` [RFC 3/7] mm: compaction: deal with upcoming high-order page splitting Minchan Kim
2020-08-14 17:31 ` [RFC 4/7] mm: factor __alloc_contig_range out Minchan Kim
2020-08-14 17:31 ` [RFC 5/7] mm: introduce alloc_pages_bulk API Minchan Kim
2020-08-17 17:40   ` David Hildenbrand
2020-08-14 17:31 ` [RFC 6/7] mm: make alloc_pages_bulk best effort Minchan Kim
2020-08-14 17:31 ` [RFC 7/7] mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk Minchan Kim
2020-08-14 17:40 ` [RFC 0/7] Support high-order page bulk allocation Matthew Wilcox
2020-08-14 20:55   ` Minchan Kim
2020-08-18  2:16     ` Cho KyongHo
2020-08-18  9:22     ` Cho KyongHo
2020-08-16 12:31 ` David Hildenbrand
2020-08-17 15:27   ` Minchan Kim
2020-08-17 15:45     ` David Hildenbrand
2020-08-17 16:30       ` Minchan Kim
2020-08-17 16:44         ` David Hildenbrand
2020-08-17 17:03           ` David Hildenbrand
2020-08-17 23:34           ` Minchan Kim
2020-08-18  7:42             ` Nicholas Piggin
2020-08-18  7:49             ` David Hildenbrand
2020-08-18 15:15               ` Minchan Kim [this message]
2020-08-18 15:58                 ` Matthew Wilcox
2020-08-18 16:22                   ` David Hildenbrand
2020-08-18 16:49                     ` Minchan Kim
2020-08-19  0:27                     ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200818151543.GE3852332@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgoldswo@codeaurora.org \
    --cc=david@redhat.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=joaodias@google.com \
    --cc=linux-mm@kvack.org \
    --cc=pullip.cho@samsung.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).