Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Vlastimil Babka <vbabka@suse.cz>, linux-mm@kvack.org
Cc: Alexander Duyck <alexander.h.duyck@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	David Gibson <david@gibson.dropbear.id.au>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Joonsoo Kim <js1304@gmail.com>, Mel Gorman <mgorman@suse.de>,
	Michal Hocko <mhocko@suse.cz>, Paul Mackerras <paulus@samba.org>,
	Sasha Levin <sasha.levin@oracle.com>,
	linux-kernel@vger.kernel.org,
	Alex Williamson <alex.williamson@redhat.com>,
	Alexander Graf <agraf@suse.de>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
Date: Mon, 17 Aug 2015 19:11:01 +1000	[thread overview]
Message-ID: <55D1A525.5090706@ozlabs.ru> (raw)
In-Reply-To: <55D1910C.7070006@suse.cz>

On 08/17/2015 05:45 PM, Vlastimil Babka wrote:
> On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
>> This is about VFIO aka PCI passthrough used from QEMU.
>> KVM is irrelevant here.
>>
>> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
>> and these pages are movable which is ok. They may happen to be allocated
>> from the contiguous memory allocation zone (CMA). Which is also ok as
>> long they are movable.
>>
>> However if the guest starts using VFIO (which can be hotplugged into
>> the guest), in most cases it involves DMA which requires guest RAM pages
>> to be pinned and not move once their addresses are programmed to
>> the hardware for DMA.
>>
>> So we end up in a situation when quite many pages in CMA are not movable
>> anymore. And we get bunch of these:
>>
>> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
>> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
>> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
>
> IIRC CMA was for mobile devices and their camera/codec drivers and you
> don't use QEMU on those? What do you need CMA for in your case?


I do not want QEMU to get memory from CMA, this is my point. It just 
happens sometime that the kernel allocates movable pages from there.


>
>> This is a very rough patch to start the conversation about how to move
>> pages properly. mm/page_alloc.c does this and
>> arch/powerpc/mm/mmu_context_iommu.c exploits it.
>
> OK such conversation should probably start by mentioning the VM_PINNED
> effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345
>
> It's more general approach to dealing with pinned pages, and moving them
> out of CMA area (and compacting them in general) prior pinning is one of
> the things that should be done within that framework.


And I assume these patches did not go anywhere, right?...



> Then there's the effort to enable migrating pages other than LRU during
> compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
> I don't know if that would be applicable in your use case, i.e. are the
> pins for DMA short-lived and can the isolation/migration code wait a bit
> for the transfer to finish so it can grab them, or something?


Pins for DMA are long-lived, pretty much as long as the guest is running. 
So this "compaction" is too late.


>>
>> Please do not comment on the style and code placement,
>> this is just to give some context :)
>>
>> Obviously, this does not work well - it manages to migrate only few pages
>> and crashes as it is missing locks/disabling interrupts and I probably
>> should not just remove pages from LRU list (normally, I guess, only these
>> can migrate) and a million of other things.
>>
>> The questions are:
>>
>> - what is the correct way of telling if the page is in CMA?
>> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?
>
> Should be.
>
>> - how to tell MM to move page away? I am calling migrate_pages() with
>> an get_new_page callback which allocates a page with GFP_USER but without
>> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
>> there is a little convern that we might want to add MOVABLE back when
>> VFIO device is unplugged from the guest.
>
> Hmm, once the page is allocated, then the migratetype is not tracked
> anywhere (except in page_owner debug data). But the unmovable allocations
> might exhaust available unmovable pageblocks and lead to fragmentation. So
> "add MOVABLE back" would be too late. Instead we would need to tell the
>allocator somehow to give us movable page but outside of CMA.

It is it movable, why do we care if it is in CMA or not?

> CMA's own
> __alloc_contig_migrate_range() avoids this problem by allocating movable
> pages, but the range has been already page-isolated and thus the allocator
> won't see the pages there.You obviously can't take this approach and
> isolate all CMA pageblocks like that.  That smells like a new __GFP_FLAG, meh.


I understood (more or less) all of it except the __GFP_FLAG - when/what 
would use it?



>> - do I need to isolate pages by using isolate_migratepages_range,
>> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
>> I dropped them for now and the patch uses only @migratepages from
>> the compact_control struct.
>
> You don't have to do reclaim_clean_pages_from_list(), but the isolation has
> to be careful, yeah.


The isolation here means the whole CMA zone isolation which I "obviously 
can't take this approach"? :)


>> - are there any flags in madvise() to address this (could not
>> locate any relevant)?
>
> AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE)
>
>> - what else is missing? disabled interrupts? locks?
>
> See what isolate_migratepages_block() does.


Thanks for the pointers! I'll have a closer look at Peter's patchset.


-- 
Alexey

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-08-17  9:11 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-05  8:08 [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning Alexey Kardashevskiy
2015-08-15 10:50 ` Alexey Kardashevskiy
2015-08-17  7:45 ` Vlastimil Babka
2015-08-17  9:11   ` Alexey Kardashevskiy [this message]
2015-08-17  9:53     ` Vlastimil Babka
2015-08-17  9:58       ` Benjamin Herrenschmidt
2015-08-17  9:57     ` Benjamin Herrenschmidt
2015-08-31 13:48     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55D1A525.5090706@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=agraf@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=alexander.h.duyck@redhat.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=david@gibson.dropbear.id.au \
    --cc=hannes@cmpxchg.org \
    --cc=js1304@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=paulus@samba.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sasha.levin@oracle.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).