Re: [Qemu-devel] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: David Gibson <david@gibson.dropbear.id.au>,
	Alex Williamson <alex.williamson@redhat.com>
Cc: Daniel Henrique Barboza <danielhb413@gmail.com>,
	qemu-devel@nongnu.org, qemu-ppc@nongnu.org,
	Piotr Jaroszynski <pjaroszynski@nvidia.com>,
	Jose Ricardo Ziviani <joserz@linux.ibm.com>
Subject: Re: [Qemu-devel] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough
Date: Mon, 11 Feb 2019 14:49:49 +1100	[thread overview]
Message-ID: <45e89e77-02ad-aebf-ab2f-e5e7f7e12ecb@ozlabs.ru> (raw)
In-Reply-To: <20190208052849.GB6434@umbus.fritz.box>



On 08/02/2019 16:28, David Gibson wrote:
> On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:
>> On Fri, 8 Feb 2019 13:29:37 +1100
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On 08/02/2019 02:18, Alex Williamson wrote:
>>>> On Thu, 7 Feb 2019 15:43:18 +1100
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>   
>>>>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:  
>>>>>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
>>>>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
>>>>>> GPU:
>>>>>>
>>>>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
>>>>>>
>>>>>>
>>>>>> In that thread, Alex raised concerns about allowing QEMU to freely lock
>>>>>> all the memory it wants. Is this an issue to be considered in the review
>>>>>> of this series here?
>>>>>>
>>>>>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
>>>>>> going to lock the KVM memory to populate the NUMA node with memory
>>>>>> of the GPU itself, so at first there is no risk of not taking over the
>>>>>> host RAM.
>>>>>> Am I missing something?    
>>>>>
>>>>>
>>>>> The GPU memory belongs to the device and not visible to the host as
>>>>> memory blocks and not covered by page structs, for the host it is more
>>>>> like MMIO which is passed through to the guest without that locked
>>>>> accounting, I'd expect libvirt to keep working as usual except that:
>>>>>
>>>>> when libvirt calculates the amount of memory needed for TCE tables
>>>>> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
>>>>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree -f":
>>>>>
>>>>> FlatView #2
>>>>>  AS "memory", root: system
>>>>>  AS "cpu-memory-0", root: system
>>>>>  Root memory region: system
>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
>>>>>
>>>>> So previously the DMA window would cover 0x7fffffff+1, now it has to
>>>>> cover 0x11fffffffff+1.  
>>>>
>>>> This looks like a chicken and egg problem, you're saying libvirt needs
>>>> to query mtree to understand the extent of the GPU layout, but we need
>>>> to specify the locked memory limits in order for QEMU to start?  Is
>>>> libvirt supposed to start the VM with unlimited locked memory and fix
>>>> it at some indeterminate point in the future?  Run a dummy VM with
>>>> unlimited locked memory in order to determine the limits for the real
>>>> VM?  Neither of these sound practical.  Thanks,  
>>>
>>>
>>> QEMU maps GPU RAM at known locations (which only depends on the vPHB's
>>> index or can be set explicitely) and libvirt knows how many GPUs are
>>> passed so it is quite easy to calculate the required amount of memory.
>>>
>>> Here is the window start calculation:
>>> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
>>>
>>> We do not exactly know the GPU RAM window size until QEMU reads it from
>>> VFIO/nvlink2 but we know that all existing hardware has a window of
>>> 128GB (the adapters I have access to only have 16/32GB on board).
>>
>> So you're asking that libvirt add 128GB per GPU with magic nvlink
>> properties, which may be 8x what's actually necessary and libvirt
>> determines which GPUs to apply this to how?  Does libvirt need to sort
>> through device tree properties for this?  Thanks,
> 
> Hm.  If the GPU memory is really separate from main RAM, which it
> sounds like, I don't think it makes sense to account it against the
> same locked memory limit as regular RAM.


This is accounting for TCE table to cover GPU RAM, not for GPU RAM itself.

So I am asking libvirt to add 128GB/64k*8=16MB to the locked_vm. It
already does so for the guest RAM.


-- 
Alexey

next prev parent reply	other threads:[~2019-02-11  3:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-17  2:51 [Qemu-devel] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough Alexey Kardashevskiy
2019-01-17  2:51 ` [Qemu-devel] [PATCH qemu 1/3] vfio/spapr: Fix indirect levels calculation Alexey Kardashevskiy
2019-02-05  5:54   ` David Gibson
2019-01-17  2:51 ` [Qemu-devel] [PATCH qemu 2/3] vfio: Make vfio_get_region_info_cap public Alexey Kardashevskiy
2019-01-17  2:51 ` [Qemu-devel] [PATCH qemu 3/3] spapr: Support NVIDIA V100 GPU with NVLink2 Alexey Kardashevskiy
2019-02-03 23:59 ` [Qemu-devel] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough Alexey Kardashevskiy
2019-02-06 17:22 ` Daniel Henrique Barboza
2019-02-07  4:43   ` Alexey Kardashevskiy
2019-02-07 15:18     ` Alex Williamson
2019-02-08  2:29       ` Alexey Kardashevskiy
2019-02-08  3:26         ` Alex Williamson
2019-02-08  5:28           ` David Gibson
2019-02-08 15:52             ` Alex Williamson
2019-02-08 16:25               ` Daniel Henrique Barboza
2019-02-11  3:49             ` Alexey Kardashevskiy [this message]
2019-02-11  6:07               ` Alex Williamson
2019-02-11  7:46                 ` Alexey Kardashevskiy
2019-02-14  5:02                   ` David Gibson
2019-02-14  4:59               ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45e89e77-02ad-aebf-ab2f-e5e7f7e12ecb@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=danielhb413@gmail.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=joserz@linux.ibm.com \
    --cc=pjaroszynski@nvidia.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).