Re: [Qemu-devel] [PATCH qemu 1/3] vfio/spapr: Fix indirect levels calculation

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: qemu-devel@nongnu.org, qemu-ppc@nongnu.org,
	Piotr Jaroszynski <pjaroszynski@nvidia.com>,
	Jose Ricardo Ziviani <joserz@linux.ibm.com>,
	Daniel Henrique Barboza <danielhb413@gmail.com>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [Qemu-devel] [PATCH qemu 1/3] vfio/spapr: Fix indirect levels calculation
Date: Tue, 5 Feb 2019 16:54:28 +1100	[thread overview]
Message-ID: <20190205055427.GK22661@umbus.fritz.box> (raw)
In-Reply-To: <20190117025115.81178-2-aik@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 5875 bytes --]

On Thu, Jan 17, 2019 at 01:51:13PM +1100, Alexey Kardashevskiy wrote:
> The current code assumes that we can address more bits on a PCI bus
> for DMA than we really can but there is no way knowing the actual limit.
> 
> This makes a better guess for the number of levels and if the kernel
> fails to allocate that, this increases the level numbers till succeeded
> or reached the 64bit limit.
> 
> This adds levels to the trace point.
> 
> This may cause the kernel to warn about failed allocation:
>    [65122.837458] Failed to allocate a TCE memory, level shift=28
> which might happen if MAX_ORDER is not large enough as it can vary:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/spapr.c      | 38 +++++++++++++++++++++++++++++---------
>  hw/vfio/trace-events |  2 +-
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index becf71a..2675bf5 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -146,7 +146,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      int ret;
>      IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
>      uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
> -    unsigned entries, pages;
> +    unsigned entries, bits_total, bits_per_level, max_levels;
>      struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>      long systempagesize = qemu_getrampagesize();

So, "systempagesize" is a bad name, since this is *not* the system
page size, but rather the (minimum) page size we're using to map guest
RAM...

> @@ -176,16 +176,35 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      create.window_size = int128_get64(section->size);
>      create.page_shift = ctz64(pagesize);
>      /*
> -     * SPAPR host supports multilevel TCE tables, there is some
> -     * heuristic to decide how many levels we want for our table:
> -     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +     * SPAPR host supports multilevel TCE tables. We try to guess optimal
> +     * levels number and if this fails (for example due to the host memory
> +     * fragmentation), we increase levels. The DMA address structure is:
> +     * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx iiiiiiii
> +     * where:
> +     *   r = reserved (bits >= 55 are reserved in the existing hardware)
> +     *   i = IOMMU page offset (64K in this example)
> +     *   x = bits to index a TCE which can be split to equal chunks to index
> +     *      within the level.
> +     * The aim is to split "x" to smaller possible number of levels.
>       */
>      entries = create.window_size >> create.page_shift;
> -    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> -    pages = MAX(pow2ceil(pages), 1); /* Round up */
> -    create.levels = ctz64(pages) / 6 + 1;
> -
> -    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    /* bits_total is number of "x" needed */
> +    bits_total = ctz64((entries * sizeof(uint64_t)));
> +    /*
> +     * bits_per_level is a safe guess of how much we can allocate per level:
> +     * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER
> +     * is usually bigger than that.
> +     */
> +    bits_per_level = ctz64(systempagesize) + 8;

..which is relevant, because here you really do want the system page
size, since you're guestimating what the kernel will be able to put on
a page, entirely unrelated to the guest page size.

> +    create.levels = bits_total / bits_per_level;

Don't you need a div-rounded-up here?

> +    create.levels = MAX(1, create.levels);
> +    max_levels = (64 - create.page_shift) / ctz64(systempagesize);

And I think you want the real system page size here again.

> +    for ( ; create.levels <= max_levels; ++create.levels) {
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +        if (!ret) {
> +            break;
> +        }
> +    }
>      if (ret) {
>          error_report("Failed to create a window, ret = %d (%m)", ret);
>          return -errno;
> @@ -200,6 +219,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>          return -EINVAL;
>      }
>      trace_vfio_spapr_create_window(create.page_shift,
> +                                   create.levels,
>                                     create.window_size,
>                                     create.start_addr);
>      *pgsize = pagesize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0f9f810..adfa75e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -129,6 +129,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
>  vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
>  vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
>  vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
> -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
>  vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2019-02-05  5:54 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-17  2:51 [Qemu-devel] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough Alexey Kardashevskiy
2019-01-17  2:51 ` [Qemu-devel] [PATCH qemu 1/3] vfio/spapr: Fix indirect levels calculation Alexey Kardashevskiy
2019-02-05  5:54   ` David Gibson [this message]
2019-01-17  2:51 ` [Qemu-devel] [PATCH qemu 2/3] vfio: Make vfio_get_region_info_cap public Alexey Kardashevskiy
2019-01-17  2:51 ` [Qemu-devel] [PATCH qemu 3/3] spapr: Support NVIDIA V100 GPU with NVLink2 Alexey Kardashevskiy
2019-02-03 23:59 ` [Qemu-devel] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough Alexey Kardashevskiy
2019-02-06 17:22 ` Daniel Henrique Barboza
2019-02-07  4:43   ` Alexey Kardashevskiy
2019-02-07 15:18     ` Alex Williamson
2019-02-08  2:29       ` Alexey Kardashevskiy
2019-02-08  3:26         ` Alex Williamson
2019-02-08  5:28           ` David Gibson
2019-02-08 15:52             ` Alex Williamson
2019-02-08 16:25               ` Daniel Henrique Barboza
2019-02-11  3:49             ` Alexey Kardashevskiy
2019-02-11  6:07               ` Alex Williamson
2019-02-11  7:46                 ` Alexey Kardashevskiy
2019-02-14  5:02                   ` David Gibson
2019-02-14  4:59               ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190205055427.GK22661@umbus.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=danielhb413@gmail.com \
    --cc=joserz@linux.ibm.com \
    --cc=pjaroszynski@nvidia.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).