All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: qemu-devel@nongnu.org, qemu-ppc@nongnu.org,
	Alex Williamson <alex.williamson@redhat.com>,
	Reza Arbab <arbab@linux.ibm.com>,
	Piotr Jaroszynski <pjaroszynski@nvidia.com>,
	Jose Ricardo Ziviani <joserz@linux.ibm.com>,
	Daniel Henrique Barboza <danielhb413@gmail.com>
Subject: Re: [Qemu-devel] [PATCH qemu v2 1/4] vfio/spapr: Fix indirect levels calculation
Date: Fri, 15 Feb 2019 12:03:17 +1100	[thread overview]
Message-ID: <20190215010317.GD4573@umbus.fritz.box> (raw)
In-Reply-To: <20190214052144.59541-2-aik@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 6045 bytes --]

On Thu, Feb 14, 2019 at 04:21:41PM +1100, Alexey Kardashevskiy wrote:
> The current code assumes that we can address more bits on a PCI bus
> for DMA than we really can but there is no way knowing the actual limit.
> 
> This makes a better guess for the number of levels and if the kernel
> fails to allocate that, this increases the level numbers till succeeded
> or reached the 64bit limit.
> 
> This adds levels to the trace point.
> 
> This may cause the kernel to warn about failed allocation:
>    [65122.837458] Failed to allocate a TCE memory, level shift=28
> which might happen if MAX_ORDER is not large enough as it can vary:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

So, looking at this, it seems kind of bogus that qemu chooses the
number of levels, when those levels aren't really visible to it.  The
kernel has better information to choose a sensible number of levels -
witness qemu here attempting to guess what the kernel will do.

But, I guess the interface is what it is now, so,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v2:
> * replace systempagesize with getpagesize() when calculating bits_per_level/max_levels
> ---
>  hw/vfio/spapr.c      | 41 ++++++++++++++++++++++++++++++++---------
>  hw/vfio/trace-events |  2 +-
>  2 files changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index becf71a..302d6b0 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -146,7 +146,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      int ret;
>      IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
>      uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
> -    unsigned entries, pages;
> +    unsigned entries, bits_total, bits_per_level, max_levels;
>      struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>      long systempagesize = qemu_getrampagesize();
>  
> @@ -176,16 +176,38 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      create.window_size = int128_get64(section->size);
>      create.page_shift = ctz64(pagesize);
>      /*
> -     * SPAPR host supports multilevel TCE tables, there is some
> -     * heuristic to decide how many levels we want for our table:
> -     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +     * SPAPR host supports multilevel TCE tables. We try to guess optimal
> +     * levels number and if this fails (for example due to the host memory
> +     * fragmentation), we increase levels. The DMA address structure is:
> +     * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx iiiiiiii
> +     * where:
> +     *   r = reserved (bits >= 55 are reserved in the existing hardware)
> +     *   i = IOMMU page offset (64K in this example)
> +     *   x = bits to index a TCE which can be split to equal chunks to index
> +     *      within the level.
> +     * The aim is to split "x" to smaller possible number of levels.
>       */
>      entries = create.window_size >> create.page_shift;
> -    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> -    pages = MAX(pow2ceil(pages), 1); /* Round up */
> -    create.levels = ctz64(pages) / 6 + 1;
> -
> -    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    /* bits_total is number of "x" needed */
> +    bits_total = ctz64(entries * sizeof(uint64_t));
> +    /*
> +     * bits_per_level is a safe guess of how much we can allocate per level:
> +     * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER
> +     * is usually bigger than that.
> +     * Below we look at getpagesize() as TCEs are allocated from system pages.
> +     */
> +    bits_per_level = ctz64(getpagesize()) + 8;
> +    create.levels = bits_total / bits_per_level;
> +    if (bits_total % bits_per_level) {
> +        ++create.levels;
> +    }
> +    max_levels = (64 - create.page_shift) / ctz64(getpagesize());
> +    for ( ; create.levels <= max_levels; ++create.levels) {
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +        if (!ret) {
> +            break;
> +        }
> +    }
>      if (ret) {
>          error_report("Failed to create a window, ret = %d (%m)", ret);
>          return -errno;
> @@ -200,6 +222,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>          return -EINVAL;
>      }
>      trace_vfio_spapr_create_window(create.page_shift,
> +                                   create.levels,
>                                     create.window_size,
>                                     create.start_addr);
>      *pgsize = pagesize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index f41ca96..579be19 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -128,6 +128,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
>  vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
>  vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
>  vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
> -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
>  vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2019-02-15  3:39 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-14  5:21 [Qemu-devel] [PATCH qemu v2 0/4] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
2019-02-14  5:21 ` [Qemu-devel] [PATCH qemu v2 1/4] vfio/spapr: Fix indirect levels calculation Alexey Kardashevskiy
2019-02-15  1:03   ` David Gibson [this message]
2019-02-14  5:21 ` [Qemu-devel] [PATCH qemu v2 2/4] vfio/spapr: Rename local systempagesize variable Alexey Kardashevskiy
2019-02-15  3:32   ` David Gibson
2019-02-15  3:37     ` Alexey Kardashevskiy
2019-02-15  3:57       ` David Gibson
2019-02-15  4:47         ` Alexey Kardashevskiy
2019-02-14  5:21 ` [Qemu-devel] [PATCH qemu v2 3/4] vfio: Make vfio_get_region_info_cap public Alexey Kardashevskiy
2019-02-14  5:21 ` [Qemu-devel] [PATCH qemu v2 4/4] spapr: Support NVIDIA V100 GPU with NVLink2 Alexey Kardashevskiy
2019-02-15  3:22   ` David Gibson
2019-02-15  4:42     ` Alexey Kardashevskiy
2019-02-15  5:30       ` David Gibson
2019-02-19  0:52         ` Alexey Kardashevskiy
2019-02-14 23:37 ` [Qemu-devel] [PATCH qemu v2 0/4] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alex Williamson
2019-02-15  0:35   ` Alexey Kardashevskiy
2019-02-15  3:24     ` David Gibson
2019-02-15  3:32       ` Alexey Kardashevskiy
2019-02-15  3:54         ` David Gibson
2019-02-15  4:34           ` Alexey Kardashevskiy
2019-02-15  5:21             ` David Gibson
2019-02-25  6:39               ` Alexey Kardashevskiy
2019-02-28  2:21                 ` David Gibson
2019-02-15  3:22   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190215010317.GD4573@umbus.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=arbab@linux.ibm.com \
    --cc=danielhb413@gmail.com \
    --cc=joserz@linux.ibm.com \
    --cc=pjaroszynski@nvidia.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.