Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH] documentation: Add disclaimer
From: Paul E. McKenney @ 2016-04-14 21:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mips, linux-ia64, Michael S. Tsirkin, Will Deacon,
	virtualization, H. Peter Anvin, sparclinux, Ingo Molnar,
	linux-arch, linux-s390, Russell King - ARM Linux,
	user-mode-linux-devel, linux-sh, Michael Ellerman, x86, xen-devel,
	Ingo Molnar, linux-xtensa, james.hogan, Arnd Bergmann,
	Stefano Stabellini, adi-buildroot-devel, Leonid Yegoshin,
	ddaney.cavm, Thomas Gleixner, linux-metag
In-Reply-To: <20160127083546.GJ6357@twins.programming.kicks-ass.net>

On Wed, Jan 27, 2016 at 09:35:46AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 12:11:43PM -0800, Paul E. McKenney wrote:
> > So Peter, would you like to update your patch to include yourself
> > and Will as authors?
> 
> Sure, here goes.
> 
> ---
> Subject: documentation: Add disclaimer
> 
> It appears people are reading this document as a requirements list for
> building hardware. This is not the intent of this document. Nor is it
> particularly suited for this purpose.
> 
> The primary purpose of this document is our collective attempt to define
> a set of primitives that (hopefully) allow us to write correct code on
> the myriad of SMP platforms Linux supports.
> 
> Its a definite work in progress as our understanding of these platforms,
> and memory ordering in general, progresses.
> 
> Nor does being mentioned in this document mean we think its a
> particularly good idea; the data dependency barrier required by Alpha
> being a prime example. Yes we have it, no you're insane to require it
> when building new hardware.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Rather belatedly queued and pushed to -rcu, apologies for the delay.
One minor edit noted below.

							Thanx, Paul

> ---
>  Documentation/memory-barriers.txt | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a61be39c7b51..98626125f484 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -4,8 +4,24 @@
> 
>  By: David Howells <dhowells@redhat.com>
>      Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> +    Will Deacon <will.deacon@arm.com>
> +    Peter Zijlstra <peterz@infradead.org>
> 
> -Contents:
> +==========
> +DISCLAIMER
> +==========
> +
> +This document is not a specification; it is intentionally (for the sake of
> +brevity) and unintentionally (due to being human) incomplete. This document is
> +meant as a guide to using the various memory barriers provided by Linux, but
> +in case of any doubt (and there are many) please ask.
> +
> +I repeat, this document is not a specification of what Linux expects from

s/I/To/ because there is more than one author.

> +hardware.
> +
> +========
> +CONTENTS
> +========
> 
>   (*) Abstract memory access model.
> 
> 

^ permalink raw reply

* Re: [PATCH] documentation: Add disclaimer
From: Paul E. McKenney @ 2016-04-14 21:40 UTC (permalink / raw)
  To: David Howells
  Cc: linux-mips, linux-ia64, Michael S. Tsirkin, Peter Zijlstra,
	Will Deacon, virtualization, H. Peter Anvin, sparclinux,
	Ingo Molnar, linux-arch, linux-s390, Russell King - ARM Linux,
	user-mode-linux-devel, linux-sh, Michael Ellerman, x86, xen-devel,
	Ingo Molnar, linux-xtensa, james.hogan, Arnd Bergmann,
	Stefano Stabellini, adi-buildroot-devel, Leonid Yegoshin,
	ddaney.cavm, Thomas Gleixner
In-Reply-To: <15882.1453906627@warthog.procyon.org.uk>

On Wed, Jan 27, 2016 at 02:57:07PM +0000, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +==========
> > +DISCLAIMER
> > +==========
> > +
> > +This document is not a specification; it is intentionally (for the sake of
> > +brevity) and unintentionally (due to being human) incomplete. This document is
> > +meant as a guide to using the various memory barriers provided by Linux, but
> > +in case of any doubt (and there are many) please ask.
> > +
> > +I repeat, this document is not a specification of what Linux expects from
> > +hardware.
> 
> The purpose of this document is twofold:
> 
>  (1) to specify the minimum functionality that one can rely on for any
>      particular barrier, and
> 
>  (2) to provide a guide as to how to use the barriers that are available.
> 
> Note that an architecture can provide more than the minimum requirement for
> any particular barrier, but if the barrier provides less than that, it is
> incorrect.
> 
> Note also that it is possible that a barrier may be a no-op for an
> architecture because the way that arch works renders an explicit barrier
> unnecessary in that case.
> 
> > +
> 
> Can you bung an extra blank line in here if you have to redo this at all?

Done as part of your patch.  Again, apologies for the delay.

							Thanx, Paul

> > +========
> > +CONTENTS
> > +========
> >  
> >   (*) Abstract memory access model.
> >  
> 
> David
> 

^ permalink raw reply

* [patch] virtio: Silence uninitialized variable warning
From: Dan Carpenter @ 2016-04-15 14:45 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kernel-janitors, linux-kernel, virtualization

Smatch complains that we might not initialize "queue".  The issue is
callers like setup_vq() from virtio_pci_modern.c where "num" could be
something like 2 and "vring_align" is 64.  In that case, vring_size() is
less than PAGE_SIZE.  It won't happen in real life, but we're getting
the value of "num" from a register so it's not really possible to tell
what value it holds with static analysis.

Let's just silence the warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5c802d4..ca6bfdd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1006,7 +1006,7 @@ struct virtqueue *vring_create_virtqueue(
 	const char *name)
 {
 	struct virtqueue *vq;
-	void *queue;
+	void *queue = NULL;
 	dma_addr_t dma_addr;
 	size_t queue_size_in_bytes;
 	struct vring vring;

^ permalink raw reply related

* List Adminstration Help
From: Chris Wright @ 2016-04-15 19:27 UTC (permalink / raw)
  To: virtualization

Hi all,

This list is run as a moderated list to minimize spam (gets a decent
amount despite filtering).  I've tended to this for years, and I'd
like to add some folks from community to help so that posts are
quickly approved.  Please let me know if you'd like to help.

thanks,
-chris

^ permalink raw reply

* Re: [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping
From: Sergey Senozhatsky @ 2016-04-17 15:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-7-git-send-email-minchan@kernel.org>

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
[..]
> +static int get_zspage_inuse(struct page *first_page)
> +{
> +	struct zs_meta *m;
> +
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +
> +	m = (struct zs_meta *)&first_page->mapping;
..
> +static void set_zspage_inuse(struct page *first_page, int val)
> +{
> +	struct zs_meta *m;
> +
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +
> +	m = (struct zs_meta *)&first_page->mapping;
..
> +static void mod_zspage_inuse(struct page *first_page, int val)
> +{
> +	struct zs_meta *m;
> +
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +
> +	m = (struct zs_meta *)&first_page->mapping;
..
>  static void get_zspage_mapping(struct page *first_page,
>  				unsigned int *class_idx,
>  				enum fullness_group *fullness)
>  {
> -	unsigned long m;
> +	struct zs_meta *m;
> +
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +	m = (struct zs_meta *)&first_page->mapping;
..
>  static void set_zspage_mapping(struct page *first_page,
>  				unsigned int class_idx,
>  				enum fullness_group fullness)
>  {
> +	struct zs_meta *m;
> +
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> +	m = (struct zs_meta *)&first_page->mapping;
> +	m->fullness = fullness;
> +	m->class = class_idx;
>  }


a nitpick: this

	struct zs_meta *m;
	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
	m = (struct zs_meta *)&first_page->mapping;


seems to be common in several places, may be it makes sense to
factor it out and turn into a macro or a static inline helper?

other than that, looks good to me

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply

* Re: [PATCH v3 05/16] zsmalloc: keep max_object in size_class
From: Sergey Senozhatsky @ 2016-04-17 15:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-6-git-send-email-minchan@kernel.org>

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
> 
> Every zspage in a size_class has same number of max objects so
> we could move it to a size_class.
> 

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply

* Re: [PATCH v3 07/16] zsmalloc: remove page_mapcount_reset
From: Sergey Senozhatsky @ 2016-04-17 15:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-8-git-send-email-minchan@kernel.org>

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
> We don't use page->_mapcount any more so no need to reset.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

> ---
>  mm/zsmalloc.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 4dd72a803568..0f6cce9b9119 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -922,7 +922,6 @@ static void reset_page(struct page *page)
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
>  	page->freelist = NULL;
> -	page_mapcount_reset(page);
>  }
>  
>  static void free_zspage(struct page *first_page)
> -- 
> 1.9.1
> 

^ permalink raw reply

* Re: [PATCH v3 09/16] zsmalloc: move struct zs_meta from mapping to freelist
From: Sergey Senozhatsky @ 2016-04-17 15:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-10-git-send-email-minchan@kernel.org>

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
> For supporting migration from VM, we need to have address_space
> on every page so zsmalloc shouldn't use page->mapping. So,
> this patch moves zs_meta from mapping to freelist.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

a small get_zspage_meta() helper would make this patch shorter :)

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

> ---
>  mm/zsmalloc.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 807998462539..d4d33a819832 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -29,7 +29,7 @@
>   *		Look at size_class->huge.
>   *	page->lru: links together first pages of various zspages.
>   *		Basically forming list of zspages in a fullness group.
> - *	page->mapping: override by struct zs_meta
> + *	page->freelist: override by struct zs_meta
>   *
>   * Usage of struct page flags:
>   *	PG_private: identifies the first component page
> @@ -418,7 +418,7 @@ static int get_zspage_inuse(struct page *first_page)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  
>  	return m->inuse;
>  }
> @@ -429,7 +429,7 @@ static void set_zspage_inuse(struct page *first_page, int val)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	m->inuse = val;
>  }
>  
> @@ -439,7 +439,7 @@ static void mod_zspage_inuse(struct page *first_page, int val)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	m->inuse += val;
>  }
>  
> @@ -449,7 +449,7 @@ static void set_freeobj(struct page *first_page, int idx)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	m->freeobj = idx;
>  }
>  
> @@ -459,7 +459,7 @@ static unsigned long get_freeobj(struct page *first_page)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	return m->freeobj;
>  }
>  
> @@ -471,7 +471,7 @@ static void get_zspage_mapping(struct page *first_page,
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	*fullness = m->fullness;
>  	*class_idx = m->class;
>  }
> @@ -484,7 +484,7 @@ static void set_zspage_mapping(struct page *first_page,
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	m->fullness = fullness;
>  	m->class = class_idx;
>  }
> @@ -946,7 +946,6 @@ static void reset_page(struct page *page)
>  	clear_bit(PG_private, &page->flags);
>  	clear_bit(PG_private_2, &page->flags);
>  	set_page_private(page, 0);
> -	page->mapping = NULL;
>  	page->freelist = NULL;
>  }
>  
> @@ -1056,6 +1055,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>  
>  		INIT_LIST_HEAD(&page->lru);
>  		if (i == 0) {	/* first page */
> +			page->freelist = NULL;
>  			SetPagePrivate(page);
>  			set_page_private(page, 0);
>  			first_page = page;
> @@ -2068,9 +2068,9 @@ static int __init zs_init(void)
>  
>  	/*
>  	 * A zspage's a free object index, class index, fullness group,
> -	 * inuse object count are encoded in its (first)page->mapping
> +	 * inuse object count are encoded in its (first)page->freelist
>  	 * so sizeof(struct zs_meta) should be less than
> -	 * sizeof(page->mapping(i.e., unsigned long)).
> +	 * sizeof(page->freelist(i.e., void *)).
>  	 */
>  	BUILD_BUG_ON(sizeof(struct zs_meta) > sizeof(unsigned long));
>  
> -- 
> 1.9.1
> 

^ permalink raw reply

* Re: [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping
From: Sergey Senozhatsky @ 2016-04-17 15:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-9-git-send-email-minchan@kernel.org>

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
[..]
> +static void objidx_to_page_and_offset(struct size_class *class,
> +				struct page *first_page,
> +				unsigned long obj_idx,
> +				struct page **obj_page,
> +				unsigned long *offset_in_page)
>  {
> -	unsigned long obj;
> +	int i;
> +	unsigned long offset;
> +	struct page *cursor;
> +	int nr_page;
>  
> -	if (!page) {
> -		VM_BUG_ON(obj_idx);
> -		return NULL;
> -	}
> +	offset = obj_idx * class->size;

so we already know the `offset' before we call objidx_to_page_and_offset(),
thus we can drop `struct size_class *class' and `obj_idx', and pass
`long obj_offset'  (which is `obj_idx * class->size') instead, right?

we also _may be_ can return `cursor' from the function.

static struct page *objidx_to_page_and_offset(struct page *first_page,
					unsigned long obj_offset,
					unsigned long *offset_in_page);

this can save ~20 instructions, which is not so terrible for a hot path
like obj_malloc(). what do you think?

well, seems that `unsigned long *offset_in_page' can be calculated
outside of this function too, it's basically

	*offset_in_page = (obj_idx * class->size) & ~PAGE_MASK;

so we don't need to supply it to this function, nor modify it there.
which can save ~40 instructions on my system. does this sound silly?

	-ss

> +	cursor = first_page;
> +	nr_page = offset >> PAGE_SHIFT;
>  
> -	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
> -	obj |= ((obj_idx) & OBJ_INDEX_MASK);
> -	obj <<= OBJ_TAG_BITS;
> +	*offset_in_page = offset & ~PAGE_MASK;
> +
> +	for (i = 0; i < nr_page; i++)
> +		cursor = get_next_page(cursor);
>  
> -	return (void *)obj;
> +	*obj_page = cursor;
>  }

	-ss

^ permalink raw reply

* Re: [PATCH v3 10/16] zsmalloc: factor page chain functionality out
From: Sergey Senozhatsky @ 2016-04-18  0:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-11-git-send-email-minchan@kernel.org>

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
> @@ -1421,7 +1434,6 @@ static unsigned long obj_malloc(struct size_class *class,
>  	unsigned long m_offset;
>  	void *vaddr;
>  
> -	handle |= OBJ_ALLOCATED_TAG;

a nitpick, why did you replace this ALLOCATED_TAG assignment
with 2 'handle | OBJ_ALLOCATED_TAG'?

	-ss

>  	obj = get_freeobj(first_page);
>  	objidx_to_page_and_offset(class, first_page, obj,
>  				&m_page, &m_offset);
> @@ -1431,10 +1443,10 @@ static unsigned long obj_malloc(struct size_class *class,
>  	set_freeobj(first_page, link->next >> OBJ_ALLOCATED_TAG);
>  	if (!class->huge)
>  		/* record handle in the header of allocated chunk */
> -		link->handle = handle;
> +		link->handle = handle | OBJ_ALLOCATED_TAG;
>  	else
>  		/* record handle in first_page->private */
> -		set_page_private(first_page, handle);
> +		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
>  	kunmap_atomic(vaddr);
>  	mod_zspage_inuse(first_page, 1);
>  	zs_stat_inc(class, OBJ_USED, 1);

^ permalink raw reply

* Re: [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage
From: Sergey Senozhatsky @ 2016-04-18  1:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-12-git-send-email-minchan@kernel.org>

Hello Minchan,

On (03/30/16 16:12), Minchan Kim wrote:
[..]
> @@ -1835,23 +1827,31 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>  			if (!migrate_zspage(pool, class, &cc))
>  				break;
>  
> -			putback_zspage(pool, class, dst_page);
> +			VM_BUG_ON_PAGE(putback_zspage(pool, class,
> +				dst_page) == ZS_EMPTY, dst_page);

can this VM_BUG_ON_PAGE() condition ever be true?

>  		}
>  		/* Stop if we couldn't find slot */
>  		if (dst_page == NULL)
>  			break;
> -		putback_zspage(pool, class, dst_page);
> -		if (putback_zspage(pool, class, src_page) == ZS_EMPTY)
> +		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> +				dst_page) == ZS_EMPTY, dst_page);

hm... this VM_BUG_ON_PAGE(dst_page) is sort of confusing. under what
circumstances it can be true?

a minor nit, it took me some time (need some coffee I guess) to
correctly parse this macro wrapper

		VM_BUG_ON_PAGE(putback_zspage(pool, class,
			dst_page) == ZS_EMPTY, dst_page);

may be do it like:
		fullness = putback_zspage(pool, class, dst_page);
		VM_BUG_ON_PAGE(fullness == ZS_EMPTY, dst_page);


well, if we want to VM_BUG_ON_PAGE() at all. there haven't been any
problems with compaction, is there any specific reason these macros
were added?



> +		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
>  			pool->stats.pages_compacted += class->pages_per_zspage;
> -		spin_unlock(&class->lock);
> +			spin_unlock(&class->lock);
> +			free_zspage(pool, class, src_page);

do we really need to free_zspage() out of class->lock?
wouldn't something like this

		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
			pool->stats.pages_compacted += class->pages_per_zspage;
			free_zspage(pool, class, src_page);
		}
		spin_unlock(&class->lock);

be simpler?

besides, free_zspage() now updates class stats out of class lock,
not critical but still.

	-ss

> +		} else {
> +			spin_unlock(&class->lock);
> +		}
> +
>  		cond_resched();
>  		spin_lock(&class->lock);
>  	}
>  
>  	if (src_page)
> -		putback_zspage(pool, class, src_page);
> +		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> +				src_page) == ZS_EMPTY, src_page);
>  
>  	spin_unlock(&class->lock);
>  }

^ permalink raw reply

* [PATCH RFC 0/3] virtio-pci: iommu support
From: Michael S. Tsirkin @ 2016-04-18  9:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Wei Liu, kvm, qemu-devel, virtualization, Christian Borntraeger,
	Andy Lutomirski, David Woodhouse

This is an attempt to allow enabling IOMMU for DMA.
Design:
	- new feature bit IOMMU_PLATFORM which means
          host won't bypass IOMMU
	- virtio core uses DMA API if it sees IOMMU_PLATFORM
	- add quirk for vfio to disable device unless IOMMU_PLATFORM is set
          or the no-iommu mode is enabled
	- while I'm not sure how it will be used, it seems like a good idea to
	  also have ability to distinguish between a legacy device and one
	  where iommu is bypassed intentionally.  To this end, add another feature bit
	  IOMMU_PASSTHROUGH. We don't acknowledge it if IOMMU_PLATFORM is set.

TODO:
	- I'm not sure whether there are setups that mix IOMMU
	  and no-IOMMU configs. If so, failing on probe might not
	  be the right thing to do, should fail binding to IOMMU group instead.


Michael S. Tsirkin (3):
  virtio: add features for IOMMU control
  vfio: report group noiommu status
  vfio: add virtio pci quirk

 drivers/vfio/pci/vfio_pci_private.h          |   1 +
 include/uapi/linux/virtio_config.h           |  10 +-
 drivers/vfio/pci/vfio_pci.c                  |  13 ++-
 drivers/vfio/pci/vfio_pci_virtio.c           | 142 +++++++++++++++++++++++++++
 drivers/vfio/platform/vfio_platform_common.c |   2 +-
 drivers/vfio/vfio.c                          |   5 +-
 drivers/virtio/virtio_ring.c                 |  18 +++-
 Documentation/vfio.txt                       |   4 +-
 drivers/vfio/pci/Makefile                    |   1 +
 9 files changed, 190 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_virtio.c

-- 
MST

^ permalink raw reply

* [PATCH RFC 1/3] virtio: add features for IOMMU control
From: Michael S. Tsirkin @ 2016-04-18  9:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Wei Liu, kvm, qemu-devel, virtualization, Christian Borntraeger,
	Andy Lutomirski, David Woodhouse
In-Reply-To: <1460973374-32719-1-git-send-email-mst@redhat.com>

The interaction between virtio and DMA API is messy.

On most systems with virtio, physical addresses match bus addresses,
and it doesn't particularly matter whether we use the DMA API.

On some systems, including Xen and any system with a physical device
that speaks virtio behind a physical IOMMU, we must use the DMA API
for virtio DMA to work at all.

Add a feature bit to detect that: VIRTIO_F_IOMMU_PLATFORM.

On other systems, including SPARC and PPC64, virtio-pci devices are
enumerated as though they are behind an IOMMU, but the virtio host
ignores the IOMMU, so we must either pretend that the IOMMU isn't
there or somehow map everything as the identity.

Add a feature bit for that as well: VIRTIO_F_IOMMU_PASSTHROUGH: without
VIRTIO_F_IOMMU_PLATFORM, it means that virtio will bypass the IOMMU.
With VIRTIO_F_IOMMU_PLATFORM, it suggests that guest maps everything as
the identity (this last bit isn't trivial to implement so ignore this
hint for now).

If not there, we preserve historic behavior and bypass the DMA
API unless within Xen guest.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/uapi/linux/virtio_config.h | 10 +++++++++-
 drivers/virtio/virtio_ring.c       | 18 +++++++++++++++++-
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 4cb65bb..952775c 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		33
+#define VIRTIO_TRANSPORT_F_END		35
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -63,4 +63,12 @@
 /* v1.0 compliant. */
 #define VIRTIO_F_VERSION_1		32
 
+/* Request IOMMU passthrough (if available)
+ * Without VIRTIO_F_IOMMU_PLATFORM: bypass the IOMMU even if enabled.
+ * With VIRTIO_F_IOMMU_PLATFORM: suggest disabling IOMMU.
+ */
+#define VIRTIO_F_IOMMU_PASSTHROUGH	33
+
+/* Do not bypass the IOMMU (if configured) */
+#define VIRTIO_F_IOMMU_PLATFORM		34
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5c802d4..0436bd2 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -117,7 +117,10 @@ struct vring_virtqueue {
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
 /*
- * The interaction between virtio and a possible IOMMU is a mess.
+ * Modern virtio devices might set feature bits to specify whether
+ * they use or unconditionally bypass the platform IOMMU.
+ *
+ * If not there, the interaction between virtio and DMA API is messy.
  *
  * On most systems with virtio, physical addresses match bus addresses,
  * and it doesn't particularly matter whether we use the DMA API.
@@ -137,6 +140,13 @@ struct vring_virtqueue {
 
 static bool vring_use_dma_api(struct virtio_device *vdev)
 {
+	if (virtio_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM))
+		return true;
+
+	if (virtio_has_feature(vdev, VIRTIO_F_IOMMU_PASSTHROUGH))
+		return false;
+
+	/* Otherwise, we are left to guess. */
 	/*
 	 * In theory, it's possible to have a buggy QEMU-supposed
 	 * emulated Q35 IOMMU and Xen enabled at the same time.  On
@@ -1099,6 +1109,12 @@ void vring_transport_features(struct virtio_device *vdev)
 			break;
 		case VIRTIO_F_VERSION_1:
 			break;
+		case VIRTIO_F_IOMMU_PASSTHROUGH:
+			break;
+		case VIRTIO_F_IOMMU_PLATFORM:
+			/* Ignore passthrough hint for now, obey kernel config. */
+			__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PASSTHROUGH);
+			break;
 		default:
 			/* We don't understand this bit. */
 			__virtio_clear_bit(vdev, i);
-- 
MST

^ permalink raw reply related

* [PATCH RFC 2/3] vfio: report group noiommu status
From: Michael S. Tsirkin @ 2016-04-18  9:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Julia Lawall, Wei Liu, kvm, Jonathan Corbet, linux-doc,
	qemu-devel, virtualization, Christian Borntraeger,
	Andy Lutomirski, Baptiste Reynal, David Woodhouse, Dan Carpenter
In-Reply-To: <1460973374-32719-1-git-send-email-mst@redhat.com>

When using vfio, callers might want to know whether device is added to a
regular group or an non-iommu group.

Report this status from vfio_add_group_dev.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c                  | 2 +-
 drivers/vfio/platform/vfio_platform_common.c | 2 +-
 drivers/vfio/vfio.c                          | 5 ++++-
 Documentation/vfio.txt                       | 4 +++-
 4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 712a849..d622a41 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1119,7 +1119,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	spin_lock_init(&vdev->irqlock);
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
-	if (ret) {
+	if (ret < 0) {
 		vfio_iommu_group_put(group, &pdev->dev);
 		kfree(vdev);
 		return ret;
diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c
index e65b142..bf74e21 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -568,7 +568,7 @@ int vfio_platform_probe_common(struct vfio_platform_device *vdev,
 	}
 
 	ret = vfio_add_group_dev(dev, &vfio_platform_ops, vdev);
-	if (ret) {
+	if (ret < 0) {
 		iommu_group_put(group);
 		return ret;
 	}
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5..67db231 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -756,6 +756,7 @@ int vfio_add_group_dev(struct device *dev,
 	struct iommu_group *iommu_group;
 	struct vfio_group *group;
 	struct vfio_device *device;
+	int noiommu;
 
 	iommu_group = iommu_group_get(dev);
 	if (!iommu_group)
@@ -791,6 +792,8 @@ int vfio_add_group_dev(struct device *dev,
 		return PTR_ERR(device);
 	}
 
+	noiommu = group->noiommu;
+
 	/*
 	 * Drop all but the vfio_device reference.  The vfio_device holds
 	 * a reference to the vfio_group, which holds a reference to the
@@ -798,7 +801,7 @@ int vfio_add_group_dev(struct device *dev,
 	 */
 	vfio_group_put(group);
 
-	return 0;
+	return noiommu;
 }
 EXPORT_SYMBOL_GPL(vfio_add_group_dev);
 
diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 1dd3fdd..d76be0f 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -259,7 +259,9 @@ extern void *vfio_del_group_dev(struct device *dev);
 
 vfio_add_group_dev() indicates to the core to begin tracking the
 specified iommu_group and register the specified dev as owned by
-a VFIO bus driver.  The driver provides an ops structure for callbacks
+a VFIO bus driver.  A negative return value indicates failure.
+A positive return value indicates that an unsafe noiommu mode
+is in use.  The driver provides an ops structure for callbacks
 similar to a file operations structure:
 
 struct vfio_device_ops {
-- 
MST

^ permalink raw reply related

* [PATCH RFC 3/3] vfio: add virtio pci quirk
From: Michael S. Tsirkin @ 2016-04-18  9:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Wei Liu, kvm, Julia Lawall, qemu-devel, virtualization,
	Christian Borntraeger, Andy Lutomirski, Paolo Bonzini, Feng Wu,
	David Woodhouse, Dan Carpenter
In-Reply-To: <1460973374-32719-1-git-send-email-mst@redhat.com>

Modern virtio pci devices can set VIRTIO_F_IOMMU_PLATFORM
to signal they are safe to use with an IOMMU.

Without this bit, exposing the device to userspace is unsafe, so probe
and fail VFIO initialization unless noiommu is enabled.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vfio/pci/vfio_pci_private.h |   1 +
 drivers/vfio/pci/vfio_pci.c         |  11 +++
 drivers/vfio/pci/vfio_pci_virtio.c  | 135 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/Makefile           |   1 +
 4 files changed, 148 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_virtio.c

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a7d546..604d445 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -130,4 +130,5 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
+extern int vfio_pci_virtio_quirk(struct vfio_pci_device *vdev, int noiommu);
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index d622a41..2bb8c76 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1125,6 +1125,17 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return ret;
 	}
 
+	if (pdev->vendor == PCI_VENDOR_ID_REDHAT_QUMRANET &&
+	    ((ret = vfio_pci_virtio_quirk(vdev, ret)))) {
+		dev_warn(&vdev->pdev->dev,
+			 "Failed to setup Virtio for VFIO\n");
+		vfio_del_group_dev(&pdev->dev);
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vdev);
+		return ret;
+	}
+
+
 	if (vfio_pci_is_vga(pdev)) {
 		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
 		vga_set_legacy_decoding(pdev,
diff --git a/drivers/vfio/pci/vfio_pci_virtio.c b/drivers/vfio/pci/vfio_pci_virtio.c
new file mode 100644
index 0000000..1a32064
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_virtio.c
@@ -0,0 +1,135 @@
+/*
+ * VFIO PCI Intel Graphics support
+ *
+ * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
+ *	Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Register a device specific region through which to provide read-only
+ * access to the Intel IGD opregion.  The register defining the opregion
+ * address is also virtualized to prevent user modification.
+ */
+
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/virtio_pci.h>
+#include <linux/virtio_config.h>
+
+#include "vfio_pci_private.h"
+
+/**
+ * virtio_pci_find_capability - walk capabilities to find device info.
+ * @dev: the pci device
+ * @cfg_type: the VIRTIO_PCI_CAP_* value we seek
+ *
+ * Returns offset of the capability, or 0.
+ */
+static inline int virtio_pci_find_capability(struct pci_dev *dev, u8 cfg_type)
+{
+	int pos;
+
+	for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+	     pos > 0;
+	     pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+		u8 type;
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+							 cfg_type),
+				     &type);
+
+		if (type != cfg_type)
+			continue;
+
+		/* Ignore structures with reserved BAR values */
+		if (type != VIRTIO_PCI_CAP_PCI_CFG) {
+			u8 bar;
+
+			pci_read_config_byte(dev, pos +
+					     offsetof(struct virtio_pci_cap,
+						      bar),
+					     &bar);
+			if (bar > 0x5)
+				continue;
+		}
+
+		return pos;
+	}
+	return 0;
+}
+
+
+int vfio_pci_virtio_quirk(struct vfio_pci_device *vdev, int noiommu)
+{
+	struct pci_dev *dev = vdev->pdev;
+	int common, cfg;
+	u32 features;
+	u32 offset;
+	u8 bar;
+
+	/* Without an IOMMU, we don't care */
+	if (noiommu)
+		return 0;
+	/* Check whether device enforces the IOMMU correctly */
+
+	/*
+	 * All modern devices must have common and cfg capabilities. We use cfg
+	 * capability for access so that we don't need to worry about resource
+	 * availability. Slow but sure.
+	 * Note that all vendor-specific fields we access are little-endian
+	 * which matches what pci config accessors expect, so they do byteswap
+	 * for us if appropriate.
+	 */
+	common = virtio_pci_find_capability(dev, VIRTIO_PCI_CAP_COMMON_CFG);
+	cfg = virtio_pci_find_capability(dev, VIRTIO_PCI_CAP_PCI_CFG);
+	if (!cfg || !common) {
+                dev_warn(&dev->dev,
+                         "Virtio device lacks common or pci cfg.\n");
+		return -ENODEV;
+	}
+
+	pci_read_config_byte(dev, common + offsetof(struct virtio_pci_cap,
+						    bar),
+			     &bar);
+	pci_read_config_dword(dev, common + offsetof(struct virtio_pci_cap,
+						    offset),
+			     &offset);
+
+	/* Program cfg capability for dword access into common cfg. */
+	pci_write_config_byte(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
+						  cap.bar),
+			      bar);
+	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
+						   cap.length),
+			       0x4);
+
+	/* Select features dword that has VIRTIO_F_IOMMU_PLATFORM. */
+	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
+						  cap.offset),
+			       offset + offsetof(struct virtio_pci_common_cfg,
+						 device_feature_select));
+	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
+						  pci_cfg_data),
+			       VIRTIO_F_IOMMU_PLATFORM / 32);
+
+	/* Get the features dword. */
+	pci_write_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
+						  cap.offset),
+			       offset + offsetof(struct virtio_pci_common_cfg,
+						 device_feature));
+	pci_read_config_dword(dev, cfg + offsetof(struct virtio_pci_cfg_cap,
+						  pci_cfg_data),
+			      &features);
+
+	/* Does this device obey the platform's IOMMU? If not it's an error. */
+	if (!(features & (0x1 << (VIRTIO_F_IOMMU_PLATFORM % 32)))) {
+                dev_warn(&dev->dev,
+                         "Virtio device lacks VIRTIO_F_IOMMU_PLATFORM.\n");
+		return -ENODEV;
+	}
+
+	return 0;
+}
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 76d8ec0..e9b20e7 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,6 @@
 
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y += vfio_pci_virtio.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
-- 
MST

^ permalink raw reply related

* [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-18 11:47 UTC (permalink / raw)
  To: qemu-devel, linux-kernel
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	peterx, virtualization, Amit Shah, Stefan Hajnoczi, kvm, pbonzini,
	David Woodhouse

This adds a flag to enable/disable bypassing the IOMMU by
virtio devices.

This is on top of patch
http://article.gmane.org/gmane.comp.emulators.qemu/403467
virtio: convert to use DMA api

Tested with patchset
http://article.gmane.org/gmane.linux.kernel.virtualization/27545
virtio-pci: iommu support

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---
 include/hw/virtio/virtio-access.h              | 3 ++-
 include/hw/virtio/virtio.h                     | 6 +++++-
 include/standard-headers/linux/virtio_config.h | 8 ++++++++
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
index 967cc75..bb6f34e 100644
--- a/include/hw/virtio/virtio-access.h
+++ b/include/hw/virtio/virtio-access.h
@@ -23,7 +23,8 @@ static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
     BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
 
-    if (k->get_dma_as) {
+    if ((vdev->host_features & (0x1ULL << VIRTIO_F_IOMMU_PLATFORM)) &&
+        k->get_dma_as) {
         return k->get_dma_as(qbus->parent);
     }
     return &address_space_memory;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b12faa9..34d3041 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -228,7 +228,11 @@ typedef struct VirtIORNGConf VirtIORNGConf;
     DEFINE_PROP_BIT64("notify_on_empty", _state, _field,  \
                       VIRTIO_F_NOTIFY_ON_EMPTY, true), \
     DEFINE_PROP_BIT64("any_layout", _state, _field, \
-                      VIRTIO_F_ANY_LAYOUT, true)
+                      VIRTIO_F_ANY_LAYOUT, true), \
+    DEFINE_PROP_BIT64("iommu_passthrough", _state, _field, \
+                      VIRTIO_F_IOMMU_PASSTHROUGH, false), \
+    DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
+                      VIRTIO_F_IOMMU_PLATFORM, false)
 
 hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
 hwaddr virtio_queue_get_avail_addr(VirtIODevice *vdev, int n);
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index bcc445b..5564dab 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -61,4 +61,12 @@
 /* v1.0 compliant. */
 #define VIRTIO_F_VERSION_1		32
 
+/* Request IOMMU passthrough (if available)
+ * Without VIRTIO_F_IOMMU_PLATFORM: bypass the IOMMU even if enabled.
+ * With VIRTIO_F_IOMMU_PLATFORM: suggest disabling IOMMU.
+ */
+#define VIRTIO_F_IOMMU_PASSTHROUGH	33
+
+/* Do not bypass the IOMMU (if configured) */
+#define VIRTIO_F_IOMMU_PLATFORM		34
 #endif /* _LINUX_VIRTIO_CONFIG_H */
-- 
MST

^ permalink raw reply related

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-18 11:58 UTC (permalink / raw)
  To: Michael S. Tsirkin, qemu-devel, linux-kernel
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	peterx, virtualization, Amit Shah, Stefan Hajnoczi, kvm, pbonzini
In-Reply-To: <1460979793-6621-1-git-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 988 bytes --]

On Mon, 2016-04-18 at 14:47 +0300, Michael S. Tsirkin wrote:
> This adds a flag to enable/disable bypassing the IOMMU by
> virtio devices.

I'm still deeply unhappy with having this kind of hack in the virtio
code at all, as you know. Drivers should just use the DMA API and if
the *platform* wants to make it a no-op for a specific device, then it
can.

Remember, this isn't just virtio either. Don't we have *precisely* the
same issue with assigned PCI devices on a system with an emulated Intel
IOMMU? The assigned PCI devices aren't covered by the emulated IOMMU,
and the platform needs to know to bypass *those* too.

Now, we've had this conversation, and we accepted the hack in virtio
for now until the platforms (especially SPARC and Power IIRC) can get
their act together and make their DMA API implementations not broken.

But now you're adding this hack to the public API where we have to
support it for ever. Please, can't we avoid that?

-- 
dwmw2



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-18 13:12 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <1460980717.12793.43.camel@infradead.org>

On Mon, Apr 18, 2016 at 07:58:37AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 14:47 +0300, Michael S. Tsirkin wrote:
> > This adds a flag to enable/disable bypassing the IOMMU by
> > virtio devices.
> 
> I'm still deeply unhappy with having this kind of hack in the virtio
> code at all, as you know. Drivers should just use the DMA API and if
> the *platform* wants to make it a no-op for a specific device, then it
> can.
> 
> Remember, this isn't just virtio either. Don't we have *precisely* the
> same issue with assigned PCI devices on a system with an emulated Intel
> IOMMU? The assigned PCI devices aren't covered by the emulated IOMMU,
> and the platform needs to know to bypass *those* too.
> 
> Now, we've had this conversation, and we accepted the hack in virtio
> for now until the platforms (especially SPARC and Power IIRC) can get
> their act together and make their DMA API implementations not broken.
> 
> But now you're adding this hack to the public API where we have to
> support it for ever. Please, can't we avoid that?

I'm not sure I understand the issue.  The public API is not about how
the driver works.  It doesn't say "don't use DMA API" anywhere, does it?
It's about telling device whether to obey the IOMMU and
about discovering whether a device is in fact under the IOMMU.

Once DMA API allows bypassing IOMMU per device we'll be
able to drop the ugly hack from virtio drivers, simply keying it
off the given flag.


> -- 
> dwmw2
> 
> 

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-18 14:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <20160418160731-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 898 bytes --]

On Mon, 2016-04-18 at 16:12 +0300, Michael S. Tsirkin wrote:
> I'm not sure I understand the issue.  The public API is not about how
> the driver works.  It doesn't say "don't use DMA API" anywhere, does it?
> It's about telling device whether to obey the IOMMU and
> about discovering whether a device is in fact under the IOMMU.

Apologies, I was wrongly reading this as a kernel patch.

After a brief struggle with "telling device whether to obey the IOMMU",
which is obviously completely impossible from the guest kernel, I
realise my mistake :)

So... on x86 how does this get reflected in the DMAR tables that the
guest BIOS presents to the guest kernel, so that the guest kernel
*knows* which devices are behind which IOMMU?

(And are you fixing the case of assigned PCI devices, which aren't
behind any IOMMU, at the same time as you answer that? :)

-- 
dwmw2



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-18 14:23 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <1460988232.22654.7.camel@infradead.org>

On Mon, Apr 18, 2016 at 10:03:52AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 16:12 +0300, Michael S. Tsirkin wrote:
> > I'm not sure I understand the issue.  The public API is not about how
> > the driver works.  It doesn't say "don't use DMA API" anywhere, does it?
> > It's about telling device whether to obey the IOMMU and
> > about discovering whether a device is in fact under the IOMMU.
> 
> Apologies, I was wrongly reading this as a kernel patch.
> 
> After a brief struggle with "telling device whether to obey the IOMMU",
> which is obviously completely impossible from the guest kernel, I
> realise my mistake :)
> 
> So... on x86 how does this get reflected in the DMAR tables that the
> guest BIOS presents to the guest kernel, so that the guest kernel
> *knows* which devices are behind which IOMMU?

This patch doesn't change DMAR tables, it creates a way for virtio
device to tell guest "I obey what DMAR tables tell you, you can stop
doing hacks".

And as PPC guys seem adamant that platform tools there are no good for
that purpose, there's another bit that says "ignore what platform tells
you, I'm not a real device - I'm part of hypervisor and I bypass the
IOMMU".


> (And are you fixing the case of assigned PCI devices, which aren't
> behind any IOMMU, at the same time as you answer that? :)

No - Aviv B.D. has patches on list to fix that.

-- 
MST

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-18 15:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <20160418170534-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 1741 bytes --]

On Mon, 2016-04-18 at 17:23 +0300, Michael S. Tsirkin wrote:
> 
> This patch doesn't change DMAR tables, it creates a way for virtio
> device to tell guest "I obey what DMAR tables tell you, you can stop
> doing hacks".
> 
> And as PPC guys seem adamant that platform tools there are no good for
> that purpose, there's another bit that says "ignore what platform tells
> you, I'm not a real device - I'm part of hypervisor and I bypass the
> IOMMU".

...

+/* Request IOMMU passthrough (if available)
+ * Without VIRTIO_F_IOMMU_PLATFORM: bypass the IOMMU even if enabled.
+ * With VIRTIO_F_IOMMU_PLATFORM: suggest disabling IOMMU.
+ */
+#define VIRTIO_F_IOMMU_PASSTHROUGH     33
+
+/* Do not bypass the IOMMU (if configured) */
+#define VIRTIO_F_IOMMU_PLATFORM                34

OK... let's see if I can reconcile those descriptions coherently.

Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
its own operating system's IOMMU code is expected to be broken, and
that the virtio driver should eschew the DMA API? And that the guest OS
cannot further assign the affected device to any of *its* nested
guests? Not that the broken IOMMU code in said guest OS will know the
latter, of course.

With VIRTIO_F_IOMMU_PLATFORM set, VIRTIO_F_IOMMU_PASSTHROUGH is just a
*hint*, suggesting that the guest OS should *request* a passthrough
mapping from the IOMMU? Via a driver←→IOMMU API which doesn't yet exist
in Linux, since we only have 'iommu=pt' on the command line for that?

And having *neither* of those bits sets is the status quo, which means
that your OS code might well be broken and need you to eschew the DMA
API, but maybe not.


-- 
dwmw2



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-18 15:30 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <1460992923.3765.8.camel@infradead.org>

On Mon, Apr 18, 2016 at 11:22:03AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 17:23 +0300, Michael S. Tsirkin wrote:
> > 
> > This patch doesn't change DMAR tables, it creates a way for virtio
> > device to tell guest "I obey what DMAR tables tell you, you can stop
> > doing hacks".
> > 
> > And as PPC guys seem adamant that platform tools there are no good for
> > that purpose, there's another bit that says "ignore what platform tells
> > you, I'm not a real device - I'm part of hypervisor and I bypass the
> > IOMMU".
> 
> ...
> 
> +/* Request IOMMU passthrough (if available)
> + * Without VIRTIO_F_IOMMU_PLATFORM: bypass the IOMMU even if enabled.
> + * With VIRTIO_F_IOMMU_PLATFORM: suggest disabling IOMMU.
> + */
> +#define VIRTIO_F_IOMMU_PASSTHROUGH     33
> +
> +/* Do not bypass the IOMMU (if configured) */
> +#define VIRTIO_F_IOMMU_PLATFORM                34
> 
> OK... let's see if I can reconcile those descriptions coherently.
> 
> Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
> its own operating system's IOMMU code is expected to be broken, and
> that the virtio driver should eschew the DMA API?

No - it tells guest that e.g. the ACPI tables (or whatever the
equivalent is) do not match reality with respect to this device
since IOMMU is ignored by hypervisor.
Hypervisor has no idea what does guest IOMMU code do - hopefully
it is not actually broken.

> And that the guest OS
> cannot further assign the affected device to any of *its* nested
> guests? Not that the broken IOMMU code in said guest OS will know the
> latter, of course.
> 
> With VIRTIO_F_IOMMU_PLATFORM set, VIRTIO_F_IOMMU_PASSTHROUGH is just a
> *hint*, suggesting that the guest OS should *request* a passthrough
> mapping from the IOMMU?

Right. But it'll work correctly if you don't.

> Via a driver←→IOMMU API which doesn't yet exist
> in Linux, since we only have 'iommu=pt' on the command line for that?
> 
> And having *neither* of those bits sets is the status quo, which means
> that your OS code might well be broken and need you to eschew the DMA
> API, but maybe not.


The status quo is that that the IOMMU might well be bypassed
and then you need to program physical addresses into the device,
but maybe not. If DMA API does not give you physical addresses, you
need to bypass it, but hypervisor does not know or care.


> 
> -- 
> dwmw2
> 
> 


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-18 15:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <20160418182320-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 2542 bytes --]

On Mon, 2016-04-18 at 18:30 +0300, Michael S. Tsirkin wrote:
> 
> > Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
> > its own operating system's IOMMU code is expected to be broken, and
> > that the virtio driver should eschew the DMA API?
> 
> No - it tells guest that e.g. the ACPI tables (or whatever the
> equivalent is) do not match reality with respect to this device
> since IOMMU is ignored by hypervisor.
> Hypervisor has no idea what does guest IOMMU code do - hopefully
> it is not actually broken.

OK, that makes sense — thanks.

So where the platform *does* have a way to coherently tell the guest
that some devices are behind and IOMMU and some aren't, we should never
see VIRTIO_F_IOMMU_PASSTHROUGH && !VIRTIO_F_IOMMU_PLATFORM. (Except
perhaps temporarily on x86 until we *do* fix the DMAR tables to tell
the truth; qv.)

This should *only* be a crutch for platforms which cannot properly
convey that information from the hypervisor to the guest. It should be
clearly documented "thou shalt not use this unless you've first
attempted to fix the broken platform to get it right for itself".

And if we look at it as such... does it make more sense for this to be
a more *generic* qemu←→guest interface? That way the software hacks can
live in the OS IOMMU code where they belong, and prevent assignment to
nested guests for example. And can cover cases like assigned PCI
devices in existing qemu/x86 which need the same treatment.

Put another way: if we're going to add code to the guest OS to look at
this information, why can't we add that code in the guest's IOMMU
support instead, to look at an out-of-band qemu-specific "ignore IOMMU
for these devices" list instead?

> The status quo is that that the IOMMU might well be bypassed
> and then you need to program physical addresses into the device,
> but maybe not. If DMA API does not give you physical addresses, you
> need to bypass it, but hypervisor does not know or care.

Right. The status quo is that qemu doesn't provide correct information
about IOMMU topology to guests, and they have to have heuristics to
work out whether to eschew the IOMMU for a given device or not. This is
true for virtio and assigned PCI devices alike.

Furthermore, some platforms don't *have* a standard way for qemu to
'tell the truth' to the guests, and that's where the real fun comes in.
But still, I'd like to see a generic solution for that lack instead of
a virtio-specific hack.

-- 
dwmw2



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-18 16:27 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Wei Liu, Andy Lutomirski, qemu-block, Christian Borntraeger,
	qemu-devel, peterx, linux-kernel, Amit Shah, Stefan Hajnoczi, kvm,
	pbonzini, virtualization
In-Reply-To: <1460994701.3765.23.camel@infradead.org>

On Mon, Apr 18, 2016 at 11:51:41AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 18:30 +0300, Michael S. Tsirkin wrote:
> > 
> > > Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
> > > its own operating system's IOMMU code is expected to be broken, and
> > > that the virtio driver should eschew the DMA API?
> > 
> > No - it tells guest that e.g. the ACPI tables (or whatever the
> > equivalent is) do not match reality with respect to this device
> > since IOMMU is ignored by hypervisor.
> > Hypervisor has no idea what does guest IOMMU code do - hopefully
> > it is not actually broken.
> 
> OK, that makes sense — thanks.
> 
> So where the platform *does* have a way to coherently tell the guest
> that some devices are behind and IOMMU and some aren't, we should never
> see VIRTIO_F_IOMMU_PASSTHROUGH && !VIRTIO_F_IOMMU_PLATFORM. (Except
> perhaps temporarily on x86 until we *do* fix the DMAR tables to tell
> the truth; qv.)
> 
> This should *only* be a crutch for platforms which cannot properly
> convey that information from the hypervisor to the guest. It should be
> clearly documented "thou shalt not use this unless you've first
> attempted to fix the broken platform to get it right for itself".
> 
> And if we look at it as such... does it make more sense for this to be
> a more *generic* qemu←→guest interface? That way the software hacks can
> live in the OS IOMMU code where they belong, and prevent assignment to
> nested guests for example. And can cover cases like assigned PCI
> devices in existing qemu/x86 which need the same treatment.
>
> Put another way: if we're going to add code to the guest OS to look at
> this information, why can't we add that code in the guest's IOMMU
> support instead, to look at an out-of-band qemu-specific "ignore IOMMU
> for these devices" list instead?

I balk at adding more hacks to a broken system. My goals are
merely to
- make things work correctly with an IOMMU and new guests,
  so people can use userspace drivers with virtio devices
- prevent security risks when guest kernel mistakenly thinks
  it's protected by an IOMMU, but in fact isn't
- avoid breaking any working configurations

Looking at guest code, it looks like virtio was always
bypassing the IOMMU even if configured, but no other
guest driver did.

This makes me think the problem where guest drivers
ignore the IOMMU is virtio specific
and so a virtio specific solution seems cleaner.

The problem for assigned devices is IMHO different: they bypass
the guest IOMMU too but no guest driver knows about this,
so guests do not work. Seems cleaner to fix QEMU to make
existing guests work.


> > The status quo is that that the IOMMU might well be bypassed
> > and then you need to program physical addresses into the device,
> > but maybe not. If DMA API does not give you physical addresses, you
> > need to bypass it, but hypervisor does not know or care.
> 
> Right. The status quo is that qemu doesn't provide correct information
> about IOMMU topology to guests, and they have to have heuristics to
> work out whether to eschew the IOMMU for a given device or not. This is
> true for virtio and assigned PCI devices alike.

True but I think we should fix QEMU to shadow IOMMU
page tables for assigned devices. This seems rather
possible with VT-D, and there are patches already on list.

It looks like this will fix all legacy guests which is
much nicer than what you suggest which will only help new guests.

> Furthermore, some platforms don't *have* a standard way for qemu to
> 'tell the truth' to the guests, and that's where the real fun comes in.
> But still, I'd like to see a generic solution for that lack instead of
> a virtio-specific hack.

But the issue is not just these holes.  E.g. with VT-D it is only easy
to emulate because there's a "caching mode" hook. It is fundamentally
paravirtualization.  So a completely generic solution would be a
paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
justified if many platforms have hard to emulate interfaces.



> -- 
> dwmw2
> 
> 


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Potential hang in virtnet_send_command
From: Christian Ehrhardt @ 2016-04-18 17:07 UTC (permalink / raw)
  To: virtualization; +Cc: Thomas Monjalon, Michael S. Tsirkin


[-- Attachment #1.1: Type: text/plain, Size: 1373 bytes --]

Hi,
I was debugging an issue caused by a bad usage of dpdk.
That is fixed in the meantime and I'll backport that into our code as well.

But along debugging that, I found a potential hang in virtnet_send_command
that my case ran into.
The following code can become an infinite loop:
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
 */
while (!virtqueue_get_buf(vi->cvq, &tmp) &&
       !virtqueue_is_broken(vi->cvq))
        cpu_relax();

In my case dpdk broke something - not exactly clear what - and due to that
following calls through virtnet_send_command ran into this hang.
Effectively it seems that the buffers didn't get refreshed at all anymore.

That said the dpdk issue to touch devices that belong to a kernel owned
driver is fixed, so one could leave the code as is for now.
Yet I wanted to make you aware in case you would vote for a time or retry
based upper limit on that loop to avoid hangs - who knows what else might
bring it in this broken state in a different case.

Kind Regards,
Christian Ehrhardt

P.S.
Steps to reproduce, backtraces and more data can be found in:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1570195
A general setup howto for DPDK in KVM guests which is a prereq is at:
https://help.ubuntu.com/16.04/serverguide/DPDK.html#dpdk-in-guest

[-- Attachment #1.2: Type: text/html, Size: 1864 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox