Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Minchan Kim @ 2016-04-04  1:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins, linux-kernel,
	virtualization, bfields, linux-mm, Gioh Kim, Mel Gorman,
	Sangseok Lee, Andrew Morton, Naoya Horiguchi, Joonsoo Kim, koct9i,
	Al Viro
In-Reply-To: <56FE706D.7080507@suse.cz>

On Fri, Apr 01, 2016 at 02:58:21PM +0200, Vlastimil Babka wrote:
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >Procedure of page migration is as follows:
> >
> >First of all, it should isolate a page from LRU and try to
> >migrate the page. If it is successful, it releases the page
> >for freeing. Otherwise, it should put the page back to LRU
> >list.
> >
> >For LRU pages, we have used putback_lru_page for both freeing
> >and putback to LRU list. It's okay because put_page is aware of
> >LRU list so if it releases last refcount of the page, it removes
> >the page from LRU list. However, It makes unnecessary operations
> >(e.g., lru_cache_add, pagevec and flags operations. It would be
> >not significant but no worth to do) and harder to support new
> >non-lru page migration because put_page isn't aware of non-lru
> >page's data structure.
> >
> >To solve the problem, we can add new hook in put_page with
> >PageMovable flags check but it can increase overhead in
> >hot path and needs new locking scheme to stabilize the flag check
> >with put_page.
> >
> >So, this patch cleans it up to divide two semantic(ie, put and putback).
> >If migration is successful, use put_page instead of putback_lru_page and
> >use putback_lru_page only on failure. That makes code more readable
> >and doesn't add overhead in put_page.
> >
> >Comment from Vlastimil
> >"Yeah, and compaction (perhaps also other migration users) has to drain
> >the lru pvec... Getting rid of this stuff is worth even by itself."
> >
> >Cc: Mel Gorman <mgorman@suse.de>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> [...]
> 
> >@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  		list_del(&page->lru);
> >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> >  				page_is_file_cache(page));
> >-		/* Soft-offlined page shouldn't go through lru cache list */
> >+	}
> >+
> >+	/*
> >+	 * If migration is successful, drop the reference grabbed during
> >+	 * isolation. Otherwise, restore the page to LRU list unless we
> >+	 * want to retry.
> >+	 */
> >+	if (rc == MIGRATEPAGE_SUCCESS) {
> >+		put_page(page);
> >  		if (reason == MR_MEMORY_FAILURE) {
> >-			put_page(page);
> >  			if (!test_set_page_hwpoison(page))
> >  				num_poisoned_pages_inc();
> >-		} else
> >+		}
> 
> Hmm, I didn't notice it previously, or it's due to rebasing, but it
> seems that you restricted the memory failure handling (i.e. setting
> hwpoison) to MIGRATE_SUCCESS, while previously it was done for all
> non-EAGAIN results. I think that goes against the intention of
> hwpoison, which is IIRC to catch and kill the poor process that
> still uses the page?

That's why I Cc'ed Naoya Horiguchi to catch things I might make
mistake.

Thanks for catching it, Vlastimil.
It was my mistake. But in this chance, I looked over hwpoison code and
I saw other places which increases num_poisoned_pages are successful
migration, already freed page and successful invalidated page.
IOW, they are already successful isolated page so I guess it should
increase the count when only successful migration is done?
And when I read memory_failure, it bails out without killing if it
encounters HWPoisoned page so I think it's not for catching and
kill the poor proces.

> 
> Also (but not your fault) the put_page() preceding
> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> pin we are releasing and which one we still have (hopefully? if I
> read description of da1b13ccfbebe right) otherwise it looks like
> doing something with a page that we just potentially freed.

Yes, while I read the code, I had same question. I think the releasing
refcount is for get_any_page.

Naoya, could you answer above two questions?

Thanks.

> 
> >+	} else {
> >+		if (rc != -EAGAIN)
> >  			putback_lru_page(page);
> >+		if (put_new_page)
> >+			put_new_page(newpage, private);
> >+		else
> >+			put_page(newpage);
> >  	}
> >
> >-	/*
> >-	 * If migration was not successful and there's a freeing callback, use
> >-	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
> >-	 * during isolation.
> >-	 */
> >-	if (put_new_page)
> >-		put_new_page(newpage, private);
> >-	else if (unlikely(__is_movable_balloon_page(newpage))) {
> >-		/* drop our reference, page already in the balloon */
> >-		put_page(newpage);
> >-	} else
> >-		putback_lru_page(newpage);
> >-
> >  	if (result) {
> >  		if (rc)
> >  			*result = rc;
> >
> 

^ permalink raw reply

* [PATCH] virtio: virtio 1.0 cs04 spec compliance for reset
From: Michael S. Tsirkin @ 2016-04-03 12:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Bandan Das, stable, virtualization

The spec says: after writing 0 to device_status, the driver MUST wait
for a read of device_status to return 0 before reinitializing the
device.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_pci_modern.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index f6f28cc..e76bd91 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -17,6 +17,7 @@
  *
  */
 
+#include <linux/delay.h>
 #define VIRTIO_PCI_NO_LEGACY
 #include "virtio_pci_common.h"
 
@@ -271,9 +272,13 @@ static void vp_reset(struct virtio_device *vdev)
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	/* 0 status means a reset. */
 	vp_iowrite8(0, &vp_dev->common->device_status);
-	/* Flush out the status write, and flush in device writes,
-	 * including MSI-X interrupts, if any. */
-	vp_ioread8(&vp_dev->common->device_status);
+	/* After writing 0 to device_status, the driver MUST wait for a read of
+	 * device_status to return 0 before reinitializing the device.
+	 * This will flush out the status write, and flush in device writes,
+	 * including MSI-X interrupts, if any.
+	 */
+	while (vp_ioread8(&vp_dev->common->device_status))
+		msleep(1);
 	/* Flush pending VQ/configuration callbacks. */
 	vp_synchronize_vectors(vdev);
 }
-- 
MST

^ permalink raw reply related

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
From: Vlastimil Babka @ 2016-04-01 21:29 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, dri-devel,
	virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim, Mel Gorman,
	Sangseok Lee, jlayton, Joonsoo Kim, koct9i, Al Viro
In-Reply-To: <1459321935-3655-3-git-send-email-minchan@kernel.org>

Might have been better as a separate migration patch and then a compaction 
patch. It's prefixed mm/compaction, but most changed are in mm/migrate.c

On 03/30/2016 09:12 AM, Minchan Kim wrote:
> We have allowed migration for only LRU pages until now and it was
> enough to make high-order pages. But recently, embedded system(e.g.,
> webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> so we have seen several reports about troubles of small high-order
> allocation. For fixing the problem, there were several efforts
> (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> reserved memory, vmalloc and so on) but if there are lots of
> non-movable pages in system, their solutions are void in the long run.
>
> So, this patch is to support facility to change non-movable pages
> with movable. For the feature, this patch introduces functions related
> to migration to address_space_operations as well as some page flags.
>
> Basically, this patch supports two page-flags and two functions related
> to page migration. The flag and page->mapping stability are protected
> by PG_lock.
>
> 	PG_movable
> 	PG_isolated
>
> 	bool (*isolate_page) (struct page *, isolate_mode_t);
> 	void (*putback_page) (struct page *);
>
> Duty of subsystem want to make their pages as migratable are
> as follows:
>
> 1. It should register address_space to page->mapping then mark
> the page as PG_movable via __SetPageMovable.
>
> 2. It should mark the page as PG_isolated via SetPageIsolated
> if isolation is sucessful and return true.

Ah another thing to document (especially in the comments/Doc) is that the 
subsystem must not expect anything to survive in page.lru (or fields that union 
it) after having isolated successfully.

> 3. If migration is successful, it should clear PG_isolated and
> PG_movable of the page for free preparation then release the
> reference of the page to free.
>
> 4. If migration fails, putback function of subsystem should
> clear PG_isolated via ClearPageIsolated.
>
> 5. If a subsystem want to release isolated page, it should
> clear PG_isolated but not PG_movable. Instead, VM will do it.

Under lock? Or just with ClearPageIsolated?

> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   Documentation/filesystems/Locking      |   4 +
>   Documentation/filesystems/vfs.txt      |   5 +
>   fs/proc/page.c                         |   3 +
>   include/linux/fs.h                     |   2 +
>   include/linux/migrate.h                |   2 +
>   include/linux/page-flags.h             |  31 ++++++
>   include/uapi/linux/kernel-page-flags.h |   1 +
>   mm/compaction.c                        |  14 ++-
>   mm/migrate.c                           | 174 +++++++++++++++++++++++++++++----
>   9 files changed, 217 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 619af9bfdcb3..0bb79560abb3 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -195,7 +195,9 @@ unlocks and drops the reference.
>   	int (*releasepage) (struct page *, int);
>   	void (*freepage)(struct page *);
>   	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
>   	int (*migratepage)(struct address_space *, struct page *, struct page *);
> +	void (*putback_page) (struct page *);
>   	int (*launder_page)(struct page *);
>   	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
>   	int (*error_remove_page)(struct address_space *, struct page *);
> @@ -219,7 +221,9 @@ invalidatepage:		yes
>   releasepage:		yes
>   freepage:		yes
>   direct_IO:
> +isolate_page:		yes
>   migratepage:		yes (both)
> +putback_page:		yes
>   launder_page:		yes
>   is_partially_uptodate:	yes
>   error_remove_page:	yes
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index b02a7d598258..4c1b6c3b4bc8 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -592,9 +592,14 @@ struct address_space_operations {
>   	int (*releasepage) (struct page *, int);
>   	void (*freepage)(struct page *);
>   	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> +	/* isolate a page for migration */
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
>   	/* migrate the contents of a page to the specified target */
>   	int (*migratepage) (struct page *, struct page *);
> +	/* put the page back to right list */

... "after a failed migration" ?

> +	void (*putback_page) (struct page *);
>   	int (*launder_page) (struct page *);
> +
>   	int (*is_partially_uptodate) (struct page *, unsigned long,
>   					unsigned long);
>   	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 3ecd445e830d..ce3d08a4ad8d 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
>   	if (page_is_idle(page))
>   		u |= 1 << KPF_IDLE;
>
> +	if (PageMovable(page))
> +		u |= 1 << KPF_MOVABLE;
> +
>   	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
>
>   	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index da9e67d937e5..36f2d610e7a8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -401,6 +401,8 @@ struct address_space_operations {
>   	 */
>   	int (*migratepage) (struct address_space *,
>   			struct page *, struct page *, enum migrate_mode);
> +	bool (*isolate_page)(struct page *, isolate_mode_t);
> +	void (*putback_page)(struct page *);
>   	int (*launder_page) (struct page *);
>   	int (*is_partially_uptodate) (struct page *, unsigned long,
>   					unsigned long);
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 9b50325e4ddf..404fbfefeb33 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
>   			struct page *, struct page *, enum migrate_mode);
>   extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
>   		unsigned long private, enum migrate_mode mode, int reason);
> +extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> +extern void putback_movable_page(struct page *page);
>
>   extern int migrate_prep(void);
>   extern int migrate_prep_local(void);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f4ed4f1b0c77..77ebf8fdbc6e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -129,6 +129,10 @@ enum pageflags {
>
>   	/* Compound pages. Stored in first tail page's flags */
>   	PG_double_map = PG_private_2,
> +
> +	/* non-lru movable pages */
> +	PG_movable = PG_reclaim,
> +	PG_isolated = PG_owner_priv_1,

Documentation should probably state that these fields alias and subsystem 
supporting the movable pages shouldn't use them elsewhere.

Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that
page->mapping->a_ops->isolate_page exists for PageMovable() pages. What if it's 
a false positive on a PG_reclaim page? Can we rely on PG_reclaim always (and 
without races) implying PageLRU() so that we don't even attempt 
isolate_movable_page()?

>   };
>
>   #ifndef __GENERATING_BOUNDS_H
> @@ -614,6 +618,33 @@ static inline void __ClearPageBalloon(struct page *page)
>   	atomic_set(&page->_mapcount, -1);
>   }
>
> +#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)

IIRC this was what Gioh's previous attempts used instead of PG_movable? Is it 
still needed? Doesn't it prevent a driver providing movable *and* mapped pages?
If it's to distinguish the PG_reclaim alias that I mention above, it seems like 
an overkill to me. Why would be need both special mapcount value and a flag? 
Checking that page->mapping->a_ops->isolate_page exists before calling it should 
be enough to resolve the ambiguity?

> +
> +static inline int PageMovable(struct page *page)
> +{
> +	return ((test_bit(PG_movable, &(page)->flags) &&
> +		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> +		|| PageBalloon(page));
> +}
> +
> +/* Caller should hold a PG_lock */
> +static inline void __SetPageMovable(struct page *page,
> +				struct address_space *mapping)
> +{
> +	page->mapping = mapping;
> +	__set_bit(PG_movable, &page->flags);
> +	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> +}
> +
> +static inline void __ClearPageMovable(struct page *page)
> +{
> +	atomic_set(&page->_mapcount, -1);
> +	__clear_bit(PG_movable, &(page)->flags);
> +	page->mapping = NULL;
> +}
> +
> +PAGEFLAG(Isolated, isolated, PF_ANY);
> +
>   /*
>    * If network-based swap is enabled, sl*b must keep track of whether pages
>    * were allocated from pfmemalloc reserves.
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index 5da5f8751ce7..a184fd2434fa 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -34,6 +34,7 @@
>   #define KPF_BALLOON		23
>   #define KPF_ZERO_PAGE		24
>   #define KPF_IDLE		25
> +#define KPF_MOVABLE		26
>
>
>   #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index ccf97b02b85f..7557aedddaee 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>
>   		/*
>   		 * Check may be lockless but that's ok as we recheck later.
> -		 * It's possible to migrate LRU pages and balloon pages
> +		 * It's possible to migrate LRU and movable kernel pages.
>   		 * Skip any other type of page
>   		 */
>   		is_lru = PageLRU(page);
> @@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   					goto isolate_success;
>   				}
>   			}
> +
> +			if (unlikely(PageMovable(page)) &&
> +					!PageIsolated(page)) {
> +				if (locked) {
> +					spin_unlock_irqrestore(&zone->lru_lock,
> +									flags);
> +					locked = false;
> +				}
> +
> +				if (isolate_movable_page(page, isolate_mode))
> +					goto isolate_success;
> +			}
>   		}
>
>   		/*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 53529c805752..b56bf2b3fe8c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -73,6 +73,85 @@ int migrate_prep_local(void)
>   	return 0;
>   }
>
> +bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> +{
> +	bool ret = false;

Maintaining "ret" seems useless here. All the "goto out*" statements are 
executed only when ret is false, and ret == true is returned by a different return.

> +
> +	/*
> +	 * Avoid burning cycles with pages that are yet under __free_pages(),
> +	 * or just got freed under us.
> +	 *
> +	 * In case we 'win' a race for a movable page being freed under us and
> +	 * raise its refcount preventing __free_pages() from doing its job
> +	 * the put_page() at the end of this block will take care of
> +	 * release this page, thus avoiding a nasty leakage.
> +	 */
> +	if (unlikely(!get_page_unless_zero(page)))
> +		goto out;
> +
> +	/*
> +	 * Check PG_movable before holding a PG_lock because page's owner
> +	 * assumes anybody doesn't touch PG_lock of newly allocated page.
> +	 */
> +	if (unlikely(!PageMovable(page)))
> +		goto out_putpage;
> +	/*
> +	 * As movable pages are not isolated from LRU lists, concurrent
> +	 * compaction threads can race against page migration functions
> +	 * as well as race against the releasing a page.
> +	 *
> +	 * In order to avoid having an already isolated movable page
> +	 * being (wrongly) re-isolated while it is under migration,
> +	 * or to avoid attempting to isolate pages being released,
> +	 * lets be sure we have the page lock
> +	 * before proceeding with the movable page isolation steps.
> +	 */
> +	if (unlikely(!trylock_page(page)))
> +		goto out_putpage;
> +
> +	if (!PageMovable(page) || PageIsolated(page))
> +		goto out_no_isolated;
> +
> +	ret = page->mapping->a_ops->isolate_page(page, mode);
> +	if (!ret)
> +		goto out_no_isolated;
> +
> +	WARN_ON_ONCE(!PageIsolated(page));
> +	unlock_page(page);
> +	return ret;
> +
> +out_no_isolated:
> +	unlock_page(page);
> +out_putpage:
> +	put_page(page);
> +out:
> +	return ret;
> +}
> +
> +/* It should be called on page which is PG_movable */
> +void putback_movable_page(struct page *page)
> +{
> +	/*
> +	 * 'lock_page()' stabilizes the page and prevents races against
> +	 * concurrent isolation threads attempting to re-isolate it.
> +	 */
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +
> +	lock_page(page);
> +	if (PageIsolated(page)) {
> +		struct address_space *mapping;
> +
> +		mapping = page_mapping(page);
> +		mapping->a_ops->putback_page(page);
> +		WARN_ON_ONCE(PageIsolated(page));
> +	} else {
> +		__ClearPageMovable(page);
> +	}
> +	unlock_page(page);
> +	/* drop the extra ref count taken for movable page isolation */
> +	put_page(page);
> +}
> +
>   /*
>    * Put previously isolated pages back onto the appropriate lists
>    * from where they were once taken off for compaction/migration.
> @@ -94,10 +173,18 @@ void putback_movable_pages(struct list_head *l)
>   		list_del(&page->lru);
>   		dec_zone_page_state(page, NR_ISOLATED_ANON +
>   				page_is_file_cache(page));
> -		if (unlikely(isolated_balloon_page(page)))
> +		if (unlikely(isolated_balloon_page(page))) {
>   			balloon_page_putback(page);
> -		else
> +		} else if (unlikely(PageMovable(page))) {
> +			if (PageIsolated(page)) {
> +				putback_movable_page(page);
> +			} else {
> +				__ClearPageMovable(page);

We don't do lock_page() here, so what prevents parallel compaction isolating the 
same page?

> +				put_page(page);
> +			}
> +		} else {
>   			putback_lru_page(page);
> +		}
>   	}
>   }
>
> @@ -592,7 +679,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>    ***********************************************************/
>
>   /*
> - * Common logic to directly migrate a single page suitable for
> + * Common logic to directly migrate a single LRU page suitable for
>    * pages that do not use PagePrivate/PagePrivate2.
>    *
>    * Pages are locked upon entry and exit.
> @@ -755,24 +842,54 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>   				enum migrate_mode mode)
>   {
>   	struct address_space *mapping;
> -	int rc;
> +	int rc = -EAGAIN;
> +	bool lru_movable = true;
>
>   	VM_BUG_ON_PAGE(!PageLocked(page), page);
>   	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
>
>   	mapping = page_mapping(page);
> -	if (!mapping)
> -		rc = migrate_page(mapping, newpage, page, mode);
> -	else if (mapping->a_ops->migratepage)
> -		/*
> -		 * Most pages have a mapping and most filesystems provide a
> -		 * migratepage callback. Anonymous pages are part of swap
> -		 * space which also has its own migratepage callback. This
> -		 * is the most common path for page migration.
> -		 */
> -		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
> -	else
> -		rc = fallback_migrate_page(mapping, newpage, page, mode);
> +	/*
> +	 * In case of non-lru page, it could be released after
> +	 * isolation step. In that case, we shouldn't try
> +	 * fallback migration which was designed for LRU pages.
> +	 *
> +	 * The rule for such case is that subsystem should clear
> +	 * PG_isolated but remains PG_movable so VM should catch
> +	 * it and clear PG_movable for it.
> +	 */
> +	if (unlikely(PageMovable(page))) {

Can false positive from PG_reclaim occur here?

> +		lru_movable = false;
> +		VM_BUG_ON_PAGE(!mapping, page);
> +		if (!PageIsolated(page)) {
> +			rc = MIGRATEPAGE_SUCCESS;
> +			__ClearPageMovable(page);
> +			goto out;
> +		}
> +	}
> +
> +	if (likely(lru_movable)) {
> +		if (!mapping)
> +			rc = migrate_page(mapping, newpage, page, mode);
> +		else if (mapping->a_ops->migratepage)
> +			/*
> +			 * Most pages have a mapping and most filesystems
> +			 * provide a migratepage callback. Anonymous pages
> +			 * are part of swap space which also has its own
> +			 * migratepage callback. This is the most common path
> +			 * for page migration.
> +			 */
> +			rc = mapping->a_ops->migratepage(mapping, newpage,
> +							page, mode);
> +		else
> +			rc = fallback_migrate_page(mapping, newpage,
> +							page, mode);
> +	} else {
> +		rc = mapping->a_ops->migratepage(mapping, newpage,
> +						page, mode);
> +		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
> +			PageIsolated(page));
> +	}
>
>   	/*
>   	 * When successful, old pagecache page->mapping must be cleared before
> @@ -782,6 +899,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>   		if (!PageAnon(page))
>   			page->mapping = NULL;
>   	}
> +out:
>   	return rc;
>   }
>
> @@ -960,6 +1078,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>   			put_new_page(newpage, private);
>   		else
>   			put_page(newpage);
> +		if (PageMovable(page))
> +			__ClearPageMovable(page);
>   		goto out;
>   	}
>
> @@ -1000,8 +1120,26 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>   				num_poisoned_pages_inc();
>   		}
>   	} else {
> -		if (rc != -EAGAIN)
> -			putback_lru_page(page);
> +		if (rc != -EAGAIN) {
> +			/*
> +			 * subsystem couldn't remove PG_movable since page is
> +			 * isolated so PageMovable check is not racy in here.
> +			 * But PageIsolated check can be racy but it's okay
> +			 * because putback_movable_page checks it under PG_lock
> +			 * again.
> +			 */
> +			if (unlikely(PageMovable(page))) {
> +				if (PageIsolated(page))
> +					putback_movable_page(page);
> +				else {
> +					__ClearPageMovable(page);

Again, we don't do lock_page() here, so what prevents parallel compaction 
isolating the same page?

Sorry for so many questions, hope they all have good answers and this series is 
a success :) Thanks for picking it up.

> +					put_page(page);
> +				}
> +			} else {
> +				putback_lru_page(page);
> +			}
> +		}
> +
>   		if (put_new_page)
>   			put_new_page(newpage, private);
>   		else
>

^ permalink raw reply

* [RFC v3 -next 2/2] virtio_net: Read the advised MTU
From: Aaron Conole @ 2016-04-01 19:32 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel,
	Paolo Abeni, Sergei Shtylyov, Pankaj Gupta
In-Reply-To: <1459539136-13948-1-git-send-email-aconole@redhat.com>

This patch checks the feature bit for the VIRTIO_NET_F_MTU feature. If it
exists, read the advised MTU and use it.

No proper error handling is provided for the case where a user changes the
negotiated MTU. A future commit will add proper error handling. Instead, a
warning is emitted if the guest changes the device MTU after previously
being given advice.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
v2:
* Whitespace cleanup in the last hunk
* Code style change around the pr_warn
* Additional test for mtu change before printing warning
v3:
* removed the mtu change warning

 drivers/net/virtio_net.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 49d84e5..2308083 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1450,6 +1450,7 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
 
 static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 {
+	struct virtnet_info *vi = netdev_priv(dev);
 	if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
 		return -EINVAL;
 	dev->mtu = new_mtu;
@@ -1896,6 +1897,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
 		vi->has_cvq = true;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_MTU)) {
+		dev->mtu = virtio_cread16(vdev,
+					  offsetof(struct virtio_net_config,
+						   mtu));
+	}
+
 	if (vi->any_header_sg)
 		dev->needed_headroom = vi->hdr_len;
 
@@ -2081,6 +2088,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ,
 	VIRTIO_NET_F_CTRL_MAC_ADDR,
 	VIRTIO_F_ANY_LAYOUT,
+	VIRTIO_NET_F_MTU,
 };
 
 static struct virtio_driver virtio_net_driver = {
-- 
2.5.5

^ permalink raw reply related

* [RFC v3 -net 1/2] virtio: Start feature MTU support
From: Aaron Conole @ 2016-04-01 19:32 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel,
	Paolo Abeni, Sergei Shtylyov, Pankaj Gupta
In-Reply-To: <1459539136-13948-1-git-send-email-aconole@redhat.com>

This commit adds the feature bit and associated mtu device entry for the
virtio network device. Future commits will make use of these bits to support
negotiated MTU.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
---
v2,v3:
* No change

 include/uapi/linux/virtio_net.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index ec32293..41a6a01 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -55,6 +55,7 @@
 #define VIRTIO_NET_F_MQ	22	/* Device supports Receive Flow
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
+#define VIRTIO_NET_F_MTU 25	/* Device supports Default MTU Negotiation */
 
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
@@ -73,6 +74,8 @@ struct virtio_net_config {
 	 * Legal values are between 1 and 0x8000
 	 */
 	__u16 max_virtqueue_pairs;
+	/* Default maximum transmit unit advice */
+	__u16 mtu;
 } __attribute__((packed));
 
 /*
-- 
2.5.5

^ permalink raw reply related

* [RFC v3 -next 0/2] virtio-net: Advised MTU feature
From: Aaron Conole @ 2016-04-01 19:32 UTC (permalink / raw)
  To: netdev, Michael S. Tsirkin, virtualization, linux-kernel,
	Paolo Abeni, Sergei Shtylyov, Pankaj Gupta

The following series adds the ability for a hypervisor to set an MTU on the
guest during feature negotiation phase. This is useful for VM orchestration
when, for instance, tunneling is involved and the MTU of the various systems
should be homogenous.

The first patch adds the feature bit as described in the proposed VIRTIO spec
addition found at
https://lists.oasis-open.org/archives/virtio-dev/201603/msg00001.html
The second patch adds a user of the bit, and a warning when the guest changes
the MTU from the hypervisor advised MTU. Future patches may add more thorough
error handling.

v2:
* Whitespace and code style cleanups from Sergei Shtylyov and Paolo Abeni
* Additional test before printing a warning

v3:
* Removed the warning when changing MTU (which simplified the code)

Aaron Conole (2):
  virtio: Start feature MTU support
  virtio_net: Read the advised MTU

 drivers/net/virtio_net.c        | 8 ++++++++
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 11 insertions(+)

-- 
2.5.5

^ permalink raw reply

* Re: [PATCH v3 03/16] mm: add non-lru movable page support document
From: Vlastimil Babka @ 2016-04-01 14:38 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Jonathan Corbet, Hugh Dickins, linux-kernel,
	virtualization, bfields, linux-mm, Gioh Kim, Mel Gorman,
	Sangseok Lee, jlayton, Joonsoo Kim, koct9i, Al Viro
In-Reply-To: <1459321935-3655-4-git-send-email-minchan@kernel.org>

On 03/30/2016 09:12 AM, Minchan Kim wrote:
> This patch describes what a subsystem should do for non-lru movable
> page supporting.

Intentionally reading this first without studying the code to better catch 
things that would seem obvious otherwise.

> Cc: Jonathan Corbet <corbet@lwn.net>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   Documentation/filesystems/vfs.txt | 11 ++++++-
>   Documentation/vm/page_migration   | 69 ++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 78 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index 4c1b6c3b4bc8..d63142f8ed7b 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -752,12 +752,21 @@ struct address_space_operations {
>           and transfer data directly between the storage and the
>           application's address space.
>
> +  isolate_page: Called by the VM when isolating a movable non-lru page.
> +	If page is successfully isolated, we should mark the page as
> +	PG_isolated via __SetPageIsolated.

Patch 02 changelog suggests SetPageIsolated, so this is confusing. I guess the 
main point is that there might be parallel attempts and only one is allowed to 
succeed, right? Whether it's done by atomic ops or otherwise doesn't matter to 
e.g. compaction.

>     migrate_page:  This is used to compact the physical memory usage.
>           If the VM wants to relocate a page (maybe off a memory card
>           that is signalling imminent failure) it will pass a new page
>   	and an old page to this function.  migrate_page should
>   	transfer any private data across and update any references
> -        that it has to the page.
> +	that it has to the page. If migrated page is non-lru page,
> +	we should clear PG_isolated and PG_movable via __ClearPageIsolated
> +	and __ClearPageMovable.

Similar concern as __SetPageIsolated.

> +
> +  putback_page: Called by the VM when isolated page's migration fails.
> +	We should clear PG_isolated marked in isolated_page function.

Note this kind of wording is less confusing and could be used above wrt my concerns.

>
>     launder_page: Called before freeing a page - it writes back the dirty page. To
>     	prevent redirtying the page, it is kept locked during the whole
> diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
> index fea5c0864170..c4e7551a414e 100644
> --- a/Documentation/vm/page_migration
> +++ b/Documentation/vm/page_migration
> @@ -142,5 +142,72 @@ is increased so that the page cannot be freed while page migration occurs.
>   20. The new page is moved to the LRU and can be scanned by the swapper
>       etc again.
>
> -Christoph Lameter, May 8, 2006.
> +C. Non-LRU Page migration
> +-------------------------
> +
> +Although original migration aimed for reducing the latency of memory access
> +for NUMA, compaction who want to create high-order page is also main customer.
> +
> +Ppage migration's disadvantage is that it was designed to migrate only
> +*LRU* pages. However, there are potential non-lru movable pages which can be
> +migrated in system, for example, zsmalloc, virtio-balloon pages.
> +For virtio-balloon pages, some parts of migration code path was hooked up
> +and added virtio-balloon specific functions to intercept logi.

logi -> logic?

> +It's too specific to one subsystem so other subsystem who want to make
> +their pages movable should add own specific hooks in migration path.

s/should/would have to/ I guess?

> +To solve such problem, VM supports non-LRU page migration which provides
> +generic functions for non-LRU movable pages without needing subsystem
> +specific hook in mm/{migrate|compact}.c.
> +
> +If a subsystem want to make own pages movable, it should mark pages as
> +PG_movable via __SetPageMovable. __SetPageMovable needs address_space for
> +argument for register functions which will be called by VM.
> +
> +Three functions in address_space_operation related to non-lru movable page:
> +
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
> +	int (*migratepage) (struct address_space *,
> +		struct page *, struct page *, enum migrate_mode);
> +	void (*putback_page)(struct page *);
> +
> +1. Isolation
> +
> +What VM expected on isolate_page of subsystem is to set PG_isolated flags
> +of the page if it was successful. With that, concurrent isolation among
> +CPUs skips the isolated page by other CPU earlier. VM calls isolate_page
> +under PG_lock of page. If a subsystem cannot isolate the page, it should
> +return false.

Ah, I see, so it's designed with page lock to handle the concurrent isolations etc.

In http://marc.info/?l=linux-mm&m=143816716511904&w=2 Mel has warned about doing 
this in general under page_lock and suggested that each user handles concurrent 
calls to isolate_page() internally. Might be more generic that way, even if all 
current implementers will actually use the page lock.

Also it's worth reading that mail in full and incorporating here, as there are 
more concerns related to concurrency that should be documented, e.g. with pages 
that can be mapped to userspace. Not a case with zram and balloon pages I guess, 
but one of Gioh's original use cases was a driver which IIRC could map pages. So 
the design and documentation should keep that in mind.

> +2. Migration
> +
> +After successful isolation, VM calls migratepage. The migratepage's goal is
> +to move content of the old page to new page and set up struct page fields
> +of new page. If migration is successful, subsystem should release old page's
> +refcount to free. Keep in mind that subsystem should clear PG_movable and
> +PG_isolated before releasing the refcount.  If everything are done, user
> +should return MIGRATEPAGE_SUCCESS. If subsystem cannot migrate the page
> +at the moment, migratepage can return -EAGAIN. On -EAGAIN, VM will retry page
> +migration because VM interprets -EAGAIN as "temporal migration failure".
> +
> +3. Putback
> +
> +If migration was unsuccessful, VM calls putback_page. The subsystem should
> +insert isolated page to own data structure again if it has. And subsystem
> +should clear PG_isolated which was marked in isolation step.
> +
> +Note about releasing page:
> +
> +Subsystem can release pages whenever it want but if it releses the page
> +which is already isolated, it should clear PG_isolated but doesn't touch
> +PG_movable under PG_lock. Instead of it, VM will clear PG_movable after
> +his job done. Otherweise, subsystem should clear both page flags before
> +releasing the page.

I don't understand this right now. But maybe I will get it after reading the 
patches and suggest some improved wording here.

> +
> +Note about PG_isolated:
> +
> +PG_isolated check on a page is valid only if the page's flag is already
> +set to PG_movable.

But it's not possible to check both atomically, so I guess it implies checking 
under page lock? If that's true, should be explicit.

Thanks!

> +Christoph Lameter, May 8, 2006.
> +Minchan Kim, Mar 28, 2016.
>

^ permalink raw reply

* [RFC v5 5/5] VSOCK: Add Makefile and Kconfig
From: Stefan Hajnoczi @ 2016-04-01 14:23 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1459520587-12337-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

Enable virtio-vsock and vhost-vsock.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v4:
 * Make checkpatch.pl happy with longer option description
 * Clarify dependency on virtio rather than QEMU as suggested by Alex
   Bennee
v3:
 * Don't put vhost vsock driver into staging
 * Add missing Kconfig dependencies (Arnd Bergmann <arnd@arndb.de>)
---
 drivers/vhost/Kconfig  | 15 +++++++++++++++
 drivers/vhost/Makefile |  4 ++++
 net/vmw_vsock/Kconfig  | 19 +++++++++++++++++++
 net/vmw_vsock/Makefile |  2 ++
 4 files changed, 40 insertions(+)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 533eaf0..d7aae9e 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -21,6 +21,21 @@ config VHOST_SCSI
 	Say M here to enable the vhost_scsi TCM fabric module
 	for use with virtio-scsi guests
 
+config VHOST_VSOCK
+	tristate "vhost virtio-vsock driver"
+	depends on VSOCKETS && EVENTFD
+	select VIRTIO_VSOCKETS_COMMON
+	select VHOST
+	select VHOST_RING
+	default n
+	---help---
+	This kernel module can be loaded in the host kernel to provide AF_VSOCK
+	sockets for communicating with guests.  The guests must have the
+	virtio_transport.ko driver loaded to use the virtio-vsock device.
+
+	To compile this driver as a module, choose M here: the module will be called
+	vhost_vsock.
+
 config VHOST_RING
 	tristate
 	---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index e0441c3..6b012b9 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -4,5 +4,9 @@ vhost_net-y := net.o
 obj-$(CONFIG_VHOST_SCSI) += vhost_scsi.o
 vhost_scsi-y := scsi.o
 
+obj-$(CONFIG_VHOST_VSOCK) += vhost_vsock.o
+vhost_vsock-y := vsock.o
+
 obj-$(CONFIG_VHOST_RING) += vringh.o
+
 obj-$(CONFIG_VHOST)	+= vhost.o
diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
index 14810ab..f27e74b 100644
--- a/net/vmw_vsock/Kconfig
+++ b/net/vmw_vsock/Kconfig
@@ -26,3 +26,22 @@ config VMWARE_VMCI_VSOCKETS
 
 	  To compile this driver as a module, choose M here: the module
 	  will be called vmw_vsock_vmci_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS
+	tristate "virtio transport for Virtual Sockets"
+	depends on VSOCKETS && VIRTIO
+	select VIRTIO_VSOCKETS_COMMON
+	help
+	  This module implements a virtio transport for Virtual Sockets.
+
+	  Enable this transport if your Virtual Machine host supports Virtual
+	  Sockets over virtio.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called virtio_vsock_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS_COMMON
+       tristate
+       ---help---
+         This option is selected by any driver which needs to access
+         the virtio_vsock.
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 2ce52d7..cf4c294 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,5 +1,7 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS) += virtio_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += virtio_transport_common.o
 
 vsock-y += af_vsock.o vsock_addr.o
 
-- 
2.5.5

^ permalink raw reply related

* [RFC v5 4/5] VSOCK: Introduce vhost_vsock.ko
From: Stefan Hajnoczi @ 2016-04-01 14:23 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1459520587-12337-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

VM sockets vhost transport implementation.  This driver runs on the
host.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v5:
 * Only take rx/tx virtqueues, userspace handles the other virtqueues
 * Explicitly skip instances without a CID when transferring packets
 * Add VHOST_VSOCK_START ioctl to being vhost virtqueue processing
 * Reset established connections when device is closed
v4:
 * Add MAINTAINERS file entry
 * virtqueue used len is now sizeof(pkt->hdr) + pkt->len instead of just
   pkt->len
 * checkpatch.pl cleanups
 * Clarify struct vhost_vsock locking
 * Add comments about optimization that disables virtqueue notify
 * Drop unused vhost_vsock_handle_ctl_kick()
 * Call wake_up() after decrementing total_tx_buf to prevent deadlock
v3:
 * Remove unneeded variable used to store return value
   (Fengguang Wu <fengguang.wu@intel.com> and Julia Lawall
   <julia.lawall@lip6.fr>)
v2:
 * Add missing total_tx_buf decrement
 * Support flexible rx/tx descriptor layout
 * Refuse to assign reserved CIDs
 * Refuse guest CID if already in use
 * Only accept correctly addressed packets
---
 MAINTAINERS           |   2 +
 drivers/vhost/vsock.c | 694 ++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vsock.h |   5 +
 3 files changed, 701 insertions(+)
 create mode 100644 drivers/vhost/vsock.c
 create mode 100644 drivers/vhost/vsock.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 1ed7364..7899c3c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11612,6 +11612,8 @@ F:	include/linux/virtio_vsock.h
 F:	include/uapi/linux/virtio_vsock.h
 F:	net/vmw_vsock/virtio_transport_common.c
 F:	net/vmw_vsock/virtio_transport.c
+F:	drivers/vhost/vsock.c
+F:	drivers/vhost/vsock.h
 
 VIRTUAL SERIO DEVICE DRIVER
 M:	Stephen Chandler Paul <thatslyude@gmail.com>
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
new file mode 100644
index 0000000..8488d01
--- /dev/null
+++ b/drivers/vhost/vsock.c
@@ -0,0 +1,694 @@
+/*
+ * vhost transport for vsock
+ *
+ * Copyright (C) 2013-2015 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <net/sock.h>
+#include <linux/virtio_vsock.h>
+#include <linux/vhost.h>
+
+#include <net/af_vsock.h>
+#include "vhost.h"
+#include "vsock.h"
+
+#define VHOST_VSOCK_DEFAULT_HOST_CID	2
+
+enum {
+	VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+};
+
+/* Used to track all the vhost_vsock instances on the system. */
+static LIST_HEAD(vhost_vsock_list);
+static DEFINE_MUTEX(vhost_vsock_mutex);
+
+struct vhost_vsock {
+	struct vhost_dev dev;
+	struct vhost_virtqueue vqs[2];
+
+	/* Link to global vhost_vsock_list, protected by vhost_vsock_mutex */
+	struct list_head list;
+
+	struct vhost_work send_pkt_work;
+	wait_queue_head_t send_wait;
+
+	/* Fields protected by vqs[VSOCK_VQ_RX].mutex */
+	struct list_head send_pkt_list;	/* host->guest pending packets */
+	u32 total_tx_buf;
+
+	u32 guest_cid;
+};
+
+static u32 vhost_transport_get_local_cid(void)
+{
+	return VHOST_VSOCK_DEFAULT_HOST_CID;
+}
+
+static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+{
+	struct vhost_vsock *vsock;
+
+	mutex_lock(&vhost_vsock_mutex);
+	list_for_each_entry(vsock, &vhost_vsock_list, list) {
+		u32 other_cid = vsock->guest_cid;
+
+		/* Skip instances that have no CID yet */
+		if (other_cid == 0)
+			continue;
+
+		if (other_cid == guest_cid) {
+			mutex_unlock(&vhost_vsock_mutex);
+			return vsock;
+		}
+	}
+	mutex_unlock(&vhost_vsock_mutex);
+
+	return NULL;
+}
+
+static void
+vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
+			    struct vhost_virtqueue *vq)
+{
+	bool added = false;
+
+	mutex_lock(&vq->mutex);
+
+	/* Avoid further vmexits, we're already processing the virtqueue */
+	vhost_disable_notify(&vsock->dev, vq);
+
+	for (;;) {
+		struct virtio_vsock_pkt *pkt;
+		struct iov_iter iov_iter;
+		unsigned out, in;
+		size_t nbytes;
+		size_t len;
+		int head;
+
+		if (list_empty(&vsock->send_pkt_list)) {
+			vhost_enable_notify(&vsock->dev, vq);
+			break;
+		}
+
+		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+					 &out, &in, NULL, NULL);
+		if (head < 0)
+			break;
+
+		if (head == vq->num) {
+			/* We cannot finish yet if more buffers snuck in while
+			 * re-enabling notify.
+			 */
+			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
+				vhost_disable_notify(&vsock->dev, vq);
+				continue;
+			}
+			break;
+		}
+
+		pkt = list_first_entry(&vsock->send_pkt_list,
+				       struct virtio_vsock_pkt, list);
+		list_del_init(&pkt->list);
+
+		if (out) {
+			virtio_transport_free_pkt(pkt);
+			vq_err(vq, "Expected 0 output buffers, got %u\n", out);
+			break;
+		}
+
+		len = iov_length(&vq->iov[out], in);
+		iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
+
+		nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
+		if (nbytes != sizeof(pkt->hdr)) {
+			virtio_transport_free_pkt(pkt);
+			vq_err(vq, "Faulted on copying pkt hdr\n");
+			break;
+		}
+
+		nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
+		if (nbytes != pkt->len) {
+			virtio_transport_free_pkt(pkt);
+			vq_err(vq, "Faulted on copying pkt buf\n");
+			break;
+		}
+
+		vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+		added = true;
+
+		vsock->total_tx_buf -= pkt->len;
+
+		virtio_transport_free_pkt(pkt);
+	}
+	if (added)
+		vhost_signal(&vsock->dev, vq);
+	mutex_unlock(&vq->mutex);
+
+	if (added)
+		wake_up(&vsock->send_wait);
+}
+
+static void vhost_transport_send_pkt_work(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq;
+	struct vhost_vsock *vsock;
+
+	vsock = container_of(work, struct vhost_vsock, send_pkt_work);
+	vq = &vsock->vqs[VSOCK_VQ_RX];
+
+	vhost_transport_do_send_pkt(vsock, vq);
+}
+
+static int
+vhost_transport_send_one_pkt(struct vhost_vsock *vsock,
+			     struct virtio_vsock_pkt *pkt)
+{
+	struct vhost_virtqueue *vq = &vsock->vqs[VSOCK_VQ_RX];
+
+	/* Queue it up in vhost work */
+	mutex_lock(&vq->mutex);
+	list_add_tail(&pkt->list, &vsock->send_pkt_list);
+	vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
+	mutex_unlock(&vq->mutex);
+
+	return pkt->len;
+}
+
+static int
+vhost_transport_send_pkt_no_sock(struct virtio_vsock_pkt *pkt)
+{
+	struct vhost_vsock *vsock;
+
+	/* Find the vhost_vsock according to guest context id  */
+	vsock = vhost_vsock_get(le32_to_cpu(pkt->hdr.dst_cid));
+	if (!vsock) {
+		virtio_transport_free_pkt(pkt);
+		return -ENODEV;
+	}
+
+	return vhost_transport_send_one_pkt(vsock, pkt);
+}
+
+static int
+vhost_transport_send_pkt(struct vsock_sock *vsk,
+			 struct virtio_vsock_pkt_info *info)
+{
+	u32 src_cid, src_port, dst_cid, dst_port;
+	struct virtio_vsock_sock *vvs;
+	struct virtio_vsock_pkt *pkt;
+	struct vhost_virtqueue *vq;
+	struct vhost_vsock *vsock;
+	u32 pkt_len = info->pkt_len;
+	DEFINE_WAIT(wait);
+
+	src_cid = vhost_transport_get_local_cid();
+	src_port = vsk->local_addr.svm_port;
+	if (!info->remote_cid) {
+		dst_cid	= vsk->remote_addr.svm_cid;
+		dst_port = vsk->remote_addr.svm_port;
+	} else {
+		dst_cid = info->remote_cid;
+		dst_port = info->remote_port;
+	}
+
+	/* Find the vhost_vsock according to guest context id  */
+	vsock = vhost_vsock_get(dst_cid);
+	if (!vsock)
+		return -ENODEV;
+
+	vvs = vsk->trans;
+	vq = &vsock->vqs[VSOCK_VQ_RX];
+
+	/* we can send less than pkt_len bytes */
+	if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
+		pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+
+	/* virtio_transport_get_credit might return less than pkt_len credit */
+	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
+
+	/* Do not send zero length OP_RW pkt*/
+	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
+		return pkt_len;
+
+	/* Respect global tx buf limitation */
+	mutex_lock(&vq->mutex);
+	while (pkt_len + vsock->total_tx_buf > VIRTIO_VSOCK_MAX_TX_BUF_SIZE) {
+		prepare_to_wait_exclusive(&vsock->send_wait, &wait,
+					  TASK_UNINTERRUPTIBLE);
+		mutex_unlock(&vq->mutex);
+		schedule();
+		mutex_lock(&vq->mutex);
+		finish_wait(&vsock->send_wait, &wait);
+	}
+	vsock->total_tx_buf += pkt_len;
+	mutex_unlock(&vq->mutex);
+
+	pkt = virtio_transport_alloc_pkt(info, pkt_len,
+					 src_cid, src_port,
+					 dst_cid, dst_port);
+	if (!pkt) {
+		mutex_lock(&vq->mutex);
+		vsock->total_tx_buf -= pkt_len;
+		mutex_unlock(&vq->mutex);
+		virtio_transport_put_credit(vvs, pkt_len);
+		wake_up(&vsock->send_wait);
+		return -ENOMEM;
+	}
+
+	virtio_transport_inc_tx_pkt(vvs, pkt);
+
+	return vhost_transport_send_one_pkt(vsock, pkt);
+}
+
+static struct virtio_vsock_pkt *
+vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
+		      unsigned int out, unsigned int in)
+{
+	struct virtio_vsock_pkt *pkt;
+	struct iov_iter iov_iter;
+	size_t nbytes;
+	size_t len;
+
+	if (in != 0) {
+		vq_err(vq, "Expected 0 input buffers, got %u\n", in);
+		return NULL;
+	}
+
+	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+	if (!pkt)
+		return NULL;
+
+	len = iov_length(vq->iov, out);
+	iov_iter_init(&iov_iter, WRITE, vq->iov, out, len);
+
+	nbytes = copy_from_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
+	if (nbytes != sizeof(pkt->hdr)) {
+		vq_err(vq, "Expected %zu bytes for pkt->hdr, got %zu bytes\n",
+		       sizeof(pkt->hdr), nbytes);
+		kfree(pkt);
+		return NULL;
+	}
+
+	if (le16_to_cpu(pkt->hdr.type) == VIRTIO_VSOCK_TYPE_STREAM)
+		pkt->len = le32_to_cpu(pkt->hdr.len);
+
+	/* No payload */
+	if (!pkt->len)
+		return pkt;
+
+	/* The pkt is too big */
+	if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+		kfree(pkt);
+		return NULL;
+	}
+
+	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
+	if (!pkt->buf) {
+		kfree(pkt);
+		return NULL;
+	}
+
+	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
+	if (nbytes != pkt->len) {
+		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
+		       pkt->len, nbytes);
+		virtio_transport_free_pkt(pkt);
+		return NULL;
+	}
+
+	return pkt;
+}
+
+static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
+						 dev);
+	struct virtio_vsock_pkt *pkt;
+	int head;
+	unsigned int out, in;
+	bool added = false;
+
+	mutex_lock(&vq->mutex);
+	vhost_disable_notify(&vsock->dev, vq);
+	for (;;) {
+		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+					 &out, &in, NULL, NULL);
+		if (head < 0)
+			break;
+
+		if (head == vq->num) {
+			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
+				vhost_disable_notify(&vsock->dev, vq);
+				continue;
+			}
+			break;
+		}
+
+		pkt = vhost_vsock_alloc_pkt(vq, out, in);
+		if (!pkt) {
+			vq_err(vq, "Faulted on pkt\n");
+			continue;
+		}
+
+		/* Only accept correctly addressed packets */
+		if (le32_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid)
+			virtio_transport_recv_pkt(pkt);
+		else
+			virtio_transport_free_pkt(pkt);
+
+		vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+		added = true;
+	}
+	if (added)
+		vhost_signal(&vsock->dev, vq);
+	mutex_unlock(&vq->mutex);
+}
+
+static void vhost_vsock_handle_rx_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						poll.work);
+	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
+						 dev);
+
+	vhost_transport_do_send_pkt(vsock, vq);
+}
+
+static int vhost_vsock_start(struct vhost_vsock *vsock)
+{
+	size_t i;
+	int ret;
+
+	mutex_lock(&vsock->dev.mutex);
+
+	ret = vhost_dev_check_owner(&vsock->dev);
+	if (ret)
+		goto err;
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		struct vhost_virtqueue *vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+
+		if (!vhost_vq_access_ok(vq)) {
+			ret = -EFAULT;
+			mutex_unlock(&vq->mutex);
+			goto err_vq;
+		}
+
+		if (!vq->private_data) {
+			vq->private_data = vsock;
+			vhost_vq_init_access(vq);
+		}
+
+		mutex_unlock(&vq->mutex);
+	}
+
+	mutex_unlock(&vsock->dev.mutex);
+	return 0;
+
+err_vq:
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		struct vhost_virtqueue *vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vq->private_data = NULL;
+		mutex_unlock(&vq->mutex);
+	}
+err:
+	mutex_unlock(&vsock->dev.mutex);
+	return ret;
+}
+
+static void vhost_vsock_stop(struct vhost_vsock *vsock)
+{
+	size_t i;
+
+	mutex_lock(&vsock->dev.mutex);
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		struct vhost_virtqueue *vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vq->private_data = vsock;
+		mutex_unlock(&vq->mutex);
+	}
+
+	mutex_unlock(&vsock->dev.mutex);
+}
+
+static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
+{
+	struct vhost_virtqueue **vqs;
+	struct vhost_vsock *vsock;
+	int ret;
+
+	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL);
+	if (!vsock)
+		return -ENOMEM;
+
+	vqs = kmalloc_array(ARRAY_SIZE(vsock->vqs), sizeof(*vqs), GFP_KERNEL);
+	if (!vqs) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	vqs[VSOCK_VQ_TX] = &vsock->vqs[VSOCK_VQ_TX];
+	vqs[VSOCK_VQ_RX] = &vsock->vqs[VSOCK_VQ_RX];
+	vsock->vqs[VSOCK_VQ_TX].handle_kick = vhost_vsock_handle_tx_kick;
+	vsock->vqs[VSOCK_VQ_RX].handle_kick = vhost_vsock_handle_rx_kick;
+
+	vhost_dev_init(&vsock->dev, vqs, ARRAY_SIZE(vsock->vqs));
+
+	file->private_data = vsock;
+	init_waitqueue_head(&vsock->send_wait);
+	INIT_LIST_HEAD(&vsock->send_pkt_list);
+	vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
+
+	mutex_lock(&vhost_vsock_mutex);
+	list_add_tail(&vsock->list, &vhost_vsock_list);
+	mutex_unlock(&vhost_vsock_mutex);
+	return 0;
+
+out:
+	kfree(vsock);
+	return ret;
+}
+
+static void vhost_vsock_flush(struct vhost_vsock *vsock)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++)
+		if (vsock->vqs[i].handle_kick)
+			vhost_poll_flush(&vsock->vqs[i].poll);
+	vhost_work_flush(&vsock->dev, &vsock->send_pkt_work);
+}
+
+static void vhost_vsock_reset_orphans(struct sock *sk)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+
+	lock_sock(sk);
+	if (!vhost_vsock_get(vsk->local_addr.svm_cid)) {
+		sk->sk_state = SS_UNCONNECTED;
+		sk->sk_err = ECONNRESET;
+		sk->sk_error_report(sk);
+	}
+	release_sock(sk);
+}
+
+static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
+{
+	struct vhost_vsock *vsock = file->private_data;
+
+	mutex_lock(&vhost_vsock_mutex);
+	list_del(&vsock->list);
+	mutex_unlock(&vhost_vsock_mutex);
+
+	/* Iterating over all connections for all CIDs to find orphans is
+	 * inefficient.  Room for improvement here. */
+	vsock_for_each_connected_socket(vhost_vsock_reset_orphans);
+
+	vhost_vsock_stop(vsock);
+	vhost_vsock_flush(vsock);
+	vhost_dev_stop(&vsock->dev);
+	vhost_dev_cleanup(&vsock->dev, false);
+	kfree(vsock->dev.vqs);
+	kfree(vsock);
+	return 0;
+}
+
+static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u32 guest_cid)
+{
+	struct vhost_vsock *other;
+
+	/* Refuse reserved CIDs */
+	if (guest_cid <= VMADDR_CID_HOST)
+		return -EINVAL;
+
+	/* Refuse if CID is already in use */
+	other = vhost_vsock_get(guest_cid);
+	if (other && other != vsock)
+		return -EADDRINUSE;
+
+	mutex_lock(&vhost_vsock_mutex);
+	vsock->guest_cid = guest_cid;
+	mutex_unlock(&vhost_vsock_mutex);
+
+	return 0;
+}
+
+static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
+{
+	struct vhost_virtqueue *vq;
+	int i;
+
+	if (features & ~VHOST_VSOCK_FEATURES)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&vsock->dev.mutex);
+	if ((features & (1 << VHOST_F_LOG_ALL)) &&
+	    !vhost_log_access_ok(&vsock->dev)) {
+		mutex_unlock(&vsock->dev.mutex);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		vq = &vsock->vqs[i];
+		mutex_lock(&vq->mutex);
+		vq->acked_features = features;
+		mutex_unlock(&vq->mutex);
+	}
+	mutex_unlock(&vsock->dev.mutex);
+	return 0;
+}
+
+static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
+				  unsigned long arg)
+{
+	struct vhost_vsock *vsock = f->private_data;
+	void __user *argp = (void __user *)arg;
+	u64 __user *featurep = argp;
+	u32 __user *cidp = argp;
+	u32 guest_cid;
+	u64 features;
+	int r;
+
+	switch (ioctl) {
+	case VHOST_VSOCK_SET_GUEST_CID:
+		if (get_user(guest_cid, cidp))
+			return -EFAULT;
+		return vhost_vsock_set_cid(vsock, guest_cid);
+	case VHOST_VSOCK_START:
+		return vhost_vsock_start(vsock);
+	case VHOST_GET_FEATURES:
+		features = VHOST_VSOCK_FEATURES;
+		if (copy_to_user(featurep, &features, sizeof(features)))
+			return -EFAULT;
+		return 0;
+	case VHOST_SET_FEATURES:
+		if (copy_from_user(&features, featurep, sizeof(features)))
+			return -EFAULT;
+		return vhost_vsock_set_features(vsock, features);
+	default:
+		mutex_lock(&vsock->dev.mutex);
+		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
+		if (r == -ENOIOCTLCMD)
+			r = vhost_vring_ioctl(&vsock->dev, ioctl, argp);
+		else
+			vhost_vsock_flush(vsock);
+		mutex_unlock(&vsock->dev.mutex);
+		return r;
+	}
+}
+
+static const struct file_operations vhost_vsock_fops = {
+	.owner          = THIS_MODULE,
+	.open           = vhost_vsock_dev_open,
+	.release        = vhost_vsock_dev_release,
+	.llseek		= noop_llseek,
+	.unlocked_ioctl = vhost_vsock_dev_ioctl,
+};
+
+static struct miscdevice vhost_vsock_misc = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vhost-vsock",
+	.fops = &vhost_vsock_fops,
+};
+
+static struct virtio_transport vhost_transport = {
+	.transport = {
+		.get_local_cid            = vhost_transport_get_local_cid,
+
+		.init                     = virtio_transport_do_socket_init,
+		.destruct                 = virtio_transport_destruct,
+		.release                  = virtio_transport_release,
+		.connect                  = virtio_transport_connect,
+		.shutdown                 = virtio_transport_shutdown,
+
+		.dgram_enqueue            = virtio_transport_dgram_enqueue,
+		.dgram_dequeue            = virtio_transport_dgram_dequeue,
+		.dgram_bind               = virtio_transport_dgram_bind,
+		.dgram_allow              = virtio_transport_dgram_allow,
+
+		.stream_enqueue           = virtio_transport_stream_enqueue,
+		.stream_dequeue           = virtio_transport_stream_dequeue,
+		.stream_has_data          = virtio_transport_stream_has_data,
+		.stream_has_space         = virtio_transport_stream_has_space,
+		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
+		.stream_is_active         = virtio_transport_stream_is_active,
+		.stream_allow             = virtio_transport_stream_allow,
+
+		.notify_poll_in           = virtio_transport_notify_poll_in,
+		.notify_poll_out          = virtio_transport_notify_poll_out,
+		.notify_recv_init         = virtio_transport_notify_recv_init,
+		.notify_recv_pre_block    = virtio_transport_notify_recv_pre_block,
+		.notify_recv_pre_dequeue  = virtio_transport_notify_recv_pre_dequeue,
+		.notify_recv_post_dequeue = virtio_transport_notify_recv_post_dequeue,
+		.notify_send_init         = virtio_transport_notify_send_init,
+		.notify_send_pre_block    = virtio_transport_notify_send_pre_block,
+		.notify_send_pre_enqueue  = virtio_transport_notify_send_pre_enqueue,
+		.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
+
+		.set_buffer_size          = virtio_transport_set_buffer_size,
+		.set_min_buffer_size      = virtio_transport_set_min_buffer_size,
+		.set_max_buffer_size      = virtio_transport_set_max_buffer_size,
+		.get_buffer_size          = virtio_transport_get_buffer_size,
+		.get_min_buffer_size      = virtio_transport_get_min_buffer_size,
+		.get_max_buffer_size      = virtio_transport_get_max_buffer_size,
+	},
+
+	.send_pkt = vhost_transport_send_pkt,
+	.send_pkt_no_sock = vhost_transport_send_pkt_no_sock,
+};
+
+static int __init vhost_vsock_init(void)
+{
+	int ret;
+
+	ret = vsock_core_init(&vhost_transport.transport);
+	if (ret < 0)
+		return ret;
+	return misc_register(&vhost_vsock_misc);
+};
+
+static void __exit vhost_vsock_exit(void)
+{
+	misc_deregister(&vhost_vsock_misc);
+	vsock_core_exit();
+};
+
+module_init(vhost_vsock_init);
+module_exit(vhost_vsock_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("vhost transport for vsock ");
diff --git a/drivers/vhost/vsock.h b/drivers/vhost/vsock.h
new file mode 100644
index 0000000..173f9fc
--- /dev/null
+++ b/drivers/vhost/vsock.h
@@ -0,0 +1,5 @@
+#ifndef VHOST_VSOCK_H
+#define VHOST_VSOCK_H
+#define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u32)
+#define VHOST_VSOCK_START		_IO(VHOST_VIRTIO, 0x61)
+#endif
-- 
2.5.5

^ permalink raw reply related

* [RFC v5 3/5] VSOCK: Introduce virtio_transport.ko
From: Stefan Hajnoczi @ 2016-04-01 14:23 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1459520587-12337-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

VM sockets virtio transport implementation.  This driver runs in the
guest.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v5:
 * Add transport reset event handling
 * Drop ctrl virtqueue
v4:
 * Add MAINTAINERS file entry
 * Drop short/long rx packets
 * checkpatch.pl cleanups
 * Clarify locking in struct virtio_vsock
 * Narrow local variable scopes as suggested by Alex Bennee
 * Call wake_up() after decrementing total_tx_buf to avoid deadlock
v2:
 * Fix total_tx_buf accounting
 * Add virtio_transport global mutex to prevent races
---
 MAINTAINERS                      |   1 +
 net/vmw_vsock/virtio_transport.c | 584 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 585 insertions(+)
 create mode 100644 net/vmw_vsock/virtio_transport.c

diff --git a/MAINTAINERS b/MAINTAINERS
index fb9c47a..1ed7364 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11611,6 +11611,7 @@ S:	Maintained
 F:	include/linux/virtio_vsock.h
 F:	include/uapi/linux/virtio_vsock.h
 F:	net/vmw_vsock/virtio_transport_common.c
+F:	net/vmw_vsock/virtio_transport.c
 
 VIRTUAL SERIO DEVICE DRIVER
 M:	Stephen Chandler Paul <thatslyude@gmail.com>
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
new file mode 100644
index 0000000..45472e0
--- /dev/null
+++ b/net/vmw_vsock/virtio_transport.c
@@ -0,0 +1,584 @@
+/*
+ * virtio transport for vsock
+ *
+ * Copyright (C) 2013-2015 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * Some of the code is take from Gerd Hoffmann <kraxel@redhat.com>'s
+ * early virtio-vsock proof-of-concept bits.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_vsock.h>
+#include <net/sock.h>
+#include <linux/mutex.h>
+#include <net/af_vsock.h>
+
+static struct workqueue_struct *virtio_vsock_workqueue;
+static struct virtio_vsock *the_virtio_vsock;
+static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
+static void virtio_vsock_rx_fill(struct virtio_vsock *vsock);
+
+struct virtio_vsock {
+	struct virtio_device *vdev;
+	struct virtqueue *vqs[VSOCK_VQ_MAX];
+
+	/* Virtqueue processing is deferred to a workqueue */
+	struct work_struct tx_work;
+	struct work_struct rx_work;
+	struct work_struct event_work;
+
+	wait_queue_head_t tx_wait;	/* for waiting for tx resources */
+
+	/* The following fields are protected by tx_lock.  vqs[VSOCK_VQ_TX]
+	 * must be accessed with tx_lock held.
+	 */
+	struct mutex tx_lock;
+	u32 total_tx_buf;
+
+	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
+	 * must be accessed with rx_lock held.
+	 */
+	struct mutex rx_lock;
+	int rx_buf_nr;
+	int rx_buf_max_nr;
+
+	/* The following fields are protected by event_lock.
+	 * vqs[VSOCK_VQ_EVENT] must be accessed with event_lock held.
+	 */
+	struct mutex event_lock;
+	struct virtio_vsock_event event_list[8];
+
+	u32 guest_cid;
+};
+
+static struct virtio_vsock *virtio_vsock_get(void)
+{
+	return the_virtio_vsock;
+}
+
+static u32 virtio_transport_get_local_cid(void)
+{
+	struct virtio_vsock *vsock = virtio_vsock_get();
+
+	return vsock->guest_cid;
+}
+
+static int
+virtio_transport_send_one_pkt(struct virtio_vsock *vsock,
+			      struct virtio_vsock_pkt *pkt)
+{
+	struct scatterlist hdr, buf, *sgs[2];
+	int ret, in_sg = 0, out_sg = 0;
+	struct virtqueue *vq;
+	DEFINE_WAIT(wait);
+
+	vq = vsock->vqs[VSOCK_VQ_TX];
+
+	/* Put pkt in the virtqueue */
+	sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
+	sgs[out_sg++] = &hdr;
+	if (pkt->buf) {
+		sg_init_one(&buf, pkt->buf, pkt->len);
+		sgs[out_sg++] = &buf;
+	}
+
+	mutex_lock(&vsock->tx_lock);
+	while ((ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt,
+					GFP_KERNEL)) < 0) {
+		prepare_to_wait_exclusive(&vsock->tx_wait, &wait,
+					  TASK_UNINTERRUPTIBLE);
+		mutex_unlock(&vsock->tx_lock);
+		schedule();
+		mutex_lock(&vsock->tx_lock);
+		finish_wait(&vsock->tx_wait, &wait);
+	}
+	virtqueue_kick(vq);
+	mutex_unlock(&vsock->tx_lock);
+
+	return pkt->len;
+}
+
+static int
+virtio_transport_send_pkt_no_sock(struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_vsock *vsock;
+
+	vsock = virtio_vsock_get();
+	if (!vsock) {
+		virtio_transport_free_pkt(pkt);
+		return -ENODEV;
+	}
+
+	return virtio_transport_send_one_pkt(vsock, pkt);
+}
+
+static int
+virtio_transport_send_pkt(struct vsock_sock *vsk,
+			  struct virtio_vsock_pkt_info *info)
+{
+	u32 src_cid, src_port, dst_cid, dst_port;
+	struct virtio_vsock_sock *vvs;
+	struct virtio_vsock_pkt *pkt;
+	struct virtio_vsock *vsock;
+	u32 pkt_len = info->pkt_len;
+	DEFINE_WAIT(wait);
+
+	vsock = virtio_vsock_get();
+	if (!vsock)
+		return -ENODEV;
+
+	src_cid	= virtio_transport_get_local_cid();
+	src_port = vsk->local_addr.svm_port;
+	if (!info->remote_cid) {
+		dst_cid	= vsk->remote_addr.svm_cid;
+		dst_port = vsk->remote_addr.svm_port;
+	} else {
+		dst_cid = info->remote_cid;
+		dst_port = info->remote_port;
+	}
+
+	vvs = vsk->trans;
+
+	if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
+		pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
+	/* Do not send zero length OP_RW pkt*/
+	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
+		return pkt_len;
+
+	/* Respect global tx buf limitation */
+	mutex_lock(&vsock->tx_lock);
+	while (pkt_len + vsock->total_tx_buf > VIRTIO_VSOCK_MAX_TX_BUF_SIZE) {
+		prepare_to_wait_exclusive(&vsock->tx_wait, &wait,
+					  TASK_UNINTERRUPTIBLE);
+		mutex_unlock(&vsock->tx_lock);
+		schedule();
+		mutex_lock(&vsock->tx_lock);
+		finish_wait(&vsock->tx_wait, &wait);
+	}
+	vsock->total_tx_buf += pkt_len;
+	mutex_unlock(&vsock->tx_lock);
+
+	pkt = virtio_transport_alloc_pkt(info, pkt_len,
+					 src_cid, src_port,
+					 dst_cid, dst_port);
+	if (!pkt) {
+		mutex_lock(&vsock->tx_lock);
+		vsock->total_tx_buf -= pkt_len;
+		mutex_unlock(&vsock->tx_lock);
+		virtio_transport_put_credit(vvs, pkt_len);
+		wake_up(&vsock->tx_wait);
+		return -ENOMEM;
+	}
+
+	virtio_transport_inc_tx_pkt(vvs, pkt);
+
+	return virtio_transport_send_one_pkt(vsock, pkt);
+}
+
+static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
+{
+	int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+	struct virtio_vsock_pkt *pkt;
+	struct scatterlist hdr, buf, *sgs[2];
+	struct virtqueue *vq;
+	int ret;
+
+	vq = vsock->vqs[VSOCK_VQ_RX];
+
+	do {
+		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+		if (!pkt)
+			break;
+
+		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
+		if (!pkt->buf) {
+			virtio_transport_free_pkt(pkt);
+			break;
+		}
+
+		pkt->len = buf_len;
+
+		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
+		sgs[0] = &hdr;
+
+		sg_init_one(&buf, pkt->buf, buf_len);
+		sgs[1] = &buf;
+		ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
+		if (ret) {
+			virtio_transport_free_pkt(pkt);
+			break;
+		}
+		vsock->rx_buf_nr++;
+	} while (vq->num_free);
+	if (vsock->rx_buf_nr > vsock->rx_buf_max_nr)
+		vsock->rx_buf_max_nr = vsock->rx_buf_nr;
+	virtqueue_kick(vq);
+}
+
+static void virtio_transport_send_pkt_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, tx_work);
+	struct virtqueue *vq;
+	bool added = false;
+
+	vq = vsock->vqs[VSOCK_VQ_TX];
+	mutex_lock(&vsock->tx_lock);
+	do {
+		struct virtio_vsock_pkt *pkt;
+		unsigned int len;
+
+		virtqueue_disable_cb(vq);
+		while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
+			vsock->total_tx_buf -= pkt->len;
+			virtio_transport_free_pkt(pkt);
+			added = true;
+		}
+	} while (!virtqueue_enable_cb(vq));
+	mutex_unlock(&vsock->tx_lock);
+
+	if (added)
+		wake_up(&vsock->tx_wait);
+}
+
+static void virtio_transport_recv_pkt_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, rx_work);
+	struct virtqueue *vq;
+
+	vq = vsock->vqs[VSOCK_VQ_RX];
+	mutex_lock(&vsock->rx_lock);
+	do {
+		struct virtio_vsock_pkt *pkt;
+		unsigned int len;
+
+		virtqueue_disable_cb(vq);
+		while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
+			vsock->rx_buf_nr--;
+
+			/* Drop short/long packets */
+			if (unlikely(len < sizeof(pkt->hdr) ||
+				     len > sizeof(pkt->hdr) + pkt->len)) {
+				virtio_transport_free_pkt(pkt);
+				continue;
+			}
+
+			pkt->len = len - sizeof(pkt->hdr);
+			virtio_transport_recv_pkt(pkt);
+		}
+	} while (!virtqueue_enable_cb(vq));
+
+	if (vsock->rx_buf_nr < vsock->rx_buf_max_nr / 2)
+		virtio_vsock_rx_fill(vsock);
+	mutex_unlock(&vsock->rx_lock);
+}
+
+/* event_lock must be held */
+static int virtio_vsock_event_fill_one(struct virtio_vsock *vsock,
+				       struct virtio_vsock_event *event)
+{
+	struct scatterlist sg;
+	struct virtqueue *vq;
+
+	vq = vsock->vqs[VSOCK_VQ_EVENT];
+
+	sg_init_one(&sg, event, sizeof(*event));
+
+	return virtqueue_add_inbuf(vq, &sg, 1, event, GFP_KERNEL);
+}
+
+/* event_lock must be held */
+static void virtio_vsock_event_fill(struct virtio_vsock *vsock)
+{
+	size_t i;
+
+	for (i = 0; i < ARRAY_SIZE(vsock->event_list); i++) {
+		struct virtio_vsock_event *event = &vsock->event_list[i];
+
+		virtio_vsock_event_fill_one(vsock, event);
+	}
+
+	virtqueue_kick(vsock->vqs[VSOCK_VQ_EVENT]);
+}
+
+static void virtio_vsock_reset_sock(struct sock *sk)
+{
+	lock_sock(sk);
+	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_err = ECONNRESET;
+	sk->sk_error_report(sk);
+	release_sock(sk);
+}
+
+static void virtio_vsock_update_guest_cid(struct virtio_vsock *vsock)
+{
+	struct virtio_device *vdev = vsock->vdev;
+	u32 guest_cid;
+
+	vdev->config->get(vdev, offsetof(struct virtio_vsock_config, guest_cid),
+			  &guest_cid, sizeof(guest_cid));
+	vsock->guest_cid = le32_to_cpu(guest_cid);
+}
+
+/* event_lock must be held */
+static void virtio_vsock_event_handle(struct virtio_vsock *vsock,
+				      struct virtio_vsock_event *event)
+{
+	switch (le32_to_cpu(event->id)) {
+	case VIRTIO_VSOCK_EVENT_TRANSPORT_RESET:
+		virtio_vsock_update_guest_cid(vsock);
+		vsock_for_each_connected_socket(virtio_vsock_reset_sock);
+		break;
+	}
+}
+
+static void virtio_transport_event_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, event_work);
+	struct virtqueue *vq;
+
+	vq = vsock->vqs[VSOCK_VQ_EVENT];
+
+	mutex_lock(&vsock->event_lock);
+
+	do {
+		struct virtio_vsock_event *event;
+		unsigned int len;
+
+		virtqueue_disable_cb(vq);
+		while ((event = virtqueue_get_buf(vq, &len)) != NULL) {
+			if (len == sizeof(*event))
+				virtio_vsock_event_handle(vsock, event);
+
+			virtio_vsock_event_fill_one(vsock, event);
+		}
+	} while (!virtqueue_enable_cb(vq));
+
+	virtqueue_kick(vsock->vqs[VSOCK_VQ_EVENT]);
+
+	mutex_unlock(&vsock->event_lock);
+}
+
+static void virtio_vsock_event_done(struct virtqueue *vq)
+{
+	struct virtio_vsock *vsock = vq->vdev->priv;
+
+	if (!vsock)
+		return;
+	queue_work(virtio_vsock_workqueue, &vsock->event_work);
+}
+
+static void virtio_vsock_tx_done(struct virtqueue *vq)
+{
+	struct virtio_vsock *vsock = vq->vdev->priv;
+
+	if (!vsock)
+		return;
+	queue_work(virtio_vsock_workqueue, &vsock->tx_work);
+}
+
+static void virtio_vsock_rx_done(struct virtqueue *vq)
+{
+	struct virtio_vsock *vsock = vq->vdev->priv;
+
+	if (!vsock)
+		return;
+	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
+}
+
+static struct virtio_transport virtio_transport = {
+	.transport = {
+		.get_local_cid            = virtio_transport_get_local_cid,
+
+		.init                     = virtio_transport_do_socket_init,
+		.destruct                 = virtio_transport_destruct,
+		.release                  = virtio_transport_release,
+		.connect                  = virtio_transport_connect,
+		.shutdown                 = virtio_transport_shutdown,
+
+		.dgram_bind               = virtio_transport_dgram_bind,
+		.dgram_dequeue            = virtio_transport_dgram_dequeue,
+		.dgram_enqueue            = virtio_transport_dgram_enqueue,
+		.dgram_allow              = virtio_transport_dgram_allow,
+
+		.stream_dequeue           = virtio_transport_stream_dequeue,
+		.stream_enqueue           = virtio_transport_stream_enqueue,
+		.stream_has_data          = virtio_transport_stream_has_data,
+		.stream_has_space         = virtio_transport_stream_has_space,
+		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
+		.stream_is_active         = virtio_transport_stream_is_active,
+		.stream_allow             = virtio_transport_stream_allow,
+
+		.notify_poll_in           = virtio_transport_notify_poll_in,
+		.notify_poll_out          = virtio_transport_notify_poll_out,
+		.notify_recv_init         = virtio_transport_notify_recv_init,
+		.notify_recv_pre_block    = virtio_transport_notify_recv_pre_block,
+		.notify_recv_pre_dequeue  = virtio_transport_notify_recv_pre_dequeue,
+		.notify_recv_post_dequeue = virtio_transport_notify_recv_post_dequeue,
+		.notify_send_init         = virtio_transport_notify_send_init,
+		.notify_send_pre_block    = virtio_transport_notify_send_pre_block,
+		.notify_send_pre_enqueue  = virtio_transport_notify_send_pre_enqueue,
+		.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
+
+		.set_buffer_size          = virtio_transport_set_buffer_size,
+		.set_min_buffer_size      = virtio_transport_set_min_buffer_size,
+		.set_max_buffer_size      = virtio_transport_set_max_buffer_size,
+		.get_buffer_size          = virtio_transport_get_buffer_size,
+		.get_min_buffer_size      = virtio_transport_get_min_buffer_size,
+		.get_max_buffer_size      = virtio_transport_get_max_buffer_size,
+	},
+
+	.send_pkt		= virtio_transport_send_pkt,
+	.send_pkt_no_sock	= virtio_transport_send_pkt_no_sock,
+};
+
+static int virtio_vsock_probe(struct virtio_device *vdev)
+{
+	vq_callback_t *callbacks[] = {
+		virtio_vsock_rx_done,
+		virtio_vsock_tx_done,
+		virtio_vsock_event_done,
+	};
+	static const char * const names[] = {
+		"rx",
+		"tx",
+		"event",
+	};
+	struct virtio_vsock *vsock = NULL;
+	int ret;
+
+	ret = mutex_lock_interruptible(&the_virtio_vsock_mutex);
+	if (ret)
+		return ret;
+
+	/* Only one virtio-vsock device per guest is supported */
+	if (the_virtio_vsock) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL);
+	if (!vsock) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	vsock->vdev = vdev;
+
+	ret = vsock->vdev->config->find_vqs(vsock->vdev, VSOCK_VQ_MAX,
+					    vsock->vqs, callbacks, names);
+	if (ret < 0)
+		goto out;
+
+	virtio_vsock_update_guest_cid(vsock);
+
+	ret = vsock_core_init(&virtio_transport.transport);
+	if (ret < 0)
+		goto out_vqs;
+
+	vsock->rx_buf_nr = 0;
+	vsock->rx_buf_max_nr = 0;
+
+	vdev->priv = vsock;
+	the_virtio_vsock = vsock;
+	init_waitqueue_head(&vsock->tx_wait);
+	mutex_init(&vsock->tx_lock);
+	mutex_init(&vsock->rx_lock);
+	mutex_init(&vsock->event_lock);
+	INIT_WORK(&vsock->rx_work, virtio_transport_recv_pkt_work);
+	INIT_WORK(&vsock->tx_work, virtio_transport_send_pkt_work);
+	INIT_WORK(&vsock->event_work, virtio_transport_event_work);
+
+	mutex_lock(&vsock->rx_lock);
+	virtio_vsock_rx_fill(vsock);
+	mutex_unlock(&vsock->rx_lock);
+
+	mutex_lock(&vsock->event_lock);
+	virtio_vsock_event_fill(vsock);
+	mutex_unlock(&vsock->event_lock);
+
+	mutex_unlock(&the_virtio_vsock_mutex);
+	return 0;
+
+out_vqs:
+	vsock->vdev->config->del_vqs(vsock->vdev);
+out:
+	kfree(vsock);
+	mutex_unlock(&the_virtio_vsock_mutex);
+	return ret;
+}
+
+static void virtio_vsock_remove(struct virtio_device *vdev)
+{
+	struct virtio_vsock *vsock = vdev->priv;
+
+	flush_work(&vsock->rx_work);
+	flush_work(&vsock->tx_work);
+	flush_work(&vsock->event_work);
+
+	vdev->config->reset(vdev);
+
+	mutex_lock(&the_virtio_vsock_mutex);
+	the_virtio_vsock = NULL;
+	vsock_core_exit();
+	mutex_unlock(&the_virtio_vsock_mutex);
+
+	vdev->config->del_vqs(vdev);
+
+	kfree(vsock);
+}
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_VSOCK, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static unsigned int features[] = {
+};
+
+static struct virtio_driver virtio_vsock_driver = {
+	.feature_table = features,
+	.feature_table_size = ARRAY_SIZE(features),
+	.driver.name = KBUILD_MODNAME,
+	.driver.owner = THIS_MODULE,
+	.id_table = id_table,
+	.probe = virtio_vsock_probe,
+	.remove = virtio_vsock_remove,
+};
+
+static int __init virtio_vsock_init(void)
+{
+	int ret;
+
+	virtio_vsock_workqueue = alloc_workqueue("virtio_vsock", 0, 0);
+	if (!virtio_vsock_workqueue)
+		return -ENOMEM;
+	ret = register_virtio_driver(&virtio_vsock_driver);
+	if (ret)
+		destroy_workqueue(virtio_vsock_workqueue);
+	return ret;
+}
+
+static void __exit virtio_vsock_exit(void)
+{
+	unregister_virtio_driver(&virtio_vsock_driver);
+	destroy_workqueue(virtio_vsock_workqueue);
+}
+
+module_init(virtio_vsock_init);
+module_exit(virtio_vsock_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("virtio transport for vsock");
+MODULE_DEVICE_TABLE(virtio, id_table);
-- 
2.5.5

^ permalink raw reply related

* [RFC v5 2/5] VSOCK: Introduce virtio_vsock_common.ko
From: Stefan Hajnoczi @ 2016-04-01 14:23 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1459520587-12337-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

This module contains the common code and header files for the following
virtio_transporto and vhost_vsock kernel modules.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v5:
 * Add event virtqueue, struct virtio_vsock_event, and transport reset
   event
 * Reorder virtqueue indices: rx, tx, event
 * Drop unused virtqueue_pairs config field
 * Drop unused ctrl virtqueue
 * Switch to a free virtio device ID, the previous one was reserved
v4:
 * Add MAINTAINERS file entry
 * checkpatch.pl cleanups
 * linux_vsock.h: drop wrong copy-pasted license header
 * Move tx sock refcounting to virtio_transport_alloc/free_pkt() to fix
   leaks in error paths
 * Add send_pkt_no_sock() to send RST packets with no listen socket
 * Rename per-socket state from virtio_transport to virtio_vsock_sock
 * Move send_pkt_ops to new virtio_transport struct
 * Drop dumppkt function, packet capture will be added in the future
 * Drop empty virtio_transport_dec_tx_pkt()
 * Allow guest->host connections again
 * Use trace events instead of pr_debug()
v3:
 * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
   of REQUEST/RESPONSE/ACK
 * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
 * Only allow host->guest connections (same security model as latest
   VMware)
v2:
 * Fix peer_buf_alloc inheritance on child socket
 * Notify other side of SOCK_STREAM disconnect (fixes shutdown
   semantics)
 * Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock)
 * Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
 * Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
---
 MAINTAINERS                                        |  10 +
 include/linux/virtio_vsock.h                       | 167 ++++
 .../trace/events/vsock_virtio_transport_common.h   | 144 ++++
 include/uapi/linux/virtio_ids.h                    |   1 +
 include/uapi/linux/virtio_vsock.h                  |  94 +++
 net/vmw_vsock/virtio_transport_common.c            | 838 +++++++++++++++++++++
 6 files changed, 1254 insertions(+)
 create mode 100644 include/linux/virtio_vsock.h
 create mode 100644 include/trace/events/vsock_virtio_transport_common.h
 create mode 100644 include/uapi/linux/virtio_vsock.h
 create mode 100644 net/vmw_vsock/virtio_transport_common.c

diff --git a/MAINTAINERS b/MAINTAINERS
index da3e4d8..fb9c47a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11602,6 +11602,16 @@ S:	Maintained
 F:	drivers/media/v4l2-core/videobuf2-*
 F:	include/media/videobuf2-*
 
+VIRTIO AND VHOST VSOCK DRIVER
+M:	Stefan Hajnoczi <stefanha@redhat.com>
+L:	kvm@vger.kernel.org
+L:	virtualization@lists.linux-foundation.org
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	include/linux/virtio_vsock.h
+F:	include/uapi/linux/virtio_vsock.h
+F:	net/vmw_vsock/virtio_transport_common.c
+
 VIRTUAL SERIO DEVICE DRIVER
 M:	Stephen Chandler Paul <thatslyude@gmail.com>
 S:	Maintained
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
new file mode 100644
index 0000000..4c3d8e6
--- /dev/null
+++ b/include/linux/virtio_vsock.h
@@ -0,0 +1,167 @@
+#ifndef _LINUX_VIRTIO_VSOCK_H
+#define _LINUX_VIRTIO_VSOCK_H
+
+#include <uapi/linux/virtio_vsock.h>
+#include <linux/socket.h>
+#include <net/sock.h>
+#include <net/af_vsock.h>
+
+#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE	128
+#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE		(1024 * 256)
+#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE	(1024 * 256)
+#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
+#define VIRTIO_VSOCK_MAX_BUF_SIZE		0xFFFFFFFFUL
+#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
+#define VIRTIO_VSOCK_MAX_TX_BUF_SIZE		(1024 * 1024 * 16)
+#define VIRTIO_VSOCK_MAX_DGRAM_SIZE		(1024 * 64)
+
+enum {
+	VSOCK_VQ_RX     = 0, /* for host to guest data */
+	VSOCK_VQ_TX     = 1, /* for guest to host data */
+	VSOCK_VQ_EVENT  = 2,
+	VSOCK_VQ_MAX    = 3,
+};
+
+/* Per-socket state (accessed via vsk->trans) */
+struct virtio_vsock_sock {
+	struct vsock_sock *vsk;
+
+	/* Protected by lock_sock(sk_vsock(trans->vsk)) */
+	u32 buf_size;
+	u32 buf_size_min;
+	u32 buf_size_max;
+
+	struct mutex tx_lock;
+	struct mutex rx_lock;
+
+	/* Protected by tx_lock */
+	u32 tx_cnt;
+	u32 buf_alloc;
+	u32 peer_fwd_cnt;
+	u32 peer_buf_alloc;
+
+	/* Protected by rx_lock */
+	u32 fwd_cnt;
+	u32 rx_bytes;
+	struct list_head rx_queue;
+};
+
+struct virtio_vsock_pkt {
+	struct virtio_vsock_hdr	hdr;
+	struct work_struct work;
+	struct list_head list;
+	void *buf;
+	u32 len;
+	u32 off;
+};
+
+struct virtio_vsock_pkt_info {
+	u32 remote_cid, remote_port;
+	struct msghdr *msg;
+	u32 pkt_len;
+	u16 type;
+	u16 op;
+	u32 flags;
+};
+
+struct virtio_transport {
+	/* This must be the first field */
+	struct vsock_transport transport;
+
+	/* Send packet for a specific socket */
+	int (*send_pkt)(struct vsock_sock *vsk,
+			struct virtio_vsock_pkt_info *info);
+
+	/* Send packet without a socket (e.g. RST).  Prefer send_pkt() over
+	 * send_pkt_no_sock() when a socket exists.
+	 */
+	int (*send_pkt_no_sock)(struct virtio_vsock_pkt *pkt);
+};
+
+struct virtio_vsock_pkt *
+virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
+			   size_t len,
+			   u32 src_cid,
+			   u32 src_port,
+			   u32 dst_cid,
+			   u32 dst_port);
+ssize_t
+virtio_transport_stream_dequeue(struct vsock_sock *vsk,
+				struct msghdr *msg,
+				size_t len,
+				int type);
+int
+virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
+			       struct msghdr *msg,
+			       size_t len, int flags);
+
+s64 virtio_transport_stream_has_data(struct vsock_sock *vsk);
+s64 virtio_transport_stream_has_space(struct vsock_sock *vsk);
+
+int virtio_transport_do_socket_init(struct vsock_sock *vsk,
+				 struct vsock_sock *psk);
+u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk);
+u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk);
+u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk);
+void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val);
+void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val);
+void virtio_transport_set_max_buffer_size(struct vsock_sock *vs, u64 val);
+int
+virtio_transport_notify_poll_in(struct vsock_sock *vsk,
+				size_t target,
+				bool *data_ready_now);
+int
+virtio_transport_notify_poll_out(struct vsock_sock *vsk,
+				 size_t target,
+				 bool *space_available_now);
+
+int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
+	size_t target, ssize_t copied, bool data_read,
+	struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_send_init(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
+	ssize_t written, struct vsock_transport_send_notify_data *data);
+
+u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
+bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
+bool virtio_transport_stream_allow(u32 cid, u32 port);
+int virtio_transport_dgram_bind(struct vsock_sock *vsk,
+				struct sockaddr_vm *addr);
+bool virtio_transport_dgram_allow(u32 cid, u32 port);
+
+int virtio_transport_connect(struct vsock_sock *vsk);
+
+int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
+
+void virtio_transport_release(struct vsock_sock *vsk);
+
+ssize_t
+virtio_transport_stream_enqueue(struct vsock_sock *vsk,
+				struct msghdr *msg,
+				size_t len);
+int
+virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
+			       struct sockaddr_vm *remote_addr,
+			       struct msghdr *msg,
+			       size_t len);
+
+void virtio_transport_destruct(struct vsock_sock *vsk);
+
+void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt);
+void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt);
+u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
+void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
+
+#endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/include/trace/events/vsock_virtio_transport_common.h b/include/trace/events/vsock_virtio_transport_common.h
new file mode 100644
index 0000000..b7f1d62
--- /dev/null
+++ b/include/trace/events/vsock_virtio_transport_common.h
@@ -0,0 +1,144 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vsock
+
+#if !defined(_TRACE_VSOCK_VIRTIO_TRANSPORT_COMMON_H) || \
+    defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VSOCK_VIRTIO_TRANSPORT_COMMON_H
+
+#include <linux/tracepoint.h>
+
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_TYPE_STREAM);
+
+#define show_type(val) \
+	__print_symbolic(val, { VIRTIO_VSOCK_TYPE_STREAM, "STREAM" })
+
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_INVALID);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_REQUEST);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_RESPONSE);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_RST);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_SHUTDOWN);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_RW);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_CREDIT_UPDATE);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_CREDIT_REQUEST);
+
+#define show_op(val) \
+	__print_symbolic(val, \
+			 { VIRTIO_VSOCK_OP_INVALID, "INVALID" }, \
+			 { VIRTIO_VSOCK_OP_REQUEST, "REQUEST" }, \
+			 { VIRTIO_VSOCK_OP_RESPONSE, "RESPONSE" }, \
+			 { VIRTIO_VSOCK_OP_RST, "RST" }, \
+			 { VIRTIO_VSOCK_OP_SHUTDOWN, "SHUTDOWN" }, \
+			 { VIRTIO_VSOCK_OP_RW, "RW" }, \
+			 { VIRTIO_VSOCK_OP_CREDIT_UPDATE, "CREDIT_UPDATE" }, \
+			 { VIRTIO_VSOCK_OP_CREDIT_REQUEST, "CREDIT_REQUEST" })
+
+TRACE_EVENT(virtio_transport_alloc_pkt,
+	TP_PROTO(
+		 __u32 src_cid, __u32 src_port,
+		 __u32 dst_cid, __u32 dst_port,
+		 __u32 len,
+		 __u16 type,
+		 __u16 op,
+		 __u32 flags
+	),
+	TP_ARGS(
+		src_cid, src_port,
+		dst_cid, dst_port,
+		len,
+		type,
+		op,
+		flags
+	),
+	TP_STRUCT__entry(
+		__field(__u32, src_cid)
+		__field(__u32, src_port)
+		__field(__u32, dst_cid)
+		__field(__u32, dst_port)
+		__field(__u32, len)
+		__field(__u16, type)
+		__field(__u16, op)
+		__field(__u32, flags)
+	),
+	TP_fast_assign(
+		__entry->src_cid = src_cid;
+		__entry->src_port = src_port;
+		__entry->dst_cid = dst_cid;
+		__entry->dst_port = dst_port;
+		__entry->len = len;
+		__entry->type = type;
+		__entry->op = op;
+		__entry->flags = flags;
+	),
+	TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x",
+		  __entry->src_cid, __entry->src_port,
+		  __entry->dst_cid, __entry->dst_port,
+		  __entry->len,
+		  show_type(__entry->type),
+		  show_op(__entry->op),
+		  __entry->flags)
+);
+
+TRACE_EVENT(virtio_transport_recv_pkt,
+	TP_PROTO(
+		 __u32 src_cid, __u32 src_port,
+		 __u32 dst_cid, __u32 dst_port,
+		 __u32 len,
+		 __u16 type,
+		 __u16 op,
+		 __u32 flags,
+		 __u32 buf_alloc,
+		 __u32 fwd_cnt
+	),
+	TP_ARGS(
+		src_cid, src_port,
+		dst_cid, dst_port,
+		len,
+		type,
+		op,
+		flags,
+		buf_alloc,
+		fwd_cnt
+	),
+	TP_STRUCT__entry(
+		__field(__u32, src_cid)
+		__field(__u32, src_port)
+		__field(__u32, dst_cid)
+		__field(__u32, dst_port)
+		__field(__u32, len)
+		__field(__u16, type)
+		__field(__u16, op)
+		__field(__u32, flags)
+		__field(__u32, buf_alloc)
+		__field(__u32, fwd_cnt)
+	),
+	TP_fast_assign(
+		__entry->src_cid = src_cid;
+		__entry->src_port = src_port;
+		__entry->dst_cid = dst_cid;
+		__entry->dst_port = dst_port;
+		__entry->len = len;
+		__entry->type = type;
+		__entry->op = op;
+		__entry->flags = flags;
+		__entry->buf_alloc = buf_alloc;
+		__entry->fwd_cnt = fwd_cnt;
+	),
+	TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x "
+		  "buf_alloc=%u fwd_cnt=%u",
+		  __entry->src_cid, __entry->src_port,
+		  __entry->dst_cid, __entry->dst_port,
+		  __entry->len,
+		  show_type(__entry->type),
+		  show_op(__entry->op),
+		  __entry->flags,
+		  __entry->buf_alloc,
+		  __entry->fwd_cnt)
+);
+
+#endif /* _TRACE_VSOCK_VIRTIO_TRANSPORT_COMMON_H */
+
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE vsock_virtio_transport_common
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 77925f5..3228d58 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -41,5 +41,6 @@
 #define VIRTIO_ID_CAIF	       12 /* Virtio caif */
 #define VIRTIO_ID_GPU          16 /* virtio GPU */
 #define VIRTIO_ID_INPUT        18 /* virtio input */
+#define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
new file mode 100644
index 0000000..12946ab
--- /dev/null
+++ b/include/uapi/linux/virtio_vsock.h
@@ -0,0 +1,94 @@
+/*
+ * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers:
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) Red Hat, Inc., 2013-2015
+ * Copyright (C) Asias He <asias@redhat.com>, 2013
+ * Copyright (C) Stefan Hajnoczi <stefanha@redhat.com>, 2015
+ */
+
+#ifndef _UAPI_LINUX_VIRTIO_VSOCK_H
+#define _UAPI_LINUX_VIRTIO_VOSCK_H
+
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+
+struct virtio_vsock_config {
+	__le32 guest_cid;
+};
+
+enum virtio_vsock_event_id {
+	VIRTIO_VSOCK_EVENT_TRANSPORT_RESET = 0,
+};
+
+struct virtio_vsock_event {
+	__le32 id;
+};
+
+struct virtio_vsock_hdr {
+	__le32	src_cid;
+	__le32	src_port;
+	__le32	dst_cid;
+	__le32	dst_port;
+	__le32	len;
+	__le16	type;		/* enum virtio_vsock_type */
+	__le16	op;		/* enum virtio_vsock_op */
+	__le32	flags;
+	__le32	buf_alloc;
+	__le32	fwd_cnt;
+};
+
+enum virtio_vsock_type {
+	VIRTIO_VSOCK_TYPE_STREAM = 1,
+};
+
+enum virtio_vsock_op {
+	VIRTIO_VSOCK_OP_INVALID = 0,
+
+	/* Connect operations */
+	VIRTIO_VSOCK_OP_REQUEST = 1,
+	VIRTIO_VSOCK_OP_RESPONSE = 2,
+	VIRTIO_VSOCK_OP_RST = 3,
+	VIRTIO_VSOCK_OP_SHUTDOWN = 4,
+
+	/* To send payload */
+	VIRTIO_VSOCK_OP_RW = 5,
+
+	/* Tell the peer our credit info */
+	VIRTIO_VSOCK_OP_CREDIT_UPDATE = 6,
+	/* Request the peer to send the credit info to us */
+	VIRTIO_VSOCK_OP_CREDIT_REQUEST = 7,
+};
+
+/* VIRTIO_VSOCK_OP_SHUTDOWN flags values */
+enum virtio_vsock_shutdown {
+	VIRTIO_VSOCK_SHUTDOWN_RCV = 1,
+	VIRTIO_VSOCK_SHUTDOWN_SEND = 2,
+};
+
+#endif /* _UAPI_LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
new file mode 100644
index 0000000..5b9e202
--- /dev/null
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -0,0 +1,838 @@
+/*
+ * common code for virtio vsock
+ *
+ * Copyright (C) 2013-2015 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/module.h>
+#include <linux/ctype.h>
+#include <linux/list.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_vsock.h>
+
+#include <net/sock.h>
+#include <net/af_vsock.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/vsock_virtio_transport_common.h>
+
+static const struct virtio_transport *virtio_transport_get_ops(void)
+{
+	const struct vsock_transport *t = vsock_core_get_transport();
+
+	return container_of(t, struct virtio_transport, transport);
+}
+
+static int virtio_transport_send_pkt(struct vsock_sock *vsk,
+				     struct virtio_vsock_pkt_info *info)
+{
+	return virtio_transport_get_ops()->send_pkt(vsk, info);
+}
+
+static int virtio_transport_send_pkt_no_sock(struct virtio_vsock_pkt *pkt)
+{
+	return virtio_transport_get_ops()->send_pkt_no_sock(pkt);
+}
+
+struct virtio_vsock_pkt *
+virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
+			   size_t len,
+			   u32 src_cid,
+			   u32 src_port,
+			   u32 dst_cid,
+			   u32 dst_port)
+{
+	struct virtio_vsock_pkt *pkt;
+	int err;
+
+	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+	if (!pkt)
+		return NULL;
+
+	pkt->hdr.type		= cpu_to_le16(info->type);
+	pkt->hdr.op		= cpu_to_le16(info->op);
+	pkt->hdr.src_cid	= cpu_to_le32(src_cid);
+	pkt->hdr.src_port	= cpu_to_le32(src_port);
+	pkt->hdr.dst_cid	= cpu_to_le32(dst_cid);
+	pkt->hdr.dst_port	= cpu_to_le32(dst_port);
+	pkt->hdr.flags		= cpu_to_le32(info->flags);
+	pkt->len		= len;
+	pkt->hdr.len		= cpu_to_le32(len);
+
+	if (info->msg && len > 0) {
+		pkt->buf = kmalloc(len, GFP_KERNEL);
+		if (!pkt->buf)
+			goto out_pkt;
+		err = memcpy_from_msg(pkt->buf, info->msg, len);
+		if (err)
+			goto out;
+	}
+
+	trace_virtio_transport_alloc_pkt(src_cid, src_port,
+					 dst_cid, dst_port,
+					 len,
+					 info->type,
+					 info->op,
+					 info->flags);
+
+	return pkt;
+
+out:
+	kfree(pkt->buf);
+out_pkt:
+	kfree(pkt);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_alloc_pkt);
+
+static void virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
+					struct virtio_vsock_pkt *pkt)
+{
+	vvs->rx_bytes += pkt->len;
+}
+
+static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
+					struct virtio_vsock_pkt *pkt)
+{
+	vvs->rx_bytes -= pkt->len;
+	vvs->fwd_cnt += pkt->len;
+}
+
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
+{
+	mutex_lock(&vvs->tx_lock);
+	pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
+	pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
+	mutex_unlock(&vvs->tx_lock);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
+
+u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 credit)
+{
+	u32 ret;
+
+	mutex_lock(&vvs->tx_lock);
+	ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
+	if (ret > credit)
+		ret = credit;
+	vvs->tx_cnt += ret;
+	mutex_unlock(&vvs->tx_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_credit);
+
+void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit)
+{
+	mutex_lock(&vvs->tx_lock);
+	vvs->tx_cnt -= credit;
+	mutex_unlock(&vvs->tx_lock);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_put_credit);
+
+static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
+					       int type,
+					       struct virtio_vsock_hdr *hdr)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_CREDIT_UPDATE,
+		.type = type,
+	};
+
+	return virtio_transport_send_pkt(vsk, &info);
+}
+
+static ssize_t
+virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
+				   struct msghdr *msg,
+				   size_t len)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	struct virtio_vsock_pkt *pkt;
+	size_t bytes, total = 0;
+	int err = -EFAULT;
+
+	mutex_lock(&vvs->rx_lock);
+	while (total < len &&
+	       vvs->rx_bytes > 0 &&
+	       !list_empty(&vvs->rx_queue)) {
+		pkt = list_first_entry(&vvs->rx_queue,
+				       struct virtio_vsock_pkt, list);
+
+		bytes = len - total;
+		if (bytes > pkt->len - pkt->off)
+			bytes = pkt->len - pkt->off;
+
+		err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
+		if (err)
+			goto out;
+		total += bytes;
+		pkt->off += bytes;
+		if (pkt->off == pkt->len) {
+			virtio_transport_dec_rx_pkt(vvs, pkt);
+			list_del(&pkt->list);
+			virtio_transport_free_pkt(pkt);
+		}
+	}
+	mutex_unlock(&vvs->rx_lock);
+
+	/* Send a credit pkt to peer */
+	virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM,
+					    NULL);
+
+	return total;
+
+out:
+	mutex_unlock(&vvs->rx_lock);
+	if (total)
+		err = total;
+	return err;
+}
+
+ssize_t
+virtio_transport_stream_dequeue(struct vsock_sock *vsk,
+				struct msghdr *msg,
+				size_t len, int flags)
+{
+	if (flags & MSG_PEEK)
+		return -EOPNOTSUPP;
+
+	return virtio_transport_stream_do_dequeue(vsk, msg, len);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_dequeue);
+
+int
+virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
+			       struct msghdr *msg,
+			       size_t len, int flags)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
+
+s64 virtio_transport_stream_has_data(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	s64 bytes;
+
+	mutex_lock(&vvs->rx_lock);
+	bytes = vvs->rx_bytes;
+	mutex_unlock(&vvs->rx_lock);
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_has_data);
+
+static s64 virtio_transport_has_space(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	s64 bytes;
+
+	bytes = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
+	if (bytes < 0)
+		bytes = 0;
+
+	return bytes;
+}
+
+s64 virtio_transport_stream_has_space(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	s64 bytes;
+
+	mutex_lock(&vvs->tx_lock);
+	bytes = virtio_transport_has_space(vsk);
+	mutex_unlock(&vvs->tx_lock);
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_has_space);
+
+int virtio_transport_do_socket_init(struct vsock_sock *vsk,
+				    struct vsock_sock *psk)
+{
+	struct virtio_vsock_sock *vvs;
+
+	vvs = kzalloc(sizeof(*vvs), GFP_KERNEL);
+	if (!vvs)
+		return -ENOMEM;
+
+	vsk->trans = vvs;
+	vvs->vsk = vsk;
+	if (psk) {
+		struct virtio_vsock_sock *ptrans = psk->trans;
+
+		vvs->buf_size	= ptrans->buf_size;
+		vvs->buf_size_min = ptrans->buf_size_min;
+		vvs->buf_size_max = ptrans->buf_size_max;
+		vvs->peer_buf_alloc = ptrans->peer_buf_alloc;
+	} else {
+		vvs->buf_size = VIRTIO_VSOCK_DEFAULT_BUF_SIZE;
+		vvs->buf_size_min = VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE;
+		vvs->buf_size_max = VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE;
+	}
+
+	vvs->buf_alloc = vvs->buf_size;
+
+	mutex_init(&vvs->rx_lock);
+	mutex_init(&vvs->tx_lock);
+	INIT_LIST_HEAD(&vvs->rx_queue);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_do_socket_init);
+
+u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	return vvs->buf_size;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_buffer_size);
+
+u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	return vvs->buf_size_min;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_min_buffer_size);
+
+u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	return vvs->buf_size_max;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_max_buffer_size);
+
+void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	if (val > VIRTIO_VSOCK_MAX_BUF_SIZE)
+		val = VIRTIO_VSOCK_MAX_BUF_SIZE;
+	if (val < vvs->buf_size_min)
+		vvs->buf_size_min = val;
+	if (val > vvs->buf_size_max)
+		vvs->buf_size_max = val;
+	vvs->buf_size = val;
+	vvs->buf_alloc = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_buffer_size);
+
+void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	if (val > VIRTIO_VSOCK_MAX_BUF_SIZE)
+		val = VIRTIO_VSOCK_MAX_BUF_SIZE;
+	if (val > vvs->buf_size)
+		vvs->buf_size = val;
+	vvs->buf_size_min = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_min_buffer_size);
+
+void virtio_transport_set_max_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	if (val > VIRTIO_VSOCK_MAX_BUF_SIZE)
+		val = VIRTIO_VSOCK_MAX_BUF_SIZE;
+	if (val < vvs->buf_size)
+		vvs->buf_size = val;
+	vvs->buf_size_max = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_max_buffer_size);
+
+int
+virtio_transport_notify_poll_in(struct vsock_sock *vsk,
+				size_t target,
+				bool *data_ready_now)
+{
+	if (vsock_stream_has_data(vsk))
+		*data_ready_now = true;
+	else
+		*data_ready_now = false;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_in);
+
+int
+virtio_transport_notify_poll_out(struct vsock_sock *vsk,
+				 size_t target,
+				 bool *space_avail_now)
+{
+	s64 free_space;
+
+	free_space = vsock_stream_has_space(vsk);
+	if (free_space > 0)
+		*space_avail_now = true;
+	else if (free_space == 0)
+		*space_avail_now = false;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_out);
+
+int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_init);
+
+int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_block);
+
+int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_dequeue);
+
+int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
+	size_t target, ssize_t copied, bool data_read,
+	struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_post_dequeue);
+
+int virtio_transport_notify_send_init(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_init);
+
+int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_block);
+
+int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_enqueue);
+
+int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
+	ssize_t written, struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_post_enqueue);
+
+u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	return vvs->buf_size;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_rcvhiwat);
+
+bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
+{
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
+
+bool virtio_transport_stream_allow(u32 cid, u32 port)
+{
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
+
+int virtio_transport_dgram_bind(struct vsock_sock *vsk,
+				struct sockaddr_vm *addr)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
+
+bool virtio_transport_dgram_allow(u32 cid, u32 port)
+{
+	return false;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
+
+int virtio_transport_connect(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_REQUEST,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+	};
+
+	return virtio_transport_send_pkt(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_connect);
+
+int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_SHUTDOWN,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+		.flags = (mode & RCV_SHUTDOWN ?
+			  VIRTIO_VSOCK_SHUTDOWN_RCV : 0) |
+			 (mode & SEND_SHUTDOWN ?
+			  VIRTIO_VSOCK_SHUTDOWN_SEND : 0),
+	};
+
+	return virtio_transport_send_pkt(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_shutdown);
+
+void virtio_transport_release(struct vsock_sock *vsk)
+{
+	struct sock *sk = &vsk->sk;
+
+	/* Tell other side to terminate connection */
+	if (sk->sk_type == SOCK_STREAM &&
+	    vsk->peer_shutdown != SHUTDOWN_MASK &&
+	    sk->sk_state == SS_CONNECTED)
+		(void)virtio_transport_shutdown(vsk, SHUTDOWN_MASK);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_release);
+
+int
+virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
+			       struct sockaddr_vm *remote_addr,
+			       struct msghdr *msg,
+			       size_t dgram_len)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
+
+ssize_t
+virtio_transport_stream_enqueue(struct vsock_sock *vsk,
+				struct msghdr *msg,
+				size_t len)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RW,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+		.msg = msg,
+		.pkt_len = len,
+	};
+
+	return virtio_transport_send_pkt(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_enqueue);
+
+void virtio_transport_destruct(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	kfree(vvs);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_destruct);
+
+static int virtio_transport_send_reset(struct vsock_sock *vsk,
+				       struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RST,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+	};
+
+	/* Send RST only if the original pkt is not a RST pkt */
+	if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+		return 0;
+
+	return virtio_transport_send_pkt(vsk, &info);
+}
+
+/* Normally packets are associated with a socket.  There may be no socket if an
+ * attempt was made to connect to a socket that does not exist.
+ */
+static int virtio_transport_send_reset_no_sock(struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RST,
+		.type = le16_to_cpu(pkt->hdr.type),
+	};
+
+	/* Send RST only if the original pkt is not a RST pkt */
+	if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+		return 0;
+
+	pkt = virtio_transport_alloc_pkt(&info, 0,
+					 le32_to_cpu(pkt->hdr.dst_cid),
+					 le32_to_cpu(pkt->hdr.dst_port),
+					 le32_to_cpu(pkt->hdr.src_cid),
+					 le32_to_cpu(pkt->hdr.src_port));
+	if (!pkt)
+		return -ENOMEM;
+
+	return virtio_transport_send_pkt_no_sock(pkt);
+}
+
+static int
+virtio_transport_recv_connecting(struct sock *sk,
+				 struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	int err;
+	int skerr;
+
+	switch (le16_to_cpu(pkt->hdr.op)) {
+	case VIRTIO_VSOCK_OP_RESPONSE:
+		sk->sk_state = SS_CONNECTED;
+		sk->sk_socket->state = SS_CONNECTED;
+		vsock_insert_connected(vsk);
+		sk->sk_state_change(sk);
+		break;
+	case VIRTIO_VSOCK_OP_INVALID:
+		break;
+	case VIRTIO_VSOCK_OP_RST:
+		skerr = ECONNRESET;
+		err = 0;
+		goto destroy;
+	default:
+		skerr = EPROTO;
+		err = -EINVAL;
+		goto destroy;
+	}
+	return 0;
+
+destroy:
+	virtio_transport_send_reset(vsk, pkt);
+	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_err = skerr;
+	sk->sk_error_report(sk);
+	return err;
+}
+
+static int
+virtio_transport_recv_connected(struct sock *sk,
+				struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	int err = 0;
+
+	switch (le16_to_cpu(pkt->hdr.op)) {
+	case VIRTIO_VSOCK_OP_RW:
+		pkt->len = le32_to_cpu(pkt->hdr.len);
+		pkt->off = 0;
+
+		mutex_lock(&vvs->rx_lock);
+		virtio_transport_inc_rx_pkt(vvs, pkt);
+		list_add_tail(&pkt->list, &vvs->rx_queue);
+		mutex_unlock(&vvs->rx_lock);
+
+		sk->sk_data_ready(sk);
+		return err;
+	case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
+		sk->sk_write_space(sk);
+		break;
+	case VIRTIO_VSOCK_OP_SHUTDOWN:
+		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
+			vsk->peer_shutdown |= RCV_SHUTDOWN;
+		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
+			vsk->peer_shutdown |= SEND_SHUTDOWN;
+		if (vsk->peer_shutdown == SHUTDOWN_MASK &&
+		    vsock_stream_has_data(vsk) <= 0)
+			sk->sk_state = SS_DISCONNECTING;
+		if (le32_to_cpu(pkt->hdr.flags))
+			sk->sk_state_change(sk);
+		break;
+	case VIRTIO_VSOCK_OP_RST:
+		sock_set_flag(sk, SOCK_DONE);
+		vsk->peer_shutdown = SHUTDOWN_MASK;
+		if (vsock_stream_has_data(vsk) <= 0)
+			sk->sk_state = SS_DISCONNECTING;
+		sk->sk_state_change(sk);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	virtio_transport_free_pkt(pkt);
+	return err;
+}
+
+static int
+virtio_transport_send_response(struct vsock_sock *vsk,
+			       struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RESPONSE,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+		.remote_cid = le32_to_cpu(pkt->hdr.src_cid),
+		.remote_port = le32_to_cpu(pkt->hdr.src_port),
+	};
+
+	return virtio_transport_send_pkt(vsk, &info);
+}
+
+/* Handle server socket */
+static int
+virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct vsock_sock *vchild;
+	struct sock *child;
+
+	if (le16_to_cpu(pkt->hdr.op) != VIRTIO_VSOCK_OP_REQUEST) {
+		virtio_transport_send_reset(vsk, pkt);
+		return -EINVAL;
+	}
+
+	if (sk_acceptq_is_full(sk)) {
+		virtio_transport_send_reset(vsk, pkt);
+		return -ENOMEM;
+	}
+
+	child = __vsock_create(sock_net(sk), NULL, sk, GFP_KERNEL,
+			       sk->sk_type, 0);
+	if (!child) {
+		virtio_transport_send_reset(vsk, pkt);
+		return -ENOMEM;
+	}
+
+	sk->sk_ack_backlog++;
+
+	lock_sock(child);
+
+	child->sk_state = SS_CONNECTED;
+
+	vchild = vsock_sk(child);
+	vsock_addr_init(&vchild->local_addr, le32_to_cpu(pkt->hdr.dst_cid),
+			le32_to_cpu(pkt->hdr.dst_port));
+	vsock_addr_init(&vchild->remote_addr, le32_to_cpu(pkt->hdr.src_cid),
+			le32_to_cpu(pkt->hdr.src_port));
+
+	vsock_insert_connected(vchild);
+	vsock_enqueue_accept(sk, child);
+	virtio_transport_send_response(vchild, pkt);
+
+	release_sock(child);
+
+	sk->sk_data_ready(sk);
+	return 0;
+}
+
+static void virtio_transport_space_update(struct sock *sk,
+					  struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	bool space_available;
+
+	/* buf_alloc and fwd_cnt is always included in the hdr */
+	mutex_lock(&vvs->tx_lock);
+	vvs->peer_buf_alloc = le32_to_cpu(pkt->hdr.buf_alloc);
+	vvs->peer_fwd_cnt = le32_to_cpu(pkt->hdr.fwd_cnt);
+	space_available = virtio_transport_has_space(vsk);
+	mutex_unlock(&vvs->tx_lock);
+
+	if (space_available)
+		sk->sk_write_space(sk);
+}
+
+/* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
+ * lock.
+ */
+void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
+{
+	struct sockaddr_vm src, dst;
+	struct vsock_sock *vsk;
+	struct sock *sk;
+
+	vsock_addr_init(&src, le32_to_cpu(pkt->hdr.src_cid),
+			le32_to_cpu(pkt->hdr.src_port));
+	vsock_addr_init(&dst, le32_to_cpu(pkt->hdr.dst_cid),
+			le32_to_cpu(pkt->hdr.dst_port));
+
+	trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
+					dst.svm_cid, dst.svm_port,
+					le32_to_cpu(pkt->hdr.len),
+					le16_to_cpu(pkt->hdr.type),
+					le16_to_cpu(pkt->hdr.op),
+					le32_to_cpu(pkt->hdr.flags),
+					le32_to_cpu(pkt->hdr.buf_alloc),
+					le32_to_cpu(pkt->hdr.fwd_cnt));
+
+	if (le16_to_cpu(pkt->hdr.type) != VIRTIO_VSOCK_TYPE_STREAM) {
+		(void)virtio_transport_send_reset_no_sock(pkt);
+		goto free_pkt;
+	}
+
+	/* The socket must be in connected or bound table
+	 * otherwise send reset back
+	 */
+	sk = vsock_find_connected_socket(&src, &dst);
+	if (!sk) {
+		sk = vsock_find_bound_socket(&dst);
+		if (!sk) {
+			(void)virtio_transport_send_reset_no_sock(pkt);
+			goto free_pkt;
+		}
+	}
+
+	vsk = vsock_sk(sk);
+
+	virtio_transport_space_update(sk, pkt);
+
+	lock_sock(sk);
+
+	/* Update CID in case it has changed after a transport reset event */
+	vsk->local_addr.svm_cid = dst.svm_cid;
+
+	switch (sk->sk_state) {
+	case VSOCK_SS_LISTEN:
+		virtio_transport_recv_listen(sk, pkt);
+		virtio_transport_free_pkt(pkt);
+		break;
+	case SS_CONNECTING:
+		virtio_transport_recv_connecting(sk, pkt);
+		virtio_transport_free_pkt(pkt);
+		break;
+	case SS_CONNECTED:
+		virtio_transport_recv_connected(sk, pkt);
+		break;
+	default:
+		virtio_transport_free_pkt(pkt);
+		break;
+	}
+	release_sock(sk);
+
+	/* Release refcnt obtained when we fetched this socket out of the
+	 * bound or connected list.
+	 */
+	sock_put(sk);
+	return;
+
+free_pkt:
+	virtio_transport_free_pkt(pkt);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
+
+void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
+{
+	kfree(pkt->buf);
+	kfree(pkt);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("common code for virtio vsock");
-- 
2.5.5

^ permalink raw reply related

* [RFC v5 1/5] VSOCK: transport-specific vsock_transport functions
From: Stefan Hajnoczi @ 2016-04-01 14:23 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, Matt Benjamin, Greg Kurz,
	virtualization, Christoffer Dall
In-Reply-To: <1459520587-12337-1-git-send-email-stefanha@redhat.com>

struct vsock_transport contains function pointers called by AF_VSOCK
core code.  The transport may want its own transport-specific function
pointers and they can be added after struct vsock_transport.

Allow the transport to fetch vsock_transport.  It can downcast it to
access transport-specific function pointers.

The virtio transport will use this.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/net/af_vsock.h   | 3 +++
 net/vmw_vsock/af_vsock.c | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index e9eb2d6..23f5525 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -165,6 +165,9 @@ static inline int vsock_core_init(const struct vsock_transport *t)
 }
 void vsock_core_exit(void);
 
+/* The transport may downcast this to access transport-specific functions */
+const struct vsock_transport *vsock_core_get_transport(void);
+
 /**** UTILS ****/
 
 void vsock_release_pending(struct sock *pending);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index bbe65dc..1e5f5ed 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1987,6 +1987,15 @@ void vsock_core_exit(void)
 }
 EXPORT_SYMBOL_GPL(vsock_core_exit);
 
+const struct vsock_transport *vsock_core_get_transport(void)
+{
+	/* vsock_register_mutex not taken since only the transport uses this
+	 * function and only while registered.
+	 */
+	return transport;
+}
+EXPORT_SYMBOL_GPL(vsock_core_get_transport);
+
 MODULE_AUTHOR("VMware, Inc.");
 MODULE_DESCRIPTION("VMware Virtual Socket Family");
 MODULE_VERSION("1.0.1.0-k");
-- 
2.5.5

^ permalink raw reply related

* [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Stefan Hajnoczi @ 2016-04-01 14:23 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, Matt Benjamin, Greg Kurz,
	virtualization, Christoffer Dall

This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).

I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
first I want to share the latest version of the code.  Several people are
playing with vsock now so sharing the latest code should avoid duplicate work.

v5:
 * Transport reset event for live migration support
 * Reorder virtqueues, drop unused ctrl virtqueue
 * Switch to a free virtio device ID
 * More small changes, see patches for individual items

v4:
 * Addressed code review comments from Alex Bennee
 * MAINTAINERS file entries for new files
 * Trace events instead of pr_debug()
 * RST packet is sent when there is no listen socket
 * Allow guest->host connections again (began discussing netfilter support with
   Matt Benjamin instead of hard-coding security policy in virtio-vsock code)
 * Many checkpatch.pl cleanups (will be 100% clean in v5)

v3:
 * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
   of REQUEST/RESPONSE/ACK
 * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
   (also drop v2 Patch 1, it's only needed for SOCK_DGRAM)
 * Only allow host->guest connections (same security model as latest
   VMware)
 * Don't put vhost vsock driver into staging
 * Add missing Kconfig dependencies (Arnd Bergmann <arnd@arndb.de>)
 * Remove unneeded variable used to store return value
   (Fengguang Wu <fengguang.wu@intel.com> and Julia Lawall
   <julia.lawall@lip6.fr>)

v2:
 * Rebased onto Linux v4.4-rc2
 * vhost: Refuse to assign reserved CIDs
 * vhost: Refuse guest CID if already in use
 * vhost: Only accept correctly addressed packets (no spoofing!)
 * vhost: Support flexible rx/tx descriptor layout
 * vhost: Add missing total_tx_buf decrement
 * virtio_transport: Fix total_tx_buf accounting
 * virtio_transport: Add virtio_transport global mutex to prevent races
 * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown
   semantics)
 * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock)
 * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
 * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
 * common: Fix peer_buf_alloc inheritance on child socket

This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/).
AF_VSOCK is designed for communication between virtual machines and
hypervisors.  It is currently only implemented for VMware's VMCI transport.

Most of the work was done by Asias He and Gerd Hoffmann a while back.  I have
picked up the series again.

The QEMU userspace changes are here:
https://github.com/stefanha/qemu/commits/vsock

Why virtio-vsock?
-----------------
Guest<->host communication is currently done over the virtio-serial device.
This makes it hard to port sockets API-based applications and is limited to
static ports.

virtio-vsock uses the sockets API so that applications can rely on familiar
SOCK_STREAM semantics.  Applications on the host can easily connect to guest
agents because the sockets API allows multiple connections to a listen socket
(unlike virtio-serial).  This simplifies the guest<->host communication and
eliminates the need for extra processes on the host to arbitrate virtio-serial
ports.

Overview
--------
This series adds 3 pieces:

1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko

2. virtio_transport.ko - guest driver

3. drivers/vhost/vsock.ko - host driver

Howto
-----
The following kernel options are needed:
  CONFIG_VSOCKETS=y
  CONFIG_VIRTIO_VSOCKETS=y
  CONFIG_VIRTIO_VSOCKETS_COMMON=y
  CONFIG_VHOST_VSOCK=m

Launch QEMU as follows:
  # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3

Guest and host can communicate via AF_VSOCK sockets.  The host's CID (address)
is 2 and the guest must be assigned a CID (3 in the example above).

Asias He (4):
  VSOCK: Introduce virtio_vsock_common.ko
  VSOCK: Introduce virtio_transport.ko
  VSOCK: Introduce vhost_vsock.ko
  VSOCK: Add Makefile and Kconfig

Stefan Hajnoczi (1):
  VSOCK: transport-specific vsock_transport functions

 MAINTAINERS                                        |  13 +
 drivers/vhost/Kconfig                              |  15 +
 drivers/vhost/Makefile                             |   4 +
 drivers/vhost/vsock.c                              | 694 +++++++++++++++++
 drivers/vhost/vsock.h                              |   5 +
 include/linux/virtio_vsock.h                       | 167 ++++
 include/net/af_vsock.h                             |   3 +
 .../trace/events/vsock_virtio_transport_common.h   | 144 ++++
 include/uapi/linux/virtio_ids.h                    |   1 +
 include/uapi/linux/virtio_vsock.h                  |  94 +++
 net/vmw_vsock/Kconfig                              |  19 +
 net/vmw_vsock/Makefile                             |   2 +
 net/vmw_vsock/af_vsock.c                           |   9 +
 net/vmw_vsock/virtio_transport.c                   | 584 ++++++++++++++
 net/vmw_vsock/virtio_transport_common.c            | 838 +++++++++++++++++++++
 15 files changed, 2592 insertions(+)
 create mode 100644 drivers/vhost/vsock.c
 create mode 100644 drivers/vhost/vsock.h
 create mode 100644 include/linux/virtio_vsock.h
 create mode 100644 include/trace/events/vsock_virtio_transport_common.h
 create mode 100644 include/uapi/linux/virtio_vsock.h
 create mode 100644 net/vmw_vsock/virtio_transport.c
 create mode 100644 net/vmw_vsock/virtio_transport_common.c

-- 
2.5.5

^ permalink raw reply

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Vlastimil Babka @ 2016-04-01 12:58 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, virtualization,
	bfields, linux-mm, Gioh Kim, Mel Gorman, Sangseok Lee, jlayton,
	Naoya Horiguchi, Joonsoo Kim, koct9i, Al Viro
In-Reply-To: <1459321935-3655-2-git-send-email-minchan@kernel.org>

On 03/30/2016 09:12 AM, Minchan Kim wrote:
> Procedure of page migration is as follows:
>
> First of all, it should isolate a page from LRU and try to
> migrate the page. If it is successful, it releases the page
> for freeing. Otherwise, it should put the page back to LRU
> list.
>
> For LRU pages, we have used putback_lru_page for both freeing
> and putback to LRU list. It's okay because put_page is aware of
> LRU list so if it releases last refcount of the page, it removes
> the page from LRU list. However, It makes unnecessary operations
> (e.g., lru_cache_add, pagevec and flags operations. It would be
> not significant but no worth to do) and harder to support new
> non-lru page migration because put_page isn't aware of non-lru
> page's data structure.
>
> To solve the problem, we can add new hook in put_page with
> PageMovable flags check but it can increase overhead in
> hot path and needs new locking scheme to stabilize the flag check
> with put_page.
>
> So, this patch cleans it up to divide two semantic(ie, put and putback).
> If migration is successful, use put_page instead of putback_lru_page and
> use putback_lru_page only on failure. That makes code more readable
> and doesn't add overhead in put_page.
>
> Comment from Vlastimil
> "Yeah, and compaction (perhaps also other migration users) has to drain
> the lru pvec... Getting rid of this stuff is worth even by itself."
>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

[...]

> @@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>   		list_del(&page->lru);
>   		dec_zone_page_state(page, NR_ISOLATED_ANON +
>   				page_is_file_cache(page));
> -		/* Soft-offlined page shouldn't go through lru cache list */
> +	}
> +
> +	/*
> +	 * If migration is successful, drop the reference grabbed during
> +	 * isolation. Otherwise, restore the page to LRU list unless we
> +	 * want to retry.
> +	 */
> +	if (rc == MIGRATEPAGE_SUCCESS) {
> +		put_page(page);
>   		if (reason == MR_MEMORY_FAILURE) {
> -			put_page(page);
>   			if (!test_set_page_hwpoison(page))
>   				num_poisoned_pages_inc();
> -		} else
> +		}

Hmm, I didn't notice it previously, or it's due to rebasing, but it seems that 
you restricted the memory failure handling (i.e. setting hwpoison) to 
MIGRATE_SUCCESS, while previously it was done for all non-EAGAIN results. I 
think that goes against the intention of hwpoison, which is IIRC to catch and 
kill the poor process that still uses the page?

Also (but not your fault) the put_page() preceding test_set_page_hwpoison(page)) 
IMHO deserves a comment saying which pin we are releasing and which one we still 
have (hopefully? if I read description of da1b13ccfbebe right) otherwise it 
looks like doing something with a page that we just potentially freed.

> +	} else {
> +		if (rc != -EAGAIN)
>   			putback_lru_page(page);
> +		if (put_new_page)
> +			put_new_page(newpage, private);
> +		else
> +			put_page(newpage);
>   	}
>
> -	/*
> -	 * If migration was not successful and there's a freeing callback, use
> -	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
> -	 * during isolation.
> -	 */
> -	if (put_new_page)
> -		put_new_page(newpage, private);
> -	else if (unlikely(__is_movable_balloon_page(newpage))) {
> -		/* drop our reference, page already in the balloon */
> -		put_page(newpage);
> -	} else
> -		putback_lru_page(newpage);
> -
>   	if (result) {
>   		if (rc)
>   			*result = rc;
>

^ permalink raw reply

* [patch] virtio: silence uninitialized variable warnings
From: Dan Carpenter @ 2016-04-01 11:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kernel-janitors, linux-kernel, virtualization

Most ->get() functions seem to call BUG_ON() if offset + len is out of
range, but rproc_virtio_get() returns early without initializing ret.
Presumably it can't actually happen but it leads to a static checker
warning.  Let's just initialize "ret".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 6e6cb0c9..597dbef 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -334,7 +334,7 @@ static inline void virtio_cread_bytes(struct virtio_device *vdev,
 
 static inline u8 virtio_cread8(struct virtio_device *vdev, unsigned int offset)
 {
-	u8 ret;
+	u8 ret = 0;
 	vdev->config->get(vdev, offset, &ret, sizeof(ret));
 	return ret;
 }
@@ -348,7 +348,7 @@ static inline void virtio_cwrite8(struct virtio_device *vdev,
 static inline u16 virtio_cread16(struct virtio_device *vdev,
 				 unsigned int offset)
 {
-	u16 ret;
+	u16 ret = 0;
 	vdev->config->get(vdev, offset, &ret, sizeof(ret));
 	return virtio16_to_cpu(vdev, (__force __virtio16)ret);
 }
@@ -363,7 +363,7 @@ static inline void virtio_cwrite16(struct virtio_device *vdev,
 static inline u32 virtio_cread32(struct virtio_device *vdev,
 				 unsigned int offset)
 {
-	u32 ret;
+	u32 ret = 0;
 	vdev->config->get(vdev, offset, &ret, sizeof(ret));
 	return virtio32_to_cpu(vdev, (__force __virtio32)ret);
 }
@@ -378,7 +378,7 @@ static inline void virtio_cwrite32(struct virtio_device *vdev,
 static inline u64 virtio_cread64(struct virtio_device *vdev,
 				 unsigned int offset)
 {
-	u64 ret;
+	u64 ret = 0;
 	__virtio_cread_many(vdev, offset, &ret, 1, sizeof(ret));
 	return virtio64_to_cpu(vdev, (__force __virtio64)ret);
 }

^ permalink raw reply related

* Re: Ballooning on TPS!=HPS hosts
From: Amit Shah @ 2016-04-01 10:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: lvivier, aarcange, drjones, qemu-devel, Virtualization List,
	lcapitulino
In-Reply-To: <20160331180024.GN2265@work-vm>

CC'ing virtualization list.

On (Thu) 31 Mar 2016 [19:00:24], Dr. David Alan Gilbert wrote:
> Hi,
>   I was reading the balloon code and am confused as to how/if ballooning
> works on hosts where the host page size is larger than the
> target page size.
> 
> static void balloon_page(void *addr, int deflate)
> {
> #if defined(__linux__)
>     if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
>                                          kvm_has_sync_mmu())) {
>         qemu_madvise(addr, TARGET_PAGE_SIZE,
>                 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
>     }
> #endif
> }
> 
> The virtio-balloon code only does stuff through ballon_page,
> and an madvise DONTNEED should fail if you try and do it on
> a size smaller than the host page size.  So does ballooning work on
> Power/ARM?
> 
> Am I misunderstanding this?

I think you're right.  Guess no one's tested this in such scenarios
yet.

> Of course looking at the above we won't actually generate an error since
> we don't check the return of qemu_madvise.

... at least we can deflate the balloon in case the madvise fails, so
the guest can use the pages it's given us.

> We have three sizes:
>     a) host page size
>     b) target page size
>     c) VIRTIO_BALLOON_PFN_SHIFT
> 
>  c == 12 (4k) for everyone
>  
> 
>     1) I think the virtio-balloon code needs to coallesce adjecent requests
>       and call balloon_page on whole chunks at once passing a length.
>     2) why does balloon_page use TARGET_PAGE_SIZE, ignoring anything else
>        shouldn't it be 1 << VIRTIO_BALLOON_PFN_SHIFT ?
>     3) I'm guessing the guest kernel doesn't know the host page size, so
>        how can it know what size chunks of balloon to work in?

Thanks,

		Amit

^ permalink raw reply

* Re: [PATCH v3 5/6] virt, sched: add cpu pinning to smp_call_sync_on_phys_cpu()
From: Juergen Gross @ 2016-04-01  9:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, jeremy, jdelvare, konrad.wilk, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <20160401091507.GG3448@twins.programming.kicks-ass.net>

On 01/04/16 11:15, Peter Zijlstra wrote:
> On Fri, Apr 01, 2016 at 11:03:21AM +0200, Juergen Gross wrote:
>>> Maybe just make the vpin thing an option like:
>>>
>>> 	smp_call_on_cpu(int (*func)(void *), int phys_cpu);
> 
>>> Also; is something like the vpin thing possible on KVM? because if we're
>>> going to expose it to generic code like this we had maybe look at wider
>>> support.
>>
>> It is necessary for dom0 under Xen. I don't think there is a need to do
>> this on KVM as a guest has no direct access to e.g. BIOS functions of
>> the real hardware and the host system needs no vcpu pinning. I'm not
>> sure about VMWare.
> 
> OK, then can we WARN if .phys=1 and the platform doesn't support it?
> 

Yes, good idea.


Juergen

^ permalink raw reply

* Re: [PATCH v3 5/6] virt, sched: add cpu pinning to smp_call_sync_on_phys_cpu()
From: Peter Zijlstra @ 2016-04-01  9:15 UTC (permalink / raw)
  To: Juergen Gross
  Cc: x86, jeremy, jdelvare, konrad.wilk, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <56FE3959.2030508@suse.com>

On Fri, Apr 01, 2016 at 11:03:21AM +0200, Juergen Gross wrote:
> > Maybe just make the vpin thing an option like:
> > 
> > 	smp_call_on_cpu(int (*func)(void *), int phys_cpu);

> > Also; is something like the vpin thing possible on KVM? because if we're
> > going to expose it to generic code like this we had maybe look at wider
> > support.
> 
> It is necessary for dom0 under Xen. I don't think there is a need to do
> this on KVM as a guest has no direct access to e.g. BIOS functions of
> the real hardware and the host system needs no vcpu pinning. I'm not
> sure about VMWare.

OK, then can we WARN if .phys=1 and the platform doesn't support it?

^ permalink raw reply

* Re: [PATCH v3 5/6] virt, sched: add cpu pinning to smp_call_sync_on_phys_cpu()
From: Juergen Gross @ 2016-04-01  9:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, jeremy, jdelvare, konrad.wilk, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <20160401084408.GF3448@twins.programming.kicks-ass.net>

On 01/04/16 10:44, Peter Zijlstra wrote:
> On Fri, Apr 01, 2016 at 10:28:46AM +0200, Juergen Gross wrote:
>> On 01/04/16 09:43, Peter Zijlstra wrote:
>>> On Fri, Apr 01, 2016 at 09:14:33AM +0200, Juergen Gross wrote:
>>>> --- a/kernel/smp.c
>>>> +++ b/kernel/smp.c
>>>> @@ -14,6 +14,7 @@
>>>>  #include <linux/smp.h>
>>>>  #include <linux/cpu.h>
>>>>  #include <linux/sched.h>
>>>> +#include <linux/hypervisor.h>
>>>>  
>>>>  #include "smpboot.h"
>>>>  
>>>> @@ -758,9 +759,14 @@ struct smp_sync_call_struct {
>>>>  static void smp_call_sync_callback(struct work_struct *work)
>>>>  {
>>>>  	struct smp_sync_call_struct *sscs;
>>>> +	unsigned int cpu;
>>>>  
>>>>  	sscs = container_of(work, struct smp_sync_call_struct, work);
>>>> +	cpu = get_cpu();
>>>> +	hypervisor_pin_vcpu(cpu);
>>>>  	sscs->ret = sscs->func(sscs->data);
>>>> +	hypervisor_pin_vcpu(-1);
>>>> +	put_cpu();
>>>>  
>>>>  	complete(&sscs->done);
>>>>  }
>>>
>>> So I don't really like this; it adds the requirement that the function
>>> cannot schedule, which greatly limits the utility of the construct. At
>>> this point you might as well use the regular IPI stuff.
>>
>> Main reason for disabling preemption was to avoid any suspend/resume
>> cycles while vcpu pinning is active.
>>
>> With the switch to workqueues this might not be necessary, if I've read
>> try_to_freeze_tasks() correctly. Can you confirm, please?
> 
> This is not something we should worry about; the caller should ensure
> the CPU stays valid; typically I would expect a caller to do
> get_online_cpus() before 'computing' what CPU to send the function to.

Okay.

> 
>>> So I would propose you add:
>>>
>>> 	smp_call_on_cpu()
>>>
>>> As per patch 2. No promises about physical or anything. This means it
>>> can be used freely by anyone that wants to run a function on another
>>> cpu -- a much more useful thing.
>>
>> Okay.
>>
>>> And then build a phys variant on top.
>>
>> Hmm, I'm not sure I understand what you are suggesting here.
>>
>> Should this phys variant make use of smp_call_on_cpu() via an
>> intermediate function called on the dedicated cpu which is doing the
>> pinning and calling the user function then?
>>
>> Or do you want the phys variant to either use smp_call_on_cpu() or to
>> do the pinning and call the user function by itself depending on the
>> environment (pinning supported)?
> 
> Yeah, uhmm.. not sure on the details; my brain is having a hard time
> engaging this morning.
> 
> Maybe just make the vpin thing an option like:
> 
> 	smp_call_on_cpu(int (*func)(void *), int phys_cpu);

Okay.

> Also; is something like the vpin thing possible on KVM? because if we're
> going to expose it to generic code like this we had maybe look at wider
> support.

It is necessary for dom0 under Xen. I don't think there is a need to do
this on KVM as a guest has no direct access to e.g. BIOS functions of
the real hardware and the host system needs no vcpu pinning. I'm not
sure about VMWare.

Juergen

^ permalink raw reply

* Re: [PATCH v3 5/6] virt, sched: add cpu pinning to smp_call_sync_on_phys_cpu()
From: Peter Zijlstra @ 2016-04-01  8:44 UTC (permalink / raw)
  To: Juergen Gross
  Cc: x86, jeremy, jdelvare, konrad.wilk, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <56FE313E.9060804@suse.com>

On Fri, Apr 01, 2016 at 10:28:46AM +0200, Juergen Gross wrote:
> On 01/04/16 09:43, Peter Zijlstra wrote:
> > On Fri, Apr 01, 2016 at 09:14:33AM +0200, Juergen Gross wrote:
> >> --- a/kernel/smp.c
> >> +++ b/kernel/smp.c
> >> @@ -14,6 +14,7 @@
> >>  #include <linux/smp.h>
> >>  #include <linux/cpu.h>
> >>  #include <linux/sched.h>
> >> +#include <linux/hypervisor.h>
> >>  
> >>  #include "smpboot.h"
> >>  
> >> @@ -758,9 +759,14 @@ struct smp_sync_call_struct {
> >>  static void smp_call_sync_callback(struct work_struct *work)
> >>  {
> >>  	struct smp_sync_call_struct *sscs;
> >> +	unsigned int cpu;
> >>  
> >>  	sscs = container_of(work, struct smp_sync_call_struct, work);
> >> +	cpu = get_cpu();
> >> +	hypervisor_pin_vcpu(cpu);
> >>  	sscs->ret = sscs->func(sscs->data);
> >> +	hypervisor_pin_vcpu(-1);
> >> +	put_cpu();
> >>  
> >>  	complete(&sscs->done);
> >>  }
> > 
> > So I don't really like this; it adds the requirement that the function
> > cannot schedule, which greatly limits the utility of the construct. At
> > this point you might as well use the regular IPI stuff.
> 
> Main reason for disabling preemption was to avoid any suspend/resume
> cycles while vcpu pinning is active.
> 
> With the switch to workqueues this might not be necessary, if I've read
> try_to_freeze_tasks() correctly. Can you confirm, please?

This is not something we should worry about; the caller should ensure
the CPU stays valid; typically I would expect a caller to do
get_online_cpus() before 'computing' what CPU to send the function to.

> > So I would propose you add:
> > 
> > 	smp_call_on_cpu()
> > 
> > As per patch 2. No promises about physical or anything. This means it
> > can be used freely by anyone that wants to run a function on another
> > cpu -- a much more useful thing.
> 
> Okay.
> 
> > And then build a phys variant on top.
> 
> Hmm, I'm not sure I understand what you are suggesting here.
> 
> Should this phys variant make use of smp_call_on_cpu() via an
> intermediate function called on the dedicated cpu which is doing the
> pinning and calling the user function then?
> 
> Or do you want the phys variant to either use smp_call_on_cpu() or to
> do the pinning and call the user function by itself depending on the
> environment (pinning supported)?

Yeah, uhmm.. not sure on the details; my brain is having a hard time
engaging this morning.

Maybe just make the vpin thing an option like:

	smp_call_on_cpu(int (*func)(void *), int phys_cpu);

Also; is something like the vpin thing possible on KVM? because if we're
going to expose it to generic code like this we had maybe look at wider
support.

^ permalink raw reply

* Re: [PATCH v3 5/6] virt, sched: add cpu pinning to smp_call_sync_on_phys_cpu()
From: Juergen Gross @ 2016-04-01  8:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, jeremy, jdelvare, konrad.wilk, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <20160401074325.GC12845@twins.programming.kicks-ass.net>

On 01/04/16 09:43, Peter Zijlstra wrote:
> On Fri, Apr 01, 2016 at 09:14:33AM +0200, Juergen Gross wrote:
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -14,6 +14,7 @@
>>  #include <linux/smp.h>
>>  #include <linux/cpu.h>
>>  #include <linux/sched.h>
>> +#include <linux/hypervisor.h>
>>  
>>  #include "smpboot.h"
>>  
>> @@ -758,9 +759,14 @@ struct smp_sync_call_struct {
>>  static void smp_call_sync_callback(struct work_struct *work)
>>  {
>>  	struct smp_sync_call_struct *sscs;
>> +	unsigned int cpu;
>>  
>>  	sscs = container_of(work, struct smp_sync_call_struct, work);
>> +	cpu = get_cpu();
>> +	hypervisor_pin_vcpu(cpu);
>>  	sscs->ret = sscs->func(sscs->data);
>> +	hypervisor_pin_vcpu(-1);
>> +	put_cpu();
>>  
>>  	complete(&sscs->done);
>>  }
> 
> So I don't really like this; it adds the requirement that the function
> cannot schedule, which greatly limits the utility of the construct. At
> this point you might as well use the regular IPI stuff.

Main reason for disabling preemption was to avoid any suspend/resume
cycles while vcpu pinning is active.

With the switch to workqueues this might not be necessary, if I've read
try_to_freeze_tasks() correctly. Can you confirm, please?

> You can easily avoid this constraint by using:
> 
> 	hypervisor_pin_vcpu(smp_processor_id());
> 
> Also, for the vpinning stuff, the UP version below is sufficient, even
> on SMP systems (with the current !preempt constraint). Which seems to
> suggest we're not having the right interface for this.
> 
> So I would propose you add:
> 
> 	smp_call_on_cpu()
> 
> As per patch 2. No promises about physical or anything. This means it
> can be used freely by anyone that wants to run a function on another
> cpu -- a much more useful thing.

Okay.

> And then build a phys variant on top.

Hmm, I'm not sure I understand what you are suggesting here.

Should this phys variant make use of smp_call_on_cpu() via an
intermediate function called on the dedicated cpu which is doing the
pinning and calling the user function then?

Or do you want the phys variant to either use smp_call_on_cpu() or to
do the pinning and call the user function by itself depending on the
environment (pinning supported)?


Juergen

^ permalink raw reply

* Re: [PATCH v3 5/6] virt, sched: add cpu pinning to smp_call_sync_on_phys_cpu()
From: Peter Zijlstra @ 2016-04-01  7:43 UTC (permalink / raw)
  To: Juergen Gross
  Cc: x86, jeremy, jdelvare, konrad.wilk, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <1459494874-12194-6-git-send-email-jgross@suse.com>

On Fri, Apr 01, 2016 at 09:14:33AM +0200, Juergen Gross wrote:
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -14,6 +14,7 @@
>  #include <linux/smp.h>
>  #include <linux/cpu.h>
>  #include <linux/sched.h>
> +#include <linux/hypervisor.h>
>  
>  #include "smpboot.h"
>  
> @@ -758,9 +759,14 @@ struct smp_sync_call_struct {
>  static void smp_call_sync_callback(struct work_struct *work)
>  {
>  	struct smp_sync_call_struct *sscs;
> +	unsigned int cpu;
>  
>  	sscs = container_of(work, struct smp_sync_call_struct, work);
> +	cpu = get_cpu();
> +	hypervisor_pin_vcpu(cpu);
>  	sscs->ret = sscs->func(sscs->data);
> +	hypervisor_pin_vcpu(-1);
> +	put_cpu();
>  
>  	complete(&sscs->done);
>  }

So I don't really like this; it adds the requirement that the function
cannot schedule, which greatly limits the utility of the construct. At
this point you might as well use the regular IPI stuff.

You can easily avoid this constraint by using:

	hypervisor_pin_vcpu(smp_processor_id());

Also, for the vpinning stuff, the UP version below is sufficient, even
on SMP systems (with the current !preempt constraint). Which seems to
suggest we're not having the right interface for this.

So I would propose you add:

	smp_call_on_cpu()

As per patch 2. No promises about physical or anything. This means it
can be used freely by anyone that wants to run a function on another
cpu -- a much more useful thing.

And then build a phys variant on top.


> diff --git a/kernel/up.c b/kernel/up.c
> index afd395c..725ec44 100644
> --- a/kernel/up.c
> +++ b/kernel/up.c
> @@ -6,6 +6,7 @@
>  #include <linux/kernel.h>
>  #include <linux/export.h>
>  #include <linux/smp.h>
> +#include <linux/hypervisor.h>
>  
>  int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
>  				int wait)
> @@ -85,9 +86,17 @@ EXPORT_SYMBOL(on_each_cpu_cond);
>  
>  int smp_call_sync_on_phys_cpu(unsigned int cpu, int (*func)(void *), void *par)
>  {
> +	int ret;
> +
>  	if (cpu != 0)
>  		return -EINVAL;
>  
> -	return func(par);
> +	preempt_disable();
> +	hypervisor_pin_vcpu(0);
> +	ret = func(par);
> +	hypervisor_pin_vcpu(-1);
> +	preempt_enable();
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(smp_call_sync_on_phys_cpu);
> -- 
> 2.6.2
> 

^ permalink raw reply

* Re: [PATCH v3 2/6] smp: add function to execute a function synchronously on a physical cpu
From: Juergen Gross @ 2016-04-01  7:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, jeremy, jdelvare, konrad.wilk, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <20160401073733.GB3448@twins.programming.kicks-ass.net>

On 01/04/16 09:37, Peter Zijlstra wrote:
> On Fri, Apr 01, 2016 at 09:14:30AM +0200, Juergen Gross wrote:
>> +	if (cpu >= nr_cpu_ids)
>> +		return -EINVAL;
> 
>> +	if (cpu != 0)
>> +		return -EINVAL;
> 
> The other functions return -ENXIO for this.

Aah, okay. Will change.


Juergen

^ permalink raw reply

* Re: [PATCH v3 2/6] smp: add function to execute a function synchronously on a physical cpu
From: Peter Zijlstra @ 2016-04-01  7:37 UTC (permalink / raw)
  To: Juergen Gross
  Cc: x86, jeremy, jdelvare, konrad.wilk, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <1459494874-12194-3-git-send-email-jgross@suse.com>

On Fri, Apr 01, 2016 at 09:14:30AM +0200, Juergen Gross wrote:
> +	if (cpu >= nr_cpu_ids)
> +		return -EINVAL;

> +	if (cpu != 0)
> +		return -EINVAL;

The other functions return -ENXIO for this.

^ permalink raw reply

* [PATCH v3 6/6] xen: add xen_pin_vcpu() to support calling functions on a dedicated pcpu
From: Juergen Gross @ 2016-04-01  7:14 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: Juergen Gross, jeremy, jdelvare, konrad.wilk, peterz, hpa,
	akataria, x86, virtualization, chrisw, mingo, david.vrabel,
	Douglas_Warzecha, pali.rohar, boris.ostrovsky, tglx, linux
In-Reply-To: <1459494874-12194-1-git-send-email-jgross@suse.com>

Some hardware models (e.g. Dell Studio 1555 laptops) require calls to
the firmware to be issued on cpu 0 only. As Dom0 might have to use
these calls, add xen_pin_vcpu() to achieve this functionality.

In case either the domain doesn't have the privilege to make the
related hypercall or the hypervisor isn't supporting it, issue a
warning once and disable further pinning attempts.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/enlighten.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 880862c..7907bcf8 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1885,6 +1885,45 @@ static void xen_set_cpu_features(struct cpuinfo_x86 *c)
 	}
 }
 
+static void xen_pin_vcpu(int cpu)
+{
+	static bool disable_pinning;
+	struct sched_pin_override pin_override;
+	int ret;
+
+	if (disable_pinning)
+		return;
+
+	pin_override.pcpu = cpu;
+	ret = HYPERVISOR_sched_op(SCHEDOP_pin_override, &pin_override);
+	if (cpu < 0)
+		return;
+
+	switch (ret) {
+	case -ENOSYS:
+		pr_warn("The kernel tried to call a function on physical cpu %d, but Xen isn't\n"
+			"supporting this. In case of problems you might consider vcpu pinning.\n",
+			cpu);
+		disable_pinning = true;
+		break;
+	case -EPERM:
+		WARN(1, "Trying to pin vcpu without having privilege to do so\n");
+		disable_pinning = true;
+		break;
+	case -EINVAL:
+	case -EBUSY:
+		pr_warn("The kernel tried to call a function on physical cpu %d, but this cpu\n"
+			"seems not to be available. Please check your Xen cpu configuration.\n",
+			cpu);
+		break;
+	case 0:
+		break;
+	default:
+		WARN(1, "rc %d while trying to pin vcpu\n", ret);
+		disable_pinning = true;
+	}
+}
+
 const struct hypervisor_x86 x86_hyper_xen = {
 	.name			= "Xen",
 	.detect			= xen_platform,
@@ -1893,6 +1932,7 @@ const struct hypervisor_x86 x86_hyper_xen = {
 #endif
 	.x2apic_available	= xen_x2apic_para_available,
 	.set_cpu_features       = xen_set_cpu_features,
+	.pin_vcpu               = xen_pin_vcpu,
 };
 EXPORT_SYMBOL(x86_hyper_xen);
 
-- 
2.6.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox