Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Michael S. Tsirkin @ 2011-11-24 16:14 UTC (permalink / raw)
  To: jasowang
  Cc: Krishna Kumar, arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <4ECE3F0D.3090908@redhat.com>

On Thu, Nov 24, 2011 at 08:56:45PM +0800, jasowang wrote:
> On 11/24/2011 06:34 PM, Michael S. Tsirkin wrote:
> >On Thu, Nov 24, 2011 at 06:13:41PM +0800, jasowang wrote:
> >>On 11/24/2011 05:59 PM, Michael S. Tsirkin wrote:
> >>>On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
> >>>>It was reported that the macvtap device selects a
> >>>>different vhost (when used with multiqueue feature)
> >>>>for incoming packets of a single connection. Use
> >>>>packet hash first. Patch tested on MQ virtio_net.
> >>>So this is sure to address the problem, why exactly does this happen?
> >>Ixgbe has flow director and bind queue to host cpu, so it can make
> >>sure the packet of a flow to be handled by the same queue/cpu. So
> >>when vhost thread moves from one host cpu to another, ixgbe would
> >>therefore send the packet to the new cpu/queue.
> >Confused. How does ixgbe know about vhost thread moving?
> 
> As far as I can see, ixgbe binds queues to physical cpu, so let consider:
> 
> vhost thread transmits packets of flow A on processor M
> during packet transmission, ixgbe driver programs the card to
> deliver the packet of flow A to queue/cpu M through flow director
> (see ixgbe_atr())
> vhost thread then receives packet of flow A with from M
> ...
> vhost thread transmits packets of flow A on processor N
> ixgbe driver programs the flow director to change the delivery of
> flow A to queue N ( cpu N )
> vhost thread then receives packet of flow A with from N
> ...
> 
> So, for a single flow A, we may get different queue mappings. Using
> rxhash instead may solve this issue.

Or better, transmit a single flow from a single vhost thread.

If packets of a single flow get spread over different CPUs,
they will get reordered and things are not going to work well.

> >
> >>>Does your device spread a single flow across multiple RX queues? Would
> >>>not that cause trouble in the TCP layer?
> >>>It would seem that using the recorded queue should be faster with
> >>>less cache misses. Before we give up on that, I'd
> >>>like to understand why it's wrong. Do you know?
> >>>
> >>>>Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
> >>>>---
> >>>>  drivers/net/macvtap.c |   16 ++++++++--------
> >>>>  1 file changed, 8 insertions(+), 8 deletions(-)
> >>>>
> >>>>diff -ruNp org/drivers/net/macvtap.c new/drivers/net/macvtap.c
> >>>>--- org/drivers/net/macvtap.c	2011-10-22 08:38:01.000000000 +0530
> >>>>+++ new/drivers/net/macvtap.c	2011-11-16 18:34:51.000000000 +0530
> >>>>@@ -175,6 +175,14 @@ static struct macvtap_queue *macvtap_get
> >>>>  	if (!numvtaps)
> >>>>  		goto out;
> >>>>
> >>>>+	/* Check if we can use flow to select a queue */
> >>>>+	rxq = skb_get_rxhash(skb);
> >>>>+	if (rxq) {
> >>>>+		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> >>>>+		if (tap)
> >>>>+			goto out;
> >>>>+	}
> >>>>+
> >>>>  	if (likely(skb_rx_queue_recorded(skb))) {
> >>>>  		rxq = skb_get_rx_queue(skb);
> >>>>
> >>>>@@ -186,14 +194,6 @@ static struct macvtap_queue *macvtap_get
> >>>>  			goto out;
> >>>>  	}
> >>>>
> >>>>-	/* Check if we can use flow to select a queue */
> >>>>-	rxq = skb_get_rxhash(skb);
> >>>>-	if (rxq) {
> >>>>-		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> >>>>-		if (tap)
> >>>>-			goto out;
> >>>>-	}
> >>>>-
> >>>>  	/* Everything failed - find first available queue */
> >>>>  	for (rxq = 0; rxq<   MAX_MACVTAP_QUEUES; rxq++) {
> >>>>  		tap = rcu_dereference(vlan->taps[rxq]);
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe netdev" in
> >>>the body of a message to majordomo@vger.kernel.org
> >>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Krishna Kumar2 @ 2011-11-25  2:58 UTC (permalink / raw)
  To: jasowang
  Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
	davem, jeffrey.t.kirsher
In-Reply-To: <4ECE4004.8010107@redhat.com>

jasowang <jasowang@redhat.com> wrote on 11/24/2011 06:30:52 PM:

>
> >> On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
> >>> It was reported that the macvtap device selects a
> >>> different vhost (when used with multiqueue feature)
> >>> for incoming packets of a single connection. Use
> >>> packet hash first. Patch tested on MQ virtio_net.
> >> So this is sure to address the problem, why exactly does this happen?
> >> Does your device spread a single flow across multiple RX queues? Would
> >> not that cause trouble in the TCP layer?
> >> It would seem that using the recorded queue should be faster with
> >> less cache misses. Before we give up on that, I'd
> >> like to understand why it's wrong. Do you know?
> > I am using ixgbe. From what I briefly saw, ixgbe_alloc_rx_buffers
> > calls skb_record_rx_queue when a skb is allocated. When a packet
> > is received (ixgbe_alloc_rx_buffers), it sets rxhash. The
> > recorded value is different for most skbs when I ran a single
> > stream TCP stream test (does skbs move between the rx_rings?).
>
> Yes, it moves. It depends on last processor or tx queue who transmits
> the packets of this stream. Because ixgbe select tx queue based on the
> processor id, so if vhost thread transmits skbs on different processors,
> the skb of a single stream may comes from different rx ring.

But I don't see transmit going on different queues,
only incoming.

- KK

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Krishna Kumar2 @ 2011-11-25  3:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <20111124161430.GC26770@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/24/2011 09:44:31 PM:

> > As far as I can see, ixgbe binds queues to physical cpu, so let
consider:
> >
> > vhost thread transmits packets of flow A on processor M
> > during packet transmission, ixgbe driver programs the card to
> > deliver the packet of flow A to queue/cpu M through flow director
> > (see ixgbe_atr())
> > vhost thread then receives packet of flow A with from M
> > ...
> > vhost thread transmits packets of flow A on processor N
> > ixgbe driver programs the flow director to change the delivery of
> > flow A to queue N ( cpu N )
> > vhost thread then receives packet of flow A with from N
> > ...
> >
> > So, for a single flow A, we may get different queue mappings. Using
> > rxhash instead may solve this issue.
>
> Or better, transmit a single flow from a single vhost thread.
>
> If packets of a single flow get spread over different CPUs,
> they will get reordered and things are not going to work well.

My testing so far shows that guest sends on (e.g.) TXQ#2
only, which is handled by vhost#2; and this doesn't change
for the entire duration of the test. Incoming keeps
changing for different packets but become same with
this patch. To iterate, I have not seen the following:

"
vhost thread transmits packets of flow A on processor M
...
vhost thread transmits packets of flow A on processor N
"

- KK

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-11-25  3:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <20111124161430.GC26770@redhat.com>

On 11/25/2011 12:14 AM, Michael S. Tsirkin wrote:
> On Thu, Nov 24, 2011 at 08:56:45PM +0800, jasowang wrote:
>> >  On 11/24/2011 06:34 PM, Michael S. Tsirkin wrote:
>>> >  >On Thu, Nov 24, 2011 at 06:13:41PM +0800, jasowang wrote:
>>>> >  >>On 11/24/2011 05:59 PM, Michael S. Tsirkin wrote:
>>>>> >  >>>On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
>>>>>> >  >>>>It was reported that the macvtap device selects a
>>>>>> >  >>>>different vhost (when used with multiqueue feature)
>>>>>> >  >>>>for incoming packets of a single connection. Use
>>>>>> >  >>>>packet hash first. Patch tested on MQ virtio_net.
>>>>> >  >>>So this is sure to address the problem, why exactly does this happen?
>>>> >  >>Ixgbe has flow director and bind queue to host cpu, so it can make
>>>> >  >>sure the packet of a flow to be handled by the same queue/cpu. So
>>>> >  >>when vhost thread moves from one host cpu to another, ixgbe would
>>>> >  >>therefore send the packet to the new cpu/queue.
>>> >  >Confused. How does ixgbe know about vhost thread moving?
>> >  
>> >  As far as I can see, ixgbe binds queues to physical cpu, so let consider:
>> >  
>> >  vhost thread transmits packets of flow A on processor M
>> >  during packet transmission, ixgbe driver programs the card to
>> >  deliver the packet of flow A to queue/cpu M through flow director
>> >  (see ixgbe_atr())
>> >  vhost thread then receives packet of flow A with from M
>> >  ...
>> >  vhost thread transmits packets of flow A on processor N
>> >  ixgbe driver programs the flow director to change the delivery of
>> >  flow A to queue N ( cpu N )
>> >  vhost thread then receives packet of flow A with from N
>> >  ...
>> >  
>> >  So, for a single flow A, we may get different queue mappings. Using
>> >  rxhash instead may solve this issue.
> Or better, transmit a single flow from a single vhost thread.

It has already worked this way, as the tx queue were choose based on tx 
hash in guest(),  but vhost thread can move among processors.

>
> If packets of a single flow get spread over different CPUs,
> they will get reordered and things are not going to work well.
>

The problem is that vhost does not handle TCP itself but ixgbe driver 
would think it does, so the nic would deliver packets of a single flow 
to different CPUs when the vhost thread who does the transmission moves.

So, in conclusion, if we do not consider features of  under layer nic, 
using rxhash instead of queue mappings to identify a flow is better.

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-11-25  3:18 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
	davem, jeffrey.t.kirsher
In-Reply-To: <OF2886C7FB.135CFC23-ON65257953.000F9C8A-65257953.00101CBF@in.ibm.com>

On 11/25/2011 10:58 AM, Krishna Kumar2 wrote:
> jasowang<jasowang@redhat.com>  wrote on 11/24/2011 06:30:52 PM:
>
>>>> On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
>>>>> It was reported that the macvtap device selects a
>>>>> different vhost (when used with multiqueue feature)
>>>>> for incoming packets of a single connection. Use
>>>>> packet hash first. Patch tested on MQ virtio_net.
>>>> So this is sure to address the problem, why exactly does this happen?
>>>> Does your device spread a single flow across multiple RX queues? Would
>>>> not that cause trouble in the TCP layer?
>>>> It would seem that using the recorded queue should be faster with
>>>> less cache misses. Before we give up on that, I'd
>>>> like to understand why it's wrong. Do you know?
>>> I am using ixgbe. From what I briefly saw, ixgbe_alloc_rx_buffers
>>> calls skb_record_rx_queue when a skb is allocated. When a packet
>>> is received (ixgbe_alloc_rx_buffers), it sets rxhash. The
>>> recorded value is different for most skbs when I ran a single
>>> stream TCP stream test (does skbs move between the rx_rings?).
>> Yes, it moves. It depends on last processor or tx queue who transmits
>> the packets of this stream. Because ixgbe select tx queue based on the
>> processor id, so if vhost thread transmits skbs on different processors,
>> the skb of a single stream may comes from different rx ring.
> But I don't see transmit going on different queues,
> only incoming.
>
> - KK
>

Maybe I'm not clear enough, I mean the processor of host and tx queue of 
ixgbe.  So you would see, for a single vhost thread, as it moves among 
host cpus, it would use different tx queues of ixgbe. I think if you pin 
the vhost thread on host cpu, you may get consistent rx queue no.

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-11-25  3:21 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
	davem
In-Reply-To: <OF86B15006.7F4BE3FB-ON65257953.00104C3E-65257953.0010EFCA@in.ibm.com>

On 11/25/2011 11:07 AM, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"<mst@redhat.com>  wrote on 11/24/2011 09:44:31 PM:
>
>>> As far as I can see, ixgbe binds queues to physical cpu, so let
> consider:
>>> vhost thread transmits packets of flow A on processor M
>>> during packet transmission, ixgbe driver programs the card to
>>> deliver the packet of flow A to queue/cpu M through flow director
>>> (see ixgbe_atr())
>>> vhost thread then receives packet of flow A with from M
>>> ...
>>> vhost thread transmits packets of flow A on processor N
>>> ixgbe driver programs the flow director to change the delivery of
>>> flow A to queue N ( cpu N )
>>> vhost thread then receives packet of flow A with from N
>>> ...
>>>
>>> So, for a single flow A, we may get different queue mappings. Using
>>> rxhash instead may solve this issue.
>> Or better, transmit a single flow from a single vhost thread.
>>
>> If packets of a single flow get spread over different CPUs,
>> they will get reordered and things are not going to work well.
> My testing so far shows that guest sends on (e.g.) TXQ#2
> only, which is handled by vhost#2; and this doesn't change
> for the entire duration of the test. Incoming keeps
> changing for different packets but become same with
> this patch. To iterate, I have not seen the following:

Yes because guest chose the txq of virtio-net based on hash.

>
> "
> vhost thread transmits packets of flow A on processor M
> ...
> vhost thread transmits packets of flow A on processor N
> "

My description is not clear again :(
I mean the same vhost thead:

vhost thread #0 transmits packets of flow A on processor M
...
vhost thread #0 move to another process N and start to transmit packets 
of flow A

> - KK
>

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Krishna Kumar2 @ 2011-11-25  4:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
	davem
In-Reply-To: <4ECF09D5.4010700@redhat.com>

Jason Wang <jasowang@redhat.com> wrote on 11/25/2011 08:51:57 AM:
>
> My description is not clear again :(
> I mean the same vhost thead:
>
> vhost thread #0 transmits packets of flow A on processor M
> ...
> vhost thread #0 move to another process N and start to transmit packets
> of flow A

Thanks for clarifying. Yes, binding vhosts to CPU's
makes the incoming packet go to the same vhost each
time. BTW, are you doing any binding and/or irqbalance
when you run your tests? I am not running either at
this time, but thought both might be useful.

- KK

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: David Miller @ 2011-11-25  6:35 UTC (permalink / raw)
  To: krkumar2; +Cc: arnd, mst, netdev, virtualization, levinsasha928
In-Reply-To: <OF835A6FF5.7D752AD1-ON65257953.0016350C-65257953.00169AF4@in.ibm.com>

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Fri, 25 Nov 2011 09:39:11 +0530

> Jason Wang <jasowang@redhat.com> wrote on 11/25/2011 08:51:57 AM:
>>
>> My description is not clear again :(
>> I mean the same vhost thead:
>>
>> vhost thread #0 transmits packets of flow A on processor M
>> ...
>> vhost thread #0 move to another process N and start to transmit packets
>> of flow A
> 
> Thanks for clarifying. Yes, binding vhosts to CPU's
> makes the incoming packet go to the same vhost each
> time. BTW, are you doing any binding and/or irqbalance
> when you run your tests? I am not running either at
> this time, but thought both might be useful.

So are we going with this patch or are we saying that vhost binding
is a requirement?

^ permalink raw reply

* Re: Fwd: Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
From: Mathieu Desnoyers @ 2011-11-25  7:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Krishna Kumar, gorcunov, Lai Jiangshan, kvm, Michael S. Tsirkin,
	Asias He, virtualization, Pekka Enberg, Benjamin Poirier,
	Sasha Levin, netdev, mingo, Paul E. McKenney

Hi Stephen,

Benjamin forwarded me your email stating:

> I have been playing with userspace-rcu which has a number of neat
> lockless routines for queuing and hashing. But there aren't kernel versions
> and several of them may require cmpxchg to work.

Just FYI, I made sure a few years ago that cmpxchg is implemented on all
architectures within the Linux kernel (using a interrupt disable
fallback on the cases where it is not supported architecturally, on
UP-only architectures), so we should be good to use the lock-free
structures as-is in the kernel on this front. As for the RCU use by
these structures, userspace RCU has very much the same semantic as in
the kernel, so we can implement and test these structures in userspace
and eventually port them to the kernel as needed.

Lai Jiangshan is actively working at making sure the user-level
implementation of the RCU lock-free hash table (currently in a
development branch of the userspace RCU git tree : urcu/ht-shrink, not
yet in master) is suitable for use in the Linux kernel too.

Best regards,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* [PATCH 1/1] Staging: hv: mousevsc: Remove the mouse driver from the staging tree
From: K. Y. Srinivasan @ 2011-11-26  5:28 UTC (permalink / raw)
  To: gregkh, linux-kernel, devel, virtualization, ohering, joe,
	dmitry.torokhov, jkosina
  Cc: K. Y. Srinivasan

The mouse driver for Hyper-V is a HID compliant mouse driver.
Now that all relevant review comments have been addressed,
Jiri has agreed to merge this mouse driver into 3.3 hid.git.
This patch takes care of removing the mouse driver from the
staging area. Greg, if you can ack this patch, Jiri will proceed
with his merge.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/staging/hv/Kconfig    |    6 -
 drivers/staging/hv/Makefile   |    1 -
 drivers/staging/hv/hv_mouse.c |  582 -----------------------------------------
 3 files changed, 0 insertions(+), 589 deletions(-)
 delete mode 100644 drivers/staging/hv/hv_mouse.c

diff --git a/drivers/staging/hv/Kconfig b/drivers/staging/hv/Kconfig
index 072185e..6c0dc30 100644
--- a/drivers/staging/hv/Kconfig
+++ b/drivers/staging/hv/Kconfig
@@ -9,9 +9,3 @@ config HYPERV_NET
 	depends on HYPERV && NET
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
-
-config HYPERV_MOUSE
-	tristate "Microsoft Hyper-V mouse driver"
-	depends on HYPERV && HID
-	help
-	  Select this option to enable the Hyper-V mouse driver.
diff --git a/drivers/staging/hv/Makefile b/drivers/staging/hv/Makefile
index 0f55cee..fbe9a42 100644
--- a/drivers/staging/hv/Makefile
+++ b/drivers/staging/hv/Makefile
@@ -1,6 +1,5 @@
 obj-$(CONFIG_HYPERV_STORAGE)	+= hv_storvsc.o
 obj-$(CONFIG_HYPERV_NET)	+= hv_netvsc.o
-obj-$(CONFIG_HYPERV_MOUSE)	+= hv_mouse.o
 
 hv_storvsc-y := storvsc_drv.o
 hv_netvsc-y := netvsc_drv.o netvsc.o rndis_filter.o
diff --git a/drivers/staging/hv/hv_mouse.c b/drivers/staging/hv/hv_mouse.c
deleted file mode 100644
index d503cbb..0000000
--- a/drivers/staging/hv/hv_mouse.c
+++ /dev/null
@@ -1,582 +0,0 @@
-/*
- *  Copyright (c) 2009, Citrix Systems, Inc.
- *  Copyright (c) 2010, Microsoft Corporation.
- *  Copyright (c) 2011, Novell Inc.
- *
- *  This program is free software; you can redistribute it and/or modify it
- *  under the terms and conditions of the GNU General Public License,
- *  version 2, as published by the Free Software Foundation.
- *
- *  This program is distributed in the hope it will be useful, but WITHOUT
- *  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- *  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
- *  more details.
- */
-#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/device.h>
-#include <linux/completion.h>
-#include <linux/input.h>
-#include <linux/hid.h>
-#include <linux/hiddev.h>
-#include <linux/hyperv.h>
-
-
-struct hv_input_dev_info {
-	unsigned int size;
-	unsigned short vendor;
-	unsigned short product;
-	unsigned short version;
-	unsigned short reserved[11];
-};
-
-/* The maximum size of a synthetic input message. */
-#define SYNTHHID_MAX_INPUT_REPORT_SIZE 16
-
-/*
- * Current version
- *
- * History:
- * Beta, RC < 2008/1/22        1,0
- * RC > 2008/1/22              2,0
- */
-#define SYNTHHID_INPUT_VERSION_MAJOR	2
-#define SYNTHHID_INPUT_VERSION_MINOR	0
-#define SYNTHHID_INPUT_VERSION		(SYNTHHID_INPUT_VERSION_MINOR | \
-					 (SYNTHHID_INPUT_VERSION_MAJOR << 16))
-
-
-#pragma pack(push, 1)
-/*
- * Message types in the synthetic input protocol
- */
-enum synthhid_msg_type {
-	SYNTH_HID_PROTOCOL_REQUEST,
-	SYNTH_HID_PROTOCOL_RESPONSE,
-	SYNTH_HID_INITIAL_DEVICE_INFO,
-	SYNTH_HID_INITIAL_DEVICE_INFO_ACK,
-	SYNTH_HID_INPUT_REPORT,
-	SYNTH_HID_MAX
-};
-
-/*
- * Basic message structures.
- */
-struct synthhid_msg_hdr {
-	enum synthhid_msg_type type;
-	u32 size;
-};
-
-struct synthhid_msg {
-	struct synthhid_msg_hdr header;
-	char data[1]; /* Enclosed message */
-};
-
-union synthhid_version {
-	struct {
-		u16 minor_version;
-		u16 major_version;
-	};
-	u32 version;
-};
-
-/*
- * Protocol messages
- */
-struct synthhid_protocol_request {
-	struct synthhid_msg_hdr header;
-	union synthhid_version version_requested;
-};
-
-struct synthhid_protocol_response {
-	struct synthhid_msg_hdr header;
-	union synthhid_version version_requested;
-	unsigned char approved;
-};
-
-struct synthhid_device_info {
-	struct synthhid_msg_hdr header;
-	struct hv_input_dev_info hid_dev_info;
-	struct hid_descriptor hid_descriptor;
-};
-
-struct synthhid_device_info_ack {
-	struct synthhid_msg_hdr header;
-	unsigned char reserved;
-};
-
-struct synthhid_input_report {
-	struct synthhid_msg_hdr header;
-	char buffer[1];
-};
-
-#pragma pack(pop)
-
-#define INPUTVSC_SEND_RING_BUFFER_SIZE		(10*PAGE_SIZE)
-#define INPUTVSC_RECV_RING_BUFFER_SIZE		(10*PAGE_SIZE)
-
-
-enum pipe_prot_msg_type {
-	PIPE_MESSAGE_INVALID,
-	PIPE_MESSAGE_DATA,
-	PIPE_MESSAGE_MAXIMUM
-};
-
-
-struct pipe_prt_msg {
-	enum pipe_prot_msg_type type;
-	u32 size;
-	char data[1];
-};
-
-struct  mousevsc_prt_msg {
-	enum pipe_prot_msg_type type;
-	u32 size;
-	union {
-		struct synthhid_protocol_request request;
-		struct synthhid_protocol_response response;
-		struct synthhid_device_info_ack ack;
-	};
-};
-
-/*
- * Represents an mousevsc device
- */
-struct mousevsc_dev {
-	struct hv_device	*device;
-	bool			init_complete;
-	bool			connected;
-	struct mousevsc_prt_msg	protocol_req;
-	struct mousevsc_prt_msg	protocol_resp;
-	/* Synchronize the request/response if needed */
-	struct completion	wait_event;
-	int			dev_info_status;
-
-	struct hid_descriptor	*hid_desc;
-	unsigned char		*report_desc;
-	u32			report_desc_size;
-	struct hv_input_dev_info hid_dev_info;
-	struct hid_device       *hid_device;
-};
-
-
-static struct mousevsc_dev *mousevsc_alloc_device(struct hv_device *device)
-{
-	struct mousevsc_dev *input_dev;
-
-	input_dev = kzalloc(sizeof(struct mousevsc_dev), GFP_KERNEL);
-
-	if (!input_dev)
-		return NULL;
-
-	input_dev->device = device;
-	hv_set_drvdata(device, input_dev);
-	init_completion(&input_dev->wait_event);
-	input_dev->init_complete = false;
-
-	return input_dev;
-}
-
-static void mousevsc_free_device(struct mousevsc_dev *device)
-{
-	kfree(device->hid_desc);
-	kfree(device->report_desc);
-	hv_set_drvdata(device->device, NULL);
-	kfree(device);
-}
-
-static void mousevsc_on_receive_device_info(struct mousevsc_dev *input_device,
-				struct synthhid_device_info *device_info)
-{
-	int ret = 0;
-	struct hid_descriptor *desc;
-	struct mousevsc_prt_msg ack;
-
-	input_device->dev_info_status = -ENOMEM;
-
-	input_device->hid_dev_info = device_info->hid_dev_info;
-	desc = &device_info->hid_descriptor;
-	if (desc->bLength == 0)
-		goto cleanup;
-
-	input_device->hid_desc = kzalloc(desc->bLength, GFP_ATOMIC);
-
-	if (!input_device->hid_desc)
-		goto cleanup;
-
-	memcpy(input_device->hid_desc, desc, desc->bLength);
-
-	input_device->report_desc_size = desc->desc[0].wDescriptorLength;
-	if (input_device->report_desc_size == 0) {
-		input_device->dev_info_status = -EINVAL;
-		goto cleanup;
-	}
-
-	input_device->report_desc = kzalloc(input_device->report_desc_size,
-					  GFP_ATOMIC);
-
-	if (!input_device->report_desc) {
-		input_device->dev_info_status = -ENOMEM;
-		goto cleanup;
-	}
-
-	memcpy(input_device->report_desc,
-	       ((unsigned char *)desc) + desc->bLength,
-	       desc->desc[0].wDescriptorLength);
-
-	/* Send the ack */
-	memset(&ack, 0, sizeof(struct mousevsc_prt_msg));
-
-	ack.type = PIPE_MESSAGE_DATA;
-	ack.size = sizeof(struct synthhid_device_info_ack);
-
-	ack.ack.header.type = SYNTH_HID_INITIAL_DEVICE_INFO_ACK;
-	ack.ack.header.size = 1;
-	ack.ack.reserved = 0;
-
-	ret = vmbus_sendpacket(input_device->device->channel,
-			&ack,
-			sizeof(struct pipe_prt_msg) - sizeof(unsigned char) +
-			sizeof(struct synthhid_device_info_ack),
-			(unsigned long)&ack,
-			VM_PKT_DATA_INBAND,
-			VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
-
-	if (!ret)
-		input_device->dev_info_status = 0;
-
-cleanup:
-	complete(&input_device->wait_event);
-
-	return;
-}
-
-static void mousevsc_on_receive(struct hv_device *device,
-				struct vmpacket_descriptor *packet)
-{
-	struct pipe_prt_msg *pipe_msg;
-	struct synthhid_msg *hid_msg;
-	struct mousevsc_dev *input_dev = hv_get_drvdata(device);
-	struct synthhid_input_report *input_report;
-
-	pipe_msg = (struct pipe_prt_msg *)((unsigned long)packet +
-						(packet->offset8 << 3));
-
-	if (pipe_msg->type != PIPE_MESSAGE_DATA)
-		return;
-
-	hid_msg = (struct synthhid_msg *)pipe_msg->data;
-
-	switch (hid_msg->header.type) {
-	case SYNTH_HID_PROTOCOL_RESPONSE:
-		/*
-		 * While it will be impossible for us to protect against
-		 * malicious/buggy hypervisor/host, add a check here to
-		 * ensure we don't corrupt memory.
-		 */
-		if ((pipe_msg->size + sizeof(struct pipe_prt_msg)
-			- sizeof(unsigned char))
-			> sizeof(struct mousevsc_prt_msg)) {
-			WARN_ON(1);
-			break;
-		}
-
-		memcpy(&input_dev->protocol_resp, pipe_msg,
-		       pipe_msg->size + sizeof(struct pipe_prt_msg) -
-		       sizeof(unsigned char));
-		complete(&input_dev->wait_event);
-		break;
-
-	case SYNTH_HID_INITIAL_DEVICE_INFO:
-		WARN_ON(pipe_msg->size < sizeof(struct hv_input_dev_info));
-
-		/*
-		 * Parse out the device info into device attr,
-		 * hid desc and report desc
-		 */
-		mousevsc_on_receive_device_info(input_dev,
-			(struct synthhid_device_info *)pipe_msg->data);
-		break;
-	case SYNTH_HID_INPUT_REPORT:
-		input_report =
-			(struct synthhid_input_report *)pipe_msg->data;
-		if (!input_dev->init_complete)
-			break;
-		hid_input_report(input_dev->hid_device,
-				HID_INPUT_REPORT, input_report->buffer,
-				input_report->header.size, 1);
-		break;
-	default:
-		pr_err("unsupported hid msg type - type %d len %d",
-		       hid_msg->header.type, hid_msg->header.size);
-		break;
-	}
-
-}
-
-static void mousevsc_on_channel_callback(void *context)
-{
-	const int packet_size = 0x100;
-	int ret;
-	struct hv_device *device = context;
-	u32 bytes_recvd;
-	u64 req_id;
-	struct vmpacket_descriptor *desc;
-	unsigned char	*buffer;
-	int	bufferlen = packet_size;
-
-	buffer = kmalloc(bufferlen, GFP_ATOMIC);
-	if (!buffer)
-		return;
-
-	do {
-		ret = vmbus_recvpacket_raw(device->channel, buffer,
-					bufferlen, &bytes_recvd, &req_id);
-
-		switch (ret) {
-		case 0:
-			if (bytes_recvd <= 0) {
-				kfree(buffer);
-				return;
-			}
-			desc = (struct vmpacket_descriptor *)buffer;
-
-			switch (desc->type) {
-			case VM_PKT_COMP:
-				break;
-
-			case VM_PKT_DATA_INBAND:
-				mousevsc_on_receive(device, desc);
-				break;
-
-			default:
-				pr_err("unhandled packet type %d, tid %llx len %d\n",
-					desc->type, req_id, bytes_recvd);
-				break;
-			}
-
-			break;
-
-		case -ENOBUFS:
-			kfree(buffer);
-			/* Handle large packet */
-			bufferlen = bytes_recvd;
-			buffer = kmalloc(bytes_recvd, GFP_ATOMIC);
-
-			if (!buffer)
-				return;
-
-			break;
-		}
-	} while (1);
-
-}
-
-static int mousevsc_connect_to_vsp(struct hv_device *device)
-{
-	int ret = 0;
-	int t;
-	struct mousevsc_dev *input_dev = hv_get_drvdata(device);
-	struct mousevsc_prt_msg *request;
-	struct mousevsc_prt_msg *response;
-
-	request = &input_dev->protocol_req;
-	memset(request, 0, sizeof(struct mousevsc_prt_msg));
-
-	request->type = PIPE_MESSAGE_DATA;
-	request->size = sizeof(struct synthhid_protocol_request);
-	request->request.header.type = SYNTH_HID_PROTOCOL_REQUEST;
-	request->request.header.size = sizeof(unsigned int);
-	request->request.version_requested.version = SYNTHHID_INPUT_VERSION;
-
-	ret = vmbus_sendpacket(device->channel, request,
-				sizeof(struct pipe_prt_msg) -
-				sizeof(unsigned char) +
-				sizeof(struct synthhid_protocol_request),
-				(unsigned long)request,
-				VM_PKT_DATA_INBAND,
-				VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
-	if (ret)
-		goto cleanup;
-
-	t = wait_for_completion_timeout(&input_dev->wait_event, 5*HZ);
-	if (!t) {
-		ret = -ETIMEDOUT;
-		goto cleanup;
-	}
-
-	response = &input_dev->protocol_resp;
-
-	if (!response->response.approved) {
-		pr_err("synthhid protocol request failed (version %d)\n",
-		       SYNTHHID_INPUT_VERSION);
-		ret = -ENODEV;
-		goto cleanup;
-	}
-
-	t = wait_for_completion_timeout(&input_dev->wait_event, 5*HZ);
-	if (!t) {
-		ret = -ETIMEDOUT;
-		goto cleanup;
-	}
-
-	/*
-	 * We should have gotten the device attr, hid desc and report
-	 * desc at this point
-	 */
-	ret = input_dev->dev_info_status;
-
-cleanup:
-	return ret;
-}
-
-static int mousevsc_hid_open(struct hid_device *hid)
-{
-	return 0;
-}
-
-static int mousevsc_hid_start(struct hid_device *hid)
-{
-	return 0;
-}
-
-static void mousevsc_hid_close(struct hid_device *hid)
-{
-}
-
-static void mousevsc_hid_stop(struct hid_device *hid)
-{
-}
-
-static struct hid_ll_driver mousevsc_ll_driver = {
-	.open = mousevsc_hid_open,
-	.close = mousevsc_hid_close,
-	.start = mousevsc_hid_start,
-	.stop = mousevsc_hid_stop,
-};
-
-static struct hid_driver mousevsc_hid_driver;
-
-static int mousevsc_probe(struct hv_device *device,
-			const struct hv_vmbus_device_id *dev_id)
-{
-	int ret;
-	struct mousevsc_dev *input_dev;
-	struct hid_device *hid_dev;
-
-	input_dev = mousevsc_alloc_device(device);
-
-	if (!input_dev)
-		return -ENOMEM;
-
-	ret = vmbus_open(device->channel,
-		INPUTVSC_SEND_RING_BUFFER_SIZE,
-		INPUTVSC_RECV_RING_BUFFER_SIZE,
-		NULL,
-		0,
-		mousevsc_on_channel_callback,
-		device
-		);
-
-	if (ret)
-		goto probe_err0;
-
-	ret = mousevsc_connect_to_vsp(device);
-
-	if (ret)
-		goto probe_err1;
-
-	/* workaround SA-167 */
-	if (input_dev->report_desc[14] == 0x25)
-		input_dev->report_desc[14] = 0x29;
-
-	hid_dev = hid_allocate_device();
-	if (IS_ERR(hid_dev)) {
-		ret = PTR_ERR(hid_dev);
-		goto probe_err1;
-	}
-
-	hid_dev->ll_driver = &mousevsc_ll_driver;
-	hid_dev->driver = &mousevsc_hid_driver;
-	hid_dev->bus = BUS_VIRTUAL;
-	hid_dev->vendor = input_dev->hid_dev_info.vendor;
-	hid_dev->product = input_dev->hid_dev_info.product;
-	hid_dev->version = input_dev->hid_dev_info.version;
-	input_dev->hid_device = hid_dev;
-
-	sprintf(hid_dev->name, "%s", "Microsoft Vmbus HID-compliant Mouse");
-
-	ret = hid_parse_report(hid_dev, input_dev->report_desc,
-				input_dev->report_desc_size);
-
-	if (ret) {
-		hid_err(hid_dev, "parse failed\n");
-		goto probe_err2;
-	}
-
-	ret = hid_hw_start(hid_dev, HID_CONNECT_HIDINPUT | HID_CONNECT_HIDDEV);
-
-	if (ret) {
-		hid_err(hid_dev, "hw start failed\n");
-		goto probe_err2;
-	}
-
-	input_dev->connected = true;
-	input_dev->init_complete = true;
-
-	return ret;
-
-probe_err2:
-	hid_destroy_device(hid_dev);
-
-probe_err1:
-	vmbus_close(device->channel);
-
-probe_err0:
-	mousevsc_free_device(input_dev);
-
-	return ret;
-}
-
-
-static int mousevsc_remove(struct hv_device *dev)
-{
-	struct mousevsc_dev *input_dev = hv_get_drvdata(dev);
-
-	vmbus_close(dev->channel);
-	hid_destroy_device(input_dev->hid_device);
-	mousevsc_free_device(input_dev);
-
-	return 0;
-}
-
-static const struct hv_vmbus_device_id id_table[] = {
-	/* Mouse guid */
-	{ VMBUS_DEVICE(0x9E, 0xB6, 0xA8, 0xCF, 0x4A, 0x5B, 0xc0, 0x4c,
-		       0xB9, 0x8B, 0x8B, 0xA1, 0xA1, 0xF3, 0xF9, 0x5A) },
-	{ },
-};
-
-MODULE_DEVICE_TABLE(vmbus, id_table);
-
-static struct  hv_driver mousevsc_drv = {
-	.name = KBUILD_MODNAME,
-	.id_table = id_table,
-	.probe = mousevsc_probe,
-	.remove = mousevsc_remove,
-};
-
-static int __init mousevsc_init(void)
-{
-	return vmbus_driver_register(&mousevsc_drv);
-}
-
-static void __exit mousevsc_exit(void)
-{
-	vmbus_driver_unregister(&mousevsc_drv);
-}
-
-MODULE_LICENSE("GPL");
-MODULE_VERSION(HV_DRV_VERSION);
-module_init(mousevsc_init);
-module_exit(mousevsc_exit);
-- 
1.7.4.1

^ permalink raw reply related

* Re: [PATCH] vhost-net: Acquire device lock when releasing device
From: David Miller @ 2011-11-26 20:45 UTC (permalink / raw)
  To: levinsasha928; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <1321607982-16283-1-git-send-email-levinsasha928@gmail.com>

From: Sasha Levin <levinsasha928@gmail.com>
Date: Fri, 18 Nov 2011 11:19:42 +0200

> Device lock should be held when releasing a device, and specifically
> when calling vhost_dev_cleanup(). Otherwise, RCU complains about it:
 ...
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Sasha Levin <levinsasha928@gmail.com>

Michael et al., are you guys going to gather this fix or should I
apply it directly to thet net tree?

Thanks.

^ permalink raw reply

* Re: [PATCH 3/3] Staging: hv: mousevsc: Use the KBUILD_MODNAME macro
From: Greg KH @ 2011-11-27  0:59 UTC (permalink / raw)
  To: H Hartley Sweeten
  Cc: K. Y. Srinivasan, gregkh@suse.de, linux-kernel@vger.kernel.org,
	devel@linuxdriverproject.org, virtualization@lists.osdl.org,
	ohering@suse.com, joe@perches.com, dmitry.torokhov@gmail.com,
	jkosina@suse.cz
In-Reply-To: <ADE657CA350FB648AAC2C43247A983F001F39DEE70F6@AUSP01VMBX24.collaborationhost.net>

On Fri, Oct 28, 2011 at 06:02:22PM -0500, H Hartley Sweeten wrote:
> On Friday, October 28, 2011 3:11 PM, K. Y. Srinivasan wrote:
> > 
> > Use the KBUILD_MODNAME macro.
> > 
> > Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
> > ---
> >  drivers/staging/hv/hv_mouse.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/staging/hv/hv_mouse.c b/drivers/staging/hv/hv_mouse.c
> > index c22f729..2c2e1b4 100644
> > --- a/drivers/staging/hv/hv_mouse.c
> > +++ b/drivers/staging/hv/hv_mouse.c
> > @@ -578,7 +578,7 @@ static const struct hv_vmbus_device_id id_table[] = {
> >  MODULE_DEVICE_TABLE(vmbus, id_table);
> >  
> >  static struct  hv_driver mousevsc_drv = {
> > -	.name = "mousevsc",
> > +	.name = KBUILD_MODNAME,
> >  	.id_table = id_table,
> >  	.probe = mousevsc_probe,
> >  	.remove = mousevsc_remove,
> 
> It's just my opinion, but I'm not sure this is better.
> 
> By changing the name to KBUILD_MODNAME you can no longer grep the source to
> find the driver based on it's name:

That's fine.

> $ grep -r name * | grep mousevsc
> drivers/staging/hv/hv_mouse.c:  .name = "mousevsc",
> 
> Also, doesn't this change the name of the driver from "mousevsc" to "hv_mouse"?

Yes.

> Does anything else need modified to handle the name change or is the matching
> handled strictly by the id_table?

No, this only shows up in sysfs, no other changes needed.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 1/1] Staging: hv: mousevsc: Remove the mouse driver from the staging tree
From: Greg KH @ 2011-11-27  1:00 UTC (permalink / raw)
  To: K. Y. Srinivasan
  Cc: gregkh, linux-kernel, devel, virtualization, ohering, joe,
	dmitry.torokhov, jkosina
In-Reply-To: <1322285308-3507-1-git-send-email-kys@microsoft.com>

On Fri, Nov 25, 2011 at 09:28:28PM -0800, K. Y. Srinivasan wrote:
> The mouse driver for Hyper-V is a HID compliant mouse driver.
> Now that all relevant review comments have been addressed,
> Jiri has agreed to merge this mouse driver into 3.3 hid.git.
> This patch takes care of removing the mouse driver from the
> staging area. Greg, if you can ack this patch, Jiri will proceed
> with his merge.
> 
> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>

Acked-by: Greg Kroah-Hartman <gregkh@suse.de>

Nice job.

greg k-h

^ permalink raw reply

* [PATCH 28/62] vhost: remove the second argument of k[un]map_atomic()
From: Cong Wang @ 2011-11-27  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Cong Wang, kvm, Michael S. Tsirkin, netdev, virtualization, akpm
In-Reply-To: <1322371662-26166-1-git-send-email-amwang@redhat.com>

Signed-off-by: Cong Wang <amwang@redhat.com>
---
 drivers/vhost/vhost.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c14c42b..bdb2d64 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -937,9 +937,9 @@ static int set_bit_to_user(int nr, void __user *addr)
 	if (r < 0)
 		return r;
 	BUG_ON(r != 1);
-	base = kmap_atomic(page, KM_USER0);
+	base = kmap_atomic(page);
 	set_bit(bit, base);
-	kunmap_atomic(base, KM_USER0);
+	kunmap_atomic(base);
 	set_page_dirty_lock(page);
 	put_page(page);
 	return 0;
-- 
1.7.4.4

^ permalink raw reply related

* Re: [PATCH] vhost-net: Acquire device lock when releasing device
From: Michael S. Tsirkin @ 2011-11-27 16:49 UTC (permalink / raw)
  To: Sasha Levin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1321607982-16283-1-git-send-email-levinsasha928@gmail.com>

On Fri, Nov 18, 2011 at 11:19:42AM +0200, Sasha Levin wrote:
> Device lock should be held when releasing a device, and specifically
> when calling vhost_dev_cleanup(). Otherwise, RCU complains about it:
> 
> [ 2025.642835] ===============================
> [ 2025.643838] [ INFO: suspicious RCU usage. ]
> [ 2025.645182] -------------------------------
> [ 2025.645927] drivers/vhost/vhost.c:475 suspicious rcu_dereference_protected() usage!
> [ 2025.647329]
> [ 2025.647330] other info that might help us debug this:
> [ 2025.647331]
> [ 2025.649042]
> [ 2025.649043] rcu_scheduler_active = 1, debug_locks = 1
> [ 2025.650235] no locks held by trinity/21042.
> [ 2025.650971]
> [ 2025.650972] stack backtrace:
> [ 2025.651789] Pid: 21042, comm: trinity Not tainted 3.2.0-rc2-sasha-00057-ga9098b3 #5
> [ 2025.653342] Call Trace:
> [ 2025.653792]  [<ffffffff810b4a6a>] lockdep_rcu_suspicious+0xaf/0xb9
> [ 2025.654916]  [<ffffffff818d4c2c>] vhost_dev_cleanup+0x342/0x3ac
> [ 2025.655985]  [<ffffffff818d4f18>] vhost_net_release+0x48/0x7f
> [ 2025.657247]  [<ffffffff811416e3>] fput+0x11e/0x1dc
> [ 2025.658091]  [<ffffffff8113f1ec>] filp_close+0x6e/0x79
> [ 2025.658964]  [<ffffffff81089ed7>] put_files_struct+0xcc/0x196
> [ 2025.659971]  [<ffffffff8108a034>] exit_files+0x46/0x4f
> [ 2025.660865]  [<ffffffff8108a716>] do_exit+0x264/0x75c
> [ 2025.662201]  [<ffffffff8113f490>] ? fsnotify_modify+0x60/0x68
> [ 2025.663260]  [<ffffffff81bbdbca>] ? sysret_check+0x2e/0x69
> [ 2025.664269]  [<ffffffff8108acc1>] do_group_exit+0x83/0xb1
> [ 2025.665448]  [<ffffffff8108ad01>] sys_exit_group+0x12/0x16
> [ 2025.666396]  [<ffffffff81bbdb92>] system_call_fastpath+0x16/0x1b
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> ---
>  drivers/vhost/net.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 882a51f..c9be601 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -586,6 +586,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>  	struct socket *tx_sock;
>  	struct socket *rx_sock;
>  
> +	mutex_lock(&n->dev.mutex);
>  	vhost_net_stop(n, &tx_sock, &rx_sock);
>  	vhost_net_flush(n);
>  	vhost_dev_cleanup(&n->dev);
> @@ -596,6 +597,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>  	/* We do an extra flush before freeing memory,
>  	 * since jobs can re-queue themselves. */
>  	vhost_net_flush(n);
> +	mutex_unlock(&n->dev.mutex);
>  	kfree(n);
>  	return 0;
>  }

This calls fput fom release under lock which is generally a bad idea,
as locks become nested then. For example, consider what would happen if this
somehow triggers a release which in turn needs the same mutex.

And, we are releasing the device, so it seems better to check
that no one has the lock.
How about the following instead? What do you think?

-->

vhost: fix release path lockdep checks

We shouldn't hold any locks on release path. Pass a flag to
vhost_dev_cleanup to use the lockdep info correctly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a8c95ef..96f9769 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -424,7 +424,7 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	if (!memory)
 		return -ENOMEM;
 
-	vhost_dev_cleanup(dev);
+	vhost_dev_cleanup(dev, true);
 
 	memory->nregions = 0;
 	RCU_INIT_POINTER(dev->memory, memory);
@@ -455,8 +455,8 @@ int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
 	return j;
 }
 
-/* Caller should have device mutex */
-void vhost_dev_cleanup(struct vhost_dev *dev)
+/* Caller should have device mutex if and only if locked is set */
+void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
 	int i;
 
@@ -493,7 +493,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->log_file = NULL;
 	/* No one will access memory at this point */
 	kfree(rcu_dereference_protected(dev->memory,
-					lockdep_is_held(&dev->mutex)));
+					locked ==
+						lockdep_is_held(&dev->mutex)));
 	RCU_INIT_POINTER(dev->memory, NULL);
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3e8cc3..97e18d3 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -164,7 +164,7 @@ struct vhost_dev {
 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
-void vhost_dev_cleanup(struct vhost_dev *);
+void vhost_dev_cleanup(struct vhost_dev *, bool locked);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);

-- 
MST

^ permalink raw reply related

* Re: [PATCH] vhost-net: Acquire device lock when releasing device
From: Michael S. Tsirkin @ 2011-11-27 16:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, virtualization, levinsasha928, kvm, linux-kernel
In-Reply-To: <20111126.154503.1317670837061338352.davem@davemloft.net>

On Sat, Nov 26, 2011 at 03:45:03PM -0500, David Miller wrote:
> From: Sasha Levin <levinsasha928@gmail.com>
> Date: Fri, 18 Nov 2011 11:19:42 +0200
> 
> > Device lock should be held when releasing a device, and specifically
> > when calling vhost_dev_cleanup(). Otherwise, RCU complains about it:
>  ...
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: kvm@vger.kernel.org
> > Cc: virtualization@lists.linux-foundation.org
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> 
> Michael et al., are you guys going to gather this fix or should I
> apply it directly to thet net tree?
> 
> Thanks.

I think it needs a small tweak before being applied.

-- 
MST

^ permalink raw reply

* Re: [PATCH] vhost-net: Acquire device lock when releasing device
From: Michael S. Tsirkin @ 2011-11-27 17:06 UTC (permalink / raw)
  To: Sasha Levin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20111127164925.GA24941@redhat.com>

On Sun, Nov 27, 2011 at 06:49:27PM +0200, Michael S. Tsirkin wrote:
> On Fri, Nov 18, 2011 at 11:19:42AM +0200, Sasha Levin wrote:
> > Device lock should be held when releasing a device, and specifically
> > when calling vhost_dev_cleanup(). Otherwise, RCU complains about it:
> > 
> > [ 2025.642835] ===============================
> > [ 2025.643838] [ INFO: suspicious RCU usage. ]
> > [ 2025.645182] -------------------------------
> > [ 2025.645927] drivers/vhost/vhost.c:475 suspicious rcu_dereference_protected() usage!
> > [ 2025.647329]
> > [ 2025.647330] other info that might help us debug this:
> > [ 2025.647331]
> > [ 2025.649042]
> > [ 2025.649043] rcu_scheduler_active = 1, debug_locks = 1
> > [ 2025.650235] no locks held by trinity/21042.
> > [ 2025.650971]
> > [ 2025.650972] stack backtrace:
> > [ 2025.651789] Pid: 21042, comm: trinity Not tainted 3.2.0-rc2-sasha-00057-ga9098b3 #5
> > [ 2025.653342] Call Trace:
> > [ 2025.653792]  [<ffffffff810b4a6a>] lockdep_rcu_suspicious+0xaf/0xb9
> > [ 2025.654916]  [<ffffffff818d4c2c>] vhost_dev_cleanup+0x342/0x3ac
> > [ 2025.655985]  [<ffffffff818d4f18>] vhost_net_release+0x48/0x7f
> > [ 2025.657247]  [<ffffffff811416e3>] fput+0x11e/0x1dc
> > [ 2025.658091]  [<ffffffff8113f1ec>] filp_close+0x6e/0x79
> > [ 2025.658964]  [<ffffffff81089ed7>] put_files_struct+0xcc/0x196
> > [ 2025.659971]  [<ffffffff8108a034>] exit_files+0x46/0x4f
> > [ 2025.660865]  [<ffffffff8108a716>] do_exit+0x264/0x75c
> > [ 2025.662201]  [<ffffffff8113f490>] ? fsnotify_modify+0x60/0x68
> > [ 2025.663260]  [<ffffffff81bbdbca>] ? sysret_check+0x2e/0x69
> > [ 2025.664269]  [<ffffffff8108acc1>] do_group_exit+0x83/0xb1
> > [ 2025.665448]  [<ffffffff8108ad01>] sys_exit_group+0x12/0x16
> > [ 2025.666396]  [<ffffffff81bbdb92>] system_call_fastpath+0x16/0x1b
> > 
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: kvm@vger.kernel.org
> > Cc: virtualization@lists.linux-foundation.org
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> > ---
> >  drivers/vhost/net.c |    2 ++
> >  1 files changed, 2 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 882a51f..c9be601 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -586,6 +586,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> >  	struct socket *tx_sock;
> >  	struct socket *rx_sock;
> >  
> > +	mutex_lock(&n->dev.mutex);
> >  	vhost_net_stop(n, &tx_sock, &rx_sock);
> >  	vhost_net_flush(n);
> >  	vhost_dev_cleanup(&n->dev);
> > @@ -596,6 +597,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> >  	/* We do an extra flush before freeing memory,
> >  	 * since jobs can re-queue themselves. */
> >  	vhost_net_flush(n);
> > +	mutex_unlock(&n->dev.mutex);
> >  	kfree(n);
> >  	return 0;
> >  }
> 
> This calls fput fom release under lock which is generally a bad idea,
> as locks become nested then. For example, consider what would happen if this
> somehow triggers a release which in turn needs the same mutex.
> 
> And, we are releasing the device, so it seems better to check
> that no one has the lock.
> How about the following instead? What do you think?
> 
> -->
> 
> vhost: fix release path lockdep checks
> 
> We shouldn't hold any locks on release path. Pass a flag to
> vhost_dev_cleanup to use the lockdep info correctly.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---

Sorry, this got cut short. Here's the full patch (1st chunk was
missing).  Does this solve the problem for you?


diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8b16d16..9bba4b3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -598,7 +598,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
-	vhost_dev_cleanup(&n->dev);
+	vhost_dev_cleanup(&n->dev, false);
 	if (tx_sock)
 		fput(tx_sock->file);
 	if (rx_sock)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a8c95ef..96f9769 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -424,7 +424,7 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	if (!memory)
 		return -ENOMEM;
 
-	vhost_dev_cleanup(dev);
+	vhost_dev_cleanup(dev, true);
 
 	memory->nregions = 0;
 	RCU_INIT_POINTER(dev->memory, memory);
@@ -455,8 +455,8 @@ int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
 	return j;
 }
 
-/* Caller should have device mutex */
-void vhost_dev_cleanup(struct vhost_dev *dev)
+/* Caller should have device mutex if and only if locked is set */
+void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
 	int i;
 
@@ -493,7 +493,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	dev->log_file = NULL;
 	/* No one will access memory at this point */
 	kfree(rcu_dereference_protected(dev->memory,
-					lockdep_is_held(&dev->mutex)));
+					locked ==
+						lockdep_is_held(&dev->mutex)));
 	RCU_INIT_POINTER(dev->memory, NULL);
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3e8cc3..97e18d3 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -164,7 +164,7 @@ struct vhost_dev {
 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
-void vhost_dev_cleanup(struct vhost_dev *);
+void vhost_dev_cleanup(struct vhost_dev *, bool locked);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);

^ permalink raw reply related

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Michael S. Tsirkin @ 2011-11-27 17:14 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <OF835A6FF5.7D752AD1-ON65257953.0016350C-65257953.00169AF4@in.ibm.com>

On Fri, Nov 25, 2011 at 09:39:11AM +0530, Krishna Kumar2 wrote:
> Jason Wang <jasowang@redhat.com> wrote on 11/25/2011 08:51:57 AM:
> >
> > My description is not clear again :(
> > I mean the same vhost thead:
> >
> > vhost thread #0 transmits packets of flow A on processor M
> > ...
> > vhost thread #0 move to another process N and start to transmit packets
> > of flow A
> 
> Thanks for clarifying. Yes, binding vhosts to CPU's
> makes the incoming packet go to the same vhost each
> time.

Interesting, but still not sure why.
What if you bind the VCPUs but not the vhost thread?

> BTW, are you doing any binding and/or irqbalance
> when you run your tests? I am not running either at
> this time, but thought both might be useful.
> 
> - KK

Either pinning or irqbalance is a good idea.
Doing neither means you get a random CPU handling
interrupts.

-- 
MST

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Michael S. Tsirkin @ 2011-11-27 17:23 UTC (permalink / raw)
  To: David Miller; +Cc: krkumar2, arnd, netdev, virtualization, levinsasha928
In-Reply-To: <20111125.013552.1613051198566054931.davem@davemloft.net>

On Fri, Nov 25, 2011 at 01:35:52AM -0500, David Miller wrote:
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Fri, 25 Nov 2011 09:39:11 +0530
> 
> > Jason Wang <jasowang@redhat.com> wrote on 11/25/2011 08:51:57 AM:
> >>
> >> My description is not clear again :(
> >> I mean the same vhost thead:
> >>
> >> vhost thread #0 transmits packets of flow A on processor M
> >> ...
> >> vhost thread #0 move to another process N and start to transmit packets
> >> of flow A
> > 
> > Thanks for clarifying. Yes, binding vhosts to CPU's
> > makes the incoming packet go to the same vhost each
> > time. BTW, are you doing any binding and/or irqbalance
> > when you run your tests? I am not running either at
> > this time, but thought both might be useful.
> 
> So are we going with this patch or are we saying that vhost binding
> is a requirement?

I think it's a good idea to make sure we understand the problem
root cause well before applying the patch. We still
have a bit of time before 3.2. In particular, why does
the vhost thread bounce between CPUs so much?

Long term it seems the best way is to expose the preferred mapping
from the guest and forward it to the device.

-- 
MST

^ permalink raw reply

* Re: [PATCH] virtio-mmio: Devices parameter parsing
From: Rusty Russell @ 2011-11-28  0:31 UTC (permalink / raw)
  To: Pawel Moll
  Cc: linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <1322071732.3093.491.camel@hornet.cambridge.arm.com>

On Wed, 23 Nov 2011 18:08:52 +0000, Pawel Moll <pawel.moll@arm.com> wrote:
> On Tue, 2011-11-22 at 00:44 +0000, Rusty Russell wrote: 
> > Or would it be simpler to enhance sscanf() with some weird format option
> > for suffixing?  I haven't looked for similar cases, but I'd suspect a
> > big win in general.
> > 
> > This would be neater than anything else we've got:
> >         if (sscanf(device, "%llu@%llu[KMG]:%u", ...) != 3
> >             && sscanf(device, "%llu@%llu[KMG]:%u:%u", ...) != 4)
> >                 return -EINVAL;
> 
> sscanf was a good hint! Thanks, why haven't I thought of it myself? ;-)
> 
> That's what I came up with:
> 
> static int vm_cmdline_set(const char *device,
>                 const struct kernel_param *kp)
> {
>         struct resource resources[2] = {};
>         char *str;
>         long long int base;
>         int processed, consumed = 0;
>         struct platform_device *pdev;
>         
>         resources[0].flags = IORESOURCE_MEM;
>         resources[1].flags = IORESOURCE_IRQ;
>         
>         resources[0].end = memparse(device, &str) - 1;
>         
>         processed = sscanf(str, "@%lli:%u%n:%d%n", 
>                         &base, &resources[1].start, &consumed,
>                         &vm_cmdline_id, &consumed);
>                         
>         if (processed < 2 || processed > 3 || str[consumed])
>                 return -EINVAL;
>                 
>         resources[0].start = base;
>         resources[0].end += base;
>         resources[1].end = resources[1].start;
> 
> The only bit missing from sscanf() would be some sort of "%m" format,
> which behaved like memparse and also processed unsigned number with "0
> base" (hard to believe but the only "universal" - as in octal, dec and
> hex - format is %i, which is signed). But still, looks quite neat
> already :-)

Yeah, something like %pK for kernel pointers in printk; you need a
suffix so that gcc won't get upset.  The "[KMG]" suffix hack was my
poor suggestion...

> I'll try to have a look at the "late parameters" idea tomorrow. Any
> early warnings?

Off the top of my head, this makes me think of the way initcalls are
ordered.  We could put a parameter parsing initcall at the start of each
initcall level in include/asm-generic/vmlinux.lds.h's INITCALLS macro.

Then we steal four bits from struct kernel_param's flags to indicate the
level of the initcall (-1 == existing ones, otherwise N == before level
N initcalls).

Good luck!
Rusty.

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Rusty Russell @ 2011-11-28  0:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <20111124070728.GH29994@redhat.com>

On Thu, 24 Nov 2011 09:11:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Nov 24, 2011 at 11:06:44AM +1030, Rusty Russell wrote:
> > It'll be *clearer* to have two completely separate paths than to fill
> > with if() statements.
> 
> Well, look at my patches. See how each new feature basically adds *one*
> if statement.

Which means every new feature doubles the number of paths to test.  Both
in qemu and the kernel.  And I know we're not very good at testing.

And you've still got alignment setting and guest-set ringsize to go,
which are more complex.

> Yes, we could declare all existing features to be required,
> this is what you are advocating. But in a couple of years
> more if statements have accumulated and we still don't have
> a flying car. Meanwhile the driver has a huge legacy section
> which is a huge pain to maintain.

We've only implemented one layout change in 4 years (MSI-X) with one
more definite (64-bit features).  The rate of requests is also slowing:
they mainly came from early re-implementations.

And by drawing a line under legacy mode, we don't have to test those new
options with legacy mode.

> > And a rewrite won't hurt the driver.
> 
> Wow. Is everyone going for the rewrite these days?

Ok, I'll confess that was an ill-advised comment.  The virtio-pci driver
is actually quite nice.

> And yes it will hurt, it will hurt bisectability and testability.
> What you propose is a huge patch that just changes all of it.
> If there's a breakage, bisect just points at a rewrite.

I don't have a problem with bisectability; it's great in the transition.

But you're suggesting we maintain all those different variations
forever, and that burden be on other implementations too.

> What I would like to see is incremental patches. So I would like to
> 1. Split driver config from common config
> 2. Split isr/notifications out too
> 3. Copy config in MMIO
> 4. Add a new config
> 5. Add length checks
> 6. Add 64 bit features
> 7. Add alignment programming
> 
> At each point we can bisect.  It worked well for event index, qemu had a
> flag to turn it off, each time someone came in and went 'maybe it's
> event index' we just tested without.

Ok, I would like to see this too.  But I think the intermediate stages
should be disallowed in the final patch.

> > But to be honest I don't really care about the Linux driver: we're
> > steeped in this stuff and we'll get it right.
> 
> Maybe. What about windows drivers?

Yes.  My point was: I think one big switch is easier than lots of little
ones.

> > But I'm *terrified* of making the spec more complex;
> 
> All you do is move stuff around. Why do you think it simplifies the spec
> so much?

No, but it reduces the yuk factor.  Which has been important to adoption.

And that's *not* all I do: reducing the number of options definitely
simplifies the spec.  For example, the spec should currently say
(looking at your implementation):

  Notifying the device
  ====================
  If you find a valid VIRTIO_PCI_CAP_NOTIFY_CFG capability, and you can
  map 2 bytes within it, those two bytes should be used to notify the
  device of new descriptors in its virtqueues, by writing the index of the
  virtqueue to that mapping.

  If the capability is missing or malformed or you cannot map it, the
  virtqueue index should be written to the VIRTIO_PCI_QUEUE_NOTIFY offset
  of the legacy bar.

Vs:

  Notifying the device
  ====================
  The index of the virtqueue containing new descriptors should be written
  to the location specified by the VIRTIO_PCI_CAP_NOTIFY_CFG capability.
  (Unless the device is in legacy mode, see Appendix Y: Legacy Mode).

> > I *really* want to banish the legacy stuff to an appendix where noone will
> > ever know it's there :)
> 
> This does not make life easy for implementations as they
> need to implement it anyway. What makes life easy is
> making new features optional, so you pick just what you like.

Today, that's true.  Tomorrow, it's not.

I'd like to see kvmtools remove support for legacy mode altogether, but
they probably have existing users.

And there are boutique implementations who don't care about
compatibility with older versions, but they implement virtio because
it's well specified, and clear.

> For example, we might want to backport some new feature into
> rhel6.X at some point. I would like the diff to be small,
> and easily understandable in its impact.

Sure.  And I'm determined to reduce long-term complexity.

Look, I try to be more inclusive and polite than Linus, but at some
point more verbiage is wasted.  We will have single Legacy Mode switch.
Accept it, or fork the standard.

If you want to reuse the same structure, we're going to need to figure
out how to specify the virtqueue address without a fixed alignment, and
how to specify the alignment itself.

Thanks,
Rusty.

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-11-28  4:25 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
	davem
In-Reply-To: <OF835A6FF5.7D752AD1-ON65257953.0016350C-65257953.00169AF4@in.ibm.com>

On 11/25/2011 12:09 PM, Krishna Kumar2 wrote:
> Jason Wang<jasowang@redhat.com>  wrote on 11/25/2011 08:51:57 AM:
>> My description is not clear again :(
>> I mean the same vhost thead:
>>
>> vhost thread #0 transmits packets of flow A on processor M
>> ...
>> vhost thread #0 move to another process N and start to transmit packets
>> of flow A
> Thanks for clarifying. Yes, binding vhosts to CPU's
> makes the incoming packet go to the same vhost each
> time. BTW, are you doing any binding and/or irqbalance
> when you run your tests? I am not running either at
> this time, but thought both might be useful.

I'm using ixgbe for testing also, for host, its driver seems provide irq 
affinity hint, so no binding or irqbalance is needed. For guest, 
irqbalance is used in guest. I've tried bind irq in guest, and it can 
improve the rx performance.

> - KK
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-11-28  4:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, arnd, netdev, virtualization, levinsasha928,
	David Miller
In-Reply-To: <20111127172333.GD31987@redhat.com>

On 11/28/2011 01:23 AM, Michael S. Tsirkin wrote:
> On Fri, Nov 25, 2011 at 01:35:52AM -0500, David Miller wrote:
>> From: Krishna Kumar2<krkumar2@in.ibm.com>
>> Date: Fri, 25 Nov 2011 09:39:11 +0530
>>
>>> Jason Wang<jasowang@redhat.com>  wrote on 11/25/2011 08:51:57 AM:
>>>> My description is not clear again :(
>>>> I mean the same vhost thead:
>>>>
>>>> vhost thread #0 transmits packets of flow A on processor M
>>>> ...
>>>> vhost thread #0 move to another process N and start to transmit packets
>>>> of flow A
>>> Thanks for clarifying. Yes, binding vhosts to CPU's
>>> makes the incoming packet go to the same vhost each
>>> time. BTW, are you doing any binding and/or irqbalance
>>> when you run your tests? I am not running either at
>>> this time, but thought both might be useful.
>> So are we going with this patch or are we saying that vhost binding
>> is a requirement?
> I think it's a good idea to make sure we understand the problem
> root cause well before applying the patch. We still
> have a bit of time before 3.2. In particular, why does
> the vhost thread bounce between CPUs so much?

Other than this, since we could not assume the behavior of the under 
nic, using rxhash to identify a flow is more generic way.
>
> Long term it seems the best way is to expose the preferred mapping
> from the guest and forward it to the device.
>

I was working on this and hope to post it soon.

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Michael S. Tsirkin @ 2011-11-28  8:41 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <87vcq5t69c.fsf@rustcorp.com.au>

On Mon, Nov 28, 2011 at 11:25:43AM +1030, Rusty Russell wrote:
> > > But I'm *terrified* of making the spec more complex;
> > 
> > All you do is move stuff around. Why do you think it simplifies the spec
> > so much?
> 
> No, but it reduces the yuk factor.  Which has been important to adoption.

Sorry if I'm dense. Could you please clarify: do you think we can live
with the slightly higher yuk factor assuming the spec moves the
legacy mode into an appendix as you explain below and driver has a
single 'legacy' switch?

> And that's *not* all I do: reducing the number of options definitely
> simplifies the spec.  For example, the spec should currently say
> (looking at your implementation):
> 
>   Notifying the device
>   ====================
>   If you find a valid VIRTIO_PCI_CAP_NOTIFY_CFG capability, and you can
>   map 2 bytes within it, those two bytes should be used to notify the
>   device of new descriptors in its virtqueues, by writing the index of the
>   virtqueue to that mapping.
> 
>   If the capability is missing or malformed or you cannot map it, the
>   virtqueue index should be written to the VIRTIO_PCI_QUEUE_NOTIFY offset
>   of the legacy bar.
> 
> Vs:
> 
>   Notifying the device
>   ====================
>   The index of the virtqueue containing new descriptors should be written
>   to the location specified by the VIRTIO_PCI_CAP_NOTIFY_CFG capability.
>   (Unless the device is in legacy mode, see Appendix Y: Legacy Mode).

Yes, I agree, this is better.

...

> Look, I try to be more inclusive and polite than Linus, but at some
> point more verbiage is wasted.
> We will have single Legacy Mode switch.

Sorry, I'm adding more verbiage :( 
When you say a single Legacy Mode switch, you mean that the driver will
assume either legacy layout or the new one, correct?

> Accept it, or fork the standard.
>
> If you want to reuse the same structure, we're going to need to figure
> out how to specify the virtqueue address without a fixed alignment, and
> how to specify the alignment itself.

I think I see a way to do that in a relatively painless way.
Do you prefer seeing driver patches or spec? Or are you not interested
in reusing the same structure at all?

-- 
MST

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Sasha Levin @ 2011-11-28  9:15 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Michael S. Tsirkin,
	Alexey Kardashevskiy, Wang Sheng-Hui, lkml - Kernel Mailing List,
	virtualization, Christian Borntraeger, Amit Shah
In-Reply-To: <87vcq5t69c.fsf@rustcorp.com.au>

On Mon, 2011-11-28 at 11:25 +1030, Rusty Russell wrote:
> I'd like to see kvmtools remove support for legacy mode altogether,
> but they probably have existing users.

While we can't simply remove it right away, instead of mixing our
implementation for both legacy and new spec in the same code we can
split the virtio-pci implementation into two:

	- virtio/virtio-pci-legacy.c
	- virtio/virtio-pci.c

At that point we can #ifdef the entire virtio-pci-legacy.c for now and
remove it at the same time legacy virtio-pci is removed from the kernel.

I think this is something very similar to what you want done in the
kernel code, so an added plus is that the usermode code will be
mirroring the kernel code - which is something we try to have in the KVM
tool :)

-- 

Sasha.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox