Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Michael S. Tsirkin @ 2011-11-24 16:14 UTC (permalink / raw)
  To: jasowang
  Cc: Krishna Kumar, arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <4ECE3F0D.3090908@redhat.com>

On Thu, Nov 24, 2011 at 08:56:45PM +0800, jasowang wrote:
> On 11/24/2011 06:34 PM, Michael S. Tsirkin wrote:
> >On Thu, Nov 24, 2011 at 06:13:41PM +0800, jasowang wrote:
> >>On 11/24/2011 05:59 PM, Michael S. Tsirkin wrote:
> >>>On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
> >>>>It was reported that the macvtap device selects a
> >>>>different vhost (when used with multiqueue feature)
> >>>>for incoming packets of a single connection. Use
> >>>>packet hash first. Patch tested on MQ virtio_net.
> >>>So this is sure to address the problem, why exactly does this happen?
> >>Ixgbe has flow director and bind queue to host cpu, so it can make
> >>sure the packet of a flow to be handled by the same queue/cpu. So
> >>when vhost thread moves from one host cpu to another, ixgbe would
> >>therefore send the packet to the new cpu/queue.
> >Confused. How does ixgbe know about vhost thread moving?
> 
> As far as I can see, ixgbe binds queues to physical cpu, so let consider:
> 
> vhost thread transmits packets of flow A on processor M
> during packet transmission, ixgbe driver programs the card to
> deliver the packet of flow A to queue/cpu M through flow director
> (see ixgbe_atr())
> vhost thread then receives packet of flow A with from M
> ...
> vhost thread transmits packets of flow A on processor N
> ixgbe driver programs the flow director to change the delivery of
> flow A to queue N ( cpu N )
> vhost thread then receives packet of flow A with from N
> ...
> 
> So, for a single flow A, we may get different queue mappings. Using
> rxhash instead may solve this issue.

Or better, transmit a single flow from a single vhost thread.

If packets of a single flow get spread over different CPUs,
they will get reordered and things are not going to work well.

> >
> >>>Does your device spread a single flow across multiple RX queues? Would
> >>>not that cause trouble in the TCP layer?
> >>>It would seem that using the recorded queue should be faster with
> >>>less cache misses. Before we give up on that, I'd
> >>>like to understand why it's wrong. Do you know?
> >>>
> >>>>Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
> >>>>---
> >>>>  drivers/net/macvtap.c |   16 ++++++++--------
> >>>>  1 file changed, 8 insertions(+), 8 deletions(-)
> >>>>
> >>>>diff -ruNp org/drivers/net/macvtap.c new/drivers/net/macvtap.c
> >>>>--- org/drivers/net/macvtap.c	2011-10-22 08:38:01.000000000 +0530
> >>>>+++ new/drivers/net/macvtap.c	2011-11-16 18:34:51.000000000 +0530
> >>>>@@ -175,6 +175,14 @@ static struct macvtap_queue *macvtap_get
> >>>>  	if (!numvtaps)
> >>>>  		goto out;
> >>>>
> >>>>+	/* Check if we can use flow to select a queue */
> >>>>+	rxq = skb_get_rxhash(skb);
> >>>>+	if (rxq) {
> >>>>+		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> >>>>+		if (tap)
> >>>>+			goto out;
> >>>>+	}
> >>>>+
> >>>>  	if (likely(skb_rx_queue_recorded(skb))) {
> >>>>  		rxq = skb_get_rx_queue(skb);
> >>>>
> >>>>@@ -186,14 +194,6 @@ static struct macvtap_queue *macvtap_get
> >>>>  			goto out;
> >>>>  	}
> >>>>
> >>>>-	/* Check if we can use flow to select a queue */
> >>>>-	rxq = skb_get_rxhash(skb);
> >>>>-	if (rxq) {
> >>>>-		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> >>>>-		if (tap)
> >>>>-			goto out;
> >>>>-	}
> >>>>-
> >>>>  	/* Everything failed - find first available queue */
> >>>>  	for (rxq = 0; rxq<   MAX_MACVTAP_QUEUES; rxq++) {
> >>>>  		tap = rcu_dereference(vlan->taps[rxq]);
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe netdev" in
> >>>the body of a message to majordomo@vger.kernel.org
> >>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Michael S. Tsirkin @ 2011-11-24 16:12 UTC (permalink / raw)
  To: jasowang
  Cc: Krishna Kumar2, arnd, netdev, virtualization, levinsasha928,
	davem, jeffrey.t.kirsher
In-Reply-To: <4ECE4004.8010107@redhat.com>

On Thu, Nov 24, 2011 at 09:00:52PM +0800, jasowang wrote:
> On 11/24/2011 07:14 PM, Krishna Kumar2 wrote:
> >"Michael S. Tsirkin"<mst@redhat.com>  wrote on 11/24/2011 03:29:03 PM:
> >
> >>Subject Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
> >>
> >>On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
> >>>It was reported that the macvtap device selects a
> >>>different vhost (when used with multiqueue feature)
> >>>for incoming packets of a single connection. Use
> >>>packet hash first. Patch tested on MQ virtio_net.
> >>So this is sure to address the problem, why exactly does this happen?
> >>Does your device spread a single flow across multiple RX queues? Would
> >>not that cause trouble in the TCP layer?
> >>It would seem that using the recorded queue should be faster with
> >>less cache misses. Before we give up on that, I'd
> >>like to understand why it's wrong. Do you know?
> >I am using ixgbe. From what I briefly saw, ixgbe_alloc_rx_buffers
> >calls skb_record_rx_queue when a skb is allocated. When a packet
> >is received (ixgbe_alloc_rx_buffers), it sets rxhash. The
> >recorded value is different for most skbs when I ran a single
> >stream TCP stream test (does skbs move between the rx_rings?).
> 
> Yes, it moves. It depends on last processor or tx queue who
> transmits the packets of this stream. Because ixgbe select tx queue
> based on the processor id, so if vhost thread transmits skbs on
> different processors, the skb of a single stream may comes from
> different rx ring.

Is that so? In that case, it looks like there's bug on the transmit
side, causing skbs to get scattered between ixgbe tx queues,
and the receive side trouble follows.

> >With this patch, macvtap selects the same device for each
> >incoming packet. I can add some debugs in ixgbe to see what is
> >happening also. I am not sure if Sasha was using a different
> >device.
> >
> >Cc'ing Jeffrey in case he can add something.
> >
> >thanks,
> >
> >- KK
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: jasowang @ 2011-11-24 13:00 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
	davem, jeffrey.t.kirsher
In-Reply-To: <OF12661931.5C4899F8-ON65257952.003B84A6-65257952.003D8DA6@in.ibm.com>

On 11/24/2011 07:14 PM, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"<mst@redhat.com>  wrote on 11/24/2011 03:29:03 PM:
>
>> Subject Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
>>
>> On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
>>> It was reported that the macvtap device selects a
>>> different vhost (when used with multiqueue feature)
>>> for incoming packets of a single connection. Use
>>> packet hash first. Patch tested on MQ virtio_net.
>> So this is sure to address the problem, why exactly does this happen?
>> Does your device spread a single flow across multiple RX queues? Would
>> not that cause trouble in the TCP layer?
>> It would seem that using the recorded queue should be faster with
>> less cache misses. Before we give up on that, I'd
>> like to understand why it's wrong. Do you know?
> I am using ixgbe. From what I briefly saw, ixgbe_alloc_rx_buffers
> calls skb_record_rx_queue when a skb is allocated. When a packet
> is received (ixgbe_alloc_rx_buffers), it sets rxhash. The
> recorded value is different for most skbs when I ran a single
> stream TCP stream test (does skbs move between the rx_rings?).

Yes, it moves. It depends on last processor or tx queue who transmits 
the packets of this stream. Because ixgbe select tx queue based on the 
processor id, so if vhost thread transmits skbs on different processors, 
the skb of a single stream may comes from different rx ring.

> With this patch, macvtap selects the same device for each
> incoming packet. I can add some debugs in ixgbe to see what is
> happening also. I am not sure if Sasha was using a different
> device.
>
> Cc'ing Jeffrey in case he can add something.
>
> thanks,
>
> - KK
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: jasowang @ 2011-11-24 12:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <20111124103449.GA16031@redhat.com>

On 11/24/2011 06:34 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 24, 2011 at 06:13:41PM +0800, jasowang wrote:
>> On 11/24/2011 05:59 PM, Michael S. Tsirkin wrote:
>>> On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
>>>> It was reported that the macvtap device selects a
>>>> different vhost (when used with multiqueue feature)
>>>> for incoming packets of a single connection. Use
>>>> packet hash first. Patch tested on MQ virtio_net.
>>> So this is sure to address the problem, why exactly does this happen?
>> Ixgbe has flow director and bind queue to host cpu, so it can make
>> sure the packet of a flow to be handled by the same queue/cpu. So
>> when vhost thread moves from one host cpu to another, ixgbe would
>> therefore send the packet to the new cpu/queue.
> Confused. How does ixgbe know about vhost thread moving?

As far as I can see, ixgbe binds queues to physical cpu, so let consider:

vhost thread transmits packets of flow A on processor M
during packet transmission, ixgbe driver programs the card to deliver 
the packet of flow A to queue/cpu M through flow director (see ixgbe_atr())
vhost thread then receives packet of flow A with from M
...
vhost thread transmits packets of flow A on processor N
ixgbe driver programs the flow director to change the delivery of flow A 
to queue N ( cpu N )
vhost thread then receives packet of flow A with from N
...

So, for a single flow A, we may get different queue mappings. Using 
rxhash instead may solve this issue.

>
>>> Does your device spread a single flow across multiple RX queues? Would
>>> not that cause trouble in the TCP layer?
>>> It would seem that using the recorded queue should be faster with
>>> less cache misses. Before we give up on that, I'd
>>> like to understand why it's wrong. Do you know?
>>>
>>>> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
>>>> ---
>>>>   drivers/net/macvtap.c |   16 ++++++++--------
>>>>   1 file changed, 8 insertions(+), 8 deletions(-)
>>>>
>>>> diff -ruNp org/drivers/net/macvtap.c new/drivers/net/macvtap.c
>>>> --- org/drivers/net/macvtap.c	2011-10-22 08:38:01.000000000 +0530
>>>> +++ new/drivers/net/macvtap.c	2011-11-16 18:34:51.000000000 +0530
>>>> @@ -175,6 +175,14 @@ static struct macvtap_queue *macvtap_get
>>>>   	if (!numvtaps)
>>>>   		goto out;
>>>>
>>>> +	/* Check if we can use flow to select a queue */
>>>> +	rxq = skb_get_rxhash(skb);
>>>> +	if (rxq) {
>>>> +		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
>>>> +		if (tap)
>>>> +			goto out;
>>>> +	}
>>>> +
>>>>   	if (likely(skb_rx_queue_recorded(skb))) {
>>>>   		rxq = skb_get_rx_queue(skb);
>>>>
>>>> @@ -186,14 +194,6 @@ static struct macvtap_queue *macvtap_get
>>>>   			goto out;
>>>>   	}
>>>>
>>>> -	/* Check if we can use flow to select a queue */
>>>> -	rxq = skb_get_rxhash(skb);
>>>> -	if (rxq) {
>>>> -		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
>>>> -		if (tap)
>>>> -			goto out;
>>>> -	}
>>>> -
>>>>   	/* Everything failed - find first available queue */
>>>>   	for (rxq = 0; rxq<   MAX_MACVTAP_QUEUES; rxq++) {
>>>>   		tap = rcu_dereference(vlan->taps[rxq]);
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Krishna Kumar2 @ 2011-11-24 11:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: arnd, netdev, virtualization, levinsasha928, davem,
	jeffrey.t.kirsher
In-Reply-To: <20111124095902.GD14491@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/24/2011 03:29:03 PM:

> Subject Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
>
> On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
> > It was reported that the macvtap device selects a
> > different vhost (when used with multiqueue feature)
> > for incoming packets of a single connection. Use
> > packet hash first. Patch tested on MQ virtio_net.
>
> So this is sure to address the problem, why exactly does this happen?
> Does your device spread a single flow across multiple RX queues? Would
> not that cause trouble in the TCP layer?
> It would seem that using the recorded queue should be faster with
> less cache misses. Before we give up on that, I'd
> like to understand why it's wrong. Do you know?

I am using ixgbe. From what I briefly saw, ixgbe_alloc_rx_buffers
calls skb_record_rx_queue when a skb is allocated. When a packet
is received (ixgbe_alloc_rx_buffers), it sets rxhash. The
recorded value is different for most skbs when I ran a single
stream TCP stream test (does skbs move between the rx_rings?).
With this patch, macvtap selects the same device for each
incoming packet. I can add some debugs in ixgbe to see what is
happening also. I am not sure if Sasha was using a different
device.

Cc'ing Jeffrey in case he can add something.

thanks,

- KK

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Michael S. Tsirkin @ 2011-11-24 10:34 UTC (permalink / raw)
  To: jasowang
  Cc: Krishna Kumar, arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <4ECE18D5.3060605@redhat.com>

On Thu, Nov 24, 2011 at 06:13:41PM +0800, jasowang wrote:
> On 11/24/2011 05:59 PM, Michael S. Tsirkin wrote:
> >On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
> >>It was reported that the macvtap device selects a
> >>different vhost (when used with multiqueue feature)
> >>for incoming packets of a single connection. Use
> >>packet hash first. Patch tested on MQ virtio_net.
> >So this is sure to address the problem, why exactly does this happen?
> 
> Ixgbe has flow director and bind queue to host cpu, so it can make
> sure the packet of a flow to be handled by the same queue/cpu. So
> when vhost thread moves from one host cpu to another, ixgbe would
> therefore send the packet to the new cpu/queue.

Confused. How does ixgbe know about vhost thread moving?

> >Does your device spread a single flow across multiple RX queues? Would
> >not that cause trouble in the TCP layer?
> >It would seem that using the recorded queue should be faster with
> >less cache misses. Before we give up on that, I'd
> >like to understand why it's wrong. Do you know?
> >
> >>Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
> >>---
> >>  drivers/net/macvtap.c |   16 ++++++++--------
> >>  1 file changed, 8 insertions(+), 8 deletions(-)
> >>
> >>diff -ruNp org/drivers/net/macvtap.c new/drivers/net/macvtap.c
> >>--- org/drivers/net/macvtap.c	2011-10-22 08:38:01.000000000 +0530
> >>+++ new/drivers/net/macvtap.c	2011-11-16 18:34:51.000000000 +0530
> >>@@ -175,6 +175,14 @@ static struct macvtap_queue *macvtap_get
> >>  	if (!numvtaps)
> >>  		goto out;
> >>
> >>+	/* Check if we can use flow to select a queue */
> >>+	rxq = skb_get_rxhash(skb);
> >>+	if (rxq) {
> >>+		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> >>+		if (tap)
> >>+			goto out;
> >>+	}
> >>+
> >>  	if (likely(skb_rx_queue_recorded(skb))) {
> >>  		rxq = skb_get_rx_queue(skb);
> >>
> >>@@ -186,14 +194,6 @@ static struct macvtap_queue *macvtap_get
> >>  			goto out;
> >>  	}
> >>
> >>-	/* Check if we can use flow to select a queue */
> >>-	rxq = skb_get_rxhash(skb);
> >>-	if (rxq) {
> >>-		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> >>-		if (tap)
> >>-			goto out;
> >>-	}
> >>-
> >>  	/* Everything failed - find first available queue */
> >>  	for (rxq = 0; rxq<  MAX_MACVTAP_QUEUES; rxq++) {
> >>  		tap = rcu_dereference(vlan->taps[rxq]);
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: jasowang @ 2011-11-24 10:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <20111124095902.GD14491@redhat.com>

On 11/24/2011 05:59 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
>> It was reported that the macvtap device selects a
>> different vhost (when used with multiqueue feature)
>> for incoming packets of a single connection. Use
>> packet hash first. Patch tested on MQ virtio_net.
> So this is sure to address the problem, why exactly does this happen?

Ixgbe has flow director and bind queue to host cpu, so it can make sure 
the packet of a flow to be handled by the same queue/cpu. So when vhost 
thread moves from one host cpu to another, ixgbe would therefore send 
the packet to the new cpu/queue.
> Does your device spread a single flow across multiple RX queues? Would
> not that cause trouble in the TCP layer?
> It would seem that using the recorded queue should be faster with
> less cache misses. Before we give up on that, I'd
> like to understand why it's wrong. Do you know?
>
>> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
>> ---
>>   drivers/net/macvtap.c |   16 ++++++++--------
>>   1 file changed, 8 insertions(+), 8 deletions(-)
>>
>> diff -ruNp org/drivers/net/macvtap.c new/drivers/net/macvtap.c
>> --- org/drivers/net/macvtap.c	2011-10-22 08:38:01.000000000 +0530
>> +++ new/drivers/net/macvtap.c	2011-11-16 18:34:51.000000000 +0530
>> @@ -175,6 +175,14 @@ static struct macvtap_queue *macvtap_get
>>   	if (!numvtaps)
>>   		goto out;
>>
>> +	/* Check if we can use flow to select a queue */
>> +	rxq = skb_get_rxhash(skb);
>> +	if (rxq) {
>> +		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
>> +		if (tap)
>> +			goto out;
>> +	}
>> +
>>   	if (likely(skb_rx_queue_recorded(skb))) {
>>   		rxq = skb_get_rx_queue(skb);
>>
>> @@ -186,14 +194,6 @@ static struct macvtap_queue *macvtap_get
>>   			goto out;
>>   	}
>>
>> -	/* Check if we can use flow to select a queue */
>> -	rxq = skb_get_rxhash(skb);
>> -	if (rxq) {
>> -		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
>> -		if (tap)
>> -			goto out;
>> -	}
>> -
>>   	/* Everything failed - find first available queue */
>>   	for (rxq = 0; rxq<  MAX_MACVTAP_QUEUES; rxq++) {
>>   		tap = rcu_dereference(vlan->taps[rxq]);
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Michael S. Tsirkin @ 2011-11-24  9:59 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <20111124081714.26635.68141.sendpatchset@krkumar2.in.ibm.com>

On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
> It was reported that the macvtap device selects a
> different vhost (when used with multiqueue feature)
> for incoming packets of a single connection. Use
> packet hash first. Patch tested on MQ virtio_net.

So this is sure to address the problem, why exactly does this happen?
Does your device spread a single flow across multiple RX queues? Would
not that cause trouble in the TCP layer?
It would seem that using the recorded queue should be faster with
less cache misses. Before we give up on that, I'd
like to understand why it's wrong. Do you know?

> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  drivers/net/macvtap.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff -ruNp org/drivers/net/macvtap.c new/drivers/net/macvtap.c
> --- org/drivers/net/macvtap.c	2011-10-22 08:38:01.000000000 +0530
> +++ new/drivers/net/macvtap.c	2011-11-16 18:34:51.000000000 +0530
> @@ -175,6 +175,14 @@ static struct macvtap_queue *macvtap_get
>  	if (!numvtaps)
>  		goto out;
>  
> +	/* Check if we can use flow to select a queue */
> +	rxq = skb_get_rxhash(skb);
> +	if (rxq) {
> +		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> +		if (tap)
> +			goto out;
> +	}
> +
>  	if (likely(skb_rx_queue_recorded(skb))) {
>  		rxq = skb_get_rx_queue(skb);
>  
> @@ -186,14 +194,6 @@ static struct macvtap_queue *macvtap_get
>  			goto out;
>  	}
>  
> -	/* Check if we can use flow to select a queue */
> -	rxq = skb_get_rxhash(skb);
> -	if (rxq) {
> -		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> -		if (tap)
> -			goto out;
> -	}
> -
>  	/* Everything failed - find first available queue */
>  	for (rxq = 0; rxq < MAX_MACVTAP_QUEUES; rxq++) {
>  		tap = rcu_dereference(vlan->taps[rxq]);

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: jasowang @ 2011-11-24  9:36 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: arnd, mst, netdev, virtualization, levinsasha928, davem
In-Reply-To: <20111124081714.26635.68141.sendpatchset@krkumar2.in.ibm.com>

On 11/24/2011 04:17 PM, Krishna Kumar wrote:
> It was reported that the macvtap device selects a
> different vhost (when used with multiqueue feature)
> for incoming packets of a single connection. Use
> packet hash first. Patch tested on MQ virtio_net.
>
> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
> ---
>   drivers/net/macvtap.c |   16 ++++++++--------
>   1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff -ruNp org/drivers/net/macvtap.c new/drivers/net/macvtap.c
> --- org/drivers/net/macvtap.c	2011-10-22 08:38:01.000000000 +0530
> +++ new/drivers/net/macvtap.c	2011-11-16 18:34:51.000000000 +0530
> @@ -175,6 +175,14 @@ static struct macvtap_queue *macvtap_get
>   	if (!numvtaps)
>   		goto out;
>
> +	/* Check if we can use flow to select a queue */
> +	rxq = skb_get_rxhash(skb);
> +	if (rxq) {
> +		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> +		if (tap)
> +			goto out;
> +	}
> +
>   	if (likely(skb_rx_queue_recorded(skb))) {
>   		rxq = skb_get_rx_queue(skb);
>   

Looks reasonable and we can improve this further by implementing  
smarter queue selection with the co-operation of guest. I think one 
initiate of this is to let the packets of a flow to be handled by the 
same guest cpu/vhost thread. This can be done by using accelerate rfs in 
virtio-net driver and 'tell' the macvtap/tap which queue should a packet 
go. I've started  working on prototype of this, it should solves the 
issue of packet steering in guest.

One more thought on this is,  If a nic (such as ixgbe) with flow 
director or similar technology is used, it can make sure the packet 
belongs to a flow to be handled by host cpu. If we can make use of this 
feature, more cache locality would be gained.
> @@ -186,14 +194,6 @@ static struct macvtap_queue *macvtap_get
>   			goto out;
>   	}
>
> -	/* Check if we can use flow to select a queue */
> -	rxq = skb_get_rxhash(skb);
> -	if (rxq) {
> -		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
> -		if (tap)
> -			goto out;
> -	}
> -
>   	/* Everything failed - find first available queue */
>   	for (rxq = 0; rxq<  MAX_MACVTAP_QUEUES; rxq++) {
>   		tap = rcu_dereference(vlan->taps[rxq]);
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Krishna Kumar @ 2011-11-24  8:17 UTC (permalink / raw)
  To: arnd; +Cc: Krishna Kumar, mst, netdev, virtualization, levinsasha928, davem

It was reported that the macvtap device selects a
different vhost (when used with multiqueue feature)
for incoming packets of a single connection. Use
packet hash first. Patch tested on MQ virtio_net.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/net/macvtap.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff -ruNp org/drivers/net/macvtap.c new/drivers/net/macvtap.c
--- org/drivers/net/macvtap.c	2011-10-22 08:38:01.000000000 +0530
+++ new/drivers/net/macvtap.c	2011-11-16 18:34:51.000000000 +0530
@@ -175,6 +175,14 @@ static struct macvtap_queue *macvtap_get
 	if (!numvtaps)
 		goto out;
 
+	/* Check if we can use flow to select a queue */
+	rxq = skb_get_rxhash(skb);
+	if (rxq) {
+		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
+		if (tap)
+			goto out;
+	}
+
 	if (likely(skb_rx_queue_recorded(skb))) {
 		rxq = skb_get_rx_queue(skb);
 
@@ -186,14 +194,6 @@ static struct macvtap_queue *macvtap_get
 			goto out;
 	}
 
-	/* Check if we can use flow to select a queue */
-	rxq = skb_get_rxhash(skb);
-	if (rxq) {
-		tap = rcu_dereference(vlan->taps[rxq % numvtaps]);
-		if (tap)
-			goto out;
-	}
-
 	/* Everything failed - find first available queue */
 	for (rxq = 0; rxq < MAX_MACVTAP_QUEUES; rxq++) {
 		tap = rcu_dereference(vlan->taps[rxq]);

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Michael S. Tsirkin @ 2011-11-24  7:11 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <87ty5uxso3.fsf@rustcorp.com.au>

On Thu, Nov 24, 2011 at 11:06:44AM +1030, Rusty Russell wrote:
> Because I suspect we'll be different enough anyway, once we change the
> way we allocate the ring, and write the alignment.

Well, it'd be easy to just add pa_high.

> It'll be *clearer* to have two completely separate paths than to fill
> with if() statements.

Well, look at my patches. See how each new feature basically adds *one*
if statement.

Yes, we could declare all existing features to be required,
this is what you are advocating. But in a couple of years
more if statements have accumulated and we still don't have
a flying car. Meanwhile the driver has a huge legacy section
which is a huge pain to maintain.

> And a rewrite won't hurt the driver.

Wow. Is everyone going for the rewrite these days?
After a lot of work we will be back at the point where we left,
with a minor tweak to enable a huge number of slow devices.
Meanwhile we have no time to work on real problems,
such as ring fragmentation that Avi pointed out,
support for very large requests.

And yes it will hurt, it will hurt bisectability and testability.
What you propose is a huge patch that just changes all of it.
If there's a breakage, bisect just points at a rewrite.
What I would like to see is incremental patches. So I would like to
1. Split driver config from common config
2. Split isr/notifications out too
3. Copy config in MMIO
4. Add a new config
5. Add length checks
6. Add 64 bit features
7. Add alignment programming

At each point we can bisect.  It worked well for event index, qemu had a
flag to turn it off, each time someone came in and went 'maybe it's
event index' we just tested without.

> But to be honest I don't really care about the Linux driver: we're
> steeped in this stuff and we'll get it right.

Maybe. What about windows drivers?

> But I'm *terrified* of making the spec more complex;

All you do is move stuff around. Why do you think it simplifies the spec
so much?

> implementations will get it wrong.

Just to clarify: you assume that if virtio pci spec is changed, then
there will be many more new implementations than existing ones, so it is
worth it to make life (temporarily, for several short years) harder for
all the existing ones?

Look at the qemu side for example. The way I proposed - with optional
capabilities, each one with a separate fallback - the change can be
implemented very gradually.  Each step will be bisectable, and testable.

> I *really* want to banish the legacy stuff to an appendix where noone will
> ever know it's there :)

This does not make life easy for implementations as they
need to implement it anyway. What makes life easy is
making new features optional, so you pick just what you like.

For example, we might want to backport some new feature into
rhel6.X at some point. I would like the diff to be small,
and easily understandable in its impact.

-- 
MST

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Michael S. Tsirkin @ 2011-11-24  6:24 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <87ty5uxso3.fsf@rustcorp.com.au>

On Thu, Nov 24, 2011 at 11:06:44AM +1030, Rusty Russell wrote:
> > > +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> > > +struct virtio_pci_common_cfg {
> > > +	/* About the whole device. */
> > > +	__u64 device_features;	/* read-only */
> > > +	__u64 guest_features;	/* read-write */
> > > +	__u64 queue_address;	/* read-write */
> > > +	__u16 msix_config;	/* read-write */
> > > +	__u8 device_status;	/* read-write */
> > > +	__u8 unused;
> > > +
> > > +	/* About a specific virtqueue. */
> > > +	__u16 queue_select;	/* read-write */
> > > +	__u16 queue_align;	/* read-write, power of 2. */
> > > +	__u16 queue_size;	/* read-write, power of 2. */
> > > +	__u16 queue_msix_vector;/* read-write */
> > > +};
> > 
> > Slightly confusing as the registers are in fact little endian ...
> 
> Good point, should mark them appropriately with __le16.  That makes it
> even clearer.
> 
> Thanks,
> Rusty.

Do we still require atomic access to fields in common cfg?
If yes it's a problem as some systems don't have 64 bit
addresses. If no, implementations might get harder.

-- 
MST

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Rusty Russell @ 2011-11-24  1:07 UTC (permalink / raw)
  To: Sasha Levin, Michael S. Tsirkin
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Amit Shah
In-Reply-To: <1322041120.3581.6.camel@lappy>

On Wed, 23 Nov 2011 11:38:40 +0200, Sasha Levin <levinsasha928@gmail.com> wrote:
> On Wed, 2011-11-23 at 10:49 +0200, Michael S. Tsirkin wrote:
> > On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
> > > +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> > > +struct virtio_pci_common_cfg {
> > > +	/* About the whole device. */
> > > +	__u64 device_features;	/* read-only */
> > > +	__u64 guest_features;	/* read-write */
> > 
> > We currently require atomic accesses to common fields.
> > Some architectures might not have such for 64 bit,
> > so these need to be split I think ...
> 
> We can consider stealing the feature implementation from virtio-mmio:
> Use the first 32 bits as a selector and the last as the features
> themselves.
> 
> It's more complex to work with, but it provides 2**32 32 feature bits
> (which should be enough for a while) and it solves the atomic access
> issue.

That works.  I don't expect we'll need more than 64 features given that
virtio_net hasn't seen a new one in over a year, but it's gone from 5 to
18 in 4 years, so another 32 bits at that rate only gets us another decade.

Unfortunately, it doesn't solve the queue_address problem.

We currently multiply the (32-bit) address by 4096 (the alignment).  If
drivers are going to reduce alignment, that makes it trickier, hence the
change here to a 64.  Simplicity.

We could cheat and assert that a 32-bit write there implies 0 in the
upper bits, and hope that all 64 bit platforms can do a 64-bit atomic
write.  Or insist the value not be intepreted until the guest writes its
(first) feature bit.

Perhaps we really need a "I'm done configuring!" trigger, now we can't
use the feature bit field for that.

Thoughts welcome...
Rusty.

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Rusty Russell @ 2011-11-24  0:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <20111123084640.GE22734@redhat.com>

On Wed, 23 Nov 2011 10:46:41 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
> > On Tue, 22 Nov 2011 20:36:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > Here's an updated vesion.
> > > I'm alternating between updating the spec and the driver,
> > > spec update to follow.
> > 
> > Don't touch the spec yet, we have a long way to go :(
> > 
> > I want the ability for driver to set the ring size, and the device to
> > set the alignment.
> 
> Did you mean driver to be able to set the alignment? This
> is what BIOS guys want - after BIOS completes, guest driver gets handed
> control and sets its own alignment to save memory.

Yep, sorry.

But we really do want the guest to set the ring size.  Because it has to
be guest-physical-contiguous, the host currently sets a very small ring,
because the guest is useless if it can't allocate.

Either way, it's now the driver's responsibility to write those fields.

> > That's a bigger change than you have here.
> 
> Why can't we just add the new registers at the end?
> With the new capability, we have as much space as we like for that.

We could, for sure.

> > I imagine it almost rips the driver into two completely different drivers.
> 
> If you insist on moving all the rest of registers around, certainly. But
> why do this?

Because I suspect we'll be different enough anyway, once we change the
way we allocate the ring, and write the alignment.  It'll be *clearer*
to have two completely separate paths than to fill with if() statements.
And a rewrite won't hurt the driver.

But to be honest I don't really care about the Linux driver: we're
steeped in this stuff and we'll get it right.  But I'm *terrified* of
making the spec more complex; implementations will get it wrong.  I
*really* want to banish the legacy stuff to an appendix where noone will
ever know it's there :)

> Renaming constants in exported headers will break userspace builds.
> Do we care? Why not?

As the patch shows, I decided not to do that.  It's a nice heads-up, but
breaking older versions of the code is just mean.  Hence this:

> > +#ifndef __KERNEL__
> > +/* Don't break compile of old userspace code.  These will go away. */
> > +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES
> > +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES
> > +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN
> > +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM
> > +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL
> > +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY
> > +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS
> > +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR
> > +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR
> > +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR
> > +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev)
> > +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT
> > +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN
> > +#endif /* ...!KERNEL */

...
> > +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> > +struct virtio_pci_common_cfg {
> > +	/* About the whole device. */
> > +	__u64 device_features;	/* read-only */
> > +	__u64 guest_features;	/* read-write */
> > +	__u64 queue_address;	/* read-write */
> > +	__u16 msix_config;	/* read-write */
> > +	__u8 device_status;	/* read-write */
> > +	__u8 unused;
> > +
> > +	/* About a specific virtqueue. */
> > +	__u16 queue_select;	/* read-write */
> > +	__u16 queue_align;	/* read-write, power of 2. */
> > +	__u16 queue_size;	/* read-write, power of 2. */
> > +	__u16 queue_msix_vector;/* read-write */
> > +};
> 
> Slightly confusing as the registers are in fact little endian ...

Good point, should mark them appropriately with __le16.  That makes it
even clearer.

Thanks,
Rusty.

^ permalink raw reply

* Re: [PATCH] virtio-mmio: Devices parameter parsing
From: Pawel Moll @ 2011-11-23 18:08 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <87fwhhx9is.fsf@rustcorp.com.au>

On Tue, 2011-11-22 at 00:53 +0000, Rusty Russell wrote:
> > Any ideas?
> Perhaps not today.  So, we will need a linked list of devices and
> resources.  Yeah, that means allocating, which means YA
> slab_is_available()/alloc_bootmem() hack.

Hm. It doesn't sound like a good deal, really... Loads of tricks to get
data I need quite late in the boot process...

> Think of yourself as a pioneer...

I had a vague idea of "late parameters" last night, which would be
parsed once whole machinery is up and running. Will look around and try
to propose something.

On Tue, 2011-11-22 at 00:44 +0000, Rusty Russell wrote: 
> Or would it be simpler to enhance sscanf() with some weird format option
> for suffixing?  I haven't looked for similar cases, but I'd suspect a
> big win in general.
> 
> This would be neater than anything else we've got:
>         if (sscanf(device, "%llu@%llu[KMG]:%u", ...) != 3
>             && sscanf(device, "%llu@%llu[KMG]:%u:%u", ...) != 4)
>                 return -EINVAL;

sscanf was a good hint! Thanks, why haven't I thought of it myself? ;-)

That's what I came up with:

static int vm_cmdline_set(const char *device,
                const struct kernel_param *kp)
{
        struct resource resources[2] = {};
        char *str;
        long long int base;
        int processed, consumed = 0;
        struct platform_device *pdev;
        
        resources[0].flags = IORESOURCE_MEM;
        resources[1].flags = IORESOURCE_IRQ;
        
        resources[0].end = memparse(device, &str) - 1;
        
        processed = sscanf(str, "@%lli:%u%n:%d%n", 
                        &base, &resources[1].start, &consumed,
                        &vm_cmdline_id, &consumed);
                        
        if (processed < 2 || processed > 3 || str[consumed])
                return -EINVAL;
                
        resources[0].start = base;
        resources[0].end += base;
        resources[1].end = resources[1].start;

The only bit missing from sscanf() would be some sort of "%m" format,
which behaved like memparse and also processed unsigned number with "0
base" (hard to believe but the only "universal" - as in octal, dec and
hex - format is %i, which is signed). But still, looks quite neat
already :-)

I'll try to have a look at the "late parameters" idea tomorrow. Any
early warnings?

Cheers!

Pawel

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Michael S. Tsirkin @ 2011-11-23 15:34 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <20111123084640.GE22734@redhat.com>

On Wed, Nov 23, 2011 at 10:46:40AM +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
> > On Tue, 22 Nov 2011 20:36:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > Here's an updated vesion.
> > > I'm alternating between updating the spec and the driver,
> > > spec update to follow.
> > 
> > Don't touch the spec yet, we have a long way to go :(
> > 
> > I want the ability for driver to set the ring size, and the device to
> > set the alignment.
> 
> Did you mean driver to be able to set the alignment? This
> is what BIOS guys want - after BIOS completes, guest driver gets handed
> control and sets its own alignment to save memory.
> 
> > That's a bigger change than you have here.
> 
> Why can't we just add the new registers at the end?
> With the new capability, we have as much space as we like for that.

Didn't have the time to finish the patch today, but just to clarify,
we can apply something like the below on top of my patch,
and then config_len can be checked to see whether the new fields
like alignment are available.

We probably can make the alignment smaller and save some memory -
the default value could set the default alignment.

Ring size, naturally, can just be made writeable - BIOS can put a
small value there, but linux guest would just use the defaults
so no need for any code changes at all.

As a bonus, the capability length is verified to be large enough.
The change's pretty small, isn't it?

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 681347b..39433d3 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -37,8 +37,9 @@ struct virtio_pci_device
 	struct virtio_device vdev;
 	struct pci_dev *pci_dev;
 
-	/* the IO address for the common PCI config space */
+	/* the IO address and length for the common PCI config space */
 	void __iomem *ioaddr;
+	unsigned config_len;
 	/* the IO address for device specific config */
 	void __iomem *ioaddr_device;
 	/* the IO address to use for notifications operations */
@@ -101,22 +102,24 @@ static void __iomem *virtio_pci_legacy_map(struct virtio_pci_device *vp_dev)
 #endif
 
 /*
- * With PIO, device-specific config moves as MSI-X is enabled/disabled.
- * Use this accessor to keep pointer to that config in sync.
+ * With PIO, device-specific config moves as MSI-X is enabled/disabled,
+ * configuration region length changes as well.  Use this accessor to keep
+ * pointer to that config and common config length, in sync.
  */
 static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int enabled)
 {
 	void __iomem* m;
 	vp_dev->msix_enabled = enabled;
 	m = virtio_pci_legacy_map(vp_dev);
-	if (m)
+	if (m) {
 		vp_dev->ioaddr_device = m + VIRTIO_PCI_CONFIG(vp_dev);
-	else
+		vp_dev->config_len = VIRTIO_PCI_CONFIG(vp_dev);
+	} else
 		vp_dev->ioaddr_device = vp_dev->device_map;
 }
 
 static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 cap_id,
-					u32 align)
+					u32 align, unsigned *lenp)
 {
         u32 size;
         u32 offset;
@@ -160,12 +163,16 @@ static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 cap
         offset &= VIRTIO_PCI_CAP_CFG_OFF_MASK;
 	/* Align offset appropriately */
 	offset &= ~(align - 1);
+	if (lenp && size < *lenp)
+		goto err;
 
 	/* It's possible for a device to supply a huge config space,
 	 * but we'll never need to map more than a page ATM. */
 	p = pci_iomap_range(dev, bir, offset, size, PAGE_SIZE);
 	if (!p)
 		dev_err(&vp_dev->vdev.dev, "Unable to map virtio pci memory");
+	if (lenp)
+		*lenp = min(size, PAGE_SIZE);
 	return p;
 err:
 	dev_err(&vp_dev->vdev.dev, "Unable to parse virtio pci capability");
@@ -188,22 +195,24 @@ static void virtio_pci_iounmap(struct virtio_pci_device *vp_dev)
 
 static int virtio_pci_iomap(struct virtio_pci_device *vp_dev)
 {
+	unsigned common_len = VIRTIO_PCI_COMMON_CFG_MINSIZE;
 	vp_dev->isr_map = virtio_pci_map_cfg(vp_dev,
 					     VIRTIO_PCI_CAP_ISR_CFG,
-					     sizeof(u8));
+					     sizeof(u8), NULL);
 	vp_dev->notify_map = virtio_pci_map_cfg(vp_dev,
 						VIRTIO_PCI_CAP_NOTIFY_CFG,
-						sizeof(u16));
+						sizeof(u16), NULL);
 	vp_dev->common_map = virtio_pci_map_cfg(vp_dev,
 						VIRTIO_PCI_CAP_COMMON_CFG,
-						sizeof(u32));
+						sizeof(u32), &common_len);
 	vp_dev->device_map = virtio_pci_map_cfg(vp_dev,
 						VIRTIO_PCI_CAP_DEVICE_CFG,
-						sizeof(u8));
+						sizeof(u8), NULL);
 
 	if (vp_dev->notify_map && vp_dev->isr_map &&
 	    vp_dev->common_map && vp_dev->device_map) {
 		vp_dev->ioaddr = vp_dev->common_map;
+		vp_dev->config_len = common_len;
 		vp_dev->ioaddr_isr = vp_dev->isr_map;
 		vp_dev->ioaddr_notify = vp_dev->notify_map;
 		vp_dev->ioaddr_device = vp_dev->device_map;
@@ -221,6 +230,7 @@ static int virtio_pci_iomap(struct virtio_pci_device *vp_dev)
 			goto err;
 		}
 		vp_dev->ioaddr = m;
+		vp_dev->config_len = VIRTIO_PCI_CONFIG(vp_dev);
 		vp_dev->ioaddr_isr = m + VIRTIO_PCI_ISR;
 		vp_dev->ioaddr_notify = m + VIRTIO_PCI_QUEUE_NOTIFY;
 		vp_dev->ioaddr_device = m + VIRTIO_PCI_CONFIG(vp_dev);
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index d6568e7..8133851 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -133,4 +133,6 @@
 #define VIRTIO_PCI_CAP_CFG_OFF_MASK	0xffffffff
 #define VIRTIO_PCI_CAP_CFG_OFF_SHIFT	0
 
+#define VIRTIO_PCI_COMMON_CFG_MINSIZE	24
+
 #endif

^ permalink raw reply related

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Amit Shah @ 2011-11-23 13:15 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Miche Baker-Harvey, linux-kernel,
	virtualization, Anton Blanchard, Mike Waychison, ppc-dev,
	Greg Kroah-Hartman, Eric Northrup
In-Reply-To: <1322053564.3581.19.camel@lappy>

On (Wed) 23 Nov 2011 [15:06:04], Sasha Levin wrote:
> On Wed, 2011-11-23 at 18:26 +0530, Amit Shah wrote:
> > On (Wed) 23 Nov 2011 [16:08:52], Amit Shah wrote:
> > > With this setup, with and without patches, I can spawn two consoles
> > > via:
> > > 
> > > /sbin/agetty /dev/hvc0 9600 vt100
> > > /sbin/agetty /dev/hvc1 9600 vt100
> > > 
> > > (Strange thing is, the second one gives a 'password incorrect' error
> > > on login attempts, while the first one logs in fine.  I do remember
> > > testing multiple consoles just fine a year and a half back, so no idea
> > > why this isn't behaving as expected -- but it mostly looks like a
> > > userspace issue rather than kernel one.)
> > 
> > Right -- when I test this on the Fedora 11 VM I used back then, the
> > two consoles work just fine without these patches.  When I use
> > something newer (F14), I get the weird password rejection, with and
> > without your patches.
> 
> It's not a weird password rejection, it's probably not set as a secure
> terminal :)
> 
> Try adding hvc1 to /etc/securetty on the guest.

Ah, of course :-)

		Amit

^ permalink raw reply

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Sasha Levin @ 2011-11-23 13:06 UTC (permalink / raw)
  To: Amit Shah
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Miche Baker-Harvey, linux-kernel,
	virtualization, Anton Blanchard, Mike Waychison, ppc-dev,
	Greg Kroah-Hartman, Eric Northrup
In-Reply-To: <20111123125657.GH16665@amit-x200.redhat.com>

On Wed, 2011-11-23 at 18:26 +0530, Amit Shah wrote:
> On (Wed) 23 Nov 2011 [16:08:52], Amit Shah wrote:
> > With this setup, with and without patches, I can spawn two consoles
> > via:
> > 
> > /sbin/agetty /dev/hvc0 9600 vt100
> > /sbin/agetty /dev/hvc1 9600 vt100
> > 
> > (Strange thing is, the second one gives a 'password incorrect' error
> > on login attempts, while the first one logs in fine.  I do remember
> > testing multiple consoles just fine a year and a half back, so no idea
> > why this isn't behaving as expected -- but it mostly looks like a
> > userspace issue rather than kernel one.)
> 
> Right -- when I test this on the Fedora 11 VM I used back then, the
> two consoles work just fine without these patches.  When I use
> something newer (F14), I get the weird password rejection, with and
> without your patches.

It's not a weird password rejection, it's probably not set as a secure
terminal :)

Try adding hvc1 to /etc/securetty on the guest.

-- 

Sasha.

^ permalink raw reply

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Amit Shah @ 2011-11-23 12:56 UTC (permalink / raw)
  To: Miche Baker-Harvey
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, linux-kernel, virtualization,
	Anton Blanchard, Mike Waychison, ppc-dev, Greg Kroah-Hartman,
	Eric Northrup
In-Reply-To: <20111123103852.GG16665@amit-x200.redhat.com>

On (Wed) 23 Nov 2011 [16:08:52], Amit Shah wrote:
> With this setup, with and without patches, I can spawn two consoles
> via:
> 
> /sbin/agetty /dev/hvc0 9600 vt100
> /sbin/agetty /dev/hvc1 9600 vt100
> 
> (Strange thing is, the second one gives a 'password incorrect' error
> on login attempts, while the first one logs in fine.  I do remember
> testing multiple consoles just fine a year and a half back, so no idea
> why this isn't behaving as expected -- but it mostly looks like a
> userspace issue rather than kernel one.)

Right -- when I test this on the Fedora 11 VM I used back then, the
two consoles work just fine without these patches.  When I use
something newer (F14), I get the weird password rejection, with and
without your patches.

		Amit

^ permalink raw reply

* Re: [PATCH v3 2/3] hvc_init(): Enforce one-time initialization.
From: Amit Shah @ 2011-11-23 10:38 UTC (permalink / raw)
  To: Miche Baker-Harvey
  Cc: Stephen Rothwell, xen-devel, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, linux-kernel, virtualization,
	Anton Blanchard, Mike Waychison, ppc-dev, Greg Kroah-Hartman,
	Eric Northrup
In-Reply-To: <CAB8RdapbaueyWLJuJDE_ZdvLkRNM4sQHTxgmgO=jrnXZ6YPSmA@mail.gmail.com>

On (Thu) 17 Nov 2011 [10:57:37], Miche Baker-Harvey wrote:
> Rusty, Michael, Stephen, et al,
> 
> Thanks for your comments on these patches.
> 
> For what I'm trying to do, all three patches are necessary, but maybe
> I'm going about it the wrong way. Your input would be appreciated.
> I'm in no way claiming that these patches are "right", just that it's
> working for me, and that what's in the current pool is not.
> 
> What I'm trying to do is:
> On X86,
> under KVM,
> start a virtio console device,
> with multiple ports on the device,
> at least one of which is also a console (as well as ttyS0).
> 
> (Eventually, we want to be able to add virtio console ports on the
> fly, and to have multiple virtio console ports be consoles.)

Are you using kvm-tool to do this?  QEMU can already hot-plug ports
and virtio-console (virtio-serial) devices.

> When all three of the patches are in place, this works great. (By
> great, I mean that getty's start up on all of ttyS0, hvc0 and hvc1,
> and console output goes to ttyS0 and to hvc0.
> "who" shows three users:  ttyS0, hvc0, and hvc1.
> "cat /proc/consoles" shows both ttyS0 and hvc0.
> I can use all three getty's, and console output really does appear on
> both the consoles.
> 
> Based on Rusty's comments, I tried removing each of the patches
> individually. Here's what happens today. I've seen other failure modes
> depending on what precisely I'm passing the guest.
> There's three patches:
> 1/3 "fix locking of vtermno"
> 2/3 "enforce one-time initialization with hvc_init
> "3/3 "use separate struct console * for each console"
> 
> If I remove the "fix locking of vtermno", I only get one virtio
> console terminal.  "who" shows the ttyS0 and the hvc0, and I can log
> into the gettys on both. I don't get the second virtio console getty.
> Interestingly, hvc0 shows up in /proc/consoles twice, and in fact the
> console output is dumped twice to hvc0 (as you'd expect from looking
> at printk.c, each line appears twice, followed by the next line.)

I don't really understand why.  "fix locking of vtermno" adds locks in
init_port_console(), which is called from add_port(), which should be
serialised due to port additions being on work items on the same
workqueue.  I don't see a race here.

> If I remove the "enforce one-time initialization with hvc_init" patch,
> which makes sure only a single thread is doing the hvc_init, and gates
> anyone from continuing until it has completed, I get different
> failures, including hangs, and dereferences of NULL pointers.
> 
> If I remove the "use separate struct console * for each console"patch,
> what I'm seeing now is that while all of ttyS0, hvc0, and hvc1 are
> present with gettys running on them, of the three, only ttyS0 is a
> console.

I don't see any difference in my testing with and without these
patches.

This is how I tested with qemu:

./x86_64-softmmu/qemu-system-x86_64 -m 512 -smp 2 -chardev
socket,path=/tmp/foo,server,nowait,id=foo -chardev
socket,path=/tmp/bar,server,nowait,id=bar -device virtio-serial
-device virtconsole,chardev=foo,nr=4 -device
virtconsole,chardev=bar,nr=3 -net none  -kernel
/home/amit/src/linux/arch/x86/boot/bzImage -append 'root=/dev/sda1
console=tty0 console=ttyS0' -initrd /tmp/initramfs.img
/guests/f14-nolvm.qcow2 -enable-kvm -snapshot

With this setup, with and without patches, I can spawn two consoles
via:

/sbin/agetty /dev/hvc0 9600 vt100
/sbin/agetty /dev/hvc1 9600 vt100

(Strange thing is, the second one gives a 'password incorrect' error
on login attempts, while the first one logs in fine.  I do remember
testing multiple consoles just fine a year and a half back, so no idea
why this isn't behaving as expected -- but it mostly looks like a
userspace issue rather than kernel one.)

As mentioned earlier, I've not looked at the hvc code, but given my
testing results, I'd like to first understand what you're seeing and
what your environment is.

		Amit

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Sasha Levin @ 2011-11-23  9:44 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Michael S. Tsirkin,
	Alexey Kardashevskiy, Wang Sheng-Hui, lkml - Kernel Mailing List,
	virtualization, Christian Borntraeger, Amit Shah
In-Reply-To: <87hb1v1scp.fsf@rustcorp.com.au>

On Wed, 2011-11-23 at 13:02 +1030, Rusty Russell wrote:
> On Tue, 22 Nov 2011 20:36:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > Here's an updated vesion.
> > I'm alternating between updating the spec and the driver,
> > spec update to follow.
> 
> Don't touch the spec yet, we have a long way to go :(
> 
> I want the ability for driver to set the ring size, and the device to
> set the alignment.  That's a bigger change than you have here.  I
> imagine it almost rips the driver into two completely different drivers.
> 
> This is the kind of thing I had in mind, for the header.  Want me to
> code up the rest?
> 
> (I've avoided adding the constants for the new layout: a struct is more
> compact and more descriptive).
> 
> Cheers,
> Rusty.
> 
> diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
> --- a/include/linux/virtio_pci.h
> +++ b/include/linux/virtio_pci.h
> @@ -42,56 +42,74 @@
>  #include <linux/virtio_config.h>
>  
>  /* A 32-bit r/o bitmask of the features supported by the host */
> -#define VIRTIO_PCI_HOST_FEATURES	0
> +#define VIRTIO_PCI_LEGACY_HOST_FEATURES		0
>  
>  /* A 32-bit r/w bitmask of features activated by the guest */
> -#define VIRTIO_PCI_GUEST_FEATURES	4
> +#define VIRTIO_PCI_LEGACY_GUEST_FEATURES	4
>  
>  /* A 32-bit r/w PFN for the currently selected queue */
> -#define VIRTIO_PCI_QUEUE_PFN		8
> +#define VIRTIO_PCI_LEGACY_QUEUE_PFN		8
>  
>  /* A 16-bit r/o queue size for the currently selected queue */
> -#define VIRTIO_PCI_QUEUE_NUM		12
> +#define VIRTIO_PCI_LEGACY_QUEUE_NUM		12
>  
>  /* A 16-bit r/w queue selector */
> -#define VIRTIO_PCI_QUEUE_SEL		14
> +#define VIRTIO_PCI_LEGACY_QUEUE_SEL		14
>  
>  /* A 16-bit r/w queue notifier */
> -#define VIRTIO_PCI_QUEUE_NOTIFY		16
> +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY		16
>  
>  /* An 8-bit device status register.  */
> -#define VIRTIO_PCI_STATUS		18
> +#define VIRTIO_PCI_LEGACY_STATUS		18
>  
>  /* An 8-bit r/o interrupt status register.  Reading the value will return the
>   * current contents of the ISR and will also clear it.  This is effectively
>   * a read-and-acknowledge. */
> -#define VIRTIO_PCI_ISR			19
> -
> -/* The bit of the ISR which indicates a device configuration change. */
> -#define VIRTIO_PCI_ISR_CONFIG		0x2
> +#define VIRTIO_PCI_LEGACY_ISR			19
>  
>  /* MSI-X registers: only enabled if MSI-X is enabled. */
>  /* A 16-bit vector for configuration changes. */
> -#define VIRTIO_MSI_CONFIG_VECTOR        20
> +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR        20
>  /* A 16-bit vector for selected queue notifications. */
> -#define VIRTIO_MSI_QUEUE_VECTOR         22
> -/* Vector value used to disable MSI for queue */
> -#define VIRTIO_MSI_NO_VECTOR            0xffff
> +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR         22
>  
>  /* The remaining space is defined by each driver as the per-driver
>   * configuration space */
> -#define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
> +#define VIRTIO_PCI_LEGACY_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
> +
> +/* How many bits to shift physical queue address written to QUEUE_PFN.
> + * 12 is historical, and due to x86 page size. */
> +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT	12
> +
> +/* The alignment to use between consumer and producer parts of vring.
> + * x86 pagesize again. */
> +#define VIRTIO_PCI_LEGACY_VRING_ALIGN		4096
> +
> +#ifndef __KERNEL__
> +/* Don't break compile of old userspace code.  These will go away. */

Maybe wrap it in:
#ifndef VIRTIO_PCI_NO_LEGACY

#warning Please stop using old defines

[...]

> +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES
> +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES
> +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN
> +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM
> +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL
> +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY
> +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS
> +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR
> +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR
> +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR
> +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev)
> +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT
> +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN

#endif /* VIRTIO_PCI_NO_LEGACY */

> +#endif /* ...!KERNEL */
>  
>  /* Virtio ABI version, this must match exactly */
>  #define VIRTIO_PCI_ABI_VERSION		0
>  
> -/* How many bits to shift physical queue address written to QUEUE_PFN.
> - * 12 is historical, and due to x86 page size. */
> -#define VIRTIO_PCI_QUEUE_ADDR_SHIFT	12
> +/* Vector value used to disable MSI for queue */
> +#define VIRTIO_MSI_NO_VECTOR            0xffff
>  
> -/* The alignment to use between consumer and producer parts of vring.
> - * x86 pagesize again. */
> -#define VIRTIO_PCI_VRING_ALIGN		4096
> +/* The bit of the ISR which indicates a device configuration change. */
> +#define VIRTIO_PCI_ISR_CONFIG		0x2
>  
>  /*
>   * Layout for Virtio PCI vendor specific capability (little-endian):
> @@ -133,4 +151,20 @@
>  #define VIRTIO_PCI_CAP_CFG_OFF_MASK	0xffffffff
>  #define VIRTIO_PCI_CAP_CFG_OFF_SHIFT	0
>  
> +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> +struct virtio_pci_common_cfg {
> +	/* About the whole device. */
> +	__u64 device_features;	/* read-only */
> +	__u64 guest_features;	/* read-write */
> +	__u64 queue_address;	/* read-write */

queue address should be in the queue specific fields below

> +	__u16 msix_config;	/* read-write */
> +	__u8 device_status;	/* read-write */
> +	__u8 unused;
> +
> +	/* About a specific virtqueue. */
> +	__u16 queue_select;	/* read-write */
> +	__u16 queue_align;	/* read-write, power of 2. */
> +	__u16 queue_size;	/* read-write, power of 2. */
> +	__u16 queue_msix_vector;/* read-write */

Maybe we should reserve a bunch of bytes here for future extensions of
virtqueues. Otherwise, non virtqueue options may be added here which
will cause fragmentation of virtqueue specific features.

> +};
>  #endif
> 
> 
> 

-- 

Sasha.

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Sasha Levin @ 2011-11-23  9:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Amit Shah
In-Reply-To: <20111123084932.GF22734@redhat.com>

On Wed, 2011-11-23 at 10:49 +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
> > +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> > +struct virtio_pci_common_cfg {
> > +	/* About the whole device. */
> > +	__u64 device_features;	/* read-only */
> > +	__u64 guest_features;	/* read-write */
> 
> We currently require atomic accesses to common fields.
> Some architectures might not have such for 64 bit,
> so these need to be split I think ...

We can consider stealing the feature implementation from virtio-mmio:
Use the first 32 bits as a selector and the last as the features
themselves.

It's more complex to work with, but it provides 2**32 32 feature bits
(which should be enough for a while) and it solves the atomic access
issue.

> > +	__u64 queue_address;	/* read-write */
> > +	__u16 msix_config;	/* read-write */
> > +	__u8 device_status;	/* read-write */
> > +	__u8 unused;
> > +
> > +	/* About a specific virtqueue. */
> > +	__u16 queue_select;	/* read-write */
> > +	__u16 queue_align;	/* read-write, power of 2. */
> > +	__u16 queue_size;	/* read-write, power of 2. */
> > +	__u16 queue_msix_vector;/* read-write */
> > +};
> >  #endif
> > 
> > 

-- 

Sasha.

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Michael S. Tsirkin @ 2011-11-23  8:49 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <87hb1v1scp.fsf@rustcorp.com.au>

On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
> +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> +struct virtio_pci_common_cfg {
> +	/* About the whole device. */
> +	__u64 device_features;	/* read-only */
> +	__u64 guest_features;	/* read-write */

We currently require atomic accesses to common fields.
Some architectures might not have such for 64 bit,
so these need to be split I think ...

> +	__u64 queue_address;	/* read-write */
> +	__u16 msix_config;	/* read-write */
> +	__u8 device_status;	/* read-write */
> +	__u8 unused;
> +
> +	/* About a specific virtqueue. */
> +	__u16 queue_select;	/* read-write */
> +	__u16 queue_align;	/* read-write, power of 2. */
> +	__u16 queue_size;	/* read-write, power of 2. */
> +	__u16 queue_msix_vector;/* read-write */
> +};
>  #endif
> 
> 

^ permalink raw reply

* Re: [PATCHv3 RFC] virtio-pci: flexible configuration layout
From: Michael S. Tsirkin @ 2011-11-23  8:46 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, kvm, Pawel Moll, Wang Sheng-Hui,
	Alexey Kardashevskiy, lkml - Kernel Mailing List, virtualization,
	Christian Borntraeger, Sasha Levin, Amit Shah
In-Reply-To: <87hb1v1scp.fsf@rustcorp.com.au>

On Wed, Nov 23, 2011 at 01:02:22PM +1030, Rusty Russell wrote:
> On Tue, 22 Nov 2011 20:36:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > Here's an updated vesion.
> > I'm alternating between updating the spec and the driver,
> > spec update to follow.
> 
> Don't touch the spec yet, we have a long way to go :(
> 
> I want the ability for driver to set the ring size, and the device to
> set the alignment.

Did you mean driver to be able to set the alignment? This
is what BIOS guys want - after BIOS completes, guest driver gets handed
control and sets its own alignment to save memory.

> That's a bigger change than you have here.

Why can't we just add the new registers at the end?
With the new capability, we have as much space as we like for that.

> I imagine it almost rips the driver into two completely different drivers.

If you insist on moving all the rest of registers around, certainly. But
why do this?

> This is the kind of thing I had in mind, for the header.  Want me to
> code up the rest?
> 
> (I've avoided adding the constants for the new layout: a struct is more
> compact and more descriptive).
> 
> Cheers,
> Rusty.

Renaming constants in exported headers will break userspace builds.
Do we care? Why not?

> diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
> --- a/include/linux/virtio_pci.h
> +++ b/include/linux/virtio_pci.h
> @@ -42,56 +42,74 @@
>  #include <linux/virtio_config.h>
>  
>  /* A 32-bit r/o bitmask of the features supported by the host */
> -#define VIRTIO_PCI_HOST_FEATURES	0
> +#define VIRTIO_PCI_LEGACY_HOST_FEATURES		0
>  
>  /* A 32-bit r/w bitmask of features activated by the guest */
> -#define VIRTIO_PCI_GUEST_FEATURES	4
> +#define VIRTIO_PCI_LEGACY_GUEST_FEATURES	4
>  
>  /* A 32-bit r/w PFN for the currently selected queue */
> -#define VIRTIO_PCI_QUEUE_PFN		8
> +#define VIRTIO_PCI_LEGACY_QUEUE_PFN		8
>  
>  /* A 16-bit r/o queue size for the currently selected queue */
> -#define VIRTIO_PCI_QUEUE_NUM		12
> +#define VIRTIO_PCI_LEGACY_QUEUE_NUM		12
>  
>  /* A 16-bit r/w queue selector */
> -#define VIRTIO_PCI_QUEUE_SEL		14
> +#define VIRTIO_PCI_LEGACY_QUEUE_SEL		14
>  
>  /* A 16-bit r/w queue notifier */
> -#define VIRTIO_PCI_QUEUE_NOTIFY		16
> +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY		16
>  
>  /* An 8-bit device status register.  */
> -#define VIRTIO_PCI_STATUS		18
> +#define VIRTIO_PCI_LEGACY_STATUS		18
>  
>  /* An 8-bit r/o interrupt status register.  Reading the value will return the
>   * current contents of the ISR and will also clear it.  This is effectively
>   * a read-and-acknowledge. */
> -#define VIRTIO_PCI_ISR			19
> -
> -/* The bit of the ISR which indicates a device configuration change. */
> -#define VIRTIO_PCI_ISR_CONFIG		0x2
> +#define VIRTIO_PCI_LEGACY_ISR			19
>  
>  /* MSI-X registers: only enabled if MSI-X is enabled. */
>  /* A 16-bit vector for configuration changes. */
> -#define VIRTIO_MSI_CONFIG_VECTOR        20
> +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR        20
>  /* A 16-bit vector for selected queue notifications. */
> -#define VIRTIO_MSI_QUEUE_VECTOR         22
> -/* Vector value used to disable MSI for queue */
> -#define VIRTIO_MSI_NO_VECTOR            0xffff
> +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR         22
>  
>  /* The remaining space is defined by each driver as the per-driver
>   * configuration space */
> -#define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
> +#define VIRTIO_PCI_LEGACY_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
> +
> +/* How many bits to shift physical queue address written to QUEUE_PFN.
> + * 12 is historical, and due to x86 page size. */
> +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT	12
> +
> +/* The alignment to use between consumer and producer parts of vring.
> + * x86 pagesize again. */
> +#define VIRTIO_PCI_LEGACY_VRING_ALIGN		4096
> +
> +#ifndef __KERNEL__
> +/* Don't break compile of old userspace code.  These will go away. */
> +#define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES
> +#define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES
> +#define VIRTIO_PCI_LEGACY_QUEUE_PFN VIRTIO_PCI_QUEUE_PFN
> +#define VIRTIO_PCI_LEGACY_QUEUE_NUM VIRTIO_PCI_QUEUE_NUM
> +#define VIRTIO_PCI_LEGACY_QUEUE_SEL VIRTIO_PCI_QUEUE_SEL
> +#define VIRTIO_PCI_LEGACY_QUEUE_NOTIFY VIRTIO_PCI_QUEUE_NOTIFY
> +#define VIRTIO_PCI_LEGACY_STATUS VIRTIO_PCI_STATUS
> +#define VIRTIO_PCI_LEGACY_ISR VIRTIO_PCI_ISR
> +#define VIRTIO_MSI_LEGACY_CONFIG_VECTOR VIRTIO_MSI_CONFIG_VECTOR
> +#define VIRTIO_MSI_LEGACY_QUEUE_VECTOR VIRTIO_MSI_QUEUE_VECTOR
> +#define VIRTIO_PCI_LEGACY_CONFIG(dev) VIRTIO_PCI_CONFIG(dev)
> +#define VIRTIO_PCI_LEGACY_QUEUE_ADDR_SHIFT VIRTIO_PCI_QUEUE_ADDR_SHIFT
> +#define VIRTIO_PCI_LEGACY_VRING_ALIGN VIRTIO_PCI_VRING_ALIGN
> +#endif /* ...!KERNEL */
>  
>  /* Virtio ABI version, this must match exactly */
>  #define VIRTIO_PCI_ABI_VERSION		0
>  
> -/* How many bits to shift physical queue address written to QUEUE_PFN.
> - * 12 is historical, and due to x86 page size. */
> -#define VIRTIO_PCI_QUEUE_ADDR_SHIFT	12
> +/* Vector value used to disable MSI for queue */
> +#define VIRTIO_MSI_NO_VECTOR            0xffff
>  
> -/* The alignment to use between consumer and producer parts of vring.
> - * x86 pagesize again. */
> -#define VIRTIO_PCI_VRING_ALIGN		4096
> +/* The bit of the ISR which indicates a device configuration change. */
> +#define VIRTIO_PCI_ISR_CONFIG		0x2
>  
>  /*
>   * Layout for Virtio PCI vendor specific capability (little-endian):
> @@ -133,4 +151,20 @@
>  #define VIRTIO_PCI_CAP_CFG_OFF_MASK	0xffffffff
>  #define VIRTIO_PCI_CAP_CFG_OFF_SHIFT	0
>  
> +/* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */
> +struct virtio_pci_common_cfg {
> +	/* About the whole device. */
> +	__u64 device_features;	/* read-only */
> +	__u64 guest_features;	/* read-write */
> +	__u64 queue_address;	/* read-write */
> +	__u16 msix_config;	/* read-write */
> +	__u8 device_status;	/* read-write */
> +	__u8 unused;
> +
> +	/* About a specific virtqueue. */
> +	__u16 queue_select;	/* read-write */
> +	__u16 queue_align;	/* read-write, power of 2. */
> +	__u16 queue_size;	/* read-write, power of 2. */
> +	__u16 queue_msix_vector;/* read-write */
> +};

Slightly confusing as the registers are in fact little endian ...

>  #endif
> 
> 

^ permalink raw reply

* Re: [PATCH 5 of 5] virtio: expose added descriptors immediately
From: Michael S. Tsirkin @ 2011-11-23  8:30 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Christoph Hellwig, linux-kernel, kvm, virtualization
In-Reply-To: <87ty5v1vqy.fsf@rustcorp.com.au>

On Wed, Nov 23, 2011 at 11:49:01AM +1030, Rusty Russell wrote:
> On Tue, 22 Nov 2011 08:29:08 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Tue, Nov 22, 2011 at 11:03:04AM +1030, Rusty Russell wrote:
> > > -	/* If you haven't kicked in this long, you're probably doing something
> > > -	 * wrong. */
> > > -	WARN_ON(vq->num_added > vq->vring.num);
> > > +	/* This is very unlikely, but theoretically possible.  Kick
> > > +	 * just in case. */
> > > +	if (unlikely(vq->num_added == 65535))
> > 
> > This is 0xffff but why use the decimal notation?
> 
> Interesting.  Why use hex?  Feels more like binary?

Just easier to see it's the largest 16 bit number.

> But I've changed it to "(1 << 16) - 1" to be clear.

That's even better.

> > > +		virtqueue_kick(_vq);
> > >  
> > >  	pr_debug("Added buffer head %i to %p\n", head, vq);
> > >  	END_USE(vq);
> > 
> > We also still need to reset vq->num_added, right?
> 
> virtqueue_kick does that for us.
> 
> Cheers,
> Rusty.

Right.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox