Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 1/2] tg3: Fix NETIF_F_LOOPBACK error
From: Matt Carlson @ 2011-05-20  1:59 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: Matthew Carlson, David Miller, linux-netdev
In-Reply-To: <BANLkTi==K_eTcqQ39HwcBcg9AyuMLuhz6w@mail.gmail.com>

On Thu, May 19, 2011 at 06:15:18PM -0700, Mahesh Bandewar wrote:
> On Thu, May 19, 2011 at 6:11 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> > Mahesh Bandewar noticed that the features cleanup in commit
> > 0da0606f493c5cdab74bdcc96b12f4305ad94085, entitled
> > "tg3: Consolidate all netdev feature assignments", mistakenly sets
> > NETIF_F_LOOPBACK by default. ?This patch corrects the error.
> >
> > Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
> > ---
> > ?drivers/net/tg3.c | ? ?3 ++-
> > ?1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> > index 012ce70..0b78c5d 100644
> > --- a/drivers/net/tg3.c
> > +++ b/drivers/net/tg3.c
> > @@ -15080,6 +15080,8 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
> > ? ? ? ? ? ? ? ? ? ? ? ?features |= NETIF_F_TSO_ECN;
> > ? ? ? ?}
> >
> > + ? ? ? dev->features |= features;
> > +
> > ? ? ? ?/*
> > ? ? ? ? * Add loopback capability only for a subset of devices that support
> > ? ? ? ? * MAC-LOOPBACK. Eventually this need to be enhanced to allow INT-PHY
> > @@ -15090,7 +15092,6 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
> > ? ? ? ? ? ? ? ?/* Add the loopback capability */
> > ? ? ? ? ? ? ? ?features |= NETIF_F_LOOPBACK;
> >
> > - ? ? ? dev->features |= features;
> > ? ? ? ?dev->hw_features |= features;
> > ? ? ? ?dev->vlan_features |= features;
> I think this line should go up too. Otherwise newly created vlan
> device(s) will have spurious loopback bit set.

Yes.  You are right.  I thought vlan_features functioned like
hw_features.


^ permalink raw reply

* Re: [PATCH net-next 1/2] tg3: Fix NETIF_F_LOOPBACK error
From: Mahesh Bandewar @ 2011-05-20  1:15 UTC (permalink / raw)
  To: Matt Carlson; +Cc: David Miller, linux-netdev
In-Reply-To: <1305853864-2135-2-git-send-email-mcarlson@broadcom.com>

On Thu, May 19, 2011 at 6:11 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> Mahesh Bandewar noticed that the features cleanup in commit
> 0da0606f493c5cdab74bdcc96b12f4305ad94085, entitled
> "tg3: Consolidate all netdev feature assignments", mistakenly sets
> NETIF_F_LOOPBACK by default.  This patch corrects the error.
>
> Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
> ---
>  drivers/net/tg3.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> index 012ce70..0b78c5d 100644
> --- a/drivers/net/tg3.c
> +++ b/drivers/net/tg3.c
> @@ -15080,6 +15080,8 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
>                        features |= NETIF_F_TSO_ECN;
>        }
>
> +       dev->features |= features;
> +
>        /*
>         * Add loopback capability only for a subset of devices that support
>         * MAC-LOOPBACK. Eventually this need to be enhanced to allow INT-PHY
> @@ -15090,7 +15092,6 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
>                /* Add the loopback capability */
>                features |= NETIF_F_LOOPBACK;
>
> -       dev->features |= features;
>        dev->hw_features |= features;
>        dev->vlan_features |= features;
I think this line should go up too. Otherwise newly created vlan
device(s) will have spurious loopback bit set.
>
> --
> 1.7.3.4
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH net-next 0/2] tg3: Quickfixes
From: Matt Carlson @ 2011-05-20  1:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, mcarlson

This patchset applies some quickfixes to the previous patchset.



^ permalink raw reply

* [PATCH net-next 1/2] tg3: Fix NETIF_F_LOOPBACK error
From: Matt Carlson @ 2011-05-20  1:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, mcarlson

Mahesh Bandewar noticed that the features cleanup in commit
0da0606f493c5cdab74bdcc96b12f4305ad94085, entitled
"tg3: Consolidate all netdev feature assignments", mistakenly sets
NETIF_F_LOOPBACK by default.  This patch corrects the error.

Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
---
 drivers/net/tg3.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 012ce70..0b78c5d 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -15080,6 +15080,8 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
 			features |= NETIF_F_TSO_ECN;
 	}
 
+	dev->features |= features;
+
 	/*
 	 * Add loopback capability only for a subset of devices that support
 	 * MAC-LOOPBACK. Eventually this need to be enhanced to allow INT-PHY
@@ -15090,7 +15092,6 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
 		/* Add the loopback capability */
 		features |= NETIF_F_LOOPBACK;
 
-	dev->features |= features;
 	dev->hw_features |= features;
 	dev->vlan_features |= features;
 
-- 
1.7.3.4



^ permalink raw reply related

* [PATCH net-next 2/2] tg3: Add braces around 5906 workaround.
From: Matt Carlson @ 2011-05-20  1:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, mcarlson

Commit dabc5c670d3f86d15ee4f42ab38ec5bd2682487d, entitled
"tg3: Move TSO_CAPABLE assignment", moved some TSO flagging code around.
In the process it failed to add braces around an exceptional 5906
condition.  This patch fixes the problem.

Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
---
 drivers/net/tg3.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 0b78c5d..1d91b35 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -13707,9 +13707,11 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
 				     tp->pcie_cap + PCI_EXP_LNKCTL,
 				     &lnkctl);
 		if (lnkctl & PCI_EXP_LNKCTL_CLKREQ_EN) {
-			if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
+			if (GET_ASIC_REV(tp->pci_chip_rev_id) ==
+			    ASIC_REV_5906) {
 				tg3_flag_clear(tp, HW_TSO_2);
 				tg3_flag_clear(tp, TSO_CAPABLE);
+			}
 			if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5784 ||
 			    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5761 ||
 			    tp->pci_chip_rev_id == CHIPREV_ID_57780_A0 ||
-- 
1.7.3.4



^ permalink raw reply related

* Re: TCP funny-ness when over-driving a 1Gbps link.
From: Rick Jones @ 2011-05-20  0:46 UTC (permalink / raw)
  To: Ben Greear; +Cc: Stephen Hemminger, netdev
In-Reply-To: <4DD5B7B3.2000505@candelatech.com>

On Thu, 2011-05-19 at 17:37 -0700, Ben Greear wrote:
> On 05/19/2011 05:24 PM, Rick Jones wrote:
> >>>> [root@i7-965-1 igb]# netstat -an|grep tcp|grep 8.1.1
> >>>> tcp        0      0 8.1.1.1:33038               0.0.0.0:*                   LISTEN
> >>>> tcp        0      0 8.1.1.1:33040               0.0.0.0:*                   LISTEN
> >>>> tcp        0      0 8.1.1.1:33042               0.0.0.0:*                   LISTEN
> >>>> tcp        0 9328612 8.1.1.2:33039               8.1.1.1:33040               ESTABLISHED
> >>>> tcp        0 17083176 8.1.1.1:33038               8.1.1.2:33037               ESTABLISHED
> >>>> tcp        0 9437340 8.1.1.2:33037               8.1.1.1:33038               ESTABLISHED
> >>>> tcp        0 17024620 8.1.1.1:33040               8.1.1.2:33039               ESTABLISHED
> >>>> tcp        0 19557040 8.1.1.1:33042               8.1.1.2:33041               ESTABLISHED
> >>>> tcp        0 9416600 8.1.1.2:33041               8.1.1.1:33042               ESTABLISHED
> >>>
> >>> I take it your system has higher values for the tcp_wmem value:
> >>>
> >>> net.ipv4.tcp_wmem = 4096 16384 4194304
> >>
> >> Yes:
> >> [root@i7-965-1 igb]# cat /proc/sys/net/ipv4/tcp_wmem
> >> 4096	16384	50000000
> >
> > Why?!?  Are you trying to get link-rate to Mars or something?  (I assume
> > tcp_rmem is similarly set...)  If you are indeed doing one 1 GbE, and no
> > more than 100ms then the default (?) of 4194304 should have been more
> > than sufficient.
> 
> Well, we occasionally do tests over emulated links that have several
> seconds of delay and may be running multiple Gbps.  Either way,
> I'd hope that offering extra RAM to a subsystem wouldn't cause it
> to go nuts.  

It has been my experience that the autotuning tends to grow things
beyond the bandwidthXdelay product.

As for several seconds of delay and multiple Gbps - unless you are
shooting the Moon, sounds like bufferbloat?-)

> Assuming this isn't some magical 1Gbps issue, you
> could probably hit the same problem with a wifi link and
> default tcp_wmem settings...

Do you also increase tx queue's for the NIC(s)?

rick


^ permalink raw reply

* Re: TCP funny-ness when over-driving a 1Gbps link.
From: Ben Greear @ 2011-05-20  0:37 UTC (permalink / raw)
  To: rick.jones2; +Cc: Stephen Hemminger, netdev
In-Reply-To: <1305851079.8149.1127.camel@tardy>

On 05/19/2011 05:24 PM, Rick Jones wrote:
>>>> [root@i7-965-1 igb]# netstat -an|grep tcp|grep 8.1.1
>>>> tcp        0      0 8.1.1.1:33038               0.0.0.0:*                   LISTEN
>>>> tcp        0      0 8.1.1.1:33040               0.0.0.0:*                   LISTEN
>>>> tcp        0      0 8.1.1.1:33042               0.0.0.0:*                   LISTEN
>>>> tcp        0 9328612 8.1.1.2:33039               8.1.1.1:33040               ESTABLISHED
>>>> tcp        0 17083176 8.1.1.1:33038               8.1.1.2:33037               ESTABLISHED
>>>> tcp        0 9437340 8.1.1.2:33037               8.1.1.1:33038               ESTABLISHED
>>>> tcp        0 17024620 8.1.1.1:33040               8.1.1.2:33039               ESTABLISHED
>>>> tcp        0 19557040 8.1.1.1:33042               8.1.1.2:33041               ESTABLISHED
>>>> tcp        0 9416600 8.1.1.2:33041               8.1.1.1:33042               ESTABLISHED
>>>
>>> I take it your system has higher values for the tcp_wmem value:
>>>
>>> net.ipv4.tcp_wmem = 4096 16384 4194304
>>
>> Yes:
>> [root@i7-965-1 igb]# cat /proc/sys/net/ipv4/tcp_wmem
>> 4096	16384	50000000
>
> Why?!?  Are you trying to get link-rate to Mars or something?  (I assume
> tcp_rmem is similarly set...)  If you are indeed doing one 1 GbE, and no
> more than 100ms then the default (?) of 4194304 should have been more
> than sufficient.

Well, we occasionally do tests over emulated links that have several
seconds of delay and may be running multiple Gbps.  Either way,
I'd hope that offering extra RAM to a subsystem wouldn't cause it
to go nuts.  Assuming this isn't some magical 1Gbps issue, you
could probably hit the same problem with a wifi link and
default tcp_wmem settings...

>>> and whatever is creating the TCP connections is not making explicit
>>> setsockopt() calls to set SO_*BUF.
>>
>> It is configured not to, but if you know of an independent way to verify
>> that, I'm interested.
>
> You could always strace the code.

Yeah...might be easier in this case to just comment out all those calls
and do a quick test.  Will be tomorrow before I can get to
that, however..

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: TCP funny-ness when over-driving a 1Gbps link.
From: Rick Jones @ 2011-05-20  0:24 UTC (permalink / raw)
  To: Ben Greear; +Cc: Stephen Hemminger, netdev
In-Reply-To: <4DD5B202.7080701@candelatech.com>

> >> [root@i7-965-1 igb]# netstat -an|grep tcp|grep 8.1.1
> >> tcp        0      0 8.1.1.1:33038               0.0.0.0:*                   LISTEN
> >> tcp        0      0 8.1.1.1:33040               0.0.0.0:*                   LISTEN
> >> tcp        0      0 8.1.1.1:33042               0.0.0.0:*                   LISTEN
> >> tcp        0 9328612 8.1.1.2:33039               8.1.1.1:33040               ESTABLISHED
> >> tcp        0 17083176 8.1.1.1:33038               8.1.1.2:33037               ESTABLISHED
> >> tcp        0 9437340 8.1.1.2:33037               8.1.1.1:33038               ESTABLISHED
> >> tcp        0 17024620 8.1.1.1:33040               8.1.1.2:33039               ESTABLISHED
> >> tcp        0 19557040 8.1.1.1:33042               8.1.1.2:33041               ESTABLISHED
> >> tcp        0 9416600 8.1.1.2:33041               8.1.1.1:33042               ESTABLISHED
> >
> > I take it your system has higher values for the tcp_wmem value:
> >
> > net.ipv4.tcp_wmem = 4096 16384 4194304
> 
> Yes:
> [root@i7-965-1 igb]# cat /proc/sys/net/ipv4/tcp_wmem
> 4096	16384	50000000

Why?!?  Are you trying to get link-rate to Mars or something?  (I assume
tcp_rmem is similarly set...)  If you are indeed doing one 1 GbE, and no
more than 100ms then the default (?) of 4194304 should have been more
than sufficient.

> > and whatever is creating the TCP connections is not making explicit
> > setsockopt() calls to set SO_*BUF.
> 
> It is configured not to, but if you know of an independent way to verify
> that, I'm interested.

You could always strace the code.

rick


^ permalink raw reply

* Re: [PATCH net-next-2.6] macvlan: remove one synchronize_rcu() call
From: Ben Greear @ 2011-05-20  0:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <1305843856.3156.36.camel@edumazet-laptop>

On 05/19/2011 03:24 PM, Eric Dumazet wrote:
> When one macvlan device is dismantled, we can avoid one
> synchronize_rcu() call done after deletion from hash list, since caller
> will perform a synchronize_net() call after its ndo_stop() call.
>
> Add a new netdev->dismantle field to signal this dismantle intent.

I applied this to today's wireless-testing kernel.  There is a consistent
speedup in deleting mac-vlans!  I wouldn't read much into changes in
creating macvlans or adding IPs..those numbers just jump around a bit
from run to run.

Before the patch:

[root@lec2010-ath9k-1 lanforge]# /mnt/b32/greearb/tmp/test_macvlans.pl 500 macvlan
Creating 500 macvlan.
Created 500 macvlan in 12.662865 seconds (0.02532573 per interface).
Added IP addresses in 9.104435 seconds (0.01820887 per addr).
Deleted 500 macvlan in 25.424282 seconds. (0.050848564 per interface)

After the patch:

[root@lec2010-ath9k-1 lanforge]# /mnt/b32/greearb/tmp/test_macvlans.pl 500 macvlan
Creating 500 macvlan.
Created 500 macvlan in 12.461308 seconds (0.024922616 per interface).
Added IP addresses in 8.787694 seconds (0.017575388 per addr).
Deleted 500 macvlan in 21.831413 seconds. (0.043662826 per interface)

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: TCP funny-ness when over-driving a 1Gbps link.
From: Ben Greear @ 2011-05-20  0:12 UTC (permalink / raw)
  To: rick.jones2; +Cc: Stephen Hemminger, netdev
In-Reply-To: <1305849940.8149.1122.camel@tardy>

On 05/19/2011 05:05 PM, Rick Jones wrote:
> On Thu, 2011-05-19 at 16:42 -0700, Ben Greear wrote:
>> On 05/19/2011 04:20 PM, Ben Greear wrote:
>>> On 05/19/2011 04:18 PM, Stephen Hemminger wrote:
>>
>>>> If you overdrive, TCP expects your network emulator to have
>>>> a some but limited queueing (like a real router).
>>>
>>> The emulator is fine, it's not being over-driven (and has limited
>>> queueing if it was
>>> being over-driven). The queues that are backing up are in the tcp
>>> sockets on the
>>> sending machine.
>>>
>>> But, just to make sure, I'll re-run the test with a looped back cable...
>>
>> Well, with looped back cable, it isn't so bad.  I still see a small drop
>> in aggregate throughput (around 900Mbps instead of 950Mbps), and
>> latency goes above 600ms, but it still performs better than when
>> going through the emulator.
>>
>> At 950+Mbps, the emulator is going to impart 1-2 ms of latency
>> even when configured for wide-open.
>>
>> If I use a bridge in place of the emulator, it seems to settle on
>> around 450Mbps in one direction and 945Mbps in the other (on the wire),
>> with round-trip latencies often over 5 seconds (user-space to user-space),
>> and a consistent large chunk of data in the socket send buffers:
>>
>> [root@i7-965-1 igb]# netstat -an|grep tcp|grep 8.1.1
>> tcp        0      0 8.1.1.1:33038               0.0.0.0:*                   LISTEN
>> tcp        0      0 8.1.1.1:33040               0.0.0.0:*                   LISTEN
>> tcp        0      0 8.1.1.1:33042               0.0.0.0:*                   LISTEN
>> tcp        0 9328612 8.1.1.2:33039               8.1.1.1:33040               ESTABLISHED
>> tcp        0 17083176 8.1.1.1:33038               8.1.1.2:33037               ESTABLISHED
>> tcp        0 9437340 8.1.1.2:33037               8.1.1.1:33038               ESTABLISHED
>> tcp        0 17024620 8.1.1.1:33040               8.1.1.2:33039               ESTABLISHED
>> tcp        0 19557040 8.1.1.1:33042               8.1.1.2:33041               ESTABLISHED
>> tcp        0 9416600 8.1.1.2:33041               8.1.1.1:33042               ESTABLISHED
>
> I take it your system has higher values for the tcp_wmem value:
>
> net.ipv4.tcp_wmem = 4096 16384 4194304

Yes:
[root@i7-965-1 igb]# cat /proc/sys/net/ipv4/tcp_wmem
4096	16384	50000000

> and whatever is creating the TCP connections is not making explicit
> setsockopt() calls to set SO_*BUF.

It is configured not to, but if you know of an independent way to verify
that, I'm interested.

Thanks,
Ben

>
> rick jones


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: TCP funny-ness when over-driving a 1Gbps link.
From: Rick Jones @ 2011-05-20  0:05 UTC (permalink / raw)
  To: Ben Greear; +Cc: Stephen Hemminger, netdev
In-Reply-To: <4DD5AAFC.8070509@candelatech.com>

On Thu, 2011-05-19 at 16:42 -0700, Ben Greear wrote:
> On 05/19/2011 04:20 PM, Ben Greear wrote:
> > On 05/19/2011 04:18 PM, Stephen Hemminger wrote:
> 
> >> If you overdrive, TCP expects your network emulator to have
> >> a some but limited queueing (like a real router).
> >
> > The emulator is fine, it's not being over-driven (and has limited
> > queueing if it was
> > being over-driven). The queues that are backing up are in the tcp
> > sockets on the
> > sending machine.
> >
> > But, just to make sure, I'll re-run the test with a looped back cable...
> 
> Well, with looped back cable, it isn't so bad.  I still see a small drop
> in aggregate throughput (around 900Mbps instead of 950Mbps), and
> latency goes above 600ms, but it still performs better than when
> going through the emulator.
> 
> At 950+Mbps, the emulator is going to impart 1-2 ms of latency
> even when configured for wide-open.
> 
> If I use a bridge in place of the emulator, it seems to settle on
> around 450Mbps in one direction and 945Mbps in the other (on the wire),
> with round-trip latencies often over 5 seconds (user-space to user-space),
> and a consistent large chunk of data in the socket send buffers:
> 
> [root@i7-965-1 igb]# netstat -an|grep tcp|grep 8.1.1
> tcp        0      0 8.1.1.1:33038               0.0.0.0:*                   LISTEN
> tcp        0      0 8.1.1.1:33040               0.0.0.0:*                   LISTEN
> tcp        0      0 8.1.1.1:33042               0.0.0.0:*                   LISTEN
> tcp        0 9328612 8.1.1.2:33039               8.1.1.1:33040               ESTABLISHED
> tcp        0 17083176 8.1.1.1:33038               8.1.1.2:33037               ESTABLISHED
> tcp        0 9437340 8.1.1.2:33037               8.1.1.1:33038               ESTABLISHED
> tcp        0 17024620 8.1.1.1:33040               8.1.1.2:33039               ESTABLISHED
> tcp        0 19557040 8.1.1.1:33042               8.1.1.2:33041               ESTABLISHED
> tcp        0 9416600 8.1.1.2:33041               8.1.1.1:33042               ESTABLISHED

I take it your system has higher values for the tcp_wmem value:

net.ipv4.tcp_wmem = 4096 16384 4194304

and whatever is creating the TCP connections is not making explicit
setsockopt() calls to set SO_*BUF.

rick jones


^ permalink raw reply

* Re: TCP funny-ness when over-driving a 1Gbps link.
From: Ben Greear @ 2011-05-19 23:42 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <4DD5A5CD.7040303@candelatech.com>

On 05/19/2011 04:20 PM, Ben Greear wrote:
> On 05/19/2011 04:18 PM, Stephen Hemminger wrote:

>> If you overdrive, TCP expects your network emulator to have
>> a some but limited queueing (like a real router).
>
> The emulator is fine, it's not being over-driven (and has limited
> queueing if it was
> being over-driven). The queues that are backing up are in the tcp
> sockets on the
> sending machine.
>
> But, just to make sure, I'll re-run the test with a looped back cable...

Well, with looped back cable, it isn't so bad.  I still see a small drop
in aggregate throughput (around 900Mbps instead of 950Mbps), and
latency goes above 600ms, but it still performs better than when
going through the emulator.

At 950+Mbps, the emulator is going to impart 1-2 ms of latency
even when configured for wide-open.

If I use a bridge in place of the emulator, it seems to settle on
around 450Mbps in one direction and 945Mbps in the other (on the wire),
with round-trip latencies often over 5 seconds (user-space to user-space),
and a consistent large chunk of data in the socket send buffers:

[root@i7-965-1 igb]# netstat -an|grep tcp|grep 8.1.1
tcp        0      0 8.1.1.1:33038               0.0.0.0:*                   LISTEN
tcp        0      0 8.1.1.1:33040               0.0.0.0:*                   LISTEN
tcp        0      0 8.1.1.1:33042               0.0.0.0:*                   LISTEN
tcp        0 9328612 8.1.1.2:33039               8.1.1.1:33040               ESTABLISHED
tcp        0 17083176 8.1.1.1:33038               8.1.1.2:33037               ESTABLISHED
tcp        0 9437340 8.1.1.2:33037               8.1.1.1:33038               ESTABLISHED
tcp        0 17024620 8.1.1.1:33040               8.1.1.2:33039               ESTABLISHED
tcp        0 19557040 8.1.1.1:33042               8.1.1.2:33041               ESTABLISHED
tcp        0 9416600 8.1.1.2:33041               8.1.1.1:33042               ESTABLISHED

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-19 23:41 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1305834169.32080.81.camel@localhost.localdomain>

On Thu, May 19, 2011 at 12:42:49PM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 10:00 -0700, Shirley Ma wrote:
> > On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote:
> > > > > Yes, I agree.  I think for tcpdump, we really need to copy the
> > > data
> > > > > anyway, to avoid guest changing it in between.  So we do that
> > and
> > > then
> > > > > use the copy everywhere, release the old one. Hmm? 
> > > > 
> > > > Yes. Old one use zerocopy, new one use copy data.
> > > > 
> > > > Thanks
> > > > Shirley
> > > 
> > > No, that's wrong, as they might become different with a
> > > malicious guest. As long as we copied already, lets realease
> > > the data and have everyone use the copy. 
> > 
> > Ok, I will patch pskb_expand_head to test it out. 
> 
> I am patching skb_copy, skb_clone, pskb_copy, pskb_expand_head to
> convert a zero-copy skb to a copy skb to avoid this kind of issue.
> 
> This overhead won't impact macvtap/vhost TX zero-copy normally.
> 
> Shirley

OK, that will handle packet socket at least in that it won't crash :)

So the requirements are
- data must be released in a timely fashion (e.g. unlike virtio-net
  tun or bridge)
- no filtering based on data (data is mapped in guest)
- SG support
- HIGHDMA support (on arches where this makes sense)
- on fast path no calls to skb_copy, skb_clone, pskb_copy,
  pskb_expand_head as these are slow

First 2 requirements are a must, all other requirements
are just dependencies to make sure zero copy will be faster
than non zero copy.

Using a new feature bit is probably the simplest approach to
this. macvtap on top of most physical NICs most likely works
correctly so it seems a bit more work than it needs to be,
but it's also the safest one I think ...

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next 10/13] tg3: Consolidate all netdev feature assignments
From: Matt Carlson @ 2011-05-19 23:51 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: Matthew Carlson, David Miller, linux-netdev
In-Reply-To: <BANLkTim7uERxfwd2isfBUbX7WS8-X-Rj9w@mail.gmail.com>

On Thu, May 19, 2011 at 04:15:52PM -0700, Mahesh Bandewar wrote:
> On Thu, May 19, 2011 at 3:12 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> > This patch consolidates all the netdev feature bit assignments to one
> > location.
> >
> > Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
> > Reviewed-by: Michael Chan <mchan@broadcom.com>
> > ---
> > ?drivers/net/tg3.c | ? 51 ++++++++++++++++++++++++---------------------------
> > ?1 files changed, 24 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> > index 09fe067..5bf2ce1 100644
> > --- a/drivers/net/tg3.c
> > +++ b/drivers/net/tg3.c
> > @@ -13602,19 +13602,6 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
> > ? ? ? ? ? ?tg3_flag(tp, 5750_PLUS))
> > ? ? ? ? ? ? ? ?tg3_flag_set(tp, 5705_PLUS);
> >
> > - ? ? ? /* 5700 B0 chips do not support checksumming correctly due
> > - ? ? ? ?* to hardware bugs.
> > - ? ? ? ?*/
> > - ? ? ? if (tp->pci_chip_rev_id != CHIPREV_ID_5700_B0) {
> > - ? ? ? ? ? ? ? u32 features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
> > -
> > - ? ? ? ? ? ? ? if (tg3_flag(tp, 5755_PLUS))
> > - ? ? ? ? ? ? ? ? ? ? ? features |= NETIF_F_IPV6_CSUM;
> > - ? ? ? ? ? ? ? tp->dev->features |= features;
> > - ? ? ? ? ? ? ? tp->dev->hw_features |= features;
> > - ? ? ? ? ? ? ? tp->dev->vlan_features |= features;
> > - ? ? ? }
> > -
> > ? ? ? ?/* Determine TSO capabilities */
> > ? ? ? ?if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5719)
> > ? ? ? ? ? ? ? ?; /* Do nothing. HW bug. */
> > @@ -14922,7 +14909,7 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
> > ? ? ? ?u32 sndmbx, rcvmbx, intmbx;
> > ? ? ? ?char str[40];
> > ? ? ? ?u64 dma_mask, persist_dma_mask;
> > - ? ? ? u32 hw_features = 0;
> > + ? ? ? u32 features = 0;
> >
> > ? ? ? ?printk_once(KERN_INFO "%s\n", version);
> >
> > @@ -14958,8 +14945,6 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
> >
> > ? ? ? ?SET_NETDEV_DEV(dev, &pdev->dev);
> >
> > - ? ? ? dev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
> > -
> > ? ? ? ?tp = netdev_priv(dev);
> > ? ? ? ?tp->pdev = pdev;
> > ? ? ? ?tp->dev = dev;
> > @@ -15039,7 +15024,7 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
> > ? ? ? ?if (dma_mask > DMA_BIT_MASK(32)) {
> > ? ? ? ? ? ? ? ?err = pci_set_dma_mask(pdev, dma_mask);
> > ? ? ? ? ? ? ? ?if (!err) {
> > - ? ? ? ? ? ? ? ? ? ? ? dev->features |= NETIF_F_HIGHDMA;
> > + ? ? ? ? ? ? ? ? ? ? ? features |= NETIF_F_HIGHDMA;
> > ? ? ? ? ? ? ? ? ? ? ? ?err = pci_set_consistent_dma_mask(pdev,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?persist_dma_mask);
> > ? ? ? ? ? ? ? ? ? ? ? ?if (err < 0) {
> > @@ -15060,6 +15045,18 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
> >
> > ? ? ? ?tg3_init_bufmgr_config(tp);
> >
> > + ? ? ? features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
> > +
> > + ? ? ? /* 5700 B0 chips do not support checksumming correctly due
> > + ? ? ? ?* to hardware bugs.
> > + ? ? ? ?*/
> > + ? ? ? if (tp->pci_chip_rev_id != CHIPREV_ID_5700_B0) {
> > + ? ? ? ? ? ? ? features |= NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
> > +
> > + ? ? ? ? ? ? ? if (tg3_flag(tp, 5755_PLUS))
> > + ? ? ? ? ? ? ? ? ? ? ? features |= NETIF_F_IPV6_CSUM;
> > + ? ? ? }
> > +
> > ? ? ? ?/* TSO is on by default on chips that support hardware TSO.
> > ? ? ? ? * Firmware TSO on older chips gives lower performance, so it
> > ? ? ? ? * is off by default, but can be enabled using ethtool.
> > @@ -15067,24 +15064,20 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
> > ? ? ? ?if ((tg3_flag(tp, HW_TSO_1) ||
> > ? ? ? ? ? ? tg3_flag(tp, HW_TSO_2) ||
> > ? ? ? ? ? ? tg3_flag(tp, HW_TSO_3)) &&
> > - ? ? ? ? ? (dev->features & NETIF_F_IP_CSUM))
> > - ? ? ? ? ? ? ? hw_features |= NETIF_F_TSO;
> > + ? ? ? ? ? (features & NETIF_F_IP_CSUM))
> > + ? ? ? ? ? ? ? features |= NETIF_F_TSO;
> > ? ? ? ?if (tg3_flag(tp, HW_TSO_2) || tg3_flag(tp, HW_TSO_3)) {
> > - ? ? ? ? ? ? ? if (dev->features & NETIF_F_IPV6_CSUM)
> > - ? ? ? ? ? ? ? ? ? ? ? hw_features |= NETIF_F_TSO6;
> > + ? ? ? ? ? ? ? if (features & NETIF_F_IPV6_CSUM)
> > + ? ? ? ? ? ? ? ? ? ? ? features |= NETIF_F_TSO6;
> > ? ? ? ? ? ? ? ?if (tg3_flag(tp, HW_TSO_3) ||
> > ? ? ? ? ? ? ? ? ? ?GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5761 ||
> > ? ? ? ? ? ? ? ? ? ?(GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5784 &&
> > ? ? ? ? ? ? ? ? ? ? GET_CHIP_REV(tp->pci_chip_rev_id) != CHIPREV_5784_AX) ||
> > ? ? ? ? ? ? ? ? ? ?GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5785 ||
> > ? ? ? ? ? ? ? ? ? ?GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_57780)
> > - ? ? ? ? ? ? ? ? ? ? ? hw_features |= NETIF_F_TSO_ECN;
> > + ? ? ? ? ? ? ? ? ? ? ? features |= NETIF_F_TSO_ECN;
> > ? ? ? ?}
> >
> > - ? ? ? dev->hw_features |= hw_features;
> > - ? ? ? dev->features |= hw_features;
> > - ? ? ? dev->vlan_features |= hw_features;
> > -
> > ? ? ? ?/*
> > ? ? ? ? * Add loopback capability only for a subset of devices that support
> > ? ? ? ? * MAC-LOOPBACK. Eventually this need to be enhanced to allow INT-PHY
> > @@ -15093,7 +15086,11 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
> > ? ? ? ?if (GET_ASIC_REV(tp->pci_chip_rev_id) != ASIC_REV_5780 &&
> > ? ? ? ? ? ?!tg3_flag(tp, CPMU_PRESENT))
> > ? ? ? ? ? ? ? ?/* Add the loopback capability */
> > - ? ? ? ? ? ? ? dev->hw_features |= NETIF_F_LOOPBACK;
> > + ? ? ? ? ? ? ? features |= NETIF_F_LOOPBACK;
> > +
> > + ? ? ? dev->features |= features;
> This will set the loopback by default for the described sub-set.
> > + ? ? ? dev->hw_features |= features;
> Yes, you just want to add that here only.

Yes.  Good catch.  I'll post a follow on patch ASAP.


^ permalink raw reply

* Re: [PATCH] drivers/net: ks8842 Fix crash on received packet when in PIO mode.
From: Dennis Aberilla @ 2011-05-19 23:30 UTC (permalink / raw)
  To: David Miller; +Cc: info, netdev
In-Reply-To: <20110519.161228.1163475688237712245.davem@davemloft.net>

On Thu, May 19, 2011 at 04:12:28PM -0400, David Miller wrote:
> From: Dennis Aberilla <dennis.aberilla@mimomax.com>
> Date: Thu, 19 May 2011 10:59:47 +1200
> 
> > This patch fixes a kernel crash during packet reception due to not
> > enough allocated bytes for the skb. This applies to the driver when
> > running in PIO mode in an ISA bus setup.
> > 
> > Signed-off-by: Dennis Aberilla <denzzzhome@yahoo.com>
> 
> If you're trying to accomodate the fact that the loops iterate
> always by 2 or 4 bytes at a time, then you will need to allocate
> up to "3" bytes of slack space, not just "2".
> 
> You need to describe exactly what the precise problem is in your
> commit message, or else people might find it difficult to figure
> out exactly what the problem is.

Ah right, it should actually be 3. Yes, it accounts for the fact that the
loops are reading the Rx buffer 4 bytes at a time. Sorry it wasn't too
clear previously.

|Dennis
=======================================================================
This email, including any attachments, is only for the intended
addressee.  It is subject to copyright, is confidential and may be
the subject of legal or other privilege, none of which is waived or
lost by reason of this transmission.
If the receiver is not the intended addressee, please accept our
apologies, notify us by return, delete all copies and perform no
other act on the email.
Unfortunately, we cannot warrant that the email has not been
altered or corrupted during transmission.
=======================================================================

^ permalink raw reply

* Re: [PATCH] networking: NET_CLS_ROUTE4 depends on INET
From: David Miller @ 2011-05-19 23:23 UTC (permalink / raw)
  To: randy.dunlap; +Cc: netdev, hadi
In-Reply-To: <20110519162218.b0344392.randy.dunlap@oracle.com>

From: Randy Dunlap <randy.dunlap@oracle.com>
Date: Thu, 19 May 2011 16:22:18 -0700

> From: Randy Dunlap <randy.dunlap@oracle.com>
> 
> IP_ROUTE_CLASSID depends on INET and NET_CLS_ROUTE4 selects
> IP_ROUTE_CLASSID, but when INET is not enabled, this kconfig warning
> is produced, so fix it by making NET_CLS_ROUTE4 depend on INET.
> 
> warning: (NET_CLS_ROUTE4) selects IP_ROUTE_CLASSID which has unmet direct dependencies (NET && INET)
> 
> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>

Applied, thanks Randy.

^ permalink raw reply

* [PATCH] networking: NET_CLS_ROUTE4 depends on INET
From: Randy Dunlap @ 2011-05-19 23:22 UTC (permalink / raw)
  To: netdev; +Cc: davem, Jamal Hadi Salim

From: Randy Dunlap <randy.dunlap@oracle.com>

IP_ROUTE_CLASSID depends on INET and NET_CLS_ROUTE4 selects
IP_ROUTE_CLASSID, but when INET is not enabled, this kconfig warning
is produced, so fix it by making NET_CLS_ROUTE4 depend on INET.

warning: (NET_CLS_ROUTE4) selects IP_ROUTE_CLASSID which has unmet direct dependencies (NET && INET)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
 net/sched/Kconfig |    1 +
 1 file changed, 1 insertion(+)

--- lnx-2639.orig/net/sched/Kconfig
+++ lnx-2639/net/sched/Kconfig
@@ -277,6 +277,7 @@ config NET_CLS_TCINDEX
 
 config NET_CLS_ROUTE4
 	tristate "Routing decision (ROUTE)"
+	depends on INET
 	select IP_ROUTE_CLASSID
 	select NET_CLS
 	---help---

^ permalink raw reply

* Re: TCP funny-ness when over-driving a 1Gbps link.
From: Ben Greear @ 2011-05-19 23:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20110519161827.2ba4b40e@nehalam>

On 05/19/2011 04:18 PM, Stephen Hemminger wrote:
> On Thu, 19 May 2011 15:47:14 -0700
> Ben Greear<greearb@candelatech.com>  wrote:
>
>> I noticed something that struck me as a bit weird today,
>> but perhaps it's normal.
>>
>> I was using our application to create 3 TCP streams from one port to
>> another (1Gbps, igb driver), running through a network emulator.
>> Traffic is flowing bi-directional in each connection.
>>
>> I am doing 24k byte writes per system call.  I tried 100ms, 10ms, and 1ms
>> latency (one-way) in the emulator, but behaviour is similar in each case.
>> The rest of this info was gathered with 1ms delay in the emulator.
>>
>> If I ask all 3 connections to run 1Gbps, netstat shows 30+GB in the
>> sending queues and 1+ second latency (user-space to user-space).  Aggregate
>> throughput is around 700Mbps in each direction.
>>
>> But, if I ask each of the connections to run at 300Mbps, latency averages
>> 2ms and each connection runs right at 300Mbps (950Mbps or so on the wire).
>>
>> It seems that when you over-drive the link, things back up and perform
>> quite badly over-all.
>>
>> This is a core-i7 3.2Ghz with 12GB RAM, Fedora 14, 2.6.38.6 kernel
>> (with some hacks), 64-bit OS and user-space app.  Quick testing on 2.6.36.3
>> showed similar results, so I don't think it's a regression.
>>
>> I am curious if others see similar results?
>>
>> Thanks,
>> Ben
>>
>
> If you overdrive, TCP expects your network emulator to have
> a some but limited queueing (like a real router).

The emulator is fine, it's not being over-driven (and has limited queueing if it was
being over-driven).  The queues that are backing up are in the tcp sockets on the
sending machine.

But, just to make sure, I'll re-run the test with a looped back cable...

Ben

>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCHv2 00/14] virtio and vhost-net performance enhancements
From: David Miller @ 2011-05-19 23:20 UTC (permalink / raw)
  To: mst
  Cc: linux-kernel, rusty, cotte, borntraeger, linux390, schwidefsky,
	heiko.carstens, xma, lguest, virtualization, netdev, linux-s390,
	kvm, krkumar2, tahm, steved, habanero
In-Reply-To: <cover.1305846412.git.mst@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 20 May 2011 02:10:07 +0300

> Rusty, I think it will be easier to merge vhost and virtio bits in one
> go. Can it all go in through your tree (Dave in the past acked
> sending a very similar patch through you so should not be a problem)?

And in case you want an explicit ack for the net bits:

Acked-by: David S. Miller <davem@davemloft.net>

:-)

^ permalink raw reply

* Re: TCP funny-ness when over-driving a 1Gbps link.
From: Stephen Hemminger @ 2011-05-19 23:18 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <4DD59DF2.2070707@candelatech.com>

On Thu, 19 May 2011 15:47:14 -0700
Ben Greear <greearb@candelatech.com> wrote:

> I noticed something that struck me as a bit weird today,
> but perhaps it's normal.
> 
> I was using our application to create 3 TCP streams from one port to
> another (1Gbps, igb driver), running through a network emulator.
> Traffic is flowing bi-directional in each connection.
> 
> I am doing 24k byte writes per system call.  I tried 100ms, 10ms, and 1ms
> latency (one-way) in the emulator, but behaviour is similar in each case.
> The rest of this info was gathered with 1ms delay in the emulator.
> 
> If I ask all 3 connections to run 1Gbps, netstat shows 30+GB in the
> sending queues and 1+ second latency (user-space to user-space).  Aggregate
> throughput is around 700Mbps in each direction.
> 
> But, if I ask each of the connections to run at 300Mbps, latency averages
> 2ms and each connection runs right at 300Mbps (950Mbps or so on the wire).
> 
> It seems that when you over-drive the link, things back up and perform
> quite badly over-all.
> 
> This is a core-i7 3.2Ghz with 12GB RAM, Fedora 14, 2.6.38.6 kernel
> (with some hacks), 64-bit OS and user-space app.  Quick testing on 2.6.36.3
> showed similar results, so I don't think it's a regression.
> 
> I am curious if others see similar results?
> 
> Thanks,
> Ben
> 

If you overdrive, TCP expects your network emulator to have
a some but limited queueing (like a real router).

-- 

^ permalink raw reply

* Re: [PATCH net-next 10/13] tg3: Consolidate all netdev feature assignments
From: Mahesh Bandewar @ 2011-05-19 23:15 UTC (permalink / raw)
  To: Matt Carlson; +Cc: David Miller, linux-netdev
In-Reply-To: <1305843176-32358-11-git-send-email-mcarlson@broadcom.com>

On Thu, May 19, 2011 at 3:12 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> This patch consolidates all the netdev feature bit assignments to one
> location.
>
> Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
> Reviewed-by: Michael Chan <mchan@broadcom.com>
> ---
>  drivers/net/tg3.c |   51 ++++++++++++++++++++++++---------------------------
>  1 files changed, 24 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> index 09fe067..5bf2ce1 100644
> --- a/drivers/net/tg3.c
> +++ b/drivers/net/tg3.c
> @@ -13602,19 +13602,6 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
>            tg3_flag(tp, 5750_PLUS))
>                tg3_flag_set(tp, 5705_PLUS);
>
> -       /* 5700 B0 chips do not support checksumming correctly due
> -        * to hardware bugs.
> -        */
> -       if (tp->pci_chip_rev_id != CHIPREV_ID_5700_B0) {
> -               u32 features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
> -
> -               if (tg3_flag(tp, 5755_PLUS))
> -                       features |= NETIF_F_IPV6_CSUM;
> -               tp->dev->features |= features;
> -               tp->dev->hw_features |= features;
> -               tp->dev->vlan_features |= features;
> -       }
> -
>        /* Determine TSO capabilities */
>        if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5719)
>                ; /* Do nothing. HW bug. */
> @@ -14922,7 +14909,7 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
>        u32 sndmbx, rcvmbx, intmbx;
>        char str[40];
>        u64 dma_mask, persist_dma_mask;
> -       u32 hw_features = 0;
> +       u32 features = 0;
>
>        printk_once(KERN_INFO "%s\n", version);
>
> @@ -14958,8 +14945,6 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
>
>        SET_NETDEV_DEV(dev, &pdev->dev);
>
> -       dev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
> -
>        tp = netdev_priv(dev);
>        tp->pdev = pdev;
>        tp->dev = dev;
> @@ -15039,7 +15024,7 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
>        if (dma_mask > DMA_BIT_MASK(32)) {
>                err = pci_set_dma_mask(pdev, dma_mask);
>                if (!err) {
> -                       dev->features |= NETIF_F_HIGHDMA;
> +                       features |= NETIF_F_HIGHDMA;
>                        err = pci_set_consistent_dma_mask(pdev,
>                                                          persist_dma_mask);
>                        if (err < 0) {
> @@ -15060,6 +15045,18 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
>
>        tg3_init_bufmgr_config(tp);
>
> +       features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
> +
> +       /* 5700 B0 chips do not support checksumming correctly due
> +        * to hardware bugs.
> +        */
> +       if (tp->pci_chip_rev_id != CHIPREV_ID_5700_B0) {
> +               features |= NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
> +
> +               if (tg3_flag(tp, 5755_PLUS))
> +                       features |= NETIF_F_IPV6_CSUM;
> +       }
> +
>        /* TSO is on by default on chips that support hardware TSO.
>         * Firmware TSO on older chips gives lower performance, so it
>         * is off by default, but can be enabled using ethtool.
> @@ -15067,24 +15064,20 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
>        if ((tg3_flag(tp, HW_TSO_1) ||
>             tg3_flag(tp, HW_TSO_2) ||
>             tg3_flag(tp, HW_TSO_3)) &&
> -           (dev->features & NETIF_F_IP_CSUM))
> -               hw_features |= NETIF_F_TSO;
> +           (features & NETIF_F_IP_CSUM))
> +               features |= NETIF_F_TSO;
>        if (tg3_flag(tp, HW_TSO_2) || tg3_flag(tp, HW_TSO_3)) {
> -               if (dev->features & NETIF_F_IPV6_CSUM)
> -                       hw_features |= NETIF_F_TSO6;
> +               if (features & NETIF_F_IPV6_CSUM)
> +                       features |= NETIF_F_TSO6;
>                if (tg3_flag(tp, HW_TSO_3) ||
>                    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5761 ||
>                    (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5784 &&
>                     GET_CHIP_REV(tp->pci_chip_rev_id) != CHIPREV_5784_AX) ||
>                    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5785 ||
>                    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_57780)
> -                       hw_features |= NETIF_F_TSO_ECN;
> +                       features |= NETIF_F_TSO_ECN;
>        }
>
> -       dev->hw_features |= hw_features;
> -       dev->features |= hw_features;
> -       dev->vlan_features |= hw_features;
> -
>        /*
>         * Add loopback capability only for a subset of devices that support
>         * MAC-LOOPBACK. Eventually this need to be enhanced to allow INT-PHY
> @@ -15093,7 +15086,11 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
>        if (GET_ASIC_REV(tp->pci_chip_rev_id) != ASIC_REV_5780 &&
>            !tg3_flag(tp, CPMU_PRESENT))
>                /* Add the loopback capability */
> -               dev->hw_features |= NETIF_F_LOOPBACK;
> +               features |= NETIF_F_LOOPBACK;
> +
> +       dev->features |= features;
This will set the loopback by default for the described sub-set.
> +       dev->hw_features |= features;
Yes, you just want to add that here only.
> +       dev->vlan_features |= features;
>
>        if (tp->pci_chip_rev_id == CHIPREV_ID_5705_A1 &&
>            !tg3_flag(tp, TSO_CAPABLE) &&
> --
> 1.7.3.4
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCHv2 14/14] vhost: fix 64 bit features
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <cover.1305846412.git.mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Update vhost_has_feature to make it work correctly for bit > 32.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/vhost/vhost.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8e03379..64889d2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -123,7 +123,7 @@ struct vhost_dev {
 	struct vhost_memory __rcu *memory;
 	struct mm_struct *mm;
 	struct mutex mutex;
-	unsigned acked_features;
+	u64 acked_features;
 	struct vhost_virtqueue *vqs;
 	int nvqs;
 	struct file *log_file;
@@ -176,14 +176,14 @@ enum {
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
-static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
+static inline bool vhost_has_feature(struct vhost_dev *dev, int bit)
 {
-	unsigned acked_features;
+	u64 acked_features;
 
 	/* TODO: check that we are running from vhost_worker or dev mutex is
 	 * held? */
 	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
-	return acked_features & (1 << bit);
+	return acked_features & (1ull << bit);
 }
 
 #endif
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCHv2 13/14] virtio_test: update for 64 bit features
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <cover.1305846412.git.mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Extend the virtio_test tool so it can work with
64 bit features.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 tools/virtio/virtio_test.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index 74d3331..96cf9bf 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -55,7 +55,6 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 {
 	struct vhost_vring_state state = { .index = info->idx };
 	struct vhost_vring_file file = { .index = info->idx };
-	unsigned long long features = dev->vdev.features[0];
 	struct vhost_vring_addr addr = {
 		.index = info->idx,
 		.desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
@@ -63,6 +62,10 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 		.used_user_addr = (uint64_t)(unsigned long)info->vring.used,
 	};
 	int r;
+	unsigned long long features = dev->vdev.features[0];
+	if (sizeof features > sizeof dev->vdev.features[0])
+		features |= ((unsigned long long)dev->vdev.features[1]) << 32;
+
 	r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
 	assert(r >= 0);
 	state.num = info->vring.num;
@@ -107,7 +110,8 @@ static void vdev_info_init(struct vdev_info* dev, unsigned long long features)
 	int r;
 	memset(dev, 0, sizeof *dev);
 	dev->vdev.features[0] = features;
-	dev->vdev.features[1] = features >> 32;
+	if (sizeof features > sizeof dev->vdev.features[0])
+		dev->vdev.features[1] = features >> 32;
 	dev->buf_size = 1024;
 	dev->buf = malloc(dev->buf_size);
 	assert(dev->buf);
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCHv2 12/14] virtio: 64 bit features
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero
In-Reply-To: <cover.1305846412.git.mst@redhat.com>

Extend features to 64 bit so we can use more
transport bits.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/lguest/lguest_device.c |    8 ++++----
 drivers/s390/kvm/kvm_virtio.c  |    8 ++++----
 drivers/virtio/virtio.c        |    8 ++++----
 drivers/virtio/virtio_pci.c    |   34 ++++++++++++++++++++++++++++------
 drivers/virtio/virtio_ring.c   |    2 ++
 include/linux/virtio.h         |    2 +-
 include/linux/virtio_config.h  |   15 +++++++++------
 include/linux/virtio_pci.h     |    9 ++++++++-
 8 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 69c84a1..d2d6953 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -93,17 +93,17 @@ static unsigned desc_size(const struct lguest_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 lg_get_features(struct virtio_device *vdev)
+static u64 lg_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct lguest_device_desc *desc = to_lgdev(vdev)->desc;
 	u8 *in_features = lg_features(desc);
 
 	/* We do this the slow but generic way. */
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 
 	return features;
 }
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 414427d..c56293c 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -79,16 +79,16 @@ static unsigned desc_size(const struct kvm_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 kvm_get_features(struct virtio_device *vdev)
+static u64 kvm_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct kvm_device_desc *desc = to_kvmdev(vdev)->desc;
 	u8 *in_features = kvm_vq_features(desc);
 
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 	return features;
 }
 
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index efb35aa..52b24d7 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -112,7 +112,7 @@ static int virtio_dev_probe(struct device *_d)
 	struct virtio_device *dev = container_of(_d,struct virtio_device,dev);
 	struct virtio_driver *drv = container_of(dev->dev.driver,
 						 struct virtio_driver, driver);
-	u32 device_features;
+	u64 device_features;
 
 	/* We have a driver! */
 	add_status(dev, VIRTIO_CONFIG_S_DRIVER);
@@ -124,14 +124,14 @@ static int virtio_dev_probe(struct device *_d)
 	memset(dev->features, 0, sizeof(dev->features));
 	for (i = 0; i < drv->feature_table_size; i++) {
 		unsigned int f = drv->feature_table[i];
-		BUG_ON(f >= 32);
-		if (device_features & (1 << f))
+		BUG_ON(f >= 64);
+		if (device_features & (1ull << f))
 			set_bit(f, dev->features);
 	}
 
 	/* Transport features always preserved to pass to finalize_features. */
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++)
-		if (device_features & (1 << i))
+		if (device_features & (1ull << i))
 			set_bit(i, dev->features);
 
 	dev->config->finalize_features(dev);
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 4fb5b2b..04b216f 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -44,6 +44,8 @@ struct virtio_pci_device
 	spinlock_t lock;
 	struct list_head virtqueues;
 
+	/* 64 bit features */
+	int features_hi;
 	/* MSI-X support */
 	int msix_enabled;
 	int intx_enabled;
@@ -103,26 +105,46 @@ static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
 }
 
 /* virtio config->get_features() implementation */
-static u32 vp_get_features(struct virtio_device *vdev)
+static u64 vp_get_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
-	/* When someone needs more than 32 feature bits, we'll need to
+	/* When someone needs more than 32 feature bits, we need to
 	 * steal a bit to indicate that the rest are somewhere else. */
-	return ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	flo = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(0x1 << VIRTIO_F_FEATURES_HI,
+			  vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+		fhi = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+		fhi = 0;
+	}
+	return (((u64)fhi) << 32) | flo;
 }
 
 /* virtio config->finalize_features() implementation */
 static void vp_finalize_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
 	/* Give virtio_ring a chance to accept features. */
 	vring_transport_features(vdev);
 
-	/* We only support 32 feature bits. */
-	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 1);
-	iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
+	/* We only support 64 feature bits. */
+	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 64 / BITS_PER_LONG);
+	flo = vdev->features[0];
+	fhi = vdev->features[64 / BITS_PER_LONG - 1] >> (BITS_PER_LONG - 32);
+	iowrite32(flo, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(fhi, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+	}
 }
 
 /* virtio config->get() implementation */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 8218fe6..4a7a651 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -534,6 +534,8 @@ void vring_transport_features(struct virtio_device *vdev)
 
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
 		switch (i) {
+		case VIRTIO_F_FEATURES_HI:
+			break;
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
 		case VIRTIO_RING_F_EVENT_IDX:
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 58c0953..944ebcd 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -119,7 +119,7 @@ struct virtio_device {
 	struct virtio_config_ops *config;
 	struct list_head vqs;
 	/* Note that this is a Linux set_bit-style bitmap. */
-	unsigned long features[1];
+	unsigned long features[64 / BITS_PER_LONG];
 	void *priv;
 };
 
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 800617b..b1a1981 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -18,16 +18,19 @@
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED		0x80
 
-/* Some virtio feature bits (currently bits 28 through 31) are reserved for the
+/* Some virtio feature bits (currently bits 28 through 39) are reserved for the
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		32
+#define VIRTIO_TRANSPORT_F_END		40
 
 /* Do we get callbacks when the ring is completely used, even if we've
  * suppressed them? */
 #define VIRTIO_F_NOTIFY_ON_EMPTY	24
 
+/* Enables feature bits 32 to 63 (only really required for virtio_pci). */
+#define VIRTIO_F_FEATURES_HI		31
+
 #ifdef __KERNEL__
 #include <linux/err.h>
 #include <linux/virtio.h>
@@ -72,7 +75,7 @@
  * @del_vqs: free virtqueues found by find_vqs().
  * @get_features: get the array of feature bits for this device.
  *	vdev: the virtio_device
- *	Returns the first 32 feature bits (all we currently need).
+ *	Returns the first 64 feature bits (all we currently need).
  * @finalize_features: confirm what device features we'll be using.
  *	vdev: the virtio_device
  *	This gives the final feature bits for the device: it can change
@@ -92,7 +95,7 @@ struct virtio_config_ops {
 			vq_callback_t *callbacks[],
 			const char *names[]);
 	void (*del_vqs)(struct virtio_device *);
-	u32 (*get_features)(struct virtio_device *vdev);
+	u64 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 };
 
@@ -110,9 +113,9 @@ static inline bool virtio_has_feature(const struct virtio_device *vdev,
 {
 	/* Did you forget to fix assumptions on max features? */
 	if (__builtin_constant_p(fbit))
-		BUILD_BUG_ON(fbit >= 32);
+		BUILD_BUG_ON(fbit >= 64);
 	else
-		BUG_ON(fbit >= 32);
+		BUG_ON(fbit >= 64);
 
 	if (fbit < VIRTIO_TRANSPORT_F_START)
 		virtio_check_driver_offered_feature(vdev, fbit);
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index 9a3d7c4..90f9725 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -55,9 +55,16 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+/* An extended 32-bit r/o bitmask of the features supported by the host */
+#define VIRTIO_PCI_HOST_FEATURES_HI	24
+
+/* An extended 32-bit r/w bitmask of features activated by the guest */
+#define VIRTIO_PCI_GUEST_FEATURES_HI	28
+
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
-#define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
+#define VIRTIO_PCI_CONFIG(dev)		((dev)->features_hi ? 32 : \
+						(dev)->msix_enabled ? 24 : 20)
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION		0
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCHv2 11/14] virtio: don't delay avail index update
From: Michael S. Tsirkin @ 2011-05-19 23:12 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <cover.1305846412.git.mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Update avail index immediately instead of upon kick:
for virtio-net RX this helps parallelism with the host.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/virtio/virtio_ring.c |   28 +++++++++++++++++++---------
 1 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eed5f29..8218fe6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -89,7 +89,7 @@ struct vring_virtqueue
 	unsigned int num_free;
 	/* Head of free buffer list. */
 	unsigned int free_head;
-	/* Number we've added since last sync. */
+	/* Number we've added since last kick. */
 	unsigned int num_added;
 
 	/* Last used index we've seen. */
@@ -174,6 +174,13 @@ int virtqueue_add_buf_gfp(struct virtqueue *_vq,
 
 	BUG_ON(data == NULL);
 
+	/* Prevent drivers from adding more than num bufs without a kick. */
+	if (vq->num_added == vq->vring.num) {
+		printk(KERN_ERR "gaaa!!!\n");
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && (out + in) > 1 && vq->num_free) {
@@ -227,8 +234,14 @@ add_head:
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync).  FIXME: avoid modulus here? */
-	avail = (vq->vring.avail->idx + vq->num_added++) % vq->vring.num;
+	avail = vq->vring.avail->idx % vq->vring.num;
 	vq->vring.avail->ring[avail] = head;
+	vq->num_added++;
+
+	/* Descriptors and available array need to be set before we expose the
+	 * new available array entries. */
+	virtio_wmb();
+	vq->vring.avail->idx++;
 
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
@@ -242,17 +255,14 @@ void virtqueue_kick(struct virtqueue *_vq)
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	u16 new, old;
 	START_USE(vq);
-	/* Descriptors and available array need to be set before we expose the
-	 * new available array entries. */
-	virtio_wmb();
-
-	old = vq->vring.avail->idx;
-	new = vq->vring.avail->idx = old + vq->num_added;
-	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
+	new = vq->vring.avail->idx;
+	old = new - vq->num_added;
+	vq->num_added = 0;
+
 	if (vq->event ?
 	    vring_need_event(vring_avail_event(&vq->vring), new, old) :
 	    !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
-- 
1.7.5.53.gc233e

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox