Netdev List
 help / color / mirror / Atom feed
* Re: [MeeGo-Dev][PATCH v3] Topcliff: Update PCH_CAN driver to 2.6.35
From: David Miller @ 2010-09-30 18:50 UTC (permalink / raw)
  To: wg-5Yr1BZd7O62+XT7JhA+gdA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	sameo-VuQAYsv1563Yd54FQh9/CA,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	masa-korg-ECg8zkTtlr0C6LszWs/t0g,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w,
	morinaga526-ECg8zkTtlr0C6LszWs/t0g,
	meego-dev-WXzIur8shnEAvxtiuMwx3w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w, chripell-VaTbYqLCNhc,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4CA4541F.5040804-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

From: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
Date: Thu, 30 Sep 2010 11:10:55 +0200

> On 09/24/2010 12:24 PM, Masayuki Ohtak wrote:
>> +};
>> +
>> +static struct can_bittiming_const pch_can_bittiming_const = {
>> +	.name = KBUILD_MODNAME,
> 
> Not sure what KBUILD_MODNAME is. Should be "pch_can", the name of the
> driver.

That's what KBUILD_MODNAME will be defined to when this gets
compiled :-)

^ permalink raw reply

* Re: how to use secure_tcp_sequence_number
From: Eric Dumazet @ 2010-09-30 18:21 UTC (permalink / raw)
  To: Nicola Padovano; +Cc: netfilter-devel, netdev
In-Reply-To: <AANLkTikWop213k7cBXAaoCMw_8LouX-qauYEbHD=d1Vh@mail.gmail.com>

Le jeudi 30 septembre 2010 à 19:01 +0200, Nicola Padovano a écrit :
> If i attempt to insert the module that uses the
> secure_tcp_sequence_number i get this error message:
> 
> from insmod:  -1 Invalid module format
> and from dmesg: version magic '2.6.35.4 SMP mod_unload 586 ' should be
> '2.6.35.4 SMP mod_unload modversions 586 '
> 
> what's the matter?

diff --git a/drivers/char/random.c b/drivers/char/random.c
index caef35a..f5ef7ba 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1550,6 +1550,7 @@ __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
 
 	return seq;
 }
+EXPORT_SYMBOL(secure_tcp_sequence_number);
 
 /* Generate secure starting point for ephemeral IPV4 transport port search */
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: Packet time delays on multi-core systems
From: Alexey Vlasov @ 2010-09-30 18:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <1285869782.2615.920.camel@edumazet-laptop>

On Thu, Sep 30, 2010 at 08:03:02PM +0200, Eric Dumazet wrote:
> >  
> > The last test were made already concerning such rx queue binding:
> > # cat /proc/irq/60/smp_affinity
> > 001000
> > # cat /proc/irq/61/smp_affinity
> > 010000
> > # cat /proc/irq/62/smp_affinity
> > 080000
> > # cat /proc/irq/63/smp_affinity
> > 800000
> > 
> 
> Why 60, 61, 62, 63 ? This should be 753, 754, 755, 756

I've got several similar servers, the interrups 60-63 are on
that one that I can test now, so this isn't a mistake.

-- 
BRGDS. Alexey Vlasov.

^ permalink raw reply

* Re: Packet time delays on multi-core systems
From: Eric Dumazet @ 2010-09-30 18:03 UTC (permalink / raw)
  To: Alexey Vlasov; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <20100930173732.GB4094@beaver.vrungel.ru>

Le jeudi 30 septembre 2010 à 21:37 +0400, Alexey Vlasov a écrit :
> On Thu, Sep 30, 2010 at 02:44:29PM +0200, Eric Dumazet wrote:
> > Le jeudi 30 septembre 2010 ?? 16:23 +0400, Alexey Vlasov a ??crit :
> > > On Thu, Sep 30, 2010 at 08:33:52AM +0200, Eric Dumazet wrote:
> > > > Le jeudi 30 septembre 2010 ?? 10:24 +0400, Alexey Vlasov a ??crit :
> > > > > Here I found some dude with the same problem:
> > > > > http://lkml.org/lkml/2010/7/9/340
> > >  
> > > Well I put interrups from NIC, namely tx/rx query, to different
> > > processors and got normal pings by adding LOG rule.
> > > 
> > > I also found that overruns is constantly growing, I don't know if these are connected.
> > > RX packets:2831439546 errors:0 dropped:134726 overruns:947671733 frame:0
> > > TX packets:2880849825 errors:0 dropped:0 overruns:0 carrier:0
> > > 
> 
> Too early to be happy, concerning one rule- the situation got better, but still
> there are some time delays. But adding one more rule:
> -A INPUT -p all -m state --state INVALID -j LOG --log-prefix
> "ipsec:IN-INVALID "
> it got totally wrecked:
> ...
> 64 bytes from (10.0.2.17): icmp_seq=24 ttl=64 time=0.342 ms
> 64 bytes from (10.0.2.17): icmp_seq=25 ttl=64 time=1868 ms
> 64 bytes from (10.0.2.17): icmp_seq=26 ttl=64 time=1448 ms
> 64 bytes from (10.0.2.17): icmp_seq=27 ttl=64 time=447 ms
> 64 bytes from (10.0.2.17): icmp_seq=28 ttl=64 time=0.196 ms
> ...
> 100 packets transmitted, 100 received, 0% packet loss, time 99990ms
> rtt min/avg/max/mdev = 0.108/39.068/1868.663/237.507 ms, pipe 2
> 
> # iptables -L -v -n
> Chain INPUT (policy ACCEPT 601K packets, 475M bytes)
>  pkts bytes target     prot opt in     out     source               destination
>   275 11096 LOG        all  --  *      *       0.0.0.0/0            0.0.0.0/0           state INVALID LOG flags 0 level 4 prefix `ipsec:IN-INVALID '
> 
> Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination
> 
> Chain OUTPUT (policy ACCEPT 529K packets, 561M bytes)
>  pkts bytes target     prot opt in     out     source               destination
> 13979  839K LOG        tcp  --  *      *       0.0.0.0/0            0.0.0.0/0           tcp dpt:80 flags:0x17/0x02 LOG flags 8 level 4 prefix `ipsec:SYN-OUTPUT-DROP '
>  
> > > Here goes the typical distribution of interrups on new servers:
> > >            CPU0    CPU1    CPU2    CPU3 ... CPU23
> > > 752:         11       0       0       0 ...     0 PCI-MSI-edge eth0
> > > 753: 2799366721       0       0       0 ...     0 PCI-MSI-edge eth0-rx3
> > > 754: 2821840553       0       0       0 ...     0 PCI-MSI-edge eth0-rx2
> > > 755: 2786117044       0       0       0 ...     0 PCI-MSI-edge eth0-rx1
> > > 756: 2896099336       0       0       0 ...     0 PCI-MSI-edge eth0-rx0
> > > 757: 1808404680       0       0       0 ...     0 PCI-MSI-edge eth0-tx3
> > > 758: 1797855130       0       0       0 ...     0 PCI-MSI-edge eth0-tx2
> > > 759: 1807222032       0       0       0 ...     0 PCI-MSI-edge eth0-tx1
> > > 760: 1820309360       0       0       0 ...     0 PCI-MSI-edge eth0-tx0
> > > 
> > 
> > echo 01 >/proc/irq/*/eth0-rx0/../smp_affinity
> > echo 02 >/proc/irq/*/eth0-rx1/../smp_affinity
> > echo 04 >/proc/irq/*/eth0-rx2/../smp_affinity
> > echo 08 >/proc/irq/*/eth0-rx3/../smp_affinity
> > 
> > 
> > cat /proc/irq/*/eth0-rx0/../smp_affinity
> > cat /proc/irq/*/eth0-rx1/../smp_affinity
> > cat /proc/irq/*/eth0-rx2/../smp_affinity
> > cat /proc/irq/*/eth0-rx3/../smp_affinity
>  
> The last test were made already concerning such rx queue binding:
> # cat /proc/irq/60/smp_affinity
> 001000
> # cat /proc/irq/61/smp_affinity
> 010000
> # cat /proc/irq/62/smp_affinity
> 080000
> # cat /proc/irq/63/smp_affinity
> 800000
> 

Why 60, 61, 62, 63 ? This should be 753, 754, 755, 756



> Now ksoftirqd eats not only one processor but all oness where I assigned the IRQs.
> 
> > > On the old ones:
> > >            CPU0       CPU1       CPU2  ...      CPU8
> > > 502:  522320256  522384039  522327386  ... 522380267 PCI-MSI-edge eth0
> > > 
> > 
> > What network driver is it (newbox), was it (old box) ?
> 
> newbox:
> 01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network
> Connection (rev 02)
> driver: igb
> version: 1.3.16-k2
> firmware-version: 2.1-0
> bus-info: 0000:01:00.0
> 
> oldbox:
> 05:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
> Ethernet Controller (Copper) (rev 01)
> driver: e1000e
> version: 0.3.3.3-k6
> firmware-version: 1.0-0
> bus-info: 0000:05:00.0
> 

^ permalink raw reply

* [PATCH net-next] net: introduce DST_NOCACHE flag
From: Eric Dumazet @ 2010-09-30 17:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

While doing stress tests with IP route cache disabled, and multi queue
devices, I noticed a very high contention on one rwlock used in
neighbour code.

When many cpus are trying to send frames (possibly using a high
performance multiqueue device) to the same neighbour, they fight for the
neigh->lock rwlock in order to call neigh_hh_init(), and fight on
hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())

But we dont need to call neigh_hh_init() for dst that are used only
once. It costs four atomic operations at least, on two contended cache
lines, plus the high contention on neigh->lock rwlock.

Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
inserted in route cache.

With the stress test bench, sending 160000000 frames on one neighbour,
results are :

Before patch:

real	2m28.406s
user	0m11.781s
sys	36m17.964s


After patch:

real	1m26.532s
user	0m12.185s
sys	20m3.903s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/dst.h    |    9 +++++----
 net/core/neighbour.c |    4 +++-
 net/ipv4/route.c     |    1 +
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index aa53fbc..a217c83 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -43,10 +43,11 @@ struct dst_entry {
 	short			error;
 	short			obsolete;
 	int			flags;
-#define DST_HOST		1
-#define DST_NOXFRM		2
-#define DST_NOPOLICY		4
-#define DST_NOHASH		8
+#define DST_HOST		0x0001
+#define DST_NOXFRM		0x0002
+#define DST_NOPOLICY		0x0004
+#define DST_NOHASH		0x0008
+#define DST_NOCACHE		0x0010
 	unsigned long		expires;
 
 	unsigned short		header_len;	/* more space at head required */
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 96b1a74..b142a0d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1210,7 +1210,9 @@ int neigh_resolve_output(struct sk_buff *skb)
 	if (!neigh_event_send(neigh, skb)) {
 		int err;
 		struct net_device *dev = neigh->dev;
-		if (dev->header_ops->cache && !dst->hh) {
+		if (dev->header_ops->cache &&
+		    !dst->hh &&
+		    !(dst->flags & DST_NOCACHE)) {
 			write_lock_bh(&neigh->lock);
 			if (!dst->hh)
 				neigh_hh_init(neigh, dst, dst->ops->protocol);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 98beda4..b0c7a87 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1107,6 +1107,7 @@ restart:
 		 * on the route gc list.
 		 */
 
+		rt->dst.flags |= DST_NOCACHE;
 		if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
 			int err = arp_bind_neighbour(&rt->dst);
 			if (err) {



^ permalink raw reply related

* Re: Packet time delays on multi-core systems
From: Alexey Vlasov @ 2010-09-30 17:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <1285850669.2615.426.camel@edumazet-laptop>

On Thu, Sep 30, 2010 at 02:44:29PM +0200, Eric Dumazet wrote:
> Le jeudi 30 septembre 2010 ?? 16:23 +0400, Alexey Vlasov a ??crit :
> > On Thu, Sep 30, 2010 at 08:33:52AM +0200, Eric Dumazet wrote:
> > > Le jeudi 30 septembre 2010 ?? 10:24 +0400, Alexey Vlasov a ??crit :
> > > > Here I found some dude with the same problem:
> > > > http://lkml.org/lkml/2010/7/9/340
> >  
> > Well I put interrups from NIC, namely tx/rx query, to different
> > processors and got normal pings by adding LOG rule.
> > 
> > I also found that overruns is constantly growing, I don't know if these are connected.
> > RX packets:2831439546 errors:0 dropped:134726 overruns:947671733 frame:0
> > TX packets:2880849825 errors:0 dropped:0 overruns:0 carrier:0
> > 

Too early to be happy, concerning one rule- the situation got better, but still
there are some time delays. But adding one more rule:
-A INPUT -p all -m state --state INVALID -j LOG --log-prefix
"ipsec:IN-INVALID "
it got totally wrecked:
...
64 bytes from (10.0.2.17): icmp_seq=24 ttl=64 time=0.342 ms
64 bytes from (10.0.2.17): icmp_seq=25 ttl=64 time=1868 ms
64 bytes from (10.0.2.17): icmp_seq=26 ttl=64 time=1448 ms
64 bytes from (10.0.2.17): icmp_seq=27 ttl=64 time=447 ms
64 bytes from (10.0.2.17): icmp_seq=28 ttl=64 time=0.196 ms
...
100 packets transmitted, 100 received, 0% packet loss, time 99990ms
rtt min/avg/max/mdev = 0.108/39.068/1868.663/237.507 ms, pipe 2

# iptables -L -v -n
Chain INPUT (policy ACCEPT 601K packets, 475M bytes)
 pkts bytes target     prot opt in     out     source               destination
  275 11096 LOG        all  --  *      *       0.0.0.0/0            0.0.0.0/0           state INVALID LOG flags 0 level 4 prefix `ipsec:IN-INVALID '

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 529K packets, 561M bytes)
 pkts bytes target     prot opt in     out     source               destination
13979  839K LOG        tcp  --  *      *       0.0.0.0/0            0.0.0.0/0           tcp dpt:80 flags:0x17/0x02 LOG flags 8 level 4 prefix `ipsec:SYN-OUTPUT-DROP '
 
> > Here goes the typical distribution of interrups on new servers:
> >            CPU0    CPU1    CPU2    CPU3 ... CPU23
> > 752:         11       0       0       0 ...     0 PCI-MSI-edge eth0
> > 753: 2799366721       0       0       0 ...     0 PCI-MSI-edge eth0-rx3
> > 754: 2821840553       0       0       0 ...     0 PCI-MSI-edge eth0-rx2
> > 755: 2786117044       0       0       0 ...     0 PCI-MSI-edge eth0-rx1
> > 756: 2896099336       0       0       0 ...     0 PCI-MSI-edge eth0-rx0
> > 757: 1808404680       0       0       0 ...     0 PCI-MSI-edge eth0-tx3
> > 758: 1797855130       0       0       0 ...     0 PCI-MSI-edge eth0-tx2
> > 759: 1807222032       0       0       0 ...     0 PCI-MSI-edge eth0-tx1
> > 760: 1820309360       0       0       0 ...     0 PCI-MSI-edge eth0-tx0
> > 
> 
> echo 01 >/proc/irq/*/eth0-rx0/../smp_affinity
> echo 02 >/proc/irq/*/eth0-rx1/../smp_affinity
> echo 04 >/proc/irq/*/eth0-rx2/../smp_affinity
> echo 08 >/proc/irq/*/eth0-rx3/../smp_affinity
> 
> 
> cat /proc/irq/*/eth0-rx0/../smp_affinity
> cat /proc/irq/*/eth0-rx1/../smp_affinity
> cat /proc/irq/*/eth0-rx2/../smp_affinity
> cat /proc/irq/*/eth0-rx3/../smp_affinity
 
The last test were made already concerning such rx queue binding:
# cat /proc/irq/60/smp_affinity
001000
# cat /proc/irq/61/smp_affinity
010000
# cat /proc/irq/62/smp_affinity
080000
# cat /proc/irq/63/smp_affinity
800000

Now ksoftirqd eats not only one processor but all oness where I assigned the IRQs.

> > On the old ones:
> >            CPU0       CPU1       CPU2  ...      CPU8
> > 502:  522320256  522384039  522327386  ... 522380267 PCI-MSI-edge eth0
> > 
> 
> What network driver is it (newbox), was it (old box) ?

newbox:
01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network
Connection (rev 02)
driver: igb
version: 1.3.16-k2
firmware-version: 2.1-0
bus-info: 0000:01:00.0

oldbox:
05:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
Ethernet Controller (Copper) (rev 01)
driver: e1000e
version: 0.3.3.3-k6
firmware-version: 1.0-0
bus-info: 0000:05:00.0

-- 
BRGDS. Alexey Vlasov.

^ permalink raw reply

* Re: how to use secure_tcp_sequence_number
From: Nicola Padovano @ 2010-09-30 17:01 UTC (permalink / raw)
  To: netfilter-devel, netdev
In-Reply-To: <AANLkTimA6jJbCvtxFi41JbH_XSJtKo0FrDE=fQ2OiafY@mail.gmail.com>

If i attempt to insert the module that uses the
secure_tcp_sequence_number i get this error message:

from insmod:  -1 Invalid module format
and from dmesg: version magic '2.6.35.4 SMP mod_unload 586 ' should be
'2.6.35.4 SMP mod_unload modversions 586 '

what's the matter?

On Wed, Sep 29, 2010 at 6:46 PM, Nicola Padovano
<nicola.padovano@gmail.com> wrote:
> Hi all. How can I export the secure_tcp_sequence_number to use it in my modules?
>
> --
> Nicola Padovano
> e-mail: nicola.padovano@gmail.com
> web: http://npadovano.altervista.org
>
> "My only ambition is not be anything at all; it seems the most
> sensible thing" (C. Bukowski)
>



-- 
Nicola Padovano
e-mail: nicola.padovano@gmail.com
web: http://npadovano.altervista.org

"My only ambition is not be anything at all; it seems the most
sensible thing" (C. Bukowski)

^ permalink raw reply

* [net-next-2.6 PATCH 3/3] bonding: reread information about speed and duplex when interface goes up
From: Krzysztof Piotr Oledzki @ 2010-09-30 16:19 UTC (permalink / raw)
  To: fubar, bonding-devel, netdev

>From 43285224a785e90c7d4cff2be0766ca8df6ddfb9 Mon Sep 17 00:00:00 2001
From: Krzysztof Piotr Oledzki <ole@ans.pl>
Date: Thu, 30 Sep 2010 17:09:02 +0200
Subject: bonding: reread information about speed and duplex when interface goes up

When an interface was enslaved when it was down, bonding thinks
it has speed -1 even after it goes up. This leads into selecting
a wrong active interface in active/backup mode on mixed 10G/1G or
1G/100M environment.

before:
 bonding: bond0: link status definitely up for interface eth5, 100 Mbps full duplex.
 bonding: bond0: link status definitely up for interface eth0, 100 Mbps full duplex.

after:
 bonding: bond0: link status definitely up for interface eth5, 10000 Mbps full duplex.
 bonding: bond0: link status definitely up for interface eth0, 1000 Mbps full duplex.

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>

---
 drivers/net/bonding/bond_main.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 721abc4..e409c14 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2368,6 +2368,8 @@ static void bond_miimon_commit(struct bonding *bond)
 				slave->state = BOND_STATE_BACKUP;
 			}
 
+			bond_update_speed_duplex(slave);
+
 			pr_info("%s: link status definitely up for interface %s, %d Mbps %s duplex.\n",
 				bond->dev->name, slave->dev->name,
 				slave->speed, slave->duplex ? "full" : "half");
-- 
1.7.1


^ permalink raw reply related

* [net-next-2.6 PATCH 2/3] bonding: add Speed/Duplex information to /proc/net/bonding/bond
From: Krzysztof Piotr Oledzki @ 2010-09-30 16:19 UTC (permalink / raw)
  To: fubar, bonding-devel, netdev

>From b340edd6763b52a2d0f1e0ae256748a21bd19005 Mon Sep 17 00:00:00 2001
From: Krzysztof Piotr Oledzki <ole@ans.pl>
Date: Thu, 30 Sep 2010 17:54:15 +0200
Subject: bonding: add Speed/Duplex information to /proc/net/bonding/bond*

Effect:
 Slave Interface: eth5
 MII Status: up
 Speed: 10000 Mbps
 Duplex: full
 Link Failure Count: 0
 Permanent HW addr: XX:XX:XX:XX:XX:XX
 Slave queue ID: 0

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
---
 drivers/net/bonding/bond_main.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e409c14..61c8971 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3315,6 +3315,8 @@ static void bond_info_show_slave(struct seq_file *seq,
 	seq_printf(seq, "\nSlave Interface: %s\n", slave->dev->name);
 	seq_printf(seq, "MII Status: %s\n",
 		   (slave->link == BOND_LINK_UP) ?  "up" : "down");
+	seq_printf(seq, "Speed: %d Mbps\n", slave->speed);
+	seq_printf(seq, "Duplex: %s\n", slave->duplex ? "full" : "half");
 	seq_printf(seq, "Link Failure Count: %u\n",
 		   slave->link_failure_count);
 
-- 
1.7.1


^ permalink raw reply related

* [net-next-2.6 PATCH 1/3] bonding: print information about speed and duplex seen by the driver
From: Krzysztof Piotr Oledzki @ 2010-09-30 16:18 UTC (permalink / raw)
  To: fubar, bonding-devel, netdev

>From d4fb10933bb88f2bf410b0abe42f151c13e3df85 Mon Sep 17 00:00:00 2001
From: Krzysztof Piotr Oledzki <ole@ans.pl>
Date: Thu, 30 Sep 2010 17:05:19 +0200
Subject: bonding: print information about speed and duplex seen by the driver

before:
 bonding: bond0: link status definitely up for interface eth5
 bonding: bond0: link status definitely up for interface eth0

after:
 bonding: bond0: link status definitely up for interface eth5, 100 Mbps full duplex.
 bonding: bond0: link status definitely up for interface eth0, 100 Mbps full duplex.


Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>

---
 drivers/net/bonding/bond_main.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index fb70c3e..721abc4 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2368,8 +2368,9 @@ static void bond_miimon_commit(struct bonding *bond)
 				slave->state = BOND_STATE_BACKUP;
 			}
 
-			pr_info("%s: link status definitely up for interface %s.\n",
-				bond->dev->name, slave->dev->name);
+			pr_info("%s: link status definitely up for interface %s, %d Mbps %s duplex.\n",
+				bond->dev->name, slave->dev->name,
+				slave->speed, slave->duplex ? "full" : "half");
 
 			/* notify ad that the link status has changed */
 			if (bond->params.mode == BOND_MODE_8023AD)
-- 
1.7.1


^ permalink raw reply related

* [PATCH net-next] neigh: reorder fields in struct neighbour
From: Eric Dumazet @ 2010-09-30 15:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On 64bit arches, there are two 32bit holes that we can remove.

sizeof(struct neighbour) shrinks from 0xf8 to 0xf0 bytes

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/neighbour.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 242879b..7d08fd1 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -94,7 +94,7 @@ struct neighbour {
 	struct neighbour	*next;
 	struct neigh_table	*tbl;
 	struct neigh_parms	*parms;
-	struct net_device		*dev;
+	struct net_device	*dev;
 	unsigned long		used;
 	unsigned long		confirmed;
 	unsigned long		updated;
@@ -102,11 +102,11 @@ struct neighbour {
 	__u8			nud_state;
 	__u8			type;
 	__u8			dead;
+	atomic_t		refcnt;
 	atomic_t		probes;
 	rwlock_t		lock;
 	unsigned char		ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];
 	struct hh_cache		*hh;
-	atomic_t		refcnt;
 	int			(*output)(struct sk_buff *skb);
 	struct sk_buff_head	arp_queue;
 	struct timer_list	timer;
@@ -163,7 +163,7 @@ struct neigh_table {
 	atomic_t		entries;
 	rwlock_t		lock;
 	unsigned long		last_rand;
-	struct kmem_cache		*kmem_cachep;
+	struct kmem_cache	*kmem_cachep;
 	struct neigh_statistics	__percpu *stats;
 	struct neighbour	**hash_buckets;
 	unsigned int		hash_mask;



^ permalink raw reply related

* Re: [PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver.
From: Eric Dumazet @ 2010-09-30 14:22 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike
In-Reply-To: <fda262fa746dfc1feb8543cf88d5ecf60001c186.1285853725.git.xiaohui.xin@intel.com>

Le jeudi 30 septembre 2010 à 22:04 +0800, xiaohui.xin@intel.com a
écrit :
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> The hook is called in netif_receive_skb().
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
> Reviewed-by: Jeff Dike <jdike@linux.intel.com>
> ---
>  net/core/dev.c |   35 +++++++++++++++++++++++++++++++++++
>  1 files changed, 35 insertions(+), 0 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c11e32c..83172b8 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2517,6 +2517,37 @@ err:
>  EXPORT_SYMBOL(netdev_mp_port_prep);
>  #endif
>  
> +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
> +/* Add a hook to intercept mediate passthru(zero-copy) packets,
> + * and insert it to the socket queue owned by mp_port specially.
> + */
> +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
> +					       struct packet_type **pt_prev,
> +					       int *ret,
> +					       struct net_device *orig_dev)
> +{
> +	struct mp_port *mp_port = NULL;
> +	struct sock *sk = NULL;
> +
> +	if (!dev_is_mpassthru(skb->dev))
> +		return skb;
> +	mp_port = skb->dev->mp_port;
> +
> +	if (*pt_prev) {
> +		*ret = deliver_skb(skb, *pt_prev, orig_dev);
> +		*pt_prev = NULL;
> +	}
> +
> +	sk = mp_port->sock->sk;
> +	skb_queue_tail(&sk->sk_receive_queue, skb);
> +	sk->sk_state_change(sk);
> +
> +	return NULL;
> +}
> +#else
> +#define handle_mpassthru(skb, pt_prev, ret, orig_dev)     (skb)
> +#endif
> +
>  /**
>   *	netif_receive_skb - process receive buffer from network
>   *	@skb: buffer to process
> @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
>  ncls:
>  #endif
>  
> +	/* To intercept mediate passthru(zero-copy) packets here */
> +	skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
> +	if (!skb)
> +		goto out;
>  	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
>  	if (!skb)
>  		goto out;

This does not apply to current net-next-2.6

We now have dev->rx_handler (currently for bridge or macvlan)


commit ab95bfe01f9872459c8678572ccadbf646badad0
Author: Jiri Pirko <jpirko@redhat.com>
Date:   Tue Jun 1 21:52:08 2010 +0000

    net: replace hooks in __netif_receive_skb V5
    
    What this patch does is it removes two receive frame hooks (for bridge and for
    macvlan) from __netif_receive_skb. These are replaced them with a single
    hook for both. It only supports one hook per device because it makes no
    sense to do bridging and macvlan on the same device.
    
    Then a network driver (of virtual netdev like macvlan or bridge) can register
    an rx_handler for needed net device.
    
    Signed-off-by: Jiri Pirko <jpirko@redhat.com>
    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>




^ permalink raw reply

* [PATCH v12 17/17]add two new ioctls for mp device.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The patch add two ioctls for mp device.
One is for userspace to query how much memory locked to make mp device
run smoothly. Another one is for userspace to set how much meory locked
it really wants.

---
 drivers/vhost/mpassthru.c |  109 +++++++++++++++++++++++----------------------
 include/linux/mpassthru.h |    2 +
 2 files changed, 58 insertions(+), 53 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index 1a114d1..41aa59e 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -54,6 +54,8 @@
 #define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
 #define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
 
+#define DEFAULT_NEED   ((8192*2*2)*4096)
+
 struct frag {
 	u16     offset;
 	u16     size;
@@ -102,8 +104,10 @@ struct page_pool {
 	spinlock_t		read_lock;
 	/* record the orignal rlimit */
 	struct rlimit		o_rlim;
-	/* record the locked pages */
-	int			lock_pages;
+	/* userspace wants to locked */
+	int			locked_pages;
+	/* currently locked pages */
+	int                     cur_pages;
 	/* the device according to */
 	struct net_device	*dev;
 	/* the mp_port according to dev */
@@ -117,6 +121,7 @@ struct mp_struct {
 	struct net_device       *dev;
 	struct page_pool	*pool;
 	struct socket           socket;
+	struct task_struct      *user;
 };
 
 struct mp_file {
@@ -207,8 +212,8 @@ static int page_pool_attach(struct mp_struct *mp)
 	pool->port.ctor = page_ctor;
 	pool->port.sock = &mp->socket;
 	pool->port.hash = mp_lookup;
-	pool->lock_pages = 0;
-
+	pool->locked_pages = 0;
+	pool->cur_pages = 0;
 	/* locked by mp_mutex */
 	dev->mp_port = &pool->port;
 	mp->pool = pool;
@@ -236,37 +241,6 @@ struct page_info *info_dequeue(struct page_pool *pool)
 	return info;
 }
 
-static int set_memlock_rlimit(struct page_pool *pool, int resource,
-			      unsigned long cur, unsigned long max)
-{
-	struct rlimit new_rlim, *old_rlim;
-	int retval;
-
-	if (resource != RLIMIT_MEMLOCK)
-		return -EINVAL;
-	new_rlim.rlim_cur = cur;
-	new_rlim.rlim_max = max;
-
-	old_rlim = current->signal->rlim + resource;
-
-	/* remember the old rlimit value when backend enabled */
-	pool->o_rlim.rlim_cur = old_rlim->rlim_cur;
-	pool->o_rlim.rlim_max = old_rlim->rlim_max;
-
-	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
-			!capable(CAP_SYS_RESOURCE))
-		return -EPERM;
-
-	retval = security_task_setrlimit(resource, &new_rlim);
-	if (retval)
-		return retval;
-
-	task_lock(current->group_leader);
-	*old_rlim = new_rlim;
-	task_unlock(current->group_leader);
-	return 0;
-}
-
 static void mp_ki_dtor(struct kiocb *iocb)
 {
 	struct page_info *info = (struct page_info *)(iocb->private);
@@ -286,7 +260,7 @@ static void mp_ki_dtor(struct kiocb *iocb)
 		}
 	}
 	/* Decrement the number of locked pages */
-	info->pool->lock_pages -= info->pnum;
+	info->pool->cur_pages -= info->pnum;
 	kmem_cache_free(ext_page_info_cache, info);
 
 	return;
@@ -319,6 +293,7 @@ static int page_pool_detach(struct mp_struct *mp)
 {
 	struct page_pool *pool;
 	struct page_info *info;
+	struct task_struct *tsk = mp->user;
 	int i;
 
 	/* locked by mp_mutex */
@@ -334,9 +309,9 @@ static int page_pool_detach(struct mp_struct *mp)
 		kmem_cache_free(ext_page_info_cache, info);
 	}
 
-	set_memlock_rlimit(pool, RLIMIT_MEMLOCK,
-			   pool->o_rlim.rlim_cur,
-			   pool->o_rlim.rlim_max);
+	down_write(&tsk->mm->mmap_sem);
+	tsk->mm->locked_vm -= pool->locked_pages;
+	up_write(&tsk->mm->mmap_sem);
 
 	/* locked by mp_mutex */
 	pool->dev->mp_port = NULL;
@@ -534,14 +509,11 @@ static struct page_info *alloc_page_info(struct page_pool *pool,
 	int rc;
 	int i, j, n = 0;
 	int len;
-	unsigned long base, lock_limit;
+	unsigned long base;
 	struct page_info *info = NULL;
 
-	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
-	lock_limit >>= PAGE_SHIFT;
-
-	if (pool->lock_pages + count > lock_limit && npages) {
-		printk(KERN_INFO "exceed the locked memory rlimit.");
+	if (pool->cur_pages + count > pool->locked_pages) {
+		printk(KERN_INFO "Exceed memory lock rlimt.");
 		return NULL;
 	}
 
@@ -603,7 +575,7 @@ static struct page_info *alloc_page_info(struct page_pool *pool,
 			mp_hash_insert(pool, info->pages[i], info);
 	}
 	/* increment the number of locked pages */
-	pool->lock_pages += j;
+	pool->cur_pages += j;
 	return info;
 
 failed:
@@ -890,7 +862,7 @@ copy:
 				info->pages[i] = NULL;
 			}
 	}
-	if (!pool->lock_pages)
+	if (!pool->cur_pages)
 		sock->sk->sk_state_change(sock->sk);
 
 	if (info != NULL) {
@@ -974,12 +946,6 @@ proceed:
 		count--;
 	}
 
-	if (!pool->lock_pages) {
-		set_memlock_rlimit(pool, RLIMIT_MEMLOCK,
-				iocb->ki_user_data * 4096 * 2,
-				iocb->ki_user_data * 4096 * 2);
-	}
-
 	/* Translate address to kernel */
 	info = alloc_page_info(pool, iocb, iov, count, frags, npages, 0);
 	if (!info)
@@ -1081,8 +1047,10 @@ static long mp_chr_ioctl(struct file *file, unsigned int cmd,
 	struct mp_struct *mp;
 	struct net_device *dev;
 	void __user* argp = (void __user *)arg;
+	unsigned long  __user *limitp = argp;
 	struct ifreq ifr;
 	struct sock *sk;
+	unsigned long limit, locked, lock_limit;
 	int ret;
 
 	ret = -EINVAL;
@@ -1122,6 +1090,7 @@ static long mp_chr_ioctl(struct file *file, unsigned int cmd,
 			goto err_dev_put;
 		}
 		mp->dev = dev;
+		mp->user = current;
 		ret = -ENOMEM;
 
 		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
@@ -1166,6 +1135,40 @@ err_dev_put:
 		rtnl_unlock();
 		break;
 
+	case MPASSTHRU_SET_MEM_LOCKED:
+		ret = copy_from_user(&limit, limitp, sizeof limit);
+		if (ret < 0)
+			return ret;
+
+		mp = mp_get(mfile);
+		if (!mp)
+			return -ENODEV;
+
+		limit = PAGE_ALIGN(limit) >> PAGE_SHIFT;
+		down_write(&current->mm->mmap_sem);
+		locked = limit + current->mm->locked_vm;
+		lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+		if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+			up_write(&current->mm->mmap_sem);
+			mp_put(mfile);
+			return -ENOMEM;
+		}
+		current->mm->locked_vm = locked;
+		up_write(&current->mm->mmap_sem);
+
+		mutex_lock(&mp_mutex);
+		mp->pool->locked_pages = limit;
+		mutex_unlock(&mp_mutex);
+
+		mp_put(mfile);
+		return 0;
+
+	case MPASSTHRU_GET_MEM_LOCKED_NEED:
+		limit = DEFAULT_NEED;
+		return copy_to_user(limitp, &limit, sizeof limit);
+
+
 	default:
 		break;
 	}
diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
index c0973b6..efd12ec 100644
--- a/include/linux/mpassthru.h
+++ b/include/linux/mpassthru.h
@@ -8,6 +8,8 @@
 /* ioctl defines */
 #define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
 #define MPASSTHRU_UNBINDDEV    _IO('M', 214)
+#define MPASSTHRU_SET_MEM_LOCKED       _IOW('M', 215, unsigned long)
+#define MPASSTHRU_GET_MEM_LOCKED_NEED  _IOR('M', 216, unsigned long)
 
 #ifdef __KERNEL__
 #if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 15/17]An example how to modifiy NIC driver to use napi_gro_frags() interface
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

This example is made on ixgbe driver.
It provides API is_rx_buffer_mapped_as_page() to indicate
if the driver use napi_gro_frags() interface or not.
The example allocates 2 pages for DMA for one ring descriptor
using netdev_alloc_page(). When packets is coming, using
napi_gro_frags() to allocate skb and to receive the packets.

---
 drivers/net/ixgbe/ixgbe.h      |    3 +
 drivers/net/ixgbe/ixgbe_main.c |  151 ++++++++++++++++++++++++++++++++--------
 2 files changed, 125 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..fceffc5 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -131,6 +131,9 @@ struct ixgbe_rx_buffer {
 	struct page *page;
 	dma_addr_t page_dma;
 	unsigned int page_offset;
+	u16 mapped_as_page;
+	struct page *page_skb;
+	unsigned int page_skb_offset;
 };
 
 struct ixgbe_queue_stats {
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 6c00ee4..905d6d2 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -688,6 +688,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
 	IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring->reg_idx), val);
 }
 
+static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
+					struct net_device *dev)
+{
+	return true;
+}
+
 /**
  * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split
  * @adapter: address of board private structure
@@ -704,13 +710,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 	i = rx_ring->next_to_use;
 	bi = &rx_ring->rx_buffer_info[i];
 
+
 	while (cleaned_count--) {
 		rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
 
+		bi->mapped_as_page =
+			is_rx_buffer_mapped_as_page(bi, adapter->netdev);
+
 		if (!bi->page_dma &&
 		    (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED)) {
 			if (!bi->page) {
-				bi->page = alloc_page(GFP_ATOMIC);
+				bi->page = netdev_alloc_page(adapter->netdev);
 				if (!bi->page) {
 					adapter->alloc_rx_page_failed++;
 					goto no_buffers;
@@ -727,7 +737,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 			                            PCI_DMA_FROMDEVICE);
 		}
 
-		if (!bi->skb) {
+		if (!bi->mapped_as_page && !bi->skb) {
 			struct sk_buff *skb;
 			/* netdev_alloc_skb reserves 32 bytes up front!! */
 			uint bufsz = rx_ring->rx_buf_len + SMP_CACHE_BYTES;
@@ -747,6 +757,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 			                         rx_ring->rx_buf_len,
 			                         PCI_DMA_FROMDEVICE);
 		}
+
+		if (bi->mapped_as_page && !bi->page_skb) {
+			bi->page_skb = netdev_alloc_page(adapter->netdev);
+			if (!bi->page_skb) {
+				adapter->alloc_rx_page_failed++;
+				goto no_buffers;
+			}
+			bi->page_skb_offset = 0;
+			bi->dma = pci_map_page(pdev, bi->page_skb,
+					bi->page_skb_offset,
+					(PAGE_SIZE / 2),
+					PCI_DMA_FROMDEVICE);
+		}
 		/* Refresh the desc even if buffer_addrs didn't change because
 		 * each write-back erases this info. */
 		if (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED) {
@@ -823,6 +846,13 @@ struct ixgbe_rsc_cb {
 	dma_addr_t dma;
 };
 
+static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info)
+{
+	return (!rx_buffer_info->skb ||
+		!rx_buffer_info->page_skb) &&
+		!rx_buffer_info->page;
+}
+
 #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)->cb)
 
 static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
@@ -832,6 +862,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
+	struct napi_struct *napi = &q_vector->napi;
 	union ixgbe_adv_rx_desc *rx_desc, *next_rxd;
 	struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
 	struct sk_buff *skb;
@@ -868,29 +899,71 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			len = le16_to_cpu(rx_desc->wb.upper.length);
 		}
 
+		if (is_no_buffer(rx_buffer_info))
+			break;
+
 		cleaned = true;
-		skb = rx_buffer_info->skb;
-		prefetch(skb->data);
-		rx_buffer_info->skb = NULL;
 
-		if (rx_buffer_info->dma) {
-			if ((adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) &&
-			    (!(staterr & IXGBE_RXD_STAT_EOP)) &&
-				 (!(skb->prev)))
-				/*
-				 * When HWRSC is enabled, delay unmapping
-				 * of the first packet. It carries the
-				 * header information, HW may still
-				 * access the header after the writeback.
-				 * Only unmap it when EOP is reached
-				 */
-				IXGBE_RSC_CB(skb)->dma = rx_buffer_info->dma;
-			else
-				pci_unmap_single(pdev, rx_buffer_info->dma,
-				                 rx_ring->rx_buf_len,
-				                 PCI_DMA_FROMDEVICE);
-			rx_buffer_info->dma = 0;
-			skb_put(skb, len);
+		if (!rx_buffer_info->mapped_as_page) {
+			skb = rx_buffer_info->skb;
+			prefetch(skb->data);
+			rx_buffer_info->skb = NULL;
+
+			if (rx_buffer_info->dma) {
+				if ((adapter->flags2 &
+					IXGBE_FLAG2_RSC_ENABLED) &&
+					(!(staterr & IXGBE_RXD_STAT_EOP)) &&
+					(!(skb->prev)))
+					/*
+					 * When HWRSC is enabled, delay unmapping
+					 * of the first packet. It carries the
+					 * header information, HW may still
+					 * access the header after the writeback.
+					 * Only unmap it when EOP is reached
+					 */
+					IXGBE_RSC_CB(skb)->dma =
+							rx_buffer_info->dma;
+				else
+					pci_unmap_single(pdev,
+							rx_buffer_info->dma,
+							rx_ring->rx_buf_len,
+							PCI_DMA_FROMDEVICE);
+				rx_buffer_info->dma = 0;
+				skb_put(skb, len);
+			}
+		} else {
+			skb = napi_get_frags(napi);
+			prefetch(rx_buffer_info->page_skb_offset);
+			rx_buffer_info->skb = NULL;
+			if (rx_buffer_info->dma) {
+				if ((adapter->flags2 &
+					IXGBE_FLAG2_RSC_ENABLED) &&
+					(!(staterr & IXGBE_RXD_STAT_EOP)) &&
+					(!(skb->prev)))
+					/*
+					 * When HWRSC is enabled, delay unmapping
+					 * of the first packet. It carries the
+					 * header information, HW may still
+					 * access the header after the writeback.
+					 * Only unmap it when EOP is reached
+					 */
+					IXGBE_RSC_CB(skb)->dma =
+						rx_buffer_info->dma;
+				else
+					pci_unmap_page(pdev, rx_buffer_info->dma,
+							PAGE_SIZE / 2,
+							PCI_DMA_FROMDEVICE);
+				rx_buffer_info->dma = 0;
+				skb_fill_page_desc(skb,
+						skb_shinfo(skb)->nr_frags,
+						rx_buffer_info->page_skb,
+						rx_buffer_info->page_skb_offset,
+						len);
+				rx_buffer_info->page_skb = NULL;
+				skb->len += len;
+				skb->data_len += len;
+				skb->truesize += len;
+			}
 		}
 
 		if (upper_len) {
@@ -956,6 +1029,11 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				rx_buffer_info->dma = next_buffer->dma;
 				next_buffer->skb = skb;
 				next_buffer->dma = 0;
+				if (rx_buffer_info->mapped_as_page) {
+					rx_buffer_info->page_skb =
+							next_buffer->page_skb;
+					next_buffer->page_skb = NULL;
+				}
 			} else {
 				skb->next = next_buffer->skb;
 				skb->next->prev = skb;
@@ -975,7 +1053,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		total_rx_bytes += skb->len;
 		total_rx_packets++;
 
-		skb->protocol = eth_type_trans(skb, adapter->netdev);
+		if (!rx_buffer_info->mapped_as_page)
+			skb->protocol = eth_type_trans(skb, adapter->netdev);
 #ifdef IXGBE_FCOE
 		/* if ddp, not passing to ULD unless for FCP_RSP or error */
 		if (adapter->flags & IXGBE_FLAG_FCOE_ENABLED) {
@@ -984,7 +1063,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				goto next_desc;
 		}
 #endif /* IXGBE_FCOE */
-		ixgbe_receive_skb(q_vector, skb, staterr, rx_ring, rx_desc);
+
+		if (!rx_buffer_info->mapped_as_page)
+			ixgbe_receive_skb(q_vector, skb, staterr,
+						rx_ring, rx_desc);
+		else {
+			skb_record_rx_queue(skb, rx_ring->queue_index);
+			napi_gro_frags(napi);
+		}
 
 next_desc:
 		rx_desc->wb.upper.status_error = 0;
@@ -3131,9 +3217,16 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
 
 		rx_buffer_info = &rx_ring->rx_buffer_info[i];
 		if (rx_buffer_info->dma) {
-			pci_unmap_single(pdev, rx_buffer_info->dma,
-			                 rx_ring->rx_buf_len,
-			                 PCI_DMA_FROMDEVICE);
+			if (!rx_buffer_info->mapped_as_page) {
+				pci_unmap_single(pdev, rx_buffer_info->dma,
+						rx_ring->rx_buf_len,
+						PCI_DMA_FROMDEVICE);
+			} else {
+				pci_unmap_page(pdev, rx_buffer_info->dma,
+						PAGE_SIZE / 2,
+						PCI_DMA_FROMDEVICE);
+				rx_buffer_info->page_skb = NULL;
+			}
 			rx_buffer_info->dma = 0;
 		}
 		if (rx_buffer_info->skb) {
@@ -3158,7 +3251,7 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
 			               PAGE_SIZE / 2, PCI_DMA_FROMDEVICE);
 			rx_buffer_info->page_dma = 0;
 		}
-		put_page(rx_buffer_info->page);
+		netdev_free_page(adapter->netdev, rx_buffer_info->page);
 		rx_buffer_info->page = NULL;
 		rx_buffer_info->page_offset = 0;
 	}
-- 
1.7.3

^ permalink raw reply related

* [PATCH v12 14/17] Provides multiple submits and asynchronous notifications.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/vhost/net.c   |  341 +++++++++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/vhost.c |   79 ++++++++++++
 drivers/vhost/vhost.h |   15 ++
 3 files changed, 407 insertions(+), 28 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b38abc6..44f4b15 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,8 @@
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
 #include <linux/if_macvlan.h>
+#include <linux/mpassthru.h>
+#include <linux/aio.h>
 
 #include <net/sock.h>
 
@@ -39,6 +41,8 @@ enum {
 	VHOST_NET_VQ_MAX = 2,
 };
 
+static struct kmem_cache *notify_cache;
+
 enum vhost_net_poll_state {
 	VHOST_NET_POLL_DISABLED = 0,
 	VHOST_NET_POLL_STARTED = 1,
@@ -49,6 +53,7 @@ struct vhost_net {
 	struct vhost_dev dev;
 	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct kmem_cache       *cache;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -93,11 +98,183 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		iocb = list_first_entry(&vq->notifier,
+				struct kiocb, ki_list);
+		list_del(&iocb->ki_list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+	struct vhost_virtqueue *vq = iocb->private;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_add_tail(&iocb->ki_list, &vq->notifier);
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+	return (vq->link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+					  struct vhost_virtqueue *vq,
+					  struct socket *sock)
+{
+	struct kiocb *iocb = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	unsigned int head, log, in, out;
+	int size;
+
+	if (!is_async_vq(vq))
+		return;
+
+	if (sock->sk->sk_data_ready)
+		sock->sk->sk_data_ready(sock->sk, 0);
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+		vq->log : NULL;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		if (!iocb->ki_left) {
+			vhost_add_used_and_signal(&net->dev, vq,
+					iocb->ki_pos, iocb->ki_nbytes);
+			size = iocb->ki_nbytes;
+			head = iocb->ki_pos;
+			rx_total_len += iocb->ki_nbytes;
+
+			if (iocb->ki_dtor)
+				iocb->ki_dtor(iocb);
+			kmem_cache_free(net->cache, iocb);
+
+			/* when log is enabled, recomputing the log is needed,
+			 * since these buffers are in async queue, may not get
+			 * the log info before.
+			 */
+			if (unlikely(vq_log)) {
+				if (!log)
+					__vhost_get_desc(&net->dev, vq, vq->iov,
+							ARRAY_SIZE(vq->iov),
+							&out, &in, vq_log,
+							&log, head);
+				vhost_log_write(vq, vq_log, log, size);
+			}
+			if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+				vhost_poll_queue(&vq->poll);
+				break;
+			}
+		} else {
+			int i = 0;
+			int count = iocb->ki_left;
+			int hc = count;
+			while (count--) {
+				if (iocb) {
+					vq->heads[i].id = iocb->ki_pos;
+					vq->heads[i].len = iocb->ki_nbytes;
+					size = iocb->ki_nbytes;
+					head = iocb->ki_pos;
+					rx_total_len += iocb->ki_nbytes;
+
+					if (iocb->ki_dtor)
+						iocb->ki_dtor(iocb);
+					kmem_cache_free(net->cache, iocb);
+
+					if (unlikely(vq_log)) {
+						if (!log)
+							__vhost_get_desc(
+							&net->dev, vq, vq->iov,
+							ARRAY_SIZE(vq->iov),
+							&out, &in, vq_log,
+							&log, head);
+						vhost_log_write(
+							vq, vq_log, log, size);
+					}
+				} else
+					break;
+
+				i++;
+				if (count)
+					iocb = notify_dequeue(vq);
+			}
+			vhost_add_used_and_signal_n(
+					&net->dev, vq, vq->heads, hc);
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+					  struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	struct list_head *entry, *tmp;
+	unsigned long flags;
+	int tx_total_len = 0;
+
+	if (!is_async_vq(vq))
+		return;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_for_each_safe(entry, tmp, &vq->notifier) {
+		iocb = list_entry(entry,
+				  struct kiocb, ki_list);
+		if (!iocb->ki_flags)
+			continue;
+		list_del(&iocb->ki_list);
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, 0);
+		tx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+
+		kmem_cache_free(net->cache, iocb);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static struct kiocb *create_iocb(struct vhost_net *net,
+				 struct vhost_virtqueue *vq,
+				 unsigned head)
+{
+	struct kiocb *iocb = NULL;
+
+	if (!is_async_vq(vq))
+		return NULL;
+
+	iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+	if (!iocb)
+		return NULL;
+	iocb->private = vq;
+	iocb->ki_pos = head;
+	iocb->ki_dtor = handle_iocb;
+	if (vq == &net->dev.vqs[VHOST_NET_VQ_RX])
+		iocb->ki_user_data = vq->num;
+	iocb->ki_iovec = vq->hdr;
+	return iocb;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, s;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -130,6 +307,8 @@ static void handle_tx(struct vhost_net *net)
 		tx_poll_stop(net);
 	vhost_hlen = vq->vhost_hlen;
 
+	handle_async_tx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_desc(&net->dev, vq, vq->iov,
 				      ARRAY_SIZE(vq->iov),
@@ -138,10 +317,13 @@ static void handle_tx(struct vhost_net *net)
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
-			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-				tx_poll_start(net, sock);
-				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-				break;
+			if (!is_async_vq(vq)) {
+				if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
+					tx_poll_start(net, sock);
+					set_bit(SOCK_ASYNC_NOSPACE,
+						&sock->flags);
+					break;
+				}
 			}
 			if (unlikely(vhost_enable_notify(vq))) {
 				vhost_disable_notify(vq);
@@ -158,6 +340,13 @@ static void handle_tx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, out);
 		msg.msg_iovlen = out;
 		len = iov_length(vq->iov, out);
+
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, head);
+			if (!iocb)
+				break;
+		}
+
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for TX: "
@@ -166,12 +355,18 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
-		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		err = sock->ops->sendmsg(iocb, sock, &msg, len);
 		if (unlikely(err < 0)) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_desc(vq, 1);
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (is_async_vq(vq))
+			continue;
+
 		if (err != len)
 			pr_err("Truncated TX packet: "
 			       " len %d != %zd\n", err, len);
@@ -183,6 +378,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -205,7 +402,8 @@ static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
 static void handle_rx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
-	unsigned in, log, s;
+	struct kiocb *iocb = NULL;
+	unsigned in, out, log, s;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -225,25 +423,42 @@ static void handle_rx(struct vhost_net *net)
 	int err, headcount, datalen;
 	size_t vhost_hlen;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+		      !is_async_vq(vq)))
 		return;
-
 	use_mm(net->dev.mm);
 	mutex_lock(&vq->mutex);
 	vhost_disable_notify(vq);
 	vhost_hlen = vq->vhost_hlen;
 
+	/* In async cases, when write log is enabled, in case the submitted
+	 * buffers did not get log info before the log enabling, so we'd
+	 * better recompute the log info when needed. We do this in
+	 * handle_async_rx_events_notify().
+	 */
+
 	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
 
-	while ((datalen = vhost_head_len(vq, sock->sk))) {
-		headcount = vhost_get_desc_n(vq, vq->heads,
-					     datalen + vhost_hlen,
-					     &in, vq_log, &log);
+	handle_async_rx_events_notify(net, vq, sock);
+
+	while (is_async_vq(vq) ||
+		(datalen = vhost_head_len(vq, sock->sk)) != 0) {
+		if (is_async_vq(vq))
+			headcount =
+				vhost_get_desc(&net->dev, vq, vq->iov,
+						ARRAY_SIZE(vq->iov),
+						&out, &in,
+						vq->log, &log);
+		else
+			headcount = vhost_get_desc_n(vq, vq->heads,
+						     datalen + vhost_hlen,
+						     &in, vq_log, &log);
 		if (headcount < 0)
 			break;
 		/* OK, now we need to know about added descriptors. */
-		if (!headcount) {
+		if ((!headcount && !is_async_vq(vq)) ||
+			(headcount == vq->num && is_async_vq(vq))) {
 			if (unlikely(vhost_enable_notify(vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
@@ -256,7 +471,12 @@ static void handle_rx(struct vhost_net *net)
 		}
 		/* We don't need to be notified again. */
 		/* Skip header. TODO: support TSO. */
+		if (is_async_vq(vq) && vhost_hlen == sizeof(hdr)) {
+			vq->hdr[0].iov_len = vhost_hlen;
+			goto nomove;
+		}
 		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, in);
+nomove:
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
 		/* Sanity check */
@@ -266,13 +486,23 @@ static void handle_rx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), vhost_hlen);
 			break;
 		}
-		err = sock->ops->recvmsg(NULL, sock, &msg,
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, headcount);
+			if (!iocb)
+				break;
+		}
+		err = sock->ops->recvmsg(iocb, sock, &msg,
 					 len, MSG_DONTWAIT | MSG_TRUNC);
 		/* TODO: Check specific error and bomb out unless EAGAIN? */
 		if (err < 0) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_desc(vq, headcount);
 			break;
 		}
+		if (is_async_vq(vq))
+			continue;
+
 		if (err != datalen) {
 			pr_err("Discarded rx packet: "
 			       " len %d, expected %zd\n", err, datalen);
@@ -280,6 +510,9 @@ static void handle_rx(struct vhost_net *net)
 			continue;
 		}
 		len = err;
+		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
+			hdr.num_buffers = headcount;
+
 		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr,
 				     vhost_hlen);
 		if (err) {
@@ -287,18 +520,7 @@ static void handle_rx(struct vhost_net *net)
 			       vq->iov->iov_base, err);
 			break;
 		}
-		/* TODO: Should check and handle checksum. */
-		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF)) {
-			struct iovec *iov = vhost_hlen ? vq->hdr : vq->iov;
-
-			if (memcpy_toiovecend(iov, (unsigned char *)&headcount,
-				      offsetof(typeof(hdr), num_buffers),
-				      sizeof(hdr.num_buffers))) {
-				vq_err(vq, "Failed num_buffers write");
-				vhost_discard_desc(vq, headcount);
-				break;
-			}
-		}
+
 		len += vhost_hlen;
 		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
 					    headcount);
@@ -311,6 +533,8 @@ static void handle_rx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq, sock);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -364,6 +588,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->cache = NULL;
 
 	f->private_data = n;
 
@@ -427,6 +652,21 @@ static void vhost_net_flush(struct vhost_net *n)
 	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
 }
 
+static void vhost_async_cleanup(struct vhost_net *n)
+{
+	/* clean the notifier */
+	struct vhost_virtqueue *vq;
+	struct kiocb *iocb = NULL;
+	if (n->cache) {
+		vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+		vq = &n->dev.vqs[VHOST_NET_VQ_TX];
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+	}
+}
+
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
@@ -443,6 +683,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+	vhost_async_cleanup(n);
 	kfree(n);
 	return 0;
 }
@@ -494,21 +735,58 @@ static struct socket *get_tap_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd,
+				 enum vhost_vq_link_state *state)
 {
 	struct socket *sock;
 	/* special case to disable backend */
 	if (fd == -1)
 		return NULL;
+
+	*state = VHOST_VQ_LINK_SYNC;
+
 	sock = get_raw_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
 	sock = get_tap_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	/* If we dont' have notify_cache, then dont do mpassthru */
+	if (!notify_cache)
+		return ERR_PTR(-ENOTSOCK);
+	sock = get_mp_socket(fd);
+	if (!IS_ERR(sock)) {
+		*state = VHOST_VQ_LINK_ASYNC;
+		return sock;
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+		if (!n->cache)
+			n->cache = notify_cache;
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -532,12 +810,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = -EFAULT;
 		goto err_vq;
 	}
-	sock = get_socket(fd);
+	sock = get_socket(vq, fd, &vq->link_state);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err_vq;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock == oldsock)
@@ -687,6 +967,9 @@ int vhost_net_init(void)
 	r = misc_register(&vhost_net_misc);
 	if (r)
 		goto err_reg;
+	notify_cache = kmem_cache_create("vhost_kiocb",
+					sizeof(struct kiocb), 0,
+					SLAB_HWCACHE_ALIGN, NULL);
 	return 0;
 err_reg:
 	vhost_cleanup();
@@ -700,6 +983,8 @@ void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
 	vhost_cleanup();
+	if (notify_cache)
+		kmem_cache_destroy(notify_cache);
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 118c8e0..66ff5c5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -909,6 +909,85 @@ err:
 	return r;
 }
 
+unsigned __vhost_get_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			   struct iovec iov[], unsigned int iov_size,
+			   unsigned int *out_num, unsigned int *in_num,
+			   struct vhost_log *log, unsigned int *log_num,
+			   unsigned int head)
+{
+	struct vring_desc desc;
+	unsigned int i, found = 0;
+	u16 last_avail_idx;
+	int ret;
+
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+	if (unlikely(log))
+		*log_num = 0;
+
+	i = head;
+	do {
+		unsigned iov_count = *in_num + *out_num;
+		if (i >= vq->num) {
+			vq_err(vq, "Desc index is %u > %u, head = %u",
+			       i, vq->num, head);
+			return vq->num;
+		}
+		if (++found > vq->num) {
+			vq_err(vq, "Loop detected: last one at %u "
+			       "vq size %u head %u\n",
+			       i, vq->num, head);
+			return vq->num;
+		}
+		ret = copy_from_user(&desc, vq->desc + i, sizeof desc);
+		if (ret) {
+			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+			       i, vq->desc + i);
+			return vq->num;
+		}
+		if (desc.flags & VRING_DESC_F_INDIRECT) {
+			ret = get_indirect(dev, vq, iov, iov_size,
+					   out_num, in_num,
+					   log, log_num, &desc);
+			if (ret < 0) {
+				vq_err(vq, "Failure detected "
+				       "in indirect descriptor at idx %d\n", i);
+				return vq->num;
+			}
+			continue;
+		}
+
+		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
+				     iov_size - iov_count);
+		if (ret < 0) {
+			vq_err(vq, "Translation failure %d descriptor idx %d\n",
+			       ret, i);
+			return vq->num;
+		}
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			/* If this is an input descriptor,
+			 * increment that count. */
+			*in_num += ret;
+			if (unlikely(log)) {
+				log[*log_num].addr = desc.addr;
+				log[*log_num].len = desc.len;
+				++*log_num;
+			}
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (*in_num) {
+				vq_err(vq, "Descriptor has out after in: "
+				       "idx %d\n", i);
+				return vq->num;
+			}
+			*out_num += ret;
+		}
+	} while ((i = next_desc(&desc)) != -1);
+
+	return head;
+}
+
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 08d740a..54c6d0b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 0,
+	VHOST_VQ_LINK_ASYNC = 1,
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -98,6 +103,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/* Differiate async socket for 0-copy from normal */
+	enum vhost_vq_link_state link_state;
+	struct list_head notifier;
+	spinlock_t notify_lock;
 };
 
 struct vhost_dev {
@@ -125,6 +134,11 @@ int vhost_log_access_ok(struct vhost_dev *);
 int vhost_get_desc_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
 		     int datalen, unsigned int *iovcount, struct vhost_log *log,
 		     unsigned int *log_num);
+unsigned __vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
+			struct iovec iov[], unsigned int iov_count,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vhost_log *log, unsigned int *log_num,
+			unsigned int head);
 unsigned vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
 			   struct iovec iov[], unsigned int iov_count,
 			   unsigned int *out_num, unsigned int *in_num,
@@ -165,6 +179,7 @@ enum {
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
 {
 	unsigned acked_features = rcu_dereference(dev->acked_features);
+	acked_features |= (1 << VIRTIO_NET_F_MRG_RXBUF);
 	return acked_features & (1 << bit);
 }
 
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 11/17] Add header file for mp device.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/mpassthru.h |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..c0973b6
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,26 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+#include <linux/ioctl.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IO('M', 214)
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The hook is called in netif_receive_skb().
Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/dev.c |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c11e32c..83172b8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2517,6 +2517,37 @@ err:
 EXPORT_SYMBOL(netdev_mp_port_prep);
 #endif
 
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+					       struct packet_type **pt_prev,
+					       int *ret,
+					       struct net_device *orig_dev)
+{
+	struct mp_port *mp_port = NULL;
+	struct sock *sk = NULL;
+
+	if (!dev_is_mpassthru(skb->dev))
+		return skb;
+	mp_port = skb->dev->mp_port;
+
+	if (*pt_prev) {
+		*ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = NULL;
+	}
+
+	sk = mp_port->sock->sk;
+	skb_queue_tail(&sk->sk_receive_queue, skb);
+	sk->sk_state_change(sk);
+
+	return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev)     (skb)
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
+	/* To intercept mediate passthru(zero-copy) packets here */
+	skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+	if (!skb)
+		goto out;
 	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
 	if (!skb)
 		goto out;
-- 
1.7.3

^ permalink raw reply related

* [PATCH v12 09/17] Don't do skb recycle, if device use external buffer.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 76c8301..1aede3a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -565,6 +565,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
 	if (skb_shared(skb) || skb_cloned(skb))
 		return 0;
 
+	/* if the device wants to do mediate passthru, the skb may
+	 * get external buffer, so don't recycle
+	 */
+	if (dev_is_mpassthru(skb->dev))
+		return 0;
+
 	skb_release_head_state(skb);
 	shinfo = skb_shinfo(skb);
 	atomic_set(&shinfo->dataref, 1);
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 08/17] Modify netdev_free_page() to release external buffer
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    4 +++-
 net/core/skbuff.c      |   24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ab29675..3d7f70e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1512,9 +1512,11 @@ static inline struct page *netdev_alloc_page(struct net_device *dev)
 	return __netdev_alloc_page(dev, GFP_ATOMIC);
 }
 
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
+
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-	__free_page(page);
+	__netdev_free_page(dev, page);
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index da02fa1..76c8301 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -306,6 +306,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+void netdev_free_ext_page(struct net_device *dev, struct page *page)
+{
+	struct skb_ext_page *ext_page = NULL;
+	if (dev_is_mpassthru(dev) && dev->mp_port->hash) {
+		ext_page = dev->mp_port->hash(dev, page);
+		if (ext_page)
+			ext_page->dtor(ext_page);
+		else
+			__free_page(page);
+	}
+}
+EXPORT_SYMBOL(netdev_free_ext_page);
+
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+	if (dev_is_mpassthru(dev)) {
+		netdev_free_ext_page(dev, page);
+		return;
+	}
+
+	__free_page(page);
+}
+EXPORT_SYMBOL(__netdev_free_page);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 07/17]Modify netdev_alloc_page() to get external buffer
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

    Currently, it can get external buffers from mp device.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 117d82b..da02fa1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -269,11 +269,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
+struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages)
+{
+	struct mp_port *port;
+	struct skb_ext_page *ext_page = NULL;
+
+	port = dev->mp_port;
+	if (!port)
+		goto out;
+	ext_page = port->ctor(port, NULL, npages);
+	if (ext_page)
+		return ext_page->page;
+out:
+	return NULL;
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_pages);
+
+struct page *netdev_alloc_ext_page(struct net_device *dev)
+{
+	return netdev_alloc_ext_pages(dev, 1);
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_page);
+
 struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 {
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct page *page;
 
+	if (dev_is_mpassthru(dev))
+		return netdev_alloc_ext_page(dev);
+
 	page = alloc_pages_node(node, gfp_mask, 0);
 	return page;
 }
-- 
1.7.3

^ permalink raw reply related

* [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

    If buffer is external, then use the callback to destruct
    buffers.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
    Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
    Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    3 ++-
 net/core/skbuff.c      |    8 ++++++++
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 74af06c..ab29675 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -197,10 +197,11 @@ struct skb_shared_info {
 	union skb_shared_tx tx_flags;
 	struct sk_buff	*frag_list;
 	struct skb_shared_hwtstamps hwtstamps;
-	skb_frag_t	frags[MAX_SKB_FRAGS];
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+
+	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
 
 /* The structure is for a skb which pages may point to
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..117d82b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -217,6 +217,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	shinfo->gso_type = 0;
 	shinfo->ip6_frag_id = 0;
 	shinfo->tx_flags.flags = 0;
+	shinfo->destructor_arg = NULL;
 	skb_frag_list_init(skb);
 	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
@@ -350,6 +351,13 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
+		if (skb->dev && dev_is_mpassthru(skb->dev)) {
+			struct skb_ext_page *ext_page =
+				skb_shinfo(skb)->destructor_arg;
+			if (ext_page && ext_page->dtor)
+				ext_page->dtor(ext_page);
+		}
+
 		kfree(skb->head);
 	}
 }
-- 
1.7.3

^ permalink raw reply related

* [PATCH v12 05/17] Add a function to indicate if device use external buffer.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 27f5024..9ed4fa2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1611,6 +1611,11 @@ extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
 extern int netdev_mp_port_prep(struct net_device *dev,
 				struct mp_port *port);
 
+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+	return dev && dev->mp_port;
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
 	kfree_skb(napi->skb);
-- 
1.7.3

^ permalink raw reply related

* [PATCH v12 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9ef9bf1..24a31e7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -651,6 +651,12 @@ struct mp_port	{
  * int (*ndo_set_vf_tx_rate)(struct net_device *dev, int vf, int rate);
  * int (*ndo_get_vf_config)(struct net_device *dev,
  *			    int vf, struct ifla_vf_info *ivf);
+ *
+ * int (*ndo_mp_port_prep)(struct net_device *dev, struct mp_port *port);
+ *	If the driver want to allocate external buffers,
+ *	then it can export it's capability, as the skb
+ *	buffer header length, the page length can be DMA, etc.
+ *	The external buffers owner may utilize this.
  */
 #define HAVE_NET_DEVICE_OPS
 struct net_device_ops {
@@ -713,6 +719,10 @@ struct net_device_ops {
 	int			(*ndo_fcoe_get_wwn)(struct net_device *dev,
 						    u64 *wwn, int type);
 #endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+	int			(*ndo_mp_port_prep)(struct net_device *dev,
+						struct mp_port *port);
+#endif
 };
 
 /*
-- 
1.7.3

^ permalink raw reply related

* [PATCH v12 02/17] Add a new struct for device to manipulate external buffer.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Add a structure in structure net_device, the new field is
named as mp_port. It's for mediate passthru (zero-copy).
It contains the capability for the net device driver,
a socket, and an external buffer creator, external means
skb buffer belongs to the device may not be allocated from
kernel space.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |   25 ++++++++++++++++++++++++-
 1 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..9ef9bf1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,28 @@ struct netdev_queue {
 	unsigned long		tx_dropped;
 } ____cacheline_aligned_in_smp;
 
+/*The structure for mediate passthru(zero-copy). */
+struct mp_port	{
+	/* the header len */
+	int		hdr_len;
+	/* the max payload len for one descriptor */
+	int		data_len;
+	/* the pages for DMA in one time */
+	int		npages;
+	/* the socket bind to */
+	struct socket	*sock;
+	/* the header len for virtio-net */
+	int		vnet_hlen;
+	/* the external buffer page creator */
+	struct skb_ext_page *(*ctor)(struct mp_port *,
+				struct sk_buff *, int);
+	/* the hash function attached to find according
+	 * backend ring descriptor info for one external
+	 * buffer page.
+	 */
+	struct skb_ext_page *(*hash)(struct net_device *,
+				struct page *);
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -952,7 +974,8 @@ struct net_device {
 	struct macvlan_port	*macvlan_port;
 	/* GARP */
 	struct garp_port	*garp_port;
-
+	/* mpassthru */
+	struct mp_port		*mp_port;
 	/* class/net/name entry */
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
-- 
1.7.3

^ permalink raw reply related

* [PATCH v12 00/17] Provide a zero-copy method on KVM virtio-net.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:  	net core and kernel changes.
patch 11-13:  	new device as interface to mantpulate external buffers.
patch 14: 	for vhost-net.
patch 15:	An example on modifying NIC driver to using napi_gro_frags().
patch 16:	An example how to get guest buffers based on driver
		who using napi_gro_frags().
patch 17:	It's a patch to address comments from Michael S. Thirkin
		to add 2 new ioctls in mp device.
		We split it out here to make easier reiewer.
		Need to revise.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
	Performance tuning

what we have done in v1:
	polish the RCU usage
	deal with write logging in asynchroush mode in vhost
	add notifier block for mp device
	rename page_ctor to mp_port in netdevice.h to make it looks generic
	add mp_dev_change_flags() for mp device to change NIC state
	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
	a small fix for missing dev_put when fail
	using dynamic minor instead of static minor number
	a __KERNEL__ protect to mp_get_sock()

what we have done in v2:
	
	remove most of the RCU usage, since the ctor pointer is only
	changed by BIND/UNBIND ioctl, and during that time, NIC will be
	stopped to get good cleanup(all outstanding requests are finished),
	so the ctor pointer cannot be raced into wrong situation.

	Remove the struct vhost_notifier with struct kiocb.
	Let vhost-net backend to alloc/free the kiocb and transfer them
	via sendmsg/recvmsg.

	use get_user_pages_fast() and set_page_dirty_lock() when read.

	Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
	the async write logging is rewritten 
	a drafted synchronous write function for qemu live migration
	a limit for locked pages from get_user_pages_fast() to prevent Dos
	by using RLIMIT_MEMLOCK
	

what we have done in v4:
	add iocb completion callback from vhost-net to queue iocb in mp device
	replace vq->receiver by mp_sock_data_ready()
	remove stuff in mp device which access structures from vhost-net
	modify skb_reserve() to ignore host NIC driver reserved space
	rebase to the latest vhost tree
	split large patches into small pieces, especially for net core part.
	

what we have done in v5:
	address Arnd Bergmann's comments
		-remove IFF_MPASSTHRU_EXCL flag in mp device
		-Add CONFIG_COMPAT macro
		-remove mp_release ops
	move dev_is_mpassthru() as inline func
	fix a bug in memory relinquish
	Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
	move create_iocb() out of page_dtor which may happen in interrupt context
	-This remove the potential issues which lock called in interrupt context
	make the cache used by mp, vhost as static, and created/destoryed during
	modules init/exit functions.
	-This makes multiple mp guest created at the same time.

what we have done in v7:
	some cleanup prepared to suppprt PS mode

what we have done in v8:
	discarding the modifications to point skb->data to guest buffer directly.
	Add code to modify driver to support napi_gro_frags() with Herbert's comments.
	To support PS mode.
	Add mergeable buffer support in mp device.
	Add GSO/GRO support in mp deice.
	Address comments from Eric Dumazet about cache line and rcu usage.

what we have done in v9:
	v8 patch is based on a fix in dev_gro_receive().
	But Herbert did not agree with the fix we have sent out.
	And he suggest another fix. v9 is modified to base on that fix.
	

what we have done in v10:
	Fix a partial csum error.
	Cleanup some unused fields with struct page_info{} in mp device.
	Modify kmem_cache_zalloc() to kmem_cache_alloc() based on Michael S. Thirkin.

what we have done in v11:
	Address comments from Michael S. Thirkin to add two new ioctls in mp device.
	But still need to revise.

what we have done in v11:
	Address most comments from Ben Hutchings, except the compat ioctls.
	As the comments are sparse, so do not make a split patch.
	Change struct mpassthru_port to struct mp_port, and struct page_ctor
	to struct page_pool.
 
Performance:
	We have seen the performance data request from mailling-list.
	And we are now looking into this.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox