Netdev List

Netdev List
 help / color / mirror / Atom feed

* Winning Alert!!!  contact: claimsdepartment1313@yahoo.co.uk  for more details
From: Henrik Maibom Hansen @ 2010-11-04 12:46 UTC (permalink / raw)


You have won $552,000.00,just send your name,tel,country

^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Michael S. Tsirkin @ 2010-11-04 12:24 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Krishna Kumar2, davem, netdev, yvugenfi
In-Reply-To: <20101102161730.GA32311@redhat.com>

On Tue, Nov 02, 2010 at 06:17:30PM +0200, Michael S. Tsirkin wrote:
> On Fri, Oct 29, 2010 at 09:58:40PM +1030, Rusty Russell wrote:
> > On Fri, 29 Oct 2010 09:25:09 pm Krishna Kumar2 wrote:
> > > Rusty Russell <rusty@rustcorp.com.au> wrote on 10/29/2010 03:17:24 PM:
> > > 
> > > > > Oct 17 10:22:40 localhost kernel: net eth0: Unexpected TX queue
> > > failure: -28
> > > > > Oct 17 10:28:22 localhost kernel: net eth0: Unexpected TX queue
> > > failure: -28
> > > > > Oct 17 10:35:58 localhost kernel: net eth0: Unexpected TX queue
> > > failure: -28
> > > > > Oct 17 10:41:06 localhost kernel: net eth0: Unexpected TX queue
> > > failure: -28
> > > > >
> > > > > I initially changed the check from -ENOMEM to -ENOSPC, but
> > > > > virtqueue_add_buf can return only -ENOSPC when it doesn't have
> > > > > space for new request.  Patch removes redundant checks but
> > > > > displays the failure errno.
> > > > >
> > > > > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> > > > > ---
> > > > >  drivers/net/virtio_net.c |   15 ++++-----------
> > > > >  1 file changed, 4 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
> > > > > --- org/drivers/net/virtio_net.c   2010-10-11 10:20:02.000000000 +0530
> > > > > +++ new/drivers/net/virtio_net.c   2010-10-21 17:37:45.000000000 +0530
> > > > > @@ -570,17 +570,10 @@ static netdev_tx_t start_xmit(struct sk_
> > > > >
> > > > >     /* This can happen with OOM and indirect buffers. */
> > > > >     if (unlikely(capacity < 0)) {
> > > > > -      if (net_ratelimit()) {
> > > > > -         if (likely(capacity == -ENOMEM)) {
> > > > > -            dev_warn(&dev->dev,
> > > > > -                "TX queue failure: out of memory\n");
> > > > > -         } else {
> > > > > -            dev->stats.tx_fifo_errors++;
> > > > > -            dev_warn(&dev->dev,
> > > > > -                "Unexpected TX queue failure: %d\n",
> > > > > -                capacity);
> > > > > -         }
> > > > > -      }
> > > > > +      if (net_ratelimit())
> > > > > +         dev_warn(&dev->dev,
> > > > > +             "TX queue failure (%d): out of memory\n",
> > > > > +             capacity);
> > > >
> > > > Hold on... you were getting -ENOSPC, which shouldn't happen.  What makes
> > > you
> > > > think it's out of memory?
> > > 
> > > virtqueue_add_buf_gfp returns only -ENOSPC on failure, whether
> > > direct or indirect descriptors are used, so isn't -ENOSPC
> > > "expected"? (vring_add_indirect returns -ENOMEM on memory
> > > failure, but that is masked out and we go direct which is
> > > the failure point).
> > 
> > Ah, OK, gotchya.
> > I'm not even sure the fallback to linear makes sense; if we're failing
> > kmallocs we should probably just return -ENOMEM.  Would mean we can
> > tell the difference between "out of space" (which should never happen
> > since we stop the queue when we have < 2+MAX_SKB_FRAGS slots left)
> > and this case.
> > 
> > Michael, what do you think?
> > 
> > Thanks,
> > Rusty.
> 
> Let's make sure I understand the issue: we use indirect buffers
> so we assume there's still a lot of place in the ring, then
> allocation for the indirect fails and so we return -ENOSPC?
> 
> So first, I agree it's a bug.  But I am not sure killing the fallback
> is such a good idea: recovering from add buf failure is hard
> generally, we should try to accomodate if we can. Let's just fix
> the return code for now?
> 
> And generally, we should be smarter: as long as the ring is almost
> empty, and s/g list is short, it is a waste to use indirect buffers.
> BTW we have had a FIXME there for a long while, I think Yan suggested
> increasing that threshold to 3. Yan?
> 
> Further, maybe preallocating some memory for the indirect buffers might
> be a good idea.
> 
> In short, lots of good ideas, let's start with the minimal patch that is
> a good 2.6.37 candidate too. How about the following (untested)?
> 
> virtio: fix add_buf return code for OOM
> 
> add_buff returned ENOSPC on out of memory: this is a bug
> as at leats virtio-net expects ENOMEM and handles it
> specially. Fix that.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

I thought about this some more.  I think the original
code is actually correct in returning ENOSPC: indirect
buffers are nice, but it's a mistake
to rely on them as a memory allocation might fail.

And if you look at virtio-net, it is dropping packets
under memory pressure which is not really a happy outcome:
the packet will get freed, reallocated and we get another one,
adding pressure on the allocator instead of releasing it
until we free up some buffers.

So I now think we should calculate the capacity
assuming non-indirect entries, and if we manage to
use indirect, all the better.

So below is what I propose now - as a replacement for
my original patch.  Krishna Kumar, Rusty, what do you think?

Separately I'm also considering moving the
	if (vq->num_free < out + in)
check earlier in the function to keep all users honest,
but need to check what the implications are for e.g. block.
Thoughts on this?

---->

virtio: return correct capacity to users

We can't rely on indirect buffers for capacity
calculations because they need a memory allocation
which might fail.

So return the number of buffers we can guarantee users.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1475ed6..cc2f73e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -230,9 +230,6 @@ add_head:
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
 
-	/* If we're indirect, we can fit many (assuming not OOM). */
-	if (vq->indirect)
-		return vq->num_free ? vq->vring.num : 0;
 	return vq->num_free;
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);

^ permalink raw reply related

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Michael Ellerman @ 2010-11-04 11:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1288869699.2659.77.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 1301 bytes --]

On Thu, 2010-11-04 at 12:21 +0100, Eric Dumazet wrote:
> 
> Hmm, a review of the code spotted a bug in fib_result_assign()
> 
> Please try following patch :
> 
> Thanks again !
> 
> [PATCH] fib: fib_result_assign() should not change fib refcounts
> 
> After commit ebc0ffae5 (RCU conversion of fib_lookup()),
> fib_result_assign()  should not change fib refcounts anymore.
> 
> Thanks to Michael who did the bisection and bug report.
> 
> Reported-by: Michael Ellerman <michael@ellerman.id.au>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/ipv4/fib_lookup.h |    5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
> index a29edf2..c079cc0 100644
> --- a/net/ipv4/fib_lookup.h
> +++ b/net/ipv4/fib_lookup.h
> @@ -47,11 +47,8 @@ extern int fib_detect_death(struct fib_info *fi, int order,
>  static inline void fib_result_assign(struct fib_result *res,
>  				     struct fib_info *fi)
>  {
> -	if (res->fi != NULL)
> -		fib_info_put(res->fi);
> +	/* we used to play games with refcounts, but we now use RCU */
>  	res->fi = fi;
> -	if (fi != NULL)
> -		atomic_inc(&fi->fib_clntref);
>  }
>  
>  #endif /* _FIB_LOOKUP_H */

Perfect, that fixes it, thanks!

cheers



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* vhost-net-next updates
From: Michael S. Tsirkin @ 2010-11-04 11:26 UTC (permalink / raw)
  To: Shirley Ma; +Cc: krkumar2, netdev, kvm, linux-kernel

I pushed out some optimization patches on vhost-net-next
branch on my vhost tree (intended for 2.6.38).
It would be helpful if people working on vhost-net optimizations
base their work on that tree just to make sure comparisons
are apples to apples.

I might rebase this as I didn't send a pull request to Dave yet
but I'll try not to.  So far I have:

8b7347a vhost: get/put_user -> __get/__put_user
dfe5ac5 vhost: copy_to_user -> __copy_to_user
64e1c80 vhost-net: batch use/unuse mm
533a19b vhost: put mm after thread stop
3fcedec drivers/vhost/vhost.c: delete double assignment

Thanks!

-- 
MST

^ permalink raw reply

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Michael Ellerman @ 2010-11-04 11:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1288869699.2659.77.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 1295 bytes --]

On Thu, 2010-11-04 at 12:21 +0100, Eric Dumazet wrote:
> Le jeudi 04 novembre 2010 à 11:30 +0100, Eric Dumazet a écrit :
> > Le jeudi 04 novembre 2010 à 21:23 +1100, Michael Ellerman a écrit :
> > > Hi all,
> > > 
> > > I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
> > > "Freeing alive fib_info" messages, from free_fib_info().
> > > 
> > > Actually I only get one per boot, when network interfaces come up.
> > > Seemingly related I am getting refcount problems when I shutdown, ie.
> > > unregister_netdevice() sees a usage count of 1, which never decrements.
> > > 
> > > Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.
> > > 
> > >     fib: RCU conversion of fib_lookup()
> > >     
> > >     fib_lookup() converted to be called in RCU protected context, no
> > >     reference taken and released on a contended cache line (fib_clntref)
> > >     
> > > 
> > > Is this a bug in that commit, or a driver bug exposed?
> > 
> > Hi Michael, thanks for the report (and painful bisection I guess)
> > 
> > Thats hard to say... Is it reproductable on my machine ?
> > 
> 
> Hmm, a review of the code spotted a bug in fib_result_assign()

Aha, I was just adding some debug in there. Let me test the patch.

cheers


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Eric Dumazet @ 2010-11-04 11:21 UTC (permalink / raw)
  To: michael; +Cc: netdev
In-Reply-To: <1288866626.2659.71.camel@edumazet-laptop>

Le jeudi 04 novembre 2010 à 11:30 +0100, Eric Dumazet a écrit :
> Le jeudi 04 novembre 2010 à 21:23 +1100, Michael Ellerman a écrit :
> > Hi all,
> > 
> > I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
> > "Freeing alive fib_info" messages, from free_fib_info().
> > 
> > Actually I only get one per boot, when network interfaces come up.
> > Seemingly related I am getting refcount problems when I shutdown, ie.
> > unregister_netdevice() sees a usage count of 1, which never decrements.
> > 
> > Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.
> > 
> >     fib: RCU conversion of fib_lookup()
> >     
> >     fib_lookup() converted to be called in RCU protected context, no
> >     reference taken and released on a contended cache line (fib_clntref)
> >     
> > 
> > Is this a bug in that commit, or a driver bug exposed?
> 
> Hi Michael, thanks for the report (and painful bisection I guess)
> 
> Thats hard to say... Is it reproductable on my machine ?
> 

Hmm, a review of the code spotted a bug in fib_result_assign()

Please try following patch :

Thanks again !

[PATCH] fib: fib_result_assign() should not change fib refcounts

After commit ebc0ffae5 (RCU conversion of fib_lookup()),
fib_result_assign()  should not change fib refcounts anymore.

Thanks to Michael who did the bisection and bug report.

Reported-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/fib_lookup.h |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index a29edf2..c079cc0 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -47,11 +47,8 @@ extern int fib_detect_death(struct fib_info *fi, int order,
 static inline void fib_result_assign(struct fib_result *res,
 				     struct fib_info *fi)
 {
-	if (res->fi != NULL)
-		fib_info_put(res->fi);
+	/* we used to play games with refcounts, but we now use RCU */
 	res->fi = fi;
-	if (fi != NULL)
-		atomic_inc(&fi->fib_clntref);
 }
 
 #endif /* _FIB_LOOKUP_H */



^ permalink raw reply related

* Congrat! contact: mr.graham.poll15@gmail.com for more details
From: Henrik Maibom Hansen @ 2010-11-04 10:22 UTC (permalink / raw)


500,000GBP was awarded to your email

^ permalink raw reply

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Eric Dumazet @ 2010-11-04 10:46 UTC (permalink / raw)
  To: michael; +Cc: netdev
In-Reply-To: <1288866626.2659.71.camel@edumazet-laptop>

Le jeudi 04 novembre 2010 à 11:30 +0100, Eric Dumazet a écrit :
> Le jeudi 04 novembre 2010 à 21:23 +1100, Michael Ellerman a écrit :
> > Hi all,
> > 
> > I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
> > "Freeing alive fib_info" messages, from free_fib_info().
> > 
> > Actually I only get one per boot, when network interfaces come up.
> > Seemingly related I am getting refcount problems when I shutdown, ie.
> > unregister_netdevice() sees a usage count of 1, which never decrements.
> > 
> > Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.
> > 
> >     fib: RCU conversion of fib_lookup()
> >     
> >     fib_lookup() converted to be called in RCU protected context, no
> >     reference taken and released on a contended cache line (fib_clntref)
> >     
> > 
> > Is this a bug in that commit, or a driver bug exposed?
> 
> Hi Michael, thanks for the report (and painful bisection I guess)
> 
> Thats hard to say... Is it reproductable on my machine ?

You could ask a stack trace eventually, this might help to spot the bug.

Thanks

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 3e0da3e..8039db0 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -159,6 +159,7 @@ void free_fib_info(struct fib_info *fi)
 {
 	if (fi->fib_dead == 0) {
 		pr_warning("Freeing alive fib_info %p\n", fi);
+		WARN_ON_ONCE(1);
 		return;
 	}
 	change_nexthops(fi) {




^ permalink raw reply related

* Re: Freeing alive fib_info caused by ebc0ffae5
From: Eric Dumazet @ 2010-11-04 10:30 UTC (permalink / raw)
  To: michael; +Cc: netdev
In-Reply-To: <1288866186.30549.10.camel@concordia>

Le jeudi 04 novembre 2010 à 21:23 +1100, Michael Ellerman a écrit :
> Hi all,
> 
> I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
> "Freeing alive fib_info" messages, from free_fib_info().
> 
> Actually I only get one per boot, when network interfaces come up.
> Seemingly related I am getting refcount problems when I shutdown, ie.
> unregister_netdevice() sees a usage count of 1, which never decrements.
> 
> Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.
> 
>     fib: RCU conversion of fib_lookup()
>     
>     fib_lookup() converted to be called in RCU protected context, no
>     reference taken and released on a contended cache line (fib_clntref)
>     
> 
> Is this a bug in that commit, or a driver bug exposed?

Hi Michael, thanks for the report (and painful bisection I guess)

Thats hard to say... Is it reproductable on my machine ?

Thanks



^ permalink raw reply

* Re: [RFC 0/3] MPEG2/TS drop analyzer iptables match extension
From: Jan Engelhardt @ 2010-11-04 10:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netfilter Developers, paulmck, Eric Dumazet, netdev, Solvik Blum
In-Reply-To: <Pine.LNX.4.64.1011040953590.19565@ask.diku.dk>


On Thursday 2010-11-04 10:20, Jesper Dangaard Brouer wrote:
> On Thu, 4 Nov 2010, Jan Engelhardt wrote:
>> On Tuesday 2010-10-19 16:21, Jesper Dangaard Brouer wrote:
>>>
>>> This is my iptables match module for analyzing IPTV MPEG2/TS streams.
>>> Currently it only detects dropped packets, but I want to extend it for
>>> analyzing jitter and bursts.
>>>
>>> Jan Engelhardt convinced me that I should just send the module as-is
>>> for review on the list.  I wrote the code in 2009, and have only done
>>> some minor changes to make it work on kernel 2.6.35 since.
>>
>> This now lives in the mp2t branch (since NFWS already actually) of xt-a,
>> and I have taken the liberty to start updating it to higher standards.
>> Please watch that branch, as I don't have any MPEG equipment around me
>> to do runtime tests.
>
> Jan, I would actually like to maintain the source via my own git tree. And I
> would gladly accept your patches against that tree.

I do not mind who is hosting what parts, as git repos can be
transferred easily, but I strongly suggest not to decouple xt_mp2t
from (any clone of) the xtables-addons structure base, because doing
so would bring you back to square one with regard to maintenance.

I recognize you may dislike splitting up the IPTV codebase, so I
propose that you make use of submodules, and have an Xt-a clone as
one submodule. That would allow merging in both directions.

^ permalink raw reply

* Freeing alive fib_info caused by ebc0ffae5
From: Michael Ellerman @ 2010-11-04 10:23 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet

[-- Attachment #1: Type: text/plain, Size: 694 bytes --]

Hi all,

I'm running Linus' latest or thereabouts (ff8b16d), and I'm seeing
"Freeing alive fib_info" messages, from free_fib_info().

Actually I only get one per boot, when network interfaces come up.
Seemingly related I am getting refcount problems when I shutdown, ie.
unregister_netdevice() sees a usage count of 1, which never decrements.

Bisect says it's ebc0ffae5 which causes the problem, or makes it appear.

    fib: RCU conversion of fib_lookup()

    fib_lookup() converted to be called in RCU protected context, no
    reference taken and released on a contended cache line (fib_clntref)

Is this a bug in that commit, or a driver bug exposed?

cheers

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 1/2] r6040: fix multicast operations
From: Florian Fainelli @ 2010-11-04 10:04 UTC (permalink / raw)
  To: Shawn Lin; +Cc: netdev, Marc Leclerc, Albert Chen, David Miller, Ben Hutchings
In-Reply-To: <1288788689.1837.141.camel@shawn-desktop>

Hello Shawn,

On Wednesday 03 November 2010 13:51:29 Shawn Lin wrote:
> On Wed, 2010-10-20 at 23:09 +0200, Florian Fainelli wrote:
> > This patch fixes the following issues with the r6040 NIC operating in
> > multicast:
> > 
> > 1) When the IFF_ALLMULTI flag is set, we should write 0xffff to the NIC
> > hash
> > 
> >    table registers to make it process multicast traffic
> > 
> > 2) When the number of multicast address to handle is smaller than
> > MCAST_MAX
> > 
> >    we should use the NIC multicast registers MID1_{L,M,H}.
> > 
> > 3) The hashing of the address was not correct, due to an invalid
> > substraction
> > 
> >    (15 - (crc & 0x0f)) instead of (crc & 0x0f)
> 
> I suggest to modify the comment as follows.
> 
> 3) The hashing of the address was not correct, due to an invalid
>  substraction (15 - (crc & 0x0f)) instead of (crc & 0x0f) and an
>  incorrect crc algorithm (ether_crc_le) instead of (ether_crc).
> 
> [...]
> 
> The original code I submitted to Florian has some issues mentioned by Ben
> Hutchings.
> 
> This revision fixes these issues and another issue about the sequence of
> configuring multicast hash table registers.
> 
> The correct sequence is to enable multicast function before write values to
> hash table registers. I have verified it on my platform.
> 
> The hash algorithm is provided by hardware designers. I also re-confirmed
> it with RDC's engineer.
> 
> Please let me know if anyone has questions.
> 
> The version is for net-next-2.6:

Please resubmit the patch with your Signed-off-by tag and the Tested-by: 
to keep track of the issue. Thank you!

> 
> ---
> diff --git a/drivers/net/r6040.c b/drivers/net/r6040.c
> index 0b014c8..e88e171 100644
> --- a/drivers/net/r6040.c
> +++ b/drivers/net/r6040.c
> @@ -69,6 +69,8 @@
> 
>  /* MAC registers */
>  #define MCR0		0x00	/* Control register 0 */
> +#define  PROMISC	0x0020  /* Promiscuous mode */
> +#define  HASH_EN	0x0100  /* Enable multicast hash table function */
>  #define MCR1		0x04	/* Control register 1 */
>  #define  MAC_RST	0x0001	/* Reset the MAC */
>  #define MBCR		0x08	/* Bus control */
> @@ -851,77 +853,84 @@ static void r6040_multicast_list(struct net_device
> *dev) {
>  	struct r6040_private *lp = netdev_priv(dev);
>  	void __iomem *ioaddr = lp->base;
> -	u16 *adrp;
> -	u16 reg;
>  	unsigned long flags;
>  	struct netdev_hw_addr *ha;
>  	int i;
> +	u16 hash_table[4] = { 0, };
> 
> -	/* MAC Address */
> -	adrp = (u16 *)dev->dev_addr;
> -	iowrite16(adrp[0], ioaddr + MID_0L);
> -	iowrite16(adrp[1], ioaddr + MID_0M);
> -	iowrite16(adrp[2], ioaddr + MID_0H);
> -
> -	/* Promiscous Mode */
>  	spin_lock_irqsave(&lp->lock, flags);
> 
>  	/* Clear AMCP & PROM bits */
> -	reg = ioread16(ioaddr) & ~0x0120;
> +	lp->mcr0 = ioread16(ioaddr + MCR0) & ~(PROMISC | HASH_EN);
> +
> +	/* Promiscuous Mode */
>  	if (dev->flags & IFF_PROMISC) {
> -		reg |= 0x0020;
> -		lp->mcr0 |= 0x0020;
> +		lp->mcr0 |= PROMISC;
>  	}
> -	/* Too many multicast addresses
> -	 * accept all traffic */
> -	else if ((netdev_mc_count(dev) > MCAST_MAX) ||
> -		 (dev->flags & IFF_ALLMULTI))
> -		reg |= 0x0020;
> -
> -	iowrite16(reg, ioaddr);
> -	spin_unlock_irqrestore(&lp->lock, flags);
> +	/* Enable multicast hash table function to
> +	 * receive all multicast packets. */
> +	else if (dev->flags & IFF_ALLMULTI) {
> +		lp->mcr0 |= HASH_EN;
> +
> +		for (i = 0; i < MCAST_MAX ; i++) {
> +			iowrite16(0, ioaddr + MID_1L + 8 * i);
> +			iowrite16(0, ioaddr + MID_1M + 8 * i);
> +			iowrite16(0, ioaddr + MID_1H + 8 * i);
> +		}
> 
> -	/* Build the hash table */
> -	if (netdev_mc_count(dev) > MCAST_MAX) {
> -		u16 hash_table[4];
> +		for (i = 0; i < 4; i++)
> +			hash_table[i] = 0xffff;
> +	}
> +	/* Use internal multicast address registers if the number of
> +	 * multicast addresses is not greater than MCAST_MAX. */
> +	else if (netdev_mc_count(dev) <= MCAST_MAX) {
> +		i = 0;
> +		netdev_for_each_mc_addr(ha, dev) {
> +			u16 *adrp = (u16 *) ha->addr;
> +			iowrite16(adrp[0], ioaddr + MID_1L + 8 * i);
> +			iowrite16(adrp[1], ioaddr + MID_1M + 8 * i);
> +			iowrite16(adrp[2], ioaddr + MID_1H + 8 * i);
> +			i++;
> +		}
> +		while (i < MCAST_MAX) {
> +			iowrite16(0, ioaddr + MID_1L + 8 * i);
> +			iowrite16(0, ioaddr + MID_1M + 8 * i);
> +			iowrite16(0, ioaddr + MID_1H + 8 * i);
> +			i++;
> +		}
> +	}
> +	/* Otherwise, Enable multicast hash table function. */
> +	else {
>  		u32 crc;
> 
> -		for (i = 0; i < 4; i++)
> -			hash_table[i] = 0;
> +		lp->mcr0 |= HASH_EN;
> 
> -		netdev_for_each_mc_addr(ha, dev) {
> -			char *addrs = ha->addr;
> +		for (i = 0; i < MCAST_MAX ; i++) {
> +			iowrite16(0, ioaddr + MID_1L + 8 * i);
> +			iowrite16(0, ioaddr + MID_1M + 8 * i);
> +			iowrite16(0, ioaddr + MID_1H + 8 * i);
> +		}
> 
> -			if (!(*addrs & 1))
> -				continue;
> +		/* Build multicast hash table */
> +		netdev_for_each_mc_addr(ha, dev) {
> +			u8 *addrs = ha->addr;
> 
> -			crc = ether_crc_le(6, addrs);
> +			crc = ether_crc(ETH_ALEN, addrs);
>  			crc >>= 26;
> -			hash_table[crc >> 4] |= 1 << (15 - (crc & 0xf));
> +			hash_table[crc >> 4] |= 1 << (crc & 0xf);
>  		}
> -		/* Fill the MAC hash tables with their values */
> +	}
> +	iowrite16(lp->mcr0, ioaddr + MCR0);
> +
> +	/* Fill the MAC hash tables with their values */
> +	if (lp->mcr0 && HASH_EN) {
>  		iowrite16(hash_table[0], ioaddr + MAR0);
>  		iowrite16(hash_table[1], ioaddr + MAR1);
>  		iowrite16(hash_table[2], ioaddr + MAR2);
>  		iowrite16(hash_table[3], ioaddr + MAR3);
>  	}
> -	/* Multicast Address 1~4 case */
> -	i = 0;
> -	netdev_for_each_mc_addr(ha, dev) {
> -		if (i >= MCAST_MAX)
> -			break;
> -		adrp = (u16 *) ha->addr;
> -		iowrite16(adrp[0], ioaddr + MID_1L + 8 * i);
> -		iowrite16(adrp[1], ioaddr + MID_1M + 8 * i);
> -		iowrite16(adrp[2], ioaddr + MID_1H + 8 * i);
> -		i++;
> -	}
> -	while (i < MCAST_MAX) {
> -		iowrite16(0xffff, ioaddr + MID_1L + 8 * i);
> -		iowrite16(0xffff, ioaddr + MID_1M + 8 * i);
> -		iowrite16(0xffff, ioaddr + MID_1H + 8 * i);
> -		i++;
> -	}
> +
> +	spin_unlock_irqrestore(&lp->lock, flags);
>  }
> 
>  static void netdev_get_drvinfo(struct net_device *dev,
> ---
> 
> The version is for 2.6.32.y and 2.6.27.y:
> 
> ---
> diff --git a/drivers/net/r6040.c b/drivers/net/r6040.c
> index 9ee9f01..f9af419 100644
> --- a/drivers/net/r6040.c
> +++ b/drivers/net/r6040.c
> @@ -69,6 +69,8 @@
> 
>  /* MAC registers */
>  #define MCR0		0x00	/* Control register 0 */
> +#define  PROMISC	0x0020  /* Promiscuous mode */
> +#define  HASH_EN	0x0100  /* Enable multicast hash table function */
>  #define MCR1		0x04	/* Control register 1 */
>  #define  MAC_RST	0x0001	/* Reset the MAC */
>  #define MBCR		0x08	/* Bus control */
> @@ -935,76 +937,88 @@ static void r6040_multicast_list(struct net_device
> *dev) {
>  	struct r6040_private *lp = netdev_priv(dev);
>  	void __iomem *ioaddr = lp->base;
> -	u16 *adrp;
> -	u16 reg;
>  	unsigned long flags;
>  	struct dev_mc_list *dmi = dev->mc_list;
>  	int i;
> +	u16 hash_table[4] = { 0, };
> 
> -	/* MAC Address */
> -	adrp = (u16 *)dev->dev_addr;
> -	iowrite16(adrp[0], ioaddr + MID_0L);
> -	iowrite16(adrp[1], ioaddr + MID_0M);
> -	iowrite16(adrp[2], ioaddr + MID_0H);
> -
> -	/* Promiscous Mode */
>  	spin_lock_irqsave(&lp->lock, flags);
> 
>  	/* Clear AMCP & PROM bits */
> -	reg = ioread16(ioaddr) & ~0x0120;
> +	lp->mcr0 = ioread16(ioaddr + MCR0) & ~(PROMISC | HASH_EN);
> +
> +	/* Promiscuous Mode */
>  	if (dev->flags & IFF_PROMISC) {
> -		reg |= 0x0020;
> -		lp->mcr0 |= 0x0020;
> +		lp->mcr0 |= PROMISC;
>  	}
> -	/* Too many multicast addresses
> -	 * accept all traffic */
> -	else if ((dev->mc_count > MCAST_MAX)
> -		|| (dev->flags & IFF_ALLMULTI))
> -		reg |= 0x0020;
> +	/* Enable multicast hash table function to
> +	 * receive all multicast packets. */
> +	else if (dev->flags & IFF_ALLMULTI) {
> +		lp->mcr0 |= HASH_EN;
> +
> +		for (i = 0; i < MCAST_MAX ; i++) {
> +			iowrite16(0, ioaddr + MID_1L + 8 * i);
> +			iowrite16(0, ioaddr + MID_1M + 8 * i);
> +			iowrite16(0, ioaddr + MID_1H + 8 * i);
> +		}
> 
> -	iowrite16(reg, ioaddr);
> -	spin_unlock_irqrestore(&lp->lock, flags);
> +		for (i = 0; i < 4; i++)
> +			hash_table[i] = 0xffff;
> +	}
> +	/* Use internal multicast address registers if the number of
> +	 * multicast addresses is not greater than MCAST_MAX. */
> +	else if (dev->mc_count <= MCAST_MAX) {
> +		i = 0;
> +		while (i < dev->mc_count) {
> +			u16 *adrp = (u16 *) dmi->dmi_addr;
> +			dmi = dmi->next;
> 
> -	/* Build the hash table */
> -	if (dev->mc_count > MCAST_MAX) {
> -		u16 hash_table[4];
> +			iowrite16(adrp[0], ioaddr + MID_1L + 8 * i);
> +			iowrite16(adrp[1], ioaddr + MID_1M + 8 * i);
> +			iowrite16(adrp[2], ioaddr + MID_1H + 8 * i);
> +			i++;
> +		}
> +		while (i < MCAST_MAX) {
> +			iowrite16(0, ioaddr + MID_1L + 8 * i);
> +			iowrite16(0, ioaddr + MID_1M + 8 * i);
> +			iowrite16(0, ioaddr + MID_1H + 8 * i);
> +			i++;
> +		}
> +	}
> +	/* Otherwise, Enable multicast hash table function. */
> +	else {
>  		u32 crc;
> 
> -		for (i = 0; i < 4; i++)
> -			hash_table[i] = 0;
> +		lp->mcr0 |= HASH_EN;
> 
> -		for (i = 0; i < dev->mc_count; i++) {
> -			char *addrs = dmi->dmi_addr;
> +		for (i = 0; i < MCAST_MAX ; i++) {
> +			iowrite16(0, ioaddr + MID_1L + 8 * i);
> +			iowrite16(0, ioaddr + MID_1M + 8 * i);
> +			iowrite16(0, ioaddr + MID_1H + 8 * i);
> +		}
> 
> +		/* Build multicast hash table */
> +		for (i = 0; i < dev->mc_count; i++) {
> +			u8 *addrs = dmi->dmi_addr;
>  			dmi = dmi->next;
> 
> -			if (!(*addrs & 1))
> -				continue;
> -
> -			crc = ether_crc_le(6, addrs);
> +			crc = ether_crc(ETH_ALEN, addrs);
>  			crc >>= 26;
> -			hash_table[crc >> 4] |= 1 << (15 - (crc & 0xf));
> +			hash_table[crc >> 4] |= 1 << (crc & 0xf);
>  		}
> -		/* Fill the MAC hash tables with their values */
> +
> +	}
> +	iowrite16(lp->mcr0, ioaddr + MCR0);
> +
> +	/* Fill the MAC hash tables with their values */
> +	if (lp->mcr0 && HASH_EN) {
>  		iowrite16(hash_table[0], ioaddr + MAR0);
>  		iowrite16(hash_table[1], ioaddr + MAR1);
>  		iowrite16(hash_table[2], ioaddr + MAR2);
>  		iowrite16(hash_table[3], ioaddr + MAR3);
>  	}
> -	/* Multicast Address 1~4 case */
> -	dmi = dev->mc_list;
> -	for (i = 0, dmi; (i < dev->mc_count) && (i < MCAST_MAX); i++) {
> -		adrp = (u16 *)dmi->dmi_addr;
> -		iowrite16(adrp[0], ioaddr + MID_1L + 8*i);
> -		iowrite16(adrp[1], ioaddr + MID_1M + 8*i);
> -		iowrite16(adrp[2], ioaddr + MID_1H + 8*i);
> -		dmi = dmi->next;
> -	}
> -	for (i = dev->mc_count; i < MCAST_MAX; i++) {
> -		iowrite16(0xffff, ioaddr + MID_1L + 8*i);
> -		iowrite16(0xffff, ioaddr + MID_1M + 8*i);
> -		iowrite16(0xffff, ioaddr + MID_1H + 8*i);
> -	}
> +
> +	spin_unlock_irqrestore(&lp->lock, flags);
>  }
> 
>  static void netdev_get_drvinfo(struct net_device *dev,
> ---
> 
> 
> 
> ===========================================================================
> ================ The privileged confidential information contained in this
> email is intended for use only by the addressees as indicated by the
> original sender of this email. If you are not the addressee indicated in
> this email or are not responsible for delivery of the email to such a
> person, please kindly reply to the sender indicating this fact and delete
> all copies of it from your computer and network server immediately. Your
> cooperation is highly appreciated. It is advised that any unauthorized use
> of confidential information of DM&P Group is strictly prohibited; and any
> information in this email irrelevant to the official business of DM&P
> Group shall be deemed as neither given nor endorsed by DM&P Group.
> 
> ===========================================================================
> ================

^ permalink raw reply

* Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
From: Michael S. Tsirkin @ 2010-11-04  9:30 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, netdev, kvm, linux-kernel
In-Reply-To: <1288849126.12932.4.camel@localhost.localdomain>

On Wed, Nov 03, 2010 at 10:38:46PM -0700, Shirley Ma wrote:
> On Wed, 2010-11-03 at 12:48 +0200, Michael S. Tsirkin wrote:
> > I mean in practice, you see a benefit from this patch?
> 
> Yes, I tested it. It does benefit the performance.
> 
> > > My concern here is whether checking only in set up would be
> > sufficient
> > > for security?
> > 
> > It better be sufficient because the checks that put_user does
> > are not effictive when run from the kernel thread, anyway.
> > 
> > > Would be there is a case guest could corrupt the ring
> > > later? If not, that's OK.
> > 
> > You mean change the pointer after it's checked?
> > If you see such a case, please holler.
> 
> I wonder about it, not a such case in mind.
> 
> > To clarify: the combination of __put_user and separate
> > signalling is giving the same performance benefit as your
> > patch?
> 
> Yes, it has similar performance, not I haven't finished all message
> sizes comparison yet.
> 
> > I am mostly concerned with adding code that seems to help
> > speed for reasons we don't completely understand, because
> > then we might break the optimization easily without noticing.
> 
> I don't think the patch I submited would break up anything.

No, I just meant that when a patch gives some benefit, I'd like
to understand where the benefit comes from so that I don't
break it later.

> It just
> reduced the cost of per used buffer 3 put_user() calls and guest
> signaling from one to one to many to one.

One thing to note is that deferred signalling needs to be
benchmarked with old guests which don't orphan skbs on xmit
(or disable orphaning in both networking stack and virtio-net).

> 
> Thanks
> Shirley

OK, so I guess I'll queue the __put_user etc patches for net-next, on top of this
I think a patch which defers signalling would be nice to have,
then we can figure out whether a separate heads array still has benefits
for non zero copy case: if yes what they are, if no whether it should be
used for zero copy only for both both non-zero copy and zero copy.

Makes sense?

-- 
MST

^ permalink raw reply

* Re: [RFC 0/3] MPEG2/TS drop analyzer iptables match extension
From: Jesper Dangaard Brouer @ 2010-11-04  9:20 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Netfilter Developers, paulmck, Eric Dumazet, netdev, Solvik Blum
In-Reply-To: <alpine.LNX.2.01.1011040113270.8604@obet.zrqbmnf.qr>

On Thu, 4 Nov 2010, Jan Engelhardt wrote:
> On Tuesday 2010-10-19 16:21, Jesper Dangaard Brouer wrote:
>>
>> This is my iptables match module for analyzing IPTV MPEG2/TS streams.
>> Currently it only detects dropped packets, but I want to extend it for
>> analyzing jitter and bursts.
>>
>> Jan Engelhardt convinced me that I should just send the module as-is
>> for review on the list.  I wrote the code in 2009, and have only done
>> some minor changes to make it work on kernel 2.6.35 since.
>
> This now lives in the mp2t branch (since NFWS already actually) of xt-a,
> and I have taken the liberty to start updating it to higher standards.
> Please watch that branch, as I don't have any MPEG equipment around me
> to do runtime tests.

Jan, I would actually like to maintain the source via my own git tree. 
And I would gladly accept your patches against that tree.

Since the workshop, I have been busy "Open Sourcing" the project my self. 
I now have a git repository, which also contains the collector daemon, 
web-code and database layout.  The git tree also contains some README 
documentation.  It for example contains a description of howto setup a 
testlab with MPEG2 streaming via VLC and controlled packet drops via 
tc-netem, so you can perform your runtime tests.

I didn't plan to release the project just yet, as I wanted to do some 
renaming.  E.g I want to rename the mp2t module to mpeg2ts, and I want to 
rename the collector daemon from tvprobe to iptv-analyzer.

The git tree is temporarily located on people.netfilter.org:

  git://people.netfilter.org/hawk/iptv-analyzer.git

I have bought the domain iptv-analyzer.org, but I have not installed any 
servers on that domain, yet.

Solvik Blum (Cc.ed), is currently helping me out with the web-frontend.

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply

* 500,000GBP was awarded to your email
From: Henrik Maibom Hansen @ 2010-11-04  8:25 UTC (permalink / raw)


You won from the google promotion, contact: mr.grahampoll15@gmail.com for your claims

^ permalink raw reply

* Re: [PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
From: Eric Dumazet @ 2010-11-04  9:07 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike
In-Reply-To: <1288861465.2659.44.camel@edumazet-laptop>

Le jeudi 04 novembre 2010 à 10:04 +0100, Eric Dumazet a écrit :

> Hmm, I suggest you read the comment two lines above.
> 
> If destructor_arg is now cleared each time we allocate a new skb, then,
> please move it before dataref in shinfo structure, so that the following
> memset() does the job efficiently...


Something like :

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e6ba898..2dca504 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -195,6 +195,9 @@ struct skb_shared_info {
 	__be32          ip6_frag_id;
 	__u8		tx_flags;
 	struct sk_buff	*frag_list;
+	/* Intermediate layers must ensure that destructor_arg
+	 * remains valid until skb destructor */
+	void		*destructor_arg;
 	struct skb_shared_hwtstamps hwtstamps;
 
 	/*
@@ -202,9 +205,6 @@ struct skb_shared_info {
 	 */
 	atomic_t	dataref;
 
-	/* Intermediate layers must ensure that destructor_arg
-	 * remains valid until skb destructor */
-	void *		destructor_arg;
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 };




^ permalink raw reply related

* [PATCH v14 14/17] Add a kconfig entry and make entry for mp device.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/Kconfig  |   10 ++++++++++
 drivers/vhost/Makefile |    2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+	tristate "mediate passthru network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support, we call it as mediate passthru to
+	  be distiguish with hardare passthru.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.7.3


^ permalink raw reply related

* [PATCH v14 13/17] Add mp(mediate passthru) device.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The patch add mp(mediate passthru) device, which now
based on vhost-net backend driver and provides proto_ops
to send/receive guest buffers data from/to guest vitio-net
driver.
It also exports async functions which can be used by other
drivers like macvtap to utilize zero-copy too.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/mpassthru.c | 1515 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1515 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..492430c
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1515 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME        "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/compat.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+#include "../net/bonding/bonding.h"
+
+struct mp_struct {
+	struct mp_file		*mfile;
+	struct net_device       *dev;
+	struct page_pool	*pool;
+	struct socket           socket;
+	struct socket_wq	wq;
+	struct mm_struct	*mm;
+};
+
+struct mp_file {
+	atomic_t count;
+	struct mp_struct *mp;
+	struct net *net;
+};
+
+struct mp_sock {
+	struct sock		sk;
+	struct mp_struct	*mp;
+};
+
+/* The main function to allocate external buffers */
+static struct skb_ext_page *page_ctor(struct mp_port *port,
+				      struct sk_buff *skb,
+				      int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_pool *pool;
+	struct page_info *info = NULL;
+
+	if (npages != 1)
+		BUG();
+	pool = container_of(port, struct page_pool, port);
+
+	spin_lock_irqsave(&pool->read_lock, flags);
+	if (!list_empty(&pool->readq)) {
+		info = list_first_entry(&pool->readq, struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&pool->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++)
+		get_page(info->pages[i]);
+	info->skb = skb;
+	return &info->ext_page;
+}
+
+static struct page_info *mp_hash_lookup(struct page_pool *pool,
+					struct page *page);
+static struct page_info *mp_hash_delete(struct page_pool *pool,
+					struct page_info *info);
+
+static struct skb_ext_page *mp_lookup(struct net_device *dev,
+				      struct page *page)
+{
+	struct mp_struct *mp =
+		container_of(dev->mp_port->sock->sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool = mp->pool;
+	struct page_info *info;
+
+	info = mp_hash_lookup(pool, page);
+	if (!info)
+		return NULL;
+	return &info->ext_page;
+}
+
+struct page_pool *page_pool_create(struct net_device *dev,
+				  struct socket *sock)
+{
+	struct page_pool *pool;
+	struct net_device *master;
+	struct slave *slave;
+	struct bonding *bond;
+	int i;
+	int rc;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	/* How to deal with bonding device:
+	 * check if all the slaves are capable of zero-copy.
+	 * if not, fail.
+	 */
+	master = dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			rc = netdev_mp_port_prep(slave->dev, &pool->port);
+			if (rc)
+				break;
+		}
+		read_unlock(&bond->lock);
+	} else
+		rc = netdev_mp_port_prep(dev, &pool->port);
+	if (rc)
+		goto fail;
+
+	INIT_LIST_HEAD(&pool->readq);
+	spin_lock_init(&pool->read_lock);
+	pool->hash_table =
+		kzalloc(sizeof(struct page_info *) * HASH_BUCKETS, GFP_KERNEL);
+	if (!pool->hash_table)
+		goto fail;
+
+	pool->dev = dev;
+	pool->port.ctor = page_ctor;
+	pool->port.sock = sock;
+	pool->port.hash = mp_lookup;
+	pool->locked_pages = 0;
+	pool->cur_pages = 0;
+	pool->orig_locked_vm = 0;
+
+	/* for bonding device, assign all the slaves the same page_pool */
+	if (master) {
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			dev_hold(slave->dev);
+			slave->dev->mp_port = &pool->port;
+		}
+		read_unlock(&bond->lock);
+	} else {
+		dev_hold(dev);
+		dev->mp_port = &pool->port;
+	}
+
+	return pool;
+fail:
+	kfree(pool);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(page_pool_create);
+
+void dev_bond_hold(struct net_device *dev)
+{
+	struct net_device *master;
+	struct bonding *bond;
+	struct slave *slave;
+	int i;
+
+	master = dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			if (slave->dev != dev)
+				dev_hold(slave->dev);
+		}
+		read_unlock(&bond->lock);
+	}
+}
+
+void dev_bond_put(struct net_device *dev)
+{
+	struct net_device *master;
+	struct bonding *bond;
+	struct slave *slave;
+	int i;
+
+	master = dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			if (slave->dev != dev)
+				dev_put(slave->dev);
+		}
+		read_unlock(&bond->lock);
+	}
+}
+
+void dev_change_state(struct net_device *dev)
+{
+	struct net_device *master;
+	struct bonding *bond;
+	struct slave *slave;
+	int i;
+
+	master = dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			dev_change_flags(slave->dev,
+					 slave->dev->flags & (~IFF_UP));
+			dev_change_flags(slave->dev,
+					 slave->dev->flags | IFF_UP);
+		}
+		read_unlock(&bond->lock);
+	} else {
+		dev_change_flags(dev, dev->flags & (~IFF_UP));
+		dev_change_flags(dev, dev->flags | IFF_UP);
+	}
+}
+EXPORT_SYMBOL_GPL(dev_change_state);
+
+static int mp_page_pool_attach(struct mp_struct *mp, struct page_pool *pool)
+{
+	int rc = 0;
+	/* should be protected by mp_mutex */
+	if (mp->pool) {
+		rc = -EBUSY;
+		goto fail;
+	}
+	if (mp->dev != pool->dev) {
+		rc = -EFAULT;
+		goto fail;
+	}
+	mp->pool = pool;
+	return 0;
+fail:
+	kfree(pool->hash_table);
+	kfree(pool);
+	return rc;
+}
+
+struct page_info *info_dequeue(struct page_pool *pool)
+{
+	unsigned long flags;
+	struct page_info *info = NULL;
+	spin_lock_irqsave(&pool->read_lock, flags);
+	if (!list_empty(&pool->readq)) {
+		info = list_first_entry(&pool->readq,
+				struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&pool->read_lock, flags);
+	return info;
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+	struct page_info *info = (struct page_info *)(iocb->private);
+	int i;
+
+	if (info->flags == INFO_READ) {
+		for (i = 0; i < info->pnum; i++) {
+			if (info->pages[i]) {
+				set_page_dirty_lock(info->pages[i]);
+				put_page(info->pages[i]);
+			}
+		}
+		mp_hash_delete(info->pool, info);
+		if (info->skb) {
+			info->skb->destructor = NULL;
+			kfree_skb(info->skb);
+		}
+	}
+	/* Decrement the number of locked pages */
+	info->pool->cur_pages -= info->pnum;
+	kmem_cache_free(ext_page_info_cache, info);
+
+	return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+	struct kiocb *iocb = NULL;
+
+	iocb = info->iocb;
+	if (!iocb)
+		return iocb;
+	iocb->ki_flags = 0;
+	iocb->ki_users = 1;
+	iocb->ki_key = 0;
+	iocb->ki_ctx = NULL;
+	iocb->ki_cancel = NULL;
+	iocb->ki_retry = NULL;
+	iocb->ki_eventfd = NULL;
+	iocb->ki_pos = info->desc_pos;
+	iocb->ki_nbytes = size;
+	iocb->ki_dtor(iocb);
+	iocb->private = (void *)info;
+	iocb->ki_dtor = mp_ki_dtor;
+
+	return iocb;
+}
+
+void page_pool_destroy(struct mm_struct *mm, struct page_pool *pool)
+{
+	struct page_info *info;
+	struct net_device *master;
+	struct slave *slave;
+	struct bonding *bond;
+	int i;
+
+	if (!pool)
+		return;
+
+	while ((info = info_dequeue(pool))) {
+		for (i = 0; i < info->pnum; i++)
+			if (info->pages[i])
+				put_page(info->pages[i]);
+		create_iocb(info, 0);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	down_write(&mm->mmap_sem);
+	mm->locked_vm -= pool->locked_pages;
+	up_write(&mm->mmap_sem);
+
+	master = pool->dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			slave->dev->mp_port = NULL;
+			dev_put(slave->dev);
+		}
+		read_unlock(&bond->lock);
+	} else {
+		pool->dev->mp_port = NULL;
+		dev_put(pool->dev);
+	}
+
+	kfree(pool->hash_table);
+	kfree(pool);
+}
+EXPORT_SYMBOL_GPL(page_pool_destroy);
+
+static void mp_page_pool_detach(struct mp_struct *mp)
+{
+	/* locked by mp_mutex */
+	if (mp->pool) {
+		page_pool_destroy(mp->mm, mp->pool);
+		mp->pool = NULL;
+	}
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+	struct net_device *master;
+	struct bonding *bond;
+	struct slave *slave;
+	int i;
+
+	mp->mfile = NULL;
+	master = mp->dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i)
+			dev_change_flags(slave->dev,
+					 slave->dev->flags & ~IFF_UP);
+		read_unlock(&bond->lock);
+	} else
+		dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+	mp_page_pool_detach(mp);
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i)
+			dev_change_flags(slave->dev,
+					 slave->dev->flags | IFF_UP);
+		read_unlock(&bond->lock);
+	} else
+		dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+	mutex_lock(&mp_mutex);
+	__mp_detach(mp);
+	mutex_unlock(&mp_mutex);
+}
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+	struct mp_struct *mp = NULL;
+	if (atomic_inc_not_zero(&mfile->count))
+		mp = mfile->mp;
+
+	return mp;
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+	if (atomic_dec_and_test(&mfile->count)) {
+		if (mfile->mp)
+			return;
+		if (!rtnl_is_locked()) {
+			rtnl_lock();
+			mp_detach(mfile->mp);
+			rtnl_unlock();
+		} else
+			mp_detach(mfile->mp);
+	}
+}
+
+static void iocb_tag(struct kiocb *iocb)
+{
+	iocb->ki_flags = 1;
+}
+
+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_ext_page *ext_page)
+{
+	struct page_info *info;
+	struct page_pool *pool;
+	struct sock *sk;
+	struct sk_buff *skb;
+
+	if (!ext_page)
+		return;
+	info = container_of(ext_page, struct page_info, ext_page);
+	if (!info)
+		return;
+	pool = info->pool;
+	skb = info->skb;
+
+	if (info->flags == INFO_READ) {
+		create_iocb(info, 0);
+		return;
+	}
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+
+	iocb_tag(info->iocb);
+	sk = pool->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+/* For small exteranl buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_pool *pool,
+		struct kiocb *iocb, int total)
+{
+	struct page_info *info =
+		kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->ext_page.dtor = page_dtor;
+	info->pool = pool;
+	info->flags = INFO_WRITE;
+	info->iocb = iocb;
+	info->pnum = 0;
+	return info;
+}
+
+typedef u32 key_mp_t;
+static inline key_mp_t mp_hash(struct page *page, int buckets)
+{
+	key_mp_t k;
+#if BITS_PER_LONG == 64
+	k = ((((unsigned long)page << 32UL) >> 32UL) /
+			sizeof(struct page)) % buckets ;
+#elif BITS_PER_LONG == 32
+	k = ((unsigned long)page / sizeof(struct page)) % buckets;
+#endif
+
+	return k;
+}
+
+static void mp_hash_insert(struct page_pool *pool,
+		struct page *page, struct page_info *page_info)
+{
+	struct page_info *tmp;
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	if (!pool->hash_table[key]) {
+		pool->hash_table[key] = page_info;
+		return;
+	}
+
+	tmp = pool->hash_table[key];
+	while (tmp->next)
+		tmp = tmp->next;
+
+	tmp->next = page_info;
+	page_info->prev = tmp;
+	return;
+}
+
+static struct page_info *mp_hash_delete(struct page_pool *pool,
+					struct page_info *info)
+{
+	key_mp_t key = mp_hash(info->pages[0], HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+
+	tmp = pool->hash_table[key];
+	while (tmp) {
+		if (tmp == info) {
+			if (!tmp->prev) {
+				pool->hash_table[key] = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = NULL;
+			} else {
+				tmp->prev->next = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = tmp->prev;
+			}
+			return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+static struct page_info *mp_hash_lookup(struct page_pool *pool,
+					struct page *page)
+{
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+
+	int i;
+	tmp = pool->hash_table[key];
+	while (tmp) {
+		for (i = 0; i < tmp->pnum; i++) {
+			if (tmp->pages[i] == page)
+				return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the external buffer address.
+ */
+static struct page_info *alloc_page_info(struct page_pool *pool,
+		struct kiocb *iocb, struct iovec *iov,
+		int count, struct frag *frags,
+		int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base;
+	struct page_info *info = NULL;
+
+	if (pool->cur_pages + count > pool->locked_pages) {
+		printk(KERN_INFO "Exceed memory lock rlimt.");
+		return NULL;
+	}
+
+	info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->skb = NULL;
+	info->next = info->prev = NULL;
+
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+				&info->pages[j]);
+		if (rc != n)
+			goto failed;
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->ext_page.dtor = page_dtor;
+	info->ext_page.page = info->pages[0];
+	info->pool = pool;
+	info->pnum = j;
+	info->iocb = iocb;
+	if (!npages)
+		info->flags = INFO_WRITE;
+	else
+		info->flags = INFO_READ;
+
+	if (info->flags == INFO_READ) {
+		if (frags[0].offset == 0 && iocb->ki_iovec[0].iov_len) {
+			frags[0].offset = iocb->ki_iovec[0].iov_len;
+			pool->port.vnet_hlen = iocb->ki_iovec[0].iov_len;
+		}
+		for (i = 0; i < j; i++)
+			mp_hash_insert(pool, info->pages[i], info);
+	}
+	/* increment the number of locked pages */
+	pool->cur_pages += j;
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ext_page_info_cache, info);
+
+	return NULL;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+	wait_queue_head_t *wqueue = sk_sleep(sk);
+	if (wqueue && waitqueue_active(wqueue))
+		wake_up_interruptible_sync_poll(wqueue, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	wait_queue_head_t *wqueue = sk_sleep(sk);
+	if (wqueue && waitqueue_active(wqueue))
+		wake_up_interruptible_sync_poll(wqueue, POLLOUT);
+}
+
+void async_data_ready(struct sock *sk, struct page_pool *pool)
+{
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	int len;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		struct page *page;
+		int off;
+		int size = 0, i = 0;
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct skb_ext_page *ext_page =
+			(struct skb_ext_page *)(shinfo->destructor_arg);
+		struct virtio_net_hdr_mrg_rxbuf hdr = {
+			.hdr.flags = 0,
+			.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
+		};
+
+		if (!ext_page) {
+			kfree_skb(skb);
+			continue;
+		}
+		if (skb->ip_summed == CHECKSUM_COMPLETE)
+			printk(KERN_INFO "Complete checksum occurs\n");
+
+		if (shinfo->frags[0].page == ext_page->page) {
+			info = container_of(ext_page,
+					    struct page_info,
+					    ext_page);
+			if (shinfo->nr_frags)
+				hdr.num_buffers = shinfo->nr_frags;
+			else
+				hdr.num_buffers = shinfo->nr_frags + 1;
+		} else {
+			info = container_of(ext_page,
+					    struct page_info,
+					    ext_page);
+			hdr.num_buffers = shinfo->nr_frags + 1;
+		}
+		skb_push(skb, ETH_HLEN);
+
+		if (skb_is_gso(skb)) {
+			hdr.hdr.hdr_len = skb_headlen(skb);
+			hdr.hdr.gso_size = shinfo->gso_size;
+			if (shinfo->gso_type & SKB_GSO_TCPV4)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+			else if (shinfo->gso_type & SKB_GSO_TCPV6)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+			else if (shinfo->gso_type & SKB_GSO_UDP)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+			else
+				BUG();
+			if (shinfo->gso_type & SKB_GSO_TCP_ECN)
+				hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+
+		} else
+			hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			hdr.hdr.csum_start =
+				skb->csum_start - skb_headroom(skb);
+			hdr.hdr.csum_offset = skb->csum_offset;
+		}
+
+		off = info->hdr[0].iov_len;
+		len = memcpy_toiovec(info->iov, (unsigned char *)&hdr, off);
+		if (len) {
+			pr_debug("Unable to write vnet_hdr at addr '%p': '%d'\n",
+				info->iov, len);
+			goto clean;
+		}
+
+		memcpy_toiovec(info->iov, skb->data, skb_headlen(skb));
+
+		info->iocb->ki_left = hdr.num_buffers;
+		if (shinfo->frags[0].page == ext_page->page) {
+			size = shinfo->frags[0].size +
+				shinfo->frags[0].page_offset - off;
+			i = 1;
+		} else {
+			size = skb_headlen(skb);
+			i = 0;
+		}
+		create_iocb(info, off + size);
+		for (i = i; i < shinfo->nr_frags; i++) {
+			page = shinfo->frags[i].page;
+			info = mp_hash_lookup(pool, shinfo->frags[i].page);
+			create_iocb(info, shinfo->frags[i].size);
+		}
+		info->skb = skb;
+		shinfo->nr_frags = 0;
+		shinfo->destructor_arg = NULL;
+		continue;
+clean:
+		kfree_skb(skb);
+		for (i = 0; i < info->pnum; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	return;
+}
+EXPORT_SYMBOL_GPL(async_data_ready);
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool = NULL;
+
+	pool = mp->pool;
+	if (!pool)
+		return;
+	return async_data_ready(sk, pool);
+}
+
+static inline struct sk_buff *mp_alloc_skb(struct sock *sk, size_t prepad,
+					   size_t len, size_t linear,
+					   int noblock, int *err)
+{
+	struct sk_buff *skb;
+
+	/* Under a page?  Don't bother with paged skb. */
+	if (prepad + len < PAGE_SIZE || !linear)
+		linear = len;
+
+	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
+			err);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, prepad);
+	skb_put(skb, linear);
+	skb->data_len = len - linear;
+	skb->len += len - linear;
+
+	return skb;
+}
+
+static int mp_skb_from_vnet_hdr(struct sk_buff *skb,
+		struct virtio_net_hdr *vnet_hdr)
+{
+	unsigned short gso_type = 0;
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+			gso_type = SKB_GSO_TCPV4;
+			break;
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			gso_type = SKB_GSO_TCPV6;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			gso_type = SKB_GSO_UDP;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
+			gso_type |= SKB_GSO_TCP_ECN;
+
+		if (vnet_hdr->gso_size == 0)
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (!skb_partial_csum_set(skb, vnet_hdr->csum_start,
+					vnet_hdr->csum_offset))
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		skb_shinfo(skb)->gso_size = vnet_hdr->gso_size;
+		skb_shinfo(skb)->gso_type = gso_type;
+
+		/* Header must be checked, and gso_segs computed. */
+		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->gso_segs = 0;
+	}
+	return 0;
+}
+
+int async_sendmsg(struct sock *sk, struct kiocb *iocb, struct page_pool *pool,
+		  struct iovec *iov, int count)
+{
+	struct virtio_net_hdr vnet_hdr = {0};
+	int hdr_len = 0;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN)
+		return -EINVAL;
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS)
+			return -EINVAL;
+	}
+
+copy:
+	hdr_len = sizeof(vnet_hdr);
+	if ((total - iocb->ki_iovec[0].iov_len) < 0)
+		return -EINVAL;
+
+	rc = memcpy_fromiovecend((void *)&vnet_hdr, iocb->ki_iovec, 0, hdr_len);
+	if (rc < 0)
+		return -EINVAL;
+
+	if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+			vnet_hdr.csum_start + vnet_hdr.csum_offset + 2 >
+			vnet_hdr.hdr_len)
+		vnet_hdr.hdr_len = vnet_hdr.csum_start +
+			vnet_hdr.csum_offset + 2;
+
+	if (vnet_hdr.hdr_len > total)
+		return -EINVAL;
+
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = mp_alloc_skb(sk, NET_IP_ALIGN, header,
+			iocb->ki_iovec[0].iov_len, 1, &rc);
+	if (!skb)
+		goto drop;
+
+	skb_set_network_header(skb, ETH_HLEN);
+	memcpy_fromiovec(skb->data, iov, header);
+
+	skb_reset_mac_header(skb);
+	skb->protocol = eth_hdr(skb)->h_proto;
+
+	rc = mp_skb_from_vnet_hdr(skb, &vnet_hdr);
+	if (rc)
+		goto drop;
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(pool, iocb, total);
+	} else {
+		info = alloc_page_info(pool, iocb, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; i < info->pnum; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (!pool->cur_pages)
+		sk->sk_state_change(sk);
+
+	if (info != NULL) {
+		info->desc_pos = iocb->ki_pos;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->ext_page;
+		skb->dev = pool->dev->master ? pool->dev->master : pool->dev;
+		create_iocb(info, total);
+		dev_queue_xmit(skb);
+		return 0;
+	}
+drop:
+	kfree_skb(skb);
+	if (info) {
+		for (i = 0; i < info->pnum; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	pool->dev->stats.tx_dropped++;
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(async_sendmsg);
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool;
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+
+	pool = mp->pool;
+	if (!pool)
+		return -ENODEV;
+	return async_sendmsg(sock->sk, iocb, pool, iov, count);
+}
+
+int async_recvmsg(struct kiocb *iocb, struct page_pool *pool,
+		  struct iovec *iov, int count, int flags)
+{
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	if (!pool)
+		return -EINVAL;
+
+	/* Error detections in case invalid external buffer */
+	if (count > 2 && iov[1].iov_len < pool->port.hdr_len &&
+			pool->dev->features & NETIF_F_SG) {
+		return -EINVAL;
+	}
+
+	npages = pool->port.npages;
+	payload = pool->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	return -EINVAL;
+proceed:
+	/* skip the virtnet head */
+	if (count > 1) {
+		iov++;
+		count--;
+	}
+
+	/* Translate address to kernel */
+	info = alloc_page_info(pool, iocb, iov, count, frags, npages, 0);
+	if (!info)
+		return -ENOMEM;
+	info->hdr[0].iov_base = iocb->ki_iovec[0].iov_base;
+	info->hdr[0].iov_len = iocb->ki_iovec[0].iov_len;
+	iocb->ki_iovec[0].iov_len = 0;
+	iocb->ki_left = 0;
+	info->desc_pos = iocb->ki_pos;
+
+	if (count > 1) {
+		iov--;
+		count++;
+	}
+
+	memcpy(info->iov, iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&pool->read_lock, flag);
+	list_add_tail(&info->list, &pool->readq);
+	spin_unlock_irqrestore(&pool->read_lock, flag);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(async_recvmsg);
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len,
+		int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool;
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+
+	pool = mp->pool;
+	if (!pool)
+		return -EINVAL;
+
+	return async_recvmsg(iocb, pool, iov, count, flags);
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
+};
+
+static struct proto mp_proto = {
+	.name           = "mp",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+	struct mp_file *mfile;
+	cycle_kernel_lock();
+
+	pr_debug("mp: mp_chr_open\n");
+	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+	if (!mfile)
+		return -ENOMEM;
+	atomic_set(&mfile->count, 0);
+	mfile->mp = NULL;
+	mfile->net = get_net(current->nsproxy->net_ns);
+	file->private_data = mfile;
+	return 0;
+}
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	int err;
+
+	netif_tx_lock_bh(mp->dev);
+
+	err = -EINVAL;
+
+	if (mfile->mp)
+		goto out;
+
+	err = -EBUSY;
+	if (mp->mfile)
+		goto out;
+
+	err = 0;
+	mfile->mp = mp;
+	mp->mfile = mfile;
+	mp->socket.file = file;
+	sock_hold(mp->socket.sk);
+	atomic_inc(&mfile->count);
+
+out:
+	netif_tx_unlock_bh(mp->dev);
+	return err;
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+	struct mp_struct *mp = mp_get(mfile);
+
+	if (mp) {
+		mp_detach(mp);
+		sock_put(mp->socket.sk);
+	}
+	mp_put(mfile);
+	return 0;
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+	struct net_device *dev;
+	struct page_pool *pool;
+	void __user* argp = (void __user *)arg;
+	unsigned long  __user *limitp = argp;
+	struct ifreq ifr;
+	struct sock *sk;
+	unsigned long limit, locked, lock_limit;
+	int ret;
+
+	ret = -EINVAL;
+
+	switch (cmd) {
+	case MPASSTHRU_BINDDEV:
+		ret = -EFAULT;
+		if (copy_from_user(&ifr, argp, sizeof ifr))
+			break;
+
+		ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+		ret = -ENODEV;
+
+		rtnl_lock();
+		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+		if (!dev) {
+			rtnl_unlock();
+			break;
+		}
+		dev_bond_hold(dev);
+		mutex_lock(&mp_mutex);
+
+		ret = -EBUSY;
+
+		/* the device can be only bind once */
+		if (dev_is_mpassthru(dev))
+			goto err_dev_put;
+
+		ret = -EFAULT;
+
+		if (!(dev->features & NETIF_F_SG)) {
+			pr_debug("The device has no SG features.\n");
+			goto err_dev_put;
+		}
+		mp = mfile->mp;
+		if (mp)
+			goto err_dev_put;
+
+		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+		if (!mp) {
+			ret = -ENOMEM;
+			goto err_dev_put;
+		}
+		mp->dev = dev;
+		mp->mm = get_task_mm(current);
+		ret = -ENOMEM;
+
+		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+		if (!sk)
+			goto err_free_mp;
+
+		mp->socket.wq = &mp->wq;
+		init_waitqueue_head(&mp->wq.wait);
+		mp->socket.ops = &mp_socket_ops;
+		sock_init_data(&mp->socket, sk);
+		sk->sk_sndbuf = INT_MAX;
+		container_of(sk, struct mp_sock, sk)->mp = mp;
+
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+		sk->sk_state_change = mp_sock_state_change;
+
+		pool = page_pool_create(dev, &mp->socket);
+		if (!pool) {
+			ret = -EFAULT;
+			goto err_free_sk;
+		}
+
+		ret = mp_attach(mp, file);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ret = mp_page_pool_attach(mp, pool);
+		if (ret < 0)
+			goto err_free_sk;
+		dev_bond_put(dev);
+		dev_put(dev);
+		dev_change_state(dev);
+out:
+		mutex_unlock(&mp_mutex);
+		rtnl_unlock();
+		break;
+err_free_sk:
+		sk_free(sk);
+err_free_mp:
+		mfile->mp = NULL;
+		kfree(mp);
+err_dev_put:
+		dev_bond_put(dev);
+		dev_put(dev);
+		goto out;
+
+	case MPASSTHRU_UNBINDDEV:
+		rtnl_lock();
+		ret = do_unbind(mfile);
+		rtnl_unlock();
+		break;
+
+	case MPASSTHRU_SET_MEM_LOCKED:
+		ret = copy_from_user(&limit, limitp, sizeof limit);
+		if (ret < 0)
+			return ret;
+
+		mp = mp_get(mfile);
+		if (!mp)
+			return -ENODEV;
+
+		mutex_lock(&mp_mutex);
+		if (mp->mm != current->mm) {
+			mutex_unlock(&mp_mutex);
+			return -EPERM;
+		}
+
+		limit = PAGE_ALIGN(limit) >> PAGE_SHIFT;
+		down_write(&mp->mm->mmap_sem);
+		if (!mp->pool->locked_pages)
+			mp->pool->orig_locked_vm = mp->mm->locked_vm;
+		locked = limit + mp->pool->orig_locked_vm;
+		lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+		if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+			up_write(&mp->mm->mmap_sem);
+			mutex_unlock(&mp_mutex);
+			mp_put(mfile);
+			return -ENOMEM;
+		}
+		mp->mm->locked_vm = locked;
+		up_write(&mp->mm->mmap_sem);
+
+		mp->pool->locked_pages = limit;
+		mutex_unlock(&mp_mutex);
+
+		mp_put(mfile);
+		return 0;
+
+	case MPASSTHRU_GET_MEM_LOCKED_NEED:
+		limit = DEFAULT_NEED;
+		return copy_to_user(limitp, &limit, sizeof limit);
+
+
+	default:
+		break;
+	}
+	return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp = mp_get(mfile);
+	struct sock *sk;
+	unsigned int mask = 0;
+
+	if (!mp)
+		return POLLERR;
+
+	sk = mp->socket.sk;
+
+	poll_wait(file, &mp->wq.wait, wait);
+
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(sk) ||
+		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+			 sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+	if (mp->dev->reg_state != NETREG_REGISTERED)
+		mask = POLLERR;
+
+	mp_put(mfile);
+	return mask;
+}
+
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct mp_struct *mp = mp_get(file->private_data);
+	struct sock *sk = mp->socket.sk;
+	struct sk_buff *skb;
+	int len, err;
+	ssize_t result = 0;
+
+	if (!mp)
+		return -EBADFD;
+
+	/* currently, async is not supported.
+	 * but we may support real async aio from user application,
+	 * maybe qemu virtio-net backend.
+	 */
+	if (!is_sync_kiocb(iocb))
+		return -EFAULT;
+
+	len = iov_length(iov, count);
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+				  file->f_flags & O_NONBLOCK, &err);
+
+	if (!skb)
+		return -ENOMEM;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, len);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+		kfree_skb(skb);
+		return -EAGAIN;
+	}
+
+	skb->protocol = eth_type_trans(skb, mp->dev);
+	skb->dev = mp->dev;
+
+	dev_queue_xmit(skb);
+
+	mp_put(file->private_data);
+	return result;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+
+	/*
+	 * Ignore return value since an error only means there was nothing to
+	 * do
+	 */
+	rtnl_lock();
+	do_unbind(mfile);
+	rtnl_unlock();
+	put_net(mfile->net);
+	kfree(mfile);
+
+	return 0;
+}
+
+#ifdef CONFIG_COMPAT
+static long mp_chr_compat_ioctl(struct file *f, unsigned int ioctl,
+				unsigned long arg)
+{
+	return mp_chr_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations mp_fops = {
+	.owner  = THIS_MODULE,
+	.llseek = no_llseek,
+	.write  = do_sync_write,
+	.aio_write = mp_chr_aio_write,
+	.poll   = mp_chr_poll,
+	.unlocked_ioctl = mp_chr_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = mp_chr_compat_ioctl,
+#endif
+	.open   = mp_chr_open,
+	.release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mp",
+	.nodename = "net/mp",
+	.fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+		unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mp_port *port;
+	struct mp_struct *mp = NULL;
+	struct socket *sock = NULL;
+	struct sock *sk;
+
+	port = dev->mp_port;
+	if (port == NULL)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		sock = dev->mp_port->sock;
+		mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+		do_unbind(mp->mfile);
+		break;
+	case NETDEV_CHANGE:
+		sk = dev->mp_port->sock->sk;
+		sk->sk_state_change(sk);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+	.notifier_call  = mp_device_event,
+};
+
+static int mp_init(void)
+{
+	int err = 0;
+
+	ext_page_info_cache = kmem_cache_create("skb_page_info",
+						sizeof(struct page_info),
+						0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!ext_page_info_cache)
+		return -ENOMEM;
+
+	err = misc_register(&mp_miscdev);
+	if (err) {
+		printk(KERN_ERR "mp: Can't register misc device\n");
+		kmem_cache_destroy(ext_page_info_cache);
+	} else {
+		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+				mp_miscdev.minor);
+		register_netdevice_notifier(&mp_notifier_block);
+	}
+	return err;
+}
+
+void mp_exit(void)
+{
+	unregister_netdevice_notifier(&mp_notifier_block);
+	misc_deregister(&mp_miscdev);
+	kmem_cache_destroy(ext_page_info_cache);
+}
+
+/* Get an underlying socket object from mp file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+
+	if (file->f_op != &mp_fops)
+		return ERR_PTR(-EINVAL);
+	mp = mp_get(mfile);
+	if (!mp)
+		return ERR_PTR(-EBADFD);
+	mp_put(mfile);
+	return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_exit);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
-- 
1.7.3

^ permalink raw reply related

* [PATCH v14 12/17] Add header file for mp device.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/mpassthru.h |  133 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 133 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..1115f55
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,133 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+#include <linux/ioctl.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IO('M', 214)
+#define MPASSTHRU_SET_MEM_LOCKED       _IOW('M', 215, unsigned long)
+#define MPASSTHRU_GET_MEM_LOCKED_NEED  _IOR('M', 216, unsigned long)
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+#define DEFAULT_NEED   ((8192*2*2)*4096)
+
+struct frag {
+	u16     offset;
+	u16     size;
+};
+
+#define HASH_BUCKETS    (8192*2)
+struct page_info {
+	struct list_head        list;
+	struct page_info        *next;
+	struct page_info        *prev;
+	struct page             *pages[MAX_SKB_FRAGS];
+	struct sk_buff          *skb;
+	struct page_pool        *pool;
+
+	/* The pointer relayed to skb, to indicate
+	 * it's a external allocated skb or kernel
+	 */
+	struct skb_ext_page    ext_page;
+	/* flag to indicate read or write */
+#define INFO_READ                      0
+#define INFO_WRITE                     1
+	unsigned                flags;
+	/* exact number of locked pages */
+	unsigned                pnum;
+
+	/* The fields after that is for backend
+	 * driver, now for vhost-net.
+	 */
+	/* the kiocb structure related to */
+	struct kiocb            *iocb;
+	/* the ring descriptor index */
+	unsigned int            desc_pos;
+	/* the iovec coming from backend, we only
+	 * need few of them */
+	struct iovec            hdr[2];
+	struct iovec            iov[2];
+};
+
+struct page_pool {
+	/* the queue for rx side */
+	struct list_head        readq;
+	/* the lock to protect readq */
+	spinlock_t              read_lock;
+	/* record the orignal rlimit */
+	struct rlimit           o_rlim;
+	/* userspace wants to locked */
+	int                     locked_pages;
+	/* currently locked pages */
+	int                     cur_pages;
+	/* the memory locked before */
+	unsigned long		orig_locked_vm;
+	/* the device according to */
+	struct net_device       *dev;
+	/* the mp_port according to dev */
+	struct mp_port          port;
+	/* the hash_table list to find each locked page */
+	struct page_info        **hash_table;
+};
+
+static struct kmem_cache *ext_page_info_cache;
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+struct page_pool *page_pool_create(struct net_device *dev,
+				   struct socket *sock);
+int async_recvmsg(struct kiocb *iocb, struct page_pool *pool,
+		  struct iovec *iov, int count, int flags);
+int async_sendmsg(struct sock *sk, struct kiocb *iocb,
+		  struct page_pool *pool, struct iovec *iov,
+		  int count);
+void async_data_ready(struct sock *sk, struct page_pool *pool);
+void dev_change_state(struct net_device *dev);
+void page_pool_destroy(struct mm_struct *mm, struct page_pool *pool);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+static inline struct page_pool *page_pool_create(struct net_device *dev,
+		struct socket *sock)
+{
+	return ERR_PTR(-EINVAL);
+}
+static inline int async_recvmsg(struct kiocb *iocb, struct page_pool *pool,
+		struct iovec *iov, int count, int flags)
+{
+	return -EINVAL;
+}
+static inline int async_sendmsg(struct sock *sk, struct kiocb *iocb,
+		struct page_pool *pool, struct iovec *iov,
+		int count)
+{
+	return -EINVAL;
+}
+static inline void async_data_ready(struct sock *sk, struct page_pool *pool)
+{
+	return;
+}
+static inline void dev_change_state(struct net_device *dev)
+{
+	return;
+}
+static inline void page_pool_destroy(struct mm_struct *mm,
+				     struct page_pool *pool)
+{
+	return;
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.7.3

^ permalink raw reply related

* [PATCH v14 11/17] Add a hook to intercept external buffers from NIC driver.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The hook is called in __netif_receive_skb().

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/dev.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 84fbb83..bdad1c8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2814,6 +2814,40 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
 }
 EXPORT_SYMBOL(__skb_bond_should_drop);
 
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+					       struct packet_type **pt_prev,
+					       int *ret,
+					       struct net_device *orig_dev)
+{
+	struct mp_port *mp_port = NULL;
+	struct sock *sk = NULL;
+
+	if (!dev_is_mpassthru(skb->dev) && !dev_is_mpassthru(orig_dev))
+		return skb;
+	if (dev_is_mpassthru(skb->dev))
+		mp_port = skb->dev->mp_port;
+	else if (orig_dev->master == skb->dev && dev_is_mpassthru(orig_dev))
+		mp_port = orig_dev->mp_port;
+
+	if (*pt_prev) {
+		*ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = NULL;
+	}
+
+	sk = mp_port->sock->sk;
+	skb_queue_tail(&sk->sk_receive_queue, skb);
+	sk->sk_state_change(sk);
+
+	return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev)     (skb)
+#endif
+
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
@@ -2891,6 +2925,11 @@ static int __netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
+	/* To intercept mediate passthru(zero-copy) packets here */
+	skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+	if (!skb)
+		goto out;
+
 	/* Handle special case of bridge or macvlan */
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
 	if (rx_handler) {
@@ -2983,6 +3022,7 @@ err:
 EXPORT_SYMBOL(netdev_mp_port_prep);
 #endif
 
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.7.3

^ permalink raw reply related

* [PATCH v14 10/17] If device is in zero-copy mode first, bonding will fail.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

If device is in this zero-copy mode first, we cannot handle this,
so fail it. This patch is for this.

If bonding is created first, and one of the device will be in zero-copy
mode, this will be handled by mp device. It will first check if all the
slaves have the zero-copy capability. If no, fail too. Otherwise make
all the slaves in zero-copy mode.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/net/bonding/bond_main.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 3b16f62..dfb6a2c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1428,6 +1428,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 			   bond_dev->name);
 	}

+	/* if the device is in zero-copy mode before bonding, fail it. */
+	if (dev_is_mpassthru(slave_dev))
+		return -EBUSY;
+
 	/* already enslaved */
 	if (slave_dev->flags & IFF_SLAVE) {
 		pr_debug("Error, Device was already enslaved\n");
-- 
1.7.3

^ permalink raw reply related

* [PATCH v14 09/17] Don't do skb recycle, if device use external buffer.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 02439e0..196aa99 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -558,6 +558,12 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
 	if (skb_shared(skb) || skb_cloned(skb))
 		return false;
 
+	/* if the device wants to do mediate passthru, the skb may
+	 * get external buffer, so don't recycle
+	 */
+	if (dev_is_mpassthru(skb->dev))
+		return 0;
+
 	skb_release_head_state(skb);
 
 	shinfo = skb_shinfo(skb);
-- 
1.7.3

^ permalink raw reply related

* [PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

If buffer is external, then use the callback to destruct
buffers.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c83b421..5e6d69c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -210,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
+	shinfo->destructor_arg = NULL;
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
 
@@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
+		if (skb->dev && dev_is_mpassthru(skb->dev)) {
+			struct skb_ext_page *ext_page =
+				skb_shinfo(skb)->destructor_arg;
+			if (ext_page && ext_page->dtor)
+				ext_page->dtor(ext_page);
+		}
+
 		kfree(skb->head);
 	}
 }
-- 
1.7.3


^ permalink raw reply related

* [PATCH v14 05/17] Add a function to indicate if device use external buffer.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8dcf6de..f91d9bb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1739,6 +1739,11 @@ extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
 extern int netdev_mp_port_prep(struct net_device *dev,
 				struct mp_port *port);
 
+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+	return dev && dev->mp_port;
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
 	kfree_skb(napi->skb);
-- 
1.7.3

^ permalink raw reply related

* [PATCH v14 04/17] Add a function make external buffer owner to query capability.
From: xiaohui.xin @ 2010-11-04  9:05 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1288860477.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhaonew@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    2 ++
 net/core/dev.c            |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 575777f..8dcf6de 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1736,6 +1736,8 @@ extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
 					  gro_result_t ret);
 extern struct sk_buff *	napi_frags_skb(struct napi_struct *napi);
 extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_prep(struct net_device *dev,
+				struct mp_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 660dd41..84fbb83 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2942,6 +2942,47 @@ out:
 	return ret;
 }
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry.
+ * Now, it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+		struct mp_port *port)
+{
+	int rc;
+	int npages, data_len;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_mp_port_prep) {
+		rc = ops->ndo_mp_port_prep(dev, port);
+		if (rc)
+			return rc;
+	} else
+		return -EINVAL;
+
+	if (port->hdr_len <= 0)
+		goto err;
+
+	npages = port->npages;
+	data_len = port->data_len;
+	if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+			(data_len < PAGE_SIZE * (npages - 1) ||
+			 data_len > PAGE_SIZE * npages))
+		goto err;
+
+	return 0;
+err:
+	dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.7.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox