Netdev List

Netdev List
 help / color / mirror / Atom feed

* Your E-mail ID won you 950,000GBP View The Attached File For More Details
From: Google Corporation® @ 2013-10-05 13:03 UTC (permalink / raw)

In-Reply-To: <79387874.62136.1380977547866.JavaMail.root@mail>

[-- Attachment #1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2: Google Corporation.doc --]
[-- Type: application/msword, Size: 54272 bytes --]

[-- Attachment #3: Google Corporation.pdf --]
[-- Type: application/pdf, Size: 59159 bytes --]

[-- Attachment #4: Google Corporation.doc --]
[-- Type: application/msword, Size: 54272 bytes --]

^ permalink raw reply

* Re: Issue with NFS over NAT since kernel 3.8.x
From: Richard Weinberger @ 2013-10-05 11:32 UTC (permalink / raw)
  To: leroy christophe; +Cc: linux-net, netdev
In-Reply-To: <524FDCBE.9070407@c-s.fr>

On Sat, Oct 5, 2013 at 11:32 AM, leroy christophe
<christophe.leroy@c-s.fr> wrote:
> Is there any change between 3.7 and 3.8 which could explain it ? How can I
> investigate this problem ?

You do a git-bisect to find the exact change.

-- 
Thanks,
//richard

^ permalink raw reply

* Re: kernel BUG at net/core/skbuff.c:1048!
From: Patrick McHardy @ 2013-10-05 10:01 UTC (permalink / raw)
  To: Wim Vandersmissen; +Cc: netdev
In-Reply-To: <524E8BFB.9080708@icts.kuleuven.be>

On Fri, Oct 04, 2013 at 11:35:55AM +0200, Wim Vandersmissen wrote:
> Hi,
> 
> Got the following BUG when using ipv6 netfilter/conntrack/ipv6
> forwarding and traffic flowing.
> 
> No issues in 3.4.x, but triggered in 3.10.x (introduced in 3.7)

Please also send me your 3.10.x config and the ip6tables rules you're
using.

Are you actually using IPv6 NAT? 

Thanks!

> git bisect tells me:
> 
> 58a317f1061c894d2344c0b6a18ab4a64b69b815 is the first bad commit
> commit 58a317f1061c894d2344c0b6a18ab4a64b69b815
> Author: Patrick McHardy <kaber@trash.net>
> Date:   Sun Aug 26 19:14:12 2012 +0200
> netfilter: ipv6: add IPv6 NAT support
> 
> 
> 
> 
> kernel: kernel BUG at net/core/skbuff.c:1048!
> kernel: invalid opcode: 0000 [#1] SMP
> kernel: icrocode]
> kernel: CPU 2
> kernel: Pid: 0, comm: swapper/2 Not tainted 3.6.0-rc2+ #1 HP
> ProLiant DL380 G6
> kernel: RIP: 0010:[<ffffffff8126d8b8>]  [<ffffffff8126d8b8>]
> pskb_expand_head+0x30/0x210
> kernel: RSP: 0018:ffff88019fc239f0  EFLAGS: 00010202
> kernel: RAX: 0000000000000001 RBX: ffff88018ae50880 RCX: 0000000000000020
> kernel: RDX: 0000000000000000 RSI: 00000000000002c0 RDI: ffff88018ae50880
> kernel: RBP: 0000000000000020 R08: 0000000000000000 R09: 0000000000000000
> kernel: R10: ffff88018af4a2c0 R11: ffffffffa0271ff8 R12: 0000000000000000
> kernel: R13: ffff880196ca26c0 R14: ffff88018ac55456 R15: ffffffff814b0f40
> kernel: FS:  0000000000000000(0000) GS:ffff88019fc20000(0000)
> knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> kernel: CR2: ffffffffff600000 CR3: 000000018b2ab000 CR4: 00000000000007e0
> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> kernel: Process swapper/2 (pid: 0, threadinfo ffff8801990ca000, task
> ffff88019909e090)
> kernel: Stack:
> kernel: 0000000800000040 ffff88018ae50880 ffff88018ae50880 0000000000000000
> kernel: ffff880196ca26c0 ffff88018ac55456 ffffffff814b0f40 ffffffffa02725ab
> kernel: 0000000800000000 0006000000000000 402c022a00000000 ffff88018ae508a8
> kernel: Call Trace:
> kernel: <IRQ>
> kernel: [<ffffffffa02725ab>] ? ip6_forward+0x5b3/0x72a [ipv6]
> kernel: [<ffffffffa0273bf1>] ? ip6_input+0x51/0x51 [ipv6]
> kernel: [<ffffffffa02f7407>] ? __ipv6_conntrack_in+0xed/0x153
> [nf_conntrack_ipv6]
> kernel: [<ffffffff81298f00>] ? nf_iterate+0x50/0x8b
> kernel: [<ffffffff810372eb>] ? mod_timer+0x15e/0x16c
> kernel: [<ffffffffa0273b0a>] ? ip6_xmit+0x2d2/0x368 [ipv6]
> kernel: [<ffffffffa0273bf1>] ? ip6_input+0x51/0x51 [ipv6]
> kernel: [<ffffffff81299135>] ? nf_hook_slow+0x67/0xfb
> kernel: [<ffffffffa0273bf1>] ? ip6_input+0x51/0x51 [ipv6]
> kernel: [<ffffffffa0273bf1>] ? ip6_input+0x51/0x51 [ipv6]
> kernel: [<ffffffffa02f22c0>] ? nf_ct_frag6_output+0x97/0xe1 [nf_defrag_ipv6]

^ permalink raw reply

* Issue with NFS over NAT since kernel 3.8.x
From: leroy christophe @ 2013-10-05  9:32 UTC (permalink / raw)
  To: linux-net, netdev

I have a system with a panel PC running RedHat 7.3, mounting NFS from a 
RedHat 9 NFS Server.

Inbetween, I have a router which is running standard Kernel 3.8.13 and 
provides NAT (Masquerade) to the PC.

Reading the directory and a first file from the PC works well. But when 
trying to read a second file, the router stops forwarding the packets 
from the server back to the masqueraded PC.

It was working well with Kernel 3.7.10
It still fails with Kernel 3.10

Is there any change between 3.7 and 3.8 which could explain it ? How can 
I investigate this problem ?

Christophe

^ permalink raw reply

* Re: tx checksum offload in rtl8168evl disabled in driver
From: Francois Romieu @ 2013-10-05  9:22 UTC (permalink / raw)
  To: jason.morgan; +Cc: netdev, Hayes Wang
In-Reply-To: <OF48469889.D817DB84-ON80257BFA.0032A86F-80257BFA.0032E20E@aveillant.com>

(please don't top post)

jason.morgan@aveillant.com <jason.morgan@aveillant.com> :
> Ubuntu 12.04.3 LTS + Kernel 3.8.13-8 64bit
> 
> I've patched the driver to allow tx checksum offload for this chip and
> found the following:
>
> MTU 9000 standard driver:
> 517Mbps with 2k + header frames
> 
> MTU 9000 patched driver:
> 770Mbps with 2k + header frames
> 
> 100% transfer without error (1e6 frames)

(Ok, so that's 20 ~ 30s worth of traffic)

> 48% increase in performance combined with a massive decrease in CPU
> effort is not to be sniffed at.

*sniff* :o)

It depends on the CPU. You did not specify it and you did not give numbers
for the decrease (did you use 'perf' btw ?). They would be welcome.

> IMO tx offload should be more prevalent as the frames grow, to reduce 
> CPU load.

I can't disagree.

> OK, so make the default OFF if there is a silicon error (that spans
> mulitple chips?),

Yes, I want safe defaults for the kernel.

I give the manufacturer's explanations a lot of credit when they're
related to hardware (up to the point where the marketing or legal dept
kicks in). If we want to balance these with experimental evidences, the
latter must be really, really strong.

> but why prevent it being turned on in the driver? 
> even if there is a kernel message that this might cause problems.

Two points:
- it's a hack: ethtool will return success. A kernel message is not a
  substitute for "Yes, I opt in for problems".
- we can't tell when it's safe and when it isn't.

-- 
Ueimor

^ permalink raw reply

* [PATCH] net: wan: remove deprecated IRQF_DISABLED
From: Michael Opdenacker @ 2013-10-05  4:45 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Michael Opdenacker

This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
---
 drivers/net/wan/hostess_sv11.c | 2 +-
 drivers/net/wan/sealevel.c     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wan/hostess_sv11.c b/drivers/net/wan/hostess_sv11.c
index 3d80e42..3d74166 100644
--- a/drivers/net/wan/hostess_sv11.c
+++ b/drivers/net/wan/hostess_sv11.c
@@ -220,7 +220,7 @@ static struct z8530_dev *sv11_init(int iobase, int irq)
 	/* We want a fast IRQ for this device. Actually we'd like an even faster
 	   IRQ ;) - This is one driver RtLinux is made for */
 
-	if (request_irq(irq, z8530_interrupt, IRQF_DISABLED,
+	if (request_irq(irq, z8530_interrupt, 0,
 			"Hostess SV11", sv) < 0) {
 		pr_warn("IRQ %d already in use\n", irq);
 		goto err_irq;
diff --git a/drivers/net/wan/sealevel.c b/drivers/net/wan/sealevel.c
index 4f77484..27860b4 100644
--- a/drivers/net/wan/sealevel.c
+++ b/drivers/net/wan/sealevel.c
@@ -266,7 +266,7 @@ static __init struct slvl_board *slvl_init(int iobase, int irq,
 	/* We want a fast IRQ for this device. Actually we'd like an even faster
 	   IRQ ;) - This is one driver RtLinux is made for */
 
-	if (request_irq(irq, z8530_interrupt, IRQF_DISABLED,
+	if (request_irq(irq, z8530_interrupt, 0,
 			"SeaLevel", dev) < 0) {
 		pr_warn("IRQ %d already in use\n", irq);
 		goto err_request_irq;
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH] irda: remove deprecated IRQF_DISABLED
From: Michael Opdenacker @ 2013-10-05  4:39 UTC (permalink / raw)
  To: samuel; +Cc: netdev, linux-kernel, Michael Opdenacker

This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
---
 drivers/net/irda/bfin_sir.c | 4 ++--
 drivers/net/irda/donauboe.c | 4 ++--
 drivers/net/irda/sh_irda.c  | 2 +-
 drivers/net/irda/sh_sir.c   | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/irda/bfin_sir.c b/drivers/net/irda/bfin_sir.c
index c74f384..303c4bd 100644
--- a/drivers/net/irda/bfin_sir.c
+++ b/drivers/net/irda/bfin_sir.c
@@ -411,12 +411,12 @@ static int bfin_sir_startup(struct bfin_sir_port *port, struct net_device *dev)
 
 #else
 
-	if (request_irq(port->irq, bfin_sir_rx_int, IRQF_DISABLED, "BFIN_SIR_RX", dev)) {
+	if (request_irq(port->irq, bfin_sir_rx_int, 0, "BFIN_SIR_RX", dev)) {
 		dev_warn(&dev->dev, "Unable to attach SIR RX interrupt\n");
 		return -EBUSY;
 	}
 
-	if (request_irq(port->irq+1, bfin_sir_tx_int, IRQF_DISABLED, "BFIN_SIR_TX", dev)) {
+	if (request_irq(port->irq+1, bfin_sir_tx_int, 0, "BFIN_SIR_TX", dev)) {
 		dev_warn(&dev->dev, "Unable to attach SIR TX interrupt\n");
 		free_irq(port->irq, dev);
 		return -EBUSY;
diff --git a/drivers/net/irda/donauboe.c b/drivers/net/irda/donauboe.c
index 31bcb98..768dfe9 100644
--- a/drivers/net/irda/donauboe.c
+++ b/drivers/net/irda/donauboe.c
@@ -1352,7 +1352,7 @@ toshoboe_net_open (struct net_device *dev)
     return 0;
 
   rc = request_irq (self->io.irq, toshoboe_interrupt,
-                    IRQF_SHARED | IRQF_DISABLED, dev->name, self);
+                    IRQF_SHARED, dev->name, self);
   if (rc)
   	return rc;
 
@@ -1559,7 +1559,7 @@ toshoboe_open (struct pci_dev *pci_dev, const struct pci_device_id *pdid)
   self->io.fir_base = self->base;
   self->io.fir_ext = OBOE_IO_EXTENT;
   self->io.irq = pci_dev->irq;
-  self->io.irqflags = IRQF_SHARED | IRQF_DISABLED;
+  self->io.irqflags = IRQF_SHARED;
 
   self->speed = self->io.speed = 9600;
   self->async = 0;
diff --git a/drivers/net/irda/sh_irda.c b/drivers/net/irda/sh_irda.c
index 4455425..ff45cd0 100644
--- a/drivers/net/irda/sh_irda.c
+++ b/drivers/net/irda/sh_irda.c
@@ -804,7 +804,7 @@ static int sh_irda_probe(struct platform_device *pdev)
 		goto err_mem_4;
 
 	platform_set_drvdata(pdev, ndev);
-	err = request_irq(irq, sh_irda_irq, IRQF_DISABLED, "sh_irda", self);
+	err = request_irq(irq, sh_irda_irq, 0, "sh_irda", self);
 	if (err) {
 		dev_warn(&pdev->dev, "Unable to attach sh_irda interrupt\n");
 		goto err_mem_4;
diff --git a/drivers/net/irda/sh_sir.c b/drivers/net/irda/sh_sir.c
index 89682b4..8d9ae5a 100644
--- a/drivers/net/irda/sh_sir.c
+++ b/drivers/net/irda/sh_sir.c
@@ -761,7 +761,7 @@ static int sh_sir_probe(struct platform_device *pdev)
 		goto err_mem_4;
 
 	platform_set_drvdata(pdev, ndev);
-	err = request_irq(irq, sh_sir_irq, IRQF_DISABLED, "sh_sir", self);
+	err = request_irq(irq, sh_sir_irq, 0, "sh_sir", self);
 	if (err) {
 		dev_warn(&pdev->dev, "Unable to attach sh_sir interrupt\n");
 		goto err_mem_4;
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH] net: hamradio/yam: remove deprecated IRQF_DISABLED
From: Michael Opdenacker @ 2013-10-05  4:25 UTC (permalink / raw)
  To: jpr; +Cc: linux-hams, netdev, linux-kernel, Michael Opdenacker

This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
---
 drivers/net/hamradio/yam.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hamradio/yam.c b/drivers/net/hamradio/yam.c
index 0721e72..ff31ff0 100644
--- a/drivers/net/hamradio/yam.c
+++ b/drivers/net/hamradio/yam.c
@@ -888,7 +888,7 @@ static int yam_open(struct net_device *dev)
 		goto out_release_base;
 	}
 	outb(0, IER(dev->base_addr));
-	if (request_irq(dev->irq, yam_interrupt, IRQF_DISABLED | IRQF_SHARED, dev->name, dev)) {
+	if (request_irq(dev->irq, yam_interrupt, IRQF_SHARED, dev->name, dev)) {
 		printk(KERN_ERR "%s: irq %d busy\n", dev->name, dev->irq);
 		ret = -EBUSY;
 		goto out_release_base;
-- 
1.8.1.2


^ permalink raw reply related

* [PATCH] net: hamradio/scc: remove deprecated IRQF_DISABLED
From: Michael Opdenacker @ 2013-10-05  4:22 UTC (permalink / raw)
  To: jreuter; +Cc: linux-hams, netdev, linux-kernel, Michael Opdenacker

This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
---
 drivers/net/hamradio/scc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c
index bc1d521..4bc6ee8 100644
--- a/drivers/net/hamradio/scc.c
+++ b/drivers/net/hamradio/scc.c
@@ -1734,7 +1734,7 @@ static int scc_net_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 			if (!Ivec[hwcfg.irq].used && hwcfg.irq)
 			{
 				if (request_irq(hwcfg.irq, scc_isr,
-						IRQF_DISABLED, "AX.25 SCC",
+						0, "AX.25 SCC",
 						(void *)(long) hwcfg.irq))
 					printk(KERN_WARNING "z8530drv: warning, cannot get IRQ %d\n", hwcfg.irq);
 				else
-- 
1.8.1.2

^ permalink raw reply related

* RE: [PATCH net-next 02/10] qlcnic: Enhance ethtool to display ring indices and interrupt mask
From: Himanshu Madhani @ 2013-10-04 22:21 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, Dept-NX Linux NIC Driver
In-Reply-To: <1380919061.3214.13.camel@bwh-desktop.uk.level5networks.com>

> 
> This is really sad; why don't you write a proper dump parser for ethtool
> rather than including markers that make it slightly easier to read hex
> dumps?
> 
> And when changing the dump format in an incompatible way like this, you
> should also bump the version number.
> 

We will resubmit the patch after making the appropriate changes.

> Ben.
> 
> --
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH net-next 01/10] qlcnic: Print informational messages only once during driver load.
From: Stephen Hemminger @ 2013-10-04 21:29 UTC (permalink / raw)
  To: Himanshu Madhani
  Cc: davem, netdev, Dept_NX_Linux_NIC_Driver, Sucheta Chakraborty
In-Reply-To: <5fea5316a6664e19c516bbd26428a5034657b130.1380937706.git.himanshu.madhani@qlogic.com>

On Fri, 4 Oct 2013 14:30:48 -0400
Himanshu Madhani <himanshu.madhani@qlogic.com> wrote:

> From: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
> 
> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com>
> ---
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic.h        |  1 +
>  .../net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c    | 12 -----------
>  .../net/ethernet/qlogic/qlcnic/qlcnic_83xx_vnic.c  | 25 ++++++++++++++++++----
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c   |  1 +
>  4 files changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic.h b/drivers/net/ethernet/qlogic/qlcnic/qlcnic.h
> index 81bf836..a3c4379 100644
> --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic.h
> +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic.h
> @@ -1199,6 +1199,7 @@ struct qlcnic_npar_info {
>  	u8	promisc_mode;
>  	u8	offload_flags;
>  	u8      pci_func;
> +	u8      mac[ETH_ALEN];
>  };
>  

>  

There is a field in netdevice which should probably be used for this perm_addr.

And then this could be corrected:

static void qlcnic_dcb_get_perm_hw_addr(struct net_device *netdev, u8 *addr)
{
	memcpy(addr, netdev->dev_addr, netdev->addr_len);
}

^ permalink raw reply

* Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
From: Ben Hutchings @ 2013-10-04 21:29 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Bjorn Helgaas, Ralf Baechle, Michael Ellerman,
	Benjamin Herrenschmidt, Martin Schwidefsky, Ingo Molnar,
	Tejun Heo, Dan Williams, Andy King, Jon Mason, Matt Porter,
	linux-pci, linux-mips, linuxppc-dev, linux390, linux-s390, x86,
	linux-ide, iss_storagedev, linux-nvme, linux-rdma, netdev,
	e1000-devel, linux-driver, Solarflare linux maintainers
In-Reply-To: <20131004082920.GA4536@dhcp-26-207.brq.redhat.com>

On Fri, 2013-10-04 at 10:29 +0200, Alexander Gordeev wrote:
> On Thu, Oct 03, 2013 at 11:49:45PM +0100, Ben Hutchings wrote:
> > On Wed, 2013-10-02 at 12:48 +0200, Alexander Gordeev wrote:
> > > This update converts pci_enable_msix() and pci_enable_msi_block()
> > > interfaces to canonical kernel functions and makes them return a
> > > error code in case of failure or 0 in case of success.
> > [...]
> > 
> > I think this is fundamentally flawed: pci_msix_table_size() and
> > pci_get_msi_cap() can only report the limits of the *device* (which the
> > driver usually already knows), whereas MSI allocation can also be
> > constrained due to *global* limits on the number of distinct IRQs.
> 
> Even the current implementation by no means addresses it. Although it
> might seem a case for architectures to report the number of IRQs available
> for a driver to retry, in fact they all just fail. The same applies to
> *any* other type of resource involved: irq_desc's, CPU interrupt vector
> space, msi_desc's etc. No platform cares about it and just bails out once
> a constrain met (please correct me if I am wrong here). Given that Linux
> has been doing well even on embedded I think we should not change it.
>
> The only exception to the above is pSeries platform which takes advantage
> of the current design (to implement MSI quota). There are indications we
> can satisfy pSeries requirements, but the design proposed in this RFC
> is not going to change drastically anyway. The start of the discusstion
> is here: https://lkml.org/lkml/2013/9/5/293

All I can see there is that Tejun didn't think that the global limits
and positive return values were implemented by any architecture.  But
you have a counter-example, so I'm not sure what your point is.

It has been quite a while since I saw this happen on x86.  But I just
checked on a test system running RHEL 5 i386 (Linux 2.6.18).  If I ask
for 16 MSI-X vectors on a device that supports 1024, the return value is
8, and indeed I can then successfully allocate 8.

Now that's going quite a way back, and it may be that global limits
aren't a significant problem any more.  With the x86_64 build of RHEL 5
on an identical system, I can allocate 16 or even 32, so this is
apparently not a hardware limit in this case.

> > Currently pci_enable_msix() will report a positive value if it fails due
> > to the global limit.  Your patch 7 removes that.  pci_enable_msi_block()
> > unfortunately doesn't appear to do this.
> 
> pci_enable_msi_block() can do more than one MSI only on x86 (with IOMMU),
> but it does not bother to return positive numbers, indeed.
> 
> > It seems to me that a more useful interface would take a minimum and
> > maximum number of vectors from the driver.  This wouldn't allow the
> > driver to specify that it could only accept, say, any even number within
> > a certain range, but you could still leave the current functions
> > available for any driver that needs that.
> 
> Mmmm.. I am not sure I am getting it. Could you please rephrase?

Most drivers seem to either:
(a) require exactly a certain number of MSI vectors, or
(b) require a minimum number of MSI vectors, usually want to allocate
more, and work with any number in between

We can support drivers in both classes by adding new allocation
functions that allow specifying a minimum (required) and maximum
(wanted) number of MSI vectors.  Those in class (a) would just specify
the same value for both.  These new functions can take account of any
global limit or allocation policy without any further changes to the
drivers that use them.

The few drivers with more specific requirements would still need to
implement the currently recommended loop, using the old allocation
functions.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH net-next v2 3/3] net: ipv4 only populate IP_PKTINFO when needed
From: Eric Dumazet @ 2013-10-04 21:20 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: David Miller, netdev, tomk, Shawn Bohrer
In-Reply-To: <1380914896-24754-4-git-send-email-shawn.bohrer@gmail.com>

On Fri, 2013-10-04 at 14:28 -0500, Shawn Bohrer wrote:
> From: Shawn Bohrer <sbohrer@rgmadvisors.com>
> 
> The since the removal of the routing cache computing
> fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
> packet received.  This has introduced a performance regression for some
> UDP workloads.
> 
> This change skips populating the packet info for sockets that do not have
> IP_PKTINFO set.
> 
> Benchmark results from a netperf UDP_RR test:
> Before 89789.68 transactions/s
> After  90587.62 transactions/s
> 
> Benchmark results from a fio 1 byte UDP multicast pingpong test
> (Multicast one way unicast response):
> Before 12.63us RTT
> After  12.48us RTT
> 
> Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
> ---
> v2 changes:
> 
> * ipv4_pktinfo_prepare() now takes a const struct sock*


Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH net-next v2 2/3] udp: Add udp early demux
From: Eric Dumazet @ 2013-10-04 21:16 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: David Miller, netdev, tomk, Shawn Bohrer
In-Reply-To: <20131004210511.GA12356@sbohrermbp13-local.rgmadvisors.com>

On Fri, 2013-10-04 at 16:05 -0500, Shawn Bohrer wrote:

> Same thing must be true in the multicast case correct? I'll fix them
> both.

Yes.

And you could state in the title or changelog that you took care of IPv4
only (which is fine, but worth mentioning)

Also, unicast lookup should use the secondary hash on (local port, local
address) for best hash distribution for this particular lookup for a
connected socket.

(Take a look at commits 5051ebd27 and  512615b6b84 for details)

^ permalink raw reply

* Re: [PATCH net-next v2 2/3] udp: Add udp early demux
From: Shawn Bohrer @ 2013-10-04 21:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, tomk, Shawn Bohrer
In-Reply-To: <1380916926.3564.30.camel@edumazet-glaptop.roam.corp.google.com>

On Fri, Oct 04, 2013 at 01:02:06PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-04 at 14:28 -0500, Shawn Bohrer wrote:
> 
> > +
> > +/* For unicast we should only early demux connected sockets or we can
> > + * break forwarding setups.  The chains here can be long so only check
> > + * if the first socket is an exact match and if not move on.
> > + */
> > +static struct sock *__udp4_lib_demux_lookup(struct net *net,
> > +					    __be16 loc_port, __be32 loc_addr,
> > +					    __be16 rmt_port, __be32 rmt_addr,
> > +					    int dif)
> > +{
> > +	struct sock *sk, *result;
> > +	struct hlist_nulls_node *node;
> > +	unsigned short hnum = ntohs(loc_port);
> > +	unsigned int slot = udp_hashfn(net, hnum, udp_table.mask);
> > +	struct udp_hslot *hslot = &udp_table.hash[slot];
> > +	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr)
> > +	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
> > +
> > +	rcu_read_lock();
> > +	result = NULL;
> > +	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
> > +		if (INET_MATCH(sk, net, acookie,
> > +			       rmt_addr, loc_addr, ports, dif))
> > +			result = sk;
> > +		/* Only check first socket in chain */
> > +		break;
> > +	}
> > +
> > +	if (result) {
> > +		if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> > +			result = NULL;
> 
> Here you must check again the keys (because of UDP sockets being
> SLAB_DESTROY_BY_RCU , this socket might have been freed and reused
> elsewhere)
> 
> 	else
> 		if (unlikely!(INET_MATCH(result, net, acookie,
> 					 rmt_addr, loc_addr,
> 					 ports, dif))) {
> 			sock_put(result);
> 			result = NULL;
> 		}
 
Same thing must be true in the multicast case correct? I'll fix them
both.

--
Shawn

^ permalink raw reply

* [PATCH net-next] net: fujitsu: Remove ISA depdendency from Kconfig
From: Matthew Whitehead @ 2013-10-04 21:03 UTC (permalink / raw)
  To: netdev; +Cc: Matthew Whitehead

There no longer are ISA drivers in the fujitsu directory, so remove the
dependency from the Kconfig.

Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
---
 drivers/net/ethernet/fujitsu/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/fujitsu/Kconfig b/drivers/net/ethernet/fujitsu/Kconfig
index 6231bc0..1085257 100644
--- a/drivers/net/ethernet/fujitsu/Kconfig
+++ b/drivers/net/ethernet/fujitsu/Kconfig
@@ -5,7 +5,7 @@
 config NET_VENDOR_FUJITSU
 	bool "Fujitsu devices"
 	default y
-	depends on ISA || PCMCIA
+	depends on PCMCIA
 	---help---
 	  If you have a network (Ethernet) card belonging to this class, say Y
 	  and read the Ethernet-HOWTO, available from
-- 
1.7.2.5

^ permalink raw reply related

* Re: [PATCH net-next 02/10] qlcnic: Enhance ethtool to display ring indices and interrupt mask
From: Ben Hutchings @ 2013-10-04 20:37 UTC (permalink / raw)
  To: Himanshu Madhani; +Cc: davem, netdev, Dept_NX_Linux_NIC_Driver, Pratik Pujar
In-Reply-To: <6802e3c08711090eada36dead165ab931c4e61db.1380937706.git.himanshu.madhani@qlogic.com>

On Fri, 2013-10-04 at 14:30 -0400, Himanshu Madhani wrote:
> From: Pratik Pujar <pratik.pujar@qlogic.com>
> 
> o Updated ethtool -d <ethX> option to display ring indices for Transmit(Tx),
>   Receive(Rx), and Status(St) rings.
> o Updated ethtool -d <ethX> option to display ring interrupt mask for Transmit(Tx),
>   and Status(St) rings.
> 
> Signed-off-by: Pratik Pujar <pratik.pujar@qlogic.com>
> Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com>
> ---
>  .../net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c    |  8 +--
>  .../net/ethernet/qlogic/qlcnic/qlcnic_ethtool.c    | 61 +++++++++++++++++-----
>  2 files changed, 51 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
> index 66e94dc..c2df4ce 100644
> --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
> +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
[...]
> @@ -512,21 +527,39 @@ qlcnic_get_regs(struct net_device *dev, struct ethtool_regs *regs, void *p)
>  	if (!test_bit(__QLCNIC_DEV_UP, &adapter->state))
>  		return;
>  
> -	regs_buff[i++] = 0xFFEFCDAB; /* Marker btw regs and ring count*/
> -
> -	regs_buff[i++] = 1; /* No. of tx ring */
> -	regs_buff[i++] = le32_to_cpu(*(adapter->tx_ring->hw_consumer));
> -	regs_buff[i++] = readl(adapter->tx_ring->crb_cmd_producer);
> -
> -	regs_buff[i++] = 2; /* No. of rx ring */
> -	regs_buff[i++] = readl(recv_ctx->rds_rings[0].crb_rcv_producer);
> -	regs_buff[i++] = readl(recv_ctx->rds_rings[1].crb_rcv_producer);
> -
> -	regs_buff[i++] = adapter->max_sds_rings;
> +	/* Marker btw regs and TX ring count */
> +	regs_buff[i++] = QLCNIC_TX_RING_MARKER;
[...]

This is really sad; why don't you write a proper dump parser for ethtool
rather than including markers that make it slightly easier to read hex
dumps?

And when changing the dump format in an incompatible way like this, you
should also bump the version number.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Build failure after merge of the drm-intel tree
From: Mark Brown @ 2013-10-04 20:22 UTC (permalink / raw)
  To: Brett Rudley, Arend van Spriel, Franky Lin, Hante Meuleman,
	John W. Linville, Larry Finger, Chaoming Li, Joe Perches,
	David S. Miller
  Cc: linux-next, linux-kernel, Thierry Reding, netdev,
	brcm80211-dev-list, linux-wireless
In-Reply-To: <20131003171440.GD27287@sirena.org.uk>

[-- Attachment #1: Type: text/plain, Size: 16197 bytes --]

While merging the wireless-next tree into -next there were conflicts in
several headers due to conflicts between various commits from Joe
Perches removing extern from headers in the net-next tree and
development in the wireless tree.

I've resolved this as below (sorry, got some extra stuff that resolved
automatically).

diff --cc drivers/net/wireless/ath/ath10k/debug.h
index bb00633,fa58148..6576b82
--- a/drivers/net/wireless/ath/ath10k/debug.h
+++ b/drivers/net/wireless/ath/ath10k/debug.h
@@@ -37,11 -38,13 +38,13 @@@ enum ath10k_debug_mask 
  
  extern unsigned int ath10k_debug_mask;
  
 -extern __printf(1, 2) int ath10k_info(const char *fmt, ...);
 -extern __printf(1, 2) int ath10k_err(const char *fmt, ...);
 -extern __printf(1, 2) int ath10k_warn(const char *fmt, ...);
 +__printf(1, 2) int ath10k_info(const char *fmt, ...);
 +__printf(1, 2) int ath10k_err(const char *fmt, ...);
 +__printf(1, 2) int ath10k_warn(const char *fmt, ...);
  
  #ifdef CONFIG_ATH10K_DEBUGFS
+ int ath10k_debug_start(struct ath10k *ar);
+ void ath10k_debug_stop(struct ath10k *ar);
  int ath10k_debug_create(struct ath10k *ar);
  void ath10k_debug_read_service_map(struct ath10k *ar,
  				   void *service_map,
diff --cc drivers/net/wireless/brcm80211/brcmfmac/bcmsdh_sdmmc.c
index c3462b7,091c905..2a23bf2
--- a/drivers/net/wireless/brcm80211/brcmfmac/bcmsdh_sdmmc.c
+++ b/drivers/net/wireless/brcm80211/brcmfmac/bcmsdh_sdmmc.c
@@@ -464,9 -459,11 +459,9 @@@ static struct sdio_driver brcmf_sdmmc_d
  
  static int brcmf_sdio_pd_probe(struct platform_device *pdev)
  {
 -	int ret;
 -
  	brcmf_dbg(SDIO, "Enter\n");
  
- 	brcmfmac_sdio_pdata = pdev->dev.platform_data;
+ 	brcmfmac_sdio_pdata = dev_get_platdata(&pdev->dev);
  
  	if (brcmfmac_sdio_pdata->power_on)
  		brcmfmac_sdio_pdata->power_on();
diff --cc drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
index 7f1340d,200ee9b..a6eb09e
--- a/drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
+++ b/drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
@@@ -132,34 -132,34 +132,34 @@@ struct pktq *brcmf_bus_gettxq(struct br
   * interface functions from common layer
   */
  
 -extern bool brcmf_c_prec_enq(struct device *dev, struct pktq *q,
 -			 struct sk_buff *pkt, int prec);
 +bool brcmf_c_prec_enq(struct device *dev, struct pktq *q, struct sk_buff *pkt,
 +		      int prec);
  
  /* Receive frame for delivery to OS.  Callee disposes of rxp. */
- void brcmf_rx_frames(struct device *dev, struct sk_buff_head *rxlist);
 -extern void brcmf_rx_frame(struct device *dev, struct sk_buff *rxp);
++void brcmf_rx_frame(struct device *dev, struct sk_buff *rxp);
  
  /* Indication from bus module regarding presence/insertion of dongle. */
 -extern int brcmf_attach(uint bus_hdrlen, struct device *dev);
 +int brcmf_attach(uint bus_hdrlen, struct device *dev);
  /* Indication from bus module regarding removal/absence of dongle */
 -extern void brcmf_detach(struct device *dev);
 +void brcmf_detach(struct device *dev);
  /* Indication from bus module that dongle should be reset */
 -extern void brcmf_dev_reset(struct device *dev);
 +void brcmf_dev_reset(struct device *dev);
  /* Indication from bus module to change flow-control state */
 -extern void brcmf_txflowblock(struct device *dev, bool state);
 +void brcmf_txflowblock(struct device *dev, bool state);
  
  /* Notify the bus has transferred the tx packet to firmware */
 -extern void brcmf_txcomplete(struct device *dev, struct sk_buff *txp,
 -			     bool success);
 +void brcmf_txcomplete(struct device *dev, struct sk_buff *txp, bool success);
  
 -extern int brcmf_bus_start(struct device *dev);
 +int brcmf_bus_start(struct device *dev);
  
  #ifdef CONFIG_BRCMFMAC_SDIO
 -extern void brcmf_sdio_exit(void);
 -extern void brcmf_sdio_init(void);
 +void brcmf_sdio_exit(void);
 +void brcmf_sdio_init(void);
 +void brcmf_sdio_register(void);
  #endif
  #ifdef CONFIG_BRCMFMAC_USB
 -extern void brcmf_usb_exit(void);
 -extern void brcmf_usb_init(void);
 +void brcmf_usb_exit(void);
 +void brcmf_usb_register(void);
  #endif
  
  #endif				/* _BRCMF_BUS_H_ */
diff --cc drivers/net/wireless/rtlwifi/rtl8188ee/phy.h
index 71ddf4f,d4545f0..8e1f1be
--- a/drivers/net/wireless/rtlwifi/rtl8188ee/phy.h
+++ b/drivers/net/wireless/rtlwifi/rtl8188ee/phy.h
@@@ -200,26 -200,29 +200,25 @@@ enum _ANT_DIV_TYPE 
  	CGCS_RX_SW_ANTDIV		= 0x05,
  };
  
 -extern u32 rtl88e_phy_query_bb_reg(struct ieee80211_hw *hw,
 -				   u32 regaddr, u32 bitmask);
 -extern void rtl88e_phy_set_bb_reg(struct ieee80211_hw *hw,
 -				  u32 regaddr, u32 bitmask, u32 data);
 -extern u32 rtl88e_phy_query_rf_reg(struct ieee80211_hw *hw,
 -				   enum radio_path rfpath, u32 regaddr,
 -				   u32 bitmask);
 -extern void rtl88e_phy_set_rf_reg(struct ieee80211_hw *hw,
 -				  enum radio_path rfpath, u32 regaddr,
 -				  u32 bitmask, u32 data);
 -extern bool rtl88e_phy_mac_config(struct ieee80211_hw *hw);
 -extern bool rtl88e_phy_bb_config(struct ieee80211_hw *hw);
 -extern bool rtl88e_phy_rf_config(struct ieee80211_hw *hw);
 -extern void rtl88e_phy_get_hw_reg_originalvalue(struct ieee80211_hw *hw);
 -extern void rtl88e_phy_get_txpower_level(struct ieee80211_hw *hw,
 -					 long *powerlevel);
 -extern void rtl88e_phy_set_txpower_level(struct ieee80211_hw *hw, u8 channel);
 -extern void rtl88e_phy_set_bw_mode_callback(struct ieee80211_hw *hw);
 -extern void rtl88e_phy_set_bw_mode(struct ieee80211_hw *hw,
 -				   enum nl80211_channel_type ch_type);
 -extern void rtl88e_phy_sw_chnl_callback(struct ieee80211_hw *hw);
 -extern u8 rtl88e_phy_sw_chnl(struct ieee80211_hw *hw);
 -extern void rtl88e_phy_iq_calibrate(struct ieee80211_hw *hw, bool b_recovery);
 +u32 rtl88e_phy_query_bb_reg(struct ieee80211_hw *hw, u32 regaddr, u32 bitmask);
 +void rtl88e_phy_set_bb_reg(struct ieee80211_hw *hw, u32 regaddr, u32 bitmask,
 +			   u32 data);
 +u32 rtl88e_phy_query_rf_reg(struct ieee80211_hw *hw, enum radio_path rfpath,
 +			    u32 regaddr, u32 bitmask);
 +void rtl88e_phy_set_rf_reg(struct ieee80211_hw *hw, enum radio_path rfpath,
 +			   u32 regaddr, u32 bitmask, u32 data);
 +bool rtl88e_phy_mac_config(struct ieee80211_hw *hw);
 +bool rtl88e_phy_bb_config(struct ieee80211_hw *hw);
 +bool rtl88e_phy_rf_config(struct ieee80211_hw *hw);
 +void rtl88e_phy_get_hw_reg_originalvalue(struct ieee80211_hw *hw);
 +void rtl88e_phy_get_txpower_level(struct ieee80211_hw *hw, long *powerlevel);
 +void rtl88e_phy_set_txpower_level(struct ieee80211_hw *hw, u8 channel);
- void rtl88e_phy_scan_operation_backup(struct ieee80211_hw *hw, u8 operation);
 +void rtl88e_phy_set_bw_mode_callback(struct ieee80211_hw *hw);
 +void rtl88e_phy_set_bw_mode(struct ieee80211_hw *hw,
 +			    enum nl80211_channel_type ch_type);
 +void rtl88e_phy_sw_chnl_callback(struct ieee80211_hw *hw);
 +u8 rtl88e_phy_sw_chnl(struct ieee80211_hw *hw);
 +void rtl88e_phy_iq_calibrate(struct ieee80211_hw *hw, bool b_recovery);
  void rtl88e_phy_lc_calibrate(struct ieee80211_hw *hw);
  void rtl88e_phy_set_rfpath_switch(struct ieee80211_hw *hw, bool bmain);
  bool rtl88e_phy_config_rf_with_headerfile(struct ieee80211_hw *hw,
diff --cc drivers/net/wireless/rtlwifi/rtl8192ce/phy.h
index f8973e5,aeb268b..9bb4658
--- a/drivers/net/wireless/rtlwifi/rtl8192ce/phy.h
+++ b/drivers/net/wireless/rtlwifi/rtl8192ce/phy.h
@@@ -199,14 -200,15 +197,14 @@@ bool rtl92c_phy_mac_config(struct ieee8
  bool rtl92ce_phy_bb_config(struct ieee80211_hw *hw);
  bool rtl92c_phy_rf_config(struct ieee80211_hw *hw);
  bool rtl92c_phy_config_rf_with_feaderfile(struct ieee80211_hw *hw,
 -						 enum radio_path rfpath);
 +					  enum radio_path rfpath);
  void rtl92c_phy_get_hw_reg_originalvalue(struct ieee80211_hw *hw);
 -void rtl92c_phy_get_txpower_level(struct ieee80211_hw *hw,
 -					 long *powerlevel);
 +void rtl92c_phy_get_txpower_level(struct ieee80211_hw *hw, long *powerlevel);
  void rtl92c_phy_set_txpower_level(struct ieee80211_hw *hw, u8 channel);
- bool rtl92c_phy_update_txpower_dbm(struct ieee80211_hw *hw, long power_indbm);
- void rtl92c_phy_scan_operation_backup(struct ieee80211_hw *hw, u8 operation);
+ bool rtl92c_phy_update_txpower_dbm(struct ieee80211_hw *hw,
+ 					  long power_indbm);
  void rtl92c_phy_set_bw_mode(struct ieee80211_hw *hw,
 -				   enum nl80211_channel_type ch_type);
 +			    enum nl80211_channel_type ch_type);
  void rtl92c_phy_sw_chnl_callback(struct ieee80211_hw *hw);
  u8 rtl92c_phy_sw_chnl(struct ieee80211_hw *hw);
  void rtl92c_phy_iq_calibrate(struct ieee80211_hw *hw, bool b_recovery);
@@@ -217,10 -220,10 +215,10 @@@ void _rtl92ce_phy_lc_calibrate(struct i
  void rtl92c_phy_set_rfpath_switch(struct ieee80211_hw *hw, bool bmain);
  bool rtl92c_phy_config_rf_with_headerfile(struct ieee80211_hw *hw,
  					  enum radio_path rfpath);
- bool rtl8192_phy_check_is_legal_rfpath(struct ieee80211_hw *hw, u32 rfpath);
- bool rtl92c_phy_set_io_cmd(struct ieee80211_hw *hw, enum io_type iotype);
+ bool rtl8192_phy_check_is_legal_rfpath(struct ieee80211_hw *hw,
+ 					      u32 rfpath);
  bool rtl92ce_phy_set_rf_power_state(struct ieee80211_hw *hw,
 -					  enum rf_pwrstate rfpwr_state);
 +				    enum rf_pwrstate rfpwr_state);
  void rtl92ce_phy_set_rf_on(struct ieee80211_hw *hw);
  bool rtl92c_phy_set_io_cmd(struct ieee80211_hw *hw, enum io_type iotype);
  void rtl92c_phy_set_io(struct ieee80211_hw *hw);
diff --cc drivers/net/wireless/rtlwifi/rtl8192de/phy.h
index 0f993f4,bef3040..33df0d1c
--- a/drivers/net/wireless/rtlwifi/rtl8192de/phy.h
+++ b/drivers/net/wireless/rtlwifi/rtl8192de/phy.h
@@@ -127,24 -125,26 +125,23 @@@ static inline void rtl92d_release_cckan
  			*flag);
  }
  
 -extern u32 rtl92d_phy_query_bb_reg(struct ieee80211_hw *hw,
 -				   u32 regaddr, u32 bitmask);
 -extern void rtl92d_phy_set_bb_reg(struct ieee80211_hw *hw,
 -				  u32 regaddr, u32 bitmask, u32 data);
 -extern u32 rtl92d_phy_query_rf_reg(struct ieee80211_hw *hw,
 -				   enum radio_path rfpath, u32 regaddr,
 -				   u32 bitmask);
 -extern void rtl92d_phy_set_rf_reg(struct ieee80211_hw *hw,
 -				  enum radio_path rfpath, u32 regaddr,
 -				  u32 bitmask, u32 data);
 -extern bool rtl92d_phy_mac_config(struct ieee80211_hw *hw);
 -extern bool rtl92d_phy_bb_config(struct ieee80211_hw *hw);
 -extern bool rtl92d_phy_rf_config(struct ieee80211_hw *hw);
 -extern bool rtl92c_phy_config_rf_with_feaderfile(struct ieee80211_hw *hw,
 -						 enum radio_path rfpath);
 -extern void rtl92d_phy_get_hw_reg_originalvalue(struct ieee80211_hw *hw);
 -extern void rtl92d_phy_set_txpower_level(struct ieee80211_hw *hw, u8 channel);
 -extern void rtl92d_phy_set_bw_mode(struct ieee80211_hw *hw,
 -				   enum nl80211_channel_type ch_type);
 -extern u8 rtl92d_phy_sw_chnl(struct ieee80211_hw *hw);
 +u32 rtl92d_phy_query_bb_reg(struct ieee80211_hw *hw, u32 regaddr, u32 bitmask);
 +void rtl92d_phy_set_bb_reg(struct ieee80211_hw *hw, u32 regaddr, u32 bitmask,
 +			   u32 data);
 +u32 rtl92d_phy_query_rf_reg(struct ieee80211_hw *hw, enum radio_path rfpath,
 +			    u32 regaddr, u32 bitmask);
 +void rtl92d_phy_set_rf_reg(struct ieee80211_hw *hw, enum radio_path rfpath,
 +			   u32 regaddr, u32 bitmask, u32 data);
 +bool rtl92d_phy_mac_config(struct ieee80211_hw *hw);
 +bool rtl92d_phy_bb_config(struct ieee80211_hw *hw);
 +bool rtl92d_phy_rf_config(struct ieee80211_hw *hw);
 +bool rtl92c_phy_config_rf_with_feaderfile(struct ieee80211_hw *hw,
 +					  enum radio_path rfpath);
 +void rtl92d_phy_get_hw_reg_originalvalue(struct ieee80211_hw *hw);
 +void rtl92d_phy_set_txpower_level(struct ieee80211_hw *hw, u8 channel);
- void rtl92d_phy_scan_operation_backup(struct ieee80211_hw *hw, u8 operation);
 +void rtl92d_phy_set_bw_mode(struct ieee80211_hw *hw,
 +			    enum nl80211_channel_type ch_type);
 +u8 rtl92d_phy_sw_chnl(struct ieee80211_hw *hw);
  bool rtl92d_phy_config_rf_with_headerfile(struct ieee80211_hw *hw,
  					  enum rf_content content,
  					  enum radio_path rfpath);
diff --cc drivers/net/wireless/rtlwifi/rtl8723ae/phy.h
index bbb950d,3d8f9e3..bb18023
--- a/drivers/net/wireless/rtlwifi/rtl8723ae/phy.h
+++ b/drivers/net/wireless/rtlwifi/rtl8723ae/phy.h
@@@ -183,33 -183,34 +183,30 @@@ struct tx_power_struct 
  	u32 mcs_original_offset[4][16];
  };
  
 -extern u32 rtl8723ae_phy_query_bb_reg(struct ieee80211_hw *hw,
 -				      u32 regaddr, u32 bitmask);
 -extern void rtl8723ae_phy_set_bb_reg(struct ieee80211_hw *hw,
 -				     u32 regaddr, u32 bitmask, u32 data);
 -extern u32 rtl8723ae_phy_query_rf_reg(struct ieee80211_hw *hw,
 -				      enum radio_path rfpath, u32 regaddr,
 -				      u32 bitmask);
 -extern void rtl8723ae_phy_set_rf_reg(struct ieee80211_hw *hw,
 -				     enum radio_path rfpath, u32 regaddr,
 -				     u32 bitmask, u32 data);
 -extern bool rtl8723ae_phy_mac_config(struct ieee80211_hw *hw);
 -extern bool rtl8723ae_phy_bb_config(struct ieee80211_hw *hw);
 -extern bool rtl8723ae_phy_rf_config(struct ieee80211_hw *hw);
 -extern bool rtl92c_phy_config_rf_with_feaderfile(struct ieee80211_hw *hw,
 -						 enum radio_path rfpath);
 -extern void rtl8723ae_phy_get_hw_reg_originalvalue(struct ieee80211_hw *hw);
 -extern void rtl8723ae_phy_get_txpower_level(struct ieee80211_hw *hw,
 -					    long *powerlevel);
 -extern void rtl8723ae_phy_set_txpower_level(struct ieee80211_hw *hw,
 -					    u8 channel);
 -extern bool rtl8723ae_phy_update_txpower_dbm(struct ieee80211_hw *hw,
 -					     long power_indbm);
 -extern void rtl8723ae_phy_set_bw_mode_callback(struct ieee80211_hw *hw);
 -extern void rtl8723ae_phy_set_bw_mode(struct ieee80211_hw *hw,
 -				      enum nl80211_channel_type ch_type);
 -extern void rtl8723ae_phy_sw_chnl_callback(struct ieee80211_hw *hw);
 -extern u8 rtl8723ae_phy_sw_chnl(struct ieee80211_hw *hw);
 -extern void rtl8723ae_phy_iq_calibrate(struct ieee80211_hw *hw, bool recovery);
 +u32 rtl8723ae_phy_query_bb_reg(struct ieee80211_hw *hw, u32 regaddr,
 +			       u32 bitmask);
- void rtl8723ae_phy_set_bb_reg(struct ieee80211_hw *hw, u32 regaddr, u32 bitmask,
- 			      u32 data);
- u32 rtl8723ae_phy_query_rf_reg(struct ieee80211_hw *hw,
- 			       enum radio_path rfpath, u32 regaddr,
- 			       u32 bitmask);
- void rtl8723ae_phy_set_rf_reg(struct ieee80211_hw *hw,
- 			      enum radio_path rfpath, u32 regaddr, u32 bitmask,
- 			      u32 data);
++void rtl8723ae_phy_set_bb_reg(struct ieee80211_hw *hw, u32 regaddr,
++			      u32 bitmask, u32 data);
++u32 rtl8723ae_phy_query_rf_reg(struct ieee80211_hw *hw, enum radio_path rfpath,
++			       u32 regaddr, u32 bitmask);
++void rtl8723ae_phy_set_rf_reg(struct ieee80211_hw *hw, enum radio_path rfpath,
++			      u32 regaddr, u32 bitmask, u32 data);
 +bool rtl8723ae_phy_mac_config(struct ieee80211_hw *hw);
 +bool rtl8723ae_phy_bb_config(struct ieee80211_hw *hw);
 +bool rtl8723ae_phy_rf_config(struct ieee80211_hw *hw);
 +bool rtl92c_phy_config_rf_with_feaderfile(struct ieee80211_hw *hw,
 +					  enum radio_path rfpath);
 +void rtl8723ae_phy_get_hw_reg_originalvalue(struct ieee80211_hw *hw);
 +void rtl8723ae_phy_get_txpower_level(struct ieee80211_hw *hw, long *powerlevel);
 +void rtl8723ae_phy_set_txpower_level(struct ieee80211_hw *hw, u8 channel);
 +bool rtl8723ae_phy_update_txpower_dbm(struct ieee80211_hw *hw,
- 				      long power_indbm);
- void rtl8723ae_phy_scan_operation_backup(struct ieee80211_hw *hw, u8 operation);
++                                      long power_indbm);
 +void rtl8723ae_phy_set_bw_mode_callback(struct ieee80211_hw *hw);
 +void rtl8723ae_phy_set_bw_mode(struct ieee80211_hw *hw,
 +			       enum nl80211_channel_type ch_type);
 +void rtl8723ae_phy_sw_chnl_callback(struct ieee80211_hw *hw);
 +u8 rtl8723ae_phy_sw_chnl(struct ieee80211_hw *hw);
 +void rtl8723ae_phy_iq_calibrate(struct ieee80211_hw *hw, bool recovery);
  void rtl8723ae_phy_lc_calibrate(struct ieee80211_hw *hw);
  void rtl8723ae_phy_set_rfpath_switch(struct ieee80211_hw *hw, bool bmain);
  bool rtl8723ae_phy_config_rf_with_headerfile(struct ieee80211_hw *hw,

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH 2/2] ixgbe: enable l2 forwarding acceleration for macvlans
From: Neil Horman @ 2013-10-04 20:10 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Andy Gospodarek, David Miller, Neil Horman
In-Reply-To: <1380917405-23801-1-git-send-email-nhorman@tuxdriver.com>

Now that l2 acceleration ops are in place from the prior patch, enable ixgbe to
take advantage of these operations.  Allow it to allocate queues for a macvlan
so that when we transmit a frame, we can do the switching in hardware inside the
ixgbe card, rather than in software.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |  33 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   4 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_l2a.h     |  54 ++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c     |  15 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    | 373 +++++++++++++++++------
 5 files changed, 387 insertions(+), 92 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_l2a.h

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 0ac6b11..e924efa 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -219,6 +219,17 @@ enum ixgbe_ring_state_t {
 	__IXGBE_RX_FCOE,
 };
 
+struct ixgbe_fwd_adapter {
+	unsigned long active_vlans[BITS_TO_LONGS(VLAN_N_VID)];
+	struct net_device *netdev;
+	struct ixgbe_adapter *real_adapter;
+	unsigned int tx_base_queue;
+	unsigned int rx_base_queue;
+	struct net_device_stats net_stats;
+	int pool;
+	bool online;
+};
+
 #define check_for_tx_hang(ring) \
 	test_bit(__IXGBE_TX_DETECT_HANG, &(ring)->state)
 #define set_check_for_tx_hang(ring) \
@@ -236,6 +247,7 @@ struct ixgbe_ring {
 	struct ixgbe_q_vector *q_vector; /* backpointer to host q_vector */
 	struct net_device *netdev;	/* netdev ring belongs to */
 	struct device *dev;		/* device for DMA mapping */
+	struct ixgbe_fwd_adapter *l2_accel_priv;
 	void *desc;			/* descriptor ring memory */
 	union {
 		struct ixgbe_tx_buffer *tx_buffer_info;
@@ -244,6 +256,7 @@ struct ixgbe_ring {
 	unsigned long last_rx_timestamp;
 	unsigned long state;
 	u8 __iomem *tail;
+	struct net_device *vmdq_netdev;
 	dma_addr_t dma;			/* phys. address of descriptor ring */
 	unsigned int size;		/* length in bytes */
 
@@ -288,11 +301,15 @@ enum ixgbe_ring_f_enum {
 };
 
 #define IXGBE_MAX_RSS_INDICES  16
-#define IXGBE_MAX_VMDQ_INDICES 64
+#define IXGBE_MAX_VMDQ_INDICES 32
 #define IXGBE_MAX_FDIR_INDICES 63	/* based on q_vector limit */
 #define IXGBE_MAX_FCOE_INDICES  8
 #define MAX_RX_QUEUES (IXGBE_MAX_FDIR_INDICES + 1)
 #define MAX_TX_QUEUES (IXGBE_MAX_FDIR_INDICES + 1)
+#define IXGBE_MAX_L2A_QUEUES 4
+#define IXGBE_MAX_L2A_QUEUES 4
+#define IXGBE_BAD_L2A_QUEUE 3
+
 struct ixgbe_ring_feature {
 	u16 limit;	/* upper limit on feature indices */
 	u16 indices;	/* current value of indices */
@@ -738,6 +755,7 @@ struct ixgbe_adapter {
 #endif /*CONFIG_DEBUG_FS*/
 
 	u8 default_up;
+	unsigned long fwd_bitmask; /* Bitmask indicating in use pools */
 };
 
 struct ixgbe_fdir_filter {
@@ -879,9 +897,14 @@ static inline void ixgbe_dbg_adapter_exit(struct ixgbe_adapter *adapter) {}
 static inline void ixgbe_dbg_init(void) {}
 static inline void ixgbe_dbg_exit(void) {}
 #endif /* CONFIG_DEBUG_FS */
+static inline struct net_device *netdev_ring(const struct ixgbe_ring *ring)
+{
+	return ring->vmdq_netdev ? ring->vmdq_netdev : ring->netdev;
+}
+
 static inline struct netdev_queue *txring_txq(const struct ixgbe_ring *ring)
 {
-	return netdev_get_tx_queue(ring->netdev, ring->queue_index);
+	return netdev_get_tx_queue(netdev_ring(ring), ring->queue_index);
 }
 
 extern void ixgbe_ptp_init(struct ixgbe_adapter *adapter);
@@ -915,4 +938,10 @@ extern void ixgbe_ptp_check_pps_event(struct ixgbe_adapter *adapter, u32 eicr);
 void ixgbe_sriov_reinit(struct ixgbe_adapter *adapter);
 #endif
 
+int ixgbe_get_settings(struct net_device *dev, struct ethtool_cmd *ecmd);
+int ixgbe_write_uc_addr_list(struct net_device *netdev);
+netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
+				  struct ixgbe_adapter *adapter,
+				  struct ixgbe_ring *tx_ring);
+void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring);
 #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index e8649ab..277af14 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -150,8 +150,8 @@ static const char ixgbe_gstrings_test[][ETH_GSTRING_LEN] = {
 };
 #define IXGBE_TEST_LEN sizeof(ixgbe_gstrings_test) / ETH_GSTRING_LEN
 
-static int ixgbe_get_settings(struct net_device *netdev,
-                              struct ethtool_cmd *ecmd)
+int ixgbe_get_settings(struct net_device *netdev,
+		       struct ethtool_cmd *ecmd)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 	struct ixgbe_hw *hw = &adapter->hw;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_l2a.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_l2a.h
new file mode 100644
index 0000000..2f36584
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_l2a.h
@@ -0,0 +1,54 @@
+/*******************************************************************************
+
+  Intel 10 Gigabit PCI Express Linux driver
+  Copyright(c) 1999 - 2013 Intel Corporation.
+
+  This program is free software; you can redistribute it and/or modify it
+  under the terms and conditions of the GNU General Public License,
+  version 2, as published by the Free Software Foundation.
+
+  This program is distributed in the hope it will be useful, but WITHOUT
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+  more details.
+
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc.,
+  51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "COPYING".
+
+  Contact Information:
+  e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
+  Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
+*******************************************************************************/
+#include "ixgbe.h"
+
+
+static inline void ixgbe_irq_enable_queues(struct ixgbe_adapter *adapter,
+					   u64 qmask)
+{
+	u32 mask;
+	struct ixgbe_hw *hw = &adapter->hw;
+
+	switch (hw->mac.type) {
+	case ixgbe_mac_82598EB:
+		mask = (IXGBE_EIMS_RTX_QUEUE & qmask);
+		IXGBE_WRITE_REG(hw, IXGBE_EIMS, mask);
+		break;
+	case ixgbe_mac_82599EB:
+	case ixgbe_mac_X540:
+		mask = (qmask & 0xFFFFFFFF);
+		if (mask)
+			IXGBE_WRITE_REG(hw, IXGBE_EIMS_EX(0), mask);
+		mask = (qmask >> 32);
+		if (mask)
+			IXGBE_WRITE_REG(hw, IXGBE_EIMS_EX(1), mask);
+		break;
+	default:
+		break;
+	}
+	/* skip the flush */
+}
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 90b4e10..e2dd635 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -500,7 +500,8 @@ static bool ixgbe_set_sriov_queues(struct ixgbe_adapter *adapter)
 #endif
 
 	/* only proceed if SR-IOV is enabled */
-	if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED))
+	if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED) &&
+	    !(adapter->flags & IXGBE_FLAG_VMDQ_ENABLED))
 		return false;
 
 	/* Add starting offset to total pool count */
@@ -852,7 +853,11 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 
 		/* apply Tx specific ring traits */
 		ring->count = adapter->tx_ring_count;
-		ring->queue_index = txr_idx;
+		if (adapter->num_rx_pools > 1)
+			ring->queue_index =
+				txr_idx % adapter->num_rx_queues_per_pool;
+		else
+			ring->queue_index = txr_idx;
 
 		/* assign ring to adapter */
 		adapter->tx_ring[txr_idx] = ring;
@@ -895,7 +900,11 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 #endif /* IXGBE_FCOE */
 		/* apply Rx specific ring traits */
 		ring->count = adapter->rx_ring_count;
-		ring->queue_index = rxr_idx;
+		if (adapter->num_rx_pools > 1)
+			ring->queue_index =
+				rxr_idx % adapter->num_rx_queues_per_pool;
+		else
+			ring->queue_index = rxr_idx;
 
 		/* assign ring to adapter */
 		adapter->rx_ring[rxr_idx] = ring;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0ade0cd..5fa553f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -52,6 +52,7 @@
 #include "ixgbe_common.h"
 #include "ixgbe_dcb_82599.h"
 #include "ixgbe_sriov.h"
+#include "ixgbe_l2a.h"
 
 char ixgbe_driver_name[] = "ixgbe";
 static const char ixgbe_driver_string[] =
@@ -118,6 +119,8 @@ static DEFINE_PCI_DEVICE_TABLE(ixgbe_pci_tbl) = {
 };
 MODULE_DEVICE_TABLE(pci, ixgbe_pci_tbl);
 
+netdev_tx_t ixgbe_fwd_xmit_frame(struct sk_buff *skb, void *priv);
+
 #ifdef CONFIG_IXGBE_DCA
 static int ixgbe_notify_dca(struct notifier_block *, unsigned long event,
 			    void *p);
@@ -872,7 +875,8 @@ static u64 ixgbe_get_tx_completed(struct ixgbe_ring *ring)
 
 static u64 ixgbe_get_tx_pending(struct ixgbe_ring *ring)
 {
-	struct ixgbe_adapter *adapter = netdev_priv(ring->netdev);
+	struct net_device *dev = ring->netdev;
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	struct ixgbe_hw *hw = &adapter->hw;
 
 	u32 head = IXGBE_READ_REG(hw, IXGBE_TDH(ring->reg_idx));
@@ -1055,7 +1059,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 			tx_ring->next_to_use, i,
 			tx_ring->tx_buffer_info[i].time_stamp, jiffies);
 
-		netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
+		netif_stop_subqueue(netdev_ring(tx_ring), tx_ring->queue_index);
 
 		e_info(probe,
 		       "tx hang %d detected on queue %d, resetting adapter\n",
@@ -1072,16 +1076,16 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 				  total_packets, total_bytes);
 
 #define TX_WAKE_THRESHOLD (DESC_NEEDED * 2)
-	if (unlikely(total_packets && netif_carrier_ok(tx_ring->netdev) &&
+	if (unlikely(total_packets && netif_carrier_ok(netdev_ring(tx_ring)) &&
 		     (ixgbe_desc_unused(tx_ring) >= TX_WAKE_THRESHOLD))) {
 		/* Make sure that anybody stopping the queue after this
 		 * sees the new next_to_clean.
 		 */
 		smp_mb();
-		if (__netif_subqueue_stopped(tx_ring->netdev,
+		if (__netif_subqueue_stopped(netdev_ring(tx_ring),
 					     tx_ring->queue_index)
 		    && !test_bit(__IXGBE_DOWN, &adapter->state)) {
-			netif_wake_subqueue(tx_ring->netdev,
+			netif_wake_subqueue(netdev_ring(tx_ring),
 					    tx_ring->queue_index);
 			++tx_ring->tx_stats.restart_queue;
 		}
@@ -1226,7 +1230,7 @@ static inline void ixgbe_rx_hash(struct ixgbe_ring *ring,
 				 union ixgbe_adv_rx_desc *rx_desc,
 				 struct sk_buff *skb)
 {
-	if (ring->netdev->features & NETIF_F_RXHASH)
+	if (netdev_ring(ring)->features & NETIF_F_RXHASH)
 		skb->rxhash = le32_to_cpu(rx_desc->wb.lower.hi_dword.rss);
 }
 
@@ -1260,10 +1264,12 @@ static inline void ixgbe_rx_checksum(struct ixgbe_ring *ring,
 				     union ixgbe_adv_rx_desc *rx_desc,
 				     struct sk_buff *skb)
 {
+	struct net_device *dev = netdev_ring(ring);
+
 	skb_checksum_none_assert(skb);
 
 	/* Rx csum disabled */
-	if (!(ring->netdev->features & NETIF_F_RXCSUM))
+	if (!(dev->features & NETIF_F_RXCSUM))
 		return;
 
 	/* if IP and error */
@@ -1559,7 +1565,7 @@ static void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
 				     union ixgbe_adv_rx_desc *rx_desc,
 				     struct sk_buff *skb)
 {
-	struct net_device *dev = rx_ring->netdev;
+	struct net_device *dev = netdev_ring(rx_ring);
 
 	ixgbe_update_rsc_stats(rx_ring, skb);
 
@@ -1739,7 +1745,7 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
 				  union ixgbe_adv_rx_desc *rx_desc,
 				  struct sk_buff *skb)
 {
-	struct net_device *netdev = rx_ring->netdev;
+	struct net_device *netdev = netdev_ring(rx_ring);
 
 	/* verify that the packet does not have any known errors */
 	if (unlikely(ixgbe_test_staterr(rx_desc,
@@ -1905,7 +1911,7 @@ static struct sk_buff *ixgbe_fetch_rx_buffer(struct ixgbe_ring *rx_ring,
 #endif
 
 		/* allocate a skb to store the frags */
-		skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
+		skb = netdev_alloc_skb_ip_align(netdev_ring(rx_ring),
 						IXGBE_RX_HDR_SIZE);
 		if (unlikely(!skb)) {
 			rx_ring->rx_stats.alloc_rx_buff_failed++;
@@ -1986,6 +1992,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	int ddp_bytes;
 	unsigned int mss = 0;
+	struct net_device *netdev = netdev_ring(rx_ring);
 #endif /* IXGBE_FCOE */
 	u16 cleaned_count = ixgbe_desc_unused(rx_ring);
 
@@ -1993,6 +2000,10 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		union ixgbe_adv_rx_desc *rx_desc;
 		struct sk_buff *skb;
 
+		if (rx_ring->l2_accel_priv) {
+			printk(KERN_CRIT "RECEIVING ON AN ACCELERATED QUEUE\n");
+		}
+
 		/* return some buffers to hardware, one at a time is too slow */
 		if (cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
 			ixgbe_alloc_rx_buffers(rx_ring, cleaned_count);
@@ -2041,7 +2052,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			/* include DDPed FCoE data */
 			if (ddp_bytes > 0) {
 				if (!mss) {
-					mss = rx_ring->netdev->mtu -
+					mss = netdev->mtu -
 						sizeof(struct fcoe_hdr) -
 						sizeof(struct fc_frame_header) -
 						sizeof(struct fcoe_crc_eof);
@@ -2455,58 +2466,6 @@ static void ixgbe_check_lsc(struct ixgbe_adapter *adapter)
 	}
 }
 
-static inline void ixgbe_irq_enable_queues(struct ixgbe_adapter *adapter,
-					   u64 qmask)
-{
-	u32 mask;
-	struct ixgbe_hw *hw = &adapter->hw;
-
-	switch (hw->mac.type) {
-	case ixgbe_mac_82598EB:
-		mask = (IXGBE_EIMS_RTX_QUEUE & qmask);
-		IXGBE_WRITE_REG(hw, IXGBE_EIMS, mask);
-		break;
-	case ixgbe_mac_82599EB:
-	case ixgbe_mac_X540:
-		mask = (qmask & 0xFFFFFFFF);
-		if (mask)
-			IXGBE_WRITE_REG(hw, IXGBE_EIMS_EX(0), mask);
-		mask = (qmask >> 32);
-		if (mask)
-			IXGBE_WRITE_REG(hw, IXGBE_EIMS_EX(1), mask);
-		break;
-	default:
-		break;
-	}
-	/* skip the flush */
-}
-
-static inline void ixgbe_irq_disable_queues(struct ixgbe_adapter *adapter,
-					    u64 qmask)
-{
-	u32 mask;
-	struct ixgbe_hw *hw = &adapter->hw;
-
-	switch (hw->mac.type) {
-	case ixgbe_mac_82598EB:
-		mask = (IXGBE_EIMS_RTX_QUEUE & qmask);
-		IXGBE_WRITE_REG(hw, IXGBE_EIMC, mask);
-		break;
-	case ixgbe_mac_82599EB:
-	case ixgbe_mac_X540:
-		mask = (qmask & 0xFFFFFFFF);
-		if (mask)
-			IXGBE_WRITE_REG(hw, IXGBE_EIMC_EX(0), mask);
-		mask = (qmask >> 32);
-		if (mask)
-			IXGBE_WRITE_REG(hw, IXGBE_EIMC_EX(1), mask);
-		break;
-	default:
-		break;
-	}
-	/* skip the flush */
-}
-
 /**
  * ixgbe_irq_enable - Enable default interrupt generation settings
  * @adapter: board private structure
@@ -2946,6 +2905,7 @@ static void ixgbe_configure_msi_and_legacy(struct ixgbe_adapter *adapter)
 void ixgbe_configure_tx_ring(struct ixgbe_adapter *adapter,
 			     struct ixgbe_ring *ring)
 {
+	struct net_device *netdev = netdev_ring(ring);
 	struct ixgbe_hw *hw = &adapter->hw;
 	u64 tdba = ring->dma;
 	int wait_loop = 10;
@@ -3005,7 +2965,7 @@ void ixgbe_configure_tx_ring(struct ixgbe_adapter *adapter,
 		struct ixgbe_q_vector *q_vector = ring->q_vector;
 
 		if (q_vector)
-			netif_set_xps_queue(adapter->netdev,
+			netif_set_xps_queue(netdev,
 					    &q_vector->affinity_mask,
 					    ring->queue_index);
 	}
@@ -3395,7 +3355,7 @@ static void ixgbe_setup_psrtype(struct ixgbe_adapter *adapter)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
 	int rss_i = adapter->ring_feature[RING_F_RSS].indices;
-	int p;
+	u16 pool;
 
 	/* PSRTYPE must be initialized in non 82598 adapters */
 	u32 psrtype = IXGBE_PSRTYPE_TCPHDR |
@@ -3412,9 +3372,8 @@ static void ixgbe_setup_psrtype(struct ixgbe_adapter *adapter)
 	else if (rss_i > 1)
 		psrtype |= 1 << 29;
 
-	for (p = 0; p < adapter->num_rx_pools; p++)
-		IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(VMDQ_P(p)),
-				psrtype);
+	for_each_set_bit(pool, &adapter->fwd_bitmask, 32)
+		IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(VMDQ_P(pool)), psrtype);
 }
 
 static void ixgbe_configure_virtualization(struct ixgbe_adapter *adapter)
@@ -3683,6 +3642,8 @@ static void ixgbe_vlan_strip_disable(struct ixgbe_adapter *adapter)
 	case ixgbe_mac_82599EB:
 	case ixgbe_mac_X540:
 		for (i = 0; i < adapter->num_rx_queues; i++) {
+			if (adapter->rx_ring[i]->vmdq_netdev)
+				continue;
 			j = adapter->rx_ring[i]->reg_idx;
 			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
 			vlnctrl &= ~IXGBE_RXDCTL_VME;
@@ -3713,6 +3674,8 @@ static void ixgbe_vlan_strip_enable(struct ixgbe_adapter *adapter)
 	case ixgbe_mac_82599EB:
 	case ixgbe_mac_X540:
 		for (i = 0; i < adapter->num_rx_queues; i++) {
+			if (adapter->rx_ring[i]->vmdq_netdev)
+				continue;
 			j = adapter->rx_ring[i]->reg_idx;
 			vlnctrl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(j));
 			vlnctrl |= IXGBE_RXDCTL_VME;
@@ -3743,15 +3706,16 @@ static void ixgbe_restore_vlan(struct ixgbe_adapter *adapter)
  *                0 on no addresses written
  *                X on writing X addresses to the RAR table
  **/
-static int ixgbe_write_uc_addr_list(struct net_device *netdev)
+int ixgbe_write_uc_addr_list(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 	struct ixgbe_hw *hw = &adapter->hw;
 	unsigned int rar_entries = hw->mac.num_rar_entries - 1;
 	int count = 0;
 
-	/* In SR-IOV mode significantly less RAR entries are available */
-	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+	/* In SR-IOV/VMDQ modes significantly less RAR entries are available */
+	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED ||
+	    adapter->flags & IXGBE_FLAG_VMDQ_ENABLED)
 		rar_entries = IXGBE_MAX_PF_MACVLANS - 1;
 
 	/* return ENOMEM indicating insufficient memory for addresses */
@@ -3772,6 +3736,7 @@ static int ixgbe_write_uc_addr_list(struct net_device *netdev)
 			count++;
 		}
 	}
+
 	/* write the addresses in reverse order to avoid write combining */
 	for (; rar_entries > 0 ; rar_entries--)
 		hw->mac.ops.clear_rar(hw, rar_entries);
@@ -4133,6 +4098,7 @@ static void ixgbe_configure(struct ixgbe_adapter *adapter)
 	ixgbe_configure_virtualization(adapter);
 
 	ixgbe_set_rx_mode(adapter->netdev);
+
 	ixgbe_restore_vlan(adapter);
 
 	switch (hw->mac.type) {
@@ -4459,7 +4425,7 @@ void ixgbe_reset(struct ixgbe_adapter *adapter)
  * ixgbe_clean_rx_ring - Free Rx Buffers per Queue
  * @rx_ring: ring to free buffers from
  **/
-static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
+void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
 {
 	struct device *dev = rx_ring->dev;
 	unsigned long size;
@@ -4838,6 +4804,8 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter)
 		return -EIO;
 	}
 
+	/* PF holds first pool slot */
+	set_bit(0, &adapter->fwd_bitmask);
 	set_bit(__IXGBE_DOWN, &adapter->state);
 
 	return 0;
@@ -5143,7 +5111,7 @@ static int ixgbe_change_mtu(struct net_device *netdev, int new_mtu)
 static int ixgbe_open(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
-	int err;
+	int err, queues;
 
 	/* disallow open during test */
 	if (test_bit(__IXGBE_TESTING, &adapter->state))
@@ -5168,16 +5136,22 @@ static int ixgbe_open(struct net_device *netdev)
 		goto err_req_irq;
 
 	/* Notify the stack of the actual queue counts. */
-	err = netif_set_real_num_tx_queues(netdev,
-					   adapter->num_rx_pools > 1 ? 1 :
-					   adapter->num_tx_queues);
+	if (adapter->num_rx_pools > 1 &&
+	    adapter->num_tx_queues > IXGBE_MAX_L2A_QUEUES)
+		queues = IXGBE_MAX_L2A_QUEUES;
+	else
+		queues = adapter->num_tx_queues;
+
+	err = netif_set_real_num_tx_queues(netdev, queues);
 	if (err)
 		goto err_set_queues;
 
-
-	err = netif_set_real_num_rx_queues(netdev,
-					   adapter->num_rx_pools > 1 ? 1 :
-					   adapter->num_rx_queues);
+	if (adapter->num_rx_pools > 1 &&
+	    adapter->num_rx_queues > IXGBE_MAX_L2A_QUEUES)
+		queues = IXGBE_MAX_L2A_QUEUES;
+	else
+		queues = adapter->num_rx_queues;
+	err = netif_set_real_num_rx_queues(netdev, queues);
 	if (err)
 		goto err_set_queues;
 
@@ -5215,7 +5189,6 @@ static int ixgbe_close(struct net_device *netdev)
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 
 	ixgbe_ptp_stop(adapter);
-
 	ixgbe_down(adapter);
 	ixgbe_free_irq(adapter);
 
@@ -6576,7 +6549,7 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
 
 static int __ixgbe_maybe_stop_tx(struct ixgbe_ring *tx_ring, u16 size)
 {
-	netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
+	netif_stop_subqueue(netdev_ring(tx_ring), tx_ring->queue_index);
 	/* Herbert's original patch had:
 	 *  smp_mb__after_netif_stop_queue();
 	 * but since that doesn't exist yet, just open code it. */
@@ -6588,7 +6561,7 @@ static int __ixgbe_maybe_stop_tx(struct ixgbe_ring *tx_ring, u16 size)
 		return -EBUSY;
 
 	/* A reprieve! - use start_queue because it doesn't call schedule */
-	netif_start_subqueue(tx_ring->netdev, tx_ring->queue_index);
+	netif_start_subqueue(netdev_ring(tx_ring), tx_ring->queue_index);
 	++tx_ring->tx_stats.restart_queue;
 	return 0;
 }
@@ -6639,6 +6612,9 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 			  struct ixgbe_ring *tx_ring)
 {
 	struct ixgbe_tx_buffer *first;
+#ifdef IXGBE_FCOE
+	struct net_device *dev;
+#endif
 	int tso;
 	u32 tx_flags = 0;
 	unsigned short f;
@@ -6730,9 +6706,10 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 	first->protocol = protocol;
 
 #ifdef IXGBE_FCOE
+	dev = netdev_ring(tx_ring);
 	/* setup tx offload for FCoE */
 	if ((protocol == __constant_htons(ETH_P_FCOE)) &&
-	    (tx_ring->netdev->features & (NETIF_F_FSO | NETIF_F_FCOE_CRC))) {
+	    (dev->features & (NETIF_F_FSO | NETIF_F_FCOE_CRC))) {
 		tso = ixgbe_fso(tx_ring, first, &hdr_len);
 		if (tso < 0)
 			goto out_drop;
@@ -6784,7 +6761,15 @@ static netdev_tx_t ixgbe_xmit_frame(struct sk_buff *skb,
 		skb_set_tail_pointer(skb, 17);
 	}
 
-	tx_ring = adapter->tx_ring[skb->queue_mapping];
+	if (skb->accel_priv) {
+		struct ixgbe_fwd_adapter *fwd_adapter = skb->accel_priv;
+		unsigned int queue;
+
+		queue = skb->queue_mapping + fwd_adapter->tx_base_queue;
+		tx_ring = fwd_adapter->real_adapter->tx_ring[queue];
+	} else 
+		tx_ring = adapter->tx_ring[skb->queue_mapping];
+
 	return ixgbe_xmit_frame_ring(skb, adapter, tx_ring);
 }
 
@@ -7057,6 +7042,7 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
 	 */
 	if (netif_running(dev))
 		ixgbe_close(dev);
+
 	ixgbe_clear_interrupt_scheme(adapter);
 
 #ifdef CONFIG_IXGBE_DCB
@@ -7305,6 +7291,217 @@ static int ixgbe_ndo_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
 	return ndo_dflt_bridge_getlink(skb, pid, seq, dev, mode);
 }
 
+static void ixgbe_irq_disable_queues(struct ixgbe_adapter *adapter, u64 qmask)
+{
+	u32 mask;
+	struct ixgbe_hw *hw = &adapter->hw;
+
+	switch (hw->mac.type) {
+	case ixgbe_mac_82598EB:
+		mask = (IXGBE_EIMS_RTX_QUEUE & qmask);
+		IXGBE_WRITE_REG(hw, IXGBE_EIMC, mask);
+		break;
+	case ixgbe_mac_82599EB:
+	case ixgbe_mac_X540:
+		mask = (qmask & 0xFFFFFFFF);
+		if (mask)
+			IXGBE_WRITE_REG(hw, IXGBE_EIMC_EX(0), mask);
+		mask = (qmask >> 32);
+		if (mask)
+			IXGBE_WRITE_REG(hw, IXGBE_EIMC_EX(1), mask);
+		break;
+	default:
+		break;
+	}
+}
+
+static void ixgbe_add_mac_filter(struct ixgbe_adapter *adapter,
+				 u8 *addr, u16 pool)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	unsigned int entry;
+
+	entry = hw->mac.num_rar_entries - pool;
+	hw->mac.ops.set_rar(hw, entry, addr, VMDQ_P(pool), IXGBE_RAH_AV);
+}
+
+static void ixgbe_fwd_psrtype(struct ixgbe_fwd_adapter *vadapter)
+{
+	struct ixgbe_adapter *adapter = vadapter->real_adapter;
+	int rss_i = vadapter->netdev->real_num_rx_queues;
+	struct ixgbe_hw *hw = &adapter->hw;
+	u16 pool = vadapter->pool;
+	u32 psrtype = IXGBE_PSRTYPE_TCPHDR |
+		      IXGBE_PSRTYPE_UDPHDR |
+		      IXGBE_PSRTYPE_IPV4HDR |
+		      IXGBE_PSRTYPE_L2HDR |
+		      IXGBE_PSRTYPE_IPV6HDR;
+
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		return;
+
+	if (rss_i > 3)
+		psrtype |= 2 << 29;
+	else if (rss_i > 1)
+		psrtype |= 1 << 29;
+
+	IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(VMDQ_P(pool)), psrtype);
+}
+
+static void ixgbe_enable_fwd_ring(struct ixgbe_adapter *adapter,
+				  struct ixgbe_ring *rx_ring,
+				  struct ixgbe_fwd_adapter *accel)
+{
+	rx_ring->l2_accel_priv = accel;
+	ixgbe_configure_rx_ring(adapter, rx_ring);
+}
+
+
+static void ixgbe_disable_fwd_ring(struct ixgbe_fwd_adapter *vadapter,
+				   struct ixgbe_ring *rx_ring)
+{
+	struct ixgbe_adapter *adapter = vadapter->real_adapter;
+	int index = rx_ring->queue_index + vadapter->rx_base_queue;
+
+	/* shutdown specific queue receive and wait for dma to settle */
+	ixgbe_disable_rx_queue(adapter, rx_ring);
+	usleep_range(10000, 20000);
+	ixgbe_irq_disable_queues(adapter, ((u64)1 << index));
+	ixgbe_clean_rx_ring(rx_ring);
+	rx_ring->l2_accel_priv = NULL;
+}
+
+int ixgbe_fwd_ring_up(struct net_device *vdev, struct ixgbe_fwd_adapter *accel)
+{
+	struct ixgbe_adapter *adapter = accel->real_adapter;
+	unsigned int rxbase = accel->pool * adapter->num_rx_queues_per_pool;
+	unsigned int txbase = accel->pool * adapter->num_rx_queues_per_pool;
+	int err, i;
+
+
+	accel->rx_base_queue = rxbase;
+	accel->tx_base_queue = txbase;
+
+	for (i = 0; i < vdev->num_rx_queues; i++)
+		ixgbe_disable_fwd_ring(accel, adapter->rx_ring[rxbase + i]);
+
+	for (i = 0; i < vdev->num_rx_queues; i++) {
+		adapter->rx_ring[rxbase + i]->vmdq_netdev = vdev;
+		ixgbe_enable_fwd_ring(adapter, adapter->rx_ring[rxbase + i], accel);
+	}
+
+	for (i = 0; i < vdev->num_tx_queues; i++)
+		adapter->tx_ring[txbase + i]->vmdq_netdev = vdev;
+
+	if (is_valid_ether_addr(vdev->dev_addr))
+		ixgbe_add_mac_filter(adapter, vdev->dev_addr, accel->pool);
+
+	err = netif_set_real_num_tx_queues(vdev, vdev->num_tx_queues);
+	if (err)
+		goto err_set_queues;
+	err = netif_set_real_num_rx_queues(vdev, vdev->num_rx_queues);
+	if (err)
+		goto err_set_queues;
+
+	ixgbe_fwd_psrtype(accel);
+	netif_tx_start_all_queues(vdev);
+	return 0;
+err_set_queues:
+	for (i = 0; i < vdev->num_rx_queues; i++)
+		ixgbe_disable_fwd_ring(accel, adapter->rx_ring[rxbase + i]);
+	return err;
+}
+
+int ixgbe_fwd_ring_down(struct net_device *vdev, struct ixgbe_fwd_adapter *accel)
+{
+	struct ixgbe_adapter *adapter = accel->real_adapter;
+	unsigned int rxbase = accel->rx_base_queue;
+	int i;
+
+	netif_tx_stop_all_queues(vdev);
+
+	for (i = 0; i < vdev->num_rx_queues; i++)
+		ixgbe_disable_fwd_ring(accel, adapter->rx_ring[rxbase + i]);
+
+	return 0;
+}
+
+static void* ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
+{
+	struct ixgbe_fwd_adapter *fwd_adapter = NULL;
+	struct ixgbe_adapter *adapter = netdev_priv(pdev);
+	int pool, vmdq_pool, base_queue;
+	int err;
+
+	/* Check for hardware restriction on number of rx/tx queues */
+	if (vdev->num_rx_queues != vdev->num_tx_queues ||
+	    vdev->num_tx_queues > IXGBE_MAX_L2A_QUEUES ||
+	    vdev->num_tx_queues == IXGBE_BAD_L2A_QUEUE) {
+		netdev_info(pdev, "%s: Supports RX/TX Queue counts 1,2, and 4\n",
+		       pdev->name);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (adapter->num_rx_pools > IXGBE_MAX_VMDQ_INDICES)
+		return ERR_PTR(-EBUSY);
+
+	fwd_adapter = kcalloc(1, sizeof(struct ixgbe_fwd_adapter), GFP_KERNEL);
+	if (!fwd_adapter)
+		return ERR_PTR(-ENOMEM);
+
+	pool = find_first_zero_bit(&adapter->fwd_bitmask, 32);
+	adapter->num_rx_pools++;
+	set_bit(pool, &adapter->fwd_bitmask);
+
+	/* Enable VMDq flag so device will be set in VM mode */
+	adapter->flags |= IXGBE_FLAG_VMDQ_ENABLED | IXGBE_FLAG_SRIOV_ENABLED;
+	adapter->ring_feature[RING_F_VMDQ].limit = adapter->num_rx_pools;
+	adapter->ring_feature[RING_F_VMDQ].offset = 0;
+	adapter->ring_feature[RING_F_RSS].limit = IXGBE_MAX_L2A_QUEUES;
+
+	/* Force reinit of ring allocation with VMDQ enabled */
+	ixgbe_setup_tc(pdev, netdev_get_num_tc(pdev));
+
+	/* Configure VSI adapter structure */
+	vmdq_pool = VMDQ_P(pool);
+	base_queue = vmdq_pool * adapter->num_rx_queues_per_pool;
+
+	netdev_dbg(pdev, "pool %i:%i queues %i:%i VSI bitmask %lx\n",
+		   pool, adapter->num_rx_pools,
+		   base_queue, base_queue + adapter->num_rx_queues_per_pool,
+		   adapter->fwd_bitmask);
+
+	fwd_adapter->pool = pool;
+	fwd_adapter->netdev = vdev;
+	fwd_adapter->real_adapter = adapter;
+	fwd_adapter->rx_base_queue = base_queue;
+	fwd_adapter->tx_base_queue = base_queue;
+
+	err = ixgbe_fwd_ring_up(vdev, fwd_adapter);	
+	if (!err) {
+		kfree(fwd_adapter);
+		return ERR_PTR(err);
+	}
+	return fwd_adapter;
+}
+
+static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
+{
+	struct ixgbe_fwd_adapter *fwd_adapter = priv; 
+	struct ixgbe_adapter *adapter = fwd_adapter->real_adapter;
+
+	clear_bit(fwd_adapter->pool, &adapter->fwd_bitmask);
+	adapter->num_rx_pools--;
+
+	ixgbe_fwd_ring_down(fwd_adapter->netdev, fwd_adapter);
+
+	netdev_dbg(pdev, "pool %i:%i queues %i:%i VSI bitmask %lx\n",
+		   fwd_adapter->pool, adapter->num_rx_pools,
+		   fwd_adapter->rx_base_queue,
+		   fwd_adapter->rx_base_queue + adapter->num_rx_queues_per_pool,
+		   adapter->fwd_bitmask);
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
@@ -7351,6 +7548,11 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_bridge_getlink	= ixgbe_ndo_bridge_getlink,
 };
 
+const struct forwarding_accel_ops  ixgbe_fwd_ops = {
+	.fwd_accel_add_station = ixgbe_fwd_add,
+	.fwd_accel_del_station = ixgbe_fwd_del,
+};
+
 /**
  * ixgbe_enumerate_functions - Get the number of ports this device has
  * @adapter: adapter structure
@@ -7554,6 +7756,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	}
 
 	netdev->netdev_ops = &ixgbe_netdev_ops;
+	netdev->fwd_ops = &ixgbe_fwd_ops;
 	ixgbe_set_ethtool_ops(netdev);
 	netdev->watchdog_timeo = 5 * HZ;
 	strncpy(netdev->name, pci_name(pdev), sizeof(netdev->name) - 1);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 1/2] net: Add layer 2 hardware acceleration operations for macvlan devices
From: Neil Horman @ 2013-10-04 20:10 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Andy Gospodarek, David Miller, Neil Horman
In-Reply-To: <1380917405-23801-1-git-send-email-nhorman@tuxdriver.com>

Add a operations structure that allows a network interface to export the fact
that it supports package forwarding in hardware between physical interfaces and
other mac layer devices assigned to it (such as macvlans).  this operaions
structure can be used by virtual mac devices to bypass software switching so
that forwarding can be done in hardware more efficiently.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
---
 drivers/net/macvlan.c      | 32 ++++++++++++++++++++++++++++++++
 include/linux/if_macvlan.h |  1 +
 include/linux/netdevice.h  | 22 ++++++++++++++++++++++
 include/linux/skbuff.h     |  9 ++++++---
 net/core/dev.c             |  3 +++
 5 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 9bf46bd..38d0fc5 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -297,7 +297,17 @@ netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 	int ret;
 	const struct macvlan_dev *vlan = netdev_priv(dev);
 
+	if (vlan->fwd_priv) {
+		skb->dev = vlan->lowerdev;
+		skb->accel_priv = vlan->fwd_priv;
+		ret = dev_queue_xmit(skb);
+		if (likely(ret == NETDEV_TX_OK))
+			goto update_stats;
+	}
+
+	skb->accel_priv = NULL;
 	ret = macvlan_queue_xmit(skb, dev);
+update_stats:
 	if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
 		struct macvlan_pcpu_stats *pcpu_stats;
 
@@ -347,6 +357,18 @@ static int macvlan_open(struct net_device *dev)
 		goto hash_add;
 	}
 
+	if (fwd_accel_supports(lowerdev, FA_FLG_STA_SUPPORT)) {
+		vlan->fwd_priv = fwd_accel_add_station(lowerdev, dev);
+		/*
+		 * If we get a NULL pointer back, or if we get an error
+		 * then we should just fall through to the non accelerated path
+		 */
+		if (IS_ERR_OR_NULL(vlan->fwd_priv))
+			vlan->fwd_priv = NULL;
+		else
+			return 0;
+	}
+
 	err = -EBUSY;
 	if (macvlan_addr_busy(vlan->port, dev->dev_addr))
 		goto out;
@@ -367,6 +389,10 @@ hash_add:
 del_unicast:
 	dev_uc_del(lowerdev, dev->dev_addr);
 out:
+	if (vlan->fwd_priv) {
+		fwd_accel_del_station(lowerdev, vlan->fwd_priv);
+		vlan->fwd_priv = NULL;
+	}
 	return err;
 }
 
@@ -391,6 +417,11 @@ static int macvlan_stop(struct net_device *dev)
 
 hash_del:
 	macvlan_hash_del(vlan, !dev->dismantle);
+	if (vlan->fwd_priv) {
+		fwd_accel_del_station(lowerdev, vlan->fwd_priv);
+		vlan->fwd_priv = NULL;
+	}
+
 	return 0;
 }
 
@@ -801,6 +832,7 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 		if (err < 0)
 			return err;
 	}
+
 	port = macvlan_port_get_rtnl(lowerdev);
 
 	/* Only 1 macvlan device can be created in passthru mode */
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index ddd33fd..c270285 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -61,6 +61,7 @@ struct macvlan_dev {
 	struct hlist_node	hlist;
 	struct macvlan_port	*port;
 	struct net_device	*lowerdev;
+	void			*fwd_priv;
 	struct macvlan_pcpu_stats __percpu *pcpu_stats;
 
 	DECLARE_BITMAP(mc_filter, MACVLAN_MC_FILTER_SZ);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3de49ac..ea18f07 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1100,6 +1100,27 @@ struct net_device_ops {
 };
 
 /*
+ * Flags to ennumerate hardware acceleration support
+ */
+#define FA_FLG_STA_SUPPORT (1 << 1)
+
+#define fwd_accel_supports(dev, feature) (dev->fwd_ops->flags & feature)
+#define fwd_accel_add_station(pdev, vdev) dev->fwd_ops->fwd_accel_add_station(pdev, vdev)
+#define fwd_accel_del_station(pdev, priv) dev->fwd_ops->fwd_accel_del_station(pdev, priv)
+
+struct forwarding_accel_ops {
+	unsigned int flags;
+
+	/*
+	 * fwd_accel_[add|del]_station must be set if
+	 * FA_FLG_STA_SUPPORT is set
+	 */
+	void*	(*fwd_accel_add_station)(struct net_device *pdev,
+					struct net_device *vdev);
+	void	(*fwd_accel_del_station)(struct net_device *pdev, void *priv);
+};
+
+/*
  *	The DEVICE structure.
  *	Actually, this whole structure is a big mistake.  It mixes I/O
  *	data with strictly "high-level" data, and it has to know about
@@ -1183,6 +1204,7 @@ struct net_device {
 	/* Management operations */
 	const struct net_device_ops *netdev_ops;
 	const struct ethtool_ops *ethtool_ops;
+	const struct forwarding_accel_ops *fwd_ops;
 
 	/* Hardware header description */
 	const struct header_ops *header_ops;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2ddb48d..0be9152 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -426,9 +426,12 @@ struct sk_buff {
 	char			cb[48] __aligned(8);
 
 	unsigned long		_skb_refdst;
-#ifdef CONFIG_XFRM
-	struct	sec_path	*sp;
-#endif
+
+	union {
+		struct	sec_path	*sp;
+		void 			*accel_priv;
+	};
+
 	unsigned int		len,
 				data_len;
 	__u16			mac_len,
diff --git a/net/core/dev.c b/net/core/dev.c
index 5c713f2..5f99382 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5992,6 +5992,7 @@ struct netdev_queue *dev_ingress_queue_create(struct net_device *dev)
 }
 
 static const struct ethtool_ops default_ethtool_ops;
+static const struct forwarding_accel_ops default_fwd_ops;
 
 void netdev_set_default_ethtool_ops(struct net_device *dev,
 				    const struct ethtool_ops *ops)
@@ -6090,6 +6091,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev->group = INIT_NETDEV_GROUP;
 	if (!dev->ethtool_ops)
 		dev->ethtool_ops = &default_ethtool_ops;
+	if (!dev->fwd_ops)
+		dev->fwd_ops = &default_fwd_ops;
 	return dev;
 
 free_all:
-- 
1.8.3.1

^ permalink raw reply related

* [RFC PATCH 0/2 v2] net: alternate proposal for using macvlans with forwarding acceleration
From: Neil Horman @ 2013-10-04 20:10 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Andy Gospodarek, David Miller
In-Reply-To: <1380140209-24587-1-git-send-email-nhorman@tuxdriver.com>

Hey all-
     heres the next, updated version of the vsi/macvlan integration that we've
been discussing.

Some change notes:

* Changes to the fowarding ops structure - Removed the priv_size field, and
added a flags field.  Removal of the priv_size field was accomplished by just
having the add method return a void * and using ERR_PTR and PTR_ERR checks,
which also allows us to allocate memory for the acceleration path in the driver,
which I like.  I'm not super happy still with how I'm using the flags (currenly
only used to indicate support for feature sets), but at least we have the flags
now, and they can be exposed to user space via iproute2 or ethtool if need be

* Changes to the Transmit path - Specifically I'm using dev_queue_xmit to send
frames now, which I like as it makes the macvlan subject to the lowerdevs qdisc
configuration.

* Changes to the acceleration fail path behavior - Now if we don't/can't use
acceleration, we just fall back to using the normal macvlan software switch
strategy

* General clenups (some renaming, that I'm not super sure of, but I though
forwarding acceleration (fwd) would be a better prefix than l2 acceleration).

Still a long way to go I think, and lots of tweaking to do, but I didn't want to
keep you waiting John.  Anywho, take a look at what I'm doing and feel free to
rip it apart.

Thanks!
Neil

^ permalink raw reply

* Re: [PATCH net-next v2 2/3] udp: Add udp early demux
From: Eric Dumazet @ 2013-10-04 20:02 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: David Miller, netdev, tomk, Shawn Bohrer
In-Reply-To: <1380914896-24754-3-git-send-email-shawn.bohrer@gmail.com>

On Fri, 2013-10-04 at 14:28 -0500, Shawn Bohrer wrote:

> +
> +/* For unicast we should only early demux connected sockets or we can
> + * break forwarding setups.  The chains here can be long so only check
> + * if the first socket is an exact match and if not move on.
> + */
> +static struct sock *__udp4_lib_demux_lookup(struct net *net,
> +					    __be16 loc_port, __be32 loc_addr,
> +					    __be16 rmt_port, __be32 rmt_addr,
> +					    int dif)
> +{
> +	struct sock *sk, *result;
> +	struct hlist_nulls_node *node;
> +	unsigned short hnum = ntohs(loc_port);
> +	unsigned int slot = udp_hashfn(net, hnum, udp_table.mask);
> +	struct udp_hslot *hslot = &udp_table.hash[slot];
> +	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr)
> +	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
> +
> +	rcu_read_lock();
> +	result = NULL;
> +	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
> +		if (INET_MATCH(sk, net, acookie,
> +			       rmt_addr, loc_addr, ports, dif))
> +			result = sk;
> +		/* Only check first socket in chain */
> +		break;
> +	}
> +
> +	if (result) {
> +		if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> +			result = NULL;

Here you must check again the keys (because of UDP sockets being
SLAB_DESTROY_BY_RCU , this socket might have been freed and reused
elsewhere)

	else
		if (unlikely!(INET_MATCH(result, net, acookie,
					 rmt_addr, loc_addr,
					 ports, dif))) {
			sock_put(result);
			result = NULL;
		}


> +	}
> +	rcu_read_unlock();
> +	return result;
> +}
> +

^ permalink raw reply

* [PATCH net-next v2 3/3] net: ipv4 only populate IP_PKTINFO when needed
From: Shawn Bohrer @ 2013-10-04 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tomk, Eric Dumazet, Shawn Bohrer
In-Reply-To: <1380914896-24754-1-git-send-email-shawn.bohrer@gmail.com>

From: Shawn Bohrer <sbohrer@rgmadvisors.com>

The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received.  This has introduced a performance regression for some
UDP workloads.

This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.

Benchmark results from a netperf UDP_RR test:
Before 89789.68 transactions/s
After  90587.62 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.63us RTT
After  12.48us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
---
v2 changes:

* ipv4_pktinfo_prepare() now takes a const struct sock*

 include/net/ip.h       |    2 +-
 net/ipv4/ip_sockglue.c |    5 +++--
 net/ipv4/raw.c         |    2 +-
 net/ipv4/udp.c         |    2 +-
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 16078f4..b39ebe5 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -459,7 +459,7 @@ int ip_options_rcv_srr(struct sk_buff *skb);
  *	Functions provided by ip_sockglue.c
  */
 
-void ipv4_pktinfo_prepare(struct sk_buff *skb);
+void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb);
 void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb);
 int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc);
 int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 56e3445..0626f2c 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1052,11 +1052,12 @@ e_inval:
  * destination in skb->cb[] before dst drop.
  * This way, receiver doesnt make cache line misses to read rtable.
  */
-void ipv4_pktinfo_prepare(struct sk_buff *skb)
+void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb)
 {
 	struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb);
 
-	if (skb_rtable(skb)) {
+	if ((inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO) &&
+	    skb_rtable(skb)) {
 		pktinfo->ipi_ifindex = inet_iif(skb);
 		pktinfo->ipi_spec_dst.s_addr = fib_compute_spec_dst(skb);
 	} else {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index b2fa14c..41e1d28 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -299,7 +299,7 @@ static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	/* Charge it to the socket. */
 
-	ipv4_pktinfo_prepare(skb);
+	ipv4_pktinfo_prepare(sk, skb);
 	if (sock_queue_rcv_skb(sk, skb) < 0) {
 		kfree_skb(skb);
 		return NET_RX_DROP;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a3e575f..79017ff 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1544,7 +1544,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	rc = 0;
 
-	ipv4_pktinfo_prepare(skb);
+	ipv4_pktinfo_prepare(sk, skb);
 	bh_lock_sock(sk);
 	if (!sock_owned_by_user(sk))
 		rc = __udp_queue_rcv_skb(sk, skb);
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH net-next v2 2/3] udp: Add udp early demux
From: Shawn Bohrer @ 2013-10-04 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tomk, Eric Dumazet, Shawn Bohrer
In-Reply-To: <1380914896-24754-1-git-send-email-shawn.bohrer@gmail.com>

From: Shawn Bohrer <sbohrer@rgmadvisors.com>

The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.

For UDP multicast we can only cache the dst if there is only one
receiving socket on the host.  Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.

For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups.  Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.

Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After  89789.68 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After  12.63us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
---
v2 Changes:

* Unicast UDP early demux now requires an exact socket match and only
tests first socket in UDP hash chain.

 include/net/sock.h |    2 +-
 include/net/udp.h  |    1 +
 net/ipv4/af_inet.c |    1 +
 net/ipv4/udp.c     |  188 +++++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 173 insertions(+), 19 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index e3bf213..7953254 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -218,7 +218,7 @@ struct cg_proto;
   *	@sk_lock:	synchronizer
   *	@sk_rcvbuf: size of receive buffer in bytes
   *	@sk_wq: sock wait queue and async head
-  *	@sk_rx_dst: receive input route used by early tcp demux
+  *	@sk_rx_dst: receive input route used by early demux
   *	@sk_dst_cache: destination cache
   *	@sk_dst_lock: destination cache lock
   *	@sk_policy: flow policy
diff --git a/include/net/udp.h b/include/net/udp.h
index 510b8cb..fe4ba9f 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -175,6 +175,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 		     unsigned int hash2_nulladdr);
 
 /* net/ipv4/udp.c */
+void udp_v4_early_demux(struct sk_buff *skb);
 int udp_get_port(struct sock *sk, unsigned short snum,
 		 int (*saddr_cmp)(const struct sock *,
 				  const struct sock *));
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index cfeb85c..35913fb 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1546,6 +1546,7 @@ static const struct net_protocol tcp_protocol = {
 };
 
 static const struct net_protocol udp_protocol = {
+	.early_demux =	udp_v4_early_demux,
 	.handler =	udp_rcv,
 	.err_handler =	udp_err,
 	.no_policy =	1,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5950e12..a3e575f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -103,6 +103,7 @@
 #include <linux/seq_file.h>
 #include <net/net_namespace.h>
 #include <net/icmp.h>
+#include <net/inet_hashtables.h>
 #include <net/route.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
@@ -565,6 +566,26 @@ struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
 }
 EXPORT_SYMBOL_GPL(udp4_lib_lookup);
 
+static inline bool __udp_is_mcast_sock(struct net *net, struct sock *sk,
+				       __be16 loc_port, __be32 loc_addr,
+				       __be16 rmt_port, __be32 rmt_addr,
+				       int dif, unsigned short hnum)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	if (!net_eq(sock_net(sk), net) ||
+	    udp_sk(sk)->udp_port_hash != hnum ||
+	    (inet->inet_daddr && inet->inet_daddr != rmt_addr) ||
+	    (inet->inet_dport != rmt_port && inet->inet_dport) ||
+	    (inet->inet_rcv_saddr && inet->inet_rcv_saddr != loc_addr) ||
+	    ipv6_only_sock(sk) ||
+	    (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+		return false;
+	if (!ip_mc_sf_allow(sk, loc_addr, rmt_addr, dif))
+		return false;
+	return true;
+}
+
 static inline struct sock *udp_v4_mcast_next(struct net *net, struct sock *sk,
 					     __be16 loc_port, __be32 loc_addr,
 					     __be16 rmt_port, __be32 rmt_addr,
@@ -575,20 +596,11 @@ static inline struct sock *udp_v4_mcast_next(struct net *net, struct sock *sk,
 	unsigned short hnum = ntohs(loc_port);
 
 	sk_nulls_for_each_from(s, node) {
-		struct inet_sock *inet = inet_sk(s);
-
-		if (!net_eq(sock_net(s), net) ||
-		    udp_sk(s)->udp_port_hash != hnum ||
-		    (inet->inet_daddr && inet->inet_daddr != rmt_addr) ||
-		    (inet->inet_dport != rmt_port && inet->inet_dport) ||
-		    (inet->inet_rcv_saddr &&
-		     inet->inet_rcv_saddr != loc_addr) ||
-		    ipv6_only_sock(s) ||
-		    (s->sk_bound_dev_if && s->sk_bound_dev_if != dif))
-			continue;
-		if (!ip_mc_sf_allow(s, loc_addr, rmt_addr, dif))
-			continue;
-		goto found;
+		if (__udp_is_mcast_sock(net, s,
+					loc_port, loc_addr,
+					rmt_port, rmt_addr,
+					dif, hnum))
+			goto found;
 	}
 	s = NULL;
 found:
@@ -1581,6 +1593,14 @@ static void flush_stack(struct sock **stack, unsigned int count,
 		kfree_skb(skb1);
 }
 
+static void udp_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+
+	dst_hold(dst);
+	sk->sk_rx_dst = dst;
+}
+
 /*
  *	Multicasts and broadcasts go to each listener.
  *
@@ -1709,11 +1729,28 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (udp4_csum_init(skb, uh, proto))
 		goto csum_error;
 
-	if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
-		return __udp4_lib_mcast_deliver(net, skb, uh,
-				saddr, daddr, udptable);
+	if (skb->sk) {
+		int ret;
+		sk = skb->sk;
+
+		if (unlikely(sk->sk_rx_dst == NULL))
+			udp_sk_rx_dst_set(sk, skb);
+
+		ret = udp_queue_rcv_skb(sk, skb);
+
+		/* a return value > 0 means to resubmit the input, but
+		 * it wants the return to be -protocol, or 0
+		 */
+		if (ret > 0)
+			return -ret;
+		return 0;
+	} else {
+		if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
+			return __udp4_lib_mcast_deliver(net, skb, uh,
+					saddr, daddr, udptable);
 
-	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
+		sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
+	}
 
 	if (sk != NULL) {
 		int ret;
@@ -1771,6 +1808,121 @@ drop:
 	return 0;
 }
 
+/* We can only early demux multicast if there is a single matching socket.
+ * If more than one socket found returns NULL
+ */
+static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
+						  __be16 loc_port, __be32 loc_addr,
+						  __be16 rmt_port, __be32 rmt_addr,
+						  int dif)
+{
+	struct sock *sk, *result;
+	struct hlist_nulls_node *node;
+	unsigned short hnum = ntohs(loc_port);
+	unsigned int count, slot = udp_hashfn(net, hnum, udp_table.mask);
+	struct udp_hslot *hslot = &udp_table.hash[slot];
+
+	rcu_read_lock();
+begin:
+	count = 0;
+	result = NULL;
+	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
+		if (__udp_is_mcast_sock(net, sk,
+					loc_port, loc_addr,
+					rmt_port, rmt_addr,
+					dif, hnum)) {
+			result = sk;
+			++count;
+		}
+	}
+	/*
+	 * if the nulls value we got at the end of this lookup is
+	 * not the expected one, we must restart lookup.
+	 * We probably met an item that was moved to another chain.
+	 */
+	if (get_nulls_value(node) != slot)
+		goto begin;
+
+	if (result) {
+		if (count != 1 ||
+		    unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
+			result = NULL;
+	}
+	rcu_read_unlock();
+	return result;
+}
+
+/* For unicast we should only early demux connected sockets or we can
+ * break forwarding setups.  The chains here can be long so only check
+ * if the first socket is an exact match and if not move on.
+ */
+static struct sock *__udp4_lib_demux_lookup(struct net *net,
+					    __be16 loc_port, __be32 loc_addr,
+					    __be16 rmt_port, __be32 rmt_addr,
+					    int dif)
+{
+	struct sock *sk, *result;
+	struct hlist_nulls_node *node;
+	unsigned short hnum = ntohs(loc_port);
+	unsigned int slot = udp_hashfn(net, hnum, udp_table.mask);
+	struct udp_hslot *hslot = &udp_table.hash[slot];
+	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr)
+	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
+
+	rcu_read_lock();
+	result = NULL;
+	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
+		if (INET_MATCH(sk, net, acookie,
+			       rmt_addr, loc_addr, ports, dif))
+			result = sk;
+		/* Only check first socket in chain */
+		break;
+	}
+
+	if (result) {
+		if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
+			result = NULL;
+	}
+	rcu_read_unlock();
+	return result;
+}
+
+void udp_v4_early_demux(struct sk_buff *skb)
+{
+	const struct iphdr *iph = ip_hdr(skb);
+	const struct udphdr *uh = udp_hdr(skb);
+	struct sock *sk;
+	struct dst_entry *dst;
+	struct net *net = dev_net(skb->dev);
+	int dif = skb->dev->ifindex;
+
+	/* validate the packet */
+	if (!pskb_may_pull(skb, skb_transport_offset(skb) + sizeof(struct udphdr)))
+		return;
+
+	if (skb->pkt_type == PACKET_BROADCAST ||
+	    skb->pkt_type == PACKET_MULTICAST)
+		sk = __udp4_lib_mcast_demux_lookup(net, uh->dest, iph->daddr,
+						   uh->source, iph->saddr, dif);
+	else if (skb->pkt_type == PACKET_HOST)
+		sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
+					     uh->source, iph->saddr, dif);
+	else
+		return;
+
+	if (!sk)
+		return;
+
+	skb->sk = sk;
+	skb->destructor = sock_edemux;
+	dst = sk->sk_rx_dst;
+
+	if (dst)
+		dst = dst_check(dst, 0);
+	if (dst)
+		skb_dst_set_noref(skb, dst);
+}
+
 int udp_rcv(struct sk_buff *skb)
 {
 	return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH net-next v2 1/3] udp: Only allow busy read/poll on connected sockets
From: Shawn Bohrer @ 2013-10-04 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tomk, Eric Dumazet, Shawn Bohrer
In-Reply-To: <1380914896-24754-1-git-send-email-shawn.bohrer@gmail.com>

From: Shawn Bohrer <sbohrer@rgmadvisors.com>

UDP sockets can receive packets from multiple endpoints and thus may be
received on multiple receive queues.  Since packets packets can arrive
on multiple receive queues we should not mark the napi_id for all
packets.  This makes busy read/poll only work for connected UDP sockets.

This additionally enables busy read/poll for UDP multicast packets as
long as the socket is connected by moving the check into
__udp_queue_rcv_skb().

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c |    5 +++--
 net/ipv6/udp.c |    5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c41833e..5950e12 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1405,8 +1405,10 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	int rc;
 
-	if (inet_sk(sk)->inet_daddr)
+	if (inet_sk(sk)->inet_daddr) {
 		sock_rps_save_rxhash(sk, skb);
+		sk_mark_napi_id(sk, skb);
+	}
 
 	rc = sock_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
@@ -1716,7 +1718,6 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (sk != NULL) {
 		int ret;
 
-		sk_mark_napi_id(sk, skb);
 		ret = udp_queue_rcv_skb(sk, skb);
 		sock_put(sk);
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8119791..3753247 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -549,8 +549,10 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	int rc;
 
-	if (!ipv6_addr_any(&inet6_sk(sk)->daddr))
+	if (!ipv6_addr_any(&inet6_sk(sk)->daddr)) {
 		sock_rps_save_rxhash(sk, skb);
+		sk_mark_napi_id(sk, skb);
+	}
 
 	rc = sock_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
@@ -844,7 +846,6 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (sk != NULL) {
 		int ret;
 
-		sk_mark_napi_id(sk, skb);
 		ret = udpv6_queue_rcv_skb(sk, skb);
 		sock_put(sk);
 
-- 
1.7.7.6

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox