Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] gianfar: Wait for both RX and TX to stop
From: Andy Fleming @ 2010-04-18 23:13 UTC (permalink / raw)
  To: davem; +Cc: netdev

When gracefully stopping the controller, the driver was continuing if
*either* RX or TX had stopped.  We need to wait for both, or the
controller could get into an invalid state.

Signed-off-by: Andy Fleming <afleming@freescale.com>
---
 drivers/net/gianfar.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index 032073d..6038397 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -1571,8 +1571,9 @@ static void gfar_halt_nodisable(struct net_device *dev)
 		tempval |= (DMACTRL_GRS | DMACTRL_GTS);
 		gfar_write(&regs->dmactrl, tempval);
 
-		while (!(gfar_read(&regs->ievent) &
-			 (IEVENT_GRSC | IEVENT_GTSC)))
+		while ((gfar_read(&regs->ievent) &
+			 (IEVENT_GRSC | IEVENT_GTSC)) !=
+			 (IEVENT_GRSC | IEVENT_GTSC))
 			cpu_relax();
 	}
 }
-- 
1.6.5.2.g6ff9a


^ permalink raw reply related

* Re: [PATCH 1/6] X25: Use identifiers for X25 to device interface
From: andrew hendry @ 2010-04-18 23:51 UTC (permalink / raw)
  To: John Hughes; +Cc: netdev
In-Reply-To: <4BCAE84A.2030701@Calva.COM>

Hi John,

Thanks, missed that, ill see if any of the driver areas have comments
and make a V2 patch with explicit values.

Your right, it would be nice if xotd/xoe could pickup the same
definition from user space.
include/net/x25device.h is not intended as a user space header.
include/linux/net/x25.h is the user space header for the x25 socket
layer interface, so I don't think these definitions should go there.

I think the right way to do it is add a new userspace header
include/linux/if_x25.h then include that from net/x25_device.h?

Regards,
Andrew.

On Sun, Apr 18, 2010 at 9:08 PM, John Hughes <john@calva.com> wrote:
> John Hughes wrote:
>>
>> Shouldn't you use explicit values here?
>>
>> enum {
>>    X25_IFACE_DATA = 0x00, /* explicit value for ABI stability */
>>    ...
>
> Oh, and is net/x25device.h suitable for inclusion from user space (xotd for
> example)?
>
>

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-19  2:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271590476.16881.4925.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 555 bytes --]

Thanks Eric. I tried to visualize your results - attached.
There are 2-3 odd numbers (labelled with *) but other
than that results are as expected...

I did run some experiments with some udp sink server
and i saw the IPIs amortized; unfortunately sky2 h/ware 
proved to be bottleneck (at > 750Kpps incoming, it started 
dropping and wasnt recording the drops, so i had to slow things down). I
need to digest my results a little more - but it seems i was getting
better throughput results with RPS (i.e it was able to sink
more packets)..

cheers,
jamal

[-- Attachment #2: erichw.pdf --]
[-- Type: application/pdf, Size: 187023 bytes --]

^ permalink raw reply

* [PATCH BUG-FIX] ipv6: allow to send packet after receiving ICMPv6 Too Big message with MTU field less than  IPV6_MIN_MTU
From: Shan Wei @ 2010-04-19  2:58 UTC (permalink / raw)
  To: David Miller, yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	魏勇军, vladislav.yasevich
  Cc: kuznet, pekkas, jmorris, Patrick McHardy, eric.dumazet, Shan Wei,
	sri, Herbert Xu, netdev@vger.kernel.org, linux-sctp

According to RFC2460, PMTU is set to the IPv6 Minimum Link
MTU (1280) and a fragment header should always be included
after a node receiving Too Big message reporting PMTU is
less than the IPv6 Minimum Link MTU.

After receiving a ICMPv6 Too Big message reporting PMTU is
less than the IPv6 Minimum Link MTU, sctp *can't* send any
data/control chunk that total length including IPv6 head 
and IPv6 extend head is less than IPV6_MIN_MTU(1280 bytes).

The failure occured in p6_fragment(), about reason 
see following(take SHUTDOWN chunk for example):
sctp_packet_transmit (SHUTDOWN chunk, len=16 byte)
|------sctp_v6_xmit (local_df=0)
   |------ip6_xmit
       |------ip6_output (dst_allfrag is ture)
           |------ip6_fragment

In ip6_fragment(), for local_df=0, drops the the packet
and returns EMSGSIZE.

The patch fixes it with adding check length of skb->len.
In this case, Ipv6 not to fragment upper protocol data,
just only add a fragment header before it. 

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 net/ipv6/ip6_output.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 5129a16..c3edb6c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -625,7 +625,7 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	/* We must not fragment if the socket is set to force MTU discovery
 	 * or if the skb it not generated by a local socket.
 	 */
-	if (!skb->local_df) {
+	if (!skb->local_df && skb->len > mtu) {
 		skb->dev = skb_dst(skb)->dev;
 		icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
 		IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
--
1.6.3.3 

^ permalink raw reply related

* Re: [PATCH BUG-FIX] ipv6: allow to send packet after receiving ICMPv6 Too Big message with MTU field less than  IPV6_MIN_MTU
From: Herbert Xu @ 2010-04-19  3:55 UTC (permalink / raw)
  To: Shan Wei
  Cc: David Miller, yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	魏勇军, vladislav.yasevich, kuznet, pekkas,
	jmorris, Patrick McHardy, eric.dumazet, sri,
	netdev@vger.kernel.org, linux-sctp
In-Reply-To: <4BCBC6CE.9020302@cn.fujitsu.com>

On Mon, Apr 19, 2010 at 10:58:22AM +0800, Shan Wei wrote:
> 
> According to RFC2460, PMTU is set to the IPv6 Minimum Link
> MTU (1280) and a fragment header should always be included
> after a node receiving Too Big message reporting PMTU is
> less than the IPv6 Minimum Link MTU.
> 
> After receiving a ICMPv6 Too Big message reporting PMTU is
> less than the IPv6 Minimum Link MTU, sctp *can't* send any
> data/control chunk that total length including IPv6 head 
> and IPv6 extend head is less than IPV6_MIN_MTU(1280 bytes).
> 
> The failure occured in p6_fragment(), about reason 
> see following(take SHUTDOWN chunk for example):
> sctp_packet_transmit (SHUTDOWN chunk, len=16 byte)
> |------sctp_v6_xmit (local_df=0)
>    |------ip6_xmit
>        |------ip6_output (dst_allfrag is ture)
>            |------ip6_fragment
> 
> In ip6_fragment(), for local_df=0, drops the the packet
> and returns EMSGSIZE.
> 
> The patch fixes it with adding check length of skb->len.
> In this case, Ipv6 not to fragment upper protocol data,
> just only add a fragment header before it. 
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>

The patch looks good to me.

If we wanted to optimise the allfrags case it may be better
to reserve the space beforehand and generate the fragment header
at the same time as we're doing the IPv6 header.

But it can't be all that important as it's been broken for so
many years.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 11/13] bnx2x: Forbid LRO on when XEN is enabled
From: David Miller @ 2010-04-19  3:58 UTC (permalink / raw)
  To: bhutchings; +Cc: vladz, eilong, netdev, dmitry
In-Reply-To: <1271603485.3679.210.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Sun, 18 Apr 2010 16:11:25 +0100

> On Sun, 2010-04-18 at 17:50 +0300, Vladislav Zolotarov wrote:
>> LRO cannot be used on virtualized environment, so it is best
>> to disable it on compile time in XEN kernel.
> [...]
> 
> If you want to cripple your performance, I'm not going to stand in your
> way, but be aware that most distribution kernels now enable CONFIG_XEN.

There is also zero reason to do this.

When the XEN code or whatever setups up the bridge, LRO will be
disabled by the bridge code.

It automatically issues ethtool LRO disable commands to the devices
involved in the bridge configuration.  And it has done this for
a _long_ time.  The ipv4 routing code will do the same.

This change is bogus and likely based upon behavior in some ancient
distribution kernel that lacks the ethtool facilities.

I absolutely refuse to apply this.

^ permalink raw reply

* Re: [PATCH 2/13] bnx2x: Use VPD-R V0 entry to display firmware revision
From: David Miller @ 2010-04-19  4:00 UTC (permalink / raw)
  To: bhutchings; +Cc: vladz, netdev, eilong, dmitry, mcarlson
In-Reply-To: <1271603308.3679.208.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Sun, 18 Apr 2010 16:08:28 +0100

> On Sun, 2010-04-18 at 17:48 +0300, Vladislav Zolotarov wrote:
>> Use VPD-R V0 entry to display firmware revision
> [...]
> 
> Matt Carlson already added VPD support functions and definitions to the
> PCI core; you should use them.

Agreed.

Really disappointed that you guy's can't keep track of generic
infrastructure created by co-workers let alone other people
in the networking developer community.

Judging by this and the LRO disabling change, I can only
come to the conclusion that you do development in something
similar to a bubble and are impervious to what's going on
outside of your domain :-)

^ permalink raw reply

* Re: [PATCH 0/13] bnx2x: Buf fixes and enhancements
From: David Miller @ 2010-04-19  4:07 UTC (permalink / raw)
  To: vladz; +Cc: netdev, eilong, dmitry, yanivr
In-Reply-To: <1271602046.27235.165.camel@lb-tlvb-vladz>

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Sun, 18 Apr 2010 17:47:26 +0300

> Dave, hi.
> Pls., apply the following series of patches.
> It includes a few bug fixes and enhancements.

Several of these patches are inappropriate or need corrections, and
you also have not specified where you want these patches applied
(net-2.6 or net-next-2.6)

You also misspelled "bug" in the subject line here, which to me is yet
another sign that you put very little care into this patch series.  It
feels very "rushed" and "reactionary" rather than a well thought out
set of changes.

You really need to fix all of these issues, do the necessary
research, take your time, and resubmit a higher quality set
of well documented changes.

Thanks.

^ permalink raw reply

* Re: [PATCH 10/13] bnx2x: Set default value of num_queues to 1 on 32-bit platforms
From: David Miller @ 2010-04-19  4:10 UTC (permalink / raw)
  To: vladz; +Cc: eilong, netdev, dmitry
In-Reply-To: <1271602224.27235.199.camel@lb-tlvb-vladz>

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Sun, 18 Apr 2010 17:50:24 +0300

> The default was changed to save memory on 32bits systems.
> 
> Author: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

This is absolutely rediculious.

There is no reason in the world to restrict multiqueue to only 64-bit
systems.

If there is a memory allocation issue, fix that!  Is it vmalloc() space
usage?

You don't even describe what the hell the problem is, so nobody can
figure out _WHY_ you're making this change.  That's so incredibly
frustrating because it makes it impossible for anyone to sanely
evaluate the patch.

The more I read of this patch series, the more I am incredibly
disappointed with the poor quality of these changes.

This is the worst patch series submitted by Broadcom engieers,
ever!

^ permalink raw reply

* Re: [PATCH] ks8842: Add module param for setting mac address
From: David Miller @ 2010-04-19  4:18 UTC (permalink / raw)
  To: richard.rojfors; +Cc: netdev
In-Reply-To: <1271611557.24099.11.camel@debian>

From: Richard Röjfors <richard.rojfors@pelagicore.com>
Date: Sun, 18 Apr 2010 19:25:57 +0200

> This patch adds a module parameter for setting the MAC address.
> 
> To ensure this MAC address is used, the MAC address is written
> after each hardware reset.
> 
> Signed-off-by: Richard Röjfors <richard.rojfors@pelagicore.com>

Your driver isn't unique, or special, and it doesn't need
to do special things to handle this situation.

Either probe the information from somewhere, or use a random
MAC address.

I'm not applying this.

^ permalink raw reply

* [PATCH RFC: linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Peter P Waskiewicz Jr @ 2010-04-19  4:57 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel

This patch adds a callback function pointer to the irq_desc
structure, along with a registration function and a read-only
proc entry for each interrupt.

This affinity_hint handle for each interrupt can be used by
underlying drivers that need a better mechanism to control
interrupt affinity.  The underlying driver can register a
callback for the interrupt, which will allow the driver to
provide the CPU mask for the interrupt to anything that
requests it.  The intent is to extend the userspace daemon,
irqbalance, to help hint to it a preferred CPU mask to balance
the interrupt into.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/interrupt.h |   27 +++++++++++++++++++++++++++
 include/linux/irq.h       |    1 +
 kernel/irq/manage.c       |   39 +++++++++++++++++++++++++++++++++++++++
 kernel/irq/proc.c         |   41 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 75f3f00..f2a7d0b 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -78,6 +78,8 @@ enum {
 };
 
 typedef irqreturn_t (*irq_handler_t)(int, void *);
+typedef unsigned int (*irq_affinity_hint_t)(cpumask_var_t, unsigned int,
+                                            void *);
 
 /**
  * struct irqaction - per interrupt action descriptor
@@ -105,6 +107,18 @@ struct irqaction {
 	unsigned long thread_flags;
 };
 
+/**
+ * struct irqaffinityhint - per interrupt affinity helper
+ * @callback:	device driver callback function
+ * @dev:	reference for the affected device
+ * @irq:	interrupt number
+ */
+struct irqaffinityhint {
+	irq_affinity_hint_t callback;
+	void *dev;
+	int irq;
+};
+
 extern irqreturn_t no_action(int cpl, void *dev_id);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
@@ -209,6 +223,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
 extern int irq_can_set_affinity(unsigned int irq);
 extern int irq_select_affinity(unsigned int irq);
 
+extern int irq_register_affinity_hint(unsigned int irq, void *dev,
+                                      irq_affinity_hint_t callback);
+extern int irq_unregister_affinity_hint(unsigned int irq);
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -223,6 +240,16 @@ static inline int irq_can_set_affinity(unsigned int irq)
 
 static inline int irq_select_affinity(unsigned int irq)  { return 0; }
 
+static inline int irq_register_affinity_hint(unsigned int irq, void *dev,
+                                             irq_affinity_hint_t callback)
+{
+	return -EINVAL;
+}
+
+static inline int irq_unregister_affinity_hint(unsigned int irq);
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SMP && CONFIG_GENERIC_HARDIRQS */
 
 #ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 707ab12..bd73e9b 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -206,6 +206,7 @@ struct irq_desc {
 	struct proc_dir_entry	*dir;
 #endif
 	const char		*name;
+	struct irqaffinityhint	*hint;
 } ____cacheline_internodealigned_in_smp;
 
 extern void arch_init_copy_chip_data(struct irq_desc *old_desc,
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 704e488..3674b6a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -138,6 +138,42 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask)
 	return 0;
 }
 
+int irq_register_affinity_hint(unsigned int irq, void *dev,
+                               irq_affinity_hint_t callback)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	if (!desc->hint)
+		desc->hint = kmalloc(sizeof(struct irqaffinityhint),
+		                     GFP_KERNEL);
+	desc->hint->callback = callback;
+	desc->hint->dev = dev;
+	desc->hint->irq = irq;
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_register_affinity_hint);
+
+int irq_unregister_affinity_hint(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	kfree(desc->hint);
+	desc->hint = NULL;
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_unregister_affinity_hint);
 #ifndef CONFIG_AUTO_IRQ_AFFINITY
 /*
  * Generic version of the affinity autoselector.
@@ -916,6 +952,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 			desc->chip->disable(irq);
 	}
 
+	/* make sure affinity_hint callback is cleaned up */
+	kfree(desc->hint);
+
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
 	unregister_handler_proc(irq, action);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 7a6eb04..59110a3 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -32,6 +32,23 @@ static int irq_affinity_proc_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
+{
+	struct irq_desc *desc = irq_to_desc((long)m->private);
+	struct cpumask mask;
+	unsigned int ret = 0;
+
+	if (desc->hint && desc->hint->callback) {
+		ret = desc->hint->callback(&mask, (long)m->private,
+		                           desc->hint->dev);
+		if (!ret)
+			seq_cpumask(m, &mask);
+	}
+
+	seq_putc(m, '\n');
+	return ret;
+}
+
 #ifndef is_affinity_mask_valid
 #define is_affinity_mask_valid(val) 1
 #endif
@@ -79,11 +96,23 @@ free_cpumask:
 	return err;
 }
 
+static ssize_t irq_affinity_hint_proc_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *pos)
+{
+	/* affinity_hint is read-only from proc */
+	return -EOPNOTSUPP;
+}
+
 static int irq_affinity_proc_open(struct inode *inode, struct file *file)
 {
 	return single_open(file, irq_affinity_proc_show, PDE(inode)->data);
 }
 
+static int irq_affinity_hint_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irq_affinity_hint_proc_show, PDE(inode)->data);
+}
+
 static const struct file_operations irq_affinity_proc_fops = {
 	.open		= irq_affinity_proc_open,
 	.read		= seq_read,
@@ -92,6 +121,14 @@ static const struct file_operations irq_affinity_proc_fops = {
 	.write		= irq_affinity_proc_write,
 };
 
+static const struct file_operations irq_affinity_hint_proc_fops = {
+	.open		= irq_affinity_hint_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+	.write		= irq_affinity_hint_proc_write,
+};
+
 static int default_affinity_show(struct seq_file *m, void *v)
 {
 	seq_cpumask(m, irq_default_affinity);
@@ -231,6 +268,10 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	/* create /proc/irq/<irq>/smp_affinity */
 	proc_create_data("smp_affinity", 0600, desc->dir,
 			 &irq_affinity_proc_fops, (void *)(long)irq);
+
+	/* create /proc/irq/<irq>/affinity_hint */
+	proc_create_data("affinity_hint", 0400, desc->dir,
+			 &irq_affinity_hint_proc_fops, (void *)(long)irq);
 #endif
 
 	proc_create_data("spurious", 0444, desc->dir,


^ permalink raw reply related

* [PATCH RFC: linux-next 2/2] ixgbe: Example usage of the new IRQ affinity_hint callback
From: Peter P Waskiewicz Jr @ 2010-04-19  4:58 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel
In-Reply-To: <20100419045741.30276.23233.stgit@ppwaskie-hc2.jf.intel.com>

This patch uses the new IRQ affinity_hint callback mechanism.
It serves purely as an example of how a low-level driver can
utilize this new interface.

An official ixgbe patch will be pushed through netdev once the
IRQ patches have been accepted and merged.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/ixgbe.h      |    2 ++
 drivers/net/ixgbe/ixgbe_main.c |   51 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..c220b9f 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/netdevice.h>
 #include <linux/aer.h>
+#include <linux/cpumask.h>
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
@@ -236,6 +237,7 @@ struct ixgbe_q_vector {
 	u8 tx_itr;
 	u8 rx_itr;
 	u32 eitr;
+	cpumask_var_t affinity_mask;
 };
 
 /* Helper macros to switch between ints/sec and what the register uses.
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..3e00d41 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1031,6 +1031,36 @@ next_desc:
 	return cleaned;
 }
 
+static unsigned int ixgbe_irq_affinity_callback(cpumask_var_t mask,
+                                                unsigned int irq, void *dev)
+{
+	struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)dev;
+	int i, q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
+
+	if (test_bit(__IXGBE_DOWN, &adapter->state))
+		return -EINVAL;
+
+	/*
+	 * Loop through the msix_entries array, looking for the vector that
+	 * matches the irq passed to us.  Once we find it, use that index to
+	 * grab the corresponding q_vector (1 to 1 mapping), and grab the
+	 * cpumask from that q_vector.
+	 */
+	for (i = 0; i < q_vectors; i++)
+		if (adapter->msix_entries[i].vector == irq)
+			break;
+
+	if (i == q_vectors)
+		return -EINVAL;
+
+	if (adapter->q_vector[i]->affinity_mask)
+		cpumask_copy(mask, adapter->q_vector[i]->affinity_mask);
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
 static int ixgbe_clean_rxonly(struct napi_struct *, int);
 /**
  * ixgbe_configure_msix - Configure MSI-X hardware
@@ -1083,6 +1113,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
 			q_vector->eitr = adapter->rx_eitr_param;
 
 		ixgbe_write_eitr(q_vector);
+
+		/*
+		 * Allocate the affinity_hint cpumask, assign the mask for
+		 * this vector, and register our affinity_hint callback.
+		 */
+		alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL);
+		cpumask_set_cpu(v_idx, q_vector->affinity_mask);
+		irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
+		                           adapter,
+		                           &ixgbe_irq_affinity_callback);
 	}
 
 	if (adapter->hw.mac.type == ixgbe_mac_82598EB)
@@ -3218,7 +3258,7 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	u32 rxctrl;
 	u32 txdctl;
-	int i, j;
+	int i, j, num_q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
 
 	/* signal that we are down to the interrupt handler */
 	set_bit(__IXGBE_DOWN, &adapter->state);
@@ -3251,6 +3291,14 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 
 	ixgbe_napi_disable_all(adapter);
 
+	for (i = 0; i < num_q_vectors; i++) {
+		struct ixgbe_q_vector *q_vector = adapter->q_vector[i];
+		/* unregister the affinity_hint callback */
+		irq_unregister_affinity_hint(adapter->msix_entries[i].vector);
+		/* release the CPU mask memory */
+		free_cpumask_var(q_vector->affinity_mask);
+	}
+
 	clear_bit(__IXGBE_SFP_MODULE_NOT_FOUND, &adapter->state);
 	del_timer_sync(&adapter->sfp_timer);
 	del_timer_sync(&adapter->watchdog_timer);
@@ -4052,6 +4100,7 @@ static void ixgbe_free_q_vectors(struct ixgbe_adapter *adapter)
 		struct ixgbe_q_vector *q_vector = adapter->q_vector[q_idx];
 		adapter->q_vector[q_idx] = NULL;
 		netif_napi_del(&q_vector->napi);
+		free_cpumask_var(q_vector->affinity_mask);
 		kfree(q_vector);
 	}
 }

^ permalink raw reply related

* [net-2.6 PATCH] ixgbe: add IntMode module parameter
From: Jeff Kirsher @ 2010-04-19  5:03 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Nicholas Nunley, John Ronciak, Jeff Kirsher

From: Nick Nunley <nicholasx.d.nunley@intel.com>

The 82598 has an erratum where some system designs can increase the
likelihood of a system hang due to a PCIe resource contention between the
82598 and another device in legacy INTx mode.

This issue can happen in any number of configurations with the 82598
and another device in legacy interrupt mode.  While we know of a
specific case where this is happening, this can happen with other
devices as well if configured in a certain way depending on the
system's slot configuration among other things.  This is alos why we
need to have the flexibility to disable MSI-X interrupts for the 82598
when these configurations are found having this deadlock situation.

The errata being worked - around with this patch - has to do with
any type of device that has a pending legacy INTx to the IOH's
internal IOAPIC which has priority from the shared resources
(round-robin to ensure forward progress / fairness).  The internal
IOAPIC is a PCI-E endpoint hence transactions to it are viewed as
peer-to-peer from a shared resource perspective.  To make forward
progress on the legacy INTx, the outbound completion FIFOs for the
IOU (resource pool1) are checked for available space due to shared
resources.  The IOH is allowed to have in/out dependency as a root
is full and not making forward progress because the 82598 has entered
its "erratum window"  where the 82598 is throttling the release of posted
credits back to the IOH as it has pending MSI-X interrupts to the IOH.
The 82598 will only release the posted credits back to IOH after the
inbound posted writes are serviced and the IOH releases the posted
credits back to the 82598 (i.e., the 82598 in/out dependency
erratum).  Since the priority is on the other device's legacy INTx,
the 82598 inbound posted transaction cannot make forward progress
that results in a link/forward progress deadlock condition.

The 82598 Specification Update has the errata list and can be found
here, it's erratum 38:
http://www.intel.com/design/network/specupdt/321040.pdf

This patch adds the IntMode module parameter to ixgbe to allow
user selection of interrupt mode (MSI-X, MSI, legacy) on driver
load.  To work around the errata described above, ixgbe needs to
be able to disable MSI-X for affected configurations.

Signed-off-by: Nicholas Nunley <nicholasx.d.nunley@intel.com>
Signed-off-by: John Ronciak <john.ronciak@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/Makefile      |    2 -
 drivers/net/ixgbe/ixgbe.h       |    1 
 drivers/net/ixgbe/ixgbe_main.c  |   26 ++++---
 drivers/net/ixgbe/ixgbe_param.c |  153 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 172 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ixgbe/ixgbe_param.c

diff --git a/drivers/net/ixgbe/Makefile b/drivers/net/ixgbe/Makefile
index 8f81efb..820491f 100644
--- a/drivers/net/ixgbe/Makefile
+++ b/drivers/net/ixgbe/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_IXGBE) += ixgbe.o
 
 ixgbe-objs := ixgbe_main.o ixgbe_common.o ixgbe_ethtool.o \
               ixgbe_82599.o ixgbe_82598.o ixgbe_phy.o ixgbe_sriov.o \
-              ixgbe_mbx.o
+              ixgbe_mbx.o ixgbe_param.o
 
 ixgbe-$(CONFIG_IXGBE_DCB) +=  ixgbe_dcb.o ixgbe_dcb_82598.o \
                               ixgbe_dcb_82599.o ixgbe_dcb_nl.o
diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..106f197 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -436,6 +436,7 @@ extern int ixgbe_copy_dcb_cfg(struct ixgbe_dcb_config *src_dcb_cfg,
 extern char ixgbe_driver_name[];
 extern const char ixgbe_driver_version[];
 
+extern void ixgbe_check_options(struct ixgbe_adapter *adapter);
 extern int ixgbe_up(struct ixgbe_adapter *adapter);
 extern void ixgbe_down(struct ixgbe_adapter *adapter);
 extern void ixgbe_reinit_locked(struct ixgbe_adapter *adapter);
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index a98ff0e..d04e095 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -3949,6 +3949,9 @@ static int ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 	int err = 0;
 	int vector, v_budget;
 
+	if (!(adapter->flags & IXGBE_FLAG_MSIX_CAPABLE))
+		goto try_msi;
+
 	/*
 	 * It's easy to be greedy for MSI-X vectors, but it really
 	 * doesn't do us much good if we have a lot more vectors
@@ -3981,16 +3984,10 @@ static int ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 			goto out;
 	}
 
-	adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
-	adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
-	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
-	adapter->flags &= ~IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
-	adapter->atr_sample_rate = 0;
-	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
-		ixgbe_disable_sriov(adapter);
-
-	ixgbe_set_num_queues(adapter);
 
+try_msi:
+	if (!(adapter->flags & IXGBE_FLAG_MSI_CAPABLE))
+		goto legacy;
 	err = pci_enable_msi(adapter->pdev);
 	if (!err) {
 		adapter->flags |= IXGBE_FLAG_MSI_ENABLED;
@@ -4000,7 +3997,16 @@ static int ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 		/* reset err */
 		err = 0;
 	}
+legacy:
+	adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
+	adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
+	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
+	adapter->flags &= ~IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
+	adapter->atr_sample_rate = 0;
+	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+		ixgbe_disable_sriov(adapter);
 
+	ixgbe_set_num_queues(adapter);
 out:
 	return err;
 }
@@ -6199,6 +6205,8 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 		goto err_sw_init;
 	}
 
+	ixgbe_check_options(adapter);
+
 	ixgbe_probe_vf(adapter, ii);
 
 	netdev->features = NETIF_F_SG |
diff --git a/drivers/net/ixgbe/ixgbe_param.c b/drivers/net/ixgbe/ixgbe_param.c
new file mode 100644
index 0000000..98c5626
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_param.c
@@ -0,0 +1,153 @@
+/*******************************************************************************
+
+  Intel 10 Gigabit PCI Express Linux driver
+  Copyright(c) 1999 - 2010 Intel Corporation.
+
+  This program is free software; you can redistribute it and/or modify it
+  under the terms and conditions of the GNU General Public License,
+  version 2, as published by the Free Software Foundation.
+
+  This program is distributed in the hope it will be useful, but WITHOUT
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+  more details.
+
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc.,
+  51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "COPYING".
+
+  Contact Information:
+  e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
+  Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
+*******************************************************************************/
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+#include "ixgbe.h"
+
+/* This is the only thing that needs to be changed to adjust the
+ * maximum number of ports that the driver can manage.
+ */
+
+#define IXGBE_MAX_NIC 32
+
+#define OPTION_UNSET    -1
+
+/* All parameters are treated the same, as an integer array of values.
+ * This macro just reduces the need to repeat the same declaration code
+ * over and over (plus this helps to avoid typo bugs).
+ */
+
+#define IXGBE_PARAM_INIT { [0 ... IXGBE_MAX_NIC] = OPTION_UNSET }
+#define IXGBE_PARAM(X, desc) \
+	static int __devinitdata X[IXGBE_MAX_NIC+1] = IXGBE_PARAM_INIT; \
+	static unsigned int num_##X; \
+	module_param_array_named(X, X, int, &num_##X, 0); \
+	MODULE_PARM_DESC(X, desc);
+
+/* IntMode (Interrupt Mode)
+ *
+ * Valid Range: 0-2
+ *  - 0 - Legacy Interrupt
+ *  - 1 - MSI Interrupt
+ *  - 2 - MSI-X Interrupt(s)
+ *
+ * Default Value: 2
+ */
+IXGBE_PARAM(IntMode, "Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2");
+#define IXGBE_INT_LEGACY		      0
+#define IXGBE_INT_MSI			      1
+#define IXGBE_INT_MSIX			      2
+#define IXGBE_DEFAULT_INT	 IXGBE_INT_MSIX
+
+struct ixgbe_option {
+	const char *name;
+	const char *err;
+	int def;
+	int min;
+	int max;
+};
+
+static int __devinit ixgbe_validate_option(unsigned int *value,
+					   struct ixgbe_option *opt,
+					   struct ixgbe_adapter *adapter)
+{
+	if (*value == OPTION_UNSET) {
+		*value = opt->def;
+		return 0;
+	}
+
+	if (*value >= opt->min && *value <= opt->max) {
+		dev_info(&adapter->pdev->dev,
+			 "%s set to %d\n", opt->name, *value);
+		return 0;
+	} else {
+		dev_info(&adapter->pdev->dev,
+			 "Invalid %s specified (%d),  %s\n",
+			 opt->name, *value, opt->err);
+		*value = opt->def;
+		return -1;
+	}
+}
+
+/**
+ * ixgbe_check_options - Range Checking for Command Line Parameters
+ * @adapter: board private structure
+ *
+ * This routine checks all command line parameters for valid user
+ * input.  If an invalid value is given, or if no user specified
+ * value exists, a default value is used.  The final value is stored
+ * in a variable in the adapter structure.
+ **/
+void __devinit ixgbe_check_options(struct ixgbe_adapter *adapter)
+{
+	int bd = adapter->bd_number;
+	u32 *aflags = &adapter->flags;
+
+	if (bd >= IXGBE_MAX_NIC) {
+		dev_notice(&adapter->pdev->dev,
+			   "Warning: no configuration for board #%d\n", bd);
+		dev_notice(&adapter->pdev->dev,
+			   "Using defaults for all values\n");
+	}
+
+	{ /* Interrupt Mode */
+		unsigned int int_mode;
+		static struct ixgbe_option opt = {
+			.name = "Interrupt Mode",
+			.err =
+			  "using default of "__MODULE_STRING(IXGBE_DEFAULT_INT),
+			.def = IXGBE_DEFAULT_INT,
+			.min = IXGBE_INT_LEGACY,
+			.max = IXGBE_INT_MSIX
+		};
+
+		if (num_IntMode > bd) {
+			int_mode = IntMode[bd];
+			ixgbe_validate_option(&int_mode, &opt, adapter);
+			switch (int_mode) {
+			case IXGBE_INT_MSIX:
+				*aflags |= IXGBE_FLAG_MSIX_CAPABLE;
+				*aflags |= IXGBE_FLAG_MSI_CAPABLE;
+				break;
+			case IXGBE_INT_MSI:
+				*aflags &= ~IXGBE_FLAG_MSIX_CAPABLE;
+				*aflags |= IXGBE_FLAG_MSI_CAPABLE;
+				break;
+			case IXGBE_INT_LEGACY:
+			default:
+				*aflags &= ~IXGBE_FLAG_MSIX_CAPABLE;
+				*aflags &= ~IXGBE_FLAG_MSI_CAPABLE;
+				break;
+			}
+		} else {
+			*aflags |= IXGBE_FLAG_MSIX_CAPABLE;
+			*aflags |= IXGBE_FLAG_MSI_CAPABLE;
+		}
+	}
+}


^ permalink raw reply related

* Re: [net-2.6 PATCH] ixgbe: add IntMode module parameter
From: David Miller @ 2010-04-19  5:09 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, nicholasx.d.nunley, john.ronciak
In-Reply-To: <20100419050315.24758.62479.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Sun, 18 Apr 2010 22:03:59 -0700

> This patch adds the IntMode module parameter to ixgbe to allow
> user selection of interrupt mode (MSI-X, MSI, legacy) on driver
> load.  To work around the errata described above, ixgbe needs to
> be able to disable MSI-X for affected configurations.

This is not how we handle situations like this.

Here is one acceptable way to handle this:

Turn MSI/MSI-X off for every system that uses this PCI-E
complex chipset.  Add a whitelist that allows enabling.

Here is another:

Turn MSI/MSI-X on by default and have blacklists.

Anything that requires unusual user interactions, such as specifying
special module parameters, for correct operations is absolutely
unacceptable.

I don't want to hear how difficult it is to determine whether a
system will hit this bug or not.  If it's hard, just turn MSI
off unconditionally with this chipset until you can detect things
properly.

^ permalink raw reply

* Re: [PATCH 0/13] bnx2x: Buf fixes and enhancements
From: Eilon Greenstein @ 2010-04-19  6:06 UTC (permalink / raw)
  To: David Miller
  Cc: Vladislav Zolotarov, netdev@vger.kernel.org, Dmitry Kravkov,
	Yaniv Rosner
In-Reply-To: <20100418.210723.71102685.davem@davemloft.net>

On Sun, 2010-04-18 at 21:07 -0700, David Miller wrote:
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Sun, 18 Apr 2010 17:47:26 +0300
> 
> > Dave, hi.
> > Pls., apply the following series of patches.
> > It includes a few bug fixes and enhancements.
> 
> Several of these patches are inappropriate or need corrections, and
> you also have not specified where you want these patches applied
> (net-2.6 or net-next-2.6)
> 

We will re-submit the path series with the following fixes:
- Omitting patch 11 (disabling LRO and leaving only GRO on XEN kernel on
compile time)
- restructuring patch 2 (using generic infrastructures for VPD-R-V0
access)
- Better description - especially on email 0...
- I'm sending another email regrading the number of queues on 32 bits
kernel.


Thanks for all the inputs,
Eilon



^ permalink raw reply

* Re: [PATCH 10/13] bnx2x: Set default value of num_queues to 1 on 32-bit platforms
From: Eilon Greenstein @ 2010-04-19  6:06 UTC (permalink / raw)
  To: David Miller; +Cc: Vladislav Zolotarov, netdev@vger.kernel.org, Dmitry Kravkov
In-Reply-To: <20100418.211023.104052199.davem@davemloft.net>

On Sun, 2010-04-18 at 21:10 -0700, David Miller wrote:
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Sun, 18 Apr 2010 17:50:24 +0300
> 
> > The default was changed to save memory on 32bits systems.
> > 
> > Author: Dmitry Kravkov <dmitry@broadcom.com>
> > Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> > Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> > Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
> 
> This is absolutely rediculious.
> 
> There is no reason in the world to restrict multiqueue to only 64-bit
> systems.
> 
> If there is a memory allocation issue, fix that!  Is it vmalloc() space
> usage?

Yes - running low on vmalloc is the reason. This patch does not restrict
the usage of multi queue on 32bits - it is just changing the default
queues values from the number of available CPUs to 1. If the user will
use another value, it will work as before.

We encounter few issues were a 32bit kernel was installed on a
multi-core platform and the driver allocated "too many" queues. One way
to go is to limit the queue size and the other is the number of queues -
leaving it as is caused issues due to running low on kernel virtual
memory so a change was needed.

Dave, what is your preference:
- re-submitting this patch with a proper description?
- limit the size of each queue?
- other?

Thanks,
Eilon

^ permalink raw reply

* Re: [PATCH 10/13] bnx2x: Set default value of num_queues to 1 on 32-bit platforms
From: David Miller @ 2010-04-19  6:29 UTC (permalink / raw)
  To: eilong; +Cc: vladz, netdev, dmitry
In-Reply-To: <1271657181.27125.20.camel@lb-tlvb-eilong.il.broadcom.com>

From: "Eilon Greenstein" <eilong@broadcom.com>
Date: Mon, 19 Apr 2010 09:06:21 +0300

> Yes - running low on vmalloc is the reason. This patch does not restrict
> the usage of multi queue on 32bits - it is just changing the default
> queues values from the number of available CPUs to 1. If the user will
> use another value, it will work as before.

Changing the default to 1 is equivalent to disabling multi-queue.

> We encounter few issues were a 32bit kernel was installed on a
> multi-core platform and the driver allocated "too many" queues. One way
> to go is to limit the queue size and the other is the number of queues -
> leaving it as is caused issues due to running low on kernel virtual
> memory so a change was needed.

Look into using an allocation strategy other than vmalloc(), in fact
I can't see why you need to much memory that vmalloc() must be used
in the first place.

^ permalink raw reply

* [PATCH RFC]: soreuseport: Bind multiple sockets to same port
From: Tom Herbert @ 2010-04-19  6:33 UTC (permalink / raw)
  To: davem, netdev

This is some work we've done to scale TCP listeners/UDP servers.  It
might be apropos with some of the discussion on SO_REUSEADDR for UDP.
---
This patch implements so_reuseport (SO_REUSEPORT socket option) for
TCP and UDP.  For TCP, so_reuseport allows multiple listener sockets
to be bound to the same port.  In the case of UDP, so_reuseport allows
multiple sockets to bind to the same port.  To prevent port hijacking
all sockets bound to the same port using so_reuseport must have the
same uid.  Received packets are distributed to multiple sockets bound
to the same port using a 4-tuple hash.

The motivating case for so_resuseport in TCP would be something like
a web server binding to port 80 running with multiple threads, where
each thread might have it's own listener socket.  This could be done
as an alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads.  In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets.  We have seen the  disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest.  With so_reusport the distribution is
uniform.

The TCP implementation has a problem in that the request sockets for a
listener are attached to a listener socket.  If a SYN is received, a
listener socket is chosen and request structure is created (SYN-RECV
state).  If the subsequent ack in 3WHS does not match the same port
by so_reusport, the connection state is not found (reset) and the
request structure is orphaned.  This scenario would occur when the
number of listener sockets bound to a port changes (new ones are
added, or old ones closed).  We are looking for a solution to this,
maybe allow multiple sockets to share the same request table...

The motivating case for so_reuseport in UDP would be something like a
DNS server.  An alternative would be to recv on the same socket from
multiple threads.  As in the case of TCP, the load across these threads
tends to be disproportionate and we also see a lot of contection on
the socket lock.  Note that SO_REUSEADDR already allows multiple UDP
sockets to bind to the same port, however there is no provision to
prevent hijacking and nothing to distribute packets across all the
sockets sharing the same bound port.  This patch does not change the
semantics of SO_REUSEADDR, but provides usable functionality of it
for unicast.
---
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..37b699f 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -22,7 +22,7 @@
 #define SO_PRIORITY	12
 #define SO_LINGER	13
 #define SO_BSDCOMPAT	14
-/* To add :#define SO_REUSEPORT 15 */
+#define SO_REUSEPORT	15
 
 #ifndef SO_PASSCRED /* powerpc only differs in these */
 #define SO_PASSCRED	16
diff --git a/include/linux/inet.h b/include/linux/inet.h
index 4cca05c..bd8f0b6 100644
--- a/include/linux/inet.h
+++ b/include/linux/inet.h
@@ -51,6 +51,12 @@
 #define INET_ADDRSTRLEN		(16)
 #define INET6_ADDRSTRLEN	(48)
 
+static inline u32 inet_next_pseudo_random32(u32 seed)
+{
+	/* Pseudo random number generator from numerical recipes */
+	return seed * 1664525 + 1013904223;
+}
+
 extern __be32 in_aton(const char *str);
 extern int in4_pton(const char *src, int srclen, u8 *dst, int delim, const char **end);
 extern int in6_pton(const char *src, int srclen, u8 *dst, int delim, const char **end);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 74358d1..0887675 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -81,7 +81,9 @@ struct inet_bind_bucket {
 	struct net		*ib_net;
 #endif
 	unsigned short		port;
-	signed short		fastreuse;
+	signed char		fastreuse;
+	signed char		fastreuseport;
+	int			fastuid;
 	int			num_owners;
 	struct hlist_node	node;
 	struct hlist_head	owners;
@@ -257,15 +259,19 @@ extern void inet_unhash(struct sock *sk);
 
 extern struct sock *__inet_lookup_listener(struct net *net,
 					   struct inet_hashinfo *hashinfo,
+					   const __be32 saddr,
+					   const __be16 sport,
 					   const __be32 daddr,
 					   const unsigned short hnum,
 					   const int dif);
 
 static inline struct sock *inet_lookup_listener(struct net *net,
 		struct inet_hashinfo *hashinfo,
+		__be32 saddr, __be16 sport,
 		__be32 daddr, __be16 dport, int dif)
 {
-	return __inet_lookup_listener(net, hashinfo, daddr, ntohs(dport), dif);
+	return __inet_lookup_listener(net, hashinfo, saddr, sport,
+	    daddr, ntohs(dport), dif);
 }
 
 /* Socket demux engine toys. */
@@ -356,7 +362,8 @@ static inline struct sock *__inet_lookup(struct net *net,
 	struct sock *sk = __inet_lookup_established(net, hashinfo,
 				saddr, sport, daddr, hnum, dif);
 
-	return sk ? : __inet_lookup_listener(net, hashinfo, daddr, hnum, dif);
+	return sk ? : __inet_lookup_listener(net, hashinfo, saddr, sport,
+	    daddr, hnum, dif);
 }
 
 static inline struct sock *inet_lookup(struct net *net,
diff --git a/include/net/sock.h b/include/net/sock.h
index 56df440..58c6fa9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -114,6 +114,7 @@ struct net;
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
+ *	@skc_reuseport: %SO_REUSEPORT setting
  *	@skc_bound_dev_if: bound device index if != 0
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
  *	@skc_portaddr_node: second hash linkage for UDP/UDP-Lite protocol
@@ -140,7 +141,8 @@ struct sock_common {
 	};
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse;
+	unsigned char		skc_reuse:1;
+	unsigned char		skc_reuseport:1;
 	int			skc_bound_dev_if;
 	union {
 		struct hlist_node	skc_bind_node;
@@ -233,6 +235,7 @@ struct sock {
 #define sk_family		__sk_common.skc_family
 #define sk_state		__sk_common.skc_state
 #define sk_reuse		__sk_common.skc_reuse
+#define sk_reuseport		__sk_common.skc_reuseport
 #define sk_bound_dev_if		__sk_common.skc_bound_dev_if
 #define sk_bind_node		__sk_common.skc_bind_node
 #define sk_prot			__sk_common.skc_prot
diff --git a/net/core/sock.c b/net/core/sock.c
index 7effa1e..801efc6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -497,6 +497,9 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	case SO_REUSEADDR:
 		sk->sk_reuse = valbool;
 		break;
+	case SO_REUSEPORT:
+		sk->sk_reuseport = valbool;
+		break;
 	case SO_TYPE:
 	case SO_PROTOCOL:
 	case SO_DOMAIN:
@@ -780,6 +783,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = sk->sk_reuse;
 		break;
 
+	case SO_REUSEPORT:
+		v.val = sk->sk_reuseport;
+		break;
+
 	case SO_KEEPALIVE:
 		v.val = !!sock_flag(sk, SOCK_KEEPOPEN);
 		break;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8da6429..02961c8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -56,6 +56,8 @@ int inet_csk_bind_conflict(const struct sock *sk,
 	struct sock *sk2;
 	struct hlist_node *node;
 	int reuse = sk->sk_reuse;
+	int reuseport = sk->sk_reuseport;
+	int uid = sock_i_uid((struct sock *)sk);
 
 	/*
 	 * Unlike other sk lookup places we do not check
@@ -70,8 +72,11 @@ int inet_csk_bind_conflict(const struct sock *sk,
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
-			if (!reuse || !sk2->sk_reuse ||
-			    sk2->sk_state == TCP_LISTEN) {
+			if ((!reuse || !sk2->sk_reuse ||
+			    sk2->sk_state == TCP_LISTEN) &&
+			    (!reuseport || !sk2->sk_reuseport ||
+			    (sk2->sk_state != TCP_TIME_WAIT &&
+			     uid != sock_i_uid(sk2)))) {
 				const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
 				if (!sk2_rcv_saddr || !sk_rcv_saddr ||
 				    sk2_rcv_saddr == sk_rcv_saddr)
@@ -96,6 +101,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	int ret, attempts = 5;
 	struct net *net = sock_net(sk);
 	int smallest_size = -1, smallest_rover;
+	int uid = sock_i_uid(sk);
 
 	local_bh_disable();
 	if (!snum) {
@@ -113,9 +119,12 @@ again:
 			spin_lock(&head->lock);
 			inet_bind_bucket_for_each(tb, node, &head->chain)
 				if (net_eq(ib_net(tb), net) && tb->port == rover) {
-					if (tb->fastreuse > 0 &&
-					    sk->sk_reuse &&
-					    sk->sk_state != TCP_LISTEN &&
+					if (((tb->fastreuse > 0 &&
+					      sk->sk_reuse &&
+					      sk->sk_state != TCP_LISTEN) ||
+					     (tb->fastreuseport > 0 &&
+					      sk->sk_reuseport &&
+					      tb->fastuid == uid)) &&
 					    (tb->num_owners < smallest_size || smallest_size == -1)) {
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
@@ -165,14 +174,18 @@ have_snum:
 	goto tb_not_found;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
-		if (tb->fastreuse > 0 &&
-		    sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+		if (((tb->fastreuse > 0 &&
+		      sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
+		     (tb->fastreuseport > 0 &&
+		      sk->sk_reuseport && tb->fastuid == uid)) &&
 		    smallest_size == -1) {
 			goto success;
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+				if (((sk->sk_reuse &&
+				      sk->sk_state != TCP_LISTEN) ||
+				     sk->sk_reuseport) &&
 				    smallest_size != -1 && --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
@@ -191,9 +204,23 @@ tb_not_found:
 			tb->fastreuse = 1;
 		else
 			tb->fastreuse = 0;
-	} else if (tb->fastreuse &&
-		   (!sk->sk_reuse || sk->sk_state == TCP_LISTEN))
-		tb->fastreuse = 0;
+		if (sk->sk_reuseport) {
+			tb->fastreuseport = 1;
+			tb->fastuid = uid;
+		} else {
+			tb->fastreuseport = 0;
+			tb->fastuid = 0;
+		}
+	} else {
+		if (tb->fastreuse &&
+		    (!sk->sk_reuse || sk->sk_state == TCP_LISTEN))
+			tb->fastreuse = 0;
+		if (tb->fastreuseport &&
+		    (!sk->sk_reuseport || tb->fastuid != uid)) {
+			tb->fastreuseport = 0;
+			tb->fastuid = 0;
+		}
+	}
 success:
 	if (!inet_csk(sk)->icsk_bind_hash)
 		inet_bind_hash(sk, tb, snum);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2b79377..f7c23e1 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -38,6 +38,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 		write_pnet(&tb->ib_net, hold_net(net));
 		tb->port      = snum;
 		tb->fastreuse = 0;
+		tb->fastreuseport = 0;
 		tb->num_owners = 0;
 		INIT_HLIST_HEAD(&tb->owners);
 		hlist_add_head(&tb->node, &head->chain);
@@ -129,16 +130,16 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
 			!ipv6_only_sock(sk)) {
 		__be32 rcv_saddr = inet->inet_rcv_saddr;
-		score = sk->sk_family == PF_INET ? 1 : 0;
+		score = sk->sk_family == PF_INET ? 2 : 1;
 		if (rcv_saddr) {
 			if (rcv_saddr != daddr)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (sk->sk_bound_dev_if) {
 			if (sk->sk_bound_dev_if != dif)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 	}
 	return score;
@@ -154,6 +155,7 @@ static inline int compute_score(struct sock *sk, struct net *net,
 
 struct sock *__inet_lookup_listener(struct net *net,
 				    struct inet_hashinfo *hashinfo,
+				    const __be32 saddr, __be16 sport,
 				    const __be32 daddr, const unsigned short hnum,
 				    const int dif)
 {
@@ -161,26 +163,39 @@ struct sock *__inet_lookup_listener(struct net *net,
 	struct hlist_nulls_node *node;
 	unsigned int hash = inet_lhashfn(net, hnum);
 	struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
-	int score, hiscore;
+	int score, hiscore, matches = 0, reuseport = 0;
+	u32 phash = 0;
 
 	rcu_read_lock();
 begin:
 	result = NULL;
-	hiscore = -1;
+	hiscore = 0;
 	sk_nulls_for_each_rcu(sk, node, &ilb->head) {
 		score = compute_score(sk, net, hnum, daddr, dif);
 		if (score > hiscore) {
 			result = sk;
 			hiscore = score;
+			reuseport = sk->sk_reuseport;
+			if (reuseport) {
+				phash = inet_ehashfn(net, daddr, hnum,
+				    saddr, htons(sport));
+				matches = 1;
+			}
+		} else if (score == hiscore && reuseport) {
+			matches++;
+			if (((u64)phash * matches) >> 32 == 0)
+				result = sk;
+			phash = inet_next_pseudo_random32(phash);
 		}
 	}
-	/*
-	 * if the nulls value we got at the end of this lookup is
-	 * not the expected one, we must restart lookup.
+/*
+ * if the nulls value we got at the end of this lookup is
+ * not the expected one, we must restart lookup.
 	 * We probably met an item that was moved to another chain.
 	 */
 	if (get_nulls_value(node) != hash + LISTENING_NULLS_BASE)
 		goto begin;
+
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
@@ -467,7 +482,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			inet_bind_bucket_for_each(tb, node, &head->chain) {
 				if (net_eq(ib_net(tb), net) &&
 				    tb->port == port) {
-					if (tb->fastreuse >= 0)
+					if (tb->fastreuse >= 0 ||
+					    tb->fastreuseport >= 0)
 						goto next_port;
 					WARN_ON(hlist_empty(&tb->owners));
 					if (!check_established(death_row, sk,
@@ -484,6 +500,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 				break;
 			}
 			tb->fastreuse = -1;
+			tb->fastreuseport = -1;
 			goto ok;
 
 		next_port:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad08392..71cf97d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1735,6 +1735,7 @@ do_time_wait:
 	case TCP_TW_SYN: {
 		struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
 							&tcp_hashinfo,
+							iph->saddr, th->source,
 							iph->daddr, th->dest,
 							inet_iif(skb));
 		if (sk2) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 666b963..59404fa 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -135,6 +135,7 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
 {
 	struct sock *sk2;
 	struct hlist_nulls_node *node;
+	int uid = sock_i_uid(sk);
 
 	sk_nulls_for_each(sk2, node, &hslot->head)
 		if (net_eq(sock_net(sk2), net) &&
@@ -143,6 +144,8 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
 		    (!sk2->sk_reuse || !sk->sk_reuse) &&
 		    (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if ||
 		     sk2->sk_bound_dev_if == sk->sk_bound_dev_if) &&
+		    (!sk2->sk_reuseport || !sk->sk_reuseport ||
+		      uid != sock_i_uid(sk2)) &&
 		    (*saddr_comp)(sk, sk2)) {
 			if (bitmap)
 				__set_bit(udp_sk(sk2)->udp_port_hash >> log,
@@ -332,26 +335,26 @@ static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr,
 			!ipv6_only_sock(sk)) {
 		struct inet_sock *inet = inet_sk(sk);
 
-		score = (sk->sk_family == PF_INET ? 1 : 0);
+		score = (sk->sk_family == PF_INET ? 2 : 1);
 		if (inet->inet_rcv_saddr) {
 			if (inet->inet_rcv_saddr != daddr)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (inet->inet_daddr) {
 			if (inet->inet_daddr != saddr)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (inet->inet_dport) {
 			if (inet->inet_dport != sport)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (sk->sk_bound_dev_if) {
 			if (sk->sk_bound_dev_if != dif)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 	}
 	return score;
@@ -360,7 +363,6 @@ static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr,
 /*
  * In this second variant, we check (daddr, dport) matches (inet_rcv_sadd, inet_num)
  */
-#define SCORE2_MAX (1 + 2 + 2 + 2)
 static inline int compute_score2(struct sock *sk, struct net *net,
 				 __be32 saddr, __be16 sport,
 				 __be32 daddr, unsigned int hnum, int dif)
@@ -375,21 +377,21 @@ static inline int compute_score2(struct sock *sk, struct net *net,
 		if (inet->inet_num != hnum)
 			return -1;
 
-		score = (sk->sk_family == PF_INET ? 1 : 0);
+		score = (sk->sk_family == PF_INET ? 2 : 1);
 		if (inet->inet_daddr) {
 			if (inet->inet_daddr != saddr)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (inet->inet_dport) {
 			if (inet->inet_dport != sport)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (sk->sk_bound_dev_if) {
 			if (sk->sk_bound_dev_if != dif)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 	}
 	return score;
@@ -404,19 +406,29 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	struct hlist_nulls_node *node;
-	int score, badness;
+	int score, badness, matches = 0, reuseport = 0;
+	u32 hash = 0;
 
 begin:
 	result = NULL;
-	badness = -1;
+	badness = 0;
 	udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) {
 		score = compute_score2(sk, net, saddr, sport,
 				      daddr, hnum, dif);
 		if (score > badness) {
 			result = sk;
 			badness = score;
-			if (score == SCORE2_MAX)
-				goto exact_match;
+			reuseport = sk->sk_reuseport;
+			if (reuseport) {
+				hash = inet_ehashfn(net, daddr, hnum,
+				    saddr, htons(sport));
+				matches = 1;
+			}
+		} else if (score == badness && reuseport) {
+			matches++;
+			if (((u64)hash * matches) >> 32 == 0)
+				result = sk;
+			hash = inet_next_pseudo_random32(hash);
 		}
 	}
 	/*
@@ -428,7 +440,6 @@ begin:
 		goto begin;
 
 	if (result) {
-exact_match:
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
 		else if (unlikely(compute_score2(result, net, saddr, sport,
@@ -452,7 +463,8 @@ static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask);
 	struct udp_hslot *hslot2, *hslot = &udptable->hash[slot];
-	int score, badness;
+	int score, badness, matches = 0, reuseport = 0;
+	u32 hash;
 
 	rcu_read_lock();
 	if (hslot->count > 10) {
@@ -481,13 +493,24 @@ static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	}
 begin:
 	result = NULL;
-	badness = -1;
+	badness = 0;
 	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
 		score = compute_score(sk, net, saddr, hnum, sport,
 				      daddr, dport, dif);
 		if (score > badness) {
 			result = sk;
 			badness = score;
+			reuseport = sk->sk_reuseport;
+			if (reuseport) {
+				hash = inet_ehashfn(net, daddr, hnum,
+				    saddr, htons(sport));
+				matches = 1;
+			}
+		} else if (score == badness && reuseport) {
+			matches++;
+			if (((u64)hash * matches) >> 32 == 0)
+				result = sk;
+			hash = inet_next_pseudo_random32(hash);
 		}
 	}
 	/*

^ permalink raw reply related

* Re: [PATCH] ks8842: Add module param for setting mac address
From: David Miller @ 2010-04-19  6:36 UTC (permalink / raw)
  To: richard.rojfors; +Cc: netdev
In-Reply-To: <4BCBF53D.1040800@pelagicore.com>

From: Richard Röjfors <richard.rojfors@pelagicore.com>
Date: Mon, 19 Apr 2010 08:16:29 +0200

> I posted a new patch where the MAC address is passed via
> platform data, hope that's an accepted way.

I saw also that you mentioned that there is no real
way to probe this information.

Therefore I worry that your plan might be to simply
use some kernel command line or module option somewhere
else to fill in this platform_device value.

Is that what you plan to do?

^ permalink raw reply

* Re: [PATCH] ks8842: Add module param for setting mac address
From: Richard Röjfors @ 2010-04-19  6:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100418.211800.112605815.davem@davemloft.net>

On 04/19/2010 06:18 AM, David Miller wrote:
> From: Richard Röjfors<richard.rojfors@pelagicore.com>
> Date: Sun, 18 Apr 2010 19:25:57 +0200
>
>> This patch adds a module parameter for setting the MAC address.
>>
>> To ensure this MAC address is used, the MAC address is written
>> after each hardware reset.
>>
>> Signed-off-by: Richard Röjfors<richard.rojfors@pelagicore.com>
>
> Your driver isn't unique, or special, and it doesn't need
> to do special things to handle this situation.

I posted a new patch where the MAC address is passed via
platform data, hope that's an accepted way.

Thanks,
--Richard

^ permalink raw reply

* Re: [PATCH] ks8842: Add module param for setting mac address
From: Richard Röjfors @ 2010-04-19  6:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100418.233631.115927208.davem@davemloft.net>

On 04/19/2010 08:36 AM, David Miller wrote:
> From: Richard Röjfors<richard.rojfors@pelagicore.com>
> Date: Mon, 19 Apr 2010 08:16:29 +0200
>
>> I posted a new patch where the MAC address is passed via
>> platform data, hope that's an accepted way.
>
> I saw also that you mentioned that there is no real
> way to probe this information.
>
> Therefore I worry that your plan might be to simply
> use some kernel command line or module option somewhere
> else to fill in this platform_device value.
>
> Is that what you plan to do?

In the lab; yes.
In the end, there will be a flash available to store
system wide parameters.

--Richard

^ permalink raw reply

* Re: [PATCH BUG-FIX] ipv6: allow to send packet after receiving ICMPv6 Too Big message with MTU field less than  IPV6_MIN_MTU
From: Shan Wei @ 2010-04-19  6:49 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	魏勇军, vladislav.yasevich, kuznet, pekkas,
	jmorris, Patrick McHardy, eric.dumazet, sri,
	netdev@vger.kernel.org, linux-sctp
In-Reply-To: <20100419035535.GA7011@gondor.apana.org.au>

Herbert Xu wrote, at 04/19/2010 11:55 AM:
> 
> The patch looks good to me.

Thanks for reviewing this patch.

> If we wanted to optimise the allfrags case it may be better
> to reserve the space beforehand and generate the fragment header
> at the same time as we're doing the IPv6 header.
> 
> But it can't be all that important as it's been broken for so
> many years.

If somebody needs one patch to fix the broken,
I am pleased to do so. 

-- 
Best Regards
-----
Shan Wei


> 
> Thanks,



^ permalink raw reply

* ARP updates and GARP
From: Sasha Levin @ 2010-04-19  6:54 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,

We are currently testing IP fail-over on storage devices, and have observed an issue with the IP transfer from one device to another.

Assuming we have 2 storage devices A and B, and a server C which uses the storage, the scenario is:

1. Device A sends an ARP request which server C sees – server C updates it’s ARP table with the MAC of device A.
2. Device A fails, Device B takes over the IP and sends out a GARP.
3. Even though device C sees the GARP, it ignores it and keeps trying to communicate with device A until the entry is removed from its cache and a new ARP request is generated.

The code which causes this is located in arp_process@/net/ipv4/arp.c:

override = time_after(jiffies, n->updated + n->parms->locktime);

/* Broadcast replies and request packets
   do not assert neighbour reachability.
 */
if (arp->ar_op != htons(ARPOP_REPLY) ||
    skb->pkt_type != PACKET_HOST)
        state = NUD_STALE;
neigh_update(n, sha, state, override ? NEIGH_UPDATE_F_OVERRIDE : 0);
neigh_release(n);

According to the code, this scenario happens because the kernel ignores any ARP updates which happened in a short period after the previous ARP update. The reason which was stated in the comments is  “If several different ARP replies follows back-to-back, use the FIRST one. It is possible, if several proxy agents are active. Taking the first reply prevents arp trashing and chooses the fastest router.”.

This, however, doesn’t take into account GARPs which are not being sent by ARP proxies anyway and just ignores them too – causing a loss of communication for over a minute until the ARP cache refreshes.

Is there another reason for this rule? If not, Is possible to submit a patch which will take GARPs into account when ignoring ARP updates?

Thanks!

Sasha.

^ permalink raw reply

* Re: [PATCH] ks8842: Add module param for setting mac address
From: David Miller @ 2010-04-19  7:13 UTC (permalink / raw)
  To: richard.rojfors; +Cc: netdev
In-Reply-To: <4BCBFC62.8010200@pelagicore.com>

From: Richard Röjfors <richard.rojfors@pelagicore.com>
Date: Mon, 19 Apr 2010 08:46:58 +0200

> On 04/19/2010 08:36 AM, David Miller wrote:
>> From: Richard Röjfors<richard.rojfors@pelagicore.com>
>> Date: Mon, 19 Apr 2010 08:16:29 +0200
>>
>>> I posted a new patch where the MAC address is passed via
>>> platform data, hope that's an accepted way.
>>
>> I saw also that you mentioned that there is no real
>> way to probe this information.
>>
>> Therefore I worry that your plan might be to simply
>> use some kernel command line or module option somewhere
>> else to fill in this platform_device value.
>>
>> Is that what you plan to do?
> 
> In the lab; yes.
> In the end, there will be a flash available to store
> system wide parameters.

Ok, your long term plan is fine I guess.

^ permalink raw reply

* Re: Phylib polling when doing mdio_read will cause system response and transfer speed drop
From: Wolfram Sang @ 2010-04-19  7:21 UTC (permalink / raw)
  To: Bryan Wu; +Cc: afleming, davem, netdev, LKML
In-Reply-To: <4BC50BF5.7080700@canonical.com>

[-- Attachment #1: Type: text/plain, Size: 1412 bytes --]

On Tue, Apr 13, 2010 at 05:27:33PM -0700, Bryan Wu wrote:

> I found the root cause is the polling operation in the mdio_read 
> function. When we transfer large files, we experienced many times of 
> timeout issue. So I got several question here:

Same here, I saw the 'MDIO Timeout' Message occasionally.

> 1. Need I return -ETIMEDOUT when polling timeout. If I don't return 
> -ETIMEOUT, the performance improved a lot. And after check other drivers, 
> some don't return anything, some return 0, some return negative value. 
> What's the rule for this mdio_read polling timeout case.
>
> 2. How to do polling busy waiting? Normally, we won't buys wait very long 
> in polling. But hardware is not perfect every time. Running cpu_relax() 
> 10000 times in polling will cause our system response very bad when 
> hardware don't set the flag as we expected. Maybe udelay(25) 10 times or 
> msleep(1) 10 times is better than that.
>
> I got a patch to recover this issue,  
> http://kernel.ubuntu.com/git?p=roc/ubuntu-lucid.git;a=commitdiff;h=5d77e3409b319ca84183bf1d2fd158a9c864e03f.

Can't help with the details, but the patch seems to help here, too
(and not only because of the removed printk ;)).

Regards,

   Wolfram

-- 
Pengutronix e.K.                           | Wolfram Sang                |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox