Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Xin, Xiaohui @ 2010-04-19 10:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@linux.intel.com, davem@davemloft.net
In-Reply-To: <20100415100546.GA17035@redhat.com>

> Michael,
> >>> The idea is simple, just to pin the guest VM user space and then
> >>> let host NIC driver has the chance to directly DMA to it. 
> >>> The patches are based on vhost-net backend driver. We add a device
> >>> which provides proto_ops as sendmsg/recvmsg to vhost-net to
> >>> send/recv directly to/from the NIC driver. KVM guest who use the
> >>> vhost-net backend may bind any ethX interface in the host side to
> >>> get copyless data transfer thru guest virtio-net frontend.
> >>> 
> >>> The scenario is like this:
> >>> 
> >>> The guest virtio-net driver submits multiple requests thru vhost-net
> >>> backend driver to the kernel. And the requests are queued and then
> >>> completed after corresponding actions in h/w are done.
> >>> 
> >>> For read, user space buffers are dispensed to NIC driver for rx when
> >>> a page constructor API is invoked. Means NICs can allocate user buffers
> >>> from a page constructor. We add a hook in netif_receive_skb() function
> >>> to intercept the incoming packets, and notify the zero-copy device.
> >>> 
> >>> For write, the zero-copy deivce may allocates a new host skb and puts
> >>> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> >>> The request remains pending until the skb is transmitted by h/w.
> >>> 
> >>> Here, we have ever considered 2 ways to utilize the page constructor
> >>> API to dispense the user buffers.
> >>> 
> >>> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> >>> 	structure of sk_buff, and the data pointer is pointing to a 
> >>> 	user buffer which is coming from a page constructor API.
> >>> 	Then the shinfo of the skb is also from guest.
> >>> 	When packet is received from hardware, the skb->data is filled
> >>> 	directly by h/w. What we have done is in this way.
> >>> 
> >>> 	Pros:	We can avoid any copy here.
> >>> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> >>> 		the same method with the host NIC drivers, say the size
> >>> 		of netdev_alloc_skb() and the same reserved space in the
> >>> 		head of skb. Many NIC drivers are the same with guest and
> >>> 		ok for this. But some lastest NIC drivers reserves special
> >>> 		room in skb head. To deal with it, we suggest to provide
> >>> 		a method in guest virtio-net driver to ask for parameter
> >>> 		we interest from the NIC driver when we know which device 
> >>> 		we have bind to do zero-copy. Then we ask guest to do so.
> >>> 		Is that reasonable?
> >>Unfortunately, this would break compatibility with existing virtio.
> >>This also complicates migration.  
>> You mean any modification to the guest virtio-net driver will break the
>> compatibility? We tried to enlarge the virtio_net_config to contains the
>> 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe()
>> will check the feature flag, and get the parameters, then virtio-net driver use
>> it to allocate buffers. How about this?

>This means that we can't, for example, live-migrate between different systems
>without flushing outstanding buffers.

Ok. What we have thought about now is to do something with skb_reserve().
If the device is binded by mp, then skb_reserve() will do nothing with it.

> >>What is the room in skb head used for?
> >I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to
>> NET_IP_ALIGN.

>Looking at code, this seems to do with alignment - could just be
>a performance optimization.

> >>> Two:	Modify driver to get user buffer allocated from a page constructor
> >>> 	API(to substitute alloc_page()), the user buffer are used as payload
> >>> 	buffers and filled by h/w directly when packet is received. Driver
> >>> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> >>> 	the head buffer side, let host allocates skb, and h/w fills it. 
> >>> 	After that, the data filled in host skb header will be copied into
> >>> 	guest header buffer which is submitted together with the payload buffer.
> >>> 
> >>> 	Pros:	We could less care the way how guest or host allocates their
> >>> 		buffers.
> >>> 	Cons:	We still need a bit copy here for the skb header.
> >>> 
> >>> We are not sure which way is the better here. 
> >>The obvious question would be whether you see any speed difference
> >>with the two approaches. If no, then the second approach would be
> >>better.
> 
>> I remember the second approach is a bit slower in 1500MTU. 
>> But we did not tested too much.

>Well, that's an important datapoint. By the way, you'll need
>header copy to activate LRO in host, so that's a good
>reason to go with option 2 as well.


> >>> This is the first thing we want
> >>> to get comments from the community. We wish the modification to the network
> >>> part will be generic which not used by vhost-net backend only, but a user
> >>> application may use it as well when the zero-copy device may provides async
> >>> read/write operations later.
> >>> 
> >>> Please give comments especially for the network part modifications.
> >>> 
> >>> 
> >>> We provide multiple submits and asynchronous notifiicaton to 
> >>>vhost-net too.
> >>> 
> >>> Our goal is to improve the bandwidth and reduce the CPU usage.
> >>> Exact performance data will be provided later. But for simple
> >>> test with netperf, we found bindwidth up and CPU % up too,
> >>> but the bindwidth up ratio is much more than CPU % up ratio.
> >>> 
> >>> What we have not done yet:
> >>> 	packet split support
> 
> >>What does this mean, exactly?
>> We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
> >support mergeable buffer, we cannot try it for multiple sg.

>I do not see why, vhost currently supports 64K buffers with indirect
>descriptors.

The receive_skb() in guest virtio-net driver will merge the multiple sg to skb frags, how can indirect descriptors to that?

>>> A jumbo frame will split 5
>>> frags and hook them once a descriptor, so the user buffer allocation is greatly dependent
>>> on how guest virtio-net drivers submits buffers. We think mergeable buffer is suitable for >>>it. 
> 
> >> 	To support GRO
>>> Actually, I think if the mergeable buffer may get good performance, then GRO is not 
>>> so important then.
> >>And TSO/GSO?
>>> Do we really need them?

>>My guess would be yes. Mergeable buffers is a memory saving
>>optimization, not a performance optimization, I don't see
>>that it can help. And I think you can't solely rely on jumbo frames
>>in hardware, not everyone can enable them.

>Having said that, number one priority is getting decent performance
>out of the driver, in whatever way you find fit. I was just
>suggesting obvious ways to do this.

Thanks.

> >> 	Performance tuning
> >> 
> >> what we have done in v1:
> >> 	polish the RCU usage
> >> 	deal with write logging in asynchroush mode in vhost
> >> 	add notifier block for mp device
> >> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> >> 	add mp_dev_change_flags() for mp device to change NIC state
> >> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> >> 	a small fix for missing dev_put when fail
> >> 	using dynamic minor instead of static minor number
> >> 	a __KERNEL__ protect to mp_get_sock()
> >> 
> >> what we have done in v2:
> >> 	
> >> 	remove most of the RCU usage, since the ctor pointer is only
> >> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> >> 	stopped to get good cleanup(all outstanding requests are finished),
> >> 	so the ctor pointer cannot be raced into wrong situation.
> >> 
> >> 	Remove the struct vhost_notifier with struct kiocb.
> >> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> >> 	via sendmsg/recvmsg.
> >> 
> >> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> >> 
> >> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> >> 
> >> 
> >> Comments not addressed yet in this time:
> >> 	the async write logging is not satified by vhost-net
> >> 	Qemu needs a sync write
> >> 	a limit for locked pages from get_user_pages_fast()
> >> 	
> >> 		
> >> performance:
> >> 	using netperf with GSO/TSO disabled, 10G NIC, 
> >> 	disabled packet split mode, with raw socket case compared to vhost.
> >> 
> >> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> >> 	CPU % from 120%-140% to 140%-160%

^ permalink raw reply

* [PATCH net-2.6] cleanup: remove two unnecessary exports in skbuff.c.
From: Rami Rosen @ 2010-04-19  9:50 UTC (permalink / raw)
  To: davem, netdev

[-- Attachment #1: Type: text/plain, Size: 240 bytes --]

Hi,
There is no need to export skb_under_panic() and skb_over_panic() in
skbuff.c, since these methods are used only in
skbuff.c ; this patch removes these two exports.

Regards,
Rami Rosen

Signed-off-by: Rami Rosen <ramirose@gmail.com>

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 666 bytes --]

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..799b89d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -126,7 +126,6 @@ void skb_over_panic(struct sk_buff *skb, int sz, void *here)
 	       skb->dev ? skb->dev->name : "<NULL>");
 	BUG();
 }
-EXPORT_SYMBOL(skb_over_panic);

 /**
  *	skb_under_panic	- 	private function
@@ -146,7 +145,6 @@ void skb_under_panic(struct sk_buff *skb, int sz, void *here)
 	       skb->dev ? skb->dev->name : "<NULL>");
 	BUG();
 }
-EXPORT_SYMBOL(skb_under_panic);

 /* 	Allocate a new skbuff. We do this ourselves so we can fill in a few
  *	'private' fields and also do memory statistics to find all the

^ permalink raw reply related

* Re: [RFC] rps: shortcut net_rps_action()
From: Changli Gao @ 2010-04-19  9:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, David Miller, netdev
In-Reply-To: <1271669822.16881.7520.camel@edumazet-laptop>

On Mon, Apr 19, 2010 at 5:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
> RPS is not active.
>
> I add a flag to scan cpumask only if at least one IPI was scheduled.
> Even cpumask_weight() might be expensive on some setups, where
> nr_cpumask_bits could be very big (4096 for example)

How about using a array to save the cpu IDs. The number of CPUs, to
which the IPI will be sent, should be small.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [RFC] rps: shortcut net_rps_action()
From: Eric Dumazet @ 2010-04-19  9:37 UTC (permalink / raw)
  To: Tom Herbert, David Miller; +Cc: netdev
In-Reply-To: <1271590476.16881.4925.camel@edumazet-laptop>

net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

I add a flag to scan cpumask only if at least one IPI was scheduled.
Even cpumask_weight() might be expensive on some setups, where
nr_cpumask_bits could be very big (4096 for example)

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    5 +-
 net/core/dev.c            |   73 ++++++++++++++++--------------------
 2 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 649a025..283d3ef 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1389,8 +1389,11 @@ struct softnet_data {
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
-	/* Elements below can be accessed between CPUs for RPS */
 #ifdef CONFIG_RPS
+	unsigned int		rps_ipis_scheduled;
+	unsigned int		rps_select;
+	cpumask_t		rps_mask[2];
+	/* Elements below can be accessed between CPUs for RPS */
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
 	unsigned int		input_queue_head;
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..3e6e420 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2347,19 +2347,14 @@ done:
 }
 
 /*
- * This structure holds the per-CPU mask of CPUs for which IPIs are scheduled
+ * sofnet_data holds the per-CPU mask of CPUs for which IPIs are scheduled
  * to be sent to kick remote softirq processing.  There are two masks since
- * the sending of IPIs must be done with interrupts enabled.  The select field
+ * the sending of IPIs must be done with interrupts enabled.  The rps_select field
  * indicates the current mask that enqueue_backlog uses to schedule IPIs.
  * select is flipped before net_rps_action is called while still under lock,
  * net_rps_action then uses the non-selected mask to send the IPIs and clears
  * it without conflicting with enqueue_backlog operation.
  */
-struct rps_remote_softirq_cpus {
-	cpumask_t mask[2];
-	int select;
-};
-static DEFINE_PER_CPU(struct rps_remote_softirq_cpus, rps_remote_softirq_cpus);
 
 /* Called from hardirq (IPI) context */
 static void trigger_softirq(void *data)
@@ -2403,10 +2398,10 @@ enqueue:
 		if (napi_schedule_prep(&queue->backlog)) {
 #ifdef CONFIG_RPS
 			if (cpu != smp_processor_id()) {
-				struct rps_remote_softirq_cpus *rcpus =
-				    &__get_cpu_var(rps_remote_softirq_cpus);
+				struct softnet_data *myqueue = &__get_cpu_var(softnet_data);
 
-				cpu_set(cpu, rcpus->mask[rcpus->select]);
+				cpu_set(cpu, myqueue->rps_mask[myqueue->rps_select]);
+				myqueue->rps_ipis_scheduled = 1;
 				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 				goto enqueue;
 			}
@@ -2911,7 +2906,9 @@ int netif_receive_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
-/* Network device is going away, flush any packets still pending  */
+/* Network device is going away, flush any packets still pending
+ * Called with irqs disabled.
+ */
 static void flush_backlog(void *arg)
 {
 	struct net_device *dev = arg;
@@ -3340,24 +3337,36 @@ void netif_napi_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(netif_napi_del);
 
-#ifdef CONFIG_RPS
 /*
- * net_rps_action sends any pending IPI's for rps.  This is only called from
- * softirq and interrupts must be enabled.
+ * net_rps_action sends any pending IPI's for rps.
+ * Note: called with local irq disabled, but exits with local irq enabled.
  */
-static void net_rps_action(cpumask_t *mask)
+static void net_rps_action(void)
 {
-	int cpu;
+#ifdef CONFIG_RPS
+	if (percpu_read(softnet_data.rps_ipis_scheduled)) {
+		struct softnet_data *queue = &__get_cpu_var(softnet_data);
+		int cpu, select = queue->rps_select;
+		cpumask_t *mask;
+		
+		queue->rps_ipis_scheduled = 0;
+		queue->rps_select ^= 1;
 
-	/* Send pending IPI's to kick RPS processing on remote cpus. */
-	for_each_cpu_mask_nr(cpu, *mask) {
-		struct softnet_data *queue = &per_cpu(softnet_data, cpu);
-		if (cpu_online(cpu))
-			__smp_call_function_single(cpu, &queue->csd, 0);
-	}
-	cpus_clear(*mask);
-}
+		local_irq_enable();
+
+		mask = &queue->rps_mask[select];
+
+		/* Send pending IPI's to kick RPS processing on remote cpus. */
+		for_each_cpu_mask_nr(cpu, *mask) {
+			struct softnet_data *remqueue = &per_cpu(softnet_data, cpu);
+			if (cpu_online(cpu))
+				__smp_call_function_single(cpu, &remqueue->csd, 0);
+		}
+		cpus_clear(*mask);
+	} else
 #endif
+		local_irq_enable();
+}
 
 static void net_rx_action(struct softirq_action *h)
 {
@@ -3365,10 +3374,6 @@ static void net_rx_action(struct softirq_action *h)
 	unsigned long time_limit = jiffies + 2;
 	int budget = netdev_budget;
 	void *have;
-#ifdef CONFIG_RPS
-	int select;
-	struct rps_remote_softirq_cpus *rcpus;
-#endif
 
 	local_irq_disable();
 
@@ -3431,17 +3436,7 @@ static void net_rx_action(struct softirq_action *h)
 		netpoll_poll_unlock(have);
 	}
 out:
-#ifdef CONFIG_RPS
-	rcpus = &__get_cpu_var(rps_remote_softirq_cpus);
-	select = rcpus->select;
-	rcpus->select ^= 1;
-
-	local_irq_enable();
-
-	net_rps_action(&rcpus->mask[select]);
-#else
-	local_irq_enable();
-#endif
+	net_rps_action();
 
 #ifdef CONFIG_NET_DMA
 	/*



^ permalink raw reply related

* [GIT]: Networking
From: David Miller @ 2010-04-19  7:38 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Fix TX lockups in forcedeth due to an incorrect chipset ID check
   wrt. whether to enable a TX hw bug workaround or not.  Fix from
   Ayaz Abdulla.

2) Fix some virtualization problems by orphan'ing the SKB on TX
   in tun driver.  From Michael S. Tsirkin.

3) Some minor fallout from the slab.h cleanups in gigaset driver,
   from Tilman Schmidt.

4) hdlc_ppp can crash on rmmod due to lack of PPP tx queue flush,
   fix from Krzysztof Halasa.

5) Three fixes from Eric Dumazet:
   a) ip_dev_loopback_xmit() needs to use netif_rx_ni() since it sometimes
      is invoked from user context and therefore an explicit check and
      run of softirqs is necessary.
   b) dev_pic_tx() should not cache a socket TX queue selection unless
      the socket cache'd dst matches the one currently hung off of the
      skb
   c) Fix lockdep false positives in fib_trie

6) AF_PACKET erroneously restricts to init_net in some ioctls, fix
   from Daniel Lezcano.

7) iwlwifi active chain detection fix from Johannes Berg.

Please pull, thanks a lot!

The following changes since commit 13bd8e4673d527a9e48f41956b11d391e7c2cfe0:
  Linus Torvalds (1):
        Merge branch 'for-linus' of git://git.kernel.org/.../anholt/drm-intel

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Ayaz Abdulla (1):
      forcedeth: fix tx limit2 flag check

Daniel Lezcano (1):
      packet : remove init_net restriction

David S. Miller (1):
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-2.6

Eric Dumazet (3):
      fib: suppress lockdep-RCU false positive in FIB trie.
      net: dev_pick_tx() fix
      ip: Fix ip_dev_loopback_xmit()

Johannes Berg (1):
      iwlwifi: work around bogus active chains detection

Krzysztof Halasa (1):
      WAN: flush tx_queue in hdlc_ppp to prevent panic on rmmod hw_driver.

Michael S. Tsirkin (1):
      tun: orphan an skb on tx

Tilman Schmidt (1):
      gigaset: include cleanup cleanup

 drivers/isdn/gigaset/bas-gigaset.c       |    5 -----
 drivers/isdn/gigaset/capi.c              |    2 --
 drivers/isdn/gigaset/common.c            |    2 --
 drivers/isdn/gigaset/gigaset.h           |    2 +-
 drivers/isdn/gigaset/i4l.c               |    1 -
 drivers/isdn/gigaset/interface.c         |    1 -
 drivers/isdn/gigaset/proc.c              |    1 -
 drivers/isdn/gigaset/ser-gigaset.c       |    3 ---
 drivers/isdn/gigaset/usb-gigaset.c       |    4 ----
 drivers/net/forcedeth.c                  |    2 +-
 drivers/net/tun.c                        |    4 ++++
 drivers/net/wan/hdlc_ppp.c               |    6 ++++++
 drivers/net/wireless/iwlwifi/iwl-calib.c |   12 ++++++++++++
 net/core/dev.c                           |    8 ++++++--
 net/ipv4/fib_trie.c                      |    4 +++-
 net/ipv4/ip_output.c                     |    2 +-
 net/ipv6/ip6_output.c                    |    2 +-
 net/packet/af_packet.c                   |    2 --
 18 files changed, 35 insertions(+), 28 deletions(-)

^ permalink raw reply

* Re: [PATCH RFC]: soreuseport: Bind multiple sockets to same port
From: Eric Dumazet @ 2010-04-19  7:28 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1004182321480.1822@pokey.mtv.corp.google.com>

Le dimanche 18 avril 2010 à 23:33 -0700, Tom Herbert a écrit :
> This is some work we've done to scale TCP listeners/UDP servers.  It
> might be apropos with some of the discussion on SO_REUSEADDR for UDP.
> ---
> This patch implements so_reuseport (SO_REUSEPORT socket option) for
> TCP and UDP.  For TCP, so_reuseport allows multiple listener sockets
> to be bound to the same port.  In the case of UDP, so_reuseport allows
> multiple sockets to bind to the same port.  To prevent port hijacking
> all sockets bound to the same port using so_reuseport must have the
> same uid.  Received packets are distributed to multiple sockets bound
> to the same port using a 4-tuple hash.
> 
> The motivating case for so_resuseport in TCP would be something like
> a web server binding to port 80 running with multiple threads, where
> each thread might have it's own listener socket.  This could be done
> as an alternative to other models: 1) have one listener thread which
> dispatches completed connections to workers. 2) accept on a single
> listener socket from multiple threads.  In case #1 the listener thread
> can easily become the bottleneck with high connection turn-over rate.
> In case #2, the proportion of connections accepted per thread tends
> to be uneven under high connection load (assuming simple event loop:
> while (1) { accept(); process() }, wakeup does not promote fairness
> among the sockets.  We have seen the  disproportion to be as high
> as 3:1 ratio between thread accepting most connections and the one
> accepting the fewest.  With so_reusport the distribution is
> uniform.
> 
> The TCP implementation has a problem in that the request sockets for a
> listener are attached to a listener socket.  If a SYN is received, a
> listener socket is chosen and request structure is created (SYN-RECV
> state).  If the subsequent ack in 3WHS does not match the same port
> by so_reusport, the connection state is not found (reset) and the
> request structure is orphaned.  This scenario would occur when the
> number of listener sockets bound to a port changes (new ones are
> added, or old ones closed).  We are looking for a solution to this,
> maybe allow multiple sockets to share the same request table...
> 
> The motivating case for so_reuseport in UDP would be something like a
> DNS server.  An alternative would be to recv on the same socket from
> multiple threads.  As in the case of TCP, the load across these threads
> tends to be disproportionate and we also see a lot of contection on
> the socket lock.  Note that SO_REUSEADDR already allows multiple UDP
> sockets to bind to the same port, however there is no provision to
> prevent hijacking and nothing to distribute packets across all the
> sockets sharing the same bound port.  This patch does not change the
> semantics of SO_REUSEADDR, but provides usable functionality of it
> for unicast.


Hmm...

I am wondering how this thing is scalable...

In fact it is not.

We live in a world with 16 cpus machines not uncommon right now.

High perf DNS server on such machine would have 16 threads, and probably
64 threads in two years.

I understand you want 16 UDP sockets to avoid lock contention, but
__udp4_lib_lookup() becomes a nightmare (It may already be ...)

My idea was to add a cpu lookup key.

thread0 would use a new setsockopt() option to bind a socket to a
virtual cpu0. Then do its normal bind( port=53)

...

threadN would use a new setsockopt() option to bind a socket to a
virtual cpuN. Then do its normal bind( port=53)

Each thread then do its normal worker loop.

Then, when receiving a frame on cpuN, we would automatically select the
right socket because its score is higher than others.


Another possibility would be to extend socket structure to be able to
have a dynamically sized queues/locks.




^ permalink raw reply

* Re: Phylib polling when doing mdio_read will cause system response and transfer speed drop
From: Wolfram Sang @ 2010-04-19  7:21 UTC (permalink / raw)
  To: Bryan Wu; +Cc: afleming, davem, netdev, LKML
In-Reply-To: <4BC50BF5.7080700@canonical.com>

[-- Attachment #1: Type: text/plain, Size: 1412 bytes --]

On Tue, Apr 13, 2010 at 05:27:33PM -0700, Bryan Wu wrote:

> I found the root cause is the polling operation in the mdio_read 
> function. When we transfer large files, we experienced many times of 
> timeout issue. So I got several question here:

Same here, I saw the 'MDIO Timeout' Message occasionally.

> 1. Need I return -ETIMEDOUT when polling timeout. If I don't return 
> -ETIMEOUT, the performance improved a lot. And after check other drivers, 
> some don't return anything, some return 0, some return negative value. 
> What's the rule for this mdio_read polling timeout case.
>
> 2. How to do polling busy waiting? Normally, we won't buys wait very long 
> in polling. But hardware is not perfect every time. Running cpu_relax() 
> 10000 times in polling will cause our system response very bad when 
> hardware don't set the flag as we expected. Maybe udelay(25) 10 times or 
> msleep(1) 10 times is better than that.
>
> I got a patch to recover this issue,  
> http://kernel.ubuntu.com/git?p=roc/ubuntu-lucid.git;a=commitdiff;h=5d77e3409b319ca84183bf1d2fd158a9c864e03f.

Can't help with the details, but the patch seems to help here, too
(and not only because of the removed printk ;)).

Regards,

   Wolfram

-- 
Pengutronix e.K.                           | Wolfram Sang                |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: [PATCH] ks8842: Add module param for setting mac address
From: David Miller @ 2010-04-19  7:13 UTC (permalink / raw)
  To: richard.rojfors; +Cc: netdev
In-Reply-To: <4BCBFC62.8010200@pelagicore.com>

From: Richard Röjfors <richard.rojfors@pelagicore.com>
Date: Mon, 19 Apr 2010 08:46:58 +0200

> On 04/19/2010 08:36 AM, David Miller wrote:
>> From: Richard Röjfors<richard.rojfors@pelagicore.com>
>> Date: Mon, 19 Apr 2010 08:16:29 +0200
>>
>>> I posted a new patch where the MAC address is passed via
>>> platform data, hope that's an accepted way.
>>
>> I saw also that you mentioned that there is no real
>> way to probe this information.
>>
>> Therefore I worry that your plan might be to simply
>> use some kernel command line or module option somewhere
>> else to fill in this platform_device value.
>>
>> Is that what you plan to do?
> 
> In the lab; yes.
> In the end, there will be a flash available to store
> system wide parameters.

Ok, your long term plan is fine I guess.

^ permalink raw reply

* ARP updates and GARP
From: Sasha Levin @ 2010-04-19  6:54 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,

We are currently testing IP fail-over on storage devices, and have observed an issue with the IP transfer from one device to another.

Assuming we have 2 storage devices A and B, and a server C which uses the storage, the scenario is:

1. Device A sends an ARP request which server C sees – server C updates it’s ARP table with the MAC of device A.
2. Device A fails, Device B takes over the IP and sends out a GARP.
3. Even though device C sees the GARP, it ignores it and keeps trying to communicate with device A until the entry is removed from its cache and a new ARP request is generated.

The code which causes this is located in arp_process@/net/ipv4/arp.c:

override = time_after(jiffies, n->updated + n->parms->locktime);

/* Broadcast replies and request packets
   do not assert neighbour reachability.
 */
if (arp->ar_op != htons(ARPOP_REPLY) ||
    skb->pkt_type != PACKET_HOST)
        state = NUD_STALE;
neigh_update(n, sha, state, override ? NEIGH_UPDATE_F_OVERRIDE : 0);
neigh_release(n);

According to the code, this scenario happens because the kernel ignores any ARP updates which happened in a short period after the previous ARP update. The reason which was stated in the comments is  “If several different ARP replies follows back-to-back, use the FIRST one. It is possible, if several proxy agents are active. Taking the first reply prevents arp trashing and chooses the fastest router.”.

This, however, doesn’t take into account GARPs which are not being sent by ARP proxies anyway and just ignores them too – causing a loss of communication for over a minute until the ARP cache refreshes.

Is there another reason for this rule? If not, Is possible to submit a patch which will take GARPs into account when ignoring ARP updates?

Thanks!

Sasha.

^ permalink raw reply

* Re: [PATCH BUG-FIX] ipv6: allow to send packet after receiving ICMPv6 Too Big message with MTU field less than  IPV6_MIN_MTU
From: Shan Wei @ 2010-04-19  6:49 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	魏勇军, vladislav.yasevich, kuznet, pekkas,
	jmorris, Patrick McHardy, eric.dumazet, sri,
	netdev@vger.kernel.org, linux-sctp
In-Reply-To: <20100419035535.GA7011@gondor.apana.org.au>

Herbert Xu wrote, at 04/19/2010 11:55 AM:
> 
> The patch looks good to me.

Thanks for reviewing this patch.

> If we wanted to optimise the allfrags case it may be better
> to reserve the space beforehand and generate the fragment header
> at the same time as we're doing the IPv6 header.
> 
> But it can't be all that important as it's been broken for so
> many years.

If somebody needs one patch to fix the broken,
I am pleased to do so. 

-- 
Best Regards
-----
Shan Wei


> 
> Thanks,



^ permalink raw reply

* Re: [PATCH] ks8842: Add module param for setting mac address
From: Richard Röjfors @ 2010-04-19  6:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100418.233631.115927208.davem@davemloft.net>

On 04/19/2010 08:36 AM, David Miller wrote:
> From: Richard Röjfors<richard.rojfors@pelagicore.com>
> Date: Mon, 19 Apr 2010 08:16:29 +0200
>
>> I posted a new patch where the MAC address is passed via
>> platform data, hope that's an accepted way.
>
> I saw also that you mentioned that there is no real
> way to probe this information.
>
> Therefore I worry that your plan might be to simply
> use some kernel command line or module option somewhere
> else to fill in this platform_device value.
>
> Is that what you plan to do?

In the lab; yes.
In the end, there will be a flash available to store
system wide parameters.

--Richard

^ permalink raw reply

* Re: [PATCH] ks8842: Add module param for setting mac address
From: Richard Röjfors @ 2010-04-19  6:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100418.211800.112605815.davem@davemloft.net>

On 04/19/2010 06:18 AM, David Miller wrote:
> From: Richard Röjfors<richard.rojfors@pelagicore.com>
> Date: Sun, 18 Apr 2010 19:25:57 +0200
>
>> This patch adds a module parameter for setting the MAC address.
>>
>> To ensure this MAC address is used, the MAC address is written
>> after each hardware reset.
>>
>> Signed-off-by: Richard Röjfors<richard.rojfors@pelagicore.com>
>
> Your driver isn't unique, or special, and it doesn't need
> to do special things to handle this situation.

I posted a new patch where the MAC address is passed via
platform data, hope that's an accepted way.

Thanks,
--Richard

^ permalink raw reply

* Re: [PATCH] ks8842: Add module param for setting mac address
From: David Miller @ 2010-04-19  6:36 UTC (permalink / raw)
  To: richard.rojfors; +Cc: netdev
In-Reply-To: <4BCBF53D.1040800@pelagicore.com>

From: Richard Röjfors <richard.rojfors@pelagicore.com>
Date: Mon, 19 Apr 2010 08:16:29 +0200

> I posted a new patch where the MAC address is passed via
> platform data, hope that's an accepted way.

I saw also that you mentioned that there is no real
way to probe this information.

Therefore I worry that your plan might be to simply
use some kernel command line or module option somewhere
else to fill in this platform_device value.

Is that what you plan to do?

^ permalink raw reply

* [PATCH RFC]: soreuseport: Bind multiple sockets to same port
From: Tom Herbert @ 2010-04-19  6:33 UTC (permalink / raw)
  To: davem, netdev

This is some work we've done to scale TCP listeners/UDP servers.  It
might be apropos with some of the discussion on SO_REUSEADDR for UDP.
---
This patch implements so_reuseport (SO_REUSEPORT socket option) for
TCP and UDP.  For TCP, so_reuseport allows multiple listener sockets
to be bound to the same port.  In the case of UDP, so_reuseport allows
multiple sockets to bind to the same port.  To prevent port hijacking
all sockets bound to the same port using so_reuseport must have the
same uid.  Received packets are distributed to multiple sockets bound
to the same port using a 4-tuple hash.

The motivating case for so_resuseport in TCP would be something like
a web server binding to port 80 running with multiple threads, where
each thread might have it's own listener socket.  This could be done
as an alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads.  In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets.  We have seen the  disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest.  With so_reusport the distribution is
uniform.

The TCP implementation has a problem in that the request sockets for a
listener are attached to a listener socket.  If a SYN is received, a
listener socket is chosen and request structure is created (SYN-RECV
state).  If the subsequent ack in 3WHS does not match the same port
by so_reusport, the connection state is not found (reset) and the
request structure is orphaned.  This scenario would occur when the
number of listener sockets bound to a port changes (new ones are
added, or old ones closed).  We are looking for a solution to this,
maybe allow multiple sockets to share the same request table...

The motivating case for so_reuseport in UDP would be something like a
DNS server.  An alternative would be to recv on the same socket from
multiple threads.  As in the case of TCP, the load across these threads
tends to be disproportionate and we also see a lot of contection on
the socket lock.  Note that SO_REUSEADDR already allows multiple UDP
sockets to bind to the same port, however there is no provision to
prevent hijacking and nothing to distribute packets across all the
sockets sharing the same bound port.  This patch does not change the
semantics of SO_REUSEADDR, but provides usable functionality of it
for unicast.
---
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..37b699f 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -22,7 +22,7 @@
 #define SO_PRIORITY	12
 #define SO_LINGER	13
 #define SO_BSDCOMPAT	14
-/* To add :#define SO_REUSEPORT 15 */
+#define SO_REUSEPORT	15
 
 #ifndef SO_PASSCRED /* powerpc only differs in these */
 #define SO_PASSCRED	16
diff --git a/include/linux/inet.h b/include/linux/inet.h
index 4cca05c..bd8f0b6 100644
--- a/include/linux/inet.h
+++ b/include/linux/inet.h
@@ -51,6 +51,12 @@
 #define INET_ADDRSTRLEN		(16)
 #define INET6_ADDRSTRLEN	(48)
 
+static inline u32 inet_next_pseudo_random32(u32 seed)
+{
+	/* Pseudo random number generator from numerical recipes */
+	return seed * 1664525 + 1013904223;
+}
+
 extern __be32 in_aton(const char *str);
 extern int in4_pton(const char *src, int srclen, u8 *dst, int delim, const char **end);
 extern int in6_pton(const char *src, int srclen, u8 *dst, int delim, const char **end);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 74358d1..0887675 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -81,7 +81,9 @@ struct inet_bind_bucket {
 	struct net		*ib_net;
 #endif
 	unsigned short		port;
-	signed short		fastreuse;
+	signed char		fastreuse;
+	signed char		fastreuseport;
+	int			fastuid;
 	int			num_owners;
 	struct hlist_node	node;
 	struct hlist_head	owners;
@@ -257,15 +259,19 @@ extern void inet_unhash(struct sock *sk);
 
 extern struct sock *__inet_lookup_listener(struct net *net,
 					   struct inet_hashinfo *hashinfo,
+					   const __be32 saddr,
+					   const __be16 sport,
 					   const __be32 daddr,
 					   const unsigned short hnum,
 					   const int dif);
 
 static inline struct sock *inet_lookup_listener(struct net *net,
 		struct inet_hashinfo *hashinfo,
+		__be32 saddr, __be16 sport,
 		__be32 daddr, __be16 dport, int dif)
 {
-	return __inet_lookup_listener(net, hashinfo, daddr, ntohs(dport), dif);
+	return __inet_lookup_listener(net, hashinfo, saddr, sport,
+	    daddr, ntohs(dport), dif);
 }
 
 /* Socket demux engine toys. */
@@ -356,7 +362,8 @@ static inline struct sock *__inet_lookup(struct net *net,
 	struct sock *sk = __inet_lookup_established(net, hashinfo,
 				saddr, sport, daddr, hnum, dif);
 
-	return sk ? : __inet_lookup_listener(net, hashinfo, daddr, hnum, dif);
+	return sk ? : __inet_lookup_listener(net, hashinfo, saddr, sport,
+	    daddr, hnum, dif);
 }
 
 static inline struct sock *inet_lookup(struct net *net,
diff --git a/include/net/sock.h b/include/net/sock.h
index 56df440..58c6fa9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -114,6 +114,7 @@ struct net;
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
+ *	@skc_reuseport: %SO_REUSEPORT setting
  *	@skc_bound_dev_if: bound device index if != 0
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
  *	@skc_portaddr_node: second hash linkage for UDP/UDP-Lite protocol
@@ -140,7 +141,8 @@ struct sock_common {
 	};
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse;
+	unsigned char		skc_reuse:1;
+	unsigned char		skc_reuseport:1;
 	int			skc_bound_dev_if;
 	union {
 		struct hlist_node	skc_bind_node;
@@ -233,6 +235,7 @@ struct sock {
 #define sk_family		__sk_common.skc_family
 #define sk_state		__sk_common.skc_state
 #define sk_reuse		__sk_common.skc_reuse
+#define sk_reuseport		__sk_common.skc_reuseport
 #define sk_bound_dev_if		__sk_common.skc_bound_dev_if
 #define sk_bind_node		__sk_common.skc_bind_node
 #define sk_prot			__sk_common.skc_prot
diff --git a/net/core/sock.c b/net/core/sock.c
index 7effa1e..801efc6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -497,6 +497,9 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	case SO_REUSEADDR:
 		sk->sk_reuse = valbool;
 		break;
+	case SO_REUSEPORT:
+		sk->sk_reuseport = valbool;
+		break;
 	case SO_TYPE:
 	case SO_PROTOCOL:
 	case SO_DOMAIN:
@@ -780,6 +783,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = sk->sk_reuse;
 		break;
 
+	case SO_REUSEPORT:
+		v.val = sk->sk_reuseport;
+		break;
+
 	case SO_KEEPALIVE:
 		v.val = !!sock_flag(sk, SOCK_KEEPOPEN);
 		break;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8da6429..02961c8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -56,6 +56,8 @@ int inet_csk_bind_conflict(const struct sock *sk,
 	struct sock *sk2;
 	struct hlist_node *node;
 	int reuse = sk->sk_reuse;
+	int reuseport = sk->sk_reuseport;
+	int uid = sock_i_uid((struct sock *)sk);
 
 	/*
 	 * Unlike other sk lookup places we do not check
@@ -70,8 +72,11 @@ int inet_csk_bind_conflict(const struct sock *sk,
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
-			if (!reuse || !sk2->sk_reuse ||
-			    sk2->sk_state == TCP_LISTEN) {
+			if ((!reuse || !sk2->sk_reuse ||
+			    sk2->sk_state == TCP_LISTEN) &&
+			    (!reuseport || !sk2->sk_reuseport ||
+			    (sk2->sk_state != TCP_TIME_WAIT &&
+			     uid != sock_i_uid(sk2)))) {
 				const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
 				if (!sk2_rcv_saddr || !sk_rcv_saddr ||
 				    sk2_rcv_saddr == sk_rcv_saddr)
@@ -96,6 +101,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	int ret, attempts = 5;
 	struct net *net = sock_net(sk);
 	int smallest_size = -1, smallest_rover;
+	int uid = sock_i_uid(sk);
 
 	local_bh_disable();
 	if (!snum) {
@@ -113,9 +119,12 @@ again:
 			spin_lock(&head->lock);
 			inet_bind_bucket_for_each(tb, node, &head->chain)
 				if (net_eq(ib_net(tb), net) && tb->port == rover) {
-					if (tb->fastreuse > 0 &&
-					    sk->sk_reuse &&
-					    sk->sk_state != TCP_LISTEN &&
+					if (((tb->fastreuse > 0 &&
+					      sk->sk_reuse &&
+					      sk->sk_state != TCP_LISTEN) ||
+					     (tb->fastreuseport > 0 &&
+					      sk->sk_reuseport &&
+					      tb->fastuid == uid)) &&
 					    (tb->num_owners < smallest_size || smallest_size == -1)) {
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
@@ -165,14 +174,18 @@ have_snum:
 	goto tb_not_found;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
-		if (tb->fastreuse > 0 &&
-		    sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+		if (((tb->fastreuse > 0 &&
+		      sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
+		     (tb->fastreuseport > 0 &&
+		      sk->sk_reuseport && tb->fastuid == uid)) &&
 		    smallest_size == -1) {
 			goto success;
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+				if (((sk->sk_reuse &&
+				      sk->sk_state != TCP_LISTEN) ||
+				     sk->sk_reuseport) &&
 				    smallest_size != -1 && --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
@@ -191,9 +204,23 @@ tb_not_found:
 			tb->fastreuse = 1;
 		else
 			tb->fastreuse = 0;
-	} else if (tb->fastreuse &&
-		   (!sk->sk_reuse || sk->sk_state == TCP_LISTEN))
-		tb->fastreuse = 0;
+		if (sk->sk_reuseport) {
+			tb->fastreuseport = 1;
+			tb->fastuid = uid;
+		} else {
+			tb->fastreuseport = 0;
+			tb->fastuid = 0;
+		}
+	} else {
+		if (tb->fastreuse &&
+		    (!sk->sk_reuse || sk->sk_state == TCP_LISTEN))
+			tb->fastreuse = 0;
+		if (tb->fastreuseport &&
+		    (!sk->sk_reuseport || tb->fastuid != uid)) {
+			tb->fastreuseport = 0;
+			tb->fastuid = 0;
+		}
+	}
 success:
 	if (!inet_csk(sk)->icsk_bind_hash)
 		inet_bind_hash(sk, tb, snum);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2b79377..f7c23e1 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -38,6 +38,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 		write_pnet(&tb->ib_net, hold_net(net));
 		tb->port      = snum;
 		tb->fastreuse = 0;
+		tb->fastreuseport = 0;
 		tb->num_owners = 0;
 		INIT_HLIST_HEAD(&tb->owners);
 		hlist_add_head(&tb->node, &head->chain);
@@ -129,16 +130,16 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
 			!ipv6_only_sock(sk)) {
 		__be32 rcv_saddr = inet->inet_rcv_saddr;
-		score = sk->sk_family == PF_INET ? 1 : 0;
+		score = sk->sk_family == PF_INET ? 2 : 1;
 		if (rcv_saddr) {
 			if (rcv_saddr != daddr)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (sk->sk_bound_dev_if) {
 			if (sk->sk_bound_dev_if != dif)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 	}
 	return score;
@@ -154,6 +155,7 @@ static inline int compute_score(struct sock *sk, struct net *net,
 
 struct sock *__inet_lookup_listener(struct net *net,
 				    struct inet_hashinfo *hashinfo,
+				    const __be32 saddr, __be16 sport,
 				    const __be32 daddr, const unsigned short hnum,
 				    const int dif)
 {
@@ -161,26 +163,39 @@ struct sock *__inet_lookup_listener(struct net *net,
 	struct hlist_nulls_node *node;
 	unsigned int hash = inet_lhashfn(net, hnum);
 	struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
-	int score, hiscore;
+	int score, hiscore, matches = 0, reuseport = 0;
+	u32 phash = 0;
 
 	rcu_read_lock();
 begin:
 	result = NULL;
-	hiscore = -1;
+	hiscore = 0;
 	sk_nulls_for_each_rcu(sk, node, &ilb->head) {
 		score = compute_score(sk, net, hnum, daddr, dif);
 		if (score > hiscore) {
 			result = sk;
 			hiscore = score;
+			reuseport = sk->sk_reuseport;
+			if (reuseport) {
+				phash = inet_ehashfn(net, daddr, hnum,
+				    saddr, htons(sport));
+				matches = 1;
+			}
+		} else if (score == hiscore && reuseport) {
+			matches++;
+			if (((u64)phash * matches) >> 32 == 0)
+				result = sk;
+			phash = inet_next_pseudo_random32(phash);
 		}
 	}
-	/*
-	 * if the nulls value we got at the end of this lookup is
-	 * not the expected one, we must restart lookup.
+/*
+ * if the nulls value we got at the end of this lookup is
+ * not the expected one, we must restart lookup.
 	 * We probably met an item that was moved to another chain.
 	 */
 	if (get_nulls_value(node) != hash + LISTENING_NULLS_BASE)
 		goto begin;
+
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
@@ -467,7 +482,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			inet_bind_bucket_for_each(tb, node, &head->chain) {
 				if (net_eq(ib_net(tb), net) &&
 				    tb->port == port) {
-					if (tb->fastreuse >= 0)
+					if (tb->fastreuse >= 0 ||
+					    tb->fastreuseport >= 0)
 						goto next_port;
 					WARN_ON(hlist_empty(&tb->owners));
 					if (!check_established(death_row, sk,
@@ -484,6 +500,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 				break;
 			}
 			tb->fastreuse = -1;
+			tb->fastreuseport = -1;
 			goto ok;
 
 		next_port:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad08392..71cf97d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1735,6 +1735,7 @@ do_time_wait:
 	case TCP_TW_SYN: {
 		struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
 							&tcp_hashinfo,
+							iph->saddr, th->source,
 							iph->daddr, th->dest,
 							inet_iif(skb));
 		if (sk2) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 666b963..59404fa 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -135,6 +135,7 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
 {
 	struct sock *sk2;
 	struct hlist_nulls_node *node;
+	int uid = sock_i_uid(sk);
 
 	sk_nulls_for_each(sk2, node, &hslot->head)
 		if (net_eq(sock_net(sk2), net) &&
@@ -143,6 +144,8 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
 		    (!sk2->sk_reuse || !sk->sk_reuse) &&
 		    (!sk2->sk_bound_dev_if || !sk->sk_bound_dev_if ||
 		     sk2->sk_bound_dev_if == sk->sk_bound_dev_if) &&
+		    (!sk2->sk_reuseport || !sk->sk_reuseport ||
+		      uid != sock_i_uid(sk2)) &&
 		    (*saddr_comp)(sk, sk2)) {
 			if (bitmap)
 				__set_bit(udp_sk(sk2)->udp_port_hash >> log,
@@ -332,26 +335,26 @@ static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr,
 			!ipv6_only_sock(sk)) {
 		struct inet_sock *inet = inet_sk(sk);
 
-		score = (sk->sk_family == PF_INET ? 1 : 0);
+		score = (sk->sk_family == PF_INET ? 2 : 1);
 		if (inet->inet_rcv_saddr) {
 			if (inet->inet_rcv_saddr != daddr)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (inet->inet_daddr) {
 			if (inet->inet_daddr != saddr)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (inet->inet_dport) {
 			if (inet->inet_dport != sport)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (sk->sk_bound_dev_if) {
 			if (sk->sk_bound_dev_if != dif)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 	}
 	return score;
@@ -360,7 +363,6 @@ static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr,
 /*
  * In this second variant, we check (daddr, dport) matches (inet_rcv_sadd, inet_num)
  */
-#define SCORE2_MAX (1 + 2 + 2 + 2)
 static inline int compute_score2(struct sock *sk, struct net *net,
 				 __be32 saddr, __be16 sport,
 				 __be32 daddr, unsigned int hnum, int dif)
@@ -375,21 +377,21 @@ static inline int compute_score2(struct sock *sk, struct net *net,
 		if (inet->inet_num != hnum)
 			return -1;
 
-		score = (sk->sk_family == PF_INET ? 1 : 0);
+		score = (sk->sk_family == PF_INET ? 2 : 1);
 		if (inet->inet_daddr) {
 			if (inet->inet_daddr != saddr)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (inet->inet_dport) {
 			if (inet->inet_dport != sport)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 		if (sk->sk_bound_dev_if) {
 			if (sk->sk_bound_dev_if != dif)
 				return -1;
-			score += 2;
+			score += 4;
 		}
 	}
 	return score;
@@ -404,19 +406,29 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	struct hlist_nulls_node *node;
-	int score, badness;
+	int score, badness, matches = 0, reuseport = 0;
+	u32 hash = 0;
 
 begin:
 	result = NULL;
-	badness = -1;
+	badness = 0;
 	udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) {
 		score = compute_score2(sk, net, saddr, sport,
 				      daddr, hnum, dif);
 		if (score > badness) {
 			result = sk;
 			badness = score;
-			if (score == SCORE2_MAX)
-				goto exact_match;
+			reuseport = sk->sk_reuseport;
+			if (reuseport) {
+				hash = inet_ehashfn(net, daddr, hnum,
+				    saddr, htons(sport));
+				matches = 1;
+			}
+		} else if (score == badness && reuseport) {
+			matches++;
+			if (((u64)hash * matches) >> 32 == 0)
+				result = sk;
+			hash = inet_next_pseudo_random32(hash);
 		}
 	}
 	/*
@@ -428,7 +440,6 @@ begin:
 		goto begin;
 
 	if (result) {
-exact_match:
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
 		else if (unlikely(compute_score2(result, net, saddr, sport,
@@ -452,7 +463,8 @@ static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask);
 	struct udp_hslot *hslot2, *hslot = &udptable->hash[slot];
-	int score, badness;
+	int score, badness, matches = 0, reuseport = 0;
+	u32 hash;
 
 	rcu_read_lock();
 	if (hslot->count > 10) {
@@ -481,13 +493,24 @@ static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	}
 begin:
 	result = NULL;
-	badness = -1;
+	badness = 0;
 	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
 		score = compute_score(sk, net, saddr, hnum, sport,
 				      daddr, dport, dif);
 		if (score > badness) {
 			result = sk;
 			badness = score;
+			reuseport = sk->sk_reuseport;
+			if (reuseport) {
+				hash = inet_ehashfn(net, daddr, hnum,
+				    saddr, htons(sport));
+				matches = 1;
+			}
+		} else if (score == badness && reuseport) {
+			matches++;
+			if (((u64)hash * matches) >> 32 == 0)
+				result = sk;
+			hash = inet_next_pseudo_random32(hash);
 		}
 	}
 	/*

^ permalink raw reply related

* Re: [PATCH 10/13] bnx2x: Set default value of num_queues to 1 on 32-bit platforms
From: David Miller @ 2010-04-19  6:29 UTC (permalink / raw)
  To: eilong; +Cc: vladz, netdev, dmitry
In-Reply-To: <1271657181.27125.20.camel@lb-tlvb-eilong.il.broadcom.com>

From: "Eilon Greenstein" <eilong@broadcom.com>
Date: Mon, 19 Apr 2010 09:06:21 +0300

> Yes - running low on vmalloc is the reason. This patch does not restrict
> the usage of multi queue on 32bits - it is just changing the default
> queues values from the number of available CPUs to 1. If the user will
> use another value, it will work as before.

Changing the default to 1 is equivalent to disabling multi-queue.

> We encounter few issues were a 32bit kernel was installed on a
> multi-core platform and the driver allocated "too many" queues. One way
> to go is to limit the queue size and the other is the number of queues -
> leaving it as is caused issues due to running low on kernel virtual
> memory so a change was needed.

Look into using an allocation strategy other than vmalloc(), in fact
I can't see why you need to much memory that vmalloc() must be used
in the first place.

^ permalink raw reply

* Re: [PATCH 10/13] bnx2x: Set default value of num_queues to 1 on 32-bit platforms
From: Eilon Greenstein @ 2010-04-19  6:06 UTC (permalink / raw)
  To: David Miller; +Cc: Vladislav Zolotarov, netdev@vger.kernel.org, Dmitry Kravkov
In-Reply-To: <20100418.211023.104052199.davem@davemloft.net>

On Sun, 2010-04-18 at 21:10 -0700, David Miller wrote:
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Sun, 18 Apr 2010 17:50:24 +0300
> 
> > The default was changed to save memory on 32bits systems.
> > 
> > Author: Dmitry Kravkov <dmitry@broadcom.com>
> > Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> > Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> > Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
> 
> This is absolutely rediculious.
> 
> There is no reason in the world to restrict multiqueue to only 64-bit
> systems.
> 
> If there is a memory allocation issue, fix that!  Is it vmalloc() space
> usage?

Yes - running low on vmalloc is the reason. This patch does not restrict
the usage of multi queue on 32bits - it is just changing the default
queues values from the number of available CPUs to 1. If the user will
use another value, it will work as before.

We encounter few issues were a 32bit kernel was installed on a
multi-core platform and the driver allocated "too many" queues. One way
to go is to limit the queue size and the other is the number of queues -
leaving it as is caused issues due to running low on kernel virtual
memory so a change was needed.

Dave, what is your preference:
- re-submitting this patch with a proper description?
- limit the size of each queue?
- other?

Thanks,
Eilon

^ permalink raw reply

* Re: [PATCH 0/13] bnx2x: Buf fixes and enhancements
From: Eilon Greenstein @ 2010-04-19  6:06 UTC (permalink / raw)
  To: David Miller
  Cc: Vladislav Zolotarov, netdev@vger.kernel.org, Dmitry Kravkov,
	Yaniv Rosner
In-Reply-To: <20100418.210723.71102685.davem@davemloft.net>

On Sun, 2010-04-18 at 21:07 -0700, David Miller wrote:
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Sun, 18 Apr 2010 17:47:26 +0300
> 
> > Dave, hi.
> > Pls., apply the following series of patches.
> > It includes a few bug fixes and enhancements.
> 
> Several of these patches are inappropriate or need corrections, and
> you also have not specified where you want these patches applied
> (net-2.6 or net-next-2.6)
> 

We will re-submit the path series with the following fixes:
- Omitting patch 11 (disabling LRO and leaving only GRO on XEN kernel on
compile time)
- restructuring patch 2 (using generic infrastructures for VPD-R-V0
access)
- Better description - especially on email 0...
- I'm sending another email regrading the number of queues on 32 bits
kernel.


Thanks for all the inputs,
Eilon



^ permalink raw reply

* Re: [net-2.6 PATCH] ixgbe: add IntMode module parameter
From: David Miller @ 2010-04-19  5:09 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, nicholasx.d.nunley, john.ronciak
In-Reply-To: <20100419050315.24758.62479.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Sun, 18 Apr 2010 22:03:59 -0700

> This patch adds the IntMode module parameter to ixgbe to allow
> user selection of interrupt mode (MSI-X, MSI, legacy) on driver
> load.  To work around the errata described above, ixgbe needs to
> be able to disable MSI-X for affected configurations.

This is not how we handle situations like this.

Here is one acceptable way to handle this:

Turn MSI/MSI-X off for every system that uses this PCI-E
complex chipset.  Add a whitelist that allows enabling.

Here is another:

Turn MSI/MSI-X on by default and have blacklists.

Anything that requires unusual user interactions, such as specifying
special module parameters, for correct operations is absolutely
unacceptable.

I don't want to hear how difficult it is to determine whether a
system will hit this bug or not.  If it's hard, just turn MSI
off unconditionally with this chipset until you can detect things
properly.

^ permalink raw reply

* [net-2.6 PATCH] ixgbe: add IntMode module parameter
From: Jeff Kirsher @ 2010-04-19  5:03 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Nicholas Nunley, John Ronciak, Jeff Kirsher

From: Nick Nunley <nicholasx.d.nunley@intel.com>

The 82598 has an erratum where some system designs can increase the
likelihood of a system hang due to a PCIe resource contention between the
82598 and another device in legacy INTx mode.

This issue can happen in any number of configurations with the 82598
and another device in legacy interrupt mode.  While we know of a
specific case where this is happening, this can happen with other
devices as well if configured in a certain way depending on the
system's slot configuration among other things.  This is alos why we
need to have the flexibility to disable MSI-X interrupts for the 82598
when these configurations are found having this deadlock situation.

The errata being worked - around with this patch - has to do with
any type of device that has a pending legacy INTx to the IOH's
internal IOAPIC which has priority from the shared resources
(round-robin to ensure forward progress / fairness).  The internal
IOAPIC is a PCI-E endpoint hence transactions to it are viewed as
peer-to-peer from a shared resource perspective.  To make forward
progress on the legacy INTx, the outbound completion FIFOs for the
IOU (resource pool1) are checked for available space due to shared
resources.  The IOH is allowed to have in/out dependency as a root
is full and not making forward progress because the 82598 has entered
its "erratum window"  where the 82598 is throttling the release of posted
credits back to the IOH as it has pending MSI-X interrupts to the IOH.
The 82598 will only release the posted credits back to IOH after the
inbound posted writes are serviced and the IOH releases the posted
credits back to the 82598 (i.e., the 82598 in/out dependency
erratum).  Since the priority is on the other device's legacy INTx,
the 82598 inbound posted transaction cannot make forward progress
that results in a link/forward progress deadlock condition.

The 82598 Specification Update has the errata list and can be found
here, it's erratum 38:
http://www.intel.com/design/network/specupdt/321040.pdf

This patch adds the IntMode module parameter to ixgbe to allow
user selection of interrupt mode (MSI-X, MSI, legacy) on driver
load.  To work around the errata described above, ixgbe needs to
be able to disable MSI-X for affected configurations.

Signed-off-by: Nicholas Nunley <nicholasx.d.nunley@intel.com>
Signed-off-by: John Ronciak <john.ronciak@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/Makefile      |    2 -
 drivers/net/ixgbe/ixgbe.h       |    1 
 drivers/net/ixgbe/ixgbe_main.c  |   26 ++++---
 drivers/net/ixgbe/ixgbe_param.c |  153 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 172 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ixgbe/ixgbe_param.c

diff --git a/drivers/net/ixgbe/Makefile b/drivers/net/ixgbe/Makefile
index 8f81efb..820491f 100644
--- a/drivers/net/ixgbe/Makefile
+++ b/drivers/net/ixgbe/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_IXGBE) += ixgbe.o
 
 ixgbe-objs := ixgbe_main.o ixgbe_common.o ixgbe_ethtool.o \
               ixgbe_82599.o ixgbe_82598.o ixgbe_phy.o ixgbe_sriov.o \
-              ixgbe_mbx.o
+              ixgbe_mbx.o ixgbe_param.o
 
 ixgbe-$(CONFIG_IXGBE_DCB) +=  ixgbe_dcb.o ixgbe_dcb_82598.o \
                               ixgbe_dcb_82599.o ixgbe_dcb_nl.o
diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..106f197 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -436,6 +436,7 @@ extern int ixgbe_copy_dcb_cfg(struct ixgbe_dcb_config *src_dcb_cfg,
 extern char ixgbe_driver_name[];
 extern const char ixgbe_driver_version[];
 
+extern void ixgbe_check_options(struct ixgbe_adapter *adapter);
 extern int ixgbe_up(struct ixgbe_adapter *adapter);
 extern void ixgbe_down(struct ixgbe_adapter *adapter);
 extern void ixgbe_reinit_locked(struct ixgbe_adapter *adapter);
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index a98ff0e..d04e095 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -3949,6 +3949,9 @@ static int ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 	int err = 0;
 	int vector, v_budget;
 
+	if (!(adapter->flags & IXGBE_FLAG_MSIX_CAPABLE))
+		goto try_msi;
+
 	/*
 	 * It's easy to be greedy for MSI-X vectors, but it really
 	 * doesn't do us much good if we have a lot more vectors
@@ -3981,16 +3984,10 @@ static int ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 			goto out;
 	}
 
-	adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
-	adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
-	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
-	adapter->flags &= ~IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
-	adapter->atr_sample_rate = 0;
-	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
-		ixgbe_disable_sriov(adapter);
-
-	ixgbe_set_num_queues(adapter);
 
+try_msi:
+	if (!(adapter->flags & IXGBE_FLAG_MSI_CAPABLE))
+		goto legacy;
 	err = pci_enable_msi(adapter->pdev);
 	if (!err) {
 		adapter->flags |= IXGBE_FLAG_MSI_ENABLED;
@@ -4000,7 +3997,16 @@ static int ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 		/* reset err */
 		err = 0;
 	}
+legacy:
+	adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
+	adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
+	adapter->flags &= ~IXGBE_FLAG_FDIR_HASH_CAPABLE;
+	adapter->flags &= ~IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
+	adapter->atr_sample_rate = 0;
+	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+		ixgbe_disable_sriov(adapter);
 
+	ixgbe_set_num_queues(adapter);
 out:
 	return err;
 }
@@ -6199,6 +6205,8 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 		goto err_sw_init;
 	}
 
+	ixgbe_check_options(adapter);
+
 	ixgbe_probe_vf(adapter, ii);
 
 	netdev->features = NETIF_F_SG |
diff --git a/drivers/net/ixgbe/ixgbe_param.c b/drivers/net/ixgbe/ixgbe_param.c
new file mode 100644
index 0000000..98c5626
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_param.c
@@ -0,0 +1,153 @@
+/*******************************************************************************
+
+  Intel 10 Gigabit PCI Express Linux driver
+  Copyright(c) 1999 - 2010 Intel Corporation.
+
+  This program is free software; you can redistribute it and/or modify it
+  under the terms and conditions of the GNU General Public License,
+  version 2, as published by the Free Software Foundation.
+
+  This program is distributed in the hope it will be useful, but WITHOUT
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+  more details.
+
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc.,
+  51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "COPYING".
+
+  Contact Information:
+  e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
+  Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
+*******************************************************************************/
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+#include "ixgbe.h"
+
+/* This is the only thing that needs to be changed to adjust the
+ * maximum number of ports that the driver can manage.
+ */
+
+#define IXGBE_MAX_NIC 32
+
+#define OPTION_UNSET    -1
+
+/* All parameters are treated the same, as an integer array of values.
+ * This macro just reduces the need to repeat the same declaration code
+ * over and over (plus this helps to avoid typo bugs).
+ */
+
+#define IXGBE_PARAM_INIT { [0 ... IXGBE_MAX_NIC] = OPTION_UNSET }
+#define IXGBE_PARAM(X, desc) \
+	static int __devinitdata X[IXGBE_MAX_NIC+1] = IXGBE_PARAM_INIT; \
+	static unsigned int num_##X; \
+	module_param_array_named(X, X, int, &num_##X, 0); \
+	MODULE_PARM_DESC(X, desc);
+
+/* IntMode (Interrupt Mode)
+ *
+ * Valid Range: 0-2
+ *  - 0 - Legacy Interrupt
+ *  - 1 - MSI Interrupt
+ *  - 2 - MSI-X Interrupt(s)
+ *
+ * Default Value: 2
+ */
+IXGBE_PARAM(IntMode, "Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2");
+#define IXGBE_INT_LEGACY		      0
+#define IXGBE_INT_MSI			      1
+#define IXGBE_INT_MSIX			      2
+#define IXGBE_DEFAULT_INT	 IXGBE_INT_MSIX
+
+struct ixgbe_option {
+	const char *name;
+	const char *err;
+	int def;
+	int min;
+	int max;
+};
+
+static int __devinit ixgbe_validate_option(unsigned int *value,
+					   struct ixgbe_option *opt,
+					   struct ixgbe_adapter *adapter)
+{
+	if (*value == OPTION_UNSET) {
+		*value = opt->def;
+		return 0;
+	}
+
+	if (*value >= opt->min && *value <= opt->max) {
+		dev_info(&adapter->pdev->dev,
+			 "%s set to %d\n", opt->name, *value);
+		return 0;
+	} else {
+		dev_info(&adapter->pdev->dev,
+			 "Invalid %s specified (%d),  %s\n",
+			 opt->name, *value, opt->err);
+		*value = opt->def;
+		return -1;
+	}
+}
+
+/**
+ * ixgbe_check_options - Range Checking for Command Line Parameters
+ * @adapter: board private structure
+ *
+ * This routine checks all command line parameters for valid user
+ * input.  If an invalid value is given, or if no user specified
+ * value exists, a default value is used.  The final value is stored
+ * in a variable in the adapter structure.
+ **/
+void __devinit ixgbe_check_options(struct ixgbe_adapter *adapter)
+{
+	int bd = adapter->bd_number;
+	u32 *aflags = &adapter->flags;
+
+	if (bd >= IXGBE_MAX_NIC) {
+		dev_notice(&adapter->pdev->dev,
+			   "Warning: no configuration for board #%d\n", bd);
+		dev_notice(&adapter->pdev->dev,
+			   "Using defaults for all values\n");
+	}
+
+	{ /* Interrupt Mode */
+		unsigned int int_mode;
+		static struct ixgbe_option opt = {
+			.name = "Interrupt Mode",
+			.err =
+			  "using default of "__MODULE_STRING(IXGBE_DEFAULT_INT),
+			.def = IXGBE_DEFAULT_INT,
+			.min = IXGBE_INT_LEGACY,
+			.max = IXGBE_INT_MSIX
+		};
+
+		if (num_IntMode > bd) {
+			int_mode = IntMode[bd];
+			ixgbe_validate_option(&int_mode, &opt, adapter);
+			switch (int_mode) {
+			case IXGBE_INT_MSIX:
+				*aflags |= IXGBE_FLAG_MSIX_CAPABLE;
+				*aflags |= IXGBE_FLAG_MSI_CAPABLE;
+				break;
+			case IXGBE_INT_MSI:
+				*aflags &= ~IXGBE_FLAG_MSIX_CAPABLE;
+				*aflags |= IXGBE_FLAG_MSI_CAPABLE;
+				break;
+			case IXGBE_INT_LEGACY:
+			default:
+				*aflags &= ~IXGBE_FLAG_MSIX_CAPABLE;
+				*aflags &= ~IXGBE_FLAG_MSI_CAPABLE;
+				break;
+			}
+		} else {
+			*aflags |= IXGBE_FLAG_MSIX_CAPABLE;
+			*aflags |= IXGBE_FLAG_MSI_CAPABLE;
+		}
+	}
+}


^ permalink raw reply related

* [PATCH RFC: linux-next 2/2] ixgbe: Example usage of the new IRQ affinity_hint callback
From: Peter P Waskiewicz Jr @ 2010-04-19  4:58 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel
In-Reply-To: <20100419045741.30276.23233.stgit@ppwaskie-hc2.jf.intel.com>

This patch uses the new IRQ affinity_hint callback mechanism.
It serves purely as an example of how a low-level driver can
utilize this new interface.

An official ixgbe patch will be pushed through netdev once the
IRQ patches have been accepted and merged.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/ixgbe.h      |    2 ++
 drivers/net/ixgbe/ixgbe_main.c |   51 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..c220b9f 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/netdevice.h>
 #include <linux/aer.h>
+#include <linux/cpumask.h>
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
@@ -236,6 +237,7 @@ struct ixgbe_q_vector {
 	u8 tx_itr;
 	u8 rx_itr;
 	u32 eitr;
+	cpumask_var_t affinity_mask;
 };
 
 /* Helper macros to switch between ints/sec and what the register uses.
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..3e00d41 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1031,6 +1031,36 @@ next_desc:
 	return cleaned;
 }
 
+static unsigned int ixgbe_irq_affinity_callback(cpumask_var_t mask,
+                                                unsigned int irq, void *dev)
+{
+	struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)dev;
+	int i, q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
+
+	if (test_bit(__IXGBE_DOWN, &adapter->state))
+		return -EINVAL;
+
+	/*
+	 * Loop through the msix_entries array, looking for the vector that
+	 * matches the irq passed to us.  Once we find it, use that index to
+	 * grab the corresponding q_vector (1 to 1 mapping), and grab the
+	 * cpumask from that q_vector.
+	 */
+	for (i = 0; i < q_vectors; i++)
+		if (adapter->msix_entries[i].vector == irq)
+			break;
+
+	if (i == q_vectors)
+		return -EINVAL;
+
+	if (adapter->q_vector[i]->affinity_mask)
+		cpumask_copy(mask, adapter->q_vector[i]->affinity_mask);
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
 static int ixgbe_clean_rxonly(struct napi_struct *, int);
 /**
  * ixgbe_configure_msix - Configure MSI-X hardware
@@ -1083,6 +1113,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
 			q_vector->eitr = adapter->rx_eitr_param;
 
 		ixgbe_write_eitr(q_vector);
+
+		/*
+		 * Allocate the affinity_hint cpumask, assign the mask for
+		 * this vector, and register our affinity_hint callback.
+		 */
+		alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL);
+		cpumask_set_cpu(v_idx, q_vector->affinity_mask);
+		irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
+		                           adapter,
+		                           &ixgbe_irq_affinity_callback);
 	}
 
 	if (adapter->hw.mac.type == ixgbe_mac_82598EB)
@@ -3218,7 +3258,7 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	u32 rxctrl;
 	u32 txdctl;
-	int i, j;
+	int i, j, num_q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
 
 	/* signal that we are down to the interrupt handler */
 	set_bit(__IXGBE_DOWN, &adapter->state);
@@ -3251,6 +3291,14 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 
 	ixgbe_napi_disable_all(adapter);
 
+	for (i = 0; i < num_q_vectors; i++) {
+		struct ixgbe_q_vector *q_vector = adapter->q_vector[i];
+		/* unregister the affinity_hint callback */
+		irq_unregister_affinity_hint(adapter->msix_entries[i].vector);
+		/* release the CPU mask memory */
+		free_cpumask_var(q_vector->affinity_mask);
+	}
+
 	clear_bit(__IXGBE_SFP_MODULE_NOT_FOUND, &adapter->state);
 	del_timer_sync(&adapter->sfp_timer);
 	del_timer_sync(&adapter->watchdog_timer);
@@ -4052,6 +4100,7 @@ static void ixgbe_free_q_vectors(struct ixgbe_adapter *adapter)
 		struct ixgbe_q_vector *q_vector = adapter->q_vector[q_idx];
 		adapter->q_vector[q_idx] = NULL;
 		netif_napi_del(&q_vector->napi);
+		free_cpumask_var(q_vector->affinity_mask);
 		kfree(q_vector);
 	}
 }

^ permalink raw reply related

* [PATCH RFC: linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Peter P Waskiewicz Jr @ 2010-04-19  4:57 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel

This patch adds a callback function pointer to the irq_desc
structure, along with a registration function and a read-only
proc entry for each interrupt.

This affinity_hint handle for each interrupt can be used by
underlying drivers that need a better mechanism to control
interrupt affinity.  The underlying driver can register a
callback for the interrupt, which will allow the driver to
provide the CPU mask for the interrupt to anything that
requests it.  The intent is to extend the userspace daemon,
irqbalance, to help hint to it a preferred CPU mask to balance
the interrupt into.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/interrupt.h |   27 +++++++++++++++++++++++++++
 include/linux/irq.h       |    1 +
 kernel/irq/manage.c       |   39 +++++++++++++++++++++++++++++++++++++++
 kernel/irq/proc.c         |   41 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 75f3f00..f2a7d0b 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -78,6 +78,8 @@ enum {
 };
 
 typedef irqreturn_t (*irq_handler_t)(int, void *);
+typedef unsigned int (*irq_affinity_hint_t)(cpumask_var_t, unsigned int,
+                                            void *);
 
 /**
  * struct irqaction - per interrupt action descriptor
@@ -105,6 +107,18 @@ struct irqaction {
 	unsigned long thread_flags;
 };
 
+/**
+ * struct irqaffinityhint - per interrupt affinity helper
+ * @callback:	device driver callback function
+ * @dev:	reference for the affected device
+ * @irq:	interrupt number
+ */
+struct irqaffinityhint {
+	irq_affinity_hint_t callback;
+	void *dev;
+	int irq;
+};
+
 extern irqreturn_t no_action(int cpl, void *dev_id);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
@@ -209,6 +223,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
 extern int irq_can_set_affinity(unsigned int irq);
 extern int irq_select_affinity(unsigned int irq);
 
+extern int irq_register_affinity_hint(unsigned int irq, void *dev,
+                                      irq_affinity_hint_t callback);
+extern int irq_unregister_affinity_hint(unsigned int irq);
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -223,6 +240,16 @@ static inline int irq_can_set_affinity(unsigned int irq)
 
 static inline int irq_select_affinity(unsigned int irq)  { return 0; }
 
+static inline int irq_register_affinity_hint(unsigned int irq, void *dev,
+                                             irq_affinity_hint_t callback)
+{
+	return -EINVAL;
+}
+
+static inline int irq_unregister_affinity_hint(unsigned int irq);
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SMP && CONFIG_GENERIC_HARDIRQS */
 
 #ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 707ab12..bd73e9b 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -206,6 +206,7 @@ struct irq_desc {
 	struct proc_dir_entry	*dir;
 #endif
 	const char		*name;
+	struct irqaffinityhint	*hint;
 } ____cacheline_internodealigned_in_smp;
 
 extern void arch_init_copy_chip_data(struct irq_desc *old_desc,
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 704e488..3674b6a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -138,6 +138,42 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask)
 	return 0;
 }
 
+int irq_register_affinity_hint(unsigned int irq, void *dev,
+                               irq_affinity_hint_t callback)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	if (!desc->hint)
+		desc->hint = kmalloc(sizeof(struct irqaffinityhint),
+		                     GFP_KERNEL);
+	desc->hint->callback = callback;
+	desc->hint->dev = dev;
+	desc->hint->irq = irq;
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_register_affinity_hint);
+
+int irq_unregister_affinity_hint(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	kfree(desc->hint);
+	desc->hint = NULL;
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_unregister_affinity_hint);
 #ifndef CONFIG_AUTO_IRQ_AFFINITY
 /*
  * Generic version of the affinity autoselector.
@@ -916,6 +952,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 			desc->chip->disable(irq);
 	}
 
+	/* make sure affinity_hint callback is cleaned up */
+	kfree(desc->hint);
+
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
 	unregister_handler_proc(irq, action);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 7a6eb04..59110a3 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -32,6 +32,23 @@ static int irq_affinity_proc_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
+{
+	struct irq_desc *desc = irq_to_desc((long)m->private);
+	struct cpumask mask;
+	unsigned int ret = 0;
+
+	if (desc->hint && desc->hint->callback) {
+		ret = desc->hint->callback(&mask, (long)m->private,
+		                           desc->hint->dev);
+		if (!ret)
+			seq_cpumask(m, &mask);
+	}
+
+	seq_putc(m, '\n');
+	return ret;
+}
+
 #ifndef is_affinity_mask_valid
 #define is_affinity_mask_valid(val) 1
 #endif
@@ -79,11 +96,23 @@ free_cpumask:
 	return err;
 }
 
+static ssize_t irq_affinity_hint_proc_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *pos)
+{
+	/* affinity_hint is read-only from proc */
+	return -EOPNOTSUPP;
+}
+
 static int irq_affinity_proc_open(struct inode *inode, struct file *file)
 {
 	return single_open(file, irq_affinity_proc_show, PDE(inode)->data);
 }
 
+static int irq_affinity_hint_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irq_affinity_hint_proc_show, PDE(inode)->data);
+}
+
 static const struct file_operations irq_affinity_proc_fops = {
 	.open		= irq_affinity_proc_open,
 	.read		= seq_read,
@@ -92,6 +121,14 @@ static const struct file_operations irq_affinity_proc_fops = {
 	.write		= irq_affinity_proc_write,
 };
 
+static const struct file_operations irq_affinity_hint_proc_fops = {
+	.open		= irq_affinity_hint_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+	.write		= irq_affinity_hint_proc_write,
+};
+
 static int default_affinity_show(struct seq_file *m, void *v)
 {
 	seq_cpumask(m, irq_default_affinity);
@@ -231,6 +268,10 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	/* create /proc/irq/<irq>/smp_affinity */
 	proc_create_data("smp_affinity", 0600, desc->dir,
 			 &irq_affinity_proc_fops, (void *)(long)irq);
+
+	/* create /proc/irq/<irq>/affinity_hint */
+	proc_create_data("affinity_hint", 0400, desc->dir,
+			 &irq_affinity_hint_proc_fops, (void *)(long)irq);
 #endif
 
 	proc_create_data("spurious", 0444, desc->dir,


^ permalink raw reply related

* Re: [PATCH] ks8842: Add module param for setting mac address
From: David Miller @ 2010-04-19  4:18 UTC (permalink / raw)
  To: richard.rojfors; +Cc: netdev
In-Reply-To: <1271611557.24099.11.camel@debian>

From: Richard Röjfors <richard.rojfors@pelagicore.com>
Date: Sun, 18 Apr 2010 19:25:57 +0200

> This patch adds a module parameter for setting the MAC address.
> 
> To ensure this MAC address is used, the MAC address is written
> after each hardware reset.
> 
> Signed-off-by: Richard Röjfors <richard.rojfors@pelagicore.com>

Your driver isn't unique, or special, and it doesn't need
to do special things to handle this situation.

Either probe the information from somewhere, or use a random
MAC address.

I'm not applying this.

^ permalink raw reply

* Re: [PATCH 10/13] bnx2x: Set default value of num_queues to 1 on 32-bit platforms
From: David Miller @ 2010-04-19  4:10 UTC (permalink / raw)
  To: vladz; +Cc: eilong, netdev, dmitry
In-Reply-To: <1271602224.27235.199.camel@lb-tlvb-vladz>

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Sun, 18 Apr 2010 17:50:24 +0300

> The default was changed to save memory on 32bits systems.
> 
> Author: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

This is absolutely rediculious.

There is no reason in the world to restrict multiqueue to only 64-bit
systems.

If there is a memory allocation issue, fix that!  Is it vmalloc() space
usage?

You don't even describe what the hell the problem is, so nobody can
figure out _WHY_ you're making this change.  That's so incredibly
frustrating because it makes it impossible for anyone to sanely
evaluate the patch.

The more I read of this patch series, the more I am incredibly
disappointed with the poor quality of these changes.

This is the worst patch series submitted by Broadcom engieers,
ever!

^ permalink raw reply

* Re: [PATCH 0/13] bnx2x: Buf fixes and enhancements
From: David Miller @ 2010-04-19  4:07 UTC (permalink / raw)
  To: vladz; +Cc: netdev, eilong, dmitry, yanivr
In-Reply-To: <1271602046.27235.165.camel@lb-tlvb-vladz>

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Sun, 18 Apr 2010 17:47:26 +0300

> Dave, hi.
> Pls., apply the following series of patches.
> It includes a few bug fixes and enhancements.

Several of these patches are inappropriate or need corrections, and
you also have not specified where you want these patches applied
(net-2.6 or net-next-2.6)

You also misspelled "bug" in the subject line here, which to me is yet
another sign that you put very little care into this patch series.  It
feels very "rushed" and "reactionary" rather than a well thought out
set of changes.

You really need to fix all of these issues, do the necessary
research, take your time, and resubmit a higher quality set
of well documented changes.

Thanks.

^ permalink raw reply

* Re: [PATCH 2/13] bnx2x: Use VPD-R V0 entry to display firmware revision
From: David Miller @ 2010-04-19  4:00 UTC (permalink / raw)
  To: bhutchings; +Cc: vladz, netdev, eilong, dmitry, mcarlson
In-Reply-To: <1271603308.3679.208.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Sun, 18 Apr 2010 16:08:28 +0100

> On Sun, 2010-04-18 at 17:48 +0300, Vladislav Zolotarov wrote:
>> Use VPD-R V0 entry to display firmware revision
> [...]
> 
> Matt Carlson already added VPD support functions and definitions to the
> PCI core; you should use them.

Agreed.

Really disappointed that you guy's can't keep track of generic
infrastructure created by co-workers let alone other people
in the networking developer community.

Judging by this and the LRO disabling change, I can only
come to the conclusion that you do development in something
similar to a bubble and are impervious to what's going on
outside of your domain :-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox