Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] CONFIG_SMP should be CONFIG_RPS
From: Eric Dumazet @ 2010-04-13  6:59 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev, Tom Herbert
In-Reply-To: <1271169406-8115-1-git-send-email-xiaosuo@gmail.com>

Le mardi 13 avril 2010 à 22:36 +0800, Changli Gao a écrit :
> CONFIG_SMP should be CONFIG_RPS
> 
> CONFIG_SMP should be CONFIG_RPS
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ----
>  include/linux/netdevice.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index d1a21b5..0efb36e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1331,7 +1331,7 @@ struct softnet_data {
>  	struct sk_buff		*completion_queue;
>  
>  	/* Elements below can be accessed between CPUs for RPS */
> -#ifdef CONFIG_SMP
> +#ifdef CONFIG_RPS
>  	struct call_single_data	csd ____cacheline_aligned_in_smp;
>  #endif
>  	struct sk_buff_head	input_pkt_queue;

Thanks Changli, this is part of RFS patches :)



^ permalink raw reply

* [PATCH] CONFIG_SMP should be CONFIG_RPS
From: Changli Gao @ 2010-04-13 14:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, netdev, Tom Herbert, Changli Gao

CONFIG_SMP should be CONFIG_RPS

CONFIG_SMP should be CONFIG_RPS

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..0efb36e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1331,7 +1331,7 @@ struct softnet_data {
 	struct sk_buff		*completion_queue;
 
 	/* Elements below can be accessed between CPUs for RPS */
-#ifdef CONFIG_SMP
+#ifdef CONFIG_RPS
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
 #endif
 	struct sk_buff_head	input_pkt_queue;

^ permalink raw reply related

* Re: Strange packet drops with heavy firewalling
From: Eric Dumazet @ 2010-04-13  5:56 UTC (permalink / raw)
  To: Changli Gao; +Cc: Benny Amorsen, zhigang gong, netdev
In-Reply-To: <u2y412e6f7f1004121618p6d6eff30q8a45a03faa59a912@mail.gmail.com>

Le mardi 13 avril 2010 à 07:18 +0800, Changli Gao a écrit :
> On Tue, Apr 13, 2010 at 1:06 AM, Benny Amorsen <benny+usenet@amorsen.dk> wrote:
> >
> >  99:         24    1306226          3          2   PCI-MSI-edge      eth1-tx-0
> >  100:      15735    1648774          3          7   PCI-MSI-edge      eth1-tx-1
> >  101:          8         11          9    1083022   PCI-MSI-edge      eth1-tx-2
> >  102:          0          0          0          0   PCI-MSI-edge      eth1-tx-3
> >  103:         18         15       6131    1095383   PCI-MSI-edge      eth1-rx-0
> >  104:        217         32      46544    1335325   PCI-MSI-edge      eth1-rx-1
> >  105:        154    1305595        218         16   PCI-MSI-edge      eth1-rx-2
> >  106:         17         16       8229    1467509   PCI-MSI-edge      eth1-rx-3
> >  107:          0          0          1          0   PCI-MSI-edge      eth1
> >  108:          2         14         15    1003053   PCI-MSI-edge      eth0-tx-0
> >  109:       8226    1668924        478        487   PCI-MSI-edge      eth0-tx-1
> >  110:          3    1188874         17         12   PCI-MSI-edge      eth0-tx-2
> >  111:          0          0          0          0   PCI-MSI-edge      eth0-tx-3
> >  112:        203        185       5324    1015263   PCI-MSI-edge      eth0-rx-0
> >  113:       4141    1600793        153        159   PCI-MSI-edge      eth0-rx-1
> >  114:      16242    1210108        436       3124   PCI-MSI-edge      eth0-rx-2
> >  115:        267       4173      19471    1321252   PCI-MSI-edge      eth0-rx-3
> >  116:          0          1          0          0   PCI-MSI-edge      eth0
> >
> >
> > irqbalanced seems to have picked CPU1 and CPU3 for all the interrupts,
> > which to my mind should cause the same problem as before (where CPU1 and
> > CPU3 was handling all packets). Yet the box clearly works much better
> > than before.
> 
> irqbalanced? I don't think it can work properly. Try RPS in netdev and
> linux-next tree, and if cpu load isn't even, try this patch:
> http://patchwork.ozlabs.org/patch/49915/ .
> 
> 

Dont try RPS on multiqueue devices !

If number of queues matches CPU numbers, it brings nothing but extra
latencies !

Benny, I am not sure your irqbalance is up2date with multiqueue devices,
you might need to disable it and manually irqaffine each interrupt

echo 01 >/proc/irq/100/smp_affinity
echo 02 >/proc/irq/101/smp_affinity
echo 04 >/proc/irq/102/smp_affinity
echo 08 >/proc/irq/103/smp_affinity
echo 10 >/proc/irq/104/smp_affinity
echo 20 >/proc/irq/105/smp_affinity
echo 40 >/proc/irq/106/smp_affinity
echo 80 >/proc/irq/107/smp_affinity

echo 01 >/proc/irq/108/smp_affinity
echo 02 >/proc/irq/109/smp_affinity
echo 04 >/proc/irq/110/smp_affinity
echo 08 >/proc/irq/111/smp_affinity
echo 10 >/proc/irq/112/smp_affinity
echo 20 >/proc/irq/113/smp_affinity
echo 40 >/proc/irq/114/smp_affinity
echo 80 >/proc/irq/115/smp_affinity



^ permalink raw reply

* Re: [PATCH] ARM: dmabounce: fix partial sync in dma_sync_single_* API
From: FUJITA Tomonori @ 2010-04-13  5:27 UTC (permalink / raw)
  To: linux; +Cc: fujita.tomonori, netdev, davem, linux-arm-kernel, linux-kernel
In-Reply-To: <20100412193536.GO3048@n2100.arm.linux.org.uk>

On Mon, 12 Apr 2010 20:35:36 +0100
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:

> On Mon, Apr 05, 2010 at 12:39:32PM +0900, FUJITA Tomonori wrote:
> > I don't have arm hardware that uses dmabounce so I can't confirm the
> > problem but seems that dmabounce doesn't work for some drivers...
> 
> Patch reviews fine, except for one niggle.  I too don't have hardware
> I can test (well, I do except the kernel stopped supporting the UDA1341
> audio codec on the SA1110 Neponset.)

Thanks for reviewing.

> > @@ -171,10 +172,17 @@ find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_
> >  	read_lock_irqsave(&device_info->lock, flags);
> >  
> >  	list_for_each_entry(b, &device_info->safe_buffers, node)
> > -		if (b->safe_dma_addr == safe_dma_addr) {
> > -			rb = b;
> > -			break;
> > -		}
> > +		if (for_sync) {
> > +			if (b->safe_dma_addr <= safe_dma_addr &&
> > +			    safe_dma_addr < b->safe_dma_addr + b->size) {
> > +				rb = b;
> > +				break;
> > +			}
> > +		} else
> > +			if (b->safe_dma_addr == safe_dma_addr) {
> > +				rb = b;
> > +				break;
> > +			}
> 
> This is the niggle; I don't like this indentation style.  If you want to
> indent this if () statement, then please format like this:
> 
> 		} else {
> 			if (b->safe...) {
> 				...
> 			}
> 		}
> 
> or format it as:
> 
> 		} else if (b->safe...) {
> 			...
> 		}

ok, here's the fixed patch.

=
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Subject: [PATCH] ARM: dmabounce: fix partial sync in dma_sync_single_* API

Some network drivers do a partial sync with
dma_sync_single_for_{device|cpu}. The dma_addr argument might not be
the same as one as passed into the mapping API.

This adds some tricks to find_safe_buffer() for
dma_sync_single_for_{device|cpu}.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 arch/arm/common/dmabounce.c |   30 +++++++++++++++++++++---------
 1 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index cc0a932..2e6deec 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -163,7 +163,8 @@ alloc_safe_buffer(struct dmabounce_device_info *device_info, void *ptr,
 
 /* determine if a buffer is from our "safe" pool */
 static inline struct safe_buffer *
-find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_addr)
+find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_addr,
+		 int for_sync)
 {
 	struct safe_buffer *b, *rb = NULL;
 	unsigned long flags;
@@ -171,9 +172,17 @@ find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_
 	read_lock_irqsave(&device_info->lock, flags);
 
 	list_for_each_entry(b, &device_info->safe_buffers, node)
-		if (b->safe_dma_addr == safe_dma_addr) {
-			rb = b;
-			break;
+		if (for_sync) {
+			if (b->safe_dma_addr <= safe_dma_addr &&
+			    safe_dma_addr < b->safe_dma_addr + b->size) {
+				rb = b;
+				break;
+			}
+		} else {
+			if (b->safe_dma_addr == safe_dma_addr) {
+				rb = b;
+				break;
+			}
 		}
 
 	read_unlock_irqrestore(&device_info->lock, flags);
@@ -205,7 +214,8 @@ free_safe_buffer(struct dmabounce_device_info *device_info, struct safe_buffer *
 /* ************************************************** */
 
 static struct safe_buffer *find_safe_buffer_dev(struct device *dev,
-		dma_addr_t dma_addr, const char *where)
+						dma_addr_t dma_addr, const char *where,
+						int for_sync)
 {
 	if (!dev || !dev->archdata.dmabounce)
 		return NULL;
@@ -216,7 +226,7 @@ static struct safe_buffer *find_safe_buffer_dev(struct device *dev,
 			pr_err("unknown device: Trying to %s invalid mapping\n", where);
 		return NULL;
 	}
-	return find_safe_buffer(dev->archdata.dmabounce, dma_addr);
+	return find_safe_buffer(dev->archdata.dmabounce, dma_addr, for_sync);
 }
 
 static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size,
@@ -286,7 +296,7 @@ static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size,
 static inline void unmap_single(struct device *dev, dma_addr_t dma_addr,
 		size_t size, enum dma_data_direction dir)
 {
-	struct safe_buffer *buf = find_safe_buffer_dev(dev, dma_addr, "unmap");
+	struct safe_buffer *buf = find_safe_buffer_dev(dev, dma_addr, "unmap", 0);
 
 	if (buf) {
 		BUG_ON(buf->size != size);
@@ -398,7 +408,7 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr,
 	dev_dbg(dev, "%s(dma=%#x,off=%#lx,sz=%zx,dir=%x)\n",
 		__func__, addr, off, sz, dir);
 
-	buf = find_safe_buffer_dev(dev, addr, __func__);
+	buf = find_safe_buffer_dev(dev, addr, __func__, 1);
 	if (!buf)
 		return 1;
 
@@ -411,6 +421,8 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr,
 	DO_STATS(dev->archdata.dmabounce->bounce_count++);
 
 	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) {
+		if (addr != buf->safe_dma_addr)
+			off = addr - buf->safe_dma_addr;
 		dev_dbg(dev, "%s: copy back safe %p to unsafe %p size %d\n",
 			__func__, buf->safe + off, buf->ptr + off, sz);
 		memcpy(buf->ptr + off, buf->safe + off, sz);
@@ -427,7 +439,7 @@ int dmabounce_sync_for_device(struct device *dev, dma_addr_t addr,
 	dev_dbg(dev, "%s(dma=%#x,off=%#lx,sz=%zx,dir=%x)\n",
 		__func__, addr, off, sz, dir);
 
-	buf = find_safe_buffer_dev(dev, addr, __func__);
+	buf = find_safe_buffer_dev(dev, addr, __func__, 1);
 	if (!buf)
 		return 1;
 
-- 
1.6.5

^ permalink raw reply related

* Re: [PATCH] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-13  5:19 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20100412.214453.57473458.davem@davemloft.net>

On Tue, Apr 13, 2010 at 12:44 PM, David Miller <davem@davemloft.net> wrote:
> From: Changli Gao <xiaosuo@gmail.com>
> Date: Tue, 13 Apr 2010 18:41:08 +0800
>
>> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
>> contention and irq disabling/enabling.
>
> In exchange for a bunch of new atomic operations.
>
> No, thanks.
>

Oh, I can eliminate atomic operations totally, if I use another
variable instead. I'll submit another version later.

^ permalink raw reply

* Re: [PATCH] net: batch skb dequeueing from softnet input_pkt_queue
From: David Miller @ 2010-04-13  4:44 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, netdev
In-Reply-To: <1271155268-2999-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Tue, 13 Apr 2010 18:41:08 +0800

> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention and irq disabling/enabling.

In exchange for a bunch of new atomic operations.

No, thanks.

^ permalink raw reply

* Re: [Bugme-new] [Bug 15720] New: IPv6's ipv4-compatibility addresses don't bind
From: Andrew Morton @ 2010-04-13  1:22 UTC (permalink / raw)
  To: netdev; +Cc: bugzilla-daemon, bugme-daemon, charles
In-Reply-To: <bug-15720-10286@https.bugzilla.kernel.org/>


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 7 Apr 2010 23:17:32 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=15720
> 
>            Summary: IPv6's ipv4-compatibility addresses don't bind

A 2.6.9 -> 2.6.32 regression ;)

>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.32-2-686-bigmem
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV6
>         AssignedTo: yoshfuji@linux-ipv6.org
>         ReportedBy: charles@kde.org
>         Regression: Yes
> 
> 
> When attempting to bind to an address using ipv4-compatibility, for example,
> "::ffff:127.0.0.1", Linux refuses to bind to that address when
> /proc/sys/net/ipv6/bindv6only is set.
> 
> Yes, you could say "but you specifically told ipv6 to not bind to ipv4
> addresses!" However, ::ffff:127.0.0.1 is *clearly* an ipv4 address, it's not an
> alternate representation of an ipv6 address, it's an ipv4 address and only
> ipv4.
> 
> This seems to not have been the case as of linux 2.6.9, although I'm not sure
> at what version this changed.
> 
> It seems to me that the intent of "bindv6only" was to not bind to the ipv4
> address when you bind to all addresses (specifically ipv6 address "::"). So
> when you bind to ::, an ipv4 client connects to you, and it appears to be
> connecting from ::ffff:192.168.5.5. I don't think its intent was to effectively
> disable binding to ::ffff:x.x.x.x addresses - just breaking that feature makes
> no sense.
> 
> The Linux 2.6.9 approach seems to match MacOS's (and I'm pretty sure Solaris's,
> too).



^ permalink raw reply

* Re: [PATCH] rps: add flow director support
From: Changli Gao @ 2010-04-13  3:11 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S. Miller, netdev
In-Reply-To: <h2m65634d661004121013uf2c86b81ndded3bb138dee7a9@mail.gmail.com>

On Tue, Apr 13, 2010 at 1:13 AM, Tom Herbert <therbert@google.com> wrote:
> On Mon, Apr 12, 2010 at 7:27 AM, Changli Gao <xiaosuo@gmail.com> wrote:
>> On Mon, Apr 12, 2010 at 9:34 PM, Tom Herbert <therbert@google.com> wrote:
>>
>
> Ideally, this should replace rps_cpus if it's a better interface....
> right now these would be conflicting interfaces.
>

How about sw-rxs and sw-rx-$ (SoftWare Receive queue). It is a
software emulation of hardware receive queue, as softnet to NIC.
sw-rx-$ is equivalent of /proc/irq/$/smp_affinity, only cpuid is
instead of cpumask.

Does anyone support this interface replacing the current rps_cpus? I do. :)

>
> It's probably a little more work, but the CPU->weight mappings could
> be implemented to cause minimal disruption in the rps_map.  Also, if
> OOO is an issue, then the mitigation technique in RFS could be applied
> (this will work best when hash table is larger I believe).
>

I am thinking about the cost of keeping packets in order. Is it really
worth for this kind of random migration?

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [PATCH] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-13 10:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, netdev, Changli Gao

batch skb dequeueing from softnet input_pkt_queue

batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention and irq disabling/enabling.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |    1 +
 net/core/dev.c            |   36 +++++++++++++++++++++++++++---------
 2 files changed, 28 insertions(+), 9 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..f3f8cca 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1336,6 +1336,7 @@ struct softnet_data {
 #endif
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
+	atomic_t		input_qlen;
 };
 
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
diff --git a/net/core/dev.c b/net/core/dev.c
index a10a216..8816204 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2339,10 +2339,11 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 	__get_cpu_var(netdev_rx_stat).total++;
 
 	rps_lock(queue);
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
+	if (atomic_read(&queue->input_qlen) <= netdev_max_backlog) {
+		if (atomic_read(&queue->input_qlen)) {
 enqueue:
 			__skb_queue_tail(&queue->input_pkt_queue, skb);
+			atomic_inc(&queue->input_qlen);
 			rps_unlock(queue);
 			local_irq_restore(flags);
 			return NET_RX_SUCCESS;
@@ -2801,6 +2802,7 @@ static void flush_backlog(void *arg)
 	skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp)
 		if (skb->dev == dev) {
 			__skb_unlink(skb, &queue->input_pkt_queue);
+			atomic_dec(&queue->input_qlen);
 			kfree_skb(skb);
 		}
 	rps_unlock(queue);
@@ -3111,25 +3113,38 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	int work = 0;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	unsigned long start_time = jiffies;
+	struct sk_buff_head skb_queue;
 
+	__skb_queue_head_init(&skb_queue);
 	napi->weight = weight_p;
 	do {
 		struct sk_buff *skb;
 
 		local_irq_disable();
 		rps_lock(queue);
-		skb = __skb_dequeue(&queue->input_pkt_queue);
-		if (!skb) {
+		skb_queue_splice_tail_init(&queue->input_pkt_queue, &skb_queue);
+		if (skb_queue_empty(&skb_queue)) {
 			__napi_complete(napi);
-			rps_unlock(queue);
-			local_irq_enable();
 			break;
 		}
 		rps_unlock(queue);
 		local_irq_enable();
 
-		__netif_receive_skb(skb);
-	} while (++work < quota && jiffies == start_time);
+		while ((skb = __skb_dequeue(&skb_queue))) {
+			atomic_dec(&queue->input_qlen);
+			__netif_receive_skb(skb);
+			if (++work < quota && jiffies == start_time)
+				continue;
+			local_irq_disable();
+			rps_lock(queue);
+			skb_queue_splice(&skb_queue, &queue->input_pkt_queue);
+			goto out;
+		}
+	} while (1);
+
+out:
+	rps_unlock(queue);
+	local_irq_enable();
 
 	return work;
 }
@@ -5488,8 +5503,10 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	local_irq_enable();
 
 	/* Process offline CPU's input_pkt_queue */
-	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
+	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
+		atomic_dec(&oldsd->input_qlen);
 		netif_rx(skb);
+	}
 
 	return NOTIFY_OK;
 }
@@ -5709,6 +5726,7 @@ static int __init net_dev_init(void)
 
 		queue = &per_cpu(softnet_data, i);
 		skb_queue_head_init(&queue->input_pkt_queue);
+		atomic_set(&queue->input_qlen, 0);
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);
 

^ permalink raw reply related

* Re: Linux arp flux problem
From: Ming-Ching Tiew @ 2010-04-13  2:18 UTC (permalink / raw)
  To: Net Dev
In-Reply-To: <908408.23515.qm@web31502.mail.mud.yahoo.com>



--- On Mon, 4/12/10, Ming-Ching Tiew <mctiew@yahoo.com> wrote:

> From: Ming-Ching Tiew <mctiew@yahoo.com>
> Subject: Linux arp flux problem
> To: "Net Dev" <netdev@vger.kernel.org>
> Date: Monday, April 12, 2010, 3:16 AM
> 
> The following link explains the Linux arp flux problem
> pretty well, and I myself have been burnt badly by a life
> site where the "arp_filter" does not help at all.
> 
>          http://linux-ip.net/html/ether-arp.html
> 
> And I tested the kernel patch by Julian Anastasov, and it
> works 100% reliably :-
> 
>          http://www.ssi.bg/~ja/#hidden
> 
> My question is the patches has been around for many years,
> why has it not been included into the kernel ? Is it that
> Linux is supposed to have this "side effects" of arp linux
> on purpose ?
> 

May I propose that the said patch ( http://www.ssi.bg/~ja/#hidden )
be accepted into the kernel. And if it does not qualify, may I know why ?

Thank you.


      

^ permalink raw reply

* Re: [Patch 3/3] net: reserve ports for applications using fixed port numbers
From: Tetsuo Handa @ 2010-04-13  1:21 UTC (permalink / raw)
  To: amwang
  Cc: opurdila, eric.dumazet, netdev, nhorman, davem, ebiederm,
	linux-kernel
In-Reply-To: <20100412100816.5302.74919.sendpatchset@localhost.localdomain>

Hello.

> --- linux-2.6.orig/drivers/infiniband/core/cma.c
> +++ linux-2.6/drivers/infiniband/core/cma.c
> @@ -1980,6 +1980,8 @@ retry:
>  	/* FIXME: add proper port randomization per like inet_csk_get_port */
>  	do {
>  		ret = idr_get_new_above(ps, bind_list, next_port, &port);
> +		if (!ret && inet_is_reserved_local_port(port))
> +			ret = -EAGAIN;
>  	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
>  
>  	if (ret)
> 
I think above part is wrong. Below program
--------------------
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/idr.h>

static DEFINE_IDR(idr);
static int idr_demo_init(void)
{
	int next_port = 65530;
	int port = 0;
	int ret = -EINTR;
	while (!signal_pending(current)) {
		msleep(1000);
		ret = idr_get_new_above(&idr, NULL, next_port, &port);
		printk(KERN_INFO "idr_get_new_above() = %d\n", ret);
		if (!ret) {
			/* Emulate inet_is_reserved_local_port(port) = true */
			printk(KERN_INFO "Port %u is reserved.\n", port);
			ret = -EAGAIN;
		}
		if (ret == -EAGAIN) {
			if (idr_pre_get(&idr, GFP_KERNEL)) {
				printk(KERN_INFO "idr_pre_get() succeeded.\n");
				continue;
			}
			printk(KERN_INFO "idr_pre_get() failed.\n");
			break;
		} else {
			printk(KERN_INFO "next_port=%u port=%u\n",
			       next_port, port);
			break;
		}
	}
	if (!ret)
		idr_remove(&idr, port);
	idr_destroy(&idr);
	return -EINVAL;
}
module_init(idr_demo_init);
MODULE_LICENSE("GPL");
--------------------
generated below output.

idr_get_new_above() = -11
idr_pre_get() succeeded.
idr_get_new_above() = 0
Port 65530 is reserved.
idr_pre_get() succeeded.
idr_get_new_above() = 0
Port 65531 is reserved.
idr_pre_get() succeeded.
idr_get_new_above() = 0
Port 65532 is reserved.
idr_pre_get() succeeded.
idr_get_new_above() = 0
Port 65533 is reserved.
idr_pre_get() succeeded.
idr_get_new_above() = 0
Port 65534 is reserved.
idr_pre_get() succeeded.
idr_get_new_above() = 0
Port 65535 is reserved.
idr_pre_get() succeeded.
idr_get_new_above() = 0
Port 65536 is reserved.
idr_pre_get() succeeded.
idr_get_new_above() = 0
Port 65537 is reserved.
idr_pre_get() succeeded.
idr_get_new_above() = 0
(...snipped...)

This result suggests that above loop will continue until idr_pre_get() fails
due to out of memory if all ports were reserved.

Also, if idr_get_new_above() returned 0, bind_list (which is a kmalloc()ed
pointer) is already installed into a free slot (see comment on
idr_get_new_above_int()). Thus, simply calling idr_get_new_above() again will
install the same pointer into multiple slots. I guess it will malfunction later.

^ permalink raw reply

* [PATCH] myri10ge: use the DMA state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-13  0:32 UTC (permalink / raw)
  To: netdev; +Cc: fujita.tomonori, Andrew Gallatin, Brice Goglin
In-Reply-To: <1271118734-28353-1-git-send-email-fujita.tomonori@lab.ntt.co.jp>

This replace the PCI DMA state API (include/linux/pci-dma.h) with the
DMA equivalents since the PCI DMA state API will be obsolete.

No functional change.

For further information about the background:

http://marc.info/?l=linux-netdev&m=127037540020276&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Andrew Gallatin <gallatin@myri.com>
Cc: Brice Goglin <brice@myri.com>
---
 drivers/net/myri10ge/myri10ge.c |   44 +++++++++++++++++++-------------------
 1 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 72b4b19..958dc28 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -110,15 +110,15 @@ MODULE_LICENSE("Dual BSD/GPL");
 struct myri10ge_rx_buffer_state {
 	struct page *page;
 	int page_offset;
-	 DECLARE_PCI_UNMAP_ADDR(bus)
-	 DECLARE_PCI_UNMAP_LEN(len)
+	DEFINE_DMA_UNMAP_ADDR(bus);
+	DEFINE_DMA_UNMAP_LEN(len);
 };
 
 struct myri10ge_tx_buffer_state {
 	struct sk_buff *skb;
 	int last;
-	 DECLARE_PCI_UNMAP_ADDR(bus)
-	 DECLARE_PCI_UNMAP_LEN(len)
+	DEFINE_DMA_UNMAP_ADDR(bus);
+	DEFINE_DMA_UNMAP_LEN(len);
 };
 
 struct myri10ge_cmd {
@@ -1234,7 +1234,7 @@ myri10ge_alloc_rx_pages(struct myri10ge_priv *mgp, struct myri10ge_rx_buf *rx,
 		rx->info[idx].page_offset = rx->page_offset;
 		/* note that this is the address of the start of the
 		 * page */
-		pci_unmap_addr_set(&rx->info[idx], bus, rx->bus);
+		dma_unmap_addr_set(&rx->info[idx], bus, rx->bus);
 		rx->shadow[idx].addr_low =
 		    htonl(MYRI10GE_LOWPART_TO_U32(rx->bus) + rx->page_offset);
 		rx->shadow[idx].addr_high =
@@ -1266,7 +1266,7 @@ myri10ge_unmap_rx_page(struct pci_dev *pdev,
 	/* unmap the recvd page if we're the only or last user of it */
 	if (bytes >= MYRI10GE_ALLOC_SIZE / 2 ||
 	    (info->page_offset + 2 * bytes) > MYRI10GE_ALLOC_SIZE) {
-		pci_unmap_page(pdev, (pci_unmap_addr(info, bus)
+		pci_unmap_page(pdev, (dma_unmap_addr(info, bus)
 				      & ~(MYRI10GE_ALLOC_SIZE - 1)),
 			       MYRI10GE_ALLOC_SIZE, PCI_DMA_FROMDEVICE);
 	}
@@ -1373,21 +1373,21 @@ myri10ge_tx_done(struct myri10ge_slice_state *ss, int mcp_index)
 			tx->info[idx].last = 0;
 		}
 		tx->done++;
-		len = pci_unmap_len(&tx->info[idx], len);
-		pci_unmap_len_set(&tx->info[idx], len, 0);
+		len = dma_unmap_len(&tx->info[idx], len);
+		dma_unmap_len_set(&tx->info[idx], len, 0);
 		if (skb) {
 			ss->stats.tx_bytes += skb->len;
 			ss->stats.tx_packets++;
 			dev_kfree_skb_irq(skb);
 			if (len)
 				pci_unmap_single(pdev,
-						 pci_unmap_addr(&tx->info[idx],
+						 dma_unmap_addr(&tx->info[idx],
 								bus), len,
 						 PCI_DMA_TODEVICE);
 		} else {
 			if (len)
 				pci_unmap_page(pdev,
-					       pci_unmap_addr(&tx->info[idx],
+					       dma_unmap_addr(&tx->info[idx],
 							      bus), len,
 					       PCI_DMA_TODEVICE);
 		}
@@ -2094,20 +2094,20 @@ static void myri10ge_free_rings(struct myri10ge_slice_state *ss)
 		/* Mark as free */
 		tx->info[idx].skb = NULL;
 		tx->done++;
-		len = pci_unmap_len(&tx->info[idx], len);
-		pci_unmap_len_set(&tx->info[idx], len, 0);
+		len = dma_unmap_len(&tx->info[idx], len);
+		dma_unmap_len_set(&tx->info[idx], len, 0);
 		if (skb) {
 			ss->stats.tx_dropped++;
 			dev_kfree_skb_any(skb);
 			if (len)
 				pci_unmap_single(mgp->pdev,
-						 pci_unmap_addr(&tx->info[idx],
+						 dma_unmap_addr(&tx->info[idx],
 								bus), len,
 						 PCI_DMA_TODEVICE);
 		} else {
 			if (len)
 				pci_unmap_page(mgp->pdev,
-					       pci_unmap_addr(&tx->info[idx],
+					       dma_unmap_addr(&tx->info[idx],
 							      bus), len,
 					       PCI_DMA_TODEVICE);
 		}
@@ -2761,8 +2761,8 @@ again:
 	idx = tx->req & tx->mask;
 	tx->info[idx].skb = skb;
 	bus = pci_map_single(mgp->pdev, skb->data, len, PCI_DMA_TODEVICE);
-	pci_unmap_addr_set(&tx->info[idx], bus, bus);
-	pci_unmap_len_set(&tx->info[idx], len, len);
+	dma_unmap_addr_set(&tx->info[idx], bus, bus);
+	dma_unmap_len_set(&tx->info[idx], len, len);
 
 	frag_cnt = skb_shinfo(skb)->nr_frags;
 	frag_idx = 0;
@@ -2865,8 +2865,8 @@ again:
 		len = frag->size;
 		bus = pci_map_page(mgp->pdev, frag->page, frag->page_offset,
 				   len, PCI_DMA_TODEVICE);
-		pci_unmap_addr_set(&tx->info[idx], bus, bus);
-		pci_unmap_len_set(&tx->info[idx], len, len);
+		dma_unmap_addr_set(&tx->info[idx], bus, bus);
+		dma_unmap_len_set(&tx->info[idx], len, len);
 	}
 
 	(req - rdma_count)->rdma_count = rdma_count;
@@ -2903,19 +2903,19 @@ abort_linearize:
 	idx = tx->req & tx->mask;
 	tx->info[idx].skb = NULL;
 	do {
-		len = pci_unmap_len(&tx->info[idx], len);
+		len = dma_unmap_len(&tx->info[idx], len);
 		if (len) {
 			if (tx->info[idx].skb != NULL)
 				pci_unmap_single(mgp->pdev,
-						 pci_unmap_addr(&tx->info[idx],
+						 dma_unmap_addr(&tx->info[idx],
 								bus), len,
 						 PCI_DMA_TODEVICE);
 			else
 				pci_unmap_page(mgp->pdev,
-					       pci_unmap_addr(&tx->info[idx],
+					       dma_unmap_addr(&tx->info[idx],
 							      bus), len,
 					       PCI_DMA_TODEVICE);
-			pci_unmap_len_set(&tx->info[idx], len, 0);
+			dma_unmap_len_set(&tx->info[idx], len, 0);
 			tx->info[idx].skb = NULL;
 		}
 		idx = (idx + 1) & tx->mask;
-- 
1.6.5


^ permalink raw reply related

* [PATCH] cxgb3: use the DMA state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-13  0:32 UTC (permalink / raw)
  To: netdev; +Cc: fujita.tomonori, Divy Le Ray
In-Reply-To: <1271118734-28353-1-git-send-email-fujita.tomonori@lab.ntt.co.jp>

This replace the PCI DMA state API (include/linux/pci-dma.h) with the
DMA equivalents since the PCI DMA state API will be obsolete.

No functional change.

For further information about the background:

http://marc.info/?l=linux-netdev&m=127037540020276&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/cxgb3/sge.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
index 07d7e7f..5962b91 100644
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -118,7 +118,7 @@ struct rx_sw_desc {                /* SW state per Rx descriptor */
 		struct sk_buff *skb;
 		struct fl_pg_chunk pg_chunk;
 	};
-	DECLARE_PCI_UNMAP_ADDR(dma_addr);
+	DEFINE_DMA_UNMAP_ADDR(dma_addr);
 };
 
 struct rsp_desc {		/* response queue descriptor */
@@ -208,7 +208,7 @@ static inline int need_skb_unmap(void)
 	 * unmapping by checking if DECLARE_PCI_UNMAP_ADDR defines anything.
 	 */
 	struct dummy {
-		DECLARE_PCI_UNMAP_ADDR(addr);
+		DEFINE_DMA_UNMAP_ADDR(addr);
 	};
 
 	return sizeof(struct dummy) != 0;
@@ -363,7 +363,7 @@ static void clear_rx_desc(struct pci_dev *pdev, const struct sge_fl *q,
 		put_page(d->pg_chunk.page);
 		d->pg_chunk.page = NULL;
 	} else {
-		pci_unmap_single(pdev, pci_unmap_addr(d, dma_addr),
+		pci_unmap_single(pdev, dma_unmap_addr(d, dma_addr),
 				 q->buf_size, PCI_DMA_FROMDEVICE);
 		kfree_skb(d->skb);
 		d->skb = NULL;
@@ -419,7 +419,7 @@ static inline int add_one_rx_buf(void *va, unsigned int len,
 	if (unlikely(pci_dma_mapping_error(pdev, mapping)))
 		return -ENOMEM;
 
-	pci_unmap_addr_set(sd, dma_addr, mapping);
+	dma_unmap_addr_set(sd, dma_addr, mapping);
 
 	d->addr_lo = cpu_to_be32(mapping);
 	d->addr_hi = cpu_to_be32((u64) mapping >> 32);
@@ -515,7 +515,7 @@ nomem:				q->alloc_failed++;
 				break;
 			}
 			mapping = sd->pg_chunk.mapping + sd->pg_chunk.offset;
-			pci_unmap_addr_set(sd, dma_addr, mapping);
+			dma_unmap_addr_set(sd, dma_addr, mapping);
 
 			add_one_rx_chunk(mapping, d, q->gen);
 			pci_dma_sync_single_for_device(adap->pdev, mapping,
@@ -791,11 +791,11 @@ static struct sk_buff *get_packet(struct adapter *adap, struct sge_fl *fl,
 		if (likely(skb != NULL)) {
 			__skb_put(skb, len);
 			pci_dma_sync_single_for_cpu(adap->pdev,
-					    pci_unmap_addr(sd, dma_addr), len,
+					    dma_unmap_addr(sd, dma_addr), len,
 					    PCI_DMA_FROMDEVICE);
 			memcpy(skb->data, sd->skb->data, len);
 			pci_dma_sync_single_for_device(adap->pdev,
-					    pci_unmap_addr(sd, dma_addr), len,
+					    dma_unmap_addr(sd, dma_addr), len,
 					    PCI_DMA_FROMDEVICE);
 		} else if (!drop_thres)
 			goto use_orig_buf;
@@ -810,7 +810,7 @@ recycle:
 		goto recycle;
 
 use_orig_buf:
-	pci_unmap_single(adap->pdev, pci_unmap_addr(sd, dma_addr),
+	pci_unmap_single(adap->pdev, dma_unmap_addr(sd, dma_addr),
 			 fl->buf_size, PCI_DMA_FROMDEVICE);
 	skb = sd->skb;
 	skb_put(skb, len);
@@ -843,7 +843,7 @@ static struct sk_buff *get_packet_pg(struct adapter *adap, struct sge_fl *fl,
 	struct sk_buff *newskb, *skb;
 	struct rx_sw_desc *sd = &fl->sdesc[fl->cidx];
 
-	dma_addr_t dma_addr = pci_unmap_addr(sd, dma_addr);
+	dma_addr_t dma_addr = dma_unmap_addr(sd, dma_addr);
 
 	newskb = skb = q->pg_skb;
 	if (!skb && (len <= SGE_RX_COPY_THRES)) {
@@ -2097,7 +2097,7 @@ static void lro_add_page(struct adapter *adap, struct sge_qset *qs,
 	fl->credits--;
 
 	pci_dma_sync_single_for_cpu(adap->pdev,
-				    pci_unmap_addr(sd, dma_addr),
+				    dma_unmap_addr(sd, dma_addr),
 				    fl->buf_size - SGE_PG_RSVD,
 				    PCI_DMA_FROMDEVICE);
 
-- 
1.6.5


^ permalink raw reply related

* [PATCH] chelsio: use the DMA state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-13  0:32 UTC (permalink / raw)
  To: netdev; +Cc: fujita.tomonori, Divy Le Ray
In-Reply-To: <1271118734-28353-1-git-send-email-fujita.tomonori@lab.ntt.co.jp>

This replace the PCI DMA state API (include/linux/pci-dma.h) with the
DMA equivalents since the PCI DMA state API will be obsolete.

No functional change.

For further information about the background:

http://marc.info/?l=linux-netdev&m=127037540020276&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/chelsio/sge.c |   50 ++++++++++++++++++++++----------------------
 1 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/net/chelsio/sge.c b/drivers/net/chelsio/sge.c
index df3a141..475304f 100644
--- a/drivers/net/chelsio/sge.c
+++ b/drivers/net/chelsio/sge.c
@@ -162,14 +162,14 @@ struct respQ_e {
  */
 struct cmdQ_ce {
 	struct sk_buff *skb;
-	DECLARE_PCI_UNMAP_ADDR(dma_addr);
-	DECLARE_PCI_UNMAP_LEN(dma_len);
+	DEFINE_DMA_UNMAP_ADDR(dma_addr);
+	DEFINE_DMA_UNMAP_LEN(dma_len);
 };
 
 struct freelQ_ce {
 	struct sk_buff *skb;
-	DECLARE_PCI_UNMAP_ADDR(dma_addr);
-	DECLARE_PCI_UNMAP_LEN(dma_len);
+	DEFINE_DMA_UNMAP_ADDR(dma_addr);
+	DEFINE_DMA_UNMAP_LEN(dma_len);
 };
 
 /*
@@ -518,8 +518,8 @@ static void free_freelQ_buffers(struct pci_dev *pdev, struct freelQ *q)
 	while (q->credits--) {
 		struct freelQ_ce *ce = &q->centries[cidx];
 
-		pci_unmap_single(pdev, pci_unmap_addr(ce, dma_addr),
-				 pci_unmap_len(ce, dma_len),
+		pci_unmap_single(pdev, dma_unmap_addr(ce, dma_addr),
+				 dma_unmap_len(ce, dma_len),
 				 PCI_DMA_FROMDEVICE);
 		dev_kfree_skb(ce->skb);
 		ce->skb = NULL;
@@ -633,9 +633,9 @@ static void free_cmdQ_buffers(struct sge *sge, struct cmdQ *q, unsigned int n)
 	q->in_use -= n;
 	ce = &q->centries[cidx];
 	while (n--) {
-		if (likely(pci_unmap_len(ce, dma_len))) {
-			pci_unmap_single(pdev, pci_unmap_addr(ce, dma_addr),
-					 pci_unmap_len(ce, dma_len),
+		if (likely(dma_unmap_len(ce, dma_len))) {
+			pci_unmap_single(pdev, dma_unmap_addr(ce, dma_addr),
+					 dma_unmap_len(ce, dma_len),
 					 PCI_DMA_TODEVICE);
 			if (q->sop)
 				q->sop = 0;
@@ -851,8 +851,8 @@ static void refill_free_list(struct sge *sge, struct freelQ *q)
 		skb_reserve(skb, sge->rx_pkt_pad);
 
 		ce->skb = skb;
-		pci_unmap_addr_set(ce, dma_addr, mapping);
-		pci_unmap_len_set(ce, dma_len, dma_len);
+		dma_unmap_addr_set(ce, dma_addr, mapping);
+		dma_unmap_len_set(ce, dma_len, dma_len);
 		e->addr_lo = (u32)mapping;
 		e->addr_hi = (u64)mapping >> 32;
 		e->len_gen = V_CMD_LEN(dma_len) | V_CMD_GEN1(q->genbit);
@@ -1059,13 +1059,13 @@ static inline struct sk_buff *get_packet(struct pci_dev *pdev,
 		skb_reserve(skb, 2);	/* align IP header */
 		skb_put(skb, len);
 		pci_dma_sync_single_for_cpu(pdev,
-					    pci_unmap_addr(ce, dma_addr),
-					    pci_unmap_len(ce, dma_len),
+					    dma_unmap_addr(ce, dma_addr),
+					    dma_unmap_len(ce, dma_len),
 					    PCI_DMA_FROMDEVICE);
 		skb_copy_from_linear_data(ce->skb, skb->data, len);
 		pci_dma_sync_single_for_device(pdev,
-					       pci_unmap_addr(ce, dma_addr),
-					       pci_unmap_len(ce, dma_len),
+					       dma_unmap_addr(ce, dma_addr),
+					       dma_unmap_len(ce, dma_len),
 					       PCI_DMA_FROMDEVICE);
 		recycle_fl_buf(fl, fl->cidx);
 		return skb;
@@ -1077,8 +1077,8 @@ use_orig_buf:
 		return NULL;
 	}
 
-	pci_unmap_single(pdev, pci_unmap_addr(ce, dma_addr),
-			 pci_unmap_len(ce, dma_len), PCI_DMA_FROMDEVICE);
+	pci_unmap_single(pdev, dma_unmap_addr(ce, dma_addr),
+			 dma_unmap_len(ce, dma_len), PCI_DMA_FROMDEVICE);
 	skb = ce->skb;
 	prefetch(skb->data);
 
@@ -1100,8 +1100,8 @@ static void unexpected_offload(struct adapter *adapter, struct freelQ *fl)
 	struct freelQ_ce *ce = &fl->centries[fl->cidx];
 	struct sk_buff *skb = ce->skb;
 
-	pci_dma_sync_single_for_cpu(adapter->pdev, pci_unmap_addr(ce, dma_addr),
-			    pci_unmap_len(ce, dma_len), PCI_DMA_FROMDEVICE);
+	pci_dma_sync_single_for_cpu(adapter->pdev, dma_unmap_addr(ce, dma_addr),
+			    dma_unmap_len(ce, dma_len), PCI_DMA_FROMDEVICE);
 	pr_err("%s: unexpected offload packet, cmd %u\n",
 	       adapter->name, *skb->data);
 	recycle_fl_buf(fl, fl->cidx);
@@ -1182,7 +1182,7 @@ static inline unsigned int write_large_page_tx_descs(unsigned int pidx,
 			write_tx_desc(e1, *desc_mapping, SGE_TX_DESC_MAX_PLEN,
 				      *gen, nfrags == 0 && *desc_len == 0);
 			ce1->skb = NULL;
-			pci_unmap_len_set(ce1, dma_len, 0);
+			dma_unmap_len_set(ce1, dma_len, 0);
 			*desc_mapping += SGE_TX_DESC_MAX_PLEN;
 			if (*desc_len) {
 				ce1++;
@@ -1233,7 +1233,7 @@ static inline void write_tx_descs(struct adapter *adapter, struct sk_buff *skb,
 	e->addr_hi = (u64)desc_mapping >> 32;
 	e->len_gen = V_CMD_LEN(first_desc_len) | V_CMD_GEN1(gen);
 	ce->skb = NULL;
-	pci_unmap_len_set(ce, dma_len, 0);
+	dma_unmap_len_set(ce, dma_len, 0);
 
 	if (PAGE_SIZE > SGE_TX_DESC_MAX_PLEN &&
 	    desc_len > SGE_TX_DESC_MAX_PLEN) {
@@ -1257,8 +1257,8 @@ static inline void write_tx_descs(struct adapter *adapter, struct sk_buff *skb,
 	}
 
 	ce->skb = NULL;
-	pci_unmap_addr_set(ce, dma_addr, mapping);
-	pci_unmap_len_set(ce, dma_len, skb->len - skb->data_len);
+	dma_unmap_addr_set(ce, dma_addr, mapping);
+	dma_unmap_len_set(ce, dma_len, skb->len - skb->data_len);
 
 	for (i = 0; nfrags--; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
@@ -1284,8 +1284,8 @@ static inline void write_tx_descs(struct adapter *adapter, struct sk_buff *skb,
 			write_tx_desc(e1, desc_mapping, desc_len, gen,
 				      nfrags == 0);
 		ce->skb = NULL;
-		pci_unmap_addr_set(ce, dma_addr, mapping);
-		pci_unmap_len_set(ce, dma_len, frag->size);
+		dma_unmap_addr_set(ce, dma_addr, mapping);
+		dma_unmap_len_set(ce, dma_len, frag->size);
 	}
 	ce->skb = skb;
 	wmb();
-- 
1.6.5


^ permalink raw reply related

* [PATCH] qlge: use the DMA state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-13  0:32 UTC (permalink / raw)
  To: netdev; +Cc: fujita.tomonori, Ron Mercer
In-Reply-To: <1271118734-28353-1-git-send-email-fujita.tomonori@lab.ntt.co.jp>

This replace the PCI DMA state API (include/linux/pci-dma.h) with the
DMA equivalents since the PCI DMA state API will be obsolete.

No functional change.

For further information about the background:

http://marc.info/?l=linux-netdev&m=127037540020276&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/qlge/qlge.h      |    8 +++---
 drivers/net/qlge/qlge_main.c |   58 +++++++++++++++++++++---------------------
 2 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/drivers/net/qlge/qlge.h b/drivers/net/qlge/qlge.h
index 8b742b6..20624ba 100644
--- a/drivers/net/qlge/qlge.h
+++ b/drivers/net/qlge/qlge.h
@@ -1344,8 +1344,8 @@ struct oal {
 };
 
 struct map_list {
-	DECLARE_PCI_UNMAP_ADDR(mapaddr);
-	DECLARE_PCI_UNMAP_LEN(maplen);
+	DEFINE_DMA_UNMAP_ADDR(mapaddr);
+	DEFINE_DMA_UNMAP_LEN(maplen);
 };
 
 struct tx_ring_desc {
@@ -1373,8 +1373,8 @@ struct bq_desc {
 	} p;
 	__le64 *addr;
 	u32 index;
-	 DECLARE_PCI_UNMAP_ADDR(mapaddr);
-	 DECLARE_PCI_UNMAP_LEN(maplen);
+	DEFINE_DMA_UNMAP_ADDR(mapaddr);
+	DEFINE_DMA_UNMAP_LEN(maplen);
 };
 
 #define QL_TXQ_IDX(qdev, skb) (smp_processor_id()%(qdev->tx_ring_count))
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index 76df968..fa4b24c 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -1057,7 +1057,7 @@ static struct bq_desc *ql_get_curr_lchunk(struct ql_adapter *qdev,
 	struct bq_desc *lbq_desc = ql_get_curr_lbuf(rx_ring);
 
 	pci_dma_sync_single_for_cpu(qdev->pdev,
-					pci_unmap_addr(lbq_desc, mapaddr),
+					dma_unmap_addr(lbq_desc, mapaddr),
 				    rx_ring->lbq_buf_size,
 					PCI_DMA_FROMDEVICE);
 
@@ -1170,8 +1170,8 @@ static void ql_update_lbq(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 
 			map = lbq_desc->p.pg_chunk.map +
 				lbq_desc->p.pg_chunk.offset;
-				pci_unmap_addr_set(lbq_desc, mapaddr, map);
-			pci_unmap_len_set(lbq_desc, maplen,
+				dma_unmap_addr_set(lbq_desc, mapaddr, map);
+			dma_unmap_len_set(lbq_desc, maplen,
 					rx_ring->lbq_buf_size);
 				*lbq_desc->addr = cpu_to_le64(map);
 
@@ -1241,8 +1241,8 @@ static void ql_update_sbq(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 					sbq_desc->p.skb = NULL;
 					return;
 				}
-				pci_unmap_addr_set(sbq_desc, mapaddr, map);
-				pci_unmap_len_set(sbq_desc, maplen,
+				dma_unmap_addr_set(sbq_desc, mapaddr, map);
+				dma_unmap_len_set(sbq_desc, maplen,
 						  rx_ring->sbq_buf_size);
 				*sbq_desc->addr = cpu_to_le64(map);
 			}
@@ -1298,18 +1298,18 @@ static void ql_unmap_send(struct ql_adapter *qdev,
 					     "unmapping OAL area.\n");
 			}
 			pci_unmap_single(qdev->pdev,
-					 pci_unmap_addr(&tx_ring_desc->map[i],
+					 dma_unmap_addr(&tx_ring_desc->map[i],
 							mapaddr),
-					 pci_unmap_len(&tx_ring_desc->map[i],
+					 dma_unmap_len(&tx_ring_desc->map[i],
 						       maplen),
 					 PCI_DMA_TODEVICE);
 		} else {
 			netif_printk(qdev, tx_done, KERN_DEBUG, qdev->ndev,
 				     "unmapping frag %d.\n", i);
 			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(&tx_ring_desc->map[i],
+				       dma_unmap_addr(&tx_ring_desc->map[i],
 						      mapaddr),
-				       pci_unmap_len(&tx_ring_desc->map[i],
+				       dma_unmap_len(&tx_ring_desc->map[i],
 						     maplen), PCI_DMA_TODEVICE);
 		}
 	}
@@ -1348,8 +1348,8 @@ static int ql_map_send(struct ql_adapter *qdev,
 
 	tbd->len = cpu_to_le32(len);
 	tbd->addr = cpu_to_le64(map);
-	pci_unmap_addr_set(&tx_ring_desc->map[map_idx], mapaddr, map);
-	pci_unmap_len_set(&tx_ring_desc->map[map_idx], maplen, len);
+	dma_unmap_addr_set(&tx_ring_desc->map[map_idx], mapaddr, map);
+	dma_unmap_len_set(&tx_ring_desc->map[map_idx], maplen, len);
 	map_idx++;
 
 	/*
@@ -1402,9 +1402,9 @@ static int ql_map_send(struct ql_adapter *qdev,
 			tbd->len =
 			    cpu_to_le32((sizeof(struct tx_buf_desc) *
 					 (frag_cnt - frag_idx)) | TX_DESC_C);
-			pci_unmap_addr_set(&tx_ring_desc->map[map_idx], mapaddr,
+			dma_unmap_addr_set(&tx_ring_desc->map[map_idx], mapaddr,
 					   map);
-			pci_unmap_len_set(&tx_ring_desc->map[map_idx], maplen,
+			dma_unmap_len_set(&tx_ring_desc->map[map_idx], maplen,
 					  sizeof(struct oal));
 			tbd = (struct tx_buf_desc *)&tx_ring_desc->oal;
 			map_idx++;
@@ -1425,8 +1425,8 @@ static int ql_map_send(struct ql_adapter *qdev,
 
 		tbd->addr = cpu_to_le64(map);
 		tbd->len = cpu_to_le32(frag->size);
-		pci_unmap_addr_set(&tx_ring_desc->map[map_idx], mapaddr, map);
-		pci_unmap_len_set(&tx_ring_desc->map[map_idx], maplen,
+		dma_unmap_addr_set(&tx_ring_desc->map[map_idx], mapaddr, map);
+		dma_unmap_len_set(&tx_ring_desc->map[map_idx], maplen,
 				  frag->size);
 
 	}
@@ -1742,8 +1742,8 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 		 */
 		sbq_desc = ql_get_curr_sbuf(rx_ring);
 		pci_unmap_single(qdev->pdev,
-				pci_unmap_addr(sbq_desc, mapaddr),
-				pci_unmap_len(sbq_desc, maplen),
+				dma_unmap_addr(sbq_desc, mapaddr),
+				dma_unmap_len(sbq_desc, maplen),
 				PCI_DMA_FROMDEVICE);
 		skb = sbq_desc->p.skb;
 		ql_realign_skb(skb, hdr_len);
@@ -1774,18 +1774,18 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 			 */
 			sbq_desc = ql_get_curr_sbuf(rx_ring);
 			pci_dma_sync_single_for_cpu(qdev->pdev,
-						    pci_unmap_addr
+						    dma_unmap_addr
 						    (sbq_desc, mapaddr),
-						    pci_unmap_len
+						    dma_unmap_len
 						    (sbq_desc, maplen),
 						    PCI_DMA_FROMDEVICE);
 			memcpy(skb_put(skb, length),
 			       sbq_desc->p.skb->data, length);
 			pci_dma_sync_single_for_device(qdev->pdev,
-						       pci_unmap_addr
+						       dma_unmap_addr
 						       (sbq_desc,
 							mapaddr),
-						       pci_unmap_len
+						       dma_unmap_len
 						       (sbq_desc,
 							maplen),
 						       PCI_DMA_FROMDEVICE);
@@ -1798,9 +1798,9 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 			ql_realign_skb(skb, length);
 			skb_put(skb, length);
 			pci_unmap_single(qdev->pdev,
-					 pci_unmap_addr(sbq_desc,
+					 dma_unmap_addr(sbq_desc,
 							mapaddr),
-					 pci_unmap_len(sbq_desc,
+					 dma_unmap_len(sbq_desc,
 						       maplen),
 					 PCI_DMA_FROMDEVICE);
 			sbq_desc->p.skb = NULL;
@@ -1839,9 +1839,9 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 				return NULL;
 			}
 			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(lbq_desc,
+				       dma_unmap_addr(lbq_desc,
 						      mapaddr),
-				       pci_unmap_len(lbq_desc, maplen),
+				       dma_unmap_len(lbq_desc, maplen),
 				       PCI_DMA_FROMDEVICE);
 			skb_reserve(skb, NET_IP_ALIGN);
 			netif_printk(qdev, rx_status, KERN_DEBUG, qdev->ndev,
@@ -1874,8 +1874,8 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 		int size, i = 0;
 		sbq_desc = ql_get_curr_sbuf(rx_ring);
 		pci_unmap_single(qdev->pdev,
-				 pci_unmap_addr(sbq_desc, mapaddr),
-				 pci_unmap_len(sbq_desc, maplen),
+				 dma_unmap_addr(sbq_desc, mapaddr),
+				 dma_unmap_len(sbq_desc, maplen),
 				 PCI_DMA_FROMDEVICE);
 		if (!(ib_mac_rsp->flags4 & IB_MAC_IOCB_RSP_HS)) {
 			/*
@@ -2737,8 +2737,8 @@ static void ql_free_sbq_buffers(struct ql_adapter *qdev, struct rx_ring *rx_ring
 		}
 		if (sbq_desc->p.skb) {
 			pci_unmap_single(qdev->pdev,
-					 pci_unmap_addr(sbq_desc, mapaddr),
-					 pci_unmap_len(sbq_desc, maplen),
+					 dma_unmap_addr(sbq_desc, mapaddr),
+					 dma_unmap_len(sbq_desc, maplen),
 					 PCI_DMA_FROMDEVICE);
 			dev_kfree_skb(sbq_desc->p.skb);
 			sbq_desc->p.skb = NULL;
-- 
1.6.5


^ permalink raw reply related

* [PATCH] qla3xxx: use the DMA state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-13  0:32 UTC (permalink / raw)
  To: netdev; +Cc: fujita.tomonori, Ron Mercer
In-Reply-To: <1271118734-28353-1-git-send-email-fujita.tomonori@lab.ntt.co.jp>

This replace the PCI DMA state API (include/linux/pci-dma.h) with the
DMA equivalents since the PCI DMA state API will be obsolete.

No functional change.

For further information about the background:

http://marc.info/?l=linux-netdev&m=127037540020276&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc:  Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/qla3xxx.c |   64 ++++++++++++++++++++++++------------------------
 drivers/net/qla3xxx.h |    8 +++---
 2 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/drivers/net/qla3xxx.c b/drivers/net/qla3xxx.c
index fc11ad3..01a6ca3 100644
--- a/drivers/net/qla3xxx.c
+++ b/drivers/net/qla3xxx.c
@@ -343,8 +343,8 @@ static void ql_release_to_lrg_buf_free_list(struct ql3_adapter *qdev,
 			    cpu_to_le32(LS_64BITS(map));
 			lrg_buf_cb->buf_phy_addr_high =
 			    cpu_to_le32(MS_64BITS(map));
-			pci_unmap_addr_set(lrg_buf_cb, mapaddr, map);
-			pci_unmap_len_set(lrg_buf_cb, maplen,
+			dma_unmap_addr_set(lrg_buf_cb, mapaddr, map);
+			dma_unmap_len_set(lrg_buf_cb, maplen,
 					  qdev->lrg_buffer_len -
 					  QL_HEADER_SPACE);
 		}
@@ -1924,8 +1924,8 @@ static int ql_populate_free_queue(struct ql3_adapter *qdev)
 				    cpu_to_le32(LS_64BITS(map));
 				lrg_buf_cb->buf_phy_addr_high =
 				    cpu_to_le32(MS_64BITS(map));
-				pci_unmap_addr_set(lrg_buf_cb, mapaddr, map);
-				pci_unmap_len_set(lrg_buf_cb, maplen,
+				dma_unmap_addr_set(lrg_buf_cb, mapaddr, map);
+				dma_unmap_len_set(lrg_buf_cb, maplen,
 						  qdev->lrg_buffer_len -
 						  QL_HEADER_SPACE);
 				--qdev->lrg_buf_skb_check;
@@ -2041,16 +2041,16 @@ static void ql_process_mac_tx_intr(struct ql3_adapter *qdev,
 	}
 
 	pci_unmap_single(qdev->pdev,
-			 pci_unmap_addr(&tx_cb->map[0], mapaddr),
-			 pci_unmap_len(&tx_cb->map[0], maplen),
+			 dma_unmap_addr(&tx_cb->map[0], mapaddr),
+			 dma_unmap_len(&tx_cb->map[0], maplen),
 			 PCI_DMA_TODEVICE);
 	tx_cb->seg_count--;
 	if (tx_cb->seg_count) {
 		for (i = 1; i < tx_cb->seg_count; i++) {
 			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(&tx_cb->map[i],
+				       dma_unmap_addr(&tx_cb->map[i],
 						      mapaddr),
-				       pci_unmap_len(&tx_cb->map[i], maplen),
+				       dma_unmap_len(&tx_cb->map[i], maplen),
 				       PCI_DMA_TODEVICE);
 		}
 	}
@@ -2119,8 +2119,8 @@ static void ql_process_mac_rx_intr(struct ql3_adapter *qdev,
 
 	skb_put(skb, length);
 	pci_unmap_single(qdev->pdev,
-			 pci_unmap_addr(lrg_buf_cb2, mapaddr),
-			 pci_unmap_len(lrg_buf_cb2, maplen),
+			 dma_unmap_addr(lrg_buf_cb2, mapaddr),
+			 dma_unmap_len(lrg_buf_cb2, maplen),
 			 PCI_DMA_FROMDEVICE);
 	prefetch(skb->data);
 	skb->ip_summed = CHECKSUM_NONE;
@@ -2165,8 +2165,8 @@ static void ql_process_macip_rx_intr(struct ql3_adapter *qdev,
 
 	skb_put(skb2, length);	/* Just the second buffer length here. */
 	pci_unmap_single(qdev->pdev,
-			 pci_unmap_addr(lrg_buf_cb2, mapaddr),
-			 pci_unmap_len(lrg_buf_cb2, maplen),
+			 dma_unmap_addr(lrg_buf_cb2, mapaddr),
+			 dma_unmap_len(lrg_buf_cb2, maplen),
 			 PCI_DMA_FROMDEVICE);
 	prefetch(skb2->data);
 
@@ -2454,8 +2454,8 @@ static int ql_send_map(struct ql3_adapter *qdev,
 	oal_entry->dma_lo = cpu_to_le32(LS_64BITS(map));
 	oal_entry->dma_hi = cpu_to_le32(MS_64BITS(map));
 	oal_entry->len = cpu_to_le32(len);
-	pci_unmap_addr_set(&tx_cb->map[seg], mapaddr, map);
-	pci_unmap_len_set(&tx_cb->map[seg], maplen, len);
+	dma_unmap_addr_set(&tx_cb->map[seg], mapaddr, map);
+	dma_unmap_len_set(&tx_cb->map[seg], maplen, len);
 	seg++;
 
 	if (seg_cnt == 1) {
@@ -2488,9 +2488,9 @@ static int ql_send_map(struct ql3_adapter *qdev,
 				oal_entry->len =
 				    cpu_to_le32(sizeof(struct oal) |
 						OAL_CONT_ENTRY);
-				pci_unmap_addr_set(&tx_cb->map[seg], mapaddr,
+				dma_unmap_addr_set(&tx_cb->map[seg], mapaddr,
 						   map);
-				pci_unmap_len_set(&tx_cb->map[seg], maplen,
+				dma_unmap_len_set(&tx_cb->map[seg], maplen,
 						  sizeof(struct oal));
 				oal_entry = (struct oal_entry *)oal;
 				oal++;
@@ -2512,8 +2512,8 @@ static int ql_send_map(struct ql3_adapter *qdev,
 			oal_entry->dma_lo = cpu_to_le32(LS_64BITS(map));
 			oal_entry->dma_hi = cpu_to_le32(MS_64BITS(map));
 			oal_entry->len = cpu_to_le32(frag->size);
-			pci_unmap_addr_set(&tx_cb->map[seg], mapaddr, map);
-			pci_unmap_len_set(&tx_cb->map[seg], maplen,
+			dma_unmap_addr_set(&tx_cb->map[seg], mapaddr, map);
+			dma_unmap_len_set(&tx_cb->map[seg], maplen,
 					  frag->size);
 		}
 		/* Terminate the last segment. */
@@ -2539,22 +2539,22 @@ map_error:
 		   (seg == 12 && seg_cnt > 13) ||      /* but necessary. */
 		   (seg == 17 && seg_cnt > 18)) {
 			pci_unmap_single(qdev->pdev,
-				pci_unmap_addr(&tx_cb->map[seg], mapaddr),
-				pci_unmap_len(&tx_cb->map[seg], maplen),
+				dma_unmap_addr(&tx_cb->map[seg], mapaddr),
+				dma_unmap_len(&tx_cb->map[seg], maplen),
 				 PCI_DMA_TODEVICE);
 			oal++;
 			seg++;
 		}
 
 		pci_unmap_page(qdev->pdev,
-			       pci_unmap_addr(&tx_cb->map[seg], mapaddr),
-			       pci_unmap_len(&tx_cb->map[seg], maplen),
+			       dma_unmap_addr(&tx_cb->map[seg], mapaddr),
+			       dma_unmap_len(&tx_cb->map[seg], maplen),
 			       PCI_DMA_TODEVICE);
 	}
 
 	pci_unmap_single(qdev->pdev,
-			 pci_unmap_addr(&tx_cb->map[0], mapaddr),
-			 pci_unmap_addr(&tx_cb->map[0], maplen),
+			 dma_unmap_addr(&tx_cb->map[0], mapaddr),
+			 dma_unmap_addr(&tx_cb->map[0], maplen),
 			 PCI_DMA_TODEVICE);
 
 	return NETDEV_TX_BUSY;
@@ -2841,8 +2841,8 @@ static void ql_free_large_buffers(struct ql3_adapter *qdev)
 		if (lrg_buf_cb->skb) {
 			dev_kfree_skb(lrg_buf_cb->skb);
 			pci_unmap_single(qdev->pdev,
-					 pci_unmap_addr(lrg_buf_cb, mapaddr),
-					 pci_unmap_len(lrg_buf_cb, maplen),
+					 dma_unmap_addr(lrg_buf_cb, mapaddr),
+					 dma_unmap_len(lrg_buf_cb, maplen),
 					 PCI_DMA_FROMDEVICE);
 			memset(lrg_buf_cb, 0, sizeof(struct ql_rcv_buf_cb));
 		} else {
@@ -2912,8 +2912,8 @@ static int ql_alloc_large_buffers(struct ql3_adapter *qdev)
 				return -ENOMEM;
 			}
 
-			pci_unmap_addr_set(lrg_buf_cb, mapaddr, map);
-			pci_unmap_len_set(lrg_buf_cb, maplen,
+			dma_unmap_addr_set(lrg_buf_cb, mapaddr, map);
+			dma_unmap_len_set(lrg_buf_cb, maplen,
 					  qdev->lrg_buffer_len -
 					  QL_HEADER_SPACE);
 			lrg_buf_cb->buf_phy_addr_low =
@@ -3793,13 +3793,13 @@ static void ql_reset_work(struct work_struct *work)
 				       "%s: Freeing lost SKB.\n",
 				       qdev->ndev->name);
 				pci_unmap_single(qdev->pdev,
-					 pci_unmap_addr(&tx_cb->map[0], mapaddr),
-					 pci_unmap_len(&tx_cb->map[0], maplen),
+					 dma_unmap_addr(&tx_cb->map[0], mapaddr),
+					 dma_unmap_len(&tx_cb->map[0], maplen),
 					 PCI_DMA_TODEVICE);
 				for(j=1;j<tx_cb->seg_count;j++) {
 					pci_unmap_page(qdev->pdev,
-					       pci_unmap_addr(&tx_cb->map[j],mapaddr),
-					       pci_unmap_len(&tx_cb->map[j],maplen),
+					       dma_unmap_addr(&tx_cb->map[j],mapaddr),
+					       dma_unmap_len(&tx_cb->map[j],maplen),
 					       PCI_DMA_TODEVICE);
 				}
 				dev_kfree_skb(tx_cb->skb);
diff --git a/drivers/net/qla3xxx.h b/drivers/net/qla3xxx.h
index 7113e71..3362a66 100644
--- a/drivers/net/qla3xxx.h
+++ b/drivers/net/qla3xxx.h
@@ -998,8 +998,8 @@ enum link_state_t {
 struct ql_rcv_buf_cb {
 	struct ql_rcv_buf_cb *next;
 	struct sk_buff *skb;
-	 DECLARE_PCI_UNMAP_ADDR(mapaddr);
-	 DECLARE_PCI_UNMAP_LEN(maplen);
+	DEFINE_DMA_UNMAP_ADDR(mapaddr);
+	DEFINE_DMA_UNMAP_LEN(maplen);
 	__le32 buf_phy_addr_low;
 	__le32 buf_phy_addr_high;
 	int index;
@@ -1029,8 +1029,8 @@ struct oal {
 };
 
 struct map_list {
-	 DECLARE_PCI_UNMAP_ADDR(mapaddr);
-	 DECLARE_PCI_UNMAP_LEN(maplen);
+	DEFINE_DMA_UNMAP_ADDR(mapaddr);
+	DEFINE_DMA_UNMAP_LEN(maplen);
 };
 
 struct ql_tx_buf_cb {
-- 
1.6.5


^ permalink raw reply related

* [PATCH] tg3: use the DMA state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-13  0:32 UTC (permalink / raw)
  To: netdev; +Cc: fujita.tomonori, Matt Carlson, Michael Chan

This replace the PCI DMA state API (include/linux/pci-dma.h) with the
DMA equivalents since the PCI DMA state API will be obsolete.

No functional change.

For further information about the background:

http://marc.info/?l=linux-netdev&m=127037540020276&w=2

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
---
 drivers/net/tg3.c |   42 +++++++++++++++++++++---------------------
 drivers/net/tg3.h |    2 +-
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 460a0c2..46cf84c 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -4379,7 +4379,7 @@ static void tg3_tx(struct tg3_napi *tnapi)
 		}
 
 		pci_unmap_single(tp->pdev,
-				 pci_unmap_addr(ri, mapping),
+				 dma_unmap_addr(ri, mapping),
 				 skb_headlen(skb),
 				 PCI_DMA_TODEVICE);
 
@@ -4393,7 +4393,7 @@ static void tg3_tx(struct tg3_napi *tnapi)
 				tx_bug = 1;
 
 			pci_unmap_page(tp->pdev,
-				       pci_unmap_addr(ri, mapping),
+				       dma_unmap_addr(ri, mapping),
 				       skb_shinfo(skb)->frags[i].size,
 				       PCI_DMA_TODEVICE);
 			sw_idx = NEXT_TX(sw_idx);
@@ -4431,7 +4431,7 @@ static void tg3_rx_skb_free(struct tg3 *tp, struct ring_info *ri, u32 map_sz)
 	if (!ri->skb)
 		return;
 
-	pci_unmap_single(tp->pdev, pci_unmap_addr(ri, mapping),
+	pci_unmap_single(tp->pdev, dma_unmap_addr(ri, mapping),
 			 map_sz, PCI_DMA_FROMDEVICE);
 	dev_kfree_skb_any(ri->skb);
 	ri->skb = NULL;
@@ -4497,7 +4497,7 @@ static int tg3_alloc_rx_skb(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
 	}
 
 	map->skb = skb;
-	pci_unmap_addr_set(map, mapping, mapping);
+	dma_unmap_addr_set(map, mapping, mapping);
 
 	desc->addr_hi = ((u64)mapping >> 32);
 	desc->addr_lo = ((u64)mapping & 0xffffffff);
@@ -4542,8 +4542,8 @@ static void tg3_recycle_rx(struct tg3_napi *tnapi,
 	}
 
 	dest_map->skb = src_map->skb;
-	pci_unmap_addr_set(dest_map, mapping,
-			   pci_unmap_addr(src_map, mapping));
+	dma_unmap_addr_set(dest_map, mapping,
+			   dma_unmap_addr(src_map, mapping));
 	dest_desc->addr_hi = src_desc->addr_hi;
 	dest_desc->addr_lo = src_desc->addr_lo;
 
@@ -4611,13 +4611,13 @@ static int tg3_rx(struct tg3_napi *tnapi, int budget)
 		opaque_key = desc->opaque & RXD_OPAQUE_RING_MASK;
 		if (opaque_key == RXD_OPAQUE_RING_STD) {
 			ri = &tp->prodring[0].rx_std_buffers[desc_idx];
-			dma_addr = pci_unmap_addr(ri, mapping);
+			dma_addr = dma_unmap_addr(ri, mapping);
 			skb = ri->skb;
 			post_ptr = &std_prod_idx;
 			rx_std_posted++;
 		} else if (opaque_key == RXD_OPAQUE_RING_JUMBO) {
 			ri = &tp->prodring[0].rx_jmb_buffers[desc_idx];
-			dma_addr = pci_unmap_addr(ri, mapping);
+			dma_addr = dma_unmap_addr(ri, mapping);
 			skb = ri->skb;
 			post_ptr = &jmb_prod_idx;
 		} else
@@ -5439,12 +5439,12 @@ static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
 			len = skb_shinfo(skb)->frags[i-1].size;
 
 		pci_unmap_single(tp->pdev,
-				 pci_unmap_addr(&tnapi->tx_buffers[entry],
+				 dma_unmap_addr(&tnapi->tx_buffers[entry],
 						mapping),
 				 len, PCI_DMA_TODEVICE);
 		if (i == 0) {
 			tnapi->tx_buffers[entry].skb = new_skb;
-			pci_unmap_addr_set(&tnapi->tx_buffers[entry], mapping,
+			dma_unmap_addr_set(&tnapi->tx_buffers[entry], mapping,
 					   new_addr);
 		} else {
 			tnapi->tx_buffers[entry].skb = NULL;
@@ -5574,7 +5574,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb,
 	}
 
 	tnapi->tx_buffers[entry].skb = skb;
-	pci_unmap_addr_set(&tnapi->tx_buffers[entry], mapping, mapping);
+	dma_unmap_addr_set(&tnapi->tx_buffers[entry], mapping, mapping);
 
 	if ((tp->tg3_flags3 & TG3_FLG3_USE_JUMBO_BDFLAG) &&
 	    !mss && skb->len > ETH_DATA_LEN)
@@ -5600,7 +5600,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb,
 				goto dma_error;
 
 			tnapi->tx_buffers[entry].skb = NULL;
-			pci_unmap_addr_set(&tnapi->tx_buffers[entry], mapping,
+			dma_unmap_addr_set(&tnapi->tx_buffers[entry], mapping,
 					   mapping);
 
 			tg3_set_txd(tnapi, entry, mapping, len,
@@ -5630,7 +5630,7 @@ dma_error:
 	entry = tnapi->tx_prod;
 	tnapi->tx_buffers[entry].skb = NULL;
 	pci_unmap_single(tp->pdev,
-			 pci_unmap_addr(&tnapi->tx_buffers[entry], mapping),
+			 dma_unmap_addr(&tnapi->tx_buffers[entry], mapping),
 			 skb_headlen(skb),
 			 PCI_DMA_TODEVICE);
 	for (i = 0; i <= last; i++) {
@@ -5638,7 +5638,7 @@ dma_error:
 		entry = NEXT_TX(entry);
 
 		pci_unmap_page(tp->pdev,
-			       pci_unmap_addr(&tnapi->tx_buffers[entry],
+			       dma_unmap_addr(&tnapi->tx_buffers[entry],
 					      mapping),
 			       frag->size, PCI_DMA_TODEVICE);
 	}
@@ -5800,7 +5800,7 @@ static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 	}
 
 	tnapi->tx_buffers[entry].skb = skb;
-	pci_unmap_addr_set(&tnapi->tx_buffers[entry], mapping, mapping);
+	dma_unmap_addr_set(&tnapi->tx_buffers[entry], mapping, mapping);
 
 	would_hit_hwbug = 0;
 
@@ -5836,7 +5836,7 @@ static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 					       len, PCI_DMA_TODEVICE);
 
 			tnapi->tx_buffers[entry].skb = NULL;
-			pci_unmap_addr_set(&tnapi->tx_buffers[entry], mapping,
+			dma_unmap_addr_set(&tnapi->tx_buffers[entry], mapping,
 					   mapping);
 			if (pci_dma_mapping_error(tp->pdev, mapping))
 				goto dma_error;
@@ -5901,7 +5901,7 @@ dma_error:
 	entry = tnapi->tx_prod;
 	tnapi->tx_buffers[entry].skb = NULL;
 	pci_unmap_single(tp->pdev,
-			 pci_unmap_addr(&tnapi->tx_buffers[entry], mapping),
+			 dma_unmap_addr(&tnapi->tx_buffers[entry], mapping),
 			 skb_headlen(skb),
 			 PCI_DMA_TODEVICE);
 	for (i = 0; i <= last; i++) {
@@ -5909,7 +5909,7 @@ dma_error:
 		entry = NEXT_TX(entry);
 
 		pci_unmap_page(tp->pdev,
-			       pci_unmap_addr(&tnapi->tx_buffers[entry],
+			       dma_unmap_addr(&tnapi->tx_buffers[entry],
 					      mapping),
 			       frag->size, PCI_DMA_TODEVICE);
 	}
@@ -6194,7 +6194,7 @@ static void tg3_free_rings(struct tg3 *tp)
 			}
 
 			pci_unmap_single(tp->pdev,
-					 pci_unmap_addr(txp, mapping),
+					 dma_unmap_addr(txp, mapping),
 					 skb_headlen(skb),
 					 PCI_DMA_TODEVICE);
 			txp->skb = NULL;
@@ -6204,7 +6204,7 @@ static void tg3_free_rings(struct tg3 *tp)
 			for (k = 0; k < skb_shinfo(skb)->nr_frags; k++) {
 				txp = &tnapi->tx_buffers[i & (TG3_TX_RING_SIZE - 1)];
 				pci_unmap_page(tp->pdev,
-					       pci_unmap_addr(txp, mapping),
+					       dma_unmap_addr(txp, mapping),
 					       skb_shinfo(skb)->frags[k].size,
 					       PCI_DMA_TODEVICE);
 				i++;
@@ -10686,7 +10686,7 @@ static int tg3_run_loopback(struct tg3 *tp, int loopback_mode)
 
 	rx_skb = tpr->rx_std_buffers[desc_idx].skb;
 
-	map = pci_unmap_addr(&tpr->rx_std_buffers[desc_idx], mapping);
+	map = dma_unmap_addr(&tpr->rx_std_buffers[desc_idx], mapping);
 	pci_dma_sync_single_for_cpu(tp->pdev, map, rx_len, PCI_DMA_FROMDEVICE);
 
 	for (i = 14; i < tx_len; i++) {
diff --git a/drivers/net/tg3.h b/drivers/net/tg3.h
index 5d7f72a..3f149f3 100644
--- a/drivers/net/tg3.h
+++ b/drivers/net/tg3.h
@@ -2512,7 +2512,7 @@ struct tg3_hw_stats {
  */
 struct ring_info {
 	struct sk_buff			*skb;
-	DECLARE_PCI_UNMAP_ADDR(mapping)
+	DEFINE_DMA_UNMAP_ADDR(mapping);
 };
 
 struct tg3_config_info {
-- 
1.6.5


^ permalink raw reply related

* Re: [PATCH v4] rfs: Receive Flow Steering
From: Stephen Hemminger @ 2010-04-13  0:12 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, eric.dumazet, Ingo Molnar
In-Reply-To: <alpine.DEB.1.00.1004121651460.31468@pokey.mtv.corp.google.com>

On Mon, 12 Apr 2010 17:03:39 -0700 (PDT)
Tom Herbert <therbert@google.com> wrote:

> The basic idea of RFS is that when an application calls recvmsg
> (or sendmsg) the application's running CPU is stored in a hash
> table that is indexed by the connection's rxhash which is stored in
> the socket structure.  The rxhash is passed in skb's received on
> the connection from netif_receive_skb.  For each received packet,
> the associated rxhash is used to look up the CPU in the hash table,
> if a valid CPU is set then the packet is steered to that CPU using
> the RPS mechanisms.

There are two sometimes conflicting models:

One model is to have the flow's be dispersed and let the scheduler
be smarter about running the applications on the right CPU's where
the packets arrive.

The other is to have the flows redirected to the CPU where the application
previously ran which is what RFS does.

For benchmarks and private fixed configuration systems it is tempting
to just nail everything down: i.e. use hard SMP affinity, for hardware, processes,
and flows.  But this is the wrong solution for general purpose systems with
varying workloads and requirements.  How well does RFS really work when
applications, processes, and sockets come and go or get migrated among
CPU's by the scheduler? My concern is this is overlapping scheduler
design and might be a step backwards.


-- 

^ permalink raw reply

* Re: Receive issues with bonding and vlans
From: Jay Vosburgh @ 2010-04-13  0:08 UTC (permalink / raw)
  To: Chris Leech
  Cc: netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy,
	bonding-devel@lists.sourceforge.net
In-Reply-To: <20100412233509.GA32302@cleech-lnx.jf.intel.com>

Chris Leech <christopher.leech@intel.com> wrote:

>On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote:
>> 	Is the FCoE supposed to run over the inactive bonding slave?  Or
>> am I misunderstanding what you're saying?  I had thought the LLDP, et
>> al, special case in bonding was to permit, essentially, path discovery,
>> not necessarily active use of the inactive slave.
>
>That's what I'm trying to do, yes.  Mostly because it's a setup that
>would work if you removed the FCoE traffic from the network data path,
>and only converged at the driver level and below.  It's possible that
>the answer is "don't do that".

	So, basically, you want the bond to act like usual for "regular"
ethernet traffic, but act like the slaves are independent from the bond
for the magic FCoE traffic, right?

	I'm not really sure if that's a "don't do that" or not.

>> 	Not that this is necessarily bad; the "drop stuff on inactive
>> slaves" is really there for duplicate suppression, but it also can
>> uncover network topology issues, e.g., network layouts that won't work
>> if the devices fail, but appear to work during testing because the
>> "inactive" slave still receives traffic (it hasn't really failed).
>> 
>> >The problem is that it doesn't work for hardware accelerated VLAN
>> >devices, because the VLAN receive paths have their own
>> >skb_bond_should_drop calls that were not updated.
>> >
>> >From what I can tell, VLAN receives always end up going through
>> >netif_receive_skb anyway, so skb_bond_should_drop gets called twice if
>> >the frame isn't dropped the first time.  I think the bonding checks in
>> >__vlan_hwaccel_rx and vlan_gro_common should just be removed.
>> 
>> 	I'm not so sure.  The checks in __vlan_hwaccel_rx are done with
>> the original receiving device in skb->dev; by the time the packet gets
>> to netif_receive_skb, the original slave the packet was received on has
>> been lost (and replaced with the VLAN device).  Various things are
>> interested in that, in particular the "arp_validate" and the "inactive
>> slave drop" logic for bonding depend on knowing the real device the
>> packet arrived on.
>> 
>> 	I note that the vlan accel logic doesn't change skb_iif to the
>> VLAN device; it remains as the original device.  I suppose one
>> alternative would be to convert the bonding drop, et al, logic to use
>> skb_iif instead of skb->dev; if that works, then I think the VLAN core
>> would not need to call skb_bond_should_drop, which in turn would be a
>> bit more complicated as it would have to look up the dev from the
>> skb_iif.  There's already some code in bonding that takes advantage of
>> this property of the VLANs, so maybe this is the way to go.
>
>Thanks, I'll take another look and see if I can come up with something
>better.

	I looked at the skb_bond_should_drop stuff a bit more after I
wrote that; it's not as easy as I had suspected.  The big sticking point
is that currently the test in netif_receive_skb (now __netif_receive_skb
in net-next-2.6) is on skb->dev->master to identify packets arriving on
slaves of bonding.  The VLAN skb->dev has ->master set to NULL.  Doing
that test against skb->skb_iif would be much more expensive, as it would
require a device lookup for every packet.

	So, I suspect that something has to happen in the VLAN
acceleration path, although I don't know exactly what.  I don't know if
it would be possible to flag the packets in some special way to indicate
that they're "bonding slave" packets, or if it's better to keep the
current structure and just fix the calls somehow.

	-J


>> >I haven't quite figured out what I think the correct change for
>> >null_or_bond is.  I suspect it involves not using NULL at all.  I can
>> >see how it addresses the arp_ip_target on a VLAN issue, but this is also
>> >changing the receive matching rules for other traffic in unexpected
>> >ways.
>> 
>> 	I'll hazard a guess that something like this might do it:
>> 
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index b98ddc6..cc665bb 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2735,7 +2735,7 @@ ncls:
>>  			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>>  		if (ptype->type == type && (ptype->dev == null_or_orig ||
>>  		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
>> -		     ptype->dev == null_or_bond)) {
>> +		     (null_or_bond && (ptype->dev == null_or_bond))) {
>>  			if (pt_prev)
>>  				ret = deliver_skb(skb, pt_prev, orig_dev);
>>  			pt_prev = ptype;
>> 
>> 
>> 	I haven't tested this, but the theory is to only test against
>> null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN
>> traffic over bonding.
>
>Yes, that should do it.
>
>	- Chris

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* [PATCH v4] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-13  0:03 UTC (permalink / raw)
  To: davem, netdev, eric.dumazet

Version 4 of RFS:
- Use a mutex in rps_sock_flow_sysctl for mutual exclusion between
concurrent writers and allows calling vmalloc.
- Removed extra space before "rc = sock_queue_rcv_skb(sk, skb);"
- Make changelog < 70 chars
- Ensure calls to smp_processor_id in netif_rx are called in
non-preemptable region
---
This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61

Signed-off-by: Tom Herbert <therbert@google.com> ---
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..573e775 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,14 +530,77 @@ struct rps_map {
 };
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
 
+/*
+ * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
+ * tail pointer for that CPU's input queue at the time of last enqueue.
+ */
+struct rps_dev_flow {
+	u16 cpu;
+	u16 fill;
+	unsigned int last_qtail;
+};
+
+/*
+ * The rps_dev_flow_table structure contains a table of flow mappings.
+ */
+struct rps_dev_flow_table {
+	unsigned int mask;
+	struct rcu_head rcu;
+	struct work_struct free_work;
+	struct rps_dev_flow flows[0];
+};
+#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
+    (_num * sizeof(struct rps_dev_flow)))
+
+/*
+ * The rps_sock_flow_table contains mappings of flows to the last CPU
+ * on which they were processed by the application (set in recvmsg).
+ */
+struct rps_sock_flow_table {
+	unsigned int mask;
+	u16 ents[0];
+};
+#define	RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \
+    (_num * sizeof(u16)))
+
+extern int rps_sock_flow_sysctl(ctl_table *table, int write,
+				void __user *buffer, size_t *lenp,
+				loff_t *ppos);
+
+#define RPS_NO_CPU 0xffff
+
+static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
+					u32 hash)
+{
+	if (table && hash) {
+		unsigned int cpu, index = hash & table->mask;
+
+		/* We only give a hint, preemption can change cpu under us */
+		cpu = raw_smp_processor_id();
+
+		if (table->ents[index] != cpu)
+			table->ents[index] = cpu;
+	}
+}
+
+static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
+				       u32 hash)
+{
+	if (table && hash)
+		table->ents[hash & table->mask] = RPS_NO_CPU;
+}
+
+extern struct rps_sock_flow_table *rps_sock_flow_table;
+
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
 	struct rps_map *rps_map;
+	struct rps_dev_flow_table *rps_flow_table;
 	struct kobject kobj;
 	struct netdev_rx_queue *first;
 	atomic_t count;
 } ____cacheline_aligned_in_smp;
-#endif
+#endif /* CONFIG_RPS */
 
 /*
  * This structure defines the management hooks for network devices.
@@ -1331,13 +1394,21 @@ struct softnet_data {
 	struct sk_buff		*completion_queue;
 
 	/* Elements below can be accessed between CPUs for RPS */
-#ifdef CONFIG_SMP
+#ifdef CONFIG_RPS
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	unsigned int		input_queue_head;
 #endif
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 };
 
+static inline void incr_input_queue_head(struct softnet_data *queue)
+{
+#ifdef CONFIG_RPS
+	queue->input_queue_head++;
+#endif
+}
+
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
 #define HAVE_NETIF_QUEUE
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 83fd344..b487bc1 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -21,6 +21,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/jhash.h>
+#include <linux/netdevice.h>
 
 #include <net/flow.h>
 #include <net/sock.h>
@@ -101,6 +102,7 @@ struct rtable;
  * @uc_ttl - Unicast TTL
  * @inet_sport - Source port
  * @inet_id - ID counter for DF pkts
+ * @rxhash - flow hash received from netif layer
  * @tos - TOS
  * @mc_ttl - Multicasting TTL
  * @is_icsk - is this an inet_connection_sock?
@@ -124,6 +126,9 @@ struct inet_sock {
 	__u16			cmsg_flags;
 	__be16			inet_sport;
 	__u16			inet_id;
+#ifdef CONFIG_RPS
+	__u32			rxhash;
+#endif
 
 	struct ip_options	*opt;
 	__u8			tos;
@@ -219,4 +224,37 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
 	return inet_sk(sk)->transparent ? FLOWI_FLAG_ANYSRC : 0;
 }
 
+static inline void inet_rps_record_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	struct rps_sock_flow_table *sock_flow_table;
+
+	rcu_read_lock();
+	sock_flow_table = rcu_dereference(rps_sock_flow_table);
+	rps_record_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
+	rcu_read_unlock();
+#endif
+}
+
+static inline void inet_rps_reset_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	struct rps_sock_flow_table *sock_flow_table;
+
+	rcu_read_lock();
+	sock_flow_table = rcu_dereference(rps_sock_flow_table);
+	rps_reset_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
+	rcu_read_unlock();
+#endif
+}
+
+static inline void inet_rps_save_rxhash(const struct sock *sk, u32 rxhash)
+{
+#ifdef CONFIG_RPS
+	if (unlikely(inet_sk(sk)->rxhash != rxhash)) {
+		inet_rps_reset_flow(sk);
+		inet_sk(sk)->rxhash = rxhash;
+	}
+#endif
+}
 #endif	/* _INET_SOCK_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index a10a216..7dbe64e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2203,22 +2203,81 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
 #ifdef CONFIG_RPS
+/* One global table that all flow-based protocols share. */
+struct rps_sock_flow_table *rps_sock_flow_table;
+EXPORT_SYMBOL(rps_sock_flow_table);
+
+int rps_sock_flow_sysctl(ctl_table *table, int write, void __user *buffer,
+			 size_t *lenp, loff_t *ppos)
+{
+	unsigned int orig_size, size;
+	int ret, i;
+	ctl_table tmp = {
+		.data = &size,
+		.maxlen = sizeof(size),
+		.mode = table->mode
+	};
+	struct rps_sock_flow_table *orig_sock_table, *sock_table;
+	static DEFINE_MUTEX(sock_flow_mutex);
+
+	mutex_lock(&sock_flow_mutex);
+
+	orig_sock_table = rps_sock_flow_table;
+	size = orig_size = orig_sock_table ? orig_sock_table->mask + 1 : 0;
+
+	ret = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+
+	if (write) {
+		if (size) {
+			size = roundup_pow_of_two(size);
+			if (size != orig_size) {
+				sock_table =
+				    vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size));
+				if (!sock_table) {
+					mutex_unlock(&sock_flow_mutex);
+					return -ENOMEM;
+				}
+
+				sock_table->mask = size - 1;
+			} else
+				sock_table = orig_sock_table;
+
+			for (i = 0; i < size; i++)
+				sock_table->ents[i] = RPS_NO_CPU;
+		} else
+			sock_table = NULL;
+
+		if (sock_table != orig_sock_table) {
+			rcu_assign_pointer(rps_sock_flow_table, sock_table);
+			synchronize_rcu();
+			vfree(orig_sock_table);
+		}
+	}
+
+	mutex_unlock(&sock_flow_mutex);
+
+	return ret;
+}
+
 /*
  * get_rps_cpu is called from netif_receive_skb and returns the target
  * CPU from the RPS map of the receiving queue for a given skb.
+ * rcu_read_lock must be held on entry.
  */
-static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
+		       struct rps_dev_flow **rflowp)
 {
 	struct ipv6hdr *ip6;
 	struct iphdr *ip;
 	struct netdev_rx_queue *rxqueue;
 	struct rps_map *map;
+	struct rps_dev_flow_table *flow_table;
+	struct rps_sock_flow_table *sock_flow_table;
 	int cpu = -1;
 	u8 ip_proto;
+	u16 tcpu;
 	u32 addr1, addr2, ports, ihl;
 
-	rcu_read_lock();
-
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
 		if (unlikely(index >= dev->num_rx_queues)) {
@@ -2233,7 +2292,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 	} else
 		rxqueue = dev->_rx;
 
-	if (!rxqueue->rps_map)
+	if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
 		goto done;
 
 	if (skb->rxhash)
@@ -2285,9 +2344,48 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 		skb->rxhash = 1;
 
 got_hash:
+	flow_table = rcu_dereference(rxqueue->rps_flow_table);
+	sock_flow_table = rcu_dereference(rps_sock_flow_table);
+	if (flow_table && sock_flow_table) {
+		u16 next_cpu;
+		struct rps_dev_flow *rflow;
+
+		rflow = &flow_table->flows[skb->rxhash & flow_table->mask];
+		tcpu = rflow->cpu;
+
+		next_cpu = sock_flow_table->ents[skb->rxhash &
+		    sock_flow_table->mask];
+
+		/*
+		 * If the desired CPU (where last recvmsg was done) is
+		 * different from current CPU (one in the rx-queue flow
+		 * table entry), switch if one of the following holds:
+		 *   - Current CPU is unset (equal to RPS_NO_CPU).
+		 *   - Current CPU is offline.
+		 *   - The current CPU's queue tail has advanced beyond the
+		 *     last packet that was enqueued using this table entry.
+		 *     This guarantees that all previous packets for the flow
+		 *     have been dequeued, thus preserving in order delivery.
+		 */
+		if (unlikely(tcpu != next_cpu) &&
+		    (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
+		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
+		      rflow->last_qtail)) >= 0)) {
+			tcpu = rflow->cpu = next_cpu;
+			if (tcpu != RPS_NO_CPU)
+				rflow->last_qtail = per_cpu(softnet_data,
+				    tcpu).input_queue_head;
+		}
+		if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
+			*rflowp = rflow;
+			cpu = tcpu;
+			goto done;
+		}
+	}
+
 	map = rcu_dereference(rxqueue->rps_map);
 	if (map) {
-		u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
+		tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
 
 		if (cpu_online(tcpu)) {
 			cpu = tcpu;
@@ -2296,7 +2394,6 @@ got_hash:
 	}
 
 done:
-	rcu_read_unlock();
 	return cpu;
 }
 
@@ -2322,13 +2419,14 @@ static void trigger_softirq(void *data)
 	__napi_schedule(&queue->backlog);
 	__get_cpu_var(netdev_rx_stat).received_rps++;
 }
-#endif /* CONFIG_SMP */
+#endif /* CONFIG_RPS */
 
 /*
  * enqueue_to_backlog is called to queue an skb to a per CPU backlog
  * queue (may be a remote CPU queue).
  */
-static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
+			      unsigned int *qtail)
 {
 	struct softnet_data *queue;
 	unsigned long flags;
@@ -2343,6 +2441,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 		if (queue->input_pkt_queue.qlen) {
 enqueue:
 			__skb_queue_tail(&queue->input_pkt_queue, skb);
+#ifdef CONFIG_RPS
+			*qtail = queue->input_queue_head +
+			    queue->input_pkt_queue.qlen;
+#endif
 			rps_unlock(queue);
 			local_irq_restore(flags);
 			return NET_RX_SUCCESS;
@@ -2357,11 +2459,10 @@ enqueue:
 
 				cpu_set(cpu, rcpus->mask[rcpus->select]);
 				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-			} else
-				__napi_schedule(&queue->backlog);
-#else
-			__napi_schedule(&queue->backlog);
+				goto enqueue;
+			}
 #endif
+			__napi_schedule(&queue->backlog);
 		}
 		goto enqueue;
 	}
@@ -2392,7 +2493,8 @@ enqueue:
 
 int netif_rx(struct sk_buff *skb)
 {
-	int cpu;
+	unsigned int qtail;
+	int err;
 
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
@@ -2402,14 +2504,26 @@ int netif_rx(struct sk_buff *skb)
 		net_timestamp(skb);
 
 #ifdef CONFIG_RPS
-	cpu = get_rps_cpu(skb->dev, skb);
-	if (cpu < 0)
-		cpu = smp_processor_id();
+	{
+		struct rps_dev_flow voidflow, *rflow = &voidflow;
+		int cpu;
+
+		rcu_read_lock();
+
+		cpu = get_rps_cpu(skb->dev, skb, &rflow);
+		if (cpu < 0)
+			cpu = smp_processor_id();
+
+		err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+
+		rcu_read_unlock();
+	}
 #else
-	cpu = smp_processor_id();
+	preempt_disable();
+	err = enqueue_to_backlog(skb, smp_processor_id(), &qtail);
+	preempt_enable();
 #endif
-
-	return enqueue_to_backlog(skb, cpu);
+	return err;
 }
 EXPORT_SYMBOL(netif_rx);
 
@@ -2776,17 +2890,22 @@ out:
 int netif_receive_skb(struct sk_buff *skb)
 {
 #ifdef CONFIG_RPS
-	int cpu;
+	struct rps_dev_flow voidflow, *rflow = &voidflow;
+	int cpu, err;
+
+	rcu_read_lock();
 
-	cpu = get_rps_cpu(skb->dev, skb);
+	cpu = get_rps_cpu(skb->dev, skb, &rflow);
 
-	if (cpu < 0)
-		return __netif_receive_skb(skb);
-	else
-		return enqueue_to_backlog(skb, cpu);
-#else
-	return __netif_receive_skb(skb);
+	if (cpu >= 0) {
+		err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+		rcu_read_unlock();
+		return err;
+	}
+
+	rcu_read_unlock();
 #endif
+	return __netif_receive_skb(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
@@ -2802,6 +2921,7 @@ static void flush_backlog(void *arg)
 		if (skb->dev == dev) {
 			__skb_unlink(skb, &queue->input_pkt_queue);
 			kfree_skb(skb);
+			incr_input_queue_head(queue);
 		}
 	rps_unlock(queue);
 }
@@ -3125,6 +3245,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
 			local_irq_enable();
 			break;
 		}
+		incr_input_queue_head(queue);
 		rps_unlock(queue);
 		local_irq_enable();
 
@@ -5488,8 +5609,10 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	local_irq_enable();
 
 	/* Process offline CPU's input_pkt_queue */
-	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
+	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
 		netif_rx(skb);
+		incr_input_queue_head(oldsd);
+	}
 
 	return NOTIFY_OK;
 }
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 96ed690..e518bee 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -601,22 +601,105 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	return len;
 }
 
+static ssize_t show_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+					   struct rx_queue_attribute *attr,
+					   char *buf)
+{
+	struct rps_dev_flow_table *flow_table;
+	unsigned int val = 0;
+
+	rcu_read_lock();
+	flow_table = rcu_dereference(queue->rps_flow_table);
+	if (flow_table)
+		val = flow_table->mask + 1;
+	rcu_read_unlock();
+
+	return sprintf(buf, "%u\n", val);
+}
+
+static void rps_dev_flow_table_release_work(struct work_struct *work)
+{
+	struct rps_dev_flow_table *table = container_of(work,
+	    struct rps_dev_flow_table, free_work);
+
+	vfree(table);
+}
+
+static void rps_dev_flow_table_release(struct rcu_head *rcu)
+{
+	struct rps_dev_flow_table *table = container_of(rcu,
+	    struct rps_dev_flow_table, rcu);
+
+	INIT_WORK(&table->free_work, rps_dev_flow_table_release_work);
+	schedule_work(&table->free_work);
+}
+
+ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+				     struct rx_queue_attribute *attr,
+				     const char *buf, size_t len)
+{
+	unsigned int count;
+	char *endp;
+	struct rps_dev_flow_table *table, *old_table;
+	static DEFINE_SPINLOCK(rps_dev_flow_lock);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	count = simple_strtoul(buf, &endp, 0);
+	if (endp == buf)
+		return -EINVAL;
+
+	if (count) {
+		int i;
+
+		count = roundup_pow_of_two(count);
+		table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));
+		if (!table)
+			return -ENOMEM;
+
+		table->mask = count - 1;
+		for (i = 0; i < count; i++)
+			table->flows[i].cpu = RPS_NO_CPU;
+	} else
+		table = NULL;
+
+	spin_lock(&rps_dev_flow_lock);
+	old_table = queue->rps_flow_table;
+	rcu_assign_pointer(queue->rps_flow_table, table);
+	spin_unlock(&rps_dev_flow_lock);
+
+	if (old_table)
+		call_rcu(&old_table->rcu, rps_dev_flow_table_release);
+
+	return len;
+}
+
 static struct rx_queue_attribute rps_cpus_attribute =
 	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
 
+
+static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
+	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
+	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+
 static struct attribute *rx_queue_default_attrs[] = {
 	&rps_cpus_attribute.attr,
+	&rps_dev_flow_table_cnt_attribute.attr,
 	NULL
 };
 
 static void rx_queue_release(struct kobject *kobj)
 {
 	struct netdev_rx_queue *queue = to_rx_queue(kobj);
-	struct rps_map *map = queue->rps_map;
 	struct netdev_rx_queue *first = queue->first;
 
-	if (map)
-		call_rcu(&map->rcu, rps_map_release);
+	if (queue->rps_map)
+		call_rcu(&queue->rps_map->rcu, rps_map_release);
+
+	if (queue->rps_flow_table)
+		call_rcu(&queue->rps_flow_table->rcu,
+		    rps_dev_flow_table_release);
 
 	if (atomic_dec_and_test(&first->count))
 		kfree(first);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b7b6b82..9eb2f67 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -82,6 +82,14 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+#ifdef CONFIG_RPS
+	{
+		.procname	= "rps_sock_flow_entries",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= rps_sock_flow_sysctl
+	},
+#endif
 #endif /* CONFIG_NET */
 	{
 		.procname	= "netdev_budget",
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index a0beb32..3703b5e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -419,6 +419,8 @@ int inet_release(struct socket *sock)
 	if (sk) {
 		long timeout;
 
+		inet_rps_reset_flow(sk);
+
 		/* Applications forget to leave groups before exiting */
 		ip_mc_drop_socket(sk);
 
@@ -720,6 +722,8 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 {
 	struct sock *sk = sock->sk;
 
+	inet_rps_record_flow(sk);
+
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
@@ -728,12 +732,13 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-
 static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 			     size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
 
+	inet_rps_record_flow(sk);
+
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
@@ -743,6 +748,22 @@ static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 	return sock_no_sendpage(sock, page, offset, size, flags);
 }
 
+int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		 size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	int addr_len = 0;
+	int err;
+
+	inet_rps_record_flow(sk);
+
+	err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
+				   flags & ~MSG_DONTWAIT, &addr_len);
+	if (err >= 0)
+		msg->msg_namelen = addr_len;
+	return err;
+}
+EXPORT_SYMBOL(inet_recvmsg);
 
 int inet_shutdown(struct socket *sock, int how)
 {
@@ -872,7 +893,7 @@ const struct proto_ops inet_stream_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = tcp_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
 	.splice_read	   = tcp_splice_read,
@@ -899,7 +920,7 @@ const struct proto_ops inet_dgram_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
@@ -929,7 +950,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a24995c..ad08392 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1672,6 +1672,8 @@ process:
 
 	skb->dev = NULL;
 
+	inet_rps_save_rxhash(sk, skb->rxhash);
+
 	bh_lock_sock_nested(sk);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8fef859..666b963 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1217,6 +1217,7 @@ int udp_disconnect(struct sock *sk, int flags)
 	sk->sk_state = TCP_CLOSE;
 	inet->inet_daddr = 0;
 	inet->inet_dport = 0;
+	inet_rps_save_rxhash(sk, 0);
 	sk->sk_bound_dev_if = 0;
 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
 		inet_reset_saddr(sk);
@@ -1258,8 +1259,12 @@ EXPORT_SYMBOL(udp_lib_unhash);
 
 static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
-	int rc = sock_queue_rcv_skb(sk, skb);
+	int rc;
+
+	if (inet_sk(sk)->inet_daddr)
+		inet_rps_save_rxhash(sk, skb->rxhash);
 
+	rc = sock_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 

^ permalink raw reply related

* Re: Receive issues with bonding and vlans
From: Chris Leech @ 2010-04-12 23:35 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy,
	bonding-devel@lists.sourceforge.net
In-Reply-To: <29849.1271113851@death.nxdomain.ibm.com>

On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote:
> 	Is the FCoE supposed to run over the inactive bonding slave?  Or
> am I misunderstanding what you're saying?  I had thought the LLDP, et
> al, special case in bonding was to permit, essentially, path discovery,
> not necessarily active use of the inactive slave.

That's what I'm trying to do, yes.  Mostly because it's a setup that
would work if you removed the FCoE traffic from the network data path,
and only converged at the driver level and below.  It's possible that
the answer is "don't do that".

> 	Not that this is necessarily bad; the "drop stuff on inactive
> slaves" is really there for duplicate suppression, but it also can
> uncover network topology issues, e.g., network layouts that won't work
> if the devices fail, but appear to work during testing because the
> "inactive" slave still receives traffic (it hasn't really failed).
> 
> >The problem is that it doesn't work for hardware accelerated VLAN
> >devices, because the VLAN receive paths have their own
> >skb_bond_should_drop calls that were not updated.
> >
> >From what I can tell, VLAN receives always end up going through
> >netif_receive_skb anyway, so skb_bond_should_drop gets called twice if
> >the frame isn't dropped the first time.  I think the bonding checks in
> >__vlan_hwaccel_rx and vlan_gro_common should just be removed.
> 
> 	I'm not so sure.  The checks in __vlan_hwaccel_rx are done with
> the original receiving device in skb->dev; by the time the packet gets
> to netif_receive_skb, the original slave the packet was received on has
> been lost (and replaced with the VLAN device).  Various things are
> interested in that, in particular the "arp_validate" and the "inactive
> slave drop" logic for bonding depend on knowing the real device the
> packet arrived on.
> 
> 	I note that the vlan accel logic doesn't change skb_iif to the
> VLAN device; it remains as the original device.  I suppose one
> alternative would be to convert the bonding drop, et al, logic to use
> skb_iif instead of skb->dev; if that works, then I think the VLAN core
> would not need to call skb_bond_should_drop, which in turn would be a
> bit more complicated as it would have to look up the dev from the
> skb_iif.  There's already some code in bonding that takes advantage of
> this property of the VLANs, so maybe this is the way to go.

Thanks, I'll take another look and see if I can come up with something
better.

> >I haven't quite figured out what I think the correct change for
> >null_or_bond is.  I suspect it involves not using NULL at all.  I can
> >see how it addresses the arp_ip_target on a VLAN issue, but this is also
> >changing the receive matching rules for other traffic in unexpected
> >ways.
> 
> 	I'll hazard a guess that something like this might do it:
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b98ddc6..cc665bb 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2735,7 +2735,7 @@ ncls:
>  			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>  		if (ptype->type == type && (ptype->dev == null_or_orig ||
>  		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
> -		     ptype->dev == null_or_bond)) {
> +		     (null_or_bond && (ptype->dev == null_or_bond))) {
>  			if (pt_prev)
>  				ret = deliver_skb(skb, pt_prev, orig_dev);
>  			pt_prev = ptype;
> 
> 
> 	I haven't tested this, but the theory is to only test against
> null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN
> traffic over bonding.

Yes, that should do it.

	- Chris


^ permalink raw reply

* Re: [PATCH] vlan: remove receive checks for bonding
From: Jay Vosburgh @ 2010-04-12 23:19 UTC (permalink / raw)
  To: Chris Leech; +Cc: netdev, Andy Gospodarek, Patrick McHardy, bonding-devel
In-Reply-To: <20100412221723.8068.75393.stgit@localhost6.localdomain6>

Chris Leech <christopher.leech@intel.com> wrote:

>The checks in the hardware accelerated receive path are not up to date
>with what's in netif_receive_skb, which will get called anyway if the
>frame is not dropped in the vlan code.
>
>Signed-off-by: Chris Leech <christopher.leech@intel.com>

NAK.

	As I explained in a reply to Chris's separate message detailing
the problem he sees, the skb_bond_should_drop logic as implemented is
dependent upon knowing the original skb->dev the packet arrived on,
prior to VLAN reassigning it.

	That's not to say there's nothing wrong here, but removing the
calls with break other things.

	-J

>---
>
> net/8021q/vlan_core.c |    6 ------
> 1 files changed, 0 insertions(+), 6 deletions(-)
>
>diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
>index c584a0a..7576f9c 100644
>--- a/net/8021q/vlan_core.c
>+++ b/net/8021q/vlan_core.c
>@@ -11,9 +11,6 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
> 	if (netpoll_rx(skb))
> 		return NET_RX_DROP;
>
>-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>-		goto drop;
>-
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
> 	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>@@ -83,9 +80,6 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
> {
> 	struct sk_buff *p;
>
>-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>-		goto drop;
>-
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
> 	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: Strange packet drops with heavy firewalling
From: Changli Gao @ 2010-04-12 23:18 UTC (permalink / raw)
  To: Benny Amorsen; +Cc: zhigang gong, netdev
In-Reply-To: <1271091990.2858.409.camel@ursa.amorsen.dk>

On Tue, Apr 13, 2010 at 1:06 AM, Benny Amorsen <benny+usenet@amorsen.dk> wrote:
>
>  99:         24    1306226          3          2   PCI-MSI-edge      eth1-tx-0
>  100:      15735    1648774          3          7   PCI-MSI-edge      eth1-tx-1
>  101:          8         11          9    1083022   PCI-MSI-edge      eth1-tx-2
>  102:          0          0          0          0   PCI-MSI-edge      eth1-tx-3
>  103:         18         15       6131    1095383   PCI-MSI-edge      eth1-rx-0
>  104:        217         32      46544    1335325   PCI-MSI-edge      eth1-rx-1
>  105:        154    1305595        218         16   PCI-MSI-edge      eth1-rx-2
>  106:         17         16       8229    1467509   PCI-MSI-edge      eth1-rx-3
>  107:          0          0          1          0   PCI-MSI-edge      eth1
>  108:          2         14         15    1003053   PCI-MSI-edge      eth0-tx-0
>  109:       8226    1668924        478        487   PCI-MSI-edge      eth0-tx-1
>  110:          3    1188874         17         12   PCI-MSI-edge      eth0-tx-2
>  111:          0          0          0          0   PCI-MSI-edge      eth0-tx-3
>  112:        203        185       5324    1015263   PCI-MSI-edge      eth0-rx-0
>  113:       4141    1600793        153        159   PCI-MSI-edge      eth0-rx-1
>  114:      16242    1210108        436       3124   PCI-MSI-edge      eth0-rx-2
>  115:        267       4173      19471    1321252   PCI-MSI-edge      eth0-rx-3
>  116:          0          1          0          0   PCI-MSI-edge      eth0
>
>
> irqbalanced seems to have picked CPU1 and CPU3 for all the interrupts,
> which to my mind should cause the same problem as before (where CPU1 and
> CPU3 was handling all packets). Yet the box clearly works much better
> than before.

irqbalanced? I don't think it can work properly. Try RPS in netdev and
linux-next tree, and if cpu load isn't even, try this patch:
http://patchwork.ozlabs.org/patch/49915/ .


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: Receive issues with bonding and vlans
From: Jay Vosburgh @ 2010-04-12 23:10 UTC (permalink / raw)
  To: Chris Leech; +Cc: netdev, Andy Gospodarek, Patrick McHardy, bonding-devel
In-Reply-To: <20100412221645.8068.71073.stgit@localhost6.localdomain6>

Chris Leech <christopher.leech@intel.com> wrote:

>Quick summary: VLANs and bonding are interacting in strange ways in the
>receive path, VLAN devices do not act the same as real Ethernet devices,
>hardware accelerated VLANs do not act the same as software tagged VLANs,
>and I think frames are incorrectly being passed up to protocols from
>inactive bonding links.
>
>I've been looking at high availability configurations for converged LAN
>+ SAN networking, trying to see what running FCoE and IP traffic looked
>like with bonding and dm_multipath.  The goal is to allow sysadmins to
>use the tools they are already using with separate LAN and SAN adapters,
>now on a single converged adapter.
>
>The setup I'm trying to use looks like this; with IP traffic running on
>bond0, storage VLANs created on eth0 and eth1, and FCoE running on the
>VLANs.  Both switches provide Fiber Channel Forwarder (FCF) services,
>and connect to the same LAN and SAN.
>
>	 .-----------------------------------------.
>	 |             .--------------.            |
>	 |             | dm_multipath |            |
>	 |             '--------------'            |
>	 |                     ^                   |
>	 | .----------.        |      .----------. |
>	 | | fc_host0 |--------'------| fc_host1 | |
>	 | '----------'               '----------' |
>	 |       ^                          ^      |
>	 |       |                          |      |
>	 | .----------.   .-------.   .----------. |
>	 | | eth0.101 |   | bond0 |   | eth1.101 | |
>	 | '----------'   '-------'   '----------' |
>	 |       ^            ^             ^      |
>	 |       | .------.   |    .------. |      |
>	 |       '-| eth0 |---'----| eth1 |-'      |
>	 |         '------'        '------'        |
>	 '-------------|---------------|-----------'
>	               |               |
>	               v               v
>	         .----------.    .----------.
>	         | switch A |----| switch B |
>	         '----------'    '----------'
>	             |  |            |  |
>	          .--'--'------------'--'-.
>	          |                       |
>	          v                       v
>	     .-,(  ),-.               .-,(  ),-.    
>	  .-(          )-.         .-(          )-. 
>	 (     FC SAN     )       (     IP LAN     )
>	  '-(          ).-'        '-(          ).-'
>	      '-.( ).-'                '-.( ).-'    
>
>bond0 is in active-backup mode, but FCoE is actively running on both
>links providing two different paths into the SAN.  This configuration
>matches a typical HA setup with separate Ethernet + FC adapters.  In
>this case I'm interested in software convergence where all traffic
>passes through the standard network transmit and receive paths.
>
>The VLANs aren't strictly required by FCoE, but it is the recommended
>best practice by switch vendors.  The FCF switches map FC VSANs to
>VLANs.
>
>Ever since this series of changes to net/core/dev.c
>
>  Author: Joe Eykholt <jre@nuovasystems.com>
>  Date:   Wed Jul 2 18:22:02 2008 -0700
>  net/core: Uninline skb_bond().
>  net/core: Allow certain receives on inactive slave.
>  net/core: Allow receive on active slaves.
>
>it has been possible to receive directly on both active and inactive
>slave links if the packet_type specifies the slave device.  This
>combined with the PACKET_ORIGDEV socket option allowed for FCoE to run
>on the slave devices (DCB link configuration uses a userspace LLDP
>agent, and FCoE includes a VLAN discovery protocol that is implemented
>in userspace as well).

	Is the FCoE supposed to run over the inactive bonding slave?  Or
am I misunderstanding what you're saying?  I had thought the LLDP, et
al, special case in bonding was to permit, essentially, path discovery,
not necessarily active use of the inactive slave.

	Not that this is necessarily bad; the "drop stuff on inactive
slaves" is really there for duplicate suppression, but it also can
uncover network topology issues, e.g., network layouts that won't work
if the devices fail, but appear to work during testing because the
"inactive" slave still receives traffic (it hasn't really failed).

>The problem is that it doesn't work for hardware accelerated VLAN
>devices, because the VLAN receive paths have their own
>skb_bond_should_drop calls that were not updated.
>
>From what I can tell, VLAN receives always end up going through
>netif_receive_skb anyway, so skb_bond_should_drop gets called twice if
>the frame isn't dropped the first time.  I think the bonding checks in
>__vlan_hwaccel_rx and vlan_gro_common should just be removed.

	I'm not so sure.  The checks in __vlan_hwaccel_rx are done with
the original receiving device in skb->dev; by the time the packet gets
to netif_receive_skb, the original slave the packet was received on has
been lost (and replaced with the VLAN device).  Various things are
interested in that, in particular the "arp_validate" and the "inactive
slave drop" logic for bonding depend on knowing the real device the
packet arrived on.

	I note that the vlan accel logic doesn't change skb_iif to the
VLAN device; it remains as the original device.  I suppose one
alternative would be to convert the bonding drop, et al, logic to use
skb_iif instead of skb->dev; if that works, then I think the VLAN core
would not need to call skb_bond_should_drop, which in turn would be a
bit more complicated as it would have to look up the dev from the
skb_iif.  There's already some code in bonding that takes advantage of
this property of the VLANs, so maybe this is the way to go.

>@@ -11,9 +11,6 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
> 	if (netpoll_rx(skb))
> 		return NET_RX_DROP;
>
>-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>-		goto drop;
>-
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
> 	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>@@ -83,9 +80,6 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
> {
> 	struct sk_buff *p;
>
>-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
>-		goto drop;
>-
> 	skb->skb_iif = skb->dev->ifindex;
> 	__vlan_hwaccel_put_tag(skb, vlan_tci);
> 	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
>
>That fixes my setup ... but thinking about it raised some more
>questions.  The VLAN discovery tool I wrote shouldn't have worked, I
>didn't bother to bind a packet socket to each interface I wanted to use.
>So a single unbound packet socket is successfully passing traffic on
>both active and inactive slave interfaces, which from my understanding
>shouldn't work.  It's easier for me this way, but it still seems wrong.

	There is no logic to block transmission on bonding slave
devices; anything is free to bind to the slave and send whatever it
wants.

	The reception logic (to drop most traffic on the inactive slave
in an active-backup bond) was originally put in place to prevent
duplicate packets from being received for broadcasts and for unicasts
when the switch floods to all ports (which can happen during the
interval that a switch is still learning the MAC address).

>I think the problem was introduced with these changes.
>
>  Author: Andy Gospodarek <andy@greyhouse.net>
>  Date:   Wed Jan 6 12:56:37 2010 +0000
>  fix bonding: allow arp_ip_targets on separate vlans to use arp validation
>  Date:   Mon Dec 14 10:48:58 2009 +0000
>  bonding: allow arp_ip_targets on separate vlans to use arp validation
>
>The use of null_or_bond in netif_receive_skb looks suspicious to me.  In
>the presence of both bonding and VLANs it probably does what was
>intended.  Without VLANs however, it is always set to NULL which matches
>unbound packet_types.  So unbound packet_types will process all frames
>received on an inactive slave link, ignoring the result of
>skb_bond_should_drop.
>
>I haven't quite figured out what I think the correct change for
>null_or_bond is.  I suspect it involves not using NULL at all.  I can
>see how it addresses the arp_ip_target on a VLAN issue, but this is also
>changing the receive matching rules for other traffic in unexpected
>ways.

	I'll hazard a guess that something like this might do it:

diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..cc665bb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2735,7 +2735,7 @@ ncls:
 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
 		if (ptype->type == type && (ptype->dev == null_or_orig ||
 		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
-		     ptype->dev == null_or_bond)) {
+		     (null_or_bond && (ptype->dev == null_or_bond))) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;


	I haven't tested this, but the theory is to only test against
null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN
traffic over bonding.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply related

* [PATCH] vlan: remove receive checks for bonding
From: Chris Leech @ 2010-04-12 22:17 UTC (permalink / raw)
  To: netdev, Andy Gospodarek, Patrick McHardy; +Cc: bonding-devel
In-Reply-To: <20100412221645.8068.71073.stgit@localhost6.localdomain6>

The checks in the hardware accelerated receive path are not up to date
with what's in netif_receive_skb, which will get called anyway if the
frame is not dropped in the vlan code.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
---

 net/8021q/vlan_core.c |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index c584a0a..7576f9c 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -11,9 +11,6 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
 	if (netpoll_rx(skb))
 		return NET_RX_DROP;
 
-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
-		goto drop;
-
 	skb->skb_iif = skb->dev->ifindex;
 	__vlan_hwaccel_put_tag(skb, vlan_tci);
 	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
@@ -83,9 +80,6 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
 {
 	struct sk_buff *p;
 
-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
-		goto drop;
-
 	skb->skb_iif = skb->dev->ifindex;
 	__vlan_hwaccel_put_tag(skb, vlan_tci);
 	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox