Netdev List

Netdev List
 help / color / mirror / Atom feed

* [Patch] atl1c: Add missing PCI device ID
From: Chuck Ebbert @ 2011-02-02 15:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jie Yang


Commit 8f574b35f22fbb9b5e5f1d11ad6b55b6f35f4533 added support for a new
adapter but failed to add it to the PCI device table.

Signed-Off-By: Chuck Ebbert <cebbert@redhat.com>

--- vanilla-2.6.38-rc3.orig/drivers/net/atl1c/atl1c_main.c
+++ vanilla-2.6.38-rc3/drivers/net/atl1c/atl1c_main.c
@@ -48,6 +48,7 @@ static DEFINE_PCI_DEVICE_TABLE(atl1c_pci
 	{PCI_DEVICE(PCI_VENDOR_ID_ATTANSIC, PCI_DEVICE_ID_ATHEROS_L2C_B)},
 	{PCI_DEVICE(PCI_VENDOR_ID_ATTANSIC, PCI_DEVICE_ID_ATHEROS_L2C_B2)},
 	{PCI_DEVICE(PCI_VENDOR_ID_ATTANSIC, PCI_DEVICE_ID_ATHEROS_L1D)},
+	{PCI_DEVICE(PCI_VENDOR_ID_ATTANSIC, PCI_DEVICE_ID_ATHEROS_L1D_2_0)},
 	/* required last entry */
 	{ 0 }
 };

^ permalink raw reply

* bonding + igb TX queue warnings
From: Phil Oester @ 2011-02-02 15:55 UTC (permalink / raw)
  To: netdev

Running 2.6.36.3 here, and brought up a bond on 2 igb interfaces.
Syslog now getting flooded with this:

kernel: bond0 selects TX queue 16, but real number of TX queues is 16

Any way to change which queue bond0 is selecting or otherwise fix
this?

Phil

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 15:48 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296661371.25430.13.camel@localhost.localdomain>

On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > > w/i guest change, I played around the parameters,for example: I
> > could
> > > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > > size,
> > > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> > CPU
> > > > usage. 
> > > 
> > > I meant w/o guest change, only vhost changes. Sorry about that.
> > > 
> > > Shirley
> > 
> > Ah, excellent. What were the parameters? 
> 
> I used half of the ring size 129 for packet counters, but the
> performance is still not as good as dropping packets on guest, 3.7 Gb/s
> vs. 6.2Gb/s.
> 
> Shirley

And this is with sndbuf=0 in host, yes?
And do you see a lot of tx interrupts?
How packets per interrupt?

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 15:47 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296661185.25430.10.camel@localhost.localdomain>

On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > Yes, I think doing this in the host is much simpler,
> > just send an interrupt after there's a decent amount
> > of space in the queue.
> > 
> > Having said that the simple heuristic that I coded
> > might be a bit too simple.
> 
> >From the debugging out what I have seen so far (a single small message
> TCP_STEAM test), I think the right approach is to patch both guest and
> vhost.

One problem is slowing down the guest helps here.
So there's a chance that just by adding complexity
in guest driver we get a small improvement :(

We can't rely on a patched guest anyway, so
I think it is best to test guest and host changes separately.

And I do agree something needs to be done in guest too,
for example when vqs share an interrupt, we
might invoke a callback when we see vq is not empty
even though it's not requested. Probably should
check interrupts enabled here?

> The problem I have found is a regression for single  small
> message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
> kernel has problem.

Likely new kernel is faster :)

> For Steven's problem, it's multiple stream TCP_RR issues, the old guest
> doesn't perform well, so does new guest kernel. We tested reducing vhost
> signaling patch before, it didn't help the performance at all.
> 
> Thanks
> Shirley

Yes, it seems unrelated to tx interrupts.

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 15:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <20110202104946.GC8505@redhat.com>

On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > w/i guest change, I played around the parameters,for example: I
> could
> > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > size,
> > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> CPU
> > > usage. 
> > 
> > I meant w/o guest change, only vhost changes. Sorry about that.
> > 
> > Shirley
> 
> Ah, excellent. What were the parameters? 

I used half of the ring size 129 for packet counters, but the
performance is still not as good as dropping packets on guest, 3.7 Gb/s
vs. 6.2Gb/s.

Shirley


^ permalink raw reply

* kernel panic when exiting a network namespace
From: Daniel Lezcano @ 2011-02-02 15:40 UTC (permalink / raw)
  To: Linux Netdev List

Hi All,

if we create a network namespace and a pair network device veth with one 
side within the new netns and the other side in the old netns, then when 
the new netns exits that leads to a beautiful kernel panic.

That does not appear when both side are in the same network namespace.

linux-swk0 login: BUG: unable to handle kernel paging request at 
ffff88003aacfce0
IP: [<ffffffff812d2ebc>] unregister_netdevice_queue+0x4d/0x85
PGD 160b063 PUD 160f063 PMD 1ffd3067 PTE 3aacf160
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/devices/virtual/block/ram9/uevent
CPU 0
Modules linked in:

Pid: 5, comm: kworker/u:0 Not tainted 2.6.38-rc2+ #4 /Bochs
RIP: 0010:[<ffffffff812d2ebc>]  [<ffffffff812d2ebc>] 
unregister_netdevice_queue+0x4d/0x85
RSP: 0018:ffff88003ecf1ca0  EFLAGS: 00010282
RAX: ffff88003aacfcd8 RBX: ffff88003908b800 RCX: 0000000000000000
RDX: ffff88003aacfcd8 RSI: ffff88003ecf1ce0 RDI: ffff88003908b8b0
RBP: ffff88003ecf1cb0 R08: ffffffff816655e0 R09: 00014b8a6008606d
R10: 0000000000000002 R11: 0000000000000004 R12: ffff88003ecf1ce0
R13: ffff88003ecf1ce0 R14: ffff88003ecf1d60 R15: ffff8800394a17f0
FS:  0000000000000000(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88003aacfce0 CR3: 0000000039587000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:0 (pid: 5, threadinfo ffff88003ecf0000, task 
ffff88003ece1100)
Stack:
  ffff88003ecf1ce0 ffff8800397fd800 ffff88003ecf1cd0 ffffffff8129c570
  ffff88003908b800 ffff8800394a1740 ffff88003ecf1d20 ffffffff812d2f4c
  ffff88003ecf1ce0 ffff88003ecf1ce0 2222222222222222 ffffffff81665220
Call Trace:
  [<ffffffff8129c570>] veth_dellink+0x16/0x26
  [<ffffffff812d2f4c>] default_device_exit_batch+0x58/0xc1
  [<ffffffff812cd73d>] ? cleanup_net+0x0/0x192
  [<ffffffff812cd116>] ops_exit_list+0x4e/0x56
  [<ffffffff812cd82d>] cleanup_net+0xf0/0x192
  [<ffffffff81042a11>] process_one_work+0x24d/0x407
  [<ffffffff81042984>] ? process_one_work+0x1c0/0x407
  [<ffffffff81042ede>] worker_thread+0x1b8/0x30a
  [<ffffffff81042d26>] ? worker_thread+0x0/0x30a
  [<ffffffff81046072>] kthread+0x7c/0x84
  [<ffffffff810034b4>] kernel_thread_helper+0x4/0x10
  [<ffffffff8139bebe>] ? restore_args+0x0/0x30
  [<ffffffff81045ff6>] ? kthread+0x0/0x84
  [<ffffffff810034b0>] ? kernel_thread_helper+0x0/0x10
Code: 48 c7 c7 3d a2 53 81 e8 ca 60 0c 00 e8 ce 5e 0c 00 4d 85 e4 74 26 
48 8b 93 b0 00 00 00 48 8b 83 b8 00 00 00 48 8d bb b0 00 00 00 <48> 89 
42 08 48 89 10 4c
  89 e2 49 8b 74 24 08 eb 1d 48 89 df e8
RIP  [<ffffffff812d2ebc>] unregister_netdevice_queue+0x4d/0x85
  RSP <ffff88003ecf1ca0>
CR2: ffff88003aacfce0
---[ end trace 66938c79ba1c0677 ]---



addr2line -e ./vmlinux ffffffff8129c570

==> net-2.6/drivers/net/veth.c:455

static void veth_dellink(struct net_device *dev, struct list_head *head)
{
     struct veth_priv *priv;
     struct net_device *peer;

     priv = netdev_priv(dev);
     peer = priv->peer;

     unregister_netdevice_queue(dev, head);
==>    unregister_netdevice_queue(peer, head); <==
}



^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 15:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <20110202104832.GA8505@redhat.com>

On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> Yes, I think doing this in the host is much simpler,
> just send an interrupt after there's a decent amount
> of space in the queue.
> 
> Having said that the simple heuristic that I coded
> might be a bit too simple.

>From the debugging out what I have seen so far (a single small message
TCP_STEAM test), I think the right approach is to patch both guest and
vhost. The problem I have found is a regression for single  small
message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
kernel has problem.

For Steven's problem, it's multiple stream TCP_RR issues, the old guest
doesn't perform well, so does new guest kernel. We tested reducing vhost
signaling patch before, it didn't help the performance at all.

Thanks
Shirley


^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Eric Dumazet @ 2011-02-02 15:08 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev
In-Reply-To: <4D49726C.6020103@univ-nantes.fr>

Le mercredi 02 février 2011 à 16:04 +0100, Yann Dupont a écrit :
> >
> Ok, will do it at 18:30 CET (to minimize impact)
> It the suspected bug SLUB related ?
> 

no : It can be a corruption from another part of kernel.

> The 2.6.34.2 kernel previously used on that server used SLAB.
> 
> 
> 2 questions :
> -How can I be sure slub_nomerge is active ? Boot message ?


# ls -l /sys/kernel/slab/

If you have symlinks : merge is on (default)

If you dont have symlinks : nomerge is in action

> -Is there a very severe impact on performance ?
> 

not at all

> Regards,
> 

^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Yann Dupont @ 2011-02-02 15:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev
In-Reply-To: <1296658407.20445.19.camel@edumazet-laptop>

Le 02/02/2011 15:53, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 14:08 +0100, Yann Dupont a écrit :
>> Le 02/02/2011 12:24, Eric Dumazet a écrit :
>>> Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
>>>> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
>>>>> Hello.
>>>>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
>>>>> kernel oops since. Each oops is after ~1 week of uptime.
>>>>> The last oops was last night but we didn't had any trace.
>>> oops, 2.6.37 "only"
>>>
>>>> Yes this is a known problem.
>>>>
>>>> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>>> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
>>>>
>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>>>
>>>> I believe David will send it to stable team shortly, if not already
>>>> done :)
>>> Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
>>> affected by the problem.
>>>
>>> So its another problem... Is there anything particular you do on this
>>> machine ?
>>>
>>>
>>>
>>>
>> Nothing really special there, we run a lot (20) of KVM guest (mainly
>> linux firewalls for lots of differents vlan), so we have a lot of
>> bridges vlan&  tun/tap.
>> Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of  the other
>> bug already sent to netdev - more to come on next mail)
>>
>> Hard to say if this BUG is new in 2.6.37. This host was running fine
>> with 2.6.34.2 since August 2010.
>> Bisecting will be hard due to the time to trigger the bug (and the fact
>> that this machine is a production machine)
>>
>> Anyway, I can test with a specific kernel version if you suspect something.
>>
> I suspect a mem corruption from another layer (not inetpeer)
>
> Unfortunately many kmem caches share the "64 bytes" cache.
>
> Could you please add "slub_nomerge" on your boot command ?
>
Ok, will do it at 18:30 CET (to minimize impact)
It the suspected bug SLUB related ?

The 2.6.34.2 kernel previously used on that server used SLAB.


2 questions :
-How can I be sure slub_nomerge is active ? Boot message ?
-Is there a very severe impact on performance ?

Regards,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Eric Dumazet @ 2011-02-02 14:53 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev
In-Reply-To: <4D495765.4090806@univ-nantes.fr>

Le mercredi 02 février 2011 à 14:08 +0100, Yann Dupont a écrit :
> Le 02/02/2011 12:24, Eric Dumazet a écrit :
> > Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
> >> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> >>> Hello.
> >>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
> >>> kernel oops since. Each oops is after ~1 week of uptime.
> >>> The last oops was last night but we didn't had any trace.
> > oops, 2.6.37 "only"
> >
> >> Yes this is a known problem.
> >>
> >> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> >> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
> >>
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> >>
> >> I believe David will send it to stable team shortly, if not already
> >> done :)
> > Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
> > affected by the problem.
> >
> > So its another problem... Is there anything particular you do on this
> > machine ?
> >
> >
> >
> >
> Nothing really special there, we run a lot (20) of KVM guest (mainly 
> linux firewalls for lots of differents vlan), so we have a lot of 
> bridges vlan & tun/tap.
> Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of  the other 
> bug already sent to netdev - more to come on next mail)
> 
> Hard to say if this BUG is new in 2.6.37. This host was running fine 
> with 2.6.34.2 since August 2010.
> Bisecting will be hard due to the time to trigger the bug (and the fact 
> that this machine is a production machine)
> 
> Anyway, I can test with a specific kernel version if you suspect something.
> 

I suspect a mem corruption from another layer (not inetpeer)

Unfortunately many kmem caches share the "64 bytes" cache.

Could you please add "slub_nomerge" on your boot command ?


This way, we can separate corruptions on each cache.


On your crash, one inetpeer contain garbage on unused_lists next/prev
pointers :

RCX: 0000000000000005
RDX: 0b000209f1beadde

Definitly something overwrote these values with non pointers values.

^ permalink raw reply

* [PATCH] bna: use device model DMA API
From: Ivan Vecera @ 2011-02-02 14:37 UTC (permalink / raw)
  To: netdev; +Cc: rmody, ddutt

Use DMA API as PCI equivalents will be deprecated.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
---
 drivers/net/bna/bnad.c |  108 +++++++++++++++++++++++++-----------------------
 drivers/net/bna/bnad.h |    2 +-
 2 files changed, 57 insertions(+), 53 deletions(-)

diff --git a/drivers/net/bna/bnad.c b/drivers/net/bna/bnad.c
index fad9126..9f356d5 100644
--- a/drivers/net/bna/bnad.c
+++ b/drivers/net/bna/bnad.c
@@ -126,22 +126,22 @@ bnad_free_all_txbufs(struct bnad *bnad,
 		}
 		unmap_array[unmap_cons].skb = NULL;
 
-		pci_unmap_single(bnad->pcidev,
-				 pci_unmap_addr(&unmap_array[unmap_cons],
+		dma_unmap_single(&bnad->pcidev->dev,
+				 dma_unmap_addr(&unmap_array[unmap_cons],
 						dma_addr), skb_headlen(skb),
-						PCI_DMA_TODEVICE);
+						DMA_TO_DEVICE);
 
-		pci_unmap_addr_set(&unmap_array[unmap_cons], dma_addr, 0);
+		dma_unmap_addr_set(&unmap_array[unmap_cons], dma_addr, 0);
 		if (++unmap_cons >= unmap_q->q_depth)
 			break;
 
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			pci_unmap_page(bnad->pcidev,
-				       pci_unmap_addr(&unmap_array[unmap_cons],
+			dma_unmap_page(&bnad->pcidev->dev,
+				       dma_unmap_addr(&unmap_array[unmap_cons],
 						      dma_addr),
 				       skb_shinfo(skb)->frags[i].size,
-				       PCI_DMA_TODEVICE);
-			pci_unmap_addr_set(&unmap_array[unmap_cons], dma_addr,
+				       DMA_TO_DEVICE);
+			dma_unmap_addr_set(&unmap_array[unmap_cons], dma_addr,
 					   0);
 			if (++unmap_cons >= unmap_q->q_depth)
 				break;
@@ -199,23 +199,23 @@ bnad_free_txbufs(struct bnad *bnad,
 		sent_bytes += skb->len;
 		wis -= BNA_TXQ_WI_NEEDED(1 + skb_shinfo(skb)->nr_frags);
 
-		pci_unmap_single(bnad->pcidev,
-				 pci_unmap_addr(&unmap_array[unmap_cons],
+		dma_unmap_single(&bnad->pcidev->dev,
+				 dma_unmap_addr(&unmap_array[unmap_cons],
 						dma_addr), skb_headlen(skb),
-				 PCI_DMA_TODEVICE);
-		pci_unmap_addr_set(&unmap_array[unmap_cons], dma_addr, 0);
+				 DMA_TO_DEVICE);
+		dma_unmap_addr_set(&unmap_array[unmap_cons], dma_addr, 0);
 		BNA_QE_INDX_ADD(unmap_cons, 1, unmap_q->q_depth);
 
 		prefetch(&unmap_array[unmap_cons + 1]);
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			prefetch(&unmap_array[unmap_cons + 1]);
 
-			pci_unmap_page(bnad->pcidev,
-				       pci_unmap_addr(&unmap_array[unmap_cons],
+			dma_unmap_page(&bnad->pcidev->dev,
+				       dma_unmap_addr(&unmap_array[unmap_cons],
 						      dma_addr),
 				       skb_shinfo(skb)->frags[i].size,
-				       PCI_DMA_TODEVICE);
-			pci_unmap_addr_set(&unmap_array[unmap_cons], dma_addr,
+				       DMA_TO_DEVICE);
+			dma_unmap_addr_set(&unmap_array[unmap_cons], dma_addr,
 					   0);
 			BNA_QE_INDX_ADD(unmap_cons, 1, unmap_q->q_depth);
 		}
@@ -340,19 +340,22 @@ static void
 bnad_free_all_rxbufs(struct bnad *bnad, struct bna_rcb *rcb)
 {
 	struct bnad_unmap_q *unmap_q;
+	struct bnad_skb_unmap *unmap_array;
 	struct sk_buff *skb;
 	int unmap_cons;
 
 	unmap_q = rcb->unmap_q;
+	unmap_array = unmap_q->unmap_array;
 	for (unmap_cons = 0; unmap_cons < unmap_q->q_depth; unmap_cons++) {
-		skb = unmap_q->unmap_array[unmap_cons].skb;
+		skb = unmap_array[unmap_cons].skb;
 		if (!skb)
 			continue;
-		unmap_q->unmap_array[unmap_cons].skb = NULL;
-		pci_unmap_single(bnad->pcidev, pci_unmap_addr(&unmap_q->
-					unmap_array[unmap_cons],
-					dma_addr), rcb->rxq->buffer_size,
-					PCI_DMA_FROMDEVICE);
+		unmap_array[unmap_cons].skb = NULL;
+		dma_unmap_single(&bnad->pcidev->dev,
+				 dma_unmap_addr(&unmap_array[unmap_cons],
+						dma_addr),
+				 rcb->rxq->buffer_size,
+				 DMA_FROM_DEVICE);
 		dev_kfree_skb(skb);
 	}
 	bnad_reset_rcb(bnad, rcb);
@@ -391,9 +394,10 @@ bnad_alloc_n_post_rxbufs(struct bnad *bnad, struct bna_rcb *rcb)
 		skb->dev = bnad->netdev;
 		skb_reserve(skb, NET_IP_ALIGN);
 		unmap_array[unmap_prod].skb = skb;
-		dma_addr = pci_map_single(bnad->pcidev, skb->data,
-			rcb->rxq->buffer_size, PCI_DMA_FROMDEVICE);
-		pci_unmap_addr_set(&unmap_array[unmap_prod], dma_addr,
+		dma_addr = dma_map_single(&bnad->pcidev->dev, skb->data,
+					  rcb->rxq->buffer_size,
+					  DMA_FROM_DEVICE);
+		dma_unmap_addr_set(&unmap_array[unmap_prod], dma_addr,
 				   dma_addr);
 		BNA_SET_DMA_ADDR(dma_addr, &rxent->host_addr);
 		BNA_QE_INDX_ADD(unmap_prod, 1, unmap_q->q_depth);
@@ -434,8 +438,9 @@ bnad_poll_cq(struct bnad *bnad, struct bna_ccb *ccb, int budget)
 	struct bna_rcb *rcb = NULL;
 	unsigned int wi_range, packets = 0, wis = 0;
 	struct bnad_unmap_q *unmap_q;
+	struct bnad_skb_unmap *unmap_array;
 	struct sk_buff *skb;
-	u32 flags;
+	u32 flags, unmap_cons;
 	u32 qid0 = ccb->rcb[0]->rxq->rxq_id;
 	struct bna_pkt_rate *pkt_rt = &ccb->pkt_rate;
 
@@ -456,17 +461,17 @@ bnad_poll_cq(struct bnad *bnad, struct bna_ccb *ccb, int budget)
 			rcb = ccb->rcb[1];
 
 		unmap_q = rcb->unmap_q;
+		unmap_array = unmap_q->unmap_array;
+		unmap_cons = unmap_q->consumer_index;
 
-		skb = unmap_q->unmap_array[unmap_q->consumer_index].skb;
+		skb = unmap_array[unmap_cons].skb;
 		BUG_ON(!(skb));
-		unmap_q->unmap_array[unmap_q->consumer_index].skb = NULL;
-		pci_unmap_single(bnad->pcidev,
-				 pci_unmap_addr(&unmap_q->
-						unmap_array[unmap_q->
-							    consumer_index],
+		unmap_array[unmap_cons].skb = NULL;
+		dma_unmap_single(&bnad->pcidev->dev,
+				 dma_unmap_addr(&unmap_array[unmap_cons],
 						dma_addr),
-						rcb->rxq->buffer_size,
-						PCI_DMA_FROMDEVICE);
+				 rcb->rxq->buffer_size,
+				 DMA_FROM_DEVICE);
 		BNA_QE_INDX_ADD(unmap_q->consumer_index, 1, unmap_q->q_depth);
 
 		/* Should be more efficient ? Performance ? */
@@ -1015,9 +1020,9 @@ bnad_mem_free(struct bnad *bnad,
 			if (mem_info->mem_type == BNA_MEM_T_DMA) {
 				BNA_GET_DMA_ADDR(&(mem_info->mdl[i].dma),
 						dma_pa);
-				pci_free_consistent(bnad->pcidev,
-						mem_info->mdl[i].len,
-						mem_info->mdl[i].kva, dma_pa);
+				dma_free_coherent(&bnad->pcidev->dev,
+						  mem_info->mdl[i].len,
+						  mem_info->mdl[i].kva, dma_pa);
 			} else
 				kfree(mem_info->mdl[i].kva);
 		}
@@ -1047,8 +1052,9 @@ bnad_mem_alloc(struct bnad *bnad,
 		for (i = 0; i < mem_info->num; i++) {
 			mem_info->mdl[i].len = mem_info->len;
 			mem_info->mdl[i].kva =
-				pci_alloc_consistent(bnad->pcidev,
-						mem_info->len, &dma_pa);
+				dma_alloc_coherent(&bnad->pcidev->dev,
+						mem_info->len, &dma_pa,
+						GFP_KERNEL);
 
 			if (mem_info->mdl[i].kva == NULL)
 				goto err_return;
@@ -2600,9 +2606,9 @@ bnad_start_xmit(struct sk_buff *skb, struct net_device *netdev)
 	unmap_q->unmap_array[unmap_prod].skb = skb;
 	BUG_ON(!(skb_headlen(skb) <= BFI_TX_MAX_DATA_PER_VECTOR));
 	txqent->vector[vect_id].length = htons(skb_headlen(skb));
-	dma_addr = pci_map_single(bnad->pcidev, skb->data, skb_headlen(skb),
-		PCI_DMA_TODEVICE);
-	pci_unmap_addr_set(&unmap_q->unmap_array[unmap_prod], dma_addr,
+	dma_addr = dma_map_single(&bnad->pcidev->dev, skb->data,
+				  skb_headlen(skb), DMA_TO_DEVICE);
+	dma_unmap_addr_set(&unmap_q->unmap_array[unmap_prod], dma_addr,
 			   dma_addr);
 
 	BNA_SET_DMA_ADDR(dma_addr, &txqent->vector[vect_id].host_addr);
@@ -2630,11 +2636,9 @@ bnad_start_xmit(struct sk_buff *skb, struct net_device *netdev)
 
 		BUG_ON(!(size <= BFI_TX_MAX_DATA_PER_VECTOR));
 		txqent->vector[vect_id].length = htons(size);
-		dma_addr =
-			pci_map_page(bnad->pcidev, frag->page,
-				     frag->page_offset, size,
-				     PCI_DMA_TODEVICE);
-		pci_unmap_addr_set(&unmap_q->unmap_array[unmap_prod], dma_addr,
+		dma_addr = dma_map_page(&bnad->pcidev->dev, frag->page,
+					frag->page_offset, size, DMA_TO_DEVICE);
+		dma_unmap_addr_set(&unmap_q->unmap_array[unmap_prod], dma_addr,
 				   dma_addr);
 		BNA_SET_DMA_ADDR(dma_addr, &txqent->vector[vect_id].host_addr);
 		BNA_QE_INDX_ADD(unmap_prod, 1, unmap_q->q_depth);
@@ -3022,14 +3026,14 @@ bnad_pci_init(struct bnad *bnad,
 	err = pci_request_regions(pdev, BNAD_NAME);
 	if (err)
 		goto disable_device;
-	if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) &&
-	    !pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64))) {
+	if (!dma_set_mask(&pdev->dev, DMA_BIT_MASK(64)) &&
+	    !dma_set_coherent_mask(&pdev->dev, DMA_BIT_MASK(64))) {
 		*using_dac = 1;
 	} else {
-		err = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+		err = dma_set_mask(&pdev->dev, DMA_BIT_MASK(32));
 		if (err) {
-			err = pci_set_consistent_dma_mask(pdev,
-						DMA_BIT_MASK(32));
+			err = dma_set_coherent_mask(&pdev->dev,
+						    DMA_BIT_MASK(32));
 			if (err)
 				goto release_regions;
 		}
diff --git a/drivers/net/bna/bnad.h b/drivers/net/bna/bnad.h
index 8b1d515..a89117f 100644
--- a/drivers/net/bna/bnad.h
+++ b/drivers/net/bna/bnad.h
@@ -181,7 +181,7 @@ struct bnad_rx_info {
 /* Unmap queues for Tx / Rx cleanup */
 struct bnad_skb_unmap {
 	struct sk_buff		*skb;
-	DECLARE_PCI_UNMAP_ADDR(dma_addr)
+	DEFINE_DMA_UNMAP_ADDR(dma_addr);
 };
 
 struct bnad_unmap_q {
-- 
1.7.3.4


^ permalink raw reply related

* Re: possible issue between bridge igmp/multicast  handling & bnx2x on kernel 2.6.34 and >
From: Yann Dupont @ 2011-02-02 13:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1294399705.3306.8.camel@edumazet-laptop>

> Le vendredi 07 janvier 2011 à 11:40 +0100, Yann Dupont a écrit :
>> Le 04/01/2011 14:40, Yann Dupont a écrit :
>> ...
>>> We just added BCM57711 10G cards (bnx2x driver) on our blade servers
>>> (connected to 10G Power Connect M8024).
>>> Since then, we are experiencing random lost of packets.
>>>
>>> Symptom : packets are lost on some vlans for a few seconds, then
>>> things go back to normal (and stops again a few minutes later)
>>>
>> As I didn't had answer so far , I digged a little more and captured more
>> packets.
>> I just noticed that an event trigger that problem : IPv6 neighbor
>> discovery packet .
>>
>> This is , of course, a multicast packet.
>>
>> Just saw that 2.6.36.3 should include this fix :
>>
Just a little update, the problem doesn't seem to be what we thought at 
first.

It may not be related to the bnx2x driver after all.
We noticed that we had the same symptoms on target machine using bnx2 
drivers  (we missed that at first since the outages are way briefer).

We're now rather suspecting our own firewall (also a linux in a kvm 
machine) since without it we don't get any more problem and the packet 
drops occurs on _THIS_ network, when packets are routed by _THIS_ firewall.

Anyway, all of that is very puzzling, we have made a lot of network 
dumps and we have really no clue of what's happening there.
We don't understand why, if the problem is really on our firewall 
machine, setting CONFIG_BRIDGE_IGMP_SNOOPING to 'n' on the target 
machine efficiently fix the problem, Especially since it doesn't seem 
related at all with our setup and we don't see anything in our network 
dumps that could explain this.

It's probably not a single problem, but a sum of different problems.
We continue to search.
Sorry for the noise.

Regards,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Yann Dupont @ 2011-02-02 13:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev
In-Reply-To: <1296645887.20445.11.camel@edumazet-laptop>

Le 02/02/2011 12:24, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
>> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
>>> Hello.
>>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
>>> kernel oops since. Each oops is after ~1 week of uptime.
>>> The last oops was last night but we didn't had any trace.
> oops, 2.6.37 "only"
>
>> Yes this is a known problem.
>>
>> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>
>> I believe David will send it to stable team shortly, if not already
>> done :)
> Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
> affected by the problem.
>
> So its another problem... Is there anything particular you do on this
> machine ?
>
>
>
>
Nothing really special there, we run a lot (20) of KVM guest (mainly 
linux firewalls for lots of differents vlan), so we have a lot of 
bridges vlan & tun/tap.
Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of  the other 
bug already sent to netdev - more to come on next mail)

Hard to say if this BUG is new in 2.6.37. This host was running fine 
with 2.6.34.2 since August 2010.
Bisecting will be hard due to the time to trigger the bug (and the fact 
that this machine is a production machine)

Anyway, I can test with a specific kernel version if you suspect something.

Regards,


-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Eric Dumazet @ 2011-02-02 11:24 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev
In-Reply-To: <1296643972.20445.9.camel@edumazet-laptop>

Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> > Hello.
> > We recently upgraded one machine with vanilla 2.6.37, and experienced 2 
> > kernel oops since. Each oops is after ~1 week of uptime.
> > The last oops was last night but we didn't had any trace.

oops, 2.6.37 "only"

> Yes this is a known problem.
> 
> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> 
> I believe David will send it to stable team shortly, if not already
> done :)

Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
affected by the problem.

So its another problem... Is there anything particular you do on this
machine ?

^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Eric Dumazet @ 2011-02-02 10:52 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev
In-Reply-To: <4D491B8D.1000107@univ-nantes.fr>

Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> Hello.
> We recently upgraded one machine with vanilla 2.6.37, and experienced 2 
> kernel oops since. Each oops is after ~1 week of uptime.
> The last oops was last night but we didn't had any trace.
> 
> Here is the previous oops :
> 
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316042] 
> BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316096] 
> IP: [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316135] PGD 0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316157] 
> Oops: 0002 [#1] SMP
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316188] 
> last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316234] CPU 1
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316240] 
> Modules linked in: xt_physdev ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6 
> ipt_LOG xt_multiport xt_limit nf_conntrack_tftp nf_conntrack_ftp tun 
> ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT 
> xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm ipv6 8021q 
> bridge stp ext2 mbcache fuse snd_pcm snd_timer snd soundcore 
> snd_page_alloc i5000_edac edac_core psmouse evdev i5k_amb tpm_tis tpm 
> joydev dcdbas tpm_bios pcspkr rng_core ghes shpchp serio_raw pci_hotplug 
> processor hed button thermal_sys xfs exportfs dm_mod sg sr_mod sd_mod 
> cdrom usbhid hid usb_storage qla2xxx scsi_transport_fc scsi_tgt uhci_hcd 
> mptsas mptscsih ehci_hcd mptbase bnx2 scsi_transport_sas scsi_mod [last 
> unloaded: scsi_wait_scan]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316694]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316715] 
> Pid: 0, comm: kworker/0:0 Not tainted 2.6.37-dsiun-110105 #17 
> 0MY736/PowerEdge M600
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316761] 
> RIP: 0010:[<ffffffff8130e6bf>]  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316808] 
> RSP: 0018:ffff8800cfc43e20  EFLAGS: 00010202
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316834] 
> RAX: ffff8803d3158018 RBX: ffff8803d3158000 RCX: 0000000000000005
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316878] 
> RDX: 0b000209f1beadde RSI: 00000000000000ac RDI: ffffffff8152a970
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318512] 
> RBP: 00000000000248f6 R08: 00000000003d0900 R09: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318560] 
> R10: dead000000200200 R11: 0000000000000000 R12: ffff8800cfc43ea0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318604] 
> R13: 0000000000000100 R14: ffff88040fc99fd8 R15: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318652] 
> FS:  0000000000000000(0000) GS:ffff8800cfc40000(0000) knlGS:0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318698] 
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318725] 
> CR2: 000000000000000d CR3: 00000000014f1000 CR4: 00000000000026e0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318768] 
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318812] 
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318855] 
> Process kworker/0:0 (pid: 0, threadinfo ffff88040fc98000, task 
> ffff88040fc6c2e0)
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318901] 
> Stack:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318921]  
> 0000000000000082 00000001029221c1 00000000000248f6 ffffffff8130e988
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318971]  
> ffff88040fc90000 ffff88040fc90000 ffffffff8152a9a0 ffffffff8105e95f
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319021]  
> ffff8800cfc43e58 ffff88040fc91020 ffffffff8130e950 ffff88040fc99fd8
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319072] 
> Call Trace:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319093] <IRQ>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319116]  
> [<ffffffff8130e988>] ? peer_check_expire+0x38/0x110
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319146]  
> [<ffffffff8105e95f>] ? run_timer_softirq+0x16f/0x350
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319175]  
> [<ffffffff8130e950>] ? peer_check_expire+0x0/0x110
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319204]  
> [<ffffffff81079c6b>] ? ktime_get+0x5b/0xe0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319232]  
> [<ffffffff8105685a>] ? __do_softirq+0xaa/0x1e0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319260]  
> [<ffffffff81003ddc>] ? call_softirq+0x1c/0x30
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319288]  
> [<ffffffff81005f75>] ? do_softirq+0x65/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319315]  
> [<ffffffff81056745>] ? irq_exit+0x85/0x90
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319343]  
> [<ffffffff8102137a>] ? smp_apic_timer_interrupt+0x6a/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319373]  
> [<ffffffff81003893>] ? apic_timer_interrupt+0x13/0x20
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319401] <EOI>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319427]  
> [<ffffffffa032218c>] ? acpi_idle_enter_bm+0x243/0x27b [processor]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319473]  
> [<ffffffffa0322185>] ? acpi_idle_enter_bm+0x23c/0x27b [processor]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319519]  
> [<ffffffff812c0deb>] ? cpuidle_idle_call+0x8b/0x140
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319547]  
> [<ffffffff8100208a>] ? cpu_idle+0x6a/0xf0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319573] 
> Code: 00 48 8b 05 c4 c2 21 00 48 3d 60 a9 52 81 74 5c 48 8d 58 e8 48 8b 
> 15 11 02 24 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89 
> 51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319768] 
> RIP  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319797]  
> RSP <ffff8800cfc43e20>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319820] 
> CR2: 000000000000000d
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320187] 
> ---[ end trace eaf3ed2d46c78768 ]---
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320257] 
> Kernel panic - not syncing: Fatal exception in interrupt
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320329] 
> Pid: 0, comm: kworker/0:0 Tainted: G      D     2.6.37-dsiun-110105 #17
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320418] 
> Call Trace:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320481] 
> <IRQ>  [<ffffffff8137c75e>] ? panic+0x92/0x1a2
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320601]  
> [<ffffffff81007357>] ? oops_end+0xe7/0xf0
> 
> 
> Any ideas ??


Hi Yann

Yes this is a known problem.

Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
(inetpeer: Use correct AVL tree base pointer in inet_getpeer())

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492

I believe David will send it to stable team shortly, if not already
done :)

Thanks



^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 10:49 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296632029.26937.871.camel@localhost.localdomain>

On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > w/i guest change, I played around the parameters,for example: I could
> > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > size,
> > w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> > usage. 
> 
> I meant w/o guest change, only vhost changes. Sorry about that.
> 
> Shirley

Ah, excellent. What were the parameters?

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 10:48 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296630891.26937.870.camel@localhost.localdomain>

On Tue, Feb 01, 2011 at 11:14:51PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > > > 
> > > > The way I am changing is only when netif queue has stopped, then
> > we
> > > > start to count num_free descriptors to send the signal to wake
> > netif
> > > > queue. 
> > > 
> > > I forgot to mention, the code change I am making is in guest kernel,
> > in
> > > xmit call back only wake up the queue when it's stopped && num_free
> > >=
> > > 1/2 *vq->num, I add a new API in virtio_ring.
> > 
> > Interesting. Yes, I agree an API extension would be helpful. However,
> > wouldn't just the signaling reduction be enough, without guest
> > changes?
> 
> w/i guest change, I played around the parameters,for example: I could
> get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
> w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> usage.

We need to consider them separately IMO.  What's the best we can get
without guest change?  And which parameters give it?
There will always be old guests, and as far as I can tell
it should work better from host.

> > > However vhost signaling reduction is needed as well. The patch I
> > > submitted a while ago showed both CPUs and BW improvement.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Which patch was that? 
> 
> The patch was called "vhost: TX used buffer guest signal accumulation".
Yes, a somewhat similar idea.

> You suggested to split add_used_bufs and signal.
Exactly. And this is basically what this patch does.

> I am still thinking
> what's the best approach to cooperate guest (virtio_kick) and
> vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
> the overheads, so I haven't submit the new patch yet.
> 
> Thanks
> Shirley


-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 10:48 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Shirley Ma, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <OFF5778D3C.84F46700-ON6525782B.00230A94-6525782B.00240890@in.ibm.com>

On Wed, Feb 02, 2011 at 12:04:37PM +0530, Krishna Kumar2 wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > >
> > > The way I am changing is only when netif queue has stopped, then we
> > > start to count num_free descriptors to send the signal to wake netif
> > > queue.
> >
> > I forgot to mention, the code change I am making is in guest kernel, in
> > xmit call back only wake up the queue when it's stopped && num_free >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> FYI :)
> 
> I have tried this before. There are a couple of issues:
> 
> 1. the free count will not reduce until you run free_old_xmit_skbs,
>    which will not run anymore since the tx queue is stopped.
> 2. You cannot call free_old_xmit_skbs directly as it races with a
>    queue that was just awakened (current cb was due to the delay
>    in disabling cb's).
> 
> You have to call free_old_xmit_skbs() under netif_queue_stopped()
> check to avoid the race.
> 
> I got a small improvement in my testing upto some number of threads
> (32 or 48?), but beyond that I was getting a regression.
> 
> Thanks,
> 
> - KK
> 
> > However vhost signaling reduction is needed as well. The patch I
> > submitted a while ago showed both CPUs and BW improvement.

Yes, I think doing this in the host is much simpler,
just send an interrupt after there's a decent amount
of space in the queue.

Having said that the simple heuristic that I coded
might be a bit too simple.

-- 
MST

^ permalink raw reply

* Re: Bonding on bond
From: Nicolas de Pesloüan @ 2011-02-02 10:19 UTC (permalink / raw)
  To: Jay Vosburgh, Jiri Bohac
  Cc: bonding-devel@lists.sourceforge.net, netdev@vger.kernel.org
In-Reply-To: <15526.1296261528@death>

Le 29/01/2011 01:38, Jay Vosburgh a écrit :
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:

[snip]

>> However, the ingress path doesn't work at all. bond0 is unable to receive any packets (ARP or IP).
>
> 	In light of this, I don't see a problem with disallowing nesting
> of bonds.  It should be documented in bonding.txt.

Ok, I will do that.

Jiri, any trouble with me stealing your patch (code) and adding the documentation update part? Or do 
you prefer to do it yourself?

[snip]

>> That being said, we still miss a way to achieve a simple configuration
>> with several links doing load balancing to a switch and one or several
>> links doing fail over to another switch, both switches *not* being 802.3ad
>> capable.
>
> 	This is a harder problem, but it's something that doesn't work
> today (and I suspect hasn't for a long time, so if somebody was using
> this, I think there would have been some discussion).

In the mean time, I will state in the documentation that:

- nesting is not allowed.
- only the above particular setup would possibly require nesting.
- this can be achieve using 802.3ad mode, connected to 802.3ad capable switches.

>> Should we arrange for bonding to be allowed to nest, for this purpose, or
>> should we find a way to setup this configuration with a single level of
>> bonding ? I would prefer the second, but...
>
> 	I'm not sure that either is necessary; 802.3ad will do this
> today, and few current production switches lack 802.3ad support.
>
> 	Adding support for etherchannel (i.e., not 802.3ad) gang
> failover is nontrivial, because the multiple etherchannel port groups
> will have to be managed separately, and most likely assigned manually.
> Sure, it'd be nice to have, but I'm not sure if it's a benefit worth the
> effort.

I'm far from a 802.3ad (802.1AX) specialist, but... wouldn't it be possible to force the aggregator 
by hand, for every slaves, to achieve the same effect as receiving LACPDU, when connected to non 
802.3ad capable switches?

echo 802.3ad > /sys/class/net/bond0/bonding/mode
echo +eth0 > /sys/class/net/bond0/bonding/slaves
echo +eth1 > /sys/class/net/bond0/bonding/slaves
echo +eth2 > /sys/class/net/bond0/bonding/slaves
echo 1 > /sys/class/net/bond0/bonding/ad_aggregator_eth0 # those sysfs entries to be created...
echo 1 > /sys/class/net/bond0/bonding/ad_aggregator_eth1
echo 2 > /sys/class/net/bond0/bonding/ad_aggregator_eth2

> 	Either way, for now, since I recall you mentioned in another
> email that you'd crashed the system from nesting bonds, I don't see a
> problem with disallowing nesting and updating the documentation with a
> bit of this discussion (e.g., "nesting doesn't work, you're probably
> trying to do gang failover, which 802.3ad already does for you").

Thanks.

	Nicolas.

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Vlad Dogaru @ 2011-02-02  9:56 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, Stephen Hemminger
In-Reply-To: <4D492289.8090708@trash.net>

On Wed, Feb 02, 2011 at 10:23:21AM +0100, Patrick McHardy wrote:
> On 02.02.2011 10:13, Vlad Dogaru wrote:
> > On Wed, Feb 02, 2011 at 09:56:28AM +0100, Patrick McHardy wrote:
> >> On 26.01.2011 17:41, Vlad Dogaru wrote:
> >>> Use the group keyword to specify what group the device should belong to.
> >>> Since the kernel uses numbers internally, mapping of group names to
> >>> numbers is defined in /etc/iproute2/group_map. Example usage:
> >>>
> >>>   ip link set dev eth0 group default
> >>>
> >>> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
> >>>  			if (get_integer(&mtu, *argv, 0))
> >>>  				invarg("Invalid \"mtu\" value\n", *argv);
> >>>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
> >>> +		} else if (strcmp(*argv, "group") == 0) {
> >>> +			NEXT_ARG();
> >>> +			if (group != -1)
> >>> +				duparg("group", *argv);
> >>> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
> >>> +				invarg("Invalid \"group\" value\n", *argv);
> >>> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);
> >>
> >> I think it would be preferrable to use a function similar to
> >> rt_realm_n2a() that can also handle plain numerical values.
> > 
> > The a2n() functions are rather complex for this case: they employ
> > caching and store a table. I suppose this is because multiple calls to
> > them are possible in a single run and the correspondence has to be made
> > in both ways (a2n and n2a).
> > 
> > A network group is only converted to a number at most once for each ip
> > process spawned, so storing a table is not really helpful. What could,
> > however, help is using get_integer before lookup_map_id. Only if
> > get_integer fails would we lookup the symbolic group name.
> 
> Actually that's not entirely correct, the caches are (also) maintained
> to speed up batch mode, in which case there could also be multiple name
> to group mappings.

Both comments noted. I will respin the patches dropping the devgroup
keyword and implementing caching for groups.

Thanks for the feedback.

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Nicolas de Pesloüan @ 2011-02-02  9:54 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Oleg V. Ukhno, John Fastabend, netdev@vger.kernel.org
In-Reply-To: <19551.1296268113@death>

Le 29/01/2011 03:28, Jay Vosburgh a écrit :
> 	I've thought about this whole thing, and here's what I view as
> the proper way to do this.
>
> 	In my mind, this proposal is two separate pieces:
>
> 	First, a piece to make round-robin a selectable hash for
> xmit_hash_policy.  The documentation for this should follow the pattern
> of the "layer3+4" hash policy, in particular noting that the new
> algorithm violates the 802.3ad standard in exciting ways, will result in
> out of order delivery, and that other 802.3ad implementations may or may
> not tolerate this.
>
> 	Second, a piece to make certain transmitted packets use the
> source MAC of the sending slave instead of the bond's MAC.  This should
> be a separate option from the round-robin hash policy.  I'd call it
> something like "mac_select" with two values: "default" (what we do now)
> and "slave_src_mac" to use the slave's real MAC for certain types of
> traffic (I'm open to better names; that's just what I came up with while
> writing this).  I believe that "certain types" means "everything but
> ARP," but might be "only IP and IPv6."  Structuring the option in this
> manner leaves the option open for additional selections in the future,
> which a simple "on/off" option wouldn't.  This option should probably
> only affect a subset of modes; I'm thinking anything except balance-tlb
> or -alb (because they do funky MAC things already) and active-backup (it
> doesn't balance traffic, and already uses fail_over_mac to control
> this).  I think this option also needs a whole new section down in the
> bottom explaining how to exploit it (the "pick special MACs on slaves to
> trick switch hash" business).
>
> 	Comments?

Looks really sensible to me.

I just propose the following option and option values : "src_mac_select" (instead of mac_select), 
with "default" and "slave_mac" (instead of slave_src_mac) as possible values. In the future, we 
might need a "dst_mac_select" option... :-)

Also, are there any risks that this kind of session load-balancing won't properly cooperate with 
multiqueue (as explained in "Overriding Configuration for Special Cases" in 
Documentation/networking/bonding.txt)? I think it is important to ensure we keep the ability to fine 
tune the egress path selection

	Nicolas.

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Patrick McHardy @ 2011-02-02  9:23 UTC (permalink / raw)
  To: Vlad Dogaru; +Cc: netdev, Stephen Hemminger
In-Reply-To: <20110202091315.GJ2494@cormyr>

On 02.02.2011 10:13, Vlad Dogaru wrote:
> On Wed, Feb 02, 2011 at 09:56:28AM +0100, Patrick McHardy wrote:
>> On 26.01.2011 17:41, Vlad Dogaru wrote:
>>> Use the group keyword to specify what group the device should belong to.
>>> Since the kernel uses numbers internally, mapping of group names to
>>> numbers is defined in /etc/iproute2/group_map. Example usage:
>>>
>>>   ip link set dev eth0 group default
>>>
>>> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
>>>  			if (get_integer(&mtu, *argv, 0))
>>>  				invarg("Invalid \"mtu\" value\n", *argv);
>>>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
>>> +		} else if (strcmp(*argv, "group") == 0) {
>>> +			NEXT_ARG();
>>> +			if (group != -1)
>>> +				duparg("group", *argv);
>>> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
>>> +				invarg("Invalid \"group\" value\n", *argv);
>>> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);
>>
>> I think it would be preferrable to use a function similar to
>> rt_realm_n2a() that can also handle plain numerical values.
> 
> The a2n() functions are rather complex for this case: they employ
> caching and store a table. I suppose this is because multiple calls to
> them are possible in a single run and the correspondence has to be made
> in both ways (a2n and n2a).
> 
> A network group is only converted to a number at most once for each ip
> process spawned, so storing a table is not really helpful. What could,
> however, help is using get_integer before lookup_map_id. Only if
> get_integer fails would we lookup the symbolic group name.

Actually that's not entirely correct, the caches are (also) maintained
to speed up batch mode, in which case there could also be multiple name
to group mappings.

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Patrick McHardy @ 2011-02-02  9:21 UTC (permalink / raw)
  To: Vlad Dogaru; +Cc: netdev, Stephen Hemminger
In-Reply-To: <20110202091315.GJ2494@cormyr>

On 02.02.2011 10:13, Vlad Dogaru wrote:
> On Wed, Feb 02, 2011 at 09:56:28AM +0100, Patrick McHardy wrote:
>> On 26.01.2011 17:41, Vlad Dogaru wrote:
>>> Use the group keyword to specify what group the device should belong to.
>>> Since the kernel uses numbers internally, mapping of group names to
>>> numbers is defined in /etc/iproute2/group_map. Example usage:
>>>
>>>   ip link set dev eth0 group default
>>>
>>> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
>>>  			if (get_integer(&mtu, *argv, 0))
>>>  				invarg("Invalid \"mtu\" value\n", *argv);
>>>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
>>> +		} else if (strcmp(*argv, "group") == 0) {
>>> +			NEXT_ARG();
>>> +			if (group != -1)
>>> +				duparg("group", *argv);
>>> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
>>> +				invarg("Invalid \"group\" value\n", *argv);
>>> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);
>>
>> I think it would be preferrable to use a function similar to
>> rt_realm_n2a() that can also handle plain numerical values.
> 
> The a2n() functions are rather complex for this case: they employ
> caching and store a table. I suppose this is because multiple calls to
> them are possible in a single run and the correspondence has to be made
> in both ways (a2n and n2a).
> 
> A network group is only converted to a number at most once for each ip
> process spawned, so storing a table is not really helpful. What could,
> however, help is using get_integer before lookup_map_id. Only if
> get_integer fails would we lookup the symbolic group name.
> 
> Does that make sense?

Sure, that would be fine as well.

One more thing I find confusing is that for assigning a group
to a device the parameter is called "group", for performing
actions on a group its called "devgroup". Why not simply use
"group" for both cases? The case "ip link set devgroup X group Y"
doesn't work anyways since the IFLA_GROUP attribute is used
for both.

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Vlad Dogaru @ 2011-02-02  9:13 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, Stephen Hemminger
In-Reply-To: <4D491C3C.2010805@trash.net>

On Wed, Feb 02, 2011 at 09:56:28AM +0100, Patrick McHardy wrote:
> On 26.01.2011 17:41, Vlad Dogaru wrote:
> > Use the group keyword to specify what group the device should belong to.
> > Since the kernel uses numbers internally, mapping of group names to
> > numbers is defined in /etc/iproute2/group_map. Example usage:
> > 
> >   ip link set dev eth0 group default
> > 
> > @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
> >  			if (get_integer(&mtu, *argv, 0))
> >  				invarg("Invalid \"mtu\" value\n", *argv);
> >  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
> > +		} else if (strcmp(*argv, "group") == 0) {
> > +			NEXT_ARG();
> > +			if (group != -1)
> > +				duparg("group", *argv);
> > +			if (lookup_map_id(*argv, &group, GROUP_MAP))
> > +				invarg("Invalid \"group\" value\n", *argv);
> > +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);
> 
> I think it would be preferrable to use a function similar to
> rt_realm_n2a() that can also handle plain numerical values.

The a2n() functions are rather complex for this case: they employ
caching and store a table. I suppose this is because multiple calls to
them are possible in a single run and the correspondence has to be made
in both ways (a2n and n2a).

A network group is only converted to a number at most once for each ip
process spawned, so storing a table is not really helpful. What could,
however, help is using get_integer before lookup_map_id. Only if
get_integer fails would we lookup the symbolic group name.

Does that make sense?

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Patrick McHardy @ 2011-02-02  8:56 UTC (permalink / raw)
  To: Vlad Dogaru; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1296060086-18777-2-git-send-email-ddvlad@rosedu.org>

On 26.01.2011 17:41, Vlad Dogaru wrote:
> Use the group keyword to specify what group the device should belong to.
> Since the kernel uses numbers internally, mapping of group names to
> numbers is defined in /etc/iproute2/group_map. Example usage:
> 
>   ip link set dev eth0 group default
> 
> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
>  			if (get_integer(&mtu, *argv, 0))
>  				invarg("Invalid \"mtu\" value\n", *argv);
>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
> +		} else if (strcmp(*argv, "group") == 0) {
> +			NEXT_ARG();
> +			if (group != -1)
> +				duparg("group", *argv);
> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
> +				invarg("Invalid \"group\" value\n", *argv);
> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);

I think it would be preferrable to use a function similar to
rt_realm_n2a() that can also handle plain numerical values.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox