Netdev List
 help / color / mirror / Atom feed
* Re: kernel 2.6.37 : oops in cleanup_once
From: Yann Dupont @ 2011-02-02 13:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev
In-Reply-To: <1296645887.20445.11.camel@edumazet-laptop>

Le 02/02/2011 12:24, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
>> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
>>> Hello.
>>> We recently upgraded one machine with vanilla 2.6.37, and experienced 2
>>> kernel oops since. Each oops is after ~1 week of uptime.
>>> The last oops was last night but we didn't had any trace.
> oops, 2.6.37 "only"
>
>> Yes this is a known problem.
>>
>> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
>>
>> I believe David will send it to stable team shortly, if not already
>> done :)
> Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
> affected by the problem.
>
> So its another problem... Is there anything particular you do on this
> machine ?
>
>
>
>
Nothing really special there, we run a lot (20) of KVM guest (mainly 
linux firewalls for lots of differents vlan), so we have a lot of 
bridges vlan & tun/tap.
Oh, and CONFIG_BRIDGE_IGMP_SNOOPING is set to n (because of  the other 
bug already sent to netdev - more to come on next mail)

Hard to say if this BUG is new in 2.6.37. This host was running fine 
with 2.6.34.2 since August 2010.
Bisecting will be hard due to the time to trigger the bug (and the fact 
that this machine is a production machine)

Anyway, I can test with a specific kernel version if you suspect something.

Regards,


-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Eric Dumazet @ 2011-02-02 11:24 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev
In-Reply-To: <1296643972.20445.9.camel@edumazet-laptop>

Le mercredi 02 février 2011 à 11:52 +0100, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> > Hello.
> > We recently upgraded one machine with vanilla 2.6.37, and experienced 2 
> > kernel oops since. Each oops is after ~1 week of uptime.
> > The last oops was last night but we didn't had any trace.

oops, 2.6.37 "only"

> Yes this is a known problem.
> 
> Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> (inetpeer: Use correct AVL tree base pointer in inet_getpeer())
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492
> 
> I believe David will send it to stable team shortly, if not already
> done :)

Please ignore, this patch was for linux-2.6 tree, 2.6.37 was not
affected by the problem.

So its another problem... Is there anything particular you do on this
machine ?

^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Eric Dumazet @ 2011-02-02 10:52 UTC (permalink / raw)
  To: Yann Dupont; +Cc: linux-kernel, netdev
In-Reply-To: <4D491B8D.1000107@univ-nantes.fr>

Le mercredi 02 février 2011 à 09:53 +0100, Yann Dupont a écrit :
> Hello.
> We recently upgraded one machine with vanilla 2.6.37, and experienced 2 
> kernel oops since. Each oops is after ~1 week of uptime.
> The last oops was last night but we didn't had any trace.
> 
> Here is the previous oops :
> 
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316042] 
> BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316096] 
> IP: [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316135] PGD 0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316157] 
> Oops: 0002 [#1] SMP
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316188] 
> last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316234] CPU 1
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316240] 
> Modules linked in: xt_physdev ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6 
> ipt_LOG xt_multiport xt_limit nf_conntrack_tftp nf_conntrack_ftp tun 
> ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT 
> xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm ipv6 8021q 
> bridge stp ext2 mbcache fuse snd_pcm snd_timer snd soundcore 
> snd_page_alloc i5000_edac edac_core psmouse evdev i5k_amb tpm_tis tpm 
> joydev dcdbas tpm_bios pcspkr rng_core ghes shpchp serio_raw pci_hotplug 
> processor hed button thermal_sys xfs exportfs dm_mod sg sr_mod sd_mod 
> cdrom usbhid hid usb_storage qla2xxx scsi_transport_fc scsi_tgt uhci_hcd 
> mptsas mptscsih ehci_hcd mptbase bnx2 scsi_transport_sas scsi_mod [last 
> unloaded: scsi_wait_scan]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316694]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316715] 
> Pid: 0, comm: kworker/0:0 Not tainted 2.6.37-dsiun-110105 #17 
> 0MY736/PowerEdge M600
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316761] 
> RIP: 0010:[<ffffffff8130e6bf>]  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316808] 
> RSP: 0018:ffff8800cfc43e20  EFLAGS: 00010202
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316834] 
> RAX: ffff8803d3158018 RBX: ffff8803d3158000 RCX: 0000000000000005
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.316878] 
> RDX: 0b000209f1beadde RSI: 00000000000000ac RDI: ffffffff8152a970
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318512] 
> RBP: 00000000000248f6 R08: 00000000003d0900 R09: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318560] 
> R10: dead000000200200 R11: 0000000000000000 R12: ffff8800cfc43ea0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318604] 
> R13: 0000000000000100 R14: ffff88040fc99fd8 R15: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318652] 
> FS:  0000000000000000(0000) GS:ffff8800cfc40000(0000) knlGS:0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318698] 
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318725] 
> CR2: 000000000000000d CR3: 00000000014f1000 CR4: 00000000000026e0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318768] 
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318812] 
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318855] 
> Process kworker/0:0 (pid: 0, threadinfo ffff88040fc98000, task 
> ffff88040fc6c2e0)
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318901] 
> Stack:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318921]  
> 0000000000000082 00000001029221c1 00000000000248f6 ffffffff8130e988
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.318971]  
> ffff88040fc90000 ffff88040fc90000 ffffffff8152a9a0 ffffffff8105e95f
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319021]  
> ffff8800cfc43e58 ffff88040fc91020 ffffffff8130e950 ffff88040fc99fd8
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319072] 
> Call Trace:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319093] <IRQ>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319116]  
> [<ffffffff8130e988>] ? peer_check_expire+0x38/0x110
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319146]  
> [<ffffffff8105e95f>] ? run_timer_softirq+0x16f/0x350
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319175]  
> [<ffffffff8130e950>] ? peer_check_expire+0x0/0x110
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319204]  
> [<ffffffff81079c6b>] ? ktime_get+0x5b/0xe0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319232]  
> [<ffffffff8105685a>] ? __do_softirq+0xaa/0x1e0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319260]  
> [<ffffffff81003ddc>] ? call_softirq+0x1c/0x30
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319288]  
> [<ffffffff81005f75>] ? do_softirq+0x65/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319315]  
> [<ffffffff81056745>] ? irq_exit+0x85/0x90
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319343]  
> [<ffffffff8102137a>] ? smp_apic_timer_interrupt+0x6a/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319373]  
> [<ffffffff81003893>] ? apic_timer_interrupt+0x13/0x20
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319401] <EOI>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319427]  
> [<ffffffffa032218c>] ? acpi_idle_enter_bm+0x243/0x27b [processor]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319473]  
> [<ffffffffa0322185>] ? acpi_idle_enter_bm+0x23c/0x27b [processor]
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319519]  
> [<ffffffff812c0deb>] ? cpuidle_idle_call+0x8b/0x140
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319547]  
> [<ffffffff8100208a>] ? cpu_idle+0x6a/0xf0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319573] 
> Code: 00 48 8b 05 c4 c2 21 00 48 3d 60 a9 52 81 74 5c 48 8d 58 e8 48 8b 
> 15 11 02 24 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89 
> 51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319768] 
> RIP  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319797]  
> RSP <ffff8800cfc43e20>
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.319820] 
> CR2: 000000000000000d
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320187] 
> ---[ end trace eaf3ed2d46c78768 ]---
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320257] 
> Kernel panic - not syncing: Fatal exception in interrupt
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320329] 
> Pid: 0, comm: kworker/0:0 Tainted: G      D     2.6.37-dsiun-110105 #17
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320418] 
> Call Trace:
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320481] 
> <IRQ>  [<ffffffff8137c75e>] ? panic+0x92/0x1a2
> Jan 21 13:15:41 linkwood.u11.univ-nantes.prive kernel: [172825.320601]  
> [<ffffffff81007357>] ? oops_end+0xe7/0xf0
> 
> 
> Any ideas ??


Hi Yann

Yes this is a known problem.

Please try commit 3408404a4c2a4eead9d73b0bbbfe3f225b65f492
(inetpeer: Use correct AVL tree base pointer in inet_getpeer())

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3408404a4c2a4eead9d73b0bbbfe3f225b65f492

I believe David will send it to stable team shortly, if not already
done :)

Thanks



^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 10:49 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296632029.26937.871.camel@localhost.localdomain>

On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > w/i guest change, I played around the parameters,for example: I could
> > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > size,
> > w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> > usage. 
> 
> I meant w/o guest change, only vhost changes. Sorry about that.
> 
> Shirley

Ah, excellent. What were the parameters?

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 10:48 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296630891.26937.870.camel@localhost.localdomain>

On Tue, Feb 01, 2011 at 11:14:51PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > > > 
> > > > The way I am changing is only when netif queue has stopped, then
> > we
> > > > start to count num_free descriptors to send the signal to wake
> > netif
> > > > queue. 
> > > 
> > > I forgot to mention, the code change I am making is in guest kernel,
> > in
> > > xmit call back only wake up the queue when it's stopped && num_free
> > >=
> > > 1/2 *vq->num, I add a new API in virtio_ring.
> > 
> > Interesting. Yes, I agree an API extension would be helpful. However,
> > wouldn't just the signaling reduction be enough, without guest
> > changes?
> 
> w/i guest change, I played around the parameters,for example: I could
> get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
> w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> usage.

We need to consider them separately IMO.  What's the best we can get
without guest change?  And which parameters give it?
There will always be old guests, and as far as I can tell
it should work better from host.

> > > However vhost signaling reduction is needed as well. The patch I
> > > submitted a while ago showed both CPUs and BW improvement.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Which patch was that? 
> 
> The patch was called "vhost: TX used buffer guest signal accumulation".
Yes, a somewhat similar idea.

> You suggested to split add_used_bufs and signal.
Exactly. And this is basically what this patch does.

> I am still thinking
> what's the best approach to cooperate guest (virtio_kick) and
> vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
> the overheads, so I haven't submit the new patch yet.
> 
> Thanks
> Shirley


-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 10:48 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Shirley Ma, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <OFF5778D3C.84F46700-ON6525782B.00230A94-6525782B.00240890@in.ibm.com>

On Wed, Feb 02, 2011 at 12:04:37PM +0530, Krishna Kumar2 wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > >
> > > The way I am changing is only when netif queue has stopped, then we
> > > start to count num_free descriptors to send the signal to wake netif
> > > queue.
> >
> > I forgot to mention, the code change I am making is in guest kernel, in
> > xmit call back only wake up the queue when it's stopped && num_free >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> FYI :)
> 
> I have tried this before. There are a couple of issues:
> 
> 1. the free count will not reduce until you run free_old_xmit_skbs,
>    which will not run anymore since the tx queue is stopped.
> 2. You cannot call free_old_xmit_skbs directly as it races with a
>    queue that was just awakened (current cb was due to the delay
>    in disabling cb's).
> 
> You have to call free_old_xmit_skbs() under netif_queue_stopped()
> check to avoid the race.
> 
> I got a small improvement in my testing upto some number of threads
> (32 or 48?), but beyond that I was getting a regression.
> 
> Thanks,
> 
> - KK
> 
> > However vhost signaling reduction is needed as well. The patch I
> > submitted a while ago showed both CPUs and BW improvement.

Yes, I think doing this in the host is much simpler,
just send an interrupt after there's a decent amount
of space in the queue.

Having said that the simple heuristic that I coded
might be a bit too simple.

-- 
MST

^ permalink raw reply

* Re: Bonding on bond
From: Nicolas de Pesloüan @ 2011-02-02 10:19 UTC (permalink / raw)
  To: Jay Vosburgh, Jiri Bohac
  Cc: bonding-devel@lists.sourceforge.net, netdev@vger.kernel.org
In-Reply-To: <15526.1296261528@death>

Le 29/01/2011 01:38, Jay Vosburgh a écrit :
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:

[snip]

>> However, the ingress path doesn't work at all. bond0 is unable to receive any packets (ARP or IP).
>
> 	In light of this, I don't see a problem with disallowing nesting
> of bonds.  It should be documented in bonding.txt.

Ok, I will do that.

Jiri, any trouble with me stealing your patch (code) and adding the documentation update part? Or do 
you prefer to do it yourself?

[snip]

>> That being said, we still miss a way to achieve a simple configuration
>> with several links doing load balancing to a switch and one or several
>> links doing fail over to another switch, both switches *not* being 802.3ad
>> capable.
>
> 	This is a harder problem, but it's something that doesn't work
> today (and I suspect hasn't for a long time, so if somebody was using
> this, I think there would have been some discussion).

In the mean time, I will state in the documentation that:

- nesting is not allowed.
- only the above particular setup would possibly require nesting.
- this can be achieve using 802.3ad mode, connected to 802.3ad capable switches.

>> Should we arrange for bonding to be allowed to nest, for this purpose, or
>> should we find a way to setup this configuration with a single level of
>> bonding ? I would prefer the second, but...
>
> 	I'm not sure that either is necessary; 802.3ad will do this
> today, and few current production switches lack 802.3ad support.
>
> 	Adding support for etherchannel (i.e., not 802.3ad) gang
> failover is nontrivial, because the multiple etherchannel port groups
> will have to be managed separately, and most likely assigned manually.
> Sure, it'd be nice to have, but I'm not sure if it's a benefit worth the
> effort.

I'm far from a 802.3ad (802.1AX) specialist, but... wouldn't it be possible to force the aggregator 
by hand, for every slaves, to achieve the same effect as receiving LACPDU, when connected to non 
802.3ad capable switches?

echo 802.3ad > /sys/class/net/bond0/bonding/mode
echo +eth0 > /sys/class/net/bond0/bonding/slaves
echo +eth1 > /sys/class/net/bond0/bonding/slaves
echo +eth2 > /sys/class/net/bond0/bonding/slaves
echo 1 > /sys/class/net/bond0/bonding/ad_aggregator_eth0 # those sysfs entries to be created...
echo 1 > /sys/class/net/bond0/bonding/ad_aggregator_eth1
echo 2 > /sys/class/net/bond0/bonding/ad_aggregator_eth2

> 	Either way, for now, since I recall you mentioned in another
> email that you'd crashed the system from nesting bonds, I don't see a
> problem with disallowing nesting and updating the documentation with a
> bit of this discussion (e.g., "nesting doesn't work, you're probably
> trying to do gang failover, which 802.3ad already does for you").

Thanks.

	Nicolas.

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Vlad Dogaru @ 2011-02-02  9:56 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, Stephen Hemminger
In-Reply-To: <4D492289.8090708@trash.net>

On Wed, Feb 02, 2011 at 10:23:21AM +0100, Patrick McHardy wrote:
> On 02.02.2011 10:13, Vlad Dogaru wrote:
> > On Wed, Feb 02, 2011 at 09:56:28AM +0100, Patrick McHardy wrote:
> >> On 26.01.2011 17:41, Vlad Dogaru wrote:
> >>> Use the group keyword to specify what group the device should belong to.
> >>> Since the kernel uses numbers internally, mapping of group names to
> >>> numbers is defined in /etc/iproute2/group_map. Example usage:
> >>>
> >>>   ip link set dev eth0 group default
> >>>
> >>> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
> >>>  			if (get_integer(&mtu, *argv, 0))
> >>>  				invarg("Invalid \"mtu\" value\n", *argv);
> >>>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
> >>> +		} else if (strcmp(*argv, "group") == 0) {
> >>> +			NEXT_ARG();
> >>> +			if (group != -1)
> >>> +				duparg("group", *argv);
> >>> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
> >>> +				invarg("Invalid \"group\" value\n", *argv);
> >>> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);
> >>
> >> I think it would be preferrable to use a function similar to
> >> rt_realm_n2a() that can also handle plain numerical values.
> > 
> > The a2n() functions are rather complex for this case: they employ
> > caching and store a table. I suppose this is because multiple calls to
> > them are possible in a single run and the correspondence has to be made
> > in both ways (a2n and n2a).
> > 
> > A network group is only converted to a number at most once for each ip
> > process spawned, so storing a table is not really helpful. What could,
> > however, help is using get_integer before lookup_map_id. Only if
> > get_integer fails would we lookup the symbolic group name.
> 
> Actually that's not entirely correct, the caches are (also) maintained
> to speed up batch mode, in which case there could also be multiple name
> to group mappings.

Both comments noted. I will respin the patches dropping the devgroup
keyword and implementing caching for groups.

Thanks for the feedback.

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Nicolas de Pesloüan @ 2011-02-02  9:54 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Oleg V. Ukhno, John Fastabend, netdev@vger.kernel.org
In-Reply-To: <19551.1296268113@death>

Le 29/01/2011 03:28, Jay Vosburgh a écrit :
> 	I've thought about this whole thing, and here's what I view as
> the proper way to do this.
>
> 	In my mind, this proposal is two separate pieces:
>
> 	First, a piece to make round-robin a selectable hash for
> xmit_hash_policy.  The documentation for this should follow the pattern
> of the "layer3+4" hash policy, in particular noting that the new
> algorithm violates the 802.3ad standard in exciting ways, will result in
> out of order delivery, and that other 802.3ad implementations may or may
> not tolerate this.
>
> 	Second, a piece to make certain transmitted packets use the
> source MAC of the sending slave instead of the bond's MAC.  This should
> be a separate option from the round-robin hash policy.  I'd call it
> something like "mac_select" with two values: "default" (what we do now)
> and "slave_src_mac" to use the slave's real MAC for certain types of
> traffic (I'm open to better names; that's just what I came up with while
> writing this).  I believe that "certain types" means "everything but
> ARP," but might be "only IP and IPv6."  Structuring the option in this
> manner leaves the option open for additional selections in the future,
> which a simple "on/off" option wouldn't.  This option should probably
> only affect a subset of modes; I'm thinking anything except balance-tlb
> or -alb (because they do funky MAC things already) and active-backup (it
> doesn't balance traffic, and already uses fail_over_mac to control
> this).  I think this option also needs a whole new section down in the
> bottom explaining how to exploit it (the "pick special MACs on slaves to
> trick switch hash" business).
>
> 	Comments?

Looks really sensible to me.

I just propose the following option and option values : "src_mac_select" (instead of mac_select), 
with "default" and "slave_mac" (instead of slave_src_mac) as possible values. In the future, we 
might need a "dst_mac_select" option... :-)

Also, are there any risks that this kind of session load-balancing won't properly cooperate with 
multiqueue (as explained in "Overriding Configuration for Special Cases" in 
Documentation/networking/bonding.txt)? I think it is important to ensure we keep the ability to fine 
tune the egress path selection

	Nicolas.

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Patrick McHardy @ 2011-02-02  9:23 UTC (permalink / raw)
  To: Vlad Dogaru; +Cc: netdev, Stephen Hemminger
In-Reply-To: <20110202091315.GJ2494@cormyr>

On 02.02.2011 10:13, Vlad Dogaru wrote:
> On Wed, Feb 02, 2011 at 09:56:28AM +0100, Patrick McHardy wrote:
>> On 26.01.2011 17:41, Vlad Dogaru wrote:
>>> Use the group keyword to specify what group the device should belong to.
>>> Since the kernel uses numbers internally, mapping of group names to
>>> numbers is defined in /etc/iproute2/group_map. Example usage:
>>>
>>>   ip link set dev eth0 group default
>>>
>>> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
>>>  			if (get_integer(&mtu, *argv, 0))
>>>  				invarg("Invalid \"mtu\" value\n", *argv);
>>>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
>>> +		} else if (strcmp(*argv, "group") == 0) {
>>> +			NEXT_ARG();
>>> +			if (group != -1)
>>> +				duparg("group", *argv);
>>> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
>>> +				invarg("Invalid \"group\" value\n", *argv);
>>> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);
>>
>> I think it would be preferrable to use a function similar to
>> rt_realm_n2a() that can also handle plain numerical values.
> 
> The a2n() functions are rather complex for this case: they employ
> caching and store a table. I suppose this is because multiple calls to
> them are possible in a single run and the correspondence has to be made
> in both ways (a2n and n2a).
> 
> A network group is only converted to a number at most once for each ip
> process spawned, so storing a table is not really helpful. What could,
> however, help is using get_integer before lookup_map_id. Only if
> get_integer fails would we lookup the symbolic group name.

Actually that's not entirely correct, the caches are (also) maintained
to speed up batch mode, in which case there could also be multiple name
to group mappings.

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Patrick McHardy @ 2011-02-02  9:21 UTC (permalink / raw)
  To: Vlad Dogaru; +Cc: netdev, Stephen Hemminger
In-Reply-To: <20110202091315.GJ2494@cormyr>

On 02.02.2011 10:13, Vlad Dogaru wrote:
> On Wed, Feb 02, 2011 at 09:56:28AM +0100, Patrick McHardy wrote:
>> On 26.01.2011 17:41, Vlad Dogaru wrote:
>>> Use the group keyword to specify what group the device should belong to.
>>> Since the kernel uses numbers internally, mapping of group names to
>>> numbers is defined in /etc/iproute2/group_map. Example usage:
>>>
>>>   ip link set dev eth0 group default
>>>
>>> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
>>>  			if (get_integer(&mtu, *argv, 0))
>>>  				invarg("Invalid \"mtu\" value\n", *argv);
>>>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
>>> +		} else if (strcmp(*argv, "group") == 0) {
>>> +			NEXT_ARG();
>>> +			if (group != -1)
>>> +				duparg("group", *argv);
>>> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
>>> +				invarg("Invalid \"group\" value\n", *argv);
>>> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);
>>
>> I think it would be preferrable to use a function similar to
>> rt_realm_n2a() that can also handle plain numerical values.
> 
> The a2n() functions are rather complex for this case: they employ
> caching and store a table. I suppose this is because multiple calls to
> them are possible in a single run and the correspondence has to be made
> in both ways (a2n and n2a).
> 
> A network group is only converted to a number at most once for each ip
> process spawned, so storing a table is not really helpful. What could,
> however, help is using get_integer before lookup_map_id. Only if
> get_integer fails would we lookup the symbolic group name.
> 
> Does that make sense?

Sure, that would be fine as well.

One more thing I find confusing is that for assigning a group
to a device the parameter is called "group", for performing
actions on a group its called "devgroup". Why not simply use
"group" for both cases? The case "ip link set devgroup X group Y"
doesn't work anyways since the IFLA_GROUP attribute is used
for both.

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Vlad Dogaru @ 2011-02-02  9:13 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, Stephen Hemminger
In-Reply-To: <4D491C3C.2010805@trash.net>

On Wed, Feb 02, 2011 at 09:56:28AM +0100, Patrick McHardy wrote:
> On 26.01.2011 17:41, Vlad Dogaru wrote:
> > Use the group keyword to specify what group the device should belong to.
> > Since the kernel uses numbers internally, mapping of group names to
> > numbers is defined in /etc/iproute2/group_map. Example usage:
> > 
> >   ip link set dev eth0 group default
> > 
> > @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
> >  			if (get_integer(&mtu, *argv, 0))
> >  				invarg("Invalid \"mtu\" value\n", *argv);
> >  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
> > +		} else if (strcmp(*argv, "group") == 0) {
> > +			NEXT_ARG();
> > +			if (group != -1)
> > +				duparg("group", *argv);
> > +			if (lookup_map_id(*argv, &group, GROUP_MAP))
> > +				invarg("Invalid \"group\" value\n", *argv);
> > +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);
> 
> I think it would be preferrable to use a function similar to
> rt_realm_n2a() that can also handle plain numerical values.

The a2n() functions are rather complex for this case: they employ
caching and store a table. I suppose this is because multiple calls to
them are possible in a single run and the correspondence has to be made
in both ways (a2n and n2a).

A network group is only converted to a number at most once for each ip
process spawned, so storing a table is not really helpful. What could,
however, help is using get_integer before lookup_map_id. Only if
get_integer fails would we lookup the symbolic group name.

Does that make sense?

^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Patrick McHardy @ 2011-02-02  8:56 UTC (permalink / raw)
  To: Vlad Dogaru; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1296060086-18777-2-git-send-email-ddvlad@rosedu.org>

On 26.01.2011 17:41, Vlad Dogaru wrote:
> Use the group keyword to specify what group the device should belong to.
> Since the kernel uses numbers internally, mapping of group names to
> numbers is defined in /etc/iproute2/group_map. Example usage:
> 
>   ip link set dev eth0 group default
> 
> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
>  			if (get_integer(&mtu, *argv, 0))
>  				invarg("Invalid \"mtu\" value\n", *argv);
>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
> +		} else if (strcmp(*argv, "group") == 0) {
> +			NEXT_ARG();
> +			if (group != -1)
> +				duparg("group", *argv);
> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
> +				invarg("Invalid \"group\" value\n", *argv);
> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);

I think it would be preferrable to use a function similar to
rt_realm_n2a() that can also handle plain numerical values.

^ permalink raw reply

* RE: Using ethernet device as efficient small packet generator
From: juice @ 2011-02-02  8:13 UTC (permalink / raw)
  To: Brandeburg, Jesse, Loke, Chetan, Jon Zhou, Eric Dumazet,
	"Stephen Hemming
In-Reply-To: <8ad1defdf427ceb7af94fad4d216b006.squirrel@www.liukuma.net>

>
>> your computation of Bandwidth (as Ben Greear said) is not accounting for
>> the interframe gaps.  Maybe more useful is to note that wire speed 64
>> byte packets is 1.44 Million packets per second.
>
> I am aware of the fact that interframe gap eats away some of the bandwidth
> from actual data bytes, and I am taking that into consideration.
> My benchmark here is the Spirent AX4000 network analyzer, which can send
> and receive full utilization of GE line.
>
> The measurement when sending full line rate from AX4000 are:
>   Total bitrate:             761.903 MBits/s
>   Packet rate:               1488090 packets/s
>   Bandwidth:                 76.19% GE
>   Average packet intereval:  0.67 us
>
>
>> I think you need different hardware (again) as you have saddled yourself
>> with a x1 PCIe connected adapter.  This adapter is not well suited to
>> small packet traffic because the sheer amount of transactions is
>> effected
>> by the added latency due to the x1 connector (vs our dual port 1GbE
>> adapters with a x4 connector)
>>
>> with Core i3/5/7 or newer cpus you should be able to saturate a 1Gb link
>> with a single core/queue.  With Core2 era processors you may have some
>> difficulty, with anything older than that you won't make it. :-)
>>
>> My suggestion is to get one of the igb based adapters, 82576, or 82580
>> based that run the igb driver.
>>
>> If you can't get a hold of those you should be able to easily get 1.1M
>> pps from an 82571 adapter.
>
> I will order the 82576 card and try my tests with that.
>

Okay, now I just installed the new hot 82576 DualGE adapter and compiled
the igb module for 2.6.38-rc2 kernel I am running on.

The results with this adapter look very promising, now I am able to reach
the full GE bandwidth with 64 byte packets with only interrupt cpu affinity
tuning, no other tweaks needed:

root@d8labralinux:/var/home/juice/pkt_test# cat /proc/net/pktgen/eth1
Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 0  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:97:21:76 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 1941436194us  stopped: 1948155853us idle: 179us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 6719658(c6719479+d179) nsec, 10000000 (60byte,0frags)
  1488170pps 714Mb/sec (714321600bps) errors: 0

AX4000 measurements:
   Total bitrate:             761.910 MBits/s
   Packet rate:               1488106 packets/s
   Bandwidth:                 76.19% GE
   Average packet intereval:  0.67 us

Now, I need to check if I can send similar rates from userspace socket
interface. If that is possible then it may be so that I do not even need
to create a kernel driver for my application.

Yours, Jussi Ohenoja



^ permalink raw reply

* Re: Network performance with small packets
From: Krishna Kumar2 @ 2011-02-02  7:37 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296630226.26937.859.camel@localhost.localdomain>

> Shirley Ma <mashirle@us.ibm.com> wrote:
>
> > I have tried this before. There are a couple of issues:
> >
> > 1. the free count will not reduce until you run free_old_xmit_skbs,
> >    which will not run anymore since the tx queue is stopped.
> > 2. You cannot call free_old_xmit_skbs directly as it races with a
> >    queue that was just awakened (current cb was due to the delay
> >    in disabling cb's).
> >
> > You have to call free_old_xmit_skbs() under netif_queue_stopped()
> > check to avoid the race.
>
> Yes, that' what I did, when the netif queue stop, don't enable the
> queue, just free_old_xmit_skbs(), if not enough freed, then enabling
> callback until half of the ring size are freed, then wake the netif
> queue. But somehow I didn't reach the performance compared to drop
> packets, need to think about it more. :)

Did you check if the number of vmexits increased with this
patch? This is possible if the device was keeping up (and
not going into a stop, start, xmit 1 packet, stop, start
loop). Also maybe you should try for 1/4th instead of 1/2?

MST's delayed signalling should avoid this issue, I haven't
tried both together.

Thanks,

- KK


^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  7:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296630891.26937.870.camel@localhost.localdomain>

On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> w/i guest change, I played around the parameters,for example: I could
> get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> size,
> w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> usage. 

I meant w/o guest change, only vhost changes. Sorry about that.

Shirley


^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  7:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <20110202062950.GD3818@redhat.com>

On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > > 
> > > The way I am changing is only when netif queue has stopped, then
> we
> > > start to count num_free descriptors to send the signal to wake
> netif
> > > queue. 
> > 
> > I forgot to mention, the code change I am making is in guest kernel,
> in
> > xmit call back only wake up the queue when it's stopped && num_free
> >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> Interesting. Yes, I agree an API extension would be helpful. However,
> wouldn't just the signaling reduction be enough, without guest
> changes?

w/i guest change, I played around the parameters,for example: I could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
usage.

> > However vhost signaling reduction is needed as well. The patch I
> > submitted a while ago showed both CPUs and BW improvement.
> > 
> > Thanks
> > Shirley
> 
> Which patch was that? 

The patch was called "vhost: TX used buffer guest signal accumulation".
You suggested to split add_used_bufs and signal. I am still thinking
what's the best approach to cooperate guest (virtio_kick) and
vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
the overheads, so I haven't submit the new patch yet.

Thanks
Shirley


^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  7:03 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <OFF5778D3C.84F46700-ON6525782B.00230A94-6525782B.00240890@in.ibm.com>

On Wed, 2011-02-02 at 12:04 +0530, Krishna Kumar2 wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > >
> > > The way I am changing is only when netif queue has stopped, then
> we
> > > start to count num_free descriptors to send the signal to wake
> netif
> > > queue.
> >
> > I forgot to mention, the code change I am making is in guest kernel,
> in
> > xmit call back only wake up the queue when it's stopped && num_free
> >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> FYI :)

> I have tried this before. There are a couple of issues:
> 
> 1. the free count will not reduce until you run free_old_xmit_skbs,
>    which will not run anymore since the tx queue is stopped.
> 2. You cannot call free_old_xmit_skbs directly as it races with a
>    queue that was just awakened (current cb was due to the delay
>    in disabling cb's).
> 
> You have to call free_old_xmit_skbs() under netif_queue_stopped()
> check to avoid the race.

Yes, that' what I did, when the netif queue stop, don't enable the
queue, just free_old_xmit_skbs(), if not enough freed, then enabling
callback until half of the ring size are freed, then wake the netif
queue. But somehow I didn't reach the performance compared to drop
packets, need to think about it more. :)

Thanks
Shirley


^ permalink raw reply

* Re: Network performance with small packets
From: Krishna Kumar2 @ 2011-02-02  6:34 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296627549.26937.856.camel@localhost.localdomain>

> On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> >
> > The way I am changing is only when netif queue has stopped, then we
> > start to count num_free descriptors to send the signal to wake netif
> > queue.
>
> I forgot to mention, the code change I am making is in guest kernel, in
> xmit call back only wake up the queue when it's stopped && num_free >=
> 1/2 *vq->num, I add a new API in virtio_ring.

FYI :)

I have tried this before. There are a couple of issues:

1. the free count will not reduce until you run free_old_xmit_skbs,
   which will not run anymore since the tx queue is stopped.
2. You cannot call free_old_xmit_skbs directly as it races with a
   queue that was just awakened (current cb was due to the delay
   in disabling cb's).

You have to call free_old_xmit_skbs() under netif_queue_stopped()
check to avoid the race.

I got a small improvement in my testing upto some number of threads
(32 or 48?), but beyond that I was getting a regression.

Thanks,

- KK

> However vhost signaling reduction is needed as well. The patch I
> submitted a while ago showed both CPUs and BW improvement.


^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02  6:29 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296627549.26937.856.camel@localhost.localdomain>

On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > 
> > The way I am changing is only when netif queue has stopped, then we
> > start to count num_free descriptors to send the signal to wake netif
> > queue. 
> 
> I forgot to mention, the code change I am making is in guest kernel, in
> xmit call back only wake up the queue when it's stopped && num_free >=
> 1/2 *vq->num, I add a new API in virtio_ring.

Interesting. Yes, I agree an API extension would be helpful. However,
wouldn't just the signaling reduction be enough, without guest changes?

> However vhost signaling reduction is needed as well. The patch I
> submitted a while ago showed both CPUs and BW improvement.
> 
> Thanks
> Shirley

Which patch was that?

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  6:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296626748.26937.852.camel@localhost.localdomain>

On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> 
> The way I am changing is only when netif queue has stopped, then we
> start to count num_free descriptors to send the signal to wake netif
> queue. 

I forgot to mention, the code change I am making is in guest kernel, in
xmit call back only wake up the queue when it's stopped && num_free >=
1/2 *vq->num, I add a new API in virtio_ring.

However vhost signaling reduction is needed as well. The patch I
submitted a while ago showed both CPUs and BW improvement.

Thanks
Shirley


^ permalink raw reply

* Re: [PATCH] include/net/genetlink.h: Allow genlmsg_cancel to accept a NULL argument
From: Julia Lawall @ 2011-02-02  6:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, paul.moore, kernel-janitors
In-Reply-To: <20110201.145410.115936566.davem@davemloft.net>

On Tue, 1 Feb 2011, David Miller wrote:

> From: Julia Lawall <julia@diku.dk>
> Date: Fri, 28 Jan 2011 16:43:40 +0100 (CET)
> 
> > nlmsg_cancel can accept NULL as its second argument, so for similarity,
> > this patch extends genlmsg_cancel to be able to accept a NULL second
> > argument as well.
> > 
> > Signed-off-by: Julia Lawall <julia@diku.dk>
> 
> I did a scan of all of the cases where this interface is used, and
> I cannot find a situation where this capability would even be useful.
> 
> The use pattern is always:
> 
> 	hdr = genlmsg_put(skb, ...);
> 	if (!hdr)
> 		goto out;
> 
> 	NLA_PUT_*();
> 	NLA_PUT_*();
> 	....
> 
> 	return genlmsg_end(skb, hdr);
> 
> nla_put_failure:
> 	genlmsg_cancel(skb, hdr);
> out:
> 	return -EWHATEVER;

This pattern occurred in eg:

net/netlabel/netlabel_unlabeled.c

in the function netlbl_unlabel_staticlist_gen and in other netlabel code, 
as well as in net/wireless/nl80211.c, but with the function nl80211hdr_put 
instead of genlmsg_put.  I submitted patches for all of these cases, so 
that is perhaps why you don't see them.  But someone suggested to change 
genlmsg_cancel as well, to be as permissive as nlmsg_cancel.

For nlmsg_cancel, there are two occurrences in 
net/netfilter/nf_conntrack_netlink.c where nlmsg_cancel is reachable with 
the second argument NULL.

For nlmsg_cancel the ability to accept NULL as a second argument comes 
from the fact that it only calls nlmsg_trim, which does nothing if NULL is 
the second argument.  nlmsg_trim is also called by nla_nest_cancel.  There 
are many calls to nla_nest_cancel with NULL as the second argument in the 
directory net/sched, for example in the function gred_dump in 
net/sched/sch_gred.c.  net/sched also contains a call to nlmsg_trim with 
NULL as the second argument, in the function flow_dump, in 
net/sched/cls_flow.c.

The whole thing seems somewhat sloppy.  I'm sure that all of the 
above-cited occurrences could be rewritten as outlined above to skip over 
the cancel/trim function.

julia

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  6:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <20110202044002.GB3818@redhat.com>

On Wed, 2011-02-02 at 06:40 +0200, Michael S. Tsirkin wrote:
> ust tweak the parameters with sysfs, you do not have to edit the code:
> echo 64 > /sys/module/vhost_net/parameters/tx_bufs_coalesce
> 
> Or in a similar way for tx_packets_coalesce (since we use indirect,
> packets will typically use 1 buffer each).

We should use packets instead of buffers, in indirect case, one packet
has multiple buffers, each packet uses one descriptor from the ring
(default size is 256).

echo 128 > /sys/module/vhost_net/parameters/tx_packets_coalesce

The way I am changing is only when netif queue has stopped, then we
start to count num_free descriptors to send the signal to wake netif
queue.

Shirley


^ permalink raw reply

* Re: [PATCH] include/net/genetlink.h: Allow genlmsg_cancel to accept a NULL argument
From: Julia Lawall @ 2011-02-02  5:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, paul.moore, kernel-janitors
In-Reply-To: <20110201.145410.115936566.davem@davemloft.net>

On Tue, 1 Feb 2011, David Miller wrote:

> From: Julia Lawall <julia@diku.dk>
> Date: Fri, 28 Jan 2011 16:43:40 +0100 (CET)
> 
> > nlmsg_cancel can accept NULL as its second argument, so for similarity,
> > this patch extends genlmsg_cancel to be able to accept a NULL second
> > argument as well.
> > 
> > Signed-off-by: Julia Lawall <julia@diku.dk>
> 
> I did a scan of all of the cases where this interface is used, and
> I cannot find a situation where this capability would even be useful.
> 
> The use pattern is always:
> 
> 	hdr = genlmsg_put(skb, ...);
> 	if (!hdr)
> 		goto out;
> 
> 	NLA_PUT_*();
> 	NLA_PUT_*();
> 	....
> 
> 	return genlmsg_end(skb, hdr);
> 
> nla_put_failure:
> 	genlmsg_cancel(skb, hdr);
> out:
> 	return -EWHATEVER;
> 
> Always, hdr will be non-NULL.
> 
> We have to allocate the header first, then put the netlink
> attributes.
> 
> Looking over users of nlmsg_cancel(), the situation seems to
> match identically.
> 
> Therefore, it seems to me that it makes more sense to remove
> the NULL check from nlmsg_cancel() than to add the NULL check
> to genlmsg_cancel().

I saw lots of cases that could be done like this, but were not; they had 
goto nla_put_failure instead.

I will double check.

julia

^ permalink raw reply

* Re:,,,,,
From: young chang @ 2011-02-02 13:47 UTC (permalink / raw)


May I ask if you would be eligible to pursue a Business Proposal of $19.7m with me if you dont mind? Let me know if you are interested.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox