Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy and source mac selection mode
From: Oleg V. Ukhno @ 2011-02-28 10:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, fubar
In-Reply-To: <20110227.160823.35047612.davem@davemloft.net>

David, thank you for reply.
Actually this is second version of patch discussed previously in 
http://patchwork.ozlabs.org/patch/78994/
I've remade that patch into current version 
(patchwork.ozlabs.org/patch/83389/) in the way Jay suggested.
Jay, can you please comment on patch I've remade, please?

On 02/28/2011 03:08 AM, David Miller wrote:
> From: "Oleg V. Ukhno"<olegu@yandex-team.ru>
> Date: Wed, 16 Feb 2011 22:13:41 +0300
>
>>
> Can we get some feedback on this patch from bonding folks?
>
> I'm not applying it blinding without at least one bonding developer
> saying it at least looks ok.
>
> Thanks.
>


-- 
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс

Олег Юхно



^ permalink raw reply

* Re: Bug inkvm_set_irq
From: Michael S. Tsirkin @ 2011-02-28 10:11 UTC (permalink / raw)
  To: Jean-Philippe Menil; +Cc: kvm, netdev, virtualization
In-Reply-To: <4D6B634E.9090801@univ-nantes.fr>

On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote:
> Le 27/02/2011 18:00, Michael S. Tsirkin a écrit :
> >On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
> >>Hi,
> >>
> >>Each time i try tou use vhost_net, i'm facing a kernel bug.
> >>I do a "modprobe vhost_net", and start guest whith vhost=on.
> >>
> >>Following is a trace with a kernel 2.6.37, but  i had the same
> >>problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
> >2.6.36 had a theorectical race that could explain this,
> >but it should be ok in 2.6.37.
> >
> >>The bug only occurs whith vhost_net charged, so i don't know if this
> >>is a bug in kvm module code or in the vhost_net code.
> >It could be a bug in eventfd which is the interface
> >used by both kvm and vhost_net.
> >Just for fun, you can try 3.6.38 - eventfd code has been changed
> >a lot in 2.6.38 and if it does not trigger there
> >it's a hint that irqfd is the reason.
> >
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243100] BUG: unable to handle kernel paging request at
> >>0000000000002458
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243250] IP: [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> >
> >Could you run markup_oops/ ksymoops on this please?
> >As far as I can see kvm_set_irq can only get a wrong
> >kvm pointer. Unless there's some general memory corruption,
> >I'd guess
> >
> >You can also try comparing the irqfd->kvm pointer in
> >kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
> >virt/kvm/eventfd.c.
> >
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243556] Oops: 0000 [#1] SMP
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243692] last sysfs file:
> >>/sys/devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:00.0/irq
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.243777] CPU 0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.243820] Modules linked in: vhost_net macvtap macvlan tun
> >>powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
> >>cpufreq_ondemand fre
> >>q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
> >>ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
> >>xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
> >>nf_conntrack_ftp nf_connt
> >>rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
> >>dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
> >>nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
> >>snd_page_alloc tpm_tis tpm ps
> >>mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
> >>serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
> >>xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
> >>sd_mod enclosu
> >>re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
> >>scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] Pid: 10, comm: kworker/0:1 Not tainted
> >>2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RIP: 0010:[<ffffffffa041aa8a>]  [<ffffffffa041aa8a>]
> >>kvm_set_irq+0x2a/0x130 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RSP: 0018:ffff88045fc89d30  EFLAGS: 00010246
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RAX: 0000000000000000 RBX: 000000000000001a RCX:
> >>0000000000000001
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> >>0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RBP: 0000000000000000 R08: 0000000000000001 R09:
> >>ffff880856a91e48
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] R10: 0000000000000000 R11: 00000000ffffffff R12:
> >>0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] R13: 0000000000000001 R14: 0000000000000000 R15:
> >>0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] FS:  00007f617986c710(0000) GS:ffff88007f800000(0000)
> >>knlGS:0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] CR2: 0000000000002458 CR3: 000000045d197000 CR4:
> >>00000000000006f0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>0000000000000000
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >>0000000000000400
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] Process kworker/0:1 (pid: 10, threadinfo
> >>ffff88045fc88000, task ffff88085fc53c30)
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123] Stack:
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  ffff88045fc89fd8 00000000000119c0 ffff88045fc88010
> >>ffff88085fc53ee8
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  ffff88045fc89fd8 ffff88085fc53ee0 ffff88085fc53c30
> >>00000000000119c0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  00000000000119c0 ffffffff8137f7ce ffff88007f80df40
> >>00000000ffffffff
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] Call Trace:
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff8137f7ce>] ? common_interrupt+0xe/0x13
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffffa041bc57>] ? irqfd_inject+0x27/0x50 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff8106b6f2>] ? process_one_work+0x112/0x460
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff8106be25>] ? worker_thread+0x145/0x410
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff8103a3d0>] ? __wake_up_common+0x50/0x80
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff8106f786>] ? kthread+0x96/0xa0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff81003ce4>] ? kernel_thread_helper+0x4/0x10
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff8106f6f0>] ? kthread+0x0/0xa0
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  [<ffffffff81003ce0>] ? kernel_thread_helper+0x0/0x10
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] Code: ff 41 57 41 89 f7 41 56 41 55 41 89 cd 41 54 49 89
> >>fc 55 53 89 d3 48 81 ec 98 00 00 00 8b 15 c6 79 03 00 85 d2 0f 85 c4
> >>00 00 00<4
> >>9>  8b 84 24 58 24 00 00 3b 98 28 01 00 00 73 5e 89 db 48 8b 84
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] RIP  [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123]  RSP<ffff88045fc89d30>
> >>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>685.246123] CR2: 0000000000002458
> >>
> >>
> >>If someone can help me, on how to solve this.
> >>
> >>Regards.
> >>_______________________________________________
> >>Virtualization mailing list
> >>Virtualization@lists.linux-foundation.org
> >>https://lists.linux-foundation.org/mailman/listinfo/virtualization
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> Hi,
> 
> thanks for your response.
> 
> This is what markup_oops.pl return me:
> "No matching code found "

Well, let's try to understand what's there.

Do objdumop -ldS kvm.ko
look for <kvm_set_irq>

and paste the content from start of that function
to offset 0x2a and a bit beyond.

You can also upload your kvm.ko somewhere, I'll try to take a look.


> So this is not a vhost_net bug, or my oops is incomplete and
> markup_oops can't find the good vma offset.
> 
> I will try to compare the pointers you indicate me, even it could be
> a little difficult for me.

Hmm you know how to add printk to code and rebuild, right?

> 
> Maybe i will try a 2.6.38, will wait a response from the kvm team.
> 
> Regards.
> 
> -- 
> Jean-Philippe Menil - Pôle réseau Service IRTS
> DSI Université de Nantes
> jean-philippe.menil@univ-nantes.fr
> Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09

^ permalink raw reply

* Re: Bug inkvm_set_irq
From: Jean-Philippe Menil @ 2011-02-28 10:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, virtualization
In-Reply-To: <20110228101139.GD28006@redhat.com>

Le 28/02/2011 11:11, Michael S. Tsirkin a écrit :
> On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote:
>> Le 27/02/2011 18:00, Michael S. Tsirkin a écrit :
>>> On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
>>>> Hi,
>>>>
>>>> Each time i try tou use vhost_net, i'm facing a kernel bug.
>>>> I do a "modprobe vhost_net", and start guest whith vhost=on.
>>>>
>>>> Following is a trace with a kernel 2.6.37, but  i had the same
>>>> problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
>>> 2.6.36 had a theorectical race that could explain this,
>>> but it should be ok in 2.6.37.
>>>
>>>> The bug only occurs whith vhost_net charged, so i don't know if this
>>>> is a bug in kvm module code or in the vhost_net code.
>>> It could be a bug in eventfd which is the interface
>>> used by both kvm and vhost_net.
>>> Just for fun, you can try 3.6.38 - eventfd code has been changed
>>> a lot in 2.6.38 and if it does not trigger there
>>> it's a hint that irqfd is the reason.
>>>
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243100] BUG: unable to handle kernel paging request at
>>>> 0000000000002458
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243250] IP: [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
>>> Could you run markup_oops/ ksymoops on this please?
>>> As far as I can see kvm_set_irq can only get a wrong
>>> kvm pointer. Unless there's some general memory corruption,
>>> I'd guess
>>>
>>> You can also try comparing the irqfd->kvm pointer in
>>> kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
>>> virt/kvm/eventfd.c.
>>>
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243556] Oops: 0000 [#1] SMP
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243692] last sysfs file:
>>>> /sys/devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:00.0/irq
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.243777] CPU 0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.243820] Modules linked in: vhost_net macvtap macvlan tun
>>>> powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
>>>> cpufreq_ondemand fre
>>>> q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
>>>> ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
>>>> xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
>>>> nf_conntrack_ftp nf_connt
>>>> rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
>>>> dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
>>>> nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
>>>> snd_page_alloc tpm_tis tpm ps
>>>> mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
>>>> serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
>>>> xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
>>>> sd_mod enclosu
>>>> re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
>>>> scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] Pid: 10, comm: kworker/0:1 Not tainted
>>>> 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RIP: 0010:[<ffffffffa041aa8a>]  [<ffffffffa041aa8a>]
>>>> kvm_set_irq+0x2a/0x130 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RSP: 0018:ffff88045fc89d30  EFLAGS: 00010246
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RAX: 0000000000000000 RBX: 000000000000001a RCX:
>>>> 0000000000000001
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>>> 0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RBP: 0000000000000000 R08: 0000000000000001 R09:
>>>> ffff880856a91e48
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] R10: 0000000000000000 R11: 00000000ffffffff R12:
>>>> 0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] R13: 0000000000000001 R14: 0000000000000000 R15:
>>>> 0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] FS:  00007f617986c710(0000) GS:ffff88007f800000(0000)
>>>> knlGS:0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] CR2: 0000000000002458 CR3: 000000045d197000 CR4:
>>>> 00000000000006f0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>> 0000000000000400
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] Process kworker/0:1 (pid: 10, threadinfo
>>>> ffff88045fc88000, task ffff88085fc53c30)
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123] Stack:
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  ffff88045fc89fd8 00000000000119c0 ffff88045fc88010
>>>> ffff88085fc53ee8
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  ffff88045fc89fd8 ffff88085fc53ee0 ffff88085fc53c30
>>>> 00000000000119c0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  00000000000119c0 ffffffff8137f7ce ffff88007f80df40
>>>> 00000000ffffffff
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] Call Trace:
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff8137f7ce>] ? common_interrupt+0xe/0x13
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffffa041bc57>] ? irqfd_inject+0x27/0x50 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff8106b6f2>] ? process_one_work+0x112/0x460
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff8106be25>] ? worker_thread+0x145/0x410
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff8103a3d0>] ? __wake_up_common+0x50/0x80
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff8106f786>] ? kthread+0x96/0xa0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff81003ce4>] ? kernel_thread_helper+0x4/0x10
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff8106f6f0>] ? kthread+0x0/0xa0
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  [<ffffffff81003ce0>] ? kernel_thread_helper+0x0/0x10
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] Code: ff 41 57 41 89 f7 41 56 41 55 41 89 cd 41 54 49 89
>>>> fc 55 53 89 d3 48 81 ec 98 00 00 00 8b 15 c6 79 03 00 85 d2 0f 85 c4
>>>> 00 00 00<4
>>>> 9>   8b 84 24 58 24 00 00 3b 98 28 01 00 00 73 5e 89 db 48 8b 84
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] RIP  [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123]  RSP<ffff88045fc89d30>
>>>> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
>>>> 685.246123] CR2: 0000000000002458
>>>>
>>>>
>>>> If someone can help me, on how to solve this.
>>>>
>>>> Regards.
>>>> _______________________________________________
>>>> Virtualization mailing list
>>>> Virtualization@lists.linux-foundation.org
>>>> https://lists.linux-foundation.org/mailman/listinfo/virtualization
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Hi,
>>
>> thanks for your response.
>>
>> This is what markup_oops.pl return me:
>> "No matching code found"
> Well, let's try to understand what's there.
>
> Do objdumop -ldS kvm.ko
> look for<kvm_set_irq>
>
> and paste the content from start of that function
> to offset 0x2a and a bit beyond.
>
> You can also upload your kvm.ko somewhere, I'll try to take a look.
>
>
>> So this is not a vhost_net bug, or my oops is incomplete and
>> markup_oops can't find the good vma offset.
>>
>> I will try to compare the pointers you indicate me, even it could be
>> a little difficult for me.
> Hmm you know how to add printk to code and rebuild, right?
>
>> Maybe i will try a 2.6.38, will wait a response from the kvm team.
>>
>> Regards.
>>
>> -- 
>> Jean-Philippe Menil - Pôle réseau Service IRTS
>> DSI Université de Nantes
>> jean-philippe.menil@univ-nantes.fr
>> Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
So, here is the result for the objdump against the kvm.ko (the 
kvm_set_irq part) :

0000000000006a60 <kvm_set_irq>:
kvm_set_irq():
     6a60:       41 57                   push   %r15
     6a62:       41 89 f7                mov    %esi,%r15d
     6a65:       41 56                   push   %r14
     6a67:       41 55                   push   %r13
     6a69:       41 89 cd                mov    %ecx,%r13d
     6a6c:       41 54                   push   %r12
     6a6e:       49 89 fc                mov    %rdi,%r12
     6a71:       55                      push   %rbp
     6a72:       53                      push   %rbx
     6a73:       89 d3                   mov    %edx,%ebx
     6a75:       48 81 ec 98 00 00 00    sub    $0x98,%rsp
     6a7c:       8b 15 00 00 00 00       mov    0x0(%rip),%edx        # 
6a82 <kvm_set_irq+0x22>
     6a82:       85 d2                   test   %edx,%edx
     6a84:       0f 85 c4 00 00 00       jne    6b4e <kvm_set_irq+0xee>
     6a8a:       49 8b 84 24 58 24 00    mov    0x2458(%r12),%rax
     6a91:       00
     6a92:       3b 98 28 01 00 00       cmp    0x128(%rax),%ebx
     6a98:       73 5e                   jae    6af8 <kvm_set_irq+0x98>
     6a9a:       89 db                   mov    %ebx,%ebx
     6a9c:       48 8b 84 d8 30 01 00    mov    0x130(%rax,%rbx,8),%rax
     6aa3:       00
     6aa4:       48 85 c0                test   %rax,%rax
     6aa7:       74 4f                   je     6af8 <kvm_set_irq+0x98>
     6aa9:       48 89 e2                mov    %rsp,%rdx
     6aac:       31 db                   xor    %ebx,%ebx
     6aae:       48 8b 08                mov    (%rax),%rcx
     6ab1:       83 c3 01                add    $0x1,%ebx
     6ab4:       0f 18 09                prefetcht0 (%rcx)
     6ab7:       48 8b 48 e0             mov    -0x20(%rax),%rcx
     6abb:       48 89 0a                mov    %rcx,(%rdx)
     6abe:       48 8b 48 e8             mov    -0x18(%rax),%rcx
     6ac2:       48 89 4a 08             mov    %rcx,0x8(%rdx)
     6ac6:       48 8b 48 f0             mov    -0x10(%rax),%rcx
     6aca:       48 89 4a 10             mov    %rcx,0x10(%rdx)
     6ace:       48 8b 48 f8             mov    -0x8(%rax),%rcx
     6ad2:       48 89 4a 18             mov    %rcx,0x18(%rdx)
     6ad6:       48 8b 08                mov    (%rax),%rcx
     6ad9:       48 89 4a 20             mov    %rcx,0x20(%rdx)
     6add:       48 8b 48 08             mov    0x8(%rax),%rcx
     6ae1:       48 89 4a 28             mov    %rcx,0x28(%rdx)
     6ae5:       48 8b 00                mov    (%rax),%rax
     6ae8:       48 83 c2 30             add    $0x30,%rdx
     6aec:       48 85 c0                test   %rax,%rax
     6aef:       75 bd                   jne    6aae <kvm_set_irq+0x4e>
     6af1:       eb 07                   jmp    6afa <kvm_set_irq+0x9a>
     6af3:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
     6af8:       31 db                   xor    %ebx,%ebx
     6afa:       bd ff ff ff ff          mov    $0xffffffff,%ebp
     6aff:       49 89 e6                mov    %rsp,%r14
     6b02:       85 db                   test   %ebx,%ebx
     6b04:       74 34                   je     6b3a <kvm_set_irq+0xda>
     6b06:       83 eb 01                sub    $0x1,%ebx
     6b09:       44 89 e9                mov    %r13d,%ecx
     6b0c:       44 89 fa                mov    %r15d,%edx
     6b0f:       48 63 c3                movslq %ebx,%rax
     6b12:       4c 89 e6                mov    %r12,%rsi
     6b15:       48 8d 04 40             lea    (%rax,%rax,2),%rax
     6b19:       48 c1 e0 04             shl    $0x4,%rax
     6b1d:       49 8d 3c 06             lea    (%r14,%rax,1),%rdi
     6b21:       ff 54 04 08             callq  *0x8(%rsp,%rax,1)
     6b25:       85 c0                   test   %eax,%eax
     6b27:       78 d9                   js     6b02 <kvm_set_irq+0xa2>
     6b29:       85 ed                   test   %ebp,%ebp
     6b2b:       ba 00 00 00 00          mov    $0x0,%edx
     6b30:       0f 48 ea                cmovs  %edx,%ebp
     6b33:       85 db                   test   %ebx,%ebx
     6b35:       8d 2c 28                lea    (%rax,%rbp,1),%ebp
     6b38:       75 cc                   jne    6b06 <kvm_set_irq+0xa6>
     6b3a:       48 81 c4 98 00 00 00    add    $0x98,%rsp
     6b41:       89 e8                   mov    %ebp,%eax
     6b43:       5b                      pop    %rbx
     6b44:       5d                      pop    %rbp
     6b45:       41 5c                   pop    %r12
     6b47:       41 5d                   pop    %r13
     6b49:       41 5e                   pop    %r14
     6b4b:       41 5f                   pop    %r15
     6b4d:       c3                      retq
     6b4e:       48 8b 2d 00 00 00 00    mov    0x0(%rip),%rbp        # 
6b55 <kvm_set_irq+0xf5>
     6b55:       48 85 ed                test   %rbp,%rbp
     6b58:       0f 84 2c ff ff ff       je     6a8a <kvm_set_irq+0x2a>
     6b5e:       48 8b 45 00             mov    0x0(%rbp),%rax
     6b62:       48 8b 7d 08             mov    0x8(%rbp),%rdi
     6b66:       48 83 c5 10             add    $0x10,%rbp
     6b6a:       44 89 f9                mov    %r15d,%ecx
     6b6d:       44 89 ea                mov    %r13d,%edx
     6b70:       89 de                   mov    %ebx,%esi
     6b72:       ff d0                   callq  *%rax
     6b74:       48 8b 45 00             mov    0x0(%rbp),%rax
     6b78:       48 85 c0                test   %rax,%rax
     6b7b:       75 e5                   jne    6b62 <kvm_set_irq+0x102>
     6b7d:       e9 08 ff ff ff          jmpq   6a8a <kvm_set_irq+0x2a>
     6b82:       66 66 66 66 66 2e 0f    nopw   %cs:0x0(%rax,%rax,1)
     6b89:       1f 84 00 00 00 00 00

I admit that this analysis is too complicated for me.
I, effectively, can rebuild a kernel with more printk, and program a reboot.

The kvm.ko is available through the following address:
http://filex.univ-nantes.fr/get?k=k1jKhQghdcHLz12Z50H

Regards.

-- 
Jean-Philippe Menil - Pôle réseau Service IRTS
DSI Université de Nantes
jean-philippe.menil@univ-nantes.fr
Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09


^ permalink raw reply

* Re: dccp: null-pointer dereference on close
From: Gerrit Renker @ 2011-02-28 11:21 UTC (permalink / raw)
  To: Johan Hovold; +Cc: Arnaldo Carvalho de Melo, David S. Miller, dccp, netdev
In-Reply-To: <20110226174505.GB3609@localhost>

On 32/64 bit x86 problem so far not seen.

Problem seems to be that 

140        tw->tw_tb = icsk->icsk_bind_hash is NULL in __inet_twsk_hashdance()
141        WARN_ON(!icsk->icsk_bind_hash); 

Will be looking at this later on today - any hints how to reproduce would be appreciated.

Gerrit

Quoting Johan Hovold:
| Hi,
| 
| I triggered the null-pointer dereference below when closing a dccp
| socket on 2.6.37 the other day. The receive path is hit during
| close, and the socket has already been unhashed in dccp_set_state from
| dccp_close.
| 
| Thanks,
| Johan
| 
| 
| root@overo:~# [84140.128631] ------------[ cut here ]------------
| [84140.133575] WARNING: at net/ipv4/inet_timewait_sock.c:141 __inet_twsk_hashdance+0x48/0x128()
| [84140.142517] Modules linked in: arc4 ecb carl9170 rt2870sta(C) mac80211 r8712u(C) crc_ccitt ah
| [84140.151794] [<c0038850>] (unwind_backtrace+0x0/0xec) from [<c0055364>] (warn_slowpath_common)
| [84140.161743] [<c0055364>] (warn_slowpath_common+0x4c/0x64) from [<c0055398>] (warn_slowpath_n)
| [84140.171966] [<c0055398>] (warn_slowpath_null+0x1c/0x24) from [<c02b72d0>] (__inet_twsk_hashd)
| [84140.182373] [<c02b72d0>] (__inet_twsk_hashdance+0x48/0x128) from [<c031caa0>] (dccp_time_wai)
| [84140.192413] [<c031caa0>] (dccp_time_wait+0x40/0xc8) from [<c031c15c>] (dccp_rcv_state_proces)
| [84140.202636] [<c031c15c>] (dccp_rcv_state_process+0x120/0x538) from [<c032609c>] (dccp_v4_do_)
| [84140.213043] [<c032609c>] (dccp_v4_do_rcv+0x11c/0x14c) from [<c0286594>] (release_sock+0xac/0)
| [84140.222442] [<c0286594>] (release_sock+0xac/0x110) from [<c031fd34>] (dccp_close+0x28c/0x380)
| [84140.231475] [<c031fd34>] (dccp_close+0x28c/0x380) from [<c02d9a78>] (inet_release+0x64/0x70)
| [84140.240386] [<c02d9a78>] (inet_release+0x64/0x70) from [<c0284ddc>] (sock_release+0x24/0xb8)
| [84140.249328] [<c0284ddc>] (sock_release+0x24/0xb8) from [<c0284e94>] (sock_close+0x24/0x34)
| [84140.258087] [<c0284e94>] (sock_close+0x24/0x34) from [<c00c2e4c>] (fput+0x108/0x1f4)
| [84140.266296] [<c00c2e4c>] (fput+0x108/0x1f4) from [<c00c0104>] (filp_close+0x70/0x7c)
| [84140.274505] [<c00c0104>] (filp_close+0x70/0x7c) from [<c00c01c4>] (sys_close+0xb4/0x10c)
| [84140.283081] [<c00c01c4>] (sys_close+0xb4/0x10c) from [<c0033a80>] (ret_fast_syscall+0x0/0x30)
| [84140.292114] ---[ end trace b8877ec9d542c32e ]---
| [84140.296997] Unable to handle kernel NULL pointer dereference at virtual address 00000010
| [84140.305541] pgd = cedb0000
| [84140.308410] [00000010] *pgd=8ed22031, *pte=00000000, *ppte=00000000
| [84140.315032] Internal error: Oops: 17 [#1] PREEMPT
| [84140.320007] last sysfs file: /sys/kernel/uevent_seqnum
| [84140.325408] Modules linked in: arc4 ecb carl9170 rt2870sta(C) mac80211 r8712u(C) crc_ccitt ah
| [84140.334533] CPU: 0    Tainted: G        WC   (2.6.37+ #47)
| [84140.340332] PC is at __inet_twsk_hashdance+0x4c/0x128
| [84140.345642] LR is at warn_slowpath_null+0x1c/0x24
| [84140.350616] pc : [<c02b72d4>]    lr : [<c0055398>]    psr: 60000013
| [84140.350616] sp : ce975e68  ip : ce975db8  fp : cfbc5c00
| [84140.362701] r10: cfa3e400  r9 : cfbc5c18  r8 : 00000000
| [84140.368225] r7 : 00000006  r6 : cfa96110  r5 : cfa3e400  r4 : cfb54000
| [84140.375091] r3 : 00000002  r2 : 00000006  r1 : 00000000  r0 : 00000000
| [84140.381988] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
| [84140.389495] Control: 10c5387d  Table: 8edb0019  DAC: 00000015
| [84140.395538] Process be2p_ctrl (pid: 2207, stack limit = 0xce9742f0)
| [84140.402160] Stack: (0xce975e68 to 0xce976000)
| [84140.406738] 5e60:                   cfb54000 00000180 cfa3e400 c031caa0 00000007 cfbc5c00
| [84140.415374] 5e80: cfbc9824 00000020 00000007 c031c15c 00000000 00000022 00000000 00000008
| [84140.424011] 5ea0: 00000001 cfbc5c00 cfbc5c00 cfa3e400 cfbc9824 00000000 00000001 c04c11b8
| [84140.432617] 5ec0: be8ffc1c c032609c fa200000 c0033608 cfa3e400 cfa3e7b0 be8ffc1c ce975ee8
| [84140.441253] 5ee0: be8ffc1c cfbc5c00 cfa3e400 ce974000 00000000 c0286594 cfa3e474 cfa3e400
| [84140.449859] 5f00: cfa3e408 00000007 cf487c20 cf805840 cf60ca00 c031fd34 00000000 00000000
| [84140.458496] 5f20: cfb20288 cfa3e400 cf487c00 00000008 00000000 c02d9a78 00000003 00000000
| [84140.467102] 5f40: cf487c00 c0284ddc 00000000 cfb20288 cfb20280 c0284e94 00000000 c00c2e4c
| [84140.475738] 5f60: 00000000 00000000 cfb20280 00000000 cfbc50c0 00000006 c0033c04 ce974000
| [84140.484375] 5f80: 00000000 c00c0104 00000004 cfbc50c0 cfb20280 c00c01c4 400a1000 00000000
| [84140.492980] 5fa0: 0000891c c0033a80 400a1000 00000000 00000004 00000000 403d3014 00000000
| [84140.501617] 5fc0: 400a1000 00000000 0000891c 00000006 00000000 00000000 400a9000 be8ffc1c
| [84140.510223] 5fe0: 00000000 be8ffbe0 00009584 4036320c 60000010 00000004 00005153 bf0fa7d0
| [84140.518859] [<c02b72d4>] (__inet_twsk_hashdance+0x4c/0x128) from [<c031caa0>] (dccp_time_wai)
| [84140.528869] [<c031caa0>] (dccp_time_wait+0x40/0xc8) from [<c031c15c>] (dccp_rcv_state_proces)
| [84140.539062] [<c031c15c>] (dccp_rcv_state_process+0x120/0x538) from [<c032609c>] (dccp_v4_do_)
| [84140.549407] [<c032609c>] (dccp_v4_do_rcv+0x11c/0x14c) from [<c0286594>] (release_sock+0xac/0)
| [84140.558776] [<c0286594>] (release_sock+0xac/0x110) from [<c031fd34>] (dccp_close+0x28c/0x380)
| [84140.567779] [<c031fd34>] (dccp_close+0x28c/0x380) from [<c02d9a78>] (inet_release+0x64/0x70)
| [84140.576660] [<c02d9a78>] (inet_release+0x64/0x70) from [<c0284ddc>] (sock_release+0x24/0xb8)
| [84140.585571] [<c0284ddc>] (sock_release+0x24/0xb8) from [<c0284e94>] (sock_close+0x24/0x34)
| [84140.594299] [<c0284e94>] (sock_close+0x24/0x34) from [<c00c2e4c>] (fput+0x108/0x1f4)
| [84140.602447] [<c00c2e4c>] (fput+0x108/0x1f4) from [<c00c0104>] (filp_close+0x70/0x7c)
| [84140.610626] [<c00c0104>] (filp_close+0x70/0x7c) from [<c00c01c4>] (sys_close+0xb4/0x10c)
| [84140.619171] [<c00c01c4>] (sys_close+0xb4/0x10c) from [<c0033a80>] (ret_fast_syscall+0x0/0x30)
| [84140.628143] Code: e59f00dc e3a0108d ebf6782a e5941044 (e5912010) 
| [84140.634643] ---[ end trace b8877ec9d542c32f ]---
| [84140.639526] Kernel panic - not syncing: Fatal exception in interrupt
| 
| --
| To unsubscribe from this list: send the line "unsubscribe dccp" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| 

-- 

^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Jussi Kivilinna @ 2011-02-28 11:23 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <AANLkTimofhhH5omyk=HhkyaNG+MGqoac4rDf=dPuR7K-@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1170 bytes --]

Quoting Albert Cahalan <acahalan@gmail.com>:

> On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna
> <jussi.kivilinna@mbnet.fi> wrote:
>
>> I made simple hack on sch_fifo with per packet time limits (attachment) this
>> weekend and have been doing limited testing on wireless link. I think
>> hardlimit is fine, it's simple and does somewhat same as what
>> packet(-hard)limited buffer does, drops packets when buffer is 'full'. My
>> hack checks for timed out packets on enqueue, might be wrong approach (on
>> other hand might allow some more burstiness).
>
> Thanks!
>
> I think the default is too high. 1 ms may even be a bit high.

Well, with 10ms buffer timeout latency goes to 10-20ms on 54Mbit wifi  
link (zd1211rw driver) from >500ms (ping rtt when iperf running same  
time). So for that it's good enough.

>
> I suppose there is a need to allow at least 2 packets despite any
> time limits, so that it remains possible to use a traditional modem
> even if a huge packet takes several seconds to send.
>

I made EWMA version of my fifo hack (attached). I added minimum 2  
packet queue limit and  probabilistic 1% ECN marking/dropping for  
timeout/2.

-Jussi


[-- Attachment #2: sch_fifo_ewma.c --]
[-- Type: text/x-csrc, Size: 7809 bytes --]

/*
 * sch_fifo_ewma.c	Simple FIFO EWMA timelimit queue.
 *
 * This program is free software; you can redistribute it and/or modify it under
 * the terms of the GNU General Public License as published by the Free Software
 * Foundation; either version 2 of the License, or (at your option) any later
 * version.
 *
 */

#include <linux/module.h>
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/skbuff.h>
#include <net/pkt_sched.h>
#include <net/inet_ecn.h>

#include <linux/version.h>
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 37)
#include "average.h"
#else
#include <linux/average.h>
#endif

#define DEFAULT_PKT_TIMEOUT_MS		10
#define DEFAULT_PKT_TIMEOUT		PSCHED_NS2TICKS(NSEC_PER_MSEC * \
							DEFAULT_PKT_TIMEOUT_MS)
#define DEFAULT_PROB_HALF_DROP		10	/* 1% */

#define FIFO_EWMA_MIN_QDISC_LEN		2

struct tc_fifo_ewma_qopt {
	__u64	timeout;	/* Max time packet may stay in buffer */
	__u32   limit;		/* Queue length: bytes for bfifo, packets for pfifo */
};

struct fifo_ewma_skb_cb {
	psched_time_t	time_queued;
};

struct fifo_ewma_sched_data {
	psched_tdiff_t	timeout;
	u32		limit;
	struct ewma	ewma;
};

static inline
struct fifo_ewma_skb_cb *fifo_ewma_skb_cb(struct sk_buff *skb)
{
	BUILD_BUG_ON(sizeof(skb->cb) <
		sizeof(struct qdisc_skb_cb) +
			sizeof(struct fifo_ewma_skb_cb));
	return (struct fifo_ewma_skb_cb *)qdisc_skb_cb(skb)->data;
}

static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);

	if (likely(skb_queue_len(&sch->q) < q->limit))
		return qdisc_enqueue_tail(skb, sch);

	/* queue full, remove one skb to fulfill the limit */
	__qdisc_queue_drop_head(sch, &sch->q);
	sch->qstats.drops++;
	qdisc_enqueue_tail(skb, sch);

	return NET_XMIT_CN;
}

static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);

	if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= q->limit))
		return qdisc_enqueue_tail(skb, sch);

	return qdisc_reshape_fail(skb, sch);
}

static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);

	if (likely(skb_queue_len(&sch->q) < q->limit))
		return qdisc_enqueue_tail(skb, sch);

	return qdisc_reshape_fail(skb, sch);
}

static inline int fifo_get_prob(void)
{
	return (net_random() & 0xffff) * 1000 / 0xffff;
}

static struct sk_buff *fifo_ewma_dequeue(struct Qdisc* sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);
	struct sk_buff *skb;
	psched_tdiff_t tdiff;

	if (likely(!q->timeout))
		goto no_ewma;

	skb = qdisc_peek_head(sch);
	if (!skb)
		return NULL;

	/* update EWMA */
	tdiff = psched_get_time() - fifo_ewma_skb_cb(skb)->time_queued;
	ewma_add(&q->ewma, tdiff);

no_ewma:
	return qdisc_dequeue_head(sch);
}

#define FIFO_EWMA_OK	0
#define FIFO_EWMA_DROP	1
#define FIFO_EWMA_CN	2

static int fifo_check_ewma_drop(struct sk_buff *skb, struct Qdisc *sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);
	unsigned long fifo_latency_avg;
	int ret = FIFO_EWMA_OK;

	if (likely(!q->timeout))
		goto no_ewma;

	/* lower limit */
	if (skb_queue_len(&sch->q) <= FIFO_EWMA_MIN_QDISC_LEN)
		goto no_drop;

	fifo_latency_avg = ewma_read(&q->ewma);

	/* hard drop */
	if (fifo_latency_avg > q->timeout) {
		/*printk(KERN_WARNING "fifo_ewma: hard drop\n");*/
		return FIFO_EWMA_DROP;
	}

	/* probabilistic drop */
	if (fifo_latency_avg > q->timeout / 2 &&
				fifo_get_prob() < DEFAULT_PROB_HALF_DROP) {
		if (!INET_ECN_set_ce(skb)) {
			/*printk(KERN_WARNING "fifo_ewma: prob drop\n");*/
			return FIFO_EWMA_DROP;
		}

		/*printk(KERN_WARNING "fifo_ewma: prob mark\n");*/
		ret = FIFO_EWMA_CN;
	}

no_drop:
	fifo_ewma_skb_cb(skb)->time_queued = psched_get_time();
no_ewma:
	return ret;
}

static int pfifo_ewma_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	int ewma_action, ret;

	ewma_action = fifo_check_ewma_drop(skb, sch);
	if (unlikely(ewma_action == FIFO_EWMA_DROP))
		return qdisc_drop(skb, sch);

	ret = pfifo_tail_enqueue(skb, sch);
	if (unlikely(ret != NET_XMIT_SUCCESS))
		return ret;

	return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}

static int bfifo_ewma_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	int ewma_action, ret;

	ewma_action = fifo_check_ewma_drop(skb, sch);
	if (unlikely(ewma_action == FIFO_EWMA_DROP))
		return qdisc_drop(skb, sch);

	ret = bfifo_enqueue(skb, sch);
	if (unlikely(ret != NET_XMIT_SUCCESS))
		return ret;

	return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}

static int pfifo_ewma_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	int ewma_action, ret;

	ewma_action = fifo_check_ewma_drop(skb, sch);
	if (unlikely(ewma_action == FIFO_EWMA_DROP))
		return qdisc_drop(skb, sch);

	ret = pfifo_enqueue(skb, sch);
	if (unlikely(ret != NET_XMIT_SUCCESS))
		return ret;

	return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}

static int fifo_ewma_init(struct Qdisc *sch, struct nlattr *opt)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);

	if (opt == NULL) {
		u32 limit = qdisc_dev(sch)->tx_queue_len ? : 1;

		q->limit = limit;
		q->timeout = DEFAULT_PKT_TIMEOUT;
	} else {
		struct tc_fifo_ewma_qopt *ctl = nla_data(opt);

		if (nla_len(opt) < sizeof(*ctl))
			return -EINVAL;

		q->limit = ctl->limit;
		q->timeout = ctl->timeout ? : DEFAULT_PKT_TIMEOUT;
	}

	ewma_init(&q->ewma, 1, 64);

	return 0;
}

static int fifo_ewma_dump(struct Qdisc *sch, struct sk_buff *skb)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);
	struct tc_fifo_ewma_qopt opt = {
		.limit = q->limit,
		.timeout = q->timeout
	};

	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
	return skb->len;

nla_put_failure:
	return -1;
}

static struct Qdisc_ops pfifo_ewma_qdisc_ops __read_mostly = {
	.id		=	"pfifo_ewma",
	.priv_size	=	sizeof(struct fifo_ewma_sched_data),
	.enqueue	=	pfifo_ewma_enqueue,
	.dequeue	=	fifo_ewma_dequeue,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop,
	.init		=	fifo_ewma_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_ewma_init,
	.dump		=	fifo_ewma_dump,
	.owner		=	THIS_MODULE,
};

static struct Qdisc_ops bfifo_ewma_qdisc_ops __read_mostly = {
	.id		=	"bfifo_ewma",
	.priv_size	=	sizeof(struct fifo_ewma_sched_data),
	.enqueue	=	bfifo_ewma_enqueue,
	.dequeue	=	fifo_ewma_dequeue,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop,
	.init		=	fifo_ewma_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_ewma_init,
	.dump		=	fifo_ewma_dump,
	.owner		=	THIS_MODULE,
};

static struct Qdisc_ops pfifo_head_drop_ewma_qdisc_ops __read_mostly = {
	.id		=	"pfifo_hd_ewma",
	.priv_size	=	sizeof(struct fifo_ewma_sched_data),
	.enqueue	=	pfifo_ewma_tail_enqueue,
	.dequeue	=	fifo_ewma_dequeue,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop_head,
	.init		=	fifo_ewma_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_ewma_init,
	.dump		=	fifo_ewma_dump,
	.owner		=	THIS_MODULE,
};

static int __init fifo_ewma_module_init(void)
{
	int retval;

	retval = register_qdisc(&pfifo_ewma_qdisc_ops);
	if (retval)
		goto cleanup;
	retval = register_qdisc(&bfifo_ewma_qdisc_ops);
	if (retval)
		goto cleanup;
	retval = register_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
	if (retval)
		goto cleanup;

	return 0;

cleanup:
	unregister_qdisc(&pfifo_ewma_qdisc_ops);
	unregister_qdisc(&bfifo_ewma_qdisc_ops);
	unregister_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
	return retval;
}
static void __exit fifo_ewma_module_exit(void)
{
	unregister_qdisc(&pfifo_ewma_qdisc_ops);
	unregister_qdisc(&bfifo_ewma_qdisc_ops);
	unregister_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
}

module_init(fifo_ewma_module_init)
module_exit(fifo_ewma_module_exit)
MODULE_LICENSE("GPL");

#include <linux/version.h>
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 37)
#include "average.c"
#endif


^ permalink raw reply

* dccp test-tree [RFC] [Patch 1/1] dccp: Only activate NN values after receiving the Confirm option
From: Gerrit Renker @ 2011-02-28 11:25 UTC (permalink / raw)
  To: Samuel Jero; +Cc: dccp, netdev

I am sending this as RFC since I have not yet deeply tested this. It makes
the exchange of NN options in established state conform to RFC 4340, 6.6.1
and thus actually is a bug fix.

>>>>>>>>>>>>>>>>>>>>>>>>> Patch <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
dccp: Only activate NN values after receiving the Confirm option

This defers changing local values using exchange of NN options in established
connection state by only activating the value after receiving the Confirm
option, as mandated by RFC 4340, 6.6.1.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/ccids/ccid2.c |    7 ++-----
 net/dccp/feat.c        |   13 ++++---------
 2 files changed, 6 insertions(+), 14 deletions(-)

--- a/net/dccp/feat.c
+++ b/net/dccp/feat.c
@@ -775,12 +775,7 @@ int dccp_feat_register_sp(struct sock *s
  * @sk: DCCP socket of an established connection
  * @feat: NN feature number from %dccp_feature_numbers
  * @nn_val: the new value to use
- * This function is used to communicate NN updates out-of-band. The difference
- * to feature negotiation during connection setup is that values are activated
- * immediately after validation, i.e. we don't wait for the Confirm: either the
- * value is accepted by the peer (and then the waiting is futile), or it is not
- * (Reset or empty Confirm). We don't accept empty Confirms - transmitted values
- * are validated, and the peer "MUST accept any valid value" (RFC 4340, 6.3.2).
+ * This function is used to communicate NN updates out-of-band.
  */
 int dccp_feat_signal_nn_change(struct sock *sk, u8 feat, u64 nn_val)
 {
@@ -805,9 +800,6 @@ int dccp_feat_signal_nn_change(struct so
 		dccp_feat_list_pop(entry);
 	}
 
-	if (dccp_feat_activate(sk, feat, 1, &fval))
-		return -EADV;
-
 	inet_csk_schedule_ack(sk);
 	return dccp_feat_push_change(fn, feat, 1, 0, &fval);
 }
@@ -1356,6 +1348,9 @@ static u8 dccp_feat_handle_nn_establishe
 		if (fval.nn != entry->val.nn)
 			return 0;
 
+		/* Only activate after receiving the Confirm option (6.6.1). */
+		dccp_feat_activate(sk, feat, local, &fval);
+
 		/* It has been confirmed - so remove the entry */
 		dccp_feat_list_pop(entry);
 
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -105,7 +105,6 @@ static void ccid2_change_l_ack_ratio(str
 		return;
 
 	ccid2_pr_debug("changing local ack ratio to %u\n", val);
-	dp->dccps_l_ack_ratio = val;
 	dccp_feat_signal_nn_change(sk, DCCPF_ACK_RATIO, val);
 }
 
@@ -117,11 +116,9 @@ static void ccid2_change_l_seq_window(st
 		val = DCCPF_SEQ_WMIN;
 	if (val > DCCPF_SEQ_WMAX)
 		val = DCCPF_SEQ_WMAX;
-	if (val == dp->dccps_l_seq_win)
-		return;
 
-	dp->dccps_l_seq_win = val;
-	dccp_feat_signal_nn_change(sk, DCCPF_SEQUENCE_WINDOW, val);
+	if (val != dp->dccps_l_seq_win)
+		dccp_feat_signal_nn_change(sk, DCCPF_SEQUENCE_WINDOW, val);
 }
 
 static void ccid2_hc_tx_rto_expire(unsigned long data)

-- 

^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-28 11:36 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>

On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote:
> I'm working on this right now.

OK I think I was definitely on the right track.  With the send
patch made lockless I now get numbers which are even better than
those obtained with running named with multiple sockets.  That's
right, a single socket is now faster than what multiple sockets
were without the patch (of course, multiple sockets may still
faster with the patch vs. a single socket for obvious reasons,
but I couldn't measure any significant difference).

Also worthy of note is that prior to the patch all CPUs showed
idleness (lazy bastards!), with the patch they're all maxed out.

In retrospect, the idleness was simply the result of the socket
lock scheduling away and was an indication of lock contention.

Here are the patches I used.  Please don't them yet as I intend
to clean them up quite a bit.

But please do test them heavily, especially if you have an AMD
NUMA machine as that's where scalability problems really show
up.  Intel tends to be a lot more forgiving.  My last AMD machine
blew up years ago :)

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: Bug inkvm_set_irq
From: Michael S. Tsirkin @ 2011-02-28 11:39 UTC (permalink / raw)
  To: Jean-Philippe Menil; +Cc: kvm, netdev, virtualization
In-Reply-To: <4D6B7BAB.9070907@univ-nantes.fr>

On Mon, Feb 28, 2011 at 11:40:43AM +0100, Jean-Philippe Menil wrote:
> Le 28/02/2011 11:11, Michael S. Tsirkin a écrit :
> >On Mon, Feb 28, 2011 at 09:56:46AM +0100, Jean-Philippe Menil wrote:
> >>Le 27/02/2011 18:00, Michael S. Tsirkin a écrit :
> >>>On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
> >>>>Hi,
> >>>>
> >>>>Each time i try tou use vhost_net, i'm facing a kernel bug.
> >>>>I do a "modprobe vhost_net", and start guest whith vhost=on.
> >>>>
> >>>>Following is a trace with a kernel 2.6.37, but  i had the same
> >>>>problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
> >>>2.6.36 had a theorectical race that could explain this,
> >>>but it should be ok in 2.6.37.
> >>>
> >>>>The bug only occurs whith vhost_net charged, so i don't know if this
> >>>>is a bug in kvm module code or in the vhost_net code.
> >>>It could be a bug in eventfd which is the interface
> >>>used by both kvm and vhost_net.
> >>>Just for fun, you can try 3.6.38 - eventfd code has been changed
> >>>a lot in 2.6.38 and if it does not trigger there
> >>>it's a hint that irqfd is the reason.
> >>>
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243100] BUG: unable to handle kernel paging request at
> >>>>0000000000002458
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243250] IP: [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> >>>Could you run markup_oops/ ksymoops on this please?
> >>>As far as I can see kvm_set_irq can only get a wrong
> >>>kvm pointer. Unless there's some general memory corruption,
> >>>I'd guess
> >>>
> >>>You can also try comparing the irqfd->kvm pointer in
> >>>kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
> >>>virt/kvm/eventfd.c.
> >>>
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243556] Oops: 0000 [#1] SMP
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243692] last sysfs file:
> >>>>/sys/devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:00.0/irq
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.243777] CPU 0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.243820] Modules linked in: vhost_net macvtap macvlan tun
> >>>>powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
> >>>>cpufreq_ondemand fre
> >>>>q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
> >>>>ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
> >>>>xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
> >>>>nf_conntrack_ftp nf_connt
> >>>>rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
> >>>>dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
> >>>>nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
> >>>>snd_page_alloc tpm_tis tpm ps
> >>>>mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
> >>>>serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
> >>>>xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
> >>>>sd_mod enclosu
> >>>>re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
> >>>>scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] Pid: 10, comm: kworker/0:1 Not tainted
> >>>>2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RIP: 0010:[<ffffffffa041aa8a>]  [<ffffffffa041aa8a>]
> >>>>kvm_set_irq+0x2a/0x130 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RSP: 0018:ffff88045fc89d30  EFLAGS: 00010246
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RAX: 0000000000000000 RBX: 000000000000001a RCX:
> >>>>0000000000000001
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> >>>>0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RBP: 0000000000000000 R08: 0000000000000001 R09:
> >>>>ffff880856a91e48
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] R10: 0000000000000000 R11: 00000000ffffffff R12:
> >>>>0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] R13: 0000000000000001 R14: 0000000000000000 R15:
> >>>>0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] FS:  00007f617986c710(0000) GS:ffff88007f800000(0000)
> >>>>knlGS:0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] CR2: 0000000000002458 CR3: 000000045d197000 CR4:
> >>>>00000000000006f0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>>>0000000000000000
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >>>>0000000000000400
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] Process kworker/0:1 (pid: 10, threadinfo
> >>>>ffff88045fc88000, task ffff88085fc53c30)
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123] Stack:
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  ffff88045fc89fd8 00000000000119c0 ffff88045fc88010
> >>>>ffff88085fc53ee8
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  ffff88045fc89fd8 ffff88085fc53ee0 ffff88085fc53c30
> >>>>00000000000119c0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  00000000000119c0 ffffffff8137f7ce ffff88007f80df40
> >>>>00000000ffffffff
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] Call Trace:
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff8137f7ce>] ? common_interrupt+0xe/0x13
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffffa041bc57>] ? irqfd_inject+0x27/0x50 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff8106b6f2>] ? process_one_work+0x112/0x460
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff8106be25>] ? worker_thread+0x145/0x410
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff8103a3d0>] ? __wake_up_common+0x50/0x80
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff8106f786>] ? kthread+0x96/0xa0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff81003ce4>] ? kernel_thread_helper+0x4/0x10
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff8106f6f0>] ? kthread+0x0/0xa0
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  [<ffffffff81003ce0>] ? kernel_thread_helper+0x0/0x10
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] Code: ff 41 57 41 89 f7 41 56 41 55 41 89 cd 41 54 49 89
> >>>>fc 55 53 89 d3 48 81 ec 98 00 00 00 8b 15 c6 79 03 00 85 d2 0f 85 c4
> >>>>00 00 00<4
> >>>>9>   8b 84 24 58 24 00 00 3b 98 28 01 00 00 73 5e 89 db 48 8b 84
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] RIP  [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123]  RSP<ffff88045fc89d30>
> >>>>Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> >>>>685.246123] CR2: 0000000000002458
> >>>>
> >>>>
> >>>>If someone can help me, on how to solve this.
> >>>>
> >>>>Regards.
> >>>>_______________________________________________
> >>>>Virtualization mailing list
> >>>>Virtualization@lists.linux-foundation.org
> >>>>https://lists.linux-foundation.org/mailman/listinfo/virtualization
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe netdev" in
> >>>the body of a message to majordomo@vger.kernel.org
> >>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>Hi,
> >>
> >>thanks for your response.
> >>
> >>This is what markup_oops.pl return me:
> >>"No matching code found"
> >Well, let's try to understand what's there.
> >
> >Do objdumop -ldS kvm.ko
> >look for<kvm_set_irq>
> >
> >and paste the content from start of that function
> >to offset 0x2a and a bit beyond.
> >
> >You can also upload your kvm.ko somewhere, I'll try to take a look.
> >
> >
> >>So this is not a vhost_net bug, or my oops is incomplete and
> >>markup_oops can't find the good vma offset.
> >>
> >>I will try to compare the pointers you indicate me, even it could be
> >>a little difficult for me.
> >Hmm you know how to add printk to code and rebuild, right?
> >
> >>Maybe i will try a 2.6.38, will wait a response from the kvm team.
> >>
> >>Regards.
> >>
> >>-- 
> >>Jean-Philippe Menil - Pôle réseau Service IRTS
> >>DSI Université de Nantes
> >>jean-philippe.menil@univ-nantes.fr
> >>Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09
> So, here is the result for the objdump against the kvm.ko (the
> kvm_set_irq part) :

Can you try building with -g and adding -l and -S to objdump
please? I'd rather make the tool do the legwork than
do it manually.

> 
> 0000000000006a60 <kvm_set_irq>:
> kvm_set_irq():
>     6a60:       41 57                   push   %r15
>     6a62:       41 89 f7                mov    %esi,%r15d
>     6a65:       41 56                   push   %r14
>     6a67:       41 55                   push   %r13
>     6a69:       41 89 cd                mov    %ecx,%r13d
>     6a6c:       41 54                   push   %r12
>     6a6e:       49 89 fc                mov    %rdi,%r12
>     6a71:       55                      push   %rbp
>     6a72:       53                      push   %rbx
>     6a73:       89 d3                   mov    %edx,%ebx
>     6a75:       48 81 ec 98 00 00 00    sub    $0x98,%rsp
>     6a7c:       8b 15 00 00 00 00       mov    0x0(%rip),%edx
> # 6a82 <kvm_set_irq+0x22>
>     6a82:       85 d2                   test   %edx,%edx
>     6a84:       0f 85 c4 00 00 00       jne    6b4e <kvm_set_irq+0xee>
>     6a8a:       49 8b 84 24 58 24 00    mov    0x2458(%r12),%rax

OK, 0x6a8a is the offset.
After you build with -g, try

addr2line kvm.ko 0x6a8a

and see which line this points to.


>     6a91:       00
>     6a92:       3b 98 28 01 00 00       cmp    0x128(%rax),%ebx
>     6a98:       73 5e                   jae    6af8 <kvm_set_irq+0x98>
>     6a9a:       89 db                   mov    %ebx,%ebx
>     6a9c:       48 8b 84 d8 30 01 00    mov    0x130(%rax,%rbx,8),%rax
>     6aa3:       00
>     6aa4:       48 85 c0                test   %rax,%rax
>     6aa7:       74 4f                   je     6af8 <kvm_set_irq+0x98>
>     6aa9:       48 89 e2                mov    %rsp,%rdx
>     6aac:       31 db                   xor    %ebx,%ebx
>     6aae:       48 8b 08                mov    (%rax),%rcx
>     6ab1:       83 c3 01                add    $0x1,%ebx
>     6ab4:       0f 18 09                prefetcht0 (%rcx)
>     6ab7:       48 8b 48 e0             mov    -0x20(%rax),%rcx
>     6abb:       48 89 0a                mov    %rcx,(%rdx)
>     6abe:       48 8b 48 e8             mov    -0x18(%rax),%rcx
>     6ac2:       48 89 4a 08             mov    %rcx,0x8(%rdx)
>     6ac6:       48 8b 48 f0             mov    -0x10(%rax),%rcx
>     6aca:       48 89 4a 10             mov    %rcx,0x10(%rdx)
>     6ace:       48 8b 48 f8             mov    -0x8(%rax),%rcx
>     6ad2:       48 89 4a 18             mov    %rcx,0x18(%rdx)
>     6ad6:       48 8b 08                mov    (%rax),%rcx
>     6ad9:       48 89 4a 20             mov    %rcx,0x20(%rdx)
>     6add:       48 8b 48 08             mov    0x8(%rax),%rcx
>     6ae1:       48 89 4a 28             mov    %rcx,0x28(%rdx)
>     6ae5:       48 8b 00                mov    (%rax),%rax
>     6ae8:       48 83 c2 30             add    $0x30,%rdx
>     6aec:       48 85 c0                test   %rax,%rax
>     6aef:       75 bd                   jne    6aae <kvm_set_irq+0x4e>
>     6af1:       eb 07                   jmp    6afa <kvm_set_irq+0x9a>
>     6af3:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>     6af8:       31 db                   xor    %ebx,%ebx
>     6afa:       bd ff ff ff ff          mov    $0xffffffff,%ebp
>     6aff:       49 89 e6                mov    %rsp,%r14
>     6b02:       85 db                   test   %ebx,%ebx
>     6b04:       74 34                   je     6b3a <kvm_set_irq+0xda>
>     6b06:       83 eb 01                sub    $0x1,%ebx
>     6b09:       44 89 e9                mov    %r13d,%ecx
>     6b0c:       44 89 fa                mov    %r15d,%edx
>     6b0f:       48 63 c3                movslq %ebx,%rax
>     6b12:       4c 89 e6                mov    %r12,%rsi
>     6b15:       48 8d 04 40             lea    (%rax,%rax,2),%rax
>     6b19:       48 c1 e0 04             shl    $0x4,%rax
>     6b1d:       49 8d 3c 06             lea    (%r14,%rax,1),%rdi
>     6b21:       ff 54 04 08             callq  *0x8(%rsp,%rax,1)
>     6b25:       85 c0                   test   %eax,%eax
>     6b27:       78 d9                   js     6b02 <kvm_set_irq+0xa2>
>     6b29:       85 ed                   test   %ebp,%ebp
>     6b2b:       ba 00 00 00 00          mov    $0x0,%edx
>     6b30:       0f 48 ea                cmovs  %edx,%ebp
>     6b33:       85 db                   test   %ebx,%ebx
>     6b35:       8d 2c 28                lea    (%rax,%rbp,1),%ebp
>     6b38:       75 cc                   jne    6b06 <kvm_set_irq+0xa6>
>     6b3a:       48 81 c4 98 00 00 00    add    $0x98,%rsp
>     6b41:       89 e8                   mov    %ebp,%eax
>     6b43:       5b                      pop    %rbx
>     6b44:       5d                      pop    %rbp
>     6b45:       41 5c                   pop    %r12
>     6b47:       41 5d                   pop    %r13
>     6b49:       41 5e                   pop    %r14
>     6b4b:       41 5f                   pop    %r15
>     6b4d:       c3                      retq
>     6b4e:       48 8b 2d 00 00 00 00    mov    0x0(%rip),%rbp
> # 6b55 <kvm_set_irq+0xf5>
>     6b55:       48 85 ed                test   %rbp,%rbp
>     6b58:       0f 84 2c ff ff ff       je     6a8a <kvm_set_irq+0x2a>
>     6b5e:       48 8b 45 00             mov    0x0(%rbp),%rax
>     6b62:       48 8b 7d 08             mov    0x8(%rbp),%rdi
>     6b66:       48 83 c5 10             add    $0x10,%rbp
>     6b6a:       44 89 f9                mov    %r15d,%ecx
>     6b6d:       44 89 ea                mov    %r13d,%edx
>     6b70:       89 de                   mov    %ebx,%esi
>     6b72:       ff d0                   callq  *%rax
>     6b74:       48 8b 45 00             mov    0x0(%rbp),%rax
>     6b78:       48 85 c0                test   %rax,%rax
>     6b7b:       75 e5                   jne    6b62 <kvm_set_irq+0x102>
>     6b7d:       e9 08 ff ff ff          jmpq   6a8a <kvm_set_irq+0x2a>
>     6b82:       66 66 66 66 66 2e 0f    nopw   %cs:0x0(%rax,%rax,1)
>     6b89:       1f 84 00 00 00 00 00
> 
> I admit that this analysis is too complicated for me.
> I, effectively, can rebuild a kernel with more printk, and program a reboot.
> 
> The kvm.ko is available through the following address:
> http://filex.univ-nantes.fr/get?k=k1jKhQghdcHLz12Z50H
> 
> Regards.

This has no debug data. Can you rebuild with -g please?

BTW if you want to rerun and get more reliable backtrace,
tyr enabling frame pointers (do you know how to?). But this will change code
so backtrace will no longer be val we will need
a new one.

> -- 
> Jean-Philippe Menil - Pôle réseau Service IRTS
> DSI Université de Nantes
> jean-philippe.menil@univ-nantes.fr
> Tel : 02.53.48.49.27 - Fax : 02.53.48.49.09

^ permalink raw reply

* [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>

inet: Add ip_make_skb and ip_send_skb

This patch adds the helper ip_make_skb which is like ip_append_data
and ip_push_pending_frames all rolled into one, except that it does
not send the skb produced.  The sending part is carried out by
ip_send_skb, which the transport protocol can call after it has
tweaked the skb.

It is meant to be called in cases where corking is not used should
have a one-to-one correspondence to sendmsg.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/ip.h     |    8 ++++++
 net/ipv4/ip_output.c |   65 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 67fac78..a96e525 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -116,8 +116,16 @@ extern int		ip_append_data(struct sock *sk,
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
 				int offset, size_t size, int flags);
+extern int		ip_send_skb(struct sk_buff *skb);
 extern int		ip_push_pending_frames(struct sock *sk);
 extern void		ip_flush_pending_frames(struct sock *sk);
+extern struct sk_buff  *ip_make_skb(struct sock *sk,
+				    int getfrag(void *from, char *to, int offset, int len,
+						int odd, struct sk_buff *skb),
+				    void *from, int length, int transhdrlen,
+				    struct ipcm_cookie *ipc,
+				    struct rtable **rtp,
+				    unsigned int flags);
 
 /* datagram.c */
 extern int		ip4_datagram_connect(struct sock *sk, 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1dd5ecc..dba14c6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1267,9 +1267,9 @@ static void ip_cork_release(struct inet_cork *cork)
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-static int __ip_push_pending_frames(struct sock *sk,
-				    struct sk_buff_head *queue,
-				    struct inet_cork *cork)
+static struct sk_buff *__ip_make_skb(struct sock *sk,
+				     struct sk_buff_head *queue,
+				     struct inet_cork *cork)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1280,7 +1280,6 @@ static int __ip_push_pending_frames(struct sock *sk,
 	struct iphdr *iph;
 	__be16 df = 0;
 	__u8 ttl;
-	int err = 0;
 
 	if ((skb = __skb_dequeue(queue)) == NULL)
 		goto out;
@@ -1351,28 +1350,37 @@ static int __ip_push_pending_frames(struct sock *sk,
 		icmp_out_count(net, ((struct icmphdr *)
 			skb_transport_header(skb))->type);
 
-	/* Netfilter gets whole the not fragmented skb. */
+	ip_cork_release(cork);
+out:
+	return skb;
+}
+
+int ip_send_skb(struct sk_buff *skb)
+{
+	struct net *net = sock_net(skb->sk);
+	int err;
+
 	err = ip_local_out(skb);
 	if (err) {
 		if (err > 0)
 			err = net_xmit_errno(err);
 		if (err)
-			goto error;
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 	}
 
-out:
-	ip_cork_release(cork);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 int ip_push_pending_frames(struct sock *sk)
 {
-	return __ip_push_pending_frames(sk, &sk->sk_write_queue,
-					&inet_sk(sk)->cork);
+	struct sk_buff *skb;
+
+	skb = __ip_make_skb(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
+	if (!skb)
+		return 0;
+
+	/* Netfilter gets whole the not fragmented skb. */
+	return ip_send_skb(skb);
 }
 
 /*
@@ -1395,6 +1403,35 @@ void ip_flush_pending_frames(struct sock *sk)
 	__ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
 }
 
+struct sk_buff *ip_make_skb(struct sock *sk,
+			    int getfrag(void *from, char *to, int offset,
+					int len, int odd, struct sk_buff *skb),
+			    void *from, int length, int transhdrlen,
+			    struct ipcm_cookie *ipc, struct rtable **rtp,
+			    unsigned int flags)
+{
+	struct inet_cork cork = {};
+	struct sk_buff_head queue;
+	int err;
+
+	if (flags & MSG_PROBE)
+		return NULL;
+
+	__skb_queue_head_init(&queue);
+
+	err = ip_setup_cork(sk, &cork, ipc, rtp);
+	if (err)
+		return ERR_PTR(err);
+
+	err = __ip_append_data(sk, &queue, &cork, getfrag,
+			       from, length, transhdrlen, flags);
+	if (err) {
+		__ip_flush_pending_frames(sk, &queue, &cork);
+		return ERR_PTR(err);
+	}
+
+	return __ip_make_skb(sk, &queue, &cork);
+}
 
 /*
  *	Fetch data from kernel space and fill in checksum if needed.

^ permalink raw reply related

* [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>

net: Remove unused sk_sndmsg_* from UFO

UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless.  It can't use them anyway since the whole
point of UFO is to use the original pages without copying.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/core/skbuff.c     |    3 ---
 net/ipv4/ip_output.c  |    1 -
 net/ipv6/ip6_output.c |    1 -
 3 files changed, 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d883dcc..97011a7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2434,8 +2434,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 			return -ENOMEM;
 
 		/* initialize the next frag */
-		sk->sk_sndmsg_page = page;
-		sk->sk_sndmsg_off = 0;
 		skb_fill_page_desc(skb, frg_cnt, page, 0, 0);
 		skb->truesize += PAGE_SIZE;
 		atomic_add(PAGE_SIZE, &sk->sk_wmem_alloc);
@@ -2455,7 +2453,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 			return -EFAULT;
 
 		/* copy was successful so update the size parameters */
-		sk->sk_sndmsg_off += copy;
 		frag->size += copy;
 		skb->len += copy;
 		skb->data_len += copy;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04c7b3b..d3a4540 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -767,7 +767,6 @@ static inline int ip_ufo_append_data(struct sock *sk,
 
 		skb->ip_summed = CHECKSUM_PARTIAL;
 		skb->csum = 0;
-		sk->sk_sndmsg_off = 0;
 
 		/* specify the length of each IP datagram fragment */
 		skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 5f8d242..9965182 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1061,7 +1061,6 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 
 		skb->ip_summed = CHECKSUM_PARTIAL;
 		skb->csum = 0;
-		sk->sk_sndmsg_off = 0;
 	}
 
 	err = skb_append_datato_frags(sk,skb, getfrag, from,

^ permalink raw reply related

* [PATCH 4/5] udp: Add lockless transmit path
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>

udp: Add lockless transmit path

The UDP transmit path has been running under the socket lock
for a long time because of the corking feature.  This means that
transmitting to the same socket in multiple threads does not
scale at all.

However, as most users don't actually use corking, the locking
can be removed in the common case.

This patch creates a lockless fast path where corking is not used.

Please note that this does create a slight inaccuracy in the
enforcement of socket send buffer limits.  In particular, we
may exceed the socket limit by up to (number of CPUs) * (packet
size) because of the way the limit is computed.

As the primary purpose of socket buffers is to indicate congestion,
this should not be a great problem for now.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/udp.h     |   11 +++++
 include/net/udplite.h |   12 +++++
 net/ipv4/udp.c        |  104 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 126 insertions(+), 1 deletion(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index bb967dd..b8563ba 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -144,6 +144,17 @@ static inline __wsum udp_csum_outgoing(struct sock *sk, struct sk_buff *skb)
 	return csum;
 }
 
+static inline __wsum udp_csum(struct sk_buff *skb)
+{
+	__wsum csum = csum_partial(skb_transport_header(skb),
+				   sizeof(struct udphdr), skb->csum);
+
+	for (skb = skb_shinfo(skb)->frag_list; skb; skb = skb->next) {
+		csum = csum_add(csum, skb->csum);
+	}
+	return csum;
+}
+
 /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
 static inline void udp_lib_hash(struct sock *sk)
 {
diff --git a/include/net/udplite.h b/include/net/udplite.h
index afdffe6..673a024 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -115,6 +115,18 @@ static inline __wsum udplite_csum_outgoing(struct sock *sk, struct sk_buff *skb)
 	return csum;
 }
 
+static inline __wsum udplite_csum(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	int cscov = udplite_sender_cscov(udp_sk(sk), udp_hdr(skb));
+	const int off = skb_transport_offset(skb);
+	const int len = skb->len - off;
+
+	skb->ip_summed = CHECKSUM_NONE;     /* no HW support for checksumming */
+
+	return skb_checksum(skb, off, min(cscov, len), 0);
+}
+
 extern void	udplite4_register(void);
 extern int 	udplite_get_port(struct sock *sk, unsigned short snum,
 			int (*scmp)(const struct sock *, const struct sock *));
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8157b17..7fd3664 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -769,6 +769,95 @@ out:
 	return err;
 }
 
+static void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst)
+{
+	struct udphdr *uh = udp_hdr(skb);
+	struct sk_buff *frags = skb_shinfo(skb)->frag_list;
+	int offset = skb_transport_offset(skb);
+	int len = skb->len - offset;
+	int hlen = len;
+	__wsum csum = 0;
+
+	if (!frags) {
+		/*
+		 * Only one fragment on the socket.
+		 */
+		skb->csum_start = skb_transport_header(skb) - skb->head;
+		skb->csum_offset = offsetof(struct udphdr, check);
+		uh->check = ~csum_tcpudp_magic(src, dst, len,
+					       IPPROTO_UDP, 0);
+	} else {
+		/*
+		 * HW-checksum won't work as there are two or more
+		 * fragments on the socket so that all csums of sk_buffs
+		 * should be together
+		 */
+		do {
+			csum = csum_add(csum, frags->csum);
+			hlen -= frags->len;
+		} while ((frags = frags->next));
+
+		csum = skb_checksum(skb, offset, hlen, csum);
+		skb->ip_summed = CHECKSUM_NONE;
+
+		uh->check = csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum);
+		if (uh->check == 0)
+			uh->check = CSUM_MANGLED_0;
+	}
+}
+
+static int udp_send_skb(struct sk_buff *skb, __be32 daddr, __be32 dport)
+{
+	struct sock *sk = skb->sk;
+	struct inet_sock *inet = inet_sk(sk);
+	struct udphdr *uh;
+	struct rtable *rt = (struct rtable *)skb_dst(skb);
+	int err = 0;
+	int is_udplite = IS_UDPLITE(sk);
+	int offset = skb_transport_offset(skb);
+	int len = skb->len - offset;
+	__wsum csum = 0;
+
+	/*
+	 * Create a UDP header
+	 */
+	uh = udp_hdr(skb);
+	uh->source = inet->inet_sport;
+	uh->dest = dport;
+	uh->len = htons(len);
+	uh->check = 0;
+
+	if (is_udplite)
+		csum = udplite_csum(skb);
+	else if (sk->sk_no_check == UDP_CSUM_NOXMIT) {
+		skb->ip_summed = CHECKSUM_NONE;
+		goto send;
+	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		udp4_hwcsum(skb, rt->rt_src, daddr);
+		goto send;
+	} else
+		csum = udp_csum(skb);
+
+	/* add protocol-dependent pseudo-header */
+	uh->check = csum_tcpudp_magic(rt->rt_src, daddr, len,
+				      sk->sk_protocol, csum);
+	if (uh->check == 0)
+		uh->check = CSUM_MANGLED_0;
+
+send:
+	err = ip_send_skb(skb);
+	if (err) {
+		if (err == -ENOBUFS && !inet->recverr) {
+			UDP_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	} else
+		UDP_INC_STATS_USER(sock_net(sk),
+				   UDP_MIB_OUTDATAGRAMS, is_udplite);
+	return err;
+}
+
 int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t len)
 {
@@ -785,6 +874,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int err, is_udplite = IS_UDPLITE(sk);
 	int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
+	struct sk_buff *skb;
 
 	if (len > 0xFFFF)
 		return -EMSGSIZE;
@@ -799,6 +889,8 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
 
+	getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
+
 	if (up->pending) {
 		/*
 		 * There are pending frames.
@@ -923,6 +1015,17 @@ back_from_confirm:
 	if (!ipc.addr)
 		daddr = ipc.addr = rt->rt_dst;
 
+	/* Lockless fast path for the non-corking case. */
+	if (!corkreq) {
+		skb = ip_make_skb(sk, getfrag, msg->msg_iov, ulen,
+				  sizeof(struct udphdr), &ipc, &rt,
+				  msg->msg_flags);
+		err = PTR_ERR(skb);
+		if (skb && !IS_ERR(skb))
+			err = udp_send_skb(skb, daddr, dport);
+		goto out;
+	}
+
 	lock_sock(sk);
 	if (unlikely(up->pending)) {
 		/* The socket is already corked while preparing it. */
@@ -944,7 +1047,6 @@ back_from_confirm:
 
 do_append_data:
 	up->len += ulen;
-	getfrag  =  is_udplite ?  udplite_getfrag : ip_generic_getfrag;
 	err = ip_append_data(sk, getfrag, msg->msg_iov, ulen,
 			sizeof(struct udphdr), &ipc, &rt,
 			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);

^ permalink raw reply related

* [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>

net: Remove explicit write references to sk/inet in ip_append_data

In order to allow simultaneous calls to ip_append_data on the same
socket, it must not modify any shared state in sk or inet (other
than those that are designed to allow that such as atomic counters).

This patch abstracts out write references to sk and inet_sk in
ip_append_data and its friends so that we may use the underlying
code in parallel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/inet_sock.h |   23 ++--
 net/ipv4/ip_output.c    |  238 ++++++++++++++++++++++++++++--------------------
 2 files changed, 154 insertions(+), 107 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 8181498..b3de102 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -86,6 +86,19 @@ static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
 	return (struct inet_request_sock *)sk;
 }
 
+struct inet_cork {
+	unsigned int		flags;
+	unsigned int		fragsize;
+	struct ip_options	*opt;
+	struct dst_entry	*dst;
+	int			length; /* Total length of all frames */
+	__be32			addr;
+	struct flowi		fl;
+	struct page		*page;
+	u32			off;
+	u8			tx_flags;
+};
+
 struct ip_mc_socklist;
 struct ipv6_pinfo;
 struct rtable;
@@ -143,15 +156,7 @@ struct inet_sock {
 	int			mc_index;
 	__be32			mc_addr;
 	struct ip_mc_socklist __rcu	*mc_list;
-	struct {
-		unsigned int		flags;
-		unsigned int		fragsize;
-		struct ip_options	*opt;
-		struct dst_entry	*dst;
-		int			length; /* Total length of all frames */
-		__be32			addr;
-		struct flowi		fl;
-	} cork;
+	struct inet_cork	cork;
 };
 
 #define IPCORK_OPT	1	/* ip-options has been held in ipcork.opt */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d3a4540..1dd5ecc 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -733,6 +733,7 @@ csum_page(struct page *page, int offset, int copy)
 }
 
 static inline int ip_ufo_append_data(struct sock *sk,
+			struct sk_buff_head *queue,
 			int getfrag(void *from, char *to, int offset, int len,
 			       int odd, struct sk_buff *skb),
 			void *from, int length, int hh_len, int fragheaderlen,
@@ -745,7 +746,7 @@ static inline int ip_ufo_append_data(struct sock *sk,
 	 * device, so create one single skb packet containing complete
 	 * udp datagram
 	 */
-	if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) {
+	if ((skb = skb_peek_tail(queue)) == NULL) {
 		skb = sock_alloc_send_skb(sk,
 			hh_len + fragheaderlen + transhdrlen + 20,
 			(flags & MSG_DONTWAIT), &err);
@@ -771,35 +772,24 @@ static inline int ip_ufo_append_data(struct sock *sk,
 		/* specify the length of each IP datagram fragment */
 		skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
 		skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
-		__skb_queue_tail(&sk->sk_write_queue, skb);
+		__skb_queue_tail(queue, skb);
 	}
 
 	return skb_append_datato_frags(sk, skb, getfrag, from,
 				       (length - transhdrlen));
 }
 
-/*
- *	ip_append_data() and ip_append_page() can make one large IP datagram
- *	from many pieces of data. Each pieces will be holded on the socket
- *	until ip_push_pending_frames() is called. Each piece can be a page
- *	or non-page data.
- *
- *	Not only UDP, other transport protocols - e.g. raw sockets - can use
- *	this interface potentially.
- *
- *	LATER: length must be adjusted by pad at tail, when it is required.
- */
-int ip_append_data(struct sock *sk,
-		   int getfrag(void *from, char *to, int offset, int len,
-			       int odd, struct sk_buff *skb),
-		   void *from, int length, int transhdrlen,
-		   struct ipcm_cookie *ipc, struct rtable **rtp,
-		   unsigned int flags)
+static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue,
+			    struct inet_cork *cork,
+			    int getfrag(void *from, char *to, int offset,
+					int len, int odd, struct sk_buff *skb),
+			    void *from, int length, int transhdrlen,
+			    unsigned int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sk_buff *skb;
 
-	struct ip_options *opt = NULL;
+	struct ip_options *opt = inet->cork.opt;
 	int hh_len;
 	int exthdrlen;
 	int mtu;
@@ -808,58 +798,19 @@ int ip_append_data(struct sock *sk,
 	int offset = 0;
 	unsigned int maxfraglen, fragheaderlen;
 	int csummode = CHECKSUM_NONE;
-	struct rtable *rt;
-
-	if (flags&MSG_PROBE)
-		return 0;
+	struct rtable *rt = (struct rtable *)cork->dst;
 
-	if (skb_queue_empty(&sk->sk_write_queue)) {
-		/*
-		 * setup for corking.
-		 */
-		opt = ipc->opt;
-		if (opt) {
-			if (inet->cork.opt == NULL) {
-				inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40, sk->sk_allocation);
-				if (unlikely(inet->cork.opt == NULL))
-					return -ENOBUFS;
-			}
-			memcpy(inet->cork.opt, opt, sizeof(struct ip_options)+opt->optlen);
-			inet->cork.flags |= IPCORK_OPT;
-			inet->cork.addr = ipc->addr;
-		}
-		rt = *rtp;
-		if (unlikely(!rt))
-			return -EFAULT;
-		/*
-		 * We steal reference to this route, caller should not release it
-		 */
-		*rtp = NULL;
-		inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ?
-					    rt->dst.dev->mtu :
-					    dst_mtu(rt->dst.path);
-		inet->cork.dst = &rt->dst;
-		inet->cork.length = 0;
-		sk->sk_sndmsg_page = NULL;
-		sk->sk_sndmsg_off = 0;
-		exthdrlen = rt->dst.header_len;
-		length += exthdrlen;
-		transhdrlen += exthdrlen;
-	} else {
-		rt = (struct rtable *)inet->cork.dst;
-		if (inet->cork.flags & IPCORK_OPT)
-			opt = inet->cork.opt;
+	exthdrlen = transhdrlen ? rt->dst.header_len : 0;
+	length += exthdrlen;
+	transhdrlen += exthdrlen;
+	mtu = inet->cork.fragsize;
 
-		transhdrlen = 0;
-		exthdrlen = 0;
-		mtu = inet->cork.fragsize;
-	}
 	hh_len = LL_RESERVED_SPACE(rt->dst.dev);
 
 	fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
 	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-	if (inet->cork.length + length > 0xFFFF - fragheaderlen) {
+	if (cork->length + length > 0xFFFF - fragheaderlen) {
 		ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport,
 			       mtu-exthdrlen);
 		return -EMSGSIZE;
@@ -875,15 +826,15 @@ int ip_append_data(struct sock *sk,
 	    !exthdrlen)
 		csummode = CHECKSUM_PARTIAL;
 
-	skb = skb_peek_tail(&sk->sk_write_queue);
+	skb = skb_peek_tail(queue);
 
-	inet->cork.length += length;
+	cork->length += length;
 	if (((length > mtu) || (skb && skb_is_gso(skb))) &&
 	    (sk->sk_protocol == IPPROTO_UDP) &&
 	    (rt->dst.dev->features & NETIF_F_UFO)) {
-		err = ip_ufo_append_data(sk, getfrag, from, length, hh_len,
-					 fragheaderlen, transhdrlen, mtu,
-					 flags);
+		err = ip_ufo_append_data(sk, queue, getfrag, from, length,
+					 hh_len, fragheaderlen, transhdrlen,
+					 mtu, flags);
 		if (err)
 			goto error;
 		return 0;
@@ -960,7 +911,7 @@ alloc_new_skb:
 				else
 					/* only the initial fragment is
 					   time stamped */
-					ipc->tx_flags = 0;
+					cork->tx_flags = 0;
 			}
 			if (skb == NULL)
 				goto error;
@@ -971,7 +922,7 @@ alloc_new_skb:
 			skb->ip_summed = csummode;
 			skb->csum = 0;
 			skb_reserve(skb, hh_len);
-			skb_shinfo(skb)->tx_flags = ipc->tx_flags;
+			skb_shinfo(skb)->tx_flags = cork->tx_flags;
 
 			/*
 			 *	Find where to start putting bytes.
@@ -1008,7 +959,7 @@ alloc_new_skb:
 			/*
 			 * Put the packet on the pending queue.
 			 */
-			__skb_queue_tail(&sk->sk_write_queue, skb);
+			__skb_queue_tail(queue, skb);
 			continue;
 		}
 
@@ -1028,8 +979,8 @@ alloc_new_skb:
 		} else {
 			int i = skb_shinfo(skb)->nr_frags;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i-1];
-			struct page *page = sk->sk_sndmsg_page;
-			int off = sk->sk_sndmsg_off;
+			struct page *page = cork->page;
+			int off = cork->off;
 			unsigned int left;
 
 			if (page && (left = PAGE_SIZE - off) > 0) {
@@ -1041,7 +992,7 @@ alloc_new_skb:
 						goto error;
 					}
 					get_page(page);
-					skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
+					skb_fill_page_desc(skb, i, page, off, 0);
 					frag = &skb_shinfo(skb)->frags[i];
 				}
 			} else if (i < MAX_SKB_FRAGS) {
@@ -1052,8 +1003,8 @@ alloc_new_skb:
 					err = -ENOMEM;
 					goto error;
 				}
-				sk->sk_sndmsg_page = page;
-				sk->sk_sndmsg_off = 0;
+				cork->page = page;
+				cork->off = 0;
 
 				skb_fill_page_desc(skb, i, page, 0, 0);
 				frag = &skb_shinfo(skb)->frags[i];
@@ -1065,7 +1016,7 @@ alloc_new_skb:
 				err = -EFAULT;
 				goto error;
 			}
-			sk->sk_sndmsg_off += copy;
+			cork->off += copy;
 			frag->size += copy;
 			skb->len += copy;
 			skb->data_len += copy;
@@ -1079,11 +1030,87 @@ alloc_new_skb:
 	return 0;
 
 error:
-	inet->cork.length -= length;
+	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	return err;
 }
 
+static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
+			 struct ipcm_cookie *ipc, struct rtable **rtp)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct ip_options *opt;
+	struct rtable *rt;
+
+	/*
+	 * setup for corking.
+	 */
+	opt = ipc->opt;
+	if (opt) {
+		if (cork->opt == NULL) {
+			cork->opt = kmalloc(sizeof(struct ip_options) + 40,
+					    sk->sk_allocation);
+			if (unlikely(cork->opt == NULL))
+				return -ENOBUFS;
+		}
+		memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen);
+		cork->flags |= IPCORK_OPT;
+		cork->addr = ipc->addr;
+	}
+	rt = *rtp;
+	if (unlikely(!rt))
+		return -EFAULT;
+	/*
+	 * We steal reference to this route, caller should not release it
+	 */
+	*rtp = NULL;
+	cork->fragsize = inet->pmtudisc == IP_PMTUDISC_PROBE ?
+			 rt->dst.dev->mtu : dst_mtu(rt->dst.path);
+	cork->dst = &rt->dst;
+	cork->length = 0;
+	cork->tx_flags = ipc->tx_flags;
+	cork->page = NULL;
+	cork->off = 0;
+
+	return 0;
+}
+
+/*
+ *	ip_append_data() and ip_append_page() can make one large IP datagram
+ *	from many pieces of data. Each pieces will be holded on the socket
+ *	until ip_push_pending_frames() is called. Each piece can be a page
+ *	or non-page data.
+ *
+ *	Not only UDP, other transport protocols - e.g. raw sockets - can use
+ *	this interface potentially.
+ *
+ *	LATER: length must be adjusted by pad at tail, when it is required.
+ */
+int ip_append_data(struct sock *sk,
+		   int getfrag(void *from, char *to, int offset, int len,
+			       int odd, struct sk_buff *skb),
+		   void *from, int length, int transhdrlen,
+		   struct ipcm_cookie *ipc, struct rtable **rtp,
+		   unsigned int flags)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	int err;
+
+	if (flags&MSG_PROBE)
+		return 0;
+
+	if (skb_queue_empty(&sk->sk_write_queue)) {
+		err = ip_setup_cork(sk, &inet->cork, ipc, rtp);
+		if (err)
+			return err;
+	} else {
+		transhdrlen = 0;
+	}
+
+	return __ip_append_data(sk, &sk->sk_write_queue, &inet->cork, getfrag,
+				from, length, transhdrlen, flags);
+}
+
 ssize_t	ip_append_page(struct sock *sk, struct page *page,
 		       int offset, size_t size, int flags)
 {
@@ -1227,40 +1254,42 @@ error:
 	return err;
 }
 
-static void ip_cork_release(struct inet_sock *inet)
+static void ip_cork_release(struct inet_cork *cork)
 {
-	inet->cork.flags &= ~IPCORK_OPT;
-	kfree(inet->cork.opt);
-	inet->cork.opt = NULL;
-	dst_release(inet->cork.dst);
-	inet->cork.dst = NULL;
+	cork->flags &= ~IPCORK_OPT;
+	kfree(cork->opt);
+	cork->opt = NULL;
+	dst_release(cork->dst);
+	cork->dst = NULL;
 }
 
 /*
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-int ip_push_pending_frames(struct sock *sk)
+static int __ip_push_pending_frames(struct sock *sk,
+				    struct sk_buff_head *queue,
+				    struct inet_cork *cork)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
 	struct inet_sock *inet = inet_sk(sk);
 	struct net *net = sock_net(sk);
 	struct ip_options *opt = NULL;
-	struct rtable *rt = (struct rtable *)inet->cork.dst;
+	struct rtable *rt = (struct rtable *)cork->dst;
 	struct iphdr *iph;
 	__be16 df = 0;
 	__u8 ttl;
 	int err = 0;
 
-	if ((skb = __skb_dequeue(&sk->sk_write_queue)) == NULL)
+	if ((skb = __skb_dequeue(queue)) == NULL)
 		goto out;
 	tail_skb = &(skb_shinfo(skb)->frag_list);
 
 	/* move skb->data to ip header from ext header */
 	if (skb->data < skb_network_header(skb))
 		__skb_pull(skb, skb_network_offset(skb));
-	while ((tmp_skb = __skb_dequeue(&sk->sk_write_queue)) != NULL) {
+	while ((tmp_skb = __skb_dequeue(queue)) != NULL) {
 		__skb_pull(tmp_skb, skb_network_header_len(skb));
 		*tail_skb = tmp_skb;
 		tail_skb = &(tmp_skb->next);
@@ -1286,8 +1315,8 @@ int ip_push_pending_frames(struct sock *sk)
 	     ip_dont_fragment(sk, &rt->dst)))
 		df = htons(IP_DF);
 
-	if (inet->cork.flags & IPCORK_OPT)
-		opt = inet->cork.opt;
+	if (cork->flags & IPCORK_OPT)
+		opt = cork->opt;
 
 	if (rt->rt_type == RTN_MULTICAST)
 		ttl = inet->mc_ttl;
@@ -1299,7 +1328,7 @@ int ip_push_pending_frames(struct sock *sk)
 	iph->ihl = 5;
 	if (opt) {
 		iph->ihl += opt->optlen>>2;
-		ip_options_build(skb, opt, inet->cork.addr, rt, 0);
+		ip_options_build(skb, opt, cork->addr, rt, 0);
 	}
 	iph->tos = inet->tos;
 	iph->frag_off = df;
@@ -1315,7 +1344,7 @@ int ip_push_pending_frames(struct sock *sk)
 	 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
 	 * on dst refcount
 	 */
-	inet->cork.dst = NULL;
+	cork->dst = NULL;
 	skb_dst_set(skb, &rt->dst);
 
 	if (iph->protocol == IPPROTO_ICMP)
@@ -1332,7 +1361,7 @@ int ip_push_pending_frames(struct sock *sk)
 	}
 
 out:
-	ip_cork_release(inet);
+	ip_cork_release(cork);
 	return err;
 
 error:
@@ -1340,17 +1369,30 @@ error:
 	goto out;
 }
 
+int ip_push_pending_frames(struct sock *sk)
+{
+	return __ip_push_pending_frames(sk, &sk->sk_write_queue,
+					&inet_sk(sk)->cork);
+}
+
 /*
  *	Throw away all pending data on the socket.
  */
-void ip_flush_pending_frames(struct sock *sk)
+static void __ip_flush_pending_frames(struct sock *sk,
+				      struct sk_buff_head *queue,
+				      struct inet_cork *cork)
 {
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue_tail(&sk->sk_write_queue)) != NULL)
+	while ((skb = __skb_dequeue_tail(queue)) != NULL)
 		kfree_skb(skb);
 
-	ip_cork_release(inet_sk(sk));
+	ip_cork_release(cork);
+}
+
+void ip_flush_pending_frames(struct sock *sk)
+{
+	__ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
 }
 
 

^ permalink raw reply related

* Re: [PATCH 4/5] udp: Add lockless transmit path
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf
In-Reply-To: <E1Pu1TJ-0005RA-DV@gondolin.me.apana.org.au>

On Mon, Feb 28, 2011 at 07:41:01PM +0800, Herbert Xu wrote:
> udp: Add lockless transmit path

Doh! There are only 4 patches in the series.  So you didn't
miss anything, yet :)
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Jussi Kivilinna @ 2011-02-28 11:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <1298837273.8726.128.camel@edumazet-laptop>

Quoting Eric Dumazet <eric.dumazet@gmail.com>:

> Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
>> Quoting Albert Cahalan <acahalan@gmail.com>:
>>
>> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet  
>> <eric.dumazet@gmail.com> wrote:
>> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>> >>>
>> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
>> > ...
>> >> Problem is some machines have slow High Resolution timing services.
>> >>
>> >> _If_ we have a time limit, it will probably use the low resolution (aka
>> >> jiffies), unless high resolution services are cheap.
>> >
>> > As long as that is totally internal to the kernel and never
>> > getting exposed by some API for setting the amount, sure.
>> >
>> >> I was thinking not having an absolute hard limit, but an EWMA based one.
>> >
>> > The whole point is to prevent stale packets, especially to prevent
>> > them from messing with TCP, so I really don't think so. I suppose
>> > you do get this to some extent via early drop.
>>
>> I made simple hack on sch_fifo with per packet time limits
>> (attachment) this weekend and have been doing limited testing on
>> wireless link. I think hardlimit is fine, it's simple and does
>> somewhat same as what packet(-hard)limited buffer does, drops packets
>> when buffer is 'full'. My hack checks for timed out packets on
>> enqueue, might be wrong approach (on other hand might allow some more
>> burstiness).
>>
>
>
> Qdisc should return to caller a good indication packet is queued or
> dropped at enqueue() time... not later (aka : never)
>
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.
>
> This is why I suggested using an EWMA plus a probabilist drop or
> congestion indication (NET_XMIT_CN) to caller at enqueue() time.
>
> The absolute time limit you are trying to implement should be checked at
> dequeue time, to cope with enqueue bursts or pauses on wire.
>

Would it be better to implement this as generic feature instead of  
qdisc specific? Have qdisc_enqueue_root do ewma check:

static inline int qdisc_enqueue_root(struct sk_buff *skb, struct Qdisc *sch)
{
	qdisc_skb_cb(skb)->pkt_len = skb->len;
	if (likely(!sch->use_timeout)) {
ewma_ok:
		return qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
	}

	status = qdisc_check_ewma_status()
	if (status == ok)
		goto ewma_ok;

	if (status == overlimits)
		...drop...

	if (status == congestion) {
		ret = qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
		return (ret == success) ? NET_XMIT_CN : ret;
	}
}

And add qdisc_dequeue_root:

static inline struct sk_buff *qdisc_dequeue_root(struct Qdisc *sch)
{
	skb = sch->dequeue(sch);

	if (skb && unlikely(sch->use_timeout))
		qdisc_update_ewma(skb);

	return skb;
}

Then user could specify any qdisc to use timeout or not with tc. Maybe  
go even as far as have some default timeout for default qdisc(?)

-Jussi

^ permalink raw reply

* [question] fcoe: bonding support
From: Jiri Pirko @ 2011-02-28 12:29 UTC (permalink / raw)
  To: robert.w.love; +Cc: netdev

Hi Robert.

I wonder what's the meaning of the following code in fcoe_interface_setup():

        /* Do not support for bonding device */
        if ((netdev->priv_flags & IFF_MASTER_ALB) ||
            (netdev->priv_flags & IFF_SLAVE_INACTIVE) ||
            (netdev->priv_flags & IFF_MASTER_8023AD)) {
                FCOE_NETDEV_DBG(netdev, "Bonded interfaces not supported\n");
                return -EOPNOTSUPP;
        }

>From this I cannot understand if bonding is not supported at all or
only alb and 8023ad modes are not supported (leaving aside completely
bogus checking for IFF_SLAVE_INACTIVE).

How about to check IFF_BONDING only (in case bonding should not be
supported at all)

Thanks.

Jirka

^ permalink raw reply

* [Patch] bonding: move procfs into bond_procfs.c
From: Amerigo Wang @ 2011-02-28 12:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: WANG Cong, Jay Vosburgh, netdev

bond_main.c is bloating, separate the procfs code out,
move them to bond_procfs.c

Signed-off-by: WANG Cong <amwang@redhat.com>

---
diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 0e2737e..1d8de09 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -4,7 +4,7 @@
 
 obj-$(CONFIG_BONDING) += bonding.o
 
-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o
+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o bond_procfs.o
 
 ipv6-$(subst m,y,$(CONFIG_IPV6)) += bond_ipv6.o
 bonding-objs += $(ipv6-y)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 584f97b..7abed73 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -65,8 +65,6 @@
 #include <linux/skbuff.h>
 #include <net/sock.h>
 #include <linux/rtnetlink.h>
-#include <linux/proc_fs.h>
-#include <linux/seq_file.h>
 #include <linux/smp.h>
 #include <linux/if_ether.h>
 #include <net/arp.h>
@@ -173,9 +171,6 @@ MODULE_PARM_DESC(resend_igmp, "Number of IGMP membership reports to send on link
 atomic_t netpoll_block_tx = ATOMIC_INIT(0);
 #endif
 
-static const char * const version =
-	DRV_DESCRIPTION ": v" DRV_VERSION " (" DRV_RELDATE ")\n";
-
 int bond_net_id __read_mostly;
 
 static __be32 arp_target[BOND_MAX_ARP_TARGETS];
@@ -245,7 +240,7 @@ static void bond_uninit(struct net_device *bond_dev);
 
 /*---------------------------- General routines -----------------------------*/
 
-static const char *bond_mode_name(int mode)
+const char *bond_mode_name(int mode)
 {
 	static const char *names[] = {
 		[BOND_MODE_ROUNDROBIN] = "load balancing (round-robin)",
@@ -3292,299 +3287,6 @@ out:
 	read_unlock(&bond->lock);
 }
 
-/*------------------------------ proc/seq_file-------------------------------*/
-
-#ifdef CONFIG_PROC_FS
-
-static void *bond_info_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
-	__acquires(&bond->lock)
-{
-	struct bonding *bond = seq->private;
-	loff_t off = 0;
-	struct slave *slave;
-	int i;
-
-	/* make sure the bond won't be taken away */
-	rcu_read_lock();
-	read_lock(&bond->lock);
-
-	if (*pos == 0)
-		return SEQ_START_TOKEN;
-
-	bond_for_each_slave(bond, slave, i) {
-		if (++off == *pos)
-			return slave;
-	}
-
-	return NULL;
-}
-
-static void *bond_info_seq_next(struct seq_file *seq, void *v, loff_t *pos)
-{
-	struct bonding *bond = seq->private;
-	struct slave *slave = v;
-
-	++*pos;
-	if (v == SEQ_START_TOKEN)
-		return bond->first_slave;
-
-	slave = slave->next;
-
-	return (slave == bond->first_slave) ? NULL : slave;
-}
-
-static void bond_info_seq_stop(struct seq_file *seq, void *v)
-	__releases(&bond->lock)
-	__releases(RCU)
-{
-	struct bonding *bond = seq->private;
-
-	read_unlock(&bond->lock);
-	rcu_read_unlock();
-}
-
-static void bond_info_show_master(struct seq_file *seq)
-{
-	struct bonding *bond = seq->private;
-	struct slave *curr;
-	int i;
-
-	read_lock(&bond->curr_slave_lock);
-	curr = bond->curr_active_slave;
-	read_unlock(&bond->curr_slave_lock);
-
-	seq_printf(seq, "Bonding Mode: %s",
-		   bond_mode_name(bond->params.mode));
-
-	if (bond->params.mode == BOND_MODE_ACTIVEBACKUP &&
-	    bond->params.fail_over_mac)
-		seq_printf(seq, " (fail_over_mac %s)",
-		   fail_over_mac_tbl[bond->params.fail_over_mac].modename);
-
-	seq_printf(seq, "\n");
-
-	if (bond->params.mode == BOND_MODE_XOR ||
-		bond->params.mode == BOND_MODE_8023AD) {
-		seq_printf(seq, "Transmit Hash Policy: %s (%d)\n",
-			xmit_hashtype_tbl[bond->params.xmit_policy].modename,
-			bond->params.xmit_policy);
-	}
-
-	if (USES_PRIMARY(bond->params.mode)) {
-		seq_printf(seq, "Primary Slave: %s",
-			   (bond->primary_slave) ?
-			   bond->primary_slave->dev->name : "None");
-		if (bond->primary_slave)
-			seq_printf(seq, " (primary_reselect %s)",
-		   pri_reselect_tbl[bond->params.primary_reselect].modename);
-
-		seq_printf(seq, "\nCurrently Active Slave: %s\n",
-			   (curr) ? curr->dev->name : "None");
-	}
-
-	seq_printf(seq, "MII Status: %s\n", netif_carrier_ok(bond->dev) ?
-		   "up" : "down");
-	seq_printf(seq, "MII Polling Interval (ms): %d\n", bond->params.miimon);
-	seq_printf(seq, "Up Delay (ms): %d\n",
-		   bond->params.updelay * bond->params.miimon);
-	seq_printf(seq, "Down Delay (ms): %d\n",
-		   bond->params.downdelay * bond->params.miimon);
-
-
-	/* ARP information */
-	if (bond->params.arp_interval > 0) {
-		int printed = 0;
-		seq_printf(seq, "ARP Polling Interval (ms): %d\n",
-				bond->params.arp_interval);
-
-		seq_printf(seq, "ARP IP target/s (n.n.n.n form):");
-
-		for (i = 0; (i < BOND_MAX_ARP_TARGETS); i++) {
-			if (!bond->params.arp_targets[i])
-				break;
-			if (printed)
-				seq_printf(seq, ",");
-			seq_printf(seq, " %pI4", &bond->params.arp_targets[i]);
-			printed = 1;
-		}
-		seq_printf(seq, "\n");
-	}
-
-	if (bond->params.mode == BOND_MODE_8023AD) {
-		struct ad_info ad_info;
-
-		seq_puts(seq, "\n802.3ad info\n");
-		seq_printf(seq, "LACP rate: %s\n",
-			   (bond->params.lacp_fast) ? "fast" : "slow");
-		seq_printf(seq, "Aggregator selection policy (ad_select): %s\n",
-			   ad_select_tbl[bond->params.ad_select].modename);
-
-		if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
-			seq_printf(seq, "bond %s has no active aggregator\n",
-				   bond->dev->name);
-		} else {
-			seq_printf(seq, "Active Aggregator Info:\n");
-
-			seq_printf(seq, "\tAggregator ID: %d\n",
-				   ad_info.aggregator_id);
-			seq_printf(seq, "\tNumber of ports: %d\n",
-				   ad_info.ports);
-			seq_printf(seq, "\tActor Key: %d\n",
-				   ad_info.actor_key);
-			seq_printf(seq, "\tPartner Key: %d\n",
-				   ad_info.partner_key);
-			seq_printf(seq, "\tPartner Mac Address: %pM\n",
-				   ad_info.partner_system);
-		}
-	}
-}
-
-static void bond_info_show_slave(struct seq_file *seq,
-				 const struct slave *slave)
-{
-	struct bonding *bond = seq->private;
-
-	seq_printf(seq, "\nSlave Interface: %s\n", slave->dev->name);
-	seq_printf(seq, "MII Status: %s\n",
-		   (slave->link == BOND_LINK_UP) ?  "up" : "down");
-	seq_printf(seq, "Speed: %d Mbps\n", slave->speed);
-	seq_printf(seq, "Duplex: %s\n", slave->duplex ? "full" : "half");
-	seq_printf(seq, "Link Failure Count: %u\n",
-		   slave->link_failure_count);
-
-	seq_printf(seq, "Permanent HW addr: %pM\n", slave->perm_hwaddr);
-
-	if (bond->params.mode == BOND_MODE_8023AD) {
-		const struct aggregator *agg
-			= SLAVE_AD_INFO(slave).port.aggregator;
-
-		if (agg)
-			seq_printf(seq, "Aggregator ID: %d\n",
-				   agg->aggregator_identifier);
-		else
-			seq_puts(seq, "Aggregator ID: N/A\n");
-	}
-	seq_printf(seq, "Slave queue ID: %d\n", slave->queue_id);
-}
-
-static int bond_info_seq_show(struct seq_file *seq, void *v)
-{
-	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "%s\n", version);
-		bond_info_show_master(seq);
-	} else
-		bond_info_show_slave(seq, v);
-
-	return 0;
-}
-
-static const struct seq_operations bond_info_seq_ops = {
-	.start = bond_info_seq_start,
-	.next  = bond_info_seq_next,
-	.stop  = bond_info_seq_stop,
-	.show  = bond_info_seq_show,
-};
-
-static int bond_info_open(struct inode *inode, struct file *file)
-{
-	struct seq_file *seq;
-	struct proc_dir_entry *proc;
-	int res;
-
-	res = seq_open(file, &bond_info_seq_ops);
-	if (!res) {
-		/* recover the pointer buried in proc_dir_entry data */
-		seq = file->private_data;
-		proc = PDE(inode);
-		seq->private = proc->data;
-	}
-
-	return res;
-}
-
-static const struct file_operations bond_info_fops = {
-	.owner   = THIS_MODULE,
-	.open    = bond_info_open,
-	.read    = seq_read,
-	.llseek  = seq_lseek,
-	.release = seq_release,
-};
-
-static void bond_create_proc_entry(struct bonding *bond)
-{
-	struct net_device *bond_dev = bond->dev;
-	struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
-
-	if (bn->proc_dir) {
-		bond->proc_entry = proc_create_data(bond_dev->name,
-						    S_IRUGO, bn->proc_dir,
-						    &bond_info_fops, bond);
-		if (bond->proc_entry == NULL)
-			pr_warning("Warning: Cannot create /proc/net/%s/%s\n",
-				   DRV_NAME, bond_dev->name);
-		else
-			memcpy(bond->proc_file_name, bond_dev->name, IFNAMSIZ);
-	}
-}
-
-static void bond_remove_proc_entry(struct bonding *bond)
-{
-	struct net_device *bond_dev = bond->dev;
-	struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
-
-	if (bn->proc_dir && bond->proc_entry) {
-		remove_proc_entry(bond->proc_file_name, bn->proc_dir);
-		memset(bond->proc_file_name, 0, IFNAMSIZ);
-		bond->proc_entry = NULL;
-	}
-}
-
-/* Create the bonding directory under /proc/net, if doesn't exist yet.
- * Caller must hold rtnl_lock.
- */
-static void __net_init bond_create_proc_dir(struct bond_net *bn)
-{
-	if (!bn->proc_dir) {
-		bn->proc_dir = proc_mkdir(DRV_NAME, bn->net->proc_net);
-		if (!bn->proc_dir)
-			pr_warning("Warning: cannot create /proc/net/%s\n",
-				   DRV_NAME);
-	}
-}
-
-/* Destroy the bonding directory under /proc/net, if empty.
- * Caller must hold rtnl_lock.
- */
-static void __net_exit bond_destroy_proc_dir(struct bond_net *bn)
-{
-	if (bn->proc_dir) {
-		remove_proc_entry(DRV_NAME, bn->net->proc_net);
-		bn->proc_dir = NULL;
-	}
-}
-
-#else /* !CONFIG_PROC_FS */
-
-static void bond_create_proc_entry(struct bonding *bond)
-{
-}
-
-static void bond_remove_proc_entry(struct bonding *bond)
-{
-}
-
-static inline void bond_create_proc_dir(struct bond_net *bn)
-{
-}
-
-static inline void bond_destroy_proc_dir(struct bond_net *bn)
-{
-}
-
-#endif /* CONFIG_PROC_FS */
-
-
 /*-------------------------- netdev event handling --------------------------*/
 
 /*
@@ -5388,7 +5090,7 @@ static int __init bonding_init(void)
 	int i;
 	int res;
 
-	pr_info("%s", version);
+	pr_info("%s", bond_version);
 
 	res = bond_check_params(&bonding_defaults);
 	if (res)
diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
new file mode 100644
index 0000000..4db0529
--- /dev/null
+++ b/drivers/net/bonding/bond_procfs.c
@@ -0,0 +1,297 @@
+#include <linux/proc_fs.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include "bonding.h"
+
+#ifdef CONFIG_PROC_FS
+
+extern const char *bond_mode_name(int mode);
+
+static void *bond_info_seq_start(struct seq_file *seq, loff_t *pos)
+	__acquires(RCU)
+	__acquires(&bond->lock)
+{
+	struct bonding *bond = seq->private;
+	loff_t off = 0;
+	struct slave *slave;
+	int i;
+
+	/* make sure the bond won't be taken away */
+	rcu_read_lock();
+	read_lock(&bond->lock);
+
+	if (*pos == 0)
+		return SEQ_START_TOKEN;
+
+	bond_for_each_slave(bond, slave, i) {
+		if (++off == *pos)
+			return slave;
+	}
+
+	return NULL;
+}
+
+static void *bond_info_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bonding *bond = seq->private;
+	struct slave *slave = v;
+
+	++*pos;
+	if (v == SEQ_START_TOKEN)
+		return bond->first_slave;
+
+	slave = slave->next;
+
+	return (slave == bond->first_slave) ? NULL : slave;
+}
+
+static void bond_info_seq_stop(struct seq_file *seq, void *v)
+	__releases(&bond->lock)
+	__releases(RCU)
+{
+	struct bonding *bond = seq->private;
+
+	read_unlock(&bond->lock);
+	rcu_read_unlock();
+}
+
+static void bond_info_show_master(struct seq_file *seq)
+{
+	struct bonding *bond = seq->private;
+	struct slave *curr;
+	int i;
+
+	read_lock(&bond->curr_slave_lock);
+	curr = bond->curr_active_slave;
+	read_unlock(&bond->curr_slave_lock);
+
+	seq_printf(seq, "Bonding Mode: %s",
+		   bond_mode_name(bond->params.mode));
+
+	if (bond->params.mode == BOND_MODE_ACTIVEBACKUP &&
+	    bond->params.fail_over_mac)
+		seq_printf(seq, " (fail_over_mac %s)",
+		   fail_over_mac_tbl[bond->params.fail_over_mac].modename);
+
+	seq_printf(seq, "\n");
+
+	if (bond->params.mode == BOND_MODE_XOR ||
+		bond->params.mode == BOND_MODE_8023AD) {
+		seq_printf(seq, "Transmit Hash Policy: %s (%d)\n",
+			xmit_hashtype_tbl[bond->params.xmit_policy].modename,
+			bond->params.xmit_policy);
+	}
+
+	if (USES_PRIMARY(bond->params.mode)) {
+		seq_printf(seq, "Primary Slave: %s",
+			   (bond->primary_slave) ?
+			   bond->primary_slave->dev->name : "None");
+		if (bond->primary_slave)
+			seq_printf(seq, " (primary_reselect %s)",
+		   pri_reselect_tbl[bond->params.primary_reselect].modename);
+
+		seq_printf(seq, "\nCurrently Active Slave: %s\n",
+			   (curr) ? curr->dev->name : "None");
+	}
+
+	seq_printf(seq, "MII Status: %s\n", netif_carrier_ok(bond->dev) ?
+		   "up" : "down");
+	seq_printf(seq, "MII Polling Interval (ms): %d\n", bond->params.miimon);
+	seq_printf(seq, "Up Delay (ms): %d\n",
+		   bond->params.updelay * bond->params.miimon);
+	seq_printf(seq, "Down Delay (ms): %d\n",
+		   bond->params.downdelay * bond->params.miimon);
+
+
+	/* ARP information */
+	if (bond->params.arp_interval > 0) {
+		int printed = 0;
+		seq_printf(seq, "ARP Polling Interval (ms): %d\n",
+				bond->params.arp_interval);
+
+		seq_printf(seq, "ARP IP target/s (n.n.n.n form):");
+
+		for (i = 0; (i < BOND_MAX_ARP_TARGETS); i++) {
+			if (!bond->params.arp_targets[i])
+				break;
+			if (printed)
+				seq_printf(seq, ",");
+			seq_printf(seq, " %pI4", &bond->params.arp_targets[i]);
+			printed = 1;
+		}
+		seq_printf(seq, "\n");
+	}
+
+	if (bond->params.mode == BOND_MODE_8023AD) {
+		struct ad_info ad_info;
+
+		seq_puts(seq, "\n802.3ad info\n");
+		seq_printf(seq, "LACP rate: %s\n",
+			   (bond->params.lacp_fast) ? "fast" : "slow");
+		seq_printf(seq, "Aggregator selection policy (ad_select): %s\n",
+			   ad_select_tbl[bond->params.ad_select].modename);
+
+		if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
+			seq_printf(seq, "bond %s has no active aggregator\n",
+				   bond->dev->name);
+		} else {
+			seq_printf(seq, "Active Aggregator Info:\n");
+
+			seq_printf(seq, "\tAggregator ID: %d\n",
+				   ad_info.aggregator_id);
+			seq_printf(seq, "\tNumber of ports: %d\n",
+				   ad_info.ports);
+			seq_printf(seq, "\tActor Key: %d\n",
+				   ad_info.actor_key);
+			seq_printf(seq, "\tPartner Key: %d\n",
+				   ad_info.partner_key);
+			seq_printf(seq, "\tPartner Mac Address: %pM\n",
+				   ad_info.partner_system);
+		}
+	}
+}
+
+static void bond_info_show_slave(struct seq_file *seq,
+				 const struct slave *slave)
+{
+	struct bonding *bond = seq->private;
+
+	seq_printf(seq, "\nSlave Interface: %s\n", slave->dev->name);
+	seq_printf(seq, "MII Status: %s\n",
+		   (slave->link == BOND_LINK_UP) ?  "up" : "down");
+	seq_printf(seq, "Speed: %d Mbps\n", slave->speed);
+	seq_printf(seq, "Duplex: %s\n", slave->duplex ? "full" : "half");
+	seq_printf(seq, "Link Failure Count: %u\n",
+		   slave->link_failure_count);
+
+	seq_printf(seq, "Permanent HW addr: %pM\n", slave->perm_hwaddr);
+
+	if (bond->params.mode == BOND_MODE_8023AD) {
+		const struct aggregator *agg
+			= SLAVE_AD_INFO(slave).port.aggregator;
+
+		if (agg)
+			seq_printf(seq, "Aggregator ID: %d\n",
+				   agg->aggregator_identifier);
+		else
+			seq_puts(seq, "Aggregator ID: N/A\n");
+	}
+	seq_printf(seq, "Slave queue ID: %d\n", slave->queue_id);
+}
+
+static int bond_info_seq_show(struct seq_file *seq, void *v)
+{
+	if (v == SEQ_START_TOKEN) {
+		seq_printf(seq, "%s\n", bond_version);
+		bond_info_show_master(seq);
+	} else
+		bond_info_show_slave(seq, v);
+
+	return 0;
+}
+
+static const struct seq_operations bond_info_seq_ops = {
+	.start = bond_info_seq_start,
+	.next  = bond_info_seq_next,
+	.stop  = bond_info_seq_stop,
+	.show  = bond_info_seq_show,
+};
+
+static int bond_info_open(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq;
+	struct proc_dir_entry *proc;
+	int res;
+
+	res = seq_open(file, &bond_info_seq_ops);
+	if (!res) {
+		/* recover the pointer buried in proc_dir_entry data */
+		seq = file->private_data;
+		proc = PDE(inode);
+		seq->private = proc->data;
+	}
+
+	return res;
+}
+
+static const struct file_operations bond_info_fops = {
+	.owner   = THIS_MODULE,
+	.open    = bond_info_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = seq_release,
+};
+
+void bond_create_proc_entry(struct bonding *bond)
+{
+	struct net_device *bond_dev = bond->dev;
+	struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
+
+	if (bn->proc_dir) {
+		bond->proc_entry = proc_create_data(bond_dev->name,
+						    S_IRUGO, bn->proc_dir,
+						    &bond_info_fops, bond);
+		if (bond->proc_entry == NULL)
+			pr_warning("Warning: Cannot create /proc/net/%s/%s\n",
+				   DRV_NAME, bond_dev->name);
+		else
+			memcpy(bond->proc_file_name, bond_dev->name, IFNAMSIZ);
+	}
+}
+
+void bond_remove_proc_entry(struct bonding *bond)
+{
+	struct net_device *bond_dev = bond->dev;
+	struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
+
+	if (bn->proc_dir && bond->proc_entry) {
+		remove_proc_entry(bond->proc_file_name, bn->proc_dir);
+		memset(bond->proc_file_name, 0, IFNAMSIZ);
+		bond->proc_entry = NULL;
+	}
+}
+
+/* Create the bonding directory under /proc/net, if doesn't exist yet.
+ * Caller must hold rtnl_lock.
+ */
+void __net_init bond_create_proc_dir(struct bond_net *bn)
+{
+	if (!bn->proc_dir) {
+		bn->proc_dir = proc_mkdir(DRV_NAME, bn->net->proc_net);
+		if (!bn->proc_dir)
+			pr_warning("Warning: cannot create /proc/net/%s\n",
+				   DRV_NAME);
+	}
+}
+
+/* Destroy the bonding directory under /proc/net, if empty.
+ * Caller must hold rtnl_lock.
+ */
+void __net_exit bond_destroy_proc_dir(struct bond_net *bn)
+{
+	if (bn->proc_dir) {
+		remove_proc_entry(DRV_NAME, bn->net->proc_net);
+		bn->proc_dir = NULL;
+	}
+}
+
+#else /* !CONFIG_PROC_FS */
+
+void bond_create_proc_entry(struct bonding *bond)
+{
+}
+
+void bond_remove_proc_entry(struct bonding *bond)
+{
+}
+
+void bond_create_proc_dir(struct bond_net *bn)
+{
+}
+
+void bond_destroy_proc_dir(struct bond_net *bn)
+{
+}
+
+#endif /* CONFIG_PROC_FS */
+
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index a401b8d..dfe41df 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -29,6 +29,8 @@
 #define DRV_NAME	"bonding"
 #define DRV_DESCRIPTION	"Ethernet Channel Bonding Driver"
 
+#define bond_version DRV_DESCRIPTION ": v" DRV_VERSION " (" DRV_RELDATE ")\n"
+
 #define BOND_MAX_ARP_TARGETS	16
 
 #define IS_UP(dev)					   \
@@ -413,6 +415,11 @@ struct bond_net {
 #endif
 };
 
+void bond_create_proc_entry(struct bonding *bond);
+void bond_remove_proc_entry(struct bonding *bond);
+void bond_create_proc_dir(struct bond_net *bn);
+void bond_destroy_proc_dir(struct bond_net *bn);
+
 /* exported from bond_main.c */
 extern int bond_net_id;
 extern const struct bond_parm_tbl bond_lacp_tbl[];

^ permalink raw reply related

* Re: txqueuelen has wrong units; should be time
From: Eric Dumazet @ 2011-02-28 13:10 UTC (permalink / raw)
  To: Jussi Kivilinna; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <20110228134338.1241484mkljbz4w0@hayate.sektori.org>

Le lundi 28 février 2011 à 13:43 +0200, Jussi Kivilinna a écrit :
> Quoting Eric Dumazet <eric.dumazet@gmail.com>:
> 
> > Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
> >> Quoting Albert Cahalan <acahalan@gmail.com>:
> >>
> >> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet  
> >> <eric.dumazet@gmail.com> wrote:
> >> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
> >> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
> >> >>>
> >> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
> >> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
> >> > ...
> >> >> Problem is some machines have slow High Resolution timing services.
> >> >>
> >> >> _If_ we have a time limit, it will probably use the low resolution (aka
> >> >> jiffies), unless high resolution services are cheap.
> >> >
> >> > As long as that is totally internal to the kernel and never
> >> > getting exposed by some API for setting the amount, sure.
> >> >
> >> >> I was thinking not having an absolute hard limit, but an EWMA based one.
> >> >
> >> > The whole point is to prevent stale packets, especially to prevent
> >> > them from messing with TCP, so I really don't think so. I suppose
> >> > you do get this to some extent via early drop.
> >>
> >> I made simple hack on sch_fifo with per packet time limits
> >> (attachment) this weekend and have been doing limited testing on
> >> wireless link. I think hardlimit is fine, it's simple and does
> >> somewhat same as what packet(-hard)limited buffer does, drops packets
> >> when buffer is 'full'. My hack checks for timed out packets on
> >> enqueue, might be wrong approach (on other hand might allow some more
> >> burstiness).
> >>
> >
> >
> > Qdisc should return to caller a good indication packet is queued or
> > dropped at enqueue() time... not later (aka : never)
> >
> > Accepting a packet at t0, and dropping it later at t0+limit without
> > giving any indication to caller is a problem.
> >
> > This is why I suggested using an EWMA plus a probabilist drop or
> > congestion indication (NET_XMIT_CN) to caller at enqueue() time.
> >
> > The absolute time limit you are trying to implement should be checked at
> > dequeue time, to cope with enqueue bursts or pauses on wire.
> >
> 
> Would it be better to implement this as generic feature instead of  
> qdisc specific? Have qdisc_enqueue_root do ewma check:

Problem is you can have several virtual queues in a qdisc.

For example, pfifo_fast has 3 bands. You could have a global ewma with
high values, but you still want to let a high priority packet going
through...

^ permalink raw reply

* [PATCH] AF_RXRPC: Handle receiving ACKALL packets
From: David Howells @ 2011-02-28 13:27 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, David Howells

The OpenAFS server is now sending ACKALL packets, so we need to handle them.
Otherwise we report a protocol error and abort.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-input.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/rxrpc/ar-input.c b/net/rxrpc/ar-input.c
index a4fc974..996d3ef 100644
--- a/net/rxrpc/ar-input.c
+++ b/net/rxrpc/ar-input.c
@@ -423,6 +423,7 @@ void rxrpc_fast_process_packet(struct rxrpc_call *call, struct sk_buff *skb)
 			goto protocol_error;
 		}
 
+	case RXRPC_PACKET_TYPE_ACKALL:
 	case RXRPC_PACKET_TYPE_ACK:
 		/* ACK processing is done in process context */
 		read_lock_bh(&call->state_lock);

^ permalink raw reply related

* [PATCH] RxRPC: Fix v1 keys
From: David Howells @ 2011-02-28 13:27 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Anton Blanchard, David Howells

From: Anton Blanchard <anton@au1.ibm.com>

commit 339412841d7 (RxRPC: Allow key payloads to be passed in XDR form)
broke klog for me. I notice the v1 key struct had a kif_version field
added:

-struct rxkad_key {
-       u16     security_index;         /* RxRPC header security index */
-       u16     ticket_len;             /* length of ticket[] */
-       u32     expiry;                 /* time at which expires */
-       u32     kvno;                   /* key version number */
-       u8      session_key[8];         /* DES session key */
-       u8      ticket[0];              /* the encrypted ticket */
-};

+struct rxrpc_key_data_v1 {
+       u32             kif_version;            /* 1 */
+       u16             security_index;
+       u16             ticket_length;
+       u32             expiry;                 /* time_t */
+       u32             kvno;
+       u8              session_key[8];
+       u8              ticket[0];
+};

However the code in rxrpc_instantiate strips it away:

	data += sizeof(kver);
	datalen -= sizeof(kver);

Removing kif_version fixes my problem.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/keys/rxrpc-type.h |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/include/keys/rxrpc-type.h b/include/keys/rxrpc-type.h
index 5cb86c3..fc48754 100644
--- a/include/keys/rxrpc-type.h
+++ b/include/keys/rxrpc-type.h
@@ -99,7 +99,6 @@ struct rxrpc_key_token {
  * structure of raw payloads passed to add_key() or instantiate key
  */
 struct rxrpc_key_data_v1 {
-	u32		kif_version;		/* 1 */
 	u16		security_index;
 	u16		ticket_length;
 	u32		expiry;			/* time_t */

^ permalink raw reply related

* [PATCH] fcoe: correct checking for bonding
From: Jiri Pirko @ 2011-02-28 13:32 UTC (permalink / raw)
  To: linux-scsi; +Cc: devel, robert.w.love, James.Bottomley, netdev

Check for IFF_BONDING as this flag is set-up for all bonding devices.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/scsi/fcoe/fcoe.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 9f9600b..67714a4 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -285,9 +285,7 @@ static int fcoe_interface_setup(struct fcoe_interface *fcoe,
 	}
 
 	/* Do not support for bonding device */
-	if ((netdev->priv_flags & IFF_MASTER_ALB) ||
-	    (netdev->priv_flags & IFF_SLAVE_INACTIVE) ||
-	    (netdev->priv_flags & IFF_MASTER_8023AD)) {
+	if (netdev->priv_flags & IFF_BONDING) {
 		FCOE_NETDEV_DBG(netdev, "Bonded interfaces not supported\n");
 		return -EOPNOTSUPP;
 	}
-- 
1.7.3.4


^ permalink raw reply related

* Re: SO_REUSEPORT - can it be done in kernel?
From: Eric Dumazet @ 2011-02-28 13:32 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev
In-Reply-To: <20110228113659.GA20726@gondor.apana.org.au>

Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit :
> On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote:
> > I'm working on this right now.
> 
> OK I think I was definitely on the right track.  With the send
> patch made lockless I now get numbers which are even better than
> those obtained with running named with multiple sockets.  That's
> right, a single socket is now faster than what multiple sockets
> were without the patch (of course, multiple sockets may still
> faster with the patch vs. a single socket for obvious reasons,
> but I couldn't measure any significant difference).
> 
> Also worthy of note is that prior to the patch all CPUs showed
> idleness (lazy bastards!), with the patch they're all maxed out.
> 
> In retrospect, the idleness was simply the result of the socket
> lock scheduling away and was an indication of lock contention.
> 

Now, input path can run without finding socket locked by xmit path, so
skb are queued into receive queue, not backlog one.

> Here are the patches I used.  Please don't them yet as I intend
> to clean them up quite a bit.
> 
> But please do test them heavily, especially if you have an AMD
> NUMA machine as that's where scalability problems really show
> up.  Intel tends to be a lot more forgiving.  My last AMD machine
> blew up years ago :)

I am going to test them, thanks !



^ permalink raw reply

* [PATCH net-2.6 0/7] bnx2x fixes
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem; +Cc: netdev, Eilon Greenstein, Vladislav Zolotarov

Hi Dave,

Please consider applying the series with bnx2x fixes to net-2.6.

Thanks
Dmitry


 



^ permalink raw reply

* [PATCH net-2.6 7/7] bnx2x: update driver version to 1.62.00-6
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: Eilon Greenstein


Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x.h b/drivers/net/bnx2x/bnx2x.h
index 368cfcd..7897d11 100644
--- a/drivers/net/bnx2x/bnx2x.h
+++ b/drivers/net/bnx2x/bnx2x.h
@@ -22,7 +22,7 @@
  * (you will need to reboot afterwards) */
 /* #define BNX2X_STOP_ON_ERROR */
 
-#define DRV_MODULE_VERSION      "1.62.00-5"
+#define DRV_MODULE_VERSION      "1.62.00-6"
 #define DRV_MODULE_RELDATE      "2011/01/30"
 #define BNX2X_BC_VER            0x040200
 
-- 
1.7.2.2





^ permalink raw reply related

* [PATCH net-2.6 6/7] bnx2x: properly calculate lro_mss
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: Eilon Greenstein, Vladislav Zolotarov


From: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_cmn.c |   48 +++++++++++++++++++++++++++++++++++-----
 1 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_cmn.c b/drivers/net/bnx2x/bnx2x_cmn.c
index a58baf3..73a1f8e 100644
--- a/drivers/net/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/bnx2x/bnx2x_cmn.c
@@ -259,10 +259,44 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
 #endif
 }
 
+/* Timestamp option length allowed for TPA aggregation:
+ *
+ *		nop nop kind length echo val
+ */
+#define TPA_TSTAMP_OPT_LEN	12
+/**
+ * Calculate the approximate value of the MSS for this
+ * aggregation using the first packet of it.
+ *
+ * @param bp
+ * @param parsing_flags Parsing flags from the START CQE
+ * @param len_on_bd Total length of the first packet for the
+ *		     aggregation.
+ */
+static inline u16 bnx2x_set_lro_mss(struct bnx2x *bp, u16 parsing_flags,
+				    u16 len_on_bd)
+{
+	/* TPA arrgregation won't have an IP options and TCP options
+	 * other than timestamp.
+	 */
+	u16 hdrs_len = ETH_HLEN + sizeof(struct iphdr) + sizeof(struct tcphdr);
+
+
+	/* Check if there was a TCP timestamp, if there is it's will
+	 * always be 12 bytes length: nop nop kind length echo val.
+	 *
+	 * Otherwise FW would close the aggregation.
+	 */
+	if (parsing_flags & PARSING_FLAGS_TIME_STAMP_EXIST_FLAG)
+		hdrs_len += TPA_TSTAMP_OPT_LEN;
+
+	return len_on_bd - hdrs_len;
+}
+
 static int bnx2x_fill_frag_skb(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 			       struct sk_buff *skb,
 			       struct eth_fast_path_rx_cqe *fp_cqe,
-			       u16 cqe_idx)
+			       u16 cqe_idx, u16 parsing_flags)
 {
 	struct sw_rx_page *rx_pg, old_rx_pg;
 	u16 len_on_bd = le16_to_cpu(fp_cqe->len_on_bd);
@@ -275,8 +309,8 @@ static int bnx2x_fill_frag_skb(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 
 	/* This is needed in order to enable forwarding support */
 	if (frag_size)
-		skb_shinfo(skb)->gso_size = min((u32)SGE_PAGE_SIZE,
-					       max(frag_size, (u32)len_on_bd));
+		skb_shinfo(skb)->gso_size = bnx2x_set_lro_mss(bp, parsing_flags,
+							      len_on_bd);
 
 #ifdef BNX2X_STOP_ON_ERROR
 	if (pages > min_t(u32, 8, MAX_SKB_FRAGS)*SGE_PAGE_SIZE*PAGES_PER_SGE) {
@@ -344,6 +378,8 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 	if (likely(new_skb)) {
 		/* fix ip xsum and give it to the stack */
 		/* (no need to map the new skb) */
+		u16 parsing_flags =
+			le16_to_cpu(cqe->fast_path_cqe.pars_flags.flags);
 
 		prefetch(skb);
 		prefetch(((char *)(skb)) + L1_CACHE_BYTES);
@@ -373,9 +409,9 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 		}
 
 		if (!bnx2x_fill_frag_skb(bp, fp, skb,
-					 &cqe->fast_path_cqe, cqe_idx)) {
-			if ((le16_to_cpu(cqe->fast_path_cqe.
-			    pars_flags.flags) & PARSING_FLAGS_VLAN))
+					 &cqe->fast_path_cqe, cqe_idx,
+					 parsing_flags)) {
+			if (parsing_flags & PARSING_FLAGS_VLAN)
 				__vlan_hwaccel_put_tag(skb,
 						 le16_to_cpu(cqe->fast_path_cqe.
 							     vlan_tag));
-- 
1.7.2.2





^ permalink raw reply related

* [PATCH net-2.6 5/7] bnx2x: perform statistics "action" before state transition.
From: Dmitry Kravkov @ 2011-02-28 13:37 UTC (permalink / raw)
  To: davem, netdev; +Cc: Eilon Greenstein, Vladislav Zolotarov


From: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_stats.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_stats.c b/drivers/net/bnx2x/bnx2x_stats.c
index bda60d5..3445ded 100644
--- a/drivers/net/bnx2x/bnx2x_stats.c
+++ b/drivers/net/bnx2x/bnx2x_stats.c
@@ -1239,14 +1239,14 @@ void bnx2x_stats_handle(struct bnx2x *bp, enum bnx2x_stats_event event)
 	if (unlikely(bp->panic))
 		return;
 
+	bnx2x_stats_stm[bp->stats_state][event].action(bp);
+
 	/* Protect a state change flow */
 	spin_lock_bh(&bp->stats_lock);
 	state = bp->stats_state;
 	bp->stats_state = bnx2x_stats_stm[state][event].next_state;
 	spin_unlock_bh(&bp->stats_lock);
 
-	bnx2x_stats_stm[state][event].action(bp);
-
 	if ((event != STATS_EVENT_UPDATE) || netif_msg_timer(bp))
 		DP(BNX2X_MSG_STATS, "state %d -> event %d -> state %d\n",
 		   state, event, bp->stats_state);
-- 
1.7.2.2





^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox