* Re: net-next: warnings from sysctl_net_exit
From: Stephen Hemminger @ 2011-02-27 18:45 UTC (permalink / raw)
To: David Miller; +Cc: adobriyan, netdev
In-Reply-To: <20110226.222333.59680338.davem@davemloft.net>
On Sat, 26 Feb 2011 22:23:33 -0800 (PST)
David Miller <davem@davemloft.net> wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Sat, 26 Feb 2011 16:56:01 -0800
>
> > Seeing lots of these messages in dmesg. Something is broken
> > recently in net-next.
>
> Did you by change pull plain net-2.6 into that tree? Because one
> commit which is in net-2.6 but not in net-next-2.6 catches my eye:
>
> commit c486da34390846b430896a407b47f0cea3a4189c
> Author: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
> Date: Thu Feb 24 19:48:03 2011 +0000
>
> sysctl: ipv6: use correct net in ipv6_sysctl_rtcache_flush
>
> Before this patch issuing these commands:
>
> fd = open("/proc/sys/net/ipv6/route/flush")
> unshare(CLONE_NEWNET)
> write(fd, "stuff")
>
> would flush the newly created net, not the original one.
>
> The equivalent ipv4 code is correct (stores the net inside ->extra1).
> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
I am building against pure net-next tree. Just checked by recloning.
--
^ permalink raw reply
* Re: Bug inkvm_set_irq
From: Michael S. Tsirkin @ 2011-02-27 17:00 UTC (permalink / raw)
To: Jean-Philippe Menil; +Cc: kvm, netdev, virtualization
In-Reply-To: <4D67714A.2050100@univ-nantes.fr>
On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
> Hi,
>
> Each time i try tou use vhost_net, i'm facing a kernel bug.
> I do a "modprobe vhost_net", and start guest whith vhost=on.
>
> Following is a trace with a kernel 2.6.37, but i had the same
> problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
2.6.36 had a theorectical race that could explain this,
but it should be ok in 2.6.37.
>
> The bug only occurs whith vhost_net charged, so i don't know if this
> is a bug in kvm module code or in the vhost_net code.
It could be a bug in eventfd which is the interface
used by both kvm and vhost_net.
Just for fun, you can try 3.6.38 - eventfd code has been changed
a lot in 2.6.38 and if it does not trigger there
it's a hint that irqfd is the reason.
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243100] BUG: unable to handle kernel paging request at
> 0000000000002458
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243250] IP: [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
Could you run markup_oops/ ksymoops on this please?
As far as I can see kvm_set_irq can only get a wrong
kvm pointer. Unless there's some general memory corruption,
I'd guess
You can also try comparing the irqfd->kvm pointer in
kvm_irqfd_assign irqfd_wakeup and kvm_set_irq in
virt/kvm/eventfd.c.
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243556] Oops: 0000 [#1] SMP
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243692] last sysfs file:
> /sys/devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:00.0/irq
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.243777] CPU 0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243820] Modules linked in: vhost_net macvtap macvlan tun
> powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
> cpufreq_ondemand fre
> q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
> ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
> xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
> nf_conntrack_ftp nf_connt
> rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
> dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
> nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
> snd_page_alloc tpm_tis tpm ps
> mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
> serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
> xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
> sd_mod enclosu
> re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
> scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] Pid: 10, comm: kworker/0:1 Not tainted
> 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RIP: 0010:[<ffffffffa041aa8a>] [<ffffffffa041aa8a>]
> kvm_set_irq+0x2a/0x130 [kvm]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RSP: 0018:ffff88045fc89d30 EFLAGS: 00010246
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RAX: 0000000000000000 RBX: 000000000000001a RCX:
> 0000000000000001
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000000
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RBP: 0000000000000000 R08: 0000000000000001 R09:
> ffff880856a91e48
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] R10: 0000000000000000 R11: 00000000ffffffff R12:
> 0000000000000000
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] R13: 0000000000000001 R14: 0000000000000000 R15:
> 0000000000000000
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] FS: 00007f617986c710(0000) GS:ffff88007f800000(0000)
> knlGS:0000000000000000
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] CR2: 0000000000002458 CR3: 000000045d197000 CR4:
> 00000000000006f0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] Process kworker/0:1 (pid: 10, threadinfo
> ffff88045fc88000, task ffff88085fc53c30)
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [ 685.246123] Stack:
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] ffff88045fc89fd8 00000000000119c0 ffff88045fc88010
> ffff88085fc53ee8
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] ffff88045fc89fd8 ffff88085fc53ee0 ffff88085fc53c30
> 00000000000119c0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] 00000000000119c0 ffffffff8137f7ce ffff88007f80df40
> 00000000ffffffff
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] Call Trace:
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff8137f7ce>] ? common_interrupt+0xe/0x13
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffffa041bc57>] ? irqfd_inject+0x27/0x50 [kvm]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffffa041bc30>] ? irqfd_inject+0x0/0x50 [kvm]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff8106b6f2>] ? process_one_work+0x112/0x460
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff8106be25>] ? worker_thread+0x145/0x410
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff8103a3d0>] ? __wake_up_common+0x50/0x80
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff8106bce0>] ? worker_thread+0x0/0x410
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff8106f786>] ? kthread+0x96/0xa0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff81003ce4>] ? kernel_thread_helper+0x4/0x10
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff8106f6f0>] ? kthread+0x0/0xa0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] [<ffffffff81003ce0>] ? kernel_thread_helper+0x0/0x10
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] Code: ff 41 57 41 89 f7 41 56 41 55 41 89 cd 41 54 49 89
> fc 55 53 89 d3 48 81 ec 98 00 00 00 8b 15 c6 79 03 00 85 d2 0f 85 c4
> 00 00 00 <4
> 9> 8b 84 24 58 24 00 00 3b 98 28 01 00 00 73 5e 89 db 48 8b 84
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RIP [<ffffffffa041aa8a>] kvm_set_irq+0x2a/0x130 [kvm]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RSP <ffff88045fc89d30>
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] CR2: 0000000000002458
>
>
> If someone can help me, on how to solve this.
>
> Regards.
> begin:vcard
> fn:Jean-Philippe Menil
> n:Menil;Jean-Philippe
> org;quoted-printable:Universit=C3=A9 de Nantes;IRTS
> adr;dom:;;;Nantes
> email;internet:jean-philippe.menil@univ-nantes.fr
> title:Reseau
> tel;work:02.53.48.49.27
> tel;fax:02.53.48.49.97
> x-mozilla-html:FALSE
> version:2.1
> end:vcard
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: ANNOUNCE: debloat-testing kernel git tree
From: Dave Täht @ 2011-02-27 16:25 UTC (permalink / raw)
To: sedat.dilek-Re5JQEeQqe8AvxtiuMwx3w
Cc: John W. Linville, bloat-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <AANLkTi=Cp7N4xqPyU6KWF6DOzFytxaA2BeoXhnhsZ6dp-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
Sedat Dilek <sedat.dilek-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> writes:
> On Sun, Feb 27, 2011 at 4:56 PM, Dave Täht <d@taht.net> wrote:
>> Sedat Dilek <sedat.dilek-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> writes:
>>
>>> On Sun, Feb 27, 2011 at 4:31 PM, Dave Täht <d@taht.net> wrote:
>>>>
>>>> Sedat Dilek <sedat.dilek-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> writes:
>>
>>> Are you planning debloat feature for 2.6.39?
>>
>> Depends on how many testers we get and what the results are.
>>
>> I feel the eBDP stuff will not be ready during this release cycle. SFB
>> and CHOKe are in net-next, so, probably. Various driver patches -
>> particularly those that increase the available dynamic range via
>> ethtool, (e.g lowering the bottommost TX queue limit to, like, 4,
>> especially for home gateways) may make it out if people look harder into
>> the issue.
>
> OK, thanks for the explanations.
>
> Concerning "more drivers":
> What would I have to do to modify ath5k?
> I looked into the ath9k patch in debloat-testing GIT and it was to mod
> some (TX/BUF) values only.
Yes, reducing your TX buffer size greatly is the first, best, and
easiest patch.
For wireless routers and cable home gateways especially, this research
shows that the total un-managed buffers in your system should be less
than 32.
http://www.cs.clemson.edu/~jmarty/papers/PID1154937.pdf
I found their data convincing, and there are dozens of other papers that
we are sorting out on the bufferbloat.net web site.
(PLEASE Note the key word there is un-managed)
0 would be the best value. :/
In the case of wireless, you also have retries to take into account.
I'd argue in those cases, that what I say above is that the number
should be FAR less than 32.
Now, whether there is a good compromise between throughput and latency
in that range in a DMA TX queue + TXQUEUE, remains to be seen.
> Not sure if ath9k is/was "well" prepared or only a good choice by the
> testers/committers as they own such a device.
My test network is mostly ath9k - the nanostation M5s and the WNDR5700
router described here:
http://www.bufferbloat.net/projects/bloat/wiki/Experiment_-_Bloated_LAGN_vs_debloated_WNDR5700
There are people looking into the ath6kl, but you're the first to step
up with the ath5k. :) Maybe the folk over at #ath6kl on irc can help.
The ath9k patch improves latency under load enormously - I can run voip
over it AND do big transfers and stream audio via samba... Which I
couldn't before - and DNS, ND, NTP, babel, etc behave much better, but
the currently hard coded nature of the TX queue limit does put an upper
limit on packet aggregation that the eBDP folk are trying to resolve
more generically.
In practice, at "normal" 180Mbit rates, with the new queue depth of 3, I
get most of the benefits of packet aggregation without the lag.
I do see higher packet loss than I would like, at present.
>
> - Sedat -
--
Dave Taht
http://nex-6.taht.net
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: ANNOUNCE: debloat-testing kernel git tree
From: Sedat Dilek @ 2011-02-27 16:01 UTC (permalink / raw)
To: Dave Täht
Cc: John W. Linville, bloat-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <8739n9ii7z.fsf-dm88P3lIUJl5IOjekqviT21/HaPePypd@public.gmane.org>
On Sun, Feb 27, 2011 at 4:56 PM, Dave Täht <d@taht.net> wrote:
> Sedat Dilek <sedat.dilek-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> writes:
>
>> On Sun, Feb 27, 2011 at 4:31 PM, Dave Täht <d@taht.net> wrote:
>>>
>>> Sedat Dilek <sedat.dilek-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> writes:
>>>
>>>> On Fri, Feb 25, 2011 at 11:22 PM, John W. Linville
>>>> <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org> wrote:
>>>>> Announcement
>>>>>
>>>>> The bufferbloat project [1] is pleased to announce the availability
>>>>> of the debloat-testing Linux kernel git tree:
>>>>>
>>>>> git://git.infradead.org/debloat-testing.git
>>>
>>> ----snip----
>
>>> Excellent. At moment I would recommend building "low latency preempt
>>> desktop" kernels with a high HZ value (400 or 1000), enabling highres
>>> timers, and compiling in SFB as a module. (I'd like the default for SFB
>>> to be "m" rather than "n", too)
>>>
>
>> These "debloat guys" are fast :-). I was just preparing my
>> build-system (which I normally use to debianize linux-next kernels).
>> Any other recommendation for kernel-config options? For example:
>> linux-next has already CONFIG_NET_SCH_CHOKE (but I have unset it).
>
> Enable CHOKe.
>
> The HZ value change is due to my worry that we've smashed latency so
> much in the driver/mac layer that it's interacting with the higher
> layers somewhat badly... So we need to add more hooks to the servo loops
> involved in order to have a normal HZ.
>
>> Which commits are in debloat-testing GIT but not in linux-next tree?
>
> The current list was in the release announcement. More on the way
> (mostly embedded drivers at this point) git pull early and often!
>
>> Are you planning debloat feature for 2.6.39?
>
> Depends on how many testers we get and what the results are.
>
> I feel the eBDP stuff will not be ready during this release cycle. SFB
> and CHOKe are in net-next, so, probably. Various driver patches -
> particularly those that increase the available dynamic range via
> ethtool, (e.g lowering the bottommost TX queue limit to, like, 4,
> especially for home gateways) may make it out if people look harder into
> the issue.
>
>>
>> - Sedat -
>
> --
> Dave Taht
> http://nex-6.taht.net
>
OK, thanks for the explanations.
Concerning "more drivers":
What would I have to do to modify ath5k?
I looked into the ath9k patch in debloat-testing GIT and it was to mod
some (TX/BUF) values only.
Not sure if ath9k is/was "well" prepared or only a good choice by the
testers/committers as they own such a device.
- Sedat -
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: ANNOUNCE: debloat-testing kernel git tree
From: Dave Täht @ 2011-02-27 15:56 UTC (permalink / raw)
To: sedat.dilek
Cc: John W. Linville, bloat-devel, netdev, linux-wireless,
linux-kernel
In-Reply-To: <AANLkTinpG-xAC-VV-j7EBJB522+BqCyyFd_6N3KEr2Cf@mail.gmail.com>
Sedat Dilek <sedat.dilek@googlemail.com> writes:
> On Sun, Feb 27, 2011 at 4:31 PM, Dave Täht <d@taht.net> wrote:
>>
>> Sedat Dilek <sedat.dilek@googlemail.com> writes:
>>
>>> On Fri, Feb 25, 2011 at 11:22 PM, John W. Linville
>>> <linville@tuxdriver.com> wrote:
>>>> Announcement
>>>>
>>>> The bufferbloat project [1] is pleased to announce the availability
>>>> of the debloat-testing Linux kernel git tree:
>>>>
>>>> git://git.infradead.org/debloat-testing.git
>>
>> ----snip----
>> Excellent. At moment I would recommend building "low latency preempt
>> desktop" kernels with a high HZ value (400 or 1000), enabling highres
>> timers, and compiling in SFB as a module. (I'd like the default for SFB
>> to be "m" rather than "n", too)
>>
> These "debloat guys" are fast :-). I was just preparing my
> build-system (which I normally use to debianize linux-next kernels).
> Any other recommendation for kernel-config options? For example:
> linux-next has already CONFIG_NET_SCH_CHOKE (but I have unset it).
Enable CHOKe.
The HZ value change is due to my worry that we've smashed latency so
much in the driver/mac layer that it's interacting with the higher
layers somewhat badly... So we need to add more hooks to the servo loops
involved in order to have a normal HZ.
> Which commits are in debloat-testing GIT but not in linux-next tree?
The current list was in the release announcement. More on the way
(mostly embedded drivers at this point) git pull early and often!
> Are you planning debloat feature for 2.6.39?
Depends on how many testers we get and what the results are.
I feel the eBDP stuff will not be ready during this release cycle. SFB
and CHOKe are in net-next, so, probably. Various driver patches -
particularly those that increase the available dynamic range via
ethtool, (e.g lowering the bottommost TX queue limit to, like, 4,
especially for home gateways) may make it out if people look harder into
the issue.
>
> - Sedat -
--
Dave Taht
http://nex-6.taht.net
^ permalink raw reply
* Re: ANNOUNCE: debloat-testing kernel git tree
From: Sedat Dilek @ 2011-02-27 15:38 UTC (permalink / raw)
To: Dave Täht
Cc: John W. Linville, bloat-devel, netdev, linux-wireless,
linux-kernel
In-Reply-To: <87fwr9jxya.fsf@cruithne.co.teklibre.org>
On Sun, Feb 27, 2011 at 4:31 PM, Dave Täht <d@taht.net> wrote:
>
> Sedat Dilek <sedat.dilek@googlemail.com> writes:
>
>> On Fri, Feb 25, 2011 at 11:22 PM, John W. Linville
>> <linville@tuxdriver.com> wrote:
>>> Announcement
>>>
>>> The bufferbloat project [1] is pleased to announce the availability
>>> of the debloat-testing Linux kernel git tree:
>>>
>>> git://git.infradead.org/debloat-testing.git
>
> ----snip----
>
>> Hi,
>>
>> it should be "localversion-debloat" in the commit-subject in [1] (not
>> "localversion-wireless") :-). "-db" as suffix is IMHO not very
>> meaningful... Why not add simply a suffix called "-debloat"? (Anyway,
>> I will revert this patch because I don't want to have any suffix added
>> automatically.)
>>
>> I have several other questions, but I start compiling first and test
>> this debloat kernel.
>
> Excellent. At moment I would recommend building "low latency preempt
> desktop" kernels with a high HZ value (400 or 1000), enabling highres
> timers, and compiling in SFB as a module. (I'd like the default for SFB
> to be "m" rather than "n", too)
>
>>
>> Regards,
>> - Sedat -
>>
>> [1] "Add localversion-wireless to identify builds from this tree."
>> http://git.infradead.org/debloat-testing.git/commit/3f9bdb4f44b076feda72d353d8ad717831416f36
>> _______________________________________________
>> Bloat-devel mailing list
>> Bloat-devel@lists.bufferbloat.net
>> https://lists.bufferbloat.net/listinfo/bloat-devel
>
> --
> Dave Taht
> http://nex-6.taht.net
>
These "debloat guys" are fast :-).
I was just preparing my build-system (which I normally use to
debianize linux-next kernels).
Any other recommendation for kernel-config options?
For example:
linux-next has already CONFIG_NET_SCH_CHOKE (but I have unset it).
Which commits are in debloat-testing GIT but not in linux-next tree?
Are you planning debloat feature for 2.6.39?
- Sedat -
^ permalink raw reply
* Re: ANNOUNCE: debloat-testing kernel git tree
From: Dave Täht @ 2011-02-27 15:31 UTC (permalink / raw)
To: sedat.dilek-Re5JQEeQqe8AvxtiuMwx3w
Cc: John W. Linville, bloat-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <AANLkTingkcc-dvs_8Nr0vYXdXvuDHEn6sz14tnHzLp8W-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
Sedat Dilek <sedat.dilek-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> writes:
> On Fri, Feb 25, 2011 at 11:22 PM, John W. Linville
> <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org> wrote:
>> Announcement
>>
>> The bufferbloat project [1] is pleased to announce the availability
>> of the debloat-testing Linux kernel git tree:
>>
>> git://git.infradead.org/debloat-testing.git
----snip----
> Hi,
>
> it should be "localversion-debloat" in the commit-subject in [1] (not
> "localversion-wireless") :-). "-db" as suffix is IMHO not very
> meaningful... Why not add simply a suffix called "-debloat"? (Anyway,
> I will revert this patch because I don't want to have any suffix added
> automatically.)
>
> I have several other questions, but I start compiling first and test
> this debloat kernel.
Excellent. At moment I would recommend building "low latency preempt
desktop" kernels with a high HZ value (400 or 1000), enabling highres
timers, and compiling in SFB as a module. (I'd like the default for SFB
to be "m" rather than "n", too)
>
> Regards,
> - Sedat -
>
> [1] "Add localversion-wireless to identify builds from this tree."
> http://git.infradead.org/debloat-testing.git/commit/3f9bdb4f44b076feda72d353d8ad717831416f36
> _______________________________________________
> Bloat-devel mailing list
> Bloat-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org
> https://lists.bufferbloat.net/listinfo/bloat-devel
--
Dave Taht
http://nex-6.taht.net
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [GIT PULL nf-next-2.6] IPVS
From: Patrick McHardy @ 2011-02-27 15:25 UTC (permalink / raw)
To: Simon Horman
Cc: lvs-devel, netdev, netfilter-devel, netfilter, Changli Gao,
Wensong Zhang, Julian Anastasov
In-Reply-To: <1298601812-8168-1-git-send-email-horms@verge.net.au>
On 25.02.2011 03:43, Simon Horman wrote:
> Hi Patrick,
>
> please consider pulling
> git://git.kernel.org/pub/scm/linux/kernel/git/horms/lvs-test-2.6.git master
> go get the following changes by Changli.
>
> ipvs: use hlist instead of list
> ipvs: use enum to instead of magic numbers
> ipvs: unify the formula to estimate the overhead of processing connections
Pulled, thanks Simon.
^ permalink raw reply
* Re: ANNOUNCE: debloat-testing kernel git tree
From: Sedat Dilek @ 2011-02-27 15:23 UTC (permalink / raw)
To: John W. Linville
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
bloat-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1
In-Reply-To: <20110225222210.GA3618-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
On Fri, Feb 25, 2011 at 11:22 PM, John W. Linville
<linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org> wrote:
> Announcement
>
> The bufferbloat project [1] is pleased to announce the availability
> of the debloat-testing Linux kernel git tree:
>
> git://git.infradead.org/debloat-testing.git
>
> The purpose of this tree is to provide a reasonably stable base for
> the development and testing of new algorithms, miscellaneous fixes,
> and maybe a few hacks intended to advance the cause of eliminating
> or at least mitigating bufferbloat in the Linux world.
>
> Introduction
>
> Bufferbloat is a term coined by Jim Gettys to describe the increasing
> prevalence of large and (particularly) unmanaged network buffers along
> the network links that comprise the Internet [2]. If you are not aware
> of the problems with network latency under load that the Internet is
> already encountering, we encourage you to visit Jim Gettys' blog [3].
> There Jim has begun to fit together enough puzzle pieces to at least
> frame the issue.
>
> Jim has also made available slides and an audio recording (edited
> for time) from a presentation on this topic:
>
> http://mirrors.bufferbloat.net/Talks/BellLabs01192011/
>
> Kernel Bits
>
> The debloat-testing tree is intended to track full and -rc releases
> from linux-2.6, with interesting patches cherry-picked from net-next
> and various experimental bits added on top. The current stable of
> such patches includes the following:
>
> Eric Dumazet (based on original work by Juliusz Chroboczek):
> net_sched: SFB flow scheduler
>
> stephen hemminger:
> sched: CHOKe flow scheduler
>
> John Fastabend:
> net: implement mechanism for HW based QOS
> net_sched: implement a root container qdisc sch_mqprio
>
> John W. Linville:
> mac80211: implement eBDP algorithm to fight bufferbloat
>
> Nathaniel J. Smith:
> iwlwifi: Simplify tx queue management
> iwlwifi: Convert the tx queue high_mark to an atomic_t
> iwlwifi: Invert the sense of the queue high_mark
> iwlwifi: auto-tune tx queue size to minimize latency
> iwlwifi: make current tx queue sizes visible in debugfs
>
> Dave Taht:
> Bufferbloat reduction for the e1000 driver that started it all
> Reduce bufferbloated default for e1000e, increase dynamic range
> Smash bufferbloat in the ath9k driver
>
> Userland Bits
>
> Patches for the userspace tc utility incorporating support for both the
> CHOKe AQM and the Stochastic Fair Blue scheduler (SFB) are available:
>
> https://github.com/dtaht/iproute2bufferbloat
>
> Contributions
>
> Please send any experimental or research-oriented patches related to
> bufferbloat to the bloat-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org list. Reminders
> of more mainstream patches that may be relevant and/or interesting
> for cherry-picking into debloat-testing are welcome there as well.
>
> Obviously, patches that are ready for normal merge consideration
> should continue to be sent to netdev, linux-wireless, linux-kernel,
> or whatever other existing list is appropriate for them.
>
> Thanks
>
> Finally, we want to offer a huge thanks to the 130+ new members of
> the bloat mailing list [4] for leaping into the fray, and to David
> Woodhouse for hosting the debloat-testing tree at infradead.
>
> Please help us beat the bloat. Good luck, and happy debloating!
>
> Notes
>
> [1] http://bufferbloat.net
> [2] http://gettys.wordpress.com/what-is-bufferbloat-anyway/
> [3] http://en.wordpress.com/tag/bufferbloat/
> [4] https://lists.bufferbloat.net
> --
> John W. Linville Someday the world will need a hero, and you
> linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org might be all we have. Be ready.
>
Hi,
it should be "localversion-debloat" in the commit-subject in [1] (not
"localversion-wireless") :-).
"-db" as suffix is IMHO not very meaningful... Why not add simply a
suffix called "-debloat"?
(Anyway, I will revert this patch because I don't want to have any
suffix added automatically.)
I have several other questions, but I start compiling first and test
this debloat kernel.
Regards,
- Sedat -
[1] "Add localversion-wireless to identify builds from this tree."
http://git.infradead.org/debloat-testing.git/commit/3f9bdb4f44b076feda72d353d8ad717831416f36
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [Lxc-users] Bad checksums and lost packets with macvlan on dummy
From: Daniel Lezcano @ 2011-02-27 15:14 UTC (permalink / raw)
To: Andrian Nord; +Cc: lxc-users, Patrick McHardy, Linux Netdev List, Eric Dumazet
In-Reply-To: <20110223170512.GA10277@nord.niifaq.ru>
On 02/23/2011 06:13 PM, Andrian Nord wrote:
> On Mon, Feb 21, 2011 at 05:07:31PM +0100, Daniel Lezcano wrote:
>> I Cc'ed the netdev mailing list and Patrick in case my analysis is wrong
>> or incomplete.
> I'm confirming, that this happens only when macvlan's are onto dummy net
> device. In case of some physical interface under macvlan there is no lost
> packages and no broken checksums.
I did some tests with a 2.6.35 kernel version and it seems the checksum
errors do not appear.
I noticed there are some changes in the dummy setup function:
dev->features |= NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO;
dev->features |= NETIF_F_NO_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
May be that was introduced by commit:
commit 6d81f41c58c69ddde497e9e640ba5805aa26e78c
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon Sep 27 20:50:33 2010 +0000
dummy: percpu stats and lockless xmit
Converts dummy network device driver to :
- percpu stats
- 64bit stats
- lockless xmit (NETIF_F_LLTX)
- performance features added (NETIF_F_SG | NETIF_F_FRAGLIST |
NETIF_F_TSO | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric,
Andrian is observing, with a couple of macvlan (in bridge mode) on top
of a dummy interface, a lot of checksums error and packets drop.
Each macvlan is in a different network namespace and the dummy interface
is in the init_net.
Any ideas ?
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-27 14:17 UTC (permalink / raw)
To: Jiri Pirko, David Miller
Cc: kaber, eric.dumazet, netdev, shemminger, fubar, andy
In-Reply-To: <20110223190541.GB2783@psychotron.redhat.com>
Le 23/02/2011 20:05, Jiri Pirko a écrit :
> This patch converts bonding to use rx_handler. Results in cleaner
> __netif_receive_skb() with much less exceptions needed. Also
> bond-specific work is moved into bond code.
>
> Did performance test using pktgen and counting incoming packets by
> iptables. No regression noted.
>
> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>
> v1->v2:
> using skb_iif instead of new input_dev to remember original
> device
>
> v2->v3:
> do another loop in case skb->dev is changed. That way orig_dev
> core can be left untouched.
>
> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
> ---
[snip]
> +static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
> +{
> + struct net_device *slave_dev;
> + struct net_device *bond_dev;
> +
> + skb = skb_share_check(skb, GFP_ATOMIC);
> + if (unlikely(!skb))
> + return NULL;
> + slave_dev = skb->dev;
> + bond_dev = ACCESS_ONCE(slave_dev->master);
> + if (unlikely(!bond_dev))
> + return skb;
> +
> + if (bond_dev->priv_flags& IFF_MASTER_ARPMON)
> + slave_dev->last_rx = jiffies;
> +
> + if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
> + skb->deliver_no_wcard = 1;
> + return skb;
Shouldn't we return NULL here ?
> + }
> +
> + skb->dev = bond_dev;
> +
> + if (bond_dev->priv_flags& IFF_MASTER_ALB&&
> + bond_dev->priv_flags& IFF_BRIDGE_PORT&&
> + skb->pkt_type == PACKET_HOST) {
> + u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
> +
> + memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
> + }
> +
> + return skb;
> +}
> +
[snip]
> +static void vlan_on_bond_hook(struct sk_buff *skb)
> {
> - if (skb->pkt_type == PACKET_HOST) {
> - u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
> + /*
> + * Make sure ARP frames received on VLAN interfaces stacked on
> + * bonding interfaces still make their way to any base bonding
> + * device that may have registered for a specific ptype.
> + */
> + if (skb->dev->priv_flags& IFF_802_1Q_VLAN&&
> + vlan_dev_real_dev(skb->dev)->priv_flags& IFF_BONDING&&
> + skb->protocol == htons(ETH_P_ARP)) {
The vlan_on_bond case used to be cost effective. Now, we clone the skb and call netif_rx...
> + struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
>
> - memcpy(dest, master->dev_addr, ETH_ALEN);
> + if (!skb2)
> + return;
> + skb2->dev = vlan_dev_real_dev(skb->dev);
> + netif_rx(skb2);
> }
> }
[snip]
> if (rx_handler) {
> + struct net_device *prev_dev;
> +
> if (pt_prev) {
> ret = deliver_skb(skb, pt_prev, orig_dev);
> pt_prev = NULL;
> }
> + prev_dev = skb->dev;
> skb = rx_handler(skb);
> if (!skb)
> goto out;
I would instead consider NULL as meaning exact-match-delivery-only. (The same effect as
dev_bond_should_drop() returning true).
> + if (skb->dev != prev_dev)
> + goto another_round;
> }
Anyway, all my comments can't be postponed to follow-up patchs. Thanks Jiri.
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
^ permalink raw reply
* [PATCH net-next-2.6 v2 2/2] dcbnl: add support for retrieving peer configuration - cee
From: Shmulik Ravid @ 2011-02-27 15:04 UTC (permalink / raw)
To: davem; +Cc: John Fastabend, Eilon Greenstein, netdev
This patch adds the support for retrieving the remote or peer DCBX
configuration via dcbnl for embedded DCBX stacks supporting the CEE DCBX
standard.
Signed-off-by: Shmulik Ravid <shmulikr@broadcom.com>
---
include/linux/dcbnl.h | 71 +++++++++++++++++++++++++++++++++++++++++
include/net/dcbnl.h | 3 ++
net/dcb/dcbnl.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 155 insertions(+), 4 deletions(-)
diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
index 2542685..a3680a1 100644
--- a/include/linux/dcbnl.h
+++ b/include/linux/dcbnl.h
@@ -87,6 +87,45 @@ struct ieee_pfc {
__u64 indications[IEEE_8021QAZ_MAX_TCS];
};
+/* CEE DCBX std supported values */
+#define CEE_DCBX_MAX_PGS 8
+#define CEE_DCBX_MAX_PRIO 8
+
+/**
+ * struct cee_pg - CEE Prioity-Group managed object
+ *
+ * @willing: willing bit in the PG tlv
+ * @error: error bit in the PG tlv
+ * @pg_en: enable bit of the PG feature
+ * @tcs_supported: number of traffic classes supported
+ * @pg_bw: bandwidth percentage for each priority group
+ * @prio_pg: priority to PG mapping indexed by priority
+ */
+struct cee_pg {
+ __u8 willing;
+ __u8 error;
+ __u8 pg_en;
+ __u8 tcs_supported;
+ __u8 pg_bw[CEE_DCBX_MAX_PGS];
+ __u8 prio_pg[CEE_DCBX_MAX_PGS];
+};
+
+/**
+ * struct cee_pfc - CEE PFC managed object
+ *
+ * @willing: willing bit in the PFC tlv
+ * @error: error bit in the PFC tlv
+ * @pfc_en: bitmap indicating pfc enabled traffic classes
+ * @tcs_supported: number of traffic classes supported
+ */
+struct cee_pfc {
+ __u8 willing;
+ __u8 error;
+ __u8 pfc_en;
+ __u8 tcs_supported;
+};
+
+
/* This structure contains the IEEE 802.1Qaz APP managed object. This
* object is also used for the CEE std as well. There is no difference
* between the objects.
@@ -158,6 +197,7 @@ struct dcbmsg {
* @DCB_CMD_SDCBX: set DCBX engine configuration
* @DCB_CMD_GFEATCFG: get DCBX features flags
* @DCB_CMD_SFEATCFG: set DCBX features negotiation flags
+ * @DCB_CMD_CEE_GET: get CEE aggregated configuration
*/
enum dcbnl_commands {
DCB_CMD_UNDEFINED,
@@ -200,6 +240,8 @@ enum dcbnl_commands {
DCB_CMD_GFEATCFG,
DCB_CMD_SFEATCFG,
+ DCB_CMD_CEE_GET,
+
__DCB_CMD_ENUM_MAX,
DCB_CMD_MAX = __DCB_CMD_ENUM_MAX - 1,
};
@@ -222,6 +264,7 @@ enum dcbnl_commands {
* @DCB_ATTR_IEEE: IEEE 802.1Qaz supported attributes (NLA_NESTED)
* @DCB_ATTR_DCBX: DCBX engine configuration in the device (NLA_U8)
* @DCB_ATTR_FEATCFG: DCBX features flags (NLA_NESTED)
+ * @DCB_ATTR_CEE: CEE std supported attributes (NLA_NESTED)
*/
enum dcbnl_attrs {
DCB_ATTR_UNDEFINED,
@@ -245,6 +288,9 @@ enum dcbnl_attrs {
DCB_ATTR_DCBX,
DCB_ATTR_FEATCFG,
+ /* CEE nested attributes */
+ DCB_ATTR_CEE,
+
__DCB_ATTR_ENUM_MAX,
DCB_ATTR_MAX = __DCB_ATTR_ENUM_MAX - 1,
};
@@ -280,6 +326,31 @@ enum ieee_attrs_app {
#define DCB_ATTR_IEEE_APP_MAX (__DCB_ATTR_IEEE_APP_MAX - 1)
/**
+ * enum cee_attrs - CEE DCBX get attributes
+ *
+ * @DCB_ATTR_CEE_UNSPEC: unspecified
+ * @DCB_ATTR_CEE_PEER_PG: peer PG configuration - get only
+ * @DCB_ATTR_CEE_PEER_PFC: peer PFC configuration - get only
+ * @DCB_ATTR_CEE_PEER_APP: peer APP tlv - get only
+ */
+enum cee_attrs {
+ DCB_ATTR_CEE_UNSPEC,
+ DCB_ATTR_CEE_PEER_PG,
+ DCB_ATTR_CEE_PEER_PFC,
+ DCB_ATTR_CEE_PEER_APP_TABLE,
+ __DCB_ATTR_CEE_MAX
+};
+#define DCB_ATTR_CEE_MAX (__DCB_ATTR_CEE_MAX - 1)
+
+enum peer_app_attr {
+ DCB_ATTR_CEE_PEER_APP_UNSPEC,
+ DCB_ATTR_CEE_PEER_APP_INFO,
+ DCB_ATTR_CEE_PEER_APP,
+ __DCB_ATTR_CEE_PEER_APP_MAX
+};
+#define DCB_ATTR_CEE_PEER_APP_MAX (__DCB_ATTR_CEE_PEER_APP_MAX - 1)
+
+/**
* enum dcbnl_pfc_attrs - DCB Priority Flow Control user priority nested attrs
*
* @DCB_PFC_UP_ATTR_UNDEFINED: unspecified attribute to catch errors
diff --git a/include/net/dcbnl.h b/include/net/dcbnl.h
index 7b7180e..e5983c9 100644
--- a/include/net/dcbnl.h
+++ b/include/net/dcbnl.h
@@ -84,6 +84,9 @@ struct dcbnl_rtnl_ops {
u16 *);
int (*peer_getapptable)(struct net_device *, struct dcb_app *);
+ /* CEE peer */
+ int (*cee_peer_getpg) (struct net_device *, struct cee_pg *);
+ int (*cee_peer_getpfc) (struct net_device *, struct cee_pfc *);
};
#endif /* __NET_DCBNL_H__ */
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index 2e6dcf2..d8b4f72 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -1224,7 +1224,9 @@ err:
return err;
}
-static int dcbnl_build_peer_app(struct net_device *netdev, struct sk_buff* skb)
+static int dcbnl_build_peer_app(struct net_device *netdev, struct sk_buff* skb,
+ int app_nested_type, int app_info_type,
+ int app_entry_type)
{
struct dcb_peer_app_info info;
struct dcb_app *table = NULL;
@@ -1256,12 +1258,15 @@ static int dcbnl_build_peer_app(struct net_device *netdev, struct sk_buff* skb)
*/
err = -EMSGSIZE;
- app = nla_nest_start(skb, DCB_ATTR_IEEE_PEER_APP);
+ app = nla_nest_start(skb, app_nested_type);
if (!app)
goto nla_put_failure;
+ if (app_info_type)
+ NLA_PUT(skb, app_info_type, sizeof(info), &info);
+
for (i = 0; i < app_count; i++)
- NLA_PUT(skb, DCB_ATTR_IEEE_APP, sizeof(struct dcb_app),
+ NLA_PUT(skb, app_entry_type, sizeof(struct dcb_app),
&table[i]);
nla_nest_end(skb, app);
@@ -1352,7 +1357,10 @@ static int dcbnl_ieee_get(struct net_device *netdev, struct nlattr **tb,
}
if (ops->peer_getappinfo && ops->peer_getapptable) {
- err = dcbnl_build_peer_app(netdev, skb);
+ err = dcbnl_build_peer_app(netdev, skb,
+ DCB_ATTR_IEEE_PEER_APP,
+ DCB_ATTR_IEEE_APP_UNSPEC,
+ DCB_ATTR_IEEE_APP);
if (err)
goto nla_put_failure;
}
@@ -1510,6 +1518,71 @@ err:
return ret;
}
+/* Handle CEE DCBX GET commands. */
+static int dcbnl_cee_get(struct net_device *netdev, struct nlattr **tb,
+ u32 pid, u32 seq, u16 flags)
+{
+ struct sk_buff *skb;
+ struct nlmsghdr *nlh;
+ struct dcbmsg *dcb;
+ struct nlattr *cee;
+ const struct dcbnl_rtnl_ops *ops = netdev->dcbnl_ops;
+ int err;
+
+ if (!ops)
+ return -EOPNOTSUPP;
+
+ skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+ if (!skb)
+ return -ENOBUFS;
+
+ nlh = NLMSG_NEW(skb, pid, seq, RTM_GETDCB, sizeof(*dcb), flags);
+
+ dcb = NLMSG_DATA(nlh);
+ dcb->dcb_family = AF_UNSPEC;
+ dcb->cmd = DCB_CMD_CEE_GET;
+
+ NLA_PUT_STRING(skb, DCB_ATTR_IFNAME, netdev->name);
+
+ cee = nla_nest_start(skb, DCB_ATTR_CEE);
+ if (!cee)
+ goto nla_put_failure;
+
+ /* get peer info if available */
+ if (ops->cee_peer_getpg) {
+ struct cee_pg pg;
+ err = ops->cee_peer_getpg(netdev, &pg);
+ if (!err)
+ NLA_PUT(skb, DCB_ATTR_CEE_PEER_PG, sizeof(pg), &pg);
+ }
+
+ if (ops->cee_peer_getpfc) {
+ struct cee_pfc pfc;
+ err = ops->cee_peer_getpfc(netdev, &pfc);
+ if (!err)
+ NLA_PUT(skb, DCB_ATTR_CEE_PEER_PFC, sizeof(pfc), &pfc);
+ }
+
+ if (ops->peer_getappinfo && ops->peer_getapptable) {
+ err = dcbnl_build_peer_app(netdev, skb,
+ DCB_ATTR_CEE_PEER_APP_TABLE,
+ DCB_ATTR_CEE_PEER_APP_INFO,
+ DCB_ATTR_CEE_PEER_APP);
+ if (err)
+ goto nla_put_failure;
+ }
+
+ nla_nest_end(skb, cee);
+ nlmsg_end(skb, nlh);
+
+ return rtnl_unicast(skb, &init_net, pid);
+nla_put_failure:
+ nlmsg_cancel(skb, nlh);
+nlmsg_failure:
+ kfree_skb(skb);
+ return -1;
+}
+
static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
{
struct net *net = sock_net(skb->sk);
@@ -1639,6 +1712,10 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
ret = dcbnl_setfeatcfg(netdev, tb, pid, nlh->nlmsg_seq,
nlh->nlmsg_flags);
goto out;
+ case DCB_CMD_CEE_GET:
+ ret = dcbnl_cee_get(netdev, tb, pid, nlh->nlmsg_seq,
+ nlh->nlmsg_flags);
+ goto out;
default:
goto errout;
}
--
1.7.3.5
^ permalink raw reply related
* [PATCH net-next-2.6 v2 1/2] dcbnl: add support for retrieving peer configuration - ieee
From: Shmulik Ravid @ 2011-02-27 15:04 UTC (permalink / raw)
To: davem; +Cc: John Fastabend, Eilon Greenstein, netdev
These 2 patches add the support for retrieving the remote or peer DCBX
configuration via dcbnl for embedded DCBX stacks. The peer configuration
is part of the DCBX MIB and is useful for debugging and diagnostics of
the overall DCB configuration. The first patch add this support for IEEE
802.1Qaz standard the second patch add the same support for the older
CEE standard. Diff for v2 - the peer-app-info is CEE specific.
Signed-off-by: Shmulik Ravid <shmulikr@broadcom.com>
---
include/linux/dcbnl.h | 28 ++++++++++++++++++++
include/net/dcbnl.h | 6 ++++
net/dcb/dcbnl.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 103 insertions(+), 0 deletions(-)
diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
index 4c5b26e..2542685 100644
--- a/include/linux/dcbnl.h
+++ b/include/linux/dcbnl.h
@@ -110,6 +110,20 @@ struct dcb_app {
__u16 protocol;
};
+/**
+ * struct dcb_peer_app_info - APP feature information sent by the peer
+ *
+ * @willing: willing bit in the peer APP tlv
+ * @error: error bit in the peer APP tlv
+ *
+ * In addition to this information the full peer APP tlv also contains
+ * a table of 'app_count' APP objects defined above.
+ */
+struct dcb_peer_app_info {
+ __u8 willing;
+ __u8 error;
+};
+
struct dcbmsg {
__u8 dcb_family;
__u8 cmd;
@@ -235,11 +249,25 @@ enum dcbnl_attrs {
DCB_ATTR_MAX = __DCB_ATTR_ENUM_MAX - 1,
};
+/**
+ * enum ieee_attrs - IEEE 802.1Qaz get/set attributes
+ *
+ * @DCB_ATTR_IEEE_UNSPEC: unspecified
+ * @DCB_ATTR_IEEE_ETS: negotiated ETS configuration
+ * @DCB_ATTR_IEEE_PFC: negotiated PFC configuration
+ * @DCB_ATTR_IEEE_APP_TABLE: negotiated APP configuration
+ * @DCB_ATTR_IEEE_PEER_ETS: peer ETS configuration - get only
+ * @DCB_ATTR_IEEE_PEER_PFC: peer PFC configuration - get only
+ * @DCB_ATTR_IEEE_PEER_APP: peer APP tlv - get only
+ */
enum ieee_attrs {
DCB_ATTR_IEEE_UNSPEC,
DCB_ATTR_IEEE_ETS,
DCB_ATTR_IEEE_PFC,
DCB_ATTR_IEEE_APP_TABLE,
+ DCB_ATTR_IEEE_PEER_ETS,
+ DCB_ATTR_IEEE_PEER_PFC,
+ DCB_ATTR_IEEE_PEER_APP,
__DCB_ATTR_IEEE_MAX
};
#define DCB_ATTR_IEEE_MAX (__DCB_ATTR_IEEE_MAX - 1)
diff --git a/include/net/dcbnl.h b/include/net/dcbnl.h
index a8e7852..7b7180e 100644
--- a/include/net/dcbnl.h
+++ b/include/net/dcbnl.h
@@ -43,6 +43,8 @@ struct dcbnl_rtnl_ops {
int (*ieee_setpfc) (struct net_device *, struct ieee_pfc *);
int (*ieee_getapp) (struct net_device *, struct dcb_app *);
int (*ieee_setapp) (struct net_device *, struct dcb_app *);
+ int (*ieee_peer_getets) (struct net_device *, struct ieee_ets *);
+ int (*ieee_peer_getpfc) (struct net_device *, struct ieee_pfc *);
/* CEE std */
u8 (*getstate)(struct net_device *);
@@ -77,6 +79,10 @@ struct dcbnl_rtnl_ops {
u8 (*getdcbx)(struct net_device *);
u8 (*setdcbx)(struct net_device *, u8);
+ /* peer apps */
+ int (*peer_getappinfo)(struct net_device *, struct dcb_peer_app_info *,
+ u16 *);
+ int (*peer_getapptable)(struct net_device *, struct dcb_app *);
};
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index d5074a5..2e6dcf2 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -1224,6 +1224,54 @@ err:
return err;
}
+static int dcbnl_build_peer_app(struct net_device *netdev, struct sk_buff* skb)
+{
+ struct dcb_peer_app_info info;
+ struct dcb_app *table = NULL;
+ const struct dcbnl_rtnl_ops *ops = netdev->dcbnl_ops;
+ u16 app_count;
+ int err;
+
+
+ /**
+ * retrieve the peer app configuration form the driver. If the driver
+ * handlers fail exit without doing anything
+ */
+ err = ops->peer_getappinfo(netdev, &info, &app_count);
+ if (!err && app_count) {
+ table = kmalloc(sizeof(struct dcb_app) * app_count, GFP_KERNEL);
+ if (!table)
+ return -ENOMEM;
+
+ err = ops->peer_getapptable(netdev, table);
+ }
+
+ if (!err) {
+ u16 i;
+ struct nlattr *app;
+
+ /**
+ * build the message, from here on the only possible failure
+ * is due to the skb size
+ */
+ err = -EMSGSIZE;
+
+ app = nla_nest_start(skb, DCB_ATTR_IEEE_PEER_APP);
+ if (!app)
+ goto nla_put_failure;
+
+ for (i = 0; i < app_count; i++)
+ NLA_PUT(skb, DCB_ATTR_IEEE_APP, sizeof(struct dcb_app),
+ &table[i]);
+
+ nla_nest_end(skb, app);
+ }
+ err = 0;
+
+nla_put_failure:
+ kfree(table);
+ return err;
+}
/* Handle IEEE 802.1Qaz GET commands. */
static int dcbnl_ieee_get(struct net_device *netdev, struct nlattr **tb,
@@ -1288,6 +1336,27 @@ static int dcbnl_ieee_get(struct net_device *netdev, struct nlattr **tb,
spin_unlock(&dcb_lock);
nla_nest_end(skb, app);
+ /* get peer info if available */
+ if (ops->ieee_peer_getets) {
+ struct ieee_ets ets;
+ err = ops->ieee_peer_getets(netdev, &ets);
+ if (!err)
+ NLA_PUT(skb, DCB_ATTR_IEEE_PEER_ETS, sizeof(ets), &ets);
+ }
+
+ if (ops->ieee_peer_getpfc) {
+ struct ieee_pfc pfc;
+ err = ops->ieee_peer_getpfc(netdev, &pfc);
+ if (!err)
+ NLA_PUT(skb, DCB_ATTR_IEEE_PEER_PFC, sizeof(pfc), &pfc);
+ }
+
+ if (ops->peer_getappinfo && ops->peer_getapptable) {
+ err = dcbnl_build_peer_app(netdev, skb);
+ if (err)
+ goto nla_put_failure;
+ }
+
nla_nest_end(skb, ieee);
nlmsg_end(skb, nlh);
--
1.7.3.5
^ permalink raw reply related
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-27 12:58 UTC (permalink / raw)
To: Jay Vosburgh
Cc: nicolas.2p.debian, David Miller, kaber, eric.dumazet, netdev,
shemminger, andy, Fischer, Anna
In-Reply-To: <27369.1298749377@death>
Sat, Feb 26, 2011 at 08:42:57PM CET, fubar@us.ibm.com wrote:
>Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:
>
>>Le 22/02/2011 00:20, Nicolas de Pesloüan a écrit :
>>
>>> After checking every protocol handlers installed by dev_add_pack(), it
>>> appears that only 4 of them really use the orig_dev parameter given by
>>> __netif_receive_skb():
>>>
>>> - bond_3ad_lacpdu_recv() @ drivers/net/bonding/bond_3ad.c
>>> - bond_arp_recv() @ drivers/net/bonding/bond_main.c
>>> - packet_rcv() @ net/packet/af_packet.c
>>> - tpacket_rcv() @ net/packet/af_packet.c
>>>
>>> From the bonding point of view, the meaning of orig_dev is obviously
>>> "the device one layer below the bonding device, through which the packet
>>> reached the bonding device". It is used by bond_3ad_lacpdu_recv() and
>>> bond_arp_recv(), to find the underlying slave device through which the
>>> LACPDU or ARP was received. (The protocol handler is registered at the
>>> bonding device level).
>>>
>>> From the af_packet point of view, the meaning is documented (in commit
>>> "[AF_PACKET]: Add option to return orig_dev to userspace") as the
>>> "physical device [that] actually received the traffic, instead of having
>>> the encapsulating device hide that information."
>>>
>>> When the bonding device is just one level above the physical device, the
>>> two meanings happen to match the same device, by chance.
>>>
>>> So, currently, a bonding device cannot stack properly on top of anything
>>> but physical devices. It might not be a problem today, but may change in
>>> the future...
>>
>>Hi Jay,
>>
>>Still thinking about this orig_dev stuff, I wonder why the protocol
>>handlers used in bonding (bond_3ad_lacpdu_recv() and bond_arp_rcv()) are
>>registered at the master level instead of at the slave level ?
>>
>>If they were registered at the slave level, they would simply receive
>>skb->dev as the ingress interface and use this value instead of needing
>>the orig_dev value given to them when they are registered at the master
>>level.
>>
>>As orig_dev is only used by bonding and by af_packet, but they disagree on
>>the exact meaning of orig_dev, one way to fix this discrepancy would be to
>>remove one of the usage. As the af_packet usage is exposed to user space,
>>bonding seems the right place to stop using orig_dev, even if orig_dev was
>>introduced for bonding :-)
>>
>>I understand that this would add one entry per slave device to the
>>ptype_base list, but this seems to be the only bad effect of registering
>>at the slave level. Can you confirm that this was the reason to register
>>at the master level instead?
>
> My recollection is that it was done the way it is because there
>was no "orig_dev" delivery logic at the time. A handler registered to a
>slave dev would receive no packets at all because assignment of skb->dev
>to the master happened first, and the "orig_dev" knowledge was lost.
>
> When 802.3ad was added, a skb->real_dev field was created, but
>it wasn't used for delivery. 802.3ad used real_dev to figure out which
>slave a LACPDU arrived on. The skb->real_dev was eventually replaced
>with the orig_dev business that's there now.
>
> Later, I did the arp_validate stuff the same way as 802.3ad
>because it worked and was easier than registering a handler per slave.
>
>>If you think registering at the slave level would cause too much impact on
>>ptype_base, then we might have another way to stop using orig_dev for
>>bonding:
>>
>>In __skb_bond_should_drop(), we already test for the two interesting protocols:
>>
>>if ((dev->priv_flags & IFF_SLAVE_NEEDARP) && skb->protocol == __cpu_to_be16(ETH_P_ARP))
>> return 0;
>>
>>if (master->priv_flags & IFF_MASTER_8023AD && skb->protocol == __cpu_to_be16(ETH_P_SLOW))
>> return 0;
>>
>>Would it be possible to call the right handlers directly from inside
>>__skb_bond_should_drop() then let __skb_bond_should_drop() return 1
>>("should drop") after processing the frames that are only of interest for
>>bonding?
>
> Isn't one purpose of switching to rx_handler that there won't
>need to be any skb_bond_should_drop logic in __netif_receive_skb at all?
Yes, that (hopefully most) would be eventually removed.
>
> Still, if you're just trying to simplify __netif_receive_skb
>first, I don't see any reason not to register the packet handlers at the
>slave level. Looking at the ptype_base hash, I don't think that the
>protocols bonding is registering (ARP and SLOW) will hash collide with
>IP or IPv6, so I suspect there won't be much impact.
>
> Once an rx_handler is used, then I suspect there's no need for
>the packet handlers at all, since the rx_handler is within bonding and
>can just deal with the ARP or LACPDU directly.
That is very true. And given that af_packet uses orig_dev to obtain
ifindex, it can be replaced by skb->skb_iif. That way we can get rid of
orig_dev parameter for good.
So I suggest to take V3 of my patch now and do multiple follow-on
patches to get us where we want to get.
Thanks
>
> -J
>
>---
> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply
* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: Vasiliy Kulikov @ 2011-02-27 11:44 UTC (permalink / raw)
To: David Miller
Cc: bhutchings, netdev, linux-kernel, kuznet, pekkas, jmorris,
yoshfuji, kaber, eric.dumazet, therbert, xiaosuo, jesse,
kees.cook, eugene, dan.j.rosenberg, akpm
In-Reply-To: <20110225.111606.115927805.davem@davemloft.net>
David,
On Fri, Feb 25, 2011 at 11:16 -0800, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 25 Feb 2011 19:07:59 +0000
>
> > You realise that module loading doesn't actually run in the context of
> > request_module(), right?
>
> Why is that a barrier? We could simply pass a capability mask into
> request_module if necessary.
>
> It's an implementation detail, and not a deterrant to my suggested
> scheme.
Let's discuss your scheme. AFAIU, you suggest to change:
1. a) request_module("%s", devname) =>
request_module_with_caps(CAP_NET_ADMIN, "%s", devname)
b) call_usermodehelper() => call_usermodehelper_with_caps()
c) add some bits/sections into kernel module image indicating that
this module is safe to be loaded via CAP_NET_ADMIN
d) run modprobe with CAP_NET_ADMIN only
e) in load_module() check whether (the process has CAP_SYS_MODULE) or
(the process has CAP_NET_ADMIN and bit SAFE_NET_MODULE is raised in
the module image)
This obviously doesn't work - the kernel is not able to verify whether
the bit/section is not malformed by user with CAP_NET_ADMIN.
-OR-
1. a) request_module("%s", devname) => request_module_with_argument("--netdev", "%s", devname)
b) patch modprobe to add "--netmodule-only" argument (or bitmask,
whatever), this would indicate that only net/** modules may be loaded.
Then the things are still broken - a user has to update modprobe
together with the kernel, otherwise the updated kernel would call
"modprobe" with unsupported argument and even "sit0" wouldn't work.
Additionally this touches module loading process, which is not buggy.
Or you propose something else besides these 2 ways? Please clarify.
Thanks,
--
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-27 11:06 UTC (permalink / raw)
To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
netdev
In-Reply-To: <20110227110205.GE9763@canuck.infradead.org>
On Sun, Feb 27, 2011 at 06:02:05AM -0500, Thomas Graf wrote:
>
> I still suggest to merge this patch as a immediate workaround fix
> until we scale properly on a single socket and also as a workaround
> for applications which can't get rid of their per socket mutex quickly.
I disagree completely.
This patch adds a user-space API that we will have to carry
with us for perpetuity. I would only support this if we had
no other way around the problem.
If this does turn out to be mostly due to sendmsg contention
then fixing it is going to be much simpler than making the UDP
stack multiqueue capable.
I'm working on this right now.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: txqueuelen has wrong units; should be time
From: Jussi Kivilinna @ 2011-02-27 10:55 UTC (permalink / raw)
To: Albert Cahalan; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <AANLkTinyE10wtM_xJsufT_3s3hvti7CN+9nyqScWa6SA@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]
Quoting Albert Cahalan <acahalan@gmail.com>:
> On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>>>
>>> > Nanoseconds seems fine; it's unlikely you'd ever want
>>> > more than 4.2 seconds (32-bit unsigned) of queue.
> ...
>> Problem is some machines have slow High Resolution timing services.
>>
>> _If_ we have a time limit, it will probably use the low resolution (aka
>> jiffies), unless high resolution services are cheap.
>
> As long as that is totally internal to the kernel and never
> getting exposed by some API for setting the amount, sure.
>
>> I was thinking not having an absolute hard limit, but an EWMA based one.
>
> The whole point is to prevent stale packets, especially to prevent
> them from messing with TCP, so I really don't think so. I suppose
> you do get this to some extent via early drop.
I made simple hack on sch_fifo with per packet time limits
(attachment) this weekend and have been doing limited testing on
wireless link. I think hardlimit is fine, it's simple and does
somewhat same as what packet(-hard)limited buffer does, drops packets
when buffer is 'full'. My hack checks for timed out packets on
enqueue, might be wrong approach (on other hand might allow some more
burstiness).
-Jussi
[-- Attachment #2: sch_fifo_to.c --]
[-- Type: text/x-csrc, Size: 6138 bytes --]
/*
* sch_fifo_timeout.c Simple FIFO queue with per packet timeout.
*
* This program is free software; you can redistribute it and/or modify it under
* the terms of the GNU General Public License as published by the Free Software
* Foundation; either version 2 of the License, or (at your option) any later
* version.
*
*/
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/skbuff.h>
#include <net/pkt_sched.h>
#include <net/inet_ecn.h>
#define DEFAULT_TIMEOUT_PKT_MS 10
#define DEFAULT_TIMEOUT_PKT PSCHED_NS2TICKS((u64)NSEC_PER_SEC * \
DEFAULT_TIMEOUT_PKT_MS / 1000)
struct tc_fifo_timeout_qopt {
__u64 timeout; /* Max time packet may stay in buffer */
__u32 limit; /* Queue length: bytes for bfifo, packets for pfifo */
};
struct fifo_timeout_skb_cb {
psched_time_t time_queued;
};
struct fifo_timeout_sched_data {
psched_tdiff_t timeout;
u32 limit;
};
static inline
struct fifo_timeout_skb_cb *fifo_timeout_skb_cb(struct sk_buff *skb)
{
BUILD_BUG_ON(sizeof(skb->cb) <
sizeof(struct qdisc_skb_cb) +
sizeof(struct fifo_timeout_skb_cb));
return (struct fifo_timeout_skb_cb *)qdisc_skb_cb(skb)->data;
}
static void pfifo_timeout_drop_timedout_packets(struct Qdisc *sch,
psched_time_t now)
{
struct fifo_timeout_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
check_next:
skb = qdisc_peek_head(sch);
if (likely(!skb))
return;
if (likely(fifo_timeout_skb_cb(skb)->time_queued + q->timeout > now))
return;
__qdisc_queue_drop_head(sch, &sch->q);
sch->qstats.drops++;
goto check_next;
}
static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
struct fifo_timeout_sched_data *q = qdisc_priv(sch);
if (likely(skb_queue_len(&sch->q) < q->limit))
return qdisc_enqueue_tail(skb, sch);
/* queue full, remove one skb to fulfill the limit */
__qdisc_queue_drop_head(sch, &sch->q);
sch->qstats.drops++;
qdisc_enqueue_tail(skb, sch);
return NET_XMIT_CN;
}
static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
struct fifo_timeout_sched_data *q = qdisc_priv(sch);
if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= q->limit))
return qdisc_enqueue_tail(skb, sch);
return qdisc_reshape_fail(skb, sch);
}
static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
struct fifo_timeout_sched_data *q = qdisc_priv(sch);
if (likely(skb_queue_len(&sch->q) < q->limit))
return qdisc_enqueue_tail(skb, sch);
return qdisc_reshape_fail(skb, sch);
}
static int pfifo_timeout_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
psched_time_t now = psched_get_time();
fifo_timeout_skb_cb(skb)->time_queued = now;
pfifo_timeout_drop_timedout_packets(sch, now);
return pfifo_tail_enqueue(skb, sch);
}
static int bfifo_timeout_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
psched_time_t now = psched_get_time();
fifo_timeout_skb_cb(skb)->time_queued = now;
pfifo_timeout_drop_timedout_packets(sch, now);
return bfifo_enqueue(skb, sch);
}
static int pfifo_timeout_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
psched_time_t now = psched_get_time();
fifo_timeout_skb_cb(skb)->time_queued = now;
pfifo_timeout_drop_timedout_packets(sch, now);
return pfifo_enqueue(skb, sch);
}
static int fifo_timeout_init(struct Qdisc *sch, struct nlattr *opt)
{
struct fifo_timeout_sched_data *q = qdisc_priv(sch);
if (opt == NULL) {
u32 limit = qdisc_dev(sch)->tx_queue_len ? : 1;
q->limit = limit;
q->timeout = DEFAULT_TIMEOUT_PKT;
} else {
struct tc_fifo_timeout_qopt *ctl = nla_data(opt);
if (nla_len(opt) < sizeof(*ctl))
return -EINVAL;
q->limit = ctl->limit;
q->timeout = ctl->timeout ? : DEFAULT_TIMEOUT_PKT;
}
return 0;
}
static int fifo_timeout_dump(struct Qdisc *sch, struct sk_buff *skb)
{
struct fifo_timeout_sched_data *q = qdisc_priv(sch);
struct tc_fifo_timeout_qopt opt = {
.limit = q->limit,
.timeout = q->timeout
};
NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
return skb->len;
nla_put_failure:
return -1;
}
static struct Qdisc_ops pfifo_timeout_qdisc_ops __read_mostly = {
.id = "pfifo_timeout",
.priv_size = sizeof(struct fifo_timeout_sched_data),
.enqueue = pfifo_timeout_enqueue,
.dequeue = qdisc_dequeue_head,
.peek = qdisc_peek_head,
.drop = qdisc_queue_drop,
.init = fifo_timeout_init,
.reset = qdisc_reset_queue,
.change = fifo_timeout_init,
.dump = fifo_timeout_dump,
.owner = THIS_MODULE,
};
static struct Qdisc_ops bfifo_timeout_qdisc_ops __read_mostly = {
.id = "bfifo_timeout",
.priv_size = sizeof(struct fifo_timeout_sched_data),
.enqueue = bfifo_timeout_enqueue,
.dequeue = qdisc_dequeue_head,
.peek = qdisc_peek_head,
.drop = qdisc_queue_drop,
.init = fifo_timeout_init,
.reset = qdisc_reset_queue,
.change = fifo_timeout_init,
.dump = fifo_timeout_dump,
.owner = THIS_MODULE,
};
static struct Qdisc_ops pfifo_head_drop_timeout_qdisc_ops __read_mostly = {
.id = "pfifo_hd_tout",
.priv_size = sizeof(struct fifo_timeout_sched_data),
.enqueue = pfifo_timeout_tail_enqueue,
.dequeue = qdisc_dequeue_head,
.peek = qdisc_peek_head,
.drop = qdisc_queue_drop_head,
.init = fifo_timeout_init,
.reset = qdisc_reset_queue,
.change = fifo_timeout_init,
.dump = fifo_timeout_dump,
.owner = THIS_MODULE,
};
static int __init fifo_timeout_module_init(void)
{
int retval;
retval = register_qdisc(&pfifo_timeout_qdisc_ops);
if (retval)
goto cleanup;
retval = register_qdisc(&bfifo_timeout_qdisc_ops);
if (retval)
goto cleanup;
retval = register_qdisc(&pfifo_head_drop_timeout_qdisc_ops);
if (retval)
goto cleanup;
return 0;
cleanup:
unregister_qdisc(&pfifo_timeout_qdisc_ops);
unregister_qdisc(&bfifo_timeout_qdisc_ops);
unregister_qdisc(&pfifo_head_drop_timeout_qdisc_ops);
return retval;
}
static void __exit fifo_timeout_module_exit(void)
{
unregister_qdisc(&pfifo_timeout_qdisc_ops);
unregister_qdisc(&bfifo_timeout_qdisc_ops);
unregister_qdisc(&pfifo_head_drop_timeout_qdisc_ops);
}
module_init(fifo_timeout_module_init)
module_exit(fifo_timeout_module_exit)
MODULE_LICENSE("GPL");
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: Thomas Graf @ 2011-02-27 11:02 UTC (permalink / raw)
To: Herbert Xu
Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
netdev
In-Reply-To: <20110226005718.GA19889@gondor.apana.org.au>
On Sat, Feb 26, 2011 at 08:57:18AM +0800, Herbert Xu wrote:
> I'm fairly certain the bottleneck is indeed in the kernel, and
> in the UDP stack in particular.
>
> This is born out by a test where I used two named worker threads,
> both working on the same socket. Stracing shows that they're
> working flat out only doing sendmsg/recvmsg.
>
> The result was that they obtained (in aggregate) half the throughput
> of a single worker thread.
I agree. This is the bottleneck that I described were the kernel is
not able to deliver enough queries for BIND to show the lock
contention issues.
But there is also the situation where netperf RR performance numbers
indicate a mugh higher kernel capability but BIND is not able to
deliver more even though the CPU utilization is very low. This is
the situation where we see the large number of futex calls indicating
the lock contention due to too many queries on a single socket.
> Which is why I'm quite skeptical about this REUSEPORT patch as
> IMHO the only reason it produces a great result is solely because
> it is allowing parallel sends going out.
>
> Rather than modifying all UDP applications out there to fix what
> is fundamentally a kernel problem, I think what we should do is
> fix the UDP stack so that it actually scales.
I am not suggesting that this is the ultimate and final fix for this
problem. It is fixing a symptom rather than fixing the cause but
sometimes being able to fix the symptom becomes really handy :-)
Adding SO_REUSEPORT does not prevent us from fixing the UDP stack
in the long run.
> It isn't all that hard since the easy way would be to only take
> the lock if we're already corked or about to cork.
>
> For the receive side we also don't need REUSEPORT as we can simply
> make our UDP stack multiqueue.
OK, it is not required and there is definitely a better way to fix
the kernel bottleneck in the long term. Even better.
I still suggest to merge this patch as a immediate workaround fix
until we scale properly on a single socket and also as a workaround
for applications which can't get rid of their per socket mutex quickly.
^ permalink raw reply
* Re: EPT: Misconfiguration
From: Avi Kivity @ 2011-02-27 10:46 UTC (permalink / raw)
To: Ruben Kerkhof; +Cc: Marcelo Tosatti, kvm, netdev
In-Reply-To: <AANLkTiknMneQtYqgmX7gvXsMoSO-yiLXr-dwbEej80Uy@mail.gmail.com>
Copying netdev: looks like memory corruption in the networking stack.
Archive link: http://www.spinics.net/lists/kvm/msg50651.html (for the
attachment).
On 02/24/2011 11:15 PM, Ruben Kerkhof wrote:
> >
> > On Tue, Feb 15, 2011 at 18:16, Marcelo Tosatti<mtosatti@redhat.com> wrote:
>
> >> This and the others reported. So yes, it looks something is corrupting
> >> memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option.
>
> Ok, there are now only 6 vms left on this host, and I've booted it
> with the slub_debug=ZFPU option.
> After a few hours, I got the following result:
>
> 2011-02-24T21:41:30.818496+01:00 phy005 kernel:
> =============================================================================
> 2011-02-24T21:41:30.818517+01:00 phy005 kernel: BUG kmalloc-2048 (Not
> tainted): Object padding overwritten
> 2011-02-24T21:41:30.818523+01:00 phy005 kernel:
> -----------------------------------------------------------------------------
> 2011-02-24T21:41:30.818526+01:00 phy005 kernel:
> 2011-02-24T21:41:30.818530+01:00 phy005 kernel: INFO:
> 0xffff8806230752ca-0xffff8806230752cf. First byte 0x0 instead of 0x5a
> 2011-02-24T21:41:30.818534+01:00 phy005 kernel: INFO: Allocated in
> __netdev_alloc_skb+0x34/0x51 age=2231 cpu=8 pid=0
> 2011-02-24T21:41:30.818537+01:00 phy005 kernel: INFO: Freed in
> skb_release_data+0xc9/0xce age=2368 cpu=8 pid=2159
> 2011-02-24T21:41:30.818541+01:00 phy005 kernel: INFO: Slab
> 0xffffea00157a9880 objects=15 used=13 fp=0xffff8806230752d0
> flags=0x40000000004083
> 2011-02-24T21:41:30.818545+01:00 phy005 kernel: INFO: Object
> 0xffff880623074a88 @offset=19080 fp=0xffff8806230752d0
>
> The rest of the output is attached since it's quite large.
>
> Kind regards,
>
> Ruben
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* [PATCH nex-next] netdevice: make initial group visible to userspace
From: Vlad Dogaru @ 2011-02-27 8:39 UTC (permalink / raw)
To: NetDev; +Cc: Stephen Hemminger, David Miller, Patrick McHardy
In-Reply-To: <20110225124345.0d691789@nehalam>
On Fri, Feb 25, 2011 at 12:43:45PM -0800, Stephen Hemminger wrote:
> On Wed, 2 Feb 2011 20:23:40 +0200
> Vlad Dogaru <ddvlad@rosedu.org> wrote:
>
> > User can specify device group to list by using the group keyword:
> >
> > ip link show group test
> >
> > If no group is specified, 0 (default) is implied.
> >
> > Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
>
> I applied this to net-next for iproute2
> but INIT_NETDEV_GROUP is in a part of netdevice.h that is not exported
> (ie inside #ifdef KERNEL).
Sorry, here is a patch for net-next that fixes the issue:
[PATCH net-next] netdevice: make initial group visible to userspace
INIT_NETDEV_GROUP is needed by userspace, move it outside __KERNEL__
guards.
Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
---
include/linux/netdevice.h | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ffe56c1..8be4056 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -75,9 +75,6 @@ struct wireless_dev;
#define NET_RX_SUCCESS 0 /* keep 'em coming, baby */
#define NET_RX_DROP 1 /* packet dropped */
-/* Initial net device group. All devices belong to group 0 by default. */
-#define INIT_NETDEV_GROUP 0
-
/*
* Transmit return codes: transmit return codes originate from three different
* namespaces:
@@ -141,6 +138,9 @@ static inline bool dev_xmit_complete(int rc)
#define MAX_ADDR_LEN 32 /* Largest hardware address length */
+/* Initial net device group. All devices belong to group 0 by default. */
+#define INIT_NETDEV_GROUP 0
+
#ifdef __KERNEL__
/*
* Compute the worst case header length according to the protocols
--
1.7.1
^ permalink raw reply related
* Re: txqueuelen has wrong units; should be time
From: Albert Cahalan @ 2011-02-27 8:27 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <1298793252.8726.45.camel@edumazet-laptop>
On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>>
>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> > more than 4.2 seconds (32-bit unsigned) of queue.
...
> Problem is some machines have slow High Resolution timing services.
>
> _If_ we have a time limit, it will probably use the low resolution (aka
> jiffies), unless high resolution services are cheap.
As long as that is totally internal to the kernel and never
getting exposed by some API for setting the amount, sure.
> I was thinking not having an absolute hard limit, but an EWMA based one.
The whole point is to prevent stale packets, especially to prevent
them from messing with TCP, so I really don't think so. I suppose
you do get this to some extent via early drop.
^ permalink raw reply
* Re: txqueuelen has wrong units; should be time
From: Eric Dumazet @ 2011-02-27 7:54 UTC (permalink / raw)
To: Mikael Abrahamsson; +Cc: Albert Cahalan, linux-kernel, netdev
In-Reply-To: <alpine.DEB.1.10.1102270758580.11974@uplift.swm.pp.se>
Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>
> > Nanoseconds seems fine; it's unlikely you'd ever want
> > more than 4.2 seconds (32-bit unsigned) of queue.
>
> I think this is shortsighted and I'm sure someone will come up with a case
> where 4.2 seconds isn't enough. Let's not build in those kinds of
> limitations from start.
>
> Why not make it 64bit and go to picoseconds from start?
>
> If you need to make it 32bit unsigned, I'd suggest to start from
> microseconds instead. It's less likely someone would want less than a
> microsecond of queue, than someone wanting more than 4.2 seconds of queue.
>
32 or 64 bits doesnt matter a lot. At Qdisc stage we have up to 40 bytes
available in skb->sb[] for our usage.
Problem is some machines have slow High Resolution timing services.
_If_ we have a time limit, it will probably use the low resolution (aka
jiffies), unless high resolution services are cheap.
I was thinking not having an absolute hard limit, but an EWMA based one.
^ permalink raw reply
* Re: txqueuelen has wrong units; should be time
From: Mikael Abrahamsson @ 2011-02-27 7:02 UTC (permalink / raw)
To: Albert Cahalan; +Cc: linux-kernel, netdev
In-Reply-To: <AANLkTimd5GQwtUFP2fD_An=M8ajBD8DJpzxQJezv8fB8@mail.gmail.com>
On Sun, 27 Feb 2011, Albert Cahalan wrote:
> Nanoseconds seems fine; it's unlikely you'd ever want
> more than 4.2 seconds (32-bit unsigned) of queue.
I think this is shortsighted and I'm sure someone will come up with a case
where 4.2 seconds isn't enough. Let's not build in those kinds of
limitations from start.
Why not make it 64bit and go to picoseconds from start?
If you need to make it 32bit unsigned, I'd suggest to start from
microseconds instead. It's less likely someone would want less than a
microsecond of queue, than someone wanting more than 4.2 seconds of queue.
--
Mikael Abrahamsson email: swmike@swm.pp.se
^ permalink raw reply
* Re: [patch 1/1] [PATCH] qeth: remove needless IPA-commands in offline
From: David Miller @ 2011-02-27 6:41 UTC (permalink / raw)
To: frank.blaschka; +Cc: netdev, linux-s390, ursula.braun
In-Reply-To: <20110218142343.763210392@de.ibm.com>
From: frank.blaschka@de.ibm.com
Date: Fri, 18 Feb 2011 15:22:59 +0100
> From: Ursula Braun <ursula.braun@de.ibm.com>
>
> If a qeth device is set offline, data and control subchannels are
> cleared, which means removal of all IP Assist Primitive settings
> implicitly. There is no need to delete those settings explicitly.
> This patch removes all IP Assist invocations from offline.
>
> Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
> Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Applied.
^ permalink raw reply
* Re: net-next: warnings from sysctl_net_exit
From: David Miller @ 2011-02-27 6:23 UTC (permalink / raw)
To: shemminger; +Cc: adobriyan, netdev
In-Reply-To: <20110226165601.48858003@nehalam>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Sat, 26 Feb 2011 16:56:01 -0800
> Seeing lots of these messages in dmesg. Something is broken
> recently in net-next.
Did you by change pull plain net-2.6 into that tree? Because one
commit which is in net-2.6 but not in net-next-2.6 catches my eye:
commit c486da34390846b430896a407b47f0cea3a4189c
Author: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Date: Thu Feb 24 19:48:03 2011 +0000
sysctl: ipv6: use correct net in ipv6_sysctl_rtcache_flush
Before this patch issuing these commands:
fd = open("/proc/sys/net/ipv6/route/flush")
unshare(CLONE_NEWNET)
write(fd, "stuff")
would flush the newly created net, not the original one.
The equivalent ipv4 code is correct (stores the net inside ->extra1).
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox