* 2.6.34-rc3-git8: Reported regressions 2.6.32 -> 2.6.33
From: Rafael J. Wysocki @ 2010-04-08 22:54 UTC (permalink / raw)
To: Linux Kernel Mailing List
Cc: Maciej Rutecki, Andrew Morton, Linus Torvalds,
Kernel Testers List, Network Development, Linux ACPI,
Linux PM List, Linux SCSI List, Linux Wireless List, DRI
This message contains a list of some post-2.6.32 regressions introduced before
2.6.33, for which there are no fixes in the mainline known to the tracking team.
If any of them have been fixed already, please let us know.
If you know of any other unresolved post-2.6.32 regressions, please let us know
either and we'll add them to the list. Also, please let us know if any
of the entries below are invalid.
Each entry from the list will be sent additionally in an automatic reply to
this message with CCs to the people involved in reporting and handling the
issue.
Listed regressions statistics:
Date Total Pending Unresolved
----------------------------------------
2010-04-09 140 34 33
2010-03-21 133 38 34
2010-02-21 115 34 27
2010-02-15 112 34 31
2010-02-07 97 27 20
2010-02-01 85 26 21
2010-01-24 75 29 23
2010-01-10 55 33 21
2009-12-29 36 34 27
Unresolved regressions
----------------------
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15733
Subject : Crash when accessing nonexistent GTT entries in i915
Submitter : Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Date : 2010-03-10 22:09 (30 days old)
Message-ID : <1268258994.2183.14.camel@carter>
References : http://marc.info/?l=linux-kernel&m=126825901326111&w=4
Handled-By : Zhenyu Wang <zhenyuw@linux.intel.com>
Andrew Morton <akpm@linux-foundation.org>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15714
Subject : PROBLEM: intelfb driver causes trace
Submitter : Troilo, Domenic <Domenic.Troilo@gwl.ca>
Date : 2010-04-01 14:52 (8 days old)
Message-ID : <CBF11783AA883E41A86C009F3AE5D47F0CAE08C1@GWCORPMAIL4.gwl.bz>
References : http://marc.info/?l=linux-kernel&m=127013359722664&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15710
Subject : NumLock LED stays on after PC poweroff.
Submitter : aceman <acelists@atlas.sk>
Date : 2010-04-07 15:13 (2 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15699
Subject : rt2500usb driver cannot remain connected
Submitter : <abcd@gentoo.org>
Date : 2010-04-05 19:30 (4 days old)
Handled-By : Ivo van Doorn <IvDoorn@gmail.com>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15695
Subject : calling pm-suspend freezes system
Submitter : Werner Lemberg <wl@gnu.org>
Date : 2010-04-05 05:13 (4 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15693
Subject : Plugging or unplugging notebook charger renders Atheros card unusable
Submitter : <registosites1@hotmail.com>
Date : 2010-04-04 21:03 (5 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15604
Subject : r8169: Reports incorrect link information
Submitter : Michael B. Trausch <mike@trausch.us>
Date : 2010-03-22 04:19 (18 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15585
Subject : [Bisected Regression in 2.6.32.8] i915 with KMS enabled causes memorycorruption when resuming from suspend-to-disk
Submitter : M. Vefa Bicakci <bicave@superonline.com>
Date : 2010-03-13 5:11 (27 days old)
First-Bad-Commit: http://git.kernel.org/git/linus/d8e0902806c0bd2ccc4f6a267ff52565a3ec933b
Message-ID : <4B9B1E8F.5090806@superonline.com>
References : http://marc.info/?l=linux-kernel&m=126845754409543&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15544
Subject : black screen upon S3 resume, syslog has "render error" and "page table error"
Submitter : Sanjoy Mahajan <sanjoy@mit.edu>
Date : 2010-03-16 00:45 (24 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15534
Subject : 07ca:b808 crashing and breaking usb's
Submitter : Alex Fiestas <alex@eyeos.org>
Date : 2010-03-14 15:56 (26 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15525
Subject : Blank screen after some time, after hibernation/suspend
Submitter : <capsel@matrix.inten.pl>
Date : 2010-03-12 17:24 (28 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15502
Subject : render error detected, EIR: 0x00000010
Submitter : Artem Anisimov <aanisimov@inbox.ru>
Date : 2010-03-10 05:45 (30 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15466
Subject : 2.6.33 dies on modprobe
Submitter : M G Berberich <berberic@fmi.uni-passau.de>
Date : 2010-02-28 22:12 (40 days old)
Message-ID : <20100228221257.GA8858@invalid>
References : http://marc.info/?l=linux-kernel&m=126739570819208&w=2
Handled-By : Américo Wang <xiyou.wangcong@gmail.com>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15465
Subject : 2.6.33 problems
Submitter : werner@guyane.dyn-o-saur.com
Date : 2010-02-27 17:09 (41 days old)
Message-ID : <1267290551.13148@guyane.dyn-o-saur.com>
References : http://marc.info/?l=linux-kernel&m=126729183719672&w=2
Handled-By : Tejun Heo <tj@kernel.org>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15454
Subject : r8169 exits with error -22 since 2.6.33
Submitter : Conrad Kostecki <ConiKost@gmx.de>
Date : 2010-03-05 22:32 (35 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15439
Subject : Laptop does consume more power when booted "cold" -- Thinkpad X200s
Submitter : <johannes.schlatow@googlemail.com>
Date : 2010-03-03 23:09 (37 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15418
Subject : battery status info broken; missing entry in ec_dmi_table for specific MSI hardware (notebook)
Submitter : Tom-Steve Watzke <tswatzke@arcor.de>
Date : 2010-03-01 07:25 (39 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15392
Subject : The kernel does not start up.
Submitter : Kristóf Ralovich <kristof.ralovich@gmail.com>
Date : 2010-02-25 06:52 (43 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15376
Subject : regression (oops) with usb in 2.6.33-rc8
Submitter : Christophe Fergeau <cfergeau@mandriva.com>
Date : 2010-02-23 10:58 (45 days old)
Handled-By : Sarah Sharp <sarah.a.sharp@linux.intel.com>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15317
Subject : Lockdep report while running aplay with pulse as the default
Submitter : Ed Tomlinson <edt@aei.ca>
Date : 2010-02-13 17:17 (55 days old)
Message-ID : <201002131217.10579.edt@aei.ca>
References : http://marc.info/?l=linux-kernel&m=126608146427546&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15311
Subject : Starting pulseaudio causes a NULL pointer hit
Submitter : Ed Tomlinson <edt@aei.ca>
Date : 2010-02-14 23:41 (54 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15305
Subject : Dell video dies when booting
Submitter : David Ronis <ronis@ronispc.chem.mcgill.ca>
Date : 2010-02-14 1:07 (54 days old)
Message-ID : <1266109622.11290.10.camel@montroll.chem.mcgill.ca>
References : http://marc.info/?l=linux-kernel&m=126611098225127&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15278
Subject : lockdep warning for iscsi in 2.6.33-rc6
Submitter : Tao Ma <tao.ma@oracle.com>
Date : 2010-02-09 6:59 (59 days old)
Message-ID : <4B7107CF.3060703@oracle.com>
References : http://marc.info/?l=linux-kernel&m=126569884330200&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15277
Subject : 2.6.33-rc6 crashes on resume
Submitter : Bill Davidsen <davidsen@roadwarrior3.tmr.com>
Date : 2010-02-08 23:03 (60 days old)
Message-ID : <4B70982F.8090208@roadwarrior3.tmr.com>
References : http://marc.info/?l=linux-kernel&m=126567021801935&w=2
Handled-By : Rafael J. Wysocki <rjw@sisk.pl>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15276
Subject : latest git kernel: general protection fault: 0000 [#1]
Submitter : Markus Trippelsdorf <markus@trippelsdorf.de>
Date : 2010-02-09 8:36 (59 days old)
Message-ID : <20100209083605.GA1766@arch.tripp.de>
References : http://marc.info/?l=linux-kernel&m=126570498804223&w=2
Handled-By : Jérôme Glisse <glisse@freedesktop.org>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15259
Subject : Corruption with OpenGL since Intel's big DRM push on i945
Submitter : Alexandre Demers <papouta@hotmail.com>
Date : 2010-02-08 13:19 (60 days old)
First-Bad-Commit: http://git.kernel.org/git/linus/76446cac68568fc7f5168a27deaf803ed22a4360
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15246
Subject : BUG: Bad page state in process portageq
Submitter : Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Date : 2010-02-07 0:45 (61 days old)
References : http://marc.info/?l=linux-kernel&m=126550356515887&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15244
Subject : PROBLEM: hda-intel divide by zero kernel crash in azx_position_ok()
Submitter : Jody Bruchon <jody@nctritech.com>
Date : 2010-02-06 0:32 (62 days old)
References : http://marc.info/?l=linux-kernel&m=126541276028173&w=2
Handled-By : Takashi Iwai <tiwai@suse.de>
Jody Bruchon <jody@nctritech.com>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15076
Subject : System panic under load with clockevents_program_event
Submitter : okias <d.okias@gmail.com>
Date : 2010-01-17 13:03 (82 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15036
Subject : soft lockup in dmesg after suspend/resume
Submitter : ykzhao <yakui.zhao@intel.com>
Date : 2010-01-04 5:36 (95 days old)
References : http://marc.info/?l=linux-kernel&m=126258356202722&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14950
Subject : tbench regression with 2.6.33-rc1
Submitter : Lin Ming <ming.m.lin@intel.com>
Date : 2009-12-25 11:11 (105 days old)
References : http://marc.info/?l=linux-kernel&m=126174044213172&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14937
Subject : WARNING: at kernel/lockdep.c:2830
Submitter : Grant Wilson <grant.wilson@zen.co.uk>
Date : 2009-12-27 13:35 (103 days old)
References : http://marc.info/?l=linux-kernel&m=126192220404829&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14792
Subject : Misdetection of the TV output
Submitter : Santi <santi@agolina.net>
Date : 2009-12-12 13:28 (118 days old)
First-Bad-Commit: http://git.kernel.org/git/linus/27dfaf4f5825a119305db1bc63bef30ed400e376
Handled-By : Zhao Yakui <yakui.zhao@intel.com>
Regressions with patches
------------------------
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15328
Subject : high load avg, extreme sluggishness on T41 w/ Radeon Mobility M7
Submitter : John W. Linville <linville@tuxdriver.com>
Date : 2010-02-16 20:25 (52 days old)
Handled-By : Francisco Jerez <currojerez@riseup.net>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=25118
For details, please visit the bug entries and follow the links given in
references.
As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions introduced
between 2.6.32 and 2.6.33, unresolved as well as resolved, at:
http://bugzilla.kernel.org/show_bug.cgi?id=14885
Please let the tracking teak know if there are any Bugzilla entries that
should be added to the list in there.
Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: linux-next: build failure after merge of the final tree
From: Stephen Rothwell @ 2010-04-08 22:59 UTC (permalink / raw)
To: John Linn
Cc: David Miller, netdev, linux-next, linux-kernel, jtyner,
grant.likely
In-Reply-To: <6125d80a-81d5-4699-ac6e-9408bd0c1145@SG2EHSMHS010.ehs.local>
[-- Attachment #1: Type: text/plain, Size: 777 bytes --]
Hi John,
On Thu, 8 Apr 2010 08:15:12 -0600 John Linn <John.Linn@xilinx.com> wrote:
>
> I'm not pushing back here, just trying to make sure I understand and do
> it better next time :)
>
> I don't see that my patch has touched that part of the driver as that
> call was already in the driver before my patch (but maybe I'm just
> missing it).
>
> My patch did change the dependency in the Kconfig so that it only
> depends on powerpc rather than powerpc DCR and maybe that exposed
> something that wasn't previously exposed.
Yeah, virt_to_bus() is only defined on 32bit PowerPC, not 64 bit.
CONFIG_PPC is set for both 32 and 64 bit PowerPC builds.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply
* RE: linux-next: build failure after merge of the final tree
From: John Linn @ 2010-04-08 23:01 UTC (permalink / raw)
To: Stephen Rothwell
Cc: David Miller, netdev, linux-next, linux-kernel, jtyner,
grant.likely
In-Reply-To: <20100409085918.cc984fb1.sfr@canb.auug.org.au>
> -----Original Message-----
> From: Stephen Rothwell [mailto:sfr@canb.auug.org.au]
> Sent: Thursday, April 08, 2010 4:59 PM
> To: John Linn
> Cc: David Miller; netdev@vger.kernel.org; linux-next@vger.kernel.org;
linux-kernel@vger.kernel.org;
> jtyner@cs.ucr.edu; grant.likely@secretlab.ca
> Subject: Re: linux-next: build failure after merge of the final tree
>
> Hi John,
>
> On Thu, 8 Apr 2010 08:15:12 -0600 John Linn <John.Linn@xilinx.com>
wrote:
> >
> > I'm not pushing back here, just trying to make sure I understand and
> > do it better next time :)
> >
> > I don't see that my patch has touched that part of the driver as
that
> > call was already in the driver before my patch (but maybe I'm just
> > missing it).
> >
> > My patch did change the dependency in the Kconfig so that it only
> > depends on powerpc rather than powerpc DCR and maybe that exposed
> > something that wasn't previously exposed.
>
> Yeah, virt_to_bus() is only defined on 32bit PowerPC, not 64 bit.
>
> CONFIG_PPC is set for both 32 and 64 bit PowerPC builds.
> --
Thanks for confirming that. Spun a new patch (set) to hopefully take
care of that.
-- John
> Cheers,
> Stephen Rothwell sfr@canb.auug.org.au
> http://www.canb.auug.org.au/~sfr/
This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.
^ permalink raw reply
* Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
From: Sridhar Samudrala @ 2010-04-09 0:05 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Tom Lendacky, netdev, kvm@vger.kernel.org
In-Reply-To: <1270488911.27874.43.camel@w-sridhar.beaverton.ibm.com>
On Mon, 2010-04-05 at 10:35 -0700, Sridhar Samudrala wrote:
> On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote:
> > On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote:
> > > Make vhost scalable by creating a separate vhost thread per vhost
> > > device. This provides better scaling across multiple guests and with
> > > multiple interfaces in a guest.
> >
> > Thanks for looking into this. An alternative approach is
> > to simply replace create_singlethread_workqueue with
> > create_workqueue which would get us a thread per host CPU.
> >
> > It seems that in theory this should be the optimal approach
> > wrt CPU locality, however, in practice a single thread
> > seems to get better numbers. I have a TODO to investigate this.
> > Could you try looking into this?
>
> Yes. I tried using create_workqueue(), but the results were not good
> atleast when the number of guest interfaces is less than the number
> of CPUs. I didn't try more than 8 guests.
> Creating a separate thread per guest interface seems to be more
> scalable based on the testing i have done so far.
>
> I will try some more tests and get some numbers to compare the following
> 3 options.
> - single vhost thread
> - vhost thread per cpu
> - vhost thread per guest virtio interface
Here are the results with netperf TCP_STREAM 64K guest to host on a
8-cpu Nehalem system. It shows cumulative bandwidth in Mbps and host
CPU utilization.
Current default single vhost thread
-----------------------------------
1 guest: 12500 37%
2 guests: 12800 46%
3 guests: 12600 47%
4 guests: 12200 47%
5 guests: 12000 47%
6 guests: 11700 47%
7 guests: 11340 47%
8 guests: 11200 48%
vhost thread per cpu
--------------------
1 guest: 4900 25%
2 guests: 10800 49%
3 guests: 17100 67%
4 guests: 20400 84%
5 guests: 21000 90%
6 guests: 22500 92%
7 guests: 23500 96%
8 guests: 24500 99%
vhost thread per guest interface
--------------------------------
1 guest: 12500 37%
2 guests: 21000 72%
3 guests: 21600 79%
4 guests: 21600 85%
5 guests: 22500 89%
6 guests: 22800 94%
7 guests: 24500 98%
8 guests: 26400 99%
Thanks
Sridhar
^ permalink raw reply
* Re: linux-next: powerpc boot failure
From: Stephen Rothwell @ 2010-04-09 0:08 UTC (permalink / raw)
To: Timo Teräs; +Cc: David Miller, netdev, linux-next, LKML
In-Reply-To: <4BBD966D.8020404@iki.fi>
[-- Attachment #1: Type: text/plain, Size: 281 bytes --]
Hi Timo,
On Thu, 08 Apr 2010 11:40:13 +0300 Timo Teräs <timo.teras@iki.fi> wrote:
>
> Can you try if this helps?
That patch allows my machine to boot.
Thanks.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply
* Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
From: Rick Jones @ 2010-04-09 0:14 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: Michael S. Tsirkin, Tom Lendacky, netdev, kvm@vger.kernel.org
In-Reply-To: <1270771542.31186.397.camel@w-sridhar.beaverton.ibm.com>
> Here are the results with netperf TCP_STREAM 64K guest to host on a
> 8-cpu Nehalem system.
I presume you mean 8 core Nehalem-EP, or did you mean 8 processor Nehalem-EX?
Don't get me wrong, I *like* the netperf 64K TCP_STREAM test, I lik it a lot!-)
but I find it incomplete and also like to run things like single-instance TCP_RR
and multiple-instance, multiple "transaction" (./configure --enable-burst)
TCP_RR tests, particularly when concerned with "scaling" issues.
happy benchmarking,
rick jones
> It shows cumulative bandwidth in Mbps and host
> CPU utilization.
>
> Current default single vhost thread
> -----------------------------------
> 1 guest: 12500 37%
> 2 guests: 12800 46%
> 3 guests: 12600 47%
> 4 guests: 12200 47%
> 5 guests: 12000 47%
> 6 guests: 11700 47%
> 7 guests: 11340 47%
> 8 guests: 11200 48%
>
> vhost thread per cpu
> --------------------
> 1 guest: 4900 25%
> 2 guests: 10800 49%
> 3 guests: 17100 67%
> 4 guests: 20400 84%
> 5 guests: 21000 90%
> 6 guests: 22500 92%
> 7 guests: 23500 96%
> 8 guests: 24500 99%
>
> vhost thread per guest interface
> --------------------------------
> 1 guest: 12500 37%
> 2 guests: 21000 72%
> 3 guests: 21600 79%
> 4 guests: 21600 85%
> 5 guests: 22500 89%
> 6 guests: 22800 94%
> 7 guests: 24500 98%
> 8 guests: 26400 99%
>
> Thanks
> Sridhar
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: re-submit3 [ANNOUNCEMENT] NET: usb: sierra_net.c driver
From: Elina Pasheva @ 2010-04-09 0:39 UTC (permalink / raw)
To: David Miller
Cc: dbrownell-Rn4VEauK+AKRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
Rory Filer, linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20100407.214530.05341654.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
On Wed, 2010-04-07 at 21:45 -0700, David Miller wrote:
> From: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
> Date: Mon, 5 Apr 2010 18:39:08 -0700
>
> > Subject: re-submit3 [ANNOUNCEMENT] NET: usb: sierra_net.c driver
> > From: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
>
> I want you to tell me exactly how you generated this patch.
>
> It doesn't apply, and I suspect that you tried to fix the excess empty
> lines at the end of certain files by editing the patch by hand.
>
> If so, did you test the result?
>
> The patch is corrupted and more importantly git won't accept it.
>
> davem@sunset:~/src/GIT/net-2.6$ git am --signoff re-submit3-ANNOUNCEMENT-NET-usb-sierra_net.c-driver.patch
> Applying: re-submit3 [ANNOUNCEMENT] NET: usb: sierra_net.c driver
> /home/davem/src/GIT/net-2.6/.git/rebase-apply/patch:36: new blank line at EOF.
> +
> error: drivers/net/usb/sierra_net.c: does not exist in index
> Patch failed at 0001 re-submit3 [ANNOUNCEMENT] NET: usb: sierra_net.c driver
> When you have resolved this problem run "git am --resolved".
> If you would prefer to skip this patch, instead run "git am --skip".
> To restore the original branch and stop patching run "git am --abort".
>
Hi Dave,
I reproduced the problem here.
I fixed it by adding an empty sierra_net.c file in my master branch
(net-2.6).
Regards,
Elina
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* linux-next: manual merge of the net tree with Linus' tree
From: Stephen Rothwell @ 2010-04-09 0:41 UTC (permalink / raw)
To: David Miller, netdev; +Cc: linux-next, linux-kernel, chavey
Hi all,
Today's linux-next merge of the net tree got a conflict in
net/core/ethtool.c between commit
5a0e3ad6af8660be21ca98a971cd00f331318c05 ("include cleanup: Update gfp.h
and slab.h includes to prepare for breaking implicit slab.h inclusion
from percpu.h") from Linus' tree and commit
97f8aefbbfb5aa5c9944e5fa8149f1fdaf71c7b6 ("net: fix ethtool coding style
errors and warnings") from the net tree.
Just context changes. I fixed it up (see below) and can carry the fix
for a while.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
diff --cc net/core/ethtool.c
index 9d55c57,99e9f85..0000000
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@@ -18,8 -18,7 +18,8 @@@
#include <linux/ethtool.h>
#include <linux/netdevice.h>
#include <linux/bitops.h>
+ #include <linux/uaccess.h>
+#include <linux/slab.h>
- #include <asm/uaccess.h>
/*
* Some useful ethtool_ops methods that're device independent.
^ permalink raw reply
* Re: [PATCH 1/1] add ethtool loopback support
From: Jeff Garzik @ 2010-04-09 0:46 UTC (permalink / raw)
To: Laurent Chavey; +Cc: Ben Hutchings, davem, netdev, therbert
In-Reply-To: <j2r97949e3e1004081543h6258125dm60083556ff28fa88@mail.gmail.com>
On 04/08/2010 06:43 PM, Laurent Chavey wrote:
> On Thu, Apr 8, 2010 at 12:35 PM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
>> On Thu, 2010-04-08 at 12:17 -0700, Laurent Chavey wrote:
>>> On Thu, Apr 8, 2010 at 11:29 AM, Ben Hutchings
>>> <bhutchings@solarflare.com> wrote:
>>>> On Thu, 2010-04-08 at 10:35 -0700, chavey@google.com wrote:
>> [...]
>>>>> +enum ethtool_loopback_type {
>>>>> + ETH_MAC = 0x00000001,
>>>>> + ETH_PHY_INT = 0x00000002,
>>>>> + ETH_PHY_EXT = 0x00000004
>>>>> +};
>>>> [...]
>>>>
>>>> There are many different places you can loop back within a MAC or PHY,
>>>> not to mention bypassing the MAC altogether. See
>>>> drivers/net/sfc/mcdi_pcol.h, starting from the line
>>>> '#define MC_CMD_LOOPBACK_NONE 0'. I believe we implement all of those
>>>> loopback modes on at least one board.
>>>>
>>>> Also are these supposed to be an enumeration or flags? In theory you
>>> those are enums that can be or together.
>>
>> I.e. they are flags. So how do you answer this:
>>
>>>> could use wire-side and host-side loopback at the same time if they
>>>> don't overlap, but it's probably too much trouble to bother with. Any
>>>> other combination is meaningless.
> since the intent is to enable the sending and receiving of packets at
> the hw/driver interfaces, a simple loopback mode on/off is sufficient
> and the ethtool_loopback_type are not necessary. the implementor can choose
> how to implement the loopback. From drivers/net/sfc/mcdi_pcol.h it is clear
> that unless ethtool_loopback_type as defined are meaningless.
If an off/on switch is sufficient, the existing ethtool flags interface
should work just fine.
Jeff
^ permalink raw reply
* Re: mmotm 2010-04-05-16-09 uploaded
From: Valdis.Kletnieks @ 2010-04-09 0:50 UTC (permalink / raw)
To: Patrick McHardy
Cc: Andrew Morton, Peter Zijlstra, Ingo Molnar, David S. Miller,
linux-kernel, netfilter-devel, netdev
In-Reply-To: <4BBDF7E7.708@trash.net>
[-- Attachment #1: Type: text/plain, Size: 793 bytes --]
On Thu, 08 Apr 2010 17:36:07 +0200, Patrick McHardy said:
> Valdis.Kletnieks@vt.edu wrote:
> > Well, it *changed* it. Does the rcu_defererence_check() only fire on the
> > first time it hits something, so we've fixed the first one and now we get to
> > see the second one?
>
> It appears that way, otherwise you should have seen a second warning in
> nf_conntrack_ecache the last time.
>
> > (For what it's worth, if this is going to be one-at-a-time whack-a-mole, I'm
> > OK on that, just want to know up front.)
>
> I went through the other files and I believe this should be it.
> We already removed most of these incorrect rcu_dereference()
> calls a while back.
Confirming - the second version of the patch fixes all the network-related
RCU complaints I've been able to trigger...
[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]
^ permalink raw reply
* Re: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.
From: Stephen Hemminger @ 2010-04-09 0:52 UTC (permalink / raw)
To: Xin, Xiaohui
Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, mingo@elte.hu, mst@redhat.com,
jdike@c2.user-mode-linux.org, davem@davemloft.net
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C17B5C2A@shzsmsx502.ccr.corp.intel.com>
On Tue, 6 Apr 2010 14:26:29 +0800
"Xin, Xiaohui" <xiaohui.xin@intel.com> wrote:
> >How do you deal with the DoS problem of hostile user space app posting huge
> >number of receives and never getting anything.
>
> That's a problem we are trying to deal with. It's critical for long term.
> Currently, we tried to limit the pages it can pin, but not sure how much is reasonable.
> For now, the buffers submitted is from guest virtio-net driver, so it's safe in some extent
> just for now.
It is critical even now. Once you get past toy benchmarks you will see things like
Java processes with 1000 threads all reading at once.
^ permalink raw reply
* Re: 2.6.34-rc3-git8: Reported regressions 2.6.32 -> 2.6.33
From: Gertjan van Wingerde @ 2010-04-09 2:36 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linux Kernel Mailing List, Maciej Rutecki, Andrew Morton,
Linus Torvalds, Kernel Testers List, Network Development,
Linux ACPI, Linux PM List, Linux SCSI List, Linux Wireless List,
DRI
In-Reply-To: <cjwJ7Pf9h0L.A.9pB.1DmvLB@chimera>
On 04/09/10 00:54, Rafael J. Wysocki wrote:
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15699
> Subject : rt2500usb driver cannot remain connected
> Submitter : <abcd@gentoo.org>
> Date : 2010-04-05 19:30 (4 days old)
> Handled-By : Ivo van Doorn <IvDoorn@gmail.com>
>
This one ought to be fixed by commit 9e76ad2a27f592c1390248867391880c7efe78b3
in Linus' tree.
---
Gertjan.
^ permalink raw reply
* Re: FEC driver: rcv is not +last
From: Bryan Wu @ 2010-04-09 5:42 UTC (permalink / raw)
To: Matthias Kaehlcke; +Cc: netdev, Sascha Hauer
In-Reply-To: <20100408104033.GI3787@darwin>
On 04/08/2010 06:40 PM, Matthias Kaehlcke wrote:
> hi,
>
> i have problems with the FEC on a i.MX25 3-Stack board. the kernel is
> v2.6.34-rc2 plus the following patch:
> http://patchwork.ozlabs.org/patch/41235/
>
> the following traces are generated at boot time:
>
> FEC Ethernet Driver
> fec: PHY @ 0x1, ID 0x20005ce1 -- unknown PHY!
> ...
Matt,
please try this patch on your hardware, it introduced phylib supporting in fec.c
driver:
http://lists.infradead.org/pipermail/linux-arm-kernel/2010-March/012214.html
Thanks
--
Bryan Wu <bryan.wu@canonical.com>
Kernel Developer +86.138-1617-6545 Mobile
Ubuntu Kernel Team | Hardware Enablement Team
Canonical Ltd. www.canonical.com
Ubuntu - Linux for human beings | www.ubuntu.com
^ permalink raw reply
* Re: net-next: 2.6.34-rc1 regression: panic when running diagnostic on interface with IPv6
From: Stephen Hemminger @ 2010-04-09 0:54 UTC (permalink / raw)
To: David Miller; +Cc: emil.s.tantilov, netdev
In-Reply-To: <20100405.165317.89399272.davem@davemloft.net>
On Mon, 05 Apr 2010 16:53:17 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:
> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
> Date: Mon, 5 Apr 2010 17:50:38 -0600
>
> > David Miller wrote:
> >> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
> >> Date: Mon, 5 Apr 2010 17:03:56 -0600
> >>
> >>> David Miller wrote:
> >>>> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
> >>>> Date: Tue, 23 Mar 2010 12:28:08 -0600
> >>>>
> >>>>> Bisecting points to this patch:
> >>>>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commitdiff;h=84e8b803f1e16f3a2b8b80f80a63fa2f2f8a9be6
> >>>>>
> >>>>> And I confirmed that the issue goes away after reverting it.
> >>>>>
> >>>>> Steps to reproduce:
> >>>>> 1. Load the driver and configure IPv6 address.
> >>>>> 2. Run ethtool diag:
> >>>>> ethtool -t eth0
> >>>>>
> >>>>> 3. If this doesn't brake it try again, or just do ifdown/up. Other
> >>>>> operations on the interface will eventually panic the system:
> >>>>
> >>>> Stephen please fix this, thanks.
> >>>
> >>> Just FYI - I still see this issue with latest pull from net-2.6.
> >>
> >> It's net-next-2.6 that introduced the problem and has the follow-on
> >> fixes, not net-2.6
> >
> > Same in net-next:
>
> Ok, Stephen please look into this, we've had this regression
> for almost two weeks now.
I can't reproduce it, on e1000e.
Since the symptoms match the original problem that you already fixed,
I have to assume that it is fixed but the reporter (Emil) had not
updated his kernel correctly.
^ permalink raw reply
* Re: [v3 Patch 2/3] bridge: make bridge support netpoll
From: Cong Wang @ 2010-04-09 5:43 UTC (permalink / raw)
To: Stephen Hemminger
Cc: linux-kernel, netdev, bridge, Andy Gospodarek, Neil Horman,
Jeff Moyer, Matt Mackall, bonding-devel, Jay Vosburgh,
David Miller
In-Reply-To: <20100408083710.2b61ee44@nehalam>
Stephen Hemminger wrote:
> On Thu, 8 Apr 2010 02:18:58 -0400
> Amerigo Wang <amwang@redhat.com> wrote:
>
>> Based on the previous patch, make bridge support netpoll by:
>>
>> 1) implement the 2 methods to support netpoll for bridge;
>>
>> 2) modify netpoll during forwarding packets via bridge;
>>
>> 3) disable netpoll support of bridge when a netpoll-unabled device
>> is added to bridge;
>>
>> 4) enable netpoll support when all underlying devices support netpoll.
>>
>> Cc: David Miller <davem@davemloft.net>
>> Cc: Neil Horman <nhorman@tuxdriver.com>
>> Cc: Stephen Hemminger <shemminger@linux-foundation.org>
>> Cc: Matt Mackall <mpm@selenic.com>
>> Signed-off-by: WANG Cong <amwang@redhat.com>
>>
>> ---
>>
>> Index: linux-2.6/net/bridge/br_device.c
>> ===================================================================
>> --- linux-2.6.orig/net/bridge/br_device.c
>> +++ linux-2.6/net/bridge/br_device.c
>> @@ -13,8 +13,10 @@
>>
>> #include <linux/kernel.h>
>> #include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> #include <linux/etherdevice.h>
>> #include <linux/ethtool.h>
>> +#include <linux/list.h>
>>
>> #include <asm/uaccess.h>
>> #include "br_private.h"
>> @@ -162,6 +164,59 @@ static int br_set_tx_csum(struct net_dev
>> return 0;
>> }
>>
>> +#ifdef CONFIG_NET_POLL_CONTROLLER
>> +bool br_devices_support_netpoll(struct net_bridge *br)
>> +{
>> + struct net_bridge_port *p;
>> + bool ret = true;
>> + int count = 0;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&br->lock, flags);
>> + list_for_each_entry(p, &br->port_list, list) {
>> + count++;
>> + if (p->dev->priv_flags & IFF_DISABLE_NETPOLL
>> + || !p->dev->netdev_ops->ndo_poll_controller)
>> + ret = false;
>> + }
>> + spin_unlock_irqrestore(&br->lock, flags);
>> + return count != 0 && ret;
>> +}
>> +
>> +static void br_poll_controller(struct net_device *br_dev)
>> +{
>> + struct netpoll *np = br_dev->npinfo->netpoll;
>> +
>> + if (np->real_dev != br_dev)
>> + netpoll_poll_dev(np->real_dev);
>> +}
>> +
>> +void br_netpoll_cleanup(struct net_device *br_dev)
>> +{
>> + struct net_bridge *br = netdev_priv(br_dev);
>> + struct net_bridge_port *p, *n;
>> + const struct net_device_ops *ops;
>> +
>> + br->dev->npinfo = NULL;
>> + list_for_each_entry_safe(p, n, &br->port_list, list) {
>> + if (p->dev) {
>> + ops = p->dev->netdev_ops;
>> + if (ops->ndo_netpoll_cleanup)
>> + ops->ndo_netpoll_cleanup(p->dev);
>> + else
>> + p->dev->npinfo = NULL;
>> + }
>> + }
>> +}
>> +
>> +#else
>> +
>> +void br_netpoll_cleanup(struct net_device *br_dev)
>> +{
>> +}
>> +
>> +#endif
>
> Could you use more stub functions to eliminate #ifdef's in code.
Probably no, because only br_netpoll_cleanup() will be called
no matter if CONFIG_NET_POLL_CONTROLLER is defined.
>> @@ -50,7 +51,13 @@ int br_dev_queue_push_xmit(struct sk_buf
>> else {
>> skb_push(skb, ETH_HLEN);
>>
>> - dev_queue_xmit(skb);
>> +#ifdef CONFIG_NET_POLL_CONTROLLER
>> + if (skb->dev->priv_flags & IFF_IN_NETPOLL) {
>> + netpoll_send_skb(skb->dev->npinfo->netpoll, skb);
>> + skb->dev->priv_flags &= ~IFF_IN_NETPOLL;
>> + } else
>> +#endif
>
> There is no protection on dev->priv_flags for SMP access.
> It would better bit value in dev->state if you are using it as control flag.
>
> Then you could use
> if (unlikely(test_and_clear_bit(__IN_NETPOLL, &skb->dev->state)))
> netpoll_send_skb(...)
>
Yes? netpoll_send_skb() needs to see IFF_IN_NETPOLL is set, so
we can't clear this bit before calling it.
But we do need a find a safe way to check/set this flag.
>> static void __br_deliver(const struct net_bridge_port *to, struct sk_buff *skb)
>> {
>> +#ifdef CONFIG_NET_POLL_CONTROLLER
>> + struct net_bridge *br = to->br;
>> + if (br->dev->priv_flags & IFF_IN_NETPOLL) {
>> + struct netpoll *np;
>> + to->dev->npinfo = skb->dev->npinfo;
>> + np = skb->dev->npinfo->netpoll;
>> + np->real_dev = np->dev = to->dev;
>> + to->dev->priv_flags |= IFF_IN_NETPOLL;
>> + }
>> +#endif
>
> This is n hot path, so use unlikely()
Ok, good point.
>> +#ifdef CONFIG_NET_POLL_CONTROLLER
>> + if (br_devices_support_netpoll(br)) {
>> + br->dev->priv_flags &= ~IFF_DISABLE_NETPOLL;
>> + if (br->dev->npinfo)
>> + dev->npinfo = br->dev->npinfo;
>> + } else if (!(br->dev->priv_flags & IFF_DISABLE_NETPOLL)) {
>> + br->dev->priv_flags |= IFF_DISABLE_NETPOLL;
>> + printk(KERN_INFO "New device %s does not support netpoll\n",
>> + dev->name);
>> + printk(KERN_INFO "Disabling netpoll for %s\n",
>> + br->dev->name);
>
> One message is sufficient.
>
Yes? The first messages explains the reason for the second message.
Thanks.
^ permalink raw reply
* [PATCH v3] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-09 6:33 UTC (permalink / raw)
To: davem, netdev; +Cc: eric.dumazet
Version 3 of RFS:
- Use sysctl instead using kernel init parameter and alloc_large_system_hash
- Created inline function for "queue->input_queue_head++" to reduce number of #ifdef's
- Added RFS support for connected UDP sockets (thanks Eric!)
---
This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms.
The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true:
- The current CPU is unset (equal to NR_CPUS)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency StdDev
No RFS or RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..573e775 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,14 +530,77 @@ struct rps_map {
};
#define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
+/*
+ * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
+ * tail pointer for that CPU's input queue at the time of last enqueue.
+ */
+struct rps_dev_flow {
+ u16 cpu;
+ u16 fill;
+ unsigned int last_qtail;
+};
+
+/*
+ * The rps_dev_flow_table structure contains a table of flow mappings.
+ */
+struct rps_dev_flow_table {
+ unsigned int mask;
+ struct rcu_head rcu;
+ struct work_struct free_work;
+ struct rps_dev_flow flows[0];
+};
+#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
+ (_num * sizeof(struct rps_dev_flow)))
+
+/*
+ * The rps_sock_flow_table contains mappings of flows to the last CPU
+ * on which they were processed by the application (set in recvmsg).
+ */
+struct rps_sock_flow_table {
+ unsigned int mask;
+ u16 ents[0];
+};
+#define RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \
+ (_num * sizeof(u16)))
+
+extern int rps_sock_flow_sysctl(ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+
+#define RPS_NO_CPU 0xffff
+
+static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
+ u32 hash)
+{
+ if (table && hash) {
+ unsigned int cpu, index = hash & table->mask;
+
+ /* We only give a hint, preemption can change cpu under us */
+ cpu = raw_smp_processor_id();
+
+ if (table->ents[index] != cpu)
+ table->ents[index] = cpu;
+ }
+}
+
+static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
+ u32 hash)
+{
+ if (table && hash)
+ table->ents[hash & table->mask] = RPS_NO_CPU;
+}
+
+extern struct rps_sock_flow_table *rps_sock_flow_table;
+
/* This structure contains an instance of an RX queue. */
struct netdev_rx_queue {
struct rps_map *rps_map;
+ struct rps_dev_flow_table *rps_flow_table;
struct kobject kobj;
struct netdev_rx_queue *first;
atomic_t count;
} ____cacheline_aligned_in_smp;
-#endif
+#endif /* CONFIG_RPS */
/*
* This structure defines the management hooks for network devices.
@@ -1331,13 +1394,21 @@ struct softnet_data {
struct sk_buff *completion_queue;
/* Elements below can be accessed between CPUs for RPS */
-#ifdef CONFIG_SMP
+#ifdef CONFIG_RPS
struct call_single_data csd ____cacheline_aligned_in_smp;
+ unsigned int input_queue_head;
#endif
struct sk_buff_head input_pkt_queue;
struct napi_struct backlog;
};
+static inline void incr_input_queue_head(struct softnet_data *queue)
+{
+#ifdef CONFIG_RPS
+ queue->input_queue_head++;
+#endif
+}
+
DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
#define HAVE_NETIF_QUEUE
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 83fd344..b487bc1 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -21,6 +21,7 @@
#include <linux/string.h>
#include <linux/types.h>
#include <linux/jhash.h>
+#include <linux/netdevice.h>
#include <net/flow.h>
#include <net/sock.h>
@@ -101,6 +102,7 @@ struct rtable;
* @uc_ttl - Unicast TTL
* @inet_sport - Source port
* @inet_id - ID counter for DF pkts
+ * @rxhash - flow hash received from netif layer
* @tos - TOS
* @mc_ttl - Multicasting TTL
* @is_icsk - is this an inet_connection_sock?
@@ -124,6 +126,9 @@ struct inet_sock {
__u16 cmsg_flags;
__be16 inet_sport;
__u16 inet_id;
+#ifdef CONFIG_RPS
+ __u32 rxhash;
+#endif
struct ip_options *opt;
__u8 tos;
@@ -219,4 +224,37 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
return inet_sk(sk)->transparent ? FLOWI_FLAG_ANYSRC : 0;
}
+static inline void inet_rps_record_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+ struct rps_sock_flow_table *sock_flow_table;
+
+ rcu_read_lock();
+ sock_flow_table = rcu_dereference(rps_sock_flow_table);
+ rps_record_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
+ rcu_read_unlock();
+#endif
+}
+
+static inline void inet_rps_reset_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+ struct rps_sock_flow_table *sock_flow_table;
+
+ rcu_read_lock();
+ sock_flow_table = rcu_dereference(rps_sock_flow_table);
+ rps_reset_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
+ rcu_read_unlock();
+#endif
+}
+
+static inline void inet_rps_save_rxhash(const struct sock *sk, u32 rxhash)
+{
+#ifdef CONFIG_RPS
+ if (unlikely(inet_sk(sk)->rxhash != rxhash)) {
+ inet_rps_reset_flow(sk);
+ inet_sk(sk)->rxhash = rxhash;
+ }
+#endif
+}
#endif /* _INET_SOCK_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..29ef1db 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,7 @@
#include <linux/random.h>
#include <trace/events/napi.h>
#include <linux/pci.h>
+#include <linux/bootmem.h>
#include "net-sysfs.h"
@@ -2202,22 +2203,80 @@ int weight_p __read_mostly = 64; /* old backlog weight */
DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
#ifdef CONFIG_RPS
+/* One global table that all flow-based protocols share. */
+struct rps_sock_flow_table *rps_sock_flow_table;
+EXPORT_SYMBOL(rps_sock_flow_table);
+
+int rps_sock_flow_sysctl(ctl_table *table, int write, void __user *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ unsigned int orig_size, size;
+ int ret, i;
+ ctl_table tmp = {
+ .data = &size,
+ .maxlen = sizeof(size),
+ .mode = table->mode
+ };
+ struct rps_sock_flow_table *orig_sock_table, *sock_table;
+
+ rcu_read_lock();
+
+ orig_sock_table = rcu_dereference(rps_sock_flow_table);
+ size = orig_size = orig_sock_table ? orig_sock_table->mask + 1 : 0;
+
+ ret = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+
+ if (write) {
+ if (size) {
+ size = roundup_pow_of_two(size);
+ if (size != orig_size) {
+ sock_table =
+ vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size));
+ if (!sock_table) {
+ rcu_read_unlock();
+ return -ENOMEM;
+ }
+
+ sock_table->mask = size - 1;
+ } else
+ sock_table = orig_sock_table;
+
+ for (i = 0; i < size; i++)
+ sock_table->ents[i] = RPS_NO_CPU;
+ } else
+ sock_table = NULL;
+
+ if (sock_table != orig_sock_table) {
+ rcu_assign_pointer(rps_sock_flow_table, sock_table);
+ synchronize_rcu();
+ vfree(orig_sock_table);
+ }
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
/*
* get_rps_cpu is called from netif_receive_skb and returns the target
* CPU from the RPS map of the receiving queue for a given skb.
+ * rcu_read_lock must be held on entry.
*/
-static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
+ struct rps_dev_flow **rflowp)
{
struct ipv6hdr *ip6;
struct iphdr *ip;
struct netdev_rx_queue *rxqueue;
struct rps_map *map;
+ struct rps_dev_flow_table *flow_table;
+ struct rps_sock_flow_table *sock_flow_table;
int cpu = -1;
u8 ip_proto;
+ u16 tcpu;
u32 addr1, addr2, ports, ihl;
- rcu_read_lock();
-
if (skb_rx_queue_recorded(skb)) {
u16 index = skb_get_rx_queue(skb);
if (unlikely(index >= dev->num_rx_queues)) {
@@ -2232,7 +2291,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
} else
rxqueue = dev->_rx;
- if (!rxqueue->rps_map)
+ if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
goto done;
if (skb->rxhash)
@@ -2284,9 +2343,48 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
skb->rxhash = 1;
got_hash:
+ flow_table = rcu_dereference(rxqueue->rps_flow_table);
+ sock_flow_table = rcu_dereference(rps_sock_flow_table);
+ if (flow_table && sock_flow_table) {
+ u16 next_cpu;
+ struct rps_dev_flow *rflow;
+
+ rflow = &flow_table->flows[skb->rxhash & flow_table->mask];
+ tcpu = rflow->cpu;
+
+ next_cpu = sock_flow_table->ents[skb->rxhash &
+ sock_flow_table->mask];
+
+ /*
+ * If the desired CPU (where last recvmsg was done) is
+ * different from current CPU (one in the rx-queue flow
+ * table entry), switch if one of the following holds:
+ * - Current CPU is unset (equal to RPS_NO_CPU).
+ * - Current CPU is offline.
+ * - The current CPU's queue tail has advanced beyond the
+ * last packet that was enqueued using this table entry.
+ * This guarantees that all previous packets for the flow
+ * have been dequeued, thus preserving in order delivery.
+ */
+ if (unlikely(tcpu != next_cpu) &&
+ (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
+ ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
+ rflow->last_qtail)) >= 0)) {
+ tcpu = rflow->cpu = next_cpu;
+ if (tcpu != RPS_NO_CPU)
+ rflow->last_qtail = per_cpu(softnet_data,
+ tcpu).input_queue_head;
+ }
+ if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
+ *rflowp = rflow;
+ cpu = tcpu;
+ goto done;
+ }
+ }
+
map = rcu_dereference(rxqueue->rps_map);
if (map) {
- u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
+ tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
if (cpu_online(tcpu)) {
cpu = tcpu;
@@ -2295,7 +2393,6 @@ got_hash:
}
done:
- rcu_read_unlock();
return cpu;
}
@@ -2321,13 +2418,14 @@ static void trigger_softirq(void *data)
__napi_schedule(&queue->backlog);
__get_cpu_var(netdev_rx_stat).received_rps++;
}
-#endif /* CONFIG_SMP */
+#endif /* CONFIG_RPS */
/*
* enqueue_to_backlog is called to queue an skb to a per CPU backlog
* queue (may be a remote CPU queue).
*/
-static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
+ unsigned int *qtail)
{
struct softnet_data *queue;
unsigned long flags;
@@ -2342,6 +2440,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
if (queue->input_pkt_queue.qlen) {
enqueue:
__skb_queue_tail(&queue->input_pkt_queue, skb);
+#ifdef CONFIG_RPS
+ *qtail = queue->input_queue_head +
+ queue->input_pkt_queue.qlen;
+#endif
rps_unlock(queue);
local_irq_restore(flags);
return NET_RX_SUCCESS;
@@ -2356,11 +2458,10 @@ enqueue:
cpu_set(cpu, rcpus->mask[rcpus->select]);
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
- } else
- __napi_schedule(&queue->backlog);
-#else
- __napi_schedule(&queue->backlog);
+ goto enqueue;
+ }
#endif
+ __napi_schedule(&queue->backlog);
}
goto enqueue;
}
@@ -2391,7 +2492,7 @@ enqueue:
int netif_rx(struct sk_buff *skb)
{
- int cpu;
+ unsigned int qtail;
/* if netpoll wants it, pretend we never saw it */
if (netpoll_rx(skb))
@@ -2401,14 +2502,24 @@ int netif_rx(struct sk_buff *skb)
net_timestamp(skb);
#ifdef CONFIG_RPS
- cpu = get_rps_cpu(skb->dev, skb);
- if (cpu < 0)
- cpu = smp_processor_id();
-#else
- cpu = smp_processor_id();
+ {
+ struct rps_dev_flow voidflow, *rflow = &voidflow;
+ int cpu, err;
+
+ rcu_read_lock();
+
+ cpu = get_rps_cpu(skb->dev, skb, &rflow);
+ if (cpu >= 0) {
+ err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+ rcu_read_unlock();
+ return err;
+ }
+
+ rcu_read_unlock();
+ }
#endif
- return enqueue_to_backlog(skb, cpu);
+ return enqueue_to_backlog(skb, smp_processor_id(), &qtail);
}
EXPORT_SYMBOL(netif_rx);
@@ -2775,17 +2886,22 @@ out:
int netif_receive_skb(struct sk_buff *skb)
{
#ifdef CONFIG_RPS
- int cpu;
+ struct rps_dev_flow voidflow, *rflow = &voidflow;
+ int cpu, err;
+
+ rcu_read_lock();
- cpu = get_rps_cpu(skb->dev, skb);
+ cpu = get_rps_cpu(skb->dev, skb, &rflow);
- if (cpu < 0)
- return __netif_receive_skb(skb);
- else
- return enqueue_to_backlog(skb, cpu);
-#else
- return __netif_receive_skb(skb);
+ if (cpu >= 0) {
+ err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+ rcu_read_unlock();
+ return err;
+ }
+
+ rcu_read_unlock();
#endif
+ return __netif_receive_skb(skb);
}
EXPORT_SYMBOL(netif_receive_skb);
@@ -2801,6 +2917,7 @@ static void flush_backlog(void *arg)
if (skb->dev == dev) {
__skb_unlink(skb, &queue->input_pkt_queue);
kfree_skb(skb);
+ incr_input_queue_head(queue);
}
rps_unlock(queue);
}
@@ -3124,6 +3241,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
local_irq_enable();
break;
}
+ incr_input_queue_head(queue);
rps_unlock(queue);
local_irq_enable();
@@ -5487,8 +5605,10 @@ static int dev_cpu_callback(struct notifier_block *nfb,
local_irq_enable();
/* Process offline CPU's input_pkt_queue */
- while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
+ while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
netif_rx(skb);
+ incr_input_queue_head(oldsd);
+ }
return NOTIFY_OK;
}
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1e7fdd6..95863b2 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -600,22 +600,105 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue,
return len;
}
+static ssize_t show_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+ struct rx_queue_attribute *attr,
+ char *buf)
+{
+ struct rps_dev_flow_table *flow_table;
+ unsigned int val = 0;
+
+ rcu_read_lock();
+ flow_table = rcu_dereference(queue->rps_flow_table);
+ if (flow_table)
+ val = flow_table->mask + 1;
+ rcu_read_unlock();
+
+ return sprintf(buf, "%u\n", val);
+}
+
+static void rps_dev_flow_table_release_work(struct work_struct *work)
+{
+ struct rps_dev_flow_table *table = container_of(work,
+ struct rps_dev_flow_table, free_work);
+
+ vfree(table);
+}
+
+static void rps_dev_flow_table_release(struct rcu_head *rcu)
+{
+ struct rps_dev_flow_table *table = container_of(rcu,
+ struct rps_dev_flow_table, rcu);
+
+ INIT_WORK(&table->free_work, rps_dev_flow_table_release_work);
+ schedule_work(&table->free_work);
+}
+
+ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+ struct rx_queue_attribute *attr,
+ const char *buf, size_t len)
+{
+ unsigned int count;
+ char *endp;
+ struct rps_dev_flow_table *table, *old_table;
+ static DEFINE_SPINLOCK(rps_dev_flow_lock);
+
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+
+ count = simple_strtoul(buf, &endp, 0);
+ if (endp == buf)
+ return -EINVAL;
+
+ if (count) {
+ int i;
+
+ count = roundup_pow_of_two(count);
+ table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));
+ if (!table)
+ return -ENOMEM;
+
+ table->mask = count - 1;
+ for (i = 0; i < count; i++)
+ table->flows[i].cpu = RPS_NO_CPU;
+ } else
+ table = NULL;
+
+ spin_lock(&rps_dev_flow_lock);
+ old_table = queue->rps_flow_table;
+ rcu_assign_pointer(queue->rps_flow_table, table);
+ spin_unlock(&rps_dev_flow_lock);
+
+ if (old_table)
+ call_rcu(&old_table->rcu, rps_dev_flow_table_release);
+
+ return len;
+}
+
static struct rx_queue_attribute rps_cpus_attribute =
__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
+
+static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
+ __ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
+ show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+
static struct attribute *rx_queue_default_attrs[] = {
&rps_cpus_attribute.attr,
+ &rps_dev_flow_table_cnt_attribute.attr,
NULL
};
static void rx_queue_release(struct kobject *kobj)
{
struct netdev_rx_queue *queue = to_rx_queue(kobj);
- struct rps_map *map = queue->rps_map;
struct netdev_rx_queue *first = queue->first;
- if (map)
- call_rcu(&map->rcu, rps_map_release);
+ if (queue->rps_map)
+ call_rcu(&queue->rps_map->rcu, rps_map_release);
+
+ if (queue->rps_flow_table)
+ call_rcu(&queue->rps_flow_table->rcu,
+ rps_dev_flow_table_release);
if (atomic_dec_and_test(&first->count))
kfree(first);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 0612487..f597189 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -81,6 +81,14 @@ static struct ctl_table net_core_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+#ifdef CONFIG_RPS
+ {
+ .procname = "rps_sock_flow_entries",
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = rps_sock_flow_sysctl
+ },
+#endif
#endif /* CONFIG_NET */
{
.procname = "netdev_budget",
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index b5924f1..cc46052 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -418,6 +418,8 @@ int inet_release(struct socket *sock)
if (sk) {
long timeout;
+ inet_rps_reset_flow(sk);
+
/* Applications forget to leave groups before exiting */
ip_mc_drop_socket(sk);
@@ -719,6 +721,8 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
{
struct sock *sk = sock->sk;
+ inet_rps_record_flow(sk);
+
/* We may need to bind the socket. */
if (!inet_sk(sk)->inet_num && inet_autobind(sk))
return -EAGAIN;
@@ -727,12 +731,13 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
}
EXPORT_SYMBOL(inet_sendmsg);
-
static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags)
{
struct sock *sk = sock->sk;
+ inet_rps_record_flow(sk);
+
/* We may need to bind the socket. */
if (!inet_sk(sk)->inet_num && inet_autobind(sk))
return -EAGAIN;
@@ -742,6 +747,22 @@ static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
return sock_no_sendpage(sock, page, offset, size, flags);
}
+int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+ size_t size, int flags)
+{
+ struct sock *sk = sock->sk;
+ int addr_len = 0;
+ int err;
+
+ inet_rps_record_flow(sk);
+
+ err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
+ flags & ~MSG_DONTWAIT, &addr_len);
+ if (err >= 0)
+ msg->msg_namelen = addr_len;
+ return err;
+}
+EXPORT_SYMBOL(inet_recvmsg);
int inet_shutdown(struct socket *sock, int how)
{
@@ -871,7 +892,7 @@ const struct proto_ops inet_stream_ops = {
.setsockopt = sock_common_setsockopt,
.getsockopt = sock_common_getsockopt,
.sendmsg = tcp_sendmsg,
- .recvmsg = sock_common_recvmsg,
+ .recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
.sendpage = tcp_sendpage,
.splice_read = tcp_splice_read,
@@ -898,7 +919,7 @@ const struct proto_ops inet_dgram_ops = {
.setsockopt = sock_common_setsockopt,
.getsockopt = sock_common_getsockopt,
.sendmsg = inet_sendmsg,
- .recvmsg = sock_common_recvmsg,
+ .recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
.sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
@@ -928,7 +949,7 @@ static const struct proto_ops inet_sockraw_ops = {
.setsockopt = sock_common_setsockopt,
.getsockopt = sock_common_getsockopt,
.sendmsg = inet_sendmsg,
- .recvmsg = sock_common_recvmsg,
+ .recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
.sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f4df5f9..2f40fe0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1674,6 +1674,8 @@ process:
skb->dev = NULL;
+ inet_rps_save_rxhash(sk, skb->rxhash);
+
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7af756d..11c7ce3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1216,6 +1216,7 @@ int udp_disconnect(struct sock *sk, int flags)
sk->sk_state = TCP_CLOSE;
inet->inet_daddr = 0;
inet->inet_dport = 0;
+ inet_rps_save_rxhash(sk, 0);
sk->sk_bound_dev_if = 0;
if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
inet_reset_saddr(sk);
@@ -1257,8 +1258,12 @@ EXPORT_SYMBOL(udp_lib_unhash);
static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
- int rc = sock_queue_rcv_skb(sk, skb);
+ int rc;
+
+ if (inet_sk(sk)->inet_daddr)
+ inet_rps_save_rxhash(sk, skb->rxhash);
+ rc = sock_queue_rcv_skb(sk, skb);
if (rc < 0) {
int is_udplite = IS_UDPLITE(sk);
^ permalink raw reply related
* Re: [PATCH v3] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-09 7:17 UTC (permalink / raw)
To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1004082331250.19688@pokey.mtv.corp.google.com>
Le jeudi 08 avril 2010 à 23:33 -0700, Tom Herbert a écrit :
> Version 3 of RFS:
> - Use sysctl instead using kernel init parameter and alloc_large_system_hash
> - Created inline function for "queue->input_queue_head++" to reduce number of #ifdef's
> - Added RFS support for connected UDP sockets (thanks Eric!)
> ---
> This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS).
>
> The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms.
>
> The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter.
>
> To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table.
>
> rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows.
>
> rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry.
>
> Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table.
>
> And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true:
>
> - The current CPU is unset (equal to NR_CPUS)
> - Current CPU is offline
> - The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery.
>
> Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant.
>
> This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols.
>
> There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two.
>
> The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency).
>
> The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well.
>
> Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf.
>
> e1000e on 8 core Intel
> No RFS or RPS 104K tps at 30% CPU
> No RFS (best RPS config): 290K tps at 63% CPU
> RFS 303K tps at 61% CPU
>
> RPC test tps CPU% 50/90/99% usec latency StdDev
> No RFS or RPS 103K 48% 757/900/3185 4472.35
> RPS only: 174K 73% 415/993/2468 491.66
> RFS 223K 73% 379/651/1382 315.61
>
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
Changelog messages should be formatted with small lines (70 char limits)
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index d1a21b5..573e775 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -530,14 +530,77 @@ struct rps_map {
> };
> #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
>
> +/*
> + * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
> + * tail pointer for that CPU's input queue at the time of last enqueue.
> + */
> +struct rps_dev_flow {
> + u16 cpu;
> + u16 fill;
> + unsigned int last_qtail;
> +};
> +
> +/*
> + * The rps_dev_flow_table structure contains a table of flow mappings.
> + */
> +struct rps_dev_flow_table {
> + unsigned int mask;
> + struct rcu_head rcu;
> + struct work_struct free_work;
> + struct rps_dev_flow flows[0];
> +};
> +#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
> + (_num * sizeof(struct rps_dev_flow)))
> +
> +/*
> + * The rps_sock_flow_table contains mappings of flows to the last CPU
> + * on which they were processed by the application (set in recvmsg).
> + */
> +struct rps_sock_flow_table {
> + unsigned int mask;
> + u16 ents[0];
> +};
> +#define RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \
> + (_num * sizeof(u16)))
> +
> +extern int rps_sock_flow_sysctl(ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos);
> +
> +#define RPS_NO_CPU 0xffff
> +
> +static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
> + u32 hash)
> +{
> + if (table && hash) {
> + unsigned int cpu, index = hash & table->mask;
> +
> + /* We only give a hint, preemption can change cpu under us */
> + cpu = raw_smp_processor_id();
> +
> + if (table->ents[index] != cpu)
> + table->ents[index] = cpu;
> + }
> +}
> +
> +static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
> + u32 hash)
> +{
> + if (table && hash)
> + table->ents[hash & table->mask] = RPS_NO_CPU;
> +}
> +
> +extern struct rps_sock_flow_table *rps_sock_flow_table;
> +
> /* This structure contains an instance of an RX queue. */
> struct netdev_rx_queue {
> struct rps_map *rps_map;
> + struct rps_dev_flow_table *rps_flow_table;
> struct kobject kobj;
> struct netdev_rx_queue *first;
> atomic_t count;
> } ____cacheline_aligned_in_smp;
> -#endif
> +#endif /* CONFIG_RPS */
>
> /*
> * This structure defines the management hooks for network devices.
> @@ -1331,13 +1394,21 @@ struct softnet_data {
> struct sk_buff *completion_queue;
>
> /* Elements below can be accessed between CPUs for RPS */
> -#ifdef CONFIG_SMP
> +#ifdef CONFIG_RPS
> struct call_single_data csd ____cacheline_aligned_in_smp;
> + unsigned int input_queue_head;
> #endif
> struct sk_buff_head input_pkt_queue;
> struct napi_struct backlog;
> };
>
> +static inline void incr_input_queue_head(struct softnet_data *queue)
> +{
> +#ifdef CONFIG_RPS
> + queue->input_queue_head++;
> +#endif
> +}
> +
> DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
>
> #define HAVE_NETIF_QUEUE
> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index 83fd344..b487bc1 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -21,6 +21,7 @@
> #include <linux/string.h>
> #include <linux/types.h>
> #include <linux/jhash.h>
> +#include <linux/netdevice.h>
>
> #include <net/flow.h>
> #include <net/sock.h>
> @@ -101,6 +102,7 @@ struct rtable;
> * @uc_ttl - Unicast TTL
> * @inet_sport - Source port
> * @inet_id - ID counter for DF pkts
> + * @rxhash - flow hash received from netif layer
> * @tos - TOS
> * @mc_ttl - Multicasting TTL
> * @is_icsk - is this an inet_connection_sock?
> @@ -124,6 +126,9 @@ struct inet_sock {
> __u16 cmsg_flags;
> __be16 inet_sport;
> __u16 inet_id;
> +#ifdef CONFIG_RPS
> + __u32 rxhash;
> +#endif
I am a bit worried, because dirtying this cache line might hurt non RPS
setups (if network interrupts are balanced to all cpus)
Best place would be to put rxhash close to sk_refcnt (because we dirty
it to get a reference on rcu sk lookups)
I believe we have a 32bits hole on 64bit arches for this :)
While testint latest net-nex-2.6 on my nehalem machine, I got a crash
(in RPS I am afraid...)
I am going to correct this crash before testing RFS and let you know the
results.
Thanks
^ permalink raw reply
* [PATCH net-next-2.6] net: Dont use netdev_warn()
From: Eric Dumazet @ 2010-04-09 7:26 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Tom Herbert, Joe Perches
Dont use netdev_warn() in dev_cap_txqueue() and get_rps_cpu() so that we
can catch following warnings without crash.
bond0.2240 received packet on queue 6, but number of RX queues is 1
bond0.2240 received packet on queue 11, but number of RX queues is 1
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
Or maybe netdev_warn() implementation should be changed ?
net/core/dev.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..ad51ffb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1986,9 +1986,9 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
{
if (unlikely(queue_index >= dev->real_num_tx_queues)) {
if (net_ratelimit()) {
- netdev_warn(dev, "selects TX queue %d, but "
- "real number of TX queues is %d\n",
- queue_index, dev->real_num_tx_queues);
+ pr_warning("%s selects TX queue %d, but "
+ "real number of TX queues is %d\n",
+ dev->name, queue_index, dev->real_num_tx_queues);
}
return 0;
}
@@ -2222,9 +2222,9 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
u16 index = skb_get_rx_queue(skb);
if (unlikely(index >= dev->num_rx_queues)) {
if (net_ratelimit()) {
- netdev_warn(dev, "received packet on queue "
- "%u, but number of RX queues is %u\n",
- index, dev->num_rx_queues);
+ pr_warning("%s received packet on queue "
+ "%u, but number of RX queues is %u\n",
+ dev->name, index, dev->num_rx_queues);
}
goto done;
}
^ permalink raw reply related
* Re: [PATCH v3] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-09 7:31 UTC (permalink / raw)
To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1004082331250.19688@pokey.mtv.corp.google.com>
Le jeudi 08 avril 2010 à 23:33 -0700, Tom Herbert a écrit :
> @@ -1257,8 +1258,12 @@ EXPORT_SYMBOL(udp_lib_unhash);
>
> static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> {
> - int rc = sock_queue_rcv_skb(sk, skb);
> + int rc;
> +
> + if (inet_sk(sk)->inet_daddr)
> + inet_rps_save_rxhash(sk, skb->rxhash);
>
> + rc = sock_queue_rcv_skb(sk, skb);
> if (rc < 0) {
> int is_udplite = IS_UDPLITE(sk);
>
There is an extra space before "rc = sock_queue_rcv_skb(sk, skb);"
^ permalink raw reply
* [RFC PATCH 0/2] netdev: Add tracepoint to network/driver interface
From: Koki Sanagi @ 2010-04-09 7:37 UTC (permalink / raw)
To: netdev; +Cc: izumi.taku, kaneshige.kenji, davem, nhorman
These patches add tracepoints to network/driver interface.
These tracepoints are helpful to investigate whether a packet passes or not.
For example, when Heart Beat is disconnected, that information is helpful
to investigate the cause is whether driver/device side or not.
An output is below.
sshd-2443 [001] 68238.415621: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
<idle>-0 [001] 68238.417058: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=52
<idle>-0 [001] 68238.704363: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=100
sshd-2443 [001] 68238.705459: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
<idle>-0 [001] 68238.706891: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=52
<idle>-0 [001] 68238.878736: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=100
sshd-2443 [001] 68238.880361: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
As other use case I have, we can get throughput per interface with some sort of
perf scripts. I plan to create it.
Thanks
Koki Sanagi
^ permalink raw reply
* Re: [PATCH v3] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-09 7:37 UTC (permalink / raw)
To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1004082331250.19688@pokey.mtv.corp.google.com>
Le jeudi 08 avril 2010 à 23:33 -0700, Tom Herbert a écrit :
> #ifdef CONFIG_RPS
> +/* One global table that all flow-based protocols share. */
> +struct rps_sock_flow_table *rps_sock_flow_table;
> +EXPORT_SYMBOL(rps_sock_flow_table);
> +
> +int rps_sock_flow_sysctl(ctl_table *table, int write, void __user *buffer,
> + size_t *lenp, loff_t *ppos)
> +{
> + unsigned int orig_size, size;
> + int ret, i;
> + ctl_table tmp = {
> + .data = &size,
> + .maxlen = sizeof(size),
> + .mode = table->mode
> + };
> + struct rps_sock_flow_table *orig_sock_table, *sock_table;
> +
> + rcu_read_lock();
> +
> + orig_sock_table = rcu_dereference(rps_sock_flow_table);
> + size = orig_size = orig_sock_table ? orig_sock_table->mask + 1 : 0;
> +
> + ret = proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +
> + if (write) {
> + if (size) {
> + size = roundup_pow_of_two(size);
> + if (size != orig_size) {
> + sock_table =
> + vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size));
> + if (!sock_table) {
> + rcu_read_unlock();
> + return -ENOMEM;
> + }
> +
> + sock_table->mask = size - 1;
> + } else
> + sock_table = orig_sock_table;
> +
> + for (i = 0; i < size; i++)
> + sock_table->ents[i] = RPS_NO_CPU;
> + } else
> + sock_table = NULL;
> +
> + if (sock_table != orig_sock_table) {
> + rcu_assign_pointer(rps_sock_flow_table, sock_table);
> + synchronize_rcu();
> + vfree(orig_sock_table);
> + }
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> /
It is not allowed to call vmalloc() inside rcu_read_unlock() section.
Anyway, rcu_read_unlock() is not appropriate (you want mutual exclusion
betwen concurrent writers here)
You should use a mutex here.
Or a spinlock (if you do the vmalloc()/vfree() things outside of the
locked section)
^ permalink raw reply
* [RFC PATCH 1/2] netdev: Add tracepoint to dev_hard_start_xmit
From: Koki Sanagi @ 2010-04-09 7:39 UTC (permalink / raw)
To: netdev; +Cc: izumi.taku, kaneshige.kenji, davem, nhorman
In-Reply-To: <4BBED951.8040406@jp.fujitsu.com>
This patch adds tracepoint to dev_hard_start_xmit.
It notices that xmit packet passes network/driver interface.
An output is below.
sshd-2443 [001] 68238.415621: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
sshd-2443 [001] 68238.705459: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
sshd-2443 [001] 68238.880361: netdev_start_xmit: dev=eth3 skbaddr=f3db5138 len=114
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
include/trace/events/skb.h | 28 ++++++++++++++++++++++++++++
net/core/dev.c | 3 +++
2 files changed, 31 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 4b2be6d..425a062 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -9,6 +9,34 @@
#include <linux/tracepoint.h>
/*
+ * netdev_start_xmit - invoked when skb is passed to the driver
+ * @skb: pointer to struct sk_buff
+ * @dev: pointer to struct net_device
+ */
+TRACE_EVENT(netdev_start_xmit,
+
+ TP_PROTO(struct sk_buff *skb,
+ struct net_device *dev),
+
+ TP_ARGS(skb, dev),
+
+ TP_STRUCT__entry(
+ __field( const void *, skbaddr )
+ __field( unsigned int, len )
+ __string( name, dev->name )
+ ),
+
+ TP_fast_assign(
+ __entry->skbaddr = skb;
+ __entry->len = skb->len;
+ __assign_str(name, dev->name);
+ ),
+
+ TP_printk("dev=%s skbaddr=%p len=%u",
+ __get_str(name), __entry->skbaddr, __entry->len)
+);
+
+/*
* Tracepoint for free an sk_buff:
*/
TRACE_EVENT(kfree_skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index 2a9b7dd..4667a96 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -129,6 +129,7 @@
#include <linux/jhash.h>
#include <linux/random.h>
#include <trace/events/napi.h>
+#include <trace/events/skb.h>
#include <linux/pci.h>
#include "net-sysfs.h"
@@ -1903,6 +1904,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(skb);
+ trace_netdev_start_xmit(skb, dev);
rc = ops->ndo_start_xmit(skb, dev);
if (rc == NETDEV_TX_OK)
txq_trans_update(txq);
@@ -1937,6 +1939,7 @@ gso:
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(nskb);
+ trace_netdev_start_xmit(nskb, dev);
rc = ops->ndo_start_xmit(nskb, dev);
if (unlikely(rc != NETDEV_TX_OK)) {
if (rc & ~NETDEV_TX_MASK)
^ permalink raw reply related
* [RFC PATCH 2/2] netdev: Add tracepoint to netif_receive_skb
From: Koki Sanagi @ 2010-04-09 7:41 UTC (permalink / raw)
To: netdev; +Cc: izumi.taku, kaneshige.kenji, davem, nhorman
In-Reply-To: <4BBED951.8040406@jp.fujitsu.com>
This patch adds tracepoint to netif_receive_skb.
It notices that receive packet passes network/driver interface.
An output is below.
<idle>-0 [001] 68238.417058: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=52
<idle>-0 [001] 68238.704363: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=100
<idle>-0 [001] 68238.706891: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=52
<idle>-0 [001] 68238.878736: netdev_receive_skb: dev=eth3 skbaddr=f3c81540 len=100
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
include/trace/events/skb.h | 28 ++++++++++++++++++++++++++++
net/core/dev.c | 1 +
2 files changed, 29 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 425a062..ca78d49 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -37,6 +37,34 @@ TRACE_EVENT(netdev_start_xmit,
);
/*
+ * netdev_receive_skb - invoked when skb is received from the driver
+ * @skb: pointer to struct sk_buff
+ * @dev: pointer to struct net_device
+ */
+TRACE_EVENT(netdev_receive_skb,
+
+ TP_PROTO(struct sk_buff *skb,
+ struct net_device *dev),
+
+ TP_ARGS(skb, dev),
+
+ TP_STRUCT__entry(
+ __field( const void *, skbaddr )
+ __field( unsigned int, len )
+ __string( name, dev->name )
+ ),
+
+ TP_fast_assign(
+ __entry->skbaddr = skb;
+ __entry->len = skb->len;
+ __assign_str(name, dev->name);
+ ),
+
+ TP_printk("dev=%s skbaddr=%p len=%u",
+ __get_str(name), __entry->skbaddr, __entry->len)
+);
+
+/*
* Tracepoint for free an sk_buff:
*/
TRACE_EVENT(kfree_skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index 4667a96..7281286 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2683,6 +2683,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
__get_cpu_var(netdev_rx_stat).total++;
+ trace_netdev_receive_skb(skb, orig_dev);
skb_reset_network_header(skb);
skb_reset_transport_header(skb);
skb->mac_len = skb->network_header - skb->mac_header;
^ permalink raw reply related
* Re: Crashes in xfrm_lookup
From: Herbert Xu @ 2010-04-09 8:09 UTC (permalink / raw)
To: Timo Teräs; +Cc: broonie, netdev
In-Reply-To: <4BBDBCF9.5060906@iki.fi>
Timo Teräs <timo.teras@iki.fi> wrote:
>
> Happens because CONFIG_XFRM_SUB_POLICY is not enabled, and one of
> the helper functions I used did unexpected things in that case.
>
> Try the following:
Ugh, can we fix this some other way?
The policy array should really only exist if SUB_POLICY is defined.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: Crashes in xfrm_lookup
From: Timo Teräs @ 2010-04-09 8:11 UTC (permalink / raw)
To: Herbert Xu; +Cc: broonie, netdev
In-Reply-To: <20100409080907.GA2029@gondor.apana.org.au>
Herbert Xu wrote:
> Timo Teräs <timo.teras@iki.fi> wrote:
>> Happens because CONFIG_XFRM_SUB_POLICY is not enabled, and one of
>> the helper functions I used did unexpected things in that case.
>>
>> Try the following:
>
> Ugh, can we fix this some other way?
>
> The policy array should really only exist if SUB_POLICY is defined.
The problem was that my code could call it with zero polcies
assuming it'd do the right thing.
It's still really misleading to have generic function that does not
do the expected thing based on some config. Compiler should know
how to optimize the for loop away if it's being called with fixed
array size.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox