* RE: [PATCH] e100: fix 10sec DHCP delay (regression for 82559ER)
From: Allan, Bruce W @ 2009-10-23 21:45 UTC (permalink / raw)
To: Bernhard Kaindl, David S. Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <4ADF6F15.5040301@gmx.net>
After further testing, this introduces a problem on 82552 parts where it no longer autonegs to 100Full. I've created a different patch that reverts my code put in to 2.6.30 which isolates all PHY ids (which causes the problem discovered by Bernhard) yet still works around the problem on 82552. It is currently queued up for testing here at Intel and will be pushed to netdev in the next few days.
>-----Original Message-----
>From: Bernhard Kaindl [mailto:bernhard.kaindl@gmx.net]
>Sent: Wednesday, October 21, 2009 1:29 PM
>To: David S. Miller
>Cc: Allan, Bruce W; netdev@vger.kernel.org
>Subject: [PATCH] e100: fix 10sec DHCP delay (regression for 82559ER)
>
>From: Bernhard Kaindl <bernhard.kaindl@gmx.net>
>
>With 2.6.30, we noticed a regression on our Intel 82559ER-equipped
>boards regarding the PHY initialization. The Intel 82559ER uses an
>internal PHY bus, so our PHY environment should be not be very special.
>
>The symptom which we observed on these boards was that the boot-time
>DHCP negotiation was stalled for ~5secs and went very slowly until
>it was finally completed after ~10secs after initial interface start.
>
>I reported our finding along with a proposal for a fix to netdev and
>Bruce Allan@Intel, whose patch to support the new Intel 82552 adapter
>included a workaround for the 82552 with this side effect for our env.
>
>Bruce worked on a way to have both environments working and tested
>it on as many 10/100 parts he could get his hands on and started
>a process which allows for more testing and patch submission from
>
>So pending this process which allows for more testing at Intel, I am
>submitting our common patch. For PHYs other than the 82552, it resembles
>mostly how 2.6.23-2.6.29 have been selecting and isolating the PHYs.
>
>This patch applies to 2.6.30.9, 2.6.31.4 and 2.6.32-rc5-git1 and is
>essentially what Bruce has been testing:
>
>Signed-off-by: Bernhard Kaindl <bernhard.kaindl@gmx.net>
>Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
>---
>
> drivers/net/e100.c | 10 +++++++---
> 1 files changed, 7 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/net/e100.c b/drivers/net/e100.c
>index 5d2f48f..aed18a4 100644
>--- a/drivers/net/e100.c
>+++ b/drivers/net/e100.c
>@@ -1427,13 +1427,17 @@ static int e100_phy_init(struct nic *nic)
> } else
> DPRINTK(HW, DEBUG, "phy_addr = %d\n", nic->mii.phy_id);
>
>- /* Isolate all the PHY ids */
>- for (addr = 0; addr < 32; addr++)
>- mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
> /* Select the discovered PHY */
> bmcr &= ~BMCR_ISOLATE;
> mdio_write(netdev, nic->mii.phy_id, MII_BMCR, bmcr);
>
>+ if (nic->phy != phy_82552_v) {
>+ /* Isolate the unused PHY ids */
>+ for (addr = 0; addr < 32; addr++)
>+ if (addr != nic->mii.phy_id)
>+ mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
>+ }
>+
> /* Get phy ID */
> id_lo = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID1);
> id_hi = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID2);
^ permalink raw reply
* Re: [PATCH] net: Fix RPF to work with policy routing
From: jamal @ 2009-10-23 22:40 UTC (permalink / raw)
To: Ben Greear; +Cc: David Miller, netdev, atis, eric.dumazet, zenczykowski
In-Reply-To: <4AE1CD19.8090901@candelatech.com>
On Fri, 2009-10-23 at 08:34 -0700, Ben Greear wrote:
> I implemented something similar while allowing for virtual router like
> applications. I had to add a mark very early in the pkt rx logic in dev.c,
> and had to add a 'skb_default_mark' member to the netdevice because
> the route lookup is done before the normal iptables logic ran.
You dont need to add a new construct to netdev.
Here's how youd tag all packets coming on eth0 with mark 7:
----
tc filter add dev eth0 parent ffff: protocol ip \
pref 10 u32 match u32 0 0 flowid 1:17 \
action skbedit mark 7
---
Of course you could also be very flow specific, example:
----
tc filter add dev eth0 parent ffff: protocol ip \
pref 9 u32 match ip src 64.233.169.99/32 flowid 1:5 \
action skbedit mark 5
---
Or even use iptable marker
---
tc filter add dev eth2 parent 1:0 protocol ip \
prio 5 u32 match ip dst 10.0.0.90/32 flowid 1:12 \
action ipt -j mark --set-mark 2
----
You could even slice bread with this stuff. Example
you could use certain policy routing tables only
if a flow misbehaved (works well with routing not
local destined packets), example
---
tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 \
match ip src 10.0.0.90/32 flowid 1:10 \
action ipt -j mark --set-mark 1 \
action police rate 100kbit burst 90k pipe \
action ipt -j mark --set-mark 2 \
action police rate 50kbit burst 50k pipe \
action ipt -j mark --set-mark 3 \
action police rate 50kbit burst 50k drop
----
As a warning ipt could be shaky in some distros because
of the morphing iptables interface.
> Without
> this, if a flow already existed for pkts coming in eth1, if the packet came
> back in eth2, it would use eth1's flow.
True. Of course you can avoid that with the patch i posted
meeting the conditions i described with RPF.
> I'll dig out the patch if anyone is interested...
If you can do overlapping IP addresses, it would be interesting
to see. Otherwise all is achievable with smart policy routing.
cheers,
jamal
^ permalink raw reply
* Re: Irq architecture for multi-core network driver.
From: Eric W. Biederman @ 2009-10-23 23:22 UTC (permalink / raw)
To: Jesse Brandeburg
Cc: David Daney, Chris Friesen, netdev, Linux Kernel Mailing List,
linux-mips
In-Reply-To: <4807377b0910231028g60b479cfycdbf3f4e25384c58@mail.gmail.com>
Jesse Brandeburg <jesse.brandeburg@gmail.com> writes:
> On Fri, Oct 23, 2009 at 12:59 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> David Daney <ddaney@caviumnetworks.com> writes:
>>> Certainly this is one mode of operation that should be supported, but I would
>>> also like to be able to go for raw throughput and have as many cores as possible
>>> reading from a single queue (like I currently have).
>>
>> I believe will detect false packet drops and ask for unnecessary
>> retransmits if you have multiple cores processing a single queue,
>> because you are processing the packets out of order.
>
> So, the way the default linux kernel configures today's many core
> server systems is to leave the affinity mask by default at 0xffffffff,
> and most current Intel hardware based on 5000 (older core cpus), or
> 5500 chipset (used with Core i7 processors) that I have seen will
> allow for round robin interrupts by default. This kind of sucks for
> the above unless you run irqbalance or set smp_affinity by hand.
On x86 if you have > 8 cores the hardware does not support any form of
irq balancing. You do have an interesting point.
How often and how much does irq balancing hurt us.
> Yes, I know Arjan and others will say you should always run
> irqbalance, but some people don't and some distros don't ship it
> enabled by default (or their version doesn't work for one reason or
> another)
irqbalance is actually more likely to move irqs than the hardware.
I have heard promises it won't move network irqs but I have seen
the opposite behavior.
> The question is should the kernel work better by default
> *without* irqbalance loaded, or does it not matter?
Good question. I would aim for the kernel to work better by default.
Ideally we should have a coupling between which sockets applications have
open, which cpus those applications run on, and which core the irqs arrive
at.
> I don't believe we should re-enable the kernel irq balancer, but
> should we consider only setting a single bit in each new interrupt's
> irq affinity? Doing it with a random spread for the initial affinity
> would be better than setting them all to one.
Not a bad idea. The practical problem is that we usually have the irqs
setup before we have the additional cpus. But that isn't entirely true,
I'm thinking of mostly pre-acpi rules. With ACPI we do some kind of
on-demand setup of the gsi in the device initialization.
How irq threads interact also ways in here.
Eric
^ permalink raw reply
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Paul E. McKenney @ 2009-10-23 21:13 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <4ADF2B57.4030708@gmail.com>
On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> Octavian Purdila a écrit :
> > On Sunday 18 October 2009 21:21:44 you wrote:
> >>> The msleep(250) should be tuned first. Then if this is really necessary
> >>> to dismantle 100.000 netdevices per second, we might have to think a bit
> >>> more.
> >>> Just try msleep(1 or 2), it should work quite well.
> >> My goal is tearing down 100,000 interfaces in a few seconds, which really
> >> is necessary. Right now we're running about 40,000 interfaces on a not
> >> yet saturated 10Gbps link. Going to dual 10Gbps links means pushing more
> >> than 100,000 subscriber interfaces, and it looks like a modern dual socket
> >> system can handle that.
> >>
> >
> > I would also like to see this patch in, we are running into scalability issues
> > with creating/deleting lots of interfaces as well.
>
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
>
> Adding lots of interfaces only needs several threads to run concurently.
>
> Before applying/examining his patch I suggest identifying all dev_put() spots than
> can be deleted and replaced by something more scalable. I began this job
> but others can help me.
>
> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
>
> netdev_run_todo() should also use a better algorithm to allow parallelism.
>
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
>
> [PATCH] net: allow netdev_wait_allrefs() to run faster
>
> netdev_wait_allrefs() waits that all references to a device vanishes.
>
> It currently uses a _very_ pessimistic 250 ms delay between each probe.
> Some users report that no more than 4 devices can be dismantled per second,
> this is a pretty serious problem for extreme setups.
>
> Most likely, references only wait for a rcu grace period that should come
> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.
Is this a place where synchronize_rcu_expedited() is appropriate?
(It went in to 2.6.32-rc1.)
Thanx, Paul
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> net/core/dev.c | 2 +-
> 1 files changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 28b0b9e..fca2e4a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev)
> rebroadcast_time = jiffies;
> }
>
> - msleep(250);
> + schedule_timeout_uninterruptible(1);
>
> if (time_after(jiffies, warning_time + 10 * HZ)) {
> printk(KERN_EMERG "unregister_netdevice: "
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC] [PATCH] cxgb3: Set the rxq
From: Divy Le Ray @ 2009-10-24 0:29 UTC (permalink / raw)
To: Krishna Kumar, davem; +Cc: netdev
In-Reply-To: <20091023111321.4726.29354.sendpatchset@localhost.localdomain>
On Fri, 23 Oct 2009 04:13:21 -0700, Krishna Kumar <krkumar2@in.ibm.com>
wrote:
> From: Krishna Kumar <krkumar2@in.ibm.com>
>
> Set the rxq# for LRO when processing the last fragment of a
> frame. This helps in fast txq selection for routing workloads.
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Divy Le Ray <divy@chelsio.com>
> ---
>
> drivers/net/cxgb3/sge.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff -ruNp org/drivers/net/cxgb3/sge.c new/drivers/net/cxgb3/sge.c
> --- org/drivers/net/cxgb3/sge.c 2009-10-19 11:58:16.000000000 +0530
> +++ new/drivers/net/cxgb3/sge.c 2009-10-20 09:40:40.000000000 +0530
> @@ -2135,6 +2135,7 @@ static void lro_add_page(struct adapter
> if (!complete)
> return;
> + skb_record_rx_queue(skb, qs - &adap->sge.qs[0]);
> skb->ip_summed = CHECKSUM_UNNECESSARY;
> cpl = qs->lro_va;
>
^ permalink raw reply
* Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Christoph Lameter @ 2009-10-24 1:52 UTC (permalink / raw)
To: Pekka Enberg
Cc: Mel Gorman, Frans Pop, Jiri Kosina, Sven Geggus,
Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
KOSAKI Motohiro, Mohamed Abbas, Jens Axboe, John W. Linville,
Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
Stephan von Krawczynski, Kernel Testers List, netdev,
linux-kernel, "li
In-Reply-To: <84144f020910220747nba30d8bkc83c2569da79bd7c@mail.gmail.com>
On Thu, 22 Oct 2009, Pekka Enberg wrote:
> These are pretty obvious bug fixes and should go to linux-next ASAP IMHO.
Bug fixes go into main not linux-next. Lets make sure these fixes really
work and then merge.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* RE: [patch next 3/4] netxen: fix bonding support
From: Dhananjay Phadke @ 2009-10-24 1:54 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: netdev@vger.kernel.org
In-Reply-To: <m1ljj2ps80.fsf@fess.ebiederm.org>
Yes, this was a bug until recent times.
It should be fixed in 2.6.32 development cycle by this commit -
db4cfd8a6149e778befb2ff6e6f91cdc6394cbe6 ("netxen: handle firmware load errors").
This added a check for adapter->is_up before touching txq lock.
-Dhananjay
________________________________________
From: Eric W. Biederman [ebiederm@xmission.com]
Sent: Thursday, October 22, 2009 10:57 PM
To: Dhananjay Phadke
Cc: netdev@vger.kernel.org
Subject: Re: [patch next 3/4] netxen: fix bonding support
Dhananjay Phadke <dhananjay@netxen.com> writes:
> o Pause traffic during mac addr change.
> o Enable setting mac address for NX3031.
In 2.6.31 when I try and use bonding with the netxen or
even simply set the mac address after the driver has
first loaded but before the nic is brought up I get:
As I read that oops. The problem is that netxen_send_cmd_descs
is being called before tx_ring->txq has been initialized.
tx_ring->txq happens in open, when the device is brought up.
Any chance you can look at the problem?
Eric
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffffa0128942>] netxen_send_cmd_descs.clone.2+0x26/0xfe [netxen_nic]
PGD 1fe76067 PUD b4d05067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
CPU 2
Modules linked in: ipt_REDIRECT vfat fat loop appletalk ipx p8023 rose ax25 tulip xt_hl ipt_LOG xt_limit ipt_REJECT xt_state xt_tcpudp dummy iptable_filter veth macvlan nfsd lockd nfs_acl auth_rpcgss exportfs sco bridge stp bnep l2cap bluetooth sunrpc cpufreq_ondemand acpi_cpufreq freq_table ext4 jbd2 crc16 dm_mirror dm_region_hash dm_log dm_multipath dm_mod uinput bonding ipv6 kvm_intel kvm fuse xt_multiport iptable_nat ip_tables nf_nat x_tables nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 tun 8021q i2c_i801 igb i2c_core i5k_amb hwmon i5400_edac edac_core iTCO_wdt iTCO_vendor_support pcspkr shpchp myri10ge floppy netxen_nic sr_mod cdrom serio_raw sg ata_generic pata_acpi ata_piix libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 22313, comm: ip Not tainted 2.6.31.2-193574.2006Arora.fc11.x86_64 #1 X7DWU
RIP: 0010:[<ffffffffa0128942>] [<ffffffffa0128942>] netxen_send_cmd_descs.clone.2+0x26/0xfe [netxen_nic]
RSP: 0018:ffff88005753d638 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88005753d688 RDI: ffff88012dc8f580
RBP: ffff88005753d678 R08: ffff88012dc8f580 R09: ffff88005753d688
R10: 00000000a388069b R11: 0000000000000246 R12: ffff88012dc8f580
R13: ffff88001fcb6c00 R14: ffff88005753d738 R15: 0000000000000020
FS: 00007f4360f656f0(0000) GS:ffff88002805a000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 00000000b4de5000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ip (pid: 22313, threadinfo ffff88005753c000, task ffff88002fbdb700)
Stack:
0000000000000000 00000000a388069b ffffea000136a578 00000000a388069b
<0> ffff88005753d778 ffff88012dc8f580 ffff88012e2cf910 ffff88001fcb6c00
<0> ffff88005753d6d8 ffffffffa0128ae5 000000000a000000 0000000000010001
Call Trace:
[<ffffffffa0128ae5>] nx_p3_sre_macaddr_change+0x60/0x76 [netxen_nic]
[<ffffffffa0128ca8>] nx_p3_nic_add_mac+0x129/0x14c [netxen_nic]
[<ffffffffa0128d57>] netxen_p3_nic_set_multi+0x8c/0x169 [netxen_nic]
[<ffffffffa0128e54>] netxen_p3_nic_set_mac_addr+0x20/0x38 [netxen_nic]
[<ffffffffa0129ed2>] netxen_nic_set_mac+0x96/0xd6 [netxen_nic]
[<ffffffff813128b1>] do_setlink+0x17c/0x360
[<ffffffff81312e97>] rtnl_newlink+0x2e1/0x4e6
[<ffffffff81312c53>] ? rtnl_newlink+0x9d/0x4e6
[<ffffffff810ec697>] ? page_add_new_anon_rmap+0x4d/0x73
[<ffffffff812f8920>] ? sock_rmalloc+0x39/0xa8
[<ffffffff810fdc8c>] ? virt_to_head_page+0x1c/0x51
[<ffffffff813122fe>] rtnetlink_rcv_msg+0x1d0/0x201
[<ffffffff81327856>] ? netlink_sendmsg+0x18f/0x2ac
[<ffffffff8131212e>] ? rtnetlink_rcv_msg+0x0/0x201
[<ffffffff81327b7d>] netlink_rcv_skb+0x4d/0xb4
[<ffffffff81312113>] rtnetlink_rcv+0x30/0x4b
[<ffffffff8132764a>] netlink_unicast+0x12f/0x1ac
[<ffffffff81327950>] netlink_sendmsg+0x289/0x2ac
[<ffffffff812f4146>] __sock_sendmsg+0x6b/0x8a
[<ffffffff812f4abf>] sock_sendmsg+0xd6/0x103
[<ffffffff812f494b>] ? sock_recvmsg+0xd9/0x106
[<ffffffff8106a257>] ? autoremove_wake_function+0x0/0x5a
[<ffffffff812ff6b4>] ? verify_iovec+0x5b/0xaf
[<ffffffff812f4d08>] sys_sendmsg+0x21c/0x2a0
[<ffffffff810e3711>] ? handle_mm_fault+0x2d6/0x681
[<ffffffff811ced89>] ? __up_read+0x9c/0xbb
[<ffffffff8106e264>] ? up_read+0x1c/0x32
[<ffffffff813a9c3f>] ? do_page_fault+0x260/0x2a4
[<ffffffff8100bdc2>] system_call_fastpath+0x16/0x1b
Code: 5b 41 5c c9 c3 55 48 89 e5 41 55 41 54 49 89 fc 53 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 48 8b 9f 28 01 00 00 <4c> 8b 6b 20 48 89 75 c8 49 8d 7d 40 e8 5e e9 27 e1 65 8b 04 25
RIP [<ffffffffa0128942>] netxen_send_cmd_descs.clone.2+0x26/0xfe [netxen_nic]
RSP <ffff88005753d638>
CR2: 0000000000000020
^ permalink raw reply
* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
From: Christoph Lameter @ 2009-10-24 2:03 UTC (permalink / raw)
To: Mel Gorman
Cc: Stephan von Krawczynski, Frans Pop, Jiri Kosina, Sven Geggus,
Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
KOSAKI Motohiro, Mohamed Abbas, Jens Axboe, John W. Linville,
Pekka Enberg, Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
Kernel Testers List, netdev, linux-kernel
In-Reply-To: <20091022163752.GU11778@csn.ul.ie>
There are now rt dependencies in the page allocator that screw things up?
And an rt flag causes the page allocator to try harder meaning it adds
latency.
?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [patch next 3/4] netxen: fix bonding support
From: Eric W. Biederman @ 2009-10-24 2:13 UTC (permalink / raw)
To: Dhananjay Phadke; +Cc: netdev@vger.kernel.org
In-Reply-To: <7608421F3572AB4292BB2532AE89D5658B0B6E4F72@AVEXMB1.qlogic.org>
Dhananjay Phadke <dhananjay.phadke@qlogic.com> writes:
> Yes, this was a bug until recent times.
>
> It should be fixed in 2.6.32 development cycle by this commit -
> db4cfd8a6149e778befb2ff6e6f91cdc6394cbe6 ("netxen: handle firmware load errors").
>
> This added a check for adapter->is_up before touching txq lock.
Yes. That should prevent the null pointer deference. Will it also
allow setting the mac address when the NIC is down?
Eric
^ permalink raw reply
* Re: [Patch] sctp: remove deprecated SCTP_GET_*_OLD stuffs
From: Cong Wang @ 2009-10-24 2:41 UTC (permalink / raw)
To: Vlad Yasevich; +Cc: Sam Ravnborg, linux-kernel, netdev, akpm
In-Reply-To: <4AE1D2AA.8060200@hp.com>
Vlad Yasevich wrote:
>
> Sam Ravnborg wrote:
>> On Thu, Oct 22, 2009 at 04:53:30PM -0400, Vlad Yasevich wrote:
>>>> diff --git a/include/net/sctp/user.h b/include/net/sctp/user.h
>>>> index be2334a..0991f1b 100644
>>>> --- a/include/net/sctp/user.h
>>>> +++ b/include/net/sctp/user.h
>>>> @@ -131,14 +131,6 @@ enum sctp_optname {
>>>> #define SCTP_SOCKOPT_BINDX_REM SCTP_SOCKOPT_BINDX_REM
>>>> SCTP_SOCKOPT_PEELOFF, /* peel off association. */
>>>> #define SCTP_SOCKOPT_PEELOFF SCTP_SOCKOPT_PEELOFF
>>>> - SCTP_GET_PEER_ADDRS_NUM_OLD, /* Get number of peer addresss. */
>>>> -#define SCTP_GET_PEER_ADDRS_NUM_OLD SCTP_GET_PEER_ADDRS_NUM_OLD
>>>> - SCTP_GET_PEER_ADDRS_OLD, /* Get all peer addresss. */
>>>> -#define SCTP_GET_PEER_ADDRS_OLD SCTP_GET_PEER_ADDRS_OLD
>>>> - SCTP_GET_LOCAL_ADDRS_NUM_OLD, /* Get number of local addresss. */
>>>> -#define SCTP_GET_LOCAL_ADDRS_NUM_OLD SCTP_GET_LOCAL_ADDRS_NUM_OLD
>>>> - SCTP_GET_LOCAL_ADDRS_OLD, /* Get all local addresss. */
>>>> -#define SCTP_GET_LOCAL_ADDRS_OLD SCTP_GET_LOCAL_ADDRS_OLD
>>>> SCTP_SOCKOPT_CONNECTX_OLD, /* CONNECTX old requests. */
>>> After running the regression suite against this patch I find that we can't
>>> remove the enum values. Removing the enums changes the value for the remainder
>>> of the definitions and breaks binary compatibility for applications that use
>>> those trailing options.
>>>
>>> You should be ok with removing the #defines and actual code that uses them,
>>> but not the enums. You can even rename the enums, but we must preserve
>>> numeric ordering.
>> If we really depend on the actual value of an enum as in this case,
>> then e should assign them direct to better document this.
>>
>> Sam
>>
>
> I agree. I have a patch that converts the enum to just a #define section that
> I'll apply on top of this removal patch and document the deletion.
Hi, Vlad.
I was busy, sorry for joining late. Thanks for doing this!
^ permalink raw reply
* Re: Irq architecture for multi-core network driver.
From: David Miller @ 2009-10-24 3:19 UTC (permalink / raw)
To: jesse.brandeburg
Cc: ebiederm, ddaney, cfriesen, netdev, linux-kernel, linux-mips
In-Reply-To: <4807377b0910231028g60b479cfycdbf3f4e25384c58@mail.gmail.com>
From: Jesse Brandeburg <jesse.brandeburg@gmail.com>
Date: Fri, 23 Oct 2009 10:28:10 -0700
> Yes, I know Arjan and others will say you should always run
> irqbalance, but some people don't and some distros don't ship it
> enabled by default (or their version doesn't work for one reason or
> another) The question is should the kernel work better by default
> *without* irqbalance loaded, or does it not matter?
I think requiring irqbalanced for optimal behavior is more
than reasonable.
And since we explicitly took that policy logic out of the
kernel it makes absolutely no sense to put it back there.
It's policy, and policy is (largely) userspace.
^ permalink raw reply
* [PATCH 1/2 net-next-2.6] ipv6 sit: RCU conversion phase I
From: Eric Dumazet @ 2009-10-24 3:51 UTC (permalink / raw)
To: David S. Miller; +Cc: Linux Netdev List, Yoshifuji Hideaki
SIT tunnels use one rwlock to protect their prl entries.
This first patch adds RCU locking for prl management,
with standard call_rcu() calls.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/ipip.h | 1
net/ipv6/sit.c | 73 +++++++++++++++++++++++++++++--------------
2 files changed, 51 insertions(+), 23 deletions(-)
diff --git a/include/net/ipip.h b/include/net/ipip.h
index 86f1c8b..290effb 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -45,6 +45,7 @@ struct ip_tunnel_prl_entry
struct ip_tunnel_prl_entry *next;
__be32 addr;
u16 flags;
+ struct rcu_head rcu_head;
};
#define IPTUNNEL_XMIT() do { \
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 510d31f..8cdcc2a 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -240,15 +240,22 @@ failed:
return NULL;
}
+static DEFINE_SPINLOCK(ipip6_prl_lock);
+
+#define for_each_prl_rcu(start) \
+ for (prl = rcu_dereference(start); \
+ prl; \
+ prl = rcu_dereference(prl->next))
+
static struct ip_tunnel_prl_entry *
__ipip6_tunnel_locate_prl(struct ip_tunnel *t, __be32 addr)
{
- struct ip_tunnel_prl_entry *p = (struct ip_tunnel_prl_entry *)NULL;
+ struct ip_tunnel_prl_entry *prl;
- for (p = t->prl; p; p = p->next)
- if (p->addr == addr)
+ for_each_prl_rcu(t->prl)
+ if (prl->addr == addr)
break;
- return p;
+ return prl;
}
@@ -273,7 +280,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
kcalloc(cmax, sizeof(*kp), GFP_KERNEL) :
NULL;
- read_lock(&ipip6_lock);
+ rcu_read_lock();
ca = t->prl_count < cmax ? t->prl_count : cmax;
@@ -291,7 +298,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
}
c = 0;
- for (prl = t->prl; prl; prl = prl->next) {
+ for_each_prl_rcu(t->prl) {
if (c >= cmax)
break;
if (kprl.addr != htonl(INADDR_ANY) && prl->addr != kprl.addr)
@@ -303,7 +310,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
break;
}
out:
- read_unlock(&ipip6_lock);
+ rcu_read_unlock();
len = sizeof(*kp) * c;
ret = 0;
@@ -324,12 +331,14 @@ ipip6_tunnel_add_prl(struct ip_tunnel *t, struct ip_tunnel_prl *a, int chg)
if (a->addr == htonl(INADDR_ANY))
return -EINVAL;
- write_lock(&ipip6_lock);
+ spin_lock(&ipip6_prl_lock);
for (p = t->prl; p; p = p->next) {
if (p->addr == a->addr) {
- if (chg)
- goto update;
+ if (chg) {
+ p->flags = a->flags;
+ goto out;
+ }
err = -EEXIST;
goto out;
}
@@ -346,46 +355,63 @@ ipip6_tunnel_add_prl(struct ip_tunnel *t, struct ip_tunnel_prl *a, int chg)
goto out;
}
+ INIT_RCU_HEAD(&p->rcu_head);
p->next = t->prl;
- t->prl = p;
- t->prl_count++;
-update:
p->addr = a->addr;
p->flags = a->flags;
+ t->prl_count++;
+ rcu_assign_pointer(t->prl, p);
out:
- write_unlock(&ipip6_lock);
+ spin_unlock(&ipip6_prl_lock);
return err;
}
+static void prl_entry_destroy_rcu(struct rcu_head *head)
+{
+ kfree(container_of(head, struct ip_tunnel_prl_entry, rcu_head));
+}
+
+static void prl_list_destroy_rcu(struct rcu_head *head)
+{
+ struct ip_tunnel_prl_entry *p, *n;
+
+ p = container_of(head, struct ip_tunnel_prl_entry, rcu_head);
+ do {
+ n = p->next;
+ kfree(p);
+ p = n;
+ } while (p);
+}
+
static int
ipip6_tunnel_del_prl(struct ip_tunnel *t, struct ip_tunnel_prl *a)
{
struct ip_tunnel_prl_entry *x, **p;
int err = 0;
- write_lock(&ipip6_lock);
+ spin_lock(&ipip6_prl_lock);
if (a && a->addr != htonl(INADDR_ANY)) {
for (p = &t->prl; *p; p = &(*p)->next) {
if ((*p)->addr == a->addr) {
x = *p;
*p = x->next;
- kfree(x);
+ call_rcu(&x->rcu_head, prl_entry_destroy_rcu);
t->prl_count--;
goto out;
}
}
err = -ENXIO;
} else {
- while (t->prl) {
+ if (t->prl) {
+ t->prl_count = 0;
x = t->prl;
- t->prl = t->prl->next;
- kfree(x);
- t->prl_count--;
+ call_rcu(&x->rcu_head, prl_list_destroy_rcu);
+ t->prl = NULL;
}
}
out:
- write_unlock(&ipip6_lock);
+ spin_unlock(&ipip6_prl_lock);
return err;
}
@@ -395,7 +421,7 @@ isatap_chksrc(struct sk_buff *skb, struct iphdr *iph, struct ip_tunnel *t)
struct ip_tunnel_prl_entry *p;
int ok = 1;
- read_lock(&ipip6_lock);
+ rcu_read_lock();
p = __ipip6_tunnel_locate_prl(t, iph->saddr);
if (p) {
if (p->flags & PRL_DEFAULT)
@@ -411,7 +437,7 @@ isatap_chksrc(struct sk_buff *skb, struct iphdr *iph, struct ip_tunnel *t)
else
ok = 0;
}
- read_unlock(&ipip6_lock);
+ rcu_read_unlock();
return ok;
}
@@ -1192,6 +1218,7 @@ static void __exit sit_cleanup(void)
xfrm4_tunnel_deregister(&sit_handler, AF_INET6);
unregister_pernet_gen_device(sit_net_id, &sit_net_ops);
+ rcu_barrier(); /* Wait for completion of call_rcu()'s */
}
static int __init sit_init(void)
^ permalink raw reply related
* [PATCH 2/2 net-next-2.6] ipv6 sit: RCU conversion phase II
From: Eric Dumazet @ 2009-10-24 3:52 UTC (permalink / raw)
To: David S. Miller; +Cc: Linux Netdev List, Yoshifuji Hideaki
SIT tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/ipv6/sit.c | 45 +++++++++++++++++++++++++++------------------
1 files changed, 27 insertions(+), 18 deletions(-)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 8cdcc2a..b6b1626 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -77,8 +77,17 @@ struct sit_net {
struct net_device *fb_tunnel_dev;
};
-static DEFINE_RWLOCK(ipip6_lock);
+/*
+ * Locking : hash tables are protected by RCU and a spinlock
+ */
+static DEFINE_SPINLOCK(ipip6_lock);
+
+#define for_each_ip_tunnel_rcu(start) \
+ for (t = rcu_dereference(start); t; t = rcu_dereference(t->next))
+/*
+ * Must be invoked with rcu_read_lock
+ */
static struct ip_tunnel * ipip6_tunnel_lookup(struct net *net,
struct net_device *dev, __be32 remote, __be32 local)
{
@@ -87,26 +96,26 @@ static struct ip_tunnel * ipip6_tunnel_lookup(struct net *net,
struct ip_tunnel *t;
struct sit_net *sitn = net_generic(net, sit_net_id);
- for (t = sitn->tunnels_r_l[h0^h1]; t; t = t->next) {
+ for_each_ip_tunnel_rcu(sitn->tunnels_r_l[h0 ^ h1]) {
if (local == t->parms.iph.saddr &&
remote == t->parms.iph.daddr &&
(!dev || !t->parms.link || dev->iflink == t->parms.link) &&
(t->dev->flags & IFF_UP))
return t;
}
- for (t = sitn->tunnels_r[h0]; t; t = t->next) {
+ for_each_ip_tunnel_rcu(sitn->tunnels_r[h0]) {
if (remote == t->parms.iph.daddr &&
(!dev || !t->parms.link || dev->iflink == t->parms.link) &&
(t->dev->flags & IFF_UP))
return t;
}
- for (t = sitn->tunnels_l[h1]; t; t = t->next) {
+ for_each_ip_tunnel_rcu(sitn->tunnels_l[h1]) {
if (local == t->parms.iph.saddr &&
(!dev || !t->parms.link || dev->iflink == t->parms.link) &&
(t->dev->flags & IFF_UP))
return t;
}
- t = sitn->tunnels_wc[0];
+ t = rcu_dereference(sitn->tunnels_wc[0]);
if ((t != NULL) && (t->dev->flags & IFF_UP))
return t;
return NULL;
@@ -143,9 +152,9 @@ static void ipip6_tunnel_unlink(struct sit_net *sitn, struct ip_tunnel *t)
for (tp = ipip6_bucket(sitn, t); *tp; tp = &(*tp)->next) {
if (t == *tp) {
- write_lock_bh(&ipip6_lock);
+ spin_lock_bh(&ipip6_lock);
*tp = t->next;
- write_unlock_bh(&ipip6_lock);
+ spin_unlock_bh(&ipip6_lock);
break;
}
}
@@ -155,10 +164,10 @@ static void ipip6_tunnel_link(struct sit_net *sitn, struct ip_tunnel *t)
{
struct ip_tunnel **tp = ipip6_bucket(sitn, t);
+ spin_lock_bh(&ipip6_lock);
t->next = *tp;
- write_lock_bh(&ipip6_lock);
- *tp = t;
- write_unlock_bh(&ipip6_lock);
+ rcu_assign_pointer(*tp, t);
+ spin_unlock_bh(&ipip6_lock);
}
static void ipip6_tunnel_clone_6rd(struct net_device *dev, struct sit_net *sitn)
@@ -447,9 +456,9 @@ static void ipip6_tunnel_uninit(struct net_device *dev)
struct sit_net *sitn = net_generic(net, sit_net_id);
if (dev == sitn->fb_tunnel_dev) {
- write_lock_bh(&ipip6_lock);
+ spin_lock_bh(&ipip6_lock);
sitn->tunnels_wc[0] = NULL;
- write_unlock_bh(&ipip6_lock);
+ spin_unlock_bh(&ipip6_lock);
dev_put(dev);
} else {
ipip6_tunnel_unlink(sitn, netdev_priv(dev));
@@ -502,7 +511,7 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
err = -ENOENT;
- read_lock(&ipip6_lock);
+ rcu_read_lock();
t = ipip6_tunnel_lookup(dev_net(skb->dev),
skb->dev,
iph->daddr,
@@ -520,7 +529,7 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
t->err_count = 1;
t->err_time = jiffies;
out:
- read_unlock(&ipip6_lock);
+ rcu_read_unlock();
return err;
}
@@ -540,7 +549,7 @@ static int ipip6_rcv(struct sk_buff *skb)
iph = ip_hdr(skb);
- read_lock(&ipip6_lock);
+ rcu_read_lock();
tunnel = ipip6_tunnel_lookup(dev_net(skb->dev), skb->dev,
iph->saddr, iph->daddr);
if (tunnel != NULL) {
@@ -554,7 +563,7 @@ static int ipip6_rcv(struct sk_buff *skb)
if ((tunnel->dev->priv_flags & IFF_ISATAP) &&
!isatap_chksrc(skb, iph, tunnel)) {
tunnel->dev->stats.rx_errors++;
- read_unlock(&ipip6_lock);
+ rcu_read_unlock();
kfree_skb(skb);
return 0;
}
@@ -565,12 +574,12 @@ static int ipip6_rcv(struct sk_buff *skb)
nf_reset(skb);
ipip6_ecn_decapsulate(iph, skb);
netif_rx(skb);
- read_unlock(&ipip6_lock);
+ rcu_read_unlock();
return 0;
}
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
- read_unlock(&ipip6_lock);
+ rcu_read_unlock();
out:
kfree_skb(skb);
return 0;
^ permalink raw reply related
* [PATCH net-next-2.6] xfrm6_tunnel: RCU conversion
From: Eric Dumazet @ 2009-10-24 4:19 UTC (permalink / raw)
To: David S. Miller; +Cc: Linux Netdev List, Yoshifuji Hideaki
xfrm6_tunnels use one rwlock to protect their hash tables.
Plain and straightforward conversion to RCU locking to permit better SMP
performance.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/ipv6/xfrm6_tunnel.c | 47 ++++++++++++++++++++++----------------
1 files changed, 28 insertions(+), 19 deletions(-)
diff --git a/net/ipv6/xfrm6_tunnel.c b/net/ipv6/xfrm6_tunnel.c
index 81a95c0..438831d 100644
--- a/net/ipv6/xfrm6_tunnel.c
+++ b/net/ipv6/xfrm6_tunnel.c
@@ -23,7 +23,7 @@
*/
#include <linux/module.h>
#include <linux/xfrm.h>
-#include <linux/list.h>
+#include <linux/rculist.h>
#include <net/ip.h>
#include <net/xfrm.h>
#include <net/ipv6.h>
@@ -36,14 +36,15 @@
* per xfrm_address_t.
*/
struct xfrm6_tunnel_spi {
- struct hlist_node list_byaddr;
- struct hlist_node list_byspi;
- xfrm_address_t addr;
- u32 spi;
- atomic_t refcnt;
+ struct hlist_node list_byaddr;
+ struct hlist_node list_byspi;
+ xfrm_address_t addr;
+ u32 spi;
+ atomic_t refcnt;
+ struct rcu_head rcu_head;
};
-static DEFINE_RWLOCK(xfrm6_tunnel_spi_lock);
+static DEFINE_SPINLOCK(xfrm6_tunnel_spi_lock);
static u32 xfrm6_tunnel_spi;
@@ -107,6 +108,7 @@ static void xfrm6_tunnel_spi_fini(void)
if (!hlist_empty(&xfrm6_tunnel_spi_byspi[i]))
return;
}
+ rcu_barrier();
kmem_cache_destroy(xfrm6_tunnel_spi_kmem);
xfrm6_tunnel_spi_kmem = NULL;
}
@@ -116,7 +118,7 @@ static struct xfrm6_tunnel_spi *__xfrm6_tunnel_spi_lookup(xfrm_address_t *saddr)
struct xfrm6_tunnel_spi *x6spi;
struct hlist_node *pos;
- hlist_for_each_entry(x6spi, pos,
+ hlist_for_each_entry_rcu(x6spi, pos,
&xfrm6_tunnel_spi_byaddr[xfrm6_tunnel_spi_hash_byaddr(saddr)],
list_byaddr) {
if (memcmp(&x6spi->addr, saddr, sizeof(x6spi->addr)) == 0)
@@ -131,10 +133,10 @@ __be32 xfrm6_tunnel_spi_lookup(xfrm_address_t *saddr)
struct xfrm6_tunnel_spi *x6spi;
u32 spi;
- read_lock_bh(&xfrm6_tunnel_spi_lock);
+ rcu_read_lock_bh();
x6spi = __xfrm6_tunnel_spi_lookup(saddr);
spi = x6spi ? x6spi->spi : 0;
- read_unlock_bh(&xfrm6_tunnel_spi_lock);
+ rcu_read_unlock_bh();
return htonl(spi);
}
@@ -185,14 +187,15 @@ alloc_spi:
if (!x6spi)
goto out;
+ INIT_RCU_HEAD(&x6spi->rcu_head);
memcpy(&x6spi->addr, saddr, sizeof(x6spi->addr));
x6spi->spi = spi;
atomic_set(&x6spi->refcnt, 1);
- hlist_add_head(&x6spi->list_byspi, &xfrm6_tunnel_spi_byspi[index]);
+ hlist_add_head_rcu(&x6spi->list_byspi, &xfrm6_tunnel_spi_byspi[index]);
index = xfrm6_tunnel_spi_hash_byaddr(saddr);
- hlist_add_head(&x6spi->list_byaddr, &xfrm6_tunnel_spi_byaddr[index]);
+ hlist_add_head_rcu(&x6spi->list_byaddr, &xfrm6_tunnel_spi_byaddr[index]);
out:
return spi;
}
@@ -202,26 +205,32 @@ __be32 xfrm6_tunnel_alloc_spi(xfrm_address_t *saddr)
struct xfrm6_tunnel_spi *x6spi;
u32 spi;
- write_lock_bh(&xfrm6_tunnel_spi_lock);
+ spin_lock_bh(&xfrm6_tunnel_spi_lock);
x6spi = __xfrm6_tunnel_spi_lookup(saddr);
if (x6spi) {
atomic_inc(&x6spi->refcnt);
spi = x6spi->spi;
} else
spi = __xfrm6_tunnel_alloc_spi(saddr);
- write_unlock_bh(&xfrm6_tunnel_spi_lock);
+ spin_unlock_bh(&xfrm6_tunnel_spi_lock);
return htonl(spi);
}
EXPORT_SYMBOL(xfrm6_tunnel_alloc_spi);
+static void x6spi_destroy_rcu(struct rcu_head *head)
+{
+ kmem_cache_free(xfrm6_tunnel_spi_kmem,
+ container_of(head, struct xfrm6_tunnel_spi, rcu_head));
+}
+
void xfrm6_tunnel_free_spi(xfrm_address_t *saddr)
{
struct xfrm6_tunnel_spi *x6spi;
struct hlist_node *pos, *n;
- write_lock_bh(&xfrm6_tunnel_spi_lock);
+ spin_lock_bh(&xfrm6_tunnel_spi_lock);
hlist_for_each_entry_safe(x6spi, pos, n,
&xfrm6_tunnel_spi_byaddr[xfrm6_tunnel_spi_hash_byaddr(saddr)],
@@ -229,14 +238,14 @@ void xfrm6_tunnel_free_spi(xfrm_address_t *saddr)
{
if (memcmp(&x6spi->addr, saddr, sizeof(x6spi->addr)) == 0) {
if (atomic_dec_and_test(&x6spi->refcnt)) {
- hlist_del(&x6spi->list_byaddr);
- hlist_del(&x6spi->list_byspi);
- kmem_cache_free(xfrm6_tunnel_spi_kmem, x6spi);
+ hlist_del_rcu(&x6spi->list_byaddr);
+ hlist_del_rcu(&x6spi->list_byspi);
+ call_rcu(&x6spi->rcu_head, x6spi_destroy_rcu);
break;
}
}
}
- write_unlock_bh(&xfrm6_tunnel_spi_lock);
+ spin_unlock_bh(&xfrm6_tunnel_spi_lock);
}
EXPORT_SYMBOL(xfrm6_tunnel_free_spi);
^ permalink raw reply related
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Eric Dumazet @ 2009-10-24 4:35 UTC (permalink / raw)
To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <20091023211338.GA6145@linux.vnet.ibm.com>
Paul E. McKenney a écrit :
> On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
>> [PATCH] net: allow netdev_wait_allrefs() to run faster
>>
>> netdev_wait_allrefs() waits that all references to a device vanishes.
>>
>> It currently uses a _very_ pessimistic 250 ms delay between each probe.
>> Some users report that no more than 4 devices can be dismantled per second,
>> this is a pretty serious problem for extreme setups.
>>
>> Most likely, references only wait for a rcu grace period that should come
>> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.
>
> Is this a place where synchronize_rcu_expedited() is appropriate?
> (It went in to 2.6.32-rc1.)
>
Thanks for the tip Paul
I believe netdev_wait_allrefs() is not a perfect candidate, because
synchronize_sched_expedited() seems really expensive.
Maybe we could call it once only, if we had to call 1 times
the jiffie delay ?
diff --git a/net/core/dev.c b/net/core/dev.c
index fa88dcd..9b04b9a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev);
static void netdev_wait_allrefs(struct net_device *dev)
{
unsigned long rebroadcast_time, warning_time;
+ unsigned int count = 0;
rebroadcast_time = warning_time = jiffies;
while (atomic_read(&dev->refcnt) != 0) {
@@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev)
rebroadcast_time = jiffies;
}
- msleep(250);
+ if (count++ == 1)
+ synchronize_rcu_expedited();
+ else
+ schedule_timeout_uninterruptible(1);
if (time_after(jiffies, warning_time + 10 * HZ)) {
printk(KERN_EMERG "unregister_netdevice: "
^ permalink raw reply related
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Paul E. McKenney @ 2009-10-24 5:49 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <4AE28429.6040608@gmail.com>
On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> >> [PATCH] net: allow netdev_wait_allrefs() to run faster
> >>
> >> netdev_wait_allrefs() waits that all references to a device vanishes.
> >>
> >> It currently uses a _very_ pessimistic 250 ms delay between each probe.
> >> Some users report that no more than 4 devices can be dismantled per second,
> >> this is a pretty serious problem for extreme setups.
> >>
> >> Most likely, references only wait for a rcu grace period that should come
> >> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.
> >
> > Is this a place where synchronize_rcu_expedited() is appropriate?
> > (It went in to 2.6.32-rc1.)
>
> Thanks for the tip Paul
>
> I believe netdev_wait_allrefs() is not a perfect candidate, because
> synchronize_sched_expedited() seems really expensive.
It does indeed keep the CPUs quite busy for a bit. ;-)
> Maybe we could call it once only, if we had to call 1 times
> the jiffie delay ?
This could be a very useful approach!
However, please keep in mind that although synchronize_rcu_expedited()
forces a grace period, it does nothing to speed the invocation of other
RCU callbacks. In short, synchronize_rcu_expedited() is a faster version
of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu()
or call_rcu() invocations.
The reason I point this out is that it looks to me that the code below is
waiting for some other task which is in turn waiting on a grace period.
But I don't know this code, so could easily be confused.
Thanx, paul
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fa88dcd..9b04b9a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev);
> static void netdev_wait_allrefs(struct net_device *dev)
> {
> unsigned long rebroadcast_time, warning_time;
> + unsigned int count = 0;
>
> rebroadcast_time = warning_time = jiffies;
> while (atomic_read(&dev->refcnt) != 0) {
> @@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev)
> rebroadcast_time = jiffies;
> }
>
> - msleep(250);
> + if (count++ == 1)
> + synchronize_rcu_expedited();
> + else
> + schedule_timeout_uninterruptible(1);
>
> if (time_after(jiffies, warning_time + 10 * HZ)) {
> printk(KERN_EMERG "unregister_netdevice: "
>
^ permalink raw reply
* Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Pekka Enberg @ 2009-10-24 6:48 UTC (permalink / raw)
To: Christoph Lameter
Cc: Mel Gorman, Frans Pop, Jiri Kosina, Sven Geggus,
Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
KOSAKI Motohiro, Mohamed Abbas, Jens Axboe, John W. Linville,
Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
Stephan von Krawczynski, Kernel Testers List,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
" \" <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, torvalds@l
In-Reply-To: <alpine.DEB.1.10.0910232151380.2001@V090114053VZO-1>
On Thu, 22 Oct 2009, Pekka Enberg wrote:
>> These are pretty obvious bug fixes and should go to linux-next ASAP IMHO.
Christoph Lameter wrote:
> Bug fixes go into main not linux-next. Lets make sure these fixes really
> work and then merge.
Regardless, patches 1-2 and should _really_ go to Linus' tree (and
eventually -stable) while we figure out the rest of the problems. They
fix obvious regressions in the code paths and we have reports from
people that they help. Yes, they don't fix everything for everyone but
we there's no upside in holding back fixes that are simple one line
fixes to regressions.
Pekka
^ permalink raw reply
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Eric Dumazet @ 2009-10-24 8:49 UTC (permalink / raw)
To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <20091024054943.GA6638@linux.vnet.ibm.com>
Paul E. McKenney a écrit :
> On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote:
>
>> Maybe we could call it once only, if we had to call 1 times
>> the jiffie delay ?
>
> This could be a very useful approach!
>
> However, please keep in mind that although synchronize_rcu_expedited()
> forces a grace period, it does nothing to speed the invocation of other
> RCU callbacks. In short, synchronize_rcu_expedited() is a faster version
> of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu()
> or call_rcu() invocations.
>
> The reason I point this out is that it looks to me that the code below is
> waiting for some other task which is in turn waiting on a grace period.
> But I don't know this code, so could easily be confused.
>
Normally, we need a synchronize_rcu() calls, but I feel its bit more than really
needed here.
On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.580259] synchronize_net() 4045596 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.588262] synchronize_net() 7769327 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.625014] synchronize_net() 4772052 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.633008] synchronize_net() 7773896 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.669260] synchronize_net() 3958141 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.677259] synchronize_net() 7755817 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.712011] synchronize_net() 2502544 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.720011] synchronize_net() 7767748 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.754259] synchronize_net() 2087946 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.762258] synchronize_net() 7738054 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.796011] synchronize_net() 3392760 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.808025] synchronize_net() 11814619 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.848010] synchronize_net() 8970220 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.856015] synchronize_net() 7800782 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.893008] synchronize_net() 6650174 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.897012] synchronize_net() 3744808 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.940202] synchronize_net() 8354366 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.952137] synchronize_net() 11693215 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.985010] synchronize_net() 2355970 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.989009] synchronize_net() 3771419 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.028137] synchronize_net() 7661195 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.036152] synchronize_net() 7800056 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.083135] synchronize_net() 6774026 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.089145] synchronize_net() 5727189 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.130385] synchronize_net() 10133932 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.134399] synchronize_net() 3773058 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.170136] synchronize_net() 4479194 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.178138] synchronize_net() 7710466 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.217198] synchronize_net() 4323437 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.226206] synchronize_net() 8723108 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.268013] synchronize_net() 6221155 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.280007] synchronize_net() 11719297 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.324008] synchronize_net() 11654511 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.332009] synchronize_net() 7744182 ns
^ permalink raw reply
* Re: [PATCH 00/27] sfc: Cleanup and preparation for new hardware
From: David Miller @ 2009-10-24 11:29 UTC (permalink / raw)
To: bhutchings; +Cc: netdev, linux-net-drivers
In-Reply-To: <1256322441.2785.3.camel@achroite>
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Fri, 23 Oct 2009 19:27:21 +0100
> This is a mixture of small cleanups, refactoring, renaming and an update
> to the register definitions. There's more to come, in particular
> support for the new hardware, but I'll let you digest this batch first.
All applied, thanks Ben.
^ permalink raw reply
* Re: [PATCH net-next-2.6] xfrm6_tunnel: RCU conversion
From: David Miller @ 2009-10-24 13:09 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, yoshfuji
In-Reply-To: <4AE28047.1020904@gmail.com>
Nice work Eric, I applied all of your RCU conversions to
net-next-2.6, thanks!
Applying: ipv6 sit: RCU conversion phase I
Applying: ipv6 sit: RCU conversion phase II
Applying: xfrm6_tunnel: RCU conversion
Applying: ipip: convert hash tables locking to RCU
Applying: ip6tnl: convert hash tables locking to RCU
Applying: gre: convert hash tables locking to RCU
^ permalink raw reply
* Re: [RFC] net,socket: introduce build_sockaddr_check helper to catch overflow at build time
From: David Miller @ 2009-10-24 13:12 UTC (permalink / raw)
To: gorcunov; +Cc: netdev
In-Reply-To: <20091023214306.GA30616@lenovo>
From: Cyrill Gorcunov <gorcunov@gmail.com>
Date: Sat, 24 Oct 2009 01:43:06 +0400
> Or say it could be something like that
>
> #define __sockaddr(type, src) \
> ({ build_sockaddr_check(sizeof(type)); (type *) src; })
>
> and say in function af_inet.c:inet_getname instead of
>
> struct sockaddr_in *sin = (struct sockaddr_in *)uaddr;
>
> we may write like
>
> struct sockaddr_in *sin = __sockaddr(struct sockaddr_in, uaddr);
>
> which would check the size.
Or even a "DECLARE_SOCKADDR(type, src, dest)" which encapsulates the
entire declaration statement.
^ permalink raw reply
* Re: [PATCH net-next-2.6] rtnetlink: speedup rtnl_dump_ifinfo()
From: David Miller @ 2009-10-24 13:13 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, bcrl
In-Reply-To: <4AE06D81.4050003@gmail.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 22 Oct 2009 16:34:41 +0200
> When handling large number of netdevice, rtnl_dump_ifinfo()
> is very slow because it has O(N^2) complexity.
>
> Instead of scanning one single list, we can use the 256 sub lists
> of the dev_index hash table.
>
> This considerably speedups "ip link" operations
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks!
^ permalink raw reply
* Re: [PATCH] atm: Cleanup redundant tests on unsigned
From: David Miller @ 2009-10-24 13:19 UTC (permalink / raw)
To: eric.dumazet; +Cc: roel.kluin, chas, netdev, akpm
In-Reply-To: <4AE1D4FE.5090902@gmail.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 23 Oct 2009 18:08:30 +0200
> Roel Kluin a écrit :
>> The variables are unsigned so the `< 0' test always fails, the
>> other part of the test catches wrapped values.
>>
>> Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
>> ---
>
> I believe such patches were posted in the past.
>
> General consensus is that compiler is able to do this optimization for us,
> and reader doesnt have to ask to himself : "Is the test safe enough ?"
Conversely, the reader can say "Wow this can be negative? There's
tests missing elsewhere! ... oh, nevermind it's unsigned"
That really wastes people's time too.
I suspect I'll apply these patches. :-)
^ permalink raw reply
* Request: how do packet routing based on input gateway
From: Luca Dionisi @ 2009-10-24 13:22 UTC (permalink / raw)
To: netdev, linux-net
Hi all
I must instruct my server to route packets based on destination prefix
and on the gateway from which the packet arrived.
E.g.
I am 192.168.0.1
a packet for 192.168.1.0/24 which arrived from gateway 192.168.0.2
must go through 192.168.1.1 dev eth0
a packet for 192.168.1.0/24 which arrived from gateway 192.168.0.3
must go through 192.168.1.2 dev eth1
a packet for 192.168.1.0/24 which arrived from any other gateway must
go through 192.168.1.3 dev eth2
The gateway I mean is just the last hop traversed, before me. It's not
necessarily the source of the packet.
How am I supposed to do that?
Regards.
Luca Dionisi
^ permalink raw reply
* Re: Irq architecture for multi-core network driver.
From: David Miller @ 2009-10-24 13:23 UTC (permalink / raw)
To: ddaney; +Cc: cfriesen, netdev, linux-kernel, linux-mips
In-Reply-To: <4AE0DB98.1000101@caviumnetworks.com>
From: David Daney <ddaney@caviumnetworks.com>
Date: Thu, 22 Oct 2009 15:24:24 -0700
> Certainly this is one mode of operation that should be supported, but
> I would also like to be able to go for raw throughput and have as many
> cores as possible reading from a single queue (like I currently have).
You can't do this, at least within the same flow, since as you even
mention in your original posting this can result in packet reordering
which we must avoid as much as is possible.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox