From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jay Vosburgh Subject: Re: RTNL: assertion failed at net/core/dev.c (4494) and RTNL: assertion failed at net/core/rtnetlink.c (940) Date: Thu, 06 Feb 2014 14:07:42 -0800 Message-ID: <31272.1391724462@death.nxdomain> References: <20140206205106.GA10488@glanzmann.de> <30988.1391723318@death.nxdomain> Cc: Thomas Glanzmann , Eric Dumazet , netdev , Veaceslav Falico , andy@greyhouse.net, =?UTF-8?B?SmnFmcOtIFDDrXJrbw==?= To: Cong Wang Return-path: Received: from e38.co.us.ibm.com ([32.97.110.159]:39995 "EHLO e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752091AbaBFWHu (ORCPT ); Thu, 6 Feb 2014 17:07:50 -0500 Received: from /spool/local by e38.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 6 Feb 2014 15:07:49 -0700 Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id A90213E40044 for ; Thu, 6 Feb 2014 15:07:46 -0700 (MST) Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com [9.17.195.245]) by b03cxnp08028.gho.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s16M7k7247775878 for ; Thu, 6 Feb 2014 23:07:46 +0100 Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id s16MB56W024569 for ; Thu, 6 Feb 2014 15:11:06 -0700 In-reply-to: <30988.1391723318@death.nxdomain> Sender: netdev-owner@vger.kernel.org List-ID: Jay Vosburgh wrote: >Cong Wang wrote: > >>On Thu, Feb 6, 2014 at 12:51 PM, Thomas Glanzmann wrote: >>> Hello, >>> this morning I checked out Linus tip and compiled it after booting my >>> dmesg is full of: >>> >>> [ 8.944991] RTNL: assertion failed at net/core/dev.c (4494) >>> [ 8.950640] CPU: 3 PID: 388 Comm: kworker/u24:4 Not tainted 3.14.0-rc1+ #3 >>> [ 8.950642] Hardware name: Supermicro X9SRD-F/X9SRD-F, BIOS 1.0a 10/15/2012 >>> [ 8.950654] Workqueue: bond0 bond_3ad_state_machine_handler [bonding] >>> [ 8.950658] 0000000000000000 ffff881020c88000 ffffffff8138e219 ffff881020c88000 >>> [ 8.950664] ffffffff812d3091 ffff881023961040 ffffffff812e3132 0000000000000246 >>> [ 8.950670] 0000000000000020 ffff881020ab1be8 0000000020ab1ba8 0000000000000000 >>> [ 8.950675] Call Trace: >>> [ 8.950686] [] ? dump_stack+0x41/0x51 >>> [ 8.950694] [] ? netdev_master_upper_dev_get+0x2a/0x4d >>> [ 8.950699] [] ? rtnl_fill_ifinfo+0x2c/0xac4 >>> [ 8.950707] [] ? print_time.part.5+0x50/0x54 >>> [ 8.950715] [] ? __kmalloc_reserve.isra.42+0x2a/0x6d >>> [ 8.950721] [] ? ksize+0x12/0x1e >>> [ 8.950726] [] ? __alloc_skb+0xb5/0x1a9 >>> [ 8.950731] [] ? rtmsg_ifinfo+0x6c/0xd6 >>> [ 8.950739] [] ? __enable_port.isra.17+0x51/0x5a [bonding] >>> [ 8.950747] [] ? ad_agg_selection_logic+0x3d3/0x3ed [bonding] >>> [ 8.950754] [] ? bond_3ad_state_machine_handler+0x555/0x918 [bonding] >>> [ 8.950761] [] ? process_one_work+0x191/0x293 >>> [ 8.950766] [] ? worker_thread+0x121/0x1e7 >>> [ 8.950770] [] ? rescuer_thread+0x269/0x269 >>> [ 8.950777] [] ? kthread+0x99/0xa1 >>> [ 8.950782] [] ? __kthread_parkme+0x59/0x59 >>> [ 8.950789] [] ? ret_from_fork+0x7c/0xb0 >>> [ 8.950794] [] ? __kthread_parkme+0x59/0x59 >> >> >>Hmm, rtmsg_ifinfo() should be called with rtnl lock, but >>__enable_port() is called >>with rcu_read_lock() which means we can't block inside it, therefore we probably >>should take rtnl lock outside: >> >>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c >>index cce1f1b..3c09ffa 100644 >>--- a/drivers/net/bonding/bond_3ad.c >>+++ b/drivers/net/bonding/bond_3ad.c >>@@ -2065,6 +2065,7 @@ void bond_3ad_state_machine_handler(struct >>work_struct *work) >> struct slave *slave; >> struct port *port; >> >>+ rtnl_lock(); >> read_lock(&bond->lock); >> rcu_read_lock(); >> >>@@ -2123,6 +2124,7 @@ void bond_3ad_state_machine_handler(struct >>work_struct *work) >> re_arm: >> rcu_read_unlock(); >> read_unlock(&bond->lock); >>+ rtnl_unlock(); >> queue_delayed_work(bond->wq, &bond->ad_work, ad_delta_in_ticks); >> } > > That would eliminate the warning, but is suboptimal. Acquiring >RTNL is not necessary on the vast majority of state machine runs >(because no state changes take place, i.e., no ports are disabled or >enabled). The above change would add 10 round trips per second to RTNL, >which seems excessive. > > Also, we cannot unconditionally acquire RTNL in this function, >as it would race with the call to cancel_delayed_work_sync from >bond_close (via bond_work_cancel_all). Thought of one more problem: we can't hold a regular lock while calling rtmsg_ifinfo, as it may sleep in alloc_skb. The rtmsg_ifinfo call has to be RTNL and nothing else. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com