From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752020Ab1JSSlf (ORCPT ); Wed, 19 Oct 2011 14:41:35 -0400 Received: from mail.vyatta.com ([76.74.103.46]:50806 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751249Ab1JSSld (ORCPT ); Wed, 19 Oct 2011 14:41:33 -0400 Date: Wed, 19 Oct 2011 11:41:29 -0700 From: Stephen Hemminger To: Jay Vosburgh Cc: Mitsuo Hayasaka , Andy Gospodarek , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, yrl.pp-manager.tt@hitachi.com, WANG Cong Subject: Re: [PATCH net -v2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close Message-ID: <20111019114129.162d895a@nehalam.linuxnetplumber.net> In-Reply-To: <27007.1319047262@death> References: <20111019081757.12455.24788.stgit@ltc219.sdl.hitachi.co.jp> <27007.1319047262@death> Organization: Vyatta X-Mailer: Claws Mail 3.7.10 (GTK+ 2.24.6; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 19 Oct 2011 11:01:02 -0700 Jay Vosburgh wrote: > Mitsuo Hayasaka wrote: > > >The bond_close() calls cancel_delayed_work() to cancel delayed works. > >It, however, cannot cancel works that were already queued in workqueue. > >The bond_open() initializes work->data, and proccess_one_work() refers > >get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when > >work->data has been initialized. Thus, a panic occurs. > > > >This patch uses flush_delayed_work_sync() instead of cancel_delayed_work() > >in bond_close(). It cancels delayed timer and waits for work to finish > >execution. So, it can avoid the null pointer dereference due to the > >parallel executions of proccess_one_work() and initializing proccess > >of bond_open(). > > I'm setting up to test this. I have a dim recollection that we > tried this some years ago, and there was a different deadlock that > manifested through the flush path. Perhaps changes since then have > removed that problem. > > -J Won't this deadlock on RTNL. The problem is that: CPU0 CPU1 rtnl_lock bond_close delayed_work mii_work read_lock(bond->lock); read_unlock(bond->lock); rtnl_lock... waiting for CPU0 flush_delayed_work_sync waiting for delayed_work to finish...