From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: Received: from mail.candelatech.com ([208.74.158.172]:60005 "EHLO ns3.lanforge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752295Ab0KLU5j (ORCPT ); Fri, 12 Nov 2010 15:57:39 -0500 Message-ID: <4CDDAA3B.9090007@candelatech.com> Date: Fri, 12 Nov 2010 12:57:31 -0800 From: Ben Greear MIME-Version: 1.0 To: Johannes Berg CC: linux-wireless@vger.kernel.org, Tejun Heo Subject: Re: [PATCH] mac80211: Fix deadlock in ieee80211_do_stop. References: <1289592426-5367-1-git-send-email-greearb@candelatech.com> <1289594998.3736.11.camel@jlt3.sipsolutions.net> In-Reply-To: <1289594998.3736.11.camel@jlt3.sipsolutions.net> Content-Type: text/plain; charset=UTF-8; format=flowed Sender: linux-wireless-owner@vger.kernel.org List-ID: On 11/12/2010 12:49 PM, Johannes Berg wrote: > On Fri, 2010-11-12 at 12:07 -0800, greearb@candelatech.com wrote: > >> - flush_work(&sdata->work); >> + /* Cannot call flush_work here because we are holding >> + * RTNL and the worker thread(s) that will be called upon to >> + * do the flushing might already be running a piece of work >> + * that is blocking on RTNL. That leads to deadlock and/or >> + * OOM. >> + */ >> + cancel_work_sync(&sdata->work); > > I don't think that comment really belongs into the sources at all ... if > we did that all the time we'd have no sources left between the > comments :-) I didn't want anyone to ever revert this un-knowingly, but assuming the patch is otherwise agreed to be correct, I'll remove the comment if you wish. :) > However, I also don't think it should be necessary to do this. > sdata->work is always queued on local->workqueue, which is created using > alloc_ordered_workqueue(), and there is no work on this workqueue that > uses the RTNL. Therefore, even flushing the entire workqueue must work, > unless alloc_ordered_workqueue() has no such guarantee any more -- which > I would consider to be a bug in the new workqueue framework. The problem appears (to me) to be that the flush_work() attempts to wait for the worker to complete it's current task. The worker can be doing a completely separate task (ie, wireless_nlevent_process), but that task can never complete because do_stop() holds rtnl and the task-in-progress may block on acquiring rtnl. So, flush_work() cannot make any progress. The stack-traces for hung programs I originally posted seem to agree with this analysis. So far, I reproduced the bug around 20 times in a row witout the patch, and since I added this patch, I have two good runs in a row, so it definitely has an affect. If my assumptions are correct, it would seem to unsafe to EVER call flush_work() while holding rtnl (or indeed, any other lock that any other work could possibly require). Thanks, Ben > > johannes -- Ben Greear Candela Technologies Inc http://www.candelatech.com