From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: Received: from mail.candelatech.com ([208.74.158.172]:47236 "EHLO ns3.lanforge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755502Ab0LHSTg (ORCPT ); Wed, 8 Dec 2010 13:19:36 -0500 Message-ID: <4CFFCC31.1050408@candelatech.com> Date: Wed, 08 Dec 2010 10:19:29 -0800 From: Ben Greear MIME-Version: 1.0 To: "Luis R. Rodriguez" CC: Johannes Berg , Tejun Heo , linux-wireless@vger.kernel.org Subject: Re: [PATCH] mac80211: Fix deadlock in ieee80211_do_stop. References: <1289592426-5367-1-git-send-email-greearb@candelatech.com> <1289594998.3736.11.camel@jlt3.sipsolutions.net> <4CDDAA3B.9090007@candelatech.com> <1289596096.3736.13.camel@jlt3.sipsolutions.net> <4CDE699B.70401@kernel.org> <4CE1A344.7040201@candelatech.com> <4CE292F7.4090200@kernel.org> <1289929258.3673.1.camel@jlt3.sipsolutions.net> <4CE396A9.1050908@kernel.org> <1290020005.3777.6.camel@jlt3.sipsolutions.net> <4CE4C8DD.6010806@kernel.org> <51f5dd53c39a77fff4efc1a99b189725@localhost> <4CE4D41F.1080005@kernel.org> <1290099585.3801.1.camel@jlt3.sipsolutions.net> <4CE68AF4.8060507@kernel.org> <1290189452.3768.3.camel@jlt3.sipsolutions.net> <4CE6E430.6080804@candelatech.com> <4CFFC214.6000608@candelatech.com> In-Reply-To: <4CFFC214.6000608@candelatech.com> Content-Type: text/plain; charset=UTF-8; format=flowed Sender: linux-wireless-owner@vger.kernel.org List-ID: On 12/08/2010 09:36 AM, Ben Greear wrote: > On 11/19/2010 02:27 PM, Luis R. Rodriguez wrote: >> On Fri, Nov 19, 2010 at 12:55 PM, Ben Greear >> wrote: >>> On 11/19/2010 09:57 AM, Johannes Berg wrote: >>>> >>>> On Fri, 2010-11-19 at 15:34 +0100, Tejun Heo wrote: >>>> >>>>> Awesome. :-) >>>>> >>>>> Ben, if you have trouble generating full trace, please let me know if >>>>> there's something I can buy which isn't too expensive to reproduce the >>>>> problem. I would be happy to track it down myself. >>>> >>>> Maybe you can try Ben's setup in kvm (or directly on your box if you >>>> like) with mac80211_hwsim. From a mac80211 POV it should be almost >>>> equivalent, although it'll do different memory allocation patterns etc. >>> >>> I tried manually backing out my patch, and now I can no longer reproduce >>> the problem. Maybe something in -rc2 fixed it, or maybe some changes >>> to my environment just made it harder to hit. >>> >>> If you see no logical reason why calling flush_work with RTNL held >>> would cause trouble, then I guess we can just leave the code as is >>> for now. >>> >>> If you do want to play with this yourself, I think any ath5k type >>> adapter >>> with 64+ virtual stations configured would be a valid test case. My >>> application calls ifdown/ifup on them a few times after being created >>> and then generates traffic (and gathers stats, calls 'iwconfig', etc). >>> As configured in the original scenario that reproduced the problem, >>> the STAs had no encryption and were all associating with a single AP. >>> wpa_supplicant was not being used. >> >> FWIW, I had to do similar tests before and Ben offered up a perl >> script to do something similar to what his proprietary app does upon >> device bring up. I've modified it just a bit and you can find it here: >> >> http://www.kernel.org/pub/linux/kernel/people/mcgrof/scripts/poo.pl > > Well, I backed out my work-around patch yesterday, and then let > the system run overnight. This morning it is mostly dead, spewing > OOM errors and with a bunch of 'sh' processes using maximum amount > of CPU, blocked on trying to acquire rtnl. > > There is one 'ip' process that appears to hold rtnl and is trying > to call ieee80211_do_stop, which is probably blocked down in > the work-queue logic just like last time. Lots of worker processes > attempting to grab rtnl (and many other processes as well.) > > Lockdep was disabled because a proprietary module of mine was attempted > to be loaded, but it doesn't actual load due to symbol mismatch > (it's compiled against a non-debug kernel). > > If the lockdep info is critical, I can attempt to reproduce with > my module completed removed from the file system so it cannot attempt > to load, but it seems like last time the 'sysrq t' was of more interest > anyway. > > I have uploaded what I believe is a full 'sysrq t' output, interspersed > with OOM warnings that are constantly spewing to the console, > here: > > http://www.candelatech.com/~greearb/minicom_ath9k_log.txt And here's a log with lockdep enabled: http://www.candelatech.com/~greearb/minicom_ath9k_log2.txt The sysrq output starts at line 1346 in this file. Seems I have a decent environment for reproducing this today, in case you have any debug you'd like me to add. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com