From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: Received: from mail.candelatech.com ([208.74.158.172]:54621 "EHLO ns3.lanforge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753181Ab0KLRTD (ORCPT ); Fri, 12 Nov 2010 12:19:03 -0500 Message-ID: <4CDD76FF.7060307@candelatech.com> Date: Fri, 12 Nov 2010 09:18:55 -0800 From: Ben Greear MIME-Version: 1.0 To: Tejun Heo CC: Johannes Berg , "linux-wireless@vger.kernel.org" Subject: Re: ath5k/mac80211: Reproducible deadlock with 64-stations. References: <4CDB2488.4040802@candelatech.com> <1289437356.3748.25.camel@jlt3.sipsolutions.net> <4CDBB716.7020802@kernel.org> <4CDC2016.8020200@candelatech.com> <4CDC354C.2060503@candelatech.com> <4CDC7860.3070307@candelatech.com> <4CDD12BD.7030208@kernel.org> In-Reply-To: <4CDD12BD.7030208@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Sender: linux-wireless-owner@vger.kernel.org List-ID: On 11/12/2010 02:11 AM, Tejun Heo wrote: > Hello, > > On 11/12/2010 12:12 AM, Ben Greear wrote: >> The lockup (or extreme slowdown?) happens before the >> serious memory pressure. >> >> One thing I noticed is that at one point near (at?) the beginning >> of the slowdown, it took 36-seconds to complete the >> flush_work() call in ieee80211_do_stop in iface.c >> >> From some printk's I added: >> >> Nov 11 14:58:13 localhost kernel: do_stop: sta14 flushing work: e51298b4 >> Nov 11 14:58:49 localhost kernel: do_stop: sta14 flushed. >> >> It is holding RTNL for this entire time, which of course stops >> a large number of other useful processes from making >> progress. >> >> Is there any good reason for the flush to take so long? > > It depends on what the work being flushed was doing. Which one is it > trying to flush? Also, if the memory pressure is high enough, due to It's trying to flush sdata->work. I have no idea which worker..how can I tell? I can reproduce this every time, and I don't mind adding debugging code, so please let me know if there is something I can do to get you better information. > the dynamic nature of workqueue, processing of works can be delayed > while trying to create new workers to process them. Situations like > that usually don't happen often as it's likely that workers get freed > up as other works finish; however, if workers are piling up on > rtnl_lock, there really isn't much it can do. If there's work user > which can behave like that, it would be a good idea to restrict its > maximum concurrency using a separate workqueue. In my case, it seems that memory is OK, but as the deadlock happens the system runs slow for a bit and then goes OOM. I compiled with 2G/2G split so that I have extra low-memory and it runs a bit longer before locking up completely (or just endlessly spewing allocation-failed messages). Thanks, Ben > > Thanks. > -- Ben Greear Candela Technologies Inc http://www.candelatech.com