From mboxrd@z Thu Jan  1 00:00:00 1970
Return-path: <linux-wireless-owner@vger.kernel.org>
Received: from mail.candelatech.com ([208.74.158.172]:54621 "EHLO
	ns3.lanforge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753181Ab0KLRTD (ORCPT
	<rfc822;linux-wireless@vger.kernel.org>);
	Fri, 12 Nov 2010 12:19:03 -0500
Message-ID: <4CDD76FF.7060307@candelatech.com>
Date: Fri, 12 Nov 2010 09:18:55 -0800
From: Ben Greear <greearb@candelatech.com>
MIME-Version: 1.0
To: Tejun Heo <tj@kernel.org>
CC: Johannes Berg <johannes@sipsolutions.net>,
	"linux-wireless@vger.kernel.org" <linux-wireless@vger.kernel.org>
Subject: Re: ath5k/mac80211:  Reproducible deadlock with 64-stations.
References: <4CDB2488.4040802@candelatech.com> <1289437356.3748.25.camel@jlt3.sipsolutions.net> <4CDBB716.7020802@kernel.org> <4CDC2016.8020200@candelatech.com> <4CDC354C.2060503@candelatech.com> <4CDC7860.3070307@candelatech.com> <4CDD12BD.7030208@kernel.org>
In-Reply-To: <4CDD12BD.7030208@kernel.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Sender: linux-wireless-owner@vger.kernel.org
List-ID: <linux-wireless.vger.kernel.org>

On 11/12/2010 02:11 AM, Tejun Heo wrote:
> Hello,
>
> On 11/12/2010 12:12 AM, Ben Greear wrote:
>> The lockup (or extreme slowdown?) happens before the
>> serious memory pressure.
>>
>> One thing I noticed is that at one point near (at?) the beginning
>> of the slowdown, it took 36-seconds to complete the
>> flush_work() call in ieee80211_do_stop in iface.c
>>
>>  From some printk's I added:
>>
>> Nov 11 14:58:13 localhost kernel: do_stop: sta14 flushing work: e51298b4
>> Nov 11 14:58:49 localhost kernel: do_stop: sta14 flushed.
>>
>> It is holding RTNL for this entire time, which of course stops
>> a large number of other useful processes from making
>> progress.
>>
>> Is there any good reason for the flush to take so long?
>
> It depends on what the work being flushed was doing.  Which one is it
> trying to flush?  Also, if the memory pressure is high enough, due to

It's trying to flush sdata->work.  I have no idea which worker..how
can I tell?

I can reproduce this every time, and I don't mind adding debugging
code, so please let me know if there is something I can do to get
you better information.

> the dynamic nature of workqueue, processing of works can be delayed
> while trying to create new workers to process them.  Situations like
> that usually don't happen often as it's likely that workers get freed
> up as other works finish; however, if workers are piling up on
> rtnl_lock, there really isn't much it can do.  If there's work user
> which can behave like that, it would be a good idea to restrict its
> maximum concurrency using a separate workqueue.

In my case, it seems that memory is OK, but as the deadlock happens
the system runs slow for a bit and then goes OOM.  I compiled with 2G/2G split
so that I have extra low-memory and it runs a bit longer before locking
up completely (or just endlessly spewing allocation-failed messages).

Thanks,
Ben

>
> Thanks.
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com