From mboxrd@z Thu Jan  1 00:00:00 1970
From: Felix Fietkau <nbd@openwrt.org>
Date: Thu, 01 Mar 2012 21:40:22 +0100
Subject: [ath9k-devel] "Failed to stop Tx DMA" and "Could not stop RX"
 with AR9485
In-Reply-To: <20120301195341.5283.qmail@stuge.se>
References: <4F4E9A36.2080809@lacto.se>
	<CAD2nsn2ifGDPr_J7LtBS04yCkxg12jQShrCDCBUERmOC5kodyA@mail.gmail.com>
	<20120301144642.13047.qmail@stuge.se>
	<4F4FB06F.6030201@openwrt.org>
	<20120301175339.28512.qmail@stuge.se>
	<4F4FBC4E.6050109@openwrt.org>
	<20120301184243.32168.qmail@stuge.se>
	<4F4FCBA1.4080601@openwrt.org> <20120301195341.5283.qmail@stuge.se>
Message-ID: <4F4FDEB6.5030305@openwrt.org>
List-Id: <ath9k-devel.lists.ath9k.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: ath9k-devel@lists.ath9k.org

On 2012-03-01 8:53 PM, Peter Stuge wrote:
>> >> > Still, DMA is not exotic, and here are DMA problems again.
>> >> 
>> >> That last sentence makes no sense at all.
>> > 
>> > My point is that DMA by peripheral devices and the drivers to manage
>> > it are established technologies in computer busses across the world,
>> > so it keeps surprising me that drivers in 2012 get it wrong.
>> 
>> Which proves to me yet again that you're completely missing the point of
>> what I'm saying about the likelihood of this being *NOT* DMA related.
> 
> It is, because the error leads to either or both sides thinking about
> DMA when they should not.
Either or both sides of what? Thinking about DMA? That sentence
unfortunately makes no sense to me.

Most of the time when the driver tries to stop DMA, the hardware doesn't
respond in time, because either the MAC or the host interface state has
locked up. It can also be caused by the MAC not fully waking up from a
sleep state (hence the powersave related suggestion).

>> >> How do you define 'lowest level'?
>> > 
>> > An example would be to monitor state machines inside the device using
>> > side channel debugging, while in parallell monitoring state machines
>> > inside the driver. Then comparing them after the fact and seeing
>> > where one goes wrong. Find out why, and audit the complete driver for
>> > similar types of errors.
>> 
>> In my experience, this generates *way* too much data to be of any use to
>> narrow down the source of the problem.
> 
> Filtering out unimportant data is of course a big part of the
> debugging. The very first thing to do I'd say, and also a recurring
> thing to do, in order to build the complete picture of what is going
> on.
> 
> The more knowledge about the hardware one has, the faster this
> process is. It might need a few days of work next to the logic
> analyzer for someone with a predisposition.
> 
> 
>> The lower you go on the level of abstraction, the more data any kind of
>> debugging approach generates, meaning way more effort to analyze the
>> data and make sense of it.
> 
> Only if the data can't be interpreted, for example, due to lack of
> documentation.
> 
> 
>> I don't typically start with the part that's hardest to analyze. High
>> level states are easier to look into and make sense of, so they're also
>> easier to rule out.
> 
> But they mask a myriad of lower level operations. I also did not
> universally argue against the top-down approach, but if there is a
> tricky problem then I much prefer to acquire all the data once and
> look at it carefully, instead of trying to turn knobs to see what
> happens.

>> OK, so let me get this straight: You can't imagine how the test result
>> from disabling PS could be useful for tracking down the problem, so you
>> automatically assume that it *must* be a lame workaround suggestion?
>> That seems rather narrow-minded of you.
> 
> The first time it was suggested and tried it gave a bit of valuable
> information. Now, several years later, the same test doesn't keep on
> giving so many useful bits of information anymore..
> 
> Since it's a high level test it can't determine with much certainty
> what the key lower level effects are which make the symptom
> disappear.
> 
> 
>> Of course I can see multiple ways in which this information would
>> be useful, but I guess that may not matter to you.
> 
> Right, the fact that it can mean *multiple* things is what makes it
> useless.
In the early stages of debugging, you will usually *never* get data
points that only mean one thing. One data point leads to ideas for
further tests

> I never said that the test does not narrow the search, but it's not
> sufficient to *identify* any issue, so when it is the only suggestion
> given, over and over, it is, in fact, just a workaround.
Just because it cannot be used by itself to fully identify the issue
doesn't mean it's a bad idea to get that data.

I believe acquiring *all* data at once is impossible (or at least
completely impractical), and I believe that the structured approach of
going up one level at a time is horribly inefficient and often grossly
misleading for bugs that show low level symptoms but have a high level
causes.

Incidentally, the 'trying to turn knobs to see what happens' approach
plus code review have been very efficient for me in dealing with that
class of bugs, and I'll take that over a random guy's random theory of
how debugging should and *must* be done any day.

So one of the differences between us appears to be this: You advocate
that there is one way that things must be done, and if the prerequisites
for that approach way aren't met, then it's impossible.
My opinion is that this is nothing but a lame excuse, proven by the fact
that I've been able to easily deal with similar situations in other
drivers, and that I know people that have found and fixed some weird
bugs in ath9k without having had any access to documentation and without
having spent weeks on what you call 'reverse engineering'.

I will now refrain from any further discussion with you on debugging
approaches, since you seem quite comfortable and content in staying
within your view of limitations and impossibilities, whereas I prefer to
get some real work done. :)

- Felix