Re: memory clobber in rx path, maybe related to ath9k.

linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ben Greear <greearb@candelatech.com>
To: "Luis R. Rodriguez" <lrodriguez@Atheros.com>
Cc: Luis Rodriguez <Luis.Rodriguez@Atheros.com>,
	linux-wireless <linux-wireless@vger.kernel.org>
Subject: Re: memory clobber in rx path, maybe related to ath9k.
Date: Fri, 15 Oct 2010 17:07:44 -0700	[thread overview]
Message-ID: <4CB8ECD0.20401@candelatech.com> (raw)
In-Reply-To: <20101015234117.GB1866@tux>

On 10/15/2010 04:41 PM, Luis R. Rodriguez wrote:
> On Fri, Oct 15, 2010 at 04:38:14PM -0700, Luis Rodriguez wrote:
>> On Fri, Oct 15, 2010 at 04:33:50PM -0700, Ben Greear wrote:
>>> On 10/15/2010 04:21 PM, Luis R. Rodriguez wrote:
>>>> Ben, please give this patch a shot. I addresses three races on the PCU:
>>>>
>>>>     * When we were stopping the CPU for non-EDMA cards we never locked against
>>>>       anything starting the PCU again
>>>>
>>>>     * ath9k_hw_startpcureceive() was being called without locking
>>>>
>>>>     * Although we lock on the rxbuf lock for contention against starting/stopping
>>>>       the PCU, we also need to lock on the driver in locations where we start/stop
>>>>       the PCU within the same location otherwise we end up in inconsistant states
>>>>       and the hardware may end up proessing an incorrect buffer for DMA. To
>>>>       protect against this we use a new PCU lock on the main part of the driver to
>>>>       ensure each start/stop/reset operation is done atomically.
>>>>
>>>> And fixes one issue as a side effect:
>>>>
>>>>     * No more packet loss on ping flood when you have one STA associated :)
>>>>
>>>> The only issue I see with this is I eventually run out of memory and my box
>>>> becomes useless, unless I am mistaking that for some other issue.
>>>>
>>>> Please give this a shot and if it cures your woes I'll split it up into
>>>> 3 separate patches, or maybe just two, one for the first two and one for
>>>> the last issue.
>>>
>>> Sounds good, but this lockdep splat happens almost immediately upon starting
>>> my app:
>>>
>>> =======================================================
>>> [ INFO: possible circular locking dependency detected ]
>>> 2.6.36-rc8-wl+ #32
>>> -------------------------------------------------------
>>> swapper/0 is trying to acquire lock:
>>>    (&(&sc->rx.pcu_lock)->rlock){+.-...}, at: [<fa16e5c7>] ath9k_tasklet+0x7e/0x140 [ath9k]
>>>
>>> but task is already holding lock:
>>>    (&(&sc->rx.rxflushlock)->rlock){+.-...}, at: [<fa16e5b9>] ath9k_tasklet+0x70/0x140 [ath9k]
>>>
>>> which lock already depends on the new lock.
>>>
>>>
>>> the existing dependency chain (in reverse order) is:
>>>
>>> ->  #1 (&(&sc->rx.rxflushlock)->rlock){+.-...}:
>>>          [<c0457639>] lock_acquire+0x5a/0x78
>>>          [<c075f6ed>] _raw_spin_lock_bh+0x20/0x2f
>>>          [<fa170513>] ath_flushrecv+0x14/0x61 [ath9k]
>>
>> Ah we just need to nuke the flush lock, one second. Also remove my
>> skb_copy() otherwise you will really run out of memory quickly,
>> unless you really want to test with it :)
>
> Try this new one, note I if (0)'d the skb_copy() set that to if (1) if you want
> to force memory clobber.

Ahh, getting better.  Ran for around 10 minutes with my app, never saw the poison,
but system got into a bad state.  It could be some other bug exposed by my test
that was previously hidden by the poison problem, or maybe still bugs with the
locking.

First, I saw this..then things seemed to recover for a bit..  I've seen these
before...

ieee80211 phy0: sta85: No probe response from AP 00:14:d1:c6:d2:54 after 500ms, try 0
ieee80211 phy0: sta73: No probe response from AP 00:14:d1:c6:d2:54 after 500ms, try 0
ieee80211 phy0: sta109: No probe response from AP 00:14:d1:c6:d2:54 after 500ms, try 0
ath: Failed to stop TX DMA in 100 msec after killing last frame
ath: Failed to stop TX DMA. Resetting hardware!
ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
sta29: authenticate with 00:14:d1:c6:d2:54 (try 1)
sta38: authenticate with 00:14:d1:c6:d2:54 (try 1)

Then a little later, hardware seems to go bad and won't recover.  Interesting that
the e1000e driver is also complaining..and I seem to have lost network connectivity
to the system...

Oct 15 16:54:38 localhost kernel: ieee80211 phy0: sta131: No probe response from AP 00:14:d1:c6:d2:54 after 500ms, try 0
Oct 15 16:54:38 localhost kernel: ieee80211 phy0: sta85: No probe response from AP 00:14:d1:c6:d2:54 after 500ms, try 0
Oct 15 16:54:38 localhost kernel: ieee80211 phy0: sta73: No probe response from AP 00:14:d1:c6:d2:54 after 500ms, try 0
Oct 15 16:54:38 localhost kernel: ieee80211 phy0: sta109: No probe response from AP 00:14:d1:c6:d2:54 after 500ms, try 0
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!
Oct 15 16:54:38 localhost kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!
Oct 15 16:54:38 localhost kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!
Oct 15 16:54:38 localhost kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!
Oct 15 16:54:38 localhost kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:54:38 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!
...
Oct 15 16:54:40 localhost kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Oct 15 16:54:40 localhost kernel: e1000e 0000:06:00.0: eth0: Detected Hardware Unit Hang:
Oct 15 16:54:40 localhost kernel:  TDH                  <3d>
Oct 15 16:54:40 localhost kernel:  TDT                  <3f>
Oct 15 16:54:40 localhost kernel:  next_to_use          <3f>
Oct 15 16:54:40 localhost kernel:  next_to_clean        <3d>
Oct 15 16:54:40 localhost kernel: buffer_info[next_to_clean]:
Oct 15 16:54:40 localhost kernel:  time_stamp           <3225b>
Oct 15 16:54:40 localhost kernel:  next_to_watch        <3e>
Oct 15 16:54:40 localhost kernel:  jiffies              <32c84>
Oct 15 16:54:40 localhost kernel:  next_to_watch.status <0>
Oct 15 16:54:40 localhost kernel: MAC Status             <80080f83>
Oct 15 16:54:40 localhost kernel: PHY Status             <796d>
Oct 15 16:54:40 localhost kernel: PHY 1000BASE-T Status  <7c00>
Oct 15 16:54:40 localhost kernel: PHY Extended Status    <3000>
Oct 15 16:54:40 localhost kernel: PCI Status             <4010>
Oct 15 16:54:40 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:54:40 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!
...

And then HD complains:

Oct 15 16:55:18 localhost kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Oct 15 16:55:18 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:55:18 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!
Oct 15 16:55:18 localhost kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
Oct 15 16:55:18 localhost kernel: ata1: device not ready (errno=-16), forcing hardreset
Oct 15 16:55:18 localhost kernel: ata1: soft resetting link
Oct 15 16:55:18 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame
Oct 15 16:55:18 localhost kernel: ath: Failed to stop TX DMA. Resetting hardware!

Maybe IRQs were disabled and not re-enabled somehow?

There were no lockdep splats..

Here is a dump of the locks held...

Oct 15 17:01:29 localhost kernel: SysRq : Show Locks Held
Oct 15 17:01:29 localhost kernel:
Oct 15 17:01:29 localhost kernel: Showing all locks held in the system:
Oct 15 17:01:29 localhost kernel: 4 locks held by flush-253:0/502:
Oct 15 17:01:29 localhost kernel: #0:  (&type->s_umount_key#22){++++..}, at: [<c04cf1e0>] writeback_inodes_wb+0x95/0xf5
Oct 15 17:01:29 localhost kernel: #1:  (jbd2_handle){+.+...}, at: [<c053814d>] start_this_handle+0x4b6/0x4ec
Oct 15 17:01:29 localhost kernel: #2:  (&ei->i_data_sem){++++..}, at: [<c051549a>] ext4_map_blocks+0xab/0x164
Oct 15 17:01:29 localhost kernel: #3:  (&lg->lg_mutex){+.+...}, at: [<c0529792>] ext4_mb_initialize_context+0x1db/0x1e5
Oct 15 17:01:29 localhost kernel: 1 lock held by flush-253:2/1148:
Oct 15 17:01:29 localhost kernel: #0:  (&type->s_umount_key#22){++++..}, at: [<c04cf1e0>] writeback_inodes_wb+0x95/0xf5
Oct 15 17:01:29 localhost kernel: 1 lock held by mingetty/1620:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 1 lock held by mingetty/1622:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 1 lock held by mingetty/1624:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 1 lock held by mingetty/1626:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 1 lock held by mingetty/1628:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 1 lock held by mingetty/1630:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 3 locks held by kworker/0:2/1632:
Oct 15 17:01:29 localhost kernel: #0:  (events){+.+.+.}, at: [<c0443cf6>] process_one_work+0x145/0x295
Oct 15 17:01:29 localhost kernel: #1:  ((&ap->scsi_rescan_task)){+.+.+.}, at: [<c0443cf6>] process_one_work+0x145/0x295
Oct 15 17:01:29 localhost kernel: #2:  (&ap->scsi_scan_mutex){+.+.+.}, at: [<c063f3c6>] ata_scsi_dev_rescan+0x1e/0xb6
Oct 15 17:01:29 localhost kernel: 1 lock held by bash/1814:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 1 lock held by bash/1863:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 1 lock held by bash/1935:
Oct 15 17:01:29 localhost kernel: #0:  (&tty->atomic_read_lock){+.+.+.}, at: [<c05f384d>] n_tty_read+0x1d1/0x5ed
Oct 15 17:01:29 localhost kernel: 2 locks held by btserver/2422:
Oct 15 17:01:29 localhost kernel: #0:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<c04871bf>] generic_file_aio_write+0x4d/0xa6
Oct 15 17:01:29 localhost kernel: #1:  (jbd2_handle){+.+...}, at: [<c053814d>] start_this_handle+0x4b6/0x4ec
Oct 15 17:01:29 localhost kernel: 2 locks held by ip/23410:
Oct 15 17:01:29 localhost kernel: #0:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<c04871bf>] generic_file_aio_write+0x4d/0xa6
Oct 15 17:01:29 localhost kernel: #1:  (jbd2_handle){+.+...}, at: [<c053814d>] start_this_handle+0x4b6/0x4ec
Oct 15 17:01:29 localhost kernel: 4 locks held by sh/23416:
Oct 15 17:01:29 localhost kernel: #0:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<c04b6c36>] do_truncate+0x5a/0x7d
Oct 15 17:01:29 localhost kernel: #1:  (&sb->s_type->i_alloc_sem_key#4){+.+...}, at: [<c04c82e2>] notify_change+0x12a/0x218
Oct 15 17:01:29 localhost kernel: #2:  (jbd2_handle){+.+...}, at: [<c053814d>] start_this_handle+0x4b6/0x4ec
Oct 15 17:01:29 localhost kernel: #3:  (&sbi->s_orphan_lock){+.+...}, at: [<c051975f>] ext4_orphan_add+0x1d8/0x1f7
Oct 15 17:01:29 localhost kernel:
Oct 15 17:01:29 localhost kernel: =============================================

Can't get processor info from sysrq..don't know why:

Oct 15 17:04:04 localhost kernel: ath: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020
SysRq : Show backtrace of all active CPUs
Oct 15 17:04:14 localhost kernel: SysRq : Show backtrace of all active CPUs
Oct 15 17:04:14 localhost kernel: sending NMI to all CPUs:
Oct 15 17:04:14 localhost kernel: ath: Failed to stop TX DMA in 100 msec after killing last frame

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

next prev parent reply	other threads:[~2010-10-16  0:07 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-05 17:00 memory clobber in rx path, maybe related to ath9k Ben Greear
2010-10-05 17:16 ` Luis R. Rodriguez
2010-10-05 17:24   ` Ben Greear
2010-10-05 17:36     ` Luis R. Rodriguez
2010-10-05 17:38       ` Ben Greear
2010-10-05 17:43         ` Luis R. Rodriguez
2010-10-05 17:47           ` Ben Greear
2010-10-05 17:55             ` Luis R. Rodriguez
2010-10-05 18:14               ` Ben Greear
2010-10-05 21:12                 ` Ben Greear
2010-10-07 17:33                 ` Ben Greear
2010-10-07 18:14                   ` Johannes Berg
2010-10-07 18:29                     ` Luis R. Rodriguez
2010-10-07 18:39                       ` Ben Greear
2010-10-07 18:42                         ` Luis R. Rodriguez
2010-10-07 18:45                           ` Ben Greear
2010-10-07 19:14                             ` Ben Greear
2010-10-07 19:17                               ` Johannes Berg
2010-10-07 19:22                           ` Ben Greear
2010-10-07 19:27                             ` Johannes Berg
2010-10-07 21:31                               ` Luis R. Rodriguez
2010-10-07 21:36                                 ` Luis R. Rodriguez
2010-10-07 21:59                                   ` Luis R. Rodriguez
2010-10-11 20:51                                     ` Ben Greear
2010-10-12  1:03                                       ` Luis R. Rodriguez
2010-10-12  3:27                                         ` Ben Greear
2010-10-12  6:10                                           ` Luis R. Rodriguez
2010-10-12 18:35                                             ` Ben Greear
2010-10-12 18:40                                               ` Luis R. Rodriguez
2010-10-12 18:43                                                 ` Ben Greear
2010-10-12 19:51                                                   ` Ben Greear
2010-10-13 17:12                                                 ` Ben Greear
2010-10-13 17:29                                                   ` Luis R. Rodriguez
2010-10-13 17:48                                                     ` Ben Greear
2010-10-14 21:25                                                       ` Luis R. Rodriguez
2010-10-14 21:31                                                         ` Ben Greear
2010-10-14 21:32                                                           ` Luis R. Rodriguez
2010-10-14 21:39                                                             ` Ben Greear
2010-10-14 21:45                                                       ` Johannes Berg
2010-10-14 21:47                                                         ` Ben Greear
2010-10-13  5:31                                               ` Vasanthakumar Thiagarajan
2010-10-13 16:39                                                 ` Ben Greear
2010-10-13 19:56                                                   ` Björn Smedman
2010-10-13 20:03                                                     ` Luis R. Rodriguez
2010-10-14 19:15                                                       ` Ben Greear
2010-10-14 19:17                                                         ` Luis R. Rodriguez
2010-10-14 21:52                                                     ` Björn Smedman
2010-10-14 22:05                                                       ` Ben Greear
2010-10-14 22:16                                                         ` Luis R. Rodriguez
2010-10-14 22:29                                                           ` Luis R. Rodriguez
2010-10-14 22:35                                                             ` Luis R. Rodriguez
2010-10-14 22:44                                                               ` Ben Greear
2010-10-14 22:54                                                                 ` Luis R. Rodriguez
2010-10-14 22:51                                                               ` Luis R. Rodriguez
2010-10-14 23:19                                                                 ` Luis R. Rodriguez
2010-10-14 23:30                                                                   ` Ben Greear
2010-10-14 23:39                                                                     ` Luis R. Rodriguez
2010-10-14 23:48                                                                       ` Luis R. Rodriguez
2010-10-15 16:51                                                                         ` Ben Greear
2010-10-15 18:47                                                                           ` Luis R. Rodriguez
2010-10-15 19:36                                                                             ` Ben Greear
2010-10-15 21:07                                                                               ` Luis R. Rodriguez
2010-10-15 23:21                                                                                 ` Luis R. Rodriguez
2010-10-15 23:33                                                                                   ` Ben Greear
2010-10-15 23:38                                                                                     ` Luis R. Rodriguez
2010-10-15 23:41                                                                                       ` Luis R. Rodriguez
2010-10-16  0:07                                                                                         ` Ben Greear [this message]
2010-10-15 23:42                                                                                       ` Ben Greear
2010-10-15 23:57                                                                                         ` Luis R. Rodriguez
2010-10-17 19:44                                                                                           ` Ben Greear
2010-10-18 22:46                                                                                             ` Luis R. Rodriguez
2010-10-15 23:39                                                                                     ` Ben Greear
2010-10-14 23:51                                                                       ` Ben Greear
2010-10-14 22:47                                                       ` Ben Greear
2010-10-14 23:46                                                         ` Björn Smedman
2010-10-18 13:48                                                           ` Björn Smedman
2010-10-18 17:24                                                             ` Luis R. Rodriguez
2010-10-18 22:34                                                               ` Björn Smedman
2010-10-18 22:41                                                                 ` Luis R. Rodriguez
2010-10-14  5:37                                                   ` Vasanthakumar Thiagarajan
2010-10-07 21:52                                 ` Ben Greear
2010-10-08  0:42                               ` Bruno Randolf
2010-10-08  2:30                                 ` Ben Greear
2010-10-05 17:22 ` Johannes Berg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CB8ECD0.20401@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=Luis.Rodriguez@Atheros.com \
    --cc=linux-wireless@vger.kernel.org \
    --cc=lrodriguez@Atheros.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).