All of lore.kernel.org
 help / color / mirror / Atom feed
* Crash in hacked kernel with CT firmware.
@ 2014-07-30 15:46 Ben Greear
  2014-07-31  7:58 ` Michal Kazior
  0 siblings, 1 reply; 2+ messages in thread
From: Ben Greear @ 2014-07-30 15:46 UTC (permalink / raw)
  To: ath10k

Not sure how relevant this is to upstream, but just in case someone
wants to look at it:

Kernel is modified 3.14.14+, with a good bit of backported ath10k and some
patches of my own to help stabilize ath10k with my workload and to support
CT firmware features.

http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=summary

Firmware is CT firmware, and it has a bug in this test case where it crashes
fairly often upon removal of a vdev after some traffic tests have been
running.  Likely this firmware bug is something that I have added or
at least exacerbated, and I am working on fixing it.

But, when it crashes, it takes the kernel down shortly afterwards
in a reliable manner:

[firmware crashes]

ath10k: failed with wmi_cmd_timeout 4 times, attempting hardware reset.
sta2: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta3: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta4: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta5: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta12: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta13: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta15: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta16: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta17: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta18: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta21: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta22: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta23: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta30: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta31: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta32: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta33: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta34: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta35: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta36: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta37: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta38: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta39: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta40: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta41: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta42: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta43: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta44: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta45: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta46: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta47: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta48: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta49: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta50: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta51: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta52: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta53: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta54: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta55: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta56: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta57: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta58: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta59: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta60: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta61: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta62: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting
sta63: Failed to send nullfunc to AP 04:f0:21:37:e3:2e after 1000ms, disconnecting


BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
IP: [<ffffffffa06a318d>] ath10k_txrx_tx_unref+0x91/0x3c7 [ath10k_core]
PGD 44cf64067 PUD 449c87067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: nf_conntrack_netlink nfnetlink nf_nat_ipv4 nf_nat 8021q garp stp mrp llc m]
CPU: 0 PID: 5945 Comm: ip Tainted: G        WC O 3.14.14+ #38
Hardware name: Supermicro X9SRL-F/X9SRL-F, BIOS 3.0a 12/05/2013
task: ffff88043675c2a0 ti: ffff88043e4d4000 task.ti: ffff88043e4d4000
RIP: 0010:[<ffffffffa06a318d>]  [<ffffffffa06a318d>] ath10k_txrx_tx_unref+0x91/0x3c7 [ath10k_]
RSP: 0018:ffff88043e4d54b8  EFLAGS: 00010282
RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000001
RDX: ffff880449e7c000 RSI: 0000000000000007 RDI: 0000000000000008
RBP: ffff88043e4d54e8 R08: 0000000000000000 R09: ffffffffa06a24b7
R10: ffffffffa06a24b7 R11: 0000000000000000 R12: ffff88043e4d5502
R13: ffff88046b02b428 R14: ffff88046c74e098 R15: ffff88046b02bc78
FS:  00007fe6c9271740(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000068 CR3: 0000000443978000 CR4: 00000000001407f0
Stack:
 0000000000000007 ffff88046b02b428 0000000000000007 ffff88046b02b568
 ffff88046b02b908 ffff88046b02bc78 ffff88043e4d5528 ffffffffa06a28b4
 ffff88043e4d5528 0001000000071c28 ffff88046b02b428 ffff88046b02b428
Call Trace:
 [<ffffffffa06a28b4>] ath10k_htt_tx_detach+0x70/0xd1 [ath10k_core]
 [<ffffffffa06a04cf>] ath10k_htt_detach+0x16/0x1b [ath10k_core]
 [<ffffffffa069eab3>] ath10k_core_stop+0x4f/0x70 [ath10k_core]
 [<ffffffffa069ae32>] ath10k_halt+0xde/0x161 [ath10k_core]
 [<ffffffffa069aeed>] ath10k_stop+0x38/0x89 [ath10k_core]
 [<ffffffffa05b0ae6>] ieee80211_stop_device+0x58/0x84 [mac80211]
 [<ffffffffa069541c>] ? spin_lock_bh+0x9/0xb [ath10k_core]
 [<ffffffffa059d0d3>] ieee80211_do_stop+0x625/0x67d [mac80211]
 [<ffffffff810fdf6a>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff810c6d42>] ? __local_bh_enable_ip+0xaf/0xd9
 [<ffffffff815d8156>] ? _raw_spin_unlock_bh+0x31/0x35
 [<ffffffff8153a693>] ? dev_deactivate_many+0x129/0x172
 [<ffffffffa059d140>] ieee80211_stop+0x15/0x19 [mac80211]
 [<ffffffff8151beff>] __dev_close_many+0x95/0xba
 [<ffffffff8151bfa5>] __dev_close+0x48/0x67
 [<ffffffff81522696>] __dev_change_flags+0xa6/0x14a
 [<ffffffff8152276d>] dev_change_flags+0x23/0x59
 [<ffffffff8152c318>] do_setlink+0x2d7/0x793
 [<ffffffff8152ef6e>] rtnl_newlink+0x36f/0x5a7
 [<ffffffff8152ed0a>] ? rtnl_newlink+0x10b/0x5a7


(gdb) l *(ath10k_txrx_tx_unref+0x91)
0xe18d is in ath10k_txrx_tx_unref (/mnt/sda/home/greearb/git/linux-3.14.dev.y/drivers/net/wireless/ath/ath10k/txrx.c:109).
104		}
105	
106		msdu = htt->pending_tx[tx_done->msdu_id];
107		skb_cb = ATH10K_SKB_CB(msdu);
108	
109		dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);
110	
111		if (skb_cb->htt.txbuf)
112			dma_pool_free(htt->tx_pool,
113				      skb_cb->htt.txbuf,
(gdb)


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Crash in hacked kernel with CT firmware.
  2014-07-30 15:46 Crash in hacked kernel with CT firmware Ben Greear
@ 2014-07-31  7:58 ` Michal Kazior
  0 siblings, 0 replies; 2+ messages in thread
From: Michal Kazior @ 2014-07-31  7:58 UTC (permalink / raw)
  To: Ben Greear; +Cc: ath10k

On 30 July 2014 17:46, Ben Greear <greearb@candelatech.com> wrote:
> Not sure how relevant this is to upstream, but just in case someone
> wants to look at it:
>
> Kernel is modified 3.14.14+, with a good bit of backported ath10k and some
> patches of my own to help stabilize ath10k with my workload and to support
> CT firmware features.
>
> http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=summary
>
> Firmware is CT firmware, and it has a bug in this test case where it crashes
> fairly often upon removal of a vdev after some traffic tests have been
> running.  Likely this firmware bug is something that I have added or
> at least exacerbated, and I am working on fixing it.
>
> But, when it crashes, it takes the kernel down shortly afterwards
> in a reliable manner:
[...]
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
> IP: [<ffffffffa06a318d>] ath10k_txrx_tx_unref+0x91/0x3c7 [ath10k_core]
[...]
> Call Trace:
>  [<ffffffffa06a28b4>] ath10k_htt_tx_detach+0x70/0xd1 [ath10k_core]
>  [<ffffffffa06a04cf>] ath10k_htt_detach+0x16/0x1b [ath10k_core]
>  [<ffffffffa069eab3>] ath10k_core_stop+0x4f/0x70 [ath10k_core]
>  [<ffffffffa069ae32>] ath10k_halt+0xde/0x161 [ath10k_core]
>  [<ffffffffa069aeed>] ath10k_stop+0x38/0x89 [ath10k_core]
>  [<ffffffffa05b0ae6>] ieee80211_stop_device+0x58/0x84 [mac80211]
>  [<ffffffffa069541c>] ? spin_lock_bh+0x9/0xb [ath10k_core]
>  [<ffffffffa059d0d3>] ieee80211_do_stop+0x625/0x67d [mac80211]
>  [<ffffffff810fdf6a>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff810c6d42>] ? __local_bh_enable_ip+0xaf/0xd9
>  [<ffffffff815d8156>] ? _raw_spin_unlock_bh+0x31/0x35
>  [<ffffffff8153a693>] ? dev_deactivate_many+0x129/0x172
>  [<ffffffffa059d140>] ieee80211_stop+0x15/0x19 [mac80211]
[...]
> (gdb) l *(ath10k_txrx_tx_unref+0x91)
> 0xe18d is in ath10k_txrx_tx_unref (/mnt/sda/home/greearb/git/linux-3.14.dev.y/drivers/net/wireless/ath/ath10k/txrx.c:109).
> 104             }
> 105
> 106             msdu = htt->pending_tx[tx_done->msdu_id];
> 107             skb_cb = ATH10K_SKB_CB(msdu);
> 108
> 109             dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);

Okay.. So `msdu` is NULL. I can't seem to find unpaired used_msdu_ids
and pending_tx accesses. This suggests htt->pending_tx itself is
invalid (as well as used_msdu_ids) - perhaps use-after-free (both
pointers aren't NULLed). This in turn suggests ath10k_htt_tx_detach()
was called before and this is the second call. Stack trace suggests
the (allegadly second) call originates from drv_stop(). When ath10k
crashes ath10k_core_start() worker calls ath10k_halt() directly, sets
RESTARTING state and queues mac80211 hw restart. ath10k_stop() calls
ath10k_halt() only if state is ON, RESTARTED or WEDGED. RESTARTING
isn't one of them, but since you have more than 1 entry point for hw
recovery (pci indication, wmi_send, flush) you can trigger
ath10k_core_start() worker with RESTARTING state (i.e. crash within a
crash before ath10k_start() is called) which changes state to WEDGED.
WEDGED allows ath10k_halt() to be called in ath10k_stop(). QED.

The following (it has been in upstream for some time now) should fix
the problem:

commit c5058f5b82f226b236dc5a65015152ed3c23efff
Author: Michal Kazior <michal.kazior@tieto.com>
Date:   Mon May 26 12:46:03 2014 +0300

    ath10k: perform hw restart lazily

    This reduces risk of races and prepares for more
    hw restart fixes.

    It also makes sense to perform teardown after
    mac80211 starts its restart routine as it
    guarantees it has stopped itself by then
    (including tx queues).

    Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
    Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>

This probably makes your ieee80211_stop_queues() in ath10k_halt() obsolete too.


Michał

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-07-31  7:59 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-30 15:46 Crash in hacked kernel with CT firmware Ben Greear
2014-07-31  7:58 ` Michal Kazior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.