corruption causing crash in __queue_work

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nikolay Borisov <kernel@kyup.com>
To: Tejun Heo <tj@kernel.org>
Cc: "Linux-Kernel@Vger. Kernel. Org" <linux-kernel@vger.kernel.org>,
	SiteGround Operations <operations@siteground.com>
Subject: corruption causing crash in __queue_work
Date: Wed, 9 Dec 2015 14:08:56 +0200	[thread overview]
Message-ID: <566819D8.5090804@kyup.com> (raw)

Hello Tejun, 

I've been observing the following crashes on kernel 4.2.6 :

73309.529940] BUG: unable to handle kernel NULL pointer dereference at           (null)
[73309.530238] IP: [<ffffffff8106b663>] __queue_work+0xb3/0x390
[73309.530466] PGD 0 
[73309.530681] Oops: 0000 [#1] SMP 
[73309.530947] Modules linked in: dm_snapshot dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c ipv6 xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables ext2 dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 igb i2c_algo_bit i2c_core lpc_ich mfd_core ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca
[73309.533556] CPU: 19 PID: 0 Comm: swapper/19 Not tainted 4.2.6-wbpatch-qib #1
[73309.533734] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
[73309.533911] task: ffff880276501b80 ti: ffff880276510000 task.ti: ffff880276510000
[73309.534093] RIP: 0010:[<ffffffff8106b663>]  [<ffffffff8106b663>] __queue_work+0xb3/0x390
[73309.534321] RSP: 0018:ffff88047fce3d58  EFLAGS: 00010086
[73309.534495] RAX: ffff880277812400 RBX: ffff8801e53e24c0 RCX: 00000000000100f0
[73309.534672] RDX: 0000000000000000 RSI: 0000000000000030 RDI: ffff8801e53e24c0
[73309.534849] RBP: ffff88047fce3de8 R08: 000042ad628a3480 R09: 0000000000000000
[73309.535023] R10: ffffffff816099d5 R11: 0000000000000000 R12: ffffffff8106b940
[73309.535196] R13: 0000000000000013 R14: ffff8803df464c00 R15: 0000000000000013
[73309.535370] FS:  0000000000000000(0000) GS:ffff88047fce0000(0000) knlGS:0000000000000000
[73309.535544] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[73309.535714] CR2: 0000000000000000 CR3: 0000000001a0e000 CR4: 00000000000406e0
[73309.535886] Stack:
[73309.536049]  ffff88047fcefcd8 0000000000000092 0000000000000000 ffff8803df464d10
[73309.536415]  0000000000000032 00000000000100f0 0000000000000000 ffff88047fcf4a00
[73309.536785]  ffff88047fcf4a00 0000000000000013 0000000000000000 ffff880276501b80
[73309.537152] Call Trace:
[73309.537319]  <IRQ> 
[73309.537373]  [<ffffffff8106b940>] ? __queue_work+0x390/0x390
[73309.537714]  [<ffffffff8106b958>] delayed_work_timer_fn+0x18/0x20
[73309.537891]  [<ffffffff810ad1d7>] call_timer_fn+0x47/0x110
[73309.538071]  [<ffffffff810be302>] ? tick_sched_timer+0x52/0xa0
[73309.538249]  [<ffffffff810adb6f>] run_timer_softirq+0x17f/0x2b0
[73309.538425]  [<ffffffff8106b940>] ? __queue_work+0x390/0x390
[73309.538604]  [<ffffffff81057f40>] __do_softirq+0xe0/0x290
[73309.538778]  [<ffffffff810581e6>] irq_exit+0xa6/0xb0
[73309.538952]  [<ffffffff8159413a>] smp_apic_timer_interrupt+0x4a/0x59
[73309.539128]  [<ffffffff815926bb>] apic_timer_interrupt+0x6b/0x70
[73309.539300]  <EOI> 
[73309.539355]  [<ffffffff8148b136>] ? cpuidle_enter_state+0x136/0x290
[73309.539694]  [<ffffffff8148b12d>] ? cpuidle_enter_state+0x12d/0x290
[73309.539870]  [<ffffffff8158d9ed>] ? __schedule+0x37d/0x840
[73309.540045]  [<ffffffff8148b2a7>] cpuidle_enter+0x17/0x20
[73309.540222]  [<ffffffff810936c5>] cpuidle_idle_call+0x95/0x140
[73309.540398]  [<ffffffff81072766>] ? atomic_notifier_call_chain+0x16/0x20
[73309.540574]  [<ffffffff810938b5>] cpu_idle_loop+0x145/0x200
[73309.540748]  [<ffffffff8109398b>] ? cpu_startup_entry+0x1b/0x70
[73309.540924]  [<ffffffff813a1948>] ? get_random_bytes+0x48/0x90
[73309.541098]  [<ffffffff810939cf>] cpu_startup_entry+0x5f/0x70
[73309.541274]  [<ffffffff81033832>] start_secondary+0xc2/0xd0
[73309.541446] Code: 49 8b 96 08 01 00 00 49 63 c7 48 03 14 c5 e0 af ab 81 48 89 55 80 48 89 df e8 0a ee ff ff 48 8b 55 80 48 85 c0 0f 84 3e 01 00 00 <48> 8b 3a 48 39 f8 0f 84 35 01 00 00 48 89 c7 48 89 85 78 ff ff 
[73309.545008] RIP  [<ffffffff8106b663>] __queue_work+0xb3/0x390
[73309.545231]  RSP <ffff88047fce3d58>
[73309.545399] CR2: 0000000000000000

The gist is that this fail on the following line: 

if (last_pool && last_pool != pwq->pool) {

Since the pointer 'pwq' is wrong (it is loaded in %rdx) which in this 
case is 0000000000000000. Looking at the function's source pwq should 
be loaded by per_cpu_ptr since the  if (!(wq->flags & WQ_UNBOUND)) 
check should evaluate to false. So pwq is loaded as the result from 
unbound_pwq_by_node(wq, cpu_to_node(cpu));

Here are the flags of the workqueue: 
crash> struct workqueue_struct.flags 0xffff8803df464c00
  flags = 131082

(0xffff8803df464c00 is indeed the pointer to the workqueue struct, 
so the flags aren't bogus).

So reading the numa_pwq_tbl it seems that it's uninitialised: 

crash> struct workqueue_struct.numa_pwq_tbl 0xffff8803df464c00
  numa_pwq_tbl = 0xffff8803df464d10
crash> rd -64 0xffff8803df464d10 3
ffff8803df464d10:  0000000000000000 0000000000000000   ................
ffff8803df464d20:  0000000000000000                    ........

The machine where the crash occurred has a single NUMA node, so at the 
very least I would have expected to have a pointer, rather than NULL ptr. 

Also this crash is not isolated in that I have observed it on multiple
other nodes running vanilla 4.2.5/4.2.6 kernels. 

Any advice how to further debug that?

next             reply	other threads:[~2015-12-09 12:09 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-09 12:08 Nikolay Borisov [this message]
2015-12-09 16:08 ` corruption causing crash in __queue_work Tejun Heo
2015-12-09 16:23   ` Nikolay Borisov
2015-12-09 16:27     ` Tejun Heo
2015-12-10  9:28       ` Nikolay Borisov
2015-12-10 15:29         ` Tejun Heo
2015-12-11 15:57           ` Nikolay Borisov
2015-12-11 17:08             ` Tejun Heo
2015-12-11 18:00               ` Nikolay Borisov
2015-12-11 19:14                 ` Mike Snitzer
2015-12-12 11:49                   ` Nikolay Borisov
2015-12-14  8:41               ` Nikolay Borisov
2015-12-14  8:41                 ` Nikolay Borisov
2015-12-14 15:31                 ` Mike Snitzer
2015-12-14 20:11                   ` Nikolay Borisov
2015-12-14 20:31                     ` Mike Snitzer
2015-12-17 10:46                       ` Nikolay Borisov
2015-12-17 15:33                         ` Tejun Heo
2015-12-17 15:43                           ` Nikolay Borisov
2015-12-17 15:50                             ` Tejun Heo
2015-12-17 17:15                               ` Mike Snitzer
2015-12-19 13:34                                 ` Nikolay Borisov
2015-12-21 21:44                                   ` Tejun Heo
2015-12-21 21:45                                     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=566819D8.5090804@kyup.com \
    --to=kernel@kyup.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=operations@siteground.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.