linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nikolay Borisov <kernel@kyup.com>
To: Tejun Heo <tj@kernel.org>
Cc: "Linux-Kernel@Vger. Kernel. Org" <linux-kernel@vger.kernel.org>,
	SiteGround Operations <operations@siteground.com>
Subject: Re: corruption causing crash in __queue_work
Date: Wed, 9 Dec 2015 18:23:15 +0200	[thread overview]
Message-ID: <56685573.1020805@kyup.com> (raw)
In-Reply-To: <20151209160803.GK30240@mtj.duckdns.org>



On 12/09/2015 06:08 PM, Tejun Heo wrote:
> Hello, Nikolay.
> 
> On Wed, Dec 09, 2015 at 02:08:56PM +0200, Nikolay Borisov wrote:
>> 73309.529940] BUG: unable to handle kernel NULL pointer dereference at           (null)
>> [73309.530238] IP: [<ffffffff8106b663>] __queue_work+0xb3/0x390
> ...
>> [73309.537319]  <IRQ> 
>> [73309.537373]  [<ffffffff8106b940>] ? __queue_work+0x390/0x390
>> [73309.537714]  [<ffffffff8106b958>] delayed_work_timer_fn+0x18/0x20
>> [73309.537891]  [<ffffffff810ad1d7>] call_timer_fn+0x47/0x110
>> [73309.538071]  [<ffffffff810be302>] ? tick_sched_timer+0x52/0xa0
>> [73309.538249]  [<ffffffff810adb6f>] run_timer_softirq+0x17f/0x2b0
>> [73309.538425]  [<ffffffff8106b940>] ? __queue_work+0x390/0x390
>> [73309.538604]  [<ffffffff81057f40>] __do_softirq+0xe0/0x290
>> [73309.538778]  [<ffffffff810581e6>] irq_exit+0xa6/0xb0
>> [73309.538952]  [<ffffffff8159413a>] smp_apic_timer_interrupt+0x4a/0x59
>> [73309.539128]  [<ffffffff815926bb>] apic_timer_interrupt+0x6b/0x70
> ...
>> The gist is that this fail on the following line: 
>>
>> if (last_pool && last_pool != pwq->pool) {
> 
> That's new.
> 
>> Since the pointer 'pwq' is wrong (it is loaded in %rdx) which in this 
>> case is 0000000000000000. Looking at the function's source pwq should 
>> be loaded by per_cpu_ptr since the  if (!(wq->flags & WQ_UNBOUND)) 
>> check should evaluate to false. So pwq is loaded as the result from 
>> unbound_pwq_by_node(wq, cpu_to_node(cpu));
>>
>> Here are the flags of the workqueue: 
>> crash> struct workqueue_struct.flags 0xffff8803df464c00
>>   flags = 131082
> 
> That's ordered unbound workqueue w/ a rescuer.

So the name of the queue is 'dm-thin', looking at the sources in
dm-thin, the only place where a workqueue is allocates this  here:

pool->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM);

But in this case I guess the caller can't be the culprit? I'm biased wrt
dm-thin because in the past few months I've hit multiple bugs.

> 
>> (0xffff8803df464c00 is indeed the pointer to the workqueue struct, 
>> so the flags aren't bogus).
>>
>> So reading the numa_pwq_tbl it seems that it's uninitialised: 
>>
>> crash> struct workqueue_struct.numa_pwq_tbl 0xffff8803df464c00
>>   numa_pwq_tbl = 0xffff8803df464d10
>> crash> rd -64 0xffff8803df464d10 3
>> ffff8803df464d10:  0000000000000000 0000000000000000   ................
>> ffff8803df464d20:  0000000000000000                    ........
>>
>> The machine where the crash occurred has a single NUMA node, so at the 
>> very least I would have expected to have a pointer, rather than NULL ptr. 
>>
>> Also this crash is not isolated in that I have observed it on multiple
>> other nodes running vanilla 4.2.5/4.2.6 kernels. 
>>
>> Any advice how to further debug that?
> 
> Adding printk or tracepoints at numa_pwq_tbl_install() to dump what's
> being installed would be helpful.  It should at least tell us whether
> it's the table being corrupted by something else or workqueue failing
> to set it up correctly to begin with.  How reproducible is the
> problem?

I think we are seeing this at least daily on at least 1 server (we have
multiple servers like that). So adding printk's would likely be the way
to go, anything in particular you might be interested in knowing? I see
RCU stuff around so might be tricky race condition.


> 
> Thanks.
> 

  reply	other threads:[~2015-12-09 16:23 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-09 12:08 corruption causing crash in __queue_work Nikolay Borisov
2015-12-09 16:08 ` Tejun Heo
2015-12-09 16:23   ` Nikolay Borisov [this message]
2015-12-09 16:27     ` Tejun Heo
2015-12-10  9:28       ` Nikolay Borisov
2015-12-10 15:29         ` Tejun Heo
2015-12-11 15:57           ` Nikolay Borisov
2015-12-11 17:08             ` Tejun Heo
2015-12-11 18:00               ` Nikolay Borisov
2015-12-11 19:14                 ` Mike Snitzer
2015-12-12 11:49                   ` Nikolay Borisov
2015-12-14  8:41               ` Nikolay Borisov
2015-12-14 15:31                 ` Mike Snitzer
2015-12-14 20:11                   ` Nikolay Borisov
2015-12-14 20:31                     ` Mike Snitzer
2015-12-17 10:46                       ` Nikolay Borisov
2015-12-17 15:33                         ` Tejun Heo
2015-12-17 15:43                           ` Nikolay Borisov
2015-12-17 15:50                             ` Tejun Heo
2015-12-17 17:15                               ` Mike Snitzer
     [not found]                                 ` <CAJFSNy5Lqv_xy7Lf1GEDPczHpZU8+a2CYCM-3ZR=VkDPJptmcg@mail.gmail.com>
2015-12-21 21:44                                   ` Tejun Heo
2015-12-21 21:45                                     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56685573.1020805@kyup.com \
    --to=kernel@kyup.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=operations@siteground.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).