From: Nikolay Borisov <kernel@kyup.com>
To: Tejun Heo <tj@kernel.org>
Cc: "Linux-Kernel@Vger. Kernel. Org" <linux-kernel@vger.kernel.org>,
SiteGround Operations <operations@siteground.com>
Subject: Re: corruption causing crash in __queue_work
Date: Fri, 11 Dec 2015 17:57:22 +0200 [thread overview]
Message-ID: <566AF262.8050009@kyup.com> (raw)
In-Reply-To: <20151210152901.GR30240@mtj.duckdns.org>
On 12/10/2015 05:29 PM, Tejun Heo wrote:
> On Thu, Dec 10, 2015 at 11:28:02AM +0200, Nikolay Borisov wrote:
>> On 12/09/2015 06:27 PM, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Wed, Dec 09, 2015 at 06:23:15PM +0200, Nikolay Borisov wrote:
>>>> I think we are seeing this at least daily on at least 1 server (we have
>>>> multiple servers like that). So adding printk's would likely be the way
>>>> to go, anything in particular you might be interested in knowing? I see
>>>> RCU stuff around so might be tricky race condition.
>>>
>>> Printing out the workqueue's pointer, name, pwq's pointer, the node
>>> being installed for and the installed pointer should give us enough
>>> clues. There's RCU involved but the pointers shouldn't be becoming
>>> NULLs unless we're installing NULL ptrs.
>>
>> So the debug patch has been rolled on 1 server and several more
>> are in the process, here it is what it prints:
>>
>> WQ: ffff88046f00ba00 (events_unbound) old_pwq: (null) new_pwq: ffff88046f00d300 node: 0
> ...
>> Is this format ok? Also I observed the exact same crash
>> on a machine running 4.1.12 kernel as well.
>
> Yeah, I think it can be a good starting point.
So I had a server with the patch just crash on me:
Here is how the queue looks like:
crash> struct workqueue_struct 0xffff8802420a4a00
struct workqueue_struct {
pwqs = {
next = 0xffff8802420a4c00,
prev = 0xffff8802420a4a00
},
list = {
next = 0xffff880351f9b210,
prev = 0xdead000000200200
},
mutex = {
count = {
counter = 1
},
wait_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
wait_list = {
next = 0xffff8802420a4a28,
prev = 0xffff8802420a4a28
},
owner = 0x0,
osq = {
tail = {
counter = 0
}
}
},
work_color = 3,
flush_color = 3,
nr_pwqs_to_flush = {
counter = 0
},
first_flusher = 0x0,
flusher_queue = {
next = 0xffff8802420a4a60,
prev = 0xffff8802420a4a60
},
flusher_overflow = {
next = 0xffff8802420a4a70,
prev = 0xffff8802420a4a70
},
maydays = {
next = 0xffff8802420a4a80,
prev = 0xffff8802420a4a80
},
rescuer = 0xffff88046932ce40,
nr_drainers = 0,
saved_max_active = 1,
unbound_attrs = 0xffff8801a76c1f00,
dfl_pwq = 0x0,
wq_dev = 0x0,
name =
"dm-thin\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
rcu = {
next = 0xffff8802531c4c20,
func = 0xffffffff810692e0 <rcu_free_wq>
},
flags = 131082,
cpu_pwqs = 0x0,
numa_pwq_tbl = 0xffff8802420a4b10
}
crash> rd 0xffff8802420a4b10 2 (the machine has 2 NUMA nodes hence the
'2' argument)
ffff8802420a4b10: 0000000000000000 0000000000000000 ................
At the same time searching for 0xffff8802420a4a00 in the debug output
shows nothing IOW it seems that the numa_pwq_tbl is never installed for
this workqueue apparently:
[root@smallvault8 ~]# grep 0xffff8802420a4a00 /var/log/messages
Also dumping all the logs from the dmesg contained in the vmcore image I
find nothing and when I do the following correlation:
[root@smallvault8 ~]# grep \(null\) wq.log | wc -l
1940
[root@smallvault8 ~]# wc -l wq.log
1940 wq.log
It seems what's happening is really just changing the numa_pwq_tbl on
workqueue creation i.e. it is never re-assigned. So at this point I
think it seems that there is a situation where the wqattr are not being
applied at all.
>
> Thanks.
>
next prev parent reply other threads:[~2015-12-11 15:57 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-09 12:08 corruption causing crash in __queue_work Nikolay Borisov
2015-12-09 16:08 ` Tejun Heo
2015-12-09 16:23 ` Nikolay Borisov
2015-12-09 16:27 ` Tejun Heo
2015-12-10 9:28 ` Nikolay Borisov
2015-12-10 15:29 ` Tejun Heo
2015-12-11 15:57 ` Nikolay Borisov [this message]
2015-12-11 17:08 ` Tejun Heo
2015-12-11 18:00 ` Nikolay Borisov
2015-12-11 19:14 ` Mike Snitzer
2015-12-12 11:49 ` Nikolay Borisov
2015-12-14 8:41 ` Nikolay Borisov
2015-12-14 15:31 ` Mike Snitzer
2015-12-14 20:11 ` Nikolay Borisov
2015-12-14 20:31 ` Mike Snitzer
2015-12-17 10:46 ` Nikolay Borisov
2015-12-17 15:33 ` Tejun Heo
2015-12-17 15:43 ` Nikolay Borisov
2015-12-17 15:50 ` Tejun Heo
2015-12-17 17:15 ` Mike Snitzer
[not found] ` <CAJFSNy5Lqv_xy7Lf1GEDPczHpZU8+a2CYCM-3ZR=VkDPJptmcg@mail.gmail.com>
2015-12-21 21:44 ` Tejun Heo
2015-12-21 21:45 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=566AF262.8050009@kyup.com \
--to=kernel@kyup.com \
--cc=linux-kernel@vger.kernel.org \
--cc=operations@siteground.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).