From: Ingo Molnar <mingo@kernel.org>
To: Sasha Levin <sasha.levin@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Dave Jones <davej@codemonkey.org.uk>,
LKML <linux-kernel@vger.kernel.org>,
nicolas.pitre@linaro.org,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: sched: divide error in sg_capacity_factor
Date: Tue, 10 Mar 2015 05:29:59 +0100 [thread overview]
Message-ID: <20150310042959.GA8591@gmail.com> (raw)
In-Reply-To: <54FDDC83.3010202@oracle.com>
* Sasha Levin <sasha.levin@oracle.com> wrote:
> Hi all,
>
> While fuzzing with trinity inside the latest -next kernel using trinity I've stumbled on:
>
> [ 936.784266] divide error: 0000 [#1] PREEMPT SMP KASAN
> [ 936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)
Hm, these line numbers don't seem to match up very well with my
version of linux-next:
28855005be1d Add linux-next specific files for 20150306
and the Git version info included in the oops seems useless:
4.0.0-rc1-sasha-00044-ge21109a
$ git log e21109a
fatal: ambiguous argument 'e21109a': unknown revision or path not in the working tree.
I think the kernel's SHA1 should be made at least 12 char wide,
regardless of the user's gitconfig::core.abbrev settings.
Also, latest linux-next is -rc2 based, while your version string says
-rc1.
> [ 936.829403] load_balance (kernel/sched/fair.c:6857)
this does not match up either.
> [ 936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 kernel/sched/fair.c:7659)
The line numbers are not even close to anything related:
run_rebalance_domains() starts at line 7666 and ends at 7680.
Also, why are the offsets into the function missing from the output?
Those would allow the rough determination of the crash site, even if
debuginfo is crap.
I also checked Linus's latest, and they do seem to match up better:
affb8172de39 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
and the line number gives:
capacity_factor = min_t(unsigned,
capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
but that's a division with a constant? Should not trap.
So I rebuild a kernel with debug info, pattern matched the disassembly
you provided, and that gave me this division:
(gdb) list *0xffffffff8107d958
0xffffffff8107d958 is in find_busiest_group (kernel/sched/fair.c:6162).
6157 capacity = group->sgc->capacity;
6158 capacity_orig = group->sgc->capacity_orig;
6159 cpus = group->group_weight;
6160
6161 /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
6162 smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
6163 capacity_factor = cpus / smt; /* cores */
6164
6165 capacity_factor = min_t(unsigned,
6166 capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
So this too seems not very plausible: 'capacity_orig' comes straight
from group->sgc->capacity_orig, which is:
- boot time initialized
- sometimes recalculated during CPU hot-plug: not sure how much of
that your tests are doing?
- but otherwise it's fairly constant and should have crashed your
system early on if it was set up wrong
unless I missed something that is.
> [ 936.829403] __do_softirq (kernel/softirq.c:273 include/linux/jump_label.h:114 include/trace/events/irq.h:126 kernel/softirq.c:274)
> [ 936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
> [ 936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
> [ 936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
> [ 936.829403] <EOI>
> [ 936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
> [ 936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
> [ 936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
> [ 936.829403] __set_page_owner (mm/page_owner.c:72)
> [ 936.829403] get_page_from_freelist (include/linux/page_owner.h:26 mm/page_alloc.c:2176)
> [ 936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
> [ 936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
> [ 936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 mm/memory.c:3269 mm/memory.c:3298)
> [ 936.829403] __do_page_fault (arch/x86/mm/fault.c:1235)
> [ 936.829403] trace_do_page_fault (arch/x86/mm/fault.c:1329)
> [ 936.829403] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> [ 936.829403] async_page_fault (arch/x86/kernel/entry_64.S:1295)
So debug info weirdnesses aside, other divisions in
find_busiest_group():
sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
/ sds.total_capacity;
total_capacity ought to be zero only on a totally borked machine
(unlikely to boot), or on memory corruption.
if calculate_imbalance() got inlined, then:
load_above_capacity /= busiest->group_capacity;
that too ought to only get corrupted in the most serious cases, we
don't recalculate it runtime.
So I'm baffled. Some tentative handwaving, pointing away from the
scheduler:
- Your stack trace is 'weird' not just due to debug info: an async
page fault doing allocations, doing a stack trace, interruted by a
timer irq, doing scheduler rebalancing...
- The (spectacularly misnamed [*] ) CONFIG_PAGE_OWNER=y page lifetime
tracing facility got enabled explicitly via the page_owner=on boot
parameter, right? Not many people are doing that I suspect.
- CONFIG_KASAN=y is enabled in your kernel. New, invasive option,
using compiler features that weren't used by kernel code before.
- async page faults are virtualization specials: not used much
elsewhere.
- There's a 'W' taint in your oops. Probably some harmless prior
warning?
So your crash signature has the combination of 3 'uncommon' kernel
features, and a scheduler crash with a relatively constant value that
should never be zero and which should crash everywhere.
So right now I'd blame the other 3 guys, I wasn't even there that
night, officer!
Cc:-ed others as well.
Thanks,
Ingo
[*] Please name debugging features accordingly: CONFIG_DEBUG_PAGE_OWNER.
Maybe even prefix them with the subsystem: CONFIG_DEBUG_VM_PAGE_OWNER.
We already have a nice set of CONFIG_DEBUG_VM* options:
CONFIG_DEBUG_VM
CONFIG_DEBUG_VM_RB
CONFIG_DEBUG_VM_VMACACHE
Also, there's no penalty for including a verb, so that people know
wth it's doing, at a glance: CONFIG_DEBUG_VM_TRACK_PAGE_OWNER?
next prev parent reply other threads:[~2015-03-10 4:30 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-09 17:46 sched: divide error in sg_capacity_factor Sasha Levin
2015-03-10 4:29 ` Ingo Molnar [this message]
2015-03-10 10:55 ` Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150310042959.GA8591@gmail.com \
--to=mingo@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=davej@codemonkey.org.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=nicolas.pitre@linaro.org \
--cc=peterz@infradead.org \
--cc=sasha.levin@oracle.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).