From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752224AbbCJK4c (ORCPT <rfc822;w@1wt.eu>);
	Tue, 10 Mar 2015 06:56:32 -0400
Received: from aserp1040.oracle.com ([141.146.126.69]:43361 "EHLO
	aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751285AbbCJK4a (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 10 Mar 2015 06:56:30 -0400
Message-ID: <54FECDAE.1070205@oracle.com>
Date: Tue, 10 Mar 2015 06:55:42 -0400
From: Sasha Levin <sasha.levin@oracle.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>,
        Dave Jones <davej@codemonkey.org.uk>,
        LKML <linux-kernel@vger.kernel.org>, nicolas.pitre@linaro.org,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Andrey Ryabinin <a.ryabinin@samsung.com>
Subject: Re: sched: divide error in sg_capacity_factor
References: <54FDDC83.3010202@oracle.com> <20150310042959.GA8591@gmail.com>
In-Reply-To: <20150310042959.GA8591@gmail.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit
X-Source-IP: acsinet22.oracle.com [141.146.126.238]
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 03/10/2015 12:29 AM, Ingo Molnar wrote:
> 
> * Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> Hi all,
>>
>> While fuzzing with trinity inside the latest -next kernel using trinity I've stumbled on:
>>
>> [  936.784266] divide error: 0000 [#1] PREEMPT SMP KASAN
>> [  936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)
> 
> Hm, these line numbers don't seem to match up very well with my 
> version of linux-next:
> 
>   28855005be1d Add linux-next specific files for 20150306
> 
> and the Git version info included in the oops seems useless:
> 
>   4.0.0-rc1-sasha-00044-ge21109a
> 
>   $ git log e21109a
>   fatal: ambiguous argument 'e21109a': unknown revision or path not in the working tree.
> 
> I think the kernel's SHA1 should be made at least 12 char wide, 
> regardless of the user's gitconfig::core.abbrev settings.
> 
> Also, latest linux-next is -rc2 based, while your version string says 
> -rc1.
> 
>> [  936.829403] load_balance (kernel/sched/fair.c:6857)
> 
> this does not match up either.
> 
>> [  936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 kernel/sched/fair.c:7659)
> 
> The line numbers are not even close to anything related: 
> run_rebalance_domains() starts at line 7666 and ends at 7680.

Right, this is my fuck up. It seems that I was fuzzing 4.0-rc1 rather than
-next as I thought I was. I forgot to go back to -next after I tested a few
things on Linus's tree.

So the line numbers should match correctly with Linus's tree as you've already
guessed below.

> Also, why are the offsets into the function missing from the output? 
> Those would allow the rough determination of the crash site, even if 
> debuginfo is crap.

I found that offsets are useless here because of the really odd things the
compiler does based on my config. There's so many things that got inlined
in this case that I think offsets wouldn't mean anything to you here.

For example, in this case the division by 0 happened on load_balance+0x88a/0x2399.

> I also checked Linus's latest, and they do seem to match up better:
> 
>   affb8172de39 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
> 
> and the line number gives:
> 
>         capacity_factor = min_t(unsigned,
>                 capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
> 
> but that's a division with a constant? Should not trap.
> 
> So I rebuild a kernel with debug info, pattern matched the disassembly 
> you provided, and that gave me this division:
> 
> (gdb) list *0xffffffff8107d958
> 0xffffffff8107d958 is in find_busiest_group (kernel/sched/fair.c:6162).
> 6157            capacity = group->sgc->capacity;
> 6158            capacity_orig = group->sgc->capacity_orig;
> 6159            cpus = group->group_weight;
> 6160
> 6161            /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
> 6162            smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
> 6163            capacity_factor = cpus / smt; /* cores */
> 6164
> 6165            capacity_factor = min_t(unsigned,
> 6166                    capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));

This is the division I was seeing as well.

> So this too seems not very plausible: 'capacity_orig' comes straight 
> from group->sgc->capacity_orig, which is:
> 
>  - boot time initialized
> 
>  - sometimes recalculated during CPU hot-plug: not sure how much of 
>    that your tests are doing?

I'm not forcing them, but they do happen pretty often.

>  - but otherwise it's fairly constant and should have crashed your 
>    system early on if it was set up wrong
> 
> unless I missed something that is.
> 
>> [  936.829403] __do_softirq (kernel/softirq.c:273 include/linux/jump_label.h:114 include/trace/events/irq.h:126 kernel/softirq.c:274)
>> [  936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
>> [  936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
>> [  936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
>> [  936.829403]  <EOI>
>> [  936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
>> [  936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
>> [  936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
>> [  936.829403] __set_page_owner (mm/page_owner.c:72)
>> [  936.829403] get_page_from_freelist (include/linux/page_owner.h:26 mm/page_alloc.c:2176)
>> [  936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
>> [  936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
>> [  936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 mm/memory.c:3269 mm/memory.c:3298)
>> [  936.829403] __do_page_fault (arch/x86/mm/fault.c:1235)
>> [  936.829403] trace_do_page_fault (arch/x86/mm/fault.c:1329)
>> [  936.829403] do_async_page_fault (arch/x86/kernel/kvm.c:280)
>> [  936.829403] async_page_fault (arch/x86/kernel/entry_64.S:1295)
> 
> So debug info weirdnesses aside, other divisions in 
> find_busiest_group():
> 
>         sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
>                                                 / sds.total_capacity;
> 
> total_capacity ought to be zero only on a totally borked machine 
> (unlikely to boot), or on memory corruption.
> 
> if calculate_imbalance() got inlined, then:
> 
>                 load_above_capacity /= busiest->group_capacity;
> 
> that too ought to only get corrupted in the most serious cases, we 
> don't recalculate it runtime.
> 
> So I'm baffled. Some tentative handwaving, pointing away from the 
> scheduler:
> 
>  - Your stack trace is 'weird' not just due to debug info: an async 
>    page fault doing allocations, doing a stack trace, interruted by a 
>    timer irq, doing scheduler rebalancing...

Yeah, it's not the most straightforward trace, but it's not "broken" -
it's a plausible scenario.

>  - The (spectacularly misnamed [*] ) CONFIG_PAGE_OWNER=y page lifetime
>    tracing facility got enabled explicitly via the page_owner=on boot 
>    parameter, right? Not many people are doing that I suspect.

Right. But all it did here was save a stack trace, no?

>  - CONFIG_KASAN=y is enabled in your kernel. New, invasive option, 
>    using compiler features that weren't used by kernel code before.

Agreed, I've Cc'ed Andrey and hope that he could see if there's anything
up with the assembly that's fishy.

>  - async page faults are virtualization specials: not used much 
>    elsewhere.
> 
>  - There's a 'W' taint in your oops. Probably some harmless prior
>    warning?

Yup, just one of my debug patches that helps me track down a different
bug.

> So your crash signature has the combination of 3 'uncommon' kernel 
> features, and a scheduler crash with a relatively constant value that 
> should never be zero and which should crash everywhere.
> 
> So right now I'd blame the other 3 guys, I wasn't even there that 
> night, officer!

Fair enough, I've only seen it once myself so I can ignore it until (if)
I hit it again.

Sorry again about the tree mixup. I'll go back to fuzzing -next now :)


Thanks,
Sasha

> Cc:-ed others as well.
> 
> Thanks,
> 
> 	Ingo
> 
> [*] Please name debugging features accordingly: CONFIG_DEBUG_PAGE_OWNER.
>     Maybe even prefix them with the subsystem: CONFIG_DEBUG_VM_PAGE_OWNER.
> 
>     We already have a nice set of CONFIG_DEBUG_VM* options:
> 
>       CONFIG_DEBUG_VM
>       CONFIG_DEBUG_VM_RB
>       CONFIG_DEBUG_VM_VMACACHE
> 
>     Also, there's no penalty for including a verb, so that people know 
>     wth it's doing, at a glance: CONFIG_DEBUG_VM_TRACK_PAGE_OWNER?
>