From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752224AbbCJK4c (ORCPT ); Tue, 10 Mar 2015 06:56:32 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:43361 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751285AbbCJK4a (ORCPT ); Tue, 10 Mar 2015 06:56:30 -0400 Message-ID: <54FECDAE.1070205@oracle.com> Date: Tue, 10 Mar 2015 06:55:42 -0400 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Ingo Molnar CC: Peter Zijlstra , Dave Jones , LKML , nicolas.pitre@linaro.org, Linus Torvalds , Andrew Morton , Andrey Ryabinin Subject: Re: sched: divide error in sg_capacity_factor References: <54FDDC83.3010202@oracle.com> <20150310042959.GA8591@gmail.com> In-Reply-To: <20150310042959.GA8591@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/10/2015 12:29 AM, Ingo Molnar wrote: > > * Sasha Levin wrote: > >> Hi all, >> >> While fuzzing with trinity inside the latest -next kernel using trinity I've stumbled on: >> >> [ 936.784266] divide error: 0000 [#1] PREEMPT SMP KASAN >> [ 936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603) > > Hm, these line numbers don't seem to match up very well with my > version of linux-next: > > 28855005be1d Add linux-next specific files for 20150306 > > and the Git version info included in the oops seems useless: > > 4.0.0-rc1-sasha-00044-ge21109a > > $ git log e21109a > fatal: ambiguous argument 'e21109a': unknown revision or path not in the working tree. > > I think the kernel's SHA1 should be made at least 12 char wide, > regardless of the user's gitconfig::core.abbrev settings. > > Also, latest linux-next is -rc2 based, while your version string says > -rc1. > >> [ 936.829403] load_balance (kernel/sched/fair.c:6857) > > this does not match up either. > >> [ 936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 kernel/sched/fair.c:7659) > > The line numbers are not even close to anything related: > run_rebalance_domains() starts at line 7666 and ends at 7680. Right, this is my fuck up. It seems that I was fuzzing 4.0-rc1 rather than -next as I thought I was. I forgot to go back to -next after I tested a few things on Linus's tree. So the line numbers should match correctly with Linus's tree as you've already guessed below. > Also, why are the offsets into the function missing from the output? > Those would allow the rough determination of the crash site, even if > debuginfo is crap. I found that offsets are useless here because of the really odd things the compiler does based on my config. There's so many things that got inlined in this case that I think offsets wouldn't mean anything to you here. For example, in this case the division by 0 happened on load_balance+0x88a/0x2399. > I also checked Linus's latest, and they do seem to match up better: > > affb8172de39 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm > > and the line number gives: > > capacity_factor = min_t(unsigned, > capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); > > but that's a division with a constant? Should not trap. > > So I rebuild a kernel with debug info, pattern matched the disassembly > you provided, and that gave me this division: > > (gdb) list *0xffffffff8107d958 > 0xffffffff8107d958 is in find_busiest_group (kernel/sched/fair.c:6162). > 6157 capacity = group->sgc->capacity; > 6158 capacity_orig = group->sgc->capacity_orig; > 6159 cpus = group->group_weight; > 6160 > 6161 /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ > 6162 smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); > 6163 capacity_factor = cpus / smt; /* cores */ > 6164 > 6165 capacity_factor = min_t(unsigned, > 6166 capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); This is the division I was seeing as well. > So this too seems not very plausible: 'capacity_orig' comes straight > from group->sgc->capacity_orig, which is: > > - boot time initialized > > - sometimes recalculated during CPU hot-plug: not sure how much of > that your tests are doing? I'm not forcing them, but they do happen pretty often. > - but otherwise it's fairly constant and should have crashed your > system early on if it was set up wrong > > unless I missed something that is. > >> [ 936.829403] __do_softirq (kernel/softirq.c:273 include/linux/jump_label.h:114 include/trace/events/irq.h:126 kernel/softirq.c:274) >> [ 936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391) >> [ 936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918) >> [ 936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958) >> [ 936.829403] >> [ 936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105) >> [ 936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244) >> [ 936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64) >> [ 936.829403] __set_page_owner (mm/page_owner.c:72) >> [ 936.829403] get_page_from_freelist (include/linux/page_owner.h:26 mm/page_alloc.c:2176) >> [ 936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844) >> [ 936.829403] alloc_pages_vma (mm/mempolicy.c:2007) >> [ 936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 mm/memory.c:3269 mm/memory.c:3298) >> [ 936.829403] __do_page_fault (arch/x86/mm/fault.c:1235) >> [ 936.829403] trace_do_page_fault (arch/x86/mm/fault.c:1329) >> [ 936.829403] do_async_page_fault (arch/x86/kernel/kvm.c:280) >> [ 936.829403] async_page_fault (arch/x86/kernel/entry_64.S:1295) > > So debug info weirdnesses aside, other divisions in > find_busiest_group(): > > sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load) > / sds.total_capacity; > > total_capacity ought to be zero only on a totally borked machine > (unlikely to boot), or on memory corruption. > > if calculate_imbalance() got inlined, then: > > load_above_capacity /= busiest->group_capacity; > > that too ought to only get corrupted in the most serious cases, we > don't recalculate it runtime. > > So I'm baffled. Some tentative handwaving, pointing away from the > scheduler: > > - Your stack trace is 'weird' not just due to debug info: an async > page fault doing allocations, doing a stack trace, interruted by a > timer irq, doing scheduler rebalancing... Yeah, it's not the most straightforward trace, but it's not "broken" - it's a plausible scenario. > - The (spectacularly misnamed [*] ) CONFIG_PAGE_OWNER=y page lifetime > tracing facility got enabled explicitly via the page_owner=on boot > parameter, right? Not many people are doing that I suspect. Right. But all it did here was save a stack trace, no? > - CONFIG_KASAN=y is enabled in your kernel. New, invasive option, > using compiler features that weren't used by kernel code before. Agreed, I've Cc'ed Andrey and hope that he could see if there's anything up with the assembly that's fishy. > - async page faults are virtualization specials: not used much > elsewhere. > > - There's a 'W' taint in your oops. Probably some harmless prior > warning? Yup, just one of my debug patches that helps me track down a different bug. > So your crash signature has the combination of 3 'uncommon' kernel > features, and a scheduler crash with a relatively constant value that > should never be zero and which should crash everywhere. > > So right now I'd blame the other 3 guys, I wasn't even there that > night, officer! Fair enough, I've only seen it once myself so I can ignore it until (if) I hit it again. Sorry again about the tree mixup. I'll go back to fuzzing -next now :) Thanks, Sasha > Cc:-ed others as well. > > Thanks, > > Ingo > > [*] Please name debugging features accordingly: CONFIG_DEBUG_PAGE_OWNER. > Maybe even prefix them with the subsystem: CONFIG_DEBUG_VM_PAGE_OWNER. > > We already have a nice set of CONFIG_DEBUG_VM* options: > > CONFIG_DEBUG_VM > CONFIG_DEBUG_VM_RB > CONFIG_DEBUG_VM_VMACACHE > > Also, there's no penalty for including a verb, so that people know > wth it's doing, at a glance: CONFIG_DEBUG_VM_TRACK_PAGE_OWNER? >