Re: [PROBLEM] possible divide by 0 in kernel/sched/cputime.c scale_stime()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stanislaw Gruszka <sgruszka@redhat.com>
To: Christian Engelmayer <cengelma@gmx.at>
Cc: Ingo Molnar <mingo@redhat.com>,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PROBLEM] possible divide by 0 in kernel/sched/cputime.c scale_stime()
Date: Mon, 18 Nov 2013 15:02:24 +0100	[thread overview]
Message-ID: <20131118140224.GA3330@redhat.com> (raw)
In-Reply-To: <20131116223740.3cd579cb@spike>

Hi Christian 

On Sat, Nov 16, 2013 at 10:37:40PM +0100, Christian Engelmayer wrote:
> Since upgrading from v3.8 to v3.12 I see random crashes in function scale_stime()
> in kernel/sched/cputime.c:
> 
>    divide error: 0000 [#1] SMP 
>    Modules linked in: parport_pc(F) ppdev(F) bnep rfcomm bluetooth binfmt_misc(F)
>    zl10353 cx88_dvb cx88_vp3054_i2c videobuf_dvb dvb_core intel_powerclamp coretemp
>    kvm_intel(F) tuner_xc2028 kvm(F) i915 snd_hda_codec_hdmi snd_hda_codec_realtek
>    cx8800 cx8802 tuner snd_hda_intel snd_hda_codec cx88_alsa crct10dif_pclmul(F)
>    crc32_pclmul(F) snd_hwdep(F) snd_pcm(F) snd_page_alloc(F) ghash_clmulni_intel(F)
>    aesni_intel(F) snd_seq_midi(F) snd_seq_midi_event(F) snd_rawmidi(F) snd_seq(F)
>    joydev(F) cx88xx snd_seq_device(F) snd_timer(F) aes_x86_64(F) lrw(F) gf128mul(F)
>    glue_helper(F) video(F) btcx_risc drm_kms_helper ablk_helper(F) tveeprom cryptd(F)
>    lp(F) videobuf_dma_sg rc_core drm v4l2_common videobuf_core mei_me parport(F)
>    snd(F) mei soundcore(F) videodev i2c_algo_bit serio_raw(F) microcode(F) mac_hid
>    lpc_ich asus_atk0110 hid_generic usbhid hid usb_storage(F) firewire_ohci ahci(F)
>    libahci(F) firewire_core r8169 crc_itu_t(F) mii(F)
>    CPU: 3 PID: 15367 Comm: htop Tainted: GF            3.12.0-031200-generic #201311031935
>    Hardware name: System manufacturer System Product Name/P7H55-M PRO, BIOS 1709    01/04/2011
>    task: ffff8800cc09e000 ti: ffff8800af620000 task.ti: ffff8800af620000
>    RIP: 0010:[<ffffffff81099de0>]  [<ffffffff81099de0>] cputime_adjust+0xf0/0x110
>    RSP: 0018:ffff8800af621cc8  EFLAGS: 00010847
>    RAX: 85fdc1fef4047c00 RBX: 0000000000000000 RCX: ffff8800af621df8
>    RDX: 0000000000000000 RSI: ffff8800cc0634d0 RDI: 0000000000000000
>    RBP: ffff8800af621cd8 R08: 00000000fffffffe R09: 0000000000000000
>    R10: 0000000000000000 R11: fffffffe03427acc R12: ffff8800af621df0
>    R13: ffff8800af621df8 R14: 0000000000000000 R15: ffff8800cc063300
>    FS:  00007f22a387d740(0000) GS:ffff880117c60000(0000) knlGS:0000000000000000
>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    CR2: 00007f22a3892000 CR3: 0000000097023000 CR4: 00000000000007e0
>    Stack:
>     ffff8800c37f0000 ffff8800af621df0 ffff8800af621d18 ffffffff8109aa51
>     000000000a7d8c00 0000000042fee100 fffffffe03427acc ffff8800bf112a80
>     ffff8800c37f0000 ffff8800c307c280 ffff8800af621e50 ffffffff8121f74b
>    Call Trace:
>     [<ffffffff8109aa51>] thread_group_cputime_adjusted+0x41/0x50
>     [<ffffffff8121f74b>] do_task_stat+0x8eb/0xb60
>     [<ffffffff81176400>] ? vma_compute_subtree_gap+0x50/0x50
>     [<ffffffff81220314>] proc_tgid_stat+0x14/0x20
>     [<ffffffff8121b16d>] proc_single_show+0x4d/0x90
>     [<ffffffff811d6eee>] seq_read+0x14e/0x390
>     [<ffffffff811b3725>] vfs_read+0x95/0x160
>     [<ffffffff811b4239>] SyS_read+0x49/0xa0
>     [<ffffffff81723ced>] system_call_fastpath+0x1a/0x1f
>    Code: 89 fa 49 c1 ea 20 4d 85 d2 74 ca 4c 89 c2 48 d1 ef 49 89 c0 48 d1 ea 48
>          89 d0 eb 9f 0f 1f 80 00 00 00 00 89 c0 31 d2 49 0f af c0 <48> f7 f7 4c
>          89 df 48 29 c7 49 89 c3 e9 31 ff ff ff 66 66 66 66 
>    RIP  [<ffffffff81099de0>] cputime_adjust+0xf0/0x110
>     RSP <ffff8800af621cc8>
>    ---[ end trace dbafd2159a385dd6 ]---
> 
> The affected LOC performing the division by 0 was introduced in commit
> 
>    commit 55eaa7c1f511af5fb6ef808b5328804f4d4e5243
>    Author: Stanislaw Gruszka <sgruszka@redhat.com>
>    Date:   Tue Apr 30 17:14:42 2013 +0200
>    sched: Avoid cputime scaling overflow
> 
> For the problem to occur the function is called eg. with the following
> input parameters
> 
>    stime: 0x3567e00
>    rtime: 0xffffffffbf1abfdb
>    total: 0x3938700
> 
> which leads to 'total' being shifted to 0 during the adaption of the precision
> and is then used without further check in
> 
>    scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total);
> 
> The root cause triggering this issue seems to be an overflowed value of
> 
>    rtime = nsecs_to_cputime(curr->sum_exec_runtime);
> 
> On the affected machine the problem can be triggered by loading the
> previously idle system by starting a full kernel build. The problem occurs
> within a minute after the ondemand frequency scaling governor adjusts the
> frequency from the minimum to the maximum.
> 
> The x86 init check whether all booted CPUs have their TSC's synchronized, never
> failed so far, however, the tsc clocksource is sporadically marked unstable.
> 
>    Clocksource tsc unstable (delta = -74994678 ns)
> 
> The used CPU provides an Intel Invariant TSC as stated by CPUID.80000007H:EDX[8]:
> 
>     eax in    eax      ebx      ecx      edx
>    00000000 0000000b 756e6547 6c65746e 49656e69
>    00000001 00020652 04100800 0298e3ff bfebfbff
>    00000002 55035a01 00f0b2e3 00000000 09ca212c
>    00000003 00000000 00000000 00000000 00000000
>    00000004 00000000 00000000 00000000 00000000
>    00000005 00000040 00000040 00000003 00001120
>    00000006 00000005 00000002 00000001 00000000
>    00000007 00000000 00000000 00000000 00000000
>    00000008 00000000 00000000 00000000 00000000
>    00000009 00000000 00000000 00000000 00000000
>    0000000a 07300403 00000004 00000000 00000603
>    0000000b 00000000 00000000 0000002c 00000004
>    80000000 80000008 00000000 00000000 00000000
>    80000001 00000000 00000000 00000001 28100800
>    80000002 65746e49 2952286c 726f4320 4d542865
>    80000003 35692029 55504320 20202020 20202020
>    80000004 30353620 20402020 30322e33 007a4847
>    80000005 00000000 00000000 00000000 00000000
>    80000006 00000000 00000000 01006040 00000000
>    80000007 00000000 00000000 00000000 00000100
>    80000008 00003024 00000000 00000000 00000000
> 
>    Vendor ID: "GenuineIntel"; CPUID level 11
> 
>    Intel-specific functions:
>    Version 00020652:
>    Type 0 - Original OEM
>    Family 6 - Pentium Pro
>    Model 5 - Pentium II Model 5/Xeon/Celeron
>    Stepping 2
>    Reserved 8
> 
>    Extended brand string: "Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz"
>    CLFLUSH instruction cache line size: 8
>    Initial APIC ID: 4
>    Hyper threading siblings: 16
> 
>    Feature flags bfebfbff:
>    FPU    Floating Point Unit
>    VME    Virtual 8086 Mode Enhancements
>    DE     Debugging Extensions
>    PSE    Page Size Extensions
>    TSC    Time Stamp Counter
>    MSR    Model Specific Registers
>    PAE    Physical Address Extension
>    MCE    Machine Check Exception
>    CX8    COMPXCHG8B Instruction
>    APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
>    SEP    Fast System Call
>    MTRR   Memory Type Range Registers
>    PGE    PTE Global Flag
>    MCA    Machine Check Architecture
>    CMOV   Conditional Move and Compare Instructions
>    FGPAT  Page Attribute Table
>    PSE-36 36-bit Page Size Extension
>    CLFSH  CFLUSH instruction
>    DS     Debug store
>    ACPI   Thermal Monitor and Clock Ctrl
>    MMX    MMX instruction set
>    FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
>    SSE    Streaming SIMD Extensions instruction set
>    SSE2   SSE2 extensions
>    SS     Self Snoop
>    HT     Hyper Threading
>    TM     Thermal monitor
>    31     reserved
> 
> Nevertheless, when looking into the issue I saw occurences of sched_clock going
> backwards as if the TSCs were read out of sync. Accordingly the problem does
> not occur when either booting with option 'nosmp' or when forcing the TSC to be
> marked as unstable for sched_clock. Booting with 'acpi=off' and no frequency
> scaling works too.
> 
> Having a look at the scheduler code I see the following pattern that would also
> catch a time warp, eg. kernel/sched/rt.c update_curr_rt():
> 
>    u64 delta_exec;
>    delta_exec = rq_clock_task(rq) - curr->se.exec_start;
>    if (unlikely((s64)delta_exec <= 0))
>            return;
> 
> However, there are still vulnerable places, eg. kernel/sched/fair.c update_curr():
> 
>    /*
>     * Get the amount of time the current task was running
>     * since the last time we changed load (this cannot
>     * overflow on 32 bits):
>     */
>    delta_exec = (unsigned long)(now - curr->exec_start);
>    if (!delta_exec)
>            return;

Thanks for great analyse. I'm not sure where this problem should be
fixed (in TSC clocksource or sched/fair.c or sched/cputime.c), however
since RT scheduler is protected for sched_clock going backward, we
probably can protect fair scheduler as well (also do_task_delta_exec()
has similar check).

Does the below patch fixes the issue on affected machine?

Stanislaw

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..f02a567 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -727,7 +727,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	u64 now = rq_clock_task(rq_of(cfs_rq));
 	unsigned long delta_exec;
 
-	if (unlikely(!curr))
+	if (unlikely(!curr || now <= curr->exec_start))
 		return;
 
 	/*
@@ -736,8 +736,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	 * overflow on 32 bits):
 	 */
 	delta_exec = (unsigned long)(now - curr->exec_start);
-	if (!delta_exec)
-		return;
 
 	__update_curr(cfs_rq, curr, delta_exec);
 	curr->exec_start = now;

next prev parent reply	other threads:[~2013-11-18 13:05 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-16 21:37 [PROBLEM] possible divide by 0 in kernel/sched/cputime.c scale_stime() Christian Engelmayer
2013-11-18 14:02 ` Stanislaw Gruszka [this message]
2013-11-18 14:19   ` Peter Zijlstra
2013-11-18 15:39     ` Ingo Molnar
2013-11-18 17:27   ` Peter Zijlstra
2013-11-20 12:08     ` Stanislaw Gruszka
2013-11-25  0:43       ` Christian Engelmayer
2013-11-26 14:44         ` [PATCH] sched, fair: Rework sched_fair time accounting Peter Zijlstra
2013-12-12  9:52     ` [tip:sched/urgent] sched/fair: " tip-bot for Peter Zijlstra

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:7c70201 dfblob:f02a567 )
 OR (
bs:"Re: [PROBLEM] possible divide by 0 in kernel/sched/cputime.c scale_stime()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131118140224.GA3330@redhat.com \
    --to=sgruszka@redhat.com \
    --cc=cengelma@gmx.at \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.