All of lore.kernel.org
 help / color / mirror / Atom feed
From: Terry Loftin <terry.loftin@hp.com>
To: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
	Peter Zijlstra <peterz@infradead.org>
Cc: Bob Montgomery <bob.montgomery@hp.com>
Subject: [PATCH 0/2] sched: Fix "divide error: 0000" in find_busiest_group
Date: Tue, 19 Jul 2011 14:58:42 -0600	[thread overview]
Message-ID: <4E25F002.2080503@hp.com> (raw)

Howdy,

The divide error occurs in inlined function update_sg_lb_stats() in
kernel/sched.c when we adjust the relative CPU power of a group by
dividing group_load by group->cpu_power:

    /* Adjust by relative CPU power of the group */ sgs->avg_load =
    (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

In this case, group->cpu_power is zero.  This was set in
update_cpu_power(), which depends on scale_rt_power() among other things.
scale_rt_power() is based in part on the rq->clock and rq->age_stamp
values for the runqueue:

    total = sched_avg_period() + (rq->clock - rq->age_stamp);

The clock and age_stamp values are in nanoseconds and come from
__cycles_2_ns() which converts the CPU tsc counter to nanoseconds.
On 64-bit systems, the computation returned from __cycles_2_ns() wraps
when the nanosecond value is 54 bits or larger (about 208.5 days).

The rq->age_stamp is designed to follow the clock value but does not
account for the fact that the clock value may wrap, and it is never reset.
After rq->clock wraps, the expression (rq->clock - rq->age_stamp) leads
to large negative values which in turn lead to very large values for
scale_rt_power().

In update_cpu_power(), an unsigned long local variable, 'power', is
used to hold the intermediate result, including the return value from
scale_rt_power(), before it is placed in an unsigned int rq->cpu_power.
If the power calculated in update_cpu_power() is > 32 bits, but all
the low order bits are zero, then the value will be truncated and
rq->cpu_power will be set to zero, leading to the divide by zero error.
There is a protective check immediately before the assignment, but it
compares the full 64-bit value instead of the 32-bit portion that will
be stored in rq->cpu_power.

I have analyzed two crash dumps from systems that were up 220 and 230
days to confirm this.

-T

Signed-off-by: Terry Loftin <terry.loftin@hp.com>
Signed-off-by: Bob Montgomery <bob.montgomery@hp.com>



                 reply	other threads:[~2011-07-19 20:58 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E25F002.2080503@hp.com \
    --to=terry.loftin@hp.com \
    --cc=bob.montgomery@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.