Need advice how to fix an access to uninitialized per_cpu clock

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Igor Mammedov <imammedo@redhat.com>
To: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: kvm@vger.kernel.org,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	kosaki.motohiro@jp.fujitsu.com, tj@kernel.org, x86@kernel.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Rik van Riel <riel@redhat.com>, "avi@redhat.com" <avi@redhat.com>
Subject: Need advice how to fix an access to uninitialized per_cpu clock
Date: Thu, 02 Feb 2012 12:43:00 +0100	[thread overview]
Message-ID: <4F2A76C4.2010601@redhat.com> (raw)

While playing with kvm cpu hot-plug, I've probably stumbled on general
kernel bug. So I'm lookng for advice on approach to fix it.

When kvm guest uses kvmclock, it may hang on cpu hot-plug at

     BSP:
      smp_apic_timer_interrupt
        ...
          -> do_timer
            -> update_wall_time

and a being on-lined CPU at waiting on sync with BSP at:
      start_secondary:
        while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
                     cpu_relax();

because of unusually big value returned by clock->read(clock) in
update_wall_time.
This happens due to overflow in pvclock_get_nsec_offset

         u64 delta = tsc - shadow->tsc_timestamp;

when shadow->tsc_timestamp is bigger than tsc.
And since pvclock_clocksource_read remembers and returns largest
value of any clock that ever was returned, clock for affected guest
virtually freezes.

Overflow happens due to reading undefined values from uninitialized
per_cpu variable hv_clock. In case of cpu hot-plug, clock is read at
least once on being on-lined cpu before it is initialized for this
cpu:

       start_secondary
         |-> smp_callin
         |  ->  smp_store_cpu_info
         |    -> identify_secondary_cpu
         |      -> mtrr_ap_init
         |        -> mtrr_restore
         |          -> stop_machine_from_inactive_cpu
         |            -> queue_stop_cpus_work
         |              ...
         |                -> kvm_clock_read
         |...
         |x86_cpuinit.setup_percpu_clockev

full call chain is below:
[    0.002999]  [<ffffffff8102fc74>] pvclock_clocksource_read+0x9f/0x275
[    0.002999]  [<ffffffff81071f9e>] ? check_preempt_wakeup+0x11c/0x1c2
[    0.002999]  [<ffffffff8102f05d>] kvm_clock_read+0x21/0xd3
[    0.002999]  [<ffffffff810154b1>] sched_clock+0x9/0xd
[    0.002999]  [<ffffffff81070509>] sched_clock_local+0x12/0x75
[    0.002999]  [<ffffffff8107065e>] sched_clock_cpu+0x84/0xc6
[    0.002999]  [<ffffffff8106bfd3>] update_rq_clock+0x28/0x108
[    0.002999]  [<ffffffff8106c15e>] enqueue_task+0x1d/0x64
[    0.002999]  [<ffffffff8106c500>] activate_task+0x22/0x24
[    0.002999]  [<ffffffff8106c90c>] ttwu_do_activate.constprop.39+0x32/0x61
[    0.002999]  [<ffffffff8106cd0c>] try_to_wake_up+0x17e/0x1e1
[    0.002999]  [<ffffffff8106cd98>] wake_up_process+0x15/0x17
[    0.002999]  [<ffffffff8109ebbf>] cpu_stop_queue_work+0x3d/0x5f
[    0.002999]  [<ffffffff8109ec6c>] queue_stop_cpus_work+0x8b/0xb6
[    0.002999]  [<ffffffff8109f0e3>] stop_machine_from_inactive_cpu+0xb4/0xed
[    0.002999]  [<ffffffff81023896>] ? mtrr_restore+0x4a/0x4a
[    0.002999]  [<ffffffff81023eb3>] mtrr_ap_init+0x5a/0x5c
[    0.002999]  [<ffffffff814b95f7>] identify_secondary_cpu+0x19/0x1b
[    0.002999]  [<ffffffff814bc0d3>] smp_store_cpu_info+0x3c/0x3e
[    0.002999]  [<ffffffff814bc4ed>] start_secondary+0x122/0x263

Looking at native_smp_prepare_cpus in arch/x86/kernel/smpboot.c
I see that it unconditionally calls set_mtrr_aps_delayed_init which
in turn effectively converts calls to mtrr_ap_init to nop ops.
And later call to native_smp_cpus_done -> mtrr_aps_init completes
mtrr initialization.

The same pattern might be noticed in suspend/hibernate handlers, that
call enable_nonboot_cpus.

   enable_nonboot_cpus
     -> arch_enable_nonboot_cpus_begin
          -> set_mtrr_aps_delayed_init

     -> boot secondary cpus

     -> arch_enable_nonboot_cpus_end
          -> mtrr_aps_init

So question is if the calling mtrr_ap_init so early in
identify_secondary_cpu is really necessary?
 From current code it looks like it is never called at normal smp
boot/resume time. And only path that triggers it is cpu hot-plug one.

I see following possible solutions:

1. Could the call to mtrr_ap_init be just moved from identify_secondary_cpu
to start_secondary right after x86_cpuinit.setup_percpu_clockev call?
This will prevent an access to uninitialized per_cpu clock. Or move
x86_cpuinit.setup_percpu_clockev before smp_callin?

2. Another way to prevent access to uninitialized per_cpu clock is to
introduce hook.early_percpu_clock_init that will be called before
smp_callin or right before mtrr_ap_init.

I hope to see opinions about this matter from a more experienced people than me.

-- 
Thanks,
  Igor

next             reply	other threads:[~2012-02-02 11:43 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-02 11:43 Igor Mammedov [this message]
2012-02-03 17:42 ` Need advice how to fix an access to uninitialized per_cpu clock Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F2A76C4.2010601@redhat.com \
    --to=imammedo@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=hpa@zytor.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.