All of lore.kernel.org
 help / color / mirror / Atom feed
* AMD Magny-Cours and HPET
@ 2011-08-16  9:47 Andrew Cooper
  2011-08-16 10:09 ` Jan Beulich
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Cooper @ 2011-08-16  9:47 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com, Christoph Egger, Wei Huang

Hello,

We have had a bug raised against Xen-3.4 that the kexec path fails, on
HP BL465c G7 blades.  The problem does not reproduce on any other AMD
machines I have to hand.

On further investigation, it appears that if the crashing cpu is #0,
then the kexec path hangs forever trying to grab the already locked
legacy_hpet_event.lock in hpet_disable_legacy_broadcast().  Removing the
lock/unlock pair causes the kexec crash path to work as expected.

If the crashing cpu is not #0, then local_time_calibration() gets
worried and dumps the calibration data, and hangs at some later point
which I have yet to find.  This hang happens while performing the NMI
shootdown of other cpus.

The support engineer who raised the bug says that it doesn't occur with
Xen-4.1.  Is there anything architecturally new in the Magny-Cours
processors which might explain this behavior?

I am unwilling to try and backport the hpet code from Xen-4.x without
understanding the problem, although it is a possible solution.

Thanks

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AMD Magny-Cours and HPET
  2011-08-16  9:47 AMD Magny-Cours and HPET Andrew Cooper
@ 2011-08-16 10:09 ` Jan Beulich
  2011-08-16 12:32   ` Andrew Cooper
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Beulich @ 2011-08-16 10:09 UTC (permalink / raw)
  To: ChristophEgger, Wei Huang, Andrew Cooper,
	xen-devel@lists.xensource.com

>>> On 16.08.11 at 11:47, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> We have had a bug raised against Xen-3.4 that the kexec path fails, on
> HP BL465c G7 blades.  The problem does not reproduce on any other AMD
> machines I have to hand.
> 
> On further investigation, it appears that if the crashing cpu is #0,
> then the kexec path hangs forever trying to grab the already locked
> legacy_hpet_event.lock in hpet_disable_legacy_broadcast().  Removing the
> lock/unlock pair causes the kexec crash path to work as expected.

Are you sure it is locked (rather than never initialized)? The problem
could be that hpet_broadcast_is_available() returns true because of
num_hpets_used > 0, yet hpet_broadcast_init() didn't make it down
to spin_lock_init(&legacy_hpet_event.lock).

> If the crashing cpu is not #0, then local_time_calibration() gets
> worried and dumps the calibration data, and hangs at some later point
> which I have yet to find.  This hang happens while performing the NMI
> shootdown of other cpus.
> 
> The support engineer who raised the bug says that it doesn't occur with
> Xen-4.1.  Is there anything architecturally new in the Magny-Cours
> processors which might explain this behavior?

Possibly more a question of the surrounding platform, namely whether
there are HPETs in the system, and whether they get used for the
C-state broadcasting.

Jan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AMD Magny-Cours and HPET
  2011-08-16 10:09 ` Jan Beulich
@ 2011-08-16 12:32   ` Andrew Cooper
  2011-08-16 12:55     ` Jan Beulich
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Cooper @ 2011-08-16 12:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: ChristophEgger, Wei Huang, xen-devel@lists.xensource.com



On 16/08/11 11:09, Jan Beulich wrote:
>>>> On 16.08.11 at 11:47, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> We have had a bug raised against Xen-3.4 that the kexec path fails, on
>> HP BL465c G7 blades.  The problem does not reproduce on any other AMD
>> machines I have to hand.
>>
>> On further investigation, it appears that if the crashing cpu is #0,
>> then the kexec path hangs forever trying to grab the already locked
>> legacy_hpet_event.lock in hpet_disable_legacy_broadcast().  Removing the
>> lock/unlock pair causes the kexec crash path to work as expected.
> Are you sure it is locked (rather than never initialized)? The problem
> could be that hpet_broadcast_is_available() returns true because of
> num_hpets_used > 0, yet hpet_broadcast_init() didn't make it down
> to spin_lock_init(&legacy_hpet_event.lock).

That is an very good point.  I had not considered it, and it turns out
that legacy broadcast is never set up

(XEN) HPET: starting hpet_broadcast_init()
(XEN) HPET: hpet_setup() successful
(XEN) HPET: 4 timers in total, 3 timers will be used for broadcast

hpet_broadcast_init() exits inside the "if ( num_hpets_used > 0 )"
clause (as the boot dmesg doesn't printk the line immediately following
the if clause), meaning that legacy broadcasts are never set up.

Therefore, the logic

if ( hpet_broadcast_is_available() )
    hpet_disable_legacy_broadcast();

in several places is wrong, and should be "if hpet_lecacy broadcast
used".  Judging on the similarities in this regard between Xen-3.4 and
Xen-4.x, i am now not certain that Xen-4.x is immune and will now
proceed to investigate this.

>> If the crashing cpu is not #0, then local_time_calibration() gets
>> worried and dumps the calibration data, and hangs at some later point
>> which I have yet to find.  This hang happens while performing the NMI
>> shootdown of other cpus.
>>
>> The support engineer who raised the bug says that it doesn't occur with
>> Xen-4.1.  Is there anything architecturally new in the Magny-Cours
>> processors which might explain this behavior?
> Possibly more a question of the surrounding platform, namely whether
> there are HPETs in the system, and whether they get used for the
> C-state broadcasting.
>
> Jan
>

Why would C-state broadcasting make a difference at this point?  I have
narrowed the crash down a bit, and local_time_calibration() is dumping
its state after one_cpu_only() and before the shootdown actually
occurs.  However, I cant see any code between these two points which
alters the state of the other CPU, which should still be running
normally at this point.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AMD Magny-Cours and HPET
  2011-08-16 12:32   ` Andrew Cooper
@ 2011-08-16 12:55     ` Jan Beulich
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Beulich @ 2011-08-16 12:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: ChristophEgger, Wei Huang, xen-devel@lists.xensource.com

>>> On 16.08.11 at 14:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 16/08/11 11:09, Jan Beulich wrote:
>>>>> On 16.08.11 at 11:47, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> The support engineer who raised the bug says that it doesn't occur with
>>> Xen-4.1.  Is there anything architecturally new in the Magny-Cours
>>> processors which might explain this behavior?
>> Possibly more a question of the surrounding platform, namely whether
>> there are HPETs in the system, and whether they get used for the
>> C-state broadcasting.
> 
> Why would C-state broadcasting make a difference at this point?  I have
> narrowed the crash down a bit, and local_time_calibration() is dumping
> its state after one_cpu_only() and before the shootdown actually
> occurs.  However, I cant see any code between these two points which
> alters the state of the other CPU, which should still be running
> normally at this point.

That "num_hpets_used > 0" check in hpet_broadcast_is_available()
could be false for all other AMD systems you had tried this on, and
hence you might not be getting into hpet_disable_legacy_broadcast()
there at all.

(4.0.2 and 4.1.1 have, btw., an extra non-zero check against
legacy_hpet_event.shift in hpet_disable_legacy_broadcast();
4.0.1 and 4.1.0 don't.)

Jan

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-08-16 12:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-16  9:47 AMD Magny-Cours and HPET Andrew Cooper
2011-08-16 10:09 ` Jan Beulich
2011-08-16 12:32   ` Andrew Cooper
2011-08-16 12:55     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.