* Xen HPET improvement proposal
@ 2013-10-25 12:21 Andrew Cooper
2013-10-25 12:47 ` Jan Beulich
0 siblings, 1 reply; 5+ messages in thread
From: Andrew Cooper @ 2013-10-25 12:21 UTC (permalink / raw)
To: Xen-devel List, Keir Fraser, Jan Beulich, Tim Deegan
Hello,
Having now had enough time to mentally unravel the current HPET code, I
present a proposal here for new logic, to replace the currently buggy
handling, which can be observed using a debug build of Xen, where it is
possible to have a stack overflow because of incorrectly pointed
interrupts, and errors nesting them.
In a system using HPETs in combination with idle, there are N HPETs and
M cpus. When a cpu wishes to idle, it must set up an external interrupt
to wake it back up again in time for its next deadline. If there are
free HPETs, this is easy; grab a free one and program it to interrupt
you at some point in the future. If however all the HPETs are already
allocated, one must be shared.
The root problems with the current situation are twofold. Xen runs the
action handler for all IRQs with interrupts enabled and having already
been acknowledged at the LAPIC. This allows arbitrary stacking of
interrupts, including lower priority interrupts. (This is a serious
problem which shall be fixed, but not part of this proposal).
Currently, HPET interrupts use the regular IRQ machinery including irq
migration, which results in the HPET interrupts being delivered to the
wrong cpus. It also means that they can be delayed for arbitrary
lengths of time behind an active line level interrupt being delivered to
a guest. Furthermore, there appears to be extra, overly-complicated
fixup code in the interrupt handler itself, apparently working around a
lack of understanding of why the interrupts are arriving at the wrong
cpus/wrong time.
Independently of the HPET issues themselves, I have identified a race
condition in the mwait-idle routines where a cpu which is preparing to
sleep can arrange for another cpu to wake it up, and have that other cpu
wake it up before it has enabled its mwait trigger, meaning that it will
idle for an arbitrary length of time in mwait. Realistically, the cpu
will be woken up by the time calibration rendezvous once a second, and
possibly by the watchdog NMI every half second.
For the new mechanism, I propose that HPET interrupts get a
direct_apic_vector and completely bypass the IRQ mechanism. This gives
the HPET interrupts guaranteed higher priority than all guest
interrupts. When a cpu wishes to idle, tries to find an HPET. If there
is a free HPET, the cpu becomes the owner of the HPET. It sets the HPET
up to interrupt itself at some point in the future and goes to sleep.
If there is not a free HPET, a cpu will need to share with another cpu.
If this cpu can find another HPET which will fire at an appropriate
time, the cpu can merely ask for it to be woken up by the HPET owner
when the owner wakes up. If all the HPETs are programmed to fire a
sufficient time into the future, one needs to be shortened. The cpu
should choose the soonest HPET, add itself to the owner's list of other
pcpus to wake, and reprogram the HPET to fire sooner. It should not
reprogram the HPET to point to itself.
The final requirement makes it far far easier to validate the
correctness of the correctness of the fix, and in particular that
interrupts are arriving at the expected cpu. Given a validated solution
proved to work, it might be possible to relax the requirement, so long
as a reasonable solution to waking up the original owner is found (and I
can't offhand think of a neat way of doing this, as ownership could move
around arbitrarily).
I would appreciate thoughts and comments. This will end up being a
substantial rewrite of most of hpet.c, but I believe the result will be
shorter, more simple and far more reliable.
~Andrew
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Xen HPET improvement proposal
2013-10-25 12:21 Xen HPET improvement proposal Andrew Cooper
@ 2013-10-25 12:47 ` Jan Beulich
2013-10-25 13:11 ` Andrew Cooper
0 siblings, 1 reply; 5+ messages in thread
From: Jan Beulich @ 2013-10-25 12:47 UTC (permalink / raw)
To: andrew.cooper3, xen-devel, keir, tim
>>> Andrew Cooper <andrew.cooper3@citrix.com> 10/25/13 2:22 PM >>>
>The root problems with the current situation are twofold. Xen runs the
>action handler for all IRQs with interrupts enabled and having already
>been acknowledged at the LAPIC.
In a subset of cases. And as pointed out before this early ack-ing is
very likely wrong in the HPET case.
>Independently of the HPET issues themselves, I have identified a race
>condition in the mwait-idle routines where a cpu which is preparing to
>sleep can arrange for another cpu to wake it up, and have that other cpu
>wake it up before it has enabled its mwait trigger, meaning that it will
>idle for an arbitrary length of time in mwait. Realistically, the cpu
>will be woken up by the time calibration rendezvous once a second, and
>possibly by the watchdog NMI every half second.
Which is an awfully long period of time... Looking forward to see
further details on this.
>For the new mechanism, I propose that HPET interrupts get a
>direct_apic_vector and completely bypass the IRQ mechanism. This gives
>the HPET interrupts guaranteed higher priority than all guest
>interrupts. When a cpu wishes to idle, tries to find an HPET. If there
>is a free HPET, the cpu becomes the owner of the HPET. It sets the HPET
>up to interrupt itself at some point in the future and goes to sleep.
I agree - it should have been done this way from the beginning.
>If there is not a free HPET, a cpu will need to share with another cpu.
>If this cpu can find another HPET which will fire at an appropriate
>time, the cpu can merely ask for it to be woken up by the HPET owner
>when the owner wakes up. If all the HPETs are programmed to fire a
>sufficient time into the future, one needs to be shortened. The cpu
>should choose the soonest HPET, add itself to the owner's list of other
>pcpus to wake, and reprogram the HPET to fire sooner. It should not
>reprogram the HPET to point to itself.
I think blindly looking for the one with the closest wakeup is not ideal:
For one, on huge systems this requires you to scan through too many
other CPUs. And taking NUMA aspects into consideration here would
seem at the very least desirable too (i.e. prefer sharing with a CPU
close to the one looking for a "partner").
>The final requirement makes it far far easier to validate the
>correctness of the correctness of the fix, and in particular that
>interrupts are arriving at the expected cpu. Given a validated solution
>proved to work, it might be possible to relax the requirement, so long
>as a reasonable solution to waking up the original owner is found (and I
>can't offhand think of a neat way of doing this, as ownership could move
>around arbitrarily).
Perhaps if this ownership change would be restricted (in that only the
owner itself can transfer [or give up] ownership), there wouldn't be
much of a problem: Since it's always the owner that gets woken, it would
simply need to re-establish a suitable new timeout. Once it's the owner's
turn, it would transfer ownership (or mark the channel unowned). Since
the interrupts wouldn't be subject to (normal) IRQ migration anymore,
there shouldn't then also be any false wakeups (which otherwise could
introduce races).
Jan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Xen HPET improvement proposal
2013-10-25 12:47 ` Jan Beulich
@ 2013-10-25 13:11 ` Andrew Cooper
2013-10-28 11:20 ` Jan Beulich
0 siblings, 1 reply; 5+ messages in thread
From: Andrew Cooper @ 2013-10-25 13:11 UTC (permalink / raw)
To: Jan Beulich; +Cc: tim, keir, xen-devel
On 25/10/13 13:47, Jan Beulich wrote:
>
>> Independently of the HPET issues themselves, I have identified a race
>> condition in the mwait-idle routines where a cpu which is preparing to
>> sleep can arrange for another cpu to wake it up, and have that other cpu
>> wake it up before it has enabled its mwait trigger, meaning that it will
>> idle for an arbitrary length of time in mwait. Realistically, the cpu
>> will be woken up by the time calibration rendezvous once a second, and
>> possibly by the watchdog NMI every half second.
> Which is an awfully long period of time... Looking forward to see
> further details on this.
The fix is fairly simple. The mwait code must set up the trigger on its
mwait region before arranging to be woken up. That way, if the other
cpu does wake up (early perhaps), it will activate the trigger, and we
will bounce straight back out of mwait rather than sleeping indefinitely.
Currently, there is a window between arranging to be woken up and
activating the mwait trigger where the other cpu might have already
written to the mwait region.
>> If there is not a free HPET, a cpu will need to share with another cpu.
>> If this cpu can find another HPET which will fire at an appropriate
>> time, the cpu can merely ask for it to be woken up by the HPET owner
>> when the owner wakes up. If all the HPETs are programmed to fire a
>> sufficient time into the future, one needs to be shortened. The cpu
>> should choose the soonest HPET, add itself to the owner's list of other
>> pcpus to wake, and reprogram the HPET to fire sooner. It should not
>> reprogram the HPET to point to itself.
> I think blindly looking for the one with the closest wakeup is not ideal:
> For one, on huge systems this requires you to scan through too many
> other CPUs. And taking NUMA aspects into consideration here would
> seem at the very least desirable too (i.e. prefer sharing with a CPU
> close to the one looking for a "partner").
I was actually thinking of just searching through the HPETs. There are
typically far fewer hpet channels than cpus (the most hpet channels I
have encountered in our test lab is 8). There is also a possibility of
maintaining some form of priority-structure, so the next-to-fire HPET is
trivial to identify. (My concern here is of the overhead with
maintaining the priority structure).
I see your point about NUMA, and shall consider it as I am developing
the code (although I might end up with v1 doing the dumb thing first,
before turning towards NUMA optimisation). The NUMA aspect plays the
other way round as well, with the (usually single) HPET being on the
southbridge/pch, therefore likely hanging off numa node 0.
~Andrew
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Xen HPET improvement proposal
2013-10-25 13:11 ` Andrew Cooper
@ 2013-10-28 11:20 ` Jan Beulich
2013-10-28 11:30 ` Jan Beulich
0 siblings, 1 reply; 5+ messages in thread
From: Jan Beulich @ 2013-10-28 11:20 UTC (permalink / raw)
To: Andrew Cooper; +Cc: tim, keir, xen-devel
>>> On 25.10.13 at 15:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 25/10/13 13:47, Jan Beulich wrote:
>>
>>> Independently of the HPET issues themselves, I have identified a race
>>> condition in the mwait-idle routines where a cpu which is preparing to
>>> sleep can arrange for another cpu to wake it up, and have that other cpu
>>> wake it up before it has enabled its mwait trigger, meaning that it will
>>> idle for an arbitrary length of time in mwait. Realistically, the cpu
>>> will be woken up by the time calibration rendezvous once a second, and
>>> possibly by the watchdog NMI every half second.
>> Which is an awfully long period of time... Looking forward to see
>> further details on this.
>
> The fix is fairly simple. The mwait code must set up the trigger on its
> mwait region before arranging to be woken up. That way, if the other
> cpu does wake up (early perhaps), it will activate the trigger, and we
> will bounce straight back out of mwait rather than sleeping indefinitely.
Actually I think rather than setting the trigger earlier (i.e. moving
the invocation of __monitor() out of mwait_idle_with_hints()) we
should rather check for an already occurred wakeup right after
having invoked __monitor(). Since the boolean-ness of the variable
pointed to by mwait_wakeup() is currently unused, we could easily
use that one. Patch in the works...
Jan
Jan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Xen HPET improvement proposal
2013-10-28 11:20 ` Jan Beulich
@ 2013-10-28 11:30 ` Jan Beulich
0 siblings, 0 replies; 5+ messages in thread
From: Jan Beulich @ 2013-10-28 11:30 UTC (permalink / raw)
To: Andrew Cooper; +Cc: tim, keir, xen-devel
>>> On 28.10.13 at 12:20, "Jan Beulich" <JBeulich@suse.com> wrote:
>>>> On 25.10.13 at 15:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 25/10/13 13:47, Jan Beulich wrote:
>>>
>>>> Independently of the HPET issues themselves, I have identified a race
>>>> condition in the mwait-idle routines where a cpu which is preparing to
>>>> sleep can arrange for another cpu to wake it up, and have that other cpu
>>>> wake it up before it has enabled its mwait trigger, meaning that it will
>>>> idle for an arbitrary length of time in mwait. Realistically, the cpu
>>>> will be woken up by the time calibration rendezvous once a second, and
>>>> possibly by the watchdog NMI every half second.
>>> Which is an awfully long period of time... Looking forward to see
>>> further details on this.
>>
>> The fix is fairly simple. The mwait code must set up the trigger on its
>> mwait region before arranging to be woken up. That way, if the other
>> cpu does wake up (early perhaps), it will activate the trigger, and we
>> will bounce straight back out of mwait rather than sleeping indefinitely.
>
> Actually I think rather than setting the trigger earlier (i.e. moving
> the invocation of __monitor() out of mwait_idle_with_hints()) we
> should rather check for an already occurred wakeup right after
> having invoked __monitor(). Since the boolean-ness of the variable
> pointed to by mwait_wakeup() is currently unused, we could easily
> use that one. Patch in the works...
But then again - are you sure there is a race? _After_ having called
__monitor(), mwait_idle_with_hints() checks another time that the
expiry time indeed still is in the future.
Jan
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-10-28 11:30 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-25 12:21 Xen HPET improvement proposal Andrew Cooper
2013-10-25 12:47 ` Jan Beulich
2013-10-25 13:11 ` Andrew Cooper
2013-10-28 11:20 ` Jan Beulich
2013-10-28 11:30 ` Jan Beulich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).