* [Qemu-devel] Help on TLB Flush
@ 2015-02-12 14:35 Mark Burton
2015-02-12 14:45 ` Alexander Graf
0 siblings, 1 reply; 22+ messages in thread
From: Mark Burton @ 2015-02-12 14:35 UTC (permalink / raw)
To: Peter Maydell, qemu-devel, mttcg
[-- Attachment #1: Type: text/plain, Size: 1275 bytes --]
TLB Flush:
We have spent a few days on this issue, and still haven’t resolved the best path.
Our solution seems to work, most of the time, but we still have some strange issues - so I want to check that what we are proposing has a chance of working.
Our plan is to allow all CPU’s to continue. Potentially one CPU will want to write to the TLBs. Subsequent to the write, it requests a TLB Flush. We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously.
This means - there is a theoretical period of time when one CPU is writing to the TLBs while other CPU’s are executing. Our belief is that this has to be handled by software anyway, and this should not be an issue from Qemu’s point of view.
The alternative would be to force all other CPU’s to exit before writing the TLB’s - this is both expensive and very painful to organise (as we get into horrid deadlocks whichever way we turn)…
We’d appreciate some thoughts on this...
Cheers
Mark.
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
<applewebdata://3693B246-CDAA-4901-A9EC-AD07F4E94137/www.greensocs.com>
[-- Attachment #2: Type: text/html, Size: 2982 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 14:35 [Qemu-devel] Help on TLB Flush Mark Burton
@ 2015-02-12 14:45 ` Alexander Graf
2015-02-12 14:58 ` Peter Maydell
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Alexander Graf @ 2015-02-12 14:45 UTC (permalink / raw)
To: Mark Burton; +Cc: mttcg, Peter Maydell, qemu-devel
> On 12.02.2015, at 15:35, Mark Burton <mark.burton@greensocs.com> wrote:
>
>
> TLB Flush:
>
> We have spent a few days on this issue, and still haven’t resolved the best path.
>
> Our solution seems to work, most of the time, but we still have some strange issues - so I want to check that what we are proposing has a chance of working.
>
>
> Our plan is to allow all CPU’s to continue. Potentially one CPU will want to write to the TLBs. Subsequent to the write, it requests a TLB Flush.
Local or global? For local TLB flushes you don't notify the other CPUs at all. For global ones, the semantics of the call usually dictate atomicity.
> We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously.
For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus.
FWIW TLBs are always CPU local. When there's a "global TLB flush" instruction, it pretty much does stall the CPU, notifies the others to also flush their TLBs, waits and then continues.
If this really does become a performance bottleneck (which I doubt it does, almost nobody except x86 does global flushes), you can also do some nasty hacky tricks, such as (atomically) change the valid bit in remote CPUs TLB entries. But really only do this as a last resort if the clean version doesn't perform well.
Alex
> This means - there is a theoretical period of time when one CPU is writing to the TLBs while other CPU’s are executing. Our belief is that this has to be handled by software anyway, and this should not be an issue from Qemu’s point of view.
> The alternative would be to force all other CPU’s to exit before writing the TLB’s - this is both expensive and very painful to organise (as we get into horrid deadlocks whichever way we turn)…
>
> We’d appreciate some thoughts on this...
>
> Cheers
>
> Mark.
>
>
>
> +44 (0)20 7100 3485 x 210
> +33 (0)5 33 52 01 77x 210
>
> +33 (0)603762104
> mark.burton
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 14:45 ` Alexander Graf
@ 2015-02-12 14:58 ` Peter Maydell
2015-02-12 15:38 ` Alexander Graf
2015-02-12 15:01 ` Peter Maydell
2015-02-12 15:11 ` Mark Burton
2 siblings, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2015-02-12 14:58 UTC (permalink / raw)
To: Alexander Graf; +Cc: mttcg, Mark Burton, qemu-devel
On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
> almost nobody except x86 does global flushes
All ARM TLB maintenance operations have both "this CPU only"
and "all TLBs in the Inner Shareable domain" [that's ARM-speak
for "every CPU core in the cluster"] variants (the latter
being the TLB *IS operations). Looking at Linux's
arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
most of the operations defined there use the IS variants.
-- PMM
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 14:45 ` Alexander Graf
2015-02-12 14:58 ` Peter Maydell
@ 2015-02-12 15:01 ` Peter Maydell
2015-02-12 15:08 ` Mark Burton
2015-02-12 15:11 ` Mark Burton
2 siblings, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2015-02-12 15:01 UTC (permalink / raw)
To: Alexander Graf; +Cc: mttcg, Mark Burton, qemu-devel
On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
>
>> On 12.02.2015, at 15:35, Mark Burton <mark.burton@greensocs.com> wrote:
>> We are proposing to implement this by signalling all other CPU’s
>> to exit (and requesting they flush before re-starting). In other
>> words, this would happen asynchronously.
>
> For global flushes, give them a pointer payload along with the flush
> request and tell all cpus to increment it atomically. In your main
> thread, wait until *ptr == nKickedCpus.
I bet this will not be the only situation where you want to
do an "get all other CPUs to do $something and wait til they
have done so" kind of operation, so some lightweight but generic
infrastructure for doing that would not be a bad plan. (Similarly
"get all other CPUs to stop, then I can do $something and let
the others continue".)
-- PMM
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 15:01 ` Peter Maydell
@ 2015-02-12 15:08 ` Mark Burton
2015-02-12 15:19 ` Alexander Graf
2015-02-12 15:31 ` Dr. David Alan Gilbert
0 siblings, 2 replies; 22+ messages in thread
From: Mark Burton @ 2015-02-12 15:08 UTC (permalink / raw)
To: Peter Maydell; +Cc: mttcg, Alexander Graf, qemu-devel
> On 12 Feb 2015, at 16:01, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
>>
>>> On 12.02.2015, at 15:35, Mark Burton <mark.burton@greensocs.com> wrote:
>>> We are proposing to implement this by signalling all other CPU’s
>>> to exit (and requesting they flush before re-starting). In other
>>> words, this would happen asynchronously.
>>
>> For global flushes, give them a pointer payload along with the flush
>> request and tell all cpus to increment it atomically. In your main
>> thread, wait until *ptr == nKickedCpus.
>
> I bet this will not be the only situation where you want to
> do an "get all other CPUs to do $something and wait til they
> have done so" kind of operation, so some lightweight but generic
> infrastructure for doing that would not be a bad plan. (Similarly
> "get all other CPUs to stop, then I can do $something and let
> the others continue”.)
We tried this - we ended up in knots.
We had 2 CPU’s trying to flush at about the same time, both waiting for the other.
We had CPU’s trying to get the global mutex to finish what they were doing, while being told to flush,
We had CPU’s in the global mutex trying to do something that would cause a flush… etc....
We had spaghetti with extra Bolognese sauce…
We eventually concluded, yes - in an infinite universe everything is possible, but if we could simply do this ‘asynchronously’ then our lives would be a LOT easier.
e.g. - ask all CPU’s to “exit and do something” is easy - wait for them to do that is a whole other problem…
Our question is - do we need this ‘sync’ (before the flush), or can we actually allow CPU’s to flush themselves asynchronously….
Cheers
Mark.
>
> -- PMM
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 14:45 ` Alexander Graf
2015-02-12 14:58 ` Peter Maydell
2015-02-12 15:01 ` Peter Maydell
@ 2015-02-12 15:11 ` Mark Burton
2 siblings, 0 replies; 22+ messages in thread
From: Mark Burton @ 2015-02-12 15:11 UTC (permalink / raw)
To: Alexander Graf; +Cc: mttcg, Peter Maydell, qemu-devel
OK - Alex - your implication is that it has to be atomic, we need the sync…
:-(
I have a horrid feeling that the atomicity of global flush can’t be causing the (almost, but not quite reproducible) errors we’re seeing - but… anyway ;-)
Cheers
Mark.
> On 12 Feb 2015, at 15:45, Alexander Graf <agraf@suse.de> wrote:
>
>
>> On 12.02.2015, at 15:35, Mark Burton <mark.burton@greensocs.com> wrote:
>>
>>
>> TLB Flush:
>>
>> We have spent a few days on this issue, and still haven’t resolved the best path.
>>
>> Our solution seems to work, most of the time, but we still have some strange issues - so I want to check that what we are proposing has a chance of working.
>>
>>
>> Our plan is to allow all CPU’s to continue. Potentially one CPU will want to write to the TLBs. Subsequent to the write, it requests a TLB Flush.
>
> Local or global? For local TLB flushes you don't notify the other CPUs at all. For global ones, the semantics of the call usually dictate atomicity.
>
>> We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously.
>
> For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus.
>
> FWIW TLBs are always CPU local. When there's a "global TLB flush" instruction, it pretty much does stall the CPU, notifies the others to also flush their TLBs, waits and then continues.
>
> If this really does become a performance bottleneck (which I doubt it does, almost nobody except x86 does global flushes), you can also do some nasty hacky tricks, such as (atomically) change the valid bit in remote CPUs TLB entries. But really only do this as a last resort if the clean version doesn't perform well.
>
>
> Alex
>
>> This means - there is a theoretical period of time when one CPU is writing to the TLBs while other CPU’s are executing. Our belief is that this has to be handled by software anyway, and this should not be an issue from Qemu’s point of view.
>> The alternative would be to force all other CPU’s to exit before writing the TLB’s - this is both expensive and very painful to organise (as we get into horrid deadlocks whichever way we turn)…
>>
>> We’d appreciate some thoughts on this...
>>
>> Cheers
>>
>> Mark.
>>
>>
>>
>> +44 (0)20 7100 3485 x 210
>> +33 (0)5 33 52 01 77x 210
>>
>> +33 (0)603762104
>> mark.burton
>>
>>
>
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 15:08 ` Mark Burton
@ 2015-02-12 15:19 ` Alexander Graf
2015-02-12 21:57 ` Peter Maydell
2015-02-12 15:31 ` Dr. David Alan Gilbert
1 sibling, 1 reply; 22+ messages in thread
From: Alexander Graf @ 2015-02-12 15:19 UTC (permalink / raw)
To: Mark Burton, Peter Maydell; +Cc: mttcg, qemu-devel
On 12.02.15 16:08, Mark Burton wrote:
>
>> On 12 Feb 2015, at 16:01, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
>>>
>>>> On 12.02.2015, at 15:35, Mark Burton <mark.burton@greensocs.com> wrote:
>>>> We are proposing to implement this by signalling all other CPU’s
>>>> to exit (and requesting they flush before re-starting). In other
>>>> words, this would happen asynchronously.
>>>
>>> For global flushes, give them a pointer payload along with the flush
>>> request and tell all cpus to increment it atomically. In your main
>>> thread, wait until *ptr == nKickedCpus.
>>
>> I bet this will not be the only situation where you want to
>> do an "get all other CPUs to do $something and wait til they
>> have done so" kind of operation, so some lightweight but generic
>> infrastructure for doing that would not be a bad plan. (Similarly
>> "get all other CPUs to stop, then I can do $something and let
>> the others continue”.)
>
> We tried this - we ended up in knots.
> We had 2 CPU’s trying to flush at about the same time, both waiting for the other.
> We had CPU’s trying to get the global mutex to finish what they were doing, while being told to flush,
> We had CPU’s in the global mutex trying to do something that would cause a flush… etc....
> We had spaghetti with extra Bolognese sauce…
>
> We eventually concluded, yes - in an infinite universe everything is possible, but if we could simply do this ‘asynchronously’ then our lives would be a LOT easier.
> e.g. - ask all CPU’s to “exit and do something” is easy - wait for them to do that is a whole other problem…
>
> Our question is - do we need this ‘sync’ (before the flush), or can we actually allow CPU’s to flush themselves asynchronously….
The respective target architecture specs will tell you. And I very much
doubt that it is ok in most cases.
Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 15:08 ` Mark Burton
2015-02-12 15:19 ` Alexander Graf
@ 2015-02-12 15:31 ` Dr. David Alan Gilbert
2015-02-12 18:44 ` Mark Burton
1 sibling, 1 reply; 22+ messages in thread
From: Dr. David Alan Gilbert @ 2015-02-12 15:31 UTC (permalink / raw)
To: Mark Burton; +Cc: mttcg, Peter Maydell, Alexander Graf, qemu-devel
* Mark Burton (mark.burton@greensocs.com) wrote:
>
> > On 12 Feb 2015, at 16:01, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
> >>
> >>> On 12.02.2015, at 15:35, Mark Burton <mark.burton@greensocs.com> wrote:
> >>> We are proposing to implement this by signalling all other CPU???s
> >>> to exit (and requesting they flush before re-starting). In other
> >>> words, this would happen asynchronously.
> >>
> >> For global flushes, give them a pointer payload along with the flush
> >> request and tell all cpus to increment it atomically. In your main
> >> thread, wait until *ptr == nKickedCpus.
> >
> > I bet this will not be the only situation where you want to
> > do an "get all other CPUs to do $something and wait til they
> > have done so" kind of operation, so some lightweight but generic
> > infrastructure for doing that would not be a bad plan. (Similarly
> > "get all other CPUs to stop, then I can do $something and let
> > the others continue???.)
>
> We tried this - we ended up in knots.
> We had 2 CPU???s trying to flush at about the same time, both waiting for the other.
> We had CPU???s trying to get the global mutex to finish what they were doing, while being told to flush,
> We had CPU???s in the global mutex trying to do something that would cause a flush??? etc....
> We had spaghetti with extra Bolognese sauce???
This is the hard problem of multithreaded emulation.
You've always got to let CPUs get back to a point where you can
invalidate a mapping/page quickly.
Thus you've also got to be very careful about where any CPU might
get into a loop or take another lock that would stop another CPU
causing an invalidate. Either that or you need a way of somehow
breaking locks or recovering from the situation.
> We eventually concluded, yes - in an infinite universe everything is possible, but if we could simply do this ???asynchronously??? then our lives would be a LOT easier.
> e.g. - ask all CPU???s to ???exit and do something??? is easy - wait for them to do that is a whole other problem???
Which is why you've got to bound how long it might take
those CPUs to get back to you, and optimise out cases where
it's not really needed later.
> Our question is - do we need this ???sync??? (before the flush), or can we actually allow CPU???s to flush themselves asynchronously???.
Always assume the worst.
Dave
>
> Cheers
>
> Mark.
>
>
>
> >
> > -- PMM
>
>
> +44 (0)20 7100 3485 x 210
> +33 (0)5 33 52 01 77x 210
>
> +33 (0)603762104
> mark.burton
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 14:58 ` Peter Maydell
@ 2015-02-12 15:38 ` Alexander Graf
2015-02-12 16:02 ` Mark Burton
2015-02-12 22:02 ` Peter Maydell
0 siblings, 2 replies; 22+ messages in thread
From: Alexander Graf @ 2015-02-12 15:38 UTC (permalink / raw)
To: Peter Maydell; +Cc: mttcg, Mark Burton, qemu-devel
On 12.02.15 15:58, Peter Maydell wrote:
> On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
>> almost nobody except x86 does global flushes
>
> All ARM TLB maintenance operations have both "this CPU only"
> and "all TLBs in the Inner Shareable domain" [that's ARM-speak
> for "every CPU core in the cluster"] variants (the latter
> being the TLB *IS operations). Looking at Linux's
> arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
> most of the operations defined there use the IS variants.
Wow, did anyone benchmark this? I know that PPC switched away from
global flushes and instead tracks the CPUs a task was running on to
limit the scope of CPUs that need to flush.
Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 15:38 ` Alexander Graf
@ 2015-02-12 16:02 ` Mark Burton
2015-02-12 22:10 ` Lluís Vilanova
2015-02-12 22:02 ` Peter Maydell
1 sibling, 1 reply; 22+ messages in thread
From: Mark Burton @ 2015-02-12 16:02 UTC (permalink / raw)
To: Alexander Graf; +Cc: mttcg, Peter Maydell, qemu-devel
> On 12 Feb 2015, at 16:38, Alexander Graf <agraf@suse.de> wrote:
>
>
>
> On 12.02.15 15:58, Peter Maydell wrote:
>> On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
>>> almost nobody except x86 does global flushes
>>
>> All ARM TLB maintenance operations have both "this CPU only"
>> and "all TLBs in the Inner Shareable domain" [that's ARM-speak
>> for "every CPU core in the cluster"] variants (the latter
>> being the TLB *IS operations). Looking at Linux's
>> arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
>> most of the operations defined there use the IS variants.
>
> Wow, did anyone benchmark this? I know that PPC switched away from
> global flushes and instead tracks the CPUs a task was running on to
> limit the scope of CPUs that need to flush.
Doesn’t that mean you have to signal a specific CPU to cause it to flush itself…. Isn’t that in itself expensive? Do you have to organise some sort of atomicity yourself around that too?
Cheers
Mark.
>
>
> Alex
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 15:31 ` Dr. David Alan Gilbert
@ 2015-02-12 18:44 ` Mark Burton
0 siblings, 0 replies; 22+ messages in thread
From: Mark Burton @ 2015-02-12 18:44 UTC (permalink / raw)
To: Dr. David Alan Gilbert; +Cc: mttcg, Peter Maydell, Alexander Graf, qemu-devel
> On 12 Feb 2015, at 16:31, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>
> * Mark Burton (mark.burton@greensocs.com) wrote:
>>
>>> On 12 Feb 2015, at 16:01, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>
>>> On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
>>>>
>>>>> On 12.02.2015, at 15:35, Mark Burton <mark.burton@greensocs.com> wrote:
>>>>> We are proposing to implement this by signalling all other CPU???s
>>>>> to exit (and requesting they flush before re-starting). In other
>>>>> words, this would happen asynchronously.
>>>>
>>>> For global flushes, give them a pointer payload along with the flush
>>>> request and tell all cpus to increment it atomically. In your main
>>>> thread, wait until *ptr == nKickedCpus.
>>>
>>> I bet this will not be the only situation where you want to
>>> do an "get all other CPUs to do $something and wait til they
>>> have done so" kind of operation, so some lightweight but generic
>>> infrastructure for doing that would not be a bad plan. (Similarly
>>> "get all other CPUs to stop, then I can do $something and let
>>> the others continue???.)
>>
>> We tried this - we ended up in knots.
>> We had 2 CPU???s trying to flush at about the same time, both waiting for the other.
>> We had CPU???s trying to get the global mutex to finish what they were doing, while being told to flush,
>> We had CPU???s in the global mutex trying to do something that would cause a flush??? etc....
>> We had spaghetti with extra Bolognese sauce???
>
> This is the hard problem of multithreaded emulation.
> You've always got to let CPUs get back to a point where you can
> invalidate a mapping/page quickly.
>
> Thus you've also got to be very careful about where any CPU might
> get into a loop or take another lock that would stop another CPU
> causing an invalidate. Either that or you need a way of somehow
> breaking locks or recovering from the situation.
Indeed -
for now - we’re building something which will likely be less than ideal. Once we have some sort of evidence that it works, and (hopefully) more reliably than the approach we have right now, then we come up with a more elegant scheme.
>
>> We eventually concluded, yes - in an infinite universe everything is possible, but if we could simply do this ???asynchronously??? then our lives would be a LOT easier.
>> e.g. - ask all CPU???s to ???exit and do something??? is easy - wait for them to do that is a whole other problem???
>
> Which is why you've got to bound how long it might take
> those CPUs to get back to you, and optimise out cases where
> it's not really needed later.
>
>> Our question is - do we need this ???sync??? (before the flush), or can we actually allow CPU???s to flush themselves asynchronously???.
>
> Always assume the worst.
:-)
Cheers
Mark.
>
> Dave
>
>>
>> Cheers
>>
>> Mark.
>>
>>
>>
>>>
>>> -- PMM
>>
>>
>> +44 (0)20 7100 3485 x 210
>> +33 (0)5 33 52 01 77x 210
>>
>> +33 (0)603762104
>> mark.burton
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 15:19 ` Alexander Graf
@ 2015-02-12 21:57 ` Peter Maydell
2015-02-13 9:34 ` Paolo Bonzini
0 siblings, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2015-02-12 21:57 UTC (permalink / raw)
To: Alexander Graf; +Cc: mttcg, Mark Burton, qemu-devel
On 12 February 2015 at 15:19, Alexander Graf <agraf@suse.de> wrote:
> On 12.02.15 16:08, Mark Burton wrote:
>> Our question is - do we need this ‘sync’ (before the flush),
>> or can we actually allow CPU’s to flush themselves asynchronously….
>
> The respective target architecture specs will tell you. And I very much
> doubt that it is ok in most cases.
For ARM note that TLB maintenance operations do not have to
complete synchronously. They can be reordered relative to other
TLB maintenance ops or to loads or stores (by this CPU or
by other CPUs if this is a global invalidate). The only
requirement is that if the CPU that did the TLB maintenance
op executes a DMB (barrier) then the TLB op must finish
before the barrier completes execution. So you could split
the "kick off TLB invalidate" and "make sure all CPUs
are done" phases if you wanted. [cf v8 ARM ARM rev A.e
section D4.7.2 and in particular the subsection on
"ordering and completion".]
This only applies to ARM guests, of course. ("Other CPU
architectures are available." :-))
-- PMM
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 15:38 ` Alexander Graf
2015-02-12 16:02 ` Mark Burton
@ 2015-02-12 22:02 ` Peter Maydell
1 sibling, 0 replies; 22+ messages in thread
From: Peter Maydell @ 2015-02-12 22:02 UTC (permalink / raw)
To: Alexander Graf; +Cc: mttcg, Mark Burton, qemu-devel
On 12 February 2015 at 15:38, Alexander Graf <agraf@suse.de> wrote:
> On 12.02.15 15:58, Peter Maydell wrote:
>> All ARM TLB maintenance operations have both "this CPU only"
>> and "all TLBs in the Inner Shareable domain" [that's ARM-speak
>> for "every CPU core in the cluster"] variants (the latter
>> being the TLB *IS operations). Looking at Linux's
>> arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
>> most of the operations defined there use the IS variants.
>
> Wow, did anyone benchmark this? I know that PPC switched away from
> global flushes and instead tracks the CPUs a task was running on to
> limit the scope of CPUs that need to flush.
That would be a valid implementation. The CPU has to behave
as the spec says it must, but there's no reason you couldn't
implement "flush by ASID for all TLBs" via some implementation
specific tracking of ASID use per CPU to limit which cores
you sent the flush request to, if you thought that was a
better way to do it.
-- PMM
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 16:02 ` Mark Burton
@ 2015-02-12 22:10 ` Lluís Vilanova
2015-02-13 7:16 ` Mark Burton
0 siblings, 1 reply; 22+ messages in thread
From: Lluís Vilanova @ 2015-02-12 22:10 UTC (permalink / raw)
To: Mark Burton; +Cc: mttcg, Peter Maydell, Alexander Graf, qemu-devel
Mark Burton writes:
>> On 12 Feb 2015, at 16:38, Alexander Graf <agraf@suse.de> wrote:
>>
>>
>>
>> On 12.02.15 15:58, Peter Maydell wrote:
>>> On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
>>>> almost nobody except x86 does global flushes
>>>
>>> All ARM TLB maintenance operations have both "this CPU only"
>>> and "all TLBs in the Inner Shareable domain" [that's ARM-speak
>>> for "every CPU core in the cluster"] variants (the latter
>>> being the TLB *IS operations). Looking at Linux's
>>> arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
>>> most of the operations defined there use the IS variants.
>>
>> Wow, did anyone benchmark this? I know that PPC switched away from
>> global flushes and instead tracks the CPUs a task was running on to
>> limit the scope of CPUs that need to flush.
> Doesn’t that mean you have to signal a specific CPU to cause it to flush itself…. Isn’t that in itself expensive? Do you have to organise some sort of atomicity yourself around that too?
Yup. AFAIR, Linux in x86-64 queues a request to a per-CPU request list, and uses
IPIs to signal these types of operations to the target CPU:
http://lxr.free-electrons.com/source/kernel/smp.c?v=2.6.32#L386
Waiting for completion is implemented on top by incrementing some counter from
each CPU, and waiting for it to have the correct final value.
If something were implemented on these lines, it could be used as a generic
cross-CPU event messaging infrastructure (plus some interrupt bit in the CPU
structure that TCG would check to break away from guest code; I believe
something similar is already being used - icount? -).
PS: To be honest, I still don't know which TLBs we're talking about here, and
which cases trigger these TLB flush operations.
Cheers,
Lluis
--
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 22:10 ` Lluís Vilanova
@ 2015-02-13 7:16 ` Mark Burton
2015-02-13 7:24 ` Peter Maydell
0 siblings, 1 reply; 22+ messages in thread
From: Mark Burton @ 2015-02-13 7:16 UTC (permalink / raw)
To: Lluís Vilanova; +Cc: mttcg, Peter Maydell, Alexander Graf, qemu-devel
Up top - thanks Peter, I think you may give us an idea !
> On 12 Feb 2015, at 23:10, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>
> Mark Burton writes:
>
>>> On 12 Feb 2015, at 16:38, Alexander Graf <agraf@suse.de> wrote:
>>>
>>>
>>>
>>> On 12.02.15 15:58, Peter Maydell wrote:
>>>> On 12 February 2015 at 14:45, Alexander Graf <agraf@suse.de> wrote:
>>>>> almost nobody except x86 does global flushes
>>>>
>>>> All ARM TLB maintenance operations have both "this CPU only"
>>>> and "all TLBs in the Inner Shareable domain" [that's ARM-speak
>>>> for "every CPU core in the cluster"] variants (the latter
>>>> being the TLB *IS operations). Looking at Linux's
>>>> arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
>>>> most of the operations defined there use the IS variants.
>>>
>>> Wow, did anyone benchmark this? I know that PPC switched away from
>>> global flushes and instead tracks the CPUs a task was running on to
>>> limit the scope of CPUs that need to flush.
>
>> Doesn’t that mean you have to signal a specific CPU to cause it to flush itself…. Isn’t that in itself expensive? Do you have to organise some sort of atomicity yourself around that too?
>
> Yup. AFAIR, Linux in x86-64 queues a request to a per-CPU request list, and uses
> IPIs to signal these types of operations to the target CPU:
>
> http://lxr.free-electrons.com/source/kernel/smp.c?v=2.6.32#L386
>
> Waiting for completion is implemented on top by incrementing some counter from
> each CPU, and waiting for it to have the correct final value.
If the kernel is doing this - then effectively - for X86, each CPU only flush’s it’s own TLB (from the perspective of Qemu) - correct?
(in which case, for Qemu itself - for x86) - we dont need to implement a global flush, and hence we dont need to build the mechanism to sync ?
If I understand correctly then - the processor that causes some pain is the ARM that has (and uses) global flush, but the mitigating factors is that those flushes can by asyncronous so long as they complete before a memory barrier….
Cheers
Mark.
>
> If something were implemented on these lines, it could be used as a generic
> cross-CPU event messaging infrastructure (plus some interrupt bit in the CPU
> structure that TCG would check to break away from guest code; I believe
> something similar is already being used - icount? -).
>
> PS: To be honest, I still don't know which TLBs we're talking about here, and
> which cases trigger these TLB flush operations.
>
>
> Cheers,
> Lluis
>
> --
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-13 7:16 ` Mark Burton
@ 2015-02-13 7:24 ` Peter Maydell
2015-02-13 7:37 ` Mark Burton
0 siblings, 1 reply; 22+ messages in thread
From: Peter Maydell @ 2015-02-13 7:24 UTC (permalink / raw)
To: Mark Burton; +Cc: qemu-devel, mttcg, Lluís Vilanova, Alexander Graf
On 13 February 2015 at 07:16, Mark Burton <mark.burton@greensocs.com> wrote:
> If the kernel is doing this - then effectively - for X86, each CPU only
> flush’s it’s own TLB (from the perspective of Qemu) - correct?
> (in which case, for Qemu itself - for x86) - we dont need to implement
> a global flush, and hence we dont need to build the mechanism to sync ?
The semantics you need are "flush the QEMU TLB for CPU X" (where
X may not be the CPU you're running on). This is what tlb_flush()
does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
We then use that to implement the target's required semantics
(eg in ARM the tlbiall_is_write() function is handled by iterating
through all CPUs and calling tlb_flush on them).
If you don't want the pain of checking the semantics of every
backend and figuring out a new set of primitives to implement,
then what you need to do is continue to provide the guarantees
the current tlb_flush function does: when it returns then the
CPU it's supposed to have acted on has definitely done so.
You can try and be cleverer if you want to, but personally
I would recommend keeping the scope of your work simple
where you can.
-- PMM
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-13 7:24 ` Peter Maydell
@ 2015-02-13 7:37 ` Mark Burton
2015-02-13 13:30 ` Lluís Vilanova
0 siblings, 1 reply; 22+ messages in thread
From: Mark Burton @ 2015-02-13 7:37 UTC (permalink / raw)
To: Peter Maydell; +Cc: qemu-devel, mttcg, Lluís Vilanova, Alexander Graf
> On 13 Feb 2015, at 08:24, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On 13 February 2015 at 07:16, Mark Burton <mark.burton@greensocs.com> wrote:
>> If the kernel is doing this - then effectively - for X86, each CPU only
>> flush’s it’s own TLB (from the perspective of Qemu) - correct?
>> (in which case, for Qemu itself - for x86) - we dont need to implement
>> a global flush, and hence we dont need to build the mechanism to sync ?
> The semantics you need are "flush the QEMU TLB for CPU X" (where
> X may not be the CPU you're running on). This is what tlb_flush()
> does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
> We then use that to implement the target's required semantics
> (eg in ARM the tlbiall_is_write() function is handled by iterating
> through all CPUs and calling tlb_flush on them).
What Lluis implied seemed to be that the kernel arranged to signal the CPU that would flush. Hence, (for X86), we would only ever flush our own TLB.
>
> If you don't want the pain of checking the semantics of every
> backend and figuring out a new set of primitives to implement,
> then what you need to do is continue to provide the guarantees
> the current tlb_flush function does: when it returns then the
> CPU it's supposed to have acted on has definitely done so.
>
> You can try and be cleverer if you want to, but personally
> I would recommend keeping the scope of your work simple
> where you can.
yes - though keeping it simple (silly) seems to have some complexities in this case, which is why we are trying to reduce the guarantees that tlm_flush() provides.
At present - the ‘foreach cpu, tlb_flush()’ is effectively atomic, as no other CPU will be executing at the same time.
Adding multi-thread, we can already say - this ‘atomicity’ isn’t strictly required. As you say, the only thing tlb_flush needs to guarantee is that the CPU concerned has flushed.
- that already helps. And I agree with you is the right place to take tlb_flush().
Of course, when only the current CPU is flushed things are much simpler (and already handled)...
For our immediate concern, in the interests of getting the thing working and making sure we’ve turned over all the stones, on ARM - it MAY help us to check that the flush has happened ‘in the next memory barrier’….
- I dont know if that will help us or not, and - even if it does, I agree with you, it would be more messy than it need be.
However, in the interests of making sure that there are no other issues - we may ‘hack’ something before we put in place a more elegant solution….
(right now, we have some mutex issues, shifting the sync to the barrier MAY help us avoid that…. To Be Seen…. and anyway - it would only be a temporary fix).
Cheers
Mark.
>
> -- PMM
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-12 21:57 ` Peter Maydell
@ 2015-02-13 9:34 ` Paolo Bonzini
2015-02-13 9:37 ` Mark Burton
0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2015-02-13 9:34 UTC (permalink / raw)
To: Peter Maydell, Alexander Graf; +Cc: mttcg, Mark Burton, qemu-devel
On 12/02/2015 22:57, Peter Maydell wrote:
> The only
> requirement is that if the CPU that did the TLB maintenance
> op executes a DMB (barrier) then the TLB op must finish
> before the barrier completes execution. So you could split
> the "kick off TLB invalidate" and "make sure all CPUs
> are done" phases if you wanted. [cf v8 ARM ARM rev A.e
> section D4.7.2 and in particular the subsection on
> "ordering and completion".]
You can just make DMB start a new translation block. Then when the TLB
flush helpers call cpu_exit() or cpu_interrupt() the flush request is
serviced.
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-13 9:34 ` Paolo Bonzini
@ 2015-02-13 9:37 ` Mark Burton
2015-02-13 9:49 ` Paolo Bonzini
0 siblings, 1 reply; 22+ messages in thread
From: Mark Burton @ 2015-02-13 9:37 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: mttcg, Peter Maydell, Alexander Graf, qemu-devel
the memory barrier is on the cpu requesting the flush isn’t it (not on the CPU that is being flushed)?
Cheers
Mark.
> On 13 Feb 2015, at 10:34, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
>
> On 12/02/2015 22:57, Peter Maydell wrote:
>> The only
>> requirement is that if the CPU that did the TLB maintenance
>> op executes a DMB (barrier) then the TLB op must finish
>> before the barrier completes execution. So you could split
>> the "kick off TLB invalidate" and "make sure all CPUs
>> are done" phases if you wanted. [cf v8 ARM ARM rev A.e
>> section D4.7.2 and in particular the subsection on
>> "ordering and completion".]
>
> You can just make DMB start a new translation block. Then when the TLB
> flush helpers call cpu_exit() or cpu_interrupt() the flush request is
> serviced.
>
> Paolo
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-13 9:37 ` Mark Burton
@ 2015-02-13 9:49 ` Paolo Bonzini
0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2015-02-13 9:49 UTC (permalink / raw)
To: Mark Burton; +Cc: mttcg, Peter Maydell, Alexander Graf, qemu-devel
On 13/02/2015 10:37, Mark Burton wrote:
> the memory barrier is on the cpu requesting the flush isn’t it (not
> on the CPU that is being flushed)?
Oops, I misread Peter's explanation.
In that case, perhaps DMB can be treated in a similar way as WFI, using
cpu->halted. Queueing work on other CPUs can be done with
async_run_on_cpu, which exits the idle loop in qemu_tcg_wait_io_event
(this avoids the deadlocks). Checking that other CPUs have flushed the
TLBs can be done in cpu_has_work ("always return false if cpu->halted ==
true there are outstanding TLB requests").
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-13 7:37 ` Mark Burton
@ 2015-02-13 13:30 ` Lluís Vilanova
2015-02-13 13:32 ` Mark Burton
0 siblings, 1 reply; 22+ messages in thread
From: Lluís Vilanova @ 2015-02-13 13:30 UTC (permalink / raw)
To: Mark Burton; +Cc: mttcg, Peter Maydell, Alexander Graf, qemu-devel
Mark Burton writes:
>> On 13 Feb 2015, at 08:24, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> On 13 February 2015 at 07:16, Mark Burton <mark.burton@greensocs.com> wrote:
>>> If the kernel is doing this - then effectively - for X86, each CPU only
>>> flush’s it’s own TLB (from the perspective of Qemu) - correct?
>>> (in which case, for Qemu itself - for x86) - we dont need to implement
>>> a global flush, and hence we dont need to build the mechanism to sync ?
>> The semantics you need are "flush the QEMU TLB for CPU X" (where
>> X may not be the CPU you're running on). This is what tlb_flush()
>> does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
>> We then use that to implement the target's required semantics
>> (eg in ARM the tlbiall_is_write() function is handled by iterating
>> through all CPUs and calling tlb_flush on them).
> What Lluis implied seemed to be that the kernel arranged to signal the CPU that would flush. Hence, (for X86), we would only ever flush our own TLB.
That's correct.
[...]
> For our immediate concern, in the interests of getting the thing working and
> making sure we’ve turned over all the stones, on ARM - it MAY help us to check
> that the flush has happened ‘in the next memory barrier’….
> - I dont know if that will help us or not, and - even if it does, I agree with you, it would be more messy than it need be.
> However, in the interests of making sure that there are no other issues - we may ‘hack’ something before we put in place a more elegant solution….
> (right now, we have some mutex issues, shifting the sync to the barrier MAY help us avoid that…. To Be Seen…. and anyway - it would only be a temporary fix).
But you shouldn't assume that everyone either uses x86's semantics (aka, each
CPU gets an IPI), or the ARM semantics you described where the global TLB flush
instruction has asynchronous effects. First, in ARM you still have to ensure
other CPUs did what you asked them to (whenever the arch manual says you must do
so). Second, it seems like ARM does not always behave in the way you described:
http://lxr.free-electrons.com/source/arch/arm/kernel/smp.c?v=2.6.32#L630
Granted, this is just the same behaviour as x86, but noone guarantees you that
some other operation in any of the multiple architectures supported by QEMU will
never need a synchronous instruction with global effects.
I understand the pressure of getting something running and work from that, but I
think that having a framework for asynchronous cross-CPU messaging would be
rather useful in the future. That can be then complemented with a mechanism to
wait for these asynchronous messages. You can achieve any desired behaviour by
composing these two.
Cheers,
Lluis
--
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] Help on TLB Flush
2015-02-13 13:30 ` Lluís Vilanova
@ 2015-02-13 13:32 ` Mark Burton
0 siblings, 0 replies; 22+ messages in thread
From: Mark Burton @ 2015-02-13 13:32 UTC (permalink / raw)
To: Lluís Vilanova; +Cc: mttcg, Peter Maydell, Alexander Graf, qemu-devel
Agreed
Cheers
Mark.
> On 13 Feb 2015, at 14:30, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>
> Mark Burton writes:
>
>>> On 13 Feb 2015, at 08:24, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>
>>> On 13 February 2015 at 07:16, Mark Burton <mark.burton@greensocs.com> wrote:
>>>> If the kernel is doing this - then effectively - for X86, each CPU only
>>>> flush’s it’s own TLB (from the perspective of Qemu) - correct?
>>>> (in which case, for Qemu itself - for x86) - we dont need to implement
>>>> a global flush, and hence we dont need to build the mechanism to sync ?
>
>>> The semantics you need are "flush the QEMU TLB for CPU X" (where
>>> X may not be the CPU you're running on). This is what tlb_flush()
>>> does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
>>> We then use that to implement the target's required semantics
>>> (eg in ARM the tlbiall_is_write() function is handled by iterating
>>> through all CPUs and calling tlb_flush on them).
>
>> What Lluis implied seemed to be that the kernel arranged to signal the CPU that would flush. Hence, (for X86), we would only ever flush our own TLB.
>
> That's correct.
>
> [...]
>> For our immediate concern, in the interests of getting the thing working and
>> making sure we’ve turned over all the stones, on ARM - it MAY help us to check
>> that the flush has happened ‘in the next memory barrier’….
>> - I dont know if that will help us or not, and - even if it does, I agree with you, it would be more messy than it need be.
>> However, in the interests of making sure that there are no other issues - we may ‘hack’ something before we put in place a more elegant solution….
>> (right now, we have some mutex issues, shifting the sync to the barrier MAY help us avoid that…. To Be Seen…. and anyway - it would only be a temporary fix).
>
> But you shouldn't assume that everyone either uses x86's semantics (aka, each
> CPU gets an IPI), or the ARM semantics you described where the global TLB flush
> instruction has asynchronous effects. First, in ARM you still have to ensure
> other CPUs did what you asked them to (whenever the arch manual says you must do
> so). Second, it seems like ARM does not always behave in the way you described:
>
> http://lxr.free-electrons.com/source/arch/arm/kernel/smp.c?v=2.6.32#L630
>
> Granted, this is just the same behaviour as x86, but noone guarantees you that
> some other operation in any of the multiple architectures supported by QEMU will
> never need a synchronous instruction with global effects.
>
> I understand the pressure of getting something running and work from that, but I
> think that having a framework for asynchronous cross-CPU messaging would be
> rather useful in the future. That can be then complemented with a mechanism to
> wait for these asynchronous messages. You can achieve any desired behaviour by
> composing these two.
>
>
> Cheers,
> Lluis
>
> --
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth
+44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210
+33 (0)603762104
mark.burton
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2015-02-13 13:32 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-12 14:35 [Qemu-devel] Help on TLB Flush Mark Burton
2015-02-12 14:45 ` Alexander Graf
2015-02-12 14:58 ` Peter Maydell
2015-02-12 15:38 ` Alexander Graf
2015-02-12 16:02 ` Mark Burton
2015-02-12 22:10 ` Lluís Vilanova
2015-02-13 7:16 ` Mark Burton
2015-02-13 7:24 ` Peter Maydell
2015-02-13 7:37 ` Mark Burton
2015-02-13 13:30 ` Lluís Vilanova
2015-02-13 13:32 ` Mark Burton
2015-02-12 22:02 ` Peter Maydell
2015-02-12 15:01 ` Peter Maydell
2015-02-12 15:08 ` Mark Burton
2015-02-12 15:19 ` Alexander Graf
2015-02-12 21:57 ` Peter Maydell
2015-02-13 9:34 ` Paolo Bonzini
2015-02-13 9:37 ` Mark Burton
2015-02-13 9:49 ` Paolo Bonzini
2015-02-12 15:31 ` Dr. David Alan Gilbert
2015-02-12 18:44 ` Mark Burton
2015-02-12 15:11 ` Mark Burton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.