Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences
@ 2018-11-02 15:12 Mathieu Desnoyers
  2018-11-02 16:08 ` Mark Rutland
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Mathieu Desnoyers @ 2018-11-02 15:12 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Will Deacon, linux-kernel, libc-alpha, Carlos O'Donell,
	Florian Weimer, Joseph Myers, Szabolcs Nagy, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng

Hi Richard,

I stumbled on these articles:

- https://medium.com/@jadr2ddude/a-big-little-problem-a-tale-of-big-little-gone-wrong-e7778ce744bb
- https://www.mono-project.com/news/2016/09/12/arm64-icache/

and discussed them with Will Deacon. He told me you were looking into gcc atomics and it might be
worthwhile to discuss the possible use of the new rseq system call that has been added in Linux 4.18
for those use-cases.

Basically, the use-cases targeted are those where some cores on the system support a larger instruction
set than others. So for instance, some cores could use a faster atomic add instruction than others, which
should rely on a slower fallback. This is also the same story for reading the performance monitoring
unit counters from user-space: it depends on the feature-set supported by the CPU on which the instruction
is issued. Same applies to cores having different cache-line sizes.

The main problem is that the kernel can migrate a thread at any point between user-space reading the
current cpu number and issuing the instruction. This is where rseq can help.

The core idea to solve the instruction set issue is to set a mask of cpus supporting the new instruction
in a library constructor, and then load cpu_id, use it with the mask, and branch to either the new or
old instruction, all with a rseq critical section. If the kernel needs to abort due to preemption or
signal delivery, the abort behavior would be to issue the fallback (slow) atomic operation, which
guarantees progress even if single-stepping.

As long as the load, test and branch is faster than the performance delta between the old and new atomic
instruction, it would be worth it.

In the case of PMU read from user-space, using rseq to figure out how to issue the PMU read enables a
use-case which is not otherwise possible to do on big.LITTLE. On rseq abort, it would fallback to a
system call to read the PMU counter. This abort behavior guarantees forward progress.

The second article is about cache line size discrepancy between CPUs. Here again, doing the cacheline
flushing in a rseq critical section could allow tuning it to characteristics of the actual core it is
running on. The fast-path would use a stride fitting the current core characteristics, and if rseq
needs to abort, the slow-path would fall-back to a conservative value which would fit all cores (smaller
cache line size on the overall system). Once again, this abort behavior guarantees forward progress.
This would only work, of course, if cacheline invalidation done on a big core end up being propagated
to other cores in a way that clears all the cache lines corresponding to the one targeted on the big
core.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences
  2018-11-02 15:12 Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences Mathieu Desnoyers
@ 2018-11-02 16:08 ` Mark Rutland
  2018-11-02 19:18 ` Florian Weimer
  2018-11-02 19:27 ` Andrew Pinski
  2 siblings, 0 replies; 4+ messages in thread
From: Mark Rutland @ 2018-11-02 16:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Richard Henderson, Will Deacon, linux-kernel, libc-alpha,
	Carlos O'Donell, Florian Weimer, Joseph Myers, Szabolcs Nagy,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, linux-api

Hi Mathieu, Richard,

On Fri, Nov 02, 2018 at 11:12:24AM -0400, Mathieu Desnoyers wrote:
> Hi Richard,
> 
> I stumbled on these articles:
> 
> - https://medium.com/@jadr2ddude/a-big-little-problem-a-tale-of-big-little-gone-wrong-e7778ce744bb
> - https://www.mono-project.com/news/2016/09/12/arm64-icache/
> 
> and discussed them with Will Deacon. He told me you were looking into
> gcc atomics and it might be worthwhile to discuss the possible use of
> the new rseq system call that has been added in Linux 4.18 for those
> use-cases.
> 
> Basically, the use-cases targeted are those where some cores on the
> system support a larger instruction set than others. So for instance,
> some cores could use a faster atomic add instruction than others,
> which should rely on a slower fallback. This is also the same story
> for reading the performance monitoring unit counters from user-space:
> it depends on the feature-set supported by the CPU on which the
> instruction is issued. Same applies to cores having different
> cache-line sizes.

Please note that upstream arm64 Linux does not expose mismatched ISA
feature to userspace. We go to great pains to expose a uniform set of
supported features.

The two issues referenced above are both handled by the kernel, and no
userspace changes are required to handle them.

We do not intend or expect to expose mismatched features to userspace.
Correctly-written userspace should not use optional instructions unless
the kernel has advertised their presence via a hwcap (or via ID register
emulation).

> The main problem is that the kernel can migrate a thread at any point
> between user-space reading the current cpu number and issuing the
> instruction. This is where rseq can help.
> 
> The core idea to solve the instruction set issue is to set a mask of
> cpus supporting the new instruction in a library constructor, and then
> load cpu_id, use it with the mask, and branch to either the new or old
> instruction, all with a rseq critical section. If the kernel needs to
> abort due to preemption or signal delivery, the abort behavior would
> be to issue the fallback (slow) atomic operation, which guarantees
> progress even if single-stepping.
> 
> As long as the load, test and branch is faster than the performance
> delta between the old and new atomic instruction, it would be worth
> it.

Specifically w.r.t. the atomics, the kernel will only expose the
presence of the ARMv8.1 atomic instructions when supported by all CPUs
in the system.

> In the case of PMU read from user-space, using rseq to figure out how
> to issue the PMU read enables a use-case which is not otherwise
> possible to do on big.LITTLE. On rseq abort, it would fallback to a
> system call to read the PMU counter. This abort behavior guarantees
> forward progress.

We do not currently expose any PMU registers to userspace. If we were to
expose them for big.LITTLE, rseq may be of use, but no-one has done the
groundwork to investigate this.

> The second article is about cache line size discrepancy between CPUs.
> Here again, doing the cacheline flushing in a rseq critical section
> could allow tuning it to characteristics of the actual core it is
> running on. The fast-path would use a stride fitting the current core
> characteristics, and if rseq needs to abort, the slow-path would
> fall-back to a conservative value which would fit all cores (smaller
> cache line size on the overall system).

This is already handled by the kernel, and the proposed rseq approach is
not correct -- cache maintenance must *always* use the system-wide
minimum cacheline size, or stale entries will be left on some CPUs,
which will result in later failures.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences
  2018-11-02 15:12 Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences Mathieu Desnoyers
  2018-11-02 16:08 ` Mark Rutland
@ 2018-11-02 19:18 ` Florian Weimer
  2018-11-02 19:27 ` Andrew Pinski
  2 siblings, 0 replies; 4+ messages in thread
From: Florian Weimer @ 2018-11-02 19:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Richard Henderson, Will Deacon, linux-kernel, libc-alpha,
	Carlos O'Donell, Joseph Myers, Szabolcs Nagy, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Dave Watson, Paul Turner, linux-api

* Mathieu Desnoyers:

> Basically, the use-cases targeted are those where some cores on the
> system support a larger instruction set than others. So for instance,
> some cores could use a faster atomic add instruction than others,
> which should rely on a slower fallback. This is also the same story
> for reading the performance monitoring unit counters from user-space:
> it depends on the feature-set supported by the CPU on which the
> instruction is issued. Same applies to cores having different
> cache-line sizes.

The kernel needs to present a consistent view to userspace, the common
denominator.  I don't think there is any other way.

The situation is not new at all, by the way.  It also arises with VM and
process migration.  In glibc, we do not re-run CPU feature selection
upon resume (and how could we? function pointers would have to change),
and we have no plans to implement anything differently.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences
  2018-11-02 15:12 Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences Mathieu Desnoyers
  2018-11-02 16:08 ` Mark Rutland
  2018-11-02 19:18 ` Florian Weimer
@ 2018-11-02 19:27 ` Andrew Pinski
  2 siblings, 0 replies; 4+ messages in thread
From: Andrew Pinski @ 2018-11-02 19:27 UTC (permalink / raw)
  To: mathieu.desnoyers
  Cc: Richard Henderson, Will Deacon, LKML, GNU C Library,
	Carlos O'Donell, Florian Weimer, Joseph S. Myers,
	Szabolcs Nagy, Thomas Gleixner, bmaurer, Peter Zijlstra,
	Paul E. McKenney, boqun.feng, davejwatson, pjt, linux-api

On Fri, Nov 2, 2018 at 8:12 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Hi Richard,
>
> I stumbled on these articles:
>
> - https://medium.com/@jadr2ddude/a-big-little-problem-a-tale-of-big-little-gone-wrong-e7778ce744bb
> - https://www.mono-project.com/news/2016/09/12/arm64-icache/
>
> and discussed them with Will Deacon. He told me you were looking into gcc atomics and it might be
> worthwhile to discuss the possible use of the new rseq system call that has been added in Linux 4.18
> for those use-cases.
>
> Basically, the use-cases targeted are those where some cores on the system support a larger instruction
> set than others. So for instance, some cores could use a faster atomic add instruction than others, which
> should rely on a slower fallback. This is also the same story for reading the performance monitoring
> unit counters from user-space: it depends on the feature-set supported by the CPU on which the instruction
> is issued. Same applies to cores having different cache-line sizes.
>
> The main problem is that the kernel can migrate a thread at any point between user-space reading the
> current cpu number and issuing the instruction. This is where rseq can help.
>
> The core idea to solve the instruction set issue is to set a mask of cpus supporting the new instruction
> in a library constructor, and then load cpu_id, use it with the mask, and branch to either the new or
> old instruction, all with a rseq critical section. If the kernel needs to abort due to preemption or
> signal delivery, the abort behavior would be to issue the fallback (slow) atomic operation, which
> guarantees progress even if single-stepping.
>
> As long as the load, test and branch is faster than the performance delta between the old and new atomic
> instruction, it would be worth it.
>
> In the case of PMU read from user-space, using rseq to figure out how to issue the PMU read enables a
> use-case which is not otherwise possible to do on big.LITTLE. On rseq abort, it would fallback to a
> system call to read the PMU counter. This abort behavior guarantees forward progress.
>
> The second article is about cache line size discrepancy between CPUs. Here again, doing the cacheline
> flushing in a rseq critical section could allow tuning it to characteristics of the actual core it is
> running on. The fast-path would use a stride fitting the current core characteristics, and if rseq
> needs to abort, the slow-path would fall-back to a conservative value which would fit all cores (smaller
> cache line size on the overall system). Once again, this abort behavior guarantees forward progress.
> This would only work, of course, if cacheline invalidation done on a big core end up being propagated
> to other cores in a way that clears all the cache lines corresponding to the one targeted on the big
> core.

Cache flusing is only one thing that deals with cache line sizes
difference.  Another thing which either needs to be emulated in the
software or disable is the "dc ZVA" instruction which is used in
memset.
There are most likely eithers too.  For an example, dealing with dmb/dsb sizes.

Thanks,
Andrew

>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-11-02 19:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-02 15:12 Supporting core-specific instruction sets (e.g. big.LITTLE) with restartable sequences Mathieu Desnoyers
2018-11-02 16:08 ` Mark Rutland
2018-11-02 19:18 ` Florian Weimer
2018-11-02 19:27 ` Andrew Pinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).