crazy idea: big percpu lock (Re: task isolation)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* crazy idea: big percpu lock (Re: task isolation)
@ 2015-10-08 21:25 Andy Lutomirski
  2015-10-08 22:01 ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Andy Lutomirski @ 2015-10-08 21:25 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

This whole isolation vs vmstat, etc thing made me think:

It seems to me that a big part of the problem is that there's all
kinds of per-cpu deferred housekeeping work that can be done on the
CPU in question without any complicated or heavyweight locking but
that can't be done remotely without a mess.  This presumably includes
vmstat, draining the LRU list, etc.  This is a problem for allowing
CPUs to spend a long time without any interrupts.

I want to propose a new primitive that might go a long way toward
solving this issue.  The new primitive would be called the "big percpu
lock".  Non-nohz CPUs would hold their big percpu lock all the time.
Nohz CPUs would hold it all the time unless idle.  Full nohz cpus
would hold it all the time except when idle or in user mode.  No CPU
promises to hold it while processing an NMI or similar NMI-like work.

This should help in a ton of cases.

For vunmap global kernel TLB flushes, we could stick the flushes in a
list of deferred flushes to be processed on entry, and that list would
be protected by the big percpu lock.  For any kind of draining of
non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
housekeeping cpu try to do it using the big percpu lock

There's a race here that affects task isolation.  On exit to user
mode, there's no obvious way to tell that an IPI is already pending.
We could add that, too: whenever we send an IPI to a nohz_full CPU, we
increment a percpu pending IPI count, then try to get the big percpu
lock, and then, if we fail, send the IPI.  IOW, we might want a helper
that takes a remote big percpu lock or calls a remote function that
guards against this race.

Thoughts?  Am I nuts?

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-08 21:25 crazy idea: big percpu lock (Re: task isolation) Andy Lutomirski
@ 2015-10-08 22:01 ` Christoph Lameter
  2015-10-08 22:28   ` Andy Lutomirski
  2015-10-09  9:08 ` Peter Zijlstra
  2015-10-28 18:42 ` Chris Metcalf
  2 siblings, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2015-10-08 22:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

On Thu, 8 Oct 2015, Andy Lutomirski wrote:

> It seems to me that a big part of the problem is that there's all
> kinds of per-cpu deferred housekeeping work that can be done on the
> CPU in question without any complicated or heavyweight locking but
> that can't be done remotely without a mess.  This presumably includes
> vmstat, draining the LRU list, etc.  This is a problem for allowing
> CPUs to spend a long time without any interrupts.

Well its not a problem if the task does a prctl to ask for the kernel to
quiet down. In that case we can simply flush all the pending stuff on the
cpu that owns the percpu section.

> I want to propose a new primitive that might go a long way toward
> solving this issue.  The new primitive would be called the "big percpu
> lock".  Non-nohz CPUs would hold their big percpu lock all the time.
> Nohz CPUs would hold it all the time unless idle.  Full nohz cpus
> would hold it all the time except when idle or in user mode.  No CPU
> promises to hold it while processing an NMI or similar NMI-like work.

Not sure that there is an issue to solve. So this is a lock per cpu that
signals that the processor can handle its per cpu data alone. If its not
held then other cpus can access the percpu data remotely?

> This should help in a ton of cases.
>
> For vunmap global kernel TLB flushes, we could stick the flushes in a
> list of deferred flushes to be processed on entry, and that list would
> be protected by the big percpu lock.  For any kind of draining of
> non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
> housekeeping cpu try to do it using the big percpu lock

Ok what is the problem with using the cpu that owns the percpu data to
flush it? Or simply ignore the situation until the cpu is entering the
kernel again? Caches can be useful later again when the process wants to
allocate memory etc. We would have to repopulate them if we flush them.

> There's a race here that affects task isolation.  On exit to user
> mode, there's no obvious way to tell that an IPI is already pending.
> We could add that, too: whenever we send an IPI to a nohz_full CPU, we
> increment a percpu pending IPI count, then try to get the big percpu
> lock, and then, if we fail, send the IPI.  IOW, we might want a helper
> that takes a remote big percpu lock or calls a remote function that
> guards against this race.
>
> Thoughts?  Am I nuts?

Generally having a lock that signals that other can access the per cpu
data may make sense. However, what is the overhead of handling that lock?

One definitely does not want to handle that in latency critical sections.

And one cannot handle the lock in interrupt disabled sections like IPIs.
But if one can remotely acquire that lock then no IPI is needed anymore if
the only thing we want to do is manipulate per cpu data.

There is a complication that many of these flushing functions are written
using this_cpu operations that can only be run on the cpu owning the per
cpu section because the per cpu base is different on other processors. If
you want to change that then more expensive instructions have to be used.
So you end up with two different versions of the function.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-08 22:01 ` Christoph Lameter
@ 2015-10-08 22:28   ` Andy Lutomirski
  2015-10-09 11:24     ` Christoph Lameter
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2015-10-08 22:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Metcalf, Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

On Thu, Oct 8, 2015 at 3:01 PM, Christoph Lameter <cl@linux.com> wrote:
> On Thu, 8 Oct 2015, Andy Lutomirski wrote:
>
>> It seems to me that a big part of the problem is that there's all
>> kinds of per-cpu deferred housekeeping work that can be done on the
>> CPU in question without any complicated or heavyweight locking but
>> that can't be done remotely without a mess.  This presumably includes
>> vmstat, draining the LRU list, etc.  This is a problem for allowing
>> CPUs to spend a long time without any interrupts.
>
> Well its not a problem if the task does a prctl to ask for the kernel to
> quiet down. In that case we can simply flush all the pending stuff on the
> cpu that owns the percpu section.
>

Will this really end up working?  I can see two problems:

1. It's rather expensive.  For processes that still make syscalls but
just not many, it means that you're forcibly quiescing every time.

2. It only really makes sense for work that results from local kernel
actions, happens once, and won't recur.  I admit that I don't know how
many of the offenders are like this, but I can imagine there being
some periodic tasks that could be done locally or remotely with a big
percpu lock.

>> I want to propose a new primitive that might go a long way toward
>> solving this issue.  The new primitive would be called the "big percpu
>> lock".  Non-nohz CPUs would hold their big percpu lock all the time.
>> Nohz CPUs would hold it all the time unless idle.  Full nohz cpus
>> would hold it all the time except when idle or in user mode.  No CPU
>> promises to hold it while processing an NMI or similar NMI-like work.
>
> Not sure that there is an issue to solve. So this is a lock per cpu that
> signals that the processor can handle its per cpu data alone. If its not
> held then other cpus can access the percpu data remotely?
>
>> This should help in a ton of cases.
>>
>> For vunmap global kernel TLB flushes, we could stick the flushes in a
>> list of deferred flushes to be processed on entry, and that list would
>> be protected by the big percpu lock.  For any kind of draining of
>> non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
>> housekeeping cpu try to do it using the big percpu lock
>
> Ok what is the problem with using the cpu that owns the percpu data to
> flush it?

Nothing, but only if flushing gets the job done.

> Or simply ignore the situation until the cpu is entering the
> kernel again?

Maybe.  I wonder if, for things like vmstat, that would be better in
general (not just NOHZ).  We have task_work nowadays...

> Caches can be useful later again when the process wants to
> allocate memory etc. We would have to repopulate them if we flush them.

True.  But we don't need to flush them at all until there's memory
pressure, and the big percpu lock solves this particular problem quite
nicely -- a remote CPU can simply drain the cache itself instead of
using an IPI.

>
>> There's a race here that affects task isolation.  On exit to user
>> mode, there's no obvious way to tell that an IPI is already pending.
>> We could add that, too: whenever we send an IPI to a nohz_full CPU, we
>> increment a percpu pending IPI count, then try to get the big percpu
>> lock, and then, if we fail, send the IPI.  IOW, we might want a helper
>> that takes a remote big percpu lock or calls a remote function that
>> guards against this race.
>>
>> Thoughts?  Am I nuts?
>
> Generally having a lock that signals that other can access the per cpu
> data may make sense. However, what is the overhead of handling that lock?
>
> One definitely does not want to handle that in latency critical sections.

On the accessing side, it's just a standard try spinlock operation.
On the nohz side, it's a spinlock acquire on entry and a spinlock
release on exit.  This is actually probably considerably cheaper than
whatever the context tracking code already does, but it does put a
lower bound on how cheap we can make it.

>
> And one cannot handle the lock in interrupt disabled sections like IPIs.
> But if one can remotely acquire that lock then no IPI is needed anymore if
> the only thing we want to do is manipulate per cpu data.

Sure you can.  If you're in an IPI handler, you have the lock.

>
> There is a complication that many of these flushing functions are written
> using this_cpu operations that can only be run on the cpu owning the per
> cpu section because the per cpu base is different on other processors. If
> you want to change that then more expensive instructions have to be used.
> So you end up with two different versions of the function.
>

That's a fair point.

How many of these things are there?

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-08 21:25 crazy idea: big percpu lock (Re: task isolation) Andy Lutomirski
  2015-10-08 22:01 ` Christoph Lameter
@ 2015-10-09  9:08 ` Peter Zijlstra
  2015-10-09  9:27   ` Thomas Gleixner
  2015-10-28 18:42 ` Chris Metcalf
  2 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2015-10-09  9:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

On Thu, Oct 08, 2015 at 02:25:23PM -0700, Andy Lutomirski wrote:
> I want to propose a new primitive that might go a long way toward
> solving this issue.  The new primitive would be called the "big percpu
> lock".  

Never, ever, combine big and lock :-) You want small granular locks, big
locks are a guaranteed recipe for pain.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-09  9:08 ` Peter Zijlstra
@ 2015-10-09  9:27   ` Thomas Gleixner
  2015-10-09 18:56     ` Andy Lutomirski
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2015-10-09  9:27 UTC (permalink / raw)
  To: Andy Lutomirski, Christoph Lameter, Peter Zijlstra
  Cc: Chris Metcalf, Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Paul E. McKenney, Viresh Kumar,
	Catalin Marinas, Will Deacon, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Linus Torvalds

On Thu, 8 Oct 2015, Andy Lutomirski wrote:
> I want to propose a new primitive that might go a long way toward
> solving this issue.  The new primitive would be called the "big percpu
> lock".

It took us 15+ years to get rid of the "Big Kernel Lock", so we really
don't want to add a new "Big XXX Lock". We have enough pain already
with preempt_disable() and local_irq_disable() which are basically
"Big CPU Locks".

Don't ever put BIG and LOCK into one context, really.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-08 22:28   ` Andy Lutomirski
@ 2015-10-09 11:24     ` Christoph Lameter
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2015-10-09 11:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chris Metcalf, Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel,
	Tejun Heo, Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

On Thu, 8 Oct 2015, Andy Lutomirski wrote:

> Will this really end up working?  I can see two problems:
>
> 1. It's rather expensive.  For processes that still make syscalls but
> just not many, it means that you're forcibly quiescing every time.

A process that does a lot of syscalls to networking must usually tolerate
nose. Task isolation is for cases where there is no need of much support
by the OS.

> 2. It only really makes sense for work that results from local kernel
> actions, happens once, and won't recur.  I admit that I don't know how
> many of the offenders are like this, but I can imagine there being
> some periodic tasks that could be done locally or remotely with a big
> percpu lock.

The percpu data is used because critical code sections run faster with
data that is not contended. Touching the data remotely causes performance
regression. You do not want that unless the task will stay away for awhile
from OS actions.

> > Or simply ignore the situation until the cpu is entering the
> > kernel again?
>
> Maybe.  I wonder if, for things like vmstat, that would be better in
> general (not just NOHZ).  We have task_work nowadays...

vmstat uses per cpu data that should be local on a processor for
performance reasons. Doing remote write accessses  will cause cache
misses to occur and will result in performance issues.

> > Caches can be useful later again when the process wants to
> > allocate memory etc. We would have to repopulate them if we flush them.
>
> True.  But we don't need to flush them at all until there's memory
> pressure, and the big percpu lock solves this particular problem quite
> nicely -- a remote CPU can simply drain the cache itself instead of
> using an IPI.

You still did not answer me as to why you would want them flushed at all.
Memory reclaim can occur in a number of ways. No need to flush everything.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-09  9:27   ` Thomas Gleixner
@ 2015-10-09 18:56     ` Andy Lutomirski
  0 siblings, 0 replies; 10+ messages in thread
From: Andy Lutomirski @ 2015-10-09 18:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Lameter, Peter Zijlstra, Chris Metcalf, Luiz Capitulino,
	Gilad Ben Yossef, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Rik van Riel, Tejun Heo, Frederic Weisbecker, Paul E. McKenney,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

On Fri, Oct 9, 2015 at 2:27 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 8 Oct 2015, Andy Lutomirski wrote:
>> I want to propose a new primitive that might go a long way toward
>> solving this issue.  The new primitive would be called the "big percpu
>> lock".
>
> It took us 15+ years to get rid of the "Big Kernel Lock", so we really
> don't want to add a new "Big XXX Lock". We have enough pain already
> with preempt_disable() and local_irq_disable() which are basically
> "Big CPU Locks".
>
> Don't ever put BIG and LOCK into one context, really.

I knew I shouldn't have called it that.  The basic useful idea (if
it's actually useful) is to have an efficient way to poke another
CPU's percpu data structures in cases where we're reasonably confident
that locality doesn't matter.  And maybe even doing anything lock-like
is a bad idea for the problems I'm trying to help solve.

lru_add_drain_all is an example.  That function already effectively
takes a massive lock.  It reads (racily, but it doesn't matter) remote
percpu data (which already forces the cachelines to become shared) and
then, if needed, it schedules work on that CPU and waits for it.  It
seems like a very one-sided lock (no overhead on the victim CPU),
except that it requires that the victim schedule in order to make
forward progress.  There may even be a forced IPI in there, although I
haven't dug far enough to find it.

Ideally, under memory pressure, we'd have a way to just grab the
remote LRU list directly.  We could easily have coarse-grained or
fine-grained locking to enable that, but it might be better to have
just one instance of the resulting barriers on user and idle entry and
exit rather than doing it on every LRU access.

flush_tlb_kernel_range and such are also examples, and they will
currently kill isolation, but maybe we should just have a way to mark
the kernel TLB as idle when we enter user mode and have a way to
recognize that we need a flush when we go back to kernel (or maybe
even NMI) mode.

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-08 21:25 crazy idea: big percpu lock (Re: task isolation) Andy Lutomirski
  2015-10-08 22:01 ` Christoph Lameter
  2015-10-09  9:08 ` Peter Zijlstra
@ 2015-10-28 18:42 ` Chris Metcalf
  2015-10-28 18:45   ` Andy Lutomirski
  2 siblings, 1 reply; 10+ messages in thread
From: Chris Metcalf @ 2015-10-28 18:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

On 10/08/2015 05:25 PM, Andy Lutomirski wrote:
> This whole isolation vs vmstat, etc thing made me think:
>
> It seems to me that a big part of the problem is that there's all
> kinds of per-cpu deferred housekeeping work that can be done on the
> CPU in question without any complicated or heavyweight locking but
> that can't be done remotely without a mess.  This presumably includes
> vmstat, draining the LRU list, etc.  This is a problem for allowing
> CPUs to spend a long time without any interrupts.
>
> I want to propose a new primitive that might go a long way toward
> solving this issue.  The new primitive would be called the "big percpu
> lock".  Non-nohz CPUs would hold their big percpu lock all the time.
> Nohz CPUs would hold it all the time unless idle.  Full nohz cpus
> would hold it all the time except when idle or in user mode.  No CPU
> promises to hold it while processing an NMI or similar NMI-like work.
>
> This should help in a ton of cases.
>
> For vunmap global kernel TLB flushes, we could stick the flushes in a
> list of deferred flushes to be processed on entry, and that list would
> be protected by the big percpu lock.  For any kind of draining of
> non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
> housekeeping cpu try to do it using the big percpu lock
>
> There's a race here that affects task isolation.  On exit to user
> mode, there's no obvious way to tell that an IPI is already pending.
> We could add that, too: whenever we send an IPI to a nohz_full CPU, we
> increment a percpu pending IPI count, then try to get the big percpu
> lock, and then, if we fail, send the IPI.  IOW, we might want a helper
> that takes a remote big percpu lock or calls a remote function that
> guards against this race.
>
> Thoughts?  Am I nuts?

The Tilera code has support for avoiding TLB flushes to kernel VAs
while running in userspace on nohz_full cores, but I didn't try to
upstream it yet because it is generally less critical than the other
stuff.

The model I chose is to have a per-cpu state that indicates whether
the core is in kernel space, in user space, or in user space with
a TLB flush pending.  On entry to user space with task isolation
in effect we just set the state to "user".  When doing a remote
TLB flush we decide whether or not to actually issue the flush by
doing a cmpxchg() from "user" to "user pending", and if the
old state was either "user" or "user pending", we don't issue the
flush.  Finally, on entry to the kernel for a task-isolation task we
do an atomic xchg() to set the state to "kernel", and if we discover
a flush was pending, we just globally flush the kernel's full VA range
(no real reason to optimize for this case).

This is basically equivalent to your lock model, where you would
remotely trylock, and if you succeed, set a bit for the core indicating
it needs a kernel TLB flush, and if you fail, just doing the remote
flush yourself.  And, on kernel entry for task isolation, you lock
(possibly waiting while someone updates the kernel TLB flush state)
and then if the kernel TLB flush bit is on, do the flush before
completing the entry to the kernel.

But, it turns out you also need to keep track of whether TLB flushes
are currently pending for a given core, since you could start a
remote TLB flush to a task-isolation core just as it was getting
ready to return to userspace.  Since the caller issues these flushes
synchronously, we would just bump a counter atomically for the
remote core before issuing, and decrement it when it was done.
Then when returning to userspace, we first flip the bit saying that
we are now in the "user" state, and then we actually spin and wait
for the counter to hit zero as well, in case a TLB flush was in progress.

For the tilegx architecture we had support for modifying how pages
were statically cache homed, and this caused a lot of these TLB
flushes since we needed to adjust the kernel linear mapping as
well as the userspace mappings.  It's a lot less common otherwise,
just vunmap and the like, but still a good thing for a follow-up patch.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-28 18:42 ` Chris Metcalf
@ 2015-10-28 18:45   ` Andy Lutomirski
  2015-11-10 14:19     ` Rik van Riel
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2015-10-28 18:45 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Rik van Riel, Tejun Heo,
	Frederic Weisbecker, Thomas Gleixner, Paul E. McKenney,
	Christoph Lameter, Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

On Wed, Oct 28, 2015 at 11:42 AM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> On 10/08/2015 05:25 PM, Andy Lutomirski wrote:
>>
>> This whole isolation vs vmstat, etc thing made me think:
>>
>> It seems to me that a big part of the problem is that there's all
>> kinds of per-cpu deferred housekeeping work that can be done on the
>> CPU in question without any complicated or heavyweight locking but
>> that can't be done remotely without a mess.  This presumably includes
>> vmstat, draining the LRU list, etc.  This is a problem for allowing
>> CPUs to spend a long time without any interrupts.
>>
>> I want to propose a new primitive that might go a long way toward
>> solving this issue.  The new primitive would be called the "big percpu
>> lock".  Non-nohz CPUs would hold their big percpu lock all the time.
>> Nohz CPUs would hold it all the time unless idle.  Full nohz cpus
>> would hold it all the time except when idle or in user mode.  No CPU
>> promises to hold it while processing an NMI or similar NMI-like work.
>>
>> This should help in a ton of cases.
>>
>> For vunmap global kernel TLB flushes, we could stick the flushes in a
>> list of deferred flushes to be processed on entry, and that list would
>> be protected by the big percpu lock.  For any kind of draining of
>> non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
>> housekeeping cpu try to do it using the big percpu lock
>>
>> There's a race here that affects task isolation.  On exit to user
>> mode, there's no obvious way to tell that an IPI is already pending.
>> We could add that, too: whenever we send an IPI to a nohz_full CPU, we
>> increment a percpu pending IPI count, then try to get the big percpu
>> lock, and then, if we fail, send the IPI.  IOW, we might want a helper
>> that takes a remote big percpu lock or calls a remote function that
>> guards against this race.
>>
>> Thoughts?  Am I nuts?
>
>
> The Tilera code has support for avoiding TLB flushes to kernel VAs
> while running in userspace on nohz_full cores, but I didn't try to
> upstream it yet because it is generally less critical than the other
> stuff.
>
> The model I chose is to have a per-cpu state that indicates whether
> the core is in kernel space, in user space, or in user space with
> a TLB flush pending.  On entry to user space with task isolation
> in effect we just set the state to "user".  When doing a remote
> TLB flush we decide whether or not to actually issue the flush by
> doing a cmpxchg() from "user" to "user pending", and if the
> old state was either "user" or "user pending", we don't issue the
> flush.  Finally, on entry to the kernel for a task-isolation task we
> do an atomic xchg() to set the state to "kernel", and if we discover
> a flush was pending, we just globally flush the kernel's full VA range
> (no real reason to optimize for this case).

This sounds like it belongs in the generic context tracking code, or
at least the tracking part and the option to handle deferred work
should go there.

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: crazy idea: big percpu lock (Re: task isolation)
  2015-10-28 18:45   ` Andy Lutomirski
@ 2015-11-10 14:19     ` Rik van Riel
  0 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2015-11-10 14:19 UTC (permalink / raw)
  To: Andy Lutomirski, Chris Metcalf
  Cc: Luiz Capitulino, Gilad Ben Yossef, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Tejun Heo, Frederic Weisbecker,
	Thomas Gleixner, Paul E. McKenney, Christoph Lameter,
	Viresh Kumar, Catalin Marinas, Will Deacon,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Linus Torvalds

On 10/28/2015 02:45 PM, Andy Lutomirski wrote:

>> The model I chose is to have a per-cpu state that indicates whether
>> the core is in kernel space, in user space, or in user space with
>> a TLB flush pending.  On entry to user space with task isolation
>> in effect we just set the state to "user".  When doing a remote
>> TLB flush we decide whether or not to actually issue the flush by
>> doing a cmpxchg() from "user" to "user pending", and if the
>> old state was either "user" or "user pending", we don't issue the
>> flush.  Finally, on entry to the kernel for a task-isolation task we
>> do an atomic xchg() to set the state to "kernel", and if we discover
>> a flush was pending, we just globally flush the kernel's full VA range
>> (no real reason to optimize for this case).
> 
> This sounds like it belongs in the generic context tracking code, or
> at least the tracking part and the option to handle deferred work
> should go there.

I wonder if your whole idea could also piggyback on the
context tracking code.

If the remote CPU is in a quiescent state, have the per-cpu
thing done by the calling CPU, instead of interrupting the
quiescent CPU.

In the unlikely event quiescent CPU wakes up (or returns to
kernel space) while some per-cpu thing is done remotely, it
can wait for completion before it continues into the kernel.

The amount of time it waits will, at worst, be as much as
it would have spent doing that per-cpu thing itself.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-11-10 14:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-08 21:25 crazy idea: big percpu lock (Re: task isolation) Andy Lutomirski
2015-10-08 22:01 ` Christoph Lameter
2015-10-08 22:28   ` Andy Lutomirski
2015-10-09 11:24     ` Christoph Lameter
2015-10-09  9:08 ` Peter Zijlstra
2015-10-09  9:27   ` Thomas Gleixner
2015-10-09 18:56     ` Andy Lutomirski
2015-10-28 18:42 ` Chris Metcalf
2015-10-28 18:45   ` Andy Lutomirski
2015-11-10 14:19     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox