PCID review?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* PCID review?
@ 2017-02-07 18:56 Andy Lutomirski
  2017-02-07 19:11 ` Kees Cook
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-02-07 18:56 UTC (permalink / raw)
  To: Borislav Petkov, Kees Cook, Dave Hansen, linux-mm@kvack.org,
	Paul E. McKenney

Quite a few people have expressed interest in enabling PCID on (x86)
Linux.  Here's the code:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

The main hold-up is that the code needs to be reviewed very carefully.
It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
entries using PCID" ought to be looked at carefully to make sure the
locking is right, but there are plenty of other ways this this could
all break.

Anyone want to take a look or maybe scare up some other reviewers?
(Kees, you seemed *really* excited about getting this in.)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-07 18:56 PCID review? Andy Lutomirski
@ 2017-02-07 19:11 ` Kees Cook
  2017-02-07 19:24   ` Thomas Garnier
  2017-02-07 19:37 ` Nadav Amit
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Kees Cook @ 2017-02-07 19:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, linux-mm@kvack.org,
	Paul E. McKenney, Thomas Garnier

On Tue, Feb 7, 2017 at 10:56 AM, Andy Lutomirski <luto@kernel.org> wrote:
> Quite a few people have expressed interest in enabling PCID on (x86)
> Linux.  Here's the code:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
>
> The main hold-up is that the code needs to be reviewed very carefully.
> It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
> entries using PCID" ought to be looked at carefully to make sure the
> locking is right, but there are plenty of other ways this this could
> all break.
>
> Anyone want to take a look or maybe scare up some other reviewers?
> (Kees, you seemed *really* excited about getting this in.)

Yeah, I'd really like to build on it to gain SMAP emulation, though
both implementing that and reviewing the existing series is outside my
current skills (well, okay, you could add "Reviewed-by:"-me to the
first 3 patches ;)). I don't know Intel guts well enough to
meaningfully do anything on the others. :)

I've added Thomas Garnier to CC, in case this is something he might be
able to assist with.

Does this need benchmarking or other testing? Perhaps bring it to the
kernel-hardening list for that?

Also, what's needed to gain SMAP emulation?

-Kees

-- 
Kees Cook
Pixel Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-07 19:11 ` Kees Cook
@ 2017-02-07 19:24   ` Thomas Garnier
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Garnier @ 2017-02-07 19:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Borislav Petkov, Dave Hansen, linux-mm@kvack.org,
	Paul E. McKenney

On Tue, Feb 7, 2017 at 11:11 AM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Feb 7, 2017 at 10:56 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> Quite a few people have expressed interest in enabling PCID on (x86)
>> Linux.  Here's the code:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
>>
>> The main hold-up is that the code needs to be reviewed very carefully.
>> It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
>> entries using PCID" ought to be looked at carefully to make sure the
>> locking is right, but there are plenty of other ways this this could
>> all break.
>>
>> Anyone want to take a look or maybe scare up some other reviewers?
>> (Kees, you seemed *really* excited about getting this in.)
>
> Yeah, I'd really like to build on it to gain SMAP emulation, though
> both implementing that and reviewing the existing series is outside my
> current skills (well, okay, you could add "Reviewed-by:"-me to the
> first 3 patches ;)). I don't know Intel guts well enough to
> meaningfully do anything on the others. :)
>
> I've added Thomas Garnier to CC, in case this is something he might be
> able to assist with.

It would be great to add but I have limited cycles and definitely
lacking knowledge on that front.

>
> Does this need benchmarking or other testing? Perhaps bring it to the
> kernel-hardening list for that?
>
> Also, what's needed to gain SMAP emulation?
>
> -Kees
>
> --
> Kees Cook
> Pixel Security



-- 
Thomas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-07 18:56 PCID review? Andy Lutomirski
  2017-02-07 19:11 ` Kees Cook
@ 2017-02-07 19:37 ` Nadav Amit
  2017-02-08 16:24   ` Andy Lutomirski
  2017-02-07 21:06 ` Paul E. McKenney
  2017-02-08 20:51 ` Andy Lutomirski
  3 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2017-02-07 19:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Kees Cook, Dave Hansen, linux-mm@kvack.org,
	Paul E. McKenney

Just a small question: should try_to_unmap_flush() work with these changes? 

> On Feb 7, 2017, at 10:56 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> Quite a few people have expressed interest in enabling PCID on (x86)
> Linux.  Here's the code:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
> 
> The main hold-up is that the code needs to be reviewed very carefully.
> It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
> entries using PCID" ought to be looked at carefully to make sure the
> locking is right, but there are plenty of other ways this this could
> all break.
> 
> Anyone want to take a look or maybe scare up some other reviewers?
> (Kees, you seemed *really* excited about getting this in.)
> 
> --Andy
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-07 19:37 ` Nadav Amit
@ 2017-02-08 16:24   ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-02-08 16:24 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andy Lutomirski, Borislav Petkov, Kees Cook, Dave Hansen,
	linux-mm@kvack.org, Paul E. McKenney

On Tue, Feb 7, 2017 at 11:37 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Just a small question: should try_to_unmap_flush() work with these changes?

Ugh, I probably need to fix that up.  Thanks!

>
>> On Feb 7, 2017, at 10:56 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> Quite a few people have expressed interest in enabling PCID on (x86)
>> Linux.  Here's the code:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
>>
>> The main hold-up is that the code needs to be reviewed very carefully.
>> It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
>> entries using PCID" ought to be looked at carefully to make sure the
>> locking is right, but there are plenty of other ways this this could
>> all break.
>>
>> Anyone want to take a look or maybe scare up some other reviewers?
>> (Kees, you seemed *really* excited about getting this in.)
>>
>> --Andy
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-07 18:56 PCID review? Andy Lutomirski
  2017-02-07 19:11 ` Kees Cook
  2017-02-07 19:37 ` Nadav Amit
@ 2017-02-07 21:06 ` Paul E. McKenney
  2017-02-08 16:25   ` Andy Lutomirski
  2017-02-08 20:51 ` Andy Lutomirski
  3 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2017-02-07 21:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Kees Cook, Dave Hansen, linux-mm@kvack.org

On Tue, Feb 07, 2017 at 10:56:59AM -0800, Andy Lutomirski wrote:
> Quite a few people have expressed interest in enabling PCID on (x86)
> Linux.  Here's the code:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
> 
> The main hold-up is that the code needs to be reviewed very carefully.
> It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
> entries using PCID" ought to be looked at carefully to make sure the
> locking is right, but there are plenty of other ways this this could
> all break.
> 
> Anyone want to take a look or maybe scare up some other reviewers?
> (Kees, you seemed *really* excited about getting this in.)

Cool!

So I can drop 61ec4c556b0d "rcu: Maintain special bits at bottom of
->dynticks counter", correct?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-07 21:06 ` Paul E. McKenney
@ 2017-02-08 16:25   ` Andy Lutomirski
  2017-02-08 16:52     ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-02-08 16:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, Borislav Petkov, Kees Cook, Dave Hansen,
	linux-mm@kvack.org

On Tue, Feb 7, 2017 at 1:06 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Feb 07, 2017 at 10:56:59AM -0800, Andy Lutomirski wrote:
>> Quite a few people have expressed interest in enabling PCID on (x86)
>> Linux.  Here's the code:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
>>
>> The main hold-up is that the code needs to be reviewed very carefully.
>> It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
>> entries using PCID" ought to be looked at carefully to make sure the
>> locking is right, but there are plenty of other ways this this could
>> all break.
>>
>> Anyone want to take a look or maybe scare up some other reviewers?
>> (Kees, you seemed *really* excited about getting this in.)
>
> Cool!
>
> So I can drop 61ec4c556b0d "rcu: Maintain special bits at bottom of
> ->dynticks counter", correct?

Nope.  That's a different optimization.  If you consider that patch
ready, want to email me, the dynticks folks, and linux-mm as a
reminder?

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-08 16:25   ` Andy Lutomirski
@ 2017-02-08 16:52     ` Paul E. McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2017-02-08 16:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Borislav Petkov, Kees Cook, Dave Hansen,
	linux-mm@kvack.org

On Wed, Feb 08, 2017 at 08:25:16AM -0800, Andy Lutomirski wrote:
> On Tue, Feb 7, 2017 at 1:06 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Tue, Feb 07, 2017 at 10:56:59AM -0800, Andy Lutomirski wrote:
> >> Quite a few people have expressed interest in enabling PCID on (x86)
> >> Linux.  Here's the code:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
> >>
> >> The main hold-up is that the code needs to be reviewed very carefully.
> >> It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
> >> entries using PCID" ought to be looked at carefully to make sure the
> >> locking is right, but there are plenty of other ways this this could
> >> all break.
> >>
> >> Anyone want to take a look or maybe scare up some other reviewers?
> >> (Kees, you seemed *really* excited about getting this in.)
> >
> > Cool!
> >
> > So I can drop 61ec4c556b0d "rcu: Maintain special bits at bottom of
> > ->dynticks counter", correct?
> 
> Nope.  That's a different optimization.  If you consider that patch
> ready, want to email me, the dynticks folks, and linux-mm as a
> reminder?

Hmmm...  Good point.  I never have gotten any review feedback, and I
don't have a specific test for it.  Let me take another look at it.
There is probably something still broken.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-07 18:56 PCID review? Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-02-07 21:06 ` Paul E. McKenney
@ 2017-02-08 20:51 ` Andy Lutomirski
  2017-02-09  0:10   ` Mel Gorman
  3 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-02-08 20:51 UTC (permalink / raw)
  To: Andy Lutomirski, Nadav Amit, Mel Gorman
  Cc: Borislav Petkov, Kees Cook, Dave Hansen, linux-mm@kvack.org,
	Paul E. McKenney

On Tue, Feb 7, 2017 at 10:56 AM, Andy Lutomirski <luto@kernel.org> wrote:
> Quite a few people have expressed interest in enabling PCID on (x86)
> Linux.  Here's the code:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
>
> The main hold-up is that the code needs to be reviewed very carefully.
> It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
> entries using PCID" ought to be looked at carefully to make sure the
> locking is right, but there are plenty of other ways this this could
> all break.
>
> Anyone want to take a look or maybe scare up some other reviewers?
> (Kees, you seemed *really* excited about getting this in.)

Nadav pointed out that this doesn't work right with
ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.  Mel, here's the issue:

I want to add ASID (Intel calls it PCID) support to x86.  This means
that "flush the TLB on a given CPU" will no longer be a particularly
well defined operation because it's not clear which ASID tag to flush.
Instead there's "flush the TLB for a given mm on a given CPU".

If I'm understanding the batched flush code, all it's trying to do is
to flush more than one mm at a time.  Would it make sense to add a new
arch API to flush more than one mm?  Presumably it would take a linked
list, and the batched flush code would fall back to flushing in pieces
if it can't allocate a new linked list node when needed.

Thoughts?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-08 20:51 ` Andy Lutomirski
@ 2017-02-09  0:10   ` Mel Gorman
  2017-02-10  2:46     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2017-02-09  0:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Nadav Amit, Borislav Petkov, Kees Cook,
	Dave Hansen, linux-mm@kvack.org, Paul E. McKenney

On Wed, Feb 08, 2017 at 12:51:24PM -0800, Andy Lutomirski wrote:
> On Tue, Feb 7, 2017 at 10:56 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > Quite a few people have expressed interest in enabling PCID on (x86)
> > Linux.  Here's the code:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
> >
> > The main hold-up is that the code needs to be reviewed very carefully.
> > It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
> > entries using PCID" ought to be looked at carefully to make sure the
> > locking is right, but there are plenty of other ways this this could
> > all break.
> >
> > Anyone want to take a look or maybe scare up some other reviewers?
> > (Kees, you seemed *really* excited about getting this in.)
> 
> Nadav pointed out that this doesn't work right with
> ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.  Mel, here's the issue:
> 

I confess I didn't read the thread or patch, I'm only looking at this
question.

> I want to add ASID (Intel calls it PCID) support to x86.  This means
> that "flush the TLB on a given CPU" will no longer be a particularly
> well defined operation because it's not clear which ASID tag to flush.
> Instead there's "flush the TLB for a given mm on a given CPU".
> 

Understood.

> If I'm understanding the batched flush code, all it's trying to do is
> to flush more than one mm at a time. 

More or less. It would be more accurate to say it's flushing any CPU TLB
that potentially accessed a list of pages recently.  It's not flushing
"multiple MMs at a time" as such, it's flushing "only CPUs that accessed
pages mapped by a mm recently". The distinction may not matter.

The requirements do matter though.

mm/vmscan.c is where TTU_BATCH_FLUSH is set and that is processing a list
of pages that can be mapped by multiple MMs.

The TLBs must be flushed before either IO starts (try_to_unmap_flush_dirty)
or they are freed to the page allocator (try_to_unmap_flush).

To do this, all it has to track is a simple mask of CPUs, whether a flush
is necessary and whether any of the PTEs were dirty. This is trivial to
assemble during an rmap walk as it's a PTE check and a cpumask_or.

try_to_unmap_flush then flushes the entire TLB as the cost of targetted
a specific page to flush was so high (both maintaining the PFNs and the
individual flush operations).

> Would it make sense to add a new
> arch API to flush more than one mm?  Presumably it would take a linked
> list, and the batched flush code would fall back to flushing in pieces
> if it can't allocate a new linked list node when needed.
> 

Conceptually it's ok but the details are a headache.

The defer code would need to maintain a list of mm's (or ASIDs) that is
unbounded in size to match the number of IPIs sent as the current code as
opposed to a simple cpumask. There are SWAP_CLUSTER_MAX pages to consider
with each page potentially mapped by an arbitrary number of MMs. The same
mm's could exist on multiple lists for each active kswapd instance and
direct reclaimer.

As multiple reclaimers are interested in the same mm, that pretty much
rules out linking them off mm_struct unless the locking would serialise
the parallel reclaimers and prevent an mm existing on more than one list
at a time. You could try allowing multiple tasks to share the one list
(not sure how to find that list quickly) but each entry would have to
be locked and as each user can flush at any time, multiple reclaimers
potentially have to block while an IPI is being sent. It's hard to see
how this could be scaled to match the existing code.

It would be easier to track via an array stored in task_struct but the
number of MMs is unknown in advance so all you can do is guess a reasonable
size. It would have to flush if the array files resulting in more IPIs
than the current code depending on how many MMs map the list of pages.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-09  0:10   ` Mel Gorman
@ 2017-02-10  2:46     ` Andy Lutomirski
  2017-02-10 11:01       ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-02-10  2:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andy Lutomirski, Nadav Amit, Borislav Petkov, Kees Cook,
	Dave Hansen, linux-mm@kvack.org, Paul E. McKenney

On Wed, Feb 8, 2017 at 4:10 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Wed, Feb 08, 2017 at 12:51:24PM -0800, Andy Lutomirski wrote:
>> On Tue, Feb 7, 2017 at 10:56 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> > Quite a few people have expressed interest in enabling PCID on (x86)
>> > Linux.  Here's the code:
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
>> >
>> > The main hold-up is that the code needs to be reviewed very carefully.
>> > It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
>> > entries using PCID" ought to be looked at carefully to make sure the
>> > locking is right, but there are plenty of other ways this this could
>> > all break.
>> >
>> > Anyone want to take a look or maybe scare up some other reviewers?
>> > (Kees, you seemed *really* excited about getting this in.)
>>
>> Nadav pointed out that this doesn't work right with
>> ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.  Mel, here's the issue:
>>
>
> I confess I didn't read the thread or patch, I'm only looking at this
> question.

That's fine with me :)

>
>> I want to add ASID (Intel calls it PCID) support to x86.  This means
>> that "flush the TLB on a given CPU" will no longer be a particularly
>> well defined operation because it's not clear which ASID tag to flush.
>> Instead there's "flush the TLB for a given mm on a given CPU".
>>
>
> Understood.
>
>> If I'm understanding the batched flush code, all it's trying to do is
>> to flush more than one mm at a time.
>
> More or less. It would be more accurate to say it's flushing any CPU TLB
> that potentially accessed a list of pages recently.  It's not flushing
> "multiple MMs at a time" as such, it's flushing "only CPUs that accessed
> pages mapped by a mm recently". The distinction may not matter.
>
> The requirements do matter though.
>
> mm/vmscan.c is where TTU_BATCH_FLUSH is set and that is processing a list
> of pages that can be mapped by multiple MMs.
>
> The TLBs must be flushed before either IO starts (try_to_unmap_flush_dirty)
> or they are freed to the page allocator (try_to_unmap_flush).
>
> To do this, all it has to track is a simple mask of CPUs, whether a flush
> is necessary and whether any of the PTEs were dirty. This is trivial to
> assemble during an rmap walk as it's a PTE check and a cpumask_or.
>
> try_to_unmap_flush then flushes the entire TLB as the cost of targetted
> a specific page to flush was so high (both maintaining the PFNs and the
> individual flush operations).

I could just maybe make it possible to remotely poke a CPU to record
which mms need flushing, but the possible races there are a bit
terrifying.

>
>> Would it make sense to add a new
>> arch API to flush more than one mm?  Presumably it would take a linked
>> list, and the batched flush code would fall back to flushing in pieces
>> if it can't allocate a new linked list node when needed.
>>
>
> Conceptually it's ok but the details are a headache.
>
> The defer code would need to maintain a list of mm's (or ASIDs) that is
> unbounded in size to match the number of IPIs sent as the current code as
> opposed to a simple cpumask. There are SWAP_CLUSTER_MAX pages to consider
> with each page potentially mapped by an arbitrary number of MMs. The same
> mm's could exist on multiple lists for each active kswapd instance and
> direct reclaimer.
>
> As multiple reclaimers are interested in the same mm, that pretty much
> rules out linking them off mm_struct unless the locking would serialise
> the parallel reclaimers and prevent an mm existing on more than one list
> at a time. You could try allowing multiple tasks to share the one list
> (not sure how to find that list quickly) but each entry would have to
> be locked and as each user can flush at any time, multiple reclaimers
> potentially have to block while an IPI is being sent. It's hard to see
> how this could be scaled to match the existing code.
>
> It would be easier to track via an array stored in task_struct but the
> number of MMs is unknown in advance so all you can do is guess a reasonable
> size. It would have to flush if the array files resulting in more IPIs
> than the current code depending on how many MMs map the list of pages.

What if I just allocate a smallish array on the stack and then extend
with kmalloc(GFP_ATOMIC) as needed?  An allocation failure would just
force an immediate flush, so there shouldn't be any deadlock risk.

Anyway, I need to rework the arch code to make this work at all.
Currently I'm taking a spinlock per mm when flushing that mm, but that
would mean I need to *sort* the list to flush more than one at a time,
and that just sounds nasty.  I can probably get rid of the spinlock.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-10  2:46     ` Andy Lutomirski
@ 2017-02-10 11:01       ` Mel Gorman
  2017-02-10 16:44         ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2017-02-10 11:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Nadav Amit, Borislav Petkov, Kees Cook,
	Dave Hansen, linux-mm@kvack.org, Paul E. McKenney

On Thu, Feb 09, 2017 at 06:46:57PM -0800, Andy Lutomirski wrote:
> > try_to_unmap_flush then flushes the entire TLB as the cost of targetted
> > a specific page to flush was so high (both maintaining the PFNs and the
> > individual flush operations).
> 
> I could just maybe make it possible to remotely poke a CPU to record
> which mms need flushing, but the possible races there are a bit
> terrifying.
> 

The overhead is concerning. You may incur a remote cache miss accessing the
data which is costly or you have to send an IPI which is also severe. You
could attempt to do the same as the scheduler and directly modify if the
CPUs share cache and IPI otherwise but you're looking at a lot of overhead
either way.

> >
> >> Would it make sense to add a new
> >> arch API to flush more than one mm?  Presumably it would take a linked
> >> list, and the batched flush code would fall back to flushing in pieces
> >> if it can't allocate a new linked list node when needed.
> >>
> >
> > Conceptually it's ok but the details are a headache.
> >
> > The defer code would need to maintain a list of mm's (or ASIDs) that is
> > unbounded in size to match the number of IPIs sent as the current code as
> > opposed to a simple cpumask. There are SWAP_CLUSTER_MAX pages to consider
> > with each page potentially mapped by an arbitrary number of MMs. The same
> > mm's could exist on multiple lists for each active kswapd instance and
> > direct reclaimer.
> >
> > As multiple reclaimers are interested in the same mm, that pretty much
> > rules out linking them off mm_struct unless the locking would serialise
> > the parallel reclaimers and prevent an mm existing on more than one list
> > at a time. You could try allowing multiple tasks to share the one list
> > (not sure how to find that list quickly) but each entry would have to
> > be locked and as each user can flush at any time, multiple reclaimers
> > potentially have to block while an IPI is being sent. It's hard to see
> > how this could be scaled to match the existing code.
> >
> > It would be easier to track via an array stored in task_struct but the
> > number of MMs is unknown in advance so all you can do is guess a reasonable
> > size. It would have to flush if the array files resulting in more IPIs
> > than the current code depending on how many MMs map the list of pages.
> 
> What if I just allocate a smallish array on the stack and then extend
> with kmalloc(GFP_ATOMIC) as needed?  An allocation failure would just
> force an immediate flush, so there shouldn't be any deadlock risk.
> 

It won't deadlock but it's an atomic allocation (which accesses reserves)
at the time when we are definitely reclaiming with a fallback being an IPI
the current code would avoid. It'll indirectly increase risks of other
atomic allocation failures although that risk is slight. The allocation
in that context will still raise eyebrows and it made me wince. I know I
recently considered doing an atomic allocation under similar circumstances
but it was fine to completely fail the allocation and a day later, I got
rid of it anyway.

> Anyway, I need to rework the arch code to make this work at all.
> Currently I'm taking a spinlock per mm when flushing that mm, but that
> would mean I need to *sort* the list to flush more than one at a time,
> and that just sounds nasty.  I can probably get rid of the spinlock.
> 

That all sounds fairly nasty. Don't get me wrong, I think you can make
it functionally work but it's a severe uphill battle.

The key concern that it'll be evaluated against is that any complexity has
to be less than doing a "batched full TLB flush and refill". The refill is
expected to be cheap as the page table structures are likely to be cache hot.
It was way cheaper than trying to be clever about flushing individual TLB
entries. I recognise that you'll be trying to balance this against processes
that are carefully isolated that do not want interference from unrelated
processes doing a TLB flush but it'll be hard to prove that it's worth it.

It's almost certain that this will be Linus' primary concern
given his contributions to similar conversations in the past
(e.g. https://lkml.org/lkml/2015/6/25/666). It's also likely to be of
major concern to Ingo (e.g. https://lkml.org/lkml/2015/6/9/276) as he had
valid objections against clever flushing at the time the batching was
introduced. Based on previous experience, I have my own concerns but I
don't count as I'm highlighing them now :P

The outcome of the TLB batch flushiing discussion was that it was way
cheaper to flush the full TLB and take the refill cost than flushing
individual pages which had the cost of tracking the PFNs and the cost of
each individual page flush operation.

The current code is basically "build a cpumask and flush the TLB for
multiple entries". We're talking about complex tracking of mm's with
difficult locking, potential remote cache misses, potentially more IPIs or
alternatively doing allocations from reclaim context. It'll be difficult
to prove that doing this in the name of flushing ASID is cheaper and
universally a good idea than just flushing the entire TLB.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-10 11:01       ` Mel Gorman
@ 2017-02-10 16:44         ` Andy Lutomirski
  2017-02-10 21:57           ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2017-02-10 16:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andy Lutomirski, Nadav Amit, Borislav Petkov, Kees Cook,
	Dave Hansen, linux-mm@kvack.org, Paul E. McKenney

On Fri, Feb 10, 2017 at 3:01 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Thu, Feb 09, 2017 at 06:46:57PM -0800, Andy Lutomirski wrote:
>> > try_to_unmap_flush then flushes the entire TLB as the cost of targetted
>> > a specific page to flush was so high (both maintaining the PFNs and the
>> > individual flush operations).
>>
>> I could just maybe make it possible to remotely poke a CPU to record
>> which mms need flushing, but the possible races there are a bit
>> terrifying.
>>
>
> The overhead is concerning. You may incur a remote cache miss accessing the
> data which is costly or you have to send an IPI which is also severe. You
> could attempt to do the same as the scheduler and directly modify if the
> CPUs share cache and IPI otherwise but you're looking at a lot of overhead
> either way.

I think all of these approaches suck and I'll give up on this particular avenue.

>> Anyway, I need to rework the arch code to make this work at all.
>> Currently I'm taking a spinlock per mm when flushing that mm, but that
>> would mean I need to *sort* the list to flush more than one at a time,
>> and that just sounds nasty.  I can probably get rid of the spinlock.
>>
>
> That all sounds fairly nasty. Don't get me wrong, I think you can make
> it functionally work but it's a severe uphill battle.
>
> The key concern that it'll be evaluated against is that any complexity has
> to be less than doing a "batched full TLB flush and refill". The refill is
> expected to be cheap as the page table structures are likely to be cache hot.
> It was way cheaper than trying to be clever about flushing individual TLB
> entries.

You're assuming that Intel CPUs are more sensible than they really
are.  My suspicion, based on some benchmarking, is that "switch pgd
and clear the TLB" is very slow because of the "clear the TLB bit" --
that is, above and beyond the refill cost, merely clearing the TLB
takes hundreds of cycles.  So I'd like to minimize unnecessary
flushes.  Unfortunately, "flush the TLB for all ASIDs" and "flush
everything" are also both very very slow.

Sigh.

> I recognise that you'll be trying to balance this against processes
> that are carefully isolated that do not want interference from unrelated
> processes doing a TLB flush but it'll be hard to prove that it's worth it.
>
> It's almost certain that this will be Linus' primary concern
> given his contributions to similar conversations in the past
> (e.g. https://lkml.org/lkml/2015/6/25/666). It's also likely to be of
> major concern to Ingo (e.g. https://lkml.org/lkml/2015/6/9/276) as he had
> valid objections against clever flushing at the time the batching was
> introduced. Based on previous experience, I have my own concerns but I
> don't count as I'm highlighing them now :P

I fully agree with those objections, but back then we didn't have the
capability to avoid a flush when switching mms.

All that being said, I agree: making this stuff too complicated is a bad idea.

>
> The outcome of the TLB batch flushiing discussion was that it was way
> cheaper to flush the full TLB and take the refill cost than flushing
> individual pages which had the cost of tracking the PFNs and the cost of
> each individual page flush operation.
>
> The current code is basically "build a cpumask and flush the TLB for
> multiple entries". We're talking about complex tracking of mm's with
> difficult locking, potential remote cache misses, potentially more IPIs or
> alternatively doing allocations from reclaim context. It'll be difficult
> to prove that doing this in the name of flushing ASID is cheaper and
> universally a good idea than just flushing the entire TLB.
>

Maybe there's a middle ground.  I could keep track of whether more
than one mm is targetted in a deferred flush and just flush everything
if so.  As a future improvement, I or someone else could add:

struct mm_struct *mms[16];
int num_mms;

to struct tlbflush_unmap_batch.  if num_mms > 16, then this just means
that we've given up on tracking them all and we do the global flush,
and, if not, we could teach the IPI handler to understand a list of
target mms.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-10 16:44         ` Andy Lutomirski
@ 2017-02-10 21:57           ` Mel Gorman
  2017-02-10 22:07             ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2017-02-10 21:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Nadav Amit, Borislav Petkov, Kees Cook,
	Dave Hansen, linux-mm@kvack.org, Paul E. McKenney

On Fri, Feb 10, 2017 at 08:44:04AM -0800, Andy Lutomirski wrote:
> On Fri, Feb 10, 2017 at 3:01 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> > On Thu, Feb 09, 2017 at 06:46:57PM -0800, Andy Lutomirski wrote:
> >> > try_to_unmap_flush then flushes the entire TLB as the cost of targetted
> >> > a specific page to flush was so high (both maintaining the PFNs and the
> >> > individual flush operations).
> >>
> >> I could just maybe make it possible to remotely poke a CPU to record
> >> which mms need flushing, but the possible races there are a bit
> >> terrifying.
> >>
> >
> > The overhead is concerning. You may incur a remote cache miss accessing the
> > data which is costly or you have to send an IPI which is also severe. You
> > could attempt to do the same as the scheduler and directly modify if the
> > CPUs share cache and IPI otherwise but you're looking at a lot of overhead
> > either way.
> 
> I think all of these approaches suck and I'll give up on this particular avenue.
> 

Ok, probably for the best albeit that is based on an inability to figure
out how it could be done efficiently and a suspicion that if it could be
done, the scheduler would be doing it already.

> >> Anyway, I need to rework the arch code to make this work at all.
> >> Currently I'm taking a spinlock per mm when flushing that mm, but that
> >> would mean I need to *sort* the list to flush more than one at a time,
> >> and that just sounds nasty.  I can probably get rid of the spinlock.
> >>
> >
> > That all sounds fairly nasty. Don't get me wrong, I think you can make
> > it functionally work but it's a severe uphill battle.
> >
> > The key concern that it'll be evaluated against is that any complexity has
> > to be less than doing a "batched full TLB flush and refill". The refill is
> > expected to be cheap as the page table structures are likely to be cache hot.
> > It was way cheaper than trying to be clever about flushing individual TLB
> > entries.
> 
> You're assuming that Intel CPUs are more sensible than they really
> are.  My suspicion, based on some benchmarking, is that "switch pgd
> and clear the TLB" is very slow because of the "clear the TLB bit" --
> that is, above and beyond the refill cost, merely clearing the TLB
> takes hundreds of cycles.  So I'd like to minimize unnecessary
> flushes.  Unfortunately, "flush the TLB for all ASIDs" and "flush
> everything" are also both very very slow.
> 

And so is flushing per page. However, you make an interesting point.
Based on previous benchmarks and evaluation, the conclusion was reached
that targeted per-page flushing was too slow and a full flush was
faster. Several hazards were identified on how it could be even measured
on out-of-order processors and other difficulties so your benchmark will
be called into question [1]. MMtests has reference to an old tlb benchmark
(config-global-dhp__tlbflush-performance) but even that was considered
to be flawed but there were not any alternatives. I have not used it in
a long time due to the level of uncertainity it had.

There was the caveat that processors like Atom cared and this was dismissed
on the grounds the processor was "crippled" (at the time) and such concerns
just didn't apply on the general case. Other concerns were raised that
Knights Landing might care but that was conference talk and it was never
pushed hard that I can recall.

If you have a methodology that proves that the fullflush+refill is
a terrible idea (even if it's CPU-specific) then it'll be important to
include it in the changelog. There will be some attacking on the benchmark
and the methodology but that's to be expected. If you're right for the
processors that are capable then it'll be fine.

But the starting point of the discussionm, not necessarily from me, will be
that Intel CPUs are super fast at refills and the flush cost doesn't matter
because the context switch clears it anyway so complexity should be as low
as possible. There may be some comments that things like ASID were great on
processors that had crippled TLBs (rightly or wrongly).  ASIDs alter the
equation but then you'll be tackled on how often the ASIDs is preserved
and that it may be workload dependant and you may get some hand waving
about how many ASIDs are available[2]. It depends on the specifics of the
Intel implementation which I haven't looked up but it will need to be in
the changelog or you'll get bogged down in "this doesn't matter" discussions.

It's possible that covering all of this is overkill but it's the avenues
of concern I'd expect if I was working on ASID support.

[1] I could be completely wrong, I'm basing this on how people have
    behaved in the past during TLB-flush related discussions. They
    might have changed their mind.

[2] This could be covered already in the specifications and other
    discussions. Again, I didn't actually look into what's truly new with
    the Intel ASID.

> > I recognise that you'll be trying to balance this against processes
> > that are carefully isolated that do not want interference from unrelated
> > processes doing a TLB flush but it'll be hard to prove that it's worth it.
> >
> > It's almost certain that this will be Linus' primary concern
> > given his contributions to similar conversations in the past
> > (e.g. https://lkml.org/lkml/2015/6/25/666). It's also likely to be of
> > major concern to Ingo (e.g. https://lkml.org/lkml/2015/6/9/276) as he had
> > valid objections against clever flushing at the time the batching was
> > introduced. Based on previous experience, I have my own concerns but I
> > don't count as I'm highlighing them now :P
> 
> I fully agree with those objections, but back then we didn't have the
> capability to avoid a flush when switching mms.
> 

True, so watch for questions on what the odds are of switching an MM will
flush the TLB information anyway due to replacement policies.

> >
> > The outcome of the TLB batch flushiing discussion was that it was way
> > cheaper to flush the full TLB and take the refill cost than flushing
> > individual pages which had the cost of tracking the PFNs and the cost of
> > each individual page flush operation.
> >
> > The current code is basically "build a cpumask and flush the TLB for
> > multiple entries". We're talking about complex tracking of mm's with
> > difficult locking, potential remote cache misses, potentially more IPIs or
> > alternatively doing allocations from reclaim context. It'll be difficult
> > to prove that doing this in the name of flushing ASID is cheaper and
> > universally a good idea than just flushing the entire TLB.
> >
> 
> Maybe there's a middle ground.  I could keep track of whether more
> than one mm is targetted in a deferred flush and just flush everything
> if so.

That would work and side-steps much of the state tracking concerns. It
might even be a good fit for use cases like "limited number of VMs on a
machine" or "one major application that must be isolated and some admin
processes with little CPU time or kthreads" because you don't want to get
burned with the "only a microbenchmark sees any benefit" hammer[3].

> As a future improvement, I or someone else could add:
> 
> struct mm_struct *mms[16];
> int num_mms;
> 
> to struct tlbflush_unmap_batch.  if num_mms > 16, then this just means
> that we've given up on tracking them all and we do the global flush,
> and, if not, we could teach the IPI handler to understand a list of
> target mms.

I *much* prefer a fallback of a full flush than kmallocing additional
space. It's also something that feasibly could be switchable at runtime with
a union of cpumask and an array of mms depending on the CPU capabilities with
static branches determining which is used to minimise overhead.  That would
have only minor overhead and with a debugging patch could allow switching
between them at boot-time for like-like comparisons on a range of workloads.

[3] Can you tell I've been burned a few times by the "only
    microbenchmarks care" feedback?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-10 21:57           ` Mel Gorman
@ 2017-02-10 22:07             ` Andy Lutomirski
  2017-02-10 22:25               ` Borislav Petkov
  2017-02-13 10:05               ` Mel Gorman
  0 siblings, 2 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-02-10 22:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andy Lutomirski, Nadav Amit, Borislav Petkov, Kees Cook,
	Dave Hansen, linux-mm@kvack.org, Paul E. McKenney

On Fri, Feb 10, 2017 at 1:57 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Fri, Feb 10, 2017 at 08:44:04AM -0800, Andy Lutomirski wrote:
>> On Fri, Feb 10, 2017 at 3:01 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> > On Thu, Feb 09, 2017 at 06:46:57PM -0800, Andy Lutomirski wrote:
>> >> > try_to_unmap_flush then flushes the entire TLB as the cost of targetted
>> >> > a specific page to flush was so high (both maintaining the PFNs and the
>> >> > individual flush operations).
>> >>
>> >> I could just maybe make it possible to remotely poke a CPU to record
>> >> which mms need flushing, but the possible races there are a bit
>> >> terrifying.
>> >>
>> >
>> > The overhead is concerning. You may incur a remote cache miss accessing the
>> > data which is costly or you have to send an IPI which is also severe. You
>> > could attempt to do the same as the scheduler and directly modify if the
>> > CPUs share cache and IPI otherwise but you're looking at a lot of overhead
>> > either way.
>>
>> I think all of these approaches suck and I'll give up on this particular avenue.
>>
>
> Ok, probably for the best albeit that is based on an inability to figure
> out how it could be done efficiently and a suspicion that if it could be
> done, the scheduler would be doing it already.
>

FWIW, I am doing a bit of this.  For remote CPUs that aren't currently
running a given mm, I just bump a per-mm generation count so that they
know to flush next time around in switch_mm().  I'll need to add a new
hook to the batched flush code to get this right, and I'll cc you on
that.  Stay tuned.

> It's possible that covering all of this is overkill but it's the avenues
> of concern I'd expect if I was working on ASID support.

Agreed.

>
> [1] I could be completely wrong, I'm basing this on how people have
>     behaved in the past during TLB-flush related discussions. They
>     might have changed their mind.

We'll see.  The main benchmark that I'm relying on (so far) is that
context switches get way faster, just ping ponging back and forth.  I
suspect that the TLB refill cost is only a small part.

>
> [2] This could be covered already in the specifications and other
>     discussions. Again, I didn't actually look into what's truly new with
>     the Intel ASID.

I suspect I could find out how many ASIDs there really are under NDA,
but even that would be challenging and only dubiously useful.  For
now, I'm using a grand total of four ASIDs. :)

>
>> > I recognise that you'll be trying to balance this against processes
>> > that are carefully isolated that do not want interference from unrelated
>> > processes doing a TLB flush but it'll be hard to prove that it's worth it.
>> >
>> > It's almost certain that this will be Linus' primary concern
>> > given his contributions to similar conversations in the past
>> > (e.g. https://lkml.org/lkml/2015/6/25/666). It's also likely to be of
>> > major concern to Ingo (e.g. https://lkml.org/lkml/2015/6/9/276) as he had
>> > valid objections against clever flushing at the time the batching was
>> > introduced. Based on previous experience, I have my own concerns but I
>> > don't count as I'm highlighing them now :P
>>
>> I fully agree with those objections, but back then we didn't have the
>> capability to avoid a flush when switching mms.
>>
>
> True, so watch for questions on what the odds are of switching an MM will
> flush the TLB information anyway due to replacement policies.
>
>> >
>> > The outcome of the TLB batch flushiing discussion was that it was way
>> > cheaper to flush the full TLB and take the refill cost than flushing
>> > individual pages which had the cost of tracking the PFNs and the cost of
>> > each individual page flush operation.
>> >
>> > The current code is basically "build a cpumask and flush the TLB for
>> > multiple entries". We're talking about complex tracking of mm's with
>> > difficult locking, potential remote cache misses, potentially more IPIs or
>> > alternatively doing allocations from reclaim context. It'll be difficult
>> > to prove that doing this in the name of flushing ASID is cheaper and
>> > universally a good idea than just flushing the entire TLB.
>> >
>>
>> Maybe there's a middle ground.  I could keep track of whether more
>> than one mm is targetted in a deferred flush and just flush everything
>> if so.
>
> That would work and side-steps much of the state tracking concerns. It
> might even be a good fit for use cases like "limited number of VMs on a
> machine" or "one major application that must be isolated and some admin
> processes with little CPU time or kthreads" because you don't want to get
> burned with the "only a microbenchmark sees any benefit" hammer[3].
>
>> As a future improvement, I or someone else could add:
>>
>> struct mm_struct *mms[16];
>> int num_mms;
>>
>> to struct tlbflush_unmap_batch.  if num_mms > 16, then this just means
>> that we've given up on tracking them all and we do the global flush,
>> and, if not, we could teach the IPI handler to understand a list of
>> target mms.
>
> I *much* prefer a fallback of a full flush than kmallocing additional
> space. It's also something that feasibly could be switchable at runtime with
> a union of cpumask and an array of mms depending on the CPU capabilities with
> static branches determining which is used to minimise overhead.  That would
> have only minor overhead and with a debugging patch could allow switching
> between them at boot-time for like-like comparisons on a range of workloads.

Sounds good.  This means I need to make my code understand the concept
of a full flush, but that's manageable.

>
> [3] Can you tell I've been burned a few times by the "only
>     microbenchmarks care" feedback?
>

:)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-10 22:07             ` Andy Lutomirski
@ 2017-02-10 22:25               ` Borislav Petkov
  2017-02-10 22:58                 ` Andy Lutomirski
  2017-02-13 10:05               ` Mel Gorman
  1 sibling, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2017-02-10 22:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mel Gorman, Andy Lutomirski, Nadav Amit, Kees Cook, Dave Hansen,
	linux-mm@kvack.org, Paul E. McKenney

On Fri, Feb 10, 2017 at 02:07:19PM -0800, Andy Lutomirski wrote:
> We'll see.  The main benchmark that I'm relying on (so far) is that
> context switches get way faster, just ping ponging back and forth.  I
> suspect that the TLB refill cost is only a small part.

Is that a microbenchmark or something more "presentable"?

We really should pay attention to the complexity and what that actually
brings us in the end.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-10 22:25               ` Borislav Petkov
@ 2017-02-10 22:58                 ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2017-02-10 22:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mel Gorman, Andy Lutomirski, Nadav Amit, Kees Cook, Dave Hansen,
	linux-mm@kvack.org, Paul E. McKenney

On Fri, Feb 10, 2017 at 2:25 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Feb 10, 2017 at 02:07:19PM -0800, Andy Lutomirski wrote:
>> We'll see.  The main benchmark that I'm relying on (so far) is that
>> context switches get way faster, just ping ponging back and forth.  I
>> suspect that the TLB refill cost is only a small part.
>
> Is that a microbenchmark or something more "presentable"?

It's a microbenchmark, but the change is fairly large.  It would be
nice to see what the effect is on real workloads.

>
> We really should pay attention to the complexity and what that actually
> brings us in the end.

Agreed.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PCID review?
  2017-02-10 22:07             ` Andy Lutomirski
  2017-02-10 22:25               ` Borislav Petkov
@ 2017-02-13 10:05               ` Mel Gorman
  1 sibling, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2017-02-13 10:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Nadav Amit, Borislav Petkov, Kees Cook,
	Dave Hansen, linux-mm@kvack.org, Paul E. McKenney

On Fri, Feb 10, 2017 at 02:07:19PM -0800, Andy Lutomirski wrote:
> > Ok, probably for the best albeit that is based on an inability to figure
> > out how it could be done efficiently and a suspicion that if it could be
> > done, the scheduler would be doing it already.
> >
> 
> FWIW, I am doing a bit of this.  For remote CPUs that aren't currently
> running a given mm, I just bump a per-mm generation count so that they
> know to flush next time around in switch_mm().  I'll need to add a new
> hook to the batched flush code to get this right, and I'll cc you on
> that.  Stay tuned.
> 

Ok, thanks.

> > [1] I could be completely wrong, I'm basing this on how people have
> >     behaved in the past during TLB-flush related discussions. They
> >     might have changed their mind.
> 
> We'll see.  The main benchmark that I'm relying on (so far) is that
> context switches get way faster, just ping ponging back and forth.  I
> suspect that the TLB refill cost is only a small part.
> 

Note that such a benchmark is not going to measure the TLB flush cost.
In itself, this is not bad but I suspect that the applications that care
about interference from TLB flushes by unrelated processes are not
applications that are context-switch intensive.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-02-13 10:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-07 18:56 PCID review? Andy Lutomirski
2017-02-07 19:11 ` Kees Cook
2017-02-07 19:24   ` Thomas Garnier
2017-02-07 19:37 ` Nadav Amit
2017-02-08 16:24   ` Andy Lutomirski
2017-02-07 21:06 ` Paul E. McKenney
2017-02-08 16:25   ` Andy Lutomirski
2017-02-08 16:52     ` Paul E. McKenney
2017-02-08 20:51 ` Andy Lutomirski
2017-02-09  0:10   ` Mel Gorman
2017-02-10  2:46     ` Andy Lutomirski
2017-02-10 11:01       ` Mel Gorman
2017-02-10 16:44         ` Andy Lutomirski
2017-02-10 21:57           ` Mel Gorman
2017-02-10 22:07             ` Andy Lutomirski
2017-02-10 22:25               ` Borislav Petkov
2017-02-10 22:58                 ` Andy Lutomirski
2017-02-13 10:05               ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).