All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, Christoph Hellwig <hch@lst.de>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Baoquan He <bhe@redhat.com>, John Ogness <jogness@linutronix.de>,
	linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>,
	x86@kernel.org
Subject: Re: Excessive TLB flush ranges
Date: Wed, 17 May 2023 12:31:04 +0200	[thread overview]
Message-ID: <87ttwb5jx3.ffs@tglx> (raw)
In-Reply-To: <ABA0B923-FD56-4787-9ED1-994BEA13C496@gmail.com>

Nadav!

On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> I'm not ignoring them and I'm well aware of these issues. No need to
>>> repeat them over and over. I'm old but not senile yet.
>
> Thomas, no disrespect was intended. I initially just sent the link and I
> had a sense (based on my past experience) that nobody clicked on it.

All good.

>> It makes a whole lot of a difference whether you do 5 IPIs in a row
>> which all need to get a cache line updated or if you have _one_ which
>> needs a couple of cache lines updated.
>
> Obviously, if the question is 5 IPIs or 1 IPI with more flushing data,
> the 1 IPI wins. The question I was focusing on is whether 1 IPI with
> potentially global flush or detailed list of ranges to flush.  

Correct and there is obviously a tradeoff too, which has yet to be
determined.

>> INVLPG is not serializing so the CPU can pull in the next required cache
>> line(s) on the VA list during that.
>
> Indeed, but ChatGPT says (yes, I see you making fun of me already):
> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
> can cause a pipeline stall because the TLB entry invalidation must be
> completed before subsequent instructions that might rely on the TLB can
> be executed correctly.”
>
> So I am not sure that your claim is exactly correct.

Key is a subsequent instruction which might depend on the to be flushed
TLB entry. That's obvious, but I'm having a hard time to construct that
dependent intruction in this case.

>> These cache lines are _not_
>> contended at that point because _all_ of these data structures are not
>> longer globally accessible (mis-speculation aside) and therefore not
>> exclusive (misalignment aside, but you have to prove that this is an
>> issue).
>
> This is not entirely true. Indeed whether you have 1 remote core or N
> remote core is not a whole issue (putting aside NUMA). But you will get
> first a snoop to the initiator cache by the responding core, and then,
> after the TLB invalidation is completed, an RFO by the initiator once
> it writes to the cache again. If the invalidation data is on the stack
> (as you did), this is even more likely to happen shortly after.

That's correct and there might be smarter ways to handle that list muck.

>> So just dismissing this on 10 years old experience is not really
>> helpful, though I'm happy to confirm your points once I had the time and
>> opportunity to actually run real testing over it, unless you beat me to
>> it.
>
> I really don’t know what “dismissing” you are talking about.

Sorry, I was overreacting due to increased grumpiness.

> I do have relatively recent experience with the overhead of caching
> effects on TLB shootdown time. It can become very apparent. You can
> find some numbers in, for instance, the patch of mine I quoted in my
> previous email.
>
> There are additional opportunities to reduce the caching effect for
> x86, such as combining the SMP-code metadata with the TLB-invalidation
> metadata (which is out of the scope) that I saw having performance
> benefit. That’s all to say that caching effect is not something to
> be considered obsolete.

I never claimed that it does not matter. That's surely part of a
decision making to investigate that.

>> The point is that the generic vmalloc code is making assumptions which
>> are x86 centric on not even necessarily true on x86.
>> 
>> Whether or not this is benefitial on x86 that's a completey separate
>> debate.
>
> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
> into 1, it is (pretty much) all benefit and there is no tradeoff. I was
> focusing on the question of whether it is beneficial also to do precise
> TLB flushing, and the tradeoff there is less clear (especially that the
> kernel uses 2MB pages).

For the vmalloc() area mappings? Not really.

> My experience with non-IPI based TLB invalidations is more limited. IIUC
> the usage model is that the TLB shootdowns should be invoked ASAP
> (perhaps each range can be batched, but there is no sense of batching
> multiple ranges), and then later you would issue some barrier to ensure
> prior TLB shootdown invocations have been completed.
>
> If that is the (use) case, I am not sure the abstraction you used in
> your prototype is the best one.

The way how arm/arm64 implement that in software is:

    magic_barrier1();
    flush_range_with_magic_opcodes();
    magic_barrier2();

And for that use case having the list with individual ranges is not
really wrong.

Maybe ARM[64] could do this smarter, but that would require to rewrite a
lot of code I assume.

>> There is also a debate required whether a wholesale "flush on _ALL_
>> CPUs' is justified when some of those CPUs are completely isolated and
>> have absolutely no chance to be affected by that. This process bound
>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
>> their computation in user space just because…
>
> I hope you would excuse my ignorance (I am sure you won’t), but isn’t
> the seccomp/BPF VMAP ranges are mapped on all processes (considering
> PTI of course)? Are you suggesting you want a per-process kernel
> address space? (which can make senes, I guess)

Right. The BPF muck is mapped in the global kernel space, but e.g. the
seccomp filters are individual per process. At least that's how I
understand it, but I might be completely wrong.

Thanks,

        tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Thomas Gleixner <tglx@linutronix.de>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, Christoph Hellwig <hch@lst.de>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Baoquan He <bhe@redhat.com>, John Ogness <jogness@linutronix.de>,
	linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>,
	x86@kernel.org
Subject: Re: Excessive TLB flush ranges
Date: Wed, 17 May 2023 12:31:04 +0200	[thread overview]
Message-ID: <87ttwb5jx3.ffs@tglx> (raw)
In-Reply-To: <ABA0B923-FD56-4787-9ED1-994BEA13C496@gmail.com>

Nadav!

On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> I'm not ignoring them and I'm well aware of these issues. No need to
>>> repeat them over and over. I'm old but not senile yet.
>
> Thomas, no disrespect was intended. I initially just sent the link and I
> had a sense (based on my past experience) that nobody clicked on it.

All good.

>> It makes a whole lot of a difference whether you do 5 IPIs in a row
>> which all need to get a cache line updated or if you have _one_ which
>> needs a couple of cache lines updated.
>
> Obviously, if the question is 5 IPIs or 1 IPI with more flushing data,
> the 1 IPI wins. The question I was focusing on is whether 1 IPI with
> potentially global flush or detailed list of ranges to flush.  

Correct and there is obviously a tradeoff too, which has yet to be
determined.

>> INVLPG is not serializing so the CPU can pull in the next required cache
>> line(s) on the VA list during that.
>
> Indeed, but ChatGPT says (yes, I see you making fun of me already):
> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
> can cause a pipeline stall because the TLB entry invalidation must be
> completed before subsequent instructions that might rely on the TLB can
> be executed correctly.”
>
> So I am not sure that your claim is exactly correct.

Key is a subsequent instruction which might depend on the to be flushed
TLB entry. That's obvious, but I'm having a hard time to construct that
dependent intruction in this case.

>> These cache lines are _not_
>> contended at that point because _all_ of these data structures are not
>> longer globally accessible (mis-speculation aside) and therefore not
>> exclusive (misalignment aside, but you have to prove that this is an
>> issue).
>
> This is not entirely true. Indeed whether you have 1 remote core or N
> remote core is not a whole issue (putting aside NUMA). But you will get
> first a snoop to the initiator cache by the responding core, and then,
> after the TLB invalidation is completed, an RFO by the initiator once
> it writes to the cache again. If the invalidation data is on the stack
> (as you did), this is even more likely to happen shortly after.

That's correct and there might be smarter ways to handle that list muck.

>> So just dismissing this on 10 years old experience is not really
>> helpful, though I'm happy to confirm your points once I had the time and
>> opportunity to actually run real testing over it, unless you beat me to
>> it.
>
> I really don’t know what “dismissing” you are talking about.

Sorry, I was overreacting due to increased grumpiness.

> I do have relatively recent experience with the overhead of caching
> effects on TLB shootdown time. It can become very apparent. You can
> find some numbers in, for instance, the patch of mine I quoted in my
> previous email.
>
> There are additional opportunities to reduce the caching effect for
> x86, such as combining the SMP-code metadata with the TLB-invalidation
> metadata (which is out of the scope) that I saw having performance
> benefit. That’s all to say that caching effect is not something to
> be considered obsolete.

I never claimed that it does not matter. That's surely part of a
decision making to investigate that.

>> The point is that the generic vmalloc code is making assumptions which
>> are x86 centric on not even necessarily true on x86.
>> 
>> Whether or not this is benefitial on x86 that's a completey separate
>> debate.
>
> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
> into 1, it is (pretty much) all benefit and there is no tradeoff. I was
> focusing on the question of whether it is beneficial also to do precise
> TLB flushing, and the tradeoff there is less clear (especially that the
> kernel uses 2MB pages).

For the vmalloc() area mappings? Not really.

> My experience with non-IPI based TLB invalidations is more limited. IIUC
> the usage model is that the TLB shootdowns should be invoked ASAP
> (perhaps each range can be batched, but there is no sense of batching
> multiple ranges), and then later you would issue some barrier to ensure
> prior TLB shootdown invocations have been completed.
>
> If that is the (use) case, I am not sure the abstraction you used in
> your prototype is the best one.

The way how arm/arm64 implement that in software is:

    magic_barrier1();
    flush_range_with_magic_opcodes();
    magic_barrier2();

And for that use case having the list with individual ranges is not
really wrong.

Maybe ARM[64] could do this smarter, but that would require to rewrite a
lot of code I assume.

>> There is also a debate required whether a wholesale "flush on _ALL_
>> CPUs' is justified when some of those CPUs are completely isolated and
>> have absolutely no chance to be affected by that. This process bound
>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
>> their computation in user space just because…
>
> I hope you would excuse my ignorance (I am sure you won’t), but isn’t
> the seccomp/BPF VMAP ranges are mapped on all processes (considering
> PTI of course)? Are you suggesting you want a per-process kernel
> address space? (which can make senes, I guess)

Right. The BPF muck is mapped in the global kernel space, but e.g. the
seccomp filters are individual per process. At least that's how I
understand it, but I might be completely wrong.

Thanks,

        tglx


  reply	other threads:[~2023-05-17 10:31 UTC|newest]

Thread overview: 150+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-15 16:43 Excessive TLB flush ranges Thomas Gleixner
2023-05-15 16:43 ` Thomas Gleixner
2023-05-15 16:59 ` Russell King (Oracle)
2023-05-15 16:59   ` Russell King (Oracle)
2023-05-15 19:46   ` Thomas Gleixner
2023-05-15 19:46     ` Thomas Gleixner
2023-05-15 21:11     ` Thomas Gleixner
2023-05-15 21:11       ` Thomas Gleixner
2023-05-15 21:31       ` Russell King (Oracle)
2023-05-15 21:31         ` Russell King (Oracle)
2023-05-16  6:37         ` Thomas Gleixner
2023-05-16  6:37           ` Thomas Gleixner
2023-05-16  6:46           ` Thomas Gleixner
2023-05-16  6:46             ` Thomas Gleixner
2023-05-16  8:18           ` Thomas Gleixner
2023-05-16  8:18             ` Thomas Gleixner
2023-05-16  8:20             ` Thomas Gleixner
2023-05-16  8:20               ` Thomas Gleixner
2023-05-16  8:27               ` Russell King (Oracle)
2023-05-16  8:27                 ` Russell King (Oracle)
2023-05-16  9:03                 ` Thomas Gleixner
2023-05-16  9:03                   ` Thomas Gleixner
2023-05-16 10:05                   ` Baoquan He
2023-05-16 10:05                     ` Baoquan He
2023-05-16 14:21                     ` Thomas Gleixner
2023-05-16 14:21                       ` Thomas Gleixner
2023-05-16 19:03                       ` Thomas Gleixner
2023-05-16 19:03                         ` Thomas Gleixner
2023-05-17  9:38                         ` Thomas Gleixner
2023-05-17  9:38                           ` Thomas Gleixner
2023-05-17 10:52                           ` Baoquan He
2023-05-17 10:52                             ` Baoquan He
2023-05-19 11:22                             ` Thomas Gleixner
2023-05-19 11:22                               ` Thomas Gleixner
2023-05-19 11:49                               ` Baoquan He
2023-05-19 11:49                                 ` Baoquan He
2023-05-19 14:13                                 ` Thomas Gleixner
2023-05-19 14:13                                   ` Thomas Gleixner
2023-05-19 12:01                         ` [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one Baoquan He
2023-05-19 12:01                           ` Baoquan He
2023-05-19 14:16                           ` Thomas Gleixner
2023-05-19 14:16                             ` Thomas Gleixner
2023-05-19 12:02                         ` [RFC PATCH 2/3] mm/vmalloc.c: Only flush VM_FLUSH_RESET_PERMS area immediately Baoquan He
2023-05-19 12:02                           ` Baoquan He
2023-05-19 12:03                         ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
2023-05-19 12:03                           ` Baoquan He
2023-05-19 14:17                           ` Thomas Gleixner
2023-05-19 14:17                             ` Thomas Gleixner
2023-05-19 18:38                           ` Thomas Gleixner
2023-05-19 18:38                             ` Thomas Gleixner
2023-05-19 23:46                             ` Baoquan He
2023-05-19 23:46                               ` Baoquan He
2023-05-21 23:10                               ` Thomas Gleixner
2023-05-21 23:10                                 ` Thomas Gleixner
2023-05-22 11:21                                 ` Baoquan He
2023-05-22 11:21                                   ` Baoquan He
2023-05-22 12:02                                   ` Thomas Gleixner
2023-05-22 12:02                                     ` Thomas Gleixner
2023-05-22 14:34                                     ` Baoquan He
2023-05-22 14:34                                       ` Baoquan He
2023-05-22 20:21                                       ` Thomas Gleixner
2023-05-22 20:21                                         ` Thomas Gleixner
2023-05-22 20:44                                         ` Thomas Gleixner
2023-05-22 20:44                                           ` Thomas Gleixner
2023-05-23  9:35                                         ` Baoquan He
2023-05-23  9:35                                           ` Baoquan He
2023-05-19 13:49                   ` Excessive TLB flush ranges Thomas Gleixner
2023-05-19 13:49                     ` Thomas Gleixner
2023-05-16  8:21             ` Russell King (Oracle)
2023-05-16  8:21               ` Russell King (Oracle)
2023-05-16  8:19           ` Russell King (Oracle)
2023-05-16  8:19             ` Russell King (Oracle)
2023-05-16  8:44             ` Thomas Gleixner
2023-05-16  8:44               ` Thomas Gleixner
2023-05-16  8:48               ` Russell King (Oracle)
2023-05-16  8:48                 ` Russell King (Oracle)
2023-05-16 12:09                 ` Thomas Gleixner
2023-05-16 12:09                   ` Thomas Gleixner
2023-05-16 13:42                   ` Uladzislau Rezki
2023-05-16 13:42                     ` Uladzislau Rezki
2023-05-16 14:38                     ` Thomas Gleixner
2023-05-16 14:38                       ` Thomas Gleixner
2023-05-16 15:01                       ` Uladzislau Rezki
2023-05-16 15:01                         ` Uladzislau Rezki
2023-05-16 17:04                         ` Thomas Gleixner
2023-05-16 17:04                           ` Thomas Gleixner
2023-05-17 11:26                           ` Uladzislau Rezki
2023-05-17 11:26                             ` Uladzislau Rezki
2023-05-17 11:58                             ` Thomas Gleixner
2023-05-17 11:58                               ` Thomas Gleixner
2023-05-17 12:15                               ` Uladzislau Rezki
2023-05-17 12:15                                 ` Uladzislau Rezki
2023-05-17 16:32                                 ` Thomas Gleixner
2023-05-17 16:32                                   ` Thomas Gleixner
2023-05-19 10:01                                   ` Uladzislau Rezki
2023-05-19 10:01                                     ` Uladzislau Rezki
2023-05-19 14:56                                     ` Thomas Gleixner
2023-05-19 14:56                                       ` Thomas Gleixner
2023-05-19 15:14                                       ` Uladzislau Rezki
2023-05-19 15:14                                         ` Uladzislau Rezki
2023-05-19 16:32                                         ` Thomas Gleixner
2023-05-19 16:32                                           ` Thomas Gleixner
2023-05-19 17:02                                           ` Uladzislau Rezki
2023-05-19 17:02                                             ` Uladzislau Rezki
2023-05-16 17:56                       ` Nadav Amit
2023-05-16 17:56                         ` Nadav Amit
2023-05-16 19:32                         ` Thomas Gleixner
2023-05-16 19:32                           ` Thomas Gleixner
2023-05-17  0:23                           ` Thomas Gleixner
2023-05-17  0:23                             ` Thomas Gleixner
2023-05-17  1:23                             ` Nadav Amit
2023-05-17  1:23                               ` Nadav Amit
2023-05-17 10:31                               ` Thomas Gleixner [this message]
2023-05-17 10:31                                 ` Thomas Gleixner
2023-05-17 11:47                                 ` Thomas Gleixner
2023-05-17 11:47                                   ` Thomas Gleixner
2023-05-17 22:41                                   ` Nadav Amit
2023-05-17 22:41                                     ` Nadav Amit
2023-05-17 14:43                                 ` Mark Rutland
2023-05-17 14:43                                   ` Mark Rutland
2023-05-17 16:41                                   ` Thomas Gleixner
2023-05-17 16:41                                     ` Thomas Gleixner
2023-05-17 22:57                                 ` Nadav Amit
2023-05-17 22:57                                   ` Nadav Amit
2023-05-19 11:49                                   ` Thomas Gleixner
2023-05-19 11:49                                     ` Thomas Gleixner
2023-05-17 12:12                               ` Russell King (Oracle)
2023-05-17 12:12                                 ` Russell King (Oracle)
2023-05-17 23:14                                 ` Nadav Amit
2023-05-17 23:14                                   ` Nadav Amit
2023-05-15 18:17 ` Uladzislau Rezki
2023-05-15 18:17   ` Uladzislau Rezki
2023-05-16  2:26   ` Baoquan He
2023-05-16  2:26     ` Baoquan He
2023-05-16  6:40     ` Thomas Gleixner
2023-05-16  6:40       ` Thomas Gleixner
2023-05-16  8:07       ` Baoquan He
2023-05-16  8:07         ` Baoquan He
2023-05-16  8:10         ` Baoquan He
2023-05-16  8:10           ` Baoquan He
2023-05-16  8:45         ` Russell King (Oracle)
2023-05-16  8:45           ` Russell King (Oracle)
2023-05-16  9:13           ` Thomas Gleixner
2023-05-16  9:13             ` Thomas Gleixner
2023-05-16  8:54         ` Thomas Gleixner
2023-05-16  8:54           ` Thomas Gleixner
2023-05-16  9:48           ` Baoquan He
2023-05-16  9:48             ` Baoquan He
2023-05-15 20:02 ` Nadav Amit
2023-05-15 20:02   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ttwb5jx3.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=hch@lst.de \
    --cc=jogness@linutronix.de \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@armlinux.org.uk \
    --cc=lstoakes@gmail.com \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterz@infradead.org \
    --cc=urezki@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.