All of lore.kernel.org
 help / color / mirror / Atom feed
From: Uladzislau Rezki <urezki@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Christoph Hellwig <hch@lst.de>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Baoquan He <bhe@redhat.com>, John Ogness <jogness@linutronix.de>,
	linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>,
	x86@kernel.org
Subject: Re: Excessive TLB flush ranges
Date: Fri, 19 May 2023 12:01:56 +0200	[thread overview]
Message-ID: <ZGdJFI/4POEUl4bK@pc636> (raw)
In-Reply-To: <87edne6hra.ffs@tglx>

On Wed, May 17, 2023 at 06:32:25PM +0200, Thomas Gleixner wrote:
> On Wed, May 17 2023 at 14:15, Uladzislau Rezki wrote:
> > On Wed, May 17, 2023 at 01:58:44PM +0200, Thomas Gleixner wrote:
> >> Keeping executable mappings around until some other flush happens is
> >> obviously neither a brilliant idea nor correct.
> >> 
> > It avoids of blocking a caller on vfree() by deferring the freeing into
> > a workqueue context. At least i got the filling that "your task" that
> > does vfree() blocks for unacceptable time. It can happen only if it
> > performs VM_FLUSH_RESET_PERMS freeing(other freeing are deferred):
> >
> > <snip>
> > if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
> > 	vm_reset_perms(vm);
> > <snip>
> >
> > in this case the vfree() can take some time instead of returning back to
> > a user asap. Is that your issue? I am not talking that TLB flushing takes
> > time, in this case holding on mutex also can take time.
> 
> This is absolutely not the problem at all. This comes via do_exit() and
> I explained already here:
> 
>  https://lore.kernel.org/all/871qjg8wqe.ffs@tglx
> 
> what made us look into this and I'm happy to quote myself for your
> conveniance:
> 
>  "The scenario which made us look is that CPU0 is housekeeping and CPU1 is
>   isolated for RT.
> 
>   Now CPU0 does that flush nonsense and the RT workload on CPU1 suffers
>   because the compute time is suddenly factor 2-3 larger, IOW, it misses
>   the deadline. That means a one off event is already a problem."
> 
> So it does not matter at all how long the operations on CPU0 take. The
> only thing which matters is how much these operations affect the
> workload on CPU1.
> 
Thanks. I focused on your first email, where you have not mentioned your
second part, explaining that you have a housekeeping CPU and another for
RT activity.

>
> That made me look into this coalescing code. I understand why you want
> to batch and coalesce and rather do a rare full tlb flush than sending
> gazillions of IPIs.
> 
Your issues has no connections with merging. But the place you looked
was correct :)

>
> But that creates a policy at the core code which does not leave any
> decision to make for the architecture, whether it's worth to do full or
> single flushes. That's what I worried about and not about the question
> whether that free takes 1ms or 10us. That's a completely different
> debate.
> 
> Whether that list based flush turns out to be the better solution or
> not, has still to be decided by deeper analysis.
>
I had a look how per-VA TLB flushing behaves on x86_64 under heavy load: 

<snip>
commit 776a33ed63f0f15b5b3f6254bcb927a45e37298d (HEAD -> master)
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri May 19 11:35:35 2023 +0200

    mm: vmalloc: Flush TLB per-va

    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9683573f1225..6ff95f3d1fa1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1739,15 +1739,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
        if (unlikely(list_empty(&local_purge_list)))
                goto out;

-       start = min(start,
-               list_first_entry(&local_purge_list,
-                       struct vmap_area, list)->va_start);
+       /* OK. A per-cpu wants to flush an exact range. */
+       if (start != ULONG_MAX)
+               flush_tlb_kernel_range(start, end);

-       end = max(end,
-               list_last_entry(&local_purge_list,
-                       struct vmap_area, list)->va_end);
+       /* Flush per-VA. */
+       list_for_each_entry(va, &local_purge_list, list)
+               flush_tlb_kernel_range(va->va_start, va->va_end);

-       flush_tlb_kernel_range(start, end);
        resched_threshold = lazy_max_pages() << 1;

        spin_lock(&free_vmap_area_lock);
<snip>

There are at least two observation:

1. asm_sysvec_call_function adds extra 12% in therms of cycles

# per-VA TLB flush
   - 12.00% native_queued_spin_lock_slowpath                                                                                                                                                            ▒
      - 11.90% asm_sysvec_call_function                                                                                                                                                                 ▒
         - sysvec_call_function                                                                                                                                                                         ▒
           __sysvec_call_function                                                                                                                                                                       ▒
         - __flush_smp_call_function_queue                                                                                                                                                              ▒
            - 1.64% __flush_tlb_all                                                                                                                                                                     ▒
                 native_flush_tlb_global                                                                                                                                                                ▒
                 native_write_cr4                                                                                                                                                                       ▒

# default
0.18%     0.16%  [kernel]            [k] asm_sysvec_call_function

2. Memory footprint grows(under heavy load) because the TLB-flush + extra lazy-list scan
take longer time.

Hope it could be somehow useful for you.

--
Uladzislau Rezki

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Uladzislau Rezki <urezki@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Christoph Hellwig <hch@lst.de>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Baoquan He <bhe@redhat.com>, John Ogness <jogness@linutronix.de>,
	linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>,
	x86@kernel.org
Subject: Re: Excessive TLB flush ranges
Date: Fri, 19 May 2023 12:01:56 +0200	[thread overview]
Message-ID: <ZGdJFI/4POEUl4bK@pc636> (raw)
In-Reply-To: <87edne6hra.ffs@tglx>

On Wed, May 17, 2023 at 06:32:25PM +0200, Thomas Gleixner wrote:
> On Wed, May 17 2023 at 14:15, Uladzislau Rezki wrote:
> > On Wed, May 17, 2023 at 01:58:44PM +0200, Thomas Gleixner wrote:
> >> Keeping executable mappings around until some other flush happens is
> >> obviously neither a brilliant idea nor correct.
> >> 
> > It avoids of blocking a caller on vfree() by deferring the freeing into
> > a workqueue context. At least i got the filling that "your task" that
> > does vfree() blocks for unacceptable time. It can happen only if it
> > performs VM_FLUSH_RESET_PERMS freeing(other freeing are deferred):
> >
> > <snip>
> > if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
> > 	vm_reset_perms(vm);
> > <snip>
> >
> > in this case the vfree() can take some time instead of returning back to
> > a user asap. Is that your issue? I am not talking that TLB flushing takes
> > time, in this case holding on mutex also can take time.
> 
> This is absolutely not the problem at all. This comes via do_exit() and
> I explained already here:
> 
>  https://lore.kernel.org/all/871qjg8wqe.ffs@tglx
> 
> what made us look into this and I'm happy to quote myself for your
> conveniance:
> 
>  "The scenario which made us look is that CPU0 is housekeeping and CPU1 is
>   isolated for RT.
> 
>   Now CPU0 does that flush nonsense and the RT workload on CPU1 suffers
>   because the compute time is suddenly factor 2-3 larger, IOW, it misses
>   the deadline. That means a one off event is already a problem."
> 
> So it does not matter at all how long the operations on CPU0 take. The
> only thing which matters is how much these operations affect the
> workload on CPU1.
> 
Thanks. I focused on your first email, where you have not mentioned your
second part, explaining that you have a housekeeping CPU and another for
RT activity.

>
> That made me look into this coalescing code. I understand why you want
> to batch and coalesce and rather do a rare full tlb flush than sending
> gazillions of IPIs.
> 
Your issues has no connections with merging. But the place you looked
was correct :)

>
> But that creates a policy at the core code which does not leave any
> decision to make for the architecture, whether it's worth to do full or
> single flushes. That's what I worried about and not about the question
> whether that free takes 1ms or 10us. That's a completely different
> debate.
> 
> Whether that list based flush turns out to be the better solution or
> not, has still to be decided by deeper analysis.
>
I had a look how per-VA TLB flushing behaves on x86_64 under heavy load: 

<snip>
commit 776a33ed63f0f15b5b3f6254bcb927a45e37298d (HEAD -> master)
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri May 19 11:35:35 2023 +0200

    mm: vmalloc: Flush TLB per-va

    Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9683573f1225..6ff95f3d1fa1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1739,15 +1739,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
        if (unlikely(list_empty(&local_purge_list)))
                goto out;

-       start = min(start,
-               list_first_entry(&local_purge_list,
-                       struct vmap_area, list)->va_start);
+       /* OK. A per-cpu wants to flush an exact range. */
+       if (start != ULONG_MAX)
+               flush_tlb_kernel_range(start, end);

-       end = max(end,
-               list_last_entry(&local_purge_list,
-                       struct vmap_area, list)->va_end);
+       /* Flush per-VA. */
+       list_for_each_entry(va, &local_purge_list, list)
+               flush_tlb_kernel_range(va->va_start, va->va_end);

-       flush_tlb_kernel_range(start, end);
        resched_threshold = lazy_max_pages() << 1;

        spin_lock(&free_vmap_area_lock);
<snip>

There are at least two observation:

1. asm_sysvec_call_function adds extra 12% in therms of cycles

# per-VA TLB flush
   - 12.00% native_queued_spin_lock_slowpath                                                                                                                                                            ▒
      - 11.90% asm_sysvec_call_function                                                                                                                                                                 ▒
         - sysvec_call_function                                                                                                                                                                         ▒
           __sysvec_call_function                                                                                                                                                                       ▒
         - __flush_smp_call_function_queue                                                                                                                                                              ▒
            - 1.64% __flush_tlb_all                                                                                                                                                                     ▒
                 native_flush_tlb_global                                                                                                                                                                ▒
                 native_write_cr4                                                                                                                                                                       ▒

# default
0.18%     0.16%  [kernel]            [k] asm_sysvec_call_function

2. Memory footprint grows(under heavy load) because the TLB-flush + extra lazy-list scan
take longer time.

Hope it could be somehow useful for you.

--
Uladzislau Rezki


  reply	other threads:[~2023-05-19 10:02 UTC|newest]

Thread overview: 150+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-15 16:43 Excessive TLB flush ranges Thomas Gleixner
2023-05-15 16:43 ` Thomas Gleixner
2023-05-15 16:59 ` Russell King (Oracle)
2023-05-15 16:59   ` Russell King (Oracle)
2023-05-15 19:46   ` Thomas Gleixner
2023-05-15 19:46     ` Thomas Gleixner
2023-05-15 21:11     ` Thomas Gleixner
2023-05-15 21:11       ` Thomas Gleixner
2023-05-15 21:31       ` Russell King (Oracle)
2023-05-15 21:31         ` Russell King (Oracle)
2023-05-16  6:37         ` Thomas Gleixner
2023-05-16  6:37           ` Thomas Gleixner
2023-05-16  6:46           ` Thomas Gleixner
2023-05-16  6:46             ` Thomas Gleixner
2023-05-16  8:18           ` Thomas Gleixner
2023-05-16  8:18             ` Thomas Gleixner
2023-05-16  8:20             ` Thomas Gleixner
2023-05-16  8:20               ` Thomas Gleixner
2023-05-16  8:27               ` Russell King (Oracle)
2023-05-16  8:27                 ` Russell King (Oracle)
2023-05-16  9:03                 ` Thomas Gleixner
2023-05-16  9:03                   ` Thomas Gleixner
2023-05-16 10:05                   ` Baoquan He
2023-05-16 10:05                     ` Baoquan He
2023-05-16 14:21                     ` Thomas Gleixner
2023-05-16 14:21                       ` Thomas Gleixner
2023-05-16 19:03                       ` Thomas Gleixner
2023-05-16 19:03                         ` Thomas Gleixner
2023-05-17  9:38                         ` Thomas Gleixner
2023-05-17  9:38                           ` Thomas Gleixner
2023-05-17 10:52                           ` Baoquan He
2023-05-17 10:52                             ` Baoquan He
2023-05-19 11:22                             ` Thomas Gleixner
2023-05-19 11:22                               ` Thomas Gleixner
2023-05-19 11:49                               ` Baoquan He
2023-05-19 11:49                                 ` Baoquan He
2023-05-19 14:13                                 ` Thomas Gleixner
2023-05-19 14:13                                   ` Thomas Gleixner
2023-05-19 12:01                         ` [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one Baoquan He
2023-05-19 12:01                           ` Baoquan He
2023-05-19 14:16                           ` Thomas Gleixner
2023-05-19 14:16                             ` Thomas Gleixner
2023-05-19 12:02                         ` [RFC PATCH 2/3] mm/vmalloc.c: Only flush VM_FLUSH_RESET_PERMS area immediately Baoquan He
2023-05-19 12:02                           ` Baoquan He
2023-05-19 12:03                         ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
2023-05-19 12:03                           ` Baoquan He
2023-05-19 14:17                           ` Thomas Gleixner
2023-05-19 14:17                             ` Thomas Gleixner
2023-05-19 18:38                           ` Thomas Gleixner
2023-05-19 18:38                             ` Thomas Gleixner
2023-05-19 23:46                             ` Baoquan He
2023-05-19 23:46                               ` Baoquan He
2023-05-21 23:10                               ` Thomas Gleixner
2023-05-21 23:10                                 ` Thomas Gleixner
2023-05-22 11:21                                 ` Baoquan He
2023-05-22 11:21                                   ` Baoquan He
2023-05-22 12:02                                   ` Thomas Gleixner
2023-05-22 12:02                                     ` Thomas Gleixner
2023-05-22 14:34                                     ` Baoquan He
2023-05-22 14:34                                       ` Baoquan He
2023-05-22 20:21                                       ` Thomas Gleixner
2023-05-22 20:21                                         ` Thomas Gleixner
2023-05-22 20:44                                         ` Thomas Gleixner
2023-05-22 20:44                                           ` Thomas Gleixner
2023-05-23  9:35                                         ` Baoquan He
2023-05-23  9:35                                           ` Baoquan He
2023-05-19 13:49                   ` Excessive TLB flush ranges Thomas Gleixner
2023-05-19 13:49                     ` Thomas Gleixner
2023-05-16  8:21             ` Russell King (Oracle)
2023-05-16  8:21               ` Russell King (Oracle)
2023-05-16  8:19           ` Russell King (Oracle)
2023-05-16  8:19             ` Russell King (Oracle)
2023-05-16  8:44             ` Thomas Gleixner
2023-05-16  8:44               ` Thomas Gleixner
2023-05-16  8:48               ` Russell King (Oracle)
2023-05-16  8:48                 ` Russell King (Oracle)
2023-05-16 12:09                 ` Thomas Gleixner
2023-05-16 12:09                   ` Thomas Gleixner
2023-05-16 13:42                   ` Uladzislau Rezki
2023-05-16 13:42                     ` Uladzislau Rezki
2023-05-16 14:38                     ` Thomas Gleixner
2023-05-16 14:38                       ` Thomas Gleixner
2023-05-16 15:01                       ` Uladzislau Rezki
2023-05-16 15:01                         ` Uladzislau Rezki
2023-05-16 17:04                         ` Thomas Gleixner
2023-05-16 17:04                           ` Thomas Gleixner
2023-05-17 11:26                           ` Uladzislau Rezki
2023-05-17 11:26                             ` Uladzislau Rezki
2023-05-17 11:58                             ` Thomas Gleixner
2023-05-17 11:58                               ` Thomas Gleixner
2023-05-17 12:15                               ` Uladzislau Rezki
2023-05-17 12:15                                 ` Uladzislau Rezki
2023-05-17 16:32                                 ` Thomas Gleixner
2023-05-17 16:32                                   ` Thomas Gleixner
2023-05-19 10:01                                   ` Uladzislau Rezki [this message]
2023-05-19 10:01                                     ` Uladzislau Rezki
2023-05-19 14:56                                     ` Thomas Gleixner
2023-05-19 14:56                                       ` Thomas Gleixner
2023-05-19 15:14                                       ` Uladzislau Rezki
2023-05-19 15:14                                         ` Uladzislau Rezki
2023-05-19 16:32                                         ` Thomas Gleixner
2023-05-19 16:32                                           ` Thomas Gleixner
2023-05-19 17:02                                           ` Uladzislau Rezki
2023-05-19 17:02                                             ` Uladzislau Rezki
2023-05-16 17:56                       ` Nadav Amit
2023-05-16 17:56                         ` Nadav Amit
2023-05-16 19:32                         ` Thomas Gleixner
2023-05-16 19:32                           ` Thomas Gleixner
2023-05-17  0:23                           ` Thomas Gleixner
2023-05-17  0:23                             ` Thomas Gleixner
2023-05-17  1:23                             ` Nadav Amit
2023-05-17  1:23                               ` Nadav Amit
2023-05-17 10:31                               ` Thomas Gleixner
2023-05-17 10:31                                 ` Thomas Gleixner
2023-05-17 11:47                                 ` Thomas Gleixner
2023-05-17 11:47                                   ` Thomas Gleixner
2023-05-17 22:41                                   ` Nadav Amit
2023-05-17 22:41                                     ` Nadav Amit
2023-05-17 14:43                                 ` Mark Rutland
2023-05-17 14:43                                   ` Mark Rutland
2023-05-17 16:41                                   ` Thomas Gleixner
2023-05-17 16:41                                     ` Thomas Gleixner
2023-05-17 22:57                                 ` Nadav Amit
2023-05-17 22:57                                   ` Nadav Amit
2023-05-19 11:49                                   ` Thomas Gleixner
2023-05-19 11:49                                     ` Thomas Gleixner
2023-05-17 12:12                               ` Russell King (Oracle)
2023-05-17 12:12                                 ` Russell King (Oracle)
2023-05-17 23:14                                 ` Nadav Amit
2023-05-17 23:14                                   ` Nadav Amit
2023-05-15 18:17 ` Uladzislau Rezki
2023-05-15 18:17   ` Uladzislau Rezki
2023-05-16  2:26   ` Baoquan He
2023-05-16  2:26     ` Baoquan He
2023-05-16  6:40     ` Thomas Gleixner
2023-05-16  6:40       ` Thomas Gleixner
2023-05-16  8:07       ` Baoquan He
2023-05-16  8:07         ` Baoquan He
2023-05-16  8:10         ` Baoquan He
2023-05-16  8:10           ` Baoquan He
2023-05-16  8:45         ` Russell King (Oracle)
2023-05-16  8:45           ` Russell King (Oracle)
2023-05-16  9:13           ` Thomas Gleixner
2023-05-16  9:13             ` Thomas Gleixner
2023-05-16  8:54         ` Thomas Gleixner
2023-05-16  8:54           ` Thomas Gleixner
2023-05-16  9:48           ` Baoquan He
2023-05-16  9:48             ` Baoquan He
2023-05-15 20:02 ` Nadav Amit
2023-05-15 20:02   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZGdJFI/4POEUl4bK@pc636 \
    --to=urezki@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=hch@lst.de \
    --cc=jogness@linutronix.de \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@armlinux.org.uk \
    --cc=lstoakes@gmail.com \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.