linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/19] arm64 kernel text replication
@ 2024-01-17  8:53 Hao Jia
  2024-01-17  9:41 ` Russell King (Oracle)
  0 siblings, 1 reply; 5+ messages in thread
From: Hao Jia @ 2024-01-17  8:53 UTC (permalink / raw)
  To: mark.rutland, rmk+kernel, catalin.marinas, corbet, will, willy
  Cc: linux-arm-kernel, linux-doc, root

From: root <root@n144-101-220.byted.org>

Many thanks to Russell King for his previous work on
arm64 kernel text replication.
https://lore.kernel.org/all/ZMKNYEkM7YnrDtOt@shell.armlinux.org.uk

After applying these patches, we tested that our business performance
increased by more than 5% and the NUMA node memory bandwidth was more
balanced.
I've recently been trying to make it work with different numbers of
page tables/page sizes, so updated this patch set to V3.

Patch overview:

Patch 1-16 is a patch set based on Russell King's previous arm64
kernel text replication, rebased on commit 052d534373b7.

The following three patches are new in v3:
patch 17 fixes compilation warning

patch 18 adapts arm64 kernel text replication to support more
page tables/page sizes, in addition to 16K page size and
4-level page tables.

patch 19 fixes the abnormal startup problem caused by module_alloc()
which may allocate an address larger than KIMAGE_VADDR when kernel text
replication is enabled.

[v2] https://lore.kernel.org/all/ZMKNYEkM7YnrDtOt@shell.armlinux.org.uk
[RFC] https://lore.kernel.org/all/ZHYCUVa8fzmB4XZV@shell.armlinux.org.uk

Please correct me if I've made a mistake, thank you very much!

Original message below.

Problem
-------

NUMA systems have greater latency when accessing data and instructions
across nodes, which can lead to a reduction in performance on CPU cores
that mainly perform accesses beyond their local node.

Normally when an ARM64 system boots, the kernel will end up placed in
memory, and each CPU core will have to fetch instructions and data from
which ever NUMA node the kernel has been placed. This means that while
executing kernel code, CPUs local to that node will run faster than
CPUs in remote nodes.

The higher the latency to access remote NUMA node memory, the more the
kernel performance suffers on those nodes.

If there is a local copy of the kernel text in each node's RAM, and
each node runs the kernel using its local copy of the kernel text,
then it stands to reason that the kernel will run faster due to fewer
stalls while instructions are fetched from remote memory.

The question then arises how to achieve this.

Background
----------

An important issue to contend with is what happens when a thread
migrates between nodes. Essentially, the thread's state (including
instruction pointer) is saved to memory, and the scheduler on that CPU
loads some other thread's state and that CPU resumes executing that
new thread.

The CPU gaining the migrating thread loads the saved state, again
including the instruction pointer, and the gaining CPU resumes fetching
instructions at the virtual address where the original CPU left off.

The key point is that the virtual address is what matters here, and
this gives us a way to implement kernel text replication fairly easily.
At a practical level, all we need to do is to ensure that the virtual
addresses which contain the kernel text point to a local copy of the
that text.

This is exactly how this proposal of kernel text replication achieves
the replication. We can go a little bit further and include most of
the read-only data in this replication, as that will never be written
to by the kernel (and thus remains constant.)

Solution
--------

So, what we need to achieve is:

1. multiple identical copies of the kernel text (and read-only data)
2. point the virtual mappings to the appropriate copy of kernel text
   for the NUMA node.

(1) is fairly easy to achieve - we just need to allocate some memory
in the appropriate node and copy the parts of the kernel we want to
replicate. However, we also need to deal with ARM64's kernel patching.
There are two functions that patch the kernel text,
__apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
these need to to be modified to update all copies of the kernel text.

(2) is slightly harder.

Firstly, the aarch64 architecture has a very useful feature here - the
kernel page tables are entirely separate from the user page tables.
The hardware contains two page table pointers, one is used for user
mappings, the other is used for kernel mappings.

Therefore, we only have one page table to be concerned with: the table
which maps kernel space. We do not need to be concerned with each
user processes page table.

The approach taken here is to ensure that the kernel is located in an
area of kernel virtual address space covered by a level-0 page table
entry which is not shared with any other user. We can then maintain
separate per-node level-0 page tables for kernel space where the only
difference between them is this level-0 page table entry.

This gives a couple of benefits. Firstly, when updates to the level-0
page table happen (e.g. when establishing new mappings) these updates
can simply be copied to the other level-0 page tables provided it isn't
for the kernel image. Secondly, we don't need complexity at lower
levels of the page table code to figure out whether a level-1 or lower
update needs to be propagated to other nodes.

The level-0 page table entry for the kernel can then be used to point
at a node-unique set of level 1..N page tables to make the appropriate
copy of the kernel text (and read-only data) into kernel space, while
keeping the kernel read-write data shared between nodes.

Performance Analysis
--------------------

Needless to say, the performance results from kernel text replication
are workload specific, but appear to show a gain of between 6% and
17% for database-centric like workloads. When combined with userspace
awareness of NUMA, this can result in a gain of over 50%.

Problems
--------

There are a few areas that are a problem for kernel text replication:
1) As this series changes the kernel space virtual address space
   layout, it breaks KASAN - and I've zero knowledge of KASAN so I
   have no idea how to fix it. I would be grateful for input from
   KASAN folk for suggestions how to fix this.

2) KASLR can not be used with kernel text replication, since we need
   to place the kernel in its own L0 page table entry, not in vmalloc
   space. KASLR is disabled when support for kernel text replication
   is enabled.

3) Changing the kernel virtual address space layout also means that
   kaslr_offset() and kaslr_enabled() need to become macros rather
   than inline functions due to the use of PGDIR_SIZE in the
   calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
   constant, but asm/memory.h is included by asm/pgtable.h, having
   this symbol available would produce a circular include
   dependency, so I don't think there is any choice here.

4) read-only protection for replicated kernel images is not yet
   implemented.

Hao Jia (3):
  arm64: text replication: fix compilation warning
  arm64: text replication: support more page sizes and levels
  arm64: text replication: keep modules inside module region when
    REPLICATE_KTEXT is enabled

Russell King (Oracle) (16):
  arm64: provide cpu_replace_ttbr1_phys()
  arm64: make clean_dcache_range_nopatch() visible
  arm64: place kernel in its own L0 page table entry
  arm64: text replication: add init function
  arm64: text replication: add sanity checks
  arm64: text replication: copy initial kernel text
  arm64: text replication: add node text patching
  arm64: text replication: add node 0 page table definitions
  arm64: text replication: add swapper page directory helpers
  arm64: text replication: create per-node kernel page tables
  arm64: text replication: boot secondary CPUs with appropriate TTBR1
  arm64: text replication: update cnp support
  arm64: text replication: setup page tables for copied kernel
  arm64: text replication: include most of read-only data as well
  arm64: text replication: early kernel option to enable replication
  arm64: text replication: add Kconfig

 .../admin-guide/kernel-parameters.txt         |   5 +
 arch/arm64/Kconfig                            |  10 +-
 arch/arm64/include/asm/cacheflush.h           |   2 +
 arch/arm64/include/asm/ktext.h                |  45 ++++
 arch/arm64/include/asm/memory.h               |  36 ++-
 arch/arm64/include/asm/mmu_context.h          |  11 +-
 arch/arm64/include/asm/pgtable.h              |  31 ++-
 arch/arm64/include/asm/smp.h                  |   1 +
 arch/arm64/kernel/alternative.c               |   4 +-
 arch/arm64/kernel/asm-offsets.c               |   1 +
 arch/arm64/kernel/head.S                      |   3 +-
 arch/arm64/kernel/hibernate.c                 |   2 +-
 arch/arm64/kernel/kaslr.c                     |   1 +
 arch/arm64/kernel/module.c                    |  20 +-
 arch/arm64/kernel/patching.c                  |   7 +-
 arch/arm64/kernel/smp.c                       |   3 +
 arch/arm64/mm/Makefile                        |   2 +
 arch/arm64/mm/init.c                          |   3 +
 arch/arm64/mm/ktext.c                         | 213 ++++++++++++++++++
 arch/arm64/mm/mmu.c                           |  73 +++++-
 20 files changed, 446 insertions(+), 27 deletions(-)
 create mode 100644 arch/arm64/include/asm/ktext.h
 create mode 100644 arch/arm64/mm/ktext.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/19] arm64 kernel text replication
  2024-01-17  8:53 Hao Jia
@ 2024-01-17  9:41 ` Russell King (Oracle)
  0 siblings, 0 replies; 5+ messages in thread
From: Russell King (Oracle) @ 2024-01-17  9:41 UTC (permalink / raw)
  To: Hao Jia
  Cc: mark.rutland, catalin.marinas, corbet, will, willy,
	linux-arm-kernel, linux-doc, root

On Wed, Jan 17, 2024 at 04:53:38PM +0800, Hao Jia wrote:
> From: root <root@n144-101-220.byted.org>
> 
> Many thanks to Russell King for his previous work on
> arm64 kernel text replication.
> https://lore.kernel.org/all/ZMKNYEkM7YnrDtOt@shell.armlinux.org.uk
> 
> After applying these patches, we tested that our business performance
> increased by more than 5% and the NUMA node memory bandwidth was more
> balanced.
> I've recently been trying to make it work with different numbers of
> page tables/page sizes, so updated this patch set to V3.
> 
> Patch overview:
> 
> Patch 1-16 is a patch set based on Russell King's previous arm64
> kernel text replication, rebased on commit 052d534373b7.
> 
> The following three patches are new in v3:
> patch 17 fixes compilation warning
> 
> patch 18 adapts arm64 kernel text replication to support more
> page tables/page sizes, in addition to 16K page size and
> 4-level page tables.
> 
> patch 19 fixes the abnormal startup problem caused by module_alloc()
> which may allocate an address larger than KIMAGE_VADDR when kernel text
> replication is enabled.
> 
> [v2] https://lore.kernel.org/all/ZMKNYEkM7YnrDtOt@shell.armlinux.org.uk
> [RFC] https://lore.kernel.org/all/ZHYCUVa8fzmB4XZV@shell.armlinux.org.uk
> 
> Please correct me if I've made a mistake, thank you very much!

Note that, even though I haven't posted an update (I see it as mostly
pointless because *noone* commented on the previous posting) I do
maintain these patches:

  git://git.armlinux.org.uk/~rmk/linux-arm.git aarch64/ktext/head

currently has them against v6.7

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re:[PATCH v3 00/19] arm64 kernel text replication
@ 2024-01-23 10:35 Yuquan Wang
  2024-01-23 11:32 ` [External] " Hao Jia
  2024-01-23 17:25 ` [PATCH " Russell King (Oracle)
  0 siblings, 2 replies; 5+ messages in thread
From: Yuquan Wang @ 2024-01-23 10:35 UTC (permalink / raw)
  To: rmk+kernel, jiahao.os; +Cc: linux-arm-kernel, linux-doc

> 
> After applying these patches, we tested that our business performance
> increased by more than 5% and the NUMA node memory bandwidth was more
> balanced.
> 

I have successfully applied your patches on my arm64 linux. And I could 
start it with a qemu machine(virt). However, I don't know the way to test
the performance it brings to the kernel. Do you have some suggestions?

Hope I can get some helps here, any help will be greatly appreciated. 

Many thanks
Yuquan


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [External] Re:[PATCH v3 00/19] arm64 kernel text replication
  2024-01-23 10:35 Re:[PATCH v3 00/19] arm64 kernel text replication Yuquan Wang
@ 2024-01-23 11:32 ` Hao Jia
  2024-01-23 17:25 ` [PATCH " Russell King (Oracle)
  1 sibling, 0 replies; 5+ messages in thread
From: Hao Jia @ 2024-01-23 11:32 UTC (permalink / raw)
  To: Yuquan Wang, rmk+kernel; +Cc: linux-arm-kernel, linux-doc



On 2024/1/23 Yuquan Wang wrote:
>>
>> After applying these patches, we tested that our business performance
>> increased by more than 5% and the NUMA node memory bandwidth was more
>> balanced.
>>
> 
> I have successfully applied your patches on my arm64 linux. And I could
> start it with a qemu machine(virt). However, I don't know the way to test
> the performance it brings to the kernel. Do you have some suggestions?
> 

Kernel text replication performance test results depend on your 
workload. Different workloads will behave differently.And performance 
testing in a virtual machine is not very accurate.


Thanks,
Hao

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 00/19] arm64 kernel text replication
  2024-01-23 10:35 Re:[PATCH v3 00/19] arm64 kernel text replication Yuquan Wang
  2024-01-23 11:32 ` [External] " Hao Jia
@ 2024-01-23 17:25 ` Russell King (Oracle)
  1 sibling, 0 replies; 5+ messages in thread
From: Russell King (Oracle) @ 2024-01-23 17:25 UTC (permalink / raw)
  To: Yuquan Wang; +Cc: jiahao.os, linux-arm-kernel, linux-doc

On Tue, Jan 23, 2024 at 06:35:09PM +0800, Yuquan Wang wrote:
> > 
> > After applying these patches, we tested that our business performance
> > increased by more than 5% and the NUMA node memory bandwidth was more
> > balanced.
> > 
> 
> I have successfully applied your patches on my arm64 linux. And I could 
> start it with a qemu machine(virt). However, I don't know the way to test
> the performance it brings to the kernel. Do you have some suggestions?

Please can I make one thing utterly clear... kernel text replication
in a virtual machine generally doesn't make sense unless one can
setup the virtual machine to be truly NUMA. In other words, groups
of CPUs with their local memory and remote-node memory having higher
latency.

Kernel text replication is something which solves the problem on
bare metal NUMA machines where running kernel text that is located
in a foreign node results in the CPU running slower than it would
do if the kernel text were in its local RAM.

Unless the VM is setup in exactly that way, then kernel text
replication has no place in a VM, and probably would result in
poorer performance.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-01-23 17:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-23 10:35 Re:[PATCH v3 00/19] arm64 kernel text replication Yuquan Wang
2024-01-23 11:32 ` [External] " Hao Jia
2024-01-23 17:25 ` [PATCH " Russell King (Oracle)
  -- strict thread matches above, loose matches on Subject: below --
2024-01-17  8:53 Hao Jia
2024-01-17  9:41 ` Russell King (Oracle)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).