[PATCH 0/6] Support Asynchronous Page Fault

From: Gavin Shan <gshan@redhat.com>
To: kvmarm@lists.cs.columbia.edu
Cc: maz@kernel.org, shan.gavin@gmail.com, pbonzini@redhat.com,
	will@kernel.org
Subject: [PATCH 0/6] Support Asynchronous Page Fault
Date: Tue, 18 Aug 2020 11:13:13 +1000	[thread overview]
Message-ID: <20200818011319.91777-1-gshan@redhat.com> (raw)

There are two stages of page fault. The guest kernel is responsible
for handling stage one page fault, while the host kernel is to take
care of the stage two page fault. When page fault is triggered because
of stage two page fault, the guest is suspended until the requested
memory (page) is populated. Sometimes, the cost to populate the requested
page isn't cheap and can take hundreds of milliseconds in extreme
cases. This impacts the overall guest's performance.

This series introduces the feature (asynchronous page fault) to resolve
the issue and improve the guest's performance. It depends on the series
to support SDEI virtualization and refactoring SDEI client driver. This
also depends on QEMU changes to export SDEI/APFT tables. All the code
including this series can be found from github:

   https://github.com/gwshan/linux ("sdei_client")
   https://github.com/gwshan/linux ("sdei")
   https://github.com/gwshan/linux ("apf")
   https://github.com/gwshan/qemu  ("apf")

The functionality is driven by two notifications: page-not-present and
page-ready. They're delivered from the host to guest via SDEI event and
PPI separately. In the mean while, each notification is always associated
with a token, used to identify the notification. The token is passed by
the shared memory between host/guest. Besides, the SMCCC interface is
mitigated by the guest to configure, enable or disable the functionality.
It's traditional control path.

When the guest is trapped to host because of stage two page fault, a
page-not-present notification is raised by the host, and sent to the guest
through (KVM private) SDEI event (0x40200001) if the requested page can't
be populated immediately. In the mean while, a (background) worker is also
started to populate the requested page. On receiving the SDEI event, the
guest marks the current running process with special flag (TIF_ASYNC_PF)
and associates the process with a pre-defined waitqueue. At same time,
a (reschedule) IPI is sent to the CPU where the process was running.
After the SDEI event is acknoledged by the guest, the (reschedule) IPI
is delivered and it causes context switch from kernel to user space.
During the context switch, the process with TIF_ASYNC_PF flag is
suspended on the associated waitqueue.

Later on, a page-ready notification is sent to guest after the requested
page is populated by the (background) worker. On receiving the interrupt,
the guest uses the associated token to locate the process, which was
previously suspended because of page-not-present, and wakes it up.

The series is organized as below:

PATCH[1-2]: support KVM hypervisor SMCCC services, which are developed by
            Will Deacon.
PATCH[3]:   export kvm_handle_user_mem_abort() with @prefault parameter
            supported, which is prepatory work to support the feature.
PATCH[4]:   support asynchronous page fault in host side.
PATCH[5]:   exposes APFT (Asynchronous Page Fault Table) ACPI table, which
            will be used by guest kernel to support the feature
PATCH[6]:   support asynchronous page fault in guest side.

=======
Testing
=======

In the test case [1] and [2], "testsuite mem" is executed to allocate
the specified percentage of free memory (90%) and then release them.
In the mean while, the calculation thread is started or not. When
the calculation thread isn't started, there isn't obvious performance
degradtion. When the calculation thread is started, the performance
is improved by 27.7% and 28.6% separately, depending on THP enablement
sttus on the host side.

In test case [3] and [4], the kernel image is built and check the
used time. The performance is improved by 9.7% and 9.9% separately,
depending on THP enablement status on the host side.

[1] Two threads to allocate/free memory and do calculation
    vCPU:                  1
    Memory:                8GB
    memory.limit_in_bytes: 2GB
    memory.swappiness:     100
    host:                  THP disabled
    command:               "testsuite mem 90 1 [thread]"
    "-": Disabled asynchronous page fault
    "+": Enabled asynchronous page fault
    "T"  With the calculation thread

    Idx  -      +     Output  T-                   T+                   Output
    ==========================================================================
    1   93.1s  93.6s    -     223.8s 21117147961   391.9s 49845637101     -
    2   93.3s  94.2s    -     237.9s 23394567744   397.0s 50506074773     -
    3   93.5s  94.3s    -     244.2s 24305177553   405.8s 51853498870     -
    4   94.1s  95.0s    -     262.8s 27113310073   421.7s 54338181069     -
    5   94.3s  95.2s    -     272.7s 28565479414   434.3s 56171922019     -
    ==========================================================================
        93.6s  94.4s   -0.8%  248.2s 24899136549   410.1s 52543062766
                                       100318841/s          128122562/s  +27.7%

[2] Two threads to allocate/free memory and do calculation
    vCPU:                  1
    Memory:                8GB
    memory.limit_in_bytes: 2GB
    memory.swappiness:     100
    host:                  THP enabled
    command:               "testsuite mem 90 1 [thread]"
    "-": Disabled asynchronous page fault
    "+": Enabled asynchronous page fault
    "T"  With the calculation thread

    Idx  -      +     Output  T-                   T+                   Output
    ==========================================================================
    1   91.3s  91.2s    -     218.8s 20319612017   389.6s 49016175698     -
    2   91.7s  91.6s    -     233.9s 22619566161   402.0s 50901616319     -
    3   91.8s  91.9s    -     251.1s 25066180266   405.3s 51247353704     -
    4   92.7s  92.2s    -     251.1s 25262121229   406.9s 51692420054     -
    5   93.1s  92.2s    -     260.7s 26532616925   425.4s 54412348724     -
    ==========================================================================
        92.1s  91.8s   +3.0%  243.1s 23960019319   405.8  51453982899
                                       98560342/s          126796409/s  +28.6%

[3] Clear kernel image and rebuild it.
    vCPU: 24  Memory:      8GB
    memory.limit_in_bytes: 2GB
    memory.swapiness:      100
    Host:                  THP disabled
    command:               "make -j 24 clean > /dev/null 2>&1 &&
                            make -j 24 > /dev/null 2>&1"

    Idx   Disabled   Enabled   Output
    ==================================
    1     2211s      2000s     +9.5%
    2     2333s      2060s     +11.7%
    3     2568s      2192s     +14.6%
    4     2631s      2423s     +7.9%
    5     2756s      2605s     +5.4%
    ==================================
          2499s      2256s     +9.7%

[4] Clear kernel image and rebuild it.
    vCPU: 24  Memory:      8GB
    memory.limit_in_bytes: 2GB
    memory.swapiness:      100
    Host:                  THP enabled
    command:               "make -j 24 clean > /dev/null 2>&1 &&
                            make -j 24 > /dev/null 2>&1"

    Idx   Disabled   Enabled   Output
    ==================================
    1     2049s      1850s     +9.7%
    2     2144s      1947s     +9.1%
    3     2164s      1997s     +7.7%
    4     2192s      2031s     +7.3%
    5     2515s      2141s     +14.8%
    ==================================
          2214s      1993s     +9.9%

Gavin Shan (4):
  kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  arm64/kvm: Support async page fault
  drivers/acpi: Import ACPI APF table
  arm64/kernel: Support async page fault

Will Deacon (2):
  arm64: Probe for the presence of KVM hypervisor services during boot
  arm/arm64: KVM: Advertise KVM UID to guests via SMCCC

 arch/arm64/Kconfig                     |  11 +
 arch/arm64/include/asm/esr.h           |   5 +
 arch/arm64/include/asm/hypervisor.h    |  11 +
 arch/arm64/include/asm/kvm_emulate.h   |   8 +-
 arch/arm64/include/asm/kvm_host.h      |  54 +++
 arch/arm64/include/asm/kvm_para.h      |  41 +++
 arch/arm64/include/asm/processor.h     |   1 +
 arch/arm64/include/asm/thread_info.h   |   4 +-
 arch/arm64/include/uapi/asm/Kbuild     |   2 -
 arch/arm64/include/uapi/asm/kvm_para.h |  23 ++
 arch/arm64/kernel/Makefile             |   1 +
 arch/arm64/kernel/kvm.c                | 478 +++++++++++++++++++++++++
 arch/arm64/kernel/setup.c              |  32 ++
 arch/arm64/kernel/signal.c             |  17 +
 arch/arm64/kvm/Kconfig                 |   1 +
 arch/arm64/kvm/Makefile                |   1 +
 arch/arm64/kvm/arm.c                   |  45 ++-
 arch/arm64/kvm/async_pf.c              | 462 ++++++++++++++++++++++++
 arch/arm64/kvm/hypercalls.c            |  37 +-
 arch/arm64/kvm/mmu.c                   |  47 ++-
 arch/arm64/kvm/sdei.c                  |   8 +
 include/acpi/actbl2.h                  |  18 +
 include/linux/arm-smccc.h              |  41 +++
 23 files changed, 1321 insertions(+), 27 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 arch/arm64/kernel/kvm.c
 create mode 100644 arch/arm64/kvm/async_pf.c

-- 
2.23.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm