From: Benjamin Berg <benjamin@sipsolutions.net>
To: Johannes Berg <johannes@sipsolutions.net>,
Anton Ivanov <anton.ivanov@cambridgegreys.com>,
linux-um@lists.infradead.org
Subject: Re: [RFC PATCH 0/3] um: clean up mm creation - another attempt
Date: Wed, 17 Jan 2024 18:17:49 +0100 [thread overview]
Message-ID: <820dc103cce16124a195c490ecc80dab4a9af96c.camel@sipsolutions.net> (raw)
In-Reply-To: <300c0b28952b2837efbb5c59f38944b02b7722f2.camel@sipsolutions.net>
Hi,
On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
> [SNIP]
> Once we are there, we can look for optimizations. The fundamental
> problem is that page faults (even minor ones) are extremely expensive
> for us.
>
> Just throwing out ideas on what we could do:
> 1. SECCOMP as that reduces the amount of context switches.
> (Yes, I know I should resubmit the patchset)
> 2. Maybe we can disable/cripple page access tracking? If we assume
> initially mark all pages as accessed by userspace (i.e.
> pte_mkyoung), then we avoid a minor page fault on first access.
> Doing that will mess with page eviction though.
> 3. Do DAX (direct_access) for files. i.e. mmap files directly in the
> host kernel rather than through UM.
> With a hostfs like file system, one should be able to add an
> intermediate block device that maps host files to physical pages,
> then do DAX in the FS.
> For disk images, the existing iomem infrastructure should be
> usable, this should work with any DAX enabled filesystems (ext2,
> ext4, xfs, virtiofs, erofs).
So, I experimented quite a bit over Christmas (including getting DAX to
work with virtiofs). At the end of all this my conclusion is that
insufficient page table synchronization is our main problem.
Basically, right now we rely on the flush_tlb_* functions from the
kernel, but these are only called when TLB entries are removed, *not*
when new PTEs are added (there is also update_mmu_cache, but it isn't
enough either). Effectively this means that new page table entries will
often only be synced because the userspace code runs into an
unnecessary segfright now we rely on the flush_tlb_* functions from the
kernel, but these are only called when TLB entries are removed, *not*
when new PTEs are added (there is also update_mmu_cache, but it isn't
enough either). Effectively this means that new page table entries will
often only be synced because the userspace code runs into an
unnecessary segfaultault.
Really, what we need is a set_pte_at() implementation that marks the
memory range for synchronization. Then we can make sure we sync it
before switching to the userspace process (the equivalent of running
flush_tlb_mm_range right now).
I think we should:
* Rewrite the userspace syscall code
- Support delaying the execution of syscalls
- Only support mmap/munmap/mprotect and LDT
- Do simple compression of consecutive syscalls here
- Drop the hand-written assembler
* Improve the tlb.c code
- remove the HVC abstraction
- never force immediate syscall execution
* Let set_pte_at() track which memory ranges that need syncing
* At that point we should be able to:
- drop copy_context_skas0
- make flush_tlb_* no-ops
- drop flush_tlb_page from handle_page_fault
- move unmap() from flush_thread to init_new_context
(or do it as part of start_userspace)
So, I did try this using nasty hacks and IIRC one of my runs was going
from 21s to 16s and another from 63s to 56s. Which seems like a nice
improvement.
Benjamin
PS: As for DAX, it doesn't really seem to help performance. It didn't
seem to lower the amount of page faults in UML. And, from my
perspective, it isn't really worth just for the memory sharing.
PPS: dirty/young tracking seemed to be only cause a small amount of
page faults in the grand scheme. So probably not something worth
following up on.
next prev parent reply other threads:[~2024-01-17 17:18 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-22 22:37 [RFC PATCH 0/3] um: clean up mm creation - another attempt Johannes Berg
2023-09-22 22:37 ` [RFC PATCH 1/3] um/x86: remove ldt mutex and use mmap lock instead Johannes Berg
2023-09-22 22:37 ` [RFC PATCH 2/3] um: clean up init_new_context() Johannes Berg
2023-09-22 22:37 ` [RFC PATCH 3/3] um: don't force-flush in mm/userspace process start Johannes Berg
2023-09-25 13:29 ` [RFC PATCH 0/3] um: clean up mm creation - another attempt Anton Ivanov
2023-09-25 13:33 ` Johannes Berg
2023-09-25 13:34 ` Anton Ivanov
2023-09-25 14:27 ` Anton Ivanov
2023-09-25 14:44 ` Johannes Berg
2023-09-25 15:20 ` Anton Ivanov
2023-09-26 12:16 ` Anton Ivanov
2023-09-26 12:38 ` Johannes Berg
2023-09-26 13:04 ` Anton Ivanov
2023-09-27 9:52 ` Benjamin Berg
2023-09-27 9:59 ` Anton Ivanov
2023-09-27 10:42 ` Benjamin Berg
2024-01-17 17:17 ` Benjamin Berg [this message]
2024-01-17 19:45 ` Anton Ivanov
2024-01-17 19:54 ` Benjamin Berg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=820dc103cce16124a195c490ecc80dab4a9af96c.camel@sipsolutions.net \
--to=benjamin@sipsolutions.net \
--cc=anton.ivanov@cambridgegreys.com \
--cc=johannes@sipsolutions.net \
--cc=linux-um@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).