All of lore.kernel.org
 help / color / mirror / Atom feed
From: Muhammad Usama Anjum <usama.anjum@collabora.com>
To: David Hildenbrand <david@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Andy Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)"
	<x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>,
	Shuah Khan <shuah@kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>,
	"open list:PROC FILESYSTEM" <linux-fsdevel@vger.kernel.org>,
	"open list:ABI/API" <linux-api@vger.kernel.org>,
	"open list:GENERIC INCLUDE/ASM HEADER FILES" 
	<linux-arch@vger.kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	"open list:PERFORMANCE EVENTS SUBSYSTEM" 
	<linux-perf-users@vger.kernel.org>,
	"open list:KERNEL SELFTEST FRAMEWORK" 
	<linux-kselftest@vger.kernel.org>,
	krisman@collabora.com
Cc: usama.anjum@collabora.com, kernel@collabora.com
Subject: Re: [PATCH 0/5] Add process_memwatch syscall
Date: Wed, 10 Aug 2022 21:39:26 +0500	[thread overview]
Message-ID: <54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com> (raw)
In-Reply-To: <95ed1a81-ff8e-2c48-8838-4b3995af51b7@redhat.com>

Hello,

Thank you for reviewing and commenting.

On 8/10/22 2:03 PM, David Hildenbrand wrote:
> On 26.07.22 18:18, Muhammad Usama Anjum wrote:
>> Hello,
> 
> Hi,
> 
>>
>> This patch series implements a new syscall, process_memwatch. Currently,
>> only the support to watch soft-dirty PTE bit is added. This syscall is
>> generic to watch the memory of the process. There is enough room to add
>> more operations like this to watch memory in the future.
>>
>> Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
>> procfs file. The soft-dirty PTE bit for the memory in a process can be
>> cleared by writing to the clear_refs file. This series adds features that
>> weren't possible through the Proc FS interface.
>> - There is no atomic get soft-dirty PTE bit status and clear operation
>>   possible.
> 
> Such an interface might be easy to add, no?
Are you referring to ioctl? I think this syscall can be used in future
for adding other operations like soft-dirty. This is why syscall has
been added.

If community doesn't agree, I can translate this syscall to the ioctl
same as it is.

> 
>> - The soft-dirty PTE bit of only a part of memory cannot be cleared.
> 
> Same.
> 
> So I'm curious why we need a new syscall for that.
> 
>>
>> Historically, soft-dirty PTE bit tracking has been used in the CRIU
>> project. The Proc FS interface is enough for that as I think the process
>> is frozen. We have the use case where we need to track the soft-dirty
>> PTE bit for running processes. We need this tracking and clear mechanism
>> of a region of memory while the process is running to emulate the
>> getWriteWatch() syscall of Windows. This syscall is used by games to keep
>> track of dirty pages and keep processing only the dirty pages. This
>> syscall can be used by the CRIU project and other applications which
>> require soft-dirty PTE bit information.
>>
>> As in the current kernel there is no way to clear a part of memory (instead
>> of clearing the Soft-Dirty bits for the entire processi) and get+clear
>> operation cannot be performed atomically, there are other methods to mimic
>> this information entirely in userspace with poor performance:
>> - The mprotect syscall and SIGSEGV handler for bookkeeping
>> - The userfaultfd syscall with the handler for bookkeeping
> 
> You write "poor performance". Did you actually implement a prototype
> using userfaultfd-wp? Can you share numbers for comparison?
> 
> Adding an new syscall just for handling a corner case feature
> (soft-dirty, which we all love, of course) needs good justification.

The cycles are given in thousands. 60 means 60k cycles here which have
been measured with rdtsc().

|   | Region size in Pages | 1    | 10   | 100   | 1000  | 10000  |
|---|----------------------|------|------|-------|-------|--------|
| 1 | MEMWATCH             | 7    | 58   | 281   | 1178  | 17563  |
| 2 | MEMWATCH Perf        | 4    | 23   | 107   | 1331  | 8924   |
| 3 | USERFAULTFD          | 5405 | 6550 | 10387 | 55708 | 621522 |
| 4 | MPROTECT_SEGV        | 35   | 611  | 1060  | 6646  | 60149  |

1. MEMWATCH --> process_memwatch considering VM_SOFTDIRT (splitting is
possible)
2. MEMWATCH Perf --> process_memwatch without considering VM_SOFTDIRTY
3. Userafaultfd --> userfaultfd with handling is userspace
4. Mprotect_segv --> mprotect and signal handler in userspace

Note: Implementation of mprotect_segv is very similar to userfaultfd. In
both of these, the signal/fault is being handled in the userspace. In
mprotect_segv, the memory region is write-protected through mprotect and
SEGV signal is received when something is written to this region. This
signal's handler is where we do calculations about soft dirty pages.
Mprotect_segv mechanism must be lighter than userfaultfd inside kernel.

My benchmark application is purely single threaded to keep effort to a
minimum until we decide to spend more time. It has been written to
measure the time taken in a serial execution of these statements without
locks. If the multi-threaded application is used and randomization is
introduced, it should affect `MPROTECT_SEGV` and `userfaultd`
implementations more than memwatch. But in this particular setting,
memwatch and mprotect_segv perform closely.


> 
>>
>>         long process_memwatch(int pidfd, unsigned long start, int len,
>>                               unsigned int flags, void *vec, int vec_len);
>>
>> This syscall can be used by the CRIU project and other applications which
>> require soft-dirty PTE bit information. The following operations are
>> supported in this syscall:
>> - Get the pages that are soft-dirty.
>> - Clear the pages which are soft-dirty.
>> - The optional flag to ignore the VM_SOFTDIRTY and only track per page
>> soft-dirty PTE bit
> 
> Huh, why? VM_SOFTDIRTY is an internal implementation detail and should
> remain such.
> 
> VM_SOFTDIRTY translates to "all pages in this VMA are soft-dirty".
Clearing soft-dirty bit for a range of memory may result in splitting
the VMA. Soft-dirty bit of the per page need to be cleared. The
VM_SOFTDIRTY flag from this splitted VMA need to be cleared. The kernel
may decide to merge this splitted VMA back. Please note that kernel
doesn't take into account the VM_SOFTDIRTY flag of the VMAs when it
decides to merge the VMAs. This not only gives performance hit, but also
the non-dirty pages of the whole VMA start to appear as dirty again
after the VMA merging. To avoid this penalty,
MEMWATCH_SD_NO_REUSED_REGIONS flag has been added to ignore the
VM_SOFTDIRTY and just rely on the soft-dirty bit present on the per
page. The user is aware about the constraint that the new regions will
not be found dirty if this flag is specified.

> 

-- 
Muhammad Usama Anjum

  reply	other threads:[~2022-08-10 16:39 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-26 16:18 [PATCH 0/5] Add process_memwatch syscall Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 1/5] fs/proc/task_mmu: make functions global to be used in other files Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 2/5] mm: Implement process_memwatch syscall Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 3/5] mm: wire up process_memwatch syscall for x86 Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 4/5] selftests: vm: add process_memwatch syscall tests Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 5/5] mm: add process_memwatch syscall documentation Muhammad Usama Anjum
2022-08-10  8:45 ` [PATCH 0/5] Add process_memwatch syscall Muhammad Usama Anjum
2022-08-10  9:03 ` David Hildenbrand
2022-08-10 16:39   ` Muhammad Usama Anjum [this message]
2022-08-10 17:05   ` Gabriel Krisman Bertazi
2022-08-10  9:22 ` Peter.Enderborg
2022-08-10 16:44   ` Muhammad Usama Anjum
2022-08-10 16:53   ` Gabriel Krisman Bertazi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com \
    --to=usama.anjum@collabora.com \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jolsa@kernel.org \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.