* [patch 15/39] mm/madvise: pass task and mm to do_madvise
[not found] <20200814172939.55d6d80b6e21e4241f1ee1f3@linux-foundation.org>
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 16/39] pid: move pidfd_get_pid() to pid.c Andrew Morton
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, alexander.h.duyck, axboe, bgeffon, christian.brauner,
christian, dancol, hannes, jannh, joaodias, joel, ktkhai,
linux-man, linux-mm, mhocko, minchan, mm-commits, oleksandr,
rientjes, shakeelb, sj38.park, sjpark, sonnyrao, sspatil, surenb,
timmurray, torvalds, vbabka
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: pass task and mm to do_madvise
Patch series "introduce memory hinting API for external process", v8.
Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
that, application could give hints to kernel what memory range are
preferred to be reclaimed. However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.
To solve the concern, this patch introduces new syscall -
process_madvise(2). Bascially, it's same with madvise(2) syscall but it
has some differences.
1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
moment. Other hints in madvise will be opened when there are explicit
requests from community to prevent unexpected bugs we couldn't support.
3. Only privileged processes can do something for other process's
address space.
For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.
This patch (of 4):
In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.
Furthermore, we must not access mm_struct via task->mm, but obtain it
via access_mm() once (in the following patch) and only use that pointer
[1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers
are safe, so we can use them further down the call stack.
And let's pass *current* and current->mm as arguments of do_madvise so
it shouldn't change existing behavior but prepare next patch to make
review easy.
Note: io_madvise passes NULL as target_task argument of do_madvise because
it couldn't know who is target.
[1] http://lore.kernel.org/r/CAG48ez27=pwm5m_N_988xT1huO7g7h6arTQL44zev6TD-h-7Tg@mail.gmail.com
[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
fs/io_uring.c | 2 +-
include/linux/mm.h | 3 ++-
mm/madvise.c | 40 +++++++++++++++++++++++-----------------
3 files changed, 26 insertions(+), 19 deletions(-)
--- a/fs/io_uring.c~mm-madvise-pass-task-and-mm-to-do_madvise
+++ a/fs/io_uring.c
@@ -3731,7 +3731,7 @@ static int io_madvise(struct io_kiocb *r
if (force_nonblock)
return -EAGAIN;
- ret = do_madvise(ma->addr, ma->len, ma->advice);
+ ret = do_madvise(NULL, current->mm, ma->addr, ma->len, ma->advice);
if (ret < 0)
req_set_fail_links(req);
io_req_complete(req, ret);
--- a/include/linux/mm.h~mm-madvise-pass-task-and-mm-to-do_madvise
+++ a/include/linux/mm.h
@@ -2547,7 +2547,8 @@ extern int __do_munmap(struct mm_struct
struct list_head *uf, bool downgrade);
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
struct list_head *uf);
-extern int do_madvise(unsigned long start, size_t len_in, int behavior);
+extern int do_madvise(struct task_struct *target_task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior);
#ifdef CONFIG_MMU
extern int __mm_populate(unsigned long addr, unsigned long len,
--- a/mm/madvise.c~mm-madvise-pass-task-and-mm-to-do_madvise
+++ a/mm/madvise.c
@@ -22,12 +22,14 @@
#include <linux/file.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
+#include <linux/compat.h>
#include <linux/pagewalk.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/shmem_fs.h>
#include <linux/mmu_notifier.h>
#include <linux/sched/mm.h>
+#include <linux/uio.h>
#include <asm/tlb.h>
@@ -255,6 +257,7 @@ static long madvise_willneed(struct vm_a
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
+ struct mm_struct *mm = vma->vm_mm;
struct file *file = vma->vm_file;
loff_t offset;
@@ -289,12 +292,12 @@ static long madvise_willneed(struct vm_a
*/
*prev = NULL; /* tell sys_madvise we drop mmap_lock */
get_file(file);
- mmap_read_unlock(current->mm);
+ mmap_read_unlock(mm);
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED);
fput(file);
- mmap_read_lock(current->mm);
+ mmap_read_lock(mm);
return 0;
}
@@ -683,7 +686,6 @@ out:
if (nr_swap) {
if (current->mm == mm)
sync_mm_rss(mm);
-
add_mm_counter(mm, MM_SWAPENTS, nr_swap);
}
arch_leave_lazy_mmu_mode();
@@ -763,6 +765,8 @@ static long madvise_dontneed_free(struct
unsigned long start, unsigned long end,
int behavior)
{
+ struct mm_struct *mm = vma->vm_mm;
+
*prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -770,8 +774,8 @@ static long madvise_dontneed_free(struct
if (!userfaultfd_remove(vma, start, end)) {
*prev = NULL; /* mmap_lock has been dropped, prev is stale */
- mmap_read_lock(current->mm);
- vma = find_vma(current->mm, start);
+ mmap_read_lock(mm);
+ vma = find_vma(mm, start);
if (!vma)
return -ENOMEM;
if (start < vma->vm_start) {
@@ -825,6 +829,7 @@ static long madvise_remove(struct vm_are
loff_t offset;
int error;
struct file *f;
+ struct mm_struct *mm = vma->vm_mm;
*prev = NULL; /* tell sys_madvise we drop mmap_lock */
@@ -852,13 +857,13 @@ static long madvise_remove(struct vm_are
get_file(f);
if (userfaultfd_remove(vma, start, end)) {
/* mmap_lock was not released by userfaultfd_remove() */
- mmap_read_unlock(current->mm);
+ mmap_read_unlock(mm);
}
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
- mmap_read_lock(current->mm);
+ mmap_read_lock(mm);
return error;
}
@@ -1051,7 +1056,8 @@ madvise_behavior_valid(int behavior)
* -EBADF - map exists, but area maps something that isn't a file.
* -EAGAIN - a kernel resource was temporarily unavailable.
*/
-int do_madvise(unsigned long start, size_t len_in, int behavior)
+int do_madvise(struct task_struct *target_task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -1089,7 +1095,7 @@ int do_madvise(unsigned long start, size
write = madvise_need_mmap_write(behavior);
if (write) {
- if (mmap_write_lock_killable(current->mm))
+ if (mmap_write_lock_killable(mm))
return -EINTR;
/*
@@ -1104,12 +1110,12 @@ int do_madvise(unsigned long start, size
* but for now we have the mmget_still_valid()
* model.
*/
- if (!mmget_still_valid(current->mm)) {
- mmap_write_unlock(current->mm);
+ if (!mmget_still_valid(mm)) {
+ mmap_write_unlock(mm);
return -EINTR;
}
} else {
- mmap_read_lock(current->mm);
+ mmap_read_lock(mm);
}
/*
@@ -1117,7 +1123,7 @@ int do_madvise(unsigned long start, size
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(current->mm, start, &prev);
+ vma = find_vma_prev(mm, start, &prev);
if (vma && start > vma->vm_start)
prev = vma;
@@ -1154,19 +1160,19 @@ int do_madvise(unsigned long start, size
if (prev)
vma = prev->vm_next;
else /* madvise_remove dropped mmap_lock */
- vma = find_vma(current->mm, start);
+ vma = find_vma(mm, start);
}
out:
blk_finish_plug(&plug);
if (write)
- mmap_write_unlock(current->mm);
+ mmap_write_unlock(mm);
else
- mmap_read_unlock(current->mm);
+ mmap_read_unlock(mm);
return error;
}
SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
- return do_madvise(start, len_in, behavior);
+ return do_madvise(current, current->mm, start, len_in, behavior);
}
_
^ permalink raw reply [flat|nested] 12+ messages in thread
* [patch 16/39] pid: move pidfd_get_pid() to pid.c
[not found] <20200814172939.55d6d80b6e21e4241f1ee1f3@linux-foundation.org>
2020-08-15 0:30 ` [patch 15/39] mm/madvise: pass task and mm to do_madvise Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
2020-08-15 0:31 ` [patch 18/39] mm/madvise: check fatal signal pending of target process Andrew Morton
3 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, alexander.h.duyck, axboe, bgeffon, christian.brauner,
dancol, hannes, jannh, joaodias, joel, ktkhai, linux-man,
linux-mm, mhocko, minchan, mm-commits, oleksandr, rientjes,
shakeelb, sj38.park, sjpark, sonnyrao, sspatil, surenb, timmurray,
torvalds, vbabka
From: Minchan Kim <minchan@kernel.org>
Subject: pid: move pidfd_get_pid() to pid.c
process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.
Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/pid.h | 1 +
kernel/exit.c | 17 -----------------
kernel/pid.c | 17 +++++++++++++++++
3 files changed, 18 insertions(+), 17 deletions(-)
--- a/include/linux/pid.h~pid-move-pidfd_get_pid-to-pidc
+++ a/include/linux/pid.h
@@ -77,6 +77,7 @@ extern const struct file_operations pidf
struct file;
extern struct pid *pidfd_pid(const struct file *file);
+struct pid *pidfd_get_pid(unsigned int fd);
static inline struct pid *get_pid(struct pid *pid)
{
--- a/kernel/exit.c~pid-move-pidfd_get_pid-to-pidc
+++ a/kernel/exit.c
@@ -1474,23 +1474,6 @@ end:
return retval;
}
-static struct pid *pidfd_get_pid(unsigned int fd)
-{
- struct fd f;
- struct pid *pid;
-
- f = fdget(fd);
- if (!f.file)
- return ERR_PTR(-EBADF);
-
- pid = pidfd_pid(f.file);
- if (!IS_ERR(pid))
- get_pid(pid);
-
- fdput(f);
- return pid;
-}
-
static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
int options, struct rusage *ru)
{
--- a/kernel/pid.c~pid-move-pidfd_get_pid-to-pidc
+++ a/kernel/pid.c
@@ -519,6 +519,23 @@ struct pid *find_ge_pid(int nr, struct p
return idr_get_next(&ns->idr, &nr);
}
+struct pid *pidfd_get_pid(unsigned int fd)
+{
+ struct fd f;
+ struct pid *pid;
+
+ f = fdget(fd);
+ if (!f.file)
+ return ERR_PTR(-EBADF);
+
+ pid = pidfd_pid(f.file);
+ if (!IS_ERR(pid))
+ get_pid(pid);
+
+ fdput(f);
+ return pid;
+}
+
/**
* pidfd_create() - Create a new pid file descriptor.
*
_
^ permalink raw reply [flat|nested] 12+ messages in thread
* [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
[not found] <20200814172939.55d6d80b6e21e4241f1ee1f3@linux-foundation.org>
2020-08-15 0:30 ` [patch 15/39] mm/madvise: pass task and mm to do_madvise Andrew Morton
2020-08-15 0:30 ` [patch 16/39] pid: move pidfd_get_pid() to pid.c Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-16 8:12 ` Christian Brauner
2020-08-15 0:31 ` [patch 18/39] mm/madvise: check fatal signal pending of target process Andrew Morton
3 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, alexander.h.duyck, axboe, bgeffon, christian.brauner,
christian, dancol, hannes, jannh, joaodias, joel, ktkhai,
linux-man, linux-mm, mhocko, minchan, mm-commits, oleksandr,
rientjes, shakeelb, sj38.park, sjpark, sonnyrao, sspatil, surenb,
timmurray, torvalds, vbabka
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.
The information required to make the reclaim decision is not known to
the app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to
initiate reclaim on its own without any app involvement.
To solve the issue, this patch introduces a new syscall process_madvise(2).
It uses pidfd of an external process to give the hint. It also supports
vector address range because Android app has thousands of vmas due to
zygote so it's totally waste of CPU and power if we should call the
syscall one by one for each vma.(With testing 2000-vma syscall vs
1-vector syscall, it showed 15% performance improvement. I think it
would be bigger in real practice because the testing ran very cache
friendly environment).
Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this
could benefit users like TCP receive zerocopy and malloc implementations.
In future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.
ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the
same UID) gives it the right to ptrace the process could use it
successfully. The flag argument is reserved for future use if we need to
extend the API.
I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints
make sense from external process and implementation for the hint may
rely on the caller being in the current context so it could be
error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this
patch.
If someone want to add other hints, we could hear hear the usecase and
review it for each hint. It's safer for maintenance rather than
introducing a buggy syscall but hard to fix it later.
So finally, the API is as follows,
ssize_t process_madvise(int pidfd, const struct iovec *iovec,
unsigned long vlen, int advice, unsigned int flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve system
or application performance.
The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by pidfd is
external.
MADV_COLD
MADV_PAGEOUT
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
FAQ:
Q.1 - Why does any external entity have better knowledge?
Quote from Sandeep
"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.
After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.
In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.
So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.
Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.
So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.
- ssp
Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?
process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.
The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.
To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.
Q.3 - Why doesn't ptrace work?
Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.
[1] https://developer.android.com/topic/performance/memory"
[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
[minchan@kernel.org: fix process_madvise build break for arm64]
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/include/asm/unistd.h | 2
arch/arm64/include/asm/unistd32.h | 2
arch/ia64/kernel/syscalls/syscall.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 2
arch/xtensa/kernel/syscalls/syscall.tbl | 1
include/linux/compat.h | 4
include/linux/syscalls.h | 2
include/uapi/asm-generic/unistd.h | 4
kernel/sys_ni.c | 2
mm/madvise.c | 121 ++++++++++++++++++
23 files changed, 152 insertions(+), 2 deletions(-)
--- a/arch/alpha/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/alpha/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
547 common openat2 sys_openat2
548 common pidfd_getfd sys_pidfd_getfd
549 common faccessat2 sys_faccessat2
+550 common process_madvise sys_process_madvise
--- a/arch/arm64/include/asm/unistd32.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/arm64/include/asm/unistd32.h
@@ -887,6 +887,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_process_madvise 440
+__SYSCALL(__NR_process_madvise, compat_sys_process_madvise)
/*
* Please add new compat syscalls above this comment and update
--- a/arch/arm64/include/asm/unistd.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 440
+#define __NR_compat_syscalls 441
#endif
#define __ARCH_WANT_SYS_CLONE
--- a/arch/arm/tools/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/arm/tools/syscall.tbl
@@ -453,3 +453,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/ia64/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/ia64/kernel/syscalls/syscall.tbl
@@ -360,3 +360,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/m68k/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/m68k/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/microblaze/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -445,3 +445,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -378,3 +378,4 @@
437 n32 openat2 sys_openat2
438 n32 pidfd_getfd sys_pidfd_getfd
439 n32 faccessat2 sys_faccessat2
+440 n32 process_madvise compat_sys_process_madvise
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -354,3 +354,4 @@
437 n64 openat2 sys_openat2
438 n64 pidfd_getfd sys_pidfd_getfd
439 n64 faccessat2 sys_faccessat2
+440 n64 process_madvise sys_process_madvise
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -427,3 +427,4 @@
437 o32 openat2 sys_openat2
438 o32 pidfd_getfd sys_pidfd_getfd
439 o32 faccessat2 sys_faccessat2
+440 o32 process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/parisc/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/parisc/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/powerpc/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -529,3 +529,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/s390/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/s390/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
437 common openat2 sys_openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/sh/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/sh/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/sparc/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/sparc/kernel/syscalls/syscall.tbl
@@ -485,3 +485,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/x86/entry/syscalls/syscall_32.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/x86/entry/syscalls/syscall_32.tbl
@@ -444,3 +444,4 @@
437 i386 openat2 sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd
439 i386 faccessat2 sys_faccessat2
+440 i386 process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/x86/entry/syscalls/syscall_64.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/x86/entry/syscalls/syscall_64.tbl
@@ -361,6 +361,7 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 64 process_madvise sys_process_madvise
#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -404,3 +405,4 @@
545 x32 execveat compat_sys_execveat
546 x32 preadv2 compat_sys_preadv64v2
547 x32 pwritev2 compat_sys_pwritev64v2
+548 x32 process_madvise compat_sys_process_madvise
--- a/arch/xtensa/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -410,3 +410,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/include/linux/compat.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/include/linux/compat.h
@@ -823,6 +823,10 @@ asmlinkage long compat_sys_pwritev64v2(u
unsigned long vlen, loff_t pos, rwf_t flags);
#endif
+asmlinkage ssize_t compat_sys_process_madvise(compat_int_t pidfd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, compat_int_t behavior,
+ compat_uint_t flags);
/*
* Deprecated system calls which are still defined in
--- a/include/linux/syscalls.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/include/linux/syscalls.h
@@ -880,6 +880,8 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
+ unsigned long vlen, int behavior, unsigned int flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
--- a/include/uapi/asm-generic/unistd.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_process_madvise 440
+__SC_COMP(__NR_process_madvise, sys_process_madvise, compat_sys_process_madvise)
#undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
/*
* 32 bit systems traditionally used different
--- a/kernel/sys_ni.c~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/kernel/sys_ni.c
@@ -280,6 +280,8 @@ COND_SYSCALL(mlockall);
COND_SYSCALL(munlockall);
COND_SYSCALL(mincore);
COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
+COND_SYSCALL_COMPAT(process_madvise);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
COND_SYSCALL_COMPAT(mbind);
--- a/mm/madvise.c~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/mm/madvise.c
@@ -17,6 +17,7 @@
#include <linux/falloc.h>
#include <linux/fadvise.h>
#include <linux/sched.h>
+#include <linux/sched/mm.h>
#include <linux/ksm.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -995,6 +996,18 @@ madvise_behavior_valid(int behavior)
}
}
+static bool
+process_madvise_behavior_valid(int behavior)
+{
+ switch (behavior) {
+ case MADV_COLD:
+ case MADV_PAGEOUT:
+ return true;
+ default:
+ return false;
+ }
+}
+
/*
* The madvise(2) system call.
*
@@ -1042,6 +1055,11 @@ madvise_behavior_valid(int behavior)
* MADV_DONTDUMP - the application wants to prevent pages in the given range
* from being included in its core dump.
* MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ * MADV_COLD - the application is not expected to use this memory soon,
+ * deactivate pages in this range so that they can be reclaimed
+ * easily if memory pressure hanppens.
+ * MADV_PAGEOUT - the application is not expected to use this memory soon,
+ * page out the pages in this range immediately.
*
* return values:
* zero - success
@@ -1176,3 +1194,106 @@ SYSCALL_DEFINE3(madvise, unsigned long,
{
return do_madvise(current, current->mm, start, len_in, behavior);
}
+
+static int process_madvise_vec(struct task_struct *target_task,
+ struct mm_struct *mm, struct iov_iter *iter, int behavior)
+{
+ struct iovec iovec;
+ int ret = 0;
+
+ while (iov_iter_count(iter)) {
+ iovec = iov_iter_iovec(iter);
+ ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base,
+ iovec.iov_len, behavior);
+ if (ret < 0)
+ break;
+ iov_iter_advance(iter, iovec.iov_len);
+ }
+
+ return ret;
+}
+
+static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter,
+ int behavior, unsigned int flags)
+{
+ ssize_t ret;
+ struct pid *pid;
+ struct task_struct *task;
+ struct mm_struct *mm;
+ size_t total_len = iov_iter_count(iter);
+
+ if (flags != 0)
+ return -EINVAL;
+
+ pid = pidfd_get_pid(pidfd);
+ if (IS_ERR(pid))
+ return PTR_ERR(pid);
+
+ task = get_pid_task(pid, PIDTYPE_PID);
+ if (!task) {
+ ret = -ESRCH;
+ goto put_pid;
+ }
+
+ if (task->mm != current->mm &&
+ !process_madvise_behavior_valid(behavior)) {
+ ret = -EINVAL;
+ goto release_task;
+ }
+
+ mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+ if (IS_ERR_OR_NULL(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ goto release_task;
+ }
+
+ ret = process_madvise_vec(task, mm, iter, behavior);
+ if (ret >= 0)
+ ret = total_len - iov_iter_count(iter);
+
+ mmput(mm);
+release_task:
+ put_task_struct(task);
+put_pid:
+ put_pid(pid);
+ return ret;
+}
+
+SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
+ unsigned long, vlen, int, behavior, unsigned int, flags)
+{
+ ssize_t ret;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
+
+ ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
+ if (ret >= 0) {
+ ret = do_process_madvise(pidfd, &iter, behavior, flags);
+ kfree(iov);
+ }
+ return ret;
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE5(process_madvise, compat_int_t, pidfd,
+ const struct compat_iovec __user *, vec,
+ compat_ulong_t, vlen,
+ compat_int_t, behavior,
+ compat_uint_t, flags)
+
+{
+ ssize_t ret;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
+
+ ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
+ &iov, &iter);
+ if (ret >= 0) {
+ ret = do_process_madvise(pidfd, &iter, behavior, flags);
+ kfree(iov);
+ }
+ return ret;
+}
+#endif
_
^ permalink raw reply [flat|nested] 12+ messages in thread
* [patch 18/39] mm/madvise: check fatal signal pending of target process
[not found] <20200814172939.55d6d80b6e21e4241f1ee1f3@linux-foundation.org>
` (2 preceding siblings ...)
2020-08-15 0:30 ` [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 2:53 ` Linus Torvalds
3 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, alexander.h.duyck, axboe, bgeffon, christian.brauner,
christian, dancol, hannes, jannh, joaodias, joel, ktkhai,
linux-man, linux-mm, mhocko, minchan, mm-commits, oleksandr,
rientjes, shakeelb, sj38.park, sjpark, sonnyrao, sspatil, surenb,
timmurray, torvalds, vbabka
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: check fatal signal pending of target process
Bail out to prevent unnecessary CPU overhead if target process has pending
fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
Link: http://lkml.kernel.org/r/20200302193630.68771-4-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/madvise.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)
--- a/mm/madvise.c~mm-madvise-check-fatal-signal-pending-of-target-process
+++ a/mm/madvise.c
@@ -39,6 +39,7 @@
struct madvise_walk_private {
struct mmu_gather *tlb;
bool pageout;
+ struct task_struct *target_task;
};
/*
@@ -319,6 +320,10 @@ static int madvise_cold_or_pageout_pte_r
if (fatal_signal_pending(current))
return -EINTR;
+ if (private->target_task &&
+ fatal_signal_pending(private->target_task))
+ return -EINTR;
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (pmd_trans_huge(*pmd)) {
pmd_t orig_pmd;
@@ -480,12 +485,14 @@ static const struct mm_walk_ops cold_wal
};
static void madvise_cold_page_range(struct mmu_gather *tlb,
+ struct task_struct *task,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
struct madvise_walk_private walk_private = {
.pageout = false,
.tlb = tlb,
+ .target_task = task,
};
tlb_start_vma(tlb, vma);
@@ -493,7 +500,8 @@ static void madvise_cold_page_range(stru
tlb_end_vma(tlb, vma);
}
-static long madvise_cold(struct vm_area_struct *vma,
+static long madvise_cold(struct task_struct *task,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start_addr, unsigned long end_addr)
{
@@ -506,19 +514,21 @@ static long madvise_cold(struct vm_area_
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
- madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+ madvise_cold_page_range(&tlb, task, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb, start_addr, end_addr);
return 0;
}
static void madvise_pageout_page_range(struct mmu_gather *tlb,
+ struct task_struct *task,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
struct madvise_walk_private walk_private = {
.pageout = true,
.tlb = tlb,
+ .target_task = task,
};
tlb_start_vma(tlb, vma);
@@ -542,7 +552,8 @@ static inline bool can_do_pageout(struct
inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
}
-static long madvise_pageout(struct vm_area_struct *vma,
+static long madvise_pageout(struct task_struct *task,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start_addr, unsigned long end_addr)
{
@@ -558,7 +569,7 @@ static long madvise_pageout(struct vm_ar
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
- madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb, start_addr, end_addr);
return 0;
@@ -938,7 +949,8 @@ static int madvise_inject_error(int beha
#endif
static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
+madvise_vma(struct task_struct *task, struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
switch (behavior) {
@@ -947,9 +959,9 @@ madvise_vma(struct vm_area_struct *vma,
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
- return madvise_cold(vma, prev, start, end);
+ return madvise_cold(task, vma, prev, start, end);
case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
+ return madvise_pageout(task, vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -1166,7 +1178,8 @@ int do_madvise(struct task_struct *targe
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma(vma, &prev, start, tmp, behavior);
+ error = madvise_vma(target_task, vma, &prev,
+ start, tmp, behavior);
if (error)
goto out;
start = tmp;
_
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 0:31 ` [patch 18/39] mm/madvise: check fatal signal pending of target process Andrew Morton
@ 2020-08-15 2:53 ` Linus Torvalds
2020-08-15 4:59 ` Minchan Kim
0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-08-15 2:53 UTC (permalink / raw)
To: Andrew Morton
Cc: Alexander Duyck, Jens Axboe, Brian Geffon, Christian Brauner,
Christian Brauner, Daniel Colascione, Johannes Weiner, Jann Horn,
John Dias, Joel Fernandes, Kirill Tkhai, linux-man, Linux-MM,
Michal Hocko, Minchan Kim, mm-commits, Oleksandr Natalenko,
David Rientjes, Shakeel Butt, sj38.park, sjpark, Sonny Rao,
Sandeep Patil, Suren Baghdasaryan, Tim Murray, Vlastimil Babka
On Fri, Aug 14, 2020 at 5:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Minchan Kim <minchan@kernel.org>
> Subject: mm/madvise: check fatal signal pending of target process
>
> Bail out to prevent unnecessary CPU overhead if target process has pending
> fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
This seems bogus.
Returning -EINTR when *SOMEBODY ELSE* has a signal is crazy talk.
It also seems to be the reason for the previous patches inexplicably
passing in the task pointer.
Finally, it has absolutely no explanations for why this would matter,
and why it's magically and suddenly an issue for process_madvise(),
when in the history of the *real* madvise() this hasn't been an issue
for "current".
I'm dropping the madvise() series.
If the issue is that you can generate a long series or areas with that
iovec, maybe the code should re-consider. Or maybe the signal pending
case should be done there, not passing down an odd task pointer to the
low-level madvise code.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 2:53 ` Linus Torvalds
@ 2020-08-15 4:59 ` Minchan Kim
2020-08-15 14:57 ` Linus Torvalds
0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2020-08-15 4:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Fri, Aug 14, 2020 at 07:53:09PM -0700, Linus Torvalds wrote:
> On Fri, Aug 14, 2020 at 5:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Minchan Kim <minchan@kernel.org>
> > Subject: mm/madvise: check fatal signal pending of target process
> >
> > Bail out to prevent unnecessary CPU overhead if target process has pending
> > fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
>
> This seems bogus.
>
> Returning -EINTR when *SOMEBODY ELSE* has a signal is crazy talk.
It doesn't propagate -EINTR to the user but just aiming for canceling
the entire operation.
>
> It also seems to be the reason for the previous patches inexplicably
> passing in the task pointer.
>
> Finally, it has absolutely no explanations for why this would matter,
> and why it's magically and suddenly an issue for process_madvise(),
> when in the history of the *real* madvise() this hasn't been an issue
> for "current".
Currently, madvise(MADV_COLD|PAGEOUT) already have done it. I just wanted
to sync with it with process_madvise. Ting was process_madvise couldn't
get target task while madvise could get it easily.
>
> I'm dropping the madvise() series.
>
> If the issue is that you can generate a long series or areas with that
> iovec, maybe the code should re-consider. Or maybe the signal pending
> case should be done there, not passing down an odd task pointer to the
> low-level madvise code.
Do you mean you want to drop target signal check madvise as well as
process_madvise?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 4:59 ` Minchan Kim
@ 2020-08-15 14:57 ` Linus Torvalds
2020-08-15 18:34 ` Minchan Kim
0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-08-15 14:57 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Fri, Aug 14, 2020 at 9:59 PM Minchan Kim <minchan@kernel.org> wrote:
>
> Currently, madvise(MADV_COLD|PAGEOUT) already have done it. I just wanted
> to sync with it with process_madvise. Ting was process_madvise couldn't
> get target task while madvise could get it easily.
The thing is, for "current" it makes sense.
It makes sense because "current" is also the one doing the action, so
when current is dying, stopping the action is sane too.
But when you target somebody else, the signal handling simply doesn't
make any sense at all.
It doesn't make sense because the error code doesn't make sense (EINTR
really is about the _actor_ getting interrupted, not the target), but
it also doesn't make sense simply because there is no 1:1 relationship
between the target mm and the target thread.
The pid that was the target may be dying, but that does *not* mean
that the mm itself is dying. So stopping the operation arbitrarily
somewhere in the middle is a fundamentally insane operation. You've
done something partial to a mm that may well still be active.
So I think it's simply conceptually wrong to look at some "target
thread signal state" in ways that it isn't to look at "current signal
state".
Now, it might be worth it to have some kind of "this mm is dying,
don't bother" thing. We _kind_ of have things like that already in the
form of the MMF_OOM_VICTIM flag (and TIF_MEMDIE is the per-thread
image of it).
It might be reasonable to have a MMF_DYING flag, but I'm not even sure
how to implement it, exactly because of that "this thread group may be
dying, but the MM might still be attached to other tasks" issue.
For example, if you do "vfork()" and the child is killed, the mm is
still active and attached to the vfork() parent.
Similarly, on a trivial level, a particular thread might be killed
without the rest of the threads being necessarily killed.
Again, for regular "madvise()" it makes sense to look at whether the
_current_ thread is being killed - because that fundamentally
interrupts the operator. But for somebody else, operating on the mm of
a thread, I really think it's wrong to look at the target thread state
and leave the MM operation in some half-way state.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 14:57 ` Linus Torvalds
@ 2020-08-15 18:34 ` Minchan Kim
2020-08-16 1:43 ` Linus Torvalds
0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2020-08-15 18:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Sat, Aug 15, 2020 at 07:57:15AM -0700, Linus Torvalds wrote:
> On Fri, Aug 14, 2020 at 9:59 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > Currently, madvise(MADV_COLD|PAGEOUT) already have done it. I just wanted
> > to sync with it with process_madvise. Ting was process_madvise couldn't
> > get target task while madvise could get it easily.
>
> The thing is, for "current" it makes sense.
>
> It makes sense because "current" is also the one doing the action, so
> when current is dying, stopping the action is sane too.
True.
>
> But when you target somebody else, the signal handling simply doesn't
> make any sense at all.
>
> It doesn't make sense because the error code doesn't make sense (EINTR
> really is about the _actor_ getting interrupted, not the target), but
> it also doesn't make sense simply because there is no 1:1 relationship
> between the target mm and the target thread.
>
> The pid that was the target may be dying, but that does *not* mean
> that the mm itself is dying. So stopping the operation arbitrarily
> somewhere in the middle is a fundamentally insane operation. You've
> done something partial to a mm that may well still be active.
>
> So I think it's simply conceptually wrong to look at some "target
> thread signal state" in ways that it isn't to look at "current signal
> state".
Agreed.
>
> Now, it might be worth it to have some kind of "this mm is dying,
> don't bother" thing. We _kind_ of have things like that already in the
> form of the MMF_OOM_VICTIM flag (and TIF_MEMDIE is the per-thread
> image of it).
>
> It might be reasonable to have a MMF_DYING flag, but I'm not even sure
> how to implement it, exactly because of that "this thread group may be
> dying, but the MM might still be attached to other tasks" issue.
>
> For example, if you do "vfork()" and the child is killed, the mm is
> still active and attached to the vfork() parent.
Maybe, we could use mm_struct's mm_users to check caller is exclusive
owner so that rest of all are existing.
>
> Similarly, on a trivial level, a particular thread might be killed
> without the rest of the threads being necessarily killed.
>
> Again, for regular "madvise()" it makes sense to look at whether the
> _current_ thread is being killed - because that fundamentally
> interrupts the operator. But for somebody else, operating on the mm of
> a thread, I really think it's wrong to look at the target thread state
> and leave the MM operation in some half-way state.
I agreed. I will drop this single patch with revising previous patch
not to make passing task_struct since the idea.
We could revist if someting real trouble happens.
Please tell me if you found something weird in this patchset series
so that in next cycle we could go smooth.
Thanks for the review, Linus.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 18:34 ` Minchan Kim
@ 2020-08-16 1:43 ` Linus Torvalds
2020-08-16 5:58 ` Minchan Kim
0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-08-16 1:43 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Sat, Aug 15, 2020 at 11:35 AM Minchan Kim <minchan@kernel.org> wrote:
>
> > Now, it might be worth it to have some kind of "this mm is dying,
> > don't bother" thing. We _kind_ of have things like that already in the
> > form of the MMF_OOM_VICTIM flag (and TIF_MEMDIE is the per-thread
> > image of it).
>
> Maybe, we could use mm_struct's mm_users to check caller is exclusive
> owner so that rest of all are existing.
Hmm. Checking mm_users sounds sane. But I think the /proc reference by
any get_task_mm() site will also count as a mm_user, so it's not quite
as useful as it could be.
In an optimal world, all the temporary "grab a reference to the mm"
would use mmgrab/mmdrop() that increments the mm_count, and "mm_users"
would mean the number of actual threads that are actively using it.
But that's not how it ends up working. mmgrab/mmdrop only keeps the
"struct mm_struct" around - it doesn't keep the vma's or the page
tables. So all the /proc users really do want to increase mm_users.
I don't see any obvious thing to check for that would be about "this
mm no longer makes sense to madvise on, because nobody cares any
more".
> Please tell me if you found something weird in this patchset series
> so that in next cycle we could go smooth.
No, the only other thing that worried me was just possible locking,
but it looked like we already have all the "access page tables from
other processes" situations and it didn't seem to introduce anything
new in that respect.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-16 1:43 ` Linus Torvalds
@ 2020-08-16 5:58 ` Minchan Kim
0 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2020-08-16 5:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Sat, Aug 15, 2020 at 06:43:08PM -0700, Linus Torvalds wrote:
> On Sat, Aug 15, 2020 at 11:35 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > > Now, it might be worth it to have some kind of "this mm is dying,
> > > don't bother" thing. We _kind_ of have things like that already in the
> > > form of the MMF_OOM_VICTIM flag (and TIF_MEMDIE is the per-thread
> > > image of it).
> >
> > Maybe, we could use mm_struct's mm_users to check caller is exclusive
> > owner so that rest of all are existing.
>
> Hmm. Checking mm_users sounds sane. But I think the /proc reference by
> any get_task_mm() site will also count as a mm_user, so it's not quite
> as useful as it could be.
>
> In an optimal world, all the temporary "grab a reference to the mm"
> would use mmgrab/mmdrop() that increments the mm_count, and "mm_users"
> would mean the number of actual threads that are actively using it.
>
> But that's not how it ends up working. mmgrab/mmdrop only keeps the
> "struct mm_struct" around - it doesn't keep the vma's or the page
> tables. So all the /proc users really do want to increase mm_users.
>
> I don't see any obvious thing to check for that would be about "this
> mm no longer makes sense to madvise on, because nobody cares any
> more".
Yeah, there are bunch of places where makes false negaive potentially
as well as proc but I expected it would be rather rare and even though
it happens, finally, we can catch it up if they are temporally holding
the refcount but our operation runs long.
At worst case, it could make the operation void so we just wastes but
when I consider the logic as optimization, it wouldn't be harmful to
start with such *simple check* rather than adding more complication.
If you still don't like the idea, at this point, I will drop the single
patch as I mentioned because I don't think I have strong justification
to add more complication here.
>
> > Please tell me if you found something weird in this patchset series
> > so that in next cycle we could go smooth.
>
> No, the only other thing that worried me was just possible locking,
> but it looked like we already have all the "access page tables from
> other processes" situations and it didn't seem to introduce anything
> new in that respect.
Thanks for the review, Linus.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
2020-08-15 0:30 ` [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
@ 2020-08-16 8:12 ` Christian Brauner
2020-08-17 15:10 ` Minchan Kim
0 siblings, 1 reply; 12+ messages in thread
From: Christian Brauner @ 2020-08-16 8:12 UTC (permalink / raw)
To: minchan
Cc: alexander.h.duyck, axboe, bgeffon, christian, dancol, hannes,
jannh, joaodias, joel, ktkhai, linux-man, linux-mm, mhocko,
mm-commits, oleksandr, rientjes, shakeelb, sj38.park, sjpark,
sonnyrao, sspatil, surenb, timmurray, torvalds, vbabka,
Andrew Morton
On Fri, Aug 14, 2020 at 05:30:58PM -0700, Andrew Morton wrote:
> From: Minchan Kim <minchan@kernel.org>
> Subject: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
>
<snip>
> +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> + unsigned long, vlen, int, behavior, unsigned int, flags)
> +{
> + ssize_t ret;
> + struct iovec iovstack[UIO_FASTIOV];
> + struct iovec *iov = iovstack;
> + struct iov_iter iter;
> +
> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
> + if (ret >= 0) {
> + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> + kfree(iov);
> + }
> + return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +COMPAT_SYSCALL_DEFINE5(process_madvise, compat_int_t, pidfd,
> + const struct compat_iovec __user *, vec,
> + compat_ulong_t, vlen,
> + compat_int_t, behavior,
> + compat_uint_t, flags)
> +
> +{
> + ssize_t ret;
> + struct iovec iovstack[UIO_FASTIOV];
> + struct iovec *iov = iovstack;
> + struct iov_iter iter;
> +
> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> + &iov, &iter);
> + if (ret >= 0) {
> + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> + kfree(iov);
> + }
> + return ret;
> +}
> +#endif
Note, I'm only commenting on this patch because it has already been
dropped for this merge window. Otherwise I wouldn't interfer with stuff
that has already been sent for inclusion.
I haven't noticed this before but why do you need this
COMPAT_SYSCALL_DEFINE5()? New code we add today tries pretty hard to
avoid the compat syscall definitions. (See what we did for
pidfd_send_signal(), seccomp, and in io_uring and in various other places.)
Afaict, this could just be sm like (__completely untested__):
static inline int madv_import_iovec(int type, const struct iovec __user *uvec, unsigned nr_segs,
unsigned fast_segs, struct iovec **iov, struct iov_iter *i)
{
#ifdef CONFIG_COMPAT
if (in_compat_syscall())
return compat_import_iovec(type, (struct compat_iovec __user *)uvec, nr_segs,
fast_segs, iov, i);
#endif
return import_iovec(type, uvec, nr_segs, fast_segs, iov, i);
}
SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
unsigned long, vlen, int, behavior, unsigned int, flags)
{
ssize_t ret;
struct iovec iovstack[UIO_FASTIOV];
struct iovec *iov = iovstack;
struct iov_iter iter;
ret = madv_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
if (ret < 0)
return ret;
ret = do_process_madvise(pidfd, &iter, behavior, flags);
kfree(iov);
return ret;
}
or is there are specific reason this wouldn't work here?
Christian
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
2020-08-16 8:12 ` Christian Brauner
@ 2020-08-17 15:10 ` Minchan Kim
0 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2020-08-17 15:10 UTC (permalink / raw)
To: Christian Brauner
Cc: alexander.h.duyck, axboe, bgeffon, christian, dancol, hannes,
jannh, joaodias, joel, ktkhai, linux-man, linux-mm, mhocko,
mm-commits, oleksandr, rientjes, shakeelb, sj38.park, sjpark,
sonnyrao, sspatil, surenb, timmurray, torvalds, vbabka,
Andrew Morton
On Sun, Aug 16, 2020 at 10:12:27AM +0200, Christian Brauner wrote:
> On Fri, Aug 14, 2020 at 05:30:58PM -0700, Andrew Morton wrote:
> > From: Minchan Kim <minchan@kernel.org>
> > Subject: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
> >
>
> <snip>
>
> > +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > + unsigned long, vlen, int, behavior, unsigned int, flags)
> > +{
> > + ssize_t ret;
> > + struct iovec iovstack[UIO_FASTIOV];
> > + struct iovec *iov = iovstack;
> > + struct iov_iter iter;
> > +
> > + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
> > + if (ret >= 0) {
> > + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> > + kfree(iov);
> > + }
> > + return ret;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +COMPAT_SYSCALL_DEFINE5(process_madvise, compat_int_t, pidfd,
> > + const struct compat_iovec __user *, vec,
> > + compat_ulong_t, vlen,
> > + compat_int_t, behavior,
> > + compat_uint_t, flags)
> > +
> > +{
> > + ssize_t ret;
> > + struct iovec iovstack[UIO_FASTIOV];
> > + struct iovec *iov = iovstack;
> > + struct iov_iter iter;
> > +
> > + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> > + &iov, &iter);
> > + if (ret >= 0) {
> > + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> > + kfree(iov);
> > + }
> > + return ret;
> > +}
> > +#endif
>
> Note, I'm only commenting on this patch because it has already been
> dropped for this merge window. Otherwise I wouldn't interfer with stuff
> that has already been sent for inclusion.
>
> I haven't noticed this before but why do you need this
> COMPAT_SYSCALL_DEFINE5()? New code we add today tries pretty hard to
> avoid the compat syscall definitions. (See what we did for
> pidfd_send_signal(), seccomp, and in io_uring and in various other places.)
>
> Afaict, this could just be sm like (__completely untested__):
>
> static inline int madv_import_iovec(int type, const struct iovec __user *uvec, unsigned nr_segs,
> unsigned fast_segs, struct iovec **iov, struct iov_iter *i)
> {
> #ifdef CONFIG_COMPAT
> if (in_compat_syscall())
> return compat_import_iovec(type, (struct compat_iovec __user *)uvec, nr_segs,
> fast_segs, iov, i);
> #endif
>
> return import_iovec(type, uvec, nr_segs, fast_segs, iov, i);
> }
>
> SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> unsigned long, vlen, int, behavior, unsigned int, flags)
> {
> ssize_t ret;
> struct iovec iovstack[UIO_FASTIOV];
> struct iovec *iov = iovstack;
> struct iov_iter iter;
>
> ret = madv_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
> if (ret < 0)
> return ret;
>
> ret = do_process_madvise(pidfd, &iter, behavior, flags);
> kfree(iov);
> return ret;
> }
>
> or is there are specific reason this wouldn't work here?
No, I just didn't know such trend to avoid compact syscall definitions.
Thanks for the information.
I think your suggestion will work. Let me have it at respin.
Thanks!
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-08-17 15:10 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20200814172939.55d6d80b6e21e4241f1ee1f3@linux-foundation.org>
2020-08-15 0:30 ` [patch 15/39] mm/madvise: pass task and mm to do_madvise Andrew Morton
2020-08-15 0:30 ` [patch 16/39] pid: move pidfd_get_pid() to pid.c Andrew Morton
2020-08-15 0:30 ` [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
2020-08-16 8:12 ` Christian Brauner
2020-08-17 15:10 ` Minchan Kim
2020-08-15 0:31 ` [patch 18/39] mm/madvise: check fatal signal pending of target process Andrew Morton
2020-08-15 2:53 ` Linus Torvalds
2020-08-15 4:59 ` Minchan Kim
2020-08-15 14:57 ` Linus Torvalds
2020-08-15 18:34 ` Minchan Kim
2020-08-16 1:43 ` Linus Torvalds
2020-08-16 5:58 ` Minchan Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).