* Re: [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary
From: Yann Droneaud @ 2019-05-31 10:06 UTC (permalink / raw)
To: Minchan Kim, Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton
In-Reply-To: <20190531064313.193437-7-minchan@kernel.org>
Hi,
Le vendredi 31 mai 2019 à 15:43 +0900, Minchan Kim a écrit :
>
> diff --git a/include/uapi/asm-generic/mman-common.h
> b/include/uapi/asm-generic/mman-common.h
> index 92e347a89ddc..220c2b5eb961 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -75,4 +75,15 @@
> #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> PKEY_DISABLE_WRITE)
>
> +struct pr_madvise_param {
> + int size; /* the size of this structure */
> + int cookie; /* reserved to support atomicity */
> + int nr_elem; /* count of below arrary fields */
Those should be unsigned.
There's an implicit hole here on ABI with 64bits aligned pointers
> + int __user *hints; /* hints for each range */
> + /* to store result of each operation */
> + const struct iovec __user *results;
> + /* input address ranges */
> + const struct iovec __user *ranges;
Using pointer type in uAPI structure require a 'compat' version of the
syscall need to be provided.
If using iovec too.
> +};
> +
> #endif /* __ASM_GENERIC_MMAN_COMMON_H */
Regards.
--
Yann Droneaud
OPTEYA
^ permalink raw reply
* [PATCH v3 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Renzo Davoli @ 2019-05-31 9:47 UTC (permalink / raw)
To: Alexander Viro, Davide Libenzi, linux-fsdevel, linux-kernel; +Cc: linux-api
This patch implements an extension of eventfd (EFD_VPOLL) to define file descriptors
whose I/O events can be generated at user level. These file descriptors
trigger notifications for [p]select/[p]poll/epoll.
This feature is useful for user-level implementations of network stacks
or virtual device drivers as libraries.
Networking programs use system calls implementing the Berkeley sockets API:
socket, accept, connect, listen, recv*, send* etc. Programs dealing with a
device use system calls like open, read, write, ioctl etc.
When somebody wants to write a library able to behave like a network stack (say
lwipv6, picotcp) or a device, they can implement functions like my_socket,
my_accept, my_open or my_ioctl, as drop-in replacement of their system
call counterpart. (It is also possible to use dynamic library magic to
rename/divert the system call requests to use their 'virtual'
implementation provided by the library: socket maps to my_socket, recv
to my_recv etc).
In this way portability and compatibility is easier, using a well known API
instead of inventing new ones.
Unfortunately this approach cannot be applied to
poll/select/ppoll/pselect/epoll. These system calls can refer at the same time
to file descriptors created by 'real' system calls like socket, open, signalfd...
and to file descriptors returned by my_open, your_socket.
While it is possible to provide a partial support (e.g. using pipes or
socketpairs), a clean and complete solution is still missing (as far as I
have seen); e.g. I have not seen any clean way to generate EPOLLPRI,
EPOLLERR, etc.
Example:
Let us suppose there is an application waiting for a TCP OOB message. It uses poll to wait
for POLLPRI and then reads the message (e.g. by 'recv').
If I want to port that application to use a network stack implemented as a library
I have to rewrite the code about 'poll' as it is not possible to receive a POLLPRI.
>From a pipe I can just receive a POLLIN, I have to encode in an external data structure
any further information.
Using EFD_VPOLL the solution is straightforward: the function mysocket (used in place
of socket to create a file descripor behaving as a 'real'socket) returns a file
descriptor created by eventfd/EFD_VPOLL, so the poll system call can be left
unmodified in the code. When the OOB message is available the library can trigger
an EPOLLPRI and the message can be received using my_recv.
This proposal is based on a new tag for eventfd2(2): EFD_VPOLL.
This statement:
fd = eventfd(EPOLLOUT, EFD_VPOLL | EFD_CLOEXEC);
creates a file descriptor for I/O event generation. In this case EPOLLOUT is
initially true.
Likewise all the other eventfs services, read(2) and write(2) use a 8-byte
integer argument.
read(2) returns the current state of the pending events.
The argument of write(2) is an or-composition of a control command
(EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the
bitmap of events to be added, deleted to the current set of pending events.
EFD_VPOLL_MODEVENTS completely redefines the set of pending events.
e.g.:
uint64_t request = EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI;
write(fd, &request, sizeof(request);
adds EPOLLIN and EPOLLPRI to the set of pending events.
There can be other approaches than EFD_VPOLL: e.g. add two specific new system
calls like "vpollfd_create" and "vpollfd_ctl".
Their signature could be:
int vpollfd_create(unsigned int init_events, int flags);
where flags are the usual NONBLOCK/CLOEXEC
int vpollfd_ctl(int fd, int op, unsigned int events);
where op can be VPOLL_ADDEVENTS, VPOLL_DELEVENTS, VPOLL_MODEVENTS
It possible to reimplement the patch this way. It needs the definition of the new system calls.
I am proposing just a new tag for eventfd as eventfd purpose is conceptually close to the new feature.
Eventfd creates a file descriptor which generates events. The default eventfd mode uses counters while
EFD_VPOLL uses event flags. The new feature can be implemented on eventfd with a very limited
impact on the kernel core code.
Instead of syscalls, the vpollfd_create/vpollfd_ctl API could be provided by the glibc as (very simple)
library functions, as it is the case for eventfd_read/eventfd_write in /usr/include/sys/eventfd.h
These are examples of messages asking for a feature like EFD_VPOLL:
https://stackoverflow.com/questions/909189/simulating-file-descriptor-in-user-space
https://stackoverflow.com/questions/1648147/running-a-simple-tcp-server-with-poll-how-do-i-trigger-events-artificially
... and I need it to write networking and device modules for vuos:
https://github.com/virtualsquare/vuos
(it is the new codebase of ViewOS, see www.virtualsquare.org).
EXAMPLE of program using EFD_VPOLL:
The following program creates an eventfd/EFD_VPOLL file descriptor and then forks
a child process. While the parent waits for events using epoll_wait the child
generates a sequence of events. When the parent receives an event (or a set of events)
it prints it and disarm it.
The following shell session shows a sample run of the program:
timeout...
timeout...
GOT event 1
timeout...
GOT event 1
timeout...
GOT event 3
timeout...
GOT event 2
timeout...
GOT event 4
timeout...
GOT event 10
Program source:
#include <sys/eventfd.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h> /* Definition of uint64_t */
#ifndef EFD_VPOLL
#define EFD_VPOLL (1 << 1)
#define EFD_VPOLL_ADDEVENTS (1ULL << 32)
#define EFD_VPOLL_DELEVENTS (2ULL << 32)
#define EFD_VPOLL_MODEVENTS (3ULL << 32)
#endif
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
static void vpoll_ctl(int fd, uint64_t request) {
ssize_t s;
s = write(fd, &request, sizeof(request));
if (s != sizeof(uint64_t))
handle_error("write");
}
int
main(int argc, char *argv[])
{
int efd, epollfd;
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI;
ev.data.u64 = 0;
efd = eventfd(0, EFD_VPOLL | EFD_CLOEXEC);
if (efd == -1)
handle_error("eventfd");
epollfd = epoll_create1(EPOLL_CLOEXEC);
if (efd == -1)
handle_error("epoll_create1");
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1)
handle_error("epoll_ctl");
switch (fork()) {
case 0:
sleep(3);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLPRI);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLOUT);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLHUP);
exit(EXIT_SUCCESS);
default:
while (1) {
int nfds;
nfds = epoll_wait(epollfd, &ev, 1, 1000);
if (nfds < 0)
handle_error("epoll_wait");
else if (nfds == 0)
printf("timeout...\n");
else {
printf("GOT event %x\n", ev.events);
vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | ev.events);
if (ev.events & EPOLLHUP)
break;
}
}
case -1:
handle_error("fork");
}
close(epollfd);
close(efd);
return 0;
}
Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
Reported-by: kbuild test robot <lkp@intel.com>
---
fs/eventfd.c | 116 +++++++++++++++++++++++++++++++--
include/linux/eventfd.h | 7 +-
include/uapi/linux/eventpoll.h | 2 +
3 files changed, 117 insertions(+), 8 deletions(-)
Changes in v2:
- Fix size of EFD_VPOLL_*EVENTS constants for 32 bit architectures
Changes in v3:
- Fix sparse warnings and wrong arg of wake_up_locked_poll in eventfd_vpoll_write
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8aa0ea8c55e8..6cdb1b854341 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -24,18 +24,32 @@
#include <linux/seq_file.h>
#include <linux/idr.h>
+#define EPOLLALLMASK64 ((__force __u64)EPOLLALLMASK)
+
static DEFINE_IDA(eventfd_ida);
struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
/*
- * Every time that a write(2) is performed on an eventfd, the
- * value of the __u64 being written is added to "count" and a
- * wakeup is performed on "wqh". A read(2) will return the "count"
- * value to userspace, and will reset "count" to zero. The kernel
- * side eventfd_signal() also, adds to the "count" counter and
- * issue a wakeup.
+ * If the EFD_VPOLL flag was NOT set at eventfd creation:
+ * Every time that a write(2) is performed on an eventfd, the
+ * value of the __u64 being written is added to "count" and a
+ * wakeup is performed on "wqh". A read(2) will return the "count"
+ * value to userspace, and will reset "count" to zero (or decrement
+ * "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
+ * side eventfd_signal() also, adds to the "count" counter and
+ * issue a wakeup.
+ *
+ * If the EFD_VPOLL flag was set at eventfd creation:
+ * count is the set of pending EPOLL events.
+ * read(2) returns the current value of count.
+ * The argument of write(2) is an 8-byte integer:
+ * it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
+ * EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
+ * events to be added, deleted to the current set of pending events.
+ * (i.e. which bits of "count" must be set or reset).
+ * EFD_VPOLL_MODEVENTS redefines the set of pending events.
*/
__u64 count;
unsigned int flags;
@@ -295,6 +309,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
return res;
}
+static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+ __poll_t events = 0;
+ u64 count;
+
+ poll_wait(file, &ctx->wqh, wait);
+
+ count = READ_ONCE(ctx->count);
+
+ events = (((__force __poll_t)count) & EPOLLALLMASK);
+
+ return events;
+}
+
+static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+ ssize_t res;
+ __u64 ucnt = 0;
+
+ if (count < sizeof(ucnt))
+ return -EINVAL;
+ res = sizeof(ucnt);
+ ucnt = READ_ONCE(ctx->count);
+ if (put_user(ucnt, (__u64 __user *)buf))
+ return -EFAULT;
+
+ return res;
+}
+
+static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+ ssize_t res;
+ __u64 ucnt;
+ __u32 events;
+
+ if (count < sizeof(ucnt))
+ return -EINVAL;
+ if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
+ return -EFAULT;
+ spin_lock_irq(&ctx->wqh.lock);
+
+ events = ucnt & EPOLLALLMASK64;
+ res = sizeof(ucnt);
+ switch (ucnt & ~EPOLLALLMASK64) {
+ case EFD_VPOLL_ADDEVENTS:
+ ctx->count |= events;
+ break;
+ case EFD_VPOLL_DELEVENTS:
+ ctx->count &= ~(events);
+ break;
+ case EFD_VPOLL_MODEVENTS:
+ ctx->count = (ctx->count & ~EPOLLALLMASK64) | events;
+ break;
+ default:
+ res = -EINVAL;
+ }
+
+ /* wake up waiting threads */
+ if (res >= 0 && waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh, ((__force __poll_t)ctx->count) & EPOLLALLMASK);
+
+ spin_unlock_irq(&ctx->wqh.lock);
+
+ return res;
+
+}
+
#ifdef CONFIG_PROC_FS
static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
{
@@ -319,6 +405,17 @@ static const struct file_operations eventfd_fops = {
.llseek = noop_llseek,
};
+static const struct file_operations eventfd_vpoll_fops = {
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = eventfd_show_fdinfo,
+#endif
+ .release = eventfd_release,
+ .poll = eventfd_vpoll_poll,
+ .read = eventfd_vpoll_read,
+ .write = eventfd_vpoll_write,
+ .llseek = noop_llseek,
+};
+
/**
* eventfd_fget - Acquire a reference of an eventfd file descriptor.
* @fd: [in] Eventfd file descriptor.
@@ -391,6 +488,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
static int do_eventfd(unsigned int count, int flags)
{
struct eventfd_ctx *ctx;
+ const struct file_operations *fops = &eventfd_fops;
int fd;
/* Check the EFD_* constants for consistency. */
@@ -410,7 +508,11 @@ static int do_eventfd(unsigned int count, int flags)
ctx->flags = flags;
ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
- fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
+ if (flags & EFD_VPOLL) {
+ fops = &eventfd_vpoll_fops;
+ ctx->count &= EPOLLALLMASK64;
+ }
+ fd = anon_inode_getfd("[eventfd]", fops, ctx,
O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
if (fd < 0)
eventfd_free_ctx(ctx);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index ffcc7724ca21..5b1e6ef56651 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -21,11 +21,16 @@
* shared O_* flags.
*/
#define EFD_SEMAPHORE (1 << 0)
+#define EFD_VPOLL (1 << 1)
#define EFD_CLOEXEC O_CLOEXEC
#define EFD_NONBLOCK O_NONBLOCK
#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
+
+#define EFD_VPOLL_ADDEVENTS (1ULL << 32)
+#define EFD_VPOLL_DELEVENTS (2ULL << 32)
+#define EFD_VPOLL_MODEVENTS (3ULL << 32)
struct eventfd_ctx;
struct file;
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8a3432d0f0dc..814de6d869c7 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -41,6 +41,8 @@
#define EPOLLMSG (__force __poll_t)0x00000400
#define EPOLLRDHUP (__force __poll_t)0x00002000
+#define EPOLLALLMASK ((__force __poll_t)0x0fffffff)
+
/* Set exclusive wakeup mode for the target file descriptor */
#define EPOLLEXCLUSIVE ((__force __poll_t)(1U << 28))
--
2.20.1
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8aa0ea8c55e8..6cdb1b854341 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -24,18 +24,32 @@
#include <linux/seq_file.h>
#include <linux/idr.h>
+#define EPOLLALLMASK64 ((__force __u64)EPOLLALLMASK)
+
static DEFINE_IDA(eventfd_ida);
struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
/*
- * Every time that a write(2) is performed on an eventfd, the
- * value of the __u64 being written is added to "count" and a
- * wakeup is performed on "wqh". A read(2) will return the "count"
- * value to userspace, and will reset "count" to zero. The kernel
- * side eventfd_signal() also, adds to the "count" counter and
- * issue a wakeup.
+ * If the EFD_VPOLL flag was NOT set at eventfd creation:
+ * Every time that a write(2) is performed on an eventfd, the
+ * value of the __u64 being written is added to "count" and a
+ * wakeup is performed on "wqh". A read(2) will return the "count"
+ * value to userspace, and will reset "count" to zero (or decrement
+ * "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
+ * side eventfd_signal() also, adds to the "count" counter and
+ * issue a wakeup.
+ *
+ * If the EFD_VPOLL flag was set at eventfd creation:
+ * count is the set of pending EPOLL events.
+ * read(2) returns the current value of count.
+ * The argument of write(2) is an 8-byte integer:
+ * it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
+ * EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
+ * events to be added, deleted to the current set of pending events.
+ * (i.e. which bits of "count" must be set or reset).
+ * EFD_VPOLL_MODEVENTS redefines the set of pending events.
*/
__u64 count;
unsigned int flags;
@@ -295,6 +309,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
return res;
}
+static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+ __poll_t events = 0;
+ u64 count;
+
+ poll_wait(file, &ctx->wqh, wait);
+
+ count = READ_ONCE(ctx->count);
+
+ events = (((__force __poll_t)count) & EPOLLALLMASK);
+
+ return events;
+}
+
+static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+ ssize_t res;
+ __u64 ucnt = 0;
+
+ if (count < sizeof(ucnt))
+ return -EINVAL;
+ res = sizeof(ucnt);
+ ucnt = READ_ONCE(ctx->count);
+ if (put_user(ucnt, (__u64 __user *)buf))
+ return -EFAULT;
+
+ return res;
+}
+
+static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+ ssize_t res;
+ __u64 ucnt;
+ __u32 events;
+
+ if (count < sizeof(ucnt))
+ return -EINVAL;
+ if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
+ return -EFAULT;
+ spin_lock_irq(&ctx->wqh.lock);
+
+ events = ucnt & EPOLLALLMASK64;
+ res = sizeof(ucnt);
+ switch (ucnt & ~EPOLLALLMASK64) {
+ case EFD_VPOLL_ADDEVENTS:
+ ctx->count |= events;
+ break;
+ case EFD_VPOLL_DELEVENTS:
+ ctx->count &= ~(events);
+ break;
+ case EFD_VPOLL_MODEVENTS:
+ ctx->count = (ctx->count & ~EPOLLALLMASK64) | events;
+ break;
+ default:
+ res = -EINVAL;
+ }
+
+ /* wake up waiting threads */
+ if (res >= 0 && waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh, ((__force __poll_t)ctx->count) & EPOLLALLMASK);
+
+ spin_unlock_irq(&ctx->wqh.lock);
+
+ return res;
+
+}
+
#ifdef CONFIG_PROC_FS
static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
{
@@ -319,6 +405,17 @@ static const struct file_operations eventfd_fops = {
.llseek = noop_llseek,
};
+static const struct file_operations eventfd_vpoll_fops = {
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = eventfd_show_fdinfo,
+#endif
+ .release = eventfd_release,
+ .poll = eventfd_vpoll_poll,
+ .read = eventfd_vpoll_read,
+ .write = eventfd_vpoll_write,
+ .llseek = noop_llseek,
+};
+
/**
* eventfd_fget - Acquire a reference of an eventfd file descriptor.
* @fd: [in] Eventfd file descriptor.
@@ -391,6 +488,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
static int do_eventfd(unsigned int count, int flags)
{
struct eventfd_ctx *ctx;
+ const struct file_operations *fops = &eventfd_fops;
int fd;
/* Check the EFD_* constants for consistency. */
@@ -410,7 +508,11 @@ static int do_eventfd(unsigned int count, int flags)
ctx->flags = flags;
ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
- fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
+ if (flags & EFD_VPOLL) {
+ fops = &eventfd_vpoll_fops;
+ ctx->count &= EPOLLALLMASK64;
+ }
+ fd = anon_inode_getfd("[eventfd]", fops, ctx,
O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
if (fd < 0)
eventfd_free_ctx(ctx);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index ffcc7724ca21..5b1e6ef56651 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -21,11 +21,16 @@
* shared O_* flags.
*/
#define EFD_SEMAPHORE (1 << 0)
+#define EFD_VPOLL (1 << 1)
#define EFD_CLOEXEC O_CLOEXEC
#define EFD_NONBLOCK O_NONBLOCK
#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
+
+#define EFD_VPOLL_ADDEVENTS (1ULL << 32)
+#define EFD_VPOLL_DELEVENTS (2ULL << 32)
+#define EFD_VPOLL_MODEVENTS (3ULL << 32)
struct eventfd_ctx;
struct file;
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8a3432d0f0dc..814de6d869c7 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -41,6 +41,8 @@
#define EPOLLMSG (__force __poll_t)0x00000400
#define EPOLLRDHUP (__force __poll_t)0x00002000
+#define EPOLLALLMASK ((__force __poll_t)0x0fffffff)
+
/* Set exclusive wakeup mode for the target file descriptor */
#define EPOLLEXCLUSIVE ((__force __poll_t)(1U << 28))
--
2.20.1
^ permalink raw reply related
* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Roman Penyaev @ 2019-05-31 9:34 UTC (permalink / raw)
To: Renzo Davoli
Cc: Greg KH, Alexander Viro, Davide Libenzi, linux-fsdevel,
linux-kernel, linux-api, linux-kernel-owner
In-Reply-To: <20190527133621.GC26073@cs.unibo.it>
Hi Renzo,
On 2019-05-27 15:36, Renzo Davoli wrote:
> On Mon, May 27, 2019 at 09:33:32AM +0200, Greg KH wrote:
>> On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote:
>> > This patch implements an extension of eventfd to define file descriptors
>> > whose I/O events can be generated at user level. These file descriptors
>> > trigger notifications for [p]select/[p]poll/epoll.
>> >
>> > This feature is useful for user-level implementations of network stacks
>> > or virtual device drivers as libraries.
>>
>> How can this be used to create a "virtual device driver"? Do you have
>> any examples of this new interface being used anywhere?
>
> Networking programs use system calls implementing the Berkeley sockets
> API:
> socket, accept, connect, listen, recv*, send* etc. Programs dealing
> with a
> device use system calls like open, read, write, ioctl etc.
>
> When somebody wants to write a library able to behave like a network
> stack (say
> lwipv6, picotcp) or a device, they can implement functions like
> my_socket,
> my_accept, my_open or my_ioctl, as drop-in replacement of their system
> call counterpart. (It is also possible to use dynamic library magic to
> rename/divert the system call requests to use their 'virtual'
> implementation provided by the library: socket maps to my_socket, recv
> to my_recv etc).
>
> In this way portability and compatibility is easier, using a well known
> API
> instead of inventing new ones.
>
> Unfortunately this approach cannot be applied to
> poll/select/ppoll/pselect/epoll.
If you have to override other systemcalls, what is the problem to
override
poll family? It will add, let's say, 50 extra code lines complexity to
your
userspace code. All you need is to be woken up by *any* event and check
one mask variable, in order to understand what you need to do: read or
write,
basically exactly what you do in your eventfd modification, but only in
userspace.
>> Why can it not be less than 64?
> This is the imeplementation of 'write'. The 64 bits include the
> 'command'
> EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the
> most
> significant 32 bits) and the set of events (in the lowest 32 bits).
Do you really need add/del/mod semantics? Userspace still has to keep
mask
somewhere, so you can have one simple command, which does:
ctx->count = events;
in kernel, so no masks and this games with bits are needed. That will
simplify API.
--
Roman
^ permalink raw reply
* Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT
From: Michal Hocko @ 2019-05-31 8:50 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton
In-Reply-To: <20190531064313.193437-4-minchan@kernel.org>
On Fri 31-05-19 15:43:10, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> for a long time, it could hint kernel that the pages can be
> reclaimed instantly but data should be preserved for future use.
> This could reduce workingset eviction so it ends up increasing
> performance.
>
> This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> syscall. MADV_PAGEOUT can be used by a process to mark a memory
> range as not expected to be used for a long time so that kernel
> reclaims the memory instantly. The hint can help kernel in deciding
> which pages to evict proactively.
Again, are there any restictions on what kind of memory can be paged out?
Private/Shared, anonymous/file backed. Any restrictions on mapping type.
Etc. Please make sure all that is in the changelog.
What are the failure modes? E.g. what if the swap is full, does the call
fails or it silently ignores the error?
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
From: Michal Hocko @ 2019-05-31 8:47 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton
In-Reply-To: <20190531064313.193437-2-minchan@kernel.org>
On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use. This could reduce
> workingset eviction so it ends up increasing performance.
>
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
>
> Internally, it works via deactivating pages from active list to inactive's
> head if the page is private because inactive list could be full of
> used-once pages which are first candidate for the reclaiming and that's a
> reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> if the memory pressure happens, they will be reclaimed earlier than other
> active pages unless there is no access until the time.
[I am intentionally not looking at the implementation because below
points should be clear from the changelog - sorry about nagging ;)]
What kind of pages can be deactivated? Anonymous/File backed.
Private/shared? If shared, are there any restrictions?
Are there any restrictions on mappings? E.g. what would be an effect of
this operation on hugetlbfs mapping?
Also you are talking about inactive LRU but what kind of LRU is that? Is
it the anonymous LRU? If yes, don't we have the same problem as with the
early MADV_FREE implementation when enough page cache causes that
deactivated anonymous memory doesn't get reclaimed anytime soon. Or
worse never when there is no swap available?
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Peter Zijlstra @ 2019-05-31 8:47 UTC (permalink / raw)
To: David Howells
Cc: Greg KH, viro, raven, linux-fsdevel, linux-api, linux-block,
keyrings, linux-security-module, linux-kernel
In-Reply-To: <31936.1559146000@warthog.procyon.org.uk>
On Wed, May 29, 2019 at 05:06:40PM +0100, David Howells wrote:
> Looking at the perf ring buffer, there appears to be a missing barrier in
> perf_aux_output_end():
>
> rb->user_page->aux_head = rb->aux_head;
>
> should be:
>
> smp_store_release(&rb->user_page->aux_head, rb->aux_head);
I've answered that in another email; the aux bit is 'magic'.
> It should also be using smp_load_acquire(). See
> Documentation/core-api/circular-buffers.rst
We use the control dependency instead, as described in the comment of
perf_output_put_handle():
* kernel user
*
* if (LOAD ->data_tail) { LOAD ->data_head
* (A) smp_rmb() (C)
* STORE $data LOAD $data
* smp_wmb() (B) smp_mb() (D)
* STORE ->data_head STORE ->data_tail
* }
*
* Where A pairs with D, and B pairs with C.
*
* In our case (A) is a control dependency that separates the load of
* the ->data_tail and the stores of $data. In case ->data_tail
* indicates there is no room in the buffer to store $data we do not.
*
* D needs to be a full barrier since it separates the data READ
* from the tail WRITE.
*
* For B a WMB is sufficient since it separates two WRITEs, and for C
* an RMB is sufficient since it separates two READs.
Userspace can choose to use smp_load_acquire() over the first smp_rmb()
if that is efficient for the architecture (for w ahole bunch of archs
load-acquire would end up using mb() while rmb() is adequate and
cheaper).
^ permalink raw reply
* Re: [RFCv2 5/6] mm: introduce external memory hinting API
From: Michal Hocko @ 2019-05-31 8:37 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton
In-Reply-To: <20190531064313.193437-6-minchan@kernel.org>
On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> ActivityManagerService is one of them.
>
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
>
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It could give a hint to the exeternal process of pidfd.
>
> int process_madvise(int pidfd, void *addr, size_t length, int advise,
> unsigned long cookie, unsigned long flag);
>
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
>
> The syscall has a cookie argument to privode atomicity(i.e., detect
> target process's address space change since monitor process has parsed
> the address range of target process so the operaion could fail in case
> of happening race). Although there is no interface to get a cookie
> at this moment, it could be useful to consider it as argument to avoid
> introducing another new syscall in future. It could support *atomicity*
> for disruptive hint(e.g., MADV_DONTNEED|FREE).
> flag argument is reserved for future use if we need to extend the API.
Providing an API that is incomplete will not fly. Really. As this really
begs for much more discussion and it would be good to move on with the
core idea of the pro active memory memory management from userspace
usecase. Could you split out the core change so that we can move on and
leave the external for a later discussion. I believe this would lead to
a smoother integration.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Peter Zijlstra @ 2019-05-31 8:35 UTC (permalink / raw)
To: Andrea Parri
Cc: Greg KH, David Howells, viro, raven, linux-fsdevel, linux-api,
linux-block, keyrings, linux-security-module, linux-kernel,
Will Deacon, Paul E. McKenney, Mark Rutland
In-Reply-To: <20190530095039.GA5137@andrea>
On Thu, May 30, 2019 at 11:50:39AM +0200, Andrea Parri wrote:
> > > Looking at the perf ring buffer, there appears to be a missing barrier in
> > > perf_aux_output_end():
> > >
> > > rb->user_page->aux_head = rb->aux_head;
> > >
> > > should be:
> > >
> > > smp_store_release(&rb->user_page->aux_head, rb->aux_head);
> > >
> > > It should also be using smp_load_acquire(). See
> > > Documentation/core-api/circular-buffers.rst
> > >
> > > And a (partial) patch has been proposed: https://lkml.org/lkml/2018/5/10/249
> >
> > So, if that's all that needs to be fixed, can you use the same
> > buffer/code if that patch is merged?
>
> That's about one year old...: let me add the usual suspects in Cc: ;-)
> since I'm not sure what the plan was (or if I'm missing something) ...
The AUX crud is 'special' and smp_store_release() doesn't really help in
many cases. Notable, AUX is typically used in combination with a
hardware writer. The driver is in charge of odering here, the generic
code doesn't know what the appropriate barrier (if any) is and would
have to resort to the most expensive/heavy one available.
Also see the comment right above this function:
"It is the
pmu driver's responsibility to observe ordering rules of the hardware,
so that all the data is externally visible before this is called."
^ permalink raw reply
* Re: [PATCH v1 1/2] fork: add clone3
From: Arnd Bergmann @ 2019-05-31 8:14 UTC (permalink / raw)
To: Szabolcs Nagy
Cc: Christian Brauner, Al Viro, Linux Kernel Mailing List,
Linus Torvalds, Jann Horn, Florian Weimer, Oleg Nesterov,
David Howells, Pavel Emelyanov, Andrew Morton, Adrian Reber,
Andrei Vagin, Linux API
In-Reply-To: <20190530132012.GS16415@port70.net>
On Thu, May 30, 2019 at 3:20 PM Szabolcs Nagy <nsz@port70.net> wrote:
> * Christian Brauner <christian@brauner.io> [2019-05-29 17:22:36 +0200]:
> > /* uapi */
> > struct clone_args {
> > __aligned_u64 flags;
> > __aligned_u64 pidfd;
> > __aligned_u64 parent_tidptr;
> > __aligned_u64 child_tidptr;
> > __aligned_u64 stack;
> > __aligned_u64 stack_size;
> > __aligned_u64 tls;
> > };
>
> is this new linux syscall api style to pass pointers as u64?
This is common for ioctls passing structures now. I don't think
we've had many system calls with structures containing pointers,
but the idea is the same, i.e. we want structures to be identical
on 32-bit and 64-bit architectures.
> i think it will look a bit ugly in userspace where cast
> to u64 would signextend pointers on most 32bit targets, so
> user code would have to do something like
>
> arg.ptr = (uint64_t)(uintptr_t)ptr;
>
> such ugliness can be hidden by the libc with a different
> struct definition, but it will require bigendian and alignment
> hackery (or translation in libc, but that does not really work
> when user calls raw syscall).
Right. Note also that user space should do zero-extension
of the variables in order for the kernel to not care about
what called it. Just leaving padding fields in the structure
is not enough here.
User space that calls the raw syscall certainly has to
go through the uintptr_t cast, but I would also expect that
applications don't normally do that, and instead call a
library function that has regular C calling conventions
with individual arguments instead of a structure.
Arnd
^ permalink raw reply
* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Florian Weimer @ 2019-05-31 8:06 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
linux-api
In-Reply-To: <2022553041.20966.1559249801435.JavaMail.zimbra@efficios.com>
* Mathieu Desnoyers:
> I found that it's because touching a __thread variable from
> ld-linux-x86-64.so.2 ends up setting the DF_STATIC_TLS flag
> for that .so, which is really not expected.
>
> Even if I tweak the assert to make it more lenient there,
> touching the __thread variable ends up triggering a SIGFPE.
Sorry, I got distracted at this critical juncture. Yes, I forgot that
there isn't TLS support in the dynamic loader today.
> So rather than touching the TLS from ld-linux-x86-64.so.2,
> I've rather experimented with moving the rseq initialization
> for both SHARED and !SHARED cases to a library constructor
> within libc.so.
>
> Are you aware of any downside to this approach ?
The information whether the kernel supports rseq would not be available
to IFUNC resolvers. And in some cases, ELF constructors for application
libraries could run before the libc.so.6 constructor, so applications
would see a transition from lack of kernel support to kernel support.
> +static
> +__attribute__ ((constructor))
> +void __rseq_libc_init (void)
> +{
> + rseq_init ();
> + /* Register rseq ABI to the kernel. */
> + (void) rseq_register_current_thread ();
> +}
I think the call to rseq_init (and the __rseq_handled variable) should
still be part of the dynamic loader. Otherwise there could be confusion
about whether glibc handles the registration (due the constructor
ordering issue).
Thanks,
Florian
^ permalink raw reply
* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
From: Oleksandr Natalenko @ 2019-05-31 7:04 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, hdanton
In-Reply-To: <20190531064313.193437-5-minchan@kernel.org>
On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> This patch factor out madvise's core functionality so that upcoming
> patch can reuse it without duplication. It shouldn't change any behavior.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> 1 file changed, 101 insertions(+), 87 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9d749a1420b4..466623ea8c36 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> struct page *page;
> int isolated = 0;
> struct vm_area_struct *vma = walk->vma;
> + struct task_struct *task = walk->private;
> unsigned long next;
>
> - if (fatal_signal_pending(current))
> + if (fatal_signal_pending(task))
> return -EINTR;
>
> next = pmd_addr_end(addr, end);
> @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> }
>
> static void madvise_pageout_page_range(struct mmu_gather *tlb,
> - struct vm_area_struct *vma,
> - unsigned long addr, unsigned long end)
> + struct task_struct *task,
> + struct vm_area_struct *vma,
> + unsigned long addr, unsigned long end)
> {
> struct mm_walk warm_walk = {
> .pmd_entry = madvise_pageout_pte_range,
> .mm = vma->vm_mm,
> + .private = task,
> };
>
> tlb_start_vma(tlb, vma);
> @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> }
>
>
> -static long madvise_pageout(struct vm_area_struct *vma,
> - struct vm_area_struct **prev,
> - unsigned long start_addr, unsigned long end_addr)
> +static long madvise_pageout(struct task_struct *task,
> + struct vm_area_struct *vma, struct vm_area_struct **prev,
> + unsigned long start_addr, unsigned long end_addr)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct mmu_gather tlb;
> @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
>
> lru_add_drain();
> tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> tlb_finish_mmu(&tlb, start_addr, end_addr);
>
> return 0;
> @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> return 0;
> }
>
> -static long madvise_dontneed_free(struct vm_area_struct *vma,
> +static long madvise_dontneed_free(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> unsigned long start, unsigned long end,
> int behavior)
> @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> if (!userfaultfd_remove(vma, start, end)) {
> *prev = NULL; /* mmap_sem has been dropped, prev is stale */
>
> - down_read(¤t->mm->mmap_sem);
> - vma = find_vma(current->mm, start);
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, start);
> if (!vma)
> return -ENOMEM;
> if (start < vma->vm_start) {
> @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> * Application wants to free up the pages and associated backing store.
> * This is effectively punching a hole into the middle of a file.
> */
> -static long madvise_remove(struct vm_area_struct *vma,
> +static long madvise_remove(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> unsigned long start, unsigned long end)
> {
> @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> get_file(f);
> if (userfaultfd_remove(vma, start, end)) {
> /* mmap_sem was not released by userfaultfd_remove() */
> - up_read(¤t->mm->mmap_sem);
> + up_read(&mm->mmap_sem);
> }
> error = vfs_fallocate(f,
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> offset, end - start);
> fput(f);
> - down_read(¤t->mm->mmap_sem);
> + down_read(&mm->mmap_sem);
> return error;
> }
>
> @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> #endif
>
> static long
> -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> + struct vm_area_struct *vma, struct vm_area_struct **prev,
> unsigned long start, unsigned long end, int behavior)
> {
> switch (behavior) {
> case MADV_REMOVE:
> - return madvise_remove(vma, prev, start, end);
> + return madvise_remove(mm, vma, prev, start, end);
> case MADV_WILLNEED:
> return madvise_willneed(vma, prev, start, end);
> case MADV_COLD:
> return madvise_cold(vma, prev, start, end);
> case MADV_PAGEOUT:
> - return madvise_pageout(vma, prev, start, end);
> + return madvise_pageout(task, vma, prev, start, end);
> case MADV_FREE:
> case MADV_DONTNEED:
> - return madvise_dontneed_free(vma, prev, start, end, behavior);
> + return madvise_dontneed_free(mm, vma, prev, start,
> + end, behavior);
> default:
> return madvise_behavior(vma, prev, start, end, behavior);
> }
> @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> }
> }
>
> -/*
> - * The madvise(2) system call.
> - *
> - * Applications can use madvise() to advise the kernel how it should
> - * handle paging I/O in this VM area. The idea is to help the kernel
> - * use appropriate read-ahead and caching techniques. The information
> - * provided is advisory only, and can be safely disregarded by the
> - * kernel without affecting the correct operation of the application.
> - *
> - * behavior values:
> - * MADV_NORMAL - the default behavior is to read clusters. This
> - * results in some read-ahead and read-behind.
> - * MADV_RANDOM - the system should read the minimum amount of data
> - * on any access, since it is unlikely that the appli-
> - * cation will need more than what it asks for.
> - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> - * once, so they can be aggressively read ahead, and
> - * can be freed soon after they are accessed.
> - * MADV_WILLNEED - the application is notifying the system to read
> - * some pages ahead.
> - * MADV_DONTNEED - the application is finished with the given range,
> - * so the kernel can free resources associated with it.
> - * MADV_FREE - the application marks pages in the given range as lazy free,
> - * where actual purges are postponed until memory pressure happens.
> - * MADV_REMOVE - the application wants to free up the given range of
> - * pages and associated backing store.
> - * MADV_DONTFORK - omit this area from child's address space when forking:
> - * typically, to avoid COWing pages pinned by get_user_pages().
> - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> - * range after a fork.
> - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> - * were corrupted by unrecoverable hardware memory failure.
> - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> - * this area with pages of identical content from other such areas.
> - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> - * huge pages in the future. Existing pages might be coalesced and
> - * new pages might be allocated as THP.
> - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> - * transparent huge pages so the existing pages will not be
> - * coalesced into THP and new pages will not be allocated as THP.
> - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> - * from being included in its core dump.
> - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> - *
> - * return values:
> - * zero - success
> - * -EINVAL - start + len < 0, start is not page-aligned,
> - * "behavior" is not a valid value, or application
> - * is attempting to release locked or shared pages,
> - * or the specified address range includes file, Huge TLB,
> - * MAP_SHARED or VMPFNMAP range.
> - * -ENOMEM - addresses in the specified range are not currently
> - * mapped, or are outside the AS of the process.
> - * -EIO - an I/O error occurred while paging in data.
> - * -EBADF - map exists, but area maps something that isn't a file.
> - * -EAGAIN - a kernel resource was temporarily unavailable.
> - */
> -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> + unsigned long start, size_t len_in, int behavior)
Just a minor nitpick, but can we please have it named madvise_common,
not madvise_core? This would follow a usual naming scheme, when some
common functionality is factored out (like, for mutexes, semaphores
etc), and within the kernel "core" usually means something completely
different.
> {
> unsigned long end, tmp;
> struct vm_area_struct *vma, *prev;
> @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>
> #ifdef CONFIG_MEMORY_FAILURE
> if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> - return madvise_inject_error(behavior, start, start + len_in);
> + return madvise_inject_error(behavior,
> + start, start + len_in);
Not sure what this change is about except changing the line length.
Note, madvise_inject_error() still operates on "current" through
get_user_pages_fast() and gup_pgd_range(), but that was not changed
here. I Know you've filtered out this hint later, so technically this
is not an issue, but, maybe, this needs some attention too since we've
already spotted it?
> #endif
>
> write = madvise_need_mmap_write(behavior);
> if (write) {
> - if (down_write_killable(¤t->mm->mmap_sem))
> + if (down_write_killable(&mm->mmap_sem))
> return -EINTR;
Do you still need that trick with mmget_still_valid() here?
Something like:
if (current->mm != mm && !mmget_still_valid(mm))
goto skip_mm;
and that skip_mm label would be before
if (write)
up_write(&mm->mmap_sem);
below.
(see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)
> } else {
> - down_read(¤t->mm->mmap_sem);
> + down_read(&mm->mmap_sem);
> }
>
> /*
> @@ -1084,7 +1032,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> * ranges, just ignore them, but return -ENOMEM at the end.
> * - different from the way of handling in mlock etc.
> */
> - vma = find_vma_prev(current->mm, start, &prev);
> + vma = find_vma_prev(mm, start, &prev);
> if (vma && start > vma->vm_start)
> prev = vma;
>
> @@ -1109,7 +1057,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> tmp = end;
>
> /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> - error = madvise_vma(vma, &prev, start, tmp, behavior);
> + error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior);
> if (error)
> goto out;
> start = tmp;
> @@ -1121,14 +1069,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> if (prev)
> vma = prev->vm_next;
> else /* madvise_remove dropped mmap_sem */
> - vma = find_vma(current->mm, start);
> + vma = find_vma(mm, start);
> }
> out:
> blk_finish_plug(&plug);
skip_mm:
> if (write)
> - up_write(¤t->mm->mmap_sem);
> + up_write(&mm->mmap_sem);
> else
> - up_read(¤t->mm->mmap_sem);
> + up_read(&mm->mmap_sem);
>
> return error;
> }
> +
> +/*
> + * The madvise(2) system call.
> + *
> + * Applications can use madvise() to advise the kernel how it should
> + * handle paging I/O in this VM area. The idea is to help the kernel
> + * use appropriate read-ahead and caching techniques. The information
> + * provided is advisory only, and can be safely disregarded by the
> + * kernel without affecting the correct operation of the application.
> + *
> + * behavior values:
> + * MADV_NORMAL - the default behavior is to read clusters. This
> + * results in some read-ahead and read-behind.
> + * MADV_RANDOM - the system should read the minimum amount of data
> + * on any access, since it is unlikely that the appli-
> + * cation will need more than what it asks for.
> + * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> + * once, so they can be aggressively read ahead, and
> + * can be freed soon after they are accessed.
> + * MADV_WILLNEED - the application is notifying the system to read
> + * some pages ahead.
> + * MADV_DONTNEED - the application is finished with the given range,
> + * so the kernel can free resources associated with it.
> + * MADV_FREE - the application marks pages in the given range as lazy free,
> + * where actual purges are postponed until memory pressure happens.
> + * MADV_REMOVE - the application wants to free up the given range of
> + * pages and associated backing store.
> + * MADV_DONTFORK - omit this area from child's address space when forking:
> + * typically, to avoid COWing pages pinned by get_user_pages().
> + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> + * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> + * range after a fork.
> + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> + * MADV_HWPOISON - trigger memory error handler as if the given memory range
> + * were corrupted by unrecoverable hardware memory failure.
> + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> + * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> + * this area with pages of identical content from other such areas.
> + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> + * MADV_HUGEPAGE - the application wants to back the given range by transparent
> + * huge pages in the future. Existing pages might be coalesced and
> + * new pages might be allocated as THP.
> + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> + * transparent huge pages so the existing pages will not be
> + * coalesced into THP and new pages will not be allocated as THP.
> + * MADV_DONTDUMP - the application wants to prevent pages in the given range
> + * from being included in its core dump.
> + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *
> + * return values:
> + * zero - success
> + * -EINVAL - start + len < 0, start is not page-aligned,
> + * "behavior" is not a valid value, or application
> + * is attempting to release locked or shared pages,
> + * or the specified address range includes file, Huge TLB,
> + * MAP_SHARED or VMPFNMAP range.
> + * -ENOMEM - addresses in the specified range are not currently
> + * mapped, or are outside the AS of the process.
> + * -EIO - an I/O error occurred while paging in data.
> + * -EBADF - map exists, but area maps something that isn't a file.
> + * -EAGAIN - a kernel resource was temporarily unavailable.
> + */
> +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +{
> + return madvise_core(current, current->mm, start, len_in, behavior);
> +}
> --
> 2.22.0.rc1.257.g3120a18244-goog
>
--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer
^ permalink raw reply
* [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton, Minchan Kim
In-Reply-To: <20190531064313.193437-1-minchan@kernel.org>
Currently, process_madvise syscall works for only one address range so
user should call the syscall several times to give hints to multiple
address ranges. However, it's not efficient to support atomicity of
address range opreations as well as performance perspective.
This patch extends process_madvise syscall to support multiple hints,
address ranges and return vaules so user could give hints all at once.
struct pr_madvise_param {
int size; /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem; /* count of below arrary fields */
int __user *hints; /* hints for each range */
/* to store result of each operation */
const struct iovec __user *results;
/* input address ranges */
const struct iovec __user *ranges;
};
int process_madvise(int pidfd, struct pr_madvise_param *u_param,
unsigned long flags);
About cookie, Daniel Colascione suggested a idea[1] to support atomicity
as well as improving parsing speed of address ranges of the target process.
The process_getinfo(2) syscall could create vma configuration sequence
number and returns(e.g., the seq number will be increased when target process
holds mmap_sem exclusive lock) the number with address ranges as binary form.
With calling the this vector syscall with the sequence number and address
ranges we got from process_getinfo, we could detect there was race of
the target process address space layout and makes the fail of the syscall
if user want to have atomicity. It also speed up the address range
parsing because we don't need to parse human-friend strings from /proc fs.
[1] https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
struct pr_madvise_param {
int size; /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem; /* count of below arrary fields */
int *hints; /* hints for each range */
/* to store result of each operation */
const struct iovec *results;
/* input address ranges */
const struct iovec *ranges;
};
int main(int argc, char *argv[])
{
struct pr_madvise_param param;
int hints[NR_ADDR_RANGE];
int ret[NR_ADDR_RANGE];
struct iovec ret_vec[NR_ADDR_RANGE];
struct iovec range_vec[NR_ADDR_RANGE];
void *addr[NR_ADDR_RANGE];
pid_t pid;
addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
if (MAP_FAILED == addr[0]) {
printf("Fail to alloc\n");
return 1;
}
addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
if (MAP_FAILED == addr[1]) {
printf("Fail to alloc\n");
return 1;
}
ret_vec[0].iov_base = &ret[0];
ret_vec[0].iov_len = sizeof(long);
ret_vec[1].iov_base = &ret[1];
ret_vec[1].iov_len = sizeof(long);
range_vec[0].iov_base = addr[0];
range_vec[0].iov_len = ALLOC_SIZE;
range_vec[1].iov_base = addr[1];
range_vec[1].iov_len = ALLOC_SIZE;
hints[0] = MADV_COLD;
hints[1] = MADV_PAGEOUT;
param.size = sizeof(struct pr_madvise_param);
param.cookie = 0;
param.nr_elem = NR_ADDR_RANGE;
param.hints = hints;
param.results = ret_vec;
param.ranges = range_vec;
pid = fork();
if (!pid) {
sleep(10);
} else {
int pidfd = syscall(__NR_pidfd_open, pid, 0);
if (pidfd < 0) {
printf("Fail to open process file descriptor\n");
return 1;
}
munmap(addr[0], ALLOC_SIZE);
munmap(addr[1], ALLOC_SIZE);
system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
if (syscall(__NR_process_madvise, pidfd, ¶m, 0))
perror("process_madvise fail\n");
system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
}
return 0;
}
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/syscalls.h | 6 +-
include/uapi/asm-generic/mman-common.h | 11 +++
mm/madvise.c | 126 ++++++++++++++++++++++---
3 files changed, 126 insertions(+), 17 deletions(-)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 6ba081c955f6..05627718a547 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -872,9 +872,9 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
-asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
- size_t len, int behavior,
- unsigned long cookie, unsigned long flags);
+asmlinkage long sys_process_madvise(int pidfd,
+ struct pr_madvise_param __user *u_params,
+ unsigned long flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 92e347a89ddc..220c2b5eb961 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -75,4 +75,15 @@
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
PKEY_DISABLE_WRITE)
+struct pr_madvise_param {
+ int size; /* the size of this structure */
+ int cookie; /* reserved to support atomicity */
+ int nr_elem; /* count of below arrary fields */
+ int __user *hints; /* hints for each range */
+ /* to store result of each operation */
+ const struct iovec __user *results;
+ /* input address ranges */
+ const struct iovec __user *ranges;
+};
+
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index fd205e928a1b..94d782097afd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1107,6 +1107,56 @@ static int madvise_core(struct task_struct *task, struct mm_struct *mm,
return error;
}
+static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
+ struct pr_madvise_param *param)
+{
+ u32 size;
+ int ret;
+
+ memset(param, 0, sizeof(*param));
+
+ ret = get_user(size, &u_param->size);
+ if (ret)
+ return ret;
+
+ if (size > PAGE_SIZE)
+ return -E2BIG;
+
+ if (!size || size > sizeof(struct pr_madvise_param))
+ return -EINVAL;
+
+ ret = copy_from_user(param, u_param, size);
+ if (ret)
+ return -EFAULT;
+
+ return ret;
+}
+
+static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm,
+ int *behaviors,
+ struct iov_iter *iter,
+ const struct iovec *range_vec,
+ unsigned long riovcnt)
+{
+ int i;
+ long err;
+
+ for (i = 0; i < riovcnt && iov_iter_count(iter); i++) {
+ err = -EINVAL;
+ if (process_madvise_behavior_valid(behaviors[i]))
+ err = madvise_core(tsk, mm,
+ (unsigned long)range_vec[i].iov_base,
+ range_vec[i].iov_len, behaviors[i]);
+
+ if (copy_to_iter(&err, sizeof(long), iter) !=
+ sizeof(long)) {
+ return -EFAULT;
+ }
+ }
+
+ return 0;
+}
+
/*
* The madvise(2) system call.
*
@@ -1173,37 +1223,78 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return madvise_core(current, current->mm, start, len_in, behavior);
}
-SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
- size_t, len_in, int, behavior, unsigned long, cookie,
- unsigned long, flags)
+
+SYSCALL_DEFINE3(process_madvise, int, pidfd,
+ struct pr_madvise_param __user *, u_params,
+ unsigned long, flags)
{
int ret;
struct fd f;
struct pid *pid;
struct task_struct *task;
struct mm_struct *mm;
+ struct pr_madvise_param params;
+ const struct iovec __user *result_vec, __user *range_vec;
+ int *behaviors;
+ struct iovec iovstack_result[UIO_FASTIOV];
+ struct iovec iovstack_r[UIO_FASTIOV];
+ struct iovec *iov_l = iovstack_result;
+ struct iovec *iov_r = iovstack_r;
+ struct iov_iter iter;
+ int nr_elem;
if (flags != 0)
return -EINVAL;
+ ret = pr_madvise_copy_param(u_params, ¶ms);
+ if (ret)
+ return ret;
+
/*
- * We don't support cookie to gaurantee address space change
- * atomicity yet.
+ * We don't support cookie to gaurantee address space atomicity yet.
+ * Once we implment cookie, process_madvise_core need to hold mmap_sme
+ * during entire operation to guarantee atomicity.
*/
- if (cookie != 0)
+ if (params.cookie != 0)
return -EINVAL;
- if (!process_madvise_behavior_valid(behavior))
- return return -EINVAL;
+ range_vec = params.ranges;
+ result_vec = params.results;
+ nr_elem = params.nr_elem;
+
+ behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL);
+ if (!behaviors)
+ return -ENOMEM;
+
+ ret = copy_from_user(behaviors, params.hints, sizeof(int) * nr_elem);
+ if (ret < 0)
+ goto free_behavior_vec;
+
+ ret = import_iovec(READ, result_vec, params. nr_elem, UIO_FASTIOV,
+ &iov_l, &iter);
+ if (ret < 0)
+ goto free_behavior_vec;
+
+ if (!iov_iter_count(&iter)) {
+ ret = -EINVAL;
+ goto free_iovecs;
+ }
+
+ ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem,
+ UIO_FASTIOV, iovstack_r, &iov_r);
+ if (ret <= 0)
+ goto free_iovecs;
f = fdget(pidfd);
- if (!f.file)
- return -EBADF;
+ if (!f.file) {
+ ret = -EBADF;
+ goto free_iovecs;
+ }
pid = pidfd_pid(f.file);
if (IS_ERR(pid)) {
ret = PTR_ERR(pid);
- goto err;
+ goto put_fd;
}
rcu_read_lock();
@@ -1211,7 +1302,7 @@ SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
if (!task) {
rcu_read_unlock();
ret = -ESRCH;
- goto err;
+ goto put_fd;
}
get_task_struct(task);
@@ -1225,11 +1316,18 @@ SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
goto release_task;
}
- ret = madvise_core(task, mm, start, len_in, behavior);
+ ret = process_madvise_core(task, mm, behaviors, &iter, iov_r, nr_elem);
mmput(mm);
release_task:
put_task_struct(task);
-err:
+put_fd:
fdput(f);
+free_iovecs:
+ if (iov_r != iovstack_r)
+ kfree(iov_r);
+ kfree(iov_l);
+free_behavior_vec:
+ kfree(behaviors);
+
return ret;
}
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [RFCv2 5/6] mm: introduce external memory hinting API
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton, Minchan Kim
In-Reply-To: <20190531064313.193437-1-minchan@kernel.org>
There is some usecase that centralized userspace daemon want to give
a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
ActivityManagerService is one of them.
It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.
To solve the issue, this patch introduces new syscall process_madvise(2).
It could give a hint to the exeternal process of pidfd.
int process_madvise(int pidfd, void *addr, size_t length, int advise,
unsigned long cookie, unsigned long flag);
Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrace the process could use it successfully.
The syscall has a cookie argument to privode atomicity(i.e., detect
target process's address space change since monitor process has parsed
the address range of target process so the operaion could fail in case
of happening race). Although there is no interface to get a cookie
at this moment, it could be useful to consider it as argument to avoid
introducing another new syscall in future. It could support *atomicity*
for disruptive hint(e.g., MADV_DONTNEED|FREE).
flag argument is reserved for future use if we need to extend the API.
I think supporting all hints madvise has/will supported/support to
process_madvise are rather risky. Because I'm not sure all hints makes
sense from external process and implementation for the hint may rely on
caller is current context like MADV_INJECT_ERROR so it could be error-prone
if the caller is external process. Other example is userfaultfd because
userfaultfd_remove need to release mmap_sem during the operation, which
wouuld be a obstacle to implement atomicity later. So, I just limited hints
as MADV_[COLD|PAGEOUT] at this moment. If someone want to expose other hint
we need to hear their workload/scenario and could review carefully by
design/implmentation of each hint. It's more safe for maintainace -
Once we open a buggy syscall but found hard to fix it later, we never
roll back.
TODO: once we agree the direction, I need to define the syscall for
every architecture.
* RFCv1
* not export pidfd_to_pid. Use pidfd_pid - Christian
* use mm struct instead of task_struct for madvise_core - Oleg
* add cookie variable as syscall argument to guarantee atomicity - dancol
* internal review
* use ptrace capability - surenb, dancol
I didn't solve the issue Olge pointed out(task we got from
pid_task could be a zombie leader so that syscall will fai
by mm_access even though process is alive with other threads)
because it's not a only problem of this new syscall but it's
general problem for other MM syscalls like process_vm_readv,
move_pages so need more discussion.
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/pid.h | 4 ++
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 8 +++
kernel/signal.c | 7 ++-
kernel/sys_ni.c | 1 +
mm/madvise.c | 87 ++++++++++++++++++++++++++
9 files changed, 113 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 43e4429a5272..5f44a29b7882 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -439,3 +439,4 @@
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
+435 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1bee0a77fdd3..35e91f3e9646 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -356,6 +356,7 @@
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
+435 common process_madvise __x64_sys_process_madvise
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/pid.h b/include/linux/pid.h
index a261712ac3fe..a49ef789c034 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,10 @@ extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
extern int pidfd_create(struct pid *pid);
+struct file;
+
+extern struct pid *pidfd_pid(const struct file *file);
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1a4dc53f40d9..6ba081c955f6 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -872,6 +872,9 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
+ size_t len, int behavior,
+ unsigned long cookie, unsigned long flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e5684a4512c0..082d1f3fe3a2 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -846,9 +846,11 @@ __SYSCALL(__NR_fsmount, sys_fsmount)
__SYSCALL(__NR_fspick, sys_fspick)
#define __NR_pidfd_open 434
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
+#define __NR_process_madvise 435
+__SYSCALL(__NR_process_madvise, sys_process_madvise)
#undef __NR_syscalls
-#define __NR_syscalls 435
+#define __NR_syscalls 436
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f238cdd886e..b76aade51631 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1738,6 +1738,14 @@ const struct file_operations pidfd_fops = {
#endif
};
+struct pid *pidfd_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return ERR_PTR(-EBADF);
+}
+
/**
* pidfd_create() - Create a new pid file descriptor.
*
diff --git a/kernel/signal.c b/kernel/signal.c
index b477e21ecafc..b376870d7565 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3702,8 +3702,11 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
static struct pid *pidfd_to_pid(const struct file *file)
{
- if (file->f_op == &pidfd_fops)
- return file->private_data;
+ struct pid *pid;
+
+ pid = pidfd_pid(file);
+ if (pid)
+ return pid;
return tgid_pidfd_to_pid(file);
}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6caf..5277421795ab 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
COND_SYSCALL(munlockall);
COND_SYSCALL(mincore);
COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/madvise.c b/mm/madvise.c
index 466623ea8c36..fd205e928a1b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -15,6 +15,7 @@
#include <linux/hugetlb.h>
#include <linux/falloc.h>
#include <linux/sched.h>
+#include <linux/sched/mm.h>
#include <linux/ksm.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -983,6 +984,31 @@ madvise_behavior_valid(int behavior)
}
}
+static bool
+process_madvise_behavior_valid(int behavior)
+{
+ switch (behavior) {
+ case MADV_COLD:
+ case MADV_PAGEOUT:
+ return true;
+
+ default:
+ return false;
+ }
+}
+
+/*
+ * madvise_core - request behavior hint to address range of the target process
+ *
+ * @task: task_struct got behavior hint, not giving the hint
+ * @mm: mm_struct got behavior hint, not giving the hint
+ * @start: base address of the hinted range
+ * @len_in: length of the hinted range
+ * @behavior: requested hint
+ *
+ * @task could be a zombie leader if it calls sys_exit so accessing mm_struct
+ * via task->mm is prohibited. Please use @mm insetad of task->mm.
+ */
static int madvise_core(struct task_struct *task, struct mm_struct *mm,
unsigned long start, size_t len_in, int behavior)
{
@@ -1146,3 +1172,64 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
return madvise_core(current, current->mm, start, len_in, behavior);
}
+
+SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
+ size_t, len_in, int, behavior, unsigned long, cookie,
+ unsigned long, flags)
+{
+ int ret;
+ struct fd f;
+ struct pid *pid;
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ if (flags != 0)
+ return -EINVAL;
+
+ /*
+ * We don't support cookie to gaurantee address space change
+ * atomicity yet.
+ */
+ if (cookie != 0)
+ return -EINVAL;
+
+ if (!process_madvise_behavior_valid(behavior))
+ return return -EINVAL;
+
+ f = fdget(pidfd);
+ if (!f.file)
+ return -EBADF;
+
+ pid = pidfd_pid(f.file);
+ if (IS_ERR(pid)) {
+ ret = PTR_ERR(pid);
+ goto err;
+ }
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ if (!task) {
+ rcu_read_unlock();
+ ret = -ESRCH;
+ goto err;
+ }
+
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+ if (!mm || IS_ERR(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ if (ret == -EACCES)
+ ret = -EPERM;
+ goto release_task;
+ }
+
+ ret = madvise_core(task, mm, start, len_in, behavior);
+ mmput(mm);
+release_task:
+ put_task_struct(task);
+err:
+ fdput(f);
+ return ret;
+}
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [RFCv2 4/6] mm: factor out madvise's core functionality
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton, Minchan Kim
In-Reply-To: <20190531064313.193437-1-minchan@kernel.org>
This patch factor out madvise's core functionality so that upcoming
patch can reuse it without duplication. It shouldn't change any behavior.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
1 file changed, 101 insertions(+), 87 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 9d749a1420b4..466623ea8c36 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
int isolated = 0;
struct vm_area_struct *vma = walk->vma;
+ struct task_struct *task = walk->private;
unsigned long next;
- if (fatal_signal_pending(current))
+ if (fatal_signal_pending(task))
return -EINTR;
next = pmd_addr_end(addr, end);
@@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
}
static void madvise_pageout_page_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+ struct task_struct *task,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
{
struct mm_walk warm_walk = {
.pmd_entry = madvise_pageout_pte_range,
.mm = vma->vm_mm,
+ .private = task,
};
tlb_start_vma(tlb, vma);
@@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
}
-static long madvise_pageout(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start_addr, unsigned long end_addr)
+static long madvise_pageout(struct task_struct *task,
+ struct vm_area_struct *vma, struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
@@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
- madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb, start_addr, end_addr);
return 0;
@@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
return 0;
}
-static long madvise_dontneed_free(struct vm_area_struct *vma,
+static long madvise_dontneed_free(struct mm_struct *mm,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end,
int behavior)
@@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
if (!userfaultfd_remove(vma, start, end)) {
*prev = NULL; /* mmap_sem has been dropped, prev is stale */
- down_read(¤t->mm->mmap_sem);
- vma = find_vma(current->mm, start);
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, start);
if (!vma)
return -ENOMEM;
if (start < vma->vm_start) {
@@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
* Application wants to free up the pages and associated backing store.
* This is effectively punching a hole into the middle of a file.
*/
-static long madvise_remove(struct vm_area_struct *vma,
+static long madvise_remove(struct mm_struct *mm,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
@@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
get_file(f);
if (userfaultfd_remove(vma, start, end)) {
/* mmap_sem was not released by userfaultfd_remove() */
- up_read(¤t->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
}
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
- down_read(¤t->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
return error;
}
@@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
#endif
static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
+madvise_vma(struct task_struct *task, struct mm_struct *mm,
+ struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
switch (behavior) {
case MADV_REMOVE:
- return madvise_remove(vma, prev, start, end);
+ return madvise_remove(mm, vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
return madvise_cold(vma, prev, start, end);
case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
+ return madvise_pageout(task, vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
- return madvise_dontneed_free(vma, prev, start, end, behavior);
+ return madvise_dontneed_free(mm, vma, prev, start,
+ end, behavior);
default:
return madvise_behavior(vma, prev, start, end, behavior);
}
@@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
}
}
-/*
- * The madvise(2) system call.
- *
- * Applications can use madvise() to advise the kernel how it should
- * handle paging I/O in this VM area. The idea is to help the kernel
- * use appropriate read-ahead and caching techniques. The information
- * provided is advisory only, and can be safely disregarded by the
- * kernel without affecting the correct operation of the application.
- *
- * behavior values:
- * MADV_NORMAL - the default behavior is to read clusters. This
- * results in some read-ahead and read-behind.
- * MADV_RANDOM - the system should read the minimum amount of data
- * on any access, since it is unlikely that the appli-
- * cation will need more than what it asks for.
- * MADV_SEQUENTIAL - pages in the given range will probably be accessed
- * once, so they can be aggressively read ahead, and
- * can be freed soon after they are accessed.
- * MADV_WILLNEED - the application is notifying the system to read
- * some pages ahead.
- * MADV_DONTNEED - the application is finished with the given range,
- * so the kernel can free resources associated with it.
- * MADV_FREE - the application marks pages in the given range as lazy free,
- * where actual purges are postponed until memory pressure happens.
- * MADV_REMOVE - the application wants to free up the given range of
- * pages and associated backing store.
- * MADV_DONTFORK - omit this area from child's address space when forking:
- * typically, to avoid COWing pages pinned by get_user_pages().
- * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
- * MADV_WIPEONFORK - present the child process with zero-filled memory in this
- * range after a fork.
- * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
- * MADV_HWPOISON - trigger memory error handler as if the given memory range
- * were corrupted by unrecoverable hardware memory failure.
- * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
- * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
- * this area with pages of identical content from other such areas.
- * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
- * MADV_HUGEPAGE - the application wants to back the given range by transparent
- * huge pages in the future. Existing pages might be coalesced and
- * new pages might be allocated as THP.
- * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
- * transparent huge pages so the existing pages will not be
- * coalesced into THP and new pages will not be allocated as THP.
- * MADV_DONTDUMP - the application wants to prevent pages in the given range
- * from being included in its core dump.
- * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
- *
- * return values:
- * zero - success
- * -EINVAL - start + len < 0, start is not page-aligned,
- * "behavior" is not a valid value, or application
- * is attempting to release locked or shared pages,
- * or the specified address range includes file, Huge TLB,
- * MAP_SHARED or VMPFNMAP range.
- * -ENOMEM - addresses in the specified range are not currently
- * mapped, or are outside the AS of the process.
- * -EIO - an I/O error occurred while paging in data.
- * -EBADF - map exists, but area maps something that isn't a file.
- * -EAGAIN - a kernel resource was temporarily unavailable.
- */
-SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+static int madvise_core(struct task_struct *task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
- return madvise_inject_error(behavior, start, start + len_in);
+ return madvise_inject_error(behavior,
+ start, start + len_in);
#endif
write = madvise_need_mmap_write(behavior);
if (write) {
- if (down_write_killable(¤t->mm->mmap_sem))
+ if (down_write_killable(&mm->mmap_sem))
return -EINTR;
} else {
- down_read(¤t->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
}
/*
@@ -1084,7 +1032,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(current->mm, start, &prev);
+ vma = find_vma_prev(mm, start, &prev);
if (vma && start > vma->vm_start)
prev = vma;
@@ -1109,7 +1057,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma(vma, &prev, start, tmp, behavior);
+ error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior);
if (error)
goto out;
start = tmp;
@@ -1121,14 +1069,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
if (prev)
vma = prev->vm_next;
else /* madvise_remove dropped mmap_sem */
- vma = find_vma(current->mm, start);
+ vma = find_vma(mm, start);
}
out:
blk_finish_plug(&plug);
if (write)
- up_write(¤t->mm->mmap_sem);
+ up_write(&mm->mmap_sem);
else
- up_read(¤t->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
return error;
}
+
+/*
+ * The madvise(2) system call.
+ *
+ * Applications can use madvise() to advise the kernel how it should
+ * handle paging I/O in this VM area. The idea is to help the kernel
+ * use appropriate read-ahead and caching techniques. The information
+ * provided is advisory only, and can be safely disregarded by the
+ * kernel without affecting the correct operation of the application.
+ *
+ * behavior values:
+ * MADV_NORMAL - the default behavior is to read clusters. This
+ * results in some read-ahead and read-behind.
+ * MADV_RANDOM - the system should read the minimum amount of data
+ * on any access, since it is unlikely that the appli-
+ * cation will need more than what it asks for.
+ * MADV_SEQUENTIAL - pages in the given range will probably be accessed
+ * once, so they can be aggressively read ahead, and
+ * can be freed soon after they are accessed.
+ * MADV_WILLNEED - the application is notifying the system to read
+ * some pages ahead.
+ * MADV_DONTNEED - the application is finished with the given range,
+ * so the kernel can free resources associated with it.
+ * MADV_FREE - the application marks pages in the given range as lazy free,
+ * where actual purges are postponed until memory pressure happens.
+ * MADV_REMOVE - the application wants to free up the given range of
+ * pages and associated backing store.
+ * MADV_DONTFORK - omit this area from child's address space when forking:
+ * typically, to avoid COWing pages pinned by get_user_pages().
+ * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
+ * MADV_WIPEONFORK - present the child process with zero-filled memory in this
+ * range after a fork.
+ * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
+ * MADV_HWPOISON - trigger memory error handler as if the given memory range
+ * were corrupted by unrecoverable hardware memory failure.
+ * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
+ * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
+ * this area with pages of identical content from other such areas.
+ * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
+ * MADV_HUGEPAGE - the application wants to back the given range by transparent
+ * huge pages in the future. Existing pages might be coalesced and
+ * new pages might be allocated as THP.
+ * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
+ * transparent huge pages so the existing pages will not be
+ * coalesced into THP and new pages will not be allocated as THP.
+ * MADV_DONTDUMP - the application wants to prevent pages in the given range
+ * from being included in its core dump.
+ * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *
+ * return values:
+ * zero - success
+ * -EINVAL - start + len < 0, start is not page-aligned,
+ * "behavior" is not a valid value, or application
+ * is attempting to release locked or shared pages,
+ * or the specified address range includes file, Huge TLB,
+ * MAP_SHARED or VMPFNMAP range.
+ * -ENOMEM - addresses in the specified range are not currently
+ * mapped, or are outside the AS of the process.
+ * -EIO - an I/O error occurred while paging in data.
+ * -EBADF - map exists, but area maps something that isn't a file.
+ * -EAGAIN - a kernel resource was temporarily unavailable.
+ */
+SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+{
+ return madvise_core(current, current->mm, start, len_in, behavior);
+}
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [RFCv2 3/6] mm: introduce MADV_PAGEOUT
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton, Minchan Kim
In-Reply-To: <20190531064313.193437-1-minchan@kernel.org>
When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.
This patch introduces the new MADV_PAGEOUT hint to madvise(2)
syscall. MADV_PAGEOUT can be used by a process to mark a memory
range as not expected to be used for a long time so that kernel
reclaims the memory instantly. The hint can help kernel in deciding
which pages to evict proactively.
* RFCv1
* rename from MADV_COLD to MADV_PAGEOUT - hannes
* bail out if process is being killed - Hillf
* fix reclaim_pages bugs - Hillf
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 126 +++++++++++++++++++++++++
mm/vmscan.c | 77 +++++++++++++++
4 files changed, 205 insertions(+)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ce997edb8bb..063c0c1e112b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
+extern unsigned long reclaim_pages(struct list_head *page_list);
#ifdef CONFIG_NUMA
extern int node_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1190f4e7f7b9..92e347a89ddc 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -44,6 +44,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
#define MADV_COLD 5 /* deactivatie these pages */
+#define MADV_PAGEOUT 6 /* reclaim these pages */
/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index bff150eab6da..9d749a1420b4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_COLD:
+ case MADV_PAGEOUT:
case MADV_FREE:
return 0;
default:
@@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
return 0;
}
+static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ LIST_HEAD(page_list);
+ struct page *page;
+ int isolated = 0;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long next;
+
+ if (fatal_signal_pending(current))
+ return -EINTR;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ if (is_huge_zero_pmd(*pmd))
+ goto huge_unlock;
+
+ page = pmd_page(*pmd);
+ if (page_mapcount(page) > 1)
+ goto huge_unlock;
+
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ if (isolate_lru_page(page))
+ goto huge_unlock;
+
+ list_add(&page->lru, &page_list);
+huge_unlock:
+ spin_unlock(ptl);
+ reclaim_pages(&page_list);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+regular_page:
+ orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (page_mapcount(page) > 1)
+ continue;
+
+ if (isolate_lru_page(page))
+ continue;
+
+ isolated++;
+ list_add(&page->lru, &page_list);
+ if (isolated >= SWAP_CLUSTER_MAX) {
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages(&page_list);
+ isolated = 0;
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ orig_pte = pte;
+ }
+ }
+
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages(&page_list);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_pageout_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk warm_walk = {
+ .pmd_entry = madvise_pageout_pte_range,
+ .mm = vma->vm_mm,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &warm_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+
+static long madvise_pageout(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ *prev = vma;
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
return madvise_cold(vma, prev, start, end);
+ case MADV_PAGEOUT:
+ return madvise_pageout(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior)
case MADV_DONTNEED:
case MADV_FREE:
case MADV_COLD:
+ case MADV_PAGEOUT:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0973a46a0472..280dd808fb91 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2126,6 +2126,83 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_deactivate, nr_rotated, sc->priority, file);
}
+unsigned long reclaim_pages(struct list_head *page_list)
+{
+ int nid = -1;
+ unsigned long nr_isolated[2] = {0, };
+ unsigned long nr_reclaimed = 0;
+ LIST_HEAD(node_page_list);
+ struct reclaim_stat dummy_stat;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .priority = DEF_PRIORITY,
+ .may_writepage = 1,
+ .may_unmap = 1,
+ .may_swap = 1,
+ };
+
+ while (!list_empty(page_list)) {
+ struct page *page;
+
+ page = lru_to_page(page_list);
+ if (nid == -1) {
+ nid = page_to_nid(page);
+ INIT_LIST_HEAD(&node_page_list);
+ nr_isolated[0] = nr_isolated[1] = 0;
+ }
+
+ if (nid == page_to_nid(page)) {
+ list_move(&page->lru, &node_page_list);
+ nr_isolated[!!page_is_file_cache(page)] +=
+ hpage_nr_pages(page);
+ continue;
+ }
+
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ nr_isolated[1]);
+ nr_reclaimed += shrink_page_list(&node_page_list,
+ NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
+ &dummy_stat, true);
+ while (!list_empty(&node_page_list)) {
+ struct page *page = lru_to_page(&node_page_list);
+
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ -nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ -nr_isolated[1]);
+ nid = -1;
+ }
+
+ if (!list_empty(&node_page_list)) {
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ nr_isolated[1]);
+ nr_reclaimed += shrink_page_list(&node_page_list,
+ NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
+ &dummy_stat, true);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ -nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ -nr_isolated[1]);
+
+ while (!list_empty(&node_page_list)) {
+ struct page *page = lru_to_page(&node_page_list);
+
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+
+ }
+
+ return nr_reclaimed;
+}
+
/*
* The inactive anon list should be small enough that the VM never has
* to do too much work.
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [RFCv2 2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton, Minchan Kim
In-Reply-To: <20190531064313.193437-1-minchan@kernel.org>
The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with ignore_references to make it more clear.
This is a preparatory work for next patch.
* RFCv1
* use ignore_referecnes as parameter name - hannes
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 84dcb651d05c..0973a46a0472 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
enum ttu_flags ttu_flags,
struct reclaim_stat *stat,
- bool force_reclaim)
+ bool ignore_references)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct address_space *mapping;
struct page *page;
int may_enter_fs;
- enum page_references references = PAGEREF_RECLAIM_CLEAN;
+ enum page_references references = PAGEREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;
@@ -1247,7 +1247,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}
- if (!force_reclaim)
+ if (!ignore_references)
references = page_check_references(page, sc);
switch (references) {
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [RFCv2 1/6] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton, Minchan Kim
In-Reply-To: <20190531064313.193437-1-minchan@kernel.org>
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
Internally, it works via deactivating pages from active list to inactive's
head if the page is private because inactive list could be full of
used-once pages which are first candidate for the reclaiming and that's a
reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
if the memory pressure happens, they will be reclaimed earlier than other
active pages unless there is no access until the time.
* RFCv1
* renaming from MADV_COOL to MADV_COLD - hannes
* internal review
* use clear_page_youn in deactivate_page - joelaf
* Revise the description - surenb
* Renaming from MADV_WARM to MADV_COOL - surenb
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/page-flags.h | 1 +
include/linux/page_idle.h | 15 ++++
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 111 +++++++++++++++++++++++++
mm/swap.c | 43 ++++++++++
6 files changed, 172 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
TESTPAGEFLAG(Young, young, PF_ANY)
SETPAGEFLAG(Young, young, PF_ANY)
TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
SetPageYoung(page);
}
+static inline void clear_page_young(struct page *page)
+{
+ ClearPageYoung(page);
+}
+
static inline bool test_and_clear_page_young(struct page *page)
{
return TestClearPageYoung(page);
@@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
}
+static void clear_page_young(struct page *page)
+{
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (unlikely(!page_ext))
+ return;
+
+ clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+}
+
static inline bool test_and_clear_page_young(struct page *page)
{
struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index bea0278f65ab..1190f4e7f7b9 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -43,6 +43,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_COLD 5 /* deactivatie these pages */
/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..bff150eab6da 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_COLD:
case MADV_FREE:
return 0;
default:
@@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ struct page *page;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ if (is_huge_zero_pmd(*pmd))
+ goto huge_unlock;
+
+ page = pmd_page(*pmd);
+ if (page_mapcount(page) > 1)
+ goto huge_unlock;
+
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ pmdp_test_and_clear_young(vma, addr, pmd);
+ deactivate_page(page);
+huge_unlock:
+ spin_unlock(ptl);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+regular_page:
+ orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (pte_none(ptent))
+ continue;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (page_mapcount(page) > 1)
+ continue;
+
+ ptep_test_and_clear_young(vma, addr, pte);
+ deactivate_page(page);
+ }
+
+ pte_unmap_unlock(orig_pte, ptl);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk cool_walk = {
+ .pmd_entry = madvise_cold_pte_range,
+ .mm = vma->vm_mm,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &cool_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cold(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ *prev = vma;
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_COLD:
+ return madvise_cold(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_FREE:
+ case MADV_COLD:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/swap.c b/mm/swap.c
index 7b079976cbec..cebedab15aa2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, file, 0);
}
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ int file = page_is_file_cache(page);
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ clear_page_young(page);
+ add_page_to_lru_list(page, lruvec, lru);
+
+ __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+ update_page_reclaim_stat(lruvec, file, 0);
+ }
+}
static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
void *arg)
@@ -590,6 +608,10 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +645,26 @@ void deactivate_file_page(struct page *page)
}
}
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page. This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ get_page(page);
+ if (!pagevec_add(pvec, page) || PageCompound(page))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
+}
+
/**
* mark_page_lazyfree - make an anon page lazyfree
* @page: page to deactivate
@@ -687,6 +729,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [RFCv2 0/6] introduce memory hinting API for external process
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
oleksandr, hdanton, Minchan Kim
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.
This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -
struct pr_madvise_param {
int size; /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem; /* count of below arrary fields */
int __user *hints; /* hints for each range */
/* to store result of each operation */
const struct iovec __user *results;
/* input address ranges */
const struct iovec __user *ranges;
};
int process_madvise(int pidfd, struct pr_madvise_param *u_param,
unsigned long flags);
The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once. It also has cookie variable
to support atomicity of the API for address ranges operations. IOW, if
target process changes address space since monitor process has parsed
address ranges via map_files or maps, the API can detect the race so
could cancel entire address space operation. It's not implemented yet.
Daniel Colascione suggested a idea(Please read description in patch[6/6])
and this patchset adds cookie a variable for the future.
- Experiment
We did bunch of testing with several hundreds of real users, not artificial
benchmark on android. We saw about 17% cold start decreasement without any
significant battery/app startup latency issues. And with artificial benchmark
which launches and switching apps, we saw average 7% app launching improvement,
18% less lmkd kill and good stat from vmstat.
A is vanilla and B is process_madvise.
A B delta ratio(%)
allocstall_dma 0 0 0 0.00
allocstall_movable 1464 457 -1007 -69.00
allocstall_normal 263210 190763 -72447 -28.00
allocstall_total 264674 191220 -73454 -28.00
compact_daemon_wake 26912 25294 -1618 -7.00
compact_fail 17885 14151 -3734 -21.00
compact_free_scanned 4204766409 3835994922 -368771487 -9.00
compact_isolated 3446484 2967618 -478866 -14.00
compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00
compact_stall 19387 15343 -4044 -21.00
compact_success 1502 1192 -310 -21.00
kswapd_high_wmark_hit_quickly 234 184 -50 -22.00
kswapd_inodesteal 221635 233093 11458 5.00
kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00
nr_dirtied 259934 296476 36542 14.00
nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00
nr_vmscan_write 1274232 2661733 1387501 108.00
nr_written 1514060 2937560 1423500 94.00
pageoutrun 67561 55133 -12428 -19.00
pgactivate 2335060 1984882 -350178 -15.00
pgalloc_dma 13743011 14096463 353452 2.00
pgalloc_movable 0 0 0 0.00
pgalloc_normal 18742440 16802065 -1940375 -11.00
pgalloc_total 32485451 30898528 -1586923 -5.00
pgdeactivate 4262210 2930670 -1331540 -32.00
pgfault 30812334 31085065 272731 0.00
pgfree 33553970 31765164 -1788806 -6.00
pginodesteal 33411 15084 -18327 -55.00
pglazyfreed 0 0 0 0.00
pgmajfault 551312 1508299 956987 173.00
pgmigrate_fail 43927 29330 -14597 -34.00
pgmigrate_success 1399851 1203922 -195929 -14.00
pgpgin 24141776 19032156 -5109620 -22.00
pgpgout 959344 1103316 143972 15.00
pgpgoutclean 4639732 3765868 -873864 -19.00
pgrefill 4884560 3006938 -1877622 -39.00
pgrotated 37828 25897 -11931 -32.00
pgscan_direct 1456037 957567 -498470 -35.00
pgscan_direct_throttle 0 0 0 0.00
pgscan_kswapd 6667767 5047360 -1620407 -25.00
pgscan_total 8123804 6004927 -2118877 -27.00
pgskip_dma 0 0 0 0.00
pgskip_movable 0 0 0 0.00
pgskip_normal 14907 25382 10475 70.00
pgskip_total 14907 25382 10475 70.00
pgsteal_direct 1118986 690215 -428771 -39.00
pgsteal_kswapd 4750223 3657107 -1093116 -24.00
pgsteal_total 5869209 4347322 -1521887 -26.00
pswpin 417613 1392647 975034 233.00
pswpout 1274224 2661731 1387507 108.00
slabs_scanned 13686905 10807200 -2879705 -22.00
workingset_activate 668966 569444 -99522 -15.00
workingset_nodereclaim 38957 32621 -6336 -17.00
workingset_refault 2816795 2179782 -637013 -23.00
workingset_restore 294320 168601 -125719 -43.00
pgmajfault is increased by 173% because swapin is increased by 200% by
process_madvise hint. However, swap read based on zram is much cheaper
than file IO in performance point of view and app hot start by swapin is
also cheaper than cold start from the beginning of app which needs many IO
from storage and initialization steps.
Brian Geffon in ChromeOS team had an experiment with process_madvise(2)
Quote form him:
"What I found is that by using process_madvise after a tab has been back
grounded for more than 45 seconds reduced the average tab switch times by
25%! This is a huge result and very obvious validation that process_madvise
hints works well for the ChromeOS use case."
This patchset is against on next-20190530.
Minchan Kim (6):
[1/6] mm: introduce MADV_COLD
[2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
[3/6] mm: introduce MADV_PAGEOUT
[4/6] mm: factor out madvise's core functionality
[5/6] mm: introduce external memory hinting API
[6/6] mm: extend process_madvise syscall to support vector arrary
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/page-flags.h | 1 +
include/linux/page_idle.h | 15 +
include/linux/pid.h | 4 +
include/linux/swap.h | 2 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/mman-common.h | 13 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 8 +
kernel/signal.c | 7 +-
kernel/sys_ni.c | 1 +
mm/madvise.c | 586 +++++++++++++++++++++----
mm/swap.c | 43 ++
mm/vmscan.c | 83 +++-
15 files changed, 691 insertions(+), 81 deletions(-)
* from RFCv1
* Dropped MADV_[ANONYMOUS|FILE]_FILTER option. The option gave several
hundered millisecond improvement via removing address range parsing
overhead. However, there is other suggestion to create general API
to provide processs information(process_getinfo(2)) which would be
very fast via binary form and get only necessary information.
* There was lots of discussion how to provide *consistency*,*atomicity*
against on target process's address space change. It needs more
discussion. Let's do that in [6/6] if you have still a argument.
* There was a concern about the vector support because it didn't show
great performance benefit. However, I still included it because
it could make us address space range operation's atomicity easier
in future without introducing other new syscall.
* Naming of process_madvise - there was request to change naming from
process_madvise to pidfd_madvise for the consistency of existing
pidfd syscall but some of people including me still want to have
syscall name focus on what it's doing rather than how it does.
* I limited hints new syscall supports as MADV_COLD|PAGEOUT at this
moment because I'm not sure all hints makes sense for external
processs and would be some lurked bugs which relies on the caller
context. Please see description in [5/6].
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Richard Guy Briggs @ 2019-05-31 0:20 UTC (permalink / raw)
To: Paul Moore
Cc: Tycho Andersen, Serge E. Hallyn, containers, linux-api,
Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
ebiederm, nhorman
In-Reply-To: <CAHC9VhT5HPt9rCJoDutdvA3r1Y1GOHfpXe2eJ54atNC1=Vd8LA@mail.gmail.com>
On 2019-05-30 19:26, Paul Moore wrote:
> On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > >
> > > [REMINDER: It is an "*audit* container ID" and not a general
> > > "container ID" ;) Smiley aside, I'm not kidding about that part.]
> >
> > This sort of seems like a distinction without a difference; presumably
> > audit is going to want to differentiate between everything that people
> > in userspace call a container. So you'll have to support all this
> > insanity anyway, even if it's "not a container ID".
>
> That's not quite right. Audit doesn't care about what a container is,
> or is not, it also doesn't care if the "audit container ID" actually
> matches the ID used by the container engine in userspace and I think
> that is a very important line to draw. Audit is simply given a value
> which it calls the "audit container ID", it ensures that the value is
> inherited appropriately (e.g. children inherit their parent's audit
> container ID), and it uses the value in audit records to provide some
> additional context for log analysis. The distinction isn't limited to
> the value itself, but also to how it is used; it is an "audit
> container ID" and not a "container ID" because this value is
> exclusively for use by the audit subsystem. We are very intentionally
> not adding a generic container ID to the kernel. If the kernel does
> ever grow a general purpose container ID we will be one of the first
> ones in line to make use of it, but we are not going to be the ones to
> generically add containers to the kernel. Enough people already hate
> audit ;)
>
> > > I'm not interested in supporting/merging something that isn't useful;
> > > if this doesn't work for your use case then we need to figure out what
> > > would work. It sounds like nested containers are much more common in
> > > the lxc world, can you elaborate a bit more on this?
> > >
> > > As far as the possible solutions you mention above, I'm not sure I
> > > like the per-userns audit container IDs, I'd much rather just emit the
> > > necessary tracking information via the audit record stream and let the
> > > log analysis tools figure it out. However, the bigger question is how
> > > to limit (re)setting the audit container ID when you are in a non-init
> > > userns. For reasons already mentioned, using capable() is a non
> > > starter for everything but the initial userns, and using ns_capable()
> > > is equally poor as it essentially allows any userns the ability to
> > > munge it's audit container ID (obviously not good). It appears we
> > > need a different method for controlling access to the audit container
> > > ID.
> >
> > One option would be to make it a string, and have it be append only.
> > That should be safe with no checks.
> >
> > I know there was a long thread about what type to make this thing. I
> > think you could accomplish the append-only-ness with a u64 if you had
> > some rule about only allowing setting lower order bits than those that
> > are already set. With 4 bits for simplicity:
> >
> > 1100 # initial container id
> > 1100 -> 1011 # not allowed
> > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > # no lower order bits left
> >
> > There are probably fancier ways to do it if you actually understand
> > math :)
>
> ;)
>
> > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > you have 64 bits, this might be reasonable. You could just teach
> > container engines to use the first say N bits for themselves, with a 1
> > bit for the barrier at the end.
>
> I like the creativity, but I worry that at some point these
> limitations are going to be raised (limits have a funny way of doing
> that over time) and we will be in trouble. I say "trouble" because I
> want to be able to quickly do an audit container ID comparison and
> we're going to pay a penalty for these larger values (we'll need this
> when we add multiple auditd support and the requisite record routing).
>
> Thinking about this makes me also realize we probably need to think a
> bit longer about audit container ID conflicts between orchestrators.
> Right now we just take the value that is given to us by the
> orchestrator, but if we want to allow multiple container orchestrators
> to work without some form of cooperation in userspace (I think we have
> to assume the orchestrators will not talk to each other) we likely
> need to have some way to block reuse of an audit container ID. We
> would either need to prevent the orchestrator from explicitly setting
> an audit container ID to a currently in use value, or instead generate
> the audit container ID in the kernel upon an event triggered by the
> orchestrator (e.g. a write to a /proc file). I suspect we should
> start looking at the idr code, I think we will need to make use of it.
My first reaction to using the IDR code is that once an idr is given up,
it can be reused. I suppose we request IDRs and then never give them up
to avoid reuse...
I already had some ideas of preventing an existing ID from being reused,
but that makes the practice of some container engines injecting
processes into existing containers difficult if not impossible.
> paul moore
> www.paul-moore.com
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-05-30 23:26 UTC (permalink / raw)
To: Tycho Andersen
Cc: Serge E. Hallyn, Richard Guy Briggs, containers, linux-api,
Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
ebiederm, nhorman
In-Reply-To: <20190530212900.GC5739@cisco>
On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> >
> > [REMINDER: It is an "*audit* container ID" and not a general
> > "container ID" ;) Smiley aside, I'm not kidding about that part.]
>
> This sort of seems like a distinction without a difference; presumably
> audit is going to want to differentiate between everything that people
> in userspace call a container. So you'll have to support all this
> insanity anyway, even if it's "not a container ID".
That's not quite right. Audit doesn't care about what a container is,
or is not, it also doesn't care if the "audit container ID" actually
matches the ID used by the container engine in userspace and I think
that is a very important line to draw. Audit is simply given a value
which it calls the "audit container ID", it ensures that the value is
inherited appropriately (e.g. children inherit their parent's audit
container ID), and it uses the value in audit records to provide some
additional context for log analysis. The distinction isn't limited to
the value itself, but also to how it is used; it is an "audit
container ID" and not a "container ID" because this value is
exclusively for use by the audit subsystem. We are very intentionally
not adding a generic container ID to the kernel. If the kernel does
ever grow a general purpose container ID we will be one of the first
ones in line to make use of it, but we are not going to be the ones to
generically add containers to the kernel. Enough people already hate
audit ;)
> > I'm not interested in supporting/merging something that isn't useful;
> > if this doesn't work for your use case then we need to figure out what
> > would work. It sounds like nested containers are much more common in
> > the lxc world, can you elaborate a bit more on this?
> >
> > As far as the possible solutions you mention above, I'm not sure I
> > like the per-userns audit container IDs, I'd much rather just emit the
> > necessary tracking information via the audit record stream and let the
> > log analysis tools figure it out. However, the bigger question is how
> > to limit (re)setting the audit container ID when you are in a non-init
> > userns. For reasons already mentioned, using capable() is a non
> > starter for everything but the initial userns, and using ns_capable()
> > is equally poor as it essentially allows any userns the ability to
> > munge it's audit container ID (obviously not good). It appears we
> > need a different method for controlling access to the audit container
> > ID.
>
> One option would be to make it a string, and have it be append only.
> That should be safe with no checks.
>
> I know there was a long thread about what type to make this thing. I
> think you could accomplish the append-only-ness with a u64 if you had
> some rule about only allowing setting lower order bits than those that
> are already set. With 4 bits for simplicity:
>
> 1100 # initial container id
> 1100 -> 1011 # not allowed
> 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> # no lower order bits left
>
> There are probably fancier ways to do it if you actually understand
> math :)
;)
> Since userns nesting is limited to 32 levels (right now, IIRC), and
> you have 64 bits, this might be reasonable. You could just teach
> container engines to use the first say N bits for themselves, with a 1
> bit for the barrier at the end.
I like the creativity, but I worry that at some point these
limitations are going to be raised (limits have a funny way of doing
that over time) and we will be in trouble. I say "trouble" because I
want to be able to quickly do an audit container ID comparison and
we're going to pay a penalty for these larger values (we'll need this
when we add multiple auditd support and the requisite record routing).
Thinking about this makes me also realize we probably need to think a
bit longer about audit container ID conflicts between orchestrators.
Right now we just take the value that is given to us by the
orchestrator, but if we want to allow multiple container orchestrators
to work without some form of cooperation in userspace (I think we have
to assume the orchestrators will not talk to each other) we likely
need to have some way to block reuse of an audit container ID. We
would either need to prevent the orchestrator from explicitly setting
an audit container ID to a currently in use value, or instead generate
the audit container ID in the kernel upon an event triggered by the
orchestrator (e.g. a write to a /proc file). I suspect we should
start looking at the idr code, I think we will need to make use of it.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Tycho Andersen @ 2019-05-30 21:29 UTC (permalink / raw)
To: Paul Moore
Cc: nhorman, Richard Guy Briggs, linux-api, containers, LKML,
dhowells, Linux-Audit Mailing List, netfilter-devel, ebiederm,
simo, netdev, linux-fsdevel, Eric Paris, Serge E. Hallyn
In-Reply-To: <CAHC9VhThLiQzGYRUWmSuVfOC6QCDmA75BDB7Eg7V8HX4x7ymQg@mail.gmail.com>
On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
>
> [REMINDER: It is an "*audit* container ID" and not a general
> "container ID" ;) Smiley aside, I'm not kidding about that part.]
This sort of seems like a distinction without a difference; presumably
audit is going to want to differentiate between everything that people
in userspace call a container. So you'll have to support all this
insanity anyway, even if it's "not a container ID".
> I'm not interested in supporting/merging something that isn't useful;
> if this doesn't work for your use case then we need to figure out what
> would work. It sounds like nested containers are much more common in
> the lxc world, can you elaborate a bit more on this?
>
> As far as the possible solutions you mention above, I'm not sure I
> like the per-userns audit container IDs, I'd much rather just emit the
> necessary tracking information via the audit record stream and let the
> log analysis tools figure it out. However, the bigger question is how
> to limit (re)setting the audit container ID when you are in a non-init
> userns. For reasons already mentioned, using capable() is a non
> starter for everything but the initial userns, and using ns_capable()
> is equally poor as it essentially allows any userns the ability to
> munge it's audit container ID (obviously not good). It appears we
> need a different method for controlling access to the audit container
> ID.
One option would be to make it a string, and have it be append only.
That should be safe with no checks.
I know there was a long thread about what type to make this thing. I
think you could accomplish the append-only-ness with a u64 if you had
some rule about only allowing setting lower order bits than those that
are already set. With 4 bits for simplicity:
1100 # initial container id
1100 -> 1011 # not allowed
1100 -> 1101 # allowed, but now 1101 is set in stone since there are
# no lower order bits left
There are probably fancier ways to do it if you actually understand
math :)
Since userns nesting is limited to 32 levels (right now, IIRC), and
you have 64 bits, this might be reasonable. You could just teach
container engines to use the first say N bits for themselves, with a 1
bit for the barrier at the end.
Tycho
^ permalink raw reply
* Re: [PATCH ghak90 V6 08/10] audit: add containerid filtering
From: Richard Guy Briggs @ 2019-05-30 21:10 UTC (permalink / raw)
To: Paul Moore
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhR6oqKer_p6Xsu6oO2j3bMZGPXWHnGchZOqUoMx9yJFwQ@mail.gmail.com>
On 2019-05-30 16:45, Paul Moore wrote:
> On Thu, May 30, 2019 at 4:37 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-05-30 10:34, Paul Moore wrote:
> > > On Thu, May 30, 2019 at 10:20 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >
> > > > On 2019-05-29 18:16, Paul Moore wrote:
> > > > > On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > >
> > > > > > Implement audit container identifier filtering using the AUDIT_CONTID
> > > > > > field name to send an 8-character string representing a u64 since the
> > > > > > value field is only u32.
> > > > > >
> > > > > > Sending it as two u32 was considered, but gathering and comparing two
> > > > > > fields was more complex.
> > > > > >
> > > > > > The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.
> > > > > >
> > > > > > Please see the github audit kernel issue for the contid filter feature:
> > > > > > https://github.com/linux-audit/audit-kernel/issues/91
> > > > > > Please see the github audit userspace issue for filter additions:
> > > > > > https://github.com/linux-audit/audit-userspace/issues/40
> > > > > > Please see the github audit testsuiite issue for the test case:
> > > > > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > > > > Please see the github audit wiki for the feature overview:
> > > > > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > > > > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > > > > ---
> > > > > > include/linux/audit.h | 1 +
> > > > > > include/uapi/linux/audit.h | 5 ++++-
> > > > > > kernel/audit.h | 1 +
> > > > > > kernel/auditfilter.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > kernel/auditsc.c | 4 ++++
> > > > > > 5 files changed, 57 insertions(+), 1 deletion(-)
> > > > >
> > > > > ...
> > > > >
> > > > > > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> > > > > > index 63f8b3f26fab..407b5bb3b4c6 100644
> > > > > > --- a/kernel/auditfilter.c
> > > > > > +++ b/kernel/auditfilter.c
> > > > > > @@ -1206,6 +1224,31 @@ int audit_comparator(u32 left, u32 op, u32 right)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > +int audit_comparator64(u64 left, u32 op, u64 right)
> > > > > > +{
> > > > > > + switch (op) {
> > > > > > + case Audit_equal:
> > > > > > + return (left == right);
> > > > > > + case Audit_not_equal:
> > > > > > + return (left != right);
> > > > > > + case Audit_lt:
> > > > > > + return (left < right);
> > > > > > + case Audit_le:
> > > > > > + return (left <= right);
> > > > > > + case Audit_gt:
> > > > > > + return (left > right);
> > > > > > + case Audit_ge:
> > > > > > + return (left >= right);
> > > > > > + case Audit_bitmask:
> > > > > > + return (left & right);
> > > > > > + case Audit_bittest:
> > > > > > + return ((left & right) == right);
> > > > > > + default:
> > > > > > + BUG();
> > > > >
> > > > > A little birdy mentioned the BUG() here as a potential issue and while
> > > > > I had ignored it in earlier patches because this is likely a
> > > > > cut-n-paste from another audit comparator function, I took a closer
> > > > > look this time. It appears as though we will never have an invalid op
> > > > > value as audit_data_to_entry()/audit_to_op() ensure that the op value
> > > > > is a a known good value. Removing the BUG() from all the audit
> > > > > comparators is a separate issue, but I think it would be good to
> > > > > remove it from this newly added comparator; keeping it so that we
> > > > > return "0" in the default case seems reasoanble.
> > > >
> > > > Fair enough. That BUG(); can be removed.
> > >
> > > Please send a fixup patch for this.
> >
> > The fixup patch is trivial.
>
> Yes, I know.
>
> > The rebase to v5.2-rc1 audit/next had merge
> > conflicts with four recent patchsets. It may be simpler to submit a new
> > patchset and look at a diff of the two sets. I'm testing the rebase
> > now.
>
> Great thanks. Although you might want to hold off a bit on posting
> the next revision until we sort out the discussion which is happening
> in patch 02/10; unfortunately I fear we may need to change some of the
> logic.
I'm watching... I have no immediate ideas on how to address that
discussion yet. I'm optimistic it can be adjusted after the initial
commit without changing the API.
> paul moore www.paul-moore.com
- RGB
^ permalink raw reply
* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Mathieu Desnoyers @ 2019-05-30 20:56 UTC (permalink / raw)
To: Florian Weimer
Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
linux-api
In-Reply-To: <140718133.18261.1559144710554.JavaMail.zimbra@efficios.com>
----- On May 29, 2019, at 11:45 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> ----- On May 27, 2019, at 3:27 PM, Mathieu Desnoyers
> mathieu.desnoyers@efficios.com wrote:
>
>> ----- On May 27, 2019, at 7:19 AM, Florian Weimer fweimer@redhat.com wrote:
>>
>
> [...]
>
>>>
>>> Furthermore, the reference to ELF constructors is misleading. I believe
>>> the code you added to __libc_start_main to initialize __rseq_handled and
>>> register __seq_abi with the kernel runs *after* ELF constructors have
>>> executed (and not at all if the main program is written in Go, alas).
>>> All initialization activity for the shared case needs to happen in
>>> elf/rtld.c or called from there, probably as part of the security
>>> initialization code or thereabouts.
>>
>> in elf/rtld.c:dl_main() we have the following code:
>>
>> /* We do not initialize any of the TLS functionality unless any of the
>> initial modules uses TLS. This makes dynamic loading of modules with
>> TLS impossible, but to support it requires either eagerly doing setup
>> now or lazily doing it later. Doing it now makes us incompatible with
>> an old kernel that can't perform TLS_INIT_TP, even if no TLS is ever
>> used. Trying to do it lazily is too hairy to try when there could be
>> multiple threads (from a non-TLS-using libpthread). */
>> bool was_tls_init_tp_called = tls_init_tp_called;
>> if (tcbp == NULL)
>> tcbp = init_tls ();
>>
>> If I understand your point correctly, I should move the rseq_init() and
>> rseq_register_current_thread() for the SHARED case just after this
>> initialization, otherwise calling those from LIBC_START_MAIN() is too
>> late and it runs after initial modules constructors (or not at all for
>> Go). However, this means glibc will start using TLS internally. I'm
>> concerned that this is not quite in line with the above comment which
>> states that TLS is not initialized if no initial modules use TLS.
>>
>> For the !SHARED use-case, if my understanding is correct, I should keep
>> rseq_init() and rseq_register_current_thread() calls within LIBC_START_MAIN().
>
> I've moved the rseq initialization for SHARED case to the very end of
> elf/rtld.c:init_tls(), and get the following error on make check:
>
> Generating locale am_ET.UTF-8: this might take a while...
> Inconsistency detected by ld.so: get-dynamic-info.h: 143: elf_get_dynamic_info:
> Assertion `info[DT_FLAGS] == NULL || (info[DT_FLAGS]->d_un.d_val &
> ~DF_BIND_NOW) == 0' failed!
> Charmap: "UTF-8" Inputfile: "am_ET" Outputdir: "am_ET.UTF-8" failed
> /bin/sh: 4: cannot create
> /home/efficios/git/glibc-build/localedata/am_ET.UTF-8/LC_CTYPE.test-result:
> Directory nonexistent
>
> This error goes away if I comment out the call to rseq_register_current_thread
> (),
> which touches the __rseq_abi __thread variable and issues a system call.
>
> Currently, the __rseq_abi __thread variable is within
> sysdeps/unix/sysv/linux/rseq-sym.c, which is added to the
> sysdep_routines within sysdeps/unix/sysv/linux/Makefile. I
> suspect it may need to be moved elsewhere.
>
> Any thoughts on how to solve this ?
I found that it's because touching a __thread variable from
ld-linux-x86-64.so.2 ends up setting the DF_STATIC_TLS flag
for that .so, which is really not expected.
Even if I tweak the assert to make it more lenient there,
touching the __thread variable ends up triggering a SIGFPE.
So rather than touching the TLS from ld-linux-x86-64.so.2,
I've rather experimented with moving the rseq initialization
for both SHARED and !SHARED cases to a library constructor
within libc.so.
Are you aware of any downside to this approach ?
diff --git a/csu/libc-start.c b/csu/libc-start.c
index 5d9c3675fa..9755ed5467 100644
--- a/csu/libc-start.c
+++ b/csu/libc-start.c
@@ -22,6 +22,7 @@
#include <ldsodefs.h>
#include <exit-thread.h>
#include <libc-internal.h>
+#include <rseq-internal.h>
#include <elf/dl-tunables.h>
@@ -81,6 +82,14 @@ apply_irel (void)
}
#endif
+static
+__attribute__ ((constructor))
+void __rseq_libc_init (void)
+{
+ rseq_init ();
+ /* Register rseq ABI to the kernel. */
+ (void) rseq_register_current_thread ();
+}
#ifdef LIBC_START_MAIN
# ifdef LIBC_START_DISABLE_INLINE
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply related
* Re: [PATCH ghak90 V6 08/10] audit: add containerid filtering
From: Paul Moore @ 2019-05-30 20:45 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190530203702.fibsrazabbiifjvf@madcap2.tricolour.ca>
On Thu, May 30, 2019 at 4:37 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 10:34, Paul Moore wrote:
> > On Thu, May 30, 2019 at 10:20 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >
> > > On 2019-05-29 18:16, Paul Moore wrote:
> > > > On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > >
> > > > > Implement audit container identifier filtering using the AUDIT_CONTID
> > > > > field name to send an 8-character string representing a u64 since the
> > > > > value field is only u32.
> > > > >
> > > > > Sending it as two u32 was considered, but gathering and comparing two
> > > > > fields was more complex.
> > > > >
> > > > > The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.
> > > > >
> > > > > Please see the github audit kernel issue for the contid filter feature:
> > > > > https://github.com/linux-audit/audit-kernel/issues/91
> > > > > Please see the github audit userspace issue for filter additions:
> > > > > https://github.com/linux-audit/audit-userspace/issues/40
> > > > > Please see the github audit testsuiite issue for the test case:
> > > > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > > > Please see the github audit wiki for the feature overview:
> > > > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > > > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > > > ---
> > > > > include/linux/audit.h | 1 +
> > > > > include/uapi/linux/audit.h | 5 ++++-
> > > > > kernel/audit.h | 1 +
> > > > > kernel/auditfilter.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > kernel/auditsc.c | 4 ++++
> > > > > 5 files changed, 57 insertions(+), 1 deletion(-)
> > > >
> > > > ...
> > > >
> > > > > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> > > > > index 63f8b3f26fab..407b5bb3b4c6 100644
> > > > > --- a/kernel/auditfilter.c
> > > > > +++ b/kernel/auditfilter.c
> > > > > @@ -1206,6 +1224,31 @@ int audit_comparator(u32 left, u32 op, u32 right)
> > > > > }
> > > > > }
> > > > >
> > > > > +int audit_comparator64(u64 left, u32 op, u64 right)
> > > > > +{
> > > > > + switch (op) {
> > > > > + case Audit_equal:
> > > > > + return (left == right);
> > > > > + case Audit_not_equal:
> > > > > + return (left != right);
> > > > > + case Audit_lt:
> > > > > + return (left < right);
> > > > > + case Audit_le:
> > > > > + return (left <= right);
> > > > > + case Audit_gt:
> > > > > + return (left > right);
> > > > > + case Audit_ge:
> > > > > + return (left >= right);
> > > > > + case Audit_bitmask:
> > > > > + return (left & right);
> > > > > + case Audit_bittest:
> > > > > + return ((left & right) == right);
> > > > > + default:
> > > > > + BUG();
> > > >
> > > > A little birdy mentioned the BUG() here as a potential issue and while
> > > > I had ignored it in earlier patches because this is likely a
> > > > cut-n-paste from another audit comparator function, I took a closer
> > > > look this time. It appears as though we will never have an invalid op
> > > > value as audit_data_to_entry()/audit_to_op() ensure that the op value
> > > > is a a known good value. Removing the BUG() from all the audit
> > > > comparators is a separate issue, but I think it would be good to
> > > > remove it from this newly added comparator; keeping it so that we
> > > > return "0" in the default case seems reasoanble.
> > >
> > > Fair enough. That BUG(); can be removed.
> >
> > Please send a fixup patch for this.
>
> The fixup patch is trivial.
Yes, I know.
> The rebase to v5.2-rc1 audit/next had merge
> conflicts with four recent patchsets. It may be simpler to submit a new
> patchset and look at a diff of the two sets. I'm testing the rebase
> now.
Great thanks. Although you might want to hold off a bit on posting
the next revision until we sort out the discussion which is happening
in patch 02/10; unfortunately I fear we may need to change some of the
logic.
^ permalink raw reply
* Re: [PATCH ghak90 V6 08/10] audit: add containerid filtering
From: Richard Guy Briggs @ 2019-05-30 20:37 UTC (permalink / raw)
To: Paul Moore
Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhQhkzCtVOXhPL7BzaqvF0y+8gBQwhOo1EQDS2OUyZbV5g@mail.gmail.com>
On 2019-05-30 10:34, Paul Moore wrote:
> On Thu, May 30, 2019 at 10:20 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> >
> > On 2019-05-29 18:16, Paul Moore wrote:
> > > On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >
> > > > Implement audit container identifier filtering using the AUDIT_CONTID
> > > > field name to send an 8-character string representing a u64 since the
> > > > value field is only u32.
> > > >
> > > > Sending it as two u32 was considered, but gathering and comparing two
> > > > fields was more complex.
> > > >
> > > > The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.
> > > >
> > > > Please see the github audit kernel issue for the contid filter feature:
> > > > https://github.com/linux-audit/audit-kernel/issues/91
> > > > Please see the github audit userspace issue for filter additions:
> > > > https://github.com/linux-audit/audit-userspace/issues/40
> > > > Please see the github audit testsuiite issue for the test case:
> > > > https://github.com/linux-audit/audit-testsuite/issues/64
> > > > Please see the github audit wiki for the feature overview:
> > > > https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > > > Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> > > > ---
> > > > include/linux/audit.h | 1 +
> > > > include/uapi/linux/audit.h | 5 ++++-
> > > > kernel/audit.h | 1 +
> > > > kernel/auditfilter.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > kernel/auditsc.c | 4 ++++
> > > > 5 files changed, 57 insertions(+), 1 deletion(-)
> > >
> > > ...
> > >
> > > > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> > > > index 63f8b3f26fab..407b5bb3b4c6 100644
> > > > --- a/kernel/auditfilter.c
> > > > +++ b/kernel/auditfilter.c
> > > > @@ -1206,6 +1224,31 @@ int audit_comparator(u32 left, u32 op, u32 right)
> > > > }
> > > > }
> > > >
> > > > +int audit_comparator64(u64 left, u32 op, u64 right)
> > > > +{
> > > > + switch (op) {
> > > > + case Audit_equal:
> > > > + return (left == right);
> > > > + case Audit_not_equal:
> > > > + return (left != right);
> > > > + case Audit_lt:
> > > > + return (left < right);
> > > > + case Audit_le:
> > > > + return (left <= right);
> > > > + case Audit_gt:
> > > > + return (left > right);
> > > > + case Audit_ge:
> > > > + return (left >= right);
> > > > + case Audit_bitmask:
> > > > + return (left & right);
> > > > + case Audit_bittest:
> > > > + return ((left & right) == right);
> > > > + default:
> > > > + BUG();
> > >
> > > A little birdy mentioned the BUG() here as a potential issue and while
> > > I had ignored it in earlier patches because this is likely a
> > > cut-n-paste from another audit comparator function, I took a closer
> > > look this time. It appears as though we will never have an invalid op
> > > value as audit_data_to_entry()/audit_to_op() ensure that the op value
> > > is a a known good value. Removing the BUG() from all the audit
> > > comparators is a separate issue, but I think it would be good to
> > > remove it from this newly added comparator; keeping it so that we
> > > return "0" in the default case seems reasoanble.
> >
> > Fair enough. That BUG(); can be removed.
>
> Please send a fixup patch for this.
The fixup patch is trivial. The rebase to v5.2-rc1 audit/next had merge
conflicts with four recent patchsets. It may be simpler to submit a new
patchset and look at a diff of the two sets. I'm testing the rebase
now.
> paul moore www.paul-moore.com
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox