* [PATCH 1/3] Checkpoint/restart epoll sets
@ 2009-10-19 17:04 Matt Helsley
[not found] ` <ce2e15faf44e254b80578c6c62e71d8685516896.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 16+ messages in thread
From: Matt Helsley @ 2009-10-19 17:04 UTC (permalink / raw)
To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Save/restore epoll items during checkpoint/restart respectively.
kmalloc failures should be dealt with more kindly than just error-out
because epoll is made to poll many thousands of file descriptors.
Subsequent patches will change epoll c/r to "chunk" its output/input
respectively.
Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Changelog:
v5:
Fix potential recursion during collect.
Replace call to ckpt_obj_collect() with ckpt_collect_file().
[Oren]
Fix checkpoint leak detection when there are more items than
expected.
Cleanup/simplify error write paths. (will complicate in a later
patch) [Oren]
Remove files_deferq bits. [Oren]
Remove extra newline. [Oren]
Remove aggregate check on number of watches added. [Oren]
This is OK since these will be done individually anyway.
Remove check for negative objrefs during restart. [Oren]
Fixup comment regarding race that indicates checkpoint leaks.
[Oren]
s/ckpt_read_obj/ckpt_read_buf_type/ [Oren]
Patch for lots of epoll items follows.
Moved sys_close(epfd) right under fget(). [Oren]
Use CKPT_HDR_BUFFER rather than custome ckpt_read/write_*
This makes it more similar to the pid array code. [Oren]
It also simplifies the error recovery paths.
Tested polling a pipe and 50,000 UNIX sockets.
v4: ckpt-v18
Use files_deferq as submitted by Dan Smith
Cleanup to only report >= 1 items when debugging.
v3: [unposted]
Removed most of the TODOs -- the remainder will be removed by
subsequent patches.
Fixed missing ep_file_collect() [Serge]
Rather than include checkpoint_hdr.h declare (but do not define)
the two structs needed in eventpoll.h [Oren]
Complain with ckpt_write_err() when we detect checkpoint obj
leaks. [Oren]
Remove redundant is_epoll_file() check in collect. [Oren]
Move epfile_objref lookup to simplify error handling. [Oren]
Simplify error handling with early return in
ep_eventpoll_checkpoint(). [Oren]
Cleaned up a comment. [Oren]
Shorten CKPT_HDR_FILE_EPOLL_ITEMS (-FILE) [Oren]
Renumbered to indicate that it follows the file table.
Renamed the epoll struct in checkpoint_hdr.h [Oren]
Also renamed substruct.
Fixup return of empty ep_file_restore(). [Oren]
Changed some error returns. [Oren]
Changed some tests to BUG_ON(). [Oren]
Factored out watch insert with epoll_ctl() into do_epoll_ctl().
[Cedric, Oren]
Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
checkpoint/files.c | 8 +
fs/eventpoll.c | 308 ++++++++++++++++++++++++++++++++++++----
include/linux/checkpoint_hdr.h | 18 +++
include/linux/eventpoll.h | 17 ++-
4 files changed, 322 insertions(+), 29 deletions(-)
diff --git a/checkpoint/files.c b/checkpoint/files.c
index f6de07e..6ea2389 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -22,6 +22,7 @@
#include <linux/deferqueue.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>
+#include <linux/eventpoll.h>
#include <net/sock.h>
@@ -607,6 +608,13 @@ static struct restore_file_ops restore_file_ops[] = {
.file_type = CKPT_FILE_TTY,
.restore = tty_file_restore,
},
+#ifdef CONFIG_EPOLL
+ {
+ .file_name = "EPOLL",
+ .file_type = CKPT_FILE_EPOLL,
+ .restore = ep_file_restore,
+ },
+#endif
};
static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 085c5c0..4706ec5 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -39,6 +39,12 @@
#include <asm/mman.h>
#include <asm/atomic.h>
+#ifdef CONFIG_CHECKPOINT
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/deferqueue.h>
+#endif
+
/*
* LOCKING:
* There are three level of locking required by epoll :
@@ -671,10 +677,20 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
return pollflags != -1 ? pollflags : 0;
}
+#ifdef CONFIG_CHECKPOINT
+static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#else
+#define ep_eventpoll_checkpoint NULL
+#define ep_file_collect NULL
+#endif
+
/* File callbacks that implement the eventpoll file behaviour */
static const struct file_operations eventpoll_fops = {
.release = ep_eventpoll_release,
- .poll = ep_eventpoll_poll
+ .poll = ep_eventpoll_poll,
+ .checkpoint = ep_eventpoll_checkpoint,
+ .collect = ep_file_collect,
};
/* Fast test to see if the file is an evenpoll file */
@@ -1226,35 +1242,18 @@ SYSCALL_DEFINE1(epoll_create, int, size)
* the eventpoll file that enables the insertion/removal/change of
* file descriptors inside the interest set.
*/
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
- struct epoll_event __user *, event)
+int do_epoll_ctl(int op, int fd,
+ struct file *file, struct file *tfile,
+ struct epoll_event *epds)
{
int error;
- struct file *file, *tfile;
struct eventpoll *ep;
struct epitem *epi;
- struct epoll_event epds;
-
- error = -EFAULT;
- if (ep_op_has_event(op) &&
- copy_from_user(&epds, event, sizeof(struct epoll_event)))
- goto error_return;
-
- /* Get the "struct file *" for the eventpoll file */
- error = -EBADF;
- file = fget(epfd);
- if (!file)
- goto error_return;
-
- /* Get the "struct file *" for the target file */
- tfile = fget(fd);
- if (!tfile)
- goto error_fput;
/* The target file descriptor must support poll */
error = -EPERM;
if (!tfile->f_op || !tfile->f_op->poll)
- goto error_tgt_fput;
+ return error;
/*
* We have to check that the file structure underneath the file descriptor
@@ -1263,7 +1262,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
*/
error = -EINVAL;
if (file == tfile || !is_file_epoll(file))
- goto error_tgt_fput;
+ return error;
/*
* At this point it is safe to assume that the "private_data" contains
@@ -1284,8 +1283,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
switch (op) {
case EPOLL_CTL_ADD:
if (!epi) {
- epds.events |= POLLERR | POLLHUP;
- error = ep_insert(ep, &epds, tfile, fd);
+ epds->events |= POLLERR | POLLHUP;
+ error = ep_insert(ep, epds, tfile, fd);
} else
error = -EEXIST;
break;
@@ -1297,15 +1296,46 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
break;
case EPOLL_CTL_MOD:
if (epi) {
- epds.events |= POLLERR | POLLHUP;
- error = ep_modify(ep, epi, &epds);
+ epds->events |= POLLERR | POLLHUP;
+ error = ep_modify(ep, epi, epds);
} else
error = -ENOENT;
break;
}
mutex_unlock(&ep->mtx);
-error_tgt_fput:
+ return error;
+}
+
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+ struct epoll_event __user *, event)
+{
+ int error;
+ struct file *file, *tfile;
+ struct epoll_event epds;
+
+ error = -EFAULT;
+ if (ep_op_has_event(op) &&
+ copy_from_user(&epds, event, sizeof(struct epoll_event)))
+ goto error_return;
+
+ /* Get the "struct file *" for the eventpoll file */
+ error = -EBADF;
+ file = fget(epfd);
+ if (!file)
+ goto error_return;
+
+ /* Get the "struct file *" for the target file */
+ tfile = fget(fd);
+ if (!tfile)
+ goto error_fput;
+
+ error = do_epoll_ctl(op, fd, file, tfile, &epds);
fput(tfile);
error_fput:
fput(file);
@@ -1413,6 +1443,228 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
#endif /* HAVE_SET_RESTORE_SIGMASK */
+#ifdef CONFIG_CHECKPOINT
+static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+ struct rb_node *rbp;
+ struct eventpoll *ep;
+ int ret = 0;
+
+ ep = file->private_data;
+ mutex_lock(&ep->mtx);
+ for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
+ struct epitem *epi;
+
+ epi = rb_entry(rbp, struct epitem, rbn);
+ if (is_file_epoll(epi->ffd.file))
+ continue; /* Don't recurse */
+ ret = ckpt_collect_file(ctx, epi->ffd.file);
+ if (ret < 0)
+ break;
+ }
+ mutex_unlock(&ep->mtx);
+ return ret;
+}
+
+struct epoll_deferq_entry {
+ struct ckpt_ctx *ctx;
+ struct file *epfile;
+};
+
+static int ep_items_checkpoint(void *data)
+{
+ struct epoll_deferq_entry *dq_entry = data;
+ struct ckpt_ctx *ctx;
+ struct ckpt_hdr_eventpoll_items *h;
+ struct ckpt_eventpoll_item *items;
+ struct rb_node *rbp;
+ struct eventpoll *ep;
+ __s32 epfile_objref;
+ int i, num_items, ret;
+
+ ctx = dq_entry->ctx;
+
+ epfile_objref = ckpt_obj_lookup(ctx, dq_entry->epfile, CKPT_OBJ_FILE);
+ BUG_ON(epfile_objref <= 0);
+
+ ep = dq_entry->epfile->private_data;
+ mutex_lock(&ep->mtx);
+ for (i = 0, rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp), i++) {}
+ mutex_unlock(&ep->mtx);
+ num_items = i;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
+ if (!h)
+ return -ENOMEM;
+ h->num_items = num_items;
+ h->epfile_objref = epfile_objref;
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret || !num_items)
+ return ret;
+
+ items = kzalloc(sizeof(*items)*num_items, GFP_KERNEL);
+ if (!items)
+ return -ENOMEM;
+ ret = 0;
+ i = 0;
+ mutex_lock(&ep->mtx);
+ for (rbp = rb_first(&ep->rbr); i < num_items && rbp; rbp = rb_next(rbp),
+ i++) {
+ struct epitem *epi;
+ int objref;
+
+ epi = rb_entry(rbp, struct epitem, rbn);
+ items[i].fd = epi->ffd.fd;
+ items[i].events = epi->event.events;
+ items[i].data = epi->event.data;
+ objref = ckpt_obj_lookup(ctx, epi->ffd.file, CKPT_OBJ_FILE);
+ if (objref <= 0) {
+ ret = -EBUSY; /* missing item -- checkpoint obj leak */
+ break;
+ }
+ items[i].file_objref = objref;
+ }
+ mutex_unlock(&ep->mtx);
+ if (i == num_items && rbp)
+ ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */
+ if (!ret)
+ ret = ckpt_write_buffer(ctx, items, sizeof(*items)*num_items);
+ else
+ ckpt_write_err(ctx, "E", "checkpoint leak detected.\n", ret);
+ kfree(items);
+ return ret;
+}
+
+static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+ struct ckpt_hdr_file *h;
+ struct epoll_deferq_entry dq_entry;
+ int ret = -ENOMEM;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+ if (!h)
+ return -ENOMEM;
+ h->f_type = CKPT_FILE_EPOLL;
+ ret = checkpoint_file_common(ctx, file, h);
+ if (ret < 0)
+ goto out;
+ ret = ckpt_write_obj(ctx, &h->h);
+ if (ret < 0)
+ goto out;
+
+ /*
+ * Defer saving the epoll items until all of the ffd.file pointers
+ * have an objref; after the file table has been checkpointed.
+ */
+ dq_entry.ctx = ctx;
+ dq_entry.epfile = file;
+ ret = deferqueue_add(ctx->files_deferq, &dq_entry,
+ sizeof(dq_entry), ep_items_checkpoint, NULL);
+out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+static int ep_items_restore(void *data)
+{
+ struct ckpt_ctx *ctx = deferqueue_data_ptr(data);
+ struct ckpt_hdr_eventpoll_items *h;
+ struct ckpt_eventpoll_item *items = NULL;
+ struct eventpoll *ep;
+ struct file *epfile = NULL;
+ int ret, num_items, i = 0;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+ num_items = h->num_items;
+ epfile = ckpt_obj_fetch(ctx, h->epfile_objref, CKPT_OBJ_FILE);
+ ckpt_hdr_put(ctx, h);
+
+ /* Make sure userspace didn't give us a ref to a non-epoll file. */
+ if (IS_ERR(epfile))
+ return PTR_ERR(epfile);
+ if (!is_file_epoll(epfile))
+ return -EINVAL;
+ if (!num_items)
+ return 0;
+
+ ret = ckpt_read_payload(ctx, (void**)&items, num_items*sizeof(*items),
+ CKPT_HDR_BUFFER);
+ if (!items)
+ return -ENOMEM;
+
+ /* Make sure the items match the size we expect */
+ if (num_items != (ret / sizeof(*items))) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ep = epfile->private_data;
+
+ /* Restore the epoll items/watches */
+ for (ret = 0, i = 0; !ret && i < num_items; i++) {
+ struct epoll_event epev;
+ struct file *tfile;
+
+ tfile = ckpt_obj_fetch(ctx, items[i].file_objref,
+ CKPT_OBJ_FILE);
+ if (IS_ERR(tfile)) {
+ ret = PTR_ERR(tfile);
+ break;
+ }
+ epev.events = items[i].events;
+ epev.data = items[i].data;
+ ret = do_epoll_ctl(EPOLL_CTL_ADD, items[i].fd,
+ epfile, tfile, &epev);
+ }
+out:
+ kfree(items);
+ return ret;
+}
+
+struct file* ep_file_restore(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_file *h)
+{
+ struct file *epfile;
+ int epfd, ret;
+
+ if (h->h.type != CKPT_HDR_FILE ||
+ h->h.len != sizeof(*h) ||
+ h->f_type != CKPT_FILE_EPOLL)
+ return ERR_PTR(-EINVAL);
+
+ epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC);
+ if (epfd < 0)
+ return ERR_PTR(epfd);
+ epfile = fget(epfd);
+ sys_close(epfd); /* harmless even if an error occured */
+ BUG_ON(!epfile);
+
+ /*
+ * Needed before we can properly restore the watches and enforce the
+ * limit on watch numbers.
+ */
+ ret = restore_file_common(ctx, epfile, h);
+ if (ret < 0)
+ goto fput_out;
+
+ /*
+ * Defer restoring the epoll items until the file table is
+ * fully restored. Ensures that valid file objrefs will resolve.
+ */
+ ret = deferqueue_add_ptr(ctx->files_deferq, ctx, ep_items_restore, NULL);
+ if (ret < 0) {
+fput_out:
+ fput(epfile);
+ epfile = ERR_PTR(ret);
+ }
+ return epfile;
+}
+
+#endif /* CONFIG_CHECKPOINT */
+
static int __init eventpoll_init(void)
{
struct sysinfo si;
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ca2500d..1a3edab 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -119,6 +119,8 @@ enum {
#define CKPT_HDR_TTY CKPT_HDR_TTY
CKPT_HDR_TTY_LDISC,
#define CKPT_HDR_TTY_LDISC CKPT_HDR_TTY_LDISC
+ CKPT_HDR_EPOLL_ITEMS = 391, /* Follows file-table */
+#define CKPT_HDR_EPOLL_ITEMS CKPT_HDR_EPOLL_ITEMS
CKPT_HDR_MM = 401,
#define CKPT_HDR_MM CKPT_HDR_MM
@@ -469,6 +471,8 @@ enum file_type {
#define CKPT_FILE_SOCKET CKPT_FILE_SOCKET
CKPT_FILE_TTY,
#define CKPT_FILE_TTY CKPT_FILE_TTY
+ CKPT_FILE_EPOLL,
+#define CKPT_FILE_EPOLL CKPT_FILE_EPOLL
CKPT_FILE_MAX
#define CKPT_FILE_MAX CKPT_FILE_MAX
};
@@ -573,6 +577,20 @@ struct ckpt_hdr_file_socket {
__s32 sock_objref;
} __attribute__((aligned(8)));
+struct ckpt_hdr_eventpoll_items {
+ struct ckpt_hdr h;
+ __s32 epfile_objref;
+ __u32 num_items;
+} __attribute__((aligned(8)));
+
+/* Contained in a CKPT_HDR_BUFFER following the ckpt_hdr_eventpoll_items */
+struct ckpt_eventpoll_item {
+ __u64 data;
+ __u32 fd;
+ __s32 file_objref;
+ __u32 events;
+} __attribute__((aligned(8)));
+
/* memory layout */
struct ckpt_hdr_mm {
struct ckpt_hdr h;
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index f6856a5..34538be 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -56,6 +56,9 @@ struct file;
#ifdef CONFIG_EPOLL
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+
/* Used to initialize the epoll bits inside the "struct file" */
static inline void eventpoll_init_file(struct file *file)
@@ -95,11 +98,23 @@ static inline void eventpoll_release(struct file *file)
eventpoll_release_file(file);
}
-#else
+#ifdef CONFIG_CHECKPOINT
+extern struct file* ep_file_restore(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_file *h);
+#endif
+#else
+/* !defined(CONFIG_EPOLL) */
static inline void eventpoll_init_file(struct file *file) {}
static inline void eventpoll_release(struct file *file) {}
+#ifdef CONFIG_CHECKPOINT
+static inline struct file* ep_file_restore(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_file *ptr)
+{
+ return ERR_PTR(-ENOSYS);
+}
+#endif
#endif
#endif /* #ifdef __KERNEL__ */
--
1.5.6.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items
[not found] ` <ce2e15faf44e254b80578c6c62e71d8685516896.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-19 17:04 ` Matt Helsley
[not found] ` <d0fd1f3eb4eaa326488f59955e5b4790080f3073.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-19 17:04 ` [PATCH 3/3] epoll: Add support for restoring many " Matt Helsley
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Matt Helsley @ 2009-10-19 17:04 UTC (permalink / raw)
To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Currently we allocate memory to output all of the epoll items in one
big chunk. At 20 bytes per item, and since epoll was designed to
support on the order of 10,000 items, we may find ourselves kmalloc'ing
200,000 bytes. That's an order 7 allocation whereas the heuristic for
difficult allocations, PAGE_ALLOC_COST_ORDER, is 3.
Instead, output the epoll header and items separately. Chunk the output
much like the pid array gets chunked. This ensures that even sub-order 0
allocations will enable checkpoint of large epoll sets. A subsequent
patch will do something similar for the restore path.
Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
fs/eventpoll.c | 71 ++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 46 insertions(+), 25 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4706ec5..2506b40 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1480,7 +1480,7 @@ static int ep_items_checkpoint(void *data)
struct rb_node *rbp;
struct eventpoll *ep;
__s32 epfile_objref;
- int i, num_items, ret;
+ int num_items = 0, nchunk, ret;
ctx = dq_entry->ctx;
@@ -1489,9 +1489,8 @@ static int ep_items_checkpoint(void *data)
ep = dq_entry->epfile->private_data;
mutex_lock(&ep->mtx);
- for (i = 0, rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp), i++) {}
+ for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp), num_items++) {}
mutex_unlock(&ep->mtx);
- num_items = i;
h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
if (!h)
@@ -1503,36 +1502,58 @@ static int ep_items_checkpoint(void *data)
if (ret || !num_items)
return ret;
- items = kzalloc(sizeof(*items)*num_items, GFP_KERNEL);
+ ret = ckpt_write_obj_type(ctx, NULL, sizeof(*items)*num_items,
+ CKPT_HDR_BUFFER);
+ if (ret < 0)
+ return ret;
+
+ nchunk = num_items;
+ do {
+ items = kzalloc(sizeof(*items)*nchunk, GFP_KERNEL);
+ if (items)
+ break;
+ nchunk = nchunk >> 1;
+ } while (nchunk > 0);
if (!items)
return -ENOMEM;
+
+ /*
+ * Walk the rbtree copying items into the chunk of memory and then
+ * writing them to the checkpoint image
+ */
ret = 0;
- i = 0;
mutex_lock(&ep->mtx);
- for (rbp = rb_first(&ep->rbr); i < num_items && rbp; rbp = rb_next(rbp),
- i++) {
- struct epitem *epi;
- int objref;
-
- epi = rb_entry(rbp, struct epitem, rbn);
- items[i].fd = epi->ffd.fd;
- items[i].events = epi->event.events;
- items[i].data = epi->event.data;
- objref = ckpt_obj_lookup(ctx, epi->ffd.file, CKPT_OBJ_FILE);
- if (objref <= 0) {
- ret = -EBUSY; /* missing item -- checkpoint obj leak */
- break;
+ rbp = rb_first(&ep->rbr);
+ while ((num_items > 0) && rbp) {
+ int n = min(num_items, nchunk);
+ int j;
+
+ for (j = 0; rbp && j < n; j++, rbp = rb_next(rbp)) {
+ struct epitem *epi;
+ int objref;
+
+ epi = rb_entry(rbp, struct epitem, rbn);
+ items[j].fd = epi->ffd.fd;
+ items[j].events = epi->event.events;
+ items[j].data = epi->event.data;
+ objref = ckpt_obj_lookup(ctx, epi->ffd.file,
+ CKPT_OBJ_FILE);
+ if (objref <= 0)
+ goto unlock;
+ items[j].file_objref = objref;
}
- items[i].file_objref = objref;
+ ret = ckpt_kwrite(ctx, items, n*sizeof(*items));
+ if (ret < 0)
+ break;
+ num_items -= n;
}
+unlock:
mutex_unlock(&ep->mtx);
- if (i == num_items && rbp)
- ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */
- if (!ret)
- ret = ckpt_write_buffer(ctx, items, sizeof(*items)*num_items);
- else
- ckpt_write_err(ctx, "E", "checkpoint leak detected.\n", ret);
kfree(items);
+ if (num_items != 0 || (num_items == 0 && rbp))
+ ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */
+ if (ret)
+ ckpt_write_err(ctx, "E", " checkpointing epoll items.\n", ret);
return ret;
}
--
1.5.6.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/3] epoll: Add support for restoring many epoll items
[not found] ` <ce2e15faf44e254b80578c6c62e71d8685516896.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-19 17:04 ` [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items Matt Helsley
@ 2009-10-19 17:04 ` Matt Helsley
[not found] ` <8e4344b801150b95cd54f2d09b660525601de256.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 0:31 ` [PATCH 1/3] Checkpoint/restart epoll sets Serge E. Hallyn
2009-10-23 23:41 ` Oren Laadan
3 siblings, 1 reply; 16+ messages in thread
From: Matt Helsley @ 2009-10-19 17:04 UTC (permalink / raw)
To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
This completes the work necessary to make checkpoint/restart of
thousands of epoll items more reliable when higher order kmallocs
would fail. We grab a piece of memory suitable to store a "chunk"
of items for input. Read the input one chunk at a time and add
epoll items for each item in the chunk.
Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
fs/eventpoll.c | 58 +++++++++++++++++++++++++++++++++++--------------------
1 files changed, 37 insertions(+), 21 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2506b40..c261263 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1594,7 +1594,7 @@ static int ep_items_restore(void *data)
struct ckpt_eventpoll_item *items = NULL;
struct eventpoll *ep;
struct file *epfile = NULL;
- int ret, num_items, i = 0;
+ int ret, num_items, nchunk;
h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
if (IS_ERR(h))
@@ -1611,34 +1611,50 @@ static int ep_items_restore(void *data)
if (!num_items)
return 0;
- ret = ckpt_read_payload(ctx, (void**)&items, num_items*sizeof(*items),
- CKPT_HDR_BUFFER);
+ ret = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER);
+ if (ret < 0)
+ return ret;
+ /* Make sure the items match the size we expect */
+ if (num_items != (ret / sizeof(*items)))
+ return -EINVAL;
+
+ nchunk = num_items;
+ do {
+ items = kzalloc(sizeof(*items)*nchunk, GFP_KERNEL);
+ if (items)
+ break;
+ nchunk = nchunk/2;
+ } while (nchunk > 0);
if (!items)
return -ENOMEM;
- /* Make sure the items match the size we expect */
- if (num_items != (ret / sizeof(*items))) {
- ret = -EINVAL;
- goto out;
- }
-
ep = epfile->private_data;
- /* Restore the epoll items/watches */
- for (ret = 0, i = 0; !ret && i < num_items; i++) {
- struct epoll_event epev;
- struct file *tfile;
+ while (num_items > 0) {
+ int n = min(num_items, nchunk);
+ int j;
- tfile = ckpt_obj_fetch(ctx, items[i].file_objref,
- CKPT_OBJ_FILE);
- if (IS_ERR(tfile)) {
- ret = PTR_ERR(tfile);
+ ret = ckpt_kread(ctx, items, n*sizeof(*items));
+ if (ret < 0)
break;
+
+ /* Restore the epoll items/watches */
+ for (j = 0; !ret && j < n; j++) {
+ struct epoll_event epev;
+ struct file *tfile;
+
+ tfile = ckpt_obj_fetch(ctx, items[j].file_objref,
+ CKPT_OBJ_FILE);
+ if (IS_ERR(tfile)) {
+ ret = PTR_ERR(tfile);
+ goto out;
+ }
+ epev.events = items[j].events;
+ epev.data = items[j].data;
+ ret = do_epoll_ctl(EPOLL_CTL_ADD, items[j].fd,
+ epfile, tfile, &epev);
}
- epev.events = items[i].events;
- epev.data = items[i].data;
- ret = do_epoll_ctl(EPOLL_CTL_ADD, items[i].fd,
- epfile, tfile, &epev);
+ num_items -= n;
}
out:
kfree(items);
--
1.5.6.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] Checkpoint/restart epoll sets
[not found] ` <ce2e15faf44e254b80578c6c62e71d8685516896.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-19 17:04 ` [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items Matt Helsley
2009-10-19 17:04 ` [PATCH 3/3] epoll: Add support for restoring many " Matt Helsley
@ 2009-10-21 0:31 ` Serge E. Hallyn
[not found] ` <20091021003128.GA23721-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-23 23:41 ` Oren Laadan
3 siblings, 1 reply; 16+ messages in thread
From: Serge E. Hallyn @ 2009-10-21 0:31 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> @@ -1226,35 +1242,18 @@ SYSCALL_DEFINE1(epoll_create, int, size)
> * the eventpoll file that enables the insertion/removal/change of
> * file descriptors inside the interest set.
> */
> -SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> - struct epoll_event __user *, event)
> +int do_epoll_ctl(int op, int fd,
> + struct file *file, struct file *tfile,
> + struct epoll_event *epds)
> {
> int error;
> - struct file *file, *tfile;
> struct eventpoll *ep;
> struct epitem *epi;
> - struct epoll_event epds;
> -
> - error = -EFAULT;
> - if (ep_op_has_event(op) &&
> - copy_from_user(&epds, event, sizeof(struct epoll_event)))
> - goto error_return;
> -
> - /* Get the "struct file *" for the eventpoll file */
> - error = -EBADF;
> - file = fget(epfd);
> - if (!file)
> - goto error_return;
> -
> - /* Get the "struct file *" for the target file */
> - tfile = fget(fd);
> - if (!tfile)
> - goto error_fput;
>
> /* The target file descriptor must support poll */
> error = -EPERM;
> if (!tfile->f_op || !tfile->f_op->poll)
> - goto error_tgt_fput;
> + return error;
>
> /*
> * We have to check that the file structure underneath the file descriptor
> @@ -1263,7 +1262,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> */
> error = -EINVAL;
> if (file == tfile || !is_file_epoll(file))
> - goto error_tgt_fput;
> + return error;
>
> /*
> * At this point it is safe to assume that the "private_data" contains
> @@ -1284,8 +1283,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> switch (op) {
> case EPOLL_CTL_ADD:
> if (!epi) {
> - epds.events |= POLLERR | POLLHUP;
> - error = ep_insert(ep, &epds, tfile, fd);
> + epds->events |= POLLERR | POLLHUP;
> + error = ep_insert(ep, epds, tfile, fd);
> } else
> error = -EEXIST;
> break;
> @@ -1297,15 +1296,46 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> break;
> case EPOLL_CTL_MOD:
> if (epi) {
> - epds.events |= POLLERR | POLLHUP;
> - error = ep_modify(ep, epi, &epds);
> + epds->events |= POLLERR | POLLHUP;
> + error = ep_modify(ep, epi, epds);
> } else
> error = -ENOENT;
> break;
> }
> mutex_unlock(&ep->mtx);
>
> -error_tgt_fput:
> + return error;
> +}
> +
> +/*
> + * The following function implements the controller interface for
> + * the eventpoll file that enables the insertion/removal/change of
> + * file descriptors inside the interest set.
> + */
> +SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> + struct epoll_event __user *, event)
> +{
> + int error;
> + struct file *file, *tfile;
> + struct epoll_event epds;
> +
> + error = -EFAULT;
> + if (ep_op_has_event(op) &&
> + copy_from_user(&epds, event, sizeof(struct epoll_event)))
> + goto error_return;
> +
> + /* Get the "struct file *" for the eventpoll file */
> + error = -EBADF;
> + file = fget(epfd);
> + if (!file)
> + goto error_return;
> +
> + /* Get the "struct file *" for the target file */
> + tfile = fget(fd);
> + if (!tfile)
> + goto error_fput;
> +
> + error = do_epoll_ctl(op, fd, file, tfile, &epds);
> fput(tfile);
> error_fput:
> fput(file);
(Just figured I'd do a sanity check of this code) looks ok to me
...
> +struct file* ep_file_restore(struct ckpt_ctx *ctx,
> + struct ckpt_hdr_file *h)
> +{
> + struct file *epfile;
> + int epfd, ret;
> +
> + if (h->h.type != CKPT_HDR_FILE ||
> + h->h.len != sizeof(*h) ||
> + h->f_type != CKPT_FILE_EPOLL)
> + return ERR_PTR(-EINVAL);
> +
> + epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC);
> + if (epfd < 0)
> + return ERR_PTR(epfd);
> + epfile = fget(epfd);
> + sys_close(epfd); /* harmless even if an error occured */
> + BUG_ON(!epfile);
Would perhaps return ERR_PTR(-ENOENT) be nicer? (And maybe safer - I'm
not quite clear on under which arches BUG_ON does nothing).
> +
> + /*
> + * Needed before we can properly restore the watches and enforce the
> + * limit on watch numbers.
> + */
> + ret = restore_file_common(ctx, epfile, h);
> + if (ret < 0)
> + goto fput_out;
> +
> + /*
> + * Defer restoring the epoll items until the file table is
> + * fully restored. Ensures that valid file objrefs will resolve.
> + */
> + ret = deferqueue_add_ptr(ctx->files_deferq, ctx, ep_items_restore, NULL);
> + if (ret < 0) {
> +fput_out:
> + fput(epfile);
> + epfile = ERR_PTR(ret);
> + }
> + return epfile;
> +}
> +
> +#endif /* CONFIG_CHECKPOINT */
> +
> static int __init eventpoll_init(void)
> {
> struct sysinfo si;
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index ca2500d..1a3edab 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -119,6 +119,8 @@ enum {
> #define CKPT_HDR_TTY CKPT_HDR_TTY
> CKPT_HDR_TTY_LDISC,
> #define CKPT_HDR_TTY_LDISC CKPT_HDR_TTY_LDISC
> + CKPT_HDR_EPOLL_ITEMS = 391, /* Follows file-table */
What is the comment supposed to mean (other than that such
comments inevitably become stale :)?
-serge
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items
[not found] ` <d0fd1f3eb4eaa326488f59955e5b4790080f3073.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 14:59 ` Serge E. Hallyn
[not found] ` <20091021145950.GA13327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-23 23:51 ` Oren Laadan
2009-10-23 23:58 ` Oren Laadan
2 siblings, 1 reply; 16+ messages in thread
From: Serge E. Hallyn @ 2009-10-21 14:59 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> Currently we allocate memory to output all of the epoll items in one
> big chunk. At 20 bytes per item, and since epoll was designed to
> support on the order of 10,000 items, we may find ourselves kmalloc'ing
> 200,000 bytes. That's an order 7 allocation whereas the heuristic for
> difficult allocations, PAGE_ALLOC_COST_ORDER, is 3.
>
> Instead, output the epoll header and items separately. Chunk the output
> much like the pid array gets chunked. This ensures that even sub-order 0
> allocations will enable checkpoint of large epoll sets. A subsequent
> patch will do something similar for the restore path.
>
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Feels a bit auto-tune-magic-happy :) but looks good
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
> fs/eventpoll.c | 71 ++++++++++++++++++++++++++++++++++++-------------------
> 1 files changed, 46 insertions(+), 25 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 4706ec5..2506b40 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1480,7 +1480,7 @@ static int ep_items_checkpoint(void *data)
> struct rb_node *rbp;
> struct eventpoll *ep;
> __s32 epfile_objref;
> - int i, num_items, ret;
> + int num_items = 0, nchunk, ret;
>
> ctx = dq_entry->ctx;
>
> @@ -1489,9 +1489,8 @@ static int ep_items_checkpoint(void *data)
>
> ep = dq_entry->epfile->private_data;
> mutex_lock(&ep->mtx);
> - for (i = 0, rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp), i++) {}
> + for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp), num_items++) {}
> mutex_unlock(&ep->mtx);
> - num_items = i;
>
> h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
> if (!h)
> @@ -1503,36 +1502,58 @@ static int ep_items_checkpoint(void *data)
> if (ret || !num_items)
> return ret;
>
> - items = kzalloc(sizeof(*items)*num_items, GFP_KERNEL);
> + ret = ckpt_write_obj_type(ctx, NULL, sizeof(*items)*num_items,
> + CKPT_HDR_BUFFER);
> + if (ret < 0)
> + return ret;
> +
> + nchunk = num_items;
> + do {
> + items = kzalloc(sizeof(*items)*nchunk, GFP_KERNEL);
> + if (items)
> + break;
> + nchunk = nchunk >> 1;
> + } while (nchunk > 0);
> if (!items)
> return -ENOMEM;
> +
> + /*
> + * Walk the rbtree copying items into the chunk of memory and then
> + * writing them to the checkpoint image
> + */
> ret = 0;
> - i = 0;
> mutex_lock(&ep->mtx);
> - for (rbp = rb_first(&ep->rbr); i < num_items && rbp; rbp = rb_next(rbp),
> - i++) {
> - struct epitem *epi;
> - int objref;
> -
> - epi = rb_entry(rbp, struct epitem, rbn);
> - items[i].fd = epi->ffd.fd;
> - items[i].events = epi->event.events;
> - items[i].data = epi->event.data;
> - objref = ckpt_obj_lookup(ctx, epi->ffd.file, CKPT_OBJ_FILE);
> - if (objref <= 0) {
> - ret = -EBUSY; /* missing item -- checkpoint obj leak */
> - break;
> + rbp = rb_first(&ep->rbr);
> + while ((num_items > 0) && rbp) {
> + int n = min(num_items, nchunk);
> + int j;
> +
> + for (j = 0; rbp && j < n; j++, rbp = rb_next(rbp)) {
> + struct epitem *epi;
> + int objref;
> +
> + epi = rb_entry(rbp, struct epitem, rbn);
> + items[j].fd = epi->ffd.fd;
> + items[j].events = epi->event.events;
> + items[j].data = epi->event.data;
> + objref = ckpt_obj_lookup(ctx, epi->ffd.file,
> + CKPT_OBJ_FILE);
> + if (objref <= 0)
> + goto unlock;
> + items[j].file_objref = objref;
> }
> - items[i].file_objref = objref;
> + ret = ckpt_kwrite(ctx, items, n*sizeof(*items));
> + if (ret < 0)
> + break;
> + num_items -= n;
> }
> +unlock:
> mutex_unlock(&ep->mtx);
> - if (i == num_items && rbp)
> - ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */
> - if (!ret)
> - ret = ckpt_write_buffer(ctx, items, sizeof(*items)*num_items);
> - else
> - ckpt_write_err(ctx, "E", "checkpoint leak detected.\n", ret);
> kfree(items);
> + if (num_items != 0 || (num_items == 0 && rbp))
> + ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */
> + if (ret)
> + ckpt_write_err(ctx, "E", " checkpointing epoll items.\n", ret);
> return ret;
> }
>
> --
> 1.5.6.3
>
>
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] epoll: Add support for restoring many epoll items
[not found] ` <8e4344b801150b95cd54f2d09b660525601de256.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 15:09 ` Serge E. Hallyn
2009-10-23 23:56 ` Oren Laadan
1 sibling, 0 replies; 16+ messages in thread
From: Serge E. Hallyn @ 2009-10-21 15:09 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> This completes the work necessary to make checkpoint/restart of
> thousands of epoll items more reliable when higher order kmallocs
> would fail. We grab a piece of memory suitable to store a "chunk"
> of items for input. Read the input one chunk at a time and add
> epoll items for each item in the chunk.
>
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
> fs/eventpoll.c | 58 +++++++++++++++++++++++++++++++++++--------------------
> 1 files changed, 37 insertions(+), 21 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 2506b40..c261263 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1594,7 +1594,7 @@ static int ep_items_restore(void *data)
> struct ckpt_eventpoll_item *items = NULL;
> struct eventpoll *ep;
> struct file *epfile = NULL;
> - int ret, num_items, i = 0;
> + int ret, num_items, nchunk;
>
> h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
> if (IS_ERR(h))
> @@ -1611,34 +1611,50 @@ static int ep_items_restore(void *data)
> if (!num_items)
> return 0;
>
> - ret = ckpt_read_payload(ctx, (void**)&items, num_items*sizeof(*items),
> - CKPT_HDR_BUFFER);
> + ret = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER);
> + if (ret < 0)
> + return ret;
> + /* Make sure the items match the size we expect */
> + if (num_items != (ret / sizeof(*items)))
> + return -EINVAL;
> +
> + nchunk = num_items;
> + do {
> + items = kzalloc(sizeof(*items)*nchunk, GFP_KERNEL);
> + if (items)
> + break;
> + nchunk = nchunk/2;
> + } while (nchunk > 0);
> if (!items)
> return -ENOMEM;
>
> - /* Make sure the items match the size we expect */
> - if (num_items != (ret / sizeof(*items))) {
> - ret = -EINVAL;
> - goto out;
> - }
> -
> ep = epfile->private_data;
>
> - /* Restore the epoll items/watches */
> - for (ret = 0, i = 0; !ret && i < num_items; i++) {
> - struct epoll_event epev;
> - struct file *tfile;
> + while (num_items > 0) {
> + int n = min(num_items, nchunk);
> + int j;
>
> - tfile = ckpt_obj_fetch(ctx, items[i].file_objref,
> - CKPT_OBJ_FILE);
> - if (IS_ERR(tfile)) {
> - ret = PTR_ERR(tfile);
> + ret = ckpt_kread(ctx, items, n*sizeof(*items));
> + if (ret < 0)
> break;
> +
> + /* Restore the epoll items/watches */
> + for (j = 0; !ret && j < n; j++) {
> + struct epoll_event epev;
> + struct file *tfile;
> +
> + tfile = ckpt_obj_fetch(ctx, items[j].file_objref,
> + CKPT_OBJ_FILE);
> + if (IS_ERR(tfile)) {
> + ret = PTR_ERR(tfile);
> + goto out;
> + }
> + epev.events = items[j].events;
> + epev.data = items[j].data;
> + ret = do_epoll_ctl(EPOLL_CTL_ADD, items[j].fd,
> + epfile, tfile, &epev);
> }
> - epev.events = items[i].events;
> - epev.data = items[i].data;
> - ret = do_epoll_ctl(EPOLL_CTL_ADD, items[i].fd,
> - epfile, tfile, &epev);
> + num_items -= n;
> }
> out:
> kfree(items);
> --
> 1.5.6.3
>
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] Checkpoint/restart epoll sets
[not found] ` <20091021003128.GA23721-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-22 6:29 ` Matt Helsley
[not found] ` <20091022062909.GF7757-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-10-23 23:30 ` Oren Laadan
1 sibling, 1 reply; 16+ messages in thread
From: Matt Helsley @ 2009-10-22 6:29 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Tue, Oct 20, 2009 at 07:31:28PM -0500, Serge E. Hallyn wrote:
> Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
<snip>
> > +struct file* ep_file_restore(struct ckpt_ctx *ctx,
> > + struct ckpt_hdr_file *h)
> > +{
> > + struct file *epfile;
> > + int epfd, ret;
> > +
> > + if (h->h.type != CKPT_HDR_FILE ||
> > + h->h.len != sizeof(*h) ||
> > + h->f_type != CKPT_FILE_EPOLL)
> > + return ERR_PTR(-EINVAL);
> > +
> > + epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC);
> > + if (epfd < 0)
> > + return ERR_PTR(epfd);
> > + epfile = fget(epfd);
> > + sys_close(epfd); /* harmless even if an error occured */
> > + BUG_ON(!epfile);
>
> Would perhaps return ERR_PTR(-ENOENT) be nicer? (And maybe safer - I'm
> not quite clear on under which arches BUG_ON does nothing).
OK, good idea. Mind if I post it as a separate patch to be merged with
this series?
>
> > +
> > + /*
> > + * Needed before we can properly restore the watches and enforce the
> > + * limit on watch numbers.
> > + */
> > + ret = restore_file_common(ctx, epfile, h);
> > + if (ret < 0)
> > + goto fput_out;
> > +
> > + /*
> > + * Defer restoring the epoll items until the file table is
> > + * fully restored. Ensures that valid file objrefs will resolve.
> > + */
> > + ret = deferqueue_add_ptr(ctx->files_deferq, ctx, ep_items_restore, NULL);
> > + if (ret < 0) {
> > +fput_out:
> > + fput(epfile);
> > + epfile = ERR_PTR(ret);
> > + }
> > + return epfile;
> > +}
> > +
> > +#endif /* CONFIG_CHECKPOINT */
> > +
> > static int __init eventpoll_init(void)
> > {
> > struct sysinfo si;
> > diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> > index ca2500d..1a3edab 100644
> > --- a/include/linux/checkpoint_hdr.h
> > +++ b/include/linux/checkpoint_hdr.h
> > @@ -119,6 +119,8 @@ enum {
> > #define CKPT_HDR_TTY CKPT_HDR_TTY
> > CKPT_HDR_TTY_LDISC,
> > #define CKPT_HDR_TTY_LDISC CKPT_HDR_TTY_LDISC
> > + CKPT_HDR_EPOLL_ITEMS = 391, /* Follows file-table */
>
> What is the comment supposed to mean (other than that such
> comments inevitably become stale :)?
It means that these items must be present some point after the
file-table in order for the checkpoint image to restart. That's
why I chose a number near 400 -- it looked like [300-400) were set
aside for file objects.
Cheers,
-Matt Helsley
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items
[not found] ` <20091021145950.GA13327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-22 6:40 ` Matt Helsley
[not found] ` <20091022064007.GG7757-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
0 siblings, 1 reply; 16+ messages in thread
From: Matt Helsley @ 2009-10-22 6:40 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Wed, Oct 21, 2009 at 09:59:50AM -0500, Serge E. Hallyn wrote:
> Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> > Currently we allocate memory to output all of the epoll items in one
> > big chunk. At 20 bytes per item, and since epoll was designed to
> > support on the order of 10,000 items, we may find ourselves kmalloc'ing
> > 200,000 bytes. That's an order 7 allocation whereas the heuristic for
> > difficult allocations, PAGE_ALLOC_COST_ORDER, is 3.
> >
> > Instead, output the epoll header and items separately. Chunk the output
> > much like the pid array gets chunked. This ensures that even sub-order 0
> > allocations will enable checkpoint of large epoll sets. A subsequent
> > patch will do something similar for the restore path.
> >
> > Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>
> Feels a bit auto-tune-magic-happy :) but looks good
Well it's not magic compared to guessing what a good number would be.
There can be lots of these items and I figured that writing them in
the biggest chunks possible could be useful.
>
> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
<snip>
For anyone who's curious, it's only 7 additional lines (++ below).
> > ++ int nchunk;
...
> > ++ nchunk = num_items;
> > ++ do {
> > + items = kzalloc(sizeof(*items)*nchunk, GFP_KERNEL);
> > ++ if (items)
> > ++ break;
> > ++ nchunk = nchunk >> 1;
> > ++ } while (nchunk > 0);
> > if (!items)
> > return -ENOMEM;
<snip>
Cheers,
-Matt Helsley
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] Checkpoint/restart epoll sets
[not found] ` <20091022062909.GF7757-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-10-22 14:02 ` Serge E. Hallyn
0 siblings, 0 replies; 16+ messages in thread
From: Serge E. Hallyn @ 2009-10-22 14:02 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> On Tue, Oct 20, 2009 at 07:31:28PM -0500, Serge E. Hallyn wrote:
> > Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
>
> <snip>
>
> > > +struct file* ep_file_restore(struct ckpt_ctx *ctx,
> > > + struct ckpt_hdr_file *h)
> > > +{
> > > + struct file *epfile;
> > > + int epfd, ret;
> > > +
> > > + if (h->h.type != CKPT_HDR_FILE ||
> > > + h->h.len != sizeof(*h) ||
> > > + h->f_type != CKPT_FILE_EPOLL)
> > > + return ERR_PTR(-EINVAL);
> > > +
> > > + epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC);
> > > + if (epfd < 0)
> > > + return ERR_PTR(epfd);
> > > + epfile = fget(epfd);
> > > + sys_close(epfd); /* harmless even if an error occured */
> > > + BUG_ON(!epfile);
> >
> > Would perhaps return ERR_PTR(-ENOENT) be nicer? (And maybe safer - I'm
> > not quite clear on under which arches BUG_ON does nothing).
>
> OK, good idea. Mind if I post it as a separate patch to be merged with
> this series?
Haha, yeah, pls don't repost the whole patch with a two-line change :)
-serge
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] Checkpoint/restart epoll sets
[not found] ` <20091021003128.GA23721-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-22 6:29 ` Matt Helsley
@ 2009-10-23 23:30 ` Oren Laadan
1 sibling, 0 replies; 16+ messages in thread
From: Oren Laadan @ 2009-10-23 23:30 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Serge E. Hallyn wrote:
> Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
>> @@ -1226,35 +1242,18 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>> * the eventpoll file that enables the insertion/removal/change of
>> * file descriptors inside the interest set.
[...]
>> + if (h->h.type != CKPT_HDR_FILE ||
>> + h->h.len != sizeof(*h) ||
>> + h->f_type != CKPT_FILE_EPOLL)
>> + return ERR_PTR(-EINVAL);
>> +
>> + epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC);
>> + if (epfd < 0)
>> + return ERR_PTR(epfd);
>> + epfile = fget(epfd);
>> + sys_close(epfd); /* harmless even if an error occured */
>> + BUG_ON(!epfile);
>
> Would perhaps return ERR_PTR(-ENOENT) be nicer? (And maybe safer - I'm
> not quite clear on under which arches BUG_ON does nothing).
Serge is right: malicious userspace could fork the restarting tasks
to all share fdtable with a non-restarting task, and that other task
could close the fd ...
I'll write a patch that ensures that the root task doesn't share
anything with its parent (coordinator).
But the race still exists for self-restart. So I'd do -EBUSY here
instead.
Oren.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] Checkpoint/restart epoll sets
[not found] ` <ce2e15faf44e254b80578c6c62e71d8685516896.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
` (2 preceding siblings ...)
2009-10-21 0:31 ` [PATCH 1/3] Checkpoint/restart epoll sets Serge E. Hallyn
@ 2009-10-23 23:41 ` Oren Laadan
3 siblings, 0 replies; 16+ messages in thread
From: Oren Laadan @ 2009-10-23 23:41 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Matt Helsley wrote:
> Save/restore epoll items during checkpoint/restart respectively.
> kmalloc failures should be dealt with more kindly than just error-out
> because epoll is made to poll many thousands of file descriptors.
> Subsequent patches will change epoll c/r to "chunk" its output/input
> respectively.
>
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
See one comment below (fixed).
Also fixed is the BUG_ON pointed out by Serge.
Oren.
>
> Changelog:
>
> v5:
> Fix potential recursion during collect.
> Replace call to ckpt_obj_collect() with ckpt_collect_file().
> [Oren]
> Fix checkpoint leak detection when there are more items than
> expected.
> Cleanup/simplify error write paths. (will complicate in a later
> patch) [Oren]
> Remove files_deferq bits. [Oren]
> Remove extra newline. [Oren]
> Remove aggregate check on number of watches added. [Oren]
> This is OK since these will be done individually anyway.
> Remove check for negative objrefs during restart. [Oren]
> Fixup comment regarding race that indicates checkpoint leaks.
> [Oren]
> s/ckpt_read_obj/ckpt_read_buf_type/ [Oren]
> Patch for lots of epoll items follows.
> Moved sys_close(epfd) right under fget(). [Oren]
> Use CKPT_HDR_BUFFER rather than custome ckpt_read/write_*
> This makes it more similar to the pid array code. [Oren]
> It also simplifies the error recovery paths.
> Tested polling a pipe and 50,000 UNIX sockets.
>
> v4: ckpt-v18
> Use files_deferq as submitted by Dan Smith
> Cleanup to only report >= 1 items when debugging.
>
> v3: [unposted]
> Removed most of the TODOs -- the remainder will be removed by
> subsequent patches.
> Fixed missing ep_file_collect() [Serge]
> Rather than include checkpoint_hdr.h declare (but do not define)
> the two structs needed in eventpoll.h [Oren]
> Complain with ckpt_write_err() when we detect checkpoint obj
> leaks. [Oren]
> Remove redundant is_epoll_file() check in collect. [Oren]
> Move epfile_objref lookup to simplify error handling. [Oren]
> Simplify error handling with early return in
> ep_eventpoll_checkpoint(). [Oren]
> Cleaned up a comment. [Oren]
> Shorten CKPT_HDR_FILE_EPOLL_ITEMS (-FILE) [Oren]
> Renumbered to indicate that it follows the file table.
> Renamed the epoll struct in checkpoint_hdr.h [Oren]
> Also renamed substruct.
> Fixup return of empty ep_file_restore(). [Oren]
> Changed some error returns. [Oren]
> Changed some tests to BUG_ON(). [Oren]
> Factored out watch insert with epoll_ctl() into do_epoll_ctl().
> [Cedric, Oren]
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
> checkpoint/files.c | 8 +
> fs/eventpoll.c | 308 ++++++++++++++++++++++++++++++++++++----
> include/linux/checkpoint_hdr.h | 18 +++
> include/linux/eventpoll.h | 17 ++-
> 4 files changed, 322 insertions(+), 29 deletions(-)
>
> diff --git a/checkpoint/files.c b/checkpoint/files.c
> index f6de07e..6ea2389 100644
> --- a/checkpoint/files.c
> +++ b/checkpoint/files.c
> @@ -22,6 +22,7 @@
> #include <linux/deferqueue.h>
> #include <linux/checkpoint.h>
> #include <linux/checkpoint_hdr.h>
> +#include <linux/eventpoll.h>
> #include <net/sock.h>
>
>
> @@ -607,6 +608,13 @@ static struct restore_file_ops restore_file_ops[] = {
> .file_type = CKPT_FILE_TTY,
> .restore = tty_file_restore,
> },
> +#ifdef CONFIG_EPOLL
> + {
> + .file_name = "EPOLL",
> + .file_type = CKPT_FILE_EPOLL,
> + .restore = ep_file_restore,
> + },
> +#endif
There must be an entry here even if CONFIG_EPOLL isn't selected. And
indeed you define ep_file_restore() for this case to return -ENOSYS
(toward the end of this patch). So I dropped the #ifdef/#endif pair.
[...]
> diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
> index f6856a5..34538be 100644
> --- a/include/linux/eventpoll.h
> +++ b/include/linux/eventpoll.h
> @@ -56,6 +56,9 @@ struct file;
>
>
> #ifdef CONFIG_EPOLL
> +struct ckpt_ctx;
> +struct ckpt_hdr_file;
> +
>
> /* Used to initialize the epoll bits inside the "struct file" */
> static inline void eventpoll_init_file(struct file *file)
> @@ -95,11 +98,23 @@ static inline void eventpoll_release(struct file *file)
> eventpoll_release_file(file);
> }
>
> -#else
>
> +#ifdef CONFIG_CHECKPOINT
> +extern struct file* ep_file_restore(struct ckpt_ctx *ctx,
> + struct ckpt_hdr_file *h);
> +#endif
> +#else
> +/* !defined(CONFIG_EPOLL) */
> static inline void eventpoll_init_file(struct file *file) {}
> static inline void eventpoll_release(struct file *file) {}
>
> +#ifdef CONFIG_CHECKPOINT
> +static inline struct file* ep_file_restore(struct ckpt_ctx *ctx,
> + struct ckpt_hdr_file *ptr)
> +{
> + return ERR_PTR(-ENOSYS);
> +}
> +#endif
> #endif
>
> #endif /* #ifdef __KERNEL__ */
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items
[not found] ` <d0fd1f3eb4eaa326488f59955e5b4790080f3073.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 14:59 ` Serge E. Hallyn
@ 2009-10-23 23:51 ` Oren Laadan
2009-10-23 23:58 ` Oren Laadan
2 siblings, 0 replies; 16+ messages in thread
From: Oren Laadan @ 2009-10-23 23:51 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Matt Helsley wrote:
> Currently we allocate memory to output all of the epoll items in one
> big chunk. At 20 bytes per item, and since epoll was designed to
> support on the order of 10,000 items, we may find ourselves kmalloc'ing
> 200,000 bytes. That's an order 7 allocation whereas the heuristic for
> difficult allocations, PAGE_ALLOC_COST_ORDER, is 3.
>
> Instead, output the epoll header and items separately. Chunk the output
> much like the pid array gets chunked. This ensures that even sub-order 0
> allocations will enable checkpoint of large epoll sets. A subsequent
> patch will do something similar for the restore path.
>
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
> fs/eventpoll.c | 71 ++++++++++++++++++++++++++++++++++++-------------------
> 1 files changed, 46 insertions(+), 25 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 4706ec5..2506b40 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1480,7 +1480,7 @@ static int ep_items_checkpoint(void *data)
> struct rb_node *rbp;
> struct eventpoll *ep;
> __s32 epfile_objref;
> - int i, num_items, ret;
> + int num_items = 0, nchunk, ret;
>
> ctx = dq_entry->ctx;
>
> @@ -1489,9 +1489,8 @@ static int ep_items_checkpoint(void *data)
>
> ep = dq_entry->epfile->private_data;
> mutex_lock(&ep->mtx);
> - for (i = 0, rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp), i++) {}
> + for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp), num_items++) {}
> mutex_unlock(&ep->mtx);
> - num_items = i;
>
> h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
> if (!h)
> @@ -1503,36 +1502,58 @@ static int ep_items_checkpoint(void *data)
> if (ret || !num_items)
> return ret;
>
> - items = kzalloc(sizeof(*items)*num_items, GFP_KERNEL);
> + ret = ckpt_write_obj_type(ctx, NULL, sizeof(*items)*num_items,
> + CKPT_HDR_BUFFER);
> + if (ret < 0)
> + return ret;
> +
> + nchunk = num_items;
> + do {
> + items = kzalloc(sizeof(*items)*nchunk, GFP_KERNEL);
> + if (items)
> + break;
> + nchunk = nchunk >> 1;
> + } while (nchunk > 0);
An allocation may or may not succeed for num_items; however, it if
does succeed, it may unnecessarily fragment the memory.
So I wonder if it's simpler to set the chunk size to 1-2 pages, like
in the pids code ?
The other advantage is that if we eventually optimize by allocating
a generic buffer for the c/r (e.g. ctx->buffer), we could easily
reuse it here.
[...]
Oren.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items
[not found] ` <20091022064007.GG7757-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2009-10-23 23:54 ` Oren Laadan
0 siblings, 0 replies; 16+ messages in thread
From: Oren Laadan @ 2009-10-23 23:54 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Matt Helsley wrote:
> On Wed, Oct 21, 2009 at 09:59:50AM -0500, Serge E. Hallyn wrote:
>> Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
>>> Currently we allocate memory to output all of the epoll items in one
>>> big chunk. At 20 bytes per item, and since epoll was designed to
>>> support on the order of 10,000 items, we may find ourselves kmalloc'ing
>>> 200,000 bytes. That's an order 7 allocation whereas the heuristic for
>>> difficult allocations, PAGE_ALLOC_COST_ORDER, is 3.
>>>
>>> Instead, output the epoll header and items separately. Chunk the output
>>> much like the pid array gets chunked. This ensures that even sub-order 0
>>> allocations will enable checkpoint of large epoll sets. A subsequent
>>> patch will do something similar for the restore path.
>>>
>>> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>> Feels a bit auto-tune-magic-happy :) but looks good
>
> Well it's not magic compared to guessing what a good number would be.
> There can be lots of these items and I figured that writing them in
> the biggest chunks possible could be useful.
What qualifies as a "good" number ? Performance wise, I suspect
there isn't much difference between 4K, 8K and above buffers in
practice, compared to the total checkpoint time.
Oren.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] epoll: Add support for restoring many epoll items
[not found] ` <8e4344b801150b95cd54f2d09b660525601de256.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 15:09 ` Serge E. Hallyn
@ 2009-10-23 23:56 ` Oren Laadan
1 sibling, 0 replies; 16+ messages in thread
From: Oren Laadan @ 2009-10-23 23:56 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Matt Helsley wrote:
> This completes the work necessary to make checkpoint/restart of
> thousands of epoll items more reliable when higher order kmallocs
> would fail. We grab a piece of memory suitable to store a "chunk"
> of items for input. Read the input one chunk at a time and add
> epoll items for each item in the chunk.
>
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
As with patch 2/3, I prefer constant chunk size.
Otherwise:
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items
[not found] ` <d0fd1f3eb4eaa326488f59955e5b4790080f3073.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 14:59 ` Serge E. Hallyn
2009-10-23 23:51 ` Oren Laadan
@ 2009-10-23 23:58 ` Oren Laadan
[not found] ` <4AE24340.9030203-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2 siblings, 1 reply; 16+ messages in thread
From: Oren Laadan @ 2009-10-23 23:58 UTC (permalink / raw)
To: Matt Helsley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Matt Helsley wrote:
> Currently we allocate memory to output all of the epoll items in one
> big chunk. At 20 bytes per item, and since epoll was designed to
> support on the order of 10,000 items, we may find ourselves kmalloc'ing
> 200,000 bytes. That's an order 7 allocation whereas the heuristic for
> difficult allocations, PAGE_ALLOC_COST_ORDER, is 3.
>
> Instead, output the epoll header and items separately. Chunk the output
> much like the pid array gets chunked. This ensures that even sub-order 0
> allocations will enable checkpoint of large epoll sets. A subsequent
> patch will do something similar for the restore path.
>
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
BTW, In the future, please use checkpatch.pl before sending;
Otherwise eventually I get yelled at ... :p
I'll fix this patchset as I pull.
I'll also change the auto-tune-magic to fixed chunk, unless
somebody screams.
Oren.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items
[not found] ` <4AE24340.9030203-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-24 4:32 ` Matt Helsley
0 siblings, 0 replies; 16+ messages in thread
From: Matt Helsley @ 2009-10-24 4:32 UTC (permalink / raw)
To: Oren Laadan
Cc: Andy Whitcroft,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Fri, Oct 23, 2009 at 07:58:56PM -0400, Oren Laadan wrote:
>
>
> Matt Helsley wrote:
> > Currently we allocate memory to output all of the epoll items in one
> > big chunk. At 20 bytes per item, and since epoll was designed to
> > support on the order of 10,000 items, we may find ourselves kmalloc'ing
> > 200,000 bytes. That's an order 7 allocation whereas the heuristic for
> > difficult allocations, PAGE_ALLOC_COST_ORDER, is 3.
> >
> > Instead, output the epoll header and items separately. Chunk the output
> > much like the pid array gets chunked. This ensures that even sub-order 0
> > allocations will enable checkpoint of large epoll sets. A subsequent
> > patch will do something similar for the restore path.
> >
> > Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>
> Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
>
> BTW, In the future, please use checkpatch.pl before sending;
OK, I'll add it to my script.
> Otherwise eventually I get yelled at ... :p
Hmm, I was suprised this triggered anything from checkpatch.
Some of the checkpatch.pl complaints look OK (omitted). Others don't
seem right.
WARNING: braces {} are not necessary for single statement blocks
#296: FILE: fs/eventpoll.c:1492:
+ for (i = 0, rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp),
i++) {}
This looks like a checkpatch bug. It's a zero statement block. The
braces prevent the compiler from putting the next statement in the loop body.
Cc'ing Andy Whitcroft.
ERROR: "(foo**)" should be "(foo **)"
#397: FILE: fs/eventpoll.c:1593:
+ ret = ckpt_read_payload(ctx, (void**)&items,
num_items*sizeof(*items),
Now that just seems dumb. ) is not an identifier so the "foo *bar" pattern
which seems to inspire this pattern does not apply. The only spaces in
typecasts should be like (struct foo*) IMHO. This looks weird since typecast
is a unary operator and there should be no space between a unary
operator and its argument. Imagine:
if (! x)
do_something()
This looks just as absurd to me:
(foo **)x Or worse: (foo **) x
Also CondingStyle says nothing about this from what I could see. So it
shouldn't be an ERROR IMHO. Ignoring this one in protest.
WARNING: line over 80 characters
#462: FILE: fs/eventpoll.c:1658:
+ ret = deferqueue_add_ptr(ctx->files_deferq, ctx,
ep_items_restore, NULL);
Over by one. My understanding is it's a warning for cases like this.
Ignoring.
...
total: 5 errors, 2 warnings, 457 lines checked
./checkpatchme.patch has style problems, please review. If any of these
errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Done.
> I'll also change the auto-tune-magic to fixed chunk, unless
> somebody screams.
*grumble* OK.
Cheers,
-Matt Helsley
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-10-24 4:32 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-19 17:04 [PATCH 1/3] Checkpoint/restart epoll sets Matt Helsley
[not found] ` <ce2e15faf44e254b80578c6c62e71d8685516896.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-19 17:04 ` [PATCH 2/3] epoll: Add support for checkpointing large numbers of epoll items Matt Helsley
[not found] ` <d0fd1f3eb4eaa326488f59955e5b4790080f3073.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 14:59 ` Serge E. Hallyn
[not found] ` <20091021145950.GA13327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-22 6:40 ` Matt Helsley
[not found] ` <20091022064007.GG7757-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-10-23 23:54 ` Oren Laadan
2009-10-23 23:51 ` Oren Laadan
2009-10-23 23:58 ` Oren Laadan
[not found] ` <4AE24340.9030203-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-10-24 4:32 ` Matt Helsley
2009-10-19 17:04 ` [PATCH 3/3] epoll: Add support for restoring many " Matt Helsley
[not found] ` <8e4344b801150b95cd54f2d09b660525601de256.1255971848.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 15:09 ` Serge E. Hallyn
2009-10-23 23:56 ` Oren Laadan
2009-10-21 0:31 ` [PATCH 1/3] Checkpoint/restart epoll sets Serge E. Hallyn
[not found] ` <20091021003128.GA23721-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-22 6:29 ` Matt Helsley
[not found] ` <20091022062909.GF7757-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2009-10-22 14:02 ` Serge E. Hallyn
2009-10-23 23:30 ` Oren Laadan
2009-10-23 23:41 ` Oren Laadan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.