From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
To: Matt Helsley <matthltc@us.ibm.com>
Cc: containers@lists.linux-foundation.org,
Eric Sandeen <sandeen@redhat.com>,
"Theodore Ts'o" <tytso@mit.edu>,
Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>,
Miklos Szeredi <miklos@szeredi.hu>,
Jamie Lokier <jamie@shareable.org>,
Amir Goldstein <amir73il@users.sf.net>,
Christoph Hellwig <hch@infradead.org>,
Andreas Dilger <adilger.kernel@dilger.ca>,
linux-fsdevel@vger.kernel.org, Jan Kara <jack@suse.cz>,
linux-ext4@vger.kernel.org, Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH 5/6] [RFC] Checkpoint/restart unlinked files
Date: Fri, 22 Oct 2010 16:43:44 -0700 [thread overview]
Message-ID: <20101022234344.GA23663@us.ibm.com> (raw)
In-Reply-To: <7da8a050193a91b69ecb3899ce2eda541ecd2473.1285278339.git.matthltc@us.ibm.com>
Matt Helsley [matthltc@us.ibm.com] wrote:
<snip>
| To understand why relinking is extremely useful for checkpoint/restart
| consider this simple pseudocode program and a specific example checkpoint
| of it:
I can see how relinking the file simplifies C/R :-) But patch 2 indicates
not all filesystems can support relink. Hope they aren't too many of those.
|
| a_fd = open("a"); /* example: size of the file at "a" is 1GB */
| link("a", "b");
| unlink("a");
| creat("a");
| <---- example: checkpoint happens here
| write(a_fd, "bar");
|
| The file "a" is unlinked and a different file has been placed at that
| path. a_fd still refers to the inode shared with "b".
|
| Without relinking we would need to walk the entire filesystem to find out
| that "b" is a path to the same inode
You may want to mention here that to checkpoint/restart a file, we save/
restore the pathname. So finding a path for the unliked file 'a' would
require walking the entire filesystem to find any alias.
| (another variation on this case: "b"
| would also have been unlinked). We'd need to do this for every
| unlinked file that remains open in every task to checkpoint. Even then
| there is no guarantee such a "b" exists for every unlinked file -- the
| inodes could be "orphans" -- and we'd need to preserve their contents
| some other way.
|
| I considered a couple alternatives to preserving unlinked file contents:
s/couple/couple of/
| copying and file handles. Each has significant drawbacks.
|
| First I attempted to copy the file contents into the image and then
| recreate and unlink the file during restart. Using a simple version of
| that method the write above would not reach "b". One fix would be to search
| the filesystem for a file with the same inode number (inode of "b") and
| either open it or hardlink it to "a". Another would be to record the inode
| number. This either shifts the search from checkpoint time to restart time
| or has all the drawbacks of the second method I considered: file handles.
|
| Instead of copying contents or recording inodes I also considered using
| file handles. We'd need to ensure that the filehandles persist in storage,
| can be snapshotted/backed up, and can be migrated. Can handlefs or any
| generic file handle system do this? My _guess_ is "no" but folks are
| welcome to tell me I'm wrong.
|
| In contrast, linking the file from a_fd back into its filesystem can avoid
| these complexities. Relinking avoids the search for matching inodes and
| copying large quantities of data from storage only to write it back (in
| fact the data would be read-and-written twice -- once for checkpoint and
| once for restart). Like file handles it does require changes to the
| filesystem code. Unlike file handles, enabling relinking does not require
| every filesystem to support a new kind of filesystem "object" -- only
| an operation that is quite similar to one that already exists: link.
|
| Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
| Cc: Eric Sandeen <sandeen@redhat.com>
| Cc: Theodore Ts'o <tytso@mit.edu>
| Cc: Andreas Dilger <adilger.kernel@dilger.ca>
| Cc: linux-ext4@vger.kernel.org
| Cc: Jan Kara <jack@suse.cz>
| Cc: containers@lists.linux-foundation.org
| Cc: Oren Laadan <orenl@cs.columbia.edu>
| Cc: linux-fsdevel@vger.kernel.org
| Cc: Al Viro <viro@zeniv.linux.org.uk>
| Cc: Christoph Hellwig <hch@infradead.org>
| Cc: Jamie Lokier <jamie@shareable.org>
| Cc: Amir Goldstein <amir73il@users.sf.net>
| Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
| Cc: Miklos Szeredi <miklos@szeredi.hu>
| ---
| fs/checkpoint.c | 51 ++++++++++++++-----
| fs/namei.c | 102 ++++++++++++++++++++++++++++++++++++++
| fs/pipe.c | 2 +-
| include/linux/checkpoint.h | 3 +-
| include/linux/checkpoint_hdr.h | 3 +
| include/linux/checkpoint_types.h | 3 +
| 6 files changed, 149 insertions(+), 15 deletions(-)
|
| diff --git a/fs/checkpoint.c b/fs/checkpoint.c
| index 87d7c6e..9c7caec 100644
| --- a/fs/checkpoint.c
| +++ b/fs/checkpoint.c
| @@ -16,6 +16,7 @@
| #include <linux/sched.h>
| #include <linux/file.h>
| #include <linux/namei.h>
| +#include <linux/mount.h>
| #include <linux/fs_struct.h>
| #include <linux/fs.h>
| #include <linux/fdtable.h>
| @@ -26,6 +27,7 @@
| #include <linux/checkpoint.h>
| #include <linux/eventpoll.h>
| #include <linux/eventfd.h>
| +#include <linux/sys-wrapper.h>
| #include <net/sock.h>
|
| /**************************************************************************
| @@ -174,6 +176,9 @@ int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
| h->f_pos = file->f_pos;
| h->f_version = file->f_version;
|
| + if (d_unlinked(file->f_dentry))
| + /* Perform post-checkpoint and post-restart unlink() */
| + h->f_restart_flags |= RESTART_FILE_F_UNLINK;
| h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
| if (h->f_credref < 0)
| return h->f_credref;
| @@ -197,16 +202,6 @@ int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
| struct ckpt_hdr_file_generic *h;
| int ret;
|
| - /*
| - * FIXME: when we'll add support for unlinked files/dirs, we'll
| - * need to distinguish between unlinked filed and unlinked dirs.
| - */
| - if (d_unlinked(file->f_dentry)) {
| - ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
| - file);
| - return -EBADF;
| - }
| -
| h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
| if (!h)
| return -ENOMEM;
| @@ -220,6 +215,9 @@ int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
| if (ret < 0)
| goto out;
| ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
Hmm, what file name will be checkpointed here, if the file is unlinked ?
| + if (ret < 0)
| + goto out;
| + ret = checkpoint_file_links(ctx, file);
| out:
| ckpt_hdr_put(ctx, h);
| return ret;
| @@ -570,9 +568,11 @@ static int ckpt_read_fname(struct ckpt_ctx *ctx, char **fname)
| /**
| * restore_open_fname - read a file name and open a file
| * @ctx: checkpoint context
| + * @restore_unlinked: unlink the opened file
| * @flags: file flags
| */
| -struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
| +struct file *restore_open_fname(struct ckpt_ctx *ctx,
| + int restore_unlinked, int flags)
nit: s/restore_unlinked/unlinked/ ?
| {
| struct file *file;
| char *fname;
| @@ -586,8 +586,33 @@ struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
| if (len < 0)
| return ERR_PTR(len);
| ckpt_debug("fname '%s' flags %#x\n", fname, flags);
| -
| + if (restore_unlinked) {
| + kfree(fname);
| + fname = NULL;
| + len = ckpt_read_payload(ctx, (void **)&fname, PATH_MAX,
| + CKPT_HDR_BUFFER);
Hmm, is there a reason we need a special way to read the file name for
unlinked files ? After re-linking the file during checkpoint, can we
not treat it like any other open file (except for the flag) ?
| + if (len < 0)
| + return ERR_PTR(len);
| + fname[len] = '\0';
| + }
| file = filp_open(fname, flags, 0);
| + if (IS_ERR(file)) {
| + ckpt_err(ctx, PTR_ERR(file), "Could not open file \"%s\"\n", fname);
| +
| + goto out;
| + }
| + if (!restore_unlinked)
| + goto out;
| + if (S_ISDIR(file->f_mapping->host->i_mode))
| + len = kernel_sys_rmdir(fname);
| + else
| + len = kernel_sys_unlink(fname);
| + if (len < 0) {
| + ckpt_err(ctx, len, "Could not unlink \"%s\"\n", fname);
| + fput(file);
| + file = ERR_PTR(len);
| + }
nit: how about moving this unlink block to a smaller function ?
| +out:
| kfree(fname);
|
| return file;
| @@ -692,7 +717,7 @@ static struct file *generic_file_restore(struct ckpt_ctx *ctx,
| ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
| return ERR_PTR(-EINVAL);
|
| - file = restore_open_fname(ctx, ptr->f_flags);
| + file = restore_open_fname(ctx, !!(ptr->f_restart_flags & RESTART_FILE_F_UNLINK), ptr->f_flags);
nit: long line
| if (IS_ERR(file))
| return file;
|
| diff --git a/fs/namei.c b/fs/namei.c
| index 8c9663d..69c4f4e 100644
| --- a/fs/namei.c
| +++ b/fs/namei.c
| @@ -32,6 +32,9 @@
| #include <linux/fcntl.h>
| #include <linux/device_cgroup.h>
| #include <linux/fs_struct.h>
| +#ifdef CONFIG_CHECKPOINT
| +#include <linux/checkpoint.h>
| +#endif
| #include <asm/uaccess.h>
|
| #include "internal.h"
| @@ -2543,6 +2546,105 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
| return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
| }
|
| +#ifdef CONFIG_CHECKPOINT
| +
| +/* Path relative to the mounted filesystem's root -- not a "global" root or even a namespace root. The unique_name_count is unique for the entire checkpoint. */
| +#define CKPT_RELINKAT_FMT "lost+found/checkpoint/unlink-at-restart-%08llx-%u"
| +
| +static int checkpoint_fill_relink_fname(struct ckpt_ctx *ctx,
nit. since it is a static function, we could probably drop the 'checkpoint_'
prefix in the name ?
| + struct file *for_file,
| + char relink_dir_pathname[PATH_MAX],
| + int *lenp)
| +{
| + struct path relink_dir_path;
nit. since the function name has "relink", maybe variable names can skip
(code is easier to read with smaller variable names).
| + char *tmp;
| + int len;
| +
| + /* Find path to mount */
| + relink_dir_path.mnt = for_file->f_path.mnt;
| + relink_dir_path.dentry = relink_dir_path.mnt->mnt_root;
| + tmp = d_path(&relink_dir_path, relink_dir_pathname, PATH_MAX);
| + if (IS_ERR(tmp))
| + return PTR_ERR(tmp);
| +
| + /* Append path to relinked file. */
| + len = strlen(tmp);
| + if (len <= 0)
| + return -ENOENT;
| + memmove(relink_dir_pathname, tmp, len);
| + tmp = relink_dir_pathname + len - 1;
| + /* Ensure we've got a single dir separator */
| + if (*tmp == '/')
| + tmp++;
| + else {
| + tmp++;
we could simplify the 'if-else' by making the tmp++ unconditional (or by
removing the -1 above).
| + *tmp = '/';
| + tmp++;
| + len++;
| + }
| + len += snprintf(tmp, PATH_MAX - len, CKPT_RELINKAT_FMT,
| + ctx->ktime_begin.tv64,
| + ++ctx->unique_name_count);
Since the format is dependent on additional parameters (tv64, unique_name_count)
any changes to the format will require updates in multiple places in the future
right ? That would make the CKPT_RELINKAT_FMT macro less useful.
Instead how about a function like this that could be used during both checkpoint
and restart:
static inline int generate_relinked_path(ctx, buf, len)
{
return sprintf(...);
}
| + relink_dir_pathname[len] = '\0';
| + *lenp = len;
| + return 0;
| +}
| +
| +static int checkpoint_file_relink(struct ckpt_ctx *ctx,
| + struct file *file,
| + char new_path[PATH_MAX])
| +{
| + int ret, len;
| +
| + /*
| + * Relinking arbitrary files without searching a path
| + * (which non-existent if the file is unlinked) requires
s/which/which is/
s/file is/file was/
| + * special privileges.
| + */
| + if (!capable(CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH)) {
| + ckpt_err(ctx, -EPERM, "%(T)Relinking unlinked files requires CAP_DAC_{OVERRIDE,READ_SEARCH}\n");
nit: long line
| + return -EPERM;
| + }
nit: a blank line here might help
| + ret = checkpoint_fill_relink_fname(ctx, file, new_path, &len);
| + if (ret)
| + return ret;
| + ret = do_kern_linkat(&file->f_path, file->f_dentry,
| + AT_FDCWD, new_path, 0);
| + if (ret)
| + ckpt_err(ctx, ret, "%(T)%(P)%(V)Failed to relink unlinked file.\n", file, file->f_op);
nit: long line
| + return ret;
| +}
| +
| +int checkpoint_file_links(struct ckpt_ctx *ctx, struct file *file)
| +{
| + char *new_link_path;
| + int ret, len;
| +
| + if (!d_unlinked(file->f_dentry))
| + return 0;
| +
| + /*
| + * Unlinked files need at least one hardlink for the post-sys_checkpoint
| + * filesystem backup/snapshot.
| + */
| + new_link_path = kmalloc(PATH_MAX, GFP_KERNEL);
| + if (!new_link_path)
| + return -ENOMEM;
| + ret = checkpoint_file_relink(ctx, file, new_link_path);
| + if (ret < 0)
| + goto out_free;
| + len = strlen(new_link_path);
| + ret = ckpt_write_obj_type(ctx, NULL, len + 1, CKPT_HDR_BUFFER);
| + if (ret < 0)
| + goto out_free;
| + ret = ckpt_kwrite(ctx, new_link_path, len + 1);
| +out_free:
| + kfree(new_link_path);
| +
| + return ret;
| +}
nit: some blank lines separating the different sections of the function will
help readability
Sukadev
next prev parent reply other threads:[~2010-10-22 23:43 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1285278812-16972-1-git-send-email-matthltc@us.ibm.com>
[not found] ` <6987185123220ec2034677299859c5a63eaf2aed.1285278339.git.matthltc@us.ibm.com>
2010-09-23 21:53 ` [PATCH 2/6] [RFC] Create the .relink file_operation Matt Helsley
2010-09-23 21:53 ` [PATCH 3/6] [RFC] ext3/4: Allow relinking to unlinked files Matt Helsley
2010-09-23 21:53 ` [PATCH 4/6] [RFC] Split do_linkat() out of sys_linkat Matt Helsley
2010-09-23 21:53 ` [PATCH 5/6] [RFC] Checkpoint/restart unlinked files Matt Helsley
2010-09-29 22:22 ` Oren Laadan
2010-09-30 1:17 ` Matt Helsley
2010-09-30 15:45 ` Oren Laadan
2010-10-22 23:43 ` Sukadev Bhattiprolu [this message]
[not found] ` <9eade1f1ec10c23dae296feda8af9fe87085e843.1285278339.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-09-29 20:28 ` [PATCH 4/6] [RFC] Split do_linkat() out of sys_linkat Oren Laadan
[not found] ` <d91e3660266d3a2956c4d1aebc9cb618081b21ef.1285278339.git.matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-09-26 19:08 ` [PATCH 2/6] [RFC] Create the .relink file_operation Brad Boyer
2010-09-27 19:16 ` Matt Helsley
2010-09-27 22:03 ` Brad Boyer
2010-09-29 20:16 ` Oren Laadan
2010-09-29 20:19 ` Oren Laadan
2010-09-29 23:07 ` Matt Helsley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101022234344.GA23663@us.ibm.com \
--to=sukadev@linux.vnet.ibm.com \
--cc=adilger.kernel@dilger.ca \
--cc=amir73il@users.sf.net \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=containers@lists.linux-foundation.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=jamie@shareable.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=matthltc@us.ibm.com \
--cc=miklos@szeredi.hu \
--cc=sandeen@redhat.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).