* [PATCH 1/9][cr][v2]: Add uid, euid params to f_modown()
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 2/9][cr][v2]: Add uid, euid params to __f_setown() Sukadev Bhattiprolu
` (7 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Checkpoint/restart of file-owner.
Add uid, euid parameters to f_modown(). These parameters will be needed
when restarting an application (and hence restoring the file information),
from a checkpoint image.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
fs/fcntl.c | 11 +++++------
1 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2079af0..aeab1f4 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -197,7 +197,7 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
}
static void f_modown(struct file *filp, struct pid *pid, enum pid_type type,
- int force)
+ uid_t uid, uid_t euid, int force)
{
write_lock_irq(&filp->f_owner.lock);
if (force || !filp->f_owner.pid) {
@@ -206,9 +206,8 @@ static void f_modown(struct file *filp, struct pid *pid, enum pid_type type,
filp->f_owner.pid_type = type;
if (pid) {
- const struct cred *cred = current_cred();
- filp->f_owner.uid = cred->uid;
- filp->f_owner.euid = cred->euid;
+ filp->f_owner.uid = uid;
+ filp->f_owner.euid = euid;
}
}
write_unlock_irq(&filp->f_owner.lock);
@@ -223,7 +222,7 @@ int __f_setown(struct file *filp, struct pid *pid, enum pid_type type,
if (err)
return err;
- f_modown(filp, pid, type, force);
+ f_modown(filp, pid, type, current_uid(), current_euid(), force);
return 0;
}
EXPORT_SYMBOL(__f_setown);
@@ -249,7 +248,7 @@ EXPORT_SYMBOL(f_setown);
void f_delown(struct file *filp)
{
- f_modown(filp, NULL, PIDTYPE_PID, 1);
+ f_modown(filp, NULL, PIDTYPE_PID, current_uid(), current_euid(), 1);
}
pid_t f_getown(struct file *filp)
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/9][cr][v2]: Add uid, euid params to __f_setown()
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 1/9][cr][v2]: Add uid, euid params to f_modown() Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 3/9][cr][v2]: Checkpoint file-owner information Sukadev Bhattiprolu
` (6 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Instead of letting __f_setown() use the UID and EUID of the calling process,
pass them in as parameters.
This modified interface will be useful when checkpointing and restarting an
application that has a 'file owner' specified for an open file. When
checkpointing the application, the UID and EUID of the process setting up
the owner are saved in the checkpoint image.
When the application is restarted, we use the UID and EUID values saved in
the checkpoint-image, rather than that of the calling process.
Changelog[v2]:
- [Matthew Wilcox] Rather than a new __f_setown_uid() interface add
the uid parameters to __f_setown() itself.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
drivers/char/tty_io.c | 3 ++-
drivers/net/tun.c | 3 ++-
fs/fcntl.c | 10 ++++++----
fs/locks.c | 3 ++-
fs/notify/dnotify/dnotify.c | 3 ++-
include/linux/fs.h | 3 ++-
6 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index d264000..3f2f115 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -1968,7 +1968,8 @@ static int tty_fasync(int fd, struct file *filp, int on)
}
get_pid(pid);
spin_unlock_irqrestore(&tty->ctrl_lock, flags);
- retval = __f_setown(filp, pid, type, 0);
+ retval = __f_setown(filp, pid, type, current_uid(),
+ current_euid(), 0);
put_pid(pid);
if (retval)
goto out;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4326520..dcbc37d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1400,7 +1400,8 @@ static int tun_chr_fasync(int fd, struct file *file, int on)
goto out;
if (on) {
- ret = __f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+ ret = __f_setown(file, task_pid(current), PIDTYPE_PID,
+ current_uid(), current_euid(), 0);
if (ret)
goto out;
tun->flags |= TUN_FASYNC;
diff --git a/fs/fcntl.c b/fs/fcntl.c
index aeab1f4..f44327d 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -214,7 +214,7 @@ static void f_modown(struct file *filp, struct pid *pid, enum pid_type type,
}
int __f_setown(struct file *filp, struct pid *pid, enum pid_type type,
- int force)
+ uid_t uid, uid_t euid, int force)
{
int err;
@@ -222,7 +222,7 @@ int __f_setown(struct file *filp, struct pid *pid, enum pid_type type,
if (err)
return err;
- f_modown(filp, pid, type, current_uid(), current_euid(), force);
+ f_modown(filp, pid, type, uid, euid, force);
return 0;
}
EXPORT_SYMBOL(__f_setown);
@@ -240,7 +240,8 @@ int f_setown(struct file *filp, unsigned long arg, int force)
}
rcu_read_lock();
pid = find_vpid(who);
- result = __f_setown(filp, pid, type, force);
+ result = __f_setown(filp, pid, type, current_uid(), current_euid(),
+ force);
rcu_read_unlock();
return result;
}
@@ -296,7 +297,8 @@ static int f_setown_ex(struct file *filp, unsigned long arg)
if (owner.pid && !pid)
ret = -ESRCH;
else
- ret = __f_setown(filp, pid, type, 1);
+ ret = __f_setown(filp, pid, type, current_uid(),
+ current_euid(), 1);
rcu_read_unlock();
return ret;
diff --git a/fs/locks.c b/fs/locks.c
index 9cd859e..ca0c7e2 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1514,7 +1514,8 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
goto out_unlock;
}
- error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
+ error = __f_setown(filp, task_pid(current), PIDTYPE_PID, current_uid(),
+ current_euid(), 0);
out_unlock:
unlock_kernel();
return error;
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 0a63bf6..3e025e5 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -409,7 +409,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
goto out;
}
- error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
+ error = __f_setown(filp, task_pid(current), PIDTYPE_PID, current_uid(),
+ current_euid(), 0);
if (error) {
/* if we added, we must shoot */
if (dnentry == new_dnentry)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ee725ff..b4a6fb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1304,7 +1304,8 @@ extern void kill_fasync(struct fasync_struct **, int, int);
/* only for net: no internal synchronization */
extern void __kill_fasync(struct fasync_struct *, int, int);
-extern int __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
+extern int __f_setown(struct file *filp, struct pid *pid, enum pid_type type,
+ uid_t uid, uid_t euid, int force);
extern int f_setown(struct file *filp, unsigned long arg, int force);
extern void f_delown(struct file *filp);
extern pid_t f_getown(struct file *filp);
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/9][cr][v2]: Checkpoint file-owner information
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 1/9][cr][v2]: Add uid, euid params to f_modown() Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 2/9][cr][v2]: Add uid, euid params to __f_setown() Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 4/9][cr][v2]: Restore file_owner info Sukadev Bhattiprolu
` (5 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Checkpoint the file->f_owner information for an open file. This
information will be used to restore the file-owner information
when the application is restarted from the checkpoint.
The file->f_owner information is "private" to each 'struct file' i.e.
fown_struct is not an external object shared with other file structures.
So the information can directly be added to the 'ckpt_hdr_file' object.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
fs/checkpoint.c | 27 +++++++++++----------------
include/linux/checkpoint_hdr.h | 5 +++++
2 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index e036a7a..0fa4ce8 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -168,12 +168,19 @@ int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
struct ckpt_hdr_file *h)
{
struct cred *f_cred = (struct cred *) file->f_cred;
+ struct pid *pid = file->f_owner.pid;
h->f_flags = file->f_flags;
h->f_mode = file->f_mode;
h->f_pos = file->f_pos;
h->f_version = file->f_version;
+ h->f_owner_pid = pid_nr_ns(pid, ns_of_pid(pid));
+ h->f_owner_pid_type = file->f_owner.pid_type;
+ h->f_owner_uid = file->f_owner.uid;
+ h->f_owner_euid = file->f_owner.euid;
+ h->f_owner_signum = file->f_owner.signum;
+
h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
if (h->f_credref < 0)
return h->f_credref;
@@ -184,10 +191,10 @@ int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
return h->f_secref;
}
- ckpt_debug("file %s credref %d secref %d\n",
- file->f_dentry->d_name.name, h->f_credref, h->f_secref);
-
- /* FIX: need also file->f_owner, etc */
+ ckpt_debug("file %s credref %d secref %d, fowner-pid %d, type %d, "
+ "fowner-signum %d\n", file->f_dentry->d_name.name,
+ h->f_credref, h->f_secref, h->f_owner_pid,
+ h->f_owner_pid_type, h->f_owner_signum);
return 0;
}
@@ -267,7 +274,6 @@ static int checkpoint_file_desc(struct ckpt_ctx *ctx,
struct fdtable *fdt;
int objref, ret;
int coe = 0; /* avoid gcc warning */
- pid_t pid;
h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
if (!h)
@@ -302,17 +308,6 @@ static int checkpoint_file_desc(struct ckpt_ctx *ctx,
}
/*
- * TODO: Implement c/r of fowner and f_sigio. Should be
- * trivial, but for now we just refuse its checkpoint
- */
- pid = f_getown(file);
- if (pid) {
- ret = -EBUSY;
- ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
- goto out;
- }
-
- /*
* if seen first time, this will add 'file' to the objhash, keep
* a reference to it, dump its state while at it.
*/
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 790214f..44e2a0d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -570,6 +570,11 @@ struct ckpt_hdr_file {
__u64 f_pos;
__u64 f_version;
__s32 f_secref;
+ __s32 f_owner_pid;
+ __u32 f_owner_pid_type;
+ __u32 f_owner_uid;
+ __u32 f_owner_euid;
+ __s32 f_owner_signum;
} __attribute__((aligned(8)));
struct ckpt_hdr_file_generic {
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 4/9][cr][v2]: Restore file_owner info
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
` (2 preceding siblings ...)
2010-05-19 3:07 ` [PATCH 3/9][cr][v2]: Checkpoint file-owner information Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-06-15 4:05 ` Oren Laadan
2010-05-19 3:07 ` [PATCH 5/9][cr][v2]: Move file_lock macros into linux/fs.h Sukadev Bhattiprolu
` (4 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Restore the file-owner information for each 'struct file'. This is
essentially is like a new fcntl(F_SETOWN) and fcntl(F_SETSIG) calls,
except that the pid, uid, euid and signum values are read from the
checkpoint image.
Changelog[v2]:
- [Matt Helsley, Serge Hallyn]: Don't trust uids in checkpoint image.
(added CAP_KILL check)
- Check that signal number read from the checkpoint image is valid.
(not sure it is required, since its an incomplete check for tampering)
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
fs/checkpoint.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 55 insertions(+), 0 deletions(-)
diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index 0fa4ce8..e82f4f1 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -615,6 +615,57 @@ static int attach_file(struct file *file)
return fd;
}
+static int restore_file_owner(struct ckpt_ctx *ctx, struct ckpt_hdr_file *h,
+ struct file *file)
+{
+ int ret;
+ struct pid *pid;
+ uid_t uid, euid;
+
+ uid = h->f_owner_uid;
+ euid = h->f_owner_euid;
+
+ ckpt_debug("restore_file_owner(): uid %u, euid %u, pid %d, type %d\n",
+ uid, euid, h->f_owner_pid, h->f_owner_pid_type);
+ /*
+ * We can't trust the uids in the checkpoint image and normally need
+ * CAP_KILL. But if the uids match our ids, should be fine since we
+ * have access to the file.
+ *
+ * TODO: Move this check to __f_setown() ?
+ */
+ ret = -EACCES;
+ if (!capable(CAP_KILL) &&
+ (uid != current_uid() || euid != current_euid())) {
+ ckpt_err(ctx, ret, "image uids [%d, %d] don't match current "
+ "process uids [%d, %d] and no CAP_KILL\n",
+ uid, euid, current_uid(), current_euid());
+ return ret;
+ }
+
+ ret = -EINVAL;
+ if (!valid_signal(h->f_owner_signum)) {
+ ckpt_err(ctx, ret, "Invalid signum %d\n", h->f_owner_signum);
+ return ret;
+ }
+ file->f_owner.signum = h->f_owner_signum;
+
+ rcu_read_lock();
+ pid = find_vpid(h->f_owner_pid);
+ /*
+ * TODO: Do we need to force==1 or can it be 0 ? 'force' is used to
+ * modify the owner, if one is already set. Can it be set when
+ * we restart an application ?
+ */
+ ret = __f_setown(file, pid, h->f_owner_pid_type, uid, euid, 1);
+ rcu_read_unlock();
+
+ if (ret < 0)
+ ckpt_err(ctx, ret, "__fsetown_uid() failed\n");
+
+ return ret;
+}
+
#define CKPT_SETFL_MASK \
(O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
@@ -648,6 +699,10 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
if (ret < 0)
return ret;
+ ret = restore_file_owner(ctx, h, file);
+ if (ret < 0)
+ return ret;
+
/*
* Normally f_mode is set by open, and modified only via
* fcntl(), so its value now should match that at checkpoint.
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 4/9][cr][v2]: Restore file_owner info
2010-05-19 3:07 ` [PATCH 4/9][cr][v2]: Restore file_owner info Sukadev Bhattiprolu
@ 2010-06-15 4:05 ` Oren Laadan
2010-07-28 19:25 ` Sukadev Bhattiprolu
0 siblings, 1 reply; 22+ messages in thread
From: Oren Laadan @ 2010-06-15 4:05 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
On 05/18/2010 11:07 PM, Sukadev Bhattiprolu wrote:
> Restore the file-owner information for each 'struct file'. This is
> essentially is like a new fcntl(F_SETOWN) and fcntl(F_SETSIG) calls,
> except that the pid, uid, euid and signum values are read from the
> checkpoint image.
>
> Changelog[v2]:
> - [Matt Helsley, Serge Hallyn]: Don't trust uids in checkpoint image.
> (added CAP_KILL check)
> - Check that signal number read from the checkpoint image is valid.
> (not sure it is required, since its an incomplete check for tampering)
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> ---
[...]
>
> +static int restore_file_owner(struct ckpt_ctx *ctx, struct ckpt_hdr_file *h,
> + struct file *file)
> +{
> + int ret;
> + struct pid *pid;
> + uid_t uid, euid;
> +
> + uid = h->f_owner_uid;
> + euid = h->f_owner_euid;
> +
> + ckpt_debug("restore_file_owner(): uid %u, euid %u, pid %d, type %d\n",
> + uid, euid, h->f_owner_pid, h->f_owner_pid_type);
> + /*
> + * We can't trust the uids in the checkpoint image and normally need
> + * CAP_KILL. But if the uids match our ids, should be fine since we
> + * have access to the file.
> + *
> + * TODO: Move this check to __f_setown() ?
> + */
> + ret = -EACCES;
> + if (!capable(CAP_KILL) &&
> + (uid != current_uid() || euid != current_euid())) {
> + ckpt_err(ctx, ret, "image uids [%d, %d] don't match current "
> + "process uids [%d, %d] and no CAP_KILL\n",
> + uid, euid, current_uid(), current_euid());
> + return ret;
> + }
> +
> + ret = -EINVAL;
> + if (!valid_signal(h->f_owner_signum)) {
> + ckpt_err(ctx, ret, "Invalid signum %d\n", h->f_owner_signum);
> + return ret;
> + }
> + file->f_owner.signum = h->f_owner_signum;
> +
> + rcu_read_lock();
> + pid = find_vpid(h->f_owner_pid);
What if this fails - the pid is invalid/non-existent ?
> + /*
> + * TODO: Do we need to force==1 or can it be 0 ? 'force' is used to
> + * modify the owner, if one is already set. Can it be set when
> + * we restart an application ?
> + */
> + ret = __f_setown(file, pid, h->f_owner_pid_type, uid, euid, 1);
> + rcu_read_unlock();
I wonder if this would be a problem in terms of security on a
non-container restart (e.g. not in a new pid-ns): one could set
any pid as owner and any signal to be sent, and cause an arbitrary
signal to be sent to an arbitrary process ?
> +
> + if (ret < 0)
> + ckpt_err(ctx, ret, "__fsetown_uid() failed\n");
> +
> + return ret;
> +}
> +
> #define CKPT_SETFL_MASK \
> (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
>
> @@ -648,6 +699,10 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
> if (ret < 0)
> return ret;
>
> + ret = restore_file_owner(ctx, h, file);
> + if (ret < 0)
> + return ret;
> +
> /*
> * Normally f_mode is set by open, and modified only via
> * fcntl(), so its value now should match that at checkpoint.
Oren.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/9][cr][v2]: Restore file_owner info
2010-06-15 4:05 ` Oren Laadan
@ 2010-07-28 19:25 ` Sukadev Bhattiprolu
2010-07-28 22:20 ` Matt Helsley
0 siblings, 1 reply; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-07-28 19:25 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Oren Laadan [orenl@cs.columbia.edu] wrote:
| > +
| > + rcu_read_lock();
| > + pid = find_vpid(h->f_owner_pid);
|
| What if this fails - the pid is invalid/non-existent ?
Good point. ->f_owner_pid can be 0 (in the normal case) and __fsetown()
below will set the owner to NULL pid. But if ->f_owner_pid is non-zero,
we should ensure we found a valid pid - added a check for this.
|
| > + /*
| > + * TODO: Do we need to force==1 or can it be 0 ? 'force' is used to
| > + * modify the owner, if one is already set. Can it be set when
| > + * we restart an application ?
| > + */
| > + ret = __f_setown(file, pid, h->f_owner_pid_type, uid, euid, 1);
| > + rcu_read_unlock();
|
| I wonder if this would be a problem in terms of security on a
| non-container restart (e.g. not in a new pid-ns): one could set
| any pid as owner and any signal to be sent, and cause an arbitrary
| signal to be sent to an arbitrary process ?
Yes, Matt and Serge pointed it out and for now we need CAP_KILL
capability to restore an application that has file-leases.
Sukadev
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/9][cr][v2]: Restore file_owner info
2010-07-28 19:25 ` Sukadev Bhattiprolu
@ 2010-07-28 22:20 ` Matt Helsley
2010-07-29 19:00 ` Serge E. Hallyn
0 siblings, 1 reply; 22+ messages in thread
From: Matt Helsley @ 2010-07-28 22:20 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Oren Laadan, serue, Matt Helsley, matthew, linux-fsdevel,
Containers
On Wed, Jul 28, 2010 at 12:25:03PM -0700, Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl@cs.columbia.edu] wrote:
> | > +
> | > + rcu_read_lock();
> | > + pid = find_vpid(h->f_owner_pid);
> |
> | What if this fails - the pid is invalid/non-existent ?
>
> Good point. ->f_owner_pid can be 0 (in the normal case) and __fsetown()
> below will set the owner to NULL pid. But if ->f_owner_pid is non-zero,
> we should ensure we found a valid pid - added a check for this.
>
> |
> | > + /*
> | > + * TODO: Do we need to force==1 or can it be 0 ? 'force' is used to
> | > + * modify the owner, if one is already set. Can it be set when
> | > + * we restart an application ?
> | > + */
> | > + ret = __f_setown(file, pid, h->f_owner_pid_type, uid, euid, 1);
> | > + rcu_read_unlock();
> |
> | I wonder if this would be a problem in terms of security on a
> | non-container restart (e.g. not in a new pid-ns): one could set
> | any pid as owner and any signal to be sent, and cause an arbitrary
> | signal to be sent to an arbitrary process ?
>
> Yes, Matt and Serge pointed it out and for now we need CAP_KILL
> capability to restore an application that has file-leases.
I looked at this some.
The pid and the signal are looked up when the signal is generated. Then
permissions are checked. See sigio_perm() called from send_sigio_to_task().
Thus I think the main thing we need to be certain of are the uid and euid
to pass in to own the file. Those are going to require some userns bits
I think. Then the signal number and recipient will be checked as normal
-- we don't need to do anything special during restart.
For reference, here's sigio_perm():
static inline int sigio_perm(struct task_struct *p,
struct fown_struct *fown, int sig)
{
const struct cred *cred;
int ret;
rcu_read_lock();
cred = __task_cred(p);
ret = ((fown->euid == 0 ||
fown->euid == cred->suid || fown->euid == cred->uid ||
fown->uid == cred->suid || fown->uid == cred->uid) &&
!security_file_send_sigiotask(p, fown, sig));
rcu_read_unlock();
return ret;
}
[ My Notes: unlike check_kill_permission() it does not check CAP_KILL.
Also check_kill_permission() calls audit as if the signal is about to be
delivered but sigio_perm() does not. ]
Cheers,
-Matt Helsley
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/9][cr][v2]: Restore file_owner info
2010-07-28 22:20 ` Matt Helsley
@ 2010-07-29 19:00 ` Serge E. Hallyn
0 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2010-07-29 19:00 UTC (permalink / raw)
To: Matt Helsley
Cc: Sukadev Bhattiprolu, matthew, Containers, linux-fsdevel, serue
Quoting Matt Helsley (matthltc@us.ibm.com):
> For reference, here's sigio_perm():
>
> static inline int sigio_perm(struct task_struct *p,
> struct fown_struct *fown, int sig)
> {
> const struct cred *cred;
> int ret;
>
> rcu_read_lock();
> cred = __task_cred(p);
> ret = ((fown->euid == 0 ||
> fown->euid == cred->suid || fown->euid == cred->uid ||
> fown->uid == cred->suid || fown->uid == cred->uid) &&
> !security_file_send_sigiotask(p, fown, sig));
> rcu_read_unlock();
> return ret;
> }
>
> [ My Notes: unlike check_kill_permission() it does not check CAP_KILL.
Right, that's bc we don't store capabilities in the fown_struct.
So fown->euid==0 is all we can do. Since this can be called from
interrupt, current is not useful.
> Also check_kill_permission() calls audit as if the signal is about to be
> delivered but sigio_perm() does not. ]
-serge
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/9][cr][v2]: Move file_lock macros into linux/fs.h
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
` (3 preceding siblings ...)
2010-05-19 3:07 ` [PATCH 4/9][cr][v2]: Restore file_owner info Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 6/9][cr][v2]: Checkpoint file-locks Sukadev Bhattiprolu
` (3 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Move IS_POSIX(), IS_FLOCK(), IS_LEASE() and 'for_each_lock()' into
include/linux/fs.h since these are also needed to checkpoint/restart
file-locks.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
fs/locks.c | 7 -------
include/linux/fs.h | 7 +++++++
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index ca0c7e2..741b8b7 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -130,16 +130,9 @@
#include <asm/uaccess.h>
-#define IS_POSIX(fl) (fl->fl_flags & FL_POSIX)
-#define IS_FLOCK(fl) (fl->fl_flags & FL_FLOCK)
-#define IS_LEASE(fl) (fl->fl_flags & FL_LEASE)
-
int leases_enable = 1;
int lease_break_time = 45;
-#define for_each_lock(inode, lockp) \
- for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
-
static LIST_HEAD(file_lock_list);
static LIST_HEAD(blocked_list);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4a6fb0..abd2267 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1088,6 +1088,13 @@ struct file_lock {
} fl_u;
};
+#define IS_POSIX(fl) (fl->fl_flags & FL_POSIX)
+#define IS_FLOCK(fl) (fl->fl_flags & FL_FLOCK)
+#define IS_LEASE(fl) (fl->fl_flags & FL_LEASE)
+
+#define for_each_lock(inode, lockp) \
+ for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
+
/* The following constant reflects the upper bound of the file/locking space */
#ifndef OFFSET_MAX
#define INT_LIMIT(x) (~((x)1 << (sizeof(x)*8 - 1)))
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 6/9][cr][v2]: Checkpoint file-locks
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
` (4 preceding siblings ...)
2010-05-19 3:07 ` [PATCH 5/9][cr][v2]: Move file_lock macros into linux/fs.h Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-06-15 4:13 ` Oren Laadan
2010-05-19 3:07 ` [PATCH 7/9][cr][v2]: Define flock_set() Sukadev Bhattiprolu
` (2 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
While checkpointing each file-descriptor, find all the locks on the
file and save information about the lock in the checkpoint-image.
A follow-on patch will use this informaiton to restore the file-locks.
Changelog[v2]:
[Matt Helsley]: Use fixed sizes (__s64) instead of 'loff_t' in
'struct ckpt_hdr_file_lock'.
[Matt Helsley, Serge Hallyn]: Highlight new use of BKL (using
lock_flocks() macros as suggested by Serge).
[Matt Helsley]: Reorg code a bit to simplify error handling.
[Matt Helsley]: Reorg code to initialize marker-lock (Pass a
NULL lock to checkpoint_one_lock() to indicate marker).
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
fs/checkpoint.c | 101 ++++++++++++++++++++++++++++++++++-----
include/linux/checkpoint_hdr.h | 10 ++++
2 files changed, 98 insertions(+), 13 deletions(-)
diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index e82f4f1..7773488 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -26,8 +26,19 @@
#include <linux/checkpoint.h>
#include <linux/eventpoll.h>
#include <linux/eventfd.h>
+#include <linux/smp_lock.h>
#include <net/sock.h>
+/*
+ * TODO: This code uses the BKL for consistency with other uses of
+ * 'for_each_lock()'. But since the BKL may be replaced with another
+ * lock in the future, use lock_flocks() macros instead. lock_flocks()
+ * are currently used in BKL-fix sand boxes and when those changes
+ * are merged, the following macros can be removed
+ */
+#define lock_flocks() lock_kernel()
+#define unlock_flocks() unlock_kernel()
+
/**************************************************************************
* Checkpoint
*/
@@ -256,8 +267,79 @@ static int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
return ret;
}
+static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
+ struct file_lock *lock)
+{
+ int rc;
+ struct ckpt_hdr_file_lock *h;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_LOCK);
+ if (!h)
+ return -ENOMEM;
+
+ if (lock) {
+ h->fl_start = lock->fl_start;
+ h->fl_end = lock->fl_end;
+ h->fl_type = lock->fl_type;
+ h->fl_flags = lock->fl_flags;
+ } else {
+ /* Checkpoint a dummy lock as a marker */
+ h->fl_start = -1;
+ h->fl_flags = FL_POSIX;
+ }
+
+ rc = ckpt_write_obj(ctx, &h->h);
+
+ ckpt_hdr_put(ctx, h);
+
+ return rc;
+}
+
+int
+checkpoint_file_locks(struct ckpt_ctx *ctx, struct files_struct *files,
+ struct file *file)
+{
+ int rc;
+ struct inode *inode;
+ struct file_lock **lockpp;
+ struct file_lock *lockp;
+
+ lock_flocks();
+ inode = file->f_path.dentry->d_inode;
+ for_each_lock(inode, lockpp) {
+ lockp = *lockpp;
+ ckpt_debug("Lock [%lld, %lld, %d, 0x%x]\n", lockp->fl_start,
+ lockp->fl_end, lockp->fl_type, lockp->fl_flags);
+
+ if (lockp->fl_owner != files)
+ continue;
+
+ rc = -EBADF;
+ if (IS_POSIX(lockp))
+ rc = checkpoint_one_file_lock(ctx, file, lockp);
+
+ if (rc < 0) {
+ ckpt_err(ctx, rc, "%(T), checkpoint of lock "
+ "[%lld, %lld, %d, 0x%x] failed\n",
+ lockp->fl_start, lockp->fl_end,
+ lockp->fl_type, lockp->fl_flags);
+ goto out;
+ }
+ }
+
+ /*
+ * At the end of file-locks for this file, checkpoint a marker.
+ */
+ rc = checkpoint_one_file_lock(ctx, file, NULL);
+ if (rc < 0)
+ ckpt_err(ctx, rc, "%(T), checkpoint marker-lock failed\n");
+out:
+ unlock_flocks();
+ return rc;
+}
+
/**
- * ckpt_write_file_desc - dump the state of a given file descriptor
+ * checkpoint_file_desc - dump the state of a given file descriptor
* @ctx: checkpoint context
* @files: files_struct pointer
* @fd: file descriptor
@@ -288,18 +370,6 @@ static int checkpoint_file_desc(struct ckpt_ctx *ctx,
}
rcu_read_unlock();
- ret = find_locks_with_owner(file, files);
- /*
- * find_locks_with_owner() returns an error when there
- * are no locks found, so we *want* it to return an error
- * code. Its success means we have to fail the checkpoint.
- */
- if (!ret) {
- ret = -EBADF;
- ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
- goto out;
- }
-
/* sanity check (although this shouldn't happen) */
ret = -EBADF;
if (!file) {
@@ -323,6 +393,11 @@ static int checkpoint_file_desc(struct ckpt_ctx *ctx,
h->fd_close_on_exec = coe;
ret = ckpt_write_obj(ctx, &h->h);
+ if (ret < 0)
+ goto out;
+
+ ret = checkpoint_file_locks(ctx, files, file);
+
out:
ckpt_hdr_put(ctx, h);
if (file)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 44e2a0d..4509016 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -144,6 +144,8 @@ enum {
#define CKPT_HDR_TTY_LDISC CKPT_HDR_TTY_LDISC
CKPT_HDR_EPOLL_ITEMS, /* must be after file-table */
#define CKPT_HDR_EPOLL_ITEMS CKPT_HDR_EPOLL_ITEMS
+ CKPT_HDR_FILE_LOCK,
+#define CKPT_HDR_FILE_LOCK CKPT_HDR_FILE_LOCK
CKPT_HDR_MM = 401,
#define CKPT_HDR_MM CKPT_HDR_MM
@@ -581,6 +583,14 @@ struct ckpt_hdr_file_generic {
struct ckpt_hdr_file common;
} __attribute__((aligned(8)));
+struct ckpt_hdr_file_lock {
+ struct ckpt_hdr h;
+ __s64 fl_start;
+ __s64 fl_end;
+ __u8 fl_type;
+ __u8 fl_flags;
+};
+
struct ckpt_hdr_file_pipe {
struct ckpt_hdr_file common;
__s32 pipe_objref;
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 6/9][cr][v2]: Checkpoint file-locks
2010-05-19 3:07 ` [PATCH 6/9][cr][v2]: Checkpoint file-locks Sukadev Bhattiprolu
@ 2010-06-15 4:13 ` Oren Laadan
2010-07-28 19:26 ` Sukadev Bhattiprolu
0 siblings, 1 reply; 22+ messages in thread
From: Oren Laadan @ 2010-06-15 4:13 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
On 05/18/2010 11:07 PM, Sukadev Bhattiprolu wrote:
> While checkpointing each file-descriptor, find all the locks on the
> file and save information about the lock in the checkpoint-image.
> A follow-on patch will use this informaiton to restore the file-locks.
>
> Changelog[v2]:
> [Matt Helsley]: Use fixed sizes (__s64) instead of 'loff_t' in
> 'struct ckpt_hdr_file_lock'.
> [Matt Helsley, Serge Hallyn]: Highlight new use of BKL (using
> lock_flocks() macros as suggested by Serge).
> [Matt Helsley]: Reorg code a bit to simplify error handling.
> [Matt Helsley]: Reorg code to initialize marker-lock (Pass a
> NULL lock to checkpoint_one_lock() to indicate marker).
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[...]
>
> +static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
> + struct file_lock *lock)
> +{
> + int rc;
> + struct ckpt_hdr_file_lock *h;
> +
> + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_LOCK);
> + if (!h)
> + return -ENOMEM;
> +
> + if (lock) {
> + h->fl_start = lock->fl_start;
> + h->fl_end = lock->fl_end;
> + h->fl_type = lock->fl_type;
> + h->fl_flags = lock->fl_flags;
> + } else {
> + /* Checkpoint a dummy lock as a marker */
> + h->fl_start = -1;
Maybe designate some constant for this ? e.g. CKPT_FLOCK_NONE ?
In any case, you need a (loff_t) -1 (like in the restore code).
[...]
Oren.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/9][cr][v2]: Checkpoint file-locks
2010-06-15 4:13 ` Oren Laadan
@ 2010-07-28 19:26 ` Sukadev Bhattiprolu
[not found] ` <20100728192649.GB14570-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-07-28 19:26 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Oren Laadan [orenl@cs.columbia.edu] wrote:
|
|
| > + if (lock) {
| > + h->fl_start = lock->fl_start;
| > + h->fl_end = lock->fl_end;
| > + h->fl_type = lock->fl_type;
| > + h->fl_flags = lock->fl_flags;
| > + } else {
| > + /* Checkpoint a dummy lock as a marker */
| > + h->fl_start = -1;
|
| Maybe designate some constant for this ? e.g. CKPT_FLOCK_NONE ?
| In any case, you need a (loff_t) -1 (like in the restore code).
Ok. Defining macros CKPT_HDR_SET_MARKER_LOCK() and
CKPT_HDR_CHECK_MARKER_LOCK(). Also added the missing (loff_t).
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 7/9][cr][v2]: Define flock_set()
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
` (5 preceding siblings ...)
2010-05-19 3:07 ` [PATCH 6/9][cr][v2]: Checkpoint file-locks Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 8/9][cr][v2]: Define flock64_set() Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 9/9][cr][v2]: Restore file-locks Sukadev Bhattiprolu
8 siblings, 0 replies; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Extract core functionality of fcntl_setlk() into a separate function,
flock_set(). flock_set() can be also used when restarting a checkpointed
application and restoring its file-locks.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
fs/locks.c | 44 +++++++++++++++++++++++++++-----------------
include/linux/fs.h | 1 +
2 files changed, 28 insertions(+), 17 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index 741b8b7..cb83741 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1759,14 +1759,10 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
return error;
}
-/* Apply the lock described by l to an open file descriptor.
- * This implements both the F_SETLK and F_SETLKW commands of fcntl().
- */
-int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd,
- struct flock __user *l)
+int flock_set(unsigned int fd, struct file *filp, unsigned int cmd,
+ struct flock *flock)
{
struct file_lock *file_lock = locks_alloc_lock();
- struct flock flock;
struct inode *inode;
struct file *f;
int error;
@@ -1774,13 +1770,6 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd,
if (file_lock == NULL)
return -ENOLCK;
- /*
- * This might block, so we do it before checking the inode.
- */
- error = -EFAULT;
- if (copy_from_user(&flock, l, sizeof(flock)))
- goto out;
-
inode = filp->f_path.dentry->d_inode;
/* Don't allow mandatory locks on files that may be memory mapped
@@ -1792,7 +1781,7 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd,
}
again:
- error = flock_to_posix_lock(filp, file_lock, &flock);
+ error = flock_to_posix_lock(filp, file_lock, flock);
if (error)
goto out;
if (cmd == F_SETLKW) {
@@ -1800,7 +1789,7 @@ again:
}
error = -EBADF;
- switch (flock.l_type) {
+ switch (flock->l_type) {
case F_RDLCK:
if (!(filp->f_mode & FMODE_READ))
goto out;
@@ -1830,8 +1819,8 @@ again:
spin_lock(¤t->files->file_lock);
f = fcheck(fd);
spin_unlock(¤t->files->file_lock);
- if (!error && f != filp && flock.l_type != F_UNLCK) {
- flock.l_type = F_UNLCK;
+ if (!error && f != filp && flock->l_type != F_UNLCK) {
+ flock->l_type = F_UNLCK;
goto again;
}
@@ -1840,6 +1829,27 @@ out:
return error;
}
+/* Apply the lock described by l to an open file descriptor.
+ * This implements both the F_SETLK and F_SETLKW commands of fcntl().
+ */
+int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd,
+ struct flock __user *l)
+{
+ int error;
+ struct flock flock;
+
+ /*
+ * This might block, so we do it before checking the inode
+ * in flock_set().
+ */
+ error = -EFAULT;
+ if (copy_from_user(&flock, l, sizeof(flock)))
+ return error;
+
+ return flock_set(fd, filp, cmd, &flock);
+}
+
+
#if BITS_PER_LONG == 32
/* Report the first existing lock that would conflict with l.
* This implements the F_GETLK command of fcntl().
diff --git a/include/linux/fs.h b/include/linux/fs.h
index abd2267..fd66f37 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1112,6 +1112,7 @@ extern void send_sigio(struct fown_struct *fown, int fd, int band);
extern int fcntl_getlk(struct file *, struct flock __user *);
extern int fcntl_setlk(unsigned int, struct file *, unsigned int,
struct flock __user *);
+extern int flock_set(unsigned int, struct file *, unsigned int, struct flock *);
#if BITS_PER_LONG == 32
extern int fcntl_getlk64(struct file *, struct flock64 __user *);
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 8/9][cr][v2]: Define flock64_set()
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
` (6 preceding siblings ...)
2010-05-19 3:07 ` [PATCH 7/9][cr][v2]: Define flock_set() Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-05-19 3:07 ` [PATCH 9/9][cr][v2]: Restore file-locks Sukadev Bhattiprolu
8 siblings, 0 replies; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Extract core functionality of fcntl_setlk64() into a separate function,
flock64_set(). flock64_set() can be also used when restarting a checkpointed
application and restoring its file-locks.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
fs/locks.c | 38 ++++++++++++++++++++++++--------------
include/linux/fs.h | 2 ++
2 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index cb83741..c62ab7f 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1890,11 +1890,10 @@ out:
/* Apply the lock described by l to an open file descriptor.
* This implements both the F_SETLK and F_SETLKW commands of fcntl().
*/
-int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd,
- struct flock64 __user *l)
+int flock64_set(unsigned int fd, struct file *filp, unsigned int cmd,
+ struct flock64 *flock)
{
struct file_lock *file_lock = locks_alloc_lock();
- struct flock64 flock;
struct inode *inode;
struct file *f;
int error;
@@ -1902,13 +1901,6 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd,
if (file_lock == NULL)
return -ENOLCK;
- /*
- * This might block, so we do it before checking the inode.
- */
- error = -EFAULT;
- if (copy_from_user(&flock, l, sizeof(flock)))
- goto out;
-
inode = filp->f_path.dentry->d_inode;
/* Don't allow mandatory locks on files that may be memory mapped
@@ -1920,7 +1912,7 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd,
}
again:
- error = flock64_to_posix_lock(filp, file_lock, &flock);
+ error = flock64_to_posix_lock(filp, file_lock, flock);
if (error)
goto out;
if (cmd == F_SETLKW64) {
@@ -1928,7 +1920,7 @@ again:
}
error = -EBADF;
- switch (flock.l_type) {
+ switch (flock->l_type) {
case F_RDLCK:
if (!(filp->f_mode & FMODE_READ))
goto out;
@@ -1953,8 +1945,8 @@ again:
spin_lock(¤t->files->file_lock);
f = fcheck(fd);
spin_unlock(¤t->files->file_lock);
- if (!error && f != filp && flock.l_type != F_UNLCK) {
- flock.l_type = F_UNLCK;
+ if (!error && f != filp && flock->l_type != F_UNLCK) {
+ flock->l_type = F_UNLCK;
goto again;
}
@@ -1962,6 +1954,24 @@ out:
locks_free_lock(file_lock);
return error;
}
+
+/* Apply the lock described by l to an open file descriptor.
+ * This implements both the F_SETLK and F_SETLKW commands of fcntl().
+ */
+int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd,
+ struct flock64 __user *l)
+{
+ struct flock64 flock;
+
+ /*
+ * This might block, so we do it before checking the inode in
+ * flock64_set().
+ */
+ if (copy_from_user(&flock, l, sizeof(flock)))
+ return -EFAULT;
+
+ return flock64_set(fd, filp, cmd, &flock);
+}
#endif /* BITS_PER_LONG == 32 */
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd66f37..49d4eeb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1118,6 +1118,8 @@ extern int flock_set(unsigned int, struct file *, unsigned int, struct flock *);
extern int fcntl_getlk64(struct file *, struct flock64 __user *);
extern int fcntl_setlk64(unsigned int, struct file *, unsigned int,
struct flock64 __user *);
+extern int flock64_set(unsigned int, struct file *, unsigned int,
+ struct flock64 *);
#endif
extern int fcntl_setlease(unsigned int fd, struct file *filp, long arg);
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 9/9][cr][v2]: Restore file-locks
2010-05-19 3:07 [PATCH 0/9][cr][v2]: C/R file owner and posix file locks Sukadev Bhattiprolu
` (7 preceding siblings ...)
2010-05-19 3:07 ` [PATCH 8/9][cr][v2]: Define flock64_set() Sukadev Bhattiprolu
@ 2010-05-19 3:07 ` Sukadev Bhattiprolu
2010-05-26 7:48 ` steve
8 siblings, 1 reply; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-19 3:07 UTC (permalink / raw)
To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers
Restore POSIX file-locks of an application from its checkpoint image.
Read the saved file-locks from the checkpoint image and for each POSIX
lock, call flock_set() to set the lock on the file.
As pointed out by Matt Helsley, no special handling is necessary for a
process P2 in the checkpointed container that is blocked on a lock, L1
held by another process P1. Since processes in the restarted container
begin execution only after all processes have restored. If the blocked
process P2 is restored first, first, it will prepare to return an
-ERESTARTSYS from the fcntl() system call, but wait for P1 to be
restored. When P1 is restored, it will re-acquire the lock L1 before P1
and P2 begin actual execution. This ensures that even if P2 is scheduled
to run before P1, P2 will go back to waiting for the lock L1.
Changelog[v2]:
- Add support for C/R of F_SETLK64/F_GETLK64
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
fs/checkpoint.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 168 insertions(+), 2 deletions(-)
diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index 7773488..64809db 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -914,8 +914,170 @@ static void *restore_file(struct ckpt_ctx *ctx)
return (void *)file;
}
+#if BITS_PER_LONG == 32
+
+/*
+ * NOTE: Even if we checkpointed a lock that was set with 'struct flock'
+ * restore the lock using 'struct flock64'. Note that both these lock
+ * types are first converted to a posix_file_lock before processing so
+ * converting to 'struct flock64' is (hopefully) not a problem.
+ * NFS for instance uses IS_SETLK() instead of cmd == F_SETLK.
+ *
+ * TODO: Are there filesystems that implement F_SETLK but not F_SETLK64 ?
+ * If there are, restore_one_file_lock() will fail.
+ */
+static int
+ckpt_hdr_file_lock_to_flock64(struct ckpt_hdr_file_lock *h, struct flock64 *fl)
+{
+ /*
+ * We checkpoint the 'raw' fl_type which in case of leases includes
+ * the F_INPROGRESS flag. But for posix-locks, the fl_type should
+ * be simple.
+ */
+ switch(h->fl_type) {
+ case F_RDLCK:
+ case F_WRLCK:
+ case F_UNLCK:
+ break;
+ default:
+ ckpt_debug("Bad posix lock type 0x%x ?\n", h->fl_type);
+ return -EINVAL;
+ }
+
+ memset(fl, 0, sizeof(*fl));
+ fl->l_type = h->fl_type;
+ fl->l_start = h->fl_start;
+ fl->l_len = h->fl_end == OFFSET_MAX ? 0 : h->fl_end - h->fl_start + 1;
+ fl->l_whence = SEEK_SET;
+
+ /* TODO: Init ->l_sysid, l_pid fields */
+ ckpt_debug("Restoring filelock [%lld, %lld, %d]\n", fl->l_start,
+ fl->l_len, fl->l_type);
+
+ return 0;
+}
+
+static int restore_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
+ int fd, struct ckpt_hdr_file_lock *h)
+{
+ struct flock64 fl;
+ int ret;
+
+ ret = ckpt_hdr_file_lock_to_flock64(h, &fl);
+ if (ret < 0) {
+ ckpt_err(ctx, ret, "%(T) Unexpected flock\n");
+ return ret;
+ }
+
+ /*
+ * Use F_SETLK because we should not have to wait for the lock. If
+ * another process holds the lock, it indicates that filesystem-state
+ * is not consistent with what it was at checkpoint. In which case we
+ * better fail.
+ */
+ ret = flock64_set(fd, file, F_SETLK64, &fl);
+ if (ret)
+ ckpt_err(ctx, ret, "flock64_set(): %d\n", (int)h->fl_type);
+
+ return ret;
+}
+
+#else
+
+static int
+ckpt_hdr_file_lock_to_flock(struct ckpt_hdr_file_lock *h, struct flock *fl)
+{
+ /*
+ * We checkpoint the 'raw' fl_type which in case of leases includes
+ * the F_INPROGRESS flag. But for posix-locks, the fl_type should
+ * be simple.
+ */
+ switch(h->fl_type) {
+ case F_RDLCK:
+ case F_WRLCK:
+ case F_UNLCK:
+ break;
+ default:
+ ckpt_debug("Bad posix lock type 0x%x ?\n", h->fl_type);
+ return -EINVAL;
+ }
+
+ memset(fl, 0, sizeof(*fl));
+
+ fl->l_type = h->fl_type;
+ fl->l_start = h->fl_start;
+ fl->l_len = fl->fl_end == OFFSET_MAX ? 0 : h->fl_end - h->fl_start + 1;
+ fl->l_whence = SEEK_SET;
+
+ ckpt_debug("Restoring filelock [%lld, %lld, %d]\n", fl->l_start,
+ fl->l_len, fl->l_type);
+
+ /* TODO: Init ->l_sysid, l_pid fields */
+
+ return 0;
+}
+
+static int restore_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
+ int fd, struct ckpt_hdr_file_lock *h)
+{
+ struct flock fl;
+ int ret;
+
+ ret = ckpt_hdr_file_lock_to_flock(h, &fl);
+ if (ret < 0) {
+ ckpt_err(ctx, ret, "%(T) Unexpected flock\n");
+ break;
+ }
+
+ /*
+ * Use F_SETLK because we should not have to wait for the lock. If
+ * another process holds the lock, it indicates that filesystem-state
+ * is not consistent with what it was at checkpoint. In which case we
+ * better fail.
+ */
+ ret = flock_set(fd, file, F_SETLK, &fl);
+ if (ret)
+ ckpt_err(ctx, ret, "flock_set(): %d\n", (int)h->fl_type);
+
+ return ret;
+}
+#endif
+
+static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
+{
+ int ret;
+ struct ckpt_hdr_file_lock *h;
+
+ ret = 0;
+ while (!ret) {
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_LOCK);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ckpt_debug("Lock [%lld, %lld, %d, 0x%x]\n", h->fl_start,
+ h->fl_end, (int)h->fl_type, h->fl_flags);
+
+ /*
+ * If it is a dummy-lock, we are done with this fd.
+ */
+ if ((h->fl_flags & FL_POSIX) && (h->fl_start == (loff_t)-1)) {
+ ckpt_debug("Found last lock for fd\n");
+ break;
+ }
+
+ ret = -EBADF;
+ if (h->fl_flags & FL_POSIX)
+ ret = restore_one_file_lock(ctx, file, fd, h);
+
+ if (ret < 0)
+ ckpt_err(ctx, ret, "%(T) fl_flags 0x%x\n", h->fl_flags);
+ }
+ return ret;
+}
+
/**
- * ckpt_read_file_desc - restore the state of a given file descriptor
+ * restore_file_desc - restore the state of a given file descriptor
* @ctx: checkpoint context
*
* Restores the state of a file descriptor; looks up the objref (in the
@@ -961,7 +1123,11 @@ static int restore_file_desc(struct ckpt_ctx *ctx)
}
set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec);
- ret = 0;
+
+ ret = restore_file_locks(ctx, file, h->fd_descriptor);
+ if (ret < 0)
+ ckpt_err(ctx, ret, "Error on fd %d\n", h->fd_descriptor);
+
out:
ckpt_hdr_put(ctx, h);
return ret;
--
1.6.0.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 9/9][cr][v2]: Restore file-locks
2010-05-19 3:07 ` [PATCH 9/9][cr][v2]: Restore file-locks Sukadev Bhattiprolu
@ 2010-05-26 7:48 ` steve
2010-05-26 23:57 ` Sukadev Bhattiprolu
0 siblings, 1 reply; 22+ messages in thread
From: steve @ 2010-05-26 7:48 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Oren Laadan, serue, Matt Helsley, matthew, linux-fsdevel,
Containers
Hi,
On Tue, May 18, 2010 at 08:07:32PM -0700, Sukadev Bhattiprolu wrote:
> Restore POSIX file-locks of an application from its checkpoint image.
>
> Read the saved file-locks from the checkpoint image and for each POSIX
> lock, call flock_set() to set the lock on the file.
>
> As pointed out by Matt Helsley, no special handling is necessary for a
> process P2 in the checkpointed container that is blocked on a lock, L1
> held by another process P1. Since processes in the restarted container
> begin execution only after all processes have restored. If the blocked
> process P2 is restored first, first, it will prepare to return an
> -ERESTARTSYS from the fcntl() system call, but wait for P1 to be
> restored. When P1 is restored, it will re-acquire the lock L1 before P1
> and P2 begin actual execution. This ensures that even if P2 is scheduled
> to run before P1, P2 will go back to waiting for the lock L1.
>
Does that imply certain conditions wrt checkpointed processes and
NFS exports? I'm not sure I exactly undertstand the use case which
this is intended to address.
I was hoping to figure out whether it would also still be safe on
a cluster filesystem as well,
Steve.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 9/9][cr][v2]: Restore file-locks
2010-05-26 7:48 ` steve
@ 2010-05-26 23:57 ` Sukadev Bhattiprolu
2010-06-15 4:22 ` Oren Laadan
0 siblings, 1 reply; 22+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26 23:57 UTC (permalink / raw)
To: steve; +Cc: Oren Laadan, serue, Matt Helsley, matthew, linux-fsdevel,
Containers
steve@chygwyn.com [steve@chygwyn.com] wrote:
| Hi,
|
| On Tue, May 18, 2010 at 08:07:32PM -0700, Sukadev Bhattiprolu wrote:
| > Restore POSIX file-locks of an application from its checkpoint image.
| >
| > Read the saved file-locks from the checkpoint image and for each POSIX
| > lock, call flock_set() to set the lock on the file.
| >
| > As pointed out by Matt Helsley, no special handling is necessary for a
| > process P2 in the checkpointed container that is blocked on a lock, L1
| > held by another process P1. Since processes in the restarted container
| > begin execution only after all processes have restored. If the blocked
| > process P2 is restored first, first, it will prepare to return an
| > -ERESTARTSYS from the fcntl() system call, but wait for P1 to be
| > restored. When P1 is restored, it will re-acquire the lock L1 before P1
| > and P2 begin actual execution. This ensures that even if P2 is scheduled
| > to run before P1, P2 will go back to waiting for the lock L1.
| >
| Does that imply certain conditions wrt checkpointed processes and
| NFS exports? I'm not sure I exactly undertstand the use case which
| this is intended to address.
Well, yes this assumes some pre-requisites are met.
First lets look at a single system. We expect that the application
process tree is run inside a container. This means that the file
system(s) (and other resources like pipes, IPC) that the application
is working with are not modified by a process outside the container.
We also require that the application process tree be frozen before
checkpointing the application. So even if the checkpoint process takes
a few minutes, the state of the resources (files, pipes, signals etc)
does not change since a) application is containerized b) container is
frozen.
We already have the ability to run applications inside containers, using
the clone() system call (see lxc.sf.net for example) and the ability to
freeze the application using the freezer cgroup in the linux kenrnel.
|
| I was hoping to figure out whether it would also still be safe on
| a cluster filesystem as well,
For clusters and NFS, an external protocol has to be established so that
the distrubuted application can be started/frozen/checkpointed/restarated
in a coordinated way.
I think that is something that would have to be built on top of the
checkpoint/restart functionality that we are working on. Or maybe there
are existing implementations that we would need to plug into.
Hope that helps, but its possible I missed your question :-). If so
please let me know.
|
| Steve.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 9/9][cr][v2]: Restore file-locks
2010-05-26 23:57 ` Sukadev Bhattiprolu
@ 2010-06-15 4:22 ` Oren Laadan
0 siblings, 0 replies; 22+ messages in thread
From: Oren Laadan @ 2010-06-15 4:22 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: steve, serue, Matt Helsley, matthew, linux-fsdevel, Containers
On 05/26/2010 07:57 PM, Sukadev Bhattiprolu wrote:
> steve@chygwyn.com [steve@chygwyn.com] wrote:
> | Hi,
> |
> | On Tue, May 18, 2010 at 08:07:32PM -0700, Sukadev Bhattiprolu wrote:
> | > Restore POSIX file-locks of an application from its checkpoint image.
> | >
> | > Read the saved file-locks from the checkpoint image and for each POSIX
> | > lock, call flock_set() to set the lock on the file.
> | >
> | > As pointed out by Matt Helsley, no special handling is necessary for a
> | > process P2 in the checkpointed container that is blocked on a lock, L1
> | > held by another process P1. Since processes in the restarted container
> | > begin execution only after all processes have restored. If the blocked
> | > process P2 is restored first, first, it will prepare to return an
> | > -ERESTARTSYS from the fcntl() system call, but wait for P1 to be
> | > restored. When P1 is restored, it will re-acquire the lock L1 before P1
> | > and P2 begin actual execution. This ensures that even if P2 is scheduled
> | > to run before P1, P2 will go back to waiting for the lock L1.
> | >
> | Does that imply certain conditions wrt checkpointed processes and
> | NFS exports? I'm not sure I exactly undertstand the use case which
> | this is intended to address.
>
> Well, yes this assumes some pre-requisites are met.
>
> First lets look at a single system. We expect that the application
> process tree is run inside a container. This means that the file
> system(s) (and other resources like pipes, IPC) that the application
> is working with are not modified by a process outside the container.
To be precise, we require that (a) resources won't change during
the checkpoint, and (b) the filesystem view at restart would be
the same as at checkpoint.
Running applications inside an isolated container is one way to
achieve that (and more so, to provide guarantees on that). Doing
that provides certain assurance on the resulting checkpoint image.
However, the requirements may be satisfied even outside a container
by, for example, a well behaved applications; except that then we
can't say it's safe - it depends on the application.
> We also require that the application process tree be frozen before
> checkpointing the application. So even if the checkpoint process takes
> a few minutes, the state of the resources (files, pipes, signals etc)
> does not change since a) application is containerized b) container is
> frozen.
>
> We already have the ability to run applications inside containers, using
> the clone() system call (see lxc.sf.net for example) and the ability to
> freeze the application using the freezer cgroup in the linux kenrnel.
>
> |
> | I was hoping to figure out whether it would also still be safe on
> | a cluster filesystem as well,
>
> For clusters and NFS, an external protocol has to be established so that
> the distrubuted application can be started/frozen/checkpointed/restarated
> in a coordinated way.
>
> I think that is something that would have to be built on top of the
> checkpoint/restart functionality that we are working on. Or maybe there
> are existing implementations that we would need to plug into.
>
> Hope that helps, but its possible I missed your question :-). If so
> please let me know.
What you refer to is checkpoint in a cluster, or distributed checkpoint
of multiple cooperating processes/applications that run on multiple
hosts. Indeed, one simple way to do it is coordinated distributed
checkpoint and restart.
However, I think the question was about a single application (or
container) that is accessing remote and clustered (and possibly
distributed) *file systems*.
Ideally, we would like to have a method to snapshot a filesystem
at checkpoint to guarantee that at restart we have a consistent
view of that file system. This is regardless of whether the file
system is local or remote. In the absence of such a mechanism,
we will have to rely on the file system not being changed (at least
those parts that the checkpoint application expects to remain as
they had been) until the restart.
Oren.
^ permalink raw reply [flat|nested] 22+ messages in thread