From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
To: Oren Laadan <orenl@cs.columbia.edu>
Cc: Serge Hallyn <serge@hallyn.com>,
Matt Helsley <matthltc@us.ibm.com>, Dan Smith <danms@us.ibm.com>,
John Stultz <johnstul@us.ibm.com>,
Matthew Wilcox <matthew@wil.cx>,
Jamie Lokier <jamie@shareable.org>,
Steven Whitehouse <swhiteho@redhat.com>,
<linux-fsdevel@vger.kernel.org>,
Containers <containers@lists.linux-foundation.org>
Subject: [PATCH 17/17][cr][v4]: Document design of C/R of file-locks and leases
Date: Mon, 16 Aug 2010 12:43:21 -0700 [thread overview]
Message-ID: <1281987801-1293-18-git-send-email-sukadev@linux.vnet.ibm.com> (raw)
In-Reply-To: <1281987801-1293-1-git-send-email-sukadev@linux.vnet.ibm.com>
Summarize the file-system consistency requirements and the design of
the C/R of file-locks and leases.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
Documentation/checkpoint/file-locks | 126 +++++++++++++++++++++++++++++++++++
1 files changed, 126 insertions(+), 0 deletions(-)
create mode 100644 Documentation/checkpoint/file-locks
diff --git a/Documentation/checkpoint/file-locks b/Documentation/checkpoint/file-locks
new file mode 100644
index 0000000..e562990
--- /dev/null
+++ b/Documentation/checkpoint/file-locks
@@ -0,0 +1,126 @@
+
+Filesystem consistency across C/R.
+==================================
+
+To checkpoint/restart a process that is using any filesystem resource, the
+kernel assumes that the file system state at the time of restart is consistent
+with its state at the time of checkpoint. In general, this consistency can be
+achieved by:
+
+ a. running the application inside a container (to ensure no process
+ outside the container modifies the filesystem/IPC or other states)
+
+ b. freezing the application before checkpoint
+ c. taking a snapshot of the file system while application is frozen
+ d. checkpointing the application while it is frozen
+
+ e. restoring the file system state to its snapshot
+ f. restart the application inside a container
+
+i.e the kernel assumes that file system state is consistent but it does/can
+NOT verify that it is. The administrator must provide this consistency taking
+into account the file system type including whether it is local or remote,
+and the tools available in the file system (snapshot tools in btrfs or rsync
+etc).
+
+For distributed applications operating on distributed filesystems, it is
+expected that an external mechanism will coordinate the freeze/checkpoint/
+snapshot/restart across the nodes. IOW, the current semantics in the kernel
+provide for C/R on a single node.
+
+Checkpoint/restart of file-locks.
+================================
+
+To checkpoint file-locks in an application, we start with each file-descriptor
+and count the number of file-locks on that file-descriptor. We save this count
+in the checkpoint image, and then information about each file-lock on the
+file-descriptor.
+
+When restarting the application from the checkpoint, we read the file-lock
+count for each file-descriptor and then read the information about each
+file-lock. For each file-lock, we call flock_set() to set a new file-lock.
+
+No special handling is necessary for a process P2 in the checkpointed container
+that is blocked on a file-lock, L1 held by another process P1. Processes in the
+restarted container begin execution only after all processes have restored.
+If the blocked process P2 is restored first, it will prepare to return an
+-ERESTARTSYS from the fcntl() system call, but wait for P1 to be restored.
+When P1 is restored, it will re-acquire the file-lock L1 before P1 and P2 begin
+actual execution.
+
+This ensures that even if P2 is scheduled to run before P1, P2 will go
+back to waiting for the file-lock L1.
+
+Checkpoint/restart of file leases
+==================================
+
+C/R of file-leases depends on whether the lease is currently being broken
+(i.e F_INPROGRESS is set). If the file-lease is not being broken, checkpoint/
+restart of file-lease is identical to checkpoint of file-locks (i.e save
+the type of the lease for the file in the checkpoint image. When restarting,
+restore the lease by calling do_setlease().
+
+C/R of file-lease gets complicated, if a process is checkpointed when its lease
+was being revoked. i.e if P1 has a F_WRLCK lease on file F1 and P2 opens F1 for
+write, P2's open is blocked for lease_break_time (45 secs). P1's lease is
+revoked (i.e set to F_UNLCK) and P1 is notified via a SIGIO to flush any dirty
+data.
+
+Basic design:
+
+To address "in-progress" leases, we checkpoint additional information about
+the lease:
+
+ - the previous lease type (file_lock->fl_type_prev)
+ - the time remaining in the lease (->fl_rem_lease), and
+ - whether we already notified the lease-holder about the lease-break
+ (->fl_break_notified)
+
+To restore an "in-progrss" lease that, we temporarily re-assign the original
+lease type (that we saved in ->fl_type_prev) to the lease-holder. i.e. in the
+above example, give P1 a F_WRLCK lease). When the lease-breaker (P2) is
+restarted after checkpoint, its open() system fails with -ERESTARTSYS and it
+will retry the open(). This open() will re-initiate the lease-break protocol
+(i.e P2 will go back to waiting and P1 will be notified).
+
+Some observations about this approach:
+
+1. We must use ->fl_type_prev because, when the lease is being broken,
+ ->fl_type is already set to F_UNLCK and would not result in a
+ lease-break protocol when P2 is restarted.
+
+2. When the lease-break is initiated and we signal the lease-holder, we set
+ the ->fl_break_notified field. When restarting the lease and repeating
+ the lease-break protocol, we check the ->fl_break_notified field and
+ signal the lease-holder only if did not signal before the checkpoint.
+
+3. If P1 was was checkpointed 40 seconds into the lease_break_time,(i.e.
+ it had 5 seconds remaining in the lease), we would ideally want to ensure
+ that after restart, P1 gets 5 or at least 5 seconds to finish cleaning up
+ the lease.
+
+ But the actual time that P1 gets after the application is restarted depends
+ on many factors (number of processes in the application process tree, load
+ on system at the time of restart etc).
+
+ Jamie Lokier had suggested that we favor the lease-holder (P1) during
+ restart, even if it meant giving the lease-holder the entire lease-break
+ interval (45 seconds) again after the restart. Oren Laadan suggested
+ that rather than make that a kernel policy, we let the user choose a
+ policy based on the application's behavior.
+
+ The current design computes and checkpoints the remaining-lease and
+ uses this value to restore the lease. i.e the kernel simply uses the
+ "remaining-lease" value stored in the checkpoint image. Userspace tools
+ can be developed to alter the remaining-lease value in the checkpoint
+ image to either favor the lease-holder or the lease-breaker or to add
+ a fixed delta.
+
+4. The current design of C/R of file-leases assumes that both lease-holder
+ and lease-breaker are restarted. If only the lease-holder is restarted,
+ the kernel will re-assign the original lease (F_WRLCK in the example) to
+ lease-holder. If no lease-breaker comes along, the kernel will leave the
+ lease assigned to lease-holder.
+
+ This should not be a problem because, as far as the lease-holder is
+ concerned the lease was revoked and it will/should reacquire the lease.
--
1.6.0.4
prev parent reply other threads:[~2010-08-16 19:38 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-16 19:43 [PATCH 00/17][cr][v4]: C/R file owner, locks, leases Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 01/17][cr][v4]: Add uid, euid params to f_modown() Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 02/17][cr][v4]: Add uid, euid params to __f_setown() Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 03/17][cr][v4]: Checkpoint file-owner information Sukadev Bhattiprolu
[not found] ` <1281987801-1293-4-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-09-16 23:34 ` Oren Laadan
2010-08-16 19:43 ` [PATCH 04/17][cr][v4]: Restore file_owner info Sukadev Bhattiprolu
2010-09-16 23:45 ` Oren Laadan
2010-08-16 19:43 ` [PATCH 05/17][cr][v4]: Move file_lock macros into linux/fs.h Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 06/17][cr][v4]: Checkpoint file-locks Sukadev Bhattiprolu
2010-09-17 0:03 ` Oren Laadan
2010-08-16 19:43 ` [PATCH 07/17][cr][v4]: Define flock_set() Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 08/17][cr][v4]: Define flock64_set() Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 09/17][cr][v4]: Restore file-locks Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 10/17][cr][v4]: Initialize ->fl_break_time to 0 Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 11/17][cr][v4]: Add ->fl_type_prev field Sukadev Bhattiprolu
2010-09-17 0:06 ` Oren Laadan
2010-08-16 19:43 ` [PATCH 12/17][cr][v4]: Add ->fl_break_notified field Sukadev Bhattiprolu
2010-09-17 0:07 ` Oren Laadan
2010-08-16 19:43 ` [PATCH 13/17][cr][v4]: Add jiffies_begin field to ckpt_ctx Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 14/17][cr][v4]: Checkpoint file-leases Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 15/17][cr][v4]: Define do_setlease() Sukadev Bhattiprolu
2010-08-16 19:43 ` [PATCH 16/17][cr][v4]: Restore file-leases Sukadev Bhattiprolu
2010-08-16 19:43 ` Sukadev Bhattiprolu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1281987801-1293-18-git-send-email-sukadev@linux.vnet.ibm.com \
--to=sukadev@linux.vnet.ibm.com \
--cc=containers@lists.linux-foundation.org \
--cc=danms@us.ibm.com \
--cc=jamie@shareable.org \
--cc=johnstul@us.ibm.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=matthew@wil.cx \
--cc=matthltc@us.ibm.com \
--cc=orenl@cs.columbia.edu \
--cc=serge@hallyn.com \
--cc=swhiteho@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).