Re: [RFC PATCH] Filesystem Journal Notifications

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH] Filesystem Journal Notifications
       [not found] ` <20080915023126.GF4090@webber.adilger.int>
@ 2008-09-15  5:39   ` Andreas Dilger
  2008-09-15 19:36     ` Abhijit Paithankar
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2008-09-15  5:39 UTC (permalink / raw)
  To: Abhijit Paithankar; +Cc: linux-fsdevel, linux-ext4

On Sep 14, 2008  19:31 -0700, Andreas Dilger wrote:
> We had a much more flexible journal commit notification mechanism.  The
> caller would register a callback with the journal layer that contained
> a caller-private callback function and callback data.  This avoided
> exposing the journal transaction IDs outside of the filesystem, and the
> caller could use any identifier desired.
> 
> There was also no limit on the number/type of callbacks registered with
> the filesystem.  The basic data structure was a linked list of callbacks
> registered on the transaction:
> 
> struct journal_callback {
>        struct list_head jcb_list;
>        void (*jcb_func)(struct journal_callback *jcb, int error);
>        /* caller data goes here */
> };
> 
> and in journal_commit_transaction() the callbacks are called:
> 
>                list_for_each_entry_safe(jcb, n, &commit_transaction->t_jcb,
> 					jcb_list) {
>                        list_del_init(&jcb->jcb_list);
>                        jcb->jcb_func(jcb, error);
>                }

Here is the patch (against 2.6.18 kernels, but I don't think anything has
changed in this area in a long time):


Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h	2006-07-15 16:08:35.000000000 +0800
+++ linux-2.6/include/linux/jbd.h	2006-07-15 16:13:01.000000000 +0800
@@ -356,6 +356,27 @@ static inline void jbd_unlock_bh_journal
 	bit_spin_unlock(BH_JournalHead, &bh->b_state);
 }
 
+#define HAVE_JOURNAL_CALLBACK_STATUS
+/**
+ *   struct journal_callback - Base structure for callback information
+ *   @jcb_list: list information for other callbacks attached to the same handle
+ *   @jcb_func: Function to call with this callback structure
+ *
+ *   This struct is a 'seed' structure for a using with your own callback
+ *   structs. If you are using callbacks you must allocate one of these
+ *   or another struct of your own definition which has this struct 
+ *   as it's first element and pass it to journal_callback_set().
+ *
+ *   This is used internally by jbd to maintain callback information.
+ *
+ *   See journal_callback_set for more information.
+ **/
+struct journal_callback {
+	struct list_head jcb_list;		/* t_jcb_lock */
+	void (*jcb_func)(struct journal_callback *jcb, int error);
+	/* caller data goes here */
+};
+
 struct jbd_revoke_table_s;
 
 /**
@@ -364,6 +385,7 @@ struct jbd_revoke_table_s;
  * @h_transaction: Which compound transaction is this update a part of?
  * @h_buffer_credits: Number of remaining buffers we are allowed to dirty.
  * @h_ref: Reference count on this handle
+ * @h_jcb: List of application registered callbacks for this handle.
  * @h_err: Field for caller's use to track errors through large fs operations
  * @h_sync: flag for sync-on-close
  * @h_jdata: flag to force data journaling
@@ -389,6 +411,13 @@ struct handle_s 
 	/* operations */
 	int			h_err;
 
+	/*
+	 * List of application registered callbacks for this handle. The
+	 * function(s) will be called after the transaction that this handle is
+	 * part of has been committed to disk. [t_jcb_lock]
+	 */
+	struct list_head	h_jcb;
+
 	/* Flags [no locking] */
 	unsigned int	h_sync:		1;	/* sync-on-close */
 	unsigned int	h_jdata:	1;	/* force data journaling */
@@ -430,6 +459,8 @@ struct handle_s 
  *    j_state_lock
  *    ->j_list_lock			(journal_unmap_buffer)
  *
+ *    t_handle_lock
+ *    ->t_jcb_lock
  */
 
 struct transaction_s 
@@ -559,6 +590,15 @@ struct transaction_s 
 	 */
 	int t_handle_count;
 
+	/*
+	 * Protects the callback list
+	 */
+	spinlock_t		t_jcb_lock;
+	/*
+	 * List of registered callback functions for this transaction.
+	 * Called when the transaction is committed. [t_jcb_lock]
+	 */
+	struct list_head	t_jcb;
 };
 
 /**
@@ -906,6 +946,10 @@ extern void	 journal_invalidatepage(jour
 extern int	 journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int	 journal_stop(handle_t *);
 extern int	 journal_flush (journal_t *);
+extern void	 journal_callback_set(handle_t *handle,
+				      void (*fn)(struct journal_callback *,int),
+				      struct journal_callback *jcb);
+
 extern void	 journal_lock_updates (journal_t *);
 extern void	 journal_unlock_updates (journal_t *);
 
Index: linux-2.6/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.orig/fs/jbd/checkpoint.c	2006-07-15 16:08:36.000000000 +0800
+++ linux-2.6/fs/jbd/checkpoint.c	2006-07-15 16:13:01.000000000 +0800
@@ -688,6 +688,7 @@ void __journal_drop_transaction(journal_
 	J_ASSERT(transaction->t_checkpoint_list == NULL);
 	J_ASSERT(transaction->t_checkpoint_io_list == NULL);
 	J_ASSERT(transaction->t_updates == 0);
+	J_ASSERT(list_empty(&transaction->t_jcb));
 	J_ASSERT(journal->j_committing_transaction != transaction);
 	J_ASSERT(journal->j_running_transaction != transaction);
 
Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c	2006-07-15 16:08:36.000000000 +0800
+++ linux-2.6/fs/jbd/commit.c	2006-07-15 16:13:01.000000000 +0800
@@ -708,6 +708,21 @@ wait_for_iobuf:
            transaction can be removed from any checkpoint list it was on
            before. */
 
+	/* Call any callbacks that had been registered for handles in this
+	 * transaction.  It is up to the callback to free any allocated
+	 * memory.  No locking is required, since this is the only process
+	 * that is processing this transaction anymore. */
+	if (!list_empty(&commit_transaction->t_jcb)) {
+		struct journal_callback *jcb, *n;
+		int error = is_journal_aborted(journal);
+
+		list_for_each_entry_safe(jcb, n, &commit_transaction->t_jcb,
+					 jcb_list) {
+			list_del_init(&jcb->jcb_list);
+			jcb->jcb_func(jcb, error);
+		}
+	}
+
 	jbd_debug(3, "JBD: commit phase 7\n");
 
 	J_ASSERT(commit_transaction->t_sync_datalist == NULL);
Index: linux-2.6/fs/jbd/journal.c
===================================================================
--- linux-2.6.orig/fs/jbd/journal.c	2006-07-15 16:08:36.000000000 +0800
+++ linux-2.6/fs/jbd/journal.c	2006-07-15 16:13:01.000000000 +0800
@@ -58,6 +58,7 @@ EXPORT_SYMBOL(journal_sync_buffer);
 #endif
 EXPORT_SYMBOL(journal_flush);
 EXPORT_SYMBOL(journal_revoke);
+EXPORT_SYMBOL(journal_callback_set);
 
 EXPORT_SYMBOL(journal_init_dev);
 EXPORT_SYMBOL(journal_init_inode);
Index: linux-2.6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.orig/fs/jbd/transaction.c	2006-07-15 16:08:35.000000000 +0800
+++ linux-2.6/fs/jbd/transaction.c	2006-07-15 16:13:01.000000000 +0800
@@ -50,6 +50,8 @@ get_transaction(journal_t *journal, tran
 	transaction->t_state = T_RUNNING;
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
+	INIT_LIST_HEAD(&transaction->t_jcb);
+	spin_lock_init(&transaction->t_jcb_lock);
 	spin_lock_init(&transaction->t_handle_lock);
 
 	/* Set up the commit timer for the new transaction. */
@@ -241,6 +243,7 @@ static handle_t *new_handle(int nblocks)
 	memset(handle, 0, sizeof(*handle));
 	handle->h_buffer_credits = nblocks;
 	handle->h_ref = 1;
+	INIT_LIST_HEAD(&handle->h_jcb);
 
 	return handle;
 }
@@ -1291,6 +1294,36 @@ drop:
 }
 
 /**
+ * void journal_callback_set() -  Register a callback function for this handle.
+ * @handle: handle to attach the callback to.
+ * @func: function to callback.
+ * @jcb:  structure with additional information required by func() , and
+ *        some space for jbd internal information.
+ * 
+ * The function will be called when the transaction that this handle is
+ * part of has been committed to disk with the original callback data
+ * struct and the error status of the journal as parameters.  There is no
+ * guarantee of ordering between handles within a single transaction, nor
+ * between callbacks registered on the same handle.
+ *
+ * The caller is responsible for allocating the journal_callback struct.
+ * This is to allow the caller to add as much extra data to the callback
+ * as needed, but reduce the overhead of multiple allocations.  The caller
+ * allocated struct must start with a struct journal_callback at offset 0,
+ * and has the caller-specific data afterwards.
+ */
+void journal_callback_set(handle_t *handle,
+			  void (*func)(struct journal_callback *jcb, int error),
+			  struct journal_callback *jcb)
+{
+	jcb->jcb_func = func;
+	spin_lock(&handle->h_transaction->t_jcb_lock);
+	list_add_tail(&jcb->jcb_list, &handle->h_jcb);
+	spin_unlock(&handle->h_transaction->t_jcb_lock);
+}
+
+/**
  * int journal_stop() - complete a transaction
  * @handle: tranaction to complete.
  * 
@@ -1363,6 +1396,11 @@ int journal_stop(handle_t *handle)
 			wake_up(&journal->j_wait_transaction_locked);
 	}
 
+	/* Move callbacks from the handle to the transaction. */
+	spin_lock(&transaction->t_jcb_lock);
+	list_splice(&handle->h_jcb, &transaction->t_jcb);
+	spin_unlock(&transaction->t_jcb_lock);
+
 	/*
 	 * If the handle is marked SYNC, we need to set another commit
 	 * going!  We also want to force a commit if the current


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15  5:39   ` [RFC PATCH] Filesystem Journal Notifications Andreas Dilger
@ 2008-09-15 19:36     ` Abhijit Paithankar
  2008-09-15 23:37       ` Andreas Dilger
  2008-09-20  5:15       ` Jan Kara
  0 siblings, 2 replies; 8+ messages in thread
From: Abhijit Paithankar @ 2008-09-15 19:36 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-ext4, Jamie Lokier, Dave Chinner

Thanks for your feedback.
It gave me new points to think about.

I have used inotify simply because it is a mechanism that already exists
in the kernel and is easily extensible.

The use of transaction ids as "journal cookie" was done to simplify
things. A journal transaction ID is a readily available monotonically
increasing ID unique to that partition (super block).

Journal cookies themselves are completely opaque to the application
and as long as it is a monotonically increasing number that signifies
a journal commit, it could be anything that a file system chooses it
to be.

Ext3/Ext4 gather multiple changes into a single transaction. So,
you will see a single 'commit' notification for multiple of meta-data
changes. The 'commit' messages, therefore, won't be that many.


This patch (v2):

In this version of the patch I've done away with most of the file-system
specific changes. The file system now only needs to send out a journal
commit notification.

The 'journal cookie' is simply a u64 that increases every time a journal
'commit' is made. This will make the code more portable to other file
systems.

Also, we no longer need those many journal message types. We can easily
piggy-back on top of fsnotify. The only remaining one is
IN_JOURNAL_COMMIT.

My next step is to allow multiple registrations for JOURNAL_COMMIT
notification. I will post that patch once it is ready.

Signed-off-by: Abhijit Paithankar <apaithan@akamai.com>
---

 fs/ext3/ext3_jbd.c       |    9 +++++
 fs/ext4/ext4_jbd2.c      |    9 +++++
 fs/inotify.c             |   17 ++++++---
 fs/inotify_user.c        |   81 ++++++++++++++++++++++++++++++++++++++++++-----
 fs/jbd/commit.c          |    1 
 fs/jbd2/commit.c         |    1 
 include/linux/fs.h       |    2 +
 include/linux/fsnotify.h |   11 ++++++
 include/linux/inotify.h  |   19 ++++++++++-
 include/linux/jbd.h      |    1 
 include/linux/jbd2.h     |    1 
 11 files changed, 138 insertions(+), 14 deletions(-)

---


Index: linux-2.6/fs/ext3/ext3_jbd.c
===================================================================
--- linux-2.6.orig/fs/ext3/ext3_jbd.c	2008-09-15 10:12:25.000000000 -0700
+++ linux-2.6/fs/ext3/ext3_jbd.c	2008-09-15 10:15:54.000000000 -0700
@@ -3,6 +3,7 @@
  */
 
 #include <linux/ext3_jbd.h>
+#include <linux/fsnotify.h>
 
 int __ext3_journal_get_undo_access(const char *where, handle_t *handle,
 				struct buffer_head *bh)
@@ -57,3 +58,11 @@
 		ext3_journal_abort_handle(where, __func__, bh, handle,err);
 	return err;
 }
+
+void journal_notify_commit (journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	if (sb->s_priv_inode && sb->s_journal_cookie)
+		fsnotify_journal_commit(sb);
+}
+
Index: linux-2.6/fs/ext4/ext4_jbd2.c
===================================================================
--- linux-2.6.orig/fs/ext4/ext4_jbd2.c	2008-09-15 10:14:12.000000000 -0700
+++ linux-2.6/fs/ext4/ext4_jbd2.c	2008-09-15 10:15:50.000000000 -0700
@@ -2,6 +2,7 @@
  * Interface between ext4 and JBD
  */
 
+#include <linux/fsnotify.h>
 #include "ext4_jbd2.h"
 
 int __ext4_journal_get_undo_access(const char *where, handle_t *handle,
@@ -57,3 +58,11 @@
 		ext4_journal_abort_handle(where, __func__, bh, handle, err);
 	return err;
 }
+
+void jbd2_journal_notify_commit (journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	if (sb->s_priv_inode && sb->s_journal_cookie)
+		fsnotify_journal_commit(sb);
+}
+
Index: linux-2.6/fs/inotify.c
===================================================================
--- linux-2.6.orig/fs/inotify.c	2008-09-15 10:12:29.000000000 -0700
+++ linux-2.6/fs/inotify.c	2008-09-15 10:13:44.000000000 -0700
@@ -231,7 +231,7 @@
 				 struct inotify_watch *watch)
 {
 	remove_watch_no_event(watch, ih);
-	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL);
+	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL, 0);
 }
 EXPORT_SYMBOL_GPL(inotify_remove_watch_locked);
 
@@ -280,10 +280,17 @@
 			       const char *name, struct inode *n_inode)
 {
 	struct inotify_watch *watch, *next;
+	u64 journal_cookie = 0;
 
 	if (!inotify_inode_watched(inode))
 		return;
 
+	if (mask & (IN_ATTRIB | IN_MOVE | IN_CREATE | IN_DELETE |
+		    IN_DELETE_SELF | IN_MOVE_SELF)) {
+		if (inode->i_sb && inode->i_sb->s_journal_cookie)
+			journal_cookie = inode->i_sb->s_journal_cookie;
+	}
+
 	mutex_lock(&inode->inotify_mutex);
 	list_for_each_entry_safe(watch, next, &inode->inotify_watches, i_list) {
 		u32 watch_mask = watch->mask;
@@ -293,7 +300,7 @@
 			if (watch_mask & IN_ONESHOT)
 				remove_watch_no_event(watch, ih);
 			ih->in_ops->handle_event(watch, watch->wd, mask, cookie,
-						 name, n_inode);
+						 name, n_inode, journal_cookie);
 			mutex_unlock(&ih->mutex);
 		}
 	}
@@ -409,7 +416,7 @@
 			struct inotify_handle *ih= watch->ih;
 			mutex_lock(&ih->mutex);
 			ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0,
-						 NULL, NULL);
+						 NULL, NULL, 0);
 			inotify_remove_watch_locked(ih, watch);
 			mutex_unlock(&ih->mutex);
 		}
@@ -576,7 +583,7 @@
 		mask_add = 1;
 
 	/* don't allow invalid bits: we don't want flags set */
-	mask &= IN_ALL_EVENTS | IN_ONESHOT;
+	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_COMMIT;
 	if (unlikely(!mask))
 		return -EINVAL;
 
@@ -623,7 +630,7 @@
 	int newly_watched;
 
 	/* don't allow invalid bits: we don't want flags set */
-	mask &= IN_ALL_EVENTS | IN_ONESHOT;
+	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_COMMIT;
 	if (unlikely(!mask))
 		return -EINVAL;
 	watch->mask = mask;
Index: linux-2.6/fs/inotify_user.c
===================================================================
--- linux-2.6.orig/fs/inotify_user.c	2008-09-15 10:12:34.000000000 -0700
+++ linux-2.6/fs/inotify_user.c	2008-09-15 10:51:43.000000000 -0700
@@ -185,7 +185,7 @@
  * This function can sleep.
  */
 static struct inotify_kernel_event * kernel_event(s32 wd, u32 mask, u32 cookie,
-						  const char *name)
+						  const char *name, u64 journal_cookie)
 {
 	struct inotify_kernel_event *kevent;
 
@@ -204,6 +204,7 @@
 
 	if (name) {
 		size_t len, rem, event_size = sizeof(struct inotify_event);
+		size_t journal_cookie_sz;
 
 		/*
 		 * We need to pad the filename so as to properly align an
@@ -213,6 +214,10 @@
 		 * simple and safe for all architectures.
 		 */
 		len = strlen(name) + 1;
+		if (journal_cookie) {
+			journal_cookie_sz = sizeof(u64);
+			len = len + journal_cookie_sz;
+		}
 		rem = event_size - len;
 		if (len > event_size) {
 			rem = event_size - (len % event_size);
@@ -225,13 +230,52 @@
 			kmem_cache_free(event_cachep, kevent);
 			return NULL;
 		}
-		memcpy(kevent->name, name, len);
+		if (journal_cookie) {
+			char *ch;
+
+			memcpy(kevent->name, name, strlen(name));
+			ch = kevent->name;
+			ch += strlen(name);
+			memset(ch++, 0, 1);
+			sprintf(ch, "%llu", journal_cookie);
+		}
+		else {
+			memcpy(kevent->name, name, len);
+		}
 		if (rem)
 			memset(kevent->name + len, 0, rem);
 		kevent->event.len = len + rem;
 	} else {
-		kevent->event.len = 0;
-		kevent->name = NULL;
+		if (journal_cookie) {
+			size_t len, rem, event_size = sizeof(struct inotify_event);
+			size_t journal_cookie_sz = sizeof(u64);
+			char *ch;
+
+			len = 1 + journal_cookie_sz;
+			rem = event_size - len;
+			if (len > event_size) {
+				rem = event_size - (len % event_size);
+				if (len % event_size == 0)
+					rem = 0;
+			}
+
+			kevent->name = kmalloc(len + rem, GFP_KERNEL);
+			if (unlikely(!kevent->name)) {
+				kmem_cache_free(event_cachep, kevent);
+				return NULL;
+			}
+			memset(kevent->name, 0, 1);
+			ch = kevent->name;
+			ch += 1;
+			sprintf(ch, "%llu", journal_cookie);
+			if (rem)
+				memset(kevent->name + len, 0, rem);
+			kevent->event.len = len + rem;
+		}
+		else {
+			kevent->event.len = 0;
+			kevent->name = NULL;
+		}
 	}
 
 	return kevent;
@@ -269,7 +313,7 @@
  */
 static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
 				    u32 cookie, const char *name,
-				    struct inode *ignored)
+				    struct inode *ignored, u64 journal_cookie)
 {
 	struct inotify_user_watch *watch;
 	struct inotify_device *dev;
@@ -280,11 +324,25 @@
 
 	mutex_lock(&dev->ev_mutex);
 
+	if (mask & IN_JOURNAL_COMMIT) {
+		if (!(w->inode) || !(w->inode->i_sb)
+			|| !(w->inode->i_sb->s_priv_inode)
+			|| !(w->inode->i_sb->s_journal_cookie))
+			goto out;
+	}
 	/* we can safely put the watch as we don't reference it while
 	 * generating the event
 	 */
-	if (mask & IN_IGNORED || w->mask & IN_ONESHOT)
+	if (mask & IN_IGNORED || w->mask & IN_ONESHOT) {
+                if (w->mask & IN_JOURNAL_COMMIT) {
+                        struct super_block *sb = NULL;
+                        if (w->inode && w->inode->i_sb)
+                                sb = w->inode->i_sb;
+                        if (sb && sb->s_priv_inode)
+                                sb->s_priv_inode = NULL;
+                }
 		put_inotify_watch(w); /* final put */
+	}
 
 	/* coalescing: drop this event if it is a dupe of the previous */
 	last = inotify_dev_get_last_event(dev);
@@ -304,9 +362,9 @@
 
 	/* if the queue overflows, we need to notify user space */
 	if (unlikely(dev->event_count == dev->max_events))
-		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL);
+		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL, 0);
 	else
-		kevent = kernel_event(wd, mask, cookie, name);
+		kevent = kernel_event(wd, mask, cookie, name, journal_cookie);
 
 	if (unlikely(!kevent))
 		goto out;
@@ -686,6 +744,13 @@
 	ret = inotify_find_update_watch(dev->ih, inode, mask);
 	if (ret == -ENOENT)
 		ret = create_watch(dev, inode, mask);
+	if (ret >= 0) {
+		if (mask & IN_JOURNAL_COMMIT) {
+                        struct super_block *sb = inode->i_sb;
+                        if (sb)
+                                sb->s_priv_inode = inode;
+		}
+	}
 	mutex_unlock(&dev->up_mutex);
 
 	path_put(&path);
Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c	2008-09-15 10:12:42.000000000 -0700
+++ linux-2.6/fs/jbd/commit.c	2008-09-15 10:13:44.000000000 -0700
@@ -928,6 +928,7 @@
 	}
 	spin_unlock(&journal->j_list_lock);
 
+	journal_notify_commit(journal);
 	jbd_debug(1, "JBD: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 
Index: linux-2.6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd2/commit.c	2008-09-15 10:14:00.000000000 -0700
+++ linux-2.6/fs/jbd2/commit.c	2008-09-15 10:16:09.000000000 -0700
@@ -990,6 +990,7 @@
 	}
 	spin_unlock(&journal->j_list_lock);
 
+	jbd2_journal_notify_commit(journal);
 	jbd_debug(1, "JBD: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2008-09-15 10:12:47.000000000 -0700
+++ linux-2.6/include/linux/fs.h	2008-09-15 10:13:44.000000000 -0700
@@ -1106,6 +1106,8 @@
 	char s_id[32];				/* Informational name */
 
 	void 			*s_fs_info;	/* Filesystem private info */
+	u64			s_journal_cookie;	/* current journal cookie # */
+	struct inode		*s_priv_inode;	/* used for journal notifications */
 
 	/*
 	 * The next field is for VFS *only*. No filesystems have any business
Index: linux-2.6/include/linux/fsnotify.h
===================================================================
--- linux-2.6.orig/include/linux/fsnotify.h	2008-09-15 10:12:53.000000000 -0700
+++ linux-2.6/include/linux/fsnotify.h	2008-09-15 10:13:44.000000000 -0700
@@ -34,6 +34,17 @@
 	inotify_d_move(entry);
 }
 
+static inline void fsnotify_journal_commit (struct super_block *sb)
+{
+	struct inode *inode;
+	sb->s_journal_cookie += 1;
+	if (sb->s_journal_cookie && sb->s_priv_inode) {
+		inode = sb->s_priv_inode;
+		inotify_inode_queue_event(inode, IN_JOURNAL_COMMIT, 0,
+					  NULL, NULL);
+	}
+}
+
 /*
  * fsnotify_move - file old_name at old_dir was moved to new_name at new_dir
  */
Index: linux-2.6/include/linux/inotify.h
===================================================================
--- linux-2.6.orig/include/linux/inotify.h	2008-09-15 10:12:59.000000000 -0700
+++ linux-2.6/include/linux/inotify.h	2008-09-15 10:13:44.000000000 -0700
@@ -44,6 +44,23 @@
 #define IN_Q_OVERFLOW		0x00004000	/* Event queued overflowed */
 #define IN_IGNORED		0x00008000	/* File was ignored */
 
+/* This is a journal commit event for file systems. These events are sent
+ * from a journaling filesystem or the journal itself.
+ *
+ * IN_JOURNAL_COMMIT is special in that only one file per journaling filesystem
+ * partition (super-block) is allowed. Registering for this event with multiple
+ * files will overwrite previous registrations. The last registration for this
+ * event will be taken as current and commit notifications will be delivered
+ * for that inode only.
+ *
+ * The inode registered for this event should not be unlinked. This will disable
+ * any further journal commit notifications.
+ *
+ * If this event is not registered for, journal commit messages will not be
+ * delivered
+ */
+#define IN_JOURNAL_COMMIT	0x00010000	/* Journal on this superblock has committed a transaction */
+
 /* helper events */
 #define IN_CLOSE		(IN_CLOSE_WRITE | IN_CLOSE_NOWRITE) /* close */
 #define IN_MOVE			(IN_MOVED_FROM | IN_MOVED_TO) /* moves */
@@ -97,7 +114,7 @@
 
 struct inotify_operations {
 	void (*handle_event)(struct inotify_watch *, u32, u32, u32,
-			     const char *, struct inode *);
+			     const char *, struct inode *, u64);
 	void (*destroy_watch)(struct inotify_watch *);
 };
 
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h	2008-09-15 10:13:08.000000000 -0700
+++ linux-2.6/include/linux/jbd.h	2008-09-15 10:13:44.000000000 -0700
@@ -114,6 +114,7 @@
  * This is an opaque datatype.
  **/
 typedef struct journal_s	journal_t;	/* Journal control structure */
+extern void journal_notify_commit(journal_t *journal);
 #endif
 
 /*
Index: linux-2.6/include/linux/jbd2.h
===================================================================
--- linux-2.6.orig/include/linux/jbd2.h	2008-09-15 10:16:18.000000000 -0700
+++ linux-2.6/include/linux/jbd2.h	2008-09-15 10:16:45.000000000 -0700
@@ -115,6 +115,7 @@
  * This is an opaque datatype.
  **/
 typedef struct journal_s	journal_t;	/* Journal control structure */
+extern void jbd2_journal_notify_commit(journal_t *journal);
 #endif
 
 /*


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15 19:36     ` Abhijit Paithankar
@ 2008-09-15 23:37       ` Andreas Dilger
  2008-09-16 23:05         ` Abhijit Paithankar
  2008-09-20  5:15       ` Jan Kara
  1 sibling, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2008-09-15 23:37 UTC (permalink / raw)
  To: Abhijit Paithankar; +Cc: linux-fsdevel, linux-ext4, Jamie Lokier, Dave Chinner

On Sep 15, 2008  12:36 -0700, Abhijit Paithankar wrote:
> In this version of the patch I've done away with most of the file-system
> specific changes. The file system now only needs to send out a journal
> commit notification.

Is there a reason NOT to use the journal commit callback mechanism that
I posted?  This would only require registering a single callback for
each transaction for the inotify, but has the benefit of being completely
generic and can be used for other commit notifiers in the future.

Most of the rest of my comments are related to following the Linux
coding style.  Please also generate your patch with "diff -up" so
that the context includes the function name.

> +void journal_notify_commit (journal_t *journal)
> +{

No space between the function name and the '(', journal_notify_commit(

> +	struct super_block *sb = journal->j_private;
> +	if (sb->s_priv_inode && sb->s_journal_cookie)
> +		fsnotify_journal_commit(sb);
> +}

Please add an empty line between variable declarations and the code.

> +void jbd2_journal_notify_commit (journal_t *journal)
> +{
> +	struct super_block *sb = journal->j_private;
> +	if (sb->s_priv_inode && sb->s_journal_cookie)
> +		fsnotify_journal_commit(sb);
> +}

Same comments as above.

> +		if (journal_cookie) {
> +			size_t len, rem, event_size = sizeof(struct inotify_event);

Wrap at 80 columns, likely putting "size_t event_size = ..." on a new line.

>  static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
> +	if (mask & IN_JOURNAL_COMMIT) {
> +		if (!(w->inode) || !(w->inode->i_sb)
> +			|| !(w->inode->i_sb->s_priv_inode)
> +			|| !(w->inode->i_sb->s_journal_cookie))
> +			goto out;
> +	}

Operators like "||" should go at the end of the line, and the new line
should be aligned with the opening parenthesis:

		if (!(w->inode) || !(w->inode->i_sb) ||
		    !(w->inode->i_sb->s_priv_inode) ||
		    !(w->inode->i_sb->s_journal_cookie))

> +	journal_notify_commit(journal);
>  	jbd_debug(1, "JBD: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);

Note that this is also a "late" notification.  At phase 7 the transaction
is complete and committed to disk, the rest of journal_commit_transaction()
is just doing cleanup and background processing.

> +/* This is a journal commit event for file systems. These events are sent
> + * from a journaling filesystem or the journal itself.
> + *
> + * IN_JOURNAL_COMMIT is special in that only one file per journaling
> + * filesystem partition (super-block) is allowed. Registering for this
> + * event with multiple files will overwrite previous registrations.

This is a major limitation of your implementation - what if 2 applications
want to be notified of this event?  Using the journal_callback_set() API
in the patch I sent out would allow an arbitrary number of applications
to monitor the commit.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15 23:37       ` Andreas Dilger
@ 2008-09-16 23:05         ` Abhijit Paithankar
  2008-09-19 20:34           ` Andreas Dilger
  0 siblings, 1 reply; 8+ messages in thread
From: Abhijit Paithankar @ 2008-09-16 23:05 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-ext4, Jamie Lokier, Dave Chinner

Hi,

Thanks for your feedback.
My comments are inline.

* Andreas Dilger (adilger@sun.com) wrote:
> On Sep 15, 2008  12:36 -0700, Abhijit Paithankar wrote:
> > In this version of the patch I've done away with most of the file-system
> > specific changes. The file system now only needs to send out a journal
> > commit notification.
> 
> Is there a reason NOT to use the journal commit callback mechanism that
> I posted?  This would only require registering a single callback for
> each transaction for the inotify, but has the benefit of being completely
> generic and can be used for other commit notifiers in the future.
> 

In the current patch if two applications want to be notified on the journal
commit event they could register for IN_JOURNAL_COMMIT on the same inode,
which could be the root inode of the file system.

There is no limit on the number of applications that could register on the
root inode for journal commit notifications. The only limitation is that
they all have to register on the same inode for that partition.

The journal commit callback mechanism could be viable way to notify
journal commits. However it is not clear to me how an application would
register the callback.

I suspect we need to have some other mechanism to register that for the
application.


> Most of the rest of my comments are related to following the Linux
> coding style.  Please also generate your patch with "diff -up" so
> that the context includes the function name.
> 
> > +void journal_notify_commit (journal_t *journal)
> > +{
> 
> No space between the function name and the '(', journal_notify_commit(
ok, will fix.

> 
> > +	struct super_block *sb = journal->j_private;
> > +	if (sb->s_priv_inode && sb->s_journal_cookie)
> > +		fsnotify_journal_commit(sb);
> > +}
> 
> Please add an empty line between variable declarations and the code.
ack.

> 
> > +void jbd2_journal_notify_commit (journal_t *journal)
> > +{
> > +	struct super_block *sb = journal->j_private;
> > +	if (sb->s_priv_inode && sb->s_journal_cookie)
> > +		fsnotify_journal_commit(sb);
> > +}
> 
> Same comments as above.
ack.

> 
> > +		if (journal_cookie) {
> > +			size_t len, rem, event_size = sizeof(struct inotify_event);
> 
> Wrap at 80 columns, likely putting "size_t event_size = ..." on a new line.
sure.

> 
> >  static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
> > +	if (mask & IN_JOURNAL_COMMIT) {
> > +		if (!(w->inode) || !(w->inode->i_sb)
> > +			|| !(w->inode->i_sb->s_priv_inode)
> > +			|| !(w->inode->i_sb->s_journal_cookie))
> > +			goto out;
> > +	}
> 
> Operators like "||" should go at the end of the line, and the new line
> should be aligned with the opening parenthesis:
> 
> 		if (!(w->inode) || !(w->inode->i_sb) ||
> 		    !(w->inode->i_sb->s_priv_inode) ||
> 		    !(w->inode->i_sb->s_journal_cookie))
> 
will do.

> > +	journal_notify_commit(journal);
> >  	jbd_debug(1, "JBD: commit %d complete, head %d\n",
> >  		  journal->j_commit_sequence, journal->j_tail_sequence);
> 
> Note that this is also a "late" notification.  At phase 7 the transaction
> is complete and committed to disk, the rest of journal_commit_transaction()
> is just doing cleanup and background processing.
Yes, will move it up.

> 
> > +/* This is a journal commit event for file systems. These events are sent
> > + * from a journaling filesystem or the journal itself.
> > + *
> > + * IN_JOURNAL_COMMIT is special in that only one file per journaling
> > + * filesystem partition (super-block) is allowed. Registering for this
> > + * event with multiple files will overwrite previous registrations.
> 
> This is a major limitation of your implementation - what if 2 applications
> want to be notified of this event?  Using the journal_callback_set() API
> in the patch I sent out would allow an arbitrary number of applications
> to monitor the commit.

Any number of applications can register for the journal event as long as they
register on the same inode.

It is not clear to me how using journal_callback_set API would change that if
we use inotify to register callbacks.

Thanks!
Abhijit

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-16 23:05         ` Abhijit Paithankar
@ 2008-09-19 20:34           ` Andreas Dilger
  2008-09-20 15:51             ` Jamie Lokier
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2008-09-19 20:34 UTC (permalink / raw)
  To: Abhijit Paithankar
  Cc: linux-fsdevel, linux-ext4, Jamie Lokier, Dave Chinner,
	Joel Becker

On Sep 16, 2008  16:05 -0700, Abhijit Paithankar wrote:
> > Is there a reason NOT to use the journal commit callback mechanism that
> > I posted?  This would only require registering a single callback for
> > each transaction for the inotify, but has the benefit of being completely
> > generic and can be used for other commit notifiers in the future.
> 
> In the current patch if two applications want to be notified on the journal
> commit event they could register for IN_JOURNAL_COMMIT on the same inode,
> which could be the root inode of the file system.
> 
> There is no limit on the number of applications that could register on the
> root inode for journal commit notifications. The only limitation is that
> they all have to register on the same inode for that partition.
>
> Any number of applications can register for the journal event as long as they
> register on the same inode.

That assumes agreement between the applications that are using this
interface.  It isn't at all desirable that applications have to know
the "mountpoint" of the filesystem in order to use inotify, and in
some cases (e.g. bind mount in a new namespace) there isn't even access
to the root inode.

> The journal commit callback mechanism could be viable way to notify
> journal commits. However it is not clear to me how an application would
> register the callback.

The application registers for the callback via the inotify interface,
which in turn needs to have an interface to the filesystem to use the
fs-specific transaction commit callback.

> It is not clear to me how using journal_callback_set API would change
> that if we use inotify to register callbacks.

The journal_callback_set() API doesn't need to change at all (though in
fact Joel Becker of OCFS2 is implementing a more generic API that can be
used for other kinds of commit callbacks.  He will hopefully post it to
this list soon.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15 19:36     ` Abhijit Paithankar
  2008-09-15 23:37       ` Andreas Dilger
@ 2008-09-20  5:15       ` Jan Kara
  2008-09-23 22:26         ` Andreas Dilger
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Kara @ 2008-09-20  5:15 UTC (permalink / raw)
  To: Abhijit Paithankar
  Cc: Andreas Dilger, linux-fsdevel, linux-ext4, Jamie Lokier,
	Dave Chinner

  Hello,

> I have used inotify simply because it is a mechanism that already exists
> in the kernel and is easily extensible.
> 
> The use of transaction ids as "journal cookie" was done to simplify
> things. A journal transaction ID is a readily available monotonically
> increasing ID unique to that partition (super block).
> 
> Journal cookies themselves are completely opaque to the application
> and as long as it is a monotonically increasing number that signifies
> a journal commit, it could be anything that a file system chooses it
> to be.
> 
> Ext3/Ext4 gather multiple changes into a single transaction. So,
> you will see a single 'commit' notification for multiple of meta-data
> changes. The 'commit' messages, therefore, won't be that many.
  I have quite some comments (additional to Andreas') to the coding
style but let's put them aside (they can be easily fixed) and first
solve some more fundamental problems:
  1) At the first sight it seems to be a useful idea to send to
userspace notifications when some change is safely on disk. But at the
second sight, it's technically really not that simple as it may seem.
Filesystems such as ext2 have no good way of providing such information,
even ext4 with delayed allocation has no way of giving you such
information for writes and I guess similar thing holds for xfs. Also log
structured filesystems and similar can tell you when a change is safely
on disk but have no concept of something like journal commit. So at this
point one should really ask himself whether a feature having a chance to
work only on a limited number of setups has a serious chance of being
used by application developpers...
  Also note that on ext3 and ext4 a single 'write' call from userspace
can be split among several transactions but there's just one inotify
event so this has to be taken care of.
  2) If I've understood correctly, you append to the name in the inotify
event the journal cookie. Sorry, but that is a really ugly hack,
definitely not a way how an interface to the userspace should look like.
Actually, the 'cookie' argument of inotify seems to quite match your
needs - it is supposed to be used for connecting related events and that
is exactly what you need. The only problem with this is that rename
already uses the cookie and so you cannot change it arbitrarily.
  3) Using just a single inode for commit notification is ugly and as
Andreas said not even workable in some cases. A sensible semantics here
could be that you'd simply send "changes safely on disk" event to all
inodes where some change went to disk. That still might have some
technical issues (like being too slow, having to identify which inodes
should get the event etc.) but would at least look sound from the
logical point of view.

  I'm sorry to turn your ideas down like this. I still appreciate
you're trying to improve Linux :). Thanks.

								Honza

> This patch (v2):
> 
> In this version of the patch I've done away with most of the file-system
> specific changes. The file system now only needs to send out a journal
> commit notification.
> 
> The 'journal cookie' is simply a u64 that increases every time a journal
> 'commit' is made. This will make the code more portable to other file
> systems.
> 
> Also, we no longer need those many journal message types. We can easily
> piggy-back on top of fsnotify. The only remaining one is
> IN_JOURNAL_COMMIT.
> 
> My next step is to allow multiple registrations for JOURNAL_COMMIT
> notification. I will post that patch once it is ready.
> 
> Signed-off-by: Abhijit Paithankar <apaithan@akamai.com>
> ---
> 
>  fs/ext3/ext3_jbd.c       |    9 +++++
>  fs/ext4/ext4_jbd2.c      |    9 +++++
>  fs/inotify.c             |   17 ++++++---
>  fs/inotify_user.c        |   81 ++++++++++++++++++++++++++++++++++++++++++-----
>  fs/jbd/commit.c          |    1 
>  fs/jbd2/commit.c         |    1 
>  include/linux/fs.h       |    2 +
>  include/linux/fsnotify.h |   11 ++++++
>  include/linux/inotify.h  |   19 ++++++++++-
>  include/linux/jbd.h      |    1 
>  include/linux/jbd2.h     |    1 
>  11 files changed, 138 insertions(+), 14 deletions(-)
> 
> ---
> 
> 
> Index: linux-2.6/fs/ext3/ext3_jbd.c
> ===================================================================
> --- linux-2.6.orig/fs/ext3/ext3_jbd.c	2008-09-15 10:12:25.000000000 -0700
> +++ linux-2.6/fs/ext3/ext3_jbd.c	2008-09-15 10:15:54.000000000 -0700
> @@ -3,6 +3,7 @@
>   */
>  
>  #include <linux/ext3_jbd.h>
> +#include <linux/fsnotify.h>
>  
>  int __ext3_journal_get_undo_access(const char *where, handle_t *handle,
>  				struct buffer_head *bh)
> @@ -57,3 +58,11 @@
>  		ext3_journal_abort_handle(where, __func__, bh, handle,err);
>  	return err;
>  }
> +
> +void journal_notify_commit (journal_t *journal)
> +{
> +	struct super_block *sb = journal->j_private;
> +	if (sb->s_priv_inode && sb->s_journal_cookie)
> +		fsnotify_journal_commit(sb);
> +}
> +
> Index: linux-2.6/fs/ext4/ext4_jbd2.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/ext4_jbd2.c	2008-09-15 10:14:12.000000000 -0700
> +++ linux-2.6/fs/ext4/ext4_jbd2.c	2008-09-15 10:15:50.000000000 -0700
> @@ -2,6 +2,7 @@
>   * Interface between ext4 and JBD
>   */
>  
> +#include <linux/fsnotify.h>
>  #include "ext4_jbd2.h"
>  
>  int __ext4_journal_get_undo_access(const char *where, handle_t *handle,
> @@ -57,3 +58,11 @@
>  		ext4_journal_abort_handle(where, __func__, bh, handle, err);
>  	return err;
>  }
> +
> +void jbd2_journal_notify_commit (journal_t *journal)
> +{
> +	struct super_block *sb = journal->j_private;
> +	if (sb->s_priv_inode && sb->s_journal_cookie)
> +		fsnotify_journal_commit(sb);
> +}
> +
> Index: linux-2.6/fs/inotify.c
> ===================================================================
> --- linux-2.6.orig/fs/inotify.c	2008-09-15 10:12:29.000000000 -0700
> +++ linux-2.6/fs/inotify.c	2008-09-15 10:13:44.000000000 -0700
> @@ -231,7 +231,7 @@
>  				 struct inotify_watch *watch)
>  {
>  	remove_watch_no_event(watch, ih);
> -	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL);
> +	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL, 0);
>  }
>  EXPORT_SYMBOL_GPL(inotify_remove_watch_locked);
>  
> @@ -280,10 +280,17 @@
>  			       const char *name, struct inode *n_inode)
>  {
>  	struct inotify_watch *watch, *next;
> +	u64 journal_cookie = 0;
>  
>  	if (!inotify_inode_watched(inode))
>  		return;
>  
> +	if (mask & (IN_ATTRIB | IN_MOVE | IN_CREATE | IN_DELETE |
> +		    IN_DELETE_SELF | IN_MOVE_SELF)) {
> +		if (inode->i_sb && inode->i_sb->s_journal_cookie)
> +			journal_cookie = inode->i_sb->s_journal_cookie;
> +	}
> +
>  	mutex_lock(&inode->inotify_mutex);
>  	list_for_each_entry_safe(watch, next, &inode->inotify_watches, i_list) {
>  		u32 watch_mask = watch->mask;
> @@ -293,7 +300,7 @@
>  			if (watch_mask & IN_ONESHOT)
>  				remove_watch_no_event(watch, ih);
>  			ih->in_ops->handle_event(watch, watch->wd, mask, cookie,
> -						 name, n_inode);
> +						 name, n_inode, journal_cookie);
>  			mutex_unlock(&ih->mutex);
>  		}
>  	}
> @@ -409,7 +416,7 @@
>  			struct inotify_handle *ih= watch->ih;
>  			mutex_lock(&ih->mutex);
>  			ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0,
> -						 NULL, NULL);
> +						 NULL, NULL, 0);
>  			inotify_remove_watch_locked(ih, watch);
>  			mutex_unlock(&ih->mutex);
>  		}
> @@ -576,7 +583,7 @@
>  		mask_add = 1;
>  
>  	/* don't allow invalid bits: we don't want flags set */
> -	mask &= IN_ALL_EVENTS | IN_ONESHOT;
> +	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_COMMIT;
>  	if (unlikely(!mask))
>  		return -EINVAL;
>  
> @@ -623,7 +630,7 @@
>  	int newly_watched;
>  
>  	/* don't allow invalid bits: we don't want flags set */
> -	mask &= IN_ALL_EVENTS | IN_ONESHOT;
> +	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_COMMIT;
>  	if (unlikely(!mask))
>  		return -EINVAL;
>  	watch->mask = mask;
> Index: linux-2.6/fs/inotify_user.c
> ===================================================================
> --- linux-2.6.orig/fs/inotify_user.c	2008-09-15 10:12:34.000000000 -0700
> +++ linux-2.6/fs/inotify_user.c	2008-09-15 10:51:43.000000000 -0700
> @@ -185,7 +185,7 @@
>   * This function can sleep.
>   */
>  static struct inotify_kernel_event * kernel_event(s32 wd, u32 mask, u32 cookie,
> -						  const char *name)
> +						  const char *name, u64 journal_cookie)
>  {
>  	struct inotify_kernel_event *kevent;
>  
> @@ -204,6 +204,7 @@
>  
>  	if (name) {
>  		size_t len, rem, event_size = sizeof(struct inotify_event);
> +		size_t journal_cookie_sz;
>  
>  		/*
>  		 * We need to pad the filename so as to properly align an
> @@ -213,6 +214,10 @@
>  		 * simple and safe for all architectures.
>  		 */
>  		len = strlen(name) + 1;
> +		if (journal_cookie) {
> +			journal_cookie_sz = sizeof(u64);
> +			len = len + journal_cookie_sz;
> +		}
>  		rem = event_size - len;
>  		if (len > event_size) {
>  			rem = event_size - (len % event_size);
> @@ -225,13 +230,52 @@
>  			kmem_cache_free(event_cachep, kevent);
>  			return NULL;
>  		}
> -		memcpy(kevent->name, name, len);
> +		if (journal_cookie) {
> +			char *ch;
> +
> +			memcpy(kevent->name, name, strlen(name));
> +			ch = kevent->name;
> +			ch += strlen(name);
> +			memset(ch++, 0, 1);
> +			sprintf(ch, "%llu", journal_cookie);
> +		}
> +		else {
> +			memcpy(kevent->name, name, len);
> +		}
>  		if (rem)
>  			memset(kevent->name + len, 0, rem);
>  		kevent->event.len = len + rem;
>  	} else {
> -		kevent->event.len = 0;
> -		kevent->name = NULL;
> +		if (journal_cookie) {
> +			size_t len, rem, event_size = sizeof(struct inotify_event);
> +			size_t journal_cookie_sz = sizeof(u64);
> +			char *ch;
> +
> +			len = 1 + journal_cookie_sz;
> +			rem = event_size - len;
> +			if (len > event_size) {
> +				rem = event_size - (len % event_size);
> +				if (len % event_size == 0)
> +					rem = 0;
> +			}
> +
> +			kevent->name = kmalloc(len + rem, GFP_KERNEL);
> +			if (unlikely(!kevent->name)) {
> +				kmem_cache_free(event_cachep, kevent);
> +				return NULL;
> +			}
> +			memset(kevent->name, 0, 1);
> +			ch = kevent->name;
> +			ch += 1;
> +			sprintf(ch, "%llu", journal_cookie);
> +			if (rem)
> +				memset(kevent->name + len, 0, rem);
> +			kevent->event.len = len + rem;
> +		}
> +		else {
> +			kevent->event.len = 0;
> +			kevent->name = NULL;
> +		}
>  	}
>  
>  	return kevent;
> @@ -269,7 +313,7 @@
>   */
>  static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
>  				    u32 cookie, const char *name,
> -				    struct inode *ignored)
> +				    struct inode *ignored, u64 journal_cookie)
>  {
>  	struct inotify_user_watch *watch;
>  	struct inotify_device *dev;
> @@ -280,11 +324,25 @@
>  
>  	mutex_lock(&dev->ev_mutex);
>  
> +	if (mask & IN_JOURNAL_COMMIT) {
> +		if (!(w->inode) || !(w->inode->i_sb)
> +			|| !(w->inode->i_sb->s_priv_inode)
> +			|| !(w->inode->i_sb->s_journal_cookie))
> +			goto out;
> +	}
>  	/* we can safely put the watch as we don't reference it while
>  	 * generating the event
>  	 */
> -	if (mask & IN_IGNORED || w->mask & IN_ONESHOT)
> +	if (mask & IN_IGNORED || w->mask & IN_ONESHOT) {
> +                if (w->mask & IN_JOURNAL_COMMIT) {
> +                        struct super_block *sb = NULL;
> +                        if (w->inode && w->inode->i_sb)
> +                                sb = w->inode->i_sb;
> +                        if (sb && sb->s_priv_inode)
> +                                sb->s_priv_inode = NULL;
> +                }
>  		put_inotify_watch(w); /* final put */
> +	}
>  
>  	/* coalescing: drop this event if it is a dupe of the previous */
>  	last = inotify_dev_get_last_event(dev);
> @@ -304,9 +362,9 @@
>  
>  	/* if the queue overflows, we need to notify user space */
>  	if (unlikely(dev->event_count == dev->max_events))
> -		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL);
> +		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL, 0);
>  	else
> -		kevent = kernel_event(wd, mask, cookie, name);
> +		kevent = kernel_event(wd, mask, cookie, name, journal_cookie);
>  
>  	if (unlikely(!kevent))
>  		goto out;
> @@ -686,6 +744,13 @@
>  	ret = inotify_find_update_watch(dev->ih, inode, mask);
>  	if (ret == -ENOENT)
>  		ret = create_watch(dev, inode, mask);
> +	if (ret >= 0) {
> +		if (mask & IN_JOURNAL_COMMIT) {
> +                        struct super_block *sb = inode->i_sb;
> +                        if (sb)
> +                                sb->s_priv_inode = inode;
> +		}
> +	}
>  	mutex_unlock(&dev->up_mutex);
>  
>  	path_put(&path);
> Index: linux-2.6/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd/commit.c	2008-09-15 10:12:42.000000000 -0700
> +++ linux-2.6/fs/jbd/commit.c	2008-09-15 10:13:44.000000000 -0700
> @@ -928,6 +928,7 @@
>  	}
>  	spin_unlock(&journal->j_list_lock);
>  
> +	journal_notify_commit(journal);
>  	jbd_debug(1, "JBD: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);
>  
> Index: linux-2.6/fs/jbd2/commit.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd2/commit.c	2008-09-15 10:14:00.000000000 -0700
> +++ linux-2.6/fs/jbd2/commit.c	2008-09-15 10:16:09.000000000 -0700
> @@ -990,6 +990,7 @@
>  	}
>  	spin_unlock(&journal->j_list_lock);
>  
> +	jbd2_journal_notify_commit(journal);
>  	jbd_debug(1, "JBD: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);
>  
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2008-09-15 10:12:47.000000000 -0700
> +++ linux-2.6/include/linux/fs.h	2008-09-15 10:13:44.000000000 -0700
> @@ -1106,6 +1106,8 @@
>  	char s_id[32];				/* Informational name */
>  
>  	void 			*s_fs_info;	/* Filesystem private info */
> +	u64			s_journal_cookie;	/* current journal cookie # */
> +	struct inode		*s_priv_inode;	/* used for journal notifications */
>  
>  	/*
>  	 * The next field is for VFS *only*. No filesystems have any business
> Index: linux-2.6/include/linux/fsnotify.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fsnotify.h	2008-09-15 10:12:53.000000000 -0700
> +++ linux-2.6/include/linux/fsnotify.h	2008-09-15 10:13:44.000000000 -0700
> @@ -34,6 +34,17 @@
>  	inotify_d_move(entry);
>  }
>  
> +static inline void fsnotify_journal_commit (struct super_block *sb)
> +{
> +	struct inode *inode;
> +	sb->s_journal_cookie += 1;
> +	if (sb->s_journal_cookie && sb->s_priv_inode) {
> +		inode = sb->s_priv_inode;
> +		inotify_inode_queue_event(inode, IN_JOURNAL_COMMIT, 0,
> +					  NULL, NULL);
> +	}
> +}
> +
>  /*
>   * fsnotify_move - file old_name at old_dir was moved to new_name at new_dir
>   */
> Index: linux-2.6/include/linux/inotify.h
> ===================================================================
> --- linux-2.6.orig/include/linux/inotify.h	2008-09-15 10:12:59.000000000 -0700
> +++ linux-2.6/include/linux/inotify.h	2008-09-15 10:13:44.000000000 -0700
> @@ -44,6 +44,23 @@
>  #define IN_Q_OVERFLOW		0x00004000	/* Event queued overflowed */
>  #define IN_IGNORED		0x00008000	/* File was ignored */
>  
> +/* This is a journal commit event for file systems. These events are sent
> + * from a journaling filesystem or the journal itself.
> + *
> + * IN_JOURNAL_COMMIT is special in that only one file per journaling filesystem
> + * partition (super-block) is allowed. Registering for this event with multiple
> + * files will overwrite previous registrations. The last registration for this
> + * event will be taken as current and commit notifications will be delivered
> + * for that inode only.
> + *
> + * The inode registered for this event should not be unlinked. This will disable
> + * any further journal commit notifications.
> + *
> + * If this event is not registered for, journal commit messages will not be
> + * delivered
> + */
> +#define IN_JOURNAL_COMMIT	0x00010000	/* Journal on this superblock has committed a transaction */
> +
>  /* helper events */
>  #define IN_CLOSE		(IN_CLOSE_WRITE | IN_CLOSE_NOWRITE) /* close */
>  #define IN_MOVE			(IN_MOVED_FROM | IN_MOVED_TO) /* moves */
> @@ -97,7 +114,7 @@
>  
>  struct inotify_operations {
>  	void (*handle_event)(struct inotify_watch *, u32, u32, u32,
> -			     const char *, struct inode *);
> +			     const char *, struct inode *, u64);
>  	void (*destroy_watch)(struct inotify_watch *);
>  };
>  
> Index: linux-2.6/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.orig/include/linux/jbd.h	2008-09-15 10:13:08.000000000 -0700
> +++ linux-2.6/include/linux/jbd.h	2008-09-15 10:13:44.000000000 -0700
> @@ -114,6 +114,7 @@
>   * This is an opaque datatype.
>   **/
>  typedef struct journal_s	journal_t;	/* Journal control structure */
> +extern void journal_notify_commit(journal_t *journal);
>  #endif
>  
>  /*
> Index: linux-2.6/include/linux/jbd2.h
> ===================================================================
> --- linux-2.6.orig/include/linux/jbd2.h	2008-09-15 10:16:18.000000000 -0700
> +++ linux-2.6/include/linux/jbd2.h	2008-09-15 10:16:45.000000000 -0700
> @@ -115,6 +115,7 @@
>   * This is an opaque datatype.
>   **/
>  typedef struct journal_s	journal_t;	/* Journal control structure */
> +extern void jbd2_journal_notify_commit(journal_t *journal);
>  #endif
>  
>  /*
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-19 20:34           ` Andreas Dilger
@ 2008-09-20 15:51             ` Jamie Lokier
  0 siblings, 0 replies; 8+ messages in thread
From: Jamie Lokier @ 2008-09-20 15:51 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Abhijit Paithankar, linux-fsdevel, linux-ext4, Dave Chinner,
	Joel Becker

Andreas Dilger wrote:
> That assumes agreement between the applications that are using this
> interface.  It isn't at all desirable that applications have to know
> the "mountpoint" of the filesystem in order to use inotify, and in
> some cases (e.g. bind mount in a new namespace) there isn't even access
> to the root inode.

A filesystem's root inode needn't be mounted at all.

You don't need a new namespace - bind mount is enough by itself.

It has me wondering - how can an application even tell when it has the
root inode of a filesystem?  You can't tell from /etc/mtab or
/proc/mounts, nor from traversing the filesystem itself - except for
filesystems where you know the expected inode number of the root inode.

-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-20  5:15       ` Jan Kara
@ 2008-09-23 22:26         ` Andreas Dilger
  0 siblings, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2008-09-23 22:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Abhijit Paithankar, linux-fsdevel, linux-ext4, Jamie Lokier,
	Dave Chinner

On Sep 20, 2008  07:15 +0200, Jan Kara wrote:
>   1) At the first sight it seems to be a useful idea to send to
> userspace notifications when some change is safely on disk. But at the
> second sight, it's technically really not that simple as it may seem.
> Filesystems such as ext2 have no good way of providing such information,
> even ext4 with delayed allocation has no way of giving you such
> information for writes and I guess similar thing holds for xfs. Also log
> structured filesystems and similar can tell you when a change is safely
> on disk but have no concept of something like journal commit. So at this
> point one should really ask himself whether a feature having a chance to
> work only on a limited number of setups has a serious chance of being
> used by application developpers...
>   Also note that on ext3 and ext4 a single 'write' call from userspace
> can be split among several transactions but there's just one inotify
> event so this has to be taken care of.

I think the assumption is that this isn't an API used by "generic"
applications, but rather ones that are being optimized for the system
they are running on.  ext3/4 is common enough that this isn't a very
limiting optimization.  I don't think Akamai are going to care about
running on JFFS2 or VFAT.  Probably registering for such events on
filesystems that don't understand them should just return an error.

We'll be stuck in 1981 forever if we continue to limit interfaces to
what all filesystems understand.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-09-23 22:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080912220649.GB26999@apaithan-desktop.sanmateo.corp.akamai.com>
     [not found] ` <20080915023126.GF4090@webber.adilger.int>
2008-09-15  5:39   ` [RFC PATCH] Filesystem Journal Notifications Andreas Dilger
2008-09-15 19:36     ` Abhijit Paithankar
2008-09-15 23:37       ` Andreas Dilger
2008-09-16 23:05         ` Abhijit Paithankar
2008-09-19 20:34           ` Andreas Dilger
2008-09-20 15:51             ` Jamie Lokier
2008-09-20  5:15       ` Jan Kara
2008-09-23 22:26         ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox