[RFC PATCH] Filesystem Journal Notifications

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] Filesystem Journal Notifications
@ 2008-09-12 22:06 Abhijit Paithankar
  2008-09-13  9:19 ` Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Abhijit Paithankar @ 2008-09-12 22:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Andreas Dilger

Hi,

On journaling file systems, applications need to know when the meta-data
changes to files are written to disks in a manner which is persistent to
crashes and reboots.

One way to do it is to fsync every few operations. However, fsync is
blocking and affects performance.

The other (more efficient) way is to have the filesystem notify the
application when a transaction/change is written to disk.

This patch implements such a notification mechanism based on inotify for
Ext3 and Ext4.

Every time an application creates, renames, unlinks, etc a file or
directory, a journal_cookie is sent via the inotify mechanism. Ofcourse,
the application has to first register for such notifications on the
files/directories of interest.

Once the transaction corresponding to that journal_cookie is committed
to the journal another notification is sent to the application
indicating that the change is now persistent.

This initial patch includes changes to the inotify layer, ext3-fs, jbd,
ext4-fs and jbd2. It is based on the latest Linus' git tree.


Signed-off-by: Abhijit Paithankar <apaithan@akamai.com>
---

 fs/ext3/namei.c               |   44 ++++++++++++++++
 fs/ext4/ext4_jbd2.h           |    4 +
 fs/ext4/namei.c               |   40 ++++++++++++++
 fs/inotify.c                  |   20 ++++---
 fs/inotify_user.c             |   79 ++++++++++++++++++++++++++---
 fs/jbd/commit.c               |    1
 fs/jbd/journal.c              |    7 ++
 fs/jbd2/commit.c              |    1
 fs/jbd2/journal.c             |    7 ++
 include/linux/ext3_fsnotify.h |   83 ++++++++++++++++++++++++++++++
 include/linux/ext3_jbd.h      |    5 +
 include/linux/ext4_fsnotify.h |   81 ++++++++++++++++++++++++++++++
 include/linux/fs.h            |    1
 include/linux/fsnotify.h      |  113 +++++++++++++++++++++++++++++++++---------
 include/linux/inotify.h       |   32 ++++++++++-
 include/linux/jbd.h           |    4 +
 include/linux/jbd2.h          |    4 +
 include/linux/journal-head.h  |    1
 18 files changed, 486 insertions(+), 41 deletions(-)

---

Index: linux-2.6/fs/ext3/namei.c
===================================================================
--- linux-2.6.orig/fs/ext3/namei.c	2008-09-11 15:48:16.000000000 -0700
+++ linux-2.6/fs/ext3/namei.c	2008-09-12 12:17:11.000000000 -0700
@@ -36,6 +36,7 @@
 #include <linux/quotaops.h>
 #include <linux/buffer_head.h>
 #include <linux/bio.h>
+#include <linux/ext3_fsnotify.h>
 
 #include "namei.h"
 #include "xattr.h"
@@ -1709,6 +1710,10 @@
 		ext3_set_aops(inode);
 		err = ext3_add_nondir(handle, dentry, inode);
 	}
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext3_journal_get_cookie(handle);
+		ext3_fsnotify_create(dir, dentry, journal_cookie);
+	}
 	ext3_journal_stop(handle);
 	if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
@@ -1744,6 +1749,10 @@
 #endif
 		err = ext3_add_nondir(handle, dentry, inode);
 	}
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext3_journal_get_cookie(handle);
+		ext3_fsnotify_create(dir, dentry, journal_cookie);
+	}
 	ext3_journal_stop(handle);
 	if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
@@ -1817,6 +1826,12 @@
 	ext3_update_dx_flag(dir);
 	ext3_mark_inode_dirty(handle, dir);
 	d_instantiate(dentry, inode);
+
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext3_journal_get_cookie(handle);
+		ext3_fsnotify_mkdir(dir, dentry, journal_cookie);
+	}
+
 out_stop:
 	ext3_journal_stop(handle);
 	if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
@@ -2093,6 +2108,10 @@
 	ext3_update_dx_flag(dir);
 	ext3_mark_inode_dirty(handle, dir);
 
+	if (!handle->h_sync) {
+		journal_cookie_t journal_cookie = ext3_journal_get_cookie(handle);
+		ext3_fsnotify_rmdir(dentry, inode, journal_cookie);
+	}
 end_rmdir:
 	ext3_journal_stop(handle);
 	brelse (bh);
@@ -2147,6 +2166,11 @@
 	ext3_mark_inode_dirty(handle, inode);
 	retval = 0;
 
+	if (!handle->h_sync) {
+		journal_cookie_t journal_cookie = ext3_journal_get_cookie(handle);
+		ext3_fsnotify_unlink(dentry, inode, journal_cookie);
+	}
+
 end_unlink:
 	ext3_journal_stop(handle);
 	brelse (bh);
@@ -2202,6 +2226,10 @@
 	}
 	EXT3_I(inode)->i_disksize = inode->i_size;
 	err = ext3_add_nondir(handle, dentry, inode);
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext3_journal_get_cookie(handle);
+		ext3_fsnotify_create(dir, dentry, journal_cookie);
+	}
 out_stop:
 	ext3_journal_stop(handle);
 	if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
@@ -2239,6 +2267,10 @@
 	atomic_inc(&inode->i_count);
 
 	err = ext3_add_nondir(handle, dentry, inode);
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext3_journal_get_cookie(handle);
+		ext3_fsnotify_link(dir, inode, dentry, journal_cookie);
+	}
 	ext3_journal_stop(handle);
 	if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
@@ -2260,6 +2292,7 @@
 	struct buffer_head * old_bh, * new_bh, * dir_bh;
 	struct ext3_dir_entry_2 * old_de, * new_de;
 	int retval;
+	const char *old_name = NULL;
 
 	old_bh = new_bh = dir_bh = NULL;
 
@@ -2273,6 +2306,7 @@
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
+	old_name = ext3_fsnotify_oldname_init(old_dentry->d_name.name);
 	if (IS_DIRSYNC(old_dir) || IS_DIRSYNC(new_dir))
 		handle->h_sync = 1;
 
@@ -2397,7 +2431,17 @@
 	}
 	retval = 0;
 
+	if (!handle->h_sync) {
+		int isdir = S_ISDIR(old_dentry->d_inode->i_mode);
+		const char *new_name = new_dentry->d_name.name;
+		journal_cookie_t journal_cookie = ext3_journal_get_cookie(handle);
+
+		ext3_fsnotify_move(old_dir, new_dir, old_name, new_name, isdir,
+			new_dentry->d_inode, old_dentry, journal_cookie);
+	}
+
 end_rename:
+	ext3_fsnotify_oldname_free(old_name);
 	brelse (dir_bh);
 	brelse (old_bh);
 	brelse (new_bh);
Index: linux-2.6/fs/inotify.c
===================================================================
--- linux-2.6.orig/fs/inotify.c	2008-09-11 15:48:21.000000000 -0700
+++ linux-2.6/fs/inotify.c	2008-09-11 15:50:08.000000000 -0700
@@ -231,7 +231,7 @@
 				 struct inotify_watch *watch)
 {
 	remove_watch_no_event(watch, ih);
-	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL);
+	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL, 0);
 }
 EXPORT_SYMBOL_GPL(inotify_remove_watch_locked);
 
@@ -275,9 +275,12 @@
  * @cookie: cookie for synchronization, or zero
  * @name: filename, if any
  * @n_inode: inode associated with name
+ * @journal_cookie: journal cookie associated with this operation if the
+ * underlying filesystem supports journaling
  */
 void inotify_inode_queue_event(struct inode *inode, u32 mask, u32 cookie,
-			       const char *name, struct inode *n_inode)
+			       const char *name, struct inode *n_inode,
+			       journal_cookie_t journal_cookie)
 {
 	struct inotify_watch *watch, *next;
 
@@ -293,7 +296,7 @@
 			if (watch_mask & IN_ONESHOT)
 				remove_watch_no_event(watch, ih);
 			ih->in_ops->handle_event(watch, watch->wd, mask, cookie,
-						 name, n_inode);
+						 name, n_inode, journal_cookie);
 			mutex_unlock(&ih->mutex);
 		}
 	}
@@ -309,7 +312,8 @@
  * @name: filename, if any
  */
 void inotify_dentry_parent_queue_event(struct dentry *dentry, u32 mask,
-				       u32 cookie, const char *name)
+				       u32 cookie, const char *name,
+					journal_cookie_t journal_cookie)
 {
 	struct dentry *parent;
 	struct inode *inode;
@@ -325,7 +329,7 @@
 		dget(parent);
 		spin_unlock(&dentry->d_lock);
 		inotify_inode_queue_event(inode, mask, cookie, name,
-					  dentry->d_inode);
+					  dentry->d_inode, journal_cookie);
 		dput(parent);
 	} else
 		spin_unlock(&dentry->d_lock);
@@ -409,7 +413,7 @@
 			struct inotify_handle *ih= watch->ih;
 			mutex_lock(&ih->mutex);
 			ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0,
-						 NULL, NULL);
+						 NULL, NULL, 0);
 			inotify_remove_watch_locked(ih, watch);
 			mutex_unlock(&ih->mutex);
 		}
@@ -576,7 +580,7 @@
 		mask_add = 1;
 
 	/* don't allow invalid bits: we don't want flags set */
-	mask &= IN_ALL_EVENTS | IN_ONESHOT;
+	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_ALL_EVENTS | IN_JOURNAL_COMMIT;
 	if (unlikely(!mask))
 		return -EINVAL;
 
@@ -623,7 +627,7 @@
 	int newly_watched;
 
 	/* don't allow invalid bits: we don't want flags set */
-	mask &= IN_ALL_EVENTS | IN_ONESHOT;
+	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_ALL_EVENTS | IN_JOURNAL_COMMIT;
 	if (unlikely(!mask))
 		return -EINVAL;
 	watch->mask = mask;
Index: linux-2.6/fs/inotify_user.c
===================================================================
--- linux-2.6.orig/fs/inotify_user.c	2008-09-11 15:48:28.000000000 -0700
+++ linux-2.6/fs/inotify_user.c	2008-09-11 16:42:41.000000000 -0700
@@ -185,7 +185,7 @@
  * This function can sleep.
  */
 static struct inotify_kernel_event * kernel_event(s32 wd, u32 mask, u32 cookie,
-						  const char *name)
+						  const char *name, journal_cookie_t journal_cookie)
 {
 	struct inotify_kernel_event *kevent;
 
@@ -204,6 +204,7 @@
 
 	if (name) {
 		size_t len, rem, event_size = sizeof(struct inotify_event);
+		size_t journal_cookie_sz;
 
 		/*
 		 * We need to pad the filename so as to properly align an
@@ -213,6 +214,10 @@
 		 * simple and safe for all architectures.
 		 */
 		len = strlen(name) + 1;
+		if (journal_cookie) {
+			journal_cookie_sz = sizeof(journal_cookie_t);
+			len = len + journal_cookie_sz;
+		}
 		rem = event_size - len;
 		if (len > event_size) {
 			rem = event_size - (len % event_size);
@@ -225,13 +230,52 @@
 			kmem_cache_free(event_cachep, kevent);
 			return NULL;
 		}
-		memcpy(kevent->name, name, len);
+		if (journal_cookie) {
+			char *ch;
+
+			memcpy(kevent->name, name, strlen(name));
+			ch = kevent->name;
+			ch += strlen(name);
+			memset(ch++, 0, 1);
+			sprintf(ch, "%d", journal_cookie);
+		}
+		else {
+			memcpy(kevent->name, name, len);
+		}
 		if (rem)
 			memset(kevent->name + len, 0, rem);
 		kevent->event.len = len + rem;
 	} else {
-		kevent->event.len = 0;
-		kevent->name = NULL;
+		if (journal_cookie) {
+			size_t len, rem, event_size = sizeof(struct inotify_event);
+			size_t journal_cookie_sz = sizeof(journal_cookie_t);
+			char *ch;
+
+			len = 1 + journal_cookie_sz;
+			rem = event_size - len;
+			if (len > event_size) {
+				rem = event_size - (len % event_size);
+				if (len % event_size == 0)
+					rem = 0;
+			}
+
+			kevent->name = kmalloc(len + rem, GFP_KERNEL);
+			if (unlikely(!kevent->name)) {
+				kmem_cache_free(event_cachep, kevent);
+				return NULL;
+			}
+			memset(kevent->name, 0, 1);
+			ch = kevent->name;
+			ch += 1;
+			sprintf(ch, "%d", journal_cookie);
+			if (rem)
+				memset(kevent->name + len, 0, rem);
+			kevent->event.len = len + rem;
+		}
+		else {
+			kevent->event.len = 0;
+			kevent->name = NULL;
+		}
 	}
 
 	return kevent;
@@ -269,7 +313,7 @@
  */
 static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
 				    u32 cookie, const char *name,
-				    struct inode *ignored)
+				    struct inode *ignored, journal_cookie_t journal_cookie)
 {
 	struct inotify_user_watch *watch;
 	struct inotify_device *dev;
@@ -280,11 +324,23 @@
 
 	mutex_lock(&dev->ev_mutex);
 
+	if (mask & IN_JOURNAL_COMMIT) {
+		if (!(w->inode) || !(w->inode->i_sb) || !(w->inode->i_sb->s_priv_inode))
+			goto out;
+	}
 	/* we can safely put the watch as we don't reference it while
 	 * generating the event
 	 */
-	if (mask & IN_IGNORED || w->mask & IN_ONESHOT)
+	if (mask & IN_IGNORED || w->mask & IN_ONESHOT) {
+                if (w->mask & IN_JOURNAL_COMMIT) {
+                        struct super_block *sb = NULL;
+                        if (w->inode && w->inode->i_sb)
+                                sb = w->inode->i_sb;
+                        if (sb && sb->s_priv_inode)
+                                sb->s_priv_inode = NULL;
+                }
 		put_inotify_watch(w); /* final put */
+	}
 
 	/* coalescing: drop this event if it is a dupe of the previous */
 	last = inotify_dev_get_last_event(dev);
@@ -304,9 +360,9 @@
 
 	/* if the queue overflows, we need to notify user space */
 	if (unlikely(dev->event_count == dev->max_events))
-		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL);
+		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL, 0);
 	else
-		kevent = kernel_event(wd, mask, cookie, name);
+		kevent = kernel_event(wd, mask, cookie, name, journal_cookie);
 
 	if (unlikely(!kevent))
 		goto out;
@@ -686,6 +742,13 @@
 	ret = inotify_find_update_watch(dev->ih, inode, mask);
 	if (ret == -ENOENT)
 		ret = create_watch(dev, inode, mask);
+	if (ret >= 0) {
+		if (mask & IN_JOURNAL_COMMIT) {
+                        struct super_block *sb = inode->i_sb;
+                        if (sb)
+                                sb->s_priv_inode = inode;
+		}
+	}
 	mutex_unlock(&dev->up_mutex);
 
 	path_put(&path);
Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c	2008-09-11 15:48:38.000000000 -0700
+++ linux-2.6/fs/jbd/commit.c	2008-09-12 11:31:15.000000000 -0700
@@ -928,6 +928,7 @@
 	}
 	spin_unlock(&journal->j_list_lock);
 
+	journal_notify_commit(journal, journal->j_commit_sequence);
 	jbd_debug(1, "JBD: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 
Index: linux-2.6/fs/jbd/journal.c
===================================================================
--- linux-2.6.orig/fs/jbd/journal.c	2008-09-11 15:48:44.000000000 -0700
+++ linux-2.6/fs/jbd/journal.c	2008-09-11 15:50:08.000000000 -0700
@@ -80,6 +80,7 @@
 EXPORT_SYMBOL(journal_invalidatepage);
 EXPORT_SYMBOL(journal_try_to_free_buffers);
 EXPORT_SYMBOL(journal_force_commit);
+EXPORT_SYMBOL(journal_get_cookie);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
@@ -1607,6 +1608,12 @@
 	return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
 }
 
+/* We don't need to do anything here other than return the txid. */
+journal_cookie_t journal_get_cookie(handle_t *handle)
+{
+	return handle->h_transaction->t_tid;
+}
+
 /*
  * Journal_head storage management
  */
Index: linux-2.6/include/linux/ext3_fsnotify.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/ext3_fsnotify.h	2008-09-12 11:31:44.000000000 -0700
@@ -0,0 +1,83 @@
+/*
+ * linux/include/linux/ext3_fsnotify.h
+ *
+ * Copyright (C) 2008 Akamai Technologies
+ * Abhijit Paithankar <apaithan@akamai.com>
+ *
+ * fsnotify interface for Ext3 with journal cookies
+ */
+
+#ifndef EXT3_FSNOTIFY_H_
+#define EXT3_FSNOTIFY_H_
+
+#include <linux/fs.h>
+#include <linux/jbd.h>
+#include <linux/ext3_fs.h>
+#include <linux/ext3_jbd.h>
+#include <linux/fsnotify.h>
+
+#define SB(x) x->h_transaction->t_journal->j_private
+
+static const char *ext3_fsnotify_oldname_init(const char *name)
+{
+	return fsnotify_oldname_init(name);
+}
+
+static void ext3_fsnotify_oldname_free(const char *old_name)
+{
+	fsnotify_oldname_free(old_name);
+}
+
+static void ext3_fsnotify_unlink(struct dentry *dentry, struct inode *inode,
+				journal_cookie_t journal_cookie)
+{
+	int isdir = S_ISDIR(dentry->d_inode->i_mode);
+	if (!inode->i_nlink)
+		fsnotify_journal_inoderemove(inode, journal_cookie);
+
+	fsnotify_journal_nameremove(dentry, isdir, journal_cookie);
+}
+
+static void ext3_fsnotify_move(struct inode *old_dir, struct inode *new_dir,
+				      const char *old_name, const char *new_name,
+				      int isdir, struct inode *target, struct dentry *moved,
+				      journal_cookie_t journal_cookie)
+{
+	fsnotify_journal_move(old_dir, new_dir, old_name, new_name, isdir,
+			target, moved, journal_cookie);
+}
+
+static void ext3_fsnotify_create(struct inode *dir, struct dentry *dentry,
+				journal_cookie_t journal_cookie)
+{
+	fsnotify_journal_create(dir, dentry, journal_cookie);
+}
+
+static void ext3_fsnotify_mkdir(struct inode *dir, struct dentry *dentry,
+				journal_cookie_t journal_cookie)
+{
+	fsnotify_journal_mkdir(dir, dentry, journal_cookie);
+}
+
+static void ext3_fsnotify_rmdir(struct dentry *dentry, struct inode *inode,
+				journal_cookie_t journal_cookie)
+{
+	ext3_fsnotify_unlink(dentry, inode, journal_cookie);
+
+}
+
+static void ext3_fsnotify_link(struct inode *dir, struct inode *inode, struct dentry *dentry,
+				journal_cookie_t journal_cookie)
+{
+	fsnotify_journal_link(dir, inode, dentry, journal_cookie);
+}
+
+void journal_notify_commit(journal_t *journal, journal_cookie_t journal_cookie)
+{
+	struct super_block *sb = journal->j_private;
+
+	if (sb && sb->s_priv_inode)
+		fsnotify_journal_commit(sb->s_priv_inode, journal_cookie);
+}
+
+#endif /* EXT3_FSNOTIFY_H_ */
Index: linux-2.6/include/linux/ext3_jbd.h
===================================================================
--- linux-2.6.orig/include/linux/ext3_jbd.h	2008-09-11 15:48:56.000000000 -0700
+++ linux-2.6/include/linux/ext3_jbd.h	2008-09-11 15:50:08.000000000 -0700
@@ -223,4 +223,9 @@
 	return 0;
 }
 
+static inline journal_cookie_t ext3_journal_get_cookie(handle_t *handle)
+{
+	return journal_get_cookie(handle);
+}
+
 #endif	/* _LINUX_EXT3_JBD_H */
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2008-09-11 15:49:00.000000000 -0700
+++ linux-2.6/include/linux/fs.h	2008-09-11 15:50:08.000000000 -0700
@@ -1105,6 +1105,7 @@
 
 	char s_id[32];				/* Informational name */
 
+	struct inode		*s_priv_inode;	/* used by EXT3 for journal notifications */
 	void 			*s_fs_info;	/* Filesystem private info */
 
 	/*
Index: linux-2.6/include/linux/fsnotify.h
===================================================================
--- linux-2.6.orig/include/linux/fsnotify.h	2008-09-11 15:49:04.000000000 -0700
+++ linux-2.6/include/linux/fsnotify.h	2008-09-12 10:27:09.000000000 -0700
@@ -34,6 +34,12 @@
 	inotify_d_move(entry);
 }
 
+static inline void fsnotify_journal_commit(struct inode *inode, journal_cookie_t journal_cookie)
+{
+	inotify_inode_queue_event(inode, IN_JOURNAL_COMMIT, 0,
+				  NULL, NULL, journal_cookie);
+}
+
 /*
  * fsnotify_move - file old_name at old_dir was moved to new_name at new_dir
  */
@@ -54,21 +60,47 @@
 	if (isdir)
 		isdir = IN_ISDIR;
 	inotify_inode_queue_event(old_dir, IN_MOVED_FROM|isdir,cookie,old_name,
-				  source);
+				  source, 0);
 	inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, cookie, new_name,
-				  source);
+				  source, 0);
 
 	if (target) {
-		inotify_inode_queue_event(target, IN_DELETE_SELF, 0, NULL, NULL);
+		inotify_inode_queue_event(target, IN_DELETE_SELF, 0, NULL, NULL, 0);
 		inotify_inode_is_dead(target);
 	}
 
 	if (source) {
-		inotify_inode_queue_event(source, IN_MOVE_SELF, 0, NULL, NULL);
+		inotify_inode_queue_event(source, IN_MOVE_SELF, 0, NULL, NULL, 0);
 	}
 	audit_inode_child(new_name, moved, new_dir);
 }
 
+static inline void fsnotify_journal_move(struct inode *old_dir, struct inode *new_dir,
+					 const char *old_name, const char *new_name,
+					 int isdir, struct inode *target, struct dentry *moved,
+					 journal_cookie_t journal_cookie)
+{
+	struct inode *source = moved->d_inode;
+	u32 cookie = inotify_get_cookie();
+
+	if (isdir)
+		isdir = IN_ISDIR;
+	inotify_inode_queue_event(old_dir, IN_JOURNAL_MOVED_FROM|isdir,cookie,old_name,
+				  source, journal_cookie);
+	inotify_inode_queue_event(new_dir, IN_JOURNAL_MOVED_TO|isdir, cookie, new_name,
+				  source, journal_cookie);
+
+	if (target) {
+		inotify_inode_queue_event(target, IN_JOURNAL_DELETE_SELF, 0, NULL, NULL,
+					 journal_cookie);
+	}
+
+	if (source) {
+		inotify_inode_queue_event(source, IN_JOURNAL_MOVE_SELF, 0, NULL, NULL,
+					 journal_cookie);
+	}
+}
+
 /*
  * fsnotify_nameremove - a filename was removed from a directory
  */
@@ -77,7 +109,16 @@
 	if (isdir)
 		isdir = IN_ISDIR;
 	dnotify_parent(dentry, DN_DELETE);
-	inotify_dentry_parent_queue_event(dentry, IN_DELETE|isdir, 0, dentry->d_name.name);
+	inotify_dentry_parent_queue_event(dentry, IN_DELETE|isdir, 0, dentry->d_name.name, 0);
+}
+
+static inline void fsnotify_journal_nameremove(struct dentry *dentry, int isdir,
+						journal_cookie_t journal_cookie)
+{
+        if (isdir)
+                isdir = IN_ISDIR;
+        inotify_dentry_parent_queue_event(dentry, IN_JOURNAL_DELETE|isdir, 0, dentry->d_name.name,
+					 journal_cookie);
 }
 
 /*
@@ -85,16 +126,24 @@
  */
 static inline void fsnotify_inoderemove(struct inode *inode)
 {
-	inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL);
+	inotify_inode_queue_event(inode, IN_DELETE_SELF, 0,
+						NULL, NULL, 0);
 	inotify_inode_is_dead(inode);
 }
 
+static inline void fsnotify_journal_inoderemove(struct inode *inode,
+						journal_cookie_t journal_cookie)
+{
+	inotify_inode_queue_event(inode, IN_JOURNAL_DELETE_SELF, 0,
+				 NULL, NULL, journal_cookie);
+}
+
 /*
  * fsnotify_link_count - inode's link count changed
  */
 static inline void fsnotify_link_count(struct inode *inode)
 {
-	inotify_inode_queue_event(inode, IN_ATTRIB, 0, NULL, NULL);
+	inotify_inode_queue_event(inode, IN_ATTRIB, 0, NULL, NULL, 0);
 }
 
 /*
@@ -104,10 +153,17 @@
 {
 	inode_dir_notify(inode, DN_CREATE);
 	inotify_inode_queue_event(inode, IN_CREATE, 0, dentry->d_name.name,
-				  dentry->d_inode);
+				  dentry->d_inode, 0);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
 }
 
+static inline void fsnotify_journal_create(struct inode *inode, struct dentry *dentry,
+						journal_cookie_t journal_cookie)
+{
+        inotify_inode_queue_event(inode, IN_JOURNAL_CREATE, 0, dentry->d_name.name,
+                                  dentry->d_inode, journal_cookie);
+}
+
 /*
  * fsnotify_link - new hardlink in 'inode' directory
  * Note: We have to pass also the linked inode ptr as some filesystems leave
@@ -117,11 +173,17 @@
 {
 	inode_dir_notify(dir, DN_CREATE);
 	inotify_inode_queue_event(dir, IN_CREATE, 0, new_dentry->d_name.name,
-				  inode);
+				  inode, 0);
 	fsnotify_link_count(inode);
 	audit_inode_child(new_dentry->d_name.name, new_dentry, dir);
 }
 
+static inline void fsnotify_journal_link(struct inode *dir, struct inode *inode,
+				struct dentry *new_dentry, journal_cookie_t journal_cookie)
+{
+	inotify_inode_queue_event(dir, IN_JOURNAL_CREATE, 0, new_dentry->d_name.name,
+				  inode, journal_cookie);
+}
 /*
  * fsnotify_mkdir - directory 'name' was created
  */
@@ -129,10 +191,17 @@
 {
 	inode_dir_notify(inode, DN_CREATE);
 	inotify_inode_queue_event(inode, IN_CREATE | IN_ISDIR, 0, 
-				  dentry->d_name.name, dentry->d_inode);
+				  dentry->d_name.name, dentry->d_inode, 0);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
 }
 
+static inline void fsnotify_journal_mkdir(struct inode *inode, struct dentry *dentry,
+					 journal_cookie_t journal_cookie)
+{
+	inotify_inode_queue_event(inode, IN_JOURNAL_CREATE | IN_ISDIR, 0,
+                                  dentry->d_name.name, dentry->d_inode, journal_cookie);
+}
+
 /*
  * fsnotify_access - file was read
  */
@@ -145,8 +214,8 @@
 		mask |= IN_ISDIR;
 
 	dnotify_parent(dentry, DN_ACCESS);
-	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
-	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name, 0);
+	inotify_inode_queue_event(inode, mask, 0, NULL, NULL, 0);
 }
 
 /*
@@ -161,8 +230,8 @@
 		mask |= IN_ISDIR;
 
 	dnotify_parent(dentry, DN_MODIFY);
-	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
-	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name, 0);
+	inotify_inode_queue_event(inode, mask, 0, NULL, NULL, 0);
 }
 
 /*
@@ -176,8 +245,8 @@
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
 
-	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
-	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name, 0);
+	inotify_inode_queue_event(inode, mask, 0, NULL, NULL, 0);
 }
 
 /*
@@ -194,8 +263,8 @@
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
 
-	inotify_dentry_parent_queue_event(dentry, mask, 0, name);
-	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+	inotify_dentry_parent_queue_event(dentry, mask, 0, name, 0);
+	inotify_inode_queue_event(inode, mask, 0, NULL, NULL, 0);
 }
 
 /*
@@ -209,8 +278,8 @@
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
 
-	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
-	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name, 0);
+	inotify_inode_queue_event(inode, mask, 0, NULL, NULL, 0);
 }
 
 /*
@@ -257,9 +326,9 @@
 	if (in_mask) {
 		if (S_ISDIR(inode->i_mode))
 			in_mask |= IN_ISDIR;
-		inotify_inode_queue_event(inode, in_mask, 0, NULL, NULL);
+		inotify_inode_queue_event(inode, in_mask, 0, NULL, NULL, 0);
 		inotify_dentry_parent_queue_event(dentry, in_mask, 0,
-						  dentry->d_name.name);
+						  dentry->d_name.name, 0);
 	}
 }
 
Index: linux-2.6/include/linux/inotify.h
===================================================================
--- linux-2.6.orig/include/linux/inotify.h	2008-09-11 15:49:08.000000000 -0700
+++ linux-2.6/include/linux/inotify.h	2008-09-12 12:37:16.000000000 -0700
@@ -44,6 +44,28 @@
 #define IN_Q_OVERFLOW		0x00004000	/* Event queued overflowed */
 #define IN_IGNORED		0x00008000	/* File was ignored */
 
+/*
+ * These are journal events. They are sent from a journaling
+ * file system or journal.
+ *
+ * IN_JOURNAL_COMMIT is special. Only one file per journaling filesystem super-block
+ * is allowed. Registering for this event with multiple files will overwrite previous
+ * registrations. The last registration for this will be taken as current and commit
+ * notifications will be delivered to only that file.
+ *
+ * The inode registered for this event should not be unlinked. This will disable
+ * any notifications for journal commits.
+ *
+ * If this event is not registered for, journal commit messages will not be delivered.
+ */
+#define IN_JOURNAL_COMMIT	0x00010000	/* Journal on this partition committed a transaction */
+#define IN_JOURNAL_MOVED_FROM	0x00020000	/* Journal transaction to move file from X started */
+#define IN_JOURNAL_MOVED_TO	0x00040000	/* Journal transaction to move file to Y started */
+#define IN_JOURNAL_MOVE_SELF	0x00080000	/* Journal transaction to move self started */
+#define IN_JOURNAL_DELETE	0x00100000	/* Journal transaction to delete subfile started */
+#define IN_JOURNAL_DELETE_SELF	0x00200000	/* Journal transaction to delete self started */
+#define IN_JOURNAL_CREATE	0x00400000	/* Journal transaction to create file started */
+
 /* helper events */
 #define IN_CLOSE		(IN_CLOSE_WRITE | IN_CLOSE_NOWRITE) /* close */
 #define IN_MOVE			(IN_MOVED_FROM | IN_MOVED_TO) /* moves */
@@ -69,10 +91,14 @@
 #define IN_CLOEXEC O_CLOEXEC
 #define IN_NONBLOCK O_NONBLOCK
 
+#define IN_JOURNAL_ALL_EVENTS	(IN_JOURNAL_MOVED_FROM | IN_JOURNAL_MOVED_TO | \
+				 IN_JOURNAL_MOVE_SELF | IN_JOURNAL_DELETE | \
+				 IN_JOURNAL_DELETE_SELF | IN_JOURNAL_CREATE)
 #ifdef __KERNEL__
 
 #include <linux/dcache.h>
 #include <linux/fs.h>
+#include <linux/journal-head.h>
 
 /*
  * struct inotify_watch - represents a watch request on a specific inode
@@ -97,7 +123,7 @@
 
 struct inotify_operations {
 	void (*handle_event)(struct inotify_watch *, u32, u32, u32,
-			     const char *, struct inode *);
+			     const char *, struct inode *, journal_cookie_t);
 	void (*destroy_watch)(struct inotify_watch *);
 };
 
@@ -108,9 +134,9 @@
 extern void inotify_d_instantiate(struct dentry *, struct inode *);
 extern void inotify_d_move(struct dentry *);
 extern void inotify_inode_queue_event(struct inode *, __u32, __u32,
-				      const char *, struct inode *);
+				      const char *, struct inode *, journal_cookie_t);
 extern void inotify_dentry_parent_queue_event(struct dentry *, __u32, __u32,
-					      const char *);
+					      const char *, journal_cookie_t);
 extern void inotify_unmount_inodes(struct list_head *);
 extern void inotify_inode_is_dead(struct inode *);
 extern u32 inotify_get_cookie(void);
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h	2008-09-11 15:49:12.000000000 -0700
+++ linux-2.6/include/linux/jbd.h	2008-09-12 12:34:58.000000000 -0700
@@ -114,6 +114,9 @@
  * This is an opaque datatype.
  **/
 typedef struct journal_s	journal_t;	/* Journal control structure */
+
+extern void journal_notify_commit(journal_t *journal, journal_cookie_t journal_cookie);
+
 #endif
 
 /*
@@ -919,6 +922,7 @@
 extern int	   journal_clear_err  (journal_t *);
 extern int	   journal_bmap(journal_t *, unsigned long, unsigned long *);
 extern int	   journal_force_commit(journal_t *);
+extern journal_cookie_t journal_get_cookie(handle_t *handle);
 
 /*
  * journal_head management
Index: linux-2.6/fs/ext4/ext4_jbd2.h
===================================================================
--- linux-2.6.orig/fs/ext4/ext4_jbd2.h	2008-09-12 11:53:15.000000000 -0700
+++ linux-2.6/fs/ext4/ext4_jbd2.h	2008-09-12 11:55:56.000000000 -0700
@@ -239,4 +239,8 @@
 	return 0;
 }
 
+static inline journal_cookie_t ext4_journal_get_cookie(handle_t *handle)
+{
+	return jbd2_journal_get_cookie(handle);
+}
 #endif	/* _EXT4_JBD2_H */
Index: linux-2.6/fs/ext4/namei.c
===================================================================
--- linux-2.6.orig/fs/ext4/namei.c	2008-09-12 11:29:43.000000000 -0700
+++ linux-2.6/fs/ext4/namei.c	2008-09-12 12:26:00.000000000 -0700
@@ -34,6 +34,7 @@
 #include <linux/quotaops.h>
 #include <linux/buffer_head.h>
 #include <linux/bio.h>
+#include <linux/ext4_fsnotify.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 
@@ -1741,6 +1742,10 @@
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 	}
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext4_journal_get_cookie(handle);
+		ext4_fsnotify_create(dir, dentry, journal_cookie);
+	}
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
@@ -1776,6 +1781,10 @@
 #endif
 		err = ext4_add_nondir(handle, dentry, inode);
 	}
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext4_journal_get_cookie(handle);
+		ext4_fsnotify_create(dir, dentry, journal_cookie);
+	}
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
@@ -1846,6 +1855,10 @@
 	ext4_update_dx_flag(dir);
 	ext4_mark_inode_dirty(handle, dir);
 	d_instantiate(dentry, inode);
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext4_journal_get_cookie(handle);
+		ext4_fsnotify_mkdir(dir, dentry, journal_cookie);
+	}
 out_stop:
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
@@ -2122,6 +2135,10 @@
 	ext4_update_dx_flag(dir);
 	ext4_mark_inode_dirty(handle, dir);
 
+	if (!handle->h_sync) {
+		journal_cookie_t journal_cookie = ext4_journal_get_cookie(handle);
+		ext4_fsnotify_rmdir(dentry, inode, journal_cookie);
+	}
 end_rmdir:
 	ext4_journal_stop(handle);
 	brelse (bh);
@@ -2176,6 +2193,10 @@
 	ext4_mark_inode_dirty(handle, inode);
 	retval = 0;
 
+	if (!handle->h_sync) {
+		journal_cookie_t journal_cookie = ext4_journal_get_cookie(handle);
+		ext4_fsnotify_unlink(dentry, inode, journal_cookie);
+	}
 end_unlink:
 	ext4_journal_stop(handle);
 	brelse (bh);
@@ -2233,6 +2254,10 @@
 	}
 	EXT4_I(inode)->i_disksize = inode->i_size;
 	err = ext4_add_nondir(handle, dentry, inode);
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext4_journal_get_cookie(handle);
+		ext4_fsnotify_create(dir, dentry, journal_cookie);
+	}
 out_stop:
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
@@ -2271,6 +2296,10 @@
 	atomic_inc(&inode->i_count);
 
 	err = ext4_add_nondir(handle, dentry, inode);
+	if (!err && !handle->h_sync) {
+		journal_cookie_t journal_cookie = ext4_journal_get_cookie(handle);
+		ext4_fsnotify_link(dir, inode, dentry, journal_cookie);
+	}
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
@@ -2292,6 +2321,7 @@
 	struct buffer_head * old_bh, * new_bh, * dir_bh;
 	struct ext4_dir_entry_2 * old_de, * new_de;
 	int retval;
+	const char *old_name = NULL;
 
 	old_bh = new_bh = dir_bh = NULL;
 
@@ -2305,6 +2335,7 @@
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
+	old_name = ext4_fsnotify_oldname_init(old_dentry->d_name.name);
 	if (IS_DIRSYNC(old_dir) || IS_DIRSYNC(new_dir))
 		handle->h_sync = 1;
 
@@ -2432,7 +2463,16 @@
 	}
 	retval = 0;
 
+	if (!handle->h_sync) {
+		int isdir = S_ISDIR(old_dentry->d_inode->i_mode);
+		const char *new_name = new_dentry->d_name.name;
+		journal_cookie_t journal_cookie = ext4_journal_get_cookie(handle);
+
+		ext4_fsnotify_move(old_dir, new_dir, old_name, new_name, isdir,
+				   new_dentry->d_inode, old_dentry, journal_cookie);
+	}
 end_rename:
+	ext4_fsnotify_oldname_free(old_name);
 	brelse (dir_bh);
 	brelse (old_bh);
 	brelse (new_bh);
Index: linux-2.6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd2/commit.c	2008-09-12 11:01:08.000000000 -0700
+++ linux-2.6/fs/jbd2/commit.c	2008-09-12 11:36:00.000000000 -0700
@@ -990,6 +990,7 @@
 	}
 	spin_unlock(&journal->j_list_lock);
 
+	jbd2_journal_notify_commit(journal, journal->j_commit_sequence);
 	jbd_debug(1, "JBD: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 
Index: linux-2.6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.orig/fs/jbd2/journal.c	2008-09-12 11:44:10.000000000 -0700
+++ linux-2.6/fs/jbd2/journal.c	2008-09-12 11:47:28.000000000 -0700
@@ -84,6 +84,7 @@
 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
+EXPORT_SYMBOL(jbd2_journal_get_cookie);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
@@ -1940,6 +1941,12 @@
 	return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
 }
 
+/* We don't need to do anything here other than return the txid */
+journal_cookie_t jbd2_journal_get_cookie(handle_t *handle)
+{
+	return handle->h_transaction->t_tid;
+}
+
 /*
  * helper functions to deal with 32 or 64bit block numbers.
  */
Index: linux-2.6/include/linux/ext4_fsnotify.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/ext4_fsnotify.h	2008-09-12 11:36:37.000000000 -0700
@@ -0,0 +1,81 @@
+/*
+ * linux/include/linux/ext4_fsnotify.h
+ *
+ * Copyright (C) 2008 Akamai Technologies
+ * Abhijit Paithankar <apaithan@akamai.com>
+ *
+ * fsnotify interface for Ext4 with journal cookies
+ */
+
+#ifndef EXT4_FSNOTIFY_H_
+#define EXT4_FSNOTIFY_H_
+
+#include <linux/fs.h>
+#include <linux/jbd2.h>
+#include <linux/fsnotify.h>
+
+#define SB(x) x->h_transaction->t_journal->j_private
+
+static const char *ext4_fsnotify_oldname_init(const char *name)
+{
+	return fsnotify_oldname_init(name);
+}
+
+static void ext4_fsnotify_oldname_free(const char *old_name)
+{
+	fsnotify_oldname_free(old_name);
+}
+
+static void ext4_fsnotify_unlink(struct dentry *dentry, struct inode *inode,
+				journal_cookie_t journal_cookie)
+{
+	int isdir = S_ISDIR(dentry->d_inode->i_mode);
+	if (!inode->i_nlink)
+		fsnotify_journal_inoderemove(inode, journal_cookie);
+
+	fsnotify_journal_nameremove(dentry, isdir, journal_cookie);
+}
+
+static void ext4_fsnotify_move(struct inode *old_dir, struct inode *new_dir,
+				      const char *old_name, const char *new_name,
+				      int isdir, struct inode *target, struct dentry *moved,
+				      journal_cookie_t journal_cookie)
+{
+	fsnotify_journal_move(old_dir, new_dir, old_name, new_name, isdir,
+			target, moved, journal_cookie);
+}
+
+static void ext4_fsnotify_create(struct inode *dir, struct dentry *dentry,
+				journal_cookie_t journal_cookie)
+{
+	fsnotify_journal_create(dir, dentry, journal_cookie);
+}
+
+static void ext4_fsnotify_mkdir(struct inode *dir, struct dentry *dentry,
+				journal_cookie_t journal_cookie)
+{
+	fsnotify_journal_mkdir(dir, dentry, journal_cookie);
+}
+
+static void ext4_fsnotify_rmdir(struct dentry *dentry, struct inode *inode,
+				journal_cookie_t journal_cookie)
+{
+	ext4_fsnotify_unlink(dentry, inode, journal_cookie);
+
+}
+
+static void ext4_fsnotify_link(struct inode *dir, struct inode *inode, struct dentry *dentry,
+				journal_cookie_t journal_cookie)
+{
+	fsnotify_journal_link(dir, inode, dentry, journal_cookie);
+}
+
+void jbd2_journal_notify_commit(journal_t *journal, journal_cookie_t journal_cookie)
+{
+	struct super_block *sb = journal->j_private;
+
+	if (sb && sb->s_priv_inode)
+		fsnotify_journal_commit(sb->s_priv_inode, journal_cookie);
+}
+
+#endif /* EXT4_FSNOTIFY_H_ */
Index: linux-2.6/include/linux/jbd2.h
===================================================================
--- linux-2.6.orig/include/linux/jbd2.h	2008-09-12 10:55:04.000000000 -0700
+++ linux-2.6/include/linux/jbd2.h	2008-09-12 12:35:14.000000000 -0700
@@ -115,6 +115,8 @@
  * This is an opaque datatype.
  **/
 typedef struct journal_s	journal_t;	/* Journal control structure */
+
+extern void jbd2_journal_notify_commit(journal_t *journal, journal_cookie_t journal_cookie);
 #endif
 
 /*
@@ -1076,6 +1078,8 @@
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
 extern void	   jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);
 
+extern journal_cookie_t jbd2_journal_get_cookie(handle_t *handle);
+
 /*
  * journal_head management
  */
Index: linux-2.6/include/linux/journal-head.h
===================================================================
--- linux-2.6.orig/include/linux/journal-head.h	2008-09-12 12:35:48.000000000 -0700
+++ linux-2.6/include/linux/journal-head.h	2008-09-12 12:36:06.000000000 -0700
@@ -13,6 +13,7 @@
 typedef unsigned int		tid_t;		/* Unique transaction ID */
 typedef struct transaction_s	transaction_t;	/* Compound transaction type */
 struct buffer_head;
+typedef tid_t journal_cookie_t; /* Opaque journal cookie */
 
 struct journal_head {
 	/*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-12 22:06 [RFC PATCH] Filesystem Journal Notifications Abhijit Paithankar
@ 2008-09-13  9:19 ` Dave Chinner
  2008-09-15  2:31 ` Andreas Dilger
  2008-09-15 13:23 ` Jamie Lokier
  2 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2008-09-13  9:19 UTC (permalink / raw)
  To: Abhijit Paithankar; +Cc: linux-fsdevel, Andreas Dilger

On Fri, Sep 12, 2008 at 03:06:49PM -0700, Abhijit Paithankar wrote:
> Hi,
> 
> On journaling file systems, applications need to know when the meta-data
> changes to files are written to disks in a manner which is persistent to
> crashes and reboots.
> 
> One way to do it is to fsync every few operations. However, fsync is
> blocking and affects performance.
> 
> The other (more efficient) way is to have the filesystem notify the
> application when a transaction/change is written to disk.
> 
> This patch implements such a notification mechanism based on inotify for
> Ext3 and Ext4.

And as such is un-implementable for other filesystems like XFS.

e.g. the "journal cookie" is a 32 bit type. XFS uses a 64 bit
type for transaction ID in the journal (the log sequence number).
I am assuming the journal cookie in a userspace visible type,
as the application is going to get these in events and have to
compare them, right?

Also, XFS doesn't assign the LSN until the transaction
commits into the in-core log buffer as the LSN is (partially) a
function of the physical offset into the log. Hence at the time we
resolve what the cookie might be, we are so far away from the
create/link/unlink/etc code that we cannot build an fsnotify
function....

If you are going to provide a userspace interface for this, the
cookie really needs to be independent of filesystem journalling
implementation. Perhaps it should be a sequence number that is
opaque to the filesystem that is assigned by inotify. The filesystem
then needs to carry that sequence number with the transaction and
returns the provided cookie to the inotify commit callback. This
avoids encoding a dependency on a specific journaling model into the
inotify interface....

That way, you can make well defined rules for doing comparisons
on cookies. i.e. determine from a pair of cookies which one is
older. Why? because...

> Every time an application creates, renames, unlinks, etc a file or
> directory, a journal_cookie is sent via the inotify mechanism. Ofcourse,
> the application has to first register for such notifications on the
> files/directories of interest.
> 
> Once the transaction corresponding to that journal_cookie is committed
> to the journal another notification is sent to the application
> indicating that the change is now persistent.

... that's pretty inefficient. We have two inotify events now for
every transaction and one of them has to be issued in journal I/O
completion context. We can reduce this down to a single event with a
simple polling interface to the filesystem to find out what the last
committed cookie was.

With a well defined cookie we can then determine how many of the
outstanding events have been committed with a simple comparison.
That will halve the amount of events, and allow batch processing
which will be far more efficient that processing an event at a time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-12 22:06 [RFC PATCH] Filesystem Journal Notifications Abhijit Paithankar
  2008-09-13  9:19 ` Dave Chinner
@ 2008-09-15  2:31 ` Andreas Dilger
  2008-09-15  5:39   ` Andreas Dilger
  2008-09-15 13:23 ` Jamie Lokier
  2 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2008-09-15  2:31 UTC (permalink / raw)
  To: Abhijit Paithankar; +Cc: linux-fsdevel

On Sep 12, 2008  15:06 -0700, Abhijit Paithankar wrote:
> On journaling file systems, applications need to know when the meta-data
> changes to files are written to disks in a manner which is persistent to
> crashes and reboots.
> 
> One way to do it is to fsync every few operations. However, fsync is
> blocking and affects performance.
> 
> The other (more efficient) way is to have the filesystem notify the
> application when a transaction/change is written to disk.
> 
> This patch implements such a notification mechanism based on inotify for
> Ext3 and Ext4.
> 
> Every time an application creates, renames, unlinks, etc a file or
> directory, a journal_cookie is sent via the inotify mechanism. Ofcourse,
> the application has to first register for such notifications on the
> files/directories of interest.
> 
> Once the transaction corresponding to that journal_cookie is committed
> to the journal another notification is sent to the application
> indicating that the change is now persistent.

We had a much more flexible journal commit notification mechanism.  The
caller would register a callback with the journal layer that contained
a caller-private callback function and callback data.  This avoided
exposing the journal transaction IDs outside of the filesystem, and the
caller could use any identifier desired.

There was also no limit on the number/type of callbacks registered with
the filesystem.  The basic data structure was a linked list of callbacks
registered on the transaction:

struct journal_callback {
       struct list_head jcb_list;
       void (*jcb_func)(struct journal_callback *jcb, int error);
       /* caller data goes here */
};

and in journal_commit_transaction() the callbacks are called:

               list_for_each_entry_safe(jcb, n, &commit_transaction->t_jcb,
					jcb_list) {
                       list_del_init(&jcb->jcb_list);
                       jcb->jcb_func(jcb, error);
               }


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15  2:31 ` Andreas Dilger
@ 2008-09-15  5:39   ` Andreas Dilger
  2008-09-15 19:36     ` Abhijit Paithankar
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2008-09-15  5:39 UTC (permalink / raw)
  To: Abhijit Paithankar; +Cc: linux-fsdevel, linux-ext4

On Sep 14, 2008  19:31 -0700, Andreas Dilger wrote:
> We had a much more flexible journal commit notification mechanism.  The
> caller would register a callback with the journal layer that contained
> a caller-private callback function and callback data.  This avoided
> exposing the journal transaction IDs outside of the filesystem, and the
> caller could use any identifier desired.
> 
> There was also no limit on the number/type of callbacks registered with
> the filesystem.  The basic data structure was a linked list of callbacks
> registered on the transaction:
> 
> struct journal_callback {
>        struct list_head jcb_list;
>        void (*jcb_func)(struct journal_callback *jcb, int error);
>        /* caller data goes here */
> };
> 
> and in journal_commit_transaction() the callbacks are called:
> 
>                list_for_each_entry_safe(jcb, n, &commit_transaction->t_jcb,
> 					jcb_list) {
>                        list_del_init(&jcb->jcb_list);
>                        jcb->jcb_func(jcb, error);
>                }

Here is the patch (against 2.6.18 kernels, but I don't think anything has
changed in this area in a long time):


Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h	2006-07-15 16:08:35.000000000 +0800
+++ linux-2.6/include/linux/jbd.h	2006-07-15 16:13:01.000000000 +0800
@@ -356,6 +356,27 @@ static inline void jbd_unlock_bh_journal
 	bit_spin_unlock(BH_JournalHead, &bh->b_state);
 }
 
+#define HAVE_JOURNAL_CALLBACK_STATUS
+/**
+ *   struct journal_callback - Base structure for callback information
+ *   @jcb_list: list information for other callbacks attached to the same handle
+ *   @jcb_func: Function to call with this callback structure
+ *
+ *   This struct is a 'seed' structure for a using with your own callback
+ *   structs. If you are using callbacks you must allocate one of these
+ *   or another struct of your own definition which has this struct 
+ *   as it's first element and pass it to journal_callback_set().
+ *
+ *   This is used internally by jbd to maintain callback information.
+ *
+ *   See journal_callback_set for more information.
+ **/
+struct journal_callback {
+	struct list_head jcb_list;		/* t_jcb_lock */
+	void (*jcb_func)(struct journal_callback *jcb, int error);
+	/* caller data goes here */
+};
+
 struct jbd_revoke_table_s;
 
 /**
@@ -364,6 +385,7 @@ struct jbd_revoke_table_s;
  * @h_transaction: Which compound transaction is this update a part of?
  * @h_buffer_credits: Number of remaining buffers we are allowed to dirty.
  * @h_ref: Reference count on this handle
+ * @h_jcb: List of application registered callbacks for this handle.
  * @h_err: Field for caller's use to track errors through large fs operations
  * @h_sync: flag for sync-on-close
  * @h_jdata: flag to force data journaling
@@ -389,6 +411,13 @@ struct handle_s 
 	/* operations */
 	int			h_err;
 
+	/*
+	 * List of application registered callbacks for this handle. The
+	 * function(s) will be called after the transaction that this handle is
+	 * part of has been committed to disk. [t_jcb_lock]
+	 */
+	struct list_head	h_jcb;
+
 	/* Flags [no locking] */
 	unsigned int	h_sync:		1;	/* sync-on-close */
 	unsigned int	h_jdata:	1;	/* force data journaling */
@@ -430,6 +459,8 @@ struct handle_s 
  *    j_state_lock
  *    ->j_list_lock			(journal_unmap_buffer)
  *
+ *    t_handle_lock
+ *    ->t_jcb_lock
  */
 
 struct transaction_s 
@@ -559,6 +590,15 @@ struct transaction_s 
 	 */
 	int t_handle_count;
 
+	/*
+	 * Protects the callback list
+	 */
+	spinlock_t		t_jcb_lock;
+	/*
+	 * List of registered callback functions for this transaction.
+	 * Called when the transaction is committed. [t_jcb_lock]
+	 */
+	struct list_head	t_jcb;
 };
 
 /**
@@ -906,6 +946,10 @@ extern void	 journal_invalidatepage(jour
 extern int	 journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
 extern int	 journal_stop(handle_t *);
 extern int	 journal_flush (journal_t *);
+extern void	 journal_callback_set(handle_t *handle,
+				      void (*fn)(struct journal_callback *,int),
+				      struct journal_callback *jcb);
+
 extern void	 journal_lock_updates (journal_t *);
 extern void	 journal_unlock_updates (journal_t *);
 
Index: linux-2.6/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.orig/fs/jbd/checkpoint.c	2006-07-15 16:08:36.000000000 +0800
+++ linux-2.6/fs/jbd/checkpoint.c	2006-07-15 16:13:01.000000000 +0800
@@ -688,6 +688,7 @@ void __journal_drop_transaction(journal_
 	J_ASSERT(transaction->t_checkpoint_list == NULL);
 	J_ASSERT(transaction->t_checkpoint_io_list == NULL);
 	J_ASSERT(transaction->t_updates == 0);
+	J_ASSERT(list_empty(&transaction->t_jcb));
 	J_ASSERT(journal->j_committing_transaction != transaction);
 	J_ASSERT(journal->j_running_transaction != transaction);
 
Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c	2006-07-15 16:08:36.000000000 +0800
+++ linux-2.6/fs/jbd/commit.c	2006-07-15 16:13:01.000000000 +0800
@@ -708,6 +708,21 @@ wait_for_iobuf:
            transaction can be removed from any checkpoint list it was on
            before. */
 
+	/* Call any callbacks that had been registered for handles in this
+	 * transaction.  It is up to the callback to free any allocated
+	 * memory.  No locking is required, since this is the only process
+	 * that is processing this transaction anymore. */
+	if (!list_empty(&commit_transaction->t_jcb)) {
+		struct journal_callback *jcb, *n;
+		int error = is_journal_aborted(journal);
+
+		list_for_each_entry_safe(jcb, n, &commit_transaction->t_jcb,
+					 jcb_list) {
+			list_del_init(&jcb->jcb_list);
+			jcb->jcb_func(jcb, error);
+		}
+	}
+
 	jbd_debug(3, "JBD: commit phase 7\n");
 
 	J_ASSERT(commit_transaction->t_sync_datalist == NULL);
Index: linux-2.6/fs/jbd/journal.c
===================================================================
--- linux-2.6.orig/fs/jbd/journal.c	2006-07-15 16:08:36.000000000 +0800
+++ linux-2.6/fs/jbd/journal.c	2006-07-15 16:13:01.000000000 +0800
@@ -58,6 +58,7 @@ EXPORT_SYMBOL(journal_sync_buffer);
 #endif
 EXPORT_SYMBOL(journal_flush);
 EXPORT_SYMBOL(journal_revoke);
+EXPORT_SYMBOL(journal_callback_set);
 
 EXPORT_SYMBOL(journal_init_dev);
 EXPORT_SYMBOL(journal_init_inode);
Index: linux-2.6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.orig/fs/jbd/transaction.c	2006-07-15 16:08:35.000000000 +0800
+++ linux-2.6/fs/jbd/transaction.c	2006-07-15 16:13:01.000000000 +0800
@@ -50,6 +50,8 @@ get_transaction(journal_t *journal, tran
 	transaction->t_state = T_RUNNING;
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
+	INIT_LIST_HEAD(&transaction->t_jcb);
+	spin_lock_init(&transaction->t_jcb_lock);
 	spin_lock_init(&transaction->t_handle_lock);
 
 	/* Set up the commit timer for the new transaction. */
@@ -241,6 +243,7 @@ static handle_t *new_handle(int nblocks)
 	memset(handle, 0, sizeof(*handle));
 	handle->h_buffer_credits = nblocks;
 	handle->h_ref = 1;
+	INIT_LIST_HEAD(&handle->h_jcb);
 
 	return handle;
 }
@@ -1291,6 +1294,36 @@ drop:
 }
 
 /**
+ * void journal_callback_set() -  Register a callback function for this handle.
+ * @handle: handle to attach the callback to.
+ * @func: function to callback.
+ * @jcb:  structure with additional information required by func() , and
+ *        some space for jbd internal information.
+ * 
+ * The function will be called when the transaction that this handle is
+ * part of has been committed to disk with the original callback data
+ * struct and the error status of the journal as parameters.  There is no
+ * guarantee of ordering between handles within a single transaction, nor
+ * between callbacks registered on the same handle.
+ *
+ * The caller is responsible for allocating the journal_callback struct.
+ * This is to allow the caller to add as much extra data to the callback
+ * as needed, but reduce the overhead of multiple allocations.  The caller
+ * allocated struct must start with a struct journal_callback at offset 0,
+ * and has the caller-specific data afterwards.
+ */
+void journal_callback_set(handle_t *handle,
+			  void (*func)(struct journal_callback *jcb, int error),
+			  struct journal_callback *jcb)
+{
+	jcb->jcb_func = func;
+	spin_lock(&handle->h_transaction->t_jcb_lock);
+	list_add_tail(&jcb->jcb_list, &handle->h_jcb);
+	spin_unlock(&handle->h_transaction->t_jcb_lock);
+}
+
+/**
  * int journal_stop() - complete a transaction
  * @handle: tranaction to complete.
  * 
@@ -1363,6 +1396,11 @@ int journal_stop(handle_t *handle)
 			wake_up(&journal->j_wait_transaction_locked);
 	}
 
+	/* Move callbacks from the handle to the transaction. */
+	spin_lock(&transaction->t_jcb_lock);
+	list_splice(&handle->h_jcb, &transaction->t_jcb);
+	spin_unlock(&transaction->t_jcb_lock);
+
 	/*
 	 * If the handle is marked SYNC, we need to set another commit
 	 * going!  We also want to force a commit if the current


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-12 22:06 [RFC PATCH] Filesystem Journal Notifications Abhijit Paithankar
  2008-09-13  9:19 ` Dave Chinner
  2008-09-15  2:31 ` Andreas Dilger
@ 2008-09-15 13:23 ` Jamie Lokier
  2008-09-17  0:06   ` Abhijit Paithankar
  2 siblings, 1 reply; 14+ messages in thread
From: Jamie Lokier @ 2008-09-15 13:23 UTC (permalink / raw)
  To: Abhijit Paithankar; +Cc: linux-fsdevel, Andreas Dilger

Abhijit Paithankar wrote:
> One way to do it is to fsync every few operations. However, fsync is
> blocking and affects performance.
>
> The other (more efficient) way is to have the filesystem notify the
> application when a transaction/change is written to disk.

Is this more efficient than aio_fsync, and if so, why?

For file writes, aio_fsync seems like a cleaner interface, and if
that's not fast enough, it could be made faster - perhaps using code
form this patch.  (An aio_fsync_ranges would be even better).

-- Jamie

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15  5:39   ` Andreas Dilger
@ 2008-09-15 19:36     ` Abhijit Paithankar
  2008-09-15 23:37       ` Andreas Dilger
  2008-09-20  5:15       ` Jan Kara
  0 siblings, 2 replies; 14+ messages in thread
From: Abhijit Paithankar @ 2008-09-15 19:36 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-ext4, Jamie Lokier, Dave Chinner

Thanks for your feedback.
It gave me new points to think about.

I have used inotify simply because it is a mechanism that already exists
in the kernel and is easily extensible.

The use of transaction ids as "journal cookie" was done to simplify
things. A journal transaction ID is a readily available monotonically
increasing ID unique to that partition (super block).

Journal cookies themselves are completely opaque to the application
and as long as it is a monotonically increasing number that signifies
a journal commit, it could be anything that a file system chooses it
to be.

Ext3/Ext4 gather multiple changes into a single transaction. So,
you will see a single 'commit' notification for multiple of meta-data
changes. The 'commit' messages, therefore, won't be that many.


This patch (v2):

In this version of the patch I've done away with most of the file-system
specific changes. The file system now only needs to send out a journal
commit notification.

The 'journal cookie' is simply a u64 that increases every time a journal
'commit' is made. This will make the code more portable to other file
systems.

Also, we no longer need those many journal message types. We can easily
piggy-back on top of fsnotify. The only remaining one is
IN_JOURNAL_COMMIT.

My next step is to allow multiple registrations for JOURNAL_COMMIT
notification. I will post that patch once it is ready.

Signed-off-by: Abhijit Paithankar <apaithan@akamai.com>
---

 fs/ext3/ext3_jbd.c       |    9 +++++
 fs/ext4/ext4_jbd2.c      |    9 +++++
 fs/inotify.c             |   17 ++++++---
 fs/inotify_user.c        |   81 ++++++++++++++++++++++++++++++++++++++++++-----
 fs/jbd/commit.c          |    1 
 fs/jbd2/commit.c         |    1 
 include/linux/fs.h       |    2 +
 include/linux/fsnotify.h |   11 ++++++
 include/linux/inotify.h  |   19 ++++++++++-
 include/linux/jbd.h      |    1 
 include/linux/jbd2.h     |    1 
 11 files changed, 138 insertions(+), 14 deletions(-)

---


Index: linux-2.6/fs/ext3/ext3_jbd.c
===================================================================
--- linux-2.6.orig/fs/ext3/ext3_jbd.c	2008-09-15 10:12:25.000000000 -0700
+++ linux-2.6/fs/ext3/ext3_jbd.c	2008-09-15 10:15:54.000000000 -0700
@@ -3,6 +3,7 @@
  */
 
 #include <linux/ext3_jbd.h>
+#include <linux/fsnotify.h>
 
 int __ext3_journal_get_undo_access(const char *where, handle_t *handle,
 				struct buffer_head *bh)
@@ -57,3 +58,11 @@
 		ext3_journal_abort_handle(where, __func__, bh, handle,err);
 	return err;
 }
+
+void journal_notify_commit (journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	if (sb->s_priv_inode && sb->s_journal_cookie)
+		fsnotify_journal_commit(sb);
+}
+
Index: linux-2.6/fs/ext4/ext4_jbd2.c
===================================================================
--- linux-2.6.orig/fs/ext4/ext4_jbd2.c	2008-09-15 10:14:12.000000000 -0700
+++ linux-2.6/fs/ext4/ext4_jbd2.c	2008-09-15 10:15:50.000000000 -0700
@@ -2,6 +2,7 @@
  * Interface between ext4 and JBD
  */
 
+#include <linux/fsnotify.h>
 #include "ext4_jbd2.h"
 
 int __ext4_journal_get_undo_access(const char *where, handle_t *handle,
@@ -57,3 +58,11 @@
 		ext4_journal_abort_handle(where, __func__, bh, handle, err);
 	return err;
 }
+
+void jbd2_journal_notify_commit (journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	if (sb->s_priv_inode && sb->s_journal_cookie)
+		fsnotify_journal_commit(sb);
+}
+
Index: linux-2.6/fs/inotify.c
===================================================================
--- linux-2.6.orig/fs/inotify.c	2008-09-15 10:12:29.000000000 -0700
+++ linux-2.6/fs/inotify.c	2008-09-15 10:13:44.000000000 -0700
@@ -231,7 +231,7 @@
 				 struct inotify_watch *watch)
 {
 	remove_watch_no_event(watch, ih);
-	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL);
+	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL, 0);
 }
 EXPORT_SYMBOL_GPL(inotify_remove_watch_locked);
 
@@ -280,10 +280,17 @@
 			       const char *name, struct inode *n_inode)
 {
 	struct inotify_watch *watch, *next;
+	u64 journal_cookie = 0;
 
 	if (!inotify_inode_watched(inode))
 		return;
 
+	if (mask & (IN_ATTRIB | IN_MOVE | IN_CREATE | IN_DELETE |
+		    IN_DELETE_SELF | IN_MOVE_SELF)) {
+		if (inode->i_sb && inode->i_sb->s_journal_cookie)
+			journal_cookie = inode->i_sb->s_journal_cookie;
+	}
+
 	mutex_lock(&inode->inotify_mutex);
 	list_for_each_entry_safe(watch, next, &inode->inotify_watches, i_list) {
 		u32 watch_mask = watch->mask;
@@ -293,7 +300,7 @@
 			if (watch_mask & IN_ONESHOT)
 				remove_watch_no_event(watch, ih);
 			ih->in_ops->handle_event(watch, watch->wd, mask, cookie,
-						 name, n_inode);
+						 name, n_inode, journal_cookie);
 			mutex_unlock(&ih->mutex);
 		}
 	}
@@ -409,7 +416,7 @@
 			struct inotify_handle *ih= watch->ih;
 			mutex_lock(&ih->mutex);
 			ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0,
-						 NULL, NULL);
+						 NULL, NULL, 0);
 			inotify_remove_watch_locked(ih, watch);
 			mutex_unlock(&ih->mutex);
 		}
@@ -576,7 +583,7 @@
 		mask_add = 1;
 
 	/* don't allow invalid bits: we don't want flags set */
-	mask &= IN_ALL_EVENTS | IN_ONESHOT;
+	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_COMMIT;
 	if (unlikely(!mask))
 		return -EINVAL;
 
@@ -623,7 +630,7 @@
 	int newly_watched;
 
 	/* don't allow invalid bits: we don't want flags set */
-	mask &= IN_ALL_EVENTS | IN_ONESHOT;
+	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_COMMIT;
 	if (unlikely(!mask))
 		return -EINVAL;
 	watch->mask = mask;
Index: linux-2.6/fs/inotify_user.c
===================================================================
--- linux-2.6.orig/fs/inotify_user.c	2008-09-15 10:12:34.000000000 -0700
+++ linux-2.6/fs/inotify_user.c	2008-09-15 10:51:43.000000000 -0700
@@ -185,7 +185,7 @@
  * This function can sleep.
  */
 static struct inotify_kernel_event * kernel_event(s32 wd, u32 mask, u32 cookie,
-						  const char *name)
+						  const char *name, u64 journal_cookie)
 {
 	struct inotify_kernel_event *kevent;
 
@@ -204,6 +204,7 @@
 
 	if (name) {
 		size_t len, rem, event_size = sizeof(struct inotify_event);
+		size_t journal_cookie_sz;
 
 		/*
 		 * We need to pad the filename so as to properly align an
@@ -213,6 +214,10 @@
 		 * simple and safe for all architectures.
 		 */
 		len = strlen(name) + 1;
+		if (journal_cookie) {
+			journal_cookie_sz = sizeof(u64);
+			len = len + journal_cookie_sz;
+		}
 		rem = event_size - len;
 		if (len > event_size) {
 			rem = event_size - (len % event_size);
@@ -225,13 +230,52 @@
 			kmem_cache_free(event_cachep, kevent);
 			return NULL;
 		}
-		memcpy(kevent->name, name, len);
+		if (journal_cookie) {
+			char *ch;
+
+			memcpy(kevent->name, name, strlen(name));
+			ch = kevent->name;
+			ch += strlen(name);
+			memset(ch++, 0, 1);
+			sprintf(ch, "%llu", journal_cookie);
+		}
+		else {
+			memcpy(kevent->name, name, len);
+		}
 		if (rem)
 			memset(kevent->name + len, 0, rem);
 		kevent->event.len = len + rem;
 	} else {
-		kevent->event.len = 0;
-		kevent->name = NULL;
+		if (journal_cookie) {
+			size_t len, rem, event_size = sizeof(struct inotify_event);
+			size_t journal_cookie_sz = sizeof(u64);
+			char *ch;
+
+			len = 1 + journal_cookie_sz;
+			rem = event_size - len;
+			if (len > event_size) {
+				rem = event_size - (len % event_size);
+				if (len % event_size == 0)
+					rem = 0;
+			}
+
+			kevent->name = kmalloc(len + rem, GFP_KERNEL);
+			if (unlikely(!kevent->name)) {
+				kmem_cache_free(event_cachep, kevent);
+				return NULL;
+			}
+			memset(kevent->name, 0, 1);
+			ch = kevent->name;
+			ch += 1;
+			sprintf(ch, "%llu", journal_cookie);
+			if (rem)
+				memset(kevent->name + len, 0, rem);
+			kevent->event.len = len + rem;
+		}
+		else {
+			kevent->event.len = 0;
+			kevent->name = NULL;
+		}
 	}
 
 	return kevent;
@@ -269,7 +313,7 @@
  */
 static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
 				    u32 cookie, const char *name,
-				    struct inode *ignored)
+				    struct inode *ignored, u64 journal_cookie)
 {
 	struct inotify_user_watch *watch;
 	struct inotify_device *dev;
@@ -280,11 +324,25 @@
 
 	mutex_lock(&dev->ev_mutex);
 
+	if (mask & IN_JOURNAL_COMMIT) {
+		if (!(w->inode) || !(w->inode->i_sb)
+			|| !(w->inode->i_sb->s_priv_inode)
+			|| !(w->inode->i_sb->s_journal_cookie))
+			goto out;
+	}
 	/* we can safely put the watch as we don't reference it while
 	 * generating the event
 	 */
-	if (mask & IN_IGNORED || w->mask & IN_ONESHOT)
+	if (mask & IN_IGNORED || w->mask & IN_ONESHOT) {
+                if (w->mask & IN_JOURNAL_COMMIT) {
+                        struct super_block *sb = NULL;
+                        if (w->inode && w->inode->i_sb)
+                                sb = w->inode->i_sb;
+                        if (sb && sb->s_priv_inode)
+                                sb->s_priv_inode = NULL;
+                }
 		put_inotify_watch(w); /* final put */
+	}
 
 	/* coalescing: drop this event if it is a dupe of the previous */
 	last = inotify_dev_get_last_event(dev);
@@ -304,9 +362,9 @@
 
 	/* if the queue overflows, we need to notify user space */
 	if (unlikely(dev->event_count == dev->max_events))
-		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL);
+		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL, 0);
 	else
-		kevent = kernel_event(wd, mask, cookie, name);
+		kevent = kernel_event(wd, mask, cookie, name, journal_cookie);
 
 	if (unlikely(!kevent))
 		goto out;
@@ -686,6 +744,13 @@
 	ret = inotify_find_update_watch(dev->ih, inode, mask);
 	if (ret == -ENOENT)
 		ret = create_watch(dev, inode, mask);
+	if (ret >= 0) {
+		if (mask & IN_JOURNAL_COMMIT) {
+                        struct super_block *sb = inode->i_sb;
+                        if (sb)
+                                sb->s_priv_inode = inode;
+		}
+	}
 	mutex_unlock(&dev->up_mutex);
 
 	path_put(&path);
Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c	2008-09-15 10:12:42.000000000 -0700
+++ linux-2.6/fs/jbd/commit.c	2008-09-15 10:13:44.000000000 -0700
@@ -928,6 +928,7 @@
 	}
 	spin_unlock(&journal->j_list_lock);
 
+	journal_notify_commit(journal);
 	jbd_debug(1, "JBD: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 
Index: linux-2.6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd2/commit.c	2008-09-15 10:14:00.000000000 -0700
+++ linux-2.6/fs/jbd2/commit.c	2008-09-15 10:16:09.000000000 -0700
@@ -990,6 +990,7 @@
 	}
 	spin_unlock(&journal->j_list_lock);
 
+	jbd2_journal_notify_commit(journal);
 	jbd_debug(1, "JBD: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2008-09-15 10:12:47.000000000 -0700
+++ linux-2.6/include/linux/fs.h	2008-09-15 10:13:44.000000000 -0700
@@ -1106,6 +1106,8 @@
 	char s_id[32];				/* Informational name */
 
 	void 			*s_fs_info;	/* Filesystem private info */
+	u64			s_journal_cookie;	/* current journal cookie # */
+	struct inode		*s_priv_inode;	/* used for journal notifications */
 
 	/*
 	 * The next field is for VFS *only*. No filesystems have any business
Index: linux-2.6/include/linux/fsnotify.h
===================================================================
--- linux-2.6.orig/include/linux/fsnotify.h	2008-09-15 10:12:53.000000000 -0700
+++ linux-2.6/include/linux/fsnotify.h	2008-09-15 10:13:44.000000000 -0700
@@ -34,6 +34,17 @@
 	inotify_d_move(entry);
 }
 
+static inline void fsnotify_journal_commit (struct super_block *sb)
+{
+	struct inode *inode;
+	sb->s_journal_cookie += 1;
+	if (sb->s_journal_cookie && sb->s_priv_inode) {
+		inode = sb->s_priv_inode;
+		inotify_inode_queue_event(inode, IN_JOURNAL_COMMIT, 0,
+					  NULL, NULL);
+	}
+}
+
 /*
  * fsnotify_move - file old_name at old_dir was moved to new_name at new_dir
  */
Index: linux-2.6/include/linux/inotify.h
===================================================================
--- linux-2.6.orig/include/linux/inotify.h	2008-09-15 10:12:59.000000000 -0700
+++ linux-2.6/include/linux/inotify.h	2008-09-15 10:13:44.000000000 -0700
@@ -44,6 +44,23 @@
 #define IN_Q_OVERFLOW		0x00004000	/* Event queued overflowed */
 #define IN_IGNORED		0x00008000	/* File was ignored */
 
+/* This is a journal commit event for file systems. These events are sent
+ * from a journaling filesystem or the journal itself.
+ *
+ * IN_JOURNAL_COMMIT is special in that only one file per journaling filesystem
+ * partition (super-block) is allowed. Registering for this event with multiple
+ * files will overwrite previous registrations. The last registration for this
+ * event will be taken as current and commit notifications will be delivered
+ * for that inode only.
+ *
+ * The inode registered for this event should not be unlinked. This will disable
+ * any further journal commit notifications.
+ *
+ * If this event is not registered for, journal commit messages will not be
+ * delivered
+ */
+#define IN_JOURNAL_COMMIT	0x00010000	/* Journal on this superblock has committed a transaction */
+
 /* helper events */
 #define IN_CLOSE		(IN_CLOSE_WRITE | IN_CLOSE_NOWRITE) /* close */
 #define IN_MOVE			(IN_MOVED_FROM | IN_MOVED_TO) /* moves */
@@ -97,7 +114,7 @@
 
 struct inotify_operations {
 	void (*handle_event)(struct inotify_watch *, u32, u32, u32,
-			     const char *, struct inode *);
+			     const char *, struct inode *, u64);
 	void (*destroy_watch)(struct inotify_watch *);
 };
 
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h	2008-09-15 10:13:08.000000000 -0700
+++ linux-2.6/include/linux/jbd.h	2008-09-15 10:13:44.000000000 -0700
@@ -114,6 +114,7 @@
  * This is an opaque datatype.
  **/
 typedef struct journal_s	journal_t;	/* Journal control structure */
+extern void journal_notify_commit(journal_t *journal);
 #endif
 
 /*
Index: linux-2.6/include/linux/jbd2.h
===================================================================
--- linux-2.6.orig/include/linux/jbd2.h	2008-09-15 10:16:18.000000000 -0700
+++ linux-2.6/include/linux/jbd2.h	2008-09-15 10:16:45.000000000 -0700
@@ -115,6 +115,7 @@
  * This is an opaque datatype.
  **/
 typedef struct journal_s	journal_t;	/* Journal control structure */
+extern void jbd2_journal_notify_commit(journal_t *journal);
 #endif
 
 /*


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15 19:36     ` Abhijit Paithankar
@ 2008-09-15 23:37       ` Andreas Dilger
  2008-09-16 23:05         ` Abhijit Paithankar
  2008-09-20  5:15       ` Jan Kara
  1 sibling, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2008-09-15 23:37 UTC (permalink / raw)
  To: Abhijit Paithankar; +Cc: linux-fsdevel, linux-ext4, Jamie Lokier, Dave Chinner

On Sep 15, 2008  12:36 -0700, Abhijit Paithankar wrote:
> In this version of the patch I've done away with most of the file-system
> specific changes. The file system now only needs to send out a journal
> commit notification.

Is there a reason NOT to use the journal commit callback mechanism that
I posted?  This would only require registering a single callback for
each transaction for the inotify, but has the benefit of being completely
generic and can be used for other commit notifiers in the future.

Most of the rest of my comments are related to following the Linux
coding style.  Please also generate your patch with "diff -up" so
that the context includes the function name.

> +void journal_notify_commit (journal_t *journal)
> +{

No space between the function name and the '(', journal_notify_commit(

> +	struct super_block *sb = journal->j_private;
> +	if (sb->s_priv_inode && sb->s_journal_cookie)
> +		fsnotify_journal_commit(sb);
> +}

Please add an empty line between variable declarations and the code.

> +void jbd2_journal_notify_commit (journal_t *journal)
> +{
> +	struct super_block *sb = journal->j_private;
> +	if (sb->s_priv_inode && sb->s_journal_cookie)
> +		fsnotify_journal_commit(sb);
> +}

Same comments as above.

> +		if (journal_cookie) {
> +			size_t len, rem, event_size = sizeof(struct inotify_event);

Wrap at 80 columns, likely putting "size_t event_size = ..." on a new line.

>  static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
> +	if (mask & IN_JOURNAL_COMMIT) {
> +		if (!(w->inode) || !(w->inode->i_sb)
> +			|| !(w->inode->i_sb->s_priv_inode)
> +			|| !(w->inode->i_sb->s_journal_cookie))
> +			goto out;
> +	}

Operators like "||" should go at the end of the line, and the new line
should be aligned with the opening parenthesis:

		if (!(w->inode) || !(w->inode->i_sb) ||
		    !(w->inode->i_sb->s_priv_inode) ||
		    !(w->inode->i_sb->s_journal_cookie))

> +	journal_notify_commit(journal);
>  	jbd_debug(1, "JBD: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);

Note that this is also a "late" notification.  At phase 7 the transaction
is complete and committed to disk, the rest of journal_commit_transaction()
is just doing cleanup and background processing.

> +/* This is a journal commit event for file systems. These events are sent
> + * from a journaling filesystem or the journal itself.
> + *
> + * IN_JOURNAL_COMMIT is special in that only one file per journaling
> + * filesystem partition (super-block) is allowed. Registering for this
> + * event with multiple files will overwrite previous registrations.

This is a major limitation of your implementation - what if 2 applications
want to be notified of this event?  Using the journal_callback_set() API
in the patch I sent out would allow an arbitrary number of applications
to monitor the commit.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15 23:37       ` Andreas Dilger
@ 2008-09-16 23:05         ` Abhijit Paithankar
  2008-09-19 20:34           ` Andreas Dilger
  0 siblings, 1 reply; 14+ messages in thread
From: Abhijit Paithankar @ 2008-09-16 23:05 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-ext4, Jamie Lokier, Dave Chinner

Hi,

Thanks for your feedback.
My comments are inline.

* Andreas Dilger (adilger@sun.com) wrote:
> On Sep 15, 2008  12:36 -0700, Abhijit Paithankar wrote:
> > In this version of the patch I've done away with most of the file-system
> > specific changes. The file system now only needs to send out a journal
> > commit notification.
> 
> Is there a reason NOT to use the journal commit callback mechanism that
> I posted?  This would only require registering a single callback for
> each transaction for the inotify, but has the benefit of being completely
> generic and can be used for other commit notifiers in the future.
> 

In the current patch if two applications want to be notified on the journal
commit event they could register for IN_JOURNAL_COMMIT on the same inode,
which could be the root inode of the file system.

There is no limit on the number of applications that could register on the
root inode for journal commit notifications. The only limitation is that
they all have to register on the same inode for that partition.

The journal commit callback mechanism could be viable way to notify
journal commits. However it is not clear to me how an application would
register the callback.

I suspect we need to have some other mechanism to register that for the
application.


> Most of the rest of my comments are related to following the Linux
> coding style.  Please also generate your patch with "diff -up" so
> that the context includes the function name.
> 
> > +void journal_notify_commit (journal_t *journal)
> > +{
> 
> No space between the function name and the '(', journal_notify_commit(
ok, will fix.

> 
> > +	struct super_block *sb = journal->j_private;
> > +	if (sb->s_priv_inode && sb->s_journal_cookie)
> > +		fsnotify_journal_commit(sb);
> > +}
> 
> Please add an empty line between variable declarations and the code.
ack.

> 
> > +void jbd2_journal_notify_commit (journal_t *journal)
> > +{
> > +	struct super_block *sb = journal->j_private;
> > +	if (sb->s_priv_inode && sb->s_journal_cookie)
> > +		fsnotify_journal_commit(sb);
> > +}
> 
> Same comments as above.
ack.

> 
> > +		if (journal_cookie) {
> > +			size_t len, rem, event_size = sizeof(struct inotify_event);
> 
> Wrap at 80 columns, likely putting "size_t event_size = ..." on a new line.
sure.

> 
> >  static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
> > +	if (mask & IN_JOURNAL_COMMIT) {
> > +		if (!(w->inode) || !(w->inode->i_sb)
> > +			|| !(w->inode->i_sb->s_priv_inode)
> > +			|| !(w->inode->i_sb->s_journal_cookie))
> > +			goto out;
> > +	}
> 
> Operators like "||" should go at the end of the line, and the new line
> should be aligned with the opening parenthesis:
> 
> 		if (!(w->inode) || !(w->inode->i_sb) ||
> 		    !(w->inode->i_sb->s_priv_inode) ||
> 		    !(w->inode->i_sb->s_journal_cookie))
> 
will do.

> > +	journal_notify_commit(journal);
> >  	jbd_debug(1, "JBD: commit %d complete, head %d\n",
> >  		  journal->j_commit_sequence, journal->j_tail_sequence);
> 
> Note that this is also a "late" notification.  At phase 7 the transaction
> is complete and committed to disk, the rest of journal_commit_transaction()
> is just doing cleanup and background processing.
Yes, will move it up.

> 
> > +/* This is a journal commit event for file systems. These events are sent
> > + * from a journaling filesystem or the journal itself.
> > + *
> > + * IN_JOURNAL_COMMIT is special in that only one file per journaling
> > + * filesystem partition (super-block) is allowed. Registering for this
> > + * event with multiple files will overwrite previous registrations.
> 
> This is a major limitation of your implementation - what if 2 applications
> want to be notified of this event?  Using the journal_callback_set() API
> in the patch I sent out would allow an arbitrary number of applications
> to monitor the commit.

Any number of applications can register for the journal event as long as they
register on the same inode.

It is not clear to me how using journal_callback_set API would change that if
we use inotify to register callbacks.

Thanks!
Abhijit

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15 13:23 ` Jamie Lokier
@ 2008-09-17  0:06   ` Abhijit Paithankar
  2008-09-17 11:35     ` Jamie Lokier
  0 siblings, 1 reply; 14+ messages in thread
From: Abhijit Paithankar @ 2008-09-17  0:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Andreas Dilger

* Jamie Lokier (jamie@shareable.org) wrote:
> Abhijit Paithankar wrote:
> > One way to do it is to fsync every few operations. However, fsync is
> > blocking and affects performance.
> >
> > The other (more efficient) way is to have the filesystem notify the
> > application when a transaction/change is written to disk.
> 
> Is this more efficient than aio_fsync, and if so, why?
> 
> For file writes, aio_fsync seems like a cleaner interface, and if
> that's not fast enough, it could be made faster - perhaps using code
> form this patch.  (An aio_fsync_ranges would be even better).
> 
> -- Jamie

Hi,

aio_fsync does not provide any relationship between the metadata
operations and journal commit. There is no mechanism to track
operations which made it to the journal and the ones that did not.

With the journal notifications patch we will be able to know exactly
which metadata change hit the journal and when.

Thanks!
Abhijit

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-17  0:06   ` Abhijit Paithankar
@ 2008-09-17 11:35     ` Jamie Lokier
  0 siblings, 0 replies; 14+ messages in thread
From: Jamie Lokier @ 2008-09-17 11:35 UTC (permalink / raw)
  To: Abhijit Paithankar; +Cc: linux-fsdevel, Andreas Dilger

Abhijit Paithankar wrote:
> > > One way to do it is to fsync every few operations. However, fsync is
> > > blocking and affects performance.
> > >
> > > The other (more efficient) way is to have the filesystem notify the
> > > application when a transaction/change is written to disk.
> > 
> > Is this more efficient than aio_fsync, and if so, why?
> > 
> > For file writes, aio_fsync seems like a cleaner interface, and if
> > that's not fast enough, it could be made faster - perhaps using code
> > form this patch.  (An aio_fsync_ranges would be even better).
> 
> aio_fsync does not provide any relationship between the metadata
> operations and journal commit. There is no mechanism to track
> operations which made it to the journal and the ones that did not.

Oh, do you mean that aio_fsync tells you all operations have reached
the journal up to the time of aio_fsync, but your notifications say
_which_ operations reach it one at a time, so you can wait on specific
ones without waiting for them all?

If the latter, I agree that what you're doing is good :-)

-- Jamie

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-16 23:05         ` Abhijit Paithankar
@ 2008-09-19 20:34           ` Andreas Dilger
  2008-09-20 15:51             ` Jamie Lokier
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2008-09-19 20:34 UTC (permalink / raw)
  To: Abhijit Paithankar
  Cc: linux-fsdevel, linux-ext4, Jamie Lokier, Dave Chinner,
	Joel Becker

On Sep 16, 2008  16:05 -0700, Abhijit Paithankar wrote:
> > Is there a reason NOT to use the journal commit callback mechanism that
> > I posted?  This would only require registering a single callback for
> > each transaction for the inotify, but has the benefit of being completely
> > generic and can be used for other commit notifiers in the future.
> 
> In the current patch if two applications want to be notified on the journal
> commit event they could register for IN_JOURNAL_COMMIT on the same inode,
> which could be the root inode of the file system.
> 
> There is no limit on the number of applications that could register on the
> root inode for journal commit notifications. The only limitation is that
> they all have to register on the same inode for that partition.
>
> Any number of applications can register for the journal event as long as they
> register on the same inode.

That assumes agreement between the applications that are using this
interface.  It isn't at all desirable that applications have to know
the "mountpoint" of the filesystem in order to use inotify, and in
some cases (e.g. bind mount in a new namespace) there isn't even access
to the root inode.

> The journal commit callback mechanism could be viable way to notify
> journal commits. However it is not clear to me how an application would
> register the callback.

The application registers for the callback via the inotify interface,
which in turn needs to have an interface to the filesystem to use the
fs-specific transaction commit callback.

> It is not clear to me how using journal_callback_set API would change
> that if we use inotify to register callbacks.

The journal_callback_set() API doesn't need to change at all (though in
fact Joel Becker of OCFS2 is implementing a more generic API that can be
used for other kinds of commit callbacks.  He will hopefully post it to
this list soon.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-15 19:36     ` Abhijit Paithankar
  2008-09-15 23:37       ` Andreas Dilger
@ 2008-09-20  5:15       ` Jan Kara
  2008-09-23 22:26         ` Andreas Dilger
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2008-09-20  5:15 UTC (permalink / raw)
  To: Abhijit Paithankar
  Cc: Andreas Dilger, linux-fsdevel, linux-ext4, Jamie Lokier,
	Dave Chinner

  Hello,

> I have used inotify simply because it is a mechanism that already exists
> in the kernel and is easily extensible.
> 
> The use of transaction ids as "journal cookie" was done to simplify
> things. A journal transaction ID is a readily available monotonically
> increasing ID unique to that partition (super block).
> 
> Journal cookies themselves are completely opaque to the application
> and as long as it is a monotonically increasing number that signifies
> a journal commit, it could be anything that a file system chooses it
> to be.
> 
> Ext3/Ext4 gather multiple changes into a single transaction. So,
> you will see a single 'commit' notification for multiple of meta-data
> changes. The 'commit' messages, therefore, won't be that many.
  I have quite some comments (additional to Andreas') to the coding
style but let's put them aside (they can be easily fixed) and first
solve some more fundamental problems:
  1) At the first sight it seems to be a useful idea to send to
userspace notifications when some change is safely on disk. But at the
second sight, it's technically really not that simple as it may seem.
Filesystems such as ext2 have no good way of providing such information,
even ext4 with delayed allocation has no way of giving you such
information for writes and I guess similar thing holds for xfs. Also log
structured filesystems and similar can tell you when a change is safely
on disk but have no concept of something like journal commit. So at this
point one should really ask himself whether a feature having a chance to
work only on a limited number of setups has a serious chance of being
used by application developpers...
  Also note that on ext3 and ext4 a single 'write' call from userspace
can be split among several transactions but there's just one inotify
event so this has to be taken care of.
  2) If I've understood correctly, you append to the name in the inotify
event the journal cookie. Sorry, but that is a really ugly hack,
definitely not a way how an interface to the userspace should look like.
Actually, the 'cookie' argument of inotify seems to quite match your
needs - it is supposed to be used for connecting related events and that
is exactly what you need. The only problem with this is that rename
already uses the cookie and so you cannot change it arbitrarily.
  3) Using just a single inode for commit notification is ugly and as
Andreas said not even workable in some cases. A sensible semantics here
could be that you'd simply send "changes safely on disk" event to all
inodes where some change went to disk. That still might have some
technical issues (like being too slow, having to identify which inodes
should get the event etc.) but would at least look sound from the
logical point of view.

  I'm sorry to turn your ideas down like this. I still appreciate
you're trying to improve Linux :). Thanks.

								Honza

> This patch (v2):
> 
> In this version of the patch I've done away with most of the file-system
> specific changes. The file system now only needs to send out a journal
> commit notification.
> 
> The 'journal cookie' is simply a u64 that increases every time a journal
> 'commit' is made. This will make the code more portable to other file
> systems.
> 
> Also, we no longer need those many journal message types. We can easily
> piggy-back on top of fsnotify. The only remaining one is
> IN_JOURNAL_COMMIT.
> 
> My next step is to allow multiple registrations for JOURNAL_COMMIT
> notification. I will post that patch once it is ready.
> 
> Signed-off-by: Abhijit Paithankar <apaithan@akamai.com>
> ---
> 
>  fs/ext3/ext3_jbd.c       |    9 +++++
>  fs/ext4/ext4_jbd2.c      |    9 +++++
>  fs/inotify.c             |   17 ++++++---
>  fs/inotify_user.c        |   81 ++++++++++++++++++++++++++++++++++++++++++-----
>  fs/jbd/commit.c          |    1 
>  fs/jbd2/commit.c         |    1 
>  include/linux/fs.h       |    2 +
>  include/linux/fsnotify.h |   11 ++++++
>  include/linux/inotify.h  |   19 ++++++++++-
>  include/linux/jbd.h      |    1 
>  include/linux/jbd2.h     |    1 
>  11 files changed, 138 insertions(+), 14 deletions(-)
> 
> ---
> 
> 
> Index: linux-2.6/fs/ext3/ext3_jbd.c
> ===================================================================
> --- linux-2.6.orig/fs/ext3/ext3_jbd.c	2008-09-15 10:12:25.000000000 -0700
> +++ linux-2.6/fs/ext3/ext3_jbd.c	2008-09-15 10:15:54.000000000 -0700
> @@ -3,6 +3,7 @@
>   */
>  
>  #include <linux/ext3_jbd.h>
> +#include <linux/fsnotify.h>
>  
>  int __ext3_journal_get_undo_access(const char *where, handle_t *handle,
>  				struct buffer_head *bh)
> @@ -57,3 +58,11 @@
>  		ext3_journal_abort_handle(where, __func__, bh, handle,err);
>  	return err;
>  }
> +
> +void journal_notify_commit (journal_t *journal)
> +{
> +	struct super_block *sb = journal->j_private;
> +	if (sb->s_priv_inode && sb->s_journal_cookie)
> +		fsnotify_journal_commit(sb);
> +}
> +
> Index: linux-2.6/fs/ext4/ext4_jbd2.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/ext4_jbd2.c	2008-09-15 10:14:12.000000000 -0700
> +++ linux-2.6/fs/ext4/ext4_jbd2.c	2008-09-15 10:15:50.000000000 -0700
> @@ -2,6 +2,7 @@
>   * Interface between ext4 and JBD
>   */
>  
> +#include <linux/fsnotify.h>
>  #include "ext4_jbd2.h"
>  
>  int __ext4_journal_get_undo_access(const char *where, handle_t *handle,
> @@ -57,3 +58,11 @@
>  		ext4_journal_abort_handle(where, __func__, bh, handle, err);
>  	return err;
>  }
> +
> +void jbd2_journal_notify_commit (journal_t *journal)
> +{
> +	struct super_block *sb = journal->j_private;
> +	if (sb->s_priv_inode && sb->s_journal_cookie)
> +		fsnotify_journal_commit(sb);
> +}
> +
> Index: linux-2.6/fs/inotify.c
> ===================================================================
> --- linux-2.6.orig/fs/inotify.c	2008-09-15 10:12:29.000000000 -0700
> +++ linux-2.6/fs/inotify.c	2008-09-15 10:13:44.000000000 -0700
> @@ -231,7 +231,7 @@
>  				 struct inotify_watch *watch)
>  {
>  	remove_watch_no_event(watch, ih);
> -	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL);
> +	ih->in_ops->handle_event(watch, watch->wd, IN_IGNORED, 0, NULL, NULL, 0);
>  }
>  EXPORT_SYMBOL_GPL(inotify_remove_watch_locked);
>  
> @@ -280,10 +280,17 @@
>  			       const char *name, struct inode *n_inode)
>  {
>  	struct inotify_watch *watch, *next;
> +	u64 journal_cookie = 0;
>  
>  	if (!inotify_inode_watched(inode))
>  		return;
>  
> +	if (mask & (IN_ATTRIB | IN_MOVE | IN_CREATE | IN_DELETE |
> +		    IN_DELETE_SELF | IN_MOVE_SELF)) {
> +		if (inode->i_sb && inode->i_sb->s_journal_cookie)
> +			journal_cookie = inode->i_sb->s_journal_cookie;
> +	}
> +
>  	mutex_lock(&inode->inotify_mutex);
>  	list_for_each_entry_safe(watch, next, &inode->inotify_watches, i_list) {
>  		u32 watch_mask = watch->mask;
> @@ -293,7 +300,7 @@
>  			if (watch_mask & IN_ONESHOT)
>  				remove_watch_no_event(watch, ih);
>  			ih->in_ops->handle_event(watch, watch->wd, mask, cookie,
> -						 name, n_inode);
> +						 name, n_inode, journal_cookie);
>  			mutex_unlock(&ih->mutex);
>  		}
>  	}
> @@ -409,7 +416,7 @@
>  			struct inotify_handle *ih= watch->ih;
>  			mutex_lock(&ih->mutex);
>  			ih->in_ops->handle_event(watch, watch->wd, IN_UNMOUNT, 0,
> -						 NULL, NULL);
> +						 NULL, NULL, 0);
>  			inotify_remove_watch_locked(ih, watch);
>  			mutex_unlock(&ih->mutex);
>  		}
> @@ -576,7 +583,7 @@
>  		mask_add = 1;
>  
>  	/* don't allow invalid bits: we don't want flags set */
> -	mask &= IN_ALL_EVENTS | IN_ONESHOT;
> +	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_COMMIT;
>  	if (unlikely(!mask))
>  		return -EINVAL;
>  
> @@ -623,7 +630,7 @@
>  	int newly_watched;
>  
>  	/* don't allow invalid bits: we don't want flags set */
> -	mask &= IN_ALL_EVENTS | IN_ONESHOT;
> +	mask &= IN_ALL_EVENTS | IN_ONESHOT | IN_JOURNAL_COMMIT;
>  	if (unlikely(!mask))
>  		return -EINVAL;
>  	watch->mask = mask;
> Index: linux-2.6/fs/inotify_user.c
> ===================================================================
> --- linux-2.6.orig/fs/inotify_user.c	2008-09-15 10:12:34.000000000 -0700
> +++ linux-2.6/fs/inotify_user.c	2008-09-15 10:51:43.000000000 -0700
> @@ -185,7 +185,7 @@
>   * This function can sleep.
>   */
>  static struct inotify_kernel_event * kernel_event(s32 wd, u32 mask, u32 cookie,
> -						  const char *name)
> +						  const char *name, u64 journal_cookie)
>  {
>  	struct inotify_kernel_event *kevent;
>  
> @@ -204,6 +204,7 @@
>  
>  	if (name) {
>  		size_t len, rem, event_size = sizeof(struct inotify_event);
> +		size_t journal_cookie_sz;
>  
>  		/*
>  		 * We need to pad the filename so as to properly align an
> @@ -213,6 +214,10 @@
>  		 * simple and safe for all architectures.
>  		 */
>  		len = strlen(name) + 1;
> +		if (journal_cookie) {
> +			journal_cookie_sz = sizeof(u64);
> +			len = len + journal_cookie_sz;
> +		}
>  		rem = event_size - len;
>  		if (len > event_size) {
>  			rem = event_size - (len % event_size);
> @@ -225,13 +230,52 @@
>  			kmem_cache_free(event_cachep, kevent);
>  			return NULL;
>  		}
> -		memcpy(kevent->name, name, len);
> +		if (journal_cookie) {
> +			char *ch;
> +
> +			memcpy(kevent->name, name, strlen(name));
> +			ch = kevent->name;
> +			ch += strlen(name);
> +			memset(ch++, 0, 1);
> +			sprintf(ch, "%llu", journal_cookie);
> +		}
> +		else {
> +			memcpy(kevent->name, name, len);
> +		}
>  		if (rem)
>  			memset(kevent->name + len, 0, rem);
>  		kevent->event.len = len + rem;
>  	} else {
> -		kevent->event.len = 0;
> -		kevent->name = NULL;
> +		if (journal_cookie) {
> +			size_t len, rem, event_size = sizeof(struct inotify_event);
> +			size_t journal_cookie_sz = sizeof(u64);
> +			char *ch;
> +
> +			len = 1 + journal_cookie_sz;
> +			rem = event_size - len;
> +			if (len > event_size) {
> +				rem = event_size - (len % event_size);
> +				if (len % event_size == 0)
> +					rem = 0;
> +			}
> +
> +			kevent->name = kmalloc(len + rem, GFP_KERNEL);
> +			if (unlikely(!kevent->name)) {
> +				kmem_cache_free(event_cachep, kevent);
> +				return NULL;
> +			}
> +			memset(kevent->name, 0, 1);
> +			ch = kevent->name;
> +			ch += 1;
> +			sprintf(ch, "%llu", journal_cookie);
> +			if (rem)
> +				memset(kevent->name + len, 0, rem);
> +			kevent->event.len = len + rem;
> +		}
> +		else {
> +			kevent->event.len = 0;
> +			kevent->name = NULL;
> +		}
>  	}
>  
>  	return kevent;
> @@ -269,7 +313,7 @@
>   */
>  static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
>  				    u32 cookie, const char *name,
> -				    struct inode *ignored)
> +				    struct inode *ignored, u64 journal_cookie)
>  {
>  	struct inotify_user_watch *watch;
>  	struct inotify_device *dev;
> @@ -280,11 +324,25 @@
>  
>  	mutex_lock(&dev->ev_mutex);
>  
> +	if (mask & IN_JOURNAL_COMMIT) {
> +		if (!(w->inode) || !(w->inode->i_sb)
> +			|| !(w->inode->i_sb->s_priv_inode)
> +			|| !(w->inode->i_sb->s_journal_cookie))
> +			goto out;
> +	}
>  	/* we can safely put the watch as we don't reference it while
>  	 * generating the event
>  	 */
> -	if (mask & IN_IGNORED || w->mask & IN_ONESHOT)
> +	if (mask & IN_IGNORED || w->mask & IN_ONESHOT) {
> +                if (w->mask & IN_JOURNAL_COMMIT) {
> +                        struct super_block *sb = NULL;
> +                        if (w->inode && w->inode->i_sb)
> +                                sb = w->inode->i_sb;
> +                        if (sb && sb->s_priv_inode)
> +                                sb->s_priv_inode = NULL;
> +                }
>  		put_inotify_watch(w); /* final put */
> +	}
>  
>  	/* coalescing: drop this event if it is a dupe of the previous */
>  	last = inotify_dev_get_last_event(dev);
> @@ -304,9 +362,9 @@
>  
>  	/* if the queue overflows, we need to notify user space */
>  	if (unlikely(dev->event_count == dev->max_events))
> -		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL);
> +		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL, 0);
>  	else
> -		kevent = kernel_event(wd, mask, cookie, name);
> +		kevent = kernel_event(wd, mask, cookie, name, journal_cookie);
>  
>  	if (unlikely(!kevent))
>  		goto out;
> @@ -686,6 +744,13 @@
>  	ret = inotify_find_update_watch(dev->ih, inode, mask);
>  	if (ret == -ENOENT)
>  		ret = create_watch(dev, inode, mask);
> +	if (ret >= 0) {
> +		if (mask & IN_JOURNAL_COMMIT) {
> +                        struct super_block *sb = inode->i_sb;
> +                        if (sb)
> +                                sb->s_priv_inode = inode;
> +		}
> +	}
>  	mutex_unlock(&dev->up_mutex);
>  
>  	path_put(&path);
> Index: linux-2.6/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd/commit.c	2008-09-15 10:12:42.000000000 -0700
> +++ linux-2.6/fs/jbd/commit.c	2008-09-15 10:13:44.000000000 -0700
> @@ -928,6 +928,7 @@
>  	}
>  	spin_unlock(&journal->j_list_lock);
>  
> +	journal_notify_commit(journal);
>  	jbd_debug(1, "JBD: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);
>  
> Index: linux-2.6/fs/jbd2/commit.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd2/commit.c	2008-09-15 10:14:00.000000000 -0700
> +++ linux-2.6/fs/jbd2/commit.c	2008-09-15 10:16:09.000000000 -0700
> @@ -990,6 +990,7 @@
>  	}
>  	spin_unlock(&journal->j_list_lock);
>  
> +	jbd2_journal_notify_commit(journal);
>  	jbd_debug(1, "JBD: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);
>  
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2008-09-15 10:12:47.000000000 -0700
> +++ linux-2.6/include/linux/fs.h	2008-09-15 10:13:44.000000000 -0700
> @@ -1106,6 +1106,8 @@
>  	char s_id[32];				/* Informational name */
>  
>  	void 			*s_fs_info;	/* Filesystem private info */
> +	u64			s_journal_cookie;	/* current journal cookie # */
> +	struct inode		*s_priv_inode;	/* used for journal notifications */
>  
>  	/*
>  	 * The next field is for VFS *only*. No filesystems have any business
> Index: linux-2.6/include/linux/fsnotify.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fsnotify.h	2008-09-15 10:12:53.000000000 -0700
> +++ linux-2.6/include/linux/fsnotify.h	2008-09-15 10:13:44.000000000 -0700
> @@ -34,6 +34,17 @@
>  	inotify_d_move(entry);
>  }
>  
> +static inline void fsnotify_journal_commit (struct super_block *sb)
> +{
> +	struct inode *inode;
> +	sb->s_journal_cookie += 1;
> +	if (sb->s_journal_cookie && sb->s_priv_inode) {
> +		inode = sb->s_priv_inode;
> +		inotify_inode_queue_event(inode, IN_JOURNAL_COMMIT, 0,
> +					  NULL, NULL);
> +	}
> +}
> +
>  /*
>   * fsnotify_move - file old_name at old_dir was moved to new_name at new_dir
>   */
> Index: linux-2.6/include/linux/inotify.h
> ===================================================================
> --- linux-2.6.orig/include/linux/inotify.h	2008-09-15 10:12:59.000000000 -0700
> +++ linux-2.6/include/linux/inotify.h	2008-09-15 10:13:44.000000000 -0700
> @@ -44,6 +44,23 @@
>  #define IN_Q_OVERFLOW		0x00004000	/* Event queued overflowed */
>  #define IN_IGNORED		0x00008000	/* File was ignored */
>  
> +/* This is a journal commit event for file systems. These events are sent
> + * from a journaling filesystem or the journal itself.
> + *
> + * IN_JOURNAL_COMMIT is special in that only one file per journaling filesystem
> + * partition (super-block) is allowed. Registering for this event with multiple
> + * files will overwrite previous registrations. The last registration for this
> + * event will be taken as current and commit notifications will be delivered
> + * for that inode only.
> + *
> + * The inode registered for this event should not be unlinked. This will disable
> + * any further journal commit notifications.
> + *
> + * If this event is not registered for, journal commit messages will not be
> + * delivered
> + */
> +#define IN_JOURNAL_COMMIT	0x00010000	/* Journal on this superblock has committed a transaction */
> +
>  /* helper events */
>  #define IN_CLOSE		(IN_CLOSE_WRITE | IN_CLOSE_NOWRITE) /* close */
>  #define IN_MOVE			(IN_MOVED_FROM | IN_MOVED_TO) /* moves */
> @@ -97,7 +114,7 @@
>  
>  struct inotify_operations {
>  	void (*handle_event)(struct inotify_watch *, u32, u32, u32,
> -			     const char *, struct inode *);
> +			     const char *, struct inode *, u64);
>  	void (*destroy_watch)(struct inotify_watch *);
>  };
>  
> Index: linux-2.6/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.orig/include/linux/jbd.h	2008-09-15 10:13:08.000000000 -0700
> +++ linux-2.6/include/linux/jbd.h	2008-09-15 10:13:44.000000000 -0700
> @@ -114,6 +114,7 @@
>   * This is an opaque datatype.
>   **/
>  typedef struct journal_s	journal_t;	/* Journal control structure */
> +extern void journal_notify_commit(journal_t *journal);
>  #endif
>  
>  /*
> Index: linux-2.6/include/linux/jbd2.h
> ===================================================================
> --- linux-2.6.orig/include/linux/jbd2.h	2008-09-15 10:16:18.000000000 -0700
> +++ linux-2.6/include/linux/jbd2.h	2008-09-15 10:16:45.000000000 -0700
> @@ -115,6 +115,7 @@
>   * This is an opaque datatype.
>   **/
>  typedef struct journal_s	journal_t;	/* Journal control structure */
> +extern void jbd2_journal_notify_commit(journal_t *journal);
>  #endif
>  
>  /*
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-19 20:34           ` Andreas Dilger
@ 2008-09-20 15:51             ` Jamie Lokier
  0 siblings, 0 replies; 14+ messages in thread
From: Jamie Lokier @ 2008-09-20 15:51 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Abhijit Paithankar, linux-fsdevel, linux-ext4, Dave Chinner,
	Joel Becker

Andreas Dilger wrote:
> That assumes agreement between the applications that are using this
> interface.  It isn't at all desirable that applications have to know
> the "mountpoint" of the filesystem in order to use inotify, and in
> some cases (e.g. bind mount in a new namespace) there isn't even access
> to the root inode.

A filesystem's root inode needn't be mounted at all.

You don't need a new namespace - bind mount is enough by itself.

It has me wondering - how can an application even tell when it has the
root inode of a filesystem?  You can't tell from /etc/mtab or
/proc/mounts, nor from traversing the filesystem itself - except for
filesystems where you know the expected inode number of the root inode.

-- Jamie

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] Filesystem Journal Notifications
  2008-09-20  5:15       ` Jan Kara
@ 2008-09-23 22:26         ` Andreas Dilger
  0 siblings, 0 replies; 14+ messages in thread
From: Andreas Dilger @ 2008-09-23 22:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Abhijit Paithankar, linux-fsdevel, linux-ext4, Jamie Lokier,
	Dave Chinner

On Sep 20, 2008  07:15 +0200, Jan Kara wrote:
>   1) At the first sight it seems to be a useful idea to send to
> userspace notifications when some change is safely on disk. But at the
> second sight, it's technically really not that simple as it may seem.
> Filesystems such as ext2 have no good way of providing such information,
> even ext4 with delayed allocation has no way of giving you such
> information for writes and I guess similar thing holds for xfs. Also log
> structured filesystems and similar can tell you when a change is safely
> on disk but have no concept of something like journal commit. So at this
> point one should really ask himself whether a feature having a chance to
> work only on a limited number of setups has a serious chance of being
> used by application developpers...
>   Also note that on ext3 and ext4 a single 'write' call from userspace
> can be split among several transactions but there's just one inotify
> event so this has to be taken care of.

I think the assumption is that this isn't an API used by "generic"
applications, but rather ones that are being optimized for the system
they are running on.  ext3/4 is common enough that this isn't a very
limiting optimization.  I don't think Akamai are going to care about
running on JFFS2 or VFAT.  Probably registering for such events on
filesystems that don't understand them should just return an error.

We'll be stuck in 1981 forever if we continue to limit interfaces to
what all filesystems understand.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-09-23 22:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-12 22:06 [RFC PATCH] Filesystem Journal Notifications Abhijit Paithankar
2008-09-13  9:19 ` Dave Chinner
2008-09-15  2:31 ` Andreas Dilger
2008-09-15  5:39   ` Andreas Dilger
2008-09-15 19:36     ` Abhijit Paithankar
2008-09-15 23:37       ` Andreas Dilger
2008-09-16 23:05         ` Abhijit Paithankar
2008-09-19 20:34           ` Andreas Dilger
2008-09-20 15:51             ` Jamie Lokier
2008-09-20  5:15       ` Jan Kara
2008-09-23 22:26         ` Andreas Dilger
2008-09-15 13:23 ` Jamie Lokier
2008-09-17  0:06   ` Abhijit Paithankar
2008-09-17 11:35     ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).