[PATCH -v1 01/11] fsnotify: unified filesystem notification backend

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH -v1 01/11] fsnotify: unified filesystem notification backend
@ 2009-02-09 21:15 Eric Paris
  2009-02-09 21:15 ` [PATCH -v1 02/11] fsnotify: add group priorities Eric Paris
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

fsnotify is a backend for filesystem notification.  fsnotify does
not provide any userspace interface but does provide the basis
needed for other notification schemes such as dnotify.  fsnotify
can be extended to be the backend for inotify or the upcoming
fanotify.  fsnotify provides a mechanism for "groups" to register for
some set of filesystem events and to then deliver those events to
those groups for processing.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/Kconfig                |   13 +++
 fs/notify/Makefile               |    2 
 fs/notify/fsnotify.c             |   78 +++++++++++++++++++
 fs/notify/fsnotify.h             |   19 +++++
 fs/notify/group.c                |  157 ++++++++++++++++++++++++++++++++++++++
 fs/notify/notification.c         |  133 ++++++++++++++++++++++++++++++++
 include/linux/fsnotify.h         |   55 +++++++++++--
 include/linux/fsnotify_backend.h |  148 ++++++++++++++++++++++++++++++++++++
 8 files changed, 597 insertions(+), 8 deletions(-)
 create mode 100644 fs/notify/fsnotify.c
 create mode 100644 fs/notify/fsnotify.h
 create mode 100644 fs/notify/group.c
 create mode 100644 fs/notify/notification.c
 create mode 100644 include/linux/fsnotify_backend.h

diff --git a/fs/notify/Kconfig b/fs/notify/Kconfig
index 50914d7..31dac7e 100644
--- a/fs/notify/Kconfig
+++ b/fs/notify/Kconfig
@@ -1,2 +1,15 @@
+config FSNOTIFY
+	bool "Filesystem notification backend"
+	default y
+	---help---
+	   fsnotify is a backend for filesystem notification.  fsnotify does
+	   not provide any userspace interface but does provide the basis
+	   needed for other notification schemes such as dnotify, inotify,
+	   and fanotify.
+
+	   Say Y here to enable fsnotify suport.
+
+	   If unsure, say Y.
+
 source "fs/notify/dnotify/Kconfig"
 source "fs/notify/inotify/Kconfig"
diff --git a/fs/notify/Makefile b/fs/notify/Makefile
index 5a95b60..7cb285a 100644
--- a/fs/notify/Makefile
+++ b/fs/notify/Makefile
@@ -1,2 +1,4 @@
 obj-y			+= dnotify/
 obj-y			+= inotify/
+
+obj-$(CONFIG_FSNOTIFY)		+= fsnotify.o notification.o group.o
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
new file mode 100644
index 0000000..b51f90a
--- /dev/null
+++ b/fs/notify/fsnotify.c
@@ -0,0 +1,78 @@
+/*
+ *  Copyright (C) 2008 Red Hat, Inc., Eric Paris <eparis@redhat.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2, or (at your option)
+ *  any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; see the file COPYING.  If not, write to
+ *  the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/dcache.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/srcu.h>
+
+#include <linux/fsnotify_backend.h>
+#include "fsnotify.h"
+
+void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
+{
+	struct fsnotify_group *group;
+	struct fsnotify_event *event = NULL;
+	int idx;
+
+	if (list_empty(&fsnotify_groups))
+		return;
+
+	if (!(mask & fsnotify_mask))
+		return;
+
+	/*
+	 * SRCU!!  the groups list is very very much read only and the path is
+	 * very hot (assuming something is using fsnotify)  Not blocking while
+	 * walking this list is ugly.  We could preallocate an event and an
+	 * event holder for every group that event might need to be put on, but
+	 * all that possibly wasted allocation is nuts.  For all we know there
+	 * are already mark entries, groups don't need this event, or all
+	 * sorts of reasons to believe not every kernel action is going to get
+	 * sent to userspace.  Hopefully this won't get shit on too much,
+	 * because going to a mutex here is really going to needlessly serialize
+	 * read/write/open/close across the whole system....
+	 */
+	idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
+	list_for_each_entry_rcu(group, &fsnotify_groups, group_list) {
+		if (mask & group->mask) {
+			if (!event) {
+				event = fsnotify_create_event(to_tell, mask, data, data_is);
+				/* shit, we OOM'd and now we can't tell, lets hope something else blows up */
+				if (!event)
+					break;
+			}
+			group->ops->handle_event(group, event);
+		}
+	}
+	srcu_read_unlock(&fsnotify_grp_srcu_struct, idx);
+	/*
+	 * fsnotify_create_event() took a reference so the event can't be cleaned
+	 * up while we are still trying to add it to lists, drop that one.
+	 */
+	if (event)
+		fsnotify_put_event(event);
+}
+EXPORT_SYMBOL_GPL(fsnotify);
+
+static __init int fsnotify_init(void)
+{
+	return init_srcu_struct(&fsnotify_grp_srcu_struct);
+}
+subsys_initcall(fsnotify_init);
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
new file mode 100644
index 0000000..9c53600
--- /dev/null
+++ b/fs/notify/fsnotify.h
@@ -0,0 +1,19 @@
+#ifndef _LINUX_FSNOTIFY_PRIVATE_H
+#define _LINUX_FSNOTIFY_PRIVATE_H
+
+#include <linux/dcache.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+#include <linux/path.h>
+#include <linux/spinlock.h>
+
+#include <linux/fsnotify.h>
+
+#include <asm/atomic.h>
+
+extern struct srcu_struct fsnotify_grp_srcu_struct;
+extern struct list_head fsnotify_groups;
+extern __u64 fsnotify_mask;
+
+extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is);
+#endif	/* _LINUX_FSNOTIFY_PRIVATE_H */
diff --git a/fs/notify/group.c b/fs/notify/group.c
new file mode 100644
index 0000000..a877929
--- /dev/null
+++ b/fs/notify/group.c
@@ -0,0 +1,157 @@
+/*
+ *  Copyright (C) 2008 Red Hat, Inc., Eric Paris <eparis@redhat.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2, or (at your option)
+ *  any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; see the file COPYING.  If not, write to
+ *  the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/rculist.h>
+#include <linux/wait.h>
+
+#include <linux/fsnotify_backend.h>
+#include "fsnotify.h"
+
+#include <asm/atomic.h>
+
+DEFINE_MUTEX(fsnotify_grp_mutex);
+struct srcu_struct fsnotify_grp_srcu_struct;
+LIST_HEAD(fsnotify_groups);
+__u64 fsnotify_mask;
+
+void fsnotify_recalc_global_mask(void)
+{
+	struct fsnotify_group *group;
+	__u64 mask = 0;
+	int idx;
+
+	idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
+	list_for_each_entry_rcu(group, &fsnotify_groups, group_list) {
+		mask |= group->mask;
+	}
+	srcu_read_unlock(&fsnotify_grp_srcu_struct, idx);
+	fsnotify_mask = mask;
+}
+
+static void fsnotify_add_group(struct fsnotify_group *group)
+{
+	list_add_rcu(&group->group_list, &fsnotify_groups);
+	group->evicted = 0;
+}
+
+void fsnotify_get_group(struct fsnotify_group *group)
+{
+	atomic_inc(&group->refcnt);
+}
+
+static void fsnotify_destroy_group(struct fsnotify_group *group)
+{
+	if (group->ops->free_group_priv)
+		group->ops->free_group_priv(group);
+
+	kfree(group);
+}
+
+void fsnotify_evict_group(struct fsnotify_group *group)
+{
+	mutex_lock(&fsnotify_grp_mutex);
+	if (!group->evicted)
+		list_del_rcu(&group->group_list);
+	group->evicted = 1;
+	mutex_unlock(&fsnotify_grp_mutex);
+}
+
+void fsnotify_put_group(struct fsnotify_group *group)
+{
+	if (atomic_dec_and_test(&group->refcnt)) {
+		int refcnt;
+
+		fsnotify_evict_group(group);
+
+		synchronize_srcu(&fsnotify_grp_srcu_struct);
+
+		fsnotify_recalc_global_mask();
+
+		mutex_lock(&fsnotify_grp_mutex);
+		refcnt = atomic_read(&group->refcnt);
+		mutex_unlock(&fsnotify_grp_mutex);
+
+		/* something else found us during the sync, let them clean it up */
+		if (refcnt)
+			return;
+
+		fsnotify_destroy_group(group);
+	}
+}
+
+static struct fsnotify_group *fsnotify_find_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops)
+{
+	struct fsnotify_group *group_iter;
+	struct fsnotify_group *group = NULL;
+
+	list_for_each_entry_rcu(group_iter, &fsnotify_groups, group_list) {
+		if (group_iter->group_num == group_num) {
+			if ((group_iter->mask == mask) &&
+			    (group_iter->ops == ops)) {
+				fsnotify_get_group(group_iter);
+				group = group_iter;
+			} else
+				group = ERR_PTR(-EEXIST);
+		}
+	}
+	return group;
+}
+
+/*
+ * low use function, could be faster to check if the group is there before we do
+ * the allocation and the initialization, but this is only called when notification
+ * systems make changes, so why make it more complex?
+ */
+struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops)
+{
+	struct fsnotify_group *group, *tgroup;
+
+	group = kmalloc(sizeof(struct fsnotify_group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&group->refcnt, 1);
+
+	group->group_num = group_num;
+	group->mask = mask;
+
+	group->ops = ops;
+
+	mutex_lock(&fsnotify_grp_mutex);
+	tgroup = fsnotify_find_group(group_num, mask, ops);
+	/* we raced and something else inserted the same group */
+	if (tgroup) {
+		mutex_unlock(&fsnotify_grp_mutex);
+		/* destroy the new one we made */
+		fsnotify_put_group(group);
+		return tgroup;
+	}
+
+	/* ok, no races here, add it */
+	fsnotify_add_group(group);
+	mutex_unlock(&fsnotify_grp_mutex);
+
+	if (mask)
+		fsnotify_recalc_global_mask();
+
+	return group;
+}
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
new file mode 100644
index 0000000..c893873
--- /dev/null
+++ b/fs/notify/notification.c
@@ -0,0 +1,133 @@
+/*
+ *  Copyright (C) 2008 Red Hat, Inc., Eric Paris <eparis@redhat.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2, or (at your option)
+ *  any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; see the file COPYING.  If not, write to
+ *  the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/mount.h>
+#include <linux/mutex.h>
+#include <linux/namei.h>
+#include <linux/path.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include <asm/atomic.h>
+
+#include <linux/fsnotify_backend.h>
+#include "fsnotify.h"
+
+static struct kmem_cache *event_kmem_cache;
+
+void fsnotify_get_event(struct fsnotify_event *event)
+{
+	atomic_inc(&event->refcnt);
+}
+
+void fsnotify_put_event(struct fsnotify_event *event)
+{
+	if (!event)
+		return;
+
+	if (atomic_dec_and_test(&event->refcnt)) {
+		if (event->flag == FSNOTIFY_EVENT_PATH) {
+			path_put(&event->path);
+			event->path.dentry = NULL;
+			event->path.mnt = NULL;
+		}
+
+		event->mask = 0;
+
+		BUG_ON(!list_empty(&event->private_data_list));
+		kmem_cache_free(event_kmem_cache, event);
+	}
+}
+
+struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+	struct fsnotify_event_private_data *lpriv;
+	struct fsnotify_event_private_data *priv = NULL;
+
+	list_for_each_entry(lpriv, &event->private_data_list, event_list) {
+		if (lpriv->group == group) {
+			priv = lpriv;
+			break;
+		}
+	}
+	return priv;
+}
+
+struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
+{
+	struct fsnotify_event *event;
+
+	event = kmem_cache_alloc(event_kmem_cache, GFP_KERNEL);
+	if (!event)
+		return NULL;
+
+	atomic_set(&event->refcnt, 1);
+
+	spin_lock_init(&event->lock);
+
+	event->path.dentry = NULL;
+	event->path.mnt = NULL;
+	event->inode = NULL;
+
+	INIT_LIST_HEAD(&event->private_data_list);
+
+	event->to_tell = to_tell;
+
+	switch (data_is) {
+	case FSNOTIFY_EVENT_FILE: {
+		struct file *file = data;
+		struct path *path = &file->f_path;
+		event->path.dentry = path->dentry;
+		event->path.mnt = path->mnt;
+		path_get(&event->path);
+		event->flag = FSNOTIFY_EVENT_PATH;
+		break;
+	}
+	case FSNOTIFY_EVENT_PATH: {
+		struct path *path = data;
+		event->path.dentry = path->dentry;
+		event->path.mnt = path->mnt;
+		path_get(&event->path);
+		event->flag = FSNOTIFY_EVENT_PATH;
+		break;
+	}
+	case FSNOTIFY_EVENT_INODE:
+		event->inode = data;
+		event->flag = FSNOTIFY_EVENT_INODE;
+		break;
+	default:
+		BUG();
+	};
+
+	event->mask = mask;
+
+	return event;
+}
+
+__init int fsnotify_notification_init(void)
+{
+	event_kmem_cache = kmem_cache_create("fsnotify_event", sizeof(struct fsnotify_event), 0, SLAB_PANIC, NULL);
+
+	return 0;
+}
+subsys_initcall(fsnotify_notification_init);
+
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 00fbd5b..d1f7de2 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -13,6 +13,7 @@
 
 #include <linux/dnotify.h>
 #include <linux/inotify.h>
+#include <linux/fsnotify_backend.h>
 #include <linux/audit.h>
 
 /*
@@ -43,28 +44,45 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
 {
 	struct inode *source = moved->d_inode;
 	u32 cookie = inotify_get_cookie();
+	__u64 old_dir_mask = 0;
+	__u64 new_dir_mask = 0;
 
-	if (old_dir == new_dir)
+	if (old_dir == new_dir) {
 		inode_dir_notify(old_dir, DN_RENAME);
-	else {
+		old_dir_mask = FS_DN_RENAME;
+	} else {
 		inode_dir_notify(old_dir, DN_DELETE);
+		old_dir_mask = FS_DELETE;
 		inode_dir_notify(new_dir, DN_CREATE);
+		new_dir_mask = FS_CREATE;
 	}
 
-	if (isdir)
+	if (isdir) {
 		isdir = IN_ISDIR;
+		old_dir_mask |= FS_IN_ISDIR;
+		new_dir_mask |= FS_IN_ISDIR;
+	}
+
+	old_dir_mask |= FS_MOVED_FROM;
+	new_dir_mask |= FS_MOVED_TO;
+
 	inotify_inode_queue_event(old_dir, IN_MOVED_FROM|isdir,cookie,old_name,
 				  source);
 	inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, cookie, new_name,
 				  source);
 
+	fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE);
+	fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE);
+
 	if (target) {
 		inotify_inode_queue_event(target, IN_DELETE_SELF, 0, NULL, NULL);
 		inotify_inode_is_dead(target);
+		fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE);
 	}
 
 	if (source) {
 		inotify_inode_queue_event(source, IN_MOVE_SELF, 0, NULL, NULL);
+		fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE);
 	}
 	audit_inode_child(new_name, moved, new_dir);
 }
@@ -87,6 +105,8 @@ static inline void fsnotify_inoderemove(struct inode *inode)
 {
 	inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL);
 	inotify_inode_is_dead(inode);
+
+	fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -95,6 +115,8 @@ static inline void fsnotify_inoderemove(struct inode *inode)
 static inline void fsnotify_link_count(struct inode *inode)
 {
 	inotify_inode_queue_event(inode, IN_ATTRIB, 0, NULL, NULL);
+
+	fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -106,6 +128,8 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
 	inotify_inode_queue_event(inode, IN_CREATE, 0, dentry->d_name.name,
 				  dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
+
+	fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -120,6 +144,8 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct
 				  inode);
 	fsnotify_link_count(inode);
 	audit_inode_child(new_dentry->d_name.name, new_dentry, dir);
+
+	fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -131,6 +157,8 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 	inotify_inode_queue_event(inode, IN_CREATE | IN_ISDIR, 0, 
 				  dentry->d_name.name, dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
+
+	fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -139,7 +167,7 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 static inline void fsnotify_access(struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	u32 mask = IN_ACCESS;
+	__u64 mask = IN_ACCESS;
 
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
@@ -147,6 +175,8 @@ static inline void fsnotify_access(struct dentry *dentry)
 	dnotify_parent(dentry, DN_ACCESS);
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -155,7 +185,7 @@ static inline void fsnotify_access(struct dentry *dentry)
 static inline void fsnotify_modify(struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	u32 mask = IN_MODIFY;
+	__u64 mask = IN_MODIFY;
 
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
@@ -163,6 +193,8 @@ static inline void fsnotify_modify(struct dentry *dentry)
 	dnotify_parent(dentry, DN_MODIFY);
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -171,13 +203,15 @@ static inline void fsnotify_modify(struct dentry *dentry)
 static inline void fsnotify_open(struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	u32 mask = IN_OPEN;
+	__u64 mask = IN_OPEN;
 
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
 
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -189,13 +223,15 @@ static inline void fsnotify_close(struct file *file)
 	struct inode *inode = dentry->d_inode;
 	const char *name = dentry->d_name.name;
 	fmode_t mode = file->f_mode;
-	u32 mask = (mode & FMODE_WRITE) ? IN_CLOSE_WRITE : IN_CLOSE_NOWRITE;
+	__u64 mask = (mode & FMODE_WRITE) ? IN_CLOSE_WRITE : IN_CLOSE_NOWRITE;
 
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
 
 	inotify_dentry_parent_queue_event(dentry, mask, 0, name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+	fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
 }
 
 /*
@@ -204,13 +240,15 @@ static inline void fsnotify_close(struct file *file)
 static inline void fsnotify_xattr(struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	u32 mask = IN_ATTRIB;
+	__u64 mask = IN_ATTRIB;
 
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
 
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 }
 
 /*
@@ -260,6 +298,7 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
 		inotify_inode_queue_event(inode, in_mask, 0, NULL, NULL);
 		inotify_dentry_parent_queue_event(dentry, in_mask, 0,
 						  dentry->d_name.name);
+		fsnotify(inode, in_mask, inode, FSNOTIFY_EVENT_INODE);
 	}
 }
 
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
new file mode 100644
index 0000000..98ffcd6
--- /dev/null
+++ b/include/linux/fsnotify_backend.h
@@ -0,0 +1,148 @@
+/*
+ * Filesystem access notification for Linux
+ *
+ *  Copyright (C) 2008 Red Hat, Inc., Eric Paris <eparis@redhat.com>
+ */
+
+#ifndef _LINUX_FSNOTIFY_BACKEND_H
+#define _LINUX_FSNOTIFY_BACKEND_H
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h> /* struct inode */
+#include <linux/list.h>
+#include <linux/path.h> /* struct path */
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+
+#include <asm/atomic.h>
+
+/*
+ * IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
+ * convert between them.  dnotify only needs conversion at watch creation
+ * so no perf loss there.  fanotify isn't defined yet, so it can use the
+ * wholes if it needs more events.
+ */
+#define FS_ACCESS		0x0000000000000001ull	/* File was accessed */
+#define FS_MODIFY		0x0000000000000002ull	/* File was modified */
+#define FS_ATTRIB		0x0000000000000004ull	/* Metadata changed */
+#define FS_CLOSE_WRITE		0x0000000000000008ull	/* Writtable file was closed */
+#define FS_CLOSE_NOWRITE	0x0000000000000010ull	/* Unwrittable file closed */
+#define FS_OPEN			0x0000000000000020ull	/* File was opened */
+#define FS_MOVED_FROM		0x0000000000000040ull	/* File was moved from X */
+#define FS_MOVED_TO		0x0000000000000080ull	/* File was moved to Y */
+#define FS_CREATE		0x0000000000000100ull	/* Subfile was created */
+#define FS_DELETE		0x0000000000000200ull	/* Subfile was deleted */
+#define FS_DELETE_SELF		0x0000000000000400ull	/* Self was deleted */
+#define FS_MOVE_SELF		0x0000000000000800ull	/* Self was moved */
+
+#define FS_UNMOUNT		0x0000000000002000ull	/* inode on umount fs */
+#define FS_Q_OVERFLOW		0x0000000000004000ull	/* Event queued overflowed */
+#define FS_IN_IGNORED		0x0000000000008000ull	/* last inotify event here */
+
+#define FS_IN_ISDIR		0x0000000040000000ull	/* event occurred against dir */
+#define FS_IN_ONESHOT		0x0000000080000000ull	/* only send event once */
+
+/*
+ * FSNOTIFY has decided to seperate out events for self vs events delivered to
+ * a parent baed on the actions of the child.  dnotify does this for 4 events
+ * ACCESS, MODIFY, ATTRIB, and DELETE.  Inotify adds to that list CLOSE_WRITE,
+ * CLOSE_NOWRITE, CREATE, MOVE_FROM, MOVE_TO, OPEN.  So all of these _CHILD
+ * events are defined the same as the regular only << 32 for easy conversion.
+ */
+#define FS_ACCESS_CHILD		0x0000000100000000ull	/* child was accessed */
+#define FS_MODIFY_CHILD		0x0000000200000000ull	/* child was modified */
+#define FS_ATTRIB_CHILD		0x0000000400000000ull	/* child attributed changed */
+#define FS_CLOSE_WRITE_CHILD	0x0000000800000000ull	/* Writtable file was closed */
+#define FS_CLOSE_NOWRITE_CHILD	0x0000001000000000ull	/* Unwrittable file closed */
+#define FS_OPEN_CHILD		0x0000002000000000ull	/* File was opened */
+#define FS_MOVED_FROM_CHILD	0x0000004000000000ull	/* File was moved from X */
+#define FS_MOVED_TO_CHILD	0x0000008000000000ull	/* File was moved to Y */
+#define FS_CREATE_CHILD		0x0000010000000000ull	/* Subfile was created */
+#define FS_DELETE_CHILD		0x0000020000000000ull	/* child was deleted */
+
+#define FS_DN_RENAME		0x1000000000000000ull	/* file renamed */
+#define FS_DN_MULTISHOT		0x2000000000000000ull	/* dnotify multishot */
+
+/* when calling fsnotify tell it if the data is a path or inode */
+#define FSNOTIFY_EVENT_PATH	1
+#define FSNOTIFY_EVENT_INODE	2
+#define FSNOTIFY_EVENT_FILE	3
+
+struct fsnotify_group;
+struct fsnotify_event;
+
+struct fsnotify_ops {
+	int (*handle_event)(struct fsnotify_group *group, struct fsnotify_event *event);
+	void (*free_group_priv)(struct fsnotify_group *group);
+	void (*free_event_priv)(struct fsnotify_group *group, struct fsnotify_event *event);
+};
+
+struct fsnotify_group {
+	struct list_head group_list;	/* list of all groups on the system */
+	__u64 mask;			/* mask of events this group cares about */
+	atomic_t refcnt;		/* num of processes with a special file open */
+	unsigned int group_num;		/* the 'name' of the event */
+
+	const struct fsnotify_ops *ops;	/* how this group handles things */
+
+	unsigned int evicted	:1;	/* has this group been evicted? */
+
+	/* groups can define private fields here */
+	union {
+	};
+};
+
+struct fsnotify_event_private_data {
+	struct fsnotify_group *group;
+	struct list_head event_list;
+	char data[0];
+};
+
+/*
+ * all of the information about the original object we want to now send to
+ * a scanner.  If you want to carry more info from the accessing task to the
+ * listener this structure is where you need to be adding fields.
+ */
+struct fsnotify_event {
+	spinlock_t lock;	/* protection for the associated event_holder and private_list */
+	struct inode *to_tell;
+	/*
+	 * depending on the event type we should have either a path or inode
+	 * we should never have more than one....
+	 */
+	union {
+		struct path path;
+		struct inode *inode;
+	};
+	int flag;		/* which of the above we have */
+	atomic_t refcnt;	/* how many groups still are using/need to send this event */
+	__u64 mask;		/* the type of access */
+
+	struct list_head private_data_list;
+};
+
+#ifdef CONFIG_FSNOTIFY
+
+/* called from the vfs to signal fs events */
+extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+
+/* called from fsnotify interfaces, such as fanotify or dnotify */
+extern void fsnotify_recalc_global_mask(void);
+extern struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
+extern void fsnotify_put_group(struct fsnotify_group *group);
+extern void fsnotify_get_group(struct fsnotify_group *group);
+
+extern void fsnotify_get_event(struct fsnotify_event *event);
+extern void fsnotify_put_event(struct fsnotify_event *event);
+extern struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event);
+
+#else
+
+static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+{}
+#endif	/* CONFIG_FSNOTIFY */
+
+#endif	/* __KERNEL __ */
+
+#endif	/* _LINUX_FSNOTIFY_BACKEND_H */


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 02/11] fsnotify: add group priorities
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
@ 2009-02-09 21:15 ` Eric Paris
  2009-02-09 21:15 ` [PATCH -v1 03/11] fsnotify: add in inode fsnotify markings Eric Paris
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

In preperation for blocking fsnotify calls group priorities must be added.
When multiple groups request the same event type the lowest priority group
will receive the notification first.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/group.c                |   27 ++++++++++++++++++++++-----
 include/linux/fsnotify_backend.h |    3 ++-
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/notify/group.c b/fs/notify/group.c
index a877929..d7cabe5 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -49,8 +49,21 @@ void fsnotify_recalc_global_mask(void)
 
 static void fsnotify_add_group(struct fsnotify_group *group)
 {
-	list_add_rcu(&group->group_list, &fsnotify_groups);
+	int priority = group->priority;
+	struct fsnotify_group *group_iter;
+
 	group->evicted = 0;
+	list_for_each_entry(group_iter, &fsnotify_groups, group_list) {
+		/* insert in front of this one? */
+		if (priority < group_iter->priority) {
+			/* I used list_add_tail() to insert in front of group_iter...  */
+			list_add_tail_rcu(&group->group_list, &group_iter->group_list);
+			return;
+		}
+	}
+
+	/* apparently we need to be the last entry */
+	list_add_tail_rcu(&group->group_list, &fsnotify_groups);
 }
 
 void fsnotify_get_group(struct fsnotify_group *group)
@@ -98,14 +111,16 @@ void fsnotify_put_group(struct fsnotify_group *group)
 	}
 }
 
-static struct fsnotify_group *fsnotify_find_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops)
+static struct fsnotify_group *fsnotify_find_group(unsigned int priority, unsigned int group_num,
+						  __u64 mask, const struct fsnotify_ops *ops)
 {
 	struct fsnotify_group *group_iter;
 	struct fsnotify_group *group = NULL;
 
 	list_for_each_entry_rcu(group_iter, &fsnotify_groups, group_list) {
-		if (group_iter->group_num == group_num) {
+		if (group_iter->priority == priority) {
 			if ((group_iter->mask == mask) &&
+			    (group_iter->group_num == group_num) &&
 			    (group_iter->ops == ops)) {
 				fsnotify_get_group(group_iter);
 				group = group_iter;
@@ -121,7 +136,8 @@ static struct fsnotify_group *fsnotify_find_group(unsigned int group_num, __u64
  * the allocation and the initialization, but this is only called when notification
  * systems make changes, so why make it more complex?
  */
-struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops)
+struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int group_num,
+					     __u64 mask, const struct fsnotify_ops *ops)
 {
 	struct fsnotify_group *group, *tgroup;
 
@@ -131,13 +147,14 @@ struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask,
 
 	atomic_set(&group->refcnt, 1);
 
+	group->priority = priority;
 	group->group_num = group_num;
 	group->mask = mask;
 
 	group->ops = ops;
 
 	mutex_lock(&fsnotify_grp_mutex);
-	tgroup = fsnotify_find_group(group_num, mask, ops);
+	tgroup = fsnotify_find_group(priority, group_num, mask, ops);
 	/* we raced and something else inserted the same group */
 	if (tgroup) {
 		mutex_unlock(&fsnotify_grp_mutex);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 98ffcd6..e156df5 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -86,6 +86,7 @@ struct fsnotify_group {
 
 	const struct fsnotify_ops *ops;	/* how this group handles things */
 
+	unsigned int priority;		/* order this group should receive msgs.  low first */
 	unsigned int evicted	:1;	/* has this group been evicted? */
 
 	/* groups can define private fields here */
@@ -129,7 +130,7 @@ extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
 
 /* called from fsnotify interfaces, such as fanotify or dnotify */
 extern void fsnotify_recalc_global_mask(void);
-extern struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
+extern struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
 extern void fsnotify_put_group(struct fsnotify_group *group);
 extern void fsnotify_get_group(struct fsnotify_group *group);
 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 03/11] fsnotify: add in inode fsnotify markings
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
  2009-02-09 21:15 ` [PATCH -v1 02/11] fsnotify: add group priorities Eric Paris
@ 2009-02-09 21:15 ` Eric Paris
  2009-02-12 21:57   ` Andrew Morton
  2009-02-09 21:15 ` [PATCH -v1 04/11] fsnotify: parent event notification Eric Paris
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

This patch creates in inode fsnotify markings.  dnotify will make use of in
inode markings to mark which inodes it wishes to send events for.  fanotify
will use this to mark which inodes it does not wish to send events for.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 Documentation/filesystems/fsnotify.txt |  192 ++++++++++++++++++++++++++
 fs/inode.c                             |    9 +
 fs/notify/Makefile                     |    2 
 fs/notify/fsnotify.c                   |   13 ++
 fs/notify/fsnotify.h                   |    3 
 fs/notify/group.c                      |   24 +++
 fs/notify/inode_mark.c                 |  238 ++++++++++++++++++++++++++++++++
 include/linux/fs.h                     |    5 +
 include/linux/fsnotify.h               |    9 +
 include/linux/fsnotify_backend.h       |   50 +++++++
 10 files changed, 543 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/filesystems/fsnotify.txt
 create mode 100644 fs/notify/inode_mark.c

diff --git a/Documentation/filesystems/fsnotify.txt b/Documentation/filesystems/fsnotify.txt
new file mode 100644
index 0000000..6da6cc1
--- /dev/null
+++ b/Documentation/filesystems/fsnotify.txt
@@ -0,0 +1,192 @@
+fsnotify inode mark locking/lifetime/and refcnting
+
+struct fsnotify_mark_entry {
+	__u64 mask;			/* mask this mark entry is for */
+	atomic_t refcnt;		/* active things looking at this mark */
+	int freeme;			/* free when this is set and refcnt hits 0 */
+	struct inode *inode;		/* inode this entry is associated with */
+	struct fsnotify_group *group;	/* group this mark entry is for */
+	struct list_head i_list;	/* list of mark_entries by inode->i_fsnotify_mark_entries */
+	struct list_head g_list;	/* list of mark_entries by group->i_fsnotify_mark_entries */
+	spinlock_t lock;		/* protect group, inode, and killme */
+	struct list_head free_i_list;	/* tmp list used when freeing this mark */
+	struct list_head free_g_list;	/* tmp list used when freeing this mark */
+	void (*free_private)(struct fsnotify_mark_entry *entry); /* called on final put+free */
+};
+
+REFCNT:
+The mark->refcnt tells how many "things" in the kernel currectly are
+referencing this object.  The object typically will live inside the kernel
+with a refcnt of 0.  Any task that finds the fsnotify_mark_entry and either
+already has a reference or holds the inode->i_lock or group->mark_lock can
+take a reference.  The mark will survive until the reference hits 0 and the
+object has been marked for death (freeme)
+
+LOCKING:
+There are 3 spinlocks involved with fsnotify inode marks and they MUST
+be taking in order as follows:
+
+entry->lock
+group->mark_lock
+inode->i_lock
+
+entry->lock protects 3 things, entry->group, entry->inode, and entry->freeme.
+
+group->mark_lock protects the mark_entries list anchored inside a given group
+and each entry is hooked via the g_list.  It also sorta protects the
+free_g_list, which when used is anchored by a private list on the stack of the
+task which held the group->mark_lock.
+
+inode->i_lock protects the i_fsnotify_mark_entries list anchored inside a
+given inode and each entry is hooked via the i_list. (and sorta the
+free_i_list)
+
+
+LIFETIME:
+Inode marks survive between when they are added to an inode and when their
+refcnt==0 and freeme has been set.  fsnotify_mark_entries normally live in the
+kernel with a refcnt==0 and only have positive refcnt when "something" is
+actively referencing its contents.  freeme is set when the fsnotify_mark is no
+longer on either an inode or a group list.  If it is off both lists we
+know the mark is unreachable by anything else and when the refcnt hits 0 we
+can free.
+
+The inode mark can be cleared for a number of different reasons including:
+The inode is unlinked for the last time.  (fsnotify_inoderemove)
+The inode is being evicted from cache. (fsnotify_inode_delete)
+The fs the inode is on is unmounted.  (fsnotify_inode_delete/fsnotify_unmount_inodes)
+Something explicitly requests that it be removed.  (fsnotify_destroy_mark_by_entry)
+The fsnotify_group associated with the mark is going away and all such marks
+need to be cleaned up. (fsnotify_clear_marks_by_group)
+
+Worst case we are given an inode and need to clean up all the marks on that
+inode.  We take i_lock and walk the i_fsnotify_mark_entries safely.  While
+walking the list we list_del_init the i_list, take a reference, and using
+free_i_list hook this into a private list we anchor on the stack.  At this
+point we can safely drop the i_lock and walk the private list we anchored on
+the stack (remember we hold a reference to everything on this list.)  For each
+entry we take the entry->lock and check if the ->group and ->inode pointers
+are set.  If they are, we take their respective locks and now hold all the
+interesting locks.  We can safely list_del_init the i_list and g_list and can
+set ->inode and ->group to NULL.  Once off both lists and everything is surely
+NULL we set freeme for cleanup.
+
+Very similarly for freeing by group, except we use free_g_list.
+
+This has the very interesting property of being able to run concurrently with
+any (or all) other directions.  Lets walk through what happens with 4 things
+trying to simultaneously mark this entry for destruction.
+
+A finds this event by some means and takes a reference.  (this could be any
+means including in the case of inotify through an idr, which is known to be
+safe since the idr entry itself holds a reference)
+B finds this event by some meand and takes a reference.
+
+At this point.
+	refcnt == 2
+	i_list -> inode
+	inode -> inode
+	g_list -> group
+	group -> group
+	free_i_list -> NUL
+	free_g_list -> NUL
+	freeme = 0
+
+C comes in and tries to free all of the fsnotify_mark attached to an inode.
+---- C  will take the i_lock and walk the i_fsnotify_mark entries list calling
+	list_del_init() on i_list, adding the entry to it's private list via
+	free_i_list, and taking a reference.  C releases the i_lock.  Start
+	walking the private list and block on the entry->lock (held by A
+	below)
+
+At this point.
+	refcnt == 3
+	i_list -> NUL 
+	inode -> inode
+	g_list -> group
+	group -> group
+	free_i_list -> private list on stack
+	free_g_list -> NUL
+	freeme = 0
+
+refcnt == 3.  events is not on the i_list.
+D comes in and tries to free all of the marks attached to the same inode.
+---- D  will take the i_lock and won't find this entry on the list and does
+	nothing.  (this is the end of D)
+
+E comes along and wants to free all of the marks in the group.
+---- E  take the group->mark_lock walk the group->mark_entry.  grab a
+	reference to the mark, list_del_init the g_list.  Add the mark to the
+	free_g_list.  Release the group->mark_lock.  Now start walking the new
+	private list and block in entry->lock.
+
+At this point.
+	refcnt == 4
+	i_list -> NUL 
+	inode -> inode
+	g_list -> NUL
+	group -> group
+	free_i_list -> private list on stack
+	free_g_list -> private list on another stack
+	freeme = 0
+
+A finally decides it wants to kill this entry for some reason.
+---- A  will take the entry->lock.  It will check if mark->group is non-NULL
+	and if so takes mark->group->mark_lock (it may have blocked here on D
+	above).  Check the ->inode and if set take mark->inode->i_lock (again
+	we may have been blocking on C).  We now own all the locks.  So
+	list_del_init on i_list and g_list.  set ->inode and ->group = NULL
+	and set freeme = 1.  Unlock i_lock, mark_lock, and entry->lock.  Drop
+	reference.   (this is the end of A)
+
+At this point.
+	refcnt == 3
+	i_list -> NUL 
+	inode -> NULL
+	g_list -> NUL
+	group -> NULL
+	free_i_list -> private list on stack
+	free_g_list -> private list on another stack
+	freeme = 1
+
+D happens to be the one to win the entry->lock.
+---- D  sees that ->inode and ->group and NULL so it just doesn't bother to
+	grab those locks (if they are NULL we know this even if off the
+	relevant lists)  D now pretends it has all the locks (even through it
+	only has mark->lock.)  so it calls list_del_init on i_list and g_list.
+	it sets ->inode and ->group to NULL (they already were) and it sets
+	freeme = 1 (it already way)  D unlocks all the locks it holds and
+	drops its reference
+
+At this point.
+	refcnt == 2
+	i_list -> NUL 
+	inode -> NULL
+	g_list -> NUL
+	group -> NULL
+	free_i_list -> private list on stack
+	free_g_list -> undefined
+	freeme = 1
+
+C does the same thing as B and the mark looks like:
+
+At this point.
+	refcnt == 1
+	i_list -> NUL 
+	inode -> NULL
+	g_list -> NUL
+	group -> NULL
+	free_i_list -> undefined
+	free_g_list -> undefined
+	freeme = 1
+
+B is the only thing left with a reference and it will do something similar to
+A, only it will only be holding mark->lock (it won't get the inode or group
+lock since they are NULL in the event.)  It will also clear everything,
+setfreeme, unlock, and drop it's reference via fsnotify_put_mark.
+
+fsnotify_put_mark calls atomic_dec_and_test(mark->refcnt, mark->lock) and will
+then test if freeme is set.  It is set.  That means we are the last thing
+inside the kernel with a pointer to this fsnotify_mark_entry and it should be
+destroyed.  fsnotify_put_mark proceedes to clean up any private data and frees
+things up.  
diff --git a/fs/inode.c b/fs/inode.c
index 40e37c0..f6a51a1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/fsnotify.h>
 #include <linux/mount.h>
 #include <linux/async.h>
 
@@ -189,6 +190,10 @@ struct inode *inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_private = NULL;
 	inode->i_mapping = mapping;
 
+#ifdef CONFIG_FSNOTIFY
+	inode->i_fsnotify_mask = 0;
+#endif
+
 	return inode;
 
 out_free_security:
@@ -220,6 +225,7 @@ void destroy_inode(struct inode *inode)
 {
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
+	fsnotify_inode_delete(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
 	else
@@ -251,6 +257,9 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->inotify_watches);
 	mutex_init(&inode->inotify_mutex);
 #endif
+#ifdef CONFIG_FSNOTIFY
+        INIT_LIST_HEAD(&inode->i_fsnotify_mark_entries);
+#endif
 }
 
 EXPORT_SYMBOL(inode_init_once);
diff --git a/fs/notify/Makefile b/fs/notify/Makefile
index 7cb285a..47b60f3 100644
--- a/fs/notify/Makefile
+++ b/fs/notify/Makefile
@@ -1,4 +1,4 @@
 obj-y			+= dnotify/
 obj-y			+= inotify/
 
-obj-$(CONFIG_FSNOTIFY)		+= fsnotify.o notification.o group.o
+obj-$(CONFIG_FSNOTIFY)		+= fsnotify.o notification.o group.o inode_mark.o
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index b51f90a..e7e53f7 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -25,6 +25,15 @@
 #include <linux/fsnotify_backend.h>
 #include "fsnotify.h"
 
+void __fsnotify_inode_delete(struct inode *inode, int flag)
+{
+	if (list_empty(&fsnotify_groups))
+		return;
+
+	fsnotify_clear_marks_by_inode(inode, flag);
+}
+EXPORT_SYMBOL_GPL(__fsnotify_inode_delete);
+
 void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
 {
 	struct fsnotify_group *group;
@@ -37,6 +46,8 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
 	if (!(mask & fsnotify_mask))
 		return;
 
+	if (!(mask & to_tell->i_fsnotify_mask))
+		return;
 	/*
 	 * SRCU!!  the groups list is very very much read only and the path is
 	 * very hot (assuming something is using fsnotify)  Not blocking while
@@ -52,6 +63,8 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
 	idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
 	list_for_each_entry_rcu(group, &fsnotify_groups, group_list) {
 		if (mask & group->mask) {
+			if (!group->ops->should_send_event(group, to_tell, mask))
+				continue;
 			if (!event) {
 				event = fsnotify_create_event(to_tell, mask, data, data_is);
 				/* shit, we OOM'd and now we can't tell, lets hope something else blows up */
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index 9c53600..bad4da9 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -16,4 +16,7 @@ extern struct list_head fsnotify_groups;
 extern __u64 fsnotify_mask;
 
 extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is);
+
+extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
+extern void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags);
 #endif	/* _LINUX_FSNOTIFY_PRIVATE_H */
diff --git a/fs/notify/group.c b/fs/notify/group.c
index d7cabe5..bac7d8d 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -47,6 +47,24 @@ void fsnotify_recalc_global_mask(void)
 	fsnotify_mask = mask;
 }
 
+void fsnotify_recalc_group_mask(struct fsnotify_group *group)
+{
+	__u64 mask = 0;
+	unsigned long old_mask = group->mask;
+	struct fsnotify_mark_entry *entry;
+
+	spin_lock(&group->mark_lock);
+	list_for_each_entry(entry, &group->mark_entries, g_list) {
+		mask |= entry->mask;
+	}
+	spin_unlock(&group->mark_lock);
+
+	group->mask = mask;
+
+	if (old_mask != mask)
+		fsnotify_recalc_global_mask();
+}
+
 static void fsnotify_add_group(struct fsnotify_group *group)
 {
 	int priority = group->priority;
@@ -73,6 +91,9 @@ void fsnotify_get_group(struct fsnotify_group *group)
 
 static void fsnotify_destroy_group(struct fsnotify_group *group)
 {
+	/* clear all inode mark entries for this group */
+	fsnotify_clear_marks_by_group(group);
+
 	if (group->ops->free_group_priv)
 		group->ops->free_group_priv(group);
 
@@ -151,6 +172,9 @@ struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int
 	group->group_num = group_num;
 	group->mask = mask;
 
+	spin_lock_init(&group->mark_lock);
+	INIT_LIST_HEAD(&group->mark_entries);
+
 	group->ops = ops;
 
 	mutex_lock(&fsnotify_grp_mutex);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
new file mode 100644
index 0000000..9a65fbc
--- /dev/null
+++ b/fs/notify/inode_mark.c
@@ -0,0 +1,238 @@
+/*
+ *  Copyright (C) 2008 Red Hat, Inc., Eric Paris <eparis@redhat.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2, or (at your option)
+ *  any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; see the file COPYING.  If not, write to
+ *  the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include <asm/atomic.h>
+
+#include <linux/fsnotify_backend.h>
+#include "fsnotify.h"
+
+static void fsnotify_destroy_mark(struct fsnotify_mark_entry *entry)
+{
+	entry->group = NULL;
+	entry->inode = NULL;
+	entry->mask = 0;
+	INIT_LIST_HEAD(&entry->i_list);
+	INIT_LIST_HEAD(&entry->g_list);
+	INIT_LIST_HEAD(&entry->free_i_list);
+	INIT_LIST_HEAD(&entry->free_g_list);
+	entry->free_private(entry);
+}
+
+void fsnotify_get_mark(struct fsnotify_mark_entry *entry)
+{
+	atomic_inc(&entry->refcnt);
+}
+
+void fsnotify_put_mark(struct fsnotify_mark_entry *entry)
+{
+	if (atomic_dec_and_lock(&entry->refcnt, &entry->lock)) {
+		/*
+		 * if refcnt hit 0 and freeme was set when it hit zero it's
+		 * time to clean up!
+		 */
+		if (entry->freeme) {
+			spin_unlock(&entry->lock);
+			fsnotify_destroy_mark(entry);
+			return;
+		}
+		spin_unlock(&entry->lock);
+	}
+}
+
+void fsnotify_recalc_inode_mask_locked(struct inode *inode)
+{
+	struct fsnotify_mark_entry *entry;
+	__u64 new_mask = 0;
+
+	list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
+		new_mask |= entry->mask;
+	}
+	inode->i_fsnotify_mask = new_mask;
+}
+
+void fsnotify_recalc_inode_mask(struct inode *inode)
+{
+	spin_lock(&inode->i_lock);
+	fsnotify_recalc_inode_mask_locked(inode);
+	spin_unlock(&inode->i_lock);
+}
+
+void fsnotify_clear_marks_by_group(struct fsnotify_group *group)
+{
+	struct fsnotify_mark_entry *lentry, *entry;
+	struct inode *inode;
+	LIST_HEAD(free_list);
+
+	spin_lock(&group->mark_lock);
+	list_for_each_entry_safe(entry, lentry, &group->mark_entries, g_list) {
+		list_del_init(&entry->g_list);
+		list_add(&entry->free_g_list, &free_list);
+		fsnotify_get_mark(entry);
+	}
+	spin_unlock(&group->mark_lock);
+
+	list_for_each_entry_safe(entry, lentry, &free_list, free_g_list) {
+		spin_lock(&entry->lock);
+		inode = entry->inode;
+		if (!inode) {
+			entry->group = NULL;
+			spin_unlock(&entry->lock);
+			fsnotify_put_mark(entry);
+			continue;
+		}
+		spin_lock(&inode->i_lock);
+
+		list_del_init(&entry->i_list);
+		entry->inode = NULL;
+		list_del_init(&entry->g_list);
+		entry->group = NULL;
+		entry->freeme = 1;
+
+		fsnotify_recalc_inode_mask_locked(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&entry->lock);
+
+		fsnotify_put_mark(entry);
+	}
+}
+
+void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry)
+{
+	struct fsnotify_group *group;
+	struct inode *inode;
+
+	spin_lock(&entry->lock);
+
+	group = entry->group;
+	if (group)
+		spin_lock(&group->mark_lock);
+
+	inode = entry->inode;
+	if (inode)
+		spin_lock(&inode->i_lock);
+
+	list_del_init(&entry->i_list);
+	entry->inode = NULL;
+	list_del_init(&entry->g_list);
+	entry->group = NULL;
+	entry->freeme = 1;
+
+	if (inode) {
+		fsnotify_recalc_inode_mask_locked(inode);
+		spin_unlock(&inode->i_lock);
+	}
+	if (group)
+		spin_unlock(&group->mark_lock);
+
+	spin_unlock(&entry->lock);
+}
+
+void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags)
+{
+	struct fsnotify_mark_entry *lentry, *entry;
+	LIST_HEAD(free_list);
+
+	spin_lock(&inode->i_lock);
+	list_for_each_entry_safe(entry, lentry, &inode->i_fsnotify_mark_entries, i_list) {
+		list_del_init(&entry->i_list);
+		list_add(&entry->free_i_list, &free_list);
+		fsnotify_get_mark(entry);
+	}
+	spin_unlock(&inode->i_lock);
+
+	/*
+	 * at this point destroy_by_* might race.
+	 *
+	 * we used list_del_init() so it can be list_del_init'd again, no harm.
+	 * we were called from an inode function so we know that other user can
+	 * try to grab entry->inode->i_lock without a problem.
+	 */
+	list_for_each_entry_safe(entry, lentry, &free_list, free_i_list) {
+		entry->group->ops->mark_clear_inode(entry, inode, flags);
+		fsnotify_put_mark(entry);
+	}
+
+	fsnotify_recalc_inode_mask(inode);
+}
+
+/* caller must hold inode->i_lock */
+struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode)
+{
+	struct fsnotify_mark_entry *entry;
+
+	list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
+		if (entry->group == group) {
+			fsnotify_get_mark(entry);
+			return entry;
+		}
+	}
+	return NULL;
+}
+
+void fsnotify_init_mark(struct fsnotify_mark_entry *entry, struct fsnotify_group *group, struct inode *inode, __u64 mask)
+{
+	spin_lock_init(&entry->lock);
+	atomic_set(&entry->refcnt, 1);
+	entry->group = group;
+	entry->mask = mask;
+	entry->inode = inode;
+	entry->freeme = 0;
+	entry->free_private = group->ops->free_mark;
+}
+
+int fsnotify_add_mark(struct fsnotify_mark_entry *entry)
+{
+	struct fsnotify_mark_entry *lentry;
+	struct fsnotify_group *group = entry->group;
+	struct inode *inode = entry->inode;
+	int ret = 0;
+
+	/*
+	 * LOCKING ORDER!!!!
+	 * entry->lock
+	 * group->mark_lock
+	 * inode->i_lock
+	 */
+	spin_lock(&group->mark_lock);
+	spin_lock(&inode->i_lock);
+
+	lentry = fsnotify_find_mark_entry(group, inode);
+	if (!lentry) {
+		list_add(&entry->i_list, &inode->i_fsnotify_mark_entries);
+		list_add(&entry->g_list, &group->mark_entries);
+		fsnotify_recalc_inode_mask_locked(inode);
+	}
+
+	spin_unlock(&inode->i_lock);
+	spin_unlock(&group->mark_lock);
+
+	if (lentry) {
+		ret = -EEXIST;
+		fsnotify_put_mark(lentry);
+	}
+
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7c68aa9..c5ec88f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -690,6 +690,11 @@ struct inode {
 
 	__u32			i_generation;
 
+#ifdef CONFIG_FSNOTIFY
+	__u64			i_fsnotify_mask; /* all events this inode cares about */
+	struct list_head	i_fsnotify_mark_entries; /* fsnotify mark entries */
+#endif
+
 #ifdef CONFIG_DNOTIFY
 	unsigned long		i_dnotify_mask; /* Directory notify events */
 	struct dnotify_struct	*i_dnotify; /* for directory notifications */
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index d1f7de2..6ae332f 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -99,6 +99,14 @@ static inline void fsnotify_nameremove(struct dentry *dentry, int isdir)
 }
 
 /*
+ * fsnotify_inode_delete - and inode is being evicted from cache, clean up is needed
+ */
+static inline void fsnotify_inode_delete(struct inode *inode)
+{
+	__fsnotify_inode_delete(inode, FSNOTIFY_INODE_DESTROY);
+}
+
+/*
  * fsnotify_inoderemove - an inode is going away
  */
 static inline void fsnotify_inoderemove(struct inode *inode)
@@ -107,6 +115,7 @@ static inline void fsnotify_inoderemove(struct inode *inode)
 	inotify_inode_is_dead(inode);
 
 	fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE);
+	__fsnotify_inode_delete(inode, FSNOTIFY_LAST_DENTRY);
 }
 
 /*
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index e156df5..a9508c5 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -13,7 +13,6 @@
 #include <linux/list.h>
 #include <linux/path.h> /* struct path */
 #include <linux/spinlock.h>
-#include <linux/wait.h>
 
 #include <asm/atomic.h>
 
@@ -69,13 +68,21 @@
 #define FSNOTIFY_EVENT_INODE	2
 #define FSNOTIFY_EVENT_FILE	3
 
+/* these tell __fsnotify_inode_delete what kind of event this is */
+#define FSNOTIFY_LAST_DENTRY	1
+#define FSNOTIFY_INODE_DESTROY	2
+
 struct fsnotify_group;
 struct fsnotify_event;
+struct fsnotify_mark_entry;
 
 struct fsnotify_ops {
+	int (*should_send_event)(struct fsnotify_group *group, struct inode *inode, __u64 mask);
 	int (*handle_event)(struct fsnotify_group *group, struct fsnotify_event *event);
 	void (*free_group_priv)(struct fsnotify_group *group);
+	void (*mark_clear_inode)(struct fsnotify_mark_entry *entry, struct inode *inode, unsigned int flags);
 	void (*free_event_priv)(struct fsnotify_group *group, struct fsnotify_event *event);
+	void (*free_mark)(struct fsnotify_mark_entry *entry);
 };
 
 struct fsnotify_group {
@@ -86,6 +93,10 @@ struct fsnotify_group {
 
 	const struct fsnotify_ops *ops;	/* how this group handles things */
 
+	/* stores all fastapth entries assoc with this group so they can be cleaned on unregister */
+	spinlock_t mark_lock;		/* protect mark_entries list */
+	struct list_head mark_entries;	/* all inode mark entries for this group */
+
 	unsigned int priority;		/* order this group should receive msgs.  low first */
 	unsigned int evicted	:1;	/* has this group been evicted? */
 
@@ -123,13 +134,39 @@ struct fsnotify_event {
 	struct list_head private_data_list;
 };
 
+/*
+ * a mark is simply an entry attached to an in core inode which allows an
+ * fsnotify listener to indicate they are either no longer interested in events
+ * of a type matching mask or only interested in those events.
+ *
+ * these are flushed when an inode is evicted from core and may be flushed
+ * when the inode is modified (as seen by fsnotify_access).  Some fsnotify users
+ * (such as dnotify) will flush these when the open fd is closed and not at
+ * inode eviction or modification.
+ */
+struct fsnotify_mark_entry {
+	__u64 mask;			/* mask this mark entry is for */
+	atomic_t refcnt;		/* active things looking at this mark */
+	int freeme;			/* free when this is set and refcnt hits 0 */
+	struct inode *inode;		/* inode this entry is associated with */
+	struct fsnotify_group *group;	/* group this mark entry is for */
+	struct list_head i_list;	/* list of mark_entries by inode->i_fsnotify_mark_entries */
+	struct list_head g_list;	/* list of mark_entries by group->i_fsnotify_mark_entries */
+	spinlock_t lock;		/* protect group, inode, and killme */
+	struct list_head free_i_list;	/* tmp list used when freeing this mark */
+	struct list_head free_g_list;	/* tmp list used when freeing this mark */
+	void (*free_private)(struct fsnotify_mark_entry *entry); /* called on final put+free */
+};
+
 #ifdef CONFIG_FSNOTIFY
 
 /* called from the vfs to signal fs events */
 extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+extern void __fsnotify_inode_delete(struct inode *inode, int flag);
 
 /* called from fsnotify interfaces, such as fanotify or dnotify */
 extern void fsnotify_recalc_global_mask(void);
+extern void fsnotify_recalc_group_mask(struct fsnotify_group *group);
 extern struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
 extern void fsnotify_put_group(struct fsnotify_group *group);
 extern void fsnotify_get_group(struct fsnotify_group *group);
@@ -138,10 +175,21 @@ extern void fsnotify_get_event(struct fsnotify_event *event);
 extern void fsnotify_put_event(struct fsnotify_event *event);
 extern struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event);
 
+extern void fsnotify_recalc_inode_mask(struct inode *inode);
+extern void fsnotify_init_mark(struct fsnotify_mark_entry *entry, struct fsnotify_group *group, struct inode *inode, __u64 mask);
+extern struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode);
+extern int fsnotify_add_mark(struct fsnotify_mark_entry *entry);
+extern void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry);
+extern void fsnotify_get_mark(struct fsnotify_mark_entry *entry);
+extern void fsnotify_put_mark(struct fsnotify_mark_entry *entry);
 #else
 
 static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
 {}
+
+static inline void __fsnotify_inode_delete(struct inode *inode, int flag)
+{}
+
 #endif	/* CONFIG_FSNOTIFY */
 
 #endif	/* __KERNEL __ */


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 04/11] fsnotify: parent event notification
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
  2009-02-09 21:15 ` [PATCH -v1 02/11] fsnotify: add group priorities Eric Paris
  2009-02-09 21:15 ` [PATCH -v1 03/11] fsnotify: add in inode fsnotify markings Eric Paris
@ 2009-02-09 21:15 ` Eric Paris
  2009-02-09 21:15 ` [PATCH -v1 05/11] dnotify: reimplement dnotify using fsnotify Eric Paris
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

inotify and dnotify both use a similar parent notification mechanism.  We
add a generic parent notification mechanism to fsnotify for both of these
to use.  This new machanism also adds the dentry flag optimization which
exists for inotify to dnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/inode_mark.c           |    4 +
 include/linux/dcache.h           |    3 +
 include/linux/fsnotify.h         |  111 +++++++++++++++++++++++++++++++++++++-
 include/linux/fsnotify_backend.h |    5 ++
 4 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 9a65fbc..840bd91 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -78,6 +78,8 @@ void fsnotify_recalc_inode_mask(struct inode *inode)
 	spin_lock(&inode->i_lock);
 	fsnotify_recalc_inode_mask_locked(inode);
 	spin_unlock(&inode->i_lock);
+
+	fsnotify_update_dentry_child_flags(inode);
 }
 
 void fsnotify_clear_marks_by_group(struct fsnotify_group *group)
@@ -232,6 +234,8 @@ int fsnotify_add_mark(struct fsnotify_mark_entry *entry)
 	if (lentry) {
 		ret = -EEXIST;
 		fsnotify_put_mark(lentry);
+	} else {
+		fsnotify_update_dentry_child_flags(inode);
 	}
 
 	return ret;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c66d224..2b935f2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -180,7 +180,8 @@ d_iput:		no		no		no       yes
 #define DCACHE_REFERENCED	0x0008  /* Recently used, don't discard. */
 #define DCACHE_UNHASHED		0x0010	
 
-#define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
+#define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched by inotify */
+#define DCACHE_FSNOTIFY_PARENT_WATCHED	0x0040 /* Parent inode is watched by some fsnotify listener */
 
 #define DCACHE_COOKIE		0x0040	/* For use by dcookie subsystem */
 
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 6ae332f..736bb28 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -16,14 +16,102 @@
 #include <linux/fsnotify_backend.h>
 #include <linux/audit.h>
 
+static inline int fsnotify_inode_watches_children(struct inode *inode)
+{
+	if (inode->i_fsnotify_mask & (FS_EVENTS_WITH_CHILD << 32))
+		return 1;
+	return 0;
+}
+
+/*
+ * Get child dentry flag into synch with parent inode.
+ * Flag should always be clear for negative dentrys.A
+ *
+ * It's possible the flag could lie (false positive).  fsnotify only
+ * automatically updates the flag when a new mark is added.  The flag is NOT
+ * automatically updated when marks are removed because of the horrid tactics
+ * that would be needed to do this.  When we remove marks from an inode we have
+ * no idea if that inode is sticking around aside from holding the i_lock.  We
+ * clearly can't hold the i_lock across this function.  It doesn't really
+ * matter though.  The whole point is just to short cut the parent lookup for
+ * performance.  False positives hurt perf, but they aren't a bad thing.
+ */
+static inline void fsnotify_update_dentry_child_flags(struct inode *inode)
+{
+	struct dentry *alias;
+	int watched = fsnotify_inode_watches_children(inode);
+
+	spin_lock(&dcache_lock);
+	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
+		struct dentry *child;
+
+		list_for_each_entry(child, &alias->d_subdirs, d_u.d_child) {
+			if (!child->d_inode)
+				continue;
+
+			spin_lock(&child->d_lock);
+			if (watched)
+				child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+			else
+				child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+			spin_unlock(&child->d_lock);
+		}
+	}
+	spin_unlock(&dcache_lock);
+}
+
 /*
  * fsnotify_d_instantiate - instantiate a dentry for inode
  * Called with dcache_lock held.
  */
-static inline void fsnotify_d_instantiate(struct dentry *entry,
-						struct inode *inode)
+static inline void fsnotify_d_instantiate(struct dentry *dentry, struct inode *inode)
 {
-	inotify_d_instantiate(entry, inode);
+	struct dentry *parent;
+	struct inode *p_inode;
+
+	if (!inode)
+		return;
+
+	spin_lock(&dentry->d_lock);
+	parent = dentry->d_parent;
+	p_inode = parent->d_inode;
+
+	if (p_inode && fsnotify_inode_watches_children(p_inode))
+		dentry->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+	spin_unlock(&dentry->d_lock);
+
+	/* call the legacy inotify shit */
+	inotify_d_instantiate(dentry, inode);
+}
+
+/* Notify this dentry's parent about a child's events. */
+static inline void fsnotify_parent(struct dentry *dentry, __u64 orig_mask)
+{
+	struct dentry *parent;
+	struct inode *p_inode;
+	__u64 mask;
+
+	if (!(dentry->d_flags | DCACHE_FSNOTIFY_PARENT_WATCHED))
+		return;
+
+	/* we are notifying a parent so come up with the new mask which
+	 * specifies these are events which came from a child. */
+	mask = (orig_mask & FS_EVENTS_WITH_CHILD) << 32;
+	/* need to remember if the child was a dir */
+	mask |= (orig_mask & FS_IN_ISDIR);
+
+	spin_lock(&dentry->d_lock);
+	parent = dentry->d_parent;
+	p_inode = parent->d_inode;
+
+	if (p_inode && (p_inode->i_fsnotify_mask & mask)) {
+		dget(parent);
+		spin_unlock(&dentry->d_lock);
+		fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE);
+		dput(parent);
+	} else {
+		spin_unlock(&dentry->d_lock);
+	}
 }
 
 /*
@@ -32,6 +120,14 @@ static inline void fsnotify_d_instantiate(struct dentry *entry,
  */
 static inline void fsnotify_d_move(struct dentry *entry)
 {
+	struct dentry *parent;
+
+	parent = entry->d_parent;
+	if (fsnotify_inode_watches_children(parent->d_inode))
+		entry->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+	else
+		entry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+
 	inotify_d_move(entry);
 }
 
@@ -96,6 +192,8 @@ static inline void fsnotify_nameremove(struct dentry *dentry, int isdir)
 		isdir = IN_ISDIR;
 	dnotify_parent(dentry, DN_DELETE);
 	inotify_dentry_parent_queue_event(dentry, IN_DELETE|isdir, 0, dentry->d_name.name);
+
+	fsnotify_parent(dentry, FS_DELETE|isdir);
 }
 
 /*
@@ -185,6 +283,7 @@ static inline void fsnotify_access(struct dentry *dentry)
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
+	fsnotify_parent(dentry, mask);
 	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 }
 
@@ -203,6 +302,7 @@ static inline void fsnotify_modify(struct dentry *dentry)
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
+	fsnotify_parent(dentry, mask);
 	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 }
 
@@ -220,6 +320,7 @@ static inline void fsnotify_open(struct dentry *dentry)
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
+	fsnotify_parent(dentry, mask);
 	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 }
 
@@ -240,6 +341,7 @@ static inline void fsnotify_close(struct file *file)
 	inotify_dentry_parent_queue_event(dentry, mask, 0, name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
+	fsnotify_parent(dentry, mask);
 	fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
 }
 
@@ -257,6 +359,7 @@ static inline void fsnotify_xattr(struct dentry *dentry)
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
+	fsnotify_parent(dentry, mask);
 	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 }
 
@@ -307,6 +410,8 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
 		inotify_inode_queue_event(inode, in_mask, 0, NULL, NULL);
 		inotify_dentry_parent_queue_event(dentry, in_mask, 0,
 						  dentry->d_name.name);
+
+		fsnotify_parent(dentry, in_mask);
 		fsnotify(inode, in_mask, inode, FSNOTIFY_EVENT_INODE);
 	}
 }
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index a9508c5..ad7294f 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -63,6 +63,11 @@
 #define FS_DN_RENAME		0x1000000000000000ull	/* file renamed */
 #define FS_DN_MULTISHOT		0x2000000000000000ull	/* dnotify multishot */
 
+#define FS_EVENTS_WITH_CHILD	(FS_ACCESS | FS_MODIFY | FS_ATTRIB |\
+				 FS_CLOSE_WRITE | FS_CLOSE_NOWRITE | FS_OPEN |\
+				 FS_MOVED_FROM | FS_MOVED_TO | FS_CREATE |\
+				 FS_DELETE)
+
 /* when calling fsnotify tell it if the data is a path or inode */
 #define FSNOTIFY_EVENT_PATH	1
 #define FSNOTIFY_EVENT_INODE	2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 05/11] dnotify: reimplement dnotify using fsnotify
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
                   ` (2 preceding siblings ...)
  2009-02-09 21:15 ` [PATCH -v1 04/11] fsnotify: parent event notification Eric Paris
@ 2009-02-09 21:15 ` Eric Paris
  2009-02-09 21:15 ` [PATCH -v1 06/11] fsnotify: generic notification queue and waitq Eric Paris
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

Reimplement dnotify using fsnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 MAINTAINERS                      |    2 
 fs/notify/dnotify/Kconfig        |    1 
 fs/notify/dnotify/dnotify.c      |  449 ++++++++++++++++++++++++++++++--------
 include/linux/dnotify.h          |   29 +-
 include/linux/fs.h               |    5 
 include/linux/fsnotify.h         |   70 ++----
 include/linux/fsnotify_backend.h |    3 
 7 files changed, 386 insertions(+), 173 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index e8bbc45..daba68f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1413,6 +1413,8 @@ S:	Orphan
 DIRECTORY NOTIFICATION (DNOTIFY)
 P:	Stephen Rothwell
 M:	sfr@canb.auug.org.au
+P:	Eric Paris
+M:	eparis@parisplace.org
 L:	linux-kernel@vger.kernel.org
 S:	Supported
 
diff --git a/fs/notify/dnotify/Kconfig b/fs/notify/dnotify/Kconfig
index 26adf5d..904ff8d 100644
--- a/fs/notify/dnotify/Kconfig
+++ b/fs/notify/dnotify/Kconfig
@@ -1,5 +1,6 @@
 config DNOTIFY
 	bool "Dnotify support"
+	depends on FSNOTIFY
 	default y
 	help
 	  Dnotify is a directory-based per-fd file change notification system
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index b0aa2cd..004c11c 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -3,6 +3,9 @@
  *
  * Copyright (C) 2000,2001,2002 Stephen Rothwell
  *
+ * Copyright (C) 2009 Eric Paris <Red Hat Inc>
+ * dnotify was largly rewritten to use the new fsnotify infrastructure
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
  * Free Software Foundation; either version 2, or (at your option) any
@@ -21,24 +24,163 @@
 #include <linux/spinlock.h>
 #include <linux/slab.h>
 #include <linux/fdtable.h>
+#include <linux/fsnotify_backend.h>
 
 int dir_notify_enable __read_mostly = 1;
 
 static struct kmem_cache *dn_cache __read_mostly;
+static struct kmem_cache *dnotify_inode_mark_cache __read_mostly;
+
+struct dnotify_mark_entry {
+	struct fsnotify_mark_entry fsn_entry;
+	struct dnotify_struct *dn;
+};
 
-static void redo_inode_mask(struct inode *inode)
+static struct fsnotify_ops dnotify_fsnotify_ops;
+
+/* this horribleness only works because dnotify only ever has 1 group */
+static inline struct fsnotify_mark_entry *dnotify_get_mark(struct inode *inode)
 {
-	unsigned long new_mask;
+	struct fsnotify_mark_entry *lentry;
+	struct fsnotify_mark_entry *entry = NULL;
+
+	spin_lock(&inode->i_lock);
+	list_for_each_entry(lentry, &inode->i_fsnotify_mark_entries, i_list) {
+		if (lentry->group->ops == &dnotify_fsnotify_ops) {
+			fsnotify_get_mark(lentry);
+			entry = lentry;
+			break;
+		}
+	}
+	spin_unlock(&inode->i_lock);
+	return entry;
+}
+
+/* holding the entry->lock to protect ->dn data. */
+static void dnotify_recalc_inode_mask(struct fsnotify_mark_entry *entry)
+{
+	__u64 new_mask, old_mask;
 	struct dnotify_struct *dn;
+	struct dnotify_mark_entry *dnentry  = container_of(entry, struct dnotify_mark_entry, fsn_entry);
 
+	old_mask = entry->mask;
 	new_mask = 0;
-	for (dn = inode->i_dnotify; dn != NULL; dn = dn->dn_next)
-		new_mask |= dn->dn_mask & ~DN_MULTISHOT;
-	inode->i_dnotify_mask = new_mask;
+	dn = dnentry->dn;
+	for (; dn != NULL; dn = dn->dn_next)
+		new_mask |= (dn->dn_mask & ~FS_DN_MULTISHOT);
+	entry->mask = new_mask;
+
+	if (old_mask == new_mask)
+		return;
+
+	if (entry->inode)
+		fsnotify_recalc_inode_mask(entry->inode);
+}
+
+static int dnotify_handle_event(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+	struct fsnotify_mark_entry *entry = NULL;
+	struct dnotify_mark_entry *dnentry;
+	struct inode *to_tell;
+	struct dnotify_struct *dn;
+	struct dnotify_struct **prev;
+	struct fown_struct *fown;
+
+	to_tell = event->to_tell;
+
+	spin_lock(&to_tell->i_lock);
+	entry = fsnotify_find_mark_entry(group, to_tell);
+	spin_unlock(&to_tell->i_lock);
+
+	/* unlikely since we alreay passed dnotify_should_send_event() */
+	if (unlikely(!entry))
+		return 0;
+	dnentry = container_of(entry, struct dnotify_mark_entry, fsn_entry);
+
+	spin_lock(&entry->lock);
+	prev = &dnentry->dn;
+	while ((dn = *prev) != NULL) {
+		if ((dn->dn_mask & event->mask) == 0) {
+			prev = &dn->dn_next;
+			continue;
+		}
+		fown = &dn->dn_filp->f_owner;
+		send_sigio(fown, dn->dn_fd, POLL_MSG);
+		if (dn->dn_mask & FS_DN_MULTISHOT)
+			prev = &dn->dn_next;
+		else {
+			*prev = dn->dn_next;
+			kmem_cache_free(dn_cache, dn);
+			dnotify_recalc_inode_mask(entry);
+		}
+	}
+
+	spin_unlock(&entry->lock);
+	fsnotify_put_mark(entry);
+
+	return 0;
+}
+
+static void dnotify_clear_mark(struct fsnotify_mark_entry *entry, struct inode *inode, unsigned int flags)
+{
+	/* if we got here when this inode just closed it's last dentry or when
+	 * the inode is being kicked out of core we screwed up since it should
+	 * have already been flushed in dnotify_flush() */
+	BUG();
+}
+
+static int dnotify_should_send_event(struct fsnotify_group *group, struct inode *inode, __u64 mask)
+{
+	struct fsnotify_mark_entry *entry;
+	int send;
+
+	/* !dir_notify_enable should never get here, don't waste time checking
+	if (!dir_notify_enable)
+		return 0; */
+
+	/* not a dir, dnotify doesn't care */
+	if (!S_ISDIR(inode->i_mode))
+		return 0;
+
+	spin_lock(&inode->i_lock);
+	entry = fsnotify_find_mark_entry(group, inode);
+	spin_unlock(&inode->i_lock);
+
+	/* no mark means no dnotify watch */
+	if (!entry)
+		return 0;
+
+	spin_lock(&entry->lock);
+	send = !!(mask & entry->mask);
+	spin_unlock(&entry->lock);
+	fsnotify_put_mark(entry);
+
+	return send;
+}
+
+static void dnotify_free_mark(struct fsnotify_mark_entry *entry)
+{
+	struct dnotify_mark_entry *dnentry = container_of(entry, struct dnotify_mark_entry, fsn_entry);
+
+	BUG_ON(dnentry->dn);
+
+	kmem_cache_free(dnotify_inode_mark_cache, dnentry);
 }
 
+static struct fsnotify_ops dnotify_fsnotify_ops = {
+	.handle_event = dnotify_handle_event,
+	.mark_clear_inode = dnotify_clear_mark,
+	.should_send_event = dnotify_should_send_event,
+	.free_group_priv = NULL,
+	.free_event_priv = NULL,
+	.free_mark = dnotify_free_mark,
+};
+
 void dnotify_flush(struct file *filp, fl_owner_t id)
 {
+	struct fsnotify_group *dnotify_group = NULL;
+	struct fsnotify_mark_entry *entry;
+	struct dnotify_mark_entry *dnentry;
 	struct dnotify_struct *dn;
 	struct dnotify_struct **prev;
 	struct inode *inode;
@@ -46,145 +188,254 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
 	inode = filp->f_path.dentry->d_inode;
 	if (!S_ISDIR(inode->i_mode))
 		return;
-	spin_lock(&inode->i_lock);
-	prev = &inode->i_dnotify;
+
+	entry = dnotify_get_mark(inode);
+	if (!entry)
+		return;
+	dnentry = container_of(entry, struct dnotify_mark_entry, fsn_entry);
+
+	spin_lock(&entry->lock);
+	prev = &dnentry->dn;
 	while ((dn = *prev) != NULL) {
 		if ((dn->dn_owner == id) && (dn->dn_filp == filp)) {
 			*prev = dn->dn_next;
-			redo_inode_mask(inode);
 			kmem_cache_free(dn_cache, dn);
+			dnotify_recalc_inode_mask(entry);
 			break;
 		}
 		prev = &dn->dn_next;
 	}
-	spin_unlock(&inode->i_lock);
+
+	/* last dnotify watch on this inode is gone */
+	if (dnentry->dn == NULL)
+		dnotify_group = entry->group;
+
+	spin_unlock(&entry->lock);
+
+	if (dnotify_group) {
+		fsnotify_destroy_mark_by_entry(entry);
+		fsnotify_put_group(dnotify_group);
+	}
+
+	fsnotify_put_mark(entry);
 }
 
-int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
+/* this conversion is done only at watch creation */
+static inline unsigned long convert_arg(unsigned long arg)
+{
+	unsigned long new_mask = 0;
+
+	if (arg & DN_MULTISHOT)
+		new_mask |= FS_DN_MULTISHOT;
+	if (arg & DN_DELETE)
+		new_mask |= (FS_DELETE | FS_MOVED_FROM);
+	if (arg & DN_MODIFY)
+		new_mask |= FS_MODIFY;
+	if (arg & DN_ACCESS)
+		new_mask |= FS_ACCESS;
+	if (arg & DN_ATTRIB)
+		new_mask |= FS_ATTRIB;
+	if (arg & DN_RENAME)
+		new_mask |= FS_DN_RENAME;
+	if (arg & DN_CREATE)
+		new_mask |= (FS_CREATE | FS_MOVED_TO);
+
+	new_mask |= (new_mask & FS_EVENTS_WITH_CHILD) << 32;
+
+	return new_mask;
+}
+
+/* this has some really ugly tricky semantics.
+ * first, fsnotify_obtain_group took a reference to group.  If we add a new mark
+ * to this inode we keep that reference.  Otherwise we drop it.
+ *
+ * second if we return an error the caller will try to free *dnentry_p.  So on
+ * error we really need to do the cleanup outself and NULL out *dnentry_p
+ */ 
+static int find_dnotify_mark_entry(struct fsnotify_group *group, struct inode *inode,
+				   struct dnotify_mark_entry **dnentry_p)
+{
+	struct dnotify_mark_entry *dnentry = *dnentry_p;
+	struct fsnotify_mark_entry *entry = &dnentry->fsn_entry;
+	int ret = 0;
+
+retry:
+	/* look for a previous entry */
+	entry = fsnotify_find_mark_entry(group, inode);
+	if (!entry) {
+		entry = &dnentry->fsn_entry;
+		/* if none, add the new one we allocated */
+		ret = fsnotify_add_mark(entry);
+		/* if we raced and someone else added, start over */
+		if (ret == -EEXIST)
+			goto retry;
+		else if (ret) {
+			fsnotify_destroy_mark_by_entry(entry);
+			fsnotify_put_mark(entry);
+			/* we didn't ally this new entry, so put the group */
+			fsnotify_put_group(group);
+			dnentry = NULL;
+		}
+	} else {
+		/* found an existing one, kill the new one */
+		kmem_cache_free(dnotify_inode_mark_cache, dnentry);
+		dnentry = container_of(entry, struct dnotify_mark_entry, fsn_entry);
+		/* we found an old entry, didn't add a new one, so put group */
+		fsnotify_put_group(group);
+	}
+
+	*dnentry_p = dnentry;
+	return ret;
+}
+
+static int attach_dn(struct dnotify_struct *dn, struct dnotify_mark_entry *dnentry, fl_owner_t id,
+		     int fd, struct file *filp, __u64 mask)
 {
-	struct dnotify_struct *dn;
 	struct dnotify_struct *odn;
 	struct dnotify_struct **prev;
+
+	prev = &dnentry->dn;
+	while ((odn = *prev) != NULL) {
+		/* do we already have a dnotify struct and we are just adding more events? */
+		if ((odn->dn_owner == id) && (odn->dn_filp == filp)) {
+			odn->dn_fd = fd;
+			odn->dn_mask |= mask;
+			return -EEXIST;
+		}
+		prev = &odn->dn_next;
+	}
+
+	dn->dn_mask = mask;
+	dn->dn_fd = fd;
+	dn->dn_filp = filp;
+	dn->dn_owner = id;
+	dn->dn_next = dnentry->dn;
+	dnentry->dn = dn;
+
+	return 0;
+}
+
+int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
+{
+	struct fsnotify_group *dnotify_group = NULL;
+	struct dnotify_mark_entry *new_dnentry, *dnentry = NULL;
+	struct fsnotify_mark_entry *entry;
+	struct dnotify_struct *dn = NULL;
 	struct inode *inode;
 	fl_owner_t id = current->files;
 	struct file *f;
-	int error = 0;
+	int error = 0, destroy = 0;
+	__u64 mask;
+
+	if (!dir_notify_enable)
+		return -EINVAL;
 
 	if ((arg & ~DN_MULTISHOT) == 0) {
 		dnotify_flush(filp, id);
 		return 0;
 	}
-	if (!dir_notify_enable)
-		return -EINVAL;
 	inode = filp->f_path.dentry->d_inode;
 	if (!S_ISDIR(inode->i_mode))
 		return -ENOTDIR;
+
+	/* convert the userspace DN_* "arg" to the internal FS_* defines in fsnotify */
+	mask = convert_arg(arg);
+
+	/* expect most fcntl to add new rather than augment old */
 	dn = kmem_cache_alloc(dn_cache, GFP_KERNEL);
-	if (dn == NULL)
+	if (!dn)
 		return -ENOMEM;
-	spin_lock(&inode->i_lock);
-	prev = &inode->i_dnotify;
-	while ((odn = *prev) != NULL) {
-		if ((odn->dn_owner == id) && (odn->dn_filp == filp)) {
-			odn->dn_fd = fd;
-			odn->dn_mask |= arg;
-			inode->i_dnotify_mask |= arg & ~DN_MULTISHOT;
-			goto out_free;
-		}
-		prev = &odn->dn_next;
+
+	/*
+	 * I really don't like using ALL_DNOTIFY_EVENTS.  We could probably do
+	 * better setting the group->mask equal to only those dnotify watches care
+	 * about, but removing events means running the entire group->mark_entries
+	 * list to recalculate the mask.  Also makes it harder to find the right
+	 * group, but this is not a fast path, so harder doesn't mean bad.
+	 * Maybe a future performance win since it could result in faster fsnotify()
+	 * processing.
+	 */
+	dnotify_group = fsnotify_obtain_group(DNOTIFY_GROUP_NUM, DNOTIFY_GROUP_NUM, ALL_DNOTIFY_EVENTS, &dnotify_fsnotify_ops);
+	if (IS_ERR(dnotify_group)) {
+		error = PTR_ERR(dnotify_group);
+		goto out_err;
 	}
 
-	rcu_read_lock();
-	f = fcheck(fd);
-	rcu_read_unlock();
-	/* we'd lost the race with close(), sod off silently */
-	/* note that inode->i_lock prevents reordering problems
-	 * between accesses to descriptor table and ->i_dnotify */
-	if (f != filp)
-		goto out_free;
+	new_dnentry = dnentry = kmem_cache_alloc(dnotify_inode_mark_cache, GFP_KERNEL);
+	if (!dnentry) {
+		error = -ENOMEM;
+		goto out_err;
+	}
+	entry = &dnentry->fsn_entry;
+	fsnotify_init_mark(entry, dnotify_group, inode, mask);
+	dnentry->dn = NULL;
 
-	error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
+	error = find_dnotify_mark_entry(dnotify_group, inode, &dnentry);
 	if (error)
-		goto out_free;
+		goto out_err;
+	entry = &dnentry->fsn_entry;
 
-	dn->dn_mask = arg;
-	dn->dn_fd = fd;
-	dn->dn_filp = filp;
-	dn->dn_owner = id;
-	inode->i_dnotify_mask |= arg & ~DN_MULTISHOT;
-	dn->dn_next = inode->i_dnotify;
-	inode->i_dnotify = dn;
-	spin_unlock(&inode->i_lock);
-	return 0;
-
-out_free:
-	spin_unlock(&inode->i_lock);
-	kmem_cache_free(dn_cache, dn);
-	return error;
-}
+	spin_lock(&entry->lock);
 
-void __inode_dir_notify(struct inode *inode, unsigned long event)
-{
-	struct dnotify_struct *	dn;
-	struct dnotify_struct **prev;
-	struct fown_struct *	fown;
-	int			changed = 0;
+	rcu_read_lock();
+	f = fcheck(fd);
+	rcu_read_unlock();
 
-	spin_lock(&inode->i_lock);
-	prev = &inode->i_dnotify;
-	while ((dn = *prev) != NULL) {
-		if ((dn->dn_mask & event) == 0) {
-			prev = &dn->dn_next;
-			continue;
-		}
-		fown = &dn->dn_filp->f_owner;
-		send_sigio(fown, dn->dn_fd, POLL_MSG);
-		if (dn->dn_mask & DN_MULTISHOT)
-			prev = &dn->dn_next;
-		else {
-			*prev = dn->dn_next;
-			changed = 1;
-			kmem_cache_free(dn_cache, dn);
-		}
+	/* if (f != filp) means that we lost a race and another task/thread
+	 * actually closed the fd we are still playing with before we grabbed
+	 * the entry->lock.  Since closing the fd is the only time we clean
+	 * up the mark entries we need to get our mark off the list. */
+	if (f != filp) {
+		/* if we added ourselves, shoot ourselves, it's possible that
+		 * the flush actually did shoot this entry.  That's fine too
+		 * since multiple calls to destroy_mark is perfectly safe */
+		if (dnentry == new_dnentry)
+			destroy = 1;
+		/* if we just found a dnentry already there, just sod off
+		 * silently as the flush at close time dealt with it */
+		goto out;
 	}
-	if (changed)
-		redo_inode_mask(inode);
-	spin_unlock(&inode->i_lock);
-}
 
-EXPORT_SYMBOL(__inode_dir_notify);
+	error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
+	if (error)
+		goto out;
 
-/*
- * This is hopelessly wrong, but unfixable without API changes.  At
- * least it doesn't oops the kernel...
- *
- * To safely access ->d_parent we need to keep d_move away from it.  Use the
- * dentry's d_lock for this.
- */
-void dnotify_parent(struct dentry *dentry, unsigned long event)
-{
-	struct dentry *parent;
+	error = attach_dn(dn, dnentry, id, fd, filp, mask);
+	/* !error means that we attached the dn to the dnentry, so don't free it */
+	if (!error)
+		dn = NULL;
+	/* -EEXIST means that we didn't add this new dn and used an old one.
+	 * that isn't an error (and the unused dn should be freed) */
+	else if (error == -EEXIST)
+		error = 0;
 
-	if (!dir_notify_enable)
-		return;
+	dnotify_recalc_inode_mask(entry);
+out:
+	spin_unlock(&entry->lock);
+	if (destroy)
+		fsnotify_destroy_mark_by_entry(entry);
+	fsnotify_put_mark(entry);
+	if (dn)
+		kmem_cache_free(dn_cache, dn);
+	return error;
 
-	spin_lock(&dentry->d_lock);
-	parent = dentry->d_parent;
-	if (parent->d_inode->i_dnotify_mask & event) {
-		dget(parent);
-		spin_unlock(&dentry->d_lock);
-		__inode_dir_notify(parent->d_inode, event);
-		dput(parent);
-	} else {
-		spin_unlock(&dentry->d_lock);
+out_err:
+	if (dnentry) {
+		entry = &dnentry->fsn_entry;
+		fsnotify_destroy_mark_by_entry(entry);
+		fsnotify_put_mark(entry);
 	}
+	if (dn)
+		kmem_cache_free(dn_cache, dn);
+	return error;
 }
-EXPORT_SYMBOL_GPL(dnotify_parent);
 
 static int __init dnotify_init(void)
 {
 	dn_cache = kmem_cache_create("dnotify_cache",
 		sizeof(struct dnotify_struct), 0, SLAB_PANIC, NULL);
+	dnotify_inode_mark_cache = kmem_cache_create("dnotify_inode_mark",
+		sizeof(struct dnotify_mark_entry), 0, SLAB_PANIC, NULL);
 	return 0;
 }
 
diff --git a/include/linux/dnotify.h b/include/linux/dnotify.h
index 102a902..e8c4256 100644
--- a/include/linux/dnotify.h
+++ b/include/linux/dnotify.h
@@ -10,7 +10,7 @@
 
 struct dnotify_struct {
 	struct dnotify_struct *	dn_next;
-	unsigned long		dn_mask;
+	__u64			dn_mask;
 	int			dn_fd;
 	struct file *		dn_filp;
 	fl_owner_t		dn_owner;
@@ -21,23 +21,18 @@ struct dnotify_struct {
 
 #ifdef CONFIG_DNOTIFY
 
-extern void __inode_dir_notify(struct inode *, unsigned long);
+#define ALL_DNOTIFY_EVENTS (FS_DELETE | FS_DELETE_CHILD |\
+			    FS_MODIFY | FS_MODIFY_CHILD |\
+			    FS_ACCESS | FS_ACCESS_CHILD |\
+			    FS_ATTRIB | FS_ATTRIB_CHILD |\
+			    FS_CREATE | FS_DN_RENAME |\
+			    FS_MOVED_FROM | FS_MOVED_TO)
+
 extern void dnotify_flush(struct file *, fl_owner_t);
 extern int fcntl_dirnotify(int, struct file *, unsigned long);
-extern void dnotify_parent(struct dentry *, unsigned long);
-
-static inline void inode_dir_notify(struct inode *inode, unsigned long event)
-{
-	if (inode->i_dnotify_mask & (event))
-		__inode_dir_notify(inode, event);
-}
 
 #else
 
-static inline void __inode_dir_notify(struct inode *inode, unsigned long event)
-{
-}
-
 static inline void dnotify_flush(struct file *filp, fl_owner_t id)
 {
 }
@@ -47,14 +42,6 @@ static inline int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 	return -EINVAL;
 }
 
-static inline void dnotify_parent(struct dentry *dentry, unsigned long event)
-{
-}
-
-static inline void inode_dir_notify(struct inode *inode, unsigned long event)
-{
-}
-
 #endif /* CONFIG_DNOTIFY */
 
 #endif /* __KERNEL __ */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c5ec88f..1eed69c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -695,11 +695,6 @@ struct inode {
 	struct list_head	i_fsnotify_mark_entries; /* fsnotify mark entries */
 #endif
 
-#ifdef CONFIG_DNOTIFY
-	unsigned long		i_dnotify_mask; /* Directory notify events */
-	struct dnotify_struct	*i_dnotify; /* for directory notifications */
-#endif
-
 #ifdef CONFIG_INOTIFY
 	struct list_head	inotify_watches; /* watches on this inode */
 	struct mutex		inotify_mutex;	/* protects the watches list */
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 736bb28..f5ce85b 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -144,13 +144,7 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
 	__u64 new_dir_mask = 0;
 
 	if (old_dir == new_dir) {
-		inode_dir_notify(old_dir, DN_RENAME);
 		old_dir_mask = FS_DN_RENAME;
-	} else {
-		inode_dir_notify(old_dir, DN_DELETE);
-		old_dir_mask = FS_DELETE;
-		inode_dir_notify(new_dir, DN_CREATE);
-		new_dir_mask = FS_CREATE;
 	}
 
 	if (isdir) {
@@ -190,7 +184,6 @@ static inline void fsnotify_nameremove(struct dentry *dentry, int isdir)
 {
 	if (isdir)
 		isdir = IN_ISDIR;
-	dnotify_parent(dentry, DN_DELETE);
 	inotify_dentry_parent_queue_event(dentry, IN_DELETE|isdir, 0, dentry->d_name.name);
 
 	fsnotify_parent(dentry, FS_DELETE|isdir);
@@ -231,7 +224,6 @@ static inline void fsnotify_link_count(struct inode *inode)
  */
 static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
 {
-	inode_dir_notify(inode, DN_CREATE);
 	inotify_inode_queue_event(inode, IN_CREATE, 0, dentry->d_name.name,
 				  dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
@@ -246,7 +238,6 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
  */
 static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct dentry *new_dentry)
 {
-	inode_dir_notify(dir, DN_CREATE);
 	inotify_inode_queue_event(dir, IN_CREATE, 0, new_dentry->d_name.name,
 				  inode);
 	fsnotify_link_count(inode);
@@ -260,7 +251,6 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct
  */
 static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 {
-	inode_dir_notify(inode, DN_CREATE);
 	inotify_inode_queue_event(inode, IN_CREATE | IN_ISDIR, 0, 
 				  dentry->d_name.name, dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
@@ -279,7 +269,6 @@ static inline void fsnotify_access(struct dentry *dentry)
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
 
-	dnotify_parent(dentry, DN_ACCESS);
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
@@ -298,7 +287,6 @@ static inline void fsnotify_modify(struct dentry *dentry)
 	if (S_ISDIR(inode->i_mode))
 		mask |= IN_ISDIR;
 
-	dnotify_parent(dentry, DN_MODIFY);
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
@@ -370,49 +358,35 @@ static inline void fsnotify_xattr(struct dentry *dentry)
 static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
 {
 	struct inode *inode = dentry->d_inode;
-	int dn_mask = 0;
-	u32 in_mask = 0;
+	__u64 mask = 0;
+
+	if (ia_valid & ATTR_UID)
+		mask |= IN_ATTRIB;
+	if (ia_valid & ATTR_GID)
+		mask |= IN_ATTRIB;
+	if (ia_valid & ATTR_SIZE)
+		mask |= IN_MODIFY;
 
-	if (ia_valid & ATTR_UID) {
-		in_mask |= IN_ATTRIB;
-		dn_mask |= DN_ATTRIB;
-	}
-	if (ia_valid & ATTR_GID) {
-		in_mask |= IN_ATTRIB;
-		dn_mask |= DN_ATTRIB;
-	}
-	if (ia_valid & ATTR_SIZE) {
-		in_mask |= IN_MODIFY;
-		dn_mask |= DN_MODIFY;
-	}
 	/* both times implies a utime(s) call */
 	if ((ia_valid & (ATTR_ATIME | ATTR_MTIME)) == (ATTR_ATIME | ATTR_MTIME))
-	{
-		in_mask |= IN_ATTRIB;
-		dn_mask |= DN_ATTRIB;
-	} else if (ia_valid & ATTR_ATIME) {
-		in_mask |= IN_ACCESS;
-		dn_mask |= DN_ACCESS;
-	} else if (ia_valid & ATTR_MTIME) {
-		in_mask |= IN_MODIFY;
-		dn_mask |= DN_MODIFY;
-	}
-	if (ia_valid & ATTR_MODE) {
-		in_mask |= IN_ATTRIB;
-		dn_mask |= DN_ATTRIB;
-	}
+		mask |= IN_ATTRIB;
+	else if (ia_valid & ATTR_ATIME)
+		mask |= IN_ACCESS;
+	else if (ia_valid & ATTR_MTIME)
+		mask |= IN_MODIFY;
+
+	if (ia_valid & ATTR_MODE)
+		mask |= IN_ATTRIB;
 
-	if (dn_mask)
-		dnotify_parent(dentry, dn_mask);
-	if (in_mask) {
+	if (mask) {
 		if (S_ISDIR(inode->i_mode))
-			in_mask |= IN_ISDIR;
-		inotify_inode_queue_event(inode, in_mask, 0, NULL, NULL);
-		inotify_dentry_parent_queue_event(dentry, in_mask, 0,
+			mask |= IN_ISDIR;
+		inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+		inotify_dentry_parent_queue_event(dentry, mask, 0,
 						  dentry->d_name.name);
 
-		fsnotify_parent(dentry, in_mask);
-		fsnotify(inode, in_mask, inode, FSNOTIFY_EVENT_INODE);
+		fsnotify_parent(dentry, mask);
+		fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
 	}
 }
 
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index ad7294f..2bada0f 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -77,6 +77,9 @@
 #define FSNOTIFY_LAST_DENTRY	1
 #define FSNOTIFY_INODE_DESTROY	2
 
+/* listeners that hard code group numbers near the top */
+#define DNOTIFY_GROUP_NUM	UINT_MAX
+
 struct fsnotify_group;
 struct fsnotify_event;
 struct fsnotify_mark_entry;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 06/11] fsnotify: generic notification queue and waitq
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
                   ` (3 preceding siblings ...)
  2009-02-09 21:15 ` [PATCH -v1 05/11] dnotify: reimplement dnotify using fsnotify Eric Paris
@ 2009-02-09 21:15 ` Eric Paris
  2009-02-09 21:15 ` [PATCH -v1 07/11] fsnotify: include pathnames with entries when possible Eric Paris
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

inotify needs to do asyc notification in which event information is stored
on a queue until the listener is ready to receive it.  This patch
implements a generic notification queue for inotify (and later fanotify) to
store events to be sent at a later time.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/fsnotify.h             |    3 +
 fs/notify/group.c                |    9 ++
 fs/notify/notification.c         |  181 +++++++++++++++++++++++++++++++++++++-
 include/linux/fsnotify_backend.h |   34 +++++++
 4 files changed, 223 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index bad4da9..6d6942c 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -15,8 +15,11 @@ extern struct srcu_struct fsnotify_grp_srcu_struct;
 extern struct list_head fsnotify_groups;
 extern __u64 fsnotify_mask;
 
+extern void fsnotify_flush_notif(struct fsnotify_group *group);
+
 extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is);
 
 extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
 extern void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags);
+
 #endif	/* _LINUX_FSNOTIFY_PRIVATE_H */
diff --git a/fs/notify/group.c b/fs/notify/group.c
index bac7d8d..9168e81 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -91,6 +91,9 @@ void fsnotify_get_group(struct fsnotify_group *group)
 
 static void fsnotify_destroy_group(struct fsnotify_group *group)
 {
+	/* clear the notification queue of all events */
+	fsnotify_flush_notif(group);
+
 	/* clear all inode mark entries for this group */
 	fsnotify_clear_marks_by_group(group);
 
@@ -172,6 +175,12 @@ struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int
 	group->group_num = group_num;
 	group->mask = mask;
 
+	mutex_init(&group->notification_mutex);
+	INIT_LIST_HEAD(&group->notification_list);
+	init_waitqueue_head(&group->notification_waitq);
+	group->q_len = 0;
+	group->max_events = UINT_MAX;
+
 	spin_lock_init(&group->mark_lock);
 	INIT_LIST_HEAD(&group->mark_entries);
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index c893873..3884879 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -33,6 +33,14 @@
 #include "fsnotify.h"
 
 static struct kmem_cache *event_kmem_cache;
+static struct kmem_cache *event_holder_kmem_cache;
+static struct fsnotify_event q_overflow_event;
+
+/* return 1 if something is available, return 0 otherwise */
+int fsnotify_check_notif_queue(struct fsnotify_group *group)
+{
+	return !list_empty(&group->notification_list);
+}
 
 void fsnotify_get_event(struct fsnotify_event *event)
 {
@@ -58,6 +66,16 @@ void fsnotify_put_event(struct fsnotify_event *event)
 	}
 }
 
+struct fsnotify_event_holder *alloc_event_holder(void)
+{
+	return kmem_cache_alloc(event_holder_kmem_cache, GFP_KERNEL);
+}
+
+void fsnotify_destroy_event_holder(struct fsnotify_event_holder *holder)
+{
+	kmem_cache_free(event_holder_kmem_cache, holder);
+}
+
 struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event)
 {
 	struct fsnotify_event_private_data *lpriv;
@@ -72,14 +90,152 @@ struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify
 	return priv;
 }
 
-struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
+static inline int event_compare(struct fsnotify_event *old, struct fsnotify_event *new)
+{
+	if ((old->mask == new->mask) &&
+	    (old->to_tell == new->to_tell) &&
+	    (old->flag == new->flag)) {
+		if ((old->flag == FSNOTIFY_EVENT_INODE) &&
+		    (old->inode == new->inode))
+			return 1;
+		else if ((old->flag == FSNOTIFY_EVENT_PATH) &&
+		    (old->path.mnt == new->path.mnt) &&
+		    (old->path.dentry == new->path.dentry))
+			return 1;
+		else if (old->flag == FSNOTIFY_EVENT_NONE)
+			return 1;
+	}
+	return 0;
+}
+
+/*
+ * we do NOT compare private data when determining if the last event in the
+ * notification queue is the same as this event someone is trying to add
+ */
+int fsnotify_add_notif_event(struct fsnotify_group *group, struct fsnotify_event *event, struct fsnotify_event_private_data *priv)
+{
+	struct fsnotify_event_holder *holder;
+	struct list_head *list = &group->notification_list;
+	struct fsnotify_event_holder *last_holder;
+	struct fsnotify_event *last_event;
+
+	/*
+	 * holder locking
+	 *
+	 * only this task is going to be adding this event to lists, thus only
+	 * this task can add the in event holder to a list.
+	 *
+	 * other tasks may be removing this event from some other group's
+	 * notification_list.
+	 *
+	 * those other tasks will blank the in event holder list under
+	 * the holder spinlock.  If we see it blank we know that once we
+	 * get that lock the in event holder will be ok for us to (re)use.
+	 */
+	if (list_empty(&event->holder.event_list))
+		holder = (struct fsnotify_event_holder *)event;
+	else
+		holder = alloc_event_holder();
+
+	if (!holder)
+		return -ENOMEM;
+
+	mutex_lock(&group->notification_mutex);
+
+	if (group->q_len >= group->max_events)
+		event = &q_overflow_event;
+
+	spin_lock(&event->lock);
+
+	if (!list_empty(list)) {
+		last_holder = list_entry(list->prev, struct fsnotify_event_holder, event_list);
+		last_event = last_holder->event;
+		if (event_compare(last_event, event)) {
+			spin_unlock(&event->lock);
+			mutex_unlock(&group->notification_mutex);
+			if (holder != (struct fsnotify_event_holder *)event)
+				fsnotify_destroy_event_holder(holder);
+			return 0;
+		}
+	}
+
+	group->q_len++;
+	holder->event = event;
+
+	fsnotify_get_event(event);
+	list_add_tail(&holder->event_list, list);
+	if (priv)
+		list_add_tail(&priv->event_list, &event->private_data_list);
+	spin_unlock(&event->lock);
+	mutex_unlock(&group->notification_mutex);
+
+	wake_up(&group->notification_waitq);
+	return 0;
+}
+
+/*
+ * must be called with group->notification_mutex held and must know event is present.
+ * it is the responsibility of the caller to call put_event() on the returned
+ * structure
+ */
+struct fsnotify_event *fsnotify_remove_notif_event(struct fsnotify_group *group)
 {
 	struct fsnotify_event *event;
+	struct fsnotify_event_holder *holder;
 
-	event = kmem_cache_alloc(event_kmem_cache, GFP_KERNEL);
-	if (!event)
-		return NULL;
+	holder = list_first_entry(&group->notification_list, struct fsnotify_event_holder, event_list);
+
+	event = holder->event;
+
+	spin_lock(&event->lock);
+	holder->event = NULL;
+	list_del_init(&holder->event_list);
+	spin_unlock(&event->lock);
+
+	/* event == holder means we are referenced through the in event holder */
+	if (event != (struct fsnotify_event *)holder)
+		fsnotify_destroy_event_holder(holder);
+
+	group->q_len--;
+
+	return event;
+}
+
+/*
+ * caller must hold group->notification_mutex and must know event is present.
+ * this will not remove the event, that must be done with fsnotify_remove_notif_event()
+ */
+struct fsnotify_event *fsnotify_peek_notif_event(struct fsnotify_group *group)
+{
+	struct fsnotify_event *event;
+	struct fsnotify_event_holder *holder;
+
+	holder = list_first_entry(&group->notification_list, struct fsnotify_event_holder, event_list);
+	event = holder->event;
 
+	return event;
+}
+
+void fsnotify_flush_notif(struct fsnotify_group *group)
+{
+	struct fsnotify_event *event;
+
+	/* do I really need the mutex here?  I think the group is now safe to
+	 * play with lockless... */
+	mutex_lock(&group->notification_mutex);
+	while (fsnotify_check_notif_queue(group)) {
+		event = fsnotify_remove_notif_event(group);
+		if (group->ops->free_event_priv)
+			group->ops->free_event_priv(group, event);
+		fsnotify_put_event(event);
+	}
+	mutex_unlock(&group->notification_mutex);
+}
+
+static void initialize_event(struct fsnotify_event *event)
+{
+	event->holder.event = NULL;
+	INIT_LIST_HEAD(&event->holder.event_list);
 	atomic_set(&event->refcnt, 1);
 
 	spin_lock_init(&event->lock);
@@ -87,9 +243,22 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
 	event->path.dentry = NULL;
 	event->path.mnt = NULL;
 	event->inode = NULL;
+	event->flag = FSNOTIFY_EVENT_NONE;
 
 	INIT_LIST_HEAD(&event->private_data_list);
 
+	event->to_tell = NULL;
+}
+
+struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
+{
+	struct fsnotify_event *event;
+
+	event = kmem_cache_alloc(event_kmem_cache, GFP_KERNEL);
+	if (!event)
+		return NULL;
+
+	initialize_event(event);
 	event->to_tell = to_tell;
 
 	switch (data_is) {
@@ -126,6 +295,10 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
 __init int fsnotify_notification_init(void)
 {
 	event_kmem_cache = kmem_cache_create("fsnotify_event", sizeof(struct fsnotify_event), 0, SLAB_PANIC, NULL);
+	event_holder_kmem_cache = kmem_cache_create("fsnotify_event_holder", sizeof(struct fsnotify_event_holder), 0, SLAB_PANIC, NULL);
+
+	initialize_event(&q_overflow_event);
+	q_overflow_event.mask = FS_Q_OVERFLOW;
 
 	return 0;
 }
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 2bada0f..6223efa 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -69,6 +69,7 @@
 				 FS_DELETE)
 
 /* when calling fsnotify tell it if the data is a path or inode */
+#define FSNOTIFY_EVENT_NONE	0
 #define FSNOTIFY_EVENT_PATH	1
 #define FSNOTIFY_EVENT_INODE	2
 #define FSNOTIFY_EVENT_FILE	3
@@ -101,6 +102,13 @@ struct fsnotify_group {
 
 	const struct fsnotify_ops *ops;	/* how this group handles things */
 
+	/* needed to send notification to userspace */
+	struct mutex notification_mutex;	/* protect the notification_list */
+	struct list_head notification_list;	/* list of event_holder this group needs to send to userspace */
+	wait_queue_head_t notification_waitq;	/* read() on the notification file blocks on this waitq */
+	unsigned int q_len;			/* events on the queue */
+	unsigned int max_events;		/* maximum events allowed on the list */
+
 	/* stores all fastapth entries assoc with this group so they can be cleaned on unregister */
 	spinlock_t mark_lock;		/* protect mark_entries list */
 	struct list_head mark_entries;	/* all inode mark entries for this group */
@@ -113,6 +121,21 @@ struct fsnotify_group {
 	};
 };
 
+/*
+ * A single event can be queued in multiple group->notification_lists.
+ *
+ * each group->notification_list will point to an event_holder which in turns points
+ * to the actual event that needs to be sent to userspace.
+ *
+ * Seemed cheaper to create a refcnt'd event and a small holder for every group
+ * than create a different event for every group
+ *
+ */
+struct fsnotify_event_holder {
+	struct fsnotify_event *event;
+	struct list_head event_list;
+};
+
 struct fsnotify_event_private_data {
 	struct fsnotify_group *group;
 	struct list_head event_list;
@@ -125,6 +148,12 @@ struct fsnotify_event_private_data {
  * listener this structure is where you need to be adding fields.
  */
 struct fsnotify_event {
+	/*
+	 * If we create an event we are also likely going to need a holder
+	 * to link to a group.  So embed one holder in the event.  Means only
+	 * one allocation for the common case where we only have one group
+	 */
+	struct fsnotify_event_holder holder;
 	spinlock_t lock;	/* protection for the associated event_holder and private_list */
 	struct inode *to_tell;
 	/*
@@ -183,6 +212,11 @@ extern void fsnotify_get_event(struct fsnotify_event *event);
 extern void fsnotify_put_event(struct fsnotify_event *event);
 extern struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event);
 
+extern int fsnotify_add_notif_event(struct fsnotify_group *group, struct fsnotify_event *event, struct fsnotify_event_private_data *priv);
+extern int fsnotify_check_notif_queue(struct fsnotify_group *group);
+extern struct fsnotify_event *fsnotify_peek_notif_event(struct fsnotify_group *group);
+extern struct fsnotify_event *fsnotify_remove_notif_event(struct fsnotify_group *group);
+
 extern void fsnotify_recalc_inode_mask(struct inode *inode);
 extern void fsnotify_init_mark(struct fsnotify_mark_entry *entry, struct fsnotify_group *group, struct inode *inode, __u64 mask);
 extern struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 07/11] fsnotify: include pathnames with entries when possible
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
                   ` (4 preceding siblings ...)
  2009-02-09 21:15 ` [PATCH -v1 06/11] fsnotify: generic notification queue and waitq Eric Paris
@ 2009-02-09 21:15 ` Eric Paris
  2009-02-09 21:16 ` [PATCH -v1 08/11] fsnotify: add correlations between events Eric Paris
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

When inotify wants to send events to a directory about a child it includes
the name of the original file.  This patch collects that filename and makes
it available for notification.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/fsnotify.c             |    4 ++--
 fs/notify/fsnotify.h             |    5 ++++-
 fs/notify/notification.c         |   18 +++++++++++++++++-
 include/linux/fsnotify.h         |   32 ++++++++++++++++----------------
 include/linux/fsnotify_backend.h |    7 +++++--
 5 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index e7e53f7..2aa437b 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -34,7 +34,7 @@ void __fsnotify_inode_delete(struct inode *inode, int flag)
 }
 EXPORT_SYMBOL_GPL(__fsnotify_inode_delete);
 
-void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
+void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *file_name)
 {
 	struct fsnotify_group *group;
 	struct fsnotify_event *event = NULL;
@@ -66,7 +66,7 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
 			if (!group->ops->should_send_event(group, to_tell, mask))
 				continue;
 			if (!event) {
-				event = fsnotify_create_event(to_tell, mask, data, data_is);
+				event = fsnotify_create_event(to_tell, mask, data, data_is, file_name);
 				/* shit, we OOM'd and now we can't tell, lets hope something else blows up */
 				if (!event)
 					break;
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index 6d6942c..72e163e 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -17,9 +17,12 @@ extern __u64 fsnotify_mask;
 
 extern void fsnotify_flush_notif(struct fsnotify_group *group);
 
-extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is);
+extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
 
 extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
 extern void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags);
 
+extern struct fsnotify_event_holder *fsnotify_alloc_event_holder(void);
+extern void fsnotify_destroy_event_holder(struct fsnotify_event_holder *holder);
+
 #endif	/* _LINUX_FSNOTIFY_PRIVATE_H */
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 3884879..21a8b03 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -62,6 +62,7 @@ void fsnotify_put_event(struct fsnotify_event *event)
 		event->mask = 0;
 
 		BUG_ON(!list_empty(&event->private_data_list));
+		kfree(event->file_name);
 		kmem_cache_free(event_kmem_cache, event);
 	}
 }
@@ -248,9 +249,12 @@ static void initialize_event(struct fsnotify_event *event)
 	INIT_LIST_HEAD(&event->private_data_list);
 
 	event->to_tell = NULL;
+
+	event->file_name = NULL;
+	event->name_len = 0;
 }
 
-struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
+struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name)
 {
 	struct fsnotify_event *event;
 
@@ -259,6 +263,15 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
 		return NULL;
 
 	initialize_event(event);
+
+	if (name) {
+		event->file_name = kstrdup(name, GFP_KERNEL);
+		if (!event->file_name) {
+			kmem_cache_free(event_kmem_cache, event);
+			return NULL;
+		}
+		event->name_len = strlen(event->file_name);
+	}
 	event->to_tell = to_tell;
 
 	switch (data_is) {
@@ -284,6 +297,9 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
 		event->flag = FSNOTIFY_EVENT_INODE;
 		break;
 	default:
+		event->path.dentry = NULL;
+		event->path.mnt = NULL;
+		event->inode = NULL;
 		BUG();
 	};
 
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index f5ce85b..3081d86 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -107,7 +107,7 @@ static inline void fsnotify_parent(struct dentry *dentry, __u64 orig_mask)
 	if (p_inode && (p_inode->i_fsnotify_mask & mask)) {
 		dget(parent);
 		spin_unlock(&dentry->d_lock);
-		fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE);
+		fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
 		dput(parent);
 	} else {
 		spin_unlock(&dentry->d_lock);
@@ -161,18 +161,18 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
 	inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, cookie, new_name,
 				  source);
 
-	fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE);
-	fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE);
+	fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE, old_name);
+	fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE, new_name);
 
 	if (target) {
 		inotify_inode_queue_event(target, IN_DELETE_SELF, 0, NULL, NULL);
 		inotify_inode_is_dead(target);
-		fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE);
+		fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE, NULL);
 	}
 
 	if (source) {
 		inotify_inode_queue_event(source, IN_MOVE_SELF, 0, NULL, NULL);
-		fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE);
+		fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE, NULL);
 	}
 	audit_inode_child(new_name, moved, new_dir);
 }
@@ -205,7 +205,7 @@ static inline void fsnotify_inoderemove(struct inode *inode)
 	inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL);
 	inotify_inode_is_dead(inode);
 
-	fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE, NULL);
 	__fsnotify_inode_delete(inode, FSNOTIFY_LAST_DENTRY);
 }
 
@@ -216,7 +216,7 @@ static inline void fsnotify_link_count(struct inode *inode)
 {
 	inotify_inode_queue_event(inode, IN_ATTRIB, 0, NULL, NULL);
 
-	fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE, NULL);
 }
 
 /*
@@ -228,7 +228,7 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
 				  dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
 
-	fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
 }
 
 /*
@@ -243,7 +243,7 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct
 	fsnotify_link_count(inode);
 	audit_inode_child(new_dentry->d_name.name, new_dentry, dir);
 
-	fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE, new_dentry->d_name.name);
 }
 
 /*
@@ -255,7 +255,7 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 				  dentry->d_name.name, dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
 
-	fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
 }
 
 /*
@@ -273,7 +273,7 @@ static inline void fsnotify_access(struct dentry *dentry)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
 }
 
 /*
@@ -291,7 +291,7 @@ static inline void fsnotify_modify(struct dentry *dentry)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
 }
 
 /*
@@ -309,7 +309,7 @@ static inline void fsnotify_open(struct dentry *dentry)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
 }
 
 /*
@@ -330,7 +330,7 @@ static inline void fsnotify_close(struct file *file)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
+	fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
 }
 
 /*
@@ -348,7 +348,7 @@ static inline void fsnotify_xattr(struct dentry *dentry)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
 }
 
 /*
@@ -386,7 +386,7 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
 						  dentry->d_name.name);
 
 		fsnotify_parent(dentry, mask);
-		fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
+		fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
 	}
 }
 
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 6223efa..b86b4ef 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -168,6 +168,9 @@ struct fsnotify_event {
 	atomic_t refcnt;	/* how many groups still are using/need to send this event */
 	__u64 mask;		/* the type of access */
 
+	char *file_name;
+	size_t name_len;
+
 	struct list_head private_data_list;
 };
 
@@ -198,7 +201,7 @@ struct fsnotify_mark_entry {
 #ifdef CONFIG_FSNOTIFY
 
 /* called from the vfs to signal fs events */
-extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
 extern void __fsnotify_inode_delete(struct inode *inode, int flag);
 
 /* called from fsnotify interfaces, such as fanotify or dnotify */
@@ -226,7 +229,7 @@ extern void fsnotify_get_mark(struct fsnotify_mark_entry *entry);
 extern void fsnotify_put_mark(struct fsnotify_mark_entry *entry);
 #else
 
-static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
 {}
 
 static inline void __fsnotify_inode_delete(struct inode *inode, int flag)


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 08/11] fsnotify: add correlations between events
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
                   ` (5 preceding siblings ...)
  2009-02-09 21:15 ` [PATCH -v1 07/11] fsnotify: include pathnames with entries when possible Eric Paris
@ 2009-02-09 21:16 ` Eric Paris
  2009-02-09 21:16 ` [PATCH -v1 09/11] fsnotify: fsnotify marks on inodes pin them in core Eric Paris
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

inotify sends userspace a correlation between events when they are related
(aka when dentries are moved).  This adds that same support for all
fsnotify events.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/fsnotify.c             |    4 ++--
 fs/notify/fsnotify.h             |    2 +-
 fs/notify/notification.c         |   14 +++++++++++++-
 include/linux/fsnotify.h         |   39 +++++++++++++++++++-------------------
 include/linux/fsnotify_backend.h |   11 +++++++++--
 5 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 2aa437b..8b816c0 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -34,7 +34,7 @@ void __fsnotify_inode_delete(struct inode *inode, int flag)
 }
 EXPORT_SYMBOL_GPL(__fsnotify_inode_delete);
 
-void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *file_name)
+void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *file_name, u32 cookie)
 {
 	struct fsnotify_group *group;
 	struct fsnotify_event *event = NULL;
@@ -66,7 +66,7 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const
 			if (!group->ops->should_send_event(group, to_tell, mask))
 				continue;
 			if (!event) {
-				event = fsnotify_create_event(to_tell, mask, data, data_is, file_name);
+				event = fsnotify_create_event(to_tell, mask, data, data_is, file_name, cookie);
 				/* shit, we OOM'd and now we can't tell, lets hope something else blows up */
 				if (!event)
 					break;
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index 72e163e..b384fa8 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -17,7 +17,7 @@ extern __u64 fsnotify_mask;
 
 extern void fsnotify_flush_notif(struct fsnotify_group *group);
 
-extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
+extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie);
 
 extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
 extern void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags);
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 21a8b03..a636a1f 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
+#include <linux/module.h>
 #include <linux/mount.h>
 #include <linux/mutex.h>
 #include <linux/namei.h>
@@ -35,6 +36,13 @@
 static struct kmem_cache *event_kmem_cache;
 static struct kmem_cache *event_holder_kmem_cache;
 static struct fsnotify_event q_overflow_event;
+static atomic_t fsnotify_sync_cookie = ATOMIC_INIT(0);
+
+u32 fsnotify_get_cookie(void)
+{
+	return atomic_inc_return(&fsnotify_sync_cookie);
+}
+EXPORT_SYMBOL_GPL(fsnotify_get_cookie);
 
 /* return 1 if something is available, return 0 otherwise */
 int fsnotify_check_notif_queue(struct fsnotify_group *group)
@@ -252,9 +260,11 @@ static void initialize_event(struct fsnotify_event *event)
 
 	event->file_name = NULL;
 	event->name_len = 0;
+
+	event->sync_cookie = 0;
 }
 
-struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name)
+struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie)
 {
 	struct fsnotify_event *event;
 
@@ -272,6 +282,8 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
 		}
 		event->name_len = strlen(event->file_name);
 	}
+
+	event->sync_cookie = cookie;
 	event->to_tell = to_tell;
 
 	switch (data_is) {
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 3081d86..bdbc897 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -107,7 +107,7 @@ static inline void fsnotify_parent(struct dentry *dentry, __u64 orig_mask)
 	if (p_inode && (p_inode->i_fsnotify_mask & mask)) {
 		dget(parent);
 		spin_unlock(&dentry->d_lock);
-		fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
+		fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name, 0);
 		dput(parent);
 	} else {
 		spin_unlock(&dentry->d_lock);
@@ -139,7 +139,8 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
 				 int isdir, struct inode *target, struct dentry *moved)
 {
 	struct inode *source = moved->d_inode;
-	u32 cookie = inotify_get_cookie();
+	u32 in_cookie = inotify_get_cookie();
+	u32 fs_cookie = fsnotify_get_cookie();
 	__u64 old_dir_mask = 0;
 	__u64 new_dir_mask = 0;
 
@@ -156,23 +157,23 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
 	old_dir_mask |= FS_MOVED_FROM;
 	new_dir_mask |= FS_MOVED_TO;
 
-	inotify_inode_queue_event(old_dir, IN_MOVED_FROM|isdir,cookie,old_name,
+	inotify_inode_queue_event(old_dir, IN_MOVED_FROM|isdir, in_cookie, old_name,
 				  source);
-	inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, cookie, new_name,
+	inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, in_cookie, new_name,
 				  source);
 
-	fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE, old_name);
-	fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE, new_name);
+	fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE, old_name, fs_cookie);
+	fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE, new_name, fs_cookie);
 
 	if (target) {
 		inotify_inode_queue_event(target, IN_DELETE_SELF, 0, NULL, NULL);
 		inotify_inode_is_dead(target);
-		fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE, NULL);
+		fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE, NULL, 0);
 	}
 
 	if (source) {
 		inotify_inode_queue_event(source, IN_MOVE_SELF, 0, NULL, NULL);
-		fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE, NULL);
+		fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 	}
 	audit_inode_child(new_name, moved, new_dir);
 }
@@ -205,7 +206,7 @@ static inline void fsnotify_inoderemove(struct inode *inode)
 	inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL);
 	inotify_inode_is_dead(inode);
 
-	fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE, NULL);
+	fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 	__fsnotify_inode_delete(inode, FSNOTIFY_LAST_DENTRY);
 }
 
@@ -216,7 +217,7 @@ static inline void fsnotify_link_count(struct inode *inode)
 {
 	inotify_inode_queue_event(inode, IN_ATTRIB, 0, NULL, NULL);
 
-	fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE, NULL);
+	fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 }
 
 /*
@@ -228,7 +229,7 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
 				  dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
 
-	fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
+	fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name, 0);
 }
 
 /*
@@ -243,7 +244,7 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct
 	fsnotify_link_count(inode);
 	audit_inode_child(new_dentry->d_name.name, new_dentry, dir);
 
-	fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE, new_dentry->d_name.name);
+	fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE, new_dentry->d_name.name, 0);
 }
 
 /*
@@ -255,7 +256,7 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 				  dentry->d_name.name, dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry, inode);
 
-	fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
+	fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name, 0);
 }
 
 /*
@@ -273,7 +274,7 @@ static inline void fsnotify_access(struct dentry *dentry)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 }
 
 /*
@@ -291,7 +292,7 @@ static inline void fsnotify_modify(struct dentry *dentry)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 }
 
 /*
@@ -309,7 +310,7 @@ static inline void fsnotify_open(struct dentry *dentry)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 }
 
 /*
@@ -330,7 +331,7 @@ static inline void fsnotify_close(struct file *file)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
+	fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL, 0);
 }
 
 /*
@@ -348,7 +349,7 @@ static inline void fsnotify_xattr(struct dentry *dentry)
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
-	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
+	fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 }
 
 /*
@@ -386,7 +387,7 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
 						  dentry->d_name.name);
 
 		fsnotify_parent(dentry, mask);
-		fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
+		fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
 	}
 }
 
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index b86b4ef..c86795d 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -168,6 +168,7 @@ struct fsnotify_event {
 	atomic_t refcnt;	/* how many groups still are using/need to send this event */
 	__u64 mask;		/* the type of access */
 
+	u32 sync_cookie;	/* used to corrolate events, namely inotify mv events */
 	char *file_name;
 	size_t name_len;
 
@@ -201,8 +202,9 @@ struct fsnotify_mark_entry {
 #ifdef CONFIG_FSNOTIFY
 
 /* called from the vfs to signal fs events */
-extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
+extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie);
 extern void __fsnotify_inode_delete(struct inode *inode, int flag);
+extern u32 fsnotify_get_cookie(void);
 
 /* called from fsnotify interfaces, such as fanotify or dnotify */
 extern void fsnotify_recalc_global_mask(void);
@@ -229,12 +231,17 @@ extern void fsnotify_get_mark(struct fsnotify_mark_entry *entry);
 extern void fsnotify_put_mark(struct fsnotify_mark_entry *entry);
 #else
 
-static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
+static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie);
 {}
 
 static inline void __fsnotify_inode_delete(struct inode *inode, int flag)
 {}
 
+static inline u32 fsnotify_get_cookie(void)
+{
+	return 0;
+}
+
 #endif	/* CONFIG_FSNOTIFY */
 
 #endif	/* __KERNEL __ */


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 09/11] fsnotify: fsnotify marks on inodes pin them in core
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
                   ` (6 preceding siblings ...)
  2009-02-09 21:16 ` [PATCH -v1 08/11] fsnotify: add correlations between events Eric Paris
@ 2009-02-09 21:16 ` Eric Paris
  2009-02-09 21:16 ` [PATCH -v1 10/11] fsnotify: handle filesystem unmounts with fsnotify marks Eric Paris
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

This patch pins any inodes with an fsnotify mark in core.  The idea is that
as soon as the mark is removed from the inode->fsnotify_mark_entries list
the inode will be iput.  In reality is doesn't quite work exactly this way.
The igrab will happen when the mark is added to an inode, but the iput will
happen when the inode pointer is NULL'd inside the mark.

It's possible that 2 racing things will try to remove the mark from
different directions.  One may try to remove the mark because of an
explicit request and one might try to remove it because the inode was
deleted.  It's possible that the removal because of inode deletion will
remove the mark from the inode's list, but the removal by explicit request
will actually set entry->inode == NULL; and call the iput.  This is safe.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/notify/inode_mark.c |   25 ++++++++++++++++++++-----
 1 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 840bd91..ff65e62 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -108,13 +108,16 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group)
 		spin_lock(&inode->i_lock);
 
 		list_del_init(&entry->i_list);
-		entry->inode = NULL;
 		list_del_init(&entry->g_list);
-		entry->group = NULL;
-		entry->freeme = 1;
 
 		fsnotify_recalc_inode_mask_locked(inode);
 		spin_unlock(&inode->i_lock);
+
+		entry->group = NULL;
+		entry->freeme = 1;
+		entry->inode = NULL;
+		iput(inode);
+
 		spin_unlock(&entry->lock);
 
 		fsnotify_put_mark(entry);
@@ -137,14 +140,17 @@ void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry)
 		spin_lock(&inode->i_lock);
 
 	list_del_init(&entry->i_list);
-	entry->inode = NULL;
 	list_del_init(&entry->g_list);
+
+	entry->inode = NULL;
 	entry->group = NULL;
 	entry->freeme = 1;
 
 	if (inode) {
 		fsnotify_recalc_inode_mask_locked(inode);
 		spin_unlock(&inode->i_lock);
+
+		iput(inode);
 	}
 	if (group)
 		spin_unlock(&group->mark_lock);
@@ -173,6 +179,11 @@ void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags)
 	 * try to grab entry->inode->i_lock without a problem.
 	 */
 	list_for_each_entry_safe(entry, lentry, &free_list, free_i_list) {
+		spin_lock(&entry->lock);
+		if (entry->inode)
+			iput(entry->inode);
+		entry->inode = NULL;
+		spin_unlock(&entry->lock);
 		entry->group->ops->mark_clear_inode(entry, inode, flags);
 		fsnotify_put_mark(entry);
 	}
@@ -209,9 +220,13 @@ int fsnotify_add_mark(struct fsnotify_mark_entry *entry)
 {
 	struct fsnotify_mark_entry *lentry;
 	struct fsnotify_group *group = entry->group;
-	struct inode *inode = entry->inode;
+	struct inode *inode;
 	int ret = 0;
 
+	inode = igrab(entry->inode);
+	if (unlikely(!inode))
+		return -EINVAL;
+
 	/*
 	 * LOCKING ORDER!!!!
 	 * entry->lock


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 10/11] fsnotify: handle filesystem unmounts with fsnotify marks
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
                   ` (7 preceding siblings ...)
  2009-02-09 21:16 ` [PATCH -v1 09/11] fsnotify: fsnotify marks on inodes pin them in core Eric Paris
@ 2009-02-09 21:16 ` Eric Paris
  2009-02-09 21:16 ` [PATCH -v1 11/11] inotify: reimplement inotify using fsnotify Eric Paris
  2009-02-09 21:28 ` [PATCH -v1 00/11] fsnotify: unified filesystem notification backend Eric Paris
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

When an fs is unmounted with an fsnotify mark entry attached to one of its
inodes we need to destroy that mark entry and we also (like inotify) send
an unmount event.

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 fs/inode.c                       |    1 +
 fs/notify/inode_mark.c           |   73 ++++++++++++++++++++++++++++++++++++++
 include/linux/fsnotify_backend.h |    5 +++
 3 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f6a51a1..bec51e3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -405,6 +405,7 @@ int invalidate_inodes(struct super_block * sb)
 	mutex_lock(&iprune_mutex);
 	spin_lock(&inode_lock);
 	inotify_unmount_inodes(&sb->s_inodes);
+	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(&sb->s_inodes, &throw_away);
 	spin_unlock(&inode_lock);
 
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index ff65e62..f9c5cfe 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -23,6 +23,7 @@
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
+#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
@@ -255,3 +256,75 @@ int fsnotify_add_mark(struct fsnotify_mark_entry *entry)
 
 	return ret;
 }
+
+/**
+ * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
+ * @list: list of inodes being unmounted (sb->s_inodes)
+ *
+ * Called with inode_lock held, protecting the unmounting super block's list
+ * of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * We temporarily drop inode_lock, however, and CAN block.
+ */
+void fsnotify_unmount_inodes(struct list_head *list)
+{
+	struct inode *inode, *next_i, *need_iput = NULL;
+
+	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+		struct inode *need_iput_tmp;
+
+		/*
+		 * If i_count is zero, the inode cannot have any watches and
+		 * doing an __iget/iput with MS_ACTIVE clear would actually
+		 * evict all inodes with zero i_count from icache which is
+		 * unnecessarily violent and may in fact be illegal to do.
+		 */
+		if (!atomic_read(&inode->i_count))
+			continue;
+
+		/*
+		 * We cannot __iget() an inode in state I_CLEAR, I_FREEING, or
+		 * I_WILL_FREE which is fine because by that point the inode
+		 * cannot have any associated watches.
+		 */
+		if (inode->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE))
+			continue;
+
+		need_iput_tmp = need_iput;
+		need_iput = NULL;
+
+		/* In case fsnotify_inode_delete() drops a reference. */
+		if (inode != need_iput_tmp)
+			__iget(inode);
+		else
+			need_iput_tmp = NULL;
+
+		/* In case the dropping of a reference would nuke next_i. */
+		if ((&next_i->i_sb_list != list) &&
+		    atomic_read(&next_i->i_count) &&
+		    !(next_i->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE))) {
+			__iget(next_i);
+			need_iput = next_i;
+		}
+
+		/*
+		 * We can safely drop inode_lock here because we hold
+		 * references on both inode and next_i.  Also no new inodes
+		 * will be added since the umount has begun.  Finally,
+		 * iprune_mutex keeps shrink_icache_memory() away.
+		 */
+		spin_unlock(&inode_lock);
+
+		if (need_iput_tmp)
+			iput(need_iput_tmp);
+
+		/* for each watch, send FS_UNMOUNT and then remove it */
+		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+
+		fsnotify_inode_delete(inode);
+
+		iput(inode);
+
+		spin_lock(&inode_lock);
+	}
+}
+EXPORT_SYMBOL_GPL(fsnotify_unmount_inodes);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index c86795d..203842c 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -229,6 +229,8 @@ extern int fsnotify_add_mark(struct fsnotify_mark_entry *entry);
 extern void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry);
 extern void fsnotify_get_mark(struct fsnotify_mark_entry *entry);
 extern void fsnotify_put_mark(struct fsnotify_mark_entry *entry);
+
+extern void fsnotify_unmount_inodes(struct list_head *list);
 #else
 
 static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie);
@@ -242,6 +244,9 @@ static inline u32 fsnotify_get_cookie(void)
 	return 0;
 }
 
+static inline void fsnotify_unmount_inodes(struct list_head *list)
+{}
+
 #endif	/* CONFIG_FSNOTIFY */
 
 #endif	/* __KERNEL __ */


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 11/11] inotify: reimplement inotify using fsnotify
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
                   ` (8 preceding siblings ...)
  2009-02-09 21:16 ` [PATCH -v1 10/11] fsnotify: handle filesystem unmounts with fsnotify marks Eric Paris
@ 2009-02-09 21:16 ` Eric Paris
  2009-02-09 21:28 ` [PATCH -v1 00/11] fsnotify: unified filesystem notification backend Eric Paris
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, hch, alan, sfr, john, rlove, malware-list, akpm

Yes, holy shit, I'm trying to reimplement inotify as fsnotify...

Signed-off-by: Eric Paris <eparis@redhat.com>
---

 MAINTAINERS                          |    2 
 fs/notify/inotify/Kconfig            |   20 +
 fs/notify/inotify/Makefile           |    2 
 fs/notify/inotify/inotify.h          |  108 ++++++
 fs/notify/inotify/inotify_fsnotify.c |  176 ++++++++++
 fs/notify/inotify/inotify_kernel.c   |  236 +++++++++++++
 fs/notify/inotify/inotify_user.c     |  598 +++++++++-------------------------
 fs/notify/notification.c             |   30 +-
 include/linux/fsnotify.h             |   39 +-
 include/linux/fsnotify_backend.h     |   11 +
 10 files changed, 744 insertions(+), 478 deletions(-)
 create mode 100644 fs/notify/inotify/inotify.h
 create mode 100644 fs/notify/inotify/inotify_fsnotify.c
 create mode 100644 fs/notify/inotify/inotify_kernel.c

diff --git a/MAINTAINERS b/MAINTAINERS
index daba68f..0bc3f24 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2243,6 +2243,8 @@ P:	John McCutchan
 M:	john@johnmccutchan.com
 P:	Robert Love
 M:	rlove@rlove.org
+P:	Eric Paris
+M:	eparis@parisplace.org
 L:	linux-kernel@vger.kernel.org
 S:	Maintained
 
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index 4467928..5356884 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,26 +1,30 @@
 config INOTIFY
 	bool "Inotify file change notification support"
-	default y
+	default n
 	---help---
-	  Say Y here to enable inotify support.  Inotify is a file change
-	  notification system and a replacement for dnotify.  Inotify fixes
-	  numerous shortcomings in dnotify and introduces several new features
-	  including multiple file events, one-shot support, and unmount
-	  notification.
+	  Say Y here to enable legacy in kernel inotify support.  Inotify is a
+	  file change notification system.  It is a replacement for dnotify.
+	  This option only provides the legacy inotify in kernel API.  There
+	  are no in tree kernel users of this interface since it is deprecated.
+	  You only need this if you are loading an out of tree kernel module
+	  that uses inotify.
 
 	  For more information, see <file:Documentation/filesystems/inotify.txt>
 
-	  If unsure, say Y.
+	  If unsure, say N.
 
 config INOTIFY_USER
 	bool "Inotify support for userspace"
-	depends on INOTIFY
+	depends on FSNOTIFY
 	default y
 	---help---
 	  Say Y here to enable inotify support for userspace, including the
 	  associated system calls.  Inotify allows monitoring of both files and
 	  directories via a single open fd.  Events are read from the file
 	  descriptor, which is also select()- and poll()-able.
+	  Inotify fixes numerous shortcomings in dnotify and introduces several
+	  new features including multiple file events, one-shot support, and
+	  unmount notification.
 
 	  For more information, see <file:Documentation/filesystems/inotify.txt>
 
diff --git a/fs/notify/inotify/Makefile b/fs/notify/inotify/Makefile
index e290f3b..aff7f68 100644
--- a/fs/notify/inotify/Makefile
+++ b/fs/notify/inotify/Makefile
@@ -1,2 +1,2 @@
 obj-$(CONFIG_INOTIFY)		+= inotify.o
-obj-$(CONFIG_INOTIFY_USER)	+= inotify_user.o
+obj-$(CONFIG_INOTIFY_USER)	+= inotify_fsnotify.o inotify_kernel.o inotify_user.o
diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
new file mode 100644
index 0000000..e37d24c
--- /dev/null
+++ b/fs/notify/inotify/inotify.h
@@ -0,0 +1,108 @@
+/*
+ * fs/inotify_user.c - inotify support for userspace
+ *
+ * Authors:
+ *	John McCutchan	<ttb@tentacle.dhs.org>
+ *	Robert Love	<rml@novell.com>
+ *
+ * Copyright (C) 2005 John McCutchan
+ * Copyright 2006 Hewlett-Packard Development Company, L.P.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2, or (at your option) any
+ * later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/limits.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/inotify.h>
+#include <linux/syscalls.h>
+#include <linux/string.h>
+#include <linux/magic.h>
+#include <linux/writeback.h>
+#include <linux/fsnotify.h>
+
+#include <asm/ioctls.h>
+
+extern struct kmem_cache *inotify_inode_mark_cachep;
+extern struct kmem_cache *event_priv_cachep;
+extern int inotify_max_user_watches;
+
+struct inotify_event_private_data {
+	struct fsnotify_event_private_data fsnotify_event_priv_data;
+	int wd;
+};
+
+struct inotify_inode_mark_entry {
+	/* fsnotify_mark_entry MUST be the first thing */
+	struct fsnotify_mark_entry fsn_entry;
+	int wd;
+};
+
+static inline __u64 inotify_arg_to_mask(u32 arg)
+{
+	/* everything should accept their own ignored */
+	__u64 mask = FS_IN_IGNORED;
+
+	BUILD_BUG_ON(IN_ACCESS != FS_ACCESS);
+	BUILD_BUG_ON(IN_MODIFY != FS_MODIFY);
+	BUILD_BUG_ON(IN_ATTRIB != FS_ATTRIB);
+	BUILD_BUG_ON(IN_CLOSE_WRITE != FS_CLOSE_WRITE);
+	BUILD_BUG_ON(IN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
+	BUILD_BUG_ON(IN_OPEN != FS_OPEN);
+	BUILD_BUG_ON(IN_MOVED_FROM != FS_MOVED_FROM);
+	BUILD_BUG_ON(IN_MOVED_TO != FS_MOVED_TO);
+	BUILD_BUG_ON(IN_CREATE != FS_CREATE);
+	BUILD_BUG_ON(IN_DELETE != FS_DELETE);
+	BUILD_BUG_ON(IN_DELETE_SELF != FS_DELETE_SELF);
+	BUILD_BUG_ON(IN_MOVE_SELF != FS_MOVE_SELF);
+	BUILD_BUG_ON(IN_Q_OVERFLOW != FS_Q_OVERFLOW);
+
+	BUILD_BUG_ON(IN_UNMOUNT != FS_UNMOUNT);
+	BUILD_BUG_ON(IN_ISDIR != FS_IN_ISDIR);
+	BUILD_BUG_ON(IN_IGNORED != FS_IN_IGNORED);
+	BUILD_BUG_ON(IN_ONESHOT != FS_IN_ONESHOT);
+
+	mask |= (arg & (IN_ALL_EVENTS | IN_ONESHOT));
+
+	mask |= ((mask & FS_EVENTS_WITH_CHILD) << 32);
+
+	return mask;
+}
+
+static inline u32 inotify_mask_to_arg(__u64 mask)
+{
+	u32 arg;
+
+	arg = (mask & (IN_ALL_EVENTS | IN_ISDIR | IN_UNMOUNT | IN_IGNORED | IN_Q_OVERFLOW));
+
+	arg |= ((mask >> 32) & FS_EVENTS_WITH_CHILD);
+
+	return arg;
+}
+
+
+int find_inode(const char __user *dirname, struct path *path, unsigned flags);
+void inotify_destroy_mark_entry(struct fsnotify_mark_entry *entry);
+int inotify_update_watch(struct fsnotify_group *group, struct inode *inode, u32 arg);
+struct fsnotify_group *inotify_new_group(struct user_struct *user, unsigned int max_events);
+void __inotify_free_event_priv(struct inotify_event_private_data *event_priv);
+
+extern const struct fsnotify_ops inotify_fsnotify_ops;
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
new file mode 100644
index 0000000..90dca51
--- /dev/null
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -0,0 +1,176 @@
+/*
+ * fs/inotify_user.c - inotify support for userspace
+ *
+ * Authors:
+ *	John McCutchan	<ttb@tentacle.dhs.org>
+ *	Robert Love	<rml@novell.com>
+ *
+ * Copyright (C) 2005 John McCutchan
+ * Copyright 2006 Hewlett-Packard Development Company, L.P.
+ *
+ * Copyright (C) 2009 Eric Paris <Red Hat Inc>
+ * inotify was largely rewriten to make use of the fsnotify infrastructure
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2, or (at your option) any
+ * later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/limits.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/idr.h>
+#include <linux/init.h>
+#include <linux/inotify.h>
+#include <linux/list.h>
+#include <linux/syscalls.h>
+#include <linux/string.h>
+#include <linux/magic.h>
+#include <linux/writeback.h>
+
+#include "inotify.h"
+
+#include <asm/ioctls.h>
+
+static int inotify_handle_event(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+	struct fsnotify_mark_entry *entry;
+	struct inotify_inode_mark_entry *ientry;
+	struct inode *to_tell;
+	struct inotify_event_private_data *event_priv;
+	int wd;
+
+	to_tell = event->to_tell;
+
+	spin_lock(&to_tell->i_lock);
+	entry = fsnotify_find_mark_entry(group, to_tell);
+	spin_unlock(&to_tell->i_lock);
+	/* race with watch removal? */
+	if (!entry)
+		return 0;
+	ientry = (struct inotify_inode_mark_entry *)entry;
+
+	wd = ientry->wd;
+
+	fsnotify_put_mark(entry);
+
+	event_priv = kmem_cache_alloc(event_priv_cachep, GFP_KERNEL);
+	if (unlikely(!event_priv))
+		return -ENOMEM;
+
+	event_priv->fsnotify_event_priv_data.group = group;
+	event_priv->wd = wd;
+
+	return fsnotify_add_notif_event(group, event, (struct fsnotify_event_private_data *)event_priv);
+}
+
+static void inotify_mark_clear_inode(struct fsnotify_mark_entry *entry, struct inode *inode, unsigned int flags)
+{
+	if (unlikely((flags != FSNOTIFY_LAST_DENTRY) && (flags != FSNOTIFY_INODE_DESTROY))) {
+		BUG();
+		return;
+	}
+
+	/*
+	 * so no matter what we need to put this entry back on the inode's list.
+	 * we need it there so fsnotify can find it to send the ignore message.
+	 *
+	 * I didn't realize how brilliant this was until I did it.  Our caller
+	 * blanked the inode->i_fsnotify_mark_entries list so we will be the
+	 * only mark on the list when fsnotify runs so only our group will get
+	 * this FS_IN_IGNORED.
+	 *
+	 * Bloody brilliant.
+	 */
+	spin_lock(&inode->i_lock);
+	list_add(&entry->i_list, &inode->i_fsnotify_mark_entries);
+	spin_unlock(&inode->i_lock);
+
+	fsnotify(inode, FS_IN_IGNORED, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+	inotify_destroy_mark_entry(entry);
+}
+
+static int inotify_should_send_event(struct fsnotify_group *group, struct inode *inode, __u64 mask)
+{
+	struct fsnotify_mark_entry *entry;
+	int send;
+
+	spin_lock(&inode->i_lock);
+	entry = fsnotify_find_mark_entry(group, inode);
+	spin_unlock(&inode->i_lock);
+	if (!entry)
+		return 0;
+
+	spin_lock(&entry->lock);
+	send = !!(entry->mask & mask);
+	spin_unlock(&entry->lock);
+
+	/* find took a reference */
+	fsnotify_put_mark(entry);
+
+	return send;
+}
+
+static int idr_callback(int id, void *p, void *data)
+{
+	BUG();
+	return 0;
+}
+
+static void inotify_free_group_priv(struct fsnotify_group *group)
+{
+	/* ideally the idr is empty and we won't hit the BUG in teh callback */
+	idr_for_each(&group->inotify_data.idr, idr_callback, NULL);
+	idr_remove_all(&group->inotify_data.idr);
+	idr_destroy(&group->inotify_data.idr);
+}
+
+void __inotify_free_event_priv(struct inotify_event_private_data *event_priv)
+{
+	list_del_init(&event_priv->fsnotify_event_priv_data.event_list);
+	kmem_cache_free(event_priv_cachep, event_priv);
+}
+
+static void inotify_free_event_priv(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+	struct inotify_event_private_data *event_priv;
+
+	spin_lock(&event->lock);
+
+	event_priv = (struct inotify_event_private_data *)fsnotify_get_priv_from_event(group, event);
+	BUG_ON(!event_priv);
+
+	__inotify_free_event_priv(event_priv);
+
+	spin_unlock(&event->lock);
+}
+
+/* ding dong the mark is dead */
+static void inotify_free_mark(struct fsnotify_mark_entry *entry)
+{
+	struct inotify_inode_mark_entry *ientry = (struct inotify_inode_mark_entry *)entry;
+
+	kmem_cache_free(inotify_inode_mark_cachep, ientry);
+}
+
+const struct fsnotify_ops inotify_fsnotify_ops = {
+	.handle_event = inotify_handle_event,
+	.mark_clear_inode = inotify_mark_clear_inode,
+	.should_send_event = inotify_should_send_event,
+	.free_group_priv = inotify_free_group_priv,
+	.free_event_priv = inotify_free_event_priv,
+	.free_mark = inotify_free_mark,
+};
diff --git a/fs/notify/inotify/inotify_kernel.c b/fs/notify/inotify/inotify_kernel.c
new file mode 100644
index 0000000..1d8673b
--- /dev/null
+++ b/fs/notify/inotify/inotify_kernel.c
@@ -0,0 +1,236 @@
+/*
+ * fs/inotify_user.c - inotify support for userspace
+ *
+ * Authors:
+ *	John McCutchan	<ttb@tentacle.dhs.org>
+ *	Robert Love	<rml@novell.com>
+ *
+ * Copyright (C) 2005 John McCutchan
+ * Copyright 2006 Hewlett-Packard Development Company, L.P.
+ *
+ * Copyright (C) 2009 Eric Paris <Red Hat Inc>
+ * inotify was largely rewriten to make use of the fsnotify infrastructure
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2, or (at your option) any
+ * later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/limits.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/idr.h>
+#include <linux/init.h>
+#include <linux/inotify.h>
+#include <linux/list.h>
+#include <linux/syscalls.h>
+#include <linux/string.h>
+#include <linux/magic.h>
+#include <linux/writeback.h>
+
+#include "inotify.h"
+
+#include <asm/ioctls.h>
+
+struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
+struct kmem_cache *event_priv_cachep __read_mostly;
+
+atomic_t inotify_grp_num;
+
+/*
+ * find_inode - resolve a user-given path to a specific inode
+ */
+int find_inode(const char __user *dirname, struct path *path, unsigned flags)
+{
+	int error;
+
+	error = user_path_at(AT_FDCWD, dirname, flags, path);
+	if (error)
+		return error;
+	/* you can only watch an inode if you have read permissions on it */
+	error = inode_permission(path->dentry->d_inode, MAY_READ);
+	if (error)
+		path_put(path);
+	return error;
+}
+
+void inotify_destroy_mark_entry(struct fsnotify_mark_entry *entry)
+{
+	struct inotify_inode_mark_entry *ientry = container_of(entry, struct inotify_inode_mark_entry, fsn_entry);
+	struct fsnotify_group *group;
+	struct idr *idr;
+
+	spin_lock(&entry->lock);
+
+	group = entry->group;
+	if (!group) {
+		/* racing with group tear down, let it do it */
+		spin_unlock(&entry->lock);
+		return;
+	}
+
+	/* group can't disappear on us, since it has to shoot this entry to disappear
+	 * and it's going to block on entry->lock */
+
+	/* remove this entry from the idr */
+	idr = &group->inotify_data.idr;
+	spin_lock(&group->inotify_data.idr_lock);
+	idr_remove(idr, ientry->wd);
+	spin_unlock(&group->inotify_data.idr_lock);
+
+	spin_unlock(&entry->lock);
+
+	/* mark the entry to die */
+	fsnotify_destroy_mark_by_entry(entry);
+
+	/* removed from idr, drop that reference */
+	fsnotify_put_mark(entry);
+}
+
+int inotify_update_watch(struct fsnotify_group *group, struct inode *inode, u32 arg)
+{
+	struct fsnotify_mark_entry *entry = NULL;
+	struct inotify_inode_mark_entry *ientry;
+	int ret = 0;
+	int add = (arg & IN_MASK_ADD);
+	__u64 mask;
+	__u64 old_mask, new_mask;
+
+	/* don't allow invalid bits: we don't want flags set */
+	mask = inotify_arg_to_mask(arg);
+	if (unlikely(!mask))
+		return -EINVAL;
+
+	ientry = kmem_cache_alloc(inotify_inode_mark_cachep, GFP_KERNEL);
+	if (unlikely(!ientry))
+		return -ENOMEM;
+	/* we set the mask at the end after attaching it */
+	fsnotify_init_mark(&ientry->fsn_entry, group, inode, 0);
+	ientry->wd = 0;
+
+find_entry:
+	entry = fsnotify_find_mark_entry(group, inode);
+	if (entry) {
+		kmem_cache_free(inotify_inode_mark_cachep, ientry);
+		ientry = container_of(entry, struct inotify_inode_mark_entry, fsn_entry);
+	} else {
+		if (atomic_read(&group->inotify_data.user->inotify_watches) >= inotify_max_user_watches) {
+			ret = -ENOSPC;
+			goto out_err;
+		}
+
+		ret = fsnotify_add_mark(&ientry->fsn_entry);
+		if (ret == -EEXIST)
+			goto find_entry;
+		else if (ret)
+			goto out_err;
+
+		entry = &ientry->fsn_entry;
+retry:
+		ret = -ENOMEM;
+		if (unlikely(!idr_pre_get(&group->inotify_data.idr, GFP_KERNEL)))
+			goto out_err;
+
+		spin_lock(&group->inotify_data.idr_lock);
+		/* if entry is added to the idr we keep the reference obtained
+		 * through fsnotify_mark_add.  remember to drop this reference
+		 * when entry is removed from idr */
+		ret = idr_get_new_above(&group->inotify_data.idr, entry,
+					++group->inotify_data.last_wd,
+					&ientry->wd);
+		spin_unlock(&group->inotify_data.idr_lock);
+		if (ret) {
+			if (ret == -EAGAIN)
+				goto retry;
+			goto out_err;
+		}
+		atomic_inc(&group->inotify_data.user->inotify_watches);
+	}
+
+	spin_lock(&entry->lock);
+
+	old_mask = entry->mask;
+	if (add) {
+		entry->mask |= mask;
+		new_mask = entry->mask;
+	} else {
+		entry->mask = mask;
+		new_mask = entry->mask;
+	}
+
+	spin_unlock(&entry->lock);
+
+	if (old_mask != new_mask) {
+		/* more bits in old than in new? */
+		int dropped = (old_mask & ~new_mask);
+		/* more bits in this entry than the inode's mask? */
+		int do_inode = (new_mask & ~inode->i_fsnotify_mask);
+		/* more bits in this entry than the group? */
+		int do_group = (new_mask & ~group->mask);
+
+		/* update the inode with this new entry */
+		if (dropped || do_inode)
+			fsnotify_recalc_inode_mask(inode);
+
+		/* update the group mask with the new mask */
+		if (dropped || do_group)
+			fsnotify_recalc_group_mask(group);
+	}
+
+	return ientry->wd;
+
+out_err:
+	/* see this isn't supposed to happen, just kill the watch */
+	if (entry) {
+		fsnotify_destroy_mark_by_entry(entry);
+		fsnotify_put_mark(entry);
+	}
+	return ret;
+}
+
+struct fsnotify_group *inotify_new_group(struct user_struct *user, unsigned int max_events)
+{
+	struct fsnotify_group *group;
+	unsigned int grp_num;
+
+	/* fsnotify_obtain_group took a reference to group, we put this when we kill the file in the end */
+	grp_num = (INOTIFY_GROUP_NUM - atomic_inc_return(&inotify_grp_num));
+	group = fsnotify_obtain_group(grp_num, grp_num, 0, &inotify_fsnotify_ops);
+	if (IS_ERR(group))
+		return group;
+
+	group->max_events = max_events;
+
+	spin_lock_init(&group->inotify_data.idr_lock);
+	idr_init(&group->inotify_data.idr);
+	group->inotify_data.last_wd = 0;
+	group->inotify_data.user = user;
+	group->inotify_data.fa = NULL;
+
+	return group;
+}
+
+static int __init inotify_kernel_setup(void)
+{
+	inotify_inode_mark_cachep = kmem_cache_create("inotify_mark_entry",
+					sizeof(struct inotify_inode_mark_entry),
+					0, SLAB_PANIC, NULL);
+	event_priv_cachep = kmem_cache_create("inotify_event_priv_cache",
+					sizeof(struct inotify_event_private_data),
+					0, SLAB_PANIC, NULL);
+	return 0;
+}
+subsys_initcall(inotify_kernel_setup);
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index bed766e..44b9844 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -8,6 +8,9 @@
  * Copyright (C) 2005 John McCutchan
  * Copyright 2006 Hewlett-Packard Development Company, L.P.
  *
+ * Copyright (C) 2009 Eric Paris <Red Hat Inc>
+ * inotify was largely rewriten to make use of the fsnotify infrastructure
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
  * Free Software Foundation; either version 2, or (at your option) any
@@ -24,89 +27,32 @@
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/limits.h>
+#include <linux/module.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/poll.h>
 #include <linux/init.h>
-#include <linux/list.h>
 #include <linux/inotify.h>
+#include <linux/list.h>
 #include <linux/syscalls.h>
+#include <linux/string.h>
 #include <linux/magic.h>
+#include <linux/writeback.h>
 
-#include <asm/ioctls.h>
+#include "inotify.h"
 
-static struct kmem_cache *watch_cachep __read_mostly;
-static struct kmem_cache *event_cachep __read_mostly;
+#include <asm/ioctls.h>
 
 static struct vfsmount *inotify_mnt __read_mostly;
 
+/* this just sits here and wastes global memory.  used to just pad userspace messages with zeros */
+static struct inotify_event nul_inotify_event;
+
 /* these are configurable via /proc/sys/fs/inotify/ */
 static int inotify_max_user_instances __read_mostly;
-static int inotify_max_user_watches __read_mostly;
 static int inotify_max_queued_events __read_mostly;
-
-/*
- * Lock ordering:
- *
- * inotify_dev->up_mutex (ensures we don't re-add the same watch)
- * 	inode->inotify_mutex (protects inode's watch list)
- * 		inotify_handle->mutex (protects inotify_handle's watch list)
- * 			inotify_dev->ev_mutex (protects device's event queue)
- */
-
-/*
- * Lifetimes of the main data structures:
- *
- * inotify_device: Lifetime is managed by reference count, from
- * sys_inotify_init() until release.  Additional references can bump the count
- * via get_inotify_dev() and drop the count via put_inotify_dev().
- *
- * inotify_user_watch: Lifetime is from create_watch() to the receipt of an
- * IN_IGNORED event from inotify, or when using IN_ONESHOT, to receipt of the
- * first event, or to inotify_destroy().
- */
-
-/*
- * struct inotify_device - represents an inotify instance
- *
- * This structure is protected by the mutex 'mutex'.
- */
-struct inotify_device {
-	wait_queue_head_t 	wq;		/* wait queue for i/o */
-	struct mutex		ev_mutex;	/* protects event queue */
-	struct mutex		up_mutex;	/* synchronizes watch updates */
-	struct list_head 	events;		/* list of queued events */
-	struct user_struct	*user;		/* user who opened this dev */
-	struct inotify_handle	*ih;		/* inotify handle */
-	struct fasync_struct    *fa;            /* async notification */
-	atomic_t		count;		/* reference count */
-	unsigned int		queue_size;	/* size of the queue (bytes) */
-	unsigned int		event_count;	/* number of pending events */
-	unsigned int		max_events;	/* maximum number of events */
-};
-
-/*
- * struct inotify_kernel_event - An inotify event, originating from a watch and
- * queued for user-space.  A list of these is attached to each instance of the
- * device.  In read(), this list is walked and all events that can fit in the
- * buffer are returned.
- *
- * Protected by dev->ev_mutex of the device in which we are queued.
- */
-struct inotify_kernel_event {
-	struct inotify_event	event;	/* the user-space event */
-	struct list_head        list;	/* entry in inotify_device's list */
-	char			*name;	/* filename, if any */
-};
-
-/*
- * struct inotify_user_watch - our version of an inotify_watch, we add
- * a reference to the associated inotify_device.
- */
-struct inotify_user_watch {
-	struct inotify_device	*dev;	/* associated device */
-	struct inotify_watch	wdata;	/* inotify watch data */
-};
+int inotify_max_user_watches __read_mostly;
 
 #ifdef CONFIG_SYSCTL
 
@@ -149,280 +95,17 @@ ctl_table inotify_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
-static inline void get_inotify_dev(struct inotify_device *dev)
-{
-	atomic_inc(&dev->count);
-}
-
-static inline void put_inotify_dev(struct inotify_device *dev)
-{
-	if (atomic_dec_and_test(&dev->count)) {
-		atomic_dec(&dev->user->inotify_devs);
-		free_uid(dev->user);
-		kfree(dev);
-	}
-}
-
-/*
- * free_inotify_user_watch - cleans up the watch and its references
- */
-static void free_inotify_user_watch(struct inotify_watch *w)
-{
-	struct inotify_user_watch *watch;
-	struct inotify_device *dev;
-
-	watch = container_of(w, struct inotify_user_watch, wdata);
-	dev = watch->dev;
-
-	atomic_dec(&dev->user->inotify_watches);
-	put_inotify_dev(dev);
-	kmem_cache_free(watch_cachep, watch);
-}
-
-/*
- * kernel_event - create a new kernel event with the given parameters
- *
- * This function can sleep.
- */
-static struct inotify_kernel_event * kernel_event(s32 wd, u32 mask, u32 cookie,
-						  const char *name)
-{
-	struct inotify_kernel_event *kevent;
-
-	kevent = kmem_cache_alloc(event_cachep, GFP_NOFS);
-	if (unlikely(!kevent))
-		return NULL;
-
-	/* we hand this out to user-space, so zero it just in case */
-	memset(&kevent->event, 0, sizeof(struct inotify_event));
-
-	kevent->event.wd = wd;
-	kevent->event.mask = mask;
-	kevent->event.cookie = cookie;
-
-	INIT_LIST_HEAD(&kevent->list);
-
-	if (name) {
-		size_t len, rem, event_size = sizeof(struct inotify_event);
-
-		/*
-		 * We need to pad the filename so as to properly align an
-		 * array of inotify_event structures.  Because the structure is
-		 * small and the common case is a small filename, we just round
-		 * up to the next multiple of the structure's sizeof.  This is
-		 * simple and safe for all architectures.
-		 */
-		len = strlen(name) + 1;
-		rem = event_size - len;
-		if (len > event_size) {
-			rem = event_size - (len % event_size);
-			if (len % event_size == 0)
-				rem = 0;
-		}
-
-		kevent->name = kmalloc(len + rem, GFP_KERNEL);
-		if (unlikely(!kevent->name)) {
-			kmem_cache_free(event_cachep, kevent);
-			return NULL;
-		}
-		memcpy(kevent->name, name, len);
-		if (rem)
-			memset(kevent->name + len, 0, rem);
-		kevent->event.len = len + rem;
-	} else {
-		kevent->event.len = 0;
-		kevent->name = NULL;
-	}
-
-	return kevent;
-}
-
-/*
- * inotify_dev_get_event - return the next event in the given dev's queue
- *
- * Caller must hold dev->ev_mutex.
- */
-static inline struct inotify_kernel_event *
-inotify_dev_get_event(struct inotify_device *dev)
-{
-	return list_entry(dev->events.next, struct inotify_kernel_event, list);
-}
-
-/*
- * inotify_dev_get_last_event - return the last event in the given dev's queue
- *
- * Caller must hold dev->ev_mutex.
- */
-static inline struct inotify_kernel_event *
-inotify_dev_get_last_event(struct inotify_device *dev)
-{
-	if (list_empty(&dev->events))
-		return NULL;
-	return list_entry(dev->events.prev, struct inotify_kernel_event, list);
-}
-
-/*
- * inotify_dev_queue_event - event handler registered with core inotify, adds
- * a new event to the given device
- *
- * Can sleep (calls kernel_event()).
- */
-static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
-				    u32 cookie, const char *name,
-				    struct inode *ignored)
-{
-	struct inotify_user_watch *watch;
-	struct inotify_device *dev;
-	struct inotify_kernel_event *kevent, *last;
-
-	watch = container_of(w, struct inotify_user_watch, wdata);
-	dev = watch->dev;
-
-	mutex_lock(&dev->ev_mutex);
-
-	/* we can safely put the watch as we don't reference it while
-	 * generating the event
-	 */
-	if (mask & IN_IGNORED || w->mask & IN_ONESHOT)
-		put_inotify_watch(w); /* final put */
-
-	/* coalescing: drop this event if it is a dupe of the previous */
-	last = inotify_dev_get_last_event(dev);
-	if (last && last->event.mask == mask && last->event.wd == wd &&
-			last->event.cookie == cookie) {
-		const char *lastname = last->name;
-
-		if (!name && !lastname)
-			goto out;
-		if (name && lastname && !strcmp(lastname, name))
-			goto out;
-	}
-
-	/* the queue overflowed and we already sent the Q_OVERFLOW event */
-	if (unlikely(dev->event_count > dev->max_events))
-		goto out;
-
-	/* if the queue overflows, we need to notify user space */
-	if (unlikely(dev->event_count == dev->max_events))
-		kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL);
-	else
-		kevent = kernel_event(wd, mask, cookie, name);
-
-	if (unlikely(!kevent))
-		goto out;
-
-	/* queue the event and wake up anyone waiting */
-	dev->event_count++;
-	dev->queue_size += sizeof(struct inotify_event) + kevent->event.len;
-	list_add_tail(&kevent->list, &dev->events);
-	wake_up_interruptible(&dev->wq);
-	kill_fasync(&dev->fa, SIGIO, POLL_IN);
-
-out:
-	mutex_unlock(&dev->ev_mutex);
-}
-
-/*
- * remove_kevent - cleans up the given kevent
- *
- * Caller must hold dev->ev_mutex.
- */
-static void remove_kevent(struct inotify_device *dev,
-			  struct inotify_kernel_event *kevent)
-{
-	list_del(&kevent->list);
-
-	dev->event_count--;
-	dev->queue_size -= sizeof(struct inotify_event) + kevent->event.len;
-}
-
-/*
- * free_kevent - frees the given kevent.
- */
-static void free_kevent(struct inotify_kernel_event *kevent)
-{
-	kfree(kevent->name);
-	kmem_cache_free(event_cachep, kevent);
-}
-
-/*
- * inotify_dev_event_dequeue - destroy an event on the given device
- *
- * Caller must hold dev->ev_mutex.
- */
-static void inotify_dev_event_dequeue(struct inotify_device *dev)
-{
-	if (!list_empty(&dev->events)) {
-		struct inotify_kernel_event *kevent;
-		kevent = inotify_dev_get_event(dev);
-		remove_kevent(dev, kevent);
-		free_kevent(kevent);
-	}
-}
-
-/*
- * find_inode - resolve a user-given path to a specific inode
- */
-static int find_inode(const char __user *dirname, struct path *path,
-		      unsigned flags)
-{
-	int error;
-
-	error = user_path_at(AT_FDCWD, dirname, flags, path);
-	if (error)
-		return error;
-	/* you can only watch an inode if you have read permissions on it */
-	error = inode_permission(path->dentry->d_inode, MAY_READ);
-	if (error)
-		path_put(path);
-	return error;
-}
-
-/*
- * create_watch - creates a watch on the given device.
- *
- * Callers must hold dev->up_mutex.
- */
-static int create_watch(struct inotify_device *dev, struct inode *inode,
-			u32 mask)
-{
-	struct inotify_user_watch *watch;
-	int ret;
-
-	if (atomic_read(&dev->user->inotify_watches) >=
-			inotify_max_user_watches)
-		return -ENOSPC;
-
-	watch = kmem_cache_alloc(watch_cachep, GFP_KERNEL);
-	if (unlikely(!watch))
-		return -ENOMEM;
-
-	/* save a reference to device and bump the count to make it official */
-	get_inotify_dev(dev);
-	watch->dev = dev;
-
-	atomic_inc(&dev->user->inotify_watches);
-
-	inotify_init_watch(&watch->wdata);
-	ret = inotify_add_watch(dev->ih, &watch->wdata, inode, mask);
-	if (ret < 0)
-		free_inotify_user_watch(&watch->wdata);
-
-	return ret;
-}
-
-/* Device Interface */
-
+/* intofiy userspace file descriptor functions */
 static unsigned int inotify_poll(struct file *file, poll_table *wait)
 {
-	struct inotify_device *dev = file->private_data;
+	struct fsnotify_group *group = file->private_data;
 	int ret = 0;
 
-	poll_wait(file, &dev->wq, wait);
-	mutex_lock(&dev->ev_mutex);
-	if (!list_empty(&dev->events))
+	poll_wait(file, &group->notification_waitq, wait);
+	mutex_lock(&group->notification_mutex);
+	if (fsnotify_check_notif_queue(group))
 		ret = POLLIN | POLLRDNORM;
-	mutex_unlock(&dev->ev_mutex);
+	mutex_unlock(&group->notification_mutex);
 
 	return ret;
 }
@@ -432,26 +115,29 @@ static unsigned int inotify_poll(struct file *file, poll_table *wait)
  * enough to fit in "count". Return an error pointer if
  * not large enough.
  *
- * Called with the device ev_mutex held.
+ * Called with the group->notification_mutex held.
  */
-static struct inotify_kernel_event *get_one_event(struct inotify_device *dev,
-						  size_t count)
+static struct fsnotify_event *get_one_event(struct fsnotify_group *group,
+					    size_t count)
 {
 	size_t event_size = sizeof(struct inotify_event);
-	struct inotify_kernel_event *kevent;
+	struct fsnotify_event *event;
 
-	if (list_empty(&dev->events))
+	if (!fsnotify_check_notif_queue(group))
 		return NULL;
 
-	kevent = inotify_dev_get_event(dev);
-	if (kevent->name)
-		event_size += kevent->event.len;
+	event = fsnotify_peek_notif_event(group);
+
+	event_size += roundup(event->name_len, event_size);
 
 	if (event_size > count)
 		return ERR_PTR(-EINVAL);
 
-	remove_kevent(dev, kevent);
-	return kevent;
+	/* held the notification_mutex the whole time, so this is the
+	 * same event we peeked above */
+	fsnotify_remove_notif_event(group);
+
+	return event;
 }
 
 /*
@@ -460,51 +146,82 @@ static struct inotify_kernel_event *get_one_event(struct inotify_device *dev,
  * We already checked that the event size is smaller than the
  * buffer we had in "get_one_event()" above.
  */
-static ssize_t copy_event_to_user(struct inotify_kernel_event *kevent,
+static ssize_t copy_event_to_user(struct fsnotify_group *group,
+				  struct fsnotify_event *event,
 				  char __user *buf)
 {
+	struct inotify_event inotify_event;
+	struct inotify_event_private_data *priv;
 	size_t event_size = sizeof(struct inotify_event);
+	size_t name_len;
+
+	/* we get the inotify watch descriptor from the event private data */
+	spin_lock(&event->lock);
+	priv = (struct inotify_event_private_data *)fsnotify_get_priv_from_event(group, event);
+	inotify_event.wd = priv->wd;
+	__inotify_free_event_priv(priv);
+	spin_unlock(&event->lock);
+
+	/* round up event->name_len so it is a multiple of event_size */
+	name_len = roundup(event->name_len, event_size);
+	inotify_event.len = name_len;
+
+	inotify_event.mask = inotify_mask_to_arg(event->mask);
+	inotify_event.cookie = event->sync_cookie;
 
-	if (copy_to_user(buf, &kevent->event, event_size))
+	/* send the main event */
+	if (copy_to_user(buf, &inotify_event, event_size))
 		return -EFAULT;
 
-	if (kevent->name) {
-		buf += event_size;
+	buf += event_size;
 
-		if (copy_to_user(buf, kevent->name, kevent->event.len))
+	/*
+	 * fsnotify only stores the pathname, so here we have to send the pathname
+	 * and then pad that pathname out to a multiple of sizeof(inotify_event)
+	 * with zeros.  I get my zeros from the nul_inotify_event.
+	 */
+	if (name_len) {
+		unsigned int len_to_zero = name_len - event->name_len;
+		/* copy the path name */
+		if (copy_to_user(buf, event->file_name, event->name_len))
 			return -EFAULT;
+		buf += event->name_len;
 
-		event_size += kevent->event.len;
+		/* fill userspace with 0's from nul_inotify_event */
+		if (copy_to_user(buf, &nul_inotify_event, len_to_zero))
+			return -EFAULT;
+		buf += len_to_zero;
+		event_size += name_len;
 	}
+
 	return event_size;
 }
 
 static ssize_t inotify_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *pos)
 {
-	struct inotify_device *dev;
+	struct fsnotify_group *group;
+	struct fsnotify_event *kevent;
 	char __user *start;
 	int ret;
 	DEFINE_WAIT(wait);
 
 	start = buf;
-	dev = file->private_data;
+	group = file->private_data;
 
 	while (1) {
-		struct inotify_kernel_event *kevent;
+		prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
 
-		prepare_to_wait(&dev->wq, &wait, TASK_INTERRUPTIBLE);
-
-		mutex_lock(&dev->ev_mutex);
-		kevent = get_one_event(dev, count);
-		mutex_unlock(&dev->ev_mutex);
+		mutex_lock(&group->notification_mutex);
+		kevent = get_one_event(group, count);
+		mutex_unlock(&group->notification_mutex);
 
 		if (kevent) {
 			ret = PTR_ERR(kevent);
 			if (IS_ERR(kevent))
 				break;
-			ret = copy_event_to_user(kevent, buf);
-			free_kevent(kevent);
+			ret = copy_event_to_user(group, kevent, buf);
+			fsnotify_put_event(kevent);
 			if (ret < 0)
 				break;
 			buf += ret;
@@ -525,7 +242,7 @@ static ssize_t inotify_read(struct file *file, char __user *buf,
 		schedule();
 	}
 
-	finish_wait(&dev->wq, &wait);
+	finish_wait(&group->notification_waitq, &wait);
 	if (start != buf && ret != -EFAULT)
 		ret = buf - start;
 	return ret;
@@ -533,25 +250,35 @@ static ssize_t inotify_read(struct file *file, char __user *buf,
 
 static int inotify_fasync(int fd, struct file *file, int on)
 {
-	struct inotify_device *dev = file->private_data;
+	struct fsnotify_group *group = file->private_data;
 
-	return fasync_helper(fd, file, on, &dev->fa) >= 0 ? 0 : -EIO;
+	return fasync_helper(fd, file, on, &group->inotify_data.fa) >= 0 ? 0 : -EIO;
 }
 
 static int inotify_release(struct inode *ignored, struct file *file)
 {
-	struct inotify_device *dev = file->private_data;
+	struct fsnotify_group *group = file->private_data;
+	struct fsnotify_mark_entry *entry;
 
-	inotify_destroy(dev->ih);
+	/* run all the entries remove them from the idr and drop that ref */
+	spin_lock(&group->mark_lock);
+	while (!list_empty(&group->mark_entries)) {
+		entry = list_first_entry(&group->mark_entries, struct fsnotify_mark_entry, g_list);
 
-	/* destroy all of the events on this device */
-	mutex_lock(&dev->ev_mutex);
-	while (!list_empty(&dev->events))
-		inotify_dev_event_dequeue(dev);
-	mutex_unlock(&dev->ev_mutex);
+		/* make sure entry can't get freed */
+		fsnotify_get_mark(entry);
+		spin_unlock(&group->mark_lock);
 
-	/* free this device: the put matching the get in inotify_init() */
-	put_inotify_dev(dev);
+		inotify_destroy_mark_entry(entry);
+
+		/* ok, free it */
+		fsnotify_put_mark(entry);
+		spin_lock(&group->mark_lock);
+	}
+	spin_unlock(&group->mark_lock);
+
+	/* free this group, matching get was inotify_init->fsnotify_obtain_group */
+	fsnotify_put_group(group);
 
 	return 0;
 }
@@ -559,16 +286,25 @@ static int inotify_release(struct inode *ignored, struct file *file)
 static long inotify_ioctl(struct file *file, unsigned int cmd,
 			  unsigned long arg)
 {
-	struct inotify_device *dev;
+	struct fsnotify_group *group;
+	struct fsnotify_event_holder *holder;
+	struct fsnotify_event *event;
 	void __user *p;
 	int ret = -ENOTTY;
+	size_t send_len = 0;
 
-	dev = file->private_data;
+	group = file->private_data;
 	p = (void __user *) arg;
 
 	switch (cmd) {
 	case FIONREAD:
-		ret = put_user(dev->queue_size, (int __user *) p);
+		mutex_lock(&group->notification_mutex);
+		list_for_each_entry(holder, &group->notification_list, event_list) {
+			event = holder->event;
+			send_len += sizeof(struct inotify_event) + event->name_len;
+		}
+		mutex_unlock(&group->notification_mutex);
+		ret = put_user(send_len, (int __user *) p);
 		break;
 	}
 
@@ -576,23 +312,18 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
 }
 
 static const struct file_operations inotify_fops = {
-	.poll           = inotify_poll,
-	.read           = inotify_read,
-	.fasync         = inotify_fasync,
-	.release        = inotify_release,
-	.unlocked_ioctl = inotify_ioctl,
+	.poll		= inotify_poll,
+	.read		= inotify_read,
+	.fasync		= inotify_fasync,
+	.release	= inotify_release,
+	.unlocked_ioctl	= inotify_ioctl,
 	.compat_ioctl	= inotify_ioctl,
 };
 
-static const struct inotify_operations inotify_user_ops = {
-	.handle_event	= inotify_dev_queue_event,
-	.destroy_watch	= free_inotify_user_watch,
-};
-
+/* inotify syscalls */
 SYSCALL_DEFINE1(inotify_init1, int, flags)
 {
-	struct inotify_device *dev;
-	struct inotify_handle *ih;
+	struct fsnotify_group *group;
 	struct user_struct *user;
 	struct file *filp;
 	int fd, ret;
@@ -621,45 +352,27 @@ SYSCALL_DEFINE1(inotify_init1, int, flags)
 		goto out_free_uid;
 	}
 
-	dev = kmalloc(sizeof(struct inotify_device), GFP_KERNEL);
-	if (unlikely(!dev)) {
-		ret = -ENOMEM;
+	/* fsnotify_obtain_group took a reference to group, we put this when we kill the file in the end */
+	group = inotify_new_group(user, inotify_max_queued_events);
+	if (IS_ERR(group)) {
+		ret = PTR_ERR(group);
 		goto out_free_uid;
 	}
 
-	ih = inotify_init(&inotify_user_ops);
-	if (IS_ERR(ih)) {
-		ret = PTR_ERR(ih);
-		goto out_free_dev;
-	}
-	dev->ih = ih;
-	dev->fa = NULL;
-
 	filp->f_op = &inotify_fops;
 	filp->f_path.mnt = mntget(inotify_mnt);
 	filp->f_path.dentry = dget(inotify_mnt->mnt_root);
 	filp->f_mapping = filp->f_path.dentry->d_inode->i_mapping;
 	filp->f_mode = FMODE_READ;
 	filp->f_flags = O_RDONLY | (flags & O_NONBLOCK);
-	filp->private_data = dev;
-
-	INIT_LIST_HEAD(&dev->events);
-	init_waitqueue_head(&dev->wq);
-	mutex_init(&dev->ev_mutex);
-	mutex_init(&dev->up_mutex);
-	dev->event_count = 0;
-	dev->queue_size = 0;
-	dev->max_events = inotify_max_queued_events;
-	dev->user = user;
-	atomic_set(&dev->count, 0);
-
-	get_inotify_dev(dev);
+	filp->private_data = group;
+
 	atomic_inc(&user->inotify_devs);
+
 	fd_install(fd, filp);
 
 	return fd;
-out_free_dev:
-	kfree(dev);
+
 out_free_uid:
 	free_uid(user);
 	put_filp(filp);
@@ -676,8 +389,8 @@ SYSCALL_DEFINE0(inotify_init)
 SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
 		u32, mask)
 {
+	struct fsnotify_group *group;
 	struct inode *inode;
-	struct inotify_device *dev;
 	struct path path;
 	struct file *filp;
 	int ret, fput_needed;
@@ -699,19 +412,19 @@ SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
 		flags |= LOOKUP_DIRECTORY;
 
 	ret = find_inode(pathname, &path, flags);
-	if (unlikely(ret))
+	if (ret)
 		goto fput_and_out;
 
-	/* inode held in place by reference to path; dev by fget on fd */
+	/* inode held in place by reference to path; group by fget on fd */
 	inode = path.dentry->d_inode;
-	dev = filp->private_data;
+	group = filp->private_data;
 
-	mutex_lock(&dev->up_mutex);
-	ret = inotify_find_update_watch(dev->ih, inode, mask);
-	if (ret == -ENOENT)
-		ret = create_watch(dev, inode, mask);
-	mutex_unlock(&dev->up_mutex);
+	/* create/update an inode mark */
+	ret = inotify_update_watch(group, inode, mask);
+	if (unlikely(ret))
+		goto path_put_and_out;
 
+path_put_and_out:
 	path_put(&path);
 fput_and_out:
 	fput_light(filp, fput_needed);
@@ -720,9 +433,10 @@ fput_and_out:
 
 SYSCALL_DEFINE2(inotify_rm_watch, int, fd, __s32, wd)
 {
+	struct fsnotify_group *group;
+	struct fsnotify_mark_entry *entry;
 	struct file *filp;
-	struct inotify_device *dev;
-	int ret, fput_needed;
+	int ret = 0, fput_needed;
 
 	filp = fget_light(fd, &fput_needed);
 	if (unlikely(!filp))
@@ -734,10 +448,20 @@ SYSCALL_DEFINE2(inotify_rm_watch, int, fd, __s32, wd)
 		goto out;
 	}
 
-	dev = filp->private_data;
+	group = filp->private_data;
 
-	/* we free our watch data when we get IN_IGNORED */
-	ret = inotify_rm_wd(dev->ih, wd);
+	spin_lock(&group->inotify_data.idr_lock);
+	entry = idr_find(&group->inotify_data.idr, wd);
+	if (unlikely(!entry)) {
+		spin_unlock(&group->inotify_data.idr_lock);
+		ret = -EINVAL;
+		goto out;
+	}
+	fsnotify_get_mark(entry);
+	spin_unlock(&group->inotify_data.idr_lock);
+
+	inotify_destroy_mark_entry(entry);
+	fsnotify_put_mark(entry);
 
 out:
 	fput_light(filp, fput_needed);
@@ -753,9 +477,9 @@ inotify_get_sb(struct file_system_type *fs_type, int flags,
 }
 
 static struct file_system_type inotify_fs_type = {
-    .name           = "inotifyfs",
-    .get_sb         = inotify_get_sb,
-    .kill_sb        = kill_anon_super,
+    .name	= "inotifyfs",
+    .get_sb	= inotify_get_sb,
+    .kill_sb	= kill_anon_super,
 };
 
 /*
@@ -779,14 +503,6 @@ static int __init inotify_user_setup(void)
 	inotify_max_user_instances = 128;
 	inotify_max_user_watches = 8192;
 
-	watch_cachep = kmem_cache_create("inotify_watch_cache",
-					 sizeof(struct inotify_user_watch),
-					 0, SLAB_PANIC, NULL);
-	event_cachep = kmem_cache_create("inotify_event_cache",
-					 sizeof(struct inotify_kernel_event),
-					 0, SLAB_PANIC, NULL);
-
 	return 0;
 }
-
 module_init(inotify_user_setup);
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index a636a1f..14af58d 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -123,7 +123,7 @@ static inline int event_compare(struct fsnotify_event *old, struct fsnotify_even
  */
 int fsnotify_add_notif_event(struct fsnotify_group *group, struct fsnotify_event *event, struct fsnotify_event_private_data *priv)
 {
-	struct fsnotify_event_holder *holder;
+	struct fsnotify_event_holder *holder = NULL;
 	struct list_head *list = &group->notification_list;
 	struct fsnotify_event_holder *last_holder;
 	struct fsnotify_event *last_event;
@@ -141,21 +141,35 @@ int fsnotify_add_notif_event(struct fsnotify_group *group, struct fsnotify_event
 	 * the holder spinlock.  If we see it blank we know that once we
 	 * get that lock the in event holder will be ok for us to (re)use.
 	 */
-	if (list_empty(&event->holder.event_list))
-		holder = (struct fsnotify_event_holder *)event;
-	else
+	if (!list_empty(&event->holder.event_list)) {
 		holder = alloc_event_holder();
-
-	if (!holder)
-		return -ENOMEM;
+		if (!holder)
+			return -ENOMEM;
+	}
 
 	mutex_lock(&group->notification_mutex);
 
-	if (group->q_len >= group->max_events)
+	if (group->q_len >= group->max_events) 
 		event = &q_overflow_event;
 
 	spin_lock(&event->lock);
 
+	if (list_empty(&event->holder.event_list)) {
+		if (unlikely(holder))
+			fsnotify_destroy_event_holder(holder);
+		holder = &event->holder;
+	} else if (unlikely(!holder)) {
+		/* this only happens if we had room in the original event holder
+		 * but we switched to the overflow event and that in event holder
+		 * was in use */
+		holder = alloc_event_holder();
+		if (!holder) {
+			spin_unlock(&event->lock);
+			mutex_unlock(&group->notification_mutex);
+			return -ENOMEM;
+		}
+	}
+
 	if (!list_empty(list)) {
 		last_holder = list_entry(list->prev, struct fsnotify_event_holder, event_list);
 		last_event = last_holder->event;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index bdbc897..e60c943 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -265,10 +265,10 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 static inline void fsnotify_access(struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	__u64 mask = IN_ACCESS;
+	__u64 mask = FS_ACCESS;
 
 	if (S_ISDIR(inode->i_mode))
-		mask |= IN_ISDIR;
+		mask |= FS_IN_ISDIR;
 
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
@@ -283,10 +283,10 @@ static inline void fsnotify_access(struct dentry *dentry)
 static inline void fsnotify_modify(struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	__u64 mask = IN_MODIFY;
+	__u64 mask = FS_MODIFY;
 
 	if (S_ISDIR(inode->i_mode))
-		mask |= IN_ISDIR;
+		mask |= FS_IN_ISDIR;
 
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
@@ -301,10 +301,10 @@ static inline void fsnotify_modify(struct dentry *dentry)
 static inline void fsnotify_open(struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	__u64 mask = IN_OPEN;
+	__u64 mask = FS_OPEN;
 
 	if (S_ISDIR(inode->i_mode))
-		mask |= IN_ISDIR;
+		mask |= FS_IN_ISDIR;
 
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
@@ -320,14 +320,13 @@ static inline void fsnotify_close(struct file *file)
 {
 	struct dentry *dentry = file->f_path.dentry;
 	struct inode *inode = dentry->d_inode;
-	const char *name = dentry->d_name.name;
 	fmode_t mode = file->f_mode;
-	__u64 mask = (mode & FMODE_WRITE) ? IN_CLOSE_WRITE : IN_CLOSE_NOWRITE;
+	__u64 mask = (mode & FMODE_WRITE) ? FS_CLOSE_WRITE : FS_CLOSE_NOWRITE;
 
 	if (S_ISDIR(inode->i_mode))
-		mask |= IN_ISDIR;
+		mask |= FS_IN_ISDIR;
 
-	inotify_dentry_parent_queue_event(dentry, mask, 0, name);
+	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 
 	fsnotify_parent(dentry, mask);
@@ -340,10 +339,10 @@ static inline void fsnotify_close(struct file *file)
 static inline void fsnotify_xattr(struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
-	__u64 mask = IN_ATTRIB;
+	__u64 mask = FS_ATTRIB;
 
 	if (S_ISDIR(inode->i_mode))
-		mask |= IN_ISDIR;
+		mask |= FS_IN_ISDIR;
 
 	inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
 	inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
@@ -362,26 +361,26 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
 	__u64 mask = 0;
 
 	if (ia_valid & ATTR_UID)
-		mask |= IN_ATTRIB;
+		mask |= FS_ATTRIB;
 	if (ia_valid & ATTR_GID)
-		mask |= IN_ATTRIB;
+		mask |= FS_ATTRIB;
 	if (ia_valid & ATTR_SIZE)
-		mask |= IN_MODIFY;
+		mask |= FS_MODIFY;
 
 	/* both times implies a utime(s) call */
 	if ((ia_valid & (ATTR_ATIME | ATTR_MTIME)) == (ATTR_ATIME | ATTR_MTIME))
-		mask |= IN_ATTRIB;
+		mask |= FS_ATTRIB;
 	else if (ia_valid & ATTR_ATIME)
-		mask |= IN_ACCESS;
+		mask |= FS_ACCESS;
 	else if (ia_valid & ATTR_MTIME)
-		mask |= IN_MODIFY;
+		mask |= FS_MODIFY;
 
 	if (ia_valid & ATTR_MODE)
-		mask |= IN_ATTRIB;
+		mask |= FS_ATTRIB;
 
 	if (mask) {
 		if (S_ISDIR(inode->i_mode))
-			mask |= IN_ISDIR;
+			mask |= FS_IN_ISDIR;
 		inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
 		inotify_dentry_parent_queue_event(dentry, mask, 0,
 						  dentry->d_name.name);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 203842c..0070cc3 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -9,6 +9,7 @@
 
 #ifdef __KERNEL__
 
+#include <linux/idr.h> /* inotify uses this */
 #include <linux/fs.h> /* struct inode */
 #include <linux/list.h>
 #include <linux/path.h> /* struct path */
@@ -80,6 +81,7 @@
 
 /* listeners that hard code group numbers near the top */
 #define DNOTIFY_GROUP_NUM	UINT_MAX
+#define INOTIFY_GROUP_NUM	(DNOTIFY_GROUP_NUM-1)
 
 struct fsnotify_group;
 struct fsnotify_event;
@@ -118,6 +120,15 @@ struct fsnotify_group {
 
 	/* groups can define private fields here */
 	union {
+#ifdef CONFIG_INOTIFY_USER
+		struct inotify_group_private_data {
+			spinlock_t	idr_lock;
+			struct idr      idr;
+			u32             last_wd;
+			struct fasync_struct    *fa;    /* async notification */
+			struct user_struct      *user;
+		} inotify_data;
+#endif
 	};
 };
 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -v1 00/11] fsnotify: unified filesystem notification backend
  2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
                   ` (9 preceding siblings ...)
  2009-02-09 21:16 ` [PATCH -v1 11/11] inotify: reimplement inotify using fsnotify Eric Paris
@ 2009-02-09 21:28 ` Eric Paris
  10 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-09 21:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: sfr, rlove, malware-list, hch, viro, akpm, john, alan

Whoops, forgot the cover note. So proceeding is an implementation of
fsnotify, a novel unified filesystem notification backend.  This patch
set implements the backend and converts both dnotify and inotify to this
backend.  This patch set reduced the size of struct inode if one uses
both dnotify and inotify.

I have patches which move the audit functionality which uses inotify to
fsnotify and I have patches which implement fanotify, my new on access
notification and access decisions to this backend, but just to keep the
series small I'm only sending fsnotify, dnotify and inotify changes.

I would love the most careful and scrutinizing eyes on patches 3, 9, 10
which implement the changes to inodes and introduce by far the most
complex locking, refcnt, and object lifetimes.

dnotify changes are in patch 5
inotify changes are in patch 11

 Documentation/filesystems/fsnotify.txt |  192 ++++++++++
 MAINTAINERS                            |    4 
 fs/inode.c                             |   10 
 fs/notify/Kconfig                      |   13 
 fs/notify/Makefile                     |    2 
 fs/notify/dnotify/Kconfig              |    1 
 fs/notify/dnotify/dnotify.c            |  449 +++++++++++++++++++-----
 fs/notify/fsnotify.c                   |   91 +++++
 fs/notify/fsnotify.h                   |   28 +
 fs/notify/group.c                      |  207 +++++++++++
 fs/notify/inode_mark.c                 |  330 ++++++++++++++++++
 fs/notify/inotify/Kconfig              |   20 -
 fs/notify/inotify/Makefile             |    2 
 fs/notify/inotify/inotify.h            |  108 +++++
 fs/notify/inotify/inotify_fsnotify.c   |  176 +++++++++
 fs/notify/inotify/inotify_kernel.c     |  236 +++++++++++++
 fs/notify/inotify/inotify_user.c       |  598 ++++++++-------------------------
 fs/notify/notification.c               |  348 +++++++++++++++++++
 include/linux/dcache.h                 |    3 
 include/linux/dnotify.h                |   29 -
 include/linux/fs.h                     |    6 
 include/linux/fsnotify.h               |  255 ++++++++++----
 include/linux/fsnotify_backend.h       |  265 ++++++++++++++
 23 files changed, 2735 insertions(+), 638 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v1 03/11] fsnotify: add in inode fsnotify markings
  2009-02-09 21:15 ` [PATCH -v1 03/11] fsnotify: add in inode fsnotify markings Eric Paris
@ 2009-02-12 21:57   ` Andrew Morton
  2009-02-17 17:26     ` Eric Paris
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2009-02-12 21:57 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-kernel, viro, hch, alan, sfr, john, rlove, malware-list


<picks a patch at random>

I don't have a good sense of what value all of this work brings to
Linux.  Why should I get excited about this?  Why is it all worth the
time/effort/risk/etc which would be involved in getting it integrated?

All rather unclear, but very important!


On Mon, 09 Feb 2009 16:15:37 -0500
Eric Paris <eparis@redhat.com> wrote:

> This patch creates in inode fsnotify markings.

I often come to a patch intending to review it, run into trouble and I
instead end up reviewing the code's _reviewability_, rather than
reviewing the code itself.

IOW, I often find the code to be too hard to understand.  It's not that
I _couldn't_ understand it, but I feel that the code requires too much
time and work to understand, and that this effort could be reduced were
the originator to go back and make the code more approachable.

I figure that if we do this, we end up with code which is more
maintainable.  Plus the review ends up being much more effective.


This patch is one of those patches.


>  dnotify will make use of in
> inode markings to mark which inodes it wishes to send events for.  fanotify
> will use this to mark which inodes it does not wish to send events for.
> 

Please use checkpatch?  The patchset introduces numerous trivial layout
glitches which you did not intend to add.

>
> ...
>
> --- /dev/null
> +++ b/Documentation/filesystems/fsnotify.txt
> @@ -0,0 +1,192 @@
> +fsnotify inode mark locking/lifetime/and refcnting
> +
> +struct fsnotify_mark_entry {
> +	__u64 mask;			/* mask this mark entry is for */
> +	atomic_t refcnt;		/* active things looking at this mark */
> +	int freeme;			/* free when this is set and refcnt hits 0 */
> +	struct inode *inode;		/* inode this entry is associated with */
> +	struct fsnotify_group *group;	/* group this mark entry is for */
> +	struct list_head i_list;	/* list of mark_entries by inode->i_fsnotify_mark_entries */
> +	struct list_head g_list;	/* list of mark_entries by group->i_fsnotify_mark_entries */
> +	spinlock_t lock;		/* protect group, inode, and killme */

s/killme/freeme/?

> +	struct list_head free_i_list;	/* tmp list used when freeing this mark */
> +	struct list_head free_g_list;	/* tmp list used when freeing this mark */
> +	void (*free_private)(struct fsnotify_mark_entry *entry); /* called on final put+free */
> +};

Why does `freeme' exist at all?  Normally the refcount is sufficient.
What is different here?

> +REFCNT:
> +The mark->refcnt tells how many "things" in the kernel currectly are
> +referencing this object.  The object typically will live inside the kernel
> +with a refcnt of 0.  Any task that finds the fsnotify_mark_entry and either
> +already has a reference or holds the inode->i_lock or group->mark_lock can
> +take a reference.  The mark will survive until the reference hits 0 and the
> +object has been marked for death (freeme)

See, we'd more typically "mark it for death" by decrementing its
refcount at that time.  If that made it go to zero then the object gets
destroyed immediately.

> +LOCKING:
> +There are 3 spinlocks involved with fsnotify inode marks and they MUST
> +be taking in order as follows:
> +
> +entry->lock
> +group->mark_lock
> +inode->i_lock
> +
> +entry->lock protects 3 things, entry->group, entry->inode, and entry->freeme.
> +
> +group->mark_lock protects the mark_entries list anchored inside a given group
> +and each entry is hooked via the g_list.  It also sorta protects the
> +free_g_list, which when used is anchored by a private list on the stack of the
> +task which held the group->mark_lock.
> +
> +inode->i_lock protects the i_fsnotify_mark_entries list anchored inside a
> +given inode and each entry is hooked via the i_list. (and sorta the
> +free_i_list)
> +
> +
> +LIFETIME:
> +Inode marks survive between when they are added to an inode and when their
> +refcnt==0 and freeme has been set.  fsnotify_mark_entries normally live in the
> +kernel with a refcnt==0 and only have positive refcnt when "something" is
> +actively referencing its contents.  freeme is set when the fsnotify_mark is no
> +longer on either an inode or a group list.  If it is off both lists we
> +know the mark is unreachable by anything else and when the refcnt hits 0 we
> +can free.

Incrementing the refcount by 1 for each time an object is present on a
list is a fairly common technique.

> +The inode mark can be cleared for a number of different reasons including:
> +The inode is unlinked for the last time.  (fsnotify_inoderemove)
> +The inode is being evicted from cache. (fsnotify_inode_delete)
> +The fs the inode is on is unmounted.  (fsnotify_inode_delete/fsnotify_unmount_inodes)
> +Something explicitly requests that it be removed.  (fsnotify_destroy_mark_by_entry)
> +The fsnotify_group associated with the mark is going away and all such marks
> +need to be cleaned up. (fsnotify_clear_marks_by_group)
> +
> +Worst case we are given an inode and need to clean up all the marks on that
> +inode.  We take i_lock and walk the i_fsnotify_mark_entries safely.  While
> +walking the list we list_del_init the i_list, take a reference, and using
> +free_i_list hook this into a private list we anchor on the stack.  At this
> +point we can safely drop the i_lock and walk the private list we anchored on
> +the stack (remember we hold a reference to everything on this list.)  For each
> +entry we take the entry->lock and check if the ->group and ->inode pointers
> +are set.  If they are, we take their respective locks and now hold all the
> +interesting locks.  We can safely list_del_init the i_list and g_list and can
> +set ->inode and ->group to NULL.  Once off both lists and everything is surely
> +NULL we set freeme for cleanup.
> +
> +Very similarly for freeing by group, except we use free_g_list.
> +
> +This has the very interesting property of being able to run concurrently with
> +any (or all) other directions.  Lets walk through what happens with 4 things
> +trying to simultaneously mark this entry for destruction.
> +
> +A finds this event by some means and takes a reference.  (this could be any
> +means including in the case of inotify through an idr, which is known to be
> +safe since the idr entry itself holds a reference)
> +B finds this event by some meand and takes a reference.

"means"

> +
> +At this point.
> +	refcnt == 2
> +	i_list -> inode
> +	inode -> inode
> +	g_list -> group
> +	group -> group
> +	free_i_list -> NUL
> +	free_g_list -> NUL
> +	freeme = 0
> +
> +C comes in and tries to free all of the fsnotify_mark attached to an inode.
> +---- C  will take the i_lock and walk the i_fsnotify_mark entries list calling
> +	list_del_init() on i_list, adding the entry to it's private list via
> +	free_i_list, and taking a reference.  C releases the i_lock.  Start
> +	walking the private list and block on the entry->lock (held by A
> +	below)
> +
> +At this point.
> +	refcnt == 3
> +	i_list -> NUL 
> +	inode -> inode
> +	g_list -> group
> +	group -> group
> +	free_i_list -> private list on stack
> +	free_g_list -> NUL
> +	freeme = 0
> +
> +refcnt == 3.  events is not on the i_list.
> +D comes in and tries to free all of the marks attached to the same inode.
> +---- D  will take the i_lock and won't find this entry on the list and does
> +	nothing.  (this is the end of D)
> +
> +E comes along and wants to free all of the marks in the group.
> +---- E  take the group->mark_lock walk the group->mark_entry.  grab a
> +	reference to the mark, list_del_init the g_list.  Add the mark to the
> +	free_g_list.  Release the group->mark_lock.  Now start walking the new
> +	private list and block in entry->lock.
> +
> +At this point.
> +	refcnt == 4
> +	i_list -> NUL 
> +	inode -> inode
> +	g_list -> NUL
> +	group -> group
> +	free_i_list -> private list on stack
> +	free_g_list -> private list on another stack
> +	freeme = 0
> +
> +A finally decides it wants to kill this entry for some reason.
> +---- A  will take the entry->lock.  It will check if mark->group is non-NULL
> +	and if so takes mark->group->mark_lock (it may have blocked here on D
> +	above).  Check the ->inode and if set take mark->inode->i_lock (again
> +	we may have been blocking on C).  We now own all the locks.  So
> +	list_del_init on i_list and g_list.  set ->inode and ->group = NULL
> +	and set freeme = 1.  Unlock i_lock, mark_lock, and entry->lock.  Drop
> +	reference.   (this is the end of A)
> +
> +At this point.
> +	refcnt == 3
> +	i_list -> NUL 
> +	inode -> NULL
> +	g_list -> NUL
> +	group -> NULL
> +	free_i_list -> private list on stack
> +	free_g_list -> private list on another stack
> +	freeme = 1
> +
> +D happens to be the one to win the entry->lock.
> +---- D  sees that ->inode and ->group and NULL so it just doesn't bother to
> +	grab those locks (if they are NULL we know this even if off the
> +	relevant lists)  D now pretends it has all the locks (even through it
> +	only has mark->lock.)  so it calls list_del_init on i_list and g_list.
> +	it sets ->inode and ->group to NULL (they already were) and it sets
> +	freeme = 1 (it already way)  D unlocks all the locks it holds and
> +	drops its reference
> +
> +At this point.
> +	refcnt == 2
> +	i_list -> NUL 
> +	inode -> NULL
> +	g_list -> NUL
> +	group -> NULL
> +	free_i_list -> private list on stack
> +	free_g_list -> undefined
> +	freeme = 1
> +
> +C does the same thing as B and the mark looks like:
> +
> +At this point.
> +	refcnt == 1
> +	i_list -> NUL 
> +	inode -> NULL
> +	g_list -> NUL
> +	group -> NULL
> +	free_i_list -> undefined
> +	free_g_list -> undefined
> +	freeme = 1

Holy cow.

Does the benefit of this new filesystem really justify the new
complexity and risk?

> +B is the only thing left with a reference and it will do something similar to
> +A, only it will only be holding mark->lock (it won't get the inode or group
> +lock since they are NULL in the event.)  It will also clear everything,
> +setfreeme, unlock, and drop it's reference via fsnotify_put_mark.
> +
> +fsnotify_put_mark calls atomic_dec_and_test(mark->refcnt, mark->lock) and will
> +then test if freeme is set.  It is set.  That means we are the last thing
> +inside the kernel with a pointer to this fsnotify_mark_entry and it should be
> +destroyed.  fsnotify_put_mark proceedes to clean up any private data and frees
> +things up.  
> diff --git a/fs/inode.c b/fs/inode.c
> index 40e37c0..f6a51a1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -22,6 +22,7 @@
>  #include <linux/cdev.h>
>  #include <linux/bootmem.h>
>  #include <linux/inotify.h>
> +#include <linux/fsnotify.h>
>  #include <linux/mount.h>
>  #include <linux/async.h>
>  
> @@ -189,6 +190,10 @@ struct inode *inode_init_always(struct super_block *sb, struct inode *inode)
>  	inode->i_private = NULL;
>  	inode->i_mapping = mapping;
>  
> +#ifdef CONFIG_FSNOTIFY
> +	inode->i_fsnotify_mask = 0;
> +#endif

That is the nineteenth open-coded instance of

	inode->foo = 0;

in this function.

Maybe it's time to just memset the whole thing?  We were going to do
that a while back but for some reason it didn't happen.

>  	return inode;
>  
>  out_free_security:
> @@ -220,6 +225,7 @@ void destroy_inode(struct inode *inode)
>  {
>  	BUG_ON(inode_has_buffers(inode));
>  	security_inode_free(inode);
> +	fsnotify_inode_delete(inode);
>  	if (inode->i_sb->s_op->destroy_inode)
>  		inode->i_sb->s_op->destroy_inode(inode);
>  	else
> @@ -251,6 +257,9 @@ void inode_init_once(struct inode *inode)
>  	INIT_LIST_HEAD(&inode->inotify_watches);
>  	mutex_init(&inode->inotify_mutex);
>  #endif
> +#ifdef CONFIG_FSNOTIFY
> +        INIT_LIST_HEAD(&inode->i_fsnotify_mark_entries);
> +#endif
>  }
>  
>  EXPORT_SYMBOL(inode_init_once);
> diff --git a/fs/notify/Makefile b/fs/notify/Makefile
> index 7cb285a..47b60f3 100644
> --- a/fs/notify/Makefile
> +++ b/fs/notify/Makefile
> @@ -1,4 +1,4 @@
>  obj-y			+= dnotify/
>  obj-y			+= inotify/
>  
> -obj-$(CONFIG_FSNOTIFY)		+= fsnotify.o notification.o group.o
> +obj-$(CONFIG_FSNOTIFY)		+= fsnotify.o notification.o group.o inode_mark.o
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index b51f90a..e7e53f7 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -25,6 +25,15 @@
>  #include <linux/fsnotify_backend.h>
>  #include "fsnotify.h"
>  
> +void __fsnotify_inode_delete(struct inode *inode, int flag)
> +{
> +	if (list_empty(&fsnotify_groups))
> +		return;
> +
> +	fsnotify_clear_marks_by_inode(inode, flag);
> +}
> +EXPORT_SYMBOL_GPL(__fsnotify_inode_delete);

What a strange function.

- it tests some global list

- it has no locking for the access to that list

- it is wholly unobvious why this is safe - suppose that someone
  added something to fsnotify_groups one nanosecond after the
  list_empty() test failed?

It needs comments explaining all of this, at least?

- it has an argument called "flag".  This name is only slightly
  less bad than the infamous "tmp".  Please think up an identifier which
  better communicates the information which this scalar contains, then
  use that identifier consistently across the patchset.

>  void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
>  {
>  	struct fsnotify_group *group;
> @@ -37,6 +46,8 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
>  	if (!(mask & fsnotify_mask))
>  		return;
>  
> +	if (!(mask & to_tell->i_fsnotify_mask))
> +		return;
>  	/*
>  	 * SRCU!!  the groups list is very very much read only and the path is
>  	 * very hot (assuming something is using fsnotify)  Not blocking while
>
> ...
>
> --- a/fs/notify/group.c
> +++ b/fs/notify/group.c
> @@ -47,6 +47,24 @@ void fsnotify_recalc_global_mask(void)
>  	fsnotify_mask = mask;
>  }
>  
> +void fsnotify_recalc_group_mask(struct fsnotify_group *group)
> +{
> +	__u64 mask = 0;
> +	unsigned long old_mask = group->mask;
> +	struct fsnotify_mark_entry *entry;
> +
> +	spin_lock(&group->mark_lock);
> +	list_for_each_entry(entry, &group->mark_entries, g_list) {
> +		mask |= entry->mask;
> +	}

Unneeded braces.  (There are multiple instances of this in the patchset)

> +	spin_unlock(&group->mark_lock);
> +
> +	group->mask = mask;
> +
> +	if (old_mask != mask)
> +		fsnotify_recalc_global_mask();
> +}

The function appears to be confused about the width of `mask'.  Is it
64-bit or 32?  Seems to be 64-bit, and that this is a bug.

A nice comment explaining this function's role in the world would be good.

>  static void fsnotify_add_group(struct fsnotify_group *group)
>  {
>  	int priority = group->priority;
> @@ -73,6 +91,9 @@ void fsnotify_get_group(struct fsnotify_group *group)
>  
>  static void fsnotify_destroy_group(struct fsnotify_group *group)
>  {
> +	/* clear all inode mark entries for this group */
> +	fsnotify_clear_marks_by_group(group);
> +
>  	if (group->ops->free_group_priv)
>  		group->ops->free_group_priv(group);
>  
>
> ...
>
> +static void fsnotify_destroy_mark(struct fsnotify_mark_entry *entry)
> +{
> +	entry->group = NULL;
> +	entry->inode = NULL;
> +	entry->mask = 0;
> +	INIT_LIST_HEAD(&entry->i_list);
> +	INIT_LIST_HEAD(&entry->g_list);
> +	INIT_LIST_HEAD(&entry->free_i_list);
> +	INIT_LIST_HEAD(&entry->free_g_list);
> +	entry->free_private(entry);
> +}

Why does a `destroy' function initialise the object?  Seems strange.

Perhaps because it is using a slab cache constructor somewhere?  If so,
perhaps it should stop doing that and switch to plain old
initialise-at-allocation-time.

> +void fsnotify_get_mark(struct fsnotify_mark_entry *entry)
> +{
> +	atomic_inc(&entry->refcnt);
> +}
>
> +void fsnotify_put_mark(struct fsnotify_mark_entry *entry)
> +{
> +	if (atomic_dec_and_lock(&entry->refcnt, &entry->lock)) {
> +		/*
> +		 * if refcnt hit 0 and freeme was set when it hit zero it's
> +		 * time to clean up!
> +		 */
> +		if (entry->freeme) {
> +			spin_unlock(&entry->lock);
> +			fsnotify_destroy_mark(entry);
> +			return;
> +		}
> +		spin_unlock(&entry->lock);
> +	}
> +}

This is strange in two ways.

a) the possibly-unneeded `freeme', mentioned above.

b) it takes a lock which is _internal_ to the object which is being
   discarded.  A much more common pattern is to take a higher-level
   lock when the object's refcount falls to zero.  eg:

                if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
                        iput_final(inode);

   So something unusual is happening here.  I think the challenge
   for you is to

   - convince yourself (and others) that this design is
     appropriate and optimal.

   - find a way of communicating the design to code-readers and to
     code-reviewers.  So that others have a hope of maintaining it. 

     And personally, I don't find ../../Documentation/foo.txt to be
     terribly useful.  I just don't trust those files to be complete
     and to be maintained.  It is better if the code can be understood
     by reading the C file.

> +void fsnotify_recalc_inode_mask_locked(struct inode *inode)
> +{
> +	struct fsnotify_mark_entry *entry;
> +	__u64 new_mask = 0;
> +
> +	list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
> +		new_mask |= entry->mask;
> +	}
> +	inode->i_fsnotify_mask = new_mask;
> +}

The code should document which lock is needed.  assert_spin_locked() is
one way.  Or a comment.

> +void fsnotify_recalc_inode_mask(struct inode *inode)
> +{
> +	spin_lock(&inode->i_lock);
> +	fsnotify_recalc_inode_mask_locked(inode);
> +	spin_unlock(&inode->i_lock);
> +}

"Reader" is now wondering under what circumstances this function is
called.  Why does the kernel need to "recalculate" an inode mask?  As a
result of some syscall which did comething to an inode, perhaps?  It's
pretty hard to find this out, isn't it?  Perhaps "Writer" can find a way
to help "Reader" in this quest ;)

> +void fsnotify_clear_marks_by_group(struct fsnotify_group *group)
> +{
> +	struct fsnotify_mark_entry *lentry, *entry;
> +	struct inode *inode;
> +	LIST_HEAD(free_list);
> +
> +	spin_lock(&group->mark_lock);
> +	list_for_each_entry_safe(entry, lentry, &group->mark_entries, g_list) {
> +		list_del_init(&entry->g_list);
> +		list_add(&entry->free_g_list, &free_list);
> +		fsnotify_get_mark(entry);
> +	}
> +	spin_unlock(&group->mark_lock);
> +
> +	list_for_each_entry_safe(entry, lentry, &free_list, free_g_list) {
> +		spin_lock(&entry->lock);
> +		inode = entry->inode;
> +		if (!inode) {
> +			entry->group = NULL;
> +			spin_unlock(&entry->lock);
> +			fsnotify_put_mark(entry);
> +			continue;
> +		}
> +		spin_lock(&inode->i_lock);
> +
> +		list_del_init(&entry->i_list);
> +		entry->inode = NULL;
> +		list_del_init(&entry->g_list);
> +		entry->group = NULL;
> +		entry->freeme = 1;
> +
> +		fsnotify_recalc_inode_mask_locked(inode);
> +		spin_unlock(&inode->i_lock);
> +		spin_unlock(&entry->lock);
> +
> +		fsnotify_put_mark(entry);
> +	}
> +}

OK, so we have the concept of a "mark".  fsnotify_mark_entry is fairly
nicely described, at its definition site.

We also have a concept of a "group".  I wonder what that is. 
Logically, I go to the definition site of `struct fsnotify_group' and
learn nothing!

> +void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry)
> +{
> +	struct fsnotify_group *group;
> +	struct inode *inode;
> +
> +	spin_lock(&entry->lock);
> +
> +	group = entry->group;
> +	if (group)
> +		spin_lock(&group->mark_lock);
> +
> +	inode = entry->inode;
> +	if (inode)
> +		spin_lock(&inode->i_lock);
> +
> +	list_del_init(&entry->i_list);
> +	entry->inode = NULL;
> +	list_del_init(&entry->g_list);
> +	entry->group = NULL;
> +	entry->freeme = 1;
> +
> +	if (inode) {
> +		fsnotify_recalc_inode_mask_locked(inode);
> +		spin_unlock(&inode->i_lock);
> +	}
> +	if (group)
> +		spin_unlock(&group->mark_lock);
> +
> +	spin_unlock(&entry->lock);
> +}

I find that if one understands the data structures, and the
relationship between them then comprehending the implementation is
fairly straightforward.  That is particularly the case in this code.
`struct fsnotify_mark_entry' is undocumented, which didn't help.

otoh, to an unusually high level, a large part of the complexity in
this code revolves around the dynamic behaviour - what can race with
what, which locks protect which data from which activities, etc.  That
part of it _has_ been documented.  It made my head spin, and not
understanding the data didn't help.

> +void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags)
> +{
> +	struct fsnotify_mark_entry *lentry, *entry;
> +	LIST_HEAD(free_list);
> +
> +	spin_lock(&inode->i_lock);
> +	list_for_each_entry_safe(entry, lentry, &inode->i_fsnotify_mark_entries, i_list) {
> +		list_del_init(&entry->i_list);
> +		list_add(&entry->free_i_list, &free_list);
> +		fsnotify_get_mark(entry);
> +	}
> +	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * at this point destroy_by_* might race.
> +	 *
> +	 * we used list_del_init() so it can be list_del_init'd again, no harm.
> +	 * we were called from an inode function so we know that other user can
> +	 * try to grab entry->inode->i_lock without a problem.
> +	 */

Is this hacky?

This reader is wondering "well, why didn't we just take the lock, to
prevent the race?".  AFACIT there is no way for the reader to determine
this from the code, which isn't a good situation.

> +	list_for_each_entry_safe(entry, lentry, &free_list, free_i_list) {
> +		entry->group->ops->mark_clear_inode(entry, inode, flags);
> +		fsnotify_put_mark(entry);
> +	}
> +
> +	fsnotify_recalc_inode_mask(inode);
> +}
> +
> +/* caller must hold inode->i_lock */

w00t!

I kinda like the use of assert_spin_locked() for this.

> +struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode)
> +{
> +	struct fsnotify_mark_entry *entry;
> +
> +	list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
> +		if (entry->group == group) {
> +			fsnotify_get_mark(entry);
> +			return entry;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +void fsnotify_init_mark(struct fsnotify_mark_entry *entry, struct fsnotify_group *group, struct inode *inode, __u64 mask)
> +{
> +	spin_lock_init(&entry->lock);
> +	atomic_set(&entry->refcnt, 1);
> +	entry->group = group;
> +	entry->mask = mask;
> +	entry->inode = inode;
> +	entry->freeme = 0;
> +	entry->free_private = group->ops->free_mark;
> +}
> +
> +int fsnotify_add_mark(struct fsnotify_mark_entry *entry)
> +{
> +	struct fsnotify_mark_entry *lentry;
> +	struct fsnotify_group *group = entry->group;
> +	struct inode *inode = entry->inode;
> +	int ret = 0;
> +
> +	/*
> +	 * LOCKING ORDER!!!!
> +	 * entry->lock
> +	 * group->mark_lock
> +	 * inode->i_lock
> +	 */
> +	spin_lock(&group->mark_lock);
> +	spin_lock(&inode->i_lock);
> +
> +	lentry = fsnotify_find_mark_entry(group, inode);
> +	if (!lentry) {
> +		list_add(&entry->i_list, &inode->i_fsnotify_mark_entries);
> +		list_add(&entry->g_list, &group->mark_entries);
> +		fsnotify_recalc_inode_mask_locked(inode);
> +	}
> +
> +	spin_unlock(&inode->i_lock);
> +	spin_unlock(&group->mark_lock);
> +
> +	if (lentry) {
> +		ret = -EEXIST;
> +		fsnotify_put_mark(lentry);
> +	}
> +
> +	return ret;
> +}

>From the name I'd guess that the function adds a mark to, umm,
something.  Actually it adds it to the inode and to a list which is in
an object with is pointed to by the to-be-added object.  IOW I just
discovered that we have bidirectional references happening here.

All very interesting.

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7c68aa9..c5ec88f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -690,6 +690,11 @@ struct inode {
>  
>  	__u32			i_generation;
>  
> +#ifdef CONFIG_FSNOTIFY
> +	__u64			i_fsnotify_mask; /* all events this inode cares about */

OK.  Where are these events enumerated?

> +	struct list_head	i_fsnotify_mark_entries; /* fsnotify mark entries */

Reader wonders:

- what type of objects are strung onto this list?

- what is the locking protocol for this list?


> +#endif
> +
>  #ifdef CONFIG_DNOTIFY
>  	unsigned long		i_dnotify_mask; /* Directory notify events */
>  	struct dnotify_struct	*i_dnotify; /* for directory notifications */
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index d1f7de2..6ae332f 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -99,6 +99,14 @@ static inline void fsnotify_nameremove(struct dentry *dentry, int isdir)
>  }
>  
> ...
>
> +/*
> + * a mark is simply an entry attached to an in core inode which allows an
> + * fsnotify listener to indicate they are either no longer interested in events
> + * of a type matching mask or only interested in those events.
> + *
> + * these are flushed when an inode is evicted from core and may be flushed
> + * when the inode is modified (as seen by fsnotify_access).  Some fsnotify users
> + * (such as dnotify) will flush these when the open fd is closed and not at
> + * inode eviction or modification.
> + */

That's a good comment!

I'm surprised that a caller adds a mark to opt out of notifications
rather than to opt in.  What's the thinking there?

> +struct fsnotify_mark_entry {
> +	__u64 mask;			/* mask this mark entry is for */
> +	atomic_t refcnt;		/* active things looking at this mark */
> +	int freeme;			/* free when this is set and refcnt hits 0 */
> +	struct inode *inode;		/* inode this entry is associated with */
> +	struct fsnotify_group *group;	/* group this mark entry is for */
> +	struct list_head i_list;	/* list of mark_entries by inode->i_fsnotify_mark_entries */
> +	struct list_head g_list;	/* list of mark_entries by group->i_fsnotify_mark_entries */
> +	spinlock_t lock;		/* protect group, inode, and killme */

s/killme/freeme/

> +	struct list_head free_i_list;	/* tmp list used when freeing this mark */
> +	struct list_head free_g_list;	/* tmp list used when freeing this mark */
> +	void (*free_private)(struct fsnotify_mark_entry *entry); /* called on final put+free */
> +};
> +


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v1 03/11] fsnotify: add in inode fsnotify markings
  2009-02-12 21:57   ` Andrew Morton
@ 2009-02-17 17:26     ` Eric Paris
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Paris @ 2009-02-17 17:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, viro, hch, alan, sfr, john, rlove, malware-list

On Thu, 2009-02-12 at 13:57 -0800, Andrew Morton wrote:
> <picks a patch at random>
> 
> I don't have a good sense of what value all of this work brings to
> Linux.  Why should I get excited about this?  Why is it all worth the
> time/effort/risk/etc which would be involved in getting it integrated?
> 
> All rather unclear, but very important!

I have largely taken your comments and rewritten this patch (and a
couple others) to be easier to understand and review.

current patches are at http://people.redhat.com/~eparis/fsnotify

I received some offline comments which I intend to address and will be
reposting later this week.

thank you!


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-02-17 17:27 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-09 21:15 [PATCH -v1 01/11] fsnotify: unified filesystem notification backend Eric Paris
2009-02-09 21:15 ` [PATCH -v1 02/11] fsnotify: add group priorities Eric Paris
2009-02-09 21:15 ` [PATCH -v1 03/11] fsnotify: add in inode fsnotify markings Eric Paris
2009-02-12 21:57   ` Andrew Morton
2009-02-17 17:26     ` Eric Paris
2009-02-09 21:15 ` [PATCH -v1 04/11] fsnotify: parent event notification Eric Paris
2009-02-09 21:15 ` [PATCH -v1 05/11] dnotify: reimplement dnotify using fsnotify Eric Paris
2009-02-09 21:15 ` [PATCH -v1 06/11] fsnotify: generic notification queue and waitq Eric Paris
2009-02-09 21:15 ` [PATCH -v1 07/11] fsnotify: include pathnames with entries when possible Eric Paris
2009-02-09 21:16 ` [PATCH -v1 08/11] fsnotify: add correlations between events Eric Paris
2009-02-09 21:16 ` [PATCH -v1 09/11] fsnotify: fsnotify marks on inodes pin them in core Eric Paris
2009-02-09 21:16 ` [PATCH -v1 10/11] fsnotify: handle filesystem unmounts with fsnotify marks Eric Paris
2009-02-09 21:16 ` [PATCH -v1 11/11] inotify: reimplement inotify using fsnotify Eric Paris
2009-02-09 21:28 ` [PATCH -v1 00/11] fsnotify: unified filesystem notification backend Eric Paris

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox