[PATCH 0/7] ceph: misc fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] ceph: misc fixes
@ 2013-03-01  6:46 Yan, Zheng
  2013-03-01  6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01  6:46 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

These patches are also in:
  git://github.com/ukernel/linux.git wip-ceph

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/7] ceph: fix LSSNAP regression
  2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
@ 2013-03-01  6:46 ` Yan, Zheng
  2013-03-01  6:46 ` [PATCH 2/7] ceph: queue cap release when trimming cap Yan, Zheng
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01  6:46 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

commit 6e8575faa8 makes parse_reply_info_extra() return -EIO for LSSNAP

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/mds_client.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d958420..608ffcf 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -265,7 +265,8 @@ static int parse_reply_info_extra(void **p, void *end,
 {
 	if (info->head->op == CEPH_MDS_OP_GETFILELOCK)
 		return parse_reply_info_filelock(p, end, info, features);
-	else if (info->head->op == CEPH_MDS_OP_READDIR)
+	else if (info->head->op == CEPH_MDS_OP_READDIR ||
+		 info->head->op == CEPH_MDS_OP_LSSNAP)
 		return parse_reply_info_dir(p, end, info, features);
 	else if (info->head->op == CEPH_MDS_OP_CREATE)
 		return parse_reply_info_create(p, end, info, features);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/7] ceph: queue cap release when trimming cap
  2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
  2013-03-01  6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
@ 2013-03-01  6:46 ` Yan, Zheng
  2013-03-01  6:46 ` [PATCH 3/7] ceph: set mds_want according to cap import message Yan, Zheng
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01  6:46 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

So the client will later send cap release message to MDS

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/caps.c       | 6 +++---
 fs/ceph/mds_client.c | 2 ++
 fs/ceph/super.h      | 2 ++
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 1e1e020..5d5c32b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -997,9 +997,9 @@ static int send_cap_msg(struct ceph_mds_session *session,
 	return 0;
 }
 
-static void __queue_cap_release(struct ceph_mds_session *session,
-				u64 ino, u64 cap_id, u32 migrate_seq,
-				u32 issue_seq)
+void __queue_cap_release(struct ceph_mds_session *session,
+			 u64 ino, u64 cap_id, u32 migrate_seq,
+			 u32 issue_seq)
 {
 	struct ceph_msg *msg;
 	struct ceph_mds_cap_release *head;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 608ffcf..ccc68b0 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1197,6 +1197,8 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 	session->s_trim_caps--;
 	if (oissued) {
 		/* we aren't the only cap.. just remove us */
+		__queue_cap_release(session, ceph_ino(inode), cap->cap_id,
+				    cap->mseq, cap->issue_seq);
 		__ceph_remove_cap(cap);
 	} else {
 		/* try to drop referring dentries */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 604526a..4353ebc 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -755,6 +755,8 @@ static inline void ceph_remove_cap(struct ceph_cap *cap)
 extern void ceph_put_cap(struct ceph_mds_client *mdsc,
 			 struct ceph_cap *cap);
 
+extern void __queue_cap_release(struct ceph_mds_session *session, u64 ino,
+				u64 cap_id, u32 migrate_seq, u32 issue_seq);
 extern void ceph_queue_caps_release(struct inode *inode);
 extern int ceph_write_inode(struct inode *inode, struct writeback_control *wbc);
 extern int ceph_fsync(struct file *file, loff_t start, loff_t end,
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/7] ceph: set mds_want according to cap import message
  2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
  2013-03-01  6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
  2013-03-01  6:46 ` [PATCH 2/7] ceph: queue cap release when trimming cap Yan, Zheng
@ 2013-03-01  6:46 ` Yan, Zheng
  2013-03-01  6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01  6:46 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

MDS ignores cap update message if migrate_seq mismatch, so when
receiving a cap import message with higher migrate_seq, set mds_want
according to the cap import message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/caps.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 5d5c32b..61f3833 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -553,6 +553,7 @@ retry:
 		cap->implemented = 0;
 		cap->mds = mds;
 		cap->mds_wanted = 0;
+		cap->mseq = 0;
 
 		cap->ci = ci;
 		__insert_cap_node(ci, cap);
@@ -628,7 +629,10 @@ retry:
 	cap->cap_id = cap_id;
 	cap->issued = issued;
 	cap->implemented |= issued;
-	cap->mds_wanted |= wanted;
+	if (mseq > cap->mseq)
+		cap->mds_wanted = wanted;
+	else
+		cap->mds_wanted |= wanted;
 	cap->seq = seq;
 	cap->issue_seq = seq;
 	cap->mseq = mseq;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag
  2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
                   ` (2 preceding siblings ...)
  2013-03-01  6:46 ` [PATCH 3/7] ceph: set mds_want according to cap import message Yan, Zheng
@ 2013-03-01  6:46 ` Yan, Zheng
  2013-03-04 18:19   ` Sage Weil
  2013-03-01  6:46 ` [PATCH 5/7] ceph: revert commit 22cddde104 Yan, Zheng
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01  6:46 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().

This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/caps.c       |  8 ++++---
 fs/ceph/dir.c        | 62 ++++++++++------------------------------------------
 fs/ceph/inode.c      | 30 +++++++++++--------------
 fs/ceph/mds_client.c |  6 ++---
 fs/ceph/super.h      | 23 ++-----------------
 5 files changed, 34 insertions(+), 95 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 61f3833..76634f4 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -490,15 +490,17 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 		ci->i_rdcache_gen++;
 
 	/*
-	 * if we are newly issued FILE_SHARED, clear D_COMPLETE; we
+	 * if we are newly issued FILE_SHARED, clear I_COMPLETE; we
 	 * don't know what happened to this directory while we didn't
 	 * have the cap.
 	 */
 	if ((issued & CEPH_CAP_FILE_SHARED) &&
 	    (had & CEPH_CAP_FILE_SHARED) == 0) {
 		ci->i_shared_gen++;
-		if (S_ISDIR(ci->vfs_inode.i_mode))
-			ceph_dir_clear_complete(&ci->vfs_inode);
+		if (S_ISDIR(ci->vfs_inode.i_mode)) {
+			dout(" marking %p NOT complete\n", &ci->vfs_inode);
+			ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
+		}
 	}
 }
 
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 8c1aabe..76821be 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p)
  * falling back to a "normal" sync readdir if any dentries in the dir
  * are dropped.
  *
- * D_COMPLETE tells indicates we have all dentries in the dir.  It is
+ * I_COMPLETE tells indicates we have all dentries in the dir.  It is
  * defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by
  * the MDS if/when the directory is modified).
  */
@@ -198,8 +198,8 @@ more:
 	filp->f_pos++;
 
 	/* make sure a dentry wasn't dropped while we didn't have parent lock */
-	if (!ceph_dir_test_complete(dir)) {
-		dout(" lost D_COMPLETE on %p; falling back to mds\n", dir);
+	if (!ceph_i_test(dir, CEPH_I_COMPLETE)) {
+		dout(" lost I_COMPLETE on %p; falling back to mds\n", dir);
 		err = -EAGAIN;
 		goto out;
 	}
@@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, filldir_t filldir)
 	if ((filp->f_pos == 2 || fi->dentry) &&
 	    !ceph_test_mount_opt(fsc, NOASYNCREADDIR) &&
 	    ceph_snap(inode) != CEPH_SNAPDIR &&
-	    ceph_dir_test_complete(inode) &&
+	    (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
 	    __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) {
 		spin_unlock(&ci->i_ceph_lock);
 		err = __dcache_readdir(filp, dirent, filldir);
@@ -350,7 +350,7 @@ more:
 
 		if (!req->r_did_prepopulate) {
 			dout("readdir !did_prepopulate");
-			fi->dir_release_count--;    /* preclude D_COMPLETE */
+			fi->dir_release_count--;    /* preclude I_COMPLETE */
 		}
 
 		/* note next offset and last dentry name */
@@ -429,7 +429,8 @@ more:
 	 */
 	spin_lock(&ci->i_ceph_lock);
 	if (ci->i_release_count == fi->dir_release_count) {
-		ceph_dir_set_complete(inode);
+		dout(" marking %p complete\n", inode);
+		ci->i_ceph_flags |= CEPH_I_COMPLETE;
 		ci->i_max_offset = filp->f_pos;
 	}
 	spin_unlock(&ci->i_ceph_lock);
@@ -604,7 +605,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
 			    fsc->mount_options->snapdir_name,
 			    dentry->d_name.len) &&
 		    !is_root_ceph_dentry(dir, dentry) &&
-		    ceph_dir_test_complete(dir) &&
+		    (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
 		    (__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) {
 			spin_unlock(&ci->i_ceph_lock);
 			dout(" dir %p complete, -ENOENT\n", dir);
@@ -908,7 +909,7 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry,
 		 */
 
 		/* d_move screws up d_subdirs order */
-		ceph_dir_clear_complete(new_dir);
+		ceph_i_clear(new_dir, CEPH_I_COMPLETE);
 
 		d_move(old_dentry, new_dentry);
 
@@ -1065,44 +1066,6 @@ static int ceph_snapdir_d_revalidate(struct dentry *dentry,
 }
 
 /*
- * Set/clear/test dir complete flag on the dir's dentry.
- */
-void ceph_dir_set_complete(struct inode *inode)
-{
-	struct dentry *dentry = d_find_any_alias(inode);
-	
-	if (dentry && ceph_dentry(dentry) &&
-	    ceph_test_mount_opt(ceph_sb_to_client(dentry->d_sb), DCACHE)) {
-		dout(" marking %p (%p) complete\n", inode, dentry);
-		set_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
-	}
-	dput(dentry);
-}
-
-void ceph_dir_clear_complete(struct inode *inode)
-{
-	struct dentry *dentry = d_find_any_alias(inode);
-
-	if (dentry && ceph_dentry(dentry)) {
-		dout(" marking %p (%p) complete\n", inode, dentry);
-		set_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
-	}
-	dput(dentry);
-}
-
-bool ceph_dir_test_complete(struct inode *inode)
-{
-	struct dentry *dentry = d_find_any_alias(inode);
-
-	if (dentry && ceph_dentry(dentry)) {
-		dout(" marking %p (%p) NOT complete\n", inode, dentry);
-		clear_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
-	}
-	dput(dentry);
-	return false;
-}
-
-/*
  * When the VFS prunes a dentry from the cache, we need to clear the
  * complete flag on the parent directory.
  *
@@ -1110,15 +1073,13 @@ bool ceph_dir_test_complete(struct inode *inode)
  */
 static void ceph_d_prune(struct dentry *dentry)
 {
-	struct ceph_dentry_info *di;
-
 	dout("ceph_d_prune %p\n", dentry);
 
 	/* do we have a valid parent? */
 	if (IS_ROOT(dentry))
 		return;
 
-	/* if we are not hashed, we don't affect D_COMPLETE */
+	/* if we are not hashed, we don't affect I_COMPLETE */
 	if (d_unhashed(dentry))
 		return;
 
@@ -1126,8 +1087,7 @@ static void ceph_d_prune(struct dentry *dentry)
 	 * we hold d_lock, so d_parent is stable, and d_fsdata is never
 	 * cleared until d_release
 	 */
-	di = ceph_dentry(dentry->d_parent);
-	clear_bit(CEPH_D_COMPLETE, &di->flags);
+	ceph_i_clear(dentry->d_parent->d_inode, CEPH_I_COMPLETE);
 }
 
 /*
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 2971eaa..42c5769 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -561,7 +561,6 @@ static int fill_inode(struct inode *inode,
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	int i;
 	int issued = 0, implemented;
-	int updating_inode = 0;
 	struct timespec mtime, atime, ctime;
 	u32 nsplits;
 	struct ceph_buffer *xattr_blob = NULL;
@@ -601,7 +600,6 @@ static int fill_inode(struct inode *inode,
 	    (ci->i_version & ~1) >= le64_to_cpu(info->version))
 		goto no_change;
 	
-	updating_inode = 1;
 	issued = __ceph_caps_issued(ci, &implemented);
 	issued |= implemented | __ceph_caps_dirty(ci);
 
@@ -716,6 +714,17 @@ static int fill_inode(struct inode *inode,
 		       ceph_vinop(inode), inode->i_mode);
 	}
 
+	/* set dir completion flag? */
+	if (S_ISDIR(inode->i_mode) &&
+	    ci->i_files == 0 && ci->i_subdirs == 0 &&
+	    ceph_snap(inode) == CEPH_NOSNAP &&
+	    (le32_to_cpu(info->cap.caps) & CEPH_CAP_FILE_SHARED) &&
+	    (issued & CEPH_CAP_FILE_EXCL) == 0 &&
+	    (ci->i_ceph_flags & CEPH_I_COMPLETE) == 0) {
+		dout(" marking %p complete (empty)\n", inode);
+		ci->i_ceph_flags |= CEPH_I_COMPLETE;
+		ci->i_max_offset = 2;
+	}
 no_change:
 	spin_unlock(&ci->i_ceph_lock);
 
@@ -766,19 +775,6 @@ no_change:
 		__ceph_get_fmode(ci, cap_fmode);
 	}
 
-	/* set dir completion flag? */
-	if (S_ISDIR(inode->i_mode) &&
-	    updating_inode &&                 /* didn't jump to no_change */
-	    ci->i_files == 0 && ci->i_subdirs == 0 &&
-	    ceph_snap(inode) == CEPH_NOSNAP &&
-	    (le32_to_cpu(info->cap.caps) & CEPH_CAP_FILE_SHARED) &&
-	    (issued & CEPH_CAP_FILE_EXCL) == 0 &&
-	    !ceph_dir_test_complete(inode)) {
-		dout(" marking %p complete (empty)\n", inode);
-		ceph_dir_set_complete(inode);
-		ci->i_max_offset = 2;
-	}
-
 	/* update delegation info? */
 	if (dirinfo)
 		ceph_fill_dirfrag(inode, dirinfo);
@@ -860,7 +856,7 @@ static void ceph_set_dentry_offset(struct dentry *dn)
 	di = ceph_dentry(dn);
 
 	spin_lock(&ci->i_ceph_lock);
-	if (!ceph_dir_test_complete(inode)) {
+	if ((ceph_inode(inode)->i_ceph_flags & CEPH_I_COMPLETE) == 0) {
 		spin_unlock(&ci->i_ceph_lock);
 		return;
 	}
@@ -1065,7 +1061,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req,
 			 * d_move() puts the renamed dentry at the end of
 			 * d_subdirs.  We need to assign it an appropriate
 			 * directory offset so we can behave when holding
-			 * D_COMPLETE.
+			 * I_COMPLETE.
 			 */
 			ceph_set_dentry_offset(req->r_old_dentry);
 			dout("dn %p gets new offset %lld\n", req->r_old_dentry, 
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index ccc68b0..e52b0fb 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2029,7 +2029,7 @@ out:
 }
 
 /*
- * Invalidate dir D_COMPLETE, dentry lease state on an aborted MDS
+ * Invalidate dir I_COMPLETE, dentry lease state on an aborted MDS
  * namespace request.
  */
 void ceph_invalidate_dir_request(struct ceph_mds_request *req)
@@ -2037,9 +2037,9 @@ void ceph_invalidate_dir_request(struct ceph_mds_request *req)
 	struct inode *inode = req->r_locked_dir;
 	struct ceph_inode_info *ci = ceph_inode(inode);
 
-	dout("invalidate_dir_request %p (D_COMPLETE, lease(s))\n", inode);
+	dout("invalidate_dir_request %p (I_COMPLETE, lease(s))\n", inode);
 	spin_lock(&ci->i_ceph_lock);
-	ceph_dir_clear_complete(inode);
+	ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
 	ci->i_release_count++;
 	spin_unlock(&ci->i_ceph_lock);
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 4353ebc..efbcb56 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -204,7 +204,6 @@ struct ceph_inode_xattr {
  * Ceph dentry state
  */
 struct ceph_dentry_info {
-	unsigned long flags;
 	struct ceph_mds_session *lease_session;
 	u32 lease_gen, lease_shared_gen;
 	u32 lease_seq;
@@ -215,18 +214,6 @@ struct ceph_dentry_info {
 	u64 offset;
 };
 
-/*
- * dentry flags
- *
- * The locking for D_COMPLETE is a bit odd:
- *  - we can clear it at almost any time (see ceph_d_prune)
- *  - it is only meaningful if:
- *    - we hold dir inode i_ceph_lock
- *    - we hold dir FILE_SHARED caps
- *    - the dentry D_COMPLETE is set
- */
-#define CEPH_D_COMPLETE 1  /* if set, d_u.d_subdirs is complete directory */
-
 struct ceph_inode_xattrs_info {
 	/*
 	 * (still encoded) xattr blob. we avoid the overhead of parsing
@@ -267,7 +254,7 @@ struct ceph_inode_info {
 	struct timespec i_rctime;
 	u64 i_rbytes, i_rfiles, i_rsubdirs;
 	u64 i_files, i_subdirs;
-	u64 i_max_offset;  /* largest readdir offset, set with D_COMPLETE */
+	u64 i_max_offset;  /* largest readdir offset, set with I_COMPLETE */
 
 	struct rb_root i_fragtree;
 	struct mutex i_fragtree_mutex;
@@ -432,6 +419,7 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 /*
  * Ceph inode.
  */
+#define CEPH_I_COMPLETE  1  /* we have complete directory cached */
 #define CEPH_I_NODELAY   4  /* do not delay cap release */
 #define CEPH_I_FLUSH     8  /* do not delay flush of dirty metadata */
 #define CEPH_I_NOFLUSH  16  /* do not flush dirty caps */
@@ -489,13 +477,6 @@ static inline loff_t ceph_make_fpos(unsigned frag, unsigned off)
 }
 
 /*
- * set/clear directory D_COMPLETE flag
- */
-void ceph_dir_set_complete(struct inode *inode);
-void ceph_dir_clear_complete(struct inode *inode);
-bool ceph_dir_test_complete(struct inode *inode);
-
-/*
  * caps helpers
  */
 static inline bool __ceph_is_any_real_caps(struct ceph_inode_info *ci)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/7] ceph: revert commit 22cddde104
  2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
                   ` (3 preceding siblings ...)
  2013-03-01  6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
@ 2013-03-01  6:46 ` Yan, Zheng
  2013-03-01  6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01  6:46 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

commit 22cddde104 breaks the atomicity of write operation, it also
introduces a deadlock between write and truncate.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/addr.c       | 51 +++---------------------------------
 fs/ceph/file.c       | 73 +++++++++++++++++++++++++++++++---------------------
 fs/ceph/mds_client.c |  1 +
 3 files changed, 48 insertions(+), 77 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index cfef3e0..d662025 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1067,51 +1067,23 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping,
 			    struct page **pagep, void **fsdata)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_file_info *fi = file->private_data;
 	struct page *page;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	int r, want, got = 0;
-
-	if (fi->fmode & CEPH_FILE_MODE_LAZY)
-		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
-	else
-		want = CEPH_CAP_FILE_BUFFER;
-
-	dout("write_begin %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
-	     inode, ceph_vinop(inode), pos, len, inode->i_size);
-	r = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, pos+len);
-	if (r < 0)
-		return r;
-	dout("write_begin %p %llx.%llx %llu~%u  got cap refs on %s\n",
-	     inode, ceph_vinop(inode), pos, len, ceph_cap_string(got));
-	if (!(got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO))) {
-		ceph_put_cap_refs(ci, got);
-		return -EAGAIN;
-	}
+	int r;
 
 	do {
 		/* get a page */
 		page = grab_cache_page_write_begin(mapping, index, 0);
-		if (!page) {
-			r = -ENOMEM;
-			break;
-		}
+		if (!page)
+			return -ENOMEM;
+		*pagep = page;
 
 		dout("write_begin file %p inode %p page %p %d~%d\n", file,
 		     inode, page, (int)pos, (int)len);
 
 		r = ceph_update_writeable_page(file, pos, len, page);
-		if (r)
-			page_cache_release(page);
 	} while (r == -EAGAIN);
 
-	if (r) {
-		ceph_put_cap_refs(ci, got);
-	} else {
-		*pagep = page;
-		*(int *)fsdata = got;
-	}
 	return r;
 }
 
@@ -1125,12 +1097,10 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
 			  struct page *page, void *fsdata)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	struct ceph_mds_client *mdsc = fsc->mdsc;
 	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
 	int check_cap = 0;
-	int got = (unsigned long)fsdata;
 
 	dout("write_end file %p inode %p page %p %d~%d (%d)\n", file,
 	     inode, page, (int)pos, (int)copied, (int)len);
@@ -1153,19 +1123,6 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
 	up_read(&mdsc->snap_rwsem);
 	page_cache_release(page);
 
-	if (copied > 0) {
-		int dirty;
-		spin_lock(&ci->i_ceph_lock);
-		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
-		spin_unlock(&ci->i_ceph_lock);
-		if (dirty)
-			__mark_inode_dirty(inode, dirty);
-	}
-
-	dout("write_end %p %llx.%llx %llu~%u  dropping cap refs on %s\n",
-	     inode, ceph_vinop(inode), pos, len, ceph_cap_string(got));
-	ceph_put_cap_refs(ci, got);
-
 	if (check_cap)
 		ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY, NULL);
 
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 9c4325e..a949805 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -718,53 +718,63 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	struct ceph_osd_client *osdc =
 		&ceph_sb_to_client(inode->i_sb)->client->osdc;
 	loff_t endoff = pos + iov->iov_len;
-	int got = 0;
-	int ret, err, written;
+	int want, got = 0;
+	int ret, err;
 
 	if (ceph_snap(inode) != CEPH_NOSNAP)
 		return -EROFS;
 
 retry_snap:
-	written = 0;
 	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
 		return -ENOSPC;
 	__ceph_do_pending_vmtruncate(inode);
+	dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
+	     inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+	     inode->i_size);
+	if (fi->fmode & CEPH_FILE_MODE_LAZY)
+		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
+	else
+		want = CEPH_CAP_FILE_BUFFER;
+	ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
+	if (ret < 0)
+		goto out_put;
 
-	/*
-	 * try to do a buffered write.  if we don't have sufficient
-	 * caps, we'll get -EAGAIN from generic_file_aio_write, or a
-	 * short write if we only get caps for some pages.
-	 */
-	if (!(iocb->ki_filp->f_flags & O_DIRECT) &&
-	    !(inode->i_sb->s_flags & MS_SYNCHRONOUS) &&
-	    !(fi->flags & CEPH_F_SYNC)) {
-		ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
-		if (ret >= 0)
-			written = ret;
+	dout("aio_write %p %llx.%llx %llu~%u  got cap refs on %s\n",
+	     inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+	     ceph_cap_string(got));
+
+	if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
+	    (iocb->ki_filp->f_flags & O_DIRECT) ||
+	    (inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
+	    (fi->flags & CEPH_F_SYNC)) {
+		ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
+			&iocb->ki_pos);
+	} else {
+		/*
+		 * buffered write; drop Fw early to avoid slow
+		 * revocation if we get stuck on balance_dirty_pages
+		 */
+		int dirty;
 
+		spin_lock(&ci->i_ceph_lock);
+		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
+		spin_unlock(&ci->i_ceph_lock);
+		ceph_put_cap_refs(ci, got);
+
+		ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
 		if ((ret >= 0 || ret == -EIOCBQUEUED) &&
 		    ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
 		     || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
-			err = vfs_fsync_range(file, pos, pos + written - 1, 1);
+			err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
 			if (err < 0)
 				ret = err;
 		}
-		if ((ret < 0 && ret != -EAGAIN) || pos + written >= endoff)
-			goto out;
-	}
 
-	dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
-	     inode, ceph_vinop(inode), pos + written,
-	     (unsigned)iov->iov_len - written, inode->i_size);
-	ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, 0, &got, endoff);
-	if (ret < 0)
+		if (dirty)
+			__mark_inode_dirty(inode, dirty);
 		goto out;
+	}
 
-	dout("aio_write %p %llx.%llx %llu~%u  got cap refs on %s\n",
-	     inode, ceph_vinop(inode), pos + written,
-	     (unsigned)iov->iov_len - written, ceph_cap_string(got));
-	ret = ceph_sync_write(file, iov->iov_base + written,
-			      iov->iov_len - written, &iocb->ki_pos);
 	if (ret >= 0) {
 		int dirty;
 		spin_lock(&ci->i_ceph_lock);
@@ -773,10 +783,13 @@ retry_snap:
 		if (dirty)
 			__mark_inode_dirty(inode, dirty);
 	}
+
+out_put:
 	dout("aio_write %p %llx.%llx %llu~%u  dropping cap refs on %s\n",
-	     inode, ceph_vinop(inode), pos + written,
-	     (unsigned)iov->iov_len - written, ceph_cap_string(got));
+	     inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+	     ceph_cap_string(got));
 	ceph_put_cap_refs(ci, got);
+
 out:
 	if (ret == -EOLDSNAPC) {
 		dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index e52b0fb..ab899c8 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1916,6 +1916,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc,
 		req = list_entry(tmp_list.next,
 				 struct ceph_mds_request, r_wait);
 		list_del_init(&req->r_wait);
+		dout(" wake request %p tid %llu\n", req, req->r_tid);
 		__do_request(mdsc, req);
 	}
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/7] ceph: don't early drop Fw cap
  2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
                   ` (4 preceding siblings ...)
  2013-03-01  6:46 ` [PATCH 5/7] ceph: revert commit 22cddde104 Yan, Zheng
@ 2013-03-01  6:46 ` Yan, Zheng
  2013-03-04 18:26   ` Gregory Farnum
  2013-03-01  6:46 ` [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate Yan, Zheng
  2013-03-04 18:49 ` [PATCH 0/7] ceph: misc fixes Gregory Farnum
  7 siblings, 1 reply; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01  6:46 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
cap dirty before data is copied to page cache and inode size is
updated. The optimization avoids slow cap revocation caused by
balance_dirty_pages(), but introduces inode size update race. If
ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. So just remove the
optimization.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/file.c | 42 +++++++++++++++++-------------------------
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index a949805..28ef273 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	if (ceph_snap(inode) != CEPH_NOSNAP)
 		return -EROFS;
 
+	sb_start_write(inode->i_sb);
 retry_snap:
-	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
-		return -ENOSPC;
+	if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
+		ret = -ENOSPC;
+		goto out;
+	}
 	__ceph_do_pending_vmtruncate(inode);
 	dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
 	     inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
@@ -750,29 +753,10 @@ retry_snap:
 		ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
 			&iocb->ki_pos);
 	} else {
-		/*
-		 * buffered write; drop Fw early to avoid slow
-		 * revocation if we get stuck on balance_dirty_pages
-		 */
-		int dirty;
-
-		spin_lock(&ci->i_ceph_lock);
-		dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
-		spin_unlock(&ci->i_ceph_lock);
-		ceph_put_cap_refs(ci, got);
-
-		ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
-		if ((ret >= 0 || ret == -EIOCBQUEUED) &&
-		    ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
-		     || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
-			err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
-			if (err < 0)
-				ret = err;
-		}
-
-		if (dirty)
-			__mark_inode_dirty(inode, dirty);
-		goto out;
+		mutex_lock(&inode->i_mutex);
+		ret = __generic_file_aio_write(iocb, iov, nr_segs,
+					       &iocb->ki_pos);
+		mutex_unlock(&inode->i_mutex);
 	}
 
 	if (ret >= 0) {
@@ -790,12 +774,20 @@ out_put:
 	     ceph_cap_string(got));
 	ceph_put_cap_refs(ci, got);
 
+	if (ret >= 0 &&
+	    ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
+	     ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
+		err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
+		if (err < 0)
+			ret = err;
+	}
 out:
 	if (ret == -EOLDSNAPC) {
 		dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
 		     inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
 		goto retry_snap;
 	}
+	sb_end_write(inode->i_sb);
 
 	return ret;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate
  2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
                   ` (5 preceding siblings ...)
  2013-03-01  6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
@ 2013-03-01  6:46 ` Yan, Zheng
  2013-03-04 18:49 ` [PATCH 0/7] ceph: misc fixes Gregory Farnum
  7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01  6:46 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

make __ceph_do_pending_vmtruncate() acquire i_mutex if the caller
does not hold the mutex, so ceph_aio_read() can call it safely.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 fs/ceph/file.c  |  6 +++---
 fs/ceph/inode.c | 18 +++++++++---------
 fs/ceph/super.h |  2 +-
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 28ef273..b9eedd4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -653,7 +653,7 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
 	dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
 	     inode, ceph_vinop(inode), pos, (unsigned)len, inode);
 again:
-	__ceph_do_pending_vmtruncate(inode);
+	__ceph_do_pending_vmtruncate(inode, true);
 	if (fi->fmode & CEPH_FILE_MODE_LAZY)
 		want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
 	else
@@ -730,7 +730,7 @@ retry_snap:
 		ret = -ENOSPC;
 		goto out;
 	}
-	__ceph_do_pending_vmtruncate(inode);
+	__ceph_do_pending_vmtruncate(inode, true);
 	dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
 	     inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
 	     inode->i_size);
@@ -801,7 +801,7 @@ static loff_t ceph_llseek(struct file *file, loff_t offset, int whence)
 	int ret;
 
 	mutex_lock(&inode->i_mutex);
-	__ceph_do_pending_vmtruncate(inode);
+	__ceph_do_pending_vmtruncate(inode, false);
 
 	if (whence == SEEK_END || whence == SEEK_DATA || whence == SEEK_HOLE) {
 		ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 42c5769..2b3fee7 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1416,7 +1416,7 @@ out:
 
 
 /*
- * called by trunc_wq; take i_mutex ourselves
+ * called by trunc_wq;
  *
  * We also truncate in a separate thread as well.
  */
@@ -1427,9 +1427,7 @@ static void ceph_vmtruncate_work(struct work_struct *work)
 	struct inode *inode = &ci->vfs_inode;
 
 	dout("vmtruncate_work %p\n", inode);
-	mutex_lock(&inode->i_mutex);
-	__ceph_do_pending_vmtruncate(inode);
-	mutex_unlock(&inode->i_mutex);
+	__ceph_do_pending_vmtruncate(inode, true);
 	iput(inode);
 }
 
@@ -1453,12 +1451,10 @@ void ceph_queue_vmtruncate(struct inode *inode)
 }
 
 /*
- * called with i_mutex held.
- *
  * Make sure any pending truncation is applied before doing anything
  * that may depend on it.
  */
-void __ceph_do_pending_vmtruncate(struct inode *inode)
+void __ceph_do_pending_vmtruncate(struct inode *inode, bool needlock)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	u64 to;
@@ -1491,7 +1487,11 @@ retry:
 	     ci->i_truncate_pending, to);
 	spin_unlock(&ci->i_ceph_lock);
 
+	if (needlock)
+		mutex_lock(&inode->i_mutex);
 	truncate_inode_pages(inode->i_mapping, to);
+	if (needlock)
+		mutex_unlock(&inode->i_mutex);
 
 	spin_lock(&ci->i_ceph_lock);
 	if (to == ci->i_truncate_size) {
@@ -1544,7 +1544,7 @@ int ceph_setattr(struct dentry *dentry, struct iattr *attr)
 	if (ceph_snap(inode) != CEPH_NOSNAP)
 		return -EROFS;
 
-	__ceph_do_pending_vmtruncate(inode);
+	__ceph_do_pending_vmtruncate(inode, false);
 
 	err = inode_change_ok(inode, attr);
 	if (err != 0)
@@ -1722,7 +1722,7 @@ int ceph_setattr(struct dentry *dentry, struct iattr *attr)
 	     ceph_cap_string(dirtied), mask);
 
 	ceph_mdsc_put_request(req);
-	__ceph_do_pending_vmtruncate(inode);
+	__ceph_do_pending_vmtruncate(inode, false);
 	return err;
 out:
 	spin_unlock(&ci->i_ceph_lock);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index efbcb56..e5f1875 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -694,7 +694,7 @@ extern int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 extern int ceph_inode_holds_cap(struct inode *inode, int mask);
 
 extern int ceph_inode_set_size(struct inode *inode, loff_t size);
-extern void __ceph_do_pending_vmtruncate(struct inode *inode);
+extern void __ceph_do_pending_vmtruncate(struct inode *inode, bool needlock);
 extern void ceph_queue_vmtruncate(struct inode *inode);
 
 extern void ceph_queue_invalidate(struct inode *inode);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag
  2013-03-01  6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
@ 2013-03-04 18:19   ` Sage Weil
  0 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2013-03-04 18:19 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel

On Fri, 1 Mar 2013, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> 
> commit c6ffe10015 moved the flag that tracks if the dcache contents
> for a directory are complete to dentry. The problem is there are
> lots of places that use ceph_dir_{set,clear,test}_complete() while
> holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
> sleep because they call dput().
> 
> This patch basically reverts that commit. For ceph_d_prune(), it's
> called with both the dentry to prune and the parent dentry are
> locked. So it's safe to access the parent dentry's d_inode and
> clear I_COMPLETE flag.

I'm trying to remember why I thought the D_COMPETE flag was necessary.  
Maybe I didn't think that i_ceph_lock could safely nest inside of d_lock?  
Or that the parent was locked?

Anyway, assuming both of those things are in fact true, this looks good 
(and simpler :).

sage


> 
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  fs/ceph/caps.c       |  8 ++++---
>  fs/ceph/dir.c        | 62 ++++++++++------------------------------------------
>  fs/ceph/inode.c      | 30 +++++++++++--------------
>  fs/ceph/mds_client.c |  6 ++---
>  fs/ceph/super.h      | 23 ++-----------------
>  5 files changed, 34 insertions(+), 95 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 61f3833..76634f4 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -490,15 +490,17 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
>  		ci->i_rdcache_gen++;
>  
>  	/*
> -	 * if we are newly issued FILE_SHARED, clear D_COMPLETE; we
> +	 * if we are newly issued FILE_SHARED, clear I_COMPLETE; we
>  	 * don't know what happened to this directory while we didn't
>  	 * have the cap.
>  	 */
>  	if ((issued & CEPH_CAP_FILE_SHARED) &&
>  	    (had & CEPH_CAP_FILE_SHARED) == 0) {
>  		ci->i_shared_gen++;
> -		if (S_ISDIR(ci->vfs_inode.i_mode))
> -			ceph_dir_clear_complete(&ci->vfs_inode);
> +		if (S_ISDIR(ci->vfs_inode.i_mode)) {
> +			dout(" marking %p NOT complete\n", &ci->vfs_inode);
> +			ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
> +		}
>  	}
>  }
>  
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 8c1aabe..76821be 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p)
>   * falling back to a "normal" sync readdir if any dentries in the dir
>   * are dropped.
>   *
> - * D_COMPLETE tells indicates we have all dentries in the dir.  It is
> + * I_COMPLETE tells indicates we have all dentries in the dir.  It is
>   * defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by
>   * the MDS if/when the directory is modified).
>   */
> @@ -198,8 +198,8 @@ more:
>  	filp->f_pos++;
>  
>  	/* make sure a dentry wasn't dropped while we didn't have parent lock */
> -	if (!ceph_dir_test_complete(dir)) {
> -		dout(" lost D_COMPLETE on %p; falling back to mds\n", dir);
> +	if (!ceph_i_test(dir, CEPH_I_COMPLETE)) {
> +		dout(" lost I_COMPLETE on %p; falling back to mds\n", dir);
>  		err = -EAGAIN;
>  		goto out;
>  	}
> @@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, filldir_t filldir)
>  	if ((filp->f_pos == 2 || fi->dentry) &&
>  	    !ceph_test_mount_opt(fsc, NOASYNCREADDIR) &&
>  	    ceph_snap(inode) != CEPH_SNAPDIR &&
> -	    ceph_dir_test_complete(inode) &&
> +	    (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
>  	    __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) {
>  		spin_unlock(&ci->i_ceph_lock);
>  		err = __dcache_readdir(filp, dirent, filldir);
> @@ -350,7 +350,7 @@ more:
>  
>  		if (!req->r_did_prepopulate) {
>  			dout("readdir !did_prepopulate");
> -			fi->dir_release_count--;    /* preclude D_COMPLETE */
> +			fi->dir_release_count--;    /* preclude I_COMPLETE */
>  		}
>  
>  		/* note next offset and last dentry name */
> @@ -429,7 +429,8 @@ more:
>  	 */
>  	spin_lock(&ci->i_ceph_lock);
>  	if (ci->i_release_count == fi->dir_release_count) {
> -		ceph_dir_set_complete(inode);
> +		dout(" marking %p complete\n", inode);
> +		ci->i_ceph_flags |= CEPH_I_COMPLETE;
>  		ci->i_max_offset = filp->f_pos;
>  	}
>  	spin_unlock(&ci->i_ceph_lock);
> @@ -604,7 +605,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
>  			    fsc->mount_options->snapdir_name,
>  			    dentry->d_name.len) &&
>  		    !is_root_ceph_dentry(dir, dentry) &&
> -		    ceph_dir_test_complete(dir) &&
> +		    (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
>  		    (__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) {
>  			spin_unlock(&ci->i_ceph_lock);
>  			dout(" dir %p complete, -ENOENT\n", dir);
> @@ -908,7 +909,7 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry,
>  		 */
>  
>  		/* d_move screws up d_subdirs order */
> -		ceph_dir_clear_complete(new_dir);
> +		ceph_i_clear(new_dir, CEPH_I_COMPLETE);
>  
>  		d_move(old_dentry, new_dentry);
>  
> @@ -1065,44 +1066,6 @@ static int ceph_snapdir_d_revalidate(struct dentry *dentry,
>  }
>  
>  /*
> - * Set/clear/test dir complete flag on the dir's dentry.
> - */
> -void ceph_dir_set_complete(struct inode *inode)
> -{
> -	struct dentry *dentry = d_find_any_alias(inode);
> -	
> -	if (dentry && ceph_dentry(dentry) &&
> -	    ceph_test_mount_opt(ceph_sb_to_client(dentry->d_sb), DCACHE)) {
> -		dout(" marking %p (%p) complete\n", inode, dentry);
> -		set_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
> -	}
> -	dput(dentry);
> -}
> -
> -void ceph_dir_clear_complete(struct inode *inode)
> -{
> -	struct dentry *dentry = d_find_any_alias(inode);
> -
> -	if (dentry && ceph_dentry(dentry)) {
> -		dout(" marking %p (%p) complete\n", inode, dentry);
> -		set_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
> -	}
> -	dput(dentry);
> -}
> -
> -bool ceph_dir_test_complete(struct inode *inode)
> -{
> -	struct dentry *dentry = d_find_any_alias(inode);
> -
> -	if (dentry && ceph_dentry(dentry)) {
> -		dout(" marking %p (%p) NOT complete\n", inode, dentry);
> -		clear_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
> -	}
> -	dput(dentry);
> -	return false;
> -}
> -
> -/*
>   * When the VFS prunes a dentry from the cache, we need to clear the
>   * complete flag on the parent directory.
>   *
> @@ -1110,15 +1073,13 @@ bool ceph_dir_test_complete(struct inode *inode)
>   */
>  static void ceph_d_prune(struct dentry *dentry)
>  {
> -	struct ceph_dentry_info *di;
> -
>  	dout("ceph_d_prune %p\n", dentry);
>  
>  	/* do we have a valid parent? */
>  	if (IS_ROOT(dentry))
>  		return;
>  
> -	/* if we are not hashed, we don't affect D_COMPLETE */
> +	/* if we are not hashed, we don't affect I_COMPLETE */
>  	if (d_unhashed(dentry))
>  		return;
>  
> @@ -1126,8 +1087,7 @@ static void ceph_d_prune(struct dentry *dentry)
>  	 * we hold d_lock, so d_parent is stable, and d_fsdata is never
>  	 * cleared until d_release
>  	 */
> -	di = ceph_dentry(dentry->d_parent);
> -	clear_bit(CEPH_D_COMPLETE, &di->flags);
> +	ceph_i_clear(dentry->d_parent->d_inode, CEPH_I_COMPLETE);
>  }
>  
>  /*
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 2971eaa..42c5769 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -561,7 +561,6 @@ static int fill_inode(struct inode *inode,
>  	struct ceph_inode_info *ci = ceph_inode(inode);
>  	int i;
>  	int issued = 0, implemented;
> -	int updating_inode = 0;
>  	struct timespec mtime, atime, ctime;
>  	u32 nsplits;
>  	struct ceph_buffer *xattr_blob = NULL;
> @@ -601,7 +600,6 @@ static int fill_inode(struct inode *inode,
>  	    (ci->i_version & ~1) >= le64_to_cpu(info->version))
>  		goto no_change;
>  	
> -	updating_inode = 1;
>  	issued = __ceph_caps_issued(ci, &implemented);
>  	issued |= implemented | __ceph_caps_dirty(ci);
>  
> @@ -716,6 +714,17 @@ static int fill_inode(struct inode *inode,
>  		       ceph_vinop(inode), inode->i_mode);
>  	}
>  
> +	/* set dir completion flag? */
> +	if (S_ISDIR(inode->i_mode) &&
> +	    ci->i_files == 0 && ci->i_subdirs == 0 &&
> +	    ceph_snap(inode) == CEPH_NOSNAP &&
> +	    (le32_to_cpu(info->cap.caps) & CEPH_CAP_FILE_SHARED) &&
> +	    (issued & CEPH_CAP_FILE_EXCL) == 0 &&
> +	    (ci->i_ceph_flags & CEPH_I_COMPLETE) == 0) {
> +		dout(" marking %p complete (empty)\n", inode);
> +		ci->i_ceph_flags |= CEPH_I_COMPLETE;
> +		ci->i_max_offset = 2;
> +	}
>  no_change:
>  	spin_unlock(&ci->i_ceph_lock);
>  
> @@ -766,19 +775,6 @@ no_change:
>  		__ceph_get_fmode(ci, cap_fmode);
>  	}
>  
> -	/* set dir completion flag? */
> -	if (S_ISDIR(inode->i_mode) &&
> -	    updating_inode &&                 /* didn't jump to no_change */
> -	    ci->i_files == 0 && ci->i_subdirs == 0 &&
> -	    ceph_snap(inode) == CEPH_NOSNAP &&
> -	    (le32_to_cpu(info->cap.caps) & CEPH_CAP_FILE_SHARED) &&
> -	    (issued & CEPH_CAP_FILE_EXCL) == 0 &&
> -	    !ceph_dir_test_complete(inode)) {
> -		dout(" marking %p complete (empty)\n", inode);
> -		ceph_dir_set_complete(inode);
> -		ci->i_max_offset = 2;
> -	}
> -
>  	/* update delegation info? */
>  	if (dirinfo)
>  		ceph_fill_dirfrag(inode, dirinfo);
> @@ -860,7 +856,7 @@ static void ceph_set_dentry_offset(struct dentry *dn)
>  	di = ceph_dentry(dn);
>  
>  	spin_lock(&ci->i_ceph_lock);
> -	if (!ceph_dir_test_complete(inode)) {
> +	if ((ceph_inode(inode)->i_ceph_flags & CEPH_I_COMPLETE) == 0) {
>  		spin_unlock(&ci->i_ceph_lock);
>  		return;
>  	}
> @@ -1065,7 +1061,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req,
>  			 * d_move() puts the renamed dentry at the end of
>  			 * d_subdirs.  We need to assign it an appropriate
>  			 * directory offset so we can behave when holding
> -			 * D_COMPLETE.
> +			 * I_COMPLETE.
>  			 */
>  			ceph_set_dentry_offset(req->r_old_dentry);
>  			dout("dn %p gets new offset %lld\n", req->r_old_dentry, 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index ccc68b0..e52b0fb 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2029,7 +2029,7 @@ out:
>  }
>  
>  /*
> - * Invalidate dir D_COMPLETE, dentry lease state on an aborted MDS
> + * Invalidate dir I_COMPLETE, dentry lease state on an aborted MDS
>   * namespace request.
>   */
>  void ceph_invalidate_dir_request(struct ceph_mds_request *req)
> @@ -2037,9 +2037,9 @@ void ceph_invalidate_dir_request(struct ceph_mds_request *req)
>  	struct inode *inode = req->r_locked_dir;
>  	struct ceph_inode_info *ci = ceph_inode(inode);
>  
> -	dout("invalidate_dir_request %p (D_COMPLETE, lease(s))\n", inode);
> +	dout("invalidate_dir_request %p (I_COMPLETE, lease(s))\n", inode);
>  	spin_lock(&ci->i_ceph_lock);
> -	ceph_dir_clear_complete(inode);
> +	ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
>  	ci->i_release_count++;
>  	spin_unlock(&ci->i_ceph_lock);
>  
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 4353ebc..efbcb56 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -204,7 +204,6 @@ struct ceph_inode_xattr {
>   * Ceph dentry state
>   */
>  struct ceph_dentry_info {
> -	unsigned long flags;
>  	struct ceph_mds_session *lease_session;
>  	u32 lease_gen, lease_shared_gen;
>  	u32 lease_seq;
> @@ -215,18 +214,6 @@ struct ceph_dentry_info {
>  	u64 offset;
>  };
>  
> -/*
> - * dentry flags
> - *
> - * The locking for D_COMPLETE is a bit odd:
> - *  - we can clear it at almost any time (see ceph_d_prune)
> - *  - it is only meaningful if:
> - *    - we hold dir inode i_ceph_lock
> - *    - we hold dir FILE_SHARED caps
> - *    - the dentry D_COMPLETE is set
> - */
> -#define CEPH_D_COMPLETE 1  /* if set, d_u.d_subdirs is complete directory */
> -
>  struct ceph_inode_xattrs_info {
>  	/*
>  	 * (still encoded) xattr blob. we avoid the overhead of parsing
> @@ -267,7 +254,7 @@ struct ceph_inode_info {
>  	struct timespec i_rctime;
>  	u64 i_rbytes, i_rfiles, i_rsubdirs;
>  	u64 i_files, i_subdirs;
> -	u64 i_max_offset;  /* largest readdir offset, set with D_COMPLETE */
> +	u64 i_max_offset;  /* largest readdir offset, set with I_COMPLETE */
>  
>  	struct rb_root i_fragtree;
>  	struct mutex i_fragtree_mutex;
> @@ -432,6 +419,7 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
>  /*
>   * Ceph inode.
>   */
> +#define CEPH_I_COMPLETE  1  /* we have complete directory cached */
>  #define CEPH_I_NODELAY   4  /* do not delay cap release */
>  #define CEPH_I_FLUSH     8  /* do not delay flush of dirty metadata */
>  #define CEPH_I_NOFLUSH  16  /* do not flush dirty caps */
> @@ -489,13 +477,6 @@ static inline loff_t ceph_make_fpos(unsigned frag, unsigned off)
>  }
>  
>  /*
> - * set/clear directory D_COMPLETE flag
> - */
> -void ceph_dir_set_complete(struct inode *inode);
> -void ceph_dir_clear_complete(struct inode *inode);
> -bool ceph_dir_test_complete(struct inode *inode);
> -
> -/*
>   * caps helpers
>   */
>  static inline bool __ceph_is_any_real_caps(struct ceph_inode_info *ci)
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/7] ceph: don't early drop Fw cap
  2013-03-01  6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
@ 2013-03-04 18:26   ` Gregory Farnum
  2013-03-05  1:57     ` Yan, Zheng
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2013-03-04 18:26 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel@vger.kernel.org, Sage Weil

On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
> cap dirty before data is copied to page cache and inode size is
> updated. The optimization avoids slow cap revocation caused by
> balance_dirty_pages(), but introduces inode size update race. If
> ceph_check_caps() flushes the dirty cap before the inode size is
> updated, MDS can miss the new inode size. So just remove the
> optimization.
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  fs/ceph/file.c | 42 +++++++++++++++++-------------------------
>  1 file changed, 17 insertions(+), 25 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index a949805..28ef273 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
>         if (ceph_snap(inode) != CEPH_NOSNAP)
>                 return -EROFS;
>
> +       sb_start_write(inode->i_sb);
>  retry_snap:
> -       if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
> -               return -ENOSPC;
> +       if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
> +               ret = -ENOSPC;
> +               goto out;
> +       }
>         __ceph_do_pending_vmtruncate(inode);
>         dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
>              inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> @@ -750,29 +753,10 @@ retry_snap:
>                 ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
>                         &iocb->ki_pos);
>         } else {
> -               /*
> -                * buffered write; drop Fw early to avoid slow
> -                * revocation if we get stuck on balance_dirty_pages
> -                */
> -               int dirty;
> -
> -               spin_lock(&ci->i_ceph_lock);
> -               dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
> -               spin_unlock(&ci->i_ceph_lock);
> -               ceph_put_cap_refs(ci, got);
> -
> -               ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
> -               if ((ret >= 0 || ret == -EIOCBQUEUED) &&
> -                   ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
> -                    || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
> -                       err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
> -                       if (err < 0)
> -                               ret = err;
> -               }
> -
> -               if (dirty)
> -                       __mark_inode_dirty(inode, dirty);
> -               goto out;
> +               mutex_lock(&inode->i_mutex);
> +               ret = __generic_file_aio_write(iocb, iov, nr_segs,
> +                                              &iocb->ki_pos);
> +               mutex_unlock(&inode->i_mutex);

Hmm, you're here passing in a different value than the removed
generic_file_aio_write() call did — &iocb->ki_pos instead of pos.
Everything else is using the pos parameter so I rather expect that
should still be used here?
Also a quick skim of the interfaces makes me think that the two
versions aren't interchangeable — __generic_file_aio_write() also
handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them?
(I haven't checked deep enough to see if the unlocked version is
correct or not, but it does say that's what most filesystems should
use.)
-Greg


>         }
>
>         if (ret >= 0) {
> @@ -790,12 +774,20 @@ out_put:
>              ceph_cap_string(got));
>         ceph_put_cap_refs(ci, got);
>
> +       if (ret >= 0 &&
> +           ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
> +            ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
> +               err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
> +               if (err < 0)
> +                       ret = err;
> +       }
>  out:
>         if (ret == -EOLDSNAPC) {
>                 dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
>                      inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
>                 goto retry_snap;
>         }
> +       sb_end_write(inode->i_sb);
>
>         return ret;
>  }
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] ceph: misc fixes
  2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
                   ` (6 preceding siblings ...)
  2013-03-01  6:46 ` [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate Yan, Zheng
@ 2013-03-04 18:49 ` Gregory Farnum
       [not found]   ` <513568D1.60305@intel.com>
  7 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2013-03-04 18:49 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel@vger.kernel.org, Sage Weil

On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> These patches are also in:
>   git://github.com/ukernel/linux.git wip-ceph

1, 2, 3, 5, 7 all look good to me. If you can double-check Sage's
concerns on 4 and my questions on 6 I'll be happy to pull these in. :)
-Greg

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/7] ceph: don't early drop Fw cap
  2013-03-04 18:26   ` Gregory Farnum
@ 2013-03-05  1:57     ` Yan, Zheng
  2013-03-05 20:42       ` Greg Farnum
  0 siblings, 1 reply; 14+ messages in thread
From: Yan, Zheng @ 2013-03-05  1:57 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org, Sage Weil

On 03/05/2013 02:26 AM, Gregory Farnum wrote:
> On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
>> cap dirty before data is copied to page cache and inode size is
>> updated. The optimization avoids slow cap revocation caused by
>> balance_dirty_pages(), but introduces inode size update race. If
>> ceph_check_caps() flushes the dirty cap before the inode size is
>> updated, MDS can miss the new inode size. So just remove the
>> optimization.
>>
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> ---
>>  fs/ceph/file.c | 42 +++++++++++++++++-------------------------
>>  1 file changed, 17 insertions(+), 25 deletions(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index a949805..28ef273 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
>>         if (ceph_snap(inode) != CEPH_NOSNAP)
>>                 return -EROFS;
>>
>> +       sb_start_write(inode->i_sb);
>>  retry_snap:
>> -       if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
>> -               return -ENOSPC;
>> +       if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
>> +               ret = -ENOSPC;
>> +               goto out;
>> +       }
>>         __ceph_do_pending_vmtruncate(inode);
>>         dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
>>              inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
>> @@ -750,29 +753,10 @@ retry_snap:
>>                 ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
>>                         &iocb->ki_pos);
>>         } else {
>> -               /*
>> -                * buffered write; drop Fw early to avoid slow
>> -                * revocation if we get stuck on balance_dirty_pages
>> -                */
>> -               int dirty;
>> -
>> -               spin_lock(&ci->i_ceph_lock);
>> -               dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
>> -               spin_unlock(&ci->i_ceph_lock);
>> -               ceph_put_cap_refs(ci, got);
>> -
>> -               ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
>> -               if ((ret >= 0 || ret == -EIOCBQUEUED) &&
>> -                   ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
>> -                    || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
>> -                       err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
>> -                       if (err < 0)
>> -                               ret = err;
>> -               }
>> -
>> -               if (dirty)
>> -                       __mark_inode_dirty(inode, dirty);
>> -               goto out;
>> +               mutex_lock(&inode->i_mutex);
>> +               ret = __generic_file_aio_write(iocb, iov, nr_segs,
>> +                                              &iocb->ki_pos);
>> +               mutex_unlock(&inode->i_mutex);
> 
> Hmm, you're here passing in a different value than the removed
> generic_file_aio_write() call did — &iocb->ki_pos instead of pos.
> Everything else is using the pos parameter so I rather expect that
> should still be used here?

They always have the same value, see the BUG_ON in generic_file_aio_write()

> Also a quick skim of the interfaces makes me think that the two
> versions aren't interchangeable — __generic_file_aio_write() also
> handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them?

ceph has its own code that handles O_SYNC case. I want to make sb_start_write()
covers ceph_sync_write(), that's the reason I use __generic_file_aio_write() here.

Regards
Yan, Zheng

> (I haven't checked deep enough to see if the unlocked version is
> correct or not, but it does say that's what most filesystems should
> use.)
> -Greg
> 
> 
>>         }
>>
>>         if (ret >= 0) {
>> @@ -790,12 +774,20 @@ out_put:
>>              ceph_cap_string(got));
>>         ceph_put_cap_refs(ci, got);
>>
>> +       if (ret >= 0 &&
>> +           ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
>> +            ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
>> +               err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
>> +               if (err < 0)
>> +                       ret = err;
>> +       }
>>  out:
>>         if (ret == -EOLDSNAPC) {
>>                 dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
>>                      inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
>>                 goto retry_snap;
>>         }
>> +       sb_end_write(inode->i_sb);
>>
>>         return ret;
>>  }
>> --
>> 1.7.11.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/7] ceph: don't early drop Fw cap
  2013-03-05  1:57     ` Yan, Zheng
@ 2013-03-05 20:42       ` Greg Farnum
  0 siblings, 0 replies; 14+ messages in thread
From: Greg Farnum @ 2013-03-05 20:42 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel@vger.kernel.org, Sage Weil

On Monday, March 4, 2013 at 5:57 PM, Yan, Zheng wrote:
> On 03/05/2013 02:26 AM, Gregory Farnum wrote:
> > On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)> wrote:
> > > From: "Yan, Zheng" <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
> > >  
> > > ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
> > > cap dirty before data is copied to page cache and inode size is
> > > updated. The optimization avoids slow cap revocation caused by
> > > balance_dirty_pages(), but introduces inode size update race. If
> > > ceph_check_caps() flushes the dirty cap before the inode size is
> > > updated, MDS can miss the new inode size. So just remove the
> > > optimization.
> > >  
> > > Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
> > > ---
> > > fs/ceph/file.c | 42 +++++++++++++++++-------------------------
> > > 1 file changed, 17 insertions(+), 25 deletions(-)
> > >  
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index a949805..28ef273 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
> > > if (ceph_snap(inode) != CEPH_NOSNAP)
> > > return -EROFS;
> > >  
> > > + sb_start_write(inode->i_sb);
> > > retry_snap:
> > > - if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
> > > - return -ENOSPC;
> > > + if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
> > > + ret = -ENOSPC;
> > > + goto out;
> > > + }
> > > __ceph_do_pending_vmtruncate(inode);
> > > dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
> > > inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> > > @@ -750,29 +753,10 @@ retry_snap:
> > > ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
> > > &iocb->ki_pos);
> > > } else {
> > > - /*
> > > - * buffered write; drop Fw early to avoid slow
> > > - * revocation if we get stuck on balance_dirty_pages
> > > - */
> > > - int dirty;
> > > -
> > > - spin_lock(&ci->i_ceph_lock);
> > > - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
> > > - spin_unlock(&ci->i_ceph_lock);
> > > - ceph_put_cap_refs(ci, got);
> > > -
> > > - ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
> > > - if ((ret >= 0 || ret == -EIOCBQUEUED) &&
> > > - ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
> > > - || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
> > > - err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
> > > - if (err < 0)
> > > - ret = err;
> > > - }
> > > -
> > > - if (dirty)
> > > - __mark_inode_dirty(inode, dirty);
> > > - goto out;
> > > + mutex_lock(&inode->i_mutex);
> > > + ret = __generic_file_aio_write(iocb, iov, nr_segs,
> > > + &iocb->ki_pos);
> > > + mutex_unlock(&inode->i_mutex);
> >  
> >  
> >  
> > Hmm, you're here passing in a different value than the removed
> > generic_file_aio_write() call did — &iocb->ki_pos instead of pos.
> > Everything else is using the pos parameter so I rather expect that
> > should still be used here?
>  
>  
>  
> They always have the same value, see the BUG_ON in generic_file_aio_write()
>  
> > Also a quick skim of the interfaces makes me think that the two
> > versions aren't interchangeable — __generic_file_aio_write() also
> > handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them?
>  
>  
>  
> ceph has its own code that handles O_SYNC case. I want to make sb_start_write()
> covers ceph_sync_write(), that's the reason I use __generic_file_aio_write() here.
>  
> Regards
> Yan, Zheng
>  
Ah, yep — sounds good!
-Greg  


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] ceph: misc fixes
       [not found]   ` <513568D1.60305@intel.com>
@ 2013-03-05 20:44     ` Greg Farnum
  0 siblings, 0 replies; 14+ messages in thread
From: Greg Farnum @ 2013-03-05 20:44 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel

I've merged this series into the testing branch, with appropriate Reviewed-by tags from me (and Sage on #4). Thanks much for the code and helping me go through it. :) 
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Monday, March 4, 2013 at 7:38 PM, Yan, Zheng wrote:

> On 03/05/2013 02:49 AM, Gregory Farnum wrote:
> > On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> > > From: "Yan, Zheng" <zheng.z.yan@intel.com>
> > > 
> > > These patches are also in:
> > > git://github.com/ukernel/linux.git (http://github.com/ukernel/linux.git) wip-ceph
> > 
> > 
> > 
> > 1, 2, 3, 5, 7 all look good to me. If you can double-check Sage's
> > concerns on 4 and my questions on 6 I'll be happy to pull these in. :)
> 
> 
> 
> I rechecked locking assumption in patch 4, nothing goes wrong.
> 
> Regards
> Yan, Zheng
> 
> > -Greg 



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-03-05 20:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
2013-03-01  6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
2013-03-01  6:46 ` [PATCH 2/7] ceph: queue cap release when trimming cap Yan, Zheng
2013-03-01  6:46 ` [PATCH 3/7] ceph: set mds_want according to cap import message Yan, Zheng
2013-03-01  6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
2013-03-04 18:19   ` Sage Weil
2013-03-01  6:46 ` [PATCH 5/7] ceph: revert commit 22cddde104 Yan, Zheng
2013-03-01  6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
2013-03-04 18:26   ` Gregory Farnum
2013-03-05  1:57     ` Yan, Zheng
2013-03-05 20:42       ` Greg Farnum
2013-03-01  6:46 ` [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate Yan, Zheng
2013-03-04 18:49 ` [PATCH 0/7] ceph: misc fixes Gregory Farnum
     [not found]   ` <513568D1.60305@intel.com>
2013-03-05 20:44     ` Greg Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.