* [PATCH 0/7] ceph: misc fixes
@ 2013-03-01 6:46 Yan, Zheng
2013-03-01 6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
` (7 more replies)
0 siblings, 8 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01 6:46 UTC (permalink / raw)
To: ceph-devel, sage; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
These patches are also in:
git://github.com/ukernel/linux.git wip-ceph
Regards
Yan, Zheng
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/7] ceph: fix LSSNAP regression
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
@ 2013-03-01 6:46 ` Yan, Zheng
2013-03-01 6:46 ` [PATCH 2/7] ceph: queue cap release when trimming cap Yan, Zheng
` (6 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01 6:46 UTC (permalink / raw)
To: ceph-devel, sage; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
commit 6e8575faa8 makes parse_reply_info_extra() return -EIO for LSSNAP
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/mds_client.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d958420..608ffcf 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -265,7 +265,8 @@ static int parse_reply_info_extra(void **p, void *end,
{
if (info->head->op == CEPH_MDS_OP_GETFILELOCK)
return parse_reply_info_filelock(p, end, info, features);
- else if (info->head->op == CEPH_MDS_OP_READDIR)
+ else if (info->head->op == CEPH_MDS_OP_READDIR ||
+ info->head->op == CEPH_MDS_OP_LSSNAP)
return parse_reply_info_dir(p, end, info, features);
else if (info->head->op == CEPH_MDS_OP_CREATE)
return parse_reply_info_create(p, end, info, features);
--
1.7.11.7
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/7] ceph: queue cap release when trimming cap
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
2013-03-01 6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
@ 2013-03-01 6:46 ` Yan, Zheng
2013-03-01 6:46 ` [PATCH 3/7] ceph: set mds_want according to cap import message Yan, Zheng
` (5 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01 6:46 UTC (permalink / raw)
To: ceph-devel, sage; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
So the client will later send cap release message to MDS
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/caps.c | 6 +++---
fs/ceph/mds_client.c | 2 ++
fs/ceph/super.h | 2 ++
3 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 1e1e020..5d5c32b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -997,9 +997,9 @@ static int send_cap_msg(struct ceph_mds_session *session,
return 0;
}
-static void __queue_cap_release(struct ceph_mds_session *session,
- u64 ino, u64 cap_id, u32 migrate_seq,
- u32 issue_seq)
+void __queue_cap_release(struct ceph_mds_session *session,
+ u64 ino, u64 cap_id, u32 migrate_seq,
+ u32 issue_seq)
{
struct ceph_msg *msg;
struct ceph_mds_cap_release *head;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 608ffcf..ccc68b0 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1197,6 +1197,8 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
session->s_trim_caps--;
if (oissued) {
/* we aren't the only cap.. just remove us */
+ __queue_cap_release(session, ceph_ino(inode), cap->cap_id,
+ cap->mseq, cap->issue_seq);
__ceph_remove_cap(cap);
} else {
/* try to drop referring dentries */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 604526a..4353ebc 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -755,6 +755,8 @@ static inline void ceph_remove_cap(struct ceph_cap *cap)
extern void ceph_put_cap(struct ceph_mds_client *mdsc,
struct ceph_cap *cap);
+extern void __queue_cap_release(struct ceph_mds_session *session, u64 ino,
+ u64 cap_id, u32 migrate_seq, u32 issue_seq);
extern void ceph_queue_caps_release(struct inode *inode);
extern int ceph_write_inode(struct inode *inode, struct writeback_control *wbc);
extern int ceph_fsync(struct file *file, loff_t start, loff_t end,
--
1.7.11.7
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/7] ceph: set mds_want according to cap import message
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
2013-03-01 6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
2013-03-01 6:46 ` [PATCH 2/7] ceph: queue cap release when trimming cap Yan, Zheng
@ 2013-03-01 6:46 ` Yan, Zheng
2013-03-01 6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
` (4 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01 6:46 UTC (permalink / raw)
To: ceph-devel, sage; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
MDS ignores cap update message if migrate_seq mismatch, so when
receiving a cap import message with higher migrate_seq, set mds_want
according to the cap import message.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/caps.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 5d5c32b..61f3833 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -553,6 +553,7 @@ retry:
cap->implemented = 0;
cap->mds = mds;
cap->mds_wanted = 0;
+ cap->mseq = 0;
cap->ci = ci;
__insert_cap_node(ci, cap);
@@ -628,7 +629,10 @@ retry:
cap->cap_id = cap_id;
cap->issued = issued;
cap->implemented |= issued;
- cap->mds_wanted |= wanted;
+ if (mseq > cap->mseq)
+ cap->mds_wanted = wanted;
+ else
+ cap->mds_wanted |= wanted;
cap->seq = seq;
cap->issue_seq = seq;
cap->mseq = mseq;
--
1.7.11.7
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
` (2 preceding siblings ...)
2013-03-01 6:46 ` [PATCH 3/7] ceph: set mds_want according to cap import message Yan, Zheng
@ 2013-03-01 6:46 ` Yan, Zheng
2013-03-04 18:19 ` Sage Weil
2013-03-01 6:46 ` [PATCH 5/7] ceph: revert commit 22cddde104 Yan, Zheng
` (3 subsequent siblings)
7 siblings, 1 reply; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01 6:46 UTC (permalink / raw)
To: ceph-devel, sage; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().
This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/caps.c | 8 ++++---
fs/ceph/dir.c | 62 ++++++++++------------------------------------------
fs/ceph/inode.c | 30 +++++++++++--------------
fs/ceph/mds_client.c | 6 ++---
fs/ceph/super.h | 23 ++-----------------
5 files changed, 34 insertions(+), 95 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 61f3833..76634f4 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -490,15 +490,17 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
ci->i_rdcache_gen++;
/*
- * if we are newly issued FILE_SHARED, clear D_COMPLETE; we
+ * if we are newly issued FILE_SHARED, clear I_COMPLETE; we
* don't know what happened to this directory while we didn't
* have the cap.
*/
if ((issued & CEPH_CAP_FILE_SHARED) &&
(had & CEPH_CAP_FILE_SHARED) == 0) {
ci->i_shared_gen++;
- if (S_ISDIR(ci->vfs_inode.i_mode))
- ceph_dir_clear_complete(&ci->vfs_inode);
+ if (S_ISDIR(ci->vfs_inode.i_mode)) {
+ dout(" marking %p NOT complete\n", &ci->vfs_inode);
+ ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
+ }
}
}
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 8c1aabe..76821be 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p)
* falling back to a "normal" sync readdir if any dentries in the dir
* are dropped.
*
- * D_COMPLETE tells indicates we have all dentries in the dir. It is
+ * I_COMPLETE tells indicates we have all dentries in the dir. It is
* defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by
* the MDS if/when the directory is modified).
*/
@@ -198,8 +198,8 @@ more:
filp->f_pos++;
/* make sure a dentry wasn't dropped while we didn't have parent lock */
- if (!ceph_dir_test_complete(dir)) {
- dout(" lost D_COMPLETE on %p; falling back to mds\n", dir);
+ if (!ceph_i_test(dir, CEPH_I_COMPLETE)) {
+ dout(" lost I_COMPLETE on %p; falling back to mds\n", dir);
err = -EAGAIN;
goto out;
}
@@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, filldir_t filldir)
if ((filp->f_pos == 2 || fi->dentry) &&
!ceph_test_mount_opt(fsc, NOASYNCREADDIR) &&
ceph_snap(inode) != CEPH_SNAPDIR &&
- ceph_dir_test_complete(inode) &&
+ (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) {
spin_unlock(&ci->i_ceph_lock);
err = __dcache_readdir(filp, dirent, filldir);
@@ -350,7 +350,7 @@ more:
if (!req->r_did_prepopulate) {
dout("readdir !did_prepopulate");
- fi->dir_release_count--; /* preclude D_COMPLETE */
+ fi->dir_release_count--; /* preclude I_COMPLETE */
}
/* note next offset and last dentry name */
@@ -429,7 +429,8 @@ more:
*/
spin_lock(&ci->i_ceph_lock);
if (ci->i_release_count == fi->dir_release_count) {
- ceph_dir_set_complete(inode);
+ dout(" marking %p complete\n", inode);
+ ci->i_ceph_flags |= CEPH_I_COMPLETE;
ci->i_max_offset = filp->f_pos;
}
spin_unlock(&ci->i_ceph_lock);
@@ -604,7 +605,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
fsc->mount_options->snapdir_name,
dentry->d_name.len) &&
!is_root_ceph_dentry(dir, dentry) &&
- ceph_dir_test_complete(dir) &&
+ (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
(__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) {
spin_unlock(&ci->i_ceph_lock);
dout(" dir %p complete, -ENOENT\n", dir);
@@ -908,7 +909,7 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry,
*/
/* d_move screws up d_subdirs order */
- ceph_dir_clear_complete(new_dir);
+ ceph_i_clear(new_dir, CEPH_I_COMPLETE);
d_move(old_dentry, new_dentry);
@@ -1065,44 +1066,6 @@ static int ceph_snapdir_d_revalidate(struct dentry *dentry,
}
/*
- * Set/clear/test dir complete flag on the dir's dentry.
- */
-void ceph_dir_set_complete(struct inode *inode)
-{
- struct dentry *dentry = d_find_any_alias(inode);
-
- if (dentry && ceph_dentry(dentry) &&
- ceph_test_mount_opt(ceph_sb_to_client(dentry->d_sb), DCACHE)) {
- dout(" marking %p (%p) complete\n", inode, dentry);
- set_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
- }
- dput(dentry);
-}
-
-void ceph_dir_clear_complete(struct inode *inode)
-{
- struct dentry *dentry = d_find_any_alias(inode);
-
- if (dentry && ceph_dentry(dentry)) {
- dout(" marking %p (%p) complete\n", inode, dentry);
- set_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
- }
- dput(dentry);
-}
-
-bool ceph_dir_test_complete(struct inode *inode)
-{
- struct dentry *dentry = d_find_any_alias(inode);
-
- if (dentry && ceph_dentry(dentry)) {
- dout(" marking %p (%p) NOT complete\n", inode, dentry);
- clear_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
- }
- dput(dentry);
- return false;
-}
-
-/*
* When the VFS prunes a dentry from the cache, we need to clear the
* complete flag on the parent directory.
*
@@ -1110,15 +1073,13 @@ bool ceph_dir_test_complete(struct inode *inode)
*/
static void ceph_d_prune(struct dentry *dentry)
{
- struct ceph_dentry_info *di;
-
dout("ceph_d_prune %p\n", dentry);
/* do we have a valid parent? */
if (IS_ROOT(dentry))
return;
- /* if we are not hashed, we don't affect D_COMPLETE */
+ /* if we are not hashed, we don't affect I_COMPLETE */
if (d_unhashed(dentry))
return;
@@ -1126,8 +1087,7 @@ static void ceph_d_prune(struct dentry *dentry)
* we hold d_lock, so d_parent is stable, and d_fsdata is never
* cleared until d_release
*/
- di = ceph_dentry(dentry->d_parent);
- clear_bit(CEPH_D_COMPLETE, &di->flags);
+ ceph_i_clear(dentry->d_parent->d_inode, CEPH_I_COMPLETE);
}
/*
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 2971eaa..42c5769 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -561,7 +561,6 @@ static int fill_inode(struct inode *inode,
struct ceph_inode_info *ci = ceph_inode(inode);
int i;
int issued = 0, implemented;
- int updating_inode = 0;
struct timespec mtime, atime, ctime;
u32 nsplits;
struct ceph_buffer *xattr_blob = NULL;
@@ -601,7 +600,6 @@ static int fill_inode(struct inode *inode,
(ci->i_version & ~1) >= le64_to_cpu(info->version))
goto no_change;
- updating_inode = 1;
issued = __ceph_caps_issued(ci, &implemented);
issued |= implemented | __ceph_caps_dirty(ci);
@@ -716,6 +714,17 @@ static int fill_inode(struct inode *inode,
ceph_vinop(inode), inode->i_mode);
}
+ /* set dir completion flag? */
+ if (S_ISDIR(inode->i_mode) &&
+ ci->i_files == 0 && ci->i_subdirs == 0 &&
+ ceph_snap(inode) == CEPH_NOSNAP &&
+ (le32_to_cpu(info->cap.caps) & CEPH_CAP_FILE_SHARED) &&
+ (issued & CEPH_CAP_FILE_EXCL) == 0 &&
+ (ci->i_ceph_flags & CEPH_I_COMPLETE) == 0) {
+ dout(" marking %p complete (empty)\n", inode);
+ ci->i_ceph_flags |= CEPH_I_COMPLETE;
+ ci->i_max_offset = 2;
+ }
no_change:
spin_unlock(&ci->i_ceph_lock);
@@ -766,19 +775,6 @@ no_change:
__ceph_get_fmode(ci, cap_fmode);
}
- /* set dir completion flag? */
- if (S_ISDIR(inode->i_mode) &&
- updating_inode && /* didn't jump to no_change */
- ci->i_files == 0 && ci->i_subdirs == 0 &&
- ceph_snap(inode) == CEPH_NOSNAP &&
- (le32_to_cpu(info->cap.caps) & CEPH_CAP_FILE_SHARED) &&
- (issued & CEPH_CAP_FILE_EXCL) == 0 &&
- !ceph_dir_test_complete(inode)) {
- dout(" marking %p complete (empty)\n", inode);
- ceph_dir_set_complete(inode);
- ci->i_max_offset = 2;
- }
-
/* update delegation info? */
if (dirinfo)
ceph_fill_dirfrag(inode, dirinfo);
@@ -860,7 +856,7 @@ static void ceph_set_dentry_offset(struct dentry *dn)
di = ceph_dentry(dn);
spin_lock(&ci->i_ceph_lock);
- if (!ceph_dir_test_complete(inode)) {
+ if ((ceph_inode(inode)->i_ceph_flags & CEPH_I_COMPLETE) == 0) {
spin_unlock(&ci->i_ceph_lock);
return;
}
@@ -1065,7 +1061,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req,
* d_move() puts the renamed dentry at the end of
* d_subdirs. We need to assign it an appropriate
* directory offset so we can behave when holding
- * D_COMPLETE.
+ * I_COMPLETE.
*/
ceph_set_dentry_offset(req->r_old_dentry);
dout("dn %p gets new offset %lld\n", req->r_old_dentry,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index ccc68b0..e52b0fb 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2029,7 +2029,7 @@ out:
}
/*
- * Invalidate dir D_COMPLETE, dentry lease state on an aborted MDS
+ * Invalidate dir I_COMPLETE, dentry lease state on an aborted MDS
* namespace request.
*/
void ceph_invalidate_dir_request(struct ceph_mds_request *req)
@@ -2037,9 +2037,9 @@ void ceph_invalidate_dir_request(struct ceph_mds_request *req)
struct inode *inode = req->r_locked_dir;
struct ceph_inode_info *ci = ceph_inode(inode);
- dout("invalidate_dir_request %p (D_COMPLETE, lease(s))\n", inode);
+ dout("invalidate_dir_request %p (I_COMPLETE, lease(s))\n", inode);
spin_lock(&ci->i_ceph_lock);
- ceph_dir_clear_complete(inode);
+ ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
ci->i_release_count++;
spin_unlock(&ci->i_ceph_lock);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 4353ebc..efbcb56 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -204,7 +204,6 @@ struct ceph_inode_xattr {
* Ceph dentry state
*/
struct ceph_dentry_info {
- unsigned long flags;
struct ceph_mds_session *lease_session;
u32 lease_gen, lease_shared_gen;
u32 lease_seq;
@@ -215,18 +214,6 @@ struct ceph_dentry_info {
u64 offset;
};
-/*
- * dentry flags
- *
- * The locking for D_COMPLETE is a bit odd:
- * - we can clear it at almost any time (see ceph_d_prune)
- * - it is only meaningful if:
- * - we hold dir inode i_ceph_lock
- * - we hold dir FILE_SHARED caps
- * - the dentry D_COMPLETE is set
- */
-#define CEPH_D_COMPLETE 1 /* if set, d_u.d_subdirs is complete directory */
-
struct ceph_inode_xattrs_info {
/*
* (still encoded) xattr blob. we avoid the overhead of parsing
@@ -267,7 +254,7 @@ struct ceph_inode_info {
struct timespec i_rctime;
u64 i_rbytes, i_rfiles, i_rsubdirs;
u64 i_files, i_subdirs;
- u64 i_max_offset; /* largest readdir offset, set with D_COMPLETE */
+ u64 i_max_offset; /* largest readdir offset, set with I_COMPLETE */
struct rb_root i_fragtree;
struct mutex i_fragtree_mutex;
@@ -432,6 +419,7 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
/*
* Ceph inode.
*/
+#define CEPH_I_COMPLETE 1 /* we have complete directory cached */
#define CEPH_I_NODELAY 4 /* do not delay cap release */
#define CEPH_I_FLUSH 8 /* do not delay flush of dirty metadata */
#define CEPH_I_NOFLUSH 16 /* do not flush dirty caps */
@@ -489,13 +477,6 @@ static inline loff_t ceph_make_fpos(unsigned frag, unsigned off)
}
/*
- * set/clear directory D_COMPLETE flag
- */
-void ceph_dir_set_complete(struct inode *inode);
-void ceph_dir_clear_complete(struct inode *inode);
-bool ceph_dir_test_complete(struct inode *inode);
-
-/*
* caps helpers
*/
static inline bool __ceph_is_any_real_caps(struct ceph_inode_info *ci)
--
1.7.11.7
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 5/7] ceph: revert commit 22cddde104
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
` (3 preceding siblings ...)
2013-03-01 6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
@ 2013-03-01 6:46 ` Yan, Zheng
2013-03-01 6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
` (2 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01 6:46 UTC (permalink / raw)
To: ceph-devel, sage; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
commit 22cddde104 breaks the atomicity of write operation, it also
introduces a deadlock between write and truncate.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/addr.c | 51 +++---------------------------------
fs/ceph/file.c | 73 +++++++++++++++++++++++++++++++---------------------
fs/ceph/mds_client.c | 1 +
3 files changed, 48 insertions(+), 77 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index cfef3e0..d662025 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1067,51 +1067,23 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping,
struct page **pagep, void **fsdata)
{
struct inode *inode = file->f_dentry->d_inode;
- struct ceph_inode_info *ci = ceph_inode(inode);
- struct ceph_file_info *fi = file->private_data;
struct page *page;
pgoff_t index = pos >> PAGE_CACHE_SHIFT;
- int r, want, got = 0;
-
- if (fi->fmode & CEPH_FILE_MODE_LAZY)
- want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
- else
- want = CEPH_CAP_FILE_BUFFER;
-
- dout("write_begin %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
- inode, ceph_vinop(inode), pos, len, inode->i_size);
- r = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, pos+len);
- if (r < 0)
- return r;
- dout("write_begin %p %llx.%llx %llu~%u got cap refs on %s\n",
- inode, ceph_vinop(inode), pos, len, ceph_cap_string(got));
- if (!(got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO))) {
- ceph_put_cap_refs(ci, got);
- return -EAGAIN;
- }
+ int r;
do {
/* get a page */
page = grab_cache_page_write_begin(mapping, index, 0);
- if (!page) {
- r = -ENOMEM;
- break;
- }
+ if (!page)
+ return -ENOMEM;
+ *pagep = page;
dout("write_begin file %p inode %p page %p %d~%d\n", file,
inode, page, (int)pos, (int)len);
r = ceph_update_writeable_page(file, pos, len, page);
- if (r)
- page_cache_release(page);
} while (r == -EAGAIN);
- if (r) {
- ceph_put_cap_refs(ci, got);
- } else {
- *pagep = page;
- *(int *)fsdata = got;
- }
return r;
}
@@ -1125,12 +1097,10 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
struct page *page, void *fsdata)
{
struct inode *inode = file->f_dentry->d_inode;
- struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
struct ceph_mds_client *mdsc = fsc->mdsc;
unsigned from = pos & (PAGE_CACHE_SIZE - 1);
int check_cap = 0;
- int got = (unsigned long)fsdata;
dout("write_end file %p inode %p page %p %d~%d (%d)\n", file,
inode, page, (int)pos, (int)copied, (int)len);
@@ -1153,19 +1123,6 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
up_read(&mdsc->snap_rwsem);
page_cache_release(page);
- if (copied > 0) {
- int dirty;
- spin_lock(&ci->i_ceph_lock);
- dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
- spin_unlock(&ci->i_ceph_lock);
- if (dirty)
- __mark_inode_dirty(inode, dirty);
- }
-
- dout("write_end %p %llx.%llx %llu~%u dropping cap refs on %s\n",
- inode, ceph_vinop(inode), pos, len, ceph_cap_string(got));
- ceph_put_cap_refs(ci, got);
-
if (check_cap)
ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY, NULL);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 9c4325e..a949805 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -718,53 +718,63 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
struct ceph_osd_client *osdc =
&ceph_sb_to_client(inode->i_sb)->client->osdc;
loff_t endoff = pos + iov->iov_len;
- int got = 0;
- int ret, err, written;
+ int want, got = 0;
+ int ret, err;
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
retry_snap:
- written = 0;
if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
return -ENOSPC;
__ceph_do_pending_vmtruncate(inode);
+ dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
+ inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+ inode->i_size);
+ if (fi->fmode & CEPH_FILE_MODE_LAZY)
+ want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
+ else
+ want = CEPH_CAP_FILE_BUFFER;
+ ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
+ if (ret < 0)
+ goto out_put;
- /*
- * try to do a buffered write. if we don't have sufficient
- * caps, we'll get -EAGAIN from generic_file_aio_write, or a
- * short write if we only get caps for some pages.
- */
- if (!(iocb->ki_filp->f_flags & O_DIRECT) &&
- !(inode->i_sb->s_flags & MS_SYNCHRONOUS) &&
- !(fi->flags & CEPH_F_SYNC)) {
- ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
- if (ret >= 0)
- written = ret;
+ dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
+ inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+ ceph_cap_string(got));
+
+ if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
+ (iocb->ki_filp->f_flags & O_DIRECT) ||
+ (inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
+ (fi->flags & CEPH_F_SYNC)) {
+ ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
+ &iocb->ki_pos);
+ } else {
+ /*
+ * buffered write; drop Fw early to avoid slow
+ * revocation if we get stuck on balance_dirty_pages
+ */
+ int dirty;
+ spin_lock(&ci->i_ceph_lock);
+ dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
+ spin_unlock(&ci->i_ceph_lock);
+ ceph_put_cap_refs(ci, got);
+
+ ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
if ((ret >= 0 || ret == -EIOCBQUEUED) &&
((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
|| ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
- err = vfs_fsync_range(file, pos, pos + written - 1, 1);
+ err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
if (err < 0)
ret = err;
}
- if ((ret < 0 && ret != -EAGAIN) || pos + written >= endoff)
- goto out;
- }
- dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
- inode, ceph_vinop(inode), pos + written,
- (unsigned)iov->iov_len - written, inode->i_size);
- ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, 0, &got, endoff);
- if (ret < 0)
+ if (dirty)
+ __mark_inode_dirty(inode, dirty);
goto out;
+ }
- dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
- inode, ceph_vinop(inode), pos + written,
- (unsigned)iov->iov_len - written, ceph_cap_string(got));
- ret = ceph_sync_write(file, iov->iov_base + written,
- iov->iov_len - written, &iocb->ki_pos);
if (ret >= 0) {
int dirty;
spin_lock(&ci->i_ceph_lock);
@@ -773,10 +783,13 @@ retry_snap:
if (dirty)
__mark_inode_dirty(inode, dirty);
}
+
+out_put:
dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
- inode, ceph_vinop(inode), pos + written,
- (unsigned)iov->iov_len - written, ceph_cap_string(got));
+ inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+ ceph_cap_string(got));
ceph_put_cap_refs(ci, got);
+
out:
if (ret == -EOLDSNAPC) {
dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index e52b0fb..ab899c8 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1916,6 +1916,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc,
req = list_entry(tmp_list.next,
struct ceph_mds_request, r_wait);
list_del_init(&req->r_wait);
+ dout(" wake request %p tid %llu\n", req, req->r_tid);
__do_request(mdsc, req);
}
}
--
1.7.11.7
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6/7] ceph: don't early drop Fw cap
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
` (4 preceding siblings ...)
2013-03-01 6:46 ` [PATCH 5/7] ceph: revert commit 22cddde104 Yan, Zheng
@ 2013-03-01 6:46 ` Yan, Zheng
2013-03-04 18:26 ` Gregory Farnum
2013-03-01 6:46 ` [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate Yan, Zheng
2013-03-04 18:49 ` [PATCH 0/7] ceph: misc fixes Gregory Farnum
7 siblings, 1 reply; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01 6:46 UTC (permalink / raw)
To: ceph-devel, sage; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
cap dirty before data is copied to page cache and inode size is
updated. The optimization avoids slow cap revocation caused by
balance_dirty_pages(), but introduces inode size update race. If
ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. So just remove the
optimization.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/file.c | 42 +++++++++++++++++-------------------------
1 file changed, 17 insertions(+), 25 deletions(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index a949805..28ef273 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
+ sb_start_write(inode->i_sb);
retry_snap:
- if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
- return -ENOSPC;
+ if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
+ ret = -ENOSPC;
+ goto out;
+ }
__ceph_do_pending_vmtruncate(inode);
dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
@@ -750,29 +753,10 @@ retry_snap:
ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
&iocb->ki_pos);
} else {
- /*
- * buffered write; drop Fw early to avoid slow
- * revocation if we get stuck on balance_dirty_pages
- */
- int dirty;
-
- spin_lock(&ci->i_ceph_lock);
- dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
- spin_unlock(&ci->i_ceph_lock);
- ceph_put_cap_refs(ci, got);
-
- ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
- if ((ret >= 0 || ret == -EIOCBQUEUED) &&
- ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
- || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
- err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
- if (err < 0)
- ret = err;
- }
-
- if (dirty)
- __mark_inode_dirty(inode, dirty);
- goto out;
+ mutex_lock(&inode->i_mutex);
+ ret = __generic_file_aio_write(iocb, iov, nr_segs,
+ &iocb->ki_pos);
+ mutex_unlock(&inode->i_mutex);
}
if (ret >= 0) {
@@ -790,12 +774,20 @@ out_put:
ceph_cap_string(got));
ceph_put_cap_refs(ci, got);
+ if (ret >= 0 &&
+ ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
+ ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
+ err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
+ if (err < 0)
+ ret = err;
+ }
out:
if (ret == -EOLDSNAPC) {
dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
goto retry_snap;
}
+ sb_end_write(inode->i_sb);
return ret;
}
--
1.7.11.7
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
` (5 preceding siblings ...)
2013-03-01 6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
@ 2013-03-01 6:46 ` Yan, Zheng
2013-03-04 18:49 ` [PATCH 0/7] ceph: misc fixes Gregory Farnum
7 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2013-03-01 6:46 UTC (permalink / raw)
To: ceph-devel, sage; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
make __ceph_do_pending_vmtruncate() acquire i_mutex if the caller
does not hold the mutex, so ceph_aio_read() can call it safely.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/file.c | 6 +++---
fs/ceph/inode.c | 18 +++++++++---------
fs/ceph/super.h | 2 +-
3 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 28ef273..b9eedd4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -653,7 +653,7 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
inode, ceph_vinop(inode), pos, (unsigned)len, inode);
again:
- __ceph_do_pending_vmtruncate(inode);
+ __ceph_do_pending_vmtruncate(inode, true);
if (fi->fmode & CEPH_FILE_MODE_LAZY)
want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
else
@@ -730,7 +730,7 @@ retry_snap:
ret = -ENOSPC;
goto out;
}
- __ceph_do_pending_vmtruncate(inode);
+ __ceph_do_pending_vmtruncate(inode, true);
dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
inode->i_size);
@@ -801,7 +801,7 @@ static loff_t ceph_llseek(struct file *file, loff_t offset, int whence)
int ret;
mutex_lock(&inode->i_mutex);
- __ceph_do_pending_vmtruncate(inode);
+ __ceph_do_pending_vmtruncate(inode, false);
if (whence == SEEK_END || whence == SEEK_DATA || whence == SEEK_HOLE) {
ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 42c5769..2b3fee7 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1416,7 +1416,7 @@ out:
/*
- * called by trunc_wq; take i_mutex ourselves
+ * called by trunc_wq;
*
* We also truncate in a separate thread as well.
*/
@@ -1427,9 +1427,7 @@ static void ceph_vmtruncate_work(struct work_struct *work)
struct inode *inode = &ci->vfs_inode;
dout("vmtruncate_work %p\n", inode);
- mutex_lock(&inode->i_mutex);
- __ceph_do_pending_vmtruncate(inode);
- mutex_unlock(&inode->i_mutex);
+ __ceph_do_pending_vmtruncate(inode, true);
iput(inode);
}
@@ -1453,12 +1451,10 @@ void ceph_queue_vmtruncate(struct inode *inode)
}
/*
- * called with i_mutex held.
- *
* Make sure any pending truncation is applied before doing anything
* that may depend on it.
*/
-void __ceph_do_pending_vmtruncate(struct inode *inode)
+void __ceph_do_pending_vmtruncate(struct inode *inode, bool needlock)
{
struct ceph_inode_info *ci = ceph_inode(inode);
u64 to;
@@ -1491,7 +1487,11 @@ retry:
ci->i_truncate_pending, to);
spin_unlock(&ci->i_ceph_lock);
+ if (needlock)
+ mutex_lock(&inode->i_mutex);
truncate_inode_pages(inode->i_mapping, to);
+ if (needlock)
+ mutex_unlock(&inode->i_mutex);
spin_lock(&ci->i_ceph_lock);
if (to == ci->i_truncate_size) {
@@ -1544,7 +1544,7 @@ int ceph_setattr(struct dentry *dentry, struct iattr *attr)
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
- __ceph_do_pending_vmtruncate(inode);
+ __ceph_do_pending_vmtruncate(inode, false);
err = inode_change_ok(inode, attr);
if (err != 0)
@@ -1722,7 +1722,7 @@ int ceph_setattr(struct dentry *dentry, struct iattr *attr)
ceph_cap_string(dirtied), mask);
ceph_mdsc_put_request(req);
- __ceph_do_pending_vmtruncate(inode);
+ __ceph_do_pending_vmtruncate(inode, false);
return err;
out:
spin_unlock(&ci->i_ceph_lock);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index efbcb56..e5f1875 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -694,7 +694,7 @@ extern int ceph_readdir_prepopulate(struct ceph_mds_request *req,
extern int ceph_inode_holds_cap(struct inode *inode, int mask);
extern int ceph_inode_set_size(struct inode *inode, loff_t size);
-extern void __ceph_do_pending_vmtruncate(struct inode *inode);
+extern void __ceph_do_pending_vmtruncate(struct inode *inode, bool needlock);
extern void ceph_queue_vmtruncate(struct inode *inode);
extern void ceph_queue_invalidate(struct inode *inode);
--
1.7.11.7
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag
2013-03-01 6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
@ 2013-03-04 18:19 ` Sage Weil
0 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2013-03-04 18:19 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel
On Fri, 1 Mar 2013, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> commit c6ffe10015 moved the flag that tracks if the dcache contents
> for a directory are complete to dentry. The problem is there are
> lots of places that use ceph_dir_{set,clear,test}_complete() while
> holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
> sleep because they call dput().
>
> This patch basically reverts that commit. For ceph_d_prune(), it's
> called with both the dentry to prune and the parent dentry are
> locked. So it's safe to access the parent dentry's d_inode and
> clear I_COMPLETE flag.
I'm trying to remember why I thought the D_COMPETE flag was necessary.
Maybe I didn't think that i_ceph_lock could safely nest inside of d_lock?
Or that the parent was locked?
Anyway, assuming both of those things are in fact true, this looks good
(and simpler :).
sage
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
> fs/ceph/caps.c | 8 ++++---
> fs/ceph/dir.c | 62 ++++++++++------------------------------------------
> fs/ceph/inode.c | 30 +++++++++++--------------
> fs/ceph/mds_client.c | 6 ++---
> fs/ceph/super.h | 23 ++-----------------
> 5 files changed, 34 insertions(+), 95 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 61f3833..76634f4 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -490,15 +490,17 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
> ci->i_rdcache_gen++;
>
> /*
> - * if we are newly issued FILE_SHARED, clear D_COMPLETE; we
> + * if we are newly issued FILE_SHARED, clear I_COMPLETE; we
> * don't know what happened to this directory while we didn't
> * have the cap.
> */
> if ((issued & CEPH_CAP_FILE_SHARED) &&
> (had & CEPH_CAP_FILE_SHARED) == 0) {
> ci->i_shared_gen++;
> - if (S_ISDIR(ci->vfs_inode.i_mode))
> - ceph_dir_clear_complete(&ci->vfs_inode);
> + if (S_ISDIR(ci->vfs_inode.i_mode)) {
> + dout(" marking %p NOT complete\n", &ci->vfs_inode);
> + ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
> + }
> }
> }
>
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 8c1aabe..76821be 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p)
> * falling back to a "normal" sync readdir if any dentries in the dir
> * are dropped.
> *
> - * D_COMPLETE tells indicates we have all dentries in the dir. It is
> + * I_COMPLETE tells indicates we have all dentries in the dir. It is
> * defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by
> * the MDS if/when the directory is modified).
> */
> @@ -198,8 +198,8 @@ more:
> filp->f_pos++;
>
> /* make sure a dentry wasn't dropped while we didn't have parent lock */
> - if (!ceph_dir_test_complete(dir)) {
> - dout(" lost D_COMPLETE on %p; falling back to mds\n", dir);
> + if (!ceph_i_test(dir, CEPH_I_COMPLETE)) {
> + dout(" lost I_COMPLETE on %p; falling back to mds\n", dir);
> err = -EAGAIN;
> goto out;
> }
> @@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, filldir_t filldir)
> if ((filp->f_pos == 2 || fi->dentry) &&
> !ceph_test_mount_opt(fsc, NOASYNCREADDIR) &&
> ceph_snap(inode) != CEPH_SNAPDIR &&
> - ceph_dir_test_complete(inode) &&
> + (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
> __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) {
> spin_unlock(&ci->i_ceph_lock);
> err = __dcache_readdir(filp, dirent, filldir);
> @@ -350,7 +350,7 @@ more:
>
> if (!req->r_did_prepopulate) {
> dout("readdir !did_prepopulate");
> - fi->dir_release_count--; /* preclude D_COMPLETE */
> + fi->dir_release_count--; /* preclude I_COMPLETE */
> }
>
> /* note next offset and last dentry name */
> @@ -429,7 +429,8 @@ more:
> */
> spin_lock(&ci->i_ceph_lock);
> if (ci->i_release_count == fi->dir_release_count) {
> - ceph_dir_set_complete(inode);
> + dout(" marking %p complete\n", inode);
> + ci->i_ceph_flags |= CEPH_I_COMPLETE;
> ci->i_max_offset = filp->f_pos;
> }
> spin_unlock(&ci->i_ceph_lock);
> @@ -604,7 +605,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
> fsc->mount_options->snapdir_name,
> dentry->d_name.len) &&
> !is_root_ceph_dentry(dir, dentry) &&
> - ceph_dir_test_complete(dir) &&
> + (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
> (__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) {
> spin_unlock(&ci->i_ceph_lock);
> dout(" dir %p complete, -ENOENT\n", dir);
> @@ -908,7 +909,7 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry,
> */
>
> /* d_move screws up d_subdirs order */
> - ceph_dir_clear_complete(new_dir);
> + ceph_i_clear(new_dir, CEPH_I_COMPLETE);
>
> d_move(old_dentry, new_dentry);
>
> @@ -1065,44 +1066,6 @@ static int ceph_snapdir_d_revalidate(struct dentry *dentry,
> }
>
> /*
> - * Set/clear/test dir complete flag on the dir's dentry.
> - */
> -void ceph_dir_set_complete(struct inode *inode)
> -{
> - struct dentry *dentry = d_find_any_alias(inode);
> -
> - if (dentry && ceph_dentry(dentry) &&
> - ceph_test_mount_opt(ceph_sb_to_client(dentry->d_sb), DCACHE)) {
> - dout(" marking %p (%p) complete\n", inode, dentry);
> - set_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
> - }
> - dput(dentry);
> -}
> -
> -void ceph_dir_clear_complete(struct inode *inode)
> -{
> - struct dentry *dentry = d_find_any_alias(inode);
> -
> - if (dentry && ceph_dentry(dentry)) {
> - dout(" marking %p (%p) complete\n", inode, dentry);
> - set_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
> - }
> - dput(dentry);
> -}
> -
> -bool ceph_dir_test_complete(struct inode *inode)
> -{
> - struct dentry *dentry = d_find_any_alias(inode);
> -
> - if (dentry && ceph_dentry(dentry)) {
> - dout(" marking %p (%p) NOT complete\n", inode, dentry);
> - clear_bit(CEPH_D_COMPLETE, &ceph_dentry(dentry)->flags);
> - }
> - dput(dentry);
> - return false;
> -}
> -
> -/*
> * When the VFS prunes a dentry from the cache, we need to clear the
> * complete flag on the parent directory.
> *
> @@ -1110,15 +1073,13 @@ bool ceph_dir_test_complete(struct inode *inode)
> */
> static void ceph_d_prune(struct dentry *dentry)
> {
> - struct ceph_dentry_info *di;
> -
> dout("ceph_d_prune %p\n", dentry);
>
> /* do we have a valid parent? */
> if (IS_ROOT(dentry))
> return;
>
> - /* if we are not hashed, we don't affect D_COMPLETE */
> + /* if we are not hashed, we don't affect I_COMPLETE */
> if (d_unhashed(dentry))
> return;
>
> @@ -1126,8 +1087,7 @@ static void ceph_d_prune(struct dentry *dentry)
> * we hold d_lock, so d_parent is stable, and d_fsdata is never
> * cleared until d_release
> */
> - di = ceph_dentry(dentry->d_parent);
> - clear_bit(CEPH_D_COMPLETE, &di->flags);
> + ceph_i_clear(dentry->d_parent->d_inode, CEPH_I_COMPLETE);
> }
>
> /*
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 2971eaa..42c5769 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -561,7 +561,6 @@ static int fill_inode(struct inode *inode,
> struct ceph_inode_info *ci = ceph_inode(inode);
> int i;
> int issued = 0, implemented;
> - int updating_inode = 0;
> struct timespec mtime, atime, ctime;
> u32 nsplits;
> struct ceph_buffer *xattr_blob = NULL;
> @@ -601,7 +600,6 @@ static int fill_inode(struct inode *inode,
> (ci->i_version & ~1) >= le64_to_cpu(info->version))
> goto no_change;
>
> - updating_inode = 1;
> issued = __ceph_caps_issued(ci, &implemented);
> issued |= implemented | __ceph_caps_dirty(ci);
>
> @@ -716,6 +714,17 @@ static int fill_inode(struct inode *inode,
> ceph_vinop(inode), inode->i_mode);
> }
>
> + /* set dir completion flag? */
> + if (S_ISDIR(inode->i_mode) &&
> + ci->i_files == 0 && ci->i_subdirs == 0 &&
> + ceph_snap(inode) == CEPH_NOSNAP &&
> + (le32_to_cpu(info->cap.caps) & CEPH_CAP_FILE_SHARED) &&
> + (issued & CEPH_CAP_FILE_EXCL) == 0 &&
> + (ci->i_ceph_flags & CEPH_I_COMPLETE) == 0) {
> + dout(" marking %p complete (empty)\n", inode);
> + ci->i_ceph_flags |= CEPH_I_COMPLETE;
> + ci->i_max_offset = 2;
> + }
> no_change:
> spin_unlock(&ci->i_ceph_lock);
>
> @@ -766,19 +775,6 @@ no_change:
> __ceph_get_fmode(ci, cap_fmode);
> }
>
> - /* set dir completion flag? */
> - if (S_ISDIR(inode->i_mode) &&
> - updating_inode && /* didn't jump to no_change */
> - ci->i_files == 0 && ci->i_subdirs == 0 &&
> - ceph_snap(inode) == CEPH_NOSNAP &&
> - (le32_to_cpu(info->cap.caps) & CEPH_CAP_FILE_SHARED) &&
> - (issued & CEPH_CAP_FILE_EXCL) == 0 &&
> - !ceph_dir_test_complete(inode)) {
> - dout(" marking %p complete (empty)\n", inode);
> - ceph_dir_set_complete(inode);
> - ci->i_max_offset = 2;
> - }
> -
> /* update delegation info? */
> if (dirinfo)
> ceph_fill_dirfrag(inode, dirinfo);
> @@ -860,7 +856,7 @@ static void ceph_set_dentry_offset(struct dentry *dn)
> di = ceph_dentry(dn);
>
> spin_lock(&ci->i_ceph_lock);
> - if (!ceph_dir_test_complete(inode)) {
> + if ((ceph_inode(inode)->i_ceph_flags & CEPH_I_COMPLETE) == 0) {
> spin_unlock(&ci->i_ceph_lock);
> return;
> }
> @@ -1065,7 +1061,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req,
> * d_move() puts the renamed dentry at the end of
> * d_subdirs. We need to assign it an appropriate
> * directory offset so we can behave when holding
> - * D_COMPLETE.
> + * I_COMPLETE.
> */
> ceph_set_dentry_offset(req->r_old_dentry);
> dout("dn %p gets new offset %lld\n", req->r_old_dentry,
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index ccc68b0..e52b0fb 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2029,7 +2029,7 @@ out:
> }
>
> /*
> - * Invalidate dir D_COMPLETE, dentry lease state on an aborted MDS
> + * Invalidate dir I_COMPLETE, dentry lease state on an aborted MDS
> * namespace request.
> */
> void ceph_invalidate_dir_request(struct ceph_mds_request *req)
> @@ -2037,9 +2037,9 @@ void ceph_invalidate_dir_request(struct ceph_mds_request *req)
> struct inode *inode = req->r_locked_dir;
> struct ceph_inode_info *ci = ceph_inode(inode);
>
> - dout("invalidate_dir_request %p (D_COMPLETE, lease(s))\n", inode);
> + dout("invalidate_dir_request %p (I_COMPLETE, lease(s))\n", inode);
> spin_lock(&ci->i_ceph_lock);
> - ceph_dir_clear_complete(inode);
> + ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
> ci->i_release_count++;
> spin_unlock(&ci->i_ceph_lock);
>
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 4353ebc..efbcb56 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -204,7 +204,6 @@ struct ceph_inode_xattr {
> * Ceph dentry state
> */
> struct ceph_dentry_info {
> - unsigned long flags;
> struct ceph_mds_session *lease_session;
> u32 lease_gen, lease_shared_gen;
> u32 lease_seq;
> @@ -215,18 +214,6 @@ struct ceph_dentry_info {
> u64 offset;
> };
>
> -/*
> - * dentry flags
> - *
> - * The locking for D_COMPLETE is a bit odd:
> - * - we can clear it at almost any time (see ceph_d_prune)
> - * - it is only meaningful if:
> - * - we hold dir inode i_ceph_lock
> - * - we hold dir FILE_SHARED caps
> - * - the dentry D_COMPLETE is set
> - */
> -#define CEPH_D_COMPLETE 1 /* if set, d_u.d_subdirs is complete directory */
> -
> struct ceph_inode_xattrs_info {
> /*
> * (still encoded) xattr blob. we avoid the overhead of parsing
> @@ -267,7 +254,7 @@ struct ceph_inode_info {
> struct timespec i_rctime;
> u64 i_rbytes, i_rfiles, i_rsubdirs;
> u64 i_files, i_subdirs;
> - u64 i_max_offset; /* largest readdir offset, set with D_COMPLETE */
> + u64 i_max_offset; /* largest readdir offset, set with I_COMPLETE */
>
> struct rb_root i_fragtree;
> struct mutex i_fragtree_mutex;
> @@ -432,6 +419,7 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
> /*
> * Ceph inode.
> */
> +#define CEPH_I_COMPLETE 1 /* we have complete directory cached */
> #define CEPH_I_NODELAY 4 /* do not delay cap release */
> #define CEPH_I_FLUSH 8 /* do not delay flush of dirty metadata */
> #define CEPH_I_NOFLUSH 16 /* do not flush dirty caps */
> @@ -489,13 +477,6 @@ static inline loff_t ceph_make_fpos(unsigned frag, unsigned off)
> }
>
> /*
> - * set/clear directory D_COMPLETE flag
> - */
> -void ceph_dir_set_complete(struct inode *inode);
> -void ceph_dir_clear_complete(struct inode *inode);
> -bool ceph_dir_test_complete(struct inode *inode);
> -
> -/*
> * caps helpers
> */
> static inline bool __ceph_is_any_real_caps(struct ceph_inode_info *ci)
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6/7] ceph: don't early drop Fw cap
2013-03-01 6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
@ 2013-03-04 18:26 ` Gregory Farnum
2013-03-05 1:57 ` Yan, Zheng
0 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2013-03-04 18:26 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel@vger.kernel.org, Sage Weil
On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
> cap dirty before data is copied to page cache and inode size is
> updated. The optimization avoids slow cap revocation caused by
> balance_dirty_pages(), but introduces inode size update race. If
> ceph_check_caps() flushes the dirty cap before the inode size is
> updated, MDS can miss the new inode size. So just remove the
> optimization.
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
> fs/ceph/file.c | 42 +++++++++++++++++-------------------------
> 1 file changed, 17 insertions(+), 25 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index a949805..28ef273 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
> if (ceph_snap(inode) != CEPH_NOSNAP)
> return -EROFS;
>
> + sb_start_write(inode->i_sb);
> retry_snap:
> - if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
> - return -ENOSPC;
> + if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
> + ret = -ENOSPC;
> + goto out;
> + }
> __ceph_do_pending_vmtruncate(inode);
> dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> @@ -750,29 +753,10 @@ retry_snap:
> ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
> &iocb->ki_pos);
> } else {
> - /*
> - * buffered write; drop Fw early to avoid slow
> - * revocation if we get stuck on balance_dirty_pages
> - */
> - int dirty;
> -
> - spin_lock(&ci->i_ceph_lock);
> - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
> - spin_unlock(&ci->i_ceph_lock);
> - ceph_put_cap_refs(ci, got);
> -
> - ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
> - if ((ret >= 0 || ret == -EIOCBQUEUED) &&
> - ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
> - || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
> - err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
> - if (err < 0)
> - ret = err;
> - }
> -
> - if (dirty)
> - __mark_inode_dirty(inode, dirty);
> - goto out;
> + mutex_lock(&inode->i_mutex);
> + ret = __generic_file_aio_write(iocb, iov, nr_segs,
> + &iocb->ki_pos);
> + mutex_unlock(&inode->i_mutex);
Hmm, you're here passing in a different value than the removed
generic_file_aio_write() call did — &iocb->ki_pos instead of pos.
Everything else is using the pos parameter so I rather expect that
should still be used here?
Also a quick skim of the interfaces makes me think that the two
versions aren't interchangeable — __generic_file_aio_write() also
handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them?
(I haven't checked deep enough to see if the unlocked version is
correct or not, but it does say that's what most filesystems should
use.)
-Greg
> }
>
> if (ret >= 0) {
> @@ -790,12 +774,20 @@ out_put:
> ceph_cap_string(got));
> ceph_put_cap_refs(ci, got);
>
> + if (ret >= 0 &&
> + ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
> + ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
> + err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
> + if (err < 0)
> + ret = err;
> + }
> out:
> if (ret == -EOLDSNAPC) {
> dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
> goto retry_snap;
> }
> + sb_end_write(inode->i_sb);
>
> return ret;
> }
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/7] ceph: misc fixes
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
` (6 preceding siblings ...)
2013-03-01 6:46 ` [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate Yan, Zheng
@ 2013-03-04 18:49 ` Gregory Farnum
[not found] ` <513568D1.60305@intel.com>
7 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2013-03-04 18:49 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel@vger.kernel.org, Sage Weil
On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> These patches are also in:
> git://github.com/ukernel/linux.git wip-ceph
1, 2, 3, 5, 7 all look good to me. If you can double-check Sage's
concerns on 4 and my questions on 6 I'll be happy to pull these in. :)
-Greg
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6/7] ceph: don't early drop Fw cap
2013-03-04 18:26 ` Gregory Farnum
@ 2013-03-05 1:57 ` Yan, Zheng
2013-03-05 20:42 ` Greg Farnum
0 siblings, 1 reply; 14+ messages in thread
From: Yan, Zheng @ 2013-03-05 1:57 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org, Sage Weil
On 03/05/2013 02:26 AM, Gregory Farnum wrote:
> On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
>> cap dirty before data is copied to page cache and inode size is
>> updated. The optimization avoids slow cap revocation caused by
>> balance_dirty_pages(), but introduces inode size update race. If
>> ceph_check_caps() flushes the dirty cap before the inode size is
>> updated, MDS can miss the new inode size. So just remove the
>> optimization.
>>
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> ---
>> fs/ceph/file.c | 42 +++++++++++++++++-------------------------
>> 1 file changed, 17 insertions(+), 25 deletions(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index a949805..28ef273 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
>> if (ceph_snap(inode) != CEPH_NOSNAP)
>> return -EROFS;
>>
>> + sb_start_write(inode->i_sb);
>> retry_snap:
>> - if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
>> - return -ENOSPC;
>> + if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
>> + ret = -ENOSPC;
>> + goto out;
>> + }
>> __ceph_do_pending_vmtruncate(inode);
>> dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
>> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
>> @@ -750,29 +753,10 @@ retry_snap:
>> ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
>> &iocb->ki_pos);
>> } else {
>> - /*
>> - * buffered write; drop Fw early to avoid slow
>> - * revocation if we get stuck on balance_dirty_pages
>> - */
>> - int dirty;
>> -
>> - spin_lock(&ci->i_ceph_lock);
>> - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
>> - spin_unlock(&ci->i_ceph_lock);
>> - ceph_put_cap_refs(ci, got);
>> -
>> - ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
>> - if ((ret >= 0 || ret == -EIOCBQUEUED) &&
>> - ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
>> - || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
>> - err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
>> - if (err < 0)
>> - ret = err;
>> - }
>> -
>> - if (dirty)
>> - __mark_inode_dirty(inode, dirty);
>> - goto out;
>> + mutex_lock(&inode->i_mutex);
>> + ret = __generic_file_aio_write(iocb, iov, nr_segs,
>> + &iocb->ki_pos);
>> + mutex_unlock(&inode->i_mutex);
>
> Hmm, you're here passing in a different value than the removed
> generic_file_aio_write() call did — &iocb->ki_pos instead of pos.
> Everything else is using the pos parameter so I rather expect that
> should still be used here?
They always have the same value, see the BUG_ON in generic_file_aio_write()
> Also a quick skim of the interfaces makes me think that the two
> versions aren't interchangeable — __generic_file_aio_write() also
> handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them?
ceph has its own code that handles O_SYNC case. I want to make sb_start_write()
covers ceph_sync_write(), that's the reason I use __generic_file_aio_write() here.
Regards
Yan, Zheng
> (I haven't checked deep enough to see if the unlocked version is
> correct or not, but it does say that's what most filesystems should
> use.)
> -Greg
>
>
>> }
>>
>> if (ret >= 0) {
>> @@ -790,12 +774,20 @@ out_put:
>> ceph_cap_string(got));
>> ceph_put_cap_refs(ci, got);
>>
>> + if (ret >= 0 &&
>> + ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
>> + ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
>> + err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
>> + if (err < 0)
>> + ret = err;
>> + }
>> out:
>> if (ret == -EOLDSNAPC) {
>> dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
>> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
>> goto retry_snap;
>> }
>> + sb_end_write(inode->i_sb);
>>
>> return ret;
>> }
>> --
>> 1.7.11.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6/7] ceph: don't early drop Fw cap
2013-03-05 1:57 ` Yan, Zheng
@ 2013-03-05 20:42 ` Greg Farnum
0 siblings, 0 replies; 14+ messages in thread
From: Greg Farnum @ 2013-03-05 20:42 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel@vger.kernel.org, Sage Weil
On Monday, March 4, 2013 at 5:57 PM, Yan, Zheng wrote:
> On 03/05/2013 02:26 AM, Gregory Farnum wrote:
> > On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)> wrote:
> > > From: "Yan, Zheng" <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
> > >
> > > ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
> > > cap dirty before data is copied to page cache and inode size is
> > > updated. The optimization avoids slow cap revocation caused by
> > > balance_dirty_pages(), but introduces inode size update race. If
> > > ceph_check_caps() flushes the dirty cap before the inode size is
> > > updated, MDS can miss the new inode size. So just remove the
> > > optimization.
> > >
> > > Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
> > > ---
> > > fs/ceph/file.c | 42 +++++++++++++++++-------------------------
> > > 1 file changed, 17 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index a949805..28ef273 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
> > > if (ceph_snap(inode) != CEPH_NOSNAP)
> > > return -EROFS;
> > >
> > > + sb_start_write(inode->i_sb);
> > > retry_snap:
> > > - if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
> > > - return -ENOSPC;
> > > + if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
> > > + ret = -ENOSPC;
> > > + goto out;
> > > + }
> > > __ceph_do_pending_vmtruncate(inode);
> > > dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
> > > inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> > > @@ -750,29 +753,10 @@ retry_snap:
> > > ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
> > > &iocb->ki_pos);
> > > } else {
> > > - /*
> > > - * buffered write; drop Fw early to avoid slow
> > > - * revocation if we get stuck on balance_dirty_pages
> > > - */
> > > - int dirty;
> > > -
> > > - spin_lock(&ci->i_ceph_lock);
> > > - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
> > > - spin_unlock(&ci->i_ceph_lock);
> > > - ceph_put_cap_refs(ci, got);
> > > -
> > > - ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
> > > - if ((ret >= 0 || ret == -EIOCBQUEUED) &&
> > > - ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
> > > - || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
> > > - err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
> > > - if (err < 0)
> > > - ret = err;
> > > - }
> > > -
> > > - if (dirty)
> > > - __mark_inode_dirty(inode, dirty);
> > > - goto out;
> > > + mutex_lock(&inode->i_mutex);
> > > + ret = __generic_file_aio_write(iocb, iov, nr_segs,
> > > + &iocb->ki_pos);
> > > + mutex_unlock(&inode->i_mutex);
> >
> >
> >
> > Hmm, you're here passing in a different value than the removed
> > generic_file_aio_write() call did — &iocb->ki_pos instead of pos.
> > Everything else is using the pos parameter so I rather expect that
> > should still be used here?
>
>
>
> They always have the same value, see the BUG_ON in generic_file_aio_write()
>
> > Also a quick skim of the interfaces makes me think that the two
> > versions aren't interchangeable — __generic_file_aio_write() also
> > handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them?
>
>
>
> ceph has its own code that handles O_SYNC case. I want to make sb_start_write()
> covers ceph_sync_write(), that's the reason I use __generic_file_aio_write() here.
>
> Regards
> Yan, Zheng
>
Ah, yep — sounds good!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/7] ceph: misc fixes
[not found] ` <513568D1.60305@intel.com>
@ 2013-03-05 20:44 ` Greg Farnum
0 siblings, 0 replies; 14+ messages in thread
From: Greg Farnum @ 2013-03-05 20:44 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel
I've merged this series into the testing branch, with appropriate Reviewed-by tags from me (and Sage on #4). Thanks much for the code and helping me go through it. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
On Monday, March 4, 2013 at 7:38 PM, Yan, Zheng wrote:
> On 03/05/2013 02:49 AM, Gregory Farnum wrote:
> > On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> > > From: "Yan, Zheng" <zheng.z.yan@intel.com>
> > >
> > > These patches are also in:
> > > git://github.com/ukernel/linux.git (http://github.com/ukernel/linux.git) wip-ceph
> >
> >
> >
> > 1, 2, 3, 5, 7 all look good to me. If you can double-check Sage's
> > concerns on 4 and my questions on 6 I'll be happy to pull these in. :)
>
>
>
> I rechecked locking assumption in patch 4, nothing goes wrong.
>
> Regards
> Yan, Zheng
>
> > -Greg
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2013-03-05 20:45 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-01 6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
2013-03-01 6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
2013-03-01 6:46 ` [PATCH 2/7] ceph: queue cap release when trimming cap Yan, Zheng
2013-03-01 6:46 ` [PATCH 3/7] ceph: set mds_want according to cap import message Yan, Zheng
2013-03-01 6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
2013-03-04 18:19 ` Sage Weil
2013-03-01 6:46 ` [PATCH 5/7] ceph: revert commit 22cddde104 Yan, Zheng
2013-03-01 6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
2013-03-04 18:26 ` Gregory Farnum
2013-03-05 1:57 ` Yan, Zheng
2013-03-05 20:42 ` Greg Farnum
2013-03-01 6:46 ` [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate Yan, Zheng
2013-03-04 18:49 ` [PATCH 0/7] ceph: misc fixes Gregory Farnum
[not found] ` <513568D1.60305@intel.com>
2013-03-05 20:44 ` Greg Farnum
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.