* [PATCH 1/4] ceph: add osd request to inode unsafe list in advance
@ 2013-04-12 8:11 Yan, Zheng
2013-04-12 8:11 ` [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued Yan, Zheng
` (4 more replies)
0 siblings, 5 replies; 13+ messages in thread
From: Yan, Zheng @ 2013-04-12 8:11 UTC (permalink / raw)
To: ceph-devel; +Cc: greg, elder, Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
Right now we add osd requests to inode unsafe list after starting them.
The problem here is osd request may have already finished when adding
it to inode unsafe list.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/caps.c | 14 ++++++++++----
fs/ceph/file.c | 30 ++++++++++++------------------
2 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 124e8a1..0da2e94 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1806,10 +1806,16 @@ static void sync_write_wait(struct inode *inode)
if (list_empty(head))
goto out;
- /* set upper bound as _last_ entry in chain */
- req = list_entry(head->prev, struct ceph_osd_request,
- r_unsafe_item);
- last_tid = req->r_tid;
+ /* set upper bound as last started request */
+ last_tid = (u64)-1;
+ list_for_each_entry_reverse(req, head, r_unsafe_item) {
+ if (req->r_tid != (u64)-1) {
+ last_tid = req->r_tid;
+ break;
+ }
+ }
+ if (last_tid == (u64)-1)
+ goto out;
do {
ceph_osdc_get_request(req);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index b7e6caa..546a705 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -571,7 +571,14 @@ more:
if ((file->f_flags & O_SYNC) == 0) {
/* get a second commit callback */
req->r_safe_callback = sync_write_commit;
+ req->r_inode = inode;
+ req->r_tid = (u64)-1;
own_pages = true;
+ spin_lock(&ci->i_unsafe_lock);
+ list_add_tail(&req->r_unsafe_item,
+ &ci->i_unsafe_writes);
+ spin_unlock(&ci->i_unsafe_lock);
+ ceph_get_cap_refs(ci, CEPH_CAP_FILE_WR);
}
}
osd_req_op_extent_osd_data_pages(req, 0, true, pages, len,
@@ -582,25 +589,12 @@ more:
ret = ceph_osdc_start_request(&fsc->client->osdc, req, false);
if (!ret) {
- if (req->r_safe_callback) {
- /*
- * Add to inode unsafe list only after we
- * start_request so that a tid has been assigned.
- */
- spin_lock(&ci->i_unsafe_lock);
- list_add_tail(&req->r_unsafe_item,
- &ci->i_unsafe_writes);
- spin_unlock(&ci->i_unsafe_lock);
- ceph_get_cap_refs(ci, CEPH_CAP_FILE_WR);
- }
-
ret = ceph_osdc_wait_request(&fsc->client->osdc, req);
- if (ret < 0 && req->r_safe_callback) {
- spin_lock(&ci->i_unsafe_lock);
- list_del_init(&req->r_unsafe_item);
- spin_unlock(&ci->i_unsafe_lock);
- ceph_put_cap_refs(ci, CEPH_CAP_FILE_WR);
- }
+ } else if (req->r_safe_callback) {
+ spin_lock(&ci->i_unsafe_lock);
+ list_del_init(&req->r_unsafe_item);
+ spin_unlock(&ci->i_unsafe_lock);
+ ceph_put_cap_refs(ci, CEPH_CAP_FILE_WR);
}
if (file->f_flags & O_DIRECT)
--
1.7.11.7
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued
2013-04-12 8:11 [PATCH 1/4] ceph: add osd request to inode unsafe list in advance Yan, Zheng
@ 2013-04-12 8:11 ` Yan, Zheng
2013-04-17 21:58 ` Gregory Farnum
2013-04-12 8:11 ` [PATCH 2/4] ceph: take i_mutex before getting Fw cap Yan, Zheng
` (3 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Yan, Zheng @ 2013-04-12 8:11 UTC (permalink / raw)
To: ceph-devel; +Cc: greg, elder, Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
There is a total of 22 cap bits and file lock uses 8 cap bits.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
src/mds/CInode.h | 4 ++--
src/mds/Locker.cc | 3 ++-
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/src/mds/CInode.h b/src/mds/CInode.h
index 43af9c4..a3fe3cf 100644
--- a/src/mds/CInode.h
+++ b/src/mds/CInode.h
@@ -736,9 +736,9 @@ public:
// caps issued, wanted
int get_caps_issued(int *ploner = 0, int *pother = 0, int *pxlocker = 0,
- int shift = 0, int mask = 0xffff);
+ int shift = 0, int mask = -1);
bool is_any_caps_wanted();
- int get_caps_wanted(int *ploner = 0, int *pother = 0, int shift = 0, int mask = 0xffff);
+ int get_caps_wanted(int *ploner = 0, int *pother = 0, int shift = 0, int mask = -1);
bool issued_caps_need_gather(SimpleLock *lock);
void replicate_relax_locks();
diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index 7ba9f44..f86ba88 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -616,7 +616,8 @@ void Locker::eval_gather(SimpleLock *lock, bool first, bool *pneed_issue, list<C
int loner_issued = 0, other_issued = 0, xlocker_issued = 0;
if (caps && in->is_head()) {
- in->get_caps_issued(&loner_issued, &other_issued, &xlocker_issued, lock->get_cap_shift(), 3);
+ in->get_caps_issued(&loner_issued, &other_issued, &xlocker_issued,
+ lock->get_cap_shift(), lock->get_cap_mask());
dout(10) << " next state is " << lock->get_state_name(next)
<< " issued/allows loner " << gcap_string(loner_issued)
<< "/" << gcap_string(lock->gcaps_allowed(CAP_LONER, next))
--
1.7.11.7
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/4] ceph: take i_mutex before getting Fw cap
2013-04-12 8:11 [PATCH 1/4] ceph: add osd request to inode unsafe list in advance Yan, Zheng
2013-04-12 8:11 ` [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued Yan, Zheng
@ 2013-04-12 8:11 ` Yan, Zheng
2013-04-15 16:15 ` Alex Elder
2013-04-17 3:42 ` Sage Weil
2013-04-12 8:11 ` [PATCH 2/2] mds: change XLOCK/XLOCKDONE's next state to LOCK Yan, Zheng
` (2 subsequent siblings)
4 siblings, 2 replies; 13+ messages in thread
From: Yan, Zheng @ 2013-04-12 8:11 UTC (permalink / raw)
To: ceph-devel; +Cc: greg, elder, Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
There is deadlock as illustrated bellow. The fix is taking i_mutex
before getting Fw cap reference.
write truncate MDS
--------------------- -------------------- --------------
get Fw cap
lock i_mutex
lock i_mutex (blocked)
request setattr.size ->
<- revoke Fw cap
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/caps.c | 13 +++++++------
fs/ceph/file.c | 12 ++++++------
2 files changed, 13 insertions(+), 12 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 0da2e94..8737572 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2058,6 +2058,13 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
goto out;
}
+ /* finish pending truncate */
+ while (ci->i_truncate_pending) {
+ spin_unlock(&ci->i_ceph_lock);
+ __ceph_do_pending_vmtruncate(inode, !(need & CEPH_CAP_FILE_WR));
+ spin_lock(&ci->i_ceph_lock);
+ }
+
if (need & CEPH_CAP_FILE_WR) {
if (endoff >= 0 && endoff > (loff_t)ci->i_max_size) {
dout("get_cap_refs %p endoff %llu > maxsize %llu\n",
@@ -2079,12 +2086,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
}
have = __ceph_caps_issued(ci, &implemented);
- /*
- * disallow writes while a truncate is pending
- */
- if (ci->i_truncate_pending)
- have &= ~CEPH_CAP_FILE_WR;
-
if ((have & need) == need) {
/*
* Look at (implemented & ~have & not) so that we keep waiting
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 546a705..5490598 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -647,7 +647,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
inode, ceph_vinop(inode), pos, (unsigned)len, inode);
again:
- __ceph_do_pending_vmtruncate(inode, true);
if (fi->fmode & CEPH_FILE_MODE_LAZY)
want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
else
@@ -724,7 +723,7 @@ retry_snap:
ret = -ENOSPC;
goto out;
}
- __ceph_do_pending_vmtruncate(inode, true);
+ mutex_lock(&inode->i_mutex);
dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
inode->i_size);
@@ -733,8 +732,10 @@ retry_snap:
else
want = CEPH_CAP_FILE_BUFFER;
ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
- if (ret < 0)
- goto out_put;
+ if (ret < 0) {
+ mutex_unlock(&inode->i_mutex);
+ goto out;
+ }
dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
@@ -744,10 +745,10 @@ retry_snap:
(iocb->ki_filp->f_flags & O_DIRECT) ||
(inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
(fi->flags & CEPH_F_SYNC)) {
+ mutex_unlock(&inode->i_mutex);
ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
&iocb->ki_pos);
} else {
- mutex_lock(&inode->i_mutex);
ret = __generic_file_aio_write(iocb, iov, nr_segs,
&iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
@@ -762,7 +763,6 @@ retry_snap:
__mark_inode_dirty(inode, dirty);
}
-out_put:
dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
ceph_cap_string(got));
--
1.7.11.7
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] mds: change XLOCK/XLOCKDONE's next state to LOCK
2013-04-12 8:11 [PATCH 1/4] ceph: add osd request to inode unsafe list in advance Yan, Zheng
2013-04-12 8:11 ` [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued Yan, Zheng
2013-04-12 8:11 ` [PATCH 2/4] ceph: take i_mutex before getting Fw cap Yan, Zheng
@ 2013-04-12 8:11 ` Yan, Zheng
2013-04-12 8:11 ` [PATCH 3/4] ceph: fix symlink inode operations Yan, Zheng
2013-04-12 8:11 ` [PATCH 4/4] ceph: apply write checks in ceph_aio_write Yan, Zheng
4 siblings, 0 replies; 13+ messages in thread
From: Yan, Zheng @ 2013-04-12 8:11 UTC (permalink / raw)
To: ceph-devel; +Cc: greg, elder, Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
For simplelock and filelock, XLOCK/XLOCKDONE's next state is SYNC.
But filelock in XLOCK/XLOCKDONE state allow Fb caps, filelock in
SYNC state does not. So filelock can be stuck in XLOCK/XLOCKDONE
state forever if there are Fb caps issued.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
src/mds/Locker.cc | 8 ++++++--
src/mds/locks.c | 4 ++--
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index f86ba88..9f0043c 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -756,13 +756,17 @@ void Locker::eval_gather(SimpleLock *lock, bool first, bool *pneed_issue, list<C
(static_cast<ScatterLock *>(lock))->clear_scatter_wanted();
break;
+ case LOCK_XLOCK:
+ case LOCK_XLOCKDONE:
+ if (next != LOCK_SYNC)
+ break;
+ // fall-thru
+
// to sync
case LOCK_EXCL_SYNC:
case LOCK_LOCK_SYNC:
case LOCK_MIX_SYNC:
case LOCK_XSYN_SYNC:
- case LOCK_XLOCK:
- case LOCK_XLOCKDONE:
if (lock->get_parent()->is_replicated()) {
bufferlist softdata;
lock->encode_locked_state(softdata);
diff --git a/src/mds/locks.c b/src/mds/locks.c
index 69b6bd6..c7dd5be 100644
--- a/src/mds/locks.c
+++ b/src/mds/locks.c
@@ -103,8 +103,8 @@ const struct sm_state_t filelock[LOCK_MAX] = {
[LOCK_MIX_LOCK2] = { LOCK_LOCK, false, LOCK_LOCK, AUTH, 0, REQ, 0, 0, 0, 0, 0,0,0,0 },
[LOCK_PREXLOCK] = { LOCK_LOCK, false, LOCK_LOCK, 0, XCL, 0, 0, 0, 0, ANY, CEPH_CAP_GCACHE|CEPH_CAP_GBUFFER,0,0,0 },
- [LOCK_XLOCK] = { LOCK_SYNC, false, LOCK_LOCK, 0, XCL, 0, 0, 0, 0, 0, CEPH_CAP_GCACHE|CEPH_CAP_GBUFFER,0,0,0 },
- [LOCK_XLOCKDONE] = { LOCK_SYNC, false, LOCK_LOCK, XCL, XCL, XCL, 0, 0, XCL, 0, CEPH_CAP_GCACHE|CEPH_CAP_GBUFFER,0,CEPH_CAP_GSHARED,0 },
+ [LOCK_XLOCK] = { LOCK_LOCK, false, LOCK_LOCK, 0, XCL, 0, 0, 0, 0, 0, CEPH_CAP_GCACHE|CEPH_CAP_GBUFFER,0,0,0 },
+ [LOCK_XLOCKDONE] = { LOCK_LOCK, false, LOCK_LOCK, XCL, XCL, XCL, 0, 0, XCL, 0, CEPH_CAP_GCACHE|CEPH_CAP_GBUFFER,0,CEPH_CAP_GSHARED,0 },
[LOCK_LOCK_XLOCK]= { LOCK_PREXLOCK,false,LOCK_LOCK,0, XCL, 0, 0, 0, 0, XCL, CEPH_CAP_GCACHE|CEPH_CAP_GBUFFER,0,0,0 },
[LOCK_MIX] = { 0, false, LOCK_MIX, 0, 0, REQ, ANY, 0, 0, 0, CEPH_CAP_GRD|CEPH_CAP_GWR|CEPH_CAP_GLAZYIO,0,0,CEPH_CAP_GRD },
--
1.7.11.7
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/4] ceph: fix symlink inode operations
2013-04-12 8:11 [PATCH 1/4] ceph: add osd request to inode unsafe list in advance Yan, Zheng
` (2 preceding siblings ...)
2013-04-12 8:11 ` [PATCH 2/2] mds: change XLOCK/XLOCKDONE's next state to LOCK Yan, Zheng
@ 2013-04-12 8:11 ` Yan, Zheng
2013-04-15 16:15 ` Alex Elder
2013-04-12 8:11 ` [PATCH 4/4] ceph: apply write checks in ceph_aio_write Yan, Zheng
4 siblings, 1 reply; 13+ messages in thread
From: Yan, Zheng @ 2013-04-12 8:11 UTC (permalink / raw)
To: ceph-devel; +Cc: greg, elder, Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
add getattr/setattr and xattrs related methods.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/inode.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 5d730d4..d5cad38 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1523,6 +1523,12 @@ static void *ceph_sym_follow_link(struct dentry *dentry, struct nameidata *nd)
static const struct inode_operations ceph_symlink_iops = {
.readlink = generic_readlink,
.follow_link = ceph_sym_follow_link,
+ .setattr = ceph_setattr,
+ .getattr = ceph_getattr,
+ .setxattr = ceph_setxattr,
+ .getxattr = ceph_getxattr,
+ .listxattr = ceph_listxattr,
+ .removexattr = ceph_removexattr,
};
/*
--
1.7.11.7
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/4] ceph: apply write checks in ceph_aio_write
2013-04-12 8:11 [PATCH 1/4] ceph: add osd request to inode unsafe list in advance Yan, Zheng
` (3 preceding siblings ...)
2013-04-12 8:11 ` [PATCH 3/4] ceph: fix symlink inode operations Yan, Zheng
@ 2013-04-12 8:11 ` Yan, Zheng
2013-04-15 16:15 ` Alex Elder
4 siblings, 1 reply; 13+ messages in thread
From: Yan, Zheng @ 2013-04-12 8:11 UTC (permalink / raw)
To: ceph-devel; +Cc: greg, elder, Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
copy write checks in __generic_file_aio_write to ceph_aio_write.
To make these checks cover sync write path.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/file.c | 94 ++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 59 insertions(+), 35 deletions(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5490598..b034c3a 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -470,7 +470,7 @@ static void sync_write_commit(struct ceph_osd_request *req,
* objects, rollback on failure, etc.)
*/
static ssize_t ceph_sync_write(struct file *file, const char __user *data,
- size_t left, loff_t *offset)
+ size_t left, loff_t pos, loff_t *ppos)
{
struct inode *inode = file->f_dentry->d_inode;
struct ceph_inode_info *ci = ceph_inode(inode);
@@ -481,7 +481,6 @@ static ssize_t ceph_sync_write(struct file *file, const char __user *data,
int num_ops = 1;
struct page **pages;
int num_pages;
- long long unsigned pos;
u64 len;
int written = 0;
int flags;
@@ -495,14 +494,9 @@ static ssize_t ceph_sync_write(struct file *file, const char __user *data,
if (ceph_snap(file->f_dentry->d_inode) != CEPH_NOSNAP)
return -EROFS;
- dout("sync_write on file %p %lld~%u %s\n", file, *offset,
+ dout("sync_write on file %p %lld~%u %s\n", file, pos,
(unsigned)left, (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
- if (file->f_flags & O_APPEND)
- pos = i_size_read(inode);
- else
- pos = *offset;
-
ret = filemap_write_and_wait_range(inode->i_mapping, pos, pos + left);
if (ret < 0)
return ret;
@@ -613,7 +607,7 @@ out:
goto more;
ret = written;
- *offset = pos;
+ *ppos = pos;
if (pos > i_size_read(inode))
check_caps = ceph_inode_set_size(inode, pos);
if (check_caps)
@@ -710,51 +704,75 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_osd_client *osdc =
&ceph_sb_to_client(inode->i_sb)->client->osdc;
- loff_t endoff = pos + iov->iov_len;
- int want, got = 0;
- int ret, err;
+ ssize_t count, written = 0;
+ int err, want, got;
+ bool hold_mutex;
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
sb_start_write(inode->i_sb);
+ mutex_lock(&inode->i_mutex);
+ hold_mutex = true;
+
+ err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
+ if (err)
+ goto out;
+
+ /* We can write back this queue in page reclaim */
+ current->backing_dev_info = file->f_mapping->backing_dev_info;
+
+ err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
+ if (err)
+ goto out;
+
+ if (count == 0)
+ goto out;
+
+ err = file_remove_suid(file);
+ if (err)
+ goto out;
+
+ err = file_update_time(file);
+ if (err)
+ goto out;
+
retry_snap:
if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
- ret = -ENOSPC;
+ err = -ENOSPC;
goto out;
}
- mutex_lock(&inode->i_mutex);
- dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
- inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
- inode->i_size);
+
+ dout("aio_write %p %llx.%llx %llu~%ld getting caps. i_size %llu\n",
+ inode, ceph_vinop(inode), pos, count, inode->i_size);
if (fi->fmode & CEPH_FILE_MODE_LAZY)
want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
else
want = CEPH_CAP_FILE_BUFFER;
- ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
- if (ret < 0) {
- mutex_unlock(&inode->i_mutex);
+ got = 0;
+ err = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, pos + count);
+ if (err < 0)
goto out;
- }
- dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
- inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
- ceph_cap_string(got));
+ dout("aio_write %p %llx.%llx %llu~%ld got cap refs on %s\n",
+ inode, ceph_vinop(inode), pos, count, ceph_cap_string(got));
if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
(iocb->ki_filp->f_flags & O_DIRECT) ||
(inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
(fi->flags & CEPH_F_SYNC)) {
mutex_unlock(&inode->i_mutex);
- ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
- &iocb->ki_pos);
+ written = ceph_sync_write(file, iov->iov_base, count,
+ pos, &iocb->ki_pos);
} else {
- ret = __generic_file_aio_write(iocb, iov, nr_segs,
- &iocb->ki_pos);
+ written = generic_file_buffered_write(iocb, iov, nr_segs,
+ pos, &iocb->ki_pos,
+ count, 0);
mutex_unlock(&inode->i_mutex);
}
+ hold_mutex = false;
- if (ret >= 0) {
+ if (written >= 0) {
int dirty;
spin_lock(&ci->i_ceph_lock);
dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
@@ -768,22 +786,28 @@ retry_snap:
ceph_cap_string(got));
ceph_put_cap_refs(ci, got);
- if (ret >= 0 &&
+ if (written >= 0 &&
((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
- err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
+ err = vfs_fsync_range(file, pos, pos + written - 1, 1);
if (err < 0)
- ret = err;
+ written = err;
}
-out:
- if (ret == -EOLDSNAPC) {
+
+ if (written == -EOLDSNAPC) {
dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
+ mutex_lock(&inode->i_mutex);
+ hold_mutex = true;
goto retry_snap;
}
+out:
+ if (hold_mutex)
+ mutex_unlock(&inode->i_mutex);
sb_end_write(inode->i_sb);
+ current->backing_dev_info = NULL;
- return ret;
+ return written ? written : err;;
}
/*
--
1.7.11.7
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] ceph: take i_mutex before getting Fw cap
2013-04-12 8:11 ` [PATCH 2/4] ceph: take i_mutex before getting Fw cap Yan, Zheng
@ 2013-04-15 16:15 ` Alex Elder
2013-04-17 3:42 ` Sage Weil
1 sibling, 0 replies; 13+ messages in thread
From: Alex Elder @ 2013-04-15 16:15 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel, greg
On 04/12/2013 03:11 AM, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> There is deadlock as illustrated bellow. The fix is taking i_mutex
> before getting Fw cap reference.
>
> write truncate MDS
> --------------------- -------------------- --------------
> get Fw cap
> lock i_mutex
> lock i_mutex (blocked)
> request setattr.size ->
> <- revoke Fw cap
This looks good to me, but I would prefer if Sage
or someone else with deeper ceph FS knowledge said
so as well.
Reviewed-by: Alex Elder <elder@inktank.com>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
> fs/ceph/caps.c | 13 +++++++------
> fs/ceph/file.c | 12 ++++++------
> 2 files changed, 13 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 0da2e94..8737572 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2058,6 +2058,13 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
> goto out;
> }
>
> + /* finish pending truncate */
> + while (ci->i_truncate_pending) {
> + spin_unlock(&ci->i_ceph_lock);
> + __ceph_do_pending_vmtruncate(inode, !(need & CEPH_CAP_FILE_WR));
> + spin_lock(&ci->i_ceph_lock);
> + }
> +
> if (need & CEPH_CAP_FILE_WR) {
> if (endoff >= 0 && endoff > (loff_t)ci->i_max_size) {
> dout("get_cap_refs %p endoff %llu > maxsize %llu\n",
> @@ -2079,12 +2086,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
> }
> have = __ceph_caps_issued(ci, &implemented);
>
> - /*
> - * disallow writes while a truncate is pending
> - */
> - if (ci->i_truncate_pending)
> - have &= ~CEPH_CAP_FILE_WR;
> -
> if ((have & need) == need) {
> /*
> * Look at (implemented & ~have & not) so that we keep waiting
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 546a705..5490598 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -647,7 +647,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
> dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
> inode, ceph_vinop(inode), pos, (unsigned)len, inode);
> again:
> - __ceph_do_pending_vmtruncate(inode, true);
> if (fi->fmode & CEPH_FILE_MODE_LAZY)
> want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
> else
> @@ -724,7 +723,7 @@ retry_snap:
> ret = -ENOSPC;
> goto out;
> }
> - __ceph_do_pending_vmtruncate(inode, true);
> + mutex_lock(&inode->i_mutex);
> dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> inode->i_size);
> @@ -733,8 +732,10 @@ retry_snap:
> else
> want = CEPH_CAP_FILE_BUFFER;
> ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
> - if (ret < 0)
> - goto out_put;
> + if (ret < 0) {
> + mutex_unlock(&inode->i_mutex);
> + goto out;
> + }
>
> dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> @@ -744,10 +745,10 @@ retry_snap:
> (iocb->ki_filp->f_flags & O_DIRECT) ||
> (inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
> (fi->flags & CEPH_F_SYNC)) {
> + mutex_unlock(&inode->i_mutex);
> ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
> &iocb->ki_pos);
> } else {
> - mutex_lock(&inode->i_mutex);
> ret = __generic_file_aio_write(iocb, iov, nr_segs,
> &iocb->ki_pos);
> mutex_unlock(&inode->i_mutex);
> @@ -762,7 +763,6 @@ retry_snap:
> __mark_inode_dirty(inode, dirty);
> }
>
> -out_put:
> dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> ceph_cap_string(got));
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/4] ceph: fix symlink inode operations
2013-04-12 8:11 ` [PATCH 3/4] ceph: fix symlink inode operations Yan, Zheng
@ 2013-04-15 16:15 ` Alex Elder
0 siblings, 0 replies; 13+ messages in thread
From: Alex Elder @ 2013-04-15 16:15 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel, greg
On 04/12/2013 03:11 AM, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> add getattr/setattr and xattrs related methods.
Looks good.
Reviewed-by: Alex Elder <elder@inktank.com>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
> fs/ceph/inode.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 5d730d4..d5cad38 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1523,6 +1523,12 @@ static void *ceph_sym_follow_link(struct dentry *dentry, struct nameidata *nd)
> static const struct inode_operations ceph_symlink_iops = {
> .readlink = generic_readlink,
> .follow_link = ceph_sym_follow_link,
> + .setattr = ceph_setattr,
> + .getattr = ceph_getattr,
> + .setxattr = ceph_setxattr,
> + .getxattr = ceph_getxattr,
> + .listxattr = ceph_listxattr,
> + .removexattr = ceph_removexattr,
> };
>
> /*
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/4] ceph: apply write checks in ceph_aio_write
2013-04-12 8:11 ` [PATCH 4/4] ceph: apply write checks in ceph_aio_write Yan, Zheng
@ 2013-04-15 16:15 ` Alex Elder
0 siblings, 0 replies; 13+ messages in thread
From: Alex Elder @ 2013-04-15 16:15 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel, greg
On 04/12/2013 03:11 AM, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> copy write checks in __generic_file_aio_write to ceph_aio_write.
> To make these checks cover sync write path.
This one is important, and it looks good to me. This is another
one I'd like another opinion on though.
Extra semicolon at the very end, but I can clean that up
before committing.
Reviewed-by: Alex Elder <elder@inktank.com>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
> fs/ceph/file.c | 94 ++++++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 59 insertions(+), 35 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 5490598..b034c3a 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -470,7 +470,7 @@ static void sync_write_commit(struct ceph_osd_request *req,
> * objects, rollback on failure, etc.)
> */
> static ssize_t ceph_sync_write(struct file *file, const char __user *data,
> - size_t left, loff_t *offset)
> + size_t left, loff_t pos, loff_t *ppos)
> {
> struct inode *inode = file->f_dentry->d_inode;
> struct ceph_inode_info *ci = ceph_inode(inode);
> @@ -481,7 +481,6 @@ static ssize_t ceph_sync_write(struct file *file, const char __user *data,
> int num_ops = 1;
> struct page **pages;
> int num_pages;
> - long long unsigned pos;
> u64 len;
> int written = 0;
> int flags;
> @@ -495,14 +494,9 @@ static ssize_t ceph_sync_write(struct file *file, const char __user *data,
> if (ceph_snap(file->f_dentry->d_inode) != CEPH_NOSNAP)
> return -EROFS;
>
> - dout("sync_write on file %p %lld~%u %s\n", file, *offset,
> + dout("sync_write on file %p %lld~%u %s\n", file, pos,
> (unsigned)left, (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
>
> - if (file->f_flags & O_APPEND)
> - pos = i_size_read(inode);
> - else
> - pos = *offset;
> -
> ret = filemap_write_and_wait_range(inode->i_mapping, pos, pos + left);
> if (ret < 0)
> return ret;
> @@ -613,7 +607,7 @@ out:
> goto more;
>
> ret = written;
> - *offset = pos;
> + *ppos = pos;
> if (pos > i_size_read(inode))
> check_caps = ceph_inode_set_size(inode, pos);
> if (check_caps)
> @@ -710,51 +704,75 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
> struct ceph_inode_info *ci = ceph_inode(inode);
> struct ceph_osd_client *osdc =
> &ceph_sb_to_client(inode->i_sb)->client->osdc;
> - loff_t endoff = pos + iov->iov_len;
> - int want, got = 0;
> - int ret, err;
> + ssize_t count, written = 0;
> + int err, want, got;
> + bool hold_mutex;
>
> if (ceph_snap(inode) != CEPH_NOSNAP)
> return -EROFS;
>
> sb_start_write(inode->i_sb);
> + mutex_lock(&inode->i_mutex);
> + hold_mutex = true;
> +
> + err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
> + if (err)
> + goto out;
> +
> + /* We can write back this queue in page reclaim */
> + current->backing_dev_info = file->f_mapping->backing_dev_info;
> +
> + err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
> + if (err)
> + goto out;
> +
> + if (count == 0)
> + goto out;
> +
> + err = file_remove_suid(file);
> + if (err)
> + goto out;
> +
> + err = file_update_time(file);
> + if (err)
> + goto out;
> +
> retry_snap:
> if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
> - ret = -ENOSPC;
> + err = -ENOSPC;
> goto out;
> }
> - mutex_lock(&inode->i_mutex);
> - dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
> - inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> - inode->i_size);
> +
> + dout("aio_write %p %llx.%llx %llu~%ld getting caps. i_size %llu\n",
> + inode, ceph_vinop(inode), pos, count, inode->i_size);
> if (fi->fmode & CEPH_FILE_MODE_LAZY)
> want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
> else
> want = CEPH_CAP_FILE_BUFFER;
> - ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
> - if (ret < 0) {
> - mutex_unlock(&inode->i_mutex);
> + got = 0;
> + err = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, pos + count);
> + if (err < 0)
> goto out;
> - }
>
> - dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
> - inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> - ceph_cap_string(got));
> + dout("aio_write %p %llx.%llx %llu~%ld got cap refs on %s\n",
> + inode, ceph_vinop(inode), pos, count, ceph_cap_string(got));
>
> if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
> (iocb->ki_filp->f_flags & O_DIRECT) ||
> (inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
> (fi->flags & CEPH_F_SYNC)) {
> mutex_unlock(&inode->i_mutex);
> - ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
> - &iocb->ki_pos);
> + written = ceph_sync_write(file, iov->iov_base, count,
> + pos, &iocb->ki_pos);
> } else {
> - ret = __generic_file_aio_write(iocb, iov, nr_segs,
> - &iocb->ki_pos);
> + written = generic_file_buffered_write(iocb, iov, nr_segs,
> + pos, &iocb->ki_pos,
> + count, 0);
> mutex_unlock(&inode->i_mutex);
> }
> + hold_mutex = false;
>
> - if (ret >= 0) {
> + if (written >= 0) {
> int dirty;
> spin_lock(&ci->i_ceph_lock);
> dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
> @@ -768,22 +786,28 @@ retry_snap:
> ceph_cap_string(got));
> ceph_put_cap_refs(ci, got);
>
> - if (ret >= 0 &&
> + if (written >= 0 &&
> ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
> ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
> - err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
> + err = vfs_fsync_range(file, pos, pos + written - 1, 1);
> if (err < 0)
> - ret = err;
> + written = err;
> }
> -out:
> - if (ret == -EOLDSNAPC) {
> +
> + if (written == -EOLDSNAPC) {
> dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
> + mutex_lock(&inode->i_mutex);
> + hold_mutex = true;
> goto retry_snap;
> }
> +out:
> + if (hold_mutex)
> + mutex_unlock(&inode->i_mutex);
> sb_end_write(inode->i_sb);
> + current->backing_dev_info = NULL;
>
> - return ret;
> + return written ? written : err;;
> }
>
> /*
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] ceph: take i_mutex before getting Fw cap
2013-04-12 8:11 ` [PATCH 2/4] ceph: take i_mutex before getting Fw cap Yan, Zheng
2013-04-15 16:15 ` Alex Elder
@ 2013-04-17 3:42 ` Sage Weil
2013-04-17 12:14 ` Yan, Zheng
1 sibling, 1 reply; 13+ messages in thread
From: Sage Weil @ 2013-04-17 3:42 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel, greg, elder
On Fri, 12 Apr 2013, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> There is deadlock as illustrated bellow. The fix is taking i_mutex
> before getting Fw cap reference.
>
> write truncate MDS
> --------------------- -------------------- --------------
> get Fw cap
> lock i_mutex
> lock i_mutex (blocked)
> request setattr.size ->
> <- revoke Fw cap
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
> fs/ceph/caps.c | 13 +++++++------
> fs/ceph/file.c | 12 ++++++------
> 2 files changed, 13 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 0da2e94..8737572 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2058,6 +2058,13 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
> goto out;
> }
>
> + /* finish pending truncate */
> + while (ci->i_truncate_pending) {
> + spin_unlock(&ci->i_ceph_lock);
> + __ceph_do_pending_vmtruncate(inode, !(need & CEPH_CAP_FILE_WR));
> + spin_lock(&ci->i_ceph_lock);
I think if we retake i_ceph_lock we need to goto the top to make sure our
local variables aren't stale.. in this case, just file_wanted.
> + }
> +
> if (need & CEPH_CAP_FILE_WR) {
> if (endoff >= 0 && endoff > (loff_t)ci->i_max_size) {
> dout("get_cap_refs %p endoff %llu > maxsize %llu\n",
> @@ -2079,12 +2086,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
> }
> have = __ceph_caps_issued(ci, &implemented);
>
> - /*
> - * disallow writes while a truncate is pending
> - */
> - if (ci->i_truncate_pending)
> - have &= ~CEPH_CAP_FILE_WR;
> -
> if ((have & need) == need) {
> /*
> * Look at (implemented & ~have & not) so that we keep waiting
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 546a705..5490598 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -647,7 +647,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
> dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
> inode, ceph_vinop(inode), pos, (unsigned)len, inode);
> again:
> - __ceph_do_pending_vmtruncate(inode, true);
> if (fi->fmode & CEPH_FILE_MODE_LAZY)
> want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
> else
> @@ -724,7 +723,7 @@ retry_snap:
> ret = -ENOSPC;
> goto out;
> }
> - __ceph_do_pending_vmtruncate(inode, true);
> + mutex_lock(&inode->i_mutex);
> dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> inode->i_size);
> @@ -733,8 +732,10 @@ retry_snap:
> else
> want = CEPH_CAP_FILE_BUFFER;
> ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
> - if (ret < 0)
> - goto out_put;
> + if (ret < 0) {
> + mutex_unlock(&inode->i_mutex);
> + goto out;
> + }
>
> dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> @@ -744,10 +745,10 @@ retry_snap:
> (iocb->ki_filp->f_flags & O_DIRECT) ||
> (inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
> (fi->flags & CEPH_F_SYNC)) {
> + mutex_unlock(&inode->i_mutex);
> ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
> &iocb->ki_pos);
> } else {
> - mutex_lock(&inode->i_mutex);
> ret = __generic_file_aio_write(iocb, iov, nr_segs,
> &iocb->ki_pos);
> mutex_unlock(&inode->i_mutex);
> @@ -762,7 +763,6 @@ retry_snap:
> __mark_inode_dirty(inode, dirty);
> }
>
> -out_put:
> dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> ceph_cap_string(got));
Mechanically, the rest of this looks correct. I seem to remember us
changing this (or something similar) so that we *didn't* hold i_mutex
while waiting on the caps in order to avoid some deadlock. But... looking
through the git history I can't find anything.
I think the race to consider is if we are blocked waiting for the WR cap
and the MDS sends us a truncate. It will queue the async truncate work
but that will block waiting for i_mutex. Can that deadlock? (I think no,
but perhaps the pending truncate check needs to happen after we acquire
the cap, too!)
Similarly, if we block holding i_mutex and wait for WR, but the MDS
revokes some other cap (say, WRBUFFER), could we deadlock from teh
async writeback worker?
Both sound dangerous to me. I wonder if something in the spirit of
while (true) {
get_cap(Fw)
if (try_lock_mutex(...))
break;
put_cap(Fw);
lock_mutex(...)
unlock_mutex(...)
}
would be simpler. Or, make a get_cap variant that drops i_mutex while
waiting, but takes it before grabbing the actual Fw cap ref.
sage
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] ceph: take i_mutex before getting Fw cap
2013-04-17 3:42 ` Sage Weil
@ 2013-04-17 12:14 ` Yan, Zheng
2013-04-17 15:28 ` Sage Weil
0 siblings, 1 reply; 13+ messages in thread
From: Yan, Zheng @ 2013-04-17 12:14 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel, greg, elder
On 04/17/2013 11:42 AM, Sage Weil wrote:
> On Fri, 12 Apr 2013, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> There is deadlock as illustrated bellow. The fix is taking i_mutex
>> before getting Fw cap reference.
>>
>> write truncate MDS
>> --------------------- -------------------- --------------
>> get Fw cap
>> lock i_mutex
>> lock i_mutex (blocked)
>> request setattr.size ->
>> <- revoke Fw cap
>>
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> ---
>> fs/ceph/caps.c | 13 +++++++------
>> fs/ceph/file.c | 12 ++++++------
>> 2 files changed, 13 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index 0da2e94..8737572 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -2058,6 +2058,13 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>> goto out;
>> }
>>
>> + /* finish pending truncate */
>> + while (ci->i_truncate_pending) {
>> + spin_unlock(&ci->i_ceph_lock);
>> + __ceph_do_pending_vmtruncate(inode, !(need & CEPH_CAP_FILE_WR));
>> + spin_lock(&ci->i_ceph_lock);
>
> I think if we retake i_ceph_lock we need to goto the top to make sure our
> local variables aren't stale.. in this case, just file_wanted.
>
>> + }
>> +
>> if (need & CEPH_CAP_FILE_WR) {
>> if (endoff >= 0 && endoff > (loff_t)ci->i_max_size) {
>> dout("get_cap_refs %p endoff %llu > maxsize %llu\n",
>> @@ -2079,12 +2086,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
>> }
>> have = __ceph_caps_issued(ci, &implemented);
>>
>> - /*
>> - * disallow writes while a truncate is pending
>> - */
>> - if (ci->i_truncate_pending)
>> - have &= ~CEPH_CAP_FILE_WR;
>> -
>> if ((have & need) == need) {
>> /*
>> * Look at (implemented & ~have & not) so that we keep waiting
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 546a705..5490598 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -647,7 +647,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
>> dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>> inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>> again:
>> - __ceph_do_pending_vmtruncate(inode, true);
>> if (fi->fmode & CEPH_FILE_MODE_LAZY)
>> want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
>> else
>> @@ -724,7 +723,7 @@ retry_snap:
>> ret = -ENOSPC;
>> goto out;
>> }
>> - __ceph_do_pending_vmtruncate(inode, true);
>> + mutex_lock(&inode->i_mutex);
>> dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
>> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
>> inode->i_size);
>> @@ -733,8 +732,10 @@ retry_snap:
>> else
>> want = CEPH_CAP_FILE_BUFFER;
>> ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
>> - if (ret < 0)
>> - goto out_put;
>> + if (ret < 0) {
>> + mutex_unlock(&inode->i_mutex);
>> + goto out;
>> + }
>>
>> dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
>> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
>> @@ -744,10 +745,10 @@ retry_snap:
>> (iocb->ki_filp->f_flags & O_DIRECT) ||
>> (inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
>> (fi->flags & CEPH_F_SYNC)) {
>> + mutex_unlock(&inode->i_mutex);
>> ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
>> &iocb->ki_pos);
>> } else {
>> - mutex_lock(&inode->i_mutex);
>> ret = __generic_file_aio_write(iocb, iov, nr_segs,
>> &iocb->ki_pos);
>> mutex_unlock(&inode->i_mutex);
>> @@ -762,7 +763,6 @@ retry_snap:
>> __mark_inode_dirty(inode, dirty);
>> }
>>
>> -out_put:
>> dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
>> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
>> ceph_cap_string(got));
>
> Mechanically, the rest of this looks correct. I seem to remember us
> changing this (or something similar) so that we *didn't* hold i_mutex
> while waiting on the caps in order to avoid some deadlock. But... looking
> through the git history I can't find anything.
>
> I think the race to consider is if we are blocked waiting for the WR cap
> and the MDS sends us a truncate. It will queue the async truncate work
> but that will block waiting for i_mutex. Can that deadlock? (I think no,
> but perhaps the pending truncate check needs to happen after we acquire
> the cap, too!)
Maybe we should also call wake_up_all(&ci->i_cap_wq) when receiving truncate from
MDS.
To truncate a file, the MDS xlock the filelock firstly, which revokes all Fw caps
from clients. Then the MDS sends truncate to clients and finally drops the xlock.
It's impossible that client receives a truncate request while having Fw cap. So I
don't think we need check pending truncate after acquiring the Fw cap.
>
> Similarly, if we block holding i_mutex and wait for WR, but the MDS
> revokes some other cap (say, WRBUFFER), could we deadlock from teh
> async writeback worker?
I think no, i_mutex is not involved in writeback.
>
> Both sound dangerous to me. I wonder if something in the spirit of
>
> while (true) {
> get_cap(Fw)
> if (try_lock_mutex(...))
> break;
> put_cap(Fw);
> lock_mutex(...)
> unlock_mutex(...)
> }
>
> would be simpler. Or, make a get_cap variant that drops i_mutex while
> waiting, but takes it before grabbing the actual Fw cap ref.
>
See my patch "ceph: apply write checks in ceph_aio_write". I think we really
should acquire i_mutex before getting caps. ceph_get_caps() needs a parameter
'endoff', if the file is opened in append mode, the endoff is calculated from
i_size. If we drop i_mutex in ceph_get_caps(), someone else may change the
i_size.
Regards
Yan, Zheng
> sage
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] ceph: take i_mutex before getting Fw cap
2013-04-17 12:14 ` Yan, Zheng
@ 2013-04-17 15:28 ` Sage Weil
0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2013-04-17 15:28 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel, greg, elder
On Wed, 17 Apr 2013, Yan, Zheng wrote:
> On 04/17/2013 11:42 AM, Sage Weil wrote:
> > On Fri, 12 Apr 2013, Yan, Zheng wrote:
> >> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> >>
> >> There is deadlock as illustrated bellow. The fix is taking i_mutex
> >> before getting Fw cap reference.
> >>
> >> write truncate MDS
> >> --------------------- -------------------- --------------
> >> get Fw cap
> >> lock i_mutex
> >> lock i_mutex (blocked)
> >> request setattr.size ->
> >> <- revoke Fw cap
> >>
> >> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> >> ---
> >> fs/ceph/caps.c | 13 +++++++------
> >> fs/ceph/file.c | 12 ++++++------
> >> 2 files changed, 13 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> >> index 0da2e94..8737572 100644
> >> --- a/fs/ceph/caps.c
> >> +++ b/fs/ceph/caps.c
> >> @@ -2058,6 +2058,13 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
> >> goto out;
> >> }
> >>
> >> + /* finish pending truncate */
> >> + while (ci->i_truncate_pending) {
> >> + spin_unlock(&ci->i_ceph_lock);
> >> + __ceph_do_pending_vmtruncate(inode, !(need & CEPH_CAP_FILE_WR));
> >> + spin_lock(&ci->i_ceph_lock);
> >
> > I think if we retake i_ceph_lock we need to goto the top to make sure our
> > local variables aren't stale.. in this case, just file_wanted.
> >
> >> + }
> >> +
> >> if (need & CEPH_CAP_FILE_WR) {
> >> if (endoff >= 0 && endoff > (loff_t)ci->i_max_size) {
> >> dout("get_cap_refs %p endoff %llu > maxsize %llu\n",
> >> @@ -2079,12 +2086,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
> >> }
> >> have = __ceph_caps_issued(ci, &implemented);
> >>
> >> - /*
> >> - * disallow writes while a truncate is pending
> >> - */
> >> - if (ci->i_truncate_pending)
> >> - have &= ~CEPH_CAP_FILE_WR;
> >> -
> >> if ((have & need) == need) {
> >> /*
> >> * Look at (implemented & ~have & not) so that we keep waiting
> >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> >> index 546a705..5490598 100644
> >> --- a/fs/ceph/file.c
> >> +++ b/fs/ceph/file.c
> >> @@ -647,7 +647,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
> >> dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
> >> inode, ceph_vinop(inode), pos, (unsigned)len, inode);
> >> again:
> >> - __ceph_do_pending_vmtruncate(inode, true);
> >> if (fi->fmode & CEPH_FILE_MODE_LAZY)
> >> want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
> >> else
> >> @@ -724,7 +723,7 @@ retry_snap:
> >> ret = -ENOSPC;
> >> goto out;
> >> }
> >> - __ceph_do_pending_vmtruncate(inode, true);
> >> + mutex_lock(&inode->i_mutex);
> >> dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
> >> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> >> inode->i_size);
> >> @@ -733,8 +732,10 @@ retry_snap:
> >> else
> >> want = CEPH_CAP_FILE_BUFFER;
> >> ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
> >> - if (ret < 0)
> >> - goto out_put;
> >> + if (ret < 0) {
> >> + mutex_unlock(&inode->i_mutex);
> >> + goto out;
> >> + }
> >>
> >> dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
> >> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> >> @@ -744,10 +745,10 @@ retry_snap:
> >> (iocb->ki_filp->f_flags & O_DIRECT) ||
> >> (inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
> >> (fi->flags & CEPH_F_SYNC)) {
> >> + mutex_unlock(&inode->i_mutex);
> >> ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
> >> &iocb->ki_pos);
> >> } else {
> >> - mutex_lock(&inode->i_mutex);
> >> ret = __generic_file_aio_write(iocb, iov, nr_segs,
> >> &iocb->ki_pos);
> >> mutex_unlock(&inode->i_mutex);
> >> @@ -762,7 +763,6 @@ retry_snap:
> >> __mark_inode_dirty(inode, dirty);
> >> }
> >>
> >> -out_put:
> >> dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
> >> inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
> >> ceph_cap_string(got));
> >
> > Mechanically, the rest of this looks correct. I seem to remember us
> > changing this (or something similar) so that we *didn't* hold i_mutex
> > while waiting on the caps in order to avoid some deadlock. But... looking
> > through the git history I can't find anything.
> >
> > I think the race to consider is if we are blocked waiting for the WR cap
> > and the MDS sends us a truncate. It will queue the async truncate work
> > but that will block waiting for i_mutex. Can that deadlock? (I think no,
> > but perhaps the pending truncate check needs to happen after we acquire
> > the cap, too!)
>
> Maybe we should also call wake_up_all(&ci->i_cap_wq) when receiving truncate from
> MDS.
>
> To truncate a file, the MDS xlock the filelock firstly, which revokes all Fw caps
> from clients. Then the MDS sends truncate to clients and finally drops the xlock.
> It's impossible that client receives a truncate request while having Fw cap. So I
> don't think we need check pending truncate after acquiring the Fw cap.
Okay, sounds good to me then!
Perhaps we should add a comment and WARN_ON that we don't have the Fw cap
at that time.
Reviewed-by: Sage Weil <sage@inktank.com>
> >
> > Similarly, if we block holding i_mutex and wait for WR, but the MDS
> > revokes some other cap (say, WRBUFFER), could we deadlock from teh
> > async writeback worker?
>
> I think no, i_mutex is not involved in writeback.
>
> >
> > Both sound dangerous to me. I wonder if something in the spirit of
> >
> > while (true) {
> > get_cap(Fw)
> > if (try_lock_mutex(...))
> > break;
> > put_cap(Fw);
> > lock_mutex(...)
> > unlock_mutex(...)
> > }
> >
> > would be simpler. Or, make a get_cap variant that drops i_mutex while
> > waiting, but takes it before grabbing the actual Fw cap ref.
> >
>
> See my patch "ceph: apply write checks in ceph_aio_write". I think we really
> should acquire i_mutex before getting caps. ceph_get_caps() needs a parameter
> 'endoff', if the file is opened in append mode, the endoff is calculated from
> i_size. If we drop i_mutex in ceph_get_caps(), someone else may change the
> i_size.
>
>
> Regards
> Yan, Zheng
>
>
> > sage
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued
2013-04-12 8:11 ` [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued Yan, Zheng
@ 2013-04-17 21:58 ` Gregory Farnum
0 siblings, 0 replies; 13+ messages in thread
From: Gregory Farnum @ 2013-04-17 21:58 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel@vger.kernel.org, Alex Elder
This looks good to me, and I got Sage to review the second patch as
well. These are both merged into the next branch — thanks!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
On Fri, Apr 12, 2013 at 1:11 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> There is a total of 22 cap bits and file lock uses 8 cap bits.
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
> src/mds/CInode.h | 4 ++--
> src/mds/Locker.cc | 3 ++-
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/src/mds/CInode.h b/src/mds/CInode.h
> index 43af9c4..a3fe3cf 100644
> --- a/src/mds/CInode.h
> +++ b/src/mds/CInode.h
> @@ -736,9 +736,9 @@ public:
>
> // caps issued, wanted
> int get_caps_issued(int *ploner = 0, int *pother = 0, int *pxlocker = 0,
> - int shift = 0, int mask = 0xffff);
> + int shift = 0, int mask = -1);
> bool is_any_caps_wanted();
> - int get_caps_wanted(int *ploner = 0, int *pother = 0, int shift = 0, int mask = 0xffff);
> + int get_caps_wanted(int *ploner = 0, int *pother = 0, int shift = 0, int mask = -1);
> bool issued_caps_need_gather(SimpleLock *lock);
> void replicate_relax_locks();
>
> diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
> index 7ba9f44..f86ba88 100644
> --- a/src/mds/Locker.cc
> +++ b/src/mds/Locker.cc
> @@ -616,7 +616,8 @@ void Locker::eval_gather(SimpleLock *lock, bool first, bool *pneed_issue, list<C
>
> int loner_issued = 0, other_issued = 0, xlocker_issued = 0;
> if (caps && in->is_head()) {
> - in->get_caps_issued(&loner_issued, &other_issued, &xlocker_issued, lock->get_cap_shift(), 3);
> + in->get_caps_issued(&loner_issued, &other_issued, &xlocker_issued,
> + lock->get_cap_shift(), lock->get_cap_mask());
> dout(10) << " next state is " << lock->get_state_name(next)
> << " issued/allows loner " << gcap_string(loner_issued)
> << "/" << gcap_string(lock->gcaps_allowed(CAP_LONER, next))
> --
> 1.7.11.7
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-04-17 21:58 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-12 8:11 [PATCH 1/4] ceph: add osd request to inode unsafe list in advance Yan, Zheng
2013-04-12 8:11 ` [PATCH 1/2] mds: pass proper mask to CInode::get_caps_issued Yan, Zheng
2013-04-17 21:58 ` Gregory Farnum
2013-04-12 8:11 ` [PATCH 2/4] ceph: take i_mutex before getting Fw cap Yan, Zheng
2013-04-15 16:15 ` Alex Elder
2013-04-17 3:42 ` Sage Weil
2013-04-17 12:14 ` Yan, Zheng
2013-04-17 15:28 ` Sage Weil
2013-04-12 8:11 ` [PATCH 2/2] mds: change XLOCK/XLOCKDONE's next state to LOCK Yan, Zheng
2013-04-12 8:11 ` [PATCH 3/4] ceph: fix symlink inode operations Yan, Zheng
2013-04-15 16:15 ` Alex Elder
2013-04-12 8:11 ` [PATCH 4/4] ceph: apply write checks in ceph_aio_write Yan, Zheng
2013-04-15 16:15 ` Alex Elder
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.