* [PATCH 1/2] ceph: Don't update i_max_size when handling non-auth cap
@ 2012-11-01 9:03 Yan, Zheng
2012-11-01 9:03 ` [PATCH 1/2] mds: Don't acquire replica object's versionlock Yan, Zheng
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Yan, Zheng @ 2012-11-01 9:03 UTC (permalink / raw)
To: sage, ceph-devel; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
The cap from non-auth mds doesn't have a meaningful max_size value.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/caps.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 3251e9c..c633d1d 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2388,7 +2388,7 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant,
&atime);
/* max size increase? */
- if (max_size != ci->i_max_size) {
+ if (ci->i_auth_cap == cap && max_size != ci->i_max_size) {
dout("max_size %lld -> %llu\n", ci->i_max_size, max_size);
ci->i_max_size = max_size;
if (max_size >= ci->i_wanted_max_size) {
--
1.7.11.7
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 1/2] mds: Don't acquire replica object's versionlock
2012-11-01 9:03 [PATCH 1/2] ceph: Don't update i_max_size when handling non-auth cap Yan, Zheng
@ 2012-11-01 9:03 ` Yan, Zheng
2012-11-05 18:52 ` Sage Weil
2012-11-01 9:03 ` [PATCH 2/2] ceph: Fix i_size update race Yan, Zheng
2012-11-01 9:03 ` [PATCH 2/2] mds: Allow try_eval to eval unstable locks in freezing object Yan, Zheng
2 siblings, 1 reply; 6+ messages in thread
From: Yan, Zheng @ 2012-11-01 9:03 UTC (permalink / raw)
To: sage, ceph-devel; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
Both CInode and CDentry's versionlocks are of type LocalLock.
Acquiring LocalLock in replica object is useless and problematic.
For example, if two requests try acquiring a replica object's
versionlock, the first request succeeds, the second request
is added to wait queue. Later when the first request finishes,
MDCache::request_drop_foreign_locks() finds the lock's parent is
non-auth, it skips waking requests in the wait queue. So the
second request hangs.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
src/mds/Locker.cc | 6 ++++++
src/mds/Server.cc | 25 ++++++++++---------------
3 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index e033bbe..6474743 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -196,6 +196,7 @@ bool Locker::acquire_locks(MDRequest *mdr,
// augment xlock with a versionlock?
if ((*p)->get_type() == CEPH_LOCK_DN) {
CDentry *dn = (CDentry*)(*p)->get_parent();
+ assert(dn->is_auth());
if (xlocks.count(&dn->versionlock))
continue; // we're xlocking the versionlock too; don't wrlock it!
@@ -213,6 +214,8 @@ bool Locker::acquire_locks(MDRequest *mdr,
if ((*p)->get_type() > CEPH_LOCK_IVERSION) {
// inode version lock?
CInode *in = (CInode*)(*p)->get_parent();
+ if (!in->is_auth())
+ continue;
if (mdr->is_master()) {
// master. wrlock versionlock so we can pipeline inode updates to journal.
wrlocks.insert(&in->versionlock);
@@ -3899,6 +3902,7 @@ void Locker::local_wrlock_grab(LocalLock *lock, Mutation *mut)
dout(7) << "local_wrlock_grab on " << *lock
<< " on " << *lock->get_parent() << dendl;
+ assert(lock->get_parent()->is_auth());
assert(lock->can_wrlock());
assert(!mut->wrlocks.count(lock));
lock->get_wrlock(mut->get_client());
@@ -3911,6 +3915,7 @@ bool Locker::local_wrlock_start(LocalLock *lock, MDRequest *mut)
dout(7) << "local_wrlock_start on " << *lock
<< " on " << *lock->get_parent() << dendl;
+ assert(lock->get_parent()->is_auth());
if (lock->can_wrlock()) {
assert(!mut->wrlocks.count(lock));
lock->get_wrlock(mut->get_client());
@@ -3942,6 +3947,7 @@ bool Locker::local_xlock_start(LocalLock *lock, MDRequest *mut)
dout(7) << "local_xlock_start on " << *lock
<< " on " << *lock->get_parent() << dendl;
+ assert(lock->get_parent()->is_auth());
if (!lock->can_xlock_local()) {
lock->add_waiter(SimpleLock::WAIT_WR|SimpleLock::WAIT_STABLE, new C_MDS_RetryRequest(mdcache, mut));
return false;
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index 4642a13..45c890a 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -5204,25 +5204,20 @@ void Server::handle_client_rename(MDRequest *mdr)
wrlocks.insert(&straydn->get_dir()->inode->nestlock);
}
- // xlock versionlock on srci if remote?
- // this ensures it gets safely remotely auth_pinned, avoiding deadlock;
- // strictly speaking, having the slave node freeze the inode is
- // otherwise sufficient for avoiding conflicts with inode locks, etc.
- if (!srcdn->is_auth() && srcdnl->is_primary()) // xlock versionlock on srci if there are any witnesses
- xlocks.insert(&srci->versionlock);
-
// xlock versionlock on dentries if there are witnesses.
// replicas can't see projected dentry linkages, and will get
// confused if we try to pipeline things.
if (!witnesses.empty()) {
- if (srcdn->is_projected())
- xlocks.insert(&srcdn->versionlock);
- if (destdn->is_projected())
- xlocks.insert(&destdn->versionlock);
- // also take rdlock on all ancestor dentries for destdn. this ensures that the
- // destdn can be traversed to by the witnesses.
- for (int i=0; i<(int)desttrace.size(); i++)
- xlocks.insert(&desttrace[i]->versionlock);
+ // take xlock on all projected dentries for srcdn and destdn. this ensures
+ // that the srcdn and destdn can be traversed to by the witnesses.
+ for (int i= 0; i<(int)srctrace.size(); i++) {
+ if (srctrace[i]->is_auth() && srctrace[i]->is_projected())
+ xlocks.insert(&srctrace[i]->versionlock);
+ }
+ for (int i=0; i<(int)desttrace.size(); i++) {
+ if (desttrace[i]->is_auth() && desttrace[i]->is_projected())
+ xlocks.insert(&desttrace[i]->versionlock);
+ }
}
// we need to update srci's ctime. xlock its least contended lock to do that...
--
1.7.11.7
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] ceph: Fix i_size update race
2012-11-01 9:03 [PATCH 1/2] ceph: Don't update i_max_size when handling non-auth cap Yan, Zheng
2012-11-01 9:03 ` [PATCH 1/2] mds: Don't acquire replica object's versionlock Yan, Zheng
@ 2012-11-01 9:03 ` Yan, Zheng
2012-11-01 9:03 ` [PATCH 2/2] mds: Allow try_eval to eval unstable locks in freezing object Yan, Zheng
2 siblings, 0 replies; 6+ messages in thread
From: Yan, Zheng @ 2012-11-01 9:03 UTC (permalink / raw)
To: sage, ceph-devel; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
ceph_aio_write() has an optimization that marks cap EPH_CAP_FILE_WR
dirty before data is copied to page cache and inode size is updated.
If sceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. The fix is move
ceph_{get,put}_cap_refs() into ceph_write_{begin,end}() and call
__ceph_mark_dirty_caps() after inode size is updated.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
fs/ceph/addr.c | 43 ++++++++++++++++++++++++++++++++++--
fs/ceph/file.c | 69 +++++++++++++++++++++++-----------------------------------
2 files changed, 68 insertions(+), 44 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 22b6e45..8dd8d05 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1078,23 +1078,47 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping,
struct page **pagep, void **fsdata)
{
struct inode *inode = file->f_dentry->d_inode;
+ struct ceph_inode_info *ci = ceph_inode(inode);
+ struct ceph_file_info *fi = file->private_data;
struct page *page;
pgoff_t index = pos >> PAGE_CACHE_SHIFT;
- int r;
+ int r, want, got = 0;
+
+ if (fi->fmode & CEPH_FILE_MODE_LAZY)
+ want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
+ else
+ want = CEPH_CAP_FILE_BUFFER;
+ r = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, pos+len);
+ if (r < 0)
+ return r;
+
+ dout("write_begin %p %llx.%llx %llu~%u got cap refs on %s\n",
+ inode, ceph_vinop(inode), pos, len, ceph_cap_string(got));
+ if (!(got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO))) {
+ ceph_put_cap_refs(ci, got);
+ return -EAGAIN;
+ }
do {
/* get a page */
page = grab_cache_page_write_begin(mapping, index, 0);
if (!page)
return -ENOMEM;
- *pagep = page;
dout("write_begin file %p inode %p page %p %d~%d\n", file,
inode, page, (int)pos, (int)len);
r = ceph_update_writeable_page(file, pos, len, page);
+ if (r)
+ page_cache_release(page);
} while (r == -EAGAIN);
+ if (r) {
+ ceph_put_cap_refs(ci, got);
+ } else {
+ *pagep = page;
+ *(int *)fsdata = got;
+ }
return r;
}
@@ -1108,10 +1132,12 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
struct page *page, void *fsdata)
{
struct inode *inode = file->f_dentry->d_inode;
+ struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
struct ceph_mds_client *mdsc = fsc->mdsc;
unsigned from = pos & (PAGE_CACHE_SIZE - 1);
int check_cap = 0;
+ int got = (unsigned long)fsdata;
dout("write_end file %p inode %p page %p %d~%d (%d)\n", file,
inode, page, (int)pos, (int)copied, (int)len);
@@ -1134,6 +1160,19 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
up_read(&mdsc->snap_rwsem);
page_cache_release(page);
+ if (copied > 0) {
+ int dirty;
+ spin_lock(&ci->i_ceph_lock);
+ dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
+ spin_unlock(&ci->i_ceph_lock);
+ if (dirty)
+ __mark_inode_dirty(inode, dirty);
+ }
+
+ dout("write_end %p %llx.%llx %llu~%u dropping cap refs on %s\n",
+ inode, ceph_vinop(inode), pos, len, ceph_cap_string(got));
+ ceph_put_cap_refs(ci, got);
+
if (check_cap)
ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY, NULL);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5840d2a..266f6e0 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -712,63 +712,49 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
struct ceph_osd_client *osdc =
&ceph_sb_to_client(inode->i_sb)->client->osdc;
loff_t endoff = pos + iov->iov_len;
- int want, got = 0;
- int ret, err;
+ int got = 0;
+ int ret, err, written;
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
retry_snap:
+ written = 0;
if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
return -ENOSPC;
__ceph_do_pending_vmtruncate(inode);
- dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
- inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
- inode->i_size);
- if (fi->fmode & CEPH_FILE_MODE_LAZY)
- want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
- else
- want = CEPH_CAP_FILE_BUFFER;
- ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, endoff);
- if (ret < 0)
- goto out_put;
-
- dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
- inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
- ceph_cap_string(got));
-
- if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
- (iocb->ki_filp->f_flags & O_DIRECT) ||
- (inode->i_sb->s_flags & MS_SYNCHRONOUS) ||
- (fi->flags & CEPH_F_SYNC)) {
- ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
- &iocb->ki_pos);
- } else {
- /*
- * buffered write; drop Fw early to avoid slow
- * revocation if we get stuck on balance_dirty_pages
- */
- int dirty;
-
- spin_lock(&ci->i_ceph_lock);
- dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
- spin_unlock(&ci->i_ceph_lock);
- ceph_put_cap_refs(ci, got);
+ if (!(iocb->ki_filp->f_flags & O_DIRECT) &&
+ !(inode->i_sb->s_flags & MS_SYNCHRONOUS) &&
+ !(fi->flags & CEPH_F_SYNC)) {
ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
+ if (ret >= 0)
+ written = ret;
+
if ((ret >= 0 || ret == -EIOCBQUEUED) &&
((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
|| ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
- err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
+ err = vfs_fsync_range(file, pos, pos + written - 1, 1);
if (err < 0)
ret = err;
}
-
- if (dirty)
- __mark_inode_dirty(inode, dirty);
- goto out;
+ if ((ret < 0 && ret != -EAGAIN) || pos + written >= endoff)
+ goto out;
}
+ dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
+ inode, ceph_vinop(inode), pos + written,
+ (unsigned)iov->iov_len - written, inode->i_size);
+ ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, 0, &got, endoff);
+ if (ret < 0)
+ goto out_put;
+
+ dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
+ inode, ceph_vinop(inode), pos + written,
+ (unsigned)iov->iov_len - written, ceph_cap_string(got));
+
+ ret = ceph_sync_write(file, iov->iov_base + written,
+ iov->iov_len - written, &iocb->ki_pos);
if (ret >= 0) {
int dirty;
spin_lock(&ci->i_ceph_lock);
@@ -780,10 +766,9 @@ retry_snap:
out_put:
dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
- inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
- ceph_cap_string(got));
+ inode, ceph_vinop(inode), pos + written,
+ (unsigned)iov->iov_len - written, ceph_cap_string(got));
ceph_put_cap_refs(ci, got);
-
out:
if (ret == -EOLDSNAPC) {
dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
--
1.7.11.7
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] mds: Allow try_eval to eval unstable locks in freezing object
2012-11-01 9:03 [PATCH 1/2] ceph: Don't update i_max_size when handling non-auth cap Yan, Zheng
2012-11-01 9:03 ` [PATCH 1/2] mds: Don't acquire replica object's versionlock Yan, Zheng
2012-11-01 9:03 ` [PATCH 2/2] ceph: Fix i_size update race Yan, Zheng
@ 2012-11-01 9:03 ` Yan, Zheng
2 siblings, 0 replies; 6+ messages in thread
From: Yan, Zheng @ 2012-11-01 9:03 UTC (permalink / raw)
To: sage, ceph-devel; +Cc: Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
Unstable locks hold auth_pins on the object, it prevents the freezing
object become frozen and then unfreeze. So try_eval() should not wait
for freezing object
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
src/mds/Locker.cc | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index 6474743..68ecc16 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -837,8 +837,8 @@ void Locker::try_eval(MDSCacheObject *p, int mask)
return;
}
- if (p->is_auth() && !p->can_auth_pin()) {
- dout(7) << "try_eval can't auth_pin, waiting on " << *p << dendl;
+ if (p->is_auth() && p->is_frozen()) {
+ dout(7) << "try_eval frozen, waiting on " << *p << dendl;
p->add_waiter(MDSCacheObject::WAIT_UNFREEZE, new C_Locker_Eval(this, p, mask));
return;
}
--
1.7.11.7
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] mds: Don't acquire replica object's versionlock
2012-11-01 9:03 ` [PATCH 1/2] mds: Don't acquire replica object's versionlock Yan, Zheng
@ 2012-11-05 18:52 ` Sage Weil
2012-11-06 8:22 ` Yan, Zheng
0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2012-11-05 18:52 UTC (permalink / raw)
To: Yan, Zheng; +Cc: ceph-devel
On Thu, 1 Nov 2012, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> Both CInode and CDentry's versionlocks are of type LocalLock.
> Acquiring LocalLock in replica object is useless and problematic.
> For example, if two requests try acquiring a replica object's
> versionlock, the first request succeeds, the second request
> is added to wait queue. Later when the first request finishes,
> MDCache::request_drop_foreign_locks() finds the lock's parent is
> non-auth, it skips waking requests in the wait queue. So the
> second request hangs.
I don't remmeber the details, but the iversion locking on replicas came up
while testing renaming and export thrashing. i.e., running with
mds thrash exports = 1
and doing some rename workload (fsstress maybe?).
Maybe the fix is just to wake the requests in the queue?
s
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
> src/mds/Locker.cc | 6 ++++++
> src/mds/Server.cc | 25 ++++++++++---------------
> 3 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
> index e033bbe..6474743 100644
> --- a/src/mds/Locker.cc
> +++ b/src/mds/Locker.cc
> @@ -196,6 +196,7 @@ bool Locker::acquire_locks(MDRequest *mdr,
> // augment xlock with a versionlock?
> if ((*p)->get_type() == CEPH_LOCK_DN) {
> CDentry *dn = (CDentry*)(*p)->get_parent();
> + assert(dn->is_auth());
>
> if (xlocks.count(&dn->versionlock))
> continue; // we're xlocking the versionlock too; don't wrlock it!
> @@ -213,6 +214,8 @@ bool Locker::acquire_locks(MDRequest *mdr,
> if ((*p)->get_type() > CEPH_LOCK_IVERSION) {
> // inode version lock?
> CInode *in = (CInode*)(*p)->get_parent();
> + if (!in->is_auth())
> + continue;
> if (mdr->is_master()) {
> // master. wrlock versionlock so we can pipeline inode updates to journal.
> wrlocks.insert(&in->versionlock);
> @@ -3899,6 +3902,7 @@ void Locker::local_wrlock_grab(LocalLock *lock, Mutation *mut)
> dout(7) << "local_wrlock_grab on " << *lock
> << " on " << *lock->get_parent() << dendl;
>
> + assert(lock->get_parent()->is_auth());
> assert(lock->can_wrlock());
> assert(!mut->wrlocks.count(lock));
> lock->get_wrlock(mut->get_client());
> @@ -3911,6 +3915,7 @@ bool Locker::local_wrlock_start(LocalLock *lock, MDRequest *mut)
> dout(7) << "local_wrlock_start on " << *lock
> << " on " << *lock->get_parent() << dendl;
>
> + assert(lock->get_parent()->is_auth());
> if (lock->can_wrlock()) {
> assert(!mut->wrlocks.count(lock));
> lock->get_wrlock(mut->get_client());
> @@ -3942,6 +3947,7 @@ bool Locker::local_xlock_start(LocalLock *lock, MDRequest *mut)
> dout(7) << "local_xlock_start on " << *lock
> << " on " << *lock->get_parent() << dendl;
>
> + assert(lock->get_parent()->is_auth());
> if (!lock->can_xlock_local()) {
> lock->add_waiter(SimpleLock::WAIT_WR|SimpleLock::WAIT_STABLE, new C_MDS_RetryRequest(mdcache, mut));
> return false;
> diff --git a/src/mds/Server.cc b/src/mds/Server.cc
> index 4642a13..45c890a 100644
> --- a/src/mds/Server.cc
> +++ b/src/mds/Server.cc
> @@ -5204,25 +5204,20 @@ void Server::handle_client_rename(MDRequest *mdr)
> wrlocks.insert(&straydn->get_dir()->inode->nestlock);
> }
>
> - // xlock versionlock on srci if remote?
> - // this ensures it gets safely remotely auth_pinned, avoiding deadlock;
> - // strictly speaking, having the slave node freeze the inode is
> - // otherwise sufficient for avoiding conflicts with inode locks, etc.
> - if (!srcdn->is_auth() && srcdnl->is_primary()) // xlock versionlock on srci if there are any witnesses
> - xlocks.insert(&srci->versionlock);
> -
> // xlock versionlock on dentries if there are witnesses.
> // replicas can't see projected dentry linkages, and will get
> // confused if we try to pipeline things.
> if (!witnesses.empty()) {
> - if (srcdn->is_projected())
> - xlocks.insert(&srcdn->versionlock);
> - if (destdn->is_projected())
> - xlocks.insert(&destdn->versionlock);
> - // also take rdlock on all ancestor dentries for destdn. this ensures that the
> - // destdn can be traversed to by the witnesses.
> - for (int i=0; i<(int)desttrace.size(); i++)
> - xlocks.insert(&desttrace[i]->versionlock);
> + // take xlock on all projected dentries for srcdn and destdn. this ensures
> + // that the srcdn and destdn can be traversed to by the witnesses.
> + for (int i= 0; i<(int)srctrace.size(); i++) {
> + if (srctrace[i]->is_auth() && srctrace[i]->is_projected())
> + xlocks.insert(&srctrace[i]->versionlock);
> + }
> + for (int i=0; i<(int)desttrace.size(); i++) {
> + if (desttrace[i]->is_auth() && desttrace[i]->is_projected())
> + xlocks.insert(&desttrace[i]->versionlock);
> + }
> }
>
> // we need to update srci's ctime. xlock its least contended lock to do that...
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] mds: Don't acquire replica object's versionlock
2012-11-05 18:52 ` Sage Weil
@ 2012-11-06 8:22 ` Yan, Zheng
0 siblings, 0 replies; 6+ messages in thread
From: Yan, Zheng @ 2012-11-06 8:22 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 11/06/2012 02:52 AM, Sage Weil wrote:
> On Thu, 1 Nov 2012, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> Both CInode and CDentry's versionlocks are of type LocalLock.
>> Acquiring LocalLock in replica object is useless and problematic.
>> For example, if two requests try acquiring a replica object's
>> versionlock, the first request succeeds, the second request
>> is added to wait queue. Later when the first request finishes,
>> MDCache::request_drop_foreign_locks() finds the lock's parent is
>> non-auth, it skips waking requests in the wait queue. So the
>> second request hangs.
>
> I don't remmeber the details, but the iversion locking on replicas came up
> while testing renaming and export thrashing. i.e., running with
>
> mds thrash exports = 1
>
> and doing some rename workload (fsstress maybe?).
I saw assertion in Server::handle_slave_rename_prep() was triggered when I applied
a wrong fix. but never see it again with this patch.
>
> Maybe the fix is just to wake the requests in the queue?
>
If I'm not wrong the version locks are used to wait for all rename operation on
the paths to srcdn/destdn to complete. So witness see consistent paths for srcdn
and destdn no matter how many times the OP_RENAMEPREP slave request is dispatched.
I think the main reason we need version lock is that: For an auth dentry, we may
rdlock it even it is already xlocked. But for non-auth dentry, we only can rdlock
it when the lock is sync state, it guarantees the dentry is not xlocked.
I find a bug in previous patch. 'assert(dn->is_auth())' in Locker::acquire_locks
should be if '(!dn->is_auth()) continue'.
Regards
Yan, Zheng
---
From 6683482d9f9517b990d3e4bae18af275f32491e4 Mon Sep 17 00:00:00 2001
From: "Yan, Zheng" <zheng.z.yan@intel.com>
Date: Thu, 1 Nov 2012 13:21:21 +0800
Subject: [PATCH] mds: Don't acquire replica object's versionlock
Both CInode and CDentry's versionlocks are of type LocalLock.
Acquiring LocalLock in replica object is useless and problematic.
For example, if two requests try acquiring a replica object's
versionlock, the first request succeeds, the second request
is added to wait queue. Later when the first request finishes,
MDCache::request_drop_foreign_locks() finds the lock's parent is
non-auth, it skips waking requests in the wait queue. So the
second request hangs.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
src/mds/Locker.cc | 7 +++++++
src/mds/Server.cc | 25 ++++++++++---------------
2 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index 7b6d449..a1f957a 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -196,6 +196,8 @@ bool Locker::acquire_locks(MDRequest *mdr,
// augment xlock with a versionlock?
if ((*p)->get_type() == CEPH_LOCK_DN) {
CDentry *dn = (CDentry*)(*p)->get_parent();
+ if (!dn->is_auth())
+ continue;
if (xlocks.count(&dn->versionlock))
continue; // we're xlocking the versionlock too; don't wrlock it!
@@ -213,6 +215,8 @@ bool Locker::acquire_locks(MDRequest *mdr,
if ((*p)->get_type() > CEPH_LOCK_IVERSION) {
// inode version lock?
CInode *in = (CInode*)(*p)->get_parent();
+ if (!in->is_auth())
+ continue;
if (mdr->is_master()) {
// master. wrlock versionlock so we can pipeline inode updates to journal.
wrlocks.insert(&in->versionlock);
@@ -3899,6 +3903,7 @@ void Locker::local_wrlock_grab(LocalLock *lock, Mutation *mut)
dout(7) << "local_wrlock_grab on " << *lock
<< " on " << *lock->get_parent() << dendl;
+ assert(lock->get_parent()->is_auth());
assert(lock->can_wrlock());
assert(!mut->wrlocks.count(lock));
lock->get_wrlock(mut->get_client());
@@ -3911,6 +3916,7 @@ bool Locker::local_wrlock_start(LocalLock *lock, MDRequest *mut)
dout(7) << "local_wrlock_start on " << *lock
<< " on " << *lock->get_parent() << dendl;
+ assert(lock->get_parent()->is_auth());
if (lock->can_wrlock()) {
assert(!mut->wrlocks.count(lock));
lock->get_wrlock(mut->get_client());
@@ -3942,6 +3948,7 @@ bool Locker::local_xlock_start(LocalLock *lock, MDRequest *mut)
dout(7) << "local_xlock_start on " << *lock
<< " on " << *lock->get_parent() << dendl;
+ assert(lock->get_parent()->is_auth());
if (!lock->can_xlock_local()) {
lock->add_waiter(SimpleLock::WAIT_WR|SimpleLock::WAIT_STABLE, new C_MDS_RetryRequest(mdcache, mut));
return false;
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index 4642a13..45c890a 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -5204,25 +5204,20 @@ void Server::handle_client_rename(MDRequest *mdr)
wrlocks.insert(&straydn->get_dir()->inode->nestlock);
}
- // xlock versionlock on srci if remote?
- // this ensures it gets safely remotely auth_pinned, avoiding deadlock;
- // strictly speaking, having the slave node freeze the inode is
- // otherwise sufficient for avoiding conflicts with inode locks, etc.
- if (!srcdn->is_auth() && srcdnl->is_primary()) // xlock versionlock on srci if there are any witnesses
- xlocks.insert(&srci->versionlock);
-
// xlock versionlock on dentries if there are witnesses.
// replicas can't see projected dentry linkages, and will get
// confused if we try to pipeline things.
if (!witnesses.empty()) {
- if (srcdn->is_projected())
- xlocks.insert(&srcdn->versionlock);
- if (destdn->is_projected())
- xlocks.insert(&destdn->versionlock);
- // also take rdlock on all ancestor dentries for destdn. this ensures that the
- // destdn can be traversed to by the witnesses.
- for (int i=0; i<(int)desttrace.size(); i++)
- xlocks.insert(&desttrace[i]->versionlock);
+ // take xlock on all projected dentries for srcdn and destdn. this ensures
+ // that the srcdn and destdn can be traversed to by the witnesses.
+ for (int i= 0; i<(int)srctrace.size(); i++) {
+ if (srctrace[i]->is_auth() && srctrace[i]->is_projected())
+ xlocks.insert(&srctrace[i]->versionlock);
+ }
+ for (int i=0; i<(int)desttrace.size(); i++) {
+ if (desttrace[i]->is_auth() && desttrace[i]->is_projected())
+ xlocks.insert(&desttrace[i]->versionlock);
+ }
}
// we need to update srci's ctime. xlock its least contended lock to do that...
--
1.7.11.7
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-11-06 8:22 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-01 9:03 [PATCH 1/2] ceph: Don't update i_max_size when handling non-auth cap Yan, Zheng
2012-11-01 9:03 ` [PATCH 1/2] mds: Don't acquire replica object's versionlock Yan, Zheng
2012-11-05 18:52 ` Sage Weil
2012-11-06 8:22 ` Yan, Zheng
2012-11-01 9:03 ` [PATCH 2/2] ceph: Fix i_size update race Yan, Zheng
2012-11-01 9:03 ` [PATCH 2/2] mds: Allow try_eval to eval unstable locks in freezing object Yan, Zheng
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.