* [Ocfs2-devel] [RFC] Why unlink performance is low?
@ 2014-01-06 2:11 Goldwyn Rodrigues
0 siblings, 0 replies; only message in thread
From: Goldwyn Rodrigues @ 2014-01-06 2:11 UTC (permalink / raw)
To: ocfs2-devel
After a delete, the system thread calls evict_inode which calls the
following sequence:
ocfs2_evict_inode() -> ocfs2_delete_inode() ->
ocfs2_query_inode_wipe() -> ocfs2_try_open_lock() on d1, it fails
with -EAGAIN. The open lock fails because on the remote node
a PR->EX convert takes longer than a simple EX grant.
This starts a checkpoint because OCFS2_INODE_DELETED flag is not set.
Now, a checkpoint interferes with the journaling of the inodes deleted
in the following unlinks. I had earlier concluded that this happens
for directories only, however I was wrong. This happens for files as well.
The patch attached is *not* correct. I am sending this to show that where
the problem lies. I worked this on a "hypothetical" situation where the
files created by other nodes are not open on any other node during the
time of deletion. I agree that open lock should not block during
inode eviction.
The root problem is that open lock fails with -EAGAIN even if the file
is not open on any other node of the cluster. The reason we get -EAGAIN
is because the lock is on the remote end and the whole locking sequence
does not complete with LKF_NOQUEUE set. Here are some numbers:
Without patch (native) times:
-------------------------------------------------
| # files | create #s | copy #s | remove #s |
-------------------------------------------------
| 1 | 0:00.03 | 0:00.25 | 0:00.94 |
| 2 | 0:00.12 | 0:00.20 | 0:01.12 |
| 4 | 0:00.16 | 0:00.31 | 0:03.50 |
| 8 | 0:00.11 | 0:00.38 | 0:08.15 |
| 16 | 0:00.11 | 0:00.60 | 0:14.64 |
| 32 | 0:00.15 | 0:00.89 | 0:28.04 |
| 64 | 0:00.24 | 0:03.49 | 0:59.96 |
| 128 | 0:00.42 | 0:08.73 | 1:52.14 |
| 256 | 0:01.05 | 0:18.03 | 3:54.81 |
| 1024 | 0:02.74 | 0:44.13 | 14:46.36 |
With patch times:
-------------------------------------------------
| # files | create #s | copy #s | remove #s |
-------------------------------------------------
| 1 | 0:00.02 | 0:00.83 | 0:00.33 |
| 2 | 0:00.04 | 0:00.18 | 0:00.27 |
| 4 | 0:00.07 | 0:00.26 | 0:00.27 |
| 8 | 0:00.08 | 0:00.29 | 0:00.44 |
| 16 | 0:00.10 | 0:00.39 | 0:00.69 |
| 32 | 0:00.14 | 0:00.60 | 0:01.26 |
| 64 | 0:00.23 | 0:01.19 | 0:02.33 |
| 128 | 0:00.51 | 0:02.15 | 0:04.60 |
| 256 | 0:00.87 | 0:04.59 | 0:09.74 |
| 1024 | 0:02.78 | 0:17.64 | 0:37.93 |
The numbers show that the improvement is not just with unlinks
but with other operations as well because the journal is
no longer overworked.
I am looking for suggestions where we can overcome this design issue
to make sure that try open locks succeed if the file is not
opened on any node. I think the semantics of DLM_LKF_NOQUEUE
may be interpreted incorrectly, or we are probably not waiting for the
lksb status to be updated, but I am not sure and some insight into this
would be helpful.
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index f2d48c8..eb3baac 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -1681,9 +1681,9 @@ void ocfs2_rw_unlock(struct inode *inode, int write)
/*
* ocfs2_open_lock always get PR mode lock.
*/
-int ocfs2_open_lock(struct inode *inode)
+int ocfs2_open_lock(struct inode *inode, int ex)
{
- int status = 0;
+ int status = 0, level;
struct ocfs2_lock_res *lockres;
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
@@ -1696,9 +1696,10 @@ int ocfs2_open_lock(struct inode *inode)
goto out;
lockres = &OCFS2_I(inode)->ip_open_lockres;
+ level = ex ? DLM_LOCK_EX : DLM_LOCK_PR;
status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres,
- DLM_LOCK_PR, 0, 0);
+ level, 0, 0);
if (status < 0)
mlog_errno(status);
diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h
index 1d596d8..12766a1 100644
--- a/fs/ocfs2/dlmglue.h
+++ b/fs/ocfs2/dlmglue.h
@@ -110,7 +110,7 @@ int ocfs2_create_new_inode_locks(struct inode *inode);
int ocfs2_drop_inode_locks(struct inode *inode);
int ocfs2_rw_lock(struct inode *inode, int write);
void ocfs2_rw_unlock(struct inode *inode, int write);
-int ocfs2_open_lock(struct inode *inode);
+int ocfs2_open_lock(struct inode *inode, int ex);
int ocfs2_try_open_lock(struct inode *inode, int write);
void ocfs2_open_unlock(struct inode *inode);
int ocfs2_inode_lock_atime(struct inode *inode,
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index f87f9bd..792dba7 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -454,7 +454,7 @@ static int ocfs2_read_locked_inode(struct inode *inode,
0, inode);
if (can_lock) {
- status = ocfs2_open_lock(inode);
+ status = ocfs2_open_lock(inode, 0);
if (status) {
make_bad_inode(inode);
mlog_errno(status);
@@ -922,7 +922,7 @@ static int ocfs2_query_inode_wipe(struct inode *inode,
* Though we call this with the meta data lock held, the
* trylock keeps us from ABBA deadlock.
*/
- status = ocfs2_try_open_lock(inode, 1);
+ status = ocfs2_open_lock(inode, 1);
if (status == -EAGAIN) {
status = 0;
reason = 3;
@@ -997,6 +997,7 @@ static void ocfs2_delete_inode(struct inode *inode)
ocfs2_cleanup_delete_inode(inode, 0);
goto bail_unblock;
}
+
/* Lock down the inode. This gives us an up to date view of
* it's metadata (for verification), and allows us to
* serialize delete_inode on multiple nodes.
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index be3f867..ac67f2d 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -2307,7 +2307,7 @@ int ocfs2_create_inode_in_orphan(struct inode *dir,
}
/* get open lock so that only nodes can't remove it from orphan dir. */
- status = ocfs2_open_lock(inode);
+ status = ocfs2_open_lock(inode, 0);
if (status < 0)
mlog_errno(status);
--
Goldwyn
^ permalink raw reply related [flat|nested] only message in thread
only message in thread, other threads:[~2014-01-06 2:11 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-06 2:11 [Ocfs2-devel] [RFC] Why unlink performance is low? Goldwyn Rodrigues
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.