* [Cluster-devel] [GFS2 PATCH] GFS2: Defer the final inode glock put
2016-02-24 17:12 ` [Cluster-devel] [GFS2 PATCH] GFS2: Defer the final inode glock put Bob Peterson
@ 2016-02-24 18:47 ` Bob Peterson
2016-02-25 11:10 ` Steven Whitehouse
1 sibling, 0 replies; 3+ messages in thread
From: Bob Peterson @ 2016-02-24 18:47 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> Hi,
>
> On 11 February, Steve Whitehouse shot a hole through my latest attempt
> to solve the livelock related to GFS2 and the inode shrinker.
> The following is a new attempt, and it seems to be working well in
> testing.
>
> This patch approaches the problem as a purely glock-related problem.
> The problem is that the final glock put can't be done by inode evict
> (or it results in the livelock as described below). This solution
> simply defers the "eviction put" so that it is done by way of a new
> function in the quota daemon.
>
> One advantage of this method is that deferring the put shouldn't
> cause any problems if another process comes in and re-uses the glock
> for a new incarnation of the dinode. The existing code will take the
> glock off the lru list, and reuses it. In that case the "eviction put"
> won't be the final put, so it won't tell the DLM to unlock the
> lkb out of order (as was problematic with earlier patches); it will
> just adjust the reference count as needed. Note that this, in theory,
> could happen multiple times in a row, so we keep a counter rather
> than a flag. This is similar to how GFS1 handled this situation.
> ---
> Patch description:
>
> This patch tries to solve the livelock problem that goes like this:
> 1. An inode is being evicted from memory, which calls clear_glock
> 2. clear_glock does a final glock_put which calls into DLM to tell
> DLM to do the final unlock of the glock.
> 3. DLM calls into comm to do the unlock
> 4. The comm layer needs memory to satisfy the DLM communication
> 5. The memory allocate blocks due to lack of memory, and so the VFS
> inode shrinker is called to free memory.
> 6. The inode shrinker calls into GFS2 to free one of its glocks
> 7. Goto 1.
> The patch avoids this circular dependency by making the final glock
> put done for inodes happen later. After the glock is queued to the
> glock LRU, it wakes up the quota daemon. A new function of the
> quota daemon checks if there are any inode glocks on the LRU
> list that have a deferred glock_put, and if so, does the put.
> This allows the inode shrinker to proceed, since the final put
> is done later. The final put, called from the quota daemon, has no
> dependencies, and makes the call into DLM to do the final unlock.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
Hi,
Self-NACK for now.
In testing I discovered a rare race condition that seems to have
caused lru list corruption. I'm investigating now.
Regards,
Bob Peterson
Red Hat File Systems
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Defer the final inode glock put
2016-02-24 17:12 ` [Cluster-devel] [GFS2 PATCH] GFS2: Defer the final inode glock put Bob Peterson
2016-02-24 18:47 ` Bob Peterson
@ 2016-02-25 11:10 ` Steven Whitehouse
1 sibling, 0 replies; 3+ messages in thread
From: Steven Whitehouse @ 2016-02-25 11:10 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On 24/02/16 17:12, Bob Peterson wrote:
> Hi,
>
> On 11 February, Steve Whitehouse shot a hole through my latest attempt
> to solve the livelock related to GFS2 and the inode shrinker.
> The following is a new attempt, and it seems to be working well in
> testing.
>
> This patch approaches the problem as a purely glock-related problem.
> The problem is that the final glock put can't be done by inode evict
> (or it results in the livelock as described below). This solution
> simply defers the "eviction put" so that it is done by way of a new
> function in the quota daemon.
>
> One advantage of this method is that deferring the put shouldn't
> cause any problems if another process comes in and re-uses the glock
> for a new incarnation of the dinode. The existing code will take the
> glock off the lru list, and reuses it. In that case the "eviction put"
> won't be the final put, so it won't tell the DLM to unlock the
> lkb out of order (as was problematic with earlier patches); it will
> just adjust the reference count as needed. Note that this, in theory,
> could happen multiple times in a row, so we keep a counter rather
> than a flag. This is similar to how GFS1 handled this situation.
> ---
> Patch description:
>
> This patch tries to solve the livelock problem that goes like this:
> 1. An inode is being evicted from memory, which calls clear_glock
> 2. clear_glock does a final glock_put which calls into DLM to tell
> DLM to do the final unlock of the glock.
> 3. DLM calls into comm to do the unlock
> 4. The comm layer needs memory to satisfy the DLM communication
> 5. The memory allocate blocks due to lack of memory, and so the VFS
> inode shrinker is called to free memory.
> 6. The inode shrinker calls into GFS2 to free one of its glocks
> 7. Goto 1.
> The patch avoids this circular dependency by making the final glock
> put done for inodes happen later. After the glock is queued to the
> glock LRU, it wakes up the quota daemon. A new function of the
> quota daemon checks if there are any inode glocks on the LRU
> list that have a deferred glock_put, and if so, does the put.
> This allows the inode shrinker to proceed, since the final put
> is done later. The final put, called from the quota daemon, has no
> dependencies, and makes the call into DLM to do the final unlock.
I think that should do the trick. There may possibly be a performance
issue if the LRU list is long. It would be possible to avoid that by
using a different list head for delayed gfs2_glock_put()s perhaps, or
alternatively something closer to the delayed_fput() code that Al wrote
a little while ago that borrows the RCU head,
Steve.
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index 8dedece..952a981 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -127,10 +127,12 @@ static int demote_ok(const struct gfs2_glock *gl)
> }
>
>
> -void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
> +void gfs2_glock_add_to_lru(struct gfs2_glock *gl, bool put_deferred)
> {
> spin_lock(&lru_lock);
>
> + if (put_deferred)
> + atomic_inc(&gl->gl_puts_deferred);
> if (!list_empty(&gl->gl_lru))
> list_del_init(&gl->gl_lru);
> else
> @@ -178,6 +180,42 @@ void gfs2_glock_put(struct gfs2_glock *gl)
> }
>
> /**
> + * gfs2_do_deferred_puts - Do glock put for glocks that have been deferred
> + *
> + * Glocks may be put on the lru list, even for a final put, then rescued by
> + * another process that subsequently opens the file. In this case, it will
> + * be taken back off the lru list and reused. However, we need to keep the
> + * gl_ref count sane. That's what gl->gl_puts_deferred is for.
> + */
> +void gfs2_do_deferred_puts(void)
> +{
> + LIST_HEAD(gl_putlist);
> + struct gfs2_glock *gl, *tmp;
> +
> + spin_lock(&lru_lock);
> + list_for_each_entry_safe(gl, tmp, &lru_list, gl_lru) {
> + if (atomic_read(&gl->gl_puts_deferred) > 0) {
> + list_move_tail(&gl->gl_lru, &gl_putlist);
> + atomic_dec(&lru_count);
> + clear_bit(GLF_LRU, &gl->gl_flags);
> + }
> + }
> + spin_unlock(&lru_lock);
> +
> + while (!list_empty(&gl_putlist)) {
> + gl = list_entry(gl_putlist.next, struct gfs2_glock, gl_lru);
> + list_del_init(&gl->gl_lru);
> + flush_delayed_work(&gl->gl_work);
> + while (atomic_read(&gl->gl_puts_deferred) > 0) {
> + GLOCK_BUG_ON(gl, __lockref_is_dead(&gl->gl_lockref));
> + gfs2_glock_put(gl);
> + atomic_dec(&gl->gl_puts_deferred);
> + }
> + cond_resched();
> + }
> +}
> +
> +/**
> * may_grant - check if its ok to grant a new lock
> * @gl: The glock
> * @gh: The lock request which we wish to grant
> @@ -698,6 +736,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
> gl->gl_tchange = jiffies;
> gl->gl_object = NULL;
> gl->gl_hold_time = GL_GLOCK_DFT_HOLD;
> + atomic_set(&gl->gl_puts_deferred, 0);
> INIT_DELAYED_WORK(&gl->gl_work, glock_work_func);
> INIT_WORK(&gl->gl_delete, delete_work_func);
>
> @@ -1031,7 +1070,7 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
> }
> if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl) &&
> (glops->go_flags & GLOF_LRU))
> - gfs2_glock_add_to_lru(gl);
> + gfs2_glock_add_to_lru(gl, false);
>
> trace_gfs2_glock_queue(gh, 0);
> spin_unlock(&gl->gl_lockref.lock);
> diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
> index 46ab67f..420583f 100644
> --- a/fs/gfs2/glock.h
> +++ b/fs/gfs2/glock.h
> @@ -234,8 +234,9 @@ extern void gfs2_glock_complete(struct gfs2_glock *gl, int ret);
> extern void gfs2_gl_hash_clear(struct gfs2_sbd *sdp);
> extern void gfs2_glock_finish_truncate(struct gfs2_inode *ip);
> extern void gfs2_glock_thaw(struct gfs2_sbd *sdp);
> -extern void gfs2_glock_add_to_lru(struct gfs2_glock *gl);
> +extern void gfs2_glock_add_to_lru(struct gfs2_glock *gl, bool do_final_put);
> extern void gfs2_glock_free(struct gfs2_glock *gl);
> +extern void gfs2_do_deferred_puts(void);
>
> extern int __init gfs2_glock_init(void);
> extern void gfs2_glock_exit(void);
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index a6a3389..ee63406 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -371,6 +371,7 @@ struct gfs2_glock {
> } gl_vm;
> };
> struct rhash_head gl_node;
> + atomic_t gl_puts_deferred;
> };
>
> #define GFS2_MIN_LVB_SIZE 32 /* Min size of LVB that gfs2 supports */
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index be6d9c4..52b4b0e 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -1537,6 +1537,7 @@ int gfs2_quotad(void *data)
>
> while (!kthread_should_stop()) {
>
> + gfs2_do_deferred_puts();
> /* Update the master statfs file */
> if (sdp->sd_statfs_force_sync) {
> int error = gfs2_statfs_sync(sdp->sd_vfs, 0);
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 07c0265..df22a8c 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -717,7 +717,7 @@ void gfs2_clear_rgrpd(struct gfs2_sbd *sdp)
> spin_lock(&gl->gl_lockref.lock);
> gl->gl_object = NULL;
> spin_unlock(&gl->gl_lockref.lock);
> - gfs2_glock_add_to_lru(gl);
> + gfs2_glock_add_to_lru(gl, false);
> gfs2_glock_put(gl);
> }
>
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index f8a0cd8..77603e7 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -850,6 +850,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
> flush_workqueue(gfs2_delete_workqueue);
> gfs2_quota_sync(sdp->sd_vfs, 0);
> gfs2_statfs_sync(sdp->sd_vfs, 0);
> + gfs2_do_deferred_puts();
>
> gfs2_log_flush(sdp, NULL, SHUTDOWN_FLUSH);
> wait_event(sdp->sd_reserving_log_wait, atomic_read(&sdp->sd_reserving_log) == 0);
> @@ -922,6 +923,7 @@ restart:
> gfs2_clear_rgrpd(sdp);
> gfs2_jindex_free(sdp);
> /* Take apart glock structures and buffer lists */
> + gfs2_do_deferred_puts();
> gfs2_gl_hash_clear(sdp);
> /* Unmount the locking protocol */
> gfs2_lm_unmount(sdp);
> @@ -1628,10 +1630,9 @@ out:
> clear_inode(inode);
> gfs2_dir_hash_inval(ip);
> ip->i_gl->gl_object = NULL;
> - flush_delayed_work(&ip->i_gl->gl_work);
> - gfs2_glock_add_to_lru(ip->i_gl);
> - gfs2_glock_put(ip->i_gl);
> + gfs2_glock_add_to_lru(ip->i_gl, true);
> ip->i_gl = NULL;
> + wake_up(&sdp->sd_quota_wait);
> if (ip->i_iopen_gh.gh_gl) {
> ip->i_iopen_gh.gh_gl->gl_object = NULL;
> ip->i_iopen_gh.gh_flags |= GL_NOCACHE;
>
^ permalink raw reply [flat|nested] 3+ messages in thread