From mboxrd@z Thu Jan  1 00:00:00 1970
From: Steven Whitehouse <swhiteho@redhat.com>
Date: Thu, 25 Feb 2016 11:10:37 +0000
Subject: [Cluster-devel] [GFS2 PATCH] GFS2: Defer the final inode glock
 put
In-Reply-To: <477791445.29170660.1456333948888.JavaMail.zimbra@redhat.com>
References: <477791445.29170660.1456333948888.JavaMail.zimbra@redhat.com>
Message-ID: <56CEE12D.3060800@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hi,

On 24/02/16 17:12, Bob Peterson wrote:
> Hi,
>
> On 11 February, Steve Whitehouse shot a hole through my latest attempt
> to solve the livelock related to GFS2 and the inode shrinker.
> The following is a new attempt, and it seems to be working well in
> testing.
>
> This patch approaches the problem as a purely glock-related problem.
> The problem is that the final glock put can't be done by inode evict
> (or it results in the livelock as described below). This solution
> simply defers the "eviction put" so that it is done by way of a new
> function in the quota daemon.
>
> One advantage of this method is that deferring the put shouldn't
> cause any problems if another process comes in and re-uses the glock
> for a new incarnation of the dinode. The existing code will take the
> glock off the lru list, and reuses it. In that case the "eviction put"
> won't be the final put, so it won't tell the DLM to unlock the
> lkb out of order (as was problematic with earlier patches); it will
> just adjust the reference count as needed. Note that this, in theory,
> could happen multiple times in a row, so we keep a counter rather
> than a flag. This is similar to how GFS1 handled this situation.
> ---
> Patch description:
>
> This patch tries to solve the livelock problem that goes like this:
> 1. An inode is being evicted from memory, which calls clear_glock
> 2. clear_glock does a final glock_put which calls into DLM to tell
>     DLM to do the final unlock of the glock.
> 3. DLM calls into comm to do the unlock
> 4. The comm layer needs memory to satisfy the DLM communication
> 5. The memory allocate blocks due to lack of memory, and so the VFS
>     inode shrinker is called to free memory.
> 6. The inode shrinker calls into GFS2 to free one of its glocks
> 7. Goto 1.
> The patch avoids this circular dependency by making the final glock
> put done for inodes happen later. After the glock is queued to the
> glock LRU, it wakes up the quota daemon. A new function of the
> quota daemon checks if there are any inode glocks on the LRU
> list that have a deferred glock_put, and if so, does the put.
> This allows the inode shrinker to proceed, since the final put
> is done later. The final put, called from the quota daemon, has no
> dependencies, and makes the call into DLM to do the final unlock.
I think that should do the trick. There may possibly be a performance 
issue if the LRU list is long. It would be possible to avoid that by 
using a different list head for delayed gfs2_glock_put()s perhaps, or 
alternatively something closer to the delayed_fput() code that Al wrote 
a little while ago that borrows the RCU head,

Steve.

> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index 8dedece..952a981 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -127,10 +127,12 @@ static int demote_ok(const struct gfs2_glock *gl)
>   }
>   
>   
> -void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
> +void gfs2_glock_add_to_lru(struct gfs2_glock *gl, bool put_deferred)
>   {
>   	spin_lock(&lru_lock);
>   
> +	if (put_deferred)
> +		atomic_inc(&gl->gl_puts_deferred);
>   	if (!list_empty(&gl->gl_lru))
>   		list_del_init(&gl->gl_lru);
>   	else
> @@ -178,6 +180,42 @@ void gfs2_glock_put(struct gfs2_glock *gl)
>   }
>   
>   /**
> + * gfs2_do_deferred_puts - Do glock put for glocks that have been deferred
> + *
> + * Glocks may be put on the lru list, even for a final put, then rescued by
> + * another process that subsequently opens the file. In this case, it will
> + * be taken back off the lru list and reused. However, we need to keep the
> + * gl_ref count sane. That's what gl->gl_puts_deferred is for.
> + */
> +void gfs2_do_deferred_puts(void)
> +{
> +	LIST_HEAD(gl_putlist);
> +	struct gfs2_glock *gl, *tmp;
> +
> +	spin_lock(&lru_lock);
> +	list_for_each_entry_safe(gl, tmp, &lru_list, gl_lru) {
> +		if (atomic_read(&gl->gl_puts_deferred) > 0) {
> +			list_move_tail(&gl->gl_lru, &gl_putlist);
> +			atomic_dec(&lru_count);
> +			clear_bit(GLF_LRU, &gl->gl_flags);
> +		}
> +	}
> +	spin_unlock(&lru_lock);
> +
> +	while (!list_empty(&gl_putlist)) {
> +		gl = list_entry(gl_putlist.next, struct gfs2_glock, gl_lru);
> +		list_del_init(&gl->gl_lru);
> +		flush_delayed_work(&gl->gl_work);
> +		while (atomic_read(&gl->gl_puts_deferred) > 0) {
> +			GLOCK_BUG_ON(gl, __lockref_is_dead(&gl->gl_lockref));
> +			gfs2_glock_put(gl);
> +			atomic_dec(&gl->gl_puts_deferred);
> +		}
> +		cond_resched();
> +	}
> +}
> +
> +/**
>    * may_grant - check if its ok to grant a new lock
>    * @gl: The glock
>    * @gh: The lock request which we wish to grant
> @@ -698,6 +736,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
>   	gl->gl_tchange = jiffies;
>   	gl->gl_object = NULL;
>   	gl->gl_hold_time = GL_GLOCK_DFT_HOLD;
> +	atomic_set(&gl->gl_puts_deferred, 0);
>   	INIT_DELAYED_WORK(&gl->gl_work, glock_work_func);
>   	INIT_WORK(&gl->gl_delete, delete_work_func);
>   
> @@ -1031,7 +1070,7 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
>   	}
>   	if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl) &&
>   	    (glops->go_flags & GLOF_LRU))
> -		gfs2_glock_add_to_lru(gl);
> +		gfs2_glock_add_to_lru(gl, false);
>   
>   	trace_gfs2_glock_queue(gh, 0);
>   	spin_unlock(&gl->gl_lockref.lock);
> diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
> index 46ab67f..420583f 100644
> --- a/fs/gfs2/glock.h
> +++ b/fs/gfs2/glock.h
> @@ -234,8 +234,9 @@ extern void gfs2_glock_complete(struct gfs2_glock *gl, int ret);
>   extern void gfs2_gl_hash_clear(struct gfs2_sbd *sdp);
>   extern void gfs2_glock_finish_truncate(struct gfs2_inode *ip);
>   extern void gfs2_glock_thaw(struct gfs2_sbd *sdp);
> -extern void gfs2_glock_add_to_lru(struct gfs2_glock *gl);
> +extern void gfs2_glock_add_to_lru(struct gfs2_glock *gl, bool do_final_put);
>   extern void gfs2_glock_free(struct gfs2_glock *gl);
> +extern void gfs2_do_deferred_puts(void);
>   
>   extern int __init gfs2_glock_init(void);
>   extern void gfs2_glock_exit(void);
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index a6a3389..ee63406 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -371,6 +371,7 @@ struct gfs2_glock {
>   		} gl_vm;
>   	};
>   	struct rhash_head gl_node;
> +	atomic_t gl_puts_deferred;
>   };
>   
>   #define GFS2_MIN_LVB_SIZE 32	/* Min size of LVB that gfs2 supports */
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index be6d9c4..52b4b0e 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -1537,6 +1537,7 @@ int gfs2_quotad(void *data)
>   
>   	while (!kthread_should_stop()) {
>   
> +		gfs2_do_deferred_puts();
>   		/* Update the master statfs file */
>   		if (sdp->sd_statfs_force_sync) {
>   			int error = gfs2_statfs_sync(sdp->sd_vfs, 0);
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 07c0265..df22a8c 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -717,7 +717,7 @@ void gfs2_clear_rgrpd(struct gfs2_sbd *sdp)
>   			spin_lock(&gl->gl_lockref.lock);
>   			gl->gl_object = NULL;
>   			spin_unlock(&gl->gl_lockref.lock);
> -			gfs2_glock_add_to_lru(gl);
> +			gfs2_glock_add_to_lru(gl, false);
>   			gfs2_glock_put(gl);
>   		}
>   
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index f8a0cd8..77603e7 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -850,6 +850,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
>   	flush_workqueue(gfs2_delete_workqueue);
>   	gfs2_quota_sync(sdp->sd_vfs, 0);
>   	gfs2_statfs_sync(sdp->sd_vfs, 0);
> +	gfs2_do_deferred_puts();
>   
>   	gfs2_log_flush(sdp, NULL, SHUTDOWN_FLUSH);
>   	wait_event(sdp->sd_reserving_log_wait, atomic_read(&sdp->sd_reserving_log) == 0);
> @@ -922,6 +923,7 @@ restart:
>   	gfs2_clear_rgrpd(sdp);
>   	gfs2_jindex_free(sdp);
>   	/*  Take apart glock structures and buffer lists  */
> +	gfs2_do_deferred_puts();
>   	gfs2_gl_hash_clear(sdp);
>   	/*  Unmount the locking protocol  */
>   	gfs2_lm_unmount(sdp);
> @@ -1628,10 +1630,9 @@ out:
>   	clear_inode(inode);
>   	gfs2_dir_hash_inval(ip);
>   	ip->i_gl->gl_object = NULL;
> -	flush_delayed_work(&ip->i_gl->gl_work);
> -	gfs2_glock_add_to_lru(ip->i_gl);
> -	gfs2_glock_put(ip->i_gl);
> +	gfs2_glock_add_to_lru(ip->i_gl, true);
>   	ip->i_gl = NULL;
> +	wake_up(&sdp->sd_quota_wait);
>   	if (ip->i_iopen_gh.gh_gl) {
>   		ip->i_iopen_gh.gh_gl->gl_object = NULL;
>   		ip->i_iopen_gh.gh_flags |= GL_NOCACHE;
>