From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Thu, 25 Feb 2016 11:10:37 +0000 Subject: [Cluster-devel] [GFS2 PATCH] GFS2: Defer the final inode glock put In-Reply-To: <477791445.29170660.1456333948888.JavaMail.zimbra@redhat.com> References: <477791445.29170660.1456333948888.JavaMail.zimbra@redhat.com> Message-ID: <56CEE12D.3060800@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On 24/02/16 17:12, Bob Peterson wrote: > Hi, > > On 11 February, Steve Whitehouse shot a hole through my latest attempt > to solve the livelock related to GFS2 and the inode shrinker. > The following is a new attempt, and it seems to be working well in > testing. > > This patch approaches the problem as a purely glock-related problem. > The problem is that the final glock put can't be done by inode evict > (or it results in the livelock as described below). This solution > simply defers the "eviction put" so that it is done by way of a new > function in the quota daemon. > > One advantage of this method is that deferring the put shouldn't > cause any problems if another process comes in and re-uses the glock > for a new incarnation of the dinode. The existing code will take the > glock off the lru list, and reuses it. In that case the "eviction put" > won't be the final put, so it won't tell the DLM to unlock the > lkb out of order (as was problematic with earlier patches); it will > just adjust the reference count as needed. Note that this, in theory, > could happen multiple times in a row, so we keep a counter rather > than a flag. This is similar to how GFS1 handled this situation. > --- > Patch description: > > This patch tries to solve the livelock problem that goes like this: > 1. An inode is being evicted from memory, which calls clear_glock > 2. clear_glock does a final glock_put which calls into DLM to tell > DLM to do the final unlock of the glock. > 3. DLM calls into comm to do the unlock > 4. The comm layer needs memory to satisfy the DLM communication > 5. The memory allocate blocks due to lack of memory, and so the VFS > inode shrinker is called to free memory. > 6. The inode shrinker calls into GFS2 to free one of its glocks > 7. Goto 1. > The patch avoids this circular dependency by making the final glock > put done for inodes happen later. After the glock is queued to the > glock LRU, it wakes up the quota daemon. A new function of the > quota daemon checks if there are any inode glocks on the LRU > list that have a deferred glock_put, and if so, does the put. > This allows the inode shrinker to proceed, since the final put > is done later. The final put, called from the quota daemon, has no > dependencies, and makes the call into DLM to do the final unlock. I think that should do the trick. There may possibly be a performance issue if the LRU list is long. It would be possible to avoid that by using a different list head for delayed gfs2_glock_put()s perhaps, or alternatively something closer to the delayed_fput() code that Al wrote a little while ago that borrows the RCU head, Steve. > Signed-off-by: Bob Peterson > --- > diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c > index 8dedece..952a981 100644 > --- a/fs/gfs2/glock.c > +++ b/fs/gfs2/glock.c > @@ -127,10 +127,12 @@ static int demote_ok(const struct gfs2_glock *gl) > } > > > -void gfs2_glock_add_to_lru(struct gfs2_glock *gl) > +void gfs2_glock_add_to_lru(struct gfs2_glock *gl, bool put_deferred) > { > spin_lock(&lru_lock); > > + if (put_deferred) > + atomic_inc(&gl->gl_puts_deferred); > if (!list_empty(&gl->gl_lru)) > list_del_init(&gl->gl_lru); > else > @@ -178,6 +180,42 @@ void gfs2_glock_put(struct gfs2_glock *gl) > } > > /** > + * gfs2_do_deferred_puts - Do glock put for glocks that have been deferred > + * > + * Glocks may be put on the lru list, even for a final put, then rescued by > + * another process that subsequently opens the file. In this case, it will > + * be taken back off the lru list and reused. However, we need to keep the > + * gl_ref count sane. That's what gl->gl_puts_deferred is for. > + */ > +void gfs2_do_deferred_puts(void) > +{ > + LIST_HEAD(gl_putlist); > + struct gfs2_glock *gl, *tmp; > + > + spin_lock(&lru_lock); > + list_for_each_entry_safe(gl, tmp, &lru_list, gl_lru) { > + if (atomic_read(&gl->gl_puts_deferred) > 0) { > + list_move_tail(&gl->gl_lru, &gl_putlist); > + atomic_dec(&lru_count); > + clear_bit(GLF_LRU, &gl->gl_flags); > + } > + } > + spin_unlock(&lru_lock); > + > + while (!list_empty(&gl_putlist)) { > + gl = list_entry(gl_putlist.next, struct gfs2_glock, gl_lru); > + list_del_init(&gl->gl_lru); > + flush_delayed_work(&gl->gl_work); > + while (atomic_read(&gl->gl_puts_deferred) > 0) { > + GLOCK_BUG_ON(gl, __lockref_is_dead(&gl->gl_lockref)); > + gfs2_glock_put(gl); > + atomic_dec(&gl->gl_puts_deferred); > + } > + cond_resched(); > + } > +} > + > +/** > * may_grant - check if its ok to grant a new lock > * @gl: The glock > * @gh: The lock request which we wish to grant > @@ -698,6 +736,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number, > gl->gl_tchange = jiffies; > gl->gl_object = NULL; > gl->gl_hold_time = GL_GLOCK_DFT_HOLD; > + atomic_set(&gl->gl_puts_deferred, 0); > INIT_DELAYED_WORK(&gl->gl_work, glock_work_func); > INIT_WORK(&gl->gl_delete, delete_work_func); > > @@ -1031,7 +1070,7 @@ void gfs2_glock_dq(struct gfs2_holder *gh) > } > if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl) && > (glops->go_flags & GLOF_LRU)) > - gfs2_glock_add_to_lru(gl); > + gfs2_glock_add_to_lru(gl, false); > > trace_gfs2_glock_queue(gh, 0); > spin_unlock(&gl->gl_lockref.lock); > diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h > index 46ab67f..420583f 100644 > --- a/fs/gfs2/glock.h > +++ b/fs/gfs2/glock.h > @@ -234,8 +234,9 @@ extern void gfs2_glock_complete(struct gfs2_glock *gl, int ret); > extern void gfs2_gl_hash_clear(struct gfs2_sbd *sdp); > extern void gfs2_glock_finish_truncate(struct gfs2_inode *ip); > extern void gfs2_glock_thaw(struct gfs2_sbd *sdp); > -extern void gfs2_glock_add_to_lru(struct gfs2_glock *gl); > +extern void gfs2_glock_add_to_lru(struct gfs2_glock *gl, bool do_final_put); > extern void gfs2_glock_free(struct gfs2_glock *gl); > +extern void gfs2_do_deferred_puts(void); > > extern int __init gfs2_glock_init(void); > extern void gfs2_glock_exit(void); > diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h > index a6a3389..ee63406 100644 > --- a/fs/gfs2/incore.h > +++ b/fs/gfs2/incore.h > @@ -371,6 +371,7 @@ struct gfs2_glock { > } gl_vm; > }; > struct rhash_head gl_node; > + atomic_t gl_puts_deferred; > }; > > #define GFS2_MIN_LVB_SIZE 32 /* Min size of LVB that gfs2 supports */ > diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c > index be6d9c4..52b4b0e 100644 > --- a/fs/gfs2/quota.c > +++ b/fs/gfs2/quota.c > @@ -1537,6 +1537,7 @@ int gfs2_quotad(void *data) > > while (!kthread_should_stop()) { > > + gfs2_do_deferred_puts(); > /* Update the master statfs file */ > if (sdp->sd_statfs_force_sync) { > int error = gfs2_statfs_sync(sdp->sd_vfs, 0); > diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c > index 07c0265..df22a8c 100644 > --- a/fs/gfs2/rgrp.c > +++ b/fs/gfs2/rgrp.c > @@ -717,7 +717,7 @@ void gfs2_clear_rgrpd(struct gfs2_sbd *sdp) > spin_lock(&gl->gl_lockref.lock); > gl->gl_object = NULL; > spin_unlock(&gl->gl_lockref.lock); > - gfs2_glock_add_to_lru(gl); > + gfs2_glock_add_to_lru(gl, false); > gfs2_glock_put(gl); > } > > diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c > index f8a0cd8..77603e7 100644 > --- a/fs/gfs2/super.c > +++ b/fs/gfs2/super.c > @@ -850,6 +850,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp) > flush_workqueue(gfs2_delete_workqueue); > gfs2_quota_sync(sdp->sd_vfs, 0); > gfs2_statfs_sync(sdp->sd_vfs, 0); > + gfs2_do_deferred_puts(); > > gfs2_log_flush(sdp, NULL, SHUTDOWN_FLUSH); > wait_event(sdp->sd_reserving_log_wait, atomic_read(&sdp->sd_reserving_log) == 0); > @@ -922,6 +923,7 @@ restart: > gfs2_clear_rgrpd(sdp); > gfs2_jindex_free(sdp); > /* Take apart glock structures and buffer lists */ > + gfs2_do_deferred_puts(); > gfs2_gl_hash_clear(sdp); > /* Unmount the locking protocol */ > gfs2_lm_unmount(sdp); > @@ -1628,10 +1630,9 @@ out: > clear_inode(inode); > gfs2_dir_hash_inval(ip); > ip->i_gl->gl_object = NULL; > - flush_delayed_work(&ip->i_gl->gl_work); > - gfs2_glock_add_to_lru(ip->i_gl); > - gfs2_glock_put(ip->i_gl); > + gfs2_glock_add_to_lru(ip->i_gl, true); > ip->i_gl = NULL; > + wake_up(&sdp->sd_quota_wait); > if (ip->i_iopen_gh.gh_gl) { > ip->i_iopen_gh.gh_gl->gl_object = NULL; > ip->i_iopen_gh.gh_flags |= GL_NOCACHE; >