From mboxrd@z Thu Jan  1 00:00:00 1970
From: Benjamin Marzinski <bmarzins@redhat.com>
Date: Wed, 4 Sep 2013 11:59:42 -0500
Subject: [Cluster-devel] [PATCH] GFS2: dirty inode correctly in
 gfs2_write_end
In-Reply-To: <1378306503.2740.27.camel@menhir>
References: <20130903215942.GM2113@dhcp80-209.msp.redhat.com>
	<1378306503.2740.27.camel@menhir>
Message-ID: <20130904165942.GN2113@dhcp80-209.msp.redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> > it actually increased the file size.  If gfs2_fsync was called without
> > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> > returning, so any metadata or journaled data changes were not getting
> > fsynced. This meant that writes to the middle of files were not always
> > getting fsynced properly.
> > 
> > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> > updated during a write. It also make gfs2_sync flush the incore log
> > if I_DIRTY_PAGES is set, and the file is using data journalling. This
> > will make sure that all incore logged data gets written to disk before
> > returning from a fsync.
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >  fs/gfs2/aops.c |    9 +++++++--
> >  fs/gfs2/file.c |    4 +++-
> >  2 files changed, 10 insertions(+), 3 deletions(-)
> > 
> > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > ===================================================================
> > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
> >  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
> >  	unsigned int to = from + len;
> >  	int ret;
> > +	struct gfs2_trans *tr = current->journal_info;
> > +	BUG_ON(!tr);
> >  
> >  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
> >  
> > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
> >  		goto failed;
> >  	}
> >  
> > -	gfs2_trans_add_meta(ip->i_gl, dibh);
> > -
> >  	if (gfs2_is_stuffed(ip))
> >  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
> >  
> > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
> >  		gfs2_page_add_databufs(ip, page, from, to);
> >  
> >  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > +	if (tr->tr_num_buf_new)
> > +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > +	else
> > +		gfs2_trans_add_meta(ip->i_gl, dibh);
> > +
> I'm not sure I follow this bit... generic_write_end() will call
> mark_inode_dirty() if the file size has changed. So that case should be
> covered. That leaves only timestamps which I think should be
> I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
> conditional here on tr_num_buf_new ? Does the "else" actually occur in
> practice?

The idea is to only set I_DIRTY_DATASYNC if we've actually updated
metadata because of the write.  Otherwise we flush the log for every
write, even if we aren't data journalling and haven't changed any
metadata.  If the write happened without adding any buffers to the
transaction (which happens if we are just writing over already allocated
space), then tr->tr_num_buf_new will be 0. This means we haven't
actually dirtied any metadata, and we don't need to flush it to the log
during the fsync.

With instrumentation in the code, I can definitely see that with the
check for tr->tr_num_buf_new, I_DIRTY_DATASYNC isn't set if we are just
rewriting over already allocated space, and gfs2_fsync doesn't need to
do all the extra I_DIRTY_DATASYNC work, cutting down the performance
impact of this patch.

If you are doing data journalling, you obviously need to flush the
changed datablocks to the journal in gfs2_fsync.  Otherwise it won't
write them back to their inplace locations. This is true even if the
write didn't update metadata, and so I_DIRTY_DATASYNC isn't set.  That's
why I have gfs2_fsync check for I_DIRTY_PAGES if an inode is using data
journalling.

Does that make sense, or am I missing something?

Now the part of my patch that I don't understand is the bit after the
"else".  See my questions at

https://bugzilla.redhat.com/show_bug.cgi?id=991596#c19

I added the "else" to keep gfs2_write_end always adding the inode to the
transaction, but I don't know why that is necessary. If we aren't
changing any metadata, why do we need to add the inode to the
transaction at all.

-Ben

> 
> Otherwise I think the patch looks good,
> 
> Steve.
> 
> >  
> >  	if (inode == sdp->sd_rindex) {
> >  		adjust_fs_space(inode);
> > Index: gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > ===================================================================
> > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/file.c
> > +++ gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > @@ -650,7 +650,7 @@ static int gfs2_fsync(struct file *file,
> >  {
> >  	struct address_space *mapping = file->f_mapping;
> >  	struct inode *inode = mapping->host;
> > -	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > +	int sync_state = inode->i_state & I_DIRTY;
> >  	struct gfs2_inode *ip = GFS2_I(inode);
> >  	int ret = 0, ret1 = 0;
> >  
> > @@ -660,6 +660,8 @@ static int gfs2_fsync(struct file *file,
> >  			return ret1;
> >  	}
> >  
> > +	if (!gfs2_is_jdata(ip))
> > +		sync_state &= ~I_DIRTY_PAGES;
> >  	if (datasync)
> >  		sync_state &= ~I_DIRTY_SYNC;
> >  
> > 
>