From mboxrd@z Thu Jan  1 00:00:00 1970
From: Steven Whitehouse <swhiteho@redhat.com>
Date: Thu, 05 Sep 2013 09:28:09 +0100
Subject: [Cluster-devel] [PATCH] GFS2: dirty inode correctly in
 gfs2_write_end
In-Reply-To: <20130905034835.GA16256@dhcp80-209.msp.redhat.com>
References: <20130903215942.GM2113@dhcp80-209.msp.redhat.com>
	<1378306503.2740.27.camel@menhir>
	<20130904165942.GN2113@dhcp80-209.msp.redhat.com>
	<1378330576.2740.46.camel@menhir>
	<20130905034835.GA16256@dhcp80-209.msp.redhat.com>
Message-ID: <1378369689.2698.1.camel@menhir>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hi,

On Wed, 2013-09-04 at 22:48 -0500, Benjamin Marzinski wrote:
> On Wed, Sep 04, 2013 at 10:36:16PM +0100, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Wed, 2013-09-04 at 11:59 -0500, Benjamin Marzinski wrote:
> > > On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote:
> > > > Hi,
> > > > 
> > > > On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> > > > > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> > > > > it actually increased the file size.  If gfs2_fsync was called without
> > > > > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> > > > > returning, so any metadata or journaled data changes were not getting
> > > > > fsynced. This meant that writes to the middle of files were not always
> > > > > getting fsynced properly.
> > > > > 
> > > > > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> > > > > updated during a write. It also make gfs2_sync flush the incore log
> > > > > if I_DIRTY_PAGES is set, and the file is using data journalling. This
> > > > > will make sure that all incore logged data gets written to disk before
> > > > > returning from a fsync.
> > > > > 
> > > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > > > ---
> > > > >  fs/gfs2/aops.c |    9 +++++++--
> > > > >  fs/gfs2/file.c |    4 +++-
> > > > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > > > 
> > > > > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > > ===================================================================
> > > > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> > > > > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
> > > > >  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
> > > > >  	unsigned int to = from + len;
> > > > >  	int ret;
> > > > > +	struct gfs2_trans *tr = current->journal_info;
> > > > > +	BUG_ON(!tr);
> > > > >  
> > > > >  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
> > > > >  
> > > > > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
> > > > >  		goto failed;
> > > > >  	}
> > > > >  
> > > > > -	gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > > -
> > > > >  	if (gfs2_is_stuffed(ip))
> > > > >  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
> > > > >  
> > > > > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
> > > > >  		gfs2_page_add_databufs(ip, page, from, to);
> > > > >  
> > > > >  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > > +	if (tr->tr_num_buf_new)
> > > > > +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > > > > +	else
> > > > > +		gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > > +
> > > > I'm not sure I follow this bit... generic_write_end() will call
> > > > mark_inode_dirty() if the file size has changed. So that case should be
> > > > covered. That leaves only timestamps which I think should be
> > > > I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
> > > > conditional here on tr_num_buf_new ? Does the "else" actually occur in
> > > > practice?
> > > 
> > > The idea is to only set I_DIRTY_DATASYNC if we've actually updated
> > > metadata because of the write.
> > 
> > I think we ought to be updating the metadata for every write (although
> > see case #3 below), since the timestamp should change each time. If the
> > size of the file doesn't change then we don't need to flush the metadata
> > on fdatasync, even though the timestamp has changed.
> > 
> > I think we have three cases:
> > 
> > 1. i_size changes: metadata must be flushed on fsync/fdatasync, so
> > I_DIRTY_DATASYNC should be set (done in generic_write_end())
> > 2. only timestamps change: metatdata must be flushed on fsync but not on
> > fdatasync, so I_DIRTY_SYNC should be set (I assume perhaps the
> > problematic case)
> > 3. write is of 0 bytes: I think we are allowed to skip all metadata
> > updates (need to check that we do this)
> 
> I wasn't counting timestamp changes, since like you mention, those use
> I_DIRTY_SYNC, and are ignored on fdatasync. What I care about is writes
> that cause block allocations.  The specific test in the bugzilla uses a
> holey file.  The writes that aren't getting synced ARE causing an
> allocation, but ARE NOT increasing i_size, since they are not at the end
> of the file.  This is a fourth case, and it's the case where we were
> falling down. I probably should have clarified that earlier.
> 
Ok, that makes perfect sense to me. Lets go with the patch "as is".
Sorry for not spotting that earlier.

> > 
> > >  Otherwise we flush the log for every
> > > write, even if we aren't data journalling and haven't changed any
> > > metadata.  If the write happened without adding any buffers to the
> > > transaction (which happens if we are just writing over already allocated
> > > space), then tr->tr_num_buf_new will be 0. This means we haven't
> > > actually dirtied any metadata, and we don't need to flush it to the log
> > > during the fsync.
> > > 
> > Does that happen for zero sized writes? Otherwise I'd be expecting us to
> > be updating time stamps at least on each write.
> 
> In rewriting already allocated data, we will update the timestamp and
> the inode will be added to the incore log, but we don't need to flush
> that change in gfs2_fsync, since we are asking for fdatasync. I thought
> the whole point of the seperation between I_DIRTY_SYNC and
> I_DIRTY_DATASYNC was so that you could avoid syncing inodes where only
> timestamps had changed. The change will still get to disk, but if we
> happen to crash before that, the mtime on the file might be wrong. Isn't
> that just the price you pay for doing a datasync instead of a full sync?
> 
Yes, that all sounds correct to me,

Steve.