cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
@ 2013-09-03 21:59 Benjamin Marzinski
  2013-09-04 13:24 ` Bob Peterson
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Benjamin Marzinski @ 2013-09-03 21:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
it actually increased the file size.  If gfs2_fsync was called without
I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
returning, so any metadata or journaled data changes were not getting
fsynced. This meant that writes to the middle of files were not always
getting fsynced properly.

This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
updated during a write. It also make gfs2_sync flush the incore log
if I_DIRTY_PAGES is set, and the file is using data journalling. This
will make sure that all incore logged data gets written to disk before
returning from a fsync.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 fs/gfs2/aops.c |    9 +++++++--
 fs/gfs2/file.c |    4 +++-
 2 files changed, 10 insertions(+), 3 deletions(-)

Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
===================================================================
--- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
+++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
@@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
 	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
 	unsigned int to = from + len;
 	int ret;
+	struct gfs2_trans *tr = current->journal_info;
+	BUG_ON(!tr);
 
 	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
 
@@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
 		goto failed;
 	}
 
-	gfs2_trans_add_meta(ip->i_gl, dibh);
-
 	if (gfs2_is_stuffed(ip))
 		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
 
@@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
 		gfs2_page_add_databufs(ip, page, from, to);
 
 	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
+	if (tr->tr_num_buf_new)
+		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
+	else
+		gfs2_trans_add_meta(ip->i_gl, dibh);
+
 
 	if (inode == sdp->sd_rindex) {
 		adjust_fs_space(inode);
Index: gfs2-3.0-nmw-130830/fs/gfs2/file.c
===================================================================
--- gfs2-3.0-nmw-130830.orig/fs/gfs2/file.c
+++ gfs2-3.0-nmw-130830/fs/gfs2/file.c
@@ -650,7 +650,7 @@ static int gfs2_fsync(struct file *file,
 {
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
-	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
+	int sync_state = inode->i_state & I_DIRTY;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	int ret = 0, ret1 = 0;
 
@@ -660,6 +660,8 @@ static int gfs2_fsync(struct file *file,
 			return ret1;
 	}
 
+	if (!gfs2_is_jdata(ip))
+		sync_state &= ~I_DIRTY_PAGES;
 	if (datasync)
 		sync_state &= ~I_DIRTY_SYNC;
 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
  2013-09-03 21:59 [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end Benjamin Marzinski
@ 2013-09-04 13:24 ` Bob Peterson
  2013-09-04 14:55 ` Steven Whitehouse
  2013-09-05  9:34 ` Steven Whitehouse
  2 siblings, 0 replies; 8+ messages in thread
From: Bob Peterson @ 2013-09-04 13:24 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
| GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
| it actually increased the file size.  If gfs2_fsync was called without
| I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
| returning, so any metadata or journaled data changes were not getting
| fsynced. This meant that writes to the middle of files were not always
| getting fsynced properly.
| 
| This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
| updated during a write. It also make gfs2_sync flush the incore log
| if I_DIRTY_PAGES is set, and the file is using data journalling. This
| will make sure that all incore logged data gets written to disk before
| returning from a fsync.
| 
| Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>

Hi,

ACK

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
  2013-09-03 21:59 [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end Benjamin Marzinski
  2013-09-04 13:24 ` Bob Peterson
@ 2013-09-04 14:55 ` Steven Whitehouse
  2013-09-04 16:59   ` Benjamin Marzinski
  2013-09-05  9:34 ` Steven Whitehouse
  2 siblings, 1 reply; 8+ messages in thread
From: Steven Whitehouse @ 2013-09-04 14:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> it actually increased the file size.  If gfs2_fsync was called without
> I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> returning, so any metadata or journaled data changes were not getting
> fsynced. This meant that writes to the middle of files were not always
> getting fsynced properly.
> 
> This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> updated during a write. It also make gfs2_sync flush the incore log
> if I_DIRTY_PAGES is set, and the file is using data journalling. This
> will make sure that all incore logged data gets written to disk before
> returning from a fsync.
> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>  fs/gfs2/aops.c |    9 +++++++--
>  fs/gfs2/file.c |    4 +++-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> ===================================================================
> --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
>  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
>  	unsigned int to = from + len;
>  	int ret;
> +	struct gfs2_trans *tr = current->journal_info;
> +	BUG_ON(!tr);
>  
>  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
>  
> @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
>  		goto failed;
>  	}
>  
> -	gfs2_trans_add_meta(ip->i_gl, dibh);
> -
>  	if (gfs2_is_stuffed(ip))
>  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
>  
> @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
>  		gfs2_page_add_databufs(ip, page, from, to);
>  
>  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> +	if (tr->tr_num_buf_new)
> +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> +	else
> +		gfs2_trans_add_meta(ip->i_gl, dibh);
> +
I'm not sure I follow this bit... generic_write_end() will call
mark_inode_dirty() if the file size has changed. So that case should be
covered. That leaves only timestamps which I think should be
I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
conditional here on tr_num_buf_new ? Does the "else" actually occur in
practice?

Otherwise I think the patch looks good,

Steve.

>  
>  	if (inode == sdp->sd_rindex) {
>  		adjust_fs_space(inode);
> Index: gfs2-3.0-nmw-130830/fs/gfs2/file.c
> ===================================================================
> --- gfs2-3.0-nmw-130830.orig/fs/gfs2/file.c
> +++ gfs2-3.0-nmw-130830/fs/gfs2/file.c
> @@ -650,7 +650,7 @@ static int gfs2_fsync(struct file *file,
>  {
>  	struct address_space *mapping = file->f_mapping;
>  	struct inode *inode = mapping->host;
> -	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> +	int sync_state = inode->i_state & I_DIRTY;
>  	struct gfs2_inode *ip = GFS2_I(inode);
>  	int ret = 0, ret1 = 0;
>  
> @@ -660,6 +660,8 @@ static int gfs2_fsync(struct file *file,
>  			return ret1;
>  	}
>  
> +	if (!gfs2_is_jdata(ip))
> +		sync_state &= ~I_DIRTY_PAGES;
>  	if (datasync)
>  		sync_state &= ~I_DIRTY_SYNC;
>  
> 




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
  2013-09-04 14:55 ` Steven Whitehouse
@ 2013-09-04 16:59   ` Benjamin Marzinski
  2013-09-04 21:36     ` Steven Whitehouse
  0 siblings, 1 reply; 8+ messages in thread
From: Benjamin Marzinski @ 2013-09-04 16:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> > it actually increased the file size.  If gfs2_fsync was called without
> > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> > returning, so any metadata or journaled data changes were not getting
> > fsynced. This meant that writes to the middle of files were not always
> > getting fsynced properly.
> > 
> > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> > updated during a write. It also make gfs2_sync flush the incore log
> > if I_DIRTY_PAGES is set, and the file is using data journalling. This
> > will make sure that all incore logged data gets written to disk before
> > returning from a fsync.
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >  fs/gfs2/aops.c |    9 +++++++--
> >  fs/gfs2/file.c |    4 +++-
> >  2 files changed, 10 insertions(+), 3 deletions(-)
> > 
> > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > ===================================================================
> > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
> >  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
> >  	unsigned int to = from + len;
> >  	int ret;
> > +	struct gfs2_trans *tr = current->journal_info;
> > +	BUG_ON(!tr);
> >  
> >  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
> >  
> > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
> >  		goto failed;
> >  	}
> >  
> > -	gfs2_trans_add_meta(ip->i_gl, dibh);
> > -
> >  	if (gfs2_is_stuffed(ip))
> >  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
> >  
> > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
> >  		gfs2_page_add_databufs(ip, page, from, to);
> >  
> >  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > +	if (tr->tr_num_buf_new)
> > +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > +	else
> > +		gfs2_trans_add_meta(ip->i_gl, dibh);
> > +
> I'm not sure I follow this bit... generic_write_end() will call
> mark_inode_dirty() if the file size has changed. So that case should be
> covered. That leaves only timestamps which I think should be
> I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
> conditional here on tr_num_buf_new ? Does the "else" actually occur in
> practice?

The idea is to only set I_DIRTY_DATASYNC if we've actually updated
metadata because of the write.  Otherwise we flush the log for every
write, even if we aren't data journalling and haven't changed any
metadata.  If the write happened without adding any buffers to the
transaction (which happens if we are just writing over already allocated
space), then tr->tr_num_buf_new will be 0. This means we haven't
actually dirtied any metadata, and we don't need to flush it to the log
during the fsync.

With instrumentation in the code, I can definitely see that with the
check for tr->tr_num_buf_new, I_DIRTY_DATASYNC isn't set if we are just
rewriting over already allocated space, and gfs2_fsync doesn't need to
do all the extra I_DIRTY_DATASYNC work, cutting down the performance
impact of this patch.

If you are doing data journalling, you obviously need to flush the
changed datablocks to the journal in gfs2_fsync.  Otherwise it won't
write them back to their inplace locations. This is true even if the
write didn't update metadata, and so I_DIRTY_DATASYNC isn't set.  That's
why I have gfs2_fsync check for I_DIRTY_PAGES if an inode is using data
journalling.

Does that make sense, or am I missing something?

Now the part of my patch that I don't understand is the bit after the
"else".  See my questions at

https://bugzilla.redhat.com/show_bug.cgi?id=991596#c19

I added the "else" to keep gfs2_write_end always adding the inode to the
transaction, but I don't know why that is necessary. If we aren't
changing any metadata, why do we need to add the inode to the
transaction at all.

-Ben

> 
> Otherwise I think the patch looks good,
> 
> Steve.
> 
> >  
> >  	if (inode == sdp->sd_rindex) {
> >  		adjust_fs_space(inode);
> > Index: gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > ===================================================================
> > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/file.c
> > +++ gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > @@ -650,7 +650,7 @@ static int gfs2_fsync(struct file *file,
> >  {
> >  	struct address_space *mapping = file->f_mapping;
> >  	struct inode *inode = mapping->host;
> > -	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > +	int sync_state = inode->i_state & I_DIRTY;
> >  	struct gfs2_inode *ip = GFS2_I(inode);
> >  	int ret = 0, ret1 = 0;
> >  
> > @@ -660,6 +660,8 @@ static int gfs2_fsync(struct file *file,
> >  			return ret1;
> >  	}
> >  
> > +	if (!gfs2_is_jdata(ip))
> > +		sync_state &= ~I_DIRTY_PAGES;
> >  	if (datasync)
> >  		sync_state &= ~I_DIRTY_SYNC;
> >  
> > 
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
  2013-09-04 16:59   ` Benjamin Marzinski
@ 2013-09-04 21:36     ` Steven Whitehouse
  2013-09-05  3:48       ` Benjamin Marzinski
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Whitehouse @ 2013-09-04 21:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Wed, 2013-09-04 at 11:59 -0500, Benjamin Marzinski wrote:
> On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> > > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> > > it actually increased the file size.  If gfs2_fsync was called without
> > > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> > > returning, so any metadata or journaled data changes were not getting
> > > fsynced. This meant that writes to the middle of files were not always
> > > getting fsynced properly.
> > > 
> > > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> > > updated during a write. It also make gfs2_sync flush the incore log
> > > if I_DIRTY_PAGES is set, and the file is using data journalling. This
> > > will make sure that all incore logged data gets written to disk before
> > > returning from a fsync.
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > ---
> > >  fs/gfs2/aops.c |    9 +++++++--
> > >  fs/gfs2/file.c |    4 +++-
> > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > 
> > > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > ===================================================================
> > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> > > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
> > >  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
> > >  	unsigned int to = from + len;
> > >  	int ret;
> > > +	struct gfs2_trans *tr = current->journal_info;
> > > +	BUG_ON(!tr);
> > >  
> > >  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
> > >  
> > > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
> > >  		goto failed;
> > >  	}
> > >  
> > > -	gfs2_trans_add_meta(ip->i_gl, dibh);
> > > -
> > >  	if (gfs2_is_stuffed(ip))
> > >  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
> > >  
> > > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
> > >  		gfs2_page_add_databufs(ip, page, from, to);
> > >  
> > >  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > +	if (tr->tr_num_buf_new)
> > > +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > > +	else
> > > +		gfs2_trans_add_meta(ip->i_gl, dibh);
> > > +
> > I'm not sure I follow this bit... generic_write_end() will call
> > mark_inode_dirty() if the file size has changed. So that case should be
> > covered. That leaves only timestamps which I think should be
> > I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
> > conditional here on tr_num_buf_new ? Does the "else" actually occur in
> > practice?
> 
> The idea is to only set I_DIRTY_DATASYNC if we've actually updated
> metadata because of the write.

I think we ought to be updating the metadata for every write (although
see case #3 below), since the timestamp should change each time. If the
size of the file doesn't change then we don't need to flush the metadata
on fdatasync, even though the timestamp has changed.

I think we have three cases:

1. i_size changes: metadata must be flushed on fsync/fdatasync, so
I_DIRTY_DATASYNC should be set (done in generic_write_end())
2. only timestamps change: metatdata must be flushed on fsync but not on
fdatasync, so I_DIRTY_SYNC should be set (I assume perhaps the
problematic case)
3. write is of 0 bytes: I think we are allowed to skip all metadata
updates (need to check that we do this)

>  Otherwise we flush the log for every
> write, even if we aren't data journalling and haven't changed any
> metadata.  If the write happened without adding any buffers to the
> transaction (which happens if we are just writing over already allocated
> space), then tr->tr_num_buf_new will be 0. This means we haven't
> actually dirtied any metadata, and we don't need to flush it to the log
> during the fsync.
> 
Does that happen for zero sized writes? Otherwise I'd be expecting us to
be updating time stamps at least on each write.

> With instrumentation in the code, I can definitely see that with the
> check for tr->tr_num_buf_new, I_DIRTY_DATASYNC isn't set if we are just
> rewriting over already allocated space, and gfs2_fsync doesn't need to
> do all the extra I_DIRTY_DATASYNC work, cutting down the performance
> impact of this patch.
> 
> If you are doing data journalling, you obviously need to flush the
> changed datablocks to the journal in gfs2_fsync.  Otherwise it won't
> write them back to their inplace locations. This is true even if the
> write didn't update metadata, and so I_DIRTY_DATASYNC isn't set.  That's
> why I have gfs2_fsync check for I_DIRTY_PAGES if an inode is using data
> journalling.
> 
> Does that make sense, or am I missing something?
> 
Yes, I certainly agree with the changes at the fsync end wrt jdata. That
all makes sense to me.

> Now the part of my patch that I don't understand is the bit after the
> "else".  See my questions at
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=991596#c19
> 
> I added the "else" to keep gfs2_write_end always adding the inode to the
> transaction, but I don't know why that is necessary. If we aren't
> changing any metadata, why do we need to add the inode to the
> transaction at all.
> 
> -Ben
> 
Ah, I see now. I'd not spotted those questions before. I think the
answer to question #1 most likely relates to time stamps, and I hope
that I've more or less answered it above.

I suspect that the answer to question #2 is no. We call
filemap_fdatawrite/wait to writeback the pages of the two address
spaces, and flush the log, etc in order to ensure that all the metadata
is in a suitable state to be written back. So the issue is just to make
sure that everything gets written back and waited for, and I think we
are doing that correctly,

Steve.

> > 
> > Otherwise I think the patch looks good,
> > 
> > Steve.
> > 
> > >  
> > >  	if (inode == sdp->sd_rindex) {
> > >  		adjust_fs_space(inode);
> > > Index: gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > > ===================================================================
> > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/file.c
> > > +++ gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > > @@ -650,7 +650,7 @@ static int gfs2_fsync(struct file *file,
> > >  {
> > >  	struct address_space *mapping = file->f_mapping;
> > >  	struct inode *inode = mapping->host;
> > > -	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > > +	int sync_state = inode->i_state & I_DIRTY;
> > >  	struct gfs2_inode *ip = GFS2_I(inode);
> > >  	int ret = 0, ret1 = 0;
> > >  
> > > @@ -660,6 +660,8 @@ static int gfs2_fsync(struct file *file,
> > >  			return ret1;
> > >  	}
> > >  
> > > +	if (!gfs2_is_jdata(ip))
> > > +		sync_state &= ~I_DIRTY_PAGES;
> > >  	if (datasync)
> > >  		sync_state &= ~I_DIRTY_SYNC;
> > >  
> > > 
> > 




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
  2013-09-04 21:36     ` Steven Whitehouse
@ 2013-09-05  3:48       ` Benjamin Marzinski
  2013-09-05  8:28         ` Steven Whitehouse
  0 siblings, 1 reply; 8+ messages in thread
From: Benjamin Marzinski @ 2013-09-05  3:48 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Sep 04, 2013 at 10:36:16PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2013-09-04 at 11:59 -0500, Benjamin Marzinski wrote:
> > On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> > > > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> > > > it actually increased the file size.  If gfs2_fsync was called without
> > > > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> > > > returning, so any metadata or journaled data changes were not getting
> > > > fsynced. This meant that writes to the middle of files were not always
> > > > getting fsynced properly.
> > > > 
> > > > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> > > > updated during a write. It also make gfs2_sync flush the incore log
> > > > if I_DIRTY_PAGES is set, and the file is using data journalling. This
> > > > will make sure that all incore logged data gets written to disk before
> > > > returning from a fsync.
> > > > 
> > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > > ---
> > > >  fs/gfs2/aops.c |    9 +++++++--
> > > >  fs/gfs2/file.c |    4 +++-
> > > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > > 
> > > > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > ===================================================================
> > > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> > > > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
> > > >  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
> > > >  	unsigned int to = from + len;
> > > >  	int ret;
> > > > +	struct gfs2_trans *tr = current->journal_info;
> > > > +	BUG_ON(!tr);
> > > >  
> > > >  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
> > > >  
> > > > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
> > > >  		goto failed;
> > > >  	}
> > > >  
> > > > -	gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > -
> > > >  	if (gfs2_is_stuffed(ip))
> > > >  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
> > > >  
> > > > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
> > > >  		gfs2_page_add_databufs(ip, page, from, to);
> > > >  
> > > >  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > +	if (tr->tr_num_buf_new)
> > > > +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > > > +	else
> > > > +		gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > +
> > > I'm not sure I follow this bit... generic_write_end() will call
> > > mark_inode_dirty() if the file size has changed. So that case should be
> > > covered. That leaves only timestamps which I think should be
> > > I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
> > > conditional here on tr_num_buf_new ? Does the "else" actually occur in
> > > practice?
> > 
> > The idea is to only set I_DIRTY_DATASYNC if we've actually updated
> > metadata because of the write.
> 
> I think we ought to be updating the metadata for every write (although
> see case #3 below), since the timestamp should change each time. If the
> size of the file doesn't change then we don't need to flush the metadata
> on fdatasync, even though the timestamp has changed.
> 
> I think we have three cases:
> 
> 1. i_size changes: metadata must be flushed on fsync/fdatasync, so
> I_DIRTY_DATASYNC should be set (done in generic_write_end())
> 2. only timestamps change: metatdata must be flushed on fsync but not on
> fdatasync, so I_DIRTY_SYNC should be set (I assume perhaps the
> problematic case)
> 3. write is of 0 bytes: I think we are allowed to skip all metadata
> updates (need to check that we do this)

I wasn't counting timestamp changes, since like you mention, those use
I_DIRTY_SYNC, and are ignored on fdatasync. What I care about is writes
that cause block allocations.  The specific test in the bugzilla uses a
holey file.  The writes that aren't getting synced ARE causing an
allocation, but ARE NOT increasing i_size, since they are not at the end
of the file.  This is a fourth case, and it's the case where we were
falling down. I probably should have clarified that earlier.

> 
> >  Otherwise we flush the log for every
> > write, even if we aren't data journalling and haven't changed any
> > metadata.  If the write happened without adding any buffers to the
> > transaction (which happens if we are just writing over already allocated
> > space), then tr->tr_num_buf_new will be 0. This means we haven't
> > actually dirtied any metadata, and we don't need to flush it to the log
> > during the fsync.
> > 
> Does that happen for zero sized writes? Otherwise I'd be expecting us to
> be updating time stamps at least on each write.

In rewriting already allocated data, we will update the timestamp and
the inode will be added to the incore log, but we don't need to flush
that change in gfs2_fsync, since we are asking for fdatasync. I thought
the whole point of the seperation between I_DIRTY_SYNC and
I_DIRTY_DATASYNC was so that you could avoid syncing inodes where only
timestamps had changed. The change will still get to disk, but if we
happen to crash before that, the mtime on the file might be wrong. Isn't
that just the price you pay for doing a datasync instead of a full sync?

> > With instrumentation in the code, I can definitely see that with the
> > check for tr->tr_num_buf_new, I_DIRTY_DATASYNC isn't set if we are just
> > rewriting over already allocated space, and gfs2_fsync doesn't need to
> > do all the extra I_DIRTY_DATASYNC work, cutting down the performance
> > impact of this patch.
> > 
> > If you are doing data journalling, you obviously need to flush the
> > changed datablocks to the journal in gfs2_fsync.  Otherwise it won't
> > write them back to their inplace locations. This is true even if the
> > write didn't update metadata, and so I_DIRTY_DATASYNC isn't set.  That's
> > why I have gfs2_fsync check for I_DIRTY_PAGES if an inode is using data
> > journalling.
> > 
> > Does that make sense, or am I missing something?
> > 
> Yes, I certainly agree with the changes at the fsync end wrt jdata. That
> all makes sense to me.
> 
> > Now the part of my patch that I don't understand is the bit after the
> > "else".  See my questions at
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=991596#c19
> > 
> > I added the "else" to keep gfs2_write_end always adding the inode to the
> > transaction, but I don't know why that is necessary. If we aren't
> > changing any metadata, why do we need to add the inode to the
> > transaction at all.
> > 
> > -Ben
> > 
> Ah, I see now. I'd not spotted those questions before. I think the
> answer to question #1 most likely relates to time stamps, and I hope
> that I've more or less answered it above.

I thought that mtime didn't get updated if the last modification was
less than a second ago. And doesn't the code that sets mtime do it's own
mark_inode_dirty(), which should call gfs2_dirty_inode() and add the
inode there.  It seems redundant to do it here as well, especially if
there are times when it's not actually necessary.

> I suspect that the answer to question #2 is no. We call
> filemap_fdatawrite/wait to writeback the pages of the two address
> spaces, and flush the log, etc in order to ensure that all the metadata
> is in a suitable state to be written back. So the issue is just to make
> sure that everything gets written back and waited for, and I think we
> are doing that correctly,
> 
> Steve.
> 
> > > 
> > > Otherwise I think the patch looks good,
> > > 
> > > Steve.
> > > 
> > > >  
> > > >  	if (inode == sdp->sd_rindex) {
> > > >  		adjust_fs_space(inode);
> > > > Index: gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > > > ===================================================================
> > > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/file.c
> > > > +++ gfs2-3.0-nmw-130830/fs/gfs2/file.c
> > > > @@ -650,7 +650,7 @@ static int gfs2_fsync(struct file *file,
> > > >  {
> > > >  	struct address_space *mapping = file->f_mapping;
> > > >  	struct inode *inode = mapping->host;
> > > > -	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > > > +	int sync_state = inode->i_state & I_DIRTY;
> > > >  	struct gfs2_inode *ip = GFS2_I(inode);
> > > >  	int ret = 0, ret1 = 0;
> > > >  
> > > > @@ -660,6 +660,8 @@ static int gfs2_fsync(struct file *file,
> > > >  			return ret1;
> > > >  	}
> > > >  
> > > > +	if (!gfs2_is_jdata(ip))
> > > > +		sync_state &= ~I_DIRTY_PAGES;
> > > >  	if (datasync)
> > > >  		sync_state &= ~I_DIRTY_SYNC;
> > > >  
> > > > 
> > > 
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
  2013-09-05  3:48       ` Benjamin Marzinski
@ 2013-09-05  8:28         ` Steven Whitehouse
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Whitehouse @ 2013-09-05  8:28 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Wed, 2013-09-04 at 22:48 -0500, Benjamin Marzinski wrote:
> On Wed, Sep 04, 2013 at 10:36:16PM +0100, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Wed, 2013-09-04 at 11:59 -0500, Benjamin Marzinski wrote:
> > > On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote:
> > > > Hi,
> > > > 
> > > > On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> > > > > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> > > > > it actually increased the file size.  If gfs2_fsync was called without
> > > > > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> > > > > returning, so any metadata or journaled data changes were not getting
> > > > > fsynced. This meant that writes to the middle of files were not always
> > > > > getting fsynced properly.
> > > > > 
> > > > > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> > > > > updated during a write. It also make gfs2_sync flush the incore log
> > > > > if I_DIRTY_PAGES is set, and the file is using data journalling. This
> > > > > will make sure that all incore logged data gets written to disk before
> > > > > returning from a fsync.
> > > > > 
> > > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > > > ---
> > > > >  fs/gfs2/aops.c |    9 +++++++--
> > > > >  fs/gfs2/file.c |    4 +++-
> > > > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > > > 
> > > > > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > > ===================================================================
> > > > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> > > > > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> > > > > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
> > > > >  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
> > > > >  	unsigned int to = from + len;
> > > > >  	int ret;
> > > > > +	struct gfs2_trans *tr = current->journal_info;
> > > > > +	BUG_ON(!tr);
> > > > >  
> > > > >  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
> > > > >  
> > > > > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
> > > > >  		goto failed;
> > > > >  	}
> > > > >  
> > > > > -	gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > > -
> > > > >  	if (gfs2_is_stuffed(ip))
> > > > >  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
> > > > >  
> > > > > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
> > > > >  		gfs2_page_add_databufs(ip, page, from, to);
> > > > >  
> > > > >  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > > +	if (tr->tr_num_buf_new)
> > > > > +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > > > > +	else
> > > > > +		gfs2_trans_add_meta(ip->i_gl, dibh);
> > > > > +
> > > > I'm not sure I follow this bit... generic_write_end() will call
> > > > mark_inode_dirty() if the file size has changed. So that case should be
> > > > covered. That leaves only timestamps which I think should be
> > > > I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a
> > > > conditional here on tr_num_buf_new ? Does the "else" actually occur in
> > > > practice?
> > > 
> > > The idea is to only set I_DIRTY_DATASYNC if we've actually updated
> > > metadata because of the write.
> > 
> > I think we ought to be updating the metadata for every write (although
> > see case #3 below), since the timestamp should change each time. If the
> > size of the file doesn't change then we don't need to flush the metadata
> > on fdatasync, even though the timestamp has changed.
> > 
> > I think we have three cases:
> > 
> > 1. i_size changes: metadata must be flushed on fsync/fdatasync, so
> > I_DIRTY_DATASYNC should be set (done in generic_write_end())
> > 2. only timestamps change: metatdata must be flushed on fsync but not on
> > fdatasync, so I_DIRTY_SYNC should be set (I assume perhaps the
> > problematic case)
> > 3. write is of 0 bytes: I think we are allowed to skip all metadata
> > updates (need to check that we do this)
> 
> I wasn't counting timestamp changes, since like you mention, those use
> I_DIRTY_SYNC, and are ignored on fdatasync. What I care about is writes
> that cause block allocations.  The specific test in the bugzilla uses a
> holey file.  The writes that aren't getting synced ARE causing an
> allocation, but ARE NOT increasing i_size, since they are not at the end
> of the file.  This is a fourth case, and it's the case where we were
> falling down. I probably should have clarified that earlier.
> 
Ok, that makes perfect sense to me. Lets go with the patch "as is".
Sorry for not spotting that earlier.

> > 
> > >  Otherwise we flush the log for every
> > > write, even if we aren't data journalling and haven't changed any
> > > metadata.  If the write happened without adding any buffers to the
> > > transaction (which happens if we are just writing over already allocated
> > > space), then tr->tr_num_buf_new will be 0. This means we haven't
> > > actually dirtied any metadata, and we don't need to flush it to the log
> > > during the fsync.
> > > 
> > Does that happen for zero sized writes? Otherwise I'd be expecting us to
> > be updating time stamps at least on each write.
> 
> In rewriting already allocated data, we will update the timestamp and
> the inode will be added to the incore log, but we don't need to flush
> that change in gfs2_fsync, since we are asking for fdatasync. I thought
> the whole point of the seperation between I_DIRTY_SYNC and
> I_DIRTY_DATASYNC was so that you could avoid syncing inodes where only
> timestamps had changed. The change will still get to disk, but if we
> happen to crash before that, the mtime on the file might be wrong. Isn't
> that just the price you pay for doing a datasync instead of a full sync?
> 
Yes, that all sounds correct to me,

Steve.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end
  2013-09-03 21:59 [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end Benjamin Marzinski
  2013-09-04 13:24 ` Bob Peterson
  2013-09-04 14:55 ` Steven Whitehouse
@ 2013-09-05  9:34 ` Steven Whitehouse
  2 siblings, 0 replies; 8+ messages in thread
From: Steven Whitehouse @ 2013-09-05  9:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

Now in the -nmw git tree. Thanks,

Steve.

On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote:
> GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
> it actually increased the file size.  If gfs2_fsync was called without
> I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
> returning, so any metadata or journaled data changes were not getting
> fsynced. This meant that writes to the middle of files were not always
> getting fsynced properly.
> 
> This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
> updated during a write. It also make gfs2_sync flush the incore log
> if I_DIRTY_PAGES is set, and the file is using data journalling. This
> will make sure that all incore logged data gets written to disk before
> returning from a fsync.
> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>  fs/gfs2/aops.c |    9 +++++++--
>  fs/gfs2/file.c |    4 +++-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> ===================================================================
> --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c
> +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c
> @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f
>  	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
>  	unsigned int to = from + len;
>  	int ret;
> +	struct gfs2_trans *tr = current->journal_info;
> +	BUG_ON(!tr);
>  
>  	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
>  
> @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f
>  		goto failed;
>  	}
>  
> -	gfs2_trans_add_meta(ip->i_gl, dibh);
> -
>  	if (gfs2_is_stuffed(ip))
>  		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
>  
> @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f
>  		gfs2_page_add_databufs(ip, page, from, to);
>  
>  	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> +	if (tr->tr_num_buf_new)
> +		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> +	else
> +		gfs2_trans_add_meta(ip->i_gl, dibh);
> +
>  
>  	if (inode == sdp->sd_rindex) {
>  		adjust_fs_space(inode);
> Index: gfs2-3.0-nmw-130830/fs/gfs2/file.c
> ===================================================================
> --- gfs2-3.0-nmw-130830.orig/fs/gfs2/file.c
> +++ gfs2-3.0-nmw-130830/fs/gfs2/file.c
> @@ -650,7 +650,7 @@ static int gfs2_fsync(struct file *file,
>  {
>  	struct address_space *mapping = file->f_mapping;
>  	struct inode *inode = mapping->host;
> -	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> +	int sync_state = inode->i_state & I_DIRTY;
>  	struct gfs2_inode *ip = GFS2_I(inode);
>  	int ret = 0, ret1 = 0;
>  
> @@ -660,6 +660,8 @@ static int gfs2_fsync(struct file *file,
>  			return ret1;
>  	}
>  
> +	if (!gfs2_is_jdata(ip))
> +		sync_state &= ~I_DIRTY_PAGES;
>  	if (datasync)
>  		sync_state &= ~I_DIRTY_SYNC;
>  
> 




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-09-05  9:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-03 21:59 [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end Benjamin Marzinski
2013-09-04 13:24 ` Bob Peterson
2013-09-04 14:55 ` Steven Whitehouse
2013-09-04 16:59   ` Benjamin Marzinski
2013-09-04 21:36     ` Steven Whitehouse
2013-09-05  3:48       ` Benjamin Marzinski
2013-09-05  8:28         ` Steven Whitehouse
2013-09-05  9:34 ` Steven Whitehouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).