From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Thu, 05 Sep 2013 09:28:09 +0100 Subject: [Cluster-devel] [PATCH] GFS2: dirty inode correctly in gfs2_write_end In-Reply-To: <20130905034835.GA16256@dhcp80-209.msp.redhat.com> References: <20130903215942.GM2113@dhcp80-209.msp.redhat.com> <1378306503.2740.27.camel@menhir> <20130904165942.GN2113@dhcp80-209.msp.redhat.com> <1378330576.2740.46.camel@menhir> <20130905034835.GA16256@dhcp80-209.msp.redhat.com> Message-ID: <1378369689.2698.1.camel@menhir> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On Wed, 2013-09-04 at 22:48 -0500, Benjamin Marzinski wrote: > On Wed, Sep 04, 2013 at 10:36:16PM +0100, Steven Whitehouse wrote: > > Hi, > > > > On Wed, 2013-09-04 at 11:59 -0500, Benjamin Marzinski wrote: > > > On Wed, Sep 04, 2013 at 03:55:03PM +0100, Steven Whitehouse wrote: > > > > Hi, > > > > > > > > On Tue, 2013-09-03 at 16:59 -0500, Benjamin Marzinski wrote: > > > > > GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when > > > > > it actually increased the file size. If gfs2_fsync was called without > > > > > I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before > > > > > returning, so any metadata or journaled data changes were not getting > > > > > fsynced. This meant that writes to the middle of files were not always > > > > > getting fsynced properly. > > > > > > > > > > This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been > > > > > updated during a write. It also make gfs2_sync flush the incore log > > > > > if I_DIRTY_PAGES is set, and the file is using data journalling. This > > > > > will make sure that all incore logged data gets written to disk before > > > > > returning from a fsync. > > > > > > > > > > Signed-off-by: Benjamin Marzinski > > > > > --- > > > > > fs/gfs2/aops.c | 9 +++++++-- > > > > > fs/gfs2/file.c | 4 +++- > > > > > 2 files changed, 10 insertions(+), 3 deletions(-) > > > > > > > > > > Index: gfs2-3.0-nmw-130830/fs/gfs2/aops.c > > > > > =================================================================== > > > > > --- gfs2-3.0-nmw-130830.orig/fs/gfs2/aops.c > > > > > +++ gfs2-3.0-nmw-130830/fs/gfs2/aops.c > > > > > @@ -815,6 +815,8 @@ static int gfs2_write_end(struct file *f > > > > > unsigned int from = pos & (PAGE_CACHE_SIZE - 1); > > > > > unsigned int to = from + len; > > > > > int ret; > > > > > + struct gfs2_trans *tr = current->journal_info; > > > > > + BUG_ON(!tr); > > > > > > > > > > BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL); > > > > > > > > > > @@ -825,8 +827,6 @@ static int gfs2_write_end(struct file *f > > > > > goto failed; > > > > > } > > > > > > > > > > - gfs2_trans_add_meta(ip->i_gl, dibh); > > > > > - > > > > > if (gfs2_is_stuffed(ip)) > > > > > return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page); > > > > > > > > > > @@ -834,6 +834,11 @@ static int gfs2_write_end(struct file *f > > > > > gfs2_page_add_databufs(ip, page, from, to); > > > > > > > > > > ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata); > > > > > + if (tr->tr_num_buf_new) > > > > > + __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > > > > > + else > > > > > + gfs2_trans_add_meta(ip->i_gl, dibh); > > > > > + > > > > I'm not sure I follow this bit... generic_write_end() will call > > > > mark_inode_dirty() if the file size has changed. So that case should be > > > > covered. That leaves only timestamps which I think should be > > > > I_DIRTY_SYNC since we don't need those for fdatasync. But why is there a > > > > conditional here on tr_num_buf_new ? Does the "else" actually occur in > > > > practice? > > > > > > The idea is to only set I_DIRTY_DATASYNC if we've actually updated > > > metadata because of the write. > > > > I think we ought to be updating the metadata for every write (although > > see case #3 below), since the timestamp should change each time. If the > > size of the file doesn't change then we don't need to flush the metadata > > on fdatasync, even though the timestamp has changed. > > > > I think we have three cases: > > > > 1. i_size changes: metadata must be flushed on fsync/fdatasync, so > > I_DIRTY_DATASYNC should be set (done in generic_write_end()) > > 2. only timestamps change: metatdata must be flushed on fsync but not on > > fdatasync, so I_DIRTY_SYNC should be set (I assume perhaps the > > problematic case) > > 3. write is of 0 bytes: I think we are allowed to skip all metadata > > updates (need to check that we do this) > > I wasn't counting timestamp changes, since like you mention, those use > I_DIRTY_SYNC, and are ignored on fdatasync. What I care about is writes > that cause block allocations. The specific test in the bugzilla uses a > holey file. The writes that aren't getting synced ARE causing an > allocation, but ARE NOT increasing i_size, since they are not at the end > of the file. This is a fourth case, and it's the case where we were > falling down. I probably should have clarified that earlier. > Ok, that makes perfect sense to me. Lets go with the patch "as is". Sorry for not spotting that earlier. > > > > > Otherwise we flush the log for every > > > write, even if we aren't data journalling and haven't changed any > > > metadata. If the write happened without adding any buffers to the > > > transaction (which happens if we are just writing over already allocated > > > space), then tr->tr_num_buf_new will be 0. This means we haven't > > > actually dirtied any metadata, and we don't need to flush it to the log > > > during the fsync. > > > > > Does that happen for zero sized writes? Otherwise I'd be expecting us to > > be updating time stamps at least on each write. > > In rewriting already allocated data, we will update the timestamp and > the inode will be added to the incore log, but we don't need to flush > that change in gfs2_fsync, since we are asking for fdatasync. I thought > the whole point of the seperation between I_DIRTY_SYNC and > I_DIRTY_DATASYNC was so that you could avoid syncing inodes where only > timestamps had changed. The change will still get to disk, but if we > happen to crash before that, the mtime on the file might be wrong. Isn't > that just the price you pay for doing a datasync instead of a full sync? > Yes, that all sounds correct to me, Steve.