From: Andrew Morton <akpm@linux-foundation.org>
To: Jan Kara <jack@suse.cz>
Cc: aneesh.kumar@linux.vnet.ibm.com, linux-ext4@vger.kernel.org,
Nick Piggin <nickpiggin@yahoo.com.au>
Subject: Re: [PATCH] ext3: Avoid false EIO errors (version 4)
Date: Tue, 31 Mar 2009 17:06:21 -0700 [thread overview]
Message-ID: <20090331170621.d2ae5693.akpm@linux-foundation.org> (raw)
In-Reply-To: <20090331094618.GE11808@duck.suse.cz>
On Tue, 31 Mar 2009 11:46:18 +0200
Jan Kara <jack@suse.cz> wrote:
> Hello,
>
> as Aneesh pointed to me, I forgot to add truncation to the data=journaled
> mode. This patch fixes it. Hopefully final version of the fix ;).
>
I _think_ I got the right version here.
>
> >From ec5c2977f87328fb93fee1aa35043bafeb53cea1 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 25 Mar 2009 18:51:52 +0100
> Subject: [PATCH] ext3: Avoid false EIO errors (version 4)
>
> Sometimes block_write_begin() can map buffers in a page but later we fail to
> copy data into those buffers (because the source page has been paged out in the
> mean time).
Really? We should just page it back in again. Do you mean "it was an
invalid address and we got -EFAULT"? Or "a parallel thread unmapped
it" or...?
> We then end up with !uptodate mapped buffers.
OK, I suppose that has to be a legal buffer state.
> To add a bit more to
> the confusion, block_write_end() does not commit any data (and thus does not
> any mark buffers as uptodate) if we didn't succeed with copying all the data.
>
> Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
> missed these cases and thus we were inserting non-uptodate buffers to
> transaction's list which confuses JBD code and it reports IO errors, aborts
> a transaction and generally makes users afraid about their data ;-P.
hm. Did any other filesystems break for these reasons?
> This patch fixes the problem by reorganizing ext3_..._write_end() code to
> first call block_write_end() to mark buffers with valid data uptodate and
> after that we file only uptodate buffers to transaction's lists.
>
> We also fix a problem where we could leave blocks allocated beyond i_size
> (i_disksize in fact) because of failed write. We now add inode to orphan
> list when write fails (to be safe in case we crash) and then truncate blocks
> beyond i_size in a separate transaction.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
> fs/ext3/inode.c | 139 +++++++++++++++++++++++++++++--------------------------
> 1 files changed, 74 insertions(+), 65 deletions(-)
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 4a09ff1..d3ef656 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1149,12 +1149,15 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
> struct page **pagep, void **fsdata)
> {
> struct inode *inode = mapping->host;
> - int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
> + int ret;
> handle_t *handle;
> int retries = 0;
> struct page *page;
> pgoff_t index;
> unsigned from, to;
> + /* Reserve one block more for addition to orphan list in case
> + * we allocate blocks but write fails for some reason */
> + int needed_blocks = ext3_writepage_trans_blocks(inode) + 1;
>
> index = pos >> PAGE_CACHE_SHIFT;
> from = pos & (PAGE_CACHE_SIZE - 1);
> @@ -1184,15 +1187,20 @@ retry:
> }
> write_begin_failed:
> if (ret) {
> - ext3_journal_stop(handle);
> - unlock_page(page);
> - page_cache_release(page);
> /*
> * block_write_begin may have instantiated a few blocks
> * outside i_size. Trim these off again. Don't need
> * i_size_read because we hold i_mutex.
> + *
> + * Add inode to orphan list in case we crash before truncate
> + * finishes.
> */
> if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> + ext3_journal_stop(handle);
> + unlock_page(page);
> + page_cache_release(page);
> + if (pos + len > inode->i_size)
> vmtruncate(inode, inode->i_size);
> }
> if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> @@ -1211,6 +1219,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> return err;
> }
>
> +/* For ordered writepage and write_end functions */
> +static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> +{
> + /*
> + * Write could have mapped the buffer but it didn't copy the data in
> + * yet. So avoid filing such buffer into a transaction.
> + */
> + if (buffer_mapped(bh) && buffer_uptodate(bh))
> + return ext3_journal_dirty_data(handle, bh);
> + return 0;
> +}
> +
> /* For write_end() in data=journal mode */
> static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> {
> @@ -1221,26 +1241,20 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> }
>
> /*
> - * Generic write_end handler for ordered and writeback ext3 journal modes.
> - * We can't use generic_write_end, because that unlocks the page and we need to
> - * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
> - * after block_write_end.
> + * This is nasty and subtle: ext3_write_begin() could have allocated blocks
> + * for the whole page but later we failed to copy the data in. Update inode
> + * size according to what we managed to copy. The rest is going to be
> + * truncated in write_end function.
> */
> -static int ext3_generic_write_end(struct file *file,
> - struct address_space *mapping,
> - loff_t pos, unsigned len, unsigned copied,
> - struct page *page, void *fsdata)
> +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
> {
> - struct inode *inode = file->f_mapping->host;
> -
> - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> -
> - if (pos+copied > inode->i_size) {
> - i_size_write(inode, pos+copied);
> + /* What matters to us is i_disksize. We don't write i_size anywhere */
> + if (pos + copied > inode->i_size)
> + i_size_write(inode, pos + copied);
> + if (pos + copied > EXT3_I(inode)->i_disksize) {
> + EXT3_I(inode)->i_disksize = pos + copied;
> mark_inode_dirty(inode);
> }
> -
> - return copied;
> }
>
> /*
> @@ -1260,35 +1274,29 @@ static int ext3_ordered_write_end(struct file *file,
> unsigned from, to;
> int ret = 0, ret2;
>
> - from = pos & (PAGE_CACHE_SIZE - 1);
> - to = from + len;
> + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
>
> + from = pos & (PAGE_CACHE_SIZE - 1);
> + to = from + copied;
> ret = walk_page_buffers(handle, page_buffers(page),
> - from, to, NULL, ext3_journal_dirty_data);
> + from, to, NULL, journal_dirty_data_fn);
>
> - if (ret == 0) {
> - /*
> - * generic_write_end() will run mark_inode_dirty() if i_size
> - * changes. So let's piggyback the i_disksize mark_inode_dirty
> - * into that.
> - */
> - loff_t new_i_size;
> -
> - new_i_size = pos + copied;
> - if (new_i_size > EXT3_I(inode)->i_disksize)
> - EXT3_I(inode)->i_disksize = new_i_size;
> - ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
> - page, fsdata);
> - copied = ret2;
> - if (ret2 < 0)
> - ret = ret2;
> - }
> + if (ret == 0)
> + update_file_sizes(inode, pos, copied);
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> ret2 = ext3_journal_stop(handle);
> if (!ret)
> ret = ret2;
> unlock_page(page);
> page_cache_release(page);
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> return ret ? ret : copied;
> }
>
> @@ -1299,25 +1307,22 @@ static int ext3_writeback_write_end(struct file *file,
> {
> handle_t *handle = ext3_journal_current_handle();
> struct inode *inode = file->f_mapping->host;
> - int ret = 0, ret2;
> - loff_t new_i_size;
> -
> - new_i_size = pos + copied;
> - if (new_i_size > EXT3_I(inode)->i_disksize)
> - EXT3_I(inode)->i_disksize = new_i_size;
> -
> - ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
> - page, fsdata);
> - copied = ret2;
> - if (ret2 < 0)
> - ret = ret2;
> + int ret;
>
> - ret2 = ext3_journal_stop(handle);
> - if (!ret)
> - ret = ret2;
> + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> + update_file_sizes(inode, pos, copied);
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> + ret = ext3_journal_stop(handle);
> unlock_page(page);
> page_cache_release(page);
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> return ret ? ret : copied;
> }
>
> @@ -1338,15 +1343,23 @@ static int ext3_journalled_write_end(struct file *file,
> if (copied < len) {
> if (!PageUptodate(page))
> copied = 0;
> - page_zero_new_buffers(page, from+copied, to);
> + page_zero_new_buffers(page, from + copied, to);
> + to = from + copied;
> }
>
> ret = walk_page_buffers(handle, page_buffers(page), from,
> to, &partial, write_end_fn);
> if (!partial)
> SetPageUptodate(page);
> - if (pos+copied > inode->i_size)
> - i_size_write(inode, pos+copied);
> +
> + if (pos + copied > inode->i_size)
> + i_size_write(inode, pos + copied);
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
> if (inode->i_size > EXT3_I(inode)->i_disksize) {
> EXT3_I(inode)->i_disksize = inode->i_size;
> @@ -1361,6 +1374,8 @@ static int ext3_journalled_write_end(struct file *file,
> unlock_page(page);
> page_cache_release(page);
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> return ret ? ret : copied;
> }
>
> @@ -1428,17 +1443,11 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
> return 0;
> }
>
> -static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> -{
> - if (buffer_mapped(bh))
> - return ext3_journal_dirty_data(handle, bh);
> - return 0;
> -}
> -
> static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
> {
> return !buffer_mapped(bh);
> }
> +
> /*
> * Note that we always start a transaction even if we're not journalling
> * data. This is to preserve ordering: any hole instantiation within
> --
> 1.6.0.2
next prev parent reply other threads:[~2009-04-01 0:08 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-26 18:21 [PATCH] ext3: Avoid false EIO errors Jan Kara
2009-03-27 18:08 ` Aneesh Kumar K.V
2009-03-27 20:24 ` Jan Kara
2009-03-30 8:25 ` Aneesh Kumar K.V
2009-03-30 10:32 ` Jan Kara
2009-03-30 10:58 ` Aneesh Kumar K.V
2009-03-30 16:05 ` Jan Kara
2009-03-31 4:45 ` Aneesh Kumar K.V
2009-03-31 9:29 ` [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure Aneesh Kumar K.V
2009-03-31 9:29 ` [PATCH 2/2] [PATCH] ext4: truncate the file properly if we fail to copy data from userspace Aneesh Kumar K.V
2009-03-31 9:38 ` Jan Kara
2009-04-05 3:22 ` Theodore Tso
2009-04-06 10:07 ` Jan Kara
2009-03-31 9:34 ` [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure Jan Kara
2009-04-05 3:11 ` Theodore Tso
2009-04-06 10:05 ` Jan Kara
2009-06-05 4:31 ` Theodore Tso
2009-06-05 6:22 ` Theodore Tso
2009-06-05 7:24 ` Theodore Tso
2009-06-05 23:42 ` Jan Kara
2009-06-05 23:44 ` Jan Kara
2009-06-08 4:35 ` [PATCH -V2 1/2] " Aneesh Kumar K.V
2009-06-08 4:35 ` [PATCH -V2 2/2] ext4: truncate the file properly if we fail to copy data from userspace Aneesh Kumar K.V
2009-06-08 16:29 ` Theodore Tso
2009-06-08 16:43 ` Aneesh Kumar K.V
2009-06-08 19:14 ` Theodore Tso
2009-06-08 19:23 ` Jan Kara
2009-06-08 20:20 ` Aneesh Kumar K.V
2009-06-09 10:12 ` Jan Kara
2009-06-08 16:29 ` [PATCH -V2 1/2] ext4: Add inode to the orphan list during block allocation failure Theodore Tso
2009-03-31 9:46 ` [PATCH] ext3: Avoid false EIO errors (version 4) Jan Kara
2009-04-01 0:06 ` Andrew Morton [this message]
2009-04-01 9:49 ` Jan Kara
2009-03-30 13:53 ` [PATCH] ext3: Avoid false EIO errors Eric Sandeen
2009-03-30 14:45 ` Aneesh Kumar K.V
2009-03-30 15:12 ` Eric Sandeen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090331170621.d2ae5693.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=nickpiggin@yahoo.com.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.