From: Tao Ma <tao.ma@oracle.com>
To: Dave Chinner <david@fromorbit.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Linux Kernel <linux-kernel@vger.kernel.org>,
ocfs2-devel@oss.oracle.com, Dave Chinner <dchinner@redhat.com>,
Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com>,
Joel Becker <joel.becker@oracle.com>
Subject: Re: [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2
Date: Mon, 05 Jul 2010 11:51:44 +0800 [thread overview]
Message-ID: <4C3156D0.5090001@oracle.com> (raw)
In-Reply-To: <20100703213219.GB21262@mail.oracle.com>
Hi Joel,
On 07/04/2010 05:32 AM, Joel Becker wrote:
> Here's the second version of my corruption fix. It fixes two
> bugs:
>
> 1) i_size can obviously be at a place that is a hole, so don't BUG on
> that.
> 2) Fix an off-by-one when checking whether the write position is within
> the tail allocation.
>
> This version passes my tail corruption test as well as the kernel
> compile that exposed the two bugs above.
>
> Joel
>
> ---------------------------------------------------------------
>
> ocfs2's allocation unit is the cluster. This can be larger than a block
> or even a memory page. This means that a file may have many blocks in
> its last extent that are beyond the block containing i_size.
>
> When ocfs2 grows a file, it zeros the entire cluster in order to ensure
> future i_size growth will see cleared blocks. Unfortunately,
> block_write_full_page() drops the pages past i_size. This means that
> ocfs2 is actually leaking garbage data into the tail end of that last
> cluster.
>
> We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
> when a write or truncate is past i_size. If there is any existing
> allocation between the block containing the current i_size and the
> location of the write or truncate, zeros will be written to that
> allocation.
>
> This is only for sparse filesystems. Non-sparse filesystems already get
> this via ocfs2_extend_no_holes().
>
> Signed-off-by: Joel Becker<joel.becker@oracle.com>
> ---
> fs/ocfs2/aops.c | 22 ++++----
> fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------
> fs/ocfs2/file.h | 2 +
> 3 files changed, 150 insertions(+), 28 deletions(-)
>
<snip>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6a13ea6..7fca78d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -848,6 +848,137 @@ out:
> return ret;
> }
>
> +/*
> + * This function is a helper for ocfs2_zero_tail(). It calculates
> + * what blocks need zeroing and does any CoW necessary.
> + */
> +static int ocfs2_zero_tail_prepare(struct inode *inode,
> + struct buffer_head *di_bh,
> + loff_t pos, u64 *start_blkno,
> + u64 *blocks)
> +{
> + int rc = 0;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + u32 tail_cpos, pos_cpos, p_cpos;
> + u64 tail_blkno, pos_blkno, blocks_to_zero;
> + unsigned int num_clusters = 0;
> + unsigned int ext_flags = 0;
> +
> + /*
> + * The block containing i_size has already been zeroed, so our tail
> + * block is the first block after i_size. The block containing
> + * pos will be zeroed. So we only need to do anything if
> + * tail_blkno is before pos_blkno.
> + */
> + tail_blkno = (i_size_read(inode)>> inode->i_sb->s_blocksize_bits) + 1;
> + pos_blkno = pos>> inode->i_sb->s_blocksize_bits;
> + mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
> + (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
> + if (pos_blkno<= tail_blkno)
> + goto out;
> + blocks_to_zero = pos_blkno - tail_blkno;
> +
> + /*
> + * If tail_blkno is in the cluster past i_size, we don't need
> + * to touch the cluster containing i_size at all.
> + */
> + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
> + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
> + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> + tail_blkno);
Can we always set tail_cpos in one line?
tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
tail_cpos is either the same cluster as i_size or the next cluster and
both works for tail_blkno I guess?
> +
> + rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters,
> + &ext_flags);
> + if (rc) {
> + mlog_errno(rc);
> + goto out;
> + }
> +
> + /* Is there a cluster to zero? */
> + if (!p_cpos)
> + goto out;
For unwritten extent, we also need to clear the pages? If yes, the
solution doesn't complete if we have 2 unwritten extent, one contains
i_size while one passes i_size. Here we only clear the pages for the 1st
unwritten extent and leave the 2nd one untouched.
> +
> + pos_cpos = pos>> osb->s_clustersize_bits;
> + mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n",
> + (unsigned int)tail_cpos, (unsigned int)num_clusters,
> + (unsigned int)pos_cpos, (unsigned long long)tail_blkno,
> + (unsigned long long)pos_blkno);
From here to the call of CoW is a bit hard to understand. In 'if',
num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it
isn't easy for the reader to tell why these 2 clauses are setting
different values. So how about my code below? It looks more
straightforward I think.
> + if ((tail_cpos + num_clusters)> pos_cpos) {
> + num_clusters = pos_cpos - tail_cpos;
> + if (pos_blkno>
> + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
> + num_clusters += 1;
> + } else {
> + blocks_to_zero =
> + ocfs2_clusters_to_blocks(inode->i_sb,
> + tail_cpos + num_clusters);
> + blocks_to_zero -= tail_blkno;
> + }
> +
> + /* Now CoW the clusters we're about to zero */
> + if (ext_flags& OCFS2_EXT_REFCOUNTED) {
> + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
> + num_clusters, UINT_MAX);
> + if (rc) {
> + mlog_errno(rc);
> + goto out;
> + }
> + }
/* Decrease blocks_to_zero if there is some hole after extent */
if (tail_cpos + num_clusters <= pos_cpos) {
blocks_to_zero =
ocfs2_clusters_to_blocks(inode->i_sb,
tail_cpos + num_clusters);
blocks_to_zero -= tail_blkno;
}
/* Now CoW if we have some refcounted clusters. */
if (ext_flags & OCFS2_EXT_REFCOUNTED) {
/*
* We add one more cluster here since it will be
* written shortly and if the pos_blkno isn't aligned
* to the cluster size, we have to zero the blocks
* before it.
*/
if (tail_cpos + num_clusters > pos_cpos)
num_clusters = pos_cpos - tail_cpos + 1;
rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
num_clusters, UINT_MAX);
if (rc) {
mlog_errno(rc);
goto out;
}
}
> +
> + *start_blkno = tail_blkno;
> + *blocks = blocks_to_zero;
> + mlog(0, "start_blkno = %llu, blocks = %llu\n",
> + (unsigned long long)(*start_blkno),
> + (unsigned long long)(*blocks));
> +
> +out:
> + return rc;
> +}
Regards,
Tao
next prev parent reply other threads:[~2010-07-05 3:53 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-28 17:35 [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF" Joel Becker
2010-06-29 0:24 ` Dave Chinner
2010-06-29 0:54 ` Joel Becker
2010-06-29 1:12 ` Linus Torvalds
2010-06-29 1:58 ` [Ocfs2-devel] " Joel Becker
2010-06-29 2:20 ` Linus Torvalds
2010-06-29 2:44 ` Dave Chinner
2010-06-29 8:16 ` Joel Becker
2010-06-30 1:30 ` Joel Becker
2010-07-06 19:06 ` Joel Becker
2010-06-29 1:56 ` Dave Chinner
2010-06-29 2:04 ` Joel Becker
2010-06-29 2:27 ` Dave Chinner
2010-06-29 7:18 ` Joel Becker
2010-07-02 22:49 ` [PATCH] ocfs2: Zero the tail cluster when extending past i_size Joel Becker
2010-07-03 21:32 ` [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2 Joel Becker
2010-07-03 21:33 ` [PATCH 2/2] ocfs2: No need to zero pages past i_size. " Joel Becker
2010-07-04 15:13 ` Tao Ma
2010-07-05 1:38 ` Tao Ma
2010-07-06 7:10 ` Joel Becker
2010-07-06 7:09 ` Joel Becker
2010-07-06 18:39 ` [Ocfs2-devel] " Joel Becker
2010-07-05 3:51 ` Tao Ma [this message]
2010-07-06 7:17 ` [PATCH 1/2] ocfs2: Zero the tail cluster when extending past " Joel Becker
2010-07-06 7:54 ` Tao Ma
2010-07-06 11:58 ` Joel Becker
2010-07-07 0:42 ` Tao Ma
2010-07-07 2:03 ` Joel Becker
2010-07-06 18:48 ` Joel Becker
2010-07-06 18:57 ` Joel Becker
2010-07-07 11:16 ` [PATCH 0/3] ocfs2: Tail zeroing fixes Joel Becker
2010-07-12 22:45 ` [Ocfs2-devel] " Joel Becker
2010-07-07 11:16 ` [PATCH 1/3] ocfs2: When zero extending, do it by page Joel Becker
2010-07-07 15:19 ` Tao Ma
2010-07-07 20:04 ` Joel Becker
2010-07-08 3:44 ` Tao Ma
2010-07-08 9:51 ` Joel Becker
2010-07-07 11:16 ` [PATCH 2/3] ocfs2: Zero the tail cluster when extending past i_size Joel Becker
2010-07-07 11:16 ` [PATCH 3/3] ocfs2: No need to zero pages " Joel Becker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C3156D0.5090001@oracle.com \
--to=tao.ma@oracle.com \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=hch@lst.de \
--cc=joel.becker@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mfasheh@suse.com \
--cc=ocfs2-devel@oss.oracle.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox