[Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries

ocfs2-devel.oss.oracle.com archive mirror
 help / color / mirror / Atom feed

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
@ 2011-02-17 15:44 Goldwyn Rodrigues
  2011-02-20  7:09 ` Joel Becker
  2011-03-26 22:48 ` Joel Becker
  0 siblings, 2 replies; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-17 15:44 UTC (permalink / raw)
  To: ocfs2-devel

When a hole spans across page boundaries, the next write forces
a read of the block. This could end up reading existing garbage
data from the disk in ocfs2_map_page_blocks. This leads to
non-zero holes. In order to avoid this, mark the writes as new
when the holes span across page boundaries.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>

---
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1fbb0e2..4be220d 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1026,6 +1026,12 @@ static int ocfs2_prepare_page_for_write(struct
inode *inode, u64 *p_blkno,
 	ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), cpos,
 					&cluster_start, &cluster_end);

+	/* treat the write as new if the a hole/lseek spanned across
+	 * the page boundary.
+	 */
+	new = new | ((i_size_read(inode) <= page_offset(page)) &&
+			(page_offset(page) <= user_pos));
+
 	if (page == wc->w_target_page) {
 		map_from = user_pos & (PAGE_CACHE_SIZE - 1);
 		map_to = map_from + user_len;

-- 
Goldwyn

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-17 15:44 [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries Goldwyn Rodrigues
@ 2011-02-20  7:09 ` Joel Becker
  2011-02-20 18:45   ` Goldwyn Rodrigues
  2011-03-26 22:48 ` Joel Becker
  1 sibling, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-20  7:09 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Feb 17, 2011 at 09:44:40AM -0600, Goldwyn Rodrigues wrote:
> When a hole spans across page boundaries, the next write forces
> a read of the block. This could end up reading existing garbage
> data from the disk in ocfs2_map_page_blocks. This leads to
> non-zero holes. In order to avoid this, mark the writes as new
> when the holes span across page boundaries.

	Is this a new approach to your earlier patch, or an additional
change?

> ---
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 1fbb0e2..4be220d 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -1026,6 +1026,12 @@ static int ocfs2_prepare_page_for_write(struct
> inode *inode, u64 *p_blkno,
>  	ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), cpos,
>  					&cluster_start, &cluster_end);
> 
> +	/* treat the write as new if the a hole/lseek spanned across
> +	 * the page boundary.
> +	 */
> +	new = new | ((i_size_read(inode) <= page_offset(page)) &&
> +			(page_offset(page) <= user_pos));

	There are two problems here.  First, It's not safe to claim
existing data is 'new'.  Imagine you have a 4K page and a 512B
blocksize.  The first 2 blocks of the page have data in them, but your
code change will cause them to be set_uptodate() even if we haven't read
them in yet.
	Secondly, ocfs2_should_read_blk() already checks for blocks past
i_size and skips reading them.  So if you are trying to avoid reading
them, it is already handled.

Joel

-- 

Life's Little Instruction Book #511

	"Call your mother."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-20  7:09 ` Joel Becker
@ 2011-02-20 18:45   ` Goldwyn Rodrigues
  2011-02-21  2:08     ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-20 18:45 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Sun, Feb 20, 2011 at 1:09 AM, Joel Becker <jlbec@evilplan.org> wrote:
> On Thu, Feb 17, 2011 at 09:44:40AM -0600, Goldwyn Rodrigues wrote:
>> When a hole spans across page boundaries, the next write forces
>> a read of the block. This could end up reading existing garbage
>> data from the disk in ocfs2_map_page_blocks. This leads to
>> non-zero holes. In order to avoid this, mark the writes as new
>> when the holes span across page boundaries.
>
> ? ? ? ?Is this a new approach to your earlier patch, or an additional
> change?

This is a new approach. You can scrap the previous attempt. I started
reading about write prepare's once you advised to look at the write
front.

>
>> ---
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index 1fbb0e2..4be220d 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -1026,6 +1026,12 @@ static int ocfs2_prepare_page_for_write(struct
>> inode *inode, u64 *p_blkno,
>> ? ? ? ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), cpos,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &cluster_start, &cluster_end);
>>
>> + ? ? /* treat the write as new if the a hole/lseek spanned across
>> + ? ? ?* the page boundary.
>> + ? ? ?*/
>> + ? ? new = new | ((i_size_read(inode) <= page_offset(page)) &&
>> + ? ? ? ? ? ? ? ? ? ? (page_offset(page) <= user_pos));
>
> ? ? ? ?There are two problems here. ?First, It's not safe to claim
> existing data is 'new'. ?Imagine you have a 4K page and a 512B
> blocksize. ?The first 2 blocks of the page have data in them, but your
> code change will cause them to be set_uptodate() even if we haven't read
> them in yet.

I did not understand this. Could you explain this in terms of
blocksize=512, page_offset, i_size and pos?
However I'll try to interpret: Won't the i_size be 1024 (assuming it
is the start of file) and the new flag set to false?


> ? ? ? ?Secondly, ocfs2_should_read_blk() already checks for blocks past
> i_size and skips reading them. ?So if you are trying to avoid reading
> them, it is already handled.

If the previous call was lseek, i_size is not updated.
ocfs2_should_read_blk returns true and junk read.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-20 18:45   ` Goldwyn Rodrigues
@ 2011-02-21  2:08     ` Joel Becker
  2011-02-21  5:44       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-21  2:08 UTC (permalink / raw)
  To: ocfs2-devel

On Sun, Feb 20, 2011 at 12:45:05PM -0600, Goldwyn Rodrigues wrote:
> On Sun, Feb 20, 2011 at 1:09 AM, Joel Becker <jlbec@evilplan.org> wrote:
> > ? ? ? ?Is this a new approach to your earlier patch, or an additional
> > change?
> 
> This is a new approach. You can scrap the previous attempt. I started
> reading about write prepare's once you advised to look at the write
> front.

	Ok, cool.  It definitely is along the track I was suggesting.

> >> ---
> >> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> >> index 1fbb0e2..4be220d 100644
> >> --- a/fs/ocfs2/aops.c
> >> +++ b/fs/ocfs2/aops.c
> >> @@ -1026,6 +1026,12 @@ static int ocfs2_prepare_page_for_write(struct
> >> inode *inode, u64 *p_blkno,
> >> ? ? ? ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), cpos,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &cluster_start, &cluster_end);
> >>
> >> + ? ? /* treat the write as new if the a hole/lseek spanned across
> >> + ? ? ?* the page boundary.
> >> + ? ? ?*/
> >> + ? ? new = new | ((i_size_read(inode) <= page_offset(page)) &&
> >> + ? ? ? ? ? ? ? ? ? ? (page_offset(page) <= user_pos));
> >
> > ? ? ? ?There are two problems here. ?First, It's not safe to claim
> > existing data is 'new'. ?Imagine you have a 4K page and a 512B
> > blocksize. ?The first 2 blocks of the page have data in them, but your
> > code change will cause them to be set_uptodate() even if we haven't read
> > them in yet.
> 
> I did not understand this. Could you explain this in terms of
> blocksize=512, page_offset, i_size and pos?
> However I'll try to interpret: Won't the i_size be 1024 (assuming it
> is the start of file) and the new flag set to false?

	What if i_size is 1000?  The second block will have data,
including zeros to the end of the block.  I have to think about this
some more to be sure where the offsets lie -- this is hard code to think
about -- but essentially there is a difference between blocks we have
allocated but not used yet, and blocks that are part of a brand new
allocation.  We need to make sure we're not confusing the two.

> > ? ? ? ?Secondly, ocfs2_should_read_blk() already checks for blocks past
> > i_size and skips reading them. ?So if you are trying to avoid reading
> > them, it is already handled.
> 
> If the previous call was lseek, i_size is not updated.
> ocfs2_should_read_blk returns true and junk read.

	Why would it return true if the f_pos is > i_size?  It should
return 0.

Joel

-- 

Life's Little Instruction Book #314

	"Never underestimate the power of forgiveness."

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-21  2:08     ` Joel Becker
@ 2011-02-21  5:44       ` Goldwyn Rodrigues
  2011-02-22  8:36         ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-21  5:44 UTC (permalink / raw)
  To: ocfs2-devel

On Sun, Feb 20, 2011 at 8:08 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Sun, Feb 20, 2011 at 12:45:05PM -0600, Goldwyn Rodrigues wrote:
>> On Sun, Feb 20, 2011 at 1:09 AM, Joel Becker <jlbec@evilplan.org> wrote:
>> > ? ? ? ?Is this a new approach to your earlier patch, or an additional
>> > change?
>>
>> This is a new approach. You can scrap the previous attempt. I started
>> reading about write prepare's once you advised to look at the write
>> front.
>
> ? ? ? ?Ok, cool. ?It definitely is along the track I was suggesting.
>
>> >> ---
>> >> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> >> index 1fbb0e2..4be220d 100644
>> >> --- a/fs/ocfs2/aops.c
>> >> +++ b/fs/ocfs2/aops.c
>> >> @@ -1026,6 +1026,12 @@ static int ocfs2_prepare_page_for_write(struct
>> >> inode *inode, u64 *p_blkno,
>> >> ? ? ? ocfs2_figure_cluster_boundaries(OCFS2_SB(inode->i_sb), cpos,
>> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &cluster_start, &cluster_end);
>> >>
>> >> + ? ? /* treat the write as new if the a hole/lseek spanned across
>> >> + ? ? ?* the page boundary.
>> >> + ? ? ?*/
>> >> + ? ? new = new | ((i_size_read(inode) <= page_offset(page)) &&
>> >> + ? ? ? ? ? ? ? ? ? ? (page_offset(page) <= user_pos));
>> >
>> > ? ? ? ?There are two problems here. ?First, It's not safe to claim
>> > existing data is 'new'. ?Imagine you have a 4K page and a 512B
>> > blocksize. ?The first 2 blocks of the page have data in them, but your
>> > code change will cause them to be set_uptodate() even if we haven't read
>> > them in yet.
>>
>> I did not understand this. Could you explain this in terms of
>> blocksize=512, page_offset, i_size and pos?
>> However I'll try to interpret: Won't the i_size be 1024 (assuming it
>> is the start of file) and the new flag set to false?
>
> ? ? ? ?What if i_size is 1000? ?The second block will have data,

Still false, because page_offset is zero.

> including zeros to the end of the block. ?I have to think about this
> some more to be sure where the offsets lie -- this is hard code to think
> about -- but essentially there is a difference between blocks we have
> allocated but not used yet, and blocks that are part of a brand new
> allocation. ?We need to make sure we're not confusing the two.
>

Sure, take your time. It would be nice to have an explanation.

>> > ? ? ? ?Secondly, ocfs2_should_read_blk() already checks for blocks past
>> > i_size and skips reading them. ?So if you are trying to avoid reading
>> > them, it is already handled.
>>
>> If the previous call was lseek, i_size is not updated.
>> ocfs2_should_read_blk returns true and junk read.
>
> ? ? ? ?Why would it return true if the f_pos is > i_size? ?It should
> return 0.
>

Sorry, I was wrong in my previous explanation on why
ocfs2_should_read_blk returns 1. ocfs2_should_read_blk returns 1
because of ocfs2_sparse_alloc() and hence forces the garbage read.

On a different note, what I have not been able to explain as yet is,
when the file is created on a nosparse filesystem, the holes have junk
(not 0xbaadfeed, ie what was written previously but junk). Could you
think of a reason why? In any case, this patch resolves this issue.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-21  5:44       ` Goldwyn Rodrigues
@ 2011-02-22  8:36         ` Joel Becker
  2011-02-22  9:37           ` Joel Becker
  2011-02-22 19:02           ` Goldwyn Rodrigues
  0 siblings, 2 replies; 34+ messages in thread
From: Joel Becker @ 2011-02-22  8:36 UTC (permalink / raw)
  To: ocfs2-devel

On Sun, Feb 20, 2011 at 11:44:05PM -0600, Goldwyn Rodrigues wrote:
> >> > ? ? ? ?Secondly, ocfs2_should_read_blk() already checks for blocks past
> >> > i_size and skips reading them. ?So if you are trying to avoid reading
> >> > them, it is already handled.
> >>
 
> Sorry, I was wrong in my previous explanation on why
> ocfs2_should_read_blk returns 1. ocfs2_should_read_blk returns 1
> because of ocfs2_sparse_alloc() and hence forces the garbage read.

	You're right, that's why ocfs2_should_read_block() returns 1.
But why would it ever want to read blocks that are outside i_size?
We've guaranteed that they won't be initialized...
	I wonder if that's the bug?  We originally used to zero to EOC.
For sparse files, that could never be any clusters past i_size.  With
the fix I put in for writepage, we no longer zero the blocks past
i_size.  I think the sparse check there is no longer accurate.
	Please try the attached patch and see if it fixes the problem.
It should work on its own, without your changes.  If I have it wrong,
we'll continue to evaluate the problem.  I'd test it myself, but my VM
setup is currently broken.

> On a different note, what I have not been able to explain as yet is,
> when the file is created on a nosparse filesystem, the holes have junk
> (not 0xbaadfeed, ie what was written previously but junk). Could you
> think of a reason why? In any case, this patch resolves this issue.

	Do they have junk without your patch, or just with your patch?

Joel

-- 

"Always give your best, never get discouraged, never be petty; always
 remember, others may hate you.  Those who hate you don't win unless
 you hate them.  And then you destroy yourself."
	- Richard M. Nixon

			http://www.jlbec.org/
			jlbec at evilplan.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-ocfs2-Don-t-read-blocks-past-i_size-they-re-not-init.patch
Type: text/x-diff
Size: 1609 bytes
Desc: not available
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20110222/be785faf/attachment.bin 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-22  8:36         ` Joel Becker
@ 2011-02-22  9:37           ` Joel Becker
  2011-02-22 19:02           ` Goldwyn Rodrigues
  1 sibling, 0 replies; 34+ messages in thread
From: Joel Becker @ 2011-02-22  9:37 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Feb 22, 2011 at 12:36:49AM -0800, Joel Becker wrote:
> 	Please try the attached patch and see if it fixes the problem.
> It should work on its own, without your changes.  If I have it wrong,
> we'll continue to evaluate the problem.  I'd test it myself, but my VM
> setup is currently broken.

	Also, can someone run tailtest out of ocfs2-test against it.  I
want to make sure I didn't break anything else.  Grr on broken test
setup.  I'm also sad that tailtest didn't catch Goldwyn's case.  I
wonder if we can make it.

Joel

-- 

"Get right to the heart of matters.
 It's the heart that matters more."

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-22  8:36         ` Joel Becker
  2011-02-22  9:37           ` Joel Becker
@ 2011-02-22 19:02           ` Goldwyn Rodrigues
  2011-02-22 21:39             ` Joel Becker
  1 sibling, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-22 19:02 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Tue, Feb 22, 2011 at 2:36 AM, Joel Becker <jlbec@evilplan.org> wrote:
> On Sun, Feb 20, 2011 at 11:44:05PM -0600, Goldwyn Rodrigues wrote:
>> >> > ? ? ? ?Secondly, ocfs2_should_read_blk() already checks for blocks past
>> >> > i_size and skips reading them. ?So if you are trying to avoid reading
>> >> > them, it is already handled.
>> >>
>
>> Sorry, I was wrong in my previous explanation on why
>> ocfs2_should_read_blk returns 1. ocfs2_should_read_blk returns 1
>> because of ocfs2_sparse_alloc() and hence forces the garbage read.
>
> ? ? ? ?You're right, that's why ocfs2_should_read_block() returns 1.
> But why would it ever want to read blocks that are outside i_size?
> We've guaranteed that they won't be initialized...
> ? ? ? ?I wonder if that's the bug? ?We originally used to zero to EOC.
> For sparse files, that could never be any clusters past i_size. ?With
> the fix I put in for writepage, we no longer zero the blocks past
> i_size. ?I think the sparse check there is no longer accurate.
> ? ? ? ?Please try the attached patch and see if it fixes the problem.
> It should work on its own, without your changes. ?If I have it wrong,
> we'll continue to evaluate the problem. ?I'd test it myself, but my VM
> setup is currently broken.
>

It does not work. However, it shows the behavior similar to "nosparse"
without patch. So, I would say what you are targeting is achieved but
another problem resurfaces. This is because nothing zeros pages beyond
i_size in ocfs2_map_page_blocks(), since we return early because of -

        if (ret == 0 || !new)
                return ret;


>> On a different note, what I have not been able to explain as yet is,
>> when the file is created on a nosparse filesystem, the holes have junk
>> (not 0xbaadfeed, ie what was written previously but junk). Could you
>> think of a reason why? In any case, this patch resolves this issue.
>
> ? ? ? ?Do they have junk without your patch, or just with your patch?
>

Without my patch, there is junk in the holes. With my patch holes are zeros.
Understood why it happens, explained before..

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-22 19:02           ` Goldwyn Rodrigues
@ 2011-02-22 21:39             ` Joel Becker
  2011-02-22 21:54               ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-22 21:39 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Feb 22, 2011 at 01:02:13PM -0600, Goldwyn Rodrigues wrote:
> On Tue, Feb 22, 2011 at 2:36 AM, Joel Becker <jlbec@evilplan.org> wrote:
> > ? ? ? ?Please try the attached patch and see if it fixes the problem.
> > It should work on its own, without your changes. ?If I have it wrong,
> > we'll continue to evaluate the problem. ?I'd test it myself, but my VM
> > setup is currently broken.
> >
> 
> It does not work. However, it shows the behavior similar to "nosparse"
> without patch. So, I would say what you are targeting is achieved but
> another problem resurfaces. This is because nothing zeros pages beyond
> i_size in ocfs2_map_page_blocks(), since we return early because of -
> 
>         if (ret == 0 || !new)
>                 return ret;

	Returning here is correct, because ret should == 0.  The new
part is about zeroing in the case of error.  We should obviously handle
it better before we get to this line.
	I like that we come in line nosparse and sparse.  What does
tailtest do for you?  Does it fail or succeed?

Joel

-- 

"Here's a nickle -- get yourself a better X server."
	- Keith Packard

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-22 21:39             ` Joel Becker
@ 2011-02-22 21:54               ` Joel Becker
  2011-02-22 22:09                 ` Goldwyn Rodrigues
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-22 21:54 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Feb 22, 2011 at 01:39:28PM -0800, Joel Becker wrote:
> On Tue, Feb 22, 2011 at 01:02:13PM -0600, Goldwyn Rodrigues wrote:
> > On Tue, Feb 22, 2011 at 2:36 AM, Joel Becker <jlbec@evilplan.org> wrote:
> > > ? ? ? ?Please try the attached patch and see if it fixes the problem.
> > > It should work on its own, without your changes. ?If I have it wrong,
> > > we'll continue to evaluate the problem. ?I'd test it myself, but my VM
> > > setup is currently broken.
> > >
> > 
> > It does not work. However, it shows the behavior similar to "nosparse"
> > without patch. So, I would say what you are targeting is achieved but
> > another problem resurfaces. This is because nothing zeros pages beyond
> > i_size in ocfs2_map_page_blocks(), since we return early because of -
> > 
> >         if (ret == 0 || !new)
> >                 return ret;
> 
> 	Returning here is correct, because ret should == 0.  The new
> part is about zeroing in the case of error.  We should obviously handle
> it better before we get to this line.
> 	I like that we come in line nosparse and sparse.  What does
> tailtest do for you?  Does it fail or succeed?

	Looking some more.  ocfs2_zero_tail() is supposed to handle
this, isn't it?  ocfs2_write_begin_nolock() calls ocfs2_zero_tail(pos),
which calls ocfs2_zero_extend(pos), which works up to
ocfs2_clusters_for_bytes(pos).  Doesn't this include your cluster?
Shouldn't the page already be loaded into the pagecache with zeros?

Joel

-- 

Life's Little Instruction Book #511

	"Call your mother."

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-22 21:54               ` Joel Becker
@ 2011-02-22 22:09                 ` Goldwyn Rodrigues
  2011-02-22 23:34                   ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-22 22:09 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Feb 22, 2011 at 3:54 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Tue, Feb 22, 2011 at 01:39:28PM -0800, Joel Becker wrote:
>> On Tue, Feb 22, 2011 at 01:02:13PM -0600, Goldwyn Rodrigues wrote:
>> > On Tue, Feb 22, 2011 at 2:36 AM, Joel Becker <jlbec@evilplan.org> wrote:
>> > > ? ? ? ?Please try the attached patch and see if it fixes the problem.
>> > > It should work on its own, without your changes. ?If I have it wrong,
>> > > we'll continue to evaluate the problem. ?I'd test it myself, but my VM
>> > > setup is currently broken.
>> > >
>> >
>> > It does not work. However, it shows the behavior similar to "nosparse"
>> > without patch. So, I would say what you are targeting is achieved but
>> > another problem resurfaces. This is because nothing zeros pages beyond
>> > i_size in ocfs2_map_page_blocks(), since we return early because of -
>> >
>> > ? ? ? ? if (ret == 0 || !new)
>> > ? ? ? ? ? ? ? ? return ret;
>>
>> ? ? ? Returning here is correct, because ret should == 0. ?The new
>> part is about zeroing in the case of error. ?We should obviously handle
>> it better before we get to this line.
>> ? ? ? I like that we come in line nosparse and sparse. ?What does
>> tailtest do for you? ?Does it fail or succeed?
>
> ? ? ? ?Looking some more. ?ocfs2_zero_tail() is supposed to handle
> this, isn't it? ?ocfs2_write_begin_nolock() calls ocfs2_zero_tail(pos),
> which calls ocfs2_zero_extend(pos), which works up to
> ocfs2_clusters_for_bytes(pos). ?Doesn't this include your cluster?
> Shouldn't the page already be loaded into the pagecache with zeros?
>
>

Din't you discard that idea in my previous attempt? ;)

On a serious note, no it doesn't because zero_start and zero_to_size
are the same when pos is on a block boundary and hence the loop does
not execute.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-22 22:09                 ` Goldwyn Rodrigues
@ 2011-02-22 23:34                   ` Joel Becker
  2011-02-23  0:05                     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-22 23:34 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Feb 22, 2011 at 04:09:47PM -0600, Goldwyn Rodrigues wrote:
> On Tue, Feb 22, 2011 at 3:54 PM, Joel Becker <jlbec@evilplan.org> wrote:
> > On Tue, Feb 22, 2011 at 01:39:28PM -0800, Joel Becker wrote:
> > ? ? ? ?Looking some more. ?ocfs2_zero_tail() is supposed to handle
> > this, isn't it? ?ocfs2_write_begin_nolock() calls ocfs2_zero_tail(pos),
> > which calls ocfs2_zero_extend(pos), which works up to
> > ocfs2_clusters_for_bytes(pos). ?Doesn't this include your cluster?
> > Shouldn't the page already be loaded into the pagecache with zeros?
> >
> >
> 
> Din't you discard that idea in my previous attempt? ;)

	I might have ;-)  I'm reloading state here, so I might be coming
back to something you've already seen.  I'm trying to ask questions that
make both of us come to the same understanding.  So my apologies if I
didn't get what you were saying before.
	For example, I thought you were saying that a write from 0 to
1023 (new i_size of 1024), with a blocksize of 4K and a clustersize of
8K, should zero from 1024 to 8191.  That is, zeroing the entire cluster
at allocation time, even though the second block does not contain
i_size.  My contention is that 1024 to 4095 should be zeroed for this
write.  We would only zero the second block when the file is later
extended past 4096.
	I may have been confused because ocfs2_zero_extend() only used
to matter for nonsparse, where this behavior was different.  Which of
the above did you mean?

> On a serious note, no it doesn't because zero_start and zero_to_size
> are the same when pos is on a block boundary and hence the loop does
> not execute.

	If they are the same (i_size == pos), there is no hole.  I don't
see how there is a problem.

Joel

-- 

Life's Little Instruction Book #464

	"Don't miss the magic of the moment by focusing on what's
	 to come."

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-22 23:34                   ` Joel Becker
@ 2011-02-23  0:05                     ` Goldwyn Rodrigues
  2011-02-23  9:39                       ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-23  0:05 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Feb 22, 2011 at 5:34 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Tue, Feb 22, 2011 at 04:09:47PM -0600, Goldwyn Rodrigues wrote:
>> On Tue, Feb 22, 2011 at 3:54 PM, Joel Becker <jlbec@evilplan.org> wrote:
>> > On Tue, Feb 22, 2011 at 01:39:28PM -0800, Joel Becker wrote:
>> > ? ? ? ?Looking some more. ?ocfs2_zero_tail() is supposed to handle
>> > this, isn't it? ?ocfs2_write_begin_nolock() calls ocfs2_zero_tail(pos),
>> > which calls ocfs2_zero_extend(pos), which works up to
>> > ocfs2_clusters_for_bytes(pos). ?Doesn't this include your cluster?
>> > Shouldn't the page already be loaded into the pagecache with zeros?
>> >
>> >
>>
>> Din't you discard that idea in my previous attempt? ;)
>
> ? ? ? ?I might have ;-) ?I'm reloading state here, so I might be coming
> back to something you've already seen. ?I'm trying to ask questions that
> make both of us come to the same understanding. ?So my apologies if I
> didn't get what you were saying before.
> ? ? ? ?For example, I thought you were saying that a write from 0 to
> 1023 (new i_size of 1024), with a blocksize of 4K and a clustersize of
> 8K, should zero from 1024 to 8191. ?That is, zeroing the entire cluster
> at allocation time, even though the second block does not contain
> i_size. ?My contention is that 1024 to 4095 should be zeroed for this
> write. ?We would only zero the second block when the file is later
> extended past 4096.
> ? ? ? ?I may have been confused because ocfs2_zero_extend() only used
> to matter for nonsparse, where this behavior was different. ?Which of
> the above did you mean?
>

Hmm, explanation follows.. let me know if you still have questions.

>> On a serious note, no it doesn't because zero_start and zero_to_size
>> are the same when pos is on a block boundary and hence the loop does
>> not execute.
>
> ? ? ? ?If they are the same (i_size == pos), there is no hole. ?I don't
> see how there is a problem.
>
>

Not now, but it may in the future and that is the problem.

So consider your case of usual parameters, but i_size=4096.
Now you append 32 bytes of data to it, so pos = i_size = 4096
and len=32
write(fd, 32 bytes of data);

In this case, zero_start and zero_to_size in zero_extend_file are both
4096 and it will not zero this page in ocfs2_zero_extend.

After the write, i_size=4128.

Now lseek 512 bytes -
lseek(fd, 512, SEEKCUR);


..and write 4096 bytes again.
write(fd, 4096bytes of data);

pos=4640 len=4096 bytes, and after the transaction i_size=8736.

i_size is beyond 8192 now

Hole between 4128-4640 was never zeroed and carries junk. Does that
make sense or am I doing the math wrong here?

Linux does not zero the rest of the page until the page with i_size is
written. Since i_size has been beyond 8192, Linux never got a chance
to zero it.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23  0:05                     ` Goldwyn Rodrigues
@ 2011-02-23  9:39                       ` Joel Becker
  2011-02-23 14:43                         ` Goldwyn Rodrigues
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-23  9:39 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Feb 22, 2011 at 06:05:20PM -0600, Goldwyn Rodrigues wrote:
> On Tue, Feb 22, 2011 at 5:34 PM, Joel Becker <jlbec@evilplan.org> wrote:
> > ? ? ? ?I may have been confused because ocfs2_zero_extend() only used
> > to matter for nonsparse, where this behavior was different. ?Which of
> > the above did you mean?
> >

Key:
 EOF -> End of File, or i_size.
 EOB -> End of Block, the end of the block containing i_size.  So if
        i_size is 8000 and blocksize is 4096, EOB is 8192.
 EOC -> End of Cluster, the end of the cluster containing i_size.  So if
        i_size is 8000, blocksize is 4096, and clustersize is 16K, EOC
        is 16384, two blocks past i_size.

Hey Goldwyn,
	I want to apologize for not quite getting the point of your
original patch.  The legacy ocfs2_zero_extend() name put me in the mind
of our old nonsparse allocation code, and I saw EOF->EOC and assumed it
was at allocation time, not i_size change time.  I knew I was still
reloading the state in my brain and could be missing something, but that
still must have been frustrating to you.

> Not now, but it may in the future and that is the problem.
> 
> So consider your case of usual parameters, but i_size=4096.
> Now you append 32 bytes of data to it, so pos = i_size = 4096
> and len=32
> write(fd, 32 bytes of data);
> 
> In this case, zero_start and zero_to_size in zero_extend_file are both
> 4096 and it will not zero this page in ocfs2_zero_extend.

	Correct.

> After the write, i_size=4128.

	And the block is zeroed out to 8191 (or at least it should be)
by.

> Now lseek 512 bytes -
> lseek(fd, 512, SEEKCUR);
> 
> 
> ..and write 4096 bytes again.
> write(fd, 4096bytes of data);
> 
> pos=4640 len=4096 bytes, and after the transaction i_size=8736.
> 
> i_size is beyond 8192 now
> 
> Hole between 4128-4640 was never zeroed and carries junk. Does that
> make sense or am I doing the math wrong here?

	That may be the problem, but that is not what should have
happened.  write_begin() should have zeroed 4128-8192 while processing
the 32-byte write.  Linux always assumes that you zero a block when you
bring it into service.  Now, this is not the responsibility of
ocfs2_zero_tail(), because i_size == pos.  ocfs2_zero_tail() is supposed
to work when i_size < pos.  It's when you prepare the pages that this
matters.
	Think about your first patch.  It zeros from i_size to pos.  But
we're supposed to be guaranteed that zeros exist from i_size to EOB.
Thus I read it as zeroing during allocation, not when i_size moves to
cover already allocated clusters.  The "correct" approach (as Linux sees
it) would zero from EOB to pos (or, actually, EOB to the start of the
block containing pos).  We're already supposed to be doing that; that's
part of what I recently fixed.  Later in in our discussion I thought you
might have found a bug in my fix.  That's why I looked at the
ocfs2_should_read_block() code.  Now I see that is not the case (though
my patch is probably correct for ocfs2_should_read_block(), it is
unrelated to the problem you are seeing).
	Given your example above, at write(fd, 32 bytes of data),
ocfs2_prepare_pages_for_write() should be zeroing 4128-8192 before the
write code is allowed to copy the 32 bytes into 4096-4128.  Everyone
expects it to work this way (xfs, extN, etc).  That's exactly what
write_begin() is for.  It gives back a page array where all the pages
are properly allocated and zeroed.  Ours is complex because we have to
handle cluster allocation, where clusters, blocks, and pages can be all
be various multiples of each other.
	Your last patch, where you force 'new', isn't quite right.  The
cluster boundaries passed to map_pages aren't right, and the later
zeroing could be wrong (zeroing earlier parts of the file that contained
data).  At least, that's how I read it.  But I think you're in the right
place to start.

Joel

-- 

Life's Little Instruction Book #198

	"Feed a stranger's expired parking meter."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23  9:39                       ` Joel Becker
@ 2011-02-23 14:43                         ` Goldwyn Rodrigues
  2011-02-23 19:13                           ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-23 14:43 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Wed, Feb 23, 2011 at 3:39 AM, Joel Becker <jlbec@evilplan.org> wrote:
> On Tue, Feb 22, 2011 at 06:05:20PM -0600, Goldwyn Rodrigues wrote:
>> On Tue, Feb 22, 2011 at 5:34 PM, Joel Becker <jlbec@evilplan.org> wrote:
>> > ? ? ? ?I may have been confused because ocfs2_zero_extend() only used
>> > to matter for nonsparse, where this behavior was different. ?Which of
>> > the above did you mean?
>> >
>
> Key:
> ?EOF -> End of File, or i_size.
> ?EOB -> End of Block, the end of the block containing i_size. ?So if
> ? ? ? ?i_size is 8000 and blocksize is 4096, EOB is 8192.
> ?EOC -> End of Cluster, the end of the cluster containing i_size. ?So if
> ? ? ? ?i_size is 8000, blocksize is 4096, and clustersize is 16K, EOC
> ? ? ? ?is 16384, two blocks past i_size.
>
> Hey Goldwyn,
> ? ? ? ?I want to apologize for not quite getting the point of your
> original patch. ?The legacy ocfs2_zero_extend() name put me in the mind
> of our old nonsparse allocation code, and I saw EOF->EOC and assumed it
> was at allocation time, not i_size change time. ?I knew I was still
> reloading the state in my brain and could be missing something, but that
> still must have been frustrating to you.
>

I understand. However, it opened more avenues to look for trouble-making code.

>> Not now, but it may in the future and that is the problem.
>>
>> So consider your case of usual parameters, but i_size=4096.
>> Now you append 32 bytes of data to it, so pos = i_size = 4096
>> and len=32
>> write(fd, 32 bytes of data);
>>
>> In this case, zero_start and zero_to_size in zero_extend_file are both
>> 4096 and it will not zero this page in ocfs2_zero_extend.
>
> ? ? ? ?Correct.
>
>> After the write, i_size=4128.
>
> ? ? ? ?And the block is zeroed out to 8191 (or at least it should be)
> by.
>

by...?


>> Now lseek 512 bytes -
>> lseek(fd, 512, SEEKCUR);
>>
>>
>> ..and write 4096 bytes again.
>> write(fd, 4096bytes of data);
>>
>> pos=4640 len=4096 bytes, and after the transaction i_size=8736.
>>
>> i_size is beyond 8192 now
>>
>> Hole between 4128-4640 was never zeroed and carries junk. Does that
>> make sense or am I doing the math wrong here?
>
> ? ? ? ?That may be the problem, but that is not what should have
> happened. ?write_begin() should have zeroed 4128-8192 while processing
> the 32-byte write. ?Linux always assumes that you zero a block when you
> bring it into service. ?Now, this is not the responsibility of
> ocfs2_zero_tail(), because i_size == pos. ?ocfs2_zero_tail() is supposed
> to work when i_size < pos. ?It's when you prepare the pages that this
> matters.

So, who should zero the page? I assume map_pages

> ? ? ? ?Think about your first patch. ?It zeros from i_size to pos. ?But

Actually, it zeroed from i_size to EOC. But again I think that might
be incorrect because of the commits you mentioned.

> we're supposed to be guaranteed that zeros exist from i_size to EOB.
> Thus I read it as zeroing during allocation, not when i_size moves to
> cover already allocated clusters. ?The "correct" approach (as Linux sees
> it) would zero from EOB to pos (or, actually, EOB to the start of the
> block containing pos). ?We're already supposed to be doing that; that's
> part of what I recently fixed. ?Later in in our discussion I thought you
> might have found a bug in my fix. ?That's why I looked at the
> ocfs2_should_read_block() code. ?Now I see that is not the case (though
> my patch is probably correct for ocfs2_should_read_block(), it is
> unrelated to the problem you are seeing).

Yes, I agree it is correct, and does partially resolve the issue.

> ? ? ? ?Given your example above, at write(fd, 32 bytes of data),
> ocfs2_prepare_pages_for_write() should be zeroing 4128-8192 before the
> write code is allowed to copy the 32 bytes into 4096-4128. ?Everyone
> expects it to work this way (xfs, extN, etc). ?That's exactly what
> write_begin() is for. ?It gives back a page array where all the pages
> are properly allocated and zeroed. ?Ours is complex because we have to
> handle cluster allocation, where clusters, blocks, and pages can be all
> be various multiples of each other.

I agree, but bouncing off "len" over functions might be an overkill,
so I thought zero the page (4096-8192) before writing to it. (refer:
"set new" patch)

> ? ? ? ?Your last patch, where you force 'new', isn't quite right. ?The
> cluster boundaries passed to map_pages aren't right, and the later
> zeroing could be wrong (zeroing earlier parts of the file that contained
> data). ?At least, that's how I read it. ?But I think you're in the right
> place to start.

I still don't understand why? I am not considering clusters at all.
The flag is set to new if -
i_size <= page_offset <= pos

There can be no data beyond i_size, so I don't think it will zero any
previously written data. Can you explain in terms of values of i_size,
cluster size, block size, page size=4k, pos, len etc? I think I have
asked this before as well.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 14:43                         ` Goldwyn Rodrigues
@ 2011-02-23 19:13                           ` Joel Becker
  2011-02-23 19:28                             ` Joel Becker
                                               ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Joel Becker @ 2011-02-23 19:13 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Feb 23, 2011 at 08:43:44AM -0600, Goldwyn Rodrigues wrote:
> On Wed, Feb 23, 2011 at 3:39 AM, Joel Becker <jlbec@evilplan.org> wrote:
> >> After the write, i_size=4128.
> >
> > ? ? ? ?And the block is zeroed out to 8191 (or at least it should be)
> > by.
> >
> 
> by...?

	Haha, where'd the rest of that sentence go?  By the
page prepare code (prepare_write() in older linux, write_begin() in
newer linux).

> So, who should zero the page? I assume map_pages

	ocfs2_map_page_blocks() explicitly describes itself as not
handling the zeroing.  Let's trust that for now.

> > ? ? ? ?Think about your first patch. ?It zeros from i_size to pos. ?But
> 
> Actually, it zeroed from i_size to EOC. But again I think that might
> be incorrect because of the commits you mentioned.

	Sure.  But it shouldn't be necessary.

> > ? ? ? ?Your last patch, where you force 'new', isn't quite right. ?The
> > cluster boundaries passed to map_pages aren't right, and the later
> > zeroing could be wrong (zeroing earlier parts of the file that contained
> > data). ?At least, that's how I read it. ?But I think you're in the right
> > place to start.
> 
> I still don't understand why? I am not considering clusters at all.
> The flag is set to new if -
> i_size <= page_offset <= pos

	Thinking about it, I think you're right about this part of its
safety.  My big concern is that 'new' means "I have a whole empty
cluster" and that might not be true.
	It's starting to look like your code is even closer to correct,
but might not be located exactly where it needs to be.  Let's look at
what should_zero means.
	In ocfs2_write_cluster(), it means the calling functions have an
entire new or unwritten cluster.  That's why we take v_blkno as cpos
when should_zero is true.  We know that the caller has set us up with an
entire cluster and we have to zero the front or all of it.  That's great
when we call down to ocfs2_prepare_page_for_write(), as we are able to
trust we can zero anything we see with impunity.  That's why we zero
whole ranges (cluster_start..cluster_end).  Remember here that a write
context can have multiple pages (to cover the front of a large cluster
that's just been allocated) or multiple clusters per page (16K/64K page
size).  should_zero (from c_needs_zero) tells us that the entire cluster
we're looking at is newly active.
	Down in ocfs2_map_page_blocks(), we must not venture outside
from/to, as other code (not the write path) assumes it will not pull in
blocks that it doesn't need.  So that's not a place to change.
	What about the loop at the bottom of ocfs2_write_cluster()?  The
one that calls ocfs2_prepare_page_for_write()?  It can walk multiple
pages, some of which may be the front of a newly allocated cluster
(Imagine this write is allocating a 1MB cluster at the end of the file,
but pos is 512K inside the new cluster.  This code handles zeroing all
the pages between the start of the cluster and pos).  I suppose we could
check here, but ocfs2_prepare_page_for_write() has logic based on target
vs non-target page, and we don't want to override it.
	So we do want to make a change inside
ocfs2_prepare_page_for_write().  Let's look at your change.  You trigger
'new' even if we're not the target page.  This means the BUG_ON() in the
else{} can never fail for the extend I described in the last paragraph.
The calling code should have made sure we are correctly passing
should_zero to all non-target pages.  Having new overridden for them is
wrong.  So maybe we move your check inside the target_page if.  It only
really matters for the target page.
	But wait!  Check out block_write_full_page_endio() in
fs/buffer.c!  Especially this part:

	zero_user_segment(page, offset, PAGE_CACHE_SIZE);

It is zeroing the tail of a page when it straddles i_size to make sure
it goes out correctly.  But, of course, blocks past i_size never go to
disk.  And we don't care; readpage should handle them correctly, right?
So if writepage is going to zero for us, and readpage is going to read
correctly, how are you getting bad data?
	I was about to think we were done, but now I'm not so sure.  I
have a question: with my change to ocfs2_should_read_block(), but
without your patch, what do you see for corruption?  Your pattern
written in the first pass, or complete random garbage?

Joel

-- 

"The opposite of a correct statement is a false statement. The
 opposite of a profound truth may well be another profound truth."
         - Niels Bohr 

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 19:13                           ` Joel Becker
@ 2011-02-23 19:28                             ` Joel Becker
  2011-02-23 21:00                               ` Goldwyn Rodrigues
  2011-02-23 20:54                             ` Goldwyn Rodrigues
  2011-02-23 21:09                             ` Goldwyn Rodrigues
  2 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-23 19:28 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Feb 23, 2011 at 11:13:39AM -0800, Joel Becker wrote:
> The calling code should have made sure we are correctly passing
> should_zero to all non-target pages.  Having new overridden for them is
> wrong.  So maybe we move your check inside the target_page if.  It only
> really matters for the target page.

	Actually, I have a question: what is the clustersize of your
test?  Because if the filesystem is 4k/4k, the write at 4096 of 32 bytes
should allocate a new cluster.  'new' should already be set, and your
check irrelevant.

> 	But wait!  Check out block_write_full_page_endio() in
> fs/buffer.c!  Especially this part:
> 
> 	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> 
> It is zeroing the tail of a page when it straddles i_size to make sure
> it goes out correctly.  But, of course, blocks past i_size never go to
> disk.  And we don't care; readpage should handle them correctly, right?
> So if writepage is going to zero for us, and readpage is going to read
> correctly, how are you getting bad data?
> 	I was about to think we were done, but now I'm not so sure.  I
> have a question: with my change to ocfs2_should_read_block(), but
> without your patch, what do you see for corruption?  Your pattern
> written in the first pass, or complete random garbage?

	One thing I would like to see are the results with fsync()
between writeSparse() and checkSparselyWrittenFile().  Forcing the file
to disk before checking the values will tell us something.  This is with
my patch but without yours.
	Another good test would be to umount/mount between writeSparse()
and checkSparselyWrittenFile().  This tells us what actually hit disk.

Joel

-- 

"The question of whether computers can think is just like the question
 of whether submarines can swim."
	- Edsger W. Dijkstra

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 19:13                           ` Joel Becker
  2011-02-23 19:28                             ` Joel Becker
@ 2011-02-23 20:54                             ` Goldwyn Rodrigues
  2011-02-23 21:17                               ` Joel Becker
  2011-02-23 21:09                             ` Goldwyn Rodrigues
  2 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-23 20:54 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Wed, Feb 23, 2011 at 1:13 PM, Joel Becker <jlbec@evilplan.org> wrote:
>> > ? ? ? ?Your last patch, where you force 'new', isn't quite right. ?The
>> > cluster boundaries passed to map_pages aren't right, and the later
>> > zeroing could be wrong (zeroing earlier parts of the file that contained
>> > data). ?At least, that's how I read it. ?But I think you're in the right
>> > place to start.
>>
>> I still don't understand why? I am not considering clusters at all.
>> The flag is set to new if -
>> i_size <= page_offset <= pos
>
> ? ? ? ?Thinking about it, I think you're right about this part of its
> safety. ?My big concern is that 'new' means "I have a whole empty
> cluster" and that might not be true.
> ? ? ? ?It's starting to look like your code is even closer to correct,
> but might not be located exactly where it needs to be. ?Let's look at
> what should_zero means.
> ? ? ? ?In ocfs2_write_cluster(), it means the calling functions have an
> entire new or unwritten cluster. ?That's why we take v_blkno as cpos
> when should_zero is true. ?We know that the caller has set us up with an
> entire cluster and we have to zero the front or all of it. ?That's great
> when we call down to ocfs2_prepare_page_for_write(), as we are able to
> trust we can zero anything we see with impunity. ?That's why we zero
> whole ranges (cluster_start..cluster_end). ?Remember here that a write
> context can have multiple pages (to cover the front of a large cluster
> that's just been allocated) or multiple clusters per page (16K/64K page
> size). ?should_zero (from c_needs_zero) tells us that the entire cluster
> we're looking at is newly active.
> ? ? ? ?Down in ocfs2_map_page_blocks(), we must not venture outside
> from/to, as other code (not the write path) assumes it will not pull in
> blocks that it doesn't need. ?So that's not a place to change.
> ? ? ? ?What about the loop at the bottom of ocfs2_write_cluster()? ?The
> one that calls ocfs2_prepare_page_for_write()? ?It can walk multiple
> pages, some of which may be the front of a newly allocated cluster
> (Imagine this write is allocating a 1MB cluster at the end of the file,
> but pos is 512K inside the new cluster. ?This code handles zeroing all
> the pages between the start of the cluster and pos). ?I suppose we could
> check here, but ocfs2_prepare_page_for_write() has logic based on target
> vs non-target page, and we don't want to override it.
> ? ? ? ?So we do want to make a change inside
> ocfs2_prepare_page_for_write(). ?Let's look at your change. ?You trigger
> 'new' even if we're not the target page. ?This means the BUG_ON() in the
> else{} can never fail for the extend I described in the last paragraph.
> The calling code should have made sure we are correctly passing
> should_zero to all non-target pages. ?Having new overridden for them is
> wrong. ?So maybe we move your check inside the target_page if. ?It only
> really matters for the target page.
> ? ? ? ?But wait! ?Check out block_write_full_page_endio() in
> fs/buffer.c! ?Especially this part:
>
> ? ? ? ?zero_user_segment(page, offset, PAGE_CACHE_SIZE);
>
> It is zeroing the tail of a page when it straddles i_size to make sure
> it goes out correctly. ?But, of course, blocks past i_size never go to
> disk. ?And we don't care; readpage should handle them correctly, right?
> So if writepage is going to zero for us, and readpage is going to read
> correctly, how are you getting bad data?
> ? ? ? ?I was about to think we were done, but now I'm not so sure. ?I
> have a question: with my change to ocfs2_should_read_block(), but
> without your patch, what do you see for corruption? ?Your pattern
> written in the first pass, or complete random garbage?
>

Joel, you are getting confused in the list of parameters to reproduce
the bug. I have listed it in the past, but not in a single location so
it is difficult to pick up the bits. I will put it together for you.
Conditions are -

1. blocksize < clustersize
2.  the previous write ended in a page boundary, but not a cluster
boundary. eg, clustersize=16K but pos is at the page boundary = 4k
3. you wrote something@the page boundary
4. there is a hole created in the page from 4k-8k in *between* say 4128-4640
5. i_size has moved past the second page, or past 8192

What I am seeing is pure junk(not 0xbaadfeed), and not what I had
written earlier.
The first 4k of the cluster is always zeroed, it is the consecutive
ones which are not.

In retrospect, since we have covered the rest of the bases, we can set
new on the condition:

i_size == page_offset == pos

I interpreted new as new page or new data vs new cluster, and perhaps
that is the reason I placed the condition there.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 19:28                             ` Joel Becker
@ 2011-02-23 21:00                               ` Goldwyn Rodrigues
  2011-02-23 21:23                                 ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-23 21:00 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Wed, Feb 23, 2011 at 1:28 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Wed, Feb 23, 2011 at 11:13:39AM -0800, Joel Becker wrote:
>> The calling code should have made sure we are correctly passing
>> should_zero to all non-target pages. ?Having new overridden for them is
>> wrong. ?So maybe we move your check inside the target_page if. ?It only
>> really matters for the target page.
>
> ? ? ? ?Actually, I have a question: what is the clustersize of your
> test? ?Because if the filesystem is 4k/4k, the write at 4096 of 32 bytes
> should allocate a new cluster. ?'new' should already be set, and your
> check irrelevant.

It is 16K and does not happen for clustersize==pagesize/blocksize.
Don't have a machine with a bigger pagesize handy.

>
>> ? ? ? But wait! ?Check out block_write_full_page_endio() in
>> fs/buffer.c! ?Especially this part:
>>
>> ? ? ? zero_user_segment(page, offset, PAGE_CACHE_SIZE);
>>
>> It is zeroing the tail of a page when it straddles i_size to make sure
>> it goes out correctly. ?But, of course, blocks past i_size never go to
>> disk. ?And we don't care; readpage should handle them correctly, right?
>> So if writepage is going to zero for us, and readpage is going to read
>> correctly, how are you getting bad data?
>> ? ? ? I was about to think we were done, but now I'm not so sure. ?I
>> have a question: with my change to ocfs2_should_read_block(), but
>> without your patch, what do you see for corruption? ?Your pattern
>> written in the first pass, or complete random garbage?
>
> ? ? ? ?One thing I would like to see are the results with fsync()
> between writeSparse() and checkSparselyWrittenFile(). ?Forcing the file
> to disk before checking the values will tell us something. ?This is with
> my patch but without yours.
> ? ? ? ?Another good test would be to umount/mount between writeSparse()
> and checkSparselyWrittenFile(). ?This tells us what actually hit disk.
>

No, fsync does not work. I still get junk in the holes. Same is the
case if I umount and re-mount the disk.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 19:13                           ` Joel Becker
  2011-02-23 19:28                             ` Joel Becker
  2011-02-23 20:54                             ` Goldwyn Rodrigues
@ 2011-02-23 21:09                             ` Goldwyn Rodrigues
  2011-02-23 21:21                               ` Joel Becker
  2 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-23 21:09 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Wed, Feb 23, 2011 at 1:13 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Wed, Feb 23, 2011 at 08:43:44AM -0600, Goldwyn Rodrigues wrote:
>> On Wed, Feb 23, 2011 at 3:39 AM, Joel Becker <jlbec@evilplan.org> wrote:
>> >> After the write, i_size=4128.
>> >
>> > ? ? ? ?And the block is zeroed out to 8191 (or at least it should be)
>> > by.
>> >
>>
>> by...?
>
> ? ? ? ?Haha, where'd the rest of that sentence go? ?By the
> page prepare code (prepare_write() in older linux, write_begin() in
> newer linux).
>
>> So, who should zero the page? I assume map_pages
>
> ? ? ? ?ocfs2_map_page_blocks() explicitly describes itself as not
> handling the zeroing. ?Let's trust that for now.
>
>> > ? ? ? ?Think about your first patch. ?It zeros from i_size to pos. ?But
>>
>> Actually, it zeroed from i_size to EOC. But again I think that might
>> be incorrect because of the commits you mentioned.
>
> ? ? ? ?Sure. ?But it shouldn't be necessary.
>
>> > ? ? ? ?Your last patch, where you force 'new', isn't quite right. ?The
>> > cluster boundaries passed to map_pages aren't right, and the later
>> > zeroing could be wrong (zeroing earlier parts of the file that contained
>> > data). ?At least, that's how I read it. ?But I think you're in the right
>> > place to start.
>>
>> I still don't understand why? I am not considering clusters at all.
>> The flag is set to new if -
>> i_size <= page_offset <= pos
>
> ? ? ? ?Thinking about it, I think you're right about this part of its
> safety. ?My big concern is that 'new' means "I have a whole empty
> cluster" and that might not be true.
> ? ? ? ?It's starting to look like your code is even closer to correct,
> but might not be located exactly where it needs to be. ?Let's look at
> what should_zero means.
> ? ? ? ?In ocfs2_write_cluster(), it means the calling functions have an
> entire new or unwritten cluster. ?That's why we take v_blkno as cpos
> when should_zero is true. ?We know that the caller has set us up with an
> entire cluster and we have to zero the front or all of it. ?That's great
> when we call down to ocfs2_prepare_page_for_write(), as we are able to
> trust we can zero anything we see with impunity. ?That's why we zero
> whole ranges (cluster_start..cluster_end). ?Remember here that a write
> context can have multiple pages (to cover the front of a large cluster
> that's just been allocated) or multiple clusters per page (16K/64K page
> size). ?should_zero (from c_needs_zero) tells us that the entire cluster
> we're looking at is newly active.
> ? ? ? ?Down in ocfs2_map_page_blocks(), we must not venture outside
> from/to, as other code (not the write path) assumes it will not pull in
> blocks that it doesn't need. ?So that's not a place to change.
> ? ? ? ?What about the loop at the bottom of ocfs2_write_cluster()? ?The
> one that calls ocfs2_prepare_page_for_write()? ?It can walk multiple
> pages, some of which may be the front of a newly allocated cluster
> (Imagine this write is allocating a 1MB cluster at the end of the file,
> but pos is 512K inside the new cluster. ?This code handles zeroing all
> the pages between the start of the cluster and pos). ?I suppose we could
> check here, but ocfs2_prepare_page_for_write() has logic based on target
> vs non-target page, and we don't want to override it.
> ? ? ? ?So we do want to make a change inside
> ocfs2_prepare_page_for_write(). ?Let's look at your change. ?You trigger
> 'new' even if we're not the target page. ?This means the BUG_ON() in the
> else{} can never fail for the extend I described in the last paragraph.
> The calling code should have made sure we are correctly passing
> should_zero to all non-target pages. ?Having new overridden for them is
> wrong. ?So maybe we move your check inside the target_page if. ?It only
> really matters for the target page.
> ? ? ? ?But wait! ?Check out block_write_full_page_endio() in
> fs/buffer.c! ?Especially this part:
>
> ? ? ? ?zero_user_segment(page, offset, PAGE_CACHE_SIZE);
>

Are you sure that block_write_full_page_endio() will be called before
the next system call on this file comes in and moves I_size beyond end
of this page?


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 20:54                             ` Goldwyn Rodrigues
@ 2011-02-23 21:17                               ` Joel Becker
  2011-02-23 21:35                                 ` Goldwyn Rodrigues
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-23 21:17 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Feb 23, 2011 at 02:54:19PM -0600, Goldwyn Rodrigues wrote:
> On Wed, Feb 23, 2011 at 1:13 PM, Joel Becker <jlbec@evilplan.org> wrote:
> > ? ? ? ?I was about to think we were done, but now I'm not so sure. ?I
> > have a question: with my change to ocfs2_should_read_block(), but
> > without your patch, what do you see for corruption? ?Your pattern
> > written in the first pass, or complete random garbage?
> 
> Joel, you are getting confused in the list of parameters to reproduce
> the bug. I have listed it in the past, but not in a single location so
> it is difficult to pick up the bits. I will put it together for you.

Goldwyn,
	I'm just not remembering off the top of my head, so I'm asking.
I'm actually describing the myriad cases we can encounter in this code,
because any change we make has to be safe for all possible cases.

> 1. blocksize < clustersize

	Ok.

> 2.  the previous write ended in a page boundary, but not a cluster
> boundary. eg, clustersize=16K but pos is at the page boundary = 4k

	Must it be a page boundary?  What about a block boundary? ;-)

> 3. you wrote something at the page boundary

	I think it's more block than page, really.  But that is the same
for 4K blocks and pages.

> 4. there is a hole created in the page from 4k-8k in *between* say 4128-4640

	There is no hole here.  We have allocation.  Even if the user
doesn't directly write to this location, it is not a hole, because we
have a cluster there.  You just mean that the user didn't write to this
location (sparse writing).  That's just a terminology thing, but it's
important for other folks understanding you.

> What I am seeing is pure junk(not 0xbaadfeed), and not what I had
> written earlier.

	Ok.  That's what I expected.  I wanted to clarify we were not
reading stale data, but were instead failing to zero memory.

> In retrospect, since we have covered the rest of the bases, we can set
> new on the condition:
> 
> i_size == page_offset == pos
> 
> I interpreted new as new page or new data vs new cluster, and perhaps
> that is the reason I placed the condition there.

	I don't like this.  We still haven't figured out where we're
failing.
	The core piece of this, why I'm going back and forth with you
discussing the issue, is that we never want to paper over a symptom.
Sure, we know that i_size==page_offset==pos triggers this, but is it the
only case?  If we figure out *why* we're here, when the code is supposed
to not have this problem, then we can be sure we've closed out all
exposure.
	To wit: If we really only have a problem when i_size is on a
page boundary, doesn't that sound like an off-by-one error?  We've had
those before.  I'm not sure that it is, but shouldn't we be sure about
it?

Joel

-- 

"But then she looks me in the eye
 And says, 'We're going to last forever,'
 And man you know I can't begin to doubt it.
 Cause it just feels so good and so free and so right,
 I know we ain't never going to change our minds about it, Hey!
 Here comes my girl."

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 21:09                             ` Goldwyn Rodrigues
@ 2011-02-23 21:21                               ` Joel Becker
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Becker @ 2011-02-23 21:21 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Feb 23, 2011 at 03:09:22PM -0600, Goldwyn Rodrigues wrote:
> Are you sure that block_write_full_page_endio() will be called before
> the next system call on this file comes in and moves I_size beyond end
> of this page?

	Nope, and that's why I'm talking about things like fsync(), etc.
I'm pretty sure we're racing this sort of thing, because we should have
zeroed it ourselves.
	This comes back to block-based vs page-based zeroing.  The
kernel code should really handle blocks outside of our i_size (this is
what the block_write_full_page_endio() stuff is aimed for).  readpage,
writepage, etc just handle it.  However, the kernel is not responsible
for zeros in blocks we claim to be controlling, and it also doesn't
claim responsibility for blocks we're extending past.

Joel

-- 

Life's Little Instruction Book #24

	"Drink champagne for no reason at all."

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 21:00                               ` Goldwyn Rodrigues
@ 2011-02-23 21:23                                 ` Joel Becker
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Becker @ 2011-02-23 21:23 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Feb 23, 2011 at 03:00:58PM -0600, Goldwyn Rodrigues wrote:
> On Wed, Feb 23, 2011 at 1:28 PM, Joel Becker <jlbec@evilplan.org> wrote:
> > On Wed, Feb 23, 2011 at 11:13:39AM -0800, Joel Becker wrote:
> >> The calling code should have made sure we are correctly passing
> >> should_zero to all non-target pages. ?Having new overridden for them is
> >> wrong. ?So maybe we move your check inside the target_page if. ?It only
> >> really matters for the target page.
> >
> > ? ? ? ?Actually, I have a question: what is the clustersize of your
> > test? ?Because if the filesystem is 4k/4k, the write at 4096 of 32 bytes
> > should allocate a new cluster. ?'new' should already be set, and your
> > check irrelevant.
> 
> It is 16K and does not happen for clustersize==pagesize/blocksize.
> Don't have a machine with a bigger pagesize handy.

	That's what I expected.

Joel

-- 

"What no boss of a programmer can ever understand is that a programmer
 is working when he's staring out of the window"
	- With apologies to Burton Rascoe

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 21:17                               ` Joel Becker
@ 2011-02-23 21:35                                 ` Goldwyn Rodrigues
  2011-02-23 21:37                                   ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-23 21:35 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Wed, Feb 23, 2011 at 3:17 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Wed, Feb 23, 2011 at 02:54:19PM -0600, Goldwyn Rodrigues wrote:
>> On Wed, Feb 23, 2011 at 1:13 PM, Joel Becker <jlbec@evilplan.org> wrote:
>> > ? ? ? ?I was about to think we were done, but now I'm not so sure. ?I
>> > have a question: with my change to ocfs2_should_read_block(), but
>> > without your patch, what do you see for corruption? ?Your pattern
>> > written in the first pass, or complete random garbage?
>>
>> Joel, you are getting confused in the list of parameters to reproduce
>> the bug. I have listed it in the past, but not in a single location so
>> it is difficult to pick up the bits. I will put it together for you.
>
> Goldwyn,
> ? ? ? ?I'm just not remembering off the top of my head, so I'm asking.
> I'm actually describing the myriad cases we can encounter in this code,
> because any change we make has to be safe for all possible cases.
>
>> 1. blocksize < clustersize
>
> ? ? ? ?Ok.
>
>> 2. ?the previous write ended in a page boundary, but not a cluster
>> boundary. eg, clustersize=16K but pos is at the page boundary = 4k
>
> ? ? ? ?Must it be a page boundary? ?What about a block boundary? ;-)

I am not sure of this, but I would like to stick with page boundary. I
am too poor to afford a machine with a page size bigger than 4k to
test this ;)

>
>> 3. you wrote something at the page boundary
>
> ? ? ? ?I think it's more block than page, really. ?But that is the same
> for 4K blocks and pages.
>
>> 4. there is a hole created in the page from 4k-8k in *between* say 4128-4640
>
> ? ? ? ?There is no hole here. ?We have allocation. ?Even if the user
> doesn't directly write to this location, it is not a hole, because we
> have a cluster there. ?You just mean that the user didn't write to this
> location (sparse writing). ?That's just a terminology thing, but it's
> important for other folks understanding you.

Agreed. hole in user-space? ;)

>
>> What I am seeing is pure junk(not 0xbaadfeed), and not what I had
>> written earlier.
>
> ? ? ? ?Ok. ?That's what I expected. ?I wanted to clarify we were not
> reading stale data, but were instead failing to zero memory.
>
>> In retrospect, since we have covered the rest of the bases, we can set
>> new on the condition:
>>
>> i_size == page_offset == pos
>>
>> I interpreted new as new page or new data vs new cluster, and perhaps
>> that is the reason I placed the condition there.
>
> ? ? ? ?I don't like this. ?We still haven't figured out where we're
> failing.

What I meant with rest of the bases is the zeroing of pages when pos != i_size.

> ? ? ? ?The core piece of this, why I'm going back and forth with you
> discussing the issue, is that we never want to paper over a symptom.
> Sure, we know that i_size==page_offset==pos triggers this, but is it the
> only case? ?If we figure out *why* we're here, when the code is supposed
> to not have this problem, then we can be sure we've closed out all
> exposure.

I agree absolutely, and this is the reason I am in the discussion. All
I am saying from the start is that the (2nd-) page is never zeroed. We
just have to figure out who and where it has to be zeroed. I am not
keen on pushing my solution, but the correct one.


> ? ? ? ?To wit: If we really only have a problem when i_size is on a
> page boundary, doesn't that sound like an off-by-one error? ?We've had
> those before. ?I'm not sure that it is, but shouldn't we be sure about
> it?
>

Yes it is. It is one of those corner cases which gets missed and we
should be sure of it.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 21:35                                 ` Goldwyn Rodrigues
@ 2011-02-23 21:37                                   ` Joel Becker
  2011-02-23 21:44                                     ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-23 21:37 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Feb 23, 2011 at 03:35:36PM -0600, Goldwyn Rodrigues wrote:
> On Wed, Feb 23, 2011 at 3:17 PM, Joel Becker <jlbec@evilplan.org> wrote:
> >> 2. ?the previous write ended in a page boundary, but not a cluster
> >> boundary. eg, clustersize=16K but pos is at the page boundary = 4k
> >
> > ? ? ? ?Must it be a page boundary? ?What about a block boundary? ;-)
> 
> I am not sure of this, but I would like to stick with page boundary. I
> am too poor to afford a machine with a page size bigger than 4k to
> test this ;)

	Create a filesystem with 1K blocks and 16K clusters.  Now you
have four blocks per page, and you can test block vs page boundaries.

Joel

-- 

"There is a country in Europe where multiple-choice tests are
 illegal."
        - Sigfried Hulzer

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 21:37                                   ` Joel Becker
@ 2011-02-23 21:44                                     ` Joel Becker
  2011-02-23 22:31                                       ` Goldwyn Rodrigues
  2011-02-24 16:32                                       ` Goldwyn Rodrigues
  0 siblings, 2 replies; 34+ messages in thread
From: Joel Becker @ 2011-02-23 21:44 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Feb 23, 2011 at 01:37:31PM -0800, Joel Becker wrote:
> On Wed, Feb 23, 2011 at 03:35:36PM -0600, Goldwyn Rodrigues wrote:
> > On Wed, Feb 23, 2011 at 3:17 PM, Joel Becker <jlbec@evilplan.org> wrote:
> > >> 2. ?the previous write ended in a page boundary, but not a cluster
> > >> boundary. eg, clustersize=16K but pos is at the page boundary = 4k
> > >
> > > ? ? ? ?Must it be a page boundary? ?What about a block boundary? ;-)
> > 
> > I am not sure of this, but I would like to stick with page boundary. I
> > am too poor to afford a machine with a page size bigger than 4k to
> > test this ;)
> 
> 	Create a filesystem with 1K blocks and 16K clusters.  Now you
> have four blocks per page, and you can test block vs page boundaries.

	I would love to see you modify tailtest to expose this bug.  You
should be able to do it with one set of writes (your write(4k),
write(32b), write(4K at someplace past 4K+32b)) test).  It will be a lot
easier to debug if it is a simple script rather than a bunch of C
writes.

Joel

-- 

"In the long run...we'll all be dead."
                                        -Unknown

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 21:44                                     ` Joel Becker
@ 2011-02-23 22:31                                       ` Goldwyn Rodrigues
  2011-02-28 18:03                                         ` Joel Becker
  2011-02-24 16:32                                       ` Goldwyn Rodrigues
  1 sibling, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-23 22:31 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Wed, Feb 23, 2011 at 3:44 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Wed, Feb 23, 2011 at 01:37:31PM -0800, Joel Becker wrote:
>> On Wed, Feb 23, 2011 at 03:35:36PM -0600, Goldwyn Rodrigues wrote:
>> > On Wed, Feb 23, 2011 at 3:17 PM, Joel Becker <jlbec@evilplan.org> wrote:
>> > >> 2. ?the previous write ended in a page boundary, but not a cluster
>> > >> boundary. eg, clustersize=16K but pos is at the page boundary = 4k
>> > >
>> > > ? ? ? ?Must it be a page boundary? ?What about a block boundary? ;-)
>> >
>> > I am not sure of this, but I would like to stick with page boundary. I
>> > am too poor to afford a machine with a page size bigger than 4k to
>> > test this ;)
>>
>> ? ? ? Create a filesystem with 1K blocks and 16K clusters. ?Now you
>> have four blocks per page, and you can test block vs page boundaries.

Proved, it is page boundary.

>
> ? ? ? ?I would love to see you modify tailtest to expose this bug. ?You
> should be able to do it with one set of writes (your write(4k),
> write(32b), write(4K at someplace past 4K+32b)) test). ?It will be a lot
> easier to debug if it is a simple script rather than a bunch of C
> writes.
>

Here is a simple script. I will incorporate later into tailtest later.

FILENAME=/mnt/f2

for i in `seq 0 256`; do
	let s=$i*4096
	echo "a" | dd of=$FILENAME count=1 bs=1 seek=$s conv=notrunc 2>/dev/null
	let t=$s+4095
	echo "b" | dd of=$FILENAME count=1 bs=1 seek=$t conv=notrunc 2>/dev/null
done


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 21:44                                     ` Joel Becker
  2011-02-23 22:31                                       ` Goldwyn Rodrigues
@ 2011-02-24 16:32                                       ` Goldwyn Rodrigues
  2011-02-24 18:16                                         ` Mark Fasheh
  1 sibling, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-24 16:32 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Wed, Feb 23, 2011 at 3:44 PM, Joel Becker <jlbec@evilplan.org> wrote:
>
> ? ? ? ?I would love to see you modify tailtest to expose this bug. ?You
> should be able to do it with one set of writes (your write(4k),
> write(32b), write(4K at someplace past 4K+32b)) test). ?It will be a lot
> easier to debug if it is a simple script rather than a bunch of C
> writes.
>

I have posted the patch for ocfs2-test on the test mailing list.

While coding the test, I discovered that the problem does not occur
for sparse,inline-data. I thought it might be because it is the first
test in the series so I changed the order and confirmed that the test
does not fail for sparse,inline-data. Any ideas?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-24 16:32                                       ` Goldwyn Rodrigues
@ 2011-02-24 18:16                                         ` Mark Fasheh
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Fasheh @ 2011-02-24 18:16 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Feb 24, 2011 at 10:32:44AM -0600, Goldwyn Rodrigues wrote:
> Hi Joel,
> 
> On Wed, Feb 23, 2011 at 3:44 PM, Joel Becker <jlbec@evilplan.org> wrote:
> >
> > ? ? ? ?I would love to see you modify tailtest to expose this bug. ?You
> > should be able to do it with one set of writes (your write(4k),
> > write(32b), write(4K at someplace past 4K+32b)) test). ?It will be a lot
> > easier to debug if it is a simple script rather than a bunch of C
> > writes.
> >
> 
> I have posted the patch for ocfs2-test on the test mailing list.
> 
> While coding the test, I discovered that the problem does not occur
> for sparse,inline-data. I thought it might be because it is the first
> test in the series so I changed the order and confirmed that the test
> does not fail for sparse,inline-data. Any ideas?

Inline gets a bit of special handling in the write cases, especially when
you cross the boundary into extents. My guess is that the zeroing in that
case is somehow already correct? Possibly that's just by dumb luck.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-23 22:31                                       ` Goldwyn Rodrigues
@ 2011-02-28 18:03                                         ` Joel Becker
  2011-02-28 19:07                                           ` Goldwyn Rodrigues
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-28 18:03 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Feb 23, 2011 at 04:31:25PM -0600, Goldwyn Rodrigues wrote:
> Here is a simple script. I will incorporate later into tailtest later.
> 
> FILENAME=/mnt/f2
> 
> for i in `seq 0 256`; do
> 	let s=$i*4096
> 	echo "a" | dd of=$FILENAME count=1 bs=1 seek=$s conv=notrunc 2>/dev/null
> 	let t=$s+4095
> 	echo "b" | dd of=$FILENAME count=1 bs=1 seek=$t conv=notrunc 2>/dev/null
> done

	You shouldn't need to do 256 runs.  I would like to see directed
tests that just hit one spot in a file and expose the corruption.  For
example, your described problem case should work like so:

  # Write the first three blocks of the file, getting us past inline_data
  dd if=/dev/urandom of=$FILENAME count=3 bs=4096
  # Write a byte in the next page
  dd if=/dev/urandom of=$FILENAME count=1 bs=1 seek=12228 conv=notrunc
  # Write after some partial-block portion, trying to expose failed zeroing
  dd if=/dev/urandom of=$FILENAME count=4084 bs=1 seek=12300 conv=notrunc

I would expect this to expose the problem as you've described for
clustersize >= 8K.

Joel

-- 

 Joel's First Law:

	Nature abhors a GUI.

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-28 18:03                                         ` Joel Becker
@ 2011-02-28 19:07                                           ` Goldwyn Rodrigues
  2011-02-28 19:40                                             ` Joel Becker
  0 siblings, 1 reply; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-02-28 19:07 UTC (permalink / raw)
  To: ocfs2-devel

Hi Joel,

On Mon, Feb 28, 2011 at 12:03 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Wed, Feb 23, 2011 at 04:31:25PM -0600, Goldwyn Rodrigues wrote:
>> Here is a simple script. I will incorporate later into tailtest later.
>>
>> FILENAME=/mnt/f2
>>
>> for i in `seq 0 256`; do
>> ? ? ? let s=$i*4096
>> ? ? ? echo "a" | dd of=$FILENAME count=1 bs=1 seek=$s conv=notrunc 2>/dev/null
>> ? ? ? let t=$s+4095
>> ? ? ? echo "b" | dd of=$FILENAME count=1 bs=1 seek=$t conv=notrunc 2>/dev/null
>> done
>
> ? ? ? ?You shouldn't need to do 256 runs. ?I would like to see directed
> tests that just hit one spot in a file and expose the corruption. ?For
> example, your described problem case should work like so:
>
> ?# Write the first three blocks of the file, getting us past inline_data
> ?dd if=/dev/urandom of=$FILENAME count=3 bs=4096
> ?# Write a byte in the next page
> ?dd if=/dev/urandom of=$FILENAME count=1 bs=1 seek=12228 conv=notrunc
> ?# Write after some partial-block portion, trying to expose failed zeroing
> ?dd if=/dev/urandom of=$FILENAME count=4084 bs=1 seek=12300 conv=notrunc
>
> I would expect this to expose the problem as you've described for
> clustersize >= 8K.
>


Yes, that will work as well, as long as you can differentiate between
the random characters from urandom and the (unexpected) non-zero
characters in the (userspace) holes. You can read the data from
[12229-12299] in the file you have created and check for zeros.

If you don't want too many changes - s/256/3/ in the script I posted
and check for the 4th block.

But yes, you are on the right path of re-creating the issue.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-28 19:07                                           ` Goldwyn Rodrigues
@ 2011-02-28 19:40                                             ` Joel Becker
  2011-03-01  2:11                                               ` Goldwyn Rodrigues
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Becker @ 2011-02-28 19:40 UTC (permalink / raw)
  To: ocfs2-devel

On Mon, Feb 28, 2011 at 01:07:21PM -0600, Goldwyn Rodrigues wrote:
> On Mon, Feb 28, 2011 at 12:03 PM, Joel Becker <jlbec@evilplan.org> wrote:
> > On Wed, Feb 23, 2011 at 04:31:25PM -0600, Goldwyn Rodrigues wrote:
> >> Here is a simple script. I will incorporate later into tailtest later.
> >>
> >> FILENAME=/mnt/f2
> >>
> >> for i in `seq 0 256`; do
> >> ? ? ? let s=$i*4096
> >> ? ? ? echo "a" | dd of=$FILENAME count=1 bs=1 seek=$s conv=notrunc 2>/dev/null
> >> ? ? ? let t=$s+4095
> >> ? ? ? echo "b" | dd of=$FILENAME count=1 bs=1 seek=$t conv=notrunc 2>/dev/null
> >> done
> >
> > ? ? ? ?You shouldn't need to do 256 runs. ?I would like to see directed
> > tests that just hit one spot in a file and expose the corruption. ?For
> > example, your described problem case should work like so:
> >
> > ?# Write the first three blocks of the file, getting us past inline_data
> > ?dd if=/dev/urandom of=$FILENAME count=3 bs=4096
> > ?# Write a byte in the next page
> > ?dd if=/dev/urandom of=$FILENAME count=1 bs=1 seek=12228 conv=notrunc
> > ?# Write after some partial-block portion, trying to expose failed zeroing
> > ?dd if=/dev/urandom of=$FILENAME count=4084 bs=1 seek=12300 conv=notrunc
> >
> > I would expect this to expose the problem as you've described for
> > clustersize >= 8K.
> 
> 
> Yes, that will work as well, as long as you can differentiate between
> the random characters from urandom and the (unexpected) non-zero
> characters in the (userspace) holes. You can read the data from
> [12229-12299] in the file you have created and check for zeros.

	You're right about the urandom being bad for this.  Maybe it
should be zeros.  But still.

> If you don't want too many changes - s/256/3/ in the script I posted
> and check for the 4th block.

	I do want too many changes.  Your loop doesn't directly specify
the parameters.

Joel

-- 

"I'm so tired of being tired,
 Sure as night will follow day.
 Most things I worry about
 Never happen anyway."

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-28 19:40                                             ` Joel Becker
@ 2011-03-01  2:11                                               ` Goldwyn Rodrigues
  0 siblings, 0 replies; 34+ messages in thread
From: Goldwyn Rodrigues @ 2011-03-01  2:11 UTC (permalink / raw)
  To: ocfs2-devel

On Mon, Feb 28, 2011 at 1:40 PM, Joel Becker <jlbec@evilplan.org> wrote:
> On Mon, Feb 28, 2011 at 01:07:21PM -0600, Goldwyn Rodrigues wrote:
>> On Mon, Feb 28, 2011 at 12:03 PM, Joel Becker <jlbec@evilplan.org> wrote:
>> > On Wed, Feb 23, 2011 at 04:31:25PM -0600, Goldwyn Rodrigues wrote:
>> >> Here is a simple script. I will incorporate later into tailtest later.
>> >>

>>
>> Yes, that will work as well, as long as you can differentiate between
>> the random characters from urandom and the (unexpected) non-zero
>> characters in the (userspace) holes. You can read the data from
>> [12229-12299] in the file you have created and check for zeros.
>
> ? ? ? ?You're right about the urandom being bad for this. ?Maybe it
> should be zeros. ?But still.
>

I used /usr/bin/yes instead. The patch is on the ocfs2-test-devel ML.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries
  2011-02-17 15:44 [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries Goldwyn Rodrigues
  2011-02-20  7:09 ` Joel Becker
@ 2011-03-26 22:48 ` Joel Becker
  1 sibling, 0 replies; 34+ messages in thread
From: Joel Becker @ 2011-03-26 22:48 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Feb 17, 2011 at 09:44:40AM -0600, Goldwyn Rodrigues wrote:
> When a hole spans across page boundaries, the next write forces
> a read of the block. This could end up reading existing garbage
> data from the disk in ocfs2_map_page_blocks. This leads to
> non-zero holes. In order to avoid this, mark the writes as new
> when the holes span across page boundaries.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>

This patch is now part of the merge-window branch of ocfs2.git.

I'm still not sure this is the final fix.  I worry that it papers over
our expectations of our new block/page/cluster code.  But we have more
than report of the problem, and this seems to alleviate those reports.
I can't see any way it breaks existing operation.  So I think it should
go in while we later revisit whether it is the end of the discussion.

Joel

-- 

"Behind every successful man there's a lot of unsuccessful years."
        - Bob Brown

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-03-26 22:48 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-17 15:44 [Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries Goldwyn Rodrigues
2011-02-20  7:09 ` Joel Becker
2011-02-20 18:45   ` Goldwyn Rodrigues
2011-02-21  2:08     ` Joel Becker
2011-02-21  5:44       ` Goldwyn Rodrigues
2011-02-22  8:36         ` Joel Becker
2011-02-22  9:37           ` Joel Becker
2011-02-22 19:02           ` Goldwyn Rodrigues
2011-02-22 21:39             ` Joel Becker
2011-02-22 21:54               ` Joel Becker
2011-02-22 22:09                 ` Goldwyn Rodrigues
2011-02-22 23:34                   ` Joel Becker
2011-02-23  0:05                     ` Goldwyn Rodrigues
2011-02-23  9:39                       ` Joel Becker
2011-02-23 14:43                         ` Goldwyn Rodrigues
2011-02-23 19:13                           ` Joel Becker
2011-02-23 19:28                             ` Joel Becker
2011-02-23 21:00                               ` Goldwyn Rodrigues
2011-02-23 21:23                                 ` Joel Becker
2011-02-23 20:54                             ` Goldwyn Rodrigues
2011-02-23 21:17                               ` Joel Becker
2011-02-23 21:35                                 ` Goldwyn Rodrigues
2011-02-23 21:37                                   ` Joel Becker
2011-02-23 21:44                                     ` Joel Becker
2011-02-23 22:31                                       ` Goldwyn Rodrigues
2011-02-28 18:03                                         ` Joel Becker
2011-02-28 19:07                                           ` Goldwyn Rodrigues
2011-02-28 19:40                                             ` Joel Becker
2011-03-01  2:11                                               ` Goldwyn Rodrigues
2011-02-24 16:32                                       ` Goldwyn Rodrigues
2011-02-24 18:16                                         ` Mark Fasheh
2011-02-23 21:09                             ` Goldwyn Rodrigues
2011-02-23 21:21                               ` Joel Becker
2011-03-26 22:48 ` Joel Becker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).