fio test triggering bad data on ext4

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* fio test triggering bad data on ext4
@ 2010-06-18  8:07 Jens Axboe
  2010-06-18 14:02 ` Eric Sandeen
  2010-07-07 14:26 ` Eric Sandeen
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2010-06-18  8:07 UTC (permalink / raw)
  To: tytso, adilger; +Cc: linux-ext4

Hi,

I was writing a small fio job file to do writes and read verifies on a
device. It forks 32 processes, each writing randomly to 4 files with a
block size between 4k and 16k. When it has written 1024 of those blocks,
it'll verify the oldest 512 of them. Each block is checksummed for every
512b. It uses libaio and O_DIRECT.

It works on ext2 and btrfs. I haven't run it to completion yet, but they
survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
before this triggers:

Bad verify header 0 at 10137600
fio: pid=9943, err=84/file:io_u.c:1212, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character

writers: (groupid=0, jobs=32): err=84 (file:io_u.c:1212, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character): pid=9943

which tells us that where we expected to find the correct verify magic
in the header, it was all zeroes. The job file used is below, and to
reproduce you want to use the latest fio (1.40) since some earlier
versions don't do verify_interval properly for non-pattern verifies. You
can get fio here:

http://brick.kernel.dk/snaps/fio-1.40.tar.gz

or from git at:

git://git.kernel.dk/fio.git

The kernel used is 2.6.35-rc3 and I ran this on a raid0 that had 8 SSD
drives.

--- snip job file ---

[global]
direct=1
group_reporting=1
exitall
runtime=4h
time_based=1

# writers, will repeatedly randomly write and verify data
[writers]
rw=randwrite
bsrange=4k-16k
ioengine=libaio
iodepth=4
directory=/data
verify=crc32c
verify_backlog=1024
verify_backlog_batch=512
verify_interval=512
size=512m
nrfiles=4
filesize=64m-256m
numjobs=32
create_serialize=0

--- snip job file ---

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18  8:07 fio test triggering bad data on ext4 Jens Axboe
@ 2010-06-18 14:02 ` Eric Sandeen
  2010-06-18 14:59   ` Eric Sandeen
  2010-07-07 14:26 ` Eric Sandeen
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2010-06-18 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: tytso, adilger, linux-ext4

Jens Axboe wrote:
> Hi,
> 
> I was writing a small fio job file to do writes and read verifies on a
> device. It forks 32 processes, each writing randomly to 4 files with a
> block size between 4k and 16k. When it has written 1024 of those blocks,
> it'll verify the oldest 512 of them. Each block is checksummed for every
> 512b. It uses libaio and O_DIRECT.
> 
> It works on ext2 and btrfs. I haven't run it to completion yet, but they
> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
> before this triggers:

Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
(and I'm not sure about btrfs in that regard), ext4 may be most similar
to xfs's behavior on the test ... wondering how it fares.

Thanks,
-Eric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18 14:02 ` Eric Sandeen
@ 2010-06-18 14:59   ` Eric Sandeen
  2010-06-18 15:13     ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2010-06-18 14:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: tytso, adilger, linux-ext4

Eric Sandeen wrote:
> Jens Axboe wrote:
>> Hi,
>>
>> I was writing a small fio job file to do writes and read verifies on a
>> device. It forks 32 processes, each writing randomly to 4 files with a
>> block size between 4k and 16k. When it has written 1024 of those blocks,
>> it'll verify the oldest 512 of them. Each block is checksummed for every
>> 512b. It uses libaio and O_DIRECT.
>>
>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>> before this triggers:
> 
> Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
> (and I'm not sure about btrfs in that regard), ext4 may be most similar
> to xfs's behavior on the test ... wondering how it fares.
> 
> Thanks,
> -Eric

Actually mingming had a patch for direct-io.c which may be related, I'll
test that out.

-Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18 14:59   ` Eric Sandeen
@ 2010-06-18 15:13     ` Jens Axboe
  2010-06-18 15:28       ` Eric Sandeen
  2010-06-18 17:36       ` Jens Axboe
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2010-06-18 15:13 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org

On 18/06/10 16.59, Eric Sandeen wrote:
> Eric Sandeen wrote:
>> Jens Axboe wrote:
>>> Hi,
>>>
>>> I was writing a small fio job file to do writes and read verifies on a
>>> device. It forks 32 processes, each writing randomly to 4 files with a
>>> block size between 4k and 16k. When it has written 1024 of those blocks,
>>> it'll verify the oldest 512 of them. Each block is checksummed for every
>>> 512b. It uses libaio and O_DIRECT.
>>>
>>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>>> before this triggers:
>>
>> Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
>> (and I'm not sure about btrfs in that regard), ext4 may be most similar
>> to xfs's behavior on the test ... wondering how it fares.
>>
>> Thanks,
>> -Eric
> 
> Actually mingming had a patch for direct-io.c which may be related, I'll
> test that out.

OK, I'll try XFS tonight as well.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18 15:13     ` Jens Axboe
@ 2010-06-18 15:28       ` Eric Sandeen
  2010-06-18 17:32         ` Jens Axboe
  2010-06-18 17:36       ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2010-06-18 15:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org

Jens Axboe wrote:
> On 18/06/10 16.59, Eric Sandeen wrote:
>   
>> Eric Sandeen wrote:
>>     
>>> Jens Axboe wrote:
>>>       
>>>> Hi,
>>>>
>>>> I was writing a small fio job file to do writes and read verifies on a
>>>> device. It forks 32 processes, each writing randomly to 4 files with a
>>>> block size between 4k and 16k. When it has written 1024 of those blocks,
>>>> it'll verify the oldest 512 of them. Each block is checksummed for every
>>>> 512b. It uses libaio and O_DIRECT.
>>>>
>>>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>>>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>>>> before this triggers:
>>>>         
>>> Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
>>> (and I'm not sure about btrfs in that regard), ext4 may be most similar
>>> to xfs's behavior on the test ... wondering how it fares.
>>>
>>> Thanks,
>>> -Eric
>>>       
>> Actually mingming had a patch for direct-io.c which may be related, I'll
>> test that out.
>>     
>
> OK, I'll try XFS tonight as well.
>
>
>   
I haven't been able to reproduce it on ext4 here, yet.

FWIW here's the patch from mingming:

When unaligned DIO writes, skip zero out the block if the buffer is marked
unwritten. That means there is an asynconous direct IO (append or fill the hole)
still pending.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/direct-io.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-git/fs/direct-io.c
===================================================================
--- linux-git.orig/fs/direct-io.c	2010-05-07 15:42:22.855033403 -0700
+++ linux-git/fs/direct-io.c	2010-05-07 15:44:17.695007770 -0700
@@ -740,7 +740,8 @@
 	struct page *page;
 
 	dio->start_zero_done = 1;
-	if (!dio->blkfactor || !buffer_new(&dio->map_bh))
+	if (!dio->blkfactor || !buffer_new(&dio->map_bh)
+	    || buffer_unwritten(&dio->map_bh))
 		return;
 
 	dio_blocks_per_fs_block = 1 << dio->blkfactor;



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18 15:28       ` Eric Sandeen
@ 2010-06-18 17:32         ` Jens Axboe
  2010-06-18 18:04           ` Eric Sandeen
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2010-06-18 17:32 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org

On 2010-06-18 17:28, Eric Sandeen wrote:
> Jens Axboe wrote:
>> On 18/06/10 16.59, Eric Sandeen wrote:
>>   
>>> Eric Sandeen wrote:
>>>     
>>>> Jens Axboe wrote:
>>>>       
>>>>> Hi,
>>>>>
>>>>> I was writing a small fio job file to do writes and read verifies on a
>>>>> device. It forks 32 processes, each writing randomly to 4 files with a
>>>>> block size between 4k and 16k. When it has written 1024 of those blocks,
>>>>> it'll verify the oldest 512 of them. Each block is checksummed for every
>>>>> 512b. It uses libaio and O_DIRECT.
>>>>>
>>>>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>>>>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>>>>> before this triggers:
>>>>>         
>>>> Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
>>>> (and I'm not sure about btrfs in that regard), ext4 may be most similar
>>>> to xfs's behavior on the test ... wondering how it fares.
>>>>
>>>> Thanks,
>>>> -Eric
>>>>       
>>> Actually mingming had a patch for direct-io.c which may be related, I'll
>>> test that out.
>>>     
>>
>> OK, I'll try XFS tonight as well.
>>
>>
>>   
> I haven't been able to reproduce it on ext4 here, yet.
> 
> FWIW here's the patch from mingming:
> 
> When unaligned DIO writes, skip zero out the block if the buffer is marked
> unwritten. That means there is an asynconous direct IO (append or fill the hole)
> still pending.
> 
> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
> ---
>  fs/direct-io.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: linux-git/fs/direct-io.c
> ===================================================================
> --- linux-git.orig/fs/direct-io.c	2010-05-07 15:42:22.855033403 -0700
> +++ linux-git/fs/direct-io.c	2010-05-07 15:44:17.695007770 -0700
> @@ -740,7 +740,8 @@
>  	struct page *page;
>  
>  	dio->start_zero_done = 1;
> -	if (!dio->blkfactor || !buffer_new(&dio->map_bh))
> +	if (!dio->blkfactor || !buffer_new(&dio->map_bh)
> +	    || buffer_unwritten(&dio->map_bh))
>  		return;
>  
>  	dio_blocks_per_fs_block = 1 << dio->blkfactor;
> 
> 

What is this patch against?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18 17:32         ` Jens Axboe
@ 2010-06-18 18:04           ` Eric Sandeen
  2010-06-18 18:14             ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2010-06-18 18:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org

Jens Axboe wrote:
> On 2010-06-18 17:28, Eric Sandeen wrote:
>> Jens Axboe wrote:
>>> On 18/06/10 16.59, Eric Sandeen wrote:
>>>   
>>>> Eric Sandeen wrote:
>>>>     
>>>>> Jens Axboe wrote:
>>>>>       
>>>>>> Hi,
>>>>>>
>>>>>> I was writing a small fio job file to do writes and read verifies on a
>>>>>> device. It forks 32 processes, each writing randomly to 4 files with a
>>>>>> block size between 4k and 16k. When it has written 1024 of those blocks,
>>>>>> it'll verify the oldest 512 of them. Each block is checksummed for every
>>>>>> 512b. It uses libaio and O_DIRECT.
>>>>>>
>>>>>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>>>>>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>>>>>> before this triggers:
>>>>>>         
>>>>> Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
>>>>> (and I'm not sure about btrfs in that regard), ext4 may be most similar
>>>>> to xfs's behavior on the test ... wondering how it fares.
>>>>>
>>>>> Thanks,
>>>>> -Eric
>>>>>       
>>>> Actually mingming had a patch for direct-io.c which may be related, I'll
>>>> test that out.
>>>>     
>>> OK, I'll try XFS tonight as well.
>>>
>>>
>>>   
>> I haven't been able to reproduce it on ext4 here, yet.
>>
>> FWIW here's the patch from mingming:
>>
>> When unaligned DIO writes, skip zero out the block if the buffer is marked
>> unwritten. That means there is an asynconous direct IO (append or fill the hole)
>> still pending.
>>
>> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
>> ---
>>  fs/direct-io.c |    3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> Index: linux-git/fs/direct-io.c
>> ===================================================================
>> --- linux-git.orig/fs/direct-io.c	2010-05-07 15:42:22.855033403 -0700
>> +++ linux-git/fs/direct-io.c	2010-05-07 15:44:17.695007770 -0700
>> @@ -740,7 +740,8 @@
>>  	struct page *page;
>>  
>>  	dio->start_zero_done = 1;
>> -	if (!dio->blkfactor || !buffer_new(&dio->map_bh))
>> +	if (!dio->blkfactor || !buffer_new(&dio->map_bh)
>> +	    || buffer_unwritten(&dio->map_bh))
>>  		return;
>>  
>>  	dio_blocks_per_fs_block = 1 << dio->blkfactor;
>>
>>
> 
> What is this patch against?
> 

Applied to 2.6.32, seems to apply upstream as well.

It hits dio_zero-block()

-Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18 18:04           ` Eric Sandeen
@ 2010-06-18 18:14             ` Jens Axboe
  2010-06-21 10:20               ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2010-06-18 18:14 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org

On 2010-06-18 20:04, Eric Sandeen wrote:
> Jens Axboe wrote:
>> On 2010-06-18 17:28, Eric Sandeen wrote:
>>> Jens Axboe wrote:
>>>> On 18/06/10 16.59, Eric Sandeen wrote:
>>>>   
>>>>> Eric Sandeen wrote:
>>>>>     
>>>>>> Jens Axboe wrote:
>>>>>>       
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was writing a small fio job file to do writes and read verifies on a
>>>>>>> device. It forks 32 processes, each writing randomly to 4 files with a
>>>>>>> block size between 4k and 16k. When it has written 1024 of those blocks,
>>>>>>> it'll verify the oldest 512 of them. Each block is checksummed for every
>>>>>>> 512b. It uses libaio and O_DIRECT.
>>>>>>>
>>>>>>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>>>>>>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>>>>>>> before this triggers:
>>>>>>>         
>>>>>> Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
>>>>>> (and I'm not sure about btrfs in that regard), ext4 may be most similar
>>>>>> to xfs's behavior on the test ... wondering how it fares.
>>>>>>
>>>>>> Thanks,
>>>>>> -Eric
>>>>>>       
>>>>> Actually mingming had a patch for direct-io.c which may be related, I'll
>>>>> test that out.
>>>>>     
>>>> OK, I'll try XFS tonight as well.
>>>>
>>>>
>>>>   
>>> I haven't been able to reproduce it on ext4 here, yet.
>>>
>>> FWIW here's the patch from mingming:
>>>
>>> When unaligned DIO writes, skip zero out the block if the buffer is marked
>>> unwritten. That means there is an asynconous direct IO (append or fill the hole)
>>> still pending.
>>>
>>> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
>>> ---
>>>  fs/direct-io.c |    3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-git/fs/direct-io.c
>>> ===================================================================
>>> --- linux-git.orig/fs/direct-io.c	2010-05-07 15:42:22.855033403 -0700
>>> +++ linux-git/fs/direct-io.c	2010-05-07 15:44:17.695007770 -0700
>>> @@ -740,7 +740,8 @@
>>>  	struct page *page;
>>>  
>>>  	dio->start_zero_done = 1;
>>> -	if (!dio->blkfactor || !buffer_new(&dio->map_bh))
>>> +	if (!dio->blkfactor || !buffer_new(&dio->map_bh)
>>> +	    || buffer_unwritten(&dio->map_bh))
>>>  		return;
>>>  
>>>  	dio_blocks_per_fs_block = 1 << dio->blkfactor;
>>>
>>>
>>
>> What is this patch against?
>>
> 
> Applied to 2.6.32, seems to apply upstream as well.
> 
> It hits dio_zero-block()

Irk indeed, I am blind. The patch does not fix it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18 18:14             ` Jens Axboe
@ 2010-06-21 10:20               ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2010-06-21 10:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org

On 2010-06-18 20:14, Jens Axboe wrote:
> On 2010-06-18 20:04, Eric Sandeen wrote:
>> Jens Axboe wrote:
>>> On 2010-06-18 17:28, Eric Sandeen wrote:
>>>> Jens Axboe wrote:
>>>>> On 18/06/10 16.59, Eric Sandeen wrote:
>>>>>   
>>>>>> Eric Sandeen wrote:
>>>>>>     
>>>>>>> Jens Axboe wrote:
>>>>>>>       
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I was writing a small fio job file to do writes and read verifies on a
>>>>>>>> device. It forks 32 processes, each writing randomly to 4 files with a
>>>>>>>> block size between 4k and 16k. When it has written 1024 of those blocks,
>>>>>>>> it'll verify the oldest 512 of them. Each block is checksummed for every
>>>>>>>> 512b. It uses libaio and O_DIRECT.
>>>>>>>>
>>>>>>>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>>>>>>>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>>>>>>>> before this triggers:
>>>>>>>>         
>>>>>>> Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
>>>>>>> (and I'm not sure about btrfs in that regard), ext4 may be most similar
>>>>>>> to xfs's behavior on the test ... wondering how it fares.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Eric
>>>>>>>       
>>>>>> Actually mingming had a patch for direct-io.c which may be related, I'll
>>>>>> test that out.
>>>>>>     
>>>>> OK, I'll try XFS tonight as well.
>>>>>
>>>>>
>>>>>   
>>>> I haven't been able to reproduce it on ext4 here, yet.
>>>>
>>>> FWIW here's the patch from mingming:
>>>>
>>>> When unaligned DIO writes, skip zero out the block if the buffer is marked
>>>> unwritten. That means there is an asynconous direct IO (append or fill the hole)
>>>> still pending.
>>>>
>>>> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
>>>> ---
>>>>  fs/direct-io.c |    3 ++-
>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> Index: linux-git/fs/direct-io.c
>>>> ===================================================================
>>>> --- linux-git.orig/fs/direct-io.c	2010-05-07 15:42:22.855033403 -0700
>>>> +++ linux-git/fs/direct-io.c	2010-05-07 15:44:17.695007770 -0700
>>>> @@ -740,7 +740,8 @@
>>>>  	struct page *page;
>>>>  
>>>>  	dio->start_zero_done = 1;
>>>> -	if (!dio->blkfactor || !buffer_new(&dio->map_bh))
>>>> +	if (!dio->blkfactor || !buffer_new(&dio->map_bh)
>>>> +	    || buffer_unwritten(&dio->map_bh))
>>>>  		return;
>>>>  
>>>>  	dio_blocks_per_fs_block = 1 << dio->blkfactor;
>>>>
>>>>
>>>
>>> What is this patch against?
>>>
>>
>> Applied to 2.6.32, seems to apply upstream as well.
>>
>> It hits dio_zero-block()
> 
> Irk indeed, I am blind. The patch does not fix it.

So just to confirm that this isn't a new regression, 2.6.34 fails in the
same way. If I change the test to make the random writes overwrite
existing blocks instead of filling holes, then there are no problems
either.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18 15:13     ` Jens Axboe
  2010-06-18 15:28       ` Eric Sandeen
@ 2010-06-18 17:36       ` Jens Axboe
  1 sibling, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2010-06-18 17:36 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org

On 2010-06-18 17:13, Jens Axboe wrote:
> On 18/06/10 16.59, Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> I was writing a small fio job file to do writes and read verifies on a
>>>> device. It forks 32 processes, each writing randomly to 4 files with a
>>>> block size between 4k and 16k. When it has written 1024 of those blocks,
>>>> it'll verify the oldest 512 of them. Each block is checksummed for every
>>>> 512b. It uses libaio and O_DIRECT.
>>>>
>>>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>>>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>>>> before this triggers:
>>>
>>> Jens, can you try XFS too?  Since ext3 can't do direct IO to a hole,
>>> (and I'm not sure about btrfs in that regard), ext4 may be most similar
>>> to xfs's behavior on the test ... wondering how it fares.
>>>
>>> Thanks,
>>> -Eric
>>
>> Actually mingming had a patch for direct-io.c which may be related, I'll
>> test that out.
> 
> OK, I'll try XFS tonight as well.

XFS fails too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-06-18  8:07 fio test triggering bad data on ext4 Jens Axboe
  2010-06-18 14:02 ` Eric Sandeen
@ 2010-07-07 14:26 ` Eric Sandeen
  2010-07-07 19:39   ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2010-07-07 14:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: tytso, adilger, linux-ext4

Jens Axboe wrote:
> Hi,
> 
> I was writing a small fio job file to do writes and read verifies on a
> device. It forks 32 processes, each writing randomly to 4 files with a
> block size between 4k and 16k. When it has written 1024 of those blocks,
> it'll verify the oldest 512 of them. Each block is checksummed for every
> 512b. It uses libaio and O_DIRECT.
> 
> It works on ext2 and btrfs. I haven't run it to completion yet, but they
> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
> before this triggers:
> 
> Bad verify header 0 at 10137600
> fio: pid=9943, err=84/file:io_u.c:1212, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character
> 
> writers: (groupid=0, jobs=32): err=84 (file:io_u.c:1212, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character): pid=9943

FYI:

I asked Jens to test hch's and Jiaying's aio completion patches with this,
and apparently those fixed this problem for him.

-Eric
 
> which tells us that where we expected to find the correct verify magic
> in the header, it was all zeroes. The job file used is below, and to
> reproduce you want to use the latest fio (1.40) since some earlier
> versions don't do verify_interval properly for non-pattern verifies. You
> can get fio here:
> 
> http://brick.kernel.dk/snaps/fio-1.40.tar.gz
> 
> or from git at:
> 
> git://git.kernel.dk/fio.git
> 
> The kernel used is 2.6.35-rc3 and I ran this on a raid0 that had 8 SSD
> drives.
> 
> --- snip job file ---
> 
> [global]
> direct=1
> group_reporting=1
> exitall
> runtime=4h
> time_based=1
> 
> # writers, will repeatedly randomly write and verify data
> [writers]
> rw=randwrite
> bsrange=4k-16k
> ioengine=libaio
> iodepth=4
> directory=/data
> verify=crc32c
> verify_backlog=1024
> verify_backlog_batch=512
> verify_interval=512
> size=512m
> nrfiles=4
> filesize=64m-256m
> numjobs=32
> create_serialize=0
> 
> --- snip job file ---
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
  2010-07-07 14:26 ` Eric Sandeen
@ 2010-07-07 19:39   ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2010-07-07 19:39 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: tytso@mit.edu, adilger@sun.com, linux-ext4@vger.kernel.org

On 07/07/10 16.26, Eric Sandeen wrote:
> Jens Axboe wrote:
>> Hi,
>>
>> I was writing a small fio job file to do writes and read verifies on a
>> device. It forks 32 processes, each writing randomly to 4 files with a
>> block size between 4k and 16k. When it has written 1024 of those blocks,
>> it'll verify the oldest 512 of them. Each block is checksummed for every
>> 512b. It uses libaio and O_DIRECT.
>>
>> It works on ext2 and btrfs. I haven't run it to completion yet, but they
>> survive 15-20 minutes just fine. ext4 doesn't even go a full minutes
>> before this triggers:
>>
>> Bad verify header 0 at 10137600
>> fio: pid=9943, err=84/file:io_u.c:1212, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character
>>
>> writers: (groupid=0, jobs=32): err=84 (file:io_u.c:1212, func=io_u_queued_complete, error=Invalid or incomplete multibyte or wide character): pid=9943
> 
> FYI:
> 
> I asked Jens to test hch's and Jiaying's aio completion patches with this,
> and apparently those fixed this problem for him.

At least for a shorter run, but long enough that all the holes should
have been filled at this point. So it at least fixes my test case.
I can try and expand the run a bit if there's any interest in that,
and see if that still verifies correctly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: fio test triggering bad data on ext4
@ 2010-06-21  9:37 Frank Mehnert
  0 siblings, 0 replies; 13+ messages in thread
From: Frank Mehnert @ 2010-06-21  9:37 UTC (permalink / raw)
  To: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 583 bytes --]

Hi,

I want like to add that we have a similar testcase which probably triggers
much faster than the testcase of Jens, see here:

  https://bugzilla.kernel.org/show_bug.cgi?id=16165

We believe that this bug is responsible for data corruption of VirtualBox
disk images located on an ext4 file system. Please let me know how we can
help you debugging this issue.

Kind regards,

Frank
-- 
Dr.-Ing. Frank Mehnert

Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, 85551 Kirchheim-Heimstetten
Amtsgericht München: HRB 161028
Geschäftsführer: Jürgen Kunz

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-07-07 19:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-18  8:07 fio test triggering bad data on ext4 Jens Axboe
2010-06-18 14:02 ` Eric Sandeen
2010-06-18 14:59   ` Eric Sandeen
2010-06-18 15:13     ` Jens Axboe
2010-06-18 15:28       ` Eric Sandeen
2010-06-18 17:32         ` Jens Axboe
2010-06-18 18:04           ` Eric Sandeen
2010-06-18 18:14             ` Jens Axboe
2010-06-21 10:20               ` Jens Axboe
2010-06-18 17:36       ` Jens Axboe
2010-07-07 14:26 ` Eric Sandeen
2010-07-07 19:39   ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2010-06-21  9:37 Frank Mehnert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox