Re: Software raid0 will crash the file-system, when each disk is 5TB

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Software raid0 will crash the file-system, when each disk is 5TB
       [not found] <659F626D666070439A4A5965CD6EBF406836C6@gazelle.ad.endace.com>
@ 2007-05-15 23:29 ` Michal Piotrowski
  2007-05-16  0:03   ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Piotrowski @ 2007-05-15 23:29 UTC (permalink / raw)
  To: Jeff Zheng
  Cc: Ingo Molnar, Neil Brown, linux-raid, linux-kernel, linux-fsdevel

[Ingo, Neil, linux-raid added to CC]

On 16/05/07, Jeff Zheng <Jeff.Zheng@endace.com> wrote:
> Hi everyone:
>
>         We are experiencing problems with software raid0, with very
> large disk arrays.
> We are using two 3ware disk array controllers, each of them is connected
> 8 750GB harddrives. And we build a software raid0 on top of that. The
> total capacity is 5.5TB+5.5TB=11TB
>
> We use jfs as the file-system, we have a test application that write
> data continuously to the disks. After writing 52 10GB files, jfs
> crashed. And we are not able to recover it, fsck doesn't recognise it
> anymore.
> We then tried xfs, same application, lasted a little longer, but gives
> kernel crash later.
>
> We then reconfigured the hardware array, this time we configured two
> disk array from each controller, than we have 4 disk arrays, each of
> them have 4 750GB harddrives. Than build a new software raid0 on top of
> that. Total capacity is still the same, but 2.75T+2.75T+2.75T+2.75T=11T.
>
> This time we managed to fill the whole 11T data without problem, we are
> still doing validation on all 11TB of data written to the disks.
>
> It happened on 2.6.20 and 2.6.13.
>
> So I think the problem is in the way on software raid handling very
> large disk, maybe a integer overflow or something. I've searched on the
> web, only find another guy complaining the same thing on the xfs mailing
> list.
>
> Anybody have a clue?
>
>
> Jeff
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Regards,
Michal

-- 
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-15 23:29 ` Software raid0 will crash the file-system, when each disk is 5TB Michal Piotrowski
@ 2007-05-16  0:03   ` Neil Brown
  2007-05-16  1:56     ` Jeff Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2007-05-16  0:03 UTC (permalink / raw)
  To: Michal Piotrowski
  Cc: Jeff Zheng, Ingo Molnar, linux-raid, linux-kernel, linux-fsdevel

On Wednesday May 16, michal.k.k.piotrowski@gmail.com wrote:
> >
> > Anybody have a clue?
> >

No...
When a raid0 array is assemble, quite a lot of message get printed
about number of zones and hash_spacing etc.  Can you collect and post
those.  Both for the failing case (2*5.5T) and the working case
(4*2.55T) is possible.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-16  0:03   ` Neil Brown
@ 2007-05-16  1:56     ` Jeff Zheng
  2007-05-16 17:28       ` Bill Davidsen
  2007-05-17  0:48       ` Neil Brown
  0 siblings, 2 replies; 18+ messages in thread
From: Jeff Zheng @ 2007-05-16  1:56 UTC (permalink / raw)
  To: Neil Brown, Michal Piotrowski
  Cc: Ingo Molnar, linux-raid, linux-kernel, linux-fsdevel

Here is the information of the created raid0. Hope it is enough.

Jeff

The crashing one:
md: bind<sdd>
md: bind<sde>
md: raid0 personality registered for level 0
md0: setting max_sectors to 4096, segment boundary to 1048575
raid0: looking at sde
raid0:   comparing sde(5859284992) with sde(5859284992)
raid0:   END
raid0:   ==> UNIQUE
raid0: 1 zones
raid0: looking at sdd
raid0:   comparing sdd(5859284992) with sde(5859284992)
raid0:   EQUAL
raid0: FINAL 1 zones
raid0: done.
raid0 : md_size is 11718569984 blocks.
raid0 : conf->hash_spacing is 11718569984 blocks.
raid0 : nb_zone is 2.
raid0 : Allocating 8 bytes for hash.
JFS: nTxBlock = 8192, nTxLock = 65536

The working one:
md: bind<sde>
md: bind<sdf>
md: bind<sdg>
md: bind<sdd>
md0: setting max_sectors to 4096, segment boundary to 1048575
raid0: looking at sdd
raid0:   comparing sdd(2929641472) with sdd(2929641472)
raid0:   END
raid0:   ==> UNIQUE
raid0: 1 zones
raid0: looking at sdg
raid0:   comparing sdg(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: looking at sdf
raid0:   comparing sdf(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: looking at sde
raid0:   comparing sde(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: FINAL 1 zones
raid0: done.
raid0 : md_size is 11718565888 blocks.
raid0 : conf->hash_spacing is 11718565888 blocks.
raid0 : nb_zone is 2.
raid0 : Allocating 8 bytes for hash.
JFS: nTxBlock = 8192, nTxLock = 65536

-----Original Message-----
From: Neil Brown [mailto:neilb@suse.de] 
Sent: Wednesday, 16 May 2007 12:04 p.m.
To: Michal Piotrowski
Cc: Jeff Zheng; Ingo Molnar; linux-raid@vger.kernel.org;
linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org
Subject: Re: Software raid0 will crash the file-system, when each disk
is 5TB

On Wednesday May 16, michal.k.k.piotrowski@gmail.com wrote:
> >
> > Anybody have a clue?
> >

No...
When a raid0 array is assemble, quite a lot of message get printed
about number of zones and hash_spacing etc.  Can you collect and post
those.  Both for the failing case (2*5.5T) and the working case
(4*2.55T) is possible.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-16  1:56     ` Jeff Zheng
@ 2007-05-16 17:28       ` Bill Davidsen
  2007-05-16 17:58         ` david
  2007-05-17  0:48       ` Neil Brown
  1 sibling, 1 reply; 18+ messages in thread
From: Bill Davidsen @ 2007-05-16 17:28 UTC (permalink / raw)
  To: Jeff Zheng
  Cc: Neil Brown, Michal Piotrowski, Ingo Molnar, linux-raid,
	linux-kernel, linux-fsdevel

Jeff Zheng wrote:
> Here is the information of the created raid0. Hope it is enough.
>
>   
If I read this correctly, the problem is with JFS rather than RAID? Have 
you tried not mounting the JFS filesystem but just starting the array 
which crashes, so you can read bits of it, etc, and verify that the 
array itself is working?

And can you run an fsck on the filesystem, if that makes sense? I assume 
you got to actually write a f/s at one time, and I've never used JFS 
under Linux. I spent five+ years using it on AIX, though, complex but 
robust.
> The crashing one:
> md: bind<sdd>
> md: bind<sde>
> md: raid0 personality registered for level 0
> md0: setting max_sectors to 4096, segment boundary to 1048575
> raid0: looking at sde
> raid0:   comparing sde(5859284992) with sde(5859284992)
> raid0:   END
> raid0:   ==> UNIQUE
> raid0: 1 zones
> raid0: looking at sdd
> raid0:   comparing sdd(5859284992) with sde(5859284992)
> raid0:   EQUAL
> raid0: FINAL 1 zones
> raid0: done.
> raid0 : md_size is 11718569984 blocks.
> raid0 : conf->hash_spacing is 11718569984 blocks.
> raid0 : nb_zone is 2.
> raid0 : Allocating 8 bytes for hash.
> JFS: nTxBlock = 8192, nTxLock = 65536
>
> The working one:
> md: bind<sde>
> md: bind<sdf>
> md: bind<sdg>
> md: bind<sdd>
> md0: setting max_sectors to 4096, segment boundary to 1048575
> raid0: looking at sdd
> raid0:   comparing sdd(2929641472) with sdd(2929641472)
> raid0:   END
> raid0:   ==> UNIQUE
> raid0: 1 zones
> raid0: looking at sdg
> raid0:   comparing sdg(2929641472) with sdd(2929641472)
> raid0:   EQUAL
> raid0: looking at sdf
> raid0:   comparing sdf(2929641472) with sdd(2929641472)
> raid0:   EQUAL
> raid0: looking at sde
> raid0:   comparing sde(2929641472) with sdd(2929641472)
> raid0:   EQUAL
> raid0: FINAL 1 zones
> raid0: done.
> raid0 : md_size is 11718565888 blocks.
> raid0 : conf->hash_spacing is 11718565888 blocks.
> raid0 : nb_zone is 2.
> raid0 : Allocating 8 bytes for hash.
> JFS: nTxBlock = 8192, nTxLock = 65536
>
> -----Original Message-----
> From: Neil Brown [mailto:neilb@suse.de] 
> Sent: Wednesday, 16 May 2007 12:04 p.m.
> To: Michal Piotrowski
> Cc: Jeff Zheng; Ingo Molnar; linux-raid@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org
> Subject: Re: Software raid0 will crash the file-system, when each disk
> is 5TB
>
> On Wednesday May 16, michal.k.k.piotrowski@gmail.com wrote:
>   
>>> Anybody have a clue?
>>>
>>>       
>
> No...
> When a raid0 array is assemble, quite a lot of message get printed
> about number of zones and hash_spacing etc.  Can you collect and post
> those.  Both for the failing case (2*5.5T) and the working case
> (4*2.55T) is possible.
>   


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-16 17:28       ` Bill Davidsen
@ 2007-05-16 17:58         ` david
  0 siblings, 0 replies; 18+ messages in thread
From: david @ 2007-05-16 17:58 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Jeff Zheng, Neil Brown, Michal Piotrowski, Ingo Molnar,
	linux-raid, linux-kernel, linux-fsdevel

On Wed, 16 May 2007, Bill Davidsen wrote:

> Jeff Zheng wrote:
>>  Here is the information of the created raid0. Hope it is enough.
>>
>> 
> If I read this correctly, the problem is with JFS rather than RAID?

he had the same problem with xfs.

David Lang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-16  1:56     ` Jeff Zheng
  2007-05-16 17:28       ` Bill Davidsen
@ 2007-05-17  0:48       ` Neil Brown
  2007-05-17  2:09         ` Jeff Zheng
  1 sibling, 1 reply; 18+ messages in thread
From: Neil Brown @ 2007-05-17  0:48 UTC (permalink / raw)
  To: Jeff Zheng
  Cc: Michal Piotrowski, Ingo Molnar, linux-raid, linux-kernel,
	linux-fsdevel

On Wednesday May 16, Jeff.Zheng@endace.com wrote:
> Here is the information of the created raid0. Hope it is enough.

Thanks.
Everything looks fine here.

The only difference of any significance between the working and
non-working configurations is that in the non-working, the component
devices are larger than 2Gig, and hence have sector offsets greater
than 32 bits.

This does cause a slightly different code path in one place, but I
cannot see it making a difference.  But maybe it does.

What architecture is this running on?
What C compiler are you using?

Can you try with this patch?  It is the only thing that I can find
that could conceivably go wrong.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid0.c |    1 +
 1 file changed, 1 insertion(+)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c	2007-05-17 10:33:30.000000000 +1000
+++ ./drivers/md/raid0.c	2007-05-17 10:34:02.000000000 +1000
@@ -461,6 +461,7 @@ static int raid0_make_request (request_q

 	while (block >= (zone->zone_offset + zone->size)) 
 		zone++;
+	BUG_ON(block < zone->zone_offset);

 	sect_in_chunk = bio->bi_sector & ((chunk_size<<1) -1);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  0:48       ` Neil Brown
@ 2007-05-17  2:09         ` Jeff Zheng
  2007-05-17  2:45           ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Zheng @ 2007-05-17  2:09 UTC (permalink / raw)
  To: Neil Brown
  Cc: Michal Piotrowski, Ingo Molnar, linux-raid, linux-kernel,
	linux-fsdevel


> The only difference of any significance between the working 
> and non-working configurations is that in the non-working, 
> the component devices are larger than 2Gig, and hence have 
> sector offsets greater than 32 bits.

Do u mean 2T here?, but in both configuartion, the component devices are
larger than 2T (2.25T&5.5T).
 
> This does cause a slightly different code path in one place, 
> but I cannot see it making a difference.  But maybe it does.
> 
> What architecture is this running on?
> What C compiler are you using?

I386(i686)
Gcc 4.0.2 20051125, 
Distro is Fedora core, we've tried fc4 and fc6.

> Can you try with this patch?  It is the only thing that I can 
> find that could conceivably go wrong.
> 

OK, I will try the patach and post the result.

Best Regards
Jeff Zheng


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  2:09         ` Jeff Zheng
@ 2007-05-17  2:45           ` Neil Brown
  2007-05-17  3:11             ` Jeff Zheng
  2007-05-17  4:45             ` david
  0 siblings, 2 replies; 18+ messages in thread
From: Neil Brown @ 2007-05-17  2:45 UTC (permalink / raw)
  To: Jeff Zheng
  Cc: Michal Piotrowski, Ingo Molnar, linux-raid, linux-kernel,
	linux-fsdevel

On Thursday May 17, Jeff.Zheng@endace.com wrote:
> 
> > The only difference of any significance between the working 
> > and non-working configurations is that in the non-working, 
> > the component devices are larger than 2Gig, and hence have 
> > sector offsets greater than 32 bits.
> 
> Do u mean 2T here?, but in both configuartion, the component devices are
> larger than 2T (2.25T&5.5T).

Yes, I meant 2T, and yes, the components are always over 2T.  So I'm
at a complete loss.  The raid0 code follows the same paths and does
the same things and uses 64bit arithmetic where needed.

So I have no idea how there could be a difference between these two
cases.  

I'm at a loss...

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  2:45           ` Neil Brown
@ 2007-05-17  3:11             ` Jeff Zheng
  2007-05-17  4:32               ` Neil Brown
  2007-05-17  4:45             ` david
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff Zheng @ 2007-05-17  3:11 UTC (permalink / raw)
  To: Neil Brown
  Cc: Michal Piotrowski, Ingo Molnar, linux-raid, linux-kernel,
	linux-fsdevel

I tried the patch, same problem show up, but no bug_on report

Is there any other things I can do?


Jeff


> Yes, I meant 2T, and yes, the components are always over 2T.  
> So I'm at a complete loss.  The raid0 code follows the same 
> paths and does the same things and uses 64bit arithmetic where needed.
> 
> So I have no idea how there could be a difference between 
> these two cases.  
> 
> I'm at a loss...
> 
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  3:11             ` Jeff Zheng
@ 2007-05-17  4:32               ` Neil Brown
  2007-05-17  5:08                 ` Jeff Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2007-05-17  4:32 UTC (permalink / raw)
  To: Jeff Zheng
  Cc: Michal Piotrowski, Ingo Molnar, linux-raid, linux-kernel,
	linux-fsdevel

On Thursday May 17, Jeff.Zheng@endace.com wrote:
> I tried the patch, same problem show up, but no bug_on report
> 
> Is there any other things I can do?
> 

What is the nature of the corruption?  Is it data in a file that is
wrong when you read it back, or does the filesystem metadata get
corrupted?

Can you try the configuration that works, and sha1sum the files after
you have written them to make sure that they really are correct?
My thought here is "maybe there is a bad block on one device, and the
block is used for data in the 'working' config, and for metadata in
the 'broken' config.

Can you try a degraded raid10 configuration. e.g.

   mdadm -C /dev/md1 --level=10 --raid-disks=4 /dev/first missing \
   /dev/second missing

That will lay out the data in exactly the same place as with raid0,
but will use totally different code paths to access it.  If you still
get a problem, then it isn't in the raid0 code.

Maybe try version 1 metadata (mdadm --metadata=1).  I doubt that would
make a difference, but as I am grasping at straws already, it may be a
straw woth trying.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  2:45           ` Neil Brown
  2007-05-17  3:11             ` Jeff Zheng
@ 2007-05-17  4:45             ` david
  2007-05-17  5:03               ` Neil Brown
  1 sibling, 1 reply; 18+ messages in thread
From: david @ 2007-05-17  4:45 UTC (permalink / raw)
  To: Neil Brown
  Cc: Jeff Zheng, Michal Piotrowski, Ingo Molnar, linux-raid,
	linux-kernel, linux-fsdevel

On Thu, 17 May 2007, Neil Brown wrote:

> On Thursday May 17, Jeff.Zheng@endace.com wrote:
>>
>>> The only difference of any significance between the working
>>> and non-working configurations is that in the non-working,
>>> the component devices are larger than 2Gig, and hence have
>>> sector offsets greater than 32 bits.
>>
>> Do u mean 2T here?, but in both configuartion, the component devices are
>> larger than 2T (2.25T&5.5T).
>
> Yes, I meant 2T, and yes, the components are always over 2T.

2T decimal or 2T binary?

> So I'm
> at a complete loss.  The raid0 code follows the same paths and does
> the same things and uses 64bit arithmetic where needed.
>
> So I have no idea how there could be a difference between these two
> cases.
>
> I'm at a loss...
>
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  4:45             ` david
@ 2007-05-17  5:03               ` Neil Brown
  2007-05-17  5:31                 ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2007-05-17  5:03 UTC (permalink / raw)
  To: david
  Cc: Jeff Zheng, Michal Piotrowski, Ingo Molnar, linux-raid,
	linux-kernel, linux-fsdevel

On Wednesday May 16, david@lang.hm wrote:
> On Thu, 17 May 2007, Neil Brown wrote:
> 
> > On Thursday May 17, Jeff.Zheng@endace.com wrote:
> >>
> >>> The only difference of any significance between the working
> >>> and non-working configurations is that in the non-working,
> >>> the component devices are larger than 2Gig, and hence have
> >>> sector offsets greater than 32 bits.
> >>
> >> Do u mean 2T here?, but in both configuartion, the component devices are
> >> larger than 2T (2.25T&5.5T).
> >
> > Yes, I meant 2T, and yes, the components are always over 2T.
> 
> 2T decimal or 2T binary?
> 

Either.  The smallest as actually 2.75T (typo above).
Precisely it was
  2929641472  kilobytes
or 
  5859282944 sectors
or 
  0x15D3D9000 sectors.

So it is over 32bits already...

Uhm, I just noticed something.
'chunk' is unsigned long, and when it gets shifted up, we might lose
bits.  That could still happen with the 4*2.75T arrangement, but is
much more likely in the 2*5.5T arrangement.

Jeff, can you try this patch?

Thanks.
NeilBrown


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid0.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c	2007-05-17 10:33:30.000000000 +1000
+++ ./drivers/md/raid0.c	2007-05-17 15:02:15.000000000 +1000
@@ -475,7 +475,7 @@ static int raid0_make_request (request_q
 		x = block >> chunksize_bits;
 		tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
 	}
-	rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
+	rsect = ((((sector_t)chunk << chunksize_bits) + zone->dev_offset)<<1)
 		+ sect_in_chunk;
  
 	bio->bi_bdev = tmp_dev->bdev;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  4:32               ` Neil Brown
@ 2007-05-17  5:08                 ` Jeff Zheng
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Zheng @ 2007-05-17  5:08 UTC (permalink / raw)
  To: Neil Brown
  Cc: Michal Piotrowski, Ingo Molnar, linux-raid, linux-kernel,
	linux-fsdevel


> What is the nature of the corruption?  Is it data in a file 
> that is wrong when you read it back, or does the filesystem 
> metadata get corrupted?
The corruption is in fs metadata, jfs is completely destroied, after 
Umount, fsck does not recogonize it as jfs anymore. Xfs gives kernel 
Crash, but seems still recoverable.
> 
> Can you try the configuration that works, and sha1sum the 
> files after you have written them to make sure that they 
> really are correct?
We have verified the data on the working configuration, we have written 
around 900 identical 10G files , and verified that the md5sum is
actually
the same. The verification took two days though :)

> My thought here is "maybe there is a bad block on one device, 
> and the block is used for data in the 'working' config, and 
> for metadata in the 'broken' config.
> 
> Can you try a degraded raid10 configuration. e.g.
> 
>    mdadm -C /dev/md1 --level=10 --raid-disks=4 /dev/first missing \
>    /dev/second missing
> 
> That will lay out the data in exactly the same place as with 
> raid0, but will use totally different code paths to access 
> it.  If you still get a problem, then it isn't in the raid0 code.

I will try this later today. As I'm now trying different size of the
component.
3.4T, seems working. Test 4.1T right now.

> Maybe try version 1 metadata (mdadm --metadata=1).  I doubt 
> that would make a difference, but as I am grasping at straws 
> already, it may be a straw woth trying.

Well the problem may also be in 3ware disk array, or disk array driver.
The guy
complaining about the same problem is also using 3ware disk array
controller.
But there is no way to verify that and a single disk array has been
working fine for us.

Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  5:03               ` Neil Brown
@ 2007-05-17  5:31                 ` Neil Brown
  2007-05-17  5:38                   ` Jeff Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2007-05-17  5:31 UTC (permalink / raw)
  To: david, Jeff Zheng, Michal Piotrowski, Ingo Molnar, linux-raid,
	linux-kernel, linux-fsdevel

On Thursday May 17, neilb@suse.de wrote:
> 
> Uhm, I just noticed something.
> 'chunk' is unsigned long, and when it gets shifted up, we might lose
> bits.  That could still happen with the 4*2.75T arrangement, but is
> much more likely in the 2*5.5T arrangement.

Actually, it cannot be a problem with the 4*2.75T arrangement.
  chuck << chunksize_bits

will not exceed the size of the underlying device *in*kilobytes*.
In that case that is 0xAE9EC800 which will git in a 32bit long.
We don't double it to make sectors until after we add
zone->dev_offset, which is "sector_t" and so 64bit arithmetic is used.

So I'm quite certain this bug will cause exactly the problems
experienced!!

> 
> Jeff, can you try this patch?

Don't bother about the other tests I mentioned, just try this one.
Thanks.

NeilBrown

> Signed-off-by: Neil Brown <neilb@suse.de>
> 
> ### Diffstat output
>  ./drivers/md/raid0.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
> --- .prev/drivers/md/raid0.c	2007-05-17 10:33:30.000000000 +1000
> +++ ./drivers/md/raid0.c	2007-05-17 15:02:15.000000000 +1000
> @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
>  		x = block >> chunksize_bits;
>  		tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
>  	}
> -	rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
> +	rsect = ((((sector_t)chunk << chunksize_bits) + zone->dev_offset)<<1)
>  		+ sect_in_chunk;
>   
>  	bio->bi_bdev = tmp_dev->bdev;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  5:31                 ` Neil Brown
@ 2007-05-17  5:38                   ` Jeff Zheng
  2007-05-17 22:55                     ` Jeff Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Zheng @ 2007-05-17  5:38 UTC (permalink / raw)
  To: Neil Brown, david, Michal Piotrowski, Ingo Molnar, linux-raid,
	linux-kernel, linux-fsdevel


Yeah, seems you've locked it down, :D. I've written 600GB of data now,
and anything is still fine.
Will let it run overnight, and fill the whole 11T. I'll post the result
tomorrow

Thanks a lot though.

Jeff 

> -----Original Message-----
> From: Neil Brown [mailto:neilb@suse.de] 
> Sent: Thursday, 17 May 2007 5:31 p.m.
> To: david@lang.hm; Jeff Zheng; Michal Piotrowski; Ingo 
> Molnar; linux-raid@vger.kernel.org; 
> linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org
> Subject: RE: Software raid0 will crash the file-system, when 
> each disk is 5TB
> 
> On Thursday May 17, neilb@suse.de wrote:
> > 
> > Uhm, I just noticed something.
> > 'chunk' is unsigned long, and when it gets shifted up, we 
> might lose 
> > bits.  That could still happen with the 4*2.75T arrangement, but is 
> > much more likely in the 2*5.5T arrangement.
> 
> Actually, it cannot be a problem with the 4*2.75T arrangement.
>   chuck << chunksize_bits
> 
> will not exceed the size of the underlying device *in*kilobytes*.
> In that case that is 0xAE9EC800 which will git in a 32bit long.
> We don't double it to make sectors until after we add
> zone->dev_offset, which is "sector_t" and so 64bit arithmetic is used.
> 
> So I'm quite certain this bug will cause exactly the problems 
> experienced!!
> 
> > 
> > Jeff, can you try this patch?
> 
> Don't bother about the other tests I mentioned, just try this one.
> Thanks.
> 
> NeilBrown
> 
> > Signed-off-by: Neil Brown <neilb@suse.de>
> > 
> > ### Diffstat output
> >  ./drivers/md/raid0.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
> > --- .prev/drivers/md/raid0.c	2007-05-17 
> 10:33:30.000000000 +1000
> > +++ ./drivers/md/raid0.c	2007-05-17 15:02:15.000000000 +1000
> > @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
> >  		x = block >> chunksize_bits;
> >  		tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
> >  	}
> > -	rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
> > +	rsect = ((((sector_t)chunk << chunksize_bits) + 
> > +zone->dev_offset)<<1)
> >  		+ sect_in_chunk;
> >   
> >  	bio->bi_bdev = tmp_dev->bdev;
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17  5:38                   ` Jeff Zheng
@ 2007-05-17 22:55                     ` Jeff Zheng
  2007-05-18  0:21                       ` Neil Brown
  2007-05-22 21:31                       ` Bill Davidsen
  0 siblings, 2 replies; 18+ messages in thread
From: Jeff Zheng @ 2007-05-17 22:55 UTC (permalink / raw)
  To: Neil Brown, david, Michal Piotrowski, Ingo Molnar, linux-raid,
	linux-kernel, linux-fsdevel

 Fix confirmed, filled the whole 11T hard disk, without crashing.
I presume this would go into 2.6.22

Thanks again.

Jeff

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org 
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jeff Zheng
> Sent: Thursday, 17 May 2007 5:39 p.m.
> To: Neil Brown; david@lang.hm; Michal Piotrowski; Ingo 
> Molnar; linux-raid@vger.kernel.org; 
> linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org
> Subject: RE: Software raid0 will crash the file-system, when 
> each disk is 5TB
> 
> 
> Yeah, seems you've locked it down, :D. I've written 600GB of 
> data now, and anything is still fine.
> Will let it run overnight, and fill the whole 11T. I'll post 
> the result tomorrow
> 
> Thanks a lot though.
> 
> Jeff 
> 
> > -----Original Message-----
> > From: Neil Brown [mailto:neilb@suse.de]
> > Sent: Thursday, 17 May 2007 5:31 p.m.
> > To: david@lang.hm; Jeff Zheng; Michal Piotrowski; Ingo Molnar; 
> > linux-raid@vger.kernel.org; linux-kernel@vger.kernel.org; 
> > linux-fsdevel@vger.kernel.org
> > Subject: RE: Software raid0 will crash the file-system, 
> when each disk 
> > is 5TB
> > 
> > On Thursday May 17, neilb@suse.de wrote:
> > > 
> > > Uhm, I just noticed something.
> > > 'chunk' is unsigned long, and when it gets shifted up, we
> > might lose
> > > bits.  That could still happen with the 4*2.75T 
> arrangement, but is 
> > > much more likely in the 2*5.5T arrangement.
> > 
> > Actually, it cannot be a problem with the 4*2.75T arrangement.
> >   chuck << chunksize_bits
> > 
> > will not exceed the size of the underlying device *in*kilobytes*.
> > In that case that is 0xAE9EC800 which will git in a 32bit long.
> > We don't double it to make sectors until after we add
> > zone->dev_offset, which is "sector_t" and so 64bit 
> arithmetic is used.
> > 
> > So I'm quite certain this bug will cause exactly the problems 
> > experienced!!
> > 
> > > 
> > > Jeff, can you try this patch?
> > 
> > Don't bother about the other tests I mentioned, just try this one.
> > Thanks.
> > 
> > NeilBrown
> > 
> > > Signed-off-by: Neil Brown <neilb@suse.de>
> > > 
> > > ### Diffstat output
> > >  ./drivers/md/raid0.c |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
> > > --- .prev/drivers/md/raid0.c	2007-05-17 
> > 10:33:30.000000000 +1000
> > > +++ ./drivers/md/raid0.c	2007-05-17 15:02:15.000000000 +1000
> > > @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
> > >  		x = block >> chunksize_bits;
> > >  		tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
> > >  	}
> > > -	rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
> > > +	rsect = ((((sector_t)chunk << chunksize_bits) +
> > > +zone->dev_offset)<<1)
> > >  		+ sect_in_chunk;
> > >   
> > >  	bio->bi_bdev = tmp_dev->bdev;
> > 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-raid" in the body of a message to 
> majordomo@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17 22:55                     ` Jeff Zheng
@ 2007-05-18  0:21                       ` Neil Brown
  2007-05-22 21:31                       ` Bill Davidsen
  1 sibling, 0 replies; 18+ messages in thread
From: Neil Brown @ 2007-05-18  0:21 UTC (permalink / raw)
  To: Jeff Zheng
  Cc: david, Michal Piotrowski, Ingo Molnar, linux-raid, linux-kernel,
	linux-fsdevel

On Friday May 18, Jeff.Zheng@endace.com wrote:
>  Fix confirmed, filled the whole 11T hard disk, without crashing.
> I presume this would go into 2.6.22

Yes, and probably 2.6.21.y, though the patch will be slightly
different, see below.
> 
> Thanks again.

And thank-you for pursuing this with me.

NeilBrown


---------------------------
Avoid overflow in raid0 calculation with large components.

If a raid0 has a component device larger than 4TB, and is accessed on
a 32bit machines, then as 'chunk' is unsigned lock,
   chunk << chunksize_bits
can overflow (this can be as high as the size of the device in KB).
chunk itself will not overflow (without triggering a BUG).

So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes
impossible to hit.

Cc: "Jeff Zheng" <Jeff.Zheng@endace.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid0.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c	2007-05-17 10:33:30.000000000 +1000
+++ ./drivers/md/raid0.c	2007-05-17 16:14:12.000000000 +1000
@@ -415,7 +415,7 @@ static int raid0_make_request (request_q
 	raid0_conf_t *conf = mddev_to_conf(mddev);
 	struct strip_zone *zone;
 	mdk_rdev_t *tmp_dev;
-	unsigned long chunk;
+	sector_t chunk;
 	sector_t block, rsect;
 	const int rw = bio_data_dir(bio);
 
@@ -470,7 +470,6 @@ static int raid0_make_request (request_q
 
 		sector_div(x, zone->nb_dev);
 		chunk = x;
-		BUG_ON(x != (sector_t)chunk);
 
 		x = block >> chunksize_bits;
 		tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Software raid0 will crash the file-system, when each disk is 5TB
  2007-05-17 22:55                     ` Jeff Zheng
  2007-05-18  0:21                       ` Neil Brown
@ 2007-05-22 21:31                       ` Bill Davidsen
  1 sibling, 0 replies; 18+ messages in thread
From: Bill Davidsen @ 2007-05-22 21:31 UTC (permalink / raw)
  To: Jeff Zheng
  Cc: Neil Brown, david, Michal Piotrowski, Ingo Molnar, linux-raid,
	linux-kernel, linux-fsdevel

Jeff Zheng wrote:
>  Fix confirmed, filled the whole 11T hard disk, without crashing.
> I presume this would go into 2.6.22
> 
Since it results in a full loss of data, I would hope it goes into 
2.6.21.x -stable.

> Thanks again.
> 
> Jeff
> 
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org 
>> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jeff Zheng
>> Sent: Thursday, 17 May 2007 5:39 p.m.
>> To: Neil Brown; david@lang.hm; Michal Piotrowski; Ingo 
>> Molnar; linux-raid@vger.kernel.org; 
>> linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org
>> Subject: RE: Software raid0 will crash the file-system, when 
>> each disk is 5TB
>>
>>
>> Yeah, seems you've locked it down, :D. I've written 600GB of 
>> data now, and anything is still fine.
>> Will let it run overnight, and fill the whole 11T. I'll post 
>> the result tomorrow
>>
>> Thanks a lot though.
>>
>> Jeff 
>>
>>> -----Original Message-----
>>> From: Neil Brown [mailto:neilb@suse.de]
>>> Sent: Thursday, 17 May 2007 5:31 p.m.
>>> To: david@lang.hm; Jeff Zheng; Michal Piotrowski; Ingo Molnar; 
>>> linux-raid@vger.kernel.org; linux-kernel@vger.kernel.org; 
>>> linux-fsdevel@vger.kernel.org
>>> Subject: RE: Software raid0 will crash the file-system, 
>> when each disk 
>>> is 5TB
>>>
>>> On Thursday May 17, neilb@suse.de wrote:
>>>> Uhm, I just noticed something.
>>>> 'chunk' is unsigned long, and when it gets shifted up, we
>>> might lose
>>>> bits.  That could still happen with the 4*2.75T 
>> arrangement, but is 
>>>> much more likely in the 2*5.5T arrangement.
>>> Actually, it cannot be a problem with the 4*2.75T arrangement.
>>>   chuck << chunksize_bits
>>>
>>> will not exceed the size of the underlying device *in*kilobytes*.
>>> In that case that is 0xAE9EC800 which will git in a 32bit long.
>>> We don't double it to make sectors until after we add
>>> zone->dev_offset, which is "sector_t" and so 64bit 
>> arithmetic is used.
>>> So I'm quite certain this bug will cause exactly the problems 
>>> experienced!!
>>>
>>>> Jeff, can you try this patch?
>>> Don't bother about the other tests I mentioned, just try this one.
>>> Thanks.
>>>
>>> NeilBrown
>>>
>>>> Signed-off-by: Neil Brown <neilb@suse.de>
>>>>
>>>> ### Diffstat output
>>>>  ./drivers/md/raid0.c |    2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
>>>> --- .prev/drivers/md/raid0.c	2007-05-17 
>>> 10:33:30.000000000 +1000
>>>> +++ ./drivers/md/raid0.c	2007-05-17 15:02:15.000000000 +1000
>>>> @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
>>>>  		x = block >> chunksize_bits;
>>>>  		tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
>>>>  	}
>>>> -	rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
>>>> +	rsect = ((((sector_t)chunk << chunksize_bits) +
>>>> +zone->dev_offset)<<1)
>>>>  		+ sect_in_chunk;
>>>>   
>>>>  	bio->bi_bdev = tmp_dev->bdev;
>> -
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-raid" in the body of a message to 
>> majordomo@vger.kernel.org More majordomo info at  
>> http://vger.kernel.org/majordomo-info.html
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-05-22 21:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <659F626D666070439A4A5965CD6EBF406836C6@gazelle.ad.endace.com>
2007-05-15 23:29 ` Software raid0 will crash the file-system, when each disk is 5TB Michal Piotrowski
2007-05-16  0:03   ` Neil Brown
2007-05-16  1:56     ` Jeff Zheng
2007-05-16 17:28       ` Bill Davidsen
2007-05-16 17:58         ` david
2007-05-17  0:48       ` Neil Brown
2007-05-17  2:09         ` Jeff Zheng
2007-05-17  2:45           ` Neil Brown
2007-05-17  3:11             ` Jeff Zheng
2007-05-17  4:32               ` Neil Brown
2007-05-17  5:08                 ` Jeff Zheng
2007-05-17  4:45             ` david
2007-05-17  5:03               ` Neil Brown
2007-05-17  5:31                 ` Neil Brown
2007-05-17  5:38                   ` Jeff Zheng
2007-05-17 22:55                     ` Jeff Zheng
2007-05-18  0:21                       ` Neil Brown
2007-05-22 21:31                       ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).