make filesystem failed while the capacity of raid5 is big than 16TB

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* make filesystem failed while the capacity of raid5 is big than 16TB
@ 2012-09-12  7:04 vincent
  2012-09-12  7:32 ` Jack Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: vincent @ 2012-09-12  7:04 UTC (permalink / raw)
  To: linux-raid

Hi, everyone:
        I am Vincent, I am writing to you to ask a question about how to
make file system about my raid5.
        I created a raid5 with 16 *2T disks, it was OK.
        Then I used mk2fs to make file system for the raid5.
        Unfortunately, it was failed.
        The output was:
        # mke2fs -t ext4 /dev/md126
          mke2fs 1.41.12 (17-May-2010)
          mke2fs: Size of device /dev/md126 too big to be expressed in 32
bits
          using a blocksize of 4096.
        Is anyone had the same problem? Could you help me?
        The version of my mdadm is 3.2.2, and the version of my kernel is
2.6.38
        Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-12  7:04 make filesystem failed while the capacity of raid5 is big than 16TB vincent
@ 2012-09-12  7:32 ` Jack Wang
  2012-09-12  7:37 ` Chris Dunlop
  2012-09-12  7:58 ` David Brown
  2 siblings, 0 replies; 22+ messages in thread
From: Jack Wang @ 2012-09-12  7:32 UTC (permalink / raw)
  To: vincent; +Cc: linux-raid

You need to change to use bigger blocksize or switch to XFS.

Jack

2012/9/12 vincent <hanguozhong@meganovo.com>:
> Hi, everyone:
>         I am Vincent, I am writing to you to ask a question about how to
> make file system about my raid5.
>         I created a raid5 with 16 *2T disks, it was OK.
>         Then I used mk2fs to make file system for the raid5.
>         Unfortunately, it was failed.
>         The output was:
>         # mke2fs -t ext4 /dev/md126
>           mke2fs 1.41.12 (17-May-2010)
>           mke2fs: Size of device /dev/md126 too big to be expressed in 32
> bits
>           using a blocksize of 4096.
>         Is anyone had the same problem? Could you help me?
>         The version of my mdadm is 3.2.2, and the version of my kernel is
> 2.6.38
>         Thanks.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-12  7:04 make filesystem failed while the capacity of raid5 is big than 16TB vincent
  2012-09-12  7:32 ` Jack Wang
@ 2012-09-12  7:37 ` Chris Dunlop
  2012-09-12  7:58 ` David Brown
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Dunlop @ 2012-09-12  7:37 UTC (permalink / raw)
  To: linux-raid

On 2012-09-12, vincent <hanguozhong@meganovo.com> wrote:
> Hi, everyone:
>         I am Vincent, I am writing to you to ask a question about how to
> make file system about my raid5.
>         I created a raid5 with 16 *2T disks, it was OK.
>         Then I used mk2fs to make file system for the raid5.
>         Unfortunately, it was failed.
>         The output was:
>         # mke2fs -t ext4 /dev/md126
>           mke2fs 1.41.12 (17-May-2010)
>           mke2fs: Size of device /dev/md126 too big to be expressed in 32
> bits
>           using a blocksize of 4096.
>         Is anyone had the same problem? Could you help me?
>         The version of my mdadm is 3.2.2, and the version of my kernel is
> 2.6.38

You need a newer mke2fs: file systems > 16 TB weren't supported until 1.42:

http://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.42

Chris

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-12  7:04 make filesystem failed while the capacity of raid5 is big than 16TB vincent
  2012-09-12  7:32 ` Jack Wang
  2012-09-12  7:37 ` Chris Dunlop
@ 2012-09-12  7:58 ` David Brown
       [not found]   ` <CACY-59cLmV2SRY+FrvhHxseDD1+r-B-3bOKPGzJdGttW+9U2mw@mail.gmail.com>
  2 siblings, 1 reply; 22+ messages in thread
From: David Brown @ 2012-09-12  7:58 UTC (permalink / raw)
  To: vincent; +Cc: linux-raid

On 12/09/2012 09:04, vincent wrote:
> Hi, everyone:
>          I am Vincent, I am writing to you to ask a question about how to
> make file system about my raid5.
>          I created a raid5 with 16 *2T disks, it was OK.
>          Then I used mk2fs to make file system for the raid5.
>          Unfortunately, it was failed.
>          The output was:
>          # mke2fs -t ext4 /dev/md126
>            mke2fs 1.41.12 (17-May-2010)
>            mke2fs: Size of device /dev/md126 too big to be expressed in 32
> bits
>            using a blocksize of 4096.
>          Is anyone had the same problem? Could you help me?
>          The version of my mdadm is 3.2.2, and the version of my kernel is
> 2.6.38
>          Thanks.
>

You need e2fsprogs version 1.42 or above to create an ext4 filesystem 
larger than 16 TB.

However, it is more common to use XFS for such large filesystems.

Another possibility is putting an LVM physical volume on the array and 
making multiple smaller logical partitions for your filesystem.

Almost certainly, however, a raid5 of 16 disks is a bad idea. 
Performance for writes will be terrible, as will parallel reads and 
writes (though that will improve dramatically as the current 
developments in multi-threaded raid5 make their way into mainstream 
distros).  And it is very poor from a reliability viewpoint - your risk 
of a second failure during a rebuild is high with a 16 disk raid5.

What is your actual application here?  If you tell us how this system 
will be used, it will be a lot easier to advise you on a better solution 
(perhaps a raid6, perhaps a raid10, perhaps a number of raid5 systems 
connected with raid0, perhaps multiple raid0 or raid5 arrays with a 
linear concatenation and XFS).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
       [not found]   ` <CACY-59cLmV2SRY+FrvhHxseDD1+r-B-3bOKPGzJdGttW+9U2mw@mail.gmail.com>
@ 2012-09-12  9:46     ` David Brown
  2012-09-12 14:13       ` Stan Hoeppner
  2012-09-13  3:21       ` GuoZhong Han
  0 siblings, 2 replies; 22+ messages in thread
From: David Brown @ 2012-09-12  9:46 UTC (permalink / raw)
  To: GuoZhong Han, Linux RAID

Hi,

(Please remember to include the linux raid mailing list in your replies, 
unless you have particular reason for posting off-list.

And please post in plain-text only, not HTML.  HTML screws up quoting 
and formatting for emails.)

On 12/09/2012 11:14, GuoZhong Han wrote:
> Hi David:
>              Thanks for your reply.
>              From your point of view, I feel like I made some mistakes.
> And when I created the raid5,
>              I noticed that the speed of the recovery is slower than
> 4*2T raid5.while 4*2T can reach to 140MB/s,16*2T raid 70MB/s.

I'll let others say exactly what is going on here, but I suspect it is 
simply that the single-threaded nature of raid5 gives poor performance 
here (as mentioned earlier, the md raid developers are working hard on 
this).

>              The requirement of my application is :
>              1.There are 16 2T disks in the system, the app must be able
> to identify these disks.
>              2.The users can create a raid0,raid10 or raid5 use the
> disks they designated.
>              3.Performance for writes of the array will reach at least
> 100MB per second.

This does not make sense as a set of requirements unless you are making 
a disk tester.  1 and 2 are a list of possible solutions, not a 
description of the application and requirements.

Your requirements should be in terms of the required/desired storage 
space, the required/desired speeds, the required/desired redundancy, and 
the required/desired cost.

The best solution to this depends highly on the mixture of reads and 
writes, the type of access (lots of small accesses or large streamed 
accesses), the level of parallelism (a few big client machines or 
processes, or lots in parallel), and the types of files (a few big 
files, lots of small files, databases, etc.).

>              I had not tested the write performace for write of 16*2T raid5.
>              There was a same problem about 4*2T raid5 and 8*2T raid5.
> when the array was going to be full,
>              the speed of the write performance tend to slower, and it
> can not reach to 100MB/s.
>              could you give me some advice?

Yes - don't make raid5 from large numbers of disks arrays.  The 
performance is always bad (except for individual large streamed reads), 
and the redundancy is bad.  If you are trying to maximise the storage 
space for a given cost, then at least use raid6 unless your data is 
worthless (though performance will be even worse, especially for 
writes).  Otherwise there are better ways to structure your array.

Once we know what you are trying to do, we can give better advice.

mvh.,

David

> 2012/9/12 David Brown <david.brown@hesbynett.no
> <mailto:david.brown@hesbynett.no>>
>
>     On 12/09/2012 09:04, vincent wrote:
>
>         Hi, everyone:
>                   I am Vincent, I am writing to you to ask a question
>         about how to
>         make file system about my raid5.
>                   I created a raid5 with 16 *2T disks, it was OK.
>                   Then I used mk2fs to make file system for the raid5.
>                   Unfortunately, it was failed.
>                   The output was:
>                   # mke2fs -t ext4 /dev/md126
>                     mke2fs 1.41.12 (17-May-2010)
>                     mke2fs: Size of device /dev/md126 too big to be
>         expressed in 32
>         bits
>                     using a blocksize of 4096.
>                   Is anyone had the same problem? Could you help me?
>                   The version of my mdadm is 3.2.2, and the version of
>         my kernel is
>         2.6.38
>                   Thanks.
>
>
>     You need e2fsprogs version 1.42 or above to create an ext4
>     filesystem larger than 16 TB.
>
>     However, it is more common to use XFS for such large filesystems.
>
>     Another possibility is putting an LVM physical volume on the array
>     and making multiple smaller logical partitions for your filesystem.
>
>     Almost certainly, however, a raid5 of 16 disks is a bad idea.
>     Performance for writes will be terrible, as will parallel reads and
>     writes (though that will improve dramatically as the current
>     developments in multi-threaded raid5 make their way into mainstream
>     distros).  And it is very poor from a reliability viewpoint - your
>     risk of a second failure during a rebuild is high with a 16 disk raid5.
>
>     What is your actual application here?  If you tell us how this
>     system will be used, it will be a lot easier to advise you on a
>     better solution (perhaps a raid6, perhaps a raid10, perhaps a number
>     of raid5 systems connected with raid0, perhaps multiple raid0 or
>     raid5 arrays with a linear concatenation and XFS).
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-12  9:46     ` David Brown
@ 2012-09-12 14:13       ` Stan Hoeppner
  2012-09-13  7:06         ` David Brown
  2012-09-13  3:21       ` GuoZhong Han
  1 sibling, 1 reply; 22+ messages in thread
From: Stan Hoeppner @ 2012-09-12 14:13 UTC (permalink / raw)
  To: David Brown; +Cc: GuoZhong Han, Linux RAID

On 9/12/2012 4:46 AM, David Brown wrote:

>>              The requirement of my application is :
>>              1.There are 16 2T disks in the system, the app must be able
>> to identify these disks.
>>              2.The users can create a raid0,raid10 or raid5 use the
>> disks they designated.
>>              3.Performance for writes of the array will reach at least
>> 100MB per second.
> 
> This does not make sense as a set of requirements unless you are making
> a disk tester.  1 and 2 are a list of possible solutions, not a
> description of the application and requirements.

It makes perfect sense if the OP is designing a storage appliance
product and a management front end for it.  Based on the information
given, this seems to be the case.

-- 
Stan


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-12  9:46     ` David Brown
  2012-09-12 14:13       ` Stan Hoeppner
@ 2012-09-13  3:21       ` GuoZhong Han
  2012-09-13  3:34         ` Mathias Buren
                           ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: GuoZhong Han @ 2012-09-13  3:21 UTC (permalink / raw)
  To: David Brown; +Cc: stan, linux-raid

Hi David:

         I am sorry for last mail that I had not described the
requirements of the system very clear.

         I will detail for you to describe the requirements of the system.

         This system has a 36 cores CPU, the frequency of each core is
1.2G. The system is designed to be a storage appliance product and a
management front end for it. The users can insert up to 16 disks to
the system and uses the interface that given by the appliance to
manage the disks and the arrays. The users can create a raid0, raid10
and raid5 use the disks they designated. After the array is be
created, the users can write to the array where the data they want.

         1. The system must support parallel write more than 150
files; the speed of each will reach to 1M/s. If the array is full,
wipe its data to re-write.

         2. Necessarily parallel the ability to read multiple files.

         3. as much as possible to use the storage space

         4. The system must have certain redundancy, when a disk
failed, the users can use other disk instead of the failed disk.

         5. The system must support disk hot-swap

         I have tested the performance for write of 4*2T raid5 and
8*2T raid5 of which the file system is ext4, the chuck size is 128K
and the strip_cache_size is 2048. At the beginning, these two raid5s
worked well. But there was a same problem, when the array was going to
be full, the speeds of the write performance tend to slower, there
were lots of data lost while parallel write 1M/s to 150 files.

         As you said, the performance for write of 16*2T raid5 will be
terrible, so what do you think that how many disks to be build to a
raid5 will be more appropriate?

         I do not know whether I describe the requirement of the
system accurately. I hope I can get your advice.

2012/9/12 David Brown <david.brown@hesbynett.no>
>
> Hi,
>
> (Please remember to include the linux raid mailing list in your replies, unless you have particular reason for posting off-list.
>
> And please post in plain-text only, not HTML.  HTML screws up quoting and formatting for emails.)
>
>
>
> On 12/09/2012 11:14, GuoZhong Han wrote:
>>
>> Hi David:
>>              Thanks for your reply.
>>              From your point of view, I feel like I made some mistakes.
>> And when I created the raid5,
>>              I noticed that the speed of the recovery is slower than
>> 4*2T raid5.while 4*2T can reach to 140MB/s,16*2T raid 70MB/s.
>
>
> I'll let others say exactly what is going on here, but I suspect it is simply that the single-threaded nature of raid5 gives poor performance here (as mentioned earlier, the md raid developers are working hard on this).
>
>
>>              The requirement of my application is :
>>              1.There are 16 2T disks in the system, the app must be able
>> to identify these disks.
>>              2.The users can create a raid0,raid10 or raid5 use the
>> disks they designated.
>>              3.Performance for writes of the array will reach at least
>> 100MB per second.
>
>
> This does not make sense as a set of requirements unless you are making a disk tester.  1 and 2 are a list of possible solutions, not a description of the application and requirements.
>
> Your requirements should be in terms of the required/desired storage space, the required/desired speeds, the required/desired redundancy, and the required/desired cost.
>
> The best solution to this depends highly on the mixture of reads and writes, the type of access (lots of small accesses or large streamed accesses), the level of parallelism (a few big client machines or processes, or lots in parallel), and the types of files (a few big files, lots of small files, databases, etc.).
>
>
>
>>              I had not tested the write performace for write of 16*2T raid5.
>>              There was a same problem about 4*2T raid5 and 8*2T raid5.
>> when the array was going to be full,
>>              the speed of the write performance tend to slower, and it
>> can not reach to 100MB/s.
>>              could you give me some advice?
>
>
> Yes - don't make raid5 from large numbers of disks arrays.  The performance is always bad (except for individual large streamed reads), and the redundancy is bad.  If you are trying to maximise the storage space for a given cost, then at least use raid6 unless your data is worthless (though performance will be even worse, especially for writes).  Otherwise there are better ways to structure your array.
>
> Once we know what you are trying to do, we can give better advice.
>
> mvh.,
>
> David
>
>
>
>> 2012/9/12 David Brown <david.brown@hesbynett.no
>> <mailto:david.brown@hesbynett.no>>
>>
>>
>>     On 12/09/2012 09:04, vincent wrote:
>>
>>         Hi, everyone:
>>                   I am Vincent, I am writing to you to ask a question
>>         about how to
>>         make file system about my raid5.
>>                   I created a raid5 with 16 *2T disks, it was OK.
>>                   Then I used mk2fs to make file system for the raid5.
>>                   Unfortunately, it was failed.
>>                   The output was:
>>                   # mke2fs -t ext4 /dev/md126
>>                     mke2fs 1.41.12 (17-May-2010)
>>                     mke2fs: Size of device /dev/md126 too big to be
>>         expressed in 32
>>         bits
>>                     using a blocksize of 4096.
>>                   Is anyone had the same problem? Could you help me?
>>                   The version of my mdadm is 3.2.2, and the version of
>>         my kernel is
>>         2.6.38
>>                   Thanks.
>>
>>
>>     You need e2fsprogs version 1.42 or above to create an ext4
>>     filesystem larger than 16 TB.
>>
>>     However, it is more common to use XFS for such large filesystems.
>>
>>     Another possibility is putting an LVM physical volume on the array
>>     and making multiple smaller logical partitions for your filesystem.
>>
>>     Almost certainly, however, a raid5 of 16 disks is a bad idea.
>>     Performance for writes will be terrible, as will parallel reads and
>>     writes (though that will improve dramatically as the current
>>     developments in multi-threaded raid5 make their way into mainstream
>>     distros).  And it is very poor from a reliability viewpoint - your
>>     risk of a second failure during a rebuild is high with a 16 disk raid5.
>>
>>     What is your actual application here?  If you tell us how this
>>     system will be used, it will be a lot easier to advise you on a
>>     better solution (perhaps a raid6, perhaps a raid10, perhaps a number
>>     of raid5 systems connected with raid0, perhaps multiple raid0 or
>>     raid5 arrays with a linear concatenation and XFS).
>>
>>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13  3:21       ` GuoZhong Han
@ 2012-09-13  3:34         ` Mathias Buren
  2012-09-13  7:13           ` David Brown
  2012-09-13  7:30         ` David Brown
  2012-09-13 13:25         ` Stan Hoeppner
  2 siblings, 1 reply; 22+ messages in thread
From: Mathias Buren @ 2012-09-13  3:34 UTC (permalink / raw)
  To: GuoZhong Han; +Cc: David Brown, stan, linux-raid

On 13/09/12 11:21, GuoZhong Han wrote:
> Hi David:
>
>           I am sorry for last mail that I had not described the
> requirements of the system very clear.
>
>           I will detail for you to describe the requirements of the system.
>

(snip)

>
>           As you said, the performance for write of 16*2T raid5 will be
> terrible, so what do you think that how many disks to be build to a
> raid5 will be more appropriate?

Personally I wouldn't use more than 5 drives in a RAID5 with drives 
larger than 1TB, the failure risk is too high. With 16x 2TB drives, how 
about two RAID6 arrays of 8 drives each, then RAID0 them? (RAID60)

Or, two RAID6 arrays with 7 drives each, 2 hotspares, and RAID0 on top. 
(RAID10 + 2 HSP)

You mention 36 cores. Perhaps you should try the very latest mdadm 
versions and Linux kernels (perhaps from the MD Linux git tree), and 
enable the multicore option.

Mathias

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-12 14:13       ` Stan Hoeppner
@ 2012-09-13  7:06         ` David Brown
  0 siblings, 0 replies; 22+ messages in thread
From: David Brown @ 2012-09-13  7:06 UTC (permalink / raw)
  To: stan; +Cc: GuoZhong Han, Linux RAID

On 12/09/2012 16:13, Stan Hoeppner wrote:
> On 9/12/2012 4:46 AM, David Brown wrote:
>
>>>               The requirement of my application is :
>>>               1.There are 16 2T disks in the system, the app must be able
>>> to identify these disks.
>>>               2.The users can create a raid0,raid10 or raid5 use the
>>> disks they designated.
>>>               3.Performance for writes of the array will reach at least
>>> 100MB per second.
>>
>> This does not make sense as a set of requirements unless you are making
>> a disk tester.  1 and 2 are a list of possible solutions, not a
>> description of the application and requirements.
>
> It makes perfect sense if the OP is designing a storage appliance
> product and a management front end for it.  Based on the information
> given, this seems to be the case.
>

If he had said "up to 16 disks" of "up to 2TB", and a more general 
performance description, then I would have agreed.  I read the 
requirements as being /exactly/ 16 2TB disks, which I thought odd.  But 
the OP has replied with more information anyway.

mvh.,

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13  3:34         ` Mathias Buren
@ 2012-09-13  7:13           ` David Brown
  0 siblings, 0 replies; 22+ messages in thread
From: David Brown @ 2012-09-13  7:13 UTC (permalink / raw)
  To: Mathias Buren; +Cc: GuoZhong Han, stan, linux-raid

On 13/09/2012 05:34, Mathias Buren wrote:
> On 13/09/12 11:21, GuoZhong Han wrote:
>> Hi David:
>>
>>           I am sorry for last mail that I had not described the
>> requirements of the system very clear.
>>
>>           I will detail for you to describe the requirements of the
>> system.
>>
>
> (snip)
>
>>
>>           As you said, the performance for write of 16*2T raid5 will be
>> terrible, so what do you think that how many disks to be build to a
>> raid5 will be more appropriate?
>
> Personally I wouldn't use more than 5 drives in a RAID5 with drives
> larger than 1TB, the failure risk is too high. With 16x 2TB drives, how
> about two RAID6 arrays of 8 drives each, then RAID0 them? (RAID60)
>
> Or, two RAID6 arrays with 7 drives each, 2 hotspares, and RAID0 on top.
> (RAID10 + 2 HSP)
>

I wouldn't bother with hotspares with RAID6 unless service and 
replacement of a dead disk is going to take a long time - you already 
have double redundancy with the raid6.  Raid6 on 8 disks is already 
orders of magnitude safer than raid5 with 16 disks - once you have a 
higher risk of the power supply taking fire and burning /all/ your 
disks, you don't benefit from even greater redundancy!

> You mention 36 cores. Perhaps you should try the very latest mdadm
> versions and Linux kernels (perhaps from the MD Linux git tree), and
> enable the multicore option.
>

If that is possible for the OP, then that is definitely worth trying. 
It is this kind of setup that will benefit most from the newer 
multithreading support.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13  3:21       ` GuoZhong Han
  2012-09-13  3:34         ` Mathias Buren
@ 2012-09-13  7:30         ` David Brown
  2012-09-13  7:43           ` John Robinson
  2012-09-13 13:25         ` Stan Hoeppner
  2 siblings, 1 reply; 22+ messages in thread
From: David Brown @ 2012-09-13  7:30 UTC (permalink / raw)
  To: GuoZhong Han; +Cc: stan, linux-raid

On 13/09/2012 05:21, GuoZhong Han wrote:
> Hi David:
>
>           I am sorry for last mail that I had not described the
> requirements of the system very clear.
>
>           I will detail for you to describe the requirements of the system.
>
>           This system has a 36 cores CPU, the frequency of each core is
> 1.2G. The system is designed to be a storage appliance product and a
> management front end for it. The users can insert up to 16 disks to
> the system and uses the interface that given by the appliance to
> manage the disks and the arrays. The users can create a raid0, raid10
> and raid5 use the disks they designated. After the array is be
> created, the users can write to the array where the data they want.
>

First ask yourself if your users really need that flexibility.  A 
16-disk raid0 only makes sense if the data is worthless, or at least 
very easily restored.  If you can decide in advance that you will 
support just one array arrangement, it will make life much easier for 
both you and your customers.

>           1. The system must support parallel write more than 150
> files; the speed of each will reach to 1M/s. If the array is full,
> wipe its data to re-write.
>
>           2. Necessarily parallel the ability to read multiple files.
>
>           3. as much as possible to use the storage space
>
>           4. The system must have certain redundancy, when a disk
> failed, the users can use other disk instead of the failed disk.
>

There is always a bit of trade-off here.  XFS over a linear cat of 8 
raid1 pairs (or 7 pairs and 2 hot spares) is going to give you the best 
parallel performance, especially for smaller accesses - but you lose 
half your disk space with raid1.  Using 2 x 8-disk raid6 connected by 
raid0 (or linear concat for XFS) will be faster and safer than a single 
large raid5/6 array, but unless you use the latest multithreaded raid 
kernel then it will still be very slow.  4 x 4-disk raid5 could also be 
a good choice, especially if you can't use multithreaded raid.  (Again, 
using XFS over a linear cat of the parts scales better for parallel 
access than using ext4 over a raid0.)

>           5. The system must support disk hot-swap
>

That should be possible.  But be /very/ careful that you have a way of 
being sure which disk should be replaced!  I would actually recommend 
raid6 combined with a write-intent bitmap, rather than raid5, when using 
hot-swap - that way if you pop the wrong disk, you can just put it back 
in again and carry on.  Remember to protect users against human failure 
as well as hardware failure!

>           I have tested the performance for write of 4*2T raid5 and
> 8*2T raid5 of which the file system is ext4, the chuck size is 128K
> and the strip_cache_size is 2048. At the beginning, these two raid5s
> worked well. But there was a same problem, when the array was going to
> be full, the speeds of the write performance tend to slower, there
> were lots of data lost while parallel write 1M/s to 150 files.
>
>           As you said, the performance for write of 16*2T raid5 will be
> terrible, so what do you think that how many disks to be build to a
> raid5 will be more appropriate?
>
>           I do not know whether I describe the requirement of the
> system accurately. I hope I can get your advice.
>
> 2012/9/12 David Brown <david.brown@hesbynett.no>
>>
>> Hi,
>>
>> (Please remember to include the linux raid mailing list in your replies, unless you have particular reason for posting off-list.
>>
>> And please post in plain-text only, not HTML.  HTML screws up quoting and formatting for emails.)
>>
>>
>>
>> On 12/09/2012 11:14, GuoZhong Han wrote:
>>>
>>> Hi David:
>>>               Thanks for your reply.
>>>               From your point of view, I feel like I made some mistakes.
>>> And when I created the raid5,
>>>               I noticed that the speed of the recovery is slower than
>>> 4*2T raid5.while 4*2T can reach to 140MB/s,16*2T raid 70MB/s.
>>
>>
>> I'll let others say exactly what is going on here, but I suspect it is simply that the single-threaded nature of raid5 gives poor performance here (as mentioned earlier, the md raid developers are working hard on this).
>>
>>
>>>               The requirement of my application is :
>>>               1.There are 16 2T disks in the system, the app must be able
>>> to identify these disks.
>>>               2.The users can create a raid0,raid10 or raid5 use the
>>> disks they designated.
>>>               3.Performance for writes of the array will reach at least
>>> 100MB per second.
>>
>>
>> This does not make sense as a set of requirements unless you are making a disk tester.  1 and 2 are a list of possible solutions, not a description of the application and requirements.
>>
>> Your requirements should be in terms of the required/desired storage space, the required/desired speeds, the required/desired redundancy, and the required/desired cost.
>>
>> The best solution to this depends highly on the mixture of reads and writes, the type of access (lots of small accesses or large streamed accesses), the level of parallelism (a few big client machines or processes, or lots in parallel), and the types of files (a few big files, lots of small files, databases, etc.).
>>
>>
>>
>>>               I had not tested the write performace for write of 16*2T raid5.
>>>               There was a same problem about 4*2T raid5 and 8*2T raid5.
>>> when the array was going to be full,
>>>               the speed of the write performance tend to slower, and it
>>> can not reach to 100MB/s.
>>>               could you give me some advice?
>>
>>
>> Yes - don't make raid5 from large numbers of disks arrays.  The performance is always bad (except for individual large streamed reads), and the redundancy is bad.  If you are trying to maximise the storage space for a given cost, then at least use raid6 unless your data is worthless (though performance will be even worse, especially for writes).  Otherwise there are better ways to structure your array.
>>
>> Once we know what you are trying to do, we can give better advice.
>>
>> mvh.,
>>
>> David
>>
>>
>>
>>> 2012/9/12 David Brown <david.brown@hesbynett.no
>>> <mailto:david.brown@hesbynett.no>>
>>>
>>>
>>>      On 12/09/2012 09:04, vincent wrote:
>>>
>>>          Hi, everyone:
>>>                    I am Vincent, I am writing to you to ask a question
>>>          about how to
>>>          make file system about my raid5.
>>>                    I created a raid5 with 16 *2T disks, it was OK.
>>>                    Then I used mk2fs to make file system for the raid5.
>>>                    Unfortunately, it was failed.
>>>                    The output was:
>>>                    # mke2fs -t ext4 /dev/md126
>>>                      mke2fs 1.41.12 (17-May-2010)
>>>                      mke2fs: Size of device /dev/md126 too big to be
>>>          expressed in 32
>>>          bits
>>>                      using a blocksize of 4096.
>>>                    Is anyone had the same problem? Could you help me?
>>>                    The version of my mdadm is 3.2.2, and the version of
>>>          my kernel is
>>>          2.6.38
>>>                    Thanks.
>>>
>>>
>>>      You need e2fsprogs version 1.42 or above to create an ext4
>>>      filesystem larger than 16 TB.
>>>
>>>      However, it is more common to use XFS for such large filesystems.
>>>
>>>      Another possibility is putting an LVM physical volume on the array
>>>      and making multiple smaller logical partitions for your filesystem.
>>>
>>>      Almost certainly, however, a raid5 of 16 disks is a bad idea.
>>>      Performance for writes will be terrible, as will parallel reads and
>>>      writes (though that will improve dramatically as the current
>>>      developments in multi-threaded raid5 make their way into mainstream
>>>      distros).  And it is very poor from a reliability viewpoint - your
>>>      risk of a second failure during a rebuild is high with a 16 disk raid5.
>>>
>>>      What is your actual application here?  If you tell us how this
>>>      system will be used, it will be a lot easier to advise you on a
>>>      better solution (perhaps a raid6, perhaps a raid10, perhaps a number
>>>      of raid5 systems connected with raid0, perhaps multiple raid0 or
>>>      raid5 arrays with a linear concatenation and XFS).
>>>
>>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13  7:30         ` David Brown
@ 2012-09-13  7:43           ` John Robinson
  2012-09-13  9:15             ` David Brown
  0 siblings, 1 reply; 22+ messages in thread
From: John Robinson @ 2012-09-13  7:43 UTC (permalink / raw)
  To: David Brown; +Cc: GuoZhong Han, stan, linux-raid

On 13/09/2012 08:30, David Brown wrote:
[...]
>   Using 2 x 8-disk raid6 connected by
> raid0 (or linear concat for XFS) will be faster and safer than a single
> large raid5/6 array, but unless you use the latest multithreaded raid
> kernel then it will still be very slow.

Not as far as I understand it, it won't. The multithreaded code is only 
really a benefit on SSDs which can manage tens of thousands of IOPS, 
while on spinning rust HDDs which can only manage hundreds of IOPS, the 
single-threaded code is fine.

Cheers,

John.

-- 
John Robinson, yuiop IT services
0131 557 9577 / 07771 784 058
46/12 Broughton Road, Edinburgh EH7 4EE

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13  7:43           ` John Robinson
@ 2012-09-13  9:15             ` David Brown
  0 siblings, 0 replies; 22+ messages in thread
From: David Brown @ 2012-09-13  9:15 UTC (permalink / raw)
  To: John Robinson; +Cc: GuoZhong Han, stan, linux-raid

On 13/09/2012 09:43, John Robinson wrote:
> On 13/09/2012 08:30, David Brown wrote:
> [...]
>>   Using 2 x 8-disk raid6 connected by
>> raid0 (or linear concat for XFS) will be faster and safer than a single
>> large raid5/6 array, but unless you use the latest multithreaded raid
>> kernel then it will still be very slow.
>
> Not as far as I understand it, it won't. The multithreaded code is only
> really a benefit on SSDs which can manage tens of thousands of IOPS,
> while on spinning rust HDDs which can only manage hundreds of IOPS, the
> single-threaded code is fine.
>

HDDs have longer latencies, so that can't do lots of different accesses 
in rapid succession.  But they have pretty high throughputs for streamed 
reads and writes, once they get going.  Raid6 writes involve quite a bit 
of processing to calculate the second parity - when you have enough HD 
spindles you will saturate the performance of a single CPU core in 
processing speed, memory bandwidth and IO bandwidth.  I freely admit 
that I'm speculating here without real numbers, but I believe this cpu 
bottleneck is part of the problems people see with larger raid5/6 arrays.

It is certainly true that you will see the biggest difference with SSDs, 
especially in the IOPS numbers.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13  3:21       ` GuoZhong Han
  2012-09-13  3:34         ` Mathias Buren
  2012-09-13  7:30         ` David Brown
@ 2012-09-13 13:25         ` Stan Hoeppner
  2012-09-13 13:52           ` David Brown
  2012-09-18  9:35           ` GuoZhong Han
  2 siblings, 2 replies; 22+ messages in thread
From: Stan Hoeppner @ 2012-09-13 13:25 UTC (permalink / raw)
  To: GuoZhong Han; +Cc: David Brown, linux-raid

On 9/12/2012 10:21 PM, GuoZhong Han wrote:

>          This system has a 36 cores CPU, the frequency of each core is
> 1.2G. 

Obviously not an x86 CPU.  36 cores.  Must be a Tilera chip.

GuoZhong, be aware that high core count systems are a poor match for
Linux md/RAID levels 1/5/6/10.  These md/RAID drivers currently utilize
a single write thread, and thus can only use one CPU core at a time.

To begin to sufficiently scale these md array types across 36x 1.2GHz
cores you would need something like the following configurations, all
striped together or concatenated with md or LVM:

72x md/RAID1 mirror pairs
 36x 4 disk RAID10 arrays
 36x 4 disk RAID6 ararys
 36x 3 disk RAID5 arrays

Patches are currently being developed to increase the parallelism of
RAID1/5/6/10 but will likely not be ready for production kernels for
some time.   These patches will however still not allow scaling an
md/RAID driver across such a high core count.  You'll still need
multiple arrays to take advantage of 36 cores.  Thus, this 16 drive
storage appliance would have much better performance with a single/dual
core CPU with a 2-3GHz clock speed.

> The users can create a raid0, raid10
> and raid5 use the disks they designated.

This is a storage appliance.  Due to the market you're targeting, the
RAID level should be chosen by the manufacturer and not selectable by
the user.  Choice is normally a good thing.  But with this type of
product, allowing users the choice of array type will simply cause your
company may problems.  You will constantly field support issues about
actual performance not meeting expectations, etc.  And you don't want to
allow RAID5 under any circumstances for a storage appliance product.  In
this category, most users won't immediately replace failed drives, so
you need to "force" the extra protection of RAID6 or RAID10 upon the
customer.

If I were doing such a product, I'd immediately toss out the 36 core
logic platform and switch to a low power single/dual core x86 chip.  And
as much as I disdain parity RAID, for such an appliance I'd make RAID6
the factory default, not changeable by the user.  Since md/RAID doesn't
scale well across multicore CPUs, and because wide parity arrays yield
poor performance, I would make 2x 8 drive RAID6 arrays at the factory,
concatenate them with md/RAID linear, and format the linear device with
XFS.  Manually force a 64KB chunk size for the RAID6 arrays.  You don't
want the 512KB default in a storage appliance.  Specify stripe alignment
when formatting with XFS.  In this case, su=64K and sw=6.  See "man
mdadm" and "man mkfs.xfs".

>          1. The system must support parallel write more than 150
> files; the speed of each will reach to 1M/s. 

For highly parallel write workloads you definitely want XFS.

> If the array is full,
> wipe its data to re-write.

What do you mean by this?  Surely you don't mean to arbitrarily erase
user date to make room for more user data.

>          2. Necessarily parallel the ability to read multiple files.

Again, XFS best fits this requirement.

>          3. as much as possible to use the storage space

RAID6 is the best option here for space efficiency and resilience to
array failure.  RAID5 is asking for heartache, especially in an
appliance product, where users tend to neglect the box until it breaks
to the point of no longer working.

>          4. The system must have certain redundancy, when a disk
> failed, the users can use other disk instead of the failed disk.

That's what RAID is for, so you're on the right track. ;)

>          5. The system must support disk hot-swap

That up to your hardware design.  Lots of pre-built solution already on
the OEM market.

>          I have tested the performance for write of 4*2T raid5 and
> 8*2T raid5 of which the file system is ext4, the chuck size is 128K
> and the strip_cache_size is 2048. At the beginning, these two raid5s
> worked well. But there was a same problem, when the array was going to
> be full, the speeds of the write performance tend to slower, there
> were lots of data lost while parallel write 1M/s to 150 files.

You shouldn't have lost data doing this.  That suggests some other
problem.  EXT4 is not particularly adept at managing free space
fragmentation.  XFS will do much better here.  But even with XFS,
depending on the workload and the "aging" of the filesystem, even XFS
will will slow down considerably when the filesystem approaches ~95%
full.  This obviously depends a bit on drive size and total array size
as well.  5% of a 12TB filesystem is quite less than a 36TB filesystem,
600GB vs 1.8TB.  And the degradation depends on what types of files
you're writing and how many in parallel to your nearly full XFS.

>          As you said, the performance for write of 16*2T raid5 will be
> terrible, so what do you think that how many disks to be build to a
> raid5 will be more appropriate?

Again, do not use RAID5 for a storage appliance.  Use RAID6 instead, and
use multiple RAID6 arrays concatenated together.

>          I do not know whether I describe the requirement of the
> system accurately. I hope I can get your advice.

You described it well, except for the part about wipe data and rewrite
when array is full.

-- 
Stan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13 13:25         ` Stan Hoeppner
@ 2012-09-13 13:52           ` David Brown
  2012-09-13 22:47             ` Stan Hoeppner
  2012-09-18  9:35           ` GuoZhong Han
  1 sibling, 1 reply; 22+ messages in thread
From: David Brown @ 2012-09-13 13:52 UTC (permalink / raw)
  To: stan; +Cc: GuoZhong Han, linux-raid

On 13/09/2012 15:25, Stan Hoeppner wrote:
> On 9/12/2012 10:21 PM, GuoZhong Han wrote:
>
>>           This system has a 36 cores CPU, the frequency of each core is
>> 1.2G.
>
> Obviously not an x86 CPU.  36 cores.  Must be a Tilera chip.
>

I don't know of any other 36-core chips - but the OP would have to 
answer that.

> GuoZhong, be aware that high core count systems are a poor match for
> Linux md/RAID levels 1/5/6/10.  These md/RAID drivers currently utilize
> a single write thread, and thus can only use one CPU core at a time.
>

Even with multitheaded raid support, such high core-count chips are not 
ideal for this sort of application.

> To begin to sufficiently scale these md array types across 36x 1.2GHz
> cores you would need something like the following configurations, all
> striped together or concatenated with md or LVM:
>
> 72x md/RAID1 mirror pairs
>   36x 4 disk RAID10 arrays
>   36x 4 disk RAID6 ararys
>   36x 3 disk RAID5 arrays
>
> Patches are currently being developed to increase the parallelism of
> RAID1/5/6/10 but will likely not be ready for production kernels for
> some time.   These patches will however still not allow scaling an
> md/RAID driver across such a high core count.  You'll still need
> multiple arrays to take advantage of 36 cores.  Thus, this 16 drive
> storage appliance would have much better performance with a single/dual
> core CPU with a 2-3GHz clock speed.
>

I doubt if the OP is aiming to saturate all 36 cores.  There is no need 
to scale across all the cores - the aim is just to spread the load 
amongst enough cores that processing power is not a bottleneck.  If you 
can achieve this with four cores in use and 32 cores sitting idle, then 
that is just as good as running 36 cores at 10% capacity.

But I absolutely agree that it is a lot easier to achieve the required 
performance with a few fast cores than lots of slower cores.

The other issue to consider here is IO and memory bandwidths - high core 
count chips don't have the bandwidth to fully utilise the cores on 
storage applications.

> If I were doing such a product, I'd immediately toss out the 36 core
> logic platform and switch to a low power single/dual core x86 chip.

I'd go for at least two, but probably four cores - the difference in 
price is going to be irrelevant compared to the rest of the hardware. 
But I agree that large numbers of cores are probably wasted.

The only reason I would want lots of cores here is if the device is more 
than just a storage array.  For example, if you are compressing or 
encrypting the data, or using encryption on the network connections, 
then extra cores will be useful.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13 13:52           ` David Brown
@ 2012-09-13 22:47             ` Stan Hoeppner
  0 siblings, 0 replies; 22+ messages in thread
From: Stan Hoeppner @ 2012-09-13 22:47 UTC (permalink / raw)
  To: David Brown; +Cc: GuoZhong Han, linux-raid

On 9/13/2012 8:52 AM, David Brown wrote:
> On 13/09/2012 15:25, Stan Hoeppner wrote:

>> If I were doing such a product, I'd immediately toss out the 36 core
>> logic platform and switch to a low power single/dual core x86 chip.
> 
> I'd go for at least two, but probably four cores - the difference in
> price is going to be irrelevant compared to the rest of the hardware.
> But I agree that large numbers of cores are probably wasted.

Price is only one concern of many.  With an embedded storage appliance
TDP and low power draw are as important as price.  For instance, a 64bit
dual core, 4 thread 1.86GHz Intel Atom N2800 has a TDP of 6.5 watts.
Each core is easily capable of handling IO and parity for 8 rust drives
making it an excellent fit for this storage appliance.  There are dual
core MIPS64 chips that are suitable as well.

-- 
Stan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-13 13:25         ` Stan Hoeppner
  2012-09-13 13:52           ` David Brown
@ 2012-09-18  9:35           ` GuoZhong Han
  2012-09-18 10:22             ` David Brown
  2012-09-18 21:20             ` Stan Hoeppner
  1 sibling, 2 replies; 22+ messages in thread
From: GuoZhong Han @ 2012-09-18  9:35 UTC (permalink / raw)
  To: stan; +Cc: David Brown, linux-raid

Hi Stan:
        Thanks for your advice. In your last mail, you mentioned XFS
file system. According to your suggestion, I changed the file system
from raid5 (4*2T, chunksize: 128K, strip_catch_size:2048) to XFS. Then
I did a write performance test on XFS.
The test was as follows:
        My program used 4 threads to do parallel writing to 30 files
with 1MB/s writing speed on each file. Each thread was bound on a
single core. The estimated total speed should be stable at 30MB/s. I
recorded the total writing speed every second in the test. Compared
with speed of ext4, when the array was going to be full, the
performance of XFS has indeed increased. The time to create the XFS
file system was much less than the cost of ext4. However, I found that
the total speed wasn’t steady. Although most of time the speed can
reach to 30M/s, it fell to only about 10MB/s in rare cases. Writing to
30 files in parallel was supposed to be easy. Why did this happen?


2012/9/13 Stan Hoeppner <stan@hardwarefreak.com>:
> On 9/12/2012 10:21 PM, GuoZhong Han wrote:
>
>>          This system has a 36 cores CPU, the frequency of each core is
>> 1.2G.
>
> Obviously not an x86 CPU.  36 cores.  Must be a Tilera chip.
>
> GuoZhong, be aware that high core count systems are a poor match for
> Linux md/RAID levels 1/5/6/10.  These md/RAID drivers currently utilize
> a single write thread, and thus can only use one CPU core at a time.
>
> To begin to sufficiently scale these md array types across 36x 1.2GHz
> cores you would need something like the following configurations, all
> striped together or concatenated with md or LVM:
>
> 72x md/RAID1 mirror pairs
>  36x 4 disk RAID10 arrays
>  36x 4 disk RAID6 ararys
>  36x 3 disk RAID5 arrays
>
> Patches are currently being developed to increase the parallelism of
> RAID1/5/6/10 but will likely not be ready for production kernels for
> some time.   These patches will however still not allow scaling an
> md/RAID driver across such a high core count.  You'll still need
> multiple arrays to take advantage of 36 cores.  Thus, this 16 drive
> storage appliance would have much better performance with a single/dual
> core CPU with a 2-3GHz clock speed.
>
>> The users can create a raid0, raid10
>> and raid5 use the disks they designated.
>
> This is a storage appliance.  Due to the market you're targeting, the
> RAID level should be chosen by the manufacturer and not selectable by
> the user.  Choice is normally a good thing.  But with this type of
> product, allowing users the choice of array type will simply cause your
> company may problems.  You will constantly field support issues about
> actual performance not meeting expectations, etc.  And you don't want to
> allow RAID5 under any circumstances for a storage appliance product.  In
> this category, most users won't immediately replace failed drives, so
> you need to "force" the extra protection of RAID6 or RAID10 upon the
> customer.
>
> If I were doing such a product, I'd immediately toss out the 36 core
> logic platform and switch to a low power single/dual core x86 chip.  And
> as much as I disdain parity RAID, for such an appliance I'd make RAID6
> the factory default, not changeable by the user.  Since md/RAID doesn't
> scale well across multicore CPUs, and because wide parity arrays yield
> poor performance, I would make 2x 8 drive RAID6 arrays at the factory,
> concatenate them with md/RAID linear, and format the linear device with
> XFS.  Manually force a 64KB chunk size for the RAID6 arrays.  You don't
> want the 512KB default in a storage appliance.  Specify stripe alignment
> when formatting with XFS.  In this case, su=64K and sw=6.  See "man
> mdadm" and "man mkfs.xfs".
>
>>          1. The system must support parallel write more than 150
>> files; the speed of each will reach to 1M/s.
>
> For highly parallel write workloads you definitely want XFS.
>
>> If the array is full,
>> wipe its data to re-write.
>
> What do you mean by this?  Surely you don't mean to arbitrarily erase
> user date to make room for more user data.
>
>>          2. Necessarily parallel the ability to read multiple files.
>
> Again, XFS best fits this requirement.
>
>>          3. as much as possible to use the storage space
>
> RAID6 is the best option here for space efficiency and resilience to
> array failure.  RAID5 is asking for heartache, especially in an
> appliance product, where users tend to neglect the box until it breaks
> to the point of no longer working.
>
>>          4. The system must have certain redundancy, when a disk
>> failed, the users can use other disk instead of the failed disk.
>
> That's what RAID is for, so you're on the right track. ;)
>
>>          5. The system must support disk hot-swap
>
> That up to your hardware design.  Lots of pre-built solution already on
> the OEM market.
>
>>          I have tested the performance for write of 4*2T raid5 and
>> 8*2T raid5 of which the file system is ext4, the chuck size is 128K
>> and the strip_cache_size is 2048. At the beginning, these two raid5s
>> worked well. But there was a same problem, when the array was going to
>> be full, the speeds of the write performance tend to slower, there
>> were lots of data lost while parallel write 1M/s to 150 files.
>
> You shouldn't have lost data doing this.  That suggests some other
> problem.  EXT4 is not particularly adept at managing free space
> fragmentation.  XFS will do much better here.  But even with XFS,
> depending on the workload and the "aging" of the filesystem, even XFS
> will will slow down considerably when the filesystem approaches ~95%
> full.  This obviously depends a bit on drive size and total array size
> as well.  5% of a 12TB filesystem is quite less than a 36TB filesystem,
> 600GB vs 1.8TB.  And the degradation depends on what types of files
> you're writing and how many in parallel to your nearly full XFS.
>
>>          As you said, the performance for write of 16*2T raid5 will be
>> terrible, so what do you think that how many disks to be build to a
>> raid5 will be more appropriate?
>
> Again, do not use RAID5 for a storage appliance.  Use RAID6 instead, and
> use multiple RAID6 arrays concatenated together.
>
>>          I do not know whether I describe the requirement of the
>> system accurately. I hope I can get your advice.
>
> You described it well, except for the part about wipe data and rewrite
> when array is full.
>
> --
> Stan
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-18  9:35           ` GuoZhong Han
@ 2012-09-18 10:22             ` David Brown
  2012-09-18 21:38               ` Stan Hoeppner
  2012-09-18 21:20             ` Stan Hoeppner
  1 sibling, 1 reply; 22+ messages in thread
From: David Brown @ 2012-09-18 10:22 UTC (permalink / raw)
  To: GuoZhong Han; +Cc: stan, linux-raid

On 18/09/2012 11:35, GuoZhong Han wrote:
> Hi Stan:
>          Thanks for your advice. In your last mail, you mentioned XFS
> file system. According to your suggestion, I changed the file system
> from raid5 (4*2T, chunksize: 128K, strip_catch_size:2048) to XFS. Then
> I did a write performance test on XFS.
> The test was as follows:
>          My program used 4 threads to do parallel writing to 30 files
> with 1MB/s writing speed on each file. Each thread was bound on a
> single core. The estimated total speed should be stable at 30MB/s. I
> recorded the total writing speed every second in the test. Compared
> with speed of ext4, when the array was going to be full, the
> performance of XFS has indeed increased. The time to create the XFS
> file system was much less than the cost of ext4. However, I found that
> the total speed wasn’t steady. Although most of time the speed can
> reach to 30M/s, it fell to only about 10MB/s in rare cases. Writing to
> 30 files in parallel was supposed to be easy. Why did this happen?
>
>

Two questions - what is the XFS built on? 4 x 2TB in a linear 
concatenation, or something else?

Secondly, are all your files in the same directory, or in different 
directories?  XFS scales by using multiple threads for different 
allocation groups, and putting these groups in different places on the 
underlying disk or disks - but files in the same directory go in the 
same allocation group.  So 30 files in 30 directories will give much 
more parallelism than 30 files in 1 directory.



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-18  9:35           ` GuoZhong Han
  2012-09-18 10:22             ` David Brown
@ 2012-09-18 21:20             ` Stan Hoeppner
  1 sibling, 0 replies; 22+ messages in thread
From: Stan Hoeppner @ 2012-09-18 21:20 UTC (permalink / raw)
  To: GuoZhong Han; +Cc: David Brown, linux-raid, xfs@oss.sgi.com

I'm copying the XFS list as this discussion has migrated more toward
filesystem/workload tuning.

On 9/18/2012 4:35 AM, GuoZhong Han wrote:
> Hi Stan:
>         Thanks for your advice. In your last mail, you mentioned XFS
> file system. According to your suggestion, I changed the file system
> from raid5 (4*2T, chunksize: 128K, strip_catch_size:2048) to XFS. Then
> I did a write performance test on XFS.
> The test was as follows:
>         My program used 4 threads to do parallel writing to 30 files
> with 1MB/s writing speed on each file. Each thread was bound on a
> single core. The estimated total speed should be stable at 30MB/s. I
> recorded the total writing speed every second in the test. Compared
> with speed of ext4, when the array was going to be full, the
> performance of XFS has indeed increased. The time to create the XFS
> file system was much less than the cost of ext4. However, I found that
> the total speed wasn’t steady. Although most of time the speed can
> reach to 30M/s, it fell to only about 10MB/s in rare cases. Writing to
> 30 files in parallel was supposed to be easy. Why did this happen?

We'll need more details of your test program and the kernel version
you're using, as well as the directory/file layout used in testing.
Your fstab entry for the filesystem, as well as xfs_info output, are
also needed.

In general, this type of behavior is due to the disks not being able to
seek quickly enough to satisfy all requests, causing latency, and thus
the dip in bandwidth.  Writing 30 files in parallel to 3x SATA stripe
members is going to put a large seek load on the disks.  If one of your
tests adds some metadata writes to this workload, the extra writes to
the journal and directory inodes may be enough to saturate the head
actuators.  Additionally, write barriers are enabled by default, and so
flushing of the drive caches after journal writes may be playing a role
here as well.


> 2012/9/13 Stan Hoeppner <stan@hardwarefreak.com>:
>> On 9/12/2012 10:21 PM, GuoZhong Han wrote:
>>
>>>          This system has a 36 cores CPU, the frequency of each core is
>>> 1.2G.
>>
>> Obviously not an x86 CPU.  36 cores.  Must be a Tilera chip.
>>
>> GuoZhong, be aware that high core count systems are a poor match for
>> Linux md/RAID levels 1/5/6/10.  These md/RAID drivers currently utilize
>> a single write thread, and thus can only use one CPU core at a time.
>>
>> To begin to sufficiently scale these md array types across 36x 1.2GHz
>> cores you would need something like the following configurations, all
>> striped together or concatenated with md or LVM:
>>
>> 72x md/RAID1 mirror pairs
>>  36x 4 disk RAID10 arrays
>>  36x 4 disk RAID6 ararys
>>  36x 3 disk RAID5 arrays
>>
>> Patches are currently being developed to increase the parallelism of
>> RAID1/5/6/10 but will likely not be ready for production kernels for
>> some time.   These patches will however still not allow scaling an
>> md/RAID driver across such a high core count.  You'll still need
>> multiple arrays to take advantage of 36 cores.  Thus, this 16 drive
>> storage appliance would have much better performance with a single/dual
>> core CPU with a 2-3GHz clock speed.
>>
>>> The users can create a raid0, raid10
>>> and raid5 use the disks they designated.
>>
>> This is a storage appliance.  Due to the market you're targeting, the
>> RAID level should be chosen by the manufacturer and not selectable by
>> the user.  Choice is normally a good thing.  But with this type of
>> product, allowing users the choice of array type will simply cause your
>> company may problems.  You will constantly field support issues about
>> actual performance not meeting expectations, etc.  And you don't want to
>> allow RAID5 under any circumstances for a storage appliance product.  In
>> this category, most users won't immediately replace failed drives, so
>> you need to "force" the extra protection of RAID6 or RAID10 upon the
>> customer.
>>
>> If I were doing such a product, I'd immediately toss out the 36 core
>> logic platform and switch to a low power single/dual core x86 chip.  And
>> as much as I disdain parity RAID, for such an appliance I'd make RAID6
>> the factory default, not changeable by the user.  Since md/RAID doesn't
>> scale well across multicore CPUs, and because wide parity arrays yield
>> poor performance, I would make 2x 8 drive RAID6 arrays at the factory,
>> concatenate them with md/RAID linear, and format the linear device with
>> XFS.  Manually force a 64KB chunk size for the RAID6 arrays.  You don't
>> want the 512KB default in a storage appliance.  Specify stripe alignment
>> when formatting with XFS.  In this case, su=64K and sw=6.  See "man
>> mdadm" and "man mkfs.xfs".
>>
>>>          1. The system must support parallel write more than 150
>>> files; the speed of each will reach to 1M/s.
>>
>> For highly parallel write workloads you definitely want XFS.
>>
>>> If the array is full,
>>> wipe its data to re-write.
>>
>> What do you mean by this?  Surely you don't mean to arbitrarily erase
>> user date to make room for more user data.
>>
>>>          2. Necessarily parallel the ability to read multiple files.
>>
>> Again, XFS best fits this requirement.
>>
>>>          3. as much as possible to use the storage space
>>
>> RAID6 is the best option here for space efficiency and resilience to
>> array failure.  RAID5 is asking for heartache, especially in an
>> appliance product, where users tend to neglect the box until it breaks
>> to the point of no longer working.
>>
>>>          4. The system must have certain redundancy, when a disk
>>> failed, the users can use other disk instead of the failed disk.
>>
>> That's what RAID is for, so you're on the right track. ;)
>>
>>>          5. The system must support disk hot-swap
>>
>> That up to your hardware design.  Lots of pre-built solution already on
>> the OEM market.
>>
>>>          I have tested the performance for write of 4*2T raid5 and
>>> 8*2T raid5 of which the file system is ext4, the chuck size is 128K
>>> and the strip_cache_size is 2048. At the beginning, these two raid5s
>>> worked well. But there was a same problem, when the array was going to
>>> be full, the speeds of the write performance tend to slower, there
>>> were lots of data lost while parallel write 1M/s to 150 files.
>>
>> You shouldn't have lost data doing this.  That suggests some other
>> problem.  EXT4 is not particularly adept at managing free space
>> fragmentation.  XFS will do much better here.  But even with XFS,
>> depending on the workload and the "aging" of the filesystem, even XFS
>> will will slow down considerably when the filesystem approaches ~95%
>> full.  This obviously depends a bit on drive size and total array size
>> as well.  5% of a 12TB filesystem is quite less than a 36TB filesystem,
>> 600GB vs 1.8TB.  And the degradation depends on what types of files
>> you're writing and how many in parallel to your nearly full XFS.
>>
>>>          As you said, the performance for write of 16*2T raid5 will be
>>> terrible, so what do you think that how many disks to be build to a
>>> raid5 will be more appropriate?
>>
>> Again, do not use RAID5 for a storage appliance.  Use RAID6 instead, and
>> use multiple RAID6 arrays concatenated together.
>>
>>>          I do not know whether I describe the requirement of the
>>> system accurately. I hope I can get your advice.
>>
>> You described it well, except for the part about wipe data and rewrite
>> when array is full.
>>
>> --
>> Stan
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-18 10:22             ` David Brown
@ 2012-09-18 21:38               ` Stan Hoeppner
  2012-09-19  7:20                 ` David Brown
  0 siblings, 1 reply; 22+ messages in thread
From: Stan Hoeppner @ 2012-09-18 21:38 UTC (permalink / raw)
  To: David Brown; +Cc: GuoZhong Han, linux-raid

On 9/18/2012 5:22 AM, David Brown wrote:
> On 18/09/2012 11:35, GuoZhong Han wrote:
>> Hi Stan:
>>          Thanks for your advice. In your last mail, you mentioned XFS
>> file system. According to your suggestion, I changed the file system
>> from raid5 (4*2T, chunksize: 128K, strip_catch_size:2048) to XFS. Then
>> I did a write performance test on XFS.
>> The test was as follows:
>>          My program used 4 threads to do parallel writing to 30 files
>> with 1MB/s writing speed on each file. Each thread was bound on a
>> single core. The estimated total speed should be stable at 30MB/s. I
>> recorded the total writing speed every second in the test. Compared
>> with speed of ext4, when the array was going to be full, the
>> performance of XFS has indeed increased. The time to create the XFS
>> file system was much less than the cost of ext4. However, I found that
>> the total speed wasn’t steady. Although most of time the speed can
>> reach to 30M/s, it fell to only about 10MB/s in rare cases. Writing to
>> 30 files in parallel was supposed to be easy. Why did this happen?
>>
>>
> 
> Two questions - what is the XFS built on? 4 x 2TB in a linear
> concatenation, or something else?

According to the above it's a 4 drive RAID5.

> Secondly, are all your files in the same directory, or in different
> directories?  XFS scales by using multiple threads for different
> allocation groups, 

This is partially correct if he's using the inode64 allocator.  Do note
multiple XFS write threads can target the same AG and get parallel
performance.  What you are referring to above is writing to multiple AGs
in parallel, where each AG resides on a different member device of a
concatenation.

Writing to say 16 AGs in parallel where all reside on the same disk
array will actually decrease performance compared to 16 writes to one AG
on that array.  The reason is the latter causes far less head travel
between writes.

> and putting these groups in different places on the
> underlying disk or disks - but files in the same directory go in the
> same allocation group.  So 30 files in 30 directories will give much
> more parallelism than 30 files in 1 directory.

Actually, no.  The level of parallelism is the same--30 concurrent
writes.  As noted above, the increase in performance comes from locating
each of the AGs on a different disk, or array.  This decreases the
number of seeks requires per write, especially with parity arrays.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-18 21:38               ` Stan Hoeppner
@ 2012-09-19  7:20                 ` David Brown
  2012-09-19 16:00                   ` Stan Hoeppner
  0 siblings, 1 reply; 22+ messages in thread
From: David Brown @ 2012-09-19  7:20 UTC (permalink / raw)
  To: stan; +Cc: GuoZhong Han, linux-raid

On 18/09/2012 23:38, Stan Hoeppner wrote:
> On 9/18/2012 5:22 AM, David Brown wrote:
>> On 18/09/2012 11:35, GuoZhong Han wrote:
>>> Hi Stan:
>>>           Thanks for your advice. In your last mail, you mentioned XFS
>>> file system. According to your suggestion, I changed the file system
>>> from raid5 (4*2T, chunksize: 128K, strip_catch_size:2048) to XFS. Then
>>> I did a write performance test on XFS.
>>> The test was as follows:
>>>           My program used 4 threads to do parallel writing to 30 files
>>> with 1MB/s writing speed on each file. Each thread was bound on a
>>> single core. The estimated total speed should be stable at 30MB/s. I
>>> recorded the total writing speed every second in the test. Compared
>>> with speed of ext4, when the array was going to be full, the
>>> performance of XFS has indeed increased. The time to create the XFS
>>> file system was much less than the cost of ext4. However, I found that
>>> the total speed wasn’t steady. Although most of time the speed can
>>> reach to 30M/s, it fell to only about 10MB/s in rare cases. Writing to
>>> 30 files in parallel was supposed to be easy. Why did this happen?
>>>
>>>
>>
>> Two questions - what is the XFS built on? 4 x 2TB in a linear
>> concatenation, or something else?
>
> According to the above it's a 4 drive RAID5.

He wrote "I changed the file system from raid5 (4 x 2T) to XFS", so I am 
looking for clarification here.

>
>> Secondly, are all your files in the same directory, or in different
>> directories?  XFS scales by using multiple threads for different
>> allocation groups,
>
> This is partially correct if he's using the inode64 allocator.  Do note
> multiple XFS write threads can target the same AG and get parallel
> performance.

I didn't know that - there is always something new to learn!

However, I don't think that should make a huge difference - after all, 
the work done by these threads is going to be fairly small until you 
actually get to writing out the data to the AG.  Latency for the 
application might be reduced a little, but disk throughput will not 
benefit much.

> What you are referring to above is writing to multiple AGs
> in parallel, where each AG resides on a different member device of a
> concatenation.

Yes, although I know that each AG does not necessarily reside on a 
different member device.

As far as I see it now, there are three stages -

1. write threads (can be several per AG)
2. AGs (can be several per disk)
3. Disks (members of a linear concat)

>
> Writing to say 16 AGs in parallel where all reside on the same disk
> array will actually decrease performance compared to 16 writes to one AG
> on that array.  The reason is the latter causes far less head travel
> between writes.
>

Yes.

>> and putting these groups in different places on the
>> underlying disk or disks - but files in the same directory go in the
>> same allocation group.  So 30 files in 30 directories will give much
>> more parallelism than 30 files in 1 directory.
>
> Actually, no.  The level of parallelism is the same--30 concurrent
> writes.  As noted above, the increase in performance comes from locating
> each of the AGs on a different disk, or array.  This decreases the
> number of seeks requires per write, especially with parity arrays.
>

OK, so you get 30 parallel logical writes, but if it does not translate 
into multiple parallel physical writes to the disks by having multiple 
member disks, then the gains are small.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: make filesystem failed while the capacity of raid5 is big than 16TB
  2012-09-19  7:20                 ` David Brown
@ 2012-09-19 16:00                   ` Stan Hoeppner
  0 siblings, 0 replies; 22+ messages in thread
From: Stan Hoeppner @ 2012-09-19 16:00 UTC (permalink / raw)
  To: David Brown; +Cc: GuoZhong Han, linux-raid

On 9/19/2012 2:20 AM, David Brown wrote:
> On 18/09/2012 23:38, Stan Hoeppner wrote:

>> Actually, no.  The level of parallelism is the same--30 concurrent
>> writes.  As noted above, the increase in performance comes from locating
>> each of the AGs on a different disk, or array.  This decreases the
>> number of seeks requires per write, especially with parity arrays.

> OK, so you get 30 parallel logical writes, but if it does not translate
> into multiple parallel physical writes to the disks by having multiple
> member disks, then the gains are small.

The problem in the OP's case isn't a lack of physical write parallelism,
but most likely a problem of seek starvation caused by write parallelism.

-- 
Stan


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-09-19 16:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-12  7:04 make filesystem failed while the capacity of raid5 is big than 16TB vincent
2012-09-12  7:32 ` Jack Wang
2012-09-12  7:37 ` Chris Dunlop
2012-09-12  7:58 ` David Brown
     [not found]   ` <CACY-59cLmV2SRY+FrvhHxseDD1+r-B-3bOKPGzJdGttW+9U2mw@mail.gmail.com>
2012-09-12  9:46     ` David Brown
2012-09-12 14:13       ` Stan Hoeppner
2012-09-13  7:06         ` David Brown
2012-09-13  3:21       ` GuoZhong Han
2012-09-13  3:34         ` Mathias Buren
2012-09-13  7:13           ` David Brown
2012-09-13  7:30         ` David Brown
2012-09-13  7:43           ` John Robinson
2012-09-13  9:15             ` David Brown
2012-09-13 13:25         ` Stan Hoeppner
2012-09-13 13:52           ` David Brown
2012-09-13 22:47             ` Stan Hoeppner
2012-09-18  9:35           ` GuoZhong Han
2012-09-18 10:22             ` David Brown
2012-09-18 21:38               ` Stan Hoeppner
2012-09-19  7:20                 ` David Brown
2012-09-19 16:00                   ` Stan Hoeppner
2012-09-18 21:20             ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).