linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Best practice for large storage?
@ 2013-02-14 17:28 Roy Sigurd Karlsbakk
  2013-02-14 17:33 ` Jeff Johnson
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-14 17:28 UTC (permalink / raw)
  To: Linux RAID

Hi all

It seems we may need some storage for video soon. This is a 20k studen college in Norway, with rather a few on media related studies. Since these students produce rather large amounts of raw material, typically to be stored during the semester, we may need some 50-100TiB, perhaps more. I have setup systems with these amounts of storage earlier on ZFS, but may be using Linux MD for this project. I'm aware of the lack of checksumming, snapshots etc with Linux, but may be using it because of more Linux knowledge amongst the sysadmins here. In such a setup, I guess nearline SAS drives on a SAS expander will be used, and with the amount of storage needed, I won't be using a single RAID-6 (too insecure) or RAID-10 (too expensive) for the lot. In ZFS-land I used smallish VDEVs (~10 drives each) in a large pool.

 - Would using LVM on top of RAID-6 give me something similar?
 - If so, should I stripe the RAID sets, and again, if striping them, will it be as easy to add new RAID sets as we run out of space?

Thanks, and best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 17:28 Roy Sigurd Karlsbakk
@ 2013-02-14 17:33 ` Jeff Johnson
  2013-02-14 17:39   ` Roy Sigurd Karlsbakk
  2013-02-15  2:18 ` Chris Murphy
  2013-02-16  7:09 ` Stan Hoeppner
  2 siblings, 1 reply; 15+ messages in thread
From: Jeff Johnson @ 2013-02-14 17:33 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk, Linux RAID

Roy,

Why not use ZFS on Linux? (http://www.zfsonlinux.org)

ZFS on Linux is being merged into the upcoming Lustre 2.4 parallel 
filesystem release.

--Jeff

On 2/14/13 9:28 AM, Roy Sigurd Karlsbakk wrote:
> Hi all
>
> It seems we may need some storage for video soon. This is a 20k studen college in Norway, with rather a few on media related studies. Since these students produce rather large amounts of raw material, typically to be stored during the semester, we may need some 50-100TiB, perhaps more. I have setup systems with these amounts of storage earlier on ZFS, but may be using Linux MD for this project. I'm aware of the lack of checksumming, snapshots etc with Linux, but may be using it because of more Linux knowledge amongst the sysadmins here. In such a setup, I guess nearline SAS drives on a SAS expander will be used, and with the amount of storage needed, I won't be using a single RAID-6 (too insecure) or RAID-10 (too expensive) for the lot. In ZFS-land I used smallish VDEVs (~10 drives each) in a large pool.
>
>   - Would using LVM on top of RAID-6 give me something similar?
>   - If so, should I stripe the RAID sets, and again, if striping them, will it be as easy to add new RAID sets as we run out of space?
>
> Thanks, and best regards
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> roy@karlsbakk.net
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101   f: 858-412-3845
m: 619-204-9061

/* New Address */
4170 Morena Boulevard, Suite D - San Diego, CA 92117

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 17:33 ` Jeff Johnson
@ 2013-02-14 17:39   ` Roy Sigurd Karlsbakk
  2013-02-14 17:48     ` Jeff Johnson
  0 siblings, 1 reply; 15+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-14 17:39 UTC (permalink / raw)
  To: Jeff Johnson; +Cc: Linux RAID

I don't feel like using the fuse version, but is zfsonlinux really stable yet? 

----- Opprinnelig melding -----
> Roy,
> 
> Why not use ZFS on Linux? (http://www.zfsonlinux.org)
> 
> ZFS on Linux is being merged into the upcoming Lustre 2.4 parallel
> filesystem release.
> 
> --Jeff
> 
> On 2/14/13 9:28 AM, Roy Sigurd Karlsbakk wrote:
> > Hi all
> >
> > It seems we may need some storage for video soon. This is a 20k
> > studen college in Norway, with rather a few on media related
> > studies. Since these students produce rather large amounts of raw
> > material, typically to be stored during the semester, we may need
> > some 50-100TiB, perhaps more. I have setup systems with these
> > amounts of storage earlier on ZFS, but may be using Linux MD for
> > this project. I'm aware of the lack of checksumming, snapshots etc
> > with Linux, but may be using it because of more Linux knowledge
> > amongst the sysadmins here. In such a setup, I guess nearline SAS
> > drives on a SAS expander will be used, and with the amount of
> > storage needed, I won't be using a single RAID-6 (too insecure) or
> > RAID-10 (too expensive) for the lot. In ZFS-land I used smallish
> > VDEVs (~10 drives each) in a large pool.
> >
> >   - Would using LVM on top of RAID-6 give me something similar?
> >   - If so, should I stripe the RAID sets, and again, if striping
> >   them, will it be as easy to add new RAID sets as we run out of
> >   space?
> >
> > Thanks, and best regards
> >
> > roy
> > --
> > Roy Sigurd Karlsbakk
> > (+47) 98013356
> > roy@karlsbakk.net
> > http://blogg.karlsbakk.net/
> > GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> > --
> > I all pedagogikk er det essensielt at pensum presenteres
> > intelligibelt. Det er et elementært imperativ for alle pedagoger å
> > unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de
> > fleste tilfeller eksisterer adekvate og relevante synonymer på
> > norsk.
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> ------------------------------
> Jeff Johnson
> Co-Founder
> Aeon Computing
> 
> jeff.johnson@aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x101 f: 858-412-3845
> m: 619-204-9061
> 
> /* New Address */
> 4170 Morena Boulevard, Suite D - San Diego, CA 92117

-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 17:39   ` Roy Sigurd Karlsbakk
@ 2013-02-14 17:48     ` Jeff Johnson
  2013-02-15  9:51       ` Sebastian Riemer
  2013-02-16  5:40       ` Stan Hoeppner
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff Johnson @ 2013-02-14 17:48 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux RAID

Stable enough where it is being used at Lawrence Livermore Nat'l Labs on 
a 55PB Lustre resource.

I've been using it on a pre-release Lustre 2.4 and I have not had any 
issues.


On 2/14/13 9:39 AM, Roy Sigurd Karlsbakk wrote:
> I don't feel like using the fuse version, but is zfsonlinux really stable yet?
>
> ----- Opprinnelig melding -----
>> Roy,
>>
>> Why not use ZFS on Linux? (http://www.zfsonlinux.org)
>>
>> ZFS on Linux is being merged into the upcoming Lustre 2.4 parallel
>> filesystem release.
>>
>> --Jeff
>>
>> On 2/14/13 9:28 AM, Roy Sigurd Karlsbakk wrote:
>>> Hi all
>>>
>>> It seems we may need some storage for video soon. This is a 20k
>>> studen college in Norway, with rather a few on media related
>>> studies. Since these students produce rather large amounts of raw
>>> material, typically to be stored during the semester, we may need
>>> some 50-100TiB, perhaps more. I have setup systems with these
>>> amounts of storage earlier on ZFS, but may be using Linux MD for
>>> this project. I'm aware of the lack of checksumming, snapshots etc
>>> with Linux, but may be using it because of more Linux knowledge
>>> amongst the sysadmins here. In such a setup, I guess nearline SAS
>>> drives on a SAS expander will be used, and with the amount of
>>> storage needed, I won't be using a single RAID-6 (too insecure) or
>>> RAID-10 (too expensive) for the lot. In ZFS-land I used smallish
>>> VDEVs (~10 drives each) in a large pool.
>>>
>>>    - Would using LVM on top of RAID-6 give me something similar?
>>>    - If so, should I stripe the RAID sets, and again, if striping
>>>    them, will it be as easy to add new RAID sets as we run out of
>>>    space?
>>>
>>> Thanks, and best regards
>>>
>>> roy
>>> --
>>> Roy Sigurd Karlsbakk
>>> (+47) 98013356
>>> roy@karlsbakk.net
>>> http://blogg.karlsbakk.net/
>>> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
>>> --
>>> I all pedagogikk er det essensielt at pensum presenteres
>>> intelligibelt. Det er et elementært imperativ for alle pedagoger å
>>> unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de
>>> fleste tilfeller eksisterer adekvate og relevante synonymer på
>>> norsk.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> --
>> ------------------------------
>> Jeff Johnson
>> Co-Founder
>> Aeon Computing
>>
>> jeff.johnson@aeoncomputing.com
>> www.aeoncomputing.com
>> t: 858-412-3810 x101 f: 858-412-3845
>> m: 619-204-9061
>>
>> /* New Address */
>> 4170 Morena Boulevard, Suite D - San Diego, CA 92117


-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101   f: 858-412-3845
m: 619-204-9061

/* New Address */
4170 Morena Boulevard, Suite D - San Diego, CA 92117

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
       [not found] <CAH3kUhFbR3coJSwPvqqOGrqcsvoJpdAPgAAiAgc_th1ym33DzA@mail.gmail.com>
@ 2013-02-14 18:27 ` Roy Sigurd Karlsbakk
  0 siblings, 0 replies; 15+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-14 18:27 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Jeff Johnson, Linux RAID

----- Opprinnelig melding -----
> why not brtfs or xfs?

Well, because btrfs isn't stable and doesn't support RAID-[56] unless you're using the git tree for testing that, which is rather more unstable than the rest, and xfs is just a filesystem that needs to sit on top of a raid of some kind.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
       [not found] <CAH3kUhGVt1iyn9tt=2-+f6H++obOGSK3x0pBBPZV8CFUXjp5yw@mail.gmail.com>
@ 2013-02-14 23:23 ` Roy Sigurd Karlsbakk
  2013-02-14 23:42   ` Roberto Spadim
  0 siblings, 1 reply; 15+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-14 23:23 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Jeff Johnson, Linux RAID

xfs+lvm+mdadm works, but I'm still back to my original question. I don't care about what filesystem is on top, only about the storage underneath. Read the original question again, please

----- Opprinnelig melding ----- 



xfs+lvm+mdadm don`t work? too complex? 



2013/2/14 Roy Sigurd Karlsbakk < roy@karlsbakk.net > 


----- Opprinnelig melding ----- 

> why not brtfs or xfs? 

Well, because btrfs isn't stable and doesn't support RAID-[56] unless you're using the git tree for testing that, which is rather more unstable than the rest, and xfs is just a filesystem that needs to sit on top of a raid of some kind. 



Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 98013356 
roy@karlsbakk.net 
http://blogg.karlsbakk.net/ 
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. 




-- 
Roberto Spadim 


-- 

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 98013356 
roy@karlsbakk.net 
http://blogg.karlsbakk.net/ 
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 23:23 ` Roy Sigurd Karlsbakk
@ 2013-02-14 23:42   ` Roberto Spadim
  2013-02-15  1:01     ` Adam Goryachev
  0 siblings, 1 reply; 15+ messages in thread
From: Roberto Spadim @ 2013-02-14 23:42 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Jeff Johnson, Linux RAID

the problem is checksum? i really don't understand why the filesystem
is not the problem, and why storage is the 'key'

storage -> mdadm + harddisks / raid hardware + harddisks / network
storage + hard disks -> here the key is data integrity WITHOUT SILENT
DATA LOSS, today i only saw this on enterprise hardware raid
controlers + enterprise sas disks

lvm + filesystem -> don't have problem increasing storage -> here a
filesystem that can grow without problems is mandatory since you want
but more disks... the lvm part is to easly work with devices

any part that i forgot?

2013/2/14 Roy Sigurd Karlsbakk <roy@karlsbakk.net>:
> xfs+lvm+mdadm works, but I'm still back to my original question. I don't care about what filesystem is on top, only about the storage underneath. Read the original question again, please
>
> ----- Opprinnelig melding -----
>
>
>
> xfs+lvm+mdadm don`t work? too complex?
>
>
>
> 2013/2/14 Roy Sigurd Karlsbakk < roy@karlsbakk.net >
>
>
> ----- Opprinnelig melding -----
>
>> why not brtfs or xfs?
>
> Well, because btrfs isn't stable and doesn't support RAID-[56] unless you're using the git tree for testing that, which is rather more unstable than the rest, and xfs is just a filesystem that needs to sit on top of a raid of some kind.
>
>
>
> Vennlige hilsener / Best regards
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> roy@karlsbakk.net
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
>
>
>
>
> --
> Roberto Spadim
>
>
> --
>
> Vennlige hilsener / Best regards
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> roy@karlsbakk.net
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.



--
Roberto Spadim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 23:42   ` Roberto Spadim
@ 2013-02-15  1:01     ` Adam Goryachev
  2013-02-15  1:13       ` Roberto Spadim
  0 siblings, 1 reply; 15+ messages in thread
From: Adam Goryachev @ 2013-02-15  1:01 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Roy Sigurd Karlsbakk, Jeff Johnson, Linux RAID

On 15/02/13 10:42, Roberto Spadim wrote:
> the problem is checksum? i really don't understand why the filesystem
> is not the problem, and why storage is the 'key'
>
> storage -> mdadm + harddisks / raid hardware + harddisks / network
> storage + hard disks -> here the key is data integrity WITHOUT SILENT
> DATA LOSS, today i only saw this on enterprise hardware raid
> controlers + enterprise sas disks
>
> lvm + filesystem -> don't have problem increasing storage -> here a
> filesystem that can grow without problems is mandatory since you want
> but more disks... the lvm part is to easly work with devices
>
> any part that i forgot?
>
> 2013/2/14 Roy Sigurd Karlsbakk <roy@karlsbakk.net>:
>> xfs+lvm+mdadm works, but I'm still back to my original question. I don't care about what filesystem is on top, only about the storage underneath. Read the original question again, please

I assume the question can be reduced to:
1) You need a very large amount of space (requires a large number of disks)
2) You need to be able to expand that space over time
3) You want decent data redundancy
4) You will have a reasonable amount of concurrent access by multiple users and want decent performance

From my readings of the list, it would seem the suggestion is to use RAID6 + concatenation, with around 6 to 8 drives in each RAID6, and use XFS with certain parameters to ensure it balances the directories across multiple groups of the RAID6. Basically, you want to put as many drives into each RAID6 to reduce wasted space, but not too many or else you will suffer a triple drive failure and lose the whole lot.

If you did not need to grow the space, then you would use RAID60, and do striping, but I think you can't grow that, although some pages I just read suggest it might be possible to grow a raid0 by converting to raid4 and back again.

Another option would be to use LVM to join multiple RAID6's together

Don't know if this helps, but hopefully.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
Ph: +61 2 8304 0000                            adam@websitemanagers.com.au
Fax: +61 2 8304 0001                            www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-15  1:01     ` Adam Goryachev
@ 2013-02-15  1:13       ` Roberto Spadim
  0 siblings, 0 replies; 15+ messages in thread
From: Roberto Spadim @ 2013-02-15  1:13 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Roy Sigurd Karlsbakk, Jeff Johnson, Linux RAID

the point of performace is
you will do a 'per user performace meter' or a 'total system performace'
in others words... users will have each one a 'fixed' disk space? or
every body will use the disks?
a per user space could allow you to do a raid1 (or another raid) for
about 100 users?! in other words, 100 users have a total of 320MB/s
(sas) disk performace and 1tb of disk space? does your users need a
low latency system? did you considered high cache? or ssd cache?
if you tell that every body will use the whole system, how you want to
share the disks? whould it be stripped or mirrored?
stripped for better sequencial reading
mirrored for better parallel reading

what your system need?

2013/2/14 Adam Goryachev <adam@websitemanagers.com.au>:
> On 15/02/13 10:42, Roberto Spadim wrote:
>> the problem is checksum? i really don't understand why the filesystem
>> is not the problem, and why storage is the 'key'
>>
>> storage -> mdadm + harddisks / raid hardware + harddisks / network
>> storage + hard disks -> here the key is data integrity WITHOUT SILENT
>> DATA LOSS, today i only saw this on enterprise hardware raid
>> controlers + enterprise sas disks
>>
>> lvm + filesystem -> don't have problem increasing storage -> here a
>> filesystem that can grow without problems is mandatory since you want
>> but more disks... the lvm part is to easly work with devices
>>
>> any part that i forgot?
>>
>> 2013/2/14 Roy Sigurd Karlsbakk <roy@karlsbakk.net>:
>>> xfs+lvm+mdadm works, but I'm still back to my original question. I don't care about what filesystem is on top, only about the storage underneath. Read the original question again, please
>
> I assume the question can be reduced to:
> 1) You need a very large amount of space (requires a large number of disks)
> 2) You need to be able to expand that space over time
> 3) You want decent data redundancy
> 4) You will have a reasonable amount of concurrent access by multiple users and want decent performance
>
> From my readings of the list, it would seem the suggestion is to use RAID6 + concatenation, with around 6 to 8 drives in each RAID6, and use XFS with certain parameters to ensure it balances the directories across multiple groups of the RAID6. Basically, you want to put as many drives into each RAID6 to reduce wasted space, but not too many or else you will suffer a triple drive failure and lose the whole lot.
>
> If you did not need to grow the space, then you would use RAID60, and do striping, but I think you can't grow that, although some pages I just read suggest it might be possible to grow a raid0 by converting to raid4 and back again.
>
> Another option would be to use LVM to join multiple RAID6's together
>
> Don't know if this helps, but hopefully.
>
> Regards,
> Adam
>
> --
> Adam Goryachev
> Website Managers
> Ph: +61 2 8304 0000                            adam@websitemanagers.com.au
> Fax: +61 2 8304 0001                            www.websitemanagers.com.au
>



--
Roberto Spadim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 17:28 Roy Sigurd Karlsbakk
  2013-02-14 17:33 ` Jeff Johnson
@ 2013-02-15  2:18 ` Chris Murphy
  2013-02-16  7:09 ` Stan Hoeppner
  2 siblings, 0 replies; 15+ messages in thread
From: Chris Murphy @ 2013-02-15  2:18 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux RAID


On Feb 14, 2013, at 10:28 AM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:

> This is a 20k studen college
>  large amounts of raw material
> 50-100TiB, perhaps more
> I guess nearline SAS drives
> I won't be using a single RAID-6 (too insecure) or RAID-10 (too expensive)

This could be a case for GlusterFS or Ceph.  You might look at those groups and see what they'd suggest for your use case. Even if it doesn't make sense right away, it might make sense sooner than later in which case it's good to have an initial deployment that doesn't make it a hassle to move to Gluster when you're ready.

Stan makes a good case for SAS HBA's capable of doing RAID, they're inexpensive, fast, reliable, you get support, and they don't cost really any more than the HBA you need anyway to connect all of these drives. Definitely use XFS for the resulting arrays. Then hand those over to GlusterFS as storage bricks (Ceph has a different arrangement and terms).

You can inquire if the GlusterFS NFS client is suitable for this task if the clients are Windows or Mac; or if it's better to setup one or more NFS v4 servers which are themselves using the native GlusterFS client. It likely depends on network bandwidth (for video 10GigE is common), how many clients, etc. Gluster also scales well in performance and capacity, just add more bricks. And you won't need to bust the drive cap in your arrays, or stripe them.

As storage gets really big, the risk of non-drive failures increases to the point that needs to be mitigated. And that's what a distributed file system will help you do.

Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 17:48     ` Jeff Johnson
@ 2013-02-15  9:51       ` Sebastian Riemer
  2013-02-16 12:48         ` Roy Sigurd Karlsbakk
  2013-02-16  5:40       ` Stan Hoeppner
  1 sibling, 1 reply; 15+ messages in thread
From: Sebastian Riemer @ 2013-02-15  9:51 UTC (permalink / raw)
  To: Jeff Johnson; +Cc: Roy Sigurd Karlsbakk, Linux RAID

On 14.02.2013 18:48, Jeff Johnson wrote:
> Stable enough where it is being used at Lawrence Livermore Nat'l Labs on
> a 55PB Lustre resource.
> 
> I've been using it on a pre-release Lustre 2.4 and I have not had any
> issues.

ZFS completely fragments if you've got massive parallel write IO -
especially with Solaris 11. You'll get only 2..3 MiB/s after some time
as everything is stored completely random then. So if you don't really
need these snapshots you shouldn't use ZFS. NILFS is also good for
snapshots.

> On 2/14/13 9:39 AM, Roy Sigurd Karlsbakk wrote:
>> I don't feel like using the fuse version, but is zfsonlinux really
>> stable yet?
>>
>> ----- Opprinnelig melding -----
>>> Roy,
>>>
>>> Why not use ZFS on Linux? (http://www.zfsonlinux.org)
>>>
>>> ZFS on Linux is being merged into the upcoming Lustre 2.4 parallel
>>> filesystem release.
>>>
>>> --Jeff


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 17:48     ` Jeff Johnson
  2013-02-15  9:51       ` Sebastian Riemer
@ 2013-02-16  5:40       ` Stan Hoeppner
  1 sibling, 0 replies; 15+ messages in thread
From: Stan Hoeppner @ 2013-02-16  5:40 UTC (permalink / raw)
  To: Jeff Johnson; +Cc: Roy Sigurd Karlsbakk, Linux RAID

On 2/14/2013 11:48 AM, Jeff Johnson wrote:
> Stable enough where it is being used at Lawrence Livermore Nat'l Labs on
> a 55PB Lustre resource.

That's a tad misleading.  LLNL' Sequoia has ZFS striped across three 8+2
hardware RADI6 arrays using 3TB drives.  Lustre is then layered atop
those.  So here ZFS sits atop 72TB raw.  It is not scaling to 55PB.

Something worth noting in this "if they use it so should you" context is
that US gov't computer labs tend to live on the bleeding edge, and have
the budget, resources, and personnel on staff to fix anything, including
rewriting Lustre and ZFS to fit their needs.

The name Donald Becker may be familiar to many here.  He wrote a good
number of the Linux ethernet device drivers while building Beowulf
clusters at NASA.  They bought a bunch of hardware, no Linux drivers
existed, so he wrote them to enable their hardware.  Eventually they
made it into mainline.

The moral of this story should be obvious.

-- 
Stan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 17:28 Roy Sigurd Karlsbakk
  2013-02-14 17:33 ` Jeff Johnson
  2013-02-15  2:18 ` Chris Murphy
@ 2013-02-16  7:09 ` Stan Hoeppner
  2 siblings, 0 replies; 15+ messages in thread
From: Stan Hoeppner @ 2013-02-16  7:09 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux RAID

On 2/14/2013 11:28 AM, Roy Sigurd Karlsbakk wrote:
> Hi all
> 
> It seems we may need some storage for video soon. This is a 20k studen college in Norway, with rather a few on media related studies. Since these students produce rather large amounts of raw material, typically to be stored during the semester, we may need some 50-100TiB, perhaps more. I have setup systems with these amounts of storage earlier on ZFS, but may be using Linux MD for this project. I'm aware of the lack of checksumming, snapshots etc with Linux, but may be using it because of more Linux knowledge amongst the sysadmins here. In such a setup, I guess nearline SAS drives on a SAS expander will be used, and with the amount of storage needed, I won't be using a single RAID-6 (too insecure) or RAID-10 (too expensive) for the lot. In ZFS-land I used smallish VDEVs (~10 drives each)
  in a large pool.
> 
>  - Would using LVM on top of RAID-6 give me something similar?

You would use LVM concatenation or md/linear to assemble the individual
RAID6 arrays into a single logical device which you'd format with XFS.

>  - If so, should I stripe the RAID sets, and again, if striping them, will it be as easy to add new RAID sets as we run out of space?

You *could* put a stripe over the RAID6s *IF* you build the system and
leave it as is, permanently, as you can never expand a stripe.  But even
then it's not recommended due to the complexity of determining the
proper chunk sizes for the nested stripes, and aligning the filesystem
to the resulting device.

It's better to create, say, 10+2 RAID6 arrays, and add them to an
md/linear array.  This linear array is nearly infinitely expandable by
adding more identical 10+2 RAID6 arrays.  Your chunk size and thus
stripe width stays the same as well.  The default RAID6 chunk of 512KB
is probably fine for large video files as that would yield a 5MB stripe
width.  When expanding with identical constituent RAID6 arrays, you
don't have to touch the XFS stripe alignment configuration, but simply
grow the filesystem after adding additional arrays to the md/linear array.

The reason I recommend the 10+2 is two fold.  First, large video file
ingestion works well with a wide RAID6 stripe.  Second, because you
could start with an LSI 9207-8e and two of something like this chassis:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047

using 48x 3TB Seagate Constellation (enterprise) SATA drives:
http://www.newegg.com/Product/Product.aspx?Item=N82E16822178324

all of which are rather inexpensive compared to Dell, HP, IBM, Fujitsu
Siemens, etc, at least here in the US.  I know this chassis is available
in Switzerland and Germany, but I don't know about Norway.  Each chassis
holds 24 drives, allowing for two 10+2 RAID6 arrays per chassis, four
arrays total.  You'd put the four md/RAID6 arrays in one md/linear
array, and format it with XFS such as:

~$ mkfs.xfs -d su=512k,sw=10 /dev/md0

This will give you a filesystem with a little under 120TB of net free
space with 120 allocation groups evenly distributed over the 4 arrays.
All AGs can be written in parallel, yielding a high performance video
ingestion, and playback system.  Before mounting you would modify fstab
to include the inode64 option.  Don't even bother attempting to use
EXT3/4 on a 120TB filesystem--they'll fall over after some use.  JFS
will work, but it's not well maintained, hasn't seen meaningful updates
in many years, and is slower than XFS in most areas.  XFS is the one
*nix filesystem that was created and optimized specifically for large
files and concurrent high bandwidth streaming IO.

See 'man mdadm' 'man mkfs.xfs' 'man mount' and 'man xfs' for more
specific information, commands, options, etc.

This may be a little more detail than you wanted, but it should give a
rough idea of at least one possible way to achieve your goal.

-- 
Stan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-15  9:51       ` Sebastian Riemer
@ 2013-02-16 12:48         ` Roy Sigurd Karlsbakk
  2013-02-18 10:35           ` Sebastian Riemer
  0 siblings, 1 reply; 15+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-16 12:48 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: Linux RAID, Jeff Johnson

> On 14.02.2013 18:48, Jeff Johnson wrote:
> > Stable enough where it is being used at Lawrence Livermore Nat'l
> > Labs on
> > a 55PB Lustre resource.
> >
> > I've been using it on a pre-release Lustre 2.4 and I have not had
> > any
> > issues.
> 
> ZFS completely fragments if you've got massive parallel write IO -
> especially with Solaris 11. You'll get only 2..3 MiB/s after some time
> as everything is stored completely random then. So if you don't really
> need these snapshots you shouldn't use ZFS. NILFS is also good for
> snapshots.

This won't be massive parallel I/O, just a fileserver with a limited amount of users. Also, can you document this claim?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Best practice for large storage?
  2013-02-16 12:48         ` Roy Sigurd Karlsbakk
@ 2013-02-18 10:35           ` Sebastian Riemer
  0 siblings, 0 replies; 15+ messages in thread
From: Sebastian Riemer @ 2013-02-18 10:35 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux RAID, Jeff Johnson

On 16.02.2013 13:48, Roy Sigurd Karlsbakk wrote:
>> On 14.02.2013 18:48, Jeff Johnson wrote:
>>> Stable enough where it is being used at Lawrence Livermore Nat'l
>>> Labs on
>>> a 55PB Lustre resource.
>>>
>>> I've been using it on a pre-release Lustre 2.4 and I have not had
>>> any
>>> issues.
>>
>> ZFS completely fragments if you've got massive parallel write IO -
>> especially with Solaris 11. You'll get only 2..3 MiB/s after some time
>> as everything is stored completely random then. So if you don't really
>> need these snapshots you shouldn't use ZFS. NILFS is also good for
>> snapshots.
> 
> This won't be massive parallel I/O, just a fileserver with a limited amount of users. Also, can you document this claim?

Of cause, we had ZFS in production in our IaaS public cloud. In such a
cloud nearly everything is random. Customers create and delete their
storage from time to time and some of them do a lot of small writes. It
fills up quite fast. We even had no snapshots and we already had the ZIL
dedicated on enterprise SSDs.

http://thomas.gouverneur.name/2011/06/20110609zfs-fragmentation-issue-examining-the-zil/
http://www.racktopsystems.com/dedicated-zfs-intent-log-aka-slogzil-and-data-fragmentation/
http://www.eall.com.br/blog/?p=2481
http://www.techforce.com.br/news/layout/set/print/linux_blog/zfs_part_4_sustained_random_small_files_sync_write_iops

ZFS as block device with COMSTAR exports is really crap. You've got
mostly synchronous (and small database) IO. This is why we switched to a
Linux storage with LVM (without thin stuff). The customers run their
file systems in their VMs anyway.

Cheers,
Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-02-18 10:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAH3kUhFbR3coJSwPvqqOGrqcsvoJpdAPgAAiAgc_th1ym33DzA@mail.gmail.com>
2013-02-14 18:27 ` Best practice for large storage? Roy Sigurd Karlsbakk
     [not found] <CAH3kUhGVt1iyn9tt=2-+f6H++obOGSK3x0pBBPZV8CFUXjp5yw@mail.gmail.com>
2013-02-14 23:23 ` Roy Sigurd Karlsbakk
2013-02-14 23:42   ` Roberto Spadim
2013-02-15  1:01     ` Adam Goryachev
2013-02-15  1:13       ` Roberto Spadim
2013-02-14 17:28 Roy Sigurd Karlsbakk
2013-02-14 17:33 ` Jeff Johnson
2013-02-14 17:39   ` Roy Sigurd Karlsbakk
2013-02-14 17:48     ` Jeff Johnson
2013-02-15  9:51       ` Sebastian Riemer
2013-02-16 12:48         ` Roy Sigurd Karlsbakk
2013-02-18 10:35           ` Sebastian Riemer
2013-02-16  5:40       ` Stan Hoeppner
2013-02-15  2:18 ` Chris Murphy
2013-02-16  7:09 ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).