Re: Best practice for large storage?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Best practice for large storage?
       [not found] <CAH3kUhGVt1iyn9tt=2-+f6H++obOGSK3x0pBBPZV8CFUXjp5yw@mail.gmail.com>
@ 2013-02-14 23:23 ` Roy Sigurd Karlsbakk
  2013-02-14 23:42   ` Roberto Spadim
  0 siblings, 1 reply; 5+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-02-14 23:23 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Jeff Johnson, Linux RAID

xfs+lvm+mdadm works, but I'm still back to my original question. I don't care about what filesystem is on top, only about the storage underneath. Read the original question again, please

----- Opprinnelig melding ----- 



xfs+lvm+mdadm don`t work? too complex? 



2013/2/14 Roy Sigurd Karlsbakk < roy@karlsbakk.net > 


----- Opprinnelig melding ----- 

> why not brtfs or xfs? 

Well, because btrfs isn't stable and doesn't support RAID-[56] unless you're using the git tree for testing that, which is rather more unstable than the rest, and xfs is just a filesystem that needs to sit on top of a raid of some kind. 



Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 98013356 
roy@karlsbakk.net 
http://blogg.karlsbakk.net/ 
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. 




-- 
Roberto Spadim 


-- 

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 98013356 
roy@karlsbakk.net 
http://blogg.karlsbakk.net/ 
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 23:23 ` Best practice for large storage? Roy Sigurd Karlsbakk
@ 2013-02-14 23:42   ` Roberto Spadim
  2013-02-15  1:01     ` Adam Goryachev
  0 siblings, 1 reply; 5+ messages in thread
From: Roberto Spadim @ 2013-02-14 23:42 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Jeff Johnson, Linux RAID

the problem is checksum? i really don't understand why the filesystem
is not the problem, and why storage is the 'key'

storage -> mdadm + harddisks / raid hardware + harddisks / network
storage + hard disks -> here the key is data integrity WITHOUT SILENT
DATA LOSS, today i only saw this on enterprise hardware raid
controlers + enterprise sas disks

lvm + filesystem -> don't have problem increasing storage -> here a
filesystem that can grow without problems is mandatory since you want
but more disks... the lvm part is to easly work with devices

any part that i forgot?

2013/2/14 Roy Sigurd Karlsbakk <roy@karlsbakk.net>:
> xfs+lvm+mdadm works, but I'm still back to my original question. I don't care about what filesystem is on top, only about the storage underneath. Read the original question again, please
>
> ----- Opprinnelig melding -----
>
>
>
> xfs+lvm+mdadm don`t work? too complex?
>
>
>
> 2013/2/14 Roy Sigurd Karlsbakk < roy@karlsbakk.net >
>
>
> ----- Opprinnelig melding -----
>
>> why not brtfs or xfs?
>
> Well, because btrfs isn't stable and doesn't support RAID-[56] unless you're using the git tree for testing that, which is rather more unstable than the rest, and xfs is just a filesystem that needs to sit on top of a raid of some kind.
>
>
>
> Vennlige hilsener / Best regards
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> roy@karlsbakk.net
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
>
>
>
>
> --
> Roberto Spadim
>
>
> --
>
> Vennlige hilsener / Best regards
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 98013356
> roy@karlsbakk.net
> http://blogg.karlsbakk.net/
> GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.



--
Roberto Spadim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Best practice for large storage?
  2013-02-14 23:42   ` Roberto Spadim
@ 2013-02-15  1:01     ` Adam Goryachev
  2013-02-15  1:13       ` Roberto Spadim
  2013-02-15  1:49       ` Growing a raid60 (Was: Re: Best practice for large storage?) Daniel Browning
  0 siblings, 2 replies; 5+ messages in thread
From: Adam Goryachev @ 2013-02-15  1:01 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Roy Sigurd Karlsbakk, Jeff Johnson, Linux RAID

On 15/02/13 10:42, Roberto Spadim wrote:
> the problem is checksum? i really don't understand why the filesystem
> is not the problem, and why storage is the 'key'
>
> storage -> mdadm + harddisks / raid hardware + harddisks / network
> storage + hard disks -> here the key is data integrity WITHOUT SILENT
> DATA LOSS, today i only saw this on enterprise hardware raid
> controlers + enterprise sas disks
>
> lvm + filesystem -> don't have problem increasing storage -> here a
> filesystem that can grow without problems is mandatory since you want
> but more disks... the lvm part is to easly work with devices
>
> any part that i forgot?
>
> 2013/2/14 Roy Sigurd Karlsbakk <roy@karlsbakk.net>:
>> xfs+lvm+mdadm works, but I'm still back to my original question. I don't care about what filesystem is on top, only about the storage underneath. Read the original question again, please

I assume the question can be reduced to:
1) You need a very large amount of space (requires a large number of disks)
2) You need to be able to expand that space over time
3) You want decent data redundancy
4) You will have a reasonable amount of concurrent access by multiple users and want decent performance

From my readings of the list, it would seem the suggestion is to use RAID6 + concatenation, with around 6 to 8 drives in each RAID6, and use XFS with certain parameters to ensure it balances the directories across multiple groups of the RAID6. Basically, you want to put as many drives into each RAID6 to reduce wasted space, but not too many or else you will suffer a triple drive failure and lose the whole lot.

If you did not need to grow the space, then you would use RAID60, and do striping, but I think you can't grow that, although some pages I just read suggest it might be possible to grow a raid0 by converting to raid4 and back again.

Another option would be to use LVM to join multiple RAID6's together

Don't know if this helps, but hopefully.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
Ph: +61 2 8304 0000                            adam@websitemanagers.com.au
Fax: +61 2 8304 0001                            www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Best practice for large storage?
  2013-02-15  1:01     ` Adam Goryachev
@ 2013-02-15  1:13       ` Roberto Spadim
  2013-02-15  1:49       ` Growing a raid60 (Was: Re: Best practice for large storage?) Daniel Browning
  1 sibling, 0 replies; 5+ messages in thread
From: Roberto Spadim @ 2013-02-15  1:13 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Roy Sigurd Karlsbakk, Jeff Johnson, Linux RAID

the point of performace is
you will do a 'per user performace meter' or a 'total system performace'
in others words... users will have each one a 'fixed' disk space? or
every body will use the disks?
a per user space could allow you to do a raid1 (or another raid) for
about 100 users?! in other words, 100 users have a total of 320MB/s
(sas) disk performace and 1tb of disk space? does your users need a
low latency system? did you considered high cache? or ssd cache?
if you tell that every body will use the whole system, how you want to
share the disks? whould it be stripped or mirrored?
stripped for better sequencial reading
mirrored for better parallel reading

what your system need?

2013/2/14 Adam Goryachev <adam@websitemanagers.com.au>:
> On 15/02/13 10:42, Roberto Spadim wrote:
>> the problem is checksum? i really don't understand why the filesystem
>> is not the problem, and why storage is the 'key'
>>
>> storage -> mdadm + harddisks / raid hardware + harddisks / network
>> storage + hard disks -> here the key is data integrity WITHOUT SILENT
>> DATA LOSS, today i only saw this on enterprise hardware raid
>> controlers + enterprise sas disks
>>
>> lvm + filesystem -> don't have problem increasing storage -> here a
>> filesystem that can grow without problems is mandatory since you want
>> but more disks... the lvm part is to easly work with devices
>>
>> any part that i forgot?
>>
>> 2013/2/14 Roy Sigurd Karlsbakk <roy@karlsbakk.net>:
>>> xfs+lvm+mdadm works, but I'm still back to my original question. I don't care about what filesystem is on top, only about the storage underneath. Read the original question again, please
>
> I assume the question can be reduced to:
> 1) You need a very large amount of space (requires a large number of disks)
> 2) You need to be able to expand that space over time
> 3) You want decent data redundancy
> 4) You will have a reasonable amount of concurrent access by multiple users and want decent performance
>
> From my readings of the list, it would seem the suggestion is to use RAID6 + concatenation, with around 6 to 8 drives in each RAID6, and use XFS with certain parameters to ensure it balances the directories across multiple groups of the RAID6. Basically, you want to put as many drives into each RAID6 to reduce wasted space, but not too many or else you will suffer a triple drive failure and lose the whole lot.
>
> If you did not need to grow the space, then you would use RAID60, and do striping, but I think you can't grow that, although some pages I just read suggest it might be possible to grow a raid0 by converting to raid4 and back again.
>
> Another option would be to use LVM to join multiple RAID6's together
>
> Don't know if this helps, but hopefully.
>
> Regards,
> Adam
>
> --
> Adam Goryachev
> Website Managers
> Ph: +61 2 8304 0000                            adam@websitemanagers.com.au
> Fax: +61 2 8304 0001                            www.websitemanagers.com.au
>



--
Roberto Spadim

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Growing a raid60 (Was: Re: Best practice for large storage?)
  2013-02-15  1:01     ` Adam Goryachev
  2013-02-15  1:13       ` Roberto Spadim
@ 2013-02-15  1:49       ` Daniel Browning
  1 sibling, 0 replies; 5+ messages in thread
From: Daniel Browning @ 2013-02-15  1:49 UTC (permalink / raw)
  To: Adam Goryachev
  Cc: Roberto Spadim, Roy Sigurd Karlsbakk, Jeff Johnson, Linux RAID

On Thursday 14 February 2013 5:01:53 pm Adam Goryachev wrote:
> If you did not need to grow the space, then you would use RAID60, and do
> striping, but I think you can't grow that, although some pages I just read
> suggest it might be possible to grow a raid0 by converting to raid4 and
> back again.

Those pages you just read are correct, except that md does the whole raid4 
conversion for you behind the scenes, automatically. Obviously, the 
transformation takes a while as it re-balances the raid accross the new 
member, but it's online and read-write the whole time. When it's done, the 
array looks as if it was created that way. You can even change the chunk size 
(if desired) with a little off-array temporary storage.

I attached a script that demonstrates one way to set it up and test it.

I was concerned about what would happen if there was a crash or power failure 
during the middle of the reshape, so I setup a test VM and simulated a power 
failure by stopping the VM. After it came back up, md continued the reshape 
right where it had left off, without missing a beat or any corruption. (I 
checked the corruption with a sha512 sum of the contents of the test 
filesystem on the raid device.)

To me, this is a killer feature of linux raid. ZFS certainly doesn't have it, 
and I doubt that any sub-$10k hardware raids do either. And even if cheap 
hardware raid cards did have it, they don't tend to have enough ports to make 
the feature all that useful. Whereas with software raid, you can almost always 
add another HBA to the box.

In fact, there is yet another cool feature of md: single-member raid60. That's 
a raid0 of a single raid6. Sounds silly, right? Well, then you can grow that 
raid0 online to 2, 3, or 10 members. You have to do --force the first time to 
set it up, because mdadm is justifiably surprised at a single-member raid0.

The downside is that other layers in the stack may not be so flexible. For 
example, with XFS you can optimize performance at the time you run mkfs.xfs by 
telling it the chunk size and stripe width parameters of the underlying raid 
device. For some workloads, it's better to set sunit/swidth to the individual 
raid6 members, for others (large sequential I/Os) it is better to set it to 
the raid0. In the latter case, reshaping the raid60 would result in the xfs no 
longer having optimal parameters. Maybe it would be nice if XFS had an online
"reshape" just like mdadm to be able to modify these parameters, but since
there isn't, I just went with the underlying raid6 params even though my
workload may have benefited from the other a little bit.

All that said, there may not be a significant performance difference between
raid60 and raid6+linear concat (e.g. via LVM) in the particular use case that 
Roy Sigurd Karlsbakk is working on. And linear concat is certainly simpler
and more widely used, so probably safer.

--
Daniel Browning
Kavod Technologies

# Note, this test uses /dev/loop8 through /dev/loop19.
# Most boxes only have loop0 through loop7.

mkdir -p tmp/raid-test
cd tmp/raid-test

dd if=/dev/zero of=test-p1c1.img bs=1M count=100 2> /dev/null
losetup /dev/loop8 test-p1c1.img
dd if=/dev/zero of=test-p1c2.img bs=1M count=100 2> /dev/null
losetup /dev/loop9 test-p1c2.img
dd if=/dev/zero of=test-p1c3.img bs=1M count=100 2> /dev/null
losetup /dev/loop10 test-p1c3.img
dd if=/dev/zero of=test-p1c4.img bs=1M count=100 2> /dev/null
losetup /dev/loop11 test-p1c4.img

mdadm --create --verbose /dev/md21 --level=6 --raid-devices=4 /dev/loop8 /dev/loop9 /dev/loop10 /dev/loop11

dd if=/dev/zero of=test-p2c1.img bs=1M count=100 2> /dev/null
losetup /dev/loop12 test-p2c1.img
dd if=/dev/zero of=test-p2c2.img bs=1M count=100 2> /dev/null
losetup /dev/loop13 test-p2c2.img
dd if=/dev/zero of=test-p2c3.img bs=1M count=100 2> /dev/null
losetup /dev/loop14 test-p2c3.img
dd if=/dev/zero of=test-p2c4.img bs=1M count=100 2> /dev/null
losetup /dev/loop15 test-p2c4.img

mdadm --create --verbose /dev/md22 --level=6 --raid-devices=4 /dev/loop12 /dev/loop13 /dev/loop14 /dev/loop15

cat /proc/mdstat

dd if=/dev/zero of=test-p3c1.img bs=1M count=100 2> /dev/null
losetup /dev/loop16 test-p3c1.img
dd if=/dev/zero of=test-p3c2.img bs=1M count=100 2> /dev/null
losetup /dev/loop17 test-p3c2.img
dd if=/dev/zero of=test-p3c3.img bs=1M count=100 2> /dev/null
losetup /dev/loop18 test-p3c3.img
dd if=/dev/zero of=test-p3c4.img bs=1M count=100 2> /dev/null
losetup /dev/loop19 test-p3c4.img

mdadm --create --verbose /dev/md23 --level=6 --raid-devices=4 /dev/loop16 /dev/loop17 /dev/loop18 /dev/loop19

cat /proc/mdstat

mdadm --create --verbose /dev/md24 --level=0 --raid-devices=1 --force /dev/md21

mkfs.xfs /dev/md24
cat /proc/mdstat

mkdir test_mount/
mount /dev/md24 test_mount/
# populate with data to 95% or so.
dd if=/dev/urandom of=test_mount/test_file bs=1M count=385
sha256sum test_mount/test_file > test_mount/test_file.sha256sum

# Now grow to two:
mdadm --manage /dev/md24 --add /dev/md22
mdadm --grow /dev/md24 --raid-devices=2

# Or three.
mdadm --manage /dev/md24 --add /dev/md23
mdadm --grow /dev/md24 --raid-devices=3

# Cleanup
umount test_mount/
mdadm --stop /dev/md24
mdadm --stop /dev/md23
mdadm --stop /dev/md21
mdadm --stop /dev/md22
losetup -d /dev/loop8
losetup -d /dev/loop9
losetup -d /dev/loop10
losetup -d /dev/loop11
losetup -d /dev/loop12
losetup -d /dev/loop13
losetup -d /dev/loop14
losetup -d /dev/loop15
losetup -d /dev/loop16
losetup -d /dev/loop17
losetup -d /dev/loop18
losetup -d /dev/loop19
#rm -Rf ./tmp/raid-test

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-02-15  1:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAH3kUhGVt1iyn9tt=2-+f6H++obOGSK3x0pBBPZV8CFUXjp5yw@mail.gmail.com>
2013-02-14 23:23 ` Best practice for large storage? Roy Sigurd Karlsbakk
2013-02-14 23:42   ` Roberto Spadim
2013-02-15  1:01     ` Adam Goryachev
2013-02-15  1:13       ` Roberto Spadim
2013-02-15  1:49       ` Growing a raid60 (Was: Re: Best practice for large storage?) Daniel Browning

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.