Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Phil Turmel @ 2016-08-26  2:22 UTC (permalink / raw)
  To: Ben, linux-raid
In-Reply-To: <57BF9965.1020403@gmail.com>

On 08/25/2016 09:20 PM, Ben wrote:

> I read a lot of conflicting info on SCT/ERC online (well, TLER anyway)
> -- Adam likes it enabled. What say the rest of you?

Adam is correct, and it's not a matter of "like".  You either must have
it enabled, or you *must* apply the kernel driver timeout work-around
(180 seconds) for that drive.  Failure to do so results in crashed arrays.

Enterprise and NAS drives work out of the box.  Desktop/green drives do not.

Some reading assignments from old discussions (read whole threads if you
have time):

http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Phil Turmel @ 2016-08-26  2:33 UTC (permalink / raw)
  To: Chris Murphy, Linux-RAID
In-Reply-To: <CAJCQCtSWOwf6pHD2O7CtaNwF4BDbHO+hh7XrXpcaRsqPkvmFrQ@mail.gmail.com>

On 08/25/2016 06:32 PM, Chris Murphy wrote:

>> It's possible, but why would you ever end up with a GPT in a partition?
> 
> In every case I've seen, it was user error. I haven't heard of things
> putting GPTs in partitions, and in a sense I'd say it's a bug if any
> utility lets a user do that. Nesting GPT's in partitions, bad idea,
> although it *should* be innocuous because it shouldn't be seen/honored
> by anything that doesn't go looking for it because it doesn't belong
> there.

It is possible to run gdisk or parted on /dev/sdX1 accidentally instead
of /dev/sdX.  Pretty simple user error.

It is also possible and appropriate if using v0.90 or v1.0 metadata on
an array and you partition the array itself.  Then it'll show up on
member 0, any mirror of member 0, and possibly on a parity disk (if
intervening blocks are zero).

Phil

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Chris Murphy @ 2016-08-26  2:48 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Chris Murphy, Linux-RAID
In-Reply-To: <e75dcc05-f0c2-2152-9d24-5ff42ea02657@turmel.org>

On Thu, Aug 25, 2016 at 8:33 PM, Phil Turmel <philip@turmel.org> wrote:
> On 08/25/2016 06:32 PM, Chris Murphy wrote:
>
>>> It's possible, but why would you ever end up with a GPT in a partition?
>>
>> In every case I've seen, it was user error. I haven't heard of things
>> putting GPTs in partitions, and in a sense I'd say it's a bug if any
>> utility lets a user do that. Nesting GPT's in partitions, bad idea,
>> although it *should* be innocuous because it shouldn't be seen/honored
>> by anything that doesn't go looking for it because it doesn't belong
>> there.
>
> It is possible to run gdisk or parted on /dev/sdX1 accidentally instead
> of /dev/sdX.  Pretty simple user error.
>
> It is also possible and appropriate if using v0.90 or v1.0 metadata on
> an array and you partition the array itself.  Then it'll show up on
> member 0, any mirror of member 0, and possibly on a parity disk (if
> intervening blocks are zero).

Right, so something like GPT on /dev/sda and /dev/sdb to create sda1
and sdb1, then mdadm -C /dev/md0 --metadata=1.0 ... /dev/sda1
/dev/sdb1, and then create a GPT on /dev/md0. The result is /dev/md0,
/dev/sda1, and /dev/sda2 will all appear to have the same GPT on them.

I would say that's probably a bad idea, I know some tools allow it,
but it creates an ambiguity. It could be argued to be inconsistent
with the UEFI spec. The only nesting it describes is MBR on a GPT
partition, not GPT nested in a GPT partition. This is probably also
better done using LVM. Otherwise we get nutty things...

-- 
Chris Murphy

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: travis @ 2016-08-26  2:50 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux-RAID
In-Reply-To: <CAJCQCtSWOwf6pHD2O7CtaNwF4BDbHO+hh7XrXpcaRsqPkvmFrQ@mail.gmail.com>

On Thu, Aug 25, 2016 at 04:32:12PM -0600, Chris Murphy wrote:
> that's not good, but not unfixable. The mdadm super block starts at
> LBA 8, 4096 bytes from the start of that partition, so it's safe to
> zero the first 4096 bytes. The GPT is mainly in the first three
> sectors so you could just write zeros for a count of 3, although it is
> more complete to zero with a count=8, for the partition, not the whole
> device.

Useful info, thanks.

> Looks like the mdadm super block might have been stepped on by
> something. You'd need to look for some evidence of it using something
> like
> 
> dd if=/dev/sdf1 count=9 2>/dev/null | hexdump -C
> 
> If it's intact it should be at offset x1000 and again just a matter of
> wiping the first 8 sectors, again of the partition, not the whole
> device.

> > Sadly, I can't do a mdadm -D because I can't assemble the RAID.
> > $ sudo mdadm -E /dev/md127
> 
> Again, wrong command, you should use -D for this.

# mdadm -D /dev/md127 
mdadm: md device /dev/md127 does not appear to be active.

> This is not a bug report. There's no reproduce steps, there's no
> evidence of a bug. I'm not experiencing random replacement of mdadm
> superblock data with MBR and GPT signatures.

I realize it's not terribly actionable.  But enough circumstantial
evidence from enough people and one starts looking for things which
can exhibit that behavior.

> That's not really what
> I'd expect of drive or enclosure firmware which by design should be
> partition agnostic, as there's more than one or two valid kinds of
> partitioning. Plus, it'd be scary even if it picked the right one, it
> could clobber a legitimate existing one.

I've had some weird shit, but you're right that it's odd that it'd
write a partition table out to /dev/sdd1 instead of /dev/sdd, that
almost sounds like something that would require the OS to get
involved, to get that offset confused.

> So I'd say it's something else.

Do you have any idea what that could be?  I haven't logged into this
box in months, and nobody else has either.  If it's not USB or drive
firmware, I'm fresh out of ideas.  Repartitioning disks isn't exactly
something most stuff does automatically and without prompting, as it's
pretty dangerous.

> In every case I've seen, it was user error. I haven't heard of things
> putting GPTs in partitions, and in a sense I'd say it's a bug if any
> utility lets a user do that. Nesting GPT's in partitions, bad idea,
> although it *should* be innocuous because it shouldn't be seen/honored
> by anything that doesn't go looking for it because it doesn't belong
> there.

That's entirely possible.  When I had this problem the _first_ few times
I assumed it was the fact I was using raw disks and not partitioned disks.
I had a very similar problem, where something would wipe out the mdlabel,
but only on the last two drives of the array.

In fact, I decided to grep around for /dev/sdd1 and /dev/sde1 which seem
to get trounced (but not /dev/sd[bc]1) and what do you know:

# grep -R /dev/sde1 /etc/
/etc/lvm/cache/.cache:          "/dev/sde1",

That certainly looks promising.  I wonder if you just solved my problem
without hardware upgrade.

> > I've certainly encountered this "GPT outside cylinder 0" on these two
> > drives before,
> 
> Keep in mind cylinders are gone, they don't exist anymore. Drives all
> speak in LBAs now. *shrug* The GPT typically involves LBAs 0, 1 and 2
> at least, more if there are more than 4 partitions.

Shorthand for "before partition 1".

> I don't recognize the above stuff, so I'm not sure what it is. I'd
> usually expect it to be zeros if it's not a boot drive.

It was used as a raw disk in an encrypted RAID before.

> OK it does in fact have a PMBR and GPT in the 1st and 2nd sector of
> this partition. Pretty weird how it got there. There is a UUID
> starting at offset 0x238 so you can look around and see if anything
> else has that UUID or if that UUID ever changed or comes back after
> you fix this. If it's not the same UUID, something is creating it with
> a random UUID each time, which would mean it's not just being copied
> from somewhere.

Got it.  Good idea.

> We kinda expect sdd to have a valid PMBR and GPT though... so that's
> sane. I just don't know what to make of the stuff in LBA 0 before the
> PMBR.

It's just random fill from a previous incarnation.

> It is common. I prefer gdisk, which has a nomenclature similar to
> fdisk. The nomenclature of parted is confusing.

I think somewhere in learning parted and repartitioning all the disks,
I managed to type /dev/sdX1 instead of /dev/sdX when creating the
partitions.

> FWIW it's probably a lot simpler layout if you wanted to do either
> linear or raid0, to just blow away all four drives with hdparm and ATA
> security erase to get rid of all signatures; and then make all of them
> into LVM physical volumes without any partitioning first, and then
> make a logical volume, which by default is linear/concat, or you can
> choose to use raid0 (this is a per logical volume characteristic), and
> then encrypt the LV, and then format the LUKS volume. There's no
> advantage to adding either partitions or mdadm RAIDs if you're going
> to use LVM anyway and this is a Linux only storage enclosure.

Good call, reduces the diversity of layers in the stack too.  Thanks.
-- 
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977

^ permalink raw reply

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Benjammin2068 @ 2016-08-26  2:54 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <f3d5add5-cfad-e33f-2f75-934189fa501d@turmel.org>



On 08/25/2016 09:22 PM, Phil Turmel wrote:
> On 08/25/2016 09:20 PM, Ben wrote:
>
>> I read a lot of conflicting info on SCT/ERC online (well, TLER anyway)
>> -- Adam likes it enabled. What say the rest of you?
> Adam is correct, and it's not a matter of "like".  

"like" was just an expression.

>
>
> You either must have
> it enabled, or you *must* apply the kernel driver timeout work-around
> (180 seconds) for that drive.  Failure to do so results in crashed arrays.

For the ST1000DM003, its SMART capabilities states "SCT Status Supported" -- What does that mean in comparison with the other HD103SJ drives?

It does SCT but doesn't let the user control it or it doesn't do it at all?

(smartctl -l scterc /dev/sde yields a message that implies control is not supported)

>
> Enterprise and NAS drives work out of the box.  Desktop/green drives do not.

Yea - I didn't buy any green drives (purposefully anyway) for this system.

>
> Some reading assignments from old discussions (read whole threads if you
> have time):
>
> http://marc.info/?l=linux-raid&m=139050322510249&w=2
> http://marc.info/?l=linux-raid&m=135863964624202&w=2
> http://marc.info/?l=linux-raid&m=135811522817345&w=1
> http://marc.info/?l=linux-raid&m=133761065622164&w=2
> http://marc.info/?l=linux-raid&m=132477199207506
> http://marc.info/?l=linux-raid&m=133665797115876&w=2
> http://marc.info/?l=linux-raid&m=142487508806844&w=3
> http://marc.info/?l=linux-raid&m=144535576302583&w=2
>

Thanks, will go read.


  -Ben

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: travis+ml-linux-raid @ 2016-08-26  3:11 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, Linux-RAID
In-Reply-To: <CAJCQCtQyqH2eOmyqmpz0sZzD9ia4Q8RsA8eb+6inuDY6QP+Tyg@mail.gmail.com>

On Thu, Aug 25, 2016 at 08:48:24PM -0600, Chris Murphy wrote:
> Right, so something like GPT on /dev/sda and /dev/sdb to create sda1
> and sdb1, then mdadm -C /dev/md0 --metadata=1.0 ... /dev/sda1
> /dev/sdb1, and then create a GPT on /dev/md0. The result is /dev/md0,
> /dev/sda1, and /dev/sda2 will all appear to have the same GPT on them.
> 
> I would say that's probably a bad idea, I know some tools allow it,
> but it creates an ambiguity. It could be argued to be inconsistent
> with the UEFI spec. The only nesting it describes is MBR on a GPT
> partition, not GPT nested in a GPT partition. This is probably also
> better done using LVM. Otherwise we get nutty things...

We had similar ambiguities in MBR-land, if you set the active flag on
more than one partition, or if you have more than one extended
partition.

Since the behavior in those cases is undefined, it seems wise to avoid
creating them.

Better if the specification avoids these situations - that no
combination of bits can create an ambiguous interpretation - but in
the occasional cases where that can't be avoided you're best off not
creating those situations.
-- 
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Chris Murphy @ 2016-08-26  3:21 UTC (permalink / raw)
  To: Chris Murphy, Linux-RAID
In-Reply-To: <20160826025012.GO32250@subspacefield.org>

On Thu, Aug 25, 2016 at 8:50 PM,  <travis@subspacefield.org> wrote:

>
>> So I'd say it's something else.
>
> Do you have any idea what that could be?

User error, you even suspect it yourself later...

> In fact, I decided to grep around for /dev/sdd1 and /dev/sde1 which seem
> to get trounced (but not /dev/sd[bc]1) and what do you know:
>
> # grep -R /dev/sde1 /etc/
> /etc/lvm/cache/.cache:          "/dev/sde1",
>
> That certainly looks promising.  I wonder if you just solved my problem
> without hardware upgrade.

That just contains a listing of LVM devices, I don't think that's
related to this problem.

>> > I've certainly encountered this "GPT outside cylinder 0" on these two
>> > drives before,
>>
>> Keep in mind cylinders are gone, they don't exist anymore. Drives all
>> speak in LBAs now. *shrug* The GPT typically involves LBAs 0, 1 and 2
>> at least, more if there are more than 4 partitions.
>
> Shorthand for "before partition 1".

Unreliable. By convention most tools used to start it at LBA 63 which
*was* based on CHS, but that's the Pleistocene (again). It's been many
years, maybe nearing a decade, since a tool would default to that.
First, 62 sectors isn't big enough to embed a bootloader these days.
Second, it's not 4096 byte aligned for 4K sector drives, which now
pretty much every hard drive is, except some higher end SCSI/SAS
drives come with the option of 512 byte physical sectors still. But
these are quickly vanishing. Macs typically start the first partition
at LBA 40, and on Windows and Linux these days it's usually LBA 2048
(1MiB gap to the first partition).

>
>> I don't recognize the above stuff, so I'm not sure what it is. I'd
>> usually expect it to be zeros if it's not a boot drive.
>
> It was used as a raw disk in an encrypted RAID before.

OK

>> It is common. I prefer gdisk, which has a nomenclature similar to
>> fdisk. The nomenclature of parted is confusing.
>
> I think somewhere in learning parted and repartitioning all the disks,
> I managed to type /dev/sdX1 instead of /dev/sdX when creating the
> partitions.

Bingo. That would do it.

The thing to get in the habit of when retasking anything, be it a
drive, a partition, or logical volume:

1. Tear down with wipefs -a from most recent structure created (file
system) to the first; or
2. Full disk encryption. If you merely luksFormat, you've obliterated
all the previous signatures, effectively, so no need for a tear down;
or
3. hdparm ATA secure erase; or
4. write zeros with something like badblocks.

-- 
Chris Murphy

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: travis @ 2016-08-26  3:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux-RAID
In-Reply-To: <CAJCQCtRwF1XCNQRxibFGZ0SVQ8=RKo1birv5ZWZBvZGpuu0ndQ@mail.gmail.com>

On Thu, Aug 25, 2016 at 09:21:30PM -0600, Chris Murphy wrote:
> On Thu, Aug 25, 2016 at 8:50 PM,  <travis@subspacefield.org> wrote:
> 
> >
> >> So I'd say it's something else.
> >
> > Do you have any idea what that could be?
> 
> User error, you even suspect it yourself later...

Yeah, *when I created this disk layout* I might have created a GPT in
partition 1.

That was probably a year or more ago.

That has nothing to do with this crash, which is perhaps the fourth of
its kind.

I haven't touched the box in weeks before this happened, when I was
away on vacation.  Although, it could have lurked for some time,
and only been uncovered by a crash or kpanic.

> > Shorthand for "before partition 1".
> 
> Unreliable. By convention most tools used to start it at LBA 63 which
> *was* based on CHS, but that's the Pleistocene (again). It's been many
> years, maybe nearing a decade, since a tool would default to that.
> First, 62 sectors isn't big enough to embed a bootloader these days.
> Second, it's not 4096 byte aligned for 4K sector drives, which now
> pretty much every hard drive is, except some higher end SCSI/SAS
> drives come with the option of 512 byte physical sectors still. But
> these are quickly vanishing. Macs typically start the first partition
> at LBA 40, and on Windows and Linux these days it's usually LBA 2048
> (1MiB gap to the first partition).

Relax, it's just to save time.  I'm aware we don't use CHS addressing
any more.  I'm also aware that Kleenex is a brand name, even though I
use it to mean tissue.  GNU/Linux, not Linux.  I get it.  Let's move on.
-- 
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: travis+ml-linux-raid @ 2016-08-26  4:06 UTC (permalink / raw)
  To: Chris Murphy, Linux-RAID
In-Reply-To: <20160826035850.GQ32250@subspacefield.org>

On Thu, Aug 25, 2016 at 08:58:50PM -0700, travis@subspacefield.org wrote:
> Yeah, *when I created this disk layout* I might have created a GPT in
> partition 1.
> 
> That was probably a year or more ago.
> 
> That has nothing to do with this crash, which is perhaps the fourth of
> its kind.
> 
> I haven't touched the box in weeks before this happened, when I was
> away on vacation.  Although, it could have lurked for some time,
> and only been uncovered by a crash or kpanic.

I certainly have not repartitioned the disks multiple times.  There's
no need for that.  It leads to these sort of problems.

The kernel does panic from time to time.

To repeat, this box was *completely unattended* for several weeks
before the crash.  No administration at all.  I simply rsync'd things
off it as necessary.  As a non-root user.

I am curious about the fact that /dev/sdd1 and /dev/sde1 were listed
together in this lvm cache, and those are the two disks that normally
get blasted every 6 months or so.  That's an odd coincidence, and my
best lead yet.
-- 
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Chris Murphy @ 2016-08-26  4:25 UTC (permalink / raw)
  To: Chris Murphy, Linux-RAID
In-Reply-To: <20160826040619.GR32250@subspacefield.org>

On Thu, Aug 25, 2016 at 10:06 PM,
<travis+ml-linux-raid@subspacefield.org> wrote:
> On Thu, Aug 25, 2016 at 08:58:50PM -0700, travis@subspacefield.org wrote:
>> Yeah, *when I created this disk layout* I might have created a GPT in
>> partition 1.
>>
>> That was probably a year or more ago.
>>
>> That has nothing to do with this crash, which is perhaps the fourth of
>> its kind.
>>
>> I haven't touched the box in weeks before this happened, when I was
>> away on vacation.  Although, it could have lurked for some time,
>> and only been uncovered by a crash or kpanic.
>
> I certainly have not repartitioned the disks multiple times.  There's
> no need for that.  It leads to these sort of problems.
>
> The kernel does panic from time to time.
>
> To repeat, this box was *completely unattended* for several weeks
> before the crash.  No administration at all.  I simply rsync'd things
> off it as necessary.  As a non-root user.
>
> I am curious about the fact that /dev/sdd1 and /dev/sde1 were listed
> together in this lvm cache, and those are the two disks that normally
> get blasted every 6 months or so.  That's an odd coincidence, and my
> best lead yet.

Well that file does seem stale, because those partitions aren't
actually part of LVM. They're members of an mdadm array. I don't know
where LVM comes into this because we don't have the complete layout.



-- 
Chris Murphy

^ permalink raw reply

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Phil Turmel @ 2016-08-26 12:38 UTC (permalink / raw)
  To: Benjammin2068, linux-raid
In-Reply-To: <57BFAF75.1080807@gmail.com>

On 08/25/2016 10:54 PM, Benjammin2068 wrote:
>> You either must have
>> it enabled, or you *must* apply the kernel driver timeout work-around
>> (180 seconds) for that drive.  Failure to do so results in crashed arrays.
> 
> For the ST1000DM003, its SMART capabilities states "SCT Status Supported" -- What does that mean in comparison with the other HD103SJ drives?
> 
> It does SCT but doesn't let the user control it or it doesn't do it at all?

ERC is a feature within the SCT standard.  For modern hard drives,
claiming "SCT" support is comparable to a bottled water supplier
advertising that their product is wet.

> (smartctl -l scterc /dev/sde yields a message that implies control is not supported)

ERC on the other hand is a valuable feature that modern drive
manufacturers make you pay extra for.

>> Enterprise and NAS drives work out of the box.  Desktop/green drives do not.
> 
> Yea - I didn't buy any green drives (purposefully anyway) for this system.

I originally wrote that sentence as "Desktop drives do not."  I added
"/green" to clarify that some non-enterprise, non-NAS drives aren't
marketed as desktop drives, but still lack ERC functionality.

Your ST1000DM003 is marketed as a desktop drive.  Seagate's product page
for this model has links to other models for specialty use cases,
including NAS.

>> Some reading assignments from old discussions (read whole threads if you
>> have time):
>>
>> http://marc.info/?l=linux-raid&m=139050322510249&w=2
>> http://marc.info/?l=linux-raid&m=135863964624202&w=2
>> http://marc.info/?l=linux-raid&m=135811522817345&w=1
>> http://marc.info/?l=linux-raid&m=133761065622164&w=2
>> http://marc.info/?l=linux-raid&m=132477199207506
>> http://marc.info/?l=linux-raid&m=133665797115876&w=2
>> http://marc.info/?l=linux-raid&m=142487508806844&w=3
>> http://marc.info/?l=linux-raid&m=144535576302583&w=2
> 
> Thanks, will go read.

You will find detailed explanations for my comments above in these old
threads.

Phil

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Matt Garman @ 2016-08-26 13:01 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Mdadm
In-Reply-To: <83f67452-9f09-2d6f-f82a-77d83309f618@websitemanagers.com.au>

On Thu, Aug 25, 2016 at 6:39 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
>> Do you think more RAM might be beneficial then?
>
> I'm not sure of this, but I can suggest that you try various sizes for the
> stripe_cache_size, in my testing, I tried various values up to 64k, but 4k
> ended up being the optimal value (I only have 8 disks with 64k chunk
> size)...
>
> You should find out if you are swapping with vmstat:
> vmstat 5
> Watch the Swap (SI and SO) columns, if they are non-zero, then you are
> indeed swapping.
>
> You might find that if there is insufficient memory, then the kernel will
> automatically reduce/limit the value for the stripe_cache_size (I'm only
> guessing, but my memory tells me that the kernel locks this memory and it
> can't be swapped/etc).

Good ideas.  I actually halved the amount of physical memory in this
machine.  I replaced the original eight 8GB DIMMs with eight 4GB
DIMMs.  So no change in number of modules, but total RAM went from 64
GB to 32 GB.

I then cranked the stripe_cache_size up to 32k, degraded the array,
and kicked off my reader test.

Performance is basically the same.  And I'm definitely not swapping,
vmstat shows both swap values constant at zero.  So it appears the
kernel is smart enough to scale back the stripe_cache_size to avoid
swapping.


>> On Tue, Aug 23, 2016 at 8:02 PM, Shaohua Li <shli@kernel.org> wrote:
>>>
>>> 2. the state machine runs in a single thread, which is a bottleneck. try
>>> to
>>> increase group_thread_cnt, which will make the handling multi-thread.
>>
>> For others' reference, this parameter is in
>> /sys/block/<device>/md/stripe_cache_size.
>>
>> On this CentOS (RHEL) 7.2 server, the parameter defaults to 0.  I set
>> it to 4, and the degraded reads went up dramatically.  Need to
>> experiment with this (and all the other tunables) some more, but that
>> change alone put me up to 2.5 GB/s read from the degraded array!
>
>
> Did you mean group_thread_cnt which defaults to 0?
> I don't recall the default for stripe_cache_size, but I'm pretty certain it
> is not 0...
> Note, in your case, it might increase the "test read scenario" but since
> your "live" scenario has a lot more CPU overhead, then this option might
> decrease overall results... Unfortunately, only testing with "live" load
> will really provide the information you will need to decide on this.

Yes, sorry, that is a typo, meant to write group_thread_cnt.  That
defaults to 0.  stripe_cache_size appears to default to 256.  (At
least on CentOS/RHEL 7.2.)

Agreed, yes, upping group_thread_cnt could improve one thing only to
the detriment of something else.  Nothing like a little "testing in
production" to make the higher-ups sweat.  :)

Thanks again all!
Matt

^ permalink raw reply

* md-cluster Module Requirement
From: Marc Smith @ 2016-08-26 14:40 UTC (permalink / raw)
  To: linux-raid

Hi,

I'm attempting to use md-cluster from Linux 4.5.2 with mdadm 3.4 and
I'm running into this when attempting to create a RAID1 device with
the clustered bitmap:

--snip--
[64782.619968] md: bind<dm-4>
[64782.629336] md: bind<dm-3>
[64782.630008] md/raid1:md127: active with 2 out of 2 mirrors
[64782.630528] md-cluster module not found.
[64782.630530] md127: Could not setup cluster service (-2)
[64782.630531] md127: bitmap file superblock:
[64782.630532]          magic: 6d746962
[64782.630533]        version: 5
[64782.630534]           uuid: 10fee18f.f553d7f2.deb926f1.c7c4bd4b
[64782.630534]         events: 0
[64782.630535] events cleared: 0
[64782.630536]          state: 00000000
[64782.630537]      chunksize: 67108864 B
[64782.630537]   daemon sleep: 5s
[64782.630538]      sync size: 878956288 KB
[64782.630539] max write behind: 0
[64782.630541] md127: failed to create bitmap (-2)
[64782.630577] md: md127 stopped.
[64782.630581] md: unbind<dm-3>
[64782.635133] md: export_rdev(dm-3)
[64782.635145] md: unbind<dm-4>
[64782.643111] md: export_rdev(dm-4)
--snip--

I'm using md-cluster built-in, not as a module:
# zcat /proc/config.gz | grep MD_CLUSTER
CONFIG_MD_CLUSTER=y

It seems the driver is attempting to load the 'md-cluster' module
(from drivers/md/md.c):
--snip--
        err = request_module("md-cluster");
        if (err) {
                pr_err("md-cluster module not found.\n");
                return -ENOENT;
        }
--snip--

I looked at linux-next and it appears this code is the same; is there
a test we can do before attempting to load the module in the case that
its built-in, or is there some other requirement that md-cluster needs
to be built as a module?

Thanks for your time.


--Marc

^ permalink raw reply

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Wols Lists @ 2016-08-26 18:07 UTC (permalink / raw)
  To: Ben, linux-raid
In-Reply-To: <57BF9965.1020403@gmail.com>

On 26/08/16 02:20, Ben wrote:
> [root@quantum ~]# smartctl -a /dev/sde
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
> Device Model:     ST1000DM003-1ER162
> Serial Number:    Z4YDLXWJ
> LU WWN Device Id: 5 000c50 091877801
> Firmware Version: CC45
> User Capacity:    1,000,204,886,016 bytes [1.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Device is:        In smartctl database [for details use: -P show]

Sorry Ben - that drive was NOT a smart buy !!! Seagate Barracuda :-(

You MUST enable the timeout on this drive :-(

Gut feel tells me most 1TB or less drives are okay in a raid - the
Barracudas are an exception :-( I've got two 3TB Barracudas mirrored,
and from reading the list, there's no way I'd go raid5 for more capacity
without ditching them.

Most people seem to get WD Reds - I've asked about Seagate NAS but I've
not picked up on any reports about them - good or bad. Barracudas - the
news is pretty much all bad :-(

Cheers,
Wol

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Wols Lists @ 2016-08-26 18:11 UTC (permalink / raw)
  To: Adam Goryachev, Matt Garman; +Cc: Mdadm
In-Reply-To: <83f67452-9f09-2d6f-f82a-77d83309f618@websitemanagers.com.au>

On 26/08/16 00:39, Adam Goryachev wrote:
> You should find out if you are swapping with vmstat:
> vmstat 5
> Watch the Swap (SI and SO) columns, if they are non-zero, then you are
> indeed swapping.
> 
> You might find that if there is insufficient memory, then the kernel
> will automatically reduce/limit the value for the stripe_cache_size (I'm
> only guessing, but my memory tells me that the kernel locks this memory
> and it can't be swapped/etc).

Are you using a gui :-) ?

Download and build the latest version of xosview (assuming it builds,
when I last tried, "bleeding edge" was bleeding ... :-(

git://github.com/mromberg/xosview

That'll give you a nice little overview of both raid and swap. The
current version of xosview is fine for swap, but the raid monitor is broken.

Cheers,
Wol

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Dumitru @ 2016-08-26 20:04 UTC (permalink / raw)
  To: Matt Garman; +Cc: Adam Goryachev, Mdadm
In-Reply-To: <CAJvUf-DJJ4ZqR+kS6SCqgaq9_MXuQxFwyHJbv7YH7hVx5xbBRQ@mail.gmail.com>

On Fri, Aug 26, 2016 at 6:01 AM, Matt Garman <matthew.garman@gmail.com> wrote:
> On Thu, Aug 25, 2016 at 6:39 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>>> Do you think more RAM might be beneficial then?
>>
>> I'm not sure of this, but I can suggest that you try various sizes for the
>> stripe_cache_size, in my testing, I tried various values up to 64k, but 4k
>> ended up being the optimal value (I only have 8 disks with 64k chunk
>> size)...
>>
>> You should find out if you are swapping with vmstat:
>> vmstat 5
>> Watch the Swap (SI and SO) columns, if they are non-zero, then you are
>> indeed swapping.
>>
>> You might find that if there is insufficient memory, then the kernel will
>> automatically reduce/limit the value for the stripe_cache_size (I'm only
>> guessing, but my memory tells me that the kernel locks this memory and it
>> can't be swapped/etc).
>
> Good ideas.  I actually halved the amount of physical memory in this
> machine.  I replaced the original eight 8GB DIMMs with eight 4GB
> DIMMs.  So no change in number of modules, but total RAM went from 64
> GB to 32 GB.
>
> I then cranked the stripe_cache_size up to 32k, degraded the array,
> and kicked off my reader test.
>
> Performance is basically the same.  And I'm definitely not swapping,
> vmstat shows both swap values constant at zero.  So it appears the
> kernel is smart enough to scale back the stripe_cache_size to avoid
> swapping.

The documentation implies that 32K is the upper limit for stripe_cache_size.

It is not immediately clear from the documentation or the code whether
a "stripe" is a page, a control structure, or a chunk.  I "think" it
is a control structure with a bio plus a single page.

I took a simple array from stripe_cache_size 256 => 32K and the system
allocated 265 MB of RAM (crude number via free), so this implies that
the stripe cache is 8K per entry.  The stripe cache struct appears to
have a bio plus a bunch of other control items in the struct.  I am
not sure if it has a statically allocated page, but at 8K it looks
like it does.  So I think the minimum/static memory allocated by the
stripe cache is 8K per entry.  This "might" also be the maximum, or
the cache size might grow to handle longer requests.

My test array is 16K chunks, and 8K is lower than 16K, so the max
might be (4K+stripe_cache_size) * chunk_size, but I suspect it is
actually (4K+4K) * stripe_cache_size.  Others write and breath this
code more than me, so clarification would be helpful.

It is were actually chunk size, the upper limits would be really bad
(32K * 512K) = 16GB.  Raid/5/6 is "compatible as a swap device", so
memory allocates during IO are generally not allowed.  So I think that
the stripe cache gets bumped and just stays there with little (or no)
dynamic allocation during operation.  If you run out of stripe cache
buckets, the driver "stalls" the calling IO operations until stripe
caches become available.  This "stall" of calling IOs will lower the
number of outstanding IOs to the member drives, which probably
explains your performance at 200 MB/sec.  Once stripe_cache_size gets
big enough to handle your workload, additional allocate does not help.
You can look at stripe_cache_active to see what is in use during your
run.

Doug

[... rest snipped ...]

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Phil Turmel @ 2016-08-26 21:57 UTC (permalink / raw)
  To: doug, Matt Garman; +Cc: Adam Goryachev, Mdadm
In-Reply-To: <CAFx4rwTekSp6mNUtt7UfEmTi93Fu+25rYLGrp63XwoxYvwV54g@mail.gmail.com>

On 08/26/2016 04:04 PM, Doug Dumitru wrote:
> I took a simple array from stripe_cache_size 256 => 32K and the system
> allocated 265 MB of RAM (crude number via free), so this implies that
> the stripe cache is 8K per entry.  The stripe cache struct appears to
> have a bio plus a bunch of other control items in the struct.  I am
> not sure if it has a statically allocated page, but at 8K it looks
> like it does.  So I think the minimum/static memory allocated by the
> stripe cache is 8K per entry.  This "might" also be the maximum, or
> the cache size might grow to handle longer requests.

This was answered three days ago.  Allow me to quote myself:

> This is not correct.  Parity operations in MD raid4/5/6 operate on 4k
> blocks.  The stripe cache for an array is a collection of 4k elements
> per member device.  Chunk size doesn't factor into the cache itself.

Phil


^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Dumitru @ 2016-08-26 22:11 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Matt Garman, Adam Goryachev, Mdadm
In-Reply-To: <51fda2c2-ef42-c052-8f7d-c00af6a30b1a@turmel.org>

Phil,

My apologies for missing this.  This thread is getting long.

Regardless, the max stripe_cache_size will not use more than 256MB of
RAM (32K x 8K) for a single device, and the memory usage will be
static.

Doug


On Fri, Aug 26, 2016 at 2:57 PM, Phil Turmel <philip@turmel.org> wrote:
> On 08/26/2016 04:04 PM, Doug Dumitru wrote:
>> I took a simple array from stripe_cache_size 256 => 32K and the system
>> allocated 265 MB of RAM (crude number via free), so this implies that
>> the stripe cache is 8K per entry.  The stripe cache struct appears to
>> have a bio plus a bunch of other control items in the struct.  I am
>> not sure if it has a statically allocated page, but at 8K it looks
>> like it does.  So I think the minimum/static memory allocated by the
>> stripe cache is 8K per entry.  This "might" also be the maximum, or
>> the cache size might grow to handle longer requests.
>
> This was answered three days ago.  Allow me to quote myself:
>
>> This is not correct.  Parity operations in MD raid4/5/6 operate on 4k
>> blocks.  The stripe cache for an array is a collection of 4k elements
>> per member device.  Chunk size doesn't factor into the cache itself.
>
> Phil
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* [PATCH v2] raid6: fix the input of raid6 algorithm
From: liuzhengyuan @ 2016-08-28  6:51 UTC (permalink / raw)
  To: hpa
  Cc: shli, linux-raid, fenghua.yu, linux-kernel, liuzhengyuang521,
	ZhengYuan Liu

From: ZhengYuan Liu <liuzhengyuan@kylinos.cn>

To test and choose the best algorithm for raid6, disks number
and disks data must be offered. These input depend on page
size and gfmul table at current time. It would cause the disk
number less than 4 when the page size is more than 64KB.This
patch would support arbitrarily page size by defining a macro
for disks number and using a PRNG based on linear congruential
bit to fill the disks data.

Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
---
 lib/raid6/algos.c | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 975c6e0..3991ac6 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -30,6 +30,8 @@ EXPORT_SYMBOL(raid6_empty_zero_page);
 #endif
 #endif
 
+#define RAID6_DISKS	8
+
 struct raid6_calls raid6_call;
 EXPORT_SYMBOL_GPL(raid6_call);
 
@@ -129,7 +131,7 @@ static inline const struct raid6_recov_calls *raid6_choose_recov(void)
 }
 
 static inline const struct raid6_calls *raid6_choose_gen(
-	void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks)
+	void *(*const dptrs)[RAID6_DISKS], const int disks)
 {
 	unsigned long perf, bestgenperf, bestxorperf, j0, j1;
 	int start = (disks>>1)-1, stop = disks-3;	/* work on the second half of the disks */
@@ -206,27 +208,36 @@ static inline const struct raid6_calls *raid6_choose_gen(
 
 int __init raid6_select_algo(void)
 {
-	const int disks = (65536/PAGE_SIZE)+2;
+	const int disks = RAID6_DISKS;
 
 	const struct raid6_calls *gen_best;
 	const struct raid6_recov_calls *rec_best;
-	char *syndromes;
-	void *dptrs[(65536/PAGE_SIZE)+2];
-	int i;
-
-	for (i = 0; i < disks-2; i++)
-		dptrs[i] = ((char *)raid6_gfmul) + PAGE_SIZE*i;
+	char *disk_ptr;
+	void *dptrs[RAID6_DISKS];
+	int32_t seed;
+	int i, j;
 
-	/* Normal code - use a 2-page allocation to avoid D$ conflict */
-	syndromes = (void *) __get_free_pages(GFP_KERNEL, 1);
+	/* use a 8-page allocation, The first 6 pages for disks
+	   and the last 2 pages for syndromes */
+	disk_ptr = (void *) __get_free_pages(GFP_KERNEL, 3);
 
-	if (!syndromes) {
+	if (!disk_ptr) {
 		pr_err("raid6: Yikes!  No memory available.\n");
 		return -ENOMEM;
 	}
 
-	dptrs[disks-2] = syndromes;
-	dptrs[disks-1] = syndromes + PAGE_SIZE;
+	/*use a PRNG based on LCB to fill the disks*/
+	seed = 1;
+	for (i = 0; i < disks-2; i++) {
+		dptrs[i] = disk_ptr + PAGE_SIZE*i;
+		for (j = 0; j < PAGE_SIZE; j = j + 4) {
+			seed = ((seed * 1103515245) + 12345) & 0x7fffffff;
+			*(int32_t *)(dptrs[i]+j) = seed;
+		}
+	}
+
+	dptrs[disks-2] = disk_ptr + PAGE_SIZE*(disks-2);
+	dptrs[disks-1] = disk_ptr + PAGE_SIZE*(disks-1);
 
 	/* select raid gen_syndrome function */
 	gen_best = raid6_choose_gen(&dptrs, disks);
@@ -234,7 +245,7 @@ int __init raid6_select_algo(void)
 	/* select raid recover functions */
 	rec_best = raid6_choose_recov();
 
-	free_pages((unsigned long)syndromes, 1);
+	free_pages((unsigned long)disk_ptr, 3);
 
 	return gen_best && rec_best ? 0 : -EINVAL;
 }
-- 
1.9.1





^ permalink raw reply related

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Benjammin2068 @ 2016-08-28 18:29 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <57C0856D.8050209@youngman.org.uk>



On 08/26/2016 01:07 PM, Wols Lists wrote:
> On 26/08/16 02:20, Ben wrote:
>> [root@quantum ~]# smartctl -a /dev/sde
>> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
>> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>>
>> === START OF INFORMATION SECTION ===
>> Model Family:     Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
>> Device Model:     ST1000DM003-1ER162
>> Serial Number:    Z4YDLXWJ
>> LU WWN Device Id: 5 000c50 091877801
>> Firmware Version: CC45
>> User Capacity:    1,000,204,886,016 bytes [1.00 TB]
>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>> Device is:        In smartctl database [for details use: -P show]
> Sorry Ben - that drive was NOT a smart buy !!! Seagate Barracuda :-(
>
> You MUST enable the timeout on this drive :-(
>
> Gut feel tells me most 1TB or less drives are okay in a raid - the
> Barracudas are an exception :-( I've got two 3TB Barracudas mirrored,
> and from reading the list, there's no way I'd go raid5 for more capacity
> without ditching them.
>
> Most people seem to get WD Reds - I've asked about Seagate NAS but I've
> not picked up on any reports about them - good or bad. Barracudas - the
> news is pretty much all bad :-(
>
>

Yea, I figured that out -- just couldn't find a decent detailed reference with what "SCT status supported" means versus the more fully featured.

And this drive (sort of  - but not this sub model -- and that's the replacement that Seagate recommended.) is not going to stay in the array.

I'm going to get some more WD red's (or decent NAS friendly mechs) and pull this puppy out of the stack and use it elsewhere.

Thanks for the confirmations!

 -Ben



^ permalink raw reply

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Anthony Youngman @ 2016-08-28 19:20 UTC (permalink / raw)
  To: Benjammin2068, linux-raid
In-Reply-To: <57C32D8E.9030102@gmail.com>

On 28/08/16 19:29, Benjammin2068 wrote:
> And this drive (sort of  - but not this sub model -- and that's the replacement that Seagate recommended.) is not going to stay in the array.

If they knew you were using it in a raid, and recommended it, then I 
don't know about your laws but over here in the UK I'd send it back as 
"unfit for purpose". Under SOGA (Sale Of Goods Act) they've sold you a 
pup and it's their problem, not yours.

(UK law assumes the salesman knows more than you, and so long as you 
tell them what you want, that forms part of the contract. Which means if 
they sell you something that does not meet the requirements you told 
them, they have to put matters right - either swap the drive for 
something that is suitable, or give you a refund. They can charge the 
difference if "suitable" means a more expensive drive, but a lot of UK 
shops would swallow the loss if they had recommended the wrong drive.)

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH v2] raid6: fix the input of raid6 algorithm
From: H. Peter Anvin @ 2016-08-28 21:08 UTC (permalink / raw)
  To: liuzhengyuan; +Cc: shli, linux-raid, fenghua.yu, linux-kernel, liuzhengyuang521
In-Reply-To: <1472367064-10935-1-git-send-email-liuzhengyuan@kylinos.cn>

On August 27, 2016 11:51:04 PM PDT, liuzhengyuan@kylinos.cn wrote:
>From: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
>
>To test and choose the best algorithm for raid6, disks number
>and disks data must be offered. These input depend on page
>size and gfmul table at current time. It would cause the disk
>number less than 4 when the page size is more than 64KB.This
>patch would support arbitrarily page size by defining a macro
>for disks number and using a PRNG based on linear congruential
>bit to fill the disks data.
>
>Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
>---
> lib/raid6/algos.c | 39 +++++++++++++++++++++++++--------------
> 1 file changed, 25 insertions(+), 14 deletions(-)
>
>diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
>index 975c6e0..3991ac6 100644
>--- a/lib/raid6/algos.c
>+++ b/lib/raid6/algos.c
>@@ -30,6 +30,8 @@ EXPORT_SYMBOL(raid6_empty_zero_page);
> #endif
> #endif
> 
>+#define RAID6_DISKS	8
>+
> struct raid6_calls raid6_call;
> EXPORT_SYMBOL_GPL(raid6_call);
> 
>@@ -129,7 +131,7 @@ static inline const struct raid6_recov_calls
>*raid6_choose_recov(void)
> }
> 
> static inline const struct raid6_calls *raid6_choose_gen(
>-	void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks)
>+	void *(*const dptrs)[RAID6_DISKS], const int disks)
> {
> 	unsigned long perf, bestgenperf, bestxorperf, j0, j1;
>	int start = (disks>>1)-1, stop = disks-3;	/* work on the second half
>of the disks */
>@@ -206,27 +208,36 @@ static inline const struct raid6_calls
>*raid6_choose_gen(
> 
> int __init raid6_select_algo(void)
> {
>-	const int disks = (65536/PAGE_SIZE)+2;
>+	const int disks = RAID6_DISKS;
> 
> 	const struct raid6_calls *gen_best;
> 	const struct raid6_recov_calls *rec_best;
>-	char *syndromes;
>-	void *dptrs[(65536/PAGE_SIZE)+2];
>-	int i;
>-
>-	for (i = 0; i < disks-2; i++)
>-		dptrs[i] = ((char *)raid6_gfmul) + PAGE_SIZE*i;
>+	char *disk_ptr;
>+	void *dptrs[RAID6_DISKS];
>+	int32_t seed;
>+	int i, j;
> 
>-	/* Normal code - use a 2-page allocation to avoid D$ conflict */
>-	syndromes = (void *) __get_free_pages(GFP_KERNEL, 1);
>+	/* use a 8-page allocation, The first 6 pages for disks
>+	   and the last 2 pages for syndromes */
>+	disk_ptr = (void *) __get_free_pages(GFP_KERNEL, 3);
> 
>-	if (!syndromes) {
>+	if (!disk_ptr) {
> 		pr_err("raid6: Yikes!  No memory available.\n");
> 		return -ENOMEM;
> 	}
> 
>-	dptrs[disks-2] = syndromes;
>-	dptrs[disks-1] = syndromes + PAGE_SIZE;
>+	/*use a PRNG based on LCB to fill the disks*/
>+	seed = 1;
>+	for (i = 0; i < disks-2; i++) {
>+		dptrs[i] = disk_ptr + PAGE_SIZE*i;
>+		for (j = 0; j < PAGE_SIZE; j = j + 4) {
>+			seed = ((seed * 1103515245) + 12345) & 0x7fffffff;
>+			*(int32_t *)(dptrs[i]+j) = seed;
>+		}
>+	}
>+
>+	dptrs[disks-2] = disk_ptr + PAGE_SIZE*(disks-2);
>+	dptrs[disks-1] = disk_ptr + PAGE_SIZE*(disks-1);
> 
> 	/* select raid gen_syndrome function */
> 	gen_best = raid6_choose_gen(&dptrs, disks);
>@@ -234,7 +245,7 @@ int __init raid6_select_algo(void)
> 	/* select raid recover functions */
> 	rec_best = raid6_choose_recov();
> 
>-	free_pages((unsigned long)syndromes, 1);
>+	free_pages((unsigned long)disk_ptr, 3);
> 
> 	return gen_best && rec_best ? 0 : -EINVAL;
> }

Linear congruential exactly what we don't want...
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply

* Raid settings
From: o1bigtenor @ 2016-08-28 21:43 UTC (permalink / raw)
  To: Linux-RAID

Greetings

I have been doing some research thinking toward the future.

Is there a 'best' raid setup?

It seems to me (a noob) that each of the options carries some negatives with it.

Is there a good option for say:

2 - 5 disks
4 - 8 disks
6 - 12 disks
10 - 30 disks
etc.

I looked at raid 5/6/10/50/60/100 and I am wondering where is the
'best' use of each of these options?

TIA

Dee

^ permalink raw reply

* Re: Raid settings
From: Wols Lists @ 2016-08-28 21:59 UTC (permalink / raw)
  To: o1bigtenor, Linux-RAID
In-Reply-To: <CAPpdf5_T=JjjUw2H5pAShmj4V6b5U8aALxDsiZvjyTUYy58DmQ@mail.gmail.com>

On 28/08/16 22:43, o1bigtenor wrote:
> Greetings
> 
> I have been doing some research thinking toward the future.
> 
> Is there a 'best' raid setup?

What do you want to achieve? There's no such thing as "best" - there's
only "most suitable for the circumstances".
> 
> It seems to me (a noob) that each of the options carries some negatives with it.
> 
> Is there a good option for say:
> 
> 2 - 5 disks
> 4 - 8 disks
> 6 - 12 disks
> 10 - 30 disks
> etc.
> 
> I looked at raid 5/6/10/50/60/100 and I am wondering where is the
> 'best' use of each of these options?
> 
Ignoring linear or stripe (which you seem to have done), with 2 disks
the only option is raid 1 (mirror). 3 disks gives you raid 5, and 4
disks gives you raid 6.

But do you want to make maximum use of the disk space (raid 6 is your
friend) or do you want maximum redundancy (raid 1)?

For my home system I've got 2 x 3TB in a raid1 config. I had intended to
add a 3rd drive and go raid5, but with two Barracudas I'd be an idiot
:-( If I want to go that route, I need three new proper raid drives :-(
I want maximum disk capacity with some redundancy, so raid 5 or 6 makes
most sense for me.

Without knowing what you want, we can't know what's best for you.

Cheers,
Wol

^ permalink raw reply

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Adam Goryachev @ 2016-08-28 23:54 UTC (permalink / raw)
  To: Benjammin2068, linux-raid
In-Reply-To: <57C32D8E.9030102@gmail.com>

On 29/08/16 04:29, Benjammin2068 wrote:
>
> On 08/26/2016 01:07 PM, Wols Lists wrote:
>> On 26/08/16 02:20, Ben wrote:
>>> [root@quantum ~]# smartctl -a /dev/sde
>>> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
>>> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>>>
>>> === START OF INFORMATION SECTION ===
>>> Model Family:     Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
>>> Device Model:     ST1000DM003-1ER162
>>> Serial Number:    Z4YDLXWJ
>>> LU WWN Device Id: 5 000c50 091877801
>>> Firmware Version: CC45
>>> User Capacity:    1,000,204,886,016 bytes [1.00 TB]
>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>> Device is:        In smartctl database [for details use: -P show]
>> Sorry Ben - that drive was NOT a smart buy !!! Seagate Barracuda :-(
>>
>> You MUST enable the timeout on this drive :-(
>>
>> Gut feel tells me most 1TB or less drives are okay in a raid - the
>> Barracudas are an exception :-( I've got two 3TB Barracudas mirrored,
>> and from reading the list, there's no way I'd go raid5 for more capacity
>> without ditching them.
>>
>> Most people seem to get WD Reds - I've asked about Seagate NAS but I've
>> not picked up on any reports about them - good or bad. Barracudas - the
>> news is pretty much all bad :-(
>>
>>
> Yea, I figured that out -- just couldn't find a decent detailed reference with what "SCT status supported" means versus the more fully featured.
When I saw this, I assume it means you can ask for the status, and it 
will tell you it is disabled, but there is no support to modify the 
status (ie, turn it on). Totally useless for all intents and purposes....

Then again, I could be wrong... but compared to your other drive which 
showed additional supports, or on my one here:
SCT capabilities:              (0x0039) SCT Status supported.
                                         SCT Error Recovery Control 
supported.
                                         SCT Feature Control supported.
                                         SCT Data Table supported.

ie, the second one is probably what you want, the third allows you to 
turn it on/off, and no idea about the last option....

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox