Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Phil Turmel @ 2016-08-26  2:22 UTC (permalink / raw)
  To: Ben, linux-raid
In-Reply-To: <57BF9965.1020403@gmail.com>

On 08/25/2016 09:20 PM, Ben wrote:

> I read a lot of conflicting info on SCT/ERC online (well, TLER anyway)
> -- Adam likes it enabled. What say the rest of you?

Adam is correct, and it's not a matter of "like".  You either must have
it enabled, or you *must* apply the kernel driver timeout work-around
(180 seconds) for that drive.  Failure to do so results in crashed arrays.

Enterprise and NAS drives work out of the box.  Desktop/green drives do not.

Some reading assignments from old discussions (read whole threads if you
have time):

http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2

^ permalink raw reply

* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Ben @ 2016-08-26  1:20 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <933228e0-bce4-ffad-f48d-034bf89bc07f@websitemanagers.com.au>

[-- Attachment #1: Type: text/plain, Size: 2646 bytes --]

As an update,

Adam's been helping me out (and I'm not used to hitting "reply-all" for mailing lists as pretty  much all the ones I'm on set the "reply-to:")

I've turned on sct/erc for the drives... and the one that went bonkers during the rebuild (sde) still would have read issues during a rebuild.

SMART reports it's ok. but.. (shrug) I ended up running ddrescue to the new replacement drive (sdc) that kept getting put back into spare status when the rebuilds would fail.

So I just copied sde -> sdc which went pretty much flawlessly (ddrescue completed without any final complaints)

I also played with badblocks after doing my copy and could find bad blocks -- but apparnently ddrescue had no issues.

So - I went back to

*bringing up the array. No problems.
* adding ANOTHER new drive (that I ordered Sunday night) and it rebuilt fine.
* doing an FSCK -n first which reported no issues - so I did a regular fsck (without -y) and it never prompted me for anything.

My last step is to run rsync -n from my backup to see if it can find any issues between my last backup and the current data for any files with byte oddities.

All this has me wonder if those old bad sectors left some files with a sector of garbage in them or not.

Adam seems to think everything is fine -- so far, that seems to be the case.

A last few questions I have are:

The new drive I got was (supposed to be) the same model as the last Seagate I ordered, but SMART reports them differently. (see attached)

The question on the new drive is that it says it does offline collection... but with gsmartcontrol, I can't seem to turn it on.

This new drive also doesn't seem to support SCT/ERC the same way.

Again,

/dev/sdc - old new spare (bought after seagate bought Samsung and discontinued the HD103SJ model)
/dev/sdd - original RAID member
/dev/sde - brand spanking new drive purchased Sunday.
/dev/sdf - original RAID member

I realize now one says: ST1000DM005 vs ST1000DM003 - Grrr!!!

So I'd like recommendations on whether I should get better matching drives (I can use these elsewhere) or it doesn't matter.

Can I mix/match this array with WD REDs? (and eventually retire all these HD103SJ drives) Do people even like these? They seem ok?

I read a lot of conflicting info on SCT/ERC online (well, TLER anyway) -- Adam likes it enabled. What say the rest of you?

And last -- any caveats as to upgrading this array to RAID6 from RAID5? Can I even do that while in place?

Thanks all, (especially Adam!)

  -Ben

p.s. Check out some of the SMART parms on the /dev/sde. Head flying hours?? And they're not zero. Weird. :/ This drive kinda creeps me out.


[-- Attachment #2: RAID.smart-info.txt --]
[-- Type: text/plain, Size: 20353 bytes --]

[root@quantum ~]# smartctl -a /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST1000DM005 HD103SJ
Serial Number:    S246JQ0D800949
LU WWN Device Id: 5 0000f0 080bb4909
Firmware Version: 1AJ10001
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Thu Aug 25 20:04:06 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 9120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 152) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0026   054   054   000    Old_age   Always       -       8630
  3 Spin_Up_Time            0x0023   076   071   025    Pre-fail  Always       -       7526
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       11
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       133
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   063   000    Old_age   Always       -       30 (Min/Max 21/37)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       10
200 Multi_Zone_Error_Rate   0x002a   100   096   000    Old_age   Always       -       558
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       14
========================================================================================================================
[root@quantum ~]# smartctl -a /dev/sdd
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F3
Device Model:     SAMSUNG HD103SJ
Serial Number:    S246J9AB404176
LU WWN Device Id: 5 0024e9 204fbf695
Firmware Version: 1AJ10001
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Thu Aug 25 20:05:32 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 9180) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 153) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       195
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   073   070   025    Pre-fail  Always       -       8310
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       37763
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       75
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   062   000    Old_age   Always       -       31 (Min/Max 20/43)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       8
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       146
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       77
========================================================================================================================
[root@quantum ~]# smartctl -a /dev/sde
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model:     ST1000DM003-1ER162
Serial Number:    Z4YDLXWJ
LU WWN Device Id: 5 000c50 091877801
Firmware Version: CC45
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ACS-2 (unknown minor revision code: 0x001f)
Local Time is:    Thu Aug 25 20:06:33 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   80) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 105) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   100   006    Pre-fail  Always       -       18255632
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       2
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       269743
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       2
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   071   068   045    Old_age   Always       -       29 (Min/Max 26/32)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       21
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always       -       29 (0 25 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       109964047679495
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3907074414
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       5102115

========================================================================================================================
[root@quantum ~]# smartctl -a /dev/sdf
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F3
Device Model:     SAMSUNG HD103SJ
Serial Number:    S246J9AB404174
LU WWN Device Id: 5 0024e9 204fbf676
Firmware Version: 1AJ10001
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Thu Aug 25 20:07:19 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 9360) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 156) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       353
  2 Throughput_Performance  0x0026   055   055   000    Old_age   Always       -       8559
  3 Spin_Up_Time            0x0023   073   069   025    Pre-fail  Always       -       8389
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       74
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       43724
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       92
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   063   000    Old_age   Always       -       30 (Min/Max 15/40)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       91
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       229
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       100

========================================================================================================================





^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Adam Goryachev @ 2016-08-25 23:39 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm
In-Reply-To: <CAJvUf-DXC6AtO3a=ox2XOinpRWgAv5NMPkRWAcsSZmBggF5_Dw@mail.gmail.com>

On 26/08/16 01:07, Matt Garman wrote:
>
>> Makes sense.  I know the stripe cache size is conservative by default
>> because of the fact that it's not shared with the page cache, so you
>> might as well consider it's memory lost.  When you upped it to 64k, and
>> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
>> allowed stripes which is a maximum memory consumption of around 700GB
>> RAM.  I doubt you have that much in your machine, so I'm guessing it's
>> simply using all available RAM that the page cache or something else
>> isn't already using.  That's also explains why setting it higher doesn't
>> provide any additional benefits ;-).
> Do you think more RAM might be beneficial then?
I'm not sure of this, but I can suggest that you try various sizes for 
the stripe_cache_size, in my testing, I tried various values up to 64k, 
but 4k ended up being the optimal value (I only have 8 disks with 64k 
chunk size)...
>
>> I would try to tune your stripe cache size such that the kswapd?
>> processes go to sleep.  Those are reading/writing swap.  That won't help
>> your overall performance.
> Do you mean swapping as in swapping memory to disk?  I don't think
> that is happening.  I have 32 GB of swap space, but according to "free
> -k" only 48k of swap is being used, and that number never grows.
> Also, I don't have any of the classic telltale signs of disk-swapping,
> e.g. overall laggy system feel.
>
> Also, I re-set the stripe_cache_size back down to 256, and those
> kswapd processes continue to peg a couple CPUs.  IOW,
> stripe_cache_size doesn't appear to have much effect on kswapd.
You should find out if you are swapping with vmstat:
vmstat 5
Watch the Swap (SI and SO) columns, if they are non-zero, then you are 
indeed swapping.

You might find that if there is insufficient memory, then the kernel 
will automatically reduce/limit the value for the stripe_cache_size (I'm 
only guessing, but my memory tells me that the kernel locks this memory 
and it can't be swapped/etc).

>
> On Tue, Aug 23, 2016 at 8:02 PM, Shaohua Li <shli@kernel.org> wrote:
>> 2. the state machine runs in a single thread, which is a bottleneck. try to
>> increase group_thread_cnt, which will make the handling multi-thread.
> For others' reference, this parameter is in
> /sys/block/<device>/md/stripe_cache_size.
>
> On this CentOS (RHEL) 7.2 server, the parameter defaults to 0.  I set
> it to 4, and the degraded reads went up dramatically.  Need to
> experiment with this (and all the other tunables) some more, but that
> change alone put me up to 2.5 GB/s read from the degraded array!

Did you mean group_thread_cnt which defaults to 0?
I don't recall the default for stripe_cache_size, but I'm pretty certain 
it is not 0...
Note, in your case, it might increase the "test read scenario" but since 
your "live" scenario has a lot more CPU overhead, then this option might 
decrease overall results... Unfortunately, only testing with "live" load 
will really provide the information you will need to decide on this.

Regards,
Adam



-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Chris Murphy @ 2016-08-25 22:32 UTC (permalink / raw)
  To: Chris Murphy, Linux-RAID
In-Reply-To: <20160825062501.GN32250@subspacefield.org>

On Thu, Aug 25, 2016 at 12:25 AM,
<travis+ml-linux-raid@subspacefield.org> wrote:

> $ sudo mdadm -E /dev/sdd1
> /dev/sdd1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : <elided>
>            Name : <elided>
>   Creation Time : Wed Aug 10 11:33:41 2016
>      Raid Level : raid0
>    Raid Devices : 4
>
>  Avail Dev Size : 7814035071 (3726.02 GiB 4000.79 GB)
>     Data Offset : 16 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : <elided)
>
>     Update Time : Wed Aug 10 11:33:41 2016
>        Checksum : 490b562f - correct
>          Events : 0
>
>      Chunk Size : 512K
>
>    Device Role : Active device 0
>    Array State : AAAA ('A' == active, '.' == missing)

I'm confused by Events: 0, even though I see the same thing with raid0
and linear arrays. As writes happen, array stopped and started, this
Events count does not increase. Parity raid only thing I guess?

Anyway, sdd1 has both an mdadm superblock on it, as shown above, and
it also has a GPT on it as show in your first message and below -
that's not good, but not unfixable. The mdadm super block starts at
LBA 8, 4096 bytes from the start of that partition, so it's safe to
zero the first 4096 bytes. The GPT is mainly in the first three
sectors so you could just write zeros for a count of 3, although it is
more complete to zero with a count=8, for the partition, not the whole
device.


>
> Here is what should be the same, only device 2 in the array
> (device 3 is similar or identical):
>
> $ sudo mdadm -E /dev/sdf1
> /dev/sdf1:
>    MBR Magic : aa55
> Partition[0] :   4294967295 sectors at            1 (type ee)

Looks like the mdadm super block might have been stepped on by
something. You'd need to look for some evidence of it using something
like

dd if=/dev/sdf1 count=9 2>/dev/null | hexdump -C

If it's intact it should be at offset x1000 and again just a matter of
wiping the first 8 sectors, again of the partition, not the whole
device.






> $ sudo mdadm -D /dev/sdf1
> mdadm: /dev/sdf1 does not appear to be an md device

You're getting the commands confused. -E applies to /dev/sdXY member
devices, and -D applies to /dev/mdX arrays.


>
> Sadly, I can't do a mdadm -D because I can't assemble the RAID.
> $ sudo mdadm -E /dev/md127

Again, wrong command, you should use -D for this.


> $
>
> The command history is gone, but I would imagine that the RAID was
> created with something like this:
>
> mdadm --create /dev/md/bu --level=0 --raid-devices=4 /dev/sd{b,c,d,e}1
>
> Although it could have been level=linear.
>
> To summarize my email:
> "Is this is a known problem? If not, here is a bug report"

This is not a bug report. There's no reproduce steps, there's no
evidence of a bug. I'm not experiencing random replacement of mdadm
superblock data with MBR and GPT signatures. That's not really what
I'd expect of drive or enclosure firmware which by design should be
partition agnostic, as there's more than one or two valid kinds of
partitioning. Plus, it'd be scary even if it picked the right one, it
could clobber a legitimate existing one.

So I'd say it's something else.


>> It's purely speculation, but it sounds like to me in the history of
>> one or more drives, the previous signatures weren't removed before the
>> drive was retasked for its new purpose. That's the folly of not wiping
>> the signatures in the reverse order they were created, and just
>> expecting that starting over will wipe those old signatures.
>
> It's possible, but why would you ever end up with a GPT in a partition?

In every case I've seen, it was user error. I haven't heard of things
putting GPTs in partitions, and in a sense I'd say it's a bug if any
utility lets a user do that. Nesting GPT's in partitions, bad idea,
although it *should* be innocuous because it shouldn't be seen/honored
by anything that doesn't go looking for it because it doesn't belong
there.



>
> I've certainly encountered this "GPT outside cylinder 0" on these two
> drives before,

Keep in mind cylinders are gone, they don't exist anymore. Drives all
speak in LBAs now. *shrug* The GPT typically involves LBAs 0, 1 and 2
at least, more if there are more than 4 partitions.

> but it goes away with a forcible reassemble or recreate
> (which I did last time), because the mdlabel blows it away.

Umm, I think that only happens with -U, --update.


>Unless
> it's something this list knows about, I suspect it is a firmware
> glitch in the USB enclosure.

Doubtful.




>
>> But I think there is a legitimate gripe that parted probably should
>> not operate on partitions like this. It's not valid to have nested
>> GPTs like this. And I have no idea if parted is showing you valid or
>> bogus information. You'd need to do something like:
>>
>> dd if=/dev/sdd1 count=2 2>/dev/null | hexdump -C
>
> ## Good disk (for comparison):
> $ sudo dd if=/dev/sdd1 count=2 2> /dev/null | file -
> /dev/stdin: data
> $ sudo dd if=/dev/sdd1 count=2 2> /dev/null | hexdump -C | head -20
> 00000000  ff 02 19 2e 03 ee fa d8  6d d7 24 78 e1 d4 04 3d  |........m.$x...=|
> 00000010  c9 92 33 97 17 7a 10 d3  05 bd 39 36 b4 a9 7c 14  |..3..z....96..|.|
> 00000020  a7 de 66 b6 cd d9 ff ef  45 27 74 6e 94 0a 03 49  |..f.....E'tn...I|
> 00000030  d4 43 26 2d 45 39 d1 93  8a 35 91 91 ff c9 a4 8e  |.C&-E9...5......|
> 00000040  bd 9a 06 6d cc f2 89 65  c0 91 87 1c 1b f0 da 2f  |...m...e......./|
> 00000050  83 c2 12 eb 80 3c c2 4c  68 cc 65 40 26 13 e0 77  |.....<.Lh.e@&..w|
> 00000060  38 15 ed 78 27 76 4c 91  71 99 3e 9f 99 f1 3f 51  |8..x'vL.q.>...?Q|
> 00000070  19 db 12 a3 ac b6 61 12  ff d9 37 87 31 1f 8b dd  |......a...7.1...|
> 00000080  88 82 de fb db f2 a5 31  10 2a d2 03 be 12 be bd  |.......1.*......|
> 00000090  19 46 9f c1 3b ea a1 37  81 d2 4d 00 54 e7 b4 55  |.F..;..7..M.T..U|
> 000000a0  b7 65 6c 3f 95 40 b0 f4  28 ff 90 62 22 cb 22 fd  |.el?.@..(..b".".|
> 000000b0  6b 4d 90 56 32 4b c6 22  35 b1 62 76 e1 fd 82 d5  |kM.V2K."5.bv....|
> 000000c0  03 40 c0 85 4b ac 5a 44  9e 6a 25 97 d3 7f bd fe  |.@..K.ZD.j%.....|
> 000000d0  0c 2d a8 bb 33 f4 00 df  7a 05 ae 6d b3 3e f3 7d  |.-..3...z..m.>.}|
> 000000e0  34 9e 0e 57 14 de d8 e0  28 63 82 a6 2a 8a 1f fc  |4..W....(c..*...|
> 000000f0  fe 2f b0 69 67 ac 0a e9  c2 53 a7 d8 36 1a 18 5a  |./.ig....S..6..Z|
> 00000100  d6 d4 e6 ce df f7 fc 67  13 eb 25 08 45 50 10 7b  |.......g..%.EP.{|
> 00000110  c6 23 1e 59 dc 2d c2 65  53 90 ca ec 21 e7 28 74  |.#.Y.-.eS...!.(t|
> 00000120  41 7f 3e 58 72 08 75 c1  d5 ca d0 91 55 5f 43 6a  |A.>Xr.u.....U_Cj|
> 00000130  4e 84 d5 7f aa f2 b5 27  e4 86 5d 28 ae 6c 29 a1  |N......'..](.l).|

OK I don't know why you used head, I needed to see past offset 0x130.
Offset lines 0x1f0 and x200 have the MBR and GPT signatures, so the
above doesn't really tell me anything.

I don't recognize the above stuff, so I'm not sure what it is. I'd
usually expect it to be zeros if it's not a boot drive.

>
> ## Bad disk:
> $ sudo dd if=/dev/sdf1 count=2 2> /dev/null | file -
> /dev/stdin: x86 boot sector; partition 1: ID=0xee, starthead 0, startsector 1, 4294967295 sectors, code offset 0x6f
> $ sudo dd if=/dev/sdf1 count=2 2> /dev/null | hexdump -C
> 00000000  38 6f 96 52 ea 9c 31 cd  10 a2 84 58 a2 f0 f5 43  |8o.R..1....X...C|
> 00000010  0f f2 5a 9b c7 ff 82 b2  d8 59 86 60 15 bc 31 65  |..Z......Y.`..1e|
> 00000020  bc d7 77 f9 31 6a c8 16  3f 13 90 24 b7 57 ff 6b  |..w.1j..?..$.W.k|
> 00000030  64 7e e2 99 2a 99 f7 32  69 be aa 56 36 31 f7 db  |d~..*..2i..V61..|
> 00000040  8c 4c 4c 12 68 19 77 0f  f6 3b 92 bf 18 92 c2 45  |.LL.h.w..;.....E|
> 00000050  73 d5 b7 93 cc ae 6b b9  b0 bd 0c 85 a9 c3 19 f7  |s.....k.........|
> 00000060  87 34 b8 be 0a 95 cd 03  03 d5 01 49 b5 b0 86 fe  |.4.........I....|
> 00000070  71 1c d2 f6 42 ed ce b0  eb c3 5f 4c 07 34 30 c7  |q...B....._L.40.|
> 00000080  8a 1f 91 c4 8b 28 b9 07  8e da ae 7d 7d c5 24 2b  |.....(.....}}.$+|
> 00000090  6d f9 ea a3 6a 83 9d b8  6a 1f 6d db 3a 01 22 c7  |m...j...j.m.:.".|
> 000000a0  56 fc 2a 46 f8 b2 84 31  d1 8b 58 55 b6 5a 36 7b  |V.*F...1..XU.Z6{|
> 000000b0  48 5d 98 2a 3f f0 ae 80  2b f8 6b b2 7f 1e 27 c2  |H].*?...+.k...'.|
> 000000c0  59 65 d0 bf c7 f0 5b 18  dc 59 8e 68 46 03 b6 ca  |Ye....[..Y.hF...|
> 000000d0  42 06 7a 52 7a 49 36 03  0d d5 9b 67 a2 03 3b 13  |B.zRzI6....g..;.|
> 000000e0  40 23 19 f5 1a a6 bd fb  c8 d5 5b 26 f5 6a 86 ab  |@#........[&.j..|
> 000000f0  89 77 98 d8 09 cb b7 59  80 03 81 48 ba c6 ce 77  |.w.....Y...H...w|
> 00000100  3c 6c d2 ba a0 71 c3 20  18 fd 77 db ca a8 8a e3  |<l...q. ..w.....|
> 00000110  8d 6c 1f 17 d5 9f e5 81  bf 50 62 c3 bc f8 6c 5d  |.l.......Pb...l]|
> 00000120  f7 3f a6 37 6b a9 53 2b  88 15 5d 6e 1e 48 4f b4  |.?.7k.S+..]n.HO.|
> 00000130  db af b4 f7 f5 7b 4d f3  3f 60 44 60 6e a2 c4 6d  |.....{M.?`D`n..m|
> 00000140  b9 6c 88 04 e8 66 d1 7c  a0 09 10 66 32 de 70 e1  |.l...f.|...f2.p.|
> 00000150  98 40 54 5e 1d f2 af b8  2e d1 75 0d 3c 46 1f f8  |.@T^......u.<F..|
> 00000160  85 72 49 87 ad 92 59 28  fd 9d 22 8e 1b 9f 2c 00  |.rI...Y(.."...,.|
> 00000170  87 58 74 01 63 a5 94 13  e3 9c ea ec 3f 21 22 41  |.Xt.c.......?!"A|
> 00000180  05 13 78 f3 a8 46 b3 02  9e 23 cb 9d 21 db a6 ae  |..x..F...#..!...|
> 00000190  08 a8 70 48 18 6c e2 38  e4 ac 03 6e 06 74 17 7c  |..pH.l.8...n.t.||
> 000001a0  90 ca 9f 5e 2e 2b 84 ef  52 2c 08 9a 48 98 f9 46  |...^.+..R,..H..F|
> 000001b0  f4 9f 00 cd ec a0 11 d7  00 00 00 00 00 00 00 00  |................|
> 000001c0  02 00 ee ff ff ff 01 00  00 00 ff ff ff ff 00 00  |................|
> 000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
> 00000200  45 46 49 20 50 41 52 54  00 00 01 00 5c 00 00 00  |EFI PART....\...|
> 00000210  3a dc 43 c4 00 00 00 00  01 00 00 00 00 00 00 00  |:.C.............|
> 00000220  8e b6 c0 d1 01 00 00 00  22 00 00 00 00 00 00 00  |........".......|
> 00000230  6d b6 c0 d1 01 00 00 00  a5 4f bd 75 f6 c8 4f 43  |m........O.u..OC|
> 00000240  92 31 ab b6 a9 59 aa 04  02 00 00 00 00 00 00 00  |.1...Y..........|
> 00000250  80 00 00 00 80 00 00 00  59 04 3d 4a 00 00 00 00  |........Y.=J....|
> 00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|


OK it does in fact have a PMBR and GPT in the 1st and 2nd sector of
this partition. Pretty weird how it got there. There is a UUID
starting at offset 0x238 so you can look around and see if anything
else has that UUID or if that UUID ever changed or comes back after
you fix this. If it's not the same UUID, something is creating it with
a random UUID each time, which would mean it's not just being copied
from somewhere.


>
> ## is that the same as the boot sector itself?  Interesting q.
> # dd if=/dev/sdd count=2 of=/tmp/foo && dd if=/dev/sdd1 count=2 of=/tmp/bar && cmp /tmp/foo /tmp/bar
> ## Nope, how do they differ?  Well that's a bit unpleasant to do manually but here...
> # dd if=/dev/sdd count=2 2> /dev/null | hexdump -C
> 00000000  10 06 27 48 33 df bb 55  8b 28 fe 60 5e 18 6d 38  |..'H3..U.(.`^.m8|
> 00000010  fc b3 17 36 55 de fd 83  d0 52 72 19 d0 76 12 f0  |...6U....Rr..v..|
> 00000020  1e 23 bc 4d c5 4d c2 d6  5a d4 2b cd 16 78 c9 28  |.#.M.M..Z.+..x.(|
> 00000030  77 21 c4 9f c4 b7 48 ad  e0 7b 08 d6 f5 8e 92 a7  |w!....H..{......|
> 00000040  bc 88 35 02 e7 f8 b8 3b  05 97 db a3 ad e7 96 4b  |..5....;.......K|
> 00000050  84 d9 e2 a4 3a 5a 07 ac  fc a2 78 58 d7 c8 5a 19  |....:Z....xX..Z.|
> 00000060  88 9c f6 f2 c0 ec 99 55  d9 5d 00 87 3a 86 52 01  |.......U.]..:.R.|
> 00000070  92 58 25 82 99 50 8e 28  0f 42 07 71 9a a3 db 82  |.X%..P.(.B.q....|
> 00000080  00 d9 b8 28 9d d8 97 85  9d c6 fb 5e 4d 94 3a 6e  |...(.......^M.:n|
> 00000090  19 3c a6 ce 57 6b a0 52  d6 72 0c 41 2e cd cb a2  |.<..Wk.R.r.A....|
> 000000a0  15 c8 d4 c8 8c 90 34 5f  15 ab 69 96 af 3d 7e 30  |......4_..i..=~0|
> 000000b0  25 e1 72 35 d6 c4 b2 5e  78 72 0b 3f 9a 96 40 7e  |%.r5...^xr.?..@~|
> 000000c0  c6 aa 0e 5a da 99 ae fe  a3 93 8b 5b c4 bf 91 64  |...Z.......[...d|
> 000000d0  d5 62 12 ea 70 15 a9 05  81 8d e4 fb 36 15 c9 63  |.b..p.......6..c|
> 000000e0  ba f9 d2 5c f6 df 28 71  d8 d5 82 95 2b 83 40 db  |...\..(q....+.@.|
> 000000f0  9b fe e2 a7 9b 38 5e 5f  51 a6 6e e6 7b 4e bf 02  |.....8^_Q.n.{N..|
> 00000100  d2 fb aa f9 2c 7a 5b f5  47 ad ac 7e d1 1c f3 1b  |....,z[.G..~....|
> 00000110  a3 8e 54 9f a4 8d 1a 02  3f cc 81 f0 ca e9 28 1e  |..T.....?.....(.|
> 00000120  33 9e d8 71 dd f2 aa b7  d4 06 96 cb 0c 8e f1 6a  |3..q...........j|
> 00000130  88 1d 2a 8a a3 33 00 8c  ef d4 d8 39 3e 70 18 34  |..*..3.....9>p.4|
> 00000140  e6 3a cd e7 0b d6 82 a8  a4 aa ff bd b3 69 0a cc  |.:...........i..|
> 00000150  32 9e e3 26 34 bb cc 0e  b0 69 5f 9a c5 f3 57 7d  |2..&4....i_...W}|
> 00000160  47 82 bc 66 44 55 c4 de  3c 2c 14 d0 9a 73 6a da  |G..fDU..<,...sj.|
> 00000170  3c 5e f8 99 26 5b f4 8a  13 a1 f1 c8 a9 20 4c 3a  |<^..&[....... L:|
> 00000180  bd 03 4e e9 83 25 46 32  3f 80 3e 42 58 e7 18 27  |..N..%F2?.>BX..'|
> 00000190  8a c8 7c 8c 74 99 96 61  d4 e2 58 c2 27 71 8c 3b  |..|.t..a..X.'q.;|
> 000001a0  da 33 f8 7f b5 c1 a7 a0  c2 7b 54 29 0d 47 b4 b5  |.3.......{T).G..|
> 000001b0  4c 62 5b f8 e9 6f bc 29  00 00 00 00 00 00 00 00  |Lb[..o.)........|
> 000001c0  02 00 ee ff ff ff 01 00  00 00 ff ff ff ff 00 00  |................|
> 000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
> 00000200  45 46 49 20 50 41 52 54  00 00 01 00 5c 00 00 00  |EFI PART....\...|
> 00000210  62 01 85 1f 00 00 00 00  01 00 00 00 00 00 00 00  |b...............|
> 00000220  af be c0 d1 01 00 00 00  22 00 00 00 00 00 00 00  |........".......|
> 00000230  8e be c0 d1 01 00 00 00  e2 89 58 78 77 63 52 44  |..........XxwcRD|
> 00000240  93 9e 4a 93 16 06 86 6b  02 00 00 00 00 00 00 00  |..J....k........|
> 00000250  80 00 00 00 80 00 00 00  5d ff 7e 02 00 00 00 00  |........].~.....|
> 00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

We kinda expect sdd to have a valid PMBR and GPT though... so that's
sane. I just don't know what to make of the stuff in LBA 0 before the
PMBR.


> I understand and can probably acquire the most recent stable and
> compile from source, if you think that would prove useful enough to
> justify the effort.  TBH once GPT came out I lost track of which
> partitioning tool was appropriate to use, it seemed like (IIRC)
> cfdisk, sfdisk, parted were all vying for my attention... is parted
> now the standard?

It is common. I prefer gdisk, which has a nomenclature similar to
fdisk. The nomenclature of parted is confusing.


>
> At the current moment I am backing up the drives so that I can try a
> forcible reassemble.  I think that last time this happened, that
> effectively relabeled the mdraid partitions and fixed the problem.
> The underlying mdraid has an LVM on LUKS, but last time this happened
> I managed to fsck and get 99% of the data back, with only a few things
> ending up in lost+found.  Presumably there might have been some data
> corruption, but since it's a backup server only I consider it
> tolerable, modulo the failed Windows system which needs to restore
> from it.

FWIW it's probably a lot simpler layout if you wanted to do either
linear or raid0, to just blow away all four drives with hdparm and ATA
security erase to get rid of all signatures; and then make all of them
into LVM physical volumes without any partitioning first, and then
make a logical volume, which by default is linear/concat, or you can
choose to use raid0 (this is a per logical volume characteristic), and
then encrypt the LV, and then format the LUKS volume. There's no
advantage to adding either partitions or mdadm RAIDs if you're going
to use LVM anyway and this is a Linux only storage enclosure.


-- 
Chris Murphy

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Wols Lists @ 2016-08-25 21:06 UTC (permalink / raw)
  To: Linux-RAID, travis+ml-linux-raid
In-Reply-To: <20160825062501.GN32250@subspacefield.org>

On 25/08/16 07:25, travis+ml-linux-raid@subspacefield.org wrote:
> I understand and can probably acquire the most recent stable and
> compile from source, if you think that would prove useful enough to
> justify the effort.  TBH once GPT came out I lost track of which
> partitioning tool was appropriate to use, it seemed like (IIRC)
> cfdisk, sfdisk, parted were all vying for my attention... is parted
> now the standard?

To add to the fun, I use gdisk (or is it gfdisk?).

Like so many things gnu, when I looked at parted I ran away screaming
from the feature overkill ... :-)

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH v2] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Jes Sorensen @ 2016-08-25 17:45 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid, dm-devel
In-Reply-To: <20160824161044.20887-1-robert@leblancnet.us>

Robert LeBlanc <robert@leblancnet.us> writes:
> Linux allows for 32 character device names. When using the maximum size device name and also
> storing "/dev/", devname needs to be 37 character long to store the complete device name.
> i.e. "/dev/md_abcdefghijklmnopqrstuvwxyz12\0"
>
> Signed-Off: Robert LeBlanc<robert@leblancnet.us>
> ---
>  mdopen.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Looks good - I corrected your comment to fit into a proper editor width
of 80 characters, and also fixed up the SOB since it needs to say
Signed-off-by rather than signed-off.

Applied!
Jes

^ permalink raw reply

* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: Shaohua Li @ 2016-08-25 17:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87bn0hfnq6.fsf@notabene.neil.brown.name>

On Thu, Aug 25, 2016 at 02:59:13PM +1000, Neil Brown wrote:
> On Wed, Aug 24 2016, Shaohua Li wrote:
> 
> > On Wed, Aug 24, 2016 at 02:49:57PM +1000, Neil Brown wrote:
> >> On Wed, Aug 17 2016, Shaohua Li wrote:
> >> >> >
> >> >> > We will have the same deadlock issue with just stopping/restarting the reclaim
> >> >> > thread. As stopping the thread will wait for the thread, which probably is
> >> >> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
> >> >> > superblock, we must hold the reconfig_mutex.
> >> >> 
> >> >> When you say "writing the superblock" you presumably mean "blocked in
> >> >> r5l_write_super_and_discard_space(), waiting for  MD_CHANGE_PENDING to
> >> >> be cleared" ??
> >> > right
> >> >> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
> >> >> ->quiesce to be set, and then exit gracefully.
> >> >
> >> > Can you give details about this please? .quiesce is called with reconfig_mutex
> >> > hold, so the MD_CHANGE_PENDING will never get cleared.
> >> 
> >> raid5_quiesce(mddev, 1) sets conf->quiesce and then calls r5l_quiesce().
> >> 
> >> r5l_quiesce() tells the reclaim_thread to exit and waits for it to do so.
> >> 
> >> But the reclaim thread might be in
> >>    r5l_do_reclaim() -> r5l_write_super_and_discard_space()
> >> waiting for MD_CHANGE_PENDING to clear.  That will only get cleared when
> >> the main thread can get the reconfig_mutex, which the thread calling
> >> raid5_quiesce() might hold.  So we get a deadlock.
> >> 
> >> My suggestion is to change r5l_write_super_and_discard_space() so that
> >> it waits for *either* MD_CHANGE_PENDING to be clear, or conf->quiesce
> >> to be set.  That will avoid the deadlock.
> >> 
> >> Whatever thread called raid5_quiesce() will now be in control of the
> >> array without any async IO going on.  If it needs the metadata to be
> >> sync, it can do that itself.  If not, then it doesn't really matter that
> >> r5l_write_super_and_discard_space() didn't wait.
> >
> > I'm afraid waiting conf->quiesce set isn't safe. The reason to wait for
> > superblock write isn't because of async IO. discard could zero data, so before
> > we do discard, we must make sure superblock points to correct log tail,
> > otherwise recovery will not work. This is the reason we wait for superblock
> > write.
> >
> >> r5l_write_super_and_discard_space() shouldn't call discard if the
> >> superblock write didn't complete, and probably r5l_do_reclaim()
> >> shouldn't update last_checkpoint and last_cp_seq in that case.
> >> This is what I mean by "with a bit of care" and "exit gracefully".
> >> Maybe I should have said "abort cleanly".  The goal is to get the thread
> >> to exit.  It doesn't need to complete what it was doing, it just needs
> >> to make sure that it leaves things in a tidy state so that when it
> >> starts up again, it can pick up where it left off.
> >
> > Agree, we could ignore discard sometime, which happens occasionally, so impact
> > is little. I tested something like below recently. Assume this is the solution
> > we agree on?
> 
> Yes, this definitely looks like it is heading in the right direction.
> 
> I thought that
> 
> > -		set_mask_bits(&mddev->flags, 0,
> > -			      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
> > -		md_wakeup_thread(mddev->thread);
> 
> would still be there in the case that the lock cannot be claimed.

yep, this makes sense.
> You could even record the ->events value before setting the flags,
> and record the range that needs to be discarded.  Next time
> r5l_do_reclaim is entered, if ->events has moved on, then it should be
> safe to discard the recorded range.  Maybe.

I thought something like this too, but looks there are more works to do to make
this happen. We updated the log, so the range could be reused soon. And if it's
a raid array stop, we don't have the chance to reenter reclaim, which I believe
it's the most common case the lock can't be hold. And missing discard isn't a
big issue especially since the miss happens rarely. I'm going to commit below
if no objection.

Thanks,
Shaohua


commit 93e297c0b152667cc4a17db6fe7360dab7e3e9d5
Author: Shaohua Li <shli@fb.com>
Date:   Thu Aug 25 10:09:39 2016 -0700

    raid5-cache: fix a deadlock in superblock write
    
    There is a potential deadlock in superblock write. Discard could zero data, so
    before discard we must make sure superblock is updated to new log tail.
    Updating superblock (either directly call md_update_sb() or depend on md
    thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called
    with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all
    IO finish, hence waitting for reclaim thread, while reclaim thread is calling
    this function and waitting for reconfig mutex. So there is a deadlock. We
    workaround this issue with a trylock. The downside of the solution is we could
    miss discard if we can't take reconfig mutex. But this should happen rarely
    (mainly in raid array stop), so miss discard shouldn't be a big problem.
    
    Cc: NeilBrown <neilb@suse.com>
    Signed-off-by: Shaohua Li <shli@fb.com>

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5504ce2..2b0589f 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -96,7 +96,6 @@ struct r5l_log {
 	spinlock_t no_space_stripes_lock;
 
 	bool need_cache_flush;
-	bool in_teardown;
 };
 
 /*
@@ -704,31 +703,22 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 
 	mddev = log->rdev->mddev;
 	/*
-	 * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
-	 * wait for this thread to finish. This thread waits for
-	 * MD_CHANGE_PENDING clear, which is supposed to be done in
-	 * md_check_recovery(). md_check_recovery() tries to get
-	 * reconfig_mutex. Since r5l_quiesce already holds the mutex,
-	 * md_check_recovery() fails, so the PENDING never get cleared. The
-	 * in_teardown check workaround this issue.
+	 * Discard could zero data, so before discard we must make sure
+	 * superblock is updated to new log tail. Updating superblock (either
+	 * directly call md_update_sb() or depend on md thread) must hold
+	 * reconfig mutex. On the other hand, raid5_quiesce is called with
+	 * reconfig_mutex hold. The first step of raid5_quiesce() is waitting
+	 * for all IO finish, hence waitting for reclaim thread, while reclaim
+	 * thread is calling this function and waitting for reconfig mutex. So
+	 * there is a deadlock. We workaround this issue with a trylock.
+	 * FIXME: we could miss discard if we can't take reconfig mutex
 	 */
-	if (!log->in_teardown) {
-		set_mask_bits(&mddev->flags, 0,
-			      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
-		md_wakeup_thread(mddev->thread);
-		wait_event(mddev->sb_wait,
-			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
-			log->in_teardown);
-		/*
-		 * r5l_quiesce could run after in_teardown check and hold
-		 * mutex first. Superblock might get updated twice.
-		 */
-		if (log->in_teardown)
-			md_update_sb(mddev, 1);
-	} else {
-		WARN_ON(!mddev_is_locked(mddev));
-		md_update_sb(mddev, 1);
-	}
+	set_mask_bits(&mddev->flags, 0,
+		BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
+	if (!mddev_trylock(mddev))
+		return;
+	md_update_sb(mddev, 1);
+	mddev_unlock(mddev);
 
 	/* discard IO error really doesn't matter, ignore it */
 	if (log->last_checkpoint < end) {
@@ -827,7 +817,6 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	if (!log || state == 2)
 		return;
 	if (state == 0) {
-		log->in_teardown = 0;
 		/*
 		 * This is a special case for hotadd. In suspend, the array has
 		 * no journal. In resume, journal is initialized as well as the
@@ -838,11 +827,6 @@ void r5l_quiesce(struct r5l_log *log, int state)
 		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 					log->rdev->mddev, "reclaim");
 	} else if (state == 1) {
-		/*
-		 * at this point all stripes are finished, so io_unit is at
-		 * least in STRIPE_END state
-		 */
-		log->in_teardown = 1;
 		/* make sure r5l_write_super_and_discard_space exits */
 		mddev = log->rdev->mddev;
 		wake_up(&mddev->sb_wait);

^ permalink raw reply related

* Re: kernel checksumming performance vs actual raid device performance
From: Matt Garman @ 2016-08-25 15:07 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Mdadm
In-Reply-To: <20160824010241.GC57645@kernel.org>

Note: again I consolidated several previous posts into one for inline replies...

On Tue, Aug 23, 2016 at 2:41 PM, Doug Dumitru <doug@easyco.com> wrote:
> So you are up at 1GB/sec, which is only 1/4 the degraded speed, but
> 1/2 the expected speed based on drive data transfers required.  This
> is actually pretty good.

I get 8 GB/sec non-degraded.  So I'd say I'm still only 1/8
non-degraded speed, and about 1/4 of what I expect in degraded state.
I.e., I expect 4 GB/sec non-degraded.  However, based on what I'm
reading in this thread, maybe I can't do any better?  But
group_thread_cnt might save the day...

> If you need this to go faster, then it is either a raid re-design, or
> perhaps you should consider cutting your array into two parts.  Two 12
> drives raid-6 arrays will give you more bandwidth both because the
> failures are less "wide", so a single drive will only do 11 reads
> instead of 22.  Plus you get the benefit of two raid-6 threads should
> you have dead drives on both halves.  You can raid-0 the arrays
> together.  Then again, you lose two drives worth of space.

Yes, that's on the list to test.  Actually we'll try three 8-disk
raid-5s striped into one big raid0.  That only loses one drive's worth
of space (compared to a single 24-disk raid6).  Space is at a premium
here, as we're really needing to build this system with 4 TB drives.

The loss of resiliency using raid5 instead of raid6 "shouldn't" be an
issue here.  The design is to deliberately over-provision these
servers so that we have one more than we need.  Then in case of
failure (or major degradation) of a single server, we can migrate
clients to the other ones.

On Tue, Aug 23, 2016 at 3:15 PM, Doug Ledford <dledford@redhat.com> wrote:
> OK, 50 sequential I/Os at a time.  Good point to know.

Note that's just the test workload.  The real workload has literally
*thousands* of sequential reads at once.  However. those thousands of
reads aren't reading at full speed like dd of=/dev/null.  In the real
workload, after a chunk of data is read, some computations are done.
IOW, when the storage backend is working optimally, the read processes
are CPU bound.  But it's extremely hard to accurately generate this
kind of test workload, so we have fewer reader threads (50 in this
case), but they are pure read-as-fast-as-we-can jobs, as opposed to
read-and-compute.

> You're raid device has a good chunk size for your usage pattern.  If you
> had a smallish chunk size (like 64k or 32k), I would actually expect
> things to behave differently.  But, then again, maybe I'm wrong and that
> would help.  With a smaller chunk size, you would be able to fit more
> stripes in the stripe cache using less memory.

For some reason I thought we had a 64k chunk size, which I believe is
the mdadm default?  But, you're right, it is indeed 512k.  I will try
to experiment with different chunk sizes, as my Internet-research
suggests that's a very application-dependent setting; I can't seem to
find any rules of thumb as to what our ideal chunk size might be for
this particular workload.  My intuition says bigger is better, since
we're dealing with sequential reads of generally large-ish files.

> Makes sense.  I know the stripe cache size is conservative by default
> because of the fact that it's not shared with the page cache, so you
> might as well consider it's memory lost.  When you upped it to 64k, and
> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
> allowed stripes which is a maximum memory consumption of around 700GB
> RAM.  I doubt you have that much in your machine, so I'm guessing it's
> simply using all available RAM that the page cache or something else
> isn't already using.  That's also explains why setting it higher doesn't
> provide any additional benefits ;-).

Do you think more RAM might be beneficial then?

> The math fits.  Most quad channel Intel CPUs have memory bandwidths in
> the 50GByte/s range theoretical maximum, but it's not bidirectional,
> it's not even multi-access, so you have to remember that the usage looks
> like this on a good read:

I'll have to re-read your explanation a few more times to fully grasp
it, but thank you for that!

For what it's worth, this is a NUMA system: two E5-2620v3 CPUs.  More
cores, but I understand the complexities added by memory controller
and PCIe node locality.

>> My colleague tested that exact same config with hardware raid5, and
>> striped the three raid5 arrays together with software raid1.
>
> That's a huge waste, are you sure he didn't use raid0 for the stripe?

Sorry, typo, that was raid0 indeed.

> I would try to tune your stripe cache size such that the kswapd?
> processes go to sleep.  Those are reading/writing swap.  That won't help
> your overall performance.

Do you mean swapping as in swapping memory to disk?  I don't think
that is happening.  I have 32 GB of swap space, but according to "free
-k" only 48k of swap is being used, and that number never grows.
Also, I don't have any of the classic telltale signs of disk-swapping,
e.g. overall laggy system feel.

Also, I re-set the stripe_cache_size back down to 256, and those
kswapd processes continue to peg a couple CPUs.  IOW,
stripe_cache_size doesn't appear to have much effect on kswapd.

On Tue, Aug 23, 2016 at 8:02 PM, Shaohua Li <shli@kernel.org> wrote:
> 2. the state machine runs in a single thread, which is a bottleneck. try to
> increase group_thread_cnt, which will make the handling multi-thread.

For others' reference, this parameter is in
/sys/block/<device>/md/stripe_cache_size.

On this CentOS (RHEL) 7.2 server, the parameter defaults to 0.  I set
it to 4, and the degraded reads went up dramatically.  Need to
experiment with this (and all the other tunables) some more, but that
change alone put me up to 2.5 GB/s read from the degraded array!

Thanks again,
Matt

^ permalink raw reply

* Re: [dm-devel] [PATCH v2] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Shaun Tancheff @ 2016-08-25  7:52 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid, dm-devel
In-Reply-To: <CAJVOszDvg6-VBndG=4XdGbfwEXbBj6-oYJsGNtvtkrQ-J6JPbQ@mail.gmail.com>

On Thu, Aug 25, 2016 at 2:44 AM, Shaun Tancheff
<shaun.tancheff@seagate.com> wrote:
> On Wed, Aug 24, 2016 at 11:10 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> Linux allows for 32 character device names. When using the maximum size device name and also
>> storing "/dev/", devname needs to be 37 character long to store the complete device name.
>> i.e. "/dev/md_abcdefghijklmnopqrstuvwxyz12\0"
>>
>> Signed-Off: Robert LeBlanc<robert@leblancnet.us>
>> ---
>>  mdopen.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mdopen.c b/mdopen.c
>> index f818fdf..5af344b 100644
>> --- a/mdopen.c
>> +++ b/mdopen.c
>> @@ -144,7 +144,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
>>         struct createinfo *ci = conf_get_create_info();
>>         int parts;
>>         char *cname;
>> -       char devname[20];
>> +       char devname[37];
>
> I think you want 38 here.
>    5 + 32 + '\0'.

>>         char devnm[32];

Ah sorry, that 32 was including the null already
implied by devnm.

Looks fine.

>>         char cbuf[400];
>>         if (chosen == NULL)
>> --
>> 2.9.3
>>
>
> Also a sprintf() to snprintf() cleanup might not be a bad idea ..
> --
> Shaun Tancheff



-- 
Shaun Tancheff

^ permalink raw reply

* Re: [PATCH v2] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Shaun Tancheff @ 2016-08-25  7:44 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid, dm-devel
In-Reply-To: <20160824161044.20887-1-robert@leblancnet.us>

On Wed, Aug 24, 2016 at 11:10 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> Linux allows for 32 character device names. When using the maximum size device name and also
> storing "/dev/", devname needs to be 37 character long to store the complete device name.
> i.e. "/dev/md_abcdefghijklmnopqrstuvwxyz12\0"
>
> Signed-Off: Robert LeBlanc<robert@leblancnet.us>
> ---
>  mdopen.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mdopen.c b/mdopen.c
> index f818fdf..5af344b 100644
> --- a/mdopen.c
> +++ b/mdopen.c
> @@ -144,7 +144,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
>         struct createinfo *ci = conf_get_create_info();
>         int parts;
>         char *cname;
> -       char devname[20];
> +       char devname[37];

I think you want 38 here.
   5 + 32 + '\0'.

>         char devnm[32];
>         char cbuf[400];
>         if (chosen == NULL)
> --
> 2.9.3
>

Also a sprintf() to snprintf() cleanup might not be a bad idea ..
-- 
Shaun Tancheff

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: travis+ml-linux-raid @ 2016-08-25  6:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux-RAID
In-Reply-To: <CAJCQCtSY=D-ASQ22km8GJjfju4jUgJSOBTAH5+XveCZq1BvT7w@mail.gmail.com>

On Wed, Aug 24, 2016 at 11:15:58AM -0600, Chris Murphy wrote:
> OK well you don't tell us what the mdadm create command was, there's
> no information on the metadata version, no mdadm -E or -D output, etc.
> There's really nothing to go on here. So we can't tell what the
> problem is either, or what your question is.

Thanks for the response, I learned some interesting things!

Here is one of the non-nuked drives:

$ sudo mdadm -E /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : <elided>
           Name : <elided>
  Creation Time : Wed Aug 10 11:33:41 2016
     Raid Level : raid0
   Raid Devices : 4

 Avail Dev Size : 7814035071 (3726.02 GiB 4000.79 GB)
    Data Offset : 16 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : <elided)

    Update Time : Wed Aug 10 11:33:41 2016
       Checksum : 490b562f - correct
         Events : 0

     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing)

Here is what should be the same, only device 2 in the array
(device 3 is similar or identical):

$ sudo mdadm -E /dev/sdf1
/dev/sdf1:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
$ sudo mdadm -D /dev/sdf1
mdadm: /dev/sdf1 does not appear to be an md device

Sadly, I can't do a mdadm -D because I can't assemble the RAID.
$ sudo mdadm -E /dev/md127
$

The command history is gone, but I would imagine that the RAID was
created with something like this:

mdadm --create /dev/md/bu --level=0 --raid-devices=4 /dev/sd{b,c,d,e}1

Although it could have been level=linear.

To summarize my email:
"Is this is a known problem? If not, here is a bug report"

> > Any recommendations on a low power hardware with a well-supported
> > distro, that matches up well with a real backplane and SATA
> > connections instead of USB.  The only caveat is that I want to encrypt
> > raw disks and it has to not be very noisy - so no rackmount gear
> > with 65dB 1" dog whistle fans.  Obviously, whatever backplane must
> > be well-supported by the distro.
> 
> OK so you just want to give up on the existing setup and you want
> advice on a whole new setup? From my perspective you're basically on
> three separate threads at this point.

Depends on the circumstances.  I'm prepared to if there are no obvious
fixes.  My intuition tells me the issue may be in the 4-bay switched
SATA enclosure, or the USB connection, or the driver thereof, and not
mdraid itself.  I'm happy to be wrong on that.

BTW, in case this rings any bells as being buggy, here is the enclosure:
https://www.amazon.com/Mediasonic-ProBox-HF2-SU3S2-SATA-Enclosure/dp/B003X26VV4/

> It's a WDC Red with a physical sector size of 4096B, so it looks like
> the USB enclosure is doing the typical thing of masking the try
> physical sector size from the kernel. This is better than the opposite
> where the enclosure reports the drive as 4096B/4096B logical/physical,
> where the drive itself has 512B logical sectors, as this will cause
> problems if the drive is ever removed from that enclosure, or put into
> one that doesn't report 4096B logical sectors.

Oooh, that's meaty information thank you.  I hadn't kept up with
things since the great 2TB changeover.  That could explain some crap I
see with larger drives and USB enclosures.  The problems you describe,
I saw back in the great 2GB switchover. Seagate had some boot sector
magic that would make things work by changing the cylinder sizes,
until it didn't....

> > # parted /dev/sdd1
> > GNU Parted 2.3
> > Using /dev/sdd1
> > Welcome to GNU Parted! Type 'help' to view a list of commands.
> > (parted) p
> > Model: Unknown (unknown)
> > Disk /dev/sdd1: 4001GB
> > Sector size (logical/physical): 512B/512B
> > Partition Table: gpt
> >
> > Number  Start   End     Size    File system  Name        Flags
> >  1      1049kB  4001GB  4001GB               Linux RAID  raid
> 
> It's purely speculation, but it sounds like to me in the history of
> one or more drives, the previous signatures weren't removed before the
> drive was retasked for its new purpose. That's the folly of not wiping
> the signatures in the reverse order they were created, and just
> expecting that starting over will wipe those old signatures.

It's possible, but why would you ever end up with a GPT in a partition?

I've certainly encountered this "GPT outside cylinder 0" on these two
drives before, but it goes away with a forcible reassemble or recreate
(which I did last time), because the mdlabel blows it away. Unless
it's something this list knows about, I suspect it is a firmware
glitch in the USB enclosure.

> But I think there is a legitimate gripe that parted probably should
> not operate on partitions like this. It's not valid to have nested
> GPTs like this. And I have no idea if parted is showing you valid or
> bogus information. You'd need to do something like:
> 
> dd if=/dev/sdd1 count=2 2>/dev/null | hexdump -C

## Good disk (for comparison):
$ sudo dd if=/dev/sdd1 count=2 2> /dev/null | file -
/dev/stdin: data
$ sudo dd if=/dev/sdd1 count=2 2> /dev/null | hexdump -C | head -20 
00000000  ff 02 19 2e 03 ee fa d8  6d d7 24 78 e1 d4 04 3d  |........m.$x...=|
00000010  c9 92 33 97 17 7a 10 d3  05 bd 39 36 b4 a9 7c 14  |..3..z....96..|.|
00000020  a7 de 66 b6 cd d9 ff ef  45 27 74 6e 94 0a 03 49  |..f.....E'tn...I|
00000030  d4 43 26 2d 45 39 d1 93  8a 35 91 91 ff c9 a4 8e  |.C&-E9...5......|
00000040  bd 9a 06 6d cc f2 89 65  c0 91 87 1c 1b f0 da 2f  |...m...e......./|
00000050  83 c2 12 eb 80 3c c2 4c  68 cc 65 40 26 13 e0 77  |.....<.Lh.e@&..w|
00000060  38 15 ed 78 27 76 4c 91  71 99 3e 9f 99 f1 3f 51  |8..x'vL.q.>...?Q|
00000070  19 db 12 a3 ac b6 61 12  ff d9 37 87 31 1f 8b dd  |......a...7.1...|
00000080  88 82 de fb db f2 a5 31  10 2a d2 03 be 12 be bd  |.......1.*......|
00000090  19 46 9f c1 3b ea a1 37  81 d2 4d 00 54 e7 b4 55  |.F..;..7..M.T..U|
000000a0  b7 65 6c 3f 95 40 b0 f4  28 ff 90 62 22 cb 22 fd  |.el?.@..(..b".".|
000000b0  6b 4d 90 56 32 4b c6 22  35 b1 62 76 e1 fd 82 d5  |kM.V2K."5.bv....|
000000c0  03 40 c0 85 4b ac 5a 44  9e 6a 25 97 d3 7f bd fe  |.@..K.ZD.j%.....|
000000d0  0c 2d a8 bb 33 f4 00 df  7a 05 ae 6d b3 3e f3 7d  |.-..3...z..m.>.}|
000000e0  34 9e 0e 57 14 de d8 e0  28 63 82 a6 2a 8a 1f fc  |4..W....(c..*...|
000000f0  fe 2f b0 69 67 ac 0a e9  c2 53 a7 d8 36 1a 18 5a  |./.ig....S..6..Z|
00000100  d6 d4 e6 ce df f7 fc 67  13 eb 25 08 45 50 10 7b  |.......g..%.EP.{|
00000110  c6 23 1e 59 dc 2d c2 65  53 90 ca ec 21 e7 28 74  |.#.Y.-.eS...!.(t|
00000120  41 7f 3e 58 72 08 75 c1  d5 ca d0 91 55 5f 43 6a  |A.>Xr.u.....U_Cj|
00000130  4e 84 d5 7f aa f2 b5 27  e4 86 5d 28 ae 6c 29 a1  |N......'..](.l).|

## Bad disk:
$ sudo dd if=/dev/sdf1 count=2 2> /dev/null | file -
/dev/stdin: x86 boot sector; partition 1: ID=0xee, starthead 0, startsector 1, 4294967295 sectors, code offset 0x6f
$ sudo dd if=/dev/sdf1 count=2 2> /dev/null | hexdump -C 
00000000  38 6f 96 52 ea 9c 31 cd  10 a2 84 58 a2 f0 f5 43  |8o.R..1....X...C|
00000010  0f f2 5a 9b c7 ff 82 b2  d8 59 86 60 15 bc 31 65  |..Z......Y.`..1e|
00000020  bc d7 77 f9 31 6a c8 16  3f 13 90 24 b7 57 ff 6b  |..w.1j..?..$.W.k|
00000030  64 7e e2 99 2a 99 f7 32  69 be aa 56 36 31 f7 db  |d~..*..2i..V61..|
00000040  8c 4c 4c 12 68 19 77 0f  f6 3b 92 bf 18 92 c2 45  |.LL.h.w..;.....E|
00000050  73 d5 b7 93 cc ae 6b b9  b0 bd 0c 85 a9 c3 19 f7  |s.....k.........|
00000060  87 34 b8 be 0a 95 cd 03  03 d5 01 49 b5 b0 86 fe  |.4.........I....|
00000070  71 1c d2 f6 42 ed ce b0  eb c3 5f 4c 07 34 30 c7  |q...B....._L.40.|
00000080  8a 1f 91 c4 8b 28 b9 07  8e da ae 7d 7d c5 24 2b  |.....(.....}}.$+|
00000090  6d f9 ea a3 6a 83 9d b8  6a 1f 6d db 3a 01 22 c7  |m...j...j.m.:.".|
000000a0  56 fc 2a 46 f8 b2 84 31  d1 8b 58 55 b6 5a 36 7b  |V.*F...1..XU.Z6{|
000000b0  48 5d 98 2a 3f f0 ae 80  2b f8 6b b2 7f 1e 27 c2  |H].*?...+.k...'.|
000000c0  59 65 d0 bf c7 f0 5b 18  dc 59 8e 68 46 03 b6 ca  |Ye....[..Y.hF...|
000000d0  42 06 7a 52 7a 49 36 03  0d d5 9b 67 a2 03 3b 13  |B.zRzI6....g..;.|
000000e0  40 23 19 f5 1a a6 bd fb  c8 d5 5b 26 f5 6a 86 ab  |@#........[&.j..|
000000f0  89 77 98 d8 09 cb b7 59  80 03 81 48 ba c6 ce 77  |.w.....Y...H...w|
00000100  3c 6c d2 ba a0 71 c3 20  18 fd 77 db ca a8 8a e3  |<l...q. ..w.....|
00000110  8d 6c 1f 17 d5 9f e5 81  bf 50 62 c3 bc f8 6c 5d  |.l.......Pb...l]|
00000120  f7 3f a6 37 6b a9 53 2b  88 15 5d 6e 1e 48 4f b4  |.?.7k.S+..]n.HO.|
00000130  db af b4 f7 f5 7b 4d f3  3f 60 44 60 6e a2 c4 6d  |.....{M.?`D`n..m|
00000140  b9 6c 88 04 e8 66 d1 7c  a0 09 10 66 32 de 70 e1  |.l...f.|...f2.p.|
00000150  98 40 54 5e 1d f2 af b8  2e d1 75 0d 3c 46 1f f8  |.@T^......u.<F..|
00000160  85 72 49 87 ad 92 59 28  fd 9d 22 8e 1b 9f 2c 00  |.rI...Y(.."...,.|
00000170  87 58 74 01 63 a5 94 13  e3 9c ea ec 3f 21 22 41  |.Xt.c.......?!"A|
00000180  05 13 78 f3 a8 46 b3 02  9e 23 cb 9d 21 db a6 ae  |..x..F...#..!...|
00000190  08 a8 70 48 18 6c e2 38  e4 ac 03 6e 06 74 17 7c  |..pH.l.8...n.t.||
000001a0  90 ca 9f 5e 2e 2b 84 ef  52 2c 08 9a 48 98 f9 46  |...^.+..R,..H..F|
000001b0  f4 9f 00 cd ec a0 11 d7  00 00 00 00 00 00 00 00  |................|
000001c0  02 00 ee ff ff ff 01 00  00 00 ff ff ff ff 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200  45 46 49 20 50 41 52 54  00 00 01 00 5c 00 00 00  |EFI PART....\...|
00000210  3a dc 43 c4 00 00 00 00  01 00 00 00 00 00 00 00  |:.C.............|
00000220  8e b6 c0 d1 01 00 00 00  22 00 00 00 00 00 00 00  |........".......|
00000230  6d b6 c0 d1 01 00 00 00  a5 4f bd 75 f6 c8 4f 43  |m........O.u..OC|
00000240  92 31 ab b6 a9 59 aa 04  02 00 00 00 00 00 00 00  |.1...Y..........|
00000250  80 00 00 00 80 00 00 00  59 04 3d 4a 00 00 00 00  |........Y.=J....|
00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

## is that the same as the boot sector itself?  Interesting q.
# dd if=/dev/sdd count=2 of=/tmp/foo && dd if=/dev/sdd1 count=2 of=/tmp/bar && cmp /tmp/foo /tmp/bar
## Nope, how do they differ?  Well that's a bit unpleasant to do manually but here...
# dd if=/dev/sdd count=2 2> /dev/null | hexdump -C
00000000  10 06 27 48 33 df bb 55  8b 28 fe 60 5e 18 6d 38  |..'H3..U.(.`^.m8|
00000010  fc b3 17 36 55 de fd 83  d0 52 72 19 d0 76 12 f0  |...6U....Rr..v..|
00000020  1e 23 bc 4d c5 4d c2 d6  5a d4 2b cd 16 78 c9 28  |.#.M.M..Z.+..x.(|
00000030  77 21 c4 9f c4 b7 48 ad  e0 7b 08 d6 f5 8e 92 a7  |w!....H..{......|
00000040  bc 88 35 02 e7 f8 b8 3b  05 97 db a3 ad e7 96 4b  |..5....;.......K|
00000050  84 d9 e2 a4 3a 5a 07 ac  fc a2 78 58 d7 c8 5a 19  |....:Z....xX..Z.|
00000060  88 9c f6 f2 c0 ec 99 55  d9 5d 00 87 3a 86 52 01  |.......U.]..:.R.|
00000070  92 58 25 82 99 50 8e 28  0f 42 07 71 9a a3 db 82  |.X%..P.(.B.q....|
00000080  00 d9 b8 28 9d d8 97 85  9d c6 fb 5e 4d 94 3a 6e  |...(.......^M.:n|
00000090  19 3c a6 ce 57 6b a0 52  d6 72 0c 41 2e cd cb a2  |.<..Wk.R.r.A....|
000000a0  15 c8 d4 c8 8c 90 34 5f  15 ab 69 96 af 3d 7e 30  |......4_..i..=~0|
000000b0  25 e1 72 35 d6 c4 b2 5e  78 72 0b 3f 9a 96 40 7e  |%.r5...^xr.?..@~|
000000c0  c6 aa 0e 5a da 99 ae fe  a3 93 8b 5b c4 bf 91 64  |...Z.......[...d|
000000d0  d5 62 12 ea 70 15 a9 05  81 8d e4 fb 36 15 c9 63  |.b..p.......6..c|
000000e0  ba f9 d2 5c f6 df 28 71  d8 d5 82 95 2b 83 40 db  |...\..(q....+.@.|
000000f0  9b fe e2 a7 9b 38 5e 5f  51 a6 6e e6 7b 4e bf 02  |.....8^_Q.n.{N..|
00000100  d2 fb aa f9 2c 7a 5b f5  47 ad ac 7e d1 1c f3 1b  |....,z[.G..~....|
00000110  a3 8e 54 9f a4 8d 1a 02  3f cc 81 f0 ca e9 28 1e  |..T.....?.....(.|
00000120  33 9e d8 71 dd f2 aa b7  d4 06 96 cb 0c 8e f1 6a  |3..q...........j|
00000130  88 1d 2a 8a a3 33 00 8c  ef d4 d8 39 3e 70 18 34  |..*..3.....9>p.4|
00000140  e6 3a cd e7 0b d6 82 a8  a4 aa ff bd b3 69 0a cc  |.:...........i..|
00000150  32 9e e3 26 34 bb cc 0e  b0 69 5f 9a c5 f3 57 7d  |2..&4....i_...W}|
00000160  47 82 bc 66 44 55 c4 de  3c 2c 14 d0 9a 73 6a da  |G..fDU..<,...sj.|
00000170  3c 5e f8 99 26 5b f4 8a  13 a1 f1 c8 a9 20 4c 3a  |<^..&[....... L:|
00000180  bd 03 4e e9 83 25 46 32  3f 80 3e 42 58 e7 18 27  |..N..%F2?.>BX..'|
00000190  8a c8 7c 8c 74 99 96 61  d4 e2 58 c2 27 71 8c 3b  |..|.t..a..X.'q.;|
000001a0  da 33 f8 7f b5 c1 a7 a0  c2 7b 54 29 0d 47 b4 b5  |.3.......{T).G..|
000001b0  4c 62 5b f8 e9 6f bc 29  00 00 00 00 00 00 00 00  |Lb[..o.)........|
000001c0  02 00 ee ff ff ff 01 00  00 00 ff ff ff ff 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200  45 46 49 20 50 41 52 54  00 00 01 00 5c 00 00 00  |EFI PART....\...|
00000210  62 01 85 1f 00 00 00 00  01 00 00 00 00 00 00 00  |b...............|
00000220  af be c0 d1 01 00 00 00  22 00 00 00 00 00 00 00  |........".......|
00000230  8e be c0 d1 01 00 00 00  e2 89 58 78 77 63 52 44  |..........XxwcRD|
00000240  93 9e 4a 93 16 06 86 6b  02 00 00 00 00 00 00 00  |..J....k........|
00000250  80 00 00 00 80 00 00 00  5d ff 7e 02 00 00 00 00  |........].~.....|
00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

> And then we can see if there really is a PMBR and GPT in that first
> sector that parted is picking up. But where it could be coming from in
> an mdadm linear layout? No idea.
> 
> The other thing to check is the end of the partition, because GPT has
> a primary and backup. So the 2nd to last sector of sdd1 may have a
> backup GPT on it, and possibly something is wrongly restoring it
> sometimes.
> 
> In any case I would still look to using something much much newer than
> parted 2.3, it's basically Pleistocene old, and the version of mdadm
> is also likewise old. But this is what happens with LTS releases,
> ancient software for which no one except its maintainers remember the
> state and history.

I understand and can probably acquire the most recent stable and
compile from source, if you think that would prove useful enough to
justify the effort.  TBH once GPT came out I lost track of which
partitioning tool was appropriate to use, it seemed like (IIRC)
cfdisk, sfdisk, parted were all vying for my attention... is parted
now the standard?

At the current moment I am backing up the drives so that I can try a
forcible reassemble.  I think that last time this happened, that
effectively relabeled the mdraid partitions and fixed the problem.
The underlying mdraid has an LVM on LUKS, but last time this happened
I managed to fsck and get 99% of the data back, with only a few things
ending up in lost+found.  Presumably there might have been some data
corruption, but since it's a backup server only I consider it
tolerable, modulo the failed Windows system which needs to restore
from it.
-- 
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977

^ permalink raw reply

* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: NeilBrown @ 2016-08-25  4:59 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20160824052512.GA1921@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3542 bytes --]

On Wed, Aug 24 2016, Shaohua Li wrote:

> On Wed, Aug 24, 2016 at 02:49:57PM +1000, Neil Brown wrote:
>> On Wed, Aug 17 2016, Shaohua Li wrote:
>> >> >
>> >> > We will have the same deadlock issue with just stopping/restarting the reclaim
>> >> > thread. As stopping the thread will wait for the thread, which probably is
>> >> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
>> >> > superblock, we must hold the reconfig_mutex.
>> >> 
>> >> When you say "writing the superblock" you presumably mean "blocked in
>> >> r5l_write_super_and_discard_space(), waiting for  MD_CHANGE_PENDING to
>> >> be cleared" ??
>> > right
>> >> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
>> >> ->quiesce to be set, and then exit gracefully.
>> >
>> > Can you give details about this please? .quiesce is called with reconfig_mutex
>> > hold, so the MD_CHANGE_PENDING will never get cleared.
>> 
>> raid5_quiesce(mddev, 1) sets conf->quiesce and then calls r5l_quiesce().
>> 
>> r5l_quiesce() tells the reclaim_thread to exit and waits for it to do so.
>> 
>> But the reclaim thread might be in
>>    r5l_do_reclaim() -> r5l_write_super_and_discard_space()
>> waiting for MD_CHANGE_PENDING to clear.  That will only get cleared when
>> the main thread can get the reconfig_mutex, which the thread calling
>> raid5_quiesce() might hold.  So we get a deadlock.
>> 
>> My suggestion is to change r5l_write_super_and_discard_space() so that
>> it waits for *either* MD_CHANGE_PENDING to be clear, or conf->quiesce
>> to be set.  That will avoid the deadlock.
>> 
>> Whatever thread called raid5_quiesce() will now be in control of the
>> array without any async IO going on.  If it needs the metadata to be
>> sync, it can do that itself.  If not, then it doesn't really matter that
>> r5l_write_super_and_discard_space() didn't wait.
>
> I'm afraid waiting conf->quiesce set isn't safe. The reason to wait for
> superblock write isn't because of async IO. discard could zero data, so before
> we do discard, we must make sure superblock points to correct log tail,
> otherwise recovery will not work. This is the reason we wait for superblock
> write.
>
>> r5l_write_super_and_discard_space() shouldn't call discard if the
>> superblock write didn't complete, and probably r5l_do_reclaim()
>> shouldn't update last_checkpoint and last_cp_seq in that case.
>> This is what I mean by "with a bit of care" and "exit gracefully".
>> Maybe I should have said "abort cleanly".  The goal is to get the thread
>> to exit.  It doesn't need to complete what it was doing, it just needs
>> to make sure that it leaves things in a tidy state so that when it
>> starts up again, it can pick up where it left off.
>
> Agree, we could ignore discard sometime, which happens occasionally, so impact
> is little. I tested something like below recently. Assume this is the solution
> we agree on?

Yes, this definitely looks like it is heading in the right direction.

I thought that

> -		set_mask_bits(&mddev->flags, 0,
> -			      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
> -		md_wakeup_thread(mddev->thread);

would still be there in the case that the lock cannot be claimed.

You could even record the ->events value before setting the flags,
and record the range that needs to be discarded.  Next time
r5l_do_reclaim is entered, if ->events has moved on, then it should be
safe to discard the recorded range.  Maybe.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Chris Murphy @ 2016-08-24 17:15 UTC (permalink / raw)
  To: Linux-RAID
In-Reply-To: <20160823050947.GL32250@subspacefield.org>

On Mon, Aug 22, 2016 at 11:09 PM,
<travis+ml-linux-raid@subspacefield.org> wrote:
> Hello all,
>
> So I have an Intel NUC (for low power Linux) plugged via USB into a 4
> bay enclosure doing linear (yeah I know; it's the backup server, the
> primary is raid10).
>
> And every once in a while, this happens (*see end).  The partition 1
> that would normally contain a MD slice ends up being a replica of the
> boot cylinder.  I can't tell if it's the mdraid linear impl, the
> kernel doing something weird, the USB drivers, the enclosure firmware,
> or what.

OK well you don't tell us what the mdadm create command was, there's
no information on the metadata version, no mdadm -E or -D output, etc.
There's really nothing to go on here. So we can't tell what the
problem is either, or what your question is.

>
> Anyway, this happened while I was restoring a Windows machine whose
> root drive suddenly took a nosedive, and it happens every 6 months
> or so.  Today it happened while I was in the middle of recovering
> a Windows machine whose 1TB SSD threw up on C: and totally nuked
> the data.

OK? I don't follow this at all, how it relates to the NUC, how it
relates to the USB drives connected to the NUC.

>
> The last low-power option I tried was an OpenRD Ultimate based around
> ARMv5TE which was basically unsupported by debian by the time I got
> it, and subsequently became ultra-flaky due to what seemed to be RAM
> problems - it was crashing every 3 days with kernel panics, and every
> once in a while would do something worse.

This is definitely superfluous information that just clutters the thread...

> Any recommendations on a low power hardware with a well-supported
> distro, that matches up well with a real backplane and SATA
> connections instead of USB.  The only caveat is that I want to encrypt
> raw disks and it has to not be very noisy - so no rackmount gear
> with 65dB 1" dog whistle fans.  Obviously, whatever backplane must
> be well-supported by the distro.

OK so you just want to give up on the existing setup and you want
advice on a whole new setup? From my perspective you're basically on
three separate threads at this point.

>
> Also, does anyone have experience with cryptsetup on multiple
> partitions?  I can do that but get prompted multiple times and I was
> wondering if anyone knew an easy way to fix the boot time scripts to
> avoid that, only prompting once per unique underlying crypttab.

And now you're on your fourth subject for an entirely new thread that
also has nothing to do with this list. This is probably a distribution
question. On the distribution I use, the thing that prompts for a
passphrase tries that passphrase on all cryptluks devices, so in the
event they share the same passphrase, they're all opened just by
entering the passphrase one time. If the passphrase is entered
incorrectly, now I'm stuck and have to enter the passphrase per LUKS
instance.

>
> And finally, I have a story about buggy drive firmware that you
> might enjoy, especially if you were doing this sort of stuff in
> the 90s as well.  Cheers:

OK...fifth subject and thread.

> # parted /dev/sde
> GNU Parted 2.3

I would start out by using a non-ancient version of parted. This is 6 years old.

> Using /dev/sde
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p
> Model: WDC WD40 EFRX-68WT0N0 (scsi)
> Disk /dev/sde: 4001GB
> Sector size (logical/physical): 512B/512B

It's a WDC Red with a physical sector size of 4096B, so it looks like
the USB enclosure is doing the typical thing of masking the try
physical sector size from the kernel. This is better than the opposite
where the enclosure reports the drive as 4096B/4096B logical/physical,
where the drive itself has 512B logical sectors, as this will cause
problems if the drive is ever removed from that enclosure, or put into
one that doesn't report 4096B logical sectors.

> Partition Table: gpt
>
> Number  Start   End     Size    File system  Name        Flags
>  1      1049kB  4001GB  4001GB               Linux RAID  raid
>
> (parted) q
> # parted /dev/sdd1
> GNU Parted 2.3
> Using /dev/sdd1
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p
> Model: Unknown (unknown)
> Disk /dev/sdd1: 4001GB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
>
> Number  Start   End     Size    File system  Name        Flags
>  1      1049kB  4001GB  4001GB               Linux RAID  raid

It's purely speculation, but it sounds like to me in the history of
one or more drives, the previous signatures weren't removed before the
drive was retasked for its new purpose. That's the folly of not wiping
the signatures in the reverse order they were created, and just
expecting that starting over will wipe those old signatures.

But I think there is a legitimate gripe that parted probably should
not operate on partitions like this. It's not valid to have nested
GPTs like this. And I have no idea if parted is showing you valid or
bogus information. You'd need to do something like:

dd if=/dev/sdd1 count=2 2>/dev/null | hexdump -C

And then we can see if there really is a PMBR and GPT in that first
sector that parted is picking up. But where it could be coming from in
an mdadm linear layout? No idea.

The other thing to check is the end of the partition, because GPT has
a primary and backup. So the 2nd to last sector of sdd1 may have a
backup GPT on it, and possibly something is wrongly restoring it
sometimes.

In any case I would still look to using something much much newer than
parted 2.3, it's basically Pleistocene old, and the version of mdadm
is also likewise old. But this is what happens with LTS releases,
ancient software for which no one except its maintainers remember the
state and history.

-- 
Chris Murphy

^ permalink raw reply

* [PATCH v2] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Robert LeBlanc @ 2016-08-24 16:10 UTC (permalink / raw)
  To: linux-raid; +Cc: dm-devel, robert

Linux allows for 32 character device names. When using the maximum size device name and also
storing "/dev/", devname needs to be 37 character long to store the complete device name.
i.e. "/dev/md_abcdefghijklmnopqrstuvwxyz12\0"

Signed-Off: Robert LeBlanc<robert@leblancnet.us>
---
 mdopen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mdopen.c b/mdopen.c
index f818fdf..5af344b 100644
--- a/mdopen.c
+++ b/mdopen.c
@@ -144,7 +144,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
 	struct createinfo *ci = conf_get_create_info();
 	int parts;
 	char *cname;
-	char devname[20];
+	char devname[37];
 	char devnm[32];
 	char cbuf[400];
 	if (chosen == NULL)
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH] raid6: fix the input of raid6 algorithm
From: liuzhengyuan @ 2016-08-24  7:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: shli, linux-raid, fenghua.yu, linux-kernel, liuzhengyuang521
In-Reply-To: <FAA53096-C767-4142-B45C-01889986EDAF@zytor.com>

Oh, get_random_*() is really expensive. Thanks for your tips. The boot log on my aarch64 showed bellow
told it taked about 0.6 second to fill with disk data. 
  
  [    0.172831] DMA: preallocated 256 KiB pool for atomic allocations
  [    0.788664] raid6: int64x1  gen()   121 MB/s
  [    0.856613] raid6: int64x1  xor()    74 MB/s
  [    0.924665] raid6: int64x2  gen()   166 MB/s
  [    0.992846] raid6: int64x2  xor()    95 MB/s
  [    1.060681] raid6: int64x4  gen()   290 MB/s
  [    1.128774] raid6: int64x4  xor()   160 MB/s
  [    1.196933] raid6: int64x8  gen()   238 MB/s
  [    1.264937] raid6: int64x8  xor()   148 MB/s
  [    1.332878] raid6: neonx1   gen()   256 MB/s
  [    1.400975] raid6: neonx1   xor()   130 MB/s
  [    1.468951] raid6: neonx2   gen()   333 MB/s
  [    1.537085] raid6: neonx2   xor()   181 MB/s
  [    1.605042] raid6: neonx4   gen()   451 MB/s
  [    1.673121] raid6: neonx4   xor()   289 MB/s
  [    1.741143] raid6: neonx8   gen()   452 MB/s
  [    1.809151] raid6: neonx8   xor()   277 MB/s
  [    1.809154] raid6: using algorithm neonx8 gen() 452 MB/s
  [    1.809157] raid6: .... xor() 277 MB/s, rmw enabled
  [    1.809160] raid6: using intx1 recovery algorithm

 I replaced get_random_* with a local PRNG based on well-know 
"linear congruential bit". The patch was like this:

  +/* use the linear congruential bit. */
  +static int32_t get_random_number_by_lcb(void)
  +{
  +        static int32_t seed = 1;
  +        int32_t ret = 0;
  +        ret = ((seed * 1103515245) + 12345) & 0x7fffffff;
  +        seed = ret;
  +        return ret;
  +}
 
   /* Try to pick the best algorithm */
   /* This code uses the gfmul table as convenient data set to abuse */
  @@ -229,8 +238,8 @@ int __init raid6_select_algo(void)
          for (i = 0; i < disks-2; i++) {
                  dptrs[i] = disk_ptr + PAGE_SIZE*i;
  -               for (j = 0; j < PAGE_SIZE; j++)
  -                       get_random_bytes(dptrs[i]+j, 1);
  +               for (j = 0; j < PAGE_SIZE; j = j + 4)
  +                       *(int32_t *)(dptrs[i]+j) = get_random_number_by_lcb();
          }
   
          dptrs[disks-2] = disk_ptr + PAGE_SIZE*(disks-2);

The boot log with this patch was showd bellow, it taked about 0.08 second.

  [    0.172858] DMA: preallocated 256 KiB pool for atomic allocations
  [    0.256673] raid6: int64x1  gen()   121 MB/s
  [    0.324484] raid6: int64x1  xor()    73 MB/s
  [    0.392606] raid6: int64x2  gen()   166 MB/s
  [    0.460309] raid6: int64x2  xor()    92 MB/s
  [    0.528368] raid6: int64x4  gen()   290 MB/s
  [    0.596401] raid6: int64x4  xor()   156 MB/s
  [    0.664601] raid6: int64x8  gen()   238 MB/s
  [    0.732609] raid6: int64x8  xor()   148 MB/s
  [    0.800523] raid6: neonx1   gen()   256 MB/s
  [    0.868730] raid6: neonx1   xor()   129 MB/s
  [    0.936741] raid6: neonx2   gen()   334 MB/s
  [    1.004717] raid6: neonx2   xor()   202 MB/s
  [    1.072692] raid6: neonx4   gen()   451 MB/s
  [    1.140763] raid6: neonx4   xor()   260 MB/s
  [    1.208842] raid6: neonx8   gen()   452 MB/s
  [    1.276887] raid6: neonx8   xor()   277 MB/s
  [    1.276890] raid6: using algorithm neonx8 gen() 452 MB/s
  [    1.276894] raid6: .... xor() 277 MB/s, rmw enabled
  [    1.276897] raid6: using intx1 recovery algorithm
  [    1.276941] ACPI: Interpreter disabled.

I'm not familiar with  spurious D$ conflicts and CPU cache behavior. How do you 
think this PRNG or anything else I need to do?

------------------ Original ------------------
From:  "H. Peter Anvin"<hpa@zytor.com>;
Date:  Tue, Aug 23, 2016 11:53 AM
To:  "liuzhengyuan"<liuzhengyuan@kylinos.cn>;
Cc:  "shli"<shli@kernel.org>; "linux-raid"<linux-raid@vger.kernel.org>; "fenghua.yu"<fenghua.yu@intel.com>; "linux-kernel"<linux-kernel@vger.kernel.org>; "liuzhengyuang521"<liuzhengyuang521@gmail.com>;
Subject:  Re: [PATCH] raid6: fix the input of raid6 algorithm
 
Do you have any idea how long this takes to run?  People are already complaining about the boot time penalty.  get_random_*() is quite expensive and is overkill...
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply

* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: Shaohua Li @ 2016-08-24  5:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87k2f6g496.fsf@notabene.neil.brown.name>

On Wed, Aug 24, 2016 at 02:49:57PM +1000, Neil Brown wrote:
> On Wed, Aug 17 2016, Shaohua Li wrote:
> >> >
> >> > We will have the same deadlock issue with just stopping/restarting the reclaim
> >> > thread. As stopping the thread will wait for the thread, which probably is
> >> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
> >> > superblock, we must hold the reconfig_mutex.
> >> 
> >> When you say "writing the superblock" you presumably mean "blocked in
> >> r5l_write_super_and_discard_space(), waiting for  MD_CHANGE_PENDING to
> >> be cleared" ??
> > right
> >> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
> >> ->quiesce to be set, and then exit gracefully.
> >
> > Can you give details about this please? .quiesce is called with reconfig_mutex
> > hold, so the MD_CHANGE_PENDING will never get cleared.
> 
> raid5_quiesce(mddev, 1) sets conf->quiesce and then calls r5l_quiesce().
> 
> r5l_quiesce() tells the reclaim_thread to exit and waits for it to do so.
> 
> But the reclaim thread might be in
>    r5l_do_reclaim() -> r5l_write_super_and_discard_space()
> waiting for MD_CHANGE_PENDING to clear.  That will only get cleared when
> the main thread can get the reconfig_mutex, which the thread calling
> raid5_quiesce() might hold.  So we get a deadlock.
> 
> My suggestion is to change r5l_write_super_and_discard_space() so that
> it waits for *either* MD_CHANGE_PENDING to be clear, or conf->quiesce
> to be set.  That will avoid the deadlock.
> 
> Whatever thread called raid5_quiesce() will now be in control of the
> array without any async IO going on.  If it needs the metadata to be
> sync, it can do that itself.  If not, then it doesn't really matter that
> r5l_write_super_and_discard_space() didn't wait.

I'm afraid waiting conf->quiesce set isn't safe. The reason to wait for
superblock write isn't because of async IO. discard could zero data, so before
we do discard, we must make sure superblock points to correct log tail,
otherwise recovery will not work. This is the reason we wait for superblock
write.

> r5l_write_super_and_discard_space() shouldn't call discard if the
> superblock write didn't complete, and probably r5l_do_reclaim()
> shouldn't update last_checkpoint and last_cp_seq in that case.
> This is what I mean by "with a bit of care" and "exit gracefully".
> Maybe I should have said "abort cleanly".  The goal is to get the thread
> to exit.  It doesn't need to complete what it was doing, it just needs
> to make sure that it leaves things in a tidy state so that when it
> starts up again, it can pick up where it left off.

Agree, we could ignore discard sometime, which happens occasionally, so impact
is little. I tested something like below recently. Assume this is the solution
we agree on?


diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5504ce2..cd34e66 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -96,7 +96,6 @@ struct r5l_log {
 	spinlock_t no_space_stripes_lock;
 
 	bool need_cache_flush;
-	bool in_teardown;
 };
 
 /*
@@ -703,32 +702,22 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 		return;
 
 	mddev = log->rdev->mddev;
+
 	/*
-	 * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
-	 * wait for this thread to finish. This thread waits for
-	 * MD_CHANGE_PENDING clear, which is supposed to be done in
-	 * md_check_recovery(). md_check_recovery() tries to get
-	 * reconfig_mutex. Since r5l_quiesce already holds the mutex,
-	 * md_check_recovery() fails, so the PENDING never get cleared. The
-	 * in_teardown check workaround this issue.
+	 * Discard could zero data, so before discard we must make sure
+	 * superblock is updated to new log tail. Updating superblock (either
+	 * directly call md_update_sb() or depend on md thread) must hold
+	 * reconfig mutex. On the other hand, raid5_quiesce is called with
+	 * reconfig_mutex hold. The first step of raid5_quiesce() is waitting
+	 * for all IO finish, hence waitting for reclaim thread, while reclaim
+	 * thread is calling this function and waitting for reconfig mutex. So
+	 * there is a deadlock. We workaround this issue with a trylock.
+	 * FIXME: we could miss discard if we can't take reconfig mutex
 	 */
-	if (!log->in_teardown) {
-		set_mask_bits(&mddev->flags, 0,
-			      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
-		md_wakeup_thread(mddev->thread);
-		wait_event(mddev->sb_wait,
-			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
-			log->in_teardown);
-		/*
-		 * r5l_quiesce could run after in_teardown check and hold
-		 * mutex first. Superblock might get updated twice.
-		 */
-		if (log->in_teardown)
-			md_update_sb(mddev, 1);
-	} else {
-		WARN_ON(!mddev_is_locked(mddev));
-		md_update_sb(mddev, 1);
-	}
+	if (!mddev_trylock(mddev))
+		return;
+	md_update_sb(mddev, 1);
+	mddev_unlock(mddev);
 
 	/* discard IO error really doesn't matter, ignore it */
 	if (log->last_checkpoint < end) {
@@ -827,7 +816,6 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	if (!log || state == 2)
 		return;
 	if (state == 0) {
-		log->in_teardown = 0;
 		/*
 		 * This is a special case for hotadd. In suspend, the array has
 		 * no journal. In resume, journal is initialized as well as the
@@ -838,11 +826,6 @@ void r5l_quiesce(struct r5l_log *log, int state)
 		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 					log->rdev->mddev, "reclaim");
 	} else if (state == 1) {
-		/*
-		 * at this point all stripes are finished, so io_unit is at
-		 * least in STRIPE_END state
-		 */
-		log->in_teardown = 1;
 		/* make sure r5l_write_super_and_discard_space exits */
 		mddev = log->rdev->mddev;
 		wake_up(&mddev->sb_wait);

^ permalink raw reply related

* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: NeilBrown @ 2016-08-24  4:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Shaohua Li
In-Reply-To: <20160817012803.GA86961@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]

On Wed, Aug 17 2016, Shaohua Li wrote:
>> >
>> > We will have the same deadlock issue with just stopping/restarting the reclaim
>> > thread. As stopping the thread will wait for the thread, which probably is
>> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
>> > superblock, we must hold the reconfig_mutex.
>> 
>> When you say "writing the superblock" you presumably mean "blocked in
>> r5l_write_super_and_discard_space(), waiting for  MD_CHANGE_PENDING to
>> be cleared" ??
> right
>> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
>> ->quiesce to be set, and then exit gracefully.
>
> Can you give details about this please? .quiesce is called with reconfig_mutex
> hold, so the MD_CHANGE_PENDING will never get cleared.

raid5_quiesce(mddev, 1) sets conf->quiesce and then calls r5l_quiesce().

r5l_quiesce() tells the reclaim_thread to exit and waits for it to do so.

But the reclaim thread might be in
   r5l_do_reclaim() -> r5l_write_super_and_discard_space()
waiting for MD_CHANGE_PENDING to clear.  That will only get cleared when
the main thread can get the reconfig_mutex, which the thread calling
raid5_quiesce() might hold.  So we get a deadlock.

My suggestion is to change r5l_write_super_and_discard_space() so that
it waits for *either* MD_CHANGE_PENDING to be clear, or conf->quiesce
to be set.  That will avoid the deadlock.

Whatever thread called raid5_quiesce() will now be in control of the
array without any async IO going on.  If it needs the metadata to be
sync, it can do that itself.  If not, then it doesn't really matter that
r5l_write_super_and_discard_space() didn't wait.

r5l_write_super_and_discard_space() shouldn't call discard if the
superblock write didn't complete, and probably r5l_do_reclaim()
shouldn't update last_checkpoint and last_cp_seq in that case.
This is what I mean by "with a bit of care" and "exit gracefully".
Maybe I should have said "abort cleanly".  The goal is to get the thread
to exit.  It doesn't need to complete what it was doing, it just needs
to make sure that it leaves things in a tidy state so that when it
starts up again, it can pick up where it left off.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply

* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: travis+ml-linux-raid @ 2016-08-24  2:14 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20160823050947.GL32250@subspacefield.org>

$ mdadm --version
mdadm - v3.2.5 - 18th May 2012
$ uname -a
Linux hostname 3.2.0-107-generic #148-Ubuntu SMP Mon Jul 18 20:22:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.5 LTS"

And I think there must be a bug in referencing the beginning of a
partition vs the beginning of the disk which leads to this.  Back when
I was using raw disk devices I had corruption in the first cylinders
which also held the mdlabel and I thought the lack of a partition
table was the problem... obviously not.

Could very well be a bug in USB enclosure firmware too.  Hard to know
how to proceed.

On Mon, Aug 22, 2016 at 10:09:47PM -0700, travis+ml-linux-raid@subspacefield.org wrote:
> Hello all,
> 
> So I have an Intel NUC (for low power Linux) plugged via USB into a 4
> bay enclosure doing linear (yeah I know; it's the backup server, the
> primary is raid10).
> 
> And every once in a while, this happens (*see end).  The partition 1
> that would normally contain a MD slice ends up being a replica of the
> boot cylinder.  I can't tell if it's the mdraid linear impl, the
> kernel doing something weird, the USB drivers, the enclosure firmware,
> or what.
> 
> Anyway, this happened while I was restoring a Windows machine whose
> root drive suddenly took a nosedive, and it happens every 6 months
> or so.  Today it happened while I was in the middle of recovering
> a Windows machine whose 1TB SSD threw up on C: and totally nuked
> the data.
> 
> The last low-power option I tried was an OpenRD Ultimate based around
> ARMv5TE which was basically unsupported by debian by the time I got
> it, and subsequently became ultra-flaky due to what seemed to be RAM
> problems - it was crashing every 3 days with kernel panics, and every
> once in a while would do something worse.
> 
> Any recommendations on a low power hardware with a well-supported
> distro, that matches up well with a real backplane and SATA
> connections instead of USB.  The only caveat is that I want to encrypt
> raw disks and it has to not be very noisy - so no rackmount gear
> with 65dB 1" dog whistle fans.  Obviously, whatever backplane must
> be well-supported by the distro.
> 
> Also, does anyone have experience with cryptsetup on multiple
> partitions?  I can do that but get prompted multiple times and I was
> wondering if anyone knew an easy way to fix the boot time scripts to
> avoid that, only prompting once per unique underlying crypttab.
> 
> And finally, I have a story about buggy drive firmware that you
> might enjoy, especially if you were doing this sort of stuff in
> the 90s as well.  Cheers:
> 
> http://www.subspacefield.org/security/hard_drives_of_doom/
> 
> 
> [*]
> 
> # parted /dev/sde
> GNU Parted 2.3
> Using /dev/sde
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p                                                                
> Model: WDC WD40 EFRX-68WT0N0 (scsi)
> Disk /dev/sde: 4001GB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
> 
> Number  Start   End     Size    File system  Name        Flags
>  1      1049kB  4001GB  4001GB               Linux RAID  raid
> 
> (parted) q                                                                
> # parted /dev/sdd1
> GNU Parted 2.3
> Using /dev/sdd1
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p                                                                
> Model: Unknown (unknown)
> Disk /dev/sdd1: 4001GB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
> 
> Number  Start   End     Size    File system  Name        Flags
>  1      1049kB  4001GB  4001GB               Linux RAID  raid
> 
> -- 
> http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
> "Computer crime, the glamor crime of the 1970s, will become in the
> 1980s one of the greatest sources of preventable business loss."
> John M. Carroll, "Computer Security", first edition cover flap, 1977
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Shaohua Li @ 2016-08-24  1:02 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm
In-Reply-To: <CAJvUf-C-Nr8sSnSPL-5jt1NLOAiZjhZ=bjDRUbX_RjphRL+yWA@mail.gmail.com>

On Tue, Jul 12, 2016 at 04:09:25PM -0500, Matt Garman wrote:
> We have a system with a 24-disk raid6 array, using 2TB SSDs.  We use
> this system in a workload that is 99.9% read-only (a few small
> writes/day, versus countless reads).  This system is an NFS server for
> about 50 compute nodes that continually read its data.
> 
> In a non-degraded state, the system works wonderfully: the md0_raid6
> process uses less than 1% CPU, each drive is around 20% utilization
> (via iostat), no swapping is taking place.  The outbound throughput
> averages around 2.0 GB/sec, with 2.5 GB/sec peaks.
> 
> However, we had a disk fail, and the throughput dropped considerably,
> with the md0_raid6 process pegged at 100% CPU.
> 
> I understand that data from the failed disk will need to be
> reconstructed from parity, and this will cause the md0_raid6 process
> to consume considerable CPU.
> 
> What I don't understand is how I can determine what kind of actual MD
> device performance (throughput) I can expect in this state?
> 
> Dmesg seems to give some hints:
> 
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
> 
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
> 
> Perhaps naively, I would expect that second-to-last line:
> 
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> 
> to indicate what kind of throughput I could expect in a degraded
> state, but clearly that is not right---or I have something
> misconfigured.
> 
> So in other words, what does that gen() 8648 MB/s metric mean in terms
> of real-world throughput?  Is there a way I can "convert" that number
> to expected throughput of a degraded array?

In non-degrade mode, raid6 just directly dispatch IO to raid disks, software
involvement is very small. In degrade mode, the data is calculated. There are a
lot of factors impacting the performance:
1. enter the raid6 state machine, which has a long code path. (this is
debatable, if a read doesn't read the faulty disk and it's a small random read,
raid6 doesn't need to run the state machine. Fixing this could hugely improve
the performance)
2. the state machine runs in a single thread, which is a bottleneck. try to
increase group_thread_cnt, which will make the handling multi-thread.
3. stripe cache involves. try to increase stripe_cache_size.
4. the faulty disk data must be calculated, which involves read from other
disks. If this is a numa machine, and each disk interrupts to different
cpus/nodes, there will be big impact (cache, wakeup IPI)
5. the xor calculation overhead. Actually I don't think the impact is big,
mordern cpu can do the calculation fast.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2] raid10: record correct address of bad block
From: Shaohua Li @ 2016-08-24  0:12 UTC (permalink / raw)
  To: Tomasz Majchrzak
  Cc: linux-raid, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt
In-Reply-To: <1471942437-16720-1-git-send-email-tomasz.majchrzak@intel.com>

On Tue, Aug 23, 2016 at 10:53:57AM +0200, Tomasz Majchrzak wrote:
> For failed write request record block address on a device, not block
> address in an array.
> 
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  drivers/md/raid10.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index cfa96b5..cd8d197 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2465,18 +2465,19 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
>  
>  	while (sect_to_write) {
>  		struct bio *wbio;
> +		sector_t wsector;
>  		if (sectors > sect_to_write)
>  			sectors = sect_to_write;
>  		/* Write at 'sector' for 'sectors' */
>  		wbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
>  		bio_trim(wbio, sector - bio->bi_iter.bi_sector, sectors);
> -		wbio->bi_iter.bi_sector = (r10_bio->devs[i].addr+
> -				   choose_data_offset(r10_bio, rdev) +
> -				   (sector - r10_bio->sector));
> +		wsector = r10_bio->devs[i].addr + (sector - r10_bio->sector);
> +		wbio->bi_iter.bi_sector = wsector +
> +				   choose_data_offset(r10_bio, rdev);
>  		wbio->bi_bdev = rdev->bdev;
>  		if (submit_bio_wait(WRITE, wbio) < 0)
>  			/* Failure! */
> -			ok = rdev_set_badblocks(rdev, sector,
> +			ok = rdev_set_badblocks(rdev, wsector,
>  						sectors, 0)
>  				&& ok;

Applied, thanks!

^ permalink raw reply

* Re: [PATCH -next] md-cluster: fix error return code in join()
From: Shaohua Li @ 2016-08-24  0:09 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: Wei Yongjun, linux-raid
In-Reply-To: <1471790545-3301-1-git-send-email-weiyj.lk@gmail.com>

On Sun, Aug 21, 2016 at 02:42:25PM +0000, Wei Yongjun wrote:
> From: Wei Yongjun <weiyongjun1@huawei.com>
> 
> Fix to return error code -ENOMEM from the lockres_init() error
> handling case instead of 0, as done elsewhere in this function.
> 
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
> ---
>  drivers/md/md-cluster.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 333f0cf..2b13117 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -874,8 +874,10 @@ static int join(struct mddev *mddev, int nodes)
>  		goto err;
>  	}
>  	cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0);
> -	if (!cinfo->ack_lockres)
> +	if (!cinfo->ack_lockres) {
> +		ret = -ENOMEM;
>  		goto err;
> +	}
>  	/* get sync CR lock on ACK. */
>  	if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR))
>  		pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n",
> @@ -889,8 +891,10 @@ static int join(struct mddev *mddev, int nodes)
>  	pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number);
>  	snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1);
>  	cinfo->bitmap_lockres = lockres_init(mddev, str, NULL, 1);
> -	if (!cinfo->bitmap_lockres)
> +	if (!cinfo->bitmap_lockres) {
> +		ret = -ENOMEM;
>  		goto err;
> +	}
>  	if (dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW)) {
>  		pr_err("Failed to get bitmap lock\n");
>  		ret = -EINVAL;
> @@ -898,8 +902,10 @@ static int join(struct mddev *mddev, int nodes)
>  	}
>  
>  	cinfo->resync_lockres = lockres_init(mddev, "resync", NULL, 0);
> -	if (!cinfo->resync_lockres)
> +	if (!cinfo->resync_lockres) {
> +		ret = -ENOMEM;
>  		goto err;
> +	}
>  
>  	return 0;
>  err:

applied, thanks! 

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Phil Turmel @ 2016-08-23 21:42 UTC (permalink / raw)
  To: Doug Ledford, Matt Garman, Doug Dumitru; +Cc: Mdadm
In-Reply-To: <3e239b96-b06e-d33b-2e99-42ffa170d804@redhat.com>

On 08/23/2016 04:15 PM, Doug Ledford wrote:

> You're raid device has a good chunk size for your usage pattern.  If you
> had a smallish chunk size (like 64k or 32k), I would actually expect
> things to behave differently.  But, then again, maybe I'm wrong and that
> would help.  With a smaller chunk size, you would be able to fit more
> stripes in the stripe cache using less memory.

This is not correct.  Parity operations in MD raid4/5/6 operate on 4k
blocks.  The stripe cache for an array is a collection of 4k elements
per member device.  Chunk size doesn't factor into the cache itself.

But see below....

> Makes sense.  I know the stripe cache size is conservative by default
> because of the fact that it's not shared with the page cache, so you
> might as well consider it's memory lost.  When you upped it to 64k, and
> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
> allowed stripes which is a maximum memory consumption of around 700GB
> RAM.  I doubt you have that much in your machine, so I'm guessing it's
> simply using all available RAM that the page cache or something else
> isn't already using.  That's also explains why setting it higher doesn't
> provide any additional benefits ;-).

More likely the parity thread saturated and no more speed was possible.
Also possible that there would be a step change in performance again at
a much larger cache size.

>> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
>> 8000 MB/s, per dmesg:
>>
>> [    6.386820] xor: automatically using best checksumming function:
>> [    6.396690]    avx       : 24064.000 MB/sec
>> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
>> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
>> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
>> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
>> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
>> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
>> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
>> [    6.499774] raid6: using avx2x2 recovery algorithm
>>
>> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)

Parity operations in raid must always involve all (available) member
devices.  Read operations when not degraded won't generate any parity
operations.  Most large write operations and any degraded read
operations will involve all members, even if those members' data is not
part of the larger read/write request.

As chunk sizes get larger the odds grow that any given array I/O will
touch a fraction of the slice, causing I/O to members purely for parity
math.  Also, the odds rise that the starting point or ending point of an
array I/O operation will not be aligned to the stripe, making more
member I/O solely for parity math.

Then add in the fact that dd issues I/O requests one block at a time,
per the bs=? parameter.  So it is possible that data that would have
been sequential without parallel pressure (still in the stripe cache for
later reads) generates multiple parity calculations for fractional
stripe operations, just due to stripe size/alignment mismatch on single
dd dispatches.

What bs=? value are you using in your dd commands?  Based on your 512k
chunk, it should be 10240k for aligned operations and much larger than
that for unaligned.

FWIW, I use small chunk sizes -- usually 16k.

Phil

^ permalink raw reply

* [PATCH] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Robert LeBlanc @ 2016-08-23 20:37 UTC (permalink / raw)
  To: linux-raid; +Cc: dm-devel, robert

Signed-Off: Robert LeBlanc<robert@leblancnet.us>
---
 mdopen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mdopen.c b/mdopen.c
index f818fdf..5af344b 100644
--- a/mdopen.c
+++ b/mdopen.c
@@ -144,7 +144,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
 	struct createinfo *ci = conf_get_create_info();
 	int parts;
 	char *cname;
-	char devname[20];
+	char devname[37];
 	char devnm[32];
 	char cbuf[400];
 	if (chosen == NULL)
-- 
2.9.3


^ permalink raw reply related

* Re: [RFC] Some fixes to allow for more than 128 md devices.
From: Robert LeBlanc @ 2016-08-23 20:37 UTC (permalink / raw)
  To: linux-raid; +Cc: dm-devel, Robert LeBlanc
In-Reply-To: <CAANLjFqv33upkC5tLN8i77ysZCcFWqRYpqVUduBaCCAEOcZAqA@mail.gmail.com>

I found an email thread [0] talking about the new way to do this. I
did find a buffer overrun and will submit a patch for it.

Robert LeBlanc

[0] http://www.spinics.net/lists/raid/msg52300.html
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Aug 22, 2016 at 10:03 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> Apparently, the mdadm source on git-kernel.org (commit 13db17bd)
> already has the fixes to properly create the device nodes, but I still
> have the unexpected failure opening /dev/md1048574.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Aug 19, 2016 at 8:10 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> I'm stuck and need some help getting this across the finish line. This
>> is in no way complete, but to help show what I'm working on.
>>
>> When we added more than 128 md devices, we started getting failures.
>> Looking through the code it seems that the minor dev number was being
>> stored in an int and causing overflow and wrecking havoc on everything.
>> I finally got the mknod in mdadm to correctly make the dev node with
>> minors up to 1048574 as expected in the mdadm code. However, I can
>> only create md devices up to 511. Trying to create an md higher than
>> that has an error where the device can't be read/opened strace reports:
>> open("/dev/.tmp.md.15341:9:1048574", O_RDWR|O_EXCL|O_DIRECT) = -1 ENXIO
>> (No such device or address)
>> while Python reports:
>> IOError: [Errno 6] No such device or address: '/dev/.tmp.md.3279:9:512'
>>
>> A corresponding node is not created in /sys/block/md* for mds over 511.
>>
>> I believe that there may be a bug in the kernel code that is now being
>> hit. After looking through the kernel code, I can't seem to find where
>> this might be. Please help me by either pointing me to the source
>> location that this might be a problem or fixing it based on these
>> patches I've worked on so far. I'm using 4.7.0 currently.
>>
>> I'm using this for testing:
>> ./mdadm --create /dev/md1048574 --assume-clean --verbose --level=1 \
>> --raid-devices=2 /dev/loop0 missing
>>
>> Yes, we have a real need for more than 128 and 512 md devices.
>>
>> Please include me in any replies as I'm not on the ML.
>>
>> Thank you.
>>
>> Robert LeBlanc (1):
>>   Some fixes to allow for more than 128 md devices.
>>
>>  Manage.c |  5 +++--
>>  lib.c    |  2 +-
>>  mdadm.h  |  6 +++---
>>  util.c   | 25 +++++++++++++------------
>>  4 files changed, 20 insertions(+), 18 deletions(-)
>>
>> --
>> 2.8.1
>>

^ permalink raw reply

* Re: kernel checksumming performance vs actual raid device performance
From: Doug Ledford @ 2016-08-23 20:15 UTC (permalink / raw)
  To: Matt Garman, Doug Dumitru; +Cc: Mdadm
In-Reply-To: <CAJvUf-ApkKJXm7Jjiq=gXY9b9RrEvwA5u35xrMUjX2x0btVL4g@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 19347 bytes --]

On 8/23/2016 3:26 PM, Matt Garman wrote:
> Doug & Doug,
> 
> Thank you for your helpful replies.  I merged both of your posts into
> one, see inline comments below:
> 
> On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Of course.  I didn't mean to imply otherwise.  The read size is the read
>> size.  But, since the OPs test case was to "read random files" and not
>> "read random blocks of random files" I took it to mean it would be
>> sequential IO across a multitude of random files.  That assumption might
>> have been wrong, but I wrote my explanation with that in mind.
> 
> Yes, multiple parallel sequential reads.  Our test program generates a
> bunch of big random files (file size has an approximately normal
> distribution, centered around 500 MB, going down to 100 MB or so, up
> to a few multi-GB outliers).  The file generation is a one-time thing,
> and we don't really care about its performance.
> 
> The read testing program just randomly picks one of those files, then
> reads it start-to-finish using "dd".  But it kicks off several "dd"
> threads at once (currently 50, though this is a run-time parameter).
> This is how we generate the read load, and I use iostat while this is
> running to see how much read throughput I'm getting from the array.

OK, 50 sequential I/Os at a time.  Good point to know.

> 
> On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@redhat.com> wrote:
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
> 
> Yes, that is exactly correct, here's the relevant part of /proc/mdstat:
> 
> Personalities : [raid1] [raid6] [raid5] [raid4]
> 
> md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
> sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
> sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]
> 
>       44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]
> 
>       bitmap: 0/15 pages [0KB], 65536KB chunk

You're raid device has a good chunk size for your usage pattern.  If you
had a smallish chunk size (like 64k or 32k), I would actually expect
things to behave differently.  But, then again, maybe I'm wrong and that
would help.  With a smaller chunk size, you would be able to fit more
stripes in the stripe cache using less memory.

> 
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
> 
> Most of this morning I've been setting/unsetting/changing various
> tunables, to see if I could increase the read speed.  I got a huge
> boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
> from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
> seem to bring any further benefit.

Makes sense.  I know the stripe cache size is conservative by default
because of the fact that it's not shared with the page cache, so you
might as well consider it's memory lost.  When you upped it to 64k, and
you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
allowed stripes which is a maximum memory consumption of around 700GB
RAM.  I doubt you have that much in your machine, so I'm guessing it's
simply using all available RAM that the page cache or something else
isn't already using.  That's also explains why setting it higher doesn't
provide any additional benefits ;-).

>  So with the stripe_cache_size
> increased to 16k, I'm now getting around 1000 MB/s read in the
> degraded state.  When the degraded array was only doing 200 MB/s, the
> md0_raid6 process was taking about 50% CPU according to top.  Now I
> have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.

You probably have maxed out your single CPU performance and won't see
any benefit without having a multi-threaded XOR routine.

> I'm still degraded by a factor of eight, though, where I'd expect only
> two.
> 
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.
> 
> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
> 8000 MB/s, per dmesg:
> 
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
> 
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
> 
> I'm assuming however the kernel does its testing is fairly optimal,

It is *highly* optimal.  What's more, it uses 100% CPU during this time.
 The raid6 thread doing your recovery is responsible for lots of stuff,
issuing reads, doing xor, fulfilling write requests, maintaining the
cache, etc.  It has to have time to actually do other work.  So start
with that 8GB/s figure, but immediately start subtracting from that
because the CPU needs to do other things as well.  Then remember that we
are under *extreme* memory pressure.  When you have to bring in 22 reads
in order to reconstruct just 1 block of the same size, then for 100MB/s
of degraded reads you are generating 2200MB/s of PCI DMA -> MEM
bandwidth consumption, followed by 2200MB/s of MEM -> register load
bandwidth consumption, then I'd have to read the avx xor routine to know
how much write bandwidth it is using, but it's at least 100MB/s of
bandwidth, and likely at least four or five times that much because it
probably doesn't do all 22 blocks in a single xor pass, it likely loads
parity, then reads up to maybe four blocks and xors them together and
then stores the parity, so each pass will re-read and re-store the
parity block.  The point of all of this is that people forget to do the
math on the memory bandwidth used by these XOR operations.  The faster
they are, the higher the percentage of main memory bandwidth you are
consuming.  Now you have to subtract all of that main memory bandwidth
from the total main memory bandwidth for the CPU, and what's left over
is all you have for doing other productive work.  Even if you aren't
blowing your caches doing all of this XOR work, you are blowing your
main memory bandwidth.  Other threads or other actions end up stalling
waiting on main memory accesses to complete.

> and probably assumes ideal cache behavior... so maybe actual XOR
> performance won't be as good as what dmesg suggests...

It will never be that good, and you can thank your stars that it isn't,
because if it were, your computer would be ground to a halt with nothing
happening but data XOR computations.

> but still, 200
> MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
> MB/s...

The math fits.  Most quad channel Intel CPUs have memory bandwidths in
the 50GByte/s range theoretical maximum, but it's not bidirectional,
it's not even multi-access, so you have to remember that the usage looks
like this on a good read:

copy 1: DMA from PCI bus to main memory
copy 2: Load from main memory to CPU for copy_to_user
copy 3: Store from CPU to main memory for user

To get 8GB/s of read performance undregraded then required 24GB/s of
actual memory bandwidth just for the copies.  That's half of your entire
memory bandwidth (unless you have multiple sockets, then things get more
complex, but this is still true for one socket of the multiple socket
machine).  Once you add the XOR routine into the figure, the 3 accesses
is the same for part of it, but for degraded fixups, it is much worse.

> Is it possible to pin kernel threads to a CPU?  I'm thinking I could
> reboot with isolcpus=2 (for example) and if I can force that md0_raid6
> thread to run on CPU 2, at least the L1/L2 caches should be minimally
> affected...

You could try that, but I doubt it will effect much.

>> Possible fixes for this might include:
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
> 
> I suppose this might be an explanation for why increasing the array's
> stripe_cache_size gave me such a boost?

Yes.  The default setting is conservative, you told it to use as much
memory as it needed.

>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
> 
> My colleague tested that exact same config with hardware raid5, and
> striped the three raid5 arrays together with software raid1.

That's a huge waste, are you sure he didn't use raid0 for the stripe?

>  So
> clearly not apples-to-apples, but he did get dramatically better
> degraded and rebuild performance.  I do intend to test a pure software
> raid-50 implementation.

I would try it.  If you are OK with single disk failures anyway.

>> (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
> 
> I'm certain head movement time isn't the issue, as these are SSDs.  :)

Fair enough ;-).  And given these are SSDs, I'd be just fine doing
something like four 6 disk raid5s then striped in a raid0 myself.  The
main cause for concern with spinning disks is latent bad sectors causing
a read error on rebuild, with SSDs that's much less of a concern.

> On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@easyco.com> wrote:
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
> 
> Running top for 20 or more seconds, the top processes in terms of CPU
> usage are pretty static:
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
>  1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
>   107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
>   108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
> 19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
>  6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
> 18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd
> 
> 
> I truncated the output.  The "dd" processes are part of our testing
> tool that generates the huge read load on the array.  Any given "dd"
> process might jump around, but those four kernel processes are always
> the top four.  (Note that before I increased the stripe_cache_size (as
> mentioned above), the md0_raid6 process was only consuming around 50%
> CPU.)

I would try to tune your stripe cache size such that the kswapd?
processes go to sleep.  Those are reading/writing swap.  That won't help
your overall performance.

> Here is a representative view of a non-first iteration of "iostat -mxt 5":
> 
> 
> 08/23/2016 01:37:59 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            4.84    0.00   27.41   67.59    0.00    0.17
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdy               0.00     0.40    0.80    0.60     0.05     0.00
> 83.43     0.00    1.00    0.50    1.67   1.00   0.14
> sdz               0.00     0.40    0.00    0.60     0.00     0.00
> 10.67     0.00    2.00    0.00    2.00   2.00   0.12
> sdd           12927.00     0.00  204.40    0.00    51.00     0.00
> 511.00     5.93   28.75   28.75    0.00   4.31  88.10

I'm not sure how much I trust some of these numbers.  According to this,
you are issuing 200 read/s, at an average size of 511KB, which should
work out to roughly 100MB/s of data read, but rMB/s is only 51.  I
wonder if the read requests from the raid6 thread are bypassing the
rMB/s accounting because they aren't coming from the VFS or some such?
It would explain why the rMB/s is only half of what it should be based
upon requests and average request size.

> sde           13002.60     0.00  205.20    0.00    51.20     0.00
> 511.00     6.29   30.39   30.39    0.00   4.59  94.12
> sdf           12976.80     0.00  205.00    0.00    51.00     0.00
> 509.50     6.17   29.76   29.76    0.00   4.57  93.78
> sdg           12950.20     0.00  205.60    0.00    50.80     0.00
> 506.03     6.20   29.75   29.75    0.00   4.57  93.88
> sdh           12949.00     0.00  207.20    0.00    50.90     0.00
> 503.11     6.36   30.35   30.35    0.00   4.59  95.10
> sdb           12196.40     0.00  192.60    0.00    48.10     0.00
> 511.47     5.48   28.15   28.15    0.00   4.38  84.36
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdi           12923.00     0.00  208.40    0.00    51.00     0.00
> 501.20     6.79   32.31   32.31    0.00   4.65  96.84
> sdj           12796.20     0.00  206.80    0.00    50.50     0.00
> 500.12     6.62   31.73   31.73    0.00   4.62  95.64
> sdk           12746.60     0.00  204.00    0.00    50.20     0.00
> 503.97     6.38   30.77   30.77    0.00   4.60  93.86
> sdl           12570.00     0.00  202.20    0.00    49.70     0.00
> 503.39     6.39   31.19   31.19    0.00   4.63  93.68
> sdn           12594.00     0.00  204.20    0.00    49.95     0.00
> 500.97     6.40   30.99   30.99    0.00   4.58  93.54
> sdm           12569.00     0.00  203.80    0.00    49.90     0.00
> 501.45     6.30   30.58   30.58    0.00   4.45  90.60
> sdp           12568.80     0.00  205.20    0.00    50.10     0.00
> 500.03     6.37   30.79   30.79    0.00   4.52  92.72
> sdo           12569.20     0.00  204.00    0.00    49.95     0.00
> 501.46     6.40   31.07   31.07    0.00   4.58  93.42
> sdw           12568.60     0.00  206.20    0.00    50.00     0.00
> 496.60     6.34   30.71   30.71    0.00   4.24  87.48
> sdx           12038.60     0.00  197.40    0.00    47.60     0.00
> 493.84     6.01   30.21   30.21    0.00   4.40  86.86
> sdq           12570.20     0.00  204.20    0.00    50.15     0.00
> 502.97     6.23   30.41   30.41    0.00   4.44  90.68
> sdr           12571.00     0.00  204.60    0.00    50.25     0.00
> 502.99     6.15   30.26   30.26    0.00   4.18  85.62
> sds           12495.20     0.00  203.80    0.00    49.95     0.00
> 501.95     6.00   29.62   29.62    0.00   4.24  86.38
> sdu           12695.60     0.00  207.80    0.00    50.65     0.00
> 499.17     6.22   30.00   30.00    0.00   4.16  86.38
> sdv           12619.00     0.00  207.80    0.00    50.35     0.00
> 496.22     6.23   30.03   30.03    0.00   4.20  87.32
> sdt           12671.20     0.00  206.20    0.00    50.50     0.00
> 501.56     6.05   29.30   29.30    0.00   4.24  87.44
> sdc           12851.60     0.00  203.00    0.00    50.70     0.00
> 511.50     5.84   28.49   28.49    0.00   4.17  84.64
> md126             0.00     0.00    0.60    1.00     0.05     0.00
> 71.00     0.00    0.00    0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.60    0.80     0.05     0.00
> 81.14     0.00    2.29    0.67    3.50   1.14   0.16
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00 4475.20    0.00  1110.95     0.00
> 508.41     0.00    0.00    0.00    0.00   0.00   0.00
> 
> 
> sdy and sz are the system drives, so they are uninteresting.
> 
> sda is the md0 drive I failed, that's why it stays at zero.
> 
> And lastly, here's the output of the perf commands you suggested (at
> least the top part):
> 
> Samples: 561K of event 'cycles', Event count (approx.): 318536644203
> Overhead  Command         Shared Object                 Symbol
>   52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
>    4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
>    3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
>    2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
>    2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    1.75%  rngd            rngd                          [.] 0x000000000000288b
>    1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
>    1.49%  dd              [kernel.kallsyms]             [k]
> copy_user_enhanced_fast_string
>    1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
>    0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
>    0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
>    0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
>    0.51%  ps              [kernel.kallsyms]             [k] format_decode
>    0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
>    0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
>    0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg
> 
> 
> That's my first time using the perf tool, so I need a little hand-holding here.

You might get more interesting perf results if you could pin the md
raid6 thread to a single CPU and then filter the perf results to just
that CPU.


-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox