* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Phil Turmel @ 2016-08-26 2:22 UTC (permalink / raw)
To: Ben, linux-raid
In-Reply-To: <57BF9965.1020403@gmail.com>
On 08/25/2016 09:20 PM, Ben wrote:
> I read a lot of conflicting info on SCT/ERC online (well, TLER anyway)
> -- Adam likes it enabled. What say the rest of you?
Adam is correct, and it's not a matter of "like". You either must have
it enabled, or you *must* apply the kernel driver timeout work-around
(180 seconds) for that drive. Failure to do so results in crashed arrays.
Enterprise and NAS drives work out of the box. Desktop/green drives do not.
Some reading assignments from old discussions (read whole threads if you
have time):
http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2
^ permalink raw reply
* Re: Need Help with crashed RAID5 (that was rebuilding and then had SATA error on another drive)
From: Ben @ 2016-08-26 1:20 UTC (permalink / raw)
To: linux-raid
In-Reply-To: <933228e0-bce4-ffad-f48d-034bf89bc07f@websitemanagers.com.au>
[-- Attachment #1: Type: text/plain, Size: 2646 bytes --]
As an update,
Adam's been helping me out (and I'm not used to hitting "reply-all" for mailing lists as pretty much all the ones I'm on set the "reply-to:")
I've turned on sct/erc for the drives... and the one that went bonkers during the rebuild (sde) still would have read issues during a rebuild.
SMART reports it's ok. but.. (shrug) I ended up running ddrescue to the new replacement drive (sdc) that kept getting put back into spare status when the rebuilds would fail.
So I just copied sde -> sdc which went pretty much flawlessly (ddrescue completed without any final complaints)
I also played with badblocks after doing my copy and could find bad blocks -- but apparnently ddrescue had no issues.
So - I went back to
*bringing up the array. No problems.
* adding ANOTHER new drive (that I ordered Sunday night) and it rebuilt fine.
* doing an FSCK -n first which reported no issues - so I did a regular fsck (without -y) and it never prompted me for anything.
My last step is to run rsync -n from my backup to see if it can find any issues between my last backup and the current data for any files with byte oddities.
All this has me wonder if those old bad sectors left some files with a sector of garbage in them or not.
Adam seems to think everything is fine -- so far, that seems to be the case.
A last few questions I have are:
The new drive I got was (supposed to be) the same model as the last Seagate I ordered, but SMART reports them differently. (see attached)
The question on the new drive is that it says it does offline collection... but with gsmartcontrol, I can't seem to turn it on.
This new drive also doesn't seem to support SCT/ERC the same way.
Again,
/dev/sdc - old new spare (bought after seagate bought Samsung and discontinued the HD103SJ model)
/dev/sdd - original RAID member
/dev/sde - brand spanking new drive purchased Sunday.
/dev/sdf - original RAID member
I realize now one says: ST1000DM005 vs ST1000DM003 - Grrr!!!
So I'd like recommendations on whether I should get better matching drives (I can use these elsewhere) or it doesn't matter.
Can I mix/match this array with WD REDs? (and eventually retire all these HD103SJ drives) Do people even like these? They seem ok?
I read a lot of conflicting info on SCT/ERC online (well, TLER anyway) -- Adam likes it enabled. What say the rest of you?
And last -- any caveats as to upgrading this array to RAID6 from RAID5? Can I even do that while in place?
Thanks all, (especially Adam!)
-Ben
p.s. Check out some of the SMART parms on the /dev/sde. Head flying hours?? And they're not zero. Weird. :/ This drive kinda creeps me out.
[-- Attachment #2: RAID.smart-info.txt --]
[-- Type: text/plain, Size: 20353 bytes --]
[root@quantum ~]# smartctl -a /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: ST1000DM005 HD103SJ
Serial Number: S246JQ0D800949
LU WWN Device Id: 5 0000f0 080bb4909
Firmware Version: 1AJ10001
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Thu Aug 25 20:04:06 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 9120) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 152) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 0
2 Throughput_Performance 0x0026 054 054 000 Old_age Always - 8630
3 Spin_Up_Time 0x0023 076 071 025 Pre-fail Always - 7526
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 11
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 133
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 063 000 Old_age Always - 30 (Min/Max 21/37)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 10
200 Multi_Zone_Error_Rate 0x002a 100 096 000 Old_age Always - 558
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
========================================================================================================================
[root@quantum ~]# smartctl -a /dev/sdd
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F3
Device Model: SAMSUNG HD103SJ
Serial Number: S246J9AB404176
LU WWN Device Id: 5 0024e9 204fbf695
Firmware Version: 1AJ10001
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Thu Aug 25 20:05:32 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 9180) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 153) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 195
2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0
3 Spin_Up_Time 0x0023 073 070 025 Pre-fail Always - 8310
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 58
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 37763
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 75
191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 062 000 Old_age Always - 31 (Min/Max 20/43)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 8
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 146
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 77
========================================================================================================================
[root@quantum ~]# smartctl -a /dev/sde
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model: ST1000DM003-1ER162
Serial Number: Z4YDLXWJ
LU WWN Device Id: 5 000c50 091877801
Firmware Version: CC45
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ACS-2 (unknown minor revision code: 0x001f)
Local Time is: Thu Aug 25 20:06:33 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 80) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 105) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 108 100 006 Pre-fail Always - 18255632
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 2
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 269743
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 9
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 2
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 099 099 000 Old_age Always - 1
190 Airflow_Temperature_Cel 0x0022 071 068 045 Old_age Always - 29 (Min/Max 26/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 21
194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 25 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 109964047679495
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3907074414
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 5102115
========================================================================================================================
[root@quantum ~]# smartctl -a /dev/sdf
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.el6.centos.plus.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F3
Device Model: SAMSUNG HD103SJ
Serial Number: S246J9AB404174
LU WWN Device Id: 5 0024e9 204fbf676
Firmware Version: 1AJ10001
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Thu Aug 25 20:07:19 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 9360) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 156) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 353
2 Throughput_Performance 0x0026 055 055 000 Old_age Always - 8559
3 Spin_Up_Time 0x0023 073 069 025 Pre-fail Always - 8389
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 74
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 43724
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 92
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 063 000 Old_age Always - 30 (Min/Max 15/40)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 91
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 229
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 100
========================================================================================================================
^ permalink raw reply
* Re: kernel checksumming performance vs actual raid device performance
From: Adam Goryachev @ 2016-08-25 23:39 UTC (permalink / raw)
To: Matt Garman; +Cc: Mdadm
In-Reply-To: <CAJvUf-DXC6AtO3a=ox2XOinpRWgAv5NMPkRWAcsSZmBggF5_Dw@mail.gmail.com>
On 26/08/16 01:07, Matt Garman wrote:
>
>> Makes sense. I know the stripe cache size is conservative by default
>> because of the fact that it's not shared with the page cache, so you
>> might as well consider it's memory lost. When you upped it to 64k, and
>> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
>> allowed stripes which is a maximum memory consumption of around 700GB
>> RAM. I doubt you have that much in your machine, so I'm guessing it's
>> simply using all available RAM that the page cache or something else
>> isn't already using. That's also explains why setting it higher doesn't
>> provide any additional benefits ;-).
> Do you think more RAM might be beneficial then?
I'm not sure of this, but I can suggest that you try various sizes for
the stripe_cache_size, in my testing, I tried various values up to 64k,
but 4k ended up being the optimal value (I only have 8 disks with 64k
chunk size)...
>
>> I would try to tune your stripe cache size such that the kswapd?
>> processes go to sleep. Those are reading/writing swap. That won't help
>> your overall performance.
> Do you mean swapping as in swapping memory to disk? I don't think
> that is happening. I have 32 GB of swap space, but according to "free
> -k" only 48k of swap is being used, and that number never grows.
> Also, I don't have any of the classic telltale signs of disk-swapping,
> e.g. overall laggy system feel.
>
> Also, I re-set the stripe_cache_size back down to 256, and those
> kswapd processes continue to peg a couple CPUs. IOW,
> stripe_cache_size doesn't appear to have much effect on kswapd.
You should find out if you are swapping with vmstat:
vmstat 5
Watch the Swap (SI and SO) columns, if they are non-zero, then you are
indeed swapping.
You might find that if there is insufficient memory, then the kernel
will automatically reduce/limit the value for the stripe_cache_size (I'm
only guessing, but my memory tells me that the kernel locks this memory
and it can't be swapped/etc).
>
> On Tue, Aug 23, 2016 at 8:02 PM, Shaohua Li <shli@kernel.org> wrote:
>> 2. the state machine runs in a single thread, which is a bottleneck. try to
>> increase group_thread_cnt, which will make the handling multi-thread.
> For others' reference, this parameter is in
> /sys/block/<device>/md/stripe_cache_size.
>
> On this CentOS (RHEL) 7.2 server, the parameter defaults to 0. I set
> it to 4, and the degraded reads went up dramatically. Need to
> experiment with this (and all the other tunables) some more, but that
> change alone put me up to 2.5 GB/s read from the degraded array!
Did you mean group_thread_cnt which defaults to 0?
I don't recall the default for stripe_cache_size, but I'm pretty certain
it is not 0...
Note, in your case, it might increase the "test read scenario" but since
your "live" scenario has a lot more CPU overhead, then this option might
decrease overall results... Unfortunately, only testing with "live" load
will really provide the information you will need to decide on this.
Regards,
Adam
--
Adam Goryachev Website Managers www.websitemanagers.com.au
^ permalink raw reply
* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Chris Murphy @ 2016-08-25 22:32 UTC (permalink / raw)
To: Chris Murphy, Linux-RAID
In-Reply-To: <20160825062501.GN32250@subspacefield.org>
On Thu, Aug 25, 2016 at 12:25 AM,
<travis+ml-linux-raid@subspacefield.org> wrote:
> $ sudo mdadm -E /dev/sdd1
> /dev/sdd1:
> Magic : a92b4efc
> Version : 1.2
> Feature Map : 0x0
> Array UUID : <elided>
> Name : <elided>
> Creation Time : Wed Aug 10 11:33:41 2016
> Raid Level : raid0
> Raid Devices : 4
>
> Avail Dev Size : 7814035071 (3726.02 GiB 4000.79 GB)
> Data Offset : 16 sectors
> Super Offset : 8 sectors
> State : clean
> Device UUID : <elided)
>
> Update Time : Wed Aug 10 11:33:41 2016
> Checksum : 490b562f - correct
> Events : 0
>
> Chunk Size : 512K
>
> Device Role : Active device 0
> Array State : AAAA ('A' == active, '.' == missing)
I'm confused by Events: 0, even though I see the same thing with raid0
and linear arrays. As writes happen, array stopped and started, this
Events count does not increase. Parity raid only thing I guess?
Anyway, sdd1 has both an mdadm superblock on it, as shown above, and
it also has a GPT on it as show in your first message and below -
that's not good, but not unfixable. The mdadm super block starts at
LBA 8, 4096 bytes from the start of that partition, so it's safe to
zero the first 4096 bytes. The GPT is mainly in the first three
sectors so you could just write zeros for a count of 3, although it is
more complete to zero with a count=8, for the partition, not the whole
device.
>
> Here is what should be the same, only device 2 in the array
> (device 3 is similar or identical):
>
> $ sudo mdadm -E /dev/sdf1
> /dev/sdf1:
> MBR Magic : aa55
> Partition[0] : 4294967295 sectors at 1 (type ee)
Looks like the mdadm super block might have been stepped on by
something. You'd need to look for some evidence of it using something
like
dd if=/dev/sdf1 count=9 2>/dev/null | hexdump -C
If it's intact it should be at offset x1000 and again just a matter of
wiping the first 8 sectors, again of the partition, not the whole
device.
> $ sudo mdadm -D /dev/sdf1
> mdadm: /dev/sdf1 does not appear to be an md device
You're getting the commands confused. -E applies to /dev/sdXY member
devices, and -D applies to /dev/mdX arrays.
>
> Sadly, I can't do a mdadm -D because I can't assemble the RAID.
> $ sudo mdadm -E /dev/md127
Again, wrong command, you should use -D for this.
> $
>
> The command history is gone, but I would imagine that the RAID was
> created with something like this:
>
> mdadm --create /dev/md/bu --level=0 --raid-devices=4 /dev/sd{b,c,d,e}1
>
> Although it could have been level=linear.
>
> To summarize my email:
> "Is this is a known problem? If not, here is a bug report"
This is not a bug report. There's no reproduce steps, there's no
evidence of a bug. I'm not experiencing random replacement of mdadm
superblock data with MBR and GPT signatures. That's not really what
I'd expect of drive or enclosure firmware which by design should be
partition agnostic, as there's more than one or two valid kinds of
partitioning. Plus, it'd be scary even if it picked the right one, it
could clobber a legitimate existing one.
So I'd say it's something else.
>> It's purely speculation, but it sounds like to me in the history of
>> one or more drives, the previous signatures weren't removed before the
>> drive was retasked for its new purpose. That's the folly of not wiping
>> the signatures in the reverse order they were created, and just
>> expecting that starting over will wipe those old signatures.
>
> It's possible, but why would you ever end up with a GPT in a partition?
In every case I've seen, it was user error. I haven't heard of things
putting GPTs in partitions, and in a sense I'd say it's a bug if any
utility lets a user do that. Nesting GPT's in partitions, bad idea,
although it *should* be innocuous because it shouldn't be seen/honored
by anything that doesn't go looking for it because it doesn't belong
there.
>
> I've certainly encountered this "GPT outside cylinder 0" on these two
> drives before,
Keep in mind cylinders are gone, they don't exist anymore. Drives all
speak in LBAs now. *shrug* The GPT typically involves LBAs 0, 1 and 2
at least, more if there are more than 4 partitions.
> but it goes away with a forcible reassemble or recreate
> (which I did last time), because the mdlabel blows it away.
Umm, I think that only happens with -U, --update.
>Unless
> it's something this list knows about, I suspect it is a firmware
> glitch in the USB enclosure.
Doubtful.
>
>> But I think there is a legitimate gripe that parted probably should
>> not operate on partitions like this. It's not valid to have nested
>> GPTs like this. And I have no idea if parted is showing you valid or
>> bogus information. You'd need to do something like:
>>
>> dd if=/dev/sdd1 count=2 2>/dev/null | hexdump -C
>
> ## Good disk (for comparison):
> $ sudo dd if=/dev/sdd1 count=2 2> /dev/null | file -
> /dev/stdin: data
> $ sudo dd if=/dev/sdd1 count=2 2> /dev/null | hexdump -C | head -20
> 00000000 ff 02 19 2e 03 ee fa d8 6d d7 24 78 e1 d4 04 3d |........m.$x...=|
> 00000010 c9 92 33 97 17 7a 10 d3 05 bd 39 36 b4 a9 7c 14 |..3..z....96..|.|
> 00000020 a7 de 66 b6 cd d9 ff ef 45 27 74 6e 94 0a 03 49 |..f.....E'tn...I|
> 00000030 d4 43 26 2d 45 39 d1 93 8a 35 91 91 ff c9 a4 8e |.C&-E9...5......|
> 00000040 bd 9a 06 6d cc f2 89 65 c0 91 87 1c 1b f0 da 2f |...m...e......./|
> 00000050 83 c2 12 eb 80 3c c2 4c 68 cc 65 40 26 13 e0 77 |.....<.Lh.e@&..w|
> 00000060 38 15 ed 78 27 76 4c 91 71 99 3e 9f 99 f1 3f 51 |8..x'vL.q.>...?Q|
> 00000070 19 db 12 a3 ac b6 61 12 ff d9 37 87 31 1f 8b dd |......a...7.1...|
> 00000080 88 82 de fb db f2 a5 31 10 2a d2 03 be 12 be bd |.......1.*......|
> 00000090 19 46 9f c1 3b ea a1 37 81 d2 4d 00 54 e7 b4 55 |.F..;..7..M.T..U|
> 000000a0 b7 65 6c 3f 95 40 b0 f4 28 ff 90 62 22 cb 22 fd |.el?.@..(..b".".|
> 000000b0 6b 4d 90 56 32 4b c6 22 35 b1 62 76 e1 fd 82 d5 |kM.V2K."5.bv....|
> 000000c0 03 40 c0 85 4b ac 5a 44 9e 6a 25 97 d3 7f bd fe |.@..K.ZD.j%.....|
> 000000d0 0c 2d a8 bb 33 f4 00 df 7a 05 ae 6d b3 3e f3 7d |.-..3...z..m.>.}|
> 000000e0 34 9e 0e 57 14 de d8 e0 28 63 82 a6 2a 8a 1f fc |4..W....(c..*...|
> 000000f0 fe 2f b0 69 67 ac 0a e9 c2 53 a7 d8 36 1a 18 5a |./.ig....S..6..Z|
> 00000100 d6 d4 e6 ce df f7 fc 67 13 eb 25 08 45 50 10 7b |.......g..%.EP.{|
> 00000110 c6 23 1e 59 dc 2d c2 65 53 90 ca ec 21 e7 28 74 |.#.Y.-.eS...!.(t|
> 00000120 41 7f 3e 58 72 08 75 c1 d5 ca d0 91 55 5f 43 6a |A.>Xr.u.....U_Cj|
> 00000130 4e 84 d5 7f aa f2 b5 27 e4 86 5d 28 ae 6c 29 a1 |N......'..](.l).|
OK I don't know why you used head, I needed to see past offset 0x130.
Offset lines 0x1f0 and x200 have the MBR and GPT signatures, so the
above doesn't really tell me anything.
I don't recognize the above stuff, so I'm not sure what it is. I'd
usually expect it to be zeros if it's not a boot drive.
>
> ## Bad disk:
> $ sudo dd if=/dev/sdf1 count=2 2> /dev/null | file -
> /dev/stdin: x86 boot sector; partition 1: ID=0xee, starthead 0, startsector 1, 4294967295 sectors, code offset 0x6f
> $ sudo dd if=/dev/sdf1 count=2 2> /dev/null | hexdump -C
> 00000000 38 6f 96 52 ea 9c 31 cd 10 a2 84 58 a2 f0 f5 43 |8o.R..1....X...C|
> 00000010 0f f2 5a 9b c7 ff 82 b2 d8 59 86 60 15 bc 31 65 |..Z......Y.`..1e|
> 00000020 bc d7 77 f9 31 6a c8 16 3f 13 90 24 b7 57 ff 6b |..w.1j..?..$.W.k|
> 00000030 64 7e e2 99 2a 99 f7 32 69 be aa 56 36 31 f7 db |d~..*..2i..V61..|
> 00000040 8c 4c 4c 12 68 19 77 0f f6 3b 92 bf 18 92 c2 45 |.LL.h.w..;.....E|
> 00000050 73 d5 b7 93 cc ae 6b b9 b0 bd 0c 85 a9 c3 19 f7 |s.....k.........|
> 00000060 87 34 b8 be 0a 95 cd 03 03 d5 01 49 b5 b0 86 fe |.4.........I....|
> 00000070 71 1c d2 f6 42 ed ce b0 eb c3 5f 4c 07 34 30 c7 |q...B....._L.40.|
> 00000080 8a 1f 91 c4 8b 28 b9 07 8e da ae 7d 7d c5 24 2b |.....(.....}}.$+|
> 00000090 6d f9 ea a3 6a 83 9d b8 6a 1f 6d db 3a 01 22 c7 |m...j...j.m.:.".|
> 000000a0 56 fc 2a 46 f8 b2 84 31 d1 8b 58 55 b6 5a 36 7b |V.*F...1..XU.Z6{|
> 000000b0 48 5d 98 2a 3f f0 ae 80 2b f8 6b b2 7f 1e 27 c2 |H].*?...+.k...'.|
> 000000c0 59 65 d0 bf c7 f0 5b 18 dc 59 8e 68 46 03 b6 ca |Ye....[..Y.hF...|
> 000000d0 42 06 7a 52 7a 49 36 03 0d d5 9b 67 a2 03 3b 13 |B.zRzI6....g..;.|
> 000000e0 40 23 19 f5 1a a6 bd fb c8 d5 5b 26 f5 6a 86 ab |@#........[&.j..|
> 000000f0 89 77 98 d8 09 cb b7 59 80 03 81 48 ba c6 ce 77 |.w.....Y...H...w|
> 00000100 3c 6c d2 ba a0 71 c3 20 18 fd 77 db ca a8 8a e3 |<l...q. ..w.....|
> 00000110 8d 6c 1f 17 d5 9f e5 81 bf 50 62 c3 bc f8 6c 5d |.l.......Pb...l]|
> 00000120 f7 3f a6 37 6b a9 53 2b 88 15 5d 6e 1e 48 4f b4 |.?.7k.S+..]n.HO.|
> 00000130 db af b4 f7 f5 7b 4d f3 3f 60 44 60 6e a2 c4 6d |.....{M.?`D`n..m|
> 00000140 b9 6c 88 04 e8 66 d1 7c a0 09 10 66 32 de 70 e1 |.l...f.|...f2.p.|
> 00000150 98 40 54 5e 1d f2 af b8 2e d1 75 0d 3c 46 1f f8 |.@T^......u.<F..|
> 00000160 85 72 49 87 ad 92 59 28 fd 9d 22 8e 1b 9f 2c 00 |.rI...Y(.."...,.|
> 00000170 87 58 74 01 63 a5 94 13 e3 9c ea ec 3f 21 22 41 |.Xt.c.......?!"A|
> 00000180 05 13 78 f3 a8 46 b3 02 9e 23 cb 9d 21 db a6 ae |..x..F...#..!...|
> 00000190 08 a8 70 48 18 6c e2 38 e4 ac 03 6e 06 74 17 7c |..pH.l.8...n.t.||
> 000001a0 90 ca 9f 5e 2e 2b 84 ef 52 2c 08 9a 48 98 f9 46 |...^.+..R,..H..F|
> 000001b0 f4 9f 00 cd ec a0 11 d7 00 00 00 00 00 00 00 00 |................|
> 000001c0 02 00 ee ff ff ff 01 00 00 00 ff ff ff ff 00 00 |................|
> 000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa |..............U.|
> 00000200 45 46 49 20 50 41 52 54 00 00 01 00 5c 00 00 00 |EFI PART....\...|
> 00000210 3a dc 43 c4 00 00 00 00 01 00 00 00 00 00 00 00 |:.C.............|
> 00000220 8e b6 c0 d1 01 00 00 00 22 00 00 00 00 00 00 00 |........".......|
> 00000230 6d b6 c0 d1 01 00 00 00 a5 4f bd 75 f6 c8 4f 43 |m........O.u..OC|
> 00000240 92 31 ab b6 a9 59 aa 04 02 00 00 00 00 00 00 00 |.1...Y..........|
> 00000250 80 00 00 00 80 00 00 00 59 04 3d 4a 00 00 00 00 |........Y.=J....|
> 00000260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
OK it does in fact have a PMBR and GPT in the 1st and 2nd sector of
this partition. Pretty weird how it got there. There is a UUID
starting at offset 0x238 so you can look around and see if anything
else has that UUID or if that UUID ever changed or comes back after
you fix this. If it's not the same UUID, something is creating it with
a random UUID each time, which would mean it's not just being copied
from somewhere.
>
> ## is that the same as the boot sector itself? Interesting q.
> # dd if=/dev/sdd count=2 of=/tmp/foo && dd if=/dev/sdd1 count=2 of=/tmp/bar && cmp /tmp/foo /tmp/bar
> ## Nope, how do they differ? Well that's a bit unpleasant to do manually but here...
> # dd if=/dev/sdd count=2 2> /dev/null | hexdump -C
> 00000000 10 06 27 48 33 df bb 55 8b 28 fe 60 5e 18 6d 38 |..'H3..U.(.`^.m8|
> 00000010 fc b3 17 36 55 de fd 83 d0 52 72 19 d0 76 12 f0 |...6U....Rr..v..|
> 00000020 1e 23 bc 4d c5 4d c2 d6 5a d4 2b cd 16 78 c9 28 |.#.M.M..Z.+..x.(|
> 00000030 77 21 c4 9f c4 b7 48 ad e0 7b 08 d6 f5 8e 92 a7 |w!....H..{......|
> 00000040 bc 88 35 02 e7 f8 b8 3b 05 97 db a3 ad e7 96 4b |..5....;.......K|
> 00000050 84 d9 e2 a4 3a 5a 07 ac fc a2 78 58 d7 c8 5a 19 |....:Z....xX..Z.|
> 00000060 88 9c f6 f2 c0 ec 99 55 d9 5d 00 87 3a 86 52 01 |.......U.]..:.R.|
> 00000070 92 58 25 82 99 50 8e 28 0f 42 07 71 9a a3 db 82 |.X%..P.(.B.q....|
> 00000080 00 d9 b8 28 9d d8 97 85 9d c6 fb 5e 4d 94 3a 6e |...(.......^M.:n|
> 00000090 19 3c a6 ce 57 6b a0 52 d6 72 0c 41 2e cd cb a2 |.<..Wk.R.r.A....|
> 000000a0 15 c8 d4 c8 8c 90 34 5f 15 ab 69 96 af 3d 7e 30 |......4_..i..=~0|
> 000000b0 25 e1 72 35 d6 c4 b2 5e 78 72 0b 3f 9a 96 40 7e |%.r5...^xr.?..@~|
> 000000c0 c6 aa 0e 5a da 99 ae fe a3 93 8b 5b c4 bf 91 64 |...Z.......[...d|
> 000000d0 d5 62 12 ea 70 15 a9 05 81 8d e4 fb 36 15 c9 63 |.b..p.......6..c|
> 000000e0 ba f9 d2 5c f6 df 28 71 d8 d5 82 95 2b 83 40 db |...\..(q....+.@.|
> 000000f0 9b fe e2 a7 9b 38 5e 5f 51 a6 6e e6 7b 4e bf 02 |.....8^_Q.n.{N..|
> 00000100 d2 fb aa f9 2c 7a 5b f5 47 ad ac 7e d1 1c f3 1b |....,z[.G..~....|
> 00000110 a3 8e 54 9f a4 8d 1a 02 3f cc 81 f0 ca e9 28 1e |..T.....?.....(.|
> 00000120 33 9e d8 71 dd f2 aa b7 d4 06 96 cb 0c 8e f1 6a |3..q...........j|
> 00000130 88 1d 2a 8a a3 33 00 8c ef d4 d8 39 3e 70 18 34 |..*..3.....9>p.4|
> 00000140 e6 3a cd e7 0b d6 82 a8 a4 aa ff bd b3 69 0a cc |.:...........i..|
> 00000150 32 9e e3 26 34 bb cc 0e b0 69 5f 9a c5 f3 57 7d |2..&4....i_...W}|
> 00000160 47 82 bc 66 44 55 c4 de 3c 2c 14 d0 9a 73 6a da |G..fDU..<,...sj.|
> 00000170 3c 5e f8 99 26 5b f4 8a 13 a1 f1 c8 a9 20 4c 3a |<^..&[....... L:|
> 00000180 bd 03 4e e9 83 25 46 32 3f 80 3e 42 58 e7 18 27 |..N..%F2?.>BX..'|
> 00000190 8a c8 7c 8c 74 99 96 61 d4 e2 58 c2 27 71 8c 3b |..|.t..a..X.'q.;|
> 000001a0 da 33 f8 7f b5 c1 a7 a0 c2 7b 54 29 0d 47 b4 b5 |.3.......{T).G..|
> 000001b0 4c 62 5b f8 e9 6f bc 29 00 00 00 00 00 00 00 00 |Lb[..o.)........|
> 000001c0 02 00 ee ff ff ff 01 00 00 00 ff ff ff ff 00 00 |................|
> 000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa |..............U.|
> 00000200 45 46 49 20 50 41 52 54 00 00 01 00 5c 00 00 00 |EFI PART....\...|
> 00000210 62 01 85 1f 00 00 00 00 01 00 00 00 00 00 00 00 |b...............|
> 00000220 af be c0 d1 01 00 00 00 22 00 00 00 00 00 00 00 |........".......|
> 00000230 8e be c0 d1 01 00 00 00 e2 89 58 78 77 63 52 44 |..........XxwcRD|
> 00000240 93 9e 4a 93 16 06 86 6b 02 00 00 00 00 00 00 00 |..J....k........|
> 00000250 80 00 00 00 80 00 00 00 5d ff 7e 02 00 00 00 00 |........].~.....|
> 00000260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
We kinda expect sdd to have a valid PMBR and GPT though... so that's
sane. I just don't know what to make of the stuff in LBA 0 before the
PMBR.
> I understand and can probably acquire the most recent stable and
> compile from source, if you think that would prove useful enough to
> justify the effort. TBH once GPT came out I lost track of which
> partitioning tool was appropriate to use, it seemed like (IIRC)
> cfdisk, sfdisk, parted were all vying for my attention... is parted
> now the standard?
It is common. I prefer gdisk, which has a nomenclature similar to
fdisk. The nomenclature of parted is confusing.
>
> At the current moment I am backing up the drives so that I can try a
> forcible reassemble. I think that last time this happened, that
> effectively relabeled the mdraid partitions and fixed the problem.
> The underlying mdraid has an LVM on LUKS, but last time this happened
> I managed to fsck and get 99% of the data back, with only a few things
> ending up in lost+found. Presumably there might have been some data
> corruption, but since it's a backup server only I consider it
> tolerable, modulo the failed Windows system which needs to restore
> from it.
FWIW it's probably a lot simpler layout if you wanted to do either
linear or raid0, to just blow away all four drives with hdparm and ATA
security erase to get rid of all signatures; and then make all of them
into LVM physical volumes without any partitioning first, and then
make a logical volume, which by default is linear/concat, or you can
choose to use raid0 (this is a per logical volume characteristic), and
then encrypt the LV, and then format the LUKS volume. There's no
advantage to adding either partitions or mdadm RAIDs if you're going
to use LVM anyway and this is a Linux only storage enclosure.
--
Chris Murphy
^ permalink raw reply
* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Wols Lists @ 2016-08-25 21:06 UTC (permalink / raw)
To: Linux-RAID, travis+ml-linux-raid
In-Reply-To: <20160825062501.GN32250@subspacefield.org>
On 25/08/16 07:25, travis+ml-linux-raid@subspacefield.org wrote:
> I understand and can probably acquire the most recent stable and
> compile from source, if you think that would prove useful enough to
> justify the effort. TBH once GPT came out I lost track of which
> partitioning tool was appropriate to use, it seemed like (IIRC)
> cfdisk, sfdisk, parted were all vying for my attention... is parted
> now the standard?
To add to the fun, I use gdisk (or is it gfdisk?).
Like so many things gnu, when I looked at parted I ran away screaming
from the feature overkill ... :-)
Cheers,
Wol
^ permalink raw reply
* Re: [PATCH v2] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Jes Sorensen @ 2016-08-25 17:45 UTC (permalink / raw)
To: Robert LeBlanc; +Cc: linux-raid, dm-devel
In-Reply-To: <20160824161044.20887-1-robert@leblancnet.us>
Robert LeBlanc <robert@leblancnet.us> writes:
> Linux allows for 32 character device names. When using the maximum size device name and also
> storing "/dev/", devname needs to be 37 character long to store the complete device name.
> i.e. "/dev/md_abcdefghijklmnopqrstuvwxyz12\0"
>
> Signed-Off: Robert LeBlanc<robert@leblancnet.us>
> ---
> mdopen.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Looks good - I corrected your comment to fit into a proper editor width
of 80 characters, and also fixed up the SOB since it needs to say
Signed-off-by rather than signed-off.
Applied!
Jes
^ permalink raw reply
* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: Shaohua Li @ 2016-08-25 17:17 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87bn0hfnq6.fsf@notabene.neil.brown.name>
On Thu, Aug 25, 2016 at 02:59:13PM +1000, Neil Brown wrote:
> On Wed, Aug 24 2016, Shaohua Li wrote:
>
> > On Wed, Aug 24, 2016 at 02:49:57PM +1000, Neil Brown wrote:
> >> On Wed, Aug 17 2016, Shaohua Li wrote:
> >> >> >
> >> >> > We will have the same deadlock issue with just stopping/restarting the reclaim
> >> >> > thread. As stopping the thread will wait for the thread, which probably is
> >> >> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
> >> >> > superblock, we must hold the reconfig_mutex.
> >> >>
> >> >> When you say "writing the superblock" you presumably mean "blocked in
> >> >> r5l_write_super_and_discard_space(), waiting for MD_CHANGE_PENDING to
> >> >> be cleared" ??
> >> > right
> >> >> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
> >> >> ->quiesce to be set, and then exit gracefully.
> >> >
> >> > Can you give details about this please? .quiesce is called with reconfig_mutex
> >> > hold, so the MD_CHANGE_PENDING will never get cleared.
> >>
> >> raid5_quiesce(mddev, 1) sets conf->quiesce and then calls r5l_quiesce().
> >>
> >> r5l_quiesce() tells the reclaim_thread to exit and waits for it to do so.
> >>
> >> But the reclaim thread might be in
> >> r5l_do_reclaim() -> r5l_write_super_and_discard_space()
> >> waiting for MD_CHANGE_PENDING to clear. That will only get cleared when
> >> the main thread can get the reconfig_mutex, which the thread calling
> >> raid5_quiesce() might hold. So we get a deadlock.
> >>
> >> My suggestion is to change r5l_write_super_and_discard_space() so that
> >> it waits for *either* MD_CHANGE_PENDING to be clear, or conf->quiesce
> >> to be set. That will avoid the deadlock.
> >>
> >> Whatever thread called raid5_quiesce() will now be in control of the
> >> array without any async IO going on. If it needs the metadata to be
> >> sync, it can do that itself. If not, then it doesn't really matter that
> >> r5l_write_super_and_discard_space() didn't wait.
> >
> > I'm afraid waiting conf->quiesce set isn't safe. The reason to wait for
> > superblock write isn't because of async IO. discard could zero data, so before
> > we do discard, we must make sure superblock points to correct log tail,
> > otherwise recovery will not work. This is the reason we wait for superblock
> > write.
> >
> >> r5l_write_super_and_discard_space() shouldn't call discard if the
> >> superblock write didn't complete, and probably r5l_do_reclaim()
> >> shouldn't update last_checkpoint and last_cp_seq in that case.
> >> This is what I mean by "with a bit of care" and "exit gracefully".
> >> Maybe I should have said "abort cleanly". The goal is to get the thread
> >> to exit. It doesn't need to complete what it was doing, it just needs
> >> to make sure that it leaves things in a tidy state so that when it
> >> starts up again, it can pick up where it left off.
> >
> > Agree, we could ignore discard sometime, which happens occasionally, so impact
> > is little. I tested something like below recently. Assume this is the solution
> > we agree on?
>
> Yes, this definitely looks like it is heading in the right direction.
>
> I thought that
>
> > - set_mask_bits(&mddev->flags, 0,
> > - BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
> > - md_wakeup_thread(mddev->thread);
>
> would still be there in the case that the lock cannot be claimed.
yep, this makes sense.
> You could even record the ->events value before setting the flags,
> and record the range that needs to be discarded. Next time
> r5l_do_reclaim is entered, if ->events has moved on, then it should be
> safe to discard the recorded range. Maybe.
I thought something like this too, but looks there are more works to do to make
this happen. We updated the log, so the range could be reused soon. And if it's
a raid array stop, we don't have the chance to reenter reclaim, which I believe
it's the most common case the lock can't be hold. And missing discard isn't a
big issue especially since the miss happens rarely. I'm going to commit below
if no objection.
Thanks,
Shaohua
commit 93e297c0b152667cc4a17db6fe7360dab7e3e9d5
Author: Shaohua Li <shli@fb.com>
Date: Thu Aug 25 10:09:39 2016 -0700
raid5-cache: fix a deadlock in superblock write
There is a potential deadlock in superblock write. Discard could zero data, so
before discard we must make sure superblock is updated to new log tail.
Updating superblock (either directly call md_update_sb() or depend on md
thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called
with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all
IO finish, hence waitting for reclaim thread, while reclaim thread is calling
this function and waitting for reconfig mutex. So there is a deadlock. We
workaround this issue with a trylock. The downside of the solution is we could
miss discard if we can't take reconfig mutex. But this should happen rarely
(mainly in raid array stop), so miss discard shouldn't be a big problem.
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5504ce2..2b0589f 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -96,7 +96,6 @@ struct r5l_log {
spinlock_t no_space_stripes_lock;
bool need_cache_flush;
- bool in_teardown;
};
/*
@@ -704,31 +703,22 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
mddev = log->rdev->mddev;
/*
- * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
- * wait for this thread to finish. This thread waits for
- * MD_CHANGE_PENDING clear, which is supposed to be done in
- * md_check_recovery(). md_check_recovery() tries to get
- * reconfig_mutex. Since r5l_quiesce already holds the mutex,
- * md_check_recovery() fails, so the PENDING never get cleared. The
- * in_teardown check workaround this issue.
+ * Discard could zero data, so before discard we must make sure
+ * superblock is updated to new log tail. Updating superblock (either
+ * directly call md_update_sb() or depend on md thread) must hold
+ * reconfig mutex. On the other hand, raid5_quiesce is called with
+ * reconfig_mutex hold. The first step of raid5_quiesce() is waitting
+ * for all IO finish, hence waitting for reclaim thread, while reclaim
+ * thread is calling this function and waitting for reconfig mutex. So
+ * there is a deadlock. We workaround this issue with a trylock.
+ * FIXME: we could miss discard if we can't take reconfig mutex
*/
- if (!log->in_teardown) {
- set_mask_bits(&mddev->flags, 0,
- BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
- md_wakeup_thread(mddev->thread);
- wait_event(mddev->sb_wait,
- !test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
- log->in_teardown);
- /*
- * r5l_quiesce could run after in_teardown check and hold
- * mutex first. Superblock might get updated twice.
- */
- if (log->in_teardown)
- md_update_sb(mddev, 1);
- } else {
- WARN_ON(!mddev_is_locked(mddev));
- md_update_sb(mddev, 1);
- }
+ set_mask_bits(&mddev->flags, 0,
+ BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
+ if (!mddev_trylock(mddev))
+ return;
+ md_update_sb(mddev, 1);
+ mddev_unlock(mddev);
/* discard IO error really doesn't matter, ignore it */
if (log->last_checkpoint < end) {
@@ -827,7 +817,6 @@ void r5l_quiesce(struct r5l_log *log, int state)
if (!log || state == 2)
return;
if (state == 0) {
- log->in_teardown = 0;
/*
* This is a special case for hotadd. In suspend, the array has
* no journal. In resume, journal is initialized as well as the
@@ -838,11 +827,6 @@ void r5l_quiesce(struct r5l_log *log, int state)
log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
log->rdev->mddev, "reclaim");
} else if (state == 1) {
- /*
- * at this point all stripes are finished, so io_unit is at
- * least in STRIPE_END state
- */
- log->in_teardown = 1;
/* make sure r5l_write_super_and_discard_space exits */
mddev = log->rdev->mddev;
wake_up(&mddev->sb_wait);
^ permalink raw reply related
* Re: kernel checksumming performance vs actual raid device performance
From: Matt Garman @ 2016-08-25 15:07 UTC (permalink / raw)
To: Shaohua Li; +Cc: Mdadm
In-Reply-To: <20160824010241.GC57645@kernel.org>
Note: again I consolidated several previous posts into one for inline replies...
On Tue, Aug 23, 2016 at 2:41 PM, Doug Dumitru <doug@easyco.com> wrote:
> So you are up at 1GB/sec, which is only 1/4 the degraded speed, but
> 1/2 the expected speed based on drive data transfers required. This
> is actually pretty good.
I get 8 GB/sec non-degraded. So I'd say I'm still only 1/8
non-degraded speed, and about 1/4 of what I expect in degraded state.
I.e., I expect 4 GB/sec non-degraded. However, based on what I'm
reading in this thread, maybe I can't do any better? But
group_thread_cnt might save the day...
> If you need this to go faster, then it is either a raid re-design, or
> perhaps you should consider cutting your array into two parts. Two 12
> drives raid-6 arrays will give you more bandwidth both because the
> failures are less "wide", so a single drive will only do 11 reads
> instead of 22. Plus you get the benefit of two raid-6 threads should
> you have dead drives on both halves. You can raid-0 the arrays
> together. Then again, you lose two drives worth of space.
Yes, that's on the list to test. Actually we'll try three 8-disk
raid-5s striped into one big raid0. That only loses one drive's worth
of space (compared to a single 24-disk raid6). Space is at a premium
here, as we're really needing to build this system with 4 TB drives.
The loss of resiliency using raid5 instead of raid6 "shouldn't" be an
issue here. The design is to deliberately over-provision these
servers so that we have one more than we need. Then in case of
failure (or major degradation) of a single server, we can migrate
clients to the other ones.
On Tue, Aug 23, 2016 at 3:15 PM, Doug Ledford <dledford@redhat.com> wrote:
> OK, 50 sequential I/Os at a time. Good point to know.
Note that's just the test workload. The real workload has literally
*thousands* of sequential reads at once. However. those thousands of
reads aren't reading at full speed like dd of=/dev/null. In the real
workload, after a chunk of data is read, some computations are done.
IOW, when the storage backend is working optimally, the read processes
are CPU bound. But it's extremely hard to accurately generate this
kind of test workload, so we have fewer reader threads (50 in this
case), but they are pure read-as-fast-as-we-can jobs, as opposed to
read-and-compute.
> You're raid device has a good chunk size for your usage pattern. If you
> had a smallish chunk size (like 64k or 32k), I would actually expect
> things to behave differently. But, then again, maybe I'm wrong and that
> would help. With a smaller chunk size, you would be able to fit more
> stripes in the stripe cache using less memory.
For some reason I thought we had a 64k chunk size, which I believe is
the mdadm default? But, you're right, it is indeed 512k. I will try
to experiment with different chunk sizes, as my Internet-research
suggests that's a very application-dependent setting; I can't seem to
find any rules of thumb as to what our ideal chunk size might be for
this particular workload. My intuition says bigger is better, since
we're dealing with sequential reads of generally large-ish files.
> Makes sense. I know the stripe cache size is conservative by default
> because of the fact that it's not shared with the page cache, so you
> might as well consider it's memory lost. When you upped it to 64k, and
> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
> allowed stripes which is a maximum memory consumption of around 700GB
> RAM. I doubt you have that much in your machine, so I'm guessing it's
> simply using all available RAM that the page cache or something else
> isn't already using. That's also explains why setting it higher doesn't
> provide any additional benefits ;-).
Do you think more RAM might be beneficial then?
> The math fits. Most quad channel Intel CPUs have memory bandwidths in
> the 50GByte/s range theoretical maximum, but it's not bidirectional,
> it's not even multi-access, so you have to remember that the usage looks
> like this on a good read:
I'll have to re-read your explanation a few more times to fully grasp
it, but thank you for that!
For what it's worth, this is a NUMA system: two E5-2620v3 CPUs. More
cores, but I understand the complexities added by memory controller
and PCIe node locality.
>> My colleague tested that exact same config with hardware raid5, and
>> striped the three raid5 arrays together with software raid1.
>
> That's a huge waste, are you sure he didn't use raid0 for the stripe?
Sorry, typo, that was raid0 indeed.
> I would try to tune your stripe cache size such that the kswapd?
> processes go to sleep. Those are reading/writing swap. That won't help
> your overall performance.
Do you mean swapping as in swapping memory to disk? I don't think
that is happening. I have 32 GB of swap space, but according to "free
-k" only 48k of swap is being used, and that number never grows.
Also, I don't have any of the classic telltale signs of disk-swapping,
e.g. overall laggy system feel.
Also, I re-set the stripe_cache_size back down to 256, and those
kswapd processes continue to peg a couple CPUs. IOW,
stripe_cache_size doesn't appear to have much effect on kswapd.
On Tue, Aug 23, 2016 at 8:02 PM, Shaohua Li <shli@kernel.org> wrote:
> 2. the state machine runs in a single thread, which is a bottleneck. try to
> increase group_thread_cnt, which will make the handling multi-thread.
For others' reference, this parameter is in
/sys/block/<device>/md/stripe_cache_size.
On this CentOS (RHEL) 7.2 server, the parameter defaults to 0. I set
it to 4, and the degraded reads went up dramatically. Need to
experiment with this (and all the other tunables) some more, but that
change alone put me up to 2.5 GB/s read from the degraded array!
Thanks again,
Matt
^ permalink raw reply
* Re: [dm-devel] [PATCH v2] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Shaun Tancheff @ 2016-08-25 7:52 UTC (permalink / raw)
To: Robert LeBlanc; +Cc: linux-raid, dm-devel
In-Reply-To: <CAJVOszDvg6-VBndG=4XdGbfwEXbBj6-oYJsGNtvtkrQ-J6JPbQ@mail.gmail.com>
On Thu, Aug 25, 2016 at 2:44 AM, Shaun Tancheff
<shaun.tancheff@seagate.com> wrote:
> On Wed, Aug 24, 2016 at 11:10 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> Linux allows for 32 character device names. When using the maximum size device name and also
>> storing "/dev/", devname needs to be 37 character long to store the complete device name.
>> i.e. "/dev/md_abcdefghijklmnopqrstuvwxyz12\0"
>>
>> Signed-Off: Robert LeBlanc<robert@leblancnet.us>
>> ---
>> mdopen.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mdopen.c b/mdopen.c
>> index f818fdf..5af344b 100644
>> --- a/mdopen.c
>> +++ b/mdopen.c
>> @@ -144,7 +144,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
>> struct createinfo *ci = conf_get_create_info();
>> int parts;
>> char *cname;
>> - char devname[20];
>> + char devname[37];
>
> I think you want 38 here.
> 5 + 32 + '\0'.
>> char devnm[32];
Ah sorry, that 32 was including the null already
implied by devnm.
Looks fine.
>> char cbuf[400];
>> if (chosen == NULL)
>> --
>> 2.9.3
>>
>
> Also a sprintf() to snprintf() cleanup might not be a bad idea ..
> --
> Shaun Tancheff
--
Shaun Tancheff
^ permalink raw reply
* Re: [PATCH v2] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Shaun Tancheff @ 2016-08-25 7:44 UTC (permalink / raw)
To: Robert LeBlanc; +Cc: linux-raid, dm-devel
In-Reply-To: <20160824161044.20887-1-robert@leblancnet.us>
On Wed, Aug 24, 2016 at 11:10 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> Linux allows for 32 character device names. When using the maximum size device name and also
> storing "/dev/", devname needs to be 37 character long to store the complete device name.
> i.e. "/dev/md_abcdefghijklmnopqrstuvwxyz12\0"
>
> Signed-Off: Robert LeBlanc<robert@leblancnet.us>
> ---
> mdopen.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mdopen.c b/mdopen.c
> index f818fdf..5af344b 100644
> --- a/mdopen.c
> +++ b/mdopen.c
> @@ -144,7 +144,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
> struct createinfo *ci = conf_get_create_info();
> int parts;
> char *cname;
> - char devname[20];
> + char devname[37];
I think you want 38 here.
5 + 32 + '\0'.
> char devnm[32];
> char cbuf[400];
> if (chosen == NULL)
> --
> 2.9.3
>
Also a sprintf() to snprintf() cleanup might not be a bad idea ..
--
Shaun Tancheff
^ permalink raw reply
* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: travis+ml-linux-raid @ 2016-08-25 6:25 UTC (permalink / raw)
To: Chris Murphy; +Cc: Linux-RAID
In-Reply-To: <CAJCQCtSY=D-ASQ22km8GJjfju4jUgJSOBTAH5+XveCZq1BvT7w@mail.gmail.com>
On Wed, Aug 24, 2016 at 11:15:58AM -0600, Chris Murphy wrote:
> OK well you don't tell us what the mdadm create command was, there's
> no information on the metadata version, no mdadm -E or -D output, etc.
> There's really nothing to go on here. So we can't tell what the
> problem is either, or what your question is.
Thanks for the response, I learned some interesting things!
Here is one of the non-nuked drives:
$ sudo mdadm -E /dev/sdd1
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : <elided>
Name : <elided>
Creation Time : Wed Aug 10 11:33:41 2016
Raid Level : raid0
Raid Devices : 4
Avail Dev Size : 7814035071 (3726.02 GiB 4000.79 GB)
Data Offset : 16 sectors
Super Offset : 8 sectors
State : clean
Device UUID : <elided)
Update Time : Wed Aug 10 11:33:41 2016
Checksum : 490b562f - correct
Events : 0
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAA ('A' == active, '.' == missing)
Here is what should be the same, only device 2 in the array
(device 3 is similar or identical):
$ sudo mdadm -E /dev/sdf1
/dev/sdf1:
MBR Magic : aa55
Partition[0] : 4294967295 sectors at 1 (type ee)
$ sudo mdadm -D /dev/sdf1
mdadm: /dev/sdf1 does not appear to be an md device
Sadly, I can't do a mdadm -D because I can't assemble the RAID.
$ sudo mdadm -E /dev/md127
$
The command history is gone, but I would imagine that the RAID was
created with something like this:
mdadm --create /dev/md/bu --level=0 --raid-devices=4 /dev/sd{b,c,d,e}1
Although it could have been level=linear.
To summarize my email:
"Is this is a known problem? If not, here is a bug report"
> > Any recommendations on a low power hardware with a well-supported
> > distro, that matches up well with a real backplane and SATA
> > connections instead of USB. The only caveat is that I want to encrypt
> > raw disks and it has to not be very noisy - so no rackmount gear
> > with 65dB 1" dog whistle fans. Obviously, whatever backplane must
> > be well-supported by the distro.
>
> OK so you just want to give up on the existing setup and you want
> advice on a whole new setup? From my perspective you're basically on
> three separate threads at this point.
Depends on the circumstances. I'm prepared to if there are no obvious
fixes. My intuition tells me the issue may be in the 4-bay switched
SATA enclosure, or the USB connection, or the driver thereof, and not
mdraid itself. I'm happy to be wrong on that.
BTW, in case this rings any bells as being buggy, here is the enclosure:
https://www.amazon.com/Mediasonic-ProBox-HF2-SU3S2-SATA-Enclosure/dp/B003X26VV4/
> It's a WDC Red with a physical sector size of 4096B, so it looks like
> the USB enclosure is doing the typical thing of masking the try
> physical sector size from the kernel. This is better than the opposite
> where the enclosure reports the drive as 4096B/4096B logical/physical,
> where the drive itself has 512B logical sectors, as this will cause
> problems if the drive is ever removed from that enclosure, or put into
> one that doesn't report 4096B logical sectors.
Oooh, that's meaty information thank you. I hadn't kept up with
things since the great 2TB changeover. That could explain some crap I
see with larger drives and USB enclosures. The problems you describe,
I saw back in the great 2GB switchover. Seagate had some boot sector
magic that would make things work by changing the cylinder sizes,
until it didn't....
> > # parted /dev/sdd1
> > GNU Parted 2.3
> > Using /dev/sdd1
> > Welcome to GNU Parted! Type 'help' to view a list of commands.
> > (parted) p
> > Model: Unknown (unknown)
> > Disk /dev/sdd1: 4001GB
> > Sector size (logical/physical): 512B/512B
> > Partition Table: gpt
> >
> > Number Start End Size File system Name Flags
> > 1 1049kB 4001GB 4001GB Linux RAID raid
>
> It's purely speculation, but it sounds like to me in the history of
> one or more drives, the previous signatures weren't removed before the
> drive was retasked for its new purpose. That's the folly of not wiping
> the signatures in the reverse order they were created, and just
> expecting that starting over will wipe those old signatures.
It's possible, but why would you ever end up with a GPT in a partition?
I've certainly encountered this "GPT outside cylinder 0" on these two
drives before, but it goes away with a forcible reassemble or recreate
(which I did last time), because the mdlabel blows it away. Unless
it's something this list knows about, I suspect it is a firmware
glitch in the USB enclosure.
> But I think there is a legitimate gripe that parted probably should
> not operate on partitions like this. It's not valid to have nested
> GPTs like this. And I have no idea if parted is showing you valid or
> bogus information. You'd need to do something like:
>
> dd if=/dev/sdd1 count=2 2>/dev/null | hexdump -C
## Good disk (for comparison):
$ sudo dd if=/dev/sdd1 count=2 2> /dev/null | file -
/dev/stdin: data
$ sudo dd if=/dev/sdd1 count=2 2> /dev/null | hexdump -C | head -20
00000000 ff 02 19 2e 03 ee fa d8 6d d7 24 78 e1 d4 04 3d |........m.$x...=|
00000010 c9 92 33 97 17 7a 10 d3 05 bd 39 36 b4 a9 7c 14 |..3..z....96..|.|
00000020 a7 de 66 b6 cd d9 ff ef 45 27 74 6e 94 0a 03 49 |..f.....E'tn...I|
00000030 d4 43 26 2d 45 39 d1 93 8a 35 91 91 ff c9 a4 8e |.C&-E9...5......|
00000040 bd 9a 06 6d cc f2 89 65 c0 91 87 1c 1b f0 da 2f |...m...e......./|
00000050 83 c2 12 eb 80 3c c2 4c 68 cc 65 40 26 13 e0 77 |.....<.Lh.e@&..w|
00000060 38 15 ed 78 27 76 4c 91 71 99 3e 9f 99 f1 3f 51 |8..x'vL.q.>...?Q|
00000070 19 db 12 a3 ac b6 61 12 ff d9 37 87 31 1f 8b dd |......a...7.1...|
00000080 88 82 de fb db f2 a5 31 10 2a d2 03 be 12 be bd |.......1.*......|
00000090 19 46 9f c1 3b ea a1 37 81 d2 4d 00 54 e7 b4 55 |.F..;..7..M.T..U|
000000a0 b7 65 6c 3f 95 40 b0 f4 28 ff 90 62 22 cb 22 fd |.el?.@..(..b".".|
000000b0 6b 4d 90 56 32 4b c6 22 35 b1 62 76 e1 fd 82 d5 |kM.V2K."5.bv....|
000000c0 03 40 c0 85 4b ac 5a 44 9e 6a 25 97 d3 7f bd fe |.@..K.ZD.j%.....|
000000d0 0c 2d a8 bb 33 f4 00 df 7a 05 ae 6d b3 3e f3 7d |.-..3...z..m.>.}|
000000e0 34 9e 0e 57 14 de d8 e0 28 63 82 a6 2a 8a 1f fc |4..W....(c..*...|
000000f0 fe 2f b0 69 67 ac 0a e9 c2 53 a7 d8 36 1a 18 5a |./.ig....S..6..Z|
00000100 d6 d4 e6 ce df f7 fc 67 13 eb 25 08 45 50 10 7b |.......g..%.EP.{|
00000110 c6 23 1e 59 dc 2d c2 65 53 90 ca ec 21 e7 28 74 |.#.Y.-.eS...!.(t|
00000120 41 7f 3e 58 72 08 75 c1 d5 ca d0 91 55 5f 43 6a |A.>Xr.u.....U_Cj|
00000130 4e 84 d5 7f aa f2 b5 27 e4 86 5d 28 ae 6c 29 a1 |N......'..](.l).|
## Bad disk:
$ sudo dd if=/dev/sdf1 count=2 2> /dev/null | file -
/dev/stdin: x86 boot sector; partition 1: ID=0xee, starthead 0, startsector 1, 4294967295 sectors, code offset 0x6f
$ sudo dd if=/dev/sdf1 count=2 2> /dev/null | hexdump -C
00000000 38 6f 96 52 ea 9c 31 cd 10 a2 84 58 a2 f0 f5 43 |8o.R..1....X...C|
00000010 0f f2 5a 9b c7 ff 82 b2 d8 59 86 60 15 bc 31 65 |..Z......Y.`..1e|
00000020 bc d7 77 f9 31 6a c8 16 3f 13 90 24 b7 57 ff 6b |..w.1j..?..$.W.k|
00000030 64 7e e2 99 2a 99 f7 32 69 be aa 56 36 31 f7 db |d~..*..2i..V61..|
00000040 8c 4c 4c 12 68 19 77 0f f6 3b 92 bf 18 92 c2 45 |.LL.h.w..;.....E|
00000050 73 d5 b7 93 cc ae 6b b9 b0 bd 0c 85 a9 c3 19 f7 |s.....k.........|
00000060 87 34 b8 be 0a 95 cd 03 03 d5 01 49 b5 b0 86 fe |.4.........I....|
00000070 71 1c d2 f6 42 ed ce b0 eb c3 5f 4c 07 34 30 c7 |q...B....._L.40.|
00000080 8a 1f 91 c4 8b 28 b9 07 8e da ae 7d 7d c5 24 2b |.....(.....}}.$+|
00000090 6d f9 ea a3 6a 83 9d b8 6a 1f 6d db 3a 01 22 c7 |m...j...j.m.:.".|
000000a0 56 fc 2a 46 f8 b2 84 31 d1 8b 58 55 b6 5a 36 7b |V.*F...1..XU.Z6{|
000000b0 48 5d 98 2a 3f f0 ae 80 2b f8 6b b2 7f 1e 27 c2 |H].*?...+.k...'.|
000000c0 59 65 d0 bf c7 f0 5b 18 dc 59 8e 68 46 03 b6 ca |Ye....[..Y.hF...|
000000d0 42 06 7a 52 7a 49 36 03 0d d5 9b 67 a2 03 3b 13 |B.zRzI6....g..;.|
000000e0 40 23 19 f5 1a a6 bd fb c8 d5 5b 26 f5 6a 86 ab |@#........[&.j..|
000000f0 89 77 98 d8 09 cb b7 59 80 03 81 48 ba c6 ce 77 |.w.....Y...H...w|
00000100 3c 6c d2 ba a0 71 c3 20 18 fd 77 db ca a8 8a e3 |<l...q. ..w.....|
00000110 8d 6c 1f 17 d5 9f e5 81 bf 50 62 c3 bc f8 6c 5d |.l.......Pb...l]|
00000120 f7 3f a6 37 6b a9 53 2b 88 15 5d 6e 1e 48 4f b4 |.?.7k.S+..]n.HO.|
00000130 db af b4 f7 f5 7b 4d f3 3f 60 44 60 6e a2 c4 6d |.....{M.?`D`n..m|
00000140 b9 6c 88 04 e8 66 d1 7c a0 09 10 66 32 de 70 e1 |.l...f.|...f2.p.|
00000150 98 40 54 5e 1d f2 af b8 2e d1 75 0d 3c 46 1f f8 |.@T^......u.<F..|
00000160 85 72 49 87 ad 92 59 28 fd 9d 22 8e 1b 9f 2c 00 |.rI...Y(.."...,.|
00000170 87 58 74 01 63 a5 94 13 e3 9c ea ec 3f 21 22 41 |.Xt.c.......?!"A|
00000180 05 13 78 f3 a8 46 b3 02 9e 23 cb 9d 21 db a6 ae |..x..F...#..!...|
00000190 08 a8 70 48 18 6c e2 38 e4 ac 03 6e 06 74 17 7c |..pH.l.8...n.t.||
000001a0 90 ca 9f 5e 2e 2b 84 ef 52 2c 08 9a 48 98 f9 46 |...^.+..R,..H..F|
000001b0 f4 9f 00 cd ec a0 11 d7 00 00 00 00 00 00 00 00 |................|
000001c0 02 00 ee ff ff ff 01 00 00 00 ff ff ff ff 00 00 |................|
000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa |..............U.|
00000200 45 46 49 20 50 41 52 54 00 00 01 00 5c 00 00 00 |EFI PART....\...|
00000210 3a dc 43 c4 00 00 00 00 01 00 00 00 00 00 00 00 |:.C.............|
00000220 8e b6 c0 d1 01 00 00 00 22 00 00 00 00 00 00 00 |........".......|
00000230 6d b6 c0 d1 01 00 00 00 a5 4f bd 75 f6 c8 4f 43 |m........O.u..OC|
00000240 92 31 ab b6 a9 59 aa 04 02 00 00 00 00 00 00 00 |.1...Y..........|
00000250 80 00 00 00 80 00 00 00 59 04 3d 4a 00 00 00 00 |........Y.=J....|
00000260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
## is that the same as the boot sector itself? Interesting q.
# dd if=/dev/sdd count=2 of=/tmp/foo && dd if=/dev/sdd1 count=2 of=/tmp/bar && cmp /tmp/foo /tmp/bar
## Nope, how do they differ? Well that's a bit unpleasant to do manually but here...
# dd if=/dev/sdd count=2 2> /dev/null | hexdump -C
00000000 10 06 27 48 33 df bb 55 8b 28 fe 60 5e 18 6d 38 |..'H3..U.(.`^.m8|
00000010 fc b3 17 36 55 de fd 83 d0 52 72 19 d0 76 12 f0 |...6U....Rr..v..|
00000020 1e 23 bc 4d c5 4d c2 d6 5a d4 2b cd 16 78 c9 28 |.#.M.M..Z.+..x.(|
00000030 77 21 c4 9f c4 b7 48 ad e0 7b 08 d6 f5 8e 92 a7 |w!....H..{......|
00000040 bc 88 35 02 e7 f8 b8 3b 05 97 db a3 ad e7 96 4b |..5....;.......K|
00000050 84 d9 e2 a4 3a 5a 07 ac fc a2 78 58 d7 c8 5a 19 |....:Z....xX..Z.|
00000060 88 9c f6 f2 c0 ec 99 55 d9 5d 00 87 3a 86 52 01 |.......U.]..:.R.|
00000070 92 58 25 82 99 50 8e 28 0f 42 07 71 9a a3 db 82 |.X%..P.(.B.q....|
00000080 00 d9 b8 28 9d d8 97 85 9d c6 fb 5e 4d 94 3a 6e |...(.......^M.:n|
00000090 19 3c a6 ce 57 6b a0 52 d6 72 0c 41 2e cd cb a2 |.<..Wk.R.r.A....|
000000a0 15 c8 d4 c8 8c 90 34 5f 15 ab 69 96 af 3d 7e 30 |......4_..i..=~0|
000000b0 25 e1 72 35 d6 c4 b2 5e 78 72 0b 3f 9a 96 40 7e |%.r5...^xr.?..@~|
000000c0 c6 aa 0e 5a da 99 ae fe a3 93 8b 5b c4 bf 91 64 |...Z.......[...d|
000000d0 d5 62 12 ea 70 15 a9 05 81 8d e4 fb 36 15 c9 63 |.b..p.......6..c|
000000e0 ba f9 d2 5c f6 df 28 71 d8 d5 82 95 2b 83 40 db |...\..(q....+.@.|
000000f0 9b fe e2 a7 9b 38 5e 5f 51 a6 6e e6 7b 4e bf 02 |.....8^_Q.n.{N..|
00000100 d2 fb aa f9 2c 7a 5b f5 47 ad ac 7e d1 1c f3 1b |....,z[.G..~....|
00000110 a3 8e 54 9f a4 8d 1a 02 3f cc 81 f0 ca e9 28 1e |..T.....?.....(.|
00000120 33 9e d8 71 dd f2 aa b7 d4 06 96 cb 0c 8e f1 6a |3..q...........j|
00000130 88 1d 2a 8a a3 33 00 8c ef d4 d8 39 3e 70 18 34 |..*..3.....9>p.4|
00000140 e6 3a cd e7 0b d6 82 a8 a4 aa ff bd b3 69 0a cc |.:...........i..|
00000150 32 9e e3 26 34 bb cc 0e b0 69 5f 9a c5 f3 57 7d |2..&4....i_...W}|
00000160 47 82 bc 66 44 55 c4 de 3c 2c 14 d0 9a 73 6a da |G..fDU..<,...sj.|
00000170 3c 5e f8 99 26 5b f4 8a 13 a1 f1 c8 a9 20 4c 3a |<^..&[....... L:|
00000180 bd 03 4e e9 83 25 46 32 3f 80 3e 42 58 e7 18 27 |..N..%F2?.>BX..'|
00000190 8a c8 7c 8c 74 99 96 61 d4 e2 58 c2 27 71 8c 3b |..|.t..a..X.'q.;|
000001a0 da 33 f8 7f b5 c1 a7 a0 c2 7b 54 29 0d 47 b4 b5 |.3.......{T).G..|
000001b0 4c 62 5b f8 e9 6f bc 29 00 00 00 00 00 00 00 00 |Lb[..o.)........|
000001c0 02 00 ee ff ff ff 01 00 00 00 ff ff ff ff 00 00 |................|
000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa |..............U.|
00000200 45 46 49 20 50 41 52 54 00 00 01 00 5c 00 00 00 |EFI PART....\...|
00000210 62 01 85 1f 00 00 00 00 01 00 00 00 00 00 00 00 |b...............|
00000220 af be c0 d1 01 00 00 00 22 00 00 00 00 00 00 00 |........".......|
00000230 8e be c0 d1 01 00 00 00 e2 89 58 78 77 63 52 44 |..........XxwcRD|
00000240 93 9e 4a 93 16 06 86 6b 02 00 00 00 00 00 00 00 |..J....k........|
00000250 80 00 00 00 80 00 00 00 5d ff 7e 02 00 00 00 00 |........].~.....|
00000260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> And then we can see if there really is a PMBR and GPT in that first
> sector that parted is picking up. But where it could be coming from in
> an mdadm linear layout? No idea.
>
> The other thing to check is the end of the partition, because GPT has
> a primary and backup. So the 2nd to last sector of sdd1 may have a
> backup GPT on it, and possibly something is wrongly restoring it
> sometimes.
>
> In any case I would still look to using something much much newer than
> parted 2.3, it's basically Pleistocene old, and the version of mdadm
> is also likewise old. But this is what happens with LTS releases,
> ancient software for which no one except its maintainers remember the
> state and history.
I understand and can probably acquire the most recent stable and
compile from source, if you think that would prove useful enough to
justify the effort. TBH once GPT came out I lost track of which
partitioning tool was appropriate to use, it seemed like (IIRC)
cfdisk, sfdisk, parted were all vying for my attention... is parted
now the standard?
At the current moment I am backing up the drives so that I can try a
forcible reassemble. I think that last time this happened, that
effectively relabeled the mdraid partitions and fixed the problem.
The underlying mdraid has an LVM on LUKS, but last time this happened
I managed to fsck and get 99% of the data back, with only a few things
ending up in lost+found. Presumably there might have been some data
corruption, but since it's a backup server only I consider it
tolerable, modulo the failed Windows system which needs to restore
from it.
--
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977
^ permalink raw reply
* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: NeilBrown @ 2016-08-25 4:59 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20160824052512.GA1921@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 3542 bytes --]
On Wed, Aug 24 2016, Shaohua Li wrote:
> On Wed, Aug 24, 2016 at 02:49:57PM +1000, Neil Brown wrote:
>> On Wed, Aug 17 2016, Shaohua Li wrote:
>> >> >
>> >> > We will have the same deadlock issue with just stopping/restarting the reclaim
>> >> > thread. As stopping the thread will wait for the thread, which probably is
>> >> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
>> >> > superblock, we must hold the reconfig_mutex.
>> >>
>> >> When you say "writing the superblock" you presumably mean "blocked in
>> >> r5l_write_super_and_discard_space(), waiting for MD_CHANGE_PENDING to
>> >> be cleared" ??
>> > right
>> >> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
>> >> ->quiesce to be set, and then exit gracefully.
>> >
>> > Can you give details about this please? .quiesce is called with reconfig_mutex
>> > hold, so the MD_CHANGE_PENDING will never get cleared.
>>
>> raid5_quiesce(mddev, 1) sets conf->quiesce and then calls r5l_quiesce().
>>
>> r5l_quiesce() tells the reclaim_thread to exit and waits for it to do so.
>>
>> But the reclaim thread might be in
>> r5l_do_reclaim() -> r5l_write_super_and_discard_space()
>> waiting for MD_CHANGE_PENDING to clear. That will only get cleared when
>> the main thread can get the reconfig_mutex, which the thread calling
>> raid5_quiesce() might hold. So we get a deadlock.
>>
>> My suggestion is to change r5l_write_super_and_discard_space() so that
>> it waits for *either* MD_CHANGE_PENDING to be clear, or conf->quiesce
>> to be set. That will avoid the deadlock.
>>
>> Whatever thread called raid5_quiesce() will now be in control of the
>> array without any async IO going on. If it needs the metadata to be
>> sync, it can do that itself. If not, then it doesn't really matter that
>> r5l_write_super_and_discard_space() didn't wait.
>
> I'm afraid waiting conf->quiesce set isn't safe. The reason to wait for
> superblock write isn't because of async IO. discard could zero data, so before
> we do discard, we must make sure superblock points to correct log tail,
> otherwise recovery will not work. This is the reason we wait for superblock
> write.
>
>> r5l_write_super_and_discard_space() shouldn't call discard if the
>> superblock write didn't complete, and probably r5l_do_reclaim()
>> shouldn't update last_checkpoint and last_cp_seq in that case.
>> This is what I mean by "with a bit of care" and "exit gracefully".
>> Maybe I should have said "abort cleanly". The goal is to get the thread
>> to exit. It doesn't need to complete what it was doing, it just needs
>> to make sure that it leaves things in a tidy state so that when it
>> starts up again, it can pick up where it left off.
>
> Agree, we could ignore discard sometime, which happens occasionally, so impact
> is little. I tested something like below recently. Assume this is the solution
> we agree on?
Yes, this definitely looks like it is heading in the right direction.
I thought that
> - set_mask_bits(&mddev->flags, 0,
> - BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
> - md_wakeup_thread(mddev->thread);
would still be there in the case that the lock cannot be claimed.
You could even record the ->events value before setting the flags,
and record the range that needs to be discarded. Next time
r5l_do_reclaim is entered, if ->events has moved on, then it should be
safe to discard the recorded range. Maybe.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]
^ permalink raw reply
* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: Chris Murphy @ 2016-08-24 17:15 UTC (permalink / raw)
To: Linux-RAID
In-Reply-To: <20160823050947.GL32250@subspacefield.org>
On Mon, Aug 22, 2016 at 11:09 PM,
<travis+ml-linux-raid@subspacefield.org> wrote:
> Hello all,
>
> So I have an Intel NUC (for low power Linux) plugged via USB into a 4
> bay enclosure doing linear (yeah I know; it's the backup server, the
> primary is raid10).
>
> And every once in a while, this happens (*see end). The partition 1
> that would normally contain a MD slice ends up being a replica of the
> boot cylinder. I can't tell if it's the mdraid linear impl, the
> kernel doing something weird, the USB drivers, the enclosure firmware,
> or what.
OK well you don't tell us what the mdadm create command was, there's
no information on the metadata version, no mdadm -E or -D output, etc.
There's really nothing to go on here. So we can't tell what the
problem is either, or what your question is.
>
> Anyway, this happened while I was restoring a Windows machine whose
> root drive suddenly took a nosedive, and it happens every 6 months
> or so. Today it happened while I was in the middle of recovering
> a Windows machine whose 1TB SSD threw up on C: and totally nuked
> the data.
OK? I don't follow this at all, how it relates to the NUC, how it
relates to the USB drives connected to the NUC.
>
> The last low-power option I tried was an OpenRD Ultimate based around
> ARMv5TE which was basically unsupported by debian by the time I got
> it, and subsequently became ultra-flaky due to what seemed to be RAM
> problems - it was crashing every 3 days with kernel panics, and every
> once in a while would do something worse.
This is definitely superfluous information that just clutters the thread...
> Any recommendations on a low power hardware with a well-supported
> distro, that matches up well with a real backplane and SATA
> connections instead of USB. The only caveat is that I want to encrypt
> raw disks and it has to not be very noisy - so no rackmount gear
> with 65dB 1" dog whistle fans. Obviously, whatever backplane must
> be well-supported by the distro.
OK so you just want to give up on the existing setup and you want
advice on a whole new setup? From my perspective you're basically on
three separate threads at this point.
>
> Also, does anyone have experience with cryptsetup on multiple
> partitions? I can do that but get prompted multiple times and I was
> wondering if anyone knew an easy way to fix the boot time scripts to
> avoid that, only prompting once per unique underlying crypttab.
And now you're on your fourth subject for an entirely new thread that
also has nothing to do with this list. This is probably a distribution
question. On the distribution I use, the thing that prompts for a
passphrase tries that passphrase on all cryptluks devices, so in the
event they share the same passphrase, they're all opened just by
entering the passphrase one time. If the passphrase is entered
incorrectly, now I'm stuck and have to enter the passphrase per LUKS
instance.
>
> And finally, I have a story about buggy drive firmware that you
> might enjoy, especially if you were doing this sort of stuff in
> the 90s as well. Cheers:
OK...fifth subject and thread.
> # parted /dev/sde
> GNU Parted 2.3
I would start out by using a non-ancient version of parted. This is 6 years old.
> Using /dev/sde
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p
> Model: WDC WD40 EFRX-68WT0N0 (scsi)
> Disk /dev/sde: 4001GB
> Sector size (logical/physical): 512B/512B
It's a WDC Red with a physical sector size of 4096B, so it looks like
the USB enclosure is doing the typical thing of masking the try
physical sector size from the kernel. This is better than the opposite
where the enclosure reports the drive as 4096B/4096B logical/physical,
where the drive itself has 512B logical sectors, as this will cause
problems if the drive is ever removed from that enclosure, or put into
one that doesn't report 4096B logical sectors.
> Partition Table: gpt
>
> Number Start End Size File system Name Flags
> 1 1049kB 4001GB 4001GB Linux RAID raid
>
> (parted) q
> # parted /dev/sdd1
> GNU Parted 2.3
> Using /dev/sdd1
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p
> Model: Unknown (unknown)
> Disk /dev/sdd1: 4001GB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
>
> Number Start End Size File system Name Flags
> 1 1049kB 4001GB 4001GB Linux RAID raid
It's purely speculation, but it sounds like to me in the history of
one or more drives, the previous signatures weren't removed before the
drive was retasked for its new purpose. That's the folly of not wiping
the signatures in the reverse order they were created, and just
expecting that starting over will wipe those old signatures.
But I think there is a legitimate gripe that parted probably should
not operate on partitions like this. It's not valid to have nested
GPTs like this. And I have no idea if parted is showing you valid or
bogus information. You'd need to do something like:
dd if=/dev/sdd1 count=2 2>/dev/null | hexdump -C
And then we can see if there really is a PMBR and GPT in that first
sector that parted is picking up. But where it could be coming from in
an mdadm linear layout? No idea.
The other thing to check is the end of the partition, because GPT has
a primary and backup. So the 2nd to last sector of sdd1 may have a
backup GPT on it, and possibly something is wrongly restoring it
sometimes.
In any case I would still look to using something much much newer than
parted 2.3, it's basically Pleistocene old, and the version of mdadm
is also likewise old. But this is what happens with LTS releases,
ancient software for which no one except its maintainers remember the
state and history.
--
Chris Murphy
^ permalink raw reply
* [PATCH v2] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Robert LeBlanc @ 2016-08-24 16:10 UTC (permalink / raw)
To: linux-raid; +Cc: dm-devel, robert
Linux allows for 32 character device names. When using the maximum size device name and also
storing "/dev/", devname needs to be 37 character long to store the complete device name.
i.e. "/dev/md_abcdefghijklmnopqrstuvwxyz12\0"
Signed-Off: Robert LeBlanc<robert@leblancnet.us>
---
mdopen.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mdopen.c b/mdopen.c
index f818fdf..5af344b 100644
--- a/mdopen.c
+++ b/mdopen.c
@@ -144,7 +144,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
struct createinfo *ci = conf_get_create_info();
int parts;
char *cname;
- char devname[20];
+ char devname[37];
char devnm[32];
char cbuf[400];
if (chosen == NULL)
--
2.9.3
^ permalink raw reply related
* Re: [PATCH] raid6: fix the input of raid6 algorithm
From: liuzhengyuan @ 2016-08-24 7:58 UTC (permalink / raw)
To: H. Peter Anvin
Cc: shli, linux-raid, fenghua.yu, linux-kernel, liuzhengyuang521
In-Reply-To: <FAA53096-C767-4142-B45C-01889986EDAF@zytor.com>
Oh, get_random_*() is really expensive. Thanks for your tips. The boot log on my aarch64 showed bellow
told it taked about 0.6 second to fill with disk data.
[ 0.172831] DMA: preallocated 256 KiB pool for atomic allocations
[ 0.788664] raid6: int64x1 gen() 121 MB/s
[ 0.856613] raid6: int64x1 xor() 74 MB/s
[ 0.924665] raid6: int64x2 gen() 166 MB/s
[ 0.992846] raid6: int64x2 xor() 95 MB/s
[ 1.060681] raid6: int64x4 gen() 290 MB/s
[ 1.128774] raid6: int64x4 xor() 160 MB/s
[ 1.196933] raid6: int64x8 gen() 238 MB/s
[ 1.264937] raid6: int64x8 xor() 148 MB/s
[ 1.332878] raid6: neonx1 gen() 256 MB/s
[ 1.400975] raid6: neonx1 xor() 130 MB/s
[ 1.468951] raid6: neonx2 gen() 333 MB/s
[ 1.537085] raid6: neonx2 xor() 181 MB/s
[ 1.605042] raid6: neonx4 gen() 451 MB/s
[ 1.673121] raid6: neonx4 xor() 289 MB/s
[ 1.741143] raid6: neonx8 gen() 452 MB/s
[ 1.809151] raid6: neonx8 xor() 277 MB/s
[ 1.809154] raid6: using algorithm neonx8 gen() 452 MB/s
[ 1.809157] raid6: .... xor() 277 MB/s, rmw enabled
[ 1.809160] raid6: using intx1 recovery algorithm
I replaced get_random_* with a local PRNG based on well-know
"linear congruential bit". The patch was like this:
+/* use the linear congruential bit. */
+static int32_t get_random_number_by_lcb(void)
+{
+ static int32_t seed = 1;
+ int32_t ret = 0;
+ ret = ((seed * 1103515245) + 12345) & 0x7fffffff;
+ seed = ret;
+ return ret;
+}
/* Try to pick the best algorithm */
/* This code uses the gfmul table as convenient data set to abuse */
@@ -229,8 +238,8 @@ int __init raid6_select_algo(void)
for (i = 0; i < disks-2; i++) {
dptrs[i] = disk_ptr + PAGE_SIZE*i;
- for (j = 0; j < PAGE_SIZE; j++)
- get_random_bytes(dptrs[i]+j, 1);
+ for (j = 0; j < PAGE_SIZE; j = j + 4)
+ *(int32_t *)(dptrs[i]+j) = get_random_number_by_lcb();
}
dptrs[disks-2] = disk_ptr + PAGE_SIZE*(disks-2);
The boot log with this patch was showd bellow, it taked about 0.08 second.
[ 0.172858] DMA: preallocated 256 KiB pool for atomic allocations
[ 0.256673] raid6: int64x1 gen() 121 MB/s
[ 0.324484] raid6: int64x1 xor() 73 MB/s
[ 0.392606] raid6: int64x2 gen() 166 MB/s
[ 0.460309] raid6: int64x2 xor() 92 MB/s
[ 0.528368] raid6: int64x4 gen() 290 MB/s
[ 0.596401] raid6: int64x4 xor() 156 MB/s
[ 0.664601] raid6: int64x8 gen() 238 MB/s
[ 0.732609] raid6: int64x8 xor() 148 MB/s
[ 0.800523] raid6: neonx1 gen() 256 MB/s
[ 0.868730] raid6: neonx1 xor() 129 MB/s
[ 0.936741] raid6: neonx2 gen() 334 MB/s
[ 1.004717] raid6: neonx2 xor() 202 MB/s
[ 1.072692] raid6: neonx4 gen() 451 MB/s
[ 1.140763] raid6: neonx4 xor() 260 MB/s
[ 1.208842] raid6: neonx8 gen() 452 MB/s
[ 1.276887] raid6: neonx8 xor() 277 MB/s
[ 1.276890] raid6: using algorithm neonx8 gen() 452 MB/s
[ 1.276894] raid6: .... xor() 277 MB/s, rmw enabled
[ 1.276897] raid6: using intx1 recovery algorithm
[ 1.276941] ACPI: Interpreter disabled.
I'm not familiar with spurious D$ conflicts and CPU cache behavior. How do you
think this PRNG or anything else I need to do?
------------------ Original ------------------
From: "H. Peter Anvin"<hpa@zytor.com>;
Date: Tue, Aug 23, 2016 11:53 AM
To: "liuzhengyuan"<liuzhengyuan@kylinos.cn>;
Cc: "shli"<shli@kernel.org>; "linux-raid"<linux-raid@vger.kernel.org>; "fenghua.yu"<fenghua.yu@intel.com>; "linux-kernel"<linux-kernel@vger.kernel.org>; "liuzhengyuang521"<liuzhengyuang521@gmail.com>;
Subject: Re: [PATCH] raid6: fix the input of raid6 algorithm
Do you have any idea how long this takes to run? People are already complaining about the boot time penalty. get_random_*() is quite expensive and is overkill...
--
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
^ permalink raw reply
* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: Shaohua Li @ 2016-08-24 5:25 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87k2f6g496.fsf@notabene.neil.brown.name>
On Wed, Aug 24, 2016 at 02:49:57PM +1000, Neil Brown wrote:
> On Wed, Aug 17 2016, Shaohua Li wrote:
> >> >
> >> > We will have the same deadlock issue with just stopping/restarting the reclaim
> >> > thread. As stopping the thread will wait for the thread, which probably is
> >> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
> >> > superblock, we must hold the reconfig_mutex.
> >>
> >> When you say "writing the superblock" you presumably mean "blocked in
> >> r5l_write_super_and_discard_space(), waiting for MD_CHANGE_PENDING to
> >> be cleared" ??
> > right
> >> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
> >> ->quiesce to be set, and then exit gracefully.
> >
> > Can you give details about this please? .quiesce is called with reconfig_mutex
> > hold, so the MD_CHANGE_PENDING will never get cleared.
>
> raid5_quiesce(mddev, 1) sets conf->quiesce and then calls r5l_quiesce().
>
> r5l_quiesce() tells the reclaim_thread to exit and waits for it to do so.
>
> But the reclaim thread might be in
> r5l_do_reclaim() -> r5l_write_super_and_discard_space()
> waiting for MD_CHANGE_PENDING to clear. That will only get cleared when
> the main thread can get the reconfig_mutex, which the thread calling
> raid5_quiesce() might hold. So we get a deadlock.
>
> My suggestion is to change r5l_write_super_and_discard_space() so that
> it waits for *either* MD_CHANGE_PENDING to be clear, or conf->quiesce
> to be set. That will avoid the deadlock.
>
> Whatever thread called raid5_quiesce() will now be in control of the
> array without any async IO going on. If it needs the metadata to be
> sync, it can do that itself. If not, then it doesn't really matter that
> r5l_write_super_and_discard_space() didn't wait.
I'm afraid waiting conf->quiesce set isn't safe. The reason to wait for
superblock write isn't because of async IO. discard could zero data, so before
we do discard, we must make sure superblock points to correct log tail,
otherwise recovery will not work. This is the reason we wait for superblock
write.
> r5l_write_super_and_discard_space() shouldn't call discard if the
> superblock write didn't complete, and probably r5l_do_reclaim()
> shouldn't update last_checkpoint and last_cp_seq in that case.
> This is what I mean by "with a bit of care" and "exit gracefully".
> Maybe I should have said "abort cleanly". The goal is to get the thread
> to exit. It doesn't need to complete what it was doing, it just needs
> to make sure that it leaves things in a tidy state so that when it
> starts up again, it can pick up where it left off.
Agree, we could ignore discard sometime, which happens occasionally, so impact
is little. I tested something like below recently. Assume this is the solution
we agree on?
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5504ce2..cd34e66 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -96,7 +96,6 @@ struct r5l_log {
spinlock_t no_space_stripes_lock;
bool need_cache_flush;
- bool in_teardown;
};
/*
@@ -703,32 +702,22 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
return;
mddev = log->rdev->mddev;
+
/*
- * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
- * wait for this thread to finish. This thread waits for
- * MD_CHANGE_PENDING clear, which is supposed to be done in
- * md_check_recovery(). md_check_recovery() tries to get
- * reconfig_mutex. Since r5l_quiesce already holds the mutex,
- * md_check_recovery() fails, so the PENDING never get cleared. The
- * in_teardown check workaround this issue.
+ * Discard could zero data, so before discard we must make sure
+ * superblock is updated to new log tail. Updating superblock (either
+ * directly call md_update_sb() or depend on md thread) must hold
+ * reconfig mutex. On the other hand, raid5_quiesce is called with
+ * reconfig_mutex hold. The first step of raid5_quiesce() is waitting
+ * for all IO finish, hence waitting for reclaim thread, while reclaim
+ * thread is calling this function and waitting for reconfig mutex. So
+ * there is a deadlock. We workaround this issue with a trylock.
+ * FIXME: we could miss discard if we can't take reconfig mutex
*/
- if (!log->in_teardown) {
- set_mask_bits(&mddev->flags, 0,
- BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
- md_wakeup_thread(mddev->thread);
- wait_event(mddev->sb_wait,
- !test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
- log->in_teardown);
- /*
- * r5l_quiesce could run after in_teardown check and hold
- * mutex first. Superblock might get updated twice.
- */
- if (log->in_teardown)
- md_update_sb(mddev, 1);
- } else {
- WARN_ON(!mddev_is_locked(mddev));
- md_update_sb(mddev, 1);
- }
+ if (!mddev_trylock(mddev))
+ return;
+ md_update_sb(mddev, 1);
+ mddev_unlock(mddev);
/* discard IO error really doesn't matter, ignore it */
if (log->last_checkpoint < end) {
@@ -827,7 +816,6 @@ void r5l_quiesce(struct r5l_log *log, int state)
if (!log || state == 2)
return;
if (state == 0) {
- log->in_teardown = 0;
/*
* This is a special case for hotadd. In suspend, the array has
* no journal. In resume, journal is initialized as well as the
@@ -838,11 +826,6 @@ void r5l_quiesce(struct r5l_log *log, int state)
log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
log->rdev->mddev, "reclaim");
} else if (state == 1) {
- /*
- * at this point all stripes are finished, so io_unit is at
- * least in STRIPE_END state
- */
- log->in_teardown = 1;
/* make sure r5l_write_super_and_discard_space exits */
mddev = log->rdev->mddev;
wake_up(&mddev->sb_wait);
^ permalink raw reply related
* Re: [PATCH 1/3] MD: hold mddev lock for .quiesce in md_do_sync
From: NeilBrown @ 2016-08-24 4:49 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid, Shaohua Li
In-Reply-To: <20160817012803.GA86961@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]
On Wed, Aug 17 2016, Shaohua Li wrote:
>> >
>> > We will have the same deadlock issue with just stopping/restarting the reclaim
>> > thread. As stopping the thread will wait for the thread, which probably is
>> > doing r5l_do_reclaim and writting the superblock. Since we are writting the
>> > superblock, we must hold the reconfig_mutex.
>>
>> When you say "writing the superblock" you presumably mean "blocked in
>> r5l_write_super_and_discard_space(), waiting for MD_CHANGE_PENDING to
>> be cleared" ??
> right
>> With a bit of care you could wait for MD_CHANGE_PENDING to clear, or for
>> ->quiesce to be set, and then exit gracefully.
>
> Can you give details about this please? .quiesce is called with reconfig_mutex
> hold, so the MD_CHANGE_PENDING will never get cleared.
raid5_quiesce(mddev, 1) sets conf->quiesce and then calls r5l_quiesce().
r5l_quiesce() tells the reclaim_thread to exit and waits for it to do so.
But the reclaim thread might be in
r5l_do_reclaim() -> r5l_write_super_and_discard_space()
waiting for MD_CHANGE_PENDING to clear. That will only get cleared when
the main thread can get the reconfig_mutex, which the thread calling
raid5_quiesce() might hold. So we get a deadlock.
My suggestion is to change r5l_write_super_and_discard_space() so that
it waits for *either* MD_CHANGE_PENDING to be clear, or conf->quiesce
to be set. That will avoid the deadlock.
Whatever thread called raid5_quiesce() will now be in control of the
array without any async IO going on. If it needs the metadata to be
sync, it can do that itself. If not, then it doesn't really matter that
r5l_write_super_and_discard_space() didn't wait.
r5l_write_super_and_discard_space() shouldn't call discard if the
superblock write didn't complete, and probably r5l_do_reclaim()
shouldn't update last_checkpoint and last_cp_seq in that case.
This is what I mean by "with a bit of care" and "exit gracefully".
Maybe I should have said "abort cleanly". The goal is to get the thread
to exit. It doesn't need to complete what it was doing, it just needs
to make sure that it leaves things in a tidy state so that when it
starts up again, it can pick up where it left off.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]
^ permalink raw reply
* Re: bootsect replicated in p1, RAID enclosure suggestions?
From: travis+ml-linux-raid @ 2016-08-24 2:14 UTC (permalink / raw)
To: linux-raid
In-Reply-To: <20160823050947.GL32250@subspacefield.org>
$ mdadm --version
mdadm - v3.2.5 - 18th May 2012
$ uname -a
Linux hostname 3.2.0-107-generic #148-Ubuntu SMP Mon Jul 18 20:22:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.5 LTS"
And I think there must be a bug in referencing the beginning of a
partition vs the beginning of the disk which leads to this. Back when
I was using raw disk devices I had corruption in the first cylinders
which also held the mdlabel and I thought the lack of a partition
table was the problem... obviously not.
Could very well be a bug in USB enclosure firmware too. Hard to know
how to proceed.
On Mon, Aug 22, 2016 at 10:09:47PM -0700, travis+ml-linux-raid@subspacefield.org wrote:
> Hello all,
>
> So I have an Intel NUC (for low power Linux) plugged via USB into a 4
> bay enclosure doing linear (yeah I know; it's the backup server, the
> primary is raid10).
>
> And every once in a while, this happens (*see end). The partition 1
> that would normally contain a MD slice ends up being a replica of the
> boot cylinder. I can't tell if it's the mdraid linear impl, the
> kernel doing something weird, the USB drivers, the enclosure firmware,
> or what.
>
> Anyway, this happened while I was restoring a Windows machine whose
> root drive suddenly took a nosedive, and it happens every 6 months
> or so. Today it happened while I was in the middle of recovering
> a Windows machine whose 1TB SSD threw up on C: and totally nuked
> the data.
>
> The last low-power option I tried was an OpenRD Ultimate based around
> ARMv5TE which was basically unsupported by debian by the time I got
> it, and subsequently became ultra-flaky due to what seemed to be RAM
> problems - it was crashing every 3 days with kernel panics, and every
> once in a while would do something worse.
>
> Any recommendations on a low power hardware with a well-supported
> distro, that matches up well with a real backplane and SATA
> connections instead of USB. The only caveat is that I want to encrypt
> raw disks and it has to not be very noisy - so no rackmount gear
> with 65dB 1" dog whistle fans. Obviously, whatever backplane must
> be well-supported by the distro.
>
> Also, does anyone have experience with cryptsetup on multiple
> partitions? I can do that but get prompted multiple times and I was
> wondering if anyone knew an easy way to fix the boot time scripts to
> avoid that, only prompting once per unique underlying crypttab.
>
> And finally, I have a story about buggy drive firmware that you
> might enjoy, especially if you were doing this sort of stuff in
> the 90s as well. Cheers:
>
> http://www.subspacefield.org/security/hard_drives_of_doom/
>
>
> [*]
>
> # parted /dev/sde
> GNU Parted 2.3
> Using /dev/sde
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p
> Model: WDC WD40 EFRX-68WT0N0 (scsi)
> Disk /dev/sde: 4001GB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
>
> Number Start End Size File system Name Flags
> 1 1049kB 4001GB 4001GB Linux RAID raid
>
> (parted) q
> # parted /dev/sdd1
> GNU Parted 2.3
> Using /dev/sdd1
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) p
> Model: Unknown (unknown)
> Disk /dev/sdd1: 4001GB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
>
> Number Start End Size File system Name Flags
> 1 1049kB 4001GB 4001GB Linux RAID raid
>
> --
> http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
> "Computer crime, the glamor crime of the 1970s, will become in the
> 1980s one of the greatest sources of preventable business loss."
> John M. Carroll, "Computer Security", first edition cover flap, 1977
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
http://www.subspacefield.org/~travis/ | if spammer then john@subspacefield.org
"Computer crime, the glamor crime of the 1970s, will become in the
1980s one of the greatest sources of preventable business loss."
John M. Carroll, "Computer Security", first edition cover flap, 1977
^ permalink raw reply
* Re: kernel checksumming performance vs actual raid device performance
From: Shaohua Li @ 2016-08-24 1:02 UTC (permalink / raw)
To: Matt Garman; +Cc: Mdadm
In-Reply-To: <CAJvUf-C-Nr8sSnSPL-5jt1NLOAiZjhZ=bjDRUbX_RjphRL+yWA@mail.gmail.com>
On Tue, Jul 12, 2016 at 04:09:25PM -0500, Matt Garman wrote:
> We have a system with a 24-disk raid6 array, using 2TB SSDs. We use
> this system in a workload that is 99.9% read-only (a few small
> writes/day, versus countless reads). This system is an NFS server for
> about 50 compute nodes that continually read its data.
>
> In a non-degraded state, the system works wonderfully: the md0_raid6
> process uses less than 1% CPU, each drive is around 20% utilization
> (via iostat), no swapping is taking place. The outbound throughput
> averages around 2.0 GB/sec, with 2.5 GB/sec peaks.
>
> However, we had a disk fail, and the throughput dropped considerably,
> with the md0_raid6 process pegged at 100% CPU.
>
> I understand that data from the failed disk will need to be
> reconstructed from parity, and this will cause the md0_raid6 process
> to consume considerable CPU.
>
> What I don't understand is how I can determine what kind of actual MD
> device performance (throughput) I can expect in this state?
>
> Dmesg seems to give some hints:
>
> [ 6.386820] xor: automatically using best checksumming function:
> [ 6.396690] avx : 24064.000 MB/sec
> [ 6.414706] raid6: sse2x1 gen() 7636 MB/s
> [ 6.431725] raid6: sse2x2 gen() 3656 MB/s
> [ 6.448742] raid6: sse2x4 gen() 3917 MB/s
> [ 6.465753] raid6: avx2x1 gen() 5425 MB/s
> [ 6.482766] raid6: avx2x2 gen() 7593 MB/s
> [ 6.499773] raid6: avx2x4 gen() 8648 MB/s
> [ 6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [ 6.499774] raid6: using avx2x2 recovery algorithm
>
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
>
> Perhaps naively, I would expect that second-to-last line:
>
> [ 6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
>
> to indicate what kind of throughput I could expect in a degraded
> state, but clearly that is not right---or I have something
> misconfigured.
>
> So in other words, what does that gen() 8648 MB/s metric mean in terms
> of real-world throughput? Is there a way I can "convert" that number
> to expected throughput of a degraded array?
In non-degrade mode, raid6 just directly dispatch IO to raid disks, software
involvement is very small. In degrade mode, the data is calculated. There are a
lot of factors impacting the performance:
1. enter the raid6 state machine, which has a long code path. (this is
debatable, if a read doesn't read the faulty disk and it's a small random read,
raid6 doesn't need to run the state machine. Fixing this could hugely improve
the performance)
2. the state machine runs in a single thread, which is a bottleneck. try to
increase group_thread_cnt, which will make the handling multi-thread.
3. stripe cache involves. try to increase stripe_cache_size.
4. the faulty disk data must be calculated, which involves read from other
disks. If this is a numa machine, and each disk interrupts to different
cpus/nodes, there will be big impact (cache, wakeup IPI)
5. the xor calculation overhead. Actually I don't think the impact is big,
mordern cpu can do the calculation fast.
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH v2] raid10: record correct address of bad block
From: Shaohua Li @ 2016-08-24 0:12 UTC (permalink / raw)
To: Tomasz Majchrzak
Cc: linux-raid, aleksey.obitotskiy, pawel.baldysiak,
artur.paszkiewicz, maksymilian.kunt
In-Reply-To: <1471942437-16720-1-git-send-email-tomasz.majchrzak@intel.com>
On Tue, Aug 23, 2016 at 10:53:57AM +0200, Tomasz Majchrzak wrote:
> For failed write request record block address on a device, not block
> address in an array.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
> drivers/md/raid10.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index cfa96b5..cd8d197 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2465,18 +2465,19 @@ static int narrow_write_error(struct r10bio *r10_bio, int i)
>
> while (sect_to_write) {
> struct bio *wbio;
> + sector_t wsector;
> if (sectors > sect_to_write)
> sectors = sect_to_write;
> /* Write at 'sector' for 'sectors' */
> wbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> bio_trim(wbio, sector - bio->bi_iter.bi_sector, sectors);
> - wbio->bi_iter.bi_sector = (r10_bio->devs[i].addr+
> - choose_data_offset(r10_bio, rdev) +
> - (sector - r10_bio->sector));
> + wsector = r10_bio->devs[i].addr + (sector - r10_bio->sector);
> + wbio->bi_iter.bi_sector = wsector +
> + choose_data_offset(r10_bio, rdev);
> wbio->bi_bdev = rdev->bdev;
> if (submit_bio_wait(WRITE, wbio) < 0)
> /* Failure! */
> - ok = rdev_set_badblocks(rdev, sector,
> + ok = rdev_set_badblocks(rdev, wsector,
> sectors, 0)
> && ok;
Applied, thanks!
^ permalink raw reply
* Re: [PATCH -next] md-cluster: fix error return code in join()
From: Shaohua Li @ 2016-08-24 0:09 UTC (permalink / raw)
To: Wei Yongjun; +Cc: Wei Yongjun, linux-raid
In-Reply-To: <1471790545-3301-1-git-send-email-weiyj.lk@gmail.com>
On Sun, Aug 21, 2016 at 02:42:25PM +0000, Wei Yongjun wrote:
> From: Wei Yongjun <weiyongjun1@huawei.com>
>
> Fix to return error code -ENOMEM from the lockres_init() error
> handling case instead of 0, as done elsewhere in this function.
>
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
> ---
> drivers/md/md-cluster.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 333f0cf..2b13117 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -874,8 +874,10 @@ static int join(struct mddev *mddev, int nodes)
> goto err;
> }
> cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0);
> - if (!cinfo->ack_lockres)
> + if (!cinfo->ack_lockres) {
> + ret = -ENOMEM;
> goto err;
> + }
> /* get sync CR lock on ACK. */
> if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR))
> pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n",
> @@ -889,8 +891,10 @@ static int join(struct mddev *mddev, int nodes)
> pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number);
> snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1);
> cinfo->bitmap_lockres = lockres_init(mddev, str, NULL, 1);
> - if (!cinfo->bitmap_lockres)
> + if (!cinfo->bitmap_lockres) {
> + ret = -ENOMEM;
> goto err;
> + }
> if (dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW)) {
> pr_err("Failed to get bitmap lock\n");
> ret = -EINVAL;
> @@ -898,8 +902,10 @@ static int join(struct mddev *mddev, int nodes)
> }
>
> cinfo->resync_lockres = lockres_init(mddev, "resync", NULL, 0);
> - if (!cinfo->resync_lockres)
> + if (!cinfo->resync_lockres) {
> + ret = -ENOMEM;
> goto err;
> + }
>
> return 0;
> err:
applied, thanks!
^ permalink raw reply
* Re: kernel checksumming performance vs actual raid device performance
From: Phil Turmel @ 2016-08-23 21:42 UTC (permalink / raw)
To: Doug Ledford, Matt Garman, Doug Dumitru; +Cc: Mdadm
In-Reply-To: <3e239b96-b06e-d33b-2e99-42ffa170d804@redhat.com>
On 08/23/2016 04:15 PM, Doug Ledford wrote:
> You're raid device has a good chunk size for your usage pattern. If you
> had a smallish chunk size (like 64k or 32k), I would actually expect
> things to behave differently. But, then again, maybe I'm wrong and that
> would help. With a smaller chunk size, you would be able to fit more
> stripes in the stripe cache using less memory.
This is not correct. Parity operations in MD raid4/5/6 operate on 4k
blocks. The stripe cache for an array is a collection of 4k elements
per member device. Chunk size doesn't factor into the cache itself.
But see below....
> Makes sense. I know the stripe cache size is conservative by default
> because of the fact that it's not shared with the page cache, so you
> might as well consider it's memory lost. When you upped it to 64k, and
> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
> allowed stripes which is a maximum memory consumption of around 700GB
> RAM. I doubt you have that much in your machine, so I'm guessing it's
> simply using all available RAM that the page cache or something else
> isn't already using. That's also explains why setting it higher doesn't
> provide any additional benefits ;-).
More likely the parity thread saturated and no more speed was possible.
Also possible that there would be a step change in performance again at
a much larger cache size.
>> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
>> 8000 MB/s, per dmesg:
>>
>> [ 6.386820] xor: automatically using best checksumming function:
>> [ 6.396690] avx : 24064.000 MB/sec
>> [ 6.414706] raid6: sse2x1 gen() 7636 MB/s
>> [ 6.431725] raid6: sse2x2 gen() 3656 MB/s
>> [ 6.448742] raid6: sse2x4 gen() 3917 MB/s
>> [ 6.465753] raid6: avx2x1 gen() 5425 MB/s
>> [ 6.482766] raid6: avx2x2 gen() 7593 MB/s
>> [ 6.499773] raid6: avx2x4 gen() 8648 MB/s
>> [ 6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
>> [ 6.499774] raid6: using avx2x2 recovery algorithm
>>
>> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
Parity operations in raid must always involve all (available) member
devices. Read operations when not degraded won't generate any parity
operations. Most large write operations and any degraded read
operations will involve all members, even if those members' data is not
part of the larger read/write request.
As chunk sizes get larger the odds grow that any given array I/O will
touch a fraction of the slice, causing I/O to members purely for parity
math. Also, the odds rise that the starting point or ending point of an
array I/O operation will not be aligned to the stripe, making more
member I/O solely for parity math.
Then add in the fact that dd issues I/O requests one block at a time,
per the bs=? parameter. So it is possible that data that would have
been sequential without parallel pressure (still in the stripe cache for
later reads) generates multiple parity calculations for fractional
stripe operations, just due to stripe size/alignment mismatch on single
dd dispatches.
What bs=? value are you using in your dd commands? Based on your 512k
chunk, it should be 10240k for aligned operations and much larger than
that for unaligned.
FWIW, I use small chunk sizes -- usually 16k.
Phil
^ permalink raw reply
* [PATCH] mdopen: Prevent overrunning the devname buffer when copying devnm into it for long md names.
From: Robert LeBlanc @ 2016-08-23 20:37 UTC (permalink / raw)
To: linux-raid; +Cc: dm-devel, robert
Signed-Off: Robert LeBlanc<robert@leblancnet.us>
---
mdopen.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mdopen.c b/mdopen.c
index f818fdf..5af344b 100644
--- a/mdopen.c
+++ b/mdopen.c
@@ -144,7 +144,7 @@ int create_mddev(char *dev, char *name, int autof, int trustworthy,
struct createinfo *ci = conf_get_create_info();
int parts;
char *cname;
- char devname[20];
+ char devname[37];
char devnm[32];
char cbuf[400];
if (chosen == NULL)
--
2.9.3
^ permalink raw reply related
* Re: [RFC] Some fixes to allow for more than 128 md devices.
From: Robert LeBlanc @ 2016-08-23 20:37 UTC (permalink / raw)
To: linux-raid; +Cc: dm-devel, Robert LeBlanc
In-Reply-To: <CAANLjFqv33upkC5tLN8i77ysZCcFWqRYpqVUduBaCCAEOcZAqA@mail.gmail.com>
I found an email thread [0] talking about the new way to do this. I
did find a buffer overrun and will submit a patch for it.
Robert LeBlanc
[0] http://www.spinics.net/lists/raid/msg52300.html
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Mon, Aug 22, 2016 at 10:03 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> Apparently, the mdadm source on git-kernel.org (commit 13db17bd)
> already has the fixes to properly create the device nodes, but I still
> have the unexpected failure opening /dev/md1048574.
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Aug 19, 2016 at 8:10 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> I'm stuck and need some help getting this across the finish line. This
>> is in no way complete, but to help show what I'm working on.
>>
>> When we added more than 128 md devices, we started getting failures.
>> Looking through the code it seems that the minor dev number was being
>> stored in an int and causing overflow and wrecking havoc on everything.
>> I finally got the mknod in mdadm to correctly make the dev node with
>> minors up to 1048574 as expected in the mdadm code. However, I can
>> only create md devices up to 511. Trying to create an md higher than
>> that has an error where the device can't be read/opened strace reports:
>> open("/dev/.tmp.md.15341:9:1048574", O_RDWR|O_EXCL|O_DIRECT) = -1 ENXIO
>> (No such device or address)
>> while Python reports:
>> IOError: [Errno 6] No such device or address: '/dev/.tmp.md.3279:9:512'
>>
>> A corresponding node is not created in /sys/block/md* for mds over 511.
>>
>> I believe that there may be a bug in the kernel code that is now being
>> hit. After looking through the kernel code, I can't seem to find where
>> this might be. Please help me by either pointing me to the source
>> location that this might be a problem or fixing it based on these
>> patches I've worked on so far. I'm using 4.7.0 currently.
>>
>> I'm using this for testing:
>> ./mdadm --create /dev/md1048574 --assume-clean --verbose --level=1 \
>> --raid-devices=2 /dev/loop0 missing
>>
>> Yes, we have a real need for more than 128 and 512 md devices.
>>
>> Please include me in any replies as I'm not on the ML.
>>
>> Thank you.
>>
>> Robert LeBlanc (1):
>> Some fixes to allow for more than 128 md devices.
>>
>> Manage.c | 5 +++--
>> lib.c | 2 +-
>> mdadm.h | 6 +++---
>> util.c | 25 +++++++++++++------------
>> 4 files changed, 20 insertions(+), 18 deletions(-)
>>
>> --
>> 2.8.1
>>
^ permalink raw reply
* Re: kernel checksumming performance vs actual raid device performance
From: Doug Ledford @ 2016-08-23 20:15 UTC (permalink / raw)
To: Matt Garman, Doug Dumitru; +Cc: Mdadm
In-Reply-To: <CAJvUf-ApkKJXm7Jjiq=gXY9b9RrEvwA5u35xrMUjX2x0btVL4g@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 19347 bytes --]
On 8/23/2016 3:26 PM, Matt Garman wrote:
> Doug & Doug,
>
> Thank you for your helpful replies. I merged both of your posts into
> one, see inline comments below:
>
> On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Of course. I didn't mean to imply otherwise. The read size is the read
>> size. But, since the OPs test case was to "read random files" and not
>> "read random blocks of random files" I took it to mean it would be
>> sequential IO across a multitude of random files. That assumption might
>> have been wrong, but I wrote my explanation with that in mind.
>
> Yes, multiple parallel sequential reads. Our test program generates a
> bunch of big random files (file size has an approximately normal
> distribution, centered around 500 MB, going down to 100 MB or so, up
> to a few multi-GB outliers). The file generation is a one-time thing,
> and we don't really care about its performance.
>
> The read testing program just randomly picks one of those files, then
> reads it start-to-finish using "dd". But it kicks off several "dd"
> threads at once (currently 50, though this is a run-time parameter).
> This is how we generate the read load, and I use iostat while this is
> running to see how much read throughput I'm getting from the array.
OK, 50 sequential I/Os at a time. Good point to know.
>
> On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@redhat.com> wrote:
>> This depends a lot on how you structured your raid array. I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array? If so, then
>> that's 22 data disks and 2 parity disks per stripe. I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
>
> Yes, that is exactly correct, here's the relevant part of /proc/mdstat:
>
> Personalities : [raid1] [raid6] [raid5] [raid4]
>
> md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
> sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
> sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]
>
> 44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]
>
> bitmap: 0/15 pages [0KB], 65536KB chunk
You're raid device has a good chunk size for your usage pattern. If you
had a smallish chunk size (like 64k or 32k), I would actually expect
things to behave differently. But, then again, maybe I'm wrong and that
would help. With a smaller chunk size, you would be able to fit more
stripes in the stripe cache using less memory.
>
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data. With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result. If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
>
> Most of this morning I've been setting/unsetting/changing various
> tunables, to see if I could increase the read speed. I got a huge
> boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
> from the default (256 IIRC) to 16384. Doubling it again to 32k didn't
> seem to bring any further benefit.
Makes sense. I know the stripe cache size is conservative by default
because of the fact that it's not shared with the page cache, so you
might as well consider it's memory lost. When you upped it to 64k, and
you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
allowed stripes which is a maximum memory consumption of around 700GB
RAM. I doubt you have that much in your machine, so I'm guessing it's
simply using all available RAM that the page cache or something else
isn't already using. That's also explains why setting it higher doesn't
provide any additional benefits ;-).
> So with the stripe_cache_size
> increased to 16k, I'm now getting around 1000 MB/s read in the
> degraded state. When the degraded array was only doing 200 MB/s, the
> md0_raid6 process was taking about 50% CPU according to top. Now I
> have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.
You probably have maxed out your single CPU performance and won't see
any benefit without having a multi-threaded XOR routine.
> I'm still degraded by a factor of eight, though, where I'd expect only
> two.
>
>> 1) 200MB/s of XOR is not insignificant. Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this. Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.
>
> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
> 8000 MB/s, per dmesg:
>
> [ 6.386820] xor: automatically using best checksumming function:
> [ 6.396690] avx : 24064.000 MB/sec
> [ 6.414706] raid6: sse2x1 gen() 7636 MB/s
> [ 6.431725] raid6: sse2x2 gen() 3656 MB/s
> [ 6.448742] raid6: sse2x4 gen() 3917 MB/s
> [ 6.465753] raid6: avx2x1 gen() 5425 MB/s
> [ 6.482766] raid6: avx2x2 gen() 7593 MB/s
> [ 6.499773] raid6: avx2x4 gen() 8648 MB/s
> [ 6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [ 6.499774] raid6: using avx2x2 recovery algorithm
>
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
>
> I'm assuming however the kernel does its testing is fairly optimal,
It is *highly* optimal. What's more, it uses 100% CPU during this time.
The raid6 thread doing your recovery is responsible for lots of stuff,
issuing reads, doing xor, fulfilling write requests, maintaining the
cache, etc. It has to have time to actually do other work. So start
with that 8GB/s figure, but immediately start subtracting from that
because the CPU needs to do other things as well. Then remember that we
are under *extreme* memory pressure. When you have to bring in 22 reads
in order to reconstruct just 1 block of the same size, then for 100MB/s
of degraded reads you are generating 2200MB/s of PCI DMA -> MEM
bandwidth consumption, followed by 2200MB/s of MEM -> register load
bandwidth consumption, then I'd have to read the avx xor routine to know
how much write bandwidth it is using, but it's at least 100MB/s of
bandwidth, and likely at least four or five times that much because it
probably doesn't do all 22 blocks in a single xor pass, it likely loads
parity, then reads up to maybe four blocks and xors them together and
then stores the parity, so each pass will re-read and re-store the
parity block. The point of all of this is that people forget to do the
math on the memory bandwidth used by these XOR operations. The faster
they are, the higher the percentage of main memory bandwidth you are
consuming. Now you have to subtract all of that main memory bandwidth
from the total main memory bandwidth for the CPU, and what's left over
is all you have for doing other productive work. Even if you aren't
blowing your caches doing all of this XOR work, you are blowing your
main memory bandwidth. Other threads or other actions end up stalling
waiting on main memory accesses to complete.
> and probably assumes ideal cache behavior... so maybe actual XOR
> performance won't be as good as what dmesg suggests...
It will never be that good, and you can thank your stars that it isn't,
because if it were, your computer would be ground to a halt with nothing
happening but data XOR computations.
> but still, 200
> MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
> MB/s...
The math fits. Most quad channel Intel CPUs have memory bandwidths in
the 50GByte/s range theoretical maximum, but it's not bidirectional,
it's not even multi-access, so you have to remember that the usage looks
like this on a good read:
copy 1: DMA from PCI bus to main memory
copy 2: Load from main memory to CPU for copy_to_user
copy 3: Store from CPU to main memory for user
To get 8GB/s of read performance undregraded then required 24GB/s of
actual memory bandwidth just for the copies. That's half of your entire
memory bandwidth (unless you have multiple sockets, then things get more
complex, but this is still true for one socket of the multiple socket
machine). Once you add the XOR routine into the figure, the 3 accesses
is the same for part of it, but for degraded fixups, it is much worse.
> Is it possible to pin kernel threads to a CPU? I'm thinking I could
> reboot with isolcpus=2 (for example) and if I can force that md0_raid6
> thread to run on CPU 2, at least the L1/L2 caches should be minimally
> affected...
You could try that, but I doubt it will effect much.
>> Possible fixes for this might include:
>> c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
>
> I suppose this might be an explanation for why increasing the array's
> stripe_cache_size gave me such a boost?
Yes. The default setting is conservative, you told it to use as much
memory as it needed.
>> d) Rearchitecting your arrays into raid50 instead of big raid6 array
>
> My colleague tested that exact same config with hardware raid5, and
> striped the three raid5 arrays together with software raid1.
That's a huge waste, are you sure he didn't use raid0 for the stripe?
> So
> clearly not apples-to-apples, but he did get dramatically better
> degraded and rebuild performance. I do intend to test a pure software
> raid-50 implementation.
I would try it. If you are OK with single disk failures anyway.
>> (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
>
> I'm certain head movement time isn't the issue, as these are SSDs. :)
Fair enough ;-). And given these are SSDs, I'd be just fine doing
something like four 6 disk raid5s then striped in a raid0 myself. The
main cause for concern with spinning disks is latent bad sectors causing
a read error on rebuild, with SSDs that's much less of a concern.
> On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@easyco.com> wrote:
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up. Even better would be
>> a perf capture, but you might not have all the tools installed. You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
>
> Running top for 20 or more seconds, the top processes in terms of CPU
> usage are pretty static:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 1228 root 20 0 0 0 0 R 100.0 0.0 562:16.83 md0_raid6
> 1315 root 20 0 4372 684 524 S 17.3 0.0 57:20.92 rngd
> 107 root 20 0 0 0 0 S 9.6 0.0 65:16.63 kswapd0
> 108 root 20 0 0 0 0 S 8.6 0.0 65:19.58 kswapd1
> 19424 root 20 0 108972 1676 560 D 3.3 0.0 0:00.52 dd
> 6909 root 20 0 108972 1676 560 D 2.7 0.0 0:01.53 dd
> 18383 root 20 0 108972 1680 560 D 2.7 0.0 0:00.63 dd
>
>
> I truncated the output. The "dd" processes are part of our testing
> tool that generates the huge read load on the array. Any given "dd"
> process might jump around, but those four kernel processes are always
> the top four. (Note that before I increased the stripe_cache_size (as
> mentioned above), the md0_raid6 process was only consuming around 50%
> CPU.)
I would try to tune your stripe cache size such that the kswapd?
processes go to sleep. Those are reading/writing swap. That won't help
your overall performance.
> Here is a representative view of a non-first iteration of "iostat -mxt 5":
>
>
> 08/23/2016 01:37:59 PM
> avg-cpu: %user %nice %system %iowait %steal %idle
> 4.84 0.00 27.41 67.59 0.00 0.17
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sdy 0.00 0.40 0.80 0.60 0.05 0.00
> 83.43 0.00 1.00 0.50 1.67 1.00 0.14
> sdz 0.00 0.40 0.00 0.60 0.00 0.00
> 10.67 0.00 2.00 0.00 2.00 2.00 0.12
> sdd 12927.00 0.00 204.40 0.00 51.00 0.00
> 511.00 5.93 28.75 28.75 0.00 4.31 88.10
I'm not sure how much I trust some of these numbers. According to this,
you are issuing 200 read/s, at an average size of 511KB, which should
work out to roughly 100MB/s of data read, but rMB/s is only 51. I
wonder if the read requests from the raid6 thread are bypassing the
rMB/s accounting because they aren't coming from the VFS or some such?
It would explain why the rMB/s is only half of what it should be based
upon requests and average request size.
> sde 13002.60 0.00 205.20 0.00 51.20 0.00
> 511.00 6.29 30.39 30.39 0.00 4.59 94.12
> sdf 12976.80 0.00 205.00 0.00 51.00 0.00
> 509.50 6.17 29.76 29.76 0.00 4.57 93.78
> sdg 12950.20 0.00 205.60 0.00 50.80 0.00
> 506.03 6.20 29.75 29.75 0.00 4.57 93.88
> sdh 12949.00 0.00 207.20 0.00 50.90 0.00
> 503.11 6.36 30.35 30.35 0.00 4.59 95.10
> sdb 12196.40 0.00 192.60 0.00 48.10 0.00
> 511.47 5.48 28.15 28.15 0.00 4.38 84.36
> sda 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdi 12923.00 0.00 208.40 0.00 51.00 0.00
> 501.20 6.79 32.31 32.31 0.00 4.65 96.84
> sdj 12796.20 0.00 206.80 0.00 50.50 0.00
> 500.12 6.62 31.73 31.73 0.00 4.62 95.64
> sdk 12746.60 0.00 204.00 0.00 50.20 0.00
> 503.97 6.38 30.77 30.77 0.00 4.60 93.86
> sdl 12570.00 0.00 202.20 0.00 49.70 0.00
> 503.39 6.39 31.19 31.19 0.00 4.63 93.68
> sdn 12594.00 0.00 204.20 0.00 49.95 0.00
> 500.97 6.40 30.99 30.99 0.00 4.58 93.54
> sdm 12569.00 0.00 203.80 0.00 49.90 0.00
> 501.45 6.30 30.58 30.58 0.00 4.45 90.60
> sdp 12568.80 0.00 205.20 0.00 50.10 0.00
> 500.03 6.37 30.79 30.79 0.00 4.52 92.72
> sdo 12569.20 0.00 204.00 0.00 49.95 0.00
> 501.46 6.40 31.07 31.07 0.00 4.58 93.42
> sdw 12568.60 0.00 206.20 0.00 50.00 0.00
> 496.60 6.34 30.71 30.71 0.00 4.24 87.48
> sdx 12038.60 0.00 197.40 0.00 47.60 0.00
> 493.84 6.01 30.21 30.21 0.00 4.40 86.86
> sdq 12570.20 0.00 204.20 0.00 50.15 0.00
> 502.97 6.23 30.41 30.41 0.00 4.44 90.68
> sdr 12571.00 0.00 204.60 0.00 50.25 0.00
> 502.99 6.15 30.26 30.26 0.00 4.18 85.62
> sds 12495.20 0.00 203.80 0.00 49.95 0.00
> 501.95 6.00 29.62 29.62 0.00 4.24 86.38
> sdu 12695.60 0.00 207.80 0.00 50.65 0.00
> 499.17 6.22 30.00 30.00 0.00 4.16 86.38
> sdv 12619.00 0.00 207.80 0.00 50.35 0.00
> 496.22 6.23 30.03 30.03 0.00 4.20 87.32
> sdt 12671.20 0.00 206.20 0.00 50.50 0.00
> 501.56 6.05 29.30 29.30 0.00 4.24 87.44
> sdc 12851.60 0.00 203.00 0.00 50.70 0.00
> 511.50 5.84 28.49 28.49 0.00 4.17 84.64
> md126 0.00 0.00 0.60 1.00 0.05 0.00
> 71.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-0 0.00 0.00 0.60 0.80 0.05 0.00
> 81.14 0.00 2.29 0.67 3.50 1.14 0.16
> dm-1 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> md0 0.00 0.00 4475.20 0.00 1110.95 0.00
> 508.41 0.00 0.00 0.00 0.00 0.00 0.00
>
>
> sdy and sz are the system drives, so they are uninteresting.
>
> sda is the md0 drive I failed, that's why it stays at zero.
>
> And lastly, here's the output of the perf commands you suggested (at
> least the top part):
>
> Samples: 561K of event 'cycles', Event count (approx.): 318536644203
> Overhead Command Shared Object Symbol
> 52.85% swapper [kernel.kallsyms] [k] cpu_startup_entry
> 4.47% md0_raid6 [kernel.kallsyms] [k] memcpy
> 3.39% dd [kernel.kallsyms] [k] __find_stripe
> 2.50% md0_raid6 [kernel.kallsyms] [k] analyse_stripe
> 2.43% dd [kernel.kallsyms] [k] _raw_spin_lock_irq
> 1.75% rngd rngd [.] 0x000000000000288b
> 1.74% md0_raid6 [kernel.kallsyms] [k] xor_avx_5
> 1.49% dd [kernel.kallsyms] [k]
> copy_user_enhanced_fast_string
> 1.33% md0_raid6 [kernel.kallsyms] [k] ops_run_io
> 0.65% dd [kernel.kallsyms] [k] raid5_compute_sector
> 0.60% md0_raid6 [kernel.kallsyms] [k] _raw_spin_lock_irq
> 0.55% ps libc-2.17.so [.] _IO_vfscanf
> 0.53% ps [kernel.kallsyms] [k] vsnprintf
> 0.51% ps [kernel.kallsyms] [k] format_decode
> 0.47% ps [kernel.kallsyms] [k] number.isra.2
> 0.41% md0_raid6 [kernel.kallsyms] [k] raid_run_ops
> 0.40% md0_raid6 [kernel.kallsyms] [k] __blk_segment_map_sg
>
>
> That's my first time using the perf tool, so I need a little hand-holding here.
You might get more interesting perf results if you could pin the md
raid6 thread to a single CPU and then filter the perf results to just
that CPU.
--
Doug Ledford <dledford@redhat.com>
GPG Key ID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox