Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* "creative" bio usage in the RAID code
From: Christoph Hellwig @ 2016-11-10 19:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block

Hi Shaohua,

one of the major issues with Ming Lei's multipage biovec works
is that we can't easily enabled the MD RAID code for it.  I had
a quick chat on that with Chris and Jens and they suggested talking
to you about it.

It's mostly about the RAID1 and RAID10 code which does a lot of funny
things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
drivers don't touch.  One example is the r1buf_pool_alloc code,
which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
case, which would also take care of r1buf_pool_free.  I'm not sure
about all the others cases, as some bits don't fully make sense to me,
e.g. why we're trying to do single page I/O out of a bigger bio.

Maybe you have some better ideas what's going on there?

Another not quite as urgent issue is how the RAID5 code abuses
->bi_phys_segments as and outstanding I/O counter, and I have no
really good answer to that either.

^ permalink raw reply

* Re: Help in recovering a RAID5 volume
From: Wols Lists @ 2016-11-10 18:58 UTC (permalink / raw)
  To: Felipe Kich; +Cc: linux-raid
In-Reply-To: <CA+GmhESqEQd6J3CSteR08qw907NnC7r1ewecs3NkWUmpei6BVw@mail.gmail.com>

On 10/11/16 17:47, Felipe Kich wrote:
> Hi Anthony,
> 
> Thanks for the reply. Here's some answers to your questions and also
> another question.
> 
> It really seems that 2 disks are bad, but 2 are still good, according
> to SMART. I'll replace them ASAP.
> For now, I don't need to increase the array size. It's more than
> enough for what I need.
> 
You might find the extra price of larger drives is minimal. It's down to
you. And even 2TB drives would give you the space to go raid-6.

> About the drive duplication, I don't have spare discs available now
> for that, I only have one 4TB disk at hand, so I'd like to know if
> it's possible to create device images that I can mount and try to
> rebuild the array, to test if it would work, then I can go and buy new
> disks to replace the defective ones.

Okay, if you've got a 4TB drive ...

I can't remember what the second bad drive was ... iirc the one that was
truly dud was sdc ...

So. What I'd do is create two partitions on the 4TB that are the same
(or possibly slightly larger) than your sdx1 partition. ddrescue the 1
partition from the best of the dud drives across. Create two partitions
the same size (or larger) than your sdx2 partition, and likewise
ddrescue the 2 partition.

Do a --force assembly, and then mount the arrays read-only. The
partition should be fine. Look over it and see. I think you can do a
fsck without it actually changing anything. fsck will probably find a
few problems.

If everything's fine, add in the other two partitions and let it rebuild.

And then replace the drives as quickly as possible. With this setup
you're critically vulnerable to the 4TB failing. Read up on the
--replace option to replace the drives with minimal risk.
> 
> And sure, I'll send you the logs you asked, no problem.
> 
> Regards.
> 
Ta muchly.

Cheers,
Wol

^ permalink raw reply

* Re: Help in recovering a RAID5 volume
From: Felipe Kich @ 2016-11-10 17:47 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid
In-Reply-To: <5824A918.3030300@youngman.org.uk>

Hi Anthony,

Thanks for the reply. Here's some answers to your questions and also
another question.

It really seems that 2 disks are bad, but 2 are still good, according
to SMART. I'll replace them ASAP.
For now, I don't need to increase the array size. It's more than
enough for what I need.

About the drive duplication, I don't have spare discs available now
for that, I only have one 4TB disk at hand, so I'd like to know if
it's possible to create device images that I can mount and try to
rebuild the array, to test if it would work, then I can go and buy new
disks to replace the defective ones.

And sure, I'll send you the logs you asked, no problem.

Regards.

-
Felipe Kich
51-9622-2067


2016-11-10 15:06 GMT-02:00 Wols Lists <antlists@youngman.org.uk>:
> On 10/11/16 15:41, Felipe Kich wrote:
>> So, with that info, I could verify some things that are frequently
>> mentioned on the posts:
>> - SCT Error Recovery Control is disabled for both Read and Write operations;
>> - Events counter in the devices are the same, except for one disk, but
>> the difference is small (<50);
>> - Magic Numbers and Checksums are all correct;
>>
>> Hope someone can give some advice as how to proceed next.
>>
> Okay. It says the drives are failing, so the first thing is to go out
> and get four new drives :-( Ouch!
>
> Preferably WD Reds or Seagate NAS (Toshibas seem to support ERC too, I'm
> not sure...)
>
> DON'T TOUCH A 3TB BARRACUDA. Barracudas aren't a good idea but the 3TB
> disk is apparently an especially bad choice.
>
> Do you want to upgrade your array size? Or do you want to go Raid-6?
> Four 2TB drives will give you a 4TB Raid-6 array. And look at getting 3-
> or 4TB drives, they're good value for money. You might decide it's not
> worth it.
>
> Copy and replace all the failing drives with ddrescue. Hopefully you'll
> get a perfect copy. Don't worry that the old drive is smaller than the
> new one if you get 2TB or larger drives.
>
> Assuming everything copies fine, find the three drives that are copies
> of sda, sdb, sdd (ie the ones with the highest event counts), and
> assemble with --force. You should now have a new array working fine. Do
> a fsck to make sure everything's okay - you'll probably lose a file or
> two :-(
>
> Add in the fourth disk - it'll trigger a rebuild, but that's normal.
>
> Now if your new disks are bigger than the old ones, you can expand the
> array to use the space. You can either create a new partition in the
> empty drive space for a third array, or you can use a utility to
> move/expand the partitions. If you take the latter step, you should be
> able to convert your raid-5 to a raid-6 (I'll let the experts chime in
> on that). You can then expand the array to use all the available space,
> and expand the filesystem on the array to use it.
>
> NB: If you don't get a perfect ddrescue copy, can you please email me
> the log files - especially where it logs the blocks it can't copy. One
> of the things I want to do is work out how to write that utility
> mentioned on the "programming" page of the wiki.
>
> Cheers,
> Wol
>

^ permalink raw reply

* Re: Help in recovering a RAID5 volume
From: Wols Lists @ 2016-11-10 17:32 UTC (permalink / raw)
  To: Felipe Kich, linux-raid
In-Reply-To: <5824A918.3030300@youngman.org.uk>

On 10/11/16 17:06, Wols Lists wrote:
> Add in the fourth disk - it'll trigger a rebuild, but that's normal.
> 
Just had a thought. Especially if you get larger drives, and you can
identify and copy just the three good disks, then don't bother with the
bad one.

Just partition the new fourth disk the way you plan to do it, and then
add it back in. You can then use the utilities to re-arrange the other
drives.

Or, and it's a bit more work, partition the new drives the way you want,
and ddrescue the old drives partition by partition, rather than a drive
at a time. But it'll save moving the partitions around later.

Cheers,
Wol

^ permalink raw reply

* Re: Help in recovering a RAID5 volume
From: Wols Lists @ 2016-11-10 17:06 UTC (permalink / raw)
  To: Felipe Kich, linux-raid
In-Reply-To: <CA+GmhESH-2hDJrOjvNzVBsgCP65BnUE2AbfOZ1Bbrp02jyBNUQ@mail.gmail.com>

On 10/11/16 15:41, Felipe Kich wrote:
> So, with that info, I could verify some things that are frequently
> mentioned on the posts:
> - SCT Error Recovery Control is disabled for both Read and Write operations;
> - Events counter in the devices are the same, except for one disk, but
> the difference is small (<50);
> - Magic Numbers and Checksums are all correct;
> 
> Hope someone can give some advice as how to proceed next.
> 
Okay. It says the drives are failing, so the first thing is to go out
and get four new drives :-( Ouch!

Preferably WD Reds or Seagate NAS (Toshibas seem to support ERC too, I'm
not sure...)

DON'T TOUCH A 3TB BARRACUDA. Barracudas aren't a good idea but the 3TB
disk is apparently an especially bad choice.

Do you want to upgrade your array size? Or do you want to go Raid-6?
Four 2TB drives will give you a 4TB Raid-6 array. And look at getting 3-
or 4TB drives, they're good value for money. You might decide it's not
worth it.

Copy and replace all the failing drives with ddrescue. Hopefully you'll
get a perfect copy. Don't worry that the old drive is smaller than the
new one if you get 2TB or larger drives.

Assuming everything copies fine, find the three drives that are copies
of sda, sdb, sdd (ie the ones with the highest event counts), and
assemble with --force. You should now have a new array working fine. Do
a fsck to make sure everything's okay - you'll probably lose a file or
two :-(

Add in the fourth disk - it'll trigger a rebuild, but that's normal.

Now if your new disks are bigger than the old ones, you can expand the
array to use the space. You can either create a new partition in the
empty drive space for a third array, or you can use a utility to
move/expand the partitions. If you take the latter step, you should be
able to convert your raid-5 to a raid-6 (I'll let the experts chime in
on that). You can then expand the array to use all the available space,
and expand the filesystem on the array to use it.

NB: If you don't get a perfect ddrescue copy, can you please email me
the log files - especially where it logs the blocks it can't copy. One
of the things I want to do is work out how to write that utility
mentioned on the "programming" page of the wiki.

Cheers,
Wol

^ permalink raw reply

* Re: Question on blocks periodic writes
From: Wols Lists @ 2016-11-10 16:10 UTC (permalink / raw)
  To: NeilBrown, Theophanis Kontogiannis, Linux RAID
In-Reply-To: <8760nwm6vu.fsf@notabene.neil.brown.name>

On 10/11/16 02:00, NeilBrown wrote:
>> [ 8664.858104] xfsaild/md1(658): WRITE block 0 on md1 (8 sectors)
> This is XFS doing something.  md cannot possibly stop all IO while the
> filesystem performs occasional IO.  If these continue, you need to
> discuss with xfs developers how to stop it.  If the writes to individual
> drives continue after there are no writes to 'md1', then it is worth
> coming back here to ask.
> 
> 
Would the new journal feature be any help?

I haven't dug in enough to understand it properly, and it would increase
the vulnerability of the system to a journal failure, but the feature
itself seems almost perfect for batching writes and enabling the disks
to spin down for extended periods.

Cheers,
Wol

^ permalink raw reply

* Help in recovering a RAID5 volume
From: Felipe Kich @ 2016-11-10 15:41 UTC (permalink / raw)
  To: linux-raid

Hello,

I have an Iomega IX4-200D bought in 2009 with 4 Seagate Barracuda LP
1TB drives that came pre-installed, and since then it's been working
fine, never had had real complaints about it in those 7 years. This
week, the samba shares disappeared. Accessing the web admin page, I
saw that the shares were gone, but the disk usage was correct (1,2TB
in use / 1,5TB free), and the status of the disks was the problem.
Disks 1, 2 and 4 had an alert and disk 3 was offline. Problem is that
until then, the unit never gave any warnings or signs that the disks
could fail. Well, doesn't really matter now. So, I turned off the unit
and started reading about what can be done to recover the files
inside.

I've set up a Linux PC, connected all disks, and began collecting
information about the condition of the HDDs, partitions, all I could
find. After reading the Linux Raid wiki and lots of threads on the
topic I'm still unable to mount the RAID5 volume in question. So, I'm
posting below the info I gathered from the RAID config in hopes
someone can give me some advice. Before posting the info, I've already
read about using hard disks designed for NAS usage, SCT Error Recovery
Control support, Desktop vs Enterprise drives, etc, but that's what we
could afford to buy at the time, unfortunately.

So, here's the info I got so far:

--------------------------------------------------------------------------------
Index
--------------------------------------------------------------------------------
1) smartctl -H -i -l scterc (for all disks)
2a) mdadm --examine /dev/sda (for the disk and both partitions)
2b) mdadm --examine /dev/sdb (for the disk and both partitions)
2c) mdadm --examine /dev/sdc (for the disk and both partitions)
2d) mdadm --examine /dev/sdd (for the disk and both partitions)
3) lsdrv
4) cat /proc/mdstat

--------------------------------------------------------------------------------
1) smartcl -H -i -l scterc
--------------------------------------------------------------------------------
root@it:/home/it/Desktop# smartctl -H -i -l scterc /dev/sda
smartctl 6.5 2016-01-24 r4214 [i686-linux-4.4.0-31-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda LP
Device Model:     ST31000520AS
Serial Number:    9VX0Y8JW
LU WWN Device Id: 5 000c50 026dca9fb
Firmware Version: CC37
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5900 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Nov 10 14:53:37 2016 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

root@it:/home/it/Desktop# smartctl -H -i -l scterc /dev/sdb
smartctl 6.5 2016-01-24 r4214 [i686-linux-4.4.0-31-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda LP
Device Model:     ST31000520AS
Serial Number:    9VX0WRVM
LU WWN Device Id: 5 000c50 026ca4019
Firmware Version: CC37
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5900 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Nov 10 14:54:07 2016 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

root@it:/home/it/Desktop# smartctl -H -i -l scterc /dev/sdc
smartctl 6.5 2016-01-24 r4214 [i686-linux-4.4.0-31-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda LP
Device Model:     ST31000520AS
Serial Number:    9VX0XD1S
LU WWN Device Id: 5 000c50 026dbdbf0
Firmware Version: CC38
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5900 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Nov 10 14:54:09 2016 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   002   002   036    Pre-fail
Always   FAILING_NOW 4033

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

root@it:/home/it/Desktop# smartctl -H -i -l scterc /dev/sdd
smartctl 6.5 2016-01-24 r4214 [i686-linux-4.4.0-31-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda LP
Device Model:     ST31000520AS
Serial Number:    9VX0Y9JW
LU WWN Device Id: 5 000c50 026d7169b
Firmware Version: CC38
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5900 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Nov 10 14:54:10 2016 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   003   003   036    Pre-fail
Always   FAILING_NOW 4013

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

--------------------------------------------------------------------------------
2a) mdadm --examine /dev/sda (for the disk and both partitions)
--------------------------------------------------------------------------------

root@it:/home/it/Desktop# mdadm --examine /dev/sda
/dev/sda:
   MBR Magic : aa55
Partition[0] :      4080509 sectors at            1 (type 83)
Partition[1] :   1949444658 sectors at      4080510 (type 83)


root@it:/home/it/Desktop# mdadm --examine /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : ab0d7fdf:373ee9f2:5d8fd52f:304e1b90
  Creation Time : Thu May  6 20:34:46 2010
     Raid Level : raid1
  Used Dev Size : 2040128 (1992.65 MiB 2089.09 MB)
     Array Size : 2040128 (1992.65 MiB 2089.09 MB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  9 16:49:29 2016
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : bd7c68c8 - correct
         Events : 37056

      Number   Major   Minor   RaidDevice State
this     0       8        1        0      active sync   /dev/sda1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       8       49        1      active sync   /dev/sdd1
   2     2       0        0        2      faulty removed
   3     3       8       17        3      active sync   /dev/sdb1


root@it:/home/it/Desktop# mdadm --examine /dev/sda2
/dev/sda2:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : b570d224:d61d7f45:8352223d:f9c68ac4
           Name : storage:1
  Creation Time : Thu Feb 17 10:22:16 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1949444384 (929.57 GiB 998.12 GB)
     Array Size : 2924166528 (2788.70 GiB 2994.35 GB)
  Used Dev Size : 1949444352 (929.57 GiB 998.12 GB)
   Super Offset : 1949444640 sectors
   Unused Space : before=0 sectors, after=288 sectors
          State : clean
    Device UUID : e0b08740:62497ceb:c107ad71:6bade30e

    Update Time : Wed Nov  9 16:05:03 2016
       Checksum : 70a9b667 - correct
         Events : 161174

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)

--------------------------------------------------------------------------------
2b) mdadm --examine /dev/sdb (for the disk and both partitions)
--------------------------------------------------------------------------------

root@it:/home/it/Desktop# mdadm --examine /dev/sdb
/dev/sdb:
   MBR Magic : aa55
Partition[0] :      4080447 sectors at           63 (type 83)
Partition[1] :   1949444658 sectors at      4080510 (type 83)


root@it:/home/it/Desktop# mdadm --examine /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : ab0d7fdf:373ee9f2:5d8fd52f:304e1b90
  Creation Time : Thu May  6 20:34:46 2010
     Raid Level : raid1
  Used Dev Size : 2040128 (1992.65 MiB 2089.09 MB)
     Array Size : 2040128 (1992.65 MiB 2089.09 MB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  9 16:49:29 2016
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : bd7c68de - correct
         Events : 37056

      Number   Major   Minor   RaidDevice State
this     3       8       17        3      active sync   /dev/sdb1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       8       49        1      active sync   /dev/sdd1
   2     2       0        0        2      faulty removed
   3     3       8       17        3      active sync   /dev/sdb1


root@it:/home/it/Desktop# mdadm --examine /dev/sdb2
/dev/sdb2:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : b570d224:d61d7f45:8352223d:f9c68ac4
           Name : storage:1
  Creation Time : Thu Feb 17 10:22:16 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1949444384 (929.57 GiB 998.12 GB)
     Array Size : 2924166528 (2788.70 GiB 2994.35 GB)
  Used Dev Size : 1949444352 (929.57 GiB 998.12 GB)
   Super Offset : 1949444640 sectors
   Unused Space : before=0 sectors, after=288 sectors
          State : clean
    Device UUID : c07ecc29:5939c5c0:dda4e6fd:343fbf57

    Update Time : Wed Nov  9 16:05:03 2016
       Checksum : 44ca328 - correct
         Events : 161174

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)

--------------------------------------------------------------------------------
2c) mdadm --examine /dev/sdc (for the disk and both partitions)
--------------------------------------------------------------------------------

root@it:/home/it/Desktop# mdadm --examine /dev/sdc
/dev/sdc:
   MBR Magic : aa55
Partition[0] :      4080509 sectors at            1 (type 83)
Partition[1] :   1949444658 sectors at      4080510 (type 83)


root@it:/home/it/Desktop# mdadm --examine /dev/sdc1
/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : ab0d7fdf:373ee9f2:5d8fd52f:304e1b90
  Creation Time : Thu May  6 20:34:46 2010
     Raid Level : raid1
  Used Dev Size : 2040128 (1992.65 MiB 2089.09 MB)
     Array Size : 2040128 (1992.65 MiB 2089.09 MB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  9 12:55:15 2016
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : bd7c31b0 - correct
         Events : 37022

      Number   Major   Minor   RaidDevice State
this     1       8       33        1      active sync   /dev/sdc1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       8       33        1      active sync   /dev/sdc1
   2     2       0        0        2      faulty removed
   3     3       8       17        3      active sync   /dev/sdb1


root@it:/home/it/Desktop# mdadm --examine /dev/sdc2
/dev/sdc2:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : b570d224:d61d7f45:8352223d:f9c68ac4
           Name : storage:1
  Creation Time : Thu Feb 17 10:22:16 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1949444384 (929.57 GiB 998.12 GB)
     Array Size : 2924166528 (2788.70 GiB 2994.35 GB)
  Used Dev Size : 1949444352 (929.57 GiB 998.12 GB)
   Super Offset : 1949444640 sectors
   Unused Space : before=0 sectors, after=288 sectors
          State : active
    Device UUID : ceb844db:855e415a:cfc9efe5:4c2db02d

    Update Time : Wed Nov  9 12:55:49 2016
       Checksum : d39e909 - correct
         Events : 161163

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)

--------------------------------------------------------------------------------
2d) mdadm --examine /dev/sdd (for the disk and both partitions)
--------------------------------------------------------------------------------

root@it:/home/it/Desktop# mdadm --examine /dev/sdd
/dev/sdd:
   MBR Magic : aa55
Partition[0] :      4080509 sectors at            1 (type 83)
Partition[1] :   1949444658 sectors at      4080510 (type 83)


root@it:/home/it/Desktop# mdadm --examine /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : ab0d7fdf:373ee9f2:5d8fd52f:304e1b90
  Creation Time : Thu May  6 20:34:46 2010
     Raid Level : raid1
  Used Dev Size : 2040128 (1992.65 MiB 2089.09 MB)
     Array Size : 2040128 (1992.65 MiB 2089.09 MB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0

    Update Time : Wed Nov  9 16:49:29 2016
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : bd7c68fa - correct
         Events : 37056

      Number   Major   Minor   RaidDevice State
this     1       8       49        1      active sync   /dev/sdd1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       8       49        1      active sync   /dev/sdd1
   2     2       0        0        2      faulty removed
   3     3       8       17        3      active sync   /dev/sdb1


root@it:/home/it/Desktop# mdadm --examine /dev/sdd2
/dev/sdd2:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : b570d224:d61d7f45:8352223d:f9c68ac4
           Name : storage:1
  Creation Time : Thu Feb 17 10:22:16 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1949444384 (929.57 GiB 998.12 GB)
     Array Size : 2924166528 (2788.70 GiB 2994.35 GB)
  Used Dev Size : 1949444352 (929.57 GiB 998.12 GB)
   Super Offset : 1949444640 sectors
   Unused Space : before=0 sectors, after=288 sectors
          State : clean
    Device UUID : c95e2f61:d146c52c:dc6336fc:c2987aab

    Update Time : Wed Nov  9 16:05:03 2016
       Checksum : f9bab3b4 - correct
         Events : 161174

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : spare
   Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)

--------------------------------------------------------------------------------
3) lsdrv
--------------------------------------------------------------------------------

root@it:/home/it/Desktop# ./lsdrv
PCI [ahci] 00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD]
FCH SATA Controller [AHCI mode] (rev 40)
├scsi 0:0:0:0 ATA      ST31000520AS     {9VX0Y8JW}
│└sda 931.51g [8:0] Partitioned (dos)
│ ├sda1 1.95g [8:1] MD raid1 (4) inactive {ab0d7fdf-373e-e9f2-5d8f-d52f304e1b90}
│ └sda2 929.57g [8:2] MD raid5 (4) inactive 'storage:1'
{b570d224-d61d-7f45-8352-223df9c68ac4}
├scsi 1:0:0:0 ATA      ST31000520AS     {9VX0WRVM}
│└sdb 931.51g [8:16] Partitioned (dos)
│ ├sdb1 1.95g [8:17] MD raid1 (4) inactive
{ab0d7fdf-373e-e9f2-5d8f-d52f304e1b90}
│ └sdb2 929.57g [8:18] MD raid5 (4) inactive 'storage:1'
{b570d224-d61d-7f45-8352-223df9c68ac4}
├scsi 2:0:0:0 ATA      ST31000520AS     {9VX0XD1S}
│└sdc 931.51g [8:32] Partitioned (dos)
│ ├sdc1 1.95g [8:33] MD raid1 (4) inactive
{ab0d7fdf-373e-e9f2-5d8f-d52f304e1b90}
│ └sdc2 929.57g [8:34] MD raid5 (4) inactive 'storage:1'
{b570d224-d61d-7f45-8352-223df9c68ac4}
└scsi 3:0:0:0 ATA      ST31000520AS     {9VX0Y9JW}
 └sdd 931.51g [8:48] Partitioned (dos)
  ├sdd1 1.95g [8:49] MD raid1 (4) inactive
{ab0d7fdf-373e-e9f2-5d8f-d52f304e1b90}
  └sdd2 929.57g [8:50] MD raid5 (4) inactive 'storage:1'
{b570d224-d61d-7f45-8352-223df9c68ac4}
USB [usb-storage] Bus 002 Device 002: ID 0781:5530 SanDisk Corp.
Cruzer {2005244391081570854A}
└scsi 4:0:0:0 SanDisk  Cruzer
 └sde 14.91g [8:64] Partitioned (dos)
  └sde1 14.91g [8:65] vfat 'FK16GB_LIVE' {1214-3C58}
   └Mounted as /dev/sde1 @ /cdrom
Other Block Devices
├loop0 820.33m [7:0] squashfs
│└Mounted as /dev/loop0 @ /rofs
├loop1 0.00k [7:1] Empty/Unknown
├loop2 0.00k [7:2] Empty/Unknown
├loop3 0.00k [7:3] Empty/Unknown
├loop4 0.00k [7:4] Empty/Unknown
├loop5 0.00k [7:5] Empty/Unknown
├loop6 0.00k [7:6] Empty/Unknown
├loop7 0.00k [7:7] Empty/Unknown
├md0 0.00k [9:0] MD vnone  () clear, None (None) None {None}
│                Empty/Unknown
├md1 0.00k [9:1] MD vnone  () clear, None (None) None {None}
│                Empty/Unknown
├md5 0.00k [9:5] MD vnone  () clear, None (None) None {None}
│                Empty/Unknown
├ram0 64.00m [1:0] Empty/Unknown
├ram1 64.00m [1:1] Empty/Unknown
├ram2 64.00m [1:2] Empty/Unknown
├ram3 64.00m [1:3] Empty/Unknown
├ram4 64.00m [1:4] Empty/Unknown
├ram5 64.00m [1:5] Empty/Unknown
├ram6 64.00m [1:6] Empty/Unknown
├ram7 64.00m [1:7] Empty/Unknown
├ram8 64.00m [1:8] Empty/Unknown
├ram9 64.00m [1:9] Empty/Unknown
├ram10 64.00m [1:10] Empty/Unknown
├ram11 64.00m [1:11] Empty/Unknown
├ram12 64.00m [1:12] Empty/Unknown
├ram13 64.00m [1:13] Empty/Unknown
├ram14 64.00m [1:14] Empty/Unknown
├ram15 64.00m [1:15] Empty/Unknown
├zram0 910.69m [251:0] swap {dd565600-cbd9-4d3c-bfa8-d534f6b0edea}
├zram1 910.69m [251:1] swap {6cc52777-6aef-4046-8acf-fd7b88eb5d74}
├zram2 910.69m [251:2] swap {7d6eba27-e88b-46a9-9edc-b36fc273b63a}
└zram3 910.69m [251:3] swap {ce871e97-37f7-4a37-b09d-bef1f1e288b9}

--------------------------------------------------------------------------------
4) cat /proc/mdstat
--------------------------------------------------------------------------------

root@it:/home/it/Desktop# cat /proc/mdstat
Personalities : [raid1]
unused devices: <none>

--------------------------------------------------------------------------------


So, with that info, I could verify some things that are frequently
mentioned on the posts:
- SCT Error Recovery Control is disabled for both Read and Write operations;
- Events counter in the devices are the same, except for one disk, but
the difference is small (<50);
- Magic Numbers and Checksums are all correct;

Hope someone can give some advice as how to proceed next.

Best regards.

-
Felipe Kich
51-9622-2067

^ permalink raw reply

* [PATCH 4/4] IMSM: 4Kn drives support - adapt general migration record
From: Pawel Baldysiak @ 2016-11-10 14:28 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak, Tomasz Majchrzak
In-Reply-To: <1478788098-32041-1-git-send-email-pawel.baldysiak@intel.com>

Convert general migration record for 4Kn drives prior to write and post
read. Calculate record location based on sector size, don't just assume
it's 512. Assure buffer address is aligned to 4096 so write operation
avoids caching.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 93 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 70 insertions(+), 23 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index fa3e96d..81bec16 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -243,9 +243,9 @@ static char *map_state_str[] = { "normal", "uninitialized", "degraded", "failed"
 
 #define GEN_MIGR_AREA_SIZE 2048 /* General Migration Copy Area size in blocks */
 
-#define MIGR_REC_BUF_SIZE 512 /* size of migr_record i/o buffer */
-#define MIGR_REC_POSITION 512 /* migr_record position offset on disk,
-			       * MIGR_REC_BUF_SIZE <= MIGR_REC_POSITION
+#define MIGR_REC_BUF_SECTORS 1 /* size of migr_record i/o buffer in sectors */
+#define MIGR_REC_SECTOR_POSITION 1 /* migr_record position offset on disk,
+			       * MIGR_REC_BUF_SECTORS <= MIGR_REC_SECTOR_POS
 			       */
 
 #define UNIT_SRC_NORMAL     0   /* Source data for curr_migr_unit must
@@ -1256,6 +1256,19 @@ static void print_imsm_disk(struct imsm_disk *disk, int index, __u32 reserved)
 	       human_size(sz * 512));
 }
 
+void convert_to_4k_imsm_migr_rec(struct intel_super *super)
+{
+	struct migr_record *migr_rec = super->migr_rec;
+
+	migr_rec->blocks_per_unit /= IMSM_4K_DIV;
+	migr_rec->ckpt_area_pba /= IMSM_4K_DIV;
+	migr_rec->dest_1st_member_lba /= IMSM_4K_DIV;
+	migr_rec->dest_depth_per_unit /= IMSM_4K_DIV;
+	split_ull((join_u32(migr_rec->post_migr_vol_cap,
+		 migr_rec->post_migr_vol_cap_hi) / IMSM_4K_DIV),
+		 &migr_rec->post_migr_vol_cap, &migr_rec->post_migr_vol_cap_hi);
+}
+
 void convert_to_4k_imsm_disk(struct imsm_disk *disk)
 {
 	set_total_blocks(disk, (total_blocks(disk)/IMSM_4K_DIV));
@@ -1356,6 +1369,20 @@ void examine_migr_rec_imsm(struct intel_super *super)
 }
 #endif /* MDASSEMBLE */
 
+void convert_from_4k_imsm_migr_rec(struct intel_super *super)
+{
+	struct migr_record *migr_rec = super->migr_rec;
+
+	migr_rec->blocks_per_unit *= IMSM_4K_DIV;
+	migr_rec->ckpt_area_pba *= IMSM_4K_DIV;
+	migr_rec->dest_1st_member_lba *= IMSM_4K_DIV;
+	migr_rec->dest_depth_per_unit *= IMSM_4K_DIV;
+	split_ull((join_u32(migr_rec->post_migr_vol_cap,
+		 migr_rec->post_migr_vol_cap_hi) * IMSM_4K_DIV),
+		 &migr_rec->post_migr_vol_cap,
+		 &migr_rec->post_migr_vol_cap_hi);
+}
+
 void convert_from_4k(struct intel_super *super)
 {
 	struct imsm_super *mpb = super->anchor;
@@ -2498,21 +2525,26 @@ static int imsm_level_to_layout(int level)
 static int read_imsm_migr_rec(int fd, struct intel_super *super)
 {
 	int ret_val = -1;
+	unsigned int sector_size = super->sector_size;
 	unsigned long long dsize;
 
 	get_dev_size(fd, NULL, &dsize);
-	if (lseek64(fd, dsize - MIGR_REC_POSITION, SEEK_SET) < 0) {
+	if (lseek64(fd, dsize - (sector_size*MIGR_REC_SECTOR_POSITION),
+		   SEEK_SET) < 0) {
 		pr_err("Cannot seek to anchor block: %s\n",
 		       strerror(errno));
 		goto out;
 	}
-	if (read(fd, super->migr_rec_buf, MIGR_REC_BUF_SIZE) !=
-							    MIGR_REC_BUF_SIZE) {
+	if (read(fd, super->migr_rec_buf,
+	    MIGR_REC_BUF_SECTORS*sector_size) !=
+	    MIGR_REC_BUF_SECTORS*sector_size) {
 		pr_err("Cannot read migr record block: %s\n",
 		       strerror(errno));
 		goto out;
 	}
 	ret_val = 0;
+	if (sector_size == 4096)
+		convert_from_4k_imsm_migr_rec(super);
 
 out:
 	return ret_val;
@@ -2658,6 +2690,7 @@ static void imsm_update_metadata_locally(struct supertype *st,
 static int write_imsm_migr_rec(struct supertype *st)
 {
 	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
 	unsigned long long dsize;
 	char nm[30];
 	int fd = -1;
@@ -2679,6 +2712,8 @@ static int write_imsm_migr_rec(struct supertype *st)
 
 	map = get_imsm_map(dev, MAP_0);
 
+	if (sector_size == 4096)
+		convert_to_4k_imsm_migr_rec(super);
 	for (sd = super->disks ; sd ; sd = sd->next) {
 		int slot = -1;
 
@@ -2696,13 +2731,15 @@ static int write_imsm_migr_rec(struct supertype *st)
 		if (fd < 0)
 			continue;
 		get_dev_size(fd, NULL, &dsize);
-		if (lseek64(fd, dsize - MIGR_REC_POSITION, SEEK_SET) < 0) {
+		if (lseek64(fd, dsize - (MIGR_REC_SECTOR_POSITION*sector_size),
+		    SEEK_SET) < 0) {
 			pr_err("Cannot seek to anchor block: %s\n",
 			       strerror(errno));
 			goto out;
 		}
-		if (write(fd, super->migr_rec_buf, MIGR_REC_BUF_SIZE) !=
-							    MIGR_REC_BUF_SIZE) {
+		if (write(fd, super->migr_rec_buf,
+		    MIGR_REC_BUF_SECTORS*sector_size) !=
+		    MIGR_REC_BUF_SECTORS*sector_size) {
 			pr_err("Cannot write migr record block: %s\n",
 			       strerror(errno));
 			goto out;
@@ -2710,9 +2747,10 @@ static int write_imsm_migr_rec(struct supertype *st)
 		close(fd);
 		fd = -1;
 	}
+	if (sector_size == 4096)
+		convert_from_4k_imsm_migr_rec(super);
 	/* update checkpoint information in metadata */
 	len = imsm_create_metadata_checkpoint_update(super, &u);
-
 	if (len <= 0) {
 		dprintf("imsm: Cannot prepare update\n");
 		goto out;
@@ -3836,7 +3874,8 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
 	sectors = mpb_sectors(anchor, sector_size) - 1;
 	free(anchor);
 
-	if (posix_memalign(&super->migr_rec_buf, 512, MIGR_REC_BUF_SIZE) != 0) {
+	if (posix_memalign(&super->migr_rec_buf, sector_size,
+	    MIGR_REC_BUF_SECTORS*sector_size) != 0) {
 		pr_err("could not allocate migr_rec buffer\n");
 		free(super->buf);
 		return 2;
@@ -4854,8 +4893,8 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 			pr_err("could not allocate new mpb\n");
 			return 0;
 		}
-		if (posix_memalign(&super->migr_rec_buf, 512,
-				   MIGR_REC_BUF_SIZE) != 0) {
+		if (posix_memalign(&super->migr_rec_buf, sector_size,
+				   MIGR_REC_BUF_SECTORS*sector_size) != 0) {
 			pr_err("could not allocate migr_rec buffer\n");
 			free(super->buf);
 			free(super);
@@ -5016,7 +5055,8 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
 		pr_err("could not allocate superblock\n");
 		return 0;
 	}
-	if (posix_memalign(&super->migr_rec_buf, 512, MIGR_REC_BUF_SIZE) != 0) {
+	if (posix_memalign(&super->migr_rec_buf, 4096,
+	    MIGR_REC_BUF_SECTORS*4096) != 0) {
 		pr_err("could not allocate migr_rec buffer\n");
 		free(super->buf);
 		free(super);
@@ -5294,10 +5334,12 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
 	}
 
 	/* clear migr_rec when adding disk to container */
-	memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SIZE);
-	if (lseek64(fd, size - MIGR_REC_POSITION, SEEK_SET) >= 0) {
+	memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SECTORS*super->sector_size);
+	if (lseek64(fd, size - MIGR_REC_SECTOR_POSITION*super->sector_size,
+	    SEEK_SET) >= 0) {
 		if (write(fd, super->migr_rec_buf,
-			MIGR_REC_BUF_SIZE) != MIGR_REC_BUF_SIZE)
+		    MIGR_REC_BUF_SECTORS*super->sector_size) !=
+		    MIGR_REC_BUF_SECTORS*super->sector_size)
 			perror("Write migr_rec failed");
 	}
 
@@ -5473,7 +5515,8 @@ static int write_super_imsm(struct supertype *st, int doclose)
 		super->clean_migration_record_by_mdmon = 0;
 	}
 	if (clear_migration_record)
-		memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SIZE);
+		memset(super->migr_rec_buf, 0,
+		    MIGR_REC_BUF_SECTORS*sector_size);
 
 	if (sector_size == 4096)
 		convert_to_4k(super);
@@ -5487,9 +5530,11 @@ static int write_super_imsm(struct supertype *st, int doclose)
 			unsigned long long dsize;
 
 			get_dev_size(d->fd, NULL, &dsize);
-			if (lseek64(d->fd, dsize - 512, SEEK_SET) >= 0) {
+			if (lseek64(d->fd, dsize - sector_size,
+			    SEEK_SET) >= 0) {
 				if (write(d->fd, super->migr_rec_buf,
-					MIGR_REC_BUF_SIZE) != MIGR_REC_BUF_SIZE)
+				    MIGR_REC_BUF_SECTORS*sector_size) !=
+				    MIGR_REC_BUF_SECTORS*sector_size)
 					perror("Write migr_rec failed");
 			}
 		}
@@ -10713,6 +10758,7 @@ static int imsm_manage_reshape(
 	int ret_val = 0;
 	struct intel_super *super = st->sb;
 	struct intel_dev *dv;
+	unsigned int sector_size = super->sector_size;
 	struct imsm_dev *dev = NULL;
 	struct imsm_map *map_src;
 	int migr_vol_qan = 0;
@@ -10907,17 +10953,18 @@ static int imsm_manage_reshape(
 	/* clear migr_rec on disks after successful migration */
 	struct dl *d;
 
-	memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SIZE);
+	memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SECTORS*sector_size);
 	for (d = super->disks; d; d = d->next) {
 		if (d->index < 0 || is_failed(&d->disk))
 			continue;
 		unsigned long long dsize;
 
 		get_dev_size(d->fd, NULL, &dsize);
-		if (lseek64(d->fd, dsize - MIGR_REC_POSITION,
+		if (lseek64(d->fd, dsize - MIGR_REC_SECTOR_POSITION*sector_size,
 			    SEEK_SET) >= 0) {
 			if (write(d->fd, super->migr_rec_buf,
-				MIGR_REC_BUF_SIZE) != MIGR_REC_BUF_SIZE)
+			    MIGR_REC_BUF_SECTORS*sector_size) !=
+			    MIGR_REC_BUF_SECTORS*sector_size)
 				perror("Write migr_rec failed");
 		}
 	}
-- 
2.7.4


^ permalink raw reply related

* [PATCH 3/4] IMSM: Add support for 4Kn sector size drives
From: Pawel Baldysiak @ 2016-11-10 14:28 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak, Tomasz Majchrzak
In-Reply-To: <1478788098-32041-1-git-send-email-pawel.baldysiak@intel.com>

This patch adds support for drives with 4Kn sector size
for IMSM metadata. Mixing member drives with 4Kn and 512
is not allowed. Some offsets were aligned with sector size.
Internal metadata representation and all calculations
are still based on 512-byte sector sizes. This
implementation converts only sector based values
when reading/writing to drive, because they needs to be
stored in metadata according to accual member drive sector size.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 199 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 154 insertions(+), 45 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 8a2d993..fa3e96d 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -318,14 +318,15 @@ static void set_migr_type(struct imsm_dev *dev, __u8 migr_type)
 	}
 }
 
-static unsigned int sector_count(__u32 bytes)
+static unsigned int sector_count(__u32 bytes, unsigned int sector_size)
 {
-	return ROUND_UP(bytes, 512) / 512;
+	return ROUND_UP(bytes, sector_size) / sector_size;
 }
 
-static unsigned int mpb_sectors(struct imsm_super *mpb)
+static unsigned int mpb_sectors(struct imsm_super *mpb,
+					unsigned int sector_size)
 {
-	return sector_count(__le32_to_cpu(mpb->mpb_size));
+	return sector_count(__le32_to_cpu(mpb->mpb_size), sector_size);
 }
 
 struct intel_dev {
@@ -915,12 +916,12 @@ static unsigned long long num_data_stripes(struct imsm_map *map)
 		return 0;
 	return join_u32(map->num_data_stripes_lo, map->num_data_stripes_hi);
 }
+#endif
 
 static void set_total_blocks(struct imsm_disk *disk, unsigned long long n)
 {
 	split_ull(n, &disk->total_blocks_lo, &disk->total_blocks_hi);
 }
-#endif
 
 static void set_pba_of_lba0(struct imsm_map *map, unsigned long long n)
 {
@@ -1122,6 +1123,8 @@ static unsigned long long min_acceptable_spare_size_imsm(struct supertype *st)
 
 static int is_gen_migration(struct imsm_dev *dev);
 
+#define IMSM_4K_DIV 8
+
 #ifndef MDASSEMBLE
 static __u64 blocks_per_migr_unit(struct intel_super *super,
 				  struct imsm_dev *dev);
@@ -1253,6 +1256,48 @@ static void print_imsm_disk(struct imsm_disk *disk, int index, __u32 reserved)
 	       human_size(sz * 512));
 }
 
+void convert_to_4k_imsm_disk(struct imsm_disk *disk)
+{
+	set_total_blocks(disk, (total_blocks(disk)/IMSM_4K_DIV));
+}
+
+void convert_to_4k(struct intel_super *super)
+{
+	struct imsm_super *mpb = super->anchor;
+	struct imsm_disk *disk;
+	int i;
+
+	for (i = 0; i < mpb->num_disks ; i++) {
+		disk = __get_imsm_disk(mpb, i);
+		/* disk */
+		convert_to_4k_imsm_disk(disk);
+	}
+	for (i = 0; i < mpb->num_raid_devs; i++) {
+		struct imsm_dev *dev = __get_imsm_dev(mpb, i);
+		struct imsm_map *map = get_imsm_map(dev, MAP_0);
+		/* dev */
+		split_ull((join_u32(dev->size_low, dev->size_high)/IMSM_4K_DIV),
+				 &dev->size_low, &dev->size_high);
+		dev->vol.curr_migr_unit /= IMSM_4K_DIV;
+
+		/* map0 */
+		set_blocks_per_member(map, blocks_per_member(map)/IMSM_4K_DIV);
+		map->blocks_per_strip /= IMSM_4K_DIV;
+		set_pba_of_lba0(map, pba_of_lba0(map)/IMSM_4K_DIV);
+
+		if (dev->vol.migr_state) {
+			/* map1 */
+			map = get_imsm_map(dev, MAP_1);
+			set_blocks_per_member(map,
+			    blocks_per_member(map)/IMSM_4K_DIV);
+			map->blocks_per_strip /= IMSM_4K_DIV;
+			set_pba_of_lba0(map, pba_of_lba0(map)/IMSM_4K_DIV);
+		}
+	}
+
+	mpb->check_sum = __gen_imsm_checksum(mpb);
+}
+
 void examine_migr_rec_imsm(struct intel_super *super)
 {
 	struct migr_record *migr_rec = super->migr_rec;
@@ -1310,6 +1355,45 @@ void examine_migr_rec_imsm(struct intel_super *super)
 	}
 }
 #endif /* MDASSEMBLE */
+
+void convert_from_4k(struct intel_super *super)
+{
+	struct imsm_super *mpb = super->anchor;
+	struct imsm_disk *disk;
+	int i;
+
+	for (i = 0; i < mpb->num_disks ; i++) {
+		disk = __get_imsm_disk(mpb, i);
+		/* disk */
+		set_total_blocks(disk, (total_blocks(disk)*IMSM_4K_DIV));
+	}
+
+	for (i = 0; i < mpb->num_raid_devs; i++) {
+		struct imsm_dev *dev = __get_imsm_dev(mpb, i);
+		struct imsm_map *map = get_imsm_map(dev, MAP_0);
+		/* dev */
+		split_ull((join_u32(dev->size_low, dev->size_high)*IMSM_4K_DIV),
+				 &dev->size_low, &dev->size_high);
+		dev->vol.curr_migr_unit *= IMSM_4K_DIV;
+
+		/* map0 */
+		set_blocks_per_member(map, blocks_per_member(map)*IMSM_4K_DIV);
+		map->blocks_per_strip *= IMSM_4K_DIV;
+		set_pba_of_lba0(map, pba_of_lba0(map)*IMSM_4K_DIV);
+
+		if (dev->vol.migr_state) {
+			/* map1 */
+			map = get_imsm_map(dev, MAP_1);
+			set_blocks_per_member(map,
+			    blocks_per_member(map)*IMSM_4K_DIV);
+			map->blocks_per_strip *= IMSM_4K_DIV;
+			set_pba_of_lba0(map, pba_of_lba0(map)*IMSM_4K_DIV);
+		}
+	}
+
+	mpb->check_sum = __gen_imsm_checksum(mpb);
+}
+
 /*******************************************************************************
  * function: imsm_check_attributes
  * Description: Function checks if features represented by attributes flags
@@ -1430,7 +1514,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost)
 	sum = __le32_to_cpu(mpb->check_sum);
 	printf("       Checksum : %08x %s\n", sum,
 		__gen_imsm_checksum(mpb) == sum ? "correct" : "incorrect");
-	printf("    MPB Sectors : %d\n", mpb_sectors(mpb));
+	printf("    MPB Sectors : %d\n", mpb_sectors(mpb, super->sector_size));
 	printf("          Disks : %d\n", mpb->num_disks);
 	printf("   RAID Devices : %d\n", mpb->num_raid_devs);
 	print_imsm_disk(__get_imsm_disk(mpb, super->disks->index), super->disks->index, reserved);
@@ -1527,7 +1611,7 @@ static void export_examine_super_imsm(struct supertype *st)
 
 static int copy_metadata_imsm(struct supertype *st, int from, int to)
 {
-	/* The second last 512byte sector of the device contains
+	/* The second last sector of the device contains
 	 * the "struct imsm_super" metadata.
 	 * This contains mpb_size which is the size in bytes of the
 	 * extended metadata.  This is located immediately before
@@ -1540,7 +1624,9 @@ static int copy_metadata_imsm(struct supertype *st, int from, int to)
 	unsigned long long dsize, offset;
 	int sectors;
 	struct imsm_super *sb;
-	int written = 0;
+	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
+	unsigned int written = 0;
 
 	if (posix_memalign(&buf, 4096, 4096) != 0)
 		return 1;
@@ -1548,21 +1634,21 @@ static int copy_metadata_imsm(struct supertype *st, int from, int to)
 	if (!get_dev_size(from, NULL, &dsize))
 		goto err;
 
-	if (lseek64(from, dsize-1024, 0) < 0)
+	if (lseek64(from, dsize-(2*sector_size), 0) < 0)
 		goto err;
-	if (read(from, buf, 512) != 512)
+	if (read(from, buf, sector_size) != sector_size)
 		goto err;
 	sb = buf;
 	if (strncmp((char*)sb->sig, MPB_SIGNATURE, MPB_SIG_LEN) != 0)
 		goto err;
 
-	sectors = mpb_sectors(sb) + 2;
-	offset = dsize - sectors * 512;
+	sectors = mpb_sectors(sb, sector_size) + 2;
+	offset = dsize - sectors * sector_size;
 	if (lseek64(from, offset, 0) < 0 ||
 	    lseek64(to, offset, 0) < 0)
 		goto err;
-	while (written < sectors * 512) {
-		int n = sectors*512 - written;
+	while (written < sectors * sector_size) {
+		int n = sectors*sector_size - written;
 		if (n > 4096)
 			n = 4096;
 		if (read(from, buf, n) != n)
@@ -2678,13 +2764,14 @@ int imsm_reshape_blocks_arrays_changes(struct intel_super *super)
 }
 static unsigned long long imsm_component_size_aligment_check(int level,
 					      int chunk_size,
+					      unsigned int sector_size,
 					      unsigned long long component_size)
 {
 	unsigned int component_size_alligment;
 
 	/* check component size aligment
 	*/
-	component_size_alligment = component_size % (chunk_size/512);
+	component_size_alligment = component_size % (chunk_size/sector_size);
 
 	dprintf("(Level: %i, chunk_size = %i, component_size = %llu), component_size_alligment = %u\n",
 		level, chunk_size, component_size,
@@ -2795,6 +2882,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	info->component_size = imsm_component_size_aligment_check(
 							info->array.level,
 							info->array.chunk_size,
+							super->sector_size,
 							info->component_size);
 
 	memset(info->uuid, 0, sizeof(info->uuid));
@@ -3615,8 +3703,9 @@ static int parse_raid_devices(struct intel_super *super)
 	if (__le32_to_cpu(mpb->mpb_size) + space_needed > super->len) {
 		void *buf;
 
-		len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + space_needed, 512);
-		if (posix_memalign(&buf, 512, len) != 0)
+		len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + space_needed,
+			      super->sector_size);
+		if (posix_memalign(&buf, 4096, len) != 0)
 			return 1;
 
 		memcpy(buf, super->buf, super->len);
@@ -3689,31 +3778,32 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
 {
 	unsigned long long dsize;
 	unsigned long long sectors;
+	unsigned int sector_size = super->sector_size;
 	struct stat;
 	struct imsm_super *anchor;
 	__u32 check_sum;
 
 	get_dev_size(fd, NULL, &dsize);
-	if (dsize < 1024) {
+	if (dsize < 2*sector_size) {
 		if (devname)
 			pr_err("%s: device to small for imsm\n",
 			       devname);
 		return 1;
 	}
 
-	if (lseek64(fd, dsize - (512 * 2), SEEK_SET) < 0) {
+	if (lseek64(fd, dsize - (sector_size * 2), SEEK_SET) < 0) {
 		if (devname)
 			pr_err("Cannot seek to anchor block on %s: %s\n",
 			       devname, strerror(errno));
 		return 1;
 	}
 
-	if (posix_memalign((void**)&anchor, 512, 512) != 0) {
+	if (posix_memalign((void **)&anchor, sector_size, sector_size) != 0) {
 		if (devname)
 			pr_err("Failed to allocate imsm anchor buffer on %s\n", devname);
 		return 1;
 	}
-	if (read(fd, anchor, 512) != 512) {
+	if (read(fd, anchor, sector_size) != sector_size) {
 		if (devname)
 			pr_err("Cannot read anchor block on %s: %s\n",
 			       devname, strerror(errno));
@@ -3733,17 +3823,17 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
 
 	/* capability and hba must be updated with new super allocation */
 	find_intel_hba_capability(fd, super, devname);
-	super->len = ROUND_UP(anchor->mpb_size, 512);
-	if (posix_memalign(&super->buf, 512, super->len) != 0) {
+	super->len = ROUND_UP(anchor->mpb_size, sector_size);
+	if (posix_memalign(&super->buf, 4096, super->len) != 0) {
 		if (devname)
 			pr_err("unable to allocate %zu byte mpb buffer\n",
 			       super->len);
 		free(anchor);
 		return 2;
 	}
-	memcpy(super->buf, anchor, 512);
+	memcpy(super->buf, anchor, sector_size);
 
-	sectors = mpb_sectors(anchor) - 1;
+	sectors = mpb_sectors(anchor, sector_size) - 1;
 	free(anchor);
 
 	if (posix_memalign(&super->migr_rec_buf, 512, MIGR_REC_BUF_SIZE) != 0) {
@@ -3768,14 +3858,15 @@ static int load_imsm_mpb(int fd, struct intel_super *super, char *devname)
 	}
 
 	/* read the extended mpb */
-	if (lseek64(fd, dsize - (512 * (2 + sectors)), SEEK_SET) < 0) {
+	if (lseek64(fd, dsize - (sector_size * (2 + sectors)), SEEK_SET) < 0) {
 		if (devname)
 			pr_err("Cannot seek to extended mpb on %s: %s\n",
 			       devname, strerror(errno));
 		return 1;
 	}
 
-	if ((unsigned)read(fd, super->buf + 512, super->len - 512) != super->len - 512) {
+	if ((unsigned int)read(fd, super->buf + sector_size,
+		    super->len - sector_size) != super->len - sector_size) {
 		if (devname)
 			pr_err("Cannot read extended mpb on %s: %s\n",
 			       devname, strerror(errno));
@@ -3836,6 +3927,8 @@ load_and_parse_mpb(int fd, struct intel_super *super, char *devname, int keep_fd
 	err = load_imsm_mpb(fd, super, devname);
 	if (err)
 		return err;
+	if (super->sector_size == 4096)
+		convert_from_4k(super);
 	err = load_imsm_disk(fd, super, devname, keep_fd);
 	if (err)
 		return err;
@@ -4733,6 +4826,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	 * so st->sb is already set.
 	 */
 	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
 	struct imsm_super *mpb = super->anchor;
 	struct intel_dev *dv;
 	struct imsm_dev *dev;
@@ -4754,9 +4848,9 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	size_new = disks_to_mpb_size(info->nr_disks);
 	if (size_new > size_old) {
 		void *mpb_new;
-		size_t size_round = ROUND_UP(size_new, 512);
+		size_t size_round = ROUND_UP(size_new, sector_size);
 
-		if (posix_memalign(&mpb_new, 512, size_round) != 0) {
+		if (posix_memalign(&mpb_new, sector_size, size_round) != 0) {
 			pr_err("could not allocate new mpb\n");
 			return 0;
 		}
@@ -4911,10 +5005,10 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
 	if (info)
 		mpb_size = disks_to_mpb_size(info->nr_disks);
 	else
-		mpb_size = 512;
+		mpb_size = 4096;
 
 	super = alloc_super();
-	if (super && posix_memalign(&super->buf, 512, mpb_size) != 0) {
+	if (super && posix_memalign(&super->buf, 4096, mpb_size) != 0) {
 		free(super);
 		super = NULL;
 	}
@@ -5261,9 +5355,9 @@ static int remove_from_super_imsm(struct supertype *st, mdu_disk_info_t *dk)
 static int store_imsm_mpb(int fd, struct imsm_super *mpb);
 
 static union {
-	char buf[512];
+	char buf[4096];
 	struct imsm_super anchor;
-} spare_record __attribute__ ((aligned(512)));
+} spare_record __attribute__ ((aligned(4096)));
 
 /* spare records have their own family number and do not have any defined raid
  * devices
@@ -5294,6 +5388,9 @@ static int write_super_imsm_spares(struct intel_super *super, int doclose)
 		if (__le32_to_cpu(d->disk.total_blocks_hi) > 0)
 			spare->attributes |= MPB_ATTRIB_2TB_DISK;
 
+		if (super->sector_size == 4096)
+			convert_to_4k_imsm_disk(&spare->disk[0]);
+
 		sum = __gen_imsm_checksum(spare);
 		spare->family_num = __cpu_to_le32(sum);
 		spare->orig_family_num = 0;
@@ -5317,6 +5414,7 @@ static int write_super_imsm_spares(struct intel_super *super, int doclose)
 static int write_super_imsm(struct supertype *st, int doclose)
 {
 	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
 	struct imsm_super *mpb = super->anchor;
 	struct dl *d;
 	__u32 generation;
@@ -5377,6 +5475,9 @@ static int write_super_imsm(struct supertype *st, int doclose)
 	if (clear_migration_record)
 		memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SIZE);
 
+	if (sector_size == 4096)
+		convert_to_4k(super);
+
 	/* write the mpb for disks that compose raid devices */
 	for (d = super->disks; d ; d = d->next) {
 		if (d->index < 0 || is_failed(&d->disk))
@@ -5500,6 +5601,8 @@ static int store_super_imsm(struct supertype *st, int fd)
 		return 1;
 
 #ifndef MDASSEMBLE
+	if (super->sector_size == 4096)
+		convert_to_4k(super);
 	return store_imsm_mpb(fd, mpb);
 #else
 	return 1;
@@ -7574,27 +7677,30 @@ static int store_imsm_mpb(int fd, struct imsm_super *mpb)
 	__u32 mpb_size = __le32_to_cpu(mpb->mpb_size);
 	unsigned long long dsize;
 	unsigned long long sectors;
+	unsigned int sector_size;
 
+	get_dev_sector_size(fd, NULL, &sector_size);
 	get_dev_size(fd, NULL, &dsize);
 
-	if (mpb_size > 512) {
+	if (mpb_size > sector_size) {
 		/* -1 to account for anchor */
-		sectors = mpb_sectors(mpb) - 1;
+		sectors = mpb_sectors(mpb, sector_size) - 1;
 
 		/* write the extended mpb to the sectors preceeding the anchor */
-		if (lseek64(fd, dsize - (512 * (2 + sectors)), SEEK_SET) < 0)
+		if (lseek64(fd, dsize - (sector_size * (2 + sectors)),
+		   SEEK_SET) < 0)
 			return 1;
 
-		if ((unsigned long long)write(fd, buf + 512, 512 * sectors)
-		    != 512 * sectors)
+		if ((unsigned long long)write(fd, buf + sector_size,
+		   sector_size * sectors) != sector_size * sectors)
 			return 1;
 	}
 
 	/* first block is stored on second to last sector of the disk */
-	if (lseek64(fd, dsize - (512 * 2), SEEK_SET) < 0)
+	if (lseek64(fd, dsize - (sector_size * 2), SEEK_SET) < 0)
 		return 1;
 
-	if (write(fd, buf, 512) != 512)
+	if (write(fd, buf, sector_size) != sector_size)
 		return 1;
 
 	return 0;
@@ -8830,6 +8936,7 @@ static int imsm_prepare_update(struct supertype *st,
 	 */
 	enum imsm_update_type type;
 	struct intel_super *super = st->sb;
+	unsigned int sector_size = super->sector_size;
 	struct imsm_super *mpb = super->anchor;
 	size_t buf_len;
 	size_t len = 0;
@@ -9066,12 +9173,13 @@ static int imsm_prepare_update(struct supertype *st,
 		 * if this allocation fails process_update will notice that
 		 * ->next_len is set and ->next_buf is NULL
 		 */
-		buf_len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + len, 512);
+		buf_len = ROUND_UP(__le32_to_cpu(mpb->mpb_size) + len,
+				  sector_size);
 		if (super->next_buf)
 			free(super->next_buf);
 
 		super->next_len = buf_len;
-		if (posix_memalign(&super->next_buf, 512, buf_len) == 0)
+		if (posix_memalign(&super->next_buf, sector_size, buf_len) == 0)
 			memset(super->next_buf, 0, buf_len);
 		else
 			super->next_buf = NULL;
@@ -9566,6 +9674,7 @@ int recover_backup_imsm(struct supertype *st, struct mdinfo *info)
 	int new_disks, i, err;
 	char *buf = NULL;
 	int retval = 1;
+	unsigned int sector_size = super->sector_size;
 	unsigned long curr_migr_unit = __le32_to_cpu(migr_rec->curr_migr_unit);
 	unsigned long num_migr_units = __le32_to_cpu(migr_rec->num_migr_units);
 	char buffer[20];
@@ -9602,7 +9711,7 @@ int recover_backup_imsm(struct supertype *st, struct mdinfo *info)
 			pba_of_lba0(map_dest)) * 512;
 
 	unit_len = __le32_to_cpu(migr_rec->dest_depth_per_unit) * 512;
-	if (posix_memalign((void **)&buf, 512, unit_len) != 0)
+	if (posix_memalign((void **)&buf, sector_size, unit_len) != 0)
 		goto abort;
 	targets = xcalloc(new_disks, sizeof(int));
 
@@ -10148,7 +10257,7 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 		 */
 		geo->size = imsm_component_size_aligment_check(
 				    get_imsm_raid_level(dev->vol.map),
-				    chunk * 1024,
+				    chunk * 1024, super->sector_size,
 				    geo->size * 2);
 		if (geo->size == 0) {
 			pr_err("Error. Size expansion is supported only (current size is %llu, requested size /rounded/ is 0).\n",
@@ -10182,7 +10291,7 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 			 */
 			max_size = imsm_component_size_aligment_check(
 					get_imsm_raid_level(dev->vol.map),
-					chunk * 1024,
+					chunk * 1024, super->sector_size,
 					max_size);
 		}
 		if (geo->size == MAX_SIZE) {
-- 
2.7.4


^ permalink raw reply related

* [PATCH 2/4] IMSM: Read and store device sector size
From: Pawel Baldysiak @ 2016-11-10 14:28 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak
In-Reply-To: <1478788098-32041-1-git-send-email-pawel.baldysiak@intel.com>

This patch adds retriving device sector size at startup
and set it in intel_super, so it can be used in other places.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 super-intel.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/super-intel.c b/super-intel.c
index 21e8532..8a2d993 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -366,6 +366,7 @@ struct intel_super {
 	unsigned long long create_offset; /* common start for 'current_vol' */
 	__u32 random; /* random data for seeding new family numbers */
 	struct intel_dev *devlist;
+	unsigned int sector_size; /* sector size of used member drives */
 	struct dl {
 		struct dl *next;
 		int index;
@@ -4491,6 +4492,7 @@ static int get_super_block(struct intel_super **super_list, char *devnm, char *d
 		goto error;
 	}
 
+	get_dev_sector_size(dfd, NULL, &s->sector_size);
 	find_intel_hba_capability(dfd, s, devname);
 	err = load_and_parse_mpb(dfd, s, NULL, keep_fd);
 
@@ -4570,6 +4572,7 @@ static int load_super_imsm(struct supertype *st, int fd, char *devname)
 	free_super_imsm(st);
 
 	super = alloc_super();
+	get_dev_sector_size(fd, NULL, &super->sector_size);
 	/* Load hba and capabilities if they exist.
 	 * But do not preclude loading metadata in case capabilities or hba are
 	 * non-compliant and ignore_hw_compat is set.
@@ -5102,6 +5105,7 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
 	struct intel_super *super = st->sb;
 	struct dl *dd;
 	unsigned long long size;
+	unsigned int member_sector_size;
 	__u32 id;
 	int rv;
 	struct stat stb;
@@ -5182,6 +5186,19 @@ static int add_to_super_imsm(struct supertype *st, mdu_disk_info_t *dk,
 	}
 
 	get_dev_size(fd, NULL, &size);
+	get_dev_sector_size(fd, NULL, &member_sector_size);
+
+	if (super->sector_size == 0) {
+		/* this a first device, so sector_size is not set yet */
+		super->sector_size = member_sector_size;
+	} else if (member_sector_size != super->sector_size) {
+		pr_err("Mixing between different sector size is forbidden, aborting...");
+		if (dd->devname)
+			free(dd->devname);
+		free(dd);
+		return 1;
+	}
+
 	/* clear migr_rec when adding disk to container */
 	memset(super->migr_rec_buf, 0, MIGR_REC_BUF_SIZE);
 	if (lseek64(fd, size - MIGR_REC_POSITION, SEEK_SET) >= 0) {
@@ -5529,6 +5546,12 @@ static int validate_geometry_imsm_container(struct supertype *st, int level,
 	 * note that there is no fd for the disks in array.
 	 */
 	super = alloc_super();
+	if (!get_dev_sector_size(fd, NULL, &super->sector_size)) {
+		close(fd);
+		free_imsm(super);
+		return 0;
+	}
+
 	rv = find_intel_hba_capability(fd, super, verbose > 0 ? dev : NULL);
 	if (rv != 0) {
 #if DEBUG
-- 
2.7.4


^ permalink raw reply related

* [PATCH 1/4] Add function for getting member drive sector size
From: Pawel Baldysiak @ 2016-11-10 14:28 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak
In-Reply-To: <1478788098-32041-1-git-send-email-pawel.baldysiak@intel.com>

This patch introduces the function for getting sector size of
given device (fd).

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
---
 mdadm.h  |  1 +
 super1.c |  3 +--
 util.c   | 16 ++++++++++++++++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/mdadm.h b/mdadm.h
index 0516c82..1aeb232 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1112,6 +1112,7 @@ static inline struct supertype *guess_super(int fd) {
 }
 extern struct supertype *dup_super(struct supertype *st);
 extern int get_dev_size(int fd, char *dname, unsigned long long *sizep);
+extern int get_dev_sector_size(int fd, char *dname, unsigned int *sectsizep);
 extern int must_be_container(int fd);
 extern int dev_size_from_id(dev_t id, unsigned long long *size);
 void wait_for(char *dev, int fd);
diff --git a/super1.c b/super1.c
index 4fef378..8f800b5 100644
--- a/super1.c
+++ b/super1.c
@@ -212,8 +212,7 @@ struct align_fd {
 static void init_afd(struct align_fd *afd, int fd)
 {
 	afd->fd = fd;
-
-	if (ioctl(afd->fd, BLKSSZGET, &afd->blk_sz) != 0)
+	if (!get_dev_sector_size(afd->fd, NULL, (unsigned int *)&afd->blk_sz))
 		afd->blk_sz = 512;
 }
 
diff --git a/util.c b/util.c
index 9e4718f..092854a 100644
--- a/util.c
+++ b/util.c
@@ -1333,6 +1333,22 @@ int get_dev_size(int fd, char *dname, unsigned long long *sizep)
 	return 1;
 }
 
+/* Return sector size of device in bytes */
+int get_dev_sector_size(int fd, char *dname, unsigned int *sectsizep)
+{
+	unsigned int sectsize;
+
+	if (ioctl(fd, BLKSSZGET, &sectsize) != 0) {
+		if (dname)
+			pr_err("Cannot get sector size of %s: %s\b",
+				dname, strerror(errno));
+		return 0;
+	}
+
+	*sectsizep = sectsize;
+	return 1;
+}
+
 /* Return true if this can only be a container, not a member device.
  * i.e. is and md device and size is zero
  */
-- 
2.7.4


^ permalink raw reply related

* [PATCH 0/4] IMSM: Add support for 4Kn sector size drives
From: Pawel Baldysiak @ 2016-11-10 14:28 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Pawel Baldysiak

This patch set adds support for IMSM with 4Kn sector size drives
First patch adds the generic function for receiving sector size,
rest are IMSM specific.
Internal calculation are still based on 512-bytes sector,
variables are converted during read/write from/to member drive.
Mixing of devices with different sector size is not allowed.

Pawel Baldysiak (4):
  Add function for getting member drive sector size
  IMSM: Read and store device sector size
  IMSM: Add support for 4Kn sector size drives
  IMSM: 4Kn drives support - adapt general migration record

 mdadm.h       |   1 +
 super-intel.c | 315 +++++++++++++++++++++++++++++++++++++++++++++-------------
 super1.c      |   3 +-
 util.c        |  16 +++
 4 files changed, 265 insertions(+), 70 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH 2/2] super1: fix setting bad block log offset in write_init_super1()
From: Artur Paszkiewicz @ 2016-11-10 10:50 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid
In-Reply-To: <20161110105054.29869-1-artur.paszkiewicz@intel.com>

Commit f79bbf4f6904 ("super1: don't put the bblog at the end of the free
space.") changed the location of the bad block log to be after the
write-intent bitmap, but a fixed offset was used and it can make bbl
overlap with the bitmap, especially when using a small bitmap chunk.
This patch changes it to use the actual offset and size of the bitmap.
It also joins the cases for v1.1 and v1.2 superblock because the code
was very similar.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 super1.c | 47 +++++++++++++++++++++++------------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/super1.c b/super1.c
index 982d88c..1d03a0a 100644
--- a/super1.c
+++ b/super1.c
@@ -1693,6 +1693,7 @@ static int write_init_super1(struct supertype *st)
 	unsigned long long dsize, array_size;
 	unsigned long long sb_offset;
 	unsigned long long data_offset;
+	long bm_offset;
 
 	for (di = st->info; di; di = di->next) {
 		if (di->disk.state & (1 << MD_DISK_JOURNAL))
@@ -1760,15 +1761,25 @@ static int write_init_super1(struct supertype *st)
 		 * data_offset has already been set.
 		 */
 		array_size = __le64_to_cpu(sb->size);
-		/* work out how much space we left for a bitmap,
-		 * Add 8 sectors for bad block log */
-		bm_space = choose_bm_space(array_size) + 8;
+
+		/* work out how much space we left for a bitmap */
+		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
+			bitmap_super_t *bms = (bitmap_super_t *)
+					(((char *)sb) + MAX_SB_SIZE);
+			bm_space = calc_bitmap_size(bms, 4096) >> 9;
+			bm_offset = (long)__le32_to_cpu(sb->bitmap_offset);
+		} else {
+			bm_space = choose_bm_space(array_size);
+			bm_offset = 8;
+		}
 
 		data_offset = di->data_offset;
 		if (data_offset == INVALID_SECTORS)
 			data_offset = st->data_offset;
 		switch(st->minor_version) {
 		case 0:
+			/* Add 8 sectors for bad block log */
+			bm_space += 8;
 			if (data_offset == INVALID_SECTORS)
 				data_offset = 0;
 			sb_offset = dsize;
@@ -1785,38 +1796,26 @@ static int write_init_super1(struct supertype *st)
 			}
 			break;
 		case 1:
-			sb->super_offset = __cpu_to_le64(0);
-			if (data_offset == INVALID_SECTORS)
-				data_offset = 16;
-
-			sb->data_offset = __cpu_to_le64(data_offset);
-			sb->data_size = __cpu_to_le64(dsize - data_offset);
-			if (data_offset >= 8 + 32*2 + 8) {
-				sb->bblog_size = __cpu_to_le16(8);
-				sb->bblog_offset = __cpu_to_le32(8 + 32*2);
-			} else if (data_offset >= 16) {
-				sb->bblog_size = __cpu_to_le16(8);
-				sb->bblog_offset = __cpu_to_le32(data_offset-8);
-			}
-			break;
 		case 2:
-			sb_offset = 4*2;
+			sb_offset = st->minor_version == 2 ? 8 : 0;
 			sb->super_offset = __cpu_to_le64(sb_offset);
 			if (data_offset == INVALID_SECTORS)
-				data_offset = 24;
+				data_offset = sb_offset + 16;
 
 			sb->data_offset = __cpu_to_le64(data_offset);
 			sb->data_size = __cpu_to_le64(dsize - data_offset);
-			if (data_offset >= 16 + 32*2 + 8) {
+			if (data_offset >= sb_offset+bm_offset+bm_space+8) {
 				sb->bblog_size = __cpu_to_le16(8);
-				sb->bblog_offset = __cpu_to_le32(8 + 32*2);
-			} else if (data_offset >= 16+16) {
+				sb->bblog_offset = __cpu_to_le32(bm_offset +
+								 bm_space);
+			} else if (data_offset >= sb_offset + 16) {
 				sb->bblog_size = __cpu_to_le16(8);
-				/* '8' sectors for the bblog, and another '8'
+				/* '8' sectors for the bblog, and 'sb_offset'
 				 * because we want offset from superblock, not
 				 * start of device.
 				 */
-				sb->bblog_offset = __cpu_to_le32(data_offset-8-8);
+				sb->bblog_offset = __cpu_to_le32(data_offset -
+								 8 - sb_offset);
 			}
 			break;
 		default:
-- 
2.10.1


^ permalink raw reply related

* [PATCH 1/2] super1: make internal bitmap size calculations more consistent
From: Artur Paszkiewicz @ 2016-11-10 10:50 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid

Determining internal bitmap size is performed using two different
functions (bitmap_sectors() and calc_bitmap_size()) and in
getinfo_super1() it is calculated in yet another way. Each of these
methods give slightly different results. The most accurate is
calc_bitmap_size() but it also has a rounding issue. So:

- fix the rounding issue in calc_bitmap_size() using bitmap_bits()
- replace usages of bitmap_sectors() and open-coded calculations with
  calc_bitmap_size()
- remove bitmap_sectors()
- move bitmap_bits() to mdadm.h as inline - otherwise mdassemble won't
  compile (it does not use bitmap.c)

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 bitmap.c | 15 ---------------
 mdadm.h  |  9 ++++++++-
 super1.c | 25 +++++++++----------------
 3 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/bitmap.c b/bitmap.c
index 6c1b8d8..ccedfd3 100644
--- a/bitmap.c
+++ b/bitmap.c
@@ -108,21 +108,6 @@ static int count_dirty_bits(char *buf, int num_bits)
 	return num;
 }
 
-/* calculate the size of the bitmap given the array size and bitmap chunksize */
-static unsigned long long
-bitmap_bits(unsigned long long array_size, unsigned long chunksize)
-{
-	return (array_size * 512 + chunksize - 1) / chunksize;
-}
-
-unsigned long bitmap_sectors(struct bitmap_super_s *bsb)
-{
-	unsigned long long bits = bitmap_bits(__le64_to_cpu(bsb->sync_size),
-					      __le32_to_cpu(bsb->chunksize));
-	int bits_per_sector = 8*512;
-	return (bits + bits_per_sector - 1) / bits_per_sector;
-}
-
 static bitmap_info_t *bitmap_fd_read(int fd, int brief)
 {
 	/* Note: fd might be open O_DIRECT, so we must be
diff --git a/mdadm.h b/mdadm.h
index 0516c82..41a4494 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1331,7 +1331,14 @@ extern int CreateBitmap(char *filename, int force, char uuid[16],
 extern int ExamineBitmap(char *filename, int brief, struct supertype *st);
 extern int Write_rules(char *rule_name);
 extern int bitmap_update_uuid(int fd, int *uuid, int swap);
-extern unsigned long bitmap_sectors(struct bitmap_super_s *bsb);
+
+/* calculate the size of the bitmap given the array size and bitmap chunksize */
+static inline unsigned long long
+bitmap_bits(unsigned long long array_size, unsigned long chunksize)
+{
+	return (array_size * 512 + chunksize - 1) / chunksize;
+}
+
 extern int Dump_metadata(char *dev, char *dir, struct context *c,
 			 struct supertype *st);
 extern int Restore_metadata(char *dev, char *dir, struct context *c,
diff --git a/super1.c b/super1.c
index 4fef378..982d88c 100644
--- a/super1.c
+++ b/super1.c
@@ -162,7 +162,8 @@ static unsigned int calc_bitmap_size(bitmap_super_t *bms, unsigned int boundary)
 {
 	unsigned long long bits, bytes;
 
-	bits = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
+	bits = bitmap_bits(__le64_to_cpu(bms->sync_size),
+			   __le32_to_cpu(bms->chunksize));
 	bytes = (bits+7) >> 3;
 	bytes += sizeof(bitmap_super_t);
 	bytes = ROUND_UP(bytes, boundary);
@@ -973,11 +974,7 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 		earliest = super_offset + (32+4)*2; /* match kernel */
 		if (info->bitmap_offset > 0) {
 			unsigned long long bmend = info->bitmap_offset;
-			unsigned long long size = __le64_to_cpu(bsb->sync_size);
-			size /= __le32_to_cpu(bsb->chunksize) >> 9;
-			size = (size + 7) >> 3;
-			size += sizeof(bitmap_super_t);
-			size = ROUND_UP(size, 4096);
+			unsigned long long size = calc_bitmap_size(bsb, 4096);
 			size /= 512;
 			bmend += size;
 			if (bmend > earliest)
@@ -1219,11 +1216,8 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 	} else if (strcmp(update, "uuid") == 0) {
 		copy_uuid(sb->set_uuid, info->uuid, super1.swapuuid);
 
-		if (__le32_to_cpu(sb->feature_map)&MD_FEATURE_BITMAP_OFFSET) {
-			struct bitmap_super_s *bm;
-			bm = (struct bitmap_super_s*)(st->sb+MAX_SB_SIZE);
-			memcpy(bm->uuid, sb->set_uuid, 16);
-		}
+		if (__le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET)
+			memcpy(bms->uuid, sb->set_uuid, 16);
 	} else if (strcmp(update, "no-bitmap") == 0) {
 		sb->feature_map &= ~__cpu_to_le32(MD_FEATURE_BITMAP_OFFSET);
 	} else if (strcmp(update, "bbl") == 0) {
@@ -1232,15 +1226,14 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		 */
 		unsigned long long sb_offset = __le64_to_cpu(sb->super_offset);
 		unsigned long long data_offset = __le64_to_cpu(sb->data_offset);
-		long bitmap_offset = (long)(int32_t)__le32_to_cpu(sb->bitmap_offset);
+		long bitmap_offset = 0;
 		long bm_sectors = 0;
 		long space;
 
 #ifndef MDASSEMBLE
 		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
-			struct bitmap_super_s *bsb;
-			bsb = (struct bitmap_super_s *)(((char*)sb)+MAX_SB_SIZE);
-			bm_sectors = bitmap_sectors(bsb);
+			bitmap_offset = (long)__le32_to_cpu(sb->bitmap_offset);
+			bm_sectors = calc_bitmap_size(bms, 4096) >> 9;
 		}
 #endif
 		if (sb_offset < data_offset) {
@@ -2120,7 +2113,7 @@ static __u64 avail_size1(struct supertype *st, __u64 devsize,
 		/* hot-add. allow for actual size of bitmap */
 		struct bitmap_super_s *bsb;
 		bsb = (struct bitmap_super_s *)(((char*)super)+MAX_SB_SIZE);
-		bmspace = bitmap_sectors(bsb);
+		bmspace = calc_bitmap_size(bsb, 4096) >> 9;
 	}
 #endif
 	/* Allow space for bad block log */
-- 
2.10.1


^ permalink raw reply related

* Re: Question on blocks periodic writes
From: NeilBrown @ 2016-11-10  2:00 UTC (permalink / raw)
  To: Theophanis Kontogiannis, Linux RAID
In-Reply-To: <CAOzB0-6NYUqrzUyx3iQ8CbPwFh-dctKTVwE6GKupdYM-8AfOrg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2261 bytes --]

On Thu, Nov 10 2016, Theophanis Kontogiannis wrote:

> Hello All,
>
> I am in the middle of bringing my server's power consumption to an
> absolute minimum.
>
> Have already reduced idle power from 110W to 52W (including 10W
> consumed by the UPS).
>
> Having enabled spin down to all 5 x 2TB disks, with 5 seconds idle
> times, I noticed that the disks wake up quite often without the server
> doing anything actual.
>
> Following
>
>    'echo 1 > /proc/sys/vm/block_dump'
>
> I can not miss that
>
>    'dmesg -c'
>
> reports frequent messages like:
>
> [ 8662.496150] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)
> [ 8662.496185] md0_raid6(631): WRITE block 16 on sdf3 (8 sectors)
> [ 8662.496253] md0_raid6(631): WRITE block 16 on sdg3 (8 sectors)
> [ 8662.496269] md0_raid6(631): WRITE block 16 on sdd3 (8 sectors)
> [ 8662.496282] md0_raid6(631): WRITE block 16 on sde3 (8 sectors)

These are probably the md bitmap being updated.
If you provided some basic detail about you array like
  mdadm --detail /dev/md1
  mdadm --examine /dev/sda3
it would be easier to be sure.


> [ 8664.849252] md0_raid6(631): WRITE block 8 on sda3 (1 sectors)
> [ 8664.849276] md0_raid6(631): WRITE block 8 on sdf3 (1 sectors)
> [ 8664.849287] md0_raid6(631): WRITE block 8 on sdg3 (1 sectors)
> [ 8664.849298] md0_raid6(631): WRITE block 8 on sdd3 (1 sectors)
> [ 8664.849352] md0_raid6(631): WRITE block 8 on sde3 (1 sectors)

This is probably the superblock being updated.

> [ 8664.858104] xfsaild/md1(658): WRITE block 0 on md1 (8 sectors)

This is XFS doing something.  md cannot possibly stop all IO while the
filesystem performs occasional IO.  If these continue, you need to
discuss with xfs developers how to stop it.  If the writes to individual
drives continue after there are no writes to 'md1', then it is worth
coming back here to ask.


> [ 8664.902688] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)

> [ 8665.177050] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)

> [ 8670.269138] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)

> [ 8680.269050] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)

The delay here is 270ms, then 5 seconds, then 10 seconds.
Does it reach a stable state?  What is the period in the stable state?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [md PATCH 2/3] md: remove md_super_wait() call after bitmap_flush()
From: Shaohua Li @ 2016-11-10  1:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87bmxom9s0.fsf@notabene.neil.brown.name>

On Thu, Nov 10, 2016 at 11:57:35AM +1100, Neil Brown wrote:
> On Thu, Nov 10 2016, Shaohua Li wrote:
> 
> > On Wed, Nov 09, 2016 at 10:21:32AM +1100, Neil Brown wrote:
> >> bitmap_flush() finishes with bitmap_update_sb(), and that finishes
> >> with write_page(..., 1), so write_page() will wait for all writes
> >> to complete.  So there is no point calling md_super_wait()
> >> immediately afterwards.
> >> 
> >> Signed-off-by: NeilBrown <neilb@suse.com>
> >> ---
> >>  drivers/md/md.c |    1 -
> >>  1 file changed, 1 deletion(-)
> >> 
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index f389d8abe137..1f1c7f007b68 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -5472,7 +5472,6 @@ static void __md_stop_writes(struct mddev *mddev)
> >>  	del_timer_sync(&mddev->safemode_timer);
> >>  
> >>  	bitmap_flush(mddev);
> >> -	md_super_wait(mddev);
> >
> > bitmap_flush() could be null if there is no bitmap, is this safe?
> 
> Good question.
> If there is no bitmap, then all metadata updates (both superblock
> and bad-block-list) are synchronous in md_update_sb(), which is always
> called under ->reconfig_mutex and so which cannot race with this code.
> 
> So yes, it is safe.  That md_super_wait() was only ever intended to wait
> for things that bitmap_flush() might have flushed, so it should have
> been inside that function.

Ah, yes, md_super_wait follows all md_super_write. It should be safe.
Applied, thanks!

^ permalink raw reply

* Re: [md PATCH 2/3] md: remove md_super_wait() call after bitmap_flush()
From: NeilBrown @ 2016-11-10  0:57 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20161109205120.kodctkb5xn5x55rd@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1293 bytes --]

On Thu, Nov 10 2016, Shaohua Li wrote:

> On Wed, Nov 09, 2016 at 10:21:32AM +1100, Neil Brown wrote:
>> bitmap_flush() finishes with bitmap_update_sb(), and that finishes
>> with write_page(..., 1), so write_page() will wait for all writes
>> to complete.  So there is no point calling md_super_wait()
>> immediately afterwards.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/md.c |    1 -
>>  1 file changed, 1 deletion(-)
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index f389d8abe137..1f1c7f007b68 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -5472,7 +5472,6 @@ static void __md_stop_writes(struct mddev *mddev)
>>  	del_timer_sync(&mddev->safemode_timer);
>>  
>>  	bitmap_flush(mddev);
>> -	md_super_wait(mddev);
>
> bitmap_flush() could be null if there is no bitmap, is this safe?

Good question.
If there is no bitmap, then all metadata updates (both superblock
and bad-block-list) are synchronous in md_update_sb(), which is always
called under ->reconfig_mutex and so which cannot race with this code.

So yes, it is safe.  That md_super_wait() was only ever intended to wait
for things that bitmap_flush() might have flushed, so it should have
been inside that function.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [md PATCH 3/3] md: define mddev flags, recovery flags and r1bio state bits using enums
From: Shaohua Li @ 2016-11-09 20:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <147864729298.1076.2506132363008045138.stgit@noble>

On Wed, Nov 09, 2016 at 10:21:33AM +1100, Neil Brown wrote:
> This is less error prone than using individual #defines.

applied, thanks!

^ permalink raw reply

* Re: [md PATCH 2/3] md: remove md_super_wait() call after bitmap_flush()
From: Shaohua Li @ 2016-11-09 20:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <147864729285.1076.27654779060287129.stgit@noble>

On Wed, Nov 09, 2016 at 10:21:32AM +1100, Neil Brown wrote:
> bitmap_flush() finishes with bitmap_update_sb(), and that finishes
> with write_page(..., 1), so write_page() will wait for all writes
> to complete.  So there is no point calling md_super_wait()
> immediately afterwards.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/md.c |    1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f389d8abe137..1f1c7f007b68 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5472,7 +5472,6 @@ static void __md_stop_writes(struct mddev *mddev)
>  	del_timer_sync(&mddev->safemode_timer);
>  
>  	bitmap_flush(mddev);
> -	md_super_wait(mddev);

bitmap_flush() could be null if there is no bitmap, is this safe?

Thanks,
Shaohua

^ permalink raw reply

* Re: [md PATCH 1/3] md/raid1: fix: IO can block resync indefinitely
From: Shaohua Li @ 2016-11-09 20:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <147864729272.1076.727849896269121517.stgit@noble>

On Wed, Nov 09, 2016 at 10:21:32AM +1100, Neil Brown wrote:
> While performing a resync/recovery, raid1 divides the
> array space into three regions:
>  - before the resync
>  - at or shortly after the resync point
>  - much further ahead of the resync point.
> 
> Write requests to the first or third do not need to wait.  Write
> requests to the middle region do need to wait if resync requests are
> pending.
> 
> If there are any active write requests in the middle region, resync
> will wait for them.
> 
> Due to an accounting error, there is a small range of addresses,
> between conf->next_resync and conf->start_next_window, where write
> requests will *not* be blocked, but *will* be counted in the middle
> region.  This can effectively block resync indefinitely if filesystem
> writes happen repeatedly to this region.

Good catch, thanks Neil! 
> As ->next_window_requests is incremented when the sector is before

I changed 'before' to 'after' when applying this patch

Thanks,
Shaohua

^ permalink raw reply

* Re: WARNING: mismatch_cnt is not 0 on <array device>
From: Benjammin2068 @ 2016-11-09 19:52 UTC (permalink / raw)
  To: Linux-RAID
In-Reply-To: <1497737a-a307-4501-4158-9703a051ef67@turmel.org>

On 11/08/2016 02:38 PM, Phil Turmel wrote:
>
> Have you added up the peak current draws of your drives to make sure
> your power supply keeps up when all drives are writing simultaneously
> (common with parity raid)?

Keeping in mind this is a pretty empty box that's pretty sleepy until I hit compile doing FPGA/SoC development.
(i.e. no power hungry graphics cards -- it's a home file server more than a desktop)

Here's them power numbers -- all of the drives are listed as peak (as in start up) where average was much less... but using those hypothetically as worst case:

They still don't touch the supply rails even after considering fans (which spin at very low PWM duty cycles) (not sure this table will print -- let me know if it doesn't)


	
	
	
	
	
	

	*_Model:_* 	*_Wattage_* 	*_+5V_* 	*_+12V_* 	*_+3.3V_* 	*_5V StdBy_*

	PWS-652-2H 	650W 	30A 	54A 	20A 	4A

	
	
	
	
	
	
Slot 1 	WD2500AAJS 	21.24 	
	1.77 	N/A 	N/A
Slot 2 	WD10EZEX 	30 	
	2.5 	N/A 	N/A
Slot 3 	ST1000DM005 	24 	
	2 	N/A 	N/A
Slot 4 	HD103SJ 	36.4 	2 	2.2 	N/A 	N/A
Slot 5 	WD10EFRX-68F 	14.4 	
	1.2 	N/A 	N/A
Slot 6 	HD103SJ 	36.4 	2 	2.2 	N/A 	N/A
Slot 7 	WD10EFRX-68F 	14.4 	
	1.2 	N/A 	N/A
Slot 8 	WD10EFRX-68F 	14.4 	
	1.2 	N/A 	N/A
Slot 9 	Empty 	
	
	
	N/A 	N/A
Slot 10 	Empty 	
	
	
	N/A 	N/A
Slot 11 	WD10JFCX-68N 	5 	1 	0 	N/A 	N/A
Slot 12 	WD10JFCX-68N 	5 	1 	0 	N/A 	N/A

	
	
	
	
	
	

	*Totals:* 	*201.24* 	*6* 	*14.27* 	
	


^ permalink raw reply

* Re: WARNING: mismatch_cnt is not 0 on <array device>
From: Benjammin2068 @ 2016-11-09 19:00 UTC (permalink / raw)
  To: Linux-RAID
In-Reply-To: <58223D29.6030401@youngman.org.uk>

On 11/08/2016 03:01 PM, Wols Lists wrote:
> On 08/11/16 20:38, Phil Turmel wrote:
>> Have you added up the peak current draws of your drives to make sure
>> your power supply keeps up when all drives are writing simultaneously
>> (common with parity raid)?
> On that point, be aware that many power supplies quote the sum of the
> power to all rails. It could well be that the supply is nominally plenty
> powerful enough, but the load on an individual rail is too high.
>
>

Right.. I'll check that when I do the math on the supplies.

  -Ben


^ permalink raw reply

* Re: WARNING: mismatch_cnt is not 0 on <array device>
From: Benjammin2068 @ 2016-11-09 19:00 UTC (permalink / raw)
  To: Linux-RAID
In-Reply-To: <1497737a-a307-4501-4158-9703a051ef67@turmel.org>

On 11/08/2016 02:38 PM, Phil Turmel wrote:
> On 11/08/2016 02:53 PM, Benjammin2068 wrote:
>> On 11/08/2016 12:47 PM, Benjammin2068 wrote:
>> Now that I think about it -- and have been talking out loud to myself (I don't think I'm crazy)...
>>
>> A parallel to all this is:
>>
>> I don't think the mismatch_cnt started showing up until I moved from RAID5 to RAID6.
>>
>> :O
>>
>> How painful is it to switch back to RAID5 to test that theory?
> Don't.  Sounds like raid6's stricter calculations are catching a real problem.

Ok -- no switching back to RAID5.

> Do you have ECC RAM?

Yes.
>
> If so, are you getting any machine check exceptions?

not getting any machine check problems (I looked)

> If not, have you done a thorough memtest any time in the recent past?

Yes. When I started getting the mismatch counts, I took the system down and ran MEMtest on this through a couple of passes.

no problem.

> If it's not memory, can you exercise the controller channels heavily to
> see if they drop from errors?

I could but haven't -- any recommendations on tools out there?

Also, I've also wondered if the raid-check that happens on Sunday isn't actually part of that kind of problem.

i.e. if I didn't do the weekly check, the drives don't get slammed anywhere  near as much the rest of the week.

Does mismatch_cnt only change value during a check -- or does it happen with each operation?

> Have you added up the peak current draws of your drives to make sure
> your power supply keeps up when all drives are writing simultaneously
> (common with parity raid)?

Not exactly. but can do that. The system has a 650W supply -- I'll go do a power check and work that against the known drives in the system.

This is a "server chassis" though which came with the 8 slots in the front to power drives - so it's not exactly a "home chassis" that I put in a 300W and then jammed full of drives.

Still -- that's a reasonable question and I'll investigate.

> One more: do you have swap on top of md raid?

No. I've seen about mismatch on RAID1 causing mismatch counts.

However, I am running a VM on this RAID volume (VirtualBox and a reasonably sleepy instance of Win7_64) and have pondered that.

 -Ben

^ permalink raw reply

* Question on blocks periodic writes
From: Theophanis Kontogiannis @ 2016-11-09 17:20 UTC (permalink / raw)
  To: Linux RAID

Hello All,

I am in the middle of bringing my server's power consumption to an
absolute minimum.

Have already reduced idle power from 110W to 52W (including 10W
consumed by the UPS).

Having enabled spin down to all 5 x 2TB disks, with 5 seconds idle
times, I noticed that the disks wake up quite often without the server
doing anything actual.

Following

   'echo 1 > /proc/sys/vm/block_dump'

I can not miss that

   'dmesg -c'

reports frequent messages like:

[ 8662.496150] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)
[ 8662.496185] md0_raid6(631): WRITE block 16 on sdf3 (8 sectors)
[ 8662.496253] md0_raid6(631): WRITE block 16 on sdg3 (8 sectors)
[ 8662.496269] md0_raid6(631): WRITE block 16 on sdd3 (8 sectors)
[ 8662.496282] md0_raid6(631): WRITE block 16 on sde3 (8 sectors)
[ 8664.849252] md0_raid6(631): WRITE block 8 on sda3 (1 sectors)
[ 8664.849276] md0_raid6(631): WRITE block 8 on sdf3 (1 sectors)
[ 8664.849287] md0_raid6(631): WRITE block 8 on sdg3 (1 sectors)
[ 8664.849298] md0_raid6(631): WRITE block 8 on sdd3 (1 sectors)
[ 8664.849352] md0_raid6(631): WRITE block 8 on sde3 (1 sectors)
[ 8664.858104] xfsaild/md1(658): WRITE block 0 on md1 (8 sectors)
[ 8664.902688] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)
[ 8664.902705] md0_raid6(631): WRITE block 16 on sdf3 (8 sectors)
[ 8664.902715] md0_raid6(631): WRITE block 16 on sdg3 (8 sectors)
[ 8664.902725] md0_raid6(631): WRITE block 16 on sdd3 (8 sectors)
[ 8664.902735] md0_raid6(631): WRITE block 16 on sde3 (8 sectors)
[ 8665.100056] md1_raid6(630): WRITE block 8 on sda1 (1 sectors)
[ 8665.100107] md1_raid6(630): WRITE block 8 on sdf1 (1 sectors)
[ 8665.100164] md1_raid6(630): WRITE block 8 on sde1 (1 sectors)
[ 8665.100217] md1_raid6(630): WRITE block 8 on sdg1 (1 sectors)
[ 8665.100467] md1_raid6(630): WRITE block 8 on sdd1 (1 sectors)
[ 8665.177050] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)
[ 8665.177098] md0_raid6(631): WRITE block 16 on sdf3 (8 sectors)
[ 8665.177154] md0_raid6(631): WRITE block 16 on sdg3 (8 sectors)
[ 8665.177207] md0_raid6(631): WRITE block 16 on sdd3 (8 sectors)
[ 8665.177431] md0_raid6(631): WRITE block 16 on sde3 (8 sectors)
[ 8665.225978] md0_raid6(631): WRITE block 8 on sda3 (1 sectors)
[ 8665.225996] md0_raid6(631): WRITE block 8 on sdf3 (1 sectors)
[ 8665.226064] md0_raid6(631): WRITE block 8 on sdg3 (1 sectors)
[ 8665.226111] md0_raid6(631): WRITE block 8 on sdd3 (1 sectors)
[ 8665.226191] md0_raid6(631): WRITE block 8 on sde3 (1 sectors)
[ 8670.269138] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)
[ 8670.269180] md0_raid6(631): WRITE block 16 on sdf3 (8 sectors)
[ 8670.269237] md0_raid6(631): WRITE block 16 on sdg3 (8 sectors)
[ 8670.269291] md0_raid6(631): WRITE block 16 on sdd3 (8 sectors)
[ 8670.269344] md0_raid6(631): WRITE block 16 on sde3 (8 sectors)
[ 8680.269050] md0_raid6(631): WRITE block 16 on sda3 (8 sectors)
[ 8680.269092] md0_raid6(631): WRITE block 16 on sdf3 (8 sectors)
[ 8680.269147] md0_raid6(631): WRITE block 16 on sdg3 (8 sectors)
[ 8680.269249] md0_raid6(631): WRITE block 16 on sdd3 (8 sectors)
[ 8680.269428] md0_raid6(631): WRITE block 16 on sde3 (8 sectors)

I guess those messages are the reason for the frequent disks spin up.

What is the reason behind those writes?

Can I affect it? Should I touch it?


---
Best regards,
ΜΦΧ,

Theophanis Kontogiannis

^ permalink raw reply

* [md PATCH 3/3] md: define mddev flags, recovery flags and r1bio state bits using enums
From: NeilBrown @ 2016-11-08 23:21 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147864718560.1076.2148299631932240330.stgit@noble>

This is less error prone than using individual #defines.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.h    |   76 +++++++++++++++++++++++++---------------------------
 drivers/md/raid1.h |   18 +++++++-----
 2 files changed, 46 insertions(+), 48 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 21bd94fad96a..af6b33c30d2d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -192,6 +192,25 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 				int is_new);
 struct md_cluster_info;
 
+enum mddev_flags {
+	MD_CHANGE_DEVS,		/* Some device status has changed */
+	MD_CHANGE_CLEAN,	/* transition to or from 'clean' */
+	MD_CHANGE_PENDING,	/* switch from 'clean' to 'active' in progress */
+	MD_ARRAY_FIRST_USE,	/* First use of array, needs initialization */
+	MD_CLOSING,		/* If set, we are closing the array, do not open
+				 * it then */
+	MD_JOURNAL_CLEAN,	/* A raid with journal is already clean */
+	MD_HAS_JOURNAL,		/* The raid array has journal feature set */
+	MD_RELOAD_SB,		/* Reload the superblock because another node
+				 * updated it.
+				 */
+	MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
+				   * already took resync lock, need to
+				   * release the lock */
+};
+#define MD_UPDATE_SB_FLAGS (BIT(MD_CHANGE_DEVS) | \
+			    BIT(MD_CHANGE_CLEAN) | \
+			    BIT(MD_CHANGE_PENDING))	/* If these are set, md_update_sb needed */
 struct mddev {
 	void				*private;
 	struct md_personality		*pers;
@@ -199,21 +218,6 @@ struct mddev {
 	int				md_minor;
 	struct list_head		disks;
 	unsigned long			flags;
-#define MD_CHANGE_DEVS	0	/* Some device status has changed */
-#define MD_CHANGE_CLEAN 1	/* transition to or from 'clean' */
-#define MD_CHANGE_PENDING 2	/* switch from 'clean' to 'active' in progress */
-#define MD_UPDATE_SB_FLAGS (1 | 2 | 4)	/* If these are set, md_update_sb needed */
-#define MD_ARRAY_FIRST_USE 3    /* First use of array, needs initialization */
-#define MD_CLOSING	4	/* If set, we are closing the array, do not open
-				 * it then */
-#define MD_JOURNAL_CLEAN 5	/* A raid with journal is already clean */
-#define MD_HAS_JOURNAL	6	/* The raid array has journal feature set */
-#define MD_RELOAD_SB	7	/* Reload the superblock because another node
-				 * updated it.
-				 */
-#define MD_CLUSTER_RESYNC_LOCKED 8 /* cluster raid only, which means node
-				    * already took resync lock, need to
-				    * release the lock */
 
 	int				suspended;
 	atomic_t			active_io;
@@ -307,31 +311,6 @@ struct mddev {
 	int				parallel_resync;
 
 	int				ok_start_degraded;
-	/* recovery/resync flags
-	 * NEEDED:   we might need to start a resync/recover
-	 * RUNNING:  a thread is running, or about to be started
-	 * SYNC:     actually doing a resync, not a recovery
-	 * RECOVER:  doing recovery, or need to try it.
-	 * INTR:     resync needs to be aborted for some reason
-	 * DONE:     thread is done and is waiting to be reaped
-	 * REQUEST:  user-space has requested a sync (used with SYNC)
-	 * CHECK:    user-space request for check-only, no repair
-	 * RESHAPE:  A reshape is happening
-	 * ERROR:    sync-action interrupted because io-error
-	 *
-	 * If neither SYNC or RESHAPE are set, then it is a recovery.
-	 */
-#define	MD_RECOVERY_RUNNING	0
-#define	MD_RECOVERY_SYNC	1
-#define	MD_RECOVERY_RECOVER	2
-#define	MD_RECOVERY_INTR	3
-#define	MD_RECOVERY_DONE	4
-#define	MD_RECOVERY_NEEDED	5
-#define	MD_RECOVERY_REQUESTED	6
-#define	MD_RECOVERY_CHECK	7
-#define MD_RECOVERY_RESHAPE	8
-#define	MD_RECOVERY_FROZEN	9
-#define	MD_RECOVERY_ERROR	10
 
 	unsigned long			recovery;
 	/* If a RAID personality determines that recovery (of a particular
@@ -445,6 +424,23 @@ struct mddev {
 	unsigned int			good_device_nr;	/* good device num within cluster raid */
 };
 
+enum recovery_flags {
+	/*
+	 * If neither SYNC or RESHAPE are set, then it is a recovery.
+	 */
+	MD_RECOVERY_RUNNING,	/* a thread is running, or about to be started */
+	MD_RECOVERY_SYNC,	/* actually doing a resync, not a recovery */
+	MD_RECOVERY_RECOVER,	/* doing recovery, or need to try it. */
+	MD_RECOVERY_INTR,	/* resync needs to be aborted for some reason */
+	MD_RECOVERY_DONE,	/* thread is done and is waiting to be reaped */
+	MD_RECOVERY_NEEDED,	/* we might need to start a resync/recover */
+	MD_RECOVERY_REQUESTED,	/* user-space has requested a sync (used with SYNC) */
+	MD_RECOVERY_CHECK,	/* user-space request for check-only, no repair */
+	MD_RECOVERY_RESHAPE,	/* A reshape is happening */
+	MD_RECOVERY_FROZEN,	/* User request to abort, and not restart, any action */
+	MD_RECOVERY_ERROR,	/* sync-action interrupted because io-error */
+};
+
 static inline int __must_check mddev_lock(struct mddev *mddev)
 {
 	return mutex_lock_interruptible(&mddev->reconfig_mutex);
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 61c39b390cd8..5ec19449779d 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -161,14 +161,15 @@ struct r1bio {
 };
 
 /* bits for r1bio.state */
-#define	R1BIO_Uptodate	0
-#define	R1BIO_IsSync	1
-#define	R1BIO_Degraded	2
-#define	R1BIO_BehindIO	3
+enum r1bio_state {
+	R1BIO_Uptodate,
+	R1BIO_IsSync,
+	R1BIO_Degraded,
+	R1BIO_BehindIO,
 /* Set ReadError on bios that experience a readerror so that
  * raid1d knows what to do with them.
  */
-#define R1BIO_ReadError 4
+	R1BIO_ReadError,
 /* For write-behind requests, we call bi_end_io when
  * the last non-write-behind device completes, providing
  * any write was successful.  Otherwise we call when
@@ -176,10 +177,11 @@ struct r1bio {
  * with failure when last write completes (and all failed).
  * Record that bi_end_io was called with this flag...
  */
-#define	R1BIO_Returned 6
+	R1BIO_Returned,
 /* If a write for this request means we can clear some
  * known-bad-block records, we set this flag
  */
-#define	R1BIO_MadeGood 7
-#define	R1BIO_WriteError 8
+	R1BIO_MadeGood,
+	R1BIO_WriteError,
+};
 #endif



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox