Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Maximizing failed disk replacement on a RAID5 array
From: Durval Menezes @ 2011-06-08  6:58 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid, Drew
In-Reply-To: <4DEDB8B7.2070708@fnarfbargle.com>

Hello,

On Tue, Jun 7, 2011 at 2:35 AM, Brad Campbell <brad@fnarfbargle.com> wrote:
> On 07/06/11 13:03, Durval Menezes wrote:
>>
>> Hello Folks,
>>
>> Just finished the "repair". It completed OK, and over SMART the HD now
>> shows a "Reallocated_Sector_Ct" of 291 (which shows that many bad
>> sectors have been remapped), but it's also still reporting 4
>> "Current_Pending_Sector" and 4 "Offline_Uncorrectable"... which I
>> think means exactly the same thing, ie, that there are 4 "active"
>> (from the HD perspective) sectors on the drive still detected as bad
>> and not remapped.
>>
>> I've been thinking about exactly what that means, and I think that
>> these 4 sectors are either A) outside the RAID partition (not very
>> probable as this partition occupies more than 99.99% of the disk,
>> leaving just a small, less than 105MB area at the beginning), or B)
>> some kind of metadata or unused space that hasn't been read and
>> rewritten by the "repair" I've just completed. I've just done a "dd
>> bs=1024k count=105</dev/DISK>/dev/null" to account for the
>> hyphotesys A), and come out empty: no errors, and the drive still
>> shows 4 bad, unmapped sectors on SMART.
>>
>> So, by elimination, it must be either case B) above, or a bug in the
>> linux md code (which prevents it from hitting every needed block on
>> the disk), or a bug in SMART (which makes it report inexistent bad
>>
> Try running a SMART long test smartctl -t long and it will tell you whether
> the sectors are really bad or not.
> I've had instances where the firmware still thought that some previously
> pending sectors were still pending until I forced a test, at which time the
> drive came to its senses and they went away.
>
> I believe if you wait until the drive gets around to doing its periodic
> offline data collection you'll see the same thing, but a long test is nice
> as it will give you an actual block number for the first failure (if you
> have one)

I did it (smartctl -a long) and it completed (registering an error at
the very end of the disk):

     SMART Self-test log structure revision number 1
     Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
     # 1  Extended offline    Completed: read failure       10%
9942           2930273794

The SMART Attributes table still shows 4 pending/uncorrectable sectors:

    197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -
           4
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -
           4

Converting the above LBA to a block number, I find 2930273794/2=
1465136897; as this is a 1.5TB HD,
this first error (there are possibly 3 more) is right at the final
35GB of the media, so it's inside (near the
end) of the RAID partition:

     fdisk -l /dev/sdc
         Disk /dev/sdc: 1500.3 GB, 1500301910016 bytes
          255 heads, 63 sectors/track, 182401 cylinders
          Units = cylinders of 16065 * 512 = 8225280 bytes
          Sector size (logical/physical): 512 bytes / 512 bytes
          I/O size (minimum/optimal): 512 bytes / 512 bytes
          Disk identifier: 0x6be6057c
             Device Boot      Start         End      Blocks   Id  System
          /dev/sdc1               1           1        8001    4  FAT16 <32M
          /dev/sdc2   *           2          14      104422+  83  Linux
          /dev/sdc3              15      182401  1465023577+  fd
Linux raid autodetect

Confirming that this block is indeed returning read errors:

    dd count=1 bs=1024 skip=1465136897 if=/dev/sdc of=/dev/null
        [long delay]
        dd: reading `/dev/sdc': Input/output error
        0+0 records in
        0+0 records out
        0 bytes (0 B) copied, 45.1076 s, 0.0 kB/s

Examining one sector before:

    dd count=1 bs=1024 skip=146513686 if=/dev/sdc | hexdump -C
        00000000  92 e1 b4 d4 c6 cd 0f 33  db 7c ff a9 be c1 c1 8e
|.......3.|......|
        00000010  71 35 fc 55 16 c4 36 ef  59 10 db 20 22 f4 57 99
|q5.U..6.Y.. ".W.|
        00000020  31 61 2b 24 e0 98 3c 94  4b 8a 17 93 23 aa e9 96
|1a+$..<.K...#...|
        00000030  b0 47 7b 8f 12 c6 52 42  99 0d 72 b4 51 02 5a 8e
|.G{...RB..r.Q.Z.|
        00000040  c6 5a ac 86 0b a5 74 9b  13 e7 87 7a db 94 e2 7f
|.Z....t....z....|
        00000050  c6 42 75 ba 53 bf 7f 20  fc 9c ad 4b 8f 3c 85 64
|.Bu.S.. ...K.<.d|
        00000060  3a b0 ac 41 6e 41 fb 95  03 70 24 7e 2e d5 df 8a
|:..AnA...p$~....|
        00000070  f9 dc d1 7d 4a 1e e1 93  9d 39 18 83 6c 9f 9f 79
|...}J....9..l..y|
        00000080  53 a3 d1 fb 7f c6 bd 44  8d 0c 40 06 0a 92 f9 7e
|S......D..@....~|
        00000090  0c 0e 87 43 66 9d fc 12  2b 0d 7a 34 ba 84 cb 73
|...Cf...+.z4...s|
        000000a0  47 3b a4 fa c9 50 d9 96  f9 50 a2 60 17 eb 7c c8
|G;...P...P.`..|.|
        000000b0  42 76 59 d0 1e 06 10 a8  3b 89 74 8d b4 04 83 88
|BvY.....;.t.....|
        000000c0  d7 9d 3c 82 cf 8f 7d 6e  a2 b6 bf 56 06 c0 aa 7c
|..<...}n...V...||
        000000d0  7d 39 ae 0a 67 48 28 b5  07 fd fc ae 49 e4 7a 08
|}9..gH(.....I.z.|
        000000e0  8a 37 94 e0 d3 d7 f0 f4  4c 49 3a ed b7 f4 84 95
|.7......LI:.....|
        000000f0  3f 0a 4f 6c 47 62 1a f4  70 ca 14 8a 52 6d 4c 1e
|?.OlGb..p...RmL.|
        00000100  da 0c 29 17 c1 a4 e1 5c  cb 43 e0 01 45 9c 72 7f
|..)....\.C..E.r.|
        00000110  78 b8 19 3f dd 35 c5 50  ff 9b 42 fb 0b d8 61 5a
|x..?.5.P..B...aZ|
        00000120  24 2b ae c9 45 e6 e5 e9  04 00 93 bb 53 c0 fd d6
|$+..E.......S...|
        00000130  9c ab 69 98 50 f0 5e 98  0d 0b b3 dc cb cb d0 7d
|..i.P.^........}|
        00000140  21 70 68 e8 fb 3c 55 fd  2d c6 6c 25 86 dd 9a 4a
|!ph..<U.-.l%...J|
        00000150  fc e2 24 a9 fb 9a 6b be  d5 e2 3b e9 a0 b1 61 ad
|..$...k...;...a.|
        00000160  1f 9a c8 31 86 91 c6 1f  86 9e 17 35 25 7e 77 42
|...1.......5%~wB|
        00000170  37 86 b2 17 08 8e c4 cf  4e e2 64 7d 83 11 05 1e
|7.......N.d}....|
        00000180  6b c1 e7 5d 0f e2 c9 f9  0a 0a b1 2b 83 a1 2a a4
|k..].......+..*.|
        00000190  1d f8 a6 13 2f e9 45 bb  b7 e2 71 e9 69 ad 3c 47
|..../.E...q.i.<G|
        000001a0  3f fa 39 7f 1e 93 0e d2  89 09 dc d2 b3 3b f8 6f
|?.9..........;.o|
        000001b0  21 21 72 b6 9e 9d 42 79  fb 78 3c 02 85 7b 1f 4f
|!!r...By.x<..{.O|
        000001c0  8b 3c 26 62 8a 58 38 a7  48 31 b9 e2 0c 0d 41 d6
|.<&b.X8.H1....A.|
        000001d0  8f 43 95 f0 1f 52 3e 0e  55 8d c0 93 f7 e3 c8 79
|.C...R>.U......y|
        000001e0  a2 bc 51 72 87 3c 16 c3  d0 f3 57 a8 e4 48 51 32
|..Qr.<....W..HQ2|
        000001f0  00 99 3e 0e 88 a3 fa e3  00 a4 c2 cb 28 7a a1 00
|..>.........(z..|
        00000200  a0 b4 1b 6d c4 2a 15 75  a3 f0 24 47 5a d6 54 74
|...m.*.u..$GZ.Tt|
        00000210  d0 ad e4 92 b1 99 5d 7a  62 47 b9 54 8f 9e 15 ca
|......]zbG.T....|
        00000220  65 09 9e d0 d3 61 51 93  88 4a 46 1e 5c 15 07 ef
|e....aQ..JF.\...|
        00000230  b0 92 fa a7 e7 3d e5 36  20 67 d2 24 b7 59 ae f4
|.....=.6 g.$.Y..|
        00000240  7c 26 57 90 e1 69 b5 f3  b4 1b 8e e6 07 2e 46 84
||&W..i........F.|
        1+0 records in
        1+0 records out
        1024 bytes (1.0 kB) copied, 5.0224e-05 s, 20.4 MB/s

Looking at one sector after the error returns similar results.

So, I don't know about you, but the above seems pretty much like data
to me (although it could also be parity).

So I have two questions:

1) can I simply skip over these sectors (using dd_rescue or multiple
dd invocations) when off-line copying the old disk to the new one,
trusting the RAID5 to reconstruct the data correctly from the other 2
disks? Or is it better to simply do the recover the "traditional" way
(ie, "fail" the old disk, "add" the new one, and run the risk of a
possible bad sector on one of the two remaining old disks ruining the
show completely and forcing me to recover from backups [I *do* have
up-to-date backups on this array])?

2) Is there a formula, a program or anything that can tell me exactly
what is located at the above sector (ie, whether it's RAID parity or a
data sector)?

Thanks,
-- 
   Durval Menezes.




Ditto, one sector after:



So, when I "dd" this partition to a new one, I think



>
>

^ permalink raw reply

* Re: [PATCH 06/22] FIX: Initialize reshape structure
From: NeilBrown @ 2011-06-08  6:54 UTC (permalink / raw)
  To: Krzysztof Wojcik
  Cc: linux-raid, wojciech.neubauer, adam.kwolek, dan.j.williams,
	ed.ciechanowski
In-Reply-To: <20110602144900.27355.65493.stgit@gklab-128-111.igk.intel.com>

On Thu, 02 Jun 2011 16:49:00 +0200 Krzysztof Wojcik
<krzysztof.wojcik@intel.com> wrote:

> From: Adam Kwolek <adam.kwolek@intel.com>
> 
> It can occurs that reshape structure can contain random values.
> Due to this fact analyse_change() can disallow for grow start
> without real cause (e.g. check of uninitialized new_chunk).
> 
> Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
> ---
>  Grow.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/Grow.c b/Grow.c
> index 7a8ffdb..11b2214 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -1745,6 +1745,7 @@ static int reshape_array(char *container, int fd, char *devname,
>  		info->component_size = array_size / array.raid_disks;
>  	}
>  
> +	memset(&reshape, 0, sizeof(reshape));
>  	if (info->reshape_active) {
>  		int new_level = info->new_level;
>  		info->new_level = UnSet;


This doesn't make any sense to me.

I cannot see how any random numbers in 'reshape' can cause analyse_change to
do the wrong thing, and there is no "new_chunk" in 'reshape' that could be
"uninitialized"..

Please explain.

NeilBrown


^ permalink raw reply

* Nested RAID and booting
From: Leslie Rhorer @ 2011-06-08  5:54 UTC (permalink / raw)
  To: linux-raid


	For financial reasons, I have had to temporarily create two members
of a RAID6 array by first creating a pair of RAID0 arrays from four member
disks.  The RAID6 array is currently re-shaping, and so far all seems well.
I do have a concern about what will happen when the system reboots, however.
In order to properly assemble the RAID6 array, the two RAID0 arrays will
first need to be assembled and running, correct?  How do I guarantee the two
RAID0 arrays will be up before mdadm attempts to assemble the RAID6 array?
Will simply putting them in the mdadm.conf file prior to the RAID6 array do
the trick?


^ permalink raw reply

* Re: [PATCH 0 of 8] Various MD patches to support device-mapper interaction
From: NeilBrown @ 2011-06-08  5:21 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid
In-Reply-To: <1307486733.31279.4.camel@f14.redhat.com>

On Tue, 07 Jun 2011 17:45:33 -0500 Jonathan Brassow <jbrassow@redhat.com>
wrote:

> Neil,
> 
> Please consider	the following patches for inclusion:

Mostly looks good. 
The last two require some changes which I have commented on separately.
If you get me updated versions of those I will put them all at least into
-next, and then see if Linus is willing to include them in -rc3.

Thanks,
NeilBrown


> 
> 1) md-no-integrity-register-if-no-gendisk.patch
> E-mail received 5/24/2011 of patch application, but it has not yet
> landed in 3.0.0-rc2
> 
> 2) md-no-sync-IO-while-suspended.patch
> E-mail received 5/24/2011 of patch application, but it has not yet
> landed in 3.0.0-rc2
> 
> 3) md-possible-typo.patch
> You mentioned you might	take this one if I changed the message instead
> of the parameter value.  I've s/ blocks/k/ - better?
> 
> 4) md-move-thread-wakeups-into-resume.patch
> No comments on this yet.
> 
> 5) md-raid1-changes-to-allow-use-by-device-mapper.patch
> No comments on this yet.
> 
> 6) md-add-sync_super-to-mddev_t-struct.patch
> This is	the patch I'm proposing	as the substitute to the 'analyze_sbs'
> patches I had originally posted.  We add a function pointer that can be
> called from device-mapper for the purposes of updating the superblock.
> (This way, the superblock and 'super_types' functions can be in
> dm-raid.c.)
> 
> 7) md-add-bitmap-support.patch
> Bitmap support for device-mapper created arrays.
> 
> 8) md-raid5-do-not-set-fullsync.patch
> A somewhat hackish way around the problem of RAID5 setting fullsync on a
> device that merely suffered a transient failure.
> 
> Thanks,
>  brassow
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH 8 of 8] MD:  raid5 do not set fullsync
From: NeilBrown @ 2011-06-08  5:20 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid
In-Reply-To: <1307487163.31279.18.camel@f14.redhat.com>

On Tue, 07 Jun 2011 17:52:43 -0500 Jonathan Brassow <jbrassow@redhat.com>
wrote:

> Add new flag for struct mdk_rdev_s to indicate when recovery can use bitmap
> 
> Device-mapper can tell if a device is in-sync, in need of partial (bitmap aided)
> recovery, or in need of complete recovery.  The raid5 code assumes that if a
> device is not in-sync, then it must undergo complete recovery - it does not
> honor the bitmap.  The flag 'RecoverByBitmap' has been introduced to force raid5
> not to set 'conf->fullsync' if the superblock routines have already determined
> that only a partial recovery is necessary.
> 
> RFC-by: Jonathan Brassow <jbrassow@redhat.com>
> 
> Index: linux-2.6/drivers/md/raid5.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/raid5.c
> +++ linux-2.6/drivers/md/raid5.c
> @@ -4858,7 +4858,7 @@ static raid5_conf_t *setup_conf(mddev_t 
>  			printk(KERN_INFO "md/raid:%s: device %s operational as raid"
>  			       " disk %d\n",
>  			       mdname(mddev), bdevname(rdev->bdev, b), raid_disk);
> -		} else
> +		} else if (!test_bit(RecoverByBitmap, &rdev->flags))
>  			/* Cannot rely on bitmap to complete recovery */
>  			conf->fullsync = 1;

I think we can just do
                } else if (rdev->saved_raid_disk != raid_disk)

and not add an extra flag.
This is more in keeping with e.g. raid5_add_disk.

Thanks,

NeilBrown


>  	}
> Index: linux-2.6/drivers/md/md.h
> ===================================================================
> --- linux-2.6.orig/drivers/md/md.h
> +++ linux-2.6/drivers/md/md.h
> @@ -77,6 +77,8 @@ struct mdk_rdev_s
>  #define Blocked		8		/* An error occurred on an externally
>  					 * managed array, don't allow writes
>  					 * until it is cleared */
> +#define RecoverByBitmap 9               /* Used by device-mapper to ensure this
> +					 * device is recovered by the bitmap. */
>  	wait_queue_head_t blocked_wait;
>  
>  	int desc_nr;			/* descriptor index in the superblock */
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH 7 of 8] MD:  add bitmap support
From: NeilBrown @ 2011-06-08  5:18 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid
In-Reply-To: <1307487159.31279.17.camel@f14.redhat.com>

On Tue, 07 Jun 2011 17:52:39 -0500 Jonathan Brassow <jbrassow@redhat.com>
wrote:

> Add bitmap support to the device-mapper specific metadata area.
> 
> This patch allows the creation of the bitmap metadata area upon initial array
> creation via device-mapper.
> 
> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
> 
> Index: linux-2.6/drivers/md/bitmap.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/bitmap.c
> +++ linux-2.6/drivers/md/bitmap.c
> @@ -534,6 +534,84 @@ void bitmap_print_sb(struct bitmap *bitm
>  	kunmap_atomic(sb, KM_USER0);
>  }
>  
> +/*
> + * bitmap_new_disk_sb
> + * @bitmap
> + *
> + * This function is somewhat the reverse of bitmap_read_sb.  bitmap_read_sb
> + * reads and verifies the on-disk bitmap superblock and populates bitmap_info.
> + * This function verifies 'bitmap_info' and populates the on-disk bitmap
> + * structure, which is to be written to disk.
> + *
> + * Returns: 0 on success, -Exxx on error
> + */
> +static int bitmap_new_disk_sb(struct bitmap *bitmap)
> +{
> +	bitmap_super_t *sb;
> +	unsigned long chunksize, daemon_sleep, write_behind;
> +	int err = -EINVAL;
> +
> +	/* page 0 is the superblock, read it... */
> +	bitmap->sb_page = read_sb_page(bitmap->mddev,
> +				       bitmap->mddev->bitmap_info.offset,
> +				       NULL, 0, sizeof(bitmap_super_t));
> +

I don't understand ... why are you reading the page from disk when you don't
expect anything to be there are are about to create an initial bitmap
superblock?
Shouldn't this just be
    bitmap->sb_page = alloc_page(GFP_KERNEL);
??


> +	if (IS_ERR(bitmap->sb_page)) {
> +		err = PTR_ERR(bitmap->sb_page);
> +		bitmap->sb_page = NULL;
> +		return err;
> +	}
> +
> +	sb = kmap_atomic(bitmap->sb_page, KM_USER0);
> +
> +	sb->magic = cpu_to_le32(BITMAP_MAGIC);
> +	sb->version = cpu_to_le32(BITMAP_MAJOR_HI);
> +
> +	chunksize = bitmap->mddev->bitmap_info.chunksize;
> +	BUG_ON(!chunksize);
> +	if ((1 << ffz(~chunksize)) != chunksize) {

Please use  is_power_of_2(chunksize) - defined in log2.h


Otherwise it looks OK.

NeilBrown


> +		kunmap_atomic(sb, KM_USER0);
> +		printk(KERN_ERR "bitmap chunksize not a power of 2\n");
> +		return -EINVAL;
> +	}
> +	sb->chunksize = cpu_to_le32(chunksize);
> +
> +	daemon_sleep = bitmap->mddev->bitmap_info.daemon_sleep;
> +	if (!daemon_sleep ||
> +	    (daemon_sleep < 1) || (daemon_sleep > MAX_SCHEDULE_TIMEOUT)) {
> +		printk(KERN_INFO "Choosing daemon_sleep default (5 sec)\n");
> +		daemon_sleep = 5 * HZ;
> +	}
> +	sb->daemon_sleep = cpu_to_le32(daemon_sleep);
> +	bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep;
> +
> +	/*
> +	 * FIXME: write_behind for RAID1.  If not specified, what
> +	 * is a good choice?  We choose COUNTER_MAX / 2 arbitrarily.
> +	 */
> +	write_behind = bitmap->mddev->bitmap_info.max_write_behind;
> +	if (write_behind > COUNTER_MAX)
> +		write_behind = COUNTER_MAX / 2;
> +	sb->write_behind = cpu_to_le32(write_behind);
> +	bitmap->mddev->bitmap_info.max_write_behind = write_behind;
> +
> +	/* keep the array size field of the bitmap superblock up to date */
> +	sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors);
> +
> +	memcpy(sb->uuid, bitmap->mddev->uuid, 16);
> +
> +	bitmap->flags |= BITMAP_STALE;
> +	sb->state |= cpu_to_le32(BITMAP_STALE);
> +	bitmap->events_cleared = bitmap->mddev->events;
> +	sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
> +
> +	bitmap->flags |= BITMAP_HOSTENDIAN;
> +	sb->version = cpu_to_le32(BITMAP_MAJOR_HOSTENDIAN);
> +
> +	kunmap_atomic(sb, KM_USER0);
> +	return 0;
> +}
> +
>  /* read the superblock from the bitmap file and initialize some bitmap fields */
>  static int bitmap_read_sb(struct bitmap *bitmap)
>  {
> @@ -1076,8 +1154,8 @@ static int bitmap_init_from_disk(struct 
>  	}
>  
>  	printk(KERN_INFO "%s: bitmap initialized from disk: "
> -		"read %lu/%lu pages, set %lu bits\n",
> -		bmname(bitmap), bitmap->file_pages, num_pages, bit_cnt);
> +	       "read %lu/%lu pages, set %lu of %lu bits\n",
> +	       bmname(bitmap), bitmap->file_pages, num_pages, bit_cnt, chunks);
>  
>  	return 0;
>  
> @@ -1728,9 +1806,16 @@ int bitmap_create(mddev_t *mddev)
>  		vfs_fsync(file, 1);
>  	}
>  	/* read superblock from bitmap file (this sets mddev->bitmap_info.chunksize) */
> -	if (!mddev->bitmap_info.external)
> -		err = bitmap_read_sb(bitmap);
> -	else {
> +	if (!mddev->bitmap_info.external) {
> +		/*
> +		 * If 'MD_ARRAY_FIRST_USE' is set, then device-mapper is
> +		 * instructing us to create a new on-disk bitmap instance.
> +		 */
> +		if (test_and_clear_bit(MD_ARRAY_FIRST_USE, &mddev->flags))
> +			err = bitmap_new_disk_sb(bitmap);
> +		else
> +			err = bitmap_read_sb(bitmap);
> +	} else {
>  		err = 0;
>  		if (mddev->bitmap_info.chunksize == 0 ||
>  		    mddev->bitmap_info.daemon_sleep == 0)
> Index: linux-2.6/drivers/md/md.h
> ===================================================================
> --- linux-2.6.orig/drivers/md/md.h
> +++ linux-2.6/drivers/md/md.h
> @@ -124,6 +124,7 @@ struct mddev_s
>  #define MD_CHANGE_DEVS	0	/* Some device status has changed */
>  #define MD_CHANGE_CLEAN 1	/* transition to or from 'clean' */
>  #define MD_CHANGE_PENDING 2	/* switch from 'clean' to 'active' in progress */
> +#define MD_ARRAY_FIRST_USE 3    /* First use of array, needs initialization */
>  
>  	int				suspended;
>  	atomic_t			active_io;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH 2/2] md/bitmap: remove unused fields from struct bitmap
From: NeilBrown @ 2011-06-08  2:38 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid
In-Reply-To: <1307458172-19373-2-git-send-email-namhyung@gmail.com>

On Tue,  7 Jun 2011 23:49:32 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> Get rid of ->syncchunk and ->counter_bits since they're never used.
> 
> Signed-off-by: Namhyung Kim <namhyung@gmail.com>
> ---
>  drivers/md/bitmap.c |    3 ---
>  drivers/md/bitmap.h |    9 ---------
>  2 files changed, 0 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index 8b40bd71bb4a..0e3b314917ab 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -1754,9 +1754,6 @@ int bitmap_create(mddev_t *mddev)
>  	bitmap->chunks = chunks;
>  	bitmap->pages = pages;
>  	bitmap->missing_pages = pages;
> -	bitmap->counter_bits = COUNTER_BITS;
> -
> -	bitmap->syncchunk = ~0UL;
>  
>  #ifdef INJECT_FATAL_FAULT_1
>  	bitmap->bp = NULL;
> diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
> index d0aeaf46d932..0a239f5d0ca1 100644
> --- a/drivers/md/bitmap.h
> +++ b/drivers/md/bitmap.h
> @@ -196,19 +196,10 @@ struct bitmap {
>  
>  	mddev_t *mddev; /* the md device that the bitmap is for */
>  
> -	int counter_bits; /* how many bits per block counter */
> -
>  	/* bitmap chunksize -- how much data does each bit represent? */
>  	unsigned long chunkshift; /* chunksize = 2^chunkshift (for bitops) */
>  	unsigned long chunks; /* total number of data chunks for the array */
>  
> -	/* We hold a count on the chunk currently being synced, and drop
> -	 * it when the last block is started.  If the resync is aborted
> -	 * midway, we need to be able to drop that count, so we remember
> -	 * the counted chunk..
> -	 */
> -	unsigned long syncchunk;
> -
>  	__u64	events_cleared;
>  	int need_sync;
>  


Always happy to see code disappear!  Thanks.

I took the opportunity to also remove COUNTER_BYTE_RATIO in the same patch.

Thanks,
NeilBrown

^ permalink raw reply

* Re: [PATCH 1/2] md/bitmap: use proper accessor macro
From: NeilBrown @ 2011-06-08  2:37 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid
In-Reply-To: <1307458172-19373-1-git-send-email-namhyung@gmail.com>

On Tue,  7 Jun 2011 23:49:31 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> Use COUNTER()/NEEDED() macro instead of open-coding them.
> 
> Signed-off-by: Namhyung Kim <namhyung@gmail.com>
> ---
>  drivers/md/bitmap.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index 70bd738b8b99..8b40bd71bb4a 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -1332,7 +1332,7 @@ int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sect
>  			return 0;
>  		}
>  
> -		if (unlikely((*bmc & COUNTER_MAX) == COUNTER_MAX)) {
> +		if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
>  			DEFINE_WAIT(__wait);
>  			/* note that it is safe to do the prepare_to_wait
>  			 * after the test as long as we do it before dropping
> @@ -1404,10 +1404,10 @@ void bitmap_endwrite(struct bitmap *bitmap, sector_t offset, unsigned long secto
>  			sysfs_notify_dirent_safe(bitmap->sysfs_can_clear);
>  		}
>  
> -		if (!success && ! (*bmc & NEEDED_MASK))
> +		if (!success && !NEEDED(*bmc))
>  			*bmc |= NEEDED_MASK;
>  
> -		if ((*bmc & COUNTER_MAX) == COUNTER_MAX)
> +		if (COUNTER(*bmc) == COUNTER_MAX)
>  			wake_up(&bitmap->overflow_wait);
>  
>  		(*bmc)--;


Thanks....

Personally I loathe such macros - I prefer things to be open codes so I can
see what is happening without having to guess.
But as we have the macros and they are already in use we should be consistent
and use them everywhere.
So I'll apply the patch.

Thanks,
NeilBrown


^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: John Robinson @ 2011-06-08  1:16 UTC (permalink / raw)
  To: lists; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <4DEE6A11.1030205@xunil.at>

On 07/06/2011 19:12, Stefan G. Weichinger wrote:
>
> Greetings, could you please advise me how to proceed?
>
> On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:
>
> md5 : active raid1 sde1[0] sdf1[1]
>        976759936 blocks [2/2] [UU]
>
> md6 : active raid1 sdh1[1] sdg1[0]
>        976759936 blocks [2/2] [UU]
>
>
> md5 and md6 are right now physical volumes (PVs) in an LVM-volume-group.
> Nearly all the space is used right now (1.7 TB out of the ~2 TB).
>
> Now I would like to move things to a more reliable RAID6 consisting of
> all the four TB-drives ...
>
> How to do that with minimum risk?
>
> For sure it would be best to move all data aside, stop the arrays and
> build a new one ... etc
>
> Failing two drives and remove them from the RAID1s to build a new
> degraded RAID6 seems dangerous to me?
>
> Maybe I overlook a clever alternative?
>
> Suggestions welcome, thanks in advance.

There may be a clever alternative, retaining single redundancy, if you 
don't mind buying one more disc, which I'm guessing you might do soon 
anyway as you're already 85% full. Or if not, it won't do too much harm 
to have a spare drive sitting on a shelf.

You can convert a 2-drive RAID1 to a 2-drive RAID5, then add the new 
drive to double the size of the array, resize the PV, then move the PEs 
over from the other RAID1, then tear down that PV and RAID1, add one or 
both of those drives into the RAID5 and grow it to a RAID6. The only 
step at which you have a little less redundancy is while you're running 
the 3-drive RAID5 (well, it's still 1 drive but against 2 drives, 
instead of 1:1).

On the other hand it might be easier to take a backup, which you 
probably ought to do anyway!

Cheers,

John.

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: Thomas Harold @ 2011-06-07 23:59 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: lists, linux-raid
In-Reply-To: <4DEE84F0.2030205@harddata.com>

On 6/7/2011 4:07 PM, Maurice Hilarius wrote:
> On 6/7/2011 12:12 PM, Stefan G. Weichinger wrote:
>> Greetings, could you please advise me how to proceed?
>>
>> On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:
>>
>> ..
>>
>> Now I would like to move things to a more reliable RAID6 consisting of
>> all the four TB-drives ...
>>
>> How to do that with minimum risk?
>>
>> ..
>> Maybe I overlook a clever alternative?
>
> RAID 10 is as secure, and risk free, and much faster.
> And will cause much less CPU load.
>

Well, with both a pair of RAID1 arrays and a pair of RAID-10 arrays, you 
can lose 2 disks without losing data, but only if the right 2 disks fail.

With RAID6, any two of the four can fail without data loss.

(I still prefer RAID-10 over RAID-6 unless space is at an absolute 
premium.  But for a four-disk setup, net disk space is the same and it's 
just a question of whether you want the speed of RAID-10 or the 
reliability of RAID-6.)

^ permalink raw reply

* [PATCH 8 of 8] MD:  raid5 do not set fullsync
From: Jonathan Brassow @ 2011-06-07 22:52 UTC (permalink / raw)
  To: linux-raid

Add new flag for struct mdk_rdev_s to indicate when recovery can use bitmap

Device-mapper can tell if a device is in-sync, in need of partial (bitmap aided)
recovery, or in need of complete recovery.  The raid5 code assumes that if a
device is not in-sync, then it must undergo complete recovery - it does not
honor the bitmap.  The flag 'RecoverByBitmap' has been introduced to force raid5
not to set 'conf->fullsync' if the superblock routines have already determined
that only a partial recovery is necessary.

RFC-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/raid5.c
===================================================================
--- linux-2.6.orig/drivers/md/raid5.c
+++ linux-2.6/drivers/md/raid5.c
@@ -4858,7 +4858,7 @@ static raid5_conf_t *setup_conf(mddev_t 
 			printk(KERN_INFO "md/raid:%s: device %s operational as raid"
 			       " disk %d\n",
 			       mdname(mddev), bdevname(rdev->bdev, b), raid_disk);
-		} else
+		} else if (!test_bit(RecoverByBitmap, &rdev->flags))
 			/* Cannot rely on bitmap to complete recovery */
 			conf->fullsync = 1;
 	}
Index: linux-2.6/drivers/md/md.h
===================================================================
--- linux-2.6.orig/drivers/md/md.h
+++ linux-2.6/drivers/md/md.h
@@ -77,6 +77,8 @@ struct mdk_rdev_s
 #define Blocked		8		/* An error occurred on an externally
 					 * managed array, don't allow writes
 					 * until it is cleared */
+#define RecoverByBitmap 9               /* Used by device-mapper to ensure this
+					 * device is recovered by the bitmap. */
 	wait_queue_head_t blocked_wait;
 
 	int desc_nr;			/* descriptor index in the superblock */



^ permalink raw reply

* [PATCH 7 of 8] MD:  add bitmap support
From: Jonathan Brassow @ 2011-06-07 22:52 UTC (permalink / raw)
  To: linux-raid

Add bitmap support to the device-mapper specific metadata area.

This patch allows the creation of the bitmap metadata area upon initial array
creation via device-mapper.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/bitmap.c
===================================================================
--- linux-2.6.orig/drivers/md/bitmap.c
+++ linux-2.6/drivers/md/bitmap.c
@@ -534,6 +534,84 @@ void bitmap_print_sb(struct bitmap *bitm
 	kunmap_atomic(sb, KM_USER0);
 }
 
+/*
+ * bitmap_new_disk_sb
+ * @bitmap
+ *
+ * This function is somewhat the reverse of bitmap_read_sb.  bitmap_read_sb
+ * reads and verifies the on-disk bitmap superblock and populates bitmap_info.
+ * This function verifies 'bitmap_info' and populates the on-disk bitmap
+ * structure, which is to be written to disk.
+ *
+ * Returns: 0 on success, -Exxx on error
+ */
+static int bitmap_new_disk_sb(struct bitmap *bitmap)
+{
+	bitmap_super_t *sb;
+	unsigned long chunksize, daemon_sleep, write_behind;
+	int err = -EINVAL;
+
+	/* page 0 is the superblock, read it... */
+	bitmap->sb_page = read_sb_page(bitmap->mddev,
+				       bitmap->mddev->bitmap_info.offset,
+				       NULL, 0, sizeof(bitmap_super_t));
+
+	if (IS_ERR(bitmap->sb_page)) {
+		err = PTR_ERR(bitmap->sb_page);
+		bitmap->sb_page = NULL;
+		return err;
+	}
+
+	sb = kmap_atomic(bitmap->sb_page, KM_USER0);
+
+	sb->magic = cpu_to_le32(BITMAP_MAGIC);
+	sb->version = cpu_to_le32(BITMAP_MAJOR_HI);
+
+	chunksize = bitmap->mddev->bitmap_info.chunksize;
+	BUG_ON(!chunksize);
+	if ((1 << ffz(~chunksize)) != chunksize) {
+		kunmap_atomic(sb, KM_USER0);
+		printk(KERN_ERR "bitmap chunksize not a power of 2\n");
+		return -EINVAL;
+	}
+	sb->chunksize = cpu_to_le32(chunksize);
+
+	daemon_sleep = bitmap->mddev->bitmap_info.daemon_sleep;
+	if (!daemon_sleep ||
+	    (daemon_sleep < 1) || (daemon_sleep > MAX_SCHEDULE_TIMEOUT)) {
+		printk(KERN_INFO "Choosing daemon_sleep default (5 sec)\n");
+		daemon_sleep = 5 * HZ;
+	}
+	sb->daemon_sleep = cpu_to_le32(daemon_sleep);
+	bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep;
+
+	/*
+	 * FIXME: write_behind for RAID1.  If not specified, what
+	 * is a good choice?  We choose COUNTER_MAX / 2 arbitrarily.
+	 */
+	write_behind = bitmap->mddev->bitmap_info.max_write_behind;
+	if (write_behind > COUNTER_MAX)
+		write_behind = COUNTER_MAX / 2;
+	sb->write_behind = cpu_to_le32(write_behind);
+	bitmap->mddev->bitmap_info.max_write_behind = write_behind;
+
+	/* keep the array size field of the bitmap superblock up to date */
+	sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors);
+
+	memcpy(sb->uuid, bitmap->mddev->uuid, 16);
+
+	bitmap->flags |= BITMAP_STALE;
+	sb->state |= cpu_to_le32(BITMAP_STALE);
+	bitmap->events_cleared = bitmap->mddev->events;
+	sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
+
+	bitmap->flags |= BITMAP_HOSTENDIAN;
+	sb->version = cpu_to_le32(BITMAP_MAJOR_HOSTENDIAN);
+
+	kunmap_atomic(sb, KM_USER0);
+	return 0;
+}
+
 /* read the superblock from the bitmap file and initialize some bitmap fields */
 static int bitmap_read_sb(struct bitmap *bitmap)
 {
@@ -1076,8 +1154,8 @@ static int bitmap_init_from_disk(struct 
 	}
 
 	printk(KERN_INFO "%s: bitmap initialized from disk: "
-		"read %lu/%lu pages, set %lu bits\n",
-		bmname(bitmap), bitmap->file_pages, num_pages, bit_cnt);
+	       "read %lu/%lu pages, set %lu of %lu bits\n",
+	       bmname(bitmap), bitmap->file_pages, num_pages, bit_cnt, chunks);
 
 	return 0;
 
@@ -1728,9 +1806,16 @@ int bitmap_create(mddev_t *mddev)
 		vfs_fsync(file, 1);
 	}
 	/* read superblock from bitmap file (this sets mddev->bitmap_info.chunksize) */
-	if (!mddev->bitmap_info.external)
-		err = bitmap_read_sb(bitmap);
-	else {
+	if (!mddev->bitmap_info.external) {
+		/*
+		 * If 'MD_ARRAY_FIRST_USE' is set, then device-mapper is
+		 * instructing us to create a new on-disk bitmap instance.
+		 */
+		if (test_and_clear_bit(MD_ARRAY_FIRST_USE, &mddev->flags))
+			err = bitmap_new_disk_sb(bitmap);
+		else
+			err = bitmap_read_sb(bitmap);
+	} else {
 		err = 0;
 		if (mddev->bitmap_info.chunksize == 0 ||
 		    mddev->bitmap_info.daemon_sleep == 0)
Index: linux-2.6/drivers/md/md.h
===================================================================
--- linux-2.6.orig/drivers/md/md.h
+++ linux-2.6/drivers/md/md.h
@@ -124,6 +124,7 @@ struct mddev_s
 #define MD_CHANGE_DEVS	0	/* Some device status has changed */
 #define MD_CHANGE_CLEAN 1	/* transition to or from 'clean' */
 #define MD_CHANGE_PENDING 2	/* switch from 'clean' to 'active' in progress */
+#define MD_ARRAY_FIRST_USE 3    /* First use of array, needs initialization */
 
 	int				suspended;
 	atomic_t			active_io;



^ permalink raw reply

* [PATCH 6 of 8] MD:  add sync_super to mddev_t struct
From: Jonathan Brassow @ 2011-06-07 22:51 UTC (permalink / raw)
  To: linux-raid

Add the 'sync_super' function pointer to MD array structure (struct mddev_s)

If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
able to load it, there must still be a way for MD to initiate superblock
updates.  The simplest way to make this happen is to provide a pointer in
the MD array structure that can be set by device-mapper (or other module)
with a function to do this.  If the function has been set, it will be used;
otherwise, the method with be looked up via 'super_types' as usual.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -1753,6 +1753,18 @@ static struct super_type super_types[] =
 	},
 };
 
+static void sync_super(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+	if (mddev->sync_super) {
+		mddev->sync_super(mddev, rdev);
+		return;
+	}
+
+	BUG_ON(mddev->major_version >= ARRAY_SIZE(super_types));
+
+	super_types[mddev->major_version].sync_super(mddev, rdev);
+}
+
 static int match_mddev_units(mddev_t *mddev1, mddev_t *mddev2)
 {
 	mdk_rdev_t *rdev, *rdev2;
@@ -2171,8 +2183,7 @@ static void sync_sbs(mddev_t * mddev, in
 			/* Don't update this superblock */
 			rdev->sb_loaded = 2;
 		} else {
-			super_types[mddev->major_version].
-				sync_super(mddev, rdev);
+			sync_super(mddev, rdev);
 			rdev->sb_loaded = 1;
 		}
 	}
Index: linux-2.6/drivers/md/md.h
===================================================================
--- linux-2.6.orig/drivers/md/md.h
+++ linux-2.6/drivers/md/md.h
@@ -330,6 +330,7 @@ struct mddev_s
 	atomic_t flush_pending;
 	struct work_struct flush_work;
 	struct work_struct event_work;	/* used by dm to report failure event */
+	void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
 };
 
 



^ permalink raw reply

* [PATCH 5 of 8] MD:  raid1 changes to allow use by device mapper
From: Jonathan Brassow @ 2011-06-07 22:50 UTC (permalink / raw)
  To: linux-raid

MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c)

Added the necessary congestion function and conditionalize calls requiring an
array 'queue' or 'gendisk'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/raid1.c
===================================================================
--- linux-2.6.orig/drivers/md/raid1.c
+++ linux-2.6/drivers/md/raid1.c
@@ -497,21 +497,19 @@ static int read_balance(conf_t *conf, r1
 	return best_disk;
 }
 
-static int raid1_congested(void *data, int bits)
+int md_raid1_congested(mddev_t *mddev, int bits)
 {
-	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
 	int i, ret = 0;
 
-	if (mddev_congested(mddev, bits))
-		return 1;
-
 	rcu_read_lock();
 	for (i = 0; i < mddev->raid_disks; i++) {
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
 
+			BUG_ON(!q);
+
 			/* Note the '|| 1' - when read_balance prefers
 			 * non-congested targets, it can be removed
 			 */
@@ -524,7 +522,15 @@ static int raid1_congested(void *data, i
 	rcu_read_unlock();
 	return ret;
 }
+EXPORT_SYMBOL_GPL(md_raid1_congested);
 
+static int raid1_congested(void *data, int bits)
+{
+	mddev_t *mddev = data;
+
+	return mddev_congested(mddev, bits) ||
+		md_raid1_congested(mddev, bits);
+}
 
 static void flush_pending_writes(conf_t *conf)
 {
@@ -1972,6 +1978,8 @@ static int run(mddev_t *mddev)
 		return PTR_ERR(conf);
 
 	list_for_each_entry(rdev, &mddev->disks, same_set) {
+		if (!mddev->gendisk)
+			continue;
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
 		/* as we don't honour merge_bvec_fn, we must never risk
@@ -2013,8 +2021,10 @@ static int run(mddev_t *mddev)
 
 	md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
 
-	mddev->queue->backing_dev_info.congested_fn = raid1_congested;
-	mddev->queue->backing_dev_info.congested_data = mddev;
+	if (mddev->queue) {
+		mddev->queue->backing_dev_info.congested_fn = raid1_congested;
+		mddev->queue->backing_dev_info.congested_data = mddev;
+	}
 	return md_integrity_register(mddev);
 }
 
Index: linux-2.6/drivers/md/raid1.h
===================================================================
--- linux-2.6.orig/drivers/md/raid1.h
+++ linux-2.6/drivers/md/raid1.h
@@ -126,4 +126,6 @@ struct r1bio_s {
  */
 #define	R1BIO_Returned 6
 
+extern int md_raid1_congested(mddev_t *mddev, int bits);
+
 #endif



^ permalink raw reply

* [PATCH 4 of 8] MD:  move thread wakeups into resume
From: Jonathan Brassow @ 2011-06-07 22:49 UTC (permalink / raw)
  To: linux-raid

Move personality and sync/recovery thread starting outside md_run.

Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR).  I/O - in this case, recovery operations - should
not happen until after a resume has taken place.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -351,6 +351,9 @@ void mddev_resume(mddev_t *mddev)
 	mddev->suspended = 0;
 	wake_up(&mddev->sb_wait);
 	mddev->pers->quiesce(mddev, 0);
+
+	md_wakeup_thread(mddev->thread);
+	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
 }
 EXPORT_SYMBOL_GPL(mddev_resume);
 
@@ -4619,9 +4622,6 @@ int md_run(mddev_t *mddev)
 	if (mddev->flags)
 		md_update_sb(mddev, 0);
 
-	md_wakeup_thread(mddev->thread);
-	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
-
 	md_new_event(mddev);
 	sysfs_notify_dirent_safe(mddev->sysfs_state);
 	sysfs_notify_dirent_safe(mddev->sysfs_action);
@@ -4642,6 +4642,10 @@ static int do_md_run(mddev_t *mddev)
 		bitmap_destroy(mddev);
 		goto out;
 	}
+
+	md_wakeup_thread(mddev->thread);
+	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
+
 	set_capacity(mddev->gendisk, mddev->array_sectors);
 	revalidate_disk(mddev->gendisk);
 	mddev->changed = 1;



^ permalink raw reply

* [PATCH 3 of 8] MD:  possible typo
From: Jonathan Brassow @ 2011-06-07 22:48 UTC (permalink / raw)
  To: linux-raid

Make message a bit clearer by s/blocks/k/

I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the
message.  'k' may be a bit ambigous, but I think it's better than "blocks"
which normally means 512, but means 1024 in MD.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -6866,8 +6866,8 @@ void md_do_sync(mddev_t *mddev)
 	 * Tune reconstruction:
 	 */
 	window = 32*(PAGE_SIZE/512);
-	printk(KERN_INFO "md: using %dk window, over a total of %llu blocks.\n",
-		window/2,(unsigned long long) max_sectors/2);
+	printk(KERN_INFO "md: using %dk window, over a total of %lluk.\n",
+		window/2, (unsigned long long)max_sectors/2);
 
 	atomic_set(&mddev->recovery_active, 0);
 	last_check = 0;



^ permalink raw reply

* [PATCH 2 of 8] MD:  no sync IO while suspended
From: Jonathan Brassow @ 2011-06-07 22:47 UTC (permalink / raw)
  To: linux-raid

Disallow resync I/O while the RAID array is suspended.

Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -7045,7 +7045,6 @@ void md_do_sync(mddev_t *mddev)
 }
 EXPORT_SYMBOL_GPL(md_do_sync);
 
-
 static int remove_and_add_spares(mddev_t *mddev)
 {
 	mdk_rdev_t *rdev;
@@ -7157,6 +7156,9 @@ static void reap_sync_thread(mddev_t *md
  */
 void md_check_recovery(mddev_t *mddev)
 {
+	if (mddev->suspended)
+		return;
+
 	if (mddev->bitmap)
 		bitmap_daemon_work(mddev);
 



^ permalink raw reply

* [PATCH 1 of 8] MD:  no integrity register if no gendisk
From: Jonathan Brassow @ 2011-06-07 22:46 UTC (permalink / raw)
  To: linux-raid

Don't attempt md_integrity_register if there is no gendisk struct available.

When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -1781,8 +1781,8 @@ int md_integrity_register(mddev_t *mddev
 
 	if (list_empty(&mddev->disks))
 		return 0; /* nothing to do */
-	if (blk_get_integrity(mddev->gendisk))
-		return 0; /* already registered */
+	if (!mddev->gendisk || blk_get_integrity(mddev->gendisk))
+		return 0; /* shouldn't register, or already is */
 	list_for_each_entry(rdev, &mddev->disks, same_set) {
 		/* skip spares and non-functional disks */
 		if (test_bit(Faulty, &rdev->flags))



^ permalink raw reply

* [PATCH 0 of 8] Various MD patches to support device-mapper interaction
From: Jonathan Brassow @ 2011-06-07 22:45 UTC (permalink / raw)
  To: linux-raid

Neil,

Please consider	the following patches for inclusion:

1) md-no-integrity-register-if-no-gendisk.patch
E-mail received 5/24/2011 of patch application, but it has not yet
landed in 3.0.0-rc2

2) md-no-sync-IO-while-suspended.patch
E-mail received 5/24/2011 of patch application, but it has not yet
landed in 3.0.0-rc2

3) md-possible-typo.patch
You mentioned you might	take this one if I changed the message instead
of the parameter value.  I've s/ blocks/k/ - better?

4) md-move-thread-wakeups-into-resume.patch
No comments on this yet.

5) md-raid1-changes-to-allow-use-by-device-mapper.patch
No comments on this yet.

6) md-add-sync_super-to-mddev_t-struct.patch
This is	the patch I'm proposing	as the substitute to the 'analyze_sbs'
patches I had originally posted.  We add a function pointer that can be
called from device-mapper for the purposes of updating the superblock.
(This way, the superblock and 'super_types' functions can be in
dm-raid.c.)

7) md-add-bitmap-support.patch
Bitmap support for device-mapper created arrays.

8) md-raid5-do-not-set-fullsync.patch
A somewhat hackish way around the problem of RAID5 setting fullsync on a
device that merely suffered a transient failure.

Thanks,
 brassow

^ permalink raw reply

* Re: md array does not detect drive removal: mdadm 3.2.1, Linux 2.6.38
From: NeilBrown @ 2011-06-07 21:33 UTC (permalink / raw)
  To: fibreraid@gmail.com; +Cc: CoolCold, linux-raid
In-Reply-To: <BANLkTikD0x7AHpbJZ2AESV_eRwrYz08oOQ@mail.gmail.com>

On Tue, 7 Jun 2011 00:01:04 -0700 "fibreraid@gmail.com" <fibreraid@gmail.com>
wrote:

> Hello,
> 
> I did test IO, and upon issuing IO, then md correctly detected the
> failure and began a rebuild. However, my opinion is that this is
> inadequate and actually, I do not believe this is correct behavior. As
> I recall from prior experiences with md, md would initiate a rebuild
> based on drive removal only as well, even without any pending IO.
> 
> I would appreciate some further feedback as to this behavior. Thanks!

MD has never been able to respond to a drive removal - only to an IO error.

If you want md to notice when a drive is removed then you need a udev rule to
tell it.  The rule can run
   mdadm --incremental --fail devicename

where 'device' name is not "/dev/sda" as that won't exist any more, but "sda"
which is the kernel-internal name for the device.

NeilBrown


> 
> -Tommy
> 
> 
> On Mon, Jun 6, 2011 at 2:25 PM, CoolCold <coolthecold@gmail.com> wrote:
> > On Mon, Jun 6, 2011 at 10:20 PM, fibreraid@gmail.com
> > <fibreraid@gmail.com> wrote:
> >> Hello,
> >>
> >> I am running Linux kernel 2.6.38 64-bit version with mdadm 3.2.1. The
> >> server hardware has dual socket Westmere CPUs (4 cores each), 24 GB of
> >> RAM, and 24 hard drives connected via SAS.
> >>
> >> I create an md0 array with 23 active drives, 1 hot-spare, RAID 5, and
> >> 64K chunk. After synchronization is complete, I have:
> >>
> >> root::~# cat /proc/mdstat
> >> Personalities : [raid6] [raid5] [raid4]
> >> md0 : active raid5 sdf1[23](S) sdi1[22] sdh1[21] sdg1[20] sde1[19]
> >> sdd1[18] sdc1[17] sdo1[16] sdn1[15] sdq1[14] sdp1[13] sdr1[12]
> >> sdm1[11] sdl1[10] sdk1[9] sdj1[8] sdv1[7] sdu1[6] sdt1[5] sds1[4]
> >> sdy1[3] sdx1[2] sdb1[1] sdw1[0]
> >>      2149005056 blocks super 1.2 level 5, 64k chunk, algorithm 2
> >> [23/23] [UUUUUUUUUUUUUUUUUUUUUUU]
> >>
> >> Then I remove an active drive from the system by unplugging it. udev
> >> catches the event, and fdisk -l reports one less drive. In this case,
> >> I remove /dev/sdv.
> >>
> >> However, /proc/mdstat remains unchanged. It's as if md has no idea
> >> that the drive disappeared. I would expect md at this point to have
> >> detected the removal, and to have automatically kicked-off a resync
> >> using the included hot-spare. But this does not occur.
> >>
> >> If I then run mdadm -R /dev/md0, in an attempt to "wake up" md, then
> >> md does realize the change, and does start the resyncing.
> > I guess md realizes there is no drive when write/read error occurs,
> > which gonna happen pretty soon if array is in usage, can you set some
> > dd reading and then remove drive?
> >
> >>
> >> I do not believe this is normal behavior. Can you advise?
> >>
> >> Thank you!
> >> -Tommy
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> >
> >
> > --
> > Best regards,
> > [COOLCOLD-RIPN]
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: from 2x RAID1 to 1x RAID6 ?
From: Maurice Hilarius @ 2011-06-07 20:07 UTC (permalink / raw)
  To: lists, linux-raid
In-Reply-To: <4DEE6A11.1030205@xunil.at>

On 6/7/2011 12:12 PM, Stefan G. Weichinger wrote:
> Greetings, could you please advise me how to proceed?
>
> On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:
>
> ..
>
> Now I would like to move things to a more reliable RAID6 consisting of
> all the four TB-drives ...
>
> How to do that with minimum risk?
>
> ..
> Maybe I overlook a clever alternative?

RAID 10 is as secure, and risk free, and much faster.
And will cause much less CPU load.

-- 
Cheers,
Maurice Hilarius
eMail: /mhilarius@gmail.com/

^ permalink raw reply

* from 2x RAID1 to 1x RAID6 ?
From: Stefan G. Weichinger @ 2011-06-07 18:12 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Greetings, could you please advise me how to proceed?

On a server I have 2 RAID1-arrays, each consisting of 2 TB-drives:

md5 : active raid1 sde1[0] sdf1[1]
      976759936 blocks [2/2] [UU]

md6 : active raid1 sdh1[1] sdg1[0]
      976759936 blocks [2/2] [UU]

md5 and md6 are right now physical volumes (PVs) in an LVM-volume-group.
Nearly all the space is used right now (1.7 TB out of the ~2 TB).

Now I would like to move things to a more reliable RAID6 consisting of
all the four TB-drives ...

How to do that with minimum risk?

For sure it would be best to move all data aside, stop the arrays and
build a new one ... etc

Failing two drives and remove them from the RAID1s to build a new
degraded RAID6 seems dangerous to me?

Maybe I overlook a clever alternative?

Suggestions welcome, thanks in advance.

Stefan

^ permalink raw reply

* [PATCH 2/2] md/bitmap: remove unused fields from struct bitmap
From: Namhyung Kim @ 2011-06-07 14:49 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid
In-Reply-To: <1307458172-19373-1-git-send-email-namhyung@gmail.com>

Get rid of ->syncchunk and ->counter_bits since they're never used.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/bitmap.c |    3 ---
 drivers/md/bitmap.h |    9 ---------
 2 files changed, 0 insertions(+), 12 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 8b40bd71bb4a..0e3b314917ab 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1754,9 +1754,6 @@ int bitmap_create(mddev_t *mddev)
 	bitmap->chunks = chunks;
 	bitmap->pages = pages;
 	bitmap->missing_pages = pages;
-	bitmap->counter_bits = COUNTER_BITS;
-
-	bitmap->syncchunk = ~0UL;
 
 #ifdef INJECT_FATAL_FAULT_1
 	bitmap->bp = NULL;
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index d0aeaf46d932..0a239f5d0ca1 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -196,19 +196,10 @@ struct bitmap {
 
 	mddev_t *mddev; /* the md device that the bitmap is for */
 
-	int counter_bits; /* how many bits per block counter */
-
 	/* bitmap chunksize -- how much data does each bit represent? */
 	unsigned long chunkshift; /* chunksize = 2^chunkshift (for bitops) */
 	unsigned long chunks; /* total number of data chunks for the array */
 
-	/* We hold a count on the chunk currently being synced, and drop
-	 * it when the last block is started.  If the resync is aborted
-	 * midway, we need to be able to drop that count, so we remember
-	 * the counted chunk..
-	 */
-	unsigned long syncchunk;
-
 	__u64	events_cleared;
 	int need_sync;
 
-- 
1.7.5.2


^ permalink raw reply related

* [PATCH 1/2] md/bitmap: use proper accessor macro
From: Namhyung Kim @ 2011-06-07 14:49 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Use COUNTER()/NEEDED() macro instead of open-coding them.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/bitmap.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 70bd738b8b99..8b40bd71bb4a 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1332,7 +1332,7 @@ int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sect
 			return 0;
 		}
 
-		if (unlikely((*bmc & COUNTER_MAX) == COUNTER_MAX)) {
+		if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
 			DEFINE_WAIT(__wait);
 			/* note that it is safe to do the prepare_to_wait
 			 * after the test as long as we do it before dropping
@@ -1404,10 +1404,10 @@ void bitmap_endwrite(struct bitmap *bitmap, sector_t offset, unsigned long secto
 			sysfs_notify_dirent_safe(bitmap->sysfs_can_clear);
 		}
 
-		if (!success && ! (*bmc & NEEDED_MASK))
+		if (!success && !NEEDED(*bmc))
 			*bmc |= NEEDED_MASK;
 
-		if ((*bmc & COUNTER_MAX) == COUNTER_MAX)
+		if (COUNTER(*bmc) == COUNTER_MAX)
 			wake_up(&bitmap->overflow_wait);
 
 		(*bmc)--;
-- 
1.7.5.2


^ permalink raw reply related

* Re: sector I/O error cause disk to be "faulty" in raid5
From: John Robinson @ 2011-06-07  8:53 UTC (permalink / raw)
  To: hank peng; +Cc: linux-raid
In-Reply-To: <BANLkTikpcH=navhOAFA2Jw+CT+rQfpOwcg@mail.gmail.com>

On 06/06/2011 14:28, hank peng wrote:
> Hi, everybody:
> In current raid5 implementation, if a r/w error occured at some
> specific sectors on a disk, the disk will be labeled as "faulty".
> Here, I want to say in most cases, this is failure indication of those
> sectors not the whole disk. Should we make some changes to be more
> reasonable?

It's already on the wishlist as a bad block map.

Cheers,

John.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox