Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] mdmon.8: fix possible typos
From: NeilBrown @ 2011-06-28  6:38 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid
In-Reply-To: <1308889610-8210-1-git-send-email-namhyung@gmail.com>

On Fri, 24 Jun 2011 13:26:49 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> Signed-off-by: Namhyung Kim <namhyung@gmail.com>
> ---
>  mdmon.8 |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mdmon.8 b/mdmon.8
> index 7939a99..03b31b8 100644
> --- a/mdmon.8
> +++ b/mdmon.8
> @@ -104,7 +104,7 @@ within those disks.  MD metadata in comparison defines a 1:1
>  relationship between a set of block devices and a raid array.  For
>  example to create 2 arrays at different raid levels on a single
>  set of disks, MD metadata requires the disks be partitioned and then
> -each array can created be created with a subset of those partitions.  The
> +each array can be created with a subset of those partitions.  The
>  supported external formats perform this disk carving internally.
>  .P
>  Container devices simply hold references to all member disks and allow
> @@ -172,7 +172,7 @@ Note that
>  is automatically started by
>  .I mdadm
>  when needed and so does not need to be considered when working with
> -RAID arrays.  The only times it is run other that by
> +RAID arrays.  The only times it is run other than by
>  .I  mdadm
>  is when the boot scripts need to restart it after mounting the new
>  root filesystem.


Thanks.  I've applied both of these.

NeilBrown

^ permalink raw reply

* Re: [PATCH] md/raid5: fix possible NULL pointer dereference in debug routine
From: NeilBrown @ 2011-06-28  6:40 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid, linux-kernel
In-Reply-To: <1308891840-5683-1-git-send-email-namhyung@gmail.com>

On Fri, 24 Jun 2011 14:04:00 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> When I ran dynamic debug, I encountered a NULL pointer dereference
> bug in add_stripe_bio(). Prior to second call to pr_debug(), @bi
> was reused in order to check whether the request is partial write
> or not, and it could lead to set @bi to NULL.
> 
> Since we save original @bi pointer into local variable 'bip', use
> it instead of NULL-able @bi.
> 
> Also changed confusing 'bh' to 'bi' in the first message.
> 
> Signed-off-by: Namhyung Kim <namhyung@gmail.com>
> ---
>  drivers/md/raid5.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 82c07fb38961..c814a6075c79 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2139,7 +2139,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
>  	raid5_conf_t *conf = sh->raid_conf;
>  	int firstwrite=0;
>  
> -	pr_debug("adding bh b#%llu to stripe s#%llu\n",
> +	pr_debug("adding bi b#%llu to stripe s#%llu\n",
>  		(unsigned long long)bi->bi_sector,
>  		(unsigned long long)sh->sector);
>  
> @@ -2181,7 +2181,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
>  	spin_unlock_irq(&conf->device_lock);
>  
>  	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
> -		(unsigned long long)bi->bi_sector,
> +		(unsigned long long)(*bip)->bi_sector,
>  		(unsigned long long)sh->sector, dd_idx);
>  
>  	if (conf->mddev->bitmap && firstwrite) {


Applied, thanks.

NeilBrown

^ permalink raw reply

* Re: [smatch stuff] md/raid5: potential null deref in debug code
From: NeilBrown @ 2011-06-28  6:42 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: linux-raid
In-Reply-To: <20110624181235.GQ14591@shale.localdomain>

On Fri, 24 Jun 2011 21:12:36 +0300 Dan Carpenter <error27@gmail.com> wrote:

> Hi Neil,
> 
> In d1b053e4de0ac33 "md/raid5: Protect some more code with
> ->device_lock." we moved some debug code around and it upsets my
> static checker.  Could you take a look?
> 
> drivers/md/raid5.c +2183 add_stripe_bio(47)
> 	error: potential null derefence 'bi'.
> 
>   2168          if (forwrite) {
>   2169                  /* check if page is covered */
>   2170                  sector_t sector = sh->dev[dd_idx].sector;
>   2171                  for (bi=sh->dev[dd_idx].towrite;
>   2172                       sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
>   2173                               bi && bi->bi_sector <= sector;
>                                      ^^
> 	It looks like "bi" can be NULL at the end of this for loop.
> 
> 
>   2174                       bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
>   2175                          if (bi->bi_sector + (bi->bi_size>>9) >= sector)
>   2176                                  sector = bi->bi_sector + (bi->bi_size>>9);
>   2177                  }
>   2178                  if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
>   2179                          set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
>   2180          }
>   2181          spin_unlock_irq(&conf->device_lock);
>   2182  
>   2183          pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
>   2184                  (unsigned long long)bi->bi_sector,
>                                             ^^^^^^^^^^^^^
> 	We dereference it here.
> 
>   2185                  (unsigned long long)sh->sector, dd_idx);
> 
> regards,
> dan carpenter
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks Dan,
 as it happens I received a patch just the day before from Namhyung Kim which
 fixes that bug - so all fixed now.

Thanks,
NeilBrown

^ permalink raw reply

* Re: [PATCH resend 3/6] md: use proper little-endian bitops
From: NeilBrown @ 2011-06-28  6:43 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-kernel, akpm, linux-raid
In-Reply-To: <1309067876-3537-4-git-send-email-akinobu.mita@gmail.com>

On Sun, 26 Jun 2011 14:57:53 +0900 Akinobu Mita <akinobu.mita@gmail.com>
wrote:

> Using __test_and_{set,clear}_bit_le() with ignoring its return value
> can be replaced with __{set,clear}_bit_le().
> 
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> Cc: NeilBrown <neilb@suse.de>
> Cc: linux-raid@vger.kernel.org
> ---
>  drivers/md/bitmap.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index 574b09a..870a7fc 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -932,7 +932,7 @@ static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
>  		if (bitmap->flags & BITMAP_HOSTENDIAN)
>  			set_bit(bit, kaddr);
>  		else
> -			__test_and_set_bit_le(bit, kaddr);
> +			__set_bit_le(bit, kaddr);
>  		kunmap_atomic(kaddr, KM_USER0);
>  		PRINTK("set file bit %lu page %lu\n", bit, page->index);
>  	}
> @@ -1304,7 +1304,7 @@ void bitmap_daemon_work(mddev_t *mddev)
>  						clear_bit(file_page_offset(bitmap, j),
>  							  paddr);
>  					else
> -						__test_and_clear_bit_le(file_page_offset(bitmap, j),
> +						__clear_bit_le(file_page_offset(bitmap, j),
>  							       paddr);
>  					kunmap_atomic(paddr, KM_USER0);
>  				} else


Thanks.  I've applied this one to my 'md' tree.

NeilBrown

^ permalink raw reply

* Re: [PATCH 4/8] md/raid: use printk_ratelimited instead of printk_ratelimit
From: NeilBrown @ 2011-06-28  6:45 UTC (permalink / raw)
  To: Christian Dietrich; +Cc: linux-raid, linux-kernel, trivial
In-Reply-To: <2c759de6a8afa764cfaa32e2ea99b0f0099140d7.1307199715.git.christian.dietrich@informatik.uni-erlangen.de>

On Sat, 4 Jun 2011 17:36:21 +0200 Christian Dietrich
<christian.dietrich@informatik.uni-erlangen.de> wrote:

> As per printk_ratelimit comment, it should not be used
> 
> Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
> ---
>  drivers/md/raid1.c  |   22 ++++++++++++----------
>  drivers/md/raid10.c |   22 ++++++++++++----------
>  drivers/md/raid5.c  |   39 +++++++++++++++++++--------------------
>  3 files changed, 43 insertions(+), 40 deletions(-)

Thanks.  I've applied this one to my 'md' tree.

NeilBrown



> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 5d09609..30af10e 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -35,6 +35,7 @@
>  #include <linux/delay.h>
>  #include <linux/blkdev.h>
>  #include <linux/seq_file.h>
> +#include <linux/ratelimit.h>
>  #include "md.h"
>  #include "raid1.h"
>  #include "bitmap.h"
> @@ -287,10 +288,11 @@ static void raid1_end_read_request(struct bio *bio, int error)
>  		 * oops, read error:
>  		 */
>  		char b[BDEVNAME_SIZE];
> -		if (printk_ratelimit())
> -			printk(KERN_ERR "md/raid1:%s: %s: rescheduling sector %llu\n",
> -			       mdname(conf->mddev),
> -			       bdevname(conf->mirrors[mirror].rdev->bdev,b), (unsigned long long)r1_bio->sector);
> +		printk_ratelimited(KERN_ERR "md/raid1:%s: %s: "
> +				   "rescheduling sector %llu\n",
> +				   mdname(conf->mddev),
> +				   bdevname(conf->mirrors[mirror].rdev->bdev, b),
> +				   (unsigned long long)r1_bio->sector);
>  		reschedule_retry(r1_bio);
>  	}
>  
> @@ -1574,12 +1576,12 @@ static void raid1d(mddev_t *mddev)
>  						      GFP_NOIO, mddev);
>  				r1_bio->bios[r1_bio->read_disk] = bio;
>  				rdev = conf->mirrors[disk].rdev;
> -				if (printk_ratelimit())
> -					printk(KERN_ERR "md/raid1:%s: redirecting sector %llu to"
> -					       " other mirror: %s\n",
> -					       mdname(mddev),
> -					       (unsigned long long)r1_bio->sector,
> -					       bdevname(rdev->bdev,b));
> +				printk_ratelimited(KERN_ERR
> +						   "md/raid1:%s: redirecting sector %llu to"
> +						   " other mirror: %s\n",
> +						   mdname(mddev),
> +						   (unsigned long long)r1_bio->sector,
> +						   bdevname(rdev->bdev, b));
>  				bio->bi_sector = r1_bio->sector + rdev->data_offset;
>  				bio->bi_bdev = rdev->bdev;
>  				bio->bi_end_io = raid1_end_read_request;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 6e84668..e80475a 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -22,6 +22,7 @@
>  #include <linux/delay.h>
>  #include <linux/blkdev.h>
>  #include <linux/seq_file.h>
> +#include <linux/ratelimit.h>
>  #include "md.h"
>  #include "raid10.h"
>  #include "raid0.h"
> @@ -277,10 +278,11 @@ static void raid10_end_read_request(struct bio *bio, int error)
>  		 * oops, read error - keep the refcount on the rdev
>  		 */
>  		char b[BDEVNAME_SIZE];
> -		if (printk_ratelimit())
> -			printk(KERN_ERR "md/raid10:%s: %s: rescheduling sector %llu\n",
> -			       mdname(conf->mddev),
> -			       bdevname(conf->mirrors[dev].rdev->bdev,b), (unsigned long long)r10_bio->sector);
> +		printk_ratelimited(KERN_ERR
> +				   "md/raid10:%s: %s: rescheduling sector %llu\n",
> +				   mdname(conf->mddev),
> +				   bdevname(conf->mirrors[dev].rdev->bdev, b),
> +				   (unsigned long long)r10_bio->sector);
>  		reschedule_retry(r10_bio);
>  	}
>  }
> @@ -1667,12 +1669,12 @@ static void raid10d(mddev_t *mddev)
>  				bio_put(bio);
>  				slot = r10_bio->read_slot;
>  				rdev = conf->mirrors[mirror].rdev;
> -				if (printk_ratelimit())
> -					printk(KERN_ERR "md/raid10:%s: %s: redirecting sector %llu to"
> -					       " another mirror\n",
> -					       mdname(mddev),
> -					       bdevname(rdev->bdev,b),
> -					       (unsigned long long)r10_bio->sector);
> +				printk_ratelimited(KERN_ERR
> +						   "md/raid10:%s: %s: redirecting"
> +						   "sector %llu to another mirror\n",
> +						   mdname(mddev),
> +						   bdevname(rdev->bdev, b),
> +						   (unsigned long long)r10_bio->sector);
>  				bio = bio_clone_mddev(r10_bio->master_bio,
>  						      GFP_NOIO, mddev);
>  				r10_bio->devs[slot].bio = bio;
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 346e69b..8927c26 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -51,6 +51,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/cpu.h>
>  #include <linux/slab.h>
> +#include <linux/ratelimit.h>
>  #include "md.h"
>  #include "raid5.h"
>  #include "raid0.h"
> @@ -96,8 +97,6 @@
>  #define __inline__
>  #endif
>  
> -#define printk_rl(args...) ((void) (printk_ratelimit() && printk(args)))
> -
>  /*
>   * We maintain a biased count of active stripes in the bottom 16 bits of
>   * bi_phys_segments, and a count of processed stripes in the upper 16 bits
> @@ -1587,12 +1586,12 @@ static void raid5_end_read_request(struct bio * bi, int error)
>  		set_bit(R5_UPTODATE, &sh->dev[i].flags);
>  		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
>  			rdev = conf->disks[i].rdev;
> -			printk_rl(KERN_INFO "md/raid:%s: read error corrected"
> -				  " (%lu sectors at %llu on %s)\n",
> -				  mdname(conf->mddev), STRIPE_SECTORS,
> -				  (unsigned long long)(sh->sector
> -						       + rdev->data_offset),
> -				  bdevname(rdev->bdev, b));
> +			printk_ratelimited(KERN_INFO "md/raid:%s: read error corrected"
> +					   " (%lu sectors at %llu on %s)\n",
> +					   mdname(conf->mddev), STRIPE_SECTORS,
> +					   (unsigned long long)(sh->sector
> +								+ rdev->data_offset),
> +					   bdevname(rdev->bdev, b));
>  			clear_bit(R5_ReadError, &sh->dev[i].flags);
>  			clear_bit(R5_ReWrite, &sh->dev[i].flags);
>  		}
> @@ -1606,21 +1605,21 @@ static void raid5_end_read_request(struct bio * bi, int error)
>  		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
>  		atomic_inc(&rdev->read_errors);
>  		if (conf->mddev->degraded >= conf->max_degraded)
> -			printk_rl(KERN_WARNING
> -				  "md/raid:%s: read error not correctable "
> -				  "(sector %llu on %s).\n",
> -				  mdname(conf->mddev),
> -				  (unsigned long long)(sh->sector
> -						       + rdev->data_offset),
> +			printk_ratelimited(KERN_WARNING
> +					   "md/raid:%s: read error not correctable "
> +					   "(sector %llu on %s).\n",
> +					   mdname(conf->mddev),
> +					   (unsigned long long)(sh->sector
> +								+ rdev->data_offset),
>  				  bdn);
>  		else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
>  			/* Oh, no!!! */
> -			printk_rl(KERN_WARNING
> -				  "md/raid:%s: read error NOT corrected!! "
> -				  "(sector %llu on %s).\n",
> -				  mdname(conf->mddev),
> -				  (unsigned long long)(sh->sector
> -						       + rdev->data_offset),
> +			printk_ratelimited(KERN_WARNING
> +					   "md/raid:%s: read error NOT corrected!! "
> +					   "(sector %llu on %s).\n",
> +					   mdname(conf->mddev),
> +					   (unsigned long long)(sh->sector
> +								+ rdev->data_offset),
>  				  bdn);
>  		else if (atomic_read(&rdev->read_errors)
>  			 > conf->max_nr_stripes)


^ permalink raw reply

* Re: Raid auto-assembly upon boot - device order
From: Pavel Hofman @ 2011-06-28 10:18 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid
In-Reply-To: <4E08980B.5080002@turmel.org>


Dne 27.6.2011 16:47, Phil Turmel napsal(a):
> Hi Pavel,
> 
> On 06/27/2011 10:15 AM, Pavel Hofman wrote:
>> Hi,
>> 
>> 
>> Our mdadm.conf lists the raids in proper order, corresponding to
>> their dependency.
> 
> I would first check the copy of mdadm.conf in your initramfs.  If it
> specifies just the raid1, you can end up in this situation.
> Most distributions have an 'update-initramfs' script or something
> similar which must be run after any updates to files that are needed
> in early boot.

Hi Phil,

Thanks a lot for your reply. I update the initramfs image regularly.
Just to make sure I uncompressed the current image, mdadm.conf lists all
the raids correctly:

DEVICE /dev/sd[a-z][1-9] /dev/md1 /dev/md2 /dev/md3 /dev/md4 /dev/md5
/dev/md6 /dev/md7 /dev/md8 /dev/md9
ARRAY /dev/md5 level=raid1 metadata=1.0 num-devices=2
UUID=2f88c280:3d7af418:e8d459c5:782e3ed2
ARRAY /dev/md6 level=raid1 metadata=1.0 num-devices=2
UUID=1f83ea99:a9e4d498:a6543047:af0a3b38
ARRAY /dev/md7 level=raid1 metadata=1.0 num-devices=2
UUID=dde16cd5:2e17c743:fcc7926c:fcf5081e
ARRAY /dev/md3 level=raid0 num-devices=2
UUID=8c9c28dd:ac12a9ef:a6200310:fe6d9686
ARRAY /dev/md1 level=raid1 num-devices=5
UUID=588cbbfd:4835b4da:0d7a0b1c:7bf552bb
ARRAY /dev/md2 level=raid1 num-devices=2
UUID=28714b52:55b123f5:a6200310:fe6d9686
ARRAY /dev/md4 level=raid0 num-devices=2
UUID=ce213d01:e50809ed:a6200310:fe6d9686
ARRAY /dev/md8 level=raid0 num-devices=2 metadata=00.90
UUID=5d23817a:fde9d31b:05afacbb:371c5cc4
ARRAY /dev/md9 level=raid0 num-devices=2 metadata=00.90
UUID=9854dd7a:43e8f27f:05afacbb:371c5cc4


This is my rather complex setup:
Personalities : [raid1] [raid0]
md4 : active raid0 sdb1[0] sdd3[1]
      2178180864 blocks 64k chunks

md2 : active raid1 sdc2[0] sdd2[1]
      8787456 blocks [2/2] [UU]

md3 : active raid0 sda1[0] sdc3[1]
      2178180864 blocks 64k chunks

md7 : active raid1 md6[2] md5[1]
      2178180592 blocks super 1.0 [2/1] [_U]
      [===========>.........]  recovery = 59.3% (1293749868/2178180592)
finish=164746.8min speed=87K/sec

md6 : active raid1 md4[0]
      2178180728 blocks super 1.0 [2/1] [U_]

md5 : active raid1 md3[2]
      2178180728 blocks super 1.0 [2/1] [U_]
      bitmap: 9/9 pages [36KB], 131072KB chunk

md1 : active raid1 sdc1[0] sdd1[3]
      10739328 blocks [5/2] [U__U_]


You can see md7 recoverying, even though both md5 and md6 were present.

Here is the relevant part of dmesg at boot:


[   11.957040] device-mapper: uevent: version 1.0.3
[   11.957040] device-mapper: ioctl: 4.13.0-ioctl (2007-10-18)
initialised: dm-devel@redhat.com
[   12.017047] md: md1 still in use.
[   12.017047] md: md1 still in use.
[   12.017821] md: md5 stopped.
[   12.133051] md: md6 stopped.
[   12.134968] md: md7 stopped.
[   12.141042] md: md3 stopped.
[   12.193037] md: bind<sdc3>
[   12.193037] md: bind<sda1>
[   12.237037] md: raid0 personality registered for level 0
[   12.237037] md3: setting max_sectors to 128, segment boundary to 32767
[   12.237037] raid0: looking at sda1
[   12.237037] raid0:   comparing sda1(732571904) with sda1(732571904)
[   12.237037] raid0:   END
[   12.237037] raid0:   ==> UNIQUE
[   12.237037] raid0: 1 zones
[   12.237037] raid0: looking at sdc3
[   12.237037] raid0:   comparing sdc3(1445608960) with sda1(732571904)
[   12.237037] raid0:   NOT EQUAL
[   12.237037] raid0:   comparing sdc3(1445608960) with sdc3(1445608960)
[   12.237037] raid0:   END
[   12.237037] raid0:   ==> UNIQUE
[   12.237037] raid0: 2 zones
[   12.237037] raid0: FINAL 2 zones
[   12.237037] raid0: zone 1
[   12.237037] raid0: checking sda1 ... nope.
[   12.237037] raid0: checking sdc3 ... contained as device 0
[   12.237037]   (1445608960) is smallest!.
[   12.237037] raid0: zone->nb_dev: 1, size: 713037056
[   12.237037] raid0: current zone offset: 1445608960
[   12.237037] raid0: done.
[   12.237037] raid0 : md_size is 2178180864 blocks.
[   12.237037] raid0 : conf->hash_spacing is 1465143808 blocks.
[   12.237037] raid0 : nb_zone is 2.
[   12.237037] raid0 : Allocating 16 bytes for hash.
[   12.241039] md: md2 stopped.
[   12.261038] md: bind<sdd2>
[   12.261038] md: bind<sdc2>
[   12.305037] raid1: raid set md2 active with 2 out of 2 mirrors
[   12.305037] md: md4 stopped.
[   12.317037] md: bind<sdd3>
[   12.317037] md: bind<sdb1>
[   12.361036] md4: setting max_sectors to 128, segment boundary to 32767
[   12.361036] raid0: looking at sdb1
[   12.361036] raid0:   comparing sdb1(732571904) with sdb1(732571904)
[   12.361036] raid0:   END
[   12.361036] raid0:   ==> UNIQUE
[   12.361036] raid0: 1 zones
[   12.361036] raid0: looking at sdd3
[   12.361036] raid0:   comparing sdd3(1445608960) with sdb1(732571904)
[   12.361036] raid0:   NOT EQUAL
[   12.361036] raid0:   comparing sdd3(1445608960) with sdd3(1445608960)
[   12.361036] raid0:   END
[   12.361036] raid0:   ==> UNIQUE
[   12.361036] raid0: 2 zones
[   12.361036] raid0: FINAL 2 zones
[   12.361036] raid0: zone 1
[   12.361036] raid0: checking sdb1 ... nope.
[   12.361036] raid0: checking sdd3 ... contained as device 0
[   12.361036]   (1445608960) is smallest!.
[   12.361036] raid0: zone->nb_dev: 1, size: 713037056
[   12.361036] raid0: current zone offset: 1445608960
[   12.361036] raid0: done.
[   12.361036] raid0 : md_size is 2178180864 blocks.
[   12.361036] raid0 : conf->hash_spacing is 1465143808 blocks.
[   12.361036] raid0 : nb_zone is 2.
[   12.361036] raid0 : Allocating 16 bytes for hash.
[   12.361036] md: md8 stopped.
[   12.413036] md: md9 stopped.
[   12.429036] md: bind<md3>
[   12.469035] raid1: raid set md5 active with 1 out of 2 mirrors
[   12.473035] md5: bitmap initialized from disk: read 1/1 pages, set
5027 bits
[   12.473035] created bitmap (9 pages) for device md5
[   12.509036] md: bind<md5>
[   12.549035] raid1: raid set md7 active with 1 out of 2 mirrors
[   12.573039] md: md6 stopped.
[   12.573039] md: bind<md4>
[   12.573039] md: md6: raid array is not clean -- starting background
reconstruction
[   12.617034] raid1: raid set md6 active with 1 out of 2 mirrors

Please notice that md7 is being assembled before even mentioning md6,
its component. Upon that, md6 is marked as not clean, eventhough both
md5 and md6 are degraded (the missing drives are connected weekly via
eSATA from external enclosure and used for offline backups).

Plus how can can a background reconstruction be started on md6, if it is
degraded and the other mirroring part is not even present?

Thanks a lot,

Pavel.


^ permalink raw reply

* Re: Raid auto-assembly upon boot - device order
From: Phil Turmel @ 2011-06-28 11:03 UTC (permalink / raw)
  To: Pavel Hofman; +Cc: linux-raid
In-Reply-To: <4E09AA68.2050302@ivitera.com>

Good morning, Pavel,

On 06/28/2011 06:18 AM, Pavel Hofman wrote:
> 
> Dne 27.6.2011 16:47, Phil Turmel napsal(a):
>> Hi Pavel,
>>
>> On 06/27/2011 10:15 AM, Pavel Hofman wrote:
>>> Hi,
>>>
>>>
>>> Our mdadm.conf lists the raids in proper order, corresponding to
>>> their dependency.
>>
>> I would first check the copy of mdadm.conf in your initramfs.  If it
>> specifies just the raid1, you can end up in this situation.
>> Most distributions have an 'update-initramfs' script or something
>> similar which must be run after any updates to files that are needed
>> in early boot.
> 
> Hi Phil,
> 
> Thanks a lot for your reply. I update the initramfs image regularly.
> Just to make sure I uncompressed the current image, mdadm.conf lists all
> the raids correctly:
> 
> DEVICE /dev/sd[a-z][1-9] /dev/md1 /dev/md2 /dev/md3 /dev/md4 /dev/md5
> /dev/md6 /dev/md7 /dev/md8 /dev/md9
> ARRAY /dev/md5 level=raid1 metadata=1.0 num-devices=2
> UUID=2f88c280:3d7af418:e8d459c5:782e3ed2
> ARRAY /dev/md6 level=raid1 metadata=1.0 num-devices=2
> UUID=1f83ea99:a9e4d498:a6543047:af0a3b38
> ARRAY /dev/md7 level=raid1 metadata=1.0 num-devices=2
> UUID=dde16cd5:2e17c743:fcc7926c:fcf5081e
> ARRAY /dev/md3 level=raid0 num-devices=2
> UUID=8c9c28dd:ac12a9ef:a6200310:fe6d9686
> ARRAY /dev/md1 level=raid1 num-devices=5
> UUID=588cbbfd:4835b4da:0d7a0b1c:7bf552bb
> ARRAY /dev/md2 level=raid1 num-devices=2
> UUID=28714b52:55b123f5:a6200310:fe6d9686
> ARRAY /dev/md4 level=raid0 num-devices=2
> UUID=ce213d01:e50809ed:a6200310:fe6d9686
> ARRAY /dev/md8 level=raid0 num-devices=2 metadata=00.90
> UUID=5d23817a:fde9d31b:05afacbb:371c5cc4
> ARRAY /dev/md9 level=raid0 num-devices=2 metadata=00.90
> UUID=9854dd7a:43e8f27f:05afacbb:371c5cc4

OK.  Though some are out of order (md3 & md4 ought to be listed before md5 & md6), but it seems to not matter.

> This is my rather complex setup:
> Personalities : [raid1] [raid0]
> md4 : active raid0 sdb1[0] sdd3[1]
>       2178180864 blocks 64k chunks
> 
> md2 : active raid1 sdc2[0] sdd2[1]
>       8787456 blocks [2/2] [UU]
> 
> md3 : active raid0 sda1[0] sdc3[1]
>       2178180864 blocks 64k chunks
> 
> md7 : active raid1 md6[2] md5[1]
>       2178180592 blocks super 1.0 [2/1] [_U]
>       [===========>.........]  recovery = 59.3% (1293749868/2178180592)
> finish=164746.8min speed=87K/sec
> 
> md6 : active raid1 md4[0]
>       2178180728 blocks super 1.0 [2/1] [U_]
> 
> md5 : active raid1 md3[2]
>       2178180728 blocks super 1.0 [2/1] [U_]
>       bitmap: 9/9 pages [36KB], 131072KB chunk
> 
> md1 : active raid1 sdc1[0] sdd1[3]
>       10739328 blocks [5/2] [U__U_]
> 
> 
> You can see md7 recoverying, even though both md5 and md6 were present.

Yes, but md5 & md6 are themselves degraded.  Should not have started unless you are globally enabling it.

ps.  "lsdrv" would be really useful here to understand your layering setup.

http://github.com/pturmel/lsdrv

> Here is the relevant part of dmesg at boot:
> 
> 
> [   11.957040] device-mapper: uevent: version 1.0.3
> [   11.957040] device-mapper: ioctl: 4.13.0-ioctl (2007-10-18)
> initialised: dm-devel@redhat.com
> [   12.017047] md: md1 still in use.
> [   12.017047] md: md1 still in use.
> [   12.017821] md: md5 stopped.
> [   12.133051] md: md6 stopped.
> [   12.134968] md: md7 stopped.
> [   12.141042] md: md3 stopped.
> [   12.193037] md: bind<sdc3>
> [   12.193037] md: bind<sda1>
> [   12.237037] md: raid0 personality registered for level 0
> [   12.237037] md3: setting max_sectors to 128, segment boundary to 32767
> [   12.237037] raid0: looking at sda1
> [   12.237037] raid0:   comparing sda1(732571904) with sda1(732571904)
> [   12.237037] raid0:   END
> [   12.237037] raid0:   ==> UNIQUE
> [   12.237037] raid0: 1 zones
> [   12.237037] raid0: looking at sdc3
> [   12.237037] raid0:   comparing sdc3(1445608960) with sda1(732571904)
> [   12.237037] raid0:   NOT EQUAL
> [   12.237037] raid0:   comparing sdc3(1445608960) with sdc3(1445608960)
> [   12.237037] raid0:   END
> [   12.237037] raid0:   ==> UNIQUE
> [   12.237037] raid0: 2 zones
> [   12.237037] raid0: FINAL 2 zones
> [   12.237037] raid0: zone 1
> [   12.237037] raid0: checking sda1 ... nope.
> [   12.237037] raid0: checking sdc3 ... contained as device 0
> [   12.237037]   (1445608960) is smallest!.
> [   12.237037] raid0: zone->nb_dev: 1, size: 713037056
> [   12.237037] raid0: current zone offset: 1445608960
> [   12.237037] raid0: done.
> [   12.237037] raid0 : md_size is 2178180864 blocks.
> [   12.237037] raid0 : conf->hash_spacing is 1465143808 blocks.
> [   12.237037] raid0 : nb_zone is 2.
> [   12.237037] raid0 : Allocating 16 bytes for hash.
> [   12.241039] md: md2 stopped.
> [   12.261038] md: bind<sdd2>
> [   12.261038] md: bind<sdc2>
> [   12.305037] raid1: raid set md2 active with 2 out of 2 mirrors
> [   12.305037] md: md4 stopped.
> [   12.317037] md: bind<sdd3>
> [   12.317037] md: bind<sdb1>
> [   12.361036] md4: setting max_sectors to 128, segment boundary to 32767
> [   12.361036] raid0: looking at sdb1
> [   12.361036] raid0:   comparing sdb1(732571904) with sdb1(732571904)
> [   12.361036] raid0:   END
> [   12.361036] raid0:   ==> UNIQUE
> [   12.361036] raid0: 1 zones
> [   12.361036] raid0: looking at sdd3
> [   12.361036] raid0:   comparing sdd3(1445608960) with sdb1(732571904)
> [   12.361036] raid0:   NOT EQUAL
> [   12.361036] raid0:   comparing sdd3(1445608960) with sdd3(1445608960)
> [   12.361036] raid0:   END
> [   12.361036] raid0:   ==> UNIQUE
> [   12.361036] raid0: 2 zones
> [   12.361036] raid0: FINAL 2 zones
> [   12.361036] raid0: zone 1
> [   12.361036] raid0: checking sdb1 ... nope.
> [   12.361036] raid0: checking sdd3 ... contained as device 0
> [   12.361036]   (1445608960) is smallest!.
> [   12.361036] raid0: zone->nb_dev: 1, size: 713037056
> [   12.361036] raid0: current zone offset: 1445608960
> [   12.361036] raid0: done.
> [   12.361036] raid0 : md_size is 2178180864 blocks.
> [   12.361036] raid0 : conf->hash_spacing is 1465143808 blocks.
> [   12.361036] raid0 : nb_zone is 2.
> [   12.361036] raid0 : Allocating 16 bytes for hash.
> [   12.361036] md: md8 stopped.
> [   12.413036] md: md9 stopped.
> [   12.429036] md: bind<md3>
> [   12.469035] raid1: raid set md5 active with 1 out of 2 mirrors
> [   12.473035] md5: bitmap initialized from disk: read 1/1 pages, set
> 5027 bits
> [   12.473035] created bitmap (9 pages) for device md5
> [   12.509036] md: bind<md5>
> [   12.549035] raid1: raid set md7 active with 1 out of 2 mirrors
> [   12.573039] md: md6 stopped.
> [   12.573039] md: bind<md4>
> [   12.573039] md: md6: raid array is not clean -- starting background
> reconstruction
> [   12.617034] raid1: raid set md6 active with 1 out of 2 mirrors
> 
> Please notice that md7 is being assembled before even mentioning md6,
> its component. Upon that, md6 is marked as not clean, eventhough both
> md5 and md6 are degraded (the missing drives are connected weekly via
> eSATA from external enclosure and used for offline backups).

I suspect it is merely timing.  You are using degraded arrays deliberately as part of your backup scheme, which means you must be using "start_dirty_degraded" as a kernel parameter.  That enables md7, which you don't want degraded, to start degraded when md6 is a hundred or so milliseconds late to the party.

I think you have a couple options:

1) Don't run degraded arrays.  Use other backup tools.
2) Remove md7 from your mdadm.conf in your initramfs.  Don't let early userspace assemble it.  The extra time should then allow your initscripts on your real root fs to assemble it with both members.  This only works if md7 does not contain your real root fs.

> Plus how can can a background reconstruction be started on md6, if it is
> degraded and the other mirroring part is not even present?

Don't know.  Maybe one of your existing drives is occupying a major/minor combination that your esata drive occupied on your last backup.  I'm pretty sure the message is harmless.  I noticed that md5 has a bitmap, but md6 does not.  I wonder if adding a bitmap to md6 would change the timing enough to help you.

Relying on timing variations for successful boot doesn't sound great to me.

> Thanks a lot,
> 
> Pavel.
> 

HTH,

Phil

^ permalink raw reply

* Re: Raid auto-assembly upon boot - device order
From: Pavel Hofman @ 2011-06-28 12:01 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid
In-Reply-To: <4E09B50B.20306@turmel.org>

Hi Phil,

Dne 28.6.2011 13:03, Phil Turmel napsal(a):
> Good morning, Pavel,
> 
> On 06/28/2011 06:18 AM, Pavel Hofman wrote:
>> 
>> 
>> Hi Phil,
>> 
>> This is my rather complex setup: Personalities : [raid1] [raid0] 
>> md4 : active raid0 sdb1[0] sdd3[1] 2178180864 blocks 64k chunks
>> 
>> md2 : active raid1 sdc2[0] sdd2[1] 8787456 blocks [2/2] [UU]
>> 
>> md3 : active raid0 sda1[0] sdc3[1] 2178180864 blocks 64k chunks
>> 
>> md7 : active raid1 md6[2] md5[1] 2178180592 blocks super 1.0 [2/1]
>> [_U] [===========>.........]  recovery = 59.3%
>> (1293749868/2178180592) finish=164746.8min speed=87K/sec
>> 
>> md6 : active raid1 md4[0] 2178180728 blocks super 1.0 [2/1] [U_]
>> 
>> md5 : active raid1 md3[2] 2178180728 blocks super 1.0 [2/1] [U_] 
>> bitmap: 9/9 pages [36KB], 131072KB chunk
>> 
>> md1 : active raid1 sdc1[0] sdd1[3] 10739328 blocks [5/2] [U__U_]
>> 
>> 
>> You can see md7 recoverying, even though both md5 and md6 were
>> present.
> 
> Yes, but md5 & md6 are themselves degraded.  Should not have started
> unless you are globally enabling it.

> 
> ps.  "lsdrv" would be really useful here to understand your layering
> setup.
> 
> http://github.com/pturmel/lsdrv

Thanks a lot for your quick reply. And for your wonderful tool too.

orfeus:/boot# lsdrv
PCI [AMD_IDE] 00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
 └─ide 2.0 HL-DT-ST RW/DVD GCC-H20N {[No Information Found]}
    └─hde: [33:0] Empty/Unknown 4.00g
PCI [sata_nv] 00:05.0 IDE interface: nVidia Corporation MCP55 SATA
Controller (rev a3)
 ├─scsi 0:0:0:0 ATA SAMSUNG HD753LJ {S13UJDWQ912345}
 │  └─sda: [8:0] MD raid10 (4) 698.64g inactive
{646f62e3:626d2cb3:05afacbb:371c5cc4}
 │     └─sda1: [8:1] MD raid0 (0/2) 698.64g md3 clean in_sync
{8c9c28dd:ac12a9ef:a6200310:fe6d9686}
 │        └─md3: [9:3] MD raid1 (0/2) 2.03t md5 active in_sync
'orfeus:5' {2f88c280:3d7af418:e8d459c5:782e3ed2}
 │           └─md5: [9:5] MD raid1 (1/2) 2.03t md7 active in_sync
'orfeus:7' {dde16cd5:2e17c743:fcc7926c:fcf5081e}
 │              └─md7: [9:7] (xfs) 2.03t 'backup'
{d987301b-dfb1-4c99-8f72-f4b400ba46c9}
 │                 └─Mounted as /dev/md7 @ /mnt/raid
 └─scsi 1:0:0:0 ATA ST3750330AS {9QK0VFJ9}
    └─sdb: [8:16] Empty/Unknown 698.64g
       └─sdb1: [8:17] MD raid0 (0/2) 698.64g md4 clean in_sync
{ce213d01:e50809ed:a6200310:fe6d9686}
          └─md4: [9:4] MD raid1 (0/2) 2.03t md6 active in_sync
''orfeus':6' {1f83ea99:a9e4d498:a6543047:af0a3b38}
             └─md6: [9:6] MD raid1 (0/2) 2.03t md7 active spare
''orfeus':7' {dde16cd5:2e17c743:fcc7926c:fcf5081e}
PCI [sata_nv] 00:05.1 IDE interface: nVidia Corporation MCP55 SATA
Controller (rev a3)
 ├─scsi 2:0:0:0 ATA ST31500341AS {9VS15Y1L}
 │  └─sdc: [8:32] Empty/Unknown 1.36t
 │     ├─sdc1: [8:33] MD raid1 (0/5) 10.24g md1 clean in_sync
{588cbbfd:4835b4da:0d7a0b1c:7bf552bb}
 │     │  └─md1: [9:1] (ext3) 10.24g {f620df1e-6dd6-43ab-b4e6-8e1fd4a447f7}
 │     │     └─Mounted as /dev/md1 @ /
 │     ├─sdc2: [8:34] MD raid1 (0/2) 8.38g md2 clean in_sync
{28714b52:55b123f5:a6200310:fe6d9686}
 │     │  └─md2: [9:2] (swap) 8.38g {1804bbc6-a61b-44ea-9cc9-ac3ce6f17305}
 │     └─sdc3: [8:35] MD raid0 (1/2) 1.35t md3 clean in_sync
{8c9c28dd:ac12a9ef:a6200310:fe6d9686}
 └─scsi 3:0:0:0 ATA ST31500341AS {9VS13H4N}
    └─sdd: [8:48] Empty/Unknown 1.36t
       ├─sdd1: [8:49] MD raid1 (3/5) 10.24g md1 clean in_sync
{588cbbfd:4835b4da:0d7a0b1c:7bf552bb}
       ├─sdd2: [8:50] MD raid1 (1/2) 8.38g md2 clean in_sync
{28714b52:55b123f5:a6200310:fe6d9686}
       └─sdd3: [8:51] MD raid0 (1/2) 1.35t md4 clean in_sync
{ce213d01:e50809ed:a6200310:fe6d9686}

Still you got the setup at the first look fine without the visualisation :)

> 
> 
> I suspect it is merely timing.  You are using degraded arrays
> deliberately as part of your backup scheme, which means you must be
> using "start_dirty_degraded" as a kernel parameter.  That enables
> md7, which you don't want degraded, to start degraded when md6 is a
> hundred or so milliseconds late to the party.

Running rgrep on /etc and /boot reveals no such kernel parameter on this
system. I have never had problems with the arrays not starting, perhaps
it is hard-compiled in debian kernel (lenny)? Config for the current
kernel in /boot does not list any such parameter either.

Woould using this parameter just change the timing?

> 
> I think you have a couple options:
> 
> 1) Don't run degraded arrays.  Use other backup tools.

It took me several years to find a reasonably fast way to offline-backup
that partition with tens of millions of backuppc hardlinks :)

> 2) Remove md7
> from your mdadm.conf in your initramfs.  Don't let early userspace
> assemble it.  The extra time should then allow your initscripts on
> your real root fs to assemble it with both members.  This only works
> if md7 does not contain your real root fs.

Fantastic, I will do so. Just have to find a way to keep different
mdadm.conf in /etc and in initramfs while preserving the useful
update-initramfs functionality :)
> 
>> Plus how can can a background reconstruction be started on md6, if
>> it is degraded and the other mirroring part is not even present?
> 
> Don't know.  Maybe one of your existing drives is occupying a
> major/minor combination that your esata drive occupied on your last
> backup.  I'm pretty sure the message is harmless.  I noticed that md5
> has a bitmap, but md6 does not.  I wonder if adding a bitmap to md6
> would change the timing enough to help you.

Wow, there is bitmap missing on md6 indeed. I swear it was there, in the
past :) It cuts down significantly the synchronization time for offline
copies. I have two offline drive sets - each rotating every two weeks.
One offline set plugs into md5, the other one into md6. This way I can
have two bitmaps, one for each set. Apparently, not now :-)

> 
> Relying on timing variations for successful boot doesn't sound great
> to me.

You are right. Hopefully the significantly delayed assembly will work OK.

I very appreciate your help, thanks a lot,

Pavel.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Software RAID and TRIM
From: Tom De Mulder @ 2011-06-28 15:31 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 847 bytes --]

Hi,

I'm investigating SSD performance on Linux, in particular for RAID 
devices.

As I understand it—and please correct me if I'm wrong—currently software 
RAID does not pass through TRIM to the underlying devices. TRIM is 
essential for the continued high performance of SSDs, which otherwise 
degrade over time.

I don't think there would be any harm in this command being passed through 
to underlying devices if they don't support it (they would just ignore 
it), and if they do it would make high-performance software RAID of SSDs a 
possibility.

Is this something that's in the works?

Many thanks,

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 28/06/2011 : The Moon is Waning Crescent (22% of Full)

^ permalink raw reply

* Re: Raid auto-assembly upon boot - device order
From: Phil Turmel @ 2011-06-28 15:39 UTC (permalink / raw)
  To: Pavel Hofman; +Cc: linux-raid
In-Reply-To: <4E09C295.8040102@ivitera.com>

On 06/28/2011 08:01 AM, Pavel Hofman wrote:
> Hi Phil,

[...]

> Thanks a lot for your quick reply. And for your wonderful tool too.

You're welcome.

> orfeus:/boot# lsdrv
> PCI [AMD_IDE] 00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
>  └─ide 2.0 HL-DT-ST RW/DVD GCC-H20N {[No Information Found]}
>     └─hde: [33:0] Empty/Unknown 4.00g
> PCI [sata_nv] 00:05.0 IDE interface: nVidia Corporation MCP55 SATA
> Controller (rev a3)
>  ├─scsi 0:0:0:0 ATA SAMSUNG HD753LJ {S13UJDWQ912345}
>  │  └─sda: [8:0] MD raid10 (4) 698.64g inactive
> {646f62e3:626d2cb3:05afacbb:371c5cc4}
>  │     └─sda1: [8:1] MD raid0 (0/2) 698.64g md3 clean in_sync
> {8c9c28dd:ac12a9ef:a6200310:fe6d9686}
>  │        └─md3: [9:3] MD raid1 (0/2) 2.03t md5 active in_sync
> 'orfeus:5' {2f88c280:3d7af418:e8d459c5:782e3ed2}
>  │           └─md5: [9:5] MD raid1 (1/2) 2.03t md7 active in_sync
> 'orfeus:7' {dde16cd5:2e17c743:fcc7926c:fcf5081e}
>  │              └─md7: [9:7] (xfs) 2.03t 'backup'
> {d987301b-dfb1-4c99-8f72-f4b400ba46c9}
>  │                 └─Mounted as /dev/md7 @ /mnt/raid
>  └─scsi 1:0:0:0 ATA ST3750330AS {9QK0VFJ9}
>     └─sdb: [8:16] Empty/Unknown 698.64g
>        └─sdb1: [8:17] MD raid0 (0/2) 698.64g md4 clean in_sync
> {ce213d01:e50809ed:a6200310:fe6d9686}
>           └─md4: [9:4] MD raid1 (0/2) 2.03t md6 active in_sync
> ''orfeus':6' {1f83ea99:a9e4d498:a6543047:af0a3b38}
>              └─md6: [9:6] MD raid1 (0/2) 2.03t md7 active spare
> ''orfeus':7' {dde16cd5:2e17c743:fcc7926c:fcf5081e}
> PCI [sata_nv] 00:05.1 IDE interface: nVidia Corporation MCP55 SATA
> Controller (rev a3)
>  ├─scsi 2:0:0:0 ATA ST31500341AS {9VS15Y1L}
>  │  └─sdc: [8:32] Empty/Unknown 1.36t
>  │     ├─sdc1: [8:33] MD raid1 (0/5) 10.24g md1 clean in_sync
> {588cbbfd:4835b4da:0d7a0b1c:7bf552bb}
>  │     │  └─md1: [9:1] (ext3) 10.24g {f620df1e-6dd6-43ab-b4e6-8e1fd4a447f7}
>  │     │     └─Mounted as /dev/md1 @ /
>  │     ├─sdc2: [8:34] MD raid1 (0/2) 8.38g md2 clean in_sync
> {28714b52:55b123f5:a6200310:fe6d9686}
>  │     │  └─md2: [9:2] (swap) 8.38g {1804bbc6-a61b-44ea-9cc9-ac3ce6f17305}
>  │     └─sdc3: [8:35] MD raid0 (1/2) 1.35t md3 clean in_sync
> {8c9c28dd:ac12a9ef:a6200310:fe6d9686}
>  └─scsi 3:0:0:0 ATA ST31500341AS {9VS13H4N}
>     └─sdd: [8:48] Empty/Unknown 1.36t
>        ├─sdd1: [8:49] MD raid1 (3/5) 10.24g md1 clean in_sync
> {588cbbfd:4835b4da:0d7a0b1c:7bf552bb}
>        ├─sdd2: [8:50] MD raid1 (1/2) 8.38g md2 clean in_sync
> {28714b52:55b123f5:a6200310:fe6d9686}
>        └─sdd3: [8:51] MD raid0 (1/2) 1.35t md4 clean in_sync
> {ce213d01:e50809ed:a6200310:fe6d9686}

Pretty deep layering.  I think I'm going to reduce the amount of indentation per layer.

> Still you got the setup at the first look fine without the visualisation :)
> 
>>
>>
>> I suspect it is merely timing.  You are using degraded arrays
>> deliberately as part of your backup scheme, which means you must be
>> using "start_dirty_degraded" as a kernel parameter.  That enables
>> md7, which you don't want degraded, to start degraded when md6 is a
>> hundred or so milliseconds late to the party.
> 
> Running rgrep on /etc and /boot reveals no such kernel parameter on this
> system. I have never had problems with the arrays not starting, perhaps
> it is hard-compiled in debian kernel (lenny)? Config for the current
> kernel in /boot does not list any such parameter either.
> 
> Woould using this parameter just change the timing?

No.  Degraded arrays are supposed to not assemble without it.  Maybe it only applies to kernel autoassembly, which I no longer use.

>> I think you have a couple options:
>>
>> 1) Don't run degraded arrays.  Use other backup tools.
> 
> It took me several years to find a reasonably fast way to offline-backup
> that partition with tens of millions of backuppc hardlinks :)

I've heard of hardlink horrors with backuppc.  I don't use it myself.  I prefer to use LVM on top of MD, then take compressed backups of LVM snapshots.

>> 2) Remove md7
>> from your mdadm.conf in your initramfs.  Don't let early userspace
>> assemble it.  The extra time should then allow your initscripts on
>> your real root fs to assemble it with both members.  This only works
>> if md7 does not contain your real root fs.
> 
> Fantastic, I will do so. Just have to find a way to keep different
> mdadm.conf in /etc and in initramfs while preserving the useful
> update-initramfs functionality :)

I haven't dug that deep.  I use dracut, myself.

>>> Plus how can can a background reconstruction be started on md6, if
>>> it is degraded and the other mirroring part is not even present?
>>
>> Don't know.  Maybe one of your existing drives is occupying a
>> major/minor combination that your esata drive occupied on your last
>> backup.  I'm pretty sure the message is harmless.  I noticed that md5
>> has a bitmap, but md6 does not.  I wonder if adding a bitmap to md6
>> would change the timing enough to help you.
> 
> Wow, there is bitmap missing on md6 indeed. I swear it was there, in the
> past :) It cuts down significantly the synchronization time for offline
> copies. I have two offline drive sets - each rotating every two weeks.
> One offline set plugs into md5, the other one into md6. This way I can
> have two bitmaps, one for each set. Apparently, not now :-)

Mirror w/ bitmap would make 1:1 backups faster.  I understand why you are doing this, but I'd be worried about filesystem integrity at the point in time you disconnect the backup drive.  Have you performed any tests to be sure you can recover usable data from the offline copy?  If I recall correctly, an LVM snapshot operation incorporates a filesystem metadata sync.

>> Relying on timing variations for successful boot doesn't sound great
>> to me.
> 
> You are right. Hopefully the significantly delayed assembly will work OK.
> 
> I very appreciate your help, thanks a lot,
> 
> Pavel.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Software RAID and TRIM
From: Mathias Burén @ 2011-06-28 16:11 UTC (permalink / raw)
  To: Tom De Mulder; +Cc: linux-raid
In-Reply-To: <alpine.OSX.2.00.1106281628320.257@trogdor.csi.cam.ac.uk>

On 28 June 2011 16:31, Tom De Mulder <tdm27@cam.ac.uk> wrote:
> Hi,
>
>
> I'm investigating SSD performance on Linux, in particular for RAID devices.
>
> As I understand it—and please correct me if I'm wrong—currently software
> RAID does not pass through TRIM to the underlying devices. TRIM is essential
> for the continued high performance of SSDs, which otherwise degrade over
> time.
>
> I don't think there would be any harm in this command being passed through
> to underlying devices if they don't support it (they would just ignore it),
> and if they do it would make high-performance software RAID of SSDs a
> possibility.
>
>
> Is this something that's in the works?
>
>
>
> Many thanks,
>
> --
> Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
> +44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
> -> 28/06/2011 : The Moon is Waning Crescent (22% of Full)


IIRC md can already pass TRIM down, but I think the filesystem needs
to know about the underlying architecture, or something, for TRIM to
work in RAID. There's numerous discussions on this in the archives of
this mailing list.

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Software RAID and TRIM
From: Johannes Truschnigg @ 2011-06-28 16:17 UTC (permalink / raw)
  To: Tom De Mulder; +Cc: linux-raid
In-Reply-To: <alpine.OSX.2.00.1106281628320.257@trogdor.csi.cam.ac.uk>

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

Hi Tom,
On Tue, Jun 28, 2011 at 04:31:35PM +0100, Tom De Mulder wrote:
> Hi,
> [...]
> Is this something that's in the works?

Iirc, dm-raid supports passthru of DSM/TRIM commands for its provided RAID0
and RAID1 levels. Maybe that's already enough for your purposes?

I don't know if there's any development going on on the md side of things in
in that regard. Others on this list will surely be able to answer that
question, however.

Have a nice day!
-- 
with best regards: 
- Johannes Truschnigg ( johannes@truschnigg.info )

www:  http://johannes.truschnigg.info/ 
phone: +43 650 2 133337 
xmpp: johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Can reading a raid drive trigger all the other drives in that set?
From: Marc MERLIN @ 2011-06-28 16:22 UTC (permalink / raw)
  To: linux-raid

I have ext4 over lvm2 on a sw raid5 with 2.6.39.1

In order to save power I have my drives spin down.

When I access my filesystem mount point, I get hangs of 30sec or a bit more
as each and every drive are woken up serially.

Is there any chance to put a patch in the block layer so that when it gets a
read on a block after a certain timeout, it just does one dummy read on all
the other droves in parallel so that all the drives have a chance to spin
back up at the same time and not serially?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply

* Re: Software RAID and TRIM
From: David Brown @ 2011-06-28 16:40 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <alpine.OSX.2.00.1106281628320.257@trogdor.csi.cam.ac.uk>

On 28/06/11 17:31, Tom De Mulder wrote:
> Hi,
>
>
> I'm investigating SSD performance on Linux, in particular for RAID devices.
>
> As I understand it—and please correct me if I'm wrong—currently software
> RAID does not pass through TRIM to the underlying devices. TRIM is
> essential for the continued high performance of SSDs, which otherwise
> degrade over time.
>
> I don't think there would be any harm in this command being passed
> through to underlying devices if they don't support it (they would just
> ignore it), and if they do it would make high-performance software RAID
> of SSDs a possibility.
>
>
> Is this something that's in the works?
>
>

I don't think you are wrong about software raid not passing TRIM down to 
the device (IIRC, it /can/ be passed down through LVM raid setups, but 
they are slower and less flexible than md raid).

However, AFAIUI, you are wrong about TRIM being essential for the 
continued high performance of SSDs.  As long as your SSDs have some 
over-provisioning (or you only partition something like 90% of the 
drive), and it's got good garbage collection, then TRIM will have 
minimal effect.

TRIM only makes a big difference in benchmarks which fill up most of a 
disk, then erase the files, then start writing them again, and even then 
it is mainly with older flash controllers.

I think other SSD-optimisations, such as those in BTRFS, are much more 
important.  These include bypassing or disabling code that is aimed at 
optimising disk access and minimising head movement - such code is of 
great benefit with hard disks, but helps little and adds latency on SSD 
systems.

(I haven't done any benchmarks to justify this opinion, nor have I 
direct links - it's based on my understanding of TRIM and how SSDs work, 
and how SSD controllers have changed between early devices and current 
ones.)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 0/3] count rdev->corrected_errors after final re-read
From: Namhyung Kim @ 2011-06-28 16:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Hello,

This patchset tries fix rdev->corrected_errors counting. Read errors
are considered to corrected if write-back and re-read cycle is finished
without further problems. Thus moving the rdev->corrected_errors
counting after the re-reading looks more reasonable IMHO.

For resync case in RAID10, rdev->corrected_errors are increased right
after the first read failure. This seems need an improvement though
it's not handled in this series.

The patches are against 'for-next' branch on git://neil.brown.name/md.

Thanks.

Namhyung Kim (3):
  md/raid1: move rdev->corrected_errors counting
  md/raid5: move rdev->corrected_errors counting
  md/raid10: move rdev->corrected_errors counting

 drivers/md/raid1.c  |   17 ++++++-----------
 drivers/md/raid10.c |    2 +-
 drivers/md/raid5.c  |    5 +----
 3 files changed, 8 insertions(+), 16 deletions(-)

-- 
1.7.6

^ permalink raw reply

* [PATCH 1/3] md/raid1: move rdev->corrected_errors counting
From: Namhyung Kim @ 2011-06-28 16:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid
In-Reply-To: <1309279646-4950-1-git-send-email-namhyung@gmail.com>

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO. Also included a couple of whitespace fixes on sync_page_io().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/raid1.c |   17 ++++++-----------
 1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 6b7f5fdb35c0..4ed381488925 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1222,9 +1222,7 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
 				 * active, and resync is currently active
 				 */
 				rdev = conf->mirrors[d].rdev;
-				if (sync_page_io(rdev,
-						 sect,
-						 s<<9,
+				if (sync_page_io(rdev, sect, s<<9,
 						 bio->bi_io_vec[idx].bv_page,
 						 READ, false)) {
 					success = 1;
@@ -1259,16 +1257,13 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
 			if (r1_bio->bios[d]->bi_end_io != end_sync_read)
 				continue;
 			rdev = conf->mirrors[d].rdev;
-			if (sync_page_io(rdev,
-					 sect,
-					 s<<9,
+			if (sync_page_io(rdev, sect, s<<9,
 					 bio->bi_io_vec[idx].bv_page,
 					 WRITE, false) == 0) {
 				r1_bio->bios[d]->bi_end_io = NULL;
 				rdev_dec_pending(rdev, mddev);
 				md_error(mddev, rdev);
-			} else
-				atomic_add(s, &rdev->corrected_errors);
+			}
 		}
 		d = start;
 		while (d != r1_bio->read_disk) {
@@ -1278,12 +1273,12 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
 			if (r1_bio->bios[d]->bi_end_io != end_sync_read)
 				continue;
 			rdev = conf->mirrors[d].rdev;
-			if (sync_page_io(rdev,
-					 sect,
-					 s<<9,
+			if (sync_page_io(rdev, sect, s<<9,
 					 bio->bi_io_vec[idx].bv_page,
 					 READ, false) == 0)
 				md_error(mddev, rdev);
+			else
+				atomic_add(s, &rdev->corrected_errors);
 		}
 		sectors -= s;
 		sect += s;
-- 
1.7.6


^ permalink raw reply related

* [PATCH 2/3] md/raid5: move rdev->corrected_errors counting
From: Namhyung Kim @ 2011-06-28 16:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid
In-Reply-To: <1309279646-4950-1-git-send-email-namhyung@gmail.com>

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/raid5.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c9eb0555f321..59acc9b4deb3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -547,10 +547,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 			bi->bi_io_vec[0].bv_offset = 0;
 			bi->bi_size = STRIPE_SIZE;
 			bi->bi_next = NULL;
-			if ((rw & WRITE) &&
-			    test_bit(R5_ReWrite, &sh->dev[i].flags))
-				atomic_add(STRIPE_SECTORS,
-					&rdev->corrected_errors);
 			generic_make_request(bi);
 		} else {
 			if (rw & WRITE)
@@ -1588,6 +1584,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 					   (unsigned long long)(sh->sector
 								+ rdev->data_offset),
 					   bdevname(rdev->bdev, b));
+			atomic_add(STRIPE_SECTORS, &rdev->corrected_errors);
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
 		}
-- 
1.7.6


^ permalink raw reply related

* [PATCH 3/3] md/raid10: move rdev->corrected_errors counting
From: Namhyung Kim @ 2011-06-28 16:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid
In-Reply-To: <1309279646-4950-1-git-send-email-namhyung@gmail.com>

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
For end_sync_read(), I'm not sure what should I do. But I think, at
least, the counting should be moved to end_sync_write() and that
might require additional field(s) in r10_bio. So I leave it as is
for simplicity now. How do you think?

 drivers/md/raid10.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 8628a62a02f0..0660bc9597d8 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1524,7 +1524,6 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 			    test_bit(In_sync, &rdev->flags)) {
 				atomic_inc(&rdev->nr_pending);
 				rcu_read_unlock();
-				atomic_add(s, &rdev->corrected_errors);
 				if (sync_page_io(rdev,
 						 r10_bio->devs[sl].addr +
 						 sect,
@@ -1589,6 +1588,7 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 					       (unsigned long long)(
 						       sect + rdev->data_offset),
 					       bdevname(rdev->bdev, b));
+					atomic_add(s, &rdev->corrected_errors);
 				}
 
 				rdev_dec_pending(rdev, mddev);
-- 
1.7.6


^ permalink raw reply related

* Re: Raid auto-assembly upon boot - device order
From: Pavel Hofman @ 2011-06-28 19:18 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid
In-Reply-To: <4E09F5A5.3020903@turmel.org>

Dne 28.6.2011 17:39, Phil Turmel napsal(a):
>> 
>> It took me several years to find a reasonably fast way to
>> offline-backup that partition with tens of millions of backuppc
>> hardlinks :)
> 
> I've heard of hardlink horrors with backuppc.  I don't use it myself.
> I prefer to use LVM on top of MD, then take compressed backups of LVM
> snapshots.

One of my goals was to be able to stick the offline backup drives to any
PC, boot from a debian netinstall CD, chroot to the mirrored root, mount
the large data partitions, and be ready to start recovery in a
matter of minutes. Therefore I prefer the whole filesystems mirror.
> 
> Mirror w/ bitmap would make 1:1 backups faster.  I understand why you
> are doing this, but I'd be worried about filesystem integrity at the
> point in time you disconnect the backup drive.  Have you performed
> any tests to be sure you can recover usable data from the offline
> copy?  If I recall correctly, an LVM snapshot operation incorporates
> a filesystem metadata sync.

Yes, I am using a rather complicated automatic procedure. When the
resyncing finishes, the script waits until backuppc finishes the
currently running jobs (while starting new ones is disabled), shuts
backuppc down, kills all other processes accessing the partition,
umounts the filesystem, removes the offline drives from the md5/md6
arrays, mounts the backup raid again, restarts the backup processes, and
then checks the offline copy: mounts the offline drives as partition and
tries to read a random file from it. Only after that are the offline
drives unmounted, their array disassembled and the drives put to sleep,
until the operator removes them from their external bays and takes away
from the company premises.

The offline drives have saved my butt several times :)

Regards,

Pavel.

^ permalink raw reply

* Can't start array and Negative "Used Dev Size"
From: Simon Matthews @ 2011-06-29  4:29 UTC (permalink / raw)
  To: LinuxRaid

Problem 1: "Used Dev Size"
====================
Note: the system is a Gentoo box, so perhaps I have missed a kernel
configuration option or use flag to deal with large hard drives.

A week or two ago, I resized a raid1 array using 2x3TB drives. I went
through the usual routine: failed one drive, installed and partitioned
(with gdisk) the new 3TB drive, added it to the array, waited for it
to sync, then did the same for the other drive. Finally, I grew the
array to max size and resized the filesystem to its maximum size.
However, after a reboot, I got many errors such as:
EXT3-fs error (device md5): ext3_get_inode_loc: unable to read inode
block - inode=150568961, block=301137922

I tracked this down to the array being the wrong size (too small), so
I unmounted the filesystem grew the array (again) to its max size and
remounted. It seems to be working now, however, it is still syncing:
md5 : active raid1 sdd2[0] sdc2[1]
      2773437376 blocks [2/2] [UU]
      [=======>.............]  resync = 38.2% (1060384320/2773437376)
finish=357.9min speed=79766K/sec

Investigating further, both sdc2 and sdd2 show a negative "Used Dev Size":
mdadm --examine /dev/sdc2
/dev/sdc2:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 5e21499a:f5562ae2:3b3bf1a1:6e290ac2
  Creation Time : Tue May 15 16:33:14 2007
     Raid Level : raid1
  Used Dev Size : -1521529920 (2644.96 GiB 2840.00 GB)      <<<<<<< WTF???
     Array Size : 2773437376 (2644.96 GiB 2840.00 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 5

    Update Time : Tue Jun 28 21:01:14 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : dfcdddaf - correct
         Events : 2222657


      Number   Major   Minor   RaidDevice State
this     1       8       34        1      active sync   /dev/sdc2

   0     0       8       50        0      active sync   /dev/sdd2
   1     1       8       34        1      active sync   /dev/sdc2

--detail shows a negative dev size also:
mdadm --detail /dev/md5
/dev/md5:
        Version : 0.90
  Creation Time : Tue May 15 16:33:14 2007
     Raid Level : raid1
     Array Size : 2773437376 (2644.96 GiB 2840.00 GB)
  Used Dev Size : -1
  <<<<<< WTF?
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 5
    Persistence : Superblock is persistent

    Update Time : Tue Jun 28 21:01:14 2011
          State : active, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 38% complete

           UUID : 5e21499a:f5562ae2:3b3bf1a1:6e290ac2
         Events : 0.2222657

    Number   Major   Minor   RaidDevice State
       0       8       50        0      active sync   /dev/sdd2
       1       8       34        1      active sync   /dev/sdc2

Since, I obviously don't want the array to shrink again and this looks
dangerous, I would appreciate advice on how to handle this problem.

Problem 2: Can't start array
====================
Whatever I do, I can't start md4:
mdadm /dev/md4 --assemble
mdadm: /dev/md4 is already in use.

/proc/mdadm:
md4 : inactive sdc1[0](S)
      58591232 blocks super 1.2

 mdadm --detail /dev/md4
mdadm: md device /dev/md4 does not appear to be active.

# mdadm --examine /dev/sdc1
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b67311b:9732e436:07da8ce8:61e8af9c
           Name : server2:4  (local to host server2)
  Creation Time : Fri Jun 10 20:41:23 2011
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 117182464 (55.88 GiB 60.00 GB)
     Array Size : 117182320 (55.88 GiB 60.00 GB)
  Used Dev Size : 117182320 (55.88 GiB 60.00 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : f8d1f97e:b15f2e09:a7d55392:b193991a

    Update Time : Tue Jun 28 19:20:08 2011
       Checksum : f6fb6a5 - correct
         Events : 53


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)

 # mdadm --examine /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6b67311b:9732e436:07da8ce8:61e8af9c
           Name : server2:4  (local to host server2)
  Creation Time : Fri Jun 10 20:41:23 2011
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 117182464 (55.88 GiB 60.00 GB)
     Array Size : 117182320 (55.88 GiB 60.00 GB)
  Used Dev Size : 117182320 (55.88 GiB 60.00 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 44d1af39:96641daa:ee077d7b:d244ef54

    Update Time : Tue Jun 28 19:20:08 2011
       Checksum : 8e939e3f - correct
         Events : 53


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)


Thanks!
Simon

^ permalink raw reply

* Re: Can't start array and Negative "Used Dev Size"
From: NeilBrown @ 2011-06-29  5:18 UTC (permalink / raw)
  To: Simon Matthews; +Cc: LinuxRaid
In-Reply-To: <BANLkTincbuZrS3PuPBOgjidsR8jdyTGBRw@mail.gmail.com>

On Tue, 28 Jun 2011 21:29:37 -0700 Simon Matthews
<simon.d.matthews@gmail.com> wrote:

> Problem 1: "Used Dev Size"
> ====================
> Note: the system is a Gentoo box, so perhaps I have missed a kernel
> configuration option or use flag to deal with large hard drives.
> 
> A week or two ago, I resized a raid1 array using 2x3TB drives. I went

Oopps.  That array is using 0.90 metadata which can only handle up to 2TB
devices.  The 'resize' code should catch that you are asking the impossible,
but it doesn't it seems.

You need to simply recreate the array as 1.0.
i.e.
 mdadm -S /dev/md5
 mdadm -C /dev/md5 --metadata 1.0 -l1 -n2 --assume-clean

Then all should be happiness.
> 
> Problem 2: Can't start array
> ====================
> Whatever I do, I can't start md4:
> mdadm /dev/md4 --assemble
> mdadm: /dev/md4 is already in use.
> 
> /proc/mdadm:
> md4 : inactive sdc1[0](S)
>       58591232 blocks super 1.2

What do you get if you:

  mdadm -S /dev/md4
  mdadm -A /dev/md4 /dev/sdc1 /dev/sdd1 --verbose
??

NeilBrown


^ permalink raw reply

* Re: Can't start array and Negative "Used Dev Size"
From: Simon Matthews @ 2011-06-29  5:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: LinuxRaid
In-Reply-To: <20110629151825.56cb4499@notabene.brown>

Neil,



On Tue, Jun 28, 2011 at 10:18 PM, NeilBrown <neilb@suse.de> wrote:
>  mdadm -S /dev/md5
>  mdadm -C /dev/md5 --metadata 1.0 -l1 -n2 --assume-clean

Will I lose data if I do this? Should I use metadata 1.2 ?

>
> Then all should be happiness.
>>
>
>  mdadm -S /dev/md4
>  mdadm -A /dev/md4 /dev/sdc1 /dev/sdd1 --verbose

That solved it. The array started.

Thanks!

Simon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Can't start array and Negative "Used Dev Size"
From: NeilBrown @ 2011-06-29  5:37 UTC (permalink / raw)
  To: Simon Matthews; +Cc: LinuxRaid
In-Reply-To: <BANLkTimonHoSn5eoX9AB8wLJYR4SFoBd-w@mail.gmail.com>

On Tue, 28 Jun 2011 22:24:41 -0700 Simon Matthews
<simon.d.matthews@gmail.com> wrote:

> Neil,
> 
> 
> 
> On Tue, Jun 28, 2011 at 10:18 PM, NeilBrown <neilb@suse.de> wrote:
> >  mdadm -S /dev/md5
> >  mdadm -C /dev/md5 --metadata 1.0 -l1 -n2 --assume-clean
> 
> Will I lose data if I do this? Should I use metadata 1.2 ?

If you use 1.2 you will lose data.  If you use 1.0 you will not.

With 0.90 and 1.0 the data starts at the start of each device. so 1.0 will
see the same data as 0.90 would.

With 1.2 there is some metadata first and the start starts later, so if you
use that the data will appear in the wrong place.

NeilBrown

> 
> >
> > Then all should be happiness.
> >>
> >
> >  mdadm -S /dev/md4
> >  mdadm -A /dev/md4 /dev/sdc1 /dev/sdd1 --verbose
> 
> That solved it. The array started.
> 
> Thanks!
> 
> Simon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Can't start array and Negative "Used Dev Size"
From: Simon Matthews @ 2011-06-29  5:59 UTC (permalink / raw)
  To: NeilBrown; +Cc: LinuxRaid
In-Reply-To: <20110629153753.5279c034@notabene.brown>

Neil,



On Tue, Jun 28, 2011 at 10:37 PM, NeilBrown <neilb@suse.de> wrote:
>> On Tue, Jun 28, 2011 at 10:18 PM, NeilBrown <neilb@suse.de> wrote:
>> >  mdadm -S /dev/md5
>> >  mdadm -C /dev/md5 --metadata 1.0 -l1 -n2 --assume-clean
>>

Am I correct in thinking that this should be a quick operation?

Simon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Can't start array and Negative "Used Dev Size"
From: NeilBrown @ 2011-06-29  6:18 UTC (permalink / raw)
  To: Simon Matthews; +Cc: LinuxRaid
In-Reply-To: <BANLkTi=GwnMMA2jEZbtrRh3W4Ff-C18teg@mail.gmail.com>

On Tue, 28 Jun 2011 22:59:43 -0700 Simon Matthews
<simon.d.matthews@gmail.com> wrote:

> Neil,
> 
> 
> 
> On Tue, Jun 28, 2011 at 10:37 PM, NeilBrown <neilb@suse.de> wrote:
> >> On Tue, Jun 28, 2011 at 10:18 PM, NeilBrown <neilb@suse.de> wrote:
> >> >  mdadm -S /dev/md5
> >> >  mdadm -C /dev/md5 --metadata 1.0 -l1 -n2 --assume-clean
> >>
> 
> Am I correct in thinking that this should be a quick operation?
> 

Yes.  Virtually instantaneous.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox