Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Can not start md0 after upgrade.
From: John McMonagle @ 2011-05-25 13:19 UTC (permalink / raw)
  To: linux-raid

Just upgraded a poweredge 1850 server from Debian lenny to squeeze and can not 
boot with the new 2.6.32 kernel.

From lspci  have this controller:
SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT 
Dual Ultra320 SCSI (rev 08)

Running mdadm raid with root on md0.

Normally run xen but all  info is for when running without xen.

I can still boot with the 2.6.26 kernel but not with the new 2.6.32 kernel.
Under 2.6.32 it fails to start md0.
in the busy box console 
Can see all the needed partitions.
What was sda and sdb are now sdb and sdc that should not matter??
mdadm.conf is:
DEVICE partitions
CREATE owner=root group=disk mode=0660 auto=yes
HOMEHOST <system>
MAILADDR xxxxx@advocap.org
ARRAY /dev/md0 level=raid1 num-devices=2 
UUID=6f744c89:d2578f95:c150b018:d9f789b1
ARRAY /dev/md1 level=raid1 num-devices=2 
UUID=7938d59c:28a69e5e:3facbdc2:12974557

raid1 mptbase mptscsih and mptspi module are loaded.
Looks right but does not start md0.

Any ideas?

Thanks

John

^ permalink raw reply

* Re: [PATCH 7 of 9] MD:  new sb type
From: NeilBrown @ 2011-05-25  4:16 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid
In-Reply-To: <201105240307.p4O374rN029659@f14.redhat.com>

On Mon, 23 May 2011 22:07:04 -0500 Jonathan Brassow <jbrassow@f14.redhat.com>
wrote:

> Patch name: md-new-sb-type.patch
> 
> A new MD superblock that is device-mapper specific.
> 
> The new superblock is not read or written from userspace and is not exported.
> It contains information to track resync, recovery, and reshaping progress.  It
> also maintains information on the health of the devices in the array.
> 
> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
> 
> Index: linux-2.6/drivers/md/md.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/md.c
> +++ linux-2.6/drivers/md/md.c
> @@ -1731,6 +1731,305 @@ super_1_rdev_size_change(mdk_rdev_t *rde
>  	return num_sectors;
>  }
>  
> +/*
> + * This structure is never used by userspace.  It is only ever
> + * used in these particular super block accessing functions.
> + * Therefore, we don't put it in any .h file.
> + *
> + * It makes sense to define a new magic number here.  This way,
> + * no userspace application will confuse the device as a device
> + * that is accessible through MD operations.  Devices with this
> + * superblock should only ever be accessed via device-mapper.
> + */
> +#define MD_DM_SB_MAGIC 0x426E6F4A
> +struct mdp_superblock_2 {
> +	__le32 magic;
> +	__le32 flags; /* Used to indicate possible future changes */
> +
> +	__le64 events;
> +
> +	/*
> +	 * The following offset variables are used to indicate:
> +	 *  reshape_offset:  If the RAID level or layout of an array is
> +	 *		     being updated, this offset keeps track of the
> +	 *		     progress.
> +	 *  disk_recovery_offset:  If drives are being repaired/replaced on
> +	 *			   an individual basis, this offset tracks
> +	 *			   that progress.  This might happen when a
> +	 *			   drive fails and is replaced.
> +	 *  array_resync_offset:  When the array is constructed for the first
> +	 *			  time, all the devices must be made coherent.
> +	 *			  This offset tracks that progress.
> +	 */
> +	__le64 reshape_offset;
> +	__le64 disk_recovery_offset;
> +	__le64 array_resync_offset;
> +
> +	/*
> +	 * The following variable pairs reflect things
> +	 * that can changed during an array reshape.
> +	 */
> +	__le32 level;
> +	__le32 new_level;
> +
> +	__le32 layout;
> +	__le32 new_layout;
> +
> +	__le32 stripe_sectors;
> +	__le32 new_stripe_sectors;
> +
> +	__le32 num_devices;    /* Number of devs in RAID, Max = 64 */
> +	__le32 new_num_devices;

Presumably the dm table knows all this info as well and it is just here for
error checking - yes?


> +
> +	__le64 failed_devices; /* bitmap of devs, used to indicate a failure */
> +	__u8 pad[432];         /* Round out the struct to 512 bytes */
> +};
> +
> +static void super_2_sync(mddev_t *mddev, mdk_rdev_t *rdev)
> +{
> +	mdk_rdev_t *r, *t;
> +	uint64_t failed_devices;
> +	struct mdp_superblock_2 *sb;
> +
> +	sb = (struct mdp_superblock_2 *)page_address(rdev->sb_page);
> +	failed_devices = le32_to_cpu(sb->failed_devices);

failed_devices is 64 bit, so you want le64_to_cpu

> +
> +	rdev_for_each(r, t, mddev)
> +		if ((r->raid_disk >= 0) && test_bit(Faulty, &r->flags))
> +			failed_devices |= (1 << r->raid_disk);

And this should be (1ULL << ....)  so that it doesn't overflow.


> +
> +	memset(sb, 0, sizeof(*sb));
> +
> +	sb->magic  = cpu_to_le32(MD_DM_SB_MAGIC);
> +	sb->flags  = cpu_to_le32(0); /* No flags yet */
> +
> +	sb->events = cpu_to_le64(mddev->events);
> +
> +	sb->reshape_offset = cpu_to_le64(mddev->reshape_position);
> +	sb->disk_recovery_offset = cpu_to_le64(rdev->recovery_offset);
> +	sb->array_resync_offset = cpu_to_le64(mddev->recovery_cp);
> +
> +	sb->level = cpu_to_le32(mddev->level);
> +	sb->layout = cpu_to_le32(mddev->layout);
> +	sb->stripe_sectors = cpu_to_le32(mddev->chunk_sectors);
> +	sb->num_devices = cpu_to_le32(mddev->raid_disks);
> +
> +	if (mddev->reshape_position != MaxSector) {
> +		sb->new_level = cpu_to_le32(mddev->new_level);
> +		sb->new_layout = cpu_to_le32(mddev->new_layout);
> +		sb->new_stripe_sectors = cpu_to_le32(mddev->new_chunk_sectors);
> +		sb->new_num_devices = cpu_to_le32(mddev->delta_disks);
> +	} else {
> +		sb->new_level = 0;
> +		sb->new_layout = 0;
> +		sb->new_stripe_sectors = 0;
> +		sb->new_num_devices = 0;
> +	}

As these values are meaningless when reshape_position is MaxSector, and as
the structure has already been zeroed, setting them to zero again looks wrong.


> +
> +	sb->failed_devices = cpu_to_le32(failed_devices);

Again, cpu_to_le64


I haven't thought through the 'FirstUse and STATE_FORCED flags yet.  When I
have I might have more to say - or I might not.

Thanks,
NeilBrown





> +}
> +
> +/*
> + * super_2_load
> + *
> + * This function creates a superblock if one is not found on the device
> + * and will indicate the more appropriate device whose superblock should
> + * be used, if given two.
> + *
> + * Return: 1 if use rdev, 0 if use refdev, -Exxx otherwise
> + */
> +static int super_2_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
> +{
> +	int r;
> +	uint64_t ev1, ev2;
> +	struct mdp_superblock_2 *sb;
> +	struct mdp_superblock_2 *refsb;
> +
> +	if (sizeof(*sb) & (sizeof(*sb) - 1)) {
> +		printk(KERN_ERR "Programmer error: Bad sized superblock (%lu)\n",
> +		       sizeof(*sb));
> +		return -EIO;
> +	}
> +
> +	rdev->sb_start = 0;
> +	rdev->sb_size  = sizeof(*sb);
> +	r = read_disk_sb(rdev, rdev->sb_size);
> +	if (r)
> +		return r;
> +
> +	sb = (struct mdp_superblock_2 *)page_address(rdev->sb_page);
> +	if (sb->magic != cpu_to_le32(MD_DM_SB_MAGIC)) {
> +		super_2_sync(rdev->mddev, rdev);
> +
> +		set_bit(FirstUse, &rdev->flags);
> +
> +		/* Force new superblocks to disk */
> +		set_bit(MD_CHANGE_DEVS, &rdev->mddev->flags);
> +
> +		/* Any superblock is better than none, choose that if given */
> +		return refdev ? 0 : 1;
> +	}
> +
> +	if (!refdev)
> +		return 1;
> +
> +	ev1 = le64_to_cpu(sb->events);
> +	refsb = (struct mdp_superblock_2 *)page_address(refdev->sb_page);
> +	ev2 = le64_to_cpu(refsb->events);
> +
> +	return (ev1 > ev2) ? 1 : 0;
> +}
> +
> +static int super_2_init_validation(mddev_t *mddev, mdk_rdev_t *rdev)
> +{
> +	uint64_t ev1;
> +	uint32_t failed_devices;
> +	struct mdp_superblock_2 *sb;
> +	uint32_t new_devs = 0;
> +	uint32_t rebuilds = 0;
> +	mdk_rdev_t *r, *t;
> +	struct mdp_superblock_2 *sb2;
> +
> +	sb = (struct mdp_superblock_2 *)page_address(rdev->sb_page);
> +	ev1 = le64_to_cpu(sb->events);
> +	failed_devices = le32_to_cpu(sb->failed_devices);
> +
> +	mddev->events = ev1 ? ev1 : 1;
> +
> +	/* Reshaping is not currently allowed */
> +	if ((le32_to_cpu(sb->level) != mddev->level) ||
> +	    (le32_to_cpu(sb->layout) != mddev->layout) ||
> +	    (le32_to_cpu(sb->stripe_sectors) != mddev->chunk_sectors) ||
> +	    (le32_to_cpu(sb->num_devices) != mddev->raid_disks)) {
> +		printk(KERN_ERR
> +		       "md: %s: Reshaping arrays not yet supported.\n",
> +		       mdname(mddev));
> +		return -EINVAL;
> +	}
> +
> +	if (!test_and_clear_bit(MD_SYNC_STATE_FORCED, &mddev->flags))
> +		mddev->recovery_cp = le64_to_cpu(sb->array_resync_offset);
> +
> +	/*
> +	 * During load, we set FirstUse if a new superblock was written.
> +	 * There are two reasons we might not have a superblock:
> +	 * 1) The array is brand new - in which case, all of the
> +	 *    devices must have their In_sync bit set.  Also,
> +	 *    recovery_cp must be 0, unless forced.
> +	 * 2) This is a new device being added to an old array
> +	 *    and the new device needs to be rebuilt - in which
> +	 *    case the In_sync bit will /not/ be set and
> +	 *    recovery_cp must be MaxSector.
> +	 */
> +	rdev_for_each(r, t, mddev) {
> +		if (!test_bit(In_sync, &r->flags)) {
> +			if (!test_bit(FirstUse, &r->flags))
> +				printk(KERN_ERR "md: %s: Superblock area of "
> +				       "rebuild device %d should have been "
> +				       "cleared.\n", mdname(mddev),
> +				       r->raid_disk);
> +			set_bit(FirstUse, &r->flags);
> +			rebuilds++;
> +		} else if (test_bit(FirstUse, &r->flags))
> +			new_devs++;
> +	}
> +
> +	if (!rebuilds) {
> +		if (new_devs == mddev->raid_disks) {
> +			printk(KERN_INFO "md: %s: Superblocks created for new array\n", mdname(mddev));
> +		} else if (new_devs) {
> +			printk(KERN_ERR "md: %s: New device injected "
> +			       "into existing array without 'rebuild' "
> +			       "parameter specified\n", mdname(mddev));
> +			return -EINVAL;
> +		}
> +	} else if (new_devs) {
> +		printk(KERN_ERR "md: %s: 'rebuild' devices cannot be "
> +		       "injected into an array with other "
> +		       "first-time devices\n", mdname(mddev));
> +		return -EINVAL;
> +	} else if (mddev->recovery_cp != MaxSector) {
> +		printk(KERN_ERR "md: %s: 'rebuild' specified while "
> +		       "array is not in-sync\n",
> +		       mdname(mddev));
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * Now we set the Faulty bit for those devices that are
> +	 * recorded in the superblock as failed.
> +	 */
> +	rdev_for_each(r, t, mddev) {
> +		if (!r->sb_page)
> +			continue;
> +		sb2 = (struct mdp_superblock_2 *)
> +			page_address(r->sb_page);
> +		sb2->failed_devices = 0;
> +
> +		if ((r->raid_disk >= 0) &&
> +		    (failed_devices & (1 << r->raid_disk))) {
> +			if (test_bit(FirstUse, &r->flags)) {
> +				char b[BDEVNAME_SIZE];
> +				printk(KERN_INFO
> +				       "md: %s: Starting complete rebuild of "
> +				       "previously failed device, %s\n",
> +				       mdname(mddev), bdevname(rdev->bdev, b));
> +			} else {
> +				set_bit(Faulty, &r->flags);
> +			}
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int super_2_validate(mddev_t *mddev, mdk_rdev_t *rdev)
> +{
> +	struct mdp_superblock_2 *sb;
> +
> +	sb = (struct mdp_superblock_2 *)page_address(rdev->sb_page);
> +
> +	/*
> +	 * mddev->events is set during the first call to super_2_validate,
> +	 * so we use that knowledge to kick off some global sanity checks
> +	 * on the first call.
> +	 */
> +	if (!mddev->events && super_2_init_validation(mddev, rdev))
> +		return -EINVAL;
> +
> +	rdev->mddev->bitmap_info.offset = 0; /* disable bitmap creation */
> +	rdev->mddev->bitmap_info.default_offset = 4096 >> 9;
> +	if (!test_bit(FirstUse, &rdev->flags)) {
> +		rdev->recovery_offset = le64_to_cpu(sb->disk_recovery_offset);
> +		if (rdev->recovery_offset != MaxSector)
> +			clear_bit(In_sync, &rdev->flags);
> +	}
> +
> +	if (test_bit(Faulty, &rdev->flags)) {
> +		clear_bit(Faulty, &rdev->flags);
> +		clear_bit(In_sync, &rdev->flags);
> +		rdev->recovery_offset = 0;
> +		printk(KERN_INFO "md: %s: Dev #%d previously marked as failed\n",
> +		       mdname(mddev), rdev->raid_disk);
> +	}
> +
> +	clear_bit(FirstUse, &rdev->flags);
> +	return 0;
> +}
> +
> +static unsigned long long
> +super_2_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
> +{
> +	/*
> +	 * Arrays built through device-mapper must use device-mapper
> +	 * tables to change the size.  A call to this function is
> +	 * invalid for this array.
> +	 */
> +	printk(KERN_ERR "md: %s: Invalid device size change request.\n",
> +	       mdname(rdev->mddev));
> +	return 0;
> +}
> +
>  static struct super_type super_types[] = {
>  	[0] = {
>  		.name	= "0.90.0",
> @@ -1748,6 +2047,14 @@ static struct super_type super_types[] =
>  		.sync_super	    = super_1_sync,
>  		.rdev_size_change   = super_1_rdev_size_change,
>  	},
> +	[2] = {
> +		.name	= "dm",
> +		.owner	= THIS_MODULE,
> +		.load_super	    = super_2_load,
> +		.validate_super	    = super_2_validate,
> +		.sync_super	    = super_2_sync,
> +		.rdev_size_change   = super_2_rdev_size_change,
> +	},
>  };
>  
>  static int match_mddev_units(mddev_t *mddev1, mddev_t *mddev2)
> Index: linux-2.6/drivers/md/md.h
> ===================================================================
> --- linux-2.6.orig/drivers/md/md.h
> +++ linux-2.6/drivers/md/md.h
> @@ -77,6 +77,8 @@ struct mdk_rdev_s
>  #define Blocked		8		/* An error occurred on an externally
>  					 * managed array, don't allow writes
>  					 * until it is cleared */
> +#define FirstUse        9               /* Used by device-mapper interface when
> +					 * initializing first-time devices. */
>  	wait_queue_head_t blocked_wait;
>  
>  	int desc_nr;			/* descriptor index in the superblock */
> @@ -124,6 +126,7 @@ struct mddev_s
>  #define MD_CHANGE_DEVS	0	/* Some device status has changed */
>  #define MD_CHANGE_CLEAN 1	/* transition to or from 'clean' */
>  #define MD_CHANGE_PENDING 2	/* switch from 'clean' to 'active' in progress */
> +#define MD_SYNC_STATE_FORCED 3  /* recovery_cp is set and must be honored */
>  
>  	int				suspended;
>  	atomic_t			active_io;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH 6 of 9] MD:  no sync IO while suspended
From: NeilBrown @ 2011-05-25  4:10 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid
In-Reply-To: <201105240306.p4O36rFK029586@f14.redhat.com>

On Mon, 23 May 2011 22:06:53 -0500 Jonathan Brassow <jbrassow@f14.redhat.com>
wrote:

> Patch name: md-no-sync-IO-while-suspended.patch
> 
> Disallow resync I/O while the RAID array is suspended.
> 
> Recovery, resync, and metadata I/O should not be allowed while a device is
> suspended.
> 
> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
> 
> Index: linux-2.6/drivers/md/md.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/md.c
> +++ linux-2.6/drivers/md/md.c
> @@ -7064,7 +7064,6 @@ void md_do_sync(mddev_t *mddev)
>  }
>  EXPORT_SYMBOL_GPL(md_do_sync);
>  
> -
>  static int remove_and_add_spares(mddev_t *mddev)
>  {
>  	mdk_rdev_t *rdev;
> @@ -7176,6 +7175,9 @@ static void reap_sync_thread(mddev_t *md
>   */
>  void md_check_recovery(mddev_t *mddev)
>  {
> +	if (mddev->suspended)
> +		return;
> +
>  	if (mddev->bitmap)
>  		bitmap_daemon_work(mddev);
>  

Yep, applied.

Thanks,
NeilBrown

^ permalink raw reply

* Re: [PATCH 2 of 9] MD:  should_read_superblock
From: NeilBrown @ 2011-05-25  4:01 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid
In-Reply-To: <201105240306.p4O369g2029293@f14.redhat.com>

On Mon, 23 May 2011 22:06:09 -0500 Jonathan Brassow <jbrassow@f14.redhat.com>
wrote:

> Patch name: md-should_read_superblock.patch
> 
> Add new function to determine whether MD superblocks should be read.
> 
> It used to be sufficient to check if mddev->raid_disks was set to determine
> whether to read the superblock or not.  However, device-mapper (dm-raid.c)
> sets this value before calling md_run().  Thus, we need additional mechanisms
> for determining whether to read the superblock.  This patch adds the condition
> that if rdev->meta_bdev is set, the superblock should be read - something that
> only device-mapper does (and only when there are superblocks to be read/used).
> 
> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

I've been feeling uncomfortable about this and have spent a while trying to
see if my discomfort is at all justified.  It seems that maybe it is.

The discomfort is really at analyze_sbs being used for dm arrays.  It is
really for arrays where md completely controls the metadata.  dm array are in
a strange intermediate situation where some metadata is controlled by
user-space (so md is told about some details of the array) and other metadata
is managed by the kernel - so md finds those bits out by itself.

It isn't yet entirely clear to me how to handle the half-way state best.

But the particular problem is that analyse_sbs can call kick_rdev_from_array.
This will call export_rdev which will call kobject_put(&rdev->kboj) which is
bad because dm-based rdevs do not get their kobj initialised.

So I think analyse_sbs should not be used for dm arrays.
Rather the code in dm-raid.c which parses the metadata_device info from the
constructor line should load_super.  Then before md_run is called it should
do the 'validate_super' step and record any failures.

So the only super_types method that md code would call on a dm-raid array
would be sync_super.

Does that work for you?

Thanks,
NeilBrown

> 
> Index: linux-2.6/drivers/md/md.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/md.c
> +++ linux-2.6/drivers/md/md.c
> @@ -4421,6 +4421,20 @@ static void md_safemode_timeout(unsigned
>  	md_wakeup_thread(mddev->thread);
>  }
>  
> +static int should_read_super(mddev_t *mddev)
> +{
> +	mdk_rdev_t *rdev, *tmp;
> +
> +	if (!mddev->raid_disks)
> +		return 1;
> +
> +	rdev_for_each(rdev, tmp, mddev)
> +		if (rdev->meta_bdev)
> +			return 1;
> +
> +	return 0;
> +}
> +
>  static int start_dirty_degraded;
>  
>  int md_run(mddev_t *mddev)
> @@ -4442,7 +4456,7 @@ int md_run(mddev_t *mddev)
>  	/*
>  	 * Analyze all RAID superblock(s)
>  	 */
> -	if (!mddev->raid_disks) {
> +	if (should_read_super(mddev)) {
>  		if (!mddev->persistent)
>  			return -EINVAL;
>  		analyze_sbs(mddev);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1 of 9] MD:  possible typo
From: NeilBrown @ 2011-05-24 22:18 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid
In-Reply-To: <201105240305.p4O35wmb029220@f14.redhat.com>

On Mon, 23 May 2011 22:05:58 -0500 Jonathan Brassow <jbrassow@f14.redhat.com>
wrote:

> Patch name: md-possible-typo.patch
> 
> Fix a value printed in kiB but labeled as 'blocks'.
> 
> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
> 
> Index: linux-2.6/drivers/md/md.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/md.c
> +++ linux-2.6/drivers/md/md.c
> @@ -6867,7 +6867,7 @@ void md_do_sync(mddev_t *mddev)
>  	 */
>  	window = 32*(PAGE_SIZE/512);
>  	printk(KERN_INFO "md: using %dk window, over a total of %llu blocks.\n",
> -		window/2,(unsigned long long) max_sectors/2);
> +		window/2, (unsigned long long) max_sectors);
>  
>  	atomic_set(&mddev->recovery_active, 0);
>  	last_check = 0;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


"blocks" has traditionally meant "kibibytes" in md nomenclature, so this was
"correct".

I'd be quite happy to change the word "blocks" to "KB" (or even "KiB") though.

NeilBrown

^ permalink raw reply

* mdadm problem
From: retail.mdadm @ 2011-05-24 22:05 UTC (permalink / raw)
  To: linux-raid

Recently I decided to upgrade my old linux based hdd enclosure:
processor arm9
kernel 2.6.12
I can't update or upgrade linux or it's kernel

So I tried to update HDDs:

When I run this command I realized that I need linear, not raid0
mdadm -Cv /dev/md0 --level=0 -n2 /dev/sda2 /dev/sdb

but when I tried to zero superblocks (I stopped md0 previously) I've
got "Segmentation fault" error..
Later I have connected my drives to PC and zeroed superblocks sucessfully,
but when I connected drives back to enclosure and tried to create
linear I've got "Segmentation fault" error again!

And from now every time I'm trying to do something via mdadm with
these drives I'm getting that "Segmentation fault" error!

I haven't tried dd if=/dev/zero of=/dev/sda2 cause my hard drive have 2Tb!

Here are my questions:
1. Can mdadm damage my hard drive(s)?
2. How to locate and zero superblocks completely? (I know that there
are two versions of superblocks.. but I don't know which one I have)

Thanks

^ permalink raw reply

* [PATCH] MD:  raid1 changes to allow use by device mapper
From: Jonathan Brassow @ 2011-05-24 21:01 UTC (permalink / raw)
  To: linux-raid

Patch name: md-raid1-changes-to-allow-use-by-device-mapper.patch

MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c)

Added the necessary congestion function and conditionalize calls requiring an
array 'queue' or 'gendisk'.

RFC-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/raid1.c
===================================================================
--- linux-2.6.orig/drivers/md/raid1.c
+++ linux-2.6/drivers/md/raid1.c
@@ -497,21 +497,19 @@ static int read_balance(conf_t *conf, r1
 	return best_disk;
 }
 
-static int raid1_congested(void *data, int bits)
+int md_raid1_congested(mddev_t *mddev, int bits)
 {
-	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
 	int i, ret = 0;
 
-	if (mddev_congested(mddev, bits))
-		return 1;
-
 	rcu_read_lock();
 	for (i = 0; i < mddev->raid_disks; i++) {
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
 
+			BUG_ON(!q);
+
 			/* Note the '|| 1' - when read_balance prefers
 			 * non-congested targets, it can be removed
 			 */
@@ -524,7 +522,15 @@ static int raid1_congested(void *data, i
 	rcu_read_unlock();
 	return ret;
 }
+EXPORT_SYMBOL_GPL(md_raid1_congested);
 
+static int raid1_congested(void *data, int bits)
+{
+	mddev_t *mddev = data;
+
+	return mddev_congested(mddev, bits) ||
+		md_raid1_congested(mddev, bits);
+}
 
 static void flush_pending_writes(conf_t *conf)
 {
@@ -1972,6 +1978,8 @@ static int run(mddev_t *mddev)
 		return PTR_ERR(conf);
 
 	list_for_each_entry(rdev, &mddev->disks, same_set) {
+		if (!mddev->gendisk)
+			continue;
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
 		/* as we don't honour merge_bvec_fn, we must never risk
@@ -2013,8 +2021,10 @@ static int run(mddev_t *mddev)
 
 	md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
 
-	mddev->queue->backing_dev_info.congested_fn = raid1_congested;
-	mddev->queue->backing_dev_info.congested_data = mddev;
+	if (mddev->queue) {
+		mddev->queue->backing_dev_info.congested_fn = raid1_congested;
+		mddev->queue->backing_dev_info.congested_data = mddev;
+	}
 	return md_integrity_register(mddev);
 }
 
Index: linux-2.6/drivers/md/raid1.h
===================================================================
--- linux-2.6.orig/drivers/md/raid1.h
+++ linux-2.6/drivers/md/raid1.h
@@ -126,4 +126,6 @@ struct r1bio_s {
  */
 #define	R1BIO_Returned 6
 
+extern int md_raid1_congested(mddev_t *mddev, int bits);
+
 #endif

^ permalink raw reply

* [PATCH] MD:  move thread wakeups into resume
From: Jonathan Brassow @ 2011-05-24 20:56 UTC (permalink / raw)
  To: linux-raid

Patch name: md-move-thread-wakeups-into-resume.patch

Move personality and sync/recovery thread starting outside md_run.

Moving the wakeup's of the personality and sync/recovery threads out of
md_run and into do_md_run and mddev_resume solves two issues:
1) It allows bitmap_load to be called before the sync_thread is run and
2) when MD personalities are used by device-mapper (dm-raid.c), the start-up
of the array is better alligned with device-mapper primatives
(CTR/resume/suspend/DTR).  I/O - in this case, recovery operations - should
not happen until after a resume has taken place.

RFC-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -351,6 +351,9 @@ void mddev_resume(mddev_t *mddev)
 	mddev->suspended = 0;
 	wake_up(&mddev->sb_wait);
 	mddev->pers->quiesce(mddev, 0);
+
+	md_wakeup_thread(mddev->thread);
+	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
 }
 EXPORT_SYMBOL_GPL(mddev_resume);
 
@@ -4948,9 +4951,6 @@ int md_run(mddev_t *mddev)
 	if (mddev->flags)
 		md_update_sb(mddev, 0);
 
-	md_wakeup_thread(mddev->thread);
-	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
-
 	md_new_event(mddev);
 	sysfs_notify_dirent_safe(mddev->sysfs_state);
 	sysfs_notify_dirent_safe(mddev->sysfs_action);
@@ -4971,6 +4971,10 @@ static int do_md_run(mddev_t *mddev)
 		bitmap_destroy(mddev);
 		goto out;
 	}
+
+	md_wakeup_thread(mddev->thread);
+	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
+
 	set_capacity(mddev->gendisk, mddev->array_sectors);
 	revalidate_disk(mddev->gendisk);
 	mddev->changed = 1;

^ permalink raw reply

* [PATCH 9 of 9] MD:  raid5 do not set fullsync
From: Jonathan Brassow @ 2011-05-24  3:07 UTC (permalink / raw)
  To: linux-raid

Patch name: md-raid5-do-not-set-fullsync.patch

Add new flag for struct mdk_rdev_s to indicate when recovery can use bitmap

The version 2 superblock routines (device-mapper) can tell if a device is
in-sync, in need of partial (bitmap aided) recovery, or in need of complete
recovery.  The raid5 code assumes that if a device is not in-sync, then it must
undergo complete recovery - it does not honor the bitmap.  The flag
'RecoverByBitmap' has been introduced to force raid5 not to set
'conf->fullsync' if the superblock routines have already determined that only
a partial recovery is necessary.

RFC-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/raid5.c
===================================================================
--- linux-2.6.orig/drivers/md/raid5.c
+++ linux-2.6/drivers/md/raid5.c
@@ -4858,7 +4858,7 @@ static raid5_conf_t *setup_conf(mddev_t 
 			printk(KERN_INFO "md/raid:%s: device %s operational as raid"
 			       " disk %d\n",
 			       mdname(mddev), bdevname(rdev->bdev, b), raid_disk);
-		} else
+		} else if (!test_bit(RecoverByBitmap, &rdev->flags))
 			/* Cannot rely on bitmap to complete recovery */
 			conf->fullsync = 1;
 	}
Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -2009,8 +2009,10 @@ static int super_2_validate(mddev_t *mdd
 	if (test_bit(Faulty, &rdev->flags)) {
 		clear_bit(Faulty, &rdev->flags);
 		clear_bit(In_sync, &rdev->flags);
+		set_bit(RecoverByBitmap, &rdev->flags);
 		rdev->recovery_offset = 0;
-		printk(KERN_INFO "md: %s: Dev #%d previously marked as failed\n",
+		printk(KERN_INFO
+		       "md: %s: Dev #%d recovering from transient failure\n",
 		       mdname(mddev), rdev->raid_disk);
 	}
 
Index: linux-2.6/drivers/md/md.h
===================================================================
--- linux-2.6.orig/drivers/md/md.h
+++ linux-2.6/drivers/md/md.h
@@ -79,6 +79,8 @@ struct mdk_rdev_s
 					 * until it is cleared */
 #define FirstUse        9               /* Used by device-mapper interface when
 					 * initializing first-time devices. */
+#define RecoverByBitmap 10              /* Used by device-mapper to ensure
+					 * this device is recovered by bitmap. */
 	wait_queue_head_t blocked_wait;
 
 	int desc_nr;			/* descriptor index in the superblock */

^ permalink raw reply

* [PATCH 8 of 9] MD:  add bitmap support
From: Jonathan Brassow @ 2011-05-24  3:07 UTC (permalink / raw)
  To: linux-raid

Patch name: md-add-bitmap-support.patch

Add bitmap support to the device-mapper specific metadata area.

This patch allows the creation of the bitmap metadata area upon initial array
creation via device-mapper.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -1937,6 +1937,7 @@ static int super_2_init_validation(mddev
 	if (!rebuilds) {
 		if (new_devs == mddev->raid_disks) {
 			printk(KERN_INFO "md: %s: Superblocks created for new array\n", mdname(mddev));
+			set_bit(MD_ARRAY_FIRST_USE, &mddev->flags);
 		} else if (new_devs) {
 			printk(KERN_ERR "md: %s: New device injected "
 			       "into existing array without 'rebuild' "
@@ -1997,7 +1998,7 @@ static int super_2_validate(mddev_t *mdd
 	if (!mddev->events && super_2_init_validation(mddev, rdev))
 		return -EINVAL;
 
-	rdev->mddev->bitmap_info.offset = 0; /* disable bitmap creation */
+	mddev->bitmap_info.offset = 4096 >> 9; /* enable bitmap creation */
 	rdev->mddev->bitmap_info.default_offset = 4096 >> 9;
 	if (!test_bit(FirstUse, &rdev->flags)) {
 		rdev->recovery_offset = le64_to_cpu(sb->disk_recovery_offset);
Index: linux-2.6/drivers/md/bitmap.c
===================================================================
--- linux-2.6.orig/drivers/md/bitmap.c
+++ linux-2.6/drivers/md/bitmap.c
@@ -534,6 +534,84 @@ void bitmap_print_sb(struct bitmap *bitm
 	kunmap_atomic(sb, KM_USER0);
 }
 
+/*
+ * bitmap_new_disk_sb
+ * @bitmap
+ *
+ * This function is somewhat the reverse of bitmap_read_sb.  bitmap_read_sb
+ * reads and verifies the on-disk bitmap superblock and populates bitmap_info.
+ * This function verifies 'bitmap_info' and populates the on-disk bitmap
+ * structure, which is to be written to disk.
+ *
+ * Returns: 0 on success, -Exxx on error
+ */
+static int bitmap_new_disk_sb(struct bitmap *bitmap)
+{
+	bitmap_super_t *sb;
+	unsigned long chunksize, daemon_sleep, write_behind;
+	int err = -EINVAL;
+
+	/* page 0 is the superblock, read it... */
+	bitmap->sb_page = read_sb_page(bitmap->mddev,
+				       bitmap->mddev->bitmap_info.offset,
+				       NULL, 0, sizeof(bitmap_super_t));
+
+	if (IS_ERR(bitmap->sb_page)) {
+		err = PTR_ERR(bitmap->sb_page);
+		bitmap->sb_page = NULL;
+		return err;
+	}
+
+	sb = kmap_atomic(bitmap->sb_page, KM_USER0);
+
+	sb->magic = cpu_to_le32(BITMAP_MAGIC);
+	sb->version = cpu_to_le32(BITMAP_MAJOR_HI);
+
+	chunksize = bitmap->mddev->bitmap_info.chunksize;
+	BUG_ON(!chunksize);
+	if ((1 << ffz(~chunksize)) != chunksize) {
+		kunmap_atomic(sb, KM_USER0);
+		printk(KERN_ERR "bitmap chunksize not a power of 2\n");
+		return -EINVAL;
+	}
+	sb->chunksize = cpu_to_le32(chunksize);
+
+	daemon_sleep = bitmap->mddev->bitmap_info.daemon_sleep;
+	if (!daemon_sleep ||
+	    (daemon_sleep < 1) || (daemon_sleep > MAX_SCHEDULE_TIMEOUT)) {
+		printk(KERN_INFO "Choosing daemon_sleep default (5 sec)\n");
+		daemon_sleep = 5 * HZ;
+	}
+	sb->daemon_sleep = cpu_to_le32(daemon_sleep);
+	bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep;
+
+	/*
+	 * FIXME: write_behind for RAID1.  If not specified, what
+	 * is a good choice?  We choose COUNTER_MAX / 2 arbitrarily.
+	 */
+	write_behind = bitmap->mddev->bitmap_info.max_write_behind;
+	if (write_behind > COUNTER_MAX)
+		write_behind = COUNTER_MAX / 2;
+	sb->write_behind = cpu_to_le32(write_behind);
+	bitmap->mddev->bitmap_info.max_write_behind = write_behind;
+
+	/* keep the array size field of the bitmap superblock up to date */
+	sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors);
+
+	memcpy(sb->uuid, bitmap->mddev->uuid, 16);
+
+	bitmap->flags |= BITMAP_STALE;
+	sb->state |= cpu_to_le32(BITMAP_STALE);
+	bitmap->events_cleared = bitmap->mddev->events;
+	sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
+
+	bitmap->flags |= BITMAP_HOSTENDIAN;
+	sb->version = cpu_to_le32(BITMAP_MAJOR_HOSTENDIAN);
+
+	kunmap_atomic(sb, KM_USER0);
+	return 0;
+}
+
 /* read the superblock from the bitmap file and initialize some bitmap fields */
 static int bitmap_read_sb(struct bitmap *bitmap)
 {
@@ -1076,8 +1154,8 @@ static int bitmap_init_from_disk(struct 
 	}
 
 	printk(KERN_INFO "%s: bitmap initialized from disk: "
-		"read %lu/%lu pages, set %lu bits\n",
-		bmname(bitmap), bitmap->file_pages, num_pages, bit_cnt);
+	       "read %lu/%lu pages, set %lu of %lu bits\n",
+	       bmname(bitmap), bitmap->file_pages, num_pages, bit_cnt, chunks);
 
 	return 0;
 
@@ -1728,9 +1806,16 @@ int bitmap_create(mddev_t *mddev)
 		vfs_fsync(file, 1);
 	}
 	/* read superblock from bitmap file (this sets mddev->bitmap_info.chunksize) */
-	if (!mddev->bitmap_info.external)
-		err = bitmap_read_sb(bitmap);
-	else {
+	if (!mddev->bitmap_info.external) {
+		/*
+		 * If 'MD_ARRAY_FIRST_USE' is set, then device-mapper is
+		 * instructing us to create a new on-disk bitmap instance.
+		 */
+		if (test_and_clear_bit(MD_ARRAY_FIRST_USE, &mddev->flags))
+			err = bitmap_new_disk_sb(bitmap);
+		else
+			err = bitmap_read_sb(bitmap);
+	} else {
 		err = 0;
 		if (mddev->bitmap_info.chunksize == 0 ||
 		    mddev->bitmap_info.daemon_sleep == 0)
Index: linux-2.6/drivers/md/md.h
===================================================================
--- linux-2.6.orig/drivers/md/md.h
+++ linux-2.6/drivers/md/md.h
@@ -127,6 +127,7 @@ struct mddev_s
 #define MD_CHANGE_CLEAN 1	/* transition to or from 'clean' */
 #define MD_CHANGE_PENDING 2	/* switch from 'clean' to 'active' in progress */
 #define MD_SYNC_STATE_FORCED 3  /* recovery_cp is set and must be honored */
+#define MD_ARRAY_FIRST_USE   4  /* First use of array, needs initialization */
 
 	int				suspended;
 	atomic_t			active_io;

^ permalink raw reply

* [PATCH 7 of 9] MD:  new sb type
From: Jonathan Brassow @ 2011-05-24  3:07 UTC (permalink / raw)
  To: linux-raid

Patch name: md-new-sb-type.patch

A new MD superblock that is device-mapper specific.

The new superblock is not read or written from userspace and is not exported.
It contains information to track resync, recovery, and reshaping progress.  It
also maintains information on the health of the devices in the array.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -1731,6 +1731,305 @@ super_1_rdev_size_change(mdk_rdev_t *rde
 	return num_sectors;
 }
 
+/*
+ * This structure is never used by userspace.  It is only ever
+ * used in these particular super block accessing functions.
+ * Therefore, we don't put it in any .h file.
+ *
+ * It makes sense to define a new magic number here.  This way,
+ * no userspace application will confuse the device as a device
+ * that is accessible through MD operations.  Devices with this
+ * superblock should only ever be accessed via device-mapper.
+ */
+#define MD_DM_SB_MAGIC 0x426E6F4A
+struct mdp_superblock_2 {
+	__le32 magic;
+	__le32 flags; /* Used to indicate possible future changes */
+
+	__le64 events;
+
+	/*
+	 * The following offset variables are used to indicate:
+	 *  reshape_offset:  If the RAID level or layout of an array is
+	 *		     being updated, this offset keeps track of the
+	 *		     progress.
+	 *  disk_recovery_offset:  If drives are being repaired/replaced on
+	 *			   an individual basis, this offset tracks
+	 *			   that progress.  This might happen when a
+	 *			   drive fails and is replaced.
+	 *  array_resync_offset:  When the array is constructed for the first
+	 *			  time, all the devices must be made coherent.
+	 *			  This offset tracks that progress.
+	 */
+	__le64 reshape_offset;
+	__le64 disk_recovery_offset;
+	__le64 array_resync_offset;
+
+	/*
+	 * The following variable pairs reflect things
+	 * that can changed during an array reshape.
+	 */
+	__le32 level;
+	__le32 new_level;
+
+	__le32 layout;
+	__le32 new_layout;
+
+	__le32 stripe_sectors;
+	__le32 new_stripe_sectors;
+
+	__le32 num_devices;    /* Number of devs in RAID, Max = 64 */
+	__le32 new_num_devices;
+
+	__le64 failed_devices; /* bitmap of devs, used to indicate a failure */
+	__u8 pad[432];         /* Round out the struct to 512 bytes */
+};
+
+static void super_2_sync(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+	mdk_rdev_t *r, *t;
+	uint64_t failed_devices;
+	struct mdp_superblock_2 *sb;
+
+	sb = (struct mdp_superblock_2 *)page_address(rdev->sb_page);
+	failed_devices = le32_to_cpu(sb->failed_devices);
+
+	rdev_for_each(r, t, mddev)
+		if ((r->raid_disk >= 0) && test_bit(Faulty, &r->flags))
+			failed_devices |= (1 << r->raid_disk);
+
+	memset(sb, 0, sizeof(*sb));
+
+	sb->magic  = cpu_to_le32(MD_DM_SB_MAGIC);
+	sb->flags  = cpu_to_le32(0); /* No flags yet */
+
+	sb->events = cpu_to_le64(mddev->events);
+
+	sb->reshape_offset = cpu_to_le64(mddev->reshape_position);
+	sb->disk_recovery_offset = cpu_to_le64(rdev->recovery_offset);
+	sb->array_resync_offset = cpu_to_le64(mddev->recovery_cp);
+
+	sb->level = cpu_to_le32(mddev->level);
+	sb->layout = cpu_to_le32(mddev->layout);
+	sb->stripe_sectors = cpu_to_le32(mddev->chunk_sectors);
+	sb->num_devices = cpu_to_le32(mddev->raid_disks);
+
+	if (mddev->reshape_position != MaxSector) {
+		sb->new_level = cpu_to_le32(mddev->new_level);
+		sb->new_layout = cpu_to_le32(mddev->new_layout);
+		sb->new_stripe_sectors = cpu_to_le32(mddev->new_chunk_sectors);
+		sb->new_num_devices = cpu_to_le32(mddev->delta_disks);
+	} else {
+		sb->new_level = 0;
+		sb->new_layout = 0;
+		sb->new_stripe_sectors = 0;
+		sb->new_num_devices = 0;
+	}
+
+	sb->failed_devices = cpu_to_le32(failed_devices);
+}
+
+/*
+ * super_2_load
+ *
+ * This function creates a superblock if one is not found on the device
+ * and will indicate the more appropriate device whose superblock should
+ * be used, if given two.
+ *
+ * Return: 1 if use rdev, 0 if use refdev, -Exxx otherwise
+ */
+static int super_2_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
+{
+	int r;
+	uint64_t ev1, ev2;
+	struct mdp_superblock_2 *sb;
+	struct mdp_superblock_2 *refsb;
+
+	if (sizeof(*sb) & (sizeof(*sb) - 1)) {
+		printk(KERN_ERR "Programmer error: Bad sized superblock (%lu)\n",
+		       sizeof(*sb));
+		return -EIO;
+	}
+
+	rdev->sb_start = 0;
+	rdev->sb_size  = sizeof(*sb);
+	r = read_disk_sb(rdev, rdev->sb_size);
+	if (r)
+		return r;
+
+	sb = (struct mdp_superblock_2 *)page_address(rdev->sb_page);
+	if (sb->magic != cpu_to_le32(MD_DM_SB_MAGIC)) {
+		super_2_sync(rdev->mddev, rdev);
+
+		set_bit(FirstUse, &rdev->flags);
+
+		/* Force new superblocks to disk */
+		set_bit(MD_CHANGE_DEVS, &rdev->mddev->flags);
+
+		/* Any superblock is better than none, choose that if given */
+		return refdev ? 0 : 1;
+	}
+
+	if (!refdev)
+		return 1;
+
+	ev1 = le64_to_cpu(sb->events);
+	refsb = (struct mdp_superblock_2 *)page_address(refdev->sb_page);
+	ev2 = le64_to_cpu(refsb->events);
+
+	return (ev1 > ev2) ? 1 : 0;
+}
+
+static int super_2_init_validation(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+	uint64_t ev1;
+	uint32_t failed_devices;
+	struct mdp_superblock_2 *sb;
+	uint32_t new_devs = 0;
+	uint32_t rebuilds = 0;
+	mdk_rdev_t *r, *t;
+	struct mdp_superblock_2 *sb2;
+
+	sb = (struct mdp_superblock_2 *)page_address(rdev->sb_page);
+	ev1 = le64_to_cpu(sb->events);
+	failed_devices = le32_to_cpu(sb->failed_devices);
+
+	mddev->events = ev1 ? ev1 : 1;
+
+	/* Reshaping is not currently allowed */
+	if ((le32_to_cpu(sb->level) != mddev->level) ||
+	    (le32_to_cpu(sb->layout) != mddev->layout) ||
+	    (le32_to_cpu(sb->stripe_sectors) != mddev->chunk_sectors) ||
+	    (le32_to_cpu(sb->num_devices) != mddev->raid_disks)) {
+		printk(KERN_ERR
+		       "md: %s: Reshaping arrays not yet supported.\n",
+		       mdname(mddev));
+		return -EINVAL;
+	}
+
+	if (!test_and_clear_bit(MD_SYNC_STATE_FORCED, &mddev->flags))
+		mddev->recovery_cp = le64_to_cpu(sb->array_resync_offset);
+
+	/*
+	 * During load, we set FirstUse if a new superblock was written.
+	 * There are two reasons we might not have a superblock:
+	 * 1) The array is brand new - in which case, all of the
+	 *    devices must have their In_sync bit set.  Also,
+	 *    recovery_cp must be 0, unless forced.
+	 * 2) This is a new device being added to an old array
+	 *    and the new device needs to be rebuilt - in which
+	 *    case the In_sync bit will /not/ be set and
+	 *    recovery_cp must be MaxSector.
+	 */
+	rdev_for_each(r, t, mddev) {
+		if (!test_bit(In_sync, &r->flags)) {
+			if (!test_bit(FirstUse, &r->flags))
+				printk(KERN_ERR "md: %s: Superblock area of "
+				       "rebuild device %d should have been "
+				       "cleared.\n", mdname(mddev),
+				       r->raid_disk);
+			set_bit(FirstUse, &r->flags);
+			rebuilds++;
+		} else if (test_bit(FirstUse, &r->flags))
+			new_devs++;
+	}
+
+	if (!rebuilds) {
+		if (new_devs == mddev->raid_disks) {
+			printk(KERN_INFO "md: %s: Superblocks created for new array\n", mdname(mddev));
+		} else if (new_devs) {
+			printk(KERN_ERR "md: %s: New device injected "
+			       "into existing array without 'rebuild' "
+			       "parameter specified\n", mdname(mddev));
+			return -EINVAL;
+		}
+	} else if (new_devs) {
+		printk(KERN_ERR "md: %s: 'rebuild' devices cannot be "
+		       "injected into an array with other "
+		       "first-time devices\n", mdname(mddev));
+		return -EINVAL;
+	} else if (mddev->recovery_cp != MaxSector) {
+		printk(KERN_ERR "md: %s: 'rebuild' specified while "
+		       "array is not in-sync\n",
+		       mdname(mddev));
+		return -EINVAL;
+	}
+
+	/*
+	 * Now we set the Faulty bit for those devices that are
+	 * recorded in the superblock as failed.
+	 */
+	rdev_for_each(r, t, mddev) {
+		if (!r->sb_page)
+			continue;
+		sb2 = (struct mdp_superblock_2 *)
+			page_address(r->sb_page);
+		sb2->failed_devices = 0;
+
+		if ((r->raid_disk >= 0) &&
+		    (failed_devices & (1 << r->raid_disk))) {
+			if (test_bit(FirstUse, &r->flags)) {
+				char b[BDEVNAME_SIZE];
+				printk(KERN_INFO
+				       "md: %s: Starting complete rebuild of "
+				       "previously failed device, %s\n",
+				       mdname(mddev), bdevname(rdev->bdev, b));
+			} else {
+				set_bit(Faulty, &r->flags);
+			}
+		}
+	}
+
+	return 0;
+}
+
+static int super_2_validate(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+	struct mdp_superblock_2 *sb;
+
+	sb = (struct mdp_superblock_2 *)page_address(rdev->sb_page);
+
+	/*
+	 * mddev->events is set during the first call to super_2_validate,
+	 * so we use that knowledge to kick off some global sanity checks
+	 * on the first call.
+	 */
+	if (!mddev->events && super_2_init_validation(mddev, rdev))
+		return -EINVAL;
+
+	rdev->mddev->bitmap_info.offset = 0; /* disable bitmap creation */
+	rdev->mddev->bitmap_info.default_offset = 4096 >> 9;
+	if (!test_bit(FirstUse, &rdev->flags)) {
+		rdev->recovery_offset = le64_to_cpu(sb->disk_recovery_offset);
+		if (rdev->recovery_offset != MaxSector)
+			clear_bit(In_sync, &rdev->flags);
+	}
+
+	if (test_bit(Faulty, &rdev->flags)) {
+		clear_bit(Faulty, &rdev->flags);
+		clear_bit(In_sync, &rdev->flags);
+		rdev->recovery_offset = 0;
+		printk(KERN_INFO "md: %s: Dev #%d previously marked as failed\n",
+		       mdname(mddev), rdev->raid_disk);
+	}
+
+	clear_bit(FirstUse, &rdev->flags);
+	return 0;
+}
+
+static unsigned long long
+super_2_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
+{
+	/*
+	 * Arrays built through device-mapper must use device-mapper
+	 * tables to change the size.  A call to this function is
+	 * invalid for this array.
+	 */
+	printk(KERN_ERR "md: %s: Invalid device size change request.\n",
+	       mdname(rdev->mddev));
+	return 0;
+}
+
 static struct super_type super_types[] = {
 	[0] = {
 		.name	= "0.90.0",
@@ -1748,6 +2047,14 @@ static struct super_type super_types[] =
 		.sync_super	    = super_1_sync,
 		.rdev_size_change   = super_1_rdev_size_change,
 	},
+	[2] = {
+		.name	= "dm",
+		.owner	= THIS_MODULE,
+		.load_super	    = super_2_load,
+		.validate_super	    = super_2_validate,
+		.sync_super	    = super_2_sync,
+		.rdev_size_change   = super_2_rdev_size_change,
+	},
 };
 
 static int match_mddev_units(mddev_t *mddev1, mddev_t *mddev2)
Index: linux-2.6/drivers/md/md.h
===================================================================
--- linux-2.6.orig/drivers/md/md.h
+++ linux-2.6/drivers/md/md.h
@@ -77,6 +77,8 @@ struct mdk_rdev_s
 #define Blocked		8		/* An error occurred on an externally
 					 * managed array, don't allow writes
 					 * until it is cleared */
+#define FirstUse        9               /* Used by device-mapper interface when
+					 * initializing first-time devices. */
 	wait_queue_head_t blocked_wait;
 
 	int desc_nr;			/* descriptor index in the superblock */
@@ -124,6 +126,7 @@ struct mddev_s
 #define MD_CHANGE_DEVS	0	/* Some device status has changed */
 #define MD_CHANGE_CLEAN 1	/* transition to or from 'clean' */
 #define MD_CHANGE_PENDING 2	/* switch from 'clean' to 'active' in progress */
+#define MD_SYNC_STATE_FORCED 3  /* recovery_cp is set and must be honored */
 
 	int				suspended;
 	atomic_t			active_io;

^ permalink raw reply

* [PATCH 6 of 9] MD:  no sync IO while suspended
From: Jonathan Brassow @ 2011-05-24  3:06 UTC (permalink / raw)
  To: linux-raid

Patch name: md-no-sync-IO-while-suspended.patch

Disallow resync I/O while the RAID array is suspended.

Recovery, resync, and metadata I/O should not be allowed while a device is
suspended.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -7064,7 +7064,6 @@ void md_do_sync(mddev_t *mddev)
 }
 EXPORT_SYMBOL_GPL(md_do_sync);
 
-
 static int remove_and_add_spares(mddev_t *mddev)
 {
 	mdk_rdev_t *rdev;
@@ -7176,6 +7175,9 @@ static void reap_sync_thread(mddev_t *md
  */
 void md_check_recovery(mddev_t *mddev)
 {
+	if (mddev->suspended)
+		return;
+
 	if (mddev->bitmap)
 		bitmap_daemon_work(mddev);
 

^ permalink raw reply

* [PATCH 5 of 9] MD:  no integrity register if no gendisk
From: Jonathan Brassow @ 2011-05-24  3:06 UTC (permalink / raw)
  To: linux-raid

Patch name: md-no-integrity-register-if-no-gendisk.patch

Don't attempt md_integrity_register if there is no gendisk struct available.

When MD arrays are built via device-mapper, the gendisk structure is not
available via mddev.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
 
Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -1781,8 +1781,8 @@ int md_integrity_register(mddev_t *mddev
 
 	if (list_empty(&mddev->disks))
 		return 0; /* nothing to do */
-	if (blk_get_integrity(mddev->gendisk))
-		return 0; /* already registered */
+	if (!mddev->gendisk || blk_get_integrity(mddev->gendisk))
+		return 0; /* shouldn't register, or already is */
 	list_for_each_entry(rdev, &mddev->disks, same_set) {
 		/* skip spares and non-functional disks */
 		if (test_bit(Faulty, &rdev->flags))

^ permalink raw reply

* [PATCH 4 of 9] MD:  analyze_sbs failure if bad superblocks
From: Jonathan Brassow @ 2011-05-24  3:06 UTC (permalink / raw)
  To: linux-raid

Patch name: md-analyze_sbs-failure-if-bad-superblocks.patch

MD's superblock reader function should fail if all superblocks are bad.

analyze_sbs should return -EINVAL if the 'freshest'/best superblock cannot
be validated.  This is especially important to dm-raid.c, as it uses md_run
at creation time to validate array transitions.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -2890,9 +2890,9 @@ static int analyze_sbs(mddev_t *mddev)
 			kick_rdev_from_array(rdev);
 		}
 
-
-	super_types[mddev->major_version].
-		validate_super(mddev, freshest);
+	/* If the freshest superblock available cannot be validated, fail. */
+	if (super_types[mddev->major_version].validate_super(mddev, freshest))
+		return -EINVAL;
 
 	i = 0;
 	rdev_for_each(rdev, tmp, mddev) {

^ permalink raw reply

* [PATCH 3 of 9] MD:  allow analyze_sbs to fail
From: Jonathan Brassow @ 2011-05-24  3:06 UTC (permalink / raw)
  To: linux-raid

Patch name: md-allow-analyze_sbs-to-fail.patch

Catch the case that md_run is called requiring non-existant super_types fns.

Other modules may call md_run (like device-mapper's, dm-raid.c), we mustn't
assume they will always set a valid MD major_version.  This is especially true
in the case of device-mapper, which will use a new superblock type.  If patches
don't land in the kernel in the proper order, access will be attempted beyond
the end of the super_types array.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -2864,12 +2864,15 @@ abort_free:
  */
 
 
-static void analyze_sbs(mddev_t * mddev)
+static int analyze_sbs(mddev_t *mddev)
 {
 	int i;
 	mdk_rdev_t *rdev, *freshest, *tmp;
 	char b[BDEVNAME_SIZE];
 
+	if (mddev->major_version >= ARRAY_SIZE(super_types))
+		return -EINVAL;
+
 	freshest = NULL;
 	rdev_for_each(rdev, tmp, mddev)
 		switch (super_types[mddev->major_version].
@@ -2921,6 +2924,7 @@ static void analyze_sbs(mddev_t * mddev)
 			clear_bit(In_sync, &rdev->flags);
 		}
 	}
+	return 0;
 }
 
 /* Read a fixed-point number.
@@ -4459,7 +4463,8 @@ int md_run(mddev_t *mddev)
 	if (should_read_super(mddev)) {
 		if (!mddev->persistent)
 			return -EINVAL;
-		analyze_sbs(mddev);
+		if (analyze_sbs(mddev))
+			return -EINVAL;
 	}
 
 	if (mddev->level != LEVEL_NONE)

^ permalink raw reply

* [PATCH 2 of 9] MD:  should_read_superblock
From: Jonathan Brassow @ 2011-05-24  3:06 UTC (permalink / raw)
  To: linux-raid

Patch name: md-should_read_superblock.patch

Add new function to determine whether MD superblocks should be read.

It used to be sufficient to check if mddev->raid_disks was set to determine
whether to read the superblock or not.  However, device-mapper (dm-raid.c)
sets this value before calling md_run().  Thus, we need additional mechanisms
for determining whether to read the superblock.  This patch adds the condition
that if rdev->meta_bdev is set, the superblock should be read - something that
only device-mapper does (and only when there are superblocks to be read/used).

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -4421,6 +4421,20 @@ static void md_safemode_timeout(unsigned
 	md_wakeup_thread(mddev->thread);
 }

+static int should_read_super(mddev_t *mddev)
+{
+	mdk_rdev_t *rdev, *tmp;
+
+	if (!mddev->raid_disks)
+		return 1;
+
+	rdev_for_each(rdev, tmp, mddev)
+		if (rdev->meta_bdev)
+			return 1;
+
+	return 0;
+}
+
 static int start_dirty_degraded;

 int md_run(mddev_t *mddev)
@@ -4442,7 +4456,7 @@ int md_run(mddev_t *mddev)
 	/*
 	 * Analyze all RAID superblock(s)
 	 */
-	if (!mddev->raid_disks) {
+	if (should_read_super(mddev)) {
 		if (!mddev->persistent)
 			return -EINVAL;
 		analyze_sbs(mddev);

^ permalink raw reply

* [PATCH 1 of 9] MD:  possible typo
From: Jonathan Brassow @ 2011-05-24  3:05 UTC (permalink / raw)
  To: linux-raid

Patch name: md-possible-typo.patch

Fix a value printed in kiB but labeled as 'blocks'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c
+++ linux-2.6/drivers/md/md.c
@@ -6867,7 +6867,7 @@ void md_do_sync(mddev_t *mddev)
 	 */
 	window = 32*(PAGE_SIZE/512);
 	printk(KERN_INFO "md: using %dk window, over a total of %llu blocks.\n",
-		window/2,(unsigned long long) max_sectors/2);
+		window/2, (unsigned long long) max_sectors);
 
 	atomic_set(&mddev->recovery_active, 0);
 	last_check = 0;

^ permalink raw reply

* [PATCH 0 of 9] MD:  Updates for dm-raid.c support
From: Jonathan Brassow @ 2011-05-24  3:05 UTC (permalink / raw)
  To: linux-raid

Neil,

The following set of 9 patches is composed of three sets:
1) Patches 1-6 are small updates and fixes to what is already in the kernel.
2) Patch 7 introduces a new superblock type for device-mapper - allowing it
to record failures and give future support for reshaping and bitmap use.
3) Patches 8-9 provide bitmap support.  Patch 9 is necessary for transient failures,
where the bitmap can be replayed to recover a disk that has been gone for a short
time.  Without patch 9, raid5 will simply force a complete resync.

The first set by itself fixes a couple issues and protects against the      
possibility that dm-raid.c attempts to utilize super_types that may not exist.
The addition of the second set provides a working solution that is useful to
LVM in addition to (userspace) dm-raid.  The final set is not absolutely   
necessary, but provides the obvious recovery speed-up.

I have seen an issue related to the MD recovery thread starting up before the
bitmap has been properly loaded.  This issue is not a result of this patchset,
nor is it addressed by this patch set.  Until I find a way to fix this issue,
the 3rd set may be problematic and we may wish to defer it until I have a 
solution.

 brassow

^ permalink raw reply

* Re: disable raid autodetect at boot
From: Michael Tokarev @ 2011-05-23 13:21 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid
In-Reply-To: <BANLkTi=zLb7j_MvkTyQa=ifMAS+kDCkbvg@mail.gmail.com>

23.05.2011 16:50, Alexander Lyakas wrote:
> Michael,
> can you pls explain what do I need to look at to disable this.

> On Mon, May 23, 2011 at 1:35 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:

>> This is not kernel autodetection, this is your initramfs/initrd
>> and mdadm.  Or maybe mdadm in the regular root filesystem.

You need to find out where and how mdadm is called
on your system during bootup, and fix that place.

/mjt

^ permalink raw reply

* Re: disable raid autodetect at boot
From: Alexander Lyakas @ 2011-05-23 12:50 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-raid
In-Reply-To: <4DDA388C.2020000@msgid.tls.msk.ru>

Michael,
can you pls explain what do I need to look at to disable this.

Alex.


On Mon, May 23, 2011 at 1:35 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> 23.05.2011 13:36, Alexander Lyakas wrote:
>> Hello,
>>
>> I have a simple raid1 created on top of /dev/sda and /dev/sdb. After I
>> reboot, I would like to always manually assemble the raid. However,
>> every time the machine reboots, it looks like md tries to
>> automatically reassemble the raid, but usually binds only one of the
>> source devices:
>>
>> 16936 May 23 11:47:36 vc kernel: [    3.472926] md: linear personality
>> registered for level -1
> ..
>> 16971 May 23 11:47:36 vc kernel: [    6.059194] md: bind<sdb>
>
> This is not kernel autodetection, this is your initramfs/initrd
> and mdadm.  Or maybe mdadm in the regular root filesystem.
>
> /mjt
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: HBA Adaptor Advice
From: Joe Landman @ 2011-05-23 11:55 UTC (permalink / raw)
  To: Ed W; +Cc: linux-raid
In-Reply-To: <4DDA41B2.8010003@wildgooses.com>

On 05/23/2011 07:14 AM, Ed W wrote:
> Getting back on track of specific adaptor advice:
>
> To recap: I am looking for ideas on what to buy to upgrade our small
> office servers (not really stretched, just adding more backup disks and
> similar). My main requirement is to be able to buy equipment in single
> lots (one server at a time) and so I require the ability to take an
> array from one machine and use it in another machine using a different
> adaptor - therefore the previous thread has dissuaded me from looking at
> adaptors with writeback cache (and also hardware raid controllers)
>
> Therefore can I see a show of hands for "good value" HBA adaptors with
> 8, 12 and 24 ports? Ideally using fewer PCI slots is preferred and
> onboard expanders rather than separate expanders are preferred

Be aware that this won't be cheap.  Also be aware that many (most) 
expandor designs are performance limited due to their implementations. 
We see significant contention from bandwidth oversubscription in every 
day situations, regardless where the expandor is.

On the HBA side

http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas9201-16i/index.html 


On the hardware RAID side of this:

http://www.lsi.com/storage_home/products_home/internal_raid/megaraid_sas/value_line/megaraid_sas_9260-16i/index.html

LSI controllers as HBAs are reasonably good.  Make sure you update your 
drivers and firmware to late revisions.

> Seems that previously we discovered that most LSI RAID cards were well
> supported and Marvel cards were frequently not.  Does this
> generalisation persist with pure HBA cards also?
>
> I can see the list of LSI HBA cards on their site, but any pointers for
> good value HBA adaptors appreciated? (Current chassis will be a tower
> chassic, but future upgrades are expected to be Supermicro/Norco 3/4U
> rack boxes)
> (is there a page on the wiki already covering any of this that we could
> try and distil this wisdom to?)

Don't skimp on power supply, or power distribution.  RAIDs hate that.

Don't skimp on cooling.  Drives hate that.

Regards,

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply

* Re: HBA Adaptor advice
From: David Brown @ 2011-05-23 11:35 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <4DDA3A00.8010904@hardwarefreak.com>

On 23/05/2011 12:42, Stan Hoeppner wrote:
> On 5/23/2011 12:54 AM, Brad Campbell wrote:
>
>> Most sane operating systems use cluster sizes of 4k or larger and have
>> done for years, so I really don't see what all the fuss is about.
>>
>> Peoples inability to properly align the data on their disks can be read
>> either as a failing in the technology (the partitioning applications
>> have not caught up yet) or simply a lack of understanding on how to
>> apply the technology.
>>
>> Don't blame the drive manufacturers, this should have happened _years_ ago.
>
> I don't think anyone has an issue w/native 4KB sectors and operating
> system support for it.  That would have been the big win.  What folks
> have issue with is the hybrid 512/4096 drives which has created the
> alignment offset problems.
>
> The industry (BIOS/firmware), commercial and FOSS OSes, should have
> worked together to migrate directly to 4KB native sectors.  I don't know
> why this didn't happen, usual suspects I guess.  It seems, from my
> limited POV, that the Linux partition tool people and kernel folks
> simply don't care at this point.
>

The problem is Windows XP - neither more nor less.  XP only supports 512 
byte sectors, and the installed base is so large that manufacturers 
can't ignore it.  Linux has been happy with 4K sectors for many years - 
the problem only came when WD produced disks that had 4K sectors but 
claimed to be 512, so that they could work with XP.

The worst case is hybrid disks that offset the sector number, so that 
512-byte "sector" number 63 is at the beginning of a 4K native sector. 
The idea is that these will be fast with XP and other systems that have 
the first partition starting at sector 63 - but it screws up everything 
else.

> I've not paid recent attention.  Have fdisk, cfdisk, parted, etc, all
> come up to speed now, and automatically handle offsets correctly for
> hybrid sector size disks?
>

Modern versions should automatically align partitions appropriately. 
Typically you use 1 MB boundaries - that works well with all sorts of 
disks (SSDs prefer alignment of perhaps 64K or 128K for erase blocks), 
and should work well for future disks.

^ permalink raw reply

* Re: HBA Adaptor advice
From: Stan Hoeppner @ 2011-05-23 11:21 UTC (permalink / raw)
  To: Ed W; +Cc: Brad Campbell, linux-raid
In-Reply-To: <4DDA37F5.1000003@wildgooses.com>

On 5/23/2011 5:33 AM, Ed W wrote:

> If the cheap WD drives weren't the main issue then perhaps at least this
> example shouldn't be used as an example of why NOT to use those drives?

On the contrary, it's the *perfect* example.  A system's reliability is
no greater than that of the least reliable component.  The fact that WD
Green drives are so cheap dictates that anyone using them in a RAID
setup is going to use a cheap backplane or individual hot swap carriers.
 There is zero doubt here.  No one will buy a $1500 *empty* HP hot swap
JBOD chassis/backplane and then slap 14 x $80 = $1120 of WD 2TB Green
drives in it.

The soundness of design, manufacturing, and QC of cheap backplanes and
drive carriers is quite low in many/most cases.  Traces not routed
properly on a backplane PCB can generate timing skew or not reject
noise.  This can cause excessive CRC errors on the link, causing the
drive to be kicked off line.  This but one possible flaw that shows up
regularly in cheap backplanes and individual carriers.  The absolute
worst are cheap *active* backplanes.  These are the units with SAS/SATA
expanders on the PCB and/or I2C chips.  If you go cheap on your
backplane, you probably want to make sure you get a passive unit.

>> Either way, cheap
>> not-fit-for-RAID drives were stuffed into a cheap RAID box and disaster
>> was the result.
> 
> But likely due to what boils down to "cables falling out" is what you
> seem to be guessing?

No.  I'm saying he purchased a low ball solution and got low quality.
Some component puked momentarily and he lost a lot of data because on
top of that, he didn't know wtf he was doing and wiped the disks that
were actually ok.  We don't know exactly what caused the problem.  Root
cause analysis was never performed, or, if it was, it was never made
public.  I'm guessing the former.  The type of people who do root cause
analysis typically don't buy low ball gear.

>> WDC
>> itself says not to use the Green drives in RAID arrays. 
> 
> The problem with taking the manufacturers word on this is that they
> provide two products and claim one is "good enough" and that the other
> "lasts way longer", and then price them quite significantly differently

Welcome to the real world.  Been this way a long time.  Why do you think
health care in America is so expensive?  Because the same probe that is
sold to vets to be stuck up a horses ass is the same one sold for human
application.  Horses aren't litigious, but humans are.  That's one
reason why the same ass probe sold for human use costs 300 times more.
One is charged for the intended use of many products today, not
necessarily the capabilities of the product.  AMD builds both the Phenom
and Opteron on the same line, both chips are identical until the last
phase of production.  There, one of two (can't recall exactly) of the HT
links are disabled to make a Phenom, and you pay twice as much for the
Opteron.  Same chip, different "intended uses".  They gouge the business
customer because they know they can.  I'm a huge AMD fan, so please
don't think I'm down on them.  EVERYONE in business does this, Intel,
WDC, Seagate, the lot of them.

> Now, without even looking inside the two identical metal chassis, you
> have to admit: a) there is incentive for them to tell fibs here in order
> to gain a price premium and b) given the "reliable" drives are roughly
> twice the cost then there should be sufficient extra engineering in
> there that we can look for third party documentation, patents and other
> supplemental information to learn more about what that engineering is
> and gain confidence that the money is well spent?

I don't think they tout them necessarily as being more reliable, but
more capable or "compatible" in certain applications.  Drives used in
hardware RAIDs need TLER and some other specific firmware tweaks.  The
mechanicals of most "enterprise" SATA drives are shared with a number of
consumer counterarts.  Just different firmware and sticker color on the
top plate.

-- 
Stan

^ permalink raw reply

* HBA Adaptor Advice
From: Ed W @ 2011-05-23 11:14 UTC (permalink / raw)
  To: linux-raid

Getting back on track of specific adaptor advice:

To recap: I am looking for ideas on what to buy to upgrade our small
office servers (not really stretched, just adding more backup disks and
similar). My main requirement is to be able to buy equipment in single
lots (one server at a time) and so I require the ability to take an
array from one machine and use it in another machine using a different
adaptor - therefore the previous thread has dissuaded me from looking at
adaptors with writeback cache (and also hardware raid controllers)

Therefore can I see a show of hands for "good value" HBA adaptors with
8, 12 and 24 ports? Ideally using fewer PCI slots is preferred and
onboard expanders rather than separate expanders are preferred

Seems that previously we discovered that most LSI RAID cards were well
supported and Marvel cards were frequently not.  Does this
generalisation persist with pure HBA cards also?

I can see the list of LSI HBA cards on their site, but any pointers for
good value HBA adaptors appreciated? (Current chassis will be a tower
chassic, but future upgrades are expected to be Supermicro/Norco 3/4U
rack boxes)
(is there a page on the wiki already covering any of this that we could
try and distil this wisdom to?)

Thanks

Ed W

^ permalink raw reply

* Re: HBA Adaptor advice
From: John Robinson @ 2011-05-23 10:44 UTC (permalink / raw)
  To: Ed W; +Cc: linux-raid
In-Reply-To: <4DDA2D77.1050604@wildgooses.com>

On 23/05/2011 10:48, Ed W wrote:
[...]
> Pardon what is probably a very ignorant question, but someone earlier in
> this thread claimed that some adaptors report the size of the disk
> slightly differently?  Wouldn't this potentially cause problems if you
> needed to move the disks to a different controller?

Yup. RAID cards will use some of the disc for their own metadata. The 
amount used, and the location of it, is probably different for different 
controllers. This would be one reason why using a RAID controller with 
BBWC and exporting the drives as single-drive RAID0 volumes is a bit 
icky, and liable to tie you to one manufacturer.

There is a possibility (handwaving here) that using a RAID controller in 
JBOD mode would be similar. You may need to flash your controller to 
non-RAID firmware to avoid it, at which point you probably ought to have 
bought an HBA in the first place.

There is a similar problem on some OEMs' BIOSes that will set a 
"host-protected area" that will reduce the visible size of drives.

> Additionally if you needed to replace the disk then some new batch might
> be some few sectors smaller?  This seems to be the biggest reason for
> wanting to add a partition table and then deliberately partition some
> 10s MB smaller? (Think I saw this exact problem come up several times in
> the last few weeks alone?)

For spinning rust discs this hasn't been the case for several years 
since we passed about 160GB; all the manufacturers signed up to an 
industry standard[1] making all their discs a consistent number of 
sectors for any given marketing size.

It's probably a problem again now with SSDs, though.

Cheers,

John.

[1] I can't remember what the standard or standards group is, and I 
can't be bothered looking it up. But of course it's a standard. We love 
standards, that's why we have so many of them![2]

[2] Sorry if I'm a bit grumpy this morning. Too many standards and not 
enough coffee make John a grumpy boy.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox