* [md PATCH 01/16] md: beginnings of bad block management.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 02/16] md/bad-block-log: add sysfs interface for accessing bad-block-log NeilBrown
` (16 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.
This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/dm-raid456.c | 6 +
drivers/md/md.c | 427 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/md.h | 46 +++++
3 files changed, 475 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-raid456.c b/drivers/md/dm-raid456.c
index 3dcbc4a..5030d16 100644
--- a/drivers/md/dm-raid456.c
+++ b/drivers/md/dm-raid456.c
@@ -112,7 +112,11 @@ static int dev_parms(struct raid_set *rs, char **argv)
int err = 0;
unsigned long long offset;
- md_rdev_init(&rs->dev[i].rdev);
+ err = md_rdev_init(&rs->dev[i].rdev);
+ if (err) {
+ rs->ti->error = "Memory allocation failure";
+ return err;
+ }
rs->dev[i].rdev.raid_disk = i;
if (strcmp(argv[0], "-") == 0)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e0a9bf8..8ae8322 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1900,6 +1900,10 @@ static void unbind_rdev_from_array(mdk_rdev_t * rdev)
sysfs_remove_link(&rdev->kobj, "block");
sysfs_put(rdev->sysfs_state);
rdev->sysfs_state = NULL;
+ kfree(rdev->badblocks.page);
+ rdev->badblocks.count = 0;
+ rdev->badblocks.page = NULL;
+ rdev->badblocks.active_page = NULL;
/* We need to delay this, otherwise we can deadlock when
* writing to 'remove' to "dev/state". We also need
* to delay it due to rcu usage.
@@ -2738,7 +2742,7 @@ static struct kobj_type rdev_ktype = {
.default_attrs = rdev_default_attrs,
};
-void md_rdev_init(mdk_rdev_t *rdev)
+int md_rdev_init(mdk_rdev_t *rdev)
{
rdev->desc_nr = -1;
rdev->saved_raid_disk = -1;
@@ -2754,6 +2758,20 @@ void md_rdev_init(mdk_rdev_t *rdev)
INIT_LIST_HEAD(&rdev->same_set);
init_waitqueue_head(&rdev->blocked_wait);
+
+ /* Add space to store bad block list.
+ * This reserves the space even on arrays where it cannot
+ * be used - I wonder if that matters
+ */
+ rdev->badblocks.count = 0;
+ rdev->badblocks.shift = 0;
+ rdev->badblocks.page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ rdev->badblocks.active_page = rdev->badblocks.page;
+ spin_lock_init(&rdev->badblocks.lock);
+ if (rdev->badblocks.page == NULL)
+ return -ENOMEM;
+
+ return 0;
}
EXPORT_SYMBOL_GPL(md_rdev_init);
/*
@@ -2779,7 +2797,8 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
return ERR_PTR(-ENOMEM);
}
- md_rdev_init(rdev);
+ if ((err = md_rdev_init(rdev)))
+ goto abort_free;
if ((err = alloc_disk_sb(rdev)))
goto abort_free;
@@ -7212,6 +7231,410 @@ void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
}
EXPORT_SYMBOL(md_wait_for_blocked_rdev);
+
+/* Bad block management.
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than whole device.
+ * Entries in the bad-block table are 64bits wide. This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ * A 'shift' can be set so that larger blocks are tracked and
+ * consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ */
+/* Locking of the bad-block table is a two-layer affair.
+ * Read access through ->active_page only require an rcu_readlock.
+ * However is ->active_page is found to be NULL, the table
+ * should be accessed through ->page which requires a spinlock.
+ * Updating the page requires setting ->active_page to NULL,
+ * synchronising with rcu, then updating ->page under the same
+ * spinlock.
+ *
+ */
+/* When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad. So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ * We return
+ * 0 if there are no known bad blocks in the range
+ * 1 if there are known bad block which are all acknowledged
+ * -1 if there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ int hi;
+ int lo = 0;
+ u64 *p;
+ int rv = 0;
+ int havelock = 0;
+ sector_t target = s + sectors;
+
+ if (bb->shift) {
+ /* round down the start, and up the end */
+ s >>= bb->shift;
+ target |= (1<<bb->shift) - 1;
+ target++;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+ /* 'target' is now the first block after the bad range */
+
+ rcu_read_lock();
+ p = rcu_dereference(bb->active_page);
+ if (!p) {
+ spin_lock(&bb->lock);
+ p = bb->page;
+ havelock = 1;
+ }
+ hi = bb->count;
+
+ /* Binary search between lo and hi for 'target'
+ * i.e. for the last range that starts before 'target'
+ */
+ /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+ * are known not to be the last range before target.
+ * VARIANT: hi-lo is the number of possible
+ * ranges, and decreases until it reaches 1
+ */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ /* The could still be the one, earlier ranges
+ * could not. */
+ lo = mid;
+ else
+ /* This and later ranges are definitely out. */
+ hi = mid;
+ }
+ /* 'lo' might be the last that started before target, but 'hi' isn't */
+ if (hi > lo) {
+ /* need to check all range that end after 's' to see if
+ * any are unacknowledged.
+ */
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ /* starts before the end, and finishes after
+ * the start, so they must overlap
+ */
+ if (rv != -1 && BB_ACK(p[lo]))
+ rv = 1;
+ else
+ rv = -1;
+ *first_bad = BB_OFFSET(p[lo]);
+ *bad_sectors = BB_LEN(p[lo]);
+ lo--;
+ }
+ }
+
+ if (havelock)
+ spin_unlock(&bb->lock);
+ rcu_read_unlock();
+ return rv;
+}
+EXPORT_SYMBOL_GPL(md_is_badblock);
+
+/*
+ * Add a range of bad blocks to the table.
+ * This might extend the table, or might contract it
+ * if two adjacent ranges can be merged.
+ * We binary-search to find the 'insertion' point, then
+ * decide how best to handle it.
+ */
+int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
+ int acknowledged)
+{
+ u64 *p;
+ int lo, hi;
+ int rv = 1;
+
+ if (bb->shift) {
+ /* round down the start, and up the end */
+ sector_t next = s + sectors;
+ s >>= bb->shift;
+ next |= (1<<bb->shift) - 1;
+ next++;
+ next >>= bb->shift;
+ sectors = next - s;
+ }
+
+again:
+ rcu_assign_pointer(bb->active_page, NULL);
+ synchronize_rcu();
+ spin_lock(&bb->lock);
+ if (bb->active_page) {
+ /* someone else just unlocked, better retry */
+ spin_unlock(&bb->lock);
+ goto again;
+ }
+ /* now have exclusive access to the page */
+
+add_more:
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts at-or-before 's' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a <= s)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo && BB_OFFSET(p[lo]) > s)
+ hi = lo;
+
+ if (hi > lo) {
+ /* we found a range that might merge with the start
+ * of our new range
+ */
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t e = a + BB_LEN(p[lo]);
+ int ack = BB_ACK(p[lo]);
+ if (e >= s) {
+ /* Yes, we can merge with a previous range */
+ if (s <= a && s + sectors >= e) {
+ /* new range covers old */
+ if (!ack)
+ ack = acknowledged;
+ } else {
+ if (!acknowledged)
+ ack = acknowledged;
+ }
+ if (e < s + sectors)
+ e = s + sectors;
+ if (s + sectors <= a + BB_MAX_LEN) {
+ p[lo] = BB_MAKE(a, e-a, ack);
+ s = e;
+ } else {
+ /* does not all fit in one range,
+ * make p[lo] maximal
+ */
+ if (BB_LEN(p[lo]) != BB_MAX_LEN)
+ p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
+ s = a + BB_MAX_LEN;
+ }
+ sectors = e - s;
+ }
+ }
+ if (sectors && hi < bb->count) {
+ /* 'hi' points to the first range that starts after 's'.
+ * Maybe we can merge with the start of that range */
+ sector_t a = BB_OFFSET(p[hi]);
+ sector_t e = a + BB_LEN(p[hi]);
+ int ack = BB_ACK(p[hi]);
+ if (a <= (s + sectors)) {
+ /* merging is possible */
+ if (e < s + sectors)
+ /* full overlap */
+ e = s + sectors;
+ if (a > s)
+ a = s;
+ if (e - a <= BB_MAX_LEN) {
+ p[hi] = BB_MAKE(a, e-a, acknowledged && ack);
+ sectors = 0;
+ s = e;
+ } else {
+ p[hi] = BB_MAKE(a, BB_MAX_LEN,
+ acknowledged && ack);
+ s = a + BB_MAX_LEN;
+ sectors -= BB_MAX_LEN;
+ }
+ hi++;
+ }
+ }
+ if (sectors == 0 && hi < bb->count) {
+ /* we might be able to combine lo and hi */
+ sector_t a = BB_OFFSET(p[hi]);
+ int lolen = BB_LEN(p[lo]);
+ int hilen = BB_LEN(p[hi]);
+ int newlen = lolen + hilen - (s - a);
+ if (s >= a && newlen < BB_MAX_LEN) {
+ /* yes, we can combine them */
+ int ack = BB_ACK(p[lo]) || BB_ACK(p[hi]);
+ p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
+ memmove(p + hi, p + hi + 1,
+ (bb->count - hi) * 8);
+ bb->count--;
+ }
+ }
+ if (sectors) {
+ /* didn't merge (it all).
+ * Need to add a range just before 'hi' */
+ if (bb->count >= MD_MAX_BADBLOCKS)
+ /* No room for more */
+ rv = 0;
+ else {
+ memmove(p + hi + 1, p + hi,
+ (bb->count - hi) * 8);
+ bb->count++;
+ if (sectors <= BB_MAX_LEN)
+ p[hi] = BB_MAKE(s, sectors, acknowledged);
+ else {
+ p[hi] = BB_MAKE(s, BB_MAX_LEN, acknowledged);
+ s += BB_MAX_LEN;
+ sectors -= BB_MAX_LEN;
+ goto add_more;
+ }
+ }
+ }
+
+ bb->changed = 1;
+ rcu_assign_pointer(bb->active_page, bb->page);
+ spin_unlock(&bb->lock);
+
+ return rv;
+}
+EXPORT_SYMBOL_GPL(md_set_badblocks);
+
+/*
+ * Remove a range of bad blocks from the table.
+ * This may involve extending the table if we spilt a region,
+ * but it must not fail. So if the table becomes full, we just
+ * drop the remove request.
+ */
+int md_clear_badblocks(struct badblocks *bb, sector_t s, int sectors)
+{
+ u64 *p;
+ int lo, hi;
+ sector_t target = s + sectors;
+ int rv = 0;
+
+ if (bb->shift) {
+ /* FIXME should this round the other way??? */
+ /* round down the start, and up the end?
+ * It should never matter as block shift should
+ * be aligned with basic IO size, and this
+ * was seems safer
+ */
+ s >>= bb->shift;
+ target |= (1<<bb->shift) - 1;
+ target++;
+ target >>= bb->shift;
+ sectors = target - s;
+ }
+
+again:
+ rcu_assign_pointer(bb->active_page, NULL);
+ synchronize_rcu();
+ spin_lock(&bb->lock);
+ if (bb->active_page) {
+ /* someone else just unlocked, better retry */
+ spin_unlock(&bb->lock);
+ goto again;
+ }
+ /* now have exclusive access to the page */
+
+ p = bb->page;
+ lo = 0;
+ hi = bb->count;
+ /* Find the last range that starts before 'target' */
+ while (hi - lo > 1) {
+ int mid = (lo + hi) / 2;
+ sector_t a = BB_OFFSET(p[mid]);
+ if (a < target)
+ lo = mid;
+ else
+ hi = mid;
+ }
+ if (hi > lo) {
+ /* p[lo] is the last range that could overlap the
+ * current range. Earlier ranges could also overlap,
+ * but only this one can overlap the end of the range.
+ */
+ if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
+ /* Partial overlap, leave the tail of this range */
+ int ack = BB_ACK(p[lo]);
+ sector_t a = BB_OFFSET(p[lo]);
+ sector_t end = a + BB_LEN(p[lo]);
+
+ if (a < s) {
+ /* we need to split this range */
+ if (bb->count >= MD_MAX_BADBLOCKS) {
+ rv = 0;
+ goto out;
+ }
+ memmove(p+lo+1, p+lo, (bb->count - lo) * 8);
+ bb->count++;
+ p[lo] = BB_MAKE(a, s-a, ack);
+ lo++;
+ }
+ p[lo] = BB_MAKE(target, end - target, ack);
+ /* there is no longer an overlap */
+ hi = lo;
+ lo--;
+ }
+ while (lo >= 0 &&
+ BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+ /* This range does overlap */
+ if (BB_OFFSET(p[lo]) < s) {
+ /* Keep the early parts of this range. */
+ int ack = BB_ACK(p[lo]);
+ sector_t start = BB_OFFSET(p[lo]);
+ p[lo] = BB_MAKE(start, s - start, ack);
+ /* now low doesn't overlap, so.. */
+ break;
+ }
+ lo--;
+ }
+ /* 'lo' is strictly before, 'hi' is strictly after,
+ * anything between needs to be discarded
+ */
+ if (hi - lo > 1) {
+ memmove(p+lo+1, p+hi, (bb->count - hi) * 8);
+ bb->count -= (hi - lo - 1);
+ }
+ }
+
+ bb->changed = 1;
+out:
+ rcu_assign_pointer(bb->active_page, bb->page);
+ spin_unlock(&bb->lock);
+ return rv;
+}
+EXPORT_SYMBOL_GPL(md_clear_badblocks);
+
+/*
+ * Acknowledge all bad blocks in a list.
+ * This only succeeds if ->changed is clear. It is used by
+ * in-kernel metadata updates
+ */
+void md_ack_all_badblocks(struct badblocks *bb)
+{
+ if (bb->page == NULL || bb->changed)
+ /* no point event trying */
+ return;
+again:
+ rcu_assign_pointer(bb->active_page, NULL);
+ synchronize_rcu();
+ spin_lock(&bb->lock);
+ if (bb->active_page) {
+ /* someone else just unlocked, better retry */
+ spin_unlock(&bb->lock);
+ goto again;
+ }
+ /* now have exclusive access to the page */
+
+ if (bb->changed == 0) {
+ u64 *p = bb->page;
+ int i;
+ for (i = 0; i < bb->count ; i++) {
+ if (!BB_ACK(p[i])) {
+ sector_t start = BB_OFFSET(p[i]);
+ int len = BB_LEN(p[i]);
+ p[i] = BB_MAKE(start, len, 1);
+ }
+ }
+ }
+ rcu_assign_pointer(bb->active_page, bb->page);
+ spin_unlock(&bb->lock);
+}
+EXPORT_SYMBOL_GPL(md_ack_all_badblocks);
+
static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
{
diff --git a/drivers/md/md.h b/drivers/md/md.h
index e53b355..a24e131 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -49,6 +49,12 @@ static inline void plugger_flush(struct plug_handle *plug)
cancel_work_sync(&plug->unplug_work);
}
+/* Bad block numbers are stored sorted in a single page.
+ * 64bits is used for each block or extent.
+ * 55 bits are sector number, 9 bits are extent size
+ */
+#define MD_MAX_BADBLOCKS (PAGE_SIZE/8)
+
/*
* MD's 'extended' device
*/
@@ -125,8 +131,46 @@ struct mdk_rdev_s
struct sysfs_dirent *sysfs_state; /* handle for 'state'
* sysfs entry */
+
+ struct badblocks {
+ int count; /* count of bad blocks */
+ int shift; /* shift from sectors to block size */
+ u64 *page; /* badblock list */
+ u64 *active_page; /* either 'page' or 'NULL' */
+ int changed;
+ spinlock_t lock;
+ } badblocks;
};
+#define BB_LEN_MASK (0x00000000000001FFULL)
+#define BB_OFFSET_MASK (0x7FFFFFFFFFFFFE00ULL)
+#define BB_ACK_MASK (0x8000000000000000ULL)
+#define BB_MAX_LEN 512
+#define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
+#define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
+#define BB_ACK(x) (!!((x) & BB_ACK_MASK))
+#define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
+
+extern int md_is_badblock(struct badblocks *bb, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors);
+static inline int is_badblock(mdk_rdev_t *rdev, sector_t s, int sectors,
+ sector_t *first_bad, int *bad_sectors)
+{
+ if (unlikely(rdev->badblocks.count)) {
+ int rv = md_is_badblock(&rdev->badblocks, rdev->data_offset + s,
+ sectors,
+ first_bad, bad_sectors);
+ if (rv)
+ *first_bad -= rdev->data_offset;
+ return rv;
+ }
+ return 0;
+}
+extern int md_set_badblocks(struct badblocks *bb, sector_t s, int sectors,
+ int acknowledged);
+extern int md_clear_badblocks(struct badblocks *bb, sector_t s, int sectors);
+extern void md_ack_all_badblocks(struct badblocks *bb);
+
struct mddev_s
{
void *private;
@@ -517,7 +561,7 @@ extern void mddev_init(mddev_t *mddev);
extern int md_run(mddev_t *mddev);
extern void md_stop(mddev_t *mddev);
extern void md_stop_writes(mddev_t *mddev);
-extern void md_rdev_init(mdk_rdev_t *rdev);
+extern int md_rdev_init(mdk_rdev_t *rdev);
extern void mddev_suspend(mddev_t *mddev);
extern void mddev_resume(mddev_t *mddev);
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 02/16] md/bad-block-log: add sysfs interface for accessing bad-block-log.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
2010-06-07 0:07 ` [md PATCH 01/16] md: beginnings of bad block management NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 06/16] md/raid1: clean up read_balance NeilBrown
` (15 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
This can show the log (providing it fits in one page) and
allows bad blocks to be 'acknowledged' meaning that they
have safely been recorded in metadata.
Clearing bad blocks is not allowed via sysfs (except for
code testing). A bad block can only be cleared when
a write to the block succeeds.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 124 insertions(+), 0 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8ae8322..6ba2253 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2672,6 +2672,36 @@ static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t le
static struct rdev_sysfs_entry rdev_recovery_start =
__ATTR(recovery_start, S_IRUGO|S_IWUSR, recovery_start_show, recovery_start_store);
+
+
+static ssize_t
+badblocks_show(struct badblocks *bb, char *page, int unack);
+static ssize_t
+badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack);
+
+static ssize_t bb_show(mdk_rdev_t *rdev, char *page)
+{
+ return badblocks_show(&rdev->badblocks, page, 0);
+}
+static ssize_t bb_store(mdk_rdev_t *rdev, const char *page, size_t len)
+{
+ return badblocks_store(&rdev->badblocks, page, len, 0);
+}
+static struct rdev_sysfs_entry rdev_bad_blocks =
+__ATTR(bad_blocks, S_IRUGO|S_IWUSR, bb_show, bb_store);
+
+
+static ssize_t ubb_show(mdk_rdev_t *rdev, char *page)
+{
+ return badblocks_show(&rdev->badblocks, page, 1);
+}
+static ssize_t ubb_store(mdk_rdev_t *rdev, const char *page, size_t len)
+{
+ return badblocks_store(&rdev->badblocks, page, len, 1);
+}
+static struct rdev_sysfs_entry rdev_unack_bad_blocks =
+__ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
+
static struct attribute *rdev_default_attrs[] = {
&rdev_state.attr,
&rdev_errors.attr,
@@ -2679,6 +2709,8 @@ static struct attribute *rdev_default_attrs[] = {
&rdev_offset.attr,
&rdev_size.attr,
&rdev_recovery_start.attr,
+ &rdev_bad_blocks.attr,
+ &rdev_unack_bad_blocks.attr,
NULL,
};
static ssize_t
@@ -7635,6 +7667,98 @@ again:
}
EXPORT_SYMBOL_GPL(md_ack_all_badblocks);
+/* sysfs access to bad-blocks list.
+ * We present two files.
+ * 'bad-blocks' lists sector numbers and lengths of ranges that
+ * are recorded as bad. The list is truncated to fit within
+ * the one-page limit of sysfs.
+ * Writing "sector length" to this file adds an acknowledged
+ * bad block list.
+ * 'unacknowledged-bad-blocks' lists bad blocks that have not yet
+ * been acknowledged. Write to this file adds bad blocks
+ * without acknowledging them. This is largely for testing.
+ *
+ */
+
+static ssize_t
+badblocks_show(struct badblocks *bb, char *page, int unack)
+{
+ size_t len = 0;
+ int i;
+ u64 *p;
+ int havelock = 0;
+
+ rcu_read_lock();
+ p = rcu_dereference(bb->active_page);
+ if (!p) {
+ spin_lock(&bb->lock);
+ p = bb->page;
+ havelock = 1;
+ }
+
+ i = 0;
+
+ while (len < PAGE_SIZE && i < bb->count) {
+ sector_t s = BB_OFFSET(p[i]);
+ unsigned int length = BB_LEN(p[i]);
+ int ack = BB_ACK(p[i]);
+ i++;
+
+ if (unack && ack)
+ continue;
+
+ len += snprintf(page+len, PAGE_SIZE-len, "%llu %u\n",
+ (unsigned long long)s, length);
+ }
+
+ if (havelock)
+ spin_unlock(&bb->lock);
+ rcu_read_unlock();
+
+ return strlen(page);
+}
+
+#define DO_DEBUG 1
+
+static ssize_t
+badblocks_store(struct badblocks *bb, const char *page, size_t len, int unack)
+{
+ unsigned long long sector;
+ int length;
+ char newline;
+#ifdef DO_DEBUG
+ /* Allow clearing via sysfs *only* for testing/debugging.
+ * Normally only a successful write may clear a badblock
+ */
+ int clear = 0;
+ if (page[0] == '-') {
+ clear = 1;
+ page++;
+ }
+#endif /* DO_DEBUG */
+
+ switch (sscanf(page, "%llu %d%c", §or, &length, &newline)) {
+ case 3:
+ if (newline != '\n')
+ return -EINVAL;
+ case 2:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+#ifdef DO_DEBUG
+ if (clear) {
+ md_clear_badblocks(bb, sector, length);
+ return len;
+ }
+#endif /* DO_DEBUG */
+ if (md_set_badblocks(bb, sector, length, !unack))
+ return len;
+ else
+ return -ENOSPC;
+}
+
static int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x)
{
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 06/16] md/raid1: clean up read_balance.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
2010-06-07 0:07 ` [md PATCH 01/16] md: beginnings of bad block management NeilBrown
2010-06-07 0:07 ` [md PATCH 02/16] md/bad-block-log: add sysfs interface for accessing bad-block-log NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 05/16] md: reject devices with bad blocks and v0.90 metadata NeilBrown
` (14 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
read_balance has three loops which all look for a 'best'
device based on slightly different criteria.
This is clumsy and makes is hard to add extra criteria.
So replace it all with a single loop that combines everything.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid1.c | 144 ++++++++++++++++++++++------------------------------
1 files changed, 60 insertions(+), 84 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 82440a7..fa62c7b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -420,10 +420,13 @@ static void raid1_end_write_request(struct bio *bio, int error)
static int read_balance(conf_t *conf, r1bio_t *r1_bio)
{
const sector_t this_sector = r1_bio->sector;
- int new_disk = conf->last_used, disk = new_disk;
- int wonly_disk = -1;
const int sectors = r1_bio->sectors;
- sector_t new_distance, current_distance;
+ int do_balance;
+ int disk;
+ int start_disk;
+ int best_disk;
+ int i;
+ sector_t best_dist;
mdk_rdev_t *rdev;
rcu_read_lock();
@@ -433,100 +436,73 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
* We take the first readable disk when above the resync window.
*/
retry:
+ disk = -1;
+ best_disk = -1;
+ best_dist = MaxSector;
if (conf->mddev->recovery_cp < MaxSector &&
(this_sector + sectors >= conf->next_resync)) {
- /* Choose the first operational device, for consistancy */
- new_disk = 0;
-
- for (rdev = rcu_dereference(conf->mirrors[new_disk].rdev);
- r1_bio->bios[new_disk] == IO_BLOCKED ||
- !rdev || !test_bit(In_sync, &rdev->flags)
- || test_bit(WriteMostly, &rdev->flags);
- rdev = rcu_dereference(conf->mirrors[++new_disk].rdev)) {
-
- if (rdev && test_bit(In_sync, &rdev->flags) &&
- r1_bio->bios[new_disk] != IO_BLOCKED)
- wonly_disk = new_disk;
-
- if (new_disk == conf->raid_disks - 1) {
- new_disk = wonly_disk;
- break;
- }
- }
- goto rb_out;
+ /* just choose the first */
+ start_disk = 0;
+ do_balance = 0;
+ } else {
+ /* Else start from last used */
+ start_disk = conf->last_used;
+ do_balance = 1;
}
+ for (i = 0; i < conf->raid_disks; i++) {
+ sector_t dist;
+ disk = (start_disk + i) % conf->raid_disks;
+ if (r1_bio->bios[disk] == IO_BLOCKED)
+ continue;
+ rdev = rcu_dereference(conf->mirrors[disk].rdev);
+ if (!rdev)
+ continue;
+ if (test_bit(Faulty, &rdev->flags))
+ continue;
+ if (!test_bit(In_sync, &rdev->flags) &&
+ rdev->recovery_offset < this_sector + sectors)
+ continue;
- /* make sure the disk is operational */
- for (rdev = rcu_dereference(conf->mirrors[new_disk].rdev);
- r1_bio->bios[new_disk] == IO_BLOCKED ||
- !rdev || !test_bit(In_sync, &rdev->flags) ||
- test_bit(WriteMostly, &rdev->flags);
- rdev = rcu_dereference(conf->mirrors[new_disk].rdev)) {
+ if (test_bit(WriteMostly, &rdev->flags)) {
+ /* don't balance among write-mostly, just
+ * use first as a last resort */
+ if (best_disk < 0)
+ best_disk = disk;
+ continue;
+ }
+ /* This is a reasonable device to use. It might
+ * even be best.
+ */
+ if (!do_balance)
+ break;
- if (rdev && test_bit(In_sync, &rdev->flags) &&
- r1_bio->bios[new_disk] != IO_BLOCKED)
- wonly_disk = new_disk;
+ /*
+ * Don't change to another disk for sequential reads:
+ */
+ if (conf->next_seq_sect == this_sector)
+ break;
- if (new_disk <= 0)
- new_disk = conf->raid_disks;
- new_disk--;
- if (new_disk == disk) {
- new_disk = wonly_disk;
+ dist = abs(this_sector - conf->mirrors[disk].head_position);
+ if (dist == 0)
+ break;
+ if (!atomic_read(&rdev->nr_pending))
+ /* Device is idle, so use it */
break;
+ if (dist < best_dist) {
+ best_dist = dist;
+ best_disk = disk;
}
}
+ if (i == conf->raid_disks)
+ disk = best_disk;
- if (new_disk < 0)
- goto rb_out;
-
- disk = new_disk;
- /* now disk == new_disk == starting point for search */
-
- /*
- * Don't change to another disk for sequential reads:
- */
- if (conf->next_seq_sect == this_sector)
- goto rb_out;
- if (this_sector == conf->mirrors[new_disk].head_position)
- goto rb_out;
-
- current_distance = abs(this_sector - conf->mirrors[disk].head_position);
-
- /* Find the disk whose head is closest */
-
- do {
- if (disk <= 0)
- disk = conf->raid_disks;
- disk--;
-
+ if (disk >= 0) {
rdev = rcu_dereference(conf->mirrors[disk].rdev);
-
- if (!rdev || r1_bio->bios[disk] == IO_BLOCKED ||
- !test_bit(In_sync, &rdev->flags) ||
- test_bit(WriteMostly, &rdev->flags))
- continue;
-
- if (!atomic_read(&rdev->nr_pending)) {
- new_disk = disk;
- break;
- }
- new_distance = abs(this_sector - conf->mirrors[disk].head_position);
- if (new_distance < current_distance) {
- current_distance = new_distance;
- new_disk = disk;
- }
- } while (disk != conf->last_used);
-
- rb_out:
-
-
- if (new_disk >= 0) {
- rdev = rcu_dereference(conf->mirrors[new_disk].rdev);
if (!rdev)
goto retry;
atomic_inc(&rdev->nr_pending);
- if (!test_bit(In_sync, &rdev->flags)) {
+ if (test_bit(Faulty, &rdev->flags)) {
/* cannot risk returning a device that failed
* before we inc'ed nr_pending
*/
@@ -534,11 +510,11 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
goto retry;
}
conf->next_seq_sect = this_sector + sectors;
- conf->last_used = new_disk;
+ conf->last_used = disk;
}
rcu_read_unlock();
- return new_disk;
+ return disk;
}
static void unplug_slaves(mddev_t *mddev)
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 05/16] md: reject devices with bad blocks and v0.90 metadata.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (2 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 06/16] md/raid1: clean up read_balance NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 07/16] md: simplify raid10 read_balance NeilBrown
` (13 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
v0.90 metadata cannot record bad blocks, so if there are any we need
to fail the devices.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 63b185e..8a888d5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1310,6 +1310,10 @@ static void super_90_sync(mddev_t *mddev, mdk_rdev_t *rdev)
sb->this_disk = sb->disks[rdev->desc_nr];
sb->sb_csum = calc_sb_csum(sb);
+
+ if (rdev->badblocks.count)
+ /* Cannot record individual bad blocks so... */
+ md_error(mddev, rdev);
}
/*
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 07/16] md: simplify raid10 read_balance
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (3 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 05/16] md: reject devices with bad blocks and v0.90 metadata NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 03/16] md: don't allow arrays to contain devices with bad blocks NeilBrown
` (12 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
raid10 read balance as two different loop for looking through
possible devices to chose the best.
Collapse those into one loop and generally make the code more
readable.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid10.c | 114 +++++++++++++++++++++------------------------------
1 files changed, 48 insertions(+), 66 deletions(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 20da258..01a75d2 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -497,13 +497,21 @@ static int raid10_mergeable_bvec(struct request_queue *q,
static int read_balance(conf_t *conf, r10bio_t *r10_bio)
{
const sector_t this_sector = r10_bio->sector;
- int disk, slot, nslot;
const int sectors = r10_bio->sectors;
- sector_t new_distance, current_distance;
+ int disk, slot;
+ sector_t new_distance, best_dist;
mdk_rdev_t *rdev;
+ int do_balance;
+ int best_disk, best_slot;
raid10_find_phys(conf, r10_bio);
rcu_read_lock();
+retry:
+ disk = -1;
+ best_disk = -1;
+ best_slot = -1;
+ best_dist = MaxSector;
+ do_balance = 1;
/*
* Check if we can balance. We can balance on the whole
* device if no resync is going on (recovery is ok), or below
@@ -511,86 +519,60 @@ static int read_balance(conf_t *conf, r10bio_t *r10_bio)
* above the resync window.
*/
if (conf->mddev->recovery_cp < MaxSector
- && (this_sector + sectors >= conf->next_resync)) {
- /* make sure that disk is operational */
- slot = 0;
- disk = r10_bio->devs[slot].devnum;
-
- while ((rdev = rcu_dereference(conf->mirrors[disk].rdev)) == NULL ||
- r10_bio->devs[slot].bio == IO_BLOCKED ||
- !test_bit(In_sync, &rdev->flags)) {
- slot++;
- if (slot == conf->copies) {
- slot = 0;
- disk = -1;
- break;
- }
- disk = r10_bio->devs[slot].devnum;
- }
- goto rb_out;
- }
-
+ && (this_sector + sectors >= conf->next_resync))
+ do_balance = 0;
- /* make sure the disk is operational */
- slot = 0;
- disk = r10_bio->devs[slot].devnum;
- while ((rdev=rcu_dereference(conf->mirrors[disk].rdev)) == NULL ||
- r10_bio->devs[slot].bio == IO_BLOCKED ||
- !test_bit(In_sync, &rdev->flags)) {
- slot ++;
- if (slot == conf->copies) {
- disk = -1;
- goto rb_out;
- }
+ for (slot = 0; slot < conf->copies ; slot++) {
+ if (r10_bio->devs[slot].bio == IO_BLOCKED)
+ continue;
disk = r10_bio->devs[slot].devnum;
- }
-
-
- current_distance = abs(r10_bio->devs[slot].addr -
- conf->mirrors[disk].head_position);
-
- /* Find the disk whose head is closest,
- * or - for far > 1 - find the closest to partition beginning */
-
- for (nslot = slot; nslot < conf->copies; nslot++) {
- int ndisk = r10_bio->devs[nslot].devnum;
-
-
- if ((rdev=rcu_dereference(conf->mirrors[ndisk].rdev)) == NULL ||
- r10_bio->devs[nslot].bio == IO_BLOCKED ||
- !test_bit(In_sync, &rdev->flags))
+ rdev = rcu_dereference(conf->mirrors[disk].rdev);
+ if (rdev == NULL)
+ continue;
+ if (!test_bit(In_sync, &rdev->flags))
continue;
+ if (!do_balance)
+ break;
+
/* This optimisation is debatable, and completely destroys
* sequential read speed for 'far copies' arrays. So only
* keep it for 'near' arrays, and review those later.
*/
- if (conf->near_copies > 1 && !atomic_read(&rdev->nr_pending)) {
- disk = ndisk;
- slot = nslot;
+ if (conf->near_copies > 1 && !atomic_read(&rdev->nr_pending))
break;
- }
/* for far > 1 always use the lowest address */
if (conf->far_copies > 1)
- new_distance = r10_bio->devs[nslot].addr;
+ new_distance = r10_bio->devs[slot].addr;
else
- new_distance = abs(r10_bio->devs[nslot].addr -
- conf->mirrors[ndisk].head_position);
- if (new_distance < current_distance) {
- current_distance = new_distance;
- disk = ndisk;
- slot = nslot;
+ new_distance = abs(r10_bio->devs[slot].addr -
+ conf->mirrors[disk].head_position);
+ if (new_distance < best_dist) {
+ best_dist = new_distance;
+ best_disk = disk;
+ best_slot = slot;
}
}
+ if (slot == conf->copies) {
+ disk = best_disk;
+ slot = best_slot;
+ }
-rb_out:
- r10_bio->read_slot = slot;
-/* conf->next_seq_sect = this_sector + sectors;*/
-
- if (disk >= 0 && (rdev=rcu_dereference(conf->mirrors[disk].rdev))!= NULL)
- atomic_inc(&conf->mirrors[disk].rdev->nr_pending);
- else
+ if (disk >= 0) {
+ rdev = rcu_dereference(conf->mirrors[disk].rdev);
+ if (!rdev)
+ goto retry;
+ atomic_inc(&rdev->nr_pending);
+ if (test_bit(Faulty, &rdev->flags)) {
+ /* Cannot risk returning a device that failed
+ * before we inc'ed nr_pending
+ */
+ rdev_dec_pending(rdev, conf->mddev);
+ goto retry;
+ }
+ r10_bio->read_slot = slot;
+ } else
disk = -1;
rcu_read_unlock();
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 03/16] md: don't allow arrays to contain devices with bad blocks.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (4 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 07/16] md: simplify raid10 read_balance NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 04/16] md: load/store badblock list from v1.x metadata NeilBrown
` (11 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.
This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid1.c | 7 +++++++
drivers/md/raid10.c | 8 ++++++++
drivers/md/raid5.c | 7 +++++++
3 files changed, 22 insertions(+), 0 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a948da8..82440a7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1153,6 +1153,9 @@ static int raid1_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
int first = 0;
int last = mddev->raid_disks - 1;
+ if (rdev->badblocks.count)
+ return -EINVAL;
+
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;
@@ -2117,6 +2120,10 @@ static int run(mddev_t *mddev)
blk_queue_segment_boundary(mddev->queue,
PAGE_CACHE_SIZE - 1);
}
+ if (rdev->badblocks.count) {
+ printk(KERN_ERR "md/raid1: Cannot handle bad blocks yet\n");
+ return -EINVAL;
+ }
}
mddev->degraded = 0;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0372499..20da258 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1129,6 +1129,9 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
int first = 0;
int last = conf->raid_disks - 1;
+ if (rdev->badblocks.count)
+ return -EINVAL;
+
if (mddev->recovery_cp < MaxSector)
/* only hot-add to in-sync arrays, as recovery is
* very different from resync
@@ -2296,6 +2299,11 @@ static int run(mddev_t *mddev)
(conf->raid_disks / conf->near_copies));
list_for_each_entry(rdev, &mddev->disks, same_set) {
+
+ if (rdev->badblocks.count) {
+ printk(KERN_ERR "md/raid10: cannot handle bad blocks yet\n");
+ goto out_free_conf;
+ }
disk_idx = rdev->raid_disk;
if (disk_idx >= conf->raid_disks
|| disk_idx < 0)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8ac122d..5ec9792 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5002,6 +5002,10 @@ static int run(mddev_t *mddev)
* 0 for a fully functional array, 1 or 2 for a degraded array.
*/
list_for_each_entry(rdev, &mddev->disks, same_set) {
+ if (rdev->badblocks.count) {
+ printk(KERN_ERR "md/raid5: cannot handle bad blocks yet\n");
+ goto abort;
+ }
if (rdev->raid_disk < 0)
continue;
if (test_bit(In_sync, &rdev->flags))
@@ -5310,6 +5314,9 @@ static int raid5_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
int first = 0;
int last = conf->raid_disks - 1;
+ if (rdev->badblocks.count)
+ return -EINVAL;
+
if (mddev->degraded > conf->max_degraded)
/* no point adding a device */
return -EINVAL;
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 04/16] md: load/store badblock list from v1.x metadata
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (5 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 03/16] md: don't allow arrays to contain devices with bad blocks NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 12/16] md: make error_handler functions more uniform and correct NeilBrown
` (10 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.
We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 103 ++++++++++++++++++++++++++++++++++++++++++---
drivers/md/md.h | 5 ++
include/linux/raid/md_p.h | 13 ++++--
3 files changed, 110 insertions(+), 11 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6ba2253..63b185e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -671,6 +671,10 @@ static void free_disk_sb(mdk_rdev_t * rdev)
rdev->sb_start = 0;
rdev->sectors = 0;
}
+ if (rdev->bb_page) {
+ put_page(rdev->bb_page);
+ rdev->bb_page = NULL;
+ }
}
@@ -1433,6 +1437,46 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
else
rdev->desc_nr = le32_to_cpu(sb->dev_number);
+ if (!rdev->bb_page) {
+ rdev->bb_page = alloc_page(GFP_KERNEL);
+ if (!rdev->bb_page)
+ return -ENOMEM;
+ }
+ if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BAD_BLOCKS) &&
+ rdev->badblocks.count == 0) {
+ /* need to load the bad block list.
+ * Currently we limit it to one page.
+ */
+ s32 offset;
+ sector_t bb_sector;
+ u64 *bbp;
+ int i;
+ int sectors = le16_to_cpu(sb->bblog_size);
+ if (sectors > (PAGE_SIZE / 512))
+ return -EINVAL;
+ offset = le32_to_cpu(sb->bblog_offset);
+ if (offset == 0)
+ return -EINVAL;
+ bb_sector = rdev->sb_start + (long long)offset;
+ if (!sync_page_io(rdev->bdev, bb_sector, sectors << 9,
+ rdev->bb_page, READ))
+ return -EIO;
+ bbp = (u64 *)page_address(rdev->bb_page);
+ rdev->badblocks.shift = sb->bblog_shift;
+ for (i = 0 ; i < (sectors << (9-3)) ; i++, bbp++) {
+ u64 bb = le64_to_cpu(*bbp);
+ int count = bb & (0x3ff);
+ u64 sector = bb >> 10;
+ sector <<= sb->bblog_shift;
+ count <<= sb->bblog_shift;
+ if (bb + 1 == 0)
+ break;
+ if (md_set_badblocks(&rdev->badblocks,
+ sector, count, 1) == 0)
+ return -EINVAL;
+ }
+ }
+
if (!refdev) {
ret = 1;
} else {
@@ -1586,7 +1630,6 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
sb->pad0 = 0;
sb->recovery_offset = cpu_to_le64(0);
memset(sb->pad1, 0, sizeof(sb->pad1));
- memset(sb->pad2, 0, sizeof(sb->pad2));
memset(sb->pad3, 0, sizeof(sb->pad3));
sb->utime = cpu_to_le64((__u64)mddev->utime);
@@ -1626,6 +1669,38 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
sb->new_chunk = cpu_to_le32(mddev->new_chunk_sectors);
}
+ if (rdev->badblocks.count > 0) {
+ int havelock = 0;
+ struct badblocks *bb = &rdev->badblocks;
+ u64 *bbp = (u64 *)page_address(rdev->bb_page);
+ u64 *p;
+ sb->feature_map |= cpu_to_le32(MD_FEATURE_BAD_BLOCKS);
+ if (bb->changed) {
+ memset(bbp, 0xff, PAGE_SIZE);
+
+ rcu_read_lock();
+ p = rcu_dereference(bb->active_page);
+ if (!p) {
+ spin_lock(&bb->lock);
+ p = bb->page;
+ havelock = 1;
+ }
+ for (i = 0 ; i < bb->count ; i++) {
+ u64 internal_bb = *p++;
+ u64 store_bb = ((BB_OFFSET(internal_bb) << 10)
+ | BB_LEN(internal_bb));
+ *bbp++ = cpu_to_le64(store_bb);
+ }
+ bb->sector = (rdev->sb_start +
+ (int)le32_to_cpu(sb->bblog_offset));
+ bb->size = le16_to_cpu(sb->bblog_size);
+ bb->changed = 0;
+ if (havelock)
+ spin_unlock(&bb->lock);
+ rcu_read_unlock();
+ }
+ }
+
max_dev = 0;
list_for_each_entry(rdev2, &mddev->disks, same_set)
if (rdev2->desc_nr+1 > max_dev)
@@ -2164,6 +2239,7 @@ static void md_update_sb(mddev_t * mddev, int force_change)
mdk_rdev_t *rdev;
int sync_req;
int nospares = 0;
+ int any_badblocks_changed = 0;
mddev->utime = get_seconds();
if (mddev->external)
@@ -2232,6 +2308,11 @@ repeat:
wake_up(&mddev->sb_wait);
return;
}
+
+ list_for_each_entry(rdev, &mddev->disks, same_set)
+ if (rdev->badblocks.changed)
+ any_badblocks_changed++;
+
sync_sbs(mddev, nospares);
spin_unlock_irq(&mddev->write_lock);
@@ -2257,6 +2338,13 @@ repeat:
bdevname(rdev->bdev,b),
(unsigned long long)rdev->sb_start);
rdev->sb_events = mddev->events;
+ if (rdev->badblocks.size) {
+ md_super_write(mddev, rdev,
+ rdev->badblocks.sector,
+ rdev->badblocks.size << 9,
+ rdev->bb_page);
+ rdev->badblocks.size = 0;
+ }
} else
dprintk(")\n");
@@ -2280,6 +2368,9 @@ repeat:
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
sysfs_notify(&mddev->kobj, NULL, "sync_completed");
+ if (any_badblocks_changed)
+ list_for_each_entry(rdev, &mddev->disks, same_set)
+ md_ack_all_badblocks(&rdev->badblocks);
}
/* words written to sysfs files may, or may not, be \n terminated.
@@ -2784,6 +2875,8 @@ int md_rdev_init(mdk_rdev_t *rdev)
rdev->sb_events = 0;
rdev->last_read_error.tv_sec = 0;
rdev->last_read_error.tv_nsec = 0;
+ rdev->sb_loaded = 0;
+ rdev->bb_page = NULL;
atomic_set(&rdev->nr_pending, 0);
atomic_set(&rdev->read_errors, 0);
atomic_set(&rdev->corrected_errors, 0);
@@ -2871,11 +2964,9 @@ static mdk_rdev_t *md_import_device(dev_t newdev, int super_format, int super_mi
return rdev;
abort_free:
- if (rdev->sb_page) {
- if (rdev->bdev)
- unlock_rdev(rdev);
- free_disk_sb(rdev);
- }
+ if (rdev->bdev)
+ unlock_rdev(rdev);
+ free_disk_sb(rdev);
kfree(rdev);
return ERR_PTR(err);
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a24e131..087764b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -68,7 +68,7 @@ struct mdk_rdev_s
struct block_device *bdev; /* block device handle */
- struct page *sb_page;
+ struct page *sb_page, *bb_page;
int sb_loaded;
__u64 sb_events;
sector_t data_offset; /* start of data in array */
@@ -139,6 +139,9 @@ struct mdk_rdev_s
u64 *active_page; /* either 'page' or 'NULL' */
int changed;
spinlock_t lock;
+
+ sector_t sector;
+ sector_t size; /* in sectors */
} badblocks;
};
diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
index ffa2efb..a2c23fd 100644
--- a/include/linux/raid/md_p.h
+++ b/include/linux/raid/md_p.h
@@ -245,10 +245,15 @@ struct mdp_superblock_1 {
__u8 device_uuid[16]; /* user-space setable, ignored by kernel */
__u8 devflags; /* per-device flags. Only one defined...*/
#define WriteMostly1 1 /* mask for writemostly flag in above */
- __u8 pad2[64-57]; /* set to 0 when writing */
+ /* bad block log. If there are any bad blocks the feature flag is set.
+ * if offset and size are non-zero, that space is reserved and available.
+ */
+ __u8 bblog_shift; /* shift from sectors to block size for badblocklist */
+ __le16 bblog_size; /* number of sectors reserved for badblocklist */
+ __le32 bblog_offset; /* sector offset from superblock to bblog, signed */
/* array state information - 64 bytes */
- __le64 utime; /* 40 bits second, 24 btes microseconds */
+ __le64 utime; /* 40 bits second, 24 bits microseconds */
__le64 events; /* incremented when superblock updated */
__le64 resync_offset; /* data before this offset (from data_offset) known to be in sync */
__le32 sb_csum; /* checksum upto devs[max_dev] */
@@ -270,8 +275,8 @@ struct mdp_superblock_1 {
* must be honoured
*/
#define MD_FEATURE_RESHAPE_ACTIVE 4
+#define MD_FEATURE_BAD_BLOCKS 8 /* badblock list is not empty */
-#define MD_FEATURE_ALL (1|2|4)
+#define MD_FEATURE_ALL (1|2|4|8)
#endif
-
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 12/16] md: make error_handler functions more uniform and correct.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (6 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 04/16] md: load/store badblock list from v1.x metadata NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 10/16] md: add 'write_error' flag to component devices NeilBrown
` (9 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
- there is no need to test_bit Faulty, as that was already done in
md_error which is the only caller of these functions.
- MD_CHANGE_DEVS should be set *after* faulty is set to ensure
metadata is updated correctly.
- spinlock should be held while updating ->degraded.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/multipath.c | 40 ++++++++++++++++++++++------------------
drivers/md/raid5.c | 40 +++++++++++++++++++---------------------
2 files changed, 41 insertions(+), 39 deletions(-)
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index da3654a..5ec4ca7 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -216,6 +216,7 @@ static int multipath_congested(void *data, int bits)
static void multipath_error (mddev_t *mddev, mdk_rdev_t *rdev)
{
multipath_conf_t *conf = mddev->private;
+ char b[BDEVNAME_SIZE];
if (conf->raid_disks - mddev->degraded <= 1) {
/*
@@ -224,26 +225,27 @@ static void multipath_error (mddev_t *mddev, mdk_rdev_t *rdev)
* which has just failed.
*/
printk(KERN_ALERT
- "multipath: only one IO path left and IO error.\n");
+ "multipath: only one IO path left and IO error.\n");
/* leave it active... it's all we have */
- } else {
- /*
- * Mark disk as unusable
- */
- if (!test_bit(Faulty, &rdev->flags)) {
- char b[BDEVNAME_SIZE];
- clear_bit(In_sync, &rdev->flags);
- set_bit(Faulty, &rdev->flags);
- set_bit(MD_CHANGE_DEVS, &mddev->flags);
- mddev->degraded++;
- printk(KERN_ALERT "multipath: IO failure on %s,"
- " disabling IO path.\n"
- "multipath: Operation continuing"
- " on %d IO paths.\n",
- bdevname (rdev->bdev,b),
- conf->raid_disks - mddev->degraded);
- }
+ return;
+ }
+ /*
+ * Mark disk as unusable
+ */
+ if (test_and_clear_bit(In_sync, &rdev->flags)) {
+ unsigned long flags;
+ spin_lock_irqsave(&conf->device_lock, flags);
+ mddev->degraded++;
+ spin_unlock_irqrestore(&conf->device_lock, flags);
}
+ set_bit(Faulty, &rdev->flags);
+ set_bit(MD_CHANGE_DEVS, &mddev->flags);
+ printk(KERN_ALERT "multipath: IO failure on %s,"
+ " disabling IO path.\n"
+ "multipath: Operation continuing"
+ " on %d IO paths.\n",
+ bdevname(rdev->bdev, b),
+ conf->raid_disks - mddev->degraded);
}
static void print_multipath_conf (multipath_conf_t *conf)
@@ -303,9 +305,11 @@ static int multipath_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
PAGE_CACHE_SIZE - 1);
}
+ spin_lock_irq(&conf->device_lock);
mddev->degraded--;
rdev->raid_disk = path;
set_bit(In_sync, &rdev->flags);
+ spin_unlock_irq(&conf->device_lock);
rcu_assign_pointer(p->rdev, rdev);
err = 0;
md_integrity_add_rdev(rdev, mddev);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5ec9792..6fb36d8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1626,28 +1626,26 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
raid5_conf_t *conf = mddev->private;
pr_debug("raid456: error called\n");
- if (!test_bit(Faulty, &rdev->flags)) {
- set_bit(MD_CHANGE_DEVS, &mddev->flags);
- if (test_and_clear_bit(In_sync, &rdev->flags)) {
- unsigned long flags;
- spin_lock_irqsave(&conf->device_lock, flags);
- mddev->degraded++;
- spin_unlock_irqrestore(&conf->device_lock, flags);
- /*
- * if recovery was running, make sure it aborts.
- */
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- }
- set_bit(Faulty, &rdev->flags);
- printk(KERN_ALERT
- "md/raid:%s: Disk failure on %s, disabling device.\n"
- KERN_ALERT
- "md/raid:%s: Operation continuing on %d devices.\n",
- mdname(mddev),
- bdevname(rdev->bdev, b),
- mdname(mddev),
- conf->raid_disks - mddev->degraded);
+ if (test_and_clear_bit(In_sync, &rdev->flags)) {
+ unsigned long flags;
+ spin_lock_irqsave(&conf->device_lock, flags);
+ mddev->degraded++;
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+ /*
+ * if recovery was running, make sure it aborts.
+ */
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
}
+ set_bit(Faulty, &rdev->flags);
+ set_bit(MD_CHANGE_DEVS, &mddev->flags);
+ printk(KERN_ALERT
+ "md/raid:%s: Disk failure on %s, disabling device.\n"
+ KERN_ALERT
+ "md/raid:%s: Operation continuing on %d devices.\n",
+ mdname(mddev),
+ bdevname(rdev->bdev, b),
+ mdname(mddev),
+ conf->raid_disks - mddev->degraded);
}
/*
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 10/16] md: add 'write_error' flag to component devices.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (7 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 12/16] md: make error_handler functions more uniform and correct NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 11/16] md/multipath: discard ->working_disks in favour of ->degraded NeilBrown
` (8 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 12 ++++++++++++
drivers/md/md.h | 3 +++
2 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8a888d5..20c6792 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2430,6 +2430,10 @@ state_show(mdk_rdev_t *rdev, char *page)
len += sprintf(page+len, "%sspare", sep);
sep = ",";
}
+ if (test_bit(WriteErrorSeen, &rdev->flags)) {
+ len += sprintf(page+len, "%swrite_error", sep);
+ sep = ",";
+ }
return len+sprintf(page+len, "\n");
}
@@ -2444,6 +2448,8 @@ state_store(mdk_rdev_t *rdev, const char *buf, size_t len)
* blocked - sets the Blocked flag
* -blocked - clears the Blocked flag
* insync - sets Insync providing device isn't active
+ * write_error - sets WriteErrorSeen
+ * -write_error - clears WriteErrorSeen
*/
int err = -EINVAL;
if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
@@ -2479,6 +2485,12 @@ state_store(mdk_rdev_t *rdev, const char *buf, size_t len)
} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
set_bit(In_sync, &rdev->flags);
err = 0;
+ } else if (cmd_match(buf, "write_error")) {
+ set_bit(WriteErrorSeen, &rdev->flags);
+ err = 0;
+ } else if (cmd_match(buf, "-write_error")) {
+ clear_bit(WriteErrorSeen, &rdev->flags);
+ err = 0;
}
if (!err)
sysfs_notify_dirent_safe(rdev->sysfs_state);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 087764b..a085df5 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -100,6 +100,9 @@ struct mdk_rdev_s
#define Blocked 8 /* An error occured on an externally
* managed array, don't allow writes
* until it is cleared */
+#define WriteErrorSeen 9 /* A write error has been seen on this
+ * device
+ */
wait_queue_head_t blocked_wait;
int desc_nr; /* descriptor index in the superblock */
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 11/16] md/multipath: discard ->working_disks in favour of ->degraded
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (8 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 10/16] md: add 'write_error' flag to component devices NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 08/16] md/raid1: avoid reading from known bad blocks NeilBrown
` (7 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
conf->working_disks duplicates information already available
in mddev->degraded.
So remove working_disks.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/multipath.c | 22 +++++++++++-----------
drivers/md/multipath.h | 1 -
2 files changed, 11 insertions(+), 12 deletions(-)
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 410fb60..da3654a 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -176,7 +176,7 @@ static void multipath_status (struct seq_file *seq, mddev_t *mddev)
int i;
seq_printf (seq, " [%d/%d] [", conf->raid_disks,
- conf->working_disks);
+ conf->raid_disks - mddev->degraded);
for (i = 0; i < conf->raid_disks; i++)
seq_printf (seq, "%s",
conf->multipaths[i].rdev &&
@@ -217,7 +217,7 @@ static void multipath_error (mddev_t *mddev, mdk_rdev_t *rdev)
{
multipath_conf_t *conf = mddev->private;
- if (conf->working_disks <= 1) {
+ if (conf->raid_disks - mddev->degraded <= 1) {
/*
* Uh oh, we can do nothing if this is our last path, but
* first check if this is a queued request for a device
@@ -235,14 +235,13 @@ static void multipath_error (mddev_t *mddev, mdk_rdev_t *rdev)
clear_bit(In_sync, &rdev->flags);
set_bit(Faulty, &rdev->flags);
set_bit(MD_CHANGE_DEVS, &mddev->flags);
- conf->working_disks--;
mddev->degraded++;
printk(KERN_ALERT "multipath: IO failure on %s,"
" disabling IO path.\n"
"multipath: Operation continuing"
" on %d IO paths.\n",
bdevname (rdev->bdev,b),
- conf->working_disks);
+ conf->raid_disks - mddev->degraded);
}
}
}
@@ -257,7 +256,7 @@ static void print_multipath_conf (multipath_conf_t *conf)
printk("(conf==NULL)\n");
return;
}
- printk(" --- wd:%d rd:%d\n", conf->working_disks,
+ printk(" --- wd:%d rd:%d\n", conf->raid_disks - conf->mddev->degraded,
conf->raid_disks);
for (i = 0; i < conf->raid_disks; i++) {
@@ -304,7 +303,6 @@ static int multipath_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
PAGE_CACHE_SIZE - 1);
}
- conf->working_disks++;
mddev->degraded--;
rdev->raid_disk = path;
set_bit(In_sync, &rdev->flags);
@@ -421,6 +419,7 @@ static int multipath_run (mddev_t *mddev)
int disk_idx;
struct multipath_info *disk;
mdk_rdev_t *rdev;
+ int working_disks;
if (md_check_no_bitmap(mddev))
return -EINVAL;
@@ -455,7 +454,7 @@ static int multipath_run (mddev_t *mddev)
goto out_free_conf;
}
- conf->working_disks = 0;
+ working_disks = 0;
list_for_each_entry(rdev, &mddev->disks, same_set) {
disk_idx = rdev->raid_disk;
if (disk_idx < 0 ||
@@ -477,7 +476,7 @@ static int multipath_run (mddev_t *mddev)
}
if (!test_bit(Faulty, &rdev->flags))
- conf->working_disks++;
+ working_disks++;
}
conf->raid_disks = mddev->raid_disks;
@@ -485,12 +484,12 @@ static int multipath_run (mddev_t *mddev)
spin_lock_init(&conf->device_lock);
INIT_LIST_HEAD(&conf->retry_list);
- if (!conf->working_disks) {
+ if (!working_disks) {
printk(KERN_ERR "multipath: no operational IO paths for %s\n",
mdname(mddev));
goto out_free_conf;
}
- mddev->degraded = conf->raid_disks - conf->working_disks;
+ mddev->degraded = conf->raid_disks - working_disks;
conf->pool = mempool_create_kmalloc_pool(NR_RESERVED_BUFS,
sizeof(struct multipath_bh));
@@ -512,7 +511,8 @@ static int multipath_run (mddev_t *mddev)
printk(KERN_INFO
"multipath: array %s active with %d out of %d IO paths\n",
- mdname(mddev), conf->working_disks, mddev->raid_disks);
+ mdname(mddev), conf->raid_disks - mddev->degraded,
+ mddev->raid_disks);
/*
* Ok, everything is just fine now
*/
diff --git a/drivers/md/multipath.h b/drivers/md/multipath.h
index d1c2a8d..3c5a45e 100644
--- a/drivers/md/multipath.h
+++ b/drivers/md/multipath.h
@@ -9,7 +9,6 @@ struct multipath_private_data {
mddev_t *mddev;
struct multipath_info *multipaths;
int raid_disks;
- int working_disks;
spinlock_t device_lock;
struct list_head retry_list;
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 08/16] md/raid1: avoid reading from known bad blocks.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (9 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 11/16] md/multipath: discard ->working_disks in favour of ->degraded NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 09/16] md/raid1: avoid reading known bad blocks during resync NeilBrown
` (6 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
1/ read_balance needs to check for bad blocks, and return not only
the chosen device, but also how many good blocks are available
there.
2/ fix_read_error needs to avoid trying to read from bad blocks.
3/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio.
4/ retrying a read needs to also be ready to submit a smaller read
and queue another request for the rest.
This does not yet handle bad blocks when reading to perform resync,
recovery, or check.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid1.c | 176 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 167 insertions(+), 9 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fa62c7b..a5a3f01 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -237,14 +237,28 @@ static void raid_end_bio_io(r1bio_t *r1_bio)
/* if nobody has done the final endio yet, do it now */
if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+ int done;
+
PRINTK(KERN_DEBUG "raid1: sync end %s on sectors %llu-%llu\n",
(bio_data_dir(bio) == WRITE) ? "write" : "read",
(unsigned long long) bio->bi_sector,
(unsigned long long) bio->bi_sector +
(bio->bi_size >> 9) - 1);
- bio_endio(bio,
- test_bit(R1BIO_Uptodate, &r1_bio->state) ? 0 : -EIO);
+ if (bio->bi_phys_segments) {
+ unsigned long flags;
+ conf_t *conf = r1_bio->mddev->private;
+ spin_lock_irqsave(&conf->device_lock, flags);
+ bio->bi_phys_segments--;
+ done = (bio->bi_phys_segments == 0);
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+ } else
+ done = 1;
+
+ if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
+ clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ if (done)
+ bio_endio(bio, 0);
}
free_r1bio(r1_bio);
}
@@ -417,10 +431,11 @@ static void raid1_end_write_request(struct bio *bio, int error)
*
* The rdev for the device selected will have nr_pending incremented.
*/
-static int read_balance(conf_t *conf, r1bio_t *r1_bio)
+static int read_balance(conf_t *conf, r1bio_t *r1_bio, int *max_sectors)
{
const sector_t this_sector = r1_bio->sector;
- const int sectors = r1_bio->sectors;
+ int sectors;
+ int best_good_sectors;
int do_balance;
int disk;
int start_disk;
@@ -436,9 +451,12 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
* We take the first readable disk when above the resync window.
*/
retry:
+ sectors = r1_bio->sectors;
disk = -1;
best_disk = -1;
best_dist = MaxSector;
+ best_good_sectors = 0;
+
if (conf->mddev->recovery_cp < MaxSector &&
(this_sector + sectors >= conf->next_resync)) {
/* just choose the first */
@@ -451,6 +469,8 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
}
for (i = 0; i < conf->raid_disks; i++) {
sector_t dist;
+ sector_t first_bad;
+ int bad_sectors;
disk = (start_disk + i) % conf->raid_disks;
if (r1_bio->bios[disk] == IO_BLOCKED)
@@ -474,6 +494,30 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
/* This is a reasonable device to use. It might
* even be best.
*/
+ if (is_badblock(rdev, this_sector, sectors,
+ &first_bad, &bad_sectors)) {
+ if (best_dist < MaxSector)
+ /* already have a better device */
+ continue;
+ if (first_bad <= this_sector) {
+ /* cannot read here. If this is the 'primary'
+ * device, then we must not read beyond
+ * bad_sectors from another device..
+ */
+ bad_sectors -= (this_sector - first_bad);
+ if (!do_balance && sectors > bad_sectors)
+ sectors = bad_sectors;
+ } else {
+ sector_t good_sectors = first_bad - this_sector;
+ if (good_sectors > best_good_sectors) {
+ best_good_sectors = good_sectors;
+ best_disk = disk;
+ }
+ }
+ continue;
+ } else
+ best_good_sectors = sectors;
+
if (!do_balance)
break;
@@ -513,6 +557,7 @@ static int read_balance(conf_t *conf, r1bio_t *r1_bio)
conf->last_used = disk;
}
rcu_read_unlock();
+ *max_sectors = sectors;
return disk;
}
@@ -751,6 +796,38 @@ do_sync_io:
return NULL;
}
+static void trim_bio(struct bio *bio, int offset, int size)
+{
+ /* 'bio' is a cloned bio which we need to trim to match
+ * the given offset and size.
+ * This requires adjusting bi_sector, bi_size, and bi_io_vec
+ */
+ if (offset == 0 && (size << 9) == bio->bi_size)
+ return;
+
+ bio->bi_sector += offset;
+ bio->bi_size = size << 9;
+ clear_bit(BIO_SEG_VALID, &bio->bi_flags);
+
+ while (bio->bi_idx < bio->bi_vcnt &&
+ bio->bi_io_vec[bio->bi_idx].bv_len < (offset << 9)) {
+ /* remove this whole bio_vec */
+ offset -= bio->bi_io_vec[bio->bi_idx].bv_len >> 9;
+ bio->bi_idx++;
+ }
+ if (bio->bi_idx < bio->bi_vcnt) {
+ bio->bi_io_vec[bio->bi_idx].bv_offset += (offset << 9);
+ bio->bi_io_vec[bio->bi_idx].bv_len -= (offset << 9);
+ }
+ /* avoid any complications with bi_idx being non-zero*/
+ if (bio->bi_idx) {
+ memmove(bio->bi_io_vec, bio->bi_io_vec+bio->bi_idx,
+ (bio->bi_vcnt - bio->bi_idx) * sizeof(struct bio_vec));
+ bio->bi_vcnt -= bio->bi_idx;
+ bio->bi_idx = 0;
+ }
+}
+
static int make_request(mddev_t *mddev, struct bio * bio)
{
conf_t *conf = mddev->private;
@@ -822,11 +899,26 @@ static int make_request(mddev_t *mddev, struct bio * bio)
r1_bio->mddev = mddev;
r1_bio->sector = bio->bi_sector;
+ /* We might need to issue multiple reads to different
+ * devices if there are bad blocks around, so we keep
+ * track of the number of reads in bio->bi_phys_segments.
+ * If this is 0, then is only one r1_bio, and no locking is
+ * needed to increment it.
+ * If it is non-zero, that is the number of incomplete
+ * requests.
+ */
+ bio->bi_phys_segments = 0;
+ clear_bit(BIO_SEG_VALID, &bio->bi_flags);
+
if (rw == READ) {
/*
* read balancing logic:
*/
- int rdisk = read_balance(conf, r1_bio);
+ int max_sectors;
+ int rdisk;
+
+ read_again:
+ rdisk = read_balance(conf, r1_bio, &max_sectors);
if (rdisk < 0) {
/* couldn't find anywhere to read from */
@@ -847,6 +939,8 @@ static int make_request(mddev_t *mddev, struct bio * bio)
r1_bio->read_disk = rdisk;
read_bio = bio_clone(bio, GFP_NOIO);
+ trim_bio(read_bio, r1_bio->sector - bio->bi_sector,
+ max_sectors);
r1_bio->bios[rdisk] = read_bio;
@@ -856,7 +950,33 @@ static int make_request(mddev_t *mddev, struct bio * bio)
read_bio->bi_rw = READ | (do_sync << BIO_RW_SYNCIO);
read_bio->bi_private = r1_bio;
- generic_make_request(read_bio);
+ if (max_sectors < r1_bio->sectors) {
+ /* could not read all from this device, so we will
+ * need another r1_bio.
+ */
+ int sectors_handled;
+
+ sectors_handled = (r1_bio->sector + max_sectors
+ - bio->bi_sector);
+ r1_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (bio->bi_phys_segments == 0)
+ bio->bi_phys_segments = 2;
+ else
+ bio->bi_phys_segments++;
+ spin_unlock_irq(&conf->device_lock);
+ generic_make_request(read_bio);
+
+ r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+
+ r1_bio->master_bio = bio;
+ r1_bio->sectors = (bio->bi_size >> 9) - sectors_handled;
+ r1_bio->state = 0;
+ r1_bio->mddev = mddev;
+ r1_bio->sector = bio->bi_sector + sectors_handled;
+ goto read_again;
+ } else
+ generic_make_request(read_bio);
return 0;
}
@@ -1487,7 +1607,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio)
*
* 1. Retries failed read operations on working mirrors.
* 2. Updates the raid superblock when problems encounter.
- * 3. Performs writes following reads for array syncronising.
+ * 3. Performs writes following reads for array synchronising.
*/
static void fix_read_error(conf_t *conf, int read_disk,
@@ -1510,9 +1630,14 @@ static void fix_read_error(conf_t *conf, int read_disk,
* which is the thread that might remove
* a device. If raid1d ever becomes multi-threaded....
*/
+ sector_t first_bad;
+ int bad_sectors;
+
rdev = conf->mirrors[d].rdev;
if (rdev &&
test_bit(In_sync, &rdev->flags) &&
+ is_badblock(rdev, sect, s,
+ &first_bad, &bad_sectors) == 0 &&
sync_page_io(rdev->bdev,
sect + rdev->data_offset,
s<<9,
@@ -1649,6 +1774,7 @@ static void raid1d(mddev_t *mddev)
}
} else {
int disk;
+ int max_sectors;
/* we got a read error. Maybe the drive is bad. Maybe just
* the block and we can fix it.
@@ -1669,7 +1795,8 @@ static void raid1d(mddev_t *mddev)
conf->mirrors[r1_bio->read_disk].rdev);
bio = r1_bio->bios[r1_bio->read_disk];
- if ((disk=read_balance(conf, r1_bio)) == -1) {
+ disk = read_balance(conf, r1_bio, &max_sectors);
+ if (disk == -1) {
printk(KERN_ALERT "md/raid1:%s: %s: unrecoverable I/O"
" read error for block %llu\n",
mdname(mddev),
@@ -1683,6 +1810,8 @@ static void raid1d(mddev_t *mddev)
r1_bio->read_disk = disk;
bio_put(bio);
bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
+ trim_bio(bio, r1_bio->sector - bio->bi_sector,
+ max_sectors);
r1_bio->bios[r1_bio->read_disk] = bio;
rdev = conf->mirrors[disk].rdev;
if (printk_ratelimit())
@@ -1697,7 +1826,36 @@ static void raid1d(mddev_t *mddev)
bio->bi_rw = READ | (do_sync << BIO_RW_SYNCIO);
bio->bi_private = r1_bio;
unplug = 1;
- generic_make_request(bio);
+ if (max_sectors < r1_bio->sectors) {
+ /* Drat - have to split this up more */
+ struct bio *mbio = r1_bio->master_bio;
+ int sectors_handled =
+ r1_bio->sector + max_sectors
+ - mbio->bi_sector;
+ r1_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (mbio->bi_phys_segments == 0)
+ mbio->bi_phys_segments = 2;
+ else
+ mbio->bi_phys_segments++;
+ spin_unlock_irq(&conf->device_lock);
+ generic_make_request(bio);
+
+ r1_bio = mempool_alloc(conf->r1bio_pool,
+ GFP_NOIO);
+
+ r1_bio->master_bio = mbio;
+ r1_bio->sectors = (mbio->bi_size >> 9)
+ - sectors_handled;
+ r1_bio->state = 0;
+ r1_bio->mddev = mddev;
+ r1_bio->sector = mbio->bi_sector
+ + sectors_handled;
+
+ reschedule_retry(r1_bio);
+
+ } else
+ generic_make_request(bio);
}
}
cond_resched();
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 09/16] md/raid1: avoid reading known bad blocks during resync
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (10 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 08/16] md/raid1: avoid reading from known bad blocks NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 16/16] md/raid1: Handle write errors by updating badblock log NeilBrown
` (5 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
When performing resync/etc, keep the size of the request
small enough that it doesn't overlap any known bad blocks.
Devices with badblocks at the start of the request are completely
excluded.
If there is nowhere to read from due to bad blocks, record
a bad block on each target device.
Now that we never read from known-bad-blocks we can allow devices with
known-bad-blocks into a RAID1.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid1.c | 97 +++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 76 insertions(+), 21 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a5a3f01..8429c63 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1249,9 +1249,6 @@ static int raid1_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
int first = 0;
int last = mddev->raid_disks - 1;
- if (rdev->badblocks.count)
- return -EINVAL;
-
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;
@@ -1488,6 +1485,9 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio)
* We don't need to freeze the array, because being in an
* active sync request, there is no normal IO, and
* no overlapping syncs.
+ * We don't need to check is_bad_block again as we
+ * made sure and anything with a bad block in range
+ * will have bi_end_io clear.
*/
sector_t sect = r1_bio->sector;
int sectors = r1_bio->sectors;
@@ -1901,6 +1901,8 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i
int write_targets = 0, read_targets = 0;
int sync_blocks;
int still_degraded = 0;
+ int good_sectors = RESYNC_SECTORS;
+ int min_bad = 0; /* number of sectors that are bad in all devices */
if (!conf->r1buf_pool)
if (init_resync(conf))
@@ -1988,34 +1990,89 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i
if (rdev == NULL ||
test_bit(Faulty, &rdev->flags)) {
still_degraded = 1;
- continue;
} else if (!test_bit(In_sync, &rdev->flags)) {
bio->bi_rw = WRITE;
bio->bi_end_io = end_sync_write;
write_targets ++;
} else {
/* may need to read from here */
- bio->bi_rw = READ;
- bio->bi_end_io = end_sync_read;
- if (test_bit(WriteMostly, &rdev->flags)) {
- if (wonly < 0)
- wonly = i;
- } else {
- if (disk < 0)
- disk = i;
+ sector_t first_bad = MaxSector;
+ int bad_sectors;
+
+ if (is_badblock(rdev, sector_nr, good_sectors,
+ &first_bad, &bad_sectors)) {
+ if (first_bad > sector_nr)
+ good_sectors = first_bad - sector_nr;
+ else {
+ bad_sectors -= (sector_nr - first_bad);
+ if (min_bad == 0 ||
+ min_bad > bad_sectors)
+ min_bad = bad_sectors;
+ }
+ }
+ if (sector_nr < first_bad) {
+ if (test_bit(WriteMostly, &rdev->flags)) {
+ if (wonly < 0)
+ wonly = i;
+ } else {
+ if (disk < 0)
+ disk = i;
+ }
+ bio->bi_rw = READ;
+ bio->bi_end_io = end_sync_read;
+ read_targets++;
}
- read_targets++;
}
- atomic_inc(&rdev->nr_pending);
- bio->bi_sector = sector_nr + rdev->data_offset;
- bio->bi_bdev = rdev->bdev;
- bio->bi_private = r1_bio;
+ if (bio->bi_end_io) {
+ atomic_inc(&rdev->nr_pending);
+ bio->bi_sector = sector_nr + rdev->data_offset;
+ bio->bi_bdev = rdev->bdev;
+ bio->bi_private = r1_bio;
+ }
}
rcu_read_unlock();
if (disk < 0)
disk = wonly;
r1_bio->read_disk = disk;
+ if (read_targets == 0 && min_bad > 0) {
+ /* These sectors are bad on all InSync devices, so we
+ * need to mark them bad on all write targets
+ */
+ int ok = 1;
+ for (i = 0 ; i < conf->raid_disks ; i++)
+ if (r1_bio->bios[i]->bi_end_io == end_sync_write) {
+ mdk_rdev_t *rdev =
+ rcu_dereference(conf->mirrors[i].rdev);
+ ok = md_set_badblocks(
+ &rdev->badblocks,
+ sector_nr + rdev->data_offset,
+ min_bad, 0
+ ) && ok;
+ }
+ set_bit(MD_CHANGE_DEVS, &mddev->flags);
+ *skipped = 1;
+ put_buf(r1_bio);
+
+ if (!ok) {
+ /* Cannot record the badblocks, so need to
+ * abort the resync.
+ * If there are multiple read targets, could just
+ * fail the really bad ones ???
+ */
+ mddev->recovery_disabled = 1;
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ return 0;
+ } else
+ return min_bad;
+
+ }
+ if (min_bad > 0 && min_bad < good_sectors) {
+ /* only resync enough to read the next bad->good
+ * transition */
+ good_sectors = min_bad;
+ }
+
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && read_targets > 0)
/* extra read targets are also write targets */
write_targets += read_targets-1;
@@ -2032,6 +2089,8 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i
if (max_sector > mddev->resync_max)
max_sector = mddev->resync_max; /* Don't do IO beyond here */
+ if (max_sector > sector_nr + good_sectors)
+ max_sector = sector_nr + good_sectors;
nr_sectors = 0;
sync_blocks = 0;
do {
@@ -2254,10 +2313,6 @@ static int run(mddev_t *mddev)
blk_queue_segment_boundary(mddev->queue,
PAGE_CACHE_SIZE - 1);
}
- if (rdev->badblocks.count) {
- printk(KERN_ERR "md/raid1: Cannot handle bad blocks yet\n");
- return -EINVAL;
- }
}
mddev->degraded = 0;
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 16/16] md/raid1: Handle write errors by updating badblock log.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (11 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 09/16] md/raid1: avoid reading known bad blocks during resync NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 15/16] md/raid1: clear bad-block record when write succeeds NeilBrown
` (4 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.
As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid1.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++---
drivers/md/raid1.h | 3 +-
2 files changed, 100 insertions(+), 7 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e23ec40..a6cc50c 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -342,12 +342,10 @@ static void raid1_end_write_request(struct bio *bio, int error)
/*
* this branch is our 'one mirror IO has finished' event handler:
*/
- r1_bio->bios[mirror] = NULL;
- to_put = bio;
if (!uptodate) {
- md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R1BIO_Degraded, &r1_bio->state);
+ set_bit(WriteErrorSeen,
+ &conf->mirrors[mirror].rdev->flags);
+ set_bit(R1BIO_WriteError, &r1_bio->state);
} else {
/*
* Set R1BIO_Uptodate in our master bio, so that
@@ -359,6 +357,8 @@ static void raid1_end_write_request(struct bio *bio, int error)
* wait for the 'master' bio.
*/
set_bit(R1BIO_Uptodate, &r1_bio->state);
+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
/* Maybe we can clear some bad blocks. */
if (is_badblock(conf->mirrors[mirror].rdev,
@@ -403,7 +403,8 @@ static void raid1_end_write_request(struct bio *bio, int error)
* already.
*/
if (atomic_dec_and_test(&r1_bio->remaining)) {
- if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
+ if (test_bit(R1BIO_BarrierRetry, &r1_bio->state) ||
+ test_bit(R1BIO_WriteError, &r1_bio->state))
reschedule_retry(r1_bio);
else {
/* it really is the end of this request */
@@ -1800,6 +1801,75 @@ static void fix_read_error(conf_t *conf, int read_disk,
}
}
+
+static void bi_complete(struct bio *bio, int error)
+{
+ complete((struct completion *)bio->bi_private);
+}
+
+static int submit_bio_wait(int rw, struct bio *bio)
+{
+ struct completion event;
+ rw |= (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG);
+
+ init_completion(&event);
+ bio->bi_private = &event;
+ bio->bi_end_io = bi_complete;
+ submit_bio(rw, bio);
+ wait_for_completion(&event);
+
+ return test_bit(BIO_UPTODATE, &bio->bi_flags);
+}
+
+static int narrow_write_error(r1bio_t *r1_bio, int i)
+{
+ struct bio *bio = r1_bio->bios[i];
+ mddev_t *mddev = r1_bio->mddev;
+ conf_t *conf = mddev->private;
+ mdk_rdev_t *rdev = conf->mirrors[i].rdev;
+ /* bio has the data to be written to device 'i' where
+ * we just recently had a write error.
+ * We repeatedly clone the bio and trim down to one block,
+ * when try the write. Where the write fails we record
+ * a bad block.
+ * It is conceivable that the bio doesn't exactly align with
+ * blocks. We must handle this somehow.
+ *
+ * We currently own a reference on the rdev.
+ */
+
+ int block_sectors = 1 << rdev->badblocks.shift;
+ sector_t sector = ((r1_bio->sector + block_sectors)
+ & (block_sectors - 1));
+ int sectors = sector - r1_bio->sector;
+ int sect_to_write = r1_bio->sectors;
+ int ok = 1;
+
+ sector = r1_bio->sector;
+
+ while (sect_to_write) {
+ struct bio *wbio;
+ if (sectors > sect_to_write)
+ sectors = sect_to_write;
+ /* Write at 'sector' for 'sectors'*/
+ wbio = bio_clone(bio, GFP_NOIO);
+ trim_bio(wbio, sector - bio->bi_sector, sectors);
+ wbio->bi_bdev = rdev->bdev;
+ if (submit_bio_wait(1, wbio) == 0)
+ /* failure! */
+ ok = md_set_badblocks(&rdev->badblocks,
+ sector + rdev->data_offset,
+ sectors, 0)
+ && ok;
+
+ bio_put(wbio);
+ sect_to_write -= sectors;
+ sector += sectors;
+ sectors = block_sectors;
+ }
+ return ok;
+}
+
static void raid1d(mddev_t *mddev)
{
r1bio_t *r1_bio;
@@ -1894,6 +1964,28 @@ static void raid1d(mddev_t *mddev)
r1_bio->bios[i] = bio;
generic_make_request(bio);
}
+ } else if (test_bit(R1BIO_WriteError, &r1_bio->state)) {
+ /* At least one drive got a write error. We
+ * need to narrow down and record precise write
+ * errors.
+ */
+ int i;
+ for (i = 0; i < conf->raid_disks; i++) {
+ struct bio *bio = r1_bio->bios[i];
+ if (!bio)
+ continue;
+ BUG_ON(test_bit(BIO_UPTODATE, &bio->bi_flags));
+ if (!narrow_write_error(r1_bio, i)) {
+ md_error(mddev, conf->mirrors[i].rdev);
+ /* an I/O failed, we can't clear
+ * the bitmap */
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ }
+ bio_put(bio);
+ r1_bio->bios[i] = NULL;
+ rdev_dec_pending(conf->mirrors[i].rdev, mddev);
+ }
+ raid_end_bio_io(r1_bio);
} else {
int disk;
int max_sectors;
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index c91c736..532fca4 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -137,6 +137,7 @@ struct r1bio_s {
/* If a write for this request means we can clear some
* known-bad-block records, we set this flag
*/
-#define R1BIO_MadeGood 7
+#define R1BIO_MadeGood 7
+#define R1BIO_WriteError 8
#endif
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 15/16] md/raid1: clear bad-block record when write succeeds.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (12 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 16/16] md/raid1: Handle write errors by updating badblock log NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 13/16] md: make it easier to wait for bad blocks to be acknowledged NeilBrown
` (3 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.
This requires some delayed handling as the bad-block-list update has
to happen in process-context.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid1.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++------
drivers/md/raid1.h | 13 +++++++++++-
2 files changed, 63 insertions(+), 8 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d240d58..e23ec40 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -175,7 +175,7 @@ static void put_all_bios(conf_t *conf, r1bio_t *r1_bio)
for (i = 0; i < conf->raid_disks; i++) {
struct bio **bio = r1_bio->bios + i;
- if (*bio && *bio != IO_BLOCKED)
+ if (!BIO_SPECIAL(*bio))
bio_put(*bio);
*bio = NULL;
}
@@ -348,7 +348,7 @@ static void raid1_end_write_request(struct bio *bio, int error)
md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
/* an I/O failed, we can't clear the bitmap */
set_bit(R1BIO_Degraded, &r1_bio->state);
- } else
+ } else {
/*
* Set R1BIO_Uptodate in our master bio, so that
* we will return a good error code for to the higher
@@ -360,6 +360,15 @@ static void raid1_end_write_request(struct bio *bio, int error)
*/
set_bit(R1BIO_Uptodate, &r1_bio->state);
+ /* Maybe we can clear some bad blocks. */
+ if (is_badblock(conf->mirrors[mirror].rdev,
+ r1_bio->sector, r1_bio->sectors,
+ NULL, NULL)) {
+ r1_bio->bios[mirror] = IO_MADE_GOOD;
+ set_bit(R1BIO_MadeGood, &r1_bio->state);
+ }
+ }
+
update_head_pos(mirror, r1_bio);
if (behind) {
@@ -384,7 +393,9 @@ static void raid1_end_write_request(struct bio *bio, int error)
}
}
}
- rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+ if (r1_bio->bios[mirror] == NULL)
+ rdev_dec_pending(conf->mirrors[mirror].rdev,
+ conf->mddev);
}
/*
*
@@ -408,7 +419,10 @@ static void raid1_end_write_request(struct bio *bio, int error)
!test_bit(R1BIO_Degraded, &r1_bio->state),
behind);
md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
+ if (test_bit(R1BIO_MadeGood, &r1_bio->state))
+ reschedule_retry(r1_bio);
+ else
+ raid_end_bio_io(r1_bio);
}
}
@@ -1451,7 +1465,11 @@ static void end_sync_write(struct bio *bio, int error)
sectors_to_go -= sync_blocks;
} while (sectors_to_go > 0);
md_error(mddev, conf->mirrors[mirror].rdev);
- }
+ } else if (is_badblock(conf->mirrors[mirror].rdev,
+ r1_bio->sector,
+ r1_bio->sectors,
+ NULL, NULL))
+ set_bit(R1BIO_MadeGood, &r1_bio->state);
update_head_pos(mirror, r1_bio);
@@ -1812,8 +1830,34 @@ static void raid1d(mddev_t *mddev)
mddev = r1_bio->mddev;
conf = mddev->private;
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
- sync_request_write(mddev, r1_bio);
- unplug = 1;
+ if (test_bit(R1BIO_MadeGood, &r1_bio->state)) {
+ int m;
+ for (m = 0; m < conf->raid_disks ; m++)
+ if (!BIO_SPECIAL(r1_bio->bios[m]) &&
+ test_bit(BIO_UPTODATE,
+ &r1_bio->bios[m]->bi_flags)) {
+ md_clear_badblocks(
+ &conf->mirrors[m].rdev->badblocks,
+ r1_bio->sector,
+ r1_bio->sectors);
+ }
+ put_buf(r1_bio);
+ } else {
+ sync_request_write(mddev, r1_bio);
+ unplug = 1;
+ }
+ } else if (test_bit(R1BIO_MadeGood, &r1_bio->state)) {
+ int m;
+ for (m = 0; m < conf->raid_disks ; m++)
+ if (r1_bio->bios[m] == IO_MADE_GOOD) {
+ md_clear_badblocks(
+ &conf->mirrors[m].rdev->badblocks,
+ r1_bio->sector,
+ r1_bio->sectors);
+ rdev_dec_pending(conf->mirrors[m].rdev,
+ mddev);
+ }
+ raid_end_bio_io(r1_bio);
} else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
/* some requests in the r1bio were BIO_RW_BARRIER
* requests which failed with -EOPNOTSUPP. Hohumm..
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 5f2d443..c91c736 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -110,7 +110,14 @@ struct r1bio_s {
* correct the read error. To keep track of bad blocks on a per-bio
* level, we store IO_BLOCKED in the appropriate 'bios' pointer
*/
-#define IO_BLOCKED ((struct bio*)1)
+#define IO_BLOCKED ((struct bio *)1)
+/* When we successfully write to a known bad-block, we need to remove the
+ * bad-block marking which must be done from process context. So we record
+ * the success by setting bios[n] to IO_MADE_GOOD
+ */
+#define IO_MADE_GOOD ((struct bio *)2)
+
+#define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
/* bits for r1bio.state */
#define R1BIO_Uptodate 0
@@ -127,5 +134,9 @@ struct r1bio_s {
* Record that bi_end_io was called with this flag...
*/
#define R1BIO_Returned 6
+/* If a write for this request means we can clear some
+ * known-bad-block records, we set this flag
+ */
+#define R1BIO_MadeGood 7
#endif
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 13/16] md: make it easier to wait for bad blocks to be acknowledged.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (13 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 15/16] md/raid1: clear bad-block record when write succeeds NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:07 ` [md PATCH 14/16] md/raid1: avoid writing to known-bad blocks on known-bad drives NeilBrown
` (2 subsequent siblings)
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.
If it hasn't we need to wait for the acknowledgement.
We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.
This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.
It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.
When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).
We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata.
Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.
The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/md.c | 63 ++++++++++++++++++++++++++++++++-------------------
drivers/md/md.h | 22 ++++++++++++++++--
drivers/md/raid1.c | 1 +
drivers/md/raid10.c | 1 +
drivers/md/raid5.c | 1 +
5 files changed, 63 insertions(+), 25 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 20c6792..16cae2f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2313,9 +2313,12 @@ repeat:
return;
}
- list_for_each_entry(rdev, &mddev->disks, same_set)
+ list_for_each_entry(rdev, &mddev->disks, same_set) {
if (rdev->badblocks.changed)
any_badblocks_changed++;
+ if (test_bit(Faulty, &rdev->flags))
+ set_bit(FaultRecorded, &rdev->flags);
+ }
sync_sbs(mddev, nospares);
spin_unlock_irq(&mddev->write_lock);
@@ -2372,9 +2375,14 @@ repeat:
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
sysfs_notify(&mddev->kobj, NULL, "sync_completed");
- if (any_badblocks_changed)
- list_for_each_entry(rdev, &mddev->disks, same_set)
+ list_for_each_entry(rdev, &mddev->disks, same_set) {
+ if (test_and_clear_bit(FaultRecorded, &rdev->flags))
+ clear_bit(Blocked, &rdev->flags);
+
+ if (any_badblocks_changed)
md_ack_all_badblocks(&rdev->badblocks);
+ }
+ wake_up(&rdev->blocked_wait);
}
/* words written to sysfs files may, or may not, be \n terminated.
@@ -2409,7 +2417,8 @@ state_show(mdk_rdev_t *rdev, char *page)
char *sep = "";
size_t len = 0;
- if (test_bit(Faulty, &rdev->flags)) {
+ if (test_bit(Faulty, &rdev->flags) ||
+ rdev->badblocks.unacked_exist) {
len+= sprintf(page+len, "%sfaulty",sep);
sep = ",";
}
@@ -2421,7 +2430,8 @@ state_show(mdk_rdev_t *rdev, char *page)
len += sprintf(page+len, "%swrite_mostly",sep);
sep = ",";
}
- if (test_bit(Blocked, &rdev->flags)) {
+ if (test_bit(Blocked, &rdev->flags) ||
+ rdev->badblocks.unacked_exist) {
len += sprintf(page+len, "%sblocked", sep);
sep = ",";
}
@@ -2441,12 +2451,12 @@ static ssize_t
state_store(mdk_rdev_t *rdev, const char *buf, size_t len)
{
/* can write
- * faulty - simulates and error
+ * faulty - simulates an error
* remove - disconnects the device
* writemostly - sets write_mostly
* -writemostly - clears write_mostly
- * blocked - sets the Blocked flag
- * -blocked - clears the Blocked flag
+ * blocked - sets the Blocked flags
+ * -blocked - clears the Blocked and possibly simulates an error
* insync - sets Insync providing device isn't active
* write_error - sets WriteErrorSeen
* -write_error - clears WriteErrorSeen
@@ -2476,7 +2486,15 @@ state_store(mdk_rdev_t *rdev, const char *buf, size_t len)
set_bit(Blocked, &rdev->flags);
err = 0;
} else if (cmd_match(buf, "-blocked")) {
+ if (!test_bit(Faulty, &rdev->flags) &&
+ test_bit(BlockedBadBlocks, &rdev->flags)) {
+ /* metadata handler doesn't understand badblocks,
+ * so we need to fail the device
+ */
+ md_error(rdev->mddev, rdev);
+ }
clear_bit(Blocked, &rdev->flags);
+ clear_bit(BlockedBadBlocks, &rdev->flags);
wake_up(&rdev->blocked_wait);
set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
md_wakeup_thread(rdev->mddev->thread);
@@ -2792,7 +2810,11 @@ static ssize_t bb_show(mdk_rdev_t *rdev, char *page)
}
static ssize_t bb_store(mdk_rdev_t *rdev, const char *page, size_t len)
{
- return badblocks_store(&rdev->badblocks, page, len, 0);
+ int rv = badblocks_store(&rdev->badblocks, page, len, 0);
+ /* Maybe that ack was all we needed */
+ if (test_and_clear_bit(BlockedBadBlocks, &rdev->flags))
+ wake_up(&rdev->blocked_wait);
+ return rv;
}
static struct rdev_sysfs_entry rdev_bad_blocks =
__ATTR(bad_blocks, S_IRUGO|S_IWUSR, bb_show, bb_store);
@@ -6234,18 +6256,7 @@ void md_error(mddev_t *mddev, mdk_rdev_t *rdev)
if (!rdev || test_bit(Faulty, &rdev->flags))
return;
- if (mddev->external)
- set_bit(Blocked, &rdev->flags);
-/*
- dprintk("md_error dev:%s, rdev:(%d:%d), (caller: %p,%p,%p,%p).\n",
- mdname(mddev),
- MAJOR(rdev->bdev->bd_dev), MINOR(rdev->bdev->bd_dev),
- __builtin_return_address(0),__builtin_return_address(1),
- __builtin_return_address(2),__builtin_return_address(3));
-*/
- if (!mddev->pers)
- return;
- if (!mddev->pers->error_handler)
+ if (!mddev->pers || !mddev->pers->error_handler)
return;
mddev->pers->error_handler(mddev,rdev);
if (mddev->degraded)
@@ -7140,7 +7151,7 @@ static int remove_and_add_spares(mddev_t *mddev)
list_for_each_entry(rdev, &mddev->disks, same_set) {
if (rdev->raid_disk >= 0 &&
!test_bit(In_sync, &rdev->flags) &&
- !test_bit(Blocked, &rdev->flags))
+ !test_bit(Faulty, &rdev->flags))
spares++;
if (rdev->raid_disk < 0
&& !test_bit(Faulty, &rdev->flags)) {
@@ -7364,7 +7375,8 @@ void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
{
sysfs_notify_dirent_safe(rdev->sysfs_state);
wait_event_timeout(rdev->blocked_wait,
- !test_bit(Blocked, &rdev->flags),
+ !test_bit(Blocked, &rdev->flags) &&
+ !test_bit(BlockedBadBlocks, &rdev->flags),
msecs_to_jiffies(5000));
rdev_dec_pending(rdev, mddev);
}
@@ -7623,6 +7635,8 @@ add_more:
}
bb->changed = 1;
+ if (!acknowledged)
+ bb->unacked_exist = 1;
rcu_assign_pointer(bb->active_page, bb->page);
spin_unlock(&bb->lock);
@@ -7768,6 +7782,7 @@ again:
p[i] = BB_MAKE(start, len, 1);
}
}
+ bb->unacked_exist = 0;
}
rcu_assign_pointer(bb->active_page, bb->page);
spin_unlock(&bb->lock);
@@ -7817,6 +7832,8 @@ badblocks_show(struct badblocks *bb, char *page, int unack)
len += snprintf(page+len, PAGE_SIZE-len, "%llu %u\n",
(unsigned long long)s, length);
}
+ if (unack && len == 0)
+ bb->unacked_exist = 0;
if (havelock)
spin_unlock(&bb->lock);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a085df5..3eace8e 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -97,12 +97,26 @@ struct mdk_rdev_s
#define AllReserved 6 /* If whole device is reserved for
* one array */
#define AutoDetected 7 /* added by auto-detect */
-#define Blocked 8 /* An error occured on an externally
- * managed array, don't allow writes
+#define Blocked 8 /* An error occurred but has not yet
+ * been acknowledged by the metadata
+ * handler, so don't allow writes
* until it is cleared */
#define WriteErrorSeen 9 /* A write error has been seen on this
* device
*/
+#define FaultRecorded 10 /* Intermediate state for clearing Blocked.
+ * The Fault is/will-be recorded in the
+ * metadata, but that metadata hasn't
+ * been stored safely on disk yet.
+ */
+#define BlockedBadBlocks 11 /* A writer is blocked because they found
+ * an unacknowledged bad-block. This can
+ * safely be cleared at any time, and the
+ * writer will re-check. It may be set
+ * at any time, and at worst the writer
+ * will timeout and re-check. So setting
+ * it as accurately as possible is good,
+ * bit not absolutely critical. */
wait_queue_head_t blocked_wait;
int desc_nr; /* descriptor index in the superblock */
@@ -137,6 +151,10 @@ struct mdk_rdev_s
struct badblocks {
int count; /* count of bad blocks */
+ int unacked_exist; /* there probably are unacknowledged
+ * bad blocks. This is only cleared
+ * when a read discovers none
+ */
int shift; /* shift from sectors to block size */
u64 *page; /* badblock list */
u64 *active_page; /* either 'page' or 'NULL' */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8429c63..bb81681 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1160,6 +1160,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
mddev->recovery_disabled = 1;
return;
}
+ set_bit(Blocked, &rdev->flags);
if (test_and_clear_bit(In_sync, &rdev->flags)) {
unsigned long flags;
spin_lock_irqsave(&conf->device_lock, flags);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 01a75d2..1e586d9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1013,6 +1013,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
*/
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
}
+ set_bit(Blocked, &rdev->flags);
set_bit(Faulty, &rdev->flags);
set_bit(MD_CHANGE_DEVS, &mddev->flags);
printk(KERN_ALERT "md/raid10:%s: Disk failure on %s, disabling device.\n"
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6fb36d8..8abc28c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1636,6 +1636,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
*/
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
}
+ set_bit(Blocked, &rdev->flags);
set_bit(Faulty, &rdev->flags);
set_bit(MD_CHANGE_DEVS, &mddev->flags);
printk(KERN_ALERT
^ permalink raw reply related [flat|nested] 26+ messages in thread* [md PATCH 14/16] md/raid1: avoid writing to known-bad blocks on known-bad drives.
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (14 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 13/16] md: make it easier to wait for bad blocks to be acknowledged NeilBrown
@ 2010-06-07 0:07 ` NeilBrown
2010-06-07 0:28 ` [md PATCH 00/16] bad block list management for md and RAID1 Berkey B Walker
2010-06-17 12:48 ` Brett Russ
17 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2010-06-07 0:07 UTC (permalink / raw)
To: linux-raid
If we have seen any write error on a drive, then don't write to
any known-bad blocks on that drive.
If necessary, we divide the write request up into pieces just
like we do for reads, so each piece is either all written or
all not written to any given drive.
Signed-off-by: NeilBrown <neilb@suse.de>
---
drivers/md/raid1.c | 147 ++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 112 insertions(+), 35 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index bb81681..d240d58 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -838,11 +838,14 @@ static int make_request(mddev_t *mddev, struct bio * bio)
struct bitmap *bitmap;
unsigned long flags;
struct bio_list bl;
- struct page **behind_pages = NULL;
+ struct page **behind_pages;
const int rw = bio_data_dir(bio);
const bool do_sync = bio_rw_flagged(bio, BIO_RW_SYNCIO);
bool do_barriers;
mdk_rdev_t *blocked_rdev;
+ int first_clone;
+ int sectors_handled;
+ int max_sectors;
/*
* Register the new request and wait if the reconstruction
@@ -914,7 +917,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
/*
* read balancing logic:
*/
- int max_sectors;
int rdisk;
read_again:
@@ -954,7 +956,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
/* could not read all from this device, so we will
* need another r1_bio.
*/
- int sectors_handled;
sectors_handled = (r1_bio->sector + max_sectors
- bio->bi_sector);
@@ -983,9 +984,15 @@ static int make_request(mddev_t *mddev, struct bio * bio)
/*
* WRITE:
*/
- /* first select target devices under spinlock and
+ /* first select target devices under rcu_lock and
* inc refcount on their rdev. Record them by setting
* bios[x] to bio
+ * If there are known/acknowledged bad blocks on any device on
+ * which we have seen a write error, we want to avoid writing those
+ * blocks.
+ * This potentially requires several writes to write around
+ * the bad blocks. We do those serially using the same
+ * bio which we repeatedly trim to size.
*/
disks = conf->raid_disks;
#if 0
@@ -998,6 +1005,7 @@ static int make_request(mddev_t *mddev, struct bio * bio)
retry_write:
blocked_rdev = NULL;
rcu_read_lock();
+ max_sectors = r1_bio->sectors;
for (i = 0; i < disks; i++) {
mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
@@ -1005,17 +1013,55 @@ static int make_request(mddev_t *mddev, struct bio * bio)
blocked_rdev = rdev;
break;
}
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- atomic_inc(&rdev->nr_pending);
- if (test_bit(Faulty, &rdev->flags)) {
+ r1_bio->bios[i] = NULL;
+ if (!rdev || test_bit(Faulty, &rdev->flags)) {
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ continue;
+ }
+
+ atomic_inc(&rdev->nr_pending);
+ if (test_bit(Faulty, &rdev->flags)) {
+ rdev_dec_pending(rdev, mddev);
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ continue;
+ }
+ if (test_bit(WriteErrorSeen, &rdev->flags)) {
+ sector_t first_bad;
+ int bad_sectors, good_sectors;
+
+ if (is_badblock(rdev, r1_bio->sector,
+ max_sectors,
+ &first_bad, &bad_sectors) < 0) {
+
+ /* mustn't write here until the bad block is
+ * acknowledged*/
+ blocked_rdev = rdev;
+ break;
+ }
+ if (first_bad <= r1_bio->sector) {
+ /* Cannot write here at all */
+ bad_sectors -= (r1_bio->sector - first_bad);
+ if (bad_sectors < max_sectors)
+ /* mustn't write more than bad_sectors
+ * to other devices yet
+ */
+ max_sectors = bad_sectors;
rdev_dec_pending(rdev, mddev);
- r1_bio->bios[i] = NULL;
- } else {
- r1_bio->bios[i] = bio;
- targets++;
+ /* We don't set R1BIO_Degraded as that only applies
+ * if the disk is missing, so it might be re-added,
+ * and we want to know to recovery this chunk.
+ * In this case the device is here, and the fact that
+ * this chunk is no in-sync is recorded in the
+ * bad block log
+ */
+ continue;
}
- } else
- r1_bio->bios[i] = NULL;
+ good_sectors = first_bad - r1_bio->sector;
+ if (good_sectors < max_sectors)
+ max_sectors = good_sectors;
+ }
+ r1_bio->bios[i] = bio;
+ targets++;
}
rcu_read_unlock();
@@ -1035,22 +1081,26 @@ static int make_request(mddev_t *mddev, struct bio * bio)
BUG_ON(targets == 0); /* we never fail the last device */
+ if (max_sectors < r1_bio->sectors) {
+ /* We are splitting this write into multiple parts, so
+ * we need to prepare for allocating another r1_bio.
+ */
+ r1_bio->sectors = max_sectors;
+ spin_lock_irq(&conf->device_lock);
+ if (bio->bi_phys_segments == 0)
+ bio->bi_phys_segments = 2;
+ else
+ bio->bi_phys_segments++;
+ spin_unlock_irq(&conf->device_lock);
+ }
+ sectors_handled = r1_bio->sector + max_sectors - bio->bi_sector;
+
if (targets < conf->raid_disks) {
/* array is degraded, we will not clear the bitmap
* on I/O completion (see raid1_end_write_request) */
set_bit(R1BIO_Degraded, &r1_bio->state);
}
- /* do behind I/O ?
- * Not if there are too many, or cannot allocate memory,
- * or a reader on WriteMostly is waiting for behind writes
- * to flush */
- if (bitmap &&
- (atomic_read(&bitmap->behind_writes)
- < mddev->bitmap_info.max_write_behind) &&
- !waitqueue_active(&bitmap->behind_wait) &&
- (behind_pages = alloc_behind_pages(bio)) != NULL)
- set_bit(R1BIO_BehindIO, &r1_bio->state);
atomic_set(&r1_bio->remaining, 0);
atomic_set(&r1_bio->behind_remaining, 0);
@@ -1060,21 +1110,28 @@ static int make_request(mddev_t *mddev, struct bio * bio)
set_bit(R1BIO_Barrier, &r1_bio->state);
bio_list_init(&bl);
+ first_clone = 1;
+ behind_pages = NULL;
for (i = 0; i < disks; i++) {
struct bio *mbio;
if (!r1_bio->bios[i])
continue;
mbio = bio_clone(bio, GFP_NOIO);
- r1_bio->bios[i] = mbio;
-
- mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
- mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
- mbio->bi_end_io = raid1_end_write_request;
- mbio->bi_rw = WRITE | (do_barriers << BIO_RW_BARRIER) |
- (do_sync << BIO_RW_SYNCIO);
- mbio->bi_private = r1_bio;
+ if (first_clone) {
+ /* do behind I/O ?
+ * Not if there are too many, or cannot
+ * allocate memory, or a reader on WriteMostly
+ * is waiting for behind writes to flush */
+ if (bitmap &&
+ (atomic_read(&bitmap->behind_writes)
+ < mddev->bitmap_info.max_write_behind) &&
+ !waitqueue_active(&bitmap->behind_wait) &&
+ (behind_pages = alloc_behind_pages(mbio)) != NULL)
+ set_bit(R1BIO_BehindIO, &r1_bio->state);
+ first_clone = 0;
+ }
if (behind_pages) {
struct bio_vec *bvec;
int j;
@@ -1092,6 +1149,17 @@ static int make_request(mddev_t *mddev, struct bio * bio)
atomic_inc(&r1_bio->behind_remaining);
}
+ trim_bio(mbio, r1_bio->sector - bio->bi_sector, max_sectors);
+ r1_bio->bios[i] = mbio;
+
+ mbio->bi_sector = (r1_bio->sector +
+ conf->mirrors[i].rdev->data_offset);
+ mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
+ mbio->bi_end_io = raid1_end_write_request;
+ mbio->bi_rw = WRITE | (do_barriers << BIO_RW_BARRIER) |
+ (do_sync << BIO_RW_SYNCIO);
+ mbio->bi_private = r1_bio;
+
atomic_inc(&r1_bio->remaining);
bio_list_add(&bl, mbio);
@@ -1110,12 +1178,21 @@ static int make_request(mddev_t *mddev, struct bio * bio)
/* In case raid1d snuck into freeze_array */
wake_up(&conf->wait_barrier);
+ if (sectors_handled < (bio->bi_size >> 9)) {
+ /* We need another r1_bio. It has already been counted
+ * in bio->bi_phys_segments
+ */
+ r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+ r1_bio->master_bio = bio;
+ r1_bio->sectors = (bio->bi_size >> 9) - sectors_handled;
+ r1_bio->state = 0;
+ r1_bio->mddev = mddev;
+ r1_bio->sector = bio->bi_sector + sectors_handled;
+ goto retry_write;
+ }
+
if (do_sync)
md_wakeup_thread(mddev->thread);
-#if 0
- while ((bio = bio_list_pop(&bl)) != NULL)
- generic_make_request(bio);
-#endif
return 0;
}
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [md PATCH 00/16] bad block list management for md and RAID1
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (15 preceding siblings ...)
2010-06-07 0:07 ` [md PATCH 14/16] md/raid1: avoid writing to known-bad blocks on known-bad drives NeilBrown
@ 2010-06-07 0:28 ` Berkey B Walker
2010-06-07 22:18 ` Stefan /*St0fF*/ Hübner
2010-06-17 12:48 ` Brett Russ
17 siblings, 1 reply; 26+ messages in thread
From: Berkey B Walker @ 2010-06-07 0:28 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
Excellent start Neil. This will be much appreciated.
b-
NeilBrown wrote:
> In the spirit of "release early" I thought I would post some patches
> that I have been working on lately.
>
> Please don't try these on a system with valuable data - they are very
> early code and will probably do the wrong thing.
>
> The goal of these patches is to add a 'bad block list' to each device
> and use it to allow us to fail single blocks rather than whole
> devices.
>
> This is particularly useful in arrays will multiple redundancy
> (e.g. RAID6 or 3-device RAID1). In such cases, bad blocks in
> different places on different devices can leave an array that still
> has at-least single redundancy on all stripes. Without this support,
> such arrays could become non-fuinctional.
>
> This is also a necessary preparation to being able to support
> 'hot-replace' where we build a new device while the old device is
> still in service. Such a process is only really needed if the old
> device is potentially faulty, and having the bad-block-list in place
> allows it to continue to provide the best service it can even when it
> cannot provide 100% service.
>
> These patches have only seen limited testing, and are posted primarily
> for review rather than testing, though testing is always valuable,
> especially if you use the md/faulty module to simulate errors, or have
> a drive that provides you with real errors...
>
> This series provides infrastructure and integration into raid1.c only.
> raid5.c and raid10.c support are still to be written.
>
> NeilBrown
>
>
> ---
>
> NeilBrown (16):
> md: beginnings of bad block management.
> md/bad-block-log: add sysfs interface for accessing bad-block-log.
> md: don't allow arrays to contain devices with bad blocks.
> md: load/store badblock list from v1.x metadata
> md: reject devices with bad blocks and v0.90 metadata.
> md/raid1: clean up read_balance.
> md: simplify raid10 read_balance
> md/raid1: avoid reading from known bad blocks.
> md/raid1: avoid reading known bad blocks during resync
> md: add 'write_error' flag to component devices.
> md/multipath: discard ->working_disks in favour of ->degraded
> md: make error_handler functions more uniform and correct.
> md: make it easier to wait for bad blocks to be acknowledged.
> md/raid1: avoid writing to known-bad blocks on known-bad drives.
> md/raid1: clear bad-block record when write succeeds.
> md/raid1: Handle write errors by updating badblock log.
>
>
> drivers/md/dm-raid456.c | 6
> drivers/md/md.c | 725 +++++++++++++++++++++++++++++++++++++++++++--
> drivers/md/md.h | 76 ++++-
> drivers/md/multipath.c | 60 ++--
> drivers/md/multipath.h | 1
> drivers/md/raid1.c | 714 +++++++++++++++++++++++++++++++++++---------
> drivers/md/raid1.h | 14 +
> drivers/md/raid10.c | 123 ++++----
> drivers/md/raid5.c | 48 ++-
> include/linux/raid/md_p.h | 13 +
> 10 files changed, 1475 insertions(+), 305 deletions(-)
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [md PATCH 00/16] bad block list management for md and RAID1
2010-06-07 0:28 ` [md PATCH 00/16] bad block list management for md and RAID1 Berkey B Walker
@ 2010-06-07 22:18 ` Stefan /*St0fF*/ Hübner
0 siblings, 0 replies; 26+ messages in thread
From: Stefan /*St0fF*/ Hübner @ 2010-06-07 22:18 UTC (permalink / raw)
To: Berkey B Walker; +Cc: NeilBrown, linux-raid
I 2nd that!
Stefan
Am 07.06.2010 02:28, schrieb Berkey B Walker:
> Excellent start Neil. This will be much appreciated.
> b-
>
> NeilBrown wrote:
>> In the spirit of "release early" I thought I would post some patches
>> that I have been working on lately.
>>
>> Please don't try these on a system with valuable data - they are very
>> early code and will probably do the wrong thing.
>>
>> The goal of these patches is to add a 'bad block list' to each device
>> and use it to allow us to fail single blocks rather than whole
>> devices.
>>
>> This is particularly useful in arrays will multiple redundancy
>> (e.g. RAID6 or 3-device RAID1). In such cases, bad blocks in
>> different places on different devices can leave an array that still
>> has at-least single redundancy on all stripes. Without this support,
>> such arrays could become non-fuinctional.
>>
>> This is also a necessary preparation to being able to support
>> 'hot-replace' where we build a new device while the old device is
>> still in service. Such a process is only really needed if the old
>> device is potentially faulty, and having the bad-block-list in place
>> allows it to continue to provide the best service it can even when it
>> cannot provide 100% service.
>>
>> These patches have only seen limited testing, and are posted primarily
>> for review rather than testing, though testing is always valuable,
>> especially if you use the md/faulty module to simulate errors, or have
>> a drive that provides you with real errors...
>>
>> This series provides infrastructure and integration into raid1.c only.
>> raid5.c and raid10.c support are still to be written.
>>
>> NeilBrown
>>
>>
>> ---
>>
>> NeilBrown (16):
>> md: beginnings of bad block management.
>> md/bad-block-log: add sysfs interface for accessing bad-block-log.
>> md: don't allow arrays to contain devices with bad blocks.
>> md: load/store badblock list from v1.x metadata
>> md: reject devices with bad blocks and v0.90 metadata.
>> md/raid1: clean up read_balance.
>> md: simplify raid10 read_balance
>> md/raid1: avoid reading from known bad blocks.
>> md/raid1: avoid reading known bad blocks during resync
>> md: add 'write_error' flag to component devices.
>> md/multipath: discard ->working_disks in favour of ->degraded
>> md: make error_handler functions more uniform and correct.
>> md: make it easier to wait for bad blocks to be acknowledged.
>> md/raid1: avoid writing to known-bad blocks on known-bad drives.
>> md/raid1: clear bad-block record when write succeeds.
>> md/raid1: Handle write errors by updating badblock log.
>>
>>
>> drivers/md/dm-raid456.c | 6
>> drivers/md/md.c | 725
>> +++++++++++++++++++++++++++++++++++++++++++--
>> drivers/md/md.h | 76 ++++-
>> drivers/md/multipath.c | 60 ++--
>> drivers/md/multipath.h | 1
>> drivers/md/raid1.c | 714
>> +++++++++++++++++++++++++++++++++++---------
>> drivers/md/raid1.h | 14 +
>> drivers/md/raid10.c | 123 ++++----
>> drivers/md/raid5.c | 48 ++-
>> include/linux/raid/md_p.h | 13 +
>> 10 files changed, 1475 insertions(+), 305 deletions(-)
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [md PATCH 00/16] bad block list management for md and RAID1
2010-06-07 0:07 [md PATCH 00/16] bad block list management for md and RAID1 NeilBrown
` (16 preceding siblings ...)
2010-06-07 0:28 ` [md PATCH 00/16] bad block list management for md and RAID1 Berkey B Walker
@ 2010-06-17 12:48 ` Brett Russ
2010-06-17 15:53 ` Graham Mitchell
2010-06-18 3:23 ` Neil Brown
17 siblings, 2 replies; 26+ messages in thread
From: Brett Russ @ 2010-06-17 12:48 UTC (permalink / raw)
To: linux-raid
On 06/06/2010 08:07 PM, NeilBrown wrote:
> The goal of these patches is to add a 'bad block list' to each device
> and use it to allow us to fail single blocks rather than whole
> devices.
Hi Neil,
This is a worthwhile addition, I think. However, one concern we have is
there appears to be no distinction between media errors (i.e. bad
blocks) and other SCSI errors. One situation we commonly see in the
enterprise is non-media SCSI errors due to i.e. path failure. We've
tested dm multipath as a solution for that but it has its own problems,
primarily performance due to its apparent decomposition of large
contiguous I/Os into smaller I/Os and we're investigating that. Until
that is fixed, we have patched md to retry failed writes (md already has
a mechanism for failed reads). Commonly these retries will succeed as
many of the path failures we've seen have been transient (i.e. a SAS
expander undergoes a reset). Today in the vanilla md code that would
cause a drive failure. In this patch, it would identify a range of
blocks as bad. Presumably later they might be revalidated and removed
from the bad block list if the original error(s) were in fact transient,
but in the meantime we lose that member from any reads.
As an aside, it would be handy to have mechanisms exposed to userspace
(via mdadm) to display, test, and possibly override the memory of these
bad blocks such that in these instances where md has (possibly
incorrectly) forced a range of blocks unavailable on a member that we
can recover data if the automated recovery doesn't succeed.
Do you have thoughts or plans to behave differently based on the type of
error? I believe today the SCSI layer only provides pass/fail, is that
correct? If so, plumbing would need to be added to make the upper layer
aware of the nature of the failure. It seems that the bad block
management in md should only take effect for media errors and that there
should be more intelligent handling of other types of errors. We would
be happy to help in this area if it aligns with your/the community's
longer term view of things.
Thanks,
Brett
^ permalink raw reply [flat|nested] 26+ messages in thread* RE: [md PATCH 00/16] bad block list management for md and RAID1
2010-06-17 12:48 ` Brett Russ
@ 2010-06-17 15:53 ` Graham Mitchell
2010-06-18 3:58 ` Neil Brown
2010-06-18 3:23 ` Neil Brown
1 sibling, 1 reply; 26+ messages in thread
From: Graham Mitchell @ 2010-06-17 15:53 UTC (permalink / raw)
To: 'Brett Russ', linux-raid
> This is a worthwhile addition, I think. However, one concern we have is there
> appears to be no distinction between media errors (i.e. bad
> blocks) and other SCSI errors.
One thing I'd like to see would be being able to import a list of bad blocks from badblocks, and also have the ability for mdadm to be able to run a 'destructive' badblocks on the drives in the array, either at create/grow time, or on demand.
I say 'destructive' since it would be a bad thing (tm) if it truly were destructive on a live array, but it would be nice for mdadm to do the full destructive aa/55/ff/00 write/read/compare cycle on each disk, without actually being destructive to the data that's there. I am slightly paranoid (having been bitten in the bum in the past), so I do a full destructive badblocks on every disk BEFORE I add It to an array (and yes, it can take days when I have 3 or 4 1TB drives to add). It would be nice to be able to add the disks to the server untested, and let mdadm do the testing when it was doing the grow.
Graham
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [md PATCH 00/16] bad block list management for md and RAID1
2010-06-17 15:53 ` Graham Mitchell
@ 2010-06-18 3:58 ` Neil Brown
2010-06-18 4:30 ` Graham Mitchell
0 siblings, 1 reply; 26+ messages in thread
From: Neil Brown @ 2010-06-18 3:58 UTC (permalink / raw)
To: Graham Mitchell; +Cc: 'Brett Russ', linux-raid
On Thu, 17 Jun 2010 11:53:40 -0400
"Graham Mitchell" <gmitch@woodlea.com> wrote:
> > This is a worthwhile addition, I think. However, one concern we have is there
> > appears to be no distinction between media errors (i.e. bad
> > blocks) and other SCSI errors.
>
> One thing I'd like to see would be being able to import a list of bad blocks from badblocks, and also have the ability for mdadm to be able to run a 'destructive' badblocks on the drives in the array, either at create/grow time, or on demand.
Importing a list of bad blocks would be quite trivial - you could write a
perl script to do it, though it might be nice to include it in mdadm.
>
> I say 'destructive' since it would be a bad thing (tm) if it truly were destructive on a live array, but it would be nice for mdadm to do the full destructive aa/55/ff/00 write/read/compare cycle on each disk, without actually being destructive to the data that's there. I am slightly paranoid (having been bitten in the bum in the past), so I do a full destructive badblocks on every disk BEFORE I add It to an array (and yes, it can take days when I have 3 or 4 1TB drives to add). It would be nice to be able to add the disks to the server untested, and let mdadm do the testing when it was doing the grow.
I think it would be a mistake to incorporate bad-block detection
functionality into md or mdadm. We already have a program which does that
and probably does it better than I could code. Best to try to leverage what
already exists.
I'm not sure I see the logic though. Surely if a drive has any errors when
new, then you don't want to trust it at all and cascading failure is likely
and tomorrow there will be more errors. So t would be best to do the
badblock scan first and only add it to the array if it were completely
successful.
However if you really want to you could tell md that all blocks were bad,
then have the badblock scan run and after if finishes with some section, tell
md that section was OK and move on.
The current badblock list format allows ranges of blocks, but it is currently
limited to 512 ranges each of at most 512 blocks. I could probably relax
that without too much effort, so that a single range could cover the whole
device... if we really thought that was a good idea.
Not convinced....
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [md PATCH 00/16] bad block list management for md and RAID1
2010-06-18 3:58 ` Neil Brown
@ 2010-06-18 4:30 ` Graham Mitchell
0 siblings, 0 replies; 26+ messages in thread
From: Graham Mitchell @ 2010-06-18 4:30 UTC (permalink / raw)
To: 'Neil Brown'; +Cc: 'Brett Russ', linux-raid
> I think it would be a mistake to incorporate bad-block detection
functionality
> into md or mdadm. We already have a program which does that and
> probably does it better than I could code. Best to try to leverage what
> already exists.
I agree - I was thinking along the lines of maintenance type cases, where we
currently run an array check once a week - we could also schedule a full
'non-destructive badblocks -w' type test once a month (say), to catch disks
which are starting to go bad. Since mdadm understands the RAID layout, it
could migrate/redirect a stripe or block to another area, run badblocks on
each of the disks specifying the start and end sectors, and if the area on
one of the disks was bad, mark the area as bad - and since the data has been
redirected, we don't lose anything. If the area is good, then the data gets
moved back to its original location, and mdadm moves on to the next
stripe/block. I really think you'd need to do a fully destructive write test
on the drive though - I've actually just finished testing a Spinpoint F3
this evening, which has shown up 5 bad sectors, all on the 4th write pass
(0x00), so a quick read test probably wouldn't have shown them up.
> I'm not sure I see the logic though. Surely if a drive has any errors
when new,
> then you don't want to trust it at all and cascading failure is likely and
> tomorrow there will be more errors. So t would be best to do the badblock
> scan first and only add it to the array if it were completely successful.
Agreed Neil - I guess I am thinking more of the maintenance type cases, but
it would be nice to have mdadm check the drive when it's added to the array.
You could just blindly add the drive, and immediately schedule a full
badblocks test - but I guess I would still be paranoid, and still check the
disk before adding it.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [md PATCH 00/16] bad block list management for md and RAID1
2010-06-17 12:48 ` Brett Russ
2010-06-17 15:53 ` Graham Mitchell
@ 2010-06-18 3:23 ` Neil Brown
[not found] ` <4C1BABC4.3020008@tmr.com>
1 sibling, 1 reply; 26+ messages in thread
From: Neil Brown @ 2010-06-18 3:23 UTC (permalink / raw)
To: Brett Russ; +Cc: linux-raid
On Thu, 17 Jun 2010 08:48:07 -0400
Brett Russ <bruss@netezza.com> wrote:
> On 06/06/2010 08:07 PM, NeilBrown wrote:
> > The goal of these patches is to add a 'bad block list' to each device
> > and use it to allow us to fail single blocks rather than whole
> > devices.
>
> Hi Neil,
>
> This is a worthwhile addition, I think. However, one concern we have is
> there appears to be no distinction between media errors (i.e. bad
> blocks) and other SCSI errors. One situation we commonly see in the
> enterprise is non-media SCSI errors due to i.e. path failure. We've
> tested dm multipath as a solution for that but it has its own problems,
> primarily performance due to its apparent decomposition of large
> contiguous I/Os into smaller I/Os and we're investigating that. Until
> that is fixed, we have patched md to retry failed writes (md already has
> a mechanism for failed reads). Commonly these retries will succeed as
> many of the path failures we've seen have been transient (i.e. a SAS
> expander undergoes a reset). Today in the vanilla md code that would
> cause a drive failure. In this patch, it would identify a range of
> blocks as bad. Presumably later they might be revalidated and removed
> from the bad block list if the original error(s) were in fact transient,
> but in the meantime we lose that member from any reads.
Hi Brett,
thanks for your thoughts.
No, md doesn't differentiate between different types of errors. There are
two reasons for this.
1/ I don't think it gets told what sort of error there was. The bi_end_io
function is passed an error code, but I don't think that can be used to
differentiate between e.g. media and transport errors. Maybe that has
changed since I last looked....
2/ I don't think it would help.
md currently treats all errors as media errors. i.e. it assumes just that
block is bad. If it can deal with that (and bad-block-lists expand the
options of dealing with it) it does. If it cannot, it just rejects the
device.
If the error were actually a transport error, it would be very likely to
quickly lead to an error that it could not deal with (e.g. updating
metadata) and would have to reject the whole device. And that action is
the only thing that it makes sense for md to do in the face of a transport
error.
Such an error says that we cannot reliable talk to the device, so md should
stop trying.
It is simply not appropriate for md to re-try on failure just as it is not
appropriate for md to implement any timeouts. Both these actions imply some
knowledge of the characteristics of the underlying device, and md simply
does not have that knowledge.
If you have a device where temporary path failures are possible, then it is
up to the driver to deal with that possibility.
For example, I believe the 'dasd' driver (which is for some sort of fibre
connected drives on an IBM mainframe) normally treats cable problems as a
transient error and retries indefinitely until they are repaired, or until
the sysadmin says otherwise. This seems a reasonable approach.
The only situation where it might make sense for md to retry is if it could
retry in a 'different' way (trying the same thing again and expecting a
different result is not entirely rational after all...).
e.g. md/raid1 could issue reads with the FASTFAIL flag which - for dasd at
least - says to not retry transport errors indefinitely. After an error
from that read it would be sensible not to reject the device but just direct
the read to a different device. If all devices failed with FASTFAIL, then
try again without FASTFAIL - then treat such a failure as hard.
That might be nice, but the last time I tried it different drivers treated
FASTFAIL quite differently. e.g. my SATA devices would fairly often fail
FASTFAIL requests even when they were otherwise working perfectly.
I don't think that FASTFAIL is/was very well specified: 'fast' is a
relative term after all.
I note that there are now 3 different FAILFAST flags (DEV, TRANSPORT, and
DRIVER). Maybe they have more useful implementations so maybe it is time to
revisit this issue again.
However it remains that if no FAILFAST flags are present, then it is up to
the driver to do any retries that might be appropriate - md cannot be
involved in retries at that level.
>
> As an aside, it would be handy to have mechanisms exposed to userspace
> (via mdadm) to display, test, and possibly override the memory of these
> bad blocks such that in these instances where md has (possibly
> incorrectly) forced a range of blocks unavailable on a member that we
> can recover data if the automated recovery doesn't succeed.
Yes, the bad block list is entirely exposed to user-space via sysfs.
Removing entries from the list directly is not currently supported (except
for debugging purposes). To remove a bad block you just need to arrange a
successful write to the device which can be done with the 'check' feature.
Adding and examining bad blocks is easy.
>
> Do you have thoughts or plans to behave differently based on the type of
> error? I believe today the SCSI layer only provides pass/fail, is that
> correct? If so, plumbing would need to be added to make the upper layer
> aware of the nature of the failure. It seems that the bad block
> management in md should only take effect for media errors and that there
> should be more intelligent handling of other types of errors. We would
> be happy to help in this area if it aligns with your/the community's
> longer term view of things.
I've probably answered this question above, but to summarise:
I think there could be some place for responding differently to different
types of errors, but it would only be to respond more harshly than we
currently do.
I think that any differentiation should come by md making different sorts of
requests (e.g. with or without FAILFAST), and possibly retrying such
requests in a more forceful way, or after other successes have shown that it
might be appropriate.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 26+ messages in thread