Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: recovering failed raid5
From: Roman Mamedov @ 2016-10-29 10:29 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161028133304.GA11564@metamorpher.de>

On Fri, 28 Oct 2016 15:33:04 +0200
Andreas Klauer <Andreas.Klauer@metamorpher.de> wrote:

> On Fri, Oct 28, 2016 at 01:22:31PM +0100, Alexander Shenkin wrote:
> > One remaining question: is sdc definitely toast?
> 
> In my opinion a drive is toast starting from the very first reallocated/ 
> pending/uncorrectable sector, your drive has several of those and that's 
> only the ones the drive already knows about - there may be more.

I'd say you are overly cautious on this. Yes there are drives for which one
reallocated sector is a sign of the coming avalanche of them, but then there
are also ones (e.g. my Hitachi 2TB) which work for years, over than period
develop 3-5-7 reallocated sectors, and THAT'S IT, they just continue to work.
And if there's an unreadable sector on rebuild as a drive found its 8th bad
sector after 3 more years of perfect operation, that's not a problem either,
because the setup they run in, is RAID6. (Not to compensate for this, but I
wouldn't be running a 8-10 drive RAID5 in any case).

-- 
With respect,
Roman

^ permalink raw reply

* Re: recovering failed raid5
From: Mikael Abrahamsson @ 2016-10-29  8:46 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161028234529.GA3909@metamorpher.de>

On Sat, 29 Oct 2016, Andreas Klauer wrote:

> You'd think timeouts would solve all problems. They probably don't. In 
> some exceedingly rare cases, they might not even matter at all.

My reasoning regarding timeouts, especially for home arrays is the 
following:

Turning up the timeouts to 180 means your worst case scenario is that your 
array will have a 180 second long "hiccup" in delivering data.

This can be really bad in an enterprise environment, but in a home 
environment it's merely in an inconvenience. It happens at very few times, 
and it stops your drive from being spuriously kicked out when there is a 
read error, where it being kicked out can lead to lots worse things 
happening.

So for regular use there is very little downside to set the timeouts to 
180 seconds, there are substantial upsides, and I recommend everybody with 
non-enterprise drives to do that.

I wish the kernel defaults would be changed to 180, because I see these 
default timeout settings to cause people more problems than they help.

This is of course just one piece of a larger puzzle, but it's one that 
it's important to get right.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply

* [PATCH 58/60] dm-crypt: convert to bio_for_each_segment_all_rd()
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Alasdair Kergon, Mike Snitzer,
	maintainer:DEVICE-MAPPER LVM, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-crypt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 4999c7497f95..ed0f54e51638 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1034,8 +1034,9 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 {
 	unsigned int i;
 	struct bio_vec *bv;
+	struct bvec_iter_all bia;
 
-	bio_for_each_segment_all(bv, clone, i) {
+	bio_for_each_segment_all_rd(bv, clone, i, bia) {
 		BUG_ON(!bv->bv_page);
 		mempool_free(bv->bv_page, cc->page_pool);
 		bv->bv_page = NULL;
-- 
2.7.4


^ permalink raw reply related

* [PATCH 57/60] bcache: convert to bio_for_each_segment_all_rd()
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Kent Overstreet, Shaohua Li,
	Hannes Reinecke, Jiri Kosina, Mike Christie, Guoqing Jiang,
	Zheng Liu, open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/btree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index b419bc91ba32..89abada6a091 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -419,8 +419,9 @@ static void do_btree_node_write(struct btree *b)
 		int j;
 		struct bio_vec *bv;
 		void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
+		struct bvec_iter_all bia;
 
-		bio_for_each_segment_all(bv, b->bio, j)
+		bio_for_each_segment_all_rd(bv, b->bio, j, bia)
 			memcpy(page_address(bv->bv_page),
 			       base + j * PAGE_SIZE, PAGE_SIZE);
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 39/60] bcache: debug: switch to bio_clone_sp()
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Kent Overstreet, Shaohua Li,
	Mike Christie, Hannes Reinecke, Guoqing Jiang,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

The cloned bio has to be singlepage bvec based, so
use bio_clone_sp(), and the allocated bvec table
is enough for hold the bvecs because QUEUE_FLAG_SPLIT_MP
is set for bcache.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/debug.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 71a9f05918eb..0735015b0842 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -111,12 +111,10 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	struct bvec_iter iter, citer = { 0 };
 
 	/*
-	 * Once multipage bvec is supported, the bio_clone()
-	 * has to make sure page count in this bio can be held
-	 * in the new cloned bio because each single page need
-	 * to assign to each bvec of the new bio.
+	 * QUEUE_FLAG_SPLIT_MP can make the cloned singlepage
+	 * bvecs to be held in the allocated bvec table.
 	 */
-	check = bio_clone(bio, GFP_NOIO);
+	check = bio_clone_sp(bio, GFP_NOIO);
 	if (!check)
 		return;
 	bio_set_op_attrs(check, REQ_OP_READ, READ_SYNC);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 30/60] bcache: set flag of QUEUE_FLAG_SPLIT_MP
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Kent Overstreet, Shaohua Li,
	Eric Wheeler, Coly Li, Yijing Wang, Zheng Liu, Mike Christie,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

It isn't safe(such as bch_data_verify()) to let bcache deal with
more than 1M bio from multipage bvec, so set this flag and size of
incoming bio won't be bigger than BIO_SP_MAX_SECTORS.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/super.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 52876fcf2b36..fca023a1a026 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -821,6 +821,12 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 
 	blk_queue_write_cache(q, true, true);
 
+	/*
+	 * Once bcache is audited that it is ready to deal with big
+	 * incoming bio with multipage bvecs, we can remove the flag.
+	 */
+	set_bit(QUEUE_FLAG_SPLIT_MP,	&d->disk->queue->queue_flags);
+
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH 29/60] dm: limit the max bio size as BIO_SP_MAX_SECTORS << SECTOR_SHIFT
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Alasdair Kergon, Mike Snitzer,
	maintainer:DEVICE-MAPPER LVM, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

For BIO based DM, some targets aren't ready for dealing with
bigger incoming bio than 1Mbyte, such as crypt and log write
targets.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ef7bf1dd6900..ce454c6c1a4e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -899,7 +899,16 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len)
 		return -EINVAL;
 	}
 
-	ti->max_io_len = (uint32_t) len;
+	/*
+	 * BIO based queue uses its own splitting. When multipage bvecs
+	 * is switched on, size of the incoming bio may be too big to
+	 * be handled in some targets, such as crypt and log write.
+	 *
+	 * When these targets are ready for the big bio, we can remove
+	 * the limit.
+	 */
+	ti->max_io_len = min_t(uint32_t, len,
+			       BIO_SP_MAX_SECTORS << SECTOR_SHIFT);
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH 24/60] md: set NO_MP for request queue of md
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

MD isn't ready for multipage bvecs, so mark it as
NO_MP.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/md.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index eac84d8ff724..f8d98098dff8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5128,6 +5128,16 @@ static void md_safemode_timeout(unsigned long data)
 
 static int start_dirty_degraded;
 
+/*
+ * MD isn't ready for multipage bvecs yet, and set the flag
+ * so that MD still can see singlepage bvecs bio
+ */
+static inline void md_set_no_mp(struct mddev *mddev)
+{
+	if (mddev->queue)
+		set_bit(QUEUE_FLAG_NO_MP, &mddev->queue->queue_flags);
+}
+
 int md_run(struct mddev *mddev)
 {
 	int err;
@@ -5353,6 +5363,8 @@ int md_run(struct mddev *mddev)
 	if (mddev->flags & MD_UPDATE_SB_FLAGS)
 		md_update_sb(mddev, 0);
 
+	md_set_no_mp(mddev);
+
 	md_new_event(mddev);
 	sysfs_notify_dirent_safe(mddev->sysfs_state);
 	sysfs_notify_dirent_safe(mddev->sysfs_action);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 22/60] block: comment on bio_alloc_pages()
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Jens Axboe, Kent Overstreet,
	Shaohua Li, Mike Christie, Guoqing Jiang, Hannes Reinecke,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

This patch adds comment on usage of bio_alloc_pages(),
also comments on one special case of bch_data_verify().

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 block/bio.c               | 4 +++-
 drivers/md/bcache/debug.c | 6 ++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index db85c5753a76..a49d1d89a85c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -907,7 +907,9 @@ EXPORT_SYMBOL(bio_advance);
  * @bio: bio to allocate pages for
  * @gfp_mask: flags for allocation
  *
- * Allocates pages up to @bio->bi_vcnt.
+ * Allocates pages up to @bio->bi_vcnt, and this function should only
+ * be called on a new initialized bio, which means no page isn't added
+ * to the bio via bio_add_page() yet.
  *
  * Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages are
  * freed.
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 430f3050663c..71a9f05918eb 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -110,6 +110,12 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	struct bio_vec bv, cbv;
 	struct bvec_iter iter, citer = { 0 };
 
+	/*
+	 * Once multipage bvec is supported, the bio_clone()
+	 * has to make sure page count in this bio can be held
+	 * in the new cloned bio because each single page need
+	 * to assign to each bvec of the new bio.
+	 */
 	check = bio_clone(bio, GFP_NOIO);
 	if (!check)
 		return;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 21/60] bcache: comment on direct access to bvec table
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Kent Overstreet, Shaohua Li,
	Mike Christie, Hannes Reinecke, Guoqing Jiang, Jiri Kosina,
	Zheng Liu, Eric Wheeler, Yijing Wang, Coly Li, Al Viro,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

Looks all are safe after multipage bvec is supported.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/btree.c | 1 +
 drivers/md/bcache/super.c | 6 ++++++
 drivers/md/bcache/util.c  | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81d3db40cd7b..b419bc91ba32 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -428,6 +428,7 @@ static void do_btree_node_write(struct btree *b)
 
 		continue_at(cl, btree_node_write_done, NULL);
 	} else {
+		/* No harm for multipage bvec since the new is just allocated */
 		b->bio->bi_vcnt = 0;
 		bch_bio_map(b->bio, i);
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index d8a6d807b498..52876fcf2b36 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -207,6 +207,7 @@ static void write_bdev_super_endio(struct bio *bio)
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
+	/* single page bio, safe for multipage bvec */
 	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
 	unsigned i;
 
@@ -1153,6 +1154,8 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	dc->bdev->bd_holder = dc;
 
 	bio_init_with_vec_table(&dc->sb_bio, dc->sb_bio.bi_inline_vecs, 1);
+
+	/* single page bio, safe for multipage bvec */
 	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
@@ -1794,6 +1797,7 @@ void bch_cache_release(struct kobject *kobj)
 	for (i = 0; i < RESERVE_NR; i++)
 		free_fifo(&ca->free[i]);
 
+	/* single page bio, safe for multipage bvec */
 	if (ca->sb_bio.bi_inline_vecs[0].bv_page)
 		put_page(ca->sb_bio.bi_io_vec[0].bv_page);
 
@@ -1850,6 +1854,8 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 	ca->bdev->bd_holder = ca;
 
 	bio_init_with_vec_table(&ca->sb_bio, ca->sb_bio.bi_inline_vecs, 1);
+
+	/* single page bio, safe for multipage bvec */
 	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index dde6172f3f10..5cc0b49a65fb 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -222,6 +222,13 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
 		: 0;
 }
 
+/*
+ * Generally it isn't good to access .bi_io_vec and .bi_vcnt
+ * directly, the preferred way is bio_add_page, but in
+ * this case, bch_bio_map() supposes that the bvec table
+ * is empty, so it is safe to access .bi_vcnt & .bi_io_vec
+ * in this way even after multipage bvec is supported.
+ */
 void bch_bio_map(struct bio *bio, void *base)
 {
 	size_t size = bio->bi_iter.bi_size;
-- 
2.7.4


^ permalink raw reply related

* [PATCH 09/60] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Alasdair Kergon, Mike Snitzer,
	maintainer:DEVICE-MAPPER LVM, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

Avoid to access .bi_vcnt directly, because it may be not what
the driver expected any more after supporting multipage bvec.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-rq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1d0d2adc050a..8534cbf8ce35 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -819,7 +819,8 @@ static void dm_old_request_fn(struct request_queue *q)
 			pos = blk_rq_pos(rq);
 
 		if ((dm_old_request_peeked_before_merge_deadline(md) &&
-		     md_in_flight(md) && rq->bio && rq->bio->bi_vcnt == 1 &&
+		     md_in_flight(md) && rq->bio &&
+		     !bio_multiple_segments(rq->bio) &&
 		     md->last_rq_pos == pos && md->last_rq_rw == rq_data_dir(rq)) ||
 		    (ti->type->busy && ti->type->busy(ti))) {
 			blk_delay_queue(q, 10);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 08/60] dm: use bvec iterator helpers to implement .get_page and .next_page
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Alasdair Kergon, Mike Snitzer,
	maintainer:DEVICE-MAPPER LVM, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

Firstly we have mature bvec/bio iterator helper for iterate each
page in one bio, not necessary to reinvent a wheel to do that.

Secondly the coming multipage bvecs requires this patch.

Also add comments about the direct access to bvec table.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-io.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 0bf1a12e35fe..2ef573c220fc 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -162,7 +162,10 @@ struct dpages {
 			 struct page **p, unsigned long *len, unsigned *offset);
 	void (*next_page)(struct dpages *dp);
 
-	unsigned context_u;
+	union {
+		unsigned context_u;
+		struct bvec_iter context_bi;
+	};
 	void *context_ptr;
 
 	void *vma_invalidate_address;
@@ -204,25 +207,36 @@ static void list_dp_init(struct dpages *dp, struct page_list *pl, unsigned offse
 static void bio_get_page(struct dpages *dp, struct page **p,
 			 unsigned long *len, unsigned *offset)
 {
-	struct bio_vec *bvec = dp->context_ptr;
-	*p = bvec->bv_page;
-	*len = bvec->bv_len - dp->context_u;
-	*offset = bvec->bv_offset + dp->context_u;
+	struct bio_vec bv = bvec_iter_bvec((struct bio_vec *)dp->context_ptr,
+			dp->context_bi);
+
+	*p = bv.bv_page;
+	*len = bv.bv_len;
+	*offset = bv.bv_offset;
+
+	/* avoid to figure out it in bio_next_page() again */
+	dp->context_bi.bi_sector = (sector_t)bv.bv_len;
 }
 
 static void bio_next_page(struct dpages *dp)
 {
-	struct bio_vec *bvec = dp->context_ptr;
-	dp->context_ptr = bvec + 1;
-	dp->context_u = 0;
+	unsigned int len = (unsigned int)dp->context_bi.bi_sector;
+
+	bvec_iter_advance((struct bio_vec *)dp->context_ptr,
+			&dp->context_bi, len);
 }
 
 static void bio_dp_init(struct dpages *dp, struct bio *bio)
 {
 	dp->get_page = bio_get_page;
 	dp->next_page = bio_next_page;
-	dp->context_ptr = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-	dp->context_u = bio->bi_iter.bi_bvec_done;
+
+	/*
+	 * We just use bvec iterator to retrieve pages, so it is ok to
+	 * access the bvec table directly here
+	 */
+	dp->context_ptr = bio->bi_io_vec;
+	dp->context_bi = bio->bi_iter;
 }
 
 /*
-- 
2.7.4


^ permalink raw reply related

* [PATCH 07/60] dm: crypt: use bio_add_page()
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Alasdair Kergon, Mike Snitzer,
	maintainer:DEVICE-MAPPER LVM, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

We have the standard interface to add page to bio, so don't
do that in hacking way.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-crypt.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index a2768835d394..4999c7497f95 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -994,7 +994,6 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 	gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
 	unsigned i, len, remaining_size;
 	struct page *page;
-	struct bio_vec *bvec;
 
 retry:
 	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -1019,12 +1018,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 
 		len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
 
-		bvec = &clone->bi_io_vec[clone->bi_vcnt++];
-		bvec->bv_page = page;
-		bvec->bv_len = len;
-		bvec->bv_offset = 0;
-
-		clone->bi_iter.bi_size += len;
+		bio_add_page(clone, page, len, 0);
 
 		remaining_size -= len;
 	}
-- 
2.7.4

^ permalink raw reply related

* [PATCH 06/60] bcache: debug: avoid to access .bi_io_vec directly
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Kent Overstreet, Shaohua Li,
	Mike Christie, Hannes Reinecke, Guoqing Jiang,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

Instead we use standard iterator way to do that.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/debug.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 333a1e5f6ae6..430f3050663c 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -107,8 +107,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 {
 	char name[BDEVNAME_SIZE];
 	struct bio *check;
-	struct bio_vec bv;
-	struct bvec_iter iter;
+	struct bio_vec bv, cbv;
+	struct bvec_iter iter, citer = { 0 };
 
 	check = bio_clone(bio, GFP_NOIO);
 	if (!check)
@@ -120,9 +120,13 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 
 	submit_bio_wait(check);
 
+	citer.bi_size = UINT_MAX;
 	bio_for_each_segment(bv, bio, iter) {
 		void *p1 = kmap_atomic(bv.bv_page);
-		void *p2 = page_address(check->bi_io_vec[iter.bi_idx].bv_page);
+		void *p2;
+
+		cbv = bio_iter_iovec(check, citer);
+		p2 = page_address(cbv.bv_page);
 
 		cache_set_err_on(memcmp(p1 + bv.bv_offset,
 					p2 + bv.bv_offset,
@@ -133,6 +137,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 				 (uint64_t) bio->bi_iter.bi_sector);
 
 		kunmap_atomic(p1);
+		bio_advance_iter(check, &citer, bv.bv_len);
 	}
 
 	bio_free_pages(check);
-- 
2.7.4


^ permalink raw reply related

* [PATCH 02/60] block drivers: convert to bio_init_with_vec_table()
From: Ming Lei @ 2016-10-29  8:08 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Jiri Kosina, Kent Overstreet,
	Shaohua Li, Alasdair Kergon, Mike Snitzer,
	maintainer:DEVICE-MAPPER LVM, Christoph Hellwig, Sagi Grimberg,
	Joern Engel, Prasad Joshi, Mike Christie, Hannes Reinecke,
	Rasmus Villemoes, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <1477728600-12938-1-git-send-email-tom.leiming@gmail.com>

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/block/floppy.c        |  3 +--
 drivers/md/bcache/io.c        |  4 +---
 drivers/md/bcache/journal.c   |  4 +---
 drivers/md/bcache/movinggc.c  |  7 +++----
 drivers/md/bcache/super.c     | 13 ++++---------
 drivers/md/bcache/writeback.c |  6 +++---
 drivers/md/dm-bufio.c         |  4 +---
 drivers/md/raid5.c            |  9 ++-------
 drivers/nvme/target/io-cmd.c  |  4 +---
 fs/logfs/dev_bdev.c           |  4 +---
 10 files changed, 18 insertions(+), 40 deletions(-)

diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index e3d8e4ced4a2..cdc916a95137 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3806,8 +3806,7 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
 
 	cbdata.drive = drive;
 
-	bio_init(&bio);
-	bio.bi_io_vec = &bio_vec;
+	bio_init_with_vec_table(&bio, &bio_vec, 1);
 	bio_vec.bv_page = page;
 	bio_vec.bv_len = size;
 	bio_vec.bv_offset = 0;
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index e97b0acf7b8d..af9489087cd3 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -24,9 +24,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
 	struct bbio *b = mempool_alloc(c->bio_meta, GFP_NOIO);
 	struct bio *bio = &b->bio;
 
-	bio_init(bio);
-	bio->bi_max_vecs	 = bucket_pages(c);
-	bio->bi_io_vec		 = bio->bi_inline_vecs;
+	bio_init_with_vec_table(bio, bio->bi_inline_vecs, bucket_pages(c));
 
 	return bio;
 }
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 6925023e12d4..b966f28d1b98 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -448,13 +448,11 @@ static void do_journal_discard(struct cache *ca)
 
 		atomic_set(&ja->discard_in_flight, DISCARD_IN_FLIGHT);
 
-		bio_init(bio);
+		bio_init_with_vec_table(bio, bio->bi_inline_vecs, 1);
 		bio_set_op_attrs(bio, REQ_OP_DISCARD, 0);
 		bio->bi_iter.bi_sector	= bucket_to_sector(ca->set,
 						ca->sb.d[ja->discard_idx]);
 		bio->bi_bdev		= ca->bdev;
-		bio->bi_max_vecs	= 1;
-		bio->bi_io_vec		= bio->bi_inline_vecs;
 		bio->bi_iter.bi_size	= bucket_bytes(ca);
 		bio->bi_end_io		= journal_discard_endio;
 
diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index 5c4bddecfaf0..9d7991f69030 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -77,15 +77,14 @@ static void moving_init(struct moving_io *io)
 {
 	struct bio *bio = &io->bio.bio;
 
-	bio_init(bio);
+	bio_init_with_vec_table(bio, bio->bi_inline_vecs,
+				DIV_ROUND_UP(KEY_SIZE(&io->w->key),
+					     PAGE_SECTORS));
 	bio_get(bio);
 	bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 
 	bio->bi_iter.bi_size	= KEY_SIZE(&io->w->key) << 9;
-	bio->bi_max_vecs	= DIV_ROUND_UP(KEY_SIZE(&io->w->key),
-					       PAGE_SECTORS);
 	bio->bi_private		= &io->cl;
-	bio->bi_io_vec		= bio->bi_inline_vecs;
 	bch_bio_map(bio, NULL);
 }
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 849ad441cd76..d8a6d807b498 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1152,9 +1152,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	dc->bdev = bdev;
 	dc->bdev->bd_holder = dc;
 
-	bio_init(&dc->sb_bio);
-	dc->sb_bio.bi_max_vecs	= 1;
-	dc->sb_bio.bi_io_vec	= dc->sb_bio.bi_inline_vecs;
+	bio_init_with_vec_table(&dc->sb_bio, dc->sb_bio.bi_inline_vecs, 1);
 	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
@@ -1814,9 +1812,8 @@ static int cache_alloc(struct cache *ca)
 	__module_get(THIS_MODULE);
 	kobject_init(&ca->kobj, &bch_cache_ktype);
 
-	bio_init(&ca->journal.bio);
-	ca->journal.bio.bi_max_vecs = 8;
-	ca->journal.bio.bi_io_vec = ca->journal.bio.bi_inline_vecs;
+	bio_init_with_vec_table(&ca->journal.bio,
+				ca->journal.bio.bi_inline_vecs, 8);
 
 	free = roundup_pow_of_two(ca->sb.nbuckets) >> 10;
 
@@ -1852,9 +1849,7 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 	ca->bdev = bdev;
 	ca->bdev->bd_holder = ca;
 
-	bio_init(&ca->sb_bio);
-	ca->sb_bio.bi_max_vecs	= 1;
-	ca->sb_bio.bi_io_vec	= ca->sb_bio.bi_inline_vecs;
+	bio_init_with_vec_table(&ca->sb_bio, ca->sb_bio.bi_inline_vecs, 1);
 	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index e51644e503a5..b2568cef8c86 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -106,14 +106,14 @@ static void dirty_init(struct keybuf_key *w)
 	struct dirty_io *io = w->private;
 	struct bio *bio = &io->bio;
 
-	bio_init(bio);
+	bio_init_with_vec_table(bio, bio->bi_inline_vecs,
+				DIV_ROUND_UP(KEY_SIZE(&w->key),
+					     PAGE_SECTORS));
 	if (!io->dc->writeback_percent)
 		bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 
 	bio->bi_iter.bi_size	= KEY_SIZE(&w->key) << 9;
-	bio->bi_max_vecs	= DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS);
 	bio->bi_private		= w;
-	bio->bi_io_vec		= bio->bi_inline_vecs;
 	bch_bio_map(bio, NULL);
 }
 
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 125aedc3875f..5b13e7e7c8aa 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -611,9 +611,7 @@ static void use_inline_bio(struct dm_buffer *b, int rw, sector_t block,
 	char *ptr;
 	int len;
 
-	bio_init(&b->bio);
-	b->bio.bi_io_vec = b->bio_vec;
-	b->bio.bi_max_vecs = DM_BUFIO_INLINE_VECS;
+	bio_init_with_vec_table(&b->bio, b->bio_vec, DM_BUFIO_INLINE_VECS);
 	b->bio.bi_iter.bi_sector = block << b->c->sectors_per_block_bits;
 	b->bio.bi_bdev = b->c->bdev;
 	b->bio.bi_end_io = inline_endio;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 92ac251e91e6..eae7b4cf34d4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2004,13 +2004,8 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		for (i = 0; i < disks; i++) {
 			struct r5dev *dev = &sh->dev[i];
 
-			bio_init(&dev->req);
-			dev->req.bi_io_vec = &dev->vec;
-			dev->req.bi_max_vecs = 1;
-
-			bio_init(&dev->rreq);
-			dev->rreq.bi_io_vec = &dev->rvec;
-			dev->rreq.bi_max_vecs = 1;
+			bio_init_with_vec_table(&dev->req, &dev->vec, 1);
+			bio_init_with_vec_table(&dev->rreq, &dev->rvec, 1);
 		}
 	}
 	return sh;
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 4a96c2049b7b..6a32b0b68b1e 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -37,9 +37,7 @@ static void nvmet_inline_bio_init(struct nvmet_req *req)
 {
 	struct bio *bio = &req->inline_bio;
 
-	bio_init(bio);
-	bio->bi_max_vecs = NVMET_MAX_INLINE_BIOVEC;
-	bio->bi_io_vec = req->inline_bvec;
+	bio_init_with_vec_table(bio, req->inline_bvec, NVMET_MAX_INLINE_BIOVEC);
 }
 
 static void nvmet_execute_rw(struct nvmet_req *req)
diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index a8329cc47dec..2bf53b0ffe83 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -19,9 +19,7 @@ static int sync_request(struct page *page, struct block_device *bdev, int op)
 	struct bio bio;
 	struct bio_vec bio_vec;
 
-	bio_init(&bio);
-	bio.bi_max_vecs = 1;
-	bio.bi_io_vec = &bio_vec;
+	bio_init_with_vec_table(&bio, &bio_vec, 1);
 	bio_vec.bv_page = page;
 	bio_vec.bv_len = PAGE_SIZE;
 	bio_vec.bv_offset = 0;
-- 
2.7.4


^ permalink raw reply related

* [PATCH 00/60] block: support multipage bvec
From: Ming Lei @ 2016-10-29  8:07 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, linux-fsdevel, Christoph Hellwig,
	Kirill A . Shutemov, Ming Lei, Al Viro, Andrew Morton,
	Bart Van Assche, open list:GFS2 FILE SYSTEM, Coly Li,
	Dan Williams, open list:DEVICE-MAPPER  LVM, open list:DRBD DRIVER,
	Eric Wheeler, Guoqing Jiang, Hannes Reinecke, Hannes Reinecke,
	Jiri Kosina, Joe Perches, Johannes Berg, Johannes Thumshirn,
	Keith Busch, Kent

Hi,

This patchset brings multipage bvec into block layer. Basic
xfstests(-a auto) over virtio-blk/virtio-scsi have been run
and no regression is found, so it should be good enough
to show the approach now, and any comments are welcome!

1) what is multipage bvec?

Multipage bvecs means that one 'struct bio_bvec' can hold
multiple pages which are physically contiguous instead
of one single page used in linux kernel for long time.

2) why is multipage bvec introduced?

Kent proposed the idea[1] first. 

As system's RAM becomes much bigger than before, and 
at the same time huge page, transparent huge page and
memory compaction are widely used, it is a bit easy now
to see physically contiguous pages inside fs/block stack.
On the other hand, from block layer's view, it isn't
necessary to store intermediate pages into bvec, and
it is enough to just store the physicallly contiguous
'segment'.

Also huge pages are being brought to filesystem[2], we
can do IO a hugepage a time[3], requires that one bio can
transfer at least one huge page one time. Turns out it isn't
flexiable to change BIO_MAX_PAGES simply[3]. Multipage bvec
can fit in this case very well.

With multipage bvec:

- bio size can be increased and it should improve some
high-bandwidth IO case in theory[4].

- Inside block layer, both bio splitting and sg map can
become more efficient than before by just traversing the
physically contiguous 'segment' instead of each page.

- there is possibility in future to improve memory footprint
of bvecs usage. 

3) how is multipage bvec implemented in this patchset?

The 1st 22 patches cleanup on direct access to bvec table,
and comments on some special cases. With this approach,
most of cases are found as safe for multipage bvec,
only fs/buffer, pktcdvd, dm-io, MD and btrfs need to deal
with.

Given a little more work is involved to cleanup pktcdvd,
MD and btrfs, this patchset introduces QUEUE_FLAG_NO_MP for
them, and these components can still see/use singlepage bvec.
In the future, once the cleanup is done, the flag can be killed.

The 2nd part(23 ~ 60) implements multipage bvec in block:

- put all tricks into bvec/bio/rq iterators, and as far as
drivers and fs use these standard iterators, they are happy
with multipage bvec

- bio_for_each_segment_all() changes
this helper pass pointer of each bvec directly to user, and
it has to be changed. Two new helpers(bio_for_each_segment_all_rd()
and bio_for_each_segment_all_wt()) are introduced. 

- bio_clone() changes
At default bio_clone still clones one new bio in multipage bvec
way. Also single page version of bio_clone() is introduced
for some special cases, such as only single page bvec is used
for the new cloned bio(bio bounce, ...)

These patches can be found in the following git tree:

	https://github.com/ming1/linux/tree/mp-bvec-0.3-v4.9

Thanks Christoph for looking at the early version and providing
very good suggestions, such as: introduce bio_init_with_vec_table(),
remove another unnecessary helpers for cleanup and so on.

TODO:
	- cleanup direct access to bvec table for MD & btrfs


[1], http://marc.info/?l=linux-kernel&m=141680246629547&w=2
[2], http://lwn.net/Articles/700781/
[3], http://marc.info/?t=147735447100001&r=1&w=2
[4], http://marc.info/?l=linux-mm&m=147745525801433&w=2


Ming Lei (60):
  block: bio: introduce bio_init_with_vec_table()
  block drivers: convert to bio_init_with_vec_table()
  block: drbd: remove impossible failure handling
  block: floppy: use bio_add_page()
  target: avoid to access .bi_vcnt directly
  bcache: debug: avoid to access .bi_io_vec directly
  dm: crypt: use bio_add_page()
  dm: use bvec iterator helpers to implement .get_page and .next_page
  dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
  fs: logfs: convert to bio_add_page() in sync_request()
  fs: logfs: use bio_add_page() in __bdev_writeseg()
  fs: logfs: use bio_add_page() in do_erase()
  fs: logfs: remove unnecesary check
  block: drbd: comment on direct access bvec table
  block: loop: comment on direct access to bvec table
  block: pktcdvd: comment on direct access to bvec table
  kernel/power/swap.c: comment on direct access to bvec table
  mm: page_io.c: comment on direct access to bvec table
  fs/buffer: comment on direct access to bvec table
  f2fs: f2fs_read_end_io: comment on direct access to bvec table
  bcache: comment on direct access to bvec table
  block: comment on bio_alloc_pages()
  block: introduce flag QUEUE_FLAG_NO_MP
  md: set NO_MP for request queue of md
  block: pktcdvd: set NO_MP for pktcdvd request queue
  btrfs: set NO_MP for request queues behind BTRFS
  block: introduce BIO_SP_MAX_SECTORS
  block: introduce QUEUE_FLAG_SPLIT_MP
  dm: limit the max bio size as BIO_SP_MAX_SECTORS << SECTOR_SHIFT
  bcache: set flag of QUEUE_FLAG_SPLIT_MP
  block: introduce multipage/single page bvec helpers
  block: implement sp version of bvec iterator helpers
  block: introduce bio_for_each_segment_mp()
  block: introduce bio_clone_sp()
  bvec_iter: introduce BVEC_ITER_ALL_INIT
  block: bounce: avoid direct access to bvec from bio->bi_io_vec
  block: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq
  block: bounce: convert multipage bvecs into singlepage
  bcache: debug: switch to bio_clone_sp()
  blk-merge: compute bio->bi_seg_front_size efficiently
  block: blk-merge: try to make front segments in full size
  block: use bio_for_each_segment_mp() to compute segments count
  block: use bio_for_each_segment_mp() to map sg
  block: introduce bvec_for_each_sp_bvec()
  block: bio: introduce bio_for_each_segment_all_rd() and its write pair
  block: deal with dirtying pages for multipage bvec
  block: convert to bio_for_each_segment_all_rd()
  fs/mpage: convert to bio_for_each_segment_all_rd()
  fs/direct-io: convert to bio_for_each_segment_all_rd()
  ext4: convert to bio_for_each_segment_all_rd()
  xfs: convert to bio_for_each_segment_all_rd()
  logfs: convert to bio_for_each_segment_all_rd()
  gfs2: convert to bio_for_each_segment_all_rd()
  f2fs: convert to bio_for_each_segment_all_rd()
  exofs: convert to bio_for_each_segment_all_rd()
  fs: crypto: convert to bio_for_each_segment_all_rd()
  bcache: convert to bio_for_each_segment_all_rd()
  dm-crypt: convert to bio_for_each_segment_all_rd()
  fs/buffer.c: use bvec iterator to truncate the bio
  block: enable multipage bvecs

 block/bio.c                        | 104 ++++++++++++++----
 block/blk-merge.c                  | 216 +++++++++++++++++++++++++++++--------
 block/bounce.c                     |  80 ++++++++++----
 drivers/block/drbd/drbd_bitmap.c   |   1 +
 drivers/block/drbd/drbd_receiver.c |  14 +--
 drivers/block/floppy.c             |  10 +-
 drivers/block/loop.c               |   5 +
 drivers/block/pktcdvd.c            |   8 ++
 drivers/md/bcache/btree.c          |   4 +-
 drivers/md/bcache/debug.c          |  19 +++-
 drivers/md/bcache/io.c             |   4 +-
 drivers/md/bcache/journal.c        |   4 +-
 drivers/md/bcache/movinggc.c       |   7 +-
 drivers/md/bcache/super.c          |  25 +++--
 drivers/md/bcache/util.c           |   7 ++
 drivers/md/bcache/writeback.c      |   6 +-
 drivers/md/dm-bufio.c              |   4 +-
 drivers/md/dm-crypt.c              |  11 +-
 drivers/md/dm-io.c                 |  34 ++++--
 drivers/md/dm-rq.c                 |   3 +-
 drivers/md/dm.c                    |  11 +-
 drivers/md/md.c                    |  12 +++
 drivers/md/raid5.c                 |   9 +-
 drivers/nvme/target/io-cmd.c       |   4 +-
 drivers/target/target_core_pscsi.c |   8 +-
 fs/btrfs/volumes.c                 |   3 +
 fs/buffer.c                        |  24 +++--
 fs/crypto/crypto.c                 |   3 +-
 fs/direct-io.c                     |   4 +-
 fs/exofs/ore.c                     |   3 +-
 fs/exofs/ore_raid.c                |   3 +-
 fs/ext4/page-io.c                  |   3 +-
 fs/ext4/readpage.c                 |   3 +-
 fs/f2fs/data.c                     |  13 ++-
 fs/gfs2/lops.c                     |   3 +-
 fs/gfs2/meta_io.c                  |   3 +-
 fs/logfs/dev_bdev.c                | 110 +++++++------------
 fs/mpage.c                         |   3 +-
 fs/xfs/xfs_aops.c                  |   3 +-
 include/linux/bio.h                | 108 +++++++++++++++++--
 include/linux/blk_types.h          |   6 ++
 include/linux/blkdev.h             |   4 +
 include/linux/bvec.h               | 123 +++++++++++++++++++--
 kernel/power/swap.c                |   2 +
 mm/page_io.c                       |   1 +
 45 files changed, 759 insertions(+), 276 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] raid1: handle read error also in readonly mode
From: Shaohua Li @ 2016-10-29  5:01 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477658758-22637-1-git-send-email-tomasz.majchrzak@intel.com>

On Fri, Oct 28, 2016 at 02:45:58PM +0200, Tomasz Majchrzak wrote:
> If write is the first operation on a disk and it happens not to be
> aligned to page size, block layer sends read request first. If read
> operation fails, the disk is set as failed as no attempt to fix the
> error is made because array is in auto-readonly mode. Similarily, the
> disk is set as failed for read-only array.
> 
> Take the same approach as in raid10. Don't fail the disk if array is in
> readonly or auto-readonly mode. Try to redirect the request first and if
> unsuccessful, return a read error.
> 
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH] md: be careful not lot leak internal curr_resync value into metadata.
From: Shaohua Li @ 2016-10-29  5:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, Linux-RAID, Viswesh
In-Reply-To: <8737jht6f6.fsf@notabene.neil.brown.name>

On Fri, Oct 28, 2016 at 03:59:41PM +1100, Neil Brown wrote:
> 
> 
> mddev->curr_resync usually records where the current resync is up to,
> but during the starting phase it has some "magic" values.
> 
>  1 - means that the array is trying to start a resync, but has yielded
>      to another array which shares physical devices, and also needs to
>      start a resync
>  2 - means the array is trying to start resync, but has found another
>      array which shares physical devices and has already started resync.
> 
>  3 - means that resync has commensed, but it is possible that nothing
>      has actually been resynced yet.
> 
> It is important that this value not be visible to user-space and
> particularly that it doesn't get written to the metadata, as the
> resync or recovery checkpoint.  In part, this is because it may be
> slightly higher than the correct value, though this is very rare.
> In part, because it is not a multiple of 4K, and some devices only
> support 4K aligned accesses.
> 
> There are two places where this value is propagates into either
> ->curr_resync_completed or ->recovery_cp or ->recovery_offset.
> These currently avoid the propagation of values 1 and 3, but will
> allow 3 to leak through.
> 
> Change them to only propagate the value if it is > 3.
> 
> As this can cause an array to fail, the patch is suitable for -stable.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Viswesh <viswesh.vichu@gmail.com>
> Signed-off-by: NeilBrown <neilb@suse.com>

Good catch, applied, thanks!

^ permalink raw reply

* Re: recovering failed raid5
From: Phil Turmel @ 2016-10-29  2:53 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Alexander Shenkin, linux-raid
In-Reply-To: <20161028234529.GA3909@metamorpher.de>

Sigh.

On 10/28/2016 07:45 PM, Andreas Klauer wrote:

> Everyone has to find their own approach to things.

Humanity advances by learning from other peoples mistakes.  If we all
had to learn from scratch the technology we use every day, we'd still be
living in huts.

Please read this archived mail, and if you can, the whole thread:

http://marc.info/?l=linux-raid&m=135811522817345&w=1

The read the rest of the archived links in the wiki here:

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

Maybe you'll get it.  Maybe you won't.  I just don't want innocent
bystanders to take your undisputed commentary as gospel.

Phil

^ permalink raw reply

* Re: recovering failed raid5
From: Edward Kuns @ 2016-10-29  2:52 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Phil Turmel, Alexander Shenkin, Linux-RAID
In-Reply-To: <20161028234529.GA3909@metamorpher.de>

On Fri, Oct 28, 2016 at 6:45 PM, Andreas Klauer
<Andreas.Klauer@metamorpher.de> wrote:
> You'd think timeouts would solve all problems. They probably don't.
> In some exceedingly rare cases, they might not even matter at all.

As someone who has experienced this problem, no-one is saying that
having correct timeouts fixes *all* problems.  Obviously, drives fail.
However, it's very clear that having mismatched timeouts can cause a
single-sector failure to escalate to the whole drive being kicked from
the array, exposing you to a much bigger risk of data loss if anything
else at all goes wrong while you have no redundancy.

Right, timeouts won't matter all the time.  They only matter when
mismatched and when you hit a condition that causes the OS to give up
before the drive does.  Is that a good reason to look the other way
and not even check to see if you are exposed to risk by having
mismatched timeouts?  Would you tell someone to go boating without a
life jacket because most of the time they might not matter at all?

         Eddie

^ permalink raw reply

* Re: recovering failed raid5
From: Andreas Klauer @ 2016-10-28 23:45 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Alexander Shenkin, linux-raid
In-Reply-To: <b878efa4-2581-9284-90ca-170ece7219b5@turmel.org>

On Fri, Oct 28, 2016 at 05:16:27PM -0400, Phil Turmel wrote:
> Andreas' approach is rather expensive in practice

Not really. Currently all of my disks are out of their warranty period. 
Whenever I bring this up the first thing I hear is that I'm just 
not noticing these errors that are happening all the time... oh well.

I run SMART selftests daily (select,cont), I run mdadm checks and check 
for mismatch_cnt afterwards (always 0 thus far). Not sure what else to 
do... haven't gone as far as patching the kernel to be more verbose. 
There's only so much you can do.

I'm mainly using cheap WD Green drives. I don't like enterprise drives, 
there's nothing that makes them more reliable, and in a home use where 
they twiddle their thumbs most of the time what's the point of it all? 
Expensive drives are more likely to turn you into a penny-pincher 
when replacement would be the right thing to do...

> manufacturers of consumer-grade drives specify an error rate of less
> than 1 per 10^14 bits read.  That's only 12.5TB.

Yes, according to that math you get stuff like that:

    http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/

Or perhaps that just isn't how failures happen.

    https://www.high-rely.com/blog/why-raid-5-stops-working-in-2009-not/

I'm sure there are better links on the topic.

If there actually was one failure for every 12.5TB, this technology 
would be unusable. It's a LOT more reliable than that, thankfully. 
So no, I don't replace my disks every 12.5TB. That'd be ridiculous.

Maybe you didn't mean it this way.

> Pending relocations are often just glitches that are gone after the
> sector is rewritten.

That's the other opinion I was referring to.

There's no way to tell what caused sectors to become unreadable. 
Is it just a glitch in the matrix, never happen again once fixed? 
Or is it a serious issue, likely to reoccur or get even worse.
Who knows? It's not like you can open it and check. 

> a weekly or monthly "check" scrub will help flush them out 
> in a timely fashion.

Our advice is not that different. You recommend regular checks. 
I recommend regular checks.

I just don't believe in the "it will magically fix itself and 
never happen again" kind of story. It's a trust issue, I just 
can't bring myself to trust disks that have already lost data 
once. Elsewhere people add checksums to filesystems because 
they worry about single bit flips, not entire sectors gone... 
how come one is completely fine but not the other.
(I'm not worried about bit flips, either.)

I see this timeout thing as a fad, it's brought up in every 
other thread about raid failures on this list, regardless 
how little / none indication there was that timeouts were 
related in any way at all to the failure in question.

You'd think timeouts would solve all problems. They probably don't. 
In some exceedingly rare cases, they might not even matter at all.

> Andreas' is flat-out wrong on this.

I say his raid failed due to not running checks, 
running checks is something you recommend too.
There is some common ground there, however tiny.

> Not that I recommend running without the SMART features

That's the general gist I get from reading your posts, though.

> -- you will still want to know when your drives have real problems.

What's a real problem then, when pending sectors and read failures 
in selftest are not real enough?

Some arbitrarily chosen number of errors...

Disks just go bad. You can make up whatever reasons to not replace them, 
but whether your RAID will survive it, seems like a gamble to me.
Backups are a failsafe. I like the safe part, I try to avoid the fail.

Everyone has to find their own approach to things.

Regards
Andreas Klauer

^ permalink raw reply

* Re: recovering failed raid5
From: Phil Turmel @ 2016-10-28 21:16 UTC (permalink / raw)
  To: Andreas Klauer, Alexander Shenkin; +Cc: linux-raid
In-Reply-To: <20161028133304.GA11564@metamorpher.de>

Good afternoon Alexander,

On 10/28/2016 09:33 AM, Andreas Klauer wrote:
> On Fri, Oct 28, 2016 at 01:22:31PM +0100, Alexander Shenkin wrote:
>> One remaining question: is sdc definitely toast?
> 
> In my opinion a drive is toast starting from the very first reallocated/ 
> pending/uncorrectable sector, your drive has several of those and that's 
> only the ones the drive already knows about - there may be more.

Actual vs. Pending relocations are very different things.  Andreas'
approach is rather expensive in practice, as manufacturers of
consumer-grade drives specify an error rate of less than 1 per 10^14
bits read.  That's only 12.5TB.  A moderately used media server will
encounter many of these in a four to five year life span.  Few at first,
then more as the drive ages.  If you insist on replacing drives at the
first "pending" relocation, expect to purchase many more drives than
everyone else.

Enterprise drives work the same way, BTW, just with a spec of 1 per
10^15 bits read.  Since enterprise drives are typically in constant
heavy use, a similar count in a normal lifespan is expected.

>> Or, is it possible that the Timeout Mismatch (as mentioned by Robin Hill; 
>> thanks Robin) is flagging the drive as failed, when something else is at 
>> play and perhaps the drive is actually fine?

Pending relocations are often just glitches that are gone after the
sector is rewritten.  if your drives have an error timeout that is
shorter than the OS device driver timeout, a raid array will silently
fix these errors for you and you'll never notice.  If your array is
lightly used, a weekly or monthly "check" scrub will help flush them out
in a timely fashion.

If you have green or desktop drive that has a long timeout (greater than
the 30-second default linux driver timeout), your array will crash when
your drives age just enough to pop up their first UREs.  Please read the
list archives linked in the wiki to help you understand how and why this
happens.

> I don't believe in timeout mismatches, either. The timeouts are generous. 
> Waiting for a disk to wake from standby is not a problem, and that takes 
> ages already. If a disk gets stuck even longer in error correction limbo 
> and it gets kicked because of it - IMHO that's the right call.

Alex, I strongly recommend you ignore Andreas' advice on this one topic.
 Use the work-arounds for the drives you have, and buy friendlier drives
as age and capacity increases demand.  { If your livelihood or marriage
depends on the security of the contents of your array, buy enterprise
drives and verify your backup system... }

[trim /]

> Your RAID did not fail because of timeouts or not. It's not important. 
> It failed because you didn't notice broken disks in time and you had two. 
> Testing, monitoring, actually acting on the first error, is important. 

Andreas' is flat-out wrong on this.  If you had the work-arounds in
place on your array, your pending errors would have been silently fixed
and your array would almost certainly never have failed.  With or
without SMART enabled.

Not that I recommend running without the SMART features -- you will
still want to know when your drives have real problems.

Phil

^ permalink raw reply

* Re: MD-RAID: Use seq_putc() in three status functions?
From: SF Markus Elfring @ 2016-10-28 20:04 UTC (permalink / raw)
  To: Hannes Reinecke, linux-raid
  Cc: Bernd Petrovitsch, Christoph Hellwig, Guoqing Jiang, Jens Axboe,
	Joe Perches, Mike Christie, Neil Brown, Shaohua Li,
	Tomasz Majchrzak, LKML, kernel-janitors, kbuild-all, ltp
In-Reply-To: <e264883f-d1b9-4bf3-aa9f-45fce5dab18e@users.sourceforge.net>

>>> So back to the original task for you: Show me in the generated output where the benefits are.

I can offer another bit of information for this software development discussion.

The following build settings were active in my "Makefile" for this Linux test case.

…
HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O0 -fomit-frame-pointer -std=gnu89
…

The afffected source files can be compiled for the processor architecture "x86_64"
by a tool like "GCC 6.2.1+r239849-1.5" from the software distribution
"openSUSE Tumbleweed" with the following command example.

my_original=${my_build_dir}unchanged/test/ \
&& my_fixing=${my_build_dir}patched/test/ \
&& mkdir -p ${my_original} ${my_fixing} \
&& my_cc=/usr/bin/gcc-6 \
&& my_module=drivers/md/raid1.s \
&& git checkout next-20161014 \
&& make -j6 O="${my_original}" CC="${my_cc}" HOSTCC="${my_cc}" allmodconfig "${my_module}" \
&& git checkout next_usage_of_seq_putc_in_md_raid_1 \
&& make -j6 O="${my_fixing}" CC="${my_cc}" HOSTCC="${my_cc}" allmodconfig "${my_module}" \
&& diff -u "${my_original}${my_module}" "${my_fixing}${my_module}" > "${my_build_dir}assembler_code_comparison_$(date -I)_3.diff"

The generated file got the size "25.4 KiB" this time. I guess that only
the following two diff hunks are interesting then to show desired effects
for the suggested software refactoring around data output of a single character
(instead of a similar string).

…
@@ -4402,10 +4402,6 @@
 .LC19:
 	.string	"%s"
 	.zero	61
-	.align 32
-.LC20:
-	.string	"]"
-	.zero	62
 	.text
 	.p2align 4,,15
 	.type	raid1_status, @function
@@ -4564,8 +4560,8 @@
 	movq	$rcu_lock_map, %rdi	#,
 	call	lock_release	#
 	movq	%r14, %rdi	# seq,
-	movq	$.LC20, %rsi	#,
-	call	seq_printf	#
+	movl	$93, %esi	#,
+	call	seq_putc	#
 	addq	$16, %rsp	#,
 	popq	%rbx	#
 	popq	%r12	#
…

* Is this kind of assembler code comparison useful to clarify relevant
  differences further?

* Are any software development concerns left over for such a transformation?

Regards,
Markus

^ permalink raw reply

* Re: recovering failed raid5
From: Robin Hill @ 2016-10-28 13:36 UTC (permalink / raw)
  To: Alexander Shenkin; +Cc: linux-raid, Andreas Klauer, rm, robin
In-Reply-To: <715b259f-1e56-9606-edc4-3e5c4d57744b@shenkin.org>

On Fri Oct 28, 2016 at 01:22:31PM +0100, Alexander Shenkin wrote:

> Thanks Andreas, much appreciated.  Your points about selftests and smart 
> are well taken, and i'll implement them once i get this back up.  I'll 
> buy yet another new, non drive-from-hell (yes Roman, I did buy the same 
> damn drive again.  Will try to return it, thanks for the heads up...) 
> and follow your instructions below.
> 
> One remaining question: is sdc definitely toast?  Or, is it possible 
> that the Timeout Mismatch (as mentioned by Robin Hill; thanks Robin) is 
> flagging the drive as failed, when something else is at play and perhaps 
> the drive is actually fine?
> 
It's not definitely toast, no (but this is unrelated to the Timeout
mismatches). It has some pending reallocations, which means the drive
was unable to read from some blocks - if a write to the blocks fails
then one of the spare blocks will be reallocated instead, but a write
will often succeed and the pending reallocation will just be cleared.

Unfortunately, reconstruction of the array depends on this data being
readable, so the fact the drive isn't toast doesn't necessarily help.
I'd suggest replicating (using ddrescue) that drive to the new one (when
it arrives) as a first step. It's possible ddrescue will manage to read
the data (it'll make several attempts, so can sometimes read data that
fails initially), otherwise you'll end up with some missing data
(possibly corrupt files, possibly corrupt filesystem metadata, possibly
just a bit of extra noise in an audio/video file). Once that's done, you
can do a proper check on sdc (e.g. a badblocks read/write test), which
will either lead to sector actually being reallocated, or to clearing
the pending reallocations. Unless you get a lot more reallocated sectors
than are currently pending, you can put the drive back into use if you
like (bearing in mind the reputation of these drives and weighing the
replacement cost against the value of your data).

If you run a regular selftest on the array, these sort of issues would
be picked up and repaired automatically (the read errors will trigger
rewrites and either reallocate blocks, clear the pending reallocations,
or fail the drive). Otherwise they're liable to come back to bite you
when you're trying to recover from a different failure.

Timeout Mismatches will lead to drives being failed from an otherwise
healthy array - a read failure on the drive can't be corrected as the
drive is still busy trying when the write request goes through, so the
drive gets kicked out of the array. You didn't say what the issue was
with your original sdb, but if it wasn't a definite fault then it may
have been affected by a timeout mismatch.

Cheers,
    Robin

> To everyone: sorry for the multiple posts.  Was having majordomo issues...
> 
> On 10/27/2016 5:04 PM, Andreas Klauer wrote:
> > On Thu, Oct 27, 2016 at 04:06:14PM +0100, Alexander Shenkin wrote:
> >> md2: raid5 mounted on /, via sd[abcd]3
> >
> > Two failed disks...
> >
> >> md0: raid1 mounted on /boot, via sd[abcd]1
> >
> > Actually only two disks active in that one, the other two are spares.
> > It hardly matters for /boot, but you could grow it to a 4 disk raid1.
> > Spares are not useful.
> >
> >> My sdb was recently reporting problems.  Instead of second guessing
> >> those problems, I just got a new disk, replaced it, and added it to
> >> the arrays.
> >
> > Replacing right away is the right thing to do.
> > Unfortunately it seems you have another disk that is broke too.
> >
> >> 2) smartctl (disabled on drives - can enable once back up.  should I?)
> >> note: SMART only enabled after problems started cropping up.
> >
> > But... why? Why disable smart? And if you do, is it a surprise that you
> > only notice disk failures when it's already too late?
> 
> yeah, i asked myself that same question.  there was probably some reason 
> I did, but i don't remember what it was.  i'll keep smart enabled from 
> now on...
> 
> > You should enable smart, and not only that, also run regular selftests,
> > and have smartd running, and have it send you mail when something happens.
> > Same with raid checks, raid checks are at least something but it won't
> > tell you about how many reallocated sectors your drive has.
> 
> will do
> 
> >> root@machinename:/home/username# smartctl --xall /dev/sda
> >
> > Looks fine but never ran a selftest.
> >
> >> root@machinename:/home/username# smartctl --xall /dev/sdb
> >
> > Looks new. (New drives need selftests too.)
> >
> >> root@machinename:/home/username# smartctl --xall /dev/sdc
> >> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.0-39-generic] (local build)
> >> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> >>
> >> === START OF INFORMATION SECTION ===
> >> Model Family:     Seagate Barracuda 7200.14 (AF)
> >> Device Model:     ST3000DM001-1CH166
> >> Serial Number:    W1F1N909
> >>
> >> 197 Current_Pending_Sector  -O--C-   100   100   000    -    8
> >> 198 Offline_Uncorrectable   ----C-   100   100   000    -    8
> >
> > This one is faulty and probably the reason why your resync failed.
> > You have no redundancy left, so an option here would be to get a
> > new drive and ddrescue it over.
> >
> > That's exactly the kind of thing you should be notified instantly
> > about via mail. And it should be discovered when running selftests.
> > Without full surface scan of the media, the disk itself won't know.
> >
> >> ==> WARNING: A firmware update for this drive may be available,
> >> see the following Seagate web pages:
> >> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
> >> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
> >
> > About this, *shrug*
> > I don't have these drives, you might want to check that out.
> > But it probably won't fix bad sectors.
> >
> >> root@machinename:/home/username# smartctl --xall /dev/sdd
> >
> > Some strange things in the error log here, but old.
> > Still, same as for all others - selftest.
> >
> >> ################### mdadm --examine ###########################
> >>
> >> /dev/sda1:
> >>      Raid Level : raid1
> >>    Raid Devices : 2
> >
> > A RAID 1 with two drives, could be four.
> >
> >> /dev/sdb1:
> >> /dev/sdc1:
> >
> > So these would also have data instead of being spare.
> >
> >> /dev/sda3:
> >>      Raid Level : raid5
> >>    Raid Devices : 4
> >>
> >>     Update Time : Mon Oct 24 09:02:52 2016
> >>          Events : 53547
> >>
> >>    Device Role : Active device 0
> >>    Array State : A..A ('A' == active, '.' == missing)
> >
> > RAID-5 with two failed disks.
> >
> >> /dev/sdc3:
> >>      Raid Level : raid5
> >>    Raid Devices : 4
> >>
> >>     Update Time : Mon Oct 24 08:53:57 2016
> >>          Events : 53539
> >>
> >>    Device Role : Active device 2
> >>    Array State : AAAA ('A' == active, '.' == missing)
> >
> > This one failed, 8:53.
> >
> >> ############ /proc/mdstat ############################################
> >>
> >> md2 : active raid5 sda3[0] sdc3[2](F) sdd3[3]
> >>       8760565248 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2]
> >> [U__U]
> >
> > [U__U] refers to device roles as in [0123],
> > so device role 0 and 3 is okay, 1 and 2 missing.
> >
> >> md0 : active raid1 sdb1[4](S) sdc1[2](S) sda1[0] sdd1[3]
> >>       1950656 blocks super 1.2 [2/2] [UU]
> >
> > Those two spares again, could be [UUUU] instead.
> >
> > tl;dr
> > stop it all,
> > ddrescue /dev/sdc to your new disk,
> > try your luck with --assemble --force (not using /dev/sdc!),
> > get yet another new disk, add, sync, cross fingers.
> >
> > There's also mdadm --replace instead of --remove, --add,
> > that sometimes helps if there's only a few bad sectors
> > on each disk. If the disk you already removed wasn't
> > already kicked from the array by the time you replaced,
> > maybe it would have avoided this problem.
> >
> > But good disk monitoring and testing is even more important.
> 
> thanks a bunch, Andreas.  I'll monitor and test from now on...
> 
> > Regards
> > Andreas Klauer
> 

-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

^ permalink raw reply

* Re: recovering failed raid5
From: Andreas Klauer @ 2016-10-28 13:33 UTC (permalink / raw)
  To: Alexander Shenkin; +Cc: linux-raid
In-Reply-To: <715b259f-1e56-9606-edc4-3e5c4d57744b@shenkin.org>

On Fri, Oct 28, 2016 at 01:22:31PM +0100, Alexander Shenkin wrote:
> One remaining question: is sdc definitely toast?

In my opinion a drive is toast starting from the very first reallocated/ 
pending/uncorrectable sector, your drive has several of those and that's 
only the ones the drive already knows about - there may be more.

> Or, is it possible that the Timeout Mismatch (as mentioned by Robin Hill; 
> thanks Robin) is flagging the drive as failed, when something else is at 
> play and perhaps the drive is actually fine?

I don't believe in timeout mismatches, either. The timeouts are generous. 
Waiting for a disk to wake from standby is not a problem, and that takes 
ages already. If a disk gets stuck even longer in error correction limbo 
and it gets kicked because of it - IMHO that's the right call.

A disk that is unable to read its data, a disk that refuses to write data, 
a disk that needs help from the RAID layer to correct its errors, 
should be kicked because it's not able to pull its own weight.

You need drives that work without errors, without outside help, because 
during a rebuild, when the RAID is already degraded, there won't be any 
outside help. Either the disks work or your RAID is dead.

RAID redundancy is supposed to allow disks be replaced. (mdadm --replace)
If you use it instead to keep fixing errors on other disks, there is not 
any real redundancy left. In a RAID, if one of your disks has errors, 
you get rid of it as soon as possible.

Your RAID did not fail because of timeouts or not. It's not important. 
It failed because you didn't notice broken disks in time and you had two. 
Testing, monitoring, actually acting on the first error, is important. 

People have different opinions on this. Someone might argue.
It's up to you what risks to take.

Regards
Andreas Klauer

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox