Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Ddf based RAID management software
From: Arka Sharma @ 2016-11-12 16:33 UTC (permalink / raw)
  To: linux-raid

Hello All,

Is there any tool apart from mdadm available which can create software
RAID based on Ddf metadata. We want to dump the metadata content and
tally with metadata written by mdadm and our application.

Regards,
Arka

^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: Christoph Hellwig @ 2016-11-12 17:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block, neilb
In-Reply-To: <20161111190223.4xrq3vvvvohzgs5e@kernel.org>

On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
> > It's mostly about the RAID1 and RAID10 code which does a lot of funny
> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
> > drivers don't touch.  One example is the r1buf_pool_alloc code,
> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
> > case, which would also take care of r1buf_pool_free.  I'm not sure
> > about all the others cases, as some bits don't fully make sense to me,
> 
> The problem is we use the iov_vec to track the pages allocated. We will read
> data to the pages and write out later for resync. If we add new fields to track
> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
> avoid the tricky parts. This should work for both the resync and writebehind
> cases.

I don't think we need to track the pages specificly - if we clone
a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
we do one bio_kmalloc, then bio_alloc_pages then clone it for the
others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
bio_alloc_pages for each.

While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
confusing, and I'm not 100% sure it's correct.  After all we check it
in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
on these callbacks being done after the flag has been raise / cleared,
which makes me bit suspicious, and also question why we even need the
mempool.

> 
> > e.g. why we're trying to do single page I/O out of a bigger bio.
> 
> what's this one?

fix_sync_read_error

^ permalink raw reply

* Re: [PATCH 01/12] block: bio: pass bvec table to bio_init()
From: Christoph Hellwig @ 2016-11-12 17:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
	Christoph Hellwig, Jens Axboe, Jiri Kosina, Kent Overstreet,
	Shaohua Li, Alasdair Kergon, Mike Snitzer,
	maintainer:DEVICE-MAPPER (LVM), Christoph Hellwig, Sagi Grimberg,
	Joern Engel, Prasad Joshi, Mike Christie, Hannes Reinecke,
	Rasmus Villemoes, Johannes Thumshirn, Guoqing Jiang, Eric
In-Reply-To: <1478865957-25252-2-git-send-email-tom.leiming@gmail.com>

On Fri, Nov 11, 2016 at 08:05:29PM +0800, Ming Lei wrote:
> Some drivers often use external bvec table, so introduce
> this helper for this case. It is always safe to access the
> bio->bi_io_vec in this way for this case.
> 
> After converting to this usage, it will becomes a bit easier
> to evaluate the remaining direct access to bio->bi_io_vec,
> so it can help to prepare for the following multipage bvec
> support.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
>  block/bio.c                   |  8 ++++++--
>  drivers/block/floppy.c        |  3 +--
>  drivers/md/bcache/io.c        |  4 +---
>  drivers/md/bcache/journal.c   |  4 +---
>  drivers/md/bcache/movinggc.c  |  6 ++----
>  drivers/md/bcache/request.c   |  2 +-
>  drivers/md/bcache/super.c     | 12 +++---------
>  drivers/md/bcache/writeback.c |  5 ++---
>  drivers/md/dm-bufio.c         |  4 +---
>  drivers/md/dm.c               |  2 +-
>  drivers/md/multipath.c        |  2 +-
>  drivers/md/raid5-cache.c      |  2 +-
>  drivers/md/raid5.c            |  9 ++-------
>  drivers/nvme/target/io-cmd.c  |  4 +---
>  fs/logfs/dev_bdev.c           |  4 +---
>  include/linux/bio.h           |  3 ++-
>  16 files changed, 27 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 2cf6ebabc68c..de257ced69b1 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -270,11 +270,15 @@ static void bio_free(struct bio *bio)
>  	}
>  }
>  
> -void bio_init(struct bio *bio)
> +void bio_init(struct bio *bio, struct bio_vec *table,
> +	      unsigned short max_vecs)
>  {
>  	memset(bio, 0, sizeof(*bio));
>  	atomic_set(&bio->__bi_remaining, 1);
>  	atomic_set(&bio->__bi_cnt, 1);
> +
> +	bio->bi_io_vec = table;
> +	bio->bi_max_vecs = max_vecs;
>  }
>  EXPORT_SYMBOL(bio_init);
>  
> @@ -480,7 +484,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		return NULL;
>  
>  	bio = p + front_pad;
> -	bio_init(bio);
> +	bio_init(bio, NULL, 0);
>  
>  	if (nr_iovecs > inline_vecs) {
>  		unsigned long idx = 0;
> diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
> index e3d8e4ced4a2..6a3ff2b2e3ae 100644
> --- a/drivers/block/floppy.c
> +++ b/drivers/block/floppy.c
> @@ -3806,8 +3806,7 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
>  
>  	cbdata.drive = drive;
>  
> -	bio_init(&bio);
> -	bio.bi_io_vec = &bio_vec;
> +	bio_init(&bio, &bio_vec, 1);
>  	bio_vec.bv_page = page;
>  	bio_vec.bv_len = size;
>  	bio_vec.bv_offset = 0;
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index e97b0acf7b8d..db45a88c0ce9 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -24,9 +24,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
>  	struct bbio *b = mempool_alloc(c->bio_meta, GFP_NOIO);
>  	struct bio *bio = &b->bio;
>  
> -	bio_init(bio);
> -	bio->bi_max_vecs	 = bucket_pages(c);
> -	bio->bi_io_vec		 = bio->bi_inline_vecs;
> +	bio_init(bio, bio->bi_inline_vecs, bucket_pages(c));
>  
>  	return bio;
>  }
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 6925023e12d4..1198e53d5670 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -448,13 +448,11 @@ static void do_journal_discard(struct cache *ca)
>  
>  		atomic_set(&ja->discard_in_flight, DISCARD_IN_FLIGHT);
>  
> -		bio_init(bio);
> +		bio_init(bio, bio->bi_inline_vecs, 1);
>  		bio_set_op_attrs(bio, REQ_OP_DISCARD, 0);
>  		bio->bi_iter.bi_sector	= bucket_to_sector(ca->set,
>  						ca->sb.d[ja->discard_idx]);
>  		bio->bi_bdev		= ca->bdev;
> -		bio->bi_max_vecs	= 1;
> -		bio->bi_io_vec		= bio->bi_inline_vecs;
>  		bio->bi_iter.bi_size	= bucket_bytes(ca);
>  		bio->bi_end_io		= journal_discard_endio;
>  
> diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
> index 5c4bddecfaf0..13b8a907006d 100644
> --- a/drivers/md/bcache/movinggc.c
> +++ b/drivers/md/bcache/movinggc.c
> @@ -77,15 +77,13 @@ static void moving_init(struct moving_io *io)
>  {
>  	struct bio *bio = &io->bio.bio;
>  
> -	bio_init(bio);
> +	bio_init(bio, bio->bi_inline_vecs,
> +		 DIV_ROUND_UP(KEY_SIZE(&io->w->key), PAGE_SECTORS));
>  	bio_get(bio);
>  	bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
>  
>  	bio->bi_iter.bi_size	= KEY_SIZE(&io->w->key) << 9;
> -	bio->bi_max_vecs	= DIV_ROUND_UP(KEY_SIZE(&io->w->key),
> -					       PAGE_SECTORS);
>  	bio->bi_private		= &io->cl;
> -	bio->bi_io_vec		= bio->bi_inline_vecs;
>  	bch_bio_map(bio, NULL);
>  }
>  
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 0d99b5f4b3e6..f49c5417527d 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -623,7 +623,7 @@ static void do_bio_hook(struct search *s, struct bio *orig_bio)
>  {
>  	struct bio *bio = &s->bio.bio;
>  
> -	bio_init(bio);
> +	bio_init(bio, NULL, 0);
>  	__bio_clone_fast(bio, orig_bio);

We have this pattern multiple times, and it almost screams for a helper.
But I think we're better off letting your patch go in as-is and sort
that out later instead of delaying it.

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-13 18:46 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

Hi

I'm running software RAID1 across two drives in my home machine (LVM
on LUKS on RAID1). I've just installed smartmontools and run short
tests, and promptly received emails to tell me that one of the drives
has 4 offline uncorrectable sectors and 3 current pending sectors.
I've attached smartctl --xall output for sda (good) and sdb (bad).

These drives are pretty old (over 5 years) so I'm going to replace
them as soon as I have time (and yes, I have backups), but in the
meantime I'd like advice on:

1. What exactly this means. My understanding is that some data has
been lost (or may have been lost) on the drive, but the drive still
has spare sectors to remap things once the failed sectors are written
to. Is that correct?

2. How can I tell which sectors are problematic? If it's in the swap
partition I'm far less worried. I can see two LBAs for offline
uncorrectable errors in the --xall output, but that still leaves
another two at large.

3. Assuming my understanding is correct, and that the sector falls
within the RAID1 partition on the drive, is there some way I can
recover the sectors from the other drive in the RAID1? As a last
resort I imagine I could wipe the suspect drive and then rebuild it
from the good one, but I'm hoping there's something less risky I can
do.

Thanks in advance
Bruce
-- 
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/

[-- Attachment #2: sda.txt --]
[-- Type: text/plain, Size: 16559 bytes --]

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-47-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD20EARX-00PASB0
Serial Number:    WD-WCAZA9626479
LU WWN Device Id: 5 0014ee 2b0e3fa4c
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Nov 13 20:30:04 2016 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(36780) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 355) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   172   162   021    -    6400
  4 Start_Stop_Count        -O--CK   097   097   000    -    3688
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   081   081   000    -    13891
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   097   097   000    -    3683
192 Power-Off_Retract_Count -O--CK   200   200   000    -    50
193 Load_Cycle_Count        -O--CK   001   001   000    -    912124
194 Temperature_Celsius     -O---K   120   109   000    -    30
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   200   200   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13888         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    30 Celsius
Power Cycle Min/Max Temperature:     22/30 Celsius
Lifetime    Min/Max Temperature:     22/41 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (462)

Index    Estimated Time   Temperature Celsius
 463    2016-11-13 12:33    31  ************
 ...    ..( 16 skipped).    ..  ************
   2    2016-11-13 12:50    31  ************
   3    2016-11-13 12:51    30  ***********
   4    2016-11-13 12:52    30  ***********
   5    2016-11-13 12:53    30  ***********
   6    2016-11-13 12:54    31  ************
   7    2016-11-13 12:55    30  ***********
 ...    ..( 21 skipped).    ..  ***********
  29    2016-11-13 13:17    30  ***********
  30    2016-11-13 13:18    31  ************
 ...    ..( 14 skipped).    ..  ************
  45    2016-11-13 13:33    31  ************
  46    2016-11-13 13:34     ?  -
  47    2016-11-13 13:35    22  ***
  48    2016-11-13 13:36    22  ***
  49    2016-11-13 13:37    22  ***
  50    2016-11-13 13:38    23  ****
  51    2016-11-13 13:39    23  ****
  52    2016-11-13 13:40    24  *****
  53    2016-11-13 13:41    24  *****
  54    2016-11-13 13:42    25  ******
  55    2016-11-13 13:43    25  ******
  56    2016-11-13 13:44    26  *******
 ...    ..(  2 skipped).    ..  *******
  59    2016-11-13 13:47    26  *******
  60    2016-11-13 13:48    27  ********
 ...    ..(  4 skipped).    ..  ********
  65    2016-11-13 13:53    27  ********
  66    2016-11-13 13:54    28  *********
 ...    ..(  4 skipped).    ..  *********
  71    2016-11-13 13:59    28  *********
  72    2016-11-13 14:00    29  **********
 ...    ..( 19 skipped).    ..  **********
  92    2016-11-13 14:20    29  **********
  93    2016-11-13 14:21    30  ***********
 ...    ..( 12 skipped).    ..  ***********
 106    2016-11-13 14:34    30  ***********
 107    2016-11-13 14:35    31  ************
 ...    ..(  2 skipped).    ..  ************
 110    2016-11-13 14:38    31  ************
 111    2016-11-13 14:39    32  *************
 112    2016-11-13 14:40    31  ************
 ...    ..( 18 skipped).    ..  ************
 131    2016-11-13 14:59    31  ************
 132    2016-11-13 15:00    32  *************
 133    2016-11-13 15:01    32  *************
 134    2016-11-13 15:02    31  ************
 135    2016-11-13 15:03    32  *************
 136    2016-11-13 15:04    31  ************
 ...    ..( 10 skipped).    ..  ************
 147    2016-11-13 15:15    31  ************
 148    2016-11-13 15:16    32  *************
 149    2016-11-13 15:17    31  ************
 150    2016-11-13 15:18    31  ************
 151    2016-11-13 15:19    32  *************
 152    2016-11-13 15:20    31  ************
 ...    ..( 10 skipped).    ..  ************
 163    2016-11-13 15:31    31  ************
 164    2016-11-13 15:32     ?  -
 165    2016-11-13 15:33    20  *
 166    2016-11-13 15:34    21  **
 167    2016-11-13 15:35    21  **
 168    2016-11-13 15:36    21  **
 169    2016-11-13 15:37    22  ***
 170    2016-11-13 15:38    22  ***
 171    2016-11-13 15:39    22  ***
 172    2016-11-13 15:40    23  ****
 173    2016-11-13 15:41    24  *****
 174    2016-11-13 15:42    24  *****
 175    2016-11-13 15:43    24  *****
 176    2016-11-13 15:44    25  ******
 177    2016-11-13 15:45    25  ******
 178    2016-11-13 15:46    25  ******
 179    2016-11-13 15:47    26  *******
 ...    ..(  2 skipped).    ..  *******
 182    2016-11-13 15:50    26  *******
 183    2016-11-13 15:51    27  ********
 ...    ..(  7 skipped).    ..  ********
 191    2016-11-13 15:59    27  ********
 192    2016-11-13 16:00    28  *********
 ...    ..(  4 skipped).    ..  *********
 197    2016-11-13 16:05    28  *********
 198    2016-11-13 16:06    29  **********
 ...    ..( 13 skipped).    ..  **********
 212    2016-11-13 16:20    29  **********
 213    2016-11-13 16:21    30  ***********
 ...    ..(  5 skipped).    ..  ***********
 219    2016-11-13 16:27    30  ***********
 220    2016-11-13 16:28    31  ************
 221    2016-11-13 16:29    31  ************
 222    2016-11-13 16:30    31  ************
 223    2016-11-13 16:31    30  ***********
 224    2016-11-13 16:32    30  ***********
 225    2016-11-13 16:33    31  ************
 ...    ..(  2 skipped).    ..  ************
 228    2016-11-13 16:36    31  ************
 229    2016-11-13 16:37    30  ***********
 ...    ..(  5 skipped).    ..  ***********
 235    2016-11-13 16:43    30  ***********
 236    2016-11-13 16:44    31  ************
 237    2016-11-13 16:45    30  ***********
 ...    ..(  8 skipped).    ..  ***********
 246    2016-11-13 16:54    30  ***********
 247    2016-11-13 16:55    31  ************
 248    2016-11-13 16:56    30  ***********
 ...    ..(  9 skipped).    ..  ***********
 258    2016-11-13 17:06    30  ***********
 259    2016-11-13 17:07    31  ************
 260    2016-11-13 17:08    30  ***********
 ...    ..(  8 skipped).    ..  ***********
 269    2016-11-13 17:17    30  ***********
 270    2016-11-13 17:18    31  ************
 271    2016-11-13 17:19    31  ************
 272    2016-11-13 17:20    31  ************
 273    2016-11-13 17:21    30  ***********
 274    2016-11-13 17:22    30  ***********
 275    2016-11-13 17:23    31  ************
 276    2016-11-13 17:24    31  ************
 277    2016-11-13 17:25    30  ***********
 ...    ..(  7 skipped).    ..  ***********
 285    2016-11-13 17:33    30  ***********
 286    2016-11-13 17:34    31  ************
 ...    ..( 17 skipped).    ..  ************
 304    2016-11-13 17:52    31  ************
 305    2016-11-13 17:53    30  ***********
 306    2016-11-13 17:54    31  ************
 307    2016-11-13 17:55    30  ***********
 ...    ..(  5 skipped).    ..  ***********
 313    2016-11-13 18:01    30  ***********
 314    2016-11-13 18:02    31  ************
 315    2016-11-13 18:03    31  ************
 316    2016-11-13 18:04    30  ***********
 ...    ..(  3 skipped).    ..  ***********
 320    2016-11-13 18:08    30  ***********
 321    2016-11-13 18:09    31  ************
 ...    ..( 11 skipped).    ..  ************
 333    2016-11-13 18:21    31  ************
 334    2016-11-13 18:22    32  *************
 335    2016-11-13 18:23    31  ************
 336    2016-11-13 18:24    31  ************
 337    2016-11-13 18:25    32  *************
 338    2016-11-13 18:26    31  ************
 ...    ..( 11 skipped).    ..  ************
 350    2016-11-13 18:38    31  ************
 351    2016-11-13 18:39    32  *************
 352    2016-11-13 18:40    31  ************
 ...    ..(  5 skipped).    ..  ************
 358    2016-11-13 18:46    31  ************
 359    2016-11-13 18:47    32  *************
 360    2016-11-13 18:48    32  *************
 361    2016-11-13 18:49    32  *************
 362    2016-11-13 18:50    31  ************
 ...    ..( 14 skipped).    ..  ************
 377    2016-11-13 19:05    31  ************
 378    2016-11-13 19:06    30  ***********
 379    2016-11-13 19:07    31  ************
 380    2016-11-13 19:08    30  ***********
 ...    ..(  4 skipped).    ..  ***********
 385    2016-11-13 19:13    30  ***********
 386    2016-11-13 19:14    31  ************
 387    2016-11-13 19:15    31  ************
 388    2016-11-13 19:16    30  ***********
 ...    ..(  4 skipped).    ..  ***********
 393    2016-11-13 19:21    30  ***********
 394    2016-11-13 19:22    31  ************
 395    2016-11-13 19:23    30  ***********
 396    2016-11-13 19:24    31  ************
 ...    ..( 10 skipped).    ..  ************
 407    2016-11-13 19:35    31  ************
 408    2016-11-13 19:36    32  *************
 409    2016-11-13 19:37     ?  -
 410    2016-11-13 19:38    32  *************
 411    2016-11-13 19:39    31  ************
 ...    ..(  2 skipped).    ..  ************
 414    2016-11-13 19:42    31  ************
 415    2016-11-13 19:43    32  *************
 416    2016-11-13 19:44    31  ************
 417    2016-11-13 19:45    32  *************
 418    2016-11-13 19:46    31  ************
 419    2016-11-13 19:47    32  *************
 420    2016-11-13 19:48    31  ************
 421    2016-11-13 19:49    32  *************
 422    2016-11-13 19:50    31  ************
 423    2016-11-13 19:51    31  ************
 424    2016-11-13 19:52    31  ************
 425    2016-11-13 19:53    32  *************
 426    2016-11-13 19:54    31  ************
 ...    ..(  4 skipped).    ..  ************
 431    2016-11-13 19:59    31  ************
 432    2016-11-13 20:00    32  *************
 433    2016-11-13 20:01    32  *************
 434    2016-11-13 20:02    31  ************
 ...    ..(  4 skipped).    ..  ************
 439    2016-11-13 20:07    31  ************
 440    2016-11-13 20:08    32  *************
 441    2016-11-13 20:09    31  ************
 442    2016-11-13 20:10    32  *************
 443    2016-11-13 20:11    31  ************
 ...    ..(  2 skipped).    ..  ************
 446    2016-11-13 20:14    31  ************
 447    2016-11-13 20:15    32  *************
 448    2016-11-13 20:16    31  ************
 ...    ..(  5 skipped).    ..  ************
 454    2016-11-13 20:22    31  ************
 455    2016-11-13 20:23    32  *************
 456    2016-11-13 20:24    31  ************
 457    2016-11-13 20:25    32  *************
 458    2016-11-13 20:26    32  *************
 459    2016-11-13 20:27    31  ************
 ...    ..(  2 skipped).    ..  ************
 462    2016-11-13 20:30    31  ************

SCT Error Recovery Control command not supported

Device Statistics (GP Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x8000  4         3578  Vendor specific


[-- Attachment #3: sdb.txt --]
[-- Type: text/plain, Size: 17448 bytes --]

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-47-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD20EARX-00PASB0
Serial Number:    WD-WCAZA9552721
LU WWN Device Id: 5 0014ee 2b0e3f07a
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Nov 13 20:30:01 2016 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(37260) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 360) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   168   162   021    -    6566
  4 Start_Stop_Count        -O--CK   097   097   000    -    3748
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   081   081   000    -    13883
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   097   097   000    -    3683
192 Power-Off_Retract_Count -O--CK   200   200   000    -    49
193 Load_Cycle_Count        -O--CK   001   001   000    -    837570
194 Temperature_Celsius     -O---K   119   108   000    -    31
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   001   000    -    3
198 Offline_Uncorrectable   ----CK   200   200   000    -    4
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    3
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 2
	CR     = Command Register
	FEATR  = Features Register
	COUNT  = Count (was: Sector Count) Register
	LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
	LH     = LBA High (was: Cylinder High) Register    ]   LBA
	LM     = LBA Mid (was: Cylinder Low) Register      ] Register
	LL     = LBA Low (was: Sector Number) Register     ]
	DV     = Device (was: Device/Head) Register
	DC     = Device Control Register
	ER     = Error register
	ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 [1] occurred at disk power-on lifetime: 9505 hours (396 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 25 b3 05 58 40 00  Error: UNC at LBA = 0x25b30558 = 632489304

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 80 00 48 00 00 25 b3 0b 80 40 08     17:33:58.845  READ FPDMA QUEUED
  60 00 80 00 50 00 00 25 b3 0b 00 40 08     17:33:58.845  READ FPDMA QUEUED
  60 00 80 00 58 00 00 25 b3 0a 80 40 08     17:33:58.844  READ FPDMA QUEUED
  60 00 80 00 60 00 00 25 b3 0a 00 40 08     17:33:58.844  READ FPDMA QUEUED
  60 00 80 00 80 00 00 25 b3 09 80 40 08     17:33:58.832  READ FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 9505 hours (396 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 fe 00 00 25 b2 fa 58 40 00  Error: UNC at LBA = 0x25b2fa58 = 632486488

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 80 00 d8 00 00 25 b2 ff 00 40 08     17:33:56.041  READ FPDMA QUEUED
  60 00 80 00 d0 00 00 25 b2 fe 80 40 08     17:33:56.041  READ FPDMA QUEUED
  60 00 80 00 c8 00 00 25 b2 fe 00 40 08     17:33:56.041  READ FPDMA QUEUED
  60 00 80 00 c0 00 00 25 b2 fd 80 40 08     17:33:56.041  READ FPDMA QUEUED
  60 00 80 00 b8 00 00 25 b2 fd 00 40 08     17:33:56.040  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13880         -
# 2  Short offline       Aborted by host               10%     13880         -
# 3  Short offline       Interrupted (host reset)      10%     13880         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    31 Celsius
Power Cycle Min/Max Temperature:     22/31 Celsius
Lifetime    Min/Max Temperature:     21/42 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (433)

Index    Estimated Time   Temperature Celsius
 434    2016-11-13 12:33    32  *************
 ...    ..( 15 skipped).    ..  *************
 450    2016-11-13 12:49    32  *************
 451    2016-11-13 12:50    31  ************
 452    2016-11-13 12:51    32  *************
 ...    ..(  2 skipped).    ..  *************
 455    2016-11-13 12:54    32  *************
 456    2016-11-13 12:55    31  ************
 ...    ..( 16 skipped).    ..  ************
 473    2016-11-13 13:12    31  ************
 474    2016-11-13 13:13    32  *************
 475    2016-11-13 13:14    32  *************
 476    2016-11-13 13:15    32  *************
 477    2016-11-13 13:16    31  ************
   0    2016-11-13 13:17    31  ************
   1    2016-11-13 13:18    32  *************
 ...    ..( 14 skipped).    ..  *************
  16    2016-11-13 13:33    32  *************
  17    2016-11-13 13:34     ?  -
  18    2016-11-13 13:35    22  ***
  19    2016-11-13 13:36    22  ***
  20    2016-11-13 13:37    22  ***
  21    2016-11-13 13:38    23  ****
  22    2016-11-13 13:39    24  *****
  23    2016-11-13 13:40    24  *****
  24    2016-11-13 13:41    25  ******
  25    2016-11-13 13:42    25  ******
  26    2016-11-13 13:43    25  ******
  27    2016-11-13 13:44    26  *******
  28    2016-11-13 13:45    26  *******
  29    2016-11-13 13:46    27  ********
 ...    ..(  5 skipped).    ..  ********
  35    2016-11-13 13:52    27  ********
  36    2016-11-13 13:53    28  *********
  37    2016-11-13 13:54    28  *********
  38    2016-11-13 13:55    29  **********
  39    2016-11-13 13:56    28  *********
  40    2016-11-13 13:57    29  **********
 ...    ..( 11 skipped).    ..  **********
  52    2016-11-13 14:09    29  **********
  53    2016-11-13 14:10    30  ***********
 ...    ..( 11 skipped).    ..  ***********
  65    2016-11-13 14:22    30  ***********
  66    2016-11-13 14:23    31  ************
 ...    ..( 10 skipped).    ..  ************
  77    2016-11-13 14:34    31  ************
  78    2016-11-13 14:35    32  *************
 ...    ..( 26 skipped).    ..  *************
 105    2016-11-13 15:02    32  *************
 106    2016-11-13 15:03    33  **************
 107    2016-11-13 15:04    32  *************
 ...    ..( 10 skipped).    ..  *************
 118    2016-11-13 15:15    32  *************
 119    2016-11-13 15:16    33  **************
 120    2016-11-13 15:17    32  *************
 121    2016-11-13 15:18    32  *************
 122    2016-11-13 15:19    33  **************
 123    2016-11-13 15:20    32  *************
 ...    ..(  2 skipped).    ..  *************
 126    2016-11-13 15:23    32  *************
 127    2016-11-13 15:24    33  **************
 128    2016-11-13 15:25    32  *************
 ...    ..(  5 skipped).    ..  *************
 134    2016-11-13 15:31    32  *************
 135    2016-11-13 15:32     ?  -
 136    2016-11-13 15:33    21  **
 137    2016-11-13 15:34    21  **
 138    2016-11-13 15:35    21  **
 139    2016-11-13 15:36    22  ***
 140    2016-11-13 15:37    22  ***
 141    2016-11-13 15:38    23  ****
 142    2016-11-13 15:39    23  ****
 143    2016-11-13 15:40    23  ****
 144    2016-11-13 15:41    24  *****
 145    2016-11-13 15:42    25  ******
 ...    ..(  2 skipped).    ..  ******
 148    2016-11-13 15:45    25  ******
 149    2016-11-13 15:46    26  *******
 150    2016-11-13 15:47    26  *******
 151    2016-11-13 15:48    26  *******
 152    2016-11-13 15:49    27  ********
 ...    ..(  5 skipped).    ..  ********
 158    2016-11-13 15:55    27  ********
 159    2016-11-13 15:56    28  *********
 ...    ..(  5 skipped).    ..  *********
 165    2016-11-13 16:02    28  *********
 166    2016-11-13 16:03    29  **********
 ...    ..( 10 skipped).    ..  **********
 177    2016-11-13 16:14    29  **********
 178    2016-11-13 16:15    30  ***********
 ...    ..(  6 skipped).    ..  ***********
 185    2016-11-13 16:22    30  ***********
 186    2016-11-13 16:23    31  ************
 ...    ..(  9 skipped).    ..  ************
 196    2016-11-13 16:33    31  ************
 197    2016-11-13 16:34    32  *************
 ...    ..(  8 skipped).    ..  *************
 206    2016-11-13 16:43    32  *************
 207    2016-11-13 16:44    31  ************
 208    2016-11-13 16:45    31  ************
 209    2016-11-13 16:46    31  ************
 210    2016-11-13 16:47    32  *************
 211    2016-11-13 16:48    31  ************
 ...    ..(  2 skipped).    ..  ************
 214    2016-11-13 16:51    31  ************
 215    2016-11-13 16:52    32  *************
 216    2016-11-13 16:53    32  *************
 217    2016-11-13 16:54    31  ************
 218    2016-11-13 16:55    32  *************
 219    2016-11-13 16:56    31  ************
 ...    ..(  3 skipped).    ..  ************
 223    2016-11-13 17:00    31  ************
 224    2016-11-13 17:01    32  *************
 225    2016-11-13 17:02    32  *************
 226    2016-11-13 17:03    31  ************
 ...    ..(  2 skipped).    ..  ************
 229    2016-11-13 17:06    31  ************
 230    2016-11-13 17:07    32  *************
 231    2016-11-13 17:08    32  *************
 232    2016-11-13 17:09    31  ************
 ...    ..(  2 skipped).    ..  ************
 235    2016-11-13 17:12    31  ************
 236    2016-11-13 17:13    32  *************
 ...    ..( 39 skipped).    ..  *************
 276    2016-11-13 17:53    32  *************
 277    2016-11-13 17:54    31  ************
 ...    ..(  7 skipped).    ..  ************
 285    2016-11-13 18:02    31  ************
 286    2016-11-13 18:03    32  *************
 287    2016-11-13 18:04    32  *************
 288    2016-11-13 18:05    31  ************
 289    2016-11-13 18:06    32  *************
 ...    ..( 14 skipped).    ..  *************
 304    2016-11-13 18:21    32  *************
 305    2016-11-13 18:22    33  **************
 306    2016-11-13 18:23    32  *************
 307    2016-11-13 18:24    32  *************
 308    2016-11-13 18:25    33  **************
 309    2016-11-13 18:26    32  *************
 ...    ..( 19 skipped).    ..  *************
 329    2016-11-13 18:46    32  *************
 330    2016-11-13 18:47    33  **************
 331    2016-11-13 18:48    32  *************
 332    2016-11-13 18:49    32  *************
 333    2016-11-13 18:50    33  **************
 334    2016-11-13 18:51    32  *************
 ...    ..(  9 skipped).    ..  *************
 344    2016-11-13 19:01    32  *************
 345    2016-11-13 19:02    31  ************
 ...    ..( 16 skipped).    ..  ************
 362    2016-11-13 19:19    31  ************
 363    2016-11-13 19:20    32  *************
 ...    ..( 15 skipped).    ..  *************
 379    2016-11-13 19:36    32  *************
 380    2016-11-13 19:37     ?  -
 381    2016-11-13 19:38    33  **************
 382    2016-11-13 19:39    32  *************
 ...    ..( 29 skipped).    ..  *************
 412    2016-11-13 20:09    32  *************
 413    2016-11-13 20:10    33  **************
 ...    ..( 14 skipped).    ..  **************
 428    2016-11-13 20:25    33  **************
 429    2016-11-13 20:26    32  *************
 430    2016-11-13 20:27    33  **************
 431    2016-11-13 20:28    32  *************
 432    2016-11-13 20:29    32  *************
 433    2016-11-13 20:30    32  *************

SCT Error Recovery Control command not supported

Device Statistics (GP Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x000a  2            2  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x8000  4         3575  Vendor specific


^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Anthony Youngman @ 2016-11-13 20:18 UTC (permalink / raw)
  To: Bruce Merry, linux-raid
In-Reply-To: <CAHy4j_7_nRMxOSW16VTAY7bzdW_VMap=Jeb2M0wMiNDoNXcijQ@mail.gmail.com>

Quick first response ...

On 13/11/16 18:46, Bruce Merry wrote:
> Hi
>
> I'm running software RAID1 across two drives in my home machine (LVM
> on LUKS on RAID1). I've just installed smartmontools and run short
> tests, and promptly received emails to tell me that one of the drives
> has 4 offline uncorrectable sectors and 3 current pending sectors.
> I've attached smartctl --xall output for sda (good) and sdb (bad).
>
> These drives are pretty old (over 5 years) so I'm going to replace
> them as soon as I have time (and yes, I have backups), but in the
> meantime I'd like advice on:
>
What drives are they? I'm guessing they're hunky-dory, but they don't 
fall foul of timeout mismatch, do they?

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

> 1. What exactly this means. My understanding is that some data has
> been lost (or may have been lost) on the drive, but the drive still
> has spare sectors to remap things once the failed sectors are written
> to. Is that correct?

It may also mean that the four sectors at least, have already been 
remapped ... I'll let the experts confirm. The three pending errors 
might be where a read has failed but there's not yet been a re-write - 
and you won't have noticed because the raid dealt with it.
>
> 2. How can I tell which sectors are problematic? If it's in the swap
> partition I'm far less worried. I can see two LBAs for offline
> uncorrectable errors in the --xall output, but that still leaves
> another two at large.

I don't think you need to be worried at all. It's only a few sectors, 
there's no sign of any further trouble? and as it's raided, when the 
drive returns an error the raid code will sort it out for you.
>
> 3. Assuming my understanding is correct, and that the sector falls
> within the RAID1 partition on the drive, is there some way I can
> recover the sectors from the other drive in the RAID1? As a last
> resort I imagine I could wipe the suspect drive and then rebuild it
> from the good one, but I'm hoping there's something less risky I can
> do.

Do a scrub? You've got seven errors total, which some people will say 
"panic on the first error" and others will say "so what, the odd error 
every now and then is nothing to worry about". The point of a scrub is 
it will background-scan the entire array, and if it can't read anything, 
it will re-calculate and re-write it.

Just make sure you've not got that timeout problem, or a scrub will make 
matters a whole lot worse ...
>
> Thanks in advance
> Bruce
>
Cheers,
Wol

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-13 20:51 UTC (permalink / raw)
  To: Anthony Youngman; +Cc: linux-raid
In-Reply-To: <942ab8be-cd5c-c6d1-d077-cd295b355c0c@youngman.org.uk>

On 13 November 2016 at 22:18, Anthony Youngman <antlists@youngman.org.uk> wrote:
> Quick first response ...
>
> On 13/11/16 18:46, Bruce Merry wrote:
>>
>> Hi
>>
>> I'm running software RAID1 across two drives in my home machine (LVM
>> on LUKS on RAID1). I've just installed smartmontools and run short
>> tests, and promptly received emails to tell me that one of the drives
>> has 4 offline uncorrectable sectors and 3 current pending sectors.
>> I've attached smartctl --xall output for sda (good) and sdb (bad).
>>
>> These drives are pretty old (over 5 years) so I'm going to replace
>> them as soon as I have time (and yes, I have backups), but in the
>> meantime I'd like advice on:
>>
> What drives are they? I'm guessing they're hunky-dory, but they don't fall
> foul of timeout mismatch, do they?
>
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

smartctl reports "SCT Error Recovery Control command not supported".
Does that mean I should be worried? Is there any way to tell whether a
given drive I can buy online supports it?

>> 1. What exactly this means. My understanding is that some data has
>> been lost (or may have been lost) on the drive, but the drive still
>> has spare sectors to remap things once the failed sectors are written
>> to. Is that correct?
>
>
> It may also mean that the four sectors at least, have already been remapped
> ... I'll let the experts confirm. The three pending errors might be where a
> read has failed but there's not yet been a re-write - and you won't have
> noticed because the raid dealt with it.

I'm guessing nothing has been remapped yet, because the
Reallocated_Sector_Ct and Reallocator_Event_ct are both zero.

>> 3. Assuming my understanding is correct, and that the sector falls
>> within the RAID1 partition on the drive, is there some way I can
>> recover the sectors from the other drive in the RAID1? As a last
>> resort I imagine I could wipe the suspect drive and then rebuild it
>> from the good one, but I'm hoping there's something less risky I can
>> do.
>
>
> Do a scrub? You've got seven errors total, which some people will say "panic
> on the first error" and others will say "so what, the odd error every now
> and then is nothing to worry about". The point of a scrub is it will
> background-scan the entire array, and if it can't read anything, it will
> re-calculate and re-write it.

Yes, that sounds like what I need. Thanks to Google I found
/usr/share/mdadm/checkarray to trigger this. It still has a few hours
to go, but now the bad drive has pending sectors == 65535 (which is
suspiciously power-of-two and I assume means it's actually higher and
is being clamped), and /sys/block/md0/md/mismatch_cnt is currently at
1408. If scrubbing is supposed to rewrite on failed reads I would have
expected pending sectors to go down rather than up, so I'm not sure
what's happening.

Thanks
Bruce
-- 
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Wols Lists @ 2016-11-13 21:06 UTC (permalink / raw)
  To: Bruce Merry; +Cc: linux-raid
In-Reply-To: <CAHy4j_7F=gN9=7mEH-TsdVJR0YFxBzJK98WeJfuwtANoDEy93w@mail.gmail.com>

On 13/11/16 20:51, Bruce Merry wrote:
> On 13 November 2016 at 22:18, Anthony Youngman <antlists@youngman.org.uk> wrote:
>> Quick first response ...
>>
>> On 13/11/16 18:46, Bruce Merry wrote:
>>>
>>> Hi
>>>
>>> I'm running software RAID1 across two drives in my home machine (LVM
>>> on LUKS on RAID1). I've just installed smartmontools and run short
>>> tests, and promptly received emails to tell me that one of the drives
>>> has 4 offline uncorrectable sectors and 3 current pending sectors.
>>> I've attached smartctl --xall output for sda (good) and sdb (bad).
>>>
>>> These drives are pretty old (over 5 years) so I'm going to replace
>>> them as soon as I have time (and yes, I have backups), but in the
>>> meantime I'd like advice on:
>>>
>> What drives are they? I'm guessing they're hunky-dory, but they don't fall
>> foul of timeout mismatch, do they?
>>
>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
> 
> smartctl reports "SCT Error Recovery Control command not supported".
> Does that mean I should be worried? Is there any way to tell whether a
> given drive I can buy online supports it?

You need drives that explicitly support raid. WD Reds, Seagate NAS, some
Toshibas - my 2TB laptop drive does ... Try and find a friend with a
drive you like, and check it out, or ask on this list :-)

Did you run that script to increase the kernel timeout?
> 
>>> 1. What exactly this means. My understanding is that some data has
>>> been lost (or may have been lost) on the drive, but the drive still
>>> has spare sectors to remap things once the failed sectors are written
>>> to. Is that correct?
>>
>>
>> It may also mean that the four sectors at least, have already been remapped
>> ... I'll let the experts confirm. The three pending errors might be where a
>> read has failed but there's not yet been a re-write - and you won't have
>> noticed because the raid dealt with it.
> 
> I'm guessing nothing has been remapped yet, because the
> Reallocated_Sector_Ct and Reallocator_Event_ct are both zero.
> 
>>> 3. Assuming my understanding is correct, and that the sector falls
>>> within the RAID1 partition on the drive, is there some way I can
>>> recover the sectors from the other drive in the RAID1? As a last
>>> resort I imagine I could wipe the suspect drive and then rebuild it
>>> from the good one, but I'm hoping there's something less risky I can
>>> do.
>>
>>
>> Do a scrub? You've got seven errors total, which some people will say "panic
>> on the first error" and others will say "so what, the odd error every now
>> and then is nothing to worry about". The point of a scrub is it will
>> background-scan the entire array, and if it can't read anything, it will
>> re-calculate and re-write it.
> 
> Yes, that sounds like what I need. Thanks to Google I found
> /usr/share/mdadm/checkarray to trigger this. It still has a few hours
> to go, but now the bad drive has pending sectors == 65535 (which is
> suspiciously power-of-two and I assume means it's actually higher and
> is being clamped), and /sys/block/md0/md/mismatch_cnt is currently at
> 1408. If scrubbing is supposed to rewrite on failed reads I would have
> expected pending sectors to go down rather than up, so I'm not sure
> what's happening.
> 
Ummm....

Sounds like that drive could need replacing. I'd get a new drive and do
that as soon as possible - use the --replace option of mdadm - don't
fail the old drive and add the new. Dunno where you're based, but 5mins
on the internet ordering a new drive is probably time well spent.

Note that Seagate Barracudas don't have the best of reputations if
they're the drive you've already got, and the 3TB drives are best
avoided. Sod's law, I've got two of them ...

Advice I always give ... if you're getting new drives, always consider
increasing capacity. I don't know what size your current drives are, but
look at prices of drives a bit larger than what they are, and is it
worth paying the extra?

If you do get bigger drives, there's nothing stopping you making the
paritions on it bigger before you add them in to the array. It'll be
wasted space until you increase the size of all the drives, but once
you've replaced both drives, you can use mdadm to increase the array
size. I don't know about LUKS, but I would expect you can grow that, and
then you can expand your data partitions within that.


^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: NeilBrown @ 2016-11-13 22:53 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block
In-Reply-To: <20161112174238.GA11518@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 3090 bytes --]

On Sun, Nov 13 2016, Christoph Hellwig wrote:

> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
>> > It's mostly about the RAID1 and RAID10 code which does a lot of funny
>> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
>> > drivers don't touch.  One example is the r1buf_pool_alloc code,
>> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
>> > case, which would also take care of r1buf_pool_free.  I'm not sure
>> > about all the others cases, as some bits don't fully make sense to me,
>> 
>> The problem is we use the iov_vec to track the pages allocated. We will read
>> data to the pages and write out later for resync. If we add new fields to track
>> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
>> avoid the tricky parts. This should work for both the resync and writebehind
>> cases.
>
> I don't think we need to track the pages specificly - if we clone
> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
> others bios.  for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
> bio_alloc_pages for each.

Part of the reason for the oddities in this code is that I wanted a
collection of bios, one per device, which were all the same size.  As
different devices might impose different restrictions on the size of the
bios, I built them carefully, step by step.

Now that those restrictions are gone, we can - as you say - just
allocate a suitably sized bio and then clone it for each device.

>
> While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
> confusing, and I'm not 100% sure it's correct.  After all we check it
> in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
> on these callbacks being done after the flag has been raise / cleared,
> which makes me bit suspicious, and also question why we even need the
> mempool.

MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
races there.
The r1buf_pool mempool is created are the start of resync, so at that
time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
after the mempool is freed.

To perform a resync we need a pool of memory buffers.  We don't want to
have to cope with kmalloc failing, but are quite able to cope with
mempool_alloc() blocking.
We probably don't need nearly as many bufs as we allocate (4 is probably
plenty), but having a pool is certainly convenient.

>
>> 
>> > e.g. why we're trying to do single page I/O out of a bigger bio.
>> 
>> what's this one?
>
> fix_sync_read_error

The "bigger bio" might cover a large number of sectors.  If there are
media errors, there might be only one sector that is bad.  So we repeat
the read with finer granularity (pages in the current code, though
device block would be ideal) and only recovery bad blocks for individual
pages which are bad and cannot be fixed.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: NeilBrown @ 2016-11-13 23:03 UTC (permalink / raw)
  To: Christoph Hellwig, Shaohua Li; +Cc: linux-raid, linux-block
In-Reply-To: <20161110194636.GA32241@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 720 bytes --]

On Fri, Nov 11 2016, Christoph Hellwig wrote:
>
> Another not quite as urgent issue is how the RAID5 code abuses
> ->bi_phys_segments as and outstanding I/O counter, and I have no
> really good answer to that either.

I would suggest adding a "bi_dev_private" field to the bio which is for
use by the lowest-level driver (much as bi_private is for use by the
top-level initiator).
That could be in a union with any or all of:
	unsigned int		bi_phys_segments;
	unsigned int		bi_seg_front_size;
	unsigned int		bi_seg_back_size;

(any driver that needs those, would see a 'request' rather than a 'bio'
and so could use rq->special)

raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Phil Turmel @ 2016-11-13 23:03 UTC (permalink / raw)
  To: Bruce Merry; +Cc: Wols Lists, linux-raid
In-Reply-To: <5828D5DA.1070406@youngman.org.uk>

Hi Bruce,

On 11/13/2016 04:06 PM, Wols Lists wrote:
> On 13/11/16 20:51, Bruce Merry wrote:
>> On 13 November 2016 at 22:18, Anthony Youngman <antlists@youngman.org.uk> wrote:
>>> Quick first response ...

>>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>>
>> smartctl reports "SCT Error Recovery Control command not supported".
>> Does that mean I should be worried? Is there any way to tell whether a
>> given drive I can buy online supports it?

You should be worried.  It is vital for proper MD raid operation that
drive timeouts be shorter than the kernel timeout for that device.  If
you can't make the drive timeout short, you *must* make the kernel
timeout long.

> You need drives that explicitly support raid. WD Reds, Seagate NAS, some
> Toshibas - my 2TB laptop drive does ... Try and find a friend with a
> drive you like, and check it out, or ask on this list :-)

Manufacturers' data sheets for system builders usually contain enough
information to determine if ERC is supported.  Nowadays, the "NAS"
families work out of the box.  And enterprise drives, too, of course.

>>>> 1. What exactly this means. My understanding is that some data has
>>>> been lost (or may have been lost) on the drive, but the drive still
>>>> has spare sectors to remap things once the failed sectors are written
>>>> to. Is that correct?

Generally, yes.

>>> Do a scrub? You've got seven errors total, which some people will say "panic
>>> on the first error" and others will say "so what, the odd error every now
>>> and then is nothing to worry about". The point of a scrub is it will
>>> background-scan the entire array, and if it can't read anything, it will
>>> re-calculate and re-write it.
>>
>> Yes, that sounds like what I need. Thanks to Google I found
>> /usr/share/mdadm/checkarray to trigger this. It still has a few hours
>> to go, but now the bad drive has pending sectors == 65535 (which is
>> suspiciously power-of-two and I assume means it's actually higher and
>> is being clamped), and /sys/block/md0/md/mismatch_cnt is currently at
>> 1408. If scrubbing is supposed to rewrite on failed reads I would have
>> expected pending sectors to go down rather than up, so I'm not sure
>> what's happening.
>>
> Ummm....
> 
> Sounds like that drive could need replacing. I'd get a new drive and do
> that as soon as possible - use the --replace option of mdadm - don't
> fail the old drive and add the new. Dunno where you're based, but 5mins
> on the internet ordering a new drive is probably time well spent.

You have two other possibilities:

1) Swap volumes in the raid.  These are known to drop unneeded writes
when the data isn't needed, even if it made it to one of the mirrors.
That makes harmless mismatches.

2) Trim.  Well-behaved drive firmware guarantees zeros for trimmed
sectors, but many drives return random data instead.  Zing, mismatches.
It's often unhelpful with encrypted volumes, as even well-behaved
firmware can't deliver zeroed sectors *inside* the encryption.

I wouldn't panic just yet.  The check scrub should (with mitigated
timeouts) fix all of your pending sectors.  Then look at your actual
relocations to determine if you really do have a problem.

Phil

^ permalink raw reply

* [md PATCH 0/4] Improve blktrace tracing of md.
From: NeilBrown @ 2016-11-14  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

blktrace on md devices reports when a request is queued and when it is
split, but request completion and the mapping to subordinate devices
is not reported.
So add that, as well some some events when IO is delayed for one
reason or another (eg. bitmap updates etc).

---

NeilBrown (4):
      md: add block tracing for bio_remapping
      md: add bio completion tracing for raid1/raid10
      md/bitmap: add blktrace event for writes to the bitmap.
      md/raid1,raid10: add blktrace records when IO is delayed.


 drivers/md/bitmap.c |   11 ++++++++++-
 drivers/md/linear.c |    8 +++++++-
 drivers/md/raid0.c  |    8 +++++++-
 drivers/md/raid1.c  |   42 +++++++++++++++++++++++++++++++++++++++---
 drivers/md/raid10.c |   38 ++++++++++++++++++++++++++++++++++++--
 5 files changed, 99 insertions(+), 8 deletions(-)

--
Signature


^ permalink raw reply

* [md PATCH 2/4] md: add bio completion tracing for raid1/raid10
From: NeilBrown @ 2016-11-14  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147910131504.27168.6566119701315109161.stgit@noble>

raid5 already has this, as does dm.
linear and raid0 do no see completions, only bio_chain_end() or bio_endio()
see those.
So just add it for raid1 and raid10.

Between
 Commit: 3a366e614d08 ("block: add missing block_bio_complete() tracepoint")
and
 Commit: 0a82a8d132b2 ("Revert "block: add missing block_bio_complete() tracepoint"")
in the 3.9-rc series, this was done centrally in bio_endio().
Until/unless that is resurected, do the tracing in the md/raid code.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c  |    1 +
 drivers/md/raid10.c |    1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3710a792a149..0674e5a0142e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -257,6 +257,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
 		bio->bi_error = -EIO;
 
 	if (done) {
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), bio, bio->bi_error);
 		bio_endio(bio);
 		/*
 		 * Wake up any possible resync thread that waits for the device
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d144c3425824..c3036099ff9a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -311,6 +311,7 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
 	if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
 		bio->bi_error = -EIO;
 	if (done) {
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), bio, bio->bi_error);
 		bio_endio(bio);
 		/*
 		 * Wake up any possible resync thread that waits for the device



^ permalink raw reply related

* [md PATCH 1/4] md: add block tracing for bio_remapping
From: NeilBrown @ 2016-11-14  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147910131504.27168.6566119701315109161.stgit@noble>

The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used when a bio in a partition is mapped to the
whole device, when bios are mapped by dm, and for mapping in md/raid5.
Other md personalities do not include this tracing yet, so add it.

When a read-error is detected we redirect the request to a different device.
This could justifiably be seen as a new mapping for the originial bio,
or a secondary mapping for the bio that errors.  This patch uses
the second option.

When md is used under dm-raid, the mappings are not traced as we do
not have access to the block device number of the parent.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/linear.c |    8 +++++++-
 drivers/md/raid0.c  |    8 +++++++-
 drivers/md/raid1.c  |   33 ++++++++++++++++++++++++++++++---
 drivers/md/raid10.c |   29 +++++++++++++++++++++++++++--
 4 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 9c7d4f5483ea..8c0bccfa53a2 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -21,6 +21,7 @@
 #include <linux/seq_file.h>
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "linear.h"
 
@@ -256,8 +257,13 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 			 !blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
 			/* Just ignore it */
 			bio_endio(split);
-		} else
+		} else {
+			if (mddev->gendisk)
+				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
+						      split, disk_devt(mddev->gendisk),
+						      bio->bi_iter.bi_sector);
 			generic_make_request(split);
+		}
 	} while (split != bio);
 	return;
 
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index b3ba77a3c3bc..841b3ad0f5ff 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -21,6 +21,7 @@
 #include <linux/seq_file.h>
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "raid0.h"
 #include "raid5.h"
@@ -491,8 +492,13 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 			 !blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
 			/* Just ignore it */
 			bio_endio(split);
-		} else
+		} else {
+			if (mddev->gendisk)
+				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
+						      split, disk_devt(mddev->gendisk),
+						      bio->bi_iter.bi_sector);
 			generic_make_request(split);
+		}
 	} while (split != bio);
 }
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 9ac61cd85e5c..3710a792a149 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -37,6 +37,7 @@
 #include <linux/module.h>
 #include <linux/seq_file.h>
 #include <linux/ratelimit.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "raid1.h"
 #include "bitmap.h"
@@ -743,6 +744,7 @@ static void flush_pending_writes(struct r1conf *conf)
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
 			struct md_rdev *rdev = (void*)bio->bi_bdev;
+			struct r1bio *r1_bio = bio->bi_private;
 			bio->bi_next = NULL;
 			bio->bi_bdev = rdev->bdev;
 			if (test_bit(Faulty, &rdev->flags)) {
@@ -752,8 +754,13 @@ static void flush_pending_writes(struct r1conf *conf)
 					    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
 				/* Just ignore it */
 				bio_endio(bio);
-			else
+			else {
+				if (conf->mddev->gendisk)
+					trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+							      bio, disk_devt(conf->mddev->gendisk),
+							      r1_bio->sector);
 				generic_make_request(bio);
+			}
 			bio = next;
 		}
 	} else
@@ -1022,6 +1029,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	while (bio) { /* submit pending writes */
 		struct bio *next = bio->bi_next;
 		struct md_rdev *rdev = (void*)bio->bi_bdev;
+		struct r1bio *r1_bio = bio->bi_private;
 		bio->bi_next = NULL;
 		bio->bi_bdev = rdev->bdev;
 		if (test_bit(Faulty, &rdev->flags)) {
@@ -1031,8 +1039,13 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 				    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
 			/* Just ignore it */
 			bio_endio(bio);
-		else
+		else {
+			if (mddev->gendisk)
+				trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+						      bio, disk_devt(mddev->gendisk),
+						      r1_bio->sector);
 			generic_make_request(bio);
+		}
 		bio = next;
 	}
 	kfree(plug);
@@ -1162,6 +1175,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 		bio_set_op_attrs(read_bio, op, do_sync);
 		read_bio->bi_private = r1_bio;
 
+		if (mddev->gendisk)
+			trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
+					      read_bio, disk_devt(mddev->gendisk),
+					      r1_bio->sector);
+
 		if (max_sectors < r1_bio->sectors) {
 			/* could not read all from this device, so we will
 			 * need another r1_bio.
@@ -2290,6 +2308,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 	struct bio *bio;
 	char b[BDEVNAME_SIZE];
 	struct md_rdev *rdev;
+	dev_t bio_dev;
+	sector_t bio_sector;
 
 	clear_bit(R1BIO_ReadError, &r1_bio->state);
 	/* we got a read error. Maybe the drive is bad.  Maybe just
@@ -2303,6 +2323,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 
 	bio = r1_bio->bios[r1_bio->read_disk];
 	bdevname(bio->bi_bdev, b);
+	bio_dev = bio->bi_bdev->bd_dev;
+	bio_sector = conf->mirrors[r1_bio->read_disk].rdev->data_offset + r1_bio->sector;
 	bio_put(bio);
 	r1_bio->bios[r1_bio->read_disk] = NULL;
 
@@ -2353,6 +2375,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 			else
 				mbio->bi_phys_segments++;
 			spin_unlock_irq(&conf->device_lock);
+			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+					      bio, bio_dev, bio_sector);
 			generic_make_request(bio);
 			bio = NULL;
 
@@ -2367,8 +2391,11 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 				sectors_handled;
 
 			goto read_more;
-		} else
+		} else {
+			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+					      bio, bio_dev, bio_sector);
 			generic_make_request(bio);
+		}
 	}
 }
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5290be3d5c26..d144c3425824 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -25,6 +25,7 @@
 #include <linux/seq_file.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "raid10.h"
 #include "raid0.h"
@@ -859,6 +860,7 @@ static void flush_pending_writes(struct r10conf *conf)
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
 			struct md_rdev *rdev = (void*)bio->bi_bdev;
+			struct r10bio *r10_bio = bio->bi_private;
 			bio->bi_next = NULL;
 			bio->bi_bdev = rdev->bdev;
 			if (test_bit(Faulty, &rdev->flags)) {
@@ -868,8 +870,13 @@ static void flush_pending_writes(struct r10conf *conf)
 					    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
 				/* Just ignore it */
 				bio_endio(bio);
-			else
+			else {
+				if (conf->mddev->gendisk)
+					trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+							      bio, disk_devt(conf->mddev->gendisk),
+							      r10_bio->sector);
 				generic_make_request(bio);
+			}
 			bio = next;
 		}
 	} else
@@ -1042,6 +1049,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	while (bio) { /* submit pending writes */
 		struct bio *next = bio->bi_next;
 		struct md_rdev *rdev = (void*)bio->bi_bdev;
+		struct r10bio *r10_bio = bio->bi_private;
 		bio->bi_next = NULL;
 		bio->bi_bdev = rdev->bdev;
 		if (test_bit(Faulty, &rdev->flags)) {
@@ -1051,8 +1059,13 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 				    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
 			/* Just ignore it */
 			bio_endio(bio);
-		else
+		else {
+			if (conf->mddev->gendisk)
+				trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+						      bio, disk_devt(conf->mddev->gendisk),
+						      r10_bio->sector);
 			generic_make_request(bio);
+		}
 		bio = next;
 	}
 	kfree(plug);
@@ -1165,6 +1178,10 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 		bio_set_op_attrs(read_bio, op, do_sync);
 		read_bio->bi_private = r10_bio;
 
+		if (mddev->gendisk)
+			trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
+					      read_bio, disk_devt(mddev->gendisk),
+					      r10_bio->sector);
 		if (max_sectors < r10_bio->sectors) {
 			/* Could not read all from this device, so we will
 			 * need another r10_bio.
@@ -2496,6 +2513,8 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	char b[BDEVNAME_SIZE];
 	unsigned long do_sync;
 	int max_sectors;
+	dev_t bio_dev;
+	sector_t bio_last_sector;
 
 	/* we got a read error. Maybe the drive is bad.  Maybe just
 	 * the block and we can fix it.
@@ -2507,6 +2526,8 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	 */
 	bio = r10_bio->devs[slot].bio;
 	bdevname(bio->bi_bdev, b);
+	bio_dev = bio->bi_bdev->bd_dev;
+	bio_last_sector = r10_bio->devs[slot].addr + rdev->data_offset + r10_bio->sectors;
 	bio_put(bio);
 	r10_bio->devs[slot].bio = NULL;
 
@@ -2546,6 +2567,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
 	bio->bi_private = r10_bio;
 	bio->bi_end_io = raid10_end_read_request;
+	trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+			      bio, bio_dev,
+			      bio_last_sector - r10_bio->sectors);
+
 	if (max_sectors < r10_bio->sectors) {
 		/* Drat - have to split this up more */
 		struct bio *mbio = r10_bio->master_bio;



^ permalink raw reply related

* [md PATCH 4/4] md/raid1, raid10: add blktrace records when IO is delayed.
From: NeilBrown @ 2016-11-14  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147910131504.27168.6566119701315109161.stgit@noble>

Both raid1 and raid10 will sometimes delay handling an IO request,
such as when resync is happening or there are too many requests queued.

Add some blktrace messsages so we can see when that is happening when
looking for performance artefacts.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c  |    8 ++++++++
 drivers/md/raid10.c |    8 ++++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0674e5a0142e..e94db92a4dbf 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -71,6 +71,9 @@ static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
 			  sector_t bi_sector);
 static void lower_barrier(struct r1conf *conf);
 
+#define raid1_log(md, fmt, args...)				\
+	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
+
 static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct pool_info *pi = data;
@@ -868,6 +871,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 		 * that queue to allow conf->start_next_window
 		 * to increase.
 		 */
+		raid1_log(conf->mddev, "wait barrier");
 		wait_event_lock_irq(conf->wait_barrier,
 				    !conf->array_frozen &&
 				    (!conf->barrier ||
@@ -947,6 +951,7 @@ static void freeze_array(struct r1conf *conf, int extra)
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
+	raid1_log(conf->mddev, "wait freeze");
 	wait_event_lock_irq_cmd(conf->wait_barrier,
 				conf->nr_pending == conf->nr_queued+extra,
 				conf->resync_lock,
@@ -1157,6 +1162,7 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 			 * take care not to over-take any writes
 			 * that are 'behind'
 			 */
+			raid1_log(mddev, "wait behind writes");
 			wait_event(bitmap->behind_wait,
 				   atomic_read(&bitmap->behind_writes) == 0);
 		}
@@ -1221,6 +1227,7 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	 */
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
+		raid1_log(mddev, "wait queued");
 		wait_event(conf->wait_barrier,
 			   conf->pending_count < max_queued_requests);
 	}
@@ -1312,6 +1319,7 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
 		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
 		start_next_window = wait_barrier(conf, bio);
 		/*
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c3036099ff9a..15e55488a9d2 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -106,6 +106,9 @@ static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio);
 static void end_reshape_write(struct bio *bio);
 static void end_reshape(struct r10conf *conf);
 
+#define raid10_log(md, fmt, args...)				\
+	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid10 " fmt, ##args); } while (0)
+
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct r10conf *conf = data;
@@ -949,6 +952,7 @@ static void wait_barrier(struct r10conf *conf)
 		 * that queue to get the nr_pending
 		 * count down.
 		 */
+		raid10_log(conf->mddev, "wait barrier");
 		wait_event_lock_irq(conf->wait_barrier,
 				    !conf->barrier ||
 				    (atomic_read(&conf->nr_pending) &&
@@ -1106,6 +1110,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 		/* IO spans the reshape position.  Need to wait for
 		 * reshape to pass
 		 */
+		raid10_log(conf->mddev, "wait reshape");
 		allow_barrier(conf);
 		wait_event(conf->wait_barrier,
 			   conf->reshape_progress <= bio->bi_iter.bi_sector ||
@@ -1125,6 +1130,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 		set_mask_bits(&mddev->flags, 0,
 			      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
 		md_wakeup_thread(mddev->thread);
+		raid10_log(conf->mddev, "wait reshape metadata");
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 
@@ -1222,6 +1228,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 	 */
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
+		raid10_log(mddev, "wait queued");
 		wait_event(conf->wait_barrier,
 			   conf->pending_count < max_queued_requests);
 	}
@@ -1349,6 +1356,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 			}
 		}
 		allow_barrier(conf);
+		raid10_log(conf->mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
 		wait_barrier(conf);
 		goto retry_write;



^ permalink raw reply related

* [md PATCH 3/4] md/bitmap: add blktrace event for writes to the bitmap.
From: NeilBrown @ 2016-11-14  5:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147910131504.27168.6566119701315109161.stgit@noble>

We trace wheneven bitmap_unplug() finds that it needs to write
to the bitmap, or when bitmap_daemon_work() find there is work
to do.

This makes it easier to correlate bitmap updates with data writes.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/bitmap.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 1a7f402b79ba..cf77cbf9ed22 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -27,6 +27,7 @@
 #include <linux/mount.h>
 #include <linux/buffer_head.h>
 #include <linux/seq_file.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "bitmap.h"
 
@@ -1008,8 +1009,12 @@ void bitmap_unplug(struct bitmap *bitmap)
 		need_write = test_and_clear_page_attr(bitmap, i,
 						      BITMAP_PAGE_NEEDWRITE);
 		if (dirty || need_write) {
-			if (!writing)
+			if (!writing) {
 				bitmap_wait_writes(bitmap);
+				if (bitmap->mddev->queue)
+					blk_add_trace_msg(bitmap->mddev->queue,
+							  "md bitmap_unplug");
+			}
 			clear_page_attr(bitmap, i, BITMAP_PAGE_PENDING);
 			write_page(bitmap, bitmap->storage.filemap[i], 0);
 			writing = 1;
@@ -1234,6 +1239,10 @@ void bitmap_daemon_work(struct mddev *mddev)
 	}
 	bitmap->allclean = 1;
 
+	if (bitmap->mddev->queue)
+		blk_add_trace_msg(bitmap->mddev->queue,
+				  "md bitmap_daemon_work");
+
 	/* Any file-page which is PENDING now needs to be written.
 	 * So set NEEDWRITE now, then after we make any last-minute changes
 	 * we will write it.



^ permalink raw reply related

* Re: Ddf based RAID management software
From: NeilBrown @ 2016-11-14  5:48 UTC (permalink / raw)
  To: Arka Sharma, linux-raid
In-Reply-To: <CAPO=kN3xzv_MmbTSLT0bfxOpdZ+yvmybSXb6L-goBUasm+1NoQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 377 bytes --]

On Sun, Nov 13 2016, Arka Sharma wrote:

> Hello All,
>
> Is there any tool apart from mdadm available which can create software
> RAID based on Ddf metadata. We want to dump the metadata content and
> tally with metadata written by mdadm and our application.

Not that I'm aware of.
dmraid can read some ddf metadata, but I don't think it will create new
metadata.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: mdadm I/O error with Ddf RAID
From: NeilBrown @ 2016-11-14  6:00 UTC (permalink / raw)
  To: Arka Sharma, linux-raid
In-Reply-To: <CAPO=kN2QDLEMgo9p9pU3=MeLQ=J6R8eeDL1Pw9m2pHjbVsuFGg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]

On Fri, Nov 11 2016, Arka Sharma wrote:

> Hi All,
>
> We have developed a RAID creation application which create RAID with
> Ddf RAID metadata. We are using PCIe ssd as physical disks. We are
> writing the anchor, primary, secondary headers, virtual and physical
> records, configuration record and physical disk data. The offsets of
> the headers are updated in the primary, secondary and anchor headers
> correctly. The problem is when we try to boot to Ubuntu server and we
> observe that mdadm is throwing a disk failure error message and from
> block layer we are getting rw=0, want=7, limit=1000215216. We also
> confirmed using there is no I/O error is coming from the PCIe ssd,
> using a logic analyzer. Also the limit value 1000215216 is the
> capacity of the ssd in 512 byte blocks. Any insight will be highly
> appreciated.
>

It looks like mdadm is attempting a 4K read starting at the last sector.

Possibly the ssd's report a physical sector size of 4K.

I don't know how DDF is supposed to work on a device like that.
Should the anchor be at the start of the last 4K block,
or in the last 512byte virtual block?

DDF support in mdadm was written with the assumption of 512 byte blocks.

I'm not at all certain this is the cause of the problem though.

I would suggest starting by finding out which READ request in mdadm is
causing the error.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-14  6:50 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Wols Lists, linux-raid
In-Reply-To: <bdd9358d-2141-eb4f-e765-52177b1ec852@turmel.org>

On 14 November 2016 at 01:03, Phil Turmel <philip@turmel.org> wrote:
> Hi Bruce,
>
> On 11/13/2016 04:06 PM, Wols Lists wrote:
>> On 13/11/16 20:51, Bruce Merry wrote:
>>> On 13 November 2016 at 22:18, Anthony Youngman <antlists@youngman.org.uk> wrote:
>>>> Quick first response ...
>
>>>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>>>
>>> smartctl reports "SCT Error Recovery Control command not supported".
>>> Does that mean I should be worried? Is there any way to tell whether a
>>> given drive I can buy online supports it?
>
> You should be worried.  It is vital for proper MD raid operation that
> drive timeouts be shorter than the kernel timeout for that device.  If
> you can't make the drive timeout short, you *must* make the kernel
> timeout long.

Okay, I'll give that script a go to increase my kernel timeout. If I
understand correctly, it's not the end of the world if the drive
doesn't support SCTERC, provided I have a long kernel timeout (and
when things go wrong it might take much longer to recover, but I can
live with that). Is that correct?

>>> Yes, that sounds like what I need. Thanks to Google I found
>>> /usr/share/mdadm/checkarray to trigger this. It still has a few hours
>>> to go, but now the bad drive has pending sectors == 65535 (which is
>>> suspiciously power-of-two and I assume means it's actually higher and
>>> is being clamped), and /sys/block/md0/md/mismatch_cnt is currently at
>>> 1408. If scrubbing is supposed to rewrite on failed reads I would have
>>> expected pending sectors to go down rather than up, so I'm not sure
>>> what's happening.
>>>
>> Ummm....
>>
>> Sounds like that drive could need replacing. I'd get a new drive and do
>> that as soon as possible - use the --replace option of mdadm - don't
>> fail the old drive and add the new. Dunno where you're based, but 5mins
>> on the internet ordering a new drive is probably time well spent.

Oh don't worry, I wasted no time in ordering new drives already.

> You have two other possibilities:
>
> 1) Swap volumes in the raid.  These are known to drop unneeded writes
> when the data isn't needed, even if it made it to one of the mirrors.
> That makes harmless mismatches.

It won't be that - I keep have separate non-RAIDed partitions for swap.

> 2) Trim.  Well-behaved drive firmware guarantees zeros for trimmed
> sectors, but many drives return random data instead.  Zing, mismatches.
> It's often unhelpful with encrypted volumes, as even well-behaved
> firmware can't deliver zeroed sectors *inside* the encryption.

Weee, sounds like fun. I hope it's that. Is there any way to tell
which blocks mismatch, so that I can tell which files are in trouble
(assuming I can figure out how to map through LVM, LUKS and
debuge2fs).

Bruce
-- 
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/

^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: Christoph Hellwig @ 2016-11-14  8:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block
In-Reply-To: <87shqvj83r.fsf@notabene.neil.brown.name>

On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote:
> I would suggest adding a "bi_dev_private" field to the bio which is for
> use by the lowest-level driver (much as bi_private is for use by the
> top-level initiator).
> That could be in a union with any or all of:
> 	unsigned int		bi_phys_segments;
> 	unsigned int		bi_seg_front_size;
> 	unsigned int		bi_seg_back_size;
> 
> (any driver that needs those, would see a 'request' rather than a 'bio'
> and so could use rq->special)
> 
> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).

All the three above fields are those that could go away with a full
implementation of the multipage bvec scheme.  So any field for driver
use would still be be overhead.  If it's just for raid5 it could
be a smaller 16 bit (or maybe even just 8 bit) one.

^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: Christoph Hellwig @ 2016-11-14  8:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block
In-Reply-To: <87vavrj8jp.fsf@notabene.neil.brown.name>

On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote:
> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
> > confusing, and I'm not 100% sure it's correct.  After all we check it
> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
> > on these callbacks being done after the flag has been raise / cleared,
> > which makes me bit suspicious, and also question why we even need the
> > mempool.
> 
> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
> races there.
> The r1buf_pool mempool is created are the start of resync, so at that
> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
> after the mempool is freed.
> 
> To perform a resync we need a pool of memory buffers.  We don't want to
> have to cope with kmalloc failing, but are quite able to cope with
> mempool_alloc() blocking.
> We probably don't need nearly as many bufs as we allocate (4 is probably
> plenty), but having a pool is certainly convenient.

Would it be good to create/delete the pool explicitly through methods
to start/emd the sync?  Right now the behavior looks very, very
confusing.

> The "bigger bio" might cover a large number of sectors.  If there are
> media errors, there might be only one sector that is bad.  So we repeat
> the read with finer granularity (pages in the current code, though
> device block would be ideal) and only recovery bad blocks for individual
> pages which are bad and cannot be fixed.

i have no problems with the behavior - the point is that these days
this should be without poking into the bio internals, but by using
a bio iterator for just the range you want to re-read.  Potentially
using a bio clone if we can't reusing the existing bio, although I'm
not sure we even need that from looking at the code.

^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: NeilBrown @ 2016-11-14  9:43 UTC (permalink / raw)
  Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block
In-Reply-To: <20161114085151.GA8405@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Mon, Nov 14 2016, Christoph Hellwig wrote:

> On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote:
>> I would suggest adding a "bi_dev_private" field to the bio which is for
>> use by the lowest-level driver (much as bi_private is for use by the
>> top-level initiator).
>> That could be in a union with any or all of:
>> 	unsigned int		bi_phys_segments;
>> 	unsigned int		bi_seg_front_size;
>> 	unsigned int		bi_seg_back_size;
>> 
>> (any driver that needs those, would see a 'request' rather than a 'bio'
>> and so could use rq->special)
>> 
>> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).
>
> All the three above fields are those that could go away with a full
> implementation of the multipage bvec scheme.  So any field for driver
> use would still be be overhead.  If it's just for raid5 it could
> be a smaller 16 bit (or maybe even just 8 bit) one.

We currently store 2 counters in that field, and before
commit 5b99c2ffa980528a197f26 one of the fields was only 8 bits,
and that caused problems

We could possibly use __bi_remaining in place of
raid5_X_bi_active_stripes().  It wouldn't be a completely
straightforward conversion, but I think it could be made to work.

We *might* be able to use bvec_iter_advance() in place of
raid5_bi_processed_stripes().  A careful audit of the code would be
needed to be certain.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: "creative" bio usage in the RAID code
From: NeilBrown @ 2016-11-14  9:51 UTC (permalink / raw)
  Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block
In-Reply-To: <20161114085720.GB8405@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

On Mon, Nov 14 2016, Christoph Hellwig wrote:

> On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote:
>> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
>> > confusing, and I'm not 100% sure it's correct.  After all we check it
>> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
>> > on these callbacks being done after the flag has been raise / cleared,
>> > which makes me bit suspicious, and also question why we even need the
>> > mempool.
>> 
>> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
>> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
>> races there.
>> The r1buf_pool mempool is created are the start of resync, so at that
>> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
>> after the mempool is freed.
>> 
>> To perform a resync we need a pool of memory buffers.  We don't want to
>> have to cope with kmalloc failing, but are quite able to cope with
>> mempool_alloc() blocking.
>> We probably don't need nearly as many bufs as we allocate (4 is probably
>> plenty), but having a pool is certainly convenient.
>
> Would it be good to create/delete the pool explicitly through methods
> to start/emd the sync?  Right now the behavior looks very, very
> confusing.

Maybe.  It is created the first time ->sync_request is called,
and destroyed when it is called with a sector_nr at-or-beyond the end of
the device.  I guess some of that could be made a bit more obvious.
I'm not strongly against adding new methods for "start_sync" and "stop_sync"
but I don't see that it is really needed.

>
>> The "bigger bio" might cover a large number of sectors.  If there are
>> media errors, there might be only one sector that is bad.  So we repeat
>> the read with finer granularity (pages in the current code, though
>> device block would be ideal) and only recovery bad blocks for individual
>> pages which are bad and cannot be fixed.
>
> i have no problems with the behavior - the point is that these days
> this should be without poking into the bio internals, but by using
> a bio iterator for just the range you want to re-read.  Potentially
> using a bio clone if we can't reusing the existing bio, although I'm
> not sure we even need that from looking at the code.

Fair enough.  The code predates bio iterators and "if it ain't broke,
don't fix it".  If it is now causing problems, then maybe it is now
"broke" and should be "fixed".

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-14 15:52 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid
In-Reply-To: <5828D5DA.1070406@youngman.org.uk>

On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
> Sounds like that drive could need replacing. I'd get a new drive and do
> that as soon as possible - use the --replace option of mdadm - don't
> fail the old drive and add the new.

Would you mind explaining why I should use --replace instead of taking
out the suspect drive? I guess I lose redundancy for any writes that
occur while the rebuild is happening, but I'd plan to do this with the
filesystem unmounted so there wouldn't be any writes.

What I'd quite like to do is treat the "good" drive as the source for
all the data (unless it turns out to have bad sectors too...), even if
the other drive doesn't have a read error; which I'd achieve by
failing the "bad" drive. I want to do this because after doing one
scrub and starting on a second, I'm still seeing non-zero
mismatch_cnt. Does --replace do anything clever about using the old
drive only when it must, or does it just read from the whole array
like normal, which might mean taking the mismatched data from the bad
drive?

Thanks
Bruce
-- 
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Wols Lists @ 2016-11-14 15:58 UTC (permalink / raw)
  To: Bruce Merry; +Cc: linux-raid
In-Reply-To: <CAHy4j_7aC+DCqMRkmK12HPP-wY5kAmLf7W3UG_Nn=TK7ry7ARQ@mail.gmail.com>

On 14/11/16 15:52, Bruce Merry wrote:
> On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
>> > Sounds like that drive could need replacing. I'd get a new drive and do
>> > that as soon as possible - use the --replace option of mdadm - don't
>> > fail the old drive and add the new.
> Would you mind explaining why I should use --replace instead of taking
> out the suspect drive? I guess I lose redundancy for any writes that
> occur while the rebuild is happening, but I'd plan to do this with the
> filesystem unmounted so there wouldn't be any writes.

Because a replace will copy from the old drive to the new, recovering
any failures from the rest of the array. A fail-and-add will have to
rebuild the entire new array from what's left of the old, stressing the
old array much more.

Okay, in your case, it probably won't make an awful lot of difference,
but it does make you vulnerable to problems on the "good" drive. To
alter your wording slightly, you lose redundancy for writes AND READS
that occur while the array is rebuilding. It's just good practice (and I
point it out because --replace is new and not well known at present).

Cheers,
Wol

^ permalink raw reply

* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Phil Turmel @ 2016-11-14 16:01 UTC (permalink / raw)
  To: Bruce Merry; +Cc: Wols Lists, linux-raid
In-Reply-To: <CAHy4j_6q2mmQ_y-xkh1Zx6CrP1vByQMadCXLJwibeeO85T3JgQ@mail.gmail.com>

Hi Bruce,

On 11/14/2016 01:50 AM, Bruce Merry wrote:

> Okay, I'll give that script a go to increase my kernel timeout. If I
> understand correctly, it's not the end of the world if the drive
> doesn't support SCTERC, provided I have a long kernel timeout (and
> when things go wrong it might take much longer to recover, but I can
> live with that). Is that correct?

Yes.

>> 2) Trim.  Well-behaved drive firmware guarantees zeros for trimmed
>> sectors, but many drives return random data instead.  Zing, mismatches.
>> It's often unhelpful with encrypted volumes, as even well-behaved
>> firmware can't deliver zeroed sectors *inside* the encryption.
> 
> Weee, sounds like fun. I hope it's that. Is there any way to tell
> which blocks mismatch, so that I can tell which files are in trouble
> (assuming I can figure out how to map through LVM, LUKS and
> debuge2fs).

The check operation doesn't log the sector addresses, unfortunately.  At
least I don't see any such operation in the code that increments
mismatch count.  Not even a tracepoint.  Hmmm.

In the meantime, run a "repair" scrub instead of a "check" scrub to
affirmatively force no mismatches.  (Writes first member of mirrors to
the others.)

Phil

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox