* Ddf based RAID management software
From: Arka Sharma @ 2016-11-12 16:33 UTC (permalink / raw)
To: linux-raid
Hello All,
Is there any tool apart from mdadm available which can create software
RAID based on Ddf metadata. We want to dump the metadata content and
tally with metadata written by mdadm and our application.
Regards,
Arka
^ permalink raw reply
* Re: "creative" bio usage in the RAID code
From: Christoph Hellwig @ 2016-11-12 17:42 UTC (permalink / raw)
To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block, neilb
In-Reply-To: <20161111190223.4xrq3vvvvohzgs5e@kernel.org>
On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
> > It's mostly about the RAID1 and RAID10 code which does a lot of funny
> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
> > drivers don't touch. One example is the r1buf_pool_alloc code,
> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
> > case, which would also take care of r1buf_pool_free. I'm not sure
> > about all the others cases, as some bits don't fully make sense to me,
>
> The problem is we use the iov_vec to track the pages allocated. We will read
> data to the pages and write out later for resync. If we add new fields to track
> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
> avoid the tricky parts. This should work for both the resync and writebehind
> cases.
I don't think we need to track the pages specificly - if we clone
a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
we do one bio_kmalloc, then bio_alloc_pages then clone it for the
others bios. for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
bio_alloc_pages for each.
While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
confusing, and I'm not 100% sure it's correct. After all we check it
in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
on these callbacks being done after the flag has been raise / cleared,
which makes me bit suspicious, and also question why we even need the
mempool.
>
> > e.g. why we're trying to do single page I/O out of a bigger bio.
>
> what's this one?
fix_sync_read_error
^ permalink raw reply
* Re: [PATCH 01/12] block: bio: pass bvec table to bio_init()
From: Christoph Hellwig @ 2016-11-12 17:59 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
Christoph Hellwig, Jens Axboe, Jiri Kosina, Kent Overstreet,
Shaohua Li, Alasdair Kergon, Mike Snitzer,
maintainer:DEVICE-MAPPER (LVM), Christoph Hellwig, Sagi Grimberg,
Joern Engel, Prasad Joshi, Mike Christie, Hannes Reinecke,
Rasmus Villemoes, Johannes Thumshirn, Guoqing Jiang, Eric
In-Reply-To: <1478865957-25252-2-git-send-email-tom.leiming@gmail.com>
On Fri, Nov 11, 2016 at 08:05:29PM +0800, Ming Lei wrote:
> Some drivers often use external bvec table, so introduce
> this helper for this case. It is always safe to access the
> bio->bi_io_vec in this way for this case.
>
> After converting to this usage, it will becomes a bit easier
> to evaluate the remaining direct access to bio->bi_io_vec,
> so it can help to prepare for the following multipage bvec
> support.
>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
> block/bio.c | 8 ++++++--
> drivers/block/floppy.c | 3 +--
> drivers/md/bcache/io.c | 4 +---
> drivers/md/bcache/journal.c | 4 +---
> drivers/md/bcache/movinggc.c | 6 ++----
> drivers/md/bcache/request.c | 2 +-
> drivers/md/bcache/super.c | 12 +++---------
> drivers/md/bcache/writeback.c | 5 ++---
> drivers/md/dm-bufio.c | 4 +---
> drivers/md/dm.c | 2 +-
> drivers/md/multipath.c | 2 +-
> drivers/md/raid5-cache.c | 2 +-
> drivers/md/raid5.c | 9 ++-------
> drivers/nvme/target/io-cmd.c | 4 +---
> fs/logfs/dev_bdev.c | 4 +---
> include/linux/bio.h | 3 ++-
> 16 files changed, 27 insertions(+), 47 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 2cf6ebabc68c..de257ced69b1 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -270,11 +270,15 @@ static void bio_free(struct bio *bio)
> }
> }
>
> -void bio_init(struct bio *bio)
> +void bio_init(struct bio *bio, struct bio_vec *table,
> + unsigned short max_vecs)
> {
> memset(bio, 0, sizeof(*bio));
> atomic_set(&bio->__bi_remaining, 1);
> atomic_set(&bio->__bi_cnt, 1);
> +
> + bio->bi_io_vec = table;
> + bio->bi_max_vecs = max_vecs;
> }
> EXPORT_SYMBOL(bio_init);
>
> @@ -480,7 +484,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> return NULL;
>
> bio = p + front_pad;
> - bio_init(bio);
> + bio_init(bio, NULL, 0);
>
> if (nr_iovecs > inline_vecs) {
> unsigned long idx = 0;
> diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
> index e3d8e4ced4a2..6a3ff2b2e3ae 100644
> --- a/drivers/block/floppy.c
> +++ b/drivers/block/floppy.c
> @@ -3806,8 +3806,7 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
>
> cbdata.drive = drive;
>
> - bio_init(&bio);
> - bio.bi_io_vec = &bio_vec;
> + bio_init(&bio, &bio_vec, 1);
> bio_vec.bv_page = page;
> bio_vec.bv_len = size;
> bio_vec.bv_offset = 0;
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index e97b0acf7b8d..db45a88c0ce9 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -24,9 +24,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
> struct bbio *b = mempool_alloc(c->bio_meta, GFP_NOIO);
> struct bio *bio = &b->bio;
>
> - bio_init(bio);
> - bio->bi_max_vecs = bucket_pages(c);
> - bio->bi_io_vec = bio->bi_inline_vecs;
> + bio_init(bio, bio->bi_inline_vecs, bucket_pages(c));
>
> return bio;
> }
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 6925023e12d4..1198e53d5670 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -448,13 +448,11 @@ static void do_journal_discard(struct cache *ca)
>
> atomic_set(&ja->discard_in_flight, DISCARD_IN_FLIGHT);
>
> - bio_init(bio);
> + bio_init(bio, bio->bi_inline_vecs, 1);
> bio_set_op_attrs(bio, REQ_OP_DISCARD, 0);
> bio->bi_iter.bi_sector = bucket_to_sector(ca->set,
> ca->sb.d[ja->discard_idx]);
> bio->bi_bdev = ca->bdev;
> - bio->bi_max_vecs = 1;
> - bio->bi_io_vec = bio->bi_inline_vecs;
> bio->bi_iter.bi_size = bucket_bytes(ca);
> bio->bi_end_io = journal_discard_endio;
>
> diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
> index 5c4bddecfaf0..13b8a907006d 100644
> --- a/drivers/md/bcache/movinggc.c
> +++ b/drivers/md/bcache/movinggc.c
> @@ -77,15 +77,13 @@ static void moving_init(struct moving_io *io)
> {
> struct bio *bio = &io->bio.bio;
>
> - bio_init(bio);
> + bio_init(bio, bio->bi_inline_vecs,
> + DIV_ROUND_UP(KEY_SIZE(&io->w->key), PAGE_SECTORS));
> bio_get(bio);
> bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
>
> bio->bi_iter.bi_size = KEY_SIZE(&io->w->key) << 9;
> - bio->bi_max_vecs = DIV_ROUND_UP(KEY_SIZE(&io->w->key),
> - PAGE_SECTORS);
> bio->bi_private = &io->cl;
> - bio->bi_io_vec = bio->bi_inline_vecs;
> bch_bio_map(bio, NULL);
> }
>
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 0d99b5f4b3e6..f49c5417527d 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -623,7 +623,7 @@ static void do_bio_hook(struct search *s, struct bio *orig_bio)
> {
> struct bio *bio = &s->bio.bio;
>
> - bio_init(bio);
> + bio_init(bio, NULL, 0);
> __bio_clone_fast(bio, orig_bio);
We have this pattern multiple times, and it almost screams for a helper.
But I think we're better off letting your patch go in as-is and sort
that out later instead of delaying it.
Otherwise this looks fine:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-13 18:46 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]
Hi
I'm running software RAID1 across two drives in my home machine (LVM
on LUKS on RAID1). I've just installed smartmontools and run short
tests, and promptly received emails to tell me that one of the drives
has 4 offline uncorrectable sectors and 3 current pending sectors.
I've attached smartctl --xall output for sda (good) and sdb (bad).
These drives are pretty old (over 5 years) so I'm going to replace
them as soon as I have time (and yes, I have backups), but in the
meantime I'd like advice on:
1. What exactly this means. My understanding is that some data has
been lost (or may have been lost) on the drive, but the drive still
has spare sectors to remap things once the failed sectors are written
to. Is that correct?
2. How can I tell which sectors are problematic? If it's in the swap
partition I'm far less worried. I can see two LBAs for offline
uncorrectable errors in the --xall output, but that still leaves
another two at large.
3. Assuming my understanding is correct, and that the sector falls
within the RAID1 partition on the drive, is there some way I can
recover the sectors from the other drive in the RAID1? As a last
resort I imagine I could wipe the suspect drive and then rebuild it
from the good one, but I'm hoping there's something less risky I can
do.
Thanks in advance
Bruce
--
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/
[-- Attachment #2: sda.txt --]
[-- Type: text/plain, Size: 16559 bytes --]
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-47-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD20EARX-00PASB0
Serial Number: WD-WCAZA9626479
LU WWN Device Id: 5 0014ee 2b0e3fa4c
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Nov 13 20:30:04 2016 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (36780) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 355) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 172 162 021 - 6400
4 Start_Stop_Count -O--CK 097 097 000 - 3688
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 081 081 000 - 13891
10 Spin_Retry_Count -O--CK 100 100 000 - 0
11 Calibration_Retry_Count -O--CK 100 100 000 - 0
12 Power_Cycle_Count -O--CK 097 097 000 - 3683
192 Power-Off_Retract_Count -O--CK 200 200 000 - 50
193 Load_Cycle_Count -O--CK 001 001 000 - 912124
194 Temperature_Celsius -O---K 120 109 000 - 30
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb7 GPL,SL VS 1 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 93 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 13888 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 30 Celsius
Power Cycle Min/Max Temperature: 22/30 Celsius
Lifetime Min/Max Temperature: 22/41 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (462)
Index Estimated Time Temperature Celsius
463 2016-11-13 12:33 31 ************
... ..( 16 skipped). .. ************
2 2016-11-13 12:50 31 ************
3 2016-11-13 12:51 30 ***********
4 2016-11-13 12:52 30 ***********
5 2016-11-13 12:53 30 ***********
6 2016-11-13 12:54 31 ************
7 2016-11-13 12:55 30 ***********
... ..( 21 skipped). .. ***********
29 2016-11-13 13:17 30 ***********
30 2016-11-13 13:18 31 ************
... ..( 14 skipped). .. ************
45 2016-11-13 13:33 31 ************
46 2016-11-13 13:34 ? -
47 2016-11-13 13:35 22 ***
48 2016-11-13 13:36 22 ***
49 2016-11-13 13:37 22 ***
50 2016-11-13 13:38 23 ****
51 2016-11-13 13:39 23 ****
52 2016-11-13 13:40 24 *****
53 2016-11-13 13:41 24 *****
54 2016-11-13 13:42 25 ******
55 2016-11-13 13:43 25 ******
56 2016-11-13 13:44 26 *******
... ..( 2 skipped). .. *******
59 2016-11-13 13:47 26 *******
60 2016-11-13 13:48 27 ********
... ..( 4 skipped). .. ********
65 2016-11-13 13:53 27 ********
66 2016-11-13 13:54 28 *********
... ..( 4 skipped). .. *********
71 2016-11-13 13:59 28 *********
72 2016-11-13 14:00 29 **********
... ..( 19 skipped). .. **********
92 2016-11-13 14:20 29 **********
93 2016-11-13 14:21 30 ***********
... ..( 12 skipped). .. ***********
106 2016-11-13 14:34 30 ***********
107 2016-11-13 14:35 31 ************
... ..( 2 skipped). .. ************
110 2016-11-13 14:38 31 ************
111 2016-11-13 14:39 32 *************
112 2016-11-13 14:40 31 ************
... ..( 18 skipped). .. ************
131 2016-11-13 14:59 31 ************
132 2016-11-13 15:00 32 *************
133 2016-11-13 15:01 32 *************
134 2016-11-13 15:02 31 ************
135 2016-11-13 15:03 32 *************
136 2016-11-13 15:04 31 ************
... ..( 10 skipped). .. ************
147 2016-11-13 15:15 31 ************
148 2016-11-13 15:16 32 *************
149 2016-11-13 15:17 31 ************
150 2016-11-13 15:18 31 ************
151 2016-11-13 15:19 32 *************
152 2016-11-13 15:20 31 ************
... ..( 10 skipped). .. ************
163 2016-11-13 15:31 31 ************
164 2016-11-13 15:32 ? -
165 2016-11-13 15:33 20 *
166 2016-11-13 15:34 21 **
167 2016-11-13 15:35 21 **
168 2016-11-13 15:36 21 **
169 2016-11-13 15:37 22 ***
170 2016-11-13 15:38 22 ***
171 2016-11-13 15:39 22 ***
172 2016-11-13 15:40 23 ****
173 2016-11-13 15:41 24 *****
174 2016-11-13 15:42 24 *****
175 2016-11-13 15:43 24 *****
176 2016-11-13 15:44 25 ******
177 2016-11-13 15:45 25 ******
178 2016-11-13 15:46 25 ******
179 2016-11-13 15:47 26 *******
... ..( 2 skipped). .. *******
182 2016-11-13 15:50 26 *******
183 2016-11-13 15:51 27 ********
... ..( 7 skipped). .. ********
191 2016-11-13 15:59 27 ********
192 2016-11-13 16:00 28 *********
... ..( 4 skipped). .. *********
197 2016-11-13 16:05 28 *********
198 2016-11-13 16:06 29 **********
... ..( 13 skipped). .. **********
212 2016-11-13 16:20 29 **********
213 2016-11-13 16:21 30 ***********
... ..( 5 skipped). .. ***********
219 2016-11-13 16:27 30 ***********
220 2016-11-13 16:28 31 ************
221 2016-11-13 16:29 31 ************
222 2016-11-13 16:30 31 ************
223 2016-11-13 16:31 30 ***********
224 2016-11-13 16:32 30 ***********
225 2016-11-13 16:33 31 ************
... ..( 2 skipped). .. ************
228 2016-11-13 16:36 31 ************
229 2016-11-13 16:37 30 ***********
... ..( 5 skipped). .. ***********
235 2016-11-13 16:43 30 ***********
236 2016-11-13 16:44 31 ************
237 2016-11-13 16:45 30 ***********
... ..( 8 skipped). .. ***********
246 2016-11-13 16:54 30 ***********
247 2016-11-13 16:55 31 ************
248 2016-11-13 16:56 30 ***********
... ..( 9 skipped). .. ***********
258 2016-11-13 17:06 30 ***********
259 2016-11-13 17:07 31 ************
260 2016-11-13 17:08 30 ***********
... ..( 8 skipped). .. ***********
269 2016-11-13 17:17 30 ***********
270 2016-11-13 17:18 31 ************
271 2016-11-13 17:19 31 ************
272 2016-11-13 17:20 31 ************
273 2016-11-13 17:21 30 ***********
274 2016-11-13 17:22 30 ***********
275 2016-11-13 17:23 31 ************
276 2016-11-13 17:24 31 ************
277 2016-11-13 17:25 30 ***********
... ..( 7 skipped). .. ***********
285 2016-11-13 17:33 30 ***********
286 2016-11-13 17:34 31 ************
... ..( 17 skipped). .. ************
304 2016-11-13 17:52 31 ************
305 2016-11-13 17:53 30 ***********
306 2016-11-13 17:54 31 ************
307 2016-11-13 17:55 30 ***********
... ..( 5 skipped). .. ***********
313 2016-11-13 18:01 30 ***********
314 2016-11-13 18:02 31 ************
315 2016-11-13 18:03 31 ************
316 2016-11-13 18:04 30 ***********
... ..( 3 skipped). .. ***********
320 2016-11-13 18:08 30 ***********
321 2016-11-13 18:09 31 ************
... ..( 11 skipped). .. ************
333 2016-11-13 18:21 31 ************
334 2016-11-13 18:22 32 *************
335 2016-11-13 18:23 31 ************
336 2016-11-13 18:24 31 ************
337 2016-11-13 18:25 32 *************
338 2016-11-13 18:26 31 ************
... ..( 11 skipped). .. ************
350 2016-11-13 18:38 31 ************
351 2016-11-13 18:39 32 *************
352 2016-11-13 18:40 31 ************
... ..( 5 skipped). .. ************
358 2016-11-13 18:46 31 ************
359 2016-11-13 18:47 32 *************
360 2016-11-13 18:48 32 *************
361 2016-11-13 18:49 32 *************
362 2016-11-13 18:50 31 ************
... ..( 14 skipped). .. ************
377 2016-11-13 19:05 31 ************
378 2016-11-13 19:06 30 ***********
379 2016-11-13 19:07 31 ************
380 2016-11-13 19:08 30 ***********
... ..( 4 skipped). .. ***********
385 2016-11-13 19:13 30 ***********
386 2016-11-13 19:14 31 ************
387 2016-11-13 19:15 31 ************
388 2016-11-13 19:16 30 ***********
... ..( 4 skipped). .. ***********
393 2016-11-13 19:21 30 ***********
394 2016-11-13 19:22 31 ************
395 2016-11-13 19:23 30 ***********
396 2016-11-13 19:24 31 ************
... ..( 10 skipped). .. ************
407 2016-11-13 19:35 31 ************
408 2016-11-13 19:36 32 *************
409 2016-11-13 19:37 ? -
410 2016-11-13 19:38 32 *************
411 2016-11-13 19:39 31 ************
... ..( 2 skipped). .. ************
414 2016-11-13 19:42 31 ************
415 2016-11-13 19:43 32 *************
416 2016-11-13 19:44 31 ************
417 2016-11-13 19:45 32 *************
418 2016-11-13 19:46 31 ************
419 2016-11-13 19:47 32 *************
420 2016-11-13 19:48 31 ************
421 2016-11-13 19:49 32 *************
422 2016-11-13 19:50 31 ************
423 2016-11-13 19:51 31 ************
424 2016-11-13 19:52 31 ************
425 2016-11-13 19:53 32 *************
426 2016-11-13 19:54 31 ************
... ..( 4 skipped). .. ************
431 2016-11-13 19:59 31 ************
432 2016-11-13 20:00 32 *************
433 2016-11-13 20:01 32 *************
434 2016-11-13 20:02 31 ************
... ..( 4 skipped). .. ************
439 2016-11-13 20:07 31 ************
440 2016-11-13 20:08 32 *************
441 2016-11-13 20:09 31 ************
442 2016-11-13 20:10 32 *************
443 2016-11-13 20:11 31 ************
... ..( 2 skipped). .. ************
446 2016-11-13 20:14 31 ************
447 2016-11-13 20:15 32 *************
448 2016-11-13 20:16 31 ************
... ..( 5 skipped). .. ************
454 2016-11-13 20:22 31 ************
455 2016-11-13 20:23 32 *************
456 2016-11-13 20:24 31 ************
457 2016-11-13 20:25 32 *************
458 2016-11-13 20:26 32 *************
459 2016-11-13 20:27 31 ************
... ..( 2 skipped). .. ************
462 2016-11-13 20:30 31 ************
SCT Error Recovery Control command not supported
Device Statistics (GP Log 0x04) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x000a 2 3 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x8000 4 3578 Vendor specific
[-- Attachment #3: sdb.txt --]
[-- Type: text/plain, Size: 17448 bytes --]
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-47-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD20EARX-00PASB0
Serial Number: WD-WCAZA9552721
LU WWN Device Id: 5 0014ee 2b0e3f07a
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Nov 13 20:30:01 2016 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (37260) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 360) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 168 162 021 - 6566
4 Start_Stop_Count -O--CK 097 097 000 - 3748
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 081 081 000 - 13883
10 Spin_Retry_Count -O--CK 100 100 000 - 0
11 Calibration_Retry_Count -O--CK 100 100 000 - 0
12 Power_Cycle_Count -O--CK 097 097 000 - 3683
192 Power-Off_Retract_Count -O--CK 200 200 000 - 49
193 Load_Cycle_Count -O--CK 001 001 000 - 837570
194 Temperature_Celsius -O---K 119 108 000 - 31
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 001 000 - 3
198 Offline_Uncorrectable ----CK 200 200 000 - 4
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 3
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb7 GPL,SL VS 1 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 93 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 2
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 [1] occurred at disk power-on lifetime: 9505 hours (396 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 25 b3 05 58 40 00 Error: UNC at LBA = 0x25b30558 = 632489304
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 80 00 48 00 00 25 b3 0b 80 40 08 17:33:58.845 READ FPDMA QUEUED
60 00 80 00 50 00 00 25 b3 0b 00 40 08 17:33:58.845 READ FPDMA QUEUED
60 00 80 00 58 00 00 25 b3 0a 80 40 08 17:33:58.844 READ FPDMA QUEUED
60 00 80 00 60 00 00 25 b3 0a 00 40 08 17:33:58.844 READ FPDMA QUEUED
60 00 80 00 80 00 00 25 b3 09 80 40 08 17:33:58.832 READ FPDMA QUEUED
Error 1 [0] occurred at disk power-on lifetime: 9505 hours (396 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 fe 00 00 25 b2 fa 58 40 00 Error: UNC at LBA = 0x25b2fa58 = 632486488
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 80 00 d8 00 00 25 b2 ff 00 40 08 17:33:56.041 READ FPDMA QUEUED
60 00 80 00 d0 00 00 25 b2 fe 80 40 08 17:33:56.041 READ FPDMA QUEUED
60 00 80 00 c8 00 00 25 b2 fe 00 40 08 17:33:56.041 READ FPDMA QUEUED
60 00 80 00 c0 00 00 25 b2 fd 80 40 08 17:33:56.041 READ FPDMA QUEUED
60 00 80 00 b8 00 00 25 b2 fd 00 40 08 17:33:56.040 READ FPDMA QUEUED
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 13880 -
# 2 Short offline Aborted by host 10% 13880 -
# 3 Short offline Interrupted (host reset) 10% 13880 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 31 Celsius
Power Cycle Min/Max Temperature: 22/31 Celsius
Lifetime Min/Max Temperature: 21/42 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (433)
Index Estimated Time Temperature Celsius
434 2016-11-13 12:33 32 *************
... ..( 15 skipped). .. *************
450 2016-11-13 12:49 32 *************
451 2016-11-13 12:50 31 ************
452 2016-11-13 12:51 32 *************
... ..( 2 skipped). .. *************
455 2016-11-13 12:54 32 *************
456 2016-11-13 12:55 31 ************
... ..( 16 skipped). .. ************
473 2016-11-13 13:12 31 ************
474 2016-11-13 13:13 32 *************
475 2016-11-13 13:14 32 *************
476 2016-11-13 13:15 32 *************
477 2016-11-13 13:16 31 ************
0 2016-11-13 13:17 31 ************
1 2016-11-13 13:18 32 *************
... ..( 14 skipped). .. *************
16 2016-11-13 13:33 32 *************
17 2016-11-13 13:34 ? -
18 2016-11-13 13:35 22 ***
19 2016-11-13 13:36 22 ***
20 2016-11-13 13:37 22 ***
21 2016-11-13 13:38 23 ****
22 2016-11-13 13:39 24 *****
23 2016-11-13 13:40 24 *****
24 2016-11-13 13:41 25 ******
25 2016-11-13 13:42 25 ******
26 2016-11-13 13:43 25 ******
27 2016-11-13 13:44 26 *******
28 2016-11-13 13:45 26 *******
29 2016-11-13 13:46 27 ********
... ..( 5 skipped). .. ********
35 2016-11-13 13:52 27 ********
36 2016-11-13 13:53 28 *********
37 2016-11-13 13:54 28 *********
38 2016-11-13 13:55 29 **********
39 2016-11-13 13:56 28 *********
40 2016-11-13 13:57 29 **********
... ..( 11 skipped). .. **********
52 2016-11-13 14:09 29 **********
53 2016-11-13 14:10 30 ***********
... ..( 11 skipped). .. ***********
65 2016-11-13 14:22 30 ***********
66 2016-11-13 14:23 31 ************
... ..( 10 skipped). .. ************
77 2016-11-13 14:34 31 ************
78 2016-11-13 14:35 32 *************
... ..( 26 skipped). .. *************
105 2016-11-13 15:02 32 *************
106 2016-11-13 15:03 33 **************
107 2016-11-13 15:04 32 *************
... ..( 10 skipped). .. *************
118 2016-11-13 15:15 32 *************
119 2016-11-13 15:16 33 **************
120 2016-11-13 15:17 32 *************
121 2016-11-13 15:18 32 *************
122 2016-11-13 15:19 33 **************
123 2016-11-13 15:20 32 *************
... ..( 2 skipped). .. *************
126 2016-11-13 15:23 32 *************
127 2016-11-13 15:24 33 **************
128 2016-11-13 15:25 32 *************
... ..( 5 skipped). .. *************
134 2016-11-13 15:31 32 *************
135 2016-11-13 15:32 ? -
136 2016-11-13 15:33 21 **
137 2016-11-13 15:34 21 **
138 2016-11-13 15:35 21 **
139 2016-11-13 15:36 22 ***
140 2016-11-13 15:37 22 ***
141 2016-11-13 15:38 23 ****
142 2016-11-13 15:39 23 ****
143 2016-11-13 15:40 23 ****
144 2016-11-13 15:41 24 *****
145 2016-11-13 15:42 25 ******
... ..( 2 skipped). .. ******
148 2016-11-13 15:45 25 ******
149 2016-11-13 15:46 26 *******
150 2016-11-13 15:47 26 *******
151 2016-11-13 15:48 26 *******
152 2016-11-13 15:49 27 ********
... ..( 5 skipped). .. ********
158 2016-11-13 15:55 27 ********
159 2016-11-13 15:56 28 *********
... ..( 5 skipped). .. *********
165 2016-11-13 16:02 28 *********
166 2016-11-13 16:03 29 **********
... ..( 10 skipped). .. **********
177 2016-11-13 16:14 29 **********
178 2016-11-13 16:15 30 ***********
... ..( 6 skipped). .. ***********
185 2016-11-13 16:22 30 ***********
186 2016-11-13 16:23 31 ************
... ..( 9 skipped). .. ************
196 2016-11-13 16:33 31 ************
197 2016-11-13 16:34 32 *************
... ..( 8 skipped). .. *************
206 2016-11-13 16:43 32 *************
207 2016-11-13 16:44 31 ************
208 2016-11-13 16:45 31 ************
209 2016-11-13 16:46 31 ************
210 2016-11-13 16:47 32 *************
211 2016-11-13 16:48 31 ************
... ..( 2 skipped). .. ************
214 2016-11-13 16:51 31 ************
215 2016-11-13 16:52 32 *************
216 2016-11-13 16:53 32 *************
217 2016-11-13 16:54 31 ************
218 2016-11-13 16:55 32 *************
219 2016-11-13 16:56 31 ************
... ..( 3 skipped). .. ************
223 2016-11-13 17:00 31 ************
224 2016-11-13 17:01 32 *************
225 2016-11-13 17:02 32 *************
226 2016-11-13 17:03 31 ************
... ..( 2 skipped). .. ************
229 2016-11-13 17:06 31 ************
230 2016-11-13 17:07 32 *************
231 2016-11-13 17:08 32 *************
232 2016-11-13 17:09 31 ************
... ..( 2 skipped). .. ************
235 2016-11-13 17:12 31 ************
236 2016-11-13 17:13 32 *************
... ..( 39 skipped). .. *************
276 2016-11-13 17:53 32 *************
277 2016-11-13 17:54 31 ************
... ..( 7 skipped). .. ************
285 2016-11-13 18:02 31 ************
286 2016-11-13 18:03 32 *************
287 2016-11-13 18:04 32 *************
288 2016-11-13 18:05 31 ************
289 2016-11-13 18:06 32 *************
... ..( 14 skipped). .. *************
304 2016-11-13 18:21 32 *************
305 2016-11-13 18:22 33 **************
306 2016-11-13 18:23 32 *************
307 2016-11-13 18:24 32 *************
308 2016-11-13 18:25 33 **************
309 2016-11-13 18:26 32 *************
... ..( 19 skipped). .. *************
329 2016-11-13 18:46 32 *************
330 2016-11-13 18:47 33 **************
331 2016-11-13 18:48 32 *************
332 2016-11-13 18:49 32 *************
333 2016-11-13 18:50 33 **************
334 2016-11-13 18:51 32 *************
... ..( 9 skipped). .. *************
344 2016-11-13 19:01 32 *************
345 2016-11-13 19:02 31 ************
... ..( 16 skipped). .. ************
362 2016-11-13 19:19 31 ************
363 2016-11-13 19:20 32 *************
... ..( 15 skipped). .. *************
379 2016-11-13 19:36 32 *************
380 2016-11-13 19:37 ? -
381 2016-11-13 19:38 33 **************
382 2016-11-13 19:39 32 *************
... ..( 29 skipped). .. *************
412 2016-11-13 20:09 32 *************
413 2016-11-13 20:10 33 **************
... ..( 14 skipped). .. **************
428 2016-11-13 20:25 33 **************
429 2016-11-13 20:26 32 *************
430 2016-11-13 20:27 33 **************
431 2016-11-13 20:28 32 *************
432 2016-11-13 20:29 32 *************
433 2016-11-13 20:30 32 *************
SCT Error Recovery Control command not supported
Device Statistics (GP Log 0x04) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x000a 2 2 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x8000 4 3575 Vendor specific
^ permalink raw reply
* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Anthony Youngman @ 2016-11-13 20:18 UTC (permalink / raw)
To: Bruce Merry, linux-raid
In-Reply-To: <CAHy4j_7_nRMxOSW16VTAY7bzdW_VMap=Jeb2M0wMiNDoNXcijQ@mail.gmail.com>
Quick first response ...
On 13/11/16 18:46, Bruce Merry wrote:
> Hi
>
> I'm running software RAID1 across two drives in my home machine (LVM
> on LUKS on RAID1). I've just installed smartmontools and run short
> tests, and promptly received emails to tell me that one of the drives
> has 4 offline uncorrectable sectors and 3 current pending sectors.
> I've attached smartctl --xall output for sda (good) and sdb (bad).
>
> These drives are pretty old (over 5 years) so I'm going to replace
> them as soon as I have time (and yes, I have backups), but in the
> meantime I'd like advice on:
>
What drives are they? I'm guessing they're hunky-dory, but they don't
fall foul of timeout mismatch, do they?
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
> 1. What exactly this means. My understanding is that some data has
> been lost (or may have been lost) on the drive, but the drive still
> has spare sectors to remap things once the failed sectors are written
> to. Is that correct?
It may also mean that the four sectors at least, have already been
remapped ... I'll let the experts confirm. The three pending errors
might be where a read has failed but there's not yet been a re-write -
and you won't have noticed because the raid dealt with it.
>
> 2. How can I tell which sectors are problematic? If it's in the swap
> partition I'm far less worried. I can see two LBAs for offline
> uncorrectable errors in the --xall output, but that still leaves
> another two at large.
I don't think you need to be worried at all. It's only a few sectors,
there's no sign of any further trouble? and as it's raided, when the
drive returns an error the raid code will sort it out for you.
>
> 3. Assuming my understanding is correct, and that the sector falls
> within the RAID1 partition on the drive, is there some way I can
> recover the sectors from the other drive in the RAID1? As a last
> resort I imagine I could wipe the suspect drive and then rebuild it
> from the good one, but I'm hoping there's something less risky I can
> do.
Do a scrub? You've got seven errors total, which some people will say
"panic on the first error" and others will say "so what, the odd error
every now and then is nothing to worry about". The point of a scrub is
it will background-scan the entire array, and if it can't read anything,
it will re-calculate and re-write it.
Just make sure you've not got that timeout problem, or a scrub will make
matters a whole lot worse ...
>
> Thanks in advance
> Bruce
>
Cheers,
Wol
^ permalink raw reply
* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-13 20:51 UTC (permalink / raw)
To: Anthony Youngman; +Cc: linux-raid
In-Reply-To: <942ab8be-cd5c-c6d1-d077-cd295b355c0c@youngman.org.uk>
On 13 November 2016 at 22:18, Anthony Youngman <antlists@youngman.org.uk> wrote:
> Quick first response ...
>
> On 13/11/16 18:46, Bruce Merry wrote:
>>
>> Hi
>>
>> I'm running software RAID1 across two drives in my home machine (LVM
>> on LUKS on RAID1). I've just installed smartmontools and run short
>> tests, and promptly received emails to tell me that one of the drives
>> has 4 offline uncorrectable sectors and 3 current pending sectors.
>> I've attached smartctl --xall output for sda (good) and sdb (bad).
>>
>> These drives are pretty old (over 5 years) so I'm going to replace
>> them as soon as I have time (and yes, I have backups), but in the
>> meantime I'd like advice on:
>>
> What drives are they? I'm guessing they're hunky-dory, but they don't fall
> foul of timeout mismatch, do they?
>
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
smartctl reports "SCT Error Recovery Control command not supported".
Does that mean I should be worried? Is there any way to tell whether a
given drive I can buy online supports it?
>> 1. What exactly this means. My understanding is that some data has
>> been lost (or may have been lost) on the drive, but the drive still
>> has spare sectors to remap things once the failed sectors are written
>> to. Is that correct?
>
>
> It may also mean that the four sectors at least, have already been remapped
> ... I'll let the experts confirm. The three pending errors might be where a
> read has failed but there's not yet been a re-write - and you won't have
> noticed because the raid dealt with it.
I'm guessing nothing has been remapped yet, because the
Reallocated_Sector_Ct and Reallocator_Event_ct are both zero.
>> 3. Assuming my understanding is correct, and that the sector falls
>> within the RAID1 partition on the drive, is there some way I can
>> recover the sectors from the other drive in the RAID1? As a last
>> resort I imagine I could wipe the suspect drive and then rebuild it
>> from the good one, but I'm hoping there's something less risky I can
>> do.
>
>
> Do a scrub? You've got seven errors total, which some people will say "panic
> on the first error" and others will say "so what, the odd error every now
> and then is nothing to worry about". The point of a scrub is it will
> background-scan the entire array, and if it can't read anything, it will
> re-calculate and re-write it.
Yes, that sounds like what I need. Thanks to Google I found
/usr/share/mdadm/checkarray to trigger this. It still has a few hours
to go, but now the bad drive has pending sectors == 65535 (which is
suspiciously power-of-two and I assume means it's actually higher and
is being clamped), and /sys/block/md0/md/mismatch_cnt is currently at
1408. If scrubbing is supposed to rewrite on failed reads I would have
expected pending sectors to go down rather than up, so I'm not sure
what's happening.
Thanks
Bruce
--
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/
^ permalink raw reply
* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Wols Lists @ 2016-11-13 21:06 UTC (permalink / raw)
To: Bruce Merry; +Cc: linux-raid
In-Reply-To: <CAHy4j_7F=gN9=7mEH-TsdVJR0YFxBzJK98WeJfuwtANoDEy93w@mail.gmail.com>
On 13/11/16 20:51, Bruce Merry wrote:
> On 13 November 2016 at 22:18, Anthony Youngman <antlists@youngman.org.uk> wrote:
>> Quick first response ...
>>
>> On 13/11/16 18:46, Bruce Merry wrote:
>>>
>>> Hi
>>>
>>> I'm running software RAID1 across two drives in my home machine (LVM
>>> on LUKS on RAID1). I've just installed smartmontools and run short
>>> tests, and promptly received emails to tell me that one of the drives
>>> has 4 offline uncorrectable sectors and 3 current pending sectors.
>>> I've attached smartctl --xall output for sda (good) and sdb (bad).
>>>
>>> These drives are pretty old (over 5 years) so I'm going to replace
>>> them as soon as I have time (and yes, I have backups), but in the
>>> meantime I'd like advice on:
>>>
>> What drives are they? I'm guessing they're hunky-dory, but they don't fall
>> foul of timeout mismatch, do they?
>>
>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>
> smartctl reports "SCT Error Recovery Control command not supported".
> Does that mean I should be worried? Is there any way to tell whether a
> given drive I can buy online supports it?
You need drives that explicitly support raid. WD Reds, Seagate NAS, some
Toshibas - my 2TB laptop drive does ... Try and find a friend with a
drive you like, and check it out, or ask on this list :-)
Did you run that script to increase the kernel timeout?
>
>>> 1. What exactly this means. My understanding is that some data has
>>> been lost (or may have been lost) on the drive, but the drive still
>>> has spare sectors to remap things once the failed sectors are written
>>> to. Is that correct?
>>
>>
>> It may also mean that the four sectors at least, have already been remapped
>> ... I'll let the experts confirm. The three pending errors might be where a
>> read has failed but there's not yet been a re-write - and you won't have
>> noticed because the raid dealt with it.
>
> I'm guessing nothing has been remapped yet, because the
> Reallocated_Sector_Ct and Reallocator_Event_ct are both zero.
>
>>> 3. Assuming my understanding is correct, and that the sector falls
>>> within the RAID1 partition on the drive, is there some way I can
>>> recover the sectors from the other drive in the RAID1? As a last
>>> resort I imagine I could wipe the suspect drive and then rebuild it
>>> from the good one, but I'm hoping there's something less risky I can
>>> do.
>>
>>
>> Do a scrub? You've got seven errors total, which some people will say "panic
>> on the first error" and others will say "so what, the odd error every now
>> and then is nothing to worry about". The point of a scrub is it will
>> background-scan the entire array, and if it can't read anything, it will
>> re-calculate and re-write it.
>
> Yes, that sounds like what I need. Thanks to Google I found
> /usr/share/mdadm/checkarray to trigger this. It still has a few hours
> to go, but now the bad drive has pending sectors == 65535 (which is
> suspiciously power-of-two and I assume means it's actually higher and
> is being clamped), and /sys/block/md0/md/mismatch_cnt is currently at
> 1408. If scrubbing is supposed to rewrite on failed reads I would have
> expected pending sectors to go down rather than up, so I'm not sure
> what's happening.
>
Ummm....
Sounds like that drive could need replacing. I'd get a new drive and do
that as soon as possible - use the --replace option of mdadm - don't
fail the old drive and add the new. Dunno where you're based, but 5mins
on the internet ordering a new drive is probably time well spent.
Note that Seagate Barracudas don't have the best of reputations if
they're the drive you've already got, and the 3TB drives are best
avoided. Sod's law, I've got two of them ...
Advice I always give ... if you're getting new drives, always consider
increasing capacity. I don't know what size your current drives are, but
look at prices of drives a bit larger than what they are, and is it
worth paying the extra?
If you do get bigger drives, there's nothing stopping you making the
paritions on it bigger before you add them in to the array. It'll be
wasted space until you increase the size of all the drives, but once
you've replaced both drives, you can use mdadm to increase the array
size. I don't know about LUKS, but I would expect you can grow that, and
then you can expand your data partitions within that.
^ permalink raw reply
* Re: "creative" bio usage in the RAID code
From: NeilBrown @ 2016-11-13 22:53 UTC (permalink / raw)
To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid, linux-block
In-Reply-To: <20161112174238.GA11518@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 3090 bytes --]
On Sun, Nov 13 2016, Christoph Hellwig wrote:
> On Fri, Nov 11, 2016 at 11:02:23AM -0800, Shaohua Li wrote:
>> > It's mostly about the RAID1 and RAID10 code which does a lot of funny
>> > things with the bi_iov_vec and bi_vcnt fields, which we'd prefer that
>> > drivers don't touch. One example is the r1buf_pool_alloc code,
>> > which I think should simply use bio_clone for the MD_RECOVERY_REQUESTED
>> > case, which would also take care of r1buf_pool_free. I'm not sure
>> > about all the others cases, as some bits don't fully make sense to me,
>>
>> The problem is we use the iov_vec to track the pages allocated. We will read
>> data to the pages and write out later for resync. If we add new fields to track
>> the pages in r1bio, we could use standard API bio_kmalloc/bio_add_page and
>> avoid the tricky parts. This should work for both the resync and writebehind
>> cases.
>
> I don't think we need to track the pages specificly - if we clone
> a bio we share the bio_vec, e.g. for the !MD_RECOVERY_REQUESTED
> we do one bio_kmalloc, then bio_alloc_pages then clone it for the
> others bios. for MD_RECOVERY_REQUESTED we do a bio_kmalloc +
> bio_alloc_pages for each.
Part of the reason for the oddities in this code is that I wanted a
collection of bios, one per device, which were all the same size. As
different devices might impose different restrictions on the size of the
bios, I built them carefully, step by step.
Now that those restrictions are gone, we can - as you say - just
allocate a suitably sized bio and then clone it for each device.
>
> While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
> confusing, and I'm not 100% sure it's correct. After all we check it
> in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
> on these callbacks being done after the flag has been raise / cleared,
> which makes me bit suspicious, and also question why we even need the
> mempool.
MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
races there.
The r1buf_pool mempool is created are the start of resync, so at that
time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
after the mempool is freed.
To perform a resync we need a pool of memory buffers. We don't want to
have to cope with kmalloc failing, but are quite able to cope with
mempool_alloc() blocking.
We probably don't need nearly as many bufs as we allocate (4 is probably
plenty), but having a pool is certainly convenient.
>
>>
>> > e.g. why we're trying to do single page I/O out of a bigger bio.
>>
>> what's this one?
>
> fix_sync_read_error
The "bigger bio" might cover a large number of sectors. If there are
media errors, there might be only one sector that is bad. So we repeat
the read with finer granularity (pages in the current code, though
device block would be ideal) and only recovery bad blocks for individual
pages which are bad and cannot be fixed.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: "creative" bio usage in the RAID code
From: NeilBrown @ 2016-11-13 23:03 UTC (permalink / raw)
To: Christoph Hellwig, Shaohua Li; +Cc: linux-raid, linux-block
In-Reply-To: <20161110194636.GA32241@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 720 bytes --]
On Fri, Nov 11 2016, Christoph Hellwig wrote:
>
> Another not quite as urgent issue is how the RAID5 code abuses
> ->bi_phys_segments as and outstanding I/O counter, and I have no
> really good answer to that either.
I would suggest adding a "bi_dev_private" field to the bio which is for
use by the lowest-level driver (much as bi_private is for use by the
top-level initiator).
That could be in a union with any or all of:
unsigned int bi_phys_segments;
unsigned int bi_seg_front_size;
unsigned int bi_seg_back_size;
(any driver that needs those, would see a 'request' rather than a 'bio'
and so could use rq->special)
raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Phil Turmel @ 2016-11-13 23:03 UTC (permalink / raw)
To: Bruce Merry; +Cc: Wols Lists, linux-raid
In-Reply-To: <5828D5DA.1070406@youngman.org.uk>
Hi Bruce,
On 11/13/2016 04:06 PM, Wols Lists wrote:
> On 13/11/16 20:51, Bruce Merry wrote:
>> On 13 November 2016 at 22:18, Anthony Youngman <antlists@youngman.org.uk> wrote:
>>> Quick first response ...
>>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>>
>> smartctl reports "SCT Error Recovery Control command not supported".
>> Does that mean I should be worried? Is there any way to tell whether a
>> given drive I can buy online supports it?
You should be worried. It is vital for proper MD raid operation that
drive timeouts be shorter than the kernel timeout for that device. If
you can't make the drive timeout short, you *must* make the kernel
timeout long.
> You need drives that explicitly support raid. WD Reds, Seagate NAS, some
> Toshibas - my 2TB laptop drive does ... Try and find a friend with a
> drive you like, and check it out, or ask on this list :-)
Manufacturers' data sheets for system builders usually contain enough
information to determine if ERC is supported. Nowadays, the "NAS"
families work out of the box. And enterprise drives, too, of course.
>>>> 1. What exactly this means. My understanding is that some data has
>>>> been lost (or may have been lost) on the drive, but the drive still
>>>> has spare sectors to remap things once the failed sectors are written
>>>> to. Is that correct?
Generally, yes.
>>> Do a scrub? You've got seven errors total, which some people will say "panic
>>> on the first error" and others will say "so what, the odd error every now
>>> and then is nothing to worry about". The point of a scrub is it will
>>> background-scan the entire array, and if it can't read anything, it will
>>> re-calculate and re-write it.
>>
>> Yes, that sounds like what I need. Thanks to Google I found
>> /usr/share/mdadm/checkarray to trigger this. It still has a few hours
>> to go, but now the bad drive has pending sectors == 65535 (which is
>> suspiciously power-of-two and I assume means it's actually higher and
>> is being clamped), and /sys/block/md0/md/mismatch_cnt is currently at
>> 1408. If scrubbing is supposed to rewrite on failed reads I would have
>> expected pending sectors to go down rather than up, so I'm not sure
>> what's happening.
>>
> Ummm....
>
> Sounds like that drive could need replacing. I'd get a new drive and do
> that as soon as possible - use the --replace option of mdadm - don't
> fail the old drive and add the new. Dunno where you're based, but 5mins
> on the internet ordering a new drive is probably time well spent.
You have two other possibilities:
1) Swap volumes in the raid. These are known to drop unneeded writes
when the data isn't needed, even if it made it to one of the mirrors.
That makes harmless mismatches.
2) Trim. Well-behaved drive firmware guarantees zeros for trimmed
sectors, but many drives return random data instead. Zing, mismatches.
It's often unhelpful with encrypted volumes, as even well-behaved
firmware can't deliver zeroed sectors *inside* the encryption.
I wouldn't panic just yet. The check scrub should (with mitigated
timeouts) fix all of your pending sectors. Then look at your actual
relocations to determine if you really do have a problem.
Phil
^ permalink raw reply
* [md PATCH 0/4] Improve blktrace tracing of md.
From: NeilBrown @ 2016-11-14 5:30 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid
blktrace on md devices reports when a request is queued and when it is
split, but request completion and the mapping to subordinate devices
is not reported.
So add that, as well some some events when IO is delayed for one
reason or another (eg. bitmap updates etc).
---
NeilBrown (4):
md: add block tracing for bio_remapping
md: add bio completion tracing for raid1/raid10
md/bitmap: add blktrace event for writes to the bitmap.
md/raid1,raid10: add blktrace records when IO is delayed.
drivers/md/bitmap.c | 11 ++++++++++-
drivers/md/linear.c | 8 +++++++-
drivers/md/raid0.c | 8 +++++++-
drivers/md/raid1.c | 42 +++++++++++++++++++++++++++++++++++++++---
drivers/md/raid10.c | 38 ++++++++++++++++++++++++++++++++++++--
5 files changed, 99 insertions(+), 8 deletions(-)
--
Signature
^ permalink raw reply
* [md PATCH 2/4] md: add bio completion tracing for raid1/raid10
From: NeilBrown @ 2016-11-14 5:30 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147910131504.27168.6566119701315109161.stgit@noble>
raid5 already has this, as does dm.
linear and raid0 do no see completions, only bio_chain_end() or bio_endio()
see those.
So just add it for raid1 and raid10.
Between
Commit: 3a366e614d08 ("block: add missing block_bio_complete() tracepoint")
and
Commit: 0a82a8d132b2 ("Revert "block: add missing block_bio_complete() tracepoint"")
in the 3.9-rc series, this was done centrally in bio_endio().
Until/unless that is resurected, do the tracing in the md/raid code.
Signed-off-by: NeilBrown <neilb@suse.com>
---
drivers/md/raid1.c | 1 +
drivers/md/raid10.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3710a792a149..0674e5a0142e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -257,6 +257,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
bio->bi_error = -EIO;
if (done) {
+ trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), bio, bio->bi_error);
bio_endio(bio);
/*
* Wake up any possible resync thread that waits for the device
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d144c3425824..c3036099ff9a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -311,6 +311,7 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
bio->bi_error = -EIO;
if (done) {
+ trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), bio, bio->bi_error);
bio_endio(bio);
/*
* Wake up any possible resync thread that waits for the device
^ permalink raw reply related
* [md PATCH 1/4] md: add block tracing for bio_remapping
From: NeilBrown @ 2016-11-14 5:30 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147910131504.27168.6566119701315109161.stgit@noble>
The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used when a bio in a partition is mapped to the
whole device, when bios are mapped by dm, and for mapping in md/raid5.
Other md personalities do not include this tracing yet, so add it.
When a read-error is detected we redirect the request to a different device.
This could justifiably be seen as a new mapping for the originial bio,
or a secondary mapping for the bio that errors. This patch uses
the second option.
When md is used under dm-raid, the mappings are not traced as we do
not have access to the block device number of the parent.
Signed-off-by: NeilBrown <neilb@suse.com>
---
drivers/md/linear.c | 8 +++++++-
drivers/md/raid0.c | 8 +++++++-
drivers/md/raid1.c | 33 ++++++++++++++++++++++++++++++---
drivers/md/raid10.c | 29 +++++++++++++++++++++++++++--
4 files changed, 71 insertions(+), 7 deletions(-)
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 9c7d4f5483ea..8c0bccfa53a2 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -21,6 +21,7 @@
#include <linux/seq_file.h>
#include <linux/module.h>
#include <linux/slab.h>
+#include <trace/events/block.h>
#include "md.h"
#include "linear.h"
@@ -256,8 +257,13 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
!blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
/* Just ignore it */
bio_endio(split);
- } else
+ } else {
+ if (mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
+ split, disk_devt(mddev->gendisk),
+ bio->bi_iter.bi_sector);
generic_make_request(split);
+ }
} while (split != bio);
return;
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index b3ba77a3c3bc..841b3ad0f5ff 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -21,6 +21,7 @@
#include <linux/seq_file.h>
#include <linux/module.h>
#include <linux/slab.h>
+#include <trace/events/block.h>
#include "md.h"
#include "raid0.h"
#include "raid5.h"
@@ -491,8 +492,13 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
!blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
/* Just ignore it */
bio_endio(split);
- } else
+ } else {
+ if (mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
+ split, disk_devt(mddev->gendisk),
+ bio->bi_iter.bi_sector);
generic_make_request(split);
+ }
} while (split != bio);
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 9ac61cd85e5c..3710a792a149 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -37,6 +37,7 @@
#include <linux/module.h>
#include <linux/seq_file.h>
#include <linux/ratelimit.h>
+#include <trace/events/block.h>
#include "md.h"
#include "raid1.h"
#include "bitmap.h"
@@ -743,6 +744,7 @@ static void flush_pending_writes(struct r1conf *conf)
while (bio) { /* submit pending writes */
struct bio *next = bio->bi_next;
struct md_rdev *rdev = (void*)bio->bi_bdev;
+ struct r1bio *r1_bio = bio->bi_private;
bio->bi_next = NULL;
bio->bi_bdev = rdev->bdev;
if (test_bit(Faulty, &rdev->flags)) {
@@ -752,8 +754,13 @@ static void flush_pending_writes(struct r1conf *conf)
!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
/* Just ignore it */
bio_endio(bio);
- else
+ else {
+ if (conf->mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+ bio, disk_devt(conf->mddev->gendisk),
+ r1_bio->sector);
generic_make_request(bio);
+ }
bio = next;
}
} else
@@ -1022,6 +1029,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
while (bio) { /* submit pending writes */
struct bio *next = bio->bi_next;
struct md_rdev *rdev = (void*)bio->bi_bdev;
+ struct r1bio *r1_bio = bio->bi_private;
bio->bi_next = NULL;
bio->bi_bdev = rdev->bdev;
if (test_bit(Faulty, &rdev->flags)) {
@@ -1031,8 +1039,13 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
/* Just ignore it */
bio_endio(bio);
- else
+ else {
+ if (mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+ bio, disk_devt(mddev->gendisk),
+ r1_bio->sector);
generic_make_request(bio);
+ }
bio = next;
}
kfree(plug);
@@ -1162,6 +1175,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
bio_set_op_attrs(read_bio, op, do_sync);
read_bio->bi_private = r1_bio;
+ if (mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
+ read_bio, disk_devt(mddev->gendisk),
+ r1_bio->sector);
+
if (max_sectors < r1_bio->sectors) {
/* could not read all from this device, so we will
* need another r1_bio.
@@ -2290,6 +2308,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
struct bio *bio;
char b[BDEVNAME_SIZE];
struct md_rdev *rdev;
+ dev_t bio_dev;
+ sector_t bio_sector;
clear_bit(R1BIO_ReadError, &r1_bio->state);
/* we got a read error. Maybe the drive is bad. Maybe just
@@ -2303,6 +2323,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
bio = r1_bio->bios[r1_bio->read_disk];
bdevname(bio->bi_bdev, b);
+ bio_dev = bio->bi_bdev->bd_dev;
+ bio_sector = conf->mirrors[r1_bio->read_disk].rdev->data_offset + r1_bio->sector;
bio_put(bio);
r1_bio->bios[r1_bio->read_disk] = NULL;
@@ -2353,6 +2375,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
else
mbio->bi_phys_segments++;
spin_unlock_irq(&conf->device_lock);
+ trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+ bio, bio_dev, bio_sector);
generic_make_request(bio);
bio = NULL;
@@ -2367,8 +2391,11 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
sectors_handled;
goto read_more;
- } else
+ } else {
+ trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+ bio, bio_dev, bio_sector);
generic_make_request(bio);
+ }
}
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5290be3d5c26..d144c3425824 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -25,6 +25,7 @@
#include <linux/seq_file.h>
#include <linux/ratelimit.h>
#include <linux/kthread.h>
+#include <trace/events/block.h>
#include "md.h"
#include "raid10.h"
#include "raid0.h"
@@ -859,6 +860,7 @@ static void flush_pending_writes(struct r10conf *conf)
while (bio) { /* submit pending writes */
struct bio *next = bio->bi_next;
struct md_rdev *rdev = (void*)bio->bi_bdev;
+ struct r10bio *r10_bio = bio->bi_private;
bio->bi_next = NULL;
bio->bi_bdev = rdev->bdev;
if (test_bit(Faulty, &rdev->flags)) {
@@ -868,8 +870,13 @@ static void flush_pending_writes(struct r10conf *conf)
!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
/* Just ignore it */
bio_endio(bio);
- else
+ else {
+ if (conf->mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+ bio, disk_devt(conf->mddev->gendisk),
+ r10_bio->sector);
generic_make_request(bio);
+ }
bio = next;
}
} else
@@ -1042,6 +1049,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
while (bio) { /* submit pending writes */
struct bio *next = bio->bi_next;
struct md_rdev *rdev = (void*)bio->bi_bdev;
+ struct r10bio *r10_bio = bio->bi_private;
bio->bi_next = NULL;
bio->bi_bdev = rdev->bdev;
if (test_bit(Faulty, &rdev->flags)) {
@@ -1051,8 +1059,13 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
!blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
/* Just ignore it */
bio_endio(bio);
- else
+ else {
+ if (conf->mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+ bio, disk_devt(conf->mddev->gendisk),
+ r10_bio->sector);
generic_make_request(bio);
+ }
bio = next;
}
kfree(plug);
@@ -1165,6 +1178,10 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
bio_set_op_attrs(read_bio, op, do_sync);
read_bio->bi_private = r10_bio;
+ if (mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
+ read_bio, disk_devt(mddev->gendisk),
+ r10_bio->sector);
if (max_sectors < r10_bio->sectors) {
/* Could not read all from this device, so we will
* need another r10_bio.
@@ -2496,6 +2513,8 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
char b[BDEVNAME_SIZE];
unsigned long do_sync;
int max_sectors;
+ dev_t bio_dev;
+ sector_t bio_last_sector;
/* we got a read error. Maybe the drive is bad. Maybe just
* the block and we can fix it.
@@ -2507,6 +2526,8 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
*/
bio = r10_bio->devs[slot].bio;
bdevname(bio->bi_bdev, b);
+ bio_dev = bio->bi_bdev->bd_dev;
+ bio_last_sector = r10_bio->devs[slot].addr + rdev->data_offset + r10_bio->sectors;
bio_put(bio);
r10_bio->devs[slot].bio = NULL;
@@ -2546,6 +2567,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
bio->bi_private = r10_bio;
bio->bi_end_io = raid10_end_read_request;
+ trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
+ bio, bio_dev,
+ bio_last_sector - r10_bio->sectors);
+
if (max_sectors < r10_bio->sectors) {
/* Drat - have to split this up more */
struct bio *mbio = r10_bio->master_bio;
^ permalink raw reply related
* [md PATCH 4/4] md/raid1, raid10: add blktrace records when IO is delayed.
From: NeilBrown @ 2016-11-14 5:30 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147910131504.27168.6566119701315109161.stgit@noble>
Both raid1 and raid10 will sometimes delay handling an IO request,
such as when resync is happening or there are too many requests queued.
Add some blktrace messsages so we can see when that is happening when
looking for performance artefacts.
Signed-off-by: NeilBrown <neilb@suse.com>
---
drivers/md/raid1.c | 8 ++++++++
drivers/md/raid10.c | 8 ++++++++
2 files changed, 16 insertions(+)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0674e5a0142e..e94db92a4dbf 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -71,6 +71,9 @@ static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
sector_t bi_sector);
static void lower_barrier(struct r1conf *conf);
+#define raid1_log(md, fmt, args...) \
+ do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
+
static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
{
struct pool_info *pi = data;
@@ -868,6 +871,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
* that queue to allow conf->start_next_window
* to increase.
*/
+ raid1_log(conf->mddev, "wait barrier");
wait_event_lock_irq(conf->wait_barrier,
!conf->array_frozen &&
(!conf->barrier ||
@@ -947,6 +951,7 @@ static void freeze_array(struct r1conf *conf, int extra)
*/
spin_lock_irq(&conf->resync_lock);
conf->array_frozen = 1;
+ raid1_log(conf->mddev, "wait freeze");
wait_event_lock_irq_cmd(conf->wait_barrier,
conf->nr_pending == conf->nr_queued+extra,
conf->resync_lock,
@@ -1157,6 +1162,7 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
* take care not to over-take any writes
* that are 'behind'
*/
+ raid1_log(mddev, "wait behind writes");
wait_event(bitmap->behind_wait,
atomic_read(&bitmap->behind_writes) == 0);
}
@@ -1221,6 +1227,7 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
*/
if (conf->pending_count >= max_queued_requests) {
md_wakeup_thread(mddev->thread);
+ raid1_log(mddev, "wait queued");
wait_event(conf->wait_barrier,
conf->pending_count < max_queued_requests);
}
@@ -1312,6 +1319,7 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
rdev_dec_pending(conf->mirrors[j].rdev, mddev);
r1_bio->state = 0;
allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+ raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
md_wait_for_blocked_rdev(blocked_rdev, mddev);
start_next_window = wait_barrier(conf, bio);
/*
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c3036099ff9a..15e55488a9d2 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -106,6 +106,9 @@ static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio);
static void end_reshape_write(struct bio *bio);
static void end_reshape(struct r10conf *conf);
+#define raid10_log(md, fmt, args...) \
+ do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid10 " fmt, ##args); } while (0)
+
static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
{
struct r10conf *conf = data;
@@ -949,6 +952,7 @@ static void wait_barrier(struct r10conf *conf)
* that queue to get the nr_pending
* count down.
*/
+ raid10_log(conf->mddev, "wait barrier");
wait_event_lock_irq(conf->wait_barrier,
!conf->barrier ||
(atomic_read(&conf->nr_pending) &&
@@ -1106,6 +1110,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
/* IO spans the reshape position. Need to wait for
* reshape to pass
*/
+ raid10_log(conf->mddev, "wait reshape");
allow_barrier(conf);
wait_event(conf->wait_barrier,
conf->reshape_progress <= bio->bi_iter.bi_sector ||
@@ -1125,6 +1130,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
set_mask_bits(&mddev->flags, 0,
BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
md_wakeup_thread(mddev->thread);
+ raid10_log(conf->mddev, "wait reshape metadata");
wait_event(mddev->sb_wait,
!test_bit(MD_CHANGE_PENDING, &mddev->flags));
@@ -1222,6 +1228,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
*/
if (conf->pending_count >= max_queued_requests) {
md_wakeup_thread(mddev->thread);
+ raid10_log(mddev, "wait queued");
wait_event(conf->wait_barrier,
conf->pending_count < max_queued_requests);
}
@@ -1349,6 +1356,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
}
}
allow_barrier(conf);
+ raid10_log(conf->mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
md_wait_for_blocked_rdev(blocked_rdev, mddev);
wait_barrier(conf);
goto retry_write;
^ permalink raw reply related
* [md PATCH 3/4] md/bitmap: add blktrace event for writes to the bitmap.
From: NeilBrown @ 2016-11-14 5:30 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <147910131504.27168.6566119701315109161.stgit@noble>
We trace wheneven bitmap_unplug() finds that it needs to write
to the bitmap, or when bitmap_daemon_work() find there is work
to do.
This makes it easier to correlate bitmap updates with data writes.
Signed-off-by: NeilBrown <neilb@suse.com>
---
drivers/md/bitmap.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 1a7f402b79ba..cf77cbf9ed22 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -27,6 +27,7 @@
#include <linux/mount.h>
#include <linux/buffer_head.h>
#include <linux/seq_file.h>
+#include <trace/events/block.h>
#include "md.h"
#include "bitmap.h"
@@ -1008,8 +1009,12 @@ void bitmap_unplug(struct bitmap *bitmap)
need_write = test_and_clear_page_attr(bitmap, i,
BITMAP_PAGE_NEEDWRITE);
if (dirty || need_write) {
- if (!writing)
+ if (!writing) {
bitmap_wait_writes(bitmap);
+ if (bitmap->mddev->queue)
+ blk_add_trace_msg(bitmap->mddev->queue,
+ "md bitmap_unplug");
+ }
clear_page_attr(bitmap, i, BITMAP_PAGE_PENDING);
write_page(bitmap, bitmap->storage.filemap[i], 0);
writing = 1;
@@ -1234,6 +1239,10 @@ void bitmap_daemon_work(struct mddev *mddev)
}
bitmap->allclean = 1;
+ if (bitmap->mddev->queue)
+ blk_add_trace_msg(bitmap->mddev->queue,
+ "md bitmap_daemon_work");
+
/* Any file-page which is PENDING now needs to be written.
* So set NEEDWRITE now, then after we make any last-minute changes
* we will write it.
^ permalink raw reply related
* Re: Ddf based RAID management software
From: NeilBrown @ 2016-11-14 5:48 UTC (permalink / raw)
To: Arka Sharma, linux-raid
In-Reply-To: <CAPO=kN3xzv_MmbTSLT0bfxOpdZ+yvmybSXb6L-goBUasm+1NoQ@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 377 bytes --]
On Sun, Nov 13 2016, Arka Sharma wrote:
> Hello All,
>
> Is there any tool apart from mdadm available which can create software
> RAID based on Ddf metadata. We want to dump the metadata content and
> tally with metadata written by mdadm and our application.
Not that I'm aware of.
dmraid can read some ddf metadata, but I don't think it will create new
metadata.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: mdadm I/O error with Ddf RAID
From: NeilBrown @ 2016-11-14 6:00 UTC (permalink / raw)
To: Arka Sharma, linux-raid
In-Reply-To: <CAPO=kN2QDLEMgo9p9pU3=MeLQ=J6R8eeDL1Pw9m2pHjbVsuFGg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]
On Fri, Nov 11 2016, Arka Sharma wrote:
> Hi All,
>
> We have developed a RAID creation application which create RAID with
> Ddf RAID metadata. We are using PCIe ssd as physical disks. We are
> writing the anchor, primary, secondary headers, virtual and physical
> records, configuration record and physical disk data. The offsets of
> the headers are updated in the primary, secondary and anchor headers
> correctly. The problem is when we try to boot to Ubuntu server and we
> observe that mdadm is throwing a disk failure error message and from
> block layer we are getting rw=0, want=7, limit=1000215216. We also
> confirmed using there is no I/O error is coming from the PCIe ssd,
> using a logic analyzer. Also the limit value 1000215216 is the
> capacity of the ssd in 512 byte blocks. Any insight will be highly
> appreciated.
>
It looks like mdadm is attempting a 4K read starting at the last sector.
Possibly the ssd's report a physical sector size of 4K.
I don't know how DDF is supposed to work on a device like that.
Should the anchor be at the start of the last 4K block,
or in the last 512byte virtual block?
DDF support in mdadm was written with the assumption of 512 byte blocks.
I'm not at all certain this is the cause of the problem though.
I would suggest starting by finding out which READ request in mdadm is
causing the error.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-14 6:50 UTC (permalink / raw)
To: Phil Turmel; +Cc: Wols Lists, linux-raid
In-Reply-To: <bdd9358d-2141-eb4f-e765-52177b1ec852@turmel.org>
On 14 November 2016 at 01:03, Phil Turmel <philip@turmel.org> wrote:
> Hi Bruce,
>
> On 11/13/2016 04:06 PM, Wols Lists wrote:
>> On 13/11/16 20:51, Bruce Merry wrote:
>>> On 13 November 2016 at 22:18, Anthony Youngman <antlists@youngman.org.uk> wrote:
>>>> Quick first response ...
>
>>>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>>>
>>> smartctl reports "SCT Error Recovery Control command not supported".
>>> Does that mean I should be worried? Is there any way to tell whether a
>>> given drive I can buy online supports it?
>
> You should be worried. It is vital for proper MD raid operation that
> drive timeouts be shorter than the kernel timeout for that device. If
> you can't make the drive timeout short, you *must* make the kernel
> timeout long.
Okay, I'll give that script a go to increase my kernel timeout. If I
understand correctly, it's not the end of the world if the drive
doesn't support SCTERC, provided I have a long kernel timeout (and
when things go wrong it might take much longer to recover, but I can
live with that). Is that correct?
>>> Yes, that sounds like what I need. Thanks to Google I found
>>> /usr/share/mdadm/checkarray to trigger this. It still has a few hours
>>> to go, but now the bad drive has pending sectors == 65535 (which is
>>> suspiciously power-of-two and I assume means it's actually higher and
>>> is being clamped), and /sys/block/md0/md/mismatch_cnt is currently at
>>> 1408. If scrubbing is supposed to rewrite on failed reads I would have
>>> expected pending sectors to go down rather than up, so I'm not sure
>>> what's happening.
>>>
>> Ummm....
>>
>> Sounds like that drive could need replacing. I'd get a new drive and do
>> that as soon as possible - use the --replace option of mdadm - don't
>> fail the old drive and add the new. Dunno where you're based, but 5mins
>> on the internet ordering a new drive is probably time well spent.
Oh don't worry, I wasted no time in ordering new drives already.
> You have two other possibilities:
>
> 1) Swap volumes in the raid. These are known to drop unneeded writes
> when the data isn't needed, even if it made it to one of the mirrors.
> That makes harmless mismatches.
It won't be that - I keep have separate non-RAIDed partitions for swap.
> 2) Trim. Well-behaved drive firmware guarantees zeros for trimmed
> sectors, but many drives return random data instead. Zing, mismatches.
> It's often unhelpful with encrypted volumes, as even well-behaved
> firmware can't deliver zeroed sectors *inside* the encryption.
Weee, sounds like fun. I hope it's that. Is there any way to tell
which blocks mismatch, so that I can tell which files are in trouble
(assuming I can figure out how to map through LVM, LUKS and
debuge2fs).
Bruce
--
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/
^ permalink raw reply
* Re: "creative" bio usage in the RAID code
From: Christoph Hellwig @ 2016-11-14 8:51 UTC (permalink / raw)
To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block
In-Reply-To: <87shqvj83r.fsf@notabene.neil.brown.name>
On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote:
> I would suggest adding a "bi_dev_private" field to the bio which is for
> use by the lowest-level driver (much as bi_private is for use by the
> top-level initiator).
> That could be in a union with any or all of:
> unsigned int bi_phys_segments;
> unsigned int bi_seg_front_size;
> unsigned int bi_seg_back_size;
>
> (any driver that needs those, would see a 'request' rather than a 'bio'
> and so could use rq->special)
>
> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).
All the three above fields are those that could go away with a full
implementation of the multipage bvec scheme. So any field for driver
use would still be be overhead. If it's just for raid5 it could
be a smaller 16 bit (or maybe even just 8 bit) one.
^ permalink raw reply
* Re: "creative" bio usage in the RAID code
From: Christoph Hellwig @ 2016-11-14 8:57 UTC (permalink / raw)
To: NeilBrown; +Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block
In-Reply-To: <87vavrj8jp.fsf@notabene.neil.brown.name>
On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote:
> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
> > confusing, and I'm not 100% sure it's correct. After all we check it
> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
> > on these callbacks being done after the flag has been raise / cleared,
> > which makes me bit suspicious, and also question why we even need the
> > mempool.
>
> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
> races there.
> The r1buf_pool mempool is created are the start of resync, so at that
> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
> after the mempool is freed.
>
> To perform a resync we need a pool of memory buffers. We don't want to
> have to cope with kmalloc failing, but are quite able to cope with
> mempool_alloc() blocking.
> We probably don't need nearly as many bufs as we allocate (4 is probably
> plenty), but having a pool is certainly convenient.
Would it be good to create/delete the pool explicitly through methods
to start/emd the sync? Right now the behavior looks very, very
confusing.
> The "bigger bio" might cover a large number of sectors. If there are
> media errors, there might be only one sector that is bad. So we repeat
> the read with finer granularity (pages in the current code, though
> device block would be ideal) and only recovery bad blocks for individual
> pages which are bad and cannot be fixed.
i have no problems with the behavior - the point is that these days
this should be without poking into the bio internals, but by using
a bio iterator for just the range you want to re-read. Potentially
using a bio clone if we can't reusing the existing bio, although I'm
not sure we even need that from looking at the code.
^ permalink raw reply
* Re: "creative" bio usage in the RAID code
From: NeilBrown @ 2016-11-14 9:43 UTC (permalink / raw)
Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block
In-Reply-To: <20161114085151.GA8405@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]
On Mon, Nov 14 2016, Christoph Hellwig wrote:
> On Mon, Nov 14, 2016 at 10:03:20AM +1100, NeilBrown wrote:
>> I would suggest adding a "bi_dev_private" field to the bio which is for
>> use by the lowest-level driver (much as bi_private is for use by the
>> top-level initiator).
>> That could be in a union with any or all of:
>> unsigned int bi_phys_segments;
>> unsigned int bi_seg_front_size;
>> unsigned int bi_seg_back_size;
>>
>> (any driver that needs those, would see a 'request' rather than a 'bio'
>> and so could use rq->special)
>>
>> raid5.c could then use bi_dev_private (or bi_special, or whatever it is call).
>
> All the three above fields are those that could go away with a full
> implementation of the multipage bvec scheme. So any field for driver
> use would still be be overhead. If it's just for raid5 it could
> be a smaller 16 bit (or maybe even just 8 bit) one.
We currently store 2 counters in that field, and before
commit 5b99c2ffa980528a197f26 one of the fields was only 8 bits,
and that caused problems
We could possibly use __bi_remaining in place of
raid5_X_bi_active_stripes(). It wouldn't be a completely
straightforward conversion, but I think it could be made to work.
We *might* be able to use bvec_iter_advance() in place of
raid5_bi_processed_stripes(). A careful audit of the code would be
needed to be certain.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: "creative" bio usage in the RAID code
From: NeilBrown @ 2016-11-14 9:51 UTC (permalink / raw)
Cc: Christoph Hellwig, Shaohua Li, linux-raid, linux-block
In-Reply-To: <20161114085720.GB8405@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]
On Mon, Nov 14 2016, Christoph Hellwig wrote:
> On Mon, Nov 14, 2016 at 09:53:46AM +1100, NeilBrown wrote:
>> > While we're at it - I find the way MD_RECOVERY_REQUESTED is used highly
>> > confusing, and I'm not 100% sure it's correct. After all we check it
>> > in r1buf_pool_alloc, which is a mempool alloc callback, so we rely
>> > on these callbacks being done after the flag has been raise / cleared,
>> > which makes me bit suspicious, and also question why we even need the
>> > mempool.
>>
>> MD_RECOVERY_REQUEST is only set or cleared when no recovery is running.
>> The ->reconfig_mutex and MD_RECOVERY_RUNNING flags ensure there are no
>> races there.
>> The r1buf_pool mempool is created are the start of resync, so at that
>> time MD_RECOVERY_RUNNING will be stable, and it will remain stable until
>> after the mempool is freed.
>>
>> To perform a resync we need a pool of memory buffers. We don't want to
>> have to cope with kmalloc failing, but are quite able to cope with
>> mempool_alloc() blocking.
>> We probably don't need nearly as many bufs as we allocate (4 is probably
>> plenty), but having a pool is certainly convenient.
>
> Would it be good to create/delete the pool explicitly through methods
> to start/emd the sync? Right now the behavior looks very, very
> confusing.
Maybe. It is created the first time ->sync_request is called,
and destroyed when it is called with a sector_nr at-or-beyond the end of
the device. I guess some of that could be made a bit more obvious.
I'm not strongly against adding new methods for "start_sync" and "stop_sync"
but I don't see that it is really needed.
>
>> The "bigger bio" might cover a large number of sectors. If there are
>> media errors, there might be only one sector that is bad. So we repeat
>> the read with finer granularity (pages in the current code, though
>> device block would be ideal) and only recovery bad blocks for individual
>> pages which are bad and cannot be fixed.
>
> i have no problems with the behavior - the point is that these days
> this should be without poking into the bio internals, but by using
> a bio iterator for just the range you want to re-read. Potentially
> using a bio clone if we can't reusing the existing bio, although I'm
> not sure we even need that from looking at the code.
Fair enough. The code predates bio iterators and "if it ain't broke,
don't fix it". If it is now causing problems, then maybe it is now
"broke" and should be "fixed".
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]
^ permalink raw reply
* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Bruce Merry @ 2016-11-14 15:52 UTC (permalink / raw)
To: Wols Lists; +Cc: linux-raid
In-Reply-To: <5828D5DA.1070406@youngman.org.uk>
On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
> Sounds like that drive could need replacing. I'd get a new drive and do
> that as soon as possible - use the --replace option of mdadm - don't
> fail the old drive and add the new.
Would you mind explaining why I should use --replace instead of taking
out the suspect drive? I guess I lose redundancy for any writes that
occur while the rebuild is happening, but I'd plan to do this with the
filesystem unmounted so there wouldn't be any writes.
What I'd quite like to do is treat the "good" drive as the source for
all the data (unless it turns out to have bad sectors too...), even if
the other drive doesn't have a read error; which I'd achieve by
failing the "bad" drive. I want to do this because after doing one
scrub and starting on a second, I'm still seeing non-zero
mismatch_cnt. Does --replace do anything clever about using the old
drive only when it must, or does it just read from the whole array
like normal, which might mean taking the mismatched data from the bad
drive?
Thanks
Bruce
--
Dr Bruce Merry
bmerry <@> gmail <.> com
http://www.brucemerry.org.za/
http://blog.brucemerry.org.za/
^ permalink raw reply
* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Wols Lists @ 2016-11-14 15:58 UTC (permalink / raw)
To: Bruce Merry; +Cc: linux-raid
In-Reply-To: <CAHy4j_7aC+DCqMRkmK12HPP-wY5kAmLf7W3UG_Nn=TK7ry7ARQ@mail.gmail.com>
On 14/11/16 15:52, Bruce Merry wrote:
> On 13 November 2016 at 23:06, Wols Lists <antlists@youngman.org.uk> wrote:
>> > Sounds like that drive could need replacing. I'd get a new drive and do
>> > that as soon as possible - use the --replace option of mdadm - don't
>> > fail the old drive and add the new.
> Would you mind explaining why I should use --replace instead of taking
> out the suspect drive? I guess I lose redundancy for any writes that
> occur while the rebuild is happening, but I'd plan to do this with the
> filesystem unmounted so there wouldn't be any writes.
Because a replace will copy from the old drive to the new, recovering
any failures from the rest of the array. A fail-and-add will have to
rebuild the entire new array from what's left of the old, stressing the
old array much more.
Okay, in your case, it probably won't make an awful lot of difference,
but it does make you vulnerable to problems on the "good" drive. To
alter your wording slightly, you lose redundancy for writes AND READS
that occur while the array is rebuilding. It's just good practice (and I
point it out because --replace is new and not well known at present).
Cheers,
Wol
^ permalink raw reply
* Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1
From: Phil Turmel @ 2016-11-14 16:01 UTC (permalink / raw)
To: Bruce Merry; +Cc: Wols Lists, linux-raid
In-Reply-To: <CAHy4j_6q2mmQ_y-xkh1Zx6CrP1vByQMadCXLJwibeeO85T3JgQ@mail.gmail.com>
Hi Bruce,
On 11/14/2016 01:50 AM, Bruce Merry wrote:
> Okay, I'll give that script a go to increase my kernel timeout. If I
> understand correctly, it's not the end of the world if the drive
> doesn't support SCTERC, provided I have a long kernel timeout (and
> when things go wrong it might take much longer to recover, but I can
> live with that). Is that correct?
Yes.
>> 2) Trim. Well-behaved drive firmware guarantees zeros for trimmed
>> sectors, but many drives return random data instead. Zing, mismatches.
>> It's often unhelpful with encrypted volumes, as even well-behaved
>> firmware can't deliver zeroed sectors *inside* the encryption.
>
> Weee, sounds like fun. I hope it's that. Is there any way to tell
> which blocks mismatch, so that I can tell which files are in trouble
> (assuming I can figure out how to map through LVM, LUKS and
> debuge2fs).
The check operation doesn't log the sector addresses, unfortunately. At
least I don't see any such operation in the code that increments
mismatch count. Not even a tracepoint. Hmmm.
In the meantime, run a "repair" scrub instead of a "check" scrub to
affirmatively force no mismatches. (Writes first member of mirrors to
the others.)
Phil
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox