Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [md PATCH 3/4] md/bitmap: add blktrace event for writes to the bitmap.
From: Shaohua Li @ 2016-11-16 19:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <147910142125.27168.1639109067364223445.stgit@noble>

On Mon, Nov 14, 2016 at 04:30:21PM +1100, Neil Brown wrote:
> We trace wheneven bitmap_unplug() finds that it needs to write
> to the bitmap, or when bitmap_daemon_work() find there is work
> to do.
> 
> This makes it easier to correlate bitmap updates with data writes.

Looks good. This reminds me if we should do similar thing for md_write_start if
we write superblock. Especially for raid5-cache, as we write superblock regularly.

Thanks,
Shaohua

^ permalink raw reply

* Re: [md PATCH 1/4] md: add block tracing for bio_remapping
From: Shaohua Li @ 2016-11-16 19:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <147910142095.27168.11356591734977480053.stgit@noble>

On Mon, Nov 14, 2016 at 04:30:21PM +1100, Neil Brown wrote:
> The block tracing infrastructure (accessed with blktrace/blkparse)
> supports the tracing of mapping bios from one device to another.
> This is currently used when a bio in a partition is mapped to the
> whole device, when bios are mapped by dm, and for mapping in md/raid5.
> Other md personalities do not include this tracing yet, so add it.
> 
> When a read-error is detected we redirect the request to a different device.
> This could justifiably be seen as a new mapping for the originial bio,
> or a secondary mapping for the bio that errors.  This patch uses
> the second option.
> 
> When md is used under dm-raid, the mappings are not traced as we do
> not have access to the block device number of the parent.

Looks the the original sector (the last parameter of trace_block_bio_remap)
isn't correct.
- in linear/raid0, bio_split already updated bio->bi_iter.bi_sector
- in raid1/raid10, r1_bio->sector is updated before the bio is sent.

Thanks,
Shaohua

^ permalink raw reply

* Re: recovering failed raid5
From: Wols Lists @ 2016-11-16 16:38 UTC (permalink / raw)
  To: Alexander Shenkin, linux-raid, rm, robin
In-Reply-To: <275e84b8-4a2c-5ef2-0e88-5dfe28e42fc9@shenkin.org>

On 16/11/16 15:50, Alexander Shenkin wrote:
> 
> 
> On 11/16/2016 3:35 PM, Wols Lists wrote:
>> On 16/11/16 09:04, Alexander Shenkin wrote:
>>> Hello all,
>>>
>>> As a quick reminder, my sdb failed in a 4-disk RAID5, and then sdc
>>> failed when trying to replace sdb.  I'm now trying to recover sdc with
>>> ddrescue.
>>>
>>> After much back and forth, I've finally got ddrescue running to
>>> replicate my apparently-faulty sdc.  I'm ddrescue'ing from a seagate 3TB
>>> to a toshiba 3TB drive, and I'm getting a 'No space left on device
>>> error'.  Any thoughts?
>>>
>>> One further question: should I also try to ddrescue my original failed
>>> sdb in the hopes that anything lost on sdc would be covered by the
>>> recovered sdb?
>>
>> Depends how badly out of sync the event counts are. However, I note that
>> your ddrescue copy appeared to run without any errors (apart from
>> falling off the end of the drive :-) ?
> 
> Thanks Wol.
> 
> From my newbie reading, it looked like there was on 65kb error... but
> i'm not sure how to tell if it got read properly by ddrescue in the end
> - any tips?  I don't see any "retrying bad sectors" (-) lines in the
> logfile below...
> 
> username@Ubuntu-VirtualBox:~$ sudo ddrescue -d -f -r3 /dev/sdb /dev/sdc
> ~/rescue.logfile
> [sudo] password for username:
> GNU ddrescue 1.19
> Press Ctrl-C to interrupt
> rescued:     3000 GB,  errsize:   65536 B,  current rate:   55640 kB/s
>    ipos:     3000 GB,   errors:       1,    average rate:   83070 kB/s
>    opos:     3000 GB, run time:   10.03 h,  successful read:       0 s ago
> Copying non-tried blocks... Pass 1 (forwards)
> ddrescue: Write error: No space left on device
> 
> # Rescue Logfile. Created by GNU ddrescue version 1.19
> # Command line: ddrescue -d -f -r3 /dev/sdb /dev/sdc
> /home/username/rescue.logfile
> # Start time:   2016-11-15 13:54:24
> # Current time: 2016-11-15 23:56:25
> # Copying non-tried blocks... Pass 1 (forwards)
> # current_pos  current_status
> 0x2BAA1470000     ?
> #      pos        size  status
> 0x00000000  0x7F5A0000  +
> 0x7F5A0000  0x00010000  *
> 0x7F5B0000  0x00010000  ?
> 0x7F5C0000  0x2BA21EB0000  +
> 0x2BAA1470000  0x00006000  ?
> 
>>
>> In which case, you haven't lost anything on sdc. Which is why the wiki
>> says don't mount your array writeable while you're trying to recover it
>> - you're not going to muck up your data and have user-space provoke
>> further errors.
> 
> gotcha - i'm doing this with removed drives on a different (virtual)
> machine.  Seemed like the arrays were getting mounted read-only by
> default when the disks were having issues...
> 
>>
>> If the array barfs while it's rebuilding, it's hopefully just a
>> transient, and do another assemble with --force to get it back again.
> 
> so, i guess i put the copied drive back in as sdc, and a new blank drive
> as sdb, add sdb, and just let it rebuild from there?  Or, do I issue
> this command as appropriate?
> 
> mdadm --force --assemble /dev/mdN /dev/sd[XYZ]1

Let me get my thoughts straight - cross check what I'm writing but ...

sda and sdd have never failed. sdc is the new drive you've ddrescue'd onto.

So in order to get a working array, you need to do
"mdadm --assemble /dev/sd[adc]n"
This will give you a working, degraded array, which unfortunately
probably has a little bit of corruption - whatever you were writing when
the array first failed will not have been saved properly. You've
basically recovered the array with the two drives that are okay, and a
copy of the drive that failed most recently.

IFF the smarts report that your two failed drives are okay, then you can
add them back in. I'm hoping it was just the timeout problem - with
Barracudas that's quite likely.

MAKE SURE that you've run the timeout script on all the Barracudas, or
the array is simply going to crash again.

WIPE THE SUPERBLOCKS on the old drives. I'm not sure what the mdadm
command is, but we're adding them back in as new drives.

mdadm --add /dev/old-b /dev/old-c

This will think they are two new drives and will rebuild on to one of
them. You can then convert the array to raid 6 and it will rebuild on to
the other one.

Once you've got back to a fully-working raid-5, you can do a fsck on the
filesystem(s) to find the corruption.

Lastly, if you can get another Toshiba drive, add that in as a spare.

This will leave you with a 6-drive raid-6 - 3xdata, 2xparity, 1xspare.

If the smarts report that any of your barracudas have a load of errors,
it's not worth faffing about with them. Bin them and replace them.

Going back to an earlier point of yours - DO NOT try to force re-add the
first drive that failed back into the array. The mismatch in event count
will mean loads of corruption.

Cheers,
Wol
> 
>>
>> Once you've got the array properly back up again :-
>>
>> 1) make sure that the timeout script is run EVERY BOOT to fix the kernel
>> defaults for your remaining barracudas.
>>
>> 2) make sure smarts are enabled EVERY BOOT because barracudas forget
>> their settings on power-off.
>>
>> 3) You've now got a spare drive. If a smart self-check comes back pretty
>> clean and it looks like a transient problem not a dud drive, then put it
>> back in and convert the array to raid 6.
>>
>> 4) MONITOR MONITOR MONITOR
>>
>> You've seen the comments elsewhere about the 3TB barracudas? Barracudas
>> in general aren't bad drives, but the 3TB model has a reputation for
>> dying early and quickly. You can then plan to replace the drives at your
>> leisure, knowing that provided you catch any failure, you've still got
>> redundancy with one dead drive in a raid-6. Even better, get another
>> Toshiba and go raid-6+spare. And don't say you haven't got enough sata
>> ports - an add-in card is about £20 :-)
>>
>> Cheers,
>> Wol
>>
> 


^ permalink raw reply

* Re: recovering failed raid5
From: Alexander Shenkin @ 2016-11-16 15:50 UTC (permalink / raw)
  To: Wols Lists, linux-raid, rm, robin
In-Reply-To: <582C7CC3.40901@youngman.org.uk>



On 11/16/2016 3:35 PM, Wols Lists wrote:
> On 16/11/16 09:04, Alexander Shenkin wrote:
>> Hello all,
>>
>> As a quick reminder, my sdb failed in a 4-disk RAID5, and then sdc
>> failed when trying to replace sdb.  I'm now trying to recover sdc with
>> ddrescue.
>>
>> After much back and forth, I've finally got ddrescue running to
>> replicate my apparently-faulty sdc.  I'm ddrescue'ing from a seagate 3TB
>> to a toshiba 3TB drive, and I'm getting a 'No space left on device
>> error'.  Any thoughts?
>>
>> One further question: should I also try to ddrescue my original failed
>> sdb in the hopes that anything lost on sdc would be covered by the
>> recovered sdb?
>
> Depends how badly out of sync the event counts are. However, I note that
> your ddrescue copy appeared to run without any errors (apart from
> falling off the end of the drive :-) ?

Thanks Wol.

 From my newbie reading, it looked like there was on 65kb error... but 
i'm not sure how to tell if it got read properly by ddrescue in the end 
- any tips?  I don't see any "retrying bad sectors" (-) lines in the 
logfile below...

username@Ubuntu-VirtualBox:~$ sudo ddrescue -d -f -r3 /dev/sdb /dev/sdc 
~/rescue.logfile
[sudo] password for username:
GNU ddrescue 1.19
Press Ctrl-C to interrupt
rescued:     3000 GB,  errsize:   65536 B,  current rate:   55640 kB/s
    ipos:     3000 GB,   errors:       1,    average rate:   83070 kB/s
    opos:     3000 GB, run time:   10.03 h,  successful read:       0 s ago
Copying non-tried blocks... Pass 1 (forwards)
ddrescue: Write error: No space left on device

# Rescue Logfile. Created by GNU ddrescue version 1.19
# Command line: ddrescue -d -f -r3 /dev/sdb /dev/sdc 
/home/username/rescue.logfile
# Start time:   2016-11-15 13:54:24
# Current time: 2016-11-15 23:56:25
# Copying non-tried blocks... Pass 1 (forwards)
# current_pos  current_status
0x2BAA1470000     ?
#      pos        size  status
0x00000000  0x7F5A0000  +
0x7F5A0000  0x00010000  *
0x7F5B0000  0x00010000  ?
0x7F5C0000  0x2BA21EB0000  +
0x2BAA1470000  0x00006000  ?

>
> In which case, you haven't lost anything on sdc. Which is why the wiki
> says don't mount your array writeable while you're trying to recover it
> - you're not going to muck up your data and have user-space provoke
> further errors.

gotcha - i'm doing this with removed drives on a different (virtual) 
machine.  Seemed like the arrays were getting mounted read-only by 
default when the disks were having issues...

>
> If the array barfs while it's rebuilding, it's hopefully just a
> transient, and do another assemble with --force to get it back again.

so, i guess i put the copied drive back in as sdc, and a new blank drive 
as sdb, add sdb, and just let it rebuild from there?  Or, do I issue 
this command as appropriate?

mdadm --force --assemble /dev/mdN /dev/sd[XYZ]1

>
> Once you've got the array properly back up again :-
>
> 1) make sure that the timeout script is run EVERY BOOT to fix the kernel
> defaults for your remaining barracudas.
>
> 2) make sure smarts are enabled EVERY BOOT because barracudas forget
> their settings on power-off.
>
> 3) You've now got a spare drive. If a smart self-check comes back pretty
> clean and it looks like a transient problem not a dud drive, then put it
> back in and convert the array to raid 6.
>
> 4) MONITOR MONITOR MONITOR
>
> You've seen the comments elsewhere about the 3TB barracudas? Barracudas
> in general aren't bad drives, but the 3TB model has a reputation for
> dying early and quickly. You can then plan to replace the drives at your
> leisure, knowing that provided you catch any failure, you've still got
> redundancy with one dead drive in a raid-6. Even better, get another
> Toshiba and go raid-6+spare. And don't say you haven't got enough sata
> ports - an add-in card is about £20 :-)
>
> Cheers,
> Wol
>

^ permalink raw reply

* Re: recovering failed raid5
From: Wols Lists @ 2016-11-16 15:35 UTC (permalink / raw)
  To: Alexander Shenkin, linux-raid, rm, robin
In-Reply-To: <194ae464-2d19-cefd-055f-6f4a33676ab2@shenkin.org>

On 16/11/16 09:04, Alexander Shenkin wrote:
> Hello all,
> 
> As a quick reminder, my sdb failed in a 4-disk RAID5, and then sdc
> failed when trying to replace sdb.  I'm now trying to recover sdc with
> ddrescue.
> 
> After much back and forth, I've finally got ddrescue running to
> replicate my apparently-faulty sdc.  I'm ddrescue'ing from a seagate 3TB
> to a toshiba 3TB drive, and I'm getting a 'No space left on device
> error'.  Any thoughts?
> 
> One further question: should I also try to ddrescue my original failed
> sdb in the hopes that anything lost on sdc would be covered by the
> recovered sdb?

Depends how badly out of sync the event counts are. However, I note that
your ddrescue copy appeared to run without any errors (apart from
falling off the end of the drive :-) ?

In which case, you haven't lost anything on sdc. Which is why the wiki
says don't mount your array writeable while you're trying to recover it
- you're not going to muck up your data and have user-space provoke
further errors.

If the array barfs while it's rebuilding, it's hopefully just a
transient, and do another assemble with --force to get it back again.

Once you've got the array properly back up again :-

1) make sure that the timeout script is run EVERY BOOT to fix the kernel
defaults for your remaining barracudas.

2) make sure smarts are enabled EVERY BOOT because barracudas forget
their settings on power-off.

3) You've now got a spare drive. If a smart self-check comes back pretty
clean and it looks like a transient problem not a dud drive, then put it
back in and convert the array to raid 6.

4) MONITOR MONITOR MONITOR

You've seen the comments elsewhere about the 3TB barracudas? Barracudas
in general aren't bad drives, but the 3TB model has a reputation for
dying early and quickly. You can then plan to replace the drives at your
leisure, knowing that provided you catch any failure, you've still got
redundancy with one dead drive in a raid-6. Even better, get another
Toshiba and go raid-6+spare. And don't say you haven't got enough sata
ports - an add-in card is about £20 :-)

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH 0/4] IMSM: Add support for 4Kn sector size drives
From: Jes Sorensen @ 2016-11-16 15:14 UTC (permalink / raw)
  To: Pawel Baldysiak; +Cc: linux-raid
In-Reply-To: <1478788098-32041-1-git-send-email-pawel.baldysiak@intel.com>

Pawel Baldysiak <pawel.baldysiak@intel.com> writes:
> This patch set adds support for IMSM with 4Kn sector size drives
> First patch adds the generic function for receiving sector size,
> rest are IMSM specific.
> Internal calculation are still based on 512-bytes sector,
> variables are converted during read/write from/to member drive.
> Mixing of devices with different sector size is not allowed.
>
> Pawel Baldysiak (4):
>   Add function for getting member drive sector size
>   IMSM: Read and store device sector size
>   IMSM: Add support for 4Kn sector size drives
>   IMSM: 4Kn drives support - adapt general migration record
>
>  mdadm.h       |   1 +
>  super-intel.c | 315 +++++++++++++++++++++++++++++++++++++++++++++-------------
>  super1.c      |   3 +-
>  util.c        |  16 +++
>  4 files changed, 265 insertions(+), 70 deletions(-)

Hi Pawel,

This set mostly looks good - a couple of comments:

+int get_dev_sector_size(int fd, char *dname, unsigned int *sectsizep)

This introduces a *dname but nowhere in your code is it actually used. I
am not necessarily against this, and it looks like we do it in some
places, but not others. However do you anticipate using it in future
changes you have lined up?

I noticed you changed hard coded 512 byte limits to hard coded 4096
when rounding up sizes for posix_memalign() etc. Wouldn't it be cleaner
to introduce a MAX_SECTOR_SIZE or similar?

Cheers,
Jes

^ permalink raw reply

* Re: Ddf based RAID management software
From: Jes Sorensen @ 2016-11-16 15:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Arka Sharma, linux-raid
In-Reply-To: <877f86k3we.fsf@notabene.neil.brown.name>

NeilBrown <neilb@suse.com> writes:
> On Sun, Nov 13 2016, Arka Sharma wrote:
>
>> Hello All,
>>
>> Is there any tool apart from mdadm available which can create software
>> RAID based on Ddf metadata. We want to dump the metadata content and
>> tally with metadata written by mdadm and our application.
>
> Not that I'm aware of.
> dmraid can read some ddf metadata, but I don't think it will create new
> metadata.

I thought some BIOS code would be able to do it, similar to how Intel's
IMSM BIOS does it?

Jes

^ permalink raw reply

* Re: [PATCH 2/2] super1: fix setting bad block log offset in write_init_super1()
From: Jes Sorensen @ 2016-11-16 14:58 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <20161110105054.29869-2-artur.paszkiewicz@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> Commit f79bbf4f6904 ("super1: don't put the bblog at the end of the free
> space.") changed the location of the bad block log to be after the
> write-intent bitmap, but a fixed offset was used and it can make bbl
> overlap with the bitmap, especially when using a small bitmap chunk.
> This patch changes it to use the actual offset and size of the bitmap.
> It also joins the cases for v1.1 and v1.2 superblock because the code
> was very similar.
>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  super1.c | 47 +++++++++++++++++++++++------------------------
>  1 file changed, 23 insertions(+), 24 deletions(-)

Applied!

Thanks,
Jes

>
> diff --git a/super1.c b/super1.c
> index 982d88c..1d03a0a 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -1693,6 +1693,7 @@ static int write_init_super1(struct supertype *st)
>  	unsigned long long dsize, array_size;
>  	unsigned long long sb_offset;
>  	unsigned long long data_offset;
> +	long bm_offset;
>  
>  	for (di = st->info; di; di = di->next) {
>  		if (di->disk.state & (1 << MD_DISK_JOURNAL))
> @@ -1760,15 +1761,25 @@ static int write_init_super1(struct supertype *st)
>  		 * data_offset has already been set.
>  		 */
>  		array_size = __le64_to_cpu(sb->size);
> -		/* work out how much space we left for a bitmap,
> -		 * Add 8 sectors for bad block log */
> -		bm_space = choose_bm_space(array_size) + 8;
> +
> +		/* work out how much space we left for a bitmap */
> +		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
> +			bitmap_super_t *bms = (bitmap_super_t *)
> +					(((char *)sb) + MAX_SB_SIZE);
> +			bm_space = calc_bitmap_size(bms, 4096) >> 9;
> +			bm_offset = (long)__le32_to_cpu(sb->bitmap_offset);
> +		} else {
> +			bm_space = choose_bm_space(array_size);
> +			bm_offset = 8;
> +		}
>  
>  		data_offset = di->data_offset;
>  		if (data_offset == INVALID_SECTORS)
>  			data_offset = st->data_offset;
>  		switch(st->minor_version) {
>  		case 0:
> +			/* Add 8 sectors for bad block log */
> +			bm_space += 8;
>  			if (data_offset == INVALID_SECTORS)
>  				data_offset = 0;
>  			sb_offset = dsize;
> @@ -1785,38 +1796,26 @@ static int write_init_super1(struct supertype *st)
>  			}
>  			break;
>  		case 1:
> -			sb->super_offset = __cpu_to_le64(0);
> -			if (data_offset == INVALID_SECTORS)
> -				data_offset = 16;
> -
> -			sb->data_offset = __cpu_to_le64(data_offset);
> -			sb->data_size = __cpu_to_le64(dsize - data_offset);
> -			if (data_offset >= 8 + 32*2 + 8) {
> -				sb->bblog_size = __cpu_to_le16(8);
> -				sb->bblog_offset = __cpu_to_le32(8 + 32*2);
> -			} else if (data_offset >= 16) {
> -				sb->bblog_size = __cpu_to_le16(8);
> -				sb->bblog_offset = __cpu_to_le32(data_offset-8);
> -			}
> -			break;
>  		case 2:
> -			sb_offset = 4*2;
> +			sb_offset = st->minor_version == 2 ? 8 : 0;
>  			sb->super_offset = __cpu_to_le64(sb_offset);
>  			if (data_offset == INVALID_SECTORS)
> -				data_offset = 24;
> +				data_offset = sb_offset + 16;
>  
>  			sb->data_offset = __cpu_to_le64(data_offset);
>  			sb->data_size = __cpu_to_le64(dsize - data_offset);
> -			if (data_offset >= 16 + 32*2 + 8) {
> +			if (data_offset >= sb_offset+bm_offset+bm_space+8) {
>  				sb->bblog_size = __cpu_to_le16(8);
> -				sb->bblog_offset = __cpu_to_le32(8 + 32*2);
> -			} else if (data_offset >= 16+16) {
> +				sb->bblog_offset = __cpu_to_le32(bm_offset +
> +								 bm_space);
> +			} else if (data_offset >= sb_offset + 16) {
>  				sb->bblog_size = __cpu_to_le16(8);
> -				/* '8' sectors for the bblog, and another '8'
> +				/* '8' sectors for the bblog, and 'sb_offset'
>  				 * because we want offset from superblock, not
>  				 * start of device.
>  				 */
> -				sb->bblog_offset = __cpu_to_le32(data_offset-8-8);
> +				sb->bblog_offset = __cpu_to_le32(data_offset -
> +								 8 - sb_offset);
>  			}
>  			break;
>  		default:

^ permalink raw reply

* Re: [PATCH 1/2] super1: make internal bitmap size calculations more consistent
From: Jes Sorensen @ 2016-11-16 14:57 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <20161110105054.29869-1-artur.paszkiewicz@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> Determining internal bitmap size is performed using two different
> functions (bitmap_sectors() and calc_bitmap_size()) and in
> getinfo_super1() it is calculated in yet another way. Each of these
> methods give slightly different results. The most accurate is
> calc_bitmap_size() but it also has a rounding issue. So:
>
> - fix the rounding issue in calc_bitmap_size() using bitmap_bits()
> - replace usages of bitmap_sectors() and open-coded calculations with
>   calc_bitmap_size()
> - remove bitmap_sectors()
> - move bitmap_bits() to mdadm.h as inline - otherwise mdassemble won't
>   compile (it does not use bitmap.c)
>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  bitmap.c | 15 ---------------
>  mdadm.h  |  9 ++++++++-
>  super1.c | 25 +++++++++----------------
>  3 files changed, 17 insertions(+), 32 deletions(-)

Applied!

However, in the future please include a cover letter for multi-patch
sets.

Thanks,
Jes

> diff --git a/bitmap.c b/bitmap.c
> index 6c1b8d8..ccedfd3 100644
> --- a/bitmap.c
> +++ b/bitmap.c
> @@ -108,21 +108,6 @@ static int count_dirty_bits(char *buf, int num_bits)
>  	return num;
>  }
>  
> -/* calculate the size of the bitmap given the array size and bitmap chunksize */
> -static unsigned long long
> -bitmap_bits(unsigned long long array_size, unsigned long chunksize)
> -{
> -	return (array_size * 512 + chunksize - 1) / chunksize;
> -}
> -
> -unsigned long bitmap_sectors(struct bitmap_super_s *bsb)
> -{
> -	unsigned long long bits = bitmap_bits(__le64_to_cpu(bsb->sync_size),
> -					      __le32_to_cpu(bsb->chunksize));
> -	int bits_per_sector = 8*512;
> -	return (bits + bits_per_sector - 1) / bits_per_sector;
> -}
> -
>  static bitmap_info_t *bitmap_fd_read(int fd, int brief)
>  {
>  	/* Note: fd might be open O_DIRECT, so we must be
> diff --git a/mdadm.h b/mdadm.h
> index 0516c82..41a4494 100755
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -1331,7 +1331,14 @@ extern int CreateBitmap(char *filename, int force, char uuid[16],
>  extern int ExamineBitmap(char *filename, int brief, struct supertype *st);
>  extern int Write_rules(char *rule_name);
>  extern int bitmap_update_uuid(int fd, int *uuid, int swap);
> -extern unsigned long bitmap_sectors(struct bitmap_super_s *bsb);
> +
> +/* calculate the size of the bitmap given the array size and bitmap chunksize */
> +static inline unsigned long long
> +bitmap_bits(unsigned long long array_size, unsigned long chunksize)
> +{
> +	return (array_size * 512 + chunksize - 1) / chunksize;
> +}
> +
>  extern int Dump_metadata(char *dev, char *dir, struct context *c,
>  			 struct supertype *st);
>  extern int Restore_metadata(char *dev, char *dir, struct context *c,
> diff --git a/super1.c b/super1.c
> index 4fef378..982d88c 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -162,7 +162,8 @@ static unsigned int calc_bitmap_size(bitmap_super_t *bms, unsigned int boundary)
>  {
>  	unsigned long long bits, bytes;
>  
> -	bits = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
> +	bits = bitmap_bits(__le64_to_cpu(bms->sync_size),
> +			   __le32_to_cpu(bms->chunksize));
>  	bytes = (bits+7) >> 3;
>  	bytes += sizeof(bitmap_super_t);
>  	bytes = ROUND_UP(bytes, boundary);
> @@ -973,11 +974,7 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
>  		earliest = super_offset + (32+4)*2; /* match kernel */
>  		if (info->bitmap_offset > 0) {
>  			unsigned long long bmend = info->bitmap_offset;
> -			unsigned long long size = __le64_to_cpu(bsb->sync_size);
> -			size /= __le32_to_cpu(bsb->chunksize) >> 9;
> -			size = (size + 7) >> 3;
> -			size += sizeof(bitmap_super_t);
> -			size = ROUND_UP(size, 4096);
> +			unsigned long long size = calc_bitmap_size(bsb, 4096);
>  			size /= 512;
>  			bmend += size;
>  			if (bmend > earliest)
> @@ -1219,11 +1216,8 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
>  	} else if (strcmp(update, "uuid") == 0) {
>  		copy_uuid(sb->set_uuid, info->uuid, super1.swapuuid);
>  
> -		if (__le32_to_cpu(sb->feature_map)&MD_FEATURE_BITMAP_OFFSET) {
> -			struct bitmap_super_s *bm;
> -			bm = (struct bitmap_super_s*)(st->sb+MAX_SB_SIZE);
> -			memcpy(bm->uuid, sb->set_uuid, 16);
> -		}
> +		if (__le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET)
> +			memcpy(bms->uuid, sb->set_uuid, 16);
>  	} else if (strcmp(update, "no-bitmap") == 0) {
>  		sb->feature_map &= ~__cpu_to_le32(MD_FEATURE_BITMAP_OFFSET);
>  	} else if (strcmp(update, "bbl") == 0) {
> @@ -1232,15 +1226,14 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
>  		 */
>  		unsigned long long sb_offset = __le64_to_cpu(sb->super_offset);
>  		unsigned long long data_offset = __le64_to_cpu(sb->data_offset);
> -		long bitmap_offset = (long)(int32_t)__le32_to_cpu(sb->bitmap_offset);
> +		long bitmap_offset = 0;
>  		long bm_sectors = 0;
>  		long space;
>  
>  #ifndef MDASSEMBLE
>  		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
> -			struct bitmap_super_s *bsb;
> -			bsb = (struct bitmap_super_s *)(((char*)sb)+MAX_SB_SIZE);
> -			bm_sectors = bitmap_sectors(bsb);
> +			bitmap_offset = (long)__le32_to_cpu(sb->bitmap_offset);
> +			bm_sectors = calc_bitmap_size(bms, 4096) >> 9;
>  		}
>  #endif
>  		if (sb_offset < data_offset) {
> @@ -2120,7 +2113,7 @@ static __u64 avail_size1(struct supertype *st, __u64 devsize,
>  		/* hot-add. allow for actual size of bitmap */
>  		struct bitmap_super_s *bsb;
>  		bsb = (struct bitmap_super_s *)(((char*)super)+MAX_SB_SIZE);
> -		bmspace = bitmap_sectors(bsb);
> +		bmspace = calc_bitmap_size(bsb, 4096) >> 9;
>  	}
>  #endif
>  	/* Allow space for bad block log */

^ permalink raw reply

* Re: [PATCH 2/8] imsm: write bad block log on metadata sync
From: Jes Sorensen @ 2016-11-16 14:47 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477925454-16809-2-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> Pre-allocate memory for largest possible bad block sectionwhen monitor
> is being opened to avoid a need for memory allocation on metadata sync.
>
> If memory for a structure has been allocated in mpb buffer but it hasn't
> been used yet, it will be taken by next buffer grow, leading to
> insufficient memory on metadata flush. Start tracking such memory and
> take it into calculation when growing a buffer. Also assert has been
> added to debug mode to warn when more metadata has been written than
> memory allocated.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  super-intel.c | 49 +++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 45 insertions(+), 4 deletions(-)
>
> diff --git a/super-intel.c b/super-intel.c
> index 5d6d534..0591c55 100644
> --- a/super-intel.c
> +++ b/super-intel.c

[snip]

> @@ -8854,6 +8886,9 @@ static void imsm_process_update(struct supertype *st,
>  		}
>  		break;
>  	}
> +	case update_prealloc_badblocks_mem: {
> +		break;
> +	}
>  	default:
>  		pr_err("error: unsuported process update type:(type: %d)\n",	type);
>  	}

Please don't include those awful brackets if you don't need them here.

> @@ -9094,6 +9129,11 @@ static int imsm_prepare_update(struct supertype *st,
>  	case update_add_remove_disk:
>  		/* no update->len needed */
>  		break;
> +	case update_prealloc_badblocks_mem: {
> +		super->extra_space += sizeof(struct bbm_log) -
> +			get_imsm_bbm_log_size(super->bbm_log);
> +		break;
> +	}
>  	default:
>  		return 0;
>  	}

Same here

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 1/8] imsm: parse bad block log in metadata on startup
From: Jes Sorensen @ 2016-11-16 14:43 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477925454-16809-1-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> Always allocate memory for all log entries to avoid a need for memory
> allocation when monitor requests to record a bad block.
>
> Also some extra checks added to make static code analyzer happy.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  super-intel.c | 158 +++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 112 insertions(+), 46 deletions(-)

Tomasz,

A couple of comments on this one:

First of all: *Always* provide a cover letter when submitting
multi-patch sets.

> diff --git a/super-intel.c b/super-intel.c
> index c146bbd..5d6d534 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -217,22 +217,24 @@ struct imsm_super {
>  } __attribute__ ((packed));
>  
>  #define BBM_LOG_MAX_ENTRIES 254
> +#define BBM_LOG_MAX_LBA_ENTRY_VAL 256		/* Represents 256 LBAs */
> +#define BBM_LOG_SIGNATURE 0xABADB10C

I know the old code is messy with regard to this, but lets get it right
from now on. Numbers are lower-case and #define NAMES are upper case.

> @@ -785,6 +787,92 @@ static struct imsm_dev *get_imsm_dev(struct intel_super *super, __u8 index)
>  	return NULL;
>  }
>  
> +#if BYTE_ORDER == LITTLE_ENDIAN
> +static inline unsigned long long __le48_to_cpu(const struct bbm_log_block_addr
> +					       *addr)
> +{
> +	return ((((__u64)addr->dw1) << 16) | addr->w1);
> +}
> +
> +static inline struct bbm_log_block_addr __cpu_to_le48(unsigned long long sec)
> +{
> +	struct bbm_log_block_addr addr;
> +
> +	addr.w1 =  (__u16)(sec & 0xFFFF);
> +	addr.dw1 = (__u32)((sec >> 16) & 0xFFFFFFFF);
> +	return addr;
> +}

Again, 0xffff/0xffffffff

> +#elif BYTE_ORDER == BIG_ENDIAN
> +static inline unsigned long long __le48_to_cpu(const struct bbm_log_block_addr
> +					       *addr)
> +{
> +	return ((((__u64)__le32_to_cpu(addr->dw1)) << 16) |
> +		__le16_to_cpu(addr->w1));
> +}
> +
> +static inline struct bbm_log_block_addr __cpu_to_le48(unsigned long long sec)
> +{
> +	struct bbm_log_block_addr addr;
> +
> +	addr.w1 =  __cpu_to_le16((__u16)(sec & 0xFFFF));
> +	addr.dw1 = __cpu_to_le32((__u32)(sec >> 16) & 0xFFFFFFFF);
> +	return addr;
> +}

Given that __cpu_to_le*()/__le*_to_cpu*() are no-ops on little-endian,
you don't really need two versions of these. The big-endian version
should suffice for both, which makes it far less cluttered. Unless I got
something wrong here of course.

> +#else
> +#  error "unknown endianness."
> +#endif
> +
> +#ifndef MDASSEMBLE
> +/* get size of the bbm log */
> +static __u32 get_imsm_bbm_log_size(struct bbm_log *log)
> +{
> +	if (!log || log->entry_count == 0)
> +		return 0;
> +
> +	return sizeof(log->signature) +
> +		sizeof(log->entry_count) +
> +		log->entry_count * sizeof(struct bbm_log_entry);
> +}
> +#endif /* MDASSEMBLE */
> +
> +/* allocate and load BBM log from metadata */
> +static int load_bbm_log(struct intel_super *super)
> +{
> +	struct imsm_super *mpb = super->anchor;
> +	__u32 bbm_log_size =  __le32_to_cpu(mpb->bbm_log_size);
> +
> +	super->bbm_log = malloc(sizeof(struct bbm_log));
> +	if (!super->bbm_log)
> +		return 1;
> +
> +	if (bbm_log_size) {
> +		struct bbm_log *log = (void *)mpb +
> +			__le32_to_cpu(mpb->mpb_size) - bbm_log_size;
> +		__u32 entry_count;

Put the assignment on a separate line please when it's this long.

> +
> +		if (bbm_log_size < sizeof(log->signature) +
> +		    sizeof(log->entry_count))
> +			return 2;
> +
> +		entry_count = __le32_to_cpu(log->entry_count);
> +		if ((__le32_to_cpu(log->signature) != BBM_LOG_SIGNATURE) ||
> +		    (entry_count > BBM_LOG_MAX_ENTRIES))
> +			return 3;
> +
> +		if (bbm_log_size !=
> +		    sizeof(log->signature) + sizeof(log->entry_count) +
> +		    entry_count * sizeof(struct bbm_log_entry))
> +			return 4;
> +
> +		memcpy(super->bbm_log, log, bbm_log_size);
> +	} else {
> +		super->bbm_log->signature = __cpu_to_le32(BBM_LOG_SIGNATURE);
> +		super->bbm_log->entry_count = 0;
> +	}

This part looks problematic - you don't clear super->bbm_log and if
bbm_log_size == 0 you end up leaving it with random data. Wouldn't it be
better to just use calloc()?

Jes

^ permalink raw reply

* Re: [PATCH v2] RAID1: Avoid unnecessary loop to decrease conf->nr_queued in raid1d()
From: Coly Li @ 2016-11-16 14:36 UTC (permalink / raw)
  To: linux-raid; +Cc: Shaohua Li, Neil Brown
In-Reply-To: <1479305968-18473-1-git-send-email-colyli@suse.de>

在 2016/11/16 下午10:19, Coly Li 写道:
[snip]
> ---
>  drivers/md/raid1.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> Index: linux-raid1/drivers/md/raid1.c
> ===================================================================
> --- linux-raid1.orig/drivers/md/raid1.c
> +++ linux-raid1/drivers/md/raid1.c
> @@ -2387,17 +2387,17 @@ static void raid1d(struct md_thread *thr
[snip]
>  		while (!list_empty(&tmp)) {
>  			r1_bio = list_first_entry(&tmp, struct r1bio,
>  						  retry_list);
>  			list_del(&r1_bio->retry_list);
> +			spin_lock_irqsave(&conf->device_lock, flags);
> +			conf->nr_queued--;
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
[snip]

Now I work on another 2 patches for a simpler I/O barrier, and a
lockless I/O submit on raid1, where conf->nr_queued will be in atomic_t.
So spin lock expense will not exist any more. Just FYI.

Coly



^ permalink raw reply

* Re: [md PATCH 2/4] md: add bio completion tracing for raid1/raid10
From: Christoph Hellwig @ 2016-11-16 14:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid
In-Reply-To: <147910142112.27168.16834874155342134173.stgit@noble>

On Mon, Nov 14, 2016 at 04:30:21PM +1100, NeilBrown wrote:
> raid5 already has this, as does dm.
> linear and raid0 do no see completions, only bio_chain_end() or bio_endio()
> see those.
> So just add it for raid1 and raid10.
> 
> Between
>  Commit: 3a366e614d08 ("block: add missing block_bio_complete() tracepoint")
> and
>  Commit: 0a82a8d132b2 ("Revert "block: add missing block_bio_complete() tracepoint"")
> in the 3.9-rc series, this was done centrally in bio_endio().
> Until/unless that is resurected, do the tracing in the md/raid code.

We're working on getting it back for 4.10, so please don't add these
tracepoints to the MD driver for now.  Next time please also ask
linux-block first and/or Cc the list on a patch like this.

^ permalink raw reply

* [PATCH v2] RAID1: Avoid unnecessary loop to decrease conf->nr_queued in raid1d()
From: Coly Li @ 2016-11-16 14:19 UTC (permalink / raw)
  To: linux-raid; +Cc: Coly Li, Shaohua Li, Neil Brown

commit ccfc7bf1f0 points out bios from conf->bio_end_io_list also
contribute to conf->nr_queued counter, that's right. But the fix
could be improved. The original fix replaces list_add() by a while()
loop to iterate every bio on conf->bio_end_io_list, and decrease
conf->nr_queued. If we look at the code several lines after, we may
find there is another while() loop to iterate every node from tmp
list which contains the original content of conf->bio_end_io_list.
Yes, we can decrease conf->nr_queued here, then we can avoid the
previous extra while() loop, which consumes more CPU cycles and hold
a spin lock longer time when conf->bio_end_io_list is not tiny. 

This patch changes to decrease conf->nr_queued in loop of
while(!list_empty(&tmp)){}, it avoids an extra loop execution, and
avoids holding conf->device_lock for too long time.

By suggestion from Neil, in this version of the patch, I use
list_splice_init() interface to operate conf->bio_end_io_list. 

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
---
 drivers/md/raid1.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Index: linux-raid1/drivers/md/raid1.c
===================================================================
--- linux-raid1.orig/drivers/md/raid1.c
+++ linux-raid1/drivers/md/raid1.c
@@ -2387,17 +2387,17 @@ static void raid1d(struct md_thread *thr
 	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+	
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))

^ permalink raw reply

* Re: recovering failed raid5
From: Andreas Klauer @ 2016-11-16 13:59 UTC (permalink / raw)
  To: Alexander Shenkin; +Cc: linux-raid
In-Reply-To: <f82a86e0-e734-5b58-4796-63fe1923d49d@shenkin.org>

On Wed, Nov 16, 2016 at 01:27:55PM +0000, Alexander Shenkin wrote:
> Does that mean that ddrescue has copied the 512 partition tables 
> to my 4k drive, and hence I just need to fix the partition table 
> on the new 4k drive?

Just so.

And the missing 4096 bytes don't matter, you had enough unpartitioned space.
It doesn't affect your data at all.

As for the 128K I counted all non+, if it's only 64K then even better.

Regards
Andreas Klauer

^ permalink raw reply

* Re: recovering failed raid5
From: Alexander Shenkin @ 2016-11-16 13:27 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161116111407.GA3348@metamorpher.de>

Thanks Andreas - replies below.

On 11/16/2016 11:14 AM, Andreas Klauer wrote:
> On Wed, Nov 16, 2016 at 09:04:29AM +0000, Alexander Shenkin wrote:
>> I'm getting a 'No space left on device error'.  Any thoughts?
>
> It's smaller by 4096 bytes, that's probably not a problem.
> ddrescue seems to have failed to copy 128K of data,
> but that's probably not a big problem either.

Regarding the failed copy, where do you see the 128k?  I see that there 
was an errsize of 65536 bytes, but I'm not sure how to tell if that was 
ever able to be read by ddrescue... i suspect ddrescue ran out of disk 
space before it was able to retry those unreadable bytes...

Won't both issues be problematic (i.e. unread data + out of space)?  I 
believe the last drive partition is the one that participates in the 
array that gets mounted as "/"... so, can't really just throw that one 
away...

https://i.imgur.com/SMdCo12.png

>
> Your problem is something else:
>
>> Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> Disklabel type: gpt
>
>> Disk /dev/sdc: 2.7 TiB, 3000592977920 bytes, 732566645 sectors
>> Sector size (logical/physical): 4096 bytes / 4096 bytes
>> Disklabel type: dos
>
> The physical sector size is different.
>
> Unfortunately GPT partition scheme still depends on sector size and is
> inherently incompatible when dd(rescue)'d to a drive with different size.
>
> In theory it would be possible to ignore this, i.e. interpret GPT
> correctly on a 4K sector drive even if it was created for a 512b drive,
> or vice versa, but Linux is quite strict about standards in this case.
> (If Linux was smarter it would work in Linux but fail for Windows...)
>
> Anyway, you'll have to fixer-upper your GPT partition tables to 4K.
> gdisk has an expert -> recovery section that might be able to do so
> automagically, or you could just manually recreate with the correct
> _byte_ offsets (sector offset will be different).

So, just trying to understand the issue here...  The original (failed) 
drive had 512 byte sectors...  Does that mean that ddrescue has copied 
the 512 partition tables to my 4k drive, and hence I just need to fix 
the partition table on the new 4k drive?

>
> Your partitions are all MiB aligned so there are no alignment issues.
>
> Once the partition table is fixed you should be able to proceed normally.
>
> Regards
> Andreas Klauer
>

^ permalink raw reply

* Re: recovering failed raid5
From: Andreas Klauer @ 2016-11-16 11:14 UTC (permalink / raw)
  To: Alexander Shenkin; +Cc: linux-raid
In-Reply-To: <194ae464-2d19-cefd-055f-6f4a33676ab2@shenkin.org>

On Wed, Nov 16, 2016 at 09:04:29AM +0000, Alexander Shenkin wrote:
> I'm getting a 'No space left on device error'.  Any thoughts?

It's smaller by 4096 bytes, that's probably not a problem. 
ddrescue seems to have failed to copy 128K of data, 
but that's probably not a big problem either.

Your problem is something else:

> Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
> Sector size (logical/physical): 512 bytes / 512 bytes
> Disklabel type: gpt

> Disk /dev/sdc: 2.7 TiB, 3000592977920 bytes, 732566645 sectors
> Sector size (logical/physical): 4096 bytes / 4096 bytes
> Disklabel type: dos

The physical sector size is different.

Unfortunately GPT partition scheme still depends on sector size and is 
inherently incompatible when dd(rescue)'d to a drive with different size.

In theory it would be possible to ignore this, i.e. interpret GPT 
correctly on a 4K sector drive even if it was created for a 512b drive, 
or vice versa, but Linux is quite strict about standards in this case. 
(If Linux was smarter it would work in Linux but fail for Windows...)

Anyway, you'll have to fixer-upper your GPT partition tables to 4K. 
gdisk has an expert -> recovery section that might be able to do so 
automagically, or you could just manually recreate with the correct 
_byte_ offsets (sector offset will be different).

Your partitions are all MiB aligned so there are no alignment issues.

Once the partition table is fixed you should be able to proceed normally.

Regards
Andreas Klauer

^ permalink raw reply

* Re: recovering failed raid5
From: Alexander Shenkin @ 2016-11-16  9:04 UTC (permalink / raw)
  To: linux-raid, Andreas Klauer, rm, robin
In-Reply-To: <20161028133626.GA27462@cthulhu.home.robinhill.me.uk>

Hello all,

As a quick reminder, my sdb failed in a 4-disk RAID5, and then sdc 
failed when trying to replace sdb.  I'm now trying to recover sdc with 
ddrescue.

After much back and forth, I've finally got ddrescue running to 
replicate my apparently-faulty sdc.  I'm ddrescue'ing from a seagate 3TB 
to a toshiba 3TB drive, and I'm getting a 'No space left on device 
error'.  Any thoughts?

One further question: should I also try to ddrescue my original failed 
sdb in the hopes that anything lost on sdc would be covered by the 
recovered sdb?

Logs below... (below, /dev/sdb is the original failed seagate sdc, and 
/dev/sdc below is the new bare toshiba drive).

Thanks,
Allie


username@Ubuntu-VirtualBox:~$ sudo ddrescue -d -f -r3 /dev/sdb /dev/sdc 
~/rescue.logfile
[sudo] password for username:
GNU ddrescue 1.19
Press Ctrl-C to interrupt
rescued:     3000 GB,  errsize:   65536 B,  current rate:   55640 kB/s
    ipos:     3000 GB,   errors:       1,    average rate:   83070 kB/s
    opos:     3000 GB, run time:   10.03 h,  successful read:       0 s ago
Copying non-tried blocks... Pass 1 (forwards)
ddrescue: Write error: No space left on device



username@Ubuntu-VirtualBox:~$ lsblk -o name,label,size,fstype,model
NAME   LABEL           SIZE FSTYPE            MODEL
sda                      8G                   VBOX HARDDISK
├─sda1                   6G ext4
├─sda2                   1K
└─sda5                   2G swap
sdb                    2.7T                   2105
├─sdb1 arrayname:0  1.9G linux_raid_member
├─sdb2                   1M
├─sdb3 arrayname:2  2.7T linux_raid_member
└─sdb4 arrayname:3  7.6G linux_raid_member
sdc                    2.7T                   Expan



username@Ubuntu-VirtualBox:~$ fdisk -l

[...]

Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 4B356AFA-8F48-4227-86F0-329565146D7A

Device          Start        End    Sectors  Size Type
/dev/sdb1        2048    3905535    3903488  1.9G Linux RAID
/dev/sdb2     3905536    3907583       2048    1M BIOS boot
/dev/sdb3     3907584 5844547583 5840640000  2.7T Linux RAID
/dev/sdb4  5844547584 5860532223   15984640  7.6G Linux RAID


Disk /dev/sdc: 2.7 TiB, 3000592977920 bytes, 732566645 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Device     Boot Start        End    Sectors Size Id Type
/dev/sdc1           1 4294967295 4294967295  16T ee GPT



username@Ubuntu-VirtualBox:~$ cat rescue.logfile
# Rescue Logfile. Created by GNU ddrescue version 1.19
# Command line: ddrescue -d -f -r3 /dev/sdb /dev/sdc 
/home/username/rescue.logfile
# Start time:   2016-11-15 13:54:24
# Current time: 2016-11-15 23:56:25
# Copying non-tried blocks... Pass 1 (forwards)
# current_pos  current_status
0x2BAA1470000     ?
#      pos        size  status
0x00000000  0x7F5A0000  +
0x7F5A0000  0x00010000  *
0x7F5B0000  0x00010000  ?
0x7F5C0000  0x2BA21EB0000  +


On 10/28/2016 2:36 PM, Robin Hill wrote:
> On Fri Oct 28, 2016 at 01:22:31PM +0100, Alexander Shenkin wrote:
>
>> Thanks Andreas, much appreciated.  Your points about selftests and smart
>> are well taken, and i'll implement them once i get this back up.  I'll
>> buy yet another new, non drive-from-hell (yes Roman, I did buy the same
>> damn drive again.  Will try to return it, thanks for the heads up...)
>> and follow your instructions below.
>>
>> One remaining question: is sdc definitely toast?  Or, is it possible
>> that the Timeout Mismatch (as mentioned by Robin Hill; thanks Robin) is
>> flagging the drive as failed, when something else is at play and perhaps
>> the drive is actually fine?
>>
> It's not definitely toast, no (but this is unrelated to the Timeout
> mismatches). It has some pending reallocations, which means the drive
> was unable to read from some blocks - if a write to the blocks fails
> then one of the spare blocks will be reallocated instead, but a write
> will often succeed and the pending reallocation will just be cleared.
>
> Unfortunately, reconstruction of the array depends on this data being
> readable, so the fact the drive isn't toast doesn't necessarily help.
> I'd suggest replicating (using ddrescue) that drive to the new one (when
> it arrives) as a first step. It's possible ddrescue will manage to read
> the data (it'll make several attempts, so can sometimes read data that
> fails initially), otherwise you'll end up with some missing data
> (possibly corrupt files, possibly corrupt filesystem metadata, possibly
> just a bit of extra noise in an audio/video file). Once that's done, you
> can do a proper check on sdc (e.g. a badblocks read/write test), which
> will either lead to sector actually being reallocated, or to clearing
> the pending reallocations. Unless you get a lot more reallocated sectors
> than are currently pending, you can put the drive back into use if you
> like (bearing in mind the reputation of these drives and weighing the
> replacement cost against the value of your data).
>
> If you run a regular selftest on the array, these sort of issues would
> be picked up and repaired automatically (the read errors will trigger
> rewrites and either reallocate blocks, clear the pending reallocations,
> or fail the drive). Otherwise they're liable to come back to bite you
> when you're trying to recover from a different failure.
>
> Timeout Mismatches will lead to drives being failed from an otherwise
> healthy array - a read failure on the drive can't be corrected as the
> drive is still busy trying when the write request goes through, so the
> drive gets kicked out of the array. You didn't say what the issue was
> with your original sdb, but if it wasn't a definite fault then it may
> have been affected by a timeout mismatch.
>
> Cheers,
>     Robin
>
>> To everyone: sorry for the multiple posts.  Was having majordomo issues...
>>
>> On 10/27/2016 5:04 PM, Andreas Klauer wrote:
>>> On Thu, Oct 27, 2016 at 04:06:14PM +0100, Alexander Shenkin wrote:
>>>> md2: raid5 mounted on /, via sd[abcd]3
>>>
>>> Two failed disks...
>>>
>>>> md0: raid1 mounted on /boot, via sd[abcd]1
>>>
>>> Actually only two disks active in that one, the other two are spares.
>>> It hardly matters for /boot, but you could grow it to a 4 disk raid1.
>>> Spares are not useful.
>>>
>>>> My sdb was recently reporting problems.  Instead of second guessing
>>>> those problems, I just got a new disk, replaced it, and added it to
>>>> the arrays.
>>>
>>> Replacing right away is the right thing to do.
>>> Unfortunately it seems you have another disk that is broke too.
>>>
>>>> 2) smartctl (disabled on drives - can enable once back up.  should I?)
>>>> note: SMART only enabled after problems started cropping up.
>>>
>>> But... why? Why disable smart? And if you do, is it a surprise that you
>>> only notice disk failures when it's already too late?
>>
>> yeah, i asked myself that same question.  there was probably some reason
>> I did, but i don't remember what it was.  i'll keep smart enabled from
>> now on...
>>
>>> You should enable smart, and not only that, also run regular selftests,
>>> and have smartd running, and have it send you mail when something happens.
>>> Same with raid checks, raid checks are at least something but it won't
>>> tell you about how many reallocated sectors your drive has.
>>
>> will do
>>
>>>> root@machinename:/home/username# smartctl --xall /dev/sda
>>>
>>> Looks fine but never ran a selftest.
>>>
>>>> root@machinename:/home/username# smartctl --xall /dev/sdb
>>>
>>> Looks new. (New drives need selftests too.)
>>>
>>>> root@machinename:/home/username# smartctl --xall /dev/sdc
>>>> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.0-39-generic] (local build)
>>>> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
>>>>
>>>> === START OF INFORMATION SECTION ===
>>>> Model Family:     Seagate Barracuda 7200.14 (AF)
>>>> Device Model:     ST3000DM001-1CH166
>>>> Serial Number:    W1F1N909
>>>>
>>>> 197 Current_Pending_Sector  -O--C-   100   100   000    -    8
>>>> 198 Offline_Uncorrectable   ----C-   100   100   000    -    8
>>>
>>> This one is faulty and probably the reason why your resync failed.
>>> You have no redundancy left, so an option here would be to get a
>>> new drive and ddrescue it over.
>>>
>>> That's exactly the kind of thing you should be notified instantly
>>> about via mail. And it should be discovered when running selftests.
>>> Without full surface scan of the media, the disk itself won't know.
>>>
>>>> ==> WARNING: A firmware update for this drive may be available,
>>>> see the following Seagate web pages:
>>>> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
>>>> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
>>>
>>> About this, *shrug*
>>> I don't have these drives, you might want to check that out.
>>> But it probably won't fix bad sectors.
>>>
>>>> root@machinename:/home/username# smartctl --xall /dev/sdd
>>>
>>> Some strange things in the error log here, but old.
>>> Still, same as for all others - selftest.
>>>
>>>> ################### mdadm --examine ###########################
>>>>
>>>> /dev/sda1:
>>>>      Raid Level : raid1
>>>>    Raid Devices : 2
>>>
>>> A RAID 1 with two drives, could be four.
>>>
>>>> /dev/sdb1:
>>>> /dev/sdc1:
>>>
>>> So these would also have data instead of being spare.
>>>
>>>> /dev/sda3:
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>     Update Time : Mon Oct 24 09:02:52 2016
>>>>          Events : 53547
>>>>
>>>>    Device Role : Active device 0
>>>>    Array State : A..A ('A' == active, '.' == missing)
>>>
>>> RAID-5 with two failed disks.
>>>
>>>> /dev/sdc3:
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>     Update Time : Mon Oct 24 08:53:57 2016
>>>>          Events : 53539
>>>>
>>>>    Device Role : Active device 2
>>>>    Array State : AAAA ('A' == active, '.' == missing)
>>>
>>> This one failed, 8:53.
>>>
>>>> ############ /proc/mdstat ############################################
>>>>
>>>> md2 : active raid5 sda3[0] sdc3[2](F) sdd3[3]
>>>>       8760565248 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2]
>>>> [U__U]
>>>
>>> [U__U] refers to device roles as in [0123],
>>> so device role 0 and 3 is okay, 1 and 2 missing.
>>>
>>>> md0 : active raid1 sdb1[4](S) sdc1[2](S) sda1[0] sdd1[3]
>>>>       1950656 blocks super 1.2 [2/2] [UU]
>>>
>>> Those two spares again, could be [UUUU] instead.
>>>
>>> tl;dr
>>> stop it all,
>>> ddrescue /dev/sdc to your new disk,
>>> try your luck with --assemble --force (not using /dev/sdc!),
>>> get yet another new disk, add, sync, cross fingers.
>>>
>>> There's also mdadm --replace instead of --remove, --add,
>>> that sometimes helps if there's only a few bad sectors
>>> on each disk. If the disk you already removed wasn't
>>> already kicked from the array by the time you replaced,
>>> maybe it would have avoided this problem.
>>>
>>> But good disk monitoring and testing is even more important.
>>
>> thanks a bunch, Andreas.  I'll monitor and test from now on...
>>
>>> Regards
>>> Andreas Klauer
>>
>

^ permalink raw reply

* Re: [PATCH v6 11/11] md/r5cache: handle alloc_page failure
From: Shaohua Li @ 2016-11-16  6:54 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161110204623.3484694-12-songliubraving@fb.com>

On Thu, Nov 10, 2016 at 12:46:23PM -0800, Song Liu wrote:
> RMW of r5c write back cache uses an extra page to store old data for
> prexor. handle_stripe_dirtying() allocates this page by calling
> alloc_page(). However, alloc_page() may fail.
> 
> To handle alloc_page() failures, this patch adds a small mempool
> in r5l_log. When alloc_page fails, the code tries to allocate a
> page from the mempool.

this is too complicated. Can we just do rcw instead of rmw if page allocation
fails? This way we don't need allocate the memory. This will cause more IO but
since this a rare memory allocation failure case, I think it should be ok.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v6 04/11] md/r5cache: caching mode of r5cache
From: Song Liu @ 2016-11-16  5:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <878tski63c.fsf@notabene.neil.brown.name>

> 
> 
>> +
>> +/*
>> + * this journal write must contain full parity,
>> + * it may also contain some data pages
>> + */
>> +static void r5c_handle_parity_cached(struct stripe_head *sh)
> 
> I don't understand how this function name corresponds to what it does or
> when it is called.
> It is parts of activating the WRITE_OUT action for a stripe that has (or
> may have) been cached to the journal.  None of that is particularly
> about "parity".

It is called parity cached, because it is part of endio of the parity write 
(to journal). We only write RAID disks (write out) after all parities are in 
journal. So if we name the function for its behavior, it will be something
like "start_write_out_to_raid"

> 
> In generally the patch looks good, but it bothers me that we need to add
> tests on R5_InJournal in lots and lots of places.  It makes all those
> test sites more complex and so easily misunderstood.
> I wonder if there is some way we could add a new flag or something that
> would subsumes several of the tests.  So instead of adding a test for InJournal,
> we could replace the current test with a test for something new?
> Or maybe gather several tests that appear together into an inline.
> Or something.

This is a great point. Let me try that. 

Thanks,
Song


^ permalink raw reply

* Re: [PATCH v6 03/11] md/r5cache: State machine for raid5-cache write back mode
From: Song Liu @ 2016-11-16  5:18 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <87bmxgi8hi.fsf@notabene.neil.brown.name>

> This bothers me.  Why would a stripe *ever* be in "caching mode" (or
> "caching phase") when the array is in write-through?  It doesn't seem to
> make sense.

I was thinking about replacing STRIPE_R5C_WRITE_OUT with something
like STRIPE_R5C_CACHING. So that:

caching-phase is STRIPE_R5C_CACHING == 1
write-out phase is STRIPE_R5C_CACHING == 0 

In this case, stripes in write-through mode will always have 
STRIPE_R5C_CACHING == 0. 

This requires some changes to current state machine, but it might work out. 

How do you like this? 

> There are two actions that can be taken when where are ->towrite blocks
> on a stripe.  We can enter WRITE_OUT, or they can be cached in the
> journal.  Also we can enter WRITE_OUT when a stripe needs to be removed
> From memory or from the journal.
> This makes "writeout" and "cache" seem more like "actions" than states,
> modes, or phases.  Naming is hard.

Yes, it is more like action. We used to name it as "modes" as different *mode*
handles writes with different *action*. So at end of the day, it doesn't really 
matter?

Thanks,
Song

^ permalink raw reply

* Re: [PATCH v6 04/11] md/r5cache: caching mode of r5cache
From: NeilBrown @ 2016-11-16  1:08 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-5-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 1777 bytes --]

On Fri, Nov 11 2016, Song Liu wrote:

>  
> +static void
> +r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
> +			      struct bio_list *return_bi)
> +{
> +	struct bio *wbi, *wbi2;
> +
> +	wbi = dev->written;
> +	dev->written = NULL;
> +	while (wbi && wbi->bi_iter.bi_sector <
> +	       dev->sector + STRIPE_SECTORS) {
> +		wbi2 = r5_next_bio(wbi, dev->sector);
> +		if (!raid5_dec_bi_active_stripes(wbi)) {
> +			md_write_end(conf->mddev);
> +			bio_list_add(return_bi, wbi);
> +		}
> +		wbi = wbi2;
> +	}
> +}

This loop (with minor variations) occurs 3 times in raid5.c, and how a
fourth time in raid5-cache.c.  It would be nice if it could be factored
out (maybe as an inline, maybe not) so we only have one copy of the
code.

> +
> +/*
> + * this journal write must contain full parity,
> + * it may also contain some data pages
> + */
> +static void r5c_handle_parity_cached(struct stripe_head *sh)

I don't understand how this function name corresponds to what it does or
when it is called.
It is parts of activating the WRITE_OUT action for a stripe that has (or
may have) been cached to the journal.  None of that is particularly
about "parity".

In generally the patch looks good, but it bothers me that we need to add
tests on R5_InJournal in lots and lots of places.  It makes all those
test sites more complex and so easily misunderstood.
I wonder if there is some way we could add a new flag or something that
would subsumes several of the tests.  So instead of adding a test for InJournal,
we could replace the current test with a test for something new?
Or maybe gather several tests that appear together into an inline.
Or something.

But mostly it looks good.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v6 09/11] md/r5cache: r5cache recovery: part 2
From: Shaohua Li @ 2016-11-16  0:37 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161110204623.3484694-10-songliubraving@fb.com>

On Thu, Nov 10, 2016 at 12:46:21PM -0800, Song Liu wrote:
> 1. In previous patch, we:
>       - add new data to r5l_recovery_ctx
>       - add new functions to recovery write-back cache
>    The new functions are not used in this patch, so this patch does not
>    change the behavior of recovery.
> 
> 2. In this patchpatch, we:
>       - modify main recovery procedure r5l_recovery_log() to call new
>         functions
>       - remove old functions
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> -/* copy data/parity from log to raid disks */
> -static void r5l_recovery_flush_log(struct r5l_log *log,
> -				   struct r5l_recovery_ctx *ctx)
> -{
> -	while (1) {
> -		if (r5l_recovery_read_meta_block(log, ctx))
> -			return;
> -		if (r5l_recovery_flush_one_meta(log, ctx))
> -			return;
> -		ctx->seq++;
> -		ctx->pos = r5l_ring_add(log, ctx->pos, ctx->meta_total_blocks);
> -	}
> -}
> -
>  static void
>  r5l_recovery_create_empty_meta_block(struct r5l_log *log,
>  				     struct page *page,
> @@ -2139,7 +2001,9 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
>  
>  static int r5l_recovery_log(struct r5l_log *log)
>  {
> +	struct mddev *mddev = log->rdev->mddev;
>  	struct r5l_recovery_ctx ctx;
> +	int ret;
>  
>  	ctx.pos = log->last_checkpoint;
>  	ctx.seq = log->last_cp_seq;
> @@ -2151,47 +2015,33 @@ static int r5l_recovery_log(struct r5l_log *log)
>  	if (!ctx.meta_page)
>  		return -ENOMEM;
>  
> -	r5l_recovery_flush_log(log, &ctx);
> +	ret = r5c_recovery_flush_log(log, &ctx);
>  	__free_page(ctx.meta_page);
>  
> -	/*
> -	 * we did a recovery. Now ctx.pos points to an invalid meta block. New
> -	 * log will start here. but we can't let superblock point to last valid
> -	 * meta block. The log might looks like:
> -	 * | meta 1| meta 2| meta 3|
> -	 * meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If
> -	 * superblock points to meta 1, we write a new valid meta 2n.  if crash
> -	 * happens again, new recovery will start from meta 1. Since meta 2n is
> -	 * valid now, recovery will think meta 3 is valid, which is wrong.
> -	 * The solution is we create a new meta in meta2 with its seq == meta
> -	 * 1's seq + 10 and let superblock points to meta2. The same recovery will
> -	 * not think meta 3 is a valid meta, because its seq doesn't match
> -	 */
> -	if (ctx.seq > log->last_cp_seq) {
> -		int ret;
> -
> -		ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
> -		if (ret)
> -			return ret;
> -		log->seq = ctx.seq + 11;
> -		log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
> -		r5l_write_super(log, ctx.pos);
> -		log->last_checkpoint = ctx.pos;
> -		log->next_checkpoint = ctx.pos;
> -	} else {
> -		log->log_start = ctx.pos;
> -		log->seq = ctx.seq;
> -	}
> +	if (ret)
> +		return ret;
>  
> -	/*
> -	 * This is to suppress "function defined but not used" warning.
> -	 * It will be removed when the two functions are used (next patch).
> -	 */

BTW, this isn't necessary. it's ok an intermediate patch causes compile
warning, as long as it doesn't break biset.

^ permalink raw reply

* Re: [PATCH v6 08/11] md/r5cache: r5cache recovery: part 1
From: Shaohua Li @ 2016-11-16  0:33 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161110204623.3484694-9-songliubraving@fb.com>

On Thu, Nov 10, 2016 at 12:46:20PM -0800, Song Liu wrote:
> Recovery of write-back cache has different logic to write-through only
> cache. Specifically, for write-back cache, the recovery need to scan
> through all active journal entries before flushing data out. Therefore,
> large portion of the recovery logic is rewritten here.
> 
> To make the diffs cleaner, we split the rewrite as follows:
> 
> 1. In this patch, we:
>       - add new data to r5l_recovery_ctx
>       - add new functions to recovery write-back cache
>    The new functions are not used in this patch, so this patch does not
>    change the behavior of recovery.
> 
> 2. In next patch, we:
>       - modify main recovery procedure r5l_recovery_log() to call new
>         functions
>       - remove old functions
> 
> With cache feature, there are 2 different scenarios of recovery:
> 1. Data-Parity stripe: a stripe with complete parity in journal.
> 2. Data-Only stripe: a stripe with only data in journal (or partial
>    parity).
> 
> The code differentiate Data-Parity stripe from Data-Only stripe with
> flag STRIPE_R5C_WRITE_OUT.
> 
> For Data-Parity stripes, we use the same procedure as raid5 journal,
> where all the data and parity are replayed to the RAID devices.
> 
> For Data-Only strips, we need to finish complete calculate parity and
> finish the full reconstruct write or RMW write. For simplicity, in
> the recovery, we load the stripe to stripe cache. Once the array is
> started, the stripe cache state machine will handle these stripes
> through normal write path.
> 
> r5c_recovery_flush_log contains the main procedure of recovery. The
> recovery code first scans through the journal and loads data to
> stripe cache. The code keeps tracks of all these stripes in a list
> (use sh->lru and ctx->cached_list), stripes in the list are
> organized in the order of its first appearance on the journal.
> During the scan, the recovery code assesses each stripe as
> Data-Parity or Data-Only.
> 
> During scan, the array may run out of stripe cache. In these cases,
> the recovery code will also call raid5_set_cache_size to increase
> stripe cache size. If the array still runs out of stripe cache
> because there isn't enough memory, the array will not assemble.
> 
> At the end of scan, the recovery code replays all Data-Parity
> stripes, and sets proper states for Data-Only stripes. The recovery
> code also increases seq number by 10 and rewrites all Data-Only
> stripes to journal. This is to avoid confusion after repeated
> crashes. More details is explained in raid5-cache.c before
> r5c_recovery_rewrite_data_only_stripes().

I'm ok with this patch in current stage. This one has a lot of areas to
improve:
- the r5c_recovery_lookup_stripe algorithm looks silly
- r5c_recovery_analyze_meta_block will read the data twice. One in verify
  checksum and the other in loading to stripe. We could estimate the maximum
  pages a meata could use, then read data to the pages. we could do the read in
  very checksum and then copy the data to stripe.
- r5c_recovery_rewrite_data_only_stripes doesn't need to use sync io, we can
  dispatch several io and wait in batch.

Please remember to fix these later.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v6 03/11] md/r5cache: State machine for raid5-cache write back mode
From: NeilBrown @ 2016-11-16  0:17 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161110204623.3484694-4-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 3505 bytes --]

On Fri, Nov 11 2016, Song Liu wrote:

> This patch adds state machine for raid5-cache. With log device, the
> raid456 array could operate in two different modes (r5c_journal_mode):
>   - write-back (R5C_MODE_WRITE_BACK)
>   - write-through (R5C_MODE_WRITE_THROUGH)
>
> Existing code of raid5-cache only has write-through mode. For write-back
> cache, it is necessary to extend the state machine.
>
> With write-back cache, every stripe could operate in two different
> modes:
>   - caching
>   - writing-out
>
> In caching mode, the stripe handles writes as:
>   - write to journal
>   - return IO
>
> In writing-out mode, the stripe behaviors as a stripe in write through
> mode R5C_MODE_WRITE_THROUGH.
>
> STRIPE_R5C_WRITE_OUT is added to sh->state to differentiate caching and
> writing-out mode.
>
> When the array is write-through, stripes also go between caching mode
> and writing-out mode.

This bothers me.  Why would a stripe *ever* be in "caching mode" (or
"caching phase") when the array is in write-through?  It doesn't seem to
make sense.

A stripe_head goes through several states/stages/phases/things.

 1/ write data blocks to journal
 2/ reading missing data blocks and compute parity
 3/ write data and parity to journal
 4/ write data and parity to RAID

When there is no log, we only perform 2 and 4
When there is a log in WRITE_THOUGH we perform 2,3,4
When there is a log in WRITE_BACK we perform 1 (maybe several times) 2 3 4.

Step 2 is initiated by calling handle_stripe_dirtying().
A new function is introduced "r5c_handle_stripe_dirtying()" which
determines if handle_stripe_dirtying() should do anything.
(It returns '0' if it shouldn't proceed, and -EAGAIN if it should,
 which seems a little strange).

If there is no log, or if STRIPE_R5C_WRITE_OUT is set, -EAGAIN is
returned.
Otherwise if in MODE_WRITE_THROUGH, STRIPE_R5C_WRITE_OUT is set and
-EAGAIN is returned.
else (in next patch) are more complex determination is made, but
-EAGAIN is only ever returns with STRIPE_R5C_WRITE_OUT set, or if log == NULL.

So STRIPE_R5C_WRITE_OUT is (sometimes) set to enter step 2, and cleared
when step 4 completes.

STRIPE_R5C_WRITE_OUT means that the 2(3)4 writeout sequence has
commenced and not yet completed.  I would like to see it *always* set
when that happens, including when log==NULL.
I would probably even rename r5c_handle_stripe_dirtying() to something
like  r5c_check_need_writeout() which would check to see if writeout was
needed, and would set STRIPE_R5C_WRITE_OUT if it was.
Then handle_stripe_dirtying() would be called iff STRIPE_R5C_WRITE_OUT
were set. (it could even be named to handle_stripe_writeout()??)

There are two actions that can be taken when where are ->towrite blocks
on a stripe.  We can enter WRITE_OUT, or they can be cached in the
journal.  Also we can enter WRITE_OUT when a stripe needs to be removed
From memory or from the journal.
This makes "writeout" and "cache" seem more like "actions" than states,
modes, or phases.  Naming is hard.

Trying to understand R5_InJournal:
 It is set (on pd_idx) when step 3 completes.
 It is only used (in this patch) in r5c_finish_stripe_write_out()
   to make sure WRITE_OUT isn't cleared until after InJournal is
 cleared.

So setting InJournal[pd_idx] marks the end of step 3 and clearing WRITE_OUT
marks the end of step 4.

I think that observation helps me understand the code a bit better.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox