Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* RE: md prefered minor has been renumbered
From: Randall C. Grimshaw @ 2016-11-28 20:34 UTC (permalink / raw)
  To: Phil Turmel, linux-raid@vger.kernel.org
In-Reply-To: <91b64061-6026-c738-75a6-8d80a177aec8@turmel.org>

Phil:
  Thank you for responding, it is much appreciated.
  This is a recently assigned system so I must apologize for not having all of the history re fstab.
  But when I try to boot into the 'normal' system it continues to use the new numbering, the mounts fail, (the duplicate md122 obviously fails), and the system panics.
  I suspect the answer will have something to do with the command I googled: mdadm -A --update=superminor /dev/mdNEWNUMBER  /dev/sd...
.. but I have zero experience with it.
Randall

________________________________________
From: Phil Turmel [philip@turmel.org]
Sent: Monday, November 28, 2016 2:58 PM
To: Randall C. Grimshaw; linux-raid@vger.kernel.org
Subject: Re: md prefered minor has been renumbered

On 11/28/2016 12:26 PM, Randall C. Grimshaw wrote:
> I have made the mistake of booting a centos-6.8 live cd to manipulate a centos 5 system.
> as a result md5 has become md122, md2 has become md125, md4 has become md126, md0 has become md127, and most unfortunately md3 has also become md122.

This is normal and expected for a livecd -- the numbering comes from the
mdadm.conf file in the initramfs, unless that mdadm.conf file excludes
arrays for later assembly (when the rootfs is available).

Many livecd initramfs scripts try to assemble everything, so the numbers
are allocated from 127 down.  Usually you can find a boot parameter to
suppress assembly.

> smartctl shows that the WD brand drives do support SCT
> mdadm --examine was used to reveal the mish-mash using uuid numbers compared with the file /etc/mdadm.conf also referencing /etc/fstab.

Eww.  You should not be counting on /dev/mdX numbering in your fstab.
Specifically for this reason.  Use LABEL= or UUID= syntax in fstab.

> can someone kindly tell me the mdadm command to put the correct numbers back.

No need.  When you boot back into your normal system, its initramfs will
supply this information.  If you are changing the info, make sure you
rebuild your initramfs, so the updated mdadm.conf file is included.

Phil

^ permalink raw reply

* Re: md prefered minor has been renumbered
From: Phil Turmel @ 2016-11-28 20:22 UTC (permalink / raw)
  To: Wols Lists, Randall C. Grimshaw, linux-raid@vger.kernel.org
In-Reply-To: <583C8D73.5060406@youngman.org.uk>

On 11/28/2016 03:02 PM, Wols Lists wrote:

> In My Experience (I'll probably get told I'm wrong :-) mdadm.conf seems
> to be pretty much ignored by modern systems. And md numbers now by
> default count down from 127, not up from 0.

mdadm.conf is *not* ignored.  Defaults are taken if it isn't found, or
doesn't contain the info needed.

Most modern systems have two copies of mdadm.conf.  One in the initramfs
so you can boot a root filesystem that is inside an array, and the other
in your root filesystem to assemble everything else.

Most distros copy the latter to the former when updating the initramfs.
You may have to manually trigger this operation when you are
re-arranging your arrays.

You also have to manually intervene if you want your initramfs to have
a different mdadm.conf, like a two-phase assembly.

mdadm has been counting down from 127 for as long as I've used it.

Phil

^ permalink raw reply

* Re: md prefered minor has been renumbered
From: Wols Lists @ 2016-11-28 20:02 UTC (permalink / raw)
  To: Randall C. Grimshaw, linux-raid@vger.kernel.org
In-Reply-To: <BC4985417D7F7B49B9AB545333BDB61B191FE1BC@Exchange-MBX1.esf.edu>

On 28/11/16 17:26, Randall C. Grimshaw wrote:
> I have made the mistake of booting a centos-6.8 live cd to manipulate a centos 5 system.
> as a result md5 has become md122, md2 has become md125, md4 has become md126, md0 has become md127, and most unfortunately md3 has also become md122.
> 
> smartctl shows that the WD brand drives do support SCT
> mdadm --examine was used to reveal the mish-mash using uuid numbers compared with the file /etc/mdadm.conf also referencing /etc/fstab.
> 
> can someone kindly tell me the mdadm command to put the correct numbers back.

In My Experience (I'll probably get told I'm wrong :-) mdadm.conf seems
to be pretty much ignored by modern systems. And md numbers now by
default count down from 127, not up from 0.

I think if you reboot back in to CentOS 5, you'll get the right numbers
back...
> 
> sda7,sdb7 (122) /dev/md5 uuid=11b7e8b7:ca181503:cf45af46:db2a5d9d /
> sda1,sdb1 (127) /dev/md0 uuid=d9a5774d:d930fbd8:cf45af46:db2a5d9d /boot
> sda3,sdb2 (125) /dev/md2 uuid=23a4116f:1d43390b:cf45af46:db2a5d9d /tmp
> sda5,sdb5 (122) /dev/md3 uuid=c89f6357:f1c7a0ef:cf45af46:db2a5d9d /var
> sda6,sdb6 (126) /dev/md4 uuid=67ad84bc:05153ebe:cf45af46:db2a5d9d /var/lib/mysql/cmdaemon_mon
> sda2,sdb2 (   1) /dev/md1 uuid=5d9f0b00:24d890eb:cf45af46:db2a5d9d swap
> sdc1                                                            /home
> 
That said, you should not be using fixed numbers any more - the linux
kernel does not guarantee discovery order or device names. It may
shuffle sda, sdb, sdc etc. It definitely shuffles mdnnn (my other gentoo
system causes me grief with this :-) because it's broken in other ways
too...)

What do you need your mdN names to be constant for? If you're just doing
emergency maintenance, then you might be better off using a rescue CD,
or just manually mounting those partitions for maintenance.

And if you're planning on upgrading, either convert the arrays to named
arrays, or refer to them by UUID.

Cheers,
Wol

^ permalink raw reply

* Re: md prefered minor has been renumbered
From: Phil Turmel @ 2016-11-28 19:58 UTC (permalink / raw)
  To: Randall C. Grimshaw, linux-raid@vger.kernel.org
In-Reply-To: <BC4985417D7F7B49B9AB545333BDB61B191FE1BC@Exchange-MBX1.esf.edu>

On 11/28/2016 12:26 PM, Randall C. Grimshaw wrote:
> I have made the mistake of booting a centos-6.8 live cd to manipulate a centos 5 system.
> as a result md5 has become md122, md2 has become md125, md4 has become md126, md0 has become md127, and most unfortunately md3 has also become md122.

This is normal and expected for a livecd -- the numbering comes from the
mdadm.conf file in the initramfs, unless that mdadm.conf file excludes
arrays for later assembly (when the rootfs is available).

Many livecd initramfs scripts try to assemble everything, so the numbers
are allocated from 127 down.  Usually you can find a boot parameter to
suppress assembly.

> smartctl shows that the WD brand drives do support SCT
> mdadm --examine was used to reveal the mish-mash using uuid numbers compared with the file /etc/mdadm.conf also referencing /etc/fstab.

Eww.  You should not be counting on /dev/mdX numbering in your fstab.
Specifically for this reason.  Use LABEL= or UUID= syntax in fstab.

> can someone kindly tell me the mdadm command to put the correct numbers back.

No need.  When you boot back into your normal system, its initramfs will
supply this information.  If you are changing the info, make sure you
rebuild your initramfs, so the updated mdadm.conf file is included.

Phil

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Shaohua Li @ 2016-11-28 17:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, NeilBrown, linux-raid
In-Reply-To: <8f633d1b-6250-6f9d-0311-70931d31fe0a@scylladb.com>

On Mon, Nov 28, 2016 at 11:11:18AM +0200, Avi Kivity wrote:
> On 11/28/2016 11:00 AM, Christoph Hellwig wrote:
> > On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote:
> > > What I guess is happening is that since the NVMe queue depth is so high, and
> > > request the driver receives is sent immediately to the disk and there is
> > > nothing to merge it to.  That could indicate the absence of plugging, or
> > > just a reluctance to merge TRIMs.
> > That is exactly the case, and it's also an issue with online trim.
> > I have work in progress block layer trim patches that always plug trims
> > and have various TRIM merge improvements including support for ranged
> > TRIMs.  It needs a bit more work, but I hope I can post it later
> > this week.
> 
> Great, good to know.
> 
> I still think it should also be fixed in the RAID layer.  There's no reason
> to break a single request in millions of smaller ones, then try to merge
> them into one request back again.  The queuing layer can merge when it's
> given bad patterns from uncontrolled sources, not as an excuse to generate
> bad patterns from within the kernel.

I think the reason is you are using 4.1 kernel. upstream kernel does support
request merge even for blk-mq. It didn't at some versions.

Thanks,
Shaohua

^ permalink raw reply

* md prefered minor has been renumbered
From: Randall C. Grimshaw @ 2016-11-28 17:26 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org
In-Reply-To: <BC4985417D7F7B49B9AB545333BDB61B191FE1A2@Exchange-MBX1.esf.edu>

I have made the mistake of booting a centos-6.8 live cd to manipulate a centos 5 system.
as a result md5 has become md122, md2 has become md125, md4 has become md126, md0 has become md127, and most unfortunately md3 has also become md122.

smartctl shows that the WD brand drives do support SCT
mdadm --examine was used to reveal the mish-mash using uuid numbers compared with the file /etc/mdadm.conf also referencing /etc/fstab.

can someone kindly tell me the mdadm command to put the correct numbers back.

sda7,sdb7 (122) /dev/md5 uuid=11b7e8b7:ca181503:cf45af46:db2a5d9d /
sda1,sdb1 (127) /dev/md0 uuid=d9a5774d:d930fbd8:cf45af46:db2a5d9d /boot
sda3,sdb2 (125) /dev/md2 uuid=23a4116f:1d43390b:cf45af46:db2a5d9d /tmp
sda5,sdb5 (122) /dev/md3 uuid=c89f6357:f1c7a0ef:cf45af46:db2a5d9d /var
sda6,sdb6 (126) /dev/md4 uuid=67ad84bc:05153ebe:cf45af46:db2a5d9d /var/lib/mysql/cmdaemon_mon
sda2,sdb2 (   1) /dev/md1 uuid=5d9f0b00:24d890eb:cf45af46:db2a5d9d swap
sdc1                                                            /home

much appreciated
Randall Grimshaw

^ permalink raw reply

* [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: Tomasz Majchrzak @ 2016-11-28 14:07 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak
In-Reply-To: <wrfjr35vpw2j.fsf@redhat.com>

If metadata handler provides support for bad blocks, tell md by writing
'external_bbl' to rdev state file (both on create and assemble),
followed by a list of known bad blocks written via sysfs 'bad_blocks'
file.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 mdadm.h | 13 +++++++++++++
 sysfs.c | 27 ++++++++++++++++++++++++++-
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/mdadm.h b/mdadm.h
index 240ab7f..4aa1d30 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -237,6 +237,17 @@ struct dlm_lksb {
 
 extern const char Name[];
 
+struct md_bb_entry {
+	unsigned long long sector;
+	int length;
+};
+
+struct md_bb {
+	int supported;
+	int count;
+	struct md_bb_entry *entries;
+};
+
 /* general information that might be extracted from a superblock */
 struct mdinfo {
 	mdu_array_info_t	array;
@@ -311,6 +322,8 @@ struct mdinfo {
 
 	/* info read from sysfs */
 	char		sysfs_array_state[20];
+
+	struct md_bb bb;
 };
 
 struct createinfo {
diff --git a/sysfs.c b/sysfs.c
index f8a9f0b..84c7348 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -52,8 +52,10 @@ void sysfs_free(struct mdinfo *sra)
 		while (sra->devs) {
 			struct mdinfo *d = sra->devs;
 			sra->devs = d->next;
+			free(d->bb.entries);
 			free(d);
 		}
+		free(sra->bb.entries);
 		free(sra);
 		sra = sra2;
 	}
@@ -261,7 +263,7 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
 		dbase = base + strlen(base);
 		*dbase++ = '/';
 
-		dev = xmalloc(sizeof(*dev));
+		dev = xcalloc(1, sizeof(*dev));
 
 		/* Always get slot, major, minor */
 		strcpy(dbase, "slot");
@@ -691,6 +693,7 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
 	char nm[PATH_MAX];
 	char *dname;
 	int rv;
+	int i;
 
 	sprintf(dv, "%d:%d", sd->disk.major, sd->disk.minor);
 	rv = sysfs_set_str(sra, NULL, "new_dev", dv);
@@ -722,6 +725,28 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
 		if (resume)
 			sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
 	}
+	if (sd->bb.supported) {
+		if (sysfs_set_str(sra, sd, "state", "external_bbl")) {
+			/*
+			 * backward compatibility - if kernel doesn't support
+			 * bad blocks for external metadata, let it continue
+			 * as long as there are none known so far
+			 */
+			if (sd->bb.count) {
+				pr_err("The kernel has no support for bad blocks in external metadata\n");
+				return -1;
+			}
+		}
+
+		for (i = 0; i < sd->bb.count; i++) {
+			char s[30];
+			const struct md_bb_entry *entry = &sd->bb.entries[i];
+
+			snprintf(s, sizeof(s) - 1, "%llu %d\n", entry->sector,
+				 entry->length);
+			rv |= sysfs_set_str(sra, sd, "bad_blocks", s);
+		}
+	}
 	return rv;
 }
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [mdadm PATCH] Add failfast support.
From: Jes Sorensen @ 2016-11-28 13:53 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, linux-raid, linux-block, Christoph Hellwig,
	linux-kernel, hare
In-Reply-To: <87polka0vu.fsf@notabene.neil.brown.name>

NeilBrown <neilb@suse.com> writes:
> Allow per-device "failfast" flag to be set when creating an
> array or adding devices to an array.
>
> When re-adding a device which had the failfast flag, it can be removed
> using --nofailfast.
>
> failfast status is printed in --detail and --examine output.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>
> Hi Jes,
>  this patch adds mdadm support for the failfast functionality that
> Shaohua recently included in his for-next.
> Hopefully the man-page additions provide all necessary context.
> If there is anything that seems to be missing, I'll be very happy to
> add it.

Hi Neil,

It looks reasonable. The only minor concern I have is over the use of
hardcoded magic numbers like 'dv->failfast = 2' rather than using an
enum or defines with descriptive names. Something that can be addressed
later, so patch applied.

Cheers,
Jes

^ permalink raw reply

* Re: [PATCH 1/4 v2] mdadm: bad block support for external metadata - initialization
From: Jes Sorensen @ 2016-11-28 13:43 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <wrfjh96wub5r.fsf@redhat.com>

Jes Sorensen <Jes.Sorensen@redhat.com> writes:
> Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
>> On Thu, Oct 27, 2016 at 10:53:42AM +0200, Tomasz Majchrzak wrote:
>>> If metadata handler provides support for bad blocks, tell md by writing
>>> 'external_bbl' to rdev state file (both on create and assemble),
>>> followed by a list of known bad blocks written via sysfs 'bad_blocks'
>>> file.
>>> 
>>> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
>>> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>>> ---
>>>  mdadm.h | 13 +++++++++++++
>>>  sysfs.c | 29 ++++++++++++++++++++++++++++-
>>>  2 files changed, 41 insertions(+), 1 deletion(-)
>>
>> Hi Jes,
>>
>> Do you have any comments for this patch set? It is a generic code for bad
>> block support for external metadata. It comes ahead of my other patch set
>> which provides IMSM implementation of new functionality.
>>
>> Regards,
>>
>> Tomek
>
> Hi Tomek,
>
> I thought I had responded to this one - I'll have a look early next week
> (holiday here).

Tomek,

It mostly look fine, there is a nitpick in patch 1/4. In general I
discourage the use of unnecessary parenthesis in multi case if
statements, but it's not a showstopper. Ie.
 if ((foo == x) && (bar != y))

If you send me an updated 1/4 then I should be able to apply it.

and of course, always use a cover letter when sending out multi-patch
sets in the future.

Cheers,
Jes

^ permalink raw reply

* Re: [PATCH 1/4 v2] mdadm: bad block support for external metadata - initialization
From: Jes Sorensen @ 2016-11-28 13:34 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1477558425-13332-1-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> If metadata handler provides support for bad blocks, tell md by writing
> 'external_bbl' to rdev state file (both on create and assemble),
> followed by a list of known bad blocks written via sysfs 'bad_blocks'
> file.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  mdadm.h | 13 +++++++++++++
>  sysfs.c | 29 ++++++++++++++++++++++++++++-
>  2 files changed, 41 insertions(+), 1 deletion(-)

Hi Tomasz,

Most of this one looks fine, just one minor nit.

> diff --git a/mdadm.h b/mdadm.h
> index 0516c82..5156ea4 100755
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -237,6 +237,17 @@ struct dlm_lksb {
>  
>  extern const char Name[];
>  
> +struct md_bb_entry {
> +	unsigned long long sector;
> +	int length;
> +};
> +
> +struct md_bb {
> +	int supported;
> +	int count;
> +	struct md_bb_entry *entries;
> +};
> +
>  /* general information that might be extracted from a superblock */
>  struct mdinfo {
>  	mdu_array_info_t	array;
> @@ -311,6 +322,8 @@ struct mdinfo {
>  
>  	/* info read from sysfs */
>  	char		sysfs_array_state[20];
> +
> +	struct md_bb bb;
>  };
>  
>  struct createinfo {
> diff --git a/sysfs.c b/sysfs.c
> index d28e21a..c7a8e66 100644
> --- a/sysfs.c
> +++ b/sysfs.c
> @@ -50,8 +50,12 @@ void sysfs_free(struct mdinfo *sra)
>  		while (sra->devs) {
>  			struct mdinfo *d = sra->devs;
>  			sra->devs = d->next;
> +			if (d->bb.entries)
> +				free(d->bb.entries);
>  			free(d);
>  		}
> +		if (sra->bb.entries)
> +			free(sra->bb.entries);
>  		free(sra);
>  		sra = sra2;
>  	}

free() handles NULL pointers, so no need to check first.

Cheers,
Jes

^ permalink raw reply

* Re: [RFC PATCH] crypto: Add IV generation algorithms
From: Herbert Xu @ 2016-11-28 12:47 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid
In-Reply-To: <1479723009-11113-2-git-send-email-binoy.jayan@linaro.org>

On Mon, Nov 21, 2016 at 03:40:09PM +0530, Binoy Jayan wrote:
> Currently, the iv generation algorithms are implemented in dm-crypt.c.
> The goal is to move these algorithms from the dm layer to the kernel
> crypto layer by implementing them as template ciphers so they can be used
> in relation with algorithms like aes, and with multiple modes like cbc,
> ecb etc. As part of this patchset, the iv-generation code is moved from the
> dm layer to the crypto layer. The dm-layer can later be optimized to
> encrypt larger block sizes in a single call to the crypto engine. The iv
> generation algorithms implemented in geniv.c includes plain, plain64,
> essiv, benbi, null, lmk and tcw. These templates are to be configured
> and has to be invoked as:
> 
> crypto_alloc_skcipher("plain(cbc(aes))", 0, 0);
> crypto_alloc_skcipher("essiv(cbc(aes))", 0, 0);
> ...
> 
> from the dm layer.
> 
> Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>

Thanks a lot for working on this!

> +static int crypto_geniv_set_ctx(struct crypto_skcipher *cipher,
> +				void *newctx, unsigned int len)
> +{
> +	struct geniv_ctx *ctx = crypto_skcipher_ctx(cipher);
> +	/*
> +	 * TODO:
> +	 * Do we really need this API or can we append the context
> +	 * 'struct geniv_ctx' to the cipher from dm-crypt and use
> +	 * the same here.
> +	 */
> +	memcpy(ctx, (char *) newctx, len);
> +	return geniv_setkey_init_ctx(&ctx->data);
> +}

I think we should be able to do without adding another API for this.
Can we instead put the information into the key and/or the IV?

Also it would really help if you could attach a patch for dm-crypt
that goes along with this (it doesn't have to be complete or
functional, just showing how this is meant to be used)

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Testing RAID 1 redundancy
From: Arka Sharma @ 2016-11-28  9:20 UTC (permalink / raw)
  To: linux-raid

Hello,

I want test a redundancy scenario with RAID 1. I have crated an array
with mdadm and formatted ext4 and after mounting I have written a text
file. Now I want to switch off one device, and with another remaining
device I expect to view the text file. When I issue mdadm --detail it
shows the State as active,FAILED,Not Started. But in dmesg I can't
find any error message from md related to one of the physical disk
missing. I was searching about this and I found some references of
creating degraded RAID 1 with missing parameter in mdadm, but what I
want to simulate is that let's say one of the disk has gone bad and
not yet being detected under /dev and in this case we want to verify
that if data is intact.

Regards,
Arka

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Coly Li @ 2016-11-28  9:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, NeilBrown, linux-raid
In-Reply-To: <8f633d1b-6250-6f9d-0311-70931d31fe0a@scylladb.com>

On 2016/11/28 下午5:11, Avi Kivity wrote:
> On 11/28/2016 11:00 AM, Christoph Hellwig wrote:
>> On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote:
>>> What I guess is happening is that since the NVMe queue depth is so
>>> high, and
>>> request the driver receives is sent immediately to the disk and there is
>>> nothing to merge it to.  That could indicate the absence of plugging, or
>>> just a reluctance to merge TRIMs.
>> That is exactly the case, and it's also an issue with online trim.
>> I have work in progress block layer trim patches that always plug trims
>> and have various TRIM merge improvements including support for ranged
>> TRIMs.  It needs a bit more work, but I hope I can post it later
>> this week.
> 
> Great, good to know.
> 
> I still think it should also be fixed in the RAID layer.  There's no
> reason to break a single request in millions of smaller ones, then try
> to merge them into one request back again.  The queuing layer can merge
> when it's given bad patterns from uncontrolled sources, not as an excuse
> to generate bad patterns from within the kernel.

Personally I agree with your opinion. We can improve both in block layer
and md code.

Coly


^ permalink raw reply

* Re: raid0 vs. mkfs
From: Avi Kivity @ 2016-11-28  9:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: NeilBrown, linux-raid
In-Reply-To: <20161128090053.GA9393@infradead.org>

On 11/28/2016 11:00 AM, Christoph Hellwig wrote:
> On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote:
>> What I guess is happening is that since the NVMe queue depth is so high, and
>> request the driver receives is sent immediately to the disk and there is
>> nothing to merge it to.  That could indicate the absence of plugging, or
>> just a reluctance to merge TRIMs.
> That is exactly the case, and it's also an issue with online trim.
> I have work in progress block layer trim patches that always plug trims
> and have various TRIM merge improvements including support for ranged
> TRIMs.  It needs a bit more work, but I hope I can post it later
> this week.

Great, good to know.

I still think it should also be fixed in the RAID layer.  There's no 
reason to break a single request in millions of smaller ones, then try 
to merge them into one request back again.  The queuing layer can merge 
when it's given bad patterns from uncontrolled sources, not as an excuse 
to generate bad patterns from within the kernel.

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Coly Li @ 2016-11-28  9:10 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey
In-Reply-To: <CAMGffEktp+raca2fYPu8fC5nxUrhGLKwjJhtHwXV2Bh_m5oViA@mail.gmail.com>

On 2016/11/28 下午5:02, Jinpu Wang wrote:
> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>> snip
>>>>>>
>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>> forgot to increase nr_queued somewhere?
>>>>>>
>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>
>>>>>> Could you give your suggestion?
>>>>>>
>>>>> Sorry, forgot to mention kernel version is 4.4.28
>>>>
>>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>>>> stable kernel or a distribution with 4.4.28 kernel ?
>>>>
>>>> Coly
>>>>
>>>>
>>> Hi Coly,
>>>
>>> I'm using Debian8 with 4.4.28 kernel.
>>
>> Hi Jinpu,
>>
>> Is it possible for your to run a upstream kernel or vanilla kernel to
>> test whether the issue still can be reproduced ? Then we can know
>> whether it is an upstream bug or a distro issue.
>>
>> Thanks.
>>
>> Coly
> 
> Hi Coly,
> 
> I did run kernel 4.4.34 (I download from kernel.org), I can reproduce
> the same bug.
> 
> I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary?
> 
Yes, please. If it can be reproduced on upstream kernel by a set of
scripts, it will be very helpful to debug and fix this issue.

Thanks in advance.

Coly

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-11-28  9:02 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey
In-Reply-To: <519e773d-e6e6-5d79-7224-ef94ef7c7a93@suse.de>

On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>> snip
>>>>>
>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>> forgot to increase nr_queued somewhere?
>>>>>
>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>
>>>>> Could you give your suggestion?
>>>>>
>>>> Sorry, forgot to mention kernel version is 4.4.28
>>>
>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>>> stable kernel or a distribution with 4.4.28 kernel ?
>>>
>>> Coly
>>>
>>>
>> Hi Coly,
>>
>> I'm using Debian8 with 4.4.28 kernel.
>
> Hi Jinpu,
>
> Is it possible for your to run a upstream kernel or vanilla kernel to
> test whether the issue still can be reproduced ? Then we can know
> whether it is an upstream bug or a distro issue.
>
> Thanks.
>
> Coly

Hi Coly,

I did run kernel 4.4.34 (I download from kernel.org), I can reproduce
the same bug.

I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary?

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Christoph Hellwig @ 2016-11-28  9:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: NeilBrown, linux-raid
In-Reply-To: <df73ebc4-9b78-09b5-022b-089c30dea17c@scylladb.com>

On Mon, Nov 28, 2016 at 10:58:24AM +0200, Avi Kivity wrote:
> What I guess is happening is that since the NVMe queue depth is so high, and
> request the driver receives is sent immediately to the disk and there is
> nothing to merge it to.  That could indicate the absence of plugging, or
> just a reluctance to merge TRIMs.

That is exactly the case, and it's also an issue with online trim.
I have work in progress block layer trim patches that always plug trims
and have various TRIM merge improvements including support for ranged
TRIMs.  It needs a bit more work, but I hope I can post it later
this week.

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Avi Kivity @ 2016-11-28  8:58 UTC (permalink / raw)
  To: NeilBrown, linux-raid
In-Reply-To: <87inr880au.fsf@notabene.neil.brown.name>

On 11/28/2016 10:40 AM, NeilBrown wrote:
> On Mon, Nov 28 2016, Avi Kivity wrote:
>
>> On 11/28/2016 07:09 AM, NeilBrown wrote:
>>> On Mon, Nov 28 2016, Avi Kivity wrote:
>>>
>>>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>>>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>>>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>>>> inappropriate) for the entire partition. As far as I can tell, md
>>>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>>>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>>>> requests, one per chunk-size worth of disk, and issues them to the RAID
>>>> components individually.
>>>>
>>>>
>>>> It seems to me that md can convert the large TRIM/DISCARD (erase
>>>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>>>> whichever is inappropriate) per RAID component, converting an O(disk
>>>> size / chunk size) operation into an O(number of RAID components)
>>>> operation, which is much faster.
>>>>
>>>>
>>>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>>>> the operation taking about a quarter of an hour, continuously pushing
>>>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>>>> to the disk. Linux 4.1.12.
>>> Surely it is the task of the underlying driver, or the queuing
>>> infrastructure, to merge small requests into large requests.
>> Here's a blkparse of that run.  As can be seen, there is no concurrency,
>> so nobody down the stack has any chance of merging anything.
> That isn't a valid conclusion to draw.

It's actually wrong, as I noted later.  The raid layer issues concurrent 
requests, but they are not merged.

> raid0 effectively calls the make_request_fn function that is registered
> by the underlying driver.
> If that function handles DISCARD synchronously, then you won't see any
> concurrency, and that is because the driver chose not to queue but to
> handle directly.
> I don't know if it actually does this though.  I don't know the insides
> of the nmve driver .... there seems to be a lightnvm thing and a scsi
> thing and a pci thing and it all confuses me.

NVMe only has queued TRIMs.  Of course, the driver could be issuing them 
synchronously, but that's doubtful, since NVMe is such a simple protocol.

What I guess is happening is that since the NVMe queue depth is so high, 
and request the driver receives is sent immediately to the disk and 
there is nothing to merge it to.  That could indicate the absence of 
plugging, or just a reluctance to merge TRIMs.

>
>>
>> No merging was happening.  This is an NVMe drive, so running with the
>> noop scheduler (which should still merge).   Does the queuing layer
>> merge trims?
> I wish I knew.  I once thought I understood about half of the block
> queuing code, but now with multi-queue, I'll need to learn it all
> again. :-(
>
>> I don't think it's the queuing layer's job, though.  At the I/O
>> scheduler you can merge to clean up sloppy patterns from the upper
>> layer, but each layer should try to generate the best pattern it can.
> Why?  How does it know what is best for the layer below?

A large request can be split at the lower layer without harming the 
upper layer (beyond the effort involved).  Merging many small requests 
at the lower layer can harm the upper layer, if it depends on the 
latency of individual requests.

In general in software engineering, when you destroy information, there 
is a potential for inefficiency.  In this case the information destroyed 
is that we are only interested when all of the range is TRIMmed, not any 
particular subrange, and the potential for inefficiency is indeed realized.

>
>> Large merges mean increased latency for the first request in the chain,
>> forcing the I/O scheduler to make a decision which can harm the
>> workload.  By generating merged requests in the first place, the upper
>> layer removes the need to make that tradeoff (splitting the requests
>> removes information: "we are interested only in when all of the range is
>> trimmed, not any particular request").
> If it is easy for the upper layer to break a very large request into a
> few very large requests, then I wouldn't necessarily object.

I can't see why it would be hard.  It's simple arithmetic.

> But unless it is very hard for the lower layer to merge requests, it
> should be doing that too.

Merging has tradeoffs.  When you merge requests R1, R2, ... Rn you make 
the latency request R1 sum of the latencies of R1..Rn.  You may gain 
some efficiency in the process, but that's not going to make up for a 
factor of n.  The queuing layer has no way to tell whether the caller is 
interested in the latency of individual requests.  By sending large 
requests, the caller indicates it's not interested in the latency of 
individual subranges.  The queuing layer is still free to internally 
split the request to smaller ranges, to satisfy hardware constraints, or 
to reduce worst-case latencies for competing request streams.

So I disagree that all the work should be pushed to the merging layer.  
It has less information to work with, so the fewer decisions it has to 
make, the better.

> When drivers/lightnvm/rrpc.c is providing rrpc_make_rq as the

This does not seem to be an NVMe driver.

> make_request_fn, it performs REQ_OP_DISCARD synchronously.  I would
> suggest that is a very poor design.  I don't know if that is affecting
> you (though a printk would find out).
>
> NeilBrown

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Coly Li @ 2016-11-28  8:54 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey
In-Reply-To: <CAMGffEmc_6o+Lgq3iJsCUyn6K=5S2w8pnrVDhXNBJxaKxd1Teg@mail.gmail.com>

On 2016/11/28 下午4:24, Jinpu Wang wrote:
> snip
>>>>
>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>> forgot to increase nr_queued somewhere?
>>>>
>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>
>>>> Could you give your suggestion?
>>>>
>>> Sorry, forgot to mention kernel version is 4.4.28
>>
>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>> stable kernel or a distribution with 4.4.28 kernel ?
>>
>> Coly
>>
>>
> Hi Coly,
> 
> I'm using Debian8 with 4.4.28 kernel.

Hi Jinpu,

Is it possible for your to run a upstream kernel or vanilla kernel to
test whether the issue still can be reproduced ? Then we can know
whether it is an upstream bug or a distro issue.

Thanks.

Coly

^ permalink raw reply

* [BUG] MD/RAID1 hung forever on bitmap_startwrite+0x122
From: Jinpu Wang @ 2016-11-28  8:45 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown, Shaohua Li

Hi folks,

We hit another hung task on bitmap_startwrite with our test machines, this time
it's hung in bitmap_startwrite.

We build MD/RAID1 over 2 block devices exported via IB,
bitmap=internal. KVM build on top of
RAID1 on compute node, disks are on remote storage node, one storage
node crash/reboot, multiple raid1 on multiple compute node KVM run
into hung task:

[106204.343870] INFO: task kvm:37669 blocked for more than 180 seconds.

[106204.344138]       Tainted: G          IO    4.4.28-1-pserver #1

[106204.344385] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[106204.344798] kvm             D ffff882037723710     0 37669      1 0x00000000

[106204.344805]  ffff882037723710 ffff882038f08d00 ffff882029770d00
ffff8820377236d8

[106204.344809]  ffff8820377236d8 ffff882037724000 0000000000308648
0000000000000008

[106204.344813]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
ffffffff81811c60

[106204.344818] Call Trace:

[106204.344831]  [<ffffffff81811c60>] schedule+0x30/0x80

[106204.344841]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]

[106204.344848]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310

[106204.344853]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80

[106204.344859]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127

[106204.344865]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]

[106204.344871]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0

[106204.344875]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0

[106204.344879]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0

[106204.344883]  [<ffffffff813fd692>] submit_bio+0x62/0x150

[106204.344892]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0

[106204.344897]  [<ffffffff810b9999>] ? __remove_hrtimer+0x89/0xa0

[106204.344903]  [<ffffffff8173c08f>] ? udp_poll+0x1f/0xb0

[106204.344908]  [<ffffffff816b71c7>] ? sock_poll+0x57/0x120

[106204.344913]  [<ffffffff811cdbf0>] ? I_BDEV+0x10/0x10

[106204.344918]  [<ffffffff811d3b1e>] __blockdev_direct_IO+0x3e/0x40

[106204.344922]  [<ffffffff811ce287>] blkdev_direct_IO+0x47/0x50

[106204.344930]  [<ffffffff81132c60>] generic_file_direct_write+0xb0/0x170

[106204.344934]  [<ffffffff81132ded>] __generic_file_write_iter+0xcd/0x1f0

[106204.344943]  [<ffffffff81184ff8>] ? kmem_cache_free+0x78/0x190

[106204.344948]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0

[106204.344952]  [<ffffffff811ce547>] blkdev_write_iter+0x87/0x110

[106204.344956]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0

[106204.344962]  [<ffffffff811dec56>] aio_run_iocb+0x236/0x2a0

[106204.344966]  [<ffffffff811dd183>] ? eventfd_ctx_read+0x53/0x200

[106204.344973]  [<ffffffff811b3bbf>] ? __fget_light+0x1f/0x60

[106204.344976]  [<ffffffff811b3c0e>] ? __fdget+0xe/0x10

[106204.344980]  [<ffffffff811dfb5a>] do_io_submit+0x23a/0x4d0

[106204.344985]  [<ffffffff811dfdfb>] SyS_io_submit+0xb/0x10

[106204.344989]  [<ffffffff818154d7>] entry_SYSCALL_64_fastpath+0x12/0x6a

[106384.345330] INFO: task kvm:37669 blocked for more than 180 seconds.

[106384.345621]       Tainted: G          IO    4.4.28-1-pserver #1

[106384.345866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[106384.346275] kvm             D ffff882037723710     0 37669      1 0x00000000

[106384.346282]  ffff882037723710 ffff882038f08d00 ffff882029770d00
ffff8820377236d8

[106384.346286]  ffff8820377236d8 ffff882037724000 0000000000308648
0000000000000008

[106384.346290]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
ffffffff81811c60

[106384.346294] Call Trace:

[106384.346308]  [<ffffffff81811c60>] schedule+0x30/0x80

[106384.346317]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]

[106384.346325]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310

[106384.346330]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80

[106384.346336]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127

[106384.346341]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]

[106384.346347]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0

[106384.346352]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0

[106384.346356]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0

[106384.346360]  [<ffffffff813fd692>] submit_bio+0x62/0x150

[106384.346369]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0


(gdb) l *bitmap_startwrite+0x122

0x121d2 is in bitmap_startwrite (drivers/md/bitmap.c:1396).



1394                 if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {

1395                         DEFINE_WAIT(__wait);

1396                         /* note that it is safe to do the prepare_to_wait

1397                          * after the test as long as we do it
before dropping

1398                          * the spinlock.

1399                          */

1400                         prepare_to_wait(&bitmap->overflow_wait, &__wait,

1401                                         TASK_UNINTERRUPTIBLE);

1402                         spin_unlock_irq(&bitmap->counts.lock);

1403                         schedule();

1404                         finish_wait(&bitmap->overflow_wait, &__wait);

1405                         continue;

1406                 }

so seem KVM is waiting on overflow_wait queue, but somehow no one wake
him up. During reboot one storage, raid1 has a lot IO errors in that
time, I guess some error handle part is broken.

I haven't have a reproducer, just want to report to community, in case
this is known bug, or anyone has patch already :)

Thanks,

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: raid0 vs. mkfs
From: NeilBrown @ 2016-11-28  8:40 UTC (permalink / raw)
  To: Avi Kivity, linux-raid
In-Reply-To: <286a5fc1-eda3-0421-a88e-b03c09403259@scylladb.com>

[-- Attachment #1: Type: text/plain, Size: 3556 bytes --]

On Mon, Nov 28 2016, Avi Kivity wrote:

> On 11/28/2016 07:09 AM, NeilBrown wrote:
>> On Mon, Nov 28 2016, Avi Kivity wrote:
>>
>>> mkfs /dev/md0 can take a very long time, if /dev/md0 is a very large
>>> disk that supports TRIM/DISCARD (erase whichever is inappropriate).
>>> That is because mkfs issues a TRIM/DISCARD (erase whichever is
>>> inappropriate) for the entire partition. As far as I can tell, md
>>> converts the large TRIM/DISCARD (erase whichever is inappropriate) into
>>> a large number of TRIM/DISCARD (erase whichever is inappropriate)
>>> requests, one per chunk-size worth of disk, and issues them to the RAID
>>> components individually.
>>>
>>>
>>> It seems to me that md can convert the large TRIM/DISCARD (erase
>>> whichever is inappropriate) request it gets into one TRIM/DISCARD (erase
>>> whichever is inappropriate) per RAID component, converting an O(disk
>>> size / chunk size) operation into an O(number of RAID components)
>>> operation, which is much faster.
>>>
>>>
>>> I observed this with mkfs.xfs on a RAID0 of four 3TB NVMe devices, with
>>> the operation taking about a quarter of an hour, continuously pushing
>>> half-megabyte TRIM/DISCARD (erase whichever is inappropriate) requests
>>> to the disk. Linux 4.1.12.
>> Surely it is the task of the underlying driver, or the queuing
>> infrastructure, to merge small requests into large requests.
>
> Here's a blkparse of that run.  As can be seen, there is no concurrency, 
> so nobody down the stack has any chance of merging anything.

That isn't a valid conclusion to draw.
raid0 effectively calls the make_request_fn function that is registered
by the underlying driver.
If that function handles DISCARD synchronously, then you won't see any
concurrency, and that is because the driver chose not to queue but to
handle directly.
I don't know if it actually does this though.  I don't know the insides
of the nmve driver .... there seems to be a lightnvm thing and a scsi
thing and a pci thing and it all confuses me.

>
>
> No merging was happening.  This is an NVMe drive, so running with the 
> noop scheduler (which should still merge).   Does the queuing layer 
> merge trims?

I wish I knew.  I once thought I understood about half of the block
queuing code, but now with multi-queue, I'll need to learn it all
again. :-(

>
> I don't think it's the queuing layer's job, though.  At the I/O 
> scheduler you can merge to clean up sloppy patterns from the upper 
> layer, but each layer should try to generate the best pattern it can.  

Why?  How does it know what is best for the layer below?

> Large merges mean increased latency for the first request in the chain, 
> forcing the I/O scheduler to make a decision which can harm the 
> workload.  By generating merged requests in the first place, the upper 
> layer removes the need to make that tradeoff (splitting the requests 
> removes information: "we are interested only in when all of the range is 
> trimmed, not any particular request").

If it is easy for the upper layer to break a very large request into a
few very large requests, then I wouldn't necessarily object.
But unless it is very hard for the lower layer to merge requests, it
should be doing that too.
When drivers/lightnvm/rrpc.c is providing rrpc_make_rq as the
make_request_fn, it performs REQ_OP_DISCARD synchronously.  I would
suggest that is a very poor design.  I don't know if that is affecting
you (though a printk would find out).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-11-28  8:24 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey
In-Reply-To: <af7ae0a8-eb73-0070-4280-8758433ad500@suse.de>

snip
>>>
>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>> forgot to increase nr_queued somewhere?
>>>
>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>
>>> Could you give your suggestion?
>>>
>> Sorry, forgot to mention kernel version is 4.4.28
>
> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
> stable kernel or a distribution with 4.4.28 kernel ?
>
> Coly
>
>
Hi Coly,

I'm using Debian8 with 4.4.28 kernel.

Best,
-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* [PATCH 4/4] md/raid5-cache: adjust the write position of the empty block and mark it as a checkpoint
From: JackieLiu @ 2016-11-28  8:19 UTC (permalink / raw)
  To: songliubraving; +Cc: liuzhengyuan, linux-raid, JackieLiu
In-Reply-To: <20161128081921.5641-1-liuyun01@kylinos.cn>

When recovery is complete, we write an empty block and record his
position first, then make the data-only stripes rewritten done,
the location of the empty block as the last checkpoint position
to write into the super block. And we should update last_checkpoint
to this empty block position.

------------------------------------------------------------------
|  old log       | empty block | data only stripes | invalid log |
------------------------------------------------------------------
^                ^                                 ^
|                |- log->last_checkpoint           |- log->log_start
|                |- log->last_cp_seq               |- log->next_checkpoint
|- log->seq=n    |- log->seq=10+n

At the same time, if there is no data-only stripes, this scene may appear,
| meta1 | meta2 | meta3 |
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. so we should 
The solution is we create a new meta in meta2 with its seq == meta1's 
seq + 10 and let superblock points to meta2. 

Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
 drivers/md/raid5-cache.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9e72180..20594f7 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2072,7 +2072,6 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		return -ENOMEM;
 	}
 
-	ctx->seq += 10;
 	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
 		struct r5l_meta_block *mb;
 		int i;
@@ -2132,6 +2131,7 @@ static int r5l_recovery_log(struct r5l_log *log)
 	struct mddev *mddev = log->rdev->mddev;
 	struct r5l_recovery_ctx ctx;
 	int ret;
+	sector_t pos;
 
 	ctx.pos = log->last_checkpoint;
 	ctx.seq = log->last_cp_seq;
@@ -2149,6 +2149,10 @@ static int r5l_recovery_log(struct r5l_log *log)
 	if (ret)
 		return ret;
 
+	pos = ctx.pos;
+	r5l_log_write_empty_meta_block(log, ctx.pos, (ctx.seq += 10));
+	ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
+
 	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
 		pr_debug("md/raid:%s: starting from clean shutdown\n",
 			 mdname(mddev));
@@ -2167,9 +2171,9 @@ static int r5l_recovery_log(struct r5l_log *log)
 
 	log->log_start = ctx.pos;
 	log->next_checkpoint = ctx.pos;
+	log->last_checkpoint = pos;
 	log->seq = ctx.seq;
-	r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq);
-	r5l_write_super(log, ctx.pos);
+	r5l_write_super(log, pos);
 	return 0;
 }
 
-- 
2.7.4




^ permalink raw reply related

* [PATCH 3/4] md/raid5-cache: release the stripe_head at the appropriate location
From: JackieLiu @ 2016-11-28  8:19 UTC (permalink / raw)
  To: songliubraving; +Cc: liuzhengyuan, linux-raid, JackieLiu
In-Reply-To: <20161128081921.5641-1-liuyun01@kylinos.cn>

If we released the 'stripe_head' in r5c_recovery_flush_log,
ctx->cached_list will both release the data-parity stripes and
data-only stripes, which will become empty.
And we also need to use the data-only stripes in
r5c_recovery_rewrite_data_only_stripes, so we should wait util rewrite
data-only stripes is done before releasing them.

Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
 drivers/md/raid5-cache.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 6d2b1da..9e72180 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1950,7 +1950,7 @@ static void r5c_recovery_load_one_stripe(struct r5l_log *log,
 static int r5c_recovery_flush_log(struct r5l_log *log,
 				  struct r5l_recovery_ctx *ctx)
 {
-	struct stripe_head *sh, *next;
+	struct stripe_head *sh;
 	int ret = 0;
 
 	/* scan through the log */
@@ -1979,11 +1979,9 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
 	r5c_recovery_replay_stripes(&ctx->cached_list, ctx);
 
 	/* load data-only stripes to stripe cache */
-	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+	list_for_each_entry(sh, &ctx->cached_list, lru) {
 		WARN_ON(!test_bit(STRIPE_R5C_CACHING, &sh->state));
 		r5c_recovery_load_one_stripe(log, sh);
-		list_del_init(&sh->lru);
-		raid5_release_stripe(sh);
 		ctx->data_only_stripes++;
 	}
 
@@ -2063,7 +2061,7 @@ static int
 r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 				       struct r5l_recovery_ctx *ctx)
 {
-	struct stripe_head *sh;
+	struct stripe_head *sh, *next;
 	struct mddev *mddev = log->rdev->mddev;
 	struct page *page;
 
@@ -2075,7 +2073,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 	}
 
 	ctx->seq += 10;
-	list_for_each_entry(sh, &ctx->cached_list, lru) {
+	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
 		struct r5l_meta_block *mb;
 		int i;
 		int offset;
@@ -2121,6 +2119,9 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		sh->log_start = ctx->pos;
 		ctx->pos = write_pos;
 		ctx->seq += 1;
+
+		list_del_init(&sh->lru);
+		raid5_release_stripe(sh);
 	}
 	__free_page(page);
 	return 0;
-- 
2.7.4




^ permalink raw reply related

* [PATCH 2/4] md/raid5-cache: use ring add to prevent overflow
From: JackieLiu @ 2016-11-28  8:19 UTC (permalink / raw)
  To: songliubraving; +Cc: liuzhengyuan, linux-raid, JackieLiu
In-Reply-To: <20161128081921.5641-1-liuyun01@kylinos.cn>

'write_pos' must be protected with 'r5l_ring_add', or it may overflow

Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
---
 drivers/md/raid5-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 576c88d..6d2b1da 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2086,7 +2086,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 						     ctx->pos, ctx->seq);
 		mb = page_address(page);
 		offset = le32_to_cpu(mb->meta_size);
-		write_pos = ctx->pos + BLOCK_SECTORS;
+		write_pos = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
 
 		for (i = sh->disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-- 
2.7.4




^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox