Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Why not just return an error?
From: keld @ 2016-10-07  9:30 UTC (permalink / raw)
  To: Rudy Zijlstra; +Cc: Dark Penguin, linux-raid
In-Reply-To: <0d0109f4-2c03-484c-9f70-008a7a1a0d67@grumpydevil.homelinux.org>

On Fri, Oct 07, 2016 at 10:21:26AM +0200, Rudy Zijlstra wrote:
> 
> 
> Op 07-10-16 om 07:26 schreef keld@keldix.com:
> >On Fri, Oct 07, 2016 at 02:32:40AM +0300, Dark Penguin wrote:
> >>Greetings!
> >>
> >>The more I read about md-raid, the more I notice that the biggest
> >>problem of it: if you hit an error on a degraded RAID, it falls apart.
> >>Because of this, it is possible to lose a huge amount of data due to one
> >>tiny read error, which particularly makes raid5 the sword of Damocles.
> >>
> >>But one question keeps me increasingly frustrated. Yes, during its
> >>normal functioning, it totally makes sense to kick a faulty device out
> >>of an array. But if we're running a degraded array, and doing so will
> >>definitely result is massive data loss, why not just return a read error
> >>instead? Just add a little check: on error, if degraded -> then just
> >>return an error. I believe this is the dream of everyone who had ever
> >>dealt with RAIDs.
> >>
> >>With RAID, the first proprity is keeping data safe. Yes, it's not an
> >>alternative to backups and all that, but still - if we hit an error on a
> >>degraded array, the array should scream and panic and send all kinds of
> >>warnings, but definitely NOT collapse and warrant a visit to the RAID
> >>recovery laboratory (or this mailing list). Imagine how much headache
> >>and lost hair would that relieve!..
> >>
> >>Now, I'm probably not the first one to think of such a bright idea. So
> >>there must be a very good reason why this is not possible; I don't think
> >>the problem is just that "the existing behaviour is preferred, and
> >>anyone who does not agree is an idiot". If not for enterprise use, then
> >>at least it would be very useful for the "home archive" scenario when
> >>"uptime" and "absense of errors" hold much less meaning than "losing one
> >>file and not all the data". So, why is this not possible?..
> >Likewise, when the first disk fails, one could mark it as kind of in an 
> >error state,
> >and keep it running, and if one gets a read error, then you could get
> >the data from the good disks.
> >
> >Often read errors can be remedied by writing data to the failing disk.
> >The good data could then be obtained from the good parts of the array.
> >
> >This behaviour could be optional and could even be set during operation.
> >
> >Best regards
> >keld
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> One big reason is human behaviour. And it is human behaviour that in the 
> end causes all the collapsed raids. I have lost count how often i have 
> seen requests for help once the raid had collapsed. But the earlier 
> signal, where the RAID had become degraded was ignored. This means that 
> if you only give an error message and continue going you will -- most 
> likely in increasing rate -- have errors in the files. Very quickly it 
> will become impossible to state which file is correct and which is not. 
> Essentially you have lost at that point all information with NO ability 
> to recover. Unless you have a backup....
> 
> That is one of the big reasons the behaviour is as it is. RAID is 
> intented to guarantee the consistency and correctness of the stored 
> data. When this becomes impossible, the only way out is to clearly 
> signal this. Even a collapsed RAID has  more consistent data (although 
> it takes effort to recover) then a corrupted RAID which would be the 
> result of your proposal. The corruption resulting from your proposal 
> above CANNOT be recovered.

I believe you are incorrect. As long as it is marked which parts of the array
that are in error, we know which data is good.  Of cause some data may be unobtainable,
but this may well be just a few files, and the rest will be good. A much better result
than all data being lost!

Anyway this could be an optional feature, so that it can be chosen or not be chosen.

Best regards
keld

^ permalink raw reply

* Re: md/raid1: Improve another size determination in setup_conf()
From: SF Markus Elfring @ 2016-10-07 10:50 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Richard Weinberger, linux-raid@vger.kernel.org, Christoph Hellwig,
	Guoqing Jiang, Jens Axboe, Mike Christie, Neil Brown, Shaohua Li,
	Tomasz Majchrzak, LKML, kernel-janitors@vger.kernel.org,
	Julia Lawall
In-Reply-To: <e5cddb79-1267-b2d8-c05b-1480096db21e@nod.at>

> Linux has tons of issues, fixes for real problems are very welcome.

Is a spectrum of software improvements to reconsider there?


> But coding style bike shedding is just a waste of time.

Why do various software developers bother about coding style specifications
at all then?

Regards,
Markus

^ permalink raw reply

* Re: [PATCH 1/2] RAID1: ignore discard error
From: Sitsofe Wheeler @ 2016-10-07 10:53 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, v3.6
In-Reply-To: <36b0390e404e6021f2aec98ee87c074eef6dc6bb.1475788662.git.shli@fb.com>

On 6 October 2016 at 22:20, Shaohua Li <shli@fb.com> wrote:
> Sitsofe,
> could you try if this patch fixes the issue too please?

As with with the previous patch the BUG_ON/crash is fixed and as
before dmesg only says this now:
[   17.884175] sd 0:0:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[   17.884180] sd 0:0:1:0: [sdb] tag#0 Sense Key : Illegal Request [current]
[   17.884185] sd 0:0:1:0: [sdb] tag#0 Add. Sense: Invalid command
operation code
[   17.884190] sd 0:0:1:0: [sdb] tag#0 CDB: Write same(16) 93 08 00 00
00 00 00 00 40 80 00 00 08 00 00 00
[   17.884196] blk_update_request: critical target error, dev sdb, sector 16512
[   17.884205] sd 0:0:2:0: [sdc] tag#1 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[   17.884208] sd 0:0:2:0: [sdc] tag#1 Sense Key : Illegal Request [current]
[   17.884211] sd 0:0:2:0: [sdc] tag#1 Add. Sense: Invalid command
operation code
[   17.884215] sd 0:0:2:0: [sdc] tag#1 CDB: Write same(16) 93 08 00 00
00 00 00 00 40 80 00 00 08 00 00 00
[   17.884217] blk_update_request: critical target error, dev sdc, sector 16512
[   17.924716] Buffer I/O error on dev dm-0, logical block 0, async page read

(sd[bc] are the devices the RAID is using)

However this time around blkdiscard doesn't return an error (despite
the discard not being done) and is quiet. It's hard to say if this is
desirable given that the underlying devices switch to saying they
don't support discard but perhaps that's part of a different
discussion.

Tested-by: Sitsofe Wheeler <sitsofe@gmail.com>

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply

* Re: Why not just return an error?
From: Andreas Klauer @ 2016-10-07 11:21 UTC (permalink / raw)
  To: Dark Penguin; +Cc: linux-raid
In-Reply-To: <57F6DF18.40703@yandex.ru>

On Fri, Oct 07, 2016 at 02:32:40AM +0300, Dark Penguin wrote:
> why not just return a read error instead?

You make it sound like it solves all problems, but it does not.
Errors are just not part of the concept anywhere really.

If a filesystem encounters one, it might flip into read only mode;
if a program encounters one it might do whatever.
You still have a huge data loss, corrupt databases, et cetera.

Even so, is that not what you have with "bad block log" enabled, 
within reason? I disable it everywhere. I want my disks kicked.

Using cosmetics to hide errors only works to a certain limit. 
In the end, RAID only works if the disks work. RAID 5 with 
two dead disks is dead, no way to get around that. Disks go bad 
and need to be replaced, if you don't do that, you'll just fail 
even more horribly later on.

> I believe this is the dream of everyone who had ever 
> dealt with RAIDs.

My dream is different. I don't want errors. I want it to work. ;)
And it does, as long as you make sure your disks are healthy.

And if you make every effort to keep broken disks in your arrays, 
it just won't work. All promises are off - RAID promises to survive 
one or two dead disks, but that's only if all other disks are in 
perfect working order for the time it takes to rebuild.

Your disk produces read errors, or needs 3 minutes to read a single sector, 
what use is it to anyone? I'm not letting those disks stay, no matter how 
many more people preach that "read errors are normal". No. They're not. 
Such disks are utter and complete trash and have to go.

Don't wait for MD to kick disks out either. Check your disks. 
Actually replace them if they have errors. Most RAIDs die due 
to people not monitoring their disks, or delaying replacements.

Replacing disks costs money but that is the price you have to pay 
for the luxury of using RAID (especially at home) in the first place. 
When buying a RAID system, the money for the next replacement disk 
should always be planned into your budget. If you max it out or 
overdraw your budget for those fancy enterprise RAID disks, 
you'll find they die just the same.

Also make backups. RAID never replaces backups.

Regards
Andreas Klauer

^ permalink raw reply

* Re: RAID6 - CPU At 100% Usage After Reassembly
From: Francisco Parada @ 2016-10-07 11:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: mdraid
In-Reply-To: <20161006235530.GA109312@kernel.org>

> Sorry for the long delay.

No worries at all, Shaohua!!!! I really appreciate you getting back to me.

> Looks it showsthere are read errors in two disks and the raid6 thread is keeping loop to
handle the case, but doesn't success. The disks are sdd and sdg. I have no idea
how to fix this so far though, sorry.

OK, let me try running diagnostics on the disks again.  I ran
"badblocks" on them with the "-sv" options when I originally
encountered the issue, and only had found errors on one drive.  But
I'll run them again just in case, as well as a few other diagnostics.
I'll have to come up with something to try and detect the errors.  I'm
wondering if these read errors are caused by the controller, or the
drives themselves.  I can try swapping controllers.  Unless it's an
issue with the model drives that I'm using.  I've got WD 3TB Green
EZRX drives, which I recently found out via the RAID Wiki, that they
didn't have error correction (after I spent over a thousand dollars on
the drives in 3 years) ... Had I known better, I would have opted for
different models.

Thanks a million, Shaohua!

On Thu, Oct 6, 2016 at 7:55 PM, Shaohua Li <shli@kernel.org> wrote:
> On Wed, Oct 05, 2016 at 07:12:59AM -0400, Francisco Parada wrote:
>> Hi all,
>>
>> Just wanted to send one last ping on this.  I didn't hear back and
>> really don't know where else to turn to before abandoning hope.
>> Should I just start wiping the drives so I can start from scratch at
>> this point?
>>
>> I rebuilt the kernel with debug patch that Shaohua provided and I sent
>> the list the output of the trace.  Does any one have any other
>> suggestions?
>
> Sorry for the long delay. I got a chance to look at the log today. Looks it shows
> there are read errors in two disks and the raid6 thread is keeping loop to
> handle the case, but doesn't success. The disks are sdd and sdg. I have no idea
> how to fix this so far though, sorry.
>
> Thanks,
> Shaohua

^ permalink raw reply

* Re: md/raid1: Improve another size determination in setup_conf()
From: Austin S. Hemmelgarn @ 2016-10-07 11:52 UTC (permalink / raw)
  To: SF Markus Elfring
  Cc: Dan Carpenter, Richard Weinberger, linux-raid@vger.kernel.org,
	Christoph Hellwig, Guoqing Jiang, Jens Axboe, Mike Christie,
	Neil Brown, Shaohua Li, Tomasz Majchrzak, LKML,
	kernel-janitors@vger.kernel.org, Julia Lawall
In-Reply-To: <522db506-1e1c-0563-7595-da6dc701d706@users.sourceforge.net>

On 2016-10-07 06:50, SF Markus Elfring wrote:
>> Linux has tons of issues, fixes for real problems are very welcome.
>
> Is a spectrum of software improvements to reconsider there?
>
>
>> But coding style bike shedding is just a waste of time.
>
> Why do various software developers bother about coding style specifications
> at all then?
Coding style is important, but patches that just fix coding style are a 
bad thing because they break things like `git blame` and run the risk of 
introducing new bugs without any net benefit to end users.  This goes 
double for code you don't actually work on regularly or don't completely 
understand.

^ permalink raw reply

* Re: md-cluster: Delete unnecessary braces in unlock_all_bitmaps()
From: walter harms @ 2016-10-07 13:21 UTC (permalink / raw)
  To: kernel-janitors
  Cc: Dan Carpenter, linux-raid, Guoqing Jiang, Shaohua Li, LKML,
	Julia Lawall
In-Reply-To: <9e1de37a-d98d-c885-b4ff-5a8b2aa1f7a8@users.sourceforge.net>



Am 07.10.2016 10:37, schrieb SF Markus Elfring:
>>> Do not use curly brackets at one source code place
>>> where a single statement should be sufficient.
>>
>> The original style was correct and this is wrong.  I have explained this before.
> 
> Did I change a bit too much in the proposed step according to the following
> update suggestion?
> 
> elfring@Sonne:~/Projekte/Linux/next-patched> git checkout d6385db94196b253ae5eb3678fa95cdf1f839fcc && scripts/checkpatch.pl --types BRACES -f drivers/md/md-cluster.c
> …
> WARNING: braces {} are not necessary for single statement blocks
> #1228: FILE: drivers/md/md-cluster.c:1228:
> +			if (cinfo->other_bitmap_lockres[i]) {
> +				lockres_free(cinfo->other_bitmap_lockres[i]);
> +			}
> …
> 
> 
> How do you think about to adjust this source code place a bit?
> 

perhaps we can agree to delete the if() block ?

static void lockres_free(struct dlm_lock_resource *res)
{
        int ret;

       if (!res)
                return;

....


@marcus: is can not send mail to you sf.net adresse because sf.net
         mark my domain als spam (note: nobody else does)

re,
 wh

^ permalink raw reply

* Re: Why not just return an error?
From: Phil Turmel @ 2016-10-07 14:19 UTC (permalink / raw)
  To: Dark Penguin, linux-raid
In-Reply-To: <57F6DF18.40703@yandex.ru>

On 10/06/2016 07:32 PM, Dark Penguin wrote:
> Greetings!
> 
> The more I read about md-raid, the more I notice that the biggest
> problem of it: if you hit an error on a degraded RAID, it falls apart.
> Because of this, it is possible to lose a huge amount of data due to one
> tiny read error, which particularly makes raid5 the sword of Damocles.

Because raid is about uptime through failures.  It's not backup, it's
not data consistency.  A degraded array is supposed to be a temporary
state -- the time it takes to install a new drive and rebuild.  A
single-degraded raid6 still has redundancy to carry you through a read
error during rebuild.  Raid5 does not.  That's it.  There's no other
magic, and anything else would be more bug-inducing complexity.

A degraded raid5 isn't raid anymore, just "aid".  You can minimize the
odds of a read error during rebuild by properly scrubbing your arrays
while they are non-degraded, but drive specifications make it clear that
your odds won't be good on large arrays.

> But one question keeps me increasingly frustrated. Yes, during its
> normal functioning, it totally makes sense to kick a faulty device out
> of an array.

{ Possible misconception here: linux raid arrays don't kick out drives
just for read errors.  MD raid will attempt to *fix* the bad sector
using the data from the other drives.  Only if the fix fails will the
drive be ejected.  Timeout mismatch guarantees that the fix will fail. }

> But if we're running a degraded array, and doing so will
> definitely result is massive data loss, why not just return a read error
> instead? Just add a little check: on error, if degraded -> then just
> return an error. I believe this is the dream of everyone who had ever
> dealt with RAIDs.

Stopping the array *preserves* data.  The block layer has no concept of
what's on top, and an error in one place that isn't handled could easily
turn into corruption in otherwise good places.  Layered block devices
require a sysadmin to evaluate the situation.

> With RAID, the first proprity is keeping data safe. Yes, it's not an
> alternative to backups and all that, but still - if we hit an error on a
> degraded array, the array should scream and panic and send all kinds of
> warnings, but definitely NOT collapse and warrant a visit to the RAID
> recovery laboratory (or this mailing list). Imagine how much headache
> and lost hair would that relieve!..

Linux raid is widely used.  Traffic on this list is relatively small.
I'm quite sure 99.99% of linux raid users are dealing with these events
just fine:  ddrescue the troublesome drive to another, reassemble with
that, then wipe or replace the original.  Whine to the powers that be
that raid6 would have kept their array up through the event so could
they please fund another drive?  Drown sorrows in beer if the PTB say no.

> Now, I'm probably not the first one to think of such a bright idea. So
> there must be a very good reason why this is not possible; I don't think
> the problem is just that "the existing behaviour is preferred, and
> anyone who does not agree is an idiot". If not for enterprise use, then
> at least it would be very useful for the "home archive" scenario when
> "uptime" and "absense of errors" hold much less meaning than "losing one
> file and not all the data". So, why is this not possible?..

No, you aren't the first to want a magic wand.  Sorry.

Phil

^ permalink raw reply

* Re: Why not just return an error?
From: Phil Turmel @ 2016-10-07 14:43 UTC (permalink / raw)
  To: Andreas Klauer, Dark Penguin; +Cc: linux-raid
In-Reply-To: <20161007112151.GA4405@metamorpher.de>

Good morning Andreas,

On 10/07/2016 07:21 AM, Andreas Klauer wrote:
> On Fri, Oct 07, 2016 at 02:32:40AM +0300, Dark Penguin wrote:
>> why not just return a read error instead?
> 
> You make it sound like it solves all problems, but it does not.
> Errors are just not part of the concept anywhere really.

That's not strictly true. The majority of read errors on large modern
drives are fixable by writing over the troublesome sector.  That may or
may not relocate the sector to the drive's spare area.  Read error
locations that haven't yet been overwritten are identified in the drive
firmware as "Pending Relocations", since the drive doesn't yet know if
the problem is a true media defect or just a write error (power
transient during write, whatever).

Since brand new drives almost never have errors, people assume that's
normal.  Get three or four years in and you see that's not true.  In my
experience, when actual relocations hit double digits, it's time to
replace the drive.  The drive is still operating within spec, though --
it won't be a warranty replacement.

> If a filesystem encounters one, it might flip into read only mode;
> if a program encounters one it might do whatever.
> You still have a huge data loss, corrupt databases, et cetera.

Concur.

> Even so, is that not what you have with "bad block log" enabled, 
> within reason? I disable it everywhere. I want my disks kicked.

I want my disks *fixed* if possible, not kicked.  If they're kicked, the
rest of the good data on that disk is unavailable for keeping my array
running.  I want to see the relocations growing in my daily logwatch
reports so I can use mdadm --replace to maintain the array without *any*
loss of redundancy.

> Using cosmetics to hide errors only works to a certain limit. 
> In the end, RAID only works if the disks work. RAID 5 with 
> two dead disks is dead, no way to get around that. Disks go bad 
> and need to be replaced, if you don't do that, you'll just fail 
> even more horribly later on.

Concur.  We seem to differ on where to draw the line on "bad".

> Your disk produces read errors, or needs 3 minutes to read a single sector, 
> what use is it to anyone? I'm not letting those disks stay, no matter how 
> many more people preach that "read errors are normal". No. They're not. 
> Such disks are utter and complete trash and have to go.

Really?  You get rid of drives on the first read error event?  If you're
discarding them, I'll pay shipping for you to send them to me.  That
would be an especially cost effective source of drives for me. None of
the green or desktop POSes, though.  (-:  Or are you just not noticing
the read errors because MD is silently fixing them for you?

> Don't wait for MD to kick disks out either. Check your disks. 
> Actually replace them if they have errors. Most RAIDs die due 
> to people not monitoring their disks, or delaying replacements.

Yup.

> Replacing disks costs money but that is the price you have to pay 
> for the luxury of using RAID (especially at home) in the first place. 
> When buying a RAID system, the money for the next replacement disk 
> should always be planned into your budget. If you max it out or 
> overdraw your budget for those fancy enterprise RAID disks, 
> you'll find they die just the same.

Enterprise drives are easily justified for heavily loaded arrays in a
small shop.  NAS drives are just fine for small business and home media
servers.  Green and modern desktop drives are utterly unsuited to raid duty.

> Also make backups. RAID never replaces backups.

Indeed.

Phil

^ permalink raw reply

* Re: [PATCH] Fix bus error when accessing MBR partition records
From: Jes Sorensen @ 2016-10-07 15:14 UTC (permalink / raw)
  To: James Clarke; +Cc: linux-raid
In-Reply-To: <20160929122838.66975-1-jrtc27@jrtc27.com>

James Clarke <jrtc27@jrtc27.com> writes:
> Since the MBR layout only has partition records as 2-byte aligned, the 32-bit
> fields in them are not aligned. Thus, they cannot be accessed on some
> architectures (such as SPARC) by using a "struct MBR_part_record *" pointer,
> as the compiler can assume that the pointer is properly aligned. Instead, the
> records must be accessed by going through the MBR struct itself every time.
>
> Signed-off-by: James Clarke <jrtc27@jrtc27.com>
> ---
>  super-mbr.c |  6 ++++++
>  util.c      | 14 +++++++-------
>  2 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/super-mbr.c b/super-mbr.c
> index 62b3f03..303dde4 100644
> --- a/super-mbr.c
> +++ b/super-mbr.c
> @@ -57,6 +57,9 @@ static void examine_mbr(struct supertype *st, char *homehost)
>  
>  	printf("   MBR Magic : %04x\n", sb->magic);
>  	for (i = 0; i < MBR_PARTITIONS; i++)
> +		/* Have to make every access through sb rather than using a pointer to
> +		 * the partition table (or an entry), since the entries are not
> +		 * properly aligned. */
>  		if (sb->parts[i].blocks_num)
>  			printf("Partition[%d] : %12lu sectors at %12lu (type %02x)\n",
>  			       i,

Reading through this thread and Neil's comments, I think it's reasonable
to do what you are doing with pointer access. However I also believe
that packed should be applied as Neil suggests.

Second, as code lines per definition are 80 characters wide, please make
sure your patch complies with this. I know some of the older code
violates this, but we shouldn't be adding more code which does so.

If you send me an updated patch I shall be happy to apply it.

Jes

^ permalink raw reply

* Re: [PATCH] imsm: retrieve nvme serial from sysfs
From: Jes Sorensen @ 2016-10-07 15:19 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <20161006091309.27263-1-artur.paszkiewicz@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> Don't rely on SCSI ioctl for reading NVMe serials - SCSI emulation for
> NVMe devices can be disabled in the kernel config. Instead, try to get a
> serial from /sys/block/nvme*/device/serial. If that fails for whatever
> reason (i.e. no such attribute in old kernels) - fall back to the SCSI
> method.
>
> This also moves some SCSI-specific code from imsm_read_serial() to
> scsi_get_serial().
>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> Reviewed-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
> ---
>  sg_io.c       | 23 +++++++++++++++++++----
>  super-intel.c | 46 +++++++++++++++++++++++++++-------------------
>  2 files changed, 46 insertions(+), 23 deletions(-)

Hi Artur,

Looks quite reasonable - applied!

Thanks,
Jes

^ permalink raw reply

* Re: md/raid1: Improve another size determination in setup_conf()
From: SF Markus Elfring @ 2016-10-07 15:27 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Dan Carpenter, Richard Weinberger, linux-raid@vger.kernel.org,
	Christoph Hellwig, Guoqing Jiang, Jens Axboe, Mike Christie,
	Neil Brown, Shaohua Li, Tomasz Majchrzak, LKML,
	kernel-janitors@vger.kernel.org, Julia Lawall
In-Reply-To: <6e2c26bc-d765-6225-af72-157832ab8785@gmail.com>

>> Why do various software developers bother about coding style specifications
>> at all then?
> Coding style is important,

Thanks that you "dare" to express also such an opinion.

> but patches that just fix coding style are a bad thing

When you find such a change opportunity so "bad", are there any circumstances
left over where you would dare to touch the corresponding source code line.

> because they break things like `git blame`

I follow your concern to some degree.

But can this argument evolve against a lot of changes generally?

> and run the risk of introducing new bugs

Did this really "happen" because of an update suggestion for this software module?

> without any net benefit to end users.

Can the proposed adjustment help to make a function like "setup_conf"
a bit more robust (together with related update steps) so that an improved
coding style compliance will hopefully influence the error probability
in positive ways?

> This goes double for code you don't actually work on regularly
> or don't completely understand.

How does such a kind of general feedback fit to the shown change
possibilities in this patch series?

Do you reject this update step?

Regards,
Markus

^ permalink raw reply

* Re: md/raid1: Improve another size determination in setup_conf()
From: Jiri Kosina @ 2016-10-07 15:35 UTC (permalink / raw)
  To: SF Markus Elfring
  Cc: Austin S. Hemmelgarn, Dan Carpenter, Richard Weinberger,
	linux-raid@vger.kernel.org, Christoph Hellwig, Guoqing Jiang,
	Jens Axboe, Mike Christie, Neil Brown, Shaohua Li,
	Tomasz Majchrzak, LKML, kernel-janitors@vger.kernel.org,
	Julia Lawall
In-Reply-To: <a3d1b5a7-4b4e-a37d-2952-bb2e01828c39@users.sourceforge.net>

On Fri, 7 Oct 2016, SF Markus Elfring wrote:

> > but patches that just fix coding style are a bad thing
> 
> When you find such a change opportunity so "bad", are there any 
> circumstances left over where you would dare to touch the corresponding 
> source code line.

If you actually rewrite the code or fix some real bug there.

> > because they break things like `git blame`
> 
> I follow your concern to some degree.
> 
> But can this argument evolve against a lot of changes generally?

If I have to reiterate git blame multiple times just because of whitespace 
or codingstyle changes, it's a pure waste of my time.

If I have to reiterate git blame multiple times to skip actual real 
changes, I have no other option than to live with that (because there was 
an actual functional reason for the change).

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply

* How to fix mistake on raid: mdadm create instead of assemble?
From: Santiago DIEZ @ 2016-10-07 15:37 UTC (permalink / raw)
  To: Linux Raid LIST

Hi guys,

I'm new to RAID although I've had a server running raid5 for a while.
It was delivered preinstalled like this and I never really wondered
how to monitor and maintain it. This quick introduction just to let
you understand why I'm such an idiot asking such a silly question.

Now what happened?

I have a server with 4 disks and raid5 configured. /dev/md10 is made
of sda10 , sdb10 , sdc10 and sdd10 .

Unfortunately, /dev/sdd died, the server crashed, etc. After restart,
md10 did not rebuilt. I understood sdd was dead and did not try to
force rebuild or even touch the existing system.

First thing I did is ddrescue the remaining partitions sd[abc]10 .
ddrescue did not stumble into any read error so I assume all remaining
partitions are perfectly safe.

Then I examined the partitions with :

================================================================================
# mdadm --examine /dev/loop[012]

/dev/loop0:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 9d37bc89:711887ae:a4d2adc2:26fd5302
  Creation Time : Wed Jan 25 09:08:11 2012
     Raid Level : raid5
  Used Dev Size : 1926247296 (1837.01 GiB 1972.48 GB)
     Array Size : 5778741888 (5511.04 GiB 5917.43 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Mon Sep  5 23:29:23 2016
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 9d0ce26d - correct
         Events : 81589

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8       10        0      active sync

   0     0       8       10        0      active sync
   1     1       8       26        1      active sync
   2     2       8       42        2      active sync
   3     3       0        0        3      faulty removed
/dev/loop1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 9d37bc89:711887ae:a4d2adc2:26fd5302
  Creation Time : Wed Jan 25 09:08:11 2012
     Raid Level : raid5
  Used Dev Size : 1926247296 (1837.01 GiB 1972.48 GB)
     Array Size : 5778741888 (5511.04 GiB 5917.43 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Mon Sep  5 23:36:23 2016
          State : clean
 Active Devices : 1
Working Devices : 1
 Failed Devices : 2
  Spare Devices : 0
       Checksum : 9d0ce487 - correct
         Events : 81626

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8       26        1      active sync

   0     0       0        0        0      removed
   1     1       8       26        1      active sync
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
/dev/loop2:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 9d37bc89:711887ae:a4d2adc2:26fd5302
  Creation Time : Wed Jan 25 09:08:11 2012
     Raid Level : raid5
  Used Dev Size : 1926247296 (1837.01 GiB 1972.48 GB)
     Array Size : 5778741888 (5511.04 GiB 5917.43 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Mon Sep  5 23:29:23 2016
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 9d0ce291 - correct
         Events : 81589

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2       8       42        2      active sync

   0     0       8       10        0      active sync
   1     1       8       26        1      active sync
   2     2       8       42        2      active sync
   3     3       0        0        3      faulty removed
================================================================================

There comes my mistake: I ran the --create command instead of --assemble :

================================================================================
# mdadm --create --verbose /dev/md1 --raid-devices=4 --level=raid5
--run --readonly /dev/loop0 /dev/loop1 /dev/loop2 missing

mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: /dev/loop0 appears to contain an ext2fs file system
       size=5778741888K  mtime=Sat Sep  3 11:00:22 2016
mdadm: /dev/loop0 appears to be part of a raid array:
       level=raid5 devices=4 ctime=Wed Jan 25 09:08:11 2012
mdadm: /dev/loop1 appears to be part of a raid array:
       level=raid5 devices=4 ctime=Wed Jan 25 09:08:11 2012
mdadm: /dev/loop2 appears to be part of a raid array:
       level=raid5 devices=4 ctime=Wed Jan 25 09:08:11 2012
mdadm: size set to 1926115840K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: creation continuing despite oddities due to --run
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
================================================================================

After that, mounting failed:

================================================================================
# mount /dev/md1 /raid/
mount: /dev/md1 is write-protected, mounting read-only
mount: wrong fs type, bad option, bad superblock on /dev/md1,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
================================================================================

Here's more info about the new raid to be compared with the initial one:

================================================================================
# mdadm --examine /dev/loop[012]

/dev/loop0:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : aa56f42f:bb95fbde:11ce620e:878b2b1c
           Name : tucana.caoba.fr:1  (local to host tucana.caoba.fr)
  Creation Time : Mon Sep 19 23:17:04 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3852232703 (1836.89 GiB 1972.34 GB)
     Array Size : 5778347520 (5510.66 GiB 5917.03 GB)
  Used Dev Size : 3852231680 (1836.89 GiB 1972.34 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=1023 sectors
          State : clean
    Device UUID : b4622e59:f0735f5a:825086d1:57f89efb

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Sep 19 23:17:04 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 3ec8dda7 - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/loop1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : aa56f42f:bb95fbde:11ce620e:878b2b1c
           Name : tucana.caoba.fr:1  (local to host tucana.caoba.fr)
  Creation Time : Mon Sep 19 23:17:04 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3852232703 (1836.89 GiB 1972.34 GB)
     Array Size : 5778347520 (5510.66 GiB 5917.03 GB)
  Used Dev Size : 3852231680 (1836.89 GiB 1972.34 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=1023 sectors
          State : clean
    Device UUID : 9d42153b:4173aeea:51f41ebc:3789f98a

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Sep 19 23:17:04 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 2af1f191 - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)
/dev/loop2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : aa56f42f:bb95fbde:11ce620e:878b2b1c
           Name : tucana.caoba.fr:1  (local to host tucana.caoba.fr)
  Creation Time : Mon Sep 19 23:17:04 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3852232703 (1836.89 GiB 1972.34 GB)
     Array Size : 5778347520 (5510.66 GiB 5917.03 GB)
  Used Dev Size : 3852231680 (1836.89 GiB 1972.34 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=1023 sectors
          State : clean
    Device UUID : ada52b0e:f2c4a680:ece59800:6425a9b2

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Sep 19 23:17:04 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 2a341a - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAA. ('A' == active, '.' == missing, 'R' == replacing)
================================================================================

With the help of the initial mdadm --examine , is it possible to
recreate my raid in a way that I can read data out of it?


I also posted the question here:
http://www.unix.com/unix-for-advanced-and-expert-users/268771-how-fix-mistake-raid-mdadm-create-instead-assemble-post302983179.html#post302983179

Regards
-------------------------
Santiago DIEZ
CAOBA Conseil
Paris, France

^ permalink raw reply

* Re: [PATCH - mdadm] Fix some issues found by clang
From: Jes Sorensen @ 2016-10-07 15:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-RAID
In-Reply-To: <8737k84xuf.fsf@notabene.neil.brown.name>

NeilBrown <neilb@suse.com> writes:
> The clang compiler complained about each of these.
>
> The mdmon.h error will only affect 'far' RAID10 arrays using intel or DDF
> metadata, and there is no such thing.
>
> The mdopen.c will cause a problem if there are no free md device
> numbers in the first 512.  That is fairly unlikely.
>
> The restripe.c error would only affect the 'test_stripe' command, and
> probably doesn't change its behaviour.
>
> The super-intel.c fix is purely cosmetic.
>
> Signed-off-by: NeilBrown <neilb@suse.com>

Oh neat! There are some real gems in there. Now I don't feel so bad
about the brown paper bag bug I found in my wireless driver last night
:)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: Why not just return an error?
From: Dark Penguin @ 2016-10-07 16:23 UTC (permalink / raw)
  To: Phil Turmel, Andreas Klauer, Rudy Zijlstra, keld; +Cc: linux-raid
In-Reply-To: <e887908f-ba51-0f88-f891-f60e8bac50bd@turmel.org>

> Likewise, when the first disk fails, one could mark it as kind of in an error state,
> and keep it running, and if one gets a read error, then you could get
> the data from the good disks.

Yes!! If a drive is "faulty", it means "you should replace it because it 
is failing"; there is no need to actually stop using it and degrade the 
whole RAID operation! What's more, it would be extremely useful at 
rebuilding without any performance loss: let the array work in degraded 
mode, while the faulty drive is being copied to the new one, with only 
read errors reconstructed from the rest of the drives! But that's a 
different issue, and not a very good idea for other reasons.

> One big reason is human behaviour. And it is human behaviour that in the
> end causes all the collapsed raids.

"Human behaviour", that's what I'm talking about. If the only reason to 
do it is to force people to do what is necessary, that approach is 
called "Windows". :) And I do not suggest that it should be the default 
behaviour; instead, we should have an option "--idiotmode 
--yes-i-know-what-i-am-doing" at RAID creation for those who 
specifically want to take the risks.

And of course, no broken files will appear if we suffer from read 
*errors*. We do not suffer from *incorrect reads*, right?..

> You make it sound like it solves all problems, but it does not.
> Errors are just not part of the concept anywhere really.

It does not "solve all problems", but it lets me solve my problems my 
way, and not "the only correct and intended way" - which is what Linux 
is good at. :)

>> > I believe this is the dream of everyone who had ever dealt with RAIDs.
>
> My dream is different. I don't want errors. I want it to work. ;)
> And it does, as long as you make sure your disks are healthy.

I do not suggest that we do it my way and not yours - we have an option 
to do it your way, but we do not have one to do it my way, that's the 
problem. :)

Anyway, if I had a collapsed RAID-5, I would want to at least have an 
easy option to start it in a read-only mode in the last-known working 
state, while the faulty drives are still not out of sync, and recover 
data easily (to my single backup drive), or continue using the array for 
a while, manually deleting one "bad" file if necessary; this is of 
course not a "good thing" to do, but this way, RAID would be at least 
not worse than single drives with faulty sectors, which are capable of 
that, while RAIDs are not! I would be fine with that in my archive - as 
I'm fine with some less importand parts of the archive being on faulty 
single drives. It's just that I don't want to lose the whole drive due 
to a hardware failure - and RAID adds more causes other than that, 
instead of offering more protection against that.

>> > Using cosmetics to hide errors only works to a certain limit.
>> > In the end, RAID only works if the disks work. RAID 5 with
>> > two dead disks is dead, no way to get around that. Disks go bad
>> > and need to be replaced, if you don't do that, you'll just fail
>> > even more horribly later on.
>
> Concur.  We seem to differ on where to draw the line on "bad".

And I think that line should be easy to move, so that anyone could 
choose their own! I understand that RAID is meant for "uptime, not 
backups" - for enterprise production. And everything that you say is 
correct about this case. However, there are other uses - like mirroring 
my backup archive to protect against whole-drive failures. And in this 
case, I want different behaviour; I can take in onto myself to make sure 
a read error won't make my filesystems go into read-only mode and break 
anything, I really know what I'm doing, and I don't need my computer to 
tell me that RAID is not supposed to be used in this way. And it 
shouldn't add a lot of complex code - just a test "if idiotmode and 
lastdisk then return error, else kick drive; shout like crazy either 
way". :)

It's just that everyone has their own opinion on where to draw the line, 
and the "intended" one should of course be preached, but not forced!

-- 
darkpenguin

^ permalink raw reply

* Re: md/raid1: Improve another size determination in setup_conf()
From: SF Markus Elfring @ 2016-10-07 16:38 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Austin S. Hemmelgarn, Dan Carpenter, Richard Weinberger,
	linux-raid, Christoph Hellwig, Guoqing Jiang, Jens Axboe,
	Mike Christie, Neil Brown, Shaohua Li, Tomasz Majchrzak, LKML,
	kernel-janitors, Julia Lawall
In-Reply-To: <alpine.LRH.2.00.1610071729340.8015@gjva.wvxbf.pm>

>>> but patches that just fix coding style are a bad thing
>>
>> When you find such a change opportunity so "bad", are there any 
>> circumstances left over where you would dare to touch the corresponding 
>> source code line.
> 
> If you actually rewrite the code or fix some real bug there.

Do the proposed update steps 12 - 16 for the function "setup_conf"
(in this software module) fit to your condition?

Do you reject this update step?

Regards,
Markus

^ permalink raw reply

* Re: [GIT PULL] MD update for 4.9
From: Linus Torvalds @ 2016-10-07 16:44 UTC (permalink / raw)
  To: doug; +Cc: Shaohua Li, Linux Kernel Mailing List, linux-raid, Neil Brown
In-Reply-To: <CAFx4rwTo7s7uVH3PuGF-ys52QaEdxHoP2viYuJteRQqmeiEyqA@mail.gmail.com>

On Thu, Oct 6, 2016 at 10:39 PM, Doug Dumitru <doug@easyco.com> wrote:
>
> There is another thread in [linux-raid] discussing pre-fetches in the
> raid-6 AVX2 code.  My testing implies that the prefetch distance is
> too short.  In your new AVX512 code, it looks like there are 24
> instructions, each with latencies of 1, between the prefetch and the
> actual memory load.  I don't have a AVX512 CPU to try this on, but the
> prefetch might do better at a bigger distance.  If I am not mistaken,
> it takes a lot longer than 24 clocks to fetch 4 cache lines.

We have basically never had a case where prefetches were actually a good idea.

If the hardware doesn't do prefetching on its own (partly with just
physical memory patterns in the memory controller, partly just with
aggressive OoO), software isn't going to be able to improve on the
situation in general.

SW prefetching is a broken concept. You can make big differences for
very specific microarchitectures (usually the broken shit ones are the
ones that show it best), but in the general case it's pretty much
always a lost cause. We've had real cases where prefetching just then
made things worse on other hardware.

So just don't do it. It's benchmarketing for specific hardware, it's
not worth worrying about in the bigger picture. You'll find people
spend a lot of time tuning things for their particular hardware, and
it not helping at all on anything else.

Waste of time. Life is too short (and software is too complex) to try
to work around broken microarchitectures with sw prefetching.

              Linus

^ permalink raw reply

* Re: Why not just return an error?
From: Phil Turmel @ 2016-10-07 16:52 UTC (permalink / raw)
  To: Dark Penguin, Andreas Klauer, Rudy Zijlstra, keld; +Cc: linux-raid
In-Reply-To: <57F7CC10.3050607@yandex.ru>

Hi DP,

{It's good that you are trimming replies, but don't cut the ID of who
wrote what. }

On 10/07/2016 12:23 PM, Dark Penguin wrote:
>> Likewise, when the first disk fails, one could mark it as kind of in
>> an error state,
>> and keep it running, and if one gets a read error, then you could get
>> the data from the good disks.
> 
> Yes!! If a drive is "faulty", it means "you should replace it because it
> is failing"; there is no need to actually stop using it and degrade the
> whole RAID operation! What's more, it would be extremely useful at
> rebuilding without any performance loss: let the array work in degraded
> mode, while the faulty drive is being copied to the new one, with only
> read errors reconstructed from the rest of the drives! But that's a
> different issue, and not a very good idea for other reasons.

MD raid already does as much of this as it can, as I described.

>> One big reason is human behaviour. And it is human behaviour that in the
>> end causes all the collapsed raids.
> 
> "Human behaviour", that's what I'm talking about. If the only reason to
> do it is to force people to do what is necessary, that approach is
> called "Windows". :) And I do not suggest that it should be the default
> behaviour; instead, we should have an option "--idiotmode
> --yes-i-know-what-i-am-doing" at RAID creation for those who
> specifically want to take the risks.
> 
> And of course, no broken files will appear if we suffer from read
> *errors*. We do not suffer from *incorrect reads*, right?..

You want to push the failure condition from being "broken raid with
likely salvageable data, except for one sector" to "repeated errors to
the upper layers with unknowable corruption as side effects".

>> You make it sound like it solves all problems, but it does not.
>> Errors are just not part of the concept anywhere really.
> 
> It does not "solve all problems", but it lets me solve my problems my
> way, and not "the only correct and intended way" - which is what Linux
> is good at. :)

Then patch your kernel with your desired behavior.  "Free software"
doesn't mean someone writes what you want for free.  And I disagree with
you, so would object to it being put in the mainline kernel.

>>> > I believe this is the dream of everyone who had ever dealt with RAIDs.
>>
>> My dream is different. I don't want errors. I want it to work. ;)
>> And it does, as long as you make sure your disks are healthy.
> 
> I do not suggest that we do it my way and not yours - we have an option
> to do it your way, but we do not have one to do it my way, that's the
> problem. :)

Write the code to add the option you want.

> Anyway, if I had a collapsed RAID-5, I would want to at least have an
> easy option to start it in a read-only mode in the last-known working
> state, while the faulty drives are still not out of sync, and recover
> data easily (to my single backup drive), or continue using the array for
> a while, manually deleting one "bad" file if necessary; this is of
> course not a "good thing" to do, but this way, RAID would be at least
> not worse than single drives with faulty sectors, which are capable of
> that, while RAIDs are not! I would be fine with that in my archive - as
> I'm fine with some less importand parts of the archive being on faulty
> single drives. It's just that I don't want to lose the whole drive due
> to a hardware failure - and RAID adds more causes other than that,
> instead of offering more protection against that.

MD raid has no idea what is at any given sector.  And with a
near-infinite variety of layering choices, there's no way it's going to.
 That's why *you* have to do this.  You trimmed my description of the
only "easy option" actually trustable.

> It's just that everyone has their own opinion on where to draw the line,
> and the "intended" one should of course be preached, but not forced!

The "line" I was referring to is the decision of when to throw away a
drive vs. recondition it.  That's already in your hands.

Phil

^ permalink raw reply

* Re: [GIT PULL] MD update for 4.9
From: Shaohua Li @ 2016-10-07 17:39 UTC (permalink / raw)
  To: Doug Dumitru; +Cc: torvalds, linux-kernel, linux-raid, neilb
In-Reply-To: <CAFx4rwTo7s7uVH3PuGF-ys52QaEdxHoP2viYuJteRQqmeiEyqA@mail.gmail.com>

On Thu, Oct 06, 2016 at 10:39:21PM -0700, Doug Dumitru wrote:
> Mr. Li,
> 
> There is another thread in [linux-raid] discussing pre-fetches in the
> raid-6 AVX2 code.  My testing implies that the prefetch distance is
> too short.  In your new AVX512 code, it looks like there are 24
> instructions, each with latencies of 1, between the prefetch and the
> actual memory load.  I don't have a AVX512 CPU to try this on, but the
> prefetch might do better at a bigger distance.  If I am not mistaken,
> it takes a lot longer than 24 clocks to fetch 4 cache lines.
> 
> Just a comment while the code is still fluid.

I did try your patch and it improved 10% in my machine, but this isn't
relevent to the pull. We can do the tunning later if necessary. I'm
hoping the intel guys can share some hints, but apparently Linus isn't a
fan for such tuning.

Thanks,
Shaohua

> On Thu, Oct 6, 2016 at 5:38 PM, Shaohua Li <shli@kernel.org> wrote:
> > Hi Linus,
> > Please pull MD update for 4.9. This update includes:
> > - new AVX512 instruction based raid6 gen/recovery algorithm
> > - A couple of md-cluster related bug fixes
> > - Fix a potential deadlock
> > - Set nonrotational bit for raid array with SSD
> > - Set correct max_hw_sectors for raid5/6, which hopefuly can improve
> >   performance a little bit
> > - Other minor fixes
> >
> > Thanks,
> > Shaohua
> >
> > The following changes since commit 7d1e042314619115153a0f6f06e4552c09a50e13:
> >
> >   Merge tag 'usercopy-v4.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux (2016-09-20 17:11:19 -0700)
> >
> > are available in the git repository at:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.9-rc1
> >
> > for you to fetch changes up to bb086a89a406b5d877ee616f1490fcc81f8e1b2b:
> >
> >   md: set rotational bit (2016-10-03 10:20:27 -0700)
> >
> > ----------------------------------------------------------------
> > Chao Yu (1):
> >       raid5: fix to detect failure of register_shrinker
> >
> > Gayatri Kammela (5):
> >       lib/raid6: Add AVX512 optimized gen_syndrome functions
> >       lib/raid6: Add AVX512 optimized recovery functions
> >       lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
> >       lib/raid6: Add AVX512 optimized xor_syndrome functions
> >       raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
> >
> > Guoqing Jiang (9):
> >       md-cluster: call md_kick_rdev_from_array once ack failed
> >       md-cluster: use FORCEUNLOCK in lockres_free
> >       md-cluster: remove some unnecessary dlm_unlock_sync
> >       md: changes for MD_STILL_CLOSED flag
> >       md-cluster: clean related infos of cluster
> >       md-cluster: protect md_find_rdev_nr_rcu with rcu lock
> >       md-cluster: convert the completion to wait queue
> >       md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
> >       md-cluster: make resync lock also could be interruptted
> >
> > Shaohua Li (5):
> >       raid5: allow arbitrary max_hw_sectors
> >       md/bitmap: fix wrong cleanup
> >       md: fix a potential deadlock
> >       raid5: handle register_shrinker failure
> >       md: set rotational bit
> >
> >  arch/x86/Makefile        |   5 +-
> >  drivers/md/bitmap.c      |   4 +-
> >  drivers/md/md-cluster.c  |  99 ++++++---
> >  drivers/md/md.c          |  44 +++-
> >  drivers/md/md.h          |   5 +-
> >  drivers/md/raid5.c       |  11 +-
> >  include/linux/raid/pq.h  |   4 +
> >  lib/raid6/Makefile       |   2 +-
> >  lib/raid6/algos.c        |  12 +
> >  lib/raid6/avx512.c       | 569 +++++++++++++++++++++++++++++++++++++++++++++++
> >  lib/raid6/recov_avx512.c | 388 ++++++++++++++++++++++++++++++++
> >  lib/raid6/test/Makefile  |   5 +-
> >  lib/raid6/test/test.c    |   7 +-
> >  lib/raid6/x86.h          |  10 +
> >  14 files changed, 1111 insertions(+), 54 deletions(-)
> >  create mode 100644 lib/raid6/avx512.c
> >  create mode 100644 lib/raid6/recov_avx512.c
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Doug Dumitru
> EasyCo LLC

^ permalink raw reply

* Re: [PATCH 1/2] RAID1: ignore discard error
From: Shaohua Li @ 2016-10-07 17:42 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Shaohua Li, linux-raid, Kernel-team, v3.6
In-Reply-To: <CALjAwxhSrcCaePBW0co1wSTNh0i3o0q09BKXeyeKtO_PRn+5_w@mail.gmail.com>

On Fri, Oct 07, 2016 at 11:53:07AM +0100, Sitsofe Wheeler wrote:
> On 6 October 2016 at 22:20, Shaohua Li <shli@fb.com> wrote:
> > Sitsofe,
> > could you try if this patch fixes the issue too please?
> 
> As with with the previous patch the BUG_ON/crash is fixed and as
> before dmesg only says this now:
> [   17.884175] sd 0:0:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [   17.884180] sd 0:0:1:0: [sdb] tag#0 Sense Key : Illegal Request [current]
> [   17.884185] sd 0:0:1:0: [sdb] tag#0 Add. Sense: Invalid command
> operation code
> [   17.884190] sd 0:0:1:0: [sdb] tag#0 CDB: Write same(16) 93 08 00 00
> 00 00 00 00 40 80 00 00 08 00 00 00
> [   17.884196] blk_update_request: critical target error, dev sdb, sector 16512
> [   17.884205] sd 0:0:2:0: [sdc] tag#1 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [   17.884208] sd 0:0:2:0: [sdc] tag#1 Sense Key : Illegal Request [current]
> [   17.884211] sd 0:0:2:0: [sdc] tag#1 Add. Sense: Invalid command
> operation code
> [   17.884215] sd 0:0:2:0: [sdc] tag#1 CDB: Write same(16) 93 08 00 00
> 00 00 00 00 40 80 00 00 08 00 00 00
> [   17.884217] blk_update_request: critical target error, dev sdc, sector 16512
> [   17.924716] Buffer I/O error on dev dm-0, logical block 0, async page read
> 
> (sd[bc] are the devices the RAID is using)
> 
> However this time around blkdiscard doesn't return an error (despite
> the discard not being done) and is quiet. It's hard to say if this is
> desirable given that the underlying devices switch to saying they
> don't support discard but perhaps that's part of a different
> discussion.

Yep, I think the spec doesn't require discard IO returns error. We probably can
make this smarter later in MD

> Tested-by: Sitsofe Wheeler <sitsofe@gmail.com>

Thanks for the testing, I'll queue the two patches.

Thanks,
Shaohua

^ permalink raw reply

* Re: Why not just return an error?
From: Dark Penguin @ 2016-10-07 17:44 UTC (permalink / raw)
  To: Phil Turmel, Andreas Klauer, Rudy Zijlstra, keld; +Cc: linux-raid
In-Reply-To: <94b1a4f4-adec-90b7-e804-2d8d2c94a7af@turmel.org>

On 07/10/16 19:52, Phil Turmel wrote:
> Hi DP,
>
> {It's good that you are trimming replies, but don't cut the ID of who
> wrote what. }

Oh, yeah, sorry.

> You want to push the failure condition from being "broken raid with
> likely salvageable data, except for one sector" to "repeated errors to
> the upper layers with unknowable corruption as side effects".

That actually describes it pretty well, yes. %) Being able to choose a 
failure condition most suitable for your specific situation, and being 
able to push it that far and still have a working RAID if you want that.

> Then patch your kernel with your desired behavior.  "Free software"
> doesn't mean someone writes what you want for free.  And I disagree with
> you, so would object to it being put in the mainline kernel.

Yes, that's one of the things on my TODO list once I become a developer 
able to do that. :) I just thought I'm probably not the only one who 
wants that, and so I wanted to learn why is it not possible. And listen 
to what other people really think about it.

>> Anyway, if I had a collapsed RAID-5, I would want to at least have an
>> easy option to start it in a read-only mode in the last-known working
>> state, while the faulty drives are still not out of sync, and recover
>> data easily (to my single backup drive), or continue using the array for
>> a while, manually deleting one "bad" file if necessary; this is of
>> course not a "good thing" to do, but this way, RAID would be at least
>> not worse than single drives with faulty sectors, which are capable of
>> that, while RAIDs are not! I would be fine with that in my archive - as
>> I'm fine with some less importand parts of the archive being on faulty
>> single drives. It's just that I don't want to lose the whole drive due
>> to a hardware failure - and RAID adds more causes other than that,
>> instead of offering more protection against that.
>
> MD raid has no idea what is at any given sector.  And with a
> near-infinite variety of layering choices, there's no way it's going to.
>   That's why *you* have to do this.  You trimmed my description of the
> only "easy option" actually trustable.

I actually wanted to ask about that. Can you really ddrescue a drive 
with a "hole" in it, re-add it and expect it to work?.. What happens if 
you try to read from that "hole" again? And while I'm talking about 
re-adding, when does it become impossible to "re-add" a drive?..

-- 
darkpenguin

^ permalink raw reply

* Re: [GIT PULL] MD update for 4.9
From: Linus Torvalds @ 2016-10-07 18:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Doug Dumitru, Linux Kernel Mailing List, linux-raid, Neil Brown
In-Reply-To: <20161007173957.GA37192@kernel.org>

On Fri, Oct 7, 2016 at 10:39 AM, Shaohua Li <shli@kernel.org> wrote:
>
> I did try your patch and it improved 10% in my machine, but this isn't
> relevent to the pull. We can do the tunning later if necessary. I'm
> hoping the intel guys can share some hints, but apparently Linus isn't a
> fan for such tuning.

We've had horrible experiences with prefetching in the past.  We've
seen microarchitectures that do really bad things when the prefetch
takes a TLB miss, for example, and suddenly they stall on the
prefetch, and actually slow the code down.

Admittedly, most of the bad cases are probably not a big deal for
streaming raid rebuild code, so it may well be that it works better
there.

So I'm not categorically against prefetching, but it needs to be
tested across a lot of different (micro-)architectures. Right now, I
guess something very specific like AVX512 means effectively just one
or two microarchitectures and then it's easy to say "it always helps".

The worst cases for the kernel have generally been in generic code.

                      Linus

^ permalink raw reply

* Re: [GIT PULL] MD update for 4.9
From: Doug Dumitru @ 2016-10-07 18:15 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Linus Torvalds, linux-kernel, linux-raid, Neil Brown
In-Reply-To: <20161007173957.GA37192@kernel.org>

For the vast majority of cases, Linus is correct.  Pre-fetch is a
hardware specific tweak that just doesn't apply across the breadth of
systems Linus has to pay attention to.

In this case, our tunnel-vision is somewhat encouraged by the code and
logic at hand.  The AVX2 and AVX512 code is pretty specific to new,
64-bit, x86_64 platforms that tend to share a lot of cache behavior.
It is not like this will ever run on an Atom or ARM CPU.  Even more
important, the data access patterns are 100% defined by the task at
hand.  With RAID parity calculation, you are basically taking a stroll
through RAM with 100% defined patterns.  This is one case, where you
can pre-fetch memory enough ahead of time so that the hardware
prefetch unit never triggers and you will always be correct.

I am somewhat surprised you see 10%.  10% on an expensive function
like this is a lot.  Even more amazing is that the AVX2 and AVX512
code look to be memory bandwidth limited.  256 bytes of source reads
in about 30 clocks is about 17 GB/sec, which is faster than a single
RAM channel.  Congrats to Intel for "finding the next bottleneck".

I also see a follow-up from Linus, and again, totally agree with him.
I guess my take is that prefetch only works when you have a use case
where you are 100% correct with your pre-fetches.  I would also note
that this "raid case" is the default, clean array, parity calc code.
The same logic probably needs to be applied to the recovery cases, but
I have not looked at those yet.

Doug

On Fri, Oct 7, 2016 at 10:39 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Oct 06, 2016 at 10:39:21PM -0700, Doug Dumitru wrote:
>> Mr. Li,
>>
>> There is another thread in [linux-raid] discussing pre-fetches in the
>> raid-6 AVX2 code.  My testing implies that the prefetch distance is
>> too short.  In your new AVX512 code, it looks like there are 24
>> instructions, each with latencies of 1, between the prefetch and the
>> actual memory load.  I don't have a AVX512 CPU to try this on, but the
>> prefetch might do better at a bigger distance.  If I am not mistaken,
>> it takes a lot longer than 24 clocks to fetch 4 cache lines.
>>
>> Just a comment while the code is still fluid.
>
> I did try your patch and it improved 10% in my machine, but this isn't
> relevent to the pull. We can do the tunning later if necessary. I'm
> hoping the intel guys can share some hints, but apparently Linus isn't a
> fan for such tuning.
>
> Thanks,
> Shaohua
>
>> On Thu, Oct 6, 2016 at 5:38 PM, Shaohua Li <shli@kernel.org> wrote:
>> > Hi Linus,
>> > Please pull MD update for 4.9. This update includes:
>> > - new AVX512 instruction based raid6 gen/recovery algorithm
>> > - A couple of md-cluster related bug fixes
>> > - Fix a potential deadlock
>> > - Set nonrotational bit for raid array with SSD
>> > - Set correct max_hw_sectors for raid5/6, which hopefuly can improve
>> >   performance a little bit
>> > - Other minor fixes
>> >
>> > Thanks,
>> > Shaohua
>> >
>> > The following changes since commit 7d1e042314619115153a0f6f06e4552c09a50e13:
>> >
>> >   Merge tag 'usercopy-v4.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux (2016-09-20 17:11:19 -0700)
>> >
>> > are available in the git repository at:
>> >
>> >   git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.9-rc1
>> >
>> > for you to fetch changes up to bb086a89a406b5d877ee616f1490fcc81f8e1b2b:
>> >
>> >   md: set rotational bit (2016-10-03 10:20:27 -0700)
>> >
>> > ----------------------------------------------------------------
>> > Chao Yu (1):
>> >       raid5: fix to detect failure of register_shrinker
>> >
>> > Gayatri Kammela (5):
>> >       lib/raid6: Add AVX512 optimized gen_syndrome functions
>> >       lib/raid6: Add AVX512 optimized recovery functions
>> >       lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
>> >       lib/raid6: Add AVX512 optimized xor_syndrome functions
>> >       raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
>> >
>> > Guoqing Jiang (9):
>> >       md-cluster: call md_kick_rdev_from_array once ack failed
>> >       md-cluster: use FORCEUNLOCK in lockres_free
>> >       md-cluster: remove some unnecessary dlm_unlock_sync
>> >       md: changes for MD_STILL_CLOSED flag
>> >       md-cluster: clean related infos of cluster
>> >       md-cluster: protect md_find_rdev_nr_rcu with rcu lock
>> >       md-cluster: convert the completion to wait queue
>> >       md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
>> >       md-cluster: make resync lock also could be interruptted
>> >
>> > Shaohua Li (5):
>> >       raid5: allow arbitrary max_hw_sectors
>> >       md/bitmap: fix wrong cleanup
>> >       md: fix a potential deadlock
>> >       raid5: handle register_shrinker failure
>> >       md: set rotational bit
>> >
>> >  arch/x86/Makefile        |   5 +-
>> >  drivers/md/bitmap.c      |   4 +-
>> >  drivers/md/md-cluster.c  |  99 ++++++---
>> >  drivers/md/md.c          |  44 +++-
>> >  drivers/md/md.h          |   5 +-
>> >  drivers/md/raid5.c       |  11 +-
>> >  include/linux/raid/pq.h  |   4 +
>> >  lib/raid6/Makefile       |   2 +-
>> >  lib/raid6/algos.c        |  12 +
>> >  lib/raid6/avx512.c       | 569 +++++++++++++++++++++++++++++++++++++++++++++++
>> >  lib/raid6/recov_avx512.c | 388 ++++++++++++++++++++++++++++++++
>> >  lib/raid6/test/Makefile  |   5 +-
>> >  lib/raid6/test/test.c    |   7 +-
>> >  lib/raid6/x86.h          |  10 +
>> >  14 files changed, 1111 insertions(+), 54 deletions(-)
>> >  create mode 100644 lib/raid6/avx512.c
>> >  create mode 100644 lib/raid6/recov_avx512.c
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Doug Dumitru
>> EasyCo LLC



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* Re: Why not just return an error?
From: Phil Turmel @ 2016-10-07 18:41 UTC (permalink / raw)
  To: Dark Penguin, Andreas Klauer, Rudy Zijlstra, keld; +Cc: linux-raid
In-Reply-To: <57F7DF05.8090605@yandex.ru>

On 10/07/2016 01:44 PM, Dark Penguin wrote:
> On 07/10/16 19:52, Phil Turmel wrote:

>> MD raid has no idea what is at any given sector.  And with a
>> near-infinite variety of layering choices, there's no way it's going to.
>>   That's why *you* have to do this.  You trimmed my description of the
>> only "easy option" actually trustable.
> 
> I actually wanted to ask about that. Can you really ddrescue a drive
> with a "hole" in it, re-add it and expect it to work?.. What happens if
> you try to read from that "hole" again? And while I'm talking about
> re-adding, when does it become impossible to "re-add" a drive?..

Yes, ddrescue replaces unreadable areas with zeroes.  If those blocks
were part of a file, then the file will have zeroes in it.  But they
might have been where an inode or dirent were stored, in which case you
get orphaned data elsewhere.  You need fsck to minimize that.

ddrescue can provide a listing of the sectors it replaced so you can use
filesystem forensic tools to pinpoint the problems (which file, etc).

Note that all of the above are manual operations -- mdadm has no
knowledge of the upper layers.

None of the above uses --re-add.  Just assembly or forced assembly.
Re-add is only to return a kicked drive to a *functional* array when the
failure reason isn't really the drive.  (Controller, cable, power
supply, etc.)  And re-add is only helpful if the array members have
write-intent bitmaps so MD can figure out which parts of the re-added
disk are out of date.  Re-add can be used if a drive is kicked for
timeout mismatch, but is only helpful if the mismatch is addressed first.

Phil

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox