remark and RFC

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* remark and RFC
@ 2006-08-16  9:06 Peter T. Breuer
  2006-08-16 10:00 ` Molle Bestefich
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-16  9:06 UTC (permalink / raw)
  To: linux raid

Hello - 

I believe the current kernel raid code retries failed reads too
quickly and gives up too soon for operation over a network device.

Over (my) the enbd device, the default mode of operation was
before-times to have the enbd device time out requests after 30s of net
stalemate and maybe even stop and restart the socket if the blockage was
very prolonged, showing the device as invalid for 5s in order to clear
any in-kernel requests that haven't arrived at its own queue. 

That happens, and my interpretation of reports reaching me is that the
current raid code sends retries quickly, which all fail, and the device
gets expelled, which is bad :-(.

BTW, with my old FR1/5 patch, the enbd device could tell the raid layer
when it felt OK again, and the patched raid code would reinsert the
device and catch up on requests marked as missed in the bitmap.

Now, that mode of operation isn't available to enbd since there is no
comm channel to the official raid layer, so all I can do is make the
enbd device block on network timeouts.  But that's totally
unsatisfactory, since real network outages then cause permanent blocks
on anything touching a file system mounted remotely (a la NFS
hard-erroring style).  People don't like that.

And letting the enbd device error temporarily provokes a cascade of
retries from raid which fail and get the enbd device ejected
permanently, also real bad.

So,

1) I would like raid request retries to be done with exponential
   delays, so that we get a chance to overcome network brownouts.

2) I would like some channel of communication to be available
   with raid that devices can use to say that they are
   OK and would they please be reinserted in the array.

The latter is the RFC thing (I presume the former will either not
be objectionable or Neil will say "there's no need since you're wrong
about the way raid does retries anyway").

The way the old FR1/5 code worked was to make available a couple of
ioctls.

When a device got inserted in an array, the raid code told the device
via a special ioctl it assumed the device had that it was now in an
array (this triggers special behaviours, such as deliberately becoming
more error-prone and less blocky, on the assumption that we have got
good comms with raid and can manage our own raid state). Ditto
removal.

When the device felt good (or ill) it notified the raid arrays it
knew it was in via another ioctl (really just hot-add or hot-remove),
and the raid layer would do the appropriate catchup (or start bitmapping
for it).

Can we have something like that in the official code? If so, what?

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16  9:06 remark and RFC Peter T. Breuer
@ 2006-08-16 10:00 ` Molle Bestefich
  2006-08-16 13:06   ` Peter T. Breuer
  2006-08-16 14:59 ` Molle Bestefich
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Molle Bestefich @ 2006-08-16 10:00 UTC (permalink / raw)
  To: ptb; +Cc: linux raid

Peter T. Breuer wrote:
> 1) I would like raid request retries to be done with exponential
>    delays, so that we get a chance to overcome network brownouts.
>
> I presume the former will either not be objectionable

You want to hurt performance for every single MD user out there, just
because things doesn't work optimally under enbd, which is after all a
rather rare use case compared to using MD on top of real disks.

Uuuuh..  yeah, no objections there.

Besides, it seems a rather pointless exercise to try and hide the fact
from MD that the device is gone, since it *is* in fact missing.  Seems
wrong at the least.

> 2) I would like some channel of communication to be available
>    with raid that devices can use to say that they are
>    OK and would they please be reinserted in the array.
>
> The latter is the RFC thing

It would be reasonable for MD to know the difference between
 - "device has (temporarily, perhaps) gone missing" and
 - "device has physical errors when reading/writing blocks",

because if MD knew that, then it would be trivial to automatically
hot-add the missing device once available again.  Whereas the faulty
one would need the administrator to get off his couch.

This would help in other areas too, like when a disk controller dies,
or a cable comes (completely) loose.

Even if the IDE drivers are not mature enough to tell us which kind of
error it is, MD could still implement such a feature just to help
enbd.

I don't think a comm-channel is the right answer, though.

I think the type=(missing/faulty) information should be embedded in
the I/O error message from the block layer (enbd in your case)
instead, to avoid race conditions and allow MD to take good decisions
as early as possible.

The comm channel and "hey, I'm OK" message you propose doesn't seem
that different from just hot-adding the disks from a shell script
using 'mdadm'.

> When the device felt good (or ill) it notified the raid arrays it
> knew it was in via another ioctl (really just hot-add or hot-remove),
> and the raid layer would do the appropriate catchup (or start
> bitmapping for it).

No point in bitmapping.  Since with the network down and all the
devices underlying the RAID missing, there's nowhere to store data.
Right?
Some more factual data about your setup would maybe be good..

> all I can do is make the enbd device block on network timeouts.
> But that's totally unsatisfactory, since real network outages then
> cause permanent blocks on anything touching a file system
> mounted remotely.  People don't like that.

If it's just this that you want to fix, you could write a DM module
which returns I/O error if the request to the underlying device takes
more than 10 seconds.

Layer that module on top of the RAID, and make your enbd device block
on network timeouts.

Now the RAID array doesn't see missing disks on network outages, and
users get near-instant errors when the array isn't responsive due to a
network outage.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16 10:00 ` Molle Bestefich
@ 2006-08-16 13:06   ` Peter T. Breuer
  2006-08-16 14:28     ` Molle Bestefich
  0 siblings, 1 reply; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-16 13:06 UTC (permalink / raw)
  To: Molle Bestefich; +Cc: linux raid

"Also sprach Molle Bestefich:"
[Charset ISO-8859-1 unsupported, filtering to ASCII...]
> Peter T. Breuer wrote:
> > 1) I would like raid request retries to be done with exponential
> >    delays, so that we get a chance to overcome network brownouts.
> >
> > I presume the former will either not be objectionable
> 
> You want to hurt performance for every single MD user out there, just

There's no performance drop!  Exponentially staged retries on failure
are standard in all network protocols ...  it is the appropriate
reaction in general, since stuffing the pipe full of immediate retries
doesn't allow the would-be successful transactions to even get a look in
against that competition.

> because things doesn't work optimally under enbd, which is after all a
> rather rare use case compared to using MD on top of real disks.

Strawman.

> Uuuuh..  yeah, no objections there.
> 
> Besides, it seems a rather pointless exercise to try and hide the fact
> from MD that the device is gone, since it *is* in fact missing.

Well, we don't really know that for sure.  As you know, it is
impossible to tell in general if the net has gone awol or is simply
heavily overloaded (with retry requests).

The retry on error is a good thing.  I am simply suggesting that if the
first retry also fails that we do some back off before trying again,
since it is now likely (lacking more knowledge) that the device is
having trouble and may well take some time to recover.  I would suspect
that an interval of 0 1 5 10 30 60s would be appropriate for retries.
One can cycle that twice for luck before giving up for good, if you
like.  The general idea in such backoff protocols is that it avoids
filling a fixed bandwidth channel with retries (the sum of a constant
times 1 + 1/2 + 1/4 + ..  is a finite proportion of the channel
bandwidth, but the sum of 1+1+1+1+1+...  is unbounded), but here also
there is an _additional_ assumption that the net is likely to have
brownouts and so we _ought_ to retry at intervals since retrying
immediately will definitely almost always do no good.

> Seems
> wrong at the least.

There is no effect on the normal request path, and the effect is
beneficial to successful requests by reducing the competing buildup of
failed requests, when they do occur.  In "normal " failures there is
zero delay anyway.  And further, the bitmap takes care of delayed
responses in the normal course of events.

> > 2) I would like some channel of communication to be available
> >    with raid that devices can use to say that they are
> >    OK and would they please be reinserted in the array.
> >
> > The latter is the RFC thing
> 
> It would be reasonable for MD to know the difference between
>  - "device has (temporarily, perhaps) gone missing" and
>  - "device has physical errors when reading/writing blocks",

I agree. The problem is that we can't really tell what's happening
(even in the lower level device) across a net that is not responding.
Enbd generally hides the problem for a short period of time, then gives
up and advises md (if only it could nowadays - I mean with the fr1
patch) that the device is down, and then tells md when the device comes
back, so that the bitmap can be discharged and the device be caught up.

The problem is that at the moment the md layer has no way of being told
that the device is OK again (and that it decides on its own account
that the device is bad when it sends umpteen retries within a short
period of time only to get them all rejected).

> because if MD knew that, then it would be trivial to automatically
> hot-add the missing device once available again.  Whereas the faulty
> one would need the administrator to get off his couch.

Yes. The idea is that across the net approximately ALL failures are
temporary ones, to a value of something like 99.99%.  The cleaning lady
is usually dusting the on-off switch on the router.

> This would help in other areas too, like when a disk controller dies,
> or a cable comes (completely) loose.
> 
> Even if the IDE drivers are not mature enough to tell us which kind of
> error it is, MD could still implement such a feature just to help
> enbd.
> 
> I don't think a comm-channel is the right answer, though.
> 
> I think the type=(missing/faulty) information should be embedded in
> the I/O error message from the block layer (enbd in your case)
> instead, to avoid race conditions and allow MD to take good decisions
> as early as possible.

That's a possibility. I certainly get two types of error back in the
enbd driver .. remote error or network error. Remote error is when
we get told by the other end that the disk has a problem. Network
error is when we hear nothing, and have a timeout.

I can certainly pass that on. Any suggestions?

> The comm channel and "hey, I'm OK" message you propose doesn't seem
> that different from just hot-adding the disks from a shell script
> using 'mdadm'.

Talking through userspace has subtle deadlock problems.  I wouldn't rely
on it in this kind of situation.  Blocking a device can lead to a file
system being blocked and processes getting stalled for all kinds of
peripheral reasons, for example.  I have seen file descriptor closes
getting blocked, to name the bizarre. I am pretty sure that removal
requests will be blocked when requests are outstanding.

Another problem is that enbd has to _know_ it is in a raid array, and
which one, in order to send the ioctl.  That leads one to more or less
require that the md array tell it.  One could build this into the mdadm
tool, but one can't guarantee that everyone uses that (same) mdadm tool,
so the md driver gets nominated as the best place for the code that
does that.

> > When the device felt good (or ill) it notified the raid arrays it
> > knew it was in via another ioctl (really just hot-add or hot-remove),
> > and the raid layer would do the appropriate catchup (or start
> > bitmapping for it).
> 
> No point in bitmapping.  Since with the network down and all the
> devices underlying the RAID missing, there's nowhere to store data.
> Right?

Only one of two devices in a two-device mirror is generally networked.
The standard scenario is two local disks per network node.  One is a
mirror half for a remote raid, the other is the mirror half for a local
raid (which has a remote other half on the remote node).

More complicated setups can also be built - there are entire grids of
such nodes arranged in a torus, with local redundancy arranged in
groups of three neighbours, each with two local devices and one remote
device. Etc.

> Some more factual data about your setup would maybe be good..

It's not my setup! Invent your own :-).

> > all I can do is make the enbd device block on network timeouts.
> > But that's totally unsatisfactory, since real network outages then
> > cause permanent blocks on anything touching a file system
> > mounted remotely.  People don't like that.
> 
> If it's just this that you want to fix, you could write a DM module
> which returns I/O error if the request to the underlying device takes
> more than 10 seconds.

I'm not sure that another layer helps. I can timeout requests myself in
10s within enbd if I want to.  The problem is that if I take ten seconds
for each one when the net is down memory will fill with backed up
requests.  The first one that is failed (after 10s) then triggers an
immediate retry from md, which also gets held for 10s.  We'll simply get
huge pulses of failures of entire backed up memory spaced at 10s.  :-o

I'm pretty sure from reports that md would error the device offline
after a pulse like that.  If it doesn't, then anyway enbd would decide
after 30s or so that the remote end was down and take itself offline.
One or the other would cause md to expell it from the array.  I could
try hot-add from enbd when the other end comes back, but we need to know
we are in an array (and which) in order to do that.

> Layer that module on top of the RAID, and make your enbd device block
> on network timeouts.

It shifts the problem to no avail, as far as I understand you, and my
understanding is likely faulty.  Can you be more specific about how this
attacks the problem?

> Now the RAID array doesn't see missing disks on network outages, and

It wouldn't see them anyway when enbd is in normal mode - it blocks.
The problem is that that behaviour is really bad for user satisfaction!

Enbd used instead to tell the md device that it was feeling ill, error
all requests, allowing md to chuck it out of the array. Then enbd would
tell the md device when it was feeling well again, and make md
reinsert it in the array. Md would catch up using the bitmap.

Right now, we can't really tell md we're feeling ill (that would be a
HOT_ARRRGH, but md doesn't have that). If we could, then md could
decide on its own to murder all outstanding requests for us and
chuck us out, with the implicit understanding that we will come back
again soon and then the bitbap can catcj us up.

We can't do a HOT_REMOVE while requests are outstanding, as far as I
know. 

> users get near-instant errors when the array isn't responsive due to a
> network outage.

I agree that the lower level device should report errors quickly up to
md. The problem is that that leads to it being chucked out
unceremonially, for ever and a day ..

  1) md shouldn't chuck us out for a few errors - nets are like that
  2) we should be able to chuck ourselves out when we feel the net is
     weak
  3) we should be able to chuck ourselves back in when we feel better
  4) for that to happen, we need to have been told by md when we are
    in an array and which

I simply proposed that (1) has the easy solution of md doing retries with
exponential backoff for a while, instead of chucking us out.

The rest needs discussion. Maybe it can be done in userspace, but be
advised that I think that is remarkably tricky! In particular, it's
almost impossible to test adequately ... which alone would make
me aim for an embedded solution (i.e. driver code).

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16 13:06   ` Peter T. Breuer
@ 2006-08-16 14:28     ` Molle Bestefich
  2006-08-16 19:01       ` Peter T. Breuer
  2006-08-16 23:43       ` Nix
  0 siblings, 2 replies; 18+ messages in thread
From: Molle Bestefich @ 2006-08-16 14:28 UTC (permalink / raw)
  To: ptb; +Cc: linux raid

Peter T. Breuer wrote:
> > You want to hurt performance for every single MD user out there, just
>
> There's no performance drop!  Exponentially staged retries on failure
> are standard in all network protocols ...  it is the appropriate
> reaction in general, since stuffing the pipe full of immediate retries
> doesn't allow the would-be successful transactions to even get a look in
> against that competition.

That's assuming that there even is a pipe, which is something specific
to ENBD / networked block devices, not something that the MD driver
should in general care about.

> > because things doesn't work optimally under enbd, which is after all a
> > rather rare use case compared to using MD on top of real disks.
>
> Strawman.

Quah?

> > Besides, it seems a rather pointless exercise to try and hide the fact
> > from MD that the device is gone, since it *is* in fact missing.
>
> Well, we don't really know that for sure.  As you know, it is
> impossible to tell in general if the net has gone awol or is simply
> heavily overloaded (with retry requests).

From MD's point of view, if we're unable to complete a request to the
device, then it's either missing or faulty.  If a call to the device
blocks, then it's just very slow.

I don't think it's wise to pollute these simple mechanics with a
"maybe it's in a sort-of failing due to a network outage, which might
just be a brownout" scenario.  Better to solve the problem in a more
appropriate place, somewhere that knows about the fact that we're
simulating a block device over a network connection.

Not introducing network-block-device aware code in MD is a good way to
avoid wrong code paths and weird behaviour for real block device
users.

"Missing" vs. "Faulty" is OTOH a pretty simple interface, which maps
fine to both real disks and NBDs.

> The retry on error is a good thing.  I am simply suggesting that if the
> first retry also fails that we do some back off before trying again,
> since it is now likely (lacking more knowledge) that the device is
> having trouble and may well take some time to recover.  I would suspect
> that an interval of 0 1 5 10 30 60s would be appropriate for retries.

Only for networked block devices.

Not for real disks, there you are just causing unbearable delays for
users for no good reason, in the event that this code path is taken.

> One can cycle that twice for luck before giving up for good, if you
> like.  The general idea in such backoff protocols is that it avoids
> filling a fixed bandwidth channel with retries (the sum of a constant
> times 1 + 1/2 + 1/4 + ..  is a finite proportion of the channel
> bandwidth, but the sum of 1+1+1+1+1+...  is unbounded), but here also
> there is an _additional_ assumption that the net is likely to have
> brownouts and so we _ought_ to retry at intervals since retrying
> immediately will definitely almost always do no good.

Since the knowledge that the block device is on a network resides in
ENBD, I think the most reasonable thing to do would be to implement a
backoff in ENBD?  Should be relatively simple to catch MD retries in
ENBD and block for 0 1 5 10 30 60 seconds.  That would keep the
network backoff algorithm in a more right place, namely the place that
knows the device is on a network.

> In "normal " failures there is zero delay anyway.

Since the first retry would succeed, or?
I'm not sure what this "normal" failure is, btw.

> And further, the bitmap takes care of delayed
> responses in the normal course of events.

Mebbe.  Does it?

> > It would be reasonable for MD to know the difference between
> >  - "device has (temporarily, perhaps) gone missing" and
> >  - "device has physical errors when reading/writing blocks",
>
> I agree. The problem is that we can't really tell what's happening
> (even in the lower level device) across a net that is not responding.

In the case where requests can't be delivered over the network (or a
SATA cable, whatever), it's a clear case of "missing device".

> > because if MD knew that, then it would be trivial to automatically
> > hot-add the missing device once available again.  Whereas the faulty
> > one would need the administrator to get off his couch.
>
> Yes. The idea is that across the net approximately ALL failures are
> temporary ones, to a value of something like 99.99%.  The cleaning lady
> is usually dusting the on-off switch on the router.
>
> > This would help in other areas too, like when a disk controller dies,
> > or a cable comes (completely) loose.
> >
> > Even if the IDE drivers are not mature enough to tell us which kind of
> > error it is, MD could still implement such a feature just to help
> > enbd.
> >
> > I don't think a comm-channel is the right answer, though.
> >
> > I think the type=(missing/faulty) information should be embedded in
> > the I/O error message from the block layer (enbd in your case)
> > instead, to avoid race conditions and allow MD to take good decisions
> > as early as possible.
>
> That's a possibility. I certainly get two types of error back in the
> enbd driver .. remote error or network error. Remote error is when
> we get told by the other end that the disk has a problem. Network
> error is when we hear nothing, and have a timeout.
>
> I can certainly pass that on. Any suggestions?

Let's hear from Neil what he thinks.

> > The comm channel and "hey, I'm OK" message you propose doesn't seem
> > that different from just hot-adding the disks from a shell script
> > using 'mdadm'.
>
> [snip speculations on possible blocking calls]

You could always try and see.
Should be easy to simulate a network outage.

> I am pretty sure that removal requests will be blocked when
> requests are outstanding.

That in particular should not be a big problem, since MD already kicks
the device for you, right?  A script would only have to hot-add the
device once it's available again.

> Another problem is that enbd has to _know_ it is in a raid array, and
> which one, in order to send the ioctl.  That leads one to more or less
> require that the md array tell it.  One could build this into the mdadm
> tool, but one can't guarantee that everyone uses that (same) mdadm tool,
> so the md driver gets nominated as the best place for the code that
> does that.

It's already in mdadm.

You can only usefully query one way (array --> device):
# mdadm -D /dev/md0 | grep -A100 -E '^    Number'

    Number   Major   Minor   RaidDevice State
       0     253        0        0      active sync   /dev/mapper/sda1
       1     253        1        1      active sync   /dev/mapper/sdb1

That should provide you with enough information though, since devices
stay in that table even after they've gone missing.  (I'm not sure
what happens when a spare takes over a place, though - test needed.)

The optimal thing would be to query the other way, of course.  ENBD
should be able to tell a hotplug shell script (or whatever) about the
name of the device that's just come back.

And you *can* in fact query the other way too, but you won't get a
useful Array UUID or device-name-of-assembled-array out of it:

# mdadm -E /dev/mapper/sda2
[snip blah, no array information :-(]

Expanding -E output to include the Array UUID would be a good feature
in any case.

Expanding -E output to include which array device is currently
mounted, having the corresponding Array UUID would be neat, but I'm
sure that most users would probably misunderstand what this means :-).

> Only one of two devices in a two-device mirror is generally networked.

Makes sense.

> The standard scenario is two local disks per network node.  One is a
> mirror half for a remote raid,

A local cache of sorts?

> the other is the mirror half for a local raid
> (which has a remote other half on the remote node).

A remote backup of sorts?

> More complicated setups can also be built - there are entire grids of
> such nodes arranged in a torus, with local redundancy arranged in
> groups of three neighbours, each with two local devices and one remote
> device. Etc.

Neat ;-).

> > > all I can do is make the enbd device block on network timeouts.
> > > But that's totally unsatisfactory, since real network outages then
> > > cause permanent blocks on anything touching a file system
> > > mounted remotely.  People don't like that.
> >
> > If it's just this that you want to fix, you could write a DM module
> > which returns I/O error if the request to the underlying device takes
> > more than 10 seconds.
>
> I'm not sure that another layer helps. I can timeout requests myself in
> 10s within enbd if I want to.

Yeah, okay.
I suggested that further up, but I guess you thought of it before I did :-).

> The problem is that if I take ten seconds for each one when the
> net is down memory will fill with backed up requests.  The first
> one that is failed (after 10s) then triggers an immediate retry
> from md, which also gets held for 10s.  We'll simply get
> huge pulses of failures of entire backed up memory spaced at 10s.
> I'm pretty sure from reports that md would error the device
> offline after a pulse like that.

I don't see where these "huge pulses" come into the picture.

If you block one MD request for 10 seconds, surely there won't be
another before you return an answer to that one?

> If it doesn't, then anyway enbd would decide after 30s or so that
> the remote end was down and take itself offline.
> One or the other would cause md to expell it from the array.  I could
> try hot-add from enbd when the other end comes back, but we need to know
> we are in an array (and which) in order to do that.

I think that's possible using mdadm at least.

> > Layer that module on top of the RAID, and make your enbd
> > device block on network timeouts.
>
> It shifts the problem to no avail, as far as I understand you, and my
> understanding is likely faulty.  Can you be more specific about how this
> attacks the problem?

Never was much of a good explainer...

I was of the impression that you wanted an error message to be
propagated quickly to userspace / users, but the MD array to just be
silently paused, whenever a network outage occurred.

Since you've mentioned that there's actually local disk components in
the RAID arrays, I imagine you would want the array to NOT be paused,
since it could reasonably continue operation on one device.  So just
forget about that proposal, it won't work in this situation :-).

I guess what will work is either:

A)

Network outage -->
 ENBD fails disk -->
 MD drops disk -->
 Network comes back -->
 ENBD brings disk back up -->
 Something kicks off /etc/hotplug.d/block-hotplug script -->
 Script queries all RAID devices and find where the disk fits -->
 Script hot-adds the disk

Or:

B)

Network outage -->
 ENBD fails disk, I/O error type "link error" -->
 MD sets disk status to "temporarily missing" -->
 Network comes back -->
 ENBD brings disk back up -->
 MD sees a block device arrival, reintegrates the disk into array

I think the latter is better, because:
 * Noone has to maintain husky shell scripts
 * It sends a nice message to the SATA/PATA/SCSI people that MD would
really like to know whether it's a disk or a link problem.

But then again, shell scripts _is_ the preferred Linux solution to...
Everything.

> Enbd used instead to tell the md device that it was feeling ill, error
> all requests, allowing md to chuck it out of the array. Then enbd would
> tell the md device when it was feeling well again, and make md
> reinsert it in the array. Md would catch up using the bitmap.
>
> Right now, we can't really tell md we're feeling ill (that would be a
> HOT_ARRRGH, but md doesn't have that). If we could, then md could
> decide on its own to murder all outstanding requests for us and
> chuck us out, with the implicit understanding that we will come back
> again soon and then the bitbap can catcj us up.
>
> We can't do a HOT_REMOVE while requests are outstanding, as far as I
> know.

MD should be fixed so HOT_REMOVE won't fail but will just kick the
disk, even if it happens to be blocking on I/O calls.

(If there really is a reason not to kick it, then at least a
HOT_REMOVE_FORCE should be added..)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16 14:28     ` Molle Bestefich
@ 2006-08-16 19:01       ` Peter T. Breuer
  2006-08-16 21:19         ` Molle Bestefich
  2006-08-16 23:43       ` Nix
  1 sibling, 1 reply; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-16 19:01 UTC (permalink / raw)
  To: Molle Bestefich; +Cc: linux raid

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UNKNOWN-8BIT, Size: 21014 bytes --]

"Also sprach Molle Bestefich:"
[Charset ISO-8859-1 unsupported, filtering to ASCII...]
> Peter T. Breuer wrote:
> > > You want to hurt performance for every single MD user out there, just
> >
> > There's no performance drop!  Exponentially staged retries on failure
> > are standard in all network protocols ...  it is the appropriate
> > reaction in general, since stuffing the pipe full of immediate retries
> > doesn't allow the would-be successful transactions to even get a look in
> > against that competition.
> 
> That's assuming that there even is a pipe,

"Pipe" refers to a channel of fixed bandwidth.  Every communication
channel is one.  The "pipe" for a local disk is composed of the bus,
disk architecture, controller, and also the kernel architecture layers.
For example, only 256 (or 1024, whatever) kernel requests can be
outstanding at a time per device [queue], so if 1024 retry requests are
in flight, no real work will get done (some kind of priority placement
may be done in each driver ..  in enbd I take care to replace retries
last in the existing queue, for example).

> which is something specific
> to ENBD / networked block devices, not something that the MD driver
> should in general care about.

See above. The problem is generic to fixed bandwidth transmission
channels, which, in the abstract, is "everything". As soon as one
does retransmits one has a kind of obligation to keep retransmissions
down to a fixed maximum percentage of the potential traffic, which
is generally accomplished via exponential backoff (a time-wise
solution, in other words, sdeliberately mearing retransmits out along
the time axis in order to prevent spikes).

The md layers now can generate retries by at least one mechanism that I
know of ..  a failed disk _read_ (maybe of existing data or parity data
as part of an exterior write attempt) will generate a disk _write_ of
the missed data (as reconstituted via redundancy info).

I believe failed disk _write_ may also generate a retry, but the above
is already enough, no? 

Anyway, the problem is merely immediately visible over the net since
individual tcp packet delays of 10s are easy to observe under fairly
normal conditions, and I have seen evidence of 30s trips in other
people's reports. It's not _unique_ to the net, but sheeucks, if you
want to think of it that way, go ahead!

Such delays may in themselves cause timeouts in md - I don't know. My
RFC (maybe "RFD") is aimed at raising a flag saying that something is
going on here that needs better control.

> > > because things doesn't work optimally under enbd, which is after all a
> > > rather rare use case compared to using MD on top of real disks.
> >
> > Strawman.
> 
> Quah?

Above.

> > > Besides, it seems a rather pointless exercise to try and hide the fact
> > > from MD that the device is gone, since it *is* in fact missing.
> >
> > Well, we don't really know that for sure.  As you know, it is
> > impossible to tell in general if the net has gone awol or is simply
> > heavily overloaded (with retry requests).
> 
> From MD's point of view, if we're unable to complete a request to the
> device, then it's either missing or faulty.  If a call to the device
> blocks, then it's just very slow.

The underlying device has to take a decision about what to tell the
upper (md) layer. I can tell you from experience that users just HATE
it if the underlying device always blocks until the other end of the
net connection comes back on line. C.f. nfs "hard" option. Try it and
hate it.

The alternative, reasonable in my opinion, is to tell the overlying md
device that a io request has failed after about 10-30s of hanging
around waiting for it. Unforrrrrrrtunately, the effect is BAAAAAD
at the moment, because (as I indicated above), this can lead to md
layer retries aimed at the same lower device, IMMMMMMEDIATELY, which are
going to fail for the same reason the first io request failed.

What the upper layer, md, ought to do is "back off".

  1) try again immediately - if that fails, then don't give up but ..
  2) wait a while before retrying again.

I _suspect_ that at the moment md is trying and retrying, and probably
retrying again, all immediately, causing an avalanch of (temporary)
failures, and expulsion from a raid array.

> I don't think it's wise to pollute these simple mechanics with a
> "maybe it's in a sort-of failing due to a network outage, which might
> just be a brownout" scenario.  Better to solve the problem in a more
> appropriate place, somewhere that knows about the fact that we're
> simulating a block device over a network connection.

I've already suggested a simple mechanism above .. "back off on the
retries, already". It does no harm to local disk devices.

If you like, the constant of backoff can be based on how long it took
the underlying device to signal the io request as failed. So a local 
disk that replies "failed" immediately can get its range of retries run
through in a couple of hop skip and millijiffies. A network device that
took 10s to report a timeout can get its next retry back again in 10s.
That should give it time to recover.

> Not introducing network-block-device aware code in MD is a good way to
> avoid wrong code paths and weird behaviour for real block device
> users.

Uh, the net is everywhere.  When you have 10PB of storage in your
intelligent house's video image file system, the parts of that array are
connected by networking room to room.  Supecomputers used to have simple
networking between each computing node.  Heck, clusters still do :).
Please keep your special case code out of the kernel :-).

> "Missing" vs. "Faulty" is OTOH a pretty simple interface, which maps
> fine to both real disks and NBDs.

It may well be a solution. I think we're still at the stage of
precisely trying to identify the problem too! At the moment, most
of what I can say is "definitely, there is something wrong with the
way the md layer reacts or can be controlled with respect to
networking brown-outs and NBDs".

> > The retry on error is a good thing.  I am simply suggesting that if the
> > first retry also fails that we do some back off before trying again,
> > since it is now likely (lacking more knowledge) that the device is
> > having trouble and may well take some time to recover.  I would suspect
> > that an interval of 0 1 5 10 30 60s would be appropriate for retries.
> 
> Only for networked block devices.

Shrug. Make that 0, 1, 5, 10 TIMES the time it took the device to
report the request as errored.

> Not for real disks, there you are just causing unbearable delays for
> users for no good reason, in the event that this code path is taken.

We are discussing _error_ semantics.  There is no bad effect at all on
normal working!  The effect on normal working should even be _good_ when
errors occur, because now max bandwidth devoted to error retries is
limited, leaving more max bandwidth for normal requests.

> > One can cycle that twice for luck before giving up for good, if you
> > like.  The general idea in such backoff protocols is that it avoids
> > filling a fixed bandwidth channel with retries (the sum of a constant
> > times 1 + 1/2 + 1/4 + ..  is a finite proportion of the channel
> > bandwidth, but the sum of 1+1+1+1+1+...  is unbounded), but here also
> > there is an _additional_ assumption that the net is likely to have
> > brownouts and so we _ought_ to retry at intervals since retrying
> > immediately will definitely almost always do no good.
> 
> Since the knowledge that the block device is on a network resides in
> ENBD, I think the most reasonable thing to do would be to implement a
> backoff in ENBD?  Should be relatively simple to catch MD retries in
> ENBD and block for 0 1 5 10 30 60 seconds.

I can't tell which request is a retry.  You are allowed to write twice
to the same place in normal operation! The knowledge is in MD.

> That would keep the
> network backoff algorithm in a more right place, namely the place that
> knows the device is on a network.

See above.

> > In "normal " failures there is zero delay anyway.
> 
> Since the first retry would succeed, or?

Yes.

> I'm not sure what this "normal" failure is, btw.

A simple read failure, followed by a successful (immediate) write
attempt. The local disk will take 0s to generate the read failure,
and the write (rewrite) attempt will be generated and accepted 0s
later.

In contrast, the net device will take 10-30s to generate a timeout for
the read attempt, followed by 0s to error the succeeding write request,
since the local driver of the net device will have taken the device
offline as it can't get a response in 30s. At that point all io to the
device will fail, all hell will break loose in the md device, and the
net device will be ejected from the array in a flurry of millions of
failed requests.

I merely ask for a little patience. Try again in 30s.

> > And further, the bitmap takes care of delayed
> > responses in the normal course of events.
> 
> Mebbe.  Does it?

Yes.

> > > It would be reasonable for MD to know the difference between
> > >  - "device has (temporarily, perhaps) gone missing" and
> > >  - "device has physical errors when reading/writing blocks",
> >
> > I agree. The problem is that we can't really tell what's happening
> > (even in the lower level device) across a net that is not responding.
> 
> In the case where requests can't be delivered over the network (or a
> SATA cable, whatever), it's a clear case of "missing device".

It's not so clear.  10-30s delays are perfectly visible in ordinary tcp
and mean nothing more than congestion.  How many times have you sat
there hitting the keys and waiting for something to move on the screen?

> 
> > > The comm channel and "hey, I'm OK" message you propose doesn't seem
> > > that different from just hot-adding the disks from a shell script
> > > using 'mdadm'.
> >
> > [snip speculations on possible blocking calls]
> 
> You could always try and see.
> Should be easy to simulate a network outage.

I should add that it's easy to simulate network outages just by lowering
the timeout in enbd.  At the 3s mark, and running continuous writes to
a file larger than memory sited on a fs on the remote device¸ one sees
timeouts every minute or so - requests which took longer than 3s to go
across the local net, be carried out remotely, and be acked back. Even
with no other traffic on the net.  Here's a typical observation sequence
I commented in correspondence to the debian maintainer ...

   1 Jul 30 07:32:55 betty kernel: ENBD #1187[73]: enbd_rollback (0):
   error out too old (783) timedout (750) req c8da00bc!

   The request had a timeout of 3s (750 jiffies) and was in the kernel
   unserviced for just over 3s (783 jiffies) before the enbd driver
   errored it.  I lowered the base timeout to 3s (default is 10s) in
   order to provoke this kind of problem.

   2 Jul 30 07:32:55 betty kernel: ENBD #1115[73]: enbd_error error out
   req c8da00bc from slot 0!

   This is the notification of the enbd driver erroring the request.

   3 Jul 30 07:32:55 betty kernel: Buffer I/O error on device ndb,
   logical block 65 540

   This is the kernel noticing the request has been errored.

   4 Jul 30 07:32:55 betty kernel: lost page write due to I/O error on
   ndb

   Ditto.

   5 Jul 30 07:32:55 betty kernel: ENBD #1506[73]: enbd_ack (0): fatal:
   Bad handle c8da00bc != 00000000!

   The request finally comes back from the enbd server, just a fraction
   of a second too late, just beyond the 3s limit.

   6 Jul 30 07:32:55 betty kernel: ENBD #1513[73]: enbd_ack (0):
   ignoring ack of req c8da00bc which slot lacks

   And the enbd driver ignores the late return - it already told the
   kernel it errored.

I've increased the default timeout in response to these observations,
but the real problem in my view is not that the network is sometimes
slow, but the way the md driver reacts to the situation in the absence
of further guidance. It needs better communications facilities with the
underlying devices. Their drivers need to be able to tell the md driver
about the state of the underlying device.

> > I am pretty sure that removal requests will be blocked when
> > requests are outstanding.
> 
> That in particular should not be a big problem, since MD already kicks
> the device for you, right?  A script would only have to hot-add the
> device once it's available again.

I can aver from experience that one should not look to a script for
salvation.  There are too many deadlock opportunities - we will be out
of memory in a situation where writes are going full speed to a raid
device, which is writing to a device across the net, and the net is
congested or has a brownout (cleaning lady action with broom and cables).
Buffers will be full.  It is not clear that there will be memory for the
tcp socket in order to build packets to allow the buffers to flush.

Really, in my experience, a real good thing to do is mark the device as
temporarily failed, clear all queued requests with error, thus making
memory available, yea, even for tcp sockets, and then let the device
reinsert itself in the MD array when contact is reestablished across the
net.  At that point the MD bitmap can catch up the missed requests.

This is complicated by the MD device's current tendency to issue
retries (one way or the other .. does it? How?). It's interfering
with the simple strategy I just sggested.

> > Another problem is that enbd has to _know_ it is in a raid array, and
> > which one, in order to send the ioctl.  That leads one to more or less
> > require that the md array tell it.  One could build this into the mdadm
> > tool, but one can't guarantee that everyone uses that (same) mdadm tool,
> > so the md driver gets nominated as the best place for the code that
> > does that.
> 
> It's already in mdadm.

One can't rely on mdadm - no user code is likely to work when we are out
of memory and in deep oxygen debt.

> You can only usefully query one way (array --> device):
> # mdadm -D /dev/md0 | grep -A100 -E '^    Number'
> 
>     Number   Major   Minor   RaidDevice State
>        0     253        0        0      active sync   /dev/mapper/sda1
>        1     253        1        1      active sync   /dev/mapper/sdb1

I'm happy to use the ioctls that mdadm uses to get that info. If it
parses /proc/mdstat, then I give up :-).  The format is not regular.

> That should provide you with enough information though, since devices
> stay in that table even after they've gone missing.  (I'm not sure
> what happens when a spare takes over a place, though - test needed.)

That's exactly what I mean .. the /proc output is difficult to parse.

> The optimal thing would be to query the other way, of course.  ENBD
> should be able to tell a hotplug shell script (or whatever) about the

Please no shell scripts (I'm the world's biggest fan of shell scripts
otherwise) - they can't be relied on in these situations. Think of
a barebones installation with a root device mirrored over the net.
These generally run a single process in real time mode - a data farm,
processing info pouring out of, say, an atomic physics experiment, at
1GB/s.

> name of the device that's just come back.
> 
> And you *can* in fact query the other way too, but you won't get a
> useful Array UUID or device-name-of-assembled-array out of it:

It's all too wishy-washy. I'm sorry, but direct ioctl or similar is the
only practical way.

> > Only one of two devices in a two-device mirror is generally networked.
> 
> Makes sense.
> 
> > The standard scenario is two local disks per network node.  One is a
> > mirror half for a remote raid,
> 
> A local cache of sorts?

Just a local mirror half. When the node goes down, its data state will
still be available on the remote half of the mirror, and processing can
continue there.

> > the other is the mirror half for a local raid
> > (which has a remote other half on the remote node).
> 
> A remote backup of sorts?

Just the remote half of the mirror.

> > The problem is that if I take ten seconds for each one when the
> > net is down memory will fill with backed up requests.  The first
> > one that is failed (after 10s) then triggers an immediate retry
> > from md, which also gets held for 10s.  We'll simply get
> > huge pulses of failures of entire backed up memory spaced at 10s.
> > I'm pretty sure from reports that md would error the device
> > offline after a pulse like that.
> 
> I don't see where these "huge pulses" come into the picture.

Because if we are writing full tilt to the network device when the net
goes down, 10s later all those requests in flight at the time (1024 off)
will time out simultaneously, all together, at the same time, in unison.

> If you block one MD request for 10 seconds, surely there won't be
> another before you return an answer to that one?

See above. We will block 1024 requests for 10s, if the request pools
are fully utilized at the time (and if 1024 is the default block
device queue limit .. it's either that or 256, I forget which)

> > If it doesn't, then anyway enbd would decide after 30s or so that
> > the remote end was down and take itself offline.
> > One or the other would cause md to expell it from the array.  I could
> > try hot-add from enbd when the other end comes back, but we need to know
> > we are in an array (and which) in order to do that.
> 
> I think that's possible using mdadm at least.

One would have to duplicate the ioctl calls that mdadm uses, from kernel
space.  It's not  advisable to call out _under pressure_ to a user
process to do something else in kernel.

> I guess what will work is either:
> 
> A)
> 
> Network outage -->
>  ENBD fails disk -->
>  MD drops disk -->
>  Network comes back -->
>  ENBD brings disk back up -->

This is what used to happen with the FR1/5 patch.  Most of that
functionality is now in the kernel code, but there is still "missing"
the communication layer that allowed enbd to bring the disk back up
and back into the MD array.

>  Something kicks off /etc/hotplug.d/block-hotplug script -->
>  Script queries all RAID devices and find where the disk fits -->
>  Script hot-adds the disk

Not first choice in a hole - simpler is what I had in the FR1/5 patches:

    1) MD advises enbd it's in an array, or not
    2) enbd tells MD to pull it in and out of that array as
       it senses the condition of the network connection

The first required MD to use a special ioctl to each device in an
array.

The second required enbd to use the MD HOT_ADD and HOT_REMOVE ioctl
commands, being careful also to kill any requests in flight so that
the remove or add would not be blocked in md or the other block device
layers.  (In fact, I think I needed to add HOT_REPAIR as a special extra
command, but don't quote me on that).

That communications layer would work if it were restored. 

> Or:
> 
> B)
> 
> Network outage -->
>  ENBD fails disk, I/O error type "link error" -->

We can do that.

>  MD sets disk status to "temporarily missing" -->

Well, this is merely the kernel level communication I am looking for!
You seem to want MD _not_ to drop the device, however, merely to set it
inactive. I am happy with that too.

>  Network comes back -->
>  ENBD brings disk back up -->
>  MD sees a block device arrival, reintegrates the disk into array

We need to tell MD that we're OK.

I will go along with that.

> I think the latter is better, because:
>  * Noone has to maintain husky shell scripts
>  * It sends a nice message to the SATA/PATA/SCSI people that MD would
> really like to know whether it's a disk or a link problem.

I agree totally. It's the kind of "solution" I had before, so I am
happy.

> But then again, shell scripts _is_ the preferred Linux solution to...
> Everything.

It can't be relied upon here. Imagine if the entire file system is
mirrored. Hic.

> MD should be fixed so HOT_REMOVE won't fail but will just kick the
> disk, even if it happens to be blocking on I/O calls.
> 
> (If there really is a reason not to kick it, then at least a
> HOT_REMOVE_FORCE should be added..)

So .. are we settling on a solution? I like the idea that we can advise
MD that we are merely temporarily out of action.  Can we take it from
there?  (Neil?)

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16 19:01       ` Peter T. Breuer
@ 2006-08-16 21:19         ` Molle Bestefich
  2006-08-16 22:19           ` Peter T. Breuer
  0 siblings, 1 reply; 18+ messages in thread
From: Molle Bestefich @ 2006-08-16 21:19 UTC (permalink / raw)
  To: ptb; +Cc: linux raid

Peter T. Breuer wrote:
> > > We can't do a HOT_REMOVE while requests are outstanding,
> > > as far as I know.
>
> Actually, I'm not quite sure which kind of requests you are
> talking about.
>
> Only one kind. Kernel requests :). They come in read and write
> flavours (let's forget about the third race for the moment).

I was wondering whether you were talking about requests from eg.
userspace to MD, or from MD to the raw device.  I guess it's not that
important really, that's why I asked you off-list.  Just getting in
too deep, and being curious.


> "Pipe" refers to a channel of fixed bandwidth.  Every communication
> channel is one.  The "pipe" for a local disk is composed of the bus,
> disk architecture, controller, and also the kernel architecture layers.

[snip]

> See above. The problem is generic to fixed bandwidth transmission
> channels, which, in the abstract, is "everything". As soon as one
> does retransmits one has a kind of obligation to keep retransmissions
> down to a fixed maximum percentage of the potential traffic, which
> is generally accomplished via exponential backoff (a time-wise
> solution, in other words, sdeliberately mearing retransmits out along
> the time axis in order to prevent spikes).

Right, so with the bandwidth to local disks being, say, 150MB/s, an
appropriate backoff would be 0 0 0 0 0 0.1 0.1 0.1 0.1 secs.  We can
agree on that pretty fast.. right? ;-).


> The md layers now can generate retries by at least one mechanism that I
> know of ..  a failed disk _read_ (maybe of existing data or parity data
> as part of an exterior write attempt) will generate a disk _write_ of
> the missed data (as reconstituted via redundancy info).
>
> I believe failed disk _write_ may also generate a retry,

Can't see any reason why MD would try to fix a failed write, since
it's not likely to be going to be successful anyway.


> Such delays may in themselves cause timeouts in md - I don't know. My
> RFC (maybe "RFD") is aimed at raising a flag saying that something is
> going on here that needs better control.

I'm still not convinced MD does retries at all..


> What the upper layer, md, ought to do is "back off".

I think it should just kick the disk.


> > I don't think it's wise to pollute these simple mechanics with a
> > "maybe it's in a sort-of failing due to a network outage, which might
> > just be a brownout" scenario.  Better to solve the problem in a more
> > appropriate place, somewhere that knows about the fact that we're
> > simulating a block device over a network connection.
>
> I've already suggested a simple mechanism above .. "back off on the
> retries, already". It does no harm to local disk devices.

Except if the code path gets taken, and the user has to wait
10+20+30+60s for each failed I/O request.


> If you like, the constant of backoff can be based on how long it took
> the underlying device to signal the io request as failed. So a local
> disk that replies "failed" immediately can get its range of retries run
> through in a couple of hop skip and millijiffies. A network device that
> took 10s to report a timeout can get its next retry back again in 10s.
> That should give it time to recover.

That sounds saner to me.


> > Not introducing network-block-device aware code in MD is a good way to
> > avoid wrong code paths and weird behaviour for real block device
> > users.
>
> Uh, the net is everywhere.  When you have 10PB of storage in your
> intelligent house's video image file system, the parts of that array are
> connected by networking room to room.  Supecomputers used to have simple
> networking between each computing node.  Heck, clusters still do :).
> Please keep your special case code out of the kernel :-).

Uhm.


> > "Missing" vs. "Faulty" is OTOH a pretty simple interface, which maps
> > fine to both real disks and NBDs.
>
> It may well be a solution. I think we're still at the stage of
> precisely trying to identify the problem too! At the moment, most
> of what I can say is "definitely, there is something wrong with the
> way the md layer reacts or can be controlled with respect to
> networking brown-outs and NBDs".

> > Not for real disks, there you are just causing unbearable delays for
> > users for no good reason, in the event that this code path is taken.
>
> We are discussing _error_ semantics.  There is no bad effect at all on
> normal working!

In the past, I've had MD run a box to a grinding halt more times than
I like.  It always results in one thing: The user pushing the big red
switch.

That's not acceptable for a RAID solution.  It should keep working,
without blocking all I/O from userspace for 5 minutes just because it
thinks it's a good idea to hold up all I/O requests to underlying
disks for 60s each, waiting to retry them.


> The effect on normal working should even be _good_ when errors
> occur, because now max bandwidth devoted to error retries is
> limited, leaving more max bandwidth for normal requests.

Assuming you use your RAID component device as a regular device also,
and that the underlying device is not able to satisfy the requests as
fast as you shove them at it.  Far out ;-).


> > Since the knowledge that the block device is on a network resides in
> > ENBD, I think the most reasonable thing to do would be to implement a
> > backoff in ENBD?  Should be relatively simple to catch MD retries in
> > ENBD and block for 0 1 5 10 30 60 seconds.
>
> I can't tell which request is a retry.  You are allowed to write twice
> to the same place in normal operation! The knowledge is in MD.

I don't think you need to either - if ENBD only blocks 10 seconds
total, and fail all requests after that period of time has lapsed
once, then that could have the same effect.


> In contrast, the net device will take 10-30s to generate a timeout for
> the read attempt, followed by 0s to error the succeeding write request,
> since the local driver of the net device will have taken the device
> offline as it can't get a response in 30s.

> At that point all io to the device will fail, all hell will break
> loose in the md device,

Really?

> and the net device will be ejected from the array

Fair nuff..

> in a flurry of millions of failed requests.

Millions?  Really?


> > In the case where requests can't be delivered over the network (or a
> > SATA cable, whatever), it's a clear case of "missing device".
>
> It's not so clear.

Yes it is.  If the device is not faulty, but there's a link problem,
then the device is just... missing :-).  Whether you actually tell MD
that it's missing or not, is another story.

> 10-30s delays are perfectly visible in ordinary tcp and mean nothing
> more than congestion.  How many times have you sat there hitting the
> keys and waiting for something to move on the screen?

I get your point, I think.

There's no reason to induce the overhead of a MD sync-via-bitmap, if
increasing the network timeout in ENBD will prevent the component
device from being kicked in the first place.  As long as the timeout
doesn't cause too much grief for the end user.

OTOH, a bitmap sync can happen in the background, so as long as the
disk is not _constantly_ being removed/added, it should be fine to
kick it real fast from the array.


> Really, in my experience, a real good thing to do is mark the device as
> temporarily failed, clear all queued requests with error, thus making
> memory available, yea, even for tcp sockets, and then let the device
> reinsert itself in the MD array when contact is reestablished across the
> net.  At that point the MD bitmap can catch up the missed requests.
>
> This is complicated by the MD device's current tendency to issue
> retries (one way or the other .. does it? How?). It's interfering
> with the simple strategy I just sggested.

There was a patch floating around at one time in which MD would ignore
a certain amount of errors from a component device.  I think.  Can't
remember the details nor the reasoning for it.  Sounded stupid to me
at the time, I remember :-).


> simpler is what I had in the FR1/5 patches:
>
>     1) MD advises enbd it's in an array, or not
>     2) enbd tells MD to pull it in and out of that array as
>        it senses the condition of the network connection
>
> The first required MD to use a special ioctl to each device in an
> array.
>
> The second required enbd to use the MD HOT_ADD and HOT_REMOVE ioctl
> commands, being careful also to kill any requests in flight so that
> the remove or add would not be blocked in md or the other block device
> layers.  (In fact, I think I needed to add HOT_REPAIR as a special extra
> command, but don't quote me on that).
>
> That communications layer would work if it were restored.

> So .. are we settling on a solution?

I'm just proposing counter-arguments.
Talk to the Neil :-).

> I like the idea that we can advise MD that we are merely
> temporarily out of action.  Can we take it from there?  (Neil?)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16 21:19         ` Molle Bestefich
@ 2006-08-16 22:19           ` Peter T. Breuer
  0 siblings, 0 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-16 22:19 UTC (permalink / raw)
  To: Molle Bestefich; +Cc: linux raid

"Also sprach Molle Bestefich:"
> 
> > See above. The problem is generic to fixed bandwidth transmission
> > channels, which, in the abstract, is "everything". As soon as one
> > does retransmits one has a kind of obligation to keep retransmissions
> > down to a fixed maximum percentage of the potential traffic, which
> > is generally accomplished via exponential backoff (a time-wise
> > solution, in other words, sdeliberately mearing retransmits out along
> > the time axis in order to prevent spikes).
> 
> Right, so with the bandwidth to local disks being, say, 150MB/s, an
> appropriate backoff would be 0 0 0 0 0 0.1 0.1 0.1 0.1 secs.  We can
> agree on that pretty fast.. right? ;-).

Whatever .. the multiplying constant can be anything you like, and the
backoff can be statistical in nature, not deterministic.  It merely has
to backoff rather than pile in retries all at once and immediately.

> > The md layers now can generate retries by at least one mechanism that I
> > know of ..  a failed disk _read_ (maybe of existing data or parity data
> > as part of an exterior write attempt) will generate a disk _write_ of
> > the missed data (as reconstituted via redundancy info).
> >
> > I believe failed disk _write_ may also generate a retry,
> 
> Can't see any reason why MD would try to fix a failed write, since
> it's not likely to be going to be successful anyway.

Maybe.

> > Such delays may in themselves cause timeouts in md - I don't know. My
> > RFC (maybe "RFD") is aimed at raising a flag saying that something is
> > going on here that needs better control.
> 
> I'm still not convinced MD does retries at all..

It certainly attempts a rewrite after a failed read. Neil can say if
anything else is tried.  Bitmaps can be used  to allow writes to fail
first time and then to be synced up later.

> > What the upper layer, md, ought to do is "back off".
> 
> I think it should just kick the disk.

That forces us to put it back in when the net comes back to life, which 
is complicated. Life would be less complicated if it were less prone
to being kicked out in the first place.

> > We are discussing _error_ semantics.  There is no bad effect at all on
> > normal working!
> 
> In the past, I've had MD run a box to a grinding halt more times than
> I like.  It always results in one thing: The user pushing the big red
> switch.

I agree that the error path in md probably contains some deadlock. My
observation also. That's why I prefer to react to a net brownout by
taking the lower device offline and erroring outstanding requests,
PROVIDED we can put it back in again sanely.  That ain't the case at the
moment, so I'd prefer if MD would not be quite so trigger-happy on the
expulsions, which I _believe_ occurs because the lower level device
errors too many requests all at once.

> That's not acceptable for a RAID solution.  It should keep working,
> without blocking all I/O from userspace for 5 minutes just because it
> thinks it's a good idea to hold up all I/O requests to underlying
> disks for 60s each, waiting to retry them.

You miscalculate here ... holding up ONE request for a retry does not
hold up ALL requests.  Everything else goes through. And I proposed
that we only backoff after trying again immediately.

Heck, that's probably wrong, mathematically - that can double the
bandwidth occupation per timeslice, meaning that we need to reserve 50%
bandwidth for errors ..  ecch.  Nope - one _needs_ some finite minimal
backoff.  One jiffy is enough.  That moves reries into the next time
slice...  umm, and we need to randomly space them out a few more jiffies
too, in a poisson distribution, in order to avoid filling the next
timeslice to capacity with errors. Yep, I'm convinced .. need
exponential statistical backoff. Each retry needs to be delayed
by an amount of time that comes from a poisson distribution
(exponential decay). The average backoff can be a jiffy.

> > The effect on normal working should even be _good_ when errors
> > occur, because now max bandwidth devoted to error retries is
> > limited, leaving more max bandwidth for normal requests.
> 
> Assuming you use your RAID component device as a regular device also,

?? Oh .. you are thinking of the channel to the device. I was
thinking of the kernel itself.  It has to spend time and memory on this.
Allowing it to concentrate on other io that will work without having to
cope with a sharp spike of errors at the temporarily incapacitated low
level device speeds up _other_ devices.

> and that the underlying device is not able to satisfy the requests as
> fast as you shove them at it.  Far out ;-).

See above.

> > > Since the knowledge that the block device is on a network resides in
> > > ENBD, I think the most reasonable thing to do would be to implement a
> > > backoff in ENBD?  Should be relatively simple to catch MD retries in
> > > ENBD and block for 0 1 5 10 30 60 seconds.
> >
> > I can't tell which request is a retry.  You are allowed to write twice
> > to the same place in normal operation! The knowledge is in MD.
> 
> I don't think you need to either - if ENBD only blocks 10 seconds
> total, and fail all requests after that period of time has lapsed
> once, then that could have the same effect.

When the net fails, all writes to the low level device will block for
10s, then fail all at once.  Md reacts by tossing the disk out.  It
probably does that because it sees failed writes (even if well intended
correction attempts provoked by a failed read).  It could instead wait a
while and retry.  That would succeed, since the net would decongest
meanwhile. That would make the problem disappear.

The alternative is that the low level device tries to insert itself
back in the array once the net comes back up. For that to happen it has
to know it was in one, has been tolssed out, and needs to get back. All
complicated.

> > In contrast, the net device will take 10-30s to generate a timeout for
> > the read attempt, followed by 0s to error the succeeding write request,
> > since the local driver of the net device will have taken the device
> > offline as it can't get a response in 30s.
> 
> > At that point all io to the device will fail, all hell will break
> > loose in the md device,
> 
> Really?

Well, zillions of requests will have been errored out all at once. At
least the 256-1024 backed up in the device queue.

> > and the net device will be ejected from the array
> 
> Fair nuff..
> 
> > in a flurry of millions of failed requests.
> 
> Millions?  Really?

Hundreds. 

> > > In the case where requests can't be delivered over the network (or a
> > > SATA cable, whatever), it's a clear case of "missing device".
> >
> > It's not so clear.
> 
> Yes it is.  If the device is not faulty, but there's a link problem,
> then the device is just... missing :-).  Whether you actually tell MD
> that it's missing or not, is another story.

We agree that not telling it simply leads to blocking behaviour when 
the net is really out forever, which is not acceptable. Telling it
after 30s results in us occasionally having to say "oops, no, I'm
sorry, we're OK again" and try and reinsert ourselves in the array,
which we currently can't do easily. I would prefer we don't tell md
until a good long time has passed, and it do retries with exp backoff
meanwhile. The array performance should not be impacted. There
will be another disk there still working.

> > 10-30s delays are perfectly visible in ordinary tcp and mean nothing
> > more than congestion.  How many times have you sat there hitting the
> > keys and waiting for something to move on the screen?
> 
> I get your point, I think.
> 
> There's no reason to induce the overhead of a MD sync-via-bitmap, if
> increasing the network timeout in ENBD will prevent the component
> device from being kicked in the first place.  As long as the timeout

There's no sensible point to set a timeout. Try an ssh session .. you
can reconnect to it an hour after cutting the cable.

> doesn't cause too much grief for the end user.
> 
> OTOH, a bitmap sync can happen in the background, so as long as the
> disk is not _constantly_ being removed/added, it should be fine to
> kick it real fast from the array.

Bun complicated to implement, as things are, since there is no special
comms channel available with the md driver.

> > simpler is what I had in the FR1/5 patches:
> >
> >     1) MD advises enbd it's in an array, or not
> >     2) enbd tells MD to pull it in and out of that array as
> >        it senses the condition of the network connection
> >
> > The first required MD to use a special ioctl to each device in an
> > array.
> >
> > The second required enbd to use the MD HOT_ADD and HOT_REMOVE ioctl
> > commands, being careful also to kill any requests in flight so that
> > the remove or add would not be blocked in md or the other block device
> > layers.  (In fact, I think I needed to add HOT_REPAIR as a special extra
> > command, but don't quote me on that).
> >
> > That communications layer would work if it were restored.
> 
> > So .. are we settling on a solution?
> 
> I'm just proposing counter-arguments.
> Talk to the Neil :-).

He readeth the list!

> > I like the idea that we can advise MD that we are merely
> > temporarily out of action.  Can we take it from there?  (Neil?)

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16 14:28     ` Molle Bestefich
  2006-08-16 19:01       ` Peter T. Breuer
@ 2006-08-16 23:43       ` Nix
  1 sibling, 0 replies; 18+ messages in thread
From: Nix @ 2006-08-16 23:43 UTC (permalink / raw)
  To: Molle Bestefich; +Cc: ptb, linux raid

On 16 Aug 2006, Molle Bestefich murmured woefully:
> Peter T. Breuer wrote:
>> > The comm channel and "hey, I'm OK" message you propose doesn't seem
>> > that different from just hot-adding the disks from a shell script
>> > using 'mdadm'.
>>
>> [snip speculations on possible blocking calls]
> 
> You could always try and see.
> Should be easy to simulate a network outage.

Blocking calls are not the problem. Deadlocks are.

The problem is that forking a userspace process necessarily involves
kernel memory allocations (for the task struct, userspace memory map,
possibly text pages if the necessary pieces of mdadm are not in the page
cache), and if your swap is on the remote RAID array, you can't
necessarily carry out those allocations.

Note that the same deadlock situation is currently triggered by
sending/receiving network packets, which is why swapping over NBD is a
bad idea at present: however, this is being fixed at this moment because
until it's fixed you can't reliably have a machine with all storage on
iSCSI, for instance. However, the deadlock is only fixable for kernel
allocations, because the amount of storage that'll need is bounded in
several ways: you can't fix it for userspace allocations.  So you can
never rely on userspace working in this situation.

-- 
`We're sysadmins. We deal with the inconceivable so often I can clearly 
 see the need to define levels of inconceivability.' --- Rik Steenwinkel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16  9:06 remark and RFC Peter T. Breuer
  2006-08-16 10:00 ` Molle Bestefich
@ 2006-08-16 14:59 ` Molle Bestefich
  2006-08-16 16:10   ` Peter T. Breuer
  2006-08-17  1:11 ` Neil Brown
  2006-08-17 14:11 ` Mario 'BitKoenig' Holbe
  3 siblings, 1 reply; 18+ messages in thread
From: Molle Bestefich @ 2006-08-16 14:59 UTC (permalink / raw)
  To: ptb; +Cc: linux raid

Peter T. Breuer wrote:
> I would like raid request retries to be done with exponential
> delays, so that we get a chance to overcome network brownouts.

Hmm, I don't think MD even does retries of requests.

It does write-back as a (very successful! Thanks Neil :-D) attempt to
fix bad blocks, but that's a different thing.

Is that what you meant?

Or is there something sandwiched between MD and ENBD that performs retries...

Or am I just wrong :-)..

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16 14:59 ` Molle Bestefich
@ 2006-08-16 16:10   ` Peter T. Breuer
  0 siblings, 0 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-16 16:10 UTC (permalink / raw)
  To: Molle Bestefich; +Cc: linux raid

"Also sprach Molle Bestefich:"
> Peter T. Breuer wrote:
> > I would like raid request retries to be done with exponential
> > delays, so that we get a chance to overcome network brownouts.
> 
> Hmm, I don't think MD even does retries of requests.

I had a "robust read" patch in FR1, and I thought Neil extended that to
"robust write".  In robust read, we try to make up for a failed read
with info from elsewhere, and then we rewrite the inferred data onto the
failed device, in an attempt to fix a possible defect.  In robust write,
a failed write is retried.  But robust read is enough to cause a retry
(as a write).

> It does write-back as a (very successful! Thanks Neil :-D) attempt to
> fix bad blocks, but that's a different thing.

Apparently not different :).

> Is that what you meant?

Plozzziby.

> Or is there something sandwiched between MD and ENBD that performs retries...

Could be too. Dunno.

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16  9:06 remark and RFC Peter T. Breuer
  2006-08-16 10:00 ` Molle Bestefich
  2006-08-16 14:59 ` Molle Bestefich
@ 2006-08-17  1:11 ` Neil Brown
  2006-08-17  6:28   ` Peter T. Breuer
  2006-08-17 14:11 ` Mario 'BitKoenig' Holbe
  3 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2006-08-17  1:11 UTC (permalink / raw)
  To: ptb; +Cc: linux raid

On Wednesday August 16, ptb@inv.it.uc3m.es wrote:
> 
> So,
> 
> 1) I would like raid request retries to be done with exponential
>    delays, so that we get a chance to overcome network brownouts.
> 
> 2) I would like some channel of communication to be available
>    with raid that devices can use to say that they are
>    OK and would they please be reinserted in the array.
> 
> The latter is the RFC thing (I presume the former will either not
> be objectionable or Neil will say "there's no need since you're wrong
> about the way raid does retries anyway").

There's no need since you're ..... you know the rest :-)
Well, sort of.

When md/raid1 gets a read error it immediately retries the request in
small (page size) chunks to find out exactly where the error is (it
does this even if the original read request is only one page).
When it hits a read error during retry, it reads from another device
(if it can find one that works) and writes what it got out to the
'faulty' drive (or drives).  If this works: great.
If not, the write error causes the drive to be kicked.
I'm not interested in putting any delays in there.  It is simply the
wrong place to put them.  If network brownouts might be a problem,
then the network driver gets to care about that.

Point 2 should be done in user-space.  
  - notice device have been ejected from array
  - discover why. act accordingly.
  - if/when it seems to be working again, add it back into the array. 

I don't see any need for this to be done in the kernel.

> 
> The way the old FR1/5 code worked was to make available a couple of
> ioctls.
> 
> When a device got inserted in an array, the raid code told the device
> via a special ioctl it assumed the device had that it was now in an
> array (this triggers special behaviours, such as deliberately becoming
> more error-prone and less blocky, on the assumption that we have got
> good comms with raid and can manage our own raid state). Ditto
> removal.

A bit like BIO_RW_FASTFAIL?  Possibly md could make more use of that.
I haven't given it any serious thought yet.  I don't even know what
low level devices recognise it or what they do in response.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-17  1:11 ` Neil Brown
@ 2006-08-17  6:28   ` Peter T. Breuer
  2006-08-19  1:35     ` Gabor Gombas
  2006-08-21  1:21     ` Neil Brown
  0 siblings, 2 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-17  6:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux raid

HI Neil ..

"Also sprach Neil Brown:"
> On Wednesday August 16, ptb@inv.it.uc3m.es wrote:
> > 1) I would like raid request retries to be done with exponential
> >    delays, so that we get a chance to overcome network brownouts.
> > 
> > 2) I would like some channel of communication to be available
> >    with raid that devices can use to say that they are
> >    OK and would they please be reinserted in the array.
> > 
> > The latter is the RFC thing (I presume the former will either not
> > be objectionable or Neil will say "there's no need since you're wrong
> > about the way raid does retries anyway").
> 
> There's no need since you're ..... you know the rest :-)
> Well, sort of.

OK, let's see ...

> When md/raid1 gets a read error it immediately retries the request in
> small (page size) chunks to find out exactly where the error is (it
> does this even if the original read request is only one page).

OK.  I didn't know that.  But do you mean a read request to the RAID
device, or a read request to the underlying disk device?  The latter
might form part of the implementation of a write request to the RAID
device.

(has the MD blocksize moved up to 4K then? It was at 1KB for years)

> When it hits a read error during retry, it reads from another device
> (if it can find one that works) and writes what it got out to the
> 'faulty' drive (or drives).

OK. That mechanism I was aware of.

> If this works: great.
> If not, the write error causes the drive to be kicked.

Yerrs, that's also what I thought.

> I'm not interested in putting any delays in there.  It is simply the
> wrong place to put them.  If network brownouts might be a problem,
> then the network driver gets to care about that.

I think you might want to reconsider (not that I know the answer).  

1) if the network disk device has decided to shut down wholesale
   (temporarily) because of lack of contact over the net, then
   retries and writes are _bound_ to fail for a while, so there
   is no point in sending them now.  You'd really do infinitely
   better to wait a while.

2) if the network device just blocks individual requests for say 10s
   while waiting for an ack, then times them out, there is more chance
   of everything continuing to work since the 10s might be long enough
   for the net to recover in, but occasionally a single timeout will
   occur and you will boot the device from the array (whereas waiting a
   bit longer would have been the right thing to do, if only we had
   known).  Change 10s to any reasonable length of time.  

   You think the device has become unreliable because write failed, but
   it hasn't ... that's just the net. Try again later! If you like
   we can set the req error count to -ETIMEDOUT to signal it. Real
   remote write breakage can be signalled with -EIO or something.
   Only boot the device on -EIO.

3) if the network device blocks essentially forever, waiting for a
   reconnect, experience says that users hate that. I believe the
   md array gets stuck somewhere here (from reports), possibly in trying
   to read the superblock of the blocked device.

4) what the network device driver wants to do is be able to identify
   the difference between primary requests and retries, and delay 
   retries (or repeat them internally) with some reasonable backoff
   scheme to give them more chance of working in the face of a
   brownout, but it has no way of doing that.  You can make the problem
   go away by delaying retries yourself (is there a timedue field in
   requests, as well as a timeout field?  If so, maybe that can be used
   to signal what kind of a request it is and how to treat it).

> Point 2 should be done in user-space.  

It's not reliable - we will be under memory pressure at this point, with
all that implies; the raid device might be the very device on which the
file system sits, etc. Pick your poison!

>   - notice device have been ejected from array
>   - discover why. act accordingly.
>   - if/when it seems to be working again, add it back into the array. 
> 
> I don't see any need for this to be done in the kernel.

Because there might not be any userspace (embedded device) and
userspace might be blocked via subtle or not-so-subtle deadlocks.
There's no harm in making it easy! /proc/mdstat is presently too hard
to parse reliably, I am afraid. Minor differences in presentation
arise in it for reasons I don't understand!

> > The way the old FR1/5 code worked was to make available a couple of
> > ioctls.
> > 
> > When a device got inserted in an array, the raid code told the device
> > via a special ioctl it assumed the device had that it was now in an
> > array (this triggers special behaviours, such as deliberately becoming
> > more error-prone and less blocky, on the assumption that we have got
> > good comms with raid and can manage our own raid state). Ditto
> > removal.
> 
> A bit like BIO_RW_FASTFAIL?  Possibly md could make more use of that.

It was a different one, but yes, that would have done. The FR1/5
code needed  to be told also in WHICH array it was, so that it
could send ioctls (HOT_REPAIR, or such) to the right md device
later when it felt well again. And it needed to be told when it
was ejected from the array, so as not to do that next time ...

> I haven't given it any serious thought yet.  I don't even know what
> low level devices recognise it or what they do in response.

As far as I am concerned, any signal is useful. Any one which
tells me which array I am in is especially useful. And I need
to be told when I leave.

Essentially I want some kernel communication channel here. Ioctls
are fine (there is a subtle kernel deadlock involved in calling an ioctl
on a device above you from within, but I got round that once, and I can
do it again).

Thanks for the replies!

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-17  6:28   ` Peter T. Breuer
@ 2006-08-19  1:35     ` Gabor Gombas
  2006-08-19 11:27       ` Peter T. Breuer
  2006-08-21  1:21     ` Neil Brown
  1 sibling, 1 reply; 18+ messages in thread
From: Gabor Gombas @ 2006-08-19  1:35 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Neil Brown, linux raid

On Thu, Aug 17, 2006 at 08:28:07AM +0200, Peter T. Breuer wrote:

> 1) if the network disk device has decided to shut down wholesale
>    (temporarily) because of lack of contact over the net, then
>    retries and writes are _bound_ to fail for a while, so there
>    is no point in sending them now.  You'd really do infinitely
>    better to wait a while.

On the other hand, if it's a physical disk that's gone, you _know_ it
will not come back, and stalling your mission-critical application
waiting for a never-occuring event instead of just continue using the
other disk does not seem right.

>    You think the device has become unreliable because write failed, but
>    it hasn't ... that's just the net. Try again later! If you like
>    we can set the req error count to -ETIMEDOUT to signal it. Real
>    remote write breakage can be signalled with -EIO or something.
>    Only boot the device on -EIO.

Depending on the application, if one device is gone for an extended
period of time (and the range of seconds is a looong time), it may be
much more applicable to just forget about that disk and continue instead
of stalling the system waiting for the device coming back.

IMHO if you want to rely on the network, use equipment that can provide
the required QoS parameters. It may cost a lot - c'est la vie.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-19  1:35     ` Gabor Gombas
@ 2006-08-19 11:27       ` Peter T. Breuer
  0 siblings, 0 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-19 11:27 UTC (permalink / raw)
  To: Gabor Gombas; +Cc: linux raid

"Also sprach Gabor Gombas:"
> On Thu, Aug 17, 2006 at 08:28:07AM +0200, Peter T. Breuer wrote:
> 
> > 1) if the network disk device has decided to shut down wholesale
> >    (temporarily) because of lack of contact over the net, then
> >    retries and writes are _bound_ to fail for a while, so there
> >    is no point in sending them now.  You'd really do infinitely
> >    better to wait a while.
> 
> On the other hand, if it's a physical disk that's gone, you _know_ it
> will not come back,

Possibly. Disks are physical whether over the net or not - you mean a
"nearby"  disk, I think. Now, over the net we can distinguish between
a (remote) disk failure and a communications hiatus easily. The problem
appears to be that the software above us (the md layer) is not tuned to
distinguish between the two.

> and stalling your mission-critical application
> waiting for a never-occuring event instead of just continue using the
> other disk does not seem right.

Then don't do it. There's no need to, as I pointed out in the following
...

> >    You think the device has become unreliable because write failed, but
> >    it hasn't ... that's just the net. Try again later! If you like
> >    we can set the req error count to -ETIMEDOUT to signal it. Real
> >    remote write breakage can be signalled with -EIO or something.
> >    Only boot the device on -EIO.
> 
> Depending on the application,

?

> if one device is gone for an extended
> period of time (and the range of seconds is a looong time),

Not over the net it isn't. I just had to wait 5s before these letters
appeared on screen!

> it may be
> much more applicable to just forget about that disk and continue instead
> of stalling the system waiting for the device coming back.

Why speculate?  Let us signal what's happening.  We can happily set a
timeout of 2s, say, and signal -EIO if we get an error return within 2s
and -ETIMEDOUT if we don't get a response of any sort back within 2s.  I
ask that you (above) don't sling us out of the array when we signal
-ETIMEDOUT (or -EAGAIN, or whatever).  Let us decide what's going on and
we'll signal it - don't second guess us.

> IMHO if you want to rely on the network, use equipment that can provide

Your opinion (and mine) doesn't count - I think swapping over the net is
crazy too, but people do it, notwithstanding my opinion. So argument
about whether they ought to do it or not is null and void. They do.

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-17  6:28   ` Peter T. Breuer
  2006-08-19  1:35     ` Gabor Gombas
@ 2006-08-21  1:21     ` Neil Brown
  1 sibling, 0 replies; 18+ messages in thread
From: Neil Brown @ 2006-08-21  1:21 UTC (permalink / raw)
  To: ptb; +Cc: linux raid

On Thursday August 17, ptb@inv.it.uc3m.es wrote:
> HI Neil ..
> 
> "Also sprach Neil Brown:"
> > On Wednesday August 16, ptb@inv.it.uc3m.es wrote:
> > > 1) I would like raid request retries to be done with exponential
> > >    delays, so that we get a chance to overcome network brownouts.
> > > 
> > > 2) I would like some channel of communication to be available
> > >    with raid that devices can use to say that they are
> > >    OK and would they please be reinserted in the array.
> > > 
> > > The latter is the RFC thing (I presume the former will either not
> > > be objectionable or Neil will say "there's no need since you're wrong
> > > about the way raid does retries anyway").
> > 
> > There's no need since you're ..... you know the rest :-)
> > Well, sort of.
> 
> OK, let's see ...
> 
> > When md/raid1 gets a read error it immediately retries the request in
> > small (page size) chunks to find out exactly where the error is (it
> > does this even if the original read request is only one page).
> 
> OK.  I didn't know that.  But do you mean a read request to the RAID
> device, or a read request to the underlying disk device?  The latter
> might form part of the implementation of a write request to the RAID
> device.

We retry the read requests to the underlying devices.
I was thinking of raid1 particularly.
For raid5 we don't retry the read as all requests are sent down from
raid5 a 4K in size so refining the location of an error is not an
issue.
For raid5 we don't retry the read.  We read from all other devices and
then send a write.  If that works, good.  If it fails we kick the
device.

> 
> (has the MD blocksize moved up to 4K then? It was at 1KB for years)
> 

A 0.90 superblock has always been 4K.


> > I'm not interested in putting any delays in there.  It is simply the
> > wrong place to put them.  If network brownouts might be a problem,
> > then the network driver gets to care about that.
> 
> I think you might want to reconsider (not that I know the answer).  
> 
> 1) if the network disk device has decided to shut down wholesale
>    (temporarily) because of lack of contact over the net, then
>    retries and writes are _bound_ to fail for a while, so there
>    is no point in sending them now.  You'd really do infinitely
>    better to wait a while.

Tell that to the network block device.  md has no knowledge of the
device under it.  It sends requests.  They succeed or they fail.  md
acts accordingly.

> 
> 2) if the network device just blocks individual requests for say 10s
>    while waiting for an ack, then times them out, there is more chance
>    of everything continuing to work since the 10s might be long enough
>    for the net to recover in, but occasionally a single timeout will
>    occur and you will boot the device from the array (whereas waiting a
>    bit longer would have been the right thing to do, if only we had
>    known).  Change 10s to any reasonable length of time.  
> 
>    You think the device has become unreliable because write failed, but
>    it hasn't ... that's just the net. Try again later! If you like
>    we can set the req error count to -ETIMEDOUT to signal it. Real
>    remote write breakage can be signalled with -EIO or something.
>    Only boot the device on -EIO.

For read requests, I might be happy to treat -ETIMEOUT differently.  I
get the data from elsewhere and leave the original disk alone.
But for writes, what can I do?  If the write fails I have to evict the
drive, otherwise the array becomes inconsistent. 
If you want to implement some extra timeout and retry for writes, do
that in user-space utilising the bitmap stuff.
If you keep you monitor app small and have it mlocked, it should
continue to work find under high memory pressure.

> 
> 3) if the network device blocks essentially forever, waiting for a
>    reconnect, experience says that users hate that. I believe the
>    md array gets stuck somewhere here (from reports), possibly in trying
>    to read the superblock of the blocked device.

So what do you expect us to do in this case?  You want the app to keep
working even though the network connection to the storage isn't
working?  Doesn't make sense to me.


> 
> 4) what the network device driver wants to do is be able to identify
>    the difference between primary requests and retries, and delay 
>    retries (or repeat them internally) with some reasonable backoff
>    scheme to give them more chance of working in the face of a
>    brownout, but it has no way of doing that.  You can make the problem
>    go away by delaying retries yourself (is there a timedue field in
>    requests, as well as a timeout field?  If so, maybe that can be used
>    to signal what kind of a request it is and how to treat it).
> 
> 
> > Point 2 should be done in user-space.  
> 
> It's not reliable - we will be under memory pressure at this point, with
> all that implies; the raid device might be the very device on which the
> file system sits, etc. Pick your poison!

mlockall

> 
> >   - notice device have been ejected from array
> >   - discover why. act accordingly.
> >   - if/when it seems to be working again, add it back into the array. 
> > 
> > I don't see any need for this to be done in the kernel.
> 
> Because there might not be any userspace (embedded device) and
> userspace might be blocked via subtle or not-so-subtle deadlocks.

Even an embedded device can have userspace.
Fix the deadlocks.

> There's no harm in making it easy! /proc/mdstat is presently too hard
> to parse reliably, I am afraid. Minor differences in presentation
> arise in it for reasons I don't understand!

There is harm in putting code in the kernel to handle a very special
case.  

NeilBrown


> 
> > > The way the old FR1/5 code worked was to make available a couple of
> > > ioctls.
> > > 
> > > When a device got inserted in an array, the raid code told the device
> > > via a special ioctl it assumed the device had that it was now in an
> > > array (this triggers special behaviours, such as deliberately becoming
> > > more error-prone and less blocky, on the assumption that we have got
> > > good comms with raid and can manage our own raid state). Ditto
> > > removal.
> > 
> > A bit like BIO_RW_FASTFAIL?  Possibly md could make more use of that.
> 
> It was a different one, but yes, that would have done. The FR1/5
> code needed  to be told also in WHICH array it was, so that it
> could send ioctls (HOT_REPAIR, or such) to the right md device
> later when it felt well again. And it needed to be told when it
> was ejected from the array, so as not to do that next time ...
> 
> > I haven't given it any serious thought yet.  I don't even know what
> > low level devices recognise it or what they do in response.
> 
> As far as I am concerned, any signal is useful. Any one which
> tells me which array I am in is especially useful. And I need
> to be told when I leave.
> 
> Essentially I want some kernel communication channel here. Ioctls
> are fine (there is a subtle kernel deadlock involved in calling an ioctl
> on a device above you from within, but I got round that once, and I can
> do it again).
> 
> 
> Thanks for the replies!
> 
> Peter
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
  2006-08-16  9:06 remark and RFC Peter T. Breuer
                   ` (2 preceding siblings ...)
  2006-08-17  1:11 ` Neil Brown
@ 2006-08-17 14:11 ` Mario 'BitKoenig' Holbe
  3 siblings, 0 replies; 18+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2006-08-17 14:11 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer <ptb@inv.it.uc3m.es> wrote:
> 1) I would like raid request retries to be done with exponential
>    delays, so that we get a chance to overcome network brownouts.

Hmmm, IMHO this should be implemented in nbd/enbd where it belongs to
and errors should be masked within nbd/enbd then. Since (at least) md
has no read-timeouts or something like that (please correct me if I'm
wrong), this should be no big issue.

Typically, storage media communication channels are loss-free, so either
a read is okay or it fails. A storage media usually has no retry-on-
timeout semantics, so upper layers are with high probability not aware
of such a thing. This is the same with RAID as well as with filesystems,
so if you run an ext2 or something like that on top of your enbd you
should suffer from the same problems: if a read fails, the filesystem
goes dead, gets remounted read-only or follows whatever error-strategy
you have it configured for.

regards
   Mario
-- 
() Ascii Ribbon Campaign
/\ Support plain text e-mail

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <62b0912f0608160746l34d4e5e5r6219cc2e0f4a9040@mail.gmail.com>]

* Re: remark and RFC
       [not found] <62b0912f0608160746l34d4e5e5r6219cc2e0f4a9040@mail.gmail.com>
@ 2006-08-16 16:15 ` Peter T. Breuer
  0 siblings, 0 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-16 16:15 UTC (permalink / raw)
  To: Molle Bestefich; +Cc: linux raid

"Also sprach Molle Bestefich:"
[Charset ISO-8859-1 unsupported, filtering to ASCII...]
> Peter T. Breuer wrote:
> > We can't do a HOT_REMOVE while requests are outstanding, as far as I
> > know.
> 
> Actually, I'm not quite sure which kind of requests you are talking about.

Only one kind. Kernel requests :). They come in read and write
flavours (let's forget about the third race for the moment).

> Also, there's been relatively recent changes to this code:
> http://marc.theaimsgroup.com/?l=linux-raid&m=108075865413863&w=2

That's 2004 and is about the inclusion of the bitmapping code into
md/raid. We're years beyond there. What I'm talking about presupposes
all those changes in the raid layers, and laments that it's still 
missing one more thing that was in the FR1/5 patches .. namely
the ability for the underlying device to communicate with the md 
layer about its state of health, taking itself in and out of the
array as appropriate.

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: remark and RFC
@ 2006-08-18  7:51 Peter T. Breuer
  0 siblings, 0 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-18  7:51 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux raid

"Also sprach ptb:"
> 4) what the network device driver wants to do is be able to identify
>    the difference between primary requests and retries, and delay 
>    retries (or repeat them internally) with some reasonable backoff
>    scheme to give them more chance of working in the face of a
>    brownout, but it has no way of doing that.  You can make the problem
>    go away by delaying retries yourself (is there a timedue field in
>    requests, as well as a timeout field?  If so, maybe that can be used
>    to signal what kind of a request it is and how to treat it).

If one could set the 

    unsigned long start_time;

field in the outgoing retry request to now + 1 jiffy, that might be
helpful.  I can't see a functionally significant use of this field at
present in the kernel ...  ll_rw_blk rewrites the field when merging
requests and end_that_request then uses it for the accounting stats
(duration) __disk_stat_add(disk, ticks[rw], duration) which will add a
minus 1 at worst.

Shame there isn't a timedue field in the request struct.

Silly idea, maybe.

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2006-08-21  1:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-16  9:06 remark and RFC Peter T. Breuer
2006-08-16 10:00 ` Molle Bestefich
2006-08-16 13:06   ` Peter T. Breuer
2006-08-16 14:28     ` Molle Bestefich
2006-08-16 19:01       ` Peter T. Breuer
2006-08-16 21:19         ` Molle Bestefich
2006-08-16 22:19           ` Peter T. Breuer
2006-08-16 23:43       ` Nix
2006-08-16 14:59 ` Molle Bestefich
2006-08-16 16:10   ` Peter T. Breuer
2006-08-17  1:11 ` Neil Brown
2006-08-17  6:28   ` Peter T. Breuer
2006-08-19  1:35     ` Gabor Gombas
2006-08-19 11:27       ` Peter T. Breuer
2006-08-21  1:21     ` Neil Brown
2006-08-17 14:11 ` Mario 'BitKoenig' Holbe
     [not found] <62b0912f0608160746l34d4e5e5r6219cc2e0f4a9040@mail.gmail.com>
2006-08-16 16:15 ` Peter T. Breuer
  -- strict thread matches above, loose matches on Subject: below --
2006-08-18  7:51 Peter T. Breuer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).