linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* remark and RFC
@ 2006-08-16  9:06 Peter T. Breuer
  2006-08-16 10:00 ` Molle Bestefich
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-16  9:06 UTC (permalink / raw)
  To: linux raid

Hello - 

I believe the current kernel raid code retries failed reads too
quickly and gives up too soon for operation over a network device.

Over (my) the enbd device, the default mode of operation was
before-times to have the enbd device time out requests after 30s of net
stalemate and maybe even stop and restart the socket if the blockage was
very prolonged, showing the device as invalid for 5s in order to clear
any in-kernel requests that haven't arrived at its own queue. 

That happens, and my interpretation of reports reaching me is that the
current raid code sends retries quickly, which all fail, and the device
gets expelled, which is bad :-(.

BTW, with my old FR1/5 patch, the enbd device could tell the raid layer
when it felt OK again, and the patched raid code would reinsert the
device and catch up on requests marked as missed in the bitmap.

Now, that mode of operation isn't available to enbd since there is no
comm channel to the official raid layer, so all I can do is make the
enbd device block on network timeouts.  But that's totally
unsatisfactory, since real network outages then cause permanent blocks
on anything touching a file system mounted remotely (a la NFS
hard-erroring style).  People don't like that.

And letting the enbd device error temporarily provokes a cascade of
retries from raid which fail and get the enbd device ejected
permanently, also real bad.

So,

1) I would like raid request retries to be done with exponential
   delays, so that we get a chance to overcome network brownouts.

2) I would like some channel of communication to be available
   with raid that devices can use to say that they are
   OK and would they please be reinserted in the array.

The latter is the RFC thing (I presume the former will either not
be objectionable or Neil will say "there's no need since you're wrong
about the way raid does retries anyway").

The way the old FR1/5 code worked was to make available a couple of
ioctls.

When a device got inserted in an array, the raid code told the device
via a special ioctl it assumed the device had that it was now in an
array (this triggers special behaviours, such as deliberately becoming
more error-prone and less blocky, on the assumption that we have got
good comms with raid and can manage our own raid state). Ditto
removal.

When the device felt good (or ill) it notified the raid arrays it
knew it was in via another ioctl (really just hot-add or hot-remove),
and the raid layer would do the appropriate catchup (or start bitmapping
for it).

Can we have something like that in the official code? If so, what?

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread
[parent not found: <62b0912f0608160746l34d4e5e5r6219cc2e0f4a9040@mail.gmail.com>]
* Re: remark and RFC
@ 2006-08-18  7:51 Peter T. Breuer
  0 siblings, 0 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-18  7:51 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux raid

"Also sprach ptb:"
> 4) what the network device driver wants to do is be able to identify
>    the difference between primary requests and retries, and delay 
>    retries (or repeat them internally) with some reasonable backoff
>    scheme to give them more chance of working in the face of a
>    brownout, but it has no way of doing that.  You can make the problem
>    go away by delaying retries yourself (is there a timedue field in
>    requests, as well as a timeout field?  If so, maybe that can be used
>    to signal what kind of a request it is and how to treat it).

If one could set the 

    unsigned long start_time;

field in the outgoing retry request to now + 1 jiffy, that might be
helpful.  I can't see a functionally significant use of this field at
present in the kernel ...  ll_rw_blk rewrites the field when merging
requests and end_that_request then uses it for the accounting stats
(duration) __disk_stat_add(disk, ticks[rw], duration) which will add a
minus 1 at worst.

Shame there isn't a timedue field in the request struct.

Silly idea, maybe.

Peter

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2006-08-21  1:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-16  9:06 remark and RFC Peter T. Breuer
2006-08-16 10:00 ` Molle Bestefich
2006-08-16 13:06   ` Peter T. Breuer
2006-08-16 14:28     ` Molle Bestefich
2006-08-16 19:01       ` Peter T. Breuer
2006-08-16 21:19         ` Molle Bestefich
2006-08-16 22:19           ` Peter T. Breuer
2006-08-16 23:43       ` Nix
2006-08-16 14:59 ` Molle Bestefich
2006-08-16 16:10   ` Peter T. Breuer
2006-08-17  1:11 ` Neil Brown
2006-08-17  6:28   ` Peter T. Breuer
2006-08-19  1:35     ` Gabor Gombas
2006-08-19 11:27       ` Peter T. Breuer
2006-08-21  1:21     ` Neil Brown
2006-08-17 14:11 ` Mario 'BitKoenig' Holbe
     [not found] <62b0912f0608160746l34d4e5e5r6219cc2e0f4a9040@mail.gmail.com>
2006-08-16 16:15 ` Peter T. Breuer
  -- strict thread matches above, loose matches on Subject: below --
2006-08-18  7:51 Peter T. Breuer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).