* remark and RFC
@ 2006-08-16 9:06 Peter T. Breuer
2006-08-16 10:00 ` Molle Bestefich
` (3 more replies)
0 siblings, 4 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-16 9:06 UTC (permalink / raw)
To: linux raid
Hello -
I believe the current kernel raid code retries failed reads too
quickly and gives up too soon for operation over a network device.
Over (my) the enbd device, the default mode of operation was
before-times to have the enbd device time out requests after 30s of net
stalemate and maybe even stop and restart the socket if the blockage was
very prolonged, showing the device as invalid for 5s in order to clear
any in-kernel requests that haven't arrived at its own queue.
That happens, and my interpretation of reports reaching me is that the
current raid code sends retries quickly, which all fail, and the device
gets expelled, which is bad :-(.
BTW, with my old FR1/5 patch, the enbd device could tell the raid layer
when it felt OK again, and the patched raid code would reinsert the
device and catch up on requests marked as missed in the bitmap.
Now, that mode of operation isn't available to enbd since there is no
comm channel to the official raid layer, so all I can do is make the
enbd device block on network timeouts. But that's totally
unsatisfactory, since real network outages then cause permanent blocks
on anything touching a file system mounted remotely (a la NFS
hard-erroring style). People don't like that.
And letting the enbd device error temporarily provokes a cascade of
retries from raid which fail and get the enbd device ejected
permanently, also real bad.
So,
1) I would like raid request retries to be done with exponential
delays, so that we get a chance to overcome network brownouts.
2) I would like some channel of communication to be available
with raid that devices can use to say that they are
OK and would they please be reinserted in the array.
The latter is the RFC thing (I presume the former will either not
be objectionable or Neil will say "there's no need since you're wrong
about the way raid does retries anyway").
The way the old FR1/5 code worked was to make available a couple of
ioctls.
When a device got inserted in an array, the raid code told the device
via a special ioctl it assumed the device had that it was now in an
array (this triggers special behaviours, such as deliberately becoming
more error-prone and less blocky, on the assumption that we have got
good comms with raid and can manage our own raid state). Ditto
removal.
When the device felt good (or ill) it notified the raid arrays it
knew it was in via another ioctl (really just hot-add or hot-remove),
and the raid layer would do the appropriate catchup (or start bitmapping
for it).
Can we have something like that in the official code? If so, what?
Peter
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: remark and RFC 2006-08-16 9:06 remark and RFC Peter T. Breuer @ 2006-08-16 10:00 ` Molle Bestefich 2006-08-16 13:06 ` Peter T. Breuer 2006-08-16 14:59 ` Molle Bestefich ` (2 subsequent siblings) 3 siblings, 1 reply; 18+ messages in thread From: Molle Bestefich @ 2006-08-16 10:00 UTC (permalink / raw) To: ptb; +Cc: linux raid Peter T. Breuer wrote: > 1) I would like raid request retries to be done with exponential > delays, so that we get a chance to overcome network brownouts. > > I presume the former will either not be objectionable You want to hurt performance for every single MD user out there, just because things doesn't work optimally under enbd, which is after all a rather rare use case compared to using MD on top of real disks. Uuuuh.. yeah, no objections there. Besides, it seems a rather pointless exercise to try and hide the fact from MD that the device is gone, since it *is* in fact missing. Seems wrong at the least. > 2) I would like some channel of communication to be available > with raid that devices can use to say that they are > OK and would they please be reinserted in the array. > > The latter is the RFC thing It would be reasonable for MD to know the difference between - "device has (temporarily, perhaps) gone missing" and - "device has physical errors when reading/writing blocks", because if MD knew that, then it would be trivial to automatically hot-add the missing device once available again. Whereas the faulty one would need the administrator to get off his couch. This would help in other areas too, like when a disk controller dies, or a cable comes (completely) loose. Even if the IDE drivers are not mature enough to tell us which kind of error it is, MD could still implement such a feature just to help enbd. I don't think a comm-channel is the right answer, though. I think the type=(missing/faulty) information should be embedded in the I/O error message from the block layer (enbd in your case) instead, to avoid race conditions and allow MD to take good decisions as early as possible. The comm channel and "hey, I'm OK" message you propose doesn't seem that different from just hot-adding the disks from a shell script using 'mdadm'. > When the device felt good (or ill) it notified the raid arrays it > knew it was in via another ioctl (really just hot-add or hot-remove), > and the raid layer would do the appropriate catchup (or start > bitmapping for it). No point in bitmapping. Since with the network down and all the devices underlying the RAID missing, there's nowhere to store data. Right? Some more factual data about your setup would maybe be good.. > all I can do is make the enbd device block on network timeouts. > But that's totally unsatisfactory, since real network outages then > cause permanent blocks on anything touching a file system > mounted remotely. People don't like that. If it's just this that you want to fix, you could write a DM module which returns I/O error if the request to the underlying device takes more than 10 seconds. Layer that module on top of the RAID, and make your enbd device block on network timeouts. Now the RAID array doesn't see missing disks on network outages, and users get near-instant errors when the array isn't responsive due to a network outage. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 10:00 ` Molle Bestefich @ 2006-08-16 13:06 ` Peter T. Breuer 2006-08-16 14:28 ` Molle Bestefich 0 siblings, 1 reply; 18+ messages in thread From: Peter T. Breuer @ 2006-08-16 13:06 UTC (permalink / raw) To: Molle Bestefich; +Cc: linux raid "Also sprach Molle Bestefich:" [Charset ISO-8859-1 unsupported, filtering to ASCII...] > Peter T. Breuer wrote: > > 1) I would like raid request retries to be done with exponential > > delays, so that we get a chance to overcome network brownouts. > > > > I presume the former will either not be objectionable > > You want to hurt performance for every single MD user out there, just There's no performance drop! Exponentially staged retries on failure are standard in all network protocols ... it is the appropriate reaction in general, since stuffing the pipe full of immediate retries doesn't allow the would-be successful transactions to even get a look in against that competition. > because things doesn't work optimally under enbd, which is after all a > rather rare use case compared to using MD on top of real disks. Strawman. > Uuuuh.. yeah, no objections there. > > Besides, it seems a rather pointless exercise to try and hide the fact > from MD that the device is gone, since it *is* in fact missing. Well, we don't really know that for sure. As you know, it is impossible to tell in general if the net has gone awol or is simply heavily overloaded (with retry requests). The retry on error is a good thing. I am simply suggesting that if the first retry also fails that we do some back off before trying again, since it is now likely (lacking more knowledge) that the device is having trouble and may well take some time to recover. I would suspect that an interval of 0 1 5 10 30 60s would be appropriate for retries. One can cycle that twice for luck before giving up for good, if you like. The general idea in such backoff protocols is that it avoids filling a fixed bandwidth channel with retries (the sum of a constant times 1 + 1/2 + 1/4 + .. is a finite proportion of the channel bandwidth, but the sum of 1+1+1+1+1+... is unbounded), but here also there is an _additional_ assumption that the net is likely to have brownouts and so we _ought_ to retry at intervals since retrying immediately will definitely almost always do no good. > Seems > wrong at the least. There is no effect on the normal request path, and the effect is beneficial to successful requests by reducing the competing buildup of failed requests, when they do occur. In "normal " failures there is zero delay anyway. And further, the bitmap takes care of delayed responses in the normal course of events. > > 2) I would like some channel of communication to be available > > with raid that devices can use to say that they are > > OK and would they please be reinserted in the array. > > > > The latter is the RFC thing > > It would be reasonable for MD to know the difference between > - "device has (temporarily, perhaps) gone missing" and > - "device has physical errors when reading/writing blocks", I agree. The problem is that we can't really tell what's happening (even in the lower level device) across a net that is not responding. Enbd generally hides the problem for a short period of time, then gives up and advises md (if only it could nowadays - I mean with the fr1 patch) that the device is down, and then tells md when the device comes back, so that the bitmap can be discharged and the device be caught up. The problem is that at the moment the md layer has no way of being told that the device is OK again (and that it decides on its own account that the device is bad when it sends umpteen retries within a short period of time only to get them all rejected). > because if MD knew that, then it would be trivial to automatically > hot-add the missing device once available again. Whereas the faulty > one would need the administrator to get off his couch. Yes. The idea is that across the net approximately ALL failures are temporary ones, to a value of something like 99.99%. The cleaning lady is usually dusting the on-off switch on the router. > This would help in other areas too, like when a disk controller dies, > or a cable comes (completely) loose. > > Even if the IDE drivers are not mature enough to tell us which kind of > error it is, MD could still implement such a feature just to help > enbd. > > I don't think a comm-channel is the right answer, though. > > I think the type=(missing/faulty) information should be embedded in > the I/O error message from the block layer (enbd in your case) > instead, to avoid race conditions and allow MD to take good decisions > as early as possible. That's a possibility. I certainly get two types of error back in the enbd driver .. remote error or network error. Remote error is when we get told by the other end that the disk has a problem. Network error is when we hear nothing, and have a timeout. I can certainly pass that on. Any suggestions? > The comm channel and "hey, I'm OK" message you propose doesn't seem > that different from just hot-adding the disks from a shell script > using 'mdadm'. Talking through userspace has subtle deadlock problems. I wouldn't rely on it in this kind of situation. Blocking a device can lead to a file system being blocked and processes getting stalled for all kinds of peripheral reasons, for example. I have seen file descriptor closes getting blocked, to name the bizarre. I am pretty sure that removal requests will be blocked when requests are outstanding. Another problem is that enbd has to _know_ it is in a raid array, and which one, in order to send the ioctl. That leads one to more or less require that the md array tell it. One could build this into the mdadm tool, but one can't guarantee that everyone uses that (same) mdadm tool, so the md driver gets nominated as the best place for the code that does that. > > When the device felt good (or ill) it notified the raid arrays it > > knew it was in via another ioctl (really just hot-add or hot-remove), > > and the raid layer would do the appropriate catchup (or start > > bitmapping for it). > > No point in bitmapping. Since with the network down and all the > devices underlying the RAID missing, there's nowhere to store data. > Right? Only one of two devices in a two-device mirror is generally networked. The standard scenario is two local disks per network node. One is a mirror half for a remote raid, the other is the mirror half for a local raid (which has a remote other half on the remote node). More complicated setups can also be built - there are entire grids of such nodes arranged in a torus, with local redundancy arranged in groups of three neighbours, each with two local devices and one remote device. Etc. > Some more factual data about your setup would maybe be good.. It's not my setup! Invent your own :-). > > all I can do is make the enbd device block on network timeouts. > > But that's totally unsatisfactory, since real network outages then > > cause permanent blocks on anything touching a file system > > mounted remotely. People don't like that. > > If it's just this that you want to fix, you could write a DM module > which returns I/O error if the request to the underlying device takes > more than 10 seconds. I'm not sure that another layer helps. I can timeout requests myself in 10s within enbd if I want to. The problem is that if I take ten seconds for each one when the net is down memory will fill with backed up requests. The first one that is failed (after 10s) then triggers an immediate retry from md, which also gets held for 10s. We'll simply get huge pulses of failures of entire backed up memory spaced at 10s. :-o I'm pretty sure from reports that md would error the device offline after a pulse like that. If it doesn't, then anyway enbd would decide after 30s or so that the remote end was down and take itself offline. One or the other would cause md to expell it from the array. I could try hot-add from enbd when the other end comes back, but we need to know we are in an array (and which) in order to do that. > Layer that module on top of the RAID, and make your enbd device block > on network timeouts. It shifts the problem to no avail, as far as I understand you, and my understanding is likely faulty. Can you be more specific about how this attacks the problem? > Now the RAID array doesn't see missing disks on network outages, and It wouldn't see them anyway when enbd is in normal mode - it blocks. The problem is that that behaviour is really bad for user satisfaction! Enbd used instead to tell the md device that it was feeling ill, error all requests, allowing md to chuck it out of the array. Then enbd would tell the md device when it was feeling well again, and make md reinsert it in the array. Md would catch up using the bitmap. Right now, we can't really tell md we're feeling ill (that would be a HOT_ARRRGH, but md doesn't have that). If we could, then md could decide on its own to murder all outstanding requests for us and chuck us out, with the implicit understanding that we will come back again soon and then the bitbap can catcj us up. We can't do a HOT_REMOVE while requests are outstanding, as far as I know. > users get near-instant errors when the array isn't responsive due to a > network outage. I agree that the lower level device should report errors quickly up to md. The problem is that that leads to it being chucked out unceremonially, for ever and a day .. 1) md shouldn't chuck us out for a few errors - nets are like that 2) we should be able to chuck ourselves out when we feel the net is weak 3) we should be able to chuck ourselves back in when we feel better 4) for that to happen, we need to have been told by md when we are in an array and which I simply proposed that (1) has the easy solution of md doing retries with exponential backoff for a while, instead of chucking us out. The rest needs discussion. Maybe it can be done in userspace, but be advised that I think that is remarkably tricky! In particular, it's almost impossible to test adequately ... which alone would make me aim for an embedded solution (i.e. driver code). Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 13:06 ` Peter T. Breuer @ 2006-08-16 14:28 ` Molle Bestefich 2006-08-16 19:01 ` Peter T. Breuer 2006-08-16 23:43 ` Nix 0 siblings, 2 replies; 18+ messages in thread From: Molle Bestefich @ 2006-08-16 14:28 UTC (permalink / raw) To: ptb; +Cc: linux raid Peter T. Breuer wrote: > > You want to hurt performance for every single MD user out there, just > > There's no performance drop! Exponentially staged retries on failure > are standard in all network protocols ... it is the appropriate > reaction in general, since stuffing the pipe full of immediate retries > doesn't allow the would-be successful transactions to even get a look in > against that competition. That's assuming that there even is a pipe, which is something specific to ENBD / networked block devices, not something that the MD driver should in general care about. > > because things doesn't work optimally under enbd, which is after all a > > rather rare use case compared to using MD on top of real disks. > > Strawman. Quah? > > Besides, it seems a rather pointless exercise to try and hide the fact > > from MD that the device is gone, since it *is* in fact missing. > > Well, we don't really know that for sure. As you know, it is > impossible to tell in general if the net has gone awol or is simply > heavily overloaded (with retry requests). From MD's point of view, if we're unable to complete a request to the device, then it's either missing or faulty. If a call to the device blocks, then it's just very slow. I don't think it's wise to pollute these simple mechanics with a "maybe it's in a sort-of failing due to a network outage, which might just be a brownout" scenario. Better to solve the problem in a more appropriate place, somewhere that knows about the fact that we're simulating a block device over a network connection. Not introducing network-block-device aware code in MD is a good way to avoid wrong code paths and weird behaviour for real block device users. "Missing" vs. "Faulty" is OTOH a pretty simple interface, which maps fine to both real disks and NBDs. > The retry on error is a good thing. I am simply suggesting that if the > first retry also fails that we do some back off before trying again, > since it is now likely (lacking more knowledge) that the device is > having trouble and may well take some time to recover. I would suspect > that an interval of 0 1 5 10 30 60s would be appropriate for retries. Only for networked block devices. Not for real disks, there you are just causing unbearable delays for users for no good reason, in the event that this code path is taken. > One can cycle that twice for luck before giving up for good, if you > like. The general idea in such backoff protocols is that it avoids > filling a fixed bandwidth channel with retries (the sum of a constant > times 1 + 1/2 + 1/4 + .. is a finite proportion of the channel > bandwidth, but the sum of 1+1+1+1+1+... is unbounded), but here also > there is an _additional_ assumption that the net is likely to have > brownouts and so we _ought_ to retry at intervals since retrying > immediately will definitely almost always do no good. Since the knowledge that the block device is on a network resides in ENBD, I think the most reasonable thing to do would be to implement a backoff in ENBD? Should be relatively simple to catch MD retries in ENBD and block for 0 1 5 10 30 60 seconds. That would keep the network backoff algorithm in a more right place, namely the place that knows the device is on a network. > In "normal " failures there is zero delay anyway. Since the first retry would succeed, or? I'm not sure what this "normal" failure is, btw. > And further, the bitmap takes care of delayed > responses in the normal course of events. Mebbe. Does it? > > It would be reasonable for MD to know the difference between > > - "device has (temporarily, perhaps) gone missing" and > > - "device has physical errors when reading/writing blocks", > > I agree. The problem is that we can't really tell what's happening > (even in the lower level device) across a net that is not responding. In the case where requests can't be delivered over the network (or a SATA cable, whatever), it's a clear case of "missing device". > > because if MD knew that, then it would be trivial to automatically > > hot-add the missing device once available again. Whereas the faulty > > one would need the administrator to get off his couch. > > Yes. The idea is that across the net approximately ALL failures are > temporary ones, to a value of something like 99.99%. The cleaning lady > is usually dusting the on-off switch on the router. > > > This would help in other areas too, like when a disk controller dies, > > or a cable comes (completely) loose. > > > > Even if the IDE drivers are not mature enough to tell us which kind of > > error it is, MD could still implement such a feature just to help > > enbd. > > > > I don't think a comm-channel is the right answer, though. > > > > I think the type=(missing/faulty) information should be embedded in > > the I/O error message from the block layer (enbd in your case) > > instead, to avoid race conditions and allow MD to take good decisions > > as early as possible. > > That's a possibility. I certainly get two types of error back in the > enbd driver .. remote error or network error. Remote error is when > we get told by the other end that the disk has a problem. Network > error is when we hear nothing, and have a timeout. > > I can certainly pass that on. Any suggestions? Let's hear from Neil what he thinks. > > The comm channel and "hey, I'm OK" message you propose doesn't seem > > that different from just hot-adding the disks from a shell script > > using 'mdadm'. > > [snip speculations on possible blocking calls] You could always try and see. Should be easy to simulate a network outage. > I am pretty sure that removal requests will be blocked when > requests are outstanding. That in particular should not be a big problem, since MD already kicks the device for you, right? A script would only have to hot-add the device once it's available again. > Another problem is that enbd has to _know_ it is in a raid array, and > which one, in order to send the ioctl. That leads one to more or less > require that the md array tell it. One could build this into the mdadm > tool, but one can't guarantee that everyone uses that (same) mdadm tool, > so the md driver gets nominated as the best place for the code that > does that. It's already in mdadm. You can only usefully query one way (array --> device): # mdadm -D /dev/md0 | grep -A100 -E '^ Number' Number Major Minor RaidDevice State 0 253 0 0 active sync /dev/mapper/sda1 1 253 1 1 active sync /dev/mapper/sdb1 That should provide you with enough information though, since devices stay in that table even after they've gone missing. (I'm not sure what happens when a spare takes over a place, though - test needed.) The optimal thing would be to query the other way, of course. ENBD should be able to tell a hotplug shell script (or whatever) about the name of the device that's just come back. And you *can* in fact query the other way too, but you won't get a useful Array UUID or device-name-of-assembled-array out of it: # mdadm -E /dev/mapper/sda2 [snip blah, no array information :-(] Expanding -E output to include the Array UUID would be a good feature in any case. Expanding -E output to include which array device is currently mounted, having the corresponding Array UUID would be neat, but I'm sure that most users would probably misunderstand what this means :-). > Only one of two devices in a two-device mirror is generally networked. Makes sense. > The standard scenario is two local disks per network node. One is a > mirror half for a remote raid, A local cache of sorts? > the other is the mirror half for a local raid > (which has a remote other half on the remote node). A remote backup of sorts? > More complicated setups can also be built - there are entire grids of > such nodes arranged in a torus, with local redundancy arranged in > groups of three neighbours, each with two local devices and one remote > device. Etc. Neat ;-). > > > all I can do is make the enbd device block on network timeouts. > > > But that's totally unsatisfactory, since real network outages then > > > cause permanent blocks on anything touching a file system > > > mounted remotely. People don't like that. > > > > If it's just this that you want to fix, you could write a DM module > > which returns I/O error if the request to the underlying device takes > > more than 10 seconds. > > I'm not sure that another layer helps. I can timeout requests myself in > 10s within enbd if I want to. Yeah, okay. I suggested that further up, but I guess you thought of it before I did :-). > The problem is that if I take ten seconds for each one when the > net is down memory will fill with backed up requests. The first > one that is failed (after 10s) then triggers an immediate retry > from md, which also gets held for 10s. We'll simply get > huge pulses of failures of entire backed up memory spaced at 10s. > I'm pretty sure from reports that md would error the device > offline after a pulse like that. I don't see where these "huge pulses" come into the picture. If you block one MD request for 10 seconds, surely there won't be another before you return an answer to that one? > If it doesn't, then anyway enbd would decide after 30s or so that > the remote end was down and take itself offline. > One or the other would cause md to expell it from the array. I could > try hot-add from enbd when the other end comes back, but we need to know > we are in an array (and which) in order to do that. I think that's possible using mdadm at least. > > Layer that module on top of the RAID, and make your enbd > > device block on network timeouts. > > It shifts the problem to no avail, as far as I understand you, and my > understanding is likely faulty. Can you be more specific about how this > attacks the problem? Never was much of a good explainer... I was of the impression that you wanted an error message to be propagated quickly to userspace / users, but the MD array to just be silently paused, whenever a network outage occurred. Since you've mentioned that there's actually local disk components in the RAID arrays, I imagine you would want the array to NOT be paused, since it could reasonably continue operation on one device. So just forget about that proposal, it won't work in this situation :-). I guess what will work is either: A) Network outage --> ENBD fails disk --> MD drops disk --> Network comes back --> ENBD brings disk back up --> Something kicks off /etc/hotplug.d/block-hotplug script --> Script queries all RAID devices and find where the disk fits --> Script hot-adds the disk Or: B) Network outage --> ENBD fails disk, I/O error type "link error" --> MD sets disk status to "temporarily missing" --> Network comes back --> ENBD brings disk back up --> MD sees a block device arrival, reintegrates the disk into array I think the latter is better, because: * Noone has to maintain husky shell scripts * It sends a nice message to the SATA/PATA/SCSI people that MD would really like to know whether it's a disk or a link problem. But then again, shell scripts _is_ the preferred Linux solution to... Everything. > Enbd used instead to tell the md device that it was feeling ill, error > all requests, allowing md to chuck it out of the array. Then enbd would > tell the md device when it was feeling well again, and make md > reinsert it in the array. Md would catch up using the bitmap. > > Right now, we can't really tell md we're feeling ill (that would be a > HOT_ARRRGH, but md doesn't have that). If we could, then md could > decide on its own to murder all outstanding requests for us and > chuck us out, with the implicit understanding that we will come back > again soon and then the bitbap can catcj us up. > > We can't do a HOT_REMOVE while requests are outstanding, as far as I > know. MD should be fixed so HOT_REMOVE won't fail but will just kick the disk, even if it happens to be blocking on I/O calls. (If there really is a reason not to kick it, then at least a HOT_REMOVE_FORCE should be added..) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 14:28 ` Molle Bestefich @ 2006-08-16 19:01 ` Peter T. Breuer 2006-08-16 21:19 ` Molle Bestefich 2006-08-16 23:43 ` Nix 1 sibling, 1 reply; 18+ messages in thread From: Peter T. Breuer @ 2006-08-16 19:01 UTC (permalink / raw) To: Molle Bestefich; +Cc: linux raid [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=UNKNOWN-8BIT, Size: 21014 bytes --] "Also sprach Molle Bestefich:" [Charset ISO-8859-1 unsupported, filtering to ASCII...] > Peter T. Breuer wrote: > > > You want to hurt performance for every single MD user out there, just > > > > There's no performance drop! Exponentially staged retries on failure > > are standard in all network protocols ... it is the appropriate > > reaction in general, since stuffing the pipe full of immediate retries > > doesn't allow the would-be successful transactions to even get a look in > > against that competition. > > That's assuming that there even is a pipe, "Pipe" refers to a channel of fixed bandwidth. Every communication channel is one. The "pipe" for a local disk is composed of the bus, disk architecture, controller, and also the kernel architecture layers. For example, only 256 (or 1024, whatever) kernel requests can be outstanding at a time per device [queue], so if 1024 retry requests are in flight, no real work will get done (some kind of priority placement may be done in each driver .. in enbd I take care to replace retries last in the existing queue, for example). > which is something specific > to ENBD / networked block devices, not something that the MD driver > should in general care about. See above. The problem is generic to fixed bandwidth transmission channels, which, in the abstract, is "everything". As soon as one does retransmits one has a kind of obligation to keep retransmissions down to a fixed maximum percentage of the potential traffic, which is generally accomplished via exponential backoff (a time-wise solution, in other words, sdeliberately mearing retransmits out along the time axis in order to prevent spikes). The md layers now can generate retries by at least one mechanism that I know of .. a failed disk _read_ (maybe of existing data or parity data as part of an exterior write attempt) will generate a disk _write_ of the missed data (as reconstituted via redundancy info). I believe failed disk _write_ may also generate a retry, but the above is already enough, no? Anyway, the problem is merely immediately visible over the net since individual tcp packet delays of 10s are easy to observe under fairly normal conditions, and I have seen evidence of 30s trips in other people's reports. It's not _unique_ to the net, but sheeucks, if you want to think of it that way, go ahead! Such delays may in themselves cause timeouts in md - I don't know. My RFC (maybe "RFD") is aimed at raising a flag saying that something is going on here that needs better control. > > > because things doesn't work optimally under enbd, which is after all a > > > rather rare use case compared to using MD on top of real disks. > > > > Strawman. > > Quah? Above. > > > Besides, it seems a rather pointless exercise to try and hide the fact > > > from MD that the device is gone, since it *is* in fact missing. > > > > Well, we don't really know that for sure. As you know, it is > > impossible to tell in general if the net has gone awol or is simply > > heavily overloaded (with retry requests). > > From MD's point of view, if we're unable to complete a request to the > device, then it's either missing or faulty. If a call to the device > blocks, then it's just very slow. The underlying device has to take a decision about what to tell the upper (md) layer. I can tell you from experience that users just HATE it if the underlying device always blocks until the other end of the net connection comes back on line. C.f. nfs "hard" option. Try it and hate it. The alternative, reasonable in my opinion, is to tell the overlying md device that a io request has failed after about 10-30s of hanging around waiting for it. Unforrrrrrrtunately, the effect is BAAAAAD at the moment, because (as I indicated above), this can lead to md layer retries aimed at the same lower device, IMMMMMMEDIATELY, which are going to fail for the same reason the first io request failed. What the upper layer, md, ought to do is "back off". 1) try again immediately - if that fails, then don't give up but .. 2) wait a while before retrying again. I _suspect_ that at the moment md is trying and retrying, and probably retrying again, all immediately, causing an avalanch of (temporary) failures, and expulsion from a raid array. > I don't think it's wise to pollute these simple mechanics with a > "maybe it's in a sort-of failing due to a network outage, which might > just be a brownout" scenario. Better to solve the problem in a more > appropriate place, somewhere that knows about the fact that we're > simulating a block device over a network connection. I've already suggested a simple mechanism above .. "back off on the retries, already". It does no harm to local disk devices. If you like, the constant of backoff can be based on how long it took the underlying device to signal the io request as failed. So a local disk that replies "failed" immediately can get its range of retries run through in a couple of hop skip and millijiffies. A network device that took 10s to report a timeout can get its next retry back again in 10s. That should give it time to recover. > Not introducing network-block-device aware code in MD is a good way to > avoid wrong code paths and weird behaviour for real block device > users. Uh, the net is everywhere. When you have 10PB of storage in your intelligent house's video image file system, the parts of that array are connected by networking room to room. Supecomputers used to have simple networking between each computing node. Heck, clusters still do :). Please keep your special case code out of the kernel :-). > "Missing" vs. "Faulty" is OTOH a pretty simple interface, which maps > fine to both real disks and NBDs. It may well be a solution. I think we're still at the stage of precisely trying to identify the problem too! At the moment, most of what I can say is "definitely, there is something wrong with the way the md layer reacts or can be controlled with respect to networking brown-outs and NBDs". > > The retry on error is a good thing. I am simply suggesting that if the > > first retry also fails that we do some back off before trying again, > > since it is now likely (lacking more knowledge) that the device is > > having trouble and may well take some time to recover. I would suspect > > that an interval of 0 1 5 10 30 60s would be appropriate for retries. > > Only for networked block devices. Shrug. Make that 0, 1, 5, 10 TIMES the time it took the device to report the request as errored. > Not for real disks, there you are just causing unbearable delays for > users for no good reason, in the event that this code path is taken. We are discussing _error_ semantics. There is no bad effect at all on normal working! The effect on normal working should even be _good_ when errors occur, because now max bandwidth devoted to error retries is limited, leaving more max bandwidth for normal requests. > > One can cycle that twice for luck before giving up for good, if you > > like. The general idea in such backoff protocols is that it avoids > > filling a fixed bandwidth channel with retries (the sum of a constant > > times 1 + 1/2 + 1/4 + .. is a finite proportion of the channel > > bandwidth, but the sum of 1+1+1+1+1+... is unbounded), but here also > > there is an _additional_ assumption that the net is likely to have > > brownouts and so we _ought_ to retry at intervals since retrying > > immediately will definitely almost always do no good. > > Since the knowledge that the block device is on a network resides in > ENBD, I think the most reasonable thing to do would be to implement a > backoff in ENBD? Should be relatively simple to catch MD retries in > ENBD and block for 0 1 5 10 30 60 seconds. I can't tell which request is a retry. You are allowed to write twice to the same place in normal operation! The knowledge is in MD. > That would keep the > network backoff algorithm in a more right place, namely the place that > knows the device is on a network. See above. > > In "normal " failures there is zero delay anyway. > > Since the first retry would succeed, or? Yes. > I'm not sure what this "normal" failure is, btw. A simple read failure, followed by a successful (immediate) write attempt. The local disk will take 0s to generate the read failure, and the write (rewrite) attempt will be generated and accepted 0s later. In contrast, the net device will take 10-30s to generate a timeout for the read attempt, followed by 0s to error the succeeding write request, since the local driver of the net device will have taken the device offline as it can't get a response in 30s. At that point all io to the device will fail, all hell will break loose in the md device, and the net device will be ejected from the array in a flurry of millions of failed requests. I merely ask for a little patience. Try again in 30s. > > And further, the bitmap takes care of delayed > > responses in the normal course of events. > > Mebbe. Does it? Yes. > > > It would be reasonable for MD to know the difference between > > > - "device has (temporarily, perhaps) gone missing" and > > > - "device has physical errors when reading/writing blocks", > > > > I agree. The problem is that we can't really tell what's happening > > (even in the lower level device) across a net that is not responding. > > In the case where requests can't be delivered over the network (or a > SATA cable, whatever), it's a clear case of "missing device". It's not so clear. 10-30s delays are perfectly visible in ordinary tcp and mean nothing more than congestion. How many times have you sat there hitting the keys and waiting for something to move on the screen? > > > > The comm channel and "hey, I'm OK" message you propose doesn't seem > > > that different from just hot-adding the disks from a shell script > > > using 'mdadm'. > > > > [snip speculations on possible blocking calls] > > You could always try and see. > Should be easy to simulate a network outage. I should add that it's easy to simulate network outages just by lowering the timeout in enbd. At the 3s mark, and running continuous writes to a file larger than memory sited on a fs on the remote device¸ one sees timeouts every minute or so - requests which took longer than 3s to go across the local net, be carried out remotely, and be acked back. Even with no other traffic on the net. Here's a typical observation sequence I commented in correspondence to the debian maintainer ... 1 Jul 30 07:32:55 betty kernel: ENBD #1187[73]: enbd_rollback (0): error out too old (783) timedout (750) req c8da00bc! The request had a timeout of 3s (750 jiffies) and was in the kernel unserviced for just over 3s (783 jiffies) before the enbd driver errored it. I lowered the base timeout to 3s (default is 10s) in order to provoke this kind of problem. 2 Jul 30 07:32:55 betty kernel: ENBD #1115[73]: enbd_error error out req c8da00bc from slot 0! This is the notification of the enbd driver erroring the request. 3 Jul 30 07:32:55 betty kernel: Buffer I/O error on device ndb, logical block 65 540 This is the kernel noticing the request has been errored. 4 Jul 30 07:32:55 betty kernel: lost page write due to I/O error on ndb Ditto. 5 Jul 30 07:32:55 betty kernel: ENBD #1506[73]: enbd_ack (0): fatal: Bad handle c8da00bc != 00000000! The request finally comes back from the enbd server, just a fraction of a second too late, just beyond the 3s limit. 6 Jul 30 07:32:55 betty kernel: ENBD #1513[73]: enbd_ack (0): ignoring ack of req c8da00bc which slot lacks And the enbd driver ignores the late return - it already told the kernel it errored. I've increased the default timeout in response to these observations, but the real problem in my view is not that the network is sometimes slow, but the way the md driver reacts to the situation in the absence of further guidance. It needs better communications facilities with the underlying devices. Their drivers need to be able to tell the md driver about the state of the underlying device. > > I am pretty sure that removal requests will be blocked when > > requests are outstanding. > > That in particular should not be a big problem, since MD already kicks > the device for you, right? A script would only have to hot-add the > device once it's available again. I can aver from experience that one should not look to a script for salvation. There are too many deadlock opportunities - we will be out of memory in a situation where writes are going full speed to a raid device, which is writing to a device across the net, and the net is congested or has a brownout (cleaning lady action with broom and cables). Buffers will be full. It is not clear that there will be memory for the tcp socket in order to build packets to allow the buffers to flush. Really, in my experience, a real good thing to do is mark the device as temporarily failed, clear all queued requests with error, thus making memory available, yea, even for tcp sockets, and then let the device reinsert itself in the MD array when contact is reestablished across the net. At that point the MD bitmap can catch up the missed requests. This is complicated by the MD device's current tendency to issue retries (one way or the other .. does it? How?). It's interfering with the simple strategy I just sggested. > > Another problem is that enbd has to _know_ it is in a raid array, and > > which one, in order to send the ioctl. That leads one to more or less > > require that the md array tell it. One could build this into the mdadm > > tool, but one can't guarantee that everyone uses that (same) mdadm tool, > > so the md driver gets nominated as the best place for the code that > > does that. > > It's already in mdadm. One can't rely on mdadm - no user code is likely to work when we are out of memory and in deep oxygen debt. > You can only usefully query one way (array --> device): > # mdadm -D /dev/md0 | grep -A100 -E '^ Number' > > Number Major Minor RaidDevice State > 0 253 0 0 active sync /dev/mapper/sda1 > 1 253 1 1 active sync /dev/mapper/sdb1 I'm happy to use the ioctls that mdadm uses to get that info. If it parses /proc/mdstat, then I give up :-). The format is not regular. > That should provide you with enough information though, since devices > stay in that table even after they've gone missing. (I'm not sure > what happens when a spare takes over a place, though - test needed.) That's exactly what I mean .. the /proc output is difficult to parse. > The optimal thing would be to query the other way, of course. ENBD > should be able to tell a hotplug shell script (or whatever) about the Please no shell scripts (I'm the world's biggest fan of shell scripts otherwise) - they can't be relied on in these situations. Think of a barebones installation with a root device mirrored over the net. These generally run a single process in real time mode - a data farm, processing info pouring out of, say, an atomic physics experiment, at 1GB/s. > name of the device that's just come back. > > And you *can* in fact query the other way too, but you won't get a > useful Array UUID or device-name-of-assembled-array out of it: It's all too wishy-washy. I'm sorry, but direct ioctl or similar is the only practical way. > > Only one of two devices in a two-device mirror is generally networked. > > Makes sense. > > > The standard scenario is two local disks per network node. One is a > > mirror half for a remote raid, > > A local cache of sorts? Just a local mirror half. When the node goes down, its data state will still be available on the remote half of the mirror, and processing can continue there. > > the other is the mirror half for a local raid > > (which has a remote other half on the remote node). > > A remote backup of sorts? Just the remote half of the mirror. > > The problem is that if I take ten seconds for each one when the > > net is down memory will fill with backed up requests. The first > > one that is failed (after 10s) then triggers an immediate retry > > from md, which also gets held for 10s. We'll simply get > > huge pulses of failures of entire backed up memory spaced at 10s. > > I'm pretty sure from reports that md would error the device > > offline after a pulse like that. > > I don't see where these "huge pulses" come into the picture. Because if we are writing full tilt to the network device when the net goes down, 10s later all those requests in flight at the time (1024 off) will time out simultaneously, all together, at the same time, in unison. > If you block one MD request for 10 seconds, surely there won't be > another before you return an answer to that one? See above. We will block 1024 requests for 10s, if the request pools are fully utilized at the time (and if 1024 is the default block device queue limit .. it's either that or 256, I forget which) > > If it doesn't, then anyway enbd would decide after 30s or so that > > the remote end was down and take itself offline. > > One or the other would cause md to expell it from the array. I could > > try hot-add from enbd when the other end comes back, but we need to know > > we are in an array (and which) in order to do that. > > I think that's possible using mdadm at least. One would have to duplicate the ioctl calls that mdadm uses, from kernel space. It's not advisable to call out _under pressure_ to a user process to do something else in kernel. > I guess what will work is either: > > A) > > Network outage --> > ENBD fails disk --> > MD drops disk --> > Network comes back --> > ENBD brings disk back up --> This is what used to happen with the FR1/5 patch. Most of that functionality is now in the kernel code, but there is still "missing" the communication layer that allowed enbd to bring the disk back up and back into the MD array. > Something kicks off /etc/hotplug.d/block-hotplug script --> > Script queries all RAID devices and find where the disk fits --> > Script hot-adds the disk Not first choice in a hole - simpler is what I had in the FR1/5 patches: 1) MD advises enbd it's in an array, or not 2) enbd tells MD to pull it in and out of that array as it senses the condition of the network connection The first required MD to use a special ioctl to each device in an array. The second required enbd to use the MD HOT_ADD and HOT_REMOVE ioctl commands, being careful also to kill any requests in flight so that the remove or add would not be blocked in md or the other block device layers. (In fact, I think I needed to add HOT_REPAIR as a special extra command, but don't quote me on that). That communications layer would work if it were restored. > Or: > > B) > > Network outage --> > ENBD fails disk, I/O error type "link error" --> We can do that. > MD sets disk status to "temporarily missing" --> Well, this is merely the kernel level communication I am looking for! You seem to want MD _not_ to drop the device, however, merely to set it inactive. I am happy with that too. > Network comes back --> > ENBD brings disk back up --> > MD sees a block device arrival, reintegrates the disk into array We need to tell MD that we're OK. I will go along with that. > I think the latter is better, because: > * Noone has to maintain husky shell scripts > * It sends a nice message to the SATA/PATA/SCSI people that MD would > really like to know whether it's a disk or a link problem. I agree totally. It's the kind of "solution" I had before, so I am happy. > But then again, shell scripts _is_ the preferred Linux solution to... > Everything. It can't be relied upon here. Imagine if the entire file system is mirrored. Hic. > MD should be fixed so HOT_REMOVE won't fail but will just kick the > disk, even if it happens to be blocking on I/O calls. > > (If there really is a reason not to kick it, then at least a > HOT_REMOVE_FORCE should be added..) So .. are we settling on a solution? I like the idea that we can advise MD that we are merely temporarily out of action. Can we take it from there? (Neil?) Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 19:01 ` Peter T. Breuer @ 2006-08-16 21:19 ` Molle Bestefich 2006-08-16 22:19 ` Peter T. Breuer 0 siblings, 1 reply; 18+ messages in thread From: Molle Bestefich @ 2006-08-16 21:19 UTC (permalink / raw) To: ptb; +Cc: linux raid Peter T. Breuer wrote: > > > We can't do a HOT_REMOVE while requests are outstanding, > > > as far as I know. > > Actually, I'm not quite sure which kind of requests you are > talking about. > > Only one kind. Kernel requests :). They come in read and write > flavours (let's forget about the third race for the moment). I was wondering whether you were talking about requests from eg. userspace to MD, or from MD to the raw device. I guess it's not that important really, that's why I asked you off-list. Just getting in too deep, and being curious. > "Pipe" refers to a channel of fixed bandwidth. Every communication > channel is one. The "pipe" for a local disk is composed of the bus, > disk architecture, controller, and also the kernel architecture layers. [snip] > See above. The problem is generic to fixed bandwidth transmission > channels, which, in the abstract, is "everything". As soon as one > does retransmits one has a kind of obligation to keep retransmissions > down to a fixed maximum percentage of the potential traffic, which > is generally accomplished via exponential backoff (a time-wise > solution, in other words, sdeliberately mearing retransmits out along > the time axis in order to prevent spikes). Right, so with the bandwidth to local disks being, say, 150MB/s, an appropriate backoff would be 0 0 0 0 0 0.1 0.1 0.1 0.1 secs. We can agree on that pretty fast.. right? ;-). > The md layers now can generate retries by at least one mechanism that I > know of .. a failed disk _read_ (maybe of existing data or parity data > as part of an exterior write attempt) will generate a disk _write_ of > the missed data (as reconstituted via redundancy info). > > I believe failed disk _write_ may also generate a retry, Can't see any reason why MD would try to fix a failed write, since it's not likely to be going to be successful anyway. > Such delays may in themselves cause timeouts in md - I don't know. My > RFC (maybe "RFD") is aimed at raising a flag saying that something is > going on here that needs better control. I'm still not convinced MD does retries at all.. > What the upper layer, md, ought to do is "back off". I think it should just kick the disk. > > I don't think it's wise to pollute these simple mechanics with a > > "maybe it's in a sort-of failing due to a network outage, which might > > just be a brownout" scenario. Better to solve the problem in a more > > appropriate place, somewhere that knows about the fact that we're > > simulating a block device over a network connection. > > I've already suggested a simple mechanism above .. "back off on the > retries, already". It does no harm to local disk devices. Except if the code path gets taken, and the user has to wait 10+20+30+60s for each failed I/O request. > If you like, the constant of backoff can be based on how long it took > the underlying device to signal the io request as failed. So a local > disk that replies "failed" immediately can get its range of retries run > through in a couple of hop skip and millijiffies. A network device that > took 10s to report a timeout can get its next retry back again in 10s. > That should give it time to recover. That sounds saner to me. > > Not introducing network-block-device aware code in MD is a good way to > > avoid wrong code paths and weird behaviour for real block device > > users. > > Uh, the net is everywhere. When you have 10PB of storage in your > intelligent house's video image file system, the parts of that array are > connected by networking room to room. Supecomputers used to have simple > networking between each computing node. Heck, clusters still do :). > Please keep your special case code out of the kernel :-). Uhm. > > "Missing" vs. "Faulty" is OTOH a pretty simple interface, which maps > > fine to both real disks and NBDs. > > It may well be a solution. I think we're still at the stage of > precisely trying to identify the problem too! At the moment, most > of what I can say is "definitely, there is something wrong with the > way the md layer reacts or can be controlled with respect to > networking brown-outs and NBDs". > > Not for real disks, there you are just causing unbearable delays for > > users for no good reason, in the event that this code path is taken. > > We are discussing _error_ semantics. There is no bad effect at all on > normal working! In the past, I've had MD run a box to a grinding halt more times than I like. It always results in one thing: The user pushing the big red switch. That's not acceptable for a RAID solution. It should keep working, without blocking all I/O from userspace for 5 minutes just because it thinks it's a good idea to hold up all I/O requests to underlying disks for 60s each, waiting to retry them. > The effect on normal working should even be _good_ when errors > occur, because now max bandwidth devoted to error retries is > limited, leaving more max bandwidth for normal requests. Assuming you use your RAID component device as a regular device also, and that the underlying device is not able to satisfy the requests as fast as you shove them at it. Far out ;-). > > Since the knowledge that the block device is on a network resides in > > ENBD, I think the most reasonable thing to do would be to implement a > > backoff in ENBD? Should be relatively simple to catch MD retries in > > ENBD and block for 0 1 5 10 30 60 seconds. > > I can't tell which request is a retry. You are allowed to write twice > to the same place in normal operation! The knowledge is in MD. I don't think you need to either - if ENBD only blocks 10 seconds total, and fail all requests after that period of time has lapsed once, then that could have the same effect. > In contrast, the net device will take 10-30s to generate a timeout for > the read attempt, followed by 0s to error the succeeding write request, > since the local driver of the net device will have taken the device > offline as it can't get a response in 30s. > At that point all io to the device will fail, all hell will break > loose in the md device, Really? > and the net device will be ejected from the array Fair nuff.. > in a flurry of millions of failed requests. Millions? Really? > > In the case where requests can't be delivered over the network (or a > > SATA cable, whatever), it's a clear case of "missing device". > > It's not so clear. Yes it is. If the device is not faulty, but there's a link problem, then the device is just... missing :-). Whether you actually tell MD that it's missing or not, is another story. > 10-30s delays are perfectly visible in ordinary tcp and mean nothing > more than congestion. How many times have you sat there hitting the > keys and waiting for something to move on the screen? I get your point, I think. There's no reason to induce the overhead of a MD sync-via-bitmap, if increasing the network timeout in ENBD will prevent the component device from being kicked in the first place. As long as the timeout doesn't cause too much grief for the end user. OTOH, a bitmap sync can happen in the background, so as long as the disk is not _constantly_ being removed/added, it should be fine to kick it real fast from the array. > Really, in my experience, a real good thing to do is mark the device as > temporarily failed, clear all queued requests with error, thus making > memory available, yea, even for tcp sockets, and then let the device > reinsert itself in the MD array when contact is reestablished across the > net. At that point the MD bitmap can catch up the missed requests. > > This is complicated by the MD device's current tendency to issue > retries (one way or the other .. does it? How?). It's interfering > with the simple strategy I just sggested. There was a patch floating around at one time in which MD would ignore a certain amount of errors from a component device. I think. Can't remember the details nor the reasoning for it. Sounded stupid to me at the time, I remember :-). > simpler is what I had in the FR1/5 patches: > > 1) MD advises enbd it's in an array, or not > 2) enbd tells MD to pull it in and out of that array as > it senses the condition of the network connection > > The first required MD to use a special ioctl to each device in an > array. > > The second required enbd to use the MD HOT_ADD and HOT_REMOVE ioctl > commands, being careful also to kill any requests in flight so that > the remove or add would not be blocked in md or the other block device > layers. (In fact, I think I needed to add HOT_REPAIR as a special extra > command, but don't quote me on that). > > That communications layer would work if it were restored. > So .. are we settling on a solution? I'm just proposing counter-arguments. Talk to the Neil :-). > I like the idea that we can advise MD that we are merely > temporarily out of action. Can we take it from there? (Neil?) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 21:19 ` Molle Bestefich @ 2006-08-16 22:19 ` Peter T. Breuer 0 siblings, 0 replies; 18+ messages in thread From: Peter T. Breuer @ 2006-08-16 22:19 UTC (permalink / raw) To: Molle Bestefich; +Cc: linux raid "Also sprach Molle Bestefich:" > > > See above. The problem is generic to fixed bandwidth transmission > > channels, which, in the abstract, is "everything". As soon as one > > does retransmits one has a kind of obligation to keep retransmissions > > down to a fixed maximum percentage of the potential traffic, which > > is generally accomplished via exponential backoff (a time-wise > > solution, in other words, sdeliberately mearing retransmits out along > > the time axis in order to prevent spikes). > > Right, so with the bandwidth to local disks being, say, 150MB/s, an > appropriate backoff would be 0 0 0 0 0 0.1 0.1 0.1 0.1 secs. We can > agree on that pretty fast.. right? ;-). Whatever .. the multiplying constant can be anything you like, and the backoff can be statistical in nature, not deterministic. It merely has to backoff rather than pile in retries all at once and immediately. > > The md layers now can generate retries by at least one mechanism that I > > know of .. a failed disk _read_ (maybe of existing data or parity data > > as part of an exterior write attempt) will generate a disk _write_ of > > the missed data (as reconstituted via redundancy info). > > > > I believe failed disk _write_ may also generate a retry, > > Can't see any reason why MD would try to fix a failed write, since > it's not likely to be going to be successful anyway. Maybe. > > Such delays may in themselves cause timeouts in md - I don't know. My > > RFC (maybe "RFD") is aimed at raising a flag saying that something is > > going on here that needs better control. > > I'm still not convinced MD does retries at all.. It certainly attempts a rewrite after a failed read. Neil can say if anything else is tried. Bitmaps can be used to allow writes to fail first time and then to be synced up later. > > What the upper layer, md, ought to do is "back off". > > I think it should just kick the disk. That forces us to put it back in when the net comes back to life, which is complicated. Life would be less complicated if it were less prone to being kicked out in the first place. > > We are discussing _error_ semantics. There is no bad effect at all on > > normal working! > > In the past, I've had MD run a box to a grinding halt more times than > I like. It always results in one thing: The user pushing the big red > switch. I agree that the error path in md probably contains some deadlock. My observation also. That's why I prefer to react to a net brownout by taking the lower device offline and erroring outstanding requests, PROVIDED we can put it back in again sanely. That ain't the case at the moment, so I'd prefer if MD would not be quite so trigger-happy on the expulsions, which I _believe_ occurs because the lower level device errors too many requests all at once. > That's not acceptable for a RAID solution. It should keep working, > without blocking all I/O from userspace for 5 minutes just because it > thinks it's a good idea to hold up all I/O requests to underlying > disks for 60s each, waiting to retry them. You miscalculate here ... holding up ONE request for a retry does not hold up ALL requests. Everything else goes through. And I proposed that we only backoff after trying again immediately. Heck, that's probably wrong, mathematically - that can double the bandwidth occupation per timeslice, meaning that we need to reserve 50% bandwidth for errors .. ecch. Nope - one _needs_ some finite minimal backoff. One jiffy is enough. That moves reries into the next time slice... umm, and we need to randomly space them out a few more jiffies too, in a poisson distribution, in order to avoid filling the next timeslice to capacity with errors. Yep, I'm convinced .. need exponential statistical backoff. Each retry needs to be delayed by an amount of time that comes from a poisson distribution (exponential decay). The average backoff can be a jiffy. > > The effect on normal working should even be _good_ when errors > > occur, because now max bandwidth devoted to error retries is > > limited, leaving more max bandwidth for normal requests. > > Assuming you use your RAID component device as a regular device also, ?? Oh .. you are thinking of the channel to the device. I was thinking of the kernel itself. It has to spend time and memory on this. Allowing it to concentrate on other io that will work without having to cope with a sharp spike of errors at the temporarily incapacitated low level device speeds up _other_ devices. > and that the underlying device is not able to satisfy the requests as > fast as you shove them at it. Far out ;-). See above. > > > Since the knowledge that the block device is on a network resides in > > > ENBD, I think the most reasonable thing to do would be to implement a > > > backoff in ENBD? Should be relatively simple to catch MD retries in > > > ENBD and block for 0 1 5 10 30 60 seconds. > > > > I can't tell which request is a retry. You are allowed to write twice > > to the same place in normal operation! The knowledge is in MD. > > I don't think you need to either - if ENBD only blocks 10 seconds > total, and fail all requests after that period of time has lapsed > once, then that could have the same effect. When the net fails, all writes to the low level device will block for 10s, then fail all at once. Md reacts by tossing the disk out. It probably does that because it sees failed writes (even if well intended correction attempts provoked by a failed read). It could instead wait a while and retry. That would succeed, since the net would decongest meanwhile. That would make the problem disappear. The alternative is that the low level device tries to insert itself back in the array once the net comes back up. For that to happen it has to know it was in one, has been tolssed out, and needs to get back. All complicated. > > In contrast, the net device will take 10-30s to generate a timeout for > > the read attempt, followed by 0s to error the succeeding write request, > > since the local driver of the net device will have taken the device > > offline as it can't get a response in 30s. > > > At that point all io to the device will fail, all hell will break > > loose in the md device, > > Really? Well, zillions of requests will have been errored out all at once. At least the 256-1024 backed up in the device queue. > > and the net device will be ejected from the array > > Fair nuff.. > > > in a flurry of millions of failed requests. > > Millions? Really? Hundreds. > > > In the case where requests can't be delivered over the network (or a > > > SATA cable, whatever), it's a clear case of "missing device". > > > > It's not so clear. > > Yes it is. If the device is not faulty, but there's a link problem, > then the device is just... missing :-). Whether you actually tell MD > that it's missing or not, is another story. We agree that not telling it simply leads to blocking behaviour when the net is really out forever, which is not acceptable. Telling it after 30s results in us occasionally having to say "oops, no, I'm sorry, we're OK again" and try and reinsert ourselves in the array, which we currently can't do easily. I would prefer we don't tell md until a good long time has passed, and it do retries with exp backoff meanwhile. The array performance should not be impacted. There will be another disk there still working. > > 10-30s delays are perfectly visible in ordinary tcp and mean nothing > > more than congestion. How many times have you sat there hitting the > > keys and waiting for something to move on the screen? > > I get your point, I think. > > There's no reason to induce the overhead of a MD sync-via-bitmap, if > increasing the network timeout in ENBD will prevent the component > device from being kicked in the first place. As long as the timeout There's no sensible point to set a timeout. Try an ssh session .. you can reconnect to it an hour after cutting the cable. > doesn't cause too much grief for the end user. > > OTOH, a bitmap sync can happen in the background, so as long as the > disk is not _constantly_ being removed/added, it should be fine to > kick it real fast from the array. Bun complicated to implement, as things are, since there is no special comms channel available with the md driver. > > simpler is what I had in the FR1/5 patches: > > > > 1) MD advises enbd it's in an array, or not > > 2) enbd tells MD to pull it in and out of that array as > > it senses the condition of the network connection > > > > The first required MD to use a special ioctl to each device in an > > array. > > > > The second required enbd to use the MD HOT_ADD and HOT_REMOVE ioctl > > commands, being careful also to kill any requests in flight so that > > the remove or add would not be blocked in md or the other block device > > layers. (In fact, I think I needed to add HOT_REPAIR as a special extra > > command, but don't quote me on that). > > > > That communications layer would work if it were restored. > > > So .. are we settling on a solution? > > I'm just proposing counter-arguments. > Talk to the Neil :-). He readeth the list! > > I like the idea that we can advise MD that we are merely > > temporarily out of action. Can we take it from there? (Neil?) Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 14:28 ` Molle Bestefich 2006-08-16 19:01 ` Peter T. Breuer @ 2006-08-16 23:43 ` Nix 1 sibling, 0 replies; 18+ messages in thread From: Nix @ 2006-08-16 23:43 UTC (permalink / raw) To: Molle Bestefich; +Cc: ptb, linux raid On 16 Aug 2006, Molle Bestefich murmured woefully: > Peter T. Breuer wrote: >> > The comm channel and "hey, I'm OK" message you propose doesn't seem >> > that different from just hot-adding the disks from a shell script >> > using 'mdadm'. >> >> [snip speculations on possible blocking calls] > > You could always try and see. > Should be easy to simulate a network outage. Blocking calls are not the problem. Deadlocks are. The problem is that forking a userspace process necessarily involves kernel memory allocations (for the task struct, userspace memory map, possibly text pages if the necessary pieces of mdadm are not in the page cache), and if your swap is on the remote RAID array, you can't necessarily carry out those allocations. Note that the same deadlock situation is currently triggered by sending/receiving network packets, which is why swapping over NBD is a bad idea at present: however, this is being fixed at this moment because until it's fixed you can't reliably have a machine with all storage on iSCSI, for instance. However, the deadlock is only fixable for kernel allocations, because the amount of storage that'll need is bounded in several ways: you can't fix it for userspace allocations. So you can never rely on userspace working in this situation. -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 9:06 remark and RFC Peter T. Breuer 2006-08-16 10:00 ` Molle Bestefich @ 2006-08-16 14:59 ` Molle Bestefich 2006-08-16 16:10 ` Peter T. Breuer 2006-08-17 1:11 ` Neil Brown 2006-08-17 14:11 ` Mario 'BitKoenig' Holbe 3 siblings, 1 reply; 18+ messages in thread From: Molle Bestefich @ 2006-08-16 14:59 UTC (permalink / raw) To: ptb; +Cc: linux raid Peter T. Breuer wrote: > I would like raid request retries to be done with exponential > delays, so that we get a chance to overcome network brownouts. Hmm, I don't think MD even does retries of requests. It does write-back as a (very successful! Thanks Neil :-D) attempt to fix bad blocks, but that's a different thing. Is that what you meant? Or is there something sandwiched between MD and ENBD that performs retries... Or am I just wrong :-).. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 14:59 ` Molle Bestefich @ 2006-08-16 16:10 ` Peter T. Breuer 0 siblings, 0 replies; 18+ messages in thread From: Peter T. Breuer @ 2006-08-16 16:10 UTC (permalink / raw) To: Molle Bestefich; +Cc: linux raid "Also sprach Molle Bestefich:" > Peter T. Breuer wrote: > > I would like raid request retries to be done with exponential > > delays, so that we get a chance to overcome network brownouts. > > Hmm, I don't think MD even does retries of requests. I had a "robust read" patch in FR1, and I thought Neil extended that to "robust write". In robust read, we try to make up for a failed read with info from elsewhere, and then we rewrite the inferred data onto the failed device, in an attempt to fix a possible defect. In robust write, a failed write is retried. But robust read is enough to cause a retry (as a write). > It does write-back as a (very successful! Thanks Neil :-D) attempt to > fix bad blocks, but that's a different thing. Apparently not different :). > Is that what you meant? Plozzziby. > Or is there something sandwiched between MD and ENBD that performs retries... Could be too. Dunno. Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 9:06 remark and RFC Peter T. Breuer 2006-08-16 10:00 ` Molle Bestefich 2006-08-16 14:59 ` Molle Bestefich @ 2006-08-17 1:11 ` Neil Brown 2006-08-17 6:28 ` Peter T. Breuer 2006-08-17 14:11 ` Mario 'BitKoenig' Holbe 3 siblings, 1 reply; 18+ messages in thread From: Neil Brown @ 2006-08-17 1:11 UTC (permalink / raw) To: ptb; +Cc: linux raid On Wednesday August 16, ptb@inv.it.uc3m.es wrote: > > So, > > 1) I would like raid request retries to be done with exponential > delays, so that we get a chance to overcome network brownouts. > > 2) I would like some channel of communication to be available > with raid that devices can use to say that they are > OK and would they please be reinserted in the array. > > The latter is the RFC thing (I presume the former will either not > be objectionable or Neil will say "there's no need since you're wrong > about the way raid does retries anyway"). There's no need since you're ..... you know the rest :-) Well, sort of. When md/raid1 gets a read error it immediately retries the request in small (page size) chunks to find out exactly where the error is (it does this even if the original read request is only one page). When it hits a read error during retry, it reads from another device (if it can find one that works) and writes what it got out to the 'faulty' drive (or drives). If this works: great. If not, the write error causes the drive to be kicked. I'm not interested in putting any delays in there. It is simply the wrong place to put them. If network brownouts might be a problem, then the network driver gets to care about that. Point 2 should be done in user-space. - notice device have been ejected from array - discover why. act accordingly. - if/when it seems to be working again, add it back into the array. I don't see any need for this to be done in the kernel. > > The way the old FR1/5 code worked was to make available a couple of > ioctls. > > When a device got inserted in an array, the raid code told the device > via a special ioctl it assumed the device had that it was now in an > array (this triggers special behaviours, such as deliberately becoming > more error-prone and less blocky, on the assumption that we have got > good comms with raid and can manage our own raid state). Ditto > removal. A bit like BIO_RW_FASTFAIL? Possibly md could make more use of that. I haven't given it any serious thought yet. I don't even know what low level devices recognise it or what they do in response. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-17 1:11 ` Neil Brown @ 2006-08-17 6:28 ` Peter T. Breuer 2006-08-19 1:35 ` Gabor Gombas 2006-08-21 1:21 ` Neil Brown 0 siblings, 2 replies; 18+ messages in thread From: Peter T. Breuer @ 2006-08-17 6:28 UTC (permalink / raw) To: Neil Brown; +Cc: linux raid HI Neil .. "Also sprach Neil Brown:" > On Wednesday August 16, ptb@inv.it.uc3m.es wrote: > > 1) I would like raid request retries to be done with exponential > > delays, so that we get a chance to overcome network brownouts. > > > > 2) I would like some channel of communication to be available > > with raid that devices can use to say that they are > > OK and would they please be reinserted in the array. > > > > The latter is the RFC thing (I presume the former will either not > > be objectionable or Neil will say "there's no need since you're wrong > > about the way raid does retries anyway"). > > There's no need since you're ..... you know the rest :-) > Well, sort of. OK, let's see ... > When md/raid1 gets a read error it immediately retries the request in > small (page size) chunks to find out exactly where the error is (it > does this even if the original read request is only one page). OK. I didn't know that. But do you mean a read request to the RAID device, or a read request to the underlying disk device? The latter might form part of the implementation of a write request to the RAID device. (has the MD blocksize moved up to 4K then? It was at 1KB for years) > When it hits a read error during retry, it reads from another device > (if it can find one that works) and writes what it got out to the > 'faulty' drive (or drives). OK. That mechanism I was aware of. > If this works: great. > If not, the write error causes the drive to be kicked. Yerrs, that's also what I thought. > I'm not interested in putting any delays in there. It is simply the > wrong place to put them. If network brownouts might be a problem, > then the network driver gets to care about that. I think you might want to reconsider (not that I know the answer). 1) if the network disk device has decided to shut down wholesale (temporarily) because of lack of contact over the net, then retries and writes are _bound_ to fail for a while, so there is no point in sending them now. You'd really do infinitely better to wait a while. 2) if the network device just blocks individual requests for say 10s while waiting for an ack, then times them out, there is more chance of everything continuing to work since the 10s might be long enough for the net to recover in, but occasionally a single timeout will occur and you will boot the device from the array (whereas waiting a bit longer would have been the right thing to do, if only we had known). Change 10s to any reasonable length of time. You think the device has become unreliable because write failed, but it hasn't ... that's just the net. Try again later! If you like we can set the req error count to -ETIMEDOUT to signal it. Real remote write breakage can be signalled with -EIO or something. Only boot the device on -EIO. 3) if the network device blocks essentially forever, waiting for a reconnect, experience says that users hate that. I believe the md array gets stuck somewhere here (from reports), possibly in trying to read the superblock of the blocked device. 4) what the network device driver wants to do is be able to identify the difference between primary requests and retries, and delay retries (or repeat them internally) with some reasonable backoff scheme to give them more chance of working in the face of a brownout, but it has no way of doing that. You can make the problem go away by delaying retries yourself (is there a timedue field in requests, as well as a timeout field? If so, maybe that can be used to signal what kind of a request it is and how to treat it). > Point 2 should be done in user-space. It's not reliable - we will be under memory pressure at this point, with all that implies; the raid device might be the very device on which the file system sits, etc. Pick your poison! > - notice device have been ejected from array > - discover why. act accordingly. > - if/when it seems to be working again, add it back into the array. > > I don't see any need for this to be done in the kernel. Because there might not be any userspace (embedded device) and userspace might be blocked via subtle or not-so-subtle deadlocks. There's no harm in making it easy! /proc/mdstat is presently too hard to parse reliably, I am afraid. Minor differences in presentation arise in it for reasons I don't understand! > > The way the old FR1/5 code worked was to make available a couple of > > ioctls. > > > > When a device got inserted in an array, the raid code told the device > > via a special ioctl it assumed the device had that it was now in an > > array (this triggers special behaviours, such as deliberately becoming > > more error-prone and less blocky, on the assumption that we have got > > good comms with raid and can manage our own raid state). Ditto > > removal. > > A bit like BIO_RW_FASTFAIL? Possibly md could make more use of that. It was a different one, but yes, that would have done. The FR1/5 code needed to be told also in WHICH array it was, so that it could send ioctls (HOT_REPAIR, or such) to the right md device later when it felt well again. And it needed to be told when it was ejected from the array, so as not to do that next time ... > I haven't given it any serious thought yet. I don't even know what > low level devices recognise it or what they do in response. As far as I am concerned, any signal is useful. Any one which tells me which array I am in is especially useful. And I need to be told when I leave. Essentially I want some kernel communication channel here. Ioctls are fine (there is a subtle kernel deadlock involved in calling an ioctl on a device above you from within, but I got round that once, and I can do it again). Thanks for the replies! Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-17 6:28 ` Peter T. Breuer @ 2006-08-19 1:35 ` Gabor Gombas 2006-08-19 11:27 ` Peter T. Breuer 2006-08-21 1:21 ` Neil Brown 1 sibling, 1 reply; 18+ messages in thread From: Gabor Gombas @ 2006-08-19 1:35 UTC (permalink / raw) To: Peter T. Breuer; +Cc: Neil Brown, linux raid On Thu, Aug 17, 2006 at 08:28:07AM +0200, Peter T. Breuer wrote: > 1) if the network disk device has decided to shut down wholesale > (temporarily) because of lack of contact over the net, then > retries and writes are _bound_ to fail for a while, so there > is no point in sending them now. You'd really do infinitely > better to wait a while. On the other hand, if it's a physical disk that's gone, you _know_ it will not come back, and stalling your mission-critical application waiting for a never-occuring event instead of just continue using the other disk does not seem right. > You think the device has become unreliable because write failed, but > it hasn't ... that's just the net. Try again later! If you like > we can set the req error count to -ETIMEDOUT to signal it. Real > remote write breakage can be signalled with -EIO or something. > Only boot the device on -EIO. Depending on the application, if one device is gone for an extended period of time (and the range of seconds is a looong time), it may be much more applicable to just forget about that disk and continue instead of stalling the system waiting for the device coming back. IMHO if you want to rely on the network, use equipment that can provide the required QoS parameters. It may cost a lot - c'est la vie. Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-19 1:35 ` Gabor Gombas @ 2006-08-19 11:27 ` Peter T. Breuer 0 siblings, 0 replies; 18+ messages in thread From: Peter T. Breuer @ 2006-08-19 11:27 UTC (permalink / raw) To: Gabor Gombas; +Cc: linux raid "Also sprach Gabor Gombas:" > On Thu, Aug 17, 2006 at 08:28:07AM +0200, Peter T. Breuer wrote: > > > 1) if the network disk device has decided to shut down wholesale > > (temporarily) because of lack of contact over the net, then > > retries and writes are _bound_ to fail for a while, so there > > is no point in sending them now. You'd really do infinitely > > better to wait a while. > > On the other hand, if it's a physical disk that's gone, you _know_ it > will not come back, Possibly. Disks are physical whether over the net or not - you mean a "nearby" disk, I think. Now, over the net we can distinguish between a (remote) disk failure and a communications hiatus easily. The problem appears to be that the software above us (the md layer) is not tuned to distinguish between the two. > and stalling your mission-critical application > waiting for a never-occuring event instead of just continue using the > other disk does not seem right. Then don't do it. There's no need to, as I pointed out in the following ... > > You think the device has become unreliable because write failed, but > > it hasn't ... that's just the net. Try again later! If you like > > we can set the req error count to -ETIMEDOUT to signal it. Real > > remote write breakage can be signalled with -EIO or something. > > Only boot the device on -EIO. > > Depending on the application, ? > if one device is gone for an extended > period of time (and the range of seconds is a looong time), Not over the net it isn't. I just had to wait 5s before these letters appeared on screen! > it may be > much more applicable to just forget about that disk and continue instead > of stalling the system waiting for the device coming back. Why speculate? Let us signal what's happening. We can happily set a timeout of 2s, say, and signal -EIO if we get an error return within 2s and -ETIMEDOUT if we don't get a response of any sort back within 2s. I ask that you (above) don't sling us out of the array when we signal -ETIMEDOUT (or -EAGAIN, or whatever). Let us decide what's going on and we'll signal it - don't second guess us. > IMHO if you want to rely on the network, use equipment that can provide Your opinion (and mine) doesn't count - I think swapping over the net is crazy too, but people do it, notwithstanding my opinion. So argument about whether they ought to do it or not is null and void. They do. Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-17 6:28 ` Peter T. Breuer 2006-08-19 1:35 ` Gabor Gombas @ 2006-08-21 1:21 ` Neil Brown 1 sibling, 0 replies; 18+ messages in thread From: Neil Brown @ 2006-08-21 1:21 UTC (permalink / raw) To: ptb; +Cc: linux raid On Thursday August 17, ptb@inv.it.uc3m.es wrote: > HI Neil .. > > "Also sprach Neil Brown:" > > On Wednesday August 16, ptb@inv.it.uc3m.es wrote: > > > 1) I would like raid request retries to be done with exponential > > > delays, so that we get a chance to overcome network brownouts. > > > > > > 2) I would like some channel of communication to be available > > > with raid that devices can use to say that they are > > > OK and would they please be reinserted in the array. > > > > > > The latter is the RFC thing (I presume the former will either not > > > be objectionable or Neil will say "there's no need since you're wrong > > > about the way raid does retries anyway"). > > > > There's no need since you're ..... you know the rest :-) > > Well, sort of. > > OK, let's see ... > > > When md/raid1 gets a read error it immediately retries the request in > > small (page size) chunks to find out exactly where the error is (it > > does this even if the original read request is only one page). > > OK. I didn't know that. But do you mean a read request to the RAID > device, or a read request to the underlying disk device? The latter > might form part of the implementation of a write request to the RAID > device. We retry the read requests to the underlying devices. I was thinking of raid1 particularly. For raid5 we don't retry the read as all requests are sent down from raid5 a 4K in size so refining the location of an error is not an issue. For raid5 we don't retry the read. We read from all other devices and then send a write. If that works, good. If it fails we kick the device. > > (has the MD blocksize moved up to 4K then? It was at 1KB for years) > A 0.90 superblock has always been 4K. > > I'm not interested in putting any delays in there. It is simply the > > wrong place to put them. If network brownouts might be a problem, > > then the network driver gets to care about that. > > I think you might want to reconsider (not that I know the answer). > > 1) if the network disk device has decided to shut down wholesale > (temporarily) because of lack of contact over the net, then > retries and writes are _bound_ to fail for a while, so there > is no point in sending them now. You'd really do infinitely > better to wait a while. Tell that to the network block device. md has no knowledge of the device under it. It sends requests. They succeed or they fail. md acts accordingly. > > 2) if the network device just blocks individual requests for say 10s > while waiting for an ack, then times them out, there is more chance > of everything continuing to work since the 10s might be long enough > for the net to recover in, but occasionally a single timeout will > occur and you will boot the device from the array (whereas waiting a > bit longer would have been the right thing to do, if only we had > known). Change 10s to any reasonable length of time. > > You think the device has become unreliable because write failed, but > it hasn't ... that's just the net. Try again later! If you like > we can set the req error count to -ETIMEDOUT to signal it. Real > remote write breakage can be signalled with -EIO or something. > Only boot the device on -EIO. For read requests, I might be happy to treat -ETIMEOUT differently. I get the data from elsewhere and leave the original disk alone. But for writes, what can I do? If the write fails I have to evict the drive, otherwise the array becomes inconsistent. If you want to implement some extra timeout and retry for writes, do that in user-space utilising the bitmap stuff. If you keep you monitor app small and have it mlocked, it should continue to work find under high memory pressure. > > 3) if the network device blocks essentially forever, waiting for a > reconnect, experience says that users hate that. I believe the > md array gets stuck somewhere here (from reports), possibly in trying > to read the superblock of the blocked device. So what do you expect us to do in this case? You want the app to keep working even though the network connection to the storage isn't working? Doesn't make sense to me. > > 4) what the network device driver wants to do is be able to identify > the difference between primary requests and retries, and delay > retries (or repeat them internally) with some reasonable backoff > scheme to give them more chance of working in the face of a > brownout, but it has no way of doing that. You can make the problem > go away by delaying retries yourself (is there a timedue field in > requests, as well as a timeout field? If so, maybe that can be used > to signal what kind of a request it is and how to treat it). > > > > Point 2 should be done in user-space. > > It's not reliable - we will be under memory pressure at this point, with > all that implies; the raid device might be the very device on which the > file system sits, etc. Pick your poison! mlockall > > > - notice device have been ejected from array > > - discover why. act accordingly. > > - if/when it seems to be working again, add it back into the array. > > > > I don't see any need for this to be done in the kernel. > > Because there might not be any userspace (embedded device) and > userspace might be blocked via subtle or not-so-subtle deadlocks. Even an embedded device can have userspace. Fix the deadlocks. > There's no harm in making it easy! /proc/mdstat is presently too hard > to parse reliably, I am afraid. Minor differences in presentation > arise in it for reasons I don't understand! There is harm in putting code in the kernel to handle a very special case. NeilBrown > > > > The way the old FR1/5 code worked was to make available a couple of > > > ioctls. > > > > > > When a device got inserted in an array, the raid code told the device > > > via a special ioctl it assumed the device had that it was now in an > > > array (this triggers special behaviours, such as deliberately becoming > > > more error-prone and less blocky, on the assumption that we have got > > > good comms with raid and can manage our own raid state). Ditto > > > removal. > > > > A bit like BIO_RW_FASTFAIL? Possibly md could make more use of that. > > It was a different one, but yes, that would have done. The FR1/5 > code needed to be told also in WHICH array it was, so that it > could send ioctls (HOT_REPAIR, or such) to the right md device > later when it felt well again. And it needed to be told when it > was ejected from the array, so as not to do that next time ... > > > I haven't given it any serious thought yet. I don't even know what > > low level devices recognise it or what they do in response. > > As far as I am concerned, any signal is useful. Any one which > tells me which array I am in is especially useful. And I need > to be told when I leave. > > Essentially I want some kernel communication channel here. Ioctls > are fine (there is a subtle kernel deadlock involved in calling an ioctl > on a device above you from within, but I got round that once, and I can > do it again). > > > Thanks for the replies! > > Peter > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC 2006-08-16 9:06 remark and RFC Peter T. Breuer ` (2 preceding siblings ...) 2006-08-17 1:11 ` Neil Brown @ 2006-08-17 14:11 ` Mario 'BitKoenig' Holbe 3 siblings, 0 replies; 18+ messages in thread From: Mario 'BitKoenig' Holbe @ 2006-08-17 14:11 UTC (permalink / raw) To: linux-raid Peter T. Breuer <ptb@inv.it.uc3m.es> wrote: > 1) I would like raid request retries to be done with exponential > delays, so that we get a chance to overcome network brownouts. Hmmm, IMHO this should be implemented in nbd/enbd where it belongs to and errors should be masked within nbd/enbd then. Since (at least) md has no read-timeouts or something like that (please correct me if I'm wrong), this should be no big issue. Typically, storage media communication channels are loss-free, so either a read is okay or it fails. A storage media usually has no retry-on- timeout semantics, so upper layers are with high probability not aware of such a thing. This is the same with RAID as well as with filesystems, so if you run an ext2 or something like that on top of your enbd you should suffer from the same problems: if a read fails, the filesystem goes dead, gets remounted read-only or follows whatever error-strategy you have it configured for. regards Mario -- () Ascii Ribbon Campaign /\ Support plain text e-mail ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <62b0912f0608160746l34d4e5e5r6219cc2e0f4a9040@mail.gmail.com>]
* Re: remark and RFC [not found] <62b0912f0608160746l34d4e5e5r6219cc2e0f4a9040@mail.gmail.com> @ 2006-08-16 16:15 ` Peter T. Breuer 0 siblings, 0 replies; 18+ messages in thread From: Peter T. Breuer @ 2006-08-16 16:15 UTC (permalink / raw) To: Molle Bestefich; +Cc: linux raid "Also sprach Molle Bestefich:" [Charset ISO-8859-1 unsupported, filtering to ASCII...] > Peter T. Breuer wrote: > > We can't do a HOT_REMOVE while requests are outstanding, as far as I > > know. > > Actually, I'm not quite sure which kind of requests you are talking about. Only one kind. Kernel requests :). They come in read and write flavours (let's forget about the third race for the moment). > Also, there's been relatively recent changes to this code: > http://marc.theaimsgroup.com/?l=linux-raid&m=108075865413863&w=2 That's 2004 and is about the inclusion of the bitmapping code into md/raid. We're years beyond there. What I'm talking about presupposes all those changes in the raid layers, and laments that it's still missing one more thing that was in the FR1/5 patches .. namely the ability for the underlying device to communicate with the md layer about its state of health, taking itself in and out of the array as appropriate. Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: remark and RFC
@ 2006-08-18 7:51 Peter T. Breuer
0 siblings, 0 replies; 18+ messages in thread
From: Peter T. Breuer @ 2006-08-18 7:51 UTC (permalink / raw)
To: ptb; +Cc: Neil Brown, linux raid
"Also sprach ptb:"
> 4) what the network device driver wants to do is be able to identify
> the difference between primary requests and retries, and delay
> retries (or repeat them internally) with some reasonable backoff
> scheme to give them more chance of working in the face of a
> brownout, but it has no way of doing that. You can make the problem
> go away by delaying retries yourself (is there a timedue field in
> requests, as well as a timeout field? If so, maybe that can be used
> to signal what kind of a request it is and how to treat it).
If one could set the
unsigned long start_time;
field in the outgoing retry request to now + 1 jiffy, that might be
helpful. I can't see a functionally significant use of this field at
present in the kernel ... ll_rw_blk rewrites the field when merging
requests and end_that_request then uses it for the accounting stats
(duration) __disk_stat_add(disk, ticks[rw], duration) which will add a
minus 1 at worst.
Shame there isn't a timedue field in the request struct.
Silly idea, maybe.
Peter
^ permalink raw reply [flat|nested] 18+ messages in threadend of thread, other threads:[~2006-08-21 1:21 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-16 9:06 remark and RFC Peter T. Breuer
2006-08-16 10:00 ` Molle Bestefich
2006-08-16 13:06 ` Peter T. Breuer
2006-08-16 14:28 ` Molle Bestefich
2006-08-16 19:01 ` Peter T. Breuer
2006-08-16 21:19 ` Molle Bestefich
2006-08-16 22:19 ` Peter T. Breuer
2006-08-16 23:43 ` Nix
2006-08-16 14:59 ` Molle Bestefich
2006-08-16 16:10 ` Peter T. Breuer
2006-08-17 1:11 ` Neil Brown
2006-08-17 6:28 ` Peter T. Breuer
2006-08-19 1:35 ` Gabor Gombas
2006-08-19 11:27 ` Peter T. Breuer
2006-08-21 1:21 ` Neil Brown
2006-08-17 14:11 ` Mario 'BitKoenig' Holbe
[not found] <62b0912f0608160746l34d4e5e5r6219cc2e0f4a9040@mail.gmail.com>
2006-08-16 16:15 ` Peter T. Breuer
-- strict thread matches above, loose matches on Subject: below --
2006-08-18 7:51 Peter T. Breuer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).