From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ricardo M. Correia Date: Wed, 16 Apr 2008 17:40:36 +0100 Subject: [Lustre-devel] Failover & Force export for the DMU In-Reply-To: References: Message-ID: <1208364036.15849.133.camel@localhost> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org On Qua, 2008-04-16 at 08:37 -0700, Peter Braam wrote: > However, the approach is flawed. It is (theoretically, but not so > likely) possible for the server to write something, believe it has > been done, and read it back getting the wrong data (because it wasn?t > written), and still panic. With the DMU there is a similar problem, but its behavior is more sane and much more interesting. If a write is discarded without the DMU's knowledge, when the data is read back the checksum will necessarily fail due to the ZFS design's cleverness of storing the checksum on the block pointer (in the parent block, which itself has its checksum on its parent block, and so on up until the uberblock). So if the checksum fails 2 things can happen: - If a read is a normal read, the caller will get an ECKSUM error, propagating the error back to the DMU's consumer (this is what is used for all data reads). - If a read if a special "must succeed" read, then the behavior will depend on the "failmode" property of the pool (explained below). A "must succeed" read, like the name indicates, is a critical read which always succeeds (caller is blocked until it does), used in situations where failure would lead to data loss. It is only used for some metadata reads. > So I would like to suggest that for the DMU we do this differently and > rely on a normal read only device. So, the server, during recovery, > will be using standard read only devices (and similar under the DMU). > If the file system or DMU returns errors because writes cannot be > performed for requests that are in progress during the failover event, > then these errors should be handled gracefully (without panics). Note > that the errors will never reach the client, not over the network and > not through reply reconstruction, because failover was initiated > before they happened. I agree, but I'm not so sure we should still continue to send read requests to the storage devices when we are failing over. One of the reasons the failover could be happening is due to a failure somewhere in the server -> storage path, and if this is happening we may experience delays of 30 or 60 seconds for the IOs to timeout, especially if we're doing synchronous I/O in the ZIO threads like we are doing now. So I think returning EIO for reads on the backend storage might be more appropriate during a failover. > Ricardo ? for the DMU all you need to do is make sure you can quickly > turn a device read only below the DMU and the DMU can handle that (its > like doing ?mount ?o remount, ro?). Well, it's a bit more complicated than that.. If there is a fatal failure to write to the backend devices, the error will be returned to the ZIO pipeline and the DMU's behavior will again depend on the "failmode" property of the pool, which can have 3 different values: - wait mode: I/O is blocked until the administrator corrects the problem manually. This is useful for regular ZFS pools, because the administrator has a chance to replace the device that is experiencing IO failures and therefore prevent any data loss. - continue mode: (quoting) "Returns EIO to any new write I/O requests" (in the transaction phase) ".. but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked." - panic mode: in userspace, we do an abort(). This would be a good solution for Lustre if we didn't have multiple ZFS pools in the same userspace server, but it's not useful at all in that case. The big problem here is that neither the "wait" mode nor the "continue" mode allow a pool with dirty data to be exported if the backend devices are returning errors in the pwrite() calls (be it EROFS, EIO, or any other), due to ZFS's insistence on preserving data integrity (which I think is very well designed). I have thought a lot about this, and my conclusion is that when force-exporting a pool we should make the DMU discard all writes to the backend storage, make reads (even "must succeed" reads) return EIO, and then go through the normal DMU export process. I believe this is the only sane way of successfully getting rid of dirty data in the DMU without any loss of transactional integrity or weird failures, but it will also require changing the DMU to gracefully handle failures in "must succeed" reads, which will not be easy.. The consequence for Lustre is that the OSS/MDS servers *must* be able to handle errors gracefully because the DMU could return a lot of EIOs during failover. Cheers, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6g_top.gif Type: image/gif Size: 1257 bytes Desc: not available URL: