* [Lustre-devel] Failover & Force export for the DMU
@ 2008-04-16 15:37 Peter Braam
2008-04-16 16:40 ` Ricardo M. Correia
0 siblings, 1 reply; 6+ messages in thread
From: Peter Braam @ 2008-04-16 15:37 UTC (permalink / raw)
To: lustre-devel
?Force export? for the DMU serves a similar purpose as a feature we added
for block devices in Linux in relation to exports. When failover is
initiated, the OSS/MDS servers stop sending replies and requests that are
still being processed interact with the block devices in a model where the
devices discard write commands WITHOUT returning errors. This is different
from merely declaring the device READONLY in which case errors are returned.
The latter is a default feature in the Linux kernel, what we did is a patch
(but could be a mapper module).
The thinking behind this approach was (many years ago) that this avoids
exposing the server layers to errors (caused by writes to read only devices)
from the block devices which might cause the server to panic, thereby taking
out other targets inadvertently.
However, the approach is flawed. It is (theoretically, but not so likely)
possible for the server to write something, believe it has been done, and
read it back getting the wrong data (because it wasn?t written), and still
panic.
So I would like to suggest that for the DMU we do this differently and rely
on a normal read only device. So, the server, during recovery, will be
using standard read only devices (and similar under the DMU). If the file
system or DMU returns errors because writes cannot be performed for requests
that are in progress during the failover event, then these errors should be
handled gracefully (without panics). Note that the errors will never reach
the client, not over the network and not through reply reconstruction,
because failover was initiated before they happened.
The hacked feature retains value because it can generate an artificially
large amount of rollback data, which is useful for testing the replay
recovery mechanisms in Lustre. However, with DMU snapshots this can easily
be simulated in a different manner.
Nikita, Alex ? I think the key issue here is that the error handling in the
new servers that you have written needs to be resilient enough to handle
this. Can you think about it?
Ricardo ? for the DMU all you need to do is make sure you can quickly turn a
device read only below the DMU and the DMU can handle that (its like doing
?mount ?o remount, ro?).
Regards
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080416/965dd99f/attachment.htm>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Lustre-devel] Failover & Force export for the DMU
2008-04-16 15:37 [Lustre-devel] Failover & Force export for the DMU Peter Braam
@ 2008-04-16 16:40 ` Ricardo M. Correia
2008-04-17 0:18 ` Peter Braam
0 siblings, 1 reply; 6+ messages in thread
From: Ricardo M. Correia @ 2008-04-16 16:40 UTC (permalink / raw)
To: lustre-devel
On Qua, 2008-04-16 at 08:37 -0700, Peter Braam wrote:
> However, the approach is flawed. It is (theoretically, but not so
> likely) possible for the server to write something, believe it has
> been done, and read it back getting the wrong data (because it wasn?t
> written), and still panic.
With the DMU there is a similar problem, but its behavior is more sane
and much more interesting. If a write is discarded without the DMU's
knowledge, when the data is read back the checksum will necessarily fail
due to the ZFS design's cleverness of storing the checksum on the block
pointer (in the parent block, which itself has its checksum on its
parent block, and so on up until the uberblock).
So if the checksum fails 2 things can happen:
- If a read is a normal read, the caller will get an ECKSUM error,
propagating the error back to the DMU's consumer (this is what is used
for all data reads).
- If a read if a special "must succeed" read, then the behavior will
depend on the "failmode" property of the pool (explained below).
A "must succeed" read, like the name indicates, is a critical read which
always succeeds (caller is blocked until it does), used in situations
where failure would lead to data loss. It is only used for some metadata
reads.
> So I would like to suggest that for the DMU we do this differently and
> rely on a normal read only device. So, the server, during recovery,
> will be using standard read only devices (and similar under the DMU).
> If the file system or DMU returns errors because writes cannot be
> performed for requests that are in progress during the failover event,
> then these errors should be handled gracefully (without panics). Note
> that the errors will never reach the client, not over the network and
> not through reply reconstruction, because failover was initiated
> before they happened.
I agree, but I'm not so sure we should still continue to send read
requests to the storage devices when we are failing over. One of the
reasons the failover could be happening is due to a failure somewhere in
the server -> storage path, and if this is happening we may experience
delays of 30 or 60 seconds for the IOs to timeout, especially if we're
doing synchronous I/O in the ZIO threads like we are doing now.
So I think returning EIO for reads on the backend storage might be more
appropriate during a failover.
> Ricardo ? for the DMU all you need to do is make sure you can quickly
> turn a device read only below the DMU and the DMU can handle that (its
> like doing ?mount ?o remount, ro?).
Well, it's a bit more complicated than that..
If there is a fatal failure to write to the backend devices, the error
will be returned to the ZIO pipeline and the DMU's behavior will again
depend on the "failmode" property of the pool, which can have 3
different values:
- wait mode: I/O is blocked until the administrator corrects the problem
manually. This is useful for regular ZFS pools, because the
administrator has a chance to replace the device that is experiencing IO
failures and therefore prevent any data loss.
- continue mode: (quoting) "Returns EIO to any new write I/O
requests" (in the transaction phase) ".. but allows reads to any of the
remaining healthy devices. Any write requests that have yet to be
committed to disk would be blocked."
- panic mode: in userspace, we do an abort(). This would be a good
solution for Lustre if we didn't have multiple ZFS pools in the same
userspace server, but it's not useful at all in that case.
The big problem here is that neither the "wait" mode nor the "continue"
mode allow a pool with dirty data to be exported if the backend devices
are returning errors in the pwrite() calls (be it EROFS, EIO, or any
other), due to ZFS's insistence on preserving data integrity (which I
think is very well designed).
I have thought a lot about this, and my conclusion is that when
force-exporting a pool we should make the DMU discard all writes to the
backend storage, make reads (even "must succeed" reads) return EIO, and
then go through the normal DMU export process. I believe this is the
only sane way of successfully getting rid of dirty data in the DMU
without any loss of transactional integrity or weird failures, but it
will also require changing the DMU to gracefully handle failures in
"must succeed" reads, which will not be easy..
The consequence for Lustre is that the OSS/MDS servers *must* be able to
handle errors gracefully because the DMU could return a lot of EIOs
during failover.
Cheers,
Ricardo
--
Ricardo Manuel Correia
Lustre Engineering
Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080416/821454ce/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080416/821454ce/attachment.gif>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Lustre-devel] Failover & Force export for the DMU
2008-04-16 16:40 ` Ricardo M. Correia
@ 2008-04-17 0:18 ` Peter Braam
2008-04-17 16:10 ` Ricardo M. Correia
0 siblings, 1 reply; 6+ messages in thread
From: Peter Braam @ 2008-04-17 0:18 UTC (permalink / raw)
To: lustre-devel
On 4/16/08 9:40 AM, "Ricardo M. Correia" <Ricardo.M.Correia@Sun.COM> wrote:
... SNIP
> I agree, but I'm not so sure we should still continue to send read requests to
> the storage devices when we are failing over. One of the reasons the failover
> could be happening is due to a failure somewhere in the server -> storage
> path, and if this is happening we may experience delays of 30 or 60 seconds
> for the IOs to timeout, especially if we're doing synchronous I/O in the ZIO
> threads like we are doing now.
>
> So I think returning EIO for reads on the backend storage might be more
> appropriate during a failover.
>
I think that is fine ? again, the key issue is not to kill the server while
it gets these errors. It may well be that the server needs a special ?I?m
recovering be gentle with errors? mode to avoid reasonable panics.
>
>> Ricardo ? for the DMU all you need to do is make sure you can quickly turn a
>> device read only below the DMU and the DMU can handle that (its like doing
>> ?mount ?o remount, ro?).
>
> Well, it's a bit more complicated than that..
> If there is a fatal failure to write to the backend devices, the error will be
> returned to the ZIO pipeline and the DMU's behavior will again depend on the
> "failmode" property of the pool, which can have 3 different values:
>
> - wait mode: I/O is blocked until the administrator corrects the problem
> manually. This is useful for regular ZFS pools, because the administrator has
> a chance to replace the device that is experiencing IO failures and therefore
> prevent any data loss.
>
> - continue mode: (quoting) "Returns EIO to any new write I/O requests" (in the
> transaction phase) ".. but allows reads to any of the remaining healthy
> devices. Any write requests that have yet to be committed to disk would be
> blocked."
>
> - panic mode: in userspace, we do an abort(). This would be a good solution
> for Lustre if we didn't have multiple ZFS pools in the same userspace server,
> but it's not useful at all in that case.
>
Well yes, the problem is that controlled failovers are required, for example
when you fail back.
>
>
> The big problem here is that neither the "wait" mode nor the "continue" mode
> allow a pool with dirty data to be exported if the backend devices are
> returning errors in the pwrite() calls (be it EROFS, EIO, or any other), due
> to ZFS's insistence on preserving data integrity (which I think is very well
> designed).
>
Please explain why we want to export such a pool and on which node we want
to export it, in fact what is ?export? (it should be similar to unmount)?
If things are failing, then, on the node that is failing, we don?t need this
pool anymore, we need to shut things down, in most cases for a reboot. We
need the pool on the failover node.
In fact there is a very useful distinction to make. There are two failover
scenarios:
1. fail over to move services away from failures on the OSS. In this case a
reboot/panic is not really harmful.
2. fail over from a fully functioning OSS/DMU to redistribute services. In
this case we need a control mechanism to turn the device read-only and clean
up the DMU.
Unfortunately we cannot consider mandating that there is only one file
system per OSS because then we need an idle node to act as the failover
node. We must handle the problem of shutting ?one of more? down, but only
in the clean case (2).
>
> I have thought a lot about this, and my conclusion is that when
> force-exporting a pool we should make the DMU discard all writes to the
> backend storage, make reads (even "must succeed" reads) return EIO, and then
> go through the normal DMU export process. I believe this is the only sane way
> of successfully getting rid of dirty data in the DMU without any loss of
> transactional integrity or weird failures, but it will also require changing
> the DMU to gracefully handle failures in "must succeed" reads, which will not
> be easy..
>
Sun already has products (a CIFS server) that can failover on ZFS. It might
be interesting to ask them if they can handle failing over one ZFS file
system while keeping others, because this is essentially the same problem as
we have from a DMU perspective.
Peter
>
> The consequence for Lustre is that the OSS/MDS servers *must* be able to
> handle errors gracefully because the DMU could return a lot of EIOs during
> failover.
>
> Cheers,
> Ricardo
> --
> Ricardo Manuel Correia
> Lustre Engineering
>
> Sun Microsystems, Inc.
> Portugal
> Phone +351.214134023 / x58723
> Mobile +351.912590825
> Email Ricardo.M.Correia at Sun.COM
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080416/7096dade/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080416/7096dade/attachment.gif>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Lustre-devel] Failover & Force export for the DMU
2008-04-17 0:18 ` Peter Braam
@ 2008-04-17 16:10 ` Ricardo M. Correia
2008-04-17 17:53 ` Peter Braam
2008-04-17 17:56 ` Peter Braam
0 siblings, 2 replies; 6+ messages in thread
From: Ricardo M. Correia @ 2008-04-17 16:10 UTC (permalink / raw)
To: lustre-devel
Hi Peter,
Please see my comments.
On Qua, 2008-04-16 at 17:18 -0700, Peter Braam wrote:
> I think that is fine ? again, the key issue is not to kill the server
> while it gets these errors. It may well be that the server needs a
> special ?I?m recovering be gentle with errors? mode to avoid
> reasonable panics.
I would say any error returned by the filesystem even in normal
operation should be handled gently :)
> Please explain why we want to export such a pool and on which node we
> want to export it, in fact what is ?export? (it should be similar to
> unmount)? If things are failing, then, on the node that is failing,
> we don?t need this pool anymore, we need to shut things down, in most
> cases for a reboot. We need the pool on the failover node.
The DMU has the notion of importing and exporting a pool, which is
different from mounting/unmounting a filesystem inside the pool.
Basically, an import consists in scanning and reading the labels of all
the devices of a pool to find out the pool configuration.
After this process, the pool transitions to the imported state, which
means that the DMU knows about the pool (has the pool configuration
cached) and the user can perform any operation he desires on the pool.
Usually after an import ZFS also mounts the filesystems inside the pool
automatically, but this is not relevant here.
In ZFS, an export consists of unmounting any filesystem belonging to the
pool, flushing dirty data, marking the pool as exported on-disk and then
removing the pool configuration from the cache.
In Lustre/ZFS, strictly speaking there are no filesystems mounted so we
don't do that, but of course the export would fail if Lustre has an open
objset, so we need to close them first.
After this, the user can only operate/manipulate the pool if he
re-imports it.
So basically, what we need to do when things are failing (in the node
that is failing) is to close the filesystems and export the pool. The
big problem is that the DMU cannot export a pool if the devices are
experiencing fatal write failures, which is why we need a force-export
mechanism.
After that, we need to import the pool on the failover node and mount
all the MDTs/OSTs that were stored there, do recovery, etc (I'm sure you
understand this process much better than I do :)
> In fact there is a very useful distinction to make. There are two
> failover scenarios:
> 1. fail over to move services away from failures on the OSS. In
> this case a reboot/panic is not really harmful.
That's why when I heard about the need for this feature, I immediately
proposed doing a panic, which wouldn't have any consequences assuming
Lustre recovery does its job. But it's not useful in a "multiple pools
in the same server" scenario.
> 1. fail over from a fully functioning OSS/DMU to redistribute
> services. In this case we need a control mechanism to turn
> the device read-only and clean up the DMU.
Why do we need to turn the device read-only in this case? Why can't we
do a clean unmount/export if the devices are fully functioning?
Andreas has told me before that with ldiskfs, doing a clean unmount
could take a lot of time if there's a lot of dirty data, but I don't
believe this will be true with the DMU.
Even if such a problem were to arise, in the DMU it's trivial to limit
the transaction group size and therefore limit the time it takes to sync
a txg.
> Unfortunately we cannot consider mandating that there is only one file
> system per OSS because then we need an idle node to act as the
> failover node. We must handle the problem of shutting ?one of more?
> down, but only in the clean case (2).
In the clean case, we don't need force-export.
Force-export is only really needed if all of the following conditions
are true:
1) We have more than 1 filesystem (MDT/OST) running in the same
userspace process (note how I didn't say "same server". Also note that
for Lustre 2.0, we will have a limitation of 1 userspace process per
server).
2) The MDTs/OSTs are stored in more than 1 ZFS pool (note how I didn't
say "more than 1 device". A single ZFS pool can use multiple disk
devices.).
3) One or more, but not all of the ZFS pools are suffering from fatal IO
failures.
4) We only want to failover the MDTs/OSTs stored on the pools that are
suffering IO failures, but we still want to keep the remaining MDTs/OSTs
working in the same server.
If there is a requirement of supporting a scenario where all of these
conditions are true, then we need force-export. From my latest
discussion with Andreas about this, we do need that.
If not all of the conditions are true, we could either do a clean export
or do a panic, depending on the situation.
At least, that is my understanding :)
Thanks,
Ricardo
--
Ricardo Manuel Correia
Lustre Engineering
Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080417/5fa11443/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080417/5fa11443/attachment.gif>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Lustre-devel] Failover & Force export for the DMU
2008-04-17 16:10 ` Ricardo M. Correia
@ 2008-04-17 17:53 ` Peter Braam
2008-04-17 17:56 ` Peter Braam
1 sibling, 0 replies; 6+ messages in thread
From: Peter Braam @ 2008-04-17 17:53 UTC (permalink / raw)
To: lustre-devel
On 4/17/08 9:10 AM, "Ricardo M. Correia" <Ricardo.M.Correia@Sun.COM> wrote:
>
>> In fact there is a very useful distinction to make. There are two failover
>> scenarios:
>> 1. fail over to move services away from failures on the OSS. In this case a
>> reboot/panic is not really harmful.
>
> That's why when I heard about the need for this feature, I immediately
> proposed doing a panic, which wouldn't have any consequences assuming Lustre
> recovery does its job. But it's not useful in a "multiple pools in the same
> server" scenario.
>
I don?t think this is valid reasoning. If one pool is hosed, it is just as
well to reboot the node. At best what you are proposing is a ?nice to have
refinement? but not necessary for proper management of Lustre clusters.
Following my proposal seems to eliminate the requirement for very
complicated work.
>
>>
>> 1. fail over from a fully functioning OSS/DMU to redistribute services. In
>> this case we need a control mechanism to turn the device read-only and clean
>> up the DMU.
>
> Why do we need to turn the device read-only in this case? Why can't we do a
> clean unmount/export if the devices are fully functioning?
> Andreas has told me before that with ldiskfs, doing a clean unmount could take
> a lot of time if there's a lot of dirty data, but I don't believe this will be
> true with the DMU.
> Even if such a problem were to arise, in the DMU it's trivial to limit the
> transaction group size and therefore limit the time it takes to sync a txg.
>
>> Unfortunately we cannot consider mandating that there is only one file
>> system per OSS because then we need an idle node to act as the failover node.
>> We must handle the problem of shutting ?one of more? down, but only in the
>> clean case (2).
>
> In the clean case, we don't need force-export.
>
> Force-export is only really needed if all of the following conditions are
> true:
>
> 1) We have more than 1 filesystem (MDT/OST) running in the same userspace
> process (note how I didn't say "same server". Also note that for Lustre 2.0,
> we will have a limitation of 1 userspace process per server).
>
> 2) The MDTs/OSTs are stored in more than 1 ZFS pool (note how I didn't say
> "more than 1 device". A single ZFS pool can use multiple disk devices.).
>
> 3) One or more, but not all of the ZFS pools are suffering from fatal IO
> failures.
>
> 4) We only want to failover the MDTs/OSTs stored on the pools that are
> suffering IO failures, but we still want to keep the remaining MDTs/OSTs
> working in the same server.
>
Yes. But this is not a requirement, because for example 4) is not necessary
for customer happiness.
>
> If there is a requirement of supporting a scenario where all of these
> conditions are true, then we need force-export. From my latest discussion with
> Andreas about this, we do need that.
>
No we do not. Andreas, please get in touch with me. I think this is a
?nice to have? but not important enough.
-Peter -
>
> If not all of the conditions are true, we could either do a clean export or do
> a panic, depending on the situation.
>
> At least, that is my understanding :)
>
> Thanks,
> Ricardo
>
> --
> Ricardo Manuel Correia
> Lustre Engineering
>
> Sun Microsystems, Inc.
> Portugal
> Phone +351.214134023 / x58723
> Mobile +351.912590825
> Email Ricardo.M.Correia at Sun.COM
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080417/7660fef9/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080417/7660fef9/attachment.gif>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Lustre-devel] Failover & Force export for the DMU
2008-04-17 16:10 ` Ricardo M. Correia
2008-04-17 17:53 ` Peter Braam
@ 2008-04-17 17:56 ` Peter Braam
1 sibling, 0 replies; 6+ messages in thread
From: Peter Braam @ 2008-04-17 17:56 UTC (permalink / raw)
To: lustre-devel
I forgot one other comment/question: shutdown of Lustre servers was
traditionally sometimes very slow because of timeouts ? however with the
Sandia ?kill the export features? is this still true?
- peter -
On 4/17/08 9:10 AM, "Ricardo M. Correia" <Ricardo.M.Correia@Sun.COM> wrote:
> Hi Peter,
>
> Please see my comments.
>
> On Qua, 2008-04-16 at 17:18 -0700, Peter Braam wrote:
>> I think that is fine ? again, the key issue is not to kill the server while
>> it gets these errors. It may well be that the server needs a special ?I?m
>> recovering be gentle with errors? mode to avoid reasonable panics.
>
> I would say any error returned by the filesystem even in normal operation
> should be handled gently :)
>
>> Please explain why we want to export such a pool and on which node we want
>> to export it, in fact what is ?export? (it should be similar to unmount)? If
>> things are failing, then, on the node that is failing, we don?t need this
>> pool anymore, we need to shut things down, in most cases for a reboot. We
>> need the pool on the failover node.
>
> The DMU has the notion of importing and exporting a pool, which is different
> from mounting/unmounting a filesystem inside the pool.
>
> Basically, an import consists in scanning and reading the labels of all the
> devices of a pool to find out the pool configuration.
> After this process, the pool transitions to the imported state, which means
> that the DMU knows about the pool (has the pool configuration cached) and the
> user can perform any operation he desires on the pool.
>
> Usually after an import ZFS also mounts the filesystems inside the pool
> automatically, but this is not relevant here.
>
> In ZFS, an export consists of unmounting any filesystem belonging to the pool,
> flushing dirty data, marking the pool as exported on-disk and then removing
> the pool configuration from the cache.
> In Lustre/ZFS, strictly speaking there are no filesystems mounted so we don't
> do that, but of course the export would fail if Lustre has an open objset, so
> we need to close them first.
> After this, the user can only operate/manipulate the pool if he re-imports it.
>
> So basically, what we need to do when things are failing (in the node that is
> failing) is to close the filesystems and export the pool. The big problem is
> that the DMU cannot export a pool if the devices are experiencing fatal write
> failures, which is why we need a force-export mechanism.
>
> After that, we need to import the pool on the failover node and mount all the
> MDTs/OSTs that were stored there, do recovery, etc (I'm sure you understand
> this process much better than I do :)
>
>
>> In fact there is a very useful distinction to make. There are two failover
>> scenarios:
>> 1. fail over to move services away from failures on the OSS. In this case a
>> reboot/panic is not really harmful.
>
> That's why when I heard about the need for this feature, I immediately
> proposed doing a panic, which wouldn't have any consequences assuming Lustre
> recovery does its job. But it's not useful in a "multiple pools in the same
> server" scenario.
>
>>
>> 1. fail over from a fully functioning OSS/DMU to redistribute services. In
>> this case we need a control mechanism to turn the device read-only and clean
>> up the DMU.
>
> Why do we need to turn the device read-only in this case? Why can't we do a
> clean unmount/export if the devices are fully functioning?
> Andreas has told me before that with ldiskfs, doing a clean unmount could take
> a lot of time if there's a lot of dirty data, but I don't believe this will be
> true with the DMU.
> Even if such a problem were to arise, in the DMU it's trivial to limit the
> transaction group size and therefore limit the time it takes to sync a txg.
>
>> Unfortunately we cannot consider mandating that there is only one file
>> system per OSS because then we need an idle node to act as the failover node.
>> We must handle the problem of shutting ?one of more? down, but only in the
>> clean case (2).
>
> In the clean case, we don't need force-export.
>
> Force-export is only really needed if all of the following conditions are
> true:
>
> 1) We have more than 1 filesystem (MDT/OST) running in the same userspace
> process (note how I didn't say "same server". Also note that for Lustre 2.0,
> we will have a limitation of 1 userspace process per server).
>
> 2) The MDTs/OSTs are stored in more than 1 ZFS pool (note how I didn't say
> "more than 1 device". A single ZFS pool can use multiple disk devices.).
>
> 3) One or more, but not all of the ZFS pools are suffering from fatal IO
> failures.
>
> 4) We only want to failover the MDTs/OSTs stored on the pools that are
> suffering IO failures, but we still want to keep the remaining MDTs/OSTs
> working in the same server.
>
> If there is a requirement of supporting a scenario where all of these
> conditions are true, then we need force-export. From my latest discussion with
> Andreas about this, we do need that.
> If not all of the conditions are true, we could either do a clean export or do
> a panic, depending on the situation.
>
> At least, that is my understanding :)
>
> Thanks,
> Ricardo
>
> --
> Ricardo Manuel Correia
> Lustre Engineering
>
> Sun Microsystems, Inc.
> Portugal
> Phone +351.214134023 / x58723
> Mobile +351.912590825
> Email Ricardo.M.Correia at Sun.COM
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080417/c936ff61/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080417/c936ff61/attachment.gif>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-04-17 17:56 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-16 15:37 [Lustre-devel] Failover & Force export for the DMU Peter Braam
2008-04-16 16:40 ` Ricardo M. Correia
2008-04-17 0:18 ` Peter Braam
2008-04-17 16:10 ` Ricardo M. Correia
2008-04-17 17:53 ` Peter Braam
2008-04-17 17:56 ` Peter Braam
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.